About | Schedule | Important Dates | CfP | Topics | Submission | Accepted Papers | Invited Speakers | People | Awards | Sponsors


DEEM brings together researchers and practitioners at the intersection of applied machine learning, data management and systems research, with the goal to discuss the arising data management issues in ML application scenarios. The DEEM workshop will be held on Sunday, June 9th, in conjunction with SIGMOD/PODS 2024. The workshop will be held in-person.

The workshop solicits regular research papers (8 pages plus unlimited references) describing preliminary or completed research results, as well as short papers (up to 4 pages) such as reports on applications and tools or preliminary results, interesting use cases, problems, datasets, benchmarks, visionary ideas, and descriptions of system components and tools related to end-to-end ML pipelines. Submissions should follow the guidelines as for SIGMOD, i.e. use the sigconf template for the ACM proceedings format.

Follow us on twitter @deem_workshop or contact us via email at madelon@berkeley.edu. We also provide archived websites of previous versions of the workshop: DEEM 2017, DEEM 2018, DEEM 2019, DEEM 2020, DEEM 2021, DEEM 2022, and DEEM 2023.

DEEM 2024 Proceedings: https://dl.acm.org/doi/proceedings/10.1145/3650203

The program will consist of contributed talks on accepted papers and a keynote from academia or industry.
New in 2024 are poster sessions to spark more discussion and networking, with the DEEM audience.

Sunday, June 9th (all times are in Chile Standard Time/CLT);

9:00am - 10:00am
Opening Remarks and Keynote Address
Empowering Users with AI: The Role of AutoML and Data Discovery in Data-Driven Exploration
Keynote Speaker: Juliana Freire (New York University)

Artificial Intelligence (AI) is reshaping data-driven exploration. In this talk, we will explore how AutoML and data discovery enhance human capabilities. We present AlphaAutoML, an open-source Python library designed to support a wide range of machine learning tasks across various data types. AlphaAutoML combines deep reinforcement learning and meta-learning to effectively construct pipelines over a large collection of primitives. It seamlessly integrates AutoML within the data science lifecycle through an ecosystem of tools that facilitate user-in-the-loop tasks, such as selecting suitable pipelines and customizing these pipelines for complex problems. Additionally, we will discuss the emerging field of dataset search, a critical component of data-centric AI. We will review the opportunities it creates to enrich analytics and improve machine learning models, and present methods that support discovery in large dataset collections.

10:00am - 10:30am

10:30am - 11:20am
Session 1: Data Preparation and Pipelining - Chair: Madelon Hulsebos

Croissant: A Metadata Format for ML-Ready Datasets (10 min)
Mubashara Akthar (King’s College London); Omar Benjelloun (Google); Costanza Conforti (Google); Pieter Gijsbers (TU Eindhoven); Joan Giner-Miguelez (Universitat Oberta de Catalunya); Nitisha Jain (King’s College London); Michael Kuchnik (Meta); quentin lhoest (Hugging Face); Pierre Marcenac (Google); Manil Maskey (NASA MSFC); Peter Mattson (Google); Luis Oala (Dotphoton AG); Pierre Ruyssen (Google); Rajat Shinde (NASA IMPACT UAH); Elena Simperl (King's College London); Geoff Thomas (Kaggle); Vyacheslav Tykhonov (DANS-KNAW); Jos van der Velde (TU Eindhoven); Joaquin Vanschoren (TU Eindhoven); Steffen Vogler (Bayer AG); Carole-Jean Wu (Meta / FAIR)

Towards Interactively Improving ML Data Preparation Code via "Shadow Pipelines" (10 min)
Stefan Grafberger (University of Amsterdam); Paul Groth (University of Amsterdam); Sebastian Schelter (University of Amsterdam)

Reactive Dataflow for Inflight Error Handling in ML Workflows (recording) (20 min)
Abhilash Jindal (IIT Delhi); Kaustubh Beedkar (IIT Delhi); Vishal Singh (Indian Institute of Technology Delhi); Jawahar Nausheen Mohammed (Indian Institute of Technology Delhi); Tushar Singla (IIT Delhi); Aman Gupta (Indian Institute of Technology Delhi); Keerti Choudhary (IIT Delhi)

Towards Efficient Data Wrangling with LLMs using Code Generation (10 min)
Xue Li (University of Amsterdam / MotherDuck); Till Döhmen (MotherDuck)

11:20am - 12:00pm
Morning Poster Session

12:00pm - 1:30pm

1:30pm - 2:20pm
Session 2: Systems and Frameworks - Chair: Matteo Interlandi

AIDB: a Sparsely Materialized Database for Queries using Machine Learning (10 min)
Tengjun Jin (UIUC); AKASH MITTAL (UIUC); Chenghao Mo (UIUC); Jiahao Fang (UIUC); Chengsong Zhang (UIUC); Timothy Dai (Stanford University); Daniel Kang (UIUC)

Nautilus: A Benchmarking Platform for DBMS Knob Tuning (10 min)
Konstantinos Kanellis (University of Wisconsin-Madison); Johannes Freischuetz (University of Wisconsin-Madison); Shivaram Venkataraman (University of Wisconsin, Madison)

DLProv: A Data-Centric Support for Deep Learning Workflow Analyses (20 min)
Debora B Pina (Federal University of Rio de Janeiro); Adriane Chapman (University of Southampton); Liliane Kunstmann (Federal University of Rio de Janeiro); Daniel Oliveira (UFF, Brazil); Marta Mattoso (Federal University of Rio de Janeiro)

Reproducible data science over data lakes: replayable data pipelines with Bauplan and Nessie (10 min)
Jacopo Tagliabue (Bauplan); Ciro Greco (Bauplan)

2:20pm - 3:30pm
Session 3: Edge Computing and ML Deployment - Chair: Shreya Shankar

Reaching the Edge of the Edge: Image Analysis in Space (20 min)
Robert Bayer (IT University of Copenhagen); Julian Priest (IT University of Copenhagen); Pinar Tozun (IT University of Copenhagen)

Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly (20 min)
Herbert Woisetschlaeger (Technical University of Munich); Alexander Erben (Technical University of Munich); Shiqiang Wang (IBM Research); Ruben Mayer (University of Bayreuth); Hans-Arno Jacobsen (University of Toronto)

Towards Consistent Language Models Using Controlled Prompting and Decoding (10 min)
Jasmin Mousavi (Oregon State University); Arash Termehchy (Oregon State University)

tailwiz: Empowering Domain Experts with Easy-to-Use, Task-Specific Natural Language Processing Models (20 min)
Timothy Dai (Stanford University); Austin Peters (Stanford); Jonah B Gelbach (Berkeley Law); David Engstrom (Stanford Law School); Daniel Kang (UIUC)

3:30pm - 4:00pm

4:00pm - 4:40pm
Afternoon Poster Session

4:45pm - 5:00pm
Awards and Closing Remarks

↑ top

Important Dates
Submission deadline: March 15 22, 2024, 5pm Pacific Time
Submission website: https://cmt3.research.microsoft.com/DEEM2024
Notification of acceptance: April 20, 2024
Final papers due: May 10, 2024
Workshop: Sunday, June 9, 2024

↑ top

Call for Papers

Applying Machine Learning (ML) in real-world scenarios is a challenging task. In recent years, the main focus of the data management community has been on creating systems and abstractions for the efficient training of ML models on large datasets. However, model training is only one of many steps in an end-to-end ML application, and a number of orthogonal data management problems arise from the large-scale use of ML and increased adoption large language models (LLMs).

For example, data preprocessing and feature extraction workloads may be complicated and require simultaneous execution of relational and linear algebraic operations. Next, model selection may involve searching many combinations of model architectures, features, and hyper-parameters to find the best-performing model. After model training, the resulting model may have to be deployed and integrated into business workflows and require lifecycle management using metadata and lineage. As a further complication, the resulting system may have to take into account a heterogeneous audience, ranging from domain experts without programming skills to data engineers and statisticians who develop custom algorithms. Many such challenges are human or engineer-centered (e.g., monitoring ML pipelines, leveraging LLMs for domain-specific tasks at scale), and DEEM uniquely encourages submissions in such topics.

Additionally, the importance of incorporating ethics and legal compliance into machine-assisted decision-making is being broadly recognized. Critical opportunities for improving data quality and representativeness, controlling for bias, and allowing humans to oversee and impact computational processes are missed if we do not consider the lifecycle stages upstream from model training and deployment. DEEM welcomes research on providing system-level support to data scientists who wish to develop and deploy responsible machine learning methods.

DEEM aims to bring together researchers and practitioners at the intersection of applied machine learning, data management and systems research, with the goal to discuss the arising data management issues in ML application scenarios.

↑ top

Topics of Interest
Areas of particular interest for the workshop include (but are not limited to):

↑ top


We invite submissions in the following two tracks:

Papers of any category can have at most 2 additional appendix pages.
Authors are requested to prepare submissions following the ACM proceedings format consistent with the SIGMOD submission guidelines. Please use the latest ACM paper format with the sigconf template. DEEM is a single-anonymous workshop, authors must include their names and affiliations on the manuscript cover page.

Submission website: https://cmt3.research.microsoft.com/DEEM2024
Inclusion and Diversity in Writing: http://2024.sigmod.org/calls_papers_inclusion_and_diversity.shtml

↑ top

Invited Speakers
Keynote: Juliana Freire (NYU)
Organization / People
Workshop Chairs:
Madelon Hulsebos
UC Berkeley, USA

Matteo Interlandi
Microsoft GSL, USA

Shreya Shankar
UC Berkeley, USA

Steering Committee: Program Committee:

↑ top

Sponsored by

↑ top

Privacy Policy