DEEM: Workshop on Data Management for End-to-End Machine Learning @ ACM SIGMOD 2025

About

DEEM brings together researchers and practitioners at the intersection of applied machine learning, data management and systems research, with the goal to discuss the arising data management issues in ML application scenarios. The DEEM workshop will be held on Friday, June 27th, in conjunction with SIGMOD/PODS 2025. The workshop will be held in-person in Berlin (InterContinental Hotel, Charlottenburg I/II).

The workshop solicits regular research papers (8 pages plus unlimited references) describing preliminary or completed research results, as well as short papers (up to 4 pages) such as reports on applications and tools or preliminary results, interesting use cases, problems, datasets, benchmarks, visionary ideas, and descriptions of system components and tools related to end-to-end ML pipelines. Submissions should follow the guidelines as for SIGMOD, i.e. use the sigconf template for the ACM proceedings format.

Follow us on twitter @deem_workshop, bluesky @deem-workshop.bsky.social, or contact the organizers via email. We also provide archived websites of previous versions of the workshop: DEEM 2017, DEEM 2018, DEEM 2019, DEEM 2020, DEEM 2021, DEEM 2022, DEEM 2023, and DEEM 2024.

DEEM 2025 Proceedings: https://dl.acm.org/doi/proceedings/10.1145/3735654

Schedule

The program will consist of contributed talks on accepted papers and two keynotes from academia.
Following last year program structure, also this year we are going to have poster sessions to spark more discussion and networking with the DEEM audience.

June 27th (all times are in Berlin Time / CEST)

09:00 - 09:10

Opening Remarks

09:10 - 10:30

Session 1: Data preparation & Model selection - Chair: Madelon Hulsebos

09:10

Keynote Title: Rethinking data engineering pipelines for machine learning, and vice versa

Keynote Speaker: Gaël Varoquaux (Inria, Probabl)

Keynote Abstract: For machine learning on relational data, the current practice relies heavily on wrangling, preparing, cleaning, massaging, torturing data. I claim that this painful situation arises because of a mismatch between the machine learning models and the data, which comprises a mixture of types (strings, dates, numbers) and are spread across multiple tables which must be merged and aggregated before prediction. I will show how we have been rethinking the data preparation pipeline, with the skrub (https://skrub-data.org) software implementing new twists ranging from encoding data types, heuristics applied across columns, to a cross-validating and tuning any sequence operations assembling dataframes. I will also discuss how we have been improving tabular learning, creating more flexible models that apply readily to complex tables. For this, we baking rich priors and knowledge to create table foundation models.

Keynote Speaker Bio: Gaël Varoquaux is a research director working on data science at Inria (French computer science national research) where he leads the Soda team. He is also co-founder and scientific advisor of Probabl. Varoquaux's research covers fundamentals of artificial intelligence, statistical learning, natural language processing, causal inference, as well as applications to health, with a current focus on public health and epidemiology. He also creates technology: he co-founded scikit-learn, one of the reference machine-learning toolboxes, and helped build various central tools for data analysis in Python. Varoquaux has worked at UC Berkeley, McGill, and university of Florence. He did a PhD in quantum physics supervised by Alain Aspect and is a graduate from Ecole Normale Superieure, Paris.

10:00

Dataset2Graph: A GNN-based Methodology for AutoML for Clustering (10 min)
Emmanouil Dilmperis (University of Piraeus), Yannis Poulakis (University of Piraeus), Dimitris Petratos (University of Piraeus), Christos Doulkeridis (University of Pireaus)

10:10

Towards Learning to Rank Deep-Learning Models for Multivariate Time-Series Transfer Learning (20 min)
Melanie Sigl (Universität Erlangen-Nürnberg), Klaus Meyer-Wegener (Universität Erlangen-Nürnberg)

10:30 - 11:00

Break

11:00 - 12:20

Session 2: Databases & ML - Chair: Matteo Interlandi

11:00

Towards Automated Task-Aware Data Validation (10 min)
Hao Chen (BIFOLD & TU Berlin), Sebastian Schelter (BIFOLD & TU Berlin)

11:10

SQL4NN: Validation and Expressive Querying of Models as Data (10 min)
Mark Gerarts (Hasselt University), Juno Steegmans (Hasselt University), Jan Van den Bussche (Hasselt University)

11:20

DuoLingo-AutoDiff: In-Database Automatic Differentiation with MLIR (20 min)
Kevin Gutjahr (University of Bamberg), Clemens Ruck (University of Bamberg), Maximilian Schüle (University of Bamberg)

11:40 - 12:20

Morning Poster Session

12:20 - 13:30

Lunch

13:30 - 15:00

Session 3: DBMS for ML - Chair: Matteo Interlandi

13:30

Keynote Title: Satisfying the Data Monster with Fewer Resources: A quest to feed the GPU in deep learning training

Keynote Speaker: Pınar Tözün (ITU)

Keynote Abstract: Deep learning tasks are computationally expensive requiring the use of powerful and expensive hardware accelerators such as GPUs and TPUs. Both the efficiency of the deep learning tasks and effective utilization of the accelerators depend on how fast the relevant data is moved to the accelerator, which still heavily depends on the CPUs. In this talk, we will look into the different aspects of reducing the CPU and data needs of deep learning to improve the end-to-end resource-efficiency of model training. First, we will explore today's landscape for the I/O path to GPUs. Then, we will investigate the impact of the work sharing and data selection on the performance of deep learning model training.

Keynote Speaker Bio: Pınar Tözün is an Associate Professor and the Head of Data, Systems, and Robotics Section at IT University of Copenhagen. Before ITU, she was a research staff member at IBM Almaden Research Center. Prior to joining IBM, she received her PhD from EPFL. Her thesis received ACM SIGMOD Jim Gray Doctoral Dissertation Award Honorable Mention in 2016. Her research focuses on resource-aware machine learning, performance characterization of data-intensive systems, and scalability and efficiency of data-intensive systems on modern hardware.

14:20

PQ Bench: Benchmarking Pruning and Quantization Techniques (10 min)
Jonas Schulze (University of Potsdam), Nils Straßenburg (Unviersity of Potsdam), Tilmann Rabl (HPI, University of Potsdam)

14:30

Towards An Improved Video RAG Workflow With Orchestration Support in A Visual Data Management System (20 min)
Sourish Chatterjee (Intel Labs), Rohit Verma (Intel Labs), Abhinav Kumar (IIT Hyderabad), Arun Raghunath (Intel Labs)

14:50

Buffer Management for Out-of-GPU LLM Execution (10 min)
Jiashen Cao (Georgia Tech), Joy Arulraj (Georgia Tech), Hyesoon Kim (Georgia Tech)

15:00 - 15:30

Break

15:30 - 16:50

Session 4: Data Retrieval & Selection - Chair: Stefan Grafberger

15:30

End-To-End ML with LLMs and Semantic Data Management: Experiences from Chemistry 4.0 (20 min)
Sayed Hoseini (Hochschule Niederrhein), Vincent Hermann (HSNR), Christoph Quix (Fraunhofer FIT)

15:50

Table Dissolution: Adding Salt To Your Data (10 min)
Francesco Pugnaloni (HPI), Tassilo Klein (SAP SE), Felix Naumann (HPI)

16:00

Towards a Framework for Hierarchical Text Segmentation using Large Language Models (20 min)
Lampros Flokas (Celonis), Jeffery Cao (Celonis), Yujian Xu (Celonis), Eugene Wu (Columbia University), Xu Chu (Celonis), Cong Yu (Celonis)

16:20 - 16:50

Afternoon Poster Session

16:50 -17:00

Closing Remarks

↑ top

Important Dates

Submission deadline: ~~March 21~~ April 1 2025, 5pm Pacific Time
Submission website: https://cmt3.research.microsoft.com/DEEM2025
Notification of acceptance: April 25, 2025
Final papers due: May 16, 2025
Workshop: Friday, June 27, 2025

↑ top

Call for Papers

Applying Machine Learning (ML) in real-world scenarios is a challenging task. In recent years, the main focus of the data management community has been on creating systems and abstractions for the efficient training of ML models on large datasets. However, model training is only one of many steps in an end-to-end ML application, and a number of orthogonal data management problems arise from the large-scale use of ML and increased adoption large language models (LLMs).

For example, data preprocessing and feature extraction workloads may be complicated and require simultaneous execution of relational and linear algebraic operations. Next, model selection may involve searching many combinations of model architectures, features, and hyper-parameters to find the best-performing model. After model training, the resulting model may have to be deployed and integrated into business workflows and require lifecycle management using metadata and lineage. As a further complication, the resulting system may have to take into account a heterogeneous audience, ranging from domain experts without programming skills to data engineers and statisticians who develop custom algorithms. Many such challenges are human or engineer-centered (e.g., monitoring ML pipelines, leveraging LLMs for domain-specific tasks at scale), and DEEM uniquely encourages submissions in such topics.

Additionally, the importance of incorporating ethics and legal compliance into machine-assisted decision-making is being broadly recognized. Critical opportunities for improving data quality and representativeness, controlling for bias, and allowing humans to oversee and impact computational processes are missed if we do not consider the lifecycle stages upstream from model training and deployment. DEEM welcomes research on providing system-level support to data scientists who wish to develop and deploy responsible machine learning methods.

DEEM aims to bring together researchers and practitioners at the intersection of applied machine learning, data management and systems research, with the goal to discuss the arising data management issues in ML application scenarios.

↑ top

Topics of Interest

Areas of particular interest for the workshop include (but are not limited to):

Data Management in Machine Learning Applications
Definition, Execution and Optimization of Complex Machine Learning Pipelines
Systems for ML, e.g. for Managing the Lifecycle of ML Models, Efficient Hyper-parameter Search, or Feature Selection
Machine Learning Services in the Cloud
Modeling, Storage and Provenance of Machine Learning Artifacts
Integration of Machine Learning and Dataflow Systems
Integration of Machine Learning and ETL Processing
Definition and Execution of Complex Ensemble Predictors
Sourcing, Labeling, Integrating, and Cleaning Data for Machine Learning
MLOps, Data Validation, and Model Debugging Techniques
Privacy-preserving Machine Learning
Benchmarking of Machine Learning Applications
Responsible Data Management
Transparency and Accountability of Machine-Assisted Decision Making
Impact of Data Quality and Data Preprocessing on the Fairness of ML Predictions
Horror stories, Anecdotes, and Lessons Learned on data management for ML
Data management for multimodal ML
Vector Databases for Retrieval and Systems for Retrieval Augmented Generation
ML for data management for ML
Data Management challenges, e.g. responsible data management, for LLMs

↑ top

Submission

We invite submissions in the following two tracks:

Regular Papers (research and industrial papers; up to 8, plus unlimited references)
Short Papers (preliminary results, interesting use cases, problems, datasets, benchmarks, visionary ideas, system designs, and descriptions of system components and tools; up to 4 pages)

Papers of any category can have at most 2 additional appendix pages.
Authors are requested to prepare submissions following the ACM proceedings format consistent with the SIGMOD submission guidelines. Please use the latest ACM paper format with the sigconf template. DEEM is a single-anonymous workshop, authors must include their names and affiliations on the manuscript cover page.

Submission website: https://cmt3.research.microsoft.com/DEEM2025
Inclusion and Diversity in Writing: http://2025.sigmod.org/calls_papers_inclusion_and_diversity.shtml

↑ top

Invited Speakers

Keynote 1: Gaël Varoquaux (Inria, Probabl)
Keynote 2: Pınar Tözün (ITU)

Organization / People

Workshop Chairs:

Madelon Hulsebos
CWI, Netherlands

Matteo Interlandi
Microsoft GSL, USA

Shreya Shankar
UC Berkeley, USA

Stefan Grafberger
BIFOLD & TU Berlin, Germany

Steering Committee:

Juliana Freire (New York University)
Bill Howe (University of Washington)
H.V. Jagadish (University of Michigan)
Volker Markl (TU Berlin)
Stefan Seufert (Amazon Research)
Markus Weimer (Microsoft AI)
Sebastian Schelter (BIFOLD & TU Berlin)

Program Committee:

Anna Pavlenko, Microsoft Gray Systems Lab
Bojan Karlaš, Harvard University
Gerardo Vitagliano, MIT CSAIL
Haralampos Gavriilidis, TU Berlin
Jacopo Tagliabue, Bauplan
Joy Arulraj, Georgia Tech
Konstantinos Kanellis, University of Wisconsin-Madison
Manisha Luthra, TU Darmstadt and DFKI
Matthias Boehm, Technische Universität Berlin
Maximilian Böther, ETH Zurich
Maximilian Schüle, University of Bamberg
Pinar Tozun, IT University of Copenhagen
Rainer Gemulla, Universität Mannheim
Sebastian Schelter, BIFOLD & TU Berlin
Shreya Shankar, University of California Berkeley
Sivaprasad Sudhir, MIT
Ties Robroek, IT University of Copenhagen
Till Döhmen, MotherDuck
Xue Li, CWI
Yiming Lin, University of California, Berkeley
Zezhou Huang, Columbia University

↑ top