DEEM: Workshop on Data Management for End-to-End Machine Learning @ ACM SIGMOD'23

About

The DEEM workshop will be held on Sunday, June 18th, in conjunction with SIGMOD/PODS 2023. The workshop will be held in hybrid (in-person and virtual) form. DEEM brings together researchers and practitioners at the intersection of applied machine learning, data management and systems research, with the goal to discuss the arising data management issues in ML application scenarios.

The workshop solicits regular research papers (10 pages plus unlimited references) describing preliminary or completed research results, as well as short papers (up to 4 pages) such as reports on applications and tools or preliminary results. With this new paper category (introduced in 2022) on applications and tools, the DEEM workshop aims to establish a broader forum for sharing interesting use cases, problems, datasets, benchmarks, visionary ideas, system designs, and descriptions of system components and tools related to end-to-end ML pipelines. Submissions should follow the guidelines as for SIGMOD, i.e. use the sigconf template for the ACM proceedings format.

Follow us on twitter @deem_workshop or contact us via email at info[at]deem-workshop[dot]org. We also provide archived websites of previous versions of the workshop: DEEM 2017, DEEM 2018, DEEM 2019, DEEM 2020, DEEM 2021, and DEEM 2022.

DEEM 2022 Proceedings: ACM DL Link

Schedule

Sunday, June 18th (all times are in PDT);

9:00am

Opening

9:10am - 10:30am

Session 1 - Systems for ML (Chair: Matthias Boehm)

9:10am

MLflow2PROV: Extracting Provenance from Machine Learning Experiments
Marius Schlegel (TU Ilmenau), Kai-Uwe Sattler (TU Ilmenau)

9:30am

Data Management and Visualization for Benchmarking Deep Learning Training Systems
Ties Robroek (IT University of Copenhagen), Aaron Duane (IT University of Copenhagen), Ehsan Yousefzadeh-Asl-Miandoab (IT University of Copenhagen), Pinar Tozun (IT University of Copenhagen)

9:50am

EVA: An End-to-End Exploratory Video Analytics System
Gaurav Tarlok Kakkar (Georgia Institute of Technology), Jiashen Cao (Georgia Tech), Pramod Chunduri (Georgia Institute of Technology), Zhuangdi Xu (Georgia Tech), Suryatej Reddy Vyalla (Georgia Institute of Technology), Anirudh Prabakaran (Georgia Institute of Technology), Jaeho Bang (Georgia Institute of Technology), Kaushik Ravichandran (Georgia Institute of Technology), Ishwarya Sivakumar (Georgia Institute of Technology), Aryan Rajoria (Georgia Institute of Technology), Ashmita Raju (Georgia Institute of Technology), Tushar Aggarwal (Georgia Institute of Technology), Shashank Suman (Georgia Institute of Technology), Myna Prasanna Kalluraya (Georgia Institute of Technology), Subrata Mitra (Adobe Research), Ali Payani (Cisco Systems Inc.), Yao Lu (Microsoft Research), Umakishore Ramachandran (Georgia Institute of Technology), Joy Arulraj (Georgia Tech)

10:10am

When Can we Ignore Missing Data in Model Training
Cheng Zhen (Oregon State University), Amandeep Sing Chabada (Oregon State University), Arash Termehchy (Oregon State University)

10:30am - 11:00am

Break

11:00am - 12:20am

Session 2 - ML Optimization (Chair: Madelon Hulsebos)

11:00am

Using Pipeline Performance Prediction to Accelerate AutoML Systems
Haoxiang Zhang (New York University), Roque Enrique López Condori (New York University), Aécio Santos (New York University), Jorge H Piazentin Ono (NYU), Aline Bessa (New York University), Juliana Freire (New York University)

11:20am

P2D: A Transpiler Framework for Optimizing Data Science Pipelines
Yordan Grigorov (Technische Universität Berlin), Haralampos Gavriilidis (Technische Universität Berlin), Sergey Redyuk (TU Berlin), Kaustubh Beedkar (IIT Delhi), Volker Markl (Technische Universität Berlin)

11:40am

Teaching Blue Elephants the Maths for Machine Learning
Clemens Ruck (TUM); Maximilian E Schüle (University of Bamberg)* 1

12:00pm

Transactional Python for Durable Machine Learning: Vision, Challenges, and Feasibility
Supawit Chockchowwat (University of Illinois at Urbana-Champaign), Zhaoheng Li (University of Illinois at Urbana-Champaign), Yongjoo Park (University of Illinois at Urbana-Champaign)

12:20am - 1:30pm

Lunch

1:30pm - 3:00pm

Session 3 - Chair: Shreya Shankar

1:30pm

Enhance, Don't Replace: A Recipe for Success in Data Science Tooling [Keynote]
Aditya Parameswaran (University of California, Berkeley)

Abstract: A large fraction of the data science and machine learning workflow is performed in computational notebooks such as Jupyter with libraries such as pandas, NumPy, and scikit-learn in an ad-hoc, highly iterative manner. However, this process is not without its challenges. We describe three open-source tools that we've built that address scalability, interactivity, and reproducibility challenges along the way -- and have been adopted widely by data scientists. We also reflect on how our recipe -- of enhancing existing tools as opposed to replacing them -- may need revisiting in the exciting arena of LLM-powered data work, which forms the focus of our new EPIC Data lab at Berkeley.

2:30pm

DiffML: End-to-end Differentiable ML Pipelines
Benjamin Hilprecht (TU Darmstadt), Christian Hammacher (Software AG), Eduardo S Reis (TU Darmstadt), Mohamed Abdelaal (Software AG), Carsten Binnig (TU Darmstadt)

3:00pm - 3:30pm

Break

3:30pm - 4:45pm

Session 4 - ML Pipelines (Chair: Paroma Varma/Madelon Hulsebos)

3:30pm

Panel: Data Management challenges for LLM-powered solutions
TBA

4:30pm

Awards + closing

↑ top

Important Dates

Submission deadline: March 15 22, 2023, 5pm Pacific Time
Submission website: https://cmt3.research.microsoft.com/DEEM2023
Notification of acceptance: April ~~19~~ 25, 2023
Final papers due: May 10, 2023
Workshop: Sunday, June 18, 2023

↑ top

Call for Papers

Applying Machine Learning (ML) in real-world scenarios is a challenging task. In recent years, the main focus of the data management community has been on creating systems and abstractions for the efficient training of ML models on large datasets. However, model training is only one of many steps in an end-to-end ML application, and a number of orthogonal data management problems arise from the large-scale use of ML.

For example, data preprocessing and feature extraction workloads may be complicated and require simultaneous execution of relational and linear algebraic operations. Next, model selection may involve searching many combinations of model architectures, features, and hyper-parameters to find the best-performing model. After model training, the resulting model may have to be deployed and integrated into business workflows and require lifecycle management using metadata and lineage. As a further complication, the resulting system may have to take into account a heterogeneous audience, ranging from domain experts without programming skills to data engineers and statisticians who develop custom algorithms.

Additionally, the importance of incorporating ethics and legal compliance into machine-assisted decision-making is being broadly recognized. Critical opportunities for improving data quality and representativeness, controlling for bias, and allowing humans to oversee and impact computational processes are missed if we do not consider the lifecycle stages upstream from model training and deployment. DEEM welcomes research on providing system-level support to data scientists who wish to develop and deploy responsible machine learning methods.

DEEM aims to bring together researchers and practitioners at the intersection of applied machine learning, data management and systems research, with the goal to discuss the arising data management issues in ML application scenarios.

↑ top

Topics of Interest

Areas of particular interest for the workshop include (but are not limited to):

Data Management in Machine Learning Applications
Definition, Execution and Optimization of Complex Machine Learning Pipelines
Systems for Managing the Lifecycle of Machine Learning Models
Systems for Efficient Hyper-parameter Search and Feature Selection
Machine Learning Services in the Cloud
Modeling, Storage, and Provenance of Machine Learning Artifacts
Integration of Machine Learning and Dataflow Systems
Integration of Machine Learning and ETL Processing
Definition and Execution of Complex Ensemble Predictors
Sourcing, Labeling, Integrating, and Cleaning Data for Machine Learning
MLOps, Data Validation, and Model Debugging Techniques
Privacy-preserving Machine Learning
Benchmarking of Machine Learning Applications
Responsible Data Management
Transparency and Accountability of Machine-Assisted Decision Making
Impact of Data Quality and Data Preprocessing on the Fairness of ML Predictions
War stories, Anecdotes, and Lessons Learned on Data Management for ML

↑ top

Submission

We invite submissions in following two tracks:

Regular Papers (research and industrial papers; up to 10 pages, plus unlimited references)
Short Papers (preliminary results, interesting use cases, problems, datasets, benchmarks, visionary ideas, system designs, and descriptions of system components and tools; up to 4 pages)

Authors are requested to prepare submissions following the ACM proceedings format consistent with the SIGMOD submission guidelines. Please use the latest ACM paper format (last update 11/2022) with the sigconf template. DEEM is a single-blind workshop, authors must include their names and affiliations on the manuscript cover page.

Submission Website: https://cmt3.research.microsoft.com/DEEM2023
Inclusion and Diversity in Writing: http://2023.sigmod.org/calls_papers_inclusion_and_diversity.shtml

↑ top

Invited Speakers

Academic Keynote: TBA

Organization / People

Workshop Chairs:

Matthias Boehm
TU Berlin, Germany

Madelon Hulsebos
University of Amsterdam, NL

Shreya Shankar
UC Berkeley, USA

Paroma Varma
Snorkel AI, USA

Steering Committee:

Juliana Freire (New York University)
Bill Howe (University of Washington)
H.V. Jagadish (University of Michigan)
Volker Markl (TU Berlin)
Stefan Seufert (Amazon Research)
Markus Weimer (Microsoft AI)

Program Committee:

Raul Castro Fernandez (University of Chicago)
Patrick Damme (TU Berlin)
Rainer Gemulla (University of Mannheim)
Stefan Grafberger (University of Amsterdam)
Nezihe Merve Gürel (ETH Zürich)
Matteo Interlandi (Microsoft Research)
Zoi Kaoudi (TU Berlin)
Bojan Karlaš (Harvard)
Asterios Katsifodimos (TU Delft)
Arun Kumar (University of California San Diego)
Nantia Makrynioti (RelationalAI)
Laurel Orr (Stanford University)
Tilmann Rabl (HPI and University of Potsdam)
Berthold Reinwald (IBM)
Sebastian Schelter (University of Amsterdam)
Nesime Tatbul (Intel Labs and MIT)
Eugene Wu (Columbia University)
Xiaozhe Yao (ETH Zürich)
Chi Zhang (Brandeis University)

↑ top