DEEM: Workshop on Data Management for End-to-End Machine Learning @ ACM SIGMOD'19

Sunday, June 30th

9:15 - 9:30

Welcome

9:30 - 10:30

"Software Engineering 2.0" for "Software 2.0": Towards Data Management for Statistical Generalization [Academic Keynote]
Ce Zhang (ETH Zürich)

Coffee Break

11:00 - 11:40

Distributed Training of Deep Learning Models for Recommendation Systems [Industry Keynote]
Leonidas Galanis (Facebook AI)

11:40 - 11:55

Cerebro: Efficient and Reproducible Model Selection on Deep Learning Systems [Short Paper]
Supun Nakandala*; Yuhao Zhang; Arun Kumar (University of California, San Diego)

11:55 - 12:10

Automated Management of Deep Learning Experiments [Short Paper]
Gharib Gharibi*; Vijay Walunj; Rakan N Alanazi; Sirisha Rella; Yugyung Lee (University of Missouri at Kansas City)

12:10 - 12:30

Osprey: Weak Supervision of Imbalanced Extraction Problems without Code [Long Paper]
Eran Bringer*; Abraham Israeli; Yoav Shoham (Intel); Alex Ratner; Christopher Ré (Stanford University)

Lunch Break

14:00 - 14:25

Follow the Data! Responsible Data Science Starts with Responsible Data Management [Invited Talk]
Julia Stoyanovich (New York University)

14:25 - 14:40

Data Cleaning for Accurate, Fair, and Robust Models: A Big Data - AI Integration Approach [Short Paper]
Kihyun Tae; Yuji Roh; Young Hun Oh; Hyunsu Kim; Steven Whang (KAIST)*

14:40 - 14:55

The ML Data Prep Zoo: Towards Semi-Automatic Data Preparation for ML [Short Paper]
Vraj Shah*; Arun Kumar (University of California, San Diego)

14:55 - 15:15

Debugging Machine Learning Pipelines [Long Paper]
Raoni Lourenço*; Juliana Freire; Dennis Shasha (New York University)

15:15 - 15:30

MLearn: A Declarative Machine Learning Language for Database Systems [Short Paper]
Maximilian E Schüle*; Matthias Bungeroth; Alfons Kemper; Stephan Günnemann; Thomas Neumann (TU Munich)

Poster Session for all SIGMOD Workshops

16:30 - 16:50

CrossTrainer: Practical Domain Adaptation with Loss Reweighting [Long Paper]
Justin Y Chen*; Edward Gan; Kexin Rong; Sahaana Suri; Peter D Bailis (Stanford University)

16:50 - 17:05

NodeGroup: A Knowledge-driven Data Management Abstraction for Industrial Machine Learning [Short Paper]
Vijay S Kumar*; Paul Cuddihy; Kareem S Aggour (GE Research)

17:05 - 17:20

Expressiveness of Matrix and Tensor Query Languages in terms of ML Operators [Short Paper]
Jorge Pérez*; Pablo Barcelo; Nelson Higuera; Bernardo Subercaseaux (Universidad de Chile)

17:20 - 17:35

DROP: A Workload-Aware Optimizer for Dimensionality Reduction [Long Paper]
Sahaana Suri*; Peter D Bailis (Stanford University)

↑ top

Important Dates

Submission Deadline: ~~11th~~ ~~18th of March, 5pm Pacific Time~~
Submission Website: https://cmt3.research.microsoft.com/DEEM2019
Notification of Acceptance: ~~22nd of April~~
Final papers due: ~~6th of May~~
Workshop: Sunday, 30th of June

↑ top

Call for Papers

Applying Machine Learning (ML) in real-world scenarios is a challenging task. In recent years, the main focus of the database community has been on creating systems and abstractions for the efficient training of ML models on large datasets. However, model training is only one of many steps in an end-to-end ML application, and a number of orthogonal data management problems arise from the large-scale use of ML, which require the attention of the data management community.

For example, data preprocessing and feature extraction workloads result in complex pipelines that often require the simultaneous execution of relational and linear algebraic operations. Next, the class of the ML model to use needs to be chosen, for that often a set of popular approaches such as linear models, decision trees and deep neural networks have to be tried out on the problem at hand. The prediction quality of such ML models heavily depends on the choice of features and hyperparameters, which are typically selected in a costly offline evaluation process, that poses huge opportunities for parallelization and optimization. Afterwards, the resulting models must be deployed and integrated into existing business workflows in a way that enables fast and efficient predictions, while still allowing for the lifecycle of models (that become stale over time) to be managed. Managing this lifecycle requires careful bookkeeping of metadata and lineage (“which data was used to train this model?”, “which models are affected by changes in this feature?”) and involves methods for continuous analysis, validation, and monitoring of data and models in production. As a further complication, the resulting systems need to take the target audience of ML applications into account; this audience is very heterogeneous, ranging from analysts without programming skills that possibly prefer an easy-to-use cloud-based solution on the one hand, to teams of data processing experts and statisticians developing and deploying custom-tailored algorithms on the other hand.

DEEM aims to bring together researchers and practitioners at the intersection of applied machine learning, data management and systems research, with the goal to discuss the arising data management issues in ML application scenarios. The workshop solicits regular research papers describing preliminary and ongoing research results. In addition, the workshop encourages the submission of industrial experience reports of end-to-end ML deployments.

↑ top

Topics of Interest

Areas of particular interest for the workshop include (but are not limited to):

Data Management in Machine Learning Applications
Definition, Execution and Optimization of Complex Machine Learning Pipelines
Systems for Managing the Lifecycle of Machine Learning Models
Systems for Efficient Hyperparameter Search and Feature Selection
Machine Learning Services in the Cloud
Modeling, Storage and Provenance of Machine Learning experimentation data
Integration of Machine Learning and Dataflow Systems
Integration of Machine Learning and ETL Processing
Definition and Execution of Complex Ensemble Predictors
Sourcing, Labeling, Integrating, and Cleaning Data for Machine Learning
Benchmarking of Machine Learning Applications
Interpretability and Reproducibility in Machine Learning Applications
Responsible Data Management

↑ top

Submission

The workshop will have two tracks for regular research papers and industrial papers. Submissions can be short papers (4 pages) or long papers (up to 10 pages). Authors are requested to prepare submissions following the ACM proceedings format. Please use the latest ACM paper format (2017) and change the font size to 10 pts (analogous to SIGMOD). DEEM is a single-blind workshop, authors must include their names and affiliations on the manuscript cover page.

Submission Website: https://cmt3.research.microsoft.com/DEEM2019

↑ top

Accepted Papers

LONG PAPERS

DROP: A Workload-Aware Optimizer for Dimensionality Reduction
Sahaana Suri*; Peter D Bailis (Stanford University)
CrossTrainer: Practical Domain Adaptation with Loss Reweighting
Justin Y Chen*; Edward Gan; Kexin Rong; Sahaana Suri; Peter D Bailis (Stanford University)
Debugging Machine Learning Pipelines
Raoni Lourenço*; Juliana Freire; Dennis Shasha (New York University)
Osprey: Weak Supervision of Imbalanced Extraction Problems without Code
Eran Bringer*; Abraham Israeli; Yoav Shoham (Intel); Alex Ratner; Christopher Ré (Stanford University)

SHORT PAPERS

Data Cleaning for Accurate, Fair, and Robust Models: A Big Data - AI Integration Approach
Kihyun Tae; Yuji Roh; Young Hun Oh; Hyunsu Kim; Steven Whang (KAIST)*
Cerebro: Efficient and Reproducible Model Selection on Deep Learning Systems
Supun Nakandala*; Yuhao Zhang; Arun Kumar (University of California, San Diego)
MLearn: A Declarative Machine Learning Language for Database Systems
Maximilian E Schüle*; Matthias Bungeroth; Alfons Kemper; Stephan Günnemann; Thomas Neumann (TU Munich)
Automated Management of Deep Learning Experiments
Gharib Gharibi*; Vijay Walunj; Rakan N Alanazi; Sirisha Rella; Yugyung Lee (University of Missouri at Kansas City)
Expressiveness of Matrix and Tensor Query Languages in terms of ML Operators
Jorge Pérez*; Pablo Barcelo; Nelson Higuera; Bernardo Subercaseaux (Universidad de Chile)
NodeGroup: A Knowledge-driven Data Management Abstraction for Industrial Machine Learning
Vijay S Kumar*; Paul Cuddihy; Kareem S Aggour (GE Research)
The ML Data Prep Zoo: Towards Semi-Automatic Data Preparation for ML
Vraj Shah*; Arun Kumar (University of California, San Diego)

Invited Speakers

Academic Keynote: Ce Zhang (ETH Zürich)

Ce Zhang is an Assistant Professor in Computer Science at ETH Zurich. He believes that by making data—along with the processing of data—easily accessible to non-CS users, we have the potential to make the world a better place. His current research focuses on building data systems to support machine learning and help facilitate other sciences. Before joining ETH, Ce was advised by Christopher Ré. He finished his PhD round-tripping between the University of Wisconsin-Madison and Stanford University, and spent another year as a postdoctoral researcher at Stanford. His PhD work produced DeepDive, a trained data system for automatic knowledge-base construction. He participated in the research efforts that won the SIGMOD Best Paper Award (2014) and SIGMOD Research Highlight Award (2015), and was featured in special issues including the Science magazine (2017), the Communications of the ACM (2017), “Best of VLDB” (2015), and the Nature magazine (2015).

"Software Engineering 2.0" for "Software 2.0": Towards Data Management for Statistical Generalization

When training a machine learning model becomes fast, and model selection and hyper-parameter tuning become automatic, will non-CS experts finally have the tool they need to build ML applications all by themselves? In this talk, I will focus on those users who are still struggling -- not because of the speed and the lack of automation of an ML system, but because it is so powerful that it is easily misused as an "overfitting machine." For many of these users, the quality of their ML applications might actually decrease with these powerful tools without proper guidelines and feedback (like what "software engineering" provides for traditional software development).
In particular, I will talk about two systems, ease.ml/ci and ease.ml/meter, which we built as an early attempt at an ML system that tries to enforce the right user behavior during the development process of ML applications. The first, ease.ml/ci, is a "continuous integration engine" for ML that gives developers a pass/fail signal for each developed ML model depending on whether they satisfy certain predefined properties over the (unknown) "true distribution". The second, ease.ml/meter, is a system that continuously returns some notion of the "degree of overfitting" to the developer. The core technical challenge is how to answer adaptive statistical queries in a rigorous but practical (in terms of label complexity) way. Interestingly, both systems can be seen as a new type of data management system which, instead of managing the (relational) querying of the data, manages the statistical generalization power of the data.

Industry Keynote: Leonidas Galanis (Facebook)

Leonidas Galanis is an Engineering Manager at Facebook where he supports the distributed training platform team in Facebook's Artificial Intelligence Infrastructure organization. Prior to that, he was managing the RocksDB and MySQL Software Engineering teams, and built the engineering team that delivered the MyRocks server that is used to store the Facebook social graph and Facebook messenger data (among other data). Before his time at Facebook, he was a Director at Oracle, where he was responsible for the diagnostic & tuning pack and the real application testing option of the Oracle relational database. In 2004, he got his Ph.D. in databases from the University of Wisconsin-Madison.

Distributed Training of Deep Learning Models for Recommendation Systems

Machine learning results have been increasingly impressive over the last years with many successful real world applications. Deep learning, in particular, has enabled major breakthroughs. For example, image classification can now achieve better results than humans. At Facebook the need for faster training of deep learning models with larger amounts of data is growing. At the same time the use of existing hardware resources as they are available is essential. We are addressing the aforementioned challenges with distributed training. This presentation provides an overview of distributed training at Facebook. There are challenges in reading training data, interacting with model artifacts and training checkpoints during the often very long process of offline training. Moving and reconfiguring model data through the various training stages (offline, model warmup and online training) while scaling the model size beyond what can fit into a single host presents several problems. Devising distributed optimization algorithms that allow us to use increasingly more trainers in order to reduce the time of training as well as leverage existing resources opportunistically is a very active research area.

Invited Talk: Julia Stoyanovich (New York University)

Julia Stoyanovich is an Assistant Professor at New York University in the Department of Computer Science and Engineering at the Tandon School of Engineering, and the Center for Data Science. She is a recipient of an NSF CAREER award and of an NSF/CRA CI Fellowship. Julia's research focuses on responsible data management and analysis practices: on operationalizing fairness, diversity, transparency, and data protection in all stages of the data acquisition and processing lifecycle. She established the Data, Responsibly consortium, and serves on the New York City Automated Decision Systems Task Force. In addition to data ethics, Julia works on management and analysis of preference data, and on querying large evolving graphs. She holds M.S. and Ph.D. degrees in Computer Science from Columbia University, and a B.S. in Computer Science and in Mathematics and Statistics from the University of Massachusetts at Amherst.

Follow the Data! Responsible Data Science Starts with Responsible Data Management

Data science technology promises to improve people's lives, accelerate scientific discovery and innovation, and bring about positive societal change. Yet, if not used responsibly, this same technology can reinforce inequity, limit accountability, and infringe on the privacy of individuals. In my talk I will describe the role that the database community can, and should, play, in building a foundation of responsible data science -- for data-driven algorithmic decision making done in accordance with ethical and moral norms, and legal and policy considerations. I will highlight some of my recent technical work in scope of the "Data, Responsibly" project, and will connect the technical insights to ongoing regulatory efforts in the US and elsewhere. Additional information about this project is available at https://dataresponsibly.github.io .

↑ top