DEEM: Workshop on Data Management for End-to-End Machine Learning @ SIGMOD'18

Academic Keynote: Jens Dittrich (Saarland University)

Jens Dittrich is a full professor of Computer Science in the area of Databases, Data Management, and Big Data at Saarland University, Germany. Previous affiliations include U Marburg, SAP AG, and ETH Zurich. He received an Outrageous Ideas and Vision Paper Award at CIDR 2011, a BMBF VIP Grant in 2011, a best paper award at VLDB 2014, three CS teaching awards (one in 2018 for a data science seminar), as well as ~10 presentation awards. He has been a PC member and area chair/group leader of prestigious international database conferences and journals such as PVLDB/VLDB, SIGMOD, ICDE, and VLDB Journal. He is on the scientific advisory board of Software AG. He was a keynote speaker at VLDB 2017: “Deep Learning (m)eats Databases“. At Saarland University he co-organizes the Data Science Summer School. His research focuses on fast access to big data including in particular: data analytics on large datasets, scalability, main-memory databases, database indexing, reproducability, and scalable data science. Since 2016 he has been working on a start-up at the intersection of data science and data management (http://daimond.ai). He tweets at https://twitter.com/jensdittrich.

Industry Keynote: Martin Zinkevich (Google)

Martin Zinkevich is a Research Scientist at Google. He received his Ph.D. from Carnegie Mellon University and has been conducting research at Brown University, University of Alberta and the Machine Learning Group at Yahoo Research. His works have been published in numerous conference such as NIPS, ICML, KDD, WWW, CIKM, AAAI, COLT as well as the Journal of the ACM and the Journal of Machine Learning Research. Additionally, Martin contributes to the discussion on data management and engineering aspects of ML with his online book on Rules of Machine Learning: Best Practices for ML Engineering and a tutorial on Data Management Challenges in Production Machine Learning at SIGMOD 2017.

Invited Talk: Matei Zaharia (Stanford)

Matei Zaharia is an assistant professor at Stanford CS, where he works on computer systems and big data as part of Stanford DAWN. He is also co-founder and Chief Technologist of Databricks, the big data company commercializing Apache Spark. Prior to joining Stanford, he was an assistant professor of CS at MIT.

Invited Talk: Joaquin Vanschoren (TU Eindhoven)

Joaquin Vanschoren is assistant professor of machine learning at the Eindhoven University of Technology (TU/e). His research focuses on the progressive automation of machine learning. He founded and leads OpenML.org, an open science platform for machine learning research used all over the world. He obtained several demonstration and application awards, the Dutch Data Prize, and has been invited speaker at ECDA, StatComp, AutoML@ICML, CiML@NIPS, Reproducibility@ICML, and many other conferences. He also co-organized machine learning conferences (e.g. ECMLPKDD 2013, LION 2016, Discovery Science 2017) and many workshops, including the AutoML Workshop series at ICML.

↑ top

Schedule

Friday, June 15th

8:15 - 8:30

Welcome

8:30 - 9:30

Data Science ≠ Machine Learning: Some Thoughts on the Role of Data Management in the new AI-Tsunami [Academic Keynote]
Jens Dittrich (Saarland University)

"Machine Learning", no wait, I mean "A.I.", no, that is the same as "Deep Learning“, isn’t it? What about "Data Science“ or "Big Data Analytics“, is that any better? Hmmmm, ok, let’s phrase it like this: <something> is going on out there. And <something> has a lot to do with playing BS bingo with buzzwords. What is the relationship of the data management community to <something>? Where are opportunities? Where can we help? Where can we learn? How do we increase our impact in the <something>-world? In my talk, I will show: 1.) opportunities for doing research at the intersection of <something> and data management, 2.) experiences from teaching <something>, and 3.) experiences from solving problems in the <something>-domain together with domain experts.

9:30 - 10:00

Towards Interactive Curation & Automatic Tuning of ML Pipelines [Paper]
Carsten Binnig (TU Darmstadt), Benedetto Buratti, Yeounoh Chung, Cyrus Cousins, Zeyuan Shang (Brown University), Tim Kraska (MIT), Eli Upfal, Robert Zeleznik, Emanuel Zgraggen (Brown University)

10:00 - 10:30

MLflow: Supporting the End-to-End Machine Learning Lifecycle [Invited Talk]
Matei Zaharia (Stanford)

ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company’s internal infrastructure. In this talk, I’ll present MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools. In addition, this kind of platform introduces new data management challenges that I will summarize in the talk.

Coffee Break

11:00 - 11:55

If Supervised Learning is the Answer, What is the Question? [Industry Keynote]
Martin Zinkevich (Google)

There have been many amazing results in machine learning: machines that play checkers, chess, poker, and Go better than humans, self-driving cars, self-flying helicopters, programs that can transcribe speech, et cetera. In comparison, supervised learning, where one has a labeled data set and one wants to build a model for ranking, regression, or classification, are not considered as challenging. In fact, most of the above problems distinguish themselves by how they are not as simple as supervised learning, and require novel solutions.
In this talk, I will discuss how supervised learning is usually nestled inside a problem that is inherently more difficult to solve. Specifically, as teams continue to work on supervised learning problems, the issues they face have less and less to do with optimizing a known, simple objective, and have more to do with finding the right objective, which is inherently not a supervised learning problem. In order to solve this, we need to do three things: gather data and infrastructure with the intent of solving this problem in a disciplined fashion, formalize the questions that we need to ask, and develop algorithms to answer these questions. In this talk, I will show early steps toward these three tasks.

11:55 - 12:30

Short Talks (5 minutes per paper)
Learning State Representations for Query Optimization with Deep Reinforcement Learning
Jennifer Ortiz, Magdalena Balanziska (University of Washington), Johannes Gehrke (Microsoft), Sathiya Keerthi (Criteo Research)
Modelling Machine Learning Algorithms on Relational Data with Datalog
Nantia Makrynioti (Athens University of Economics and Business), Nikolaos Vasiloglou (RelationalAI), Emir Pasalic (Infor Retail), Vasilis Vassalos (Athens University of Economics and Business)
End-to-End Machine Learning with Apache AsterixDB
Xikui Wang, Wail Alkowaileet (UC Irvine), Sattam Alsubaiee (Center for Complex Engineering Systems at KACST and MIT), Michael Carey, Chen Li (UC Irvine), Heri Ramampiaro (Norwegian University of Science and Technology), Phanwadee Sinthong (UC Irvine)
Exploring the Utility of Developer Exhaust
Jian Zhang, Max Lam, Stephanie Wang, Paroma Varma, Luigi Nardi, Kunle Olukotun, Christopher Re (Stanford University)
AC/DC: In-Database Learning Thunderstruck
Mahmoud Abo Khamis, Hung Ngo (RelationalAI), XuanLong Nguyen (Michigan University), Dan Olteanu, Maximilian Schleich (University of Oxford)
Accelerating Human-in-the-loop Machine Learning: Challenges and Opportunities
Doris Xin, Litian Ma, Jialin Liu, Stephen Macke, Shuchen Song, Aditya Parameswaran (University of Illinois at Urbana-Champaign)
Learning Efficiently Over Heterogeneous Databases: Sampling and Constraints to the Rescue
Jose Picado, Arash Termehchy, Sudhanshu Pathak (Oregon State University)

Lunch Break

14:00 - 15:30

ML/AI Systems and Applications: Is the SIGMOD/VLDB Community Losing Relevance? [Panel]

Ce Zhang (ETH Zurich)
Joseph Gonzalez (UC Berkeley)
Joaquin Vanschoren (Eindhoven University)
Matei Zaharia (Stanford)
Jens Dittrich (University of Saarland)

Coffee Break & Poster Session

16:30 - 17:00

Democratizing and Automating Machine Learning [Invited Talk]
Joaquin Vanschoren (Eindhoven University)

Building machine learning systems remains something of a (black) art, requiring a lot of prior experience to compose appropriate ML workflows and their hyperparameters. To democratize machine learning, and make it easily accessible to those who need it, we need a more principled approach to experimentation to understand how to build machine learning systems and progressively automate this process as much as possible. First, we created OpenML, an open science platform allowing scientists to share datasets and train many machine learning models from many software tools in a frictionless yet principled way. It also organizes all results online, providing detailed insight into the performance of machine learning techniques, and allowing a more scientific, data-driven approach to building new machine learning systems. Second, we use this knowledge to create automatic machine learning (AutoML) techniques that learn from these experiments to help people build better models, faster, or automate the process entirely.

17:00 - 17:30

Snorkel MeTaL: Weak Supervision for Multi-Task Learning [Paper]
Alexander Ratner, Braden Hancock, Jared Dunnmon, Christopher Re (Stanford University)

17:30 - 18:00

Avatar: Large Scale Entity Resolution of Heterogeneous User Profiles [Paper]
Janani Balaji, Chris Min, Faizan Javed, Yun Zhu (CareerBuilder)

18:00 - 18:15

Closing Remarks

↑ top

Accepted Papers

Long Talks

Towards Interactive Curation & Automatic Tuning of ML Pipelines
Carsten Binnig (TU Darmstadt), Benedetto Buratti, Yeounoh Chung, Cyrus Cousins, Zeyuan Shang (Brown University), Tim Kraska (MIT), Eli Upfal, Robert Zeleznik, Emanuel Zgraggen (Brown University)
Avatar: Large Scale Entity Resolution of Heterogeneous User Profiles
Janani Balaji, Chris Min, Faizan Javed, Yun Zhu (CareerBuilder)
Snorkel MeTaL: Weak Supervision for Multi-Task Learning
Alexander Ratner, Braden Hancock, Jared Dunnmon, Christopher Re (Stanford University)

Short Talks

Learning State Representations for Query Optimization with Deep Reinforcement Learning
Jennifer Ortiz, Magdalena Balanziska (University of Washington), Johannes Gehrke (Microsoft), Sathiya Keerthi (Criteo Research)
Modelling Machine Learning Algorithms on Relational Data with Datalog
Nantia Makrynioti (Athens University of Economics and Business), Nikolaos Vasiloglou (RelationalAI), Emir Pasalic (Infor Retail), Vasilis Vassalos (Athens University of Economics and Business)
End-to-End Machine Learning with Apache AsterixDB
Xikui Wang, Wail Alkowaileet (UC Irvine), Sattam Alsubaiee (Center for Complex Engineering Systems at KACST and MIT), Michael Carey, Chen Li (UC Irvine), Heri Ramampiaro (Norwegian University of Science and Technology), Phanwadee Sinthong (UC Irvine)
Exploring the Utility of Developer Exhaust
Jian Zhang, Max Lam, Stephanie Wang, Paroma Varma, Luigi Nardi, Kunle Olukotun, Christopher Re (Stanford University)
AC/DC: In-Database Learning Thunderstruck
Mahmoud Abo Khamis, Hung Ngo (RelationalAI), XuanLong Nguyen (Michigan University), Dan Olteanu, Maximilian Schleich (University of Oxford)
Accelerating Human-in-the-loop Machine Learning: Challenges and Opportunities
Doris Xin, Litian Ma, Jialin Liu, Stephen Macke, Shuchen Song, Aditya Parameswaran (University of Illinois at Urbana-Champaign)
Learning Efficiently Over Heterogeneous Databases: Sampling and Constraints to the Rescue
Jose Picado, Arash Termehchy, Sudhanshu Pathak (Oregon State University)

↑ top

Important Dates

Submission Deadline: ~~12th of March~~ extended to March 19th, 5pm Pacific Time
Submission Website: https://cmt3.research.microsoft.com/DEEM2018
Notification of Acceptance: 16th of April
Final papers due: 30th of April
Workshop: Friday, 15th of June

↑ top

Panel

ML/AI Systems and Applications: Is the SIGMOD/VLDB Community Losing Relevance?

ML/AI systems and ML/AI-powered applications are transforming the landscape of computing, with almost all major tech companies pivoting towards an "AI-first" future and many enterprise companies creating applied ML/AI labs. A motley of computing research communities ranging from core ML/AI to data management, systems, computer architecture, human-computer interaction, programming languages, software engineering, and more are increasingly tackling novel technical problems posed by the new preponderance of ML/AI. As the home of data management and data systems research, is the SIGMOD/VLDB community really stepping up to the plate of driving a data-centric agenda in this increasingly important direction, or is it losing relevance and ceding leadership on data-centric research to nearby communities such as NSDI/OSDI/SOSP, HPCA/ISCA, etc.?

This panel brings together experts from multiple pertinent research communities to discuss and debate various aspects of the above question and chart the paths forward. There will be three main topics for discussion: research content and problem selection, logistics and optics of publication venues, and training the next generation of students.

The following panelists have confirmed their participation so far:

Joseph Gonzalez
Assistant Professor at UC Berkeley

Joaquin Vanschoren
Assistant Professor of Machine Learning at the Eindhoven University of Technology

Matei Zaharia
Assistant Professor of Computer Science, Stanford

Jens Dittrich
Professor of Databases, Data Management and Big Data at the University of Saarland

Manasi Vartak
PhD student, MIT

↑ top

Call for Papers

Applying Machine Learning (ML) in real-world scenarios is a challenging task. In recent years, the main focus of the database community has been on creating systems and abstractions for the efficient training of ML models on large datasets. However, model training is only one of many steps in an end-to-end ML application, and a number of orthogonal data management problems arise from the large-scale use of ML, which require the attention of the data management community.

For example, data preprocessing and feature extraction workloads result in complex pipelines that often require the simultaneous execution of relational and linear algebraic operations. Next, the class of the ML model to use needs to be chosen, for that often a set of popular approaches such as linear models, decision trees and deep neural networks have to be tried out on the problem at hand. The prediction quality of such ML models heavily depends on the choice of features and hyperparameters, which are typically selected in a costly offline evaluation process, that poses huge opportunities for parallelization and optimization. Afterwards, the resulting models must be deployed and integrated into existing business workflows in a way that enables fast and efficient predictions, while still allowing for the lifecycle of models (that become stale over time) to be managed. As a further complication, the resulting systems need to take the target audience of ML applications into account; this audience is very heterogenous, ranging from analysts without programming skills that possibly prefer an easy-to-use cloud-based solution on the one hand, to teams of data processing experts and statisticians developing and deploying custom-tailored algorithms on the other hand.

DEEM aims to bring together researchers and practitioners at the intersection of applied machine learning, data management and systems research, with the goal to discuss the arising data management issues in ML application scenarios. The workshop solicits regular research papers describing preliminary and ongoing research results. In addition, the workshop encourages the submission of industrial experience reports of end-to-end ML deployments.

↑ top