DEEM: Workshop on Data Management for End-to-End Machine Learning @ SIGMOD'17

Sunday, May 14th

9:00 - 9:15

Welcome

9:15 - 10:30

Democratizing Advanced Analytics Beyond Just Plumbing [Academic Keynote]
Arun Kumar (UC San Diego)

Coffee Break

11:00 - 11:30

Data Integration for Machine Learning & Machine Learning for Data Integration [Invited Talk]
Xin Luna Dong (Amazon)

11:30 - 12:00

Model-based Pricing: Do Not Pay for More than What You Learn! [Paper]
Lingjiao Chen, Paraschos Koutris (University of Wisconsin-Madison), Arun Kumar (UC San Diego)

12:00 - 12:30

Using Word Embedding to Enable Semantic Queries in Relational Databases [Paper]
Rajesh Bordawekar (IBM Research), Oded Shmueli (Technion Haifa)

Lunch Break

14:00 - 15:00

Machine Learning for Recommender Systems at Twitter [Industry Keynote]
Quannan Li (Twitter)

15:00 - 15:30

EMT: End To End Model Training for MSR Machine Translation [Paper]
Vishal Chowdhary, Scott Greenwood (Microsoft Research)

Coffee Break

16:00 - 16:30

Snorkel: Creating Noisy Training Data to Overcome Machine Learning's Biggest Bottleneck [Invited Talk]
Stephen Bach (Stanford)

16:30 - 17:00

Versioning for end-to-end machine learning pipelines [Paper]
Tom van der Weide, Dimitris Papadopoulos, Oleg Smirnov, Michal Zielinski, Tim van Kasteren (Schibsted Media Group)

17:00 - 17:30

On Model Discovery For Hosted Data Science Projects [Paper]
Hui Miao, Ang Li, Larry Davis, Amol Deshpande (University of Maryland)

17:30 - 18:00

Towards Automatically Setting Language Bias in Relational Learning [Paper]
Jose Picado, Arash Termehchy, Alan Fern, Sudhanshu Pathak (Oregon State University)

↑ top

Accepted Papers

Vishal Chowdhary, Scott Greenwood (Microsoft Research)
EMT: End To End Model Training for MSR Machine Translation

Jose Picado, Arash Termehchy, Alan Fern, Sudhanshu Pathak (Oregon State University)
Towards Automatically Setting Language Bias in Relational Learning

Tom van der Weide, Dimitris Papadopoulos, Oleg Smirnov, Michal Zielinski, Tim van Kasteren (Schibsted Media Group)
Versioning for end-to-end machine learning pipelines

Lingjiao Chen, Paraschos Koutris (University of Wisconsin-Madison), Arun Kumar (University of California San Diego)
Model-based Pricing: Do Not Pay for More than What You Learn!

Rajesh Bordawekar, (IBM Research), Oded Shmueli (Technion Haifa)
Using Word Embedding to Enable Semantic Queries in Relational Databases

Hui Miao, Ang Li, Larry Davis, Amol Deshpande (University of Maryland)
On Model Discovery For Hosted Data Science Projects

Invited Speakers

Academic Keynote: Arun Kumar

Arun Kumar is an Assistant Professor in the Department of Computer Science and Engineering at the University of California, San Diego. He obtained his PhD from the University of Wisconsin-Madison in 2016. His primary research interests are in data management, especially the intersection of data management and machine learning, with a focus on problems related to usability, developability, performance, and scalability. Systems and ideas based on his research have been released as part of the MADlib open-source library, shipped as part of products from EMC, Oracle, Cloudera, and IBM, and used internally by Facebook, LogicBlox, and Microsoft. A paper he co-authored was accorded the Best Paper Award at ACM SIGMOD 2014. He was awarded the 2016 Graduate Student Research Award for the best dissertation research in UW-Madison CS and the Anthony C. Klug NCR Fellowship in Database Systems in 2015.

Democratizing Advanced Analytics Beyond Just Plumbing

Fast, scalable, in-database, hardware-conscious, streaming, cloud-based, [insert other awesome adjectives here] implementations of machine learning (ML) and inference algorithms are important. But democratizing advanced analytics requires a lot more work on improving the processes inherent in building and using ML models, not just great plumbing, akin to how democratizing relational data management took a lot more than just well-plumbed operator implementations. Alas, the ML community seems mostly content with improving algorithmic accuracy. In this talk, I will explain why I think our community is well positioned to step up to the plate by drawing upon key process-oriented lessons from relational data management. In particular, I will discuss two crucial but painful ML processes that require more attention: the model selection lifecycle during model building and managing complex inference during model deployment. These challenges demand creative identification and formalization of problems by academic and industrial researchers, closer interactions between researchers and practitioners, and truly bridging the knowledge gap between the worlds of data management, systems, and ML. While I do not have all the answers, I will illustrate these points with some interesting stories from my recent efforts to build new abstractions and systems to this end, including interacting with practitioners and ML researchers. I will also a give a shout out to a few projects by other researchers that I think also exemplify this new direction for our community.

Industry Keynote: Quannan Li

Quannan Li is a staff software engineer at Twitter, leading the quality and relevance modeling effort on the recommendation systems. Before joining Twitter, he obtained his PhD degree in computer vision and machine learning from UCLA. He is interested in big data analysis, recommendation systems, machine learning, data mining and computer vision.

Machine Learning for Recommender Systems at Twitter

Recommendation Systems are a core product of Twitter, and help to drive user growth and engagement on Twitter. We started with ad-hoc rules in the systems, and later ML proved to be much more effective. In this keynote, I will go over the challenges we face when applying ML to the recommendation systems and our solution to these challenges.

Invited Talk: Xin Luna Dong

Xin Luna Dong is a Principal Scientist at Amazon, leading the efforts of constructing Amazon Product Knowledge Graph. She was one of the major contributors to the Knowledge Vault project, and has led the Knowledge-based Trust project, which is called the "Google Truth Machine" by Washington's Post. She has co-authored a book on "Big Data Integration", published 65+ papers in top conferences and journals and given 20+ keynotes/invited-talks/tutorials. She got the VLDB Early Career Research Contribution Award for advancing the state of the art of knowledge fusion, and got the Best Demo award in Sigmod 2005. She is the PC co-chair for Sigmod 2018 and WAIM 2015, and serves as an area chair for Sigmod 2017, CIKM 2017, Sigmod 2015, ICDE 2013, and CIKM 2011.

Data Integration for Machine Learning and Machine Learning for Data Integration

Data integration has been a field over 2 decades studying how to integrate data from multiple heterogenous sources. On the one hand, data integration effectively enrich data and address the big-data needs for improving machine learning results. On the other hand, machine learning has been applied since the beginning of data integration to generate high-quality integrated data. This talk overviews how data integration and machine learning help each other, and discusses the challenges to further improve data integration by state-of-the-art ML techniques.

Invited Speaker: Stephen Bach

Stephen Bach is a postdoctoral scholar in the Stanford computer science department. His research focuses on weakly supervised machine learning, statistical relational learning, and information extraction. His goal is to design algorithms and systems that empower people to use machine learning with minimal intervention from computer scientists. He co-leads the development of the Snorkel framework for training data creation using generative models. He was recognized with the Larry S. Davis Doctoral Dissertation Award from the University of Maryland, College Park department of computer science.

Snorkel: Creating Noisy Training Data to Overcome Machine Learning's Biggest Bottleneck

The bottleneck in machine learning has shifted with the advent of data-hungry representation learning techniques like deep neural networks. Curating labeled training data has replaced feature engineering as the most expensive and time consuming task for practitioners. In this talk, I'll describe Snorkel, a new framework for overcoming this bottleneck. Using novel statistical methods, we combine weak supervision sources like heuristic rules and related data sets, i.e., distant supervision, which are far less expensive to use than hand labeling data. With the resulting estimated labels, we can train many kinds of state-of-the-art models. I'll discuss how users in industry, bioinformatics, computational social science, and other disciplines have built high-quality models with Snorkel in far less time than previously possible.

↑ top

Important Dates

Submission Deadline: ~~February 1, 2017~~ ~~extended to Friday, February 17, 2017~~
Submission Website: https://cmt3.research.microsoft.com/DEEM2017/
Notification of Acceptance: March 17, 2017
Final papers due: ~~March 31st, 2017~~
Workshop: May 14th, 2017

↑ top

Call for Papers

Applying Machine Learning (ML) in real-world scenarios is a challenging task. In recent years, the main focus of the database community has been on creating systems and abstractions for the efficient training of ML models on large datasets. However, model training is only one of many steps in an end-to-end ML application, and a number of orthogonal data management problems arise from the large-scale use of ML, which require the attention of the data management community.

For example, data preprocessing and feature extraction workloads result in complex pipelines that often require the simultaneous execution of relational and linear algebraic operations. Next, the class of the ML model to use needs to be chosen, for that often a set of popular approaches such as linear models, decision trees and deep neural networks have to be tried out on the problem at hand. The prediction quality of such ML models heavily depends on the choice of features and hyperparameters, which are typically selected in a costly offline evaluation process, that poses huge opportunities for parallelization and optimization. Afterwards, the resulting models must be deployed and integrated into existing business workflows in a way that enables fast and efficient predictions, while still allowing for the lifecycle of models (that become stale over time) to be managed. As a further complication, the resulting systems need to take the target audience of ML applications into account; this audience is very heterogenous, ranging from analysts without programming skills that possibly prefer an easy-to-use cloud-based solution on the one hand, to teams of data processing experts and statisticians developing and deploying custom-tailored algorithms on the other hand.

DEEM aims to bring together researchers and practitioners at the intersection of applied machine learning, data management and systems research, with the goal to discuss the arising data management issues in ML application scenarios. The workshop solicits regular research papers describing preliminary and ongoing research results. In addition, the workshop encourages the submission of industrial experience reports of end-to-end ML deployments.

↑ top