NOTE THAT THIS IS AN ARCHIVED WEBSITE FOR A PREVIOUS VERSION OF DEEM!
THE WEBSITE FOR DEEM 2020 IS AVAILABLE HERE
The DEEM workshop will be held on Sunday, 30th of June in Amsterdam, NL in conjunction with SIGMOD/PODS 2019. DEEM brings together researchers and practitioners at the intersection of applied machine learning, data management and systems research, with the goal to discuss the arising data management issues in ML application scenarios.
The workshop solicits regular research papers describing preliminary and ongoing research results. In addition, the workshop encourages the submission of industrial experience reports of end-to-end ML deployments. Submissions can be short papers (4 pages) or long papers (up to 10 pages) following the ACM proceedings format. Please use the latest ACM paper format (2017) and change the font size to 10 pts (analogous to SIGMOD).
Follow us on twitter @deem_workshop or contact us via email at info[at]deem-workshop[dot]org. We also provide archived websites of previous versions of the workshop: DEEM 2017, DEEM 2018.
Applying Machine Learning (ML) in real-world scenarios is a challenging task. In recent years, the main focus of the database community has been on creating systems and abstractions for the efficient training of ML models on large datasets. However, model training is only one of many steps in an end-to-end ML application, and a number of orthogonal data management problems arise from the large-scale use of ML, which require the attention of the data management community.
For example, data preprocessing and feature extraction workloads result in complex pipelines that often require the simultaneous execution of relational and linear algebraic operations. Next, the class of the ML model to use needs to be chosen, for that often a set of popular approaches such as linear models, decision trees and deep neural networks have to be tried out on the problem at hand. The prediction quality of such ML models heavily depends on the choice of features and hyperparameters, which are typically selected in a costly offline evaluation process, that poses huge opportunities for parallelization and optimization. Afterwards, the resulting models must be deployed and integrated into existing business workflows in a way that enables fast and efficient predictions, while still allowing for the lifecycle of models (that become stale over time) to be managed. Managing this lifecycle requires careful bookkeeping of metadata and lineage (“which data was used to train this model?”, “which models are affected by changes in this feature?”) and involves methods for continuous analysis, validation, and monitoring of data and models in production. As a further complication, the resulting systems need to take the target audience of ML applications into account; this audience is very heterogeneous, ranging from analysts without programming skills that possibly prefer an easy-to-use cloud-based solution on the one hand, to teams of data processing experts and statisticians developing and deploying custom-tailored algorithms on the other hand.
DEEM aims to bring together researchers and practitioners at the intersection of applied machine learning, data management and systems research, with the goal to discuss the arising data management issues in ML application scenarios. The workshop solicits regular research papers describing preliminary and ongoing research results. In addition, the workshop encourages the submission of industrial experience reports of end-to-end ML deployments.
The workshop will have two tracks for regular research papers and industrial papers. Submissions can be short papers (4 pages) or long papers (up to 10 pages). Authors are requested to prepare submissions following the ACM proceedings format. Please use the latest ACM paper format (2017) and change the font size to 10 pts (analogous to SIGMOD). DEEM is a single-blind workshop, authors must include their names and affiliations on the manuscript cover page.
Submission Website: https://cmt3.research.microsoft.com/DEEM2019
LONG PAPERS
SHORT PAPERS
Ce Zhang is an Assistant Professor in Computer Science at ETH Zurich. He believes that by making data—along with the processing of data—easily accessible to non-CS users, we have the potential to make the world a better place. His current research focuses on building data systems to support machine learning and help facilitate other sciences. Before joining ETH, Ce was advised by Christopher Ré. He finished his PhD round-tripping between the University of Wisconsin-Madison and Stanford University, and spent another year as a postdoctoral researcher at Stanford. His PhD work produced DeepDive, a trained data system for automatic knowledge-base construction. He participated in the research efforts that won the SIGMOD Best Paper Award (2014) and SIGMOD Research Highlight Award (2015), and was featured in special issues including the Science magazine (2017), the Communications of the ACM (2017), “Best of VLDB” (2015), and the Nature magazine (2015).
"Software Engineering 2.0" for "Software 2.0": Towards Data Management for Statistical Generalization
When training a machine learning model becomes fast, and model selection and hyper-parameter tuning become
automatic, will non-CS experts finally have the tool they need to build ML applications all by themselves? In this talk,
I will focus on those users who are still struggling -- not because of the speed and the lack of automation of an ML system,
but because it is so powerful that it is easily misused as an "overfitting machine." For many of these users, the quality
of their ML applications might actually decrease with these powerful tools without proper guidelines and feedback (like what
"software engineering" provides for traditional software development).
In particular, I will talk about two systems, ease.ml/ci and ease.ml/meter, which we built as an early attempt at an ML system that tries to enforce the right user behavior during the development process of ML applications. The first, ease.ml/ci,
is a "continuous integration engine" for ML that gives developers a pass/fail signal for each developed ML model depending
on whether they satisfy certain predefined properties over the (unknown) "true distribution". The second, ease.ml/meter,
is a system that continuously returns some notion of the "degree of overfitting" to the developer. The core technical
challenge is how to answer adaptive statistical queries in a rigorous but practical (in terms of label complexity) way.
Interestingly, both systems can be seen as a new type of data management system which, instead of managing the
(relational) querying of the data, manages the statistical generalization power of the data.
Leonidas Galanis is an Engineering Manager at Facebook where he supports the distributed training platform team in Facebook's Artificial Intelligence Infrastructure organization. Prior to that, he was managing the RocksDB and MySQL Software Engineering teams, and built the engineering team that delivered the MyRocks server that is used to store the Facebook social graph and Facebook messenger data (among other data). Before his time at Facebook, he was a Director at Oracle, where he was responsible for the diagnostic & tuning pack and the real application testing option of the Oracle relational database. In 2004, he got his Ph.D. in databases from the University of Wisconsin-Madison.
Distributed Training of Deep Learning Models for Recommendation SystemsMachine learning results have been increasingly impressive over the last years with many successful real world applications. Deep learning, in particular, has enabled major breakthroughs. For example, image classification can now achieve better results than humans. At Facebook the need for faster training of deep learning models with larger amounts of data is growing. At the same time the use of existing hardware resources as they are available is essential. We are addressing the aforementioned challenges with distributed training. This presentation provides an overview of distributed training at Facebook. There are challenges in reading training data, interacting with model artifacts and training checkpoints during the often very long process of offline training. Moving and reconfiguring model data through the various training stages (offline, model warmup and online training) while scaling the model size beyond what can fit into a single host presents several problems. Devising distributed optimization algorithms that allow us to use increasingly more trainers in order to reduce the time of training as well as leverage existing resources opportunistically is a very active research area.
Julia Stoyanovich is an Assistant Professor at New York University in the Department of Computer Science and Engineering at the Tandon School of Engineering, and the Center for Data Science. She is a recipient of an NSF CAREER award and of an NSF/CRA CI Fellowship. Julia's research focuses on responsible data management and analysis practices: on operationalizing fairness, diversity, transparency, and data protection in all stages of the data acquisition and processing lifecycle. She established the Data, Responsibly consortium, and serves on the New York City Automated Decision Systems Task Force. In addition to data ethics, Julia works on management and analysis of preference data, and on querying large evolving graphs. She holds M.S. and Ph.D. degrees in Computer Science from Columbia University, and a B.S. in Computer Science and in Mathematics and Statistics from the University of Massachusetts at Amherst.
Follow the Data! Responsible Data Science Starts with Responsible Data ManagementData science technology promises to improve people's lives, accelerate scientific discovery and innovation, and bring about positive societal change. Yet, if not used responsibly, this same technology can reinforce inequity, limit accountability, and infringe on the privacy of individuals. In my talk I will describe the role that the database community can, and should, play, in building a foundation of responsible data science -- for data-driven algorithmic decision making done in accordance with ethical and moral norms, and legal and policy considerations. I will highlight some of my recent technical work in scope of the "Data, Responsibly" project, and will connect the technical insights to ongoing regulatory efforts in the US and elsewhere. Additional information about this project is available at https://dataresponsibly.github.io .