Presentation by Matei Zaharia at the SOSP 2019 AI Systems workshop about the systems research challenges specific to machine learning systems, including debugging and performance optimization for ML. Covers research from Stanford DAWN and an industry perspective from Databricks.
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
What are the Unique Challenges and Opportunities in Systems for ML?
1. What are the Unique Challenges and
Opportunities in Systems for ML?
Matei Zaharia
2. 😀 🙂
AI is going to
change all of
computing!
AI Researcher Systems Researcher
3. 😀 🙂
It’s intelligent and you
don’t need to program
anymore and you just
differentiate things...
AI Researcher Systems Researcher
4. 😀 🙂
How does it affect
your research
field?
AI Researcher Systems Researcher
5. 😀 🙂
How does it affect
your research
field?
AI Researcher Systems Researcher
Umm, I figured out
a way to shave off
some system calls!
6. 😀 🙂
How does it affect
your research
field?
AI Researcher Networking Researcher
I came up with a
new congestion
control scheme
7. 😐 🙂AI Researcher Networking Researcher
I came up with a
new congestion
control scheme
8. Motivation
ML workloads can certainly influence a lot of systems,
but what are the unique research challenges they raise?
Turns out there are a lot! ML is very different from
traditional software, and we should look at how
9. My Perspective
Research lab focused on infrastructure for
usable machine learning
Data & ML platform for 2000+ orgs
10. How Does ML Differ from Traditional Software?
Traditional Software
Goal: meet a functional
specification
Quality depends only on
application code
Mostly deterministic
Machine Learning
Goal: optimize a metric
(e.g. accuracy)
Quality depends on input data
and tuning parameters
Stochastic
11. Some Interesting Opportunities
ML Platforms: software for managing and productionizing ML
Data-oriented model training, QA and debugging tools
Optimizations leveraging the stochastic nature of ML
13. The ML Inference Bottleneck
Inference cost is often 100x higher than training
overall, and greatly limits deployments
Example: processing 1 video
stream in real time with CNNs
requires a $1000 GPU
14. Inference Optimization in NoScope
Idea: optimize execution of ML models for a specific
application or query
• Model specialization: train a small DNN to recognize the
specific class in the dataset (e.g. “buses in street video”)
• Query optimization: tune a cascade of
models to achieve a target accuracy Target
Model
Specialized
Model
Dataset
User Query
16. Optimizing ML + SQL in BlazeIt
[Kang et al, CIDR 2019]
Object Detection DNN
Frames from Video
Query Plan with
Specialized DNNs
Resnet 50
SQL Query
17. BlazeIt Optimizations
Accelerate approximate queries by
using specialized model’s output
as a control variate for sampling
E.g.: find average # of cars/frame
Use specialized models to sort
frames by likelihood of matching
query, then run full model
E.g.: SELECT * FROM frames
WHERE #(red buses) > 3 LIMIT 5
Aggregation Queries Limit Queries
20. Motivation
ML applications fail in complex, hard-to-debug ways
• Tesla cars crashing into lane dividers
• Gender classification incorrect
based on race
How can we test and improve quality of ML apps?
21. Model Assertions
Predicates on input/output of an ML application
(similar to software assertions)
[Kang, Raghavan et al, NeurIPS MLSys 2018]
Frame 1 Frame 2 Frame 3
assert(cars should not flicker in and out)
Improved training
(data selection &
weak supervision)
Runtime
monitoring
22. Example Assertions
Problem Domain Assertion
Video analytics
Objects should not flicker
in and out across frames
Autonomous vehicles
LIDAR and video object
detectors should agree
Heart rhythm
classification
Output class should not
change frequently
23. Using Model Assertions
Inference time
» Runtime monitoring
» Corrective action
Training time
» Active learning
» Weak supervision via
correction rules
24. Active Learning with Assertions:
Can assertions help select data to label & train on?
Key idea: new active learning algorithm samples data that
is most likely to reduce # failing assertions
25. Active Learning with Assertions:
Can assertions help select data to label & train on?
Using assertions
for active learning
improves model
quality.
Selection Method for 2000 New Labels
mAP
26. Weak Supervision with Assertions:
Can assertions improve quality without human labeling?
Key idea: consistency constraints API lets devs say which
attributes should stay constant across outputs in a dataset
E.g. “each tracked object should always have same class”,
“each person should have consistent detected gender”
27. Task Pretrained Weakly Supervised
AV perception (mAP) 10.6 14.1 (+33%)
Object detection (mAP) 34.4 49.9 (+45%)
ECG (% accuracy) 70.7 72.1 (+2%)
Weak Supervision with Assertions:
Can assertions improve quality without human labeling?
28. Model Quality After Retraining
Retrained SSD ModelOriginal SSD Model
[Kang, Raghavan et al, NeurIPS MLSys 2018]
30. ML at Industrial Scale
Today, ML development is ad-hoc:
• Hard to track experiments & metrics: users do it best-effort
• Hard to reproduce results: won’t happen by default
• Hard to share & deploy models: different dev & deploy stacks
Each app takes months to build, and then needs to
continuously be maintained!
31. ML Platforms
A new class of systems to manage the ML lifecycle
Pioneered by company-specific platforms: Facebook
FBLearner, Uber Michelangelo, Google TFX, etc
+Standardize the data prep / training / deploy cycle:
if you work with the platform, you get these!
–Limited to a few algorithms or frameworks
–Tied to one company’s infrastructure
32. MLflow from Databricks
Open source, open-interface ML platform (mlflow.org)
• Works with any existing ML library and deployment service
Project
Project Spec
your_code.py
. . .
log_param(“alpha”, 0.5)
log_metric(“rmse”, 0.2)
log_model(my_model)
. . .
Deps Params
Tracking Server
UI
API
Inference Code
Bulk Scoring
Cloud Serving Tools
Deployment TargetsExperiment TrackingReproducible Projects
REST
API
35. MLflow Tracking: Logging for ML
Notebooks
Local Apps
Cloud Jobs
Tracking Server
UI
API
mlflow.log_param(“alpha”, 0.5)
mlflow.log_metric(“accuracy”, 0.9)
...
REST API
37. Model Format
ONNX Flavor
Python Flavor
Model Logic
Batch Inference
REST Serving
Packaging Format
. . .
Testing & Debug Tools
LIME
TCAV
Packages arbitrary code (not just model weights)
MLflow Models: Packaging Models
38. MLflow Community Growth
140 contributors from >50 companies since June 2018
850K downloads/month
Major external contributions:
• Docker & Kubernetes execution
• R API
• Integrations with PyTorch, H2O, HDFS, GCS, …
• Plugin system
39. Other ML-Specific Research Opportunities
Data validation and monitoring (e.g. TFX Data Validation)
Supervision-oriented systems (e.g. Snorkel, Overton)
Leveraging the numeric nature of ML for optimization,
security, etc (e.g. TASO, HogWild, SSP, federated ML)
40. Conclusion
Many systems problems specific to ML are not
heavily studied in research
• App lifecycle, data quality & monitoring, model QA, etc
These are also major problems in practice!
Follow DAWN’s research at dawn.cs.stanford.edu