Original event: https://www.meetup.com/NYC-Machine-Learning/events/256605862/
--
"Doing large scale ML in production is hard" – Everyone who's tried
This talk is focussed on ML Systems. Especially the less obvious pitfalls, which have caused us troubles at Spotify.
This talk assumes a certain level of familiarity with ML: You'll get the most out of if you've some experience with applied ML, ideally on production systems.
Romain Yon is a Staff ML Engineer at Spotify. Over the years, Romain has worked on many of the core ML systems that power Spotify today (Music Recommendation, Catalog Quality, Search Ranking, Ads, ..).
During the past year, Romain has been mostly focusing on designing reusable ML Infrastructure that can be leveraged throughout Spotify.
Prior to Spotify, Romain co-founded the startup https://linkurio.us while getting his MSc in ML from Georgia Tech.
4. Over 190M active users
More than 40M songs
Over 3B playlists
Over 1 billion plays per day
5. 30% of these teams
use ML at some
capacity
Eng org map, blurred
6. Recommendation - What should appear in this users’ Discover
Weekly?
Ranking - Which shelves should appear on the home page?
Classification - Which items in our catalog contain certain
instruments?
Estimation - How likely is this user to skip an ad?
ML Use Cases
7. What is Machine Learning Infrastructure?
D. Sculley , Gary Holt , Daniel Golovin , Eugene Davydov , Todd Phillips , Dietmar Ebner , Vinay Chaudhary , Michael Young , Jean-Francois Crespo , Dan Dennison, Hidden technical debt in Machine learning systems,
Proceedings of the 28th International Conference on Neural Information Processing Systems, p.2503-2511, December 07-12, 2015, Montreal, Canada
11. ML systems have a special capacity
for incurring technical debt,
because they have all of the
maintenance problems of
traditional code plus an additional
set of ML-specific issues¹
[1] D. Sculley , Gary Holt , Daniel Golovin , Eugene Davydov , Todd Phillips , Dietmar Ebner , Vinay Chaudhary , Michael Young , Jean-Francois Crespo , Dan Dennison, Hidden technical debt in Machine
learning systems, Proceedings of the 28th International Conference on Neural Information Processing Systems, p.2503-2511, December 07-12, 2015, Montreal, Canada
34. Standard ML data format @ Spotify
● Input to a ML pipeline should use tf.example stored inside tf.record
● Libraries will serve as interfaces for reading of the input data
● Tooling to create, share and discover ML datasets
58. Sharing logic & weights
● Weights need to be shared, both for model and transformation stages
● Sharing logic is very hard if different training and serving stacks
● The least moving pieces the least amount of issues
⇨ Try to group (ffeatures, Θfeatures, fmodel & Θmodel ) inside a single object
63. ML Wisdom from Google
“The best way to make sure that you train like you serve is to
save the set of features used at serving time, and then
pipe those features to a log to use them at training time.”
⇨ Martin Zinkevich - Rules of ML
71. “Using ML in real-world production systems is complicated by a host of issues
not found in small toy examples or even large offline research experiments.
Testing and monitoring are key considerations for assessing the production-
readiness of an ML system.”
Breck et al.
⇨ What’s your ML Test Score? A rubric for ML production systems
82. Data Validation
Three main stages of data validation
1. Validation of data against schema (human curated)
2. Validation of data against past data
3. Validation of serving data against training data
⇨ Ideally all 3 should be used in tandem
89. Browse Ranking Service
Featurizer
Model
Entity Data
Genre Service Playlist ServiceUser Service
Business
Logic
Browse Service
GCS
Streaming
Streaming Window
Generate
Statistics
training_stats.pb schema.pb
Data
Validation
93. Avoid model “hot swap”
ML is not a special snowflake:
● Avoid custom (model swap) logic
● Use (sealed) containers
● Use containers management systems (e.g. Kubernetes)
95. CI/CD for Model
Critical to keep both quality & velocity high
● Use Continuous Integration (Offline & Online metrics)
● Use Continuous Delivery
● Use low user impact environments (Canaries / Shadow)
96. Summary: our six learnings
● Rely on data standards
● Share logic & weights
● Share decoration logic
● Validate your data
● Use “stateless” containers
● Leverage CI/CD for ML
100. Sounds exciting?!
● We have several openings for ML Infra engineers
● Application link: bit.ly/spotify-ml-infra-engineer
● Checkout Spotify Job page: spotifyjobs.com
● Questions? ⇨