Talk from Software Engineering for Machine Learning Workshop (SW4ML) at the Neural Information Processing Systems (NIPS) 2014 conference in Montreal, Canada on 2014-12-13.
Abstract:
Building a real system that incorporates machine learning as a part can be a difficult effort, both in terms of the algorithmic and engineering challenges involved. In this talk I will focus on the engineering side and discuss some of the practical issues we’ve encountered in developing real machine learning systems at Netflix and some of the lessons we’ve learned over time. I will describe our approach for building machine learning systems and how it comes from a desire to balance many different, and sometimes conflicting, requirements such as handling large volumes of data, choosing and adapting good algorithms, keeping recommendations fresh and accurate, remaining responsive to user actions, and also being flexible to accommodate research and experimentation. I will focus on what it takes to put machine learning into a real system that works in a feedback loop with our users and how that imposes different requirements and a different focus than doing machine learning only within a lab environment. I will address the particular software engineering challenges that we’ve faced in running our algorithms at scale in the cloud. I will also mention some simple design patterns that we’ve fond to be useful across a wide variety of machine-learned systems.
4. 4
Netflix Scale
> 50M members
> 40 countries
> 1000 device types
Hours: > 2B/month
Plays: > 70M/day
Log 100B events/day
34.2% of peak US
downstream traffic
5. 5
Goal
Help members find content to watch and enjoy
to maximize member satisfaction and retention
6. 6
Everything is a Recommendation
Rows
Ranking
Over 75% of what
people watch
comes from our
recommendations
Recommendations
are driven by
Machine Learning
12. 12
Lesson 1:
Be flexible about where and when
computation happens.
13. 13
System Architecture
Offline: Process data
Nearline: Process events
Online: Process requests
Learning, Features, or Model
evaluation can be done at any
level
Netflix.Hermes
Netflix.Manhattan
Nearline
Computation
Models
Online
Data Service
Offline Data
Model
training
Online
Computation
Event Distribution
User Event
Queue
Algorithm
Service
UI Client
Member
Query results
Recommendations
NEARLINE
Machine
Learning
Algorithm
Machine
Learning
Algorithm
Offline
Computation Machine
Learning
Algorithm
Play, Rate,
Browse...
OFFLINE
ONLINE
More details on Netflix Techblog
14. 14
Where to place components?
Example: Matrix Factorization
Offline:
Collect sample of play data
Run batch learning algorithm like
SGD to produce factorization
Publish video factors
Nearline:
Solve user factors
Compute user-video dot products
Store scores in cache
Online:
Presentation-context filtering
Serve recommendations
Netflix.Hermes
Netflix.Manhattan
X≈UVt
Nearline
Computation
Models
Online
Data Service
Offline Data
Model
training
Online
Computation
Event Distribution
User Event
Queue
Algorithm
Service
UI Client
Member
Query results
Recommendations
NEARLINE
Machine
Learning
Algorithm
Machine
Learning
Algorithm
Offline
Computation Machine
Learning
Algorithm
Play, Rate,
Browse...
OFFLINE
ONLINE
V
sij=uivj Aui=b
sij
X
sij>t
15. 15
Lesson 2:
Think about distribution starting from the
outermost levels.
16. 16
Three levels of Learning Distribution/Parallelization
1. For each subset of the population (e.g.
region)
Want independently trained and tuned models
2. For each combination of (hyper)parameters
Simple: Grid search
Better: Bayesian optimization using Gaussian
Processes
3. For each subset of the training data
Distribute over machines (e.g. ADMM)
Multi-core parallelism (e.g. HogWild)
Or… use GPUs
17. 17
Example: Training Neural Networks
Level 1: Machines in different
AWS regions
Level 2: Machines in same AWS
region
Spearmint or MOE for parameter
optimization
Condor, StarCluster, Mesos, etc. for
coordination
Level 3: Highly optimized, parallel
CUDA code on GPUs
18. 18
Lesson 3:
Design application software for
experimentation.
19. 19
Example development process
Idea Data
Offline
Modeling
(R, Python,
MATLAB, …)
Iterate
Implement in
production
system (Java,
C++, …)
Data
discrepancies
Missing post-processing
logic
Performance
issues
Actual
output
Experimentation environment
Production environment
(A/B test) Code
discrepancies
Final
model
20. 20
Avoid dual implementations
Shared Engine
Experiment
code
Production
code
Experiment Production • Models
• Features
• Algorithms
• …
21. 21
Solution: Share and lean towards production
Developing machine learning is an iterative process
Want a short pipeline to rapidly try ideas
Want to see output of complete system, not just learned component
Make application components easy to experiment with
Share them between online, nearline, and offline
Make it possible to run individual parts of the software
Use the real code whenever possible
Have well-defined interfaces and formats to allow you to go
off-the-beaten path
22. 22
Lesson 4:
Make algorithms extensible and modular.
23. 23
Make algorithms and models extensible and modular
Algorithms often need to be tailored for a
specific application
Treating an algorithm as a black box is
limiting
Better to make algorithms extensible and
modular to allow for customization
Separate models and algorithms
Many algorithms can learn the same model
(i.e. linear binary classifier)
Many algorithms can be trained on the same
types of data
Support composing algorithms
Data
Parameters
Data
Model
Parameters
Model
Algorithm
Vs.
24. 24
Provide building blocks
Don’t start from scratch
Linear algebra: Vectors, Matrices, …
Statistics: Distributions, tests, …
Models, features, metrics, ensembles, …
Cost, distance, kernel, … functions
Optimization, inference, …
Layers, activation functions, …
Initializers, stopping criteria, …
…
Domain-specific components
Build abstractions on
familiar concepts
Make the software put
them together
25. 25
Example: Tailoring Random Forests
Use a custom
tree split
Customize to
run it for an
hour
Report a
custom metric
each iteration
Inspect the
ensemble
Using Cognitive Foundry: http://github.com/algorithmfoundry/Foundry
26. 26
Lesson 5:
Describe your input and output
transformations with your model.
27. 27
Putting learning in an application
Application
Application or model code?
Feature
Encoding
Output
Decoding
? Machine
Learned Model
Rd ⟶ Rk
28. 28
Example: Simple ranking system
High-level API: List<Video> rank(User u, List<Video> videos)
Example model description file:
{
“type”: “ScoringRanker”,
“scorer”: {
“type”: “FeatureScorer”,
“features”: [
{“type”: “Popularity”, “days”: 10},
{“type”: “PredictedRating”}
],
“function”: {
“type”: “Linear”,
“bias”: -0.5,
“weights”: {
“popularity”: 0.2,
“predictedRating”: 1.2,
“predictedRating*popularity”:
3.5
}
}
}
}
Ranker
Scorer
Features
Linear function
Feature transformations
29. 29
Lesson 6:
Don’t just rely on metrics for testing.
30. 30
Importance of Testing
Temptation: Use validation metrics to test software
When things work this seems great
When metrics don’t improve: was it the code, data, metric, idea, …?
Machine learning code involves intricate math and logic
Rounding issues, corner cases, …
Is that a + or -? (The math or paper could be wrong.)
Solution: Unit test
Testing of metric code is especially important
Test the whole system
Compare output for unexpected changes across versions
32. 32
Two ways to solve computational problems
Know
solution
Write code
Compile
code
Test code Deploy code
Know
relevant
data
Develop
algorithmic
approach
Train model
on data using
algorithm
Validate
model with
metrics
Deploy
model
Software Development
Machine Learning
(steps may involve Software Development)
33. 33
Take-aways for building machine learning software
Building machine learning is an iterative process
Make experimentation easy
Take a holistic view of both the application and experimental
environments
Optimize only what matters
Testing can be hard but is worthwhile
34. Thank You Justin Basilico
jbasilico@netflix.com
34 @JustinBasilico
We’re hiring