SlideShare una empresa de Scribd logo
1 de 34
1 
Lessons Learned from 
Building Machine Learning Software at Netflix 
Justin Basilico 
Page Algorithms Engineering December 13, 2014 
@JustinBasilico 
Workshop 2014
2 
Introduction
3 
Introduction 
2006 2014
4 
Netflix Scale 
 > 50M members 
 > 40 countries 
 > 1000 device types 
 Hours: > 2B/month 
 Plays: > 70M/day 
 Log 100B events/day 
 34.2% of peak US 
downstream traffic
5 
Goal 
Help members find content to watch and enjoy 
to maximize member satisfaction and retention
6 
Everything is a Recommendation 
Rows 
Ranking 
Over 75% of what 
people watch 
comes from our 
recommendations 
Recommendations 
are driven by 
Machine Learning
7 
Machine Learning Approach 
Problem 
Data 
Metrics 
Model Algorithm
8 
Models & Algorithms 
 Regression (Linear, logistic, elastic net) 
 SVD and other Matrix Factorizations 
 Factorization Machines 
 Restricted Boltzmann Machines 
 Deep Neural Networks 
 Markov Models and Graph Algorithms 
 Clustering 
 Latent Dirichlet Allocation 
 Gradient Boosted Decision 
Trees/Random Forests 
 Gaussian Processes 
 …
9 
Design Considerations 
Recommendations 
• Personal 
• Accurate 
• Diverse 
• Novel 
• Fresh 
Software 
• Scalable 
• Responsive 
• Resilient 
• Efficient 
• Flexible
10 
Software Stack 
http://techblog.netflix.com
11 
Lessons Learned
12 
Lesson 1: 
Be flexible about where and when 
computation happens.
13 
System Architecture 
 Offline: Process data 
 Nearline: Process events 
 Online: Process requests 
 Learning, Features, or Model 
evaluation can be done at any 
level 
Netflix.Hermes 
Netflix.Manhattan 
Nearline 
Computation 
Models 
Online 
Data Service 
Offline Data 
Model 
training 
Online 
Computation 
Event Distribution 
User Event 
Queue 
Algorithm 
Service 
UI Client 
Member 
Query results 
Recommendations 
NEARLINE 
Machine 
Learning 
Algorithm 
Machine 
Learning 
Algorithm 
Offline 
Computation Machine 
Learning 
Algorithm 
Play, Rate, 
Browse... 
OFFLINE 
ONLINE 
More details on Netflix Techblog
14 
Where to place components? 
 Example: Matrix Factorization 
 Offline: 
 Collect sample of play data 
 Run batch learning algorithm like 
SGD to produce factorization 
 Publish video factors 
 Nearline: 
 Solve user factors 
 Compute user-video dot products 
 Store scores in cache 
 Online: 
 Presentation-context filtering 
 Serve recommendations 
Netflix.Hermes 
Netflix.Manhattan 
X≈UVt 
Nearline 
Computation 
Models 
Online 
Data Service 
Offline Data 
Model 
training 
Online 
Computation 
Event Distribution 
User Event 
Queue 
Algorithm 
Service 
UI Client 
Member 
Query results 
Recommendations 
NEARLINE 
Machine 
Learning 
Algorithm 
Machine 
Learning 
Algorithm 
Offline 
Computation Machine 
Learning 
Algorithm 
Play, Rate, 
Browse... 
OFFLINE 
ONLINE 
V 
sij=uivj Aui=b 
sij 
X 
sij>t
15 
Lesson 2: 
Think about distribution starting from the 
outermost levels.
16 
Three levels of Learning Distribution/Parallelization 
1. For each subset of the population (e.g. 
region) 
 Want independently trained and tuned models 
2. For each combination of (hyper)parameters 
 Simple: Grid search 
 Better: Bayesian optimization using Gaussian 
Processes 
3. For each subset of the training data 
 Distribute over machines (e.g. ADMM) 
 Multi-core parallelism (e.g. HogWild) 
 Or… use GPUs
17 
Example: Training Neural Networks 
 Level 1: Machines in different 
AWS regions 
 Level 2: Machines in same AWS 
region 
 Spearmint or MOE for parameter 
optimization 
 Condor, StarCluster, Mesos, etc. for 
coordination 
 Level 3: Highly optimized, parallel 
CUDA code on GPUs
18 
Lesson 3: 
Design application software for 
experimentation.
19 
Example development process 
Idea Data 
Offline 
Modeling 
(R, Python, 
MATLAB, …) 
Iterate 
Implement in 
production 
system (Java, 
C++, …) 
Data 
discrepancies 
Missing post-processing 
logic 
Performance 
issues 
Actual 
output 
Experimentation environment 
Production environment 
(A/B test) Code 
discrepancies 
Final 
model
20 
Avoid dual implementations 
Shared Engine 
Experiment 
code 
Production 
code 
Experiment Production • Models 
• Features 
• Algorithms 
• …
21 
Solution: Share and lean towards production 
 Developing machine learning is an iterative process 
 Want a short pipeline to rapidly try ideas 
 Want to see output of complete system, not just learned component 
 Make application components easy to experiment with 
 Share them between online, nearline, and offline 
 Make it possible to run individual parts of the software 
 Use the real code whenever possible 
 Have well-defined interfaces and formats to allow you to go 
off-the-beaten path
22 
Lesson 4: 
Make algorithms extensible and modular.
23 
Make algorithms and models extensible and modular 
 Algorithms often need to be tailored for a 
specific application 
 Treating an algorithm as a black box is 
limiting 
 Better to make algorithms extensible and 
modular to allow for customization 
 Separate models and algorithms 
 Many algorithms can learn the same model 
(i.e. linear binary classifier) 
 Many algorithms can be trained on the same 
types of data 
 Support composing algorithms 
Data 
Parameters 
Data 
Model 
Parameters 
Model 
Algorithm 
Vs.
24 
Provide building blocks 
 Don’t start from scratch 
 Linear algebra: Vectors, Matrices, … 
 Statistics: Distributions, tests, … 
 Models, features, metrics, ensembles, … 
 Cost, distance, kernel, … functions 
 Optimization, inference, … 
 Layers, activation functions, … 
 Initializers, stopping criteria, … 
 … 
 Domain-specific components 
Build abstractions on 
familiar concepts 
Make the software put 
them together
25 
Example: Tailoring Random Forests 
Use a custom 
tree split 
Customize to 
run it for an 
hour 
Report a 
custom metric 
each iteration 
Inspect the 
ensemble 
Using Cognitive Foundry: http://github.com/algorithmfoundry/Foundry
26 
Lesson 5: 
Describe your input and output 
transformations with your model.
27 
Putting learning in an application 
Application 
Application or model code? 
Feature 
Encoding 
Output 
Decoding 
? Machine 
Learned Model 
Rd ⟶ Rk
28 
Example: Simple ranking system 
 High-level API: List<Video> rank(User u, List<Video> videos) 
 Example model description file: 
{ 
“type”: “ScoringRanker”, 
“scorer”: { 
“type”: “FeatureScorer”, 
“features”: [ 
{“type”: “Popularity”, “days”: 10}, 
{“type”: “PredictedRating”} 
], 
“function”: { 
“type”: “Linear”, 
“bias”: -0.5, 
“weights”: { 
“popularity”: 0.2, 
“predictedRating”: 1.2, 
“predictedRating*popularity”: 
3.5 
} 
} 
} 
} 
Ranker 
Scorer 
Features 
Linear function 
Feature transformations
29 
Lesson 6: 
Don’t just rely on metrics for testing.
30 
Importance of Testing 
 Temptation: Use validation metrics to test software 
 When things work this seems great 
 When metrics don’t improve: was it the code, data, metric, idea, …? 
 Machine learning code involves intricate math and logic 
 Rounding issues, corner cases, … 
 Is that a + or -? (The math or paper could be wrong.) 
 Solution: Unit test 
 Testing of metric code is especially important 
 Test the whole system 
 Compare output for unexpected changes across versions
31 
Conclusions
32 
Two ways to solve computational problems 
Know 
solution 
Write code 
Compile 
code 
Test code Deploy code 
Know 
relevant 
data 
Develop 
algorithmic 
approach 
Train model 
on data using 
algorithm 
Validate 
model with 
metrics 
Deploy 
model 
Software Development 
Machine Learning 
(steps may involve Software Development)
33 
Take-aways for building machine learning software 
 Building machine learning is an iterative process 
 Make experimentation easy 
 Take a holistic view of both the application and experimental 
environments 
 Optimize only what matters 
 Testing can be hard but is worthwhile
Thank You Justin Basilico 
jbasilico@netflix.com 
34 @JustinBasilico 
We’re hiring

Más contenido relacionado

La actualidad más candente

Artwork Personalization at Netflix
Artwork Personalization at NetflixArtwork Personalization at Netflix
Artwork Personalization at Netflix
Justin Basilico
 

La actualidad más candente (20)

Calibrated Recommendations
Calibrated RecommendationsCalibrated Recommendations
Calibrated Recommendations
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning Software
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
 
Recommendation at Netflix Scale
Recommendation at Netflix ScaleRecommendation at Netflix Scale
Recommendation at Netflix Scale
 
Past, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry PerspectivePast, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry Perspective
 
Artwork Personalization at Netflix
Artwork Personalization at NetflixArtwork Personalization at Netflix
Artwork Personalization at Netflix
 
Learning to Personalize
Learning to PersonalizeLearning to Personalize
Learning to Personalize
 
Recent Trends in Personalization at Netflix
Recent Trends in Personalization at NetflixRecent Trends in Personalization at Netflix
Recent Trends in Personalization at Netflix
 
Shallow and Deep Latent Models for Recommender System
Shallow and Deep Latent Models for Recommender SystemShallow and Deep Latent Models for Recommender System
Shallow and Deep Latent Models for Recommender System
 
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se... Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 
Context Aware Recommendations at Netflix
Context Aware Recommendations at NetflixContext Aware Recommendations at Netflix
Context Aware Recommendations at Netflix
 
Recommending for the World
Recommending for the WorldRecommending for the World
Recommending for the World
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in Recommendations
 
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
 
Missing values in recommender models
Missing values in recommender modelsMissing values in recommender models
Missing values in recommender models
 
Personalizing "The Netflix Experience" with Deep Learning
Personalizing "The Netflix Experience" with Deep LearningPersonalizing "The Netflix Experience" with Deep Learning
Personalizing "The Netflix Experience" with Deep Learning
 
Déjà Vu: The Importance of Time and Causality in Recommender Systems
Déjà Vu: The Importance of Time and Causality in Recommender SystemsDéjà Vu: The Importance of Time and Causality in Recommender Systems
Déjà Vu: The Importance of Time and Causality in Recommender Systems
 
Netflix Recommendations - Beyond the 5 Stars
Netflix Recommendations - Beyond the 5 StarsNetflix Recommendations - Beyond the 5 Stars
Netflix Recommendations - Beyond the 5 Stars
 
Data council SF 2020 Building a Personalized Messaging System at Netflix
Data council SF 2020 Building a Personalized Messaging System at NetflixData council SF 2020 Building a Personalized Messaging System at Netflix
Data council SF 2020 Building a Personalized Messaging System at Netflix
 
Recent Trends in Personalization at Netflix
Recent Trends in Personalization at NetflixRecent Trends in Personalization at Netflix
Recent Trends in Personalization at Netflix
 

Similar a Lessons Learned from Building Machine Learning Software at Netflix

Net campus2015 antimomusone
Net campus2015 antimomusoneNet campus2015 antimomusone
Net campus2015 antimomusone
DotNetCampus
 

Similar a Lessons Learned from Building Machine Learning Software at Netflix (20)

Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning Software
 
Net campus2015 antimomusone
Net campus2015 antimomusoneNet campus2015 antimomusone
Net campus2015 antimomusone
 
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATAPREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
 
201906 04 Overview of Automated ML June 2019
201906 04 Overview of Automated ML June 2019201906 04 Overview of Automated ML June 2019
201906 04 Overview of Automated ML June 2019
 
Machine learning for sensor Data Analytics
Machine learning for sensor Data AnalyticsMachine learning for sensor Data Analytics
Machine learning for sensor Data Analytics
 
What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?
 
Azure Machine Learning Dotnet Campus 2015
Azure Machine Learning Dotnet Campus 2015 Azure Machine Learning Dotnet Campus 2015
Azure Machine Learning Dotnet Campus 2015
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 
Building an ML model with zero code
Building an ML model with zero codeBuilding an ML model with zero code
Building an ML model with zero code
 
Deep Learning with CNTK
Deep Learning with CNTKDeep Learning with CNTK
Deep Learning with CNTK
 
201909 Automated ML for Developers
201909 Automated ML for Developers201909 Automated ML for Developers
201909 Automated ML for Developers
 
DEVOPS AND MACHINE LEARNING
DEVOPS AND MACHINE LEARNINGDEVOPS AND MACHINE LEARNING
DEVOPS AND MACHINE LEARNING
 
Build 2019 Recap
Build 2019 RecapBuild 2019 Recap
Build 2019 Recap
 
Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...
Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...
Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Introduction to ML.NET
Introduction to ML.NETIntroduction to ML.NET
Introduction to ML.NET
 
Software Modeling and Artificial Intelligence: friends or foes?
Software Modeling and Artificial Intelligence: friends or foes?Software Modeling and Artificial Intelligence: friends or foes?
Software Modeling and Artificial Intelligence: friends or foes?
 
Presentation Verification & Validation
Presentation Verification & ValidationPresentation Verification & Validation
Presentation Verification & Validation
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Lessons Learned from Building Machine Learning Software at Netflix

  • 1. 1 Lessons Learned from Building Machine Learning Software at Netflix Justin Basilico Page Algorithms Engineering December 13, 2014 @JustinBasilico Workshop 2014
  • 4. 4 Netflix Scale  > 50M members  > 40 countries  > 1000 device types  Hours: > 2B/month  Plays: > 70M/day  Log 100B events/day  34.2% of peak US downstream traffic
  • 5. 5 Goal Help members find content to watch and enjoy to maximize member satisfaction and retention
  • 6. 6 Everything is a Recommendation Rows Ranking Over 75% of what people watch comes from our recommendations Recommendations are driven by Machine Learning
  • 7. 7 Machine Learning Approach Problem Data Metrics Model Algorithm
  • 8. 8 Models & Algorithms  Regression (Linear, logistic, elastic net)  SVD and other Matrix Factorizations  Factorization Machines  Restricted Boltzmann Machines  Deep Neural Networks  Markov Models and Graph Algorithms  Clustering  Latent Dirichlet Allocation  Gradient Boosted Decision Trees/Random Forests  Gaussian Processes  …
  • 9. 9 Design Considerations Recommendations • Personal • Accurate • Diverse • Novel • Fresh Software • Scalable • Responsive • Resilient • Efficient • Flexible
  • 10. 10 Software Stack http://techblog.netflix.com
  • 12. 12 Lesson 1: Be flexible about where and when computation happens.
  • 13. 13 System Architecture  Offline: Process data  Nearline: Process events  Online: Process requests  Learning, Features, or Model evaluation can be done at any level Netflix.Hermes Netflix.Manhattan Nearline Computation Models Online Data Service Offline Data Model training Online Computation Event Distribution User Event Queue Algorithm Service UI Client Member Query results Recommendations NEARLINE Machine Learning Algorithm Machine Learning Algorithm Offline Computation Machine Learning Algorithm Play, Rate, Browse... OFFLINE ONLINE More details on Netflix Techblog
  • 14. 14 Where to place components?  Example: Matrix Factorization  Offline:  Collect sample of play data  Run batch learning algorithm like SGD to produce factorization  Publish video factors  Nearline:  Solve user factors  Compute user-video dot products  Store scores in cache  Online:  Presentation-context filtering  Serve recommendations Netflix.Hermes Netflix.Manhattan X≈UVt Nearline Computation Models Online Data Service Offline Data Model training Online Computation Event Distribution User Event Queue Algorithm Service UI Client Member Query results Recommendations NEARLINE Machine Learning Algorithm Machine Learning Algorithm Offline Computation Machine Learning Algorithm Play, Rate, Browse... OFFLINE ONLINE V sij=uivj Aui=b sij X sij>t
  • 15. 15 Lesson 2: Think about distribution starting from the outermost levels.
  • 16. 16 Three levels of Learning Distribution/Parallelization 1. For each subset of the population (e.g. region)  Want independently trained and tuned models 2. For each combination of (hyper)parameters  Simple: Grid search  Better: Bayesian optimization using Gaussian Processes 3. For each subset of the training data  Distribute over machines (e.g. ADMM)  Multi-core parallelism (e.g. HogWild)  Or… use GPUs
  • 17. 17 Example: Training Neural Networks  Level 1: Machines in different AWS regions  Level 2: Machines in same AWS region  Spearmint or MOE for parameter optimization  Condor, StarCluster, Mesos, etc. for coordination  Level 3: Highly optimized, parallel CUDA code on GPUs
  • 18. 18 Lesson 3: Design application software for experimentation.
  • 19. 19 Example development process Idea Data Offline Modeling (R, Python, MATLAB, …) Iterate Implement in production system (Java, C++, …) Data discrepancies Missing post-processing logic Performance issues Actual output Experimentation environment Production environment (A/B test) Code discrepancies Final model
  • 20. 20 Avoid dual implementations Shared Engine Experiment code Production code Experiment Production • Models • Features • Algorithms • …
  • 21. 21 Solution: Share and lean towards production  Developing machine learning is an iterative process  Want a short pipeline to rapidly try ideas  Want to see output of complete system, not just learned component  Make application components easy to experiment with  Share them between online, nearline, and offline  Make it possible to run individual parts of the software  Use the real code whenever possible  Have well-defined interfaces and formats to allow you to go off-the-beaten path
  • 22. 22 Lesson 4: Make algorithms extensible and modular.
  • 23. 23 Make algorithms and models extensible and modular  Algorithms often need to be tailored for a specific application  Treating an algorithm as a black box is limiting  Better to make algorithms extensible and modular to allow for customization  Separate models and algorithms  Many algorithms can learn the same model (i.e. linear binary classifier)  Many algorithms can be trained on the same types of data  Support composing algorithms Data Parameters Data Model Parameters Model Algorithm Vs.
  • 24. 24 Provide building blocks  Don’t start from scratch  Linear algebra: Vectors, Matrices, …  Statistics: Distributions, tests, …  Models, features, metrics, ensembles, …  Cost, distance, kernel, … functions  Optimization, inference, …  Layers, activation functions, …  Initializers, stopping criteria, …  …  Domain-specific components Build abstractions on familiar concepts Make the software put them together
  • 25. 25 Example: Tailoring Random Forests Use a custom tree split Customize to run it for an hour Report a custom metric each iteration Inspect the ensemble Using Cognitive Foundry: http://github.com/algorithmfoundry/Foundry
  • 26. 26 Lesson 5: Describe your input and output transformations with your model.
  • 27. 27 Putting learning in an application Application Application or model code? Feature Encoding Output Decoding ? Machine Learned Model Rd ⟶ Rk
  • 28. 28 Example: Simple ranking system  High-level API: List<Video> rank(User u, List<Video> videos)  Example model description file: { “type”: “ScoringRanker”, “scorer”: { “type”: “FeatureScorer”, “features”: [ {“type”: “Popularity”, “days”: 10}, {“type”: “PredictedRating”} ], “function”: { “type”: “Linear”, “bias”: -0.5, “weights”: { “popularity”: 0.2, “predictedRating”: 1.2, “predictedRating*popularity”: 3.5 } } } } Ranker Scorer Features Linear function Feature transformations
  • 29. 29 Lesson 6: Don’t just rely on metrics for testing.
  • 30. 30 Importance of Testing  Temptation: Use validation metrics to test software  When things work this seems great  When metrics don’t improve: was it the code, data, metric, idea, …?  Machine learning code involves intricate math and logic  Rounding issues, corner cases, …  Is that a + or -? (The math or paper could be wrong.)  Solution: Unit test  Testing of metric code is especially important  Test the whole system  Compare output for unexpected changes across versions
  • 32. 32 Two ways to solve computational problems Know solution Write code Compile code Test code Deploy code Know relevant data Develop algorithmic approach Train model on data using algorithm Validate model with metrics Deploy model Software Development Machine Learning (steps may involve Software Development)
  • 33. 33 Take-aways for building machine learning software  Building machine learning is an iterative process  Make experimentation easy  Take a holistic view of both the application and experimental environments  Optimize only what matters  Testing can be hard but is worthwhile
  • 34. Thank You Justin Basilico jbasilico@netflix.com 34 @JustinBasilico We’re hiring

Notas del editor

  1. http://techblog.netflix.com/2013/03/system-architectures-for.html
  2. http://techblog.netflix.com/2014/02/distributed-neural-networks-with-gpus.html
  3. http://jobs.netflix.com/jobs.php?id=NFX01267