MMDS 2014 Talk - Distributing ML Algorithms: from GPUs to the Cloud

•

13 recomendaciones•3,818 vistas

Xavier Amatriain

Ingeniería Datos y análisis Tecnología

Outline
■ Introduction
■ Emmy-winning Algorithms
■ Distributing ML Algorithms in Practice
■ An example: ANN over GPUs & AWS Cloud

What we were interested in:
■ High quality recommendations
Proxy question:
■ Accuracy in predicted rating
■ Improve by 10% = $1million!
Data size:
■ 100M ratings (back then “almost massive”)

Netflix Scale
▪ > 44M members
▪ > 40 countries
▪ > 1000 device types
▪ > 5B hours in Q3 2013
▪ Plays: > 50M/day
▪ Searches: > 3M/day
▪ Ratings: > 5M/day
▪ Log 100B events/day
▪ 31.62% of peak US downstream
traffic

Smart Models ■ Regression models (Logistic,
Linear, Elastic nets)
■ GBDT/RF
■ SVD & other MF models
■ Factorization Machines
■ Restricted Boltzmann Machines
■ Markov Chains & other graphical
models
■ Clustering (from k-means to
modern non-parametric models)
■ Deep ANN
■ LDA
■ Association Rules
■ …

2007 Progress Prize
▪ Top 2 algorithms
▪ MF/SVD - Prize RMSE: 0.8914
▪ RBM - Prize RMSE: 0.8990
▪ Linear blend Prize RMSE: 0.88
▪ Currently in use as part of Netflix’ rating prediction component
▪ Limitations
▪ Designed for 100M ratings, we have 5B ratings
▪ Not adaptable as users add ratings
▪ Performance issues

1. Do I need all that data?
2. At what level should I distribute/parallelize?
3. What latency can I afford?

Really?
Anand Rajaraman: Former Stanford Prof. &
Senior VP at Walmart

[Banko and Brill, 2001]
Norvig: “Google does not
have better Algorithms,
only more Data”
Many features/
low-bias models

The three levels of Distribution/Parallelization
1. For each subset of the population (e.g.
region)
2. For each combination of the
hyperparameters
3. For each subset of the training data
Each level has different requirements

Level 1 Distribution
■ We may have subsets of the
population for which we need to
train an independently optimized
model.
■ Training can be fully distributed
requiring no coordination or data
communication

Level 2 Distribution
■ For a given subset of the population we
need to find the “optimal” model
■ Train several models with different
hyperparameter values
■ Worst-case: grid search
■ Can do much better than this (E.g. Bayesian
Optimization with Gaussian Process Priors)
■ This process *does* require coordination
■ Need to decide on next “step”
■ Need to gather final optimal result
■ Requires data distribution, not sharing

Level 3 Distribution
■ For each combination of
hyperparameters, model training may
still be expensive
■ Process requires coordination and
data sharing/communication
■ Can distribute computation over
machines splitting examples or
parameters (e.g. ADMM)
■ Or parallelize on a single multicore
machine (e.g. Hogwild)
■ Or… use GPUs

ANN Training over GPUS and AWS
■ Level 1 distribution: machines over different AWS
regions
■ Level 2 distribution: machines in AWS and same AWS
region
■ Use coordination tools
■ Spearmint or similar for parameter optimization
■ Condor, StarCluster, Mesos… for distributed cluster coordination
■ Level 3 parallelization: highly optimized parallel CUDA
code on GPUs

3 shades of latency
▪ Blueprint for multiple
algorithm services
▪ Ranking
▪ Row selection
▪ Ratings
▪ Search
▪ …
▪ Multi-layered Machine
Learning

Xavier Amatriain (@xamat)
xavier@netflix.com
Thanks!
(and yes, we are hiring)

Más contenido relacionado

La actualidad más candente

A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...Natalia Díaz Rodríguez

Machine Learning to Grow the World's KnowledgeXavier Amatriain

Recommending the world's knowledgeLei Yang

Introduction To Applied Machine Learningananth

A Multi-Armed Bandit Framework For Recommendations at NetflixJaya Kawale

Explainable AI - making ML and DL models more interpretableAditya Bhattacharya

Replicable Evaluation of Recommender SystemsAlejandro Bellogin

Building Large-scale Real-world Recommender Systems - Recsys2012 tutorialXavier Amatriain

Artificial Intelligence Course: Linear models ananth

Understanding Basics of Machine LearningPranav Ainavolu

Artwork Personalization at NetflixJustin Basilico

Recommending for the WorldYves Raimond

Machine learning basics Akanksha Bali

MaxEnt (Loglinear) Models - Overviewananth

Factorization Machines with libFMLiangjie Hong

Introduction to-machine-learningBabu Priyavrat

Introduction to Machine LearningShahar Cohen

Machine Learning BasicsSuresh Arora

Time, Context and Causality in Recommender SystemsYves Raimond

Lecture 2 Basic Concepts in Machine Learning for Language TechnologyMarina Santini

La actualidad más candente (20)

A Folksonomy of styles, aka: other stylists also said and Subjective Influenc...

Machine Learning to Grow the World's Knowledge

Recommending the world's knowledge

Introduction To Applied Machine Learning

A Multi-Armed Bandit Framework For Recommendations at Netflix

Explainable AI - making ML and DL models more interpretable

Replicable Evaluation of Recommender Systems

Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial

Artificial Intelligence Course: Linear models

Understanding Basics of Machine Learning

Artwork Personalization at Netflix

Recommending for the World

Machine learning basics

MaxEnt (Loglinear) Models - Overview

Factorization Machines with libFM

Introduction to-machine-learning

Introduction to Machine Learning

Machine Learning Basics

Time, Context and Causality in Recommender Systems

Lecture 2 Basic Concepts in Machine Learning for Language Technology

Destacado

MLConf Seattle 2015 - ML@QuoraXavier Amatriain

May: If I Were 22LinkedIn Editors' Picks

Chela stress testsuperserch

Adding Identity Management and Access Control to your ApplicationFernando Lopez Aguilar

Cwin16 - Paris - ux designCapgemini

Ia32 Modo ProtegidoErwin Meza

Past present and future of Recommender Systems: an Industry PerspectiveXavier Amatriain

Marc Stickdorn & Jakob Schneider – Mobile ethnography and ExperienceFellow, a...Jakob Schneider

Netflix Nebula - Gradle Summit 2014Justin Ryan

Cwin16 tls-partner-hpe-digital economy & Hybrid ITCapgemini

Disciplined agile business analysisScott W. Ambler

How Comcast uses Data Science to Improve the Customer ExperienceTuri, Inc.

How to Start a Startup at NYUNYU Entrepreneurial Institute

Introduction to the Innovation Corps (NSF I-Corps)NYU Entrepreneurial Institute

[PREMONEY 2014] Mayfield Fund >> Tim Chang, "Mobile Is The Future Of YOU: Why...500 Startups

Cwin16 - lyon - faurecia customer cockpitCapgemini

Destacado (16)

MLConf Seattle 2015 - ML@Quora

May: If I Were 22

Chela stress test

Adding Identity Management and Access Control to your Application

Cwin16 - Paris - ux design

Ia32 Modo Protegido

Past present and future of Recommender Systems: an Industry Perspective

Marc Stickdorn & Jakob Schneider – Mobile ethnography and ExperienceFellow, a...

Netflix Nebula - Gradle Summit 2014

Cwin16 tls-partner-hpe-digital economy & Hybrid IT

Disciplined agile business analysis

How Comcast uses Data Science to Improve the Customer Experience

How to Start a Startup at NYU

Introduction to the Innovation Corps (NSF I-Corps)

[PREMONEY 2014] Mayfield Fund >> Tim Chang, "Mobile Is The Future Of YOU: Why...

Cwin16 - lyon - faurecia customer cockpit

Similar a MMDS 2014 Talk - Distributing ML Algorithms: from GPUs to the Cloud

The Power of Auto ML and How Does it WorkIvo Andreev

Machine Learning for Fraud DetectionNitesh Kumar

Computer Vision for BeginnersSanghamitra Deb

Multi Model Machine Learning by Maximo Gurmendez and Beth LoganSpark Summit

Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix ScaleXavier Amatriain

Spark Summit EU talk by Josef HabdankSpark Summit

Big data 2.0, deep learning and financial UsecasesArvind Rapaka

10 Lessons Learned from Building Machine Learning SystemsXavier Amatriain

Prediction as a service with ensemble model in SparkML and Python ScikitLearnJosef A. Habdank

Ed Snelson. Counterfactual AnalysisVolha Banadyseva

Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Databricks

A Framework for Scene Recognition Using Convolutional Neural Network as Featu...Tahmid Abtahi

Cooperative Machine Learning Network with Ahmed Masud of saf.aiChristopher Conlan

C3 w1Ajay Taneja

SparkML: Easy ML Productization for Real-Time BiddingDatabricks

Making sense of the Graph RevolutionInfiniteGraph

IEEE.BigData.Tutorial.2.slidesNish Parikh

Large scale Click-streaming and tranaction log miningitstuff

Machine Learning Introduction introducing basics of Machine Learningvipulkondekar

MongoDB.local Seattle 2019: Global Cluster Topologies in MongoDB AtlasMongoDB

Similar a MMDS 2014 Talk - Distributing ML Algorithms: from GPUs to the Cloud (20)

The Power of Auto ML and How Does it Work

Machine Learning for Fraud Detection

Computer Vision for Beginners

Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale

Spark Summit EU talk by Josef Habdank

Big data 2.0, deep learning and financial Usecases

10 Lessons Learned from Building Machine Learning Systems

Prediction as a service with ensemble model in SparkML and Python ScikitLearn

Ed Snelson. Counterfactual Analysis

Understanding Parallelization of Machine Learning Algorithms in Apache Spark™

A Framework for Scene Recognition Using Convolutional Neural Network as Featu...

Cooperative Machine Learning Network with Ahmed Masud of saf.ai

C3 w1

SparkML: Easy ML Productization for Real-Time Bidding

Making sense of the Graph Revolution

IEEE.BigData.Tutorial.2.slides

Large scale Click-streaming and tranaction log mining

Machine Learning Introduction introducing basics of Machine Learning

MongoDB.local Seattle 2019: Global Cluster Topologies in MongoDB Atlas

Más de Xavier Amatriain

Data/AI driven product development: from video streaming to telehealthXavier Amatriain

AI-driven product innovation: from Recommender Systems to COVID-19Xavier Amatriain

AI for COVID-19 - Q42020 updateXavier Amatriain

AI for COVID-19: An online virtual care approachXavier Amatriain

Lessons learned from building practical deep learning systemsXavier Amatriain

AI for healthcare: Scaling Access and Quality of Care for EveryoneXavier Amatriain

Towards online universal quality healthcare through AIXavier Amatriain

From one to zero: Going smaller as a growth strategyXavier Amatriain

Learning to speak medicineXavier Amatriain

ML to cure the worldXavier Amatriain

Recommender Systems In IndustryXavier Amatriain

Medical advice as a Recommender SystemXavier Amatriain

Staying Shallow & Lean in a Deep Learning WorldXavier Amatriain

Machine Learning for Q&A Sites: The Quora ExampleXavier Amatriain

Past, present, and future of Recommender Systems: an industry perspectiveXavier Amatriain

10 more lessons learned from building Machine Learning systems - MLConfXavier Amatriain

10 more lessons learned from building Machine Learning systemsXavier Amatriain

Lean DevOps - Lessons Learned from Innovation-driven CompaniesXavier Amatriain

Recsys 2014 Tutorial - The Recommender Problem RevisitedXavier Amatriain

Kdd 2014 Tutorial - the recommender problem revisitedXavier Amatriain

Más de Xavier Amatriain (20)

Data/AI driven product development: from video streaming to telehealth

AI-driven product innovation: from Recommender Systems to COVID-19

AI for COVID-19 - Q42020 update

AI for COVID-19: An online virtual care approach

Lessons learned from building practical deep learning systems

AI for healthcare: Scaling Access and Quality of Care for Everyone

Towards online universal quality healthcare through AI

From one to zero: Going smaller as a growth strategy

Learning to speak medicine

ML to cure the world

Recommender Systems In Industry

Medical advice as a Recommender System

Staying Shallow & Lean in a Deep Learning World

Machine Learning for Q&A Sites: The Quora Example

Past, present, and future of Recommender Systems: an industry perspective

10 more lessons learned from building Machine Learning systems - MLConf

10 more lessons learned from building Machine Learning systems

Lean DevOps - Lessons Learned from Innovation-driven Companies

Recsys 2014 Tutorial - The Recommender Problem Revisited

Kdd 2014 Tutorial - the recommender problem revisited

Último

FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsArindam Chakraborty, Ph.D., P.E. (CA, TX)

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b

Air Compressor reciprocating single stageAbc194748

Thermal Engineering-R & A / C - unit - VDineshKumar4165

Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...soginsider

Unleashing the Power of the SORA AI lastest leapRishantSharmaFr

A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath

+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...Health

Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1

Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Servicemeghakumariji156

S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture

Online electricity billing project report..pdfKamal Acharya

kiln thermal load.pptx kiln tgermal loadhamedmustafa094

Hostel management system project report..pdfKamal Acharya

Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)

School management system project Report.pdfKamal Acharya

Thermal Engineering -unit - III & IV.pptDineshKumar4165

Computer Lecture 01.pptxIntroduction to ComputersMairaAshraf6

Rums floating Omkareshwar FSPV IM_16112021.pdfsmsksolar

Double Revolving field theory-how the rotor develops torqueBhangaleSonal

MMDS 2014 Talk - Distributing ML Algorithms: from GPUs to the Cloud

1. Distributing Large-scale ML Algorithms: from GPUs to the Cloud MMDS 2014 June, 2014 Xavier Amatriain Director - Algorithms Engineering @xamat

2. Outline ■ Introduction ■ Emmy-winning Algorithms ■ Distributing ML Algorithms in Practice ■ An example: ANN over GPUs & AWS Cloud

3. What we were interested in: ■ High quality recommendations Proxy question: ■ Accuracy in predicted rating ■ Improve by 10% = $1million! Data size: ■ 100M ratings (back then “almost massive”)

4. 2006 2014

5. Netflix Scale ▪ > 44M members ▪ > 40 countries ▪ > 1000 device types ▪ > 5B hours in Q3 2013 ▪ Plays: > 50M/day ▪ Searches: > 3M/day ▪ Ratings: > 5M/day ▪ Log 100B events/day ▪ 31.62% of peak US downstream traffic

6. Smart Models ■ Regression models (Logistic, Linear, Elastic nets) ■ GBDT/RF ■ SVD & other MF models ■ Factorization Machines ■ Restricted Boltzmann Machines ■ Markov Chains & other graphical models ■ Clustering (from k-means to modern non-parametric models) ■ Deep ANN ■ LDA ■ Association Rules ■ …

7. Netflix Algorithms “Emmy Winning”

9. Rating Prediction

10. 2007 Progress Prize ▪ Top 2 algorithms ▪ MF/SVD - Prize RMSE: 0.8914 ▪ RBM - Prize RMSE: 0.8990 ▪ Linear blend Prize RMSE: 0.88 ▪ Currently in use as part of Netflix’ rating prediction component ▪ Limitations ▪ Designed for 100M ratings, we have 5B ratings ▪ Not adaptable as users add ratings ▪ Performance issues

11. Ranking Ranking

12. Page composition

13. Similarity

14. Search Recommendations

15. Postplay

16. Gamification

17. Distributing ML algorithms in practice

18. 1. Do I need all that data? 2. At what level should I distribute/parallelize? 3. What latency can I afford?

19. Do I need all that data?

20. Really? Anand Rajaraman: Former Stanford Prof. & Senior VP at Walmart

21. Sometimes, it’s not about more data

22. [Banko and Brill, 2001] Norvig: “Google does not have better Algorithms, only more Data” Many features/ low-bias models

23. Sometimes, it’s not about more data

24. At what level should I parallelize?

25. The three levels of Distribution/Parallelization 1. For each subset of the population (e.g. region) 2. For each combination of the hyperparameters 3. For each subset of the training data Each level has different requirements

26. Level 1 Distribution ■ We may have subsets of the population for which we need to train an independently optimized model. ■ Training can be fully distributed requiring no coordination or data communication

27. Level 2 Distribution ■ For a given subset of the population we need to find the “optimal” model ■ Train several models with different hyperparameter values ■ Worst-case: grid search ■ Can do much better than this (E.g. Bayesian Optimization with Gaussian Process Priors) ■ This process *does* require coordination ■ Need to decide on next “step” ■ Need to gather final optimal result ■ Requires data distribution, not sharing

28. Level 3 Distribution ■ For each combination of hyperparameters, model training may still be expensive ■ Process requires coordination and data sharing/communication ■ Can distribute computation over machines splitting examples or parameters (e.g. ADMM) ■ Or parallelize on a single multicore machine (e.g. Hogwild) ■ Or… use GPUs

29. ANN Training over GPUS and AWS

30. ANN Training over GPUS and AWS ■ Level 1 distribution: machines over different AWS regions ■ Level 2 distribution: machines in AWS and same AWS region ■ Use coordination tools ■ Spearmint or similar for parameter optimization ■ Condor, StarCluster, Mesos… for distributed cluster coordination ■ Level 3 parallelization: highly optimized parallel CUDA code on GPUs

31. What latency can I afford?

32. 3 shades of latency ▪ Blueprint for multiple algorithm services ▪ Ranking ▪ Row selection ▪ Ratings ▪ Search ▪ … ▪ Multi-layered Machine Learning

33. Matrix Factorization Example

34. Xavier Amatriain (@xamat) xavier@netflix.com Thanks! (and yes, we are hiring)

MMDS 2014 Talk - Distributing ML Algorithms: from GPUs to the Cloud

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (16)

Similar a MMDS 2014 Talk - Distributing ML Algorithms: from GPUs to the Cloud

Similar a MMDS 2014 Talk - Distributing ML Algorithms: from GPUs to the Cloud (20)

Más de Xavier Amatriain

Más de Xavier Amatriain (20)

Último

Último (20)

MMDS 2014 Talk - Distributing ML Algorithms: from GPUs to the Cloud