SlideShare una empresa de Scribd logo
1 de 27
Descargar para leer sin conexión
Scalable Machine Learning at Yahoo 
Andy Feng 
Nov 14, 2014
My Background 
§ Current 
› VP Architecture, Yahoo 
› Committer, Apache Storm 
› Contributor, Apache Spark & Hadoop 
§ Past 
› NoSQL 
› Online advertisement 
› Personalization 
› Cloud services
Agenda 
3 
§ Machine Learning 
› Use Cases 
› Challenges 
§ Scalable ML Architecture 
§ Design Patterns 
› Batch, real-time and hybrid
Evolution of Big Data @ Yahoo 
4 
600 
500 
400 
300 
200 
100 
0 
45,000 
40,000 
35,000 
30,000 
25,000 
20,000 
15,000 
10,000 
5,000 
0 
Increased 
User-base 
with partitioned 
namespaces Hadoop 2.5 
2006 2007 2008 2009 2010 2011 2012 2013 2014 
Raw HDFS Storage (in PB) 
Number of Servers 
Year 
Servers Storage 
Yahoo! 
Commits to 
Scaling Hadoop 
for Production 
Use 
Research 
Workloads 
in Search and 
Advertising 
Production 
with machine 
learning & 
WebMap 
Revenue 
Systems 
with Security, 
Multi-tenancy, 
and SLAs 
Open 
Sourced with 
Apache 
Hortonworks 
Spinoff for 
Enterprise 
hardening 
Nextgen 
Hadoop 
(H 0.23) 
New Services 
(Hbase, Hive) 
Machine 
Learning
Personalized Homepage 
http://www.yahoo.com Mobile 
Today 
Module 
(2012) 
Content 
stream w/ 
native ads 
(2013)
6 
Web Search & Ads 
• Web Page rank 
• Image/Video insertion 
Ads targeting 
& ranking
Flickr Photo Search 
Google 
Flickr 
2013 … User tags based 2014 … Empowered by Scalable ML
§ Search 
› Page ranking per user intention 
§ Advertisement 
› Ad click prediction 
› Identify potential users for an ad campaign 
§ Content 
› Matching news articles against users 
› Object detection, face recognition in photos 
§ Security 
› Email spam 
› Fraud login and registration 
8 
Machine Learning @ Yahoo
§ Scale 
› 1,000,000,000’s examples 
› 100,000,000’s features 
› 10,000’s models 
› 10’s algorithms 
• Batch learning 
• Incremental learning 
• Real-time learning 
§ Speed 
› Temporal nature of user 
interests 
› Time sensitive content 
• Ex., breaking news 
› Naïve solutions spend days/ 
hours in model training 
• Minutes/seconds desired 
9 
Our Challenges
Our Approach: 
Big-Data Machine Learning
§ Originally created by Yahoo 
§ Popular framework for running 
applications on large cluster built 
of commodity hardware 
§ Designed for very high throughput 
and reliability 
§ YARN resource manager 
supports Map/Reduce, Tez and 
beyond 
11 
Apache Hadoop 
http://hadoop.apache.org
Apache Storm 
http://storm.apache.org § “Hadoop for Realtime” 
› distributed and high-performance 
realtime data 
processing 
§ Simple API 
§ Horizontal scalability 
§ Fault-tolerance 
§ Guaranteed data 
processing 
12
Apache Spark 
http://spark.apache.org 
§ Fast and expressive cluster 
computing system compatible 
with Apache Hadoop 
§ Support general execution 
DAGs 
› Ex. iterative programming 
§ Resilient Distributed Datasets 
› In-memory storage
30x Speedup for GBDT 
§ Gradient Boosted 
Decision Trees took 
days on training for 
our large datasets. 
é High accuracy 
ê Sequential execution 
§ 30X speedup 
enables frequently 
model training. 
› GBDT included in data 
pipeline (Hadoop Oozie 
workflow)
Pixels -> 
features 
Pixels -> 
features 
Pixels -> 
features 
dog, 1, [.2, -.3, …] 
dog, 0, [.3, -.5, …] 
cat, 1, [.2, -.3, …] 
cat, 0, [.3, -.5, …] 
Train models: 
Dog, … 
Train model: 
… 
Train model: 
Cat, … 
10,000 
Mappers 
1,000 
Shuffle Reducers 
Deep network as 
feature extractor 
8000+ classifiers 
Auto-tag billions of Flickr photos
Real-time 
Real-time Learning of Newly Uploaded Photos 
Prediction User Experience & Training
Design Patterns Enabled 
17 
1. Batch ML for scale 
› Parallel model training (ex. 1000 models for ad campaigns) 
› Distributed model training (ex. 1 model for all homepage content) 
2. Real-time ML for speed 
› Up-to-minutes models (ex. fraud detection, breaknews) 
3. Lambda architecture 
› Scale + Speedy learning (ex. Photo autotags) 
› Enabled by “Parameter Server on Grid”
§ Basic Requirements 
› 100’s - 1000’s models 
› Training data for each model 
could be loaded into a single 
machine 
§ Solution: 1 reducer per model 
› hadoop jar hadoop-streaming.jar 
-Dmapreduce.job.reduces=$num_models 
-reducer ”vw --passes 20 --cache_file …” 
› hadoop jar lib/hadoop-streaming.jar 
-D mapreduce.job.reduces=$num_models 
-reducer ”svm_train_reducer.py …” 
18 
1a. ML in Hadoop Reducers
§ Basic Requirements 
› Small # of models to be trained 
› Training data are too large to be 
loaded into a single machine 
§ Solution: Mappers + MPI AllReduce 
1. spanning_tree 
2. hadoop jar hadoop-streaming.jar 
-input $training_data -output $model_loc 
-Dmapreduce.job.maps=$num_mappers 
-mapper "runvw.sh $model_location 
$span_server $num_mappers” 
-reducer NONE 
19 
1b. ML in Hadoop Mappers
1c. Spark Native ML 
20 
§ Spark based 
› Yahoo E-Commerce: 30 LOC Spark program for collaborative 
filtering 
§ Spark’s MLlib 
› Binary classification, Linear regression, Collaborative filtering, 
Clustering, Decision Trees etc. 
§ 3Rd ML libs 
› Ex. Alpine Data Lab’s Random Forest
1d. Approximate Computing 
§ Observations 
› A large scale ML learning 
job use 100’s processes to 
train models for hours. 
› Some learner processes 
will stuck/fail due to many 
hardware issues (ex. disk, 
network etc.) 
› Existing ML algorithms will 
hang or fail. 
§ Partial Reducer 
› Enable trade off b/w speed and 
accuracy 
› Tolerate failures of % of learner 
processes 
for (i <- 1 to ITERATIONS) { 
val gradient = 
points.pipe(learner_cmd) 
.partialReducer(reduceFunc, 
0.99, timeout) 
w -= gradient 
}
22 
2. Realtime Training in Storm Bolts 
§ Basic Requirements 
› Freshness of ML model is critical 
§ Sample Solution 
public class TrainingBolt extends BaseBasicBolt { 
Model model; 
public void prepare(Map conf, TopologyContext ctx) { 
System.loadLibrary("VW"); 
model =VW.init(conf); 
} 
public void execute(Tuple input, OutputCollector collector) { 
Instance example = input. getValue(0); 
model.learn(example); 
if (Time since last export) collector.emit(model); 
} 
}
23 
3a. Hybrid Learning 
§ Basic Requirements 
› Boostrape models via batch 
learning from large datasets 
› Update models via realtime 
learning from latest events 
§ Sample Solution 
› ML in Hadoop + Storm 
› ML in Spark + Storm
3b. Parameter Server on Grid 
• billions of features per model 
• millions of operation per second 
• enable asynchronous learning
Summary 
Applications 
Decision 
Trees … 
Hadoop YARN: Resource Manager 
Hadoop Storage: File System and NoSQL 
Search 
Ranking 
Photo/Video 
Services 
Online 
Ads 
Persona-lization 
Abuse 
Detection 
Machine Learning Libraries 
Logistic 
Regression Deep Learning Unsupervised 
Learning 
Computing Engines
Committed to Apache Open Source 
26 
8 Committers (6 PMCs) | Apache - 80 
3 Committers (2 PMCs) | Apache - 21 
5 Committers (3 PMCs) | Apache - 18 
5 Committer (5 PMC) | Apache - 17 
3 Committers | Apache - 32 
7 Committers (6 PMCs) | Apache - 33
§ Big-Data Blog … http://yahoohadoop.tumblr.com 
§ Hiring … http://careers.yahoo.com 
27 
Thanks!

Más contenido relacionado

Más de MLconf

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...MLconf
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...MLconf
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushMLconf
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceMLconf
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...MLconf
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMLconf
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionMLconf
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLMLconf
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldMLconf
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...MLconf
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...MLconf
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...MLconf
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeMLconf
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareMLconf
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesMLconf
 

Más de MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Último

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 

Último (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

Andy Feng, Distinguished Architect, Yahoo at MLconf SF

  • 1. Scalable Machine Learning at Yahoo Andy Feng Nov 14, 2014
  • 2. My Background § Current › VP Architecture, Yahoo › Committer, Apache Storm › Contributor, Apache Spark & Hadoop § Past › NoSQL › Online advertisement › Personalization › Cloud services
  • 3. Agenda 3 § Machine Learning › Use Cases › Challenges § Scalable ML Architecture § Design Patterns › Batch, real-time and hybrid
  • 4. Evolution of Big Data @ Yahoo 4 600 500 400 300 200 100 0 45,000 40,000 35,000 30,000 25,000 20,000 15,000 10,000 5,000 0 Increased User-base with partitioned namespaces Hadoop 2.5 2006 2007 2008 2009 2010 2011 2012 2013 2014 Raw HDFS Storage (in PB) Number of Servers Year Servers Storage Yahoo! Commits to Scaling Hadoop for Production Use Research Workloads in Search and Advertising Production with machine learning & WebMap Revenue Systems with Security, Multi-tenancy, and SLAs Open Sourced with Apache Hortonworks Spinoff for Enterprise hardening Nextgen Hadoop (H 0.23) New Services (Hbase, Hive) Machine Learning
  • 5. Personalized Homepage http://www.yahoo.com Mobile Today Module (2012) Content stream w/ native ads (2013)
  • 6. 6 Web Search & Ads • Web Page rank • Image/Video insertion Ads targeting & ranking
  • 7. Flickr Photo Search Google Flickr 2013 … User tags based 2014 … Empowered by Scalable ML
  • 8. § Search › Page ranking per user intention § Advertisement › Ad click prediction › Identify potential users for an ad campaign § Content › Matching news articles against users › Object detection, face recognition in photos § Security › Email spam › Fraud login and registration 8 Machine Learning @ Yahoo
  • 9. § Scale › 1,000,000,000’s examples › 100,000,000’s features › 10,000’s models › 10’s algorithms • Batch learning • Incremental learning • Real-time learning § Speed › Temporal nature of user interests › Time sensitive content • Ex., breaking news › Naïve solutions spend days/ hours in model training • Minutes/seconds desired 9 Our Challenges
  • 10. Our Approach: Big-Data Machine Learning
  • 11. § Originally created by Yahoo § Popular framework for running applications on large cluster built of commodity hardware § Designed for very high throughput and reliability § YARN resource manager supports Map/Reduce, Tez and beyond 11 Apache Hadoop http://hadoop.apache.org
  • 12. Apache Storm http://storm.apache.org § “Hadoop for Realtime” › distributed and high-performance realtime data processing § Simple API § Horizontal scalability § Fault-tolerance § Guaranteed data processing 12
  • 13. Apache Spark http://spark.apache.org § Fast and expressive cluster computing system compatible with Apache Hadoop § Support general execution DAGs › Ex. iterative programming § Resilient Distributed Datasets › In-memory storage
  • 14. 30x Speedup for GBDT § Gradient Boosted Decision Trees took days on training for our large datasets. é High accuracy ê Sequential execution § 30X speedup enables frequently model training. › GBDT included in data pipeline (Hadoop Oozie workflow)
  • 15. Pixels -> features Pixels -> features Pixels -> features dog, 1, [.2, -.3, …] dog, 0, [.3, -.5, …] cat, 1, [.2, -.3, …] cat, 0, [.3, -.5, …] Train models: Dog, … Train model: … Train model: Cat, … 10,000 Mappers 1,000 Shuffle Reducers Deep network as feature extractor 8000+ classifiers Auto-tag billions of Flickr photos
  • 16. Real-time Real-time Learning of Newly Uploaded Photos Prediction User Experience & Training
  • 17. Design Patterns Enabled 17 1. Batch ML for scale › Parallel model training (ex. 1000 models for ad campaigns) › Distributed model training (ex. 1 model for all homepage content) 2. Real-time ML for speed › Up-to-minutes models (ex. fraud detection, breaknews) 3. Lambda architecture › Scale + Speedy learning (ex. Photo autotags) › Enabled by “Parameter Server on Grid”
  • 18. § Basic Requirements › 100’s - 1000’s models › Training data for each model could be loaded into a single machine § Solution: 1 reducer per model › hadoop jar hadoop-streaming.jar -Dmapreduce.job.reduces=$num_models -reducer ”vw --passes 20 --cache_file …” › hadoop jar lib/hadoop-streaming.jar -D mapreduce.job.reduces=$num_models -reducer ”svm_train_reducer.py …” 18 1a. ML in Hadoop Reducers
  • 19. § Basic Requirements › Small # of models to be trained › Training data are too large to be loaded into a single machine § Solution: Mappers + MPI AllReduce 1. spanning_tree 2. hadoop jar hadoop-streaming.jar -input $training_data -output $model_loc -Dmapreduce.job.maps=$num_mappers -mapper "runvw.sh $model_location $span_server $num_mappers” -reducer NONE 19 1b. ML in Hadoop Mappers
  • 20. 1c. Spark Native ML 20 § Spark based › Yahoo E-Commerce: 30 LOC Spark program for collaborative filtering § Spark’s MLlib › Binary classification, Linear regression, Collaborative filtering, Clustering, Decision Trees etc. § 3Rd ML libs › Ex. Alpine Data Lab’s Random Forest
  • 21. 1d. Approximate Computing § Observations › A large scale ML learning job use 100’s processes to train models for hours. › Some learner processes will stuck/fail due to many hardware issues (ex. disk, network etc.) › Existing ML algorithms will hang or fail. § Partial Reducer › Enable trade off b/w speed and accuracy › Tolerate failures of % of learner processes for (i <- 1 to ITERATIONS) { val gradient = points.pipe(learner_cmd) .partialReducer(reduceFunc, 0.99, timeout) w -= gradient }
  • 22. 22 2. Realtime Training in Storm Bolts § Basic Requirements › Freshness of ML model is critical § Sample Solution public class TrainingBolt extends BaseBasicBolt { Model model; public void prepare(Map conf, TopologyContext ctx) { System.loadLibrary("VW"); model =VW.init(conf); } public void execute(Tuple input, OutputCollector collector) { Instance example = input. getValue(0); model.learn(example); if (Time since last export) collector.emit(model); } }
  • 23. 23 3a. Hybrid Learning § Basic Requirements › Boostrape models via batch learning from large datasets › Update models via realtime learning from latest events § Sample Solution › ML in Hadoop + Storm › ML in Spark + Storm
  • 24. 3b. Parameter Server on Grid • billions of features per model • millions of operation per second • enable asynchronous learning
  • 25. Summary Applications Decision Trees … Hadoop YARN: Resource Manager Hadoop Storage: File System and NoSQL Search Ranking Photo/Video Services Online Ads Persona-lization Abuse Detection Machine Learning Libraries Logistic Regression Deep Learning Unsupervised Learning Computing Engines
  • 26. Committed to Apache Open Source 26 8 Committers (6 PMCs) | Apache - 80 3 Committers (2 PMCs) | Apache - 21 5 Committers (3 PMCs) | Apache - 18 5 Committer (5 PMC) | Apache - 17 3 Committers | Apache - 32 7 Committers (6 PMCs) | Apache - 33
  • 27. § Big-Data Blog … http://yahoohadoop.tumblr.com § Hiring … http://careers.yahoo.com 27 Thanks!