Abstract: Scalable Machine Learning at Yahoo
Yahoo scientists have developed variety of machine learning libraries (supervised learning, unsupervised learning, deep learning) for online search, advertising and personalization. The emerging business needs require us to address 2 problems:
- Can we apply these libraries against massive datasets (billions of training examples, and millions of features) using commodity hardware clusters?
- Can we reduce the learning time from days to minutes or seconds?
We have thus examined system architecture options (including Hadoop, Spark and Storm), and developed a fault-tolerant MPI solution that allows hundreds of machines to jointly build a model. We are collaborating with open source community for a better system architecture for next-gen machine learning applications. Yahoo ML libraries are being revised for much better scalability and latency. In the talk, we will share system architecture of our ML platform and its use cases.
3. Agenda
3
§ Machine Learning
› Use Cases
› Challenges
§ Scalable ML Architecture
§ Design Patterns
› Batch, real-time and hybrid
4. Evolution of Big Data @ Yahoo
4
600
500
400
300
200
100
0
45,000
40,000
35,000
30,000
25,000
20,000
15,000
10,000
5,000
0
Increased
User-base
with partitioned
namespaces Hadoop 2.5
2006 2007 2008 2009 2010 2011 2012 2013 2014
Raw HDFS Storage (in PB)
Number of Servers
Year
Servers Storage
Yahoo!
Commits to
Scaling Hadoop
for Production
Use
Research
Workloads
in Search and
Advertising
Production
with machine
learning &
WebMap
Revenue
Systems
with Security,
Multi-tenancy,
and SLAs
Open
Sourced with
Apache
Hortonworks
Spinoff for
Enterprise
hardening
Nextgen
Hadoop
(H 0.23)
New Services
(Hbase, Hive)
Machine
Learning
11. § Originally created by Yahoo
§ Popular framework for running
applications on large cluster built
of commodity hardware
§ Designed for very high throughput
and reliability
§ YARN resource manager
supports Map/Reduce, Tez and
beyond
11
Apache Hadoop
http://hadoop.apache.org
12. Apache Storm
http://storm.apache.org § “Hadoop for Realtime”
› distributed and high-performance
realtime data
processing
§ Simple API
§ Horizontal scalability
§ Fault-tolerance
§ Guaranteed data
processing
12
13. Apache Spark
http://spark.apache.org
§ Fast and expressive cluster
computing system compatible
with Apache Hadoop
§ Support general execution
DAGs
› Ex. iterative programming
§ Resilient Distributed Datasets
› In-memory storage
14. 30x Speedup for GBDT
§ Gradient Boosted
Decision Trees took
days on training for
our large datasets.
é High accuracy
ê Sequential execution
§ 30X speedup
enables frequently
model training.
› GBDT included in data
pipeline (Hadoop Oozie
workflow)
15. Pixels ->
features
Pixels ->
features
Pixels ->
features
dog, 1, [.2, -.3, …]
dog, 0, [.3, -.5, …]
cat, 1, [.2, -.3, …]
cat, 0, [.3, -.5, …]
Train models:
Dog, …
Train model:
…
Train model:
Cat, …
10,000
Mappers
1,000
Shuffle Reducers
Deep network as
feature extractor
8000+ classifiers
Auto-tag billions of Flickr photos
17. Design Patterns Enabled
17
1. Batch ML for scale
› Parallel model training (ex. 1000 models for ad campaigns)
› Distributed model training (ex. 1 model for all homepage content)
2. Real-time ML for speed
› Up-to-minutes models (ex. fraud detection, breaknews)
3. Lambda architecture
› Scale + Speedy learning (ex. Photo autotags)
› Enabled by “Parameter Server on Grid”
18. § Basic Requirements
› 100’s - 1000’s models
› Training data for each model
could be loaded into a single
machine
§ Solution: 1 reducer per model
› hadoop jar hadoop-streaming.jar
-Dmapreduce.job.reduces=$num_models
-reducer ”vw --passes 20 --cache_file …”
› hadoop jar lib/hadoop-streaming.jar
-D mapreduce.job.reduces=$num_models
-reducer ”svm_train_reducer.py …”
18
1a. ML in Hadoop Reducers
19. § Basic Requirements
› Small # of models to be trained
› Training data are too large to be
loaded into a single machine
§ Solution: Mappers + MPI AllReduce
1. spanning_tree
2. hadoop jar hadoop-streaming.jar
-input $training_data -output $model_loc
-Dmapreduce.job.maps=$num_mappers
-mapper "runvw.sh $model_location
$span_server $num_mappers”
-reducer NONE
19
1b. ML in Hadoop Mappers
20. 1c. Spark Native ML
20
§ Spark based
› Yahoo E-Commerce: 30 LOC Spark program for collaborative
filtering
§ Spark’s MLlib
› Binary classification, Linear regression, Collaborative filtering,
Clustering, Decision Trees etc.
§ 3Rd ML libs
› Ex. Alpine Data Lab’s Random Forest
21. 1d. Approximate Computing
§ Observations
› A large scale ML learning
job use 100’s processes to
train models for hours.
› Some learner processes
will stuck/fail due to many
hardware issues (ex. disk,
network etc.)
› Existing ML algorithms will
hang or fail.
§ Partial Reducer
› Enable trade off b/w speed and
accuracy
› Tolerate failures of % of learner
processes
for (i <- 1 to ITERATIONS) {
val gradient =
points.pipe(learner_cmd)
.partialReducer(reduceFunc,
0.99, timeout)
w -= gradient
}
22. 22
2. Realtime Training in Storm Bolts
§ Basic Requirements
› Freshness of ML model is critical
§ Sample Solution
public class TrainingBolt extends BaseBasicBolt {
Model model;
public void prepare(Map conf, TopologyContext ctx) {
System.loadLibrary("VW");
model =VW.init(conf);
}
public void execute(Tuple input, OutputCollector collector) {
Instance example = input. getValue(0);
model.learn(example);
if (Time since last export) collector.emit(model);
}
}
23. 23
3a. Hybrid Learning
§ Basic Requirements
› Boostrape models via batch
learning from large datasets
› Update models via realtime
learning from latest events
§ Sample Solution
› ML in Hadoop + Storm
› ML in Spark + Storm
24. 3b. Parameter Server on Grid
• billions of features per model
• millions of operation per second
• enable asynchronous learning
25. Summary
Applications
Decision
Trees …
Hadoop YARN: Resource Manager
Hadoop Storage: File System and NoSQL
Search
Ranking
Photo/Video
Services
Online
Ads
Persona-lization
Abuse
Detection
Machine Learning Libraries
Logistic
Regression Deep Learning Unsupervised
Learning
Computing Engines