Andy Feng, Distinguished Architect, Yahoo at MLconf SF

Scalable Machine Learning at Yahoo
Andy Feng
Nov 14, 2014

My Background
§ Current
› VP Architecture, Yahoo
› Committer, Apache Storm
› Contributor, Apache Spark & Hadoop
§ Past
› NoSQL
› Online advertisement
› Personalization
› Cloud services

Agenda
3
§ Machine Learning
› Use Cases
› Challenges
§ Scalable ML Architecture
§ Design Patterns
› Batch, real-time and hybrid

Evolution of Big Data @ Yahoo
4
600
500
400
300
200
100
0
45,000
40,000
35,000
30,000
25,000
20,000
15,000
10,000
5,000
0
Increased
User-base
with partitioned
namespaces Hadoop 2.5
2006 2007 2008 2009 2010 2011 2012 2013 2014
Raw HDFS Storage (in PB)
Number of Servers
Year
Servers Storage
Yahoo!
Commits to
Scaling Hadoop
for Production
Use
Research
Workloads
in Search and
Advertising
Production
with machine
learning &
WebMap
Revenue
Systems
with Security,
Multi-tenancy,
and SLAs
Open
Sourced with
Apache
Hortonworks
Spinoff for
Enterprise
hardening
Nextgen
Hadoop
(H 0.23)
New Services
(Hbase, Hive)
Machine
Learning

Personalized Homepage
http://www.yahoo.com Mobile
Today
Module
(2012)
Content
stream w/
native ads
(2013)

6
Web Search & Ads
• Web Page rank
• Image/Video insertion
Ads targeting
& ranking

Flickr Photo Search
Google
Flickr
2013 … User tags based 2014 … Empowered by Scalable ML

§ Search
› Page ranking per user intention
§ Advertisement
› Ad click prediction
› Identify potential users for an ad campaign
§ Content
› Matching news articles against users
› Object detection, face recognition in photos
§ Security
› Email spam
› Fraud login and registration
8
Machine Learning @ Yahoo

§ Scale
› 1,000,000,000’s examples
› 100,000,000’s features
› 10,000’s models
› 10’s algorithms
• Batch learning
• Incremental learning
• Real-time learning
§ Speed
› Temporal nature of user
interests
› Time sensitive content
• Ex., breaking news
› Naïve solutions spend days/
hours in model training
• Minutes/seconds desired
9
Our Challenges

Our Approach:
Big-Data Machine Learning

§ Originally created by Yahoo
§ Popular framework for running
applications on large cluster built
of commodity hardware
§ Designed for very high throughput
and reliability
§ YARN resource manager
supports Map/Reduce, Tez and
beyond
11
Apache Hadoop
http://hadoop.apache.org

Apache Storm
http://storm.apache.org § “Hadoop for Realtime”
› distributed and high-performance
realtime data
processing
§ Simple API
§ Horizontal scalability
§ Fault-tolerance
§ Guaranteed data
processing
12

Apache Spark
http://spark.apache.org
§ Fast and expressive cluster
computing system compatible
with Apache Hadoop
§ Support general execution
DAGs
› Ex. iterative programming
§ Resilient Distributed Datasets
› In-memory storage

30x Speedup for GBDT
§ Gradient Boosted
Decision Trees took
days on training for
our large datasets.
é High accuracy
ê Sequential execution
§ 30X speedup
enables frequently
model training.
› GBDT included in data
pipeline (Hadoop Oozie
workflow)

Pixels ->
features
Pixels ->
features
Pixels ->
features
dog, 1, [.2, -.3, …]
dog, 0, [.3, -.5, …]
cat, 1, [.2, -.3, …]
cat, 0, [.3, -.5, …]
Train models:
Dog, …
Train model:
…
Train model:
Cat, …
10,000
Mappers
1,000
Shuffle Reducers
Deep network as
feature extractor
8000+ classifiers
Auto-tag billions of Flickr photos

Real-time
Real-time Learning of Newly Uploaded Photos
Prediction User Experience & Training

Design Patterns Enabled
17
1. Batch ML for scale
› Parallel model training (ex. 1000 models for ad campaigns)
› Distributed model training (ex. 1 model for all homepage content)
2. Real-time ML for speed
› Up-to-minutes models (ex. fraud detection, breaknews)
3. Lambda architecture
› Scale + Speedy learning (ex. Photo autotags)
› Enabled by “Parameter Server on Grid”

§ Basic Requirements
› 100’s - 1000’s models
› Training data for each model
could be loaded into a single
machine
§ Solution: 1 reducer per model
› hadoop jar hadoop-streaming.jar
-Dmapreduce.job.reduces=$num_models
-reducer ”vw --passes 20 --cache_file …”
› hadoop jar lib/hadoop-streaming.jar
-D mapreduce.job.reduces=$num_models
-reducer ”svm_train_reducer.py …”
18
1a. ML in Hadoop Reducers

› Small # of models to be trained
› Training data are too large to be
loaded into a single machine
§ Solution: Mappers + MPI AllReduce
1. spanning_tree
2. hadoop jar hadoop-streaming.jar
-input $training_data -output $model_loc
-Dmapreduce.job.maps=$num_mappers
-mapper "runvw.sh $model_location
$span_server $num_mappers”
-reducer NONE
19
1b. ML in Hadoop Mappers

1c. Spark Native ML
20
§ Spark based
› Yahoo E-Commerce: 30 LOC Spark program for collaborative
filtering
§ Spark’s MLlib
› Binary classification, Linear regression, Collaborative filtering,
Clustering, Decision Trees etc.
§ 3Rd ML libs
› Ex. Alpine Data Lab’s Random Forest

1d. Approximate Computing
§ Observations
› A large scale ML learning
job use 100’s processes to
train models for hours.
› Some learner processes
will stuck/fail due to many
hardware issues (ex. disk,
network etc.)
› Existing ML algorithms will
hang or fail.
§ Partial Reducer
› Enable trade off b/w speed and
accuracy
› Tolerate failures of % of learner
processes
for (i <- 1 to ITERATIONS) {
val gradient =
points.pipe(learner_cmd)
.partialReducer(reduceFunc,
0.99, timeout)
w -= gradient
}

22
2. Realtime Training in Storm Bolts
› Freshness of ML model is critical
§ Sample Solution
public class TrainingBolt extends BaseBasicBolt {
Model model;
public void prepare(Map conf, TopologyContext ctx) {
System.loadLibrary("VW");
model =VW.init(conf);
}
public void execute(Tuple input, OutputCollector collector) {
Instance example = input. getValue(0);
model.learn(example);
if (Time since last export) collector.emit(model);
}
}

23
3a. Hybrid Learning
› Boostrape models via batch
learning from large datasets
› Update models via realtime
learning from latest events
§ Sample Solution
› ML in Hadoop + Storm
› ML in Spark + Storm

3b. Parameter Server on Grid
• billions of features per model
• millions of operation per second
• enable asynchronous learning

Summary
Applications
Decision
Trees …
Hadoop YARN: Resource Manager
Hadoop Storage: File System and NoSQL
Search
Ranking
Photo/Video
Services
Online
Ads
Persona-lization
Abuse
Detection
Machine Learning Libraries
Logistic
Regression Deep Learning Unsupervised
Learning
Computing Engines

Committed to Apache Open Source
26
8 Committers (6 PMCs) | Apache - 80
5 Committer (5 PMC) | Apache - 17
3 Committers | Apache - 32

§ Big-Data Blog … http://yahoohadoop.tumblr.com
§ Hiring … http://careers.yahoo.com
27
Thanks!

Andy Feng, Distinguished Architect, Yahoo at MLconf SF

Recomendados

Recomendados

Más contenido relacionado

Más de MLconf

Más de MLconf (20)

Último

Último (20)

Andy Feng, Distinguished Architect, Yahoo at MLconf SF