Abstract:
Machine learning researchers and practitioners develop computer
algorithms that "improve performance automatically through
experience". At Google, machine learning is applied to solve many
problems, such as prioritizing emails in Gmail, recommending tags for
YouTube videos, and identifying different aspects from online user
reviews. Machine learning on big data, however, is challenging. Some
"simple" machine learning algorithms with quadratic time complexity,
while running fine with hundreds of records, are almost impractical to
use on billions of records.
In this talk, I will describe lessons drawn from various Google
projects on developing large scale machine learning systems. These
systems build on top of Google's computing infrastructure such as GFS
and MapReduce, and attack the scalability problem through massively
parallel algorithms. I will present the design decisions made in
these systems, strategies of scaling and speeding up machine learning
systems on web scale data.
Speaker biography:
Max Lin is a software engineer with Google Research in New York City
office. He is the tech lead of the Google Prediction API, a machine
learning web service in the cloud. Prior to Google, he published
research work on video content analysis, sentiment analysis, machine
learning, and cross-lingual information retrieval. He had a PhD in
Computer Science from Carnegie Mellon University.
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
1. Machine Learning on Big Data
Lessons Learned from Google Projects
Max Lin
Software Engineer | Google Research
Massively Parallel Computing | Harvard CS 264
Guest Lecture | March 29th, 2011
2. Outline
• Machine Learning intro
• Scaling machine learning algorithms up
• Design choices of large scale ML systems
3. Outline
• Machine Learning intro
• Scaling machine learning algorithms up
• Design choices of large scale ML systems
4. “Machine Learning is a study
of computer algorithms that
improve automatically
through experience.”
5.
6.
7.
8.
9.
10. The quick brown fox
English
jumped over the lazy dog.
To err is human, but to
really foul things up you English
Training Input X
need a computer. Output Y
No hay mal que por bien
Spanish
no venga.
Model f(x)
La tercera es la vencida. Spanish
To be or not to be -- that
?
Testing f(x’)
is the question
= y’
La fe mueve montañas. ?
11. Linear Classifier
The quick brown fox jumped over the lazy dog.
‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
x [ 0, ... 0, ... 1, ... 1, ... 0, ... ]
w [ 0.1, ... 132, ... 150, ... 200, ... -153, ... ]
P
f (x) = w · x = wp ∗ xp
p=1
12. Training Data
Input X Ouput Y
P
...
...
...
N
... ... ... ... ... ...
...
13. Typical machine learning
data at Google
N: 100 billions / 1 billion
P: 1 billion / 10 million
(mean / median)
http://www.flickr.com/photos/mr_t_in_dc/5469563053
14. Classifier Training
• Training: Given {(x, y)} and f, minimize the
following objective function
N
arg min L(yi , f (xi ; w)) + R(w)
w
n=1
15. Use Newton’s method?
t+1 t t −1 t
w ← w − H(w ) ∇J(w )
http://www.flickr.com/photos/visitfinland/5424369765/
16. Outline
• Machine Learning intro
• Scaling machine learning algorithms up
• Design choices of large scale ML systems
21. Parallelize Estimates
• Naive Bayes Classifier
N P
i
arg min − P (xp |yi ; w)P (yi ; w)
w
i=1 p=1
• Maximum Likelihood Estimates
N i
i=1 1EN,the (x )
wthe|EN = N
i=1 1EN (xi )
22. Word Counting
(‘the|EN’, 1)
X: “The quick brown fox ...”
Map (‘quick|EN’, 1)
Y: EN
(‘brown|EN’, 1)
Reduce [ (‘the|EN’, 1), (‘the|EN’, 1), (‘the|EN’, 1) ]
C(‘the’|EN) = SUM of values = 3
C( the |EN )
w the |EN =
C(EN )
23. Word Counting
Big Data
Mapper 1 Mapper 2 Mapper 3 Mapper M
Map Shard 1 Shard 2 Shard 3 ... Shard M
(‘the’ | EN, 1) (‘fox’ | EN, 1) ... (‘montañas’ | ES, 1)
Reducer
Reduce Tally counts
and update w
Model
24. Parallelize Optimization
• Maximum Entropy Classifiers
P
N
i yi
exp( p=1 wp ∗ xp )
arg min P
w
i=1 1 + exp( p=1 wp ∗ xi )
p
• Good: J(w) is concave
• Bad: no closed-form solution like NB
• Ugly: Large N
26. Gradient Descent
• w is initialized as zero
• for t in 1 to T
• Calculate gradients ∇J(w)
• w ← w − η∇J(w)
t+1 t
N
∇J(w) = P (w, xi , yi )
i=1
27. Distribute Gradient
• w is initialized as zero
• for t in 1 to T
• Calculate gradients in parallel
wt+1 ← wt − η∇J(w)
• Training CPU: O(TPN) to O(TPN / M)
28. Distribute Gradient
Big Data
Machine 1 Machine 2 Machine 3 Machine M
Map Shard 1 Shard 2 Shard 3 ... Shard M
(dummy key, partial gradient sum)
Reduce Sum and
Update w
Repeat M/R
until converge Model
30. Parallelize Subroutines
• Support Vector Machines
1
n
2
arg min ||w||2 +C ζi
w,b,ζ 2 i=1
s.t. 1 − yi (w · φ(xi ) + b) ≤ ζi , ζi ≥ 0
• Solve the dual problem
1 T
arg min α Qα − αT 1
α 2
s.t. 0 ≤ α ≤ C, yT α = 0
31. The computational
cost for the Primal-
Dual Interior Point
Method is O(n^3) in
time and O(n^2) in
memory
http://www.flickr.com/photos/sea-turtle/198445204/
32. Parallel SVM [Chang et al, 2007]
• Parallel, row-wise incomplete Cholesky
Factorization for Q
• Parallel interior point method
• Time O(n^3) becomes O(n^2 / M)
√
• Memory O(n^2) becomes O(n N / M)
• Parallel Support Vector Machines (psvm) http://
code.google.com/p/psvm/
• Implement in MPI
33. Parallel ICF
• Distribute Q by row into M machines
Machine 1 Machine 2 Machine 3
row 1 row 3 row 5 ...
row 2 row 4 row 6
• For each dimension n N √
• Send local pivots to master
• Master selects largest local pivots and
broadcast the global pivot to workers
36. Majority Vote
Big Data
Machine 1 Machine 2 Machine 3 Machine M
Map Shard 1 Shard 2 Shard 3 ... Shard M
Model 1 Model 2 Model 3 Model 4
37. Majority Vote
• Train individual classifiers independently
• Predict by taking majority votes
• Training CPU: O(TPN) to O(TPN / M)
38. Parameter Mixture [Mann et al, 2009]
Big Data
Machine 1 Machine 2 Machine 3 Machine M
Map Shard 1 Shard 2 Shard 3 ... Shard M
(dummy key, w1) (dummy key, w2) ...
Reduce Average w
Model
39. Much Less network
usage than
distributed gradient
descent
O(MN) vs. O(MNT)
ttp://www.flickr.com/photos/annamatic3000/127945652/
40.
41. Iterative Param Mixture [McDonald et al., 2010]
Big Data
Machine 1 Machine 2 Machine 3 Machine M
Map Shard 1 Shard 2 Shard 3 ... Shard M
(dummy key, w1) (dummy key, w2) ...
Reduce
after each Average w
epoch
Model
42.
43. Outline
• Machine Learning intro
• Scaling machine learning algorithms up
• Design choices of large scale ML systems
60. Google APIs
• Prediction API
• machine learning service on the cloud
• http://code.google.com/apis/predict
• BigQuery
• interactive analysis of massive data on the cloud
• http://code.google.com/apis/bigquery