2. Mahout
Scalable Data Mining for Everybody
Wednesday, March 16, 2011 1
3. What is Mahout
• Recommendations (people who x this also
x that)
• Clustering (segment data into groups of)
• Classification (learn decision making from
examples)
• Stuff (LDA, SVD, frequent item-set, math)
Wednesday, March 16, 2011 2
4. What is Mahout?
• Recommendations (people who x this also
x that)
• Clustering (segment data into groups of)
• Classification (learn decision
making from examples)
• Stuff (LDA, SVM, frequent item-set, math)
Wednesday, March 16, 2011 3
5. Classification in Detail
• Naive Bayes Family
• Hadoop based training
• Decision Forests
• Hadoop based training
• Logistic Regression (aka SGD)
• fast on-line (sequential) training
Wednesday, March 16, 2011 4
6. Classification in Detail
• Naive Bayes Family
• Hadoop based training
• Decision Forests
• Hadoop based training
• Logistic Regression (aka SGD)
• fast on-line (sequential) training
Wednesday, March 16, 2011 5
7. So What?
Online training
has low
overhead for
small and
moderate size
data-sets
Wednesday, March 16, 2011 6
8. So What?
Online training
has low
overhead for
small and
moderate size
data-sets
Wednesday, March 16, 2011 6
9. So What?
Online training
has low
overhead for
small and
moderate size
data-sets
Wednesday, March 16, 2011 6
10. So What?
Online training
has low
overhead for
small and
moderate size
data-sets
Wednesday, March 16, 2011 6
11. So What?
big starts here
Online training
has low
overhead for
small and
moderate size
data-sets
Wednesday, March 16, 2011 6
19. And Another
From: Dr. Paul Acquah
Dear Sir,
Re: Proposal for over-invoice Contract Benevolence
Based on information gathered from the India
hospital directory, I am pleased to propose a
confidential business deal for our mutual benefit.
I have in my possession, instruments
(documentation) to transfer the sum of
33,100,000.00 eur thirty-three million one hundred
thousand euros, only) into a foreign company's bank
account for our favor.
...
Wednesday, March 16, 2011 8
20. And Another
Date: Thu, May 20, 2010 at 10:51 AM
From: George <george@fumble-tech.com>
Hi Ted, was a pleasure talking to you last night
at the Hadoop User Group. I liked the idea of
going for lunch together. Are you available
tomorrow (Friday) at noon?
Wednesday, March 16, 2011 8
21. And Another
Date: Thu, May 20, 2010 at 10:51 AM
From: George <george@fumble-tech.com>
Hi Ted, was a pleasure talking to you last night
at the Hadoop User Group. I liked the idea of
going for lunch together. Are you available
tomorrow (Friday) at noon?
Wednesday, March 16, 2011 8
22. And Another
Date: Thu, May 20, 2010 at 10:51 AM
From: George <george@fumble-tech.com>
Hi Ted, was a pleasure talking to you last night
at the Hadoop User Group. I liked the idea of
going for lunch together. Are you available
tomorrow (Friday) at noon?
Wednesday, March 16, 2011 8
23. Mahout’s SGD
• Learns on-line per example
• O(1) memory
• O(1) time per training example
• Sequential implementation
• fast, but not parallel
Wednesday, March 16, 2011 9
24. Special Features
• Hashed feature encoding
• Per-term annealing
• learn the boring stuff once
• Auto-magical learning knob turning
• learns correct learning rate, learns
correct learning rate for learning learning
rate, ...
Wednesday, March 16, 2011 10
30. Learning Rate Per-term Annealing
# training examples seen
Wednesday, March 16, 2011 15
31. Learning Rate Per-term Annealing
Common
Feature
# training examples seen
Wednesday, March 16, 2011 15
32. Learning Rate Per-term Annealing
Rare
Feature
# training examples seen
Wednesday, March 16, 2011 15
33. General Structure
• OnlineLogisticRegression
• Traditional logistic regression
• Stochastic Gradient Descent
• Per term annealing
• Too fast (for the disk + encoder)
Wednesday, March 16, 2011 16
34. Next Level
• CrossFoldLearner
• contains multiple primitive learners
• online cross validation
• 5x more work
Wednesday, March 16, 2011 17
35. And again
• AdaptiveLogisticRegression
• 20 x CrossFoldLearner
• evolves good learning and regularization
rates
• 100 x more work than basic learner
• still faster than disk + encoding
Wednesday, March 16, 2011 18
36. A comparison
• Traditional view
• 400 x (read + OLR)
• Revised Mahout view
• 1 x (read + mu x 100 x OLR) x eta
• mu = efficiency from killing losers early
• eta = efficiency from stopping early
Wednesday, March 16, 2011 19
37. Deployment
• Training
• ModelSerializer.writeBinary(..., model)
• Deployment
• m = ModelSerializer.readBinary(...)
• r = m.classifyScalar(featureVector)
Wednesday, March 16, 2011 20
38. The Upshot
• One machine can go fast
• SITM trains in 2 billion examples in 3
hours
• Deployability pays off big
• simple sample server farm
Wednesday, March 16, 2011 21