3. What is Mahout
• Recommendations (people who x this also x
that)
• Clustering (segment data into groups of)
• Classification (learn decision making from
examples)
• Stuff (LDA, SVD, frequent item-set, math)
4. What is Mahout?
• Recommendations (people who x this also x
that)
• Clustering (segment data into groups of)
• Classification (learn decision making from
examples)
• Stuff (LDA, SVM, frequent item-set, math)
5. Classification in Detail
• Naive Bayes Family
– Hadoop based training
• Decision Forests
– Hadoop based training
• Logistic Regression (aka SGD)
– fast on-line (sequential) training
6. Classification in Detail
• Naive Bayes Family
– Hadoop based training
• Decision Forests
– Hadoop based training
• Logistic Regression (aka SGD)
– fast on-line (sequential) training
7. Classification in Detail
• Naive Bayes Family
– Hadoop based training
• Decision Forests
– Hadoop based training
• Logistic Regression (aka SGD)
– fast on-line (sequential) training
– Now with MORE topping!
9. And Another
From: Dr. Paul Acquah
Dear Sir,
Re: Proposal for over-invoice Contract Benevolence
Based on information gathered from the India
hospital directory, I am pleased to propose a
confidential business deal for our mutual
benefit. I have in my possession, instruments
(documentation) to transfer the sum of
33,100,000.00 eur thirty-three million one hundred
thousand euros, only) into a foreign company's
bank account for our favor.
...
Date: Thu, May 20, 2010 at 10:51 AM
From: George <george@fumble-tech.com>
Hi Ted, was a pleasure talking to you last night
at the Hadoop User Group. I liked the idea of
going for lunch together. Are you available
tomorrow (Friday) at noon?
13. How it Works
• We are given “features”
– Often binary values in a vector
• Algorithm learns weights
– Weighted sum of feature * weight is the key
• Each weight is a single real value
14. A Quick Diversion
• You see a coin
– What is the probability of heads?
– Could it be larger or smaller than that?
• I flip the coin and while it is in the air ask again
• I catch the coin and ask again
• I look at the coin (and you don’t) and ask again
• Why does the answer change?
– And did it ever have a single value?
15. A First Conclusion
• Probability as expressed by humans is
subjective and depends on information and
experience
16. A Second Conclusion
• A single number is a bad way to express
uncertain knowledge
• A distribution of values might be better
23. Which One to Play?
• One may be better than the other
• The better machine pays off at some rate
• Playing the other will pay off at a lesser rate
– Playing the lesser machine has “opportunity cost”
• But how do we know which is which?
– Explore versus Exploit!
25. Bayesian Bandit
• Compute distributions based on data
• Sample p1 and p2 from these distributions
• Put a coin in bandit 1 if p1 > p2
• Else, put the coin in bandit 2
26.
27.
28. The Basic Idea
• We can encode a distribution by sampling
• Sampling allows unification of exploration and
exploitation
• Can be extended to more general response
models
29. Deployment with Storm/MapR
Impression
Logs
Click Logs
Targeting
Engine
Conversion
Detector
Model
Selector
RPC
Online
Model
Online
Model
Online
Model
RPC
RPC
RPC
Conversion
Dashboard
RPC
Training
Training
Training
All state managed transactionally
in MapR file system
30. Service Architecture
MapR Lockless Storage Services
MapR Pluggable Service Management
Storm
HadoopImpression
Logs
Click Logs
Targeting
Engine
Conversion
Detector
Model
Selector
RPC
Online
Model
Online
Model
Online
Model
RPC
RPC
RPC
Conversion
Dashboard
RPC
Training
Training
Training
31. Find Out More
• Me: tdunning@mapr.com
ted.dunning@gmail.com
tdunning@apache.com
• MapR: http://www.mapr.com
• Mahout: http://mahout.apache.org
• Code: https://github.com/tdunning
Notas del editor
No information would give a relative expected payoff of -0.25. This graph shows 25, 50 and 75%-ile results for sampled experiments with uniform random probabilities. Convergence to optimum is nearly equal to the optimum sqrt(n). Note the log scale on number of trials
Here is how the system converges in terms of how likely it is to pick the better bandit with probabilities that are only slightly different. After 1000 trials, the system is already giving 75% of the bandwidth to the better option. This graph was produced by averaging several thousand runs with the same probabilities.