2. Agenda and such…
What is ML (Machine Learning)
ML Common Use Cases
Mahout Overview
Algorithms in Mahout
Mahout Commercial Use
Mahout Summary
3. What is ML
“Machine Learning is programming
computers to optimize a performance
criterion using example data or past
experience”
Intro. To Machine Learning by E. Alpaydin
10. Mahout Overview – What ?
Began life at 2008 as a subproject of
Apache’s Lucene project
On 2010 Mahout became a top-level
Apache project in its own right
Implemented in Java
Built upon Apache’s Hadoop (Look ! An
Elephant !)
11. Mahout Overview – Why ?
Many open source ML libraries either:
Lack community
Lack documentation and examples
Lack scalability
Lack the Apache license
Are research oriented
Not well tested
Not built over existing production quality
libraries
12. Mahout Overview – Why ?
Scalability
Scalable to reasonably large datasets (core
algorithms implemented in Map/Reduce,
runnable on Hadoop)
Scalable to support your business case
(Apache License)
Scalable community
13. Mahout Overview – Why ?
Built over existing production quality
libraries
14. Mahout Overview – Use Cases
Mahout currently supports mainly four
use cases:
1. Recommendation
2. Clustering
3. Classification
4. Frequent Itemset Mining
15. Mahout Overview - Technical
System Requirements
Linux (or Cygwin on Windows)
Java 1.6.x or greater
Maven 2.0.11 or greater to build the source
code
Hadoop 0.2 or greater*
* Not all algorithms are implemented to work on Hadoop clusters
16. Algorithms in Mahout
We’ll focus on one example:
Collaborative Filtering (Recommenders)
Yet there are many (many !!) more, you
can find them all on
https://cwiki.apache.org/confluence/dis
play/MAHOUT/Algorithms
17. Algorithms Examples –
Recommendation
Help users find items they might like
based on historical preferences
Based on example by Sebastian Schelter in “Distributed Itembased
Collaborative Filtering with Apache Mahout”
19. Algorithms Examples –
Recommendation
Algorithm
Neighborhood-based approach
Works by finding similarly rated items in the
user-item-matrix (e.g. cosine, Pearson-
Correlation, Tanimoto Coefficient)
Estimates a user's preference towards an
item by looking at his/her preferences
towards similar items
20. Algorithms Examples –
Recommendation
Prediction: Estimate Bob's preference
towards “The Matrix”
1. Look at all items that
a) are similar to “The Matrix“
b) have been rated by Bob
=> “Alien“, “Inception“
2. Estimate the unknown preference with a
weighted sum
21. Algorithms Examples –
Recommendation
MapReduce phase 1
Map – Make user the key
(Alice, Matrix, 5) Alice (Matrix, 5)
(Alice, Alien, 1) Alice (Alien, 1)
(Alice, Inception, 4) Alice (Inception, 4)
(Bob, Alien, 2) Bob (Alien, 2)
(Bob, Inception, 5) Bob (Inception, 5)
(Peter, Matrix, 4) Peter (Matrix, 4)
(Peter, Alien, 3) Peter (Alien, 3)
(Peter, Inception, 2) Peter (Inception, 2)
22. Algorithms Examples –
Recommendation
MapReduce phase 1
Reduce – Create inverted index
Alice (Matrix, 5)
Alice (Alien, 1)
Alice (Inception, 4) Alice (Matrix, 5) (Alien, 1) (Inception, 4)
Bob (Alien, 2) Bob (Alien, 2) (Inception, 5)
Bob (Inception, 5) Peter(Matrix, 4) (Alien, 3) (Inception, 2)
Peter (Matrix, 4)
Peter (Alien, 3)
Peter (Inception, 2)
23. Algorithms Examples –
Recommendation
MapReduce phase 2
Map – Isolate all co-occurred ratings (all
cases where a user rated both items)
Matrix, Alien (5,1)
Matrix, Alien (4,3)
Alice (Matrix, 5) (Alien, 1) (Inception, 4) Alien, Inception (1,4)
Bob (Alien, 2) (Inception, 5) Alien, Inception (2,5)
Peter(Matrix, 4) (Alien, 3) (Inception, 2) Alien, Inception (3,2)
Matrix, Inception (4,2)
Matrix, Inception (5,4)
27. Mahout Resources
Mahout website - http://mahout.apache.org/
Introducing Apache Mahout –
http://www.ibm.com/developerworks/java/lib
rary/j-mahout/
“Mahout In Action” by Sean Owen and Robin
Anil
28. Mahout Summary
ML is all over the web today
Mahout is about scalable machine
learning
Mahout has functionality for many of
today’s common machine learning tasks
MapReduce magic in
action
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers (2008)Apache Lucene(TM) is a high-performance, full-featured text search engine library (2005)