The document provides an introduction to machine learning with Mahout. It discusses machine learning concepts and algorithms like clustering, classification, and recommendation. It introduces Hadoop as a framework for distributed processing of big data and Mahout as an open-source library for machine learning algorithms on Hadoop. The document demonstrates how to run recommendation algorithms and clustering algorithms using Mahout on local machines or cloud platforms like Amazon EC2 and EMR. It also discusses preprocessing text data and classifiers.
10. Future Plans
Establish Non-Profit
Increase Global Following
Become Strong Networking and
Education Resource for YOU
11. A (very) little bit about
me…
Consultant (Management & Technology)
Open Source Evangelist
Full-spectrum data nerd
12. A little about you!
Rate yourself (1 – 10) on Mahout
Rate yourself (1 – 10) on Machine
Learning/Data Mining
Rate yourself (1 – 10) on Big
Data/Hadoop
Please wait… optimizing presentation…
13. A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P, if its
performance at tasks in T, as measured by
P, improves with experience E.
-- Tom M. Mitchell, 1997
Data mining is defined as the process of
discovering patterns in data. The process
must be automatic or (more usually)
semiautomatic. The patterns discovered must
be meaningful in that they lead to … an
economic advantage.
-- Ian H. Witten & Eibe Frank, 2005
14. If you’re in academia, you call it “machine
learning.” If you’re in business, you call it
“data mining.”
Mark Hall
I create or
improve general
purpose
algorithms for
machine
learning
I use multiple
machine
learning
algorithms for
practical data
discovery
Source : xkcd Source : xkcd
16. Machine Learning
Algorithms
Regression
K-means Clustering
K-NN
CART
Neural Networks
Support Vector Machines
Association Rules
Principal Component Analysis
Singular Value Decomposition
Ensemble Methods
Naïve Bayes
…
17. Real-World Applications
Recommender Systems
Image recognition
Signal Processing
Propensity to buy/churn
Fraud analysis
Text analytics
Spam filtering
Forecasting methods
Revenue management
…
18. The Problem … and Opportunity
Big Data™
If you have to choose, having more data does indeed trump a
better algorithm. However, what is better than just having
more data on its own is also having an algorithm that
annotates the data with new linkages and statistics which alter
the underlying data asset.”
- Omar Tawakol
Weka Explorer can handle ~1M instances, 25 attributes (50
MB file)
- Ian Witten
21. Finally -- Mahout
A Java-based library of machine
learning algorithms designed to support
distributed processing
Initially on MapReduce, now leaning
heavily towards Spark
Primarily focused on Recommenders,
Clustering, and Classification spaces.
22. Running Mahout
Locally – download mahout distro.
/bin/mahout is the wrapper script, default shows all
the example programs available.
Lots of tools included to convert data into vector
formats and pre-process text, worth a look
Amazon EC2
Configure stack from scratch on EC2 servers
Amazon EMR
Quicker start, a lot of the build is already optimized
for MapReduce jobs, just add Mahout as a custom
jar and pass the script as a parameter
24. Running Recommenders
Tip : If you have no preferences, there
are Boolean equivalents of the
recommender classes
Evaluate user vs. item similarities
Example
25. Clustering Algorithms
To cluster you need:
Location in n-dimensional space
Distance metric
Threshold
K-means
Canopy
Dirichlet
Fuzzy K-means
Spectral Clustering
27. Clustering Text
Identify k topics in a document corpus
Requires conversion of text into vector
Lucene utilities are available to vectorize
text and apply stop-word or weighting
criteria.
Seqdirectory – from a directory of text
files
Lucene.vector – from a Lucene index