3. Introduction
• Data increasing rapidly
• It is necessary to process and to analyze the
data
• Analyzing the data by machine as a human
being. …Different
4. Machine Learning
Supervised Learning:
Generate a function based upon assigned
labels that maps inputs to desired outputs.
Unsupervised Learning:
Looks for patterns native to a dataset, and
models it like clustering (e.g. Data mining
&knowledge discovery).
Reinforcement Learning:
Learns how to act given reward(or
punishment) from the world.
5.
6.
7. Types of problems
Classification:
data is labeled means it assigned a class
- Learn a model from a manually classified data
- Predict the class of a new object based on its
features and the learned model
e.g.: spam/non-spam, fraud/non-fraud
Clustering
data is not labelled,but can be divided into groups based on
similarity
- Group similar looking objects
- Notion of similarity: Distance measure:
eg:organizing pictures by faces without names.
Regression
Data is labeled with real value rather than a label
eg:time series data like the price of a stock over time.
8. Supervised Learning
Algorithms
Decision Trees
k-Nearest Neighbours
Naive Bayes
Logistic Regression
Perceptron and Multi-level
Perceptions
Neural Networks
SVM and Kernel estimation
10. uses
Spam filtering
Credit card Fraud detection
Face recognition(computer vision)
Speech understanding
Medical diagnosis
and so on…
11. Current state of ML libraries
Lack scalability
Lack documentations and examples
Lack Apache licensing
Are not well tested
Are Research oriented
Not built over existing production
quality libraries
Lack “Deployability”
12. MapReduce
It’s a programming framework
Used for parallel processing over large
data sets
Application divided into small
fragments of works and distributed
across the cluster
Computation unit of Hadoop
Two functions: Map() and Reduce()
13. Apache mahout
The starting place for MapReduce-
based machine learning
A disparate collection of algorithms for
Recommendation
Clustering
Classification
Frequency item Mining
14. Mahout installation
Prerequisites
java
Hadoop
maven
Java installation
1. sudo apt-get install sun java jdk
2. sudo gedit .bashrc
set JAVA_HOME in .bashrc file
Installation of maven
1. sudo apt-get install maven2
2. open .bashrc and add the lines
############## Apache-Maven #########
export M2_HOME=/usr/local/apache-maven-3.0.4
export M2=$M2_HOME/bin
export PATH=$M2:$PATH
export JAVA_HOME=$HOME/programs/jdk
15. Contd..
Run mvn --version to verify that it is
correctly installed.
16. Hadoop installation
single node hadoop cluster has been set up as how java
installed
Installation of Mahout
1. http://www.apache.org/dyn/closer.cgi/lucene/mahout/
2. Create a folder and move the download file to the created directory
say, mkdir usr/local/mahout
3.Mvn install..it shows as
20. Application of Mahout
Collaborative Filtering
Matrix factorization based recommenders
A user based Recommender
Clustering
Canopy Clustering
K-Means Clustering
Fuzzy K-Means
Affinity Propagation Clustering
Classification
Naive Bayes
21. Conclusion
By using the mapReduce framework,
we could parallelize a wide range of
machine learning algorithms and
apache mahout provide s a platform
for machine learning in mapReduce
paradigm.