The document discusses tools and technologies for large scale data mining, focusing on Apache Mahout. It provides an overview of machine learning and algorithms in Apache Mahout, including collaborative filtering, clustering, classification. It demonstrates building recommendations engines and document classification with Mahout. References and resources on Mahout are also included. The presentation was given at a seminar on data mining and semantic web.
1. Tools andTechnologies for Large Scale Data
Mining
Jaganadh G
Project Lead NLP R&D
365Media Pvt. Ltd.
jaganadhg@gmail.com
DRDO Sponsored National Level Seminar
on
Challenging Issues on Data Mining Semantic Web,
Sri Krishna College of Engineering and Technology,
Coimbatore
27th Jan 2012
Jaganadh G Tools andTechnologies for Large Scale Data Mining
2. About me !!
Software Engineer Specializing in Text Analytics Research &
Development
When free, teaches Python, Speaks about FOSS and blogs at
http://jaganadhg.in
Working as Project Lead (NLP) 365Media Pvt. Ltd.
Coimbatore
I am a computational linguist / Linguist and Indologist, Book
reviewer
Maters Degree Holder in Sanskrit from University of Kerala
Jaganadh G Tools andTechnologies for Large Scale Data Mining
3. Machine Learning
Machine Learning
Machine learning is a subfield of artificial intelligence (AI)
concerned with algorithms that allow computers to learn.
Jaganadh G Tools andTechnologies for Large Scale Data Mining
4. Machine Learning
Machine Learning
Machine learning is a subfield of artificial intelligence (AI)
concerned with algorithms that allow computers to learn.
Jaganadh G Tools andTechnologies for Large Scale Data Mining
5. Machine Learning
Machine Learning
Machine learning is a subfield of artificial intelligence (AI)
concerned with algorithms that allow computers to learn.
This talk is not aimed to give introduction about Machine
Learning
Jaganadh G Tools andTechnologies for Large Scale Data Mining
6. Machine Learning
Machine Learning
Machine learning is a subfield of artificial intelligence (AI)
concerned with algorithms that allow computers to learn.
This talk is not aimed to give introduction about Machine
Learning
Dont expect some mathy equations here
Jaganadh G Tools andTechnologies for Large Scale Data Mining
7. Machine Learning and Our Life
Do you think that Machine Learning has any impact in our life
??
Jaganadh G Tools andTechnologies for Large Scale Data Mining
8. Machine Learning and Our Life
Do you think that Machine Learning has any impact in our life
??
Yes
Jaganadh G Tools andTechnologies for Large Scale Data Mining
9. Machine Learning and Our Life
Do you think that Machine Learning has any impact in our life
??
Yes
In our day to day life we may use many Machine Learning
powered tools
Jaganadh G Tools andTechnologies for Large Scale Data Mining
10. Machine Learning and Our Life
Do you think that Machine Learning has any impact in our life
??
Yes
In our day to day life we may use many Machine Learning
powered tools
E-mail spam filtering , product recommendations etc ..
Jaganadh G Tools andTechnologies for Large Scale Data Mining
11. Machine Learning and Our Life
Do you think that Machine Learning has any impact in our life
??
Yes
In our day to day life we may use many Machine Learning
powered tools
E-mail spam filtering , product recommendations etc ..
Fraud detection
Jaganadh G Tools andTechnologies for Large Scale Data Mining
12. Examples
Jaganadh G Tools andTechnologies for Large Scale Data Mining
13. Examples
Jaganadh G Tools andTechnologies for Large Scale Data Mining
14. Examples
Jaganadh G Tools andTechnologies for Large Scale Data Mining
15. Tool for building Machine Learning powerd product/service
Apache Mahout
Apache Mahout is a scalable machine learning library that supports
large data sets. Apache Mahout’s goal is to build scalable machine
learning libraries.
Commercially friendly licence
Well documented
Healthy community
Targeted to developers
Jaganadh G Tools andTechnologies for Large Scale Data Mining
16. Algorithms in Apache Mahout
Jaganadh G Tools andTechnologies for Large Scale Data Mining
17. Algorithms in Apache Mahout
Collaborative Filtering
Jaganadh G Tools andTechnologies for Large Scale Data Mining
18. Algorithms in Apache Mahout
Collaborative Filtering
User and Item based recommenders
Jaganadh G Tools andTechnologies for Large Scale Data Mining
19. Algorithms in Apache Mahout
Collaborative Filtering
User and Item based recommenders
K-Means, Fuzzy K-Means clustering
Jaganadh G Tools andTechnologies for Large Scale Data Mining
20. Algorithms in Apache Mahout
Collaborative Filtering
User and Item based recommenders
K-Means, Fuzzy K-Means clustering
Mean Shift clustering
Jaganadh G Tools andTechnologies for Large Scale Data Mining
21. Algorithms in Apache Mahout
Collaborative Filtering
User and Item based recommenders
K-Means, Fuzzy K-Means clustering
Mean Shift clustering
Dirichlet process clustering
Jaganadh G Tools andTechnologies for Large Scale Data Mining
22. Algorithms in Apache Mahout
Collaborative Filtering
User and Item based recommenders
K-Means, Fuzzy K-Means clustering
Mean Shift clustering
Dirichlet process clustering
Latent Dirichlet Allocation
Jaganadh G Tools andTechnologies for Large Scale Data Mining
23. Algorithms in Apache Mahout
Collaborative Filtering
User and Item based recommenders
K-Means, Fuzzy K-Means clustering
Mean Shift clustering
Dirichlet process clustering
Latent Dirichlet Allocation
Singular value decomposition
Jaganadh G Tools andTechnologies for Large Scale Data Mining
24. Algorithms in Apache Mahout
Collaborative Filtering
User and Item based recommenders
K-Means, Fuzzy K-Means clustering
Mean Shift clustering
Dirichlet process clustering
Latent Dirichlet Allocation
Singular value decomposition
Parallel Frequent Pattern mining
Jaganadh G Tools andTechnologies for Large Scale Data Mining
25. Algorithms in Apache Mahout
Collaborative Filtering
User and Item based recommenders
K-Means, Fuzzy K-Means clustering
Mean Shift clustering
Dirichlet process clustering
Latent Dirichlet Allocation
Singular value decomposition
Parallel Frequent Pattern mining
Complementary Naive Bayes classifier
Jaganadh G Tools andTechnologies for Large Scale Data Mining
26. Algorithms in Apache Mahout
Collaborative Filtering
User and Item based recommenders
K-Means, Fuzzy K-Means clustering
Mean Shift clustering
Dirichlet process clustering
Latent Dirichlet Allocation
Singular value decomposition
Parallel Frequent Pattern mining
Complementary Naive Bayes classifier
Random forest decision tree based classifier
Jaganadh G Tools andTechnologies for Large Scale Data Mining
27. Demo
Building recommendations engines with Mahout
Document Classification with Mahout
Some Python stuff on Machine Learning
Jaganadh G Tools andTechnologies for Large Scale Data Mining
28. Reference
Jaganadh G Tools andTechnologies for Large Scale Data Mining
29. Reference
Mahout in Action - Book by Sean Owen and Robin Anil,
published by Manning Publications.
Taming Text - By Grant Ingersoll and Tom Morton, published
by Manning Publications.
Introducing Apache Mahout - Grant Ingersoll - Intro to
Apache Mahout focused on clustering, classification and
collaborative filtering.
https://www.ibm.com/developerworks/java/library/j-
mahout/index.html
Programming Collective Intelligence: Building Smart Web 2.0
Applications
http://www.amazon.com/Programming-Collective-
Intelligence-Building-Applications/dp/0596529325
Jaganadh G Tools andTechnologies for Large Scale Data Mining
30. Useful Resources
Apache Mahout Site http://mahout.apache.org/
Apache Mahout Mailing List user@mahout.apache.org
The code which I used for Mahout demo is available at
http://bitbucket.org/jaganadhg/blog/src/tip/bck9/java/
Twenty News Group data set
http://people.csail.mit.edu/jrennie/20Newsgroups/20news-
bydate.tar.gz
Jaganadh G Tools andTechnologies for Large Scale Data Mining
31. Questions ??
Jaganadh G Tools andTechnologies for Large Scale Data Mining
32. Acknowledgments
Thanks to :
Manning Publications for Review Copy of the book ”Mahout
in Action”
Apache Mahout mailing list members
Ted Dunning and Robin Anil for suggestions
Sreejith S and Biju B for Java help
@chelakkandupoda for review and criticism
Mukundhanchari R&D Director 365Media Pvt. Ltd. for
support and encouragement
Jaganadh G Tools andTechnologies for Large Scale Data Mining
33. Finally
Jaganadh G Tools andTechnologies for Large Scale Data Mining