SlideShare una empresa de Scribd logo
1 de 28
Apache Mahout – Driving the Yellow Elephant Grant Ingersoll TriHUG http://www.trihug.org
Anyone Here Use Machine Learning? Any users of: Google? Search? Priority Inbox? Facebook? Twitter? LinkedIn?
Topics What is Machine Learning? ML Use Cases What is Mahout? A Word on Scaling What can I do with it right now? Mahout and Hadoop: An Example
Amazon.com What is Machine Learning? Google News
Really it’s… “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” Intro. To Machine Learning by E. Alpaydin Subset of Artificial Intelligence Lots of related fields: Information Retrieval Stats Biology Linear algebra Many more
Common Use Cases Recommend friends/dates/products Classify content into predefined groups Find similar content based on object properties Find associations/patterns in actions/behaviors Identify key topics in large collections of text Detect anomalies in machine output Ranking search results Others?
Apache Mahout http://dictionary.reference.com/browse/mahout An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License http://mahout.apache.org Why Mahout? Many Open Source ML libraries either: Lack Community Lack Documentation and Examples Lack Scalability Lack the Apache License ;-) Or are research-oriented
Who uses Mahout? https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
What does scalable mean? Ted Dunning (Mahout committer): As data grows linearly, either scale linearly in time or in machines 2X data requires 2X time or 2X machines (or less!) Goal: Be as fast and efficient as possible given the intrinsic design of the algorithm Some algorithms won’t scale to massive machine clusters Others fit logically on a Map Reduce framework like Apache Hadoop Still others will need different distributed programming models Be pragmatic
What Can I do with Mahout Right Now?
Recommendations Extensive framework for collaborative filtering Recommenders User based Item based Online and Offline support Offline can utilize Hadoop Many different Similarity measures Cosine, LLR, Tanimoto, Pearson, others It’s Valentine’s Day soon!
Clustering Document level Group documents based on a notion of similarity K-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift Distance Measures Manhattan, Euclidean, other Topic Modeling  Cluster words across documents to identify topics Latent Dirichlet Allocation
Categorization Place new items into predefined categories: Sports, politics, entertainment Recommenders Implementations Naïve Bayes Compl. Naïve Bayes Decision Forests Linear Regression ,[object Object]
http://awe.sm/5FyNe,[object Object]
Evolutionary Map-Reduce ready fitness functions for genetic programming Integration with Watchmaker http://watchmaker.uncommons.org/index.php Problems solved: Traveling salesman Class discovery Many others Caveat: Hasn’t received as much attention as others
Other Primitive Collections! Math library Vectors, Matrices, etc. Noise Reduction via Singular Value Decomposition Export from Lucene/Solr and other formats
Mahout and Hadoop Most Mahout implementations are built on Map-Reduce Many also have sequential implementations Linear Regression is blazingly fast without needing M/R Let’s look at how K-Means is implemented in Mahout
K-Means Clustering Algorithm Nicely parallelizable! http://en.wikipedia.org/wiki/K-means_clustering
K-Means in Map-Reduce Input: Mahout Vectors representing the original content Either: A predefined set of initial centroids (Can be from Canopy) --k – The number of clusters to produce Iterate Do the centroid calculation (more in a moment) Clustering Step (optional) Output Centroids (as Mahout Vectors) Points for each Centroid (if Clustering Step was taken)
Map-Reduce Iteration Each Iteration calculates the Centroids using: KMeansMapper KMeansCombiner KMeansReducer Clustering Step Calculate the points for each Centroid using: KMeansClusterMapper
KMeansMapper During Setup: Load the initial Centroids (or the Centroids from the last iteration) Map Phase For each input Calculate it’s distance from each Centroid and output the closest one Distance Measures are pluggable Manhattan, Euclidean, Squared Euclidean, Cosine, others
KMeansReducer Setup: Load up clusters Convergence information Partial sums from KMeansCombiner (more in a moment) Reduce Phase Sum all the vectors in the cluster to produce a new Centroid Check for Convergence Output cluster
KMeansCombiner A Combiner is like a Map-side Reducer which helps save on IO Just like KMeansReducer, but only produces partial sum of the cluster based on the data local to the Mapper
KMeansClusterMapper Some applications only care about what the Centroids are, so this step is optional Setup: Load up the clusters and the DistanceMeasure used Map Phase Calculate which Cluster the point belongs to Output <ClusterId, Vector>
Summary Machine learning is all over the web today Mahout is about scalable machine learning Mahout has functionality for many of today’s common machine learning tasks Many Mahout implementations use Hadoop KMeans clustering is an example of a machine learning algorithm in Mahout that is implemented using Map Reduce
Resources http://mahout.apache.org http://cwiki.apache.org/MAHOUT {user|dev}@mahout.apache.org http://svn.apache.org/repos/asf/mahout/trunk http://hadoop.apache.org
Resources “Mahout in Action” by Owen, Anil, Dunning and Friedman http://awe.sm/5FyNe “Introducing Apache Mahout”  http://www.ibm.com/developerworks/java/library/j-mahout/ “Programming Collective Intelligence” by Toby Segaran “Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank
References HAL: http://en.wikipedia.org/wiki/File:Hal-9000.jpg Terminator: http://en.wikipedia.org/wiki/File:Terminator1984movieposter.jpg Matrix: http://en.wikipedia.org/wiki/File:The_Matrix_Poster.jpg Google News: http://news.google.com Amazon.com: http://www.amazon.com Facebook: http://www.facebook.com Couple: http://www.vlemx.com/ Beer and Diapers: http://www.flickr.com/photos/baubcat/2484459070/ http://www.theregister.co.uk/2006/08/15/beer_diapers/ DMOZ: http://www.dmoz.org Shopping Cart: http://themeanestmom.blogspot.com/2010/09/shopping-carts.html

Más contenido relacionado

La actualidad más candente

Machine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An IntroductionMachine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An IntroductionVarad Meru
 
Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)Cataldo Musto
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whyKorea Sdec
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutTed Dunning
 
Machine Learning with Apache Mahout
Machine Learning with Apache MahoutMachine Learning with Apache Mahout
Machine Learning with Apache MahoutDaniel Glauser
 
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkScalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkEvan Casey
 
Primer to Machine Learning
Primer to Machine LearningPrimer to Machine Learning
Primer to Machine LearningJeff Tanner
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用James Chen
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - RecommendationCataldo Musto
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSpark Summit
 
Recommendation Engine Powered by Hadoop
Recommendation Engine Powered by HadoopRecommendation Engine Powered by Hadoop
Recommendation Engine Powered by HadoopPranab Ghosh
 
Graph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataGraph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataBenjamin Bengfort
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahoutaneeshabakharia
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for EveryoneAly Abdelkareem
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Cataldo Musto
 
Survey on Parallel/Distributed Search Engines
Survey on Parallel/Distributed Search EnginesSurvey on Parallel/Distributed Search Engines
Survey on Parallel/Distributed Search EnginesYu Liu
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenderssscdotopen
 

La actualidad más candente (20)

Machine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An IntroductionMachine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An Introduction
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the why
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache Mahout
 
Machine Learning with Apache Mahout
Machine Learning with Apache MahoutMachine Learning with Apache Mahout
Machine Learning with Apache Mahout
 
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkScalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
 
Primer to Machine Learning
Primer to Machine LearningPrimer to Machine Learning
Primer to Machine Learning
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - Recommendation
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya Hristakeva
 
Recommendation Engine Powered by Hadoop
Recommendation Engine Powered by HadoopRecommendation Engine Powered by Hadoop
Recommendation Engine Powered by Hadoop
 
Graph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataGraph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational Data
 
Mahout part2
Mahout part2Mahout part2
Mahout part2
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for Everyone
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
 
Survey on Parallel/Distributed Search Engines
Survey on Parallel/Distributed Search EnginesSurvey on Parallel/Distributed Search Engines
Survey on Parallel/Distributed Search Engines
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
 

Similar a Apache Mahout: Driving the Yellow Elephant

Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101John Ternent
 
OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningRobin Anil
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningVarad Meru
 
Classification with Naive Bayes
Classification with Naive BayesClassification with Naive Bayes
Classification with Naive BayesJosh Patterson
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Rio Info
 
IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.ASHISH JAGTAP
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Data science technology overview
Data science technology overviewData science technology overview
Data science technology overviewSoojung Hong
 
Seattle Scalability Mahout
Seattle Scalability MahoutSeattle Scalability Mahout
Seattle Scalability MahoutJake Mannix
 
Numenta ACM Data Min - PowerPoint Presentation
Numenta ACM Data Min - PowerPoint PresentationNumenta ACM Data Min - PowerPoint Presentation
Numenta ACM Data Min - PowerPoint Presentationbutest
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupSri Ambati
 
Apache mahout and R-mining complex dataobject
Apache mahout and R-mining complex dataobjectApache mahout and R-mining complex dataobject
Apache mahout and R-mining complex dataobjectsakthibalabalamuruga
 
Jane Recommendation Engines
Jane Recommendation EnginesJane Recommendation Engines
Jane Recommendation EnginesAdam Rogers
 
PhD experience and skills
PhD experience and skillsPhD experience and skills
PhD experience and skillsDr. Wei GUO
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
 

Similar a Apache Mahout: Driving the Yellow Elephant (20)

Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101
 
Recommendation engine
Recommendation engineRecommendation engine
Recommendation engine
 
OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine Learning
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
 
Classification with Naive Bayes
Classification with Naive BayesClassification with Naive Bayes
Classification with Naive Bayes
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
 
IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Data science technology overview
Data science technology overviewData science technology overview
Data science technology overview
 
Seattle Scalability Mahout
Seattle Scalability MahoutSeattle Scalability Mahout
Seattle Scalability Mahout
 
Numenta ACM Data Min - PowerPoint Presentation
Numenta ACM Data Min - PowerPoint PresentationNumenta ACM Data Min - PowerPoint Presentation
Numenta ACM Data Min - PowerPoint Presentation
 
Mahout in action
Mahout in actionMahout in action
Mahout in action
 
Data Science.pptx
Data Science.pptxData Science.pptx
Data Science.pptx
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User Group
 
Apache mahout and R-mining complex dataobject
Apache mahout and R-mining complex dataobjectApache mahout and R-mining complex dataobject
Apache mahout and R-mining complex dataobject
 
Jane Recommendation Engines
Jane Recommendation EnginesJane Recommendation Engines
Jane Recommendation Engines
 
PhD experience and skills
PhD experience and skillsPhD experience and skills
PhD experience and skills
 
Ajug april 2011
Ajug april 2011Ajug april 2011
Ajug april 2011
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 

Más de Grant Ingersoll

This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineGrant Ingersoll
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Grant Ingersoll
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopGrant Ingersoll
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xGrant Ingersoll
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and MahoutGrant Ingersoll
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopGrant Ingersoll
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionGrant Ingersoll
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Grant Ingersoll
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsGrant Ingersoll
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsGrant Ingersoll
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopGrant Ingersoll
 

Más de Grant Ingersoll (20)

Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
 
Intro to Search
Intro to SearchIntro to Search
Intro to Search
 
Open Source Search FTW
Open Source Search FTWOpen Source Search FTW
Open Source Search FTW
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
 
Taming Text
Taming TextTaming Text
Taming Text
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and Mahout
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in Action
 
Apache Lucene 4
Apache Lucene 4Apache Lucene 4
Apache Lucene 4
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
 

Apache Mahout: Driving the Yellow Elephant

  • 1. Apache Mahout – Driving the Yellow Elephant Grant Ingersoll TriHUG http://www.trihug.org
  • 2. Anyone Here Use Machine Learning? Any users of: Google? Search? Priority Inbox? Facebook? Twitter? LinkedIn?
  • 3. Topics What is Machine Learning? ML Use Cases What is Mahout? A Word on Scaling What can I do with it right now? Mahout and Hadoop: An Example
  • 4. Amazon.com What is Machine Learning? Google News
  • 5. Really it’s… “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” Intro. To Machine Learning by E. Alpaydin Subset of Artificial Intelligence Lots of related fields: Information Retrieval Stats Biology Linear algebra Many more
  • 6. Common Use Cases Recommend friends/dates/products Classify content into predefined groups Find similar content based on object properties Find associations/patterns in actions/behaviors Identify key topics in large collections of text Detect anomalies in machine output Ranking search results Others?
  • 7. Apache Mahout http://dictionary.reference.com/browse/mahout An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License http://mahout.apache.org Why Mahout? Many Open Source ML libraries either: Lack Community Lack Documentation and Examples Lack Scalability Lack the Apache License ;-) Or are research-oriented
  • 8. Who uses Mahout? https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
  • 9. What does scalable mean? Ted Dunning (Mahout committer): As data grows linearly, either scale linearly in time or in machines 2X data requires 2X time or 2X machines (or less!) Goal: Be as fast and efficient as possible given the intrinsic design of the algorithm Some algorithms won’t scale to massive machine clusters Others fit logically on a Map Reduce framework like Apache Hadoop Still others will need different distributed programming models Be pragmatic
  • 10. What Can I do with Mahout Right Now?
  • 11. Recommendations Extensive framework for collaborative filtering Recommenders User based Item based Online and Offline support Offline can utilize Hadoop Many different Similarity measures Cosine, LLR, Tanimoto, Pearson, others It’s Valentine’s Day soon!
  • 12. Clustering Document level Group documents based on a notion of similarity K-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift Distance Measures Manhattan, Euclidean, other Topic Modeling Cluster words across documents to identify topics Latent Dirichlet Allocation
  • 13.
  • 14.
  • 15. Evolutionary Map-Reduce ready fitness functions for genetic programming Integration with Watchmaker http://watchmaker.uncommons.org/index.php Problems solved: Traveling salesman Class discovery Many others Caveat: Hasn’t received as much attention as others
  • 16. Other Primitive Collections! Math library Vectors, Matrices, etc. Noise Reduction via Singular Value Decomposition Export from Lucene/Solr and other formats
  • 17. Mahout and Hadoop Most Mahout implementations are built on Map-Reduce Many also have sequential implementations Linear Regression is blazingly fast without needing M/R Let’s look at how K-Means is implemented in Mahout
  • 18. K-Means Clustering Algorithm Nicely parallelizable! http://en.wikipedia.org/wiki/K-means_clustering
  • 19. K-Means in Map-Reduce Input: Mahout Vectors representing the original content Either: A predefined set of initial centroids (Can be from Canopy) --k – The number of clusters to produce Iterate Do the centroid calculation (more in a moment) Clustering Step (optional) Output Centroids (as Mahout Vectors) Points for each Centroid (if Clustering Step was taken)
  • 20. Map-Reduce Iteration Each Iteration calculates the Centroids using: KMeansMapper KMeansCombiner KMeansReducer Clustering Step Calculate the points for each Centroid using: KMeansClusterMapper
  • 21. KMeansMapper During Setup: Load the initial Centroids (or the Centroids from the last iteration) Map Phase For each input Calculate it’s distance from each Centroid and output the closest one Distance Measures are pluggable Manhattan, Euclidean, Squared Euclidean, Cosine, others
  • 22. KMeansReducer Setup: Load up clusters Convergence information Partial sums from KMeansCombiner (more in a moment) Reduce Phase Sum all the vectors in the cluster to produce a new Centroid Check for Convergence Output cluster
  • 23. KMeansCombiner A Combiner is like a Map-side Reducer which helps save on IO Just like KMeansReducer, but only produces partial sum of the cluster based on the data local to the Mapper
  • 24. KMeansClusterMapper Some applications only care about what the Centroids are, so this step is optional Setup: Load up the clusters and the DistanceMeasure used Map Phase Calculate which Cluster the point belongs to Output <ClusterId, Vector>
  • 25. Summary Machine learning is all over the web today Mahout is about scalable machine learning Mahout has functionality for many of today’s common machine learning tasks Many Mahout implementations use Hadoop KMeans clustering is an example of a machine learning algorithm in Mahout that is implemented using Map Reduce
  • 26. Resources http://mahout.apache.org http://cwiki.apache.org/MAHOUT {user|dev}@mahout.apache.org http://svn.apache.org/repos/asf/mahout/trunk http://hadoop.apache.org
  • 27. Resources “Mahout in Action” by Owen, Anil, Dunning and Friedman http://awe.sm/5FyNe “Introducing Apache Mahout” http://www.ibm.com/developerworks/java/library/j-mahout/ “Programming Collective Intelligence” by Toby Segaran “Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank
  • 28. References HAL: http://en.wikipedia.org/wiki/File:Hal-9000.jpg Terminator: http://en.wikipedia.org/wiki/File:Terminator1984movieposter.jpg Matrix: http://en.wikipedia.org/wiki/File:The_Matrix_Poster.jpg Google News: http://news.google.com Amazon.com: http://www.amazon.com Facebook: http://www.facebook.com Couple: http://www.vlemx.com/ Beer and Diapers: http://www.flickr.com/photos/baubcat/2484459070/ http://www.theregister.co.uk/2006/08/15/beer_diapers/ DMOZ: http://www.dmoz.org Shopping Cart: http://themeanestmom.blogspot.com/2010/09/shopping-carts.html

Notas del editor

  1. A few things come to mind
  2. Convergence just checks to see how far the centroid has moved from the previous centroid