Apache Mahout: Driving the Yellow Elephant

Apache Mahout – Driving the Yellow Elephant Grant Ingersoll TriHUG http://www.trihug.org

Anyone Here Use Machine Learning? Any users of: Google? Search? Priority Inbox? Facebook? Twitter? LinkedIn?

Topics What is Machine Learning? ML Use Cases What is Mahout? A Word on Scaling What can I do with it right now? Mahout and Hadoop: An Example

Amazon.com What is Machine Learning? Google News

Really it’s… “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” Intro. To Machine Learning by E. Alpaydin Subset of Artificial Intelligence Lots of related fields: Information Retrieval Stats Biology Linear algebra Many more

Common Use Cases Recommend friends/dates/products Classify content into predefined groups Find similar content based on object properties Find associations/patterns in actions/behaviors Identify key topics in large collections of text Detect anomalies in machine output Ranking search results Others?

Apache Mahout http://dictionary.reference.com/browse/mahout An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License http://mahout.apache.org Why Mahout? Many Open Source ML libraries either: Lack Community Lack Documentation and Examples Lack Scalability Lack the Apache License ;-) Or are research-oriented

Who uses Mahout? https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout

What does scalable mean? Ted Dunning (Mahout committer): As data grows linearly, either scale linearly in time or in machines 2X data requires 2X time or 2X machines (or less!) Goal: Be as fast and efficient as possible given the intrinsic design of the algorithm Some algorithms won’t scale to massive machine clusters Others fit logically on a Map Reduce framework like Apache Hadoop Still others will need different distributed programming models Be pragmatic

What Can I do with Mahout Right Now?

Recommendations Extensive framework for collaborative filtering Recommenders User based Item based Online and Offline support Offline can utilize Hadoop Many different Similarity measures Cosine, LLR, Tanimoto, Pearson, others It’s Valentine’s Day soon!

Clustering Document level Group documents based on a notion of similarity K-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift Distance Measures Manhattan, Euclidean, other Topic Modeling Cluster words across documents to identify topics Latent Dirichlet Allocation

Categorization Place new items into predefined categories: Sports, politics, entertainment Recommenders Implementations Naïve Bayes Compl. Naïve Bayes Decision Forests Linear Regression ,[object Object]

http://awe.sm/5FyNe,[object Object]

Evolutionary Map-Reduce ready fitness functions for genetic programming Integration with Watchmaker http://watchmaker.uncommons.org/index.php Problems solved: Traveling salesman Class discovery Many others Caveat: Hasn’t received as much attention as others

Other Primitive Collections! Math library Vectors, Matrices, etc. Noise Reduction via Singular Value Decomposition Export from Lucene/Solr and other formats

Mahout and Hadoop Most Mahout implementations are built on Map-Reduce Many also have sequential implementations Linear Regression is blazingly fast without needing M/R Let’s look at how K-Means is implemented in Mahout

K-Means Clustering Algorithm Nicely parallelizable! http://en.wikipedia.org/wiki/K-means_clustering

K-Means in Map-Reduce Input: Mahout Vectors representing the original content Either: A predefined set of initial centroids (Can be from Canopy) --k – The number of clusters to produce Iterate Do the centroid calculation (more in a moment) Clustering Step (optional) Output Centroids (as Mahout Vectors) Points for each Centroid (if Clustering Step was taken)

Map-Reduce Iteration Each Iteration calculates the Centroids using: KMeansMapper KMeansCombiner KMeansReducer Clustering Step Calculate the points for each Centroid using: KMeansClusterMapper

KMeansMapper During Setup: Load the initial Centroids (or the Centroids from the last iteration) Map Phase For each input Calculate it’s distance from each Centroid and output the closest one Distance Measures are pluggable Manhattan, Euclidean, Squared Euclidean, Cosine, others

KMeansReducer Setup: Load up clusters Convergence information Partial sums from KMeansCombiner (more in a moment) Reduce Phase Sum all the vectors in the cluster to produce a new Centroid Check for Convergence Output cluster

KMeansCombiner A Combiner is like a Map-side Reducer which helps save on IO Just like KMeansReducer, but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper Some applications only care about what the Centroids are, so this step is optional Setup: Load up the clusters and the DistanceMeasure used Map Phase Calculate which Cluster the point belongs to Output <ClusterId, Vector>

Summary Machine learning is all over the web today Mahout is about scalable machine learning Mahout has functionality for many of today’s common machine learning tasks Many Mahout implementations use Hadoop KMeans clustering is an example of a machine learning algorithm in Mahout that is implemented using Map Reduce

Resources http://mahout.apache.org http://cwiki.apache.org/MAHOUT {user|dev}@mahout.apache.org http://svn.apache.org/repos/asf/mahout/trunk http://hadoop.apache.org

Resources “Mahout in Action” by Owen, Anil, Dunning and Friedman http://awe.sm/5FyNe “Introducing Apache Mahout” http://www.ibm.com/developerworks/java/library/j-mahout/ “Programming Collective Intelligence” by Toby Segaran “Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank

References HAL: http://en.wikipedia.org/wiki/File:Hal-9000.jpg Terminator: http://en.wikipedia.org/wiki/File:Terminator1984movieposter.jpg Matrix: http://en.wikipedia.org/wiki/File:The_Matrix_Poster.jpg Google News: http://news.google.com Amazon.com: http://www.amazon.com Facebook: http://www.facebook.com Couple: http://www.vlemx.com/ Beer and Diapers: http://www.flickr.com/photos/baubcat/2484459070/ http://www.theregister.co.uk/2006/08/15/beer_diapers/ DMOZ: http://www.dmoz.org Shopping Cart: http://themeanestmom.blogspot.com/2010/09/shopping-carts.html

Apache Mahout: Driving the Yellow Elephant

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Apache Mahout: Driving the Yellow Elephant

Similar a Apache Mahout: Driving the Yellow Elephant (20)

Más de Grant Ingersoll

Más de Grant Ingersoll (20)

Apache Mahout: Driving the Yellow Elephant

Notas del editor