SlideShare una empresa de Scribd logo
1 de 38
Descargar para leer sin conexión
Mahout



Wednesday, March 16, 2011            1
Mahout
                            Scalable Data Mining for Everybody




Wednesday, March 16, 2011                                        1
What is Mahout
                   • Recommendations (people who x this also
                            x that)
                   • Clustering (segment data into groups of)
                   • Classification (learn decision making from
                            examples)
                   • Stuff (LDA, SVD, frequent item-set, math)

Wednesday, March 16, 2011                                        2
What is Mahout?
                   • Recommendations (people who x this also
                            x that)
                   • Clustering (segment data into groups of)
                   • Classification (learn decision
                            making from examples)
                   • Stuff (LDA, SVM, frequent item-set, math)

Wednesday, March 16, 2011                                        3
Classification in Detail
                   • Naive Bayes Family
                    • Hadoop based training
                   • Decision Forests
                    • Hadoop based training
                   • Logistic Regression (aka SGD)
                    • fast on-line (sequential) training
Wednesday, March 16, 2011                                  4
Classification in Detail
                   • Naive Bayes Family
                    • Hadoop based training
                   • Decision Forests
                    • Hadoop based training
                   • Logistic Regression (aka SGD)
                    • fast on-line (sequential) training
Wednesday, March 16, 2011                                  5
So What?
                                   Online training
                                   has low
                                   overhead for
                                   small and
                                   moderate size
                                   data-sets




Wednesday, March 16, 2011                            6
So What?
                                   Online training
                                   has low
                                   overhead for
                                   small and
                                   moderate size
                                   data-sets




Wednesday, March 16, 2011                            6
So What?
                                   Online training
                                   has low
                                   overhead for
                                   small and
                                   moderate size
                                   data-sets




Wednesday, March 16, 2011                            6
So What?
                                   Online training
                                   has low
                                   overhead for
                                   small and
                                   moderate size
                                   data-sets




Wednesday, March 16, 2011                            6
So What?
                            big starts here

                                              Online training
                                              has low
                                              overhead for
                                              small and
                                              moderate size
                                              data-sets




Wednesday, March 16, 2011                                       6
An Example




Wednesday, March 16, 2011                7
An Example




Wednesday, March 16, 2011                7
An Example




Wednesday, March 16, 2011                7
An Example




Wednesday, March 16, 2011                7
An Example




Wednesday, March 16, 2011                7
An Example




Wednesday, March 16, 2011                7
An Example




Wednesday, March 16, 2011                7
And Another
                   From: Dr. Paul Acquah
                   Dear Sir,
                   Re: Proposal for over-invoice Contract Benevolence

                   Based on information gathered from the India
                   hospital directory, I am pleased to propose a
                   confidential business deal for our mutual benefit.
                   I have in my possession, instruments
                   (documentation) to transfer the sum of
                   33,100,000.00 eur thirty-three million one hundred
                   thousand euros, only) into a foreign company's bank
                   account for our favor.
                   ...




Wednesday, March 16, 2011                                                8
And Another
                    Date: Thu, May 20, 2010 at 10:51 AM
                    From: George <george@fumble-tech.com>

                    Hi Ted, was a pleasure talking to you last night
                    at the Hadoop User Group. I liked the idea of
                    going for lunch together. Are you available
                    tomorrow (Friday) at noon?




Wednesday, March 16, 2011                                              8
And Another
                    Date: Thu, May 20, 2010 at 10:51 AM
                    From: George <george@fumble-tech.com>

                    Hi Ted, was a pleasure talking to you last night
                    at the Hadoop User Group. I liked the idea of
                    going for lunch together. Are you available
                    tomorrow (Friday) at noon?




Wednesday, March 16, 2011                                              8
And Another
                    Date: Thu, May 20, 2010 at 10:51 AM
                    From: George <george@fumble-tech.com>

                    Hi Ted, was a pleasure talking to you last night
                    at the Hadoop User Group. I liked the idea of
                    going for lunch together. Are you available
                    tomorrow (Friday) at noon?




Wednesday, March 16, 2011                                              8
Mahout’s SGD

                   • Learns on-line per example
                    • O(1) memory
                    • O(1) time per training example
                   • Sequential implementation
                    • fast, but not parallel

Wednesday, March 16, 2011                              9
Special Features
                   • Hashed feature encoding
                   • Per-term annealing
                    • learn the boring stuff once
                   • Auto-magical learning knob turning
                    • learns correct learning rate, learns
                            correct learning rate for learning learning
                            rate, ...


Wednesday, March 16, 2011                                                 10
Feature Encoding




Wednesday, March 16, 2011                      11
Feature Encoding




Wednesday, March 16, 2011                      11
Hashed Encoding




Wednesday, March 16, 2011                     12
Feature Collisions




Wednesday, March 16, 2011                        13
Learning Rate Annealing
        Learning Rate




                            # training examples seen


Wednesday, March 16, 2011                              14
Learning Rate   Per-term Annealing




                                   # training examples seen



Wednesday, March 16, 2011                                     15
Learning Rate   Per-term Annealing

                                 Common
                                  Feature




                                     # training examples seen



Wednesday, March 16, 2011                                       15
Learning Rate   Per-term Annealing


                                                          Rare
                                                         Feature




                                   # training examples seen



Wednesday, March 16, 2011                                          15
General Structure

                • OnlineLogisticRegression
                 • Traditional logistic regression
                 • Stochastic Gradient Descent
                 • Per term annealing
                 • Too fast (for the disk + encoder)

Wednesday, March 16, 2011                              16
Next Level

                   • CrossFoldLearner
                    • contains multiple primitive learners
                    • online cross validation
                    • 5x more work

Wednesday, March 16, 2011                                    17
And again
                   • AdaptiveLogisticRegression
                    • 20 x CrossFoldLearner
                    • evolves good learning and regularization
                              rates
                            • 100 x more work than basic learner
                            • still faster than disk + encoding
Wednesday, March 16, 2011                                          18
A comparison
                   • Traditional view
                    • 400 x (read + OLR)
                   • Revised Mahout view
                    • 1 x (read + mu x 100 x OLR) x eta
                    • mu = efficiency from killing losers early
                    • eta = efficiency from stopping early
Wednesday, March 16, 2011                                        19
Deployment

                   • Training
                    • ModelSerializer.writeBinary(..., model)
                   • Deployment
                    • m = ModelSerializer.readBinary(...)
                    • r = m.classifyScalar(featureVector)

Wednesday, March 16, 2011                                       20
The Upshot

                   • One machine can go fast
                    • SITM trains in 2 billion examples in 3
                            hours
                   • Deployability pays off big
                    • simple sample server farm

Wednesday, March 16, 2011                                      21

Más contenido relacionado

Similar a MAHOUT classifier tour

Opensource Authentication and Authorization
Opensource Authentication and AuthorizationOpensource Authentication and Authorization
Opensource Authentication and AuthorizationConFoo
 
Visual Communication That Works! (PDF)
Visual Communication That Works! (PDF)Visual Communication That Works! (PDF)
Visual Communication That Works! (PDF)Barry Casey
 
Intro to Linked Data: Context
Intro to Linked Data: ContextIntro to Linked Data: Context
Intro to Linked Data: ContextDavid Wood
 
Kill bottlenecks with gearman, sphinx, and memcached, Confoo 2011
Kill bottlenecks with gearman, sphinx, and memcached, Confoo 2011Kill bottlenecks with gearman, sphinx, and memcached, Confoo 2011
Kill bottlenecks with gearman, sphinx, and memcached, Confoo 2011Bachkoutou Toutou
 

Similar a MAHOUT classifier tour (6)

Mahout classifier tour
Mahout classifier tourMahout classifier tour
Mahout classifier tour
 
Opensource Authentication and Authorization
Opensource Authentication and AuthorizationOpensource Authentication and Authorization
Opensource Authentication and Authorization
 
Visual Communication That Works! (PDF)
Visual Communication That Works! (PDF)Visual Communication That Works! (PDF)
Visual Communication That Works! (PDF)
 
State of Social & Informal Learning
State of Social & Informal LearningState of Social & Informal Learning
State of Social & Informal Learning
 
Intro to Linked Data: Context
Intro to Linked Data: ContextIntro to Linked Data: Context
Intro to Linked Data: Context
 
Kill bottlenecks with gearman, sphinx, and memcached, Confoo 2011
Kill bottlenecks with gearman, sphinx, and memcached, Confoo 2011Kill bottlenecks with gearman, sphinx, and memcached, Confoo 2011
Kill bottlenecks with gearman, sphinx, and memcached, Confoo 2011
 

Más de Ted Dunning

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxTed Dunning
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with KubernetesTed Dunning
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in KubernetesTed Dunning
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning LogisticsTed Dunning
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTed Dunning
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logisticsTed Dunning
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real DataTed Dunning
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteTed Dunning
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoopTed Dunning
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Ted Dunning
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data SecurelyTed Dunning
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownTed Dunning
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossibleTed Dunning
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningTed Dunning
 

Más de Ted Dunning (20)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Último (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

MAHOUT classifier tour

  • 2. Mahout Scalable Data Mining for Everybody Wednesday, March 16, 2011 1
  • 3. What is Mahout • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classification (learn decision making from examples) • Stuff (LDA, SVD, frequent item-set, math) Wednesday, March 16, 2011 2
  • 4. What is Mahout? • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classification (learn decision making from examples) • Stuff (LDA, SVM, frequent item-set, math) Wednesday, March 16, 2011 3
  • 5. Classification in Detail • Naive Bayes Family • Hadoop based training • Decision Forests • Hadoop based training • Logistic Regression (aka SGD) • fast on-line (sequential) training Wednesday, March 16, 2011 4
  • 6. Classification in Detail • Naive Bayes Family • Hadoop based training • Decision Forests • Hadoop based training • Logistic Regression (aka SGD) • fast on-line (sequential) training Wednesday, March 16, 2011 5
  • 7. So What? Online training has low overhead for small and moderate size data-sets Wednesday, March 16, 2011 6
  • 8. So What? Online training has low overhead for small and moderate size data-sets Wednesday, March 16, 2011 6
  • 9. So What? Online training has low overhead for small and moderate size data-sets Wednesday, March 16, 2011 6
  • 10. So What? Online training has low overhead for small and moderate size data-sets Wednesday, March 16, 2011 6
  • 11. So What? big starts here Online training has low overhead for small and moderate size data-sets Wednesday, March 16, 2011 6
  • 19. And Another From: Dr. Paul Acquah Dear Sir, Re: Proposal for over-invoice Contract Benevolence Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit. I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor. ... Wednesday, March 16, 2011 8
  • 20. And Another Date: Thu, May 20, 2010 at 10:51 AM From: George <george@fumble-tech.com> Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon? Wednesday, March 16, 2011 8
  • 21. And Another Date: Thu, May 20, 2010 at 10:51 AM From: George <george@fumble-tech.com> Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon? Wednesday, March 16, 2011 8
  • 22. And Another Date: Thu, May 20, 2010 at 10:51 AM From: George <george@fumble-tech.com> Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon? Wednesday, March 16, 2011 8
  • 23. Mahout’s SGD • Learns on-line per example • O(1) memory • O(1) time per training example • Sequential implementation • fast, but not parallel Wednesday, March 16, 2011 9
  • 24. Special Features • Hashed feature encoding • Per-term annealing • learn the boring stuff once • Auto-magical learning knob turning • learns correct learning rate, learns correct learning rate for learning learning rate, ... Wednesday, March 16, 2011 10
  • 29. Learning Rate Annealing Learning Rate # training examples seen Wednesday, March 16, 2011 14
  • 30. Learning Rate Per-term Annealing # training examples seen Wednesday, March 16, 2011 15
  • 31. Learning Rate Per-term Annealing Common Feature # training examples seen Wednesday, March 16, 2011 15
  • 32. Learning Rate Per-term Annealing Rare Feature # training examples seen Wednesday, March 16, 2011 15
  • 33. General Structure • OnlineLogisticRegression • Traditional logistic regression • Stochastic Gradient Descent • Per term annealing • Too fast (for the disk + encoder) Wednesday, March 16, 2011 16
  • 34. Next Level • CrossFoldLearner • contains multiple primitive learners • online cross validation • 5x more work Wednesday, March 16, 2011 17
  • 35. And again • AdaptiveLogisticRegression • 20 x CrossFoldLearner • evolves good learning and regularization rates • 100 x more work than basic learner • still faster than disk + encoding Wednesday, March 16, 2011 18
  • 36. A comparison • Traditional view • 400 x (read + OLR) • Revised Mahout view • 1 x (read + mu x 100 x OLR) x eta • mu = efficiency from killing losers early • eta = efficiency from stopping early Wednesday, March 16, 2011 19
  • 37. Deployment • Training • ModelSerializer.writeBinary(..., model) • Deployment • m = ModelSerializer.readBinary(...) • r = m.classifyScalar(featureVector) Wednesday, March 16, 2011 20
  • 38. The Upshot • One machine can go fast • SITM trains in 2 billion examples in 3 hours • Deployability pays off big • simple sample server farm Wednesday, March 16, 2011 21