MAHOUT classifier tour

•

8 recomendaciones•2,434 vistas

Ted Dunning

This ta

Tecnología

Mahout
Scalable Data Mining for Everybody

Wednesday, March 16, 2011 1

What is Mahout
• Recommendations (people who x this also
x that)
• Clustering (segment data into groups of)
• Classiﬁcation (learn decision making from
examples)
• Stuff (LDA, SVD, frequent item-set, math)

Wednesday, March 16, 2011 2

Classiﬁcation in Detail
• Naive Bayes Family
• Hadoop based training
• Decision Forests
• Hadoop based training
• Logistic Regression (aka SGD)
• fast on-line (sequential) training
Wednesday, March 16, 2011 4

So What?
Online training
has low
overhead for
small and
moderate size
data-sets

Wednesday, March 16, 2011 6

So What?
big starts here

Online training
has low
overhead for
small and
moderate size
data-sets

Wednesday, March 16, 2011 6

And Another
From: Dr. Paul Acquah
Dear Sir,
Re: Proposal for over-invoice Contract Benevolence

Based on information gathered from the India
hospital directory, I am pleased to propose a
confidential business deal for our mutual benefit.
I have in my possession, instruments
(documentation) to transfer the sum of
33,100,000.00 eur thirty-three million one hundred
thousand euros, only) into a foreign company's bank
account for our favor.
...

Wednesday, March 16, 2011 8

And Another
Date: Thu, May 20, 2010 at 10:51 AM
From: George <george@fumble-tech.com>

Hi Ted, was a pleasure talking to you last night
at the Hadoop User Group. I liked the idea of
going for lunch together. Are you available
tomorrow (Friday) at noon?

Wednesday, March 16, 2011 8

Mahout’s SGD

• Learns on-line per example
• O(1) memory
• O(1) time per training example
• Sequential implementation
• fast, but not parallel

Wednesday, March 16, 2011 9

Special Features
• Hashed feature encoding
• Per-term annealing
• learn the boring stuff once
• Auto-magical learning knob turning
• learns correct learning rate, learns
correct learning rate for learning learning
rate, ...

Wednesday, March 16, 2011 10

Feature Encoding

Wednesday, March 16, 2011 11

Hashed Encoding

Wednesday, March 16, 2011 12

Feature Collisions

Wednesday, March 16, 2011 13

Learning Rate Annealing
Learning Rate

# training examples seen

Wednesday, March 16, 2011 14

Learning Rate Per-term Annealing

# training examples seen

Wednesday, March 16, 2011 15

Learning Rate Per-term Annealing

Common
Feature

# training examples seen

Wednesday, March 16, 2011 15

Learning Rate Per-term Annealing

Rare
Feature

# training examples seen

Wednesday, March 16, 2011 15

General Structure

• OnlineLogisticRegression
• Traditional logistic regression
• Stochastic Gradient Descent
• Per term annealing
• Too fast (for the disk + encoder)

Wednesday, March 16, 2011 16

Next Level

• CrossFoldLearner
• contains multiple primitive learners
• online cross validation
• 5x more work

Wednesday, March 16, 2011 17

And again
• AdaptiveLogisticRegression
• 20 x CrossFoldLearner
• evolves good learning and regularization
rates
• 100 x more work than basic learner
• still faster than disk + encoding
Wednesday, March 16, 2011 18

A comparison
• Traditional view
• 400 x (read + OLR)
• Revised Mahout view
• 1 x (read + mu x 100 x OLR) x eta
• mu = efﬁciency from killing losers early
• eta = efﬁciency from stopping early
Wednesday, March 16, 2011 19

Deployment

• Training
• ModelSerializer.writeBinary(..., model)
• Deployment
• m = ModelSerializer.readBinary(...)
• r = m.classifyScalar(featureVector)

Wednesday, March 16, 2011 20

The Upshot

• One machine can go fast
• SITM trains in 2 billion examples in 3
hours
• Deployability pays off big
• simple sample server farm

Wednesday, March 16, 2011 21

Más contenido relacionado

Similar a MAHOUT classifier tour

Mahout classifier tourMapR Technologies

Opensource Authentication and AuthorizationConFoo

Visual Communication That Works! (PDF)Barry Casey

State of Social & Informal LearningTom Hood, CPA,CITP,CGMA

Intro to Linked Data: ContextDavid Wood

Kill bottlenecks with gearman, sphinx, and memcached, Confoo 2011Bachkoutou Toutou

Similar a MAHOUT classifier tour (6)

Mahout classifier tour

Opensource Authentication and Authorization

Visual Communication That Works! (PDF)

State of Social & Informal Learning

Intro to Linked Data: Context

Kill bottlenecks with gearman, sphinx, and memcached, Confoo 2011

Más de Ted Dunning

Dunning - SIGMOD - Data Economy.pptxTed Dunning

How to Get Going with KubernetesTed Dunning

Progress for big data in KubernetesTed Dunning

Anomaly Detection: How to find what you didn’t know to look forTed Dunning

Streaming Architecture including Rendezvous for Machine LearningTed Dunning

Machine Learning LogisticsTed Dunning

Tensor Abuse - how to reuse machine learning frameworksTed Dunning

Machine Learning logisticsTed Dunning

T digest-updateTed Dunning

Finding Changes in Real DataTed Dunning

Where is Data Going? - RMDC KeynoteTed Dunning

Real time-hadoopTed Dunning

Cheap learning-dunning-9-18-2015Ted Dunning

Sharing Sensitive Data SecurelyTed Dunning

Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning

How the Internet of Things is Turning the Internet Upside DownTed Dunning

Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning

Dunning time-series-2015Ted Dunning

Doing-the-impossibleTed Dunning

Anomaly Detection - New York Machine LearningTed Dunning

Más de Ted Dunning (20)

Dunning - SIGMOD - Data Economy.pptx

How to Get Going with Kubernetes

Progress for big data in Kubernetes

Anomaly Detection: How to find what you didn’t know to look for

Streaming Architecture including Rendezvous for Machine Learning

Machine Learning Logistics

Tensor Abuse - how to reuse machine learning frameworks

Machine Learning logistics

T digest-update

Finding Changes in Real Data

Where is Data Going? - RMDC Keynote

Real time-hadoop

Cheap learning-dunning-9-18-2015

Sharing Sensitive Data Securely

Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time

How the Internet of Things is Turning the Internet Upside Down

Apache Kylin - OLAP Cubes for SQL on Hadoop

Dunning time-series-2015

Doing-the-impossible

Anomaly Detection - New York Machine Learning

Último

Architecting Cloud Native ApplicationsWSO2

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Platformless Horizons for Digital AdaptabilityWSO2

Why Teams call analytics are critical to your entire businesspanagenda

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub

[BuildWithAI] Introduction to Gemini.pdfSandro Moreira

Exploring Multimodal Embeddings with MilvusZilliz

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays

Corporate and higher education May webinar.pptxRustici Software

Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

MAHOUT classifier tour

1. Mahout Wednesday, March 16, 2011 1

2. Mahout Scalable Data Mining for Everybody Wednesday, March 16, 2011 1

3. What is Mahout • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classiﬁcation (learn decision making from examples) • Stuff (LDA, SVD, frequent item-set, math) Wednesday, March 16, 2011 2

4. What is Mahout? • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classiﬁcation (learn decision making from examples) • Stuff (LDA, SVM, frequent item-set, math) Wednesday, March 16, 2011 3

5. Classiﬁcation in Detail • Naive Bayes Family • Hadoop based training • Decision Forests • Hadoop based training • Logistic Regression (aka SGD) • fast on-line (sequential) training Wednesday, March 16, 2011 4

6. Classiﬁcation in Detail • Naive Bayes Family • Hadoop based training • Decision Forests • Hadoop based training • Logistic Regression (aka SGD) • fast on-line (sequential) training Wednesday, March 16, 2011 5

7. So What? Online training has low overhead for small and moderate size data-sets Wednesday, March 16, 2011 6

8. So What? Online training has low overhead for small and moderate size data-sets Wednesday, March 16, 2011 6

9. So What? Online training has low overhead for small and moderate size data-sets Wednesday, March 16, 2011 6

10. So What? Online training has low overhead for small and moderate size data-sets Wednesday, March 16, 2011 6

11. So What? big starts here Online training has low overhead for small and moderate size data-sets Wednesday, March 16, 2011 6

12. An Example Wednesday, March 16, 2011 7

13. An Example Wednesday, March 16, 2011 7

14. An Example Wednesday, March 16, 2011 7

15. An Example Wednesday, March 16, 2011 7

16. An Example Wednesday, March 16, 2011 7

17. An Example Wednesday, March 16, 2011 7

18. An Example Wednesday, March 16, 2011 7

19. And Another From: Dr. Paul Acquah Dear Sir, Re: Proposal for over-invoice Contract Benevolence Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit. I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor. ... Wednesday, March 16, 2011 8

20. And Another Date: Thu, May 20, 2010 at 10:51 AM From: George <george@fumble-tech.com> Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon? Wednesday, March 16, 2011 8

21. And Another Date: Thu, May 20, 2010 at 10:51 AM From: George <george@fumble-tech.com> Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon? Wednesday, March 16, 2011 8

22. And Another Date: Thu, May 20, 2010 at 10:51 AM From: George <george@fumble-tech.com> Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon? Wednesday, March 16, 2011 8

23. Mahout’s SGD • Learns on-line per example • O(1) memory • O(1) time per training example • Sequential implementation • fast, but not parallel Wednesday, March 16, 2011 9

24. Special Features • Hashed feature encoding • Per-term annealing • learn the boring stuff once • Auto-magical learning knob turning • learns correct learning rate, learns correct learning rate for learning learning rate, ... Wednesday, March 16, 2011 10

25. Feature Encoding Wednesday, March 16, 2011 11

26. Feature Encoding Wednesday, March 16, 2011 11

27. Hashed Encoding Wednesday, March 16, 2011 12

28. Feature Collisions Wednesday, March 16, 2011 13

29. Learning Rate Annealing Learning Rate # training examples seen Wednesday, March 16, 2011 14

30. Learning Rate Per-term Annealing # training examples seen Wednesday, March 16, 2011 15

31. Learning Rate Per-term Annealing Common Feature # training examples seen Wednesday, March 16, 2011 15

32. Learning Rate Per-term Annealing Rare Feature # training examples seen Wednesday, March 16, 2011 15

33. General Structure • OnlineLogisticRegression • Traditional logistic regression • Stochastic Gradient Descent • Per term annealing • Too fast (for the disk + encoder) Wednesday, March 16, 2011 16

34. Next Level • CrossFoldLearner • contains multiple primitive learners • online cross validation • 5x more work Wednesday, March 16, 2011 17

35. And again • AdaptiveLogisticRegression • 20 x CrossFoldLearner • evolves good learning and regularization rates • 100 x more work than basic learner • still faster than disk + encoding Wednesday, March 16, 2011 18

36. A comparison • Traditional view • 400 x (read + OLR) • Revised Mahout view • 1 x (read + mu x 100 x OLR) x eta • mu = efﬁciency from killing losers early • eta = efﬁciency from stopping early Wednesday, March 16, 2011 19

37. Deployment • Training • ModelSerializer.writeBinary(..., model) • Deployment • m = ModelSerializer.readBinary(...) • r = m.classifyScalar(featureVector) Wednesday, March 16, 2011 20

38. The Upshot • One machine can go fast • SITM trains in 2 billion examples in 3 hours • Deployability pays off big • simple sample server farm Wednesday, March 16, 2011 21

MAHOUT classifier tour

Recomendados

Recomendados

Más contenido relacionado

Similar a MAHOUT classifier tour

Similar a MAHOUT classifier tour (6)

Más de Ted Dunning

Más de Ted Dunning (20)

Último

Último (20)

MAHOUT classifier tour