LA HUG 2012 02-07

•Descargar como PPTX, PDF•

0 recomendaciones•425 vistas

MapR Technologies

Hadoop User Group talk in L.A. (2012)

Tecnología Empresariales

Mahout
• Scalable Data Mining for Everybody

What is Mahout
• Recommendations (people who x this also x
that)
• Clustering (segment data into groups of)
• Classification (learn decision making from
examples)
• Stuff (LDA, SVD, frequent item-set, math)

Classification in Detail
• Naive Bayes Family
– Hadoop based training
• Decision Forests
– Hadoop based training
• Logistic Regression (aka SGD)
– fast on-line (sequential) training

And Another
From: Dr. Paul Acquah
Dear Sir,
Re: Proposal for over-invoice Contract Benevolence
Based on information gathered from the India
hospital directory, I am pleased to propose a
confidential business deal for our mutual
benefit. I have in my possession, instruments
(documentation) to transfer the sum of
33,100,000.00 eur thirty-three million one hundred
thousand euros, only) into a foreign company's
bank account for our favor.
...
Date: Thu, May 20, 2010 at 10:51 AM
From: George <george@fumble-tech.com>
Hi Ted, was a pleasure talking to you last night
at the Hadoop User Group. I liked the idea of
going for lunch together. Are you available
tomorrow (Friday) at noon?

How it Works
• We are given “features”
– Often binary values in a vector
• Algorithm learns weights
– Weighted sum of feature * weight is the key
• Each weight is a single real value

A Quick Diversion
• You see a coin
– What is the probability of heads?
– Could it be larger or smaller than that?
• I flip the coin and while it is in the air ask again
• I catch the coin and ask again
• I look at the coin (and you don’t) and ask again
• Why does the answer change?
– And did it ever have a single value?

A First Conclusion
• Probability as expressed by humans is
subjective and depends on information and
experience

A Second Conclusion
• A single number is a bad way to express
uncertain knowledge
• A distribution of values might be better

Which One to Play?
• One may be better than the other
• The better machine pays off at some rate
• Playing the other will pay off at a lesser rate
– Playing the lesser machine has “opportunity cost”
• But how do we know which is which?
– Explore versus Exploit!

Algorithmic Costs
• Option 1
– Explicitly code the explore/exploit trade-off
• Option 2
– Bayesian Bandit

Bayesian Bandit
• Compute distributions based on data
• Sample p1 and p2 from these distributions
• Put a coin in bandit 1 if p1 > p2
• Else, put the coin in bandit 2

The Basic Idea
• We can encode a distribution by sampling
• Sampling allows unification of exploration and
exploitation
• Can be extended to more general response
models

Deployment with Storm/MapR
Impression
Logs
Click Logs
Targeting
Engine
Conversion
Detector
Model
Selector
RPC
Online
Model
Online
Model
Online
Model
RPC
RPC
RPC
Conversion
Dashboard
RPC
Training
Training
Training
All state managed transactionally
in MapR file system

Service Architecture
MapR Lockless Storage Services
MapR Pluggable Service Management
Storm
HadoopImpression
Logs
Click Logs
Targeting
Engine
Conversion
Detector
Model
Selector
RPC
Online
Model
Online
Model
Online
Model
RPC
RPC
RPC
Conversion
Dashboard
RPC
Training
Training
Training

Find Out More
• Me: tdunning@mapr.com
ted.dunning@gmail.com
tdunning@apache.com
• MapR: http://www.mapr.com
• Mahout: http://mahout.apache.org
• Code: https://github.com/tdunning

Más contenido relacionado

Similar a LA HUG 2012 02-07

Waves keynote2cDavid Topps

Dama - Protecting Sensitive Data on a Databasejohanswart1234

Ai overviewSerendipity Seraph

Introduction to Machine LearningRahul Jain

Introduction to Data Mining (Why Mine Data? Commercial Viewpoint)dradilkhan87

Data Science 101ideatoipo

Group Interaction Patterns - The Keys for Highly Productive Teams (BeyondAgil...Michael R. Wolf

Ai4life aiml-xops-sigmadhucharis

Recoding black-mirror2017-irrefutable-history-of-youAllan Third

Habits of Highly Effective Technical Teams - Martijn VerburgJAXLondon2014

Statistical Information Retrieval Modelling: from the Probability Ranking Pr...Jun Wang

Simple Card Sorting MethodsAndrii Rusakov

Tips and tricks to win kaggle data science competitionsDarius Barušauskas

Simplicity and TransparencyJodie Heflin

Learning from dataGovind Kanshi

Artificial Intelligence: Knowledge AcquisitionThe Integral Worm

Visual tools and innovation games workshop - spscbus - aug 2014Ruven Gotz

Data science for advanced dummiesSaurav Chakravorty

The Rise of NoSQL and Polyglot PersistenceAbdelmonaim Remani

Core Methods In Educational Data Miningebelani

Similar a LA HUG 2012 02-07 (20)

Waves keynote2c

Dama - Protecting Sensitive Data on a Database

Ai overview

Introduction to Machine Learning

Introduction to Data Mining (Why Mine Data? Commercial Viewpoint)

Data Science 101

Group Interaction Patterns - The Keys for Highly Productive Teams (BeyondAgil...

Ai4life aiml-xops-sig

Recoding black-mirror2017-irrefutable-history-of-you

Habits of Highly Effective Technical Teams - Martijn Verburg

Statistical Information Retrieval Modelling: from the Probability Ranking Pr...

Simple Card Sorting Methods

Tips and tricks to win kaggle data science competitions

Simplicity and Transparency

Learning from data

Artificial Intelligence: Knowledge Acquisition

Visual tools and innovation games workshop - spscbus - aug 2014

Data science for advanced dummies

The Rise of NoSQL and Polyglot Persistence

Core Methods In Educational Data Mining

Más de MapR Technologies

Converging your data landscapeMapR Technologies

ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies

Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies

Enabling Real-Time Business with Change Data CaptureMapR Technologies

Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies

ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies

Machine Learning Success: The Key to Easier Model ManagementMapR Technologies

Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies

Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies

Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies

Live Machine Learning Tutorial: Churn PredictionMapR Technologies

An Introduction to the MapR Converged Data PlatformMapR Technologies

How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies

Best Practices for Data Convergence in HealthcareMapR Technologies

Geo-Distributed Big Data and AnalyticsMapR Technologies

MapR Product Update - Spring 2017MapR Technologies

3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies

Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies

MapR and Cisco Make IT BetterMapR Technologies

Evolving from RDBMS to NoSQL + SQLMapR Technologies

Más de MapR Technologies (20)

Converging your data landscape

ML Workshop 2: Machine Learning Model Comparison & Evaluation

Self-Service Data Science for Leveraging ML & AI on All of Your Data

Enabling Real-Time Business with Change Data Capture

Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...

ML Workshop 1: A New Architecture for Machine Learning Logistics

Machine Learning Success: The Key to Easier Model Management

Data Warehouse Modernization: Accelerating Time-To-Action

Live Tutorial – Streaming Real-Time Events Using Apache APIs

Bringing Structure, Scalability, and Services to Cloud-Scale Storage

Live Machine Learning Tutorial: Churn Prediction

An Introduction to the MapR Converged Data Platform

How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...

Best Practices for Data Convergence in Healthcare

Geo-Distributed Big Data and Analytics

MapR Product Update - Spring 2017

3 Benefits of Multi-Temperature Data Management for Data Analytics

Cisco & MapR bring 3 Superpowers to SAP HANA Deployments

MapR and Cisco Make IT Better

Evolving from RDBMS to NoSQL + SQL

Último

"ML in Production",Oleksandr BaganFwdays

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

From Family Reminiscence to Scholarly Archive .Alan Dix

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Search Engine Optimization SEO PDF for 2024.pdfRankYa

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

CloudStudio User manual (basic edition):comworks

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Advanced Computer Architecture – An IntroductionDilum Bandara

Gen AI in Business - Global Trends Report 2024.pdfAddepto

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

LA HUG 2012 02-07

1. Beating up on Bayesian Bandits

2. Mahout • Scalable Data Mining for Everybody

3. What is Mahout • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classification (learn decision making from examples) • Stuff (LDA, SVD, frequent item-set, math)

4. What is Mahout? • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classification (learn decision making from examples) • Stuff (LDA, SVM, frequent item-set, math)

5. Classification in Detail • Naive Bayes Family – Hadoop based training • Decision Forests – Hadoop based training • Logistic Regression (aka SGD) – fast on-line (sequential) training

6. Classification in Detail • Naive Bayes Family – Hadoop based training • Decision Forests – Hadoop based training • Logistic Regression (aka SGD) – fast on-line (sequential) training

7. Classification in Detail • Naive Bayes Family – Hadoop based training • Decision Forests – Hadoop based training • Logistic Regression (aka SGD) – fast on-line (sequential) training – Now with MORE topping!

8. An Example

9. And Another From: Dr. Paul Acquah Dear Sir, Re: Proposal for over-invoice Contract Benevolence Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit. I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor. ... Date: Thu, May 20, 2010 at 10:51 AM From: George <george@fumble-tech.com> Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?

10. Feature Encoding

11. Hashed Encoding

12. Feature Collisions

13. How it Works • We are given “features” – Often binary values in a vector • Algorithm learns weights – Weighted sum of feature * weight is the key • Each weight is a single real value

14. A Quick Diversion • You see a coin – What is the probability of heads? – Could it be larger or smaller than that? • I flip the coin and while it is in the air ask again • I catch the coin and ask again • I look at the coin (and you don’t) and ask again • Why does the answer change? – And did it ever have a single value?

15. A First Conclusion • Probability as expressed by humans is subjective and depends on information and experience

16. A Second Conclusion • A single number is a bad way to express uncertain knowledge • A distribution of values might be better

17. I Dunno

18. 5 and 5

19. 2 and 10

20. The Cynic Among Us

21. A Second Diversion

22. Two-armed Bandit

23. Which One to Play? • One may be better than the other • The better machine pays off at some rate • Playing the other will pay off at a lesser rate – Playing the lesser machine has “opportunity cost” • But how do we know which is which? – Explore versus Exploit!

24. Algorithmic Costs • Option 1 – Explicitly code the explore/exploit trade-off • Option 2 – Bayesian Bandit

25. Bayesian Bandit • Compute distributions based on data • Sample p1 and p2 from these distributions • Put a coin in bandit 1 if p1 > p2 • Else, put the coin in bandit 2

26.

27.

28. The Basic Idea • We can encode a distribution by sampling • Sampling allows unification of exploration and exploitation • Can be extended to more general response models

29. Deployment with Storm/MapR Impression Logs Click Logs Targeting Engine Conversion Detector Model Selector RPC Online Model Online Model Online Model RPC RPC RPC Conversion Dashboard RPC Training Training Training All state managed transactionally in MapR file system

30. Service Architecture MapR Lockless Storage Services MapR Pluggable Service Management Storm HadoopImpression Logs Click Logs Targeting Engine Conversion Detector Model Selector RPC Online Model Online Model Online Model RPC RPC RPC Conversion Dashboard RPC Training Training Training

31. Find Out More • Me: tdunning@mapr.com ted.dunning@gmail.com tdunning@apache.com • MapR: http://www.mapr.com • Mahout: http://mahout.apache.org • Code: https://github.com/tdunning

Notas del editor

No information would give a relative expected payoff of -0.25. This graph shows 25, 50 and 75%-ile results for sampled experiments with uniform random probabilities. Convergence to optimum is nearly equal to the optimum sqrt(n). Note the log scale on number of trials
Here is how the system converges in terms of how likely it is to pick the better bandit with probabilities that are only slightly different. After 1000 trials, the system is already giving 75% of the bandwidth to the better option. This graph was produced by averaging several thousand runs with the same probabilities.

LA HUG 2012 02-07

Recomendados

Recomendados

Más contenido relacionado

Similar a LA HUG 2012 02-07

Similar a LA HUG 2012 02-07 (20)

Más de MapR Technologies

Más de MapR Technologies (20)

Último

Último (20)

LA HUG 2012 02-07

Notas del editor