SlideShare una empresa de Scribd logo
1 de 24
Hands-on Classification
Preliminaries
• Code is available from github:
    – git@github.com:tdunning/Chapter-16.git

•   EC2 instances available
•   Thumb drives also available
•   Email to ted.dunning@gmail.com
•   Twitter @ted_dunning
A Quick Review
• What is classification?
  – goes-ins: predictors
  – goes-outs: target variable
• What is classifiable data?
  – continuous, categorical, word-like, text-like
  – uniform schema
• How do we convert from classifiable data to
  feature vector?
Data Flow



Not quite so
  simple
Classifiable Data
• Continuous
  – A number that represents a quantity, not an id
  – Blood pressure, stock price, latitude, mass
• Categorical
  – One of a known, small set (color, shape)
• Word-like
  – One of a possibly unknown, possibly large set
• Text-like
  – Many word-like things, usually unordered
But that isn’t quite there
• Learning algorithms need feature vectors
  – Have to convert from data to vector
• Can assign one location per feature
  – or category
  – or word
• Can assign one or more locations with hashing
  – scary
  – but safe on average
Data Flow
Classifiable Data   Vectors
Hashed Encoding
What about collisions?
Let’s write some code



  (cue relaxing background music)
Generating new features
• Sometimes the existing features are difficult to
  use
• Restating the geometry using new reference
  points may help
• Automatic reference points using k-means can
  be better than manual references
K-means using target
K-means features
More code!



(cue relaxing background music)
Integration Issues
• Feature extraction is ideal for map-reduce
  – Side data adds some complexity
• Clustering works great with map-reduce
  – Cluster centroids to HDFS


• Model training works better sequentially
  – Need centroids in normal files
• Model deployment shouldn’t depend on HDFS
Parallel Stochastic Gradient Descent
              Model




    I
    n
              Train   Average
    p
               sub    models
    u
              model
    t
Variational Dirichlet Assignment
             Model




    I
    n
             Gather      Update
    p
            sufficient   model
    u
            statistics
    t
Old tricks, new dogs
                       Read from local disk
• Mapper               from distributed cache
  – Assign point to cluster
                                          Read from
  – Emit cluster id, (1, point)           HDFS to local disk
• Combiner and reducer                    by distributed cache


  – Sum counts, weighted sum of points
  – Emit cluster id, (n, sum/n)    Written by

• Output to HDFS                                map-reduce
Old tricks, new dogs
• Mapper
  – Assign point to cluster        Read
                                   from
  – Emit cluster id, 1, point      NFS

• Combiner and reducer
  – Sum counts, weighted sum of points
  – Emit cluster id, n, sum/n          Written by
                                          map-reduce
• Output to HDFS
                MapR FS
Modeling architecture
      Side-data

                                Now via NFS




I
       Feature
n                                  Sequential
      extraction     Data
p                                     SGD
         and         join
u                                   Learning
        down
t
      sampling




                   Map-reduce

Más contenido relacionado

La actualidad más candente

NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
Jason Shao
 
Hands on MapR -- Viadea
Hands on MapR -- ViadeaHands on MapR -- Viadea
Hands on MapR -- Viadea
viadea
 
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Boris Yen
 
Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaram
Viswanath Gangavaram
 
TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batch
boorad
 

La actualidad más candente (20)

Concurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and ScalaConcurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and Scala
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Llnl talk
Llnl talkLlnl talk
Llnl talk
 
MapR Tutorial Series
MapR Tutorial SeriesMapR Tutorial Series
MapR Tutorial Series
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without Java
 
LCA13: Hadoop DFS Performance
LCA13: Hadoop DFS PerformanceLCA13: Hadoop DFS Performance
LCA13: Hadoop DFS Performance
 
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
 
Hadoop, Map Reduce and Apache Pig tutorial
Hadoop, Map Reduce and Apache Pig tutorialHadoop, Map Reduce and Apache Pig tutorial
Hadoop, Map Reduce and Apache Pig tutorial
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
 
Hands on MapR -- Viadea
Hands on MapR -- ViadeaHands on MapR -- Viadea
Hands on MapR -- Viadea
 
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processor
 
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
 
Talk About Apache Cassandra
Talk About Apache CassandraTalk About Apache Cassandra
Talk About Apache Cassandra
 
Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaram
 
TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batch
 
14 lab-planing
14 lab-planing14 lab-planing
14 lab-planing
 

Destacado (7)

Hadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected IntelligenceHadoop summit EU - Crowd Sourcing Reflected Intelligence
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
 
Boston hug
Boston hugBoston hug
Boston hug
 
predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21
 
Storm 2012-03-29
Storm 2012-03-29Storm 2012-03-29
Storm 2012-03-29
 
Storm users group real time hadoop
Storm users group real time hadoopStorm users group real time hadoop
Storm users group real time hadoop
 
Goto amsterdam-2013-skinned
Goto amsterdam-2013-skinnedGoto amsterdam-2013-skinned
Goto amsterdam-2013-skinned
 
Clustering large-scale data TU Berlin talk
Clustering large-scale data TU Berlin talkClustering large-scale data TU Berlin talk
Clustering large-scale data TU Berlin talk
 

Similar a Oscon data-2011-ted-dunning

Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant Conference
Joydeep Sen Sarma
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystem
Andrew Brust
 

Similar a Oscon data-2011-ted-dunning (20)

Oscon Data 2011 Ted Dunning
Oscon Data 2011 Ted DunningOscon Data 2011 Ted Dunning
Oscon Data 2011 Ted Dunning
 
Data mining 2011 09
Data mining 2011 09Data mining 2011 09
Data mining 2011 09
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Hive: Data Warehousing for Hadoop
Hive: Data Warehousing for HadoopHive: Data Warehousing for Hadoop
Hive: Data Warehousing for Hadoop
 
Data mining-2011-09
Data mining-2011-09Data mining-2011-09
Data mining-2011-09
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant Conference
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Spring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataSpring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_data
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystem
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Target Holding - Big Dikes and Big Data
Target Holding - Big Dikes and Big DataTarget Holding - Big Dikes and Big Data
Target Holding - Big Dikes and Big Data
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at mylife
 
R user-group-2011-09
R user-group-2011-09R user-group-2011-09
R user-group-2011-09
 

Más de Ted Dunning

Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
Ted Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
Ted Dunning
 

Más de Ted Dunning (20)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Oscon data-2011-ted-dunning

  • 2. Preliminaries • Code is available from github: – git@github.com:tdunning/Chapter-16.git • EC2 instances available • Thumb drives also available • Email to ted.dunning@gmail.com • Twitter @ted_dunning
  • 3. A Quick Review • What is classification? – goes-ins: predictors – goes-outs: target variable • What is classifiable data? – continuous, categorical, word-like, text-like – uniform schema • How do we convert from classifiable data to feature vector?
  • 5. Classifiable Data • Continuous – A number that represents a quantity, not an id – Blood pressure, stock price, latitude, mass • Categorical – One of a known, small set (color, shape) • Word-like – One of a possibly unknown, possibly large set • Text-like – Many word-like things, usually unordered
  • 6. But that isn’t quite there • Learning algorithms need feature vectors – Have to convert from data to vector • Can assign one location per feature – or category – or word • Can assign one or more locations with hashing – scary – but safe on average
  • 8.
  • 10.
  • 11.
  • 14. Let’s write some code (cue relaxing background music)
  • 15. Generating new features • Sometimes the existing features are difficult to use • Restating the geometry using new reference points may help • Automatic reference points using k-means can be better than manual references
  • 18. More code! (cue relaxing background music)
  • 19. Integration Issues • Feature extraction is ideal for map-reduce – Side data adds some complexity • Clustering works great with map-reduce – Cluster centroids to HDFS • Model training works better sequentially – Need centroids in normal files • Model deployment shouldn’t depend on HDFS
  • 20. Parallel Stochastic Gradient Descent Model I n Train Average p sub models u model t
  • 21. Variational Dirichlet Assignment Model I n Gather Update p sufficient model u statistics t
  • 22. Old tricks, new dogs Read from local disk • Mapper from distributed cache – Assign point to cluster Read from – Emit cluster id, (1, point) HDFS to local disk • Combiner and reducer by distributed cache – Sum counts, weighted sum of points – Emit cluster id, (n, sum/n) Written by • Output to HDFS map-reduce
  • 23. Old tricks, new dogs • Mapper – Assign point to cluster Read from – Emit cluster id, 1, point NFS • Combiner and reducer – Sum counts, weighted sum of points – Emit cluster id, n, sum/n Written by map-reduce • Output to HDFS MapR FS
  • 24. Modeling architecture Side-data Now via NFS I Feature n Sequential extraction Data p SGD and join u Learning down t sampling Map-reduce