SlideShare una empresa de Scribd logo
1 de 23
Classification with Naïve Bayes A Deep Dive into Apache Mahout
Today’s speaker – Josh Patterson josh@cloudera.com / twitter: @jpatanooga Master’s Thesis: self-organizing mesh networks Published in IAAI-09: TinyTermite: A Secure Routing Algorithm Conceived, built, and led Hadoop integration for the openPDC project at TVA (Smartgrid stuff) Led small team which designed classification techniques for time series and Map Reduce Open source work at http://openpdc.codeplex.com Now: Solutions Architect at Cloudera 2
What is Classification? Supervised Learning We give the system a set of instances to learn from System builds knowledge of some structure Learns “concepts” System can then classify new instances
Supervised vs Unsupervised Learning Supervised Give system examples/instances of multiple concepts System learns “concepts” More “hands on” Example: Naïve Bayes, Neural Nets Unsupervised Uses unlabled data Builds joint density model Example: k-means clustering
Naïve Bayes Called Naïve Bayes because its based on “Baye’sRule” and “naively” assumes independence given the label It is only valid to multiply probabilities when the events are independent Simplistic assumption in real life Despite the name, Naïve works well on actual datasets
Naïve Bayes Classifier Simple probabilistic classifier based on  applying Baye’s theorem (from Bayesian statistics)  strong (naive) independence assumptions.  A more descriptive term for the underlying probability model would be “independent feature model".
Naïve Bayes Classifier (2) Assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature.  Example:  a fruit may be considered to be an apple if it is red, round, and about 4" in diameter.  Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.
A Little Bit o’ Theory
Condensing Meaning To train our system we need Total number input training instances (count) Counts tuples:  {attributen,outcomeo,valuem}  Total counts of each outcomeo {outcome-count} To Calculate each Pr[En|H] ({attributen,outcomeo,valuem} / {outcome-count} ) …From the Vapor of That Last Big Equation
A Real Example From Witten, et al
Enter Apache Mahout What is it? Apache Mahout is a scalable machine learning library that supports large data sets What Are the Major Algorithm Type? Classification Recommendation Clustering http://mahout.apache.org/
Mahout Algorithms
Naïve Bayes and Text Naive Bayes does not model text well.  “Tackling the Poor Assumptions of Naive Bayes Text Classifiers” http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf Mahout does some modifications based around TF-IDF scoring (Next Slide) Includes two other pre-processing steps, common for information retrieval but not for Naive Bayes classification
High Level Algorithm For Each Feature(word) in each Doc: Calc: “Weight Normalized Tf-Idf” for a given feature in a label is the Tf-idf calculated using standard idf multiplied by the Weight Normalized Tf We calculate the sum of W-N-Tf-idf for all the features in a label called Sigma_k, and alpha_i == 1.0 Weight = Log [ ( W-N-Tf-Idf + alpha_i ) / ( Sigma_k + N  ) ]
BayesDriver Training Workflow Naïve Bayes Training MapReduce Workflow in Mahout
Logical Classification Process Gather, Clean, and Examine the Training Data Really get to know your data! Train the Classifier, allowing the system to “Learn” the “Concepts” But not “overfit” to this specific training data set Classify New Unseen Instances With Naïve Bayes we’ll calculate the probabilities of each class wrt this instance
How Is Classification Done? Sequentially or via Map Reduce TestClassifier.java Creates ClassifierContext For Each File in Dir For Each Line Break line into map of tokens Feed array of words to Classifier engine for new classification/label Collect classifications as output
A Quick Note About Training Data… Your classifier can only be as good as the training data lets it be… If you don’t do good data prep, everything will perform poorly Data collection and pre-processing takes the bulk of the time
Enough Math, Run the Code Download and install Mahout http://www.apache.org Run 20Newsgroups Example https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups Uses Naïve Bayes Classification Download and extract 20news-bydate.tar.gz from the 20newsgroups dataset
Generate Test and Train Dataset Training Dataset: mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups br />  -p examples/bin/work/20news-bydate/20news-bydate-train br />  -o examples/bin/work/20news-bydate/bayes-train-input br />  -a org.apache.mahout.vectorizer.DefaultAnalyzerbr />  -c UTF-8 Test Dataset: mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups br />  -p examples/bin/work/20news-bydate/20news-bydate-test br />  -o examples/bin/work/20news-bydate/bayes-test-input br />  -a org.apache.mahout.vectorizer.DefaultAnalyzer br />  -c UTF-8
Train and Test Classifier Train: $MAHOUT_HOME/bin/mahout trainclassifier br />  -i 20news-input/bayes-train-input br />  -o newsmodel br />  -type bayes br />  -ng 3 br />  -source hdfs Test: $MAHOUT_HOME/bin/mahout testclassifier br />  -m newsmodel br />  -d 20news-input br />  -type bayes br />  -ng 3 br />  -source hdfs br />  -method mapreduce
Other Use Cases Predictive Analytics You’ll hear this term a lot in the field, especially in the context of SAS General Supervised Learning Classification We can recognize a lot of things with practice And lots of tuning! Document Classification Sentiment Analysis
Questions? We’re Hiring! Cloudera’sDistro of Apache Hadoop: http://www.cloudera.com Resources “Tackling the Poor Assumptions of Naive Bayes Text Classifiers” http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf

Más contenido relacionado

La actualidad más candente

Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Association rule mining and Apriori algorithm
Association rule mining and Apriori algorithmAssociation rule mining and Apriori algorithm
Association rule mining and Apriori algorithmhina firdaus
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean
 
k medoid clustering.pptx
k medoid clustering.pptxk medoid clustering.pptx
k medoid clustering.pptxRoshan86572
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olapSalah Amean
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopApache Apex
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessingSalah Amean
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data MiningDHIVYADEVAKI
 
Data Quality for Machine Learning Tasks
Data Quality for Machine Learning TasksData Quality for Machine Learning Tasks
Data Quality for Machine Learning TasksHima Patel
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methodsrajshreemuthiah
 
Data mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityData mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityRushali Deshmukh
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and predictionDataminingTools Inc
 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and UsesSuvradeep Rudra
 
Statistical learning
Statistical learningStatistical learning
Statistical learningSlideshare
 

La actualidad más candente (20)

Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Association rule mining and Apriori algorithm
Association rule mining and Apriori algorithmAssociation rule mining and Apriori algorithm
Association rule mining and Apriori algorithm
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
k medoid clustering.pptx
k medoid clustering.pptxk medoid clustering.pptx
k medoid clustering.pptx
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
Data Quality for Machine Learning Tasks
Data Quality for Machine Learning TasksData Quality for Machine Learning Tasks
Data Quality for Machine Learning Tasks
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methods
 
Data mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityData mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarity
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and Uses
 
Statistical learning
Statistical learningStatistical learning
Statistical learning
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 

Destacado

Lecture 5: Bayesian Classification
Lecture 5: Bayesian ClassificationLecture 5: Bayesian Classification
Lecture 5: Bayesian ClassificationMarina Santini
 
Bayesian classification
Bayesian classificationBayesian classification
Bayesian classificationManu Chandel
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classificationKrish_ver2
 
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsData Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsSalah Amean
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classificationKrish_ver2
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data MiningValerii Klymchuk
 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagationKrish_ver2
 

Destacado (7)

Lecture 5: Bayesian Classification
Lecture 5: Bayesian ClassificationLecture 5: Bayesian Classification
Lecture 5: Bayesian Classification
 
Bayesian classification
Bayesian classificationBayesian classification
Bayesian classification
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classification
 
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methodsData Mining: Concepts and techniques classification _chapter 9 :advanced methods
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
 
2.4 rule based classification
2.4 rule based classification2.4 rule based classification
2.4 rule based classification
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagation
 

Similar a Classification with Naive Bayes

Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!OSCON Byrum
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahoutaneeshabakharia
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul Divyanshu
 
OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningRobin Anil
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantGrant Ingersoll
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using MahoutIMC Institute
 
Tuning the Untunable - Insights on Deep Learning Optimization
Tuning the Untunable - Insights on Deep Learning OptimizationTuning the Untunable - Insights on Deep Learning Optimization
Tuning the Untunable - Insights on Deep Learning OptimizationSigOpt
 
Java Deserialization Vulnerabilities - The Forgotten Bug Class (RuhrSec Edition)
Java Deserialization Vulnerabilities - The Forgotten Bug Class (RuhrSec Edition)Java Deserialization Vulnerabilities - The Forgotten Bug Class (RuhrSec Edition)
Java Deserialization Vulnerabilities - The Forgotten Bug Class (RuhrSec Edition)CODE WHITE GmbH
 
Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101John Ternent
 
Machine Learning with Hadoop
Machine Learning with HadoopMachine Learning with Hadoop
Machine Learning with HadoopSangchul Song
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupesh Bansal
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDCDrew Farris
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningVarad Meru
 
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNMLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNJosh Patterson
 
SCasia 2018 MSFT hands on session for Azure Batch AI
SCasia 2018 MSFT hands on session for Azure Batch AISCasia 2018 MSFT hands on session for Azure Batch AI
SCasia 2018 MSFT hands on session for Azure Batch AIHiroshi Tanaka
 

Similar a Classification with Naive Bayes (20)

Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
mahout introduction
mahout  introductionmahout  introduction
mahout introduction
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentation
 
OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine Learning
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using Mahout
 
Tuning the Untunable - Insights on Deep Learning Optimization
Tuning the Untunable - Insights on Deep Learning OptimizationTuning the Untunable - Insights on Deep Learning Optimization
Tuning the Untunable - Insights on Deep Learning Optimization
 
NYC_2016_slides
NYC_2016_slidesNYC_2016_slides
NYC_2016_slides
 
Java Deserialization Vulnerabilities - The Forgotten Bug Class (RuhrSec Edition)
Java Deserialization Vulnerabilities - The Forgotten Bug Class (RuhrSec Edition)Java Deserialization Vulnerabilities - The Forgotten Bug Class (RuhrSec Edition)
Java Deserialization Vulnerabilities - The Forgotten Bug Class (RuhrSec Edition)
 
Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101
 
Machine Learning with Hadoop
Machine Learning with HadoopMachine Learning with Hadoop
Machine Learning with Hadoop
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
Proposal with sdlc
Proposal with sdlcProposal with sdlc
Proposal with sdlc
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDC
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
 
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNMLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
 
SCasia 2018 MSFT hands on session for Azure Batch AI
SCasia 2018 MSFT hands on session for Azure Batch AISCasia 2018 MSFT hands on session for Azure Batch AI
SCasia 2018 MSFT hands on session for Azure Batch AI
 

Más de Josh Patterson

Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?Josh Patterson
 
What is Artificial Intelligence
What is Artificial IntelligenceWhat is Artificial Intelligence
What is Artificial IntelligenceJosh Patterson
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecJosh Patterson
 
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecJosh Patterson
 
Deep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseDeep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseJosh Patterson
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksJosh Patterson
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JJosh Patterson
 
How to Build Deep Learning Models
How to Build Deep Learning ModelsHow to Build Deep Learning Models
How to Build Deep Learning ModelsJosh Patterson
 
Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Josh Patterson
 
Enterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JEnterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JJosh Patterson
 
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015Josh Patterson
 
Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015Josh Patterson
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Josh Patterson
 
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4JGeorgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4JJosh Patterson
 
Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Josh Patterson
 
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopHadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopJosh Patterson
 
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNJosh Patterson
 
Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Josh Patterson
 
Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012Josh Patterson
 
LA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation TalkLA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation TalkJosh Patterson
 

Más de Josh Patterson (20)

Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?
 
What is Artificial Intelligence
What is Artificial IntelligenceWhat is Artificial Intelligence
What is Artificial Intelligence
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVec
 
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVec
 
Deep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseDeep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the Enterprise
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4J
 
How to Build Deep Learning Models
How to Build Deep Learning ModelsHow to Build Deep Learning Models
How to Build Deep Learning Models
 
Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015
 
Enterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JEnterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4J
 
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
 
Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4JGeorgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
 
Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242
 
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopHadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
 
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
 
Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2
 
Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012
 
LA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation TalkLA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation Talk
 

Classification with Naive Bayes

  • 1. Classification with Naïve Bayes A Deep Dive into Apache Mahout
  • 2. Today’s speaker – Josh Patterson josh@cloudera.com / twitter: @jpatanooga Master’s Thesis: self-organizing mesh networks Published in IAAI-09: TinyTermite: A Secure Routing Algorithm Conceived, built, and led Hadoop integration for the openPDC project at TVA (Smartgrid stuff) Led small team which designed classification techniques for time series and Map Reduce Open source work at http://openpdc.codeplex.com Now: Solutions Architect at Cloudera 2
  • 3. What is Classification? Supervised Learning We give the system a set of instances to learn from System builds knowledge of some structure Learns “concepts” System can then classify new instances
  • 4. Supervised vs Unsupervised Learning Supervised Give system examples/instances of multiple concepts System learns “concepts” More “hands on” Example: Naïve Bayes, Neural Nets Unsupervised Uses unlabled data Builds joint density model Example: k-means clustering
  • 5. Naïve Bayes Called Naïve Bayes because its based on “Baye’sRule” and “naively” assumes independence given the label It is only valid to multiply probabilities when the events are independent Simplistic assumption in real life Despite the name, Naïve works well on actual datasets
  • 6. Naïve Bayes Classifier Simple probabilistic classifier based on applying Baye’s theorem (from Bayesian statistics) strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be “independent feature model".
  • 7. Naïve Bayes Classifier (2) Assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. Example: a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.
  • 8. A Little Bit o’ Theory
  • 9. Condensing Meaning To train our system we need Total number input training instances (count) Counts tuples: {attributen,outcomeo,valuem} Total counts of each outcomeo {outcome-count} To Calculate each Pr[En|H] ({attributen,outcomeo,valuem} / {outcome-count} ) …From the Vapor of That Last Big Equation
  • 10. A Real Example From Witten, et al
  • 11. Enter Apache Mahout What is it? Apache Mahout is a scalable machine learning library that supports large data sets What Are the Major Algorithm Type? Classification Recommendation Clustering http://mahout.apache.org/
  • 13. Naïve Bayes and Text Naive Bayes does not model text well. “Tackling the Poor Assumptions of Naive Bayes Text Classifiers” http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf Mahout does some modifications based around TF-IDF scoring (Next Slide) Includes two other pre-processing steps, common for information retrieval but not for Naive Bayes classification
  • 14. High Level Algorithm For Each Feature(word) in each Doc: Calc: “Weight Normalized Tf-Idf” for a given feature in a label is the Tf-idf calculated using standard idf multiplied by the Weight Normalized Tf We calculate the sum of W-N-Tf-idf for all the features in a label called Sigma_k, and alpha_i == 1.0 Weight = Log [ ( W-N-Tf-Idf + alpha_i ) / ( Sigma_k + N ) ]
  • 15. BayesDriver Training Workflow Naïve Bayes Training MapReduce Workflow in Mahout
  • 16. Logical Classification Process Gather, Clean, and Examine the Training Data Really get to know your data! Train the Classifier, allowing the system to “Learn” the “Concepts” But not “overfit” to this specific training data set Classify New Unseen Instances With Naïve Bayes we’ll calculate the probabilities of each class wrt this instance
  • 17. How Is Classification Done? Sequentially or via Map Reduce TestClassifier.java Creates ClassifierContext For Each File in Dir For Each Line Break line into map of tokens Feed array of words to Classifier engine for new classification/label Collect classifications as output
  • 18. A Quick Note About Training Data… Your classifier can only be as good as the training data lets it be… If you don’t do good data prep, everything will perform poorly Data collection and pre-processing takes the bulk of the time
  • 19. Enough Math, Run the Code Download and install Mahout http://www.apache.org Run 20Newsgroups Example https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups Uses Naïve Bayes Classification Download and extract 20news-bydate.tar.gz from the 20newsgroups dataset
  • 20. Generate Test and Train Dataset Training Dataset: mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups br /> -p examples/bin/work/20news-bydate/20news-bydate-train br /> -o examples/bin/work/20news-bydate/bayes-train-input br /> -a org.apache.mahout.vectorizer.DefaultAnalyzerbr /> -c UTF-8 Test Dataset: mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups br /> -p examples/bin/work/20news-bydate/20news-bydate-test br /> -o examples/bin/work/20news-bydate/bayes-test-input br /> -a org.apache.mahout.vectorizer.DefaultAnalyzer br /> -c UTF-8
  • 21. Train and Test Classifier Train: $MAHOUT_HOME/bin/mahout trainclassifier br /> -i 20news-input/bayes-train-input br /> -o newsmodel br /> -type bayes br /> -ng 3 br /> -source hdfs Test: $MAHOUT_HOME/bin/mahout testclassifier br /> -m newsmodel br /> -d 20news-input br /> -type bayes br /> -ng 3 br /> -source hdfs br /> -method mapreduce
  • 22. Other Use Cases Predictive Analytics You’ll hear this term a lot in the field, especially in the context of SAS General Supervised Learning Classification We can recognize a lot of things with practice And lots of tuning! Document Classification Sentiment Analysis
  • 23. Questions? We’re Hiring! Cloudera’sDistro of Apache Hadoop: http://www.cloudera.com Resources “Tackling the Poor Assumptions of Naive Bayes Text Classifiers” http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf

Notas del editor

  1. https://cwiki.apache.org/MAHOUT/books-tutorials-and-talks.html
  2. Contrasts with “1Rule” method (1Rule uses 1 attribute)NB allows all attributes to make contributions that are equally important and independent of one another
  3. This classifier produces a probability estimate for each class rather than a predictionConsidered “Supervised Learning”
  4. comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as boosted trees or random forestsAn advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification.
  5. Pr[E|H] -> all evidence for instances with H->”yes”Pr[H] -> percent of instances w/ this outcomePr[E] -> sum of the values ( ) for all outcomes
  6. Book reference: snow crashFor each attribute “a” there are multiple values, and given these combinations we need to look at how many times the instances were actually classified each class.In training we use the term “outcome”, in classification we use the term “class”Example: say we have 2 attributes to an instance
  7. We don’t take into account some of the other things like “missing values” here
  8. Now that we’ve established the case for Naïve Bayes + Text  show how it fits in with other classifications algos
  9. *** Need to sell case for using another feature calculating mechanic ***when one class has more training examples than anotherNaive Bayes selects poor weights for the decision boundary. To balance the amount of training examples used per estimatethey introduced a “complement class” formulation of Naive Bayes.A document is treated as a sequence of words and it is assumed that each word position is generated independently of every other word
  10. Term frequency =num occurrences of the considered term ti in document dj / sizeof ( words in doc dj )Normalized to protect against bias in larger docsIDF = log( Normalized Frequency for a term(feature) in a document is calculated by dividing the term frequency by the root mean square of terms frequencies in that documentWeight Normalized Tffor a given feature in a given label = sum of Normalized Frequency of the feature across all the documents in the label.
  11. Need to get a better handle on Sigma_kirSigmaWijhttps://cwiki.apache.org/MAHOUT/bayesian.html
  12. https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups
  13. Can also test sequentially