SlideShare una empresa de Scribd logo
1 de 23
Descargar para leer sin conexión
Bayes Classifier and 
Naïve Bayes 
CS434
Bayes Classifiers 
• A formidable and sworn enemy of decision 
trees 
Classifier Prediction of 
categorical output 
Input 
Attributes 
DT BC
3 
Probabilistic Classification 
• Credit scoring: 
– Inputs are income and savings 
– Output is low‐risk vs high‐risk 
• Input: x = [x1,x2]T ,Output: C  {0,1} 
• Prediction: 
P C x ,x . 
1 if ( 1| ) 05 
C 
C 
or equivalently 
C 
   
   
1 2 
    
 
   
 
1 if ( 1| ) ( 0 | ) 
0 otherwise 
choose 
0 otherwise 
choose 
1 2 1 2 
C 
P C x ,x P C x ,x
A side note: probabilistic inference 
F 
H 
H = “Have a headache” 
F = “Coming down with 
Flu” 
P(H) = 1/10 
P(F) = 1/40 
P(H|F) = 1/2 
One day you wake up with a headache. You think: “Drat! 
50% of flus are associated with headaches so I must have a 
50-50 chance of coming down with flu” 
Is this reasoning good?
Probabilistic Inference 
P F H 
( ^ )  
P H 
Copyright © Andrew W. Moore 
F 
H 
H = “Have a headache” 
F = “Coming down with 
Flu” 
P(H) = 1/10 
P(F) = 1/40 
P(H|F) = 1/2 
P(F)P(H | F)  1  
P(F ^ H) = … 
P(F|H) = … 
1 
80 
* 1 
40 
2 
1 
8 
( )
6 
Bayes classifier 
  P   p 
 x 
 
P y y 
y 
x 
x 
p 
| 
|  
posterior 
prior 
A simple bayes net 
Given a set of training examples, to build a Bayes 
classifier, we need to 
1. Estimate P(y) from data 
2. Estimate P(x|y) from data 
Given a test data point x, to make prediction 
1. Apply bayes rule: 
Py | x P( y)P(x | y) 
2. Predict 
argmax Py | x 
y 
ܲሺݕሻ 
ܲሺ࢞|ݕሻ
Maximum Likelihood Estimation (MLE) 
• Let y be the outcome of the credit scoring of a random loan 
applicant, yϵ{0,1} 
– P0=P(y=0), and P1=P(y=1)=1‐P0 
• This can be compactly represented as 
ଵି௬ ሺ1 െ ܲ0ሻ௬ 
ܲ ݕ ൌܲ0 
• If you observe n samples of y: y1, y2,…,yn 
• we can write down the likelihood function (i.e. the probability of 
the observed data given the parameters): 
• The log‐likelihood function: 
n 
 
L p  p  yi  
p yi 
i 
1 
0 
1 
0 0 ( ) (1 ) 
1 
 
n 
l ( p )  log L ( p )  log p i (1  
p ) 
i 
0 0 0 
n 
 
 
1 
 
i 
    
i 
i i 
y y 
0 
y p y p 
[(1 ) log log(1 )] 
0 0
MLE cont. 
• MLE maximizes the likelihood, or the log 
likelihood 
• For this case: 
MLE  
p l p 
argmax ( ) 0 0 
0 
p 
(1  
) 
n 
y 
p 
n 
 
 i 
1 
i 
MLE 
0 
i.e., the frequency that one observes y=0 in 
the training data
Bayes Classifiers in a nutshell 
1. Estimate P(x1, x2, … xm | y=vi ) for each value vi 
3. Estimate P(y=vi ) as fraction of records with y=vi . 
4. For a new prediction: 
y P y v x u x u 
argmax ( | ) 
    
learning 
argmax ( | ) ( ) 
1 1 
1 1 
predict 
P x u x u y v P y v 
m m 
v 
m m 
v 
   
   
 
Estimating the joint 
distribution of x1, x2, … xm 
given y can be problematic!
Joint Density Estimator Overfits 
• Typically we don’t have enough data to estimate the 
joint distribution accurately 
• It is common to encounter the following situation: 
– If no training examples have the exact x=(u1, u2, …. um), 
then P(x|y=vi ) = 0 for all values of Y. 
• In that case, what can we do? 
– we might as well guess a random y based on the prior, i.e., 
p(y) 
  P   p 
 x 
 
P y y 
y 
x 
x 
p 
| 
| 
Example: Spam Filtering 
• Assume that our vocabulary contains 10k commonly 
used words & tokens‐‐‐ we have 10,000 attributes 
• Let’s assume these attributes are binary 
• How many parameters that we need to learn? 
2*(210,000-1) 
2 classes 
Parameters for each 
joint distribution p(x|y) 
Clearly we don’t have enough data to estimate 
that many parameters
The Naïve Bayes Assumption 
• Assume that each attribute is independent of 
any other attributes given the class label 
P ( x 1  u 1 
x m  u m | y  
v 
i 
) 
     
P x u y v P x u y v 
( | ) ( | ) 
1 1 
 
i m m i 
 
ܲሺݕሻ 
ܲሺ࢞1|ݕሻ ܲሺ࢞݉|ݕሻ 
ܲሺ࢞2|ݕሻ 
࢞݉ 
ܲሺݕሻ 
ܲሺ࢞|ݕሻ
A note about independence 
• Assume A and B are two Random Variables. 
Then 
“A and B are independent” 
if and only if 
P(A|B) = P(A) 
• “A and B are independent” is often notated as 
A  B
Examples of independent events 
• Two separate coin tosses 
• Consider the following four variables: 
– T: Toothache ( I have a toothache) 
– C: Catch (dentist’s steel probe catches in my 
tooth) 
– A: Cavity 
– W: Weather 
– P(T, C, A, W) =? 
T, C, A, W 
T, C, A W
Conditional Independence 
• P(x1|x2,y) = P(x1|y) 
– X1 is independent of x2 given y 
– x1 and x2 are conditionally independent given y 
• If X1 and X2 are conditionally independent 
given y, then we have 
– P(X1,X2|y) = P(X1|y) P(X2|y)
Example of conditional independence 
– T: Toothache ( I have a toothache) 
– C: Catch (dentist’s steel probe catches in my tooth) 
– A: Cavity 
T and C are conditionally independent given A: P(T|C,A) =P(T|A) 
P(T, C|A) =P(T|A)*P(C|A) 
Events that are not independent from each other might be conditionally 
independent given some fact 
It can also happen the other way around. Events that are independent might 
become conditionally dependent given some fact. 
B=Burglar in your house; A = Alarm (Burglar) rang in your house 
E = Earthquake happened 
B is independent of E (ignoring some minor possible connections between them) 
However, if we know A is true, then B and E are no longer independent. Why? 
P(B|A) >> P(B|A, E) Knowing E is true makes it much less likely for B to be true
Naïve Bayes Classifier 
• By assuming that each attribute is 
independent of any other attributes given the 
class label, we now have a Naïve Bayes 
Classifier 
• Instead of learning a joint distribution of all 
features, we learn p(xi|y) separately for each 
feature xi 
• Everything else remains the same
Naïve Bayes Classifier 
• Assume you want to predict output y which has nY values v1, v2, 
… vny. 
• Assume there are m input attributes called x=(x1, x2, … xm) 
• Learn a conditional distribution of p(x|y) for each possible y 
value, y = v1, v2, … vny,, we do this by: 
– Break training set into nY subsets called S1, S2, … Sny based on the y 
values, i.e., Si contains examples in which y=vi 
– For each Si, learn p(y=vi)= |Si |/ |S| 
– For each Si , learn the conditional distribution each input features, 
e.g.: 
P(x1  u1 | y  vi ),, P(xm  um | y  vi ) 
ypredict P x u y v P x u y v P y v m m 
argmax ( 1 1 | ) ( | ) ( ) 
       
v
Example 
X1 X2 X3 Y 
1 1 1 0 
1 1 0 0 
0 0 0 0 
0 1 0 1 
0 0 1 1 
0 1 1 1 
Apply Naïve Bayes, and make 
prediction for (1,0,1)? 
1. Learn the prior distribution of y. 
P(y=0)=1/2, P(y=1)=1/2 
2. Learn the conditional distribution of 
xi given y for each possible y values 
p(X1|y=0), p(X1|y=1) 
p(X2|y=0), p(X2|y=1) 
p(X3|y=0), p(X3|y=1) 
For example, p(X1|y=0): 
P(X1=1|y=0)=2/3, P(X1=1|y=1)=0 
… 
To predict for (1,0,1): 
P(y=0|(1,0,1)) = P((1,0,1)|y=0)P(y=0)/P((1,0,1)) 
P(y=1|(1,0,1)) = P((1,0,1)|y=1)P(y=1)/P((1,0,1))
Laplace Smoothing 
• With the Naïve Bayes Assumption, we can still end up with zero 
probabilities 
• E.g., if we receive an email that contains a word that has never 
appeared in the training emails 
– P(x|y) will be 0 for all y values 
– We can only make prediction based on p(y) 
• This is bad because we ignored all the other words in the email 
because of this single rare word 
• Laplace smoothing can help 
P(X1=1|y=0) 
= (1+ # of examples with y=0, X1=1 ) /(k+ # of examples with y=0) 
k = the total number of possible values of x 
• For a binary feature like above, p(x|y) will not be 0
Final Notes about (Naïve) Bayes Classifier 
• Any density estimator can be plugged in to estimate P(x1,x2, 
…, xm |y), or P(xi|y) for Naïve bayes 
• Real valued attributes can be modeled using simple 
distributions such as Gaussian (Normal) distribution 
• Naïve Bayes is wonderfully cheap and survives tens of 
thousands of attributes easily
Bayes Classifier is a Generative 
Approach 
• Generative approach: 
– Learn p(y), p(x|y), and then apply bayes rule to 
compute p(y|x) for making predictions 
– This is equivalent to assuming that each data point 
is generated following a generative process 
governed by p(y) and p(X|y) 
y 
X 
p(y) 
p(X|y) 
Bayes 
classifier 
y 
X1 
p(y) 
Naïve Bayes 
classifier 
p(X1|y) Xm p(Xm|y)
• Generative approach is just one type of learning approaches 
used in machine learning 
– Learning a correct generative model is difficult 
– And sometimes unnecessary 
• KNN and DT are both what we call discriminative methods 
– They are not concerned about any generative models 
– They only care about finding a good discriminative function 
– For KNN and DT, these functions are deterministic, not probabilistic 
• One can also take a probabilistic approach to learning 
discriminative functions 
– i.e., Learn p(y|X) directly without assuming X is generated based on 
some particular distribution given y (i.e., p(X|y)) 
– Logistic regression is one such approach

Más contenido relacionado

La actualidad más candente

An overview of Bayesian testing
An overview of Bayesian testingAn overview of Bayesian testing
An overview of Bayesian testingChristian Robert
 
Text classification
Text classificationText classification
Text classificationHarry Potter
 
2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…
2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…
2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…Dongseo University
 
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...Christian Robert
 
Langrange Interpolation Polynomials
Langrange Interpolation PolynomialsLangrange Interpolation Polynomials
Langrange Interpolation PolynomialsSohaib H. Khan
 
ABC short course: final chapters
ABC short course: final chaptersABC short course: final chapters
ABC short course: final chaptersChristian Robert
 
ABC short course: model choice chapter
ABC short course: model choice chapterABC short course: model choice chapter
ABC short course: model choice chapterChristian Robert
 
Statistics (1): estimation Chapter 3: likelihood function and likelihood esti...
Statistics (1): estimation Chapter 3: likelihood function and likelihood esti...Statistics (1): estimation Chapter 3: likelihood function and likelihood esti...
Statistics (1): estimation Chapter 3: likelihood function and likelihood esti...Christian Robert
 
Bayesian classification
Bayesian classification Bayesian classification
Bayesian classification Zul Kawsar
 
Bayesian inference on mixtures
Bayesian inference on mixturesBayesian inference on mixtures
Bayesian inference on mixturesChristian Robert
 
Interpolation of Cubic Splines
Interpolation of Cubic SplinesInterpolation of Cubic Splines
Interpolation of Cubic SplinesSohaib H. Khan
 
Introduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksIntroduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksFederico Cerutti
 
Defining Functions on Equivalence Classes
Defining Functions on Equivalence ClassesDefining Functions on Equivalence Classes
Defining Functions on Equivalence ClassesLawrence Paulson
 

La actualidad más candente (16)

ABC workshop: 17w5025
ABC workshop: 17w5025ABC workshop: 17w5025
ABC workshop: 17w5025
 
An overview of Bayesian testing
An overview of Bayesian testingAn overview of Bayesian testing
An overview of Bayesian testing
 
Text classification
Text classificationText classification
Text classification
 
2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…
2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…
2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…
 
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
 
Langrange Interpolation Polynomials
Langrange Interpolation PolynomialsLangrange Interpolation Polynomials
Langrange Interpolation Polynomials
 
ABC short course: final chapters
ABC short course: final chaptersABC short course: final chapters
ABC short course: final chapters
 
ABC short course: model choice chapter
ABC short course: model choice chapterABC short course: model choice chapter
ABC short course: model choice chapter
 
Analysis of data
Analysis of dataAnalysis of data
Analysis of data
 
Statistics (1): estimation Chapter 3: likelihood function and likelihood esti...
Statistics (1): estimation Chapter 3: likelihood function and likelihood esti...Statistics (1): estimation Chapter 3: likelihood function and likelihood esti...
Statistics (1): estimation Chapter 3: likelihood function and likelihood esti...
 
Daa chapter7
Daa chapter7Daa chapter7
Daa chapter7
 
Bayesian classification
Bayesian classification Bayesian classification
Bayesian classification
 
Bayesian inference on mixtures
Bayesian inference on mixturesBayesian inference on mixtures
Bayesian inference on mixtures
 
Interpolation of Cubic Splines
Interpolation of Cubic SplinesInterpolation of Cubic Splines
Interpolation of Cubic Splines
 
Introduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksIntroduction to Evidential Neural Networks
Introduction to Evidential Neural Networks
 
Defining Functions on Equivalence Classes
Defining Functions on Equivalence ClassesDefining Functions on Equivalence Classes
Defining Functions on Equivalence Classes
 

Similar a Bayes 6

Probability_Review.ppt
Probability_Review.pptProbability_Review.ppt
Probability_Review.pptssuserd329601
 
Probability_Review.ppt
Probability_Review.pptProbability_Review.ppt
Probability_Review.pptsarahfarhin
 
Probability_Review.ppt
Probability_Review.pptProbability_Review.ppt
Probability_Review.pptYonas992841
 
Probability_Review.ppt for your knowledg
Probability_Review.ppt for your knowledgProbability_Review.ppt for your knowledg
Probability_Review.ppt for your knowledgnsnayak03
 
Probability_Review.ppt
Probability_Review.pptProbability_Review.ppt
Probability_Review.pptSameer607695
 
Probability_Review HELPFUL IN STATISTICS.ppt
Probability_Review HELPFUL IN STATISTICS.pptProbability_Review HELPFUL IN STATISTICS.ppt
Probability_Review HELPFUL IN STATISTICS.pptShamshadAli58
 
Probability_Review.ppt
Probability_Review.pptProbability_Review.ppt
Probability_Review.pptGireeshNcs
 
Probability_Review.ppt
Probability_Review.pptProbability_Review.ppt
Probability_Review.pptRobinBushu
 
Unit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VEC
Unit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VECUnit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VEC
Unit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VECsundarKanagaraj1
 
presentation on Fandamental of Probability
presentation on Fandamental of Probabilitypresentation on Fandamental of Probability
presentation on Fandamental of ProbabilityMaheshGour5
 
Probability statistics assignment help
Probability statistics assignment helpProbability statistics assignment help
Probability statistics assignment helpHomeworkAssignmentHe
 
Data mining assignment 2
Data mining assignment 2Data mining assignment 2
Data mining assignment 2BarryK88
 
Scientific Computing with Python Webinar 9/18/2009:Curve Fitting
Scientific Computing with Python Webinar 9/18/2009:Curve FittingScientific Computing with Python Webinar 9/18/2009:Curve Fitting
Scientific Computing with Python Webinar 9/18/2009:Curve FittingEnthought, Inc.
 
53158699-d7c5-4e6e-af19-1f642992cc58-161011142651.pptx
53158699-d7c5-4e6e-af19-1f642992cc58-161011142651.pptx53158699-d7c5-4e6e-af19-1f642992cc58-161011142651.pptx
53158699-d7c5-4e6e-af19-1f642992cc58-161011142651.pptxsaadurrehman35
 
Fuzzy Logic.pptx
Fuzzy Logic.pptxFuzzy Logic.pptx
Fuzzy Logic.pptxImXaib
 

Similar a Bayes 6 (20)

Probability_Review.ppt
Probability_Review.pptProbability_Review.ppt
Probability_Review.ppt
 
Probability_Review.ppt
Probability_Review.pptProbability_Review.ppt
Probability_Review.ppt
 
Probability_Review.ppt
Probability_Review.pptProbability_Review.ppt
Probability_Review.ppt
 
Probability_Review.ppt for your knowledg
Probability_Review.ppt for your knowledgProbability_Review.ppt for your knowledg
Probability_Review.ppt for your knowledg
 
Probability_Review.ppt
Probability_Review.pptProbability_Review.ppt
Probability_Review.ppt
 
Probability_Review HELPFUL IN STATISTICS.ppt
Probability_Review HELPFUL IN STATISTICS.pptProbability_Review HELPFUL IN STATISTICS.ppt
Probability_Review HELPFUL IN STATISTICS.ppt
 
Probability_Review.ppt
Probability_Review.pptProbability_Review.ppt
Probability_Review.ppt
 
Probability_Review.ppt
Probability_Review.pptProbability_Review.ppt
Probability_Review.ppt
 
Probability_Review.ppt
Probability_Review.pptProbability_Review.ppt
Probability_Review.ppt
 
Unit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VEC
Unit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VECUnit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VEC
Unit IV UNCERTAINITY AND STATISTICAL REASONING in AI K.Sundar,AP/CSE,VEC
 
presentation on Fandamental of Probability
presentation on Fandamental of Probabilitypresentation on Fandamental of Probability
presentation on Fandamental of Probability
 
Probability statistics assignment help
Probability statistics assignment helpProbability statistics assignment help
Probability statistics assignment help
 
Uncertainty
UncertaintyUncertainty
Uncertainty
 
pattern recognition
pattern recognition pattern recognition
pattern recognition
 
Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
 
Bayesnetwork
BayesnetworkBayesnetwork
Bayesnetwork
 
Data mining assignment 2
Data mining assignment 2Data mining assignment 2
Data mining assignment 2
 
Scientific Computing with Python Webinar 9/18/2009:Curve Fitting
Scientific Computing with Python Webinar 9/18/2009:Curve FittingScientific Computing with Python Webinar 9/18/2009:Curve Fitting
Scientific Computing with Python Webinar 9/18/2009:Curve Fitting
 
53158699-d7c5-4e6e-af19-1f642992cc58-161011142651.pptx
53158699-d7c5-4e6e-af19-1f642992cc58-161011142651.pptx53158699-d7c5-4e6e-af19-1f642992cc58-161011142651.pptx
53158699-d7c5-4e6e-af19-1f642992cc58-161011142651.pptx
 
Fuzzy Logic.pptx
Fuzzy Logic.pptxFuzzy Logic.pptx
Fuzzy Logic.pptx
 

Bayes 6

  • 1. Bayes Classifier and Naïve Bayes CS434
  • 2. Bayes Classifiers • A formidable and sworn enemy of decision trees Classifier Prediction of categorical output Input Attributes DT BC
  • 3. 3 Probabilistic Classification • Credit scoring: – Inputs are income and savings – Output is low‐risk vs high‐risk • Input: x = [x1,x2]T ,Output: C  {0,1} • Prediction: P C x ,x . 1 if ( 1| ) 05 C C or equivalently C       1 2          1 if ( 1| ) ( 0 | ) 0 otherwise choose 0 otherwise choose 1 2 1 2 C P C x ,x P C x ,x
  • 4. A side note: probabilistic inference F H H = “Have a headache” F = “Coming down with Flu” P(H) = 1/10 P(F) = 1/40 P(H|F) = 1/2 One day you wake up with a headache. You think: “Drat! 50% of flus are associated with headaches so I must have a 50-50 chance of coming down with flu” Is this reasoning good?
  • 5. Probabilistic Inference P F H ( ^ )  P H Copyright © Andrew W. Moore F H H = “Have a headache” F = “Coming down with Flu” P(H) = 1/10 P(F) = 1/40 P(H|F) = 1/2 P(F)P(H | F)  1  P(F ^ H) = … P(F|H) = … 1 80 * 1 40 2 1 8 ( )
  • 6. 6 Bayes classifier   P   p  x  P y y y x x p | |  posterior prior A simple bayes net Given a set of training examples, to build a Bayes classifier, we need to 1. Estimate P(y) from data 2. Estimate P(x|y) from data Given a test data point x, to make prediction 1. Apply bayes rule: Py | x P( y)P(x | y) 2. Predict argmax Py | x y ܲሺݕሻ ܲሺ࢞|ݕሻ
  • 7. Maximum Likelihood Estimation (MLE) • Let y be the outcome of the credit scoring of a random loan applicant, yϵ{0,1} – P0=P(y=0), and P1=P(y=1)=1‐P0 • This can be compactly represented as ଵି௬ ሺ1 െ ܲ0ሻ௬ ܲ ݕ ൌܲ0 • If you observe n samples of y: y1, y2,…,yn • we can write down the likelihood function (i.e. the probability of the observed data given the parameters): • The log‐likelihood function: n  L p  p  yi  p yi i 1 0 1 0 0 ( ) (1 ) 1  n l ( p )  log L ( p )  log p i (1  p ) i 0 0 0 n   1  i     i i i y y 0 y p y p [(1 ) log log(1 )] 0 0
  • 8. MLE cont. • MLE maximizes the likelihood, or the log likelihood • For this case: MLE  p l p argmax ( ) 0 0 0 p (1  ) n y p n   i 1 i MLE 0 i.e., the frequency that one observes y=0 in the training data
  • 9. Bayes Classifiers in a nutshell 1. Estimate P(x1, x2, … xm | y=vi ) for each value vi 3. Estimate P(y=vi ) as fraction of records with y=vi . 4. For a new prediction: y P y v x u x u argmax ( | )     learning argmax ( | ) ( ) 1 1 1 1 predict P x u x u y v P y v m m v m m v        Estimating the joint distribution of x1, x2, … xm given y can be problematic!
  • 10. Joint Density Estimator Overfits • Typically we don’t have enough data to estimate the joint distribution accurately • It is common to encounter the following situation: – If no training examples have the exact x=(u1, u2, …. um), then P(x|y=vi ) = 0 for all values of Y. • In that case, what can we do? – we might as well guess a random y based on the prior, i.e., p(y)   P   p  x  P y y y x x p | | 
  • 11. Example: Spam Filtering • Assume that our vocabulary contains 10k commonly used words & tokens‐‐‐ we have 10,000 attributes • Let’s assume these attributes are binary • How many parameters that we need to learn? 2*(210,000-1) 2 classes Parameters for each joint distribution p(x|y) Clearly we don’t have enough data to estimate that many parameters
  • 12. The Naïve Bayes Assumption • Assume that each attribute is independent of any other attributes given the class label P ( x 1  u 1 x m  u m | y  v i )      P x u y v P x u y v ( | ) ( | ) 1 1  i m m i  ܲሺݕሻ ܲሺ࢞1|ݕሻ ܲሺ࢞݉|ݕሻ ܲሺ࢞2|ݕሻ ࢞݉ ܲሺݕሻ ܲሺ࢞|ݕሻ
  • 13. A note about independence • Assume A and B are two Random Variables. Then “A and B are independent” if and only if P(A|B) = P(A) • “A and B are independent” is often notated as A  B
  • 14. Examples of independent events • Two separate coin tosses • Consider the following four variables: – T: Toothache ( I have a toothache) – C: Catch (dentist’s steel probe catches in my tooth) – A: Cavity – W: Weather – P(T, C, A, W) =? T, C, A, W T, C, A W
  • 15. Conditional Independence • P(x1|x2,y) = P(x1|y) – X1 is independent of x2 given y – x1 and x2 are conditionally independent given y • If X1 and X2 are conditionally independent given y, then we have – P(X1,X2|y) = P(X1|y) P(X2|y)
  • 16. Example of conditional independence – T: Toothache ( I have a toothache) – C: Catch (dentist’s steel probe catches in my tooth) – A: Cavity T and C are conditionally independent given A: P(T|C,A) =P(T|A) P(T, C|A) =P(T|A)*P(C|A) Events that are not independent from each other might be conditionally independent given some fact It can also happen the other way around. Events that are independent might become conditionally dependent given some fact. B=Burglar in your house; A = Alarm (Burglar) rang in your house E = Earthquake happened B is independent of E (ignoring some minor possible connections between them) However, if we know A is true, then B and E are no longer independent. Why? P(B|A) >> P(B|A, E) Knowing E is true makes it much less likely for B to be true
  • 17. Naïve Bayes Classifier • By assuming that each attribute is independent of any other attributes given the class label, we now have a Naïve Bayes Classifier • Instead of learning a joint distribution of all features, we learn p(xi|y) separately for each feature xi • Everything else remains the same
  • 18. Naïve Bayes Classifier • Assume you want to predict output y which has nY values v1, v2, … vny. • Assume there are m input attributes called x=(x1, x2, … xm) • Learn a conditional distribution of p(x|y) for each possible y value, y = v1, v2, … vny,, we do this by: – Break training set into nY subsets called S1, S2, … Sny based on the y values, i.e., Si contains examples in which y=vi – For each Si, learn p(y=vi)= |Si |/ |S| – For each Si , learn the conditional distribution each input features, e.g.: P(x1  u1 | y  vi ),, P(xm  um | y  vi ) ypredict P x u y v P x u y v P y v m m argmax ( 1 1 | ) ( | ) ( )        v
  • 19. Example X1 X2 X3 Y 1 1 1 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 1 1 1 Apply Naïve Bayes, and make prediction for (1,0,1)? 1. Learn the prior distribution of y. P(y=0)=1/2, P(y=1)=1/2 2. Learn the conditional distribution of xi given y for each possible y values p(X1|y=0), p(X1|y=1) p(X2|y=0), p(X2|y=1) p(X3|y=0), p(X3|y=1) For example, p(X1|y=0): P(X1=1|y=0)=2/3, P(X1=1|y=1)=0 … To predict for (1,0,1): P(y=0|(1,0,1)) = P((1,0,1)|y=0)P(y=0)/P((1,0,1)) P(y=1|(1,0,1)) = P((1,0,1)|y=1)P(y=1)/P((1,0,1))
  • 20. Laplace Smoothing • With the Naïve Bayes Assumption, we can still end up with zero probabilities • E.g., if we receive an email that contains a word that has never appeared in the training emails – P(x|y) will be 0 for all y values – We can only make prediction based on p(y) • This is bad because we ignored all the other words in the email because of this single rare word • Laplace smoothing can help P(X1=1|y=0) = (1+ # of examples with y=0, X1=1 ) /(k+ # of examples with y=0) k = the total number of possible values of x • For a binary feature like above, p(x|y) will not be 0
  • 21. Final Notes about (Naïve) Bayes Classifier • Any density estimator can be plugged in to estimate P(x1,x2, …, xm |y), or P(xi|y) for Naïve bayes • Real valued attributes can be modeled using simple distributions such as Gaussian (Normal) distribution • Naïve Bayes is wonderfully cheap and survives tens of thousands of attributes easily
  • 22. Bayes Classifier is a Generative Approach • Generative approach: – Learn p(y), p(x|y), and then apply bayes rule to compute p(y|x) for making predictions – This is equivalent to assuming that each data point is generated following a generative process governed by p(y) and p(X|y) y X p(y) p(X|y) Bayes classifier y X1 p(y) Naïve Bayes classifier p(X1|y) Xm p(Xm|y)
  • 23. • Generative approach is just one type of learning approaches used in machine learning – Learning a correct generative model is difficult – And sometimes unnecessary • KNN and DT are both what we call discriminative methods – They are not concerned about any generative models – They only care about finding a good discriminative function – For KNN and DT, these functions are deterministic, not probabilistic • One can also take a probabilistic approach to learning discriminative functions – i.e., Learn p(y|X) directly without assuming X is generated based on some particular distribution given y (i.e., p(X|y)) – Logistic regression is one such approach