SlideShare una empresa de Scribd logo
1 de 30
Information Gain,
Decision Trees and
Boosting
10-701 ML recitation
9 Feb 2006
by Jure
Entropy and
Information Grain
Entropy & Bits
 You are watching a set of independent random
sample of X
 X has 4 possible values:
P(X=A)=1/4, P(X=B)=1/4, P(X=C)=1/4, P(X=D)=1/4
 You get a string of symbols ACBABBCDADDC…
 To transmit the data over binary link you can
encode each symbol with bits (A=00, B=01,
C=10, D=11)
 You need 2 bits per symbol
Fewer Bits – example 1
 Now someone tells you the probabilities are not
equal
P(X=A)=1/2, P(X=B)=1/4, P(X=C)=1/8, P(X=D)=1/8
 Now, it is possible to find coding that uses only
1.75 bits on the average. How?
Fewer bits – example 2
 Suppose there are three equally likely values
P(X=A)=1/3, P(X=B)=1/3, P(X=C)=1/3
 Naïve coding: A = 00, B = 01, C=10
 Uses 2 bits per symbol
 Can you find coding that uses 1.6 bits per
symbol?
 In theory it can be done with 1.58496 bits
Entropy – General Case
 Suppose X takes n values, V1, V2,… Vn, and
P(X=V1)=p1, P(X=V2)=p2, … P(X=Vn)=pn
 What is the smallest number of bits, on average, per
symbol, needed to transmit the symbols drawn from
distribution of X? It’s
H(X) = p1log2 p1 – p2 log2p2 – … pnlog2pn
 H(X) = the entropy of X
)(log
1
2 i
n
i
i pp∑=
−=
High, Low Entropy
 “High Entropy”
 X is from a uniform like distribution
 Flat histogram
 Values sampled from it are less predictable
 “Low Entropy”
 X is from a varied (peaks and valleys) distribution
 Histogram has many lows and highs
 Values sampled from it are more predictable
Specific Conditional Entropy, H(Y|X=v)
X Y
Math Yes
History No
CS Yes
Math No
Math No
CS Yes
History No
Math Yes
 I have input X and want to predict
Y
 From data we estimate
probabilities
P(LikeG = Yes) = 0.5
P(Major=Math & LikeG=No) = 0.25
P(Major=Math) = 0.5
P(Major=History & LikeG=Yes) = 0
 Note
H(X) = 1.5
H(Y) = 1
X = College Major
Y = Likes “Gladiator”
Specific Conditional Entropy, H(Y|X=v)
X Y
Math Yes
History No
CS Yes
Math No
Math No
CS Yes
History No
Math Yes
 Definition of Specific Conditional
Entropy
 H(Y|X=v) = entropy of Y among
only those records in which X
has value v
 Example:
H(Y|X=Math) = 1
H(Y|X=History) = 0
H(Y|X=CS) = 0
X = College Major
Y = Likes “Gladiator”
Conditional Entropy, H(Y|X)
X Y
Math Yes
History No
CS Yes
Math No
Math No
CS Yes
History No
Math Yes
 Definition of Conditional Entropy
H(Y|X) = the average conditional
entropy of Y
= Σi P(X=vi) H(Y|X=vi)
 Example:
H(Y|X) = 0.5*1+0.25*0+0.25*0 = 0.5
X = College Major
Y = Likes “Gladiator”
vi P(X=vi) H(Y|X=vi)
Math 0.5 1
History 0.25 0
CS 0.25 0
Information Gain
X Y
Math Yes
History No
CS Yes
Math No
Math No
CS Yes
History No
Math Yes
 Definition of Information Gain
 IG(Y|X) = I must transmit Y.
How many bits on average
would it save me if both ends of
the line knew X?
IG(Y|X) = H(Y) – H(Y|X)
 Example:
H(Y) = 1
H(Y|X) = 0.5
Thus:
IG(Y|X) = 1 – 0.5 = 0.5
X = College Major
Y = Likes “Gladiator”
Decision Trees
When do I play tennis?
Decision Tree
Is the decision tree correct?
 Let’s check whether the split on Wind attribute is
correct.
 We need to show that Wind attribute has the
highest information gain.
When do I play tennis?
Wind attribute – 5 records match
Note: calculate the entropy only on examples that got
“routed” in our branch of the tree (Outlook=Rain)
Calculation
 Let
S = {D4, D5, D6, D10, D14}
 Entropy:
H(S) = – 3/5log(3/5) – 2/5log(2/5) = 0.971
 Information Gain
IG(S,Temp) = H(S) – H(S|Temp) = 0.01997
IG(S, Humidity) = H(S) – H(S|Humidity) = 0.01997
IG(S,Wind) = H(S) – H(S|Wind) = 0.971
More about Decision Trees
 How I determine classification in the leaf?
 If Outlook=Rain is a leaf, what is classification rule?
 Classify Example:
 We have N boolean attributes, all are needed for
classification:
 How many IG calculations do we need?
 Strength of Decision Trees (boolean attributes)
 All boolean functions
 Handling continuous attributes
Boosting
Booosting
 Is a way of combining weak learners (also
called base learners) into a more accurate
classifier
 Learn in iterations
 Each iteration focuses on hard to learn parts of
the attribute space, i.e. examples that were
misclassified by previous weak learners.
Note: There is nothing inherently weak about the weak
learners – we just think of them this way. In fact, any
learning algorithm can be used as a weak learner in
boosting
Boooosting, AdaBoost
miss-classifications
with respect to
weights D
Influence (importance)
of weak learner
Booooosting Decision Stumps
Boooooosting
 Weights Dt are uniform
 First weak learner is stump that splits on Outlook
(since weights are uniform)
 4 misclassifications out of 14 examples:
α1 = ½ ln((1-ε)/ε)
= ½ ln((1- 0.28)/0.28) = 0.45
 Update Dt:
Determines
miss-classifications
Booooooosting Decision Stumps
miss-classifications
by 1st
weak learner
Boooooooosting, round 1
 1st
weak learner misclassifies 4 examples (D6,
D9, D11, D14):
 Now update weights Dt :
 Weights of examples D6, D9, D11, D14 increase
 Weights of other (correctly classified) examples
decrease
 How do we calculate IGs for 2nd
round of
boosting?
Booooooooosting, round 2
 Now use Dt instead of counts (Dt is a distribution):
 So when calculating information gain we calculate the
“probability” by using weights Dt (not counts)
 e.g.
P(Temp=mild) = Dt(d4) + Dt(d8)+ Dt(d10)+
Dt(d11)+ Dt(d12)+ Dt(d14)
which is more than 6/14 (Temp=mild occurs 6 times)
 similarly:
P(Tennis=Yes|Temp=mild) = (Dt(d4) + Dt(d10)+
Dt(d11)+ Dt(d12)) / P(Temp=mild)
 and no magic for IG
Boooooooooosting, even more
 Boosting does not easily overfit
 Have to determine stopping criteria
 Not obvious, but not that important
 Boosting is greedy:
 always chooses currently best weak learner
 once it chooses weak learner and its Alpha, it
remains fixed – no changes possible in later
rounds of boosting
Acknowledgement
 Part of the slides on Information Gain borrowed
from Andrew Moore

Más contenido relacionado

La actualidad más candente

Introduction to Probability
Introduction to ProbabilityIntroduction to Probability
Introduction to ProbabilityTodd Bill
 
27 calculation with log and exp x
27 calculation with log and exp x27 calculation with log and exp x
27 calculation with log and exp xmath260
 
28 more on log and exponential equations x
28 more on log and exponential equations x28 more on log and exponential equations x
28 more on log and exponential equations xmath260
 
3 algebraic expressions y
3 algebraic expressions y3 algebraic expressions y
3 algebraic expressions ymath266
 
47 operations of 2nd degree expressions and formulas
47 operations of 2nd degree expressions and formulas47 operations of 2nd degree expressions and formulas
47 operations of 2nd degree expressions and formulasalg1testreview
 
4 4polynomial operations
4 4polynomial operations4 4polynomial operations
4 4polynomial operationsmath123a
 
2.5 calculation with log and exp
2.5 calculation with log and exp2.5 calculation with log and exp
2.5 calculation with log and expmath123c
 
1.5 algebraic and elementary functions
1.5 algebraic and elementary functions1.5 algebraic and elementary functions
1.5 algebraic and elementary functionsmath265
 
Machine Learning and Data Mining - Decision Trees
Machine Learning and Data Mining - Decision TreesMachine Learning and Data Mining - Decision Trees
Machine Learning and Data Mining - Decision Treeswebisslides
 
3.2 more on log and exponential equations
3.2 more on log and exponential equations3.2 more on log and exponential equations
3.2 more on log and exponential equationsmath123c
 
Chap02 describing data; numerical
Chap02 describing data; numericalChap02 describing data; numerical
Chap02 describing data; numericalJudianto Nugroho
 
Bilangan Bulat Matematika Kelas 7
Bilangan Bulat Matematika Kelas 7Bilangan Bulat Matematika Kelas 7
Bilangan Bulat Matematika Kelas 7miaakmt
 
1.3 solving equations
1.3 solving equations1.3 solving equations
1.3 solving equationsmath260
 
2.1 reviews of exponents and the power functions
2.1 reviews of exponents and the power functions2.1 reviews of exponents and the power functions
2.1 reviews of exponents and the power functionsmath123c
 
Mean (Girja Pd. Patel)
Mean (Girja Pd. Patel)Mean (Girja Pd. Patel)
Mean (Girja Pd. Patel)GirjaPrasad
 
4 1exponents
4 1exponents4 1exponents
4 1exponentsmath123a
 
Ani agustina (a1 c011007) polynomial
Ani agustina (a1 c011007) polynomialAni agustina (a1 c011007) polynomial
Ani agustina (a1 c011007) polynomialAni_Agustina
 
4 5special binomial operations
4 5special binomial operations4 5special binomial operations
4 5special binomial operationsmath123a
 

La actualidad más candente (20)

Introduction to Probability
Introduction to ProbabilityIntroduction to Probability
Introduction to Probability
 
27 calculation with log and exp x
27 calculation with log and exp x27 calculation with log and exp x
27 calculation with log and exp x
 
28 more on log and exponential equations x
28 more on log and exponential equations x28 more on log and exponential equations x
28 more on log and exponential equations x
 
3 algebraic expressions y
3 algebraic expressions y3 algebraic expressions y
3 algebraic expressions y
 
47 operations of 2nd degree expressions and formulas
47 operations of 2nd degree expressions and formulas47 operations of 2nd degree expressions and formulas
47 operations of 2nd degree expressions and formulas
 
4 4polynomial operations
4 4polynomial operations4 4polynomial operations
4 4polynomial operations
 
2.5 calculation with log and exp
2.5 calculation with log and exp2.5 calculation with log and exp
2.5 calculation with log and exp
 
1.5 algebraic and elementary functions
1.5 algebraic and elementary functions1.5 algebraic and elementary functions
1.5 algebraic and elementary functions
 
Deep Learning Opening Workshop - Admissibility of Solution Estimators in Stoc...
Deep Learning Opening Workshop - Admissibility of Solution Estimators in Stoc...Deep Learning Opening Workshop - Admissibility of Solution Estimators in Stoc...
Deep Learning Opening Workshop - Admissibility of Solution Estimators in Stoc...
 
Machine Learning and Data Mining - Decision Trees
Machine Learning and Data Mining - Decision TreesMachine Learning and Data Mining - Decision Trees
Machine Learning and Data Mining - Decision Trees
 
3.2 more on log and exponential equations
3.2 more on log and exponential equations3.2 more on log and exponential equations
3.2 more on log and exponential equations
 
Chap02 describing data; numerical
Chap02 describing data; numericalChap02 describing data; numerical
Chap02 describing data; numerical
 
Bilangan Bulat Matematika Kelas 7
Bilangan Bulat Matematika Kelas 7Bilangan Bulat Matematika Kelas 7
Bilangan Bulat Matematika Kelas 7
 
1.3 solving equations
1.3 solving equations1.3 solving equations
1.3 solving equations
 
2.1 reviews of exponents and the power functions
2.1 reviews of exponents and the power functions2.1 reviews of exponents and the power functions
2.1 reviews of exponents and the power functions
 
Mean (Girja Pd. Patel)
Mean (Girja Pd. Patel)Mean (Girja Pd. Patel)
Mean (Girja Pd. Patel)
 
4 1exponents
4 1exponents4 1exponents
4 1exponents
 
Ani agustina (a1 c011007) polynomial
Ani agustina (a1 c011007) polynomialAni agustina (a1 c011007) polynomial
Ani agustina (a1 c011007) polynomial
 
Probability and Statistics - Week 2
Probability and Statistics - Week 2Probability and Statistics - Week 2
Probability and Statistics - Week 2
 
4 5special binomial operations
4 5special binomial operations4 5special binomial operations
4 5special binomial operations
 

Similar a Recitation decision trees-adaboost-02-09-2006-3

Unit-2 Bayes Decision Theory.pptx
Unit-2 Bayes Decision Theory.pptxUnit-2 Bayes Decision Theory.pptx
Unit-2 Bayes Decision Theory.pptxavinashBajpayee1
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learningbutest
 
An introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using PythonAn introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using Pythonfreshdatabos
 
4_logit_printable_.pdf
4_logit_printable_.pdf4_logit_printable_.pdf
4_logit_printable_.pdfElio Laureano
 
Interpreting Logistic Regression.pptx
Interpreting Logistic Regression.pptxInterpreting Logistic Regression.pptx
Interpreting Logistic Regression.pptxGairuzazmiMGhani
 
Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learningbutest
 
Machine Learning
Machine LearningMachine Learning
Machine Learningbutest
 
Chap04 discrete random variables and probability distribution
Chap04 discrete random variables and probability distributionChap04 discrete random variables and probability distribution
Chap04 discrete random variables and probability distributionJudianto Nugroho
 
Bayesian Learning- part of machine learning
Bayesian Learning-  part of machine learningBayesian Learning-  part of machine learning
Bayesian Learning- part of machine learningkensaleste
 
Google BigQuery is a very popular enterprise warehouse that’s built with a co...
Google BigQuery is a very popular enterprise warehouse that’s built with a co...Google BigQuery is a very popular enterprise warehouse that’s built with a co...
Google BigQuery is a very popular enterprise warehouse that’s built with a co...Abebe Admasu
 
introduction to probability
introduction to probabilityintroduction to probability
introduction to probabilityTodd Bill
 
Alpaydin - Chapter 2
Alpaydin - Chapter 2Alpaydin - Chapter 2
Alpaydin - Chapter 2butest
 

Similar a Recitation decision trees-adaboost-02-09-2006-3 (20)

Unit-2 Bayes Decision Theory.pptx
Unit-2 Bayes Decision Theory.pptxUnit-2 Bayes Decision Theory.pptx
Unit-2 Bayes Decision Theory.pptx
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
My7class
My7classMy7class
My7class
 
Stats chapter 8
Stats chapter 8Stats chapter 8
Stats chapter 8
 
Stats chapter 8
Stats chapter 8Stats chapter 8
Stats chapter 8
 
An introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using PythonAn introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using Python
 
Practice Test 2 Solutions
Practice Test 2  SolutionsPractice Test 2  Solutions
Practice Test 2 Solutions
 
lec2_CS540_handouts.pdf
lec2_CS540_handouts.pdflec2_CS540_handouts.pdf
lec2_CS540_handouts.pdf
 
4_logit_printable_.pdf
4_logit_printable_.pdf4_logit_printable_.pdf
4_logit_printable_.pdf
 
Interpreting Logistic Regression.pptx
Interpreting Logistic Regression.pptxInterpreting Logistic Regression.pptx
Interpreting Logistic Regression.pptx
 
Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learning
 
Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
 
Decision tree learning
Decision tree learningDecision tree learning
Decision tree learning
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Chap04 discrete random variables and probability distribution
Chap04 discrete random variables and probability distributionChap04 discrete random variables and probability distribution
Chap04 discrete random variables and probability distribution
 
Bayesian Learning- part of machine learning
Bayesian Learning-  part of machine learningBayesian Learning-  part of machine learning
Bayesian Learning- part of machine learning
 
U unit7 ssb
U unit7 ssbU unit7 ssb
U unit7 ssb
 
Google BigQuery is a very popular enterprise warehouse that’s built with a co...
Google BigQuery is a very popular enterprise warehouse that’s built with a co...Google BigQuery is a very popular enterprise warehouse that’s built with a co...
Google BigQuery is a very popular enterprise warehouse that’s built with a co...
 
introduction to probability
introduction to probabilityintroduction to probability
introduction to probability
 
Alpaydin - Chapter 2
Alpaydin - Chapter 2Alpaydin - Chapter 2
Alpaydin - Chapter 2
 

Último

Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Communityprachaibot
 
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfPaper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfNainaShrivastava14
 
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHTEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHSneha Padhiar
 
Python Programming for basic beginners.pptx
Python Programming for basic beginners.pptxPython Programming for basic beginners.pptx
Python Programming for basic beginners.pptxmohitesoham12
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfChristianCDAM
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsResearcher Researcher
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfDrew Moseley
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfalene1
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTSneha Padhiar
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSneha Padhiar
 
Levelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodLevelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodManicka Mamallan Andavar
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxStephen Sitton
 
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfBalamuruganV28
 

Último (20)

Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Community
 
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfPaper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
 
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHTEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
 
Python Programming for basic beginners.pptx
Python Programming for basic beginners.pptxPython Programming for basic beginners.pptx
Python Programming for basic beginners.pptx
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdf
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending Actuators
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdf
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
 
Levelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodLevelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument method
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptx
 
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdf
 

Recitation decision trees-adaboost-02-09-2006-3

  • 1. Information Gain, Decision Trees and Boosting 10-701 ML recitation 9 Feb 2006 by Jure
  • 3. Entropy & Bits  You are watching a set of independent random sample of X  X has 4 possible values: P(X=A)=1/4, P(X=B)=1/4, P(X=C)=1/4, P(X=D)=1/4  You get a string of symbols ACBABBCDADDC…  To transmit the data over binary link you can encode each symbol with bits (A=00, B=01, C=10, D=11)  You need 2 bits per symbol
  • 4. Fewer Bits – example 1  Now someone tells you the probabilities are not equal P(X=A)=1/2, P(X=B)=1/4, P(X=C)=1/8, P(X=D)=1/8  Now, it is possible to find coding that uses only 1.75 bits on the average. How?
  • 5. Fewer bits – example 2  Suppose there are three equally likely values P(X=A)=1/3, P(X=B)=1/3, P(X=C)=1/3  Naïve coding: A = 00, B = 01, C=10  Uses 2 bits per symbol  Can you find coding that uses 1.6 bits per symbol?  In theory it can be done with 1.58496 bits
  • 6. Entropy – General Case  Suppose X takes n values, V1, V2,… Vn, and P(X=V1)=p1, P(X=V2)=p2, … P(X=Vn)=pn  What is the smallest number of bits, on average, per symbol, needed to transmit the symbols drawn from distribution of X? It’s H(X) = p1log2 p1 – p2 log2p2 – … pnlog2pn  H(X) = the entropy of X )(log 1 2 i n i i pp∑= −=
  • 7. High, Low Entropy  “High Entropy”  X is from a uniform like distribution  Flat histogram  Values sampled from it are less predictable  “Low Entropy”  X is from a varied (peaks and valleys) distribution  Histogram has many lows and highs  Values sampled from it are more predictable
  • 8. Specific Conditional Entropy, H(Y|X=v) X Y Math Yes History No CS Yes Math No Math No CS Yes History No Math Yes  I have input X and want to predict Y  From data we estimate probabilities P(LikeG = Yes) = 0.5 P(Major=Math & LikeG=No) = 0.25 P(Major=Math) = 0.5 P(Major=History & LikeG=Yes) = 0  Note H(X) = 1.5 H(Y) = 1 X = College Major Y = Likes “Gladiator”
  • 9. Specific Conditional Entropy, H(Y|X=v) X Y Math Yes History No CS Yes Math No Math No CS Yes History No Math Yes  Definition of Specific Conditional Entropy  H(Y|X=v) = entropy of Y among only those records in which X has value v  Example: H(Y|X=Math) = 1 H(Y|X=History) = 0 H(Y|X=CS) = 0 X = College Major Y = Likes “Gladiator”
  • 10. Conditional Entropy, H(Y|X) X Y Math Yes History No CS Yes Math No Math No CS Yes History No Math Yes  Definition of Conditional Entropy H(Y|X) = the average conditional entropy of Y = Σi P(X=vi) H(Y|X=vi)  Example: H(Y|X) = 0.5*1+0.25*0+0.25*0 = 0.5 X = College Major Y = Likes “Gladiator” vi P(X=vi) H(Y|X=vi) Math 0.5 1 History 0.25 0 CS 0.25 0
  • 11. Information Gain X Y Math Yes History No CS Yes Math No Math No CS Yes History No Math Yes  Definition of Information Gain  IG(Y|X) = I must transmit Y. How many bits on average would it save me if both ends of the line knew X? IG(Y|X) = H(Y) – H(Y|X)  Example: H(Y) = 1 H(Y|X) = 0.5 Thus: IG(Y|X) = 1 – 0.5 = 0.5 X = College Major Y = Likes “Gladiator”
  • 13. When do I play tennis?
  • 15. Is the decision tree correct?  Let’s check whether the split on Wind attribute is correct.  We need to show that Wind attribute has the highest information gain.
  • 16. When do I play tennis?
  • 17. Wind attribute – 5 records match Note: calculate the entropy only on examples that got “routed” in our branch of the tree (Outlook=Rain)
  • 18. Calculation  Let S = {D4, D5, D6, D10, D14}  Entropy: H(S) = – 3/5log(3/5) – 2/5log(2/5) = 0.971  Information Gain IG(S,Temp) = H(S) – H(S|Temp) = 0.01997 IG(S, Humidity) = H(S) – H(S|Humidity) = 0.01997 IG(S,Wind) = H(S) – H(S|Wind) = 0.971
  • 19. More about Decision Trees  How I determine classification in the leaf?  If Outlook=Rain is a leaf, what is classification rule?  Classify Example:  We have N boolean attributes, all are needed for classification:  How many IG calculations do we need?  Strength of Decision Trees (boolean attributes)  All boolean functions  Handling continuous attributes
  • 21. Booosting  Is a way of combining weak learners (also called base learners) into a more accurate classifier  Learn in iterations  Each iteration focuses on hard to learn parts of the attribute space, i.e. examples that were misclassified by previous weak learners. Note: There is nothing inherently weak about the weak learners – we just think of them this way. In fact, any learning algorithm can be used as a weak learner in boosting
  • 23. miss-classifications with respect to weights D Influence (importance) of weak learner
  • 25. Boooooosting  Weights Dt are uniform  First weak learner is stump that splits on Outlook (since weights are uniform)  4 misclassifications out of 14 examples: α1 = ½ ln((1-ε)/ε) = ½ ln((1- 0.28)/0.28) = 0.45  Update Dt: Determines miss-classifications
  • 27. Boooooooosting, round 1  1st weak learner misclassifies 4 examples (D6, D9, D11, D14):  Now update weights Dt :  Weights of examples D6, D9, D11, D14 increase  Weights of other (correctly classified) examples decrease  How do we calculate IGs for 2nd round of boosting?
  • 28. Booooooooosting, round 2  Now use Dt instead of counts (Dt is a distribution):  So when calculating information gain we calculate the “probability” by using weights Dt (not counts)  e.g. P(Temp=mild) = Dt(d4) + Dt(d8)+ Dt(d10)+ Dt(d11)+ Dt(d12)+ Dt(d14) which is more than 6/14 (Temp=mild occurs 6 times)  similarly: P(Tennis=Yes|Temp=mild) = (Dt(d4) + Dt(d10)+ Dt(d11)+ Dt(d12)) / P(Temp=mild)  and no magic for IG
  • 29. Boooooooooosting, even more  Boosting does not easily overfit  Have to determine stopping criteria  Not obvious, but not that important  Boosting is greedy:  always chooses currently best weak learner  once it chooses weak learner and its Alpha, it remains fixed – no changes possible in later rounds of boosting
  • 30. Acknowledgement  Part of the slides on Information Gain borrowed from Andrew Moore