This Edureka Decision Tree tutorial will help you understand all the basics of Decision tree. This decision tree tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Data Science concepts, learn decision tree analysis along with examples.
Below are the topics covered in this tutorial:
1) Machine Learning Introduction
2) Classification
3) Types of classifiers
4) Decision tree
5) How does Decision tree work?
6) Demo in R
You can also take a complete structured training, check out the details here: https://goo.gl/AfxwBc
2. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
What Will You Learn Today?
ClassificationMachine Learning Types Of Classifiers
What Is Decision Tree? How Decision Tree Works? Demo In R: Diabetes
Prevention Use Case
1 2 3
4 65
4. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Introduction To Machine Learning
Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without
being explicitly programmed.
Training Data Learn
Algorithm
Build Model Perform
Feedback
5. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Machine Learning - Example
Amazon has huge amount of consumer purchasing
data.
The data consists of consumer demographics (age,
sex, location), purchasing history, past browsing
history.
Based on this data, Amazon segments its
customers, draws a pattern and recommends the
right product to the right customer at the right
time.
7. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Introduction To Classification
Classification is the problem of
identifying to which set of categories a
new observation belongs.
It is a supervised learning model as the
classifier already has a set of classified
examples and from these examples, the
classifier learns to assign unseen new
examples.
Example: Assigning a given email
into "spam" or "non-spam" category.
Is this A or B ?
8. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Classification - Example
Feed the classifier with training data set and predefined labels.
It will learn to categorize particular data under a specific label.
How to train my
model to identify
spam mails from
genuine mails?
Source IP Address
Phrases in the text
Subject Line
HTML Tags
9. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Classification Use Cases
Banking
Remote sensing
Medicine
Banking
Identification of loan risk applicants by their
probability of defaulting payments.
Medicine
Identification of at-risk patients and disease trends.
Remote sensing
Identification of areas of similar land use in a GIS
database.
Marketing
Identifying customer churn.
Use-cases
Marketing
10. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Types Of Classifiers
Decision Tree
• Decision tree builds classification
models in the form of a tree
structure.
• It breaks down a dataset into
smaller and smaller subsets.
• Random Forest is an ensemble
classifier made using many
decision tree models.
• Ensemble models combine the
results from different models.
Random Forest Naïve Bayes
• It is a classification technique
based on Bayes' Theorem with an
assumption of independence
among attributes.
12. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
What Is Decision Tree?
A decision tree uses a tree structure to specify sequences of decisions and consequences.
A decision tree employs a structure of nodes and branches.
The depth of a node is the minimum number of steps required to reach the node from the root.
Eventually, a final point is reached and a prediction is made.
Gender
AgeIncome
Yes No Yes No
Root Node
Internal Node
Branch NodeDepth=1
Female Male
<=40 >40
Leaf Node
<=45000 >45000
13. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Use Case - Credit Risk Detection
To minimize loss, the bank needs a
decision rule to predict whom to give
approval of the loan.
An applicant’s demographic (income,
debts, credit history) and socio-economic
profiles are considered.
Data science can help banks recognize
behavior patterns and provide a
complete view of individual customers.
15. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How Decision Tree Works?
Let’s take an example,
We have taken dataset consisting of:
• Weather information of last 14 days
• Whether match was played or not on that particular day
Now using the decision tree we need to predict whether the
game will happen if the weather condition is
Outlook = Rain
Humidity = High
Wind = Weak
Play = ?
16. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How Decision Tree Works?
From our data, we will choose one variable “Outlook” and will see how it affects the variable “Play”
Day Outlook Humidity Wind Play
D1 Sunny High Weak No
D2 Sunny High Strong No
D3 Overcast High Weak Yes
D4 Rain High Weak Yes
D5 Rain Normal Weak Yes
D6 Rain Normal Strong No
D7 Overcast Normal Strong Yes
D8 Sunny High Weak No
D9 Sunny Normal Weak Yes
D10 Rain Normal Weak Yes
D11 Sunny Normal Strong Yes
D12 Overcast High Strong Yes
D13 Overcast Normal Weak Yes
D14 Rain High Strong No
Outlook
Play: 9 Yes, 5 No
Sunny Overcast Rain
There are 3 types
of Outlook Here
17. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How Decision Tree Works?
We can further divide our data based on Outlook.
Outlook
Overcast
Sunny Rain
Day Outlook Humidity Wind
D1 Sunny High Weak
D2 Sunny High Strong
D8 Sunny High Weak
D9 Sunny Normal Weak
D11 Sunny Normal Strong
2 Yes / 3 No
Split further
Pure subset
Will play
3 Yes / 2 No
Split further
Day Outlook Humidity Wind
D4 Rain High Weak
D5 Rain Normal Weak
D6 Rain Normal Strong
D10 Rain Normal Weak
D14 Rain High Strong
We will split the data until we get pure subsets at every branch
9 Yes / 5 No
Day Outlook Humidity Wind
D3 Overcast High Weak
D7 Overcast Normal Strong
D12 Overcast High Strong
D13 Overcast Normal Weak
18. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How Decision Tree Works?
We will use Humidity column to split the subset “Sunny” further.
Will playWill not play
Outlook
Overcast
Sunny Rain
Humidity
NormalHigh
Day Humidity Wind
D1 High Weak
D2 High Strong
D8 High Weak
Pure subset
Day Humidity Wind
D9 Normal Weak
D11 Normal Strong
Pure subset 3 Yes / 2 No
Split further
Day Outlook Humidity Wind
D4 Rain High Weak
D5 Rain Normal Weak
D6 Rain Normal Strong
D10 Rain Normal Weak
D14 Rain High Strong
Pure subset
Will play
Day Outlook Humidity Wind
D3 Overcast High Weak
D7 Overcast Normal Strong
D12 Overcast High Strong
D13 Overcast Normal Weak
19. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How Decision Tree Works?
We will use Humidity column to split the subset “Sunny” further.
Will playWill not play
Outlook
Overcast
Sunny Rain
Humidity
NormalHigh Weak Strong
Wind
Will play Will not play
Day Humidity Wind
D1 High Weak
D2 High Strong
D8 High Weak
Pure subset
Day Humidity Wind
D9 Normal Weak
D11 Normal Strong
Pure subset
Day Humidity Wind
D4 High Weak
D5 Normal Weak
D10 Normal Weak
Pure subset
Day Humidity Wind
D6 Normal Strong
D14 High Strong
Pure subset
Pure subset
Will play
Day Outlook Humidity Wind
D3 Overcast High Weak
D7 Overcast Normal Strong
D12 Overcast High Strong
D13 Overcast Normal Weak
20. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How Decision Tree Works?
We will use Humidity column to split the subset “Sunny” further.
Will playWill not play
Outlook
Overcast
Sunny Rain
Humidity
NormalHigh Weak Strong
Wind
Will play Will not play
Will play
22. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem – Client Subscription
Consider the case of a bank that wants to market its products to the appropriate customers.
Given the demographics of clients and their reactions to previous campaign phone calls, the bank's goal is to predict
which clients would subscribe.
The attributes are:
• Job
• Marital status
• Education
• Housing
• Loan
• Contact
• Poutcome
23. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How To Choose An Attribute?
A common way to identify the most
informative attribute is to use entropy-based
methods.
The entropy methods select the most
informative attribute.
Entropy (H) can be calculated as,
x = Datapoint
p(x) = Probability of x
H = Entropy of x
24. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How To Choose An Attribute?
Now, let’s do some mathematics on it
Therefore, the root is only 10.55% pure on the subscribed = yes class.
Conversely, it is 89.45% pure on the subscribed = no class.
P(subscribed=yes) = 0.1055
P(subscribed=no) = 0.8945
Hsubscribed = −0.1055·log20.1055–0.8945·log20.8945
≈ 0.4862
P(subscribed = yes) = 1-1789/2000 =10.55%
Let’s say, the overall fraction of the clients who have not subscribed to is 1,789 out
of the total population of 2,000.
25. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How To Choose An Attribute?
Conditional entropy is,
Hsubscribed|contact = 0.4661
Calculating conditional entropy for subscribed|contact gives us following result.
26. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How To Choose An Attribute?
The information gain of an attribute A is defined as
the difference between the base entropy (HS) and
the conditional entropy of the attribute (HS|A).
Attribute poutcome has the most information
gain and is the most informative variable.
Therefore, poutcome is chosen for the first split of
the decision tree.
InfoGainA = HS – HS|A
27. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How To Choose An Attribute?
Finally, we get the following decision tree
Poutcome
EducationNo
Job Yes
Root Node
Branch Node
Failure, Other,
Unknown
Secondary,
tertiary
Success
Internal Node
Primary,
Unknown
Leaf Node
Admin, blue-collar,
management,
technician
Self-employed,
student, unemployed
No Yes
30. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
What if we could predict the
occurrence of diabetes and
take appropriate measures
beforehand to prevent it?
Sure! Let me take you
through the steps to
predict the vulnerable
patients.
31. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Implement model
Visualize
Model Validation
Doctor gets the following data from the medical history of the patient.
32. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Implement model
Visualize
Model Validation
We will divide our entire dataset into two subsets as:
• Training dataset -> to train the model
• Testing dataset -> to validate and make predictions
33. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Implement model
Visualize
Model Validation
Here, we implement decision tree in R using following commands.
34. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Implement model
Visualize
Model Validation
We get the output as follows but this is not easy to understand, so let’s
visualize it for better understanding.
35. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Implement model
Visualize
Model Validation
For plotting we can use the following commands
> plot(diabet_model,margin = 0.1)
> text(diabet_model,use.n= TRUE,pretty = TRUE,cex =0.6)
glucose_conc< 154.5
Diabetes_pedigree_fn<0.315glucose_conc< 131
blood_pressure>=72
NO
68/18 NO
12/3
YES
5/11
glucose_conc< 100.5
NO
107/3
BMI <26.35 Age >=53.5
NO
6/4
YES
9/65
NO
93/13
Age <30.5
Age >=53.5
NO
5/2
YES
13/39
NO
35/18
36. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Implement model
Visualize
Model Validation
Now, we can use our model to predict the output of our testing dataset.
We can use the following code for predicting the output.
pred_diabet<-predict(diabet_model,newdata = diabet_test,type ="class")
pred_diabet
37. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Implement model
Visualize
Model Validation
We get the following output for our testing dataset where:
“YES” means the probability of patient being vulnerable to diabetes is positive
“NO” means the probability of patient being vulnerable to diabetes is negative.
38. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data Acquisition
Divide dataset
Implement model
Visualize
Model Validation
library(caret)
confusionMatrix(table(pred_diabet,diabet_test$is_diabetic))
We can create confusion matrix for the model using the library caret to
know how good is our model.
39. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Demo
Data acquisition
Divide dataset
Implement model
Visualize
Accuracy = 71.13%
The accuracy (or the overall success rate) is a metric defining the rate at
which a model has classified the records correctly. A good model should
have a high accuracy score.
Data Acquisition
Divide dataset
Implement model
Visualize
Model Validation
40. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Course Details
Go to www.edureka.co/data-science
Get Edureka Certified in Data Science Today!
What our learners have to say about us!
Shravan Reddy says- “I would like to recommend any one who
wants to be a Data Scientist just one place: Edureka. Explanations
are clean, clear, easy to understand. Their support team works
very well.. I took the Data Science course and I'm going to take
Machine Learning with Mahout and then Big Data and Hadoop”.
Gnana Sekhar says - “Edureka Data science course provided me a very
good mixture of theoretical and practical training. LMS pre recorded
sessions and assignments were very good as there is a lot of
information in them that will help me in my job. Edureka is my
teaching GURU now...Thanks EDUREKA.”
Balu Samaga says - “It was a great experience to undergo and get
certified in the Data Science course from Edureka. Quality of the
training materials, assignments, project, support and other
infrastructures are a top notch.”