Building classification model, tree model, confusion matrix and prediction accuracy

Department of Geomatics, National Cheng Kung University
[106-2] Data Mining, Homework 5, Instructor: Hsueh-Chan Lu
Muhammad Irsyadi Firdaus P66067055
Based on your collected dataset, please select the most relevant attribute as the target attribute. Using R tool
to analyze the following items:
1. Based on C5.0 Classification, 70% of data are randomly sampled for building classification model and
other 30% of data are used for testing. Output the tree model, confusion matrix and prediction accuracy.
2. Based on naiveBayes Classification, 70% of data are randomly sampled for building classification model
and other 30% of data are used for testing. Output the confusion matrix and prediction accuracy.
3. Write a short report to summarize what do you get / find after classification analysis.
Hint: Observe the decision tree and try to explain why these attributes are important to the target attribute.
The comparison of decision tree model and naïve Bayes model in terms of prediction accuracy.
Answers
1. In HW1, collected a dataset with about 68 records. In this dataset, Target attribute is Interest in Vacation
which is classified to Yes and No. and other six attributes are evaluated attributes. Gender attributes consists
of Male and Female, Age attributes consist of Young and Medium, and Marriage Status attributes consists
of Student and Not Student, The Intensity of a Vacation attributes consists of Low and High, Vacation Time
attributes consists of Weekend, School holidays, and National holiday.
Table 1. Training Data (classification model)
This method used tree structure to build the classification models. It divides a dataset into smaller subsets.
Leaf node represents a decision. Based on feature values of instances, the decision trees classify the

instances. Each node represents a feature in an instance in a decision tree which is to be classified, and each
branch represents a value. Classification of Instances starts from the root node and sorted based on their
feature values. Categorical and numerical data can be handled by decision tress.
To Classification the dataset will divide into two types ie 70% of data are randomly sampled for building
classification model and other 30% of data are used for testing. In this case, data for model classification
amounted to 48 while data for testing amounted to 20.
Testing data taken randomly as much as 20. The results can be seen below
Table 2. Testing Data
We need to become comfortable with some terminology. Recall that we can talk in terms of positive tuples
(tuples of the main class of interest) and negative tuples (all other tuples). Given two classes, for example,
the positive tuples may be Interest in Vacation = yes while the negative tuples are Interest in Vacation =
No. Suppose we use our classifier on a test set of labeled tuples . The result on confusion matrix in figure
1, the dataset has accuracy about 0.45 with sensitivity about 0.7. The accuracy of a classifier on a given
test set is the percentage of test set tuples that are correctly classified by the classifier. That is,
𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 =
𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑇𝑇
𝑃𝑃 + 𝑁𝑁
The sensitivity and specificity measures can be used, respectively, for this purpose. Sensitivity is also
referred to as the true positive (recognition) rate (i.e., the proportion of positive tuples that are correctly
identified), while specificity is the true negative rate (i.e., the proportion of negative tuples that are correctly
identified). These measures are defined as
𝑠𝑠𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 =
𝑇𝑇𝑇𝑇
𝑃𝑃
𝑠𝑠𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 =
𝑇𝑇𝑇𝑇
𝑁𝑁

Figure 1. Confusion Matrix and statistics using decision tree
Figure 2. Decision Tree
Classification trees are used for the kind of Data Mining problem which are concerned with prediction.
2. Bayesian classification can predict class membership probabilities. The effect of an attribute value on a
given class is independent of the value of the other attributes is assumed by the Naïve Bayes algorithm.
The Naïve Bayes algorithm scales continuously in the number of predictors and rows and builds rapidly
models. Naive Bayes algorithm derives the probability of a prediction. The probability of event X occurring
given that event Y has occurred (𝑃𝑃(𝑋𝑋|𝑌𝑌)) is proportional to the probability of event Y occurring given

that event X has occurred multiplied by the probability of event X occurring ((𝑃𝑃(𝑌𝑌|𝑋𝑋)𝑃𝑃(𝑋𝑋)).
If using the Bayesian classification method then the confusion matrix and statistics can be seen below
Figure 3. Confusion Matrix and statistics using Bayesian classification
From the above confusion matrix result, the dataset has accuracy about 0.6 with sensitivity about 0.8750.
We wish to predict the class label of a “Interest in Vacation” using na¨ıve Bayesian classification, given the
same training data as in Table 1 for decision tree induction. The results of the prediction model can be seen
below
3. Once we get the result of the decision tree model and Bayesian classification model then we can compare
the accuracy of both models. From these results it is found that the accuracy of the Bayesian model
classification is better than using the decision tree model.
Table 3. Comparison between Bayesian Classification and Decision Tree
Interest.in.Vacation
by Ground
by B. C
by D. T
21 No No No
33 No No No
39 No No No
43 Yes Yes Yes
49 Yes Yes Yes

Appendix
Algorithms Classification Analysis in RStudio

Building classification model, tree model, confusion matrix and prediction accuracy

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (12)

Similar a Building classification model, tree model, confusion matrix and prediction accuracy

Similar a Building classification model, tree model, confusion matrix and prediction accuracy (20)

Más de National Cheng Kung University

Más de National Cheng Kung University (20)

Último

Último (20)

Building classification model, tree model, confusion matrix and prediction accuracy