SlideShare una empresa de Scribd logo
1 de 21
Descargar para leer sin conexión
Forest Cover Type Classification: A Decision Tree Approach
Laila van Ments, Kayleigh Beard, Anita Tran and Nathalie Post
March 31, 2015
Abstract
This paper describes our approach to a Kaggle classification task in which Forest Cover Types need to
be predicted. In order to solve this classification problem, three machine learning decision tree algorithms
were explored: Naive Bayes Tree (NBTree), Reduced Error Pruning Tree (REPTree) and J48 Tree. The
main focus of this research lies on two aspects. The first aspect concerns which of these trees yields the
highest accuracy for this classification task. The second aspect is finding which parameter settings and
which category- or bin size of the partitioned training set yield the highest accuracy for each decision tree.
Briefly, the experiments involve binning the training set, performing Cross Validated Parameter Selection
in Weka, calculating the true error and submitting the best performing trained decision trees to Kaggle in
order to determine the test set error. The results of this research show that bin sizes between 20 and 50
yield higher accurateness for each decision tree. Moreover, the optimal parameters settings that yielded the
highest accuracy on the training set, were not necessarily the optimal parameter settings for the the test set.
Contents
Page
Introduction 2
1 Data inspection and Preparation 3
1.1 The available data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Becoming familiar with the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Determining missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Data transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Analyzing the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6 Data visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.7 Evaluating the Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.8 Outlier and measurement error detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.9 Discretize values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.10 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Experimental Setup 8
2.1 General setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Detailed experimental setup per tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 J48 decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Naive Bayes decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Reduced Error Pruning decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Results 13
3.1 Training Set Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Test Set Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Conclusion 16
5 Discussion 18
Bibliography 18
1
Introduction
This paper describes our approach to the Kaggle Competition ‘Forest Cover Type Prediction’. This compe-
tition is a classification problem. The goal of this classification problem is to predict the correct forest cover
type, based on 54 cartographic features. Cartographic features are the features that describe the landscape,
such as elevation and soil type. The data was gathered in four wilderness areas in the Roosevelt National
Forest (Northern Colorado). [1]
With many large wilderness areas, it can be difficult and costly to determine the cover type of each area
manually. However, the information about the cover type is very useful for experts or ecosystem caretakers
who need to maintain the ecosystems of these areas. Thus, if a machine learning algorithm is capable of
accurately predicting the cover type of a certain area with given cartographic features, it will be easier for
experts to make decisions when preserving ecological environments.
In order to solve this classification problem, three machine learning decision tree algorithms were ex-
plored. Decision trees are often used in classification problems. The basic idea of a decision tree algorithm
is as follows. Based on the data of a training set, a decision tree is constructed that maps the observations
about features onto the class value. While constructing the tree, for each node, a decision needs to be made
on which feature to split on next, i.e. which feature will be placed on the current node. The chosen feature is
the feature that has the highest information gain at that moment. The information gain indicates how well
a certain feature will predict the target class value. [2] After the tree is constructed based on the training
data, the class of every new and unknown data point can be predicted with the tree.
The three decision trees that will be evaluated for this classification problem are Naive Bayes Tree
(NBtree), Reduced Error Pruning Tree (REPtree) and J48 Tree. The reason for testing these decision trees
is because they are all similar to or based on the C4.5 statistical classifier, which is an extension of the ID3
decision tree algorithm. [3] We are interested in the performance of these C4.5-like algorithms and will be
comparing these trees with each other.
The three decision tree algorithms were explored and tuned in several experiments in order to find the
parameters that yield the highest accuracy (highest percentage of correctly classified cover types) on both
the training set and the test set. We are particularly interested in assigning the real-valued features of the
dataset into categories or bins, by testing several bin-sizes in order to discover which size yields the highest
accuracy. Moreover, several parameter settings are tested for each decision tree algorithm. Therefore, the
research question that will be answered in this report is:
Which of the NBTree, REP Tree and J48 Tree yields the highest accuracy on the
Forest Cover Type classification problem?
With the following sub-question:
For each decision tree, which parameter settings and bin sizes of the features yield the highest accuracy?
The first step in answering these questions consists of gaining a deeper understanding of the data and
pre-processing it where necessary, as described in Chapter 1. In the second chapter, the experimental setup
for each decision tree is discussed. The third chapter contains the accuracy results of the experiments on
the training set as well as the test set. Finally, the conclusion and discussion are reported.
2
Chapter 1
Data inspection and Preparation
1.1 The available data
The available data for this classification problem was provided by Kaggle, and consists of a training set and
a test set. For the initial phases of the process, only the training set was utilized. The training set consisted
of 15.120 instances with 54 attributes, all belonging to 1 of 7 forest cover type classes.
1.2 Becoming familiar with the data
The first step of the data analysis was to become familiar with the data. For that reason, the first step of
data inspection was simply analyzing the data with the bare eye in Excel. From this it was observed that all
the provided data was numeric. Also, part of the data (the Wilderness Areas and the Soil Types) consisted
of Boolean values.
1.3 Determining missing values
Excel was used in order to determine whether there were missing values in the data, which was not the case;
the data was complete.
1.4 Data transformations
The Boolean features (Wilderness Areas and Soil Types) were further analyzed using a Python script. It
turned out that every instance in the data set belonged to exactly one of the four Wilderness Areas and exactly
one of the forty Soil Types. For that reason, another Python script was written in order to reshape the data.
The four Wilderness Area Boolean features were merged into one feature: ‘Wilderness Areas Combined’. In
this combined feature, a Class value is assigned to each of the Wilderness Areas (1,2,3 or 4). This was also
done for the forty Boolean Soil Type Boolean features, resulting in one feature ‘Soil Type Combined’, with
forty Classes (1, 2,...,40).
1.5 Analyzing the data
Weka (a machine learning software tool) was used to gain an understanding of the values and the ranges of
the data. In Table 1.1 an overview of all the features is displayed.
3
Feature: Unit: Type: Range Median: Mode: Mean:
Standard
Deviation:
Elevation Meters Numerical [1863 , 3849] 2752 2830 2749,32 417,68
Aspect Degrees Numerical [0 , 360] 126 45 156,68 110,09
Slope Degrees Numerical [0 , 52] 15 11 16,5 8,45
Horizontal Distance To Hydrology Meters Numerical [0 , 1343] 180 0 227,2 210,08
Vertical Distance To Hydrology Meters Numerical [-146 , 554] 32 0 51,08 61,24
Horizontal Distance To Roadways Meters Numerical [0 , 6890] 1316 150 1714,02 1325,07
Hillshade 9 am Illumination value Numerical [0 , 254] 220 226 30,56
Hillshade Noon Illumination value Numerical [99 , 254] 223 225 218,97 22,8
Hillshade 3pm Illumination value Numerical [0 , 248] 138 143 135,09 45,9
Horizontal Distance To Fire Points Meters Numerical [0 , 6993] 1256 618 1511,15 1099,94
Wilderness Combined Wilderness Class Categorical [0 , 4] - 3 - -
Soil Type Combined Soil Type Class Categorical [0 , 40] - 10 - -
Cover Type Cover Type Class Categorical [1 , 7] - 5 - -
Table 1.1: Feature Overview
The unit, type, range, median, mode, mean and standard deviation of each feature are good indicators
of the meaning and distribution of the dataset. For example, it is useful to know that the hill shades have
illumination values that range from 0 to 255, and that it would be wrong if this vector would contain values
outside this range.
Because some data types are categorical, the mean of these values were not calculated, because for those
values this is not a meaningful feature. The same holds for the median and the standard deviation of these
values.
Furthermore, Weka was used to visualize the shape of the distribution for each of the features.
1.6 Data visualizations
In order to gain insight on which variables are the most valuable for predicting the correct cover type, a
number of density plots were made in R, which are displayed in the figures below.
Figure 1.1: Density plot Elevation
From the Elevation density plot in Figure 1.1 can be observed that a clear distinction can be made
between cover types that belong to class 3, 4 or 6 (elevation <2700) and cover types belonging to class 1, 2,
5 or 7 (elevation >2700).
4
Figure 1.2: Density plot Aspect
The Aspect density plot, displayed in Figure 1.2 is less informative than the Elevation density plot.
However, from this plot can be observed that if the aspect is between 100 and 150, there is a greater likelihood
that the instance belongs to class 4. Also, instances belonging to class 6 are more likely to have an aspect
between 0-100 and 250-350, than between 100-250.
Figure 1.3: Density plot Horizontal Distance To Hydrology
In the Horizontal Distance To Hydrology density plot, displayed in Figure 1.3, most of the values are
stacked on top of each other, which makes it difficult to use this feature to distinguish the classes. However,
instances belonging to class 6 are the most likely to fall in the lowest range (between 0 and 100 approximately).
The density plots of the other attributes were not as informative as the plots displayed above. Most of
the classes in the rest of the density plots were stacked on top of each other, which makes it difficult to
clearly make distinctions between the classes.
5
1.7 Evaluating the Information Gain
From looking at the plots displayed above, it can be observed that the Elevation is likely to be one of the
best predictors for the Cover Type. To confirm this assumption, Weka‘s information gain attribute evaluator
tool was used. This tool calculates for all features of the dataset the information gain and ranks the features
accordingly. The results of the information gain ranking are displayed in Table 1.2. From these results
can be observed that the earlier made assumption of the Elevation being a good predictor is confirmed.
Furthermore, the Horizontal Distance To Roadways, Horizontal Distance To Fire Points, and Wilderness
Area4 all have quite a high information gain compared to the rest of the features.
Information Gain Feature
1.656762 Elevation
1.08141 Horizontal Distance To Roadways
0.859312 Horizontal Distance To Fire Points
0.601112 Wilderness Area4
0.270703 Horizontal Distance To Hydrology
0.251287 Wilderness Area1
0.234928 Vertical Distance To Hydrology
0.22248 Aspect
0.203344 Hillshade 9am
0.191957 Soil Type10
Table 1.2: Information Gain
1.8 Outlier and measurement error detection
With the information from the visualizations and information gain ranking, the data was inspected for
outliers and measurement errors.
The range of the feature Vertical Distance To Hydrology was explored, because the values of this feature
range from -146 to 554. The fact that this distance vector contains values below zero seemed odd. However,
it turns out that forest cover types with Vertical Distance To Hydrology values below zero are situated in
water. These cover types should be distinguished from ‘non hydrology’ values, and therefore values below
zero are used in the data set.
The value ranges of other features are assumed not to have crucial measurement errors, since their ranges
match perfectly with the possible ranges for these values in ‘the real world’. However, no detection technique
is applied in order to detect outliers, thus no outliers were found. The reason for this, is that the only machine
learning algorithms used in this research are decision trees, and these algorithms can handle outliers very
well, by pruning for example.
1.9 Discretize values
Another data preparation step could be discretization of the feature values. Decision tree algorithms are
essentially designed for data sets that contain nominal or categorical features. However, that does not mean
that they cannot handle real valued inputs. In order to enable decision trees to handle real valued inputs
the data has to be modified. An example of such a modification for a feature with real valued input is
‘binning’. The real valued inputs of each feature are splitted into a number of categories, based on numerical
thresholds. Each category then contains a range of real valued numbers.
The Forest Cover Type dataset contains real valued features that could be converted to categorical
features (‘bins’) as a preprocessing step. For example, values Hillshade 3pm ranging from 0 to 50 belong to
category 1, values ranging from 51 to 100 belong to category 2 et cetera. Nonetheless, it is interesting to
6
consider whether ‘binning’ the features of the Forest Cover Type dataset would result in a higher accuracy,
and if so, what bin sizes perform the most optimal. The detailed process of finding the best (un)binned
feature set is further explained in the experimental setup section.
1.10 Feature selection
After cleaning and preprocessing the features, it was decided not to eliminate or merge any features of the
dataset. The first reason for this is that the amount of features present for classification is not extremely
large. A classifier that makes decisions based on 12 features is not necessarily overly complex, expensive
or prone to overfitting.[4] Besides, a certain form of feature selection is already part of the decision tree
algorithm in case pruning takes place.
7
Chapter 2
Experimental Setup
In this chapter an overview is given of the setup of the research experiments and the used techniques and
algorithms.
2.1 General setup
As mentioned in the introduction of this report, the research question is: Which of the Naive Bayes Tree,
REP Tree and J48 Tree yields the highest accuracy on the Forest Cover Type classification problem? With
the sub question: For each decision tree, which parameter settings and bin sizes of the features yield the
highest accuracy? Note that by using the accuracy as a performance measure, the performance on the
training set, estimated future data and test set should be obtained and taken into account. This is because
performance on the training set alone is not a good estimator for the performance of the classifier on future
data. Concerning answering the research questions, prior to the experiments, four variations of the training
set need to be made:
• In addition to the original unbinned dataset, the set is divided into 5 bins, 20 bins and 50 bins. The
bin sizes for each feature are dependent on the range of the feature vector and splitted into intervals,
which are certain percentages of the size of that range. To ensure efficiency this is achieved by writing
a Python script.
Furthermore, concerning the performance measures, the following general setup for testing each decision
tree is as follows:
1. For each training set variant (unbinned, 5 bins, 20 bins and 50 bins), Weka‘s Cross Validated Param-
eter Selection (‘CVParameterSelection’) is used in order to find the optimal values for two important
parameters. The values of the other parameters are kept as the Weka defaults, and are assumed to be
optimal in the pursue of preventing extremely complex and computationally expensive experiments.
Naturally, it is important to consider that this method may not result in an optimal classifier be-
ing found. Luckily, because of Weka‘s CVParameterSelection option, the optimal parameters for the
decision trees do not have to be sought for manually, which is a time consuming task.
2. For each training set variant (unbinned, 5 bins, 20 bins and 50 bins), the training set performance is
recorded, based on 10-fold cross validation in Weka with accuracy as result. Each fold in cross validation
involves partitioning the dataset into a training set for training and fictional test set for validating.
This validation method is chosen, because it only wastes 10 percent of the training data and can still
give a good estimate of the training set performance. [5] The performance of the algorithms will be
measured with accuracy, which is the percentage of correctly classified instances out of all instances.
3. The future data set performance is estimated for each training set variant (unbinned, 5 bins, 20 bins
and 50 bins) and compared with the optimized values, as well as Weka‘s default values. This is done
because the training set accuracy might be biased due to overfitting. To estimate the future data set
performance, the true error interval with confidence factor is calculated. The interval for true error
can be calculated as displayed in Figure 2.1 [6].
8
Figure 2.1: Formula for estimating True Error Interval
Where errors(h) is the sample error, which is 1 minus the proportion of correctly classified training
samples. Where n is the number of samples and where Zn is the Z-index of the confidence factor. The
confidence factor gives the certainty of the actual error lying in the true error interval. A confidence
factor of 68% is chosen for the upcoming calculations. Because of the risk of either overfitting or
underfitting that decision trees are prone to, the chosen confidence interval allows for a large deviation
from the sample error.
4. The optimal version of the trained decision tree is used to classify the instances of the separate test
set that is provided by Kaggle. The results are submitted to Kaggle.com to determine the test set
accuracy.
In the following section, the detailed parameter optimization process will be outlined per decision tree.
2.2 Detailed experimental setup per tree
In this section the three different decision trees and parameters that were experimented with are discussed
in detail.
2.2.1 J48 decision tree
J48 is an implemented C4.5 algorithm (the C4.5 algorithm is an extension of the ID3 algorithm) in Weka
and can be used for classification. It generates a pruned or unpruned C4.5 decision tree. J48 builds decision
trees from a set of labeled training data. It calculates which feature to split on by choosing the feature split
that yields the highest information gain. After each split, the algorithm recurses, and calculates again for
each feature which next split will yield the highest information gain. The splitting procedure stops when
all instances in a subset belong to the same class. Furthermore, the J48 tree provides an option for pruning
decision trees. [7]
The parameters of the J48 decision tree that can be modified in Weka are displayed in Table 2.1.
confidenceFactor The confidence factor used for pruning (smaller values incur more pruning).
minNumObj The minimum number of instances per leaf.
numFolds
Determines the amount of data used for reduced-error pruning.
One fold is used for pruning, the rest for growing the tree.
reducedErrorPruning Whether reduced-error pruning is used instead of C.4.5 pruning.
seed The seed used for randomizing the data when reduced-error pruning is used.
subtreeRaising Whether to consider the subtree raising operation when pruning.
unpruned Whether pruning is performed.
useLaplace Whether counts at leaves are smoothed based on Laplace.
Table 2.1: Parameters Adaptable for J48
The parameter settings displayed in Table 2.2 are the default settings for the J48 decision tree.
9
confidenceFactor False
minNumObj 2
numFolds 3
reducedErrorPruning False
seed 1
subtreeRaising True
unpruned False
useLaplace False
Table 2.2: Default Parameter Settings for J48
Parameters for Optimization
The chosen parameters for optimization are minimum number of instances per leaf and confidence factor.
The rationale behind this parameter choice will be explained in this section.
The optimal size of a decision tree is crucial for the performance of the classification. When the decision
tree is too large (i.e. too many nodes), it may overfit on the training data and therefore generalize poorly to
new samples. On the other hand, if the tree is too small, it may not be able to capture important structural
information about the sample space. [9] A common strategy that is often used in order to prevent overfitting,
is to grow the tree until each node contains a small number of instances and use pruning to remove nodes
that do not provide additional information. Pruning is used to reduce the size of a learning tree without
reducing the predictive accuracy.
The parameter minimum number of instances per leaf influences the width of the tree. High values for
this parameter result in smaller trees.
The parameter confidence factor is used to discover at which level of pruning the prediction results improve
most. The smaller the confidence factor value, the more pruning will occur. Therefore, the combination
of optimal parameter values for the confidence factor and the minimum number of instances per leaf are
searched for using 10-fold ‘CVParameterSelection’ in Weka. This is done for each variant of the training set:
unbinned, 5 bins, 20 bins and 50 bins.
Test values for Parameters
It was investigated what the regular values for both parameters are, in order to know which values the ‘CV-
ParameterSelection’ in Weka should test for each (un)binned training set. [10] The values of the parameter
confidence factor that ‘CVParameterSelection’ tested ranged from 0.1 to 0.9, with a step size of 0.1. The
values of the parameter minimum number of instances per leaf that ‘CVParameterSelection’ tested ranged
from 2 to 10, with a step size of 1.
2.2.2 Naive Bayes decision tree
Naive Bayes Tree (‘NBTree’) is an induced hybrid algorithm of decision-tree classifiers and Naive Bayes
classifiers. This algorithm generates a decision tree that splits the same way as regular decision trees, but
with Naive Bayes classifiers at the leaves. NBTree frequently achieves higher accuracy than either a naive
Bayesian classifier or a decision tree learner. This approach is suitable in scenarios where many attributes are
likely to be relevant for a classification task, yet the attributes are not necessarily conditionally independent
given the label. [8]
NBTree is only able to process nominal and binary classes. Therefore, the non-binned data sets could
not be experimented with.
This algorithm does not contain any parameter settings in Weka. Therefore, this algorithm is only tested
with different bin sizes.
10
2.2.3 Reduced Error Pruning decision tree
Reduced Error Pruning Tree (‘REPTree’) is a fast decision tree learner and builds a decision tree based
on the information gain or reducing the variance. The basics of pruning of this algorithm are that it uses
reduced-error pruning with backfitting. Starting at the leaves, each node is replaced by the class of majority.
If the prediction accuracy is not affected, the change is kept. This results in the smallest version of the most
accurate subtree with respect to the pruning set. [9]
The parameters of the REPTree that can be modified in Weka are displayed in Table 2.3
debug If set to true, classifier may output additional info to the console.
maxDepth The maximum tree depth (-1 for no restriction).
minNum The minimum number of instances per leaf
minVarianceProp
The minimum proportion of the variance on all the data that needs to be present
at a node in order for splitting to be performed in regression trees.
noPruning Whether pruning is performed.
numFolds
Determines the amount of data used for pruning. One fold is used for pruning,
the rest for growing the tree.
seed The seed used for randomizing the data.
Table 2.3: Parameters Adaptable for REPtree
The parameter settings displayed in Table 2.4 are the default settings for the REPTree.
debug False
maxDepth -1
minNum 2.0
minVarianceProp 0.001
noPruning False
numFolds 3
seed 1
Table 2.4: Default Parameter Settings for REPtree
Parameters for Optimization
The parameters maximum depth of tree and minimum number of instances per leaf are optimized, because
these are important parameters in the REPTree algorithm. The rationale behind this parameter choice will
be explained in this section.
The default setting for maximum depth of tree is set to -1, which means that there is no restriction for
depth. This parameter indicates the amount of pruning that takes place. Pruning is an important factor
related to overfitting and underfitting, two things that both need to be prevented as much as possible. The
minimum number of instances per leaf influences the width of the tree. The higher this number, the smaller
the tree. Therefore, the combination of optimal parameter values for the maximum depth of tree and the
minimum number of instances per leaf are searched for using 10-fold ‘CVParameterSelection’ in Weka. This
is done for each variant of the training set: unbinned, 5 bins, 20 bins and 50 bins.
Test values for Parameters
It was investigated what the common values for both parameters are, in order to know which values the
‘CVParameterSelection’ in Weka should test for each (un)binned training set. [10] The values of the param-
eter maximum depth of tree that ‘CVParameterSelection’ tested ranged from 2 to 12 with step size 1. The
11
values of the parameter minimum number of instances per leaf that ‘CVParameterSelection’ tested ranged
from 2 to 10, with a step size of 1.
12
Chapter 3
Results
In this chapter, the results of the performed experiments are given. As described in the previous chapter
(The Experimental Setup), the results for each tree (J48 Tree, NBTree, REPtree) consist of:
Training Set Results:
• Default and optimal parameter settings found with Cross Validated Parameter Selection
• Training set accuracy based on 10-fold Cross Validation
• An estimation of the true error of a future dataset with 68% confidence
Test Set Results:
• Test set error, based on Kaggle submission
3.1 Training Set Results
The default and optimal parameter settings, training set accuracy and true error estimate for each decision
tree are summarized in Table 3.1 (J48 Tree), Table 3.2 (NBTree) and Table 3.3 (REPTree).
Optimized /
Default
Confidence
Factor
Value
Minimum
Number of
Instances per
Leaf Value
Training Set
Accuracy
True Error(68%)
Estimate
Not binned
Default 0.25 2 59.33% 0.4067 (+- 0.004)
Optimized* 0.4 8 59.84% 0.4016 (+- 0.004)
5 bins
Default 0.25 2 60.22% 0.3978 (+- 0.004)
Optimized 0.2 5 61.17% 0.3883 (+- 0.004)
20 bins
Default 0.25 2 61.75% 0.3825 (+- 0.004)
Optimized 0.3 7 62.07% 0.3793 (+- 0.0039)
50 bins
Default 0.25 2 62.22% 0.3778 (+- 0.0039)
Optimized 0.3 5 62.26% 0.3774 (+- 0.0039)
Table 3.1: J48 Tree Test Results
Training Set
Accuracy
True Error(68%)
Estimate
Not binned* - -
5 bins 59.50% 0.405 (+- 0.004)
10 bins 60.69% 0.3931 (+- 0.004)
20 bins 60.81% 0.3919 (+- 0.004)
50 bins 61.27% 0.3873 (+- 0.004)
*Naive Bayes Tree cannot handle numeric inputs.
Table 3.2: NBTree Test Results
13
Optimized /
Default
Maximum
Depth of
Tree Value
Minimum
Number of
Instances Per
Leaf Value
Training Set
Accuracy
True Error(68%)
Estimate
Not binned
Default -1 2 53.35% 0.4665 (+- 0.0041)
Optimized* 2 2 53.35% 0.4665 (+- 0.0041)
5 bins
Default -1 2 59.57% 0.4043 (+- 0.004)
Optimized 4 10 60.49% 0.3951 (+- 0.004)
20 bins
Default -1 2 60.31% 0.3969 (+- 0.004)
Optimized 4 7 60.60% 0.394 (+- 0.004)
50 bins
Default -1 2 59.67% 0.4033 (+- 0.004)
Optimized 3 10 59.80% 0.402 (+- 0.004)
Table 3.3: REPtree Test Results
Furthermore, in Figure 3.1 below, the accuracy of the decision trees is plotted against the used bin sizes
in the training set to provide a brief visual summary of the results.
Figure 3.1: Plot of accuracy against bin sizes for each tree
3.2 Test Set Results
The optimal parameter settings for each type of decision tree (J48 Tree, NBTree, REPTree) are used in
order to classify the instances from the test set that was provided by Kaggle.
As was shown in Table 3.1, Table 3.2 and Table 3.3 above, the following parameter and bin settings are
optimal for the decision trees, i.e. result in highest accuracy and lowest true error. Therefore these are used
for testing and submitting to Kaggle.com:
1. J48 Tree with confidence factor 0.3, a minimum of 5 instances per leaf and trained on 50 bins
2. REP Tree with maximum tree depth of 4, a minimum of 7 instances per leaf and trained on 20 bins
3. NBTree trained on 50 bins
These optimized trees will be referred to as ‘J48 Tree Optimized’, ‘REPtree Optimized’, and ‘NBtree
Optimized’ respectively. Table 3.4 displays the different performance measures of the optimized trees that
were submitted to Kaggle. [13]
14
Decision Tree
Training Set
Accuracy
True Error(68%)
Estimate
Test Set
Accuracy
(Kaggle)
J48 Tree Optimized 62.26% 0.3774 (+- 0.0039) 43,67%
NBtree Optimized 61.27% 0.3873 (+- 0.004) 43,35%
REPtree Optimized 60.60% 0.394 (+- 0.004) 47,07%
Table 3.4: Different performance measures of optimized trees.
15
Chapter 4
Conclusion
In this section the results of the performed experiments are interpreted with respect to the research question
of this report:
Which of the NBTree, REP Tree and J48 Tree yields the highest accuracy on the
Forest Cover Type classification problem?
With the following sub-question:
For each decision tree, which parameter settings and bin sizes of the features yield the highest accuracy?
The accuracy on the training set alone is not a good predictor for the performance of the classifier on
future data, so the estimated performance on future data and test set performance were also taken into
account.
In pursuit of answering these questions, experiments with unbinned, 5 binned, 20 binned and 50 binned
datasets were made. In addition, for each decision tree two important parameters were optimized. Accuracy
results based on training and test set performance were recorded, as well as the true error, which can be
found in chapter 3.
Conclusions concerning the training set
Considering the sub-question, the following can be concluded based on the results from the training set
experiments.
• J48 Tree yields highest accuracy when trained on 50 bins, with confidence factor 0.3, a minimum of 5
instances per leaf
• REPTree yields highest accuracy when trained on 20 bins, with maximum tree depth of 4 and a
minimum of 7 instances per leaf
• NBTree yields highest accuracy when trained on 50 bins
With regard to the main research question, the results show that the J48 Tree yields the highest training
set accuracy (62.26%) on the Forest Cover Type classification problem. The trees that follow respectively in
accuracy ranking are NBtree (61.27%) and REPtree (60.60%). The settings for the trees and the training
set accuracy ranking also apply when considering the lowest estimated future error (true error). This means
that J48 Tree is expected to have the least errors on future unknown data.
Even though the difference in accuracy between the three best performing decision trees is not significantly
large, reasons for this accuracy ranking with respect to the properties of the tree are as follows. J48 is the
most complex decision tree of explored ones. By complex is meant that it can fit the training data very
precisely, which is beneficial for the training accuracy, but might be harmful when predicting new data with
respect to overfitting. NBtree is the simplest algorithm and generalizes more due to the fact that the Naive
Bayes rule is used when constructing the tree. The Naive Bayes rule assumes that features are independent
and this can cause more generalization of the data. This might be the reason for a low training set accuracy.
Consider Figure 3.1, where the accuracy of each optimized tree is plotted against the different (un)binned
training sets. For J48 Tree and NBtree it seems to show a correlation between accuracy and bin sizes. Hence,
16
one could state that the more bin sizes are used, the higher the accuracy. Even though it may appear so, it
is not concluded that this correlation actually exists. For this we have conducted too few tests and it does
not occur with the REPtree. Moreover, a correlation between two variables needs to be investigated further.
However, based on the results it is assumed that bin sizes between 20 and 50 yield higher accuracy for the
three decision trees.
Conclusions concerning the test set
The test set accuracy results from Kaggle show something entirely different. REPtree yields the highest test
set accuracy (47.07%) on the Forest Cover Type classification problem. The trees that follow respectively
in accuracy ranking are J48 Tree (43,67%) and NBTree (43,35%). The occurrence of REPTree transcending
J48 Tree on the test set can be because indeed, like mentioned before, J48 Tree was not generalizing enough.
Overall conclusion
The accuracy results of the trees on the training set experiments are significantly different from the accuracy
results on the test set. Firstly, the test set accuracy for each tree is roughly almost 20% lower than the
training set accuracy. The reason for this can be that the decision trees are overfitting on the training set
and incapable of generalizing on the new test set. Moreover, on the training set the J48 Tree performs best,
and on the test set REPTree performs best. However, the results from the test set provide information about
the performance of the decision tree most accurately. This is because the test set has a larger number of
instances. Moreover, the test set performance is a better estimator of how the classifier will predict new
data. Therefore, the final conclusion is that the Reduced Error Pruning Tree yields highest accuracy on the
Forest Cover Type classification problem.
17
Chapter 5
Discussion
When testing each decision tree and while drawing conclusions based on the results, several assumptions and
simplifications were made.
First of all, in the conducted experiments only four variants of the (un)binned dataset were used in
pursuit of finding the optimal partitioning of the training set in bins. In case more bin sizes were tested,
another training set with a certain bin size might have been found that would yield a higher accuracy.
Furthermore, the process of parameter optimization can be extended. In this research only two param-
eters were tuned while assuming other parameters with default values were optimal. A combination of all
parameter values being optimized might result in a decision tree with a much higher accuracy. In addition,
the step sizes of the values each parameter is tested on (e.g. from 0.1 to 0.9 with step size 9) could be
decreased. Using that approach, more values for that parameter are tested and an even more precise and
optimal value for that parameter may be found.
Finally, it was assumed that the optimal settings for each decision tree would result likewise in the highest
accuracy on the test set submitted to Kaggle. Hence only the test set predictions from trees with optimal
training settings were submitted. From the Kaggle results, it could be concluded that REPTree yields the
highest accuracy for the Forest Cover Type classification problem. However, another (version of a) tree
might have performed better on the Kaggle test set, even though it performed worse on the training set.
Therefore, the conclusion of this research would have been more reliable when test set results were recorded
for each default and optimized decision tree trained on each (un)binned variant.
18
Bibliography
[1] Kaggle, http://www.kaggle.com/c/forest-cover-type-prediction
[2] Quinlan, J.R. ‘Simplifying decision trees’ International Journal of Man-Machine Studies. Volume 27,
Issue 3, September 1987, P 221-234.
[3] Zhao, Y., Zang, Y. ‘Comparison of decision tree methods for finding active objects’. Advances in Space
Research. Volume 41, Issue 12, 2008, P 1955-1959.
[4] Langley, P. Selection of relevant features in machine learning. 1994.
[5] Lecture Evert Haasdijk about Cross Validation. February 2015
[6] Lecture Evert Haasdijk about Evaluating Hypotheses. March 2015
[7] Ross Quinlan (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo,
CA
[8] Kohavi, R. (1996). Scaling Up the Accuracy of Naive Bayes Classifier: a Decision-Tree Hybrid (KDD-96
Proceedings)
[9] Nor Haizan, W., Mohamed, W., Salleh, M. N. M., & Omar, A. H. (2012). A Comparative Study of
Reduced Error Pruning Method in Decision Tree Algorithms (IEEE International Conference on Control
System, Computing and Engineering)
[10] Weka Documentation, http://www.weka.wikispaces.com
[11] Lewicki, M. S. (2007, April 12). Articifial Intelligence: Decision Trees 2
[12] Kumar, T. S. (2004, April 18). Introduction to Data Mining
[13] Kaggle submissions - http://www.kaggle.com/users/301943/kayleigh-beard
19

Más contenido relacionado

La actualidad más candente

LE03.doc
LE03.docLE03.doc
LE03.doc
butest
 
Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)
Sanghun Kim
 

La actualidad más candente (20)

LE03.doc
LE03.docLE03.doc
LE03.doc
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Density based spatial clustering of applications with noises for dna methylat...
Density based spatial clustering of applications with noises for dna methylat...Density based spatial clustering of applications with noises for dna methylat...
Density based spatial clustering of applications with noises for dna methylat...
 
Machine learning in science and industry — day 2
Machine learning in science and industry — day 2Machine learning in science and industry — day 2
Machine learning in science and industry — day 2
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracy
 
The comparison study of kernel KC-means and support vector machines for class...
The comparison study of kernel KC-means and support vector machines for class...The comparison study of kernel KC-means and support vector machines for class...
The comparison study of kernel KC-means and support vector machines for class...
 
Customer Segmentation using Clustering
Customer Segmentation using ClusteringCustomer Segmentation using Clustering
Customer Segmentation using Clustering
 
Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
 
Regularized Weighted Ensemble of Deep Classifiers
Regularized Weighted Ensemble of Deep Classifiers Regularized Weighted Ensemble of Deep Classifiers
Regularized Weighted Ensemble of Deep Classifiers
 
Clustering: A Survey
Clustering: A SurveyClustering: A Survey
Clustering: A Survey
 
Introduction to Some Tree based Learning Method
Introduction to Some Tree based Learning MethodIntroduction to Some Tree based Learning Method
Introduction to Some Tree based Learning Method
 
Clustering
ClusteringClustering
Clustering
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
IRJET- Plant Disease Detection and Classification using Image Processing a...
IRJET- 	  Plant Disease Detection and Classification using Image Processing a...IRJET- 	  Plant Disease Detection and Classification using Image Processing a...
IRJET- Plant Disease Detection and Classification using Image Processing a...
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
61_Empirical
61_Empirical61_Empirical
61_Empirical
 
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
 
Comparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for RegressionComparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for Regression
 
A STUDY OF DECISION TREE ENSEMBLES AND FEATURE SELECTION FOR STEEL PLATES FAU...
A STUDY OF DECISION TREE ENSEMBLES AND FEATURE SELECTION FOR STEEL PLATES FAU...A STUDY OF DECISION TREE ENSEMBLES AND FEATURE SELECTION FOR STEEL PLATES FAU...
A STUDY OF DECISION TREE ENSEMBLES AND FEATURE SELECTION FOR STEEL PLATES FAU...
 

Destacado

Destacado (8)

Forest Cover Type Prediction
Forest Cover Type PredictionForest Cover Type Prediction
Forest Cover Type Prediction
 
Feature Engineering on Forest Cover Type Data with Decision Trees
Feature Engineering on Forest Cover Type Data with Decision TreesFeature Engineering on Forest Cover Type Data with Decision Trees
Feature Engineering on Forest Cover Type Data with Decision Trees
 
Limpiador XXL
Limpiador XXLLimpiador XXL
Limpiador XXL
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 
Using Deep Learning to Find Similar Dresses
Using Deep Learning to Find Similar DressesUsing Deep Learning to Find Similar Dresses
Using Deep Learning to Find Similar Dresses
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 

Similar a forest-cover-type

Practical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationPractical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and Presentation
HariniMS1
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
butest
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
kevinlan
 
Arules_TM_Rpart_Markdown
Arules_TM_Rpart_MarkdownArules_TM_Rpart_Markdown
Arules_TM_Rpart_Markdown
Adrian Cuyugan
 
Observations
ObservationsObservations
Observations
butest
 

Similar a forest-cover-type (20)

Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction models
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction models
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_Trushita
 
Practical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationPractical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and Presentation
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Arules_TM_Rpart_Markdown
Arules_TM_Rpart_MarkdownArules_TM_Rpart_Markdown
Arules_TM_Rpart_Markdown
 
M3R.FINAL
M3R.FINALM3R.FINAL
M3R.FINAL
 
Hx3115011506
Hx3115011506Hx3115011506
Hx3115011506
 
Mat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataMat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports Data
 
07 learning
07 learning07 learning
07 learning
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
 
Supervised learning (2)
Supervised learning (2)Supervised learning (2)
Supervised learning (2)
 
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASETSURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
Microsoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data ScienceMicrosoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data Science
 
G046024851
G046024851G046024851
G046024851
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Observations
ObservationsObservations
Observations
 

forest-cover-type

  • 1. Forest Cover Type Classification: A Decision Tree Approach Laila van Ments, Kayleigh Beard, Anita Tran and Nathalie Post March 31, 2015
  • 2. Abstract This paper describes our approach to a Kaggle classification task in which Forest Cover Types need to be predicted. In order to solve this classification problem, three machine learning decision tree algorithms were explored: Naive Bayes Tree (NBTree), Reduced Error Pruning Tree (REPTree) and J48 Tree. The main focus of this research lies on two aspects. The first aspect concerns which of these trees yields the highest accuracy for this classification task. The second aspect is finding which parameter settings and which category- or bin size of the partitioned training set yield the highest accuracy for each decision tree. Briefly, the experiments involve binning the training set, performing Cross Validated Parameter Selection in Weka, calculating the true error and submitting the best performing trained decision trees to Kaggle in order to determine the test set error. The results of this research show that bin sizes between 20 and 50 yield higher accurateness for each decision tree. Moreover, the optimal parameters settings that yielded the highest accuracy on the training set, were not necessarily the optimal parameter settings for the the test set.
  • 3. Contents Page Introduction 2 1 Data inspection and Preparation 3 1.1 The available data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Becoming familiar with the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Determining missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Data transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Analyzing the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.6 Data visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.7 Evaluating the Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.8 Outlier and measurement error detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.9 Discretize values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.10 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Experimental Setup 8 2.1 General setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Detailed experimental setup per tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 J48 decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.2 Naive Bayes decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.3 Reduced Error Pruning decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3 Results 13 3.1 Training Set Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Test Set Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4 Conclusion 16 5 Discussion 18 Bibliography 18 1
  • 4. Introduction This paper describes our approach to the Kaggle Competition ‘Forest Cover Type Prediction’. This compe- tition is a classification problem. The goal of this classification problem is to predict the correct forest cover type, based on 54 cartographic features. Cartographic features are the features that describe the landscape, such as elevation and soil type. The data was gathered in four wilderness areas in the Roosevelt National Forest (Northern Colorado). [1] With many large wilderness areas, it can be difficult and costly to determine the cover type of each area manually. However, the information about the cover type is very useful for experts or ecosystem caretakers who need to maintain the ecosystems of these areas. Thus, if a machine learning algorithm is capable of accurately predicting the cover type of a certain area with given cartographic features, it will be easier for experts to make decisions when preserving ecological environments. In order to solve this classification problem, three machine learning decision tree algorithms were ex- plored. Decision trees are often used in classification problems. The basic idea of a decision tree algorithm is as follows. Based on the data of a training set, a decision tree is constructed that maps the observations about features onto the class value. While constructing the tree, for each node, a decision needs to be made on which feature to split on next, i.e. which feature will be placed on the current node. The chosen feature is the feature that has the highest information gain at that moment. The information gain indicates how well a certain feature will predict the target class value. [2] After the tree is constructed based on the training data, the class of every new and unknown data point can be predicted with the tree. The three decision trees that will be evaluated for this classification problem are Naive Bayes Tree (NBtree), Reduced Error Pruning Tree (REPtree) and J48 Tree. The reason for testing these decision trees is because they are all similar to or based on the C4.5 statistical classifier, which is an extension of the ID3 decision tree algorithm. [3] We are interested in the performance of these C4.5-like algorithms and will be comparing these trees with each other. The three decision tree algorithms were explored and tuned in several experiments in order to find the parameters that yield the highest accuracy (highest percentage of correctly classified cover types) on both the training set and the test set. We are particularly interested in assigning the real-valued features of the dataset into categories or bins, by testing several bin-sizes in order to discover which size yields the highest accuracy. Moreover, several parameter settings are tested for each decision tree algorithm. Therefore, the research question that will be answered in this report is: Which of the NBTree, REP Tree and J48 Tree yields the highest accuracy on the Forest Cover Type classification problem? With the following sub-question: For each decision tree, which parameter settings and bin sizes of the features yield the highest accuracy? The first step in answering these questions consists of gaining a deeper understanding of the data and pre-processing it where necessary, as described in Chapter 1. In the second chapter, the experimental setup for each decision tree is discussed. The third chapter contains the accuracy results of the experiments on the training set as well as the test set. Finally, the conclusion and discussion are reported. 2
  • 5. Chapter 1 Data inspection and Preparation 1.1 The available data The available data for this classification problem was provided by Kaggle, and consists of a training set and a test set. For the initial phases of the process, only the training set was utilized. The training set consisted of 15.120 instances with 54 attributes, all belonging to 1 of 7 forest cover type classes. 1.2 Becoming familiar with the data The first step of the data analysis was to become familiar with the data. For that reason, the first step of data inspection was simply analyzing the data with the bare eye in Excel. From this it was observed that all the provided data was numeric. Also, part of the data (the Wilderness Areas and the Soil Types) consisted of Boolean values. 1.3 Determining missing values Excel was used in order to determine whether there were missing values in the data, which was not the case; the data was complete. 1.4 Data transformations The Boolean features (Wilderness Areas and Soil Types) were further analyzed using a Python script. It turned out that every instance in the data set belonged to exactly one of the four Wilderness Areas and exactly one of the forty Soil Types. For that reason, another Python script was written in order to reshape the data. The four Wilderness Area Boolean features were merged into one feature: ‘Wilderness Areas Combined’. In this combined feature, a Class value is assigned to each of the Wilderness Areas (1,2,3 or 4). This was also done for the forty Boolean Soil Type Boolean features, resulting in one feature ‘Soil Type Combined’, with forty Classes (1, 2,...,40). 1.5 Analyzing the data Weka (a machine learning software tool) was used to gain an understanding of the values and the ranges of the data. In Table 1.1 an overview of all the features is displayed. 3
  • 6. Feature: Unit: Type: Range Median: Mode: Mean: Standard Deviation: Elevation Meters Numerical [1863 , 3849] 2752 2830 2749,32 417,68 Aspect Degrees Numerical [0 , 360] 126 45 156,68 110,09 Slope Degrees Numerical [0 , 52] 15 11 16,5 8,45 Horizontal Distance To Hydrology Meters Numerical [0 , 1343] 180 0 227,2 210,08 Vertical Distance To Hydrology Meters Numerical [-146 , 554] 32 0 51,08 61,24 Horizontal Distance To Roadways Meters Numerical [0 , 6890] 1316 150 1714,02 1325,07 Hillshade 9 am Illumination value Numerical [0 , 254] 220 226 30,56 Hillshade Noon Illumination value Numerical [99 , 254] 223 225 218,97 22,8 Hillshade 3pm Illumination value Numerical [0 , 248] 138 143 135,09 45,9 Horizontal Distance To Fire Points Meters Numerical [0 , 6993] 1256 618 1511,15 1099,94 Wilderness Combined Wilderness Class Categorical [0 , 4] - 3 - - Soil Type Combined Soil Type Class Categorical [0 , 40] - 10 - - Cover Type Cover Type Class Categorical [1 , 7] - 5 - - Table 1.1: Feature Overview The unit, type, range, median, mode, mean and standard deviation of each feature are good indicators of the meaning and distribution of the dataset. For example, it is useful to know that the hill shades have illumination values that range from 0 to 255, and that it would be wrong if this vector would contain values outside this range. Because some data types are categorical, the mean of these values were not calculated, because for those values this is not a meaningful feature. The same holds for the median and the standard deviation of these values. Furthermore, Weka was used to visualize the shape of the distribution for each of the features. 1.6 Data visualizations In order to gain insight on which variables are the most valuable for predicting the correct cover type, a number of density plots were made in R, which are displayed in the figures below. Figure 1.1: Density plot Elevation From the Elevation density plot in Figure 1.1 can be observed that a clear distinction can be made between cover types that belong to class 3, 4 or 6 (elevation <2700) and cover types belonging to class 1, 2, 5 or 7 (elevation >2700). 4
  • 7. Figure 1.2: Density plot Aspect The Aspect density plot, displayed in Figure 1.2 is less informative than the Elevation density plot. However, from this plot can be observed that if the aspect is between 100 and 150, there is a greater likelihood that the instance belongs to class 4. Also, instances belonging to class 6 are more likely to have an aspect between 0-100 and 250-350, than between 100-250. Figure 1.3: Density plot Horizontal Distance To Hydrology In the Horizontal Distance To Hydrology density plot, displayed in Figure 1.3, most of the values are stacked on top of each other, which makes it difficult to use this feature to distinguish the classes. However, instances belonging to class 6 are the most likely to fall in the lowest range (between 0 and 100 approximately). The density plots of the other attributes were not as informative as the plots displayed above. Most of the classes in the rest of the density plots were stacked on top of each other, which makes it difficult to clearly make distinctions between the classes. 5
  • 8. 1.7 Evaluating the Information Gain From looking at the plots displayed above, it can be observed that the Elevation is likely to be one of the best predictors for the Cover Type. To confirm this assumption, Weka‘s information gain attribute evaluator tool was used. This tool calculates for all features of the dataset the information gain and ranks the features accordingly. The results of the information gain ranking are displayed in Table 1.2. From these results can be observed that the earlier made assumption of the Elevation being a good predictor is confirmed. Furthermore, the Horizontal Distance To Roadways, Horizontal Distance To Fire Points, and Wilderness Area4 all have quite a high information gain compared to the rest of the features. Information Gain Feature 1.656762 Elevation 1.08141 Horizontal Distance To Roadways 0.859312 Horizontal Distance To Fire Points 0.601112 Wilderness Area4 0.270703 Horizontal Distance To Hydrology 0.251287 Wilderness Area1 0.234928 Vertical Distance To Hydrology 0.22248 Aspect 0.203344 Hillshade 9am 0.191957 Soil Type10 Table 1.2: Information Gain 1.8 Outlier and measurement error detection With the information from the visualizations and information gain ranking, the data was inspected for outliers and measurement errors. The range of the feature Vertical Distance To Hydrology was explored, because the values of this feature range from -146 to 554. The fact that this distance vector contains values below zero seemed odd. However, it turns out that forest cover types with Vertical Distance To Hydrology values below zero are situated in water. These cover types should be distinguished from ‘non hydrology’ values, and therefore values below zero are used in the data set. The value ranges of other features are assumed not to have crucial measurement errors, since their ranges match perfectly with the possible ranges for these values in ‘the real world’. However, no detection technique is applied in order to detect outliers, thus no outliers were found. The reason for this, is that the only machine learning algorithms used in this research are decision trees, and these algorithms can handle outliers very well, by pruning for example. 1.9 Discretize values Another data preparation step could be discretization of the feature values. Decision tree algorithms are essentially designed for data sets that contain nominal or categorical features. However, that does not mean that they cannot handle real valued inputs. In order to enable decision trees to handle real valued inputs the data has to be modified. An example of such a modification for a feature with real valued input is ‘binning’. The real valued inputs of each feature are splitted into a number of categories, based on numerical thresholds. Each category then contains a range of real valued numbers. The Forest Cover Type dataset contains real valued features that could be converted to categorical features (‘bins’) as a preprocessing step. For example, values Hillshade 3pm ranging from 0 to 50 belong to category 1, values ranging from 51 to 100 belong to category 2 et cetera. Nonetheless, it is interesting to 6
  • 9. consider whether ‘binning’ the features of the Forest Cover Type dataset would result in a higher accuracy, and if so, what bin sizes perform the most optimal. The detailed process of finding the best (un)binned feature set is further explained in the experimental setup section. 1.10 Feature selection After cleaning and preprocessing the features, it was decided not to eliminate or merge any features of the dataset. The first reason for this is that the amount of features present for classification is not extremely large. A classifier that makes decisions based on 12 features is not necessarily overly complex, expensive or prone to overfitting.[4] Besides, a certain form of feature selection is already part of the decision tree algorithm in case pruning takes place. 7
  • 10. Chapter 2 Experimental Setup In this chapter an overview is given of the setup of the research experiments and the used techniques and algorithms. 2.1 General setup As mentioned in the introduction of this report, the research question is: Which of the Naive Bayes Tree, REP Tree and J48 Tree yields the highest accuracy on the Forest Cover Type classification problem? With the sub question: For each decision tree, which parameter settings and bin sizes of the features yield the highest accuracy? Note that by using the accuracy as a performance measure, the performance on the training set, estimated future data and test set should be obtained and taken into account. This is because performance on the training set alone is not a good estimator for the performance of the classifier on future data. Concerning answering the research questions, prior to the experiments, four variations of the training set need to be made: • In addition to the original unbinned dataset, the set is divided into 5 bins, 20 bins and 50 bins. The bin sizes for each feature are dependent on the range of the feature vector and splitted into intervals, which are certain percentages of the size of that range. To ensure efficiency this is achieved by writing a Python script. Furthermore, concerning the performance measures, the following general setup for testing each decision tree is as follows: 1. For each training set variant (unbinned, 5 bins, 20 bins and 50 bins), Weka‘s Cross Validated Param- eter Selection (‘CVParameterSelection’) is used in order to find the optimal values for two important parameters. The values of the other parameters are kept as the Weka defaults, and are assumed to be optimal in the pursue of preventing extremely complex and computationally expensive experiments. Naturally, it is important to consider that this method may not result in an optimal classifier be- ing found. Luckily, because of Weka‘s CVParameterSelection option, the optimal parameters for the decision trees do not have to be sought for manually, which is a time consuming task. 2. For each training set variant (unbinned, 5 bins, 20 bins and 50 bins), the training set performance is recorded, based on 10-fold cross validation in Weka with accuracy as result. Each fold in cross validation involves partitioning the dataset into a training set for training and fictional test set for validating. This validation method is chosen, because it only wastes 10 percent of the training data and can still give a good estimate of the training set performance. [5] The performance of the algorithms will be measured with accuracy, which is the percentage of correctly classified instances out of all instances. 3. The future data set performance is estimated for each training set variant (unbinned, 5 bins, 20 bins and 50 bins) and compared with the optimized values, as well as Weka‘s default values. This is done because the training set accuracy might be biased due to overfitting. To estimate the future data set performance, the true error interval with confidence factor is calculated. The interval for true error can be calculated as displayed in Figure 2.1 [6]. 8
  • 11. Figure 2.1: Formula for estimating True Error Interval Where errors(h) is the sample error, which is 1 minus the proportion of correctly classified training samples. Where n is the number of samples and where Zn is the Z-index of the confidence factor. The confidence factor gives the certainty of the actual error lying in the true error interval. A confidence factor of 68% is chosen for the upcoming calculations. Because of the risk of either overfitting or underfitting that decision trees are prone to, the chosen confidence interval allows for a large deviation from the sample error. 4. The optimal version of the trained decision tree is used to classify the instances of the separate test set that is provided by Kaggle. The results are submitted to Kaggle.com to determine the test set accuracy. In the following section, the detailed parameter optimization process will be outlined per decision tree. 2.2 Detailed experimental setup per tree In this section the three different decision trees and parameters that were experimented with are discussed in detail. 2.2.1 J48 decision tree J48 is an implemented C4.5 algorithm (the C4.5 algorithm is an extension of the ID3 algorithm) in Weka and can be used for classification. It generates a pruned or unpruned C4.5 decision tree. J48 builds decision trees from a set of labeled training data. It calculates which feature to split on by choosing the feature split that yields the highest information gain. After each split, the algorithm recurses, and calculates again for each feature which next split will yield the highest information gain. The splitting procedure stops when all instances in a subset belong to the same class. Furthermore, the J48 tree provides an option for pruning decision trees. [7] The parameters of the J48 decision tree that can be modified in Weka are displayed in Table 2.1. confidenceFactor The confidence factor used for pruning (smaller values incur more pruning). minNumObj The minimum number of instances per leaf. numFolds Determines the amount of data used for reduced-error pruning. One fold is used for pruning, the rest for growing the tree. reducedErrorPruning Whether reduced-error pruning is used instead of C.4.5 pruning. seed The seed used for randomizing the data when reduced-error pruning is used. subtreeRaising Whether to consider the subtree raising operation when pruning. unpruned Whether pruning is performed. useLaplace Whether counts at leaves are smoothed based on Laplace. Table 2.1: Parameters Adaptable for J48 The parameter settings displayed in Table 2.2 are the default settings for the J48 decision tree. 9
  • 12. confidenceFactor False minNumObj 2 numFolds 3 reducedErrorPruning False seed 1 subtreeRaising True unpruned False useLaplace False Table 2.2: Default Parameter Settings for J48 Parameters for Optimization The chosen parameters for optimization are minimum number of instances per leaf and confidence factor. The rationale behind this parameter choice will be explained in this section. The optimal size of a decision tree is crucial for the performance of the classification. When the decision tree is too large (i.e. too many nodes), it may overfit on the training data and therefore generalize poorly to new samples. On the other hand, if the tree is too small, it may not be able to capture important structural information about the sample space. [9] A common strategy that is often used in order to prevent overfitting, is to grow the tree until each node contains a small number of instances and use pruning to remove nodes that do not provide additional information. Pruning is used to reduce the size of a learning tree without reducing the predictive accuracy. The parameter minimum number of instances per leaf influences the width of the tree. High values for this parameter result in smaller trees. The parameter confidence factor is used to discover at which level of pruning the prediction results improve most. The smaller the confidence factor value, the more pruning will occur. Therefore, the combination of optimal parameter values for the confidence factor and the minimum number of instances per leaf are searched for using 10-fold ‘CVParameterSelection’ in Weka. This is done for each variant of the training set: unbinned, 5 bins, 20 bins and 50 bins. Test values for Parameters It was investigated what the regular values for both parameters are, in order to know which values the ‘CV- ParameterSelection’ in Weka should test for each (un)binned training set. [10] The values of the parameter confidence factor that ‘CVParameterSelection’ tested ranged from 0.1 to 0.9, with a step size of 0.1. The values of the parameter minimum number of instances per leaf that ‘CVParameterSelection’ tested ranged from 2 to 10, with a step size of 1. 2.2.2 Naive Bayes decision tree Naive Bayes Tree (‘NBTree’) is an induced hybrid algorithm of decision-tree classifiers and Naive Bayes classifiers. This algorithm generates a decision tree that splits the same way as regular decision trees, but with Naive Bayes classifiers at the leaves. NBTree frequently achieves higher accuracy than either a naive Bayesian classifier or a decision tree learner. This approach is suitable in scenarios where many attributes are likely to be relevant for a classification task, yet the attributes are not necessarily conditionally independent given the label. [8] NBTree is only able to process nominal and binary classes. Therefore, the non-binned data sets could not be experimented with. This algorithm does not contain any parameter settings in Weka. Therefore, this algorithm is only tested with different bin sizes. 10
  • 13. 2.2.3 Reduced Error Pruning decision tree Reduced Error Pruning Tree (‘REPTree’) is a fast decision tree learner and builds a decision tree based on the information gain or reducing the variance. The basics of pruning of this algorithm are that it uses reduced-error pruning with backfitting. Starting at the leaves, each node is replaced by the class of majority. If the prediction accuracy is not affected, the change is kept. This results in the smallest version of the most accurate subtree with respect to the pruning set. [9] The parameters of the REPTree that can be modified in Weka are displayed in Table 2.3 debug If set to true, classifier may output additional info to the console. maxDepth The maximum tree depth (-1 for no restriction). minNum The minimum number of instances per leaf minVarianceProp The minimum proportion of the variance on all the data that needs to be present at a node in order for splitting to be performed in regression trees. noPruning Whether pruning is performed. numFolds Determines the amount of data used for pruning. One fold is used for pruning, the rest for growing the tree. seed The seed used for randomizing the data. Table 2.3: Parameters Adaptable for REPtree The parameter settings displayed in Table 2.4 are the default settings for the REPTree. debug False maxDepth -1 minNum 2.0 minVarianceProp 0.001 noPruning False numFolds 3 seed 1 Table 2.4: Default Parameter Settings for REPtree Parameters for Optimization The parameters maximum depth of tree and minimum number of instances per leaf are optimized, because these are important parameters in the REPTree algorithm. The rationale behind this parameter choice will be explained in this section. The default setting for maximum depth of tree is set to -1, which means that there is no restriction for depth. This parameter indicates the amount of pruning that takes place. Pruning is an important factor related to overfitting and underfitting, two things that both need to be prevented as much as possible. The minimum number of instances per leaf influences the width of the tree. The higher this number, the smaller the tree. Therefore, the combination of optimal parameter values for the maximum depth of tree and the minimum number of instances per leaf are searched for using 10-fold ‘CVParameterSelection’ in Weka. This is done for each variant of the training set: unbinned, 5 bins, 20 bins and 50 bins. Test values for Parameters It was investigated what the common values for both parameters are, in order to know which values the ‘CVParameterSelection’ in Weka should test for each (un)binned training set. [10] The values of the param- eter maximum depth of tree that ‘CVParameterSelection’ tested ranged from 2 to 12 with step size 1. The 11
  • 14. values of the parameter minimum number of instances per leaf that ‘CVParameterSelection’ tested ranged from 2 to 10, with a step size of 1. 12
  • 15. Chapter 3 Results In this chapter, the results of the performed experiments are given. As described in the previous chapter (The Experimental Setup), the results for each tree (J48 Tree, NBTree, REPtree) consist of: Training Set Results: • Default and optimal parameter settings found with Cross Validated Parameter Selection • Training set accuracy based on 10-fold Cross Validation • An estimation of the true error of a future dataset with 68% confidence Test Set Results: • Test set error, based on Kaggle submission 3.1 Training Set Results The default and optimal parameter settings, training set accuracy and true error estimate for each decision tree are summarized in Table 3.1 (J48 Tree), Table 3.2 (NBTree) and Table 3.3 (REPTree). Optimized / Default Confidence Factor Value Minimum Number of Instances per Leaf Value Training Set Accuracy True Error(68%) Estimate Not binned Default 0.25 2 59.33% 0.4067 (+- 0.004) Optimized* 0.4 8 59.84% 0.4016 (+- 0.004) 5 bins Default 0.25 2 60.22% 0.3978 (+- 0.004) Optimized 0.2 5 61.17% 0.3883 (+- 0.004) 20 bins Default 0.25 2 61.75% 0.3825 (+- 0.004) Optimized 0.3 7 62.07% 0.3793 (+- 0.0039) 50 bins Default 0.25 2 62.22% 0.3778 (+- 0.0039) Optimized 0.3 5 62.26% 0.3774 (+- 0.0039) Table 3.1: J48 Tree Test Results Training Set Accuracy True Error(68%) Estimate Not binned* - - 5 bins 59.50% 0.405 (+- 0.004) 10 bins 60.69% 0.3931 (+- 0.004) 20 bins 60.81% 0.3919 (+- 0.004) 50 bins 61.27% 0.3873 (+- 0.004) *Naive Bayes Tree cannot handle numeric inputs. Table 3.2: NBTree Test Results 13
  • 16. Optimized / Default Maximum Depth of Tree Value Minimum Number of Instances Per Leaf Value Training Set Accuracy True Error(68%) Estimate Not binned Default -1 2 53.35% 0.4665 (+- 0.0041) Optimized* 2 2 53.35% 0.4665 (+- 0.0041) 5 bins Default -1 2 59.57% 0.4043 (+- 0.004) Optimized 4 10 60.49% 0.3951 (+- 0.004) 20 bins Default -1 2 60.31% 0.3969 (+- 0.004) Optimized 4 7 60.60% 0.394 (+- 0.004) 50 bins Default -1 2 59.67% 0.4033 (+- 0.004) Optimized 3 10 59.80% 0.402 (+- 0.004) Table 3.3: REPtree Test Results Furthermore, in Figure 3.1 below, the accuracy of the decision trees is plotted against the used bin sizes in the training set to provide a brief visual summary of the results. Figure 3.1: Plot of accuracy against bin sizes for each tree 3.2 Test Set Results The optimal parameter settings for each type of decision tree (J48 Tree, NBTree, REPTree) are used in order to classify the instances from the test set that was provided by Kaggle. As was shown in Table 3.1, Table 3.2 and Table 3.3 above, the following parameter and bin settings are optimal for the decision trees, i.e. result in highest accuracy and lowest true error. Therefore these are used for testing and submitting to Kaggle.com: 1. J48 Tree with confidence factor 0.3, a minimum of 5 instances per leaf and trained on 50 bins 2. REP Tree with maximum tree depth of 4, a minimum of 7 instances per leaf and trained on 20 bins 3. NBTree trained on 50 bins These optimized trees will be referred to as ‘J48 Tree Optimized’, ‘REPtree Optimized’, and ‘NBtree Optimized’ respectively. Table 3.4 displays the different performance measures of the optimized trees that were submitted to Kaggle. [13] 14
  • 17. Decision Tree Training Set Accuracy True Error(68%) Estimate Test Set Accuracy (Kaggle) J48 Tree Optimized 62.26% 0.3774 (+- 0.0039) 43,67% NBtree Optimized 61.27% 0.3873 (+- 0.004) 43,35% REPtree Optimized 60.60% 0.394 (+- 0.004) 47,07% Table 3.4: Different performance measures of optimized trees. 15
  • 18. Chapter 4 Conclusion In this section the results of the performed experiments are interpreted with respect to the research question of this report: Which of the NBTree, REP Tree and J48 Tree yields the highest accuracy on the Forest Cover Type classification problem? With the following sub-question: For each decision tree, which parameter settings and bin sizes of the features yield the highest accuracy? The accuracy on the training set alone is not a good predictor for the performance of the classifier on future data, so the estimated performance on future data and test set performance were also taken into account. In pursuit of answering these questions, experiments with unbinned, 5 binned, 20 binned and 50 binned datasets were made. In addition, for each decision tree two important parameters were optimized. Accuracy results based on training and test set performance were recorded, as well as the true error, which can be found in chapter 3. Conclusions concerning the training set Considering the sub-question, the following can be concluded based on the results from the training set experiments. • J48 Tree yields highest accuracy when trained on 50 bins, with confidence factor 0.3, a minimum of 5 instances per leaf • REPTree yields highest accuracy when trained on 20 bins, with maximum tree depth of 4 and a minimum of 7 instances per leaf • NBTree yields highest accuracy when trained on 50 bins With regard to the main research question, the results show that the J48 Tree yields the highest training set accuracy (62.26%) on the Forest Cover Type classification problem. The trees that follow respectively in accuracy ranking are NBtree (61.27%) and REPtree (60.60%). The settings for the trees and the training set accuracy ranking also apply when considering the lowest estimated future error (true error). This means that J48 Tree is expected to have the least errors on future unknown data. Even though the difference in accuracy between the three best performing decision trees is not significantly large, reasons for this accuracy ranking with respect to the properties of the tree are as follows. J48 is the most complex decision tree of explored ones. By complex is meant that it can fit the training data very precisely, which is beneficial for the training accuracy, but might be harmful when predicting new data with respect to overfitting. NBtree is the simplest algorithm and generalizes more due to the fact that the Naive Bayes rule is used when constructing the tree. The Naive Bayes rule assumes that features are independent and this can cause more generalization of the data. This might be the reason for a low training set accuracy. Consider Figure 3.1, where the accuracy of each optimized tree is plotted against the different (un)binned training sets. For J48 Tree and NBtree it seems to show a correlation between accuracy and bin sizes. Hence, 16
  • 19. one could state that the more bin sizes are used, the higher the accuracy. Even though it may appear so, it is not concluded that this correlation actually exists. For this we have conducted too few tests and it does not occur with the REPtree. Moreover, a correlation between two variables needs to be investigated further. However, based on the results it is assumed that bin sizes between 20 and 50 yield higher accuracy for the three decision trees. Conclusions concerning the test set The test set accuracy results from Kaggle show something entirely different. REPtree yields the highest test set accuracy (47.07%) on the Forest Cover Type classification problem. The trees that follow respectively in accuracy ranking are J48 Tree (43,67%) and NBTree (43,35%). The occurrence of REPTree transcending J48 Tree on the test set can be because indeed, like mentioned before, J48 Tree was not generalizing enough. Overall conclusion The accuracy results of the trees on the training set experiments are significantly different from the accuracy results on the test set. Firstly, the test set accuracy for each tree is roughly almost 20% lower than the training set accuracy. The reason for this can be that the decision trees are overfitting on the training set and incapable of generalizing on the new test set. Moreover, on the training set the J48 Tree performs best, and on the test set REPTree performs best. However, the results from the test set provide information about the performance of the decision tree most accurately. This is because the test set has a larger number of instances. Moreover, the test set performance is a better estimator of how the classifier will predict new data. Therefore, the final conclusion is that the Reduced Error Pruning Tree yields highest accuracy on the Forest Cover Type classification problem. 17
  • 20. Chapter 5 Discussion When testing each decision tree and while drawing conclusions based on the results, several assumptions and simplifications were made. First of all, in the conducted experiments only four variants of the (un)binned dataset were used in pursuit of finding the optimal partitioning of the training set in bins. In case more bin sizes were tested, another training set with a certain bin size might have been found that would yield a higher accuracy. Furthermore, the process of parameter optimization can be extended. In this research only two param- eters were tuned while assuming other parameters with default values were optimal. A combination of all parameter values being optimized might result in a decision tree with a much higher accuracy. In addition, the step sizes of the values each parameter is tested on (e.g. from 0.1 to 0.9 with step size 9) could be decreased. Using that approach, more values for that parameter are tested and an even more precise and optimal value for that parameter may be found. Finally, it was assumed that the optimal settings for each decision tree would result likewise in the highest accuracy on the test set submitted to Kaggle. Hence only the test set predictions from trees with optimal training settings were submitted. From the Kaggle results, it could be concluded that REPTree yields the highest accuracy for the Forest Cover Type classification problem. However, another (version of a) tree might have performed better on the Kaggle test set, even though it performed worse on the training set. Therefore, the conclusion of this research would have been more reliable when test set results were recorded for each default and optimized decision tree trained on each (un)binned variant. 18
  • 21. Bibliography [1] Kaggle, http://www.kaggle.com/c/forest-cover-type-prediction [2] Quinlan, J.R. ‘Simplifying decision trees’ International Journal of Man-Machine Studies. Volume 27, Issue 3, September 1987, P 221-234. [3] Zhao, Y., Zang, Y. ‘Comparison of decision tree methods for finding active objects’. Advances in Space Research. Volume 41, Issue 12, 2008, P 1955-1959. [4] Langley, P. Selection of relevant features in machine learning. 1994. [5] Lecture Evert Haasdijk about Cross Validation. February 2015 [6] Lecture Evert Haasdijk about Evaluating Hypotheses. March 2015 [7] Ross Quinlan (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA [8] Kohavi, R. (1996). Scaling Up the Accuracy of Naive Bayes Classifier: a Decision-Tree Hybrid (KDD-96 Proceedings) [9] Nor Haizan, W., Mohamed, W., Salleh, M. N. M., & Omar, A. H. (2012). A Comparative Study of Reduced Error Pruning Method in Decision Tree Algorithms (IEEE International Conference on Control System, Computing and Engineering) [10] Weka Documentation, http://www.weka.wikispaces.com [11] Lewicki, M. S. (2007, April 12). Articifial Intelligence: Decision Trees 2 [12] Kumar, T. S. (2004, April 18). Introduction to Data Mining [13] Kaggle submissions - http://www.kaggle.com/users/301943/kayleigh-beard 19