Multivariate decision tree

C4.5 algorithm and Multivariate Decision Trees

Thales Sehn Korting

Image Processing Division, National Institute for Space Research – INPE
Sõ Jos´ dos Campos – SP, Brazil
a e

tkorting@dpi.inpe.br

Abstract
The aim of this article is to show a brief description
about the C4.5 algorithm, used to create Univariate De-
cision Trees. We also talk about Multivariate Decision
Trees, their process to classify instances using more than
one attribute per node in the tree. We try to discuss how
they work, and how to implement the algorithms that
build such trees, including examples of Univariate and
Multivariate results.

1. Introduction
Describing the Pattern Recognition process, the goal
is to learn (or to “teach” a machine) how to classify ob-
jects, through the analysis of an instances set, whose
classes1 are known [5].
As we know the classes of an instances set (or train-
ing set), we can use several algorithms to discover the
way the attributes-vector of the instances behaves, to Figure 1. Simple example of a classification pro-
estimate the classes for new instances. One manner to cess.
do this is through Decision Trees (DT’s).
A tree is either a leaf node labeled with a class, or
a structure containing a test, linked to two or more The DT’s can deal with one attribute per test node
nodes (or subtrees) [5]. So, to classify some instance, or with more than one. The former approach is called
first we get its attribute-vector, and apply this vec- Univariate DT, and the second is the Multivariate
tor to the tree. The tests are performed into these at- method. This article explains the construction of Uni-
tributes, reaching one or other leaf, to complete the variate DT’s and the C4.5 algorithm, used to build such
classification process, as in Figure 1. trees (Section 2). After this, we discuss the Multivari-
If we have n attributes for our instances, we’ll have ate approach, and how to construct such trees (Section
a n-dimensional space to the classes. And the DT will 3). At the end of each approach (Uni and Multivari-
create hyperplanes (or partitions) to divide this space ate), we show some results for different test cases.
to the classes. A 2D space is shown in Figure 2, and
the lines means the hyperplanes in this dimension. 2. C4.5 Algorithm
1 Mutually exclusive labels, such as “buildings”, “deforest- This section explains one of the algorithms used to
ment”, etc. create Univariate DT’s. This one, called C4.5, is based

and finally, we define Gain by

Gain(y, j) = Entropy(y − Entropy(j|y))

The aim is to maximize the Gain, dividing by over-
all entropy due to split argument y by value j.

2.3. Prunning

This is an important step to the result because of
the outliers. All data sets contain a little subset of in-
stances that are not well-defined, and differs from the
other ones on its neighborhood.
Figure 2. Partitions created in a DT.
After the complete creation of the tree, that must
classify all the instances in the training set, it is pruned.
on the ID32 algorithm, that tries to find small (or sim- This is to reduce classification errors, caused by espe-
ple) DT’s. We start presenting some premisses on wich cialization in the training set; this is done to make the
this algorithm is based, and after we discuss the infer- tree more general.
ence of the weights and tests in the nodes of the trees.
2.4. Results
2.1. Construction

Some premisses guide this algorithm, such as the fol- To show concrete examples of the C4.5 algorithm ap-
lowing [4]: plication, we used the System WEKA [6]. One training
set, considering some aspects of working people, like
• if all cases are of the same class, the tree is a leaf vacation time, working hours, health plan was used.
and so the leaf is returned labelled with this class; The resulting classes are about the work conditions,
• for each attribute, calculate the potential informa- i.e. good or bad. Figure 3 shows the resulting DT, us-
tion provided by a test on the attribute (based on ing C4.5 implementation from WEKA.
the probabilities of each case having a particular Another example deals with levels of contact-lenses,
value for the attribute). Also calculate the gain in according to some characteristics of the patients. Re-
information that would result from a test on the sults in Figure 4.
attribute (based on the probabilities of each case
with a particular value for the attribute being of
3. Multivariate DT’s
a particular class);
• depending on the current selection criterion, find Talking about Multivariate DT’s, and inductive-
the best attribute to branch on. learning, they are able to generalize well when deal-
ing with attributes correlation. Also, the results are
2.2. Counting gain easy to the humans, i.e. we can understand the influ-
ence of each attribute to the whole process [2].
This process uses the “Entropy”, i.e. a measure of One problem, when using simple (or Univariate)
the disorder of the data. The Entropy of y is calculated DT’s, is that in the whole path, they test some at-
by tributes more than once. Sometimes this prejudices
n the performance of the system, because with a sim-
|yj | |yj |
Entropy(y) = − log ple transformation in the data, such as principal com-
j=1
|y| |y|
ponents, we can reduce de correlation between the at-
iterating over all possible values of y. The conditional tributes, and with a simple test realize the same clas-
Entropy is sification. But the aim of the Multivariate DT’s are to
perform different tests with the data, according to the
|yj | |yj | Figure 5.
Entropy(j|y) = log
|y| |y| The purpose of the Multivariate approach is to use
more than one attribute in the test leaves. In the ex-
2 ID3 stands for Iterative Dichotomiser 3 ample of Figure 5, we can change the whole set of tests

Figure 3. Simple Univariate DT, created by the C4.5 algorithm. In blue are the tests, green and red are the
resulting classes.

Figure 4. Other Univariate DT, created by the C4.5 algorithm. In blue are the tests, and in red the resulting
classes.

by the simple one x + y ≥ 8. But, how to develop an al- 3.1. Tree Construction
gorithm that is able to “discover” such planes? This is
the content of the following sections. The first step in this phase is to have a set of train-
We can think this approach like a linear combina- ing instances. All of them have a attributes, and a as-
tion of the attributes, at each internal node. For exem- sociated class. This is the default procedure for all clas-
ple, an instance with this attributes y = y1 , y2 , . . . , yn sification methods.
belonging to class Cj . The tests at each node of the Through a top-down decision tree algorithm, and a
tree will follow the form: merit selection criterion, the process chooses the best
test to split the data, creating a branch. Now, in the
n+1
first time, we have two partitions, on wich the algo-
wi yi > 0
rithm do the same top-down analysis, to make more
i=1
partitions, according to the criteria.
where w1 , w2 , . . . wn+1 are real-valued coefficients [3]. One of the stop criterion is when some partition
Let’s also consider the attributes y1 , y2 , . . . , yn can be presents just a single class, so this node becomes a
real too, but some approaches deals with symbolic ones, leave, with an associated class.
most of the times inserting them into a scale of num- But, we want to know how the process splits the
bers. data, and here is the difference between Multi and Uni-
Multivariate and Univariate DT’s share some prop- variate DT’s.
erties, when modelling the tree, specially at the stage Considering a multiclass instance set, we can repre-
of prunig statistically invalid branches. sent the multivariate tests with a Linear Machine (LM)

Figure 5. Problem in the Univariate approach [2]. It performs several, and the blue line (Multivariate) is
much more efficient.

[2]. 3.1.2. Thermal Perceptron: For not linearly sep-
arable instances, one method is the “thermal percep-
LM: Let y be an instance description consisting of 1 tron” [1], that also adjusts wi and wj , and deals with
and the n features that describe the instance. Then some constants
each discriminant function gi (y) has the form B
c=
T
wi y B+k
and
where wi is a vector of n + 1 coefficients. The LM (wj − wi )T y
infers instance y to belong to class i iff k=
2y T y
(∀j, i = j)gi (y) > gj (y) The process is according to the following algorithm:
1. B = 2;
Some methods for training a LM have been pro- 2. If LM is correct for all instances
posed. We can start the weights vector with a default Or B < 0.001, RETURN
value for all wi , i = 1, . . . N . Here, we show the abso- 3. Otherwise, for each misclassified instance
lute error correction rule, and the thermal perceptron. 3.1. Compute correction c
update w[i] and w[j]
3.1.1. Absolute Error Correction rule: One ap-
3.2. Adjust B <- aB - b
proach for updating the weight of the discriminat func-
with a = 0.99 and b = 0.0005
tions is the absolute error correction rule, wich adjusts
4. Back to step 2
wi , where i is the class to which the instance belongs,
and wj , where j is the class to which the LM incor- The basic idea of this algorithm is to correct the
rectly assigns the instance. The correction is accom- weights-vector until all instances become correct, or in
plished by the worst case, a certain number of iterations is reached
(represented by the atualization of B value, decreasing
wi ← wi + cy according the equation B = aB − b, as a = 99% and
b = 0.0005 is also a linear small decreasing of the value
and B.

wj ← wj − cy 3.2. Prunning

where When prunnig Multivariate DT’s, one must consider
that this can result in more classification errors than
(wj − wi )T y in generalization increasing. Generally, just some fea-
c= tures (or attributes) are extracted from the multivari-
2y T y
ate tests, instead of prunnnig the whole node. [2] says
is the smallest integer such that the updated LM will that a multivariate test with n−1 features is more gen-
classify the instance correctly. eral than one based on n features.

3.3. Results [4] J. Quinlan. C 4. 5: Programs for Machine Learning. Mor-
gan Kaufmann, 1992.
Figure 6 shows a good example, doing the classifi- [5] J. Quinlan. Learning decision tree classifiers. ACM Com-
cation with simple tests, even with a complicated data puting Surveys (CSUR), 28(1):71–72, 1996.
set. [6] Weka. WEKA (Data Mining Software). Available at
http://www.cs.waikato.ac.nz/ml/weka/. 2006.

4. Conclusion
In this article we made a discussion about Decision
Trees, the Univariate and the Multivariate approaches.
The C4.5 algorithm implements one way to build Uni-
variate DT’s and some results were shown. About the
Multivariate approach, first we discussed about the ad-
vantages of using it, and we showed how to build such
trees with the Linear Machine approach, using the Ab-
solute Error Correction and also the Thermal percep-
tron rules.
DT’s are a powerful tool for classification, specially
when the results need to be interpreted by human. Mul-
tivariate DT’s deals well with attributes correlation,
presenting advantages in the tests, considering the Uni-
variate approach.

References
[1] C. Brodley and P. Utgoff. Multivariate Versus Univariate
Decision Trees. 1992.
[2] C. Brodley and P. Utgoff. Multivariate decision trees. Ma-
chine Learning, 19(1):45–77, 1995.
[3] S. Murthy, S. Kasif, and S. Salzberg. A System for
Induction of Oblique Decision Trees. Arxiv preprint
cs.AI/9408103, 1994.

Figure 6. Multivariate DT, created by the OC1
algorithm (Oblique Classifier 1) [3].

Multivariate decision tree

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (12)

Similar a Multivariate decision tree

Similar a Multivariate decision tree (20)

Último

Último (20)

Multivariate decision tree