An application of nonparametric predictive inference for multinomial data (NPI) to classification tasks is presented. This model is applied to an established procedure for building classification trees using imprecise probabilities and uncertainty measures, thus far used only with the imprecise Dirichlet model (IDM), that is defined through the use of a parameter expressing previous knowledge. The accuracy of that procedure of classification has a significant dependence on the value of the parameter used when the IDM is applied. A detailed study involving 40 data sets shows that the procedure using the NPI model (which has no parameter dependence) obtains a better trade-off between accuracy and size of tree than does the procedure when the IDM is used, whatever the choice of parameter. In a bias-variance study of the errors, it is proved that the procedure with the NPI model has a lower variance than the one with the IDM, implying a lower level of over-fitting.
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
lassification with decision trees from a nonparametric predictive inference perspective
1. ERCIM'11 - Handling imprecision in graphical models
Classification with decision trees from a
nonparametric predictive inference
perspective
Joaquín Abellán†, Rebecca M. Baker§, Frank P.A. Coolen§,
Richard J. Crossman¶, Andrés. R. Masegosa†
† Department of Computer Science and Artificial Intelligence, University of
Granada, Granada, Spain
§ Department of Mathematical Sciences, Durham University, Durham, UK
¶ Warwick Medical School, University of Warwick, Coventry, UK
London, December 2011
ERCIM’11 London (UK) 1/28
2. Outline
1 Introduction
2 Imprecise Models for Multinomial data
3 Uncertainty Measures and Decision Trees
4 Experimental Evaluation
5 Conclusions & Future Works
ERCIM’11 London (UK) 2/28
4. Introduction
Imprecise Probabilities
Imprecise Probabilities is a term that subsumes several
theories for reasoning under uncertainty:
Belief Functions
Reachable Probability Intervals
Capacities of various orders
Upper and lower probabilities
Credal Sets
.....
ERCIM’11 London (UK) 4/28
5. Introduction
Uncertainty Measures
Extensions of the classic information theory has been
developed for these imprecise models.
Several extensions to the Shannon entropy measure has
been proposed:
Maximum Entropy measure proposed by (Abellan - Moral,
2003) as a total uncertainty measure for general credal
sets.
The employment of this measure in the case of the
Imprecise Dirichlet model (IDM) has lead to several
successful applications in the field of data-mining, specially
on decision trees for supervised classification.
ERCIM’11 London (UK) 5/28
6. Introduction
Imprecise Dirichlet Model (IDM) model
The IDM model was proposed by Walley in 1996 for
statistical inference from multinomial data was
developed to correct shortcomings of previous alternative
objective models.
It verifies a set of principles which are claimed by Walley
to be desirable for inference:
Representation Invariance Principle.
It can be expressed via a set of reachable probability
intervals and a belief function (Abellán,2006).
ERCIM’11 London (UK) 6/28
8. Introduction
Non-parametric Predictive Inference
Non-parametric Predictive Inference model for multinomial data (NPI-M) is
presented by (Coolen-Augustin,2005) as an alternative to the IDM model.
Learn from data in the absence of prior knowledge and with only a few
modeling assumptions.
ERCIM’11 London (UK) 7/28
9. Introduction
Non-parametric Predictive Inference
Non-parametric Predictive Inference model for multinomial data (NPI-M) is
presented by (Coolen-Augustin,2005) as an alternative to the IDM model.
Learn from data in the absence of prior knowledge and with only a few
modeling assumptions.
Post-data exchangeability-like assumption together with a latent
variable representation of data.
B
BB
B
P
P P
P
P
ERCIM’11 London (UK) 7/28
10. Introduction
Non-parametric Predictive Inference
Non-parametric Predictive Inference model for multinomial data (NPI-M) is
presented by (Coolen-Augustin,2005) as an alternative to the IDM model.
Learn from data in the absence of prior knowledge and with only a few
modeling assumptions.
Post-data exchangeability-like assumption together with a latent
variable representation of data.
B
BB
B
P
P P
P
P
It does not satisfy as Representation Invariance Principle (RIP).
Not a shortcoming for Coolen-Augustin. A weaker principle is
proposed.
ERCIM’11 London (UK) 7/28
11. Introduction
Classification with Decision Trees using NPI-M model
In this work, we employ the NPI-M model to build
decision trees using the maximum entropy (ME)
measure as a split criteria:
ERCIM’11 London (UK) 8/28
12. Introduction
Classification with Decision Trees using NPI-M model
In this work, we employ the NPI-M model to build
decision trees using the maximum entropy (ME)
measure as a split criteria:
We experimentally compare the decision trees build with
IDM with different S values.
Classification accuracy is slightly improve.
The decision trees are notably smaller.
A parameter free method.
ERCIM’11 London (UK) 8/28
13. Imprecise Models for Multinomial Data
Part II
Imprecise Models for Multinomial
Data
ERCIM’11 London (UK) 9/28
14. Imprecise Models for Multinomial Data
Imprecise Dirichlet Model
The imprecise Dirichlet model (IDM) was introduced by Walley for
inference about the probability distribution of a categorical variable.
ERCIM’11 London (UK) 10/28
15. Imprecise Models for Multinomial Data
Imprecise Dirichlet Model
The imprecise Dirichlet model (IDM) was introduced by Walley for
inference about the probability distribution of a categorical variable.
Let us assume that X is a variable taking values on a finite set X = {x1, . . . , xK }
and we have a sample of N independent and identically distributed outcomes of
X.
ERCIM’11 London (UK) 10/28
16. Imprecise Models for Multinomial Data
Imprecise Dirichlet Model
The imprecise Dirichlet model (IDM) was introduced by Walley for
inference about the probability distribution of a categorical variable.
Let us assume that X is a variable taking values on a finite set X = {x1, . . . , xK }
and we have a sample of N independent and identically distributed outcomes of
X.
The goal is to estimate the probabilities θx = p(x) with which X takes its values.
ERCIM’11 London (UK) 10/28
17. Imprecise Models for Multinomial Data
Imprecise Dirichlet Model
The imprecise Dirichlet model (IDM) was introduced by Walley for
inference about the probability distribution of a categorical variable.
Let us assume that X is a variable taking values on a finite set X = {x1, . . . , xK }
and we have a sample of N independent and identically distributed outcomes of
X.
The goal is to estimate the probabilities θx = p(x) with which X takes its values.
Common Bayesian procedure:
Assumes a Dirichlet prior for the parameter vector (θx )x∈X .
f((θx )x∈X ) =
Γ(s)
x∈X Γ(s · tx )
x∈X
θs·tx −1
x
The parameter s > 0 and t = (tx )x∈X , which is a vector of positive real
numbers satisfying x∈X tx = 1.
Take the posterior expectation of the parameters given the sample.
ERCIM’11 London (UK) 10/28
18. Imprecise Models for Multinomial Data
Imprecise Dirichlet Model
Posterior Probability
f((θx )x∈X |D) =
Γ(s)
x∈X Γ(s · tx ) x∈X
θ
n(x)+s·tx −1
x
where n(x) are the count frequencies of x in the data.
ERCIM’11 London (UK) 11/28
19. Imprecise Models for Multinomial Data
Imprecise Dirichlet Model
Posterior Probability
f((θx )x∈X |D) =
Γ(s)
x∈X Γ(s · tx ) x∈X
θ
n(x)+s·tx −1
x
where n(x) are the count frequencies of x in the data.
The IDM only depends on parameter s and assumes all the possible values of
the vector t.
This defines a closed and convex set of prior distributions.
ERCIM’11 London (UK) 11/28
20. Imprecise Models for Multinomial Data
Imprecise Dirichlet Model
Posterior Probability
f((θx )x∈X |D) =
Γ(s)
x∈X Γ(s · tx ) x∈X
θ
n(x)+s·tx −1
x
where n(x) are the count frequencies of x in the data.
The IDM only depends on parameter s and assumes all the possible values of
the vector t.
This defines a closed and convex set of prior distributions.
This credal set for a particular variable X can be represented by a system of
probability intervals:
p (x) ∈ [
n(x)
N + s
,
n(x) + s
N + s
]
Parameter s determines how quickly the lower and upper probabilities
converge as more data become available.
ERCIM’11 London (UK) 11/28
21. Imprecise Models for Multinomial Data
The NPI-M model
The NPI model for multinomial data (NPI-M) was developed by
Coolen and Augustin is based on a variation of Hill’s assumption A(n)
which relates to predictive inference involving real-valued data
observations.
Post-data exchangeability-like assumption together with a latent variable
representation of data.
ERCIM’11 London (UK) 12/28
22. Imprecise Models for Multinomial Data
The NPI-M model
The NPI model for multinomial data (NPI-M) was developed by
Coolen and Augustin is based on a variation of Hill’s assumption A(n)
which relates to predictive inference involving real-valued data
observations.
Post-data exchangeability-like assumption together with a latent variable
representation of data.
Suppose there are 5 different categories {B, P, R, Y, G} the observations: 4
outcomes of B and 5 outcomes of P.
B
P
P
B
P
B
B
P
P
ERCIM’11 London (UK) 12/28
23. Imprecise Models for Multinomial Data
The NPI-M model
The NPI model for multinomial data (NPI-M) was developed by
Coolen and Augustin is based on a variation of Hill’s assumption A(n)
which relates to predictive inference involving real-valued data
observations.
Post-data exchangeability-like assumption together with a latent variable
representation of data.
Suppose there are 5 different categories {B, P, R, Y, G} the observations: 4
outcomes of B and 5 outcomes of P.
B
P
P
B
P
B
B
P
P
B
B
B
B
P
P
P
P
P
ERCIM’11 London (UK) 12/28
24. Imprecise Models for Multinomial Data
The NPI-M model
Theorem (Coolen and Augustin 2009) The lower and upper probabilities for
the event x:
P(x) = max 0,
n(x) − 1
N
P(x) = min
n(x) + 1
N
, 1
ERCIM’11 London (UK) 13/28
25. Imprecise Models for Multinomial Data
The NPI-M model
Theorem (Coolen and Augustin 2009) The lower and upper probabilities for
the event x:
P(x) = max 0,
n(x) − 1
N
P(x) = min
n(x) + 1
N
, 1
For the previous case {B, P, R, Y, G}:
B
B
B
B
P
P
P
P
P
[
3
9
,
5
9
]; [
4
9
,
6
9
]; [0,
1
9
]; [0,
1
9
]; [0,
1
9
]
ERCIM’11 London (UK) 13/28
26. Imprecise Models for Multinomial Data
The A-NPI-M model
This set of lower and upper probabilities,L, associated with the singleton events
x expresses a reachable set of probability intervals, i.e. a credal set.
L = {[max 0,
n(x) − 1
N
, min
n(x) + 1
N
, 1 ] : x ∈ {x1, ..., xK }}
ERCIM’11 London (UK) 14/28
27. Imprecise Models for Multinomial Data
The A-NPI-M model
This set of lower and upper probabilities,L, associated with the singleton events
x expresses a reachable set of probability intervals, i.e. a credal set.
L = {[max 0,
n(x) − 1
N
, min
n(x) + 1
N
, 1 ] : x ∈ {x1, ..., xK }}
The set of probability distributions obtained from NPI-M is not a credal set (see
a counterexample in the paper).
p = ( 3
9
, 4
9
, 2
27
, 2
27
, 2
27
) belongs to the credal set:
[3
9
, 5
9
]; [ 4
9
, 6
9
]; [0, 1
9
]; [0, 1
9
]; [0, 1
9
]
It is not possible to find a configuration of the probability wheel that
corresponds to p.
ERCIM’11 London (UK) 14/28
28. Imprecise Models for Multinomial Data
The A-NPI-M model
This set of lower and upper probabilities,L, associated with the singleton events
x expresses a reachable set of probability intervals, i.e. a credal set.
L = {[max 0,
n(x) − 1
N
, min
n(x) + 1
N
, 1 ] : x ∈ {x1, ..., xK }}
The set of probability distributions obtained from NPI-M is not a credal set (see
a counterexample in the paper).
p = ( 3
9
, 4
9
, 2
27
, 2
27
, 2
27
) belongs to the credal set:
[3
9
, 5
9
]; [ 4
9
, 6
9
]; [0, 1
9
]; [0, 1
9
]; [0, 1
9
]
It is not possible to find a configuration of the probability wheel that
corresponds to p.
The A-NPI-M model is a simplified approximate model derived from NPI-M by
considering all probability distributions compatible with the set of lower and upper
probabilities, L, obtained from NPI-M.
ERCIM’11 London (UK) 14/28
29. Uncertainty Measures and Decision Trees
Part III
Uncertainty Measures and
Decision Trees
ERCIM’11 London (UK) 15/28
30. Uncertainty Measures and Decision Trees
Decision Trees
A decision tree, also called a classification tree, is a simple structure that can
be used as a classifier.
Each node represents an attribute variable
Each branch represents one of the states of this variable
Each tree leaf specifies an expected value of the class variable
Example of decision tree for three attribute variables Ai (i = 1, 2, 3), with two
possible values (0, 1) and a class variable C with cases or states c1, c2, c3:
ERCIM’11 London (UK) 16/28
31. Uncertainty Measures and Decision Trees
Building Decision Trees from data
Basic Rule
Associate to each node the most informative variable about the class C.
Not already been selected in the path from the root to this node.
This variable provides more information than if it had not been included.
ERCIM’11 London (UK) 17/28
32. Uncertainty Measures and Decision Trees
Building Decision Trees from data
Basic Rule
Associate to each node the most informative variable about the class C.
Not already been selected in the path from the root to this node.
This variable provides more information than if it had not been included.
We are given a data set D.
Each node of DT defines a set of probabilities for the
class variable C, Pσ.
σ is the configuration of a node.
σ = (A1 = 1, A3 = 0) associated to node c3.
ERCIM’11 London (UK) 17/28
33. Uncertainty Measures and Decision Trees
Building Decision Trees from data
Basic Rule
Associate to each node the most informative variable about the class C.
Not already been selected in the path from the root to this node.
This variable provides more information than if it had not been included.
We are given a data set D.
Each node of DT defines a set of probabilities for the
class variable C, Pσ.
σ is the configuration of a node.
σ = (A1 = 1, A3 = 0) associated to node c3.
Pσ are obtained from the IDM, NPI-M or A-NPI-M from the data set D[σ].
ERCIM’11 London (UK) 17/28
34. Uncertainty Measures and Decision Trees
Imprecise Information Gain
ImpInfGain = TU(Pσ
) −
ai ∈Ai
rσ
ai
TU(Pσ∪(Ai =ai )
)
ERCIM’11 London (UK) 18/28
35. Uncertainty Measures and Decision Trees
Imprecise Information Gain
ImpInfGain = TU(Pσ
) −
ai ∈Ai
rσ
ai
TU(Pσ∪(Ai =ai )
)
TU is a total uncertainty measure function, normally defined on credal sets.
rσ
ai
is the relative frequency with which Ai takes the value ai in D[σ].
σ ∪ (Ai = ai ) is the result of adding the value Ai = ai to configuration σ
ERCIM’11 London (UK) 18/28
36. Uncertainty Measures and Decision Trees
Imprecise Information Gain
ImpInfGain = TU(Pσ
) −
ai ∈Ai
rσ
ai
TU(Pσ∪(Ai =ai )
)
TU is a total uncertainty measure function, normally defined on credal sets.
rσ
ai
is the relative frequency with which Ai takes the value ai in D[σ].
σ ∪ (Ai = ai ) is the result of adding the value Ai = ai to configuration σ
Imprecise Information Gain is a comparison of the total uncertainty for
configuration σ with the expected total uncertainty of σ ∪ (Ai = ai ).
We put a new node or we split only if this difference is positive.
ERCIM’11 London (UK) 18/28
37. Uncertainty Measures and Decision Trees
Maximum Entropy
Maximum Entropy is an aggregate uncertainty measure which aims to
generalize classic measures (Hartley Measure and Shanon Entropy).
It satisfies all the required properties (additivity, subadditivity, monotonicity,
proper range, etc.)
It is defined as a functional S∗:
S∗
(Pσ
) = max
p∈Pσ
{−
x∈X
p(x) log2 p(x)}
ERCIM’11 London (UK) 19/28
38. Uncertainty Measures and Decision Trees
Maximum Entropy
Maximum Entropy is an aggregate uncertainty measure which aims to
generalize classic measures (Hartley Measure and Shanon Entropy).
It satisfies all the required properties (additivity, subadditivity, monotonicity,
proper range, etc.)
It is defined as a functional S∗:
S∗
(Pσ
) = max
p∈Pσ
{−
x∈X
p(x) log2 p(x)}
Efficient algorithms for different imprecise models have been proposed.
Credal Sets (AbellanMoral,2003); IDM (AbellanMoral,2006); NPI-M and
A-NPI-M (Abellan et al.,2010).
ERCIM’11 London (UK) 19/28
39. Uncertainty Measures and Decision Trees
Maximum Entropy
Maximum Entropy is an aggregate uncertainty measure which aims to
generalize classic measures (Hartley Measure and Shanon Entropy).
It satisfies all the required properties (additivity, subadditivity, monotonicity,
proper range, etc.)
It is defined as a functional S∗:
S∗
(Pσ
) = max
p∈Pσ
{−
x∈X
p(x) log2 p(x)}
Efficient algorithms for different imprecise models have been proposed.
Credal Sets (AbellanMoral,2003); IDM (AbellanMoral,2006); NPI-M and
A-NPI-M (Abellan et al.,2010).
This method for building decision trees for classification
can be similarly extended for other imprecise models.
ERCIM’11 London (UK) 19/28
42. Experimental Evaluation
Experimental Set-up
The aim is to compare the use of the IDM with the use of
the NPI-M and the A-NPI-M for building decision trees
Decision trees for supervised classification using imprecise models, IDM, NPI-M
and A-NPI:
The maximum entropy as base measure for uncertainty.
Classic Shannon’s entropy measure is also employed:
information gain and information gain ratio.
Experiments on 40 UCI data sets using a 10-fold-cross validation scheme.
Statistical Tests for the comparison:
Corrected Paired T-Test at 5% significant level.
Friedman Test at 5% significant level.
ERCIM’11 London (UK) 21/28
43. Experimental Evaluation
Evaluating Classification Accuracy
IDM NPI-M A-NPI-M IG IGR
Av. Accuracy 76.73 76.77 76.77 74.33 75.66
Friedman Rank 2.95 2.94 2.76 3.06 3.29
Friedman Test accepts the hypothesis that all algorithms
performs equally well.
ERCIM’11 London (UK) 22/28
44. Experimental Evaluation
Evaluating the size of the decision trees
The size of a decision tree is relevant:
A decision tree is composed by a set of classification rules.
Smaller trees implies more representative classification rules (lower risk of
over-fitted predictions).
ERCIM’11 London (UK) 23/28
45. Experimental Evaluation
Evaluating the size of the decision trees
The size of a decision tree is relevant:
A decision tree is composed by a set of classification rules.
Smaller trees implies more representative classification rules (lower risk of
over-fitted predictions).
IDM NPI-M A-NPI-M IG IGR
Av. N. Nodes 610.16 589.52 589.81 1119.46 1288.56
Friedman Rank 2.96 1.4 1.8 4.18 4.67
Imprecise models provided much more compact representations than
precise based methods.
NPI-M and A-NPI-M build smaller decision trees than IDM with s=1 (default
value).
ERCIM’11 London (UK) 23/28
46. Experimental Evaluation
Evaluating IDM with different s values
IDM depends of the parameter s (affects to the size of the intervals ):
s=0 implies the information gain criterion.
Higher s values implies higher intervals. Decrease the size of the trees.
ERCIM’11 London (UK) 24/28
47. Experimental Evaluation
Evaluating IDM with different s values
IDM depends of the parameter s (affects to the size of the intervals ):
s=0 implies the information gain criterion.
Higher s values implies higher intervals. Decrease the size of the trees.
NPI-M IDM-S1 IDM-S1.5 IDM-S2.0
Av. N. Nodes 589.5 610.2 480.3 383.8
Friedman Rank 2.4 3.85 2.43 1.28
IDM with s=1.5 obtains decision trees with a similar size to NPI-M.
IDM with s=2.0 obtains decision trees with smaller similar size than NPI-M.
ERCIM’11 London (UK) 24/28
48. Experimental Evaluation
Evaluating IDM with different s values
NPI-M IDM-S1 IDM-S1.5 IDM-S2.0
Av. Accuracy 76.77 76.73 76.23 75.75
Friedman Rank 2.25 2.26 2.46 3.03
IDM with s>1 performs worst than NPI-M and IDM with s=1.
NPI-M models get the best classification accuracy with the smallest
decision trees.
ERCIM’11 London (UK) 25/28
49. Conclusions and Future Works
Part V
Conclusions and Future Works
ERCIM’11 London (UK) 26/28
50. Conclusions and Future Works
Conclusions and Future Works
We have presented an application of the NPI model for
multinomial data in classification.
A known scheme to build decision trees via two NPI-based
models, an exact model (NPI-M), and an approximate model
(A-NPI-M), is used.
Models based on NPI model have a performance of accuracy
similar to that of the best model based on the IDM (which varies
its parameter), but using smaller trees.
ERCIM’11 London (UK) 27/28
51. Conclusions and Future Works
Conclusions and Future Works
We have presented an application of the NPI model for
multinomial data in classification.
A known scheme to build decision trees via two NPI-based
models, an exact model (NPI-M), and an approximate model
(A-NPI-M), is used.
Models based on NPI model have a performance of accuracy
similar to that of the best model based on the IDM (which varies
its parameter), but using smaller trees.
Future Works:
Extend these methods for credal classification.
ERCIM’11 London (UK) 27/28
52. Conclusions and Future Works
Thanks for you attention!
Questions?
ERCIM’11 London (UK) 28/28