More Related Content Similar to Bayesian_Decision_Theory-3.pdf (20) Bayesian_Decision_Theory-3.pdf2. Data Generation Process
The process that had generated the data is not
completely known and hence is modelled as a random
process.
The outcome of a random process is modelled as a
random variable.
Based on the available information or features the value
of a random variable is not predictable with certainty and
hence is non-deterministic.
Probability theory deals with the study and analysis of
such random processes
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 2
3. Probability and Inference
Result of tossing a coin is ∈ {Heads,Tails}
Random var X ∈{1,0}
Po denotes probability of heads, P(X=1)
This implies P(X=0)=1- po
X is Bernoulli distributed and its probability is expressed as
P{X} = po
X (1 ‒ po)(1 ‒ X)
Data Sample: X = {xt }N
t =1
Estimation: po = # {Heads}/#{Tosses} = ∑t
xt / N
Prediction of next toss:
Infer Heads if po > ½, Tails otherwise
3
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
4. Classification
Values of observable input variables is the basis for
prediction
Credit scoring: Inputs are income and savings.
Output is low-risk vs high-risk
Input: x = [x1,x2]T ,Output: C = {0,1}
Prediction:
Error =1- Max { P(C=1|(x1 , x2), P(C=0|(x1 , x2)}
=
=
>
=
=
=
>
=
=
otherwise
0
)
|
(
)
|
(
if
1
choose
or
otherwise
0
)
|
(
if
1
choose
C
C
C
C
,x
x
C
P
,x
x
C
P
.
,x
x
C
P
2
1
2
1
2
1
0
1
5
0
1
4
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
5. Bayes’ Rule
( )
( ) ( )
( )
x
x
x
p
p
P
P
C
C
C
|
| =
( ) ( )
( ) ( ) ( ) ( ) ( )
( ) ( ) 1
|
1
|
0
0
0
|
1
1
|
1
1
0
=
=
+
=
=
=
+
=
=
=
=
=
+
=
x
x
x
x
x
C
C
C
C
C
C
C
C
P
p
P
p
P
p
p
P
P
5
posterior
Likelihood of X in C
prior
Prob of Evidence,x, irrespective of C
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
For Binary classification
6. Bayes’ Rule: K>2 Classes
( ) ( ) ( )
( )
( ) ( )
( ) ( )
∑
=
=
=
K
k
k
k
i
i
i
i
i
C
P
C
p
C
P
C
p
p
C
P
C
p
C
P
1
|
|
|
|
x
x
x
x
x
( ) ( )
( ) ( )
x
x |
max
|
if
choose
and
1
k
k
i
i
K
i
i
i
C
P
C
P
C
C
P
C
P
=
=
≥ ∑
=
1
0
6
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
7. Decision Making considering
Losses and Risks
Loss incurred by False Positives and False Negatives
may not be equal in domains like finance, health and
disaster management.
Action to assign an input to Ci: αi
Loss of αi when the true state is Ck : λik
Expected risk in taking the action αi is
( ) ( )
( ) ( )
x
x
x
x
|
min
|
if
choose
|
|
k
k
i
i
k
K
k
ik
i
R
R
C
P
R
α
α
α
λ
α
=
= ∑
=1
7
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
8. Losses and Risks: 0/1 Loss Case
≠
=
=
k
i
k
i
ik
if
if
1
0
λ
( ) ( )
( )
( )
x
x
x
x
|
|
|
|
i
i
k
k
K
k
k
ik
i
C
P
C
P
C
P
R
−
=
=
=
∑
∑
≠
=
1
1
λ
α
8
For minimum risk, choose the most probable class
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
All errors are equally costly:
9. Losses and Risks: Reject as CK+1
(if misclassification is costlier than manual work,
eg: sorting mail by optical digit recognizer)
1
0
1
1
0
<
<
+
=
=
= λ
λ
λ
otherwise
if
if
,
K
i
k
i
ik
( ) ( )
( ) ( ) ( )
x
x
x
x
x
|
|
|
|
|
i
i
k
k
i
K
k
k
K
C
P
C
P
R
C
P
R
−
=
=
=
=
∑
∑
≠
=
+
1
1
1
α
λ
λ
α
( ) ( ) ( )
otherwise
reject
|
and
|
|
if
choose λ
−
>
≠
∀
> 1
x
x
x i
k
i
i C
P
i
k
C
P
C
P
C
9
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
10. Classification using Discriminant
Functions
( ) ( )
x
x k
k
i
i g
g
C max
if
choose =
( ) ( )
{ }
x
x
x k
k
i
i g
g max
| =
=
R
( )
( )
( )
( ) ( )
−
=
i
i
i
i
i
C
P
C
p
C
P
R
g
|
|
|
x
x
x
x
α
10
g(x) Divides the feature space into
K decision regions R1,...,RK
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Discriminant functions can be
defined as
11. K=2 Classes
Dichotomizer (K=2) vs Polychotomizer (K>2)
Single discriminant function is often used for 2-
class classification
g(x) = g1(x) – g2(x)
Log odds:
( )
>
otherwise
if
choose
2
1 0
C
g
C x
( )
( )
x
x
|
|
log
2
1
C
P
C
P
11
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
12. Utility Theory for making Rational
Decisions under uncertainty
Prob of state k given exidence x: P (Sk|x)
Utility of action αi when state is k: Uik
Expected utility:
Maximizing expected utility is equivalent to
minimizing expected risk.
( ) ( )
( ) ( )
x
x
x
x
|
max
|
if
Choose
|
|
j
j
i
i
k
k
ik
i
EU
EU
α
S
P
U
EU
α
α
α
=
= ∑
12
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
13. Value of Information
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 13
• Observable features like blood test, MRI scan, etc. are
costly and unless they are needed for diagnosis
should not be asked for.
• Value of information has to be assessed in such
domains.
• Observed features: x Newly added features: z
• The expected utility of the best action in state k before
and after adding z is given by
• Value of Info given by z =(EU(x,z)-EU(x)) and if it is
greater than 0 then only z is useful.
( ) ( )
( ) ( )
∑
∑
=
=
k
k
jk
j
k
k
jk
j
z
S
P
U
z
EU
S
P
U
EU
,
|
max
,
|
max
x
x
x
x
15. Bayesian Networks
Directed Acyclic Graphical model to represent the interaction
between the random variables denoted by nodes and directed
edges between them.
The nodes in the DAG structure have conditional probabilities
as parameters to be learned based on a set of known
examples or through domain knowledge.
Bayesian networks represents conditional independence
between certain nodes which is helpful to break down the
problem of finding joint distribution of many variables into
local structures.
P(X1, …Xd )= ∏ P(Xi |parents(Xi ))
Accordingly for the Bayesian Network shown in the diagram
P(C,S,R,W,F)=P(C)P(S|C)P(R|C)P(W|S,R)P(F|R)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 15
16. Bayesian Networks contd…
In Bayesian networks the input and output variables are not
explicitly designated. Based on the available evidence, the
belief in the variables propagates to infer the prob of the other
variables.
Hidden variables may also be represented by some of the
nodes and their conditional probabilities are estimated based
on the values of their parents representing related observed
variables.
Deals with the numeric and categorical variables also
The structure should be created by human expert after
identifying the casual relationships among the variables and
the local hierarchies.
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 16
17. Influence Diagrams: Graphical models for
generalization of Bayesian Networks for Decision Making
Lecture Notes for E Alpaydın 2010 Introduction
to Machine Learning 2e © The MIT Press (V1.0) 17
Influence Diagram contains
chance nodes rep the Random
variables in BN,
decision nodes rep choice of
action/classification and
utility node for utility estimation.
Bayesian Network(BN)
for Classification
18. Association Rules
Association rule: X → Y
People who buy X are also likely to buy Y.
A rule implies association, not necessarily causation.
In order to find such associations, the frequent itemsets
are to be found out from the transaction database.
The number of transactions that cover an itemset is
referred to as its support
An itemset is considered frequent enough based on a
minimum support threshold.
18
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
19. Apriori Property
All subsets of a frequent itemset are frequent. Hence, if a
set is found to be infrequent all its supersets cease to be
frequent and hence pruned.
For (X,Y,Z), a 3-item set, to be frequent (have enough
support), (X,Y), (X,Z), and (Y,Z) should be frequent.
If (X,Y) is not frequent, none of its supersets can be
frequent.
Once we find the frequent k-item sets, we convert them
to rules: X, Y → Z, ...
and X → Y, Z, ...
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 19
20. Association measures
Support (X → Y):
Confidence (X → Y):
Lift (X → Y):
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 20
( ) { }
{ }
ons
transacti
all
#
and
covering
ns
transactio
#
,
Y
X
Y
X
P =
( ) ( )
{ }
{ }
X
Y
X
X
P
Y
X
P
X
Y
P
covering
ns
transactio
#
and
covering
ns
transactio
#
)
(
,
|
=
=
( )
)
(
)
|
(
)
(
)
(
,
Y
P
X
Y
P
Y
P
X
P
Y
X
P
=
=
21. Conclusion
Discussed the formalism for optimal decision making
under uncertainty
The concepts of probability theory are found to be useful
for modelling uncertainty and accordingly utility of
making a choice or decision is estimated.
The next chapters focus on how to estimate these
probabilities from a given dataset. They are categorised
as:
Parametric approaches
Semiparametric and nonparametric approaches
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 21