Bayesian_Decision_Theory-3.pdf

Text Book slides modified by Prof M.Shashi as per the
AU syllabus

Data Generation Process
 The process that had generated the data is not
completely known and hence is modelled as a random
process.
 The outcome of a random process is modelled as a
random variable.
 Based on the available information or features the value
of a random variable is not predictable with certainty and
hence is non-deterministic.
 Probability theory deals with the study and analysis of
such random processes
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 2

Probability and Inference
 Result of tossing a coin is ∈ {Heads,Tails}
 Random var X ∈{1,0}
 Po denotes probability of heads, P(X=1)
 This implies P(X=0)=1- po
 X is Bernoulli distributed and its probability is expressed as
P{X} = po
X (1 ‒ po)(1 ‒ X)
 Data Sample: X = {xt }N
t =1
Estimation: po = # {Heads}/#{Tosses} = ∑t
xt / N
 Prediction of next toss:
Infer Heads if po > ½, Tails otherwise
3
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Classification
 Values of observable input variables is the basis for
prediction
 Credit scoring: Inputs are income and savings.
Output is low-risk vs high-risk
 Input: x = [x1,x2]T ,Output: C = {0,1}
 Prediction:
 Error =1- Max { P(C=1|(x1 , x2), P(C=0|(x1 , x2)}



=
=
>
=
=



=
>
=
=
otherwise
0
)
|
(
)
|
(
if
1
choose
or
otherwise
0
)
|
(
if
1
choose
C
C
C
C
,x
x
C
P
,x
x
C
P
.
,x
x
C
P
2
1
2
1
2
1
0
1
5
0
1
4

Bayes’ Rule
( )
( ) ( )
( )
x
x
x
p
p
P
P
C
C
C
|
| =
( ) ( )
( ) ( ) ( ) ( ) ( )
( ) ( ) 1
|
1
|
0
0
0
|
1
1
|
1
1
0
=
=
+
=
=
=
+
=
=
=
=
=
+
=
x
x
x
x
x
C
C
C
C
C
C
C
C
P
p
P
p
P
p
p
P
P
5
posterior
Likelihood of X in C
prior
Prob of Evidence,x, irrespective of C
For Binary classification

Bayes’ Rule: K>2 Classes
( ) ( ) ( )
( )
( ) ( )
( ) ( )
∑
=
=
=
K
k
k
k
i
i
i
i
i
C
P
C
p
C
P
C
p
p
C
P
C
p
C
P
1
|
|
|
|
x
x
x
x
x
( ) ( )
( ) ( )
x
x |
max
|
if
choose
and
1
k
k
i
i
K
i
i
i
C
P
C
P
C
C
P
C
P
=
=
≥ ∑
=
1
0
6

Decision Making considering
Losses and Risks
 Loss incurred by False Positives and False Negatives
may not be equal in domains like finance, health and
disaster management.
 Action to assign an input to Ci: αi
 Loss of αi when the true state is Ck : λik
 Expected risk in taking the action αi is
( ) ( )
( ) ( )
x
x
x
x
|
min
|
if
choose
|
|
k
k
i
i
k
K
k
ik
i
R
R
C
P
R
α
α
α
λ
α
=
= ∑
=1
7

Losses and Risks: 0/1 Loss Case



≠
=
=
k
i
k
i
ik
if
if
1
0
λ
( ) ( )
( )
( )
x
x
x
x
|
|
|
|
i
i
k
k
K
k
k
ik
i
C
P
C
P
C
P
R
−
=
=
=
∑
∑
≠
=
1
1
λ
α
8
For minimum risk, choose the most probable class
All errors are equally costly:

Losses and Risks: Reject as CK+1
(if misclassification is costlier than manual work,
eg: sorting mail by optical digit recognizer)
1
0
1
1
0
<
<





+
=
=
= λ
λ
λ
otherwise
if
if
,
K
i
k
i
ik
( ) ( )
( ) ( ) ( )
x
x
x
x
x
|
|
|
|
|
i
i
k
k
i
K
k
k
K
C
P
C
P
R
C
P
R
−
=
=
=
=
∑
∑
≠
=
+
1
1
1
α
λ
λ
α
( ) ( ) ( )
otherwise
reject
|
and
|
|
if
choose λ
−
>
≠
∀
> 1
x
x
x i
k
i
i C
P
i
k
C
P
C
P
C
9

Classification using Discriminant
Functions
( ) ( )
x
x k
k
i
i g
g
C max
if
choose =
( ) ( )
{ }
x
x
x k
k
i
i g
g max
| =
=
R
( )
( )
( )
( ) ( )




−
=
i
i
i
i
i
C
P
C
p
C
P
R
g
|
|
|
x
x
x
x
α
10
g(x) Divides the feature space into
K decision regions R1,...,RK
Discriminant functions can be
defined as

K=2 Classes
 Dichotomizer (K=2) vs Polychotomizer (K>2)
 Single discriminant function is often used for 2-
class classification
 g(x) = g1(x) – g2(x)
 Log odds:
( )


 >
otherwise
if
choose
2
1 0
C
g
C x
( )
( )
x
x
|
|
log
2
1
C
P
C
P
11

Utility Theory for making Rational
Decisions under uncertainty
 Prob of state k given exidence x: P (Sk|x)
 Utility of action αi when state is k: Uik
 Expected utility:
 Maximizing expected utility is equivalent to
minimizing expected risk.
( ) ( )
( ) ( )
x
x
x
x
|
max
|
if
Choose
|
|
j
j
i
i
k
k
ik
i
EU
EU
α
S
P
U
EU
α
α
α
=
= ∑
12

Value of Information
• Observable features like blood test, MRI scan, etc. are
costly and unless they are needed for diagnosis
should not be asked for.
• Value of information has to be assessed in such
domains.
• Observed features: x Newly added features: z
• The expected utility of the best action in state k before
and after adding z is given by
• Value of Info given by z =(EU(x,z)-EU(x)) and if it is
greater than 0 then only z is useful.
( ) ( )
( ) ( )
∑
∑
=
=
k
k
jk
j
k
k
jk
j
z
S
P
U
z
EU
S
P
U
EU
,
|
max
,
|
max
x
x
x
x

Bayesian Networks
 Directed Acyclic Graphical model to represent the interaction
between the random variables denoted by nodes and directed
edges between them.
 The nodes in the DAG structure have conditional probabilities
as parameters to be learned based on a set of known
examples or through domain knowledge.
 Bayesian networks represents conditional independence
between certain nodes which is helpful to break down the
problem of finding joint distribution of many variables into
local structures.
P(X1, …Xd )= ∏ P(Xi |parents(Xi ))
 Accordingly for the Bayesian Network shown in the diagram
P(C,S,R,W,F)=P(C)P(S|C)P(R|C)P(W|S,R)P(F|R)

Bayesian Networks contd…
 In Bayesian networks the input and output variables are not
explicitly designated. Based on the available evidence, the
belief in the variables propagates to infer the prob of the other
variables.
 Hidden variables may also be represented by some of the
nodes and their conditional probabilities are estimated based
on the values of their parents representing related observed
variables.
 Deals with the numeric and categorical variables also
 The structure should be created by human expert after
identifying the casual relationships among the variables and
the local hierarchies.

Influence Diagrams: Graphical models for
generalization of Bayesian Networks for Decision Making
Lecture Notes for E Alpaydın 2010 Introduction
to Machine Learning 2e © The MIT Press (V1.0) 17
 Influence Diagram contains
 chance nodes rep the Random
variables in BN,
 decision nodes rep choice of
action/classification and
 utility node for utility estimation.
Bayesian Network(BN)
for Classification

Association Rules
 Association rule: X → Y
 People who buy X are also likely to buy Y.
 A rule implies association, not necessarily causation.
 In order to find such associations, the frequent itemsets
are to be found out from the transaction database.
 The number of transactions that cover an itemset is
referred to as its support
 An itemset is considered frequent enough based on a
minimum support threshold.
18

Apriori Property
 All subsets of a frequent itemset are frequent. Hence, if a
set is found to be infrequent all its supersets cease to be
frequent and hence pruned.
 For (X,Y,Z), a 3-item set, to be frequent (have enough
support), (X,Y), (X,Z), and (Y,Z) should be frequent.
 If (X,Y) is not frequent, none of its supersets can be
frequent.
 Once we find the frequent k-item sets, we convert them
to rules: X, Y → Z, ...
and X → Y, Z, ...

Association measures
 Support (X → Y):
 Confidence (X → Y):
 Lift (X → Y):
( ) { }
{ }
ons
transacti
all
#
and
covering
ns
transactio
#
,
Y
X
Y
X
P =
( ) ( )
{ }
{ }
X
Y
X
X
P
Y
X
P
X
Y
P
covering
ns
transactio
#
and
covering
ns
transactio
#
)
(
,
|
=
=
( )
)
(
)
|
(
)
(
)
(
,
Y
P
X
Y
P
Y
P
X
P
Y
X
P
=
=

Conclusion
 Discussed the formalism for optimal decision making
under uncertainty
 The concepts of probability theory are found to be useful
for modelling uncertainty and accordingly utility of
making a choice or decision is estimated.
 The next chapters focus on how to estimate these
probabilities from a given dataset. They are categorised
as:
 Parametric approaches
 Semiparametric and nonparametric approaches

Bayesian_Decision_Theory-3.pdf

Recommended

Recommended

More Related Content

Similar to Bayesian_Decision_Theory-3.pdf

Similar to Bayesian_Decision_Theory-3.pdf (20)

Recently uploaded

Recently uploaded (20)

Bayesian_Decision_Theory-3.pdf