SlideShare una empresa de Scribd logo
1 de 38
Descargar para leer sin conexión
INTRO TO EDL
Federico Cerutti
Educational material for students at Cardiff University, UK and the University of Brescia,
Italy.
2
a primer in bayesian analysis
Bayesian Probabilities
Bayes theorem
p(Y |X) =
p(X|Y )p(Y )
p(X)
(1)
where
p(X) =
Y
p(X|Y )p(Y ) (2)
4
Suppose we randomly pick one of the boxes and from that
box we randomly select an item of fruit, and having observed
which sort of fruit it is we replace it in the box from which it
came.
We could imagine repeating this process many times. Let us
suppose that in so doing we pick the red box 40% of the time
and we pick the blue box 60% of the time, and that when we
remove an item of fruit from a box we are equally likely to
select any of the pieces of fruit in the box.
We are told that a piece of fruit has been selected and it is an orange.
Which box does it came from?
5
Image from Bishop, C. M. Pattern Recognition and Machine Learning. (Springer-Verlag, 2006).
p(B = r|F = o) =
p(F = o|B = r)p(B = r)
p(F = o)
=
=
6
8 · 4
10
6
8 · 4
10 + 1
4 · 6
10
=
=
3
4
·
2
5
·
20
9
=
2
3
(3)
6
Image from Bishop, C. M. Pattern Recognition and Machine Learning. (Springer-Verlag, 2006).
Given the parameters of our model w, we can capture our assumptions about w, before
observing the data, in the form of a prior probability distribution p(w). The effect of the
observed data D = {t1, . . . , tN } is expressed through the conditional p(D |w), hence Bayes
theorem takes the form:
p(w|D ) =
likelihood
p(D |w)
prior
p(w)
p(D )
(4)
posterior ∝ likelihood · prior (5)
p(D ) = p(D |w)p(w)dw (6)
It ensures that the posterior distribution on the left-hand side is a valid probability density and integrates to one.
7
Frequentist paradigm
• w is considered to be a fixed parameter,
whose values is determined by some
form of estimator, e.g. the maximum
likelihood in which w is set to the value
that maximises p(D |w)
• Error bars on this estimate are obtained
by considering the distribution of
possible data sets D .
• The negative log of the likelihood
function is called an error function: the
negative log is a monotonically
decreasing function hence maximising
the likelihood is equivalent to
minimising the error.
Bayesian paradigm
• There is only one single data set D (the
one observed) and the uncertainty in the
parameters is expressed through a
probability distribution over w.
• The inclusion of prior knowledge arises
naturally: suppose that a fair-looking
coin is tossed three times and lands
heads each time. A classical maximum
likelihood estimate of the probability of
landing heads would give 1.
There are cases where you want to
reduce the dependence on the prior,
hence using noninformative priors.
8
Binary variable: Bernoulli
Let us consider a single binary random variable x ∈ {0, 1}, e.g. flipping coin, not necessary
fair, hence the probability is conditioned by a parameter 0 ≤ µ ≤ 1:
p(x = 1|µ) = µ (7)
The probability distribution over x is known as the Bernoulli distribution:
Bern(x|µ) = µx
(1 − µ)1−x
(8)
E[x] = µ (9)
9
Binomial distribution
The distribution of m observations of x = 1 given the datasize N is given by the Binomial
distribution:
Bin(m|N, µ) =
N
m
µm
(1 − µ)N−m
(10)
with
E[m] ≡
N
m=0
mBin(m|N, µ) = Nµ (11)
and
var[m] ≡
N
m=0
(m − E[m])2
Bin(m|N, µ) = Nµ(1 − µ) (12)
10
Image from Bishop, C. M. Pattern Recognition and Machine Learning. (Springer-Verlag, 2006).
How many times, over N = 10 runs, would you see x = 1 if µ = 0.25?
m
0 1 2 3 4 5 6 7 8 9 10
0
0.1
0.2
0.3
11
Let’s go back to the Bernoulli distribution
Now suppose that we have a data set of observations x = (x1, . . . , xN )T
drawn independently
from a Bernoulli distribution (iid) whose mean µ is unknown, and we would like to
determine this parameter from the data set.
p(D |µ) =
N
n=1
p(xn|µ) =
N
n=1
µxn
(1 − µ)1−xn
(13)
Let’s maximise the (log)-likelihood to identify the parameter (log simplifies and reduces risks
of underflow):
ln p(D |µ) =
N
n=1
ln p(xn|µ) =
N
n=1
{xn ln µ + (1 − xn) ln(1 − µ)} (14)
12
The log likelihood depends on the N observations xn only through their sum
n
xn, hence
the sum provides an example of a sufficient statistics for the data under this distribution,
“
hence no other statistic that can be calculated from the same sample provides
any additional information as to the value of the parameter
Fisher 1922„
13
d
dµ
ln p(D |µ) = 0
N
n=1
xn
µ
−
1 − xn
1 − µ
= 0
N
n=1
xn − µ
µ(1 − µ)
= 0
N
n=1
xn = Nµ
µML =
1
N
N
n=1
xn
aka sample mean. Risk of overfit: consider to toss the coin three times and each time is head
14
In order to develop a Bayesian treatment to the overfit problem of the maximum likelihood
estimator for the Bernoulli. Since the likelihood takes the form of the product of factors of
the form µx
(1 − µ)1−x
, if we choose a prior to be proportional to powers of µ and (1 − µ) then
the posterior distribution, proportional to the product of the prior and the likelihood, will
have the same functional form as the prior. This property is called conjugacy.
15
Binary variables: Beta distribution
Beta(µ|a, b) =
Γ(a + b)
Γ(a)Γ(b)
µa−1
(1 − µ)b−1
with
Γ(x) ≡
∞
0
ux−1
e−u
du
E[µ] =
a
a + b
var[µ] =
ab
(a + b)2(a + b + 1)
a and b are hyperparameters controlling the distribution of parameter µ.
16
µ
a = 0.1
b = 0.1
0 0.5 1
0
1
2
3
µ
a = 1
b = 1
0 0.5 1
0
1
2
3
µ
a = 2
b = 3
0 0.5 1
0
1
2
3
µ
a = 8
b = 4
0 0.5 1
0
1
2
3
17
Images from Bishop, C. M. Pattern Recognition and Machine Learning. (Springer-Verlag, 2006).
Considering a beta distribution prior and the binomial likelihood function, and given
l = N − m
p(µ|m, l, a, b) ∝ µm+a−1
(1 − µ)l+b−1
Hence p(µ|m, l, a, b) is another beta distribution and we can rearrange the normalisation
coefficient as follows:
p(µ|m, l, a, b) =
Γ(m + a + l + b)
Γ(m + a)Γ(l + b)
µm+a−1
(1 − µ)l+b−1
µ
prior
0 0.5 1
0
1
2
µ
likelihood function
0 0.5 1
0
1
2
µ
posterior
0 0.5 1
0
1
2
18
Images from Bishop, C. M. Pattern Recognition and Machine Learning. (Springer-Verlag, 2006).
Epistemic vs Aleatoric uncertainty
Aleatoric uncertainty
Variability in the outcome of an experiment
which is due to inherently random effects
(e.g. flipping a fair coin): no additional
source of information but Laplace’s daemon
can reduce such a variability.
Epistemic uncertainty
Epistemic state of the agent using the model,
hence its lack of knowledge that—in
principle—can be reduced on the basis of
additional data samples.
It is a general property of Bayesian learning
that, as we observe more and more data, the
epistemic uncertainty represented by the
posterior distribution will steadily decrease
(the variance decreases).
19
See notebook at
https://nbviewer.jupyter.org/federicocerutti/UncertaintyAwarenessResources/
blob/master/notebooks/Beta.ipynb
20
Multinomial variables: categorical distribution
Let us suppose to roll a dice with K = 6 faces. An observation of this variable x equivalent
to x3 = 1 (e.g. the number 3 with face up) can be:
x = (0, 0, 1, 0, 0, 0)T
Note that such vectors must satisfy
K
k=1
xk = 1.
p(x|µ) =
K
k=1
µxk
k
where µ = (µ1, . . . , µK )T
, nad the parameters µk are such that µk ≥ 0 and
k
µk = 1.
Generalisation of the Bernoulli
21
p(D |µ) =
N
n=1
K
k=1
µxnk
k
The likelihood depends on the N datapoints only through the K quantities
mk =
n
xnk
which represent the number of observations of xk = 1 (e.g. with k = 3, the third face of the
dice). These are called the sufficient statistics for this distribution.
22
Finding the maximum likelihood requires a Lagrange multiplier that
K
x=1
mk ln µk + λ
K
k=1
µk − 1
Hence
µML
k =
mk
N
which is the fraction of N observations for which xk = 1.
23
Multinomial variables: the Dirichlet distribution
The Dirichlet distribution is the generalisation of the beta distribution to K dimensions.
Dir(µ|α) =
Γ(α0)
Γ(α1) · · · Γ(αK )
K
k=1
µαk −1
k
such that
k
µk = 1, α = (α1, . . . , αK )T
, αk ≥ 0 and
α0 =
K
k=1
αk
24
Considering a Dirichlet distribution prior and the categorical likelihood function, the
posterior is then:
p(µ|D , α) = Dir(µ|α + m) =
=
Γ(α0 + N)
Γ(α1 + m1) · · · Γ(αK + mK )
K
k=1
µαk +mk −1
k
The uniform prior is given by Dir(µ|1) and the Jeffreys’ non-informative prior is given by
Dir(µ|(0.5, . . . , 0.5)T
).
The marginals of a Dirichlet distribution are beta distributions.
25
neural networks and uncertainty awareness
Change the loss function so to output pieces of evidences in favour of different classes that
should then be considered through Bayesian update resulting into a Dirichlet Distribution
Sensoy, Murat, Lance Kaplan, and Melih Kandemir. “Evidential deep learning to quantify classification
uncertainty.” Advances in Neural Information Processing Systems. 2018.
27
From Evidence to Dirichlet
Let us now assume a Dirichlet distribution over K classes that is the result of Bayesian
update with N observations and starting with a uniform prior:
Dir(µ | α) = Dir(µ | e1 + 1, e2 + 2, . . . , eK + 1 )
where ei is the number of observations (evidence) for the class k, and
k
ek = N.
28
Dirichlet and Epistemic Uncertainty
The epistemic uncertainty associated to a Dirichlet distribution Dir(µ | α) is given by
u =
K
S
with K the number of classes and S = α0 =
K
k=1
αk is the Dirichlet strength.
Note that if the Dirichlet has been computed as the resulting of Bayesian update from a
uniform prior, 0 ≤ u ≤ 1, and u = 1 implies that we are considering the uniform distribution
(an extreme case of Dirichlet distribution).
Let us denote with µk
αk
S
.
29
Loss function
If we then consider Dir(mi | αi ) as the prior for a Multinomial p(yi | µi ), we can then compute the
expected squared error (aka Brier score)
E[ yi − mi
2
2
] =
K
k=1
E[y2
i,k − 2yi,k µi,k + µ2
i,k ] =
X
k=1
y2
i,k − 2yi,k E[µi,k ] + E[µ2
i,k ] =
=
K
k=1
y2
i,k − 2yi,k E[µi,k ] + E[µi,k ]2
+ var[µi,k ] =
=
K
k=1
(yi,k − E[µi,k ])2
+ var[µi,k ] =
=
K
k=1
yi,k −
αi,k
Si
2
+
αi,k (Si − αi,k )
S2
i (Si + 1)
=
=
K
k=1
(yi,k − µi,k )2
+
µi,k (1 − µi,k )2
Si + 1
The loss over a batch of training samples is the sum of the loss for each sample in the batch.
Sensoy, Murat, Lance Kaplan, and Melih Kandemir. “Evidential deep learning to quantify classification
uncertainty.” Advances in Neural Information Processing Systems. 2018.
30
Learning to say “I don’t know”
To avoid generating evidence for all the classes when the network cannot classify a given
sample (epistemic uncertainty), we introduce a term in the loss function that penalises the
divergence from the uniform distribution:
L =
N
i=1
E[ yi − µi
2
2
] + λt
N
i=1
KL ( Dir(µi | αi ) || Dir(µi | 1) )
where:
• λt is another hyperparameter, and the suggestion is to use it parametric on the number of
training epochs, e.g. λt = min 1,
t
CONST
with t the number of current training epoch, so that
the effect of the KL divergence is gradually increased to avoid premature convergence to the
uniform distribution in the early epoch where the learning algorithm still needs to explore the
parameter space;
• αi = yi + (1 − yi ) · αi are the Dirichlet parameters the neural network in a forward pass has put
on the wrong classes, and the idea is to minimise them as much as possible.
Sensoy, Murat, Lance Kaplan, and Melih Kandemir. “Evidential deep learning to quantify classification
uncertainty.” Advances in Neural Information Processing Systems. 2018.
31
KL recap
Consider some unknown distribution p(x) and suppose that we have modelled this using
q(x). If we use q(x) instead of p(x) to represent the true values of x, the average additional
amount of information required is:
KL(p||q) = − p(x) ln q(x)dx − − p(x) ln p(x)dx
= − p(x) ln
q(x)
p(x)
dx
= −E ln
q(x)
p(x)
(15)
This is known as the relative entropy or Kullback-Leibler divergence, or KL divergence
between the distributions p(x) and q(x).
Properties:
• KL(p||q) ≡ KL(q||p);
• KL(p||q) ≥ 0 and KL(p||q) = 0 if and only if p = q
32
KL ( Dir(µi | αi ) || Dir(µi | 1) ) = ln
Γ( K
k=1 αi,k )
Γ(K) K
k=1 Γ(αi,k )
+
K
k=1
(αi,k −1)

ψ(αi,k ) − ψ


K
j=1
αi,j




where ψ(x) =
d
dx
ln ( Γ(x) ) is the digamma function
Sensoy, Murat, Lance Kaplan, and Melih Kandemir. “Evidential deep learning to quantify classification
uncertainty.” Advances in Neural Information Processing Systems. 2018.
33
EDL and robustness to FGS
Sensoy, Murat, Lance Kaplan, and Melih Kandemir. “Evidential deep learning to quantify classification
uncertainty.” Advances in Neural Information Processing Systems. 2018.
34
EDL + GAN for adversarial training
Sensoy, Murat, et al. “Uncertainty-Aware Deep Classifiers using Generative Models.” AAAI 2020
35
aluation
VAE + GAN
G
D' D
For each data point in latent space, we generate a new noisy sample, which is similar
to it to some extent. Hence, we avoid mode-collapse problem.
be trivially predicted without learning the actual structure of the data. Similarly,
if the noise distribution is too close to the data distribution, the density ratio
would be trivially one and the learning will be deprived.
G: Generator in the latent space of VAE
D’: Discriminator in the latent space
D : Discriminator in the input space
Figure 2: Original training samples (top), samples recon-
structed by the VAE (middle), and the samples generated by
the proposed method (bottom) over a number of epochs.
for high dimensional data by maximizingSensoy, Murat, et al. “Uncertainty-Aware Deep Classifiers using Generative Models.” AAAI 2020
36
Robustness against FGS
Sensoy, Murat, et al. “Uncertainty-Aware Deep Classifiers using Generative Models.” AAAI 2020
37
Anomaly detection
(mnist) (cifar10)
Sensoy, Murat, et al. “Uncertainty-Aware Deep Classifiers using Generative Models.” AAAI 2020
38

Más contenido relacionado

La actualidad más candente

Flow based generative models
Flow based generative modelsFlow based generative models
Flow based generative models수철 박
 
Dropout as a Bayesian Approximation
Dropout as a Bayesian ApproximationDropout as a Bayesian Approximation
Dropout as a Bayesian ApproximationSangwoo Mo
 
今さら聞けないカーネル法とサポートベクターマシン
今さら聞けないカーネル法とサポートベクターマシン今さら聞けないカーネル法とサポートベクターマシン
今さら聞けないカーネル法とサポートベクターマシンShinya Shimizu
 
[논문리뷰] Data Augmentation for 1D 시계열 데이터
[논문리뷰] Data Augmentation for 1D 시계열 데이터[논문리뷰] Data Augmentation for 1D 시계열 데이터
[논문리뷰] Data Augmentation for 1D 시계열 데이터Donghyeon Kim
 
Linear regression with gradient descent
Linear regression with gradient descentLinear regression with gradient descent
Linear regression with gradient descentSuraj Parmar
 
Concepts in wireless networks through Mind Maps
Concepts in wireless networks through Mind MapsConcepts in wireless networks through Mind Maps
Concepts in wireless networks through Mind Mapsnikshaikh786
 
A brief introduction to mutual information and its application
A brief introduction to mutual information and its applicationA brief introduction to mutual information and its application
A brief introduction to mutual information and its applicationHyun-hwan Jeong
 
Spectral Clustering
Spectral ClusteringSpectral Clustering
Spectral Clusteringssusered887b
 
Feature selection
Feature selectionFeature selection
Feature selectionDong Guo
 
TensorFlowプログラミングと分類アルゴリズムの基礎
TensorFlowプログラミングと分類アルゴリズムの基礎TensorFlowプログラミングと分類アルゴリズムの基礎
TensorFlowプログラミングと分類アルゴリズムの基礎Etsuji Nakai
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent methodSanghyuk Chun
 
Linear regression
Linear regressionLinear regression
Linear regressionMartinHogg9
 
パターン認識と機械学習 §6.2 カーネル関数の構成
パターン認識と機械学習 §6.2 カーネル関数の構成パターン認識と機械学習 §6.2 カーネル関数の構成
パターン認識と機械学習 §6.2 カーネル関数の構成Prunus 1350
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentationOwin Will
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoostJoonyoung Yi
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationYan Xu
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reductionmrizwan969
 
AI Greedy and A-STAR Search
AI Greedy and A-STAR SearchAI Greedy and A-STAR Search
AI Greedy and A-STAR SearchAndrew Ferlitsch
 

La actualidad más candente (20)

Flow based generative models
Flow based generative modelsFlow based generative models
Flow based generative models
 
Dropout as a Bayesian Approximation
Dropout as a Bayesian ApproximationDropout as a Bayesian Approximation
Dropout as a Bayesian Approximation
 
今さら聞けないカーネル法とサポートベクターマシン
今さら聞けないカーネル法とサポートベクターマシン今さら聞けないカーネル法とサポートベクターマシン
今さら聞けないカーネル法とサポートベクターマシン
 
[논문리뷰] Data Augmentation for 1D 시계열 데이터
[논문리뷰] Data Augmentation for 1D 시계열 데이터[논문리뷰] Data Augmentation for 1D 시계열 데이터
[논문리뷰] Data Augmentation for 1D 시계열 데이터
 
Linear regression with gradient descent
Linear regression with gradient descentLinear regression with gradient descent
Linear regression with gradient descent
 
Concepts in wireless networks through Mind Maps
Concepts in wireless networks through Mind MapsConcepts in wireless networks through Mind Maps
Concepts in wireless networks through Mind Maps
 
A brief introduction to mutual information and its application
A brief introduction to mutual information and its applicationA brief introduction to mutual information and its application
A brief introduction to mutual information and its application
 
Spectral Clustering
Spectral ClusteringSpectral Clustering
Spectral Clustering
 
Numpy Talk at SIAM
Numpy Talk at SIAMNumpy Talk at SIAM
Numpy Talk at SIAM
 
Feature selection
Feature selectionFeature selection
Feature selection
 
TensorFlowプログラミングと分類アルゴリズムの基礎
TensorFlowプログラミングと分類アルゴリズムの基礎TensorFlowプログラミングと分類アルゴリズムの基礎
TensorFlowプログラミングと分類アルゴリズムの基礎
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
Linear regression
Linear regressionLinear regression
Linear regression
 
パターン認識と機械学習 §6.2 カーネル関数の構成
パターン認識と機械学習 §6.2 カーネル関数の構成パターン認識と機械学習 §6.2 カーネル関数の構成
パターン認識と機械学習 §6.2 カーネル関数の構成
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentation
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
AI Greedy and A-STAR Search
AI Greedy and A-STAR SearchAI Greedy and A-STAR Search
AI Greedy and A-STAR Search
 

Similar a Introduction to Evidential Neural Networks

Probability cheatsheet
Probability cheatsheetProbability cheatsheet
Probability cheatsheetJoachim Gwoke
 
Probability cheatsheet
Probability cheatsheetProbability cheatsheet
Probability cheatsheetSuvrat Mishra
 
multivariate normal distribution.pdf
multivariate normal distribution.pdfmultivariate normal distribution.pdf
multivariate normal distribution.pdfrishumaurya10
 
Multivriada ppt ms
Multivriada   ppt msMultivriada   ppt ms
Multivriada ppt msFaeco Bot
 
Error analysis statistics
Error analysis   statisticsError analysis   statistics
Error analysis statisticsTarun Gehlot
 
SPDE presentation 2012
SPDE presentation 2012SPDE presentation 2012
SPDE presentation 2012Zheng Mengdi
 
Litv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdfLitv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdfAlexander Litvinenko
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classificationSung Yub Kim
 
Basic math including gradient
Basic math including gradientBasic math including gradient
Basic math including gradientRamesh Kesavan
 
Appendex b
Appendex bAppendex b
Appendex bswavicky
 
Statistical computing2
Statistical computing2Statistical computing2
Statistical computing2Padma Metta
 
a) Use Newton’s Polynomials for Evenly Spaced data to derive the O(h.pdf
a) Use Newton’s Polynomials for Evenly Spaced data to derive the O(h.pdfa) Use Newton’s Polynomials for Evenly Spaced data to derive the O(h.pdf
a) Use Newton’s Polynomials for Evenly Spaced data to derive the O(h.pdfpetercoiffeur18
 

Similar a Introduction to Evidential Neural Networks (20)

Statistical Method In Economics
Statistical Method In EconomicsStatistical Method In Economics
Statistical Method In Economics
 
Probability cheatsheet
Probability cheatsheetProbability cheatsheet
Probability cheatsheet
 
Probability cheatsheet
Probability cheatsheetProbability cheatsheet
Probability cheatsheet
 
multivariate normal distribution.pdf
multivariate normal distribution.pdfmultivariate normal distribution.pdf
multivariate normal distribution.pdf
 
Probability Cheatsheet.pdf
Probability Cheatsheet.pdfProbability Cheatsheet.pdf
Probability Cheatsheet.pdf
 
Multivriada ppt ms
Multivriada   ppt msMultivriada   ppt ms
Multivriada ppt ms
 
Error analysis statistics
Error analysis   statisticsError analysis   statistics
Error analysis statistics
 
Multivariate Methods Assignment Help
Multivariate Methods Assignment HelpMultivariate Methods Assignment Help
Multivariate Methods Assignment Help
 
Unit II PPT.pptx
Unit II PPT.pptxUnit II PPT.pptx
Unit II PPT.pptx
 
SPDE presentation 2012
SPDE presentation 2012SPDE presentation 2012
SPDE presentation 2012
 
chap2.pdf
chap2.pdfchap2.pdf
chap2.pdf
 
Litv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdfLitv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdf
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
 
Basic math including gradient
Basic math including gradientBasic math including gradient
Basic math including gradient
 
Ica group 3[1]
Ica group 3[1]Ica group 3[1]
Ica group 3[1]
 
Binomial probability distributions
Binomial probability distributions  Binomial probability distributions
Binomial probability distributions
 
Estimation rs
Estimation rsEstimation rs
Estimation rs
 
Appendex b
Appendex bAppendex b
Appendex b
 
Statistical computing2
Statistical computing2Statistical computing2
Statistical computing2
 
a) Use Newton’s Polynomials for Evenly Spaced data to derive the O(h.pdf
a) Use Newton’s Polynomials for Evenly Spaced data to derive the O(h.pdfa) Use Newton’s Polynomials for Evenly Spaced data to derive the O(h.pdf
a) Use Newton’s Polynomials for Evenly Spaced data to derive the O(h.pdf
 

Más de Federico Cerutti

Security of Artificial Intelligence
Security of Artificial IntelligenceSecurity of Artificial Intelligence
Security of Artificial IntelligenceFederico Cerutti
 
Argumentation and Machine Learning: When the Whole is Greater than the Sum of...
Argumentation and Machine Learning: When the Whole is Greater than the Sum of...Argumentation and Machine Learning: When the Whole is Greater than the Sum of...
Argumentation and Machine Learning: When the Whole is Greater than the Sum of...Federico Cerutti
 
Human-Argumentation Experiment Pilot 2013: Technical Material
Human-Argumentation Experiment Pilot 2013: Technical MaterialHuman-Argumentation Experiment Pilot 2013: Technical Material
Human-Argumentation Experiment Pilot 2013: Technical MaterialFederico Cerutti
 
Probabilistic Logic Programming with Beta-Distributed Random Variables
Probabilistic Logic Programming with Beta-Distributed Random VariablesProbabilistic Logic Programming with Beta-Distributed Random Variables
Probabilistic Logic Programming with Beta-Distributed Random VariablesFederico Cerutti
 
Supporting Scientific Enquiry with Uncertain Sources
Supporting Scientific Enquiry with Uncertain SourcesSupporting Scientific Enquiry with Uncertain Sources
Supporting Scientific Enquiry with Uncertain SourcesFederico Cerutti
 
Introduction to Formal Argumentation Theory
Introduction to Formal Argumentation TheoryIntroduction to Formal Argumentation Theory
Introduction to Formal Argumentation TheoryFederico Cerutti
 
Handout: Argumentation in Artificial Intelligence: From Theory to Practice
Handout: Argumentation in Artificial Intelligence: From Theory to PracticeHandout: Argumentation in Artificial Intelligence: From Theory to Practice
Handout: Argumentation in Artificial Intelligence: From Theory to PracticeFederico Cerutti
 
Argumentation in Artificial Intelligence: From Theory to Practice
Argumentation in Artificial Intelligence: From Theory to PracticeArgumentation in Artificial Intelligence: From Theory to Practice
Argumentation in Artificial Intelligence: From Theory to PracticeFederico Cerutti
 
Handout for the course Abstract Argumentation and Interfaces to Argumentative...
Handout for the course Abstract Argumentation and Interfaces to Argumentative...Handout for the course Abstract Argumentation and Interfaces to Argumentative...
Handout for the course Abstract Argumentation and Interfaces to Argumentative...Federico Cerutti
 
Argumentation in Artificial Intelligence: 20 years after Dung's work. Left ma...
Argumentation in Artificial Intelligence: 20 years after Dung's work. Left ma...Argumentation in Artificial Intelligence: 20 years after Dung's work. Left ma...
Argumentation in Artificial Intelligence: 20 years after Dung's work. Left ma...Federico Cerutti
 
Argumentation in Artificial Intelligence: 20 years after Dung's work. Right m...
Argumentation in Artificial Intelligence: 20 years after Dung's work. Right m...Argumentation in Artificial Intelligence: 20 years after Dung's work. Right m...
Argumentation in Artificial Intelligence: 20 years after Dung's work. Right m...Federico Cerutti
 
Argumentation in Artificial Intelligence
Argumentation in Artificial IntelligenceArgumentation in Artificial Intelligence
Argumentation in Artificial IntelligenceFederico Cerutti
 
Algorithm Selection for Preferred Extensions Enumeration
Algorithm Selection for Preferred Extensions EnumerationAlgorithm Selection for Preferred Extensions Enumeration
Algorithm Selection for Preferred Extensions EnumerationFederico Cerutti
 
Formal Arguments, Preferences, and Natural Language Interfaces to Humans: an ...
Formal Arguments, Preferences, and Natural Language Interfaces to Humans: an ...Formal Arguments, Preferences, and Natural Language Interfaces to Humans: an ...
Formal Arguments, Preferences, and Natural Language Interfaces to Humans: an ...Federico Cerutti
 
Argumentation Extensions Enumeration as a Constraint Satisfaction Problem: a ...
Argumentation Extensions Enumeration as a Constraint Satisfaction Problem: a ...Argumentation Extensions Enumeration as a Constraint Satisfaction Problem: a ...
Argumentation Extensions Enumeration as a Constraint Satisfaction Problem: a ...Federico Cerutti
 
A SCC Recursive Meta-Algorithm for Computing Preferred Labellings in Abstract...
A SCC Recursive Meta-Algorithm for Computing Preferred Labellings in Abstract...A SCC Recursive Meta-Algorithm for Computing Preferred Labellings in Abstract...
A SCC Recursive Meta-Algorithm for Computing Preferred Labellings in Abstract...Federico Cerutti
 
Cerutti-AT2013-Graphical Subjective Logic
Cerutti-AT2013-Graphical Subjective LogicCerutti-AT2013-Graphical Subjective Logic
Cerutti-AT2013-Graphical Subjective LogicFederico Cerutti
 
Cerutti-AT2013-Trust and Risk
Cerutti-AT2013-Trust and RiskCerutti-AT2013-Trust and Risk
Cerutti-AT2013-Trust and RiskFederico Cerutti
 
Cerutti--Introduction to Argumentation (seminar @ University of Aberdeen)
Cerutti--Introduction to Argumentation (seminar @ University of Aberdeen)Cerutti--Introduction to Argumentation (seminar @ University of Aberdeen)
Cerutti--Introduction to Argumentation (seminar @ University of Aberdeen)Federico Cerutti
 

Más de Federico Cerutti (20)

Security of Artificial Intelligence
Security of Artificial IntelligenceSecurity of Artificial Intelligence
Security of Artificial Intelligence
 
Argumentation and Machine Learning: When the Whole is Greater than the Sum of...
Argumentation and Machine Learning: When the Whole is Greater than the Sum of...Argumentation and Machine Learning: When the Whole is Greater than the Sum of...
Argumentation and Machine Learning: When the Whole is Greater than the Sum of...
 
Human-Argumentation Experiment Pilot 2013: Technical Material
Human-Argumentation Experiment Pilot 2013: Technical MaterialHuman-Argumentation Experiment Pilot 2013: Technical Material
Human-Argumentation Experiment Pilot 2013: Technical Material
 
Probabilistic Logic Programming with Beta-Distributed Random Variables
Probabilistic Logic Programming with Beta-Distributed Random VariablesProbabilistic Logic Programming with Beta-Distributed Random Variables
Probabilistic Logic Programming with Beta-Distributed Random Variables
 
Supporting Scientific Enquiry with Uncertain Sources
Supporting Scientific Enquiry with Uncertain SourcesSupporting Scientific Enquiry with Uncertain Sources
Supporting Scientific Enquiry with Uncertain Sources
 
Introduction to Formal Argumentation Theory
Introduction to Formal Argumentation TheoryIntroduction to Formal Argumentation Theory
Introduction to Formal Argumentation Theory
 
Handout: Argumentation in Artificial Intelligence: From Theory to Practice
Handout: Argumentation in Artificial Intelligence: From Theory to PracticeHandout: Argumentation in Artificial Intelligence: From Theory to Practice
Handout: Argumentation in Artificial Intelligence: From Theory to Practice
 
Argumentation in Artificial Intelligence: From Theory to Practice
Argumentation in Artificial Intelligence: From Theory to PracticeArgumentation in Artificial Intelligence: From Theory to Practice
Argumentation in Artificial Intelligence: From Theory to Practice
 
Handout for the course Abstract Argumentation and Interfaces to Argumentative...
Handout for the course Abstract Argumentation and Interfaces to Argumentative...Handout for the course Abstract Argumentation and Interfaces to Argumentative...
Handout for the course Abstract Argumentation and Interfaces to Argumentative...
 
Argumentation in Artificial Intelligence: 20 years after Dung's work. Left ma...
Argumentation in Artificial Intelligence: 20 years after Dung's work. Left ma...Argumentation in Artificial Intelligence: 20 years after Dung's work. Left ma...
Argumentation in Artificial Intelligence: 20 years after Dung's work. Left ma...
 
Argumentation in Artificial Intelligence: 20 years after Dung's work. Right m...
Argumentation in Artificial Intelligence: 20 years after Dung's work. Right m...Argumentation in Artificial Intelligence: 20 years after Dung's work. Right m...
Argumentation in Artificial Intelligence: 20 years after Dung's work. Right m...
 
Argumentation in Artificial Intelligence
Argumentation in Artificial IntelligenceArgumentation in Artificial Intelligence
Argumentation in Artificial Intelligence
 
Algorithm Selection for Preferred Extensions Enumeration
Algorithm Selection for Preferred Extensions EnumerationAlgorithm Selection for Preferred Extensions Enumeration
Algorithm Selection for Preferred Extensions Enumeration
 
Formal Arguments, Preferences, and Natural Language Interfaces to Humans: an ...
Formal Arguments, Preferences, and Natural Language Interfaces to Humans: an ...Formal Arguments, Preferences, and Natural Language Interfaces to Humans: an ...
Formal Arguments, Preferences, and Natural Language Interfaces to Humans: an ...
 
Argumentation Extensions Enumeration as a Constraint Satisfaction Problem: a ...
Argumentation Extensions Enumeration as a Constraint Satisfaction Problem: a ...Argumentation Extensions Enumeration as a Constraint Satisfaction Problem: a ...
Argumentation Extensions Enumeration as a Constraint Satisfaction Problem: a ...
 
A SCC Recursive Meta-Algorithm for Computing Preferred Labellings in Abstract...
A SCC Recursive Meta-Algorithm for Computing Preferred Labellings in Abstract...A SCC Recursive Meta-Algorithm for Computing Preferred Labellings in Abstract...
A SCC Recursive Meta-Algorithm for Computing Preferred Labellings in Abstract...
 
Cerutti-AT2013-Graphical Subjective Logic
Cerutti-AT2013-Graphical Subjective LogicCerutti-AT2013-Graphical Subjective Logic
Cerutti-AT2013-Graphical Subjective Logic
 
Cerutti-AT2013-Trust and Risk
Cerutti-AT2013-Trust and RiskCerutti-AT2013-Trust and Risk
Cerutti-AT2013-Trust and Risk
 
Cerutti -- TAFA2013
Cerutti -- TAFA2013Cerutti -- TAFA2013
Cerutti -- TAFA2013
 
Cerutti--Introduction to Argumentation (seminar @ University of Aberdeen)
Cerutti--Introduction to Argumentation (seminar @ University of Aberdeen)Cerutti--Introduction to Argumentation (seminar @ University of Aberdeen)
Cerutti--Introduction to Argumentation (seminar @ University of Aberdeen)
 

Último

Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...RKavithamani
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 

Último (20)

Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 

Introduction to Evidential Neural Networks

  • 2. Educational material for students at Cardiff University, UK and the University of Brescia, Italy. 2
  • 3. a primer in bayesian analysis
  • 4. Bayesian Probabilities Bayes theorem p(Y |X) = p(X|Y )p(Y ) p(X) (1) where p(X) = Y p(X|Y )p(Y ) (2) 4
  • 5. Suppose we randomly pick one of the boxes and from that box we randomly select an item of fruit, and having observed which sort of fruit it is we replace it in the box from which it came. We could imagine repeating this process many times. Let us suppose that in so doing we pick the red box 40% of the time and we pick the blue box 60% of the time, and that when we remove an item of fruit from a box we are equally likely to select any of the pieces of fruit in the box. We are told that a piece of fruit has been selected and it is an orange. Which box does it came from? 5 Image from Bishop, C. M. Pattern Recognition and Machine Learning. (Springer-Verlag, 2006).
  • 6. p(B = r|F = o) = p(F = o|B = r)p(B = r) p(F = o) = = 6 8 · 4 10 6 8 · 4 10 + 1 4 · 6 10 = = 3 4 · 2 5 · 20 9 = 2 3 (3) 6 Image from Bishop, C. M. Pattern Recognition and Machine Learning. (Springer-Verlag, 2006).
  • 7. Given the parameters of our model w, we can capture our assumptions about w, before observing the data, in the form of a prior probability distribution p(w). The effect of the observed data D = {t1, . . . , tN } is expressed through the conditional p(D |w), hence Bayes theorem takes the form: p(w|D ) = likelihood p(D |w) prior p(w) p(D ) (4) posterior ∝ likelihood · prior (5) p(D ) = p(D |w)p(w)dw (6) It ensures that the posterior distribution on the left-hand side is a valid probability density and integrates to one. 7
  • 8. Frequentist paradigm • w is considered to be a fixed parameter, whose values is determined by some form of estimator, e.g. the maximum likelihood in which w is set to the value that maximises p(D |w) • Error bars on this estimate are obtained by considering the distribution of possible data sets D . • The negative log of the likelihood function is called an error function: the negative log is a monotonically decreasing function hence maximising the likelihood is equivalent to minimising the error. Bayesian paradigm • There is only one single data set D (the one observed) and the uncertainty in the parameters is expressed through a probability distribution over w. • The inclusion of prior knowledge arises naturally: suppose that a fair-looking coin is tossed three times and lands heads each time. A classical maximum likelihood estimate of the probability of landing heads would give 1. There are cases where you want to reduce the dependence on the prior, hence using noninformative priors. 8
  • 9. Binary variable: Bernoulli Let us consider a single binary random variable x ∈ {0, 1}, e.g. flipping coin, not necessary fair, hence the probability is conditioned by a parameter 0 ≤ µ ≤ 1: p(x = 1|µ) = µ (7) The probability distribution over x is known as the Bernoulli distribution: Bern(x|µ) = µx (1 − µ)1−x (8) E[x] = µ (9) 9
  • 10. Binomial distribution The distribution of m observations of x = 1 given the datasize N is given by the Binomial distribution: Bin(m|N, µ) = N m µm (1 − µ)N−m (10) with E[m] ≡ N m=0 mBin(m|N, µ) = Nµ (11) and var[m] ≡ N m=0 (m − E[m])2 Bin(m|N, µ) = Nµ(1 − µ) (12) 10 Image from Bishop, C. M. Pattern Recognition and Machine Learning. (Springer-Verlag, 2006).
  • 11. How many times, over N = 10 runs, would you see x = 1 if µ = 0.25? m 0 1 2 3 4 5 6 7 8 9 10 0 0.1 0.2 0.3 11
  • 12. Let’s go back to the Bernoulli distribution Now suppose that we have a data set of observations x = (x1, . . . , xN )T drawn independently from a Bernoulli distribution (iid) whose mean µ is unknown, and we would like to determine this parameter from the data set. p(D |µ) = N n=1 p(xn|µ) = N n=1 µxn (1 − µ)1−xn (13) Let’s maximise the (log)-likelihood to identify the parameter (log simplifies and reduces risks of underflow): ln p(D |µ) = N n=1 ln p(xn|µ) = N n=1 {xn ln µ + (1 − xn) ln(1 − µ)} (14) 12
  • 13. The log likelihood depends on the N observations xn only through their sum n xn, hence the sum provides an example of a sufficient statistics for the data under this distribution, “ hence no other statistic that can be calculated from the same sample provides any additional information as to the value of the parameter Fisher 1922„ 13
  • 14. d dµ ln p(D |µ) = 0 N n=1 xn µ − 1 − xn 1 − µ = 0 N n=1 xn − µ µ(1 − µ) = 0 N n=1 xn = Nµ µML = 1 N N n=1 xn aka sample mean. Risk of overfit: consider to toss the coin three times and each time is head 14
  • 15. In order to develop a Bayesian treatment to the overfit problem of the maximum likelihood estimator for the Bernoulli. Since the likelihood takes the form of the product of factors of the form µx (1 − µ)1−x , if we choose a prior to be proportional to powers of µ and (1 − µ) then the posterior distribution, proportional to the product of the prior and the likelihood, will have the same functional form as the prior. This property is called conjugacy. 15
  • 16. Binary variables: Beta distribution Beta(µ|a, b) = Γ(a + b) Γ(a)Γ(b) µa−1 (1 − µ)b−1 with Γ(x) ≡ ∞ 0 ux−1 e−u du E[µ] = a a + b var[µ] = ab (a + b)2(a + b + 1) a and b are hyperparameters controlling the distribution of parameter µ. 16
  • 17. µ a = 0.1 b = 0.1 0 0.5 1 0 1 2 3 µ a = 1 b = 1 0 0.5 1 0 1 2 3 µ a = 2 b = 3 0 0.5 1 0 1 2 3 µ a = 8 b = 4 0 0.5 1 0 1 2 3 17 Images from Bishop, C. M. Pattern Recognition and Machine Learning. (Springer-Verlag, 2006).
  • 18. Considering a beta distribution prior and the binomial likelihood function, and given l = N − m p(µ|m, l, a, b) ∝ µm+a−1 (1 − µ)l+b−1 Hence p(µ|m, l, a, b) is another beta distribution and we can rearrange the normalisation coefficient as follows: p(µ|m, l, a, b) = Γ(m + a + l + b) Γ(m + a)Γ(l + b) µm+a−1 (1 − µ)l+b−1 µ prior 0 0.5 1 0 1 2 µ likelihood function 0 0.5 1 0 1 2 µ posterior 0 0.5 1 0 1 2 18 Images from Bishop, C. M. Pattern Recognition and Machine Learning. (Springer-Verlag, 2006).
  • 19. Epistemic vs Aleatoric uncertainty Aleatoric uncertainty Variability in the outcome of an experiment which is due to inherently random effects (e.g. flipping a fair coin): no additional source of information but Laplace’s daemon can reduce such a variability. Epistemic uncertainty Epistemic state of the agent using the model, hence its lack of knowledge that—in principle—can be reduced on the basis of additional data samples. It is a general property of Bayesian learning that, as we observe more and more data, the epistemic uncertainty represented by the posterior distribution will steadily decrease (the variance decreases). 19
  • 21. Multinomial variables: categorical distribution Let us suppose to roll a dice with K = 6 faces. An observation of this variable x equivalent to x3 = 1 (e.g. the number 3 with face up) can be: x = (0, 0, 1, 0, 0, 0)T Note that such vectors must satisfy K k=1 xk = 1. p(x|µ) = K k=1 µxk k where µ = (µ1, . . . , µK )T , nad the parameters µk are such that µk ≥ 0 and k µk = 1. Generalisation of the Bernoulli 21
  • 22. p(D |µ) = N n=1 K k=1 µxnk k The likelihood depends on the N datapoints only through the K quantities mk = n xnk which represent the number of observations of xk = 1 (e.g. with k = 3, the third face of the dice). These are called the sufficient statistics for this distribution. 22
  • 23. Finding the maximum likelihood requires a Lagrange multiplier that K x=1 mk ln µk + λ K k=1 µk − 1 Hence µML k = mk N which is the fraction of N observations for which xk = 1. 23
  • 24. Multinomial variables: the Dirichlet distribution The Dirichlet distribution is the generalisation of the beta distribution to K dimensions. Dir(µ|α) = Γ(α0) Γ(α1) · · · Γ(αK ) K k=1 µαk −1 k such that k µk = 1, α = (α1, . . . , αK )T , αk ≥ 0 and α0 = K k=1 αk 24
  • 25. Considering a Dirichlet distribution prior and the categorical likelihood function, the posterior is then: p(µ|D , α) = Dir(µ|α + m) = = Γ(α0 + N) Γ(α1 + m1) · · · Γ(αK + mK ) K k=1 µαk +mk −1 k The uniform prior is given by Dir(µ|1) and the Jeffreys’ non-informative prior is given by Dir(µ|(0.5, . . . , 0.5)T ). The marginals of a Dirichlet distribution are beta distributions. 25
  • 26. neural networks and uncertainty awareness
  • 27. Change the loss function so to output pieces of evidences in favour of different classes that should then be considered through Bayesian update resulting into a Dirichlet Distribution Sensoy, Murat, Lance Kaplan, and Melih Kandemir. “Evidential deep learning to quantify classification uncertainty.” Advances in Neural Information Processing Systems. 2018. 27
  • 28. From Evidence to Dirichlet Let us now assume a Dirichlet distribution over K classes that is the result of Bayesian update with N observations and starting with a uniform prior: Dir(µ | α) = Dir(µ | e1 + 1, e2 + 2, . . . , eK + 1 ) where ei is the number of observations (evidence) for the class k, and k ek = N. 28
  • 29. Dirichlet and Epistemic Uncertainty The epistemic uncertainty associated to a Dirichlet distribution Dir(µ | α) is given by u = K S with K the number of classes and S = α0 = K k=1 αk is the Dirichlet strength. Note that if the Dirichlet has been computed as the resulting of Bayesian update from a uniform prior, 0 ≤ u ≤ 1, and u = 1 implies that we are considering the uniform distribution (an extreme case of Dirichlet distribution). Let us denote with µk αk S . 29
  • 30. Loss function If we then consider Dir(mi | αi ) as the prior for a Multinomial p(yi | µi ), we can then compute the expected squared error (aka Brier score) E[ yi − mi 2 2 ] = K k=1 E[y2 i,k − 2yi,k µi,k + µ2 i,k ] = X k=1 y2 i,k − 2yi,k E[µi,k ] + E[µ2 i,k ] = = K k=1 y2 i,k − 2yi,k E[µi,k ] + E[µi,k ]2 + var[µi,k ] = = K k=1 (yi,k − E[µi,k ])2 + var[µi,k ] = = K k=1 yi,k − αi,k Si 2 + αi,k (Si − αi,k ) S2 i (Si + 1) = = K k=1 (yi,k − µi,k )2 + µi,k (1 − µi,k )2 Si + 1 The loss over a batch of training samples is the sum of the loss for each sample in the batch. Sensoy, Murat, Lance Kaplan, and Melih Kandemir. “Evidential deep learning to quantify classification uncertainty.” Advances in Neural Information Processing Systems. 2018. 30
  • 31. Learning to say “I don’t know” To avoid generating evidence for all the classes when the network cannot classify a given sample (epistemic uncertainty), we introduce a term in the loss function that penalises the divergence from the uniform distribution: L = N i=1 E[ yi − µi 2 2 ] + λt N i=1 KL ( Dir(µi | αi ) || Dir(µi | 1) ) where: • λt is another hyperparameter, and the suggestion is to use it parametric on the number of training epochs, e.g. λt = min 1, t CONST with t the number of current training epoch, so that the effect of the KL divergence is gradually increased to avoid premature convergence to the uniform distribution in the early epoch where the learning algorithm still needs to explore the parameter space; • αi = yi + (1 − yi ) · αi are the Dirichlet parameters the neural network in a forward pass has put on the wrong classes, and the idea is to minimise them as much as possible. Sensoy, Murat, Lance Kaplan, and Melih Kandemir. “Evidential deep learning to quantify classification uncertainty.” Advances in Neural Information Processing Systems. 2018. 31
  • 32. KL recap Consider some unknown distribution p(x) and suppose that we have modelled this using q(x). If we use q(x) instead of p(x) to represent the true values of x, the average additional amount of information required is: KL(p||q) = − p(x) ln q(x)dx − − p(x) ln p(x)dx = − p(x) ln q(x) p(x) dx = −E ln q(x) p(x) (15) This is known as the relative entropy or Kullback-Leibler divergence, or KL divergence between the distributions p(x) and q(x). Properties: • KL(p||q) ≡ KL(q||p); • KL(p||q) ≥ 0 and KL(p||q) = 0 if and only if p = q 32
  • 33. KL ( Dir(µi | αi ) || Dir(µi | 1) ) = ln Γ( K k=1 αi,k ) Γ(K) K k=1 Γ(αi,k ) + K k=1 (αi,k −1)  ψ(αi,k ) − ψ   K j=1 αi,j     where ψ(x) = d dx ln ( Γ(x) ) is the digamma function Sensoy, Murat, Lance Kaplan, and Melih Kandemir. “Evidential deep learning to quantify classification uncertainty.” Advances in Neural Information Processing Systems. 2018. 33
  • 34. EDL and robustness to FGS Sensoy, Murat, Lance Kaplan, and Melih Kandemir. “Evidential deep learning to quantify classification uncertainty.” Advances in Neural Information Processing Systems. 2018. 34
  • 35. EDL + GAN for adversarial training Sensoy, Murat, et al. “Uncertainty-Aware Deep Classifiers using Generative Models.” AAAI 2020 35
  • 36. aluation VAE + GAN G D' D For each data point in latent space, we generate a new noisy sample, which is similar to it to some extent. Hence, we avoid mode-collapse problem. be trivially predicted without learning the actual structure of the data. Similarly, if the noise distribution is too close to the data distribution, the density ratio would be trivially one and the learning will be deprived. G: Generator in the latent space of VAE D’: Discriminator in the latent space D : Discriminator in the input space Figure 2: Original training samples (top), samples recon- structed by the VAE (middle), and the samples generated by the proposed method (bottom) over a number of epochs. for high dimensional data by maximizingSensoy, Murat, et al. “Uncertainty-Aware Deep Classifiers using Generative Models.” AAAI 2020 36
  • 37. Robustness against FGS Sensoy, Murat, et al. “Uncertainty-Aware Deep Classifiers using Generative Models.” AAAI 2020 37
  • 38. Anomaly detection (mnist) (cifar10) Sensoy, Murat, et al. “Uncertainty-Aware Deep Classifiers using Generative Models.” AAAI 2020 38