1. The document discusses various methods for Bayesian inference and model comparison in dynamic causal models, including variational Laplace approximation, sampling methods, and computing model evidence.
2. Variational Laplace approximation involves factorizing the posterior distribution and iteratively optimizing a lower bound on the model evidence called the negative free energy.
3. Sampling methods like Markov chain Monte Carlo generate stochastic approximations to the posterior by constructing a Markov chain with the target distribution as its equilibrium distribution.
Bayesian inversion of deterministic dynamic causal models
1. Bayesian inversion of deterministic
dynamic causal models
Kay H. Brodersen1,2 & Ajita Gupta2
1 Department of Economics, University of Zurich, Switzerland
2 Department of Computer Science, ETH Zurich, Switzerland
Mike West
2. Overview
1. Problem setting
Model; likelihood; prior; posterior.
2. Variational Laplace
Factorization of posterior; why the free-energy; energies; gradient ascent;
adaptive step size; example.
3. Sampling
Transformation method; rejection method; Gibbs sampling; MCMC; Metropolis-
Hastings; example.
4. Model comparison
Model evidence; Bayes factors; free-energy; prior arithmetic mean; posterior
harmonic mean; Savage-Dickey; example.
With material from Will Penny, Klaas Enno Stephan, Chris Bishop, and Justin Chumbley.
2
5. Bayesian inference is conceptually straightforward
Question 1: what do the data tell us about the
model parameters? ?
compute the posterior
𝑝 𝑦 𝜃, 𝑚 𝑝 𝜃 𝑚
𝑝 𝜃 𝑦, 𝑚 =
𝑝 𝑦 𝑚
Question 2: which model is best?
compute the model evidence
𝑝 𝑚 𝑦 ∝ 𝑝 𝑦 𝑚 𝑝(𝑚)
= ∫ 𝑝 𝑦 𝜃, 𝑚 𝑝 𝜃 𝑚 𝑑𝜃
?
5
9. Variational Laplace
Inversion strategy
Recall the relationship between the log model evidence and the negative free-energy 𝐹:
ln 𝑝 𝑦 𝑚 = 𝐸 𝑞 ln 𝑝 𝑦 𝜃 − 𝐾𝐿 𝑞 𝜃 |𝑝 𝜃 𝑚 + 𝐾𝐿 𝑞 𝜃 ||𝑝 𝜃 𝑦, 𝑚
=: 𝐹 ≥0
Maximizing 𝐹 implies two things:
(i) we obtain a good approximation to ln 𝑝 𝑦 𝑚
(ii) the KL divergence between 𝑞 𝜃 and 𝑝 𝜃 𝑦, 𝑚 becomes minimal
Practically, we can maximize F by iteratively (EM) maximizing the variational energies:
W. Penny
9
11. Variational Laplace
Implementation: gradient-ascent scheme (Newton’s method)
Compute Newton update (change)
New estimate Newton’s Method for finding a root (1D)
f ( x(old ))
x(new) x(old )
f '( x(old ))
Big curvature -> small step
Small curvature -> big step
W. Penny
11
12. Newton’s method – demonstration
Newton’s method is very efficient. However, its solution is not insensitive to the starting
point, as shown above.
http://demonstrations.wolfram.com/LearningNewtonsMethod/
12
13. Variational Laplace
Nonlinear regression (example)
Model (likelihood):
Ground truth
(known parameter values that
were used to generate the data
Data: on the left):
where
y(t)
W. Penny t
13
18. Strategy 1 – Transformation method
We can obtain samples from some distribution 𝑝 𝑧 by first sampling from the
uniform distribution and then transforming these samples.
1.4
2 yields high-quality
1.2 samples
1 1.5 easy to implement
0.8 computationally
p(z)
p(u)
1 efficient
0.6
obtaining the inverse
0.4
0.5 cdf can be difficult
0.2
𝑢 = 0.9 𝑧 = 1.15
0 0
0 0.5 1 0 1 2 3 4
transformation: 𝑧 𝜏 = 𝐹 −1 𝑢 𝜏
18
19. Strategy 2 – Rejection method
When the transformation method cannot be applied, we can resort to a more
general method called rejection sampling. Here, we draw random numbers from a
simpler proposal distribution 𝑞(𝑧) and keep only some of these samples.
The proposal distribution
𝑞(𝑧), scaled by a factor 𝑘,
represents an envelope of
𝑝(𝑧).
yields high-quality
samples
easy to implement
can be computationally
efficient
computationally
inefficient if proposal is
Bishop (2007) PRML a poor approximation
19
20. Strategy 3 – Gibbs sampling
Often the joint distribution of several random variables is unavailable, whereas the
full-conditional distributions are available. In this case, we can cycle over full-
conditionals to obtain samples from the joint distribution.
easy to implement
samples are serially
correlated
the full-conditions
may not be available
Bishop (2007) PRML
20
21. Strategy 4 – Markov Chain Monte Carlo (MCMC)
Idea: we can sample from a large class of distributions and overcome the problems
that previous methods face in high dimensions using a framework called Markov
Chain Monte Carlo.
p(z*)
Increase in density: p(zτ)
1
zτ z*
p(zτ)
Decrease in density: ~ ( z*)
p
p(z*) ~
p( z )
M. Dümcke zτ z*
21
22. MCMC demonstration: finding a good proposal density
When the proposal distribution is too When it is too wide, we obtain long
narrow, we might miss a mode. constant stretches without an acceptance.
http://demonstrations.wolfram.com/MarkovChainMonteCarloSimulationUsingTheMetropolisAlgorithm/
22
23. MCMC for DCM
MH creates as series of random points 𝜃 1 , 𝜃 2 , … , whose
distribution converges to the target distribution of interest. For us, this
is the posterior density 𝑝(𝜃|𝑦).
We could use the following proposal distribution:
𝑞 𝜃 𝜏 𝜃 𝜏−1 = 𝒩 0, 𝜎𝐶 𝜃
prior covariance
scaling factor
adapted such that acceptance
rate is between 20% and 40%
W. Penny
23
32. Savage-Dickey ratio
In many situations we wish to compare models that are nested. For example:
𝑚 𝐹 : full model with parameters 𝜃 = 𝜃1 , 𝜃2
𝑚 𝑅 : reduced model with 𝜃 = (𝜃1 , 0)
In this case, we can use the Savage-
Dickey ratio to obtain a Bayes factor
without having to compute the two
model evidences: 𝐵 𝑅𝐹 = 0.9
𝑝 𝜃2 = 0 𝑦, 𝑚 𝐹
𝐵 𝑅𝐹 =
𝑝 𝜃2 = 0 𝑚 𝐹
W. Penny
32