SlideShare una empresa de Scribd logo
1 de 103
Descargar para leer sin conexión
An [under]view of Monte Carlo methods, from
importance sampling to MCMC, to ABC
(& kudos to Bernoulli)
Christian P. Robert
Universit´e Paris-Dauphine, University of Warwick, & CREST, Paris
2013 WSC, Hong Kong
bayesianstatistics@gmail.com
Outline
Bernoulli, Jakob (1654–1705)
MCMC connected steps
Metropolis-Hastings revisited
Approximate Bayesian computation
(ABC)
Bernoulli as founding father of Monte Carlo methods
The weak law of large numbers (or Bernoulli’s [Golden] theorem)
provides the justification for Monte Carlo approximations:
if x1, . . . , xn are i.i.d. rv’s with density f ,
lim
n→∞
h(x1) + . . . + h(xn)
n
=
X
h(x)f (x) dx
Stigler’s Law of Eponimy: Cardano (1501–1576) first stated the
result
Bernoulli as founding father of Monte Carlo methods
...and indeed
h(x1) + . . . + h(xn)
n
converges to
I =
X
h(x)f (x) dx
Bernoulli as founding father of Monte Carlo methods
...and indeed
h(x1) + . . . + h(xn)
n
converges to
I =
X
h(x)f (x) dx
...meaning that provided we can simulate xi ∼ f (·) long and fast
“enough”, the empirical mean will be a good “enough”
approximation to I
Early implementations of the LLN
While Jakob Bernoulli
himself apparently did not
engage in simulation,
Buffon (1707–1788) resorted
to a (not-yet-Monte-Carlo)
experiment in 1735 to
estimate the value of the
Saint Petersburg game
(even though he did not
perform a similar experiment
for estimating π)
[Stigler, STS, 1991; Stigler, JRSS A, 2010]
Early implementations of the LLN
While Jakob Bernoulli
himself apparently did not
engage in simulation,
De Forest (1834–1888)
found the median of a
log-Cauchy distribution,
using normal simulations
approximated to the second
digit (in 1876)
[Stigler, STS, 1991; Stigler, JRSS A, 2010]
Early implementations of the LLN
While Jakob Bernoulli
himself apparently did not
engage in simulation,
followed closely by the
ubuquitous Galton using
“normal” dice in 1890, after
developping the Quincunx,
used both for checking the
CLT and simulating from a
posterior distribution as
early as 1877
[Stigler, STS, 1991; Stigler, JRSS A, 2010]
Importance Sampling
When focussing on integral approximation, very loose principle in
that proposal distribution with pdf q(·) leads to alternative
representation
I =
X
h(x){f /q}(x) q(x) dx
Principle of importance
Generate an iid sample x1, . . . , xn ∼ q(·) and estimate I by
^IIS
= n−1
n
i=1
h(xi ){f /q}(xi ).
...provided q is positive on the right set
Importance Sampling
When focussing on integral approximation, very loose principle in
that proposal distribution with pdf q(·) leads to alternative
representation
I =
X
h(x){f /q}(x) q(x) dx
Principle of importance
Generate an iid sample x1, . . . , xn ∼ q(·) and estimate I by
^IIS
= n−1
n
i=1
h(xi ){f /q}(xi ).
...provided q is positive on the right set
Importance Sampling
When focussing on integral approximation, very loose principle in
that proposal distribution with pdf q(·) leads to alternative
representation
I =
X
h(x){f /q}(x) q(x) dx
Principle of importance
Generate an iid sample x1, . . . , xn ∼ q(·) and estimate I by
^IIS
= n−1
n
i=1
h(xi ){f /q}(xi ).
...provided q is positive on the right set
things aren’t all rosy...
LLN not sufficient to justify Monte
Carlo methods: if
n−1
n
i=1
h(xi ){f /q}(xi )
has an infinite variance, the estimator
^IIS is useless Importance sampling estimation of
P(2 Z 6) Z is Cauchy and
importance is normal, compared
with exact value, 0.095.
The harmonic mean estimator
Bayesian posterior distribution defined as
π(θ|x) = π(θ)L(θ|x)/m(x)
When θi ∼ π(θ|x),
1
T
T
t=1
1
L(θt|x)
is an unbiased estimator of 1/m(x)
[Gelfand & Dey, 1994; Newton & Raftery, 1994]
Highly hazardous material: Most often leads to an infinite
variance!!!
The harmonic mean estimator
Bayesian posterior distribution defined as
π(θ|x) = π(θ)L(θ|x)/m(x)
When θi ∼ π(θ|x),
1
T
T
t=1
1
L(θt|x)
is an unbiased estimator of 1/m(x)
[Gelfand & Dey, 1994; Newton & Raftery, 1994]
Highly hazardous material: Most often leads to an infinite
variance!!!
“The Worst Monte Carlo Method Ever”
“The good news is that the Law of Large Numbers guarantees that this
estimator is consistent ie, it will very likely be very close to the correct
answer if you use a sufficiently large number of points from the posterior
distribution.
The bad news is that the number of points required for this estimator to
get close to the right answer will often be greater than the number of
atoms in the observable universe. The even worse news is that it’s easy
for people to not realize this, and to na¨ıvely accept estimates that are
nowhere close to the correct value of the marginal likelihood.”
[Radford Neal’s blog, Aug. 23, 2008]
Comparison with regular importance sampling
Harmonic mean: Constraint opposed to usual importance sampling
constraints: proposal ϕ(·) must have lighter (rather than fatter)
tails than π(·)L(·) for the approximation
1
1
T
T
t=1
ϕ(θt)
πk(θt)L(θt)
θt ∼ ϕ(·)
to have a finite variance.
E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ
Comparison with regular importance sampling
Harmonic mean: Constraint opposed to usual importance sampling
constraints: proposal ϕ(·) must have lighter (rather than fatter)
tails than π(·)L(·) for the approximation
1
1
T
T
t=1
ϕ(θt)
πk(θt)L(θt)
θt ∼ ϕ(·)
to have a finite variance.
E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ
HPD indicator as ϕ
Use the convex hull of MCMC simulations (θt)t=1,...,T
corresponding to the 10% HPD region (easily derived!) and ϕ as
indicator:
ϕ(θ) =
10
T
t∈HPD
Id(θ,θt )
[X & Wraith, 2009]
Bayesian computing (R)evolution
Bernoulli, Jakob (1654–1705)
MCMC connected steps
Metropolis-Hastings revisited
Approximate Bayesian computation
(ABC)
computational jam
In the 1970’s and early 1980’s, theoretical foundations of Bayesian
statistics were sound, but methodology was lagging for lack of
computing tools.
restriction to conjugate priors
limited complexity of models
small sample sizes
The field was desperately in need of a new computing paradigm!
[X & Casella, STS, 2012]
MCMC as in Markov Chain Monte Carlo
Notion that i.i.d. simulation is definitely not necessary, all that
matters is the ergodic theorem
Realization that Markov chains could be used in a wide variety of
situations only came to mainstream statisticians with Gelfand and
Smith (1990) despite earlier publications in the statistical literature
like Hastings (1970) and growing awareness in spatial statistics
(Besag, 1986)
Reasons:
lack of computing machinery
lack of background on Markov chains
lack of trust in the practicality of the method
pre-Gibbs/pre-Hastings era
Early 1970’s, Hammersley, Clifford, and Besag were working on the
specification of joint distributions from conditional distributions
and on necessary and sufficient conditions for the conditional
distributions to be compatible with a joint distribution.
[Hammersley and Clifford, 1971]
pre-Gibbs/pre-Hastings era
Early 1970’s, Hammersley, Clifford, and Besag were working on the
specification of joint distributions from conditional distributions
and on necessary and sufficient conditions for the conditional
distributions to be compatible with a joint distribution.
“What is the most general form of the conditional
probability functions that define a coherent joint
function? And what will the joint look like?”
[Besag, 1972]
Hammersley-Clifford[-Besag] theorem
Theorem (Hammersley-Clifford)
Joint distribution of vector associated with a dependence graph
must be represented as product of functions over the cliques of the
graphs, i.e., of functions depending only on the components
indexed by the labels in the clique.
[Cressie, 1993; Lauritzen, 1996]
Hammersley-Clifford[-Besag] theorem
Theorem (Hammersley-Clifford)
A probability distribution P with positive and continuous density f
satisfies the pairwise Markov property with respect to an
undirected graph G if and only if it factorizes according to G, i.e.,
(F) ≡ (G)
[Cressie, 1993; Lauritzen, 1996]
Hammersley-Clifford[-Besag] theorem
Theorem (Hammersley-Clifford)
Under the positivity condition, the joint distribution g satisfies
g(y1, . . . , yp) ∝
p
j=1
g j
(y j
|y 1 , . . . , y j−1
, y j+1
, . . . , y p
)
g j
(y j
|y 1 , . . . , y j−1
, y j+1
, . . . , y p
)
for every permutation on {1, 2, . . . , p} and every y ∈ Y.
[Cressie, 1993; Lauritzen, 1996]
Clicking in
After Peskun (1973), MCMC mostly dormant in mainstream
statistical world for about 10 years, then several papers/books
highlighted its usefulness in specific settings:
Geman and Geman (1984)
Besag (1986)
Strauss (1986)
Ripley (Stochastic Simulation, 1987)
Tanner and Wong (1987)
Younes (1988)
[Re-]Enters the Gibbs sampler
Geman and Geman (1984), building on
Metropolis et al. (1953), Hastings (1970), and
Peskun (1973), constructed a Gibbs sampler
for optimisation in a discrete image processing
problem with a Gibbs random field without
completion.
Back to Metropolis et al., 1953: the Gibbs
sampler is already in use therein and ergodicity
is proven on the collection of global maxima
[Re-]Enters the Gibbs sampler
Geman and Geman (1984), building on
Metropolis et al. (1953), Hastings (1970), and
Peskun (1973), constructed a Gibbs sampler
for optimisation in a discrete image processing
problem with a Gibbs random field without
completion.
Back to Metropolis et al., 1953: the Gibbs
sampler is already in use therein and ergodicity
is proven on the collection of global maxima
Removing the jam
In early 1990s, researchers found that Gibbs and then Metropolis -
Hastings algorithms would crack almost any problem!
Flood of papers followed applying MCMC:
linear mixed models (Gelfand & al., 1990; Zeger & Karim, 1991;
Wang & al., 1993, 1994)
generalized linear mixed models (Albert & Chib, 1993)
mixture models (Tanner & Wong, 1987; Diebolt & X., 1990, 1994;
Escobar & West, 1993)
changepoint analysis (Carlin & al., 1992)
point processes (Grenander & Møller, 1994)
&tc
Removing the jam
In early 1990s, researchers found that Gibbs and then Metropolis -
Hastings algorithms would crack almost any problem!
Flood of papers followed applying MCMC:
genomics (Stephens & Smith, 1993; Lawrence & al., 1993;
Churchill, 1995; Geyer & Thompson, 1995; Stephens & Donnelly,
2000)
ecology (George & X, 1992)
variable selection in regression (George & mcCulloch, 1993; Green,
1995; Chen & al., 2000)
spatial statistics (Raftery & Banfield, 1991; Besag & Green, 1993))
longitudinal studies (Lange & al., 1992)
&tc
MCMC and beyond
reversible jump MCMC which impacted considerably Bayesian model
choice (Green, 1995)
adaptive MCMC algorithms (Haario & al., 1999; Roberts & Rosenthal,
2009)
exact approximations to targets (Tanner & Wong, 1987; Beaumont,
2003; Andrieu & Roberts, 2009)
comp’al stats catching up with comp’al physics: free energy sampling
(e.g., Wang-Landau), Hamilton Monte Carlo (Girolami & Calderhead,
2011)
sequential Monte Carlo (SMC) for non-sequential problems (Chopin,
2002; Neal, 2001; Del Moral et al 2006)
retrospective sampling
intractability: EP – GIMH – PMCMC – SMC2
– INLA
QMC[MC] (Owen, 2011)
Particles
Iterating/sequential importance sampling is about as old as Monte
Carlo methods themselves!
[Hammersley and Morton,1954; Rosenbluth and Rosenbluth, 1955]
Found in the molecular simulation literature of the 50’s with
self-avoiding random walks and signal processing
[Marshall, 1965; Handschin and Mayne, 1969]
Use of the term “particle” dates back to Kitagawa (1996), and Carpenter
et al. (1997) coined the term “particle filter”.
Particles
Iterating/sequential importance sampling is about as old as Monte
Carlo methods themselves!
[Hammersley and Morton,1954; Rosenbluth and Rosenbluth, 1955]
Found in the molecular simulation literature of the 50’s with
self-avoiding random walks and signal processing
[Marshall, 1965; Handschin and Mayne, 1969]
Use of the term “particle” dates back to Kitagawa (1996), and Carpenter
et al. (1997) coined the term “particle filter”.
pMC & pMCMC
Recycling of past simulations legitimate to build better
importance sampling functions as in population Monte Carlo
[Iba, 2000; Capp´e et al, 2004; Del Moral et al., 2007]
synthesis by Andrieu, Doucet, and Hollenstein (2010) using
particles to build an evolving MCMC kernel ^pθ(y1:T ) in state
space models p(x1:T )p(y1:T |x1:T )
importance sampling on discretely observed diffusions
[Beskos et al., 2006; Fearnhead et al., 2008, 2010]
Metropolis-Hastings revisited
Bernoulli, Jakob (1654–1705)
MCMC connected steps
Metropolis-Hastings revisited
Reinterpretation and
Rao-Blackwellisation
Russian roulette
Approximate Bayesian computation
(ABC)
Metropolis Hastings algorithm
1. We wish to approximate
I =
h(x)π(x)dx
π(x)dx
= h(x)¯π(x)dx
2. π(x) is known but not π(x)dx.
3. Approximate I with δ = 1
n
n
t=1 h(x(t)) where (x(t)) is a
Markov chain with limiting distribution ¯π.
4. Convergence obtained from Law of Large Numbers or CLT for
Markov chains.
Metropolis Hasting Algorithm
Suppose that x(t) is drawn.
1. Simulate yt ∼ q(·|x(t)).
2. Set x(t+1) = yt with probability
α(x(t)
, yt) = min 1,
π(yt)
π(x(t))
q(x(t)|yt)
q(yt|x(t))
Otherwise, set x(t+1) = x(t).
3. α is such that the detailed balance equation is satisfied: ¯π
is the stationary distribution of (x(t)).
The accepted candidates are simulated with the rejection
algorithm.
Metropolis Hasting Algorithm
Suppose that x(t) is drawn.
1. Simulate yt ∼ q(·|x(t)).
2. Set x(t+1) = yt with probability
α(x(t)
, yt) = min 1,
π(yt)
π(x(t))
q(x(t)|yt)
q(yt|x(t))
Otherwise, set x(t+1) = x(t).
3. α is such that the detailed balance equation is satisfied: ¯π
is the stationary distribution of (x(t)).
The accepted candidates are simulated with the rejection
algorithm.
Metropolis Hasting Algorithm
Suppose that x(t) is drawn.
1. Simulate yt ∼ q(·|x(t)).
2. Set x(t+1) = yt with probability
α(x(t)
, yt) = min 1,
π(yt)
π(x(t))
q(x(t)|yt)
q(yt|x(t))
Otherwise, set x(t+1) = x(t).
3. α is such that the detailed balance equation is satisfied:
π(x)q(y|x)α(x, y) = π(y)q(x|y)α(y, x).
¯π is the stationary distribution of (x(t)).
The accepted candidates are simulated with the rejection
algorithm.
Metropolis Hasting Algorithm
Suppose that x(t) is drawn.
1. Simulate yt ∼ q(·|x(t)).
2. Set x(t+1) = yt with probability
α(x(t)
, yt) = min 1,
π(yt)
π(x(t))
q(x(t)|yt)
q(yt|x(t))
Otherwise, set x(t+1) = x(t).
3. α is such that the detailed balance equation is satisfied:
π(x)q(y|x)α(x, y) = π(y)q(x|y)α(y, x).
¯π is the stationary distribution of (x(t)).
The accepted candidates are simulated with the rejection
algorithm.
Some properties of the HM algorithm
Alternative representation of the estimator δ is
δ =
1
n
n
t=1
h(x(t)
) =
1
n
Mn
i=1
ni h(zi ) ,
where
zi ’s are the accepted yj ’s,
Mn is the number of accepted yj ’s till time n,
ni is the number of times zi appears in the sequence (x(t))t.
The ”accepted candidates”
˜q(·|zi ) =
α(zi , ·) q(·|zi )
p(zi )
q(·|zi )
p(zi )
,
where p(zi ) = α(zi , y) q(y|zi )dy. To simulate from ˜q(·|zi ):
1. Propose a candidate y ∼ q(·|zi )
2. Accept with probability
˜q(y|zi )
q(y|zi )
p(zi )
= α(zi , y)
Otherwise, reject it and starts again.
this is the transition of the HM algorithm.The transition kernel
˜q enjoys ˜π as a stationary distribution:
˜π(x)˜q(y|x) = ˜π(y)˜q(x|y) ,
The ”accepted candidates”
˜q(·|zi ) =
α(zi , ·) q(·|zi )
p(zi )
q(·|zi )
p(zi )
,
where p(zi ) = α(zi , y) q(y|zi )dy. To simulate from ˜q(·|zi ):
1. Propose a candidate y ∼ q(·|zi )
2. Accept with probability
˜q(y|zi )
q(y|zi )
p(zi )
= α(zi , y)
Otherwise, reject it and starts again.
this is the transition of the HM algorithm.The transition kernel
˜q enjoys ˜π as a stationary distribution:
˜π(x)˜q(y|x) = ˜π(y)˜q(x|y) ,
”accepted” Markov chain
Lemma (Douc & X., AoS, 2011)
The sequence (zi , ni ) satisfies
1. (zi , ni )i is a Markov chain;
2. zi+1 and ni are independent given zi ;
3. ni is distributed as a geometric random variable with
probability parameter
p(zi ) := α(zi , y) q(y|zi ) dy ; (1)
4. (zi )i is a Markov chain with transition kernel
˜Q(z, dy) = ˜q(y|z)dy and stationary distribution ˜π such that
˜q(·|z) ∝ α(z, ·) q(·|z) and ˜π(·) ∝ π(·)p(·) .
”accepted” Markov chain
Lemma (Douc & X., AoS, 2011)
The sequence (zi , ni ) satisfies
1. (zi , ni )i is a Markov chain;
2. zi+1 and ni are independent given zi ;
3. ni is distributed as a geometric random variable with
probability parameter
p(zi ) := α(zi , y) q(y|zi ) dy ; (1)
4. (zi )i is a Markov chain with transition kernel
˜Q(z, dy) = ˜q(y|z)dy and stationary distribution ˜π such that
˜q(·|z) ∝ α(z, ·) q(·|z) and ˜π(·) ∝ π(·)p(·) .
”accepted” Markov chain
Lemma (Douc & X., AoS, 2011)
The sequence (zi , ni ) satisfies
1. (zi , ni )i is a Markov chain;
2. zi+1 and ni are independent given zi ;
3. ni is distributed as a geometric random variable with
probability parameter
p(zi ) := α(zi , y) q(y|zi ) dy ; (1)
4. (zi )i is a Markov chain with transition kernel
˜Q(z, dy) = ˜q(y|z)dy and stationary distribution ˜π such that
˜q(·|z) ∝ α(z, ·) q(·|z) and ˜π(·) ∝ π(·)p(·) .
”accepted” Markov chain
Lemma (Douc & X., AoS, 2011)
The sequence (zi , ni ) satisfies
1. (zi , ni )i is a Markov chain;
2. zi+1 and ni are independent given zi ;
3. ni is distributed as a geometric random variable with
probability parameter
p(zi ) := α(zi , y) q(y|zi ) dy ; (1)
4. (zi )i is a Markov chain with transition kernel
˜Q(z, dy) = ˜q(y|z)dy and stationary distribution ˜π such that
˜q(·|z) ∝ α(z, ·) q(·|z) and ˜π(·) ∝ π(·)p(·) .
Importance sampling perspective
1. A natural idea:
δ∗
=
1
n
Mn
i=1
h(zi )
p(zi )
,
Importance sampling perspective
1. A natural idea:
δ∗
Mn
i=1
h(zi )
p(zi )
Mn
i=1
1
p(zi )
=
Mn
i=1
π(zi )
˜π(zi )
h(zi )
Mn
i=1
π(zi )
˜π(zi )
.
Importance sampling perspective
1. A natural idea:
δ∗
Mn
i=1
h(zi )
p(zi )
Mn
i=1
1
p(zi )
=
Mn
i=1
π(zi )
˜π(zi )
h(zi )
Mn
i=1
π(zi )
˜π(zi )
.
2. But p not available in closed form.
Importance sampling perspective
1. A natural idea:
δ∗
Mn
i=1
h(zi )
p(zi )
Mn
i=1
1
p(zi )
=
Mn
i=1
π(zi )
˜π(zi )
h(zi )
Mn
i=1
π(zi )
˜π(zi )
.
2. But p not available in closed form.
3. The geometric ni is the replacement, an obvious solution that
is used in the original Metropolis–Hastings estimate since
E[ni ] = 1/p(zi ).
The Bernoulli factory
The crude estimate of 1/p(zi ),
ni = 1 +
∞
j=1 j
I {u α(zi , y )} ,
can be improved:
Lemma (Douc & X., AoS, 2011)
If (yj )j is an iid sequence with distribution q(y|zi ), the quantity
^ξi = 1 +
∞
j=1 j
{1 − α(zi , y )}
is an unbiased estimator of 1/p(zi ) which variance, conditional on
zi , is lower than the conditional variance of ni , {1 − p(zi )}/p2(zi ).
Rao-Blackwellised, for sure?
^ξi = 1 +
∞
j=1 j
{1 − α(zi , y )}
1. Infinite sum but finite with at least positive probability:
α(x(t)
, yt) = min 1,
π(yt)
π(x(t))
q(x(t)|yt)
q(yt|x(t))
For example: take a symmetric random walk as a proposal.
2. What if we wish to be sure that the sum is finite?
Finite horizon k version:
^ξk
i = 1 +
∞
j=1 1 k∧j
1 − α(zi , yj )
k+1 j
I {u α(zi , y )}
Rao-Blackwellised, for sure?
^ξi = 1 +
∞
j=1 j
{1 − α(zi , y )}
1. Infinite sum but finite with at least positive probability:
α(x(t)
, yt) = min 1,
π(yt)
π(x(t))
q(x(t)|yt)
q(yt|x(t))
For example: take a symmetric random walk as a proposal.
2. What if we wish to be sure that the sum is finite?
Finite horizon k version:
^ξk
i = 1 +
∞
j=1 1 k∧j
1 − α(zi , yj )
k+1 j
I {u α(zi , y )}
which Bernoulli factory?!
Not the spice warehouse of Leon Bernoulli!
Query:
Given an algorithm delivering iid B(p) rv’s, is it possible to derive
an algorithm delivering iid B(p) rv’s when f is known and p
unknown?
[von Neumann, 1951; Keane & O’Brien, 1994]
existence (e.g., impossible for f (p) = min(2p, 1))
condition: for some n,
min{f (p), 1 − f (p)} min{p, 1 − p}n
implementation (polynomial vs. exponential time)
use of sandwiching polynomials/power series
which Bernoulli factory?!
Not the spice warehouse of Leon Bernoulli!
Query:
Given an algorithm delivering iid B(p) rv’s, is it possible to derive
an algorithm delivering iid B(p) rv’s when f is known and p
unknown?
[von Neumann, 1951; Keane & O’Brien, 1994]
existence (e.g., impossible for f (p) = min(2p, 1))
condition: for some n,
min{f (p), 1 − f (p)} min{p, 1 − p}n
implementation (polynomial vs. exponential time)
use of sandwiching polynomials/power series
Variance improvement
Theorem (Douc & X., AoS, 2011)
If (yj )j is an iid sequence with distribution q(y|zi ) and (uj )j is an
iid uniform sequence, for any k 0, the quantity
^ξk
i = 1 +
∞
j=1 1 k∧j
1 − α(zi , yj )
k+1 j
I {u α(zi , y )}
is an unbiased estimator of 1/p(zi ) with an almost sure finite
number of terms.
Variance improvement
Theorem (Douc & X., AoS, 2011)
If (yj )j is an iid sequence with distribution q(y|zi ) and (uj )j is an
iid uniform sequence, for any k 0, the quantity
^ξk
i = 1 +
∞
j=1 1 k∧j
1 − α(zi , yj )
k+1 j
I {u α(zi , y )}
is an unbiased estimator of 1/p(zi ) with an almost sure finite
number of terms. Moreover, for k 1,
V ^ξk
i zi =
1 − p(zi )
p2(zi )
−
1 − (1 − 2p(zi ) + r(zi ))k
2p(zi ) − r(zi )
2 − p(zi )
p2(zi )
(p(zi ) − r(zi )) ,
where p(zi ) := α(zi , y) q(y|zi ) dy. and r(zi ) := α2(zi , y) q(y|zi ) dy.
Variance improvement
Theorem (Douc & X., AoS, 2011)
If (yj )j is an iid sequence with distribution q(y|zi ) and (uj )j is an
iid uniform sequence, for any k 0, the quantity
^ξk
i = 1 +
∞
j=1 1 k∧j
1 − α(zi , yj )
k+1 j
I {u α(zi , y )}
is an unbiased estimator of 1/p(zi ) with an almost sure finite
number of terms. Therefore, we have
V ^ξi zi V ^ξk
i zi V ^ξ0
i zi = V [ni | zi ] .
B motivation for Russian roulette
drior π(θ), data density p(y|θ) = f (y; θ)/Z(θ) with
Z(θ) = f (x; θ)dx
intractable (e.g., Ising spin model, MRF, diffusion processes,
networks, &tc)
doubly-intractable posterior follows as
π(θ|y) = p(y|θ) × π(θ) ×
1
Z(y)
=
f (y; θ)
Z(θ)
× π(θ) ×
1
Z(y)
where Z(y) = p(y|θ)π(θ)dθ
both Z(θ) and Z(y) are intractable with massively different
consequences
[thanks to Mark Girolami for his Russian slides!]
B motivation for Russian roulette
drior π(θ), data density p(y|θ) = f (y; θ)/Z(θ) with
Z(θ) = f (x; θ)dx
intractable (e.g., Ising spin model, MRF, diffusion processes,
networks, &tc)
doubly-intractable posterior follows as
π(θ|y) = p(y|θ) × π(θ) ×
1
Z(y)
=
f (y; θ)
Z(θ)
× π(θ) ×
1
Z(y)
where Z(y) = p(y|θ)π(θ)dθ
both Z(θ) and Z(y) are intractable with massively different
consequences
[thanks to Mark Girolami for his Russian slides!]
B motivation for Russian roulette
If Z(θ) is intractable, Metropolis–Hasting acceptance
probability
α(θ , θ) = min 1,
f (y; θ )π(θ )
f (y; θ)π(θ)
×
q(θ|θ )
q(θ |θ)
×
Z(θ)
Z(θ )
is not available
Use instead biased approximations e.g. pseudo-likelihoods,
plugin ^Z(θ ) estimates without sacrificing exactness of MCMC
B motivation for Russian roulette
If Z(θ) is intractable, Metropolis–Hasting acceptance
probability
α(θ , θ) = min 1,
f (y; θ )π(θ )
f (y; θ)π(θ)
×
q(θ|θ )
q(θ |θ)
×
Z(θ)
Z(θ )
is not available
Use instead biased approximations e.g. pseudo-likelihoods,
plugin ^Z(θ ) estimates without sacrificing exactness of MCMC
Existing solution
Unbiased plugin estimate
Z(θ)
Z(θ )
≈
f (x; θ)
f (x; θ )
where x ∼
f (x; θ )
Z(θ )
[Møller et al, Bka, 2006; Murray et al 2006]
auxiliary variable method
removes Z(θ) Z(θ ) from the picture
require simulations from the model (e.g., via perfect sampling)
Exact approximate methods
Pseudo-Marginal construction that allows for the use of unbiased,
positive estimates of target in acceptance probability
α(θ , θ) = min 1,
^π(θ |y)
^π(θ|y)
×
q(θ|θ )
q(θ |θ)
[Beaumont, 2003; Andrieu and Roberts, 2009; Doucet et al, 2012]
Transition kernel has invariant distribution with exact target
density π(θ|y)
Exact approximate methods
Pseudo-Marginal construction that allows for the use of unbiased,
positive estimates of target in acceptance probability
α(θ , θ) = min 1,
^π(θ |y)
^π(θ|y)
×
q(θ|θ )
q(θ |θ)
[Beaumont, 2003; Andrieu and Roberts, 2009; Doucet et al, 2012]
Transition kernel has invariant distribution with exact target
density π(θ|y)
Infinite series estimator
For each (θ, y), construct rv’s {V
(j)
θ , j 0} such that
^π(θ, {V
(j)
θ }|y) :=
∞
j=0
V
(j)
θ
is a.s. finite with finite expectation
E ^π(θ, {V
(j)
θ } |y) = π(θ|y)
Infinite series estimator
For each (θ, y), construct rv’s {V
(j)
θ , j 0} such that
^π(θ, {V
(j)
θ }|y) :=
∞
j=0
V
(j)
θ
is a.s. finite with finite expectation
E ^π(θ, {V
(j)
θ } |y) = π(θ|y)
Introduce a random stopping time τθ, such that with
ξ := (τθ, {V
(j)
θ , 0 j τθ}) the estimate
^π(θ, ξ|y) :=
τθ
j=0
V
(j)
θ
satisfies
E ^π(θ, ξ|y)|{V
(j)
θ , j 0} = ^π(θ, {V
(j)
θ }|y)
Infinite series estimator
For each (θ, y), construct rv’s {V
(j)
θ , j 0} such that
^π(θ, {V
(j)
θ }|y) :=
∞
j=0
V
(j)
θ
is a.s. finite with finite expectation
E ^π(θ, {V
(j)
θ } |y) = π(θ|y)
Warning: unbiased estimate ^π(θ, ξ|y) using series
construction no general guarantee of positivity
Russian roulette
Method that requires unbiased truncation of a series
S(θ) =
∞
i=0
φi (θ)
Russian roulette employed extensively in simulation of neutron
scattering and computer graphics
Assign probabilities {qj , j 1} qj ∈ (0, 1] and generate
U(0, 1) i.i.d. r.v’s {Uj , j 1}
Find the first time k 1 such that Uk qk
Russian roulette estimate of S(θ) is
^S(θ) =
k
j=0
φj (θ)
j−1
i=1
qi ,
[Girolami, Lyne, Strathman, Simpson, & Atchad´e, arXiv:1306.4032]
Russian roulette
Method that requires unbiased truncation of a series
S(θ) =
∞
i=0
φi (θ)
Russian roulette employed extensively in simulation of neutron
scattering and computer graphics
Assign probabilities {qj , j 1} qj ∈ (0, 1] and generate
U(0, 1) i.i.d. r.v’s {Uj , j 1}
Find the first time k 1 such that Uk qk
Russian roulette estimate of S(θ) is
^S(θ) =
k
j=0
φj (θ)
j−1
i=1
qi ,
[Girolami, Lyne, Strathman, Simpson, & Atchad´e, arXiv:1306.4032]
Russian roulette
Method that requires unbiased truncation of a series
S(θ) =
∞
i=0
φi (θ)
Russian roulette employed extensively in simulation of neutron
scattering and computer graphics
Assign probabilities {qj , j 1} qj ∈ (0, 1] and generate
U(0, 1) i.i.d. r.v’s {Uj , j 1}
Find the first time k 1 such that Uk qk
Russian roulette estimate of S(θ) is
^S(θ) =
k
j=0
φj (θ)
j−1
i=1
qi ,
If limn→∞
n
j=1 qj = 0, Russian roulette terminates with
probability one
[Girolami, Lyne, Strathman, Simpson, & Atchad´e, arXiv:1306.4032]
Russian roulette
Method that requires unbiased truncation of a series
S(θ) =
∞
i=0
φi (θ)
Russian roulette employed extensively in simulation of neutron
scattering and computer graphics
Assign probabilities {qj , j 1} qj ∈ (0, 1] and generate
U(0, 1) i.i.d. r.v’s {Uj , j 1}
Find the first time k 1 such that Uk qk
Russian roulette estimate of S(θ) is
^S(θ) =
k
j=0
φj (θ)
j−1
i=1
qi ,
E{^S(θ)} = S(θ)
variance finite under certain known conditions
[Girolami, Lyne, Strathman, Simpson, & Atchad´e, arXiv:1306.4032]
towards ever more complexity
Bernoulli, Jakob (1654–1705)
MCMC connected steps
Metropolis-Hastings revisited
Approximate Bayesian computation
(ABC)
New challenges
Novel statisticial issues that forces a different Bayesian answer:
very large datasets
complex or unknown dependence structures with maybe p n
multiple and involved random effects
missing data structures containing most of the information
sequential structures involving most of the above
New paradigm?
“Surprisingly, the confident prediction of the previous
generation that Bayesian methods would ultimately supplant
frequentist methods has given way to a realization that Markov
chain Monte Carlo (MCMC) may be too slow to handle
modern data sets. Size matters because large data sets stress
computer storage and processing power to the breaking point.
The most successful compromises between Bayesian and
frequentist methods now rely on penalization and
optimization.”
[Lange at al., ISR, 2013]
New paradigm?
sad reality constraint that
size does matter
focus on much smaller
dimensions and on sparse
summaries
many (fast if non-Bayesian)
ways of producing those
summaries
Bayesian inference can kick
in almost automatically at
this stage
Approximate Bayesian computation (ABC)
Case of a well-defined statistical model where the likelihood
function
(θ|y) = f (y1, . . . , yn|θ)
is out of reach!
Empirical approximations to the original
Bayesian inference problem
Degrading the data precision down
to a tolerance ε
Replacing the likelihood with a
non-parametric approximation
Summarising/replacing the data
with insufficient statistics
Approximate Bayesian computation (ABC)
Case of a well-defined statistical model where the likelihood
function
(θ|y) = f (y1, . . . , yn|θ)
is out of reach!
Empirical approximations to the original
Bayesian inference problem
Degrading the data precision down
to a tolerance ε
Replacing the likelihood with a
non-parametric approximation
Summarising/replacing the data
with insufficient statistics
Approximate Bayesian computation (ABC)
Case of a well-defined statistical model where the likelihood
function
(θ|y) = f (y1, . . . , yn|θ)
is out of reach!
Empirical approximations to the original
Bayesian inference problem
Degrading the data precision down
to a tolerance ε
Replacing the likelihood with a
non-parametric approximation
Summarising/replacing the data
with insufficient statistics
Approximate Bayesian computation (ABC)
Case of a well-defined statistical model where the likelihood
function
(θ|y) = f (y1, . . . , yn|θ)
is out of reach!
Empirical approximations to the original
Bayesian inference problem
Degrading the data precision down
to a tolerance ε
Replacing the likelihood with a
non-parametric approximation
Summarising/replacing the data
with insufficient statistics
ABC methodology
Bayesian setting: target is π(θ)f (x|θ)
When likelihood f (x|θ) not in closed form, likelihood-free rejection
technique:
Foundation
For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps
jointly simulating
θ ∼ π(θ) , z ∼ f (z|θ ) ,
until the auxiliary variable z is equal to the observed value, z = y,
then the selected
θ ∼ π(θ|y)
[Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]
ABC methodology
Bayesian setting: target is π(θ)f (x|θ)
When likelihood f (x|θ) not in closed form, likelihood-free rejection
technique:
Foundation
For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps
jointly simulating
θ ∼ π(θ) , z ∼ f (z|θ ) ,
until the auxiliary variable z is equal to the observed value, z = y,
then the selected
θ ∼ π(θ|y)
[Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]
ABC methodology
Bayesian setting: target is π(θ)f (x|θ)
When likelihood f (x|θ) not in closed form, likelihood-free rejection
technique:
Foundation
For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps
jointly simulating
θ ∼ π(θ) , z ∼ f (z|θ ) ,
until the auxiliary variable z is equal to the observed value, z = y,
then the selected
θ ∼ π(θ|y)
[Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]
ABC algorithm
In most implementations, degree of approximation:
Algorithm 1 Likelihood-free rejection sampler
for i = 1 to N do
repeat
generate θ from the prior distribution π(·)
generate z from the likelihood f (·|θ )
until ρ{η(z), η(y)}
set θi = θ
end for
where η(y) defines a (not necessarily sufficient) statistic
Comments
role of distance paramount
(because = 0)
scaling of components of η(y) also
capital
matters little if “small enough”
representative of “curse of
dimensionality”
small is beautiful!, i.e. data as a
whole may be weakly informative
for ABC
non-parametric method at core
ABC simulation advances
Simulating from the prior is often poor in efficiency
Either modify the proposal distribution on θ to increase the density
of x’s within the vicinity of y...
[Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]
...or by viewing the problem as a conditional density estimation
and by developing techniques to allow for larger
[Beaumont et al., 2002; Blum & Fran¸cois, 2010; Biau et al., 2013]
.....or even by including in the inferential framework [ABCµ]
[Ratmann et al., 2009]
ABC simulation advances
Simulating from the prior is often poor in efficiency
Either modify the proposal distribution on θ to increase the density
of x’s within the vicinity of y...
[Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]
...or by viewing the problem as a conditional density estimation
and by developing techniques to allow for larger
[Beaumont et al., 2002; Blum & Fran¸cois, 2010; Biau et al., 2013]
.....or even by including in the inferential framework [ABCµ]
[Ratmann et al., 2009]
ABC simulation advances
Simulating from the prior is often poor in efficiency
Either modify the proposal distribution on θ to increase the density
of x’s within the vicinity of y...
[Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]
...or by viewing the problem as a conditional density estimation
and by developing techniques to allow for larger
[Beaumont et al., 2002; Blum & Fran¸cois, 2010; Biau et al., 2013]
.....or even by including in the inferential framework [ABCµ]
[Ratmann et al., 2009]
ABC simulation advances
Simulating from the prior is often poor in efficiency
Either modify the proposal distribution on θ to increase the density
of x’s within the vicinity of y...
[Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012]
...or by viewing the problem as a conditional density estimation
and by developing techniques to allow for larger
[Beaumont et al., 2002; Blum & Fran¸cois, 2010; Biau et al., 2013]
.....or even by including in the inferential framework [ABCµ]
[Ratmann et al., 2009]
ABC as an inference machine
Starting point is summary statistic
η(y), either chosen for computational
realism or imposed by external
constraints
ABC can produce a distribution on the parameter of interest
conditional on this summary statistic η(y)
inference based on ABC may be consistent or not, so it needs
to be validated on its own
the choice of the tolerance level is dictated by both
computational and convergence constraints
ABC as an inference machine
Starting point is summary statistic
η(y), either chosen for computational
realism or imposed by external
constraints
ABC can produce a distribution on the parameter of interest
conditional on this summary statistic η(y)
inference based on ABC may be consistent or not, so it needs
to be validated on its own
the choice of the tolerance level is dictated by both
computational and convergence constraints
How Bayesian aBc is..?
At best, ABC approximates π(θ|η(y)):
approximation error unknown (w/o massive simulation)
pragmatic or empirical Bayes (there is no other solution!)
many calibration issues (tolerance, distance, statistics)
the NP side should be incorporated into the whole Bayesian
picture
the approximation error should also be part of the Bayesian
inference
Noisy ABC
ABC approximation error (under non-zero tolerance ) replaced
with exact simulation from a controlled approximation to the
target, convolution of true posterior with kernel function
π (θ, z|y) =
π(θ)f (z|θ)K (y − z)
π(θ)f (z|θ)K (y − z)dzdθ
,
with K kernel parameterised by bandwidth .
[Wilkinson, 2013]
Theorem
The ABC algorithm based on a randomised observation y = ˜y + ξ,
ξ ∼ K , and an acceptance probability of
K (y − z)/M
gives draws from the posterior distribution π(θ|y).
Noisy ABC
ABC approximation error (under non-zero tolerance ) replaced
with exact simulation from a controlled approximation to the
target, convolution of true posterior with kernel function
π (θ, z|y) =
π(θ)f (z|θ)K (y − z)
π(θ)f (z|θ)K (y − z)dzdθ
,
with K kernel parameterised by bandwidth .
[Wilkinson, 2013]
Theorem
The ABC algorithm based on a randomised observation y = ˜y + ξ,
ξ ∼ K , and an acceptance probability of
K (y − z)/M
gives draws from the posterior distribution π(θ|y).
Which summary?
Fundamental difficulty of the choice of the summary statistic when
there is no non-trivial sufficient statistics [except when done by the
experimenters in the field]
Which summary?
Fundamental difficulty of the choice of the summary statistic when
there is no non-trivial sufficient statistics [except when done by the
experimenters in the field]
Loss of statistical information balanced against gain in data
roughening
Approximation error and information loss remain unknown
Choice of statistics induces choice of distance function
towards standardisation
borrowing tools from data analysis (LDA) machine learning
[Estoup et al., ME, 2012]
Which summary?
Fundamental difficulty of the choice of the summary statistic when
there is no non-trivial sufficient statistics [except when done by the
experimenters in the field]
may be imposed for external/practical reasons
may gather several non-B point estimates
we can learn about efficient combination
distance can be provided by estimation techniques
Which summary for model choice?
‘This is also why focus on model discrimination typically (...)
proceeds by (...) accepting that the Bayes Factor that one obtains
is only derived from the summary statistics and may in no way
correspond to that of the full model.’
[S. Sisson, Jan. 31, 2011, xianblog]
Depending on the choice of η(·), the Bayes factor based on this
insufficient statistic,
Bη
12(y) =
π1(θ1)f η
1 (η(y)|θ1) dθ1
π2(θ2)f η
2 (η(y)|θ2) dθ2
,
is either consistent or not
[X et al., PNAS, 2012]
Which summary for model choice?
Depending on the choice of η(·), the Bayes factor based on this
insufficient statistic,
Bη
12(y) =
π1(θ1)f η
1 (η(y)|θ1) dθ1
π2(θ2)f η
2 (η(y)|θ2) dθ2
,
is either consistent or not
[X et al., PNAS, 2012]
q
q
q
q
q
q
q
q
q
q
q
Gauss Laplace
0.00.10.20.30.40.50.60.7
n=100
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Gauss Laplace
0.00.20.40.60.81.0
n=100
Selecting proper summaries
Consistency only depends on the range of
µi (θ) = Ei [η(y)]
under both models against the asymptotic mean µ0 of η(y)
Theorem
If Pn belongs to one of the two models and if µ0 cannot be
attained by the other one :
0 = min (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2)
< max (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2) ,
then the Bayes factor Bη
12 is consistent
[Marin et al., JRSS B, 2013]
Selecting proper summaries
Consistency only depends on the range of
µi (θ) = Ei [η(y)]
under both models against the asymptotic mean µ0 of η(y)
q
M1 M2
0.30.40.50.60.7
q
q
q
q
M1 M2
0.30.40.50.60.7
M1 M2
0.30.40.50.60.7
q
q
q
q
q
q
q
q
M1 M2
0.00.20.40.60.8
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
M1 M2
0.00.20.40.60.81.0
q
q
q
q
q
q
q
q
M1 M2
0.00.20.40.60.81.0
q
q
q
q
q
q
q
M1 M2
0.00.20.40.60.8
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
M1 M2
0.00.20.40.60.81.0
q
q
qq
q
qq
M1 M2
0.00.20.40.60.81.0
[Marin et al., JRSS B, 2013]

Más contenido relacionado

La actualidad más candente

comments on exponential ergodicity of the bouncy particle sampler
comments on exponential ergodicity of the bouncy particle samplercomments on exponential ergodicity of the bouncy particle sampler
comments on exponential ergodicity of the bouncy particle samplerChristian Robert
 
Fixed point theorem in fuzzy metric space with e.a property
Fixed point theorem in fuzzy metric space with e.a propertyFixed point theorem in fuzzy metric space with e.a property
Fixed point theorem in fuzzy metric space with e.a propertyAlexander Decker
 
Bayesian model choice in cosmology
Bayesian model choice in cosmologyBayesian model choice in cosmology
Bayesian model choice in cosmologyChristian Robert
 
Multiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximationsMultiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximationsChristian Robert
 
ABC based on Wasserstein distances
ABC based on Wasserstein distancesABC based on Wasserstein distances
ABC based on Wasserstein distancesChristian Robert
 
Bayesian hybrid variable selection under generalized linear models
Bayesian hybrid variable selection under generalized linear modelsBayesian hybrid variable selection under generalized linear models
Bayesian hybrid variable selection under generalized linear modelsCaleb (Shiqiang) Jin
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)Christian Robert
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking componentsChristian Robert
 
Approximating Bayes Factors
Approximating Bayes FactorsApproximating Bayes Factors
Approximating Bayes FactorsChristian Robert
 
Continuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGDContinuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGDValentin De Bortoli
 
An overview of Bayesian testing
An overview of Bayesian testingAn overview of Bayesian testing
An overview of Bayesian testingChristian Robert
 
ABC convergence under well- and mis-specified models
ABC convergence under well- and mis-specified modelsABC convergence under well- and mis-specified models
ABC convergence under well- and mis-specified modelsChristian Robert
 
A common fixed point theorem for six mappings in g banach space with weak-com...
A common fixed point theorem for six mappings in g banach space with weak-com...A common fixed point theorem for six mappings in g banach space with weak-com...
A common fixed point theorem for six mappings in g banach space with weak-com...Alexander Decker
 
Threshold network models
Threshold network modelsThreshold network models
Threshold network modelsNaoki Masuda
 
Inference in generative models using the Wasserstein distance [[INI]
Inference in generative models using the Wasserstein distance [[INI]Inference in generative models using the Wasserstein distance [[INI]
Inference in generative models using the Wasserstein distance [[INI]Christian Robert
 

La actualidad más candente (20)

comments on exponential ergodicity of the bouncy particle sampler
comments on exponential ergodicity of the bouncy particle samplercomments on exponential ergodicity of the bouncy particle sampler
comments on exponential ergodicity of the bouncy particle sampler
 
Fixed point theorem in fuzzy metric space with e.a property
Fixed point theorem in fuzzy metric space with e.a propertyFixed point theorem in fuzzy metric space with e.a property
Fixed point theorem in fuzzy metric space with e.a property
 
Bayesian model choice in cosmology
Bayesian model choice in cosmologyBayesian model choice in cosmology
Bayesian model choice in cosmology
 
Multiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximationsMultiple estimators for Monte Carlo approximations
Multiple estimators for Monte Carlo approximations
 
ABC based on Wasserstein distances
ABC based on Wasserstein distancesABC based on Wasserstein distances
ABC based on Wasserstein distances
 
Bayesian hybrid variable selection under generalized linear models
Bayesian hybrid variable selection under generalized linear modelsBayesian hybrid variable selection under generalized linear models
Bayesian hybrid variable selection under generalized linear models
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 
Approximating Bayes Factors
Approximating Bayes FactorsApproximating Bayes Factors
Approximating Bayes Factors
 
JMVA_Paper_Shibasish
JMVA_Paper_ShibasishJMVA_Paper_Shibasish
JMVA_Paper_Shibasish
 
Continuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGDContinuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGD
 
An overview of Bayesian testing
An overview of Bayesian testingAn overview of Bayesian testing
An overview of Bayesian testing
 
ABC convergence under well- and mis-specified models
ABC convergence under well- and mis-specified modelsABC convergence under well- and mis-specified models
ABC convergence under well- and mis-specified models
 
A common fixed point theorem for six mappings in g banach space with weak-com...
A common fixed point theorem for six mappings in g banach space with weak-com...A common fixed point theorem for six mappings in g banach space with weak-com...
A common fixed point theorem for six mappings in g banach space with weak-com...
 
Threshold network models
Threshold network modelsThreshold network models
Threshold network models
 
the ABC of ABC
the ABC of ABCthe ABC of ABC
the ABC of ABC
 
Inference in generative models using the Wasserstein distance [[INI]
Inference in generative models using the Wasserstein distance [[INI]Inference in generative models using the Wasserstein distance [[INI]
Inference in generative models using the Wasserstein distance [[INI]
 
8803-09-lec16.pdf
8803-09-lec16.pdf8803-09-lec16.pdf
8803-09-lec16.pdf
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Vancouver18
Vancouver18Vancouver18
Vancouver18
 

Destacado

ENAT Study on Inclusive Tourism
ENAT Study on Inclusive TourismENAT Study on Inclusive Tourism
ENAT Study on Inclusive TourismScott Rains
 
Consumer Sentinel Network - Federal Trade Commission 2013
Consumer Sentinel Network - Federal Trade Commission 2013Consumer Sentinel Network - Federal Trade Commission 2013
Consumer Sentinel Network - Federal Trade Commission 2013- Mark - Fullbright
 
Com120 04 listening_13ed1
Com120 04 listening_13ed1Com120 04 listening_13ed1
Com120 04 listening_13ed1turnercom
 
NYA Spirituality and Spiritual Development in Youth Work
NYA Spirituality and Spiritual Development in Youth WorkNYA Spirituality and Spiritual Development in Youth Work
NYA Spirituality and Spiritual Development in Youth WorkDiocese of Exeter
 
Numérisé sur une imprimante multifonctions
Numérisé sur une imprimante multifonctionsNumérisé sur une imprimante multifonctions
Numérisé sur une imprimante multifonctionsScott Rains
 
Graveur Coatantiec
Graveur CoatantiecGraveur Coatantiec
Graveur CoatantiecScott Rains
 

Destacado (6)

ENAT Study on Inclusive Tourism
ENAT Study on Inclusive TourismENAT Study on Inclusive Tourism
ENAT Study on Inclusive Tourism
 
Consumer Sentinel Network - Federal Trade Commission 2013
Consumer Sentinel Network - Federal Trade Commission 2013Consumer Sentinel Network - Federal Trade Commission 2013
Consumer Sentinel Network - Federal Trade Commission 2013
 
Com120 04 listening_13ed1
Com120 04 listening_13ed1Com120 04 listening_13ed1
Com120 04 listening_13ed1
 
NYA Spirituality and Spiritual Development in Youth Work
NYA Spirituality and Spiritual Development in Youth WorkNYA Spirituality and Spiritual Development in Youth Work
NYA Spirituality and Spiritual Development in Youth Work
 
Numérisé sur une imprimante multifonctions
Numérisé sur une imprimante multifonctionsNumérisé sur une imprimante multifonctions
Numérisé sur une imprimante multifonctions
 
Graveur Coatantiec
Graveur CoatantiecGraveur Coatantiec
Graveur Coatantiec
 

Similar a Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

Talk slides imsct2016
Talk slides imsct2016Talk slides imsct2016
Talk slides imsct2016ychaubey
 
Can we estimate a constant?
Can we estimate a constant?Can we estimate a constant?
Can we estimate a constant?Christian Robert
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Christian Robert
 
TS-IASSL2014
TS-IASSL2014TS-IASSL2014
TS-IASSL2014ychaubey
 
A Review Article on Fixed Point Theory and Its Application
A Review Article on Fixed Point Theory and Its ApplicationA Review Article on Fixed Point Theory and Its Application
A Review Article on Fixed Point Theory and Its Applicationijtsrd
 
(Approximate) Bayesian computation as a new empirical Bayes (something)?
(Approximate) Bayesian computation as a new empirical Bayes (something)?(Approximate) Bayesian computation as a new empirical Bayes (something)?
(Approximate) Bayesian computation as a new empirical Bayes (something)?Christian Robert
 
A Family Of Extragradient Methods For Solving Equilibrium Problems
A Family Of Extragradient Methods For Solving Equilibrium ProblemsA Family Of Extragradient Methods For Solving Equilibrium Problems
A Family Of Extragradient Methods For Solving Equilibrium ProblemsYasmine Anino
 
Quaternions, Alexander Armstrong, Harold Baker, Owen Williams
Quaternions, Alexander Armstrong, Harold Baker, Owen WilliamsQuaternions, Alexander Armstrong, Harold Baker, Owen Williams
Quaternions, Alexander Armstrong, Harold Baker, Owen WilliamsHarold Baker
 
Talk slides at ISI, 2014
Talk slides at ISI, 2014Talk slides at ISI, 2014
Talk slides at ISI, 2014ychaubey
 
Thesis_NickyGrant_2013
Thesis_NickyGrant_2013Thesis_NickyGrant_2013
Thesis_NickyGrant_2013Nicky Grant
 
Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Frank Nielsen
 
A Survey On The Weierstrass Approximation Theorem
A Survey On The Weierstrass Approximation TheoremA Survey On The Weierstrass Approximation Theorem
A Survey On The Weierstrass Approximation TheoremMichele Thomas
 

Similar a Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013 (20)

Talk slides imsct2016
Talk slides imsct2016Talk slides imsct2016
Talk slides imsct2016
 
Can we estimate a constant?
Can we estimate a constant?Can we estimate a constant?
Can we estimate a constant?
 
quantum gravity
quantum gravityquantum gravity
quantum gravity
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1
 
TS-IASSL2014
TS-IASSL2014TS-IASSL2014
TS-IASSL2014
 
10.1.1.96.9176
10.1.1.96.917610.1.1.96.9176
10.1.1.96.9176
 
Problemas de Smale
Problemas de SmaleProblemas de Smale
Problemas de Smale
 
A Review Article on Fixed Point Theory and Its Application
A Review Article on Fixed Point Theory and Its ApplicationA Review Article on Fixed Point Theory and Its Application
A Review Article on Fixed Point Theory and Its Application
 
sigir2017bayesian
sigir2017bayesiansigir2017bayesian
sigir2017bayesian
 
(Approximate) Bayesian computation as a new empirical Bayes (something)?
(Approximate) Bayesian computation as a new empirical Bayes (something)?(Approximate) Bayesian computation as a new empirical Bayes (something)?
(Approximate) Bayesian computation as a new empirical Bayes (something)?
 
A Family Of Extragradient Methods For Solving Equilibrium Problems
A Family Of Extragradient Methods For Solving Equilibrium ProblemsA Family Of Extragradient Methods For Solving Equilibrium Problems
A Family Of Extragradient Methods For Solving Equilibrium Problems
 
17_monte_carlo.pdf
17_monte_carlo.pdf17_monte_carlo.pdf
17_monte_carlo.pdf
 
Quaternions, Alexander Armstrong, Harold Baker, Owen Williams
Quaternions, Alexander Armstrong, Harold Baker, Owen WilliamsQuaternions, Alexander Armstrong, Harold Baker, Owen Williams
Quaternions, Alexander Armstrong, Harold Baker, Owen Williams
 
Talk slides at ISI, 2014
Talk slides at ISI, 2014Talk slides at ISI, 2014
Talk slides at ISI, 2014
 
Talk slides isi-2014
Talk slides isi-2014Talk slides isi-2014
Talk slides isi-2014
 
Thesis_NickyGrant_2013
Thesis_NickyGrant_2013Thesis_NickyGrant_2013
Thesis_NickyGrant_2013
 
Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...
 
PhysRevE.87.022905
PhysRevE.87.022905PhysRevE.87.022905
PhysRevE.87.022905
 
A bit about мcmc
A bit about мcmcA bit about мcmc
A bit about мcmc
 
A Survey On The Weierstrass Approximation Theorem
A Survey On The Weierstrass Approximation TheoremA Survey On The Weierstrass Approximation Theorem
A Survey On The Weierstrass Approximation Theorem
 

Más de Christian Robert

Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceChristian Robert
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinChristian Robert
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?Christian Robert
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Christian Robert
 
Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Christian Robert
 
discussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihooddiscussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihoodChristian Robert
 
Likelihood-free Design: a discussion
Likelihood-free Design: a discussionLikelihood-free Design: a discussion
Likelihood-free Design: a discussionChristian Robert
 
CISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergenceCISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergenceChristian Robert
 
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment modelsa discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment modelsChristian Robert
 
Poster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferencePoster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferenceChristian Robert
 
short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018Christian Robert
 
ABC with Wasserstein distances
ABC with Wasserstein distancesABC with Wasserstein distances
ABC with Wasserstein distancesChristian Robert
 
Coordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerCoordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerChristian Robert
 

Más de Christian Robert (19)

Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de France
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael Martin
 
discussion of ICML23.pdf
discussion of ICML23.pdfdiscussion of ICML23.pdf
discussion of ICML23.pdf
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?
 
restore.pdf
restore.pdfrestore.pdf
restore.pdf
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13
 
Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?
 
CDT 22 slides.pdf
CDT 22 slides.pdfCDT 22 slides.pdf
CDT 22 slides.pdf
 
discussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihooddiscussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihood
 
eugenics and statistics
eugenics and statisticseugenics and statistics
eugenics and statistics
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
asymptotics of ABC
asymptotics of ABCasymptotics of ABC
asymptotics of ABC
 
Likelihood-free Design: a discussion
Likelihood-free Design: a discussionLikelihood-free Design: a discussion
Likelihood-free Design: a discussion
 
CISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergenceCISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergence
 
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment modelsa discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
 
Poster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferencePoster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conference
 
short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018
 
ABC with Wasserstein distances
ABC with Wasserstein distancesABC with Wasserstein distances
ABC with Wasserstein distances
 
Coordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerCoordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like sampler
 

Último

MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Food processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsFood processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsManeerUddin
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxleah joy valeriano
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 

Último (20)

MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Food processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsFood processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture hons
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 

Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013

  • 1. An [under]view of Monte Carlo methods, from importance sampling to MCMC, to ABC (& kudos to Bernoulli) Christian P. Robert Universit´e Paris-Dauphine, University of Warwick, & CREST, Paris 2013 WSC, Hong Kong bayesianstatistics@gmail.com
  • 2. Outline Bernoulli, Jakob (1654–1705) MCMC connected steps Metropolis-Hastings revisited Approximate Bayesian computation (ABC)
  • 3. Bernoulli as founding father of Monte Carlo methods The weak law of large numbers (or Bernoulli’s [Golden] theorem) provides the justification for Monte Carlo approximations: if x1, . . . , xn are i.i.d. rv’s with density f , lim n→∞ h(x1) + . . . + h(xn) n = X h(x)f (x) dx Stigler’s Law of Eponimy: Cardano (1501–1576) first stated the result
  • 4. Bernoulli as founding father of Monte Carlo methods ...and indeed h(x1) + . . . + h(xn) n converges to I = X h(x)f (x) dx
  • 5. Bernoulli as founding father of Monte Carlo methods ...and indeed h(x1) + . . . + h(xn) n converges to I = X h(x)f (x) dx ...meaning that provided we can simulate xi ∼ f (·) long and fast “enough”, the empirical mean will be a good “enough” approximation to I
  • 6. Early implementations of the LLN While Jakob Bernoulli himself apparently did not engage in simulation, Buffon (1707–1788) resorted to a (not-yet-Monte-Carlo) experiment in 1735 to estimate the value of the Saint Petersburg game (even though he did not perform a similar experiment for estimating π) [Stigler, STS, 1991; Stigler, JRSS A, 2010]
  • 7. Early implementations of the LLN While Jakob Bernoulli himself apparently did not engage in simulation, De Forest (1834–1888) found the median of a log-Cauchy distribution, using normal simulations approximated to the second digit (in 1876) [Stigler, STS, 1991; Stigler, JRSS A, 2010]
  • 8. Early implementations of the LLN While Jakob Bernoulli himself apparently did not engage in simulation, followed closely by the ubuquitous Galton using “normal” dice in 1890, after developping the Quincunx, used both for checking the CLT and simulating from a posterior distribution as early as 1877 [Stigler, STS, 1991; Stigler, JRSS A, 2010]
  • 9. Importance Sampling When focussing on integral approximation, very loose principle in that proposal distribution with pdf q(·) leads to alternative representation I = X h(x){f /q}(x) q(x) dx Principle of importance Generate an iid sample x1, . . . , xn ∼ q(·) and estimate I by ^IIS = n−1 n i=1 h(xi ){f /q}(xi ). ...provided q is positive on the right set
  • 10. Importance Sampling When focussing on integral approximation, very loose principle in that proposal distribution with pdf q(·) leads to alternative representation I = X h(x){f /q}(x) q(x) dx Principle of importance Generate an iid sample x1, . . . , xn ∼ q(·) and estimate I by ^IIS = n−1 n i=1 h(xi ){f /q}(xi ). ...provided q is positive on the right set
  • 11. Importance Sampling When focussing on integral approximation, very loose principle in that proposal distribution with pdf q(·) leads to alternative representation I = X h(x){f /q}(x) q(x) dx Principle of importance Generate an iid sample x1, . . . , xn ∼ q(·) and estimate I by ^IIS = n−1 n i=1 h(xi ){f /q}(xi ). ...provided q is positive on the right set
  • 12. things aren’t all rosy... LLN not sufficient to justify Monte Carlo methods: if n−1 n i=1 h(xi ){f /q}(xi ) has an infinite variance, the estimator ^IIS is useless Importance sampling estimation of P(2 Z 6) Z is Cauchy and importance is normal, compared with exact value, 0.095.
  • 13. The harmonic mean estimator Bayesian posterior distribution defined as π(θ|x) = π(θ)L(θ|x)/m(x) When θi ∼ π(θ|x), 1 T T t=1 1 L(θt|x) is an unbiased estimator of 1/m(x) [Gelfand & Dey, 1994; Newton & Raftery, 1994] Highly hazardous material: Most often leads to an infinite variance!!!
  • 14. The harmonic mean estimator Bayesian posterior distribution defined as π(θ|x) = π(θ)L(θ|x)/m(x) When θi ∼ π(θ|x), 1 T T t=1 1 L(θt|x) is an unbiased estimator of 1/m(x) [Gelfand & Dey, 1994; Newton & Raftery, 1994] Highly hazardous material: Most often leads to an infinite variance!!!
  • 15. “The Worst Monte Carlo Method Ever” “The good news is that the Law of Large Numbers guarantees that this estimator is consistent ie, it will very likely be very close to the correct answer if you use a sufficiently large number of points from the posterior distribution. The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe. The even worse news is that it’s easy for people to not realize this, and to na¨ıvely accept estimates that are nowhere close to the correct value of the marginal likelihood.” [Radford Neal’s blog, Aug. 23, 2008]
  • 16. Comparison with regular importance sampling Harmonic mean: Constraint opposed to usual importance sampling constraints: proposal ϕ(·) must have lighter (rather than fatter) tails than π(·)L(·) for the approximation 1 1 T T t=1 ϕ(θt) πk(θt)L(θt) θt ∼ ϕ(·) to have a finite variance. E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ
  • 17. Comparison with regular importance sampling Harmonic mean: Constraint opposed to usual importance sampling constraints: proposal ϕ(·) must have lighter (rather than fatter) tails than π(·)L(·) for the approximation 1 1 T T t=1 ϕ(θt) πk(θt)L(θt) θt ∼ ϕ(·) to have a finite variance. E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ
  • 18. HPD indicator as ϕ Use the convex hull of MCMC simulations (θt)t=1,...,T corresponding to the 10% HPD region (easily derived!) and ϕ as indicator: ϕ(θ) = 10 T t∈HPD Id(θ,θt ) [X & Wraith, 2009]
  • 19. Bayesian computing (R)evolution Bernoulli, Jakob (1654–1705) MCMC connected steps Metropolis-Hastings revisited Approximate Bayesian computation (ABC)
  • 20. computational jam In the 1970’s and early 1980’s, theoretical foundations of Bayesian statistics were sound, but methodology was lagging for lack of computing tools. restriction to conjugate priors limited complexity of models small sample sizes The field was desperately in need of a new computing paradigm! [X & Casella, STS, 2012]
  • 21. MCMC as in Markov Chain Monte Carlo Notion that i.i.d. simulation is definitely not necessary, all that matters is the ergodic theorem Realization that Markov chains could be used in a wide variety of situations only came to mainstream statisticians with Gelfand and Smith (1990) despite earlier publications in the statistical literature like Hastings (1970) and growing awareness in spatial statistics (Besag, 1986) Reasons: lack of computing machinery lack of background on Markov chains lack of trust in the practicality of the method
  • 22. pre-Gibbs/pre-Hastings era Early 1970’s, Hammersley, Clifford, and Besag were working on the specification of joint distributions from conditional distributions and on necessary and sufficient conditions for the conditional distributions to be compatible with a joint distribution. [Hammersley and Clifford, 1971]
  • 23. pre-Gibbs/pre-Hastings era Early 1970’s, Hammersley, Clifford, and Besag were working on the specification of joint distributions from conditional distributions and on necessary and sufficient conditions for the conditional distributions to be compatible with a joint distribution. “What is the most general form of the conditional probability functions that define a coherent joint function? And what will the joint look like?” [Besag, 1972]
  • 24. Hammersley-Clifford[-Besag] theorem Theorem (Hammersley-Clifford) Joint distribution of vector associated with a dependence graph must be represented as product of functions over the cliques of the graphs, i.e., of functions depending only on the components indexed by the labels in the clique. [Cressie, 1993; Lauritzen, 1996]
  • 25. Hammersley-Clifford[-Besag] theorem Theorem (Hammersley-Clifford) A probability distribution P with positive and continuous density f satisfies the pairwise Markov property with respect to an undirected graph G if and only if it factorizes according to G, i.e., (F) ≡ (G) [Cressie, 1993; Lauritzen, 1996]
  • 26. Hammersley-Clifford[-Besag] theorem Theorem (Hammersley-Clifford) Under the positivity condition, the joint distribution g satisfies g(y1, . . . , yp) ∝ p j=1 g j (y j |y 1 , . . . , y j−1 , y j+1 , . . . , y p ) g j (y j |y 1 , . . . , y j−1 , y j+1 , . . . , y p ) for every permutation on {1, 2, . . . , p} and every y ∈ Y. [Cressie, 1993; Lauritzen, 1996]
  • 27. Clicking in After Peskun (1973), MCMC mostly dormant in mainstream statistical world for about 10 years, then several papers/books highlighted its usefulness in specific settings: Geman and Geman (1984) Besag (1986) Strauss (1986) Ripley (Stochastic Simulation, 1987) Tanner and Wong (1987) Younes (1988)
  • 28. [Re-]Enters the Gibbs sampler Geman and Geman (1984), building on Metropolis et al. (1953), Hastings (1970), and Peskun (1973), constructed a Gibbs sampler for optimisation in a discrete image processing problem with a Gibbs random field without completion. Back to Metropolis et al., 1953: the Gibbs sampler is already in use therein and ergodicity is proven on the collection of global maxima
  • 29. [Re-]Enters the Gibbs sampler Geman and Geman (1984), building on Metropolis et al. (1953), Hastings (1970), and Peskun (1973), constructed a Gibbs sampler for optimisation in a discrete image processing problem with a Gibbs random field without completion. Back to Metropolis et al., 1953: the Gibbs sampler is already in use therein and ergodicity is proven on the collection of global maxima
  • 30. Removing the jam In early 1990s, researchers found that Gibbs and then Metropolis - Hastings algorithms would crack almost any problem! Flood of papers followed applying MCMC: linear mixed models (Gelfand & al., 1990; Zeger & Karim, 1991; Wang & al., 1993, 1994) generalized linear mixed models (Albert & Chib, 1993) mixture models (Tanner & Wong, 1987; Diebolt & X., 1990, 1994; Escobar & West, 1993) changepoint analysis (Carlin & al., 1992) point processes (Grenander & Møller, 1994) &tc
  • 31. Removing the jam In early 1990s, researchers found that Gibbs and then Metropolis - Hastings algorithms would crack almost any problem! Flood of papers followed applying MCMC: genomics (Stephens & Smith, 1993; Lawrence & al., 1993; Churchill, 1995; Geyer & Thompson, 1995; Stephens & Donnelly, 2000) ecology (George & X, 1992) variable selection in regression (George & mcCulloch, 1993; Green, 1995; Chen & al., 2000) spatial statistics (Raftery & Banfield, 1991; Besag & Green, 1993)) longitudinal studies (Lange & al., 1992) &tc
  • 32. MCMC and beyond reversible jump MCMC which impacted considerably Bayesian model choice (Green, 1995) adaptive MCMC algorithms (Haario & al., 1999; Roberts & Rosenthal, 2009) exact approximations to targets (Tanner & Wong, 1987; Beaumont, 2003; Andrieu & Roberts, 2009) comp’al stats catching up with comp’al physics: free energy sampling (e.g., Wang-Landau), Hamilton Monte Carlo (Girolami & Calderhead, 2011) sequential Monte Carlo (SMC) for non-sequential problems (Chopin, 2002; Neal, 2001; Del Moral et al 2006) retrospective sampling intractability: EP – GIMH – PMCMC – SMC2 – INLA QMC[MC] (Owen, 2011)
  • 33. Particles Iterating/sequential importance sampling is about as old as Monte Carlo methods themselves! [Hammersley and Morton,1954; Rosenbluth and Rosenbluth, 1955] Found in the molecular simulation literature of the 50’s with self-avoiding random walks and signal processing [Marshall, 1965; Handschin and Mayne, 1969] Use of the term “particle” dates back to Kitagawa (1996), and Carpenter et al. (1997) coined the term “particle filter”.
  • 34. Particles Iterating/sequential importance sampling is about as old as Monte Carlo methods themselves! [Hammersley and Morton,1954; Rosenbluth and Rosenbluth, 1955] Found in the molecular simulation literature of the 50’s with self-avoiding random walks and signal processing [Marshall, 1965; Handschin and Mayne, 1969] Use of the term “particle” dates back to Kitagawa (1996), and Carpenter et al. (1997) coined the term “particle filter”.
  • 35. pMC & pMCMC Recycling of past simulations legitimate to build better importance sampling functions as in population Monte Carlo [Iba, 2000; Capp´e et al, 2004; Del Moral et al., 2007] synthesis by Andrieu, Doucet, and Hollenstein (2010) using particles to build an evolving MCMC kernel ^pθ(y1:T ) in state space models p(x1:T )p(y1:T |x1:T ) importance sampling on discretely observed diffusions [Beskos et al., 2006; Fearnhead et al., 2008, 2010]
  • 36. Metropolis-Hastings revisited Bernoulli, Jakob (1654–1705) MCMC connected steps Metropolis-Hastings revisited Reinterpretation and Rao-Blackwellisation Russian roulette Approximate Bayesian computation (ABC)
  • 37. Metropolis Hastings algorithm 1. We wish to approximate I = h(x)π(x)dx π(x)dx = h(x)¯π(x)dx 2. π(x) is known but not π(x)dx. 3. Approximate I with δ = 1 n n t=1 h(x(t)) where (x(t)) is a Markov chain with limiting distribution ¯π. 4. Convergence obtained from Law of Large Numbers or CLT for Markov chains.
  • 38. Metropolis Hasting Algorithm Suppose that x(t) is drawn. 1. Simulate yt ∼ q(·|x(t)). 2. Set x(t+1) = yt with probability α(x(t) , yt) = min 1, π(yt) π(x(t)) q(x(t)|yt) q(yt|x(t)) Otherwise, set x(t+1) = x(t). 3. α is such that the detailed balance equation is satisfied: ¯π is the stationary distribution of (x(t)). The accepted candidates are simulated with the rejection algorithm.
  • 39. Metropolis Hasting Algorithm Suppose that x(t) is drawn. 1. Simulate yt ∼ q(·|x(t)). 2. Set x(t+1) = yt with probability α(x(t) , yt) = min 1, π(yt) π(x(t)) q(x(t)|yt) q(yt|x(t)) Otherwise, set x(t+1) = x(t). 3. α is such that the detailed balance equation is satisfied: ¯π is the stationary distribution of (x(t)). The accepted candidates are simulated with the rejection algorithm.
  • 40. Metropolis Hasting Algorithm Suppose that x(t) is drawn. 1. Simulate yt ∼ q(·|x(t)). 2. Set x(t+1) = yt with probability α(x(t) , yt) = min 1, π(yt) π(x(t)) q(x(t)|yt) q(yt|x(t)) Otherwise, set x(t+1) = x(t). 3. α is such that the detailed balance equation is satisfied: π(x)q(y|x)α(x, y) = π(y)q(x|y)α(y, x). ¯π is the stationary distribution of (x(t)). The accepted candidates are simulated with the rejection algorithm.
  • 41. Metropolis Hasting Algorithm Suppose that x(t) is drawn. 1. Simulate yt ∼ q(·|x(t)). 2. Set x(t+1) = yt with probability α(x(t) , yt) = min 1, π(yt) π(x(t)) q(x(t)|yt) q(yt|x(t)) Otherwise, set x(t+1) = x(t). 3. α is such that the detailed balance equation is satisfied: π(x)q(y|x)α(x, y) = π(y)q(x|y)α(y, x). ¯π is the stationary distribution of (x(t)). The accepted candidates are simulated with the rejection algorithm.
  • 42. Some properties of the HM algorithm Alternative representation of the estimator δ is δ = 1 n n t=1 h(x(t) ) = 1 n Mn i=1 ni h(zi ) , where zi ’s are the accepted yj ’s, Mn is the number of accepted yj ’s till time n, ni is the number of times zi appears in the sequence (x(t))t.
  • 43. The ”accepted candidates” ˜q(·|zi ) = α(zi , ·) q(·|zi ) p(zi ) q(·|zi ) p(zi ) , where p(zi ) = α(zi , y) q(y|zi )dy. To simulate from ˜q(·|zi ): 1. Propose a candidate y ∼ q(·|zi ) 2. Accept with probability ˜q(y|zi ) q(y|zi ) p(zi ) = α(zi , y) Otherwise, reject it and starts again. this is the transition of the HM algorithm.The transition kernel ˜q enjoys ˜π as a stationary distribution: ˜π(x)˜q(y|x) = ˜π(y)˜q(x|y) ,
  • 44. The ”accepted candidates” ˜q(·|zi ) = α(zi , ·) q(·|zi ) p(zi ) q(·|zi ) p(zi ) , where p(zi ) = α(zi , y) q(y|zi )dy. To simulate from ˜q(·|zi ): 1. Propose a candidate y ∼ q(·|zi ) 2. Accept with probability ˜q(y|zi ) q(y|zi ) p(zi ) = α(zi , y) Otherwise, reject it and starts again. this is the transition of the HM algorithm.The transition kernel ˜q enjoys ˜π as a stationary distribution: ˜π(x)˜q(y|x) = ˜π(y)˜q(x|y) ,
  • 45. ”accepted” Markov chain Lemma (Douc & X., AoS, 2011) The sequence (zi , ni ) satisfies 1. (zi , ni )i is a Markov chain; 2. zi+1 and ni are independent given zi ; 3. ni is distributed as a geometric random variable with probability parameter p(zi ) := α(zi , y) q(y|zi ) dy ; (1) 4. (zi )i is a Markov chain with transition kernel ˜Q(z, dy) = ˜q(y|z)dy and stationary distribution ˜π such that ˜q(·|z) ∝ α(z, ·) q(·|z) and ˜π(·) ∝ π(·)p(·) .
  • 46. ”accepted” Markov chain Lemma (Douc & X., AoS, 2011) The sequence (zi , ni ) satisfies 1. (zi , ni )i is a Markov chain; 2. zi+1 and ni are independent given zi ; 3. ni is distributed as a geometric random variable with probability parameter p(zi ) := α(zi , y) q(y|zi ) dy ; (1) 4. (zi )i is a Markov chain with transition kernel ˜Q(z, dy) = ˜q(y|z)dy and stationary distribution ˜π such that ˜q(·|z) ∝ α(z, ·) q(·|z) and ˜π(·) ∝ π(·)p(·) .
  • 47. ”accepted” Markov chain Lemma (Douc & X., AoS, 2011) The sequence (zi , ni ) satisfies 1. (zi , ni )i is a Markov chain; 2. zi+1 and ni are independent given zi ; 3. ni is distributed as a geometric random variable with probability parameter p(zi ) := α(zi , y) q(y|zi ) dy ; (1) 4. (zi )i is a Markov chain with transition kernel ˜Q(z, dy) = ˜q(y|z)dy and stationary distribution ˜π such that ˜q(·|z) ∝ α(z, ·) q(·|z) and ˜π(·) ∝ π(·)p(·) .
  • 48. ”accepted” Markov chain Lemma (Douc & X., AoS, 2011) The sequence (zi , ni ) satisfies 1. (zi , ni )i is a Markov chain; 2. zi+1 and ni are independent given zi ; 3. ni is distributed as a geometric random variable with probability parameter p(zi ) := α(zi , y) q(y|zi ) dy ; (1) 4. (zi )i is a Markov chain with transition kernel ˜Q(z, dy) = ˜q(y|z)dy and stationary distribution ˜π such that ˜q(·|z) ∝ α(z, ·) q(·|z) and ˜π(·) ∝ π(·)p(·) .
  • 49. Importance sampling perspective 1. A natural idea: δ∗ = 1 n Mn i=1 h(zi ) p(zi ) ,
  • 50. Importance sampling perspective 1. A natural idea: δ∗ Mn i=1 h(zi ) p(zi ) Mn i=1 1 p(zi ) = Mn i=1 π(zi ) ˜π(zi ) h(zi ) Mn i=1 π(zi ) ˜π(zi ) .
  • 51. Importance sampling perspective 1. A natural idea: δ∗ Mn i=1 h(zi ) p(zi ) Mn i=1 1 p(zi ) = Mn i=1 π(zi ) ˜π(zi ) h(zi ) Mn i=1 π(zi ) ˜π(zi ) . 2. But p not available in closed form.
  • 52. Importance sampling perspective 1. A natural idea: δ∗ Mn i=1 h(zi ) p(zi ) Mn i=1 1 p(zi ) = Mn i=1 π(zi ) ˜π(zi ) h(zi ) Mn i=1 π(zi ) ˜π(zi ) . 2. But p not available in closed form. 3. The geometric ni is the replacement, an obvious solution that is used in the original Metropolis–Hastings estimate since E[ni ] = 1/p(zi ).
  • 53. The Bernoulli factory The crude estimate of 1/p(zi ), ni = 1 + ∞ j=1 j I {u α(zi , y )} , can be improved: Lemma (Douc & X., AoS, 2011) If (yj )j is an iid sequence with distribution q(y|zi ), the quantity ^ξi = 1 + ∞ j=1 j {1 − α(zi , y )} is an unbiased estimator of 1/p(zi ) which variance, conditional on zi , is lower than the conditional variance of ni , {1 − p(zi )}/p2(zi ).
  • 54. Rao-Blackwellised, for sure? ^ξi = 1 + ∞ j=1 j {1 − α(zi , y )} 1. Infinite sum but finite with at least positive probability: α(x(t) , yt) = min 1, π(yt) π(x(t)) q(x(t)|yt) q(yt|x(t)) For example: take a symmetric random walk as a proposal. 2. What if we wish to be sure that the sum is finite? Finite horizon k version: ^ξk i = 1 + ∞ j=1 1 k∧j 1 − α(zi , yj ) k+1 j I {u α(zi , y )}
  • 55. Rao-Blackwellised, for sure? ^ξi = 1 + ∞ j=1 j {1 − α(zi , y )} 1. Infinite sum but finite with at least positive probability: α(x(t) , yt) = min 1, π(yt) π(x(t)) q(x(t)|yt) q(yt|x(t)) For example: take a symmetric random walk as a proposal. 2. What if we wish to be sure that the sum is finite? Finite horizon k version: ^ξk i = 1 + ∞ j=1 1 k∧j 1 − α(zi , yj ) k+1 j I {u α(zi , y )}
  • 56. which Bernoulli factory?! Not the spice warehouse of Leon Bernoulli! Query: Given an algorithm delivering iid B(p) rv’s, is it possible to derive an algorithm delivering iid B(p) rv’s when f is known and p unknown? [von Neumann, 1951; Keane & O’Brien, 1994] existence (e.g., impossible for f (p) = min(2p, 1)) condition: for some n, min{f (p), 1 − f (p)} min{p, 1 − p}n implementation (polynomial vs. exponential time) use of sandwiching polynomials/power series
  • 57. which Bernoulli factory?! Not the spice warehouse of Leon Bernoulli! Query: Given an algorithm delivering iid B(p) rv’s, is it possible to derive an algorithm delivering iid B(p) rv’s when f is known and p unknown? [von Neumann, 1951; Keane & O’Brien, 1994] existence (e.g., impossible for f (p) = min(2p, 1)) condition: for some n, min{f (p), 1 − f (p)} min{p, 1 − p}n implementation (polynomial vs. exponential time) use of sandwiching polynomials/power series
  • 58. Variance improvement Theorem (Douc & X., AoS, 2011) If (yj )j is an iid sequence with distribution q(y|zi ) and (uj )j is an iid uniform sequence, for any k 0, the quantity ^ξk i = 1 + ∞ j=1 1 k∧j 1 − α(zi , yj ) k+1 j I {u α(zi , y )} is an unbiased estimator of 1/p(zi ) with an almost sure finite number of terms.
  • 59. Variance improvement Theorem (Douc & X., AoS, 2011) If (yj )j is an iid sequence with distribution q(y|zi ) and (uj )j is an iid uniform sequence, for any k 0, the quantity ^ξk i = 1 + ∞ j=1 1 k∧j 1 − α(zi , yj ) k+1 j I {u α(zi , y )} is an unbiased estimator of 1/p(zi ) with an almost sure finite number of terms. Moreover, for k 1, V ^ξk i zi = 1 − p(zi ) p2(zi ) − 1 − (1 − 2p(zi ) + r(zi ))k 2p(zi ) − r(zi ) 2 − p(zi ) p2(zi ) (p(zi ) − r(zi )) , where p(zi ) := α(zi , y) q(y|zi ) dy. and r(zi ) := α2(zi , y) q(y|zi ) dy.
  • 60. Variance improvement Theorem (Douc & X., AoS, 2011) If (yj )j is an iid sequence with distribution q(y|zi ) and (uj )j is an iid uniform sequence, for any k 0, the quantity ^ξk i = 1 + ∞ j=1 1 k∧j 1 − α(zi , yj ) k+1 j I {u α(zi , y )} is an unbiased estimator of 1/p(zi ) with an almost sure finite number of terms. Therefore, we have V ^ξi zi V ^ξk i zi V ^ξ0 i zi = V [ni | zi ] .
  • 61. B motivation for Russian roulette drior π(θ), data density p(y|θ) = f (y; θ)/Z(θ) with Z(θ) = f (x; θ)dx intractable (e.g., Ising spin model, MRF, diffusion processes, networks, &tc) doubly-intractable posterior follows as π(θ|y) = p(y|θ) × π(θ) × 1 Z(y) = f (y; θ) Z(θ) × π(θ) × 1 Z(y) where Z(y) = p(y|θ)π(θ)dθ both Z(θ) and Z(y) are intractable with massively different consequences [thanks to Mark Girolami for his Russian slides!]
  • 62. B motivation for Russian roulette drior π(θ), data density p(y|θ) = f (y; θ)/Z(θ) with Z(θ) = f (x; θ)dx intractable (e.g., Ising spin model, MRF, diffusion processes, networks, &tc) doubly-intractable posterior follows as π(θ|y) = p(y|θ) × π(θ) × 1 Z(y) = f (y; θ) Z(θ) × π(θ) × 1 Z(y) where Z(y) = p(y|θ)π(θ)dθ both Z(θ) and Z(y) are intractable with massively different consequences [thanks to Mark Girolami for his Russian slides!]
  • 63. B motivation for Russian roulette If Z(θ) is intractable, Metropolis–Hasting acceptance probability α(θ , θ) = min 1, f (y; θ )π(θ ) f (y; θ)π(θ) × q(θ|θ ) q(θ |θ) × Z(θ) Z(θ ) is not available Use instead biased approximations e.g. pseudo-likelihoods, plugin ^Z(θ ) estimates without sacrificing exactness of MCMC
  • 64. B motivation for Russian roulette If Z(θ) is intractable, Metropolis–Hasting acceptance probability α(θ , θ) = min 1, f (y; θ )π(θ ) f (y; θ)π(θ) × q(θ|θ ) q(θ |θ) × Z(θ) Z(θ ) is not available Use instead biased approximations e.g. pseudo-likelihoods, plugin ^Z(θ ) estimates without sacrificing exactness of MCMC
  • 65. Existing solution Unbiased plugin estimate Z(θ) Z(θ ) ≈ f (x; θ) f (x; θ ) where x ∼ f (x; θ ) Z(θ ) [Møller et al, Bka, 2006; Murray et al 2006] auxiliary variable method removes Z(θ) Z(θ ) from the picture require simulations from the model (e.g., via perfect sampling)
  • 66. Exact approximate methods Pseudo-Marginal construction that allows for the use of unbiased, positive estimates of target in acceptance probability α(θ , θ) = min 1, ^π(θ |y) ^π(θ|y) × q(θ|θ ) q(θ |θ) [Beaumont, 2003; Andrieu and Roberts, 2009; Doucet et al, 2012] Transition kernel has invariant distribution with exact target density π(θ|y)
  • 67. Exact approximate methods Pseudo-Marginal construction that allows for the use of unbiased, positive estimates of target in acceptance probability α(θ , θ) = min 1, ^π(θ |y) ^π(θ|y) × q(θ|θ ) q(θ |θ) [Beaumont, 2003; Andrieu and Roberts, 2009; Doucet et al, 2012] Transition kernel has invariant distribution with exact target density π(θ|y)
  • 68. Infinite series estimator For each (θ, y), construct rv’s {V (j) θ , j 0} such that ^π(θ, {V (j) θ }|y) := ∞ j=0 V (j) θ is a.s. finite with finite expectation E ^π(θ, {V (j) θ } |y) = π(θ|y)
  • 69. Infinite series estimator For each (θ, y), construct rv’s {V (j) θ , j 0} such that ^π(θ, {V (j) θ }|y) := ∞ j=0 V (j) θ is a.s. finite with finite expectation E ^π(θ, {V (j) θ } |y) = π(θ|y) Introduce a random stopping time τθ, such that with ξ := (τθ, {V (j) θ , 0 j τθ}) the estimate ^π(θ, ξ|y) := τθ j=0 V (j) θ satisfies E ^π(θ, ξ|y)|{V (j) θ , j 0} = ^π(θ, {V (j) θ }|y)
  • 70. Infinite series estimator For each (θ, y), construct rv’s {V (j) θ , j 0} such that ^π(θ, {V (j) θ }|y) := ∞ j=0 V (j) θ is a.s. finite with finite expectation E ^π(θ, {V (j) θ } |y) = π(θ|y) Warning: unbiased estimate ^π(θ, ξ|y) using series construction no general guarantee of positivity
  • 71. Russian roulette Method that requires unbiased truncation of a series S(θ) = ∞ i=0 φi (θ) Russian roulette employed extensively in simulation of neutron scattering and computer graphics Assign probabilities {qj , j 1} qj ∈ (0, 1] and generate U(0, 1) i.i.d. r.v’s {Uj , j 1} Find the first time k 1 such that Uk qk Russian roulette estimate of S(θ) is ^S(θ) = k j=0 φj (θ) j−1 i=1 qi , [Girolami, Lyne, Strathman, Simpson, & Atchad´e, arXiv:1306.4032]
  • 72. Russian roulette Method that requires unbiased truncation of a series S(θ) = ∞ i=0 φi (θ) Russian roulette employed extensively in simulation of neutron scattering and computer graphics Assign probabilities {qj , j 1} qj ∈ (0, 1] and generate U(0, 1) i.i.d. r.v’s {Uj , j 1} Find the first time k 1 such that Uk qk Russian roulette estimate of S(θ) is ^S(θ) = k j=0 φj (θ) j−1 i=1 qi , [Girolami, Lyne, Strathman, Simpson, & Atchad´e, arXiv:1306.4032]
  • 73. Russian roulette Method that requires unbiased truncation of a series S(θ) = ∞ i=0 φi (θ) Russian roulette employed extensively in simulation of neutron scattering and computer graphics Assign probabilities {qj , j 1} qj ∈ (0, 1] and generate U(0, 1) i.i.d. r.v’s {Uj , j 1} Find the first time k 1 such that Uk qk Russian roulette estimate of S(θ) is ^S(θ) = k j=0 φj (θ) j−1 i=1 qi , If limn→∞ n j=1 qj = 0, Russian roulette terminates with probability one [Girolami, Lyne, Strathman, Simpson, & Atchad´e, arXiv:1306.4032]
  • 74. Russian roulette Method that requires unbiased truncation of a series S(θ) = ∞ i=0 φi (θ) Russian roulette employed extensively in simulation of neutron scattering and computer graphics Assign probabilities {qj , j 1} qj ∈ (0, 1] and generate U(0, 1) i.i.d. r.v’s {Uj , j 1} Find the first time k 1 such that Uk qk Russian roulette estimate of S(θ) is ^S(θ) = k j=0 φj (θ) j−1 i=1 qi , E{^S(θ)} = S(θ) variance finite under certain known conditions [Girolami, Lyne, Strathman, Simpson, & Atchad´e, arXiv:1306.4032]
  • 75. towards ever more complexity Bernoulli, Jakob (1654–1705) MCMC connected steps Metropolis-Hastings revisited Approximate Bayesian computation (ABC)
  • 76. New challenges Novel statisticial issues that forces a different Bayesian answer: very large datasets complex or unknown dependence structures with maybe p n multiple and involved random effects missing data structures containing most of the information sequential structures involving most of the above
  • 77. New paradigm? “Surprisingly, the confident prediction of the previous generation that Bayesian methods would ultimately supplant frequentist methods has given way to a realization that Markov chain Monte Carlo (MCMC) may be too slow to handle modern data sets. Size matters because large data sets stress computer storage and processing power to the breaking point. The most successful compromises between Bayesian and frequentist methods now rely on penalization and optimization.” [Lange at al., ISR, 2013]
  • 78. New paradigm? sad reality constraint that size does matter focus on much smaller dimensions and on sparse summaries many (fast if non-Bayesian) ways of producing those summaries Bayesian inference can kick in almost automatically at this stage
  • 79. Approximate Bayesian computation (ABC) Case of a well-defined statistical model where the likelihood function (θ|y) = f (y1, . . . , yn|θ) is out of reach! Empirical approximations to the original Bayesian inference problem Degrading the data precision down to a tolerance ε Replacing the likelihood with a non-parametric approximation Summarising/replacing the data with insufficient statistics
  • 80. Approximate Bayesian computation (ABC) Case of a well-defined statistical model where the likelihood function (θ|y) = f (y1, . . . , yn|θ) is out of reach! Empirical approximations to the original Bayesian inference problem Degrading the data precision down to a tolerance ε Replacing the likelihood with a non-parametric approximation Summarising/replacing the data with insufficient statistics
  • 81. Approximate Bayesian computation (ABC) Case of a well-defined statistical model where the likelihood function (θ|y) = f (y1, . . . , yn|θ) is out of reach! Empirical approximations to the original Bayesian inference problem Degrading the data precision down to a tolerance ε Replacing the likelihood with a non-parametric approximation Summarising/replacing the data with insufficient statistics
  • 82. Approximate Bayesian computation (ABC) Case of a well-defined statistical model where the likelihood function (θ|y) = f (y1, . . . , yn|θ) is out of reach! Empirical approximations to the original Bayesian inference problem Degrading the data precision down to a tolerance ε Replacing the likelihood with a non-parametric approximation Summarising/replacing the data with insufficient statistics
  • 83. ABC methodology Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: Foundation For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y, then the selected θ ∼ π(θ|y) [Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]
  • 84. ABC methodology Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: Foundation For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y, then the selected θ ∼ π(θ|y) [Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]
  • 85. ABC methodology Bayesian setting: target is π(θ)f (x|θ) When likelihood f (x|θ) not in closed form, likelihood-free rejection technique: Foundation For an observation y ∼ f (y|θ), under the prior π(θ), if one keeps jointly simulating θ ∼ π(θ) , z ∼ f (z|θ ) , until the auxiliary variable z is equal to the observed value, z = y, then the selected θ ∼ π(θ|y) [Rubin, 1984; Diggle & Gratton, 1984; Griffith et al., 1997]
  • 86. ABC algorithm In most implementations, degree of approximation: Algorithm 1 Likelihood-free rejection sampler for i = 1 to N do repeat generate θ from the prior distribution π(·) generate z from the likelihood f (·|θ ) until ρ{η(z), η(y)} set θi = θ end for where η(y) defines a (not necessarily sufficient) statistic
  • 87. Comments role of distance paramount (because = 0) scaling of components of η(y) also capital matters little if “small enough” representative of “curse of dimensionality” small is beautiful!, i.e. data as a whole may be weakly informative for ABC non-parametric method at core
  • 88. ABC simulation advances Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002; Blum & Fran¸cois, 2010; Biau et al., 2013] .....or even by including in the inferential framework [ABCµ] [Ratmann et al., 2009]
  • 89. ABC simulation advances Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002; Blum & Fran¸cois, 2010; Biau et al., 2013] .....or even by including in the inferential framework [ABCµ] [Ratmann et al., 2009]
  • 90. ABC simulation advances Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002; Blum & Fran¸cois, 2010; Biau et al., 2013] .....or even by including in the inferential framework [ABCµ] [Ratmann et al., 2009]
  • 91. ABC simulation advances Simulating from the prior is often poor in efficiency Either modify the proposal distribution on θ to increase the density of x’s within the vicinity of y... [Marjoram et al, 2003; Beaumont et al., 2009, Del Moral et al., 2012] ...or by viewing the problem as a conditional density estimation and by developing techniques to allow for larger [Beaumont et al., 2002; Blum & Fran¸cois, 2010; Biau et al., 2013] .....or even by including in the inferential framework [ABCµ] [Ratmann et al., 2009]
  • 92. ABC as an inference machine Starting point is summary statistic η(y), either chosen for computational realism or imposed by external constraints ABC can produce a distribution on the parameter of interest conditional on this summary statistic η(y) inference based on ABC may be consistent or not, so it needs to be validated on its own the choice of the tolerance level is dictated by both computational and convergence constraints
  • 93. ABC as an inference machine Starting point is summary statistic η(y), either chosen for computational realism or imposed by external constraints ABC can produce a distribution on the parameter of interest conditional on this summary statistic η(y) inference based on ABC may be consistent or not, so it needs to be validated on its own the choice of the tolerance level is dictated by both computational and convergence constraints
  • 94. How Bayesian aBc is..? At best, ABC approximates π(θ|η(y)): approximation error unknown (w/o massive simulation) pragmatic or empirical Bayes (there is no other solution!) many calibration issues (tolerance, distance, statistics) the NP side should be incorporated into the whole Bayesian picture the approximation error should also be part of the Bayesian inference
  • 95. Noisy ABC ABC approximation error (under non-zero tolerance ) replaced with exact simulation from a controlled approximation to the target, convolution of true posterior with kernel function π (θ, z|y) = π(θ)f (z|θ)K (y − z) π(θ)f (z|θ)K (y − z)dzdθ , with K kernel parameterised by bandwidth . [Wilkinson, 2013] Theorem The ABC algorithm based on a randomised observation y = ˜y + ξ, ξ ∼ K , and an acceptance probability of K (y − z)/M gives draws from the posterior distribution π(θ|y).
  • 96. Noisy ABC ABC approximation error (under non-zero tolerance ) replaced with exact simulation from a controlled approximation to the target, convolution of true posterior with kernel function π (θ, z|y) = π(θ)f (z|θ)K (y − z) π(θ)f (z|θ)K (y − z)dzdθ , with K kernel parameterised by bandwidth . [Wilkinson, 2013] Theorem The ABC algorithm based on a randomised observation y = ˜y + ξ, ξ ∼ K , and an acceptance probability of K (y − z)/M gives draws from the posterior distribution π(θ|y).
  • 97. Which summary? Fundamental difficulty of the choice of the summary statistic when there is no non-trivial sufficient statistics [except when done by the experimenters in the field]
  • 98. Which summary? Fundamental difficulty of the choice of the summary statistic when there is no non-trivial sufficient statistics [except when done by the experimenters in the field] Loss of statistical information balanced against gain in data roughening Approximation error and information loss remain unknown Choice of statistics induces choice of distance function towards standardisation borrowing tools from data analysis (LDA) machine learning [Estoup et al., ME, 2012]
  • 99. Which summary? Fundamental difficulty of the choice of the summary statistic when there is no non-trivial sufficient statistics [except when done by the experimenters in the field] may be imposed for external/practical reasons may gather several non-B point estimates we can learn about efficient combination distance can be provided by estimation techniques
  • 100. Which summary for model choice? ‘This is also why focus on model discrimination typically (...) proceeds by (...) accepting that the Bayes Factor that one obtains is only derived from the summary statistics and may in no way correspond to that of the full model.’ [S. Sisson, Jan. 31, 2011, xianblog] Depending on the choice of η(·), the Bayes factor based on this insufficient statistic, Bη 12(y) = π1(θ1)f η 1 (η(y)|θ1) dθ1 π2(θ2)f η 2 (η(y)|θ2) dθ2 , is either consistent or not [X et al., PNAS, 2012]
  • 101. Which summary for model choice? Depending on the choice of η(·), the Bayes factor based on this insufficient statistic, Bη 12(y) = π1(θ1)f η 1 (η(y)|θ1) dθ1 π2(θ2)f η 2 (η(y)|θ2) dθ2 , is either consistent or not [X et al., PNAS, 2012] q q q q q q q q q q q Gauss Laplace 0.00.10.20.30.40.50.60.7 n=100 q q q q q q q q q q q q q q q q q q Gauss Laplace 0.00.20.40.60.81.0 n=100
  • 102. Selecting proper summaries Consistency only depends on the range of µi (θ) = Ei [η(y)] under both models against the asymptotic mean µ0 of η(y) Theorem If Pn belongs to one of the two models and if µ0 cannot be attained by the other one : 0 = min (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2) < max (inf{|µ0 − µi (θi )|; θi ∈ Θi }, i = 1, 2) , then the Bayes factor Bη 12 is consistent [Marin et al., JRSS B, 2013]
  • 103. Selecting proper summaries Consistency only depends on the range of µi (θ) = Ei [η(y)] under both models against the asymptotic mean µ0 of η(y) q M1 M2 0.30.40.50.60.7 q q q q M1 M2 0.30.40.50.60.7 M1 M2 0.30.40.50.60.7 q q q q q q q q M1 M2 0.00.20.40.60.8 q qq q q q q q q q q qq q q q q q q M1 M2 0.00.20.40.60.81.0 q q q q q q q q M1 M2 0.00.20.40.60.81.0 q q q q q q q M1 M2 0.00.20.40.60.8 q q qq q q q q qq q q q q q q q M1 M2 0.00.20.40.60.81.0 q q qq q qq M1 M2 0.00.20.40.60.81.0 [Marin et al., JRSS B, 2013]