1. My life as a mixture
Christian P. Robert
Universite Paris-Dauphine, Paris University of Warwick, Coventry
September 17, 2014
bayesianstatistics@gmail.com
2. Your next Valencia meeting:
I Objective Bayes section of ISBA
major meeting:
I O-Bayes 2015 in Valencia,
Spain, June 1-4(+1), 2015
I in memory of our friend Susie
Bayarri
I objective Bayes, limited
information, partly de
3. ned and
approximate models, tc
I all
avours of Bayesian analysis
welcomed!
I Spain in June, what else...?!
4. Outline
Gibbs sampling
weakly informative priors
imperfect sampling
SAME algorithm
perfect sampling
Bayes factor
less informative prior
no Bayes factor
5. birthdate: May 1989, Ottawa Civic Hospital
Repartition of grey levels in an unprocessed chest radiograph
[X, 1994]
6. Mixture models
Structure of mixtures of distributions:
x fj with probability pj ,
for j = 1, 2, . . . , k, with overall density
p1f1(x) + + pk fk (x) .
Usual case: parameterised components
Xk
i=1
pi f (xji )
Xn
i=1
pi = 1
where weights pi 's are distinguished from other parameters
7. Mixture models
Structure of mixtures of distributions:
x fj with probability pj ,
for j = 1, 2, . . . , k, with overall density
p1f1(x) + + pk fk (x) .
Usual case: parameterised components
Xk
i=1
pi f (xji )
Xn
i=1
pi = 1
where weights pi 's are distinguished from other parameters
8. Motivations
I Dataset made of several latent/missing/unobserved
groups/strata/subpopulations. Mixture structure due to the
missing origin/allocation of each observation to a speci
9. c
subpopulation/stratum. Inference on either the allocations
(clustering) or on the parameters (i , pi ) or on the number of
groups
I Semiparametric perspective where mixtures are functional
basis approximations of unknown distributions
10. Motivations
I Dataset made of several latent/missing/unobserved
groups/strata/subpopulations. Mixture structure due to the
missing origin/allocation of each observation to a speci
11. c
subpopulation/stratum. Inference on either the allocations
(clustering) or on the parameters (i , pi ) or on the number of
groups
I Semiparametric perspective where mixtures are functional
basis approximations of unknown distributions
12. License
Dataset derived from [my] license plate image
Grey levels concentrated on 256 values [later jittered]
[Marin X, 2007]
13. Likelihood
For a sample of independent random variables (x1, , xn),
likelihood
Yn
i=1
fp1f1(xi ) + + pk fk (xi )g .
Expanding this product involves
kn
elementary terms: prohibitive to compute in large samples.
But likelihood still computable [pointwise] in O(kn) time.
14. Likelihood
For a sample of independent random variables (x1, , xn),
likelihood
Yn
i=1
fp1f1(xi ) + + pk fk (xi )g .
Expanding this product involves
kn
elementary terms: prohibitive to compute in large samples.
But likelihood still computable [pointwise] in O(kn) time.
15. Normal mean benchmark
Normal mixture
p N(1, 1) + (1 - p)N(2, 1)
with only means unknown (2-D representation possible)
Identi
20. able: 1
cannot be confused with 2 when p is
dierent from 0.5.
Presence of a spurious mode,
understood by letting p go to 0.5
21. Bayesian inference on mixtures
For any prior (, p), posterior distribution of (, p) available up
to a multiplicative constant
(, pjx) /
2
4
Yn
i=1
Xk
j=1
pj f (xi jj )
3
5 (, p)
at a cost of order O(kn)
B Diculty
Despite this, derivation of posterior characteristics like posterior
expectations only possible in an exponential time of order O(kn)!
22. Bayesian inference on mixtures
For any prior (, p), posterior distribution of (, p) available up
to a multiplicative constant
(, pjx) /
2
4
Yn
i=1
Xk
j=1
pj f (xi jj )
3
5 (, p)
at a cost of order O(kn)
B Diculty
Despite this, derivation of posterior characteristics like posterior
expectations only possible in an exponential time of order O(kn)!
23. Missing variable representation
Associate to each xi a missing/latent variable zi that indicates its
component:
zi jp Mk (p1, . . . , pk )
and
xi jzi , f (jzi ) .
Completed likelihood
`(, pjx, z) =
Yn
i=1
pzi f (xi jzi ) ,
and
(, pjx, z) /
Yn
i=1
pzi f (xi jzi )
#
(, p)
where z = (z1, . . . , zn)
24. Missing variable representation
Associate to each xi a missing/latent variable zi that indicates its
component:
zi jp Mk (p1, . . . , pk )
and
xi jzi , f (jzi ) .
Completed likelihood
`(, pjx, z) =
Yn
i=1
pzi f (xi jzi ) ,
and
(, pjx, z) /
Yn
i=1
pzi f (xi jzi )
#
(, p)
where z = (z1, . . . , zn)
25. Gibbs sampling for mixture models
Take advantage of the missing data structure:
Algorithm
I Initialization: choose p(0) and (0) arbitrarily
I Step t. For t = 1, . . .
1. Generate z(t)
i (i = 1, . . . , n) from (j = 1, . . . , k)
P
z(t)
i = j jp(t-1)
j , (t-1)
j , xi
/ p(t-1)
j f
xi j(t-1)
j
2. Generate p(t) from (pjz(t)),
3. Generate (t) from (jz(t), x).
[Brooks Gelman, 1990; Diebolt X, 1990, 1994; Escobar West, 1991]
26. Normal mean example (cont'd)
Algorithm
I Initialization. Choose (0)
1 and (0)
2 ,
I Step t. For t = 1, . . .
1. Generate z(t)
i (i = 1, . . . , n) from
P
z(t)
i = 1
= 1-P
z(t)
i = 2
/ p exp
-
1
2
xi - (t-1)
1
2
2. Compute n(t)
j =
Xn
i=1
i =j and (sx
Iz(t)
j )(t) =
Xn
i=1
Iz(t)
i =jxi
3. Generate (t)
j (j = 1, 2) from N
+ (sx
j )(t)
+ n(t)
j
,
1
+ n(t)
j
!
.
27. Normal mean example (cont'd)
−1 0 1 2 3 4
−1 0 1 2 m1
3 4
m2
(a) initialised at random
[X Casella, 2009]
28. Normal mean example (cont'd)
−1 0 1 2 3 4
−1 0 1 2 m1
3 4
m2
(a) initialised at random
−2 0 2 4
−2 0 2 4
m1
m2
(b) initialised close to the
lower mode
[X Casella, 2009]
29. License
Consider k = 3 components, a D3(1=2, 1=2, 1=2) prior for the
weights, a N(x, ^2=3) prior on the means i and a Ga(10, ^2) prior
on the precisions -2
i , where x and ^2 are the empirical mean and
variance of License
[Empirical Bayes]
[Marin X, 2007]
30. Outline
Gibbs sampling
weakly informative priors
imperfect sampling
SAME algorithm
perfect sampling
Bayes factor
less informative prior
no Bayes factor
31. weakly informative priors
I possible symmetric empirical Bayes priors
p D(
, . . . ,
), i N(^, ^ !2i
), -2
i Ga(, ^)
which can be replaced with hierarchical priors
[Diebolt X, 1990; Richardson Green, 1997]
I independent improper priors on j 's prohibited, thus standard
at and Jereys priors impossible to use (except with the
exclude-empty-component trick)
[Diebolt X, 1990; Wasserman, 1999]
32. weakly informative priors
I Reparameterization to compact set for use of uniform priors
i ! ei
1 + ei
, i ! i
1 + i
[Chopin, 2000]
I dependent weakly informative priors
p D(k, . . . , 1), i N(i-1, 2i
-1), i U([0, i-1])
[Mengersen X, 1996; X Titterington, 1998]
I reference priors
p D(1, . . . , 1), i N(0, (2i
+ 20
)=2), 2i
C+(0, 20
)
[Moreno Liseo, 1999]
33. Re-ban on improper priors
Dicult to use improper priors in the setting of mixtures because
independent improper priors,
() =
Yk
i=1
i (i ) , with
Z
i (i )di = 1
end up, for all n's, with the property
Z
(, pjx)ddp = 1
Reason
There are (k - 1)n terms among the kn terms in the expansion
that allocate no observation at all to the i-th component
34. Re-ban on improper priors
Dicult to use improper priors in the setting of mixtures because
independent improper priors,
() =
Yk
i=1
i (i ) , with
Z
i (i )di = 1
end up, for all n's, with the property
Z
(, pjx)ddp = 1
Reason
There are (k - 1)n terms among the kn terms in the expansion
that allocate no observation at all to the i-th component
35. Outline
Gibbs sampling
weakly informative priors
imperfect sampling
SAME algorithm
perfect sampling
Bayes factor
less informative prior
no Bayes factor
36. Connected diculties
1. Number of modes of the likelihood of order O(k!):
c Maximization and even [MCMC] exploration of the
posterior surface harder
2. Under exchangeable priors on (, p) [prior invariant under
permutation of the indices], all posterior marginals are
identical:
c Posterior expectation of 1 equal to posterior expectation
of 2
37. Connected diculties
1. Number of modes of the likelihood of order O(k!):
c Maximization and even [MCMC] exploration of the
posterior surface harder
2. Under exchangeable priors on (, p) [prior invariant under
permutation of the indices], all posterior marginals are
identical:
c Posterior expectation of 1 equal to posterior expectation
of 2
38. License
When Gibbs output does not (re)produce exchangeability, Gibbs
sampler has failed to explored the whole parameter space: not
enough energy to switch simultaneously enough component
allocations at once
[Marin X, 2007]
39. Label switching paradox
I We should observe the exchangeability of the components
[label switching] to conclude about convergence of the Gibbs
sampler.
I If we observe it, then we do not know how to estimate the
parameters.
I If we do not, then we are uncertain about the convergence!!!
[Celeux, Hurn X, 2000]
[Fruhwirth-Schnatter, 2001, 2004]
[Holmes, Jasra Stephens, 2005]
40. Label switching paradox
I We should observe the exchangeability of the components
[label switching] to conclude about convergence of the Gibbs
sampler.
I If we observe it, then we do not know how to estimate the
parameters.
I If we do not, then we are uncertain about the convergence!!!
[Celeux, Hurn X, 2000]
[Fruhwirth-Schnatter, 2001, 2004]
[Holmes, Jasra Stephens, 2005]
41. Label switching paradox
I We should observe the exchangeability of the components
[label switching] to conclude about convergence of the Gibbs
sampler.
I If we observe it, then we do not know how to estimate the
parameters.
I If we do not, then we are uncertain about the convergence!!!
[Celeux, Hurn X, 2000]
[Fruhwirth-Schnatter, 2001, 2004]
[Holmes, Jasra Stephens, 2005]
43. ability: impose constraints like
1 6 . . . 6 k
in the prior
Mostly incompatible with the topology of the posterior surface:
posterior expectations then depend on the choice of the
constraints.
Computational detail
The constraint need be imposed during the simulation but can
instead be imposed after simulation, reordering MCMC output
according to constraints. [This avoids possible negative eects on
convergence]
45. ability: impose constraints like
1 6 . . . 6 k
in the prior
Mostly incompatible with the topology of the posterior surface:
posterior expectations then depend on the choice of the
constraints.
Computational detail
The constraint need be imposed during the simulation but can
instead be imposed after simulation, reordering MCMC output
according to constraints. [This avoids possible negative eects on
convergence]
47. ability: impose constraints like
1 6 . . . 6 k
in the prior
Mostly incompatible with the topology of the posterior surface:
posterior expectations then depend on the choice of the
constraints.
Computational detail
The constraint need be imposed during the simulation but can
instead be imposed after simulation, reordering MCMC output
according to constraints. [This avoids possible negative eects on
convergence]
48. Relabeling towards the mode
Selection of one of the k! modal regions of the posterior,
post-simulation, by computing the approximate MAP
(, p)(i) with i = arg max
i=1,...,M
(, p)(i)jx
Pivotal Reordering
At iteration i 2 f1, . . . ,Mg,
1. Compute the optimal permutation
i = arg min
2Sk
d
((i), p(i)), ((i), p(i))
where d(, ) distance in the parameter space.
2. Set ((i), p(i)) = i (((i), p(i))).
[Celeux, 1998; Stephens, 2000; Celeux, Hurn X, 2000]
49. Relabeling towards the mode
Selection of one of the k! modal regions of the posterior,
post-simulation, by computing the approximate MAP
(, p)(i) with i = arg max
i=1,...,M
(, p)(i)jx
Pivotal Reordering
At iteration i 2 f1, . . . ,Mg,
1. Compute the optimal permutation
i = arg min
2Sk
d
((i), p(i)), ((i), p(i))
where d(, ) distance in the parameter space.
2. Set ((i), p(i)) = i (((i), p(i))).
[Celeux, 1998; Stephens, 2000; Celeux, Hurn X, 2000]
50. Loss functions for mixture estimation
Global loss function that considers
distance between predictives
L(,^
) =
Z
X
f(x) log
f(x)=f^(x)
dx
eliminates the labelling eect
Similar solution for estimating clusters
through allocation variables
L(z, ^z) =
X
ij
I[zi=zj ](1 - I[^zi=^zj ]) + I[^zi=^zj ](1 - I[zi=zj ])
.
[Celeux, Hurn X, 2000]
51. Outline
Gibbs sampling
weakly informative priors
imperfect sampling
SAME algorithm
perfect sampling
Bayes factor
less informative prior
no Bayes factor
52. MAP estimation
For high-dimensional parameter space,
diculty with marginal MAP (MMAP) estimates
because nuisance parameters must be integrated out
MMAP
1 = arg1 max p (1jy)
where
p (1jy) =
Z
2
p (1, 2jy) d2
53. MAP estimation
SAME stands for State Augmentation for Marginal Estimation
[Doucet, Godsill X, 2001]
Arti
54. cially augmented probability model whose marginal
distribution is
p
(1jy) / p (1jy)
via replications of the nuisance parameters:
I Replace 2 with
arti
56. MAP estimation
SAME stands for State Augmentation for Marginal Estimation
[Doucet, Godsill X, 2001]
Arti
57. cially augmented probability model whose marginal
distribution is
p
(1jy) / p (1jy)
via replications of the nuisance parameters:
I Treat the 2 (j)'s as distinct random variables:
q
(1, 2 (1) , . . . , 2 (
)jy) /
Y
k=1
p (1, 2 (k)jy)
58. MAP estimation
SAME stands for State Augmentation for Marginal Estimation
[Doucet, Godsill X, 2001]
Arti
59. cially augmented probability model whose marginal
distribution is
p
(1jy) / p (1jy)
via replications of the nuisance parameters:
I Use corresponding marginal for 1
q
(1jy) =
Z
q
(1, 2 (1) , . . . , 2 (
)jy) d2 (1) . . . d2 (
)
/
ZY
k=1
p (1, 2 (k)jy) d2 (1) . . . d2 (
)
= p
(1jy)
60. MAP estimation
SAME stands for State Augmentation for Marginal Estimation
[Doucet, Godsill X, 2001]
Arti
61. cially augmented probability model whose marginal
distribution is
p
(1jy) / p (1jy)
via replications of the nuisance parameters:
I Build a MCMC algorithm in the augmented space, with
invariant distribution
q
(1, 2 (1) , . . . , 2 (
)jy)
62. MAP estimation
SAME stands for State Augmentation for Marginal Estimation
[Doucet, Godsill X, 2001]
Arti
63. cially augmented probability model whose marginal
distribution is
p
(1jy) / p (1jy)
via replications of the nuisance parameters:
I Use simulated subsequence
(i)
1 ; i 2 N
as drawn from marginal posterior p
(1jy)
64. example: Galaxy dataset benchmark
82 observations of galaxy velocities from 3 (?) groups
Algorithm EM MCEM SAME
Mean log-posterior 65.47 60.73 66.22
Std dev of 2.31 4.48 0.02
log-posterior
[Doucet X, 2002]
65. Really the SAME?!
SAME algorithm re-invented in many guises:
I Gaetan Yao, 2003, Biometrika
I Jacquier, Johannes Polson, 2007, J. Econometrics
I Lele, Dennis Lutscher, 2007, Ecology Letters [data cloning]
I ...
66. Outline
Gibbs sampling
weakly informative priors
imperfect sampling
SAME algorithm
perfect sampling
Bayes factor
less informative prior
no Bayes factor
67. Propp and Wilson's perfect sampler
Diculty devising MCMC stopping rules:
when should one stop an MCMC algorithm?!
Principle: Coupling from the past
rather than start at t = 0 and wait till t = +1, start at t = -1
and wait till t = 0
[Propp Wilson, 1996]
c Outcome at time t = 0 is stationnary
68. Propp and Wilson's perfect sampler
Diculty devising MCMC stopping rules:
when should one stop an MCMC algorithm?!
Principle: Coupling from the past
rather than start at t = 0 and wait till t = +1, start at t = -1
and wait till t = 0
[Propp Wilson, 1996]
c Outcome at time t = 0 is stationnary
69. CFTP Algorithm
Algorithm (Coupling from the past)
1. Start from the m possible values at time -t
2. Run the m chains till time 0 (coupling allowed)
3. Check if the chains are equal at time 0
4. If not, start further back: t 2 t, using the same random
numbers at time already simulated
I requires a
70. nite state space
I probability of merging chains must be high enough
I hard to implement w/o a monotonicity in both state space
and transition
71. CFTP Algorithm
Algorithm (Coupling from the past)
1. Start from the m possible values at time -t
2. Run the m chains till time 0 (coupling allowed)
3. Check if the chains are equal at time 0
4. If not, start further back: t 2 t, using the same random
numbers at time already simulated
I requires a
72. nite state space
I probability of merging chains must be high enough
I hard to implement w/o a monotonicity in both state space
and transition
73. Mixture models
Simplest possible mixture structure
pf0(x) + (1 - p)f1(x),
with uniform prior on p.
Algorithm (Data Augmentation Gibbs sampler)
At iteration t:
1. Generate n iid U(0, 1) rv's u(t)
1 , . . . , u(t)
n .
2. Derive the indicator variables z(t)
i as z(t)
i = 0 i
u(t)
i 6 q(t-1)
i =
p(t-1)f0(xi )
p(t-1)f0(xi ) + (1 - p(t-1))f1(xi )
and compute
m(t) =
Xn
i=1
z(t)
i .
3. Simulate p(t) Be(n + 1 - m(t), 1 + m(t)).
74. Mixture models
Algorithm (CFTP Gibbs sampler)
At iteration -t:
1. Generate n iid uniform rv's u(-t)
1 , . . . , u(-t)
n .
2. Partition [0, 1) into intervals [q[j ], q[j+1]).
3. For each [q(-t)
[j ] , q(-t)
[j+1]), generate
p(-t)
j Be(n - j + 1, j + 1).
4. For each j = 0, 1, . . . , n, r(-t)
j p(-t)
j
5. For (` = 1, ` T, ` + +) r(-t+`)
j p(-t+`)
k with k such that
r(-t+`-1)
j 2 [q(-t+`)
[k] , q(-t+`)
[k+1] ]
6. Stop if the r (0)
j 's (0 6 j 6 n) are all equal. Otherwise, t 2 t.
[Hobert et al., 1999]
75. Mixture models
Extension to the case k = 3:
Sample of n = 35 observations from
.23N(2.2, 1.44) + .62N(1.4, 0.49) + .15N(0.6, 0.64)
[Hobert et al., 1999]
76. Outline
Gibbs sampling
weakly informative priors
imperfect sampling
SAME algorithm
perfect sampling
Bayes factor
less informative prior
no Bayes factor
77. Bayesian model choice
Comparison of models Mi by Bayesian means:
probabilise the entire model/parameter space
I allocate probabilities pi to all models Mi
I de
78. ne priors i (i ) for each parameter space i
I compute
(Mi jx) =
pi
Z
i
fi (xji )i (i )di
X
j
pj
Z
j
fj (xjj )j (j )dj
79. Bayesian model choice
Comparison of models Mi by Bayesian means:
Relies on a central notion: the evidence
Zk =
Z
k
k (k )Lk (k ) dk ,
aka the marginal likelihood.
80. Chib's representation
Direct application of Bayes' theorem: given x fk (xjk ) and
k k (k ),
Zk = mk (x) =
fk (xjk ) k (k )
k (k jx)
Replace with an approximation to the posterior
bZ
k = cmk (x) =
fk (xjk
) k (k
)
^ k (k
jx)
.
[Chib, 1995]
81. Chib's representation
Direct application of Bayes' theorem: given x fk (xjk ) and
k k (k ),
Zk = mk (x) =
fk (xjk ) k (k )
k (k jx)
Replace with an approximation to the posterior
bZ
k = cmk (x) =
fk (xjk
) k (k
)
^ k (k
jx)
.
[Chib, 1995]
82. Case of latent variables
For missing variable z as in mixture models, natural Rao-Blackwell
estimate
ck (k
jx) =
1
T
XT
t=1
k (k
jx, z(t)
k ) ,
where the z(t)
k 's are Gibbs sampled latent variables
[Diebolt Robert, 1990; Chib, 1995]
83. Compensation for label switching
For mixture models, z(t)
k usually fails to visit all con
84. gurations in a
balanced way, despite the symmetry predicted by the theory
Consequences on numerical approximation, biased by an order k!
Recover the theoretical symmetry by using
fk (k
jx) =
1
T k!
X
2Sk
XT
t=1
k ((k
)jx, z(t)
k ) .
for all 's in Sk , set of all permutations of f1, . . . , kg
[Berkhof, Mechelen, Gelman, 2003]
85. Compensation for label switching
For mixture models, z(t)
k usually fails to visit all con
86. gurations in a
balanced way, despite the symmetry predicted by the theory
Consequences on numerical approximation, biased by an order k!
Recover the theoretical symmetry by using
fk (k
jx) =
1
T k!
X
2Sk
XT
t=1
k ((k
)jx, z(t)
k ) .
for all 's in Sk , set of all permutations of f1, . . . , kg
[Berkhof, Mechelen, Gelman, 2003]
87. Galaxy dataset (k)
Using Chib's estimate, with k
as MAP estimator,
log(^Z
k (x)) = -105.1396
for k = 3, while introducing permutations leads to
log(^Z
k (x)) = -103.3479
Note that
-105.1396 + log(3!) = -103.3479
k 2 3 4 5 6 7 8
Zk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44
Estimations of the marginal likelihoods by the symmetrised Chib's
approximation (based on 105 Gibbs iterations and, for k 5, 100 permutations
selected at random in Sk ).
[Lee et al., 2008]
88. Galaxy dataset (k)
Using Chib's estimate, with k
as MAP estimator,
log(^Z
k (x)) = -105.1396
for k = 3, while introducing permutations leads to
log(^Z
k (x)) = -103.3479
Note that
-105.1396 + log(3!) = -103.3479
k 2 3 4 5 6 7 8
Zk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44
Estimations of the marginal likelihoods by the symmetrised Chib's
approximation (based on 105 Gibbs iterations and, for k 5, 100 permutations
selected at random in Sk ).
[Lee et al., 2008]
89. Galaxy dataset (k)
Using Chib's estimate, with k
as MAP estimator,
log(^Z
k (x)) = -105.1396
for k = 3, while introducing permutations leads to
log(^Z
k (x)) = -103.3479
Note that
-105.1396 + log(3!) = -103.3479
k 2 3 4 5 6 7 8
Zk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44
Estimations of the marginal likelihoods by the symmetrised Chib's
approximation (based on 105 Gibbs iterations and, for k 5, 100 permutations
selected at random in Sk ).
[Lee et al., 2008]
90. More ecient sampling
Diculty with the explosive numbers of terms in
fk (k
jx) =
1
T k!
X
2Sk
XT
t=1
k ((k
)jx, z(t)
k ) .
when most terms are equal to zero...
Iterative bridge sampling:
bE
(t)(k) = bE
(t-1)(k)M-1
1
XM1
l=1
^(~l jx)
M1q(~l ) + M2 ^(~l jx)
.
M-1
2
XM2
m=1
q(^m)
M1q(^m) + M2 ^(^mjx)
[Fruhwirth-Schnatter, 2004]
91. More ecient sampling
Iterative bridge sampling:
bE
(t)(k) = bE
(t-1)(k)M-1
1
XM1
l=1
^(~l jx)
M1q(~l ) + M2 ^(~l jx)
.
M-1
2
XM2
m=1
q(^m)
M1q(^m) + M2 ^(^mjx)
[Fruhwirth-Schnatter, 2004]
where
q() =
1
J1
XJ1
j=1
p(jz(j))
Yk
i=1
p(i j(j)
ij , (j-1)
ij , z(j), x)
92. More ecient sampling
Iterative bridge sampling:
bE
(t)(k) = bE
(t-1)(k)M-1
1
XM1
l=1
^(~l jx)
M1q(~l ) + M2 ^(~l jx)
.
M-1
2
XM2
m=1
q(^m)
M1q(^m) + M2 ^(^mjx)
[Fruhwirth-Schnatter, 2004]
or where
q() =
1
k!
X
2S(k)
p(j(zo))
Yk
i=1
p(i j(oi
j ), (oi
j ), (zo), x)
93. Further eciency
After de-switching (un-switching?), representation of importance
function as
q() =
1
Jk!
XJ
j=1
X
2Sk
(j('(j)), x) =
1
k!
X
2Sk
h()
where h associated with particular mode of q
Assuming generations
((1), . . . , (T)) hc ()
how many of the h((t)) are non-zero?
94. Sparsity for the sum
Contribution of each term relative to q()
() =
h()
k!q()
=
P hi ()
2Sk
h()
and importance of permutation evaluated by
b Ehc [i ()] =
1
M
XM
l=1
i ((l)) , (l) hc ()
Approximate set A(k) S(k) consist of [1, , n] for the
smallest n that satis
101. dual importance sampling with approximation
DIS2A
1 Randomly select fz(j), (j)gJj
=1 from Gibbs sample and un-switch
Construct q()
2 Choose hc () and generate particles f(t)gTt
=1 hc ()
3 Construction of approximation ~q() using
102. rst M-sample
3.1 Compute bE
hc
[1 ()], ,bE
hc
[k! ()]
3.2 Reorder the 's such that
bE
hc
[1 ()] bE
hc
[k! ()].
3.3 Initially set n = 1 and compute ~qn((t))'s and b
n. If b
n=1 ,
go to Step 4. Otherwise increase n = n + 1
4 Replace q((1)), . . . , q((T)) with ~q((1)), . . . , ~q((T)) to
estimate bE
[Lee X, 2014]
103. illustrations
k k! jA(k)j (A)
3 6 1.0000 0.1675
4 24 2.7333 0.1148
Fishery data
k k! jA(k)j (A)
3 6 1.000 0.1675
4 24 15.7000 0.6545
6 720 298.1200 0.4146
Galaxy data
Table : Mean estimates of approximate set sizes, jA(k)j, and the
reduction rate of a number of evaluated h-terms (A) for (a)
111. I O(k) matrix
I unavailable in closed form except special cases
I unidimensional integrals approximated by Monte Carlo tools
[Grazian [talk tomorrow] et al., 2014+]
113. cant computing requirement (reduced by delayed
acceptance)
[Banterle et al., 2014]
I dier from component-wise Jereys
[Diebolt X, 1990; Stoneking, 2014]
I when is the posterior proper?
I how to check properness via MCMC outputs?
114. Outline
Gibbs sampling
weakly informative priors
imperfect sampling
SAME algorithm
perfect sampling
Bayes factor
less informative prior
no Bayes factor
115. Diculties with Bayes factors
I delicate calibration towards supporting a given hypothesis or
model
I long-lasting impact of prior modelling, despite overall
consistency
I discontinuity in the use of improper priors in most settings
I binary outcome more suited for immediate decision than for
model evaluation
I related impossibility to ascertain mis
116. t or outliers
I missing assesment of uncertainty associated with the decision
I dicult computation of marginal likelihoods in most settings
117. Reformulation
I Representation of the test problem as a two-component
mixture estimation problem where the weights are formally
equal to 0 or 1
I Mixture model thus contains both models under comparison
as extreme cases
I Inspired by consistency result of Rousseau and Mengersen
(2011) on over
118. tting mixtures
I Use of posteror distribution of the weight of a model instead
of single-digit posterior probability
[Kamari [see poster] et al., 2014+]
119. Construction of Bayes tests
Given two statistical models,
M1 : x f1(xj1) , 1 2 1 and M2 : x f2(xj2 , 2 2 2 , )
embed both models within an encompassing mixture model
M : x f1(xj1) + (1 - )f2(xj2) , 0 6 6 1 . (1)
Both models as special cases of the mixture model, one for = 1
and the other for = 0
c Test as inference on
120. Arguments
I substituting estimate of the weight for posterior probability
of model M1 produces an equally convergent indicator of
which model is true while removing the need of often
arti
121. cial prior probabilities on model indices
I interpretation at least as natural as for the posterior
probability, while avoiding the zero-one loss setting
I highly problematic computation of marginal likelihoods
bypassed by standard algorithms for mixture estimation
I straightforward extension to collection of models allows to
consider all models at once
I posterior on evaluates thoroughly strength of support for a
given model, compared with single digit Bayes factor
I mixture model acknowledges possibility that both models [or
none] could be acceptable
122. Arguments
I standard prior modelling can be reproduced here but improper
priors now acceptable, when both models reparameterised
towards common-meaning parameters, e.g. location and scale
I using same parameters on both components is essential:
opposition between components is not an issue with dierent
parameter values
I parameters of components, 1 and 2, integrated out by
Monte Carlo
I contrary to common testing settings, data signal lack of
agreement with either model when posterior on away from
both 0 and 1
I in most settings, approach easily calibrated by parametric
boostrap providing posterior of under each model and prior
predictive error
123. Toy examples (1)
Test of a Poisson P() versus a geometric Geo(p) [as a number of
failures, starting at zero]
Same parameter used in Poisson P() and geometric Geo(p) with
p = 1=1+
Improper noninformative prior () = 1= is valid
Posterior on conditional to allocation vector
( j x, ) / exp(-n1())(
Pni
=1 xi-1)( + 1)-(n2+s2())
and Be(n1 + a0, n2 + a0)
124. Toy examples (1)
.1 .2 .3 .4 .5 1
0.00 0.02 0.04 0.06 0.08
lBF= −509684
Posterior of Poisson weight when a0 = .1, .2, .3, .4, .5, 1 and
sample of geometric G(0.5) 105 observations
125. Toy examples (2)
Normal N(, 1) model versus double-exponential L(,
p
p 2) [Scale
2 is intentionaly chosen to make both distributions share the
same variance]
Location parameter can be shared by both models with a single
at prior (). Beta distributions B(a0, a0) are compared wrt their
hyperparameter a0
126. Toy examples (2)
Posterior of double-exponential weight for L(0,
p
2) data, with
5, . . . , 103 observations and 105 Gibbs iterations
127. Toy examples (2)
Posterior of Normal weight for N(0, .72) data with 103
observations and 104 Gibbs iterations
128. Toy examples (2)
Posterior of Normal weight for N(0, 1) data with 103
observations and 104 Gibbs iterations
129. Toy examples (2)
Posterior of normal weight for double-exponential data, with 103
observations and 104 Gibbs iterations