MaxEnt 2009 talk

1. On some computational methods for Bayesian model choice On some computational methods for Bayesian model choice Christian P. Robert CREST-INSEE and Universit´ Paris Dauphine e http://www.ceremade.dauphine.fr/~xian MaxEnt 2009, Oxford, July 6, 2009 Joint works with M. Beaumont, N. Chopin, J.-M. Cornuet, J.-M. Marin and D. Wraith

2. On some computational methods for Bayesian model choice Outline 1 Evidence 2 Importance sampling solutions 3 Cross-model solutions 4 Nested sampling 5 ABC model choice

3. On some computational methods for Bayesian model choice Evidence Bayes tests Formal construction of Bayes tests Deﬁnition (Test) Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical model, a test is a statistical procedure that takes its values in {0, 1}. Theorem (Bayes test) The Bayes estimator associated with π and with the 0 − 1 loss is 1 if π(θ ∈ Θ0 |x) > π(θ ∈ Θ0 |x), δ π (x) = 0 otherwise,

4. On some computational methods for Bayesian model choice Evidence Bayes tests Formal construction of Bayes tests Deﬁnition (Test) Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a statistical model, a test is a statistical procedure that takes its values in {0, 1}. Theorem (Bayes test) The Bayes estimator associated with π and with the 0 − 1 loss is 1 if π(θ ∈ Θ0 |x) > π(θ ∈ Θ0 |x), δ π (x) = 0 otherwise,

5. On some computational methods for Bayesian model choice Evidence Bayes factor Bayes factor Deﬁnition (Bayes factors) For testing hypothesis H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0 , under prior π(Θ0 )π0 (θ) + π(Θc )π1 (θ) , 0 central quantity f (x|θ)π0 (θ)dθ π(Θ0 |x) π(Θ0 ) Θ0 m0 (x) B01 (x) = = = π(Θc |x) 0 π(Θc ) 0 m1 (x) f (x|θ)π1 (θ)dθ Θc 0 [Jeﬀreys, 1939]

6. On some computational methods for Bayesian model choice Evidence Model choice Model choice and model comparison Choice between models Several models available for the same observation x Mi : x ∼ fi (x|θi ), i∈I where the family I can be ﬁnite or inﬁnite Identical setup: Replace hypotheses with models but keep marginal likelihoods and Bayes factors

7. On some computational methods for Bayesian model choice Evidence Model choice Model choice and model comparison Choice between models Several models available for the same observation x Mi : x ∼ fi (x|θi ), i∈I where the family I can be ﬁnite or inﬁnite Identical setup: Replace hypotheses with models but keep marginal likelihoods and Bayes factors

8. On some computational methods for Bayesian model choice Evidence Model choice Bayesian model choice Probabilise the entire model/parameter space allocate probabilities pi to all models Mi deﬁne priors πi (θi ) for each parameter space Θi compute pi fi (x|θi )πi (θi )dθi Θi π(Mi |x) = pj fj (x|θj )πj (θj )dθj j Θj take largest π(Mi |x) to determine “best” model, or use averaged predictive π(Mj |x) fj (x |θj )πj (θj |x)dθj j Θj

12. On some computational methods for Bayesian model choice Evidence Evidence Evidence All these problems end up with a similar quantity, the evidence Zk = πk (θk )Lk (θk ) dθk , Θk aka the marginal likelihood.

13. On some computational methods for Bayesian model choice Importance sampling solutions Importance sampling revisited 1 Evidence 2 Importance sampling solutions Regular importance Harmonic means Chib’s solution 3 Cross-model solutions 4 Nested sampling 5 ABC model choice

14. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Bayes factor approximation When approximating the Bayes factor f0 (x|θ0 )π0 (θ0 )dθ0 Θ0 B01 = f1 (x|θ1 )π1 (θ1 )dθ1 Θ1 use of importance functions 0 and 1 and n−1 0 n0 i i i=1 f0 (x|θ0 )π0 (θ0 )/ i 0 (θ0 ) B01 = n−1 1 n1 i i i=1 f1 (x|θ1 )π1 (θ1 )/ i 1 (θ1 )

16. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Bridge sampling variance The bridge sampling estimator does poorly if 2 var(B12 ) 1 π1 (θ) − π2 (θ) 2 ≈ E B12 n π2 (θ) is large, i.e. if π1 and π2 have little overlap...

17. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Bridge sampling variance The bridge sampling estimator does poorly if 2 var(B12 ) 1 π1 (θ) − π2 (θ) 2 ≈ E B12 n π2 (θ) is large, i.e. if π1 and π2 have little overlap...

19. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance An infamous example When 1 α(θ) = π1 (θ)˜2 (θ) ˜ π harmonic mean approximation to B12 n1 1 1/˜1 (θ1i |x) π n1 i=1 B21 = n2 θji ∼ πj (θ|x) 1 1/˜2 (θ2i |x) π n2 i=1 [Newton & Raftery, 1994] Infamous: Most often leads to an inﬁnite variance!!! [Radford Neal’s blog, 2008]

20. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance An infamous example When 1 α(θ) = π1 (θ)˜2 (θ) ˜ π harmonic mean approximation to B12 n1 1 1/˜1 (θ1i |x) π n1 i=1 B21 = n2 θji ∼ πj (θ|x) 1 1/˜2 (θ2i |x) π n2 i=1 [Newton & Raftery, 1994] Infamous: Most often leads to an inﬁnite variance!!! [Radford Neal’s blog, 2008]

21. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance “The Worst Monte Carlo Method Ever” “The good news is that the Law of Large Numbers guarantees that this estimator is consistent ie, it will very likely be very close to the correct answer if you use a suﬃciently large number of points from the posterior distribution. The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe. The even worse news is that itws easy for people to not realize this, and to naively accept estimates that are nowhere close to the correct value of the marginal likelihood.” [Radford Neal’s blog, Aug. 23, 2008]

22. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance “The Worst Monte Carlo Method Ever” “The good news is that the Law of Large Numbers guarantees that this estimator is consistent ie, it will very likely be very close to the correct answer if you use a suﬃciently large number of points from the posterior distribution. The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe. The even worse news is that itws easy for people to not realize this, and to naively accept estimates that are nowhere close to the correct value of the marginal likelihood.” [Radford Neal’s blog, Aug. 23, 2008]

24. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Optimal bridge sampling (2) Reason: Var(B12 ) 1 π1 (θ)π2 (θ)[n1 π1 (θ) + n2 π2 (θ)]α(θ)2 dθ 2 ≈ 2 −1 B12 n1 n2 π1 (θ)π2 (θ)α(θ) dθ δ method Drag: Dependence on the unknown normalising constants solved iteratively

25. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Optimal bridge sampling (2) Reason: Var(B12 ) 1 π1 (θ)π2 (θ)[n1 π1 (θ) + n2 π2 (θ)]α(θ)2 dθ 2 ≈ 2 −1 B12 n1 n2 π1 (θ)π2 (θ)α(θ) dθ δ method Drag: Dependence on the unknown normalising constants solved iteratively

26. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Ratio importance sampling Another identity: Eϕ [˜1 (θ)/ϕ(θ)] π B12 = Eϕ [˜2 (θ)/ϕ(θ)] π for any density ϕ with suﬃciently large support [Torrie & Valleau, 1977] Use of a single sample θ1 , . . . , θn from ϕ i=1 π1 (θi )/ϕ(θi ) ˜ B12 = i=1 π2 (θi )/ϕ(θi ) ˜

27. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Ratio importance sampling Another identity: Eϕ [˜1 (θ)/ϕ(θ)] π B12 = Eϕ [˜2 (θ)/ϕ(θ)] π for any density ϕ with suﬃciently large support [Torrie & Valleau, 1977] Use of a single sample θ1 , . . . , θn from ϕ i=1 π1 (θi )/ϕ(θi ) ˜ B12 = i=1 π2 (θi )/ϕ(θi ) ˜

28. On some computational methods for Bayesian model choice Importance sampling solutions Regular importance Ratio importance sampling (2) Approximate variance: 2 var(B12 ) 1 (π1 (θ) − π2 (θ))2 2 ≈ Eϕ B12 n ϕ(θ)2 Optimal choice: | π1 (θ) − π2 (θ) | ϕ∗ (θ) = | π1 (η) − π2 (η) | dη [Chen, Shao & Ibrahim, 2000] Formaly better than bridge sampling

31. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Approximating Zk from a posterior sample Use of the [harmonic mean] identity ϕ(θk ) ϕ(θk ) πk (θk )Lk (θk ) 1 Eπk x = dθk = πk (θk )Lk (θk ) πk (θk )Lk (θk ) Zk Zk no matter what the proposal ϕ(·) is. [Gelfand & Dey, 1994; Bartolucci et al., 2006] Direct exploitation of the MCMC output RB-RJ

32. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Approximating Zk from a posterior sample Use of the [harmonic mean] identity ϕ(θk ) ϕ(θk ) πk (θk )Lk (θk ) 1 Eπk x = dθk = πk (θk )Lk (θk ) πk (θk )Lk (θk ) Zk Zk no matter what the proposal ϕ(·) is. [Gelfand & Dey, 1994; Bartolucci et al., 2006] Direct exploitation of the MCMC output RB-RJ

33. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Comparison with regular importance sampling Harmonic mean: Constraint opposed to usual importance sampling constraints: ϕ(θ) must have lighter (rather than fatter) tails than πk (θk )Lk (θk ) for the approximation T (t) 1 ϕ(θk ) Z1k = 1 (t) (t) T πk (θk )Lk (θk ) t=1 to have a ﬁnite variance. E.g., use ﬁnite support kernels (like Epanechnikov’s kernel) for ϕ

34. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Comparison with regular importance sampling Harmonic mean: Constraint opposed to usual importance sampling constraints: ϕ(θ) must have lighter (rather than fatter) tails than πk (θk )Lk (θk ) for the approximation T (t) 1 ϕ(θk ) Z1k = 1 (t) (t) T πk (θk )Lk (θk ) t=1 to have a ﬁnite variance. E.g., use ﬁnite support kernels (like Epanechnikov’s kernel) for ϕ

35. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Comparison with regular importance sampling (cont’d) Compare Z1k with a standard importance sampling approximation T (t) (t) 1 πk (θk )Lk (θk ) Z2k = (t) T ϕ(θk ) t=1 (t) where the θk ’s are generated from the density ϕ(·) (with fatter tails like t’s)

36. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Approximating Zk using a mixture representation Bridge sampling recall Design a speciﬁc mixture for simulation [importance sampling] purposes, with density ϕk (θk ) ∝ ω1 πk (θk )Lk (θk ) + ϕ(θk ) , where ϕ(·) is arbitrary (but normalised) Note: ω1 is not a probability weight

37. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Approximating Zk using a mixture representation Bridge sampling recall Design a speciﬁc mixture for simulation [importance sampling] purposes, with density ϕk (θk ) ∝ ω1 πk (θk )Lk (θk ) + ϕ(θk ) , where ϕ(·) is arbitrary (but normalised) Note: ω1 is not a probability weight

38. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Approximating Z using a mixture representation (cont’d) Corresponding MCMC (=Gibbs) sampler At iteration t 1 Take δ (t) = 1 with probability (t−1) (t−1) (t−1) (t−1) (t−1) ω1 πk (θk )Lk (θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk ) and δ (t) = 2 otherwise; (t) (t−1) 2 If δ (t) = 1, generate θk ∼ MCMC(θk , θk ) where MCMC(θk , θk ) denotes an arbitrary MCMC kernel associated with the posterior πk (θk |x) ∝ πk (θk )Lk (θk ); (t) 3 If δ (t) = 2, generate θk ∼ ϕ(θk ) independently

41. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Evidence approximation by mixtures Rao-Blackwellised estimate T ˆ 1 ξ= (t) ω1 πk (θk )Lk (θk ) (t) (t) (t) ω1 πk (θk )Lk (θk ) + ϕ(θk ) , (t) T t=1 converges to ω1 Zk /{ω1 Zk + 1} 3k ˆ ˆ ˆ Deduce Zˆ from ω1 Z3k /{ω1 Z3k + 1} = ξ ie T (t) (t) (t) (t) (t) t=1 ω1 πk (θk )Lk (θk ) ω1 π(θk )Lk (θk ) + ϕ(θk ) ˆ Z3k = T (t) (t) (t) (t) t=1 ϕ(θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk ) [Bridge sampler redux]

42. On some computational methods for Bayesian model choice Importance sampling solutions Harmonic means Evidence approximation by mixtures Rao-Blackwellised estimate T ˆ 1 ξ= (t) ω1 πk (θk )Lk (θk ) (t) (t) (t) ω1 πk (θk )Lk (θk ) + ϕ(θk ) , (t) T t=1 converges to ω1 Zk /{ω1 Zk + 1} 3k ˆ ˆ ˆ Deduce Zˆ from ω1 Z3k /{ω1 Z3k + 1} = ξ ie T (t) (t) (t) (t) (t) t=1 ω1 πk (θk )Lk (θk ) ω1 π(θk )Lk (θk ) + ϕ(θk ) ˆ Z3k = T (t) (t) (t) (t) t=1 ϕ(θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk ) [Bridge sampler redux]

43. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Chib’s representation Direct application of Bayes’ theorem: given x ∼ fk (x|θk ) and θk ∼ πk (θk ), fk (x|θk ) πk (θk ) Zk = mk (x) = πk (θk |x) Use of an approximation to the posterior ∗ ∗ fk (x|θk ) πk (θk ) Zk = mk (x) = . ˆ ∗ πk (θk |x)

44. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Chib’s representation Direct application of Bayes’ theorem: given x ∼ fk (x|θk ) and θk ∼ πk (θk ), fk (x|θk ) πk (θk ) Zk = mk (x) = πk (θk |x) Use of an approximation to the posterior ∗ ∗ fk (x|θk ) πk (θk ) Zk = mk (x) = . ˆ ∗ πk (θk |x)

45. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Case of latent variables For missing variable z as in mixture models, natural Rao-Blackwell estimate T ∗ 1 ∗ (t) πk (θk |x) = πk (θk |x, zk ) , T t=1 (t) where the zk ’s are Gibbs sampled latent variables

46. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Label switching impact A mixture model [special case of missing variable model] is invariant under permutations of the indices of the components. E.g., mixtures 0.3N (0, 1) + 0.7N (2.3, 1) and 0.7N (2.3, 1) + 0.3N (0, 1) are exactly the same! c The component parameters θi are not identiﬁable marginally since they are exchangeable

47. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Label switching impact A mixture model [special case of missing variable model] is invariant under permutations of the indices of the components. E.g., mixtures 0.3N (0, 1) + 0.7N (2.3, 1) and 0.7N (2.3, 1) + 0.3N (0, 1) are exactly the same! c The component parameters θi are not identiﬁable marginally since they are exchangeable

48. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Connected diﬃculties 1 Number of modes of the likelihood of order O(k!): c Maximization and even [MCMC] exploration of the posterior surface harder 2 Under exchangeable priors on (θ, p) [prior invariant under permutation of the indices], all posterior marginals are identical: c Posterior expectation of θ1 equal to posterior expectation of θ2

49. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Connected diﬃculties 1 Number of modes of the likelihood of order O(k!): c Maximization and even [MCMC] exploration of the posterior surface harder 2 Under exchangeable priors on (θ, p) [prior invariant under permutation of the indices], all posterior marginals are identical: c Posterior expectation of θ1 equal to posterior expectation of θ2

50. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution License Since Gibbs output does not produce exchangeability, the Gibbs sampler has not explored the whole parameter space: it lacks energy to switch simultaneously enough component allocations at once 0.2 0.3 0.4 0.5 −1 0 1 2 3 µi 0 100 200 n 300 400 500 pi −1 0 µ 1 i 2 3 0.4 0.6 0.8 1.0 0.2 0.3 0.4 0.5 σi pi 0 100 200 300 400 500 0.2 0.3 0.4 0.5 n pi 0.4 0.6 0.8 1.0 −1 0 1 2 3 σi µi 0 100 200 300 400 500 0.4 0.6 0.8 1.0 n σi

51. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Label switching paradox We should observe the exchangeability of the components [label switching] to conclude about convergence of the Gibbs sampler. If we observe it, then we do not know how to estimate the parameters. If we do not, then we are uncertain about the convergence!!!

54. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Compensation for label switching (t) For mixture models, zk usually fails to visit all conﬁgurations in a balanced way, despite the symmetry predicted by the theory 1 πk (θk |x) = πk (σ(θk )|x) = πk (σ(θk )|x) k! σ∈S for all σ’s in Sk , set of all permutations of {1, . . . , k}. Consequences on numerical approximation, biased by an order k! Recover the theoretical symmetry by using T ∗ 1 ∗ (t) πk (θk |x) = πk (σ(θk )|x, zk ) . T k! σ∈Sk t=1 [Berkhof, Mechelen, & Gelman, 2003]

55. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Compensation for label switching (t) For mixture models, zk usually fails to visit all conﬁgurations in a balanced way, despite the symmetry predicted by the theory 1 πk (θk |x) = πk (σ(θk )|x) = πk (σ(θk )|x) k! σ∈S for all σ’s in Sk , set of all permutations of {1, . . . , k}. Consequences on numerical approximation, biased by an order k! Recover the theoretical symmetry by using T ∗ 1 ∗ (t) πk (θk |x) = πk (σ(θk )|x, zk ) . T k! σ∈Sk t=1 [Berkhof, Mechelen, & Gelman, 2003]

56. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Galaxy dataset n = 82 galaxies as a mixture of k normal distributions with both mean and variance unknown. [Roeder, 1992] Average density 0.8 0.6 Relative Frequency 0.4 0.2 0.0 −2 −1 0 1 2 3 data

57. On some computational methods for Bayesian model choice Importance sampling solutions Chib’s solution Galaxy dataset (k) ∗ Using only the original estimate, with θk as the MAP estimator, log(mk (x)) = −105.1396 ˆ for k = 3 (based on 103 simulations), while introducing the permutations leads to log(mk (x)) = −103.3479 ˆ Note that −105.1396 + log(3!) = −103.3479 k 2 3 4 5 6 7 8 mk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44 Estimations of the marginal likelihoods by the symmetrised Chib’s approximation (based on 105 Gibbs iterations and, for k > 5, 100 permutations selected at random in Sk ). [Lee, Marin, Mengersen & Robert, 2008]

60. On some computational methods for Bayesian model choice Cross-model solutions Cross-model solutions 1 Evidence 2 Importance sampling solutions 3 Cross-model solutions Reversible jump Saturation schemes Implementation error 4 Nested sampling 5 ABC model choice

61. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Reversible jump irreversible jump Idea: Set up a proper measure–theoretic framework for designing moves between models Mk [Green, 1995] Create a reversible kernel K on H = k {k} × Θk such that K(x, dy)π(x)dx = K(y, dx)π(y)dy A B B A for the invariant density π [x is of the form (k, θ(k) )]

62. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Reversible jump irreversible jump Idea: Set up a proper measure–theoretic framework for designing moves between models Mk [Green, 1995] Create a reversible kernel K on H = k {k} × Θk such that K(x, dy)π(x)dx = K(y, dx)π(y)dy A B B A for the invariant density π [x is of the form (k, θ(k) )]

63. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Local moves For a move between two models, M1 and M2 , the Markov chain being in state θ1 ∈ M1 , denote by K1→2 (θ1 , dθ) and K2→1 (θ2 , dθ) the corresponding kernels, under the detailed balance condition π(dθ1 ) K1→2 (θ1 , dθ) = π(dθ2 ) K2→1 (θ2 , dθ) , and take, wlog, dim(M2 ) > dim(M1 ). Proposal expressed as θ2 = Ψ1→2 (θ1 , v1→2 ) where v1→2 is a random variable of dimension dim(M2 ) − dim(M1 ), generated as v1→2 ∼ ϕ1→2 (v1→2 ) .

64. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Local moves For a move between two models, M1 and M2 , the Markov chain being in state θ1 ∈ M1 , denote by K1→2 (θ1 , dθ) and K2→1 (θ2 , dθ) the corresponding kernels, under the detailed balance condition π(dθ1 ) K1→2 (θ1 , dθ) = π(dθ2 ) K2→1 (θ2 , dθ) , and take, wlog, dim(M2 ) > dim(M1 ). Proposal expressed as θ2 = Ψ1→2 (θ1 , v1→2 ) where v1→2 is a random variable of dimension dim(M2 ) − dim(M1 ), generated as v1→2 ∼ ϕ1→2 (v1→2 ) .

65. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Local moves (2) In this case, q1→2 (θ1 , dθ2 ) has density −1 ∂Ψ1→2 (θ1 , v1→2 ) ϕ1→2 (v1→2 ) , ∂(θ1 , v1→2 ) by the Jacobian rule. Reverse importance link If probability 1→2 of choosing move to M2 while in M1 , acceptance probability reduces to π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 ) α(θ1 , v1→2 ) = 1∧ . π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 ) If several models are considered simultaneously, with probability 1→2 of choosing move to M2 while in M1 , as in ∞ XZ K(x, B) = ρm (x, y)qm (x, dy) + ω(x)IB (x) m=1

68. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Generic reversible jump acceptance probability Acceptance probability of θ2 = Ψ1→2 (θ1 , v1→2 ) is π(M2 , θ2 ) 2→1 ∂Ψ1→2 (θ1 , v1→2 ) α(θ1 , v1→2 ) = 1 ∧ π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂(θ1 , v1→2 ) while acceptance probability of θ1 with (θ1 , v1→2 ) = Ψ−1 (θ2 ) is 1→2 −1 π(M1 , θ1 ) 1→2 ϕ1→2 (v1→2 ) ∂Ψ1→2 (θ1 , v1→2 ) α(θ1 , v1→2 ) = 1 ∧ π(M2 , θ2 ) 2→1 ∂(θ1 , v1→2 ) c Diﬃcult calibration

71. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Green’s sampler Algorithm Iteration t (t ≥ 1): if x(t) = (m, θ(m) ), 1 Select model Mn with probability πmn 2 Generate umn ∼ ϕmn (u) and set (θ(n) , vnm ) = Ψm→n (θ(m) , umn ) 3 Take x(t+1) = (n, θ(n) ) with probability π(n, θ(n) ) πnm ϕnm (vnm ) ∂Ψm→n (θ(m) , umn ) min ,1 π(m, θ(m) ) πmn ϕmn (umn ) ∂(θ(m) , umn ) and take x(t+1) = x(t) otherwise.

72. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Interpretation The representation puts us back in a ﬁxed dimension setting: M1 × V1→2 and M2 in one-to-one relation. reversibility imposes that θ1 is derived as (θ1 , v1→2 ) = Ψ−1 (θ2 ) 1→2 appears like a regular Metropolis–Hastings move from the couple (θ1 , v1→2 ) to θ2 when stationary distributions are π(M1 , θ1 ) × ϕ1→2 (v1→2 ) and π(M2 , θ2 ), and when proposal distribution is deterministic

73. On some computational methods for Bayesian model choice Cross-model solutions Reversible jump Interpretation The representation puts us back in a ﬁxed dimension setting: M1 × V1→2 and M2 in one-to-one relation. reversibility imposes that θ1 is derived as (θ1 , v1→2 ) = Ψ−1 (θ2 ) 1→2 appears like a regular Metropolis–Hastings move from the couple (θ1 , v1→2 ) to θ2 when stationary distributions are π(M1 , θ1 ) × ϕ1→2 (v1→2 ) and π(M2 , θ2 ), and when proposal distribution is deterministic

74. On some computational methods for Bayesian model choice Cross-model solutions Saturation schemes Alternative Saturation of the parameter space H = k {k} × Θk by creating θ = (θ1 , . . . , θD ) a model index M pseudo-priors πj (θj |M = k) for j = k [Carlin & Chib, 1995] Validation by P(M = k|x) = P (M = k|x, θ)π(θ|x)dθ = Zk where the (marginal) posterior is [not πk ] D π(θ|x) = P(θ, M = k|x) k=1 D = pk Zk πk (θk |x) πj (θj |M = k) . k=1 j=k

75. On some computational methods for Bayesian model choice Cross-model solutions Saturation schemes Alternative Saturation of the parameter space H = k {k} × Θk by creating θ = (θ1 , . . . , θD ) a model index M pseudo-priors πj (θj |M = k) for j = k [Carlin & Chib, 1995] Validation by P(M = k|x) = P (M = k|x, θ)π(θ|x)dθ = Zk where the (marginal) posterior is [not πk ] D π(θ|x) = P(θ, M = k|x) k=1 D = pk Zk πk (θk |x) πj (θj |M = k) . k=1 j=k

76. On some computational methods for Bayesian model choice Cross-model solutions Saturation schemes MCMC implementation (t) (t) Run a Markov chain (M (t) , θ1 , . . . , θD ) with stationary distribution π(θ, M |x) by 1 Pick M (t) = k with probability π(θ(t−1) , k|x) (t−1) 2 Generate θk from the posterior πk (θk |x) [or MCMC step] (t−1) 3 Generate θj (j = k) from the pseudo-prior πj (θj |M = k) Approximate P(M = k|x) = Zk by T (t) (t) (t) pk (x) ∝ pk ˇ fk (x|θk ) πk (θk ) πj (θj |M = k) t=1 j=k D (t) (t) (t) p f (x|θ ) π (θ ) πj (θj |M = ) =1 j=

79. On some computational methods for Bayesian model choice Cross-model solutions Implementation error Scott’s (2002) proposal Suggest estimating P(M = k|x) by   T  D  (t) (t) Zk ∝ pk f (x|θk ) pj fj (x|θj ) ,  k  t=1 j=1 based on D simultaneous and independent MCMC chains (t) (θk )t , 1 ≤ k ≤ D, with stationary distributions πk (θk |x) [instead of above joint]

80. On some computational methods for Bayesian model choice Cross-model solutions Implementation error Scott’s (2002) proposal Suggest estimating P(M = k|x) by   T  D  (t) (t) Zk ∝ pk f (x|θk ) pj fj (x|θj ) ,  k  t=1 j=1 based on D simultaneous and independent MCMC chains (t) (θk )t , 1 ≤ k ≤ D, with stationary distributions πk (θk |x) [instead of above joint]

81. On some computational methods for Bayesian model choice Cross-model solutions Implementation error Congdon’s (2006) extension Selecting ﬂat [prohibited] pseudo-priors, uses instead   T  D  (t) (t) (t) (t) Zk ∝ pk fk (x|θk )πk (θk ) pj fj (x|θj )πj (θj ) ,   t=1 j=1 (t) where again the θk ’s are MCMC chains with stationary distributions πk (θk |x) to next section

82. On some computational methods for Bayesian model choice Cross-model solutions Implementation error Examples Example (Model choice) Model M1 : x|θ ∼ U(0, θ) with prior θ ∼ Exp(1) is versus model M2 : x|θ ∼ Exp(θ) with prior θ ∼ Exp(1). Equal prior weights on both models: 1 = 2 = 0.5. 1.0 0.8 Approximations of P(M = 1|x): 0.6 Scott’s (2002) (blue), and Congdon’s (2006) (red) 0.4 [N = 106 simulations]. 0.2 0.0 1 2 3 4 5 y

83. On some computational methods for Bayesian model choice Cross-model solutions Implementation error Examples Example (Model choice) Model M1 : x|θ ∼ U(0, θ) with prior θ ∼ Exp(1) is versus model M2 : x|θ ∼ Exp(θ) with prior θ ∼ Exp(1). Equal prior weights on both models: 1 = 2 = 0.5. 1.0 0.8 Approximations of P(M = 1|x): 0.6 Scott’s (2002) (blue), and Congdon’s (2006) (red) 0.4 [N = 106 simulations]. 0.2 0.0 1 2 3 4 5 y

84. On some computational methods for Bayesian model choice Cross-model solutions Implementation error Examples (2) Example (Model choice (2)) Normal model M1 : x ∼ N (θ, 1) with θ ∼ N (0, 1) vs. normal model M2 : x ∼ N (θ, 1) with θ ∼ N (5, 1) Comparison of both 1.0 approximations with 0.8 P(M = 1|x): Scott’s (2002) (green and mixed dashes) and 0.6 Congdon’s (2006) (brown and 0.4 long dashes) [N = 104 simulations]. 0.2 0.0 −1 0 1 2 3 4 5 6 y

85. On some computational methods for Bayesian model choice Cross-model solutions Implementation error Examples (3) Example (Model choice (3)) Model M1 : x ∼ N (0, 1/ω) with ω ∼ Exp(a) vs. M2 : exp(x) ∼ Exp(λ) with λ ∼ Exp(b). 1.0 1.0 0.8 0.8 Comparison of Congdon’s (2006) 0.6 0.6 (brown and dashed lines) with 0.4 0.4 0.2 0.2 P(M = 1|x) when (a, b) is equal 0.0 0.0 to (.24, 8.9), (.56, .7), (4.1, .46) −10 −5 0 y 5 10 −10 −5 0 y 5 10 and (.98, .081), resp. [N = 104 1.0 1.0 0.8 0.8 simulations]. 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 −10 −5 0 5 10 −10 −5 0 5 10 y y

86. On some computational methods for Bayesian model choice Nested sampling Nested sampling 1 Evidence 2 Importance sampling solutions 3 Cross-model solutions 4 Nested sampling Purpose Implementation Error rates Constraints Importance variant A mixture comparison

87. On some computational methods for Bayesian model choice Nested sampling Purpose Nested sampling: Goal Skilling’s (2007) technique using the one-dimensional representation: 1 Z = Eπ [L(θ)] = ϕ(x) dx 0 with ϕ−1 (l) = P π (L(θ) > l). Note; ϕ(·) is intractable in most cases.

88. On some computational methods for Bayesian model choice Nested sampling Implementation Nested sampling: First approximation Approximate Z by a Riemann sum: N Z= (xi−1 − xi )ϕ(xi ) i=1 where the xi ’s are either: deterministic: xi = e−i/N or random: x0 = 0, xi+1 = ti xi , ti ∼ Be(N, 1) so that E[log xi ] = −i/N .

89. On some computational methods for Bayesian model choice Nested sampling Implementation Extraneous white noise Take 1 −(1−δ)θ −δθ 1 −(1−δ)θ Z= e−θ dθ = e e = Eδ e δ δ N 1 ˆ Z= δ −1 e−(1−δ)θi (xi−1 − xi ) , θi ∼ E(δ) I(θi ≤ θi−1 ) N i=1 N deterministic random 50 4.64 10.5 4.65 10.5 100 2.47 4.9 Comparison of variances and MSEs 2.48 5.02 500 .549 1.01 .550 1.14

90. On some computational methods for Bayesian model choice Nested sampling Implementation Extraneous white noise Take 1 −(1−δ)θ −δθ 1 −(1−δ)θ Z= e−θ dθ = e e = Eδ e δ δ N 1 ˆ Z= δ −1 e−(1−δ)θi (xi−1 − xi ) , θi ∼ E(δ) I(θi ≤ θi−1 ) N i=1 N deterministic random 50 4.64 10.5 4.65 10.5 100 2.47 4.9 Comparison of variances and MSEs 2.48 5.02 500 .549 1.01 .550 1.14

91. On some computational methods for Bayesian model choice Nested sampling Implementation Nested sampling: Alternative representation Another representation is N −1 Z= {ϕ(xi+1 ) − ϕ(xi )}xi i=0 which is a special case of N −1 Z= {L(θ(i+1) ) − L(θ(i) )}π({θ; L(θ) > L(θ(i) )}) i=0 where · · · L(θ(i+1) ) < L(θ(i) ) · · · [Lebesgue version of Riemann’s sum]

92. On some computational methods for Bayesian model choice Nested sampling Implementation Nested sampling: Alternative representation Another representation is N −1 Z= {ϕ(xi+1 ) − ϕ(xi )}xi i=0 which is a special case of N −1 Z= {L(θ(i+1) ) − L(θ(i) )}π({θ; L(θ) > L(θ(i) )}) i=0 where · · · L(θ(i+1) ) < L(θ(i) ) · · · [Lebesgue version of Riemann’s sum]

93. On some computational methods for Bayesian model choice Nested sampling Implementation Nested sampling: Second approximation Estimate (intractable) ϕ(xi ) by ϕi : Nested sampling Start with N values θ1 , . . . , θN sampled from π At iteration i, 1 Take ϕi = L(θk ), where θk is the point with smallest likelihood in the pool of θi ’s 2 Replace θk with a sample from the prior constrained to L(θ) > ϕi : the current N points are sampled from prior constrained to L(θ) > ϕi . Note that 1 π({θ; L(θ) > L(θ(i+1) )}) π({θ; L(θ) > L(θ(i) )}) ≈ 1 − N

97. On some computational methods for Bayesian model choice Nested sampling Implementation Nested sampling: Third approximation Iterate the above steps until a given stopping iteration j is reached: e.g., observe very small changes in the approximation Z; reach the maximal value of L(θ) when the likelihood is bounded and its maximum is known; truncate the integral Z at level , i.e. replace 1 1 ϕ(x) dx with ϕ(x) dx 0

98. On some computational methods for Bayesian model choice Nested sampling Error rates Approximation error Error = Z − Z j 1 = (xi−1 − xi )ϕi − ϕ(x) dx = − ϕ(x) dx i=1 0 0 j 1 + (xi−1 − xi )ϕ(xi ) − ϕ(x) dx (Quadrature Error) i=1 j + (xi−1 − xi ) {ϕi − ϕ(xi )} (Stochastic Error) i=1 [Dominated by Monte Carlo]

99. On some computational methods for Bayesian model choice Nested sampling Error rates A CLT for the Stochastic Error The (dominating) stochastic error is OP (N −1/2 ): D N 1/2 Z − Z → N (0, V ) with V =− sϕ (s)tϕ (t) log(s ∨ t) ds dt. s,t∈[ ,1] [Proof based on Donsker’s theorem]

100. On some computational methods for Bayesian model choice Nested sampling Error rates What of log Z? If the interest lies within log Z, Slutsky’s transform of the CLT: D N −1/2 log Z − log Z → N 0, V /Z2 Note: The number of simulated points equals the number of iterations j, and is a multiple of N : if one stops at ﬁrst iteration j such that e−j/N < , then: j = N − log .

103. On some computational methods for Bayesian model choice Nested sampling Constraints Sampling from constr’d priors Exact simulation from the constrained prior is intractable in most cases Skilling (2007) proposes to use MCMC but this introduces a bias (stopping rule) If implementable, then slice sampler can be devised at the same cost [or less]

106. On some computational methods for Bayesian model choice Nested sampling Constraints Banana illustration Case of a banana target made of a twisted 2D normal: x2 = x2 + βx2 − 100β 1 [Haario, Sacksman, Tamminen, 1999] β = .03, σ = (100, 1)

107. On some computational methods for Bayesian model choice Nested sampling Constraints Banana illustration (2) Use of nested sampling with N = 1000, 50 MCMC steps with size 0.1, compared with a population Monte Carlo (PMC) based on 10 iterations with 5000 points per iteration and ﬁnal sample of 50000 points, using nine Student’s t components with 9 df [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D] Evidence estimation

108. On some computational methods for Bayesian model choice Nested sampling Constraints Banana illustration (2) Use of nested sampling with N = 1000, 50 MCMC steps with size 0.1, compared with a population Monte Carlo (PMC) based on 10 iterations with 5000 points per iteration and ﬁnal sample of 50000 points, using nine Student’s t components with 9 df [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D] E[X1 ] estimation

109. On some computational methods for Bayesian model choice Nested sampling Constraints Banana illustration (2) Use of nested sampling with N = 1000, 50 MCMC steps with size 0.1, compared with a population Monte Carlo (PMC) based on 10 iterations with 5000 points per iteration and ﬁnal sample of 50000 points, using nine Student’s t components with 9 df [Wraith, Kilbinger, Benabed et al., 2009, Physica Rev. D] E[X2 ] estimation

110. On some computational methods for Bayesian model choice Nested sampling Importance variant A IS variant of nested sampling ˜ Consider instrumental prior π and likelihood L, weight function π(θ)L(θ) w(θ) = π(θ)L(θ) and weighted NS estimator j Z= (xi−1 − xi )ϕi w(θi ). i=1 Then choose (π, L) so that sampling from π constrained to L(θ) > l is easy; e.g. N (c, Id ) constrained to c − θ < r.

111. On some computational methods for Bayesian model choice Nested sampling Importance variant A IS variant of nested sampling ˜ Consider instrumental prior π and likelihood L, weight function π(θ)L(θ) w(θ) = π(θ)L(θ) and weighted NS estimator j Z= (xi−1 − xi )ϕi w(θi ). i=1 Then choose (π, L) so that sampling from π constrained to L(θ) > l is easy; e.g. N (c, Id ) constrained to c − θ < r.

112. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Mixture pN (0, 1) + (1 − p)N (µ, σ) posterior Posterior on (µ, σ) for n observations with µ = 2 and σ = 3/2, when p is known Use of a uniform prior both on (−2, 6) for µ and on (.001, 16) for log σ 2 . occurrences of posterior bursts for µ = xi

113. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Experiment MCMC sample for n = 16 Nested sampling sequence observations from the mixture. with M = 1000 starting points.

114. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Experiment MCMC sample for n = 50 Nested sampling sequence observations from the mixture. with M = 1000 starting points.

115. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Comparison 1 Nested sampling: M = 1000 points, with 10 random walk moves at each step, simulations from the constr’d prior and a stopping rule at 95% of the observed maximum likelihood 2 T = 104 MCMC (=Gibbs) simulations producing non-parametric estimates ϕ [Diebolt & Robert, 1990] 3 Monte Carlo estimates Z1 , Z2 , Z3 using product of two Gaussian kernels 4 numerical integration based on 850 × 950 grid [reference value, conﬁrmed by Chib’s]

116. On some computational methods for Bayesian model choice Nested sampling A mixture comparison Comparison (cont’d) q q q q q q 1.15 q q q q q q q q q q q 1.10 q q q q q q q q q q q q q q q q q q q 1.05 q 1.00 0.95 q q q q q q q q q q q q q q 0.90 q 0.85 q q q q V1 q V2 V3 V4 q Graph based on a sample of 10 observations for µ = 2 and σ = 3/2 (150 replicas).

MaxEnt 2009 talk

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MaxEnt 2009 talk

Similar to MaxEnt 2009 talk (20)

More from Christian Robert

More from Christian Robert (20)

Recently uploaded

Recently uploaded (20)

MaxEnt 2009 talk