The 21st Century Belongs to Bayes

The 21st Bayesian Century

“The 21st Century belongs to Bayes”
as argued by a discussion on Bayesian testing and
Bayesian model choice

Christian P. Robert

Universit´ Paris Dauphine and CREST-INSEE
e
http://www.ceremade.dauphine.fr/~xian
http://xianblog.wordpress.com

July 1, 2009


A consequence of Bayesian statistics being given a proper
name is that it encourages too much historical deference
from people who think that the bibles of Jeﬀreys, de
Finetti, Jaynes, and others have all the answers.
—Gelman, Bayesian Analysis 3(3), 2008


Outline

Anyone not shocked by the Bayesian theory of inference has not
understood it
Senn, BA., 2008

Introduction

Tests and model choice

Bayesian Calculations

A Defense of the Bayesian Choice

Introduction

Vocabulary and concepts
Bayesian inference is a coherent mathematical theory
but I don’t trust it in scientiﬁc applications.
Gelman, BA, 2008

Introduction
Models
The Bayesian framework
Improper prior distributions
Noninformative prior distributions




Introduction
Models

Parametric model

Bayesians promote the idea that a multiplicity of parameters can be
handled via hierarchical, typically exchangeable, models, but it seems
implausible that this could really work automatically [instead of] giving
reasonable answers using minimal assumptions.
Gelman, BA, 2008

Observations x1 , . . . , xn generated from a probability distribution
fi (xi |θi , x1 , . . . , xi−1 ) = fi (xi |θi , x1:i−1 )

x = (x1 , . . . , xn ) ∼ f (x|θ), θ = (θ1 , . . . , θn )

Associated likelihood
ℓ(θ|x) = f (x|θ)
[inverted density & starting point]

Introduction
Models

And [B] nonparametrics?!
Equally very active and deﬁnitely very 21st, thank you,
but not mentioned in this talk!
7th Workshop on Bayesian Nonparametrics - Collegio... http://bnpworkshop.carloalberto.org/

21 - 25 June 2009, Moncalieri

The 7th Workshop on Bayesian Nonparametrics will be held at
the Collegio Carlo Alberto from June 21 to 25, 2009. The Collegio is a
Research Institution housed in an historical building located in
Moncalieri on the outskirts of Turin, Italy.
The meeting will feature the latest developments in the area and will
cover a wide variety of both theoretical and applied topics such as:
foundations of the Bayesian nonparametric approach, construction
and properties of prior distributions, asymptotics, interplay with
probability theory and stochastic processes, statistical modelling,
computational algorithms and applications in machine learning,
biostatistics, bioinformatics, economics and econometrics.

The Workshop will be structured in 4 tutorials on special topics, a
series of invited talks and contributed posters sessions.

News
Tentative Workshop Schedule
Abstract Book (last updated 27th May 2009)
Workshop Poster

Introduction

Bayesian approach
The impact of treating x as a ﬁxed constant
is to increase statistical power as an artefact
Templeton, Molec. Ecol., 2009

New perspective
◮ Uncertainty on the parameters θ of a model modeled through
a probability distribution π on Θ, called prior distribution
◮ Inference based on the distribution of θ conditional on x,
π(θ|x), called posterior distribution

f (x|θ)π(θ)
π(θ|x) = .
f (x|θ)π(θ) dθ

Introduction

[Nonphilosophical] justiﬁcations

Ignoring the sampling error of x undermines
the statistical validity of all inferences made by the method
Templeton, Molec. Ecol., 2009
◮ Semantic drift from unknown to random
◮ Actualization of the information on θ by extracting the
information on θ contained in the observation x
◮ Allows incorporation of imperfect information in the decision
process
◮ Unique mathematical way to condition upon the observations
(conditional perspective)
◮ Unique way to give meaning to statements like P(θ > 0)

Introduction

Posterior distribution

Bayesian methods are presented as an automatic inference engine,
and this raises suspicion in anyone with applied experience
Gelman, BA, 2008

π(θ|x) central to Bayesian inference
◮ Operates conditional upon the observations
◮ Incorporates the requirement of the Likelihood Principle
◮ Avoids averaging over the unobserved values of x
◮ Coherent updating of the information available on θ
◮ Provides a complete inferential machinery

Introduction

Improper distributions

If we take P (dσ) ∝ dσ as a statement that σ may have any value
between 0 and ∞ (...), we must use ∞ instead of 1 to denote certainty.
Jeﬀreys, ToP, 1939

Necessary extension from a prior distribution to a prior σ-ﬁnite
measure π such that

π(θ) dθ = +∞
Θ

Improper prior distribution
[Weird? Inappropriate?? report!! ]

Introduction

Justiﬁcations

If the parameter may have any value from −∞ to +∞,
its prior probability should be taken as uniformly distributed
Automated prior determination often leads to improper priors
1. Similar performances of estimators derived from these
generalized distributions
2. Improper priors as limits of proper distributions in many
[mathematical] senses

Introduction

More justifications

There is no good objective principle for choosing a noninformative prior
(even if that concept were mathematically defined, which it is not)
Gelman, BA, 2008

4. Robust answer against possible misspecifications of the prior
5. Frequencial justifications, such as:
(i) minimaxity
(ii) admissibility
(iii) invariance (Haar measure)
6. Improper priors [much] prefered to vague proper priors like
N (0, 106 )

Introduction

Validation

The mistake is to think of them as representing ignorance
Lindley, JASA, 1990
Extension of the posterior distribution π(θ|x) associated with an
improper prior π as given by Bayes’s formula

f (x|θ)π(θ)
π(θ|x) = ,
Θ f (x|θ)π(θ) dθ

when
f (x|θ)π(θ) dθ < ∞
Θ
Delete all emotional names

Introduction

Noninformative priors

...cannot be expected to represent exactly total ignorance about the
problem, but should rather be taken as reference priors, upon which
everyone could fall back when the prior information is missing.
Kass and Wasserman, JASA, 1996

What if all we know is that we know “nothing” ?!
In the absence of prior information, prior distributions solely
derived from the sample distribution f (x|θ)
Diﬃculty with uniform priors, lacking invariance properties.

Introduction

Jeffreys’ prior

If we took the prior density for the parameters to be proportional to
|I(θ)|1/2 , it could be stated for any law that is differentiable with respect
to all parameters that the total probability in any region of the θi would
′
be equal to the total probability in the corresponding region of the θi

Based on Fisher information
∂ℓ ∂ℓ
I(θ) = Eθ
∂θT ∂θ

Jeffreys’ prior distribution is

π ∗ (θ) ∝ |I(θ)|1/2


The Jeﬀreys-subjective synthesis betrays a much more dangerous
confusion than the Neyman-Pearson-Fisher synthesis as regards
hypothesis tests
Senn, BA, 2008

Introduction

Bayesian tests
Bayes factors
Opposition to classical tests
Model choice
Compatible priors
Variable selection

Bayesian tests

Construction of Bayes tests

What is almost never used, however, is the Jeffreys significance test.
Senn, BA, 2008

Definition (Test)
Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a
statistical model, a test is a statistical procedure that takes its
values in {0, 1}.

Example (Normal mean)
For x ∼ N (θ, 1), decide whether or not θ ≤ 0.

Bayesian tests

Decision-theoretic perspective
Loss functions [are] not relevant to statistical inference
Gelman, BA, 2008

Theorem (Optimal Bayes decision)
Under the 0 − 1 loss function

0
 if d = IΘ0 (θ)
L(θ, d) = a0 if d = 1 and θ ∈ Θ0


a1 if d = 0 and θ ∈ Θ0

the Bayes procedure is

1 if Prπ (θ ∈ Θ0 |x) ≥ a0 /(a0 + a1 )
δ π (x) =
0 otherwise

Bayes factors

A function of posterior probabilities
The method posits two or more alternative hypotheses and tests their
relative fits to some observed statistics
Templeton, Mol. Ecol., 2009

Definition (Bayes factors)
For hypotheses H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0

f (x|θ)π0 (θ)dθ
π(Θ0 |x) π(Θ0 ) Θ0
B01 = =
π(Θc |x)
0 π(Θc )
0 f (x|θ)π1 (θ)dθ
Θc
0

[Good, 1958 & Jeffreys, 1961]
Goto Poisson example

Bayes factors

Self-contained concept

Having a high relative probability does not mean that a hypothesis is true
or supported by the data

Non-decision-theoretic:
◮ eliminates choice of π(Θ0 )
◮ Bayesian/marginal equivalent to the likelihood ratio
◮ Jeﬀreys’ scale of evidence:
π
◮ if log10 (B10 ) between 0 and 0.5, evidence against H0 weak,
π
◮ if log10 (B10 ) 0.5 and 1, evidence substantial,
π
◮ if log10 (B10 ) 1 and 2, evidence strong and
π
◮ if log10 (B10 ) above 2, evidence decisive

Bayes factors

A major modiﬁcation

Considering whether a location parameter α is 0. The prior is uniform
and we should have to take f (α) = 0 and B10 would always be inﬁnite

When the null hypothesis is supported by a set of measure 0,
π(Θ0 ) = 0 and thus π(Θ0 |x) = 0.
[End of the story?!]

Bayes factors

Changing the prior to ﬁt the hypotheses

Requirement
Deﬁned prior distributions under both assumptions,

π0 (θ) ∝ π(θ)IΘ0 (θ), π1 (θ) ∝ π(θ)IΘ1 (θ),

(under the standard dominating measures on Θ0 and Θ1 )

Using the prior probabilities π(Θ0 ) = ̺0 and π(Θ1 ) = ̺1 ,

π(θ) = ̺0 π0 (θ) + ̺1 π1 (θ).

Bayes factors

Point null hypotheses

I have no patience for statistical methods that assign positive probability
to point hypotheses of the θ = 0 type that can never actually be true
Gelman, BA, 2008

Take ρ0 = Prπ (θ = θ0 ) and g1 prior density under Ha . Then

f (x|θ0 )ρ0 f (x|θ0 )ρ0
π(Θ0 |x) = =
f (x|θ)π(θ) dθ f (x|θ0 )ρ0 + (1 − ρ0 )m1 (x)

and Bayes factor

π f (x|θ0 )ρ0 ρ0 f (x|θ0 )
B01 (x) = =
m1 (x)(1 − ρ0 ) 1 − ρ0 m1 (x)

Bayes factors

Point null hypotheses (cont’d)

Example (Normal mean)
Test of H0 : θ = 0 when x ∼ N (θ, 1): we take π1 as N (0, τ 2 )

m1 (x) σ2 τ 2 x2
= exp
f (x|0) σ2 + τ 2 2σ 2 (σ 2 + τ 2 )

and the posterior probability is
τ /x 0 0.68 1.28 1.96
1 0.586 0.557 0.484 0.351
10 0.768 0.729 0.612 0.366


Comparison with classical tests

The 95 percent frequentist intervals will live up to their advertised
coverage claims
Wasserman, BA, 2008

Standard answer
Deﬁnition (p-value)
The p-value p(x) associated with a test is the largest signiﬁcance
level for which H0 is rejected


Problems with p-values

The use of P implies that a hypothesis that may be true may be rejected
because it had not predicted observable results that have not occurred

◮ Evaluation of the wrong quantity, namely the probability to
exceed the observed quantity.(wrong conditioning)
◮ Evaluation only under the null hypothesis
◮ Huge numerical diﬀerence with the Bayesian range of answers


Bayesian lower bounds

If the Bayes estimator has good frequency behavior
then we might as well use the frequentist method.
If it has bad frequency behavior then we shouldn’t use it.
Wasserman, BA, 2008

Least favourable Bayesian answer is

f (x|θ0 )
B(x, GA ) = inf ,
Θ f (x|θ)g(θ) dθ
g∈GA

ˆ
i.e., if there exists a mle for θ, θ(x),

f (x|θ0 )
B(x, GA ) =
ˆ
f (x|θ(x))


Illustration

Example (Normal case)
When x ∼ N (θ, 1) and H0 : θ0 = 0, the lower bounds are

2 /2 2 /2 −1
B(x, GA ) = e−x and P(x, GA ) = 1 + ex ,

i.e.
p-value 0.10 0.05 0.01 0.001
P 0.205 0.128 0.035 0.004
B 0.256 0.146 0.036 0.004
[Quite diﬀerent!]

Model choice

Model choice and model comparison

There is no null hypothesis, which complicates the computation of
sampling error

Choice among models
Several models available for the same observation(s)

Mi : x ∼ fi (x|θi ), i∈I

where I can be ﬁnite or inﬁnite

Model choice

Bayesian resolution
The posterior probabilities are constructed by using a numerator that is a
function of the observation for a particular model, then divided by a
denominator that ensures that the ”probabilities” sum to one

Probabilise the entire model/parameter space
◮ allocate probabilities pi to all models Mi

◮ deﬁne priors πi (θi ) for each parameter space Θi

◮ compute

pi fi (x|θi )πi (θi )dθi
Θi
π(Mi |x) =
pj fj (x|θj )πj (θj )dθj
j Θj

Model choice

Bayesian resolution(2)

The numerators are not co-measurable across hypotheses, and the
denominators are sums of non-co-measurable entities. This means that it
is mathematically impossible for them to be probabilities.

◮ take largest π(Mi |x) to determine “best” model,
or use averaged predictive

π(Mj |x) fj (x′ |θj )πj (θj |x)dθj
j Θj

Model choice

Natural Ockham’s razor

Pluralitas non est ponenda sine neccesitate

Variation is random until the
contrary is shown; and new
parameters in laws, when they
are suggested, must be tested
one at a time, unless there is
specific reason to the contrary.


The Bayesian approach naturally weights differently models with
different parameter dimensions (BIC).

Compatible priors

Compatibility principle

Further complicating dimensionality of test statistics is the fact that the
models are often not nested, and one model may contain parameters that
do not have analogues in the other models and vice versa

Diﬃculty of ﬁnding simultaneously priors on a collection of models
Easier to start from a single prior on a “big” [encompassing] model
and to derive others from a coherence principle
[Dawid & Lauritzen, 2000]
Raw regression output

Compatible priors

An illustration for linear regression
In the case M1 and M2 are two nested Gaussian linear regression
models with Zellner’s g-priors and the same variance σ 2 ∼ π(σ 2 ):
◮ M1 : y|β1 , σ 2 ∼ N (X1 β1 , σ 2 ) with

β1 |σ 2 ∼ N s1 , σ 2 n1 (X1 X1 )−1
T

where X1 is a (n × k1 ) matrix of rank k1 ≤ n
◮ M2 : y|β2 , σ 2 ∼ N (X2 β2 , σ 2 ) with

β2 |σ 2 ∼ N s2 , σ 2 n2 (X2 X2 )−1 ,
T

where X2 is a (n × k2 ) matrix with span(X2 ) ⊆ span(X1 )
[ c Marin & Robert, Bayesian Core]

Compatible priors

Compatible g-priors

I don’t see any role for squared error loss, minimax, or the rest of what is
sometimes called statistical decision theory
Gelman, BA, 2008

Since σ 2 is a nuisance parameter, minimize the Kullback-Leibler
divergence between both marginal distributions conditional on σ 2 :
m1 (y|σ 2 ; s1 , n1 ) and m2 (y|σ 2 ; s2 , n2 ), with solution

β2 |X2 , σ 2 ∼ N s∗ , σ 2 n∗ (X2 X2 )−1
2 2
T

with
s∗ = (X2 X2 )−1 X2 X1 s1
2
T T
n∗ = n1
2

Variable selection

Variable selection

Regression setup where y regressed on a
set {x1 , . . . , xp } of p potential
explanatory regressors (plus intercept)
Corresponding 2p submodels Mγ , where
γ ∈ Γ = {0, 1}p indicates
inclusion/exclusion of variables by a
binary representation,
e.g. γ = 101001011 means that x1 , x3 ,
x5 , x7 and x8 are included.

Variable selection

Notations
For model Mγ ,
◮ qγ variables included
◮ t1 (γ) = {t1,1 (γ), . . . , t1,qγ (γ)} indices of those variables and
t0 (γ) indices of the variables not included
◮ For β ∈ Rp+1 ,

βt1 (γ) = β0 , βt1,1 (γ) , . . . , βt1,qγ (γ)

Xt1 (γ) = 1n |xt1,1 (γ) | . . . |xt1,qγ (γ) .

Submodel Mγ is thus

y|β, γ, σ 2 ∼ N Xt1 (γ) βt1 (γ) , σ 2 In

Variable selection

Global and compatible priors
Use Zellner’s g-prior, i.e. a normal prior for β conditional on σ 2 ,
˜
β|σ 2 ∼ N (β, cσ 2 (X T X)−1 )

and a Jeﬀreys prior for σ 2 ,

π(σ 2 ) ∝ σ −2

Noninformative g

Resulting compatible prior

−1 −1
βt1 (γ) ∼ N T
Xt1 (γ) Xt1 (γ) T ˜
Xt1 (γ) X β, cσ 2 Xt1 (γ) Xt1 (γ)
T

Variable selection

Posterior model probability

Can be obtained in closed form:
−n/2
˜ ˜ 2y T P1 X β
cy T P1 y β T X T P1 X β ˜
−(qγ +1)/2 T
π(γ|y) ∝ (c+1) y y− + − .
c+1 c+1 c+1

Conditionally on γ, posterior distributions of β and σ 2 :
c ˜ σ2 c −1
βt1 (γ) |σ 2 , y, γ ∼ N (U1 y + U1 X β/c), T
Xt1 (γ) Xt1 (γ) ,
c+1 c+1
n yT y cy T P1 y ˜ ˜ y T P1 X β
β T X T P1 X β ˜
σ 2 |y, γ ∼ IG , − + − .
2 2 2(c + 1) 2(c + 1) c+1

Variable selection

Noninformative case

Use the same compatible informative g-prior distribution with
˜
β = 0p+1 and a hierarchical diﬀuse prior distribution on c,

π(c) ∝ c−1 IN∗ (c) or π(c) ∝ c−1 Ic>0

Recall g-prior

The choice of this hierarchical diﬀuse prior distribution on c is due
to the model posterior sensitivity to large values of c:

Taking ˜
β = 0p+1 and c large does not work

Variable selection

Processionary caterpillar

Inﬂuence of some forest settlement characteristics on the
development of caterpillar colonies

Response y log-transform of the average number of nests of
caterpillars per tree on an area of 500 square meters (n = 33 areas)
[ c Marin & Robert, Bayesian Core]

Variable selection

Processionary caterpillar (cont’d)

Potential explanatory variables
x x2 x3
1
x1 altitude (in meters), x2 slope (in degrees),
x3 number of pines in the square,
x4 height (in meters) of the tree at the center of the square,
x5 diameter of the tree at the center of the square,
x6 index of the settlement density,
xx4orientation of the squarex(from 1 if southb’d to 2 ow),
7
5 x6

x8 height (in meters) of the dominant tree,
x9 number of vegetation strata,
x10 mix settlement index (from 1 if not mixed to 2 if mixed).

x x8 x9

Variable selection

Bayesian regression output
Estimate BF log10(BF)

(Intercept) 9.2714 26.334 1.4205 (***)
X1 -0.0037 7.0839 0.8502 (**)
X2 -0.0454 3.6850 0.5664 (**)
X3 0.0573 0.4356 -0.3609
X4 -1.0905 2.8314 0.4520 (*)
X5 0.1953 2.5157 0.4007 (*)
X6 -0.3008 0.3621 -0.4412
X7 -0.2002 0.3627 -0.4404
X8 0.1526 0.4589 -0.3383
X9 -1.0835 0.9069 -0.0424
X10 -0.3651 0.4132 -0.3838
evidence against H0: (****) decisive, (***) strong, (**)
subtantial, (*) poor

Variable selection

Bayesian variable selection

t1 (γ) π(γ|y, X)
0,1,2,4,5 0.0929
0,1,2,4,5,9 0.0325
0,1,2,4,5,10 0.0295
0,1,2,4,5,7 0.0231
0,1,2,4,5,8 0.0228
0,1,2,4,5,6 0.0228
0,1,2,3,4,5 0.0224
0,1,2,3,4,5,9 0.0167
0,1,2,4,5,6,9 0.0167
0,1,2,4,5,8,9 0.0137
Noninformative G-prior model choice



Bayesian methods seem to quickly move to elaborate computation
Gelman, BA, 2008

Introduction


Implementation diﬃculties
Bayes factor approximation
ABC model choice



B Implementation diﬃculties
◮ Computing the posterior distribution

π(θ|x) ∝ π(θ)f (x|θ)

◮ Resolution of

arg min L(θ, δ)π(θ)f (x|θ)dθ
Θ

◮ Maximisation of the marginal posterior

arg max π(θ|x)dθ−1
Θ−1


B Implementation further diﬃculties
A statistical test returns a probability value, but rarely is the probability
value per se the reason for an investigator performing the test

◮ Computing posterior quantities

h(θ) π(θ)f (x|θ)dθ
δ π (x) = h(θ) π(θ|x)dθ = Θ
Θ π(θ)f (x|θ)dθ
Θ

◮ Resolution (in k) of

P (π(θ|x) ≥ k|x) = α


Monte Carlo methods
Bayesian simulation seems stuck in an inﬁnite regress of inferential
uncertainty
Gelman, BA, 2008

Approximation of

I= g(θ)f (x|θ)π(θ) dθ,
Θ

takes advantage of the fact that f (x|θ)π(θ) is proportional to a
density: If the θi ’s are from π(θ),
m
1
g(θi )f (x|θi )
m
i=1

converges (almost surely) to I


Importance function

A simulation method of inference hides unrealistic assumptions

No need to simulate from π(·|x) or from π: if h is a probability
density,

g(θ)f (x|θ)π(θ)
g(θ)f (x|θ)π(θ) dθ = h(θ) dθ
Θ h(θ)

and m
i=1 g(θi )ω(θi ) f (x|θi )π(θi )
m with ω(θi ) =
i=1 ω(θi ) h(θi )
approximates Eπ [g(θ)|x]


ABC’s When approximating the Bayes factor

f1 (x|θ1 )π1 (θ1 )dθ1
Θ1 Z1
B12 = =
Z2
f2 (x|θ2 )π2 (θ2 )dθ2
Θ2

use of importance functions ̟1 and ̟2 and
n1
n−1 i i i
i=1 f1 (x|θ1 )π1 (θ1 )/̟1 (θ1 ) i
B12 = 1 n2 θj ∼ ̟j (θ)
n−1
2
i i i
i=1 f2 (x|θ2 )π2 (θ2 )/̟2 (θ2 )

[Chopin & Robert, 2007]


Approximating Zk from a posterior sample

Use of the [harmonic mean] identity

ϕ(θk ) ϕ(θk ) πk (θk )Lk (θk ) 1
Eπk x = dθk =
πk (θk )Lk (θk ) πk (θk )Lk (θk ) Zk Zk

no matter what the proposal ϕ(·) is.
[Gelfand & Dey, 1994; Bartolucci et al., 2006]
Direct exploitation of the MCMC output


Comparison with regular importance sampling

Harmonic mean: Constraint opposed to usual importance sampling
constraints: ϕ(θ) must have lighter (rather than fatter) tails than
πk (θk )Lk (θk ) for the approximation
T (t)
1 ϕ(θk )
Z1k = 1 (t) (t)
T πk (θk )Lk (θk )
t=1

to have a ﬁnite variance.
E.g., use ﬁnite support kernels (like Epanechnikov’s kernel) for ϕ


Approximating Z using a mixture representation

Bridge sampling redux

Design a speciﬁc mixture for simulation [importance sampling]
purposes, with density

ϕk (θk ) ∝ ω1 πk (θk )Lk (θk ) + ϕ(θk ) ,

where ϕ(·) is arbitrary (but normalised)
Note: ω1 is not a probability weight


Approximating Z using a mixture representation (cont’d)

Corresponding MCMC (=Gibbs) sampler
At iteration t
1. Take δ (t) = 1 with probability

(t−1) (t−1) (t−1) (t−1) (t−1)
ω1 πk (θk )Lk (θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk )

and δ (t) = 2 otherwise;
(t) (t−1)
2. If δ (t) = 1, generate θk ∼ MCMC(θk , θk ) where
MCMC(θk , θk ′ ) denotes an arbitrary MCMC kernel associated

with the posterior πk (θk |x) ∝ πk (θk )Lk (θk );
(t)
3. If δ (t) = 2, generate θk ∼ ϕ(θk ) independently


Evidence approximation by mixtures
Rao-Blackwellised estimate
T
ˆ 1
ξ=
(t) (t)
ω1 πk (θk )Lk (θk )
(t) (t)
ω1 πk (θk )Lk (θk ) + ϕ(θk ) ,
(t)
T
t=1

converges to ω1 Zk /{ω1 Zk + 1}
3k
ˆ ˆ ˆ
Deduce Zˆ from ω1 Z3k /{ω1 Z3k + 1} = ξ ie

T (t) (t) (t) (t) (t)
t=1 ω1 πk (θk )Lk (θk ) ω1 π(θk )Lk (θk ) + ϕ(θk )
ˆ
Z3k =
T (t) (t) (t) (t)
t=1 ϕ(θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk )

[Bridge sampler]


Case of latent variables

For missing variable z as in mixture models, natural Rao-Blackwell
estimate
T
∗ 1 ∗ (t)
πk (θk |x) = πk (θk |x, zk ) ,
T
t=1
(t)
where the zk ’s are Gibbs sampled latent variables

ABC model choice

Approximate Bayesian Computation

Simulation target is π(θ)f (x|θ) with likelihood f (x|θ) not in
closed form.
Likelihood-free rejection technique:
ABC algorithm
For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly
simulating
θ′ ∼ π(θ) , x ∼ f (x|θ′ ) ,
until the auxiliary variable x is equal to the observed value, x = y.

[Pritchard et al., 1999]

ABC model choice

A as approximative

When y is a continuous random variable, equality x = y is replaced
with a tolerance condition,

̺(x, y) ≤ ǫ

where ̺ is a distance between summary statistics
Output distributed from

π(θ) Pθ {̺(x, y) < ǫ} ∝ π(θ|̺(x, y) < ǫ)

ABC model choice

Gibbs random ﬁelds

Gibbs distribution
The rv y = (y1 , . . . , yn ) is a Gibbs random ﬁeld associated with
the graph G if

1
f (y) = exp − Vc (yc ) ,
Z
c∈C

where Z is the normalising constant, C is the set of cliques of G
and Vc is any function also called potential
U (y) = c∈C Vc (yc ) is the energy function

c Z is usually unavailable in closed form

ABC model choice

Potts model
Potts model
Vc (y) is of the form

Vc (y) = θS(y) = θ δyl =yi
l∼i

where l∼i denotes a neighbourhood structure

In most realistic settings, summation

Zθ = exp{θT S(x)}
x∈X

involves too many terms to be manageable and numerical
approximations cannot always be trusted
[Cucala, Marin, CPR & Titterington, JASA, 2009]

ABC model choice

Neighbourhood relations

Choice to be made between M neighbourhood relations
m
i ∼ i′ (0 ≤ m ≤ M − 1)

with
Sm (x) = I{xi =xi′ }
m
i∼i′

driven by the posterior probabilities of the models.

ABC model choice

Model index

Formalisation via a model index M, new parameter with prior
distribution π(M = m) and π(θ|M = m) = πm (θm )
Computational target:

P(M = m|x) ∝ fm (x|θm )πm (θm ) dθm π(M = m)
Θm

ABC model choice

Sufficient statistics
If S(x) sufficient statistic for the joint parameters
(M, θ0 , . . . , θM −1 ),
P(M = m|x) = P(M = m|S(x)) .
For each model m, sufficient statistic Sm (·) makes
S(·) = (S0 (·), . . . , SM −1 (·)) also sufficient.
For Gibbs random fields,
1 2
x|M = m ∼ fm (x|θm ) = fm (x|S(x))fm (S(x)|θm )
1
= f 2 (S(x)|θm )
n(S(x)) m
where
n(S(x)) = ♯ {˜ ∈ X : S(˜ ) = S(x)}
x x
c S(x) also sufficient for the joint parameters
[Specific to Gibbs random fields!]

ABC model choice

ABC model choice Algorithm

ABC-MC
◮ Generate m∗ from the prior π(M = m).
◮ ∗
Generate θm∗ from the prior πm∗ (·).
◮ ∗
Generate x∗ from the model fm∗ (·|θm∗ ).
◮ Compute the distance ρ(S(x0 ), S(x∗ )).
◮ Accept (θm∗ , m∗ ) if ρ(S(x0 ), S(x∗ )) < ǫ.
∗

[Cornuet, Grelaud, Marin & Robert, BA, 2008]

Note When ǫ = 0 the algorithm is exact

ABC model choice

Toy example

iid Bernoulli model versus two-state ﬁrst-order Markov chain, i.e.
n
f0 (x|θ0 ) = exp θ0 I{xi =1} {1 + exp(θ0 )}n ,
i=1

versus
n
1
f1 (x|θ1 ) = exp θ1 I{xi =xi−1 } {1 + exp(θ1 )}n−1 ,
2
i=2

with priors θ0 ∼ U(−5, 5) and θ1 ∼ U(0, 6) (inspired by “phase
transition” boundaries).

ABC model choice

Toy example (2)

(left) Comparison of the true BF m0 /m1 (x0 ) with BF m0 /m1 (x0 )
(in logs) over 2, 000 simulations and 4.106 proposals from the
prior. (right) Same when using tolerance ǫ corresponding to the
1% quantile on the distances.



Given the advances in practical Bayesian methods in the past two
decades, anti-Bayesianism is no longer a serious option
Gelman, BA, 2009

Bayesians are of course their own worst enemies. They make
non-Bayesians accuse them of religious fervour, and an unwillingness to
see another point of view.
Davidson, 2009


1. Choosing a probabilistic representation

Bayesian statistics is about making probability statements
Gelman, BA, 2009

Bayesian Statistics appears as the calculus of uncertainty
Reminder:
A probabilistic model is nothing but an interpretation of a given
phenomenon
What is the meaning of RD’s t test example?!


1. Choosing a probabilistic representation (2)

Inference is impossible.
Davidson, 2009

The Bahadur–Savage problem stems from the inability to make
choices about the shape of a statistical model, not from an
impossibility to draw [Bayesian] inference.
Further, a probability distribution is more than the sum of its
moments. Ill-posed problems thus highlight issues with the model,
not the inference.


2. Conditioning on the data

Bayesian data analysis is a method for summarizing uncertainty and
making estimates and predictions using probability statements conditional
on observed data and an assumed model
Gelman, BA, 2009

At the basis of statistical inference lies an inversion process
between cause and eﬀect. Using a prior distribution brings a
necessary balance between observations and parameters and enable
to operate conditional upon x
What is the data in RD’s t test example?! U ’s? Y ’s?


3. Exhibiting the true likelihood

Frequentist statistics is an approach for evaluating statistical procedures
conditional on some family of posited probability models
Gelman, BA, 2009

Provides a complete quantitative inference on the parameters and
predictive that points out inadequacies of frequentist statistics,
while implementing the Likelihood Principle.
There needs to be a true likelihood, including in
non-parametric settings
[Rousseau, Van der Vaart]


4. Using priors as tools and summaries

Bayesian techniques allow prior beliefs to be tested and discarded as
appropriate
Gelman, BA, 2009

The choice of a prior distribution π does not require any kind of
belief in this distribution: rather consider it as a tool that
summarizes the available prior information and the uncertainty
surrounding this information
Non-identiﬁability is an issue in that the prior may strongly
impact inference about identiﬁable bits


4. Using priors as tools and summaries (2)

No uninformative prior exists for such models.
Davidson, 2009

Reference priors can be deduced from the sampling distribution by
an automated procedure, based on a minimal information principle
that maximises the information brought by the data.
Important literature on prior modelling for non-parametric
problems, incl. smoothness constraints.


5. Accepting the subjective basis of knowledge

Knowledge is a critical confrontation between a prioris and
experiments. Ignoring these a prioris impoverishes analysis.
We have, for one thing, to use a language and our
language is entirely made of preconceived ideas and has to be
so. However, these are unconscious preconceived ideas, which
are a million times more dangerous than the other ones. Were
we to assert that if we are including other preconceived ideas,
consciously stated, we would aggravate the evil! I do not
believe so: I rather maintain that they would balance one
another.
Henri Poincar´, 1902
e


6. Choosing a coherent system of inference

Bayesian data analysis has three stages: formulating a model,
splitting the model to data, and checking the model ﬁt.
The second step—inference—gets most of the attention,
but the procedure as a whole is not automatic
Gelman, BA, 2009

To force inference into a decision-theoretic mold allows for a
clariﬁcation of the way inferential tools should be evaluated, and
therefore implies a conscious (although subjective) choice of the
retained optimality.
Logical inference process Start with requested properties, i.e.
loss function and prior distribution, then derive the best solution
satisfying these properties.


6. Choosing a coherent system of inference (2)

Asymptopia annoys Bayesians.
Davidson, 2009

Asymptotics [for inference] sounds for a proxy for not specifying
completely the model and thus for using another model. While
asymptotics [for simulation] is quite acceptable. Bayesian inference
does not escape asymptotic diﬃculties, see e.g. mixtures.
NP Bootstrap aims at inference with no[t enough]
modelling, while P Bayesian bootstrap is essentially using the
Bayesian predictive


7. Looking for optimal frequentist procedures

At intermediate levels of a Bayesian model, frequency properties typically
take care of themselves. It is typically only at the top level of
unreplicated parameters that we have to worry
Gelman, BA, 2009

Bayesian inference widely intersects with the three notions of
minimaxity, admissibility and equivariance (Haar). Looking for an
optimal estimator most often ends up ﬁnding a Bayes estimator.
Optimality is easier to attain through the Bayes “ﬁlter”


8. Solving the actual problem

Frequentist methods have coverage guarantees; Bayesian methods don’t.
In science, coverage matters
Wasserman, BA, 2009

Frequentist methods justiﬁed on a long-term basis, i.e., from the
statistician viewpoint. From a decision-maker’s point of view, only
the problem at hand matters! That is, he/she calls for an inference
conditional on x.


9. Providing a universal system of inference

Bayesian methods are presented as an automatic inference engine
Gelman, BA, 2009

Given the three factors

(X , f (x|θ), (Θ, π(θ)), (D, L(θ, d)) ,

the Bayesian approach validates one and only one inferential
procedure


10. Computing procedures as a minimization problem

The discussion of computational issues should not be allowed to obscure
the need for further analysis of inferential questions
Bernardo, BA, 2009

Bayesian procedures are easier to compute than procedures of
alternative theories, in the sense that there exists a universal
method for the computation of Bayes estimators
Convergence assessment is an issue, but recent developments
in adaptive MCMC allow for more conﬁdence in the output

The 21st Century Belongs to Bayes

Recomendados

Recomendados

Más contenido relacionado

Más de Christian Robert

Más de Christian Robert (20)

Último

Último (20)

The 21st Century Belongs to Bayes