1) The document discusses Bayesian testing and model choice, arguing that the 21st century belongs to Bayesian statistics.
2) It introduces Bayesian tests which are constructed from a decision-theoretic perspective to minimize expected loss.
3) Bayes factors are discussed as a function of posterior probabilities that allows comparison of alternative hypotheses without choosing a prior probability. Bayes factors provide a scale to assess the strength of evidence against a null hypothesis.
4) Issues with improper prior distributions and noninformative priors are addressed, justifying their use in certain situations. Changes to Bayes factors when the null hypothesis has zero prior probability are also described.
Influencing policy (training slides from Fast Track Impact)
The 21st Century Belongs to Bayes
1. The 21st Bayesian Century
“The 21st Century belongs to Bayes”
as argued by a discussion on Bayesian testing and
Bayesian model choice
Christian P. Robert
Universit´ Paris Dauphine and CREST-INSEE
e
http://www.ceremade.dauphine.fr/~xian
http://xianblog.wordpress.com
July 1, 2009
2. The 21st Bayesian Century
A consequence of Bayesian statistics being given a proper
name is that it encourages too much historical deference
from people who think that the bibles of Jeffreys, de
Finetti, Jaynes, and others have all the answers.
—Gelman, Bayesian Analysis 3(3), 2008
3. The 21st Bayesian Century
Outline
Anyone not shocked by the Bayesian theory of inference has not
understood it
Senn, BA., 2008
Introduction
Tests and model choice
Bayesian Calculations
A Defense of the Bayesian Choice
4. The 21st Bayesian Century
Introduction
Vocabulary and concepts
Bayesian inference is a coherent mathematical theory
but I don’t trust it in scientific applications.
Gelman, BA, 2008
Introduction
Models
The Bayesian framework
Improper prior distributions
Noninformative prior distributions
Tests and model choice
Bayesian Calculations
A Defense of the Bayesian Choice
5. The 21st Bayesian Century
Introduction
Models
Parametric model
Bayesians promote the idea that a multiplicity of parameters can be
handled via hierarchical, typically exchangeable, models, but it seems
implausible that this could really work automatically [instead of] giving
reasonable answers using minimal assumptions.
Gelman, BA, 2008
Observations x1 , . . . , xn generated from a probability distribution
fi (xi |θi , x1 , . . . , xi−1 ) = fi (xi |θi , x1:i−1 )
x = (x1 , . . . , xn ) ∼ f (x|θ), θ = (θ1 , . . . , θn )
Associated likelihood
ℓ(θ|x) = f (x|θ)
[inverted density & starting point]
6. The 21st Bayesian Century
Introduction
Models
And [B] nonparametrics?!
Equally very active and definitely very 21st, thank you,
but not mentioned in this talk!
7th Workshop on Bayesian Nonparametrics - Collegio... http://bnpworkshop.carloalberto.org/
21 - 25 June 2009, Moncalieri
The 7th Workshop on Bayesian Nonparametrics will be held at
the Collegio Carlo Alberto from June 21 to 25, 2009. The Collegio is a
Research Institution housed in an historical building located in
Moncalieri on the outskirts of Turin, Italy.
The meeting will feature the latest developments in the area and will
cover a wide variety of both theoretical and applied topics such as:
foundations of the Bayesian nonparametric approach, construction
and properties of prior distributions, asymptotics, interplay with
probability theory and stochastic processes, statistical modelling,
computational algorithms and applications in machine learning,
biostatistics, bioinformatics, economics and econometrics.
The Workshop will be structured in 4 tutorials on special topics, a
series of invited talks and contributed posters sessions.
News
Tentative Workshop Schedule
Abstract Book (last updated 27th May 2009)
Workshop Poster
7. The 21st Bayesian Century
Introduction
The Bayesian framework
Bayes theorem 101
Bayes theorem = Inversion of probabilities
If A and E are events such that P (E) = 0, P (A|E) and P (E|A)
are related by
P (A|E) =
P (E|A)P (A)
P (E|A)P (A) + P (E|Ac )P (Ac )
P (E|A)P (A)
=
P (E)
[Thomas Bayes (?)]
8. The 21st Bayesian Century
Introduction
The Bayesian framework
Bayesian approach
The impact of treating x as a fixed constant
is to increase statistical power as an artefact
Templeton, Molec. Ecol., 2009
New perspective
◮ Uncertainty on the parameters θ of a model modeled through
a probability distribution π on Θ, called prior distribution
◮ Inference based on the distribution of θ conditional on x,
π(θ|x), called posterior distribution
f (x|θ)π(θ)
π(θ|x) = .
f (x|θ)π(θ) dθ
9. The 21st Bayesian Century
Introduction
The Bayesian framework
[Nonphilosophical] justifications
Ignoring the sampling error of x undermines
the statistical validity of all inferences made by the method
Templeton, Molec. Ecol., 2009
◮ Semantic drift from unknown to random
◮ Actualization of the information on θ by extracting the
information on θ contained in the observation x
◮ Allows incorporation of imperfect information in the decision
process
◮ Unique mathematical way to condition upon the observations
(conditional perspective)
◮ Unique way to give meaning to statements like P(θ > 0)
10. The 21st Bayesian Century
Introduction
The Bayesian framework
Posterior distribution
Bayesian methods are presented as an automatic inference engine,
and this raises suspicion in anyone with applied experience
Gelman, BA, 2008
π(θ|x) central to Bayesian inference
◮ Operates conditional upon the observations
◮ Incorporates the requirement of the Likelihood Principle
◮ Avoids averaging over the unobserved values of x
◮ Coherent updating of the information available on θ
◮ Provides a complete inferential machinery
11. The 21st Bayesian Century
Introduction
Improper prior distributions
Improper distributions
If we take P (dσ) ∝ dσ as a statement that σ may have any value
between 0 and ∞ (...), we must use ∞ instead of 1 to denote certainty.
Jeffreys, ToP, 1939
Necessary extension from a prior distribution to a prior σ-finite
measure π such that
π(θ) dθ = +∞
Θ
Improper prior distribution
[Weird? Inappropriate?? report!! ]
12. The 21st Bayesian Century
Introduction
Improper prior distributions
Justifications
If the parameter may have any value from −∞ to +∞,
its prior probability should be taken as uniformly distributed
Jeffreys, ToP, 1939
Automated prior determination often leads to improper priors
1. Similar performances of estimators derived from these
generalized distributions
2. Improper priors as limits of proper distributions in many
[mathematical] senses
13. The 21st Bayesian Century
Introduction
Improper prior distributions
More justifications
There is no good objective principle for choosing a noninformative prior
(even if that concept were mathematically defined, which it is not)
Gelman, BA, 2008
4. Robust answer against possible misspecifications of the prior
5. Frequencial justifications, such as:
(i) minimaxity
(ii) admissibility
(iii) invariance (Haar measure)
6. Improper priors [much] prefered to vague proper priors like
N (0, 106 )
14. The 21st Bayesian Century
Introduction
Improper prior distributions
Validation
The mistake is to think of them as representing ignorance
Lindley, JASA, 1990
Extension of the posterior distribution π(θ|x) associated with an
improper prior π as given by Bayes’s formula
f (x|θ)π(θ)
π(θ|x) = ,
Θ f (x|θ)π(θ) dθ
when
f (x|θ)π(θ) dθ < ∞
Θ
Delete all emotional names
15. The 21st Bayesian Century
Introduction
Noninformative prior distributions
Noninformative priors
...cannot be expected to represent exactly total ignorance about the
problem, but should rather be taken as reference priors, upon which
everyone could fall back when the prior information is missing.
Kass and Wasserman, JASA, 1996
What if all we know is that we know “nothing” ?!
In the absence of prior information, prior distributions solely
derived from the sample distribution f (x|θ)
Difficulty with uniform priors, lacking invariance properties.
16. The 21st Bayesian Century
Introduction
Noninformative prior distributions
Jeffreys’ prior
If we took the prior density for the parameters to be proportional to
|I(θ)|1/2 , it could be stated for any law that is differentiable with respect
to all parameters that the total probability in any region of the θi would
′
be equal to the total probability in the corresponding region of the θi
Jeffreys, ToP, 1939
Based on Fisher information
∂ℓ ∂ℓ
I(θ) = Eθ
∂θT ∂θ
Jeffreys’ prior distribution is
π ∗ (θ) ∝ |I(θ)|1/2
17. The 21st Bayesian Century
Tests and model choice
Tests and model choice
The Jeffreys-subjective synthesis betrays a much more dangerous
confusion than the Neyman-Pearson-Fisher synthesis as regards
hypothesis tests
Senn, BA, 2008
Introduction
Tests and model choice
Bayesian tests
Bayes factors
Opposition to classical tests
Model choice
Compatible priors
Variable selection
18. The 21st Bayesian Century
Tests and model choice
Bayesian tests
Construction of Bayes tests
What is almost never used, however, is the Jeffreys significance test.
Senn, BA, 2008
Definition (Test)
Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of a
statistical model, a test is a statistical procedure that takes its
values in {0, 1}.
Example (Normal mean)
For x ∼ N (θ, 1), decide whether or not θ ≤ 0.
19. The 21st Bayesian Century
Tests and model choice
Bayesian tests
Decision-theoretic perspective
Loss functions [are] not relevant to statistical inference
Gelman, BA, 2008
Theorem (Optimal Bayes decision)
Under the 0 − 1 loss function
0
if d = IΘ0 (θ)
L(θ, d) = a0 if d = 1 and θ ∈ Θ0
a1 if d = 0 and θ ∈ Θ0
the Bayes procedure is
1 if Prπ (θ ∈ Θ0 |x) ≥ a0 /(a0 + a1 )
δ π (x) =
0 otherwise
20. The 21st Bayesian Century
Tests and model choice
Bayes factors
A function of posterior probabilities
The method posits two or more alternative hypotheses and tests their
relative fits to some observed statistics
Templeton, Mol. Ecol., 2009
Definition (Bayes factors)
For hypotheses H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θ0
f (x|θ)π0 (θ)dθ
π(Θ0 |x) π(Θ0 ) Θ0
B01 = =
π(Θc |x)
0 π(Θc )
0 f (x|θ)π1 (θ)dθ
Θc
0
[Good, 1958 & Jeffreys, 1961]
Goto Poisson example
21. The 21st Bayesian Century
Tests and model choice
Bayes factors
Self-contained concept
Having a high relative probability does not mean that a hypothesis is true
or supported by the data
Templeton, Mol. Ecol., 2009
Non-decision-theoretic:
◮ eliminates choice of π(Θ0 )
◮ Bayesian/marginal equivalent to the likelihood ratio
◮ Jeffreys’ scale of evidence:
π
◮ if log10 (B10 ) between 0 and 0.5, evidence against H0 weak,
π
◮ if log10 (B10 ) 0.5 and 1, evidence substantial,
π
◮ if log10 (B10 ) 1 and 2, evidence strong and
π
◮ if log10 (B10 ) above 2, evidence decisive
22. The 21st Bayesian Century
Tests and model choice
Bayes factors
A major modification
Considering whether a location parameter α is 0. The prior is uniform
and we should have to take f (α) = 0 and B10 would always be infinite
Jeffreys, ToP, 1939
When the null hypothesis is supported by a set of measure 0,
π(Θ0 ) = 0 and thus π(Θ0 |x) = 0.
[End of the story?!]
23. The 21st Bayesian Century
Tests and model choice
Bayes factors
Changing the prior to fit the hypotheses
Requirement
Defined prior distributions under both assumptions,
π0 (θ) ∝ π(θ)IΘ0 (θ), π1 (θ) ∝ π(θ)IΘ1 (θ),
(under the standard dominating measures on Θ0 and Θ1 )
Using the prior probabilities π(Θ0 ) = ̺0 and π(Θ1 ) = ̺1 ,
π(θ) = ̺0 π0 (θ) + ̺1 π1 (θ).
24. The 21st Bayesian Century
Tests and model choice
Bayes factors
Point null hypotheses
I have no patience for statistical methods that assign positive probability
to point hypotheses of the θ = 0 type that can never actually be true
Gelman, BA, 2008
Take ρ0 = Prπ (θ = θ0 ) and g1 prior density under Ha . Then
f (x|θ0 )ρ0 f (x|θ0 )ρ0
π(Θ0 |x) = =
f (x|θ)π(θ) dθ f (x|θ0 )ρ0 + (1 − ρ0 )m1 (x)
and Bayes factor
π f (x|θ0 )ρ0 ρ0 f (x|θ0 )
B01 (x) = =
m1 (x)(1 − ρ0 ) 1 − ρ0 m1 (x)
25. The 21st Bayesian Century
Tests and model choice
Bayes factors
Point null hypotheses (cont’d)
Example (Normal mean)
Test of H0 : θ = 0 when x ∼ N (θ, 1): we take π1 as N (0, τ 2 )
m1 (x) σ2 τ 2 x2
= exp
f (x|0) σ2 + τ 2 2σ 2 (σ 2 + τ 2 )
and the posterior probability is
τ /x 0 0.68 1.28 1.96
1 0.586 0.557 0.484 0.351
10 0.768 0.729 0.612 0.366
26. The 21st Bayesian Century
Tests and model choice
Opposition to classical tests
Comparison with classical tests
The 95 percent frequentist intervals will live up to their advertised
coverage claims
Wasserman, BA, 2008
Standard answer
Definition (p-value)
The p-value p(x) associated with a test is the largest significance
level for which H0 is rejected
27. The 21st Bayesian Century
Tests and model choice
Opposition to classical tests
Problems with p-values
The use of P implies that a hypothesis that may be true may be rejected
because it had not predicted observable results that have not occurred
Jeffreys, ToP, 1939
◮ Evaluation of the wrong quantity, namely the probability to
exceed the observed quantity.(wrong conditioning)
◮ Evaluation only under the null hypothesis
◮ Huge numerical difference with the Bayesian range of answers
28. The 21st Bayesian Century
Tests and model choice
Opposition to classical tests
Bayesian lower bounds
If the Bayes estimator has good frequency behavior
then we might as well use the frequentist method.
If it has bad frequency behavior then we shouldn’t use it.
Wasserman, BA, 2008
Least favourable Bayesian answer is
f (x|θ0 )
B(x, GA ) = inf ,
Θ f (x|θ)g(θ) dθ
g∈GA
ˆ
i.e., if there exists a mle for θ, θ(x),
f (x|θ0 )
B(x, GA ) =
ˆ
f (x|θ(x))
29. The 21st Bayesian Century
Tests and model choice
Opposition to classical tests
Illustration
Example (Normal case)
When x ∼ N (θ, 1) and H0 : θ0 = 0, the lower bounds are
2 /2 2 /2 −1
B(x, GA ) = e−x and P(x, GA ) = 1 + ex ,
i.e.
p-value 0.10 0.05 0.01 0.001
P 0.205 0.128 0.035 0.004
B 0.256 0.146 0.036 0.004
[Quite different!]
30. The 21st Bayesian Century
Tests and model choice
Model choice
Model choice and model comparison
There is no null hypothesis, which complicates the computation of
sampling error
Templeton, Mol. Ecol., 2009
Choice among models
Several models available for the same observation(s)
Mi : x ∼ fi (x|θi ), i∈I
where I can be finite or infinite
31. The 21st Bayesian Century
Tests and model choice
Model choice
Bayesian resolution
The posterior probabilities are constructed by using a numerator that is a
function of the observation for a particular model, then divided by a
denominator that ensures that the ”probabilities” sum to one
Templeton, Mol. Ecol., 2009
Probabilise the entire model/parameter space
◮ allocate probabilities pi to all models Mi
◮ define priors πi (θi ) for each parameter space Θi
◮ compute
pi fi (x|θi )πi (θi )dθi
Θi
π(Mi |x) =
pj fj (x|θj )πj (θj )dθj
j Θj
32. The 21st Bayesian Century
Tests and model choice
Model choice
Bayesian resolution(2)
The numerators are not co-measurable across hypotheses, and the
denominators are sums of non-co-measurable entities. This means that it
is mathematically impossible for them to be probabilities.
Templeton, Mol. Ecol., 2009
◮ take largest π(Mi |x) to determine “best” model,
or use averaged predictive
π(Mj |x) fj (x′ |θj )πj (θj |x)dθj
j Θj
33. The 21st Bayesian Century
Tests and model choice
Model choice
Natural Ockham’s razor
Pluralitas non est ponenda sine neccesitate
Variation is random until the
contrary is shown; and new
parameters in laws, when they
are suggested, must be tested
one at a time, unless there is
specific reason to the contrary.
Jeffreys, ToP, 1939
The Bayesian approach naturally weights differently models with
different parameter dimensions (BIC).
34. The 21st Bayesian Century
Tests and model choice
Compatible priors
Compatibility principle
Further complicating dimensionality of test statistics is the fact that the
models are often not nested, and one model may contain parameters that
do not have analogues in the other models and vice versa
Templeton, Mol. Ecol., 2009
Difficulty of finding simultaneously priors on a collection of models
Easier to start from a single prior on a “big” [encompassing] model
and to derive others from a coherence principle
[Dawid & Lauritzen, 2000]
Raw regression output
35. The 21st Bayesian Century
Tests and model choice
Compatible priors
An illustration for linear regression
In the case M1 and M2 are two nested Gaussian linear regression
models with Zellner’s g-priors and the same variance σ 2 ∼ π(σ 2 ):
◮ M1 : y|β1 , σ 2 ∼ N (X1 β1 , σ 2 ) with
β1 |σ 2 ∼ N s1 , σ 2 n1 (X1 X1 )−1
T
where X1 is a (n × k1 ) matrix of rank k1 ≤ n
◮ M2 : y|β2 , σ 2 ∼ N (X2 β2 , σ 2 ) with
β2 |σ 2 ∼ N s2 , σ 2 n2 (X2 X2 )−1 ,
T
where X2 is a (n × k2 ) matrix with span(X2 ) ⊆ span(X1 )
[ c Marin & Robert, Bayesian Core]
36. The 21st Bayesian Century
Tests and model choice
Compatible priors
Compatible g-priors
I don’t see any role for squared error loss, minimax, or the rest of what is
sometimes called statistical decision theory
Gelman, BA, 2008
Since σ 2 is a nuisance parameter, minimize the Kullback-Leibler
divergence between both marginal distributions conditional on σ 2 :
m1 (y|σ 2 ; s1 , n1 ) and m2 (y|σ 2 ; s2 , n2 ), with solution
β2 |X2 , σ 2 ∼ N s∗ , σ 2 n∗ (X2 X2 )−1
2 2
T
with
s∗ = (X2 X2 )−1 X2 X1 s1
2
T T
n∗ = n1
2
37. The 21st Bayesian Century
Tests and model choice
Variable selection
Variable selection
Regression setup where y regressed on a
set {x1 , . . . , xp } of p potential
explanatory regressors (plus intercept)
Corresponding 2p submodels Mγ , where
γ ∈ Γ = {0, 1}p indicates
inclusion/exclusion of variables by a
binary representation,
e.g. γ = 101001011 means that x1 , x3 ,
x5 , x7 and x8 are included.
38. The 21st Bayesian Century
Tests and model choice
Variable selection
Notations
For model Mγ ,
◮ qγ variables included
◮ t1 (γ) = {t1,1 (γ), . . . , t1,qγ (γ)} indices of those variables and
t0 (γ) indices of the variables not included
◮ For β ∈ Rp+1 ,
βt1 (γ) = β0 , βt1,1 (γ) , . . . , βt1,qγ (γ)
Xt1 (γ) = 1n |xt1,1 (γ) | . . . |xt1,qγ (γ) .
Submodel Mγ is thus
y|β, γ, σ 2 ∼ N Xt1 (γ) βt1 (γ) , σ 2 In
39. The 21st Bayesian Century
Tests and model choice
Variable selection
Global and compatible priors
Use Zellner’s g-prior, i.e. a normal prior for β conditional on σ 2 ,
˜
β|σ 2 ∼ N (β, cσ 2 (X T X)−1 )
and a Jeffreys prior for σ 2 ,
π(σ 2 ) ∝ σ −2
Noninformative g
Resulting compatible prior
−1 −1
βt1 (γ) ∼ N T
Xt1 (γ) Xt1 (γ) T ˜
Xt1 (γ) X β, cσ 2 Xt1 (γ) Xt1 (γ)
T
40. The 21st Bayesian Century
Tests and model choice
Variable selection
Posterior model probability
Can be obtained in closed form:
−n/2
˜ ˜ 2y T P1 X β
cy T P1 y β T X T P1 X β ˜
−(qγ +1)/2 T
π(γ|y) ∝ (c+1) y y− + − .
c+1 c+1 c+1
Conditionally on γ, posterior distributions of β and σ 2 :
c ˜ σ2 c −1
βt1 (γ) |σ 2 , y, γ ∼ N (U1 y + U1 X β/c), T
Xt1 (γ) Xt1 (γ) ,
c+1 c+1
n yT y cy T P1 y ˜ ˜ y T P1 X β
β T X T P1 X β ˜
σ 2 |y, γ ∼ IG , − + − .
2 2 2(c + 1) 2(c + 1) c+1
41. The 21st Bayesian Century
Tests and model choice
Variable selection
Noninformative case
Use the same compatible informative g-prior distribution with
˜
β = 0p+1 and a hierarchical diffuse prior distribution on c,
π(c) ∝ c−1 IN∗ (c) or π(c) ∝ c−1 Ic>0
Recall g-prior
The choice of this hierarchical diffuse prior distribution on c is due
to the model posterior sensitivity to large values of c:
Taking ˜
β = 0p+1 and c large does not work
42. The 21st Bayesian Century
Tests and model choice
Variable selection
Processionary caterpillar
Influence of some forest settlement characteristics on the
development of caterpillar colonies
Response y log-transform of the average number of nests of
caterpillars per tree on an area of 500 square meters (n = 33 areas)
[ c Marin & Robert, Bayesian Core]
43. The 21st Bayesian Century
Tests and model choice
Variable selection
Processionary caterpillar (cont’d)
Potential explanatory variables
x x2 x3
1
x1 altitude (in meters), x2 slope (in degrees),
x3 number of pines in the square,
x4 height (in meters) of the tree at the center of the square,
x5 diameter of the tree at the center of the square,
x6 index of the settlement density,
xx4orientation of the squarex(from 1 if southb’d to 2 ow),
7
5 x6
x8 height (in meters) of the dominant tree,
x9 number of vegetation strata,
x10 mix settlement index (from 1 if not mixed to 2 if mixed).
x x8 x9
45. The 21st Bayesian Century
Tests and model choice
Variable selection
Bayesian variable selection
t1 (γ) π(γ|y, X)
0,1,2,4,5 0.0929
0,1,2,4,5,9 0.0325
0,1,2,4,5,10 0.0295
0,1,2,4,5,7 0.0231
0,1,2,4,5,8 0.0228
0,1,2,4,5,6 0.0228
0,1,2,3,4,5 0.0224
0,1,2,3,4,5,9 0.0167
0,1,2,4,5,6,9 0.0167
0,1,2,4,5,8,9 0.0137
Noninformative G-prior model choice
46. The 21st Bayesian Century
Bayesian Calculations
Bayesian Calculations
Bayesian methods seem to quickly move to elaborate computation
Gelman, BA, 2008
Introduction
Tests and model choice
Bayesian Calculations
Implementation difficulties
Bayes factor approximation
ABC model choice
A Defense of the Bayesian Choice
47. The 21st Bayesian Century
Bayesian Calculations
Implementation difficulties
B Implementation difficulties
◮ Computing the posterior distribution
π(θ|x) ∝ π(θ)f (x|θ)
◮ Resolution of
arg min L(θ, δ)π(θ)f (x|θ)dθ
Θ
◮ Maximisation of the marginal posterior
arg max π(θ|x)dθ−1
Θ−1
48. The 21st Bayesian Century
Bayesian Calculations
Implementation difficulties
B Implementation further difficulties
A statistical test returns a probability value, but rarely is the probability
value per se the reason for an investigator performing the test
Templeton, Mol. Ecol., 2009
◮ Computing posterior quantities
h(θ) π(θ)f (x|θ)dθ
δ π (x) = h(θ) π(θ|x)dθ = Θ
Θ π(θ)f (x|θ)dθ
Θ
◮ Resolution (in k) of
P (π(θ|x) ≥ k|x) = α
49. The 21st Bayesian Century
Bayesian Calculations
Implementation difficulties
Monte Carlo methods
Bayesian simulation seems stuck in an infinite regress of inferential
uncertainty
Gelman, BA, 2008
Approximation of
I= g(θ)f (x|θ)π(θ) dθ,
Θ
takes advantage of the fact that f (x|θ)π(θ) is proportional to a
density: If the θi ’s are from π(θ),
m
1
g(θi )f (x|θi )
m
i=1
converges (almost surely) to I
50. The 21st Bayesian Century
Bayesian Calculations
Implementation difficulties
Importance function
A simulation method of inference hides unrealistic assumptions
Templeton, Mol. Ecol., 2009
No need to simulate from π(·|x) or from π: if h is a probability
density,
g(θ)f (x|θ)π(θ)
g(θ)f (x|θ)π(θ) dθ = h(θ) dθ
Θ h(θ)
and m
i=1 g(θi )ω(θi ) f (x|θi )π(θi )
m with ω(θi ) =
i=1 ω(θi ) h(θi )
approximates Eπ [g(θ)|x]
51. The 21st Bayesian Century
Bayesian Calculations
Bayes factor approximation
Bayes factor approximation
ABC’s When approximating the Bayes factor
f1 (x|θ1 )π1 (θ1 )dθ1
Θ1 Z1
B12 = =
Z2
f2 (x|θ2 )π2 (θ2 )dθ2
Θ2
use of importance functions ̟1 and ̟2 and
n1
n−1 i i i
i=1 f1 (x|θ1 )π1 (θ1 )/̟1 (θ1 ) i
B12 = 1 n2 θj ∼ ̟j (θ)
n−1
2
i i i
i=1 f2 (x|θ2 )π2 (θ2 )/̟2 (θ2 )
[Chopin & Robert, 2007]
52. The 21st Bayesian Century
Bayesian Calculations
Bayes factor approximation
Bridge sampling
Special case:
If
π1 (θ1 |x) ∝ π1 (θ1 |x)
˜
π2 (θ2 |x) ∝ π2 (θ2 |x)
˜
live on the same space (Θ1 = Θ2 ), then
n
1 π1 (θi |x)
˜
B12 ≈ θi ∼ π2 (θ|x)
n π2 (θi |x)
˜
i=1
[Gelman & Meng, 1998; Chen, Shao & Ibrahim, 2000]
54. The 21st Bayesian Century
Bayesian Calculations
Bayes factor approximation
Optimal bridge sampling
The optimal choice of auxiliary function is
n1 + n2
α⋆ (θ) =
n1 π1 (θ|x) + n2 π2 (θ|x)
leading to
n1
1 π2 (θ1i |x)
˜
n1 n1 π1 (θ1i |x) + n2 π2 (θ1i |x)
i=1
B12 ≈ n2
1 π1 (θ2i |x)
˜
n2 n1 π1 (θ2i |x) + n2 π2 (θ2i |x)
i=1
Back later!
55. The 21st Bayesian Century
Bayesian Calculations
Bayes factor approximation
Approximating Zk from a posterior sample
Use of the [harmonic mean] identity
ϕ(θk ) ϕ(θk ) πk (θk )Lk (θk ) 1
Eπk x = dθk =
πk (θk )Lk (θk ) πk (θk )Lk (θk ) Zk Zk
no matter what the proposal ϕ(·) is.
[Gelfand & Dey, 1994; Bartolucci et al., 2006]
Direct exploitation of the MCMC output
56. The 21st Bayesian Century
Bayesian Calculations
Bayes factor approximation
Comparison with regular importance sampling
Harmonic mean: Constraint opposed to usual importance sampling
constraints: ϕ(θ) must have lighter (rather than fatter) tails than
πk (θk )Lk (θk ) for the approximation
T (t)
1 ϕ(θk )
Z1k = 1 (t) (t)
T πk (θk )Lk (θk )
t=1
to have a finite variance.
E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ
57. The 21st Bayesian Century
Bayesian Calculations
Bayes factor approximation
Approximating Z using a mixture representation
Bridge sampling redux
Design a specific mixture for simulation [importance sampling]
purposes, with density
ϕk (θk ) ∝ ω1 πk (θk )Lk (θk ) + ϕ(θk ) ,
where ϕ(·) is arbitrary (but normalised)
Note: ω1 is not a probability weight
58. The 21st Bayesian Century
Bayesian Calculations
Bayes factor approximation
Approximating Z using a mixture representation (cont’d)
Corresponding MCMC (=Gibbs) sampler
At iteration t
1. Take δ (t) = 1 with probability
(t−1) (t−1) (t−1) (t−1) (t−1)
ω1 πk (θk )Lk (θk ) ω1 πk (θk )Lk (θk ) + ϕ(θk )
and δ (t) = 2 otherwise;
(t) (t−1)
2. If δ (t) = 1, generate θk ∼ MCMC(θk , θk ) where
MCMC(θk , θk ′ ) denotes an arbitrary MCMC kernel associated
with the posterior πk (θk |x) ∝ πk (θk )Lk (θk );
(t)
3. If δ (t) = 2, generate θk ∼ ϕ(θk ) independently
60. The 21st Bayesian Century
Bayesian Calculations
Bayes factor approximation
Chib’s representation
Direct application of Bayes’ theorem: given
x ∼ fk (x|θk ) and θk ∼ πk (θk ),
fk (x|θk ) πk (θk )
Zk = mk (x) =
πk (θk |x)
Use of an approximation to the posterior
∗ ∗
fk (x|θk ) πk (θk )
Zk = mk (x) = .
ˆ ∗
πk (θk |x)
61. The 21st Bayesian Century
Bayesian Calculations
Bayes factor approximation
Case of latent variables
For missing variable z as in mixture models, natural Rao-Blackwell
estimate
T
∗ 1 ∗ (t)
πk (θk |x) = πk (θk |x, zk ) ,
T
t=1
(t)
where the zk ’s are Gibbs sampled latent variables
62. The 21st Bayesian Century
Bayesian Calculations
ABC model choice
Approximate Bayesian Computation
Simulation target is π(θ)f (x|θ) with likelihood f (x|θ) not in
closed form.
Likelihood-free rejection technique:
ABC algorithm
For an observation y ∼ f (y|θ), under the prior π(θ), keep jointly
simulating
θ′ ∼ π(θ) , x ∼ f (x|θ′ ) ,
until the auxiliary variable x is equal to the observed value, x = y.
[Pritchard et al., 1999]
63. The 21st Bayesian Century
Bayesian Calculations
ABC model choice
A as approximative
When y is a continuous random variable, equality x = y is replaced
with a tolerance condition,
̺(x, y) ≤ ǫ
where ̺ is a distance between summary statistics
Output distributed from
π(θ) Pθ {̺(x, y) < ǫ} ∝ π(θ|̺(x, y) < ǫ)
64. The 21st Bayesian Century
Bayesian Calculations
ABC model choice
Gibbs random fields
Gibbs distribution
The rv y = (y1 , . . . , yn ) is a Gibbs random field associated with
the graph G if
1
f (y) = exp − Vc (yc ) ,
Z
c∈C
where Z is the normalising constant, C is the set of cliques of G
and Vc is any function also called potential
U (y) = c∈C Vc (yc ) is the energy function
c Z is usually unavailable in closed form
65. The 21st Bayesian Century
Bayesian Calculations
ABC model choice
Potts model
Potts model
Vc (y) is of the form
Vc (y) = θS(y) = θ δyl =yi
l∼i
where l∼i denotes a neighbourhood structure
In most realistic settings, summation
Zθ = exp{θT S(x)}
x∈X
involves too many terms to be manageable and numerical
approximations cannot always be trusted
[Cucala, Marin, CPR & Titterington, JASA, 2009]
66. The 21st Bayesian Century
Bayesian Calculations
ABC model choice
Neighbourhood relations
Choice to be made between M neighbourhood relations
m
i ∼ i′ (0 ≤ m ≤ M − 1)
with
Sm (x) = I{xi =xi′ }
m
i∼i′
driven by the posterior probabilities of the models.
67. The 21st Bayesian Century
Bayesian Calculations
ABC model choice
Model index
Formalisation via a model index M, new parameter with prior
distribution π(M = m) and π(θ|M = m) = πm (θm )
Computational target:
P(M = m|x) ∝ fm (x|θm )πm (θm ) dθm π(M = m)
Θm
68. The 21st Bayesian Century
Bayesian Calculations
ABC model choice
Sufficient statistics
If S(x) sufficient statistic for the joint parameters
(M, θ0 , . . . , θM −1 ),
P(M = m|x) = P(M = m|S(x)) .
For each model m, sufficient statistic Sm (·) makes
S(·) = (S0 (·), . . . , SM −1 (·)) also sufficient.
For Gibbs random fields,
1 2
x|M = m ∼ fm (x|θm ) = fm (x|S(x))fm (S(x)|θm )
1
= f 2 (S(x)|θm )
n(S(x)) m
where
n(S(x)) = ♯ {˜ ∈ X : S(˜ ) = S(x)}
x x
c S(x) also sufficient for the joint parameters
[Specific to Gibbs random fields!]
69. The 21st Bayesian Century
Bayesian Calculations
ABC model choice
ABC model choice Algorithm
ABC-MC
◮ Generate m∗ from the prior π(M = m).
◮ ∗
Generate θm∗ from the prior πm∗ (·).
◮ ∗
Generate x∗ from the model fm∗ (·|θm∗ ).
◮ Compute the distance ρ(S(x0 ), S(x∗ )).
◮ Accept (θm∗ , m∗ ) if ρ(S(x0 ), S(x∗ )) < ǫ.
∗
[Cornuet, Grelaud, Marin & Robert, BA, 2008]
Note When ǫ = 0 the algorithm is exact
70. The 21st Bayesian Century
Bayesian Calculations
ABC model choice
Toy example
iid Bernoulli model versus two-state first-order Markov chain, i.e.
n
f0 (x|θ0 ) = exp θ0 I{xi =1} {1 + exp(θ0 )}n ,
i=1
versus
n
1
f1 (x|θ1 ) = exp θ1 I{xi =xi−1 } {1 + exp(θ1 )}n−1 ,
2
i=2
with priors θ0 ∼ U(−5, 5) and θ1 ∼ U(0, 6) (inspired by “phase
transition” boundaries).
71. The 21st Bayesian Century
Bayesian Calculations
ABC model choice
Toy example (2)
(left) Comparison of the true BF m0 /m1 (x0 ) with BF m0 /m1 (x0 )
(in logs) over 2, 000 simulations and 4.106 proposals from the
prior. (right) Same when using tolerance ǫ corresponding to the
1% quantile on the distances.
72. The 21st Bayesian Century
A Defense of the Bayesian Choice
A Defense of the Bayesian Choice
Given the advances in practical Bayesian methods in the past two
decades, anti-Bayesianism is no longer a serious option
Gelman, BA, 2009
Bayesians are of course their own worst enemies. They make
non-Bayesians accuse them of religious fervour, and an unwillingness to
see another point of view.
Davidson, 2009
73. The 21st Bayesian Century
A Defense of the Bayesian Choice
1. Choosing a probabilistic representation
Bayesian statistics is about making probability statements
Gelman, BA, 2009
Bayesian Statistics appears as the calculus of uncertainty
Reminder:
A probabilistic model is nothing but an interpretation of a given
phenomenon
What is the meaning of RD’s t test example?!
74. The 21st Bayesian Century
A Defense of the Bayesian Choice
1. Choosing a probabilistic representation (2)
Inference is impossible.
Davidson, 2009
The Bahadur–Savage problem stems from the inability to make
choices about the shape of a statistical model, not from an
impossibility to draw [Bayesian] inference.
Further, a probability distribution is more than the sum of its
moments. Ill-posed problems thus highlight issues with the model,
not the inference.
75. The 21st Bayesian Century
A Defense of the Bayesian Choice
2. Conditioning on the data
Bayesian data analysis is a method for summarizing uncertainty and
making estimates and predictions using probability statements conditional
on observed data and an assumed model
Gelman, BA, 2009
At the basis of statistical inference lies an inversion process
between cause and effect. Using a prior distribution brings a
necessary balance between observations and parameters and enable
to operate conditional upon x
What is the data in RD’s t test example?! U ’s? Y ’s?
76. The 21st Bayesian Century
A Defense of the Bayesian Choice
3. Exhibiting the true likelihood
Frequentist statistics is an approach for evaluating statistical procedures
conditional on some family of posited probability models
Gelman, BA, 2009
Provides a complete quantitative inference on the parameters and
predictive that points out inadequacies of frequentist statistics,
while implementing the Likelihood Principle.
There needs to be a true likelihood, including in
non-parametric settings
[Rousseau, Van der Vaart]
77. The 21st Bayesian Century
A Defense of the Bayesian Choice
4. Using priors as tools and summaries
Bayesian techniques allow prior beliefs to be tested and discarded as
appropriate
Gelman, BA, 2009
The choice of a prior distribution π does not require any kind of
belief in this distribution: rather consider it as a tool that
summarizes the available prior information and the uncertainty
surrounding this information
Non-identifiability is an issue in that the prior may strongly
impact inference about identifiable bits
78. The 21st Bayesian Century
A Defense of the Bayesian Choice
4. Using priors as tools and summaries (2)
No uninformative prior exists for such models.
Davidson, 2009
Reference priors can be deduced from the sampling distribution by
an automated procedure, based on a minimal information principle
that maximises the information brought by the data.
Important literature on prior modelling for non-parametric
problems, incl. smoothness constraints.
79. The 21st Bayesian Century
A Defense of the Bayesian Choice
5. Accepting the subjective basis of knowledge
Knowledge is a critical confrontation between a prioris and
experiments. Ignoring these a prioris impoverishes analysis.
We have, for one thing, to use a language and our
language is entirely made of preconceived ideas and has to be
so. However, these are unconscious preconceived ideas, which
are a million times more dangerous than the other ones. Were
we to assert that if we are including other preconceived ideas,
consciously stated, we would aggravate the evil! I do not
believe so: I rather maintain that they would balance one
another.
Henri Poincar´, 1902
e
80. The 21st Bayesian Century
A Defense of the Bayesian Choice
6. Choosing a coherent system of inference
Bayesian data analysis has three stages: formulating a model,
splitting the model to data, and checking the model fit.
The second step—inference—gets most of the attention,
but the procedure as a whole is not automatic
Gelman, BA, 2009
To force inference into a decision-theoretic mold allows for a
clarification of the way inferential tools should be evaluated, and
therefore implies a conscious (although subjective) choice of the
retained optimality.
Logical inference process Start with requested properties, i.e.
loss function and prior distribution, then derive the best solution
satisfying these properties.
81. The 21st Bayesian Century
A Defense of the Bayesian Choice
6. Choosing a coherent system of inference (2)
Asymptopia annoys Bayesians.
Davidson, 2009
Asymptotics [for inference] sounds for a proxy for not specifying
completely the model and thus for using another model. While
asymptotics [for simulation] is quite acceptable. Bayesian inference
does not escape asymptotic difficulties, see e.g. mixtures.
NP Bootstrap aims at inference with no[t enough]
modelling, while P Bayesian bootstrap is essentially using the
Bayesian predictive
82. The 21st Bayesian Century
A Defense of the Bayesian Choice
7. Looking for optimal frequentist procedures
At intermediate levels of a Bayesian model, frequency properties typically
take care of themselves. It is typically only at the top level of
unreplicated parameters that we have to worry
Gelman, BA, 2009
Bayesian inference widely intersects with the three notions of
minimaxity, admissibility and equivariance (Haar). Looking for an
optimal estimator most often ends up finding a Bayes estimator.
Optimality is easier to attain through the Bayes “filter”
83. The 21st Bayesian Century
A Defense of the Bayesian Choice
8. Solving the actual problem
Frequentist methods have coverage guarantees; Bayesian methods don’t.
In science, coverage matters
Wasserman, BA, 2009
Frequentist methods justified on a long-term basis, i.e., from the
statistician viewpoint. From a decision-maker’s point of view, only
the problem at hand matters! That is, he/she calls for an inference
conditional on x.
84. The 21st Bayesian Century
A Defense of the Bayesian Choice
9. Providing a universal system of inference
Bayesian methods are presented as an automatic inference engine
Gelman, BA, 2009
Given the three factors
(X , f (x|θ), (Θ, π(θ)), (D, L(θ, d)) ,
the Bayesian approach validates one and only one inferential
procedure
85. The 21st Bayesian Century
A Defense of the Bayesian Choice
10. Computing procedures as a minimization problem
The discussion of computational issues should not be allowed to obscure
the need for further analysis of inferential questions
Bernardo, BA, 2009
Bayesian procedures are easier to compute than procedures of
alternative theories, in the sense that there exists a universal
method for the computation of Bayes estimators
Convergence assessment is an issue, but recent developments
in adaptive MCMC allow for more confidence in the output