Habilitation à diriger des recherches

Habilitation à diriger des recherches
Pierre Pudlo
Université Montpellier 2
Institut de Mathématiques et Modélisation de Montpellier (I3M)
Institut de Biologie Computationelle
Labex NUMEV
12/12/2014
Pierre Pudlo (UM2) HDR 12/12/2014 1 / 22

Contents
1 Graph based clustering
2 Approximate Bayesian computation
3 Bayesian computation with empirical likelihood

Contents

Graph based clustering
Idea. Deﬁne a (weighted) graph which links
similar observations Xi, i = 1, . . . , n.
Example: link Xi and Xj iff d(Xi, Xj) ≤ h
where h is a tuning parameter (bandwidth).
To obtain the clusters, cut the graph at a
minimal number of edges while partitioning
the data into large groups
The Ncut (Cheeger) for k = 2 groups.
Optimize
Ncut(A, ¯A) =
cut(A, ¯A)
min vol(A), vol( ¯A)
where ¯A is the complement of A in the set
of observations.
−2 −1 0 1 2
−1.0−0.50.00.51.0
Take into account vol(A) and vol( ¯A) to avoid
A = {a single outlier}
¯A = {the rest of the data set}

A first asymptotic result
On spectral clustering.
The relaxed optimization problem
based on the spectral decomposition
of the matrix
Qh(i, j) =
kh(Xj − Xi)
kh(X − Xi)
• von Luxburg et al. (2008) obtained a
deterministic limit when
(∗) n → ∞ & h is fixed
Looks like the limit of kernel density
estimator fn,h(x), whose limit when h is
fixed is
f(y)kh(x − y)dy
• With Bruno Pelletier, we have
obtained asymptotic results under (∗).
We add a pre-processing: remove data
points in area of low density
Cluster the kept data points
Then, the limit does not depend on h &
can be characterized geometrically in
term of level sets of the true density
f(x).
• But the asymptotic behavior of
spectral clustering when
(∗∗) n → ∞ & h → 0
is much more difficult to
understand.

A second asymptotic result
On the NP-hard optimization
problem which minimizes
Ncut(A, ¯A) =
cut(A, ¯A)
min vol(A), vol( ¯A)
on the graph with bandwidth h.
Assume M is a compact subset of Rd
,
X1, . . . Xn a uniform random sampling
of M.
(Known results if the points form a
regular grid of M)
• With Ery Arias-Castro and Bruno
Pelletier,
we have proven the asymptotic
consistency of the Ncut problem when
(∗ ∗ ∗) n → ∞ & h → 0,
h2d+1
1/
√
n
The limit is the solution of the Cheeger
problem
(cut M with a surface of minimal
perimeter)

Contents

Intractable likelihoods
Problem
How to perform a Bayesian analysis when the likelihood f(y|φ) is intractable?
Example 1. Gibbs random ﬁelds
f(y|φ) ∝ exp(−H(y, φ))
is known up to a constant
Z(φ) = 1
y
exp(−H(y, φ))
Example 2. Neutral population
genetics
Aim. Infer demographic parameters on
the past of some populations based on
the trace left in genomes of individuals
sampled from current populations.
Latent process (past history of the
sample) ∈ space of high dimension.
If y is the genetic data of the sample,
the likelihood is
f(y|φ) =
Z
f(y, z | φ) dz
Typically, dim(Z ) dim(Y ).
No hope to compute the likelihood with
clever Monte Carlo algorithms?

Approximate Bayesian computation
Idea
Infer conditional distribution of φ given yobs on simulations from the joint π(φ)f(y|φ)
ABC algorithm
A) Generate a large set of (φ, y)
from the Bayesian model
π(φ) f(y|φ)
B) Keep the particles (φ, y) such
that d(η(yobs), η(y)) ≤ ε
C) Return the φ’s of the kept
particles
Curse of dimensionality: y is replaced
by some numerical summaries η(y)
Stage A) is computationally heavy!
We end up rejecting almost all
simulations except if fallen in the
neighborhood of η(yobs)
Sequential ABC algorithms try to avoid
drawing φ is area of low π(φ|y).
An auto-calibrated ABC-SMC
sampler with Mohammed Sedki,
Jean-Michel Marin, Jean-Marie Cornet
and Christian P. Robert

ABC sequential sampler
How to calibrate ε1 ≥ ε2 ≥ · · · ≥ εT and T to be efﬁcient?
The auto-calibrated ABC-SMC sampler developed with Mohammed Sedki,
Jean-Michel Marin, Jean-Marie Cornet and Christian P. Robert

ABC target
Three levels of approximation of the
posterior π φ yobs
1 the ABC posterior distribution
π φ η(yobs)
2 approximated with a kernel of
bandwidth ε (or with k-nearest
neighbours)
π φ d(η(y), η(yobs)) ≤ ε
3 a Monte Carlo error:
sample size N < ∞
See, e.g., our review with J.-M. Marin,
C. Robert and R. Ryder
If η(y) are not sufﬁcient statistics,
π φ yobs π φ η(yobs)
Information regarding yobs might be
lost!
Curse of dimensionality:
cannot have both ε small, N large
when η(y) is of large dimension
Post-processing of Beaumont et al.
(2002) with local linear regression.
But the lack of sufﬁciency might still be
problematic. See Robert et al. (2011)
for model choice.

ABC model choice
ABC model choice
A) Generate a large set of
(m, φ, y) from the Bayesian
model, π(m)πm(φ) fm(y|φ)
B) Keep the particles (m, φ, y)
such that d(η(y), η(yobs)) ≤ ε
C) For each m, return
pm(yobs) = porportion of m
among the kept particles
Likewise, if η(y) is not sufficient for the
model choice issue,
π m y π m η(y)
It might be difficult to design
informative η(y).
Toy example.
Model 1. yi
iid
∼ N (φ, 1)
Model 2. yi
iid
∼ N (φ, 2)
Same prior on φ (whatever the model)
& uniform prior on model index
η(y) = y1 + · · · + yn is sufficient to
estimate φ in both models
But η(y) carries no information
regarding the variance (hence the
model choice issue)
Other examples in Robert et al. (2011)
In population genetics. Might be
difficult to find summary statistics that
help discriminate between models
(= possible historical scenarios on the
sampled populations)

ABC model choice
ABC model choice
(m, φ, y) from the Bayesian
model π(m)πm(φ) fm(y|φ)
B) Keep the particles (m, φ, y)
such that d(η(y), η(yobs)) ≤ ε
C) For each m, return
pm(yobs) = porportion of m
among the kept particles
If ε is tuned so that the number of kept
particles is k, then pm is a k-nearest
neighbor estimate of
E 1{M = m} η(yobs)
Approximating the posterior
probabilities of model m is a
regression problem where
the response is 1{M = m},
the co-variables are the summary
statistics η(y),
the loss is L2
(conditional
expectation)
The prefered method to approximate
the postererior probabilities in DIYABC
is a local multinomial regression.
Ticklish if dim(η(y)) large, or high
correlation in the summary statistics.

Choosing between hidden random fields
Choosing between dependency
graph: 4 or 8 neighbours?
Models. α, β ∼ prior
z | β ∼ Potts on G4 or G8 with interaction β
y | z, α ∼ i P(yi|zi, α)
How to sum up the noisy y?
Without noise (directly observed field),
sufficient statistics for the model choice
issue.
With Julien Stoehr and Lionel Cucala
a method to design new summary
statistics
Based on a clustering of the observed
data on possible dependency graphs
number of connected components
size of the largest connected
component,
. . .

Machine learning to analyse machine simulated data
ABC model choice
(m, φ, y) from π(m)πm(φ) fm(y|φ)
B) Infer (anything?) about
m η(y) with machine learning
methods
In this machine learning perspective:
the (iid) simulations of A) form the
training set
yobs becomes a new data point
With J.-M. Marin, J.-M. Cornuet, A.
Estoup, M. Gautier and C. P. Robert
Predicting m is a classiﬁcation
problem
Computing π(m|η(y)) is a
regression problem
It is well known that classiﬁcation is
much simple than regression.
(dimension of the object we infer)
Why computing π(m|η(y)) if we know
that
π(m|y) π(m|η(y))?

An example with random forest on human SNP data
Out of Africa
6 scenarios, 6 models
Observed data. 4 populations, 30
individuals per population; 10,000
genotyped SNP from the 1000 Genome
Project
Random forest trained on 40, 000
simulations (112 summary statistics)
predict the model which supports
a single out-of-Africa colonization
event,
a secondary split between European
and Asian lineages and
a recent admixture for Americans
with African origin
Conﬁdence in the selected model?

Example (continued)
Observed data. 4 populations, 30
individuals per population; 10,000
genotyped SNP from the 1000 Genome
Project
Random forest trained on 40, 000
simulations (112 summary statistics)
predict the model which supports
a single out-of-Africa colonization
event,
a secondary split between European
and Asian lineages and
a recent admixture for Americans
with African origin
Benefits of random forests?
1 Can find the relevant statistics in a
large set of statistics (112) to
discriminate models
2 Lower prior misclassification error
(≈ 6%) than other methods (ABC, i.e.
k-nn ≈ 18%)
3 Supply a similarity measure to
compare η(y) and η(yobs)
Confidence in the selected model?
Compute the average of the
misclassification error over an ABC
approximation of the predictive (∗). Here,
≤ 0.1%
(∗) π(m, φ, y | ηobs) = π(m | ηobs)πm(φ | ηobs)fm(y | φ)

Contents

Another approximation of the likelihood
What if both
the likelihood is intractable and
unable to simulate a dataset in a reasonable amount of time to resort on ABC?
First answer: use pseudo-likelihoods
such as the pairwise composite likelihood
fPCL(y | φ) =
i<j
f(yi, yj | φ)
Maximum composite likelihood
estimators φ(y) are suitable estimators
But cannot substitute a true likelihood
in a Bayesian framework
leads to credible intervals which are
too narrow: over-conﬁdence in φ(y), see
e.g. Ribatet et al. (2012)
Our proposal with Kerrie Mengersen and
Christian P. Robert:
use the empirical likelihood of Owen
(2001, 2011)
It relies on iid blocks in the dataset y to
reconstruct a likelihood
& permits likelihood ratio tests
conﬁdence intervals are correct
Original aim of Owen: remove parametric
assumptions

Bayesian computation via empirical likelihood
Our proposal with Kerrie Mengersen and
Christian P. Robert:
use the empirical likelihood of Owen
(2001, 2011)
It relies on iid blocks in the dataset y to
reconstruct a likelihood
& permits likelihood ratio tests
conﬁdence intervals are correct
Original aim of Owen: remove parametric
assumptions
With empirical likelihood, the parameter φ
is deﬁned as
(∗) E h(yb, φ) = 0
where
yb is one block of y,
E the expected value according to
the true distribution of the block yb
h is a known function
E.g, if φ is the mean of an iid sample,
h(yb, φ) = yb − φ
In population genetics, what is (∗) with
dates of population splits
population sizes, etc. ?

Bayesian computation via empirical likelihood
With empirical likelihood, the parameter φ
is deﬁned as
(∗) E h(yb, φ) = 0
where
yb is one block of y,
E the expected value according to
the true distribution of the block yb
h is a known function
E.g, if φ is the mean of an iid sample,
h(yb, φ) = yb − φ
In population genetics, what is (∗) with
dates of population splits
population sizes, etc. ?
A block = genetic data at given locus
h(yb, φ) is the pairwise composite score
function we can explicitly compute in
many situations:
h(yb, φ) = φ log fPCL(yb | φ)
Beneﬁts.
much faster than ABC (no need to
simulate fake data)
same accuracy than ABC or even
much precise: no loss of information
with summary statistics

An experiment
Evolutionary scenario:
MRCA
POP 0 POP 1 POP 2
τ1
τ2
Dataset:
50 genes per populations,
100 microsat. loci
Assumptions:
Ne identical over all populations
φ = log10(θ, τ1, τ2)
non-informative prior
Comparison of ABC and EL
histogram = EL
curve = ABC
vertical line = “true” parameter

Habilitation à diriger des recherches

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Habilitation à diriger des recherches

Similar a Habilitation à diriger des recherches (20)

Último

Último (20)

Habilitation à diriger des recherches