Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Saddlepoint approximations, likelihood
asymptotics, and approximate conditional inference

Jared Tobin (BSc, MAS)

Department of Statistics
The University of Auckland
Auckland, New Zealand

February 24, 2011

Jared Tobin Approximate conditional inference

./helloWorld
I’m from St. John’s, Canada


./helloWorld

It’s a charming city known for
A street containing the most pubs
per square foot in North America
What could be the worst weather
on the planet

these characteristics are probably
related..


./tellThemAboutMe

I recently completed my Master’s degree in Applied Statistics at
Memorial University.

I’m also a Senior Research Analyst with the Government of
Newfoundland & Labrador.

This basically means I do a lot of programming and statistics..

(thank you for R!)

Here at Auckland, my main supervisor is Russell.

I’m also aﬃliated with Fisheries & Oceans Canada (DFO) via my
co-supervisor, Noel.


./whatImDoingHere

Today I’ll be talking about my Master’s research, as well as what I
plan to work on during my PhD studies here in Auckland.

So let’s get started.


Likelihood inference

Consider a probabilistic model Y ∼ f (y ; ζ), ζ ∈ R, and a sample y
of size n from f (y ; ζ).

How can we estimate ζ from the sample y?

Everybody knows maximum likelihood.. (right?)

It’s the cornerstone of frequentist inference, and remains quite
popular today.
(ask Russell)


Likelihood inference

Deﬁne L(ζ; y) = n f (yj ; ζ) and call it the likelihood function of
j=1
ζ, given the sample y.

If we take R to be the (non-extended) real line, then ζ ∈ R lets us
ˆ
deﬁne ζ = argmaxζ L(ζ; y) to be the maximum likelihood
estimator (or MLE) of ζ.

ˆ
We can think of ζ as the value of ζ that maximizes the probability
of observing the sample y.


Nuisance parameter models

Now consider a probabilistic model Y ∼ f (y ; ζ) where

ζ = (θ, ψ), ζ ∈ RP
θ∈R
ψ ∈ RP−1

If we are only interested in θ, then ψ is called a nuisance
parameter, or incidental parameter.

It turns out that these things are aptly named..


The nuisance parameter problem: example

Take a very simple example; a H-strata model with two
observations per stratum. Yh1 and Yh2 are iid N(µh , σ 2 ) random
variables, h = 1, . . . , H, and we are interested in estimating σ 2 .

Deﬁne µ = (µ1 , . . . , µH ). Assuming σ 2 is known, the
log-likelihood for µ is

(yh1 − µh )2 + (yh2 − µh )2
l(µ; σ 2 , y) = − log 2πσ 2 −
2σ 2
h

and the MLE for the hth stratum, µh , is that stratum’s sample
ˆ
mean (yh1 + yh2 )/2.


Example, continued..

To estimate σ 2 , however, we must use that estimate for µ.

It is common to use the proﬁle likelihood, deﬁned for this example
as
l (P) (σ 2 ; µ, y) = sup l(µ, σ 2 ; y)
ˆ
µ

to estimate σ 2 . Maximizing yields

1 (yh1 − µh )2 + (yh2 − µh )2
ˆ ˆ
σ2 =
ˆ
H 2
h

as the MLE for σ 2 .



Let’s check for bias.. let Sh = [(Yh1 − µh )2 + (Yh2 − µh )2 ]/2 and
2 ˆ ˆ
note that Sh2 = (Y − Y )2 /4 and σ 2 = H −1
ˆ 2.
h1 h2 h Sh

Some algebra shows that
1
E [Sh ] = (varYh1 + µ2 + varYh2 + µ2 − 2E [Yh1 Yh2 ])
2
h h
4
and since Yh1 , Yh2 are independent, E [Yh1 Yh2 ] = µ2 so that
h
E [Sh ] = σ 2 /2.
2



Put it together and we have
1
E [ˆ 2 ] =
σ 2
E [Sh ]
H
h
1 σ2
=
H 2
h
σ2
=
2
No big deal - everyone knows the MLE for σ 2 is biased..

But notice the implication for consistency..

lim P(|ˆ 2 − σ 2 | < ) = 0
σ
n→∞


Neyman-Scott problems

This result isn’t exactly new..

It was described by Neyman & Scott as early as 1948.

That’s merit for the name; this type of problem is typically known
as a Neyman-Scott problem in the literature.

The problem is that one of the required regularity conditions is not
met. We usually require that the dimension of (µ, σ 2 ) remain
constant for increasing sample size..
H
But notice that n = h=1 2, so n → ∞ iﬀ H → ∞ iﬀ
dim(µ) → ∞.


The profile likelihood

In a general setting where Y ∼ f (y ; ψ, θ) with nuisance parameter
−1
ψ, consider the partial information iθθ|ψ = iθθ − iθψ iψψ iψθ .
(P)
It can be shown that the profile expected information iθθ is
first-order equivalent to iθθ ..
(P)
so iθθ|ψ < iθθ in an asymptotic sense.

In other words, the profile likelihood places more weight on
information about θ than it ’ought’ to be doing.


Two-index asymptotics for the profile likelihood

Take a general stratified model again.. Yh ∼ f (yh ; ψh , θ) with H
strata. Remove the per-stratum sample size restriction of nh = 2.

Now, if we let both nh and H approach infinity, we get different
results depending on the speed at which nh → ∞ and H → ∞.
ˆ
If nh → ∞ faster than H does, we have that θ − θ = Op (n−1/2 ).
−1
If H → ∞ faster, the same difference is of order Op (nh ).

In other words, we can probably expect the profile likelihood to
make relatively poor estimates of θ if H > nh on average.


Alternatives to using the proﬁle likelihood

This type of model comes up a lot in practice.. (example later)

What solutions are available to tackle the nuisance parameter
problem?

For the normal nuisance parameter model, the method of moments
estimator is an option (and happens to be unbiased & consistent).


Alternatives to using the proﬁle likelihood

But what if the moments themselves involve multiple parameters?

... then it may be diﬃcult or impossible to construct a method of
moments estimator.

Also, likelihood-based estimators generally have desirable statistical
properties, and it would be nice to retain those.

There may be a way to patch up the problem with ‘standard’ ML
in these models..

hint: there is. my plane ticket was $2300 CAD, so I’d better have
something entertaining to tell you..


Motivating example

This research started when looking at a stock assessment problem.

Take the waters of the coast of Newfoundland & Labrador, which
historically supported the largest ﬁshery of Atlantic cod in the
world.

(keyword: historically)


Motivating example (continued)

Here’s the model: divide these waters (the stock area) into N
equally-sized sampling units, where the j th unit, j = 1, . . . , N
contains λj fish. Each sampling unit corresponds to the area over
the ocean bottom covered by a standardized trawl tow made at a
fixed speed and duration,

N
Then the total number of fish in the stock is λ = j=1 λj , and we
want to estimate this.

In practice we estimate a measure of trawlable abundance. We
weight λ by the probability of catching a fish on any given tow, q,
and estimate µ = qλ.


DFO conducts two research trawl surveys on these waters every
year using a stratiﬁed random sampling scheme..



For each stratum h, h = 1, . . . , H, we model an observed catch as
Yh ∼ negbin(µh , k).

The negative binomial mass function is
yh k
Γ(yh + k) µh k
P(Yh = yh ; µh , k) =
Γ(yh + 1)Γ(k) µh + k µh + k

and it has mean µh and variance µh + k −1 µ2 .
h


We have a nuisance parameter model.. if we want to make interval
estimates for µ, we must estimate the dispersion parameter k.

Breezing through the literature will suggest any number of
increasingly esoteric ways to do this..

method of moments
pseudo-likelihood
optimal quadratic estimating equations
extended quasi-likelihood
adjusted extended quasi-likelihood
double extended quasi-likelihood
etc.

WTF



Which of these is best??

(and why are there so many??)

Noel and I wrote a paper that tried to answer the ﬁrst question..

.. unfortunately, we also wound up adding another long-winded
estimator to the list, and so increased the scope of the second.

We coined something called the ‘adjusted double-extended
quasi-likelihood’ or ADEQL estimator, which performed best in our
simulations.


Motivating example (fini)

When writing my Master’s thesis, I wanted to figure out why this
estimator worked as well as it did?

And what exactly are all the other ones?

This involved looking into functional approximations and likelihood
asymptotics..

.. but I managed to uncover some fundamental answers that
simplified the whole nuisance parameter estimation mumbojumbo.


Conditional inference

For simplicity, recall the general nonstratiﬁed nuisance parameter
model.. i.e. Y ∼ f (y ; ψ, θ).

Start with some theory. Let (t1 , t2 ) be jointly suﬃcient for (ψ, θ)
and let a be ancillary.

If we could factorize the likelihood as

L(ψ, θ) ≈ L(θ; t2 |a)L(ψ, θ; t1 |t2 , a)

or
L(ψ, θ) ≈ L(θ; t2 |t1 , a)L(ψ, θ; t1 |a)
then we could maximize L(θ; t2 |a), called the marginal likelihood,
or L(θ; t2 |t1 , a), the conditional likelihood, to obtain an estimate of
θ.


Conditional inference

Each of these functions condition on statistics that contain all of
the information about θ, and negligible information about ψ.

They seek to eliminate the eﬀect of ψ when estimating θ, and thus
theoretically solve the nuisance parameter problem.

(both the marginal and conditional likelihoods are special cases of Cox’s
partial likelihood function)


Approximate conditional inference?

Disclaimer: theory vs. practice

It is pretty much impossible to show that a factorization like this
even exists in practice.

The best we can usually do is try to approximate the conditional or
marginal likelihood.


Approximations

There are many, many ways to approximate functions and
integrals..

We’ll brieﬂy touch on an important one..
(recall the title of this talk for a hint)

.. the saddlepoint approximation, which is a highly accurate
approximation to arbitrary functions.

Often it’s capable of outperforming more computationally
demanding methods, i.e. Metropolis Hastings/Gibbs MCMC.

For a few cases (normal, gamma, inverse gamma densities), it’s
even exact.


Laplace approximation

Familiar with the Laplace approximation? We need it ﬁrst. We’re
b
interested in a f (y ; θ)dy for some a < b, and the idea is to use
e −g (y ;θ) , for g (y ; θ) = − log f (y ; θ) to do it.

Truncate a Taylor expansion of e −g (y ;θ) about y , where
ˆ
y = argmaxy g on (a, b).
ˆ

Then integrate over (a, b).. we wind up integrating the kernel of a
N(ˆ , −1/g (ˆ )) density and get
y y
1
b
2π 2
f (y ; θ)dy ≈ exp {g (ˆ )} −
y
a g (ˆ )
y
It works because the value of the integral depends mainly on g (ˆ )
y
(the function’s maximum on (a, b)) and g (ˆ ) (its curvature at the
y
maximum).


Saddlepoint approximation
We can then do some nifty math in order to refine Taylor’s
approximation to a function. Briefly, we want to relate the
cumulant generating function to its corresponding density by
creating an approximate inverse mapping.

For K (t) the cumulant generating function, the moment
generating function can be written

e K (t) = e ty +log f (y ;θ) dy
Y

so fix t and let g (t, y ) = −ty − log f (y ; θ). Laplace’s
approximation yields

2π
e K (t) ≈ e tyt f (yt ; θ)
g (t, yt )

where yt solves g (t, yt ) = 0


Sparing the nitty gritty, we do some solving and rearranging
(particularly involving the saddlepoint equation K (t) = y ), and we
come up with
−1/2
f (yt ; θ) ≈ 2πK (t) exp {K (t) − tyt }
where yt solves the saddlepoint equation.

This guy is called the unnormalized saddlepoint approximation to
ˆ
f , and is typically denoted f .



We can normalize the saddlepoint approximation by using
ˆ
c = Y f (y ; θ)dy .

ˆ
Call f ∗ (y ; θ) = c −1 f (y ; θ) the renormalized saddlepoint
approximation.

How does it perform relative to Taylor’s approximation?

For a sample of size n, Taylor’s approximation is typically accurate
to O(n−1/2 ).. the unnormalized saddlepoint approximation is
accurate to O(n−1 ), while the renormalized version does even
better at O(n−3/2 ).

If small samples are involved, the saddlepoint approximation can
make a big diﬀerence.


The p ∗ formula
We could use the saddlepoint approximation to directly
approximate the marginal likelihood.

(or could we? where would we start?)

Best to continue from Barndorff-Nielsen’s idea.. he and Cox did
some particularly horrific math in the 80’s and came up with a
second-order approximation to the distribution of the MLE.

Briefly, it involves taking a regular exponential family and then
using an unnormalized saddlepoint approximation to approximate
the distribution of a minimal sufficient statistic.. make a particular
reparameterization and renormalize, and you get the p ∗ formula:
L(θ; y )
p ∗ (θ; θ) = κ(θ)|j(θ)|1/2
ˆ ˆ
ˆ
L(θ; y )
where κ is a renormalizing constant (depending on θ) and j is the
observed information.

Putting likelihood asymptotics to work

So how does the p ∗ formula help us approximate the marginal
likelihood?

Let t = (t1 , t2 ) be a minimal suﬃcient statistic and u be a statistic
ˆ ˆ ˆ
such that both (ψ, u) and (ψ, θ) are one-to-one transformations of
t, with the distribution of u depending only on θ.

Barndorﬀ-Nielsen (who else) showed that the marginal density of u
can be written as
ˆ ˆ
∂(ψ, θ) ˆ
∂ ψθ
f (u; θ) = ˆ ˆ
f (ψ, θ; ψ, θ) ˆ
/ f (ψθ ; ψ, θ|u)
ˆ
∂(ψ, u) ∂ψ ˆ


Putting likelihood asymptotics to work

It suﬃces that we know

∂ψ ˆ −1
ˆ ˆ ˆ
= jψψ (λθ ) lψ;ψ (λθ ; ψ, u)
ˆ
ˆ
∂ ψθ

ˆ ˆ ˆ ˆ
where λθ = (θ, ψθ ) and |lψ;ψ (λθ ; ψ, u)| is the determinant of a
ˆ
ˆ ˆ
sample space derivative, deﬁned as the matrix ∂ 2 l(λ; λ, u)/∂ψ∂ ψ T .
ˆ ˆ ˆ
We don’t need to worry about the other term |∂(ψ, θ)/∂(ψ, u)|. It
doesn’t depend on θ.


Putting likelihood asympotics to work

ˆ ˆ
We can use the p ∗ formula to approximate both f (ψ, θ; ψ, θ) and
ˆθ ; ψ, θ|u). Doing so, we get
f (ψ

1/2
ˆ ˆ
j(ψ, θ) ˆ −1
L(ψ, θ) L(ψθ , θ) ˆ ˆ
L(θ; u) ∝ j(ψθ , θ) lψ;ψ (ψθ , θ)
ˆ
ˆ
1/2 ˆ ˆ
L(ψ, θ) L(ψ, θ)
j(ψθ , θ)
1/2 −1
ˆ ˆ
∝ L(ψθ , θ) j(ψθ , θ) ˆ
lψ;ψ (ψθ , θ)
ˆ
1/2 −1
ˆ ˆ
= L(P) (θ; ψθ ) jψψ (θ, ψθ ) ˆ
lψ;ψ (ψθ , θ)
ˆ


Modified profile likelihood (MPL)

Taking the logarithm, we get

ˆ 1 ˆ ˆ
l(θ; u) ≈ l (P) (θ; ψθ ) + log jψψ (θ, ψθ ) − log lψ;ψ (θ, ψθ ) .
ˆ
2

known as the modified profile likelihood for θ and denoted l (M) (θ).

As it’s based on the saddlepoint approximation, it is a highly
accurate approximation to the marginal likelihood l(θ; u) and thus
(from before) L(θ; t2 |a).

In cases where the marginal or conditional likelihood do not exist,
it can be thought of as an approximate conditional likelihood for θ.


Two-index asymptotics for the MPL

Recall the stratified model with H strata.. how does the modified
profile likelihood perform in a two-index asymptotic setting?

If nh → ∞ faster than H, we have a similar bound as before:
ˆ
θ(M) − θ = Op (n−1/2 )

The difference this time is that nh must only increase without
bound faster than H 1/3 , which is a much weaker condition.

If H → ∞ faster than nh , then we have a boost in performance
ˆ −2
over the profile likelihood in that θ(M) − θ = Op (nh ) (as opposed
−1
to Op (nh )).


Modified profile likelihood (MPL)

ˆ
The profile observed information term jψψ (θ, ψθ ) in the MPL
corrects the profile likelihood’s habit of putting excess information
on θ.

What about the sample space derivative term lψ;ψ ?
ˆ

.. this preserves the structure of the parameterization. If θ and ψ
are not parameter orthogonal, this term ensures that
parameterization invariance holds.

What if θ and ψ are parameter orthogonal?


Adjusted profile likelihood (APL)

If θ and ψ are orthogonal, we can do without the sample space
derivative..

.. we can define
1 ˆ
l (A) (θ) = l (P) (θ) − log jψψ (θ, ψθ )
2
as the adjusted profile likelihood, which is equivalent to the MPL
when θ and ψ are parameter orthogonal.

As a special case of the MPL, the APL has comparable
performance as long as θ and ψ are approximately orthogonal.


MPL vs APL

It’s interesting to note the nature of the diﬀerence between the
MPL and APL..

While the MPL arises via the p ∗ formula, the APL can actually be
derived via a lower-order Laplace approximation to the integrated
likelihood

L(θ) = L(ψ, θ)dψ
R  
ˆ
 2π 
≈ exp l (P) (θ; ψθ ) − ∂ 2 l(ψ,θ)
 
2 ∂ψ ˆ
ψ=ψθ

= L(P) (θ; ψθ ) jψψ (ψθ )−1/2
ˆ ˆ


MPL vs APL

In practice we can often get away with using the APL.

May require assuming that θ and ψ are parameter orthogonal, but
this is often the case anyway (i.e. joint mean/dispersion GLMs,
mixed models - i.e. REML).

In particular, if θ is a scalar, then an orthogonal reparameterization
can always be found.

The applicability means that the adjustment term
ˆ
− 1 log jψψ (θ, ψθ ) can be broadly used in GLMs, quasi-GLMs,
2
HGLMs, etc.


Getting back to the problem..

In mine & Noel’s paper, we compared a bunch of estimators for
the negative binomial dispersion parameter k.. the most relevant
methods to us are
maximum (proﬁle) likelihood (ML)
adjusted proﬁle likelihood (AML)
extended quasi-likelihood (EQL)
adjusted extended quasi-likelihood (AEQL)
double extended quasi-likelihood (DEQL)
adjusted double extended quasi-likelihood (ADEQL)
What insight did the whole likelihood asymptotics exercise shed on
this?

It showed two branches of estimators and developed a theoretical
hierarchy in each..


Insight

The EQL function is actually a saddlepoint approximation to an
exponential family likelihood..

+ yhi + k
q(P) (k) = yhi log yh + (yhi + k) log
¯
yh + k
¯
h,i
1 1 yhi
− log(yhi + k) + log k −
2 2 12k(yhi + k)

.. so it should perform similarly to (but worse than) the MLE.

The double extended quasi-likelihood function is actually the EQL
function for the strata mean model.
And the AEQL function is actually an approximation to the
adjusted proﬁle likelihood.. so the adjusted proﬁle likelihood should
intuitively perform better.


Insight

In our paper, the results didn’t exactly follow this theoretical
pattern..


Insight

.. but in that paper we had capped estimates of k at 10k.

I decided to throw out (and not resample) nonconverging estimates
of k in my thesis.

This means I had some information about how many estimates
failed to converge, but those estimates didn’t throw oﬀ my
simulation averages.

Surely enough, upon doing that, the estimators performed
according to the theoretical hierarchy.


Estimator performance (thesis)

Table: Average perfomance measures across all factor combinations, by
estimator.

ML AML EQL CREQL LNEQL CTEQL
Avg. abs. % bias 109.00 30.00 110.00 31.00 33.00 26.00
Avg. MSE 10.21 0.47 10.22 0.47 1.58 0.55
Avg. prop. NC 0.09 0.00 0.09 0.00 0.02 0.00

Table: Ranks of estimator by criterion. Overall rank is calculated as the
ranked average of all other ranks.

ML AML EQL CREQL LNEQL CTEQL
Avg. abs. % bias 5 2 6 3 4 1
Avg. MSE 5 1 6 2 4 3
Avg. prop. NC. 5.5 2 5.5 2 4 2
Overall rank 5 1 6 3 4 2


End of the story..

The odd estimator out was the one we originally called ADEQL in
our paper. It’s the k-root of this guy:

yh + k
¯ k yhi (yhi + 2k)
2k log + − − (n − H) = 0
yhi + k yhi + k 6k(yhi + k)2
h,i

The whole saddlepoint deal revealed the DEQL part was really just
EQL.. so really it’s an adjustment of an approximation to the
proﬁle likelihood, and the adjustment itself is degrees-of-freedom
based.

It performed very well; best in our paper, and second only to the
adjusted proﬁle likelihood in my thesis. I since called it the
Cadigan-Tobin EQL (or CTEQL) estimator for k.


Future research direction

Integrated likelihood
Should the adjusted profile likelihood be adopted as a
‘standard’ way to remove nuisance parameters?
How does the adjusted profile likelihood compare to the MPL
if we depart from parameter orthogonality?
If we do find poor performance under parameter
non-orthogonality, how difficult is it to approximate a sample
space derivative in general?
Can autodiff assist with this? Or is there some neat
saddlepoint-like approximation that will do the trick?


Acronym treadmill..

Russell calls integrated likelihood
’GREML’, for Generalized REstricted
Maximum Likelihood.

Tacking on ‘INference’ shows the
dangers of acronymization..

We’re oh-so-close to GREMLIN..

(hopefully that won’t be my greatest
contribution to statistics.. )


Future research direction

Other interests
Machine learning
Information geometry and asymptotics
Quantum information theory and L2 -norm probability (?)

I would be happy to work with anyone in any of these areas!


Contact

Email: jared@jtobin.ca
Skype: jahredtobin

Website/Blog:
http://jtobin.ca

MAS Thesis:
http://jtobin.ca/jTobin MAS thesis.pdf


Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (7)

Similar a Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Similar a Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference (20)

Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference