SlideShare una empresa de Scribd logo
1 de 52
Descargar para leer sin conexión
Saddlepoint approximations, likelihood
asymptotics, and approximate conditional inference

               Jared Tobin (BSc, MAS)

                 Department of Statistics
                The University of Auckland
                 Auckland, New Zealand


                  February 24, 2011




                  Jared Tobin   Approximate conditional inference
./helloWorld
  I’m from St. John’s, Canada




                       Jared Tobin   Approximate conditional inference
./helloWorld



It’s a charming city known for
    A street containing the most pubs
    per square foot in North America
    What could be the worst weather
    on the planet


these characteristics are probably
related..




                           Jared Tobin   Approximate conditional inference
./tellThemAboutMe

  I recently completed my Master’s degree in Applied Statistics at
  Memorial University.

  I’m also a Senior Research Analyst with the Government of
  Newfoundland & Labrador.

  This basically means I do a lot of programming and statistics..


                          (thank you for R!)


  Here at Auckland, my main supervisor is Russell.

  I’m also affiliated with Fisheries & Oceans Canada (DFO) via my
  co-supervisor, Noel.


                           Jared Tobin   Approximate conditional inference
./whatImDoingHere




  Today I’ll be talking about my Master’s research, as well as what I
  plan to work on during my PhD studies here in Auckland.

  So let’s get started.




                          Jared Tobin   Approximate conditional inference
Likelihood inference


  Consider a probabilistic model Y ∼ f (y ; ζ), ζ ∈ R, and a sample y
  of size n from f (y ; ζ).

  How can we estimate ζ from the sample y?

  Everybody knows maximum likelihood.. (right?)

  It’s the cornerstone of frequentist inference, and remains quite
  popular today.
                                                          (ask Russell)




                           Jared Tobin   Approximate conditional inference
Likelihood inference



  Define L(ζ; y) = n f (yj ; ζ) and call it the likelihood function of
                     j=1
  ζ, given the sample y.

  If we take R to be the (non-extended) real line, then ζ ∈ R lets us
         ˆ
  define ζ = argmaxζ L(ζ; y) to be the maximum likelihood
  estimator (or MLE) of ζ.

                   ˆ
  We can think of ζ as the value of ζ that maximizes the probability
  of observing the sample y.




                          Jared Tobin   Approximate conditional inference
Nuisance parameter models



  Now consider a probabilistic model Y ∼ f (y ; ζ) where

      ζ = (θ, ψ), ζ ∈ RP
      θ∈R
      ψ ∈ RP−1

  If we are only interested in θ, then ψ is called a nuisance
  parameter, or incidental parameter.

  It turns out that these things are aptly named..




                           Jared Tobin   Approximate conditional inference
The nuisance parameter problem: example


  Take a very simple example; a H-strata model with two
  observations per stratum. Yh1 and Yh2 are iid N(µh , σ 2 ) random
  variables, h = 1, . . . , H, and we are interested in estimating σ 2 .

  Define µ = (µ1 , . . . , µH ). Assuming σ 2 is known, the
  log-likelihood for µ is

                                           (yh1 − µh )2 + (yh2 − µh )2
   l(µ; σ 2 , y) =       − log 2πσ 2 −
                                                       2σ 2
                     h

  and the MLE for the hth stratum, µh , is that stratum’s sample
                                   ˆ
  mean (yh1 + yh2 )/2.




                            Jared Tobin   Approximate conditional inference
Example, continued..


  To estimate σ 2 , however, we must use that estimate for µ.

  It is common to use the profile likelihood, defined for this example
  as
                   l (P) (σ 2 ; µ, y) = sup l(µ, σ 2 ; y)
                                ˆ
                                               µ

  to estimate σ 2 . Maximizing yields

                         1          (yh1 − µh )2 + (yh2 − µh )2
                                           ˆ              ˆ
               σ2 =
               ˆ
                         H                       2
                             h

  as the MLE for σ 2 .




                                 Jared Tobin   Approximate conditional inference
Example, continued..



  Let’s check for bias.. let Sh = [(Yh1 − µh )2 + (Yh2 − µh )2 ]/2 and
                               2          ˆ               ˆ
  note that Sh2 = (Y − Y )2 /4 and σ 2 = H −1
                                        ˆ               2.
                     h1      h2                      h Sh


  Some algebra shows that
                 1
        E [Sh ] = (varYh1 + µ2 + varYh2 + µ2 − 2E [Yh1 Yh2 ])
            2
                             h             h
                 4
  and since Yh1 , Yh2 are independent, E [Yh1 Yh2 ] = µ2 so that
                                                       h
  E [Sh ] = σ 2 /2.
      2




                           Jared Tobin   Approximate conditional inference
Example, continued..

  Put it together and we have
                                         1
                         E [ˆ 2 ] =
                            σ                          2
                                                   E [Sh ]
                                         H
                                              h
                                    1              σ2
                                  =
                                    H              2
                                              h
                                         σ2
                                  =
                                         2
  No big deal - everyone knows the MLE for σ 2 is biased..

  But notice the implication for consistency..

                       lim P(|ˆ 2 − σ 2 | < ) = 0
                              σ
                      n→∞




                           Jared Tobin        Approximate conditional inference
Neyman-Scott problems


  This result isn’t exactly new..

  It was described by Neyman & Scott as early as 1948.

  That’s merit for the name; this type of problem is typically known
  as a Neyman-Scott problem in the literature.

  The problem is that one of the required regularity conditions is not
  met. We usually require that the dimension of (µ, σ 2 ) remain
  constant for increasing sample size..
                          H
  But notice that n =     h=1 2,    so n → ∞ iff H → ∞ iff
  dim(µ) → ∞.



                            Jared Tobin   Approximate conditional inference
The profile likelihood



  In a general setting where Y ∼ f (y ; ψ, θ) with nuisance parameter
                                                         −1
  ψ, consider the partial information iθθ|ψ = iθθ − iθψ iψψ iψθ .
                                                                         (P)
  It can be shown that the profile expected information iθθ is
  first-order equivalent to iθθ ..
                                                (P)
                                 so iθθ|ψ < iθθ in an asymptotic sense.

  In other words, the profile likelihood places more weight on
  information about θ than it ’ought’ to be doing.




                           Jared Tobin   Approximate conditional inference
Two-index asymptotics for the profile likelihood


  Take a general stratified model again.. Yh ∼ f (yh ; ψh , θ) with H
  strata. Remove the per-stratum sample size restriction of nh = 2.

  Now, if we let both nh and H approach infinity, we get different
  results depending on the speed at which nh → ∞ and H → ∞.
                                             ˆ
  If nh → ∞ faster than H does, we have that θ − θ = Op (n−1/2 ).
                                                       −1
  If H → ∞ faster, the same difference is of order Op (nh ).

  In other words, we can probably expect the profile likelihood to
  make relatively poor estimates of θ if H > nh on average.



                           Jared Tobin   Approximate conditional inference
Alternatives to using the profile likelihood




  This type of model comes up a lot in practice.. (example later)

  What solutions are available to tackle the nuisance parameter
  problem?

  For the normal nuisance parameter model, the method of moments
  estimator is an option (and happens to be unbiased & consistent).




                          Jared Tobin   Approximate conditional inference
Alternatives to using the profile likelihood


  But what if the moments themselves involve multiple parameters?

  ... then it may be difficult or impossible to construct a method of
  moments estimator.

  Also, likelihood-based estimators generally have desirable statistical
  properties, and it would be nice to retain those.

  There may be a way to patch up the problem with ‘standard’ ML
  in these models..


     hint: there is. my plane ticket was $2300 CAD, so I’d better have
                     something entertaining to tell you..




                            Jared Tobin   Approximate conditional inference
Motivating example




  This research started when looking at a stock assessment problem.

  Take the waters of the coast of Newfoundland & Labrador, which
  historically supported the largest fishery of Atlantic cod in the
  world.

                        (keyword: historically)



                          Jared Tobin   Approximate conditional inference
Motivating example (continued)


  Here’s the model: divide these waters (the stock area) into N
  equally-sized sampling units, where the j th unit, j = 1, . . . , N
  contains λj fish. Each sampling unit corresponds to the area over
  the ocean bottom covered by a standardized trawl tow made at a
  fixed speed and duration,

                                                                     N
  Then the total number of fish in the stock is λ =                   j=1 λj ,   and we
  want to estimate this.

  In practice we estimate a measure of trawlable abundance. We
  weight λ by the probability of catching a fish on any given tow, q,
  and estimate µ = qλ.



                           Jared Tobin   Approximate conditional inference
Motivating example (continued)
  DFO conducts two research trawl surveys on these waters every
  year using a stratified random sampling scheme..




                         Jared Tobin   Approximate conditional inference
Motivating example (continued)



  For each stratum h, h = 1, . . . , H, we model an observed catch as
  Yh ∼ negbin(µh , k).

  The negative binomial mass function is
                                                                  yh               k
                               Γ(yh + k)              µh                    k
     P(Yh = yh ; µh , k) =
                             Γ(yh + 1)Γ(k)          µh + k                µh + k

  and it has mean µh and variance µh + k −1 µ2 .
                                             h




                             Jared Tobin   Approximate conditional inference
Motivating example (continued)
  We have a nuisance parameter model.. if we want to make interval
  estimates for µ, we must estimate the dispersion parameter k.

  Breezing through the literature will suggest any number of
  increasingly esoteric ways to do this..

      method of moments
      pseudo-likelihood
      optimal quadratic estimating equations
      extended quasi-likelihood
      adjusted extended quasi-likelihood
      double extended quasi-likelihood
      etc.

                                    WTF

                          Jared Tobin   Approximate conditional inference
Motivating example (continued)


  Which of these is best??

                     (and why are there so many??)

  Noel and I wrote a paper that tried to answer the first question..

  .. unfortunately, we also wound up adding another long-winded
  estimator to the list, and so increased the scope of the second.

  We coined something called the ‘adjusted double-extended
  quasi-likelihood’ or ADEQL estimator, which performed best in our
  simulations.




                             Jared Tobin   Approximate conditional inference
Motivating example (fini)



  When writing my Master’s thesis, I wanted to figure out why this
  estimator worked as well as it did?

  And what exactly are all the other ones?

  This involved looking into functional approximations and likelihood
  asymptotics..

  .. but I managed to uncover some fundamental answers that
  simplified the whole nuisance parameter estimation mumbojumbo.




                          Jared Tobin   Approximate conditional inference
Conditional inference

  For simplicity, recall the general nonstratified nuisance parameter
  model.. i.e. Y ∼ f (y ; ψ, θ).

  Start with some theory. Let (t1 , t2 ) be jointly sufficient for (ψ, θ)
  and let a be ancillary.

  If we could factorize the likelihood as

                   L(ψ, θ) ≈ L(θ; t2 |a)L(ψ, θ; t1 |t2 , a)

  or
                   L(ψ, θ) ≈ L(θ; t2 |t1 , a)L(ψ, θ; t1 |a)
  then we could maximize L(θ; t2 |a), called the marginal likelihood,
  or L(θ; t2 |t1 , a), the conditional likelihood, to obtain an estimate of
  θ.


                             Jared Tobin   Approximate conditional inference
Conditional inference



  Each of these functions condition on statistics that contain all of
  the information about θ, and negligible information about ψ.

  They seek to eliminate the effect of ψ when estimating θ, and thus
  theoretically solve the nuisance parameter problem.


   (both the marginal and conditional likelihoods are special cases of Cox’s
                         partial likelihood function)




                             Jared Tobin   Approximate conditional inference
Approximate conditional inference?




                   Disclaimer: theory vs. practice

  It is pretty much impossible to show that a factorization like this
  even exists in practice.

  The best we can usually do is try to approximate the conditional or
  marginal likelihood.




                           Jared Tobin   Approximate conditional inference
Approximations


  There are many, many ways to approximate functions and
  integrals..

  We’ll briefly touch on an important one..
                  (recall the title of this talk for a hint)

  .. the saddlepoint approximation, which is a highly accurate
  approximation to arbitrary functions.

  Often it’s capable of outperforming more computationally
  demanding methods, i.e. Metropolis Hastings/Gibbs MCMC.

  For a few cases (normal, gamma, inverse gamma densities), it’s
  even exact.


                            Jared Tobin   Approximate conditional inference
Laplace approximation

  Familiar with the Laplace approximation? We need it first. We’re
                       b
  interested in a f (y ; θ)dy for some a < b, and the idea is to use
  e −g (y ;θ) , for g (y ; θ) = − log f (y ; θ) to do it.

  Truncate a Taylor expansion of e −g (y ;θ) about y , where
                                                   ˆ
  y = argmaxy g on (a, b).
  ˆ

  Then integrate over (a, b).. we wind up integrating the kernel of a
  N(ˆ , −1/g (ˆ )) density and get
    y         y
                                                                            1
                     b
                                                       2π                   2
                         f (y ; θ)dy ≈ exp {g (ˆ )} −
                                               y
                 a                                    g (ˆ )
                                                         y
  It works because the value of the integral depends mainly on g (ˆ )
                                                                  y
  (the function’s maximum on (a, b)) and g (ˆ ) (its curvature at the
                                               y
  maximum).

                                 Jared Tobin   Approximate conditional inference
Saddlepoint approximation
  We can then do some nifty math in order to refine Taylor’s
  approximation to a function. Briefly, we want to relate the
  cumulant generating function to its corresponding density by
  creating an approximate inverse mapping.

  For K (t) the cumulant generating function, the moment
  generating function can be written

                        e K (t) =          e ty +log f (y ;θ) dy
                                     Y

  so fix t and let g (t, y ) = −ty − log f (y ; θ). Laplace’s
  approximation yields

                                      2π
                     e K (t) ≈                 e tyt f (yt ; θ)
                                    g (t, yt )

  where yt solves g (t, yt ) = 0
                             Jared Tobin        Approximate conditional inference
Saddlepoint approximation



  Sparing the nitty gritty, we do some solving and rearranging
  (particularly involving the saddlepoint equation K (t) = y ), and we
  come up with
                                         −1/2
              f (yt ; θ) ≈ 2πK (t)              exp {K (t) − tyt }
  where yt solves the saddlepoint equation.

  This guy is called the unnormalized saddlepoint approximation to
                               ˆ
  f , and is typically denoted f .




                           Jared Tobin    Approximate conditional inference
Saddlepoint approximation

  We can normalize the saddlepoint approximation by using
        ˆ
  c = Y f (y ; θ)dy .

                          ˆ
  Call f ∗ (y ; θ) = c −1 f (y ; θ) the renormalized saddlepoint
  approximation.

  How does it perform relative to Taylor’s approximation?

  For a sample of size n, Taylor’s approximation is typically accurate
  to O(n−1/2 ).. the unnormalized saddlepoint approximation is
  accurate to O(n−1 ), while the renormalized version does even
  better at O(n−3/2 ).

  If small samples are involved, the saddlepoint approximation can
  make a big difference.


                             Jared Tobin   Approximate conditional inference
The p ∗ formula
  We could use the saddlepoint approximation to directly
  approximate the marginal likelihood.

                  (or could we? where would we start?)

  Best to continue from Barndorff-Nielsen’s idea.. he and Cox did
  some particularly horrific math in the 80’s and came up with a
  second-order approximation to the distribution of the MLE.

  Briefly, it involves taking a regular exponential family and then
  using an unnormalized saddlepoint approximation to approximate
  the distribution of a minimal sufficient statistic.. make a particular
  reparameterization and renormalize, and you get the p ∗ formula:
                                                   L(θ; y )
                    p ∗ (θ; θ) = κ(θ)|j(θ)|1/2
                         ˆ              ˆ
                                                     ˆ
                                                   L(θ; y )
  where κ is a renormalizing constant (depending on θ) and j is the
  observed information.
                           Jared Tobin   Approximate conditional inference
Putting likelihood asymptotics to work


  So how does the p ∗ formula help us approximate the marginal
  likelihood?

  Let t = (t1 , t2 ) be a minimal sufficient statistic and u be a statistic
                      ˆ         ˆ ˆ
  such that both (ψ, u) and (ψ, θ) are one-to-one transformations of
  t, with the distribution of u depending only on θ.

  Barndorff-Nielsen (who else) showed that the marginal density of u
  can be written as
                                   ˆ ˆ
                                 ∂(ψ, θ)                            ˆ
                                                                  ∂ ψθ
     f (u; θ) =      ˆ ˆ
                  f (ψ, θ; ψ, θ)                     ˆ
                                                / f (ψθ ; ψ, θ|u)
                                   ˆ
                                 ∂(ψ, u)                          ∂ψ ˆ




                            Jared Tobin   Approximate conditional inference
Putting likelihood asymptotics to work



  It suffices that we know

                   ∂ψ ˆ                                        −1
                               ˆ          ˆ ˆ
                        = jψψ (λθ ) lψ;ψ (λθ ; ψ, u)
                                       ˆ
                     ˆ
                   ∂ ψθ

        ˆ        ˆ               ˆ ˆ
  where λθ = (θ, ψθ ) and |lψ;ψ (λθ ; ψ, u)| is the determinant of a
                              ˆ
                                                           ˆ         ˆ
  sample space derivative, defined as the matrix ∂ 2 l(λ; λ, u)/∂ψ∂ ψ T .
                                                 ˆ ˆ     ˆ
  We don’t need to worry about the other term |∂(ψ, θ)/∂(ψ, u)|. It
  doesn’t depend on θ.




                           Jared Tobin   Approximate conditional inference
Putting likelihood asympotics to work


                                                    ˆ ˆ
  We can use the p ∗ formula to approximate both f (ψ, θ; ψ, θ) and
     ˆθ ; ψ, θ|u). Doing so, we get
  f (ψ

                            1/2
                  ˆ ˆ
                j(ψ, θ)                   ˆ                                           −1
                                L(ψ, θ) L(ψθ , θ)   ˆ             ˆ
    L(θ; u) ∝                                     j(ψθ , θ) lψ;ψ (ψθ , θ)
                                                               ˆ
                  ˆ
                            1/2   ˆ ˆ
                                L(ψ, θ) L(ψ, θ)
                j(ψθ , θ)
                                         1/2                     −1
               ˆ         ˆ
           ∝ L(ψθ , θ) j(ψθ , θ)                      ˆ
                                                lψ;ψ (ψθ , θ)
                                                   ˆ
                                                 1/2                      −1
                      ˆ            ˆ
           = L(P) (θ; ψθ ) jψψ (θ, ψθ )                      ˆ
                                                       lψ;ψ (ψθ , θ)
                                                          ˆ




                                  Jared Tobin     Approximate conditional inference
Modified profile likelihood (MPL)


  Taking the logarithm, we get

                         ˆ     1            ˆ                   ˆ
     l(θ; u) ≈ l (P) (θ; ψθ ) + log jψψ (θ, ψθ ) − log lψ;ψ (θ, ψθ ) .
                                                          ˆ
                               2

  known as the modified profile likelihood for θ and denoted l (M) (θ).

  As it’s based on the saddlepoint approximation, it is a highly
  accurate approximation to the marginal likelihood l(θ; u) and thus
  (from before) L(θ; t2 |a).

  In cases where the marginal or conditional likelihood do not exist,
  it can be thought of as an approximate conditional likelihood for θ.



                           Jared Tobin   Approximate conditional inference
Two-index asymptotics for the MPL


  Recall the stratified model with H strata.. how does the modified
  profile likelihood perform in a two-index asymptotic setting?

  If nh → ∞ faster than H, we have a similar bound as before:
  ˆ
  θ(M) − θ = Op (n−1/2 )

  The difference this time is that nh must only increase without
  bound faster than H 1/3 , which is a much weaker condition.

  If H → ∞ faster than nh , then we have a boost in performance
                                     ˆ               −2
  over the profile likelihood in that θ(M) − θ = Op (nh ) (as opposed
          −1
  to Op (nh )).



                          Jared Tobin   Approximate conditional inference
Modified profile likelihood (MPL)


                                                  ˆ
  The profile observed information term jψψ (θ, ψθ ) in the MPL
  corrects the profile likelihood’s habit of putting excess information
  on θ.

  What about the sample space derivative term lψ;ψ ?
                                                 ˆ


  .. this preserves the structure of the parameterization. If θ and ψ
  are not parameter orthogonal, this term ensures that
  parameterization invariance holds.

  What if θ and ψ are parameter orthogonal?




                           Jared Tobin   Approximate conditional inference
Adjusted profile likelihood (APL)


  If θ and ψ are orthogonal, we can do without the sample space
  derivative..

  .. we can define
                                           1             ˆ
                 l (A) (θ) = l (P) (θ) −     log jψψ (θ, ψθ )
                                           2
  as the adjusted profile likelihood, which is equivalent to the MPL
  when θ and ψ are parameter orthogonal.

  As a special case of the MPL, the APL has comparable
  performance as long as θ and ψ are approximately orthogonal.




                            Jared Tobin    Approximate conditional inference
MPL vs APL

  It’s interesting to note the nature of the difference between the
  MPL and APL..

  While the MPL arises via the p ∗ formula, the APL can actually be
  derived via a lower-order Laplace approximation to the integrated
  likelihood

             L(θ) =       L(ψ, θ)dψ
                      R                                                
                                  ˆ
                                                  2π                   
                  ≈ exp l (P) (θ; ψθ )  − ∂ 2 l(ψ,θ)
                                                                       
                                                 2    ∂ψ          ˆ
                                                                ψ=ψθ

                  = L(P) (θ; ψθ ) jψψ (ψθ )−1/2
                             ˆ         ˆ




                            Jared Tobin   Approximate conditional inference
MPL vs APL


  In practice we can often get away with using the APL.

  May require assuming that θ and ψ are parameter orthogonal, but
  this is often the case anyway (i.e. joint mean/dispersion GLMs,
  mixed models - i.e. REML).

  In particular, if θ is a scalar, then an orthogonal reparameterization
  can always be found.

  The applicability means that the adjustment term
                  ˆ
  − 1 log jψψ (θ, ψθ ) can be broadly used in GLMs, quasi-GLMs,
    2
  HGLMs, etc.


                           Jared Tobin   Approximate conditional inference
Getting back to the problem..

  In mine & Noel’s paper, we compared a bunch of estimators for
  the negative binomial dispersion parameter k.. the most relevant
  methods to us are
      maximum (profile) likelihood (ML)
      adjusted profile likelihood (AML)
      extended quasi-likelihood (EQL)
      adjusted extended quasi-likelihood (AEQL)
      double extended quasi-likelihood (DEQL)
      adjusted double extended quasi-likelihood (ADEQL)
  What insight did the whole likelihood asymptotics exercise shed on
  this?

  It showed two branches of estimators and developed a theoretical
  hierarchy in each..

                          Jared Tobin   Approximate conditional inference
Insight

  The EQL function is actually a saddlepoint approximation to an
  exponential family likelihood..


           +                                                         yhi + k
          q(P) (k) =             yhi log yh + (yhi + k) log
                                         ¯
                                                                     yh + k
                                                                     ¯
                       h,i
                             1               1            yhi
                       −       log(yhi + k) + log k −
                             2               2        12k(yhi + k)

  .. so it should perform similarly to (but worse than) the MLE.

  The double extended quasi-likelihood function is actually the EQL
  function for the strata mean model.
  And the AEQL function is actually an approximation to the
  adjusted profile likelihood.. so the adjusted profile likelihood should
  intuitively perform better.

                                  Jared Tobin   Approximate conditional inference
Insight

  In our paper, the results didn’t exactly follow this theoretical
  pattern..




                            Jared Tobin   Approximate conditional inference
Insight



  .. but in that paper we had capped estimates of k at 10k.

  I decided to throw out (and not resample) nonconverging estimates
  of k in my thesis.

  This means I had some information about how many estimates
  failed to converge, but those estimates didn’t throw off my
  simulation averages.

  Surely enough, upon doing that, the estimators performed
  according to the theoretical hierarchy.




                          Jared Tobin   Approximate conditional inference
Estimator performance (thesis)

  Table: Average perfomance measures across all factor combinations, by
  estimator.

                       ML       AML EQL CREQL LNEQL CTEQL
   Avg. abs. % bias 109.00      30.00 110.00 31.00 33.00 26.00
         Avg. MSE 10.21          0.47 10.22   0.47  1.58  0.55
     Avg. prop. NC 0.09          0.00 0.09    0.00  0.02  0.00

  Table: Ranks of estimator by criterion. Overall rank is calculated as the
  ranked average of all other ranks.

                        ML AML EQL CREQL LNEQL CTEQL
       Avg. abs. % bias 5    2   6     3     4     1
             Avg. MSE 5      1   6     2     4     3
        Avg. prop. NC. 5.5   2 5.5     2     4     2
            Overall rank 5   1   6     3     4     2

                             Jared Tobin   Approximate conditional inference
End of the story..

  The odd estimator out was the one we originally called ADEQL in
  our paper. It’s the k-root of this guy:


                   yh + k
                   ¯            k      yhi (yhi + 2k)
          2k log           +         −                − (n − H) = 0
                   yhi + k   yhi + k   6k(yhi + k)2
    h,i

  The whole saddlepoint deal revealed the DEQL part was really just
  EQL.. so really it’s an adjustment of an approximation to the
  profile likelihood, and the adjustment itself is degrees-of-freedom
  based.

  It performed very well; best in our paper, and second only to the
  adjusted profile likelihood in my thesis. I since called it the
  Cadigan-Tobin EQL (or CTEQL) estimator for k.


                             Jared Tobin   Approximate conditional inference
Future research direction


  Integrated likelihood
      Should the adjusted profile likelihood be adopted as a
      ‘standard’ way to remove nuisance parameters?
      How does the adjusted profile likelihood compare to the MPL
      if we depart from parameter orthogonality?
      If we do find poor performance under parameter
      non-orthogonality, how difficult is it to approximate a sample
      space derivative in general?
      Can autodiff assist with this? Or is there some neat
      saddlepoint-like approximation that will do the trick?




                          Jared Tobin   Approximate conditional inference
Acronym treadmill..


Russell calls integrated likelihood
’GREML’, for Generalized REstricted
Maximum Likelihood.

Tacking on ‘INference’ shows the
dangers of acronymization..

We’re oh-so-close to GREMLIN..


   (hopefully that won’t be my greatest
       contribution to statistics.. )




                            Jared Tobin   Approximate conditional inference
Future research direction




  Other interests
      Machine learning
      Information geometry and asymptotics
      Quantum information theory and L2 -norm probability (?)


  I would be happy to work with anyone in any of these areas!




                          Jared Tobin   Approximate conditional inference
Contact




Email: jared@jtobin.ca
Skype: jahredtobin

Website/Blog:
http://jtobin.ca

MAS Thesis:
http://jtobin.ca/jTobin MAS thesis.pdf




                         Jared Tobin   Approximate conditional inference

Más contenido relacionado

La actualidad más candente

11.[104 111]analytical solution for telegraph equation by modified of sumudu ...
11.[104 111]analytical solution for telegraph equation by modified of sumudu ...11.[104 111]analytical solution for telegraph equation by modified of sumudu ...
11.[104 111]analytical solution for telegraph equation by modified of sumudu ...Alexander Decker
 
Nonlinear perturbed difference equations
Nonlinear perturbed difference equationsNonlinear perturbed difference equations
Nonlinear perturbed difference equationsTahia ZERIZER
 
Homotopy perturbation and elzaki transform for solving nonlinear partial diff...
Homotopy perturbation and elzaki transform for solving nonlinear partial diff...Homotopy perturbation and elzaki transform for solving nonlinear partial diff...
Homotopy perturbation and elzaki transform for solving nonlinear partial diff...Alexander Decker
 
11.homotopy perturbation and elzaki transform for solving nonlinear partial d...
11.homotopy perturbation and elzaki transform for solving nonlinear partial d...11.homotopy perturbation and elzaki transform for solving nonlinear partial d...
11.homotopy perturbation and elzaki transform for solving nonlinear partial d...Alexander Decker
 
Doering Savov
Doering SavovDoering Savov
Doering Savovgh
 
Optimalpolicyhandout
OptimalpolicyhandoutOptimalpolicyhandout
OptimalpolicyhandoutNBER
 
A brief introduction to Hartree-Fock and TDDFT
A brief introduction to Hartree-Fock and TDDFTA brief introduction to Hartree-Fock and TDDFT
A brief introduction to Hartree-Fock and TDDFTJiahao Chen
 
A Geometric Note on a Type of Multiple Testing-07-24-2015
A Geometric Note on a Type of Multiple Testing-07-24-2015A Geometric Note on a Type of Multiple Testing-07-24-2015
A Geometric Note on a Type of Multiple Testing-07-24-2015Junfeng Liu
 
NONLINEAR DIFFERENCE EQUATIONS WITH SMALL PARAMETERS OF MULTIPLE SCALES
NONLINEAR DIFFERENCE EQUATIONS WITH SMALL PARAMETERS OF MULTIPLE SCALESNONLINEAR DIFFERENCE EQUATIONS WITH SMALL PARAMETERS OF MULTIPLE SCALES
NONLINEAR DIFFERENCE EQUATIONS WITH SMALL PARAMETERS OF MULTIPLE SCALESTahia ZERIZER
 
Dual Gravitons in AdS4/CFT3 and the Holographic Cotton Tensor
Dual Gravitons in AdS4/CFT3 and the Holographic Cotton TensorDual Gravitons in AdS4/CFT3 and the Holographic Cotton Tensor
Dual Gravitons in AdS4/CFT3 and the Holographic Cotton TensorSebastian De Haro
 
2003 Ames.Models
2003 Ames.Models2003 Ames.Models
2003 Ames.Modelspinchung
 
Ml mle_bayes
Ml  mle_bayesMl  mle_bayes
Ml mle_bayesPhong Vo
 

La actualidad más candente (20)

11.[104 111]analytical solution for telegraph equation by modified of sumudu ...
11.[104 111]analytical solution for telegraph equation by modified of sumudu ...11.[104 111]analytical solution for telegraph equation by modified of sumudu ...
11.[104 111]analytical solution for telegraph equation by modified of sumudu ...
 
Nonlinear perturbed difference equations
Nonlinear perturbed difference equationsNonlinear perturbed difference equations
Nonlinear perturbed difference equations
 
Fdtd
FdtdFdtd
Fdtd
 
Homotopy perturbation and elzaki transform for solving nonlinear partial diff...
Homotopy perturbation and elzaki transform for solving nonlinear partial diff...Homotopy perturbation and elzaki transform for solving nonlinear partial diff...
Homotopy perturbation and elzaki transform for solving nonlinear partial diff...
 
11.homotopy perturbation and elzaki transform for solving nonlinear partial d...
11.homotopy perturbation and elzaki transform for solving nonlinear partial d...11.homotopy perturbation and elzaki transform for solving nonlinear partial d...
11.homotopy perturbation and elzaki transform for solving nonlinear partial d...
 
Doering Savov
Doering SavovDoering Savov
Doering Savov
 
Optimalpolicyhandout
OptimalpolicyhandoutOptimalpolicyhandout
Optimalpolicyhandout
 
A brief introduction to Hartree-Fock and TDDFT
A brief introduction to Hartree-Fock and TDDFTA brief introduction to Hartree-Fock and TDDFT
A brief introduction to Hartree-Fock and TDDFT
 
A Geometric Note on a Type of Multiple Testing-07-24-2015
A Geometric Note on a Type of Multiple Testing-07-24-2015A Geometric Note on a Type of Multiple Testing-07-24-2015
A Geometric Note on a Type of Multiple Testing-07-24-2015
 
Valencia 9 (poster)
Valencia 9 (poster)Valencia 9 (poster)
Valencia 9 (poster)
 
Lectures4 8
Lectures4 8Lectures4 8
Lectures4 8
 
9 pd es
9 pd es9 pd es
9 pd es
 
Holographic Cotton Tensor
Holographic Cotton TensorHolographic Cotton Tensor
Holographic Cotton Tensor
 
Basic concepts and how to measure price volatility
Basic concepts and how to measure price volatility Basic concepts and how to measure price volatility
Basic concepts and how to measure price volatility
 
NONLINEAR DIFFERENCE EQUATIONS WITH SMALL PARAMETERS OF MULTIPLE SCALES
NONLINEAR DIFFERENCE EQUATIONS WITH SMALL PARAMETERS OF MULTIPLE SCALESNONLINEAR DIFFERENCE EQUATIONS WITH SMALL PARAMETERS OF MULTIPLE SCALES
NONLINEAR DIFFERENCE EQUATIONS WITH SMALL PARAMETERS OF MULTIPLE SCALES
 
Dual Gravitons in AdS4/CFT3 and the Holographic Cotton Tensor
Dual Gravitons in AdS4/CFT3 and the Holographic Cotton TensorDual Gravitons in AdS4/CFT3 and the Holographic Cotton Tensor
Dual Gravitons in AdS4/CFT3 and the Holographic Cotton Tensor
 
Rouviere
RouviereRouviere
Rouviere
 
2003 Ames.Models
2003 Ames.Models2003 Ames.Models
2003 Ames.Models
 
Ml mle_bayes
Ml  mle_bayesMl  mle_bayes
Ml mle_bayes
 
Rousseau
RousseauRousseau
Rousseau
 

Destacado

Production function
Production function Production function
Production function Tej Kiran
 
50 Essential Content Marketing Hacks (Content Marketing World)
50 Essential Content Marketing Hacks (Content Marketing World)50 Essential Content Marketing Hacks (Content Marketing World)
50 Essential Content Marketing Hacks (Content Marketing World)Heinz Marketing Inc
 
Prototyping is an attitude
Prototyping is an attitudePrototyping is an attitude
Prototyping is an attitudeWith Company
 
10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer Experience10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer ExperienceYuan Wang
 
How to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media PlanHow to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media PlanPost Planner
 
Learn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionLearn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionIn a Rocket
 
SEO: Getting Personal
SEO: Getting PersonalSEO: Getting Personal
SEO: Getting PersonalKirsty Hulse
 

Destacado (7)

Production function
Production function Production function
Production function
 
50 Essential Content Marketing Hacks (Content Marketing World)
50 Essential Content Marketing Hacks (Content Marketing World)50 Essential Content Marketing Hacks (Content Marketing World)
50 Essential Content Marketing Hacks (Content Marketing World)
 
Prototyping is an attitude
Prototyping is an attitudePrototyping is an attitude
Prototyping is an attitude
 
10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer Experience10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer Experience
 
How to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media PlanHow to Build a Dynamic Social Media Plan
How to Build a Dynamic Social Media Plan
 
Learn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionLearn BEM: CSS Naming Convention
Learn BEM: CSS Naming Convention
 
SEO: Getting Personal
SEO: Getting PersonalSEO: Getting Personal
SEO: Getting Personal
 

Similar a Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Pierre Jacob
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Pierre Jacob
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Pierre Jacob
 
Mit18 330 s12_chapter5
Mit18 330 s12_chapter5Mit18 330 s12_chapter5
Mit18 330 s12_chapter5CAALAAA
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Pierre Jacob
 
My PhD talk "Application of H-matrices for computing partial inverse"
My PhD talk "Application of H-matrices for computing partial inverse"My PhD talk "Application of H-matrices for computing partial inverse"
My PhD talk "Application of H-matrices for computing partial inverse"Alexander Litvinenko
 
Jyokyo-kai-20120605
Jyokyo-kai-20120605Jyokyo-kai-20120605
Jyokyo-kai-20120605ketanaka
 
Talk at CIRM on Poisson equation and debiasing techniques
Talk at CIRM on Poisson equation and debiasing techniquesTalk at CIRM on Poisson equation and debiasing techniques
Talk at CIRM on Poisson equation and debiasing techniquesPierre Jacob
 
Chapter2: Likelihood-based approach
Chapter2: Likelihood-based approach Chapter2: Likelihood-based approach
Chapter2: Likelihood-based approach Jae-kwang Kim
 
The 2 Goldbach's Conjectures with Proof
The 2 Goldbach's Conjectures with Proof The 2 Goldbach's Conjectures with Proof
The 2 Goldbach's Conjectures with Proof nikos mantzakouras
 
lecture 8
lecture 8lecture 8
lecture 8sajinsc
 
On probability distributions
On probability distributionsOn probability distributions
On probability distributionsEric Xihui Lin
 

Similar a Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference (20)

Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
 
Mit18 330 s12_chapter5
Mit18 330 s12_chapter5Mit18 330 s12_chapter5
Mit18 330 s12_chapter5
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
 
Problems and solutions_4
Problems and solutions_4Problems and solutions_4
Problems and solutions_4
 
My PhD talk "Application of H-matrices for computing partial inverse"
My PhD talk "Application of H-matrices for computing partial inverse"My PhD talk "Application of H-matrices for computing partial inverse"
My PhD talk "Application of H-matrices for computing partial inverse"
 
Jyokyo-kai-20120605
Jyokyo-kai-20120605Jyokyo-kai-20120605
Jyokyo-kai-20120605
 
Talk at CIRM on Poisson equation and debiasing techniques
Talk at CIRM on Poisson equation and debiasing techniquesTalk at CIRM on Poisson equation and debiasing techniques
Talk at CIRM on Poisson equation and debiasing techniques
 
Chapter2: Likelihood-based approach
Chapter2: Likelihood-based approach Chapter2: Likelihood-based approach
Chapter2: Likelihood-based approach
 
lec2_CS540_handouts.pdf
lec2_CS540_handouts.pdflec2_CS540_handouts.pdf
lec2_CS540_handouts.pdf
 
pattern recognition
pattern recognition pattern recognition
pattern recognition
 
The 2 Goldbach's Conjectures with Proof
The 2 Goldbach's Conjectures with Proof The 2 Goldbach's Conjectures with Proof
The 2 Goldbach's Conjectures with Proof
 
Equivariance
EquivarianceEquivariance
Equivariance
 
lecture 8
lecture 8lecture 8
lecture 8
 
LDP.pdf
LDP.pdfLDP.pdf
LDP.pdf
 
02 math essentials
02 math essentials02 math essentials
02 math essentials
 
On probability distributions
On probability distributionsOn probability distributions
On probability distributions
 
Chris Sherlock's slides
Chris Sherlock's slidesChris Sherlock's slides
Chris Sherlock's slides
 
Quantum chaos of generic systems - Marko Robnik
Quantum chaos of generic systems - Marko RobnikQuantum chaos of generic systems - Marko Robnik
Quantum chaos of generic systems - Marko Robnik
 

Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

  • 1. Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference Jared Tobin (BSc, MAS) Department of Statistics The University of Auckland Auckland, New Zealand February 24, 2011 Jared Tobin Approximate conditional inference
  • 2. ./helloWorld I’m from St. John’s, Canada Jared Tobin Approximate conditional inference
  • 3. ./helloWorld It’s a charming city known for A street containing the most pubs per square foot in North America What could be the worst weather on the planet these characteristics are probably related.. Jared Tobin Approximate conditional inference
  • 4. ./tellThemAboutMe I recently completed my Master’s degree in Applied Statistics at Memorial University. I’m also a Senior Research Analyst with the Government of Newfoundland & Labrador. This basically means I do a lot of programming and statistics.. (thank you for R!) Here at Auckland, my main supervisor is Russell. I’m also affiliated with Fisheries & Oceans Canada (DFO) via my co-supervisor, Noel. Jared Tobin Approximate conditional inference
  • 5. ./whatImDoingHere Today I’ll be talking about my Master’s research, as well as what I plan to work on during my PhD studies here in Auckland. So let’s get started. Jared Tobin Approximate conditional inference
  • 6. Likelihood inference Consider a probabilistic model Y ∼ f (y ; ζ), ζ ∈ R, and a sample y of size n from f (y ; ζ). How can we estimate ζ from the sample y? Everybody knows maximum likelihood.. (right?) It’s the cornerstone of frequentist inference, and remains quite popular today. (ask Russell) Jared Tobin Approximate conditional inference
  • 7. Likelihood inference Define L(ζ; y) = n f (yj ; ζ) and call it the likelihood function of j=1 ζ, given the sample y. If we take R to be the (non-extended) real line, then ζ ∈ R lets us ˆ define ζ = argmaxζ L(ζ; y) to be the maximum likelihood estimator (or MLE) of ζ. ˆ We can think of ζ as the value of ζ that maximizes the probability of observing the sample y. Jared Tobin Approximate conditional inference
  • 8. Nuisance parameter models Now consider a probabilistic model Y ∼ f (y ; ζ) where ζ = (θ, ψ), ζ ∈ RP θ∈R ψ ∈ RP−1 If we are only interested in θ, then ψ is called a nuisance parameter, or incidental parameter. It turns out that these things are aptly named.. Jared Tobin Approximate conditional inference
  • 9. The nuisance parameter problem: example Take a very simple example; a H-strata model with two observations per stratum. Yh1 and Yh2 are iid N(µh , σ 2 ) random variables, h = 1, . . . , H, and we are interested in estimating σ 2 . Define µ = (µ1 , . . . , µH ). Assuming σ 2 is known, the log-likelihood for µ is (yh1 − µh )2 + (yh2 − µh )2 l(µ; σ 2 , y) = − log 2πσ 2 − 2σ 2 h and the MLE for the hth stratum, µh , is that stratum’s sample ˆ mean (yh1 + yh2 )/2. Jared Tobin Approximate conditional inference
  • 10. Example, continued.. To estimate σ 2 , however, we must use that estimate for µ. It is common to use the profile likelihood, defined for this example as l (P) (σ 2 ; µ, y) = sup l(µ, σ 2 ; y) ˆ µ to estimate σ 2 . Maximizing yields 1 (yh1 − µh )2 + (yh2 − µh )2 ˆ ˆ σ2 = ˆ H 2 h as the MLE for σ 2 . Jared Tobin Approximate conditional inference
  • 11. Example, continued.. Let’s check for bias.. let Sh = [(Yh1 − µh )2 + (Yh2 − µh )2 ]/2 and 2 ˆ ˆ note that Sh2 = (Y − Y )2 /4 and σ 2 = H −1 ˆ 2. h1 h2 h Sh Some algebra shows that 1 E [Sh ] = (varYh1 + µ2 + varYh2 + µ2 − 2E [Yh1 Yh2 ]) 2 h h 4 and since Yh1 , Yh2 are independent, E [Yh1 Yh2 ] = µ2 so that h E [Sh ] = σ 2 /2. 2 Jared Tobin Approximate conditional inference
  • 12. Example, continued.. Put it together and we have 1 E [ˆ 2 ] = σ 2 E [Sh ] H h 1 σ2 = H 2 h σ2 = 2 No big deal - everyone knows the MLE for σ 2 is biased.. But notice the implication for consistency.. lim P(|ˆ 2 − σ 2 | < ) = 0 σ n→∞ Jared Tobin Approximate conditional inference
  • 13. Neyman-Scott problems This result isn’t exactly new.. It was described by Neyman & Scott as early as 1948. That’s merit for the name; this type of problem is typically known as a Neyman-Scott problem in the literature. The problem is that one of the required regularity conditions is not met. We usually require that the dimension of (µ, σ 2 ) remain constant for increasing sample size.. H But notice that n = h=1 2, so n → ∞ iff H → ∞ iff dim(µ) → ∞. Jared Tobin Approximate conditional inference
  • 14. The profile likelihood In a general setting where Y ∼ f (y ; ψ, θ) with nuisance parameter −1 ψ, consider the partial information iθθ|ψ = iθθ − iθψ iψψ iψθ . (P) It can be shown that the profile expected information iθθ is first-order equivalent to iθθ .. (P) so iθθ|ψ < iθθ in an asymptotic sense. In other words, the profile likelihood places more weight on information about θ than it ’ought’ to be doing. Jared Tobin Approximate conditional inference
  • 15. Two-index asymptotics for the profile likelihood Take a general stratified model again.. Yh ∼ f (yh ; ψh , θ) with H strata. Remove the per-stratum sample size restriction of nh = 2. Now, if we let both nh and H approach infinity, we get different results depending on the speed at which nh → ∞ and H → ∞. ˆ If nh → ∞ faster than H does, we have that θ − θ = Op (n−1/2 ). −1 If H → ∞ faster, the same difference is of order Op (nh ). In other words, we can probably expect the profile likelihood to make relatively poor estimates of θ if H > nh on average. Jared Tobin Approximate conditional inference
  • 16. Alternatives to using the profile likelihood This type of model comes up a lot in practice.. (example later) What solutions are available to tackle the nuisance parameter problem? For the normal nuisance parameter model, the method of moments estimator is an option (and happens to be unbiased & consistent). Jared Tobin Approximate conditional inference
  • 17. Alternatives to using the profile likelihood But what if the moments themselves involve multiple parameters? ... then it may be difficult or impossible to construct a method of moments estimator. Also, likelihood-based estimators generally have desirable statistical properties, and it would be nice to retain those. There may be a way to patch up the problem with ‘standard’ ML in these models.. hint: there is. my plane ticket was $2300 CAD, so I’d better have something entertaining to tell you.. Jared Tobin Approximate conditional inference
  • 18. Motivating example This research started when looking at a stock assessment problem. Take the waters of the coast of Newfoundland & Labrador, which historically supported the largest fishery of Atlantic cod in the world. (keyword: historically) Jared Tobin Approximate conditional inference
  • 19. Motivating example (continued) Here’s the model: divide these waters (the stock area) into N equally-sized sampling units, where the j th unit, j = 1, . . . , N contains λj fish. Each sampling unit corresponds to the area over the ocean bottom covered by a standardized trawl tow made at a fixed speed and duration, N Then the total number of fish in the stock is λ = j=1 λj , and we want to estimate this. In practice we estimate a measure of trawlable abundance. We weight λ by the probability of catching a fish on any given tow, q, and estimate µ = qλ. Jared Tobin Approximate conditional inference
  • 20. Motivating example (continued) DFO conducts two research trawl surveys on these waters every year using a stratified random sampling scheme.. Jared Tobin Approximate conditional inference
  • 21. Motivating example (continued) For each stratum h, h = 1, . . . , H, we model an observed catch as Yh ∼ negbin(µh , k). The negative binomial mass function is yh k Γ(yh + k) µh k P(Yh = yh ; µh , k) = Γ(yh + 1)Γ(k) µh + k µh + k and it has mean µh and variance µh + k −1 µ2 . h Jared Tobin Approximate conditional inference
  • 22. Motivating example (continued) We have a nuisance parameter model.. if we want to make interval estimates for µ, we must estimate the dispersion parameter k. Breezing through the literature will suggest any number of increasingly esoteric ways to do this.. method of moments pseudo-likelihood optimal quadratic estimating equations extended quasi-likelihood adjusted extended quasi-likelihood double extended quasi-likelihood etc. WTF Jared Tobin Approximate conditional inference
  • 23. Motivating example (continued) Which of these is best?? (and why are there so many??) Noel and I wrote a paper that tried to answer the first question.. .. unfortunately, we also wound up adding another long-winded estimator to the list, and so increased the scope of the second. We coined something called the ‘adjusted double-extended quasi-likelihood’ or ADEQL estimator, which performed best in our simulations. Jared Tobin Approximate conditional inference
  • 24. Motivating example (fini) When writing my Master’s thesis, I wanted to figure out why this estimator worked as well as it did? And what exactly are all the other ones? This involved looking into functional approximations and likelihood asymptotics.. .. but I managed to uncover some fundamental answers that simplified the whole nuisance parameter estimation mumbojumbo. Jared Tobin Approximate conditional inference
  • 25. Conditional inference For simplicity, recall the general nonstratified nuisance parameter model.. i.e. Y ∼ f (y ; ψ, θ). Start with some theory. Let (t1 , t2 ) be jointly sufficient for (ψ, θ) and let a be ancillary. If we could factorize the likelihood as L(ψ, θ) ≈ L(θ; t2 |a)L(ψ, θ; t1 |t2 , a) or L(ψ, θ) ≈ L(θ; t2 |t1 , a)L(ψ, θ; t1 |a) then we could maximize L(θ; t2 |a), called the marginal likelihood, or L(θ; t2 |t1 , a), the conditional likelihood, to obtain an estimate of θ. Jared Tobin Approximate conditional inference
  • 26. Conditional inference Each of these functions condition on statistics that contain all of the information about θ, and negligible information about ψ. They seek to eliminate the effect of ψ when estimating θ, and thus theoretically solve the nuisance parameter problem. (both the marginal and conditional likelihoods are special cases of Cox’s partial likelihood function) Jared Tobin Approximate conditional inference
  • 27. Approximate conditional inference? Disclaimer: theory vs. practice It is pretty much impossible to show that a factorization like this even exists in practice. The best we can usually do is try to approximate the conditional or marginal likelihood. Jared Tobin Approximate conditional inference
  • 28. Approximations There are many, many ways to approximate functions and integrals.. We’ll briefly touch on an important one.. (recall the title of this talk for a hint) .. the saddlepoint approximation, which is a highly accurate approximation to arbitrary functions. Often it’s capable of outperforming more computationally demanding methods, i.e. Metropolis Hastings/Gibbs MCMC. For a few cases (normal, gamma, inverse gamma densities), it’s even exact. Jared Tobin Approximate conditional inference
  • 29. Laplace approximation Familiar with the Laplace approximation? We need it first. We’re b interested in a f (y ; θ)dy for some a < b, and the idea is to use e −g (y ;θ) , for g (y ; θ) = − log f (y ; θ) to do it. Truncate a Taylor expansion of e −g (y ;θ) about y , where ˆ y = argmaxy g on (a, b). ˆ Then integrate over (a, b).. we wind up integrating the kernel of a N(ˆ , −1/g (ˆ )) density and get y y 1 b 2π 2 f (y ; θ)dy ≈ exp {g (ˆ )} − y a g (ˆ ) y It works because the value of the integral depends mainly on g (ˆ ) y (the function’s maximum on (a, b)) and g (ˆ ) (its curvature at the y maximum). Jared Tobin Approximate conditional inference
  • 30. Saddlepoint approximation We can then do some nifty math in order to refine Taylor’s approximation to a function. Briefly, we want to relate the cumulant generating function to its corresponding density by creating an approximate inverse mapping. For K (t) the cumulant generating function, the moment generating function can be written e K (t) = e ty +log f (y ;θ) dy Y so fix t and let g (t, y ) = −ty − log f (y ; θ). Laplace’s approximation yields 2π e K (t) ≈ e tyt f (yt ; θ) g (t, yt ) where yt solves g (t, yt ) = 0 Jared Tobin Approximate conditional inference
  • 31. Saddlepoint approximation Sparing the nitty gritty, we do some solving and rearranging (particularly involving the saddlepoint equation K (t) = y ), and we come up with −1/2 f (yt ; θ) ≈ 2πK (t) exp {K (t) − tyt } where yt solves the saddlepoint equation. This guy is called the unnormalized saddlepoint approximation to ˆ f , and is typically denoted f . Jared Tobin Approximate conditional inference
  • 32. Saddlepoint approximation We can normalize the saddlepoint approximation by using ˆ c = Y f (y ; θ)dy . ˆ Call f ∗ (y ; θ) = c −1 f (y ; θ) the renormalized saddlepoint approximation. How does it perform relative to Taylor’s approximation? For a sample of size n, Taylor’s approximation is typically accurate to O(n−1/2 ).. the unnormalized saddlepoint approximation is accurate to O(n−1 ), while the renormalized version does even better at O(n−3/2 ). If small samples are involved, the saddlepoint approximation can make a big difference. Jared Tobin Approximate conditional inference
  • 33. The p ∗ formula We could use the saddlepoint approximation to directly approximate the marginal likelihood. (or could we? where would we start?) Best to continue from Barndorff-Nielsen’s idea.. he and Cox did some particularly horrific math in the 80’s and came up with a second-order approximation to the distribution of the MLE. Briefly, it involves taking a regular exponential family and then using an unnormalized saddlepoint approximation to approximate the distribution of a minimal sufficient statistic.. make a particular reparameterization and renormalize, and you get the p ∗ formula: L(θ; y ) p ∗ (θ; θ) = κ(θ)|j(θ)|1/2 ˆ ˆ ˆ L(θ; y ) where κ is a renormalizing constant (depending on θ) and j is the observed information. Jared Tobin Approximate conditional inference
  • 34. Putting likelihood asymptotics to work So how does the p ∗ formula help us approximate the marginal likelihood? Let t = (t1 , t2 ) be a minimal sufficient statistic and u be a statistic ˆ ˆ ˆ such that both (ψ, u) and (ψ, θ) are one-to-one transformations of t, with the distribution of u depending only on θ. Barndorff-Nielsen (who else) showed that the marginal density of u can be written as ˆ ˆ ∂(ψ, θ) ˆ ∂ ψθ f (u; θ) = ˆ ˆ f (ψ, θ; ψ, θ) ˆ / f (ψθ ; ψ, θ|u) ˆ ∂(ψ, u) ∂ψ ˆ Jared Tobin Approximate conditional inference
  • 35. Putting likelihood asymptotics to work It suffices that we know ∂ψ ˆ −1 ˆ ˆ ˆ = jψψ (λθ ) lψ;ψ (λθ ; ψ, u) ˆ ˆ ∂ ψθ ˆ ˆ ˆ ˆ where λθ = (θ, ψθ ) and |lψ;ψ (λθ ; ψ, u)| is the determinant of a ˆ ˆ ˆ sample space derivative, defined as the matrix ∂ 2 l(λ; λ, u)/∂ψ∂ ψ T . ˆ ˆ ˆ We don’t need to worry about the other term |∂(ψ, θ)/∂(ψ, u)|. It doesn’t depend on θ. Jared Tobin Approximate conditional inference
  • 36. Putting likelihood asympotics to work ˆ ˆ We can use the p ∗ formula to approximate both f (ψ, θ; ψ, θ) and ˆθ ; ψ, θ|u). Doing so, we get f (ψ 1/2 ˆ ˆ j(ψ, θ) ˆ −1 L(ψ, θ) L(ψθ , θ) ˆ ˆ L(θ; u) ∝ j(ψθ , θ) lψ;ψ (ψθ , θ) ˆ ˆ 1/2 ˆ ˆ L(ψ, θ) L(ψ, θ) j(ψθ , θ) 1/2 −1 ˆ ˆ ∝ L(ψθ , θ) j(ψθ , θ) ˆ lψ;ψ (ψθ , θ) ˆ 1/2 −1 ˆ ˆ = L(P) (θ; ψθ ) jψψ (θ, ψθ ) ˆ lψ;ψ (ψθ , θ) ˆ Jared Tobin Approximate conditional inference
  • 37. Modified profile likelihood (MPL) Taking the logarithm, we get ˆ 1 ˆ ˆ l(θ; u) ≈ l (P) (θ; ψθ ) + log jψψ (θ, ψθ ) − log lψ;ψ (θ, ψθ ) . ˆ 2 known as the modified profile likelihood for θ and denoted l (M) (θ). As it’s based on the saddlepoint approximation, it is a highly accurate approximation to the marginal likelihood l(θ; u) and thus (from before) L(θ; t2 |a). In cases where the marginal or conditional likelihood do not exist, it can be thought of as an approximate conditional likelihood for θ. Jared Tobin Approximate conditional inference
  • 38. Two-index asymptotics for the MPL Recall the stratified model with H strata.. how does the modified profile likelihood perform in a two-index asymptotic setting? If nh → ∞ faster than H, we have a similar bound as before: ˆ θ(M) − θ = Op (n−1/2 ) The difference this time is that nh must only increase without bound faster than H 1/3 , which is a much weaker condition. If H → ∞ faster than nh , then we have a boost in performance ˆ −2 over the profile likelihood in that θ(M) − θ = Op (nh ) (as opposed −1 to Op (nh )). Jared Tobin Approximate conditional inference
  • 39. Modified profile likelihood (MPL) ˆ The profile observed information term jψψ (θ, ψθ ) in the MPL corrects the profile likelihood’s habit of putting excess information on θ. What about the sample space derivative term lψ;ψ ? ˆ .. this preserves the structure of the parameterization. If θ and ψ are not parameter orthogonal, this term ensures that parameterization invariance holds. What if θ and ψ are parameter orthogonal? Jared Tobin Approximate conditional inference
  • 40. Adjusted profile likelihood (APL) If θ and ψ are orthogonal, we can do without the sample space derivative.. .. we can define 1 ˆ l (A) (θ) = l (P) (θ) − log jψψ (θ, ψθ ) 2 as the adjusted profile likelihood, which is equivalent to the MPL when θ and ψ are parameter orthogonal. As a special case of the MPL, the APL has comparable performance as long as θ and ψ are approximately orthogonal. Jared Tobin Approximate conditional inference
  • 41. MPL vs APL It’s interesting to note the nature of the difference between the MPL and APL.. While the MPL arises via the p ∗ formula, the APL can actually be derived via a lower-order Laplace approximation to the integrated likelihood L(θ) = L(ψ, θ)dψ R   ˆ  2π  ≈ exp l (P) (θ; ψθ ) − ∂ 2 l(ψ,θ)   2 ∂ψ ˆ ψ=ψθ = L(P) (θ; ψθ ) jψψ (ψθ )−1/2 ˆ ˆ Jared Tobin Approximate conditional inference
  • 42. MPL vs APL In practice we can often get away with using the APL. May require assuming that θ and ψ are parameter orthogonal, but this is often the case anyway (i.e. joint mean/dispersion GLMs, mixed models - i.e. REML). In particular, if θ is a scalar, then an orthogonal reparameterization can always be found. The applicability means that the adjustment term ˆ − 1 log jψψ (θ, ψθ ) can be broadly used in GLMs, quasi-GLMs, 2 HGLMs, etc. Jared Tobin Approximate conditional inference
  • 43. Getting back to the problem.. In mine & Noel’s paper, we compared a bunch of estimators for the negative binomial dispersion parameter k.. the most relevant methods to us are maximum (profile) likelihood (ML) adjusted profile likelihood (AML) extended quasi-likelihood (EQL) adjusted extended quasi-likelihood (AEQL) double extended quasi-likelihood (DEQL) adjusted double extended quasi-likelihood (ADEQL) What insight did the whole likelihood asymptotics exercise shed on this? It showed two branches of estimators and developed a theoretical hierarchy in each.. Jared Tobin Approximate conditional inference
  • 44. Insight The EQL function is actually a saddlepoint approximation to an exponential family likelihood.. + yhi + k q(P) (k) = yhi log yh + (yhi + k) log ¯ yh + k ¯ h,i 1 1 yhi − log(yhi + k) + log k − 2 2 12k(yhi + k) .. so it should perform similarly to (but worse than) the MLE. The double extended quasi-likelihood function is actually the EQL function for the strata mean model. And the AEQL function is actually an approximation to the adjusted profile likelihood.. so the adjusted profile likelihood should intuitively perform better. Jared Tobin Approximate conditional inference
  • 45. Insight In our paper, the results didn’t exactly follow this theoretical pattern.. Jared Tobin Approximate conditional inference
  • 46. Insight .. but in that paper we had capped estimates of k at 10k. I decided to throw out (and not resample) nonconverging estimates of k in my thesis. This means I had some information about how many estimates failed to converge, but those estimates didn’t throw off my simulation averages. Surely enough, upon doing that, the estimators performed according to the theoretical hierarchy. Jared Tobin Approximate conditional inference
  • 47. Estimator performance (thesis) Table: Average perfomance measures across all factor combinations, by estimator. ML AML EQL CREQL LNEQL CTEQL Avg. abs. % bias 109.00 30.00 110.00 31.00 33.00 26.00 Avg. MSE 10.21 0.47 10.22 0.47 1.58 0.55 Avg. prop. NC 0.09 0.00 0.09 0.00 0.02 0.00 Table: Ranks of estimator by criterion. Overall rank is calculated as the ranked average of all other ranks. ML AML EQL CREQL LNEQL CTEQL Avg. abs. % bias 5 2 6 3 4 1 Avg. MSE 5 1 6 2 4 3 Avg. prop. NC. 5.5 2 5.5 2 4 2 Overall rank 5 1 6 3 4 2 Jared Tobin Approximate conditional inference
  • 48. End of the story.. The odd estimator out was the one we originally called ADEQL in our paper. It’s the k-root of this guy: yh + k ¯ k yhi (yhi + 2k) 2k log + − − (n − H) = 0 yhi + k yhi + k 6k(yhi + k)2 h,i The whole saddlepoint deal revealed the DEQL part was really just EQL.. so really it’s an adjustment of an approximation to the profile likelihood, and the adjustment itself is degrees-of-freedom based. It performed very well; best in our paper, and second only to the adjusted profile likelihood in my thesis. I since called it the Cadigan-Tobin EQL (or CTEQL) estimator for k. Jared Tobin Approximate conditional inference
  • 49. Future research direction Integrated likelihood Should the adjusted profile likelihood be adopted as a ‘standard’ way to remove nuisance parameters? How does the adjusted profile likelihood compare to the MPL if we depart from parameter orthogonality? If we do find poor performance under parameter non-orthogonality, how difficult is it to approximate a sample space derivative in general? Can autodiff assist with this? Or is there some neat saddlepoint-like approximation that will do the trick? Jared Tobin Approximate conditional inference
  • 50. Acronym treadmill.. Russell calls integrated likelihood ’GREML’, for Generalized REstricted Maximum Likelihood. Tacking on ‘INference’ shows the dangers of acronymization.. We’re oh-so-close to GREMLIN.. (hopefully that won’t be my greatest contribution to statistics.. ) Jared Tobin Approximate conditional inference
  • 51. Future research direction Other interests Machine learning Information geometry and asymptotics Quantum information theory and L2 -norm probability (?) I would be happy to work with anyone in any of these areas! Jared Tobin Approximate conditional inference
  • 52. Contact Email: jared@jtobin.ca Skype: jahredtobin Website/Blog: http://jtobin.ca MAS Thesis: http://jtobin.ca/jTobin MAS thesis.pdf Jared Tobin Approximate conditional inference