1. Online EM Algorithm and Some Extensions
Olivier Capp´
e
T´l´com ParisTech & CNRS
ee
March 2011
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 1 / 34
2. Online Estimation for Missing Data Models
Based on (C & Moulines, 2009) and (C, 2010)
Goals
1 Maximum likelihood estimation, or
1’ Competitive with maximum likelihood
estimation when #obs. is large
2 Good scaling (performance vs. computational cost) as #obs.
increases
(3) Process data on-the-fly (no storage)
4 Simple to implement (no line-search, projection,
preconditioning, etc.)
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 2 / 34
3. Outline
1 The EM Algorithm in Exponential Families
2 The Limiting EM Recursion
3 Online EM Algorithm
The Algorithm
Properties and Discussion
4 Use for Batch ML Estimation
5 Extensions
6 References
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 3 / 34
4. The EM Algorithm in Exponential Families
Missing Data Model
A missing data model is a statistical model {pθ (x, y)}θ∈Θ in which only Y
may be observed (the couple (X, Y ) is referred to as the complete data)
Hence, parameter estimates θn must be function of observations
Y1 , . . . , Yn only (here assumed to be independent and identically
distributed)
Of course, the statistical model could also be defined as {fθ (y)}θ∈Θ ,
where fθ (y) = pθ (x, y)dx but the specific structure of fθ needs to
be exploited
To analyze the methods the data {Yt }t≥1 is assumed to be generated by
an iid. process with marginal π, not necessarily equal to fθ
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 4 / 34
5. The EM Algorithm in Exponential Families
Finite Mixture Model
Mixture PDF
m
f (y) = αi fi (y)
i=1
Missing Data Interpretation
P(Xt = i) = αi
Yt |Xt = i ∼ fi (y)
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 5 / 34
6. The EM Algorithm in Exponential Families
To determine the maximum likelihood estimate
n
θn = arg max log fθ (Yt )
θ
t=1
numerically, the standard approach is the following.
Expectation-Maximization (Dempster, Laird & Rubin, 1977)
k
Given a current parameter guess θn
E-Step Compute
n
1
qn,θn (θ) =
k Eθn [ log pθ (Xt , Yt )| Yt ]
k
n
t=1
M-Step Update the parameter estimate to
k+1
θn = arg max qn,θn (θ)
k
θ∈Θ
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 6 / 34
7. The EM Algorithm in Exponential Families
Rationale
1 It is an ascent algorithm (shown using Jensen inequality)
Figure: The EM intermediate quantity is a minorizing surrogate
2 Because of Fisher relation, the algorithm can only stop in a stationary
point of the log-likelihood∗
∗
See (Wu, 1983) for necessary topological
and regularity assumptions
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 7 / 34
8. The EM Algorithm in Exponential Families
An Example: Poisson Mixture
Likelihood
m
λj Y −λj
fθ (Y ) = αj e
Y!
j=1
“Complete-Data” Log-Likelihood
log pθ (X, Y ) = − log(Y !)
m
+ [log(αj ) − λj ] 1{X = j}
j=1
m
+ log(λj )Y 1{X = j}
j=1
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 8 / 34
9. The EM Algorithm in Exponential Families
EM Algorithm for the Poisson Mixture
EM E-Step
m n
1
qn,θn =
k [log(αj ) − λj ] Pθn (Xt = j|Yt )
k
n
j=1 t=1
m n
1
+ log(λj ) Yt Pθn (Xt = j|Yt )
k
n
j=1 t=1
EM M-Step
n
k+1 1
αn,j = Pθn (Xt = j|Yt )
k
n
t=1
n
t=1 Yt Pθn (Xt = j|Yt )
k
λk+1 =
n,j n
t=1 Pθn (Xt = j|Yt )
k
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 9 / 34
10. The EM Algorithm in Exponential Families
Exponential Family Model
In the following, we assume that the complete-data model belongs to an
exponential family
(Curved) Exponential Family Model
pθ (x, y) = exp ( s(x, y), ψ(θ) − A(θ))
where s(x, y) is the vector (complete-data) sufficient
statistics
Explicit Complete-Data Maximum Likelihood
¯
S → θ(S) = arg max S, ψ(θ) − A(θ)
θ
is available in closed-form
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 10 / 34
11. The EM Algorithm in Exponential Families
The EM Algorithm Revisited
The k-th EM Iteration (From n Observations)
E-Step
n
k+1 1
Sn = Eθn [ s(Xt , Yt )| Yt ]
k
n
t=1
M-Step
k+1 ¯ k+1
θn = θ Sn
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 11 / 34
12. The Limiting EM Recursion
A Key Remark
The k-th EM Iteration (From n Observations)
E-Step
n
k+1 1
Sn = Eθn [ s(Xt , Yt )| Yt ]
k
n
t=1
M-Step
k+1 ¯ k+1
θn = θ Sn
Can be fully reparameterized in the domain of sufficient statistics
n
k+1 1
Sn = Eθ(Sn ) [ s(Xt , Yt )| Yt ]
¯ k
n
t=1
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 12 / 34
13. The Limiting EM Recursion
The Limiting EM Recursion
By letting n tend to infinity, one obtains two equivalent updates:
Sufficient Statistics Update
S k = Eπ Eθ(S k−1 ) [ s(X1 , Y1 )| Y1 ]
¯
Parameter Update
¯
θk = θ {Eπ (Eθk−1 [ s(X1 , Y1 )| Y1 ])}
Using usual EM arguments, these updates are such that
1 The Kullback-Leibler divergence D(π|fθk ) is monotonically decreasing
with k
2 Converge to {θ : θ D(π|fθ ) = 0}
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 13 / 34
14. The Limiting EM Recursion
Batch EM Is Not Efficient for Large Data Records
see also (Neal & Hinton, 1999)
3
2 10 observations
4
3
||u||2
2
1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
3
20 10 observations
4
3
||u||2
2
1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Batch EM iterations
Figure: Convergence of batch EM estimates of u 2 as a function of the number of EM iterations for 2,000 (top) and
20,000 (bottom) observations. The box-and-whisker plots are computed from 1,000 independent replications of the simulated
data. The grey region corresponds to ±2 interquartile range (approx. 99.3% coverage) under the asymptotic Gaussian
approximation of the MLE (from [C, 2010]).
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 14 / 34
15. Online EM Algorithm The Algorithm
The Online EM Algorithm
The online EM algorithm outputs one updated parameter estimate θn
after processing each individual observation Yn
The parameter update is very similar to applying the EM algorithm to
the single observation Yn (with smoothing)
The memory footprint of the algorithm is constant while its
computational cost is proportional to the number of processed
observations
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 15 / 34
16. Online EM Algorithm The Algorithm
Online EM: Rationale
We try to locate the solutions of
Eπ Eθ(S) [ s(X1 , Y1 )| Y1 ] − S = 0
¯
Viewing Eθ(S) [ s(Xn , Yn )| Yn ] as a noisy observation of
¯
Eπ Eθ(S) [ s(X1 , Y1 )| Y1 ] , this is exactly the usual Stochastic
¯
Approximation (or Robbins-Monro) setup:
Sn = Sn−1 + γn Eθ(Sn−1 ) [ s(Xn , Yn )| Yn ] − Sn−1
¯
where (γn ) is a sequence of decreasing positive stepsizes
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 16 / 34
17. Online EM Algorithm The Algorithm
The Algorithm
Online EM Algorithm
Stochastic E-Step
Sn = (1 − γn )Sn−1 + γn Eθn−1 [ s(Xn , Yn )| Yn ]
M Step
¯
θn = θ(Sn )
Practical Recommendations
γn = 1/nα with α ∈ [0.6, 0.7]
Don’t do M for the first 10–20 obs.
(optional) Use Polyak-Ruppert averaging (requires to
chose n0 )
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 17 / 34
18. Online EM Algorithm The Algorithm
Online EM in the Poisson Mixture Example
SA E-Step
Computing Conditional Expectations
αn−1,j λYn e−λn−1,j
n−1,j
pn,j = Pm
i=1 αn−1,i λYn e−λn−1,i
n−1,i
Statistics Update (Stochastic Approximation)
α α
Sn,j = (1 − γn )Sn−1,j + γn pn,j
λ λ
Sn,j = (1 − γn )Sn−1,j + γn pn,j Yn
M-Step: Parameter Update
α
αn,j = Sn,j ,
ˆ ˆ λ α
λn,j = Sn,j /Sn,j
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 18 / 34
19. Online EM Algorithm Properties and Discussion
Analysis
(C & Moulines, 2009)
Under n γn = ∞, 2 < ∞, compactness of Θ and other regularity
n γn
assumptions
1 The estimate θn converges to one of the roots of θ D(π|fθ ) =0
2 The algorithm is asymptotically equivalent to
θn = θn−1 + γn J −1 (θn−1 ) θ log fθn−1 (Yn )
where J(θ) = −Eπ Eθ 2 log pθ (X1 , Y1 ) Y1
θ
3 For a well specified model (π = fθ ) and under Polyak-Ruppert
averaging† θn is Fisher efficient
√ L
n(θn − θ ) −→ N (0, If (θ ))
where If (θ ) = −Eθ [ 2 log f (Y )]
θ θ 1
†˜
= 1/(n − n0 ) n 0 +1 θn ,
P
θn t=n
−α
with γn = n and α ∈ (1/2, 1)
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 19 / 34
20. Online EM Algorithm Properties and Discussion
Some More Details
1 (Andrieu et al., 2005) but also (Delyon, 1994), (Bena¨ 1999) using
ım,
the fact that D(π|fθ(S) ) is a Lyapunov function:
¯
S D(π|fθ(S) ) ,
¯ Eπ Eθ(S) [ s(X1 , Y1 )| Y1 ] − S
¯ ≤0
mean field
2 ¯
Taylor series expansion of θ to establish the equivalence (with
remainder a.s. o(γn ))
3 (Pelletier, 1998) to show that
−1/2 L
−1
γn (θn − θ ) −→ N (0, Ip (θ )/2)
in well-specified models (where Ip is the complete-data Fisher
information matrix)
General results of (Polyak and Judistky, 1992), (Mokkadem and
Pelletier, 2006) on averaging
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 20 / 34
21. Online EM Algorithm Properties and Discussion
Illustration of Polyak-Ruppert Averaging
α = 0.9
2
1
0
u
−2
0 200 400 600 800 1000 1200 1400 1600 1800 2000
α = 0.6
2
1
0
u
−2
0 200 400 600 800 1000 1200 1400 1600 1800 2000
α = 0.6 with halfway averaging
2
1
0
u
−2
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Number of observations
Figure: Four superimposed trajectories of the estimate of u1 (first component of u) for various algorithm settings
(α = 0.9, α = 0.6 and α = 0.6 with Polyak-Ruppert averaging, from top to bottom). The actual value of u1 is equal to
zero.
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 21 / 34
22. Online EM Algorithm Properties and Discussion
Performance of Online EM
α = 0.9
4
3
||u||2
2
1
0
0.2 10^3 2 10^3 20 10^3
α = 0.6
4
3
||u||2
2
1
0
0.2 10^3 2 10^3 20 10^3
α = 0.6 with halfway averaging
4
3
||u||2
2
1
0
0.2 10^3 2 10^3 20 10^3
Number of observations
Figure: Online EM estimates of u 2 for various data sizes (200, 2,000 and 20,000 observations, from left to right) and
algorithm settings (α = 0.9, α = 0.6 and α = 0.6 with Polyak-Ruppert averaging, from top to bottom). The
box-and-whisker plots (outliers plotting suppressed) are computed from 1,000 independent replications of the simulated data.
The grey regions corresponds to ±2 interquartile range (approx. 99.3% coverage) under the asymptotic Gaussian
approximation of the MLE.
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 22 / 34
23. Online EM Algorithm Properties and Discussion
Related Works
(Titterington, 1984) Proposes a gradient algorithm
−1
θn = θn−1 + γn Ip (θn−1 ) θ log fθn−1 (Yn )
It is asymptotically equivalent to the algorithm (previously
described) for well-specified models (π = fθ )
(Neal & Hinton, 1999) Describe an algorithm called Incremental EM that
is equivalent (up to first batch scan only) to Online EM used
with γn = 1/n
(Sato, 2000; Sato & Ishii, 2000) Describe the algorithm and provide some
analysis in the flat model case and for mixtures of Gaussian
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 23 / 34
24. Online EM Algorithm Properties and Discussion
How Does This Work in Practice?
Fine But don’t use ‡ γn = 1/n
Simulations in (C & Moulines, 2009) on mixtures of Gaussian regressions
Large Scale Experiments on Real Data in (Liang & Klein, 2009), where
the use of mini-batch blocking was found useful:
Apply the proposed algorithm considering
Ymk+1 , Ymk+2 . . . Ym(k+1) as one observation
Mini-batch blocking is useful in dealing with mixture-like
models with infrequent components
‡
γn = γ0 /(n0 + n) can be an option
but requires carefully setting γ0 and n0
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 24 / 34
25. Online EM Algorithm Properties and Discussion
Some Intuition About the Weights
If rk = (1 − γk )rk−1 + γk Ek , for k ≥ 1 1
−4
x 10
α=1
1
n n n
1 rn = k=1 ωk Ek + ω0 r0 with 1
n n
k=0 ωk = 1 4
−4
x 10
α = 0.9
n 1
2 ωk = n+a (for k ≥ 1) when 2
γk = 1/(k + a) and is strictly 0
increasing otherwise 4
−3
x 10
α = 0.6
n
3 n 2
k=1 (ωk )≡ 2 n−α when γk = k −α ,
1
2
with 1/2 < α < 1 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 25 / 34
26. Use for Batch ML Estimation
How to Use Online EM for Batch ML Estimation?
The most popular use of the method is to perfom batch ML estimation
from very large datasets
Because we did not assume that π = fθ , the previous analysis can be
applied to π ≡ the empirical measure associated with Y1 , . . . , Yn
Online EM can be used for batch ML estimation by (randomly)
scanning the data Y1 , . . . , Yn
Convergence “speed” (with averaging) is (nobs. × nscans )−1/2 versus
ρnscans for batch EM
Not a fair comparison in terms of computing time as the M-Step is
not free and possible parallelization is ignored
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 26 / 34
27. Use for Batch ML Estimation
Comparison With Batch and Incremental EM
Batch EM
−1.54
−1.56
−1.58
1 2 3 4 5
Incremental EM
−1.54
−1.56
−1.58
1 2 3 4 5
Online EM
−1.54
−1.56
−1.58
1 2 3 4 5
batch tours
Figure: Normalized log-likelihood of the estimates obtained with, from top to bottom, batch EM, incremental EM and
online EM as a function of the number of batch tours (or iterations, for batch EM). The data is of length N = 100 and the box
an whiskers plots summarize the results of 500 independent runs of the algorithms started from randomized starting points θ0 .
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 27 / 34
28. Use for Batch ML Estimation
Comparison With Batch and Incremental EM (Contd.)
Batch EM
−1.56
−1.58
−1.6
1 2 3 4 5
Incremental EM
−1.56
−1.58
−1.6
1 2 3 4 5
Online EM
−1.56
−1.58
−1.6
1 2 3 4 5
batch tours
Figure: Same display for a data record of length N = 1,000.
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 28 / 34
29. Extensions
Summary
The Good Easy (esp. when EM implementation is available)
Can be used for ML estimation from a batch of
observations
Robust wrt. to stepsize selection (note that scale is
fixed due to the use of convex combinations)
Handles parameter constraints nicely (only requires that
S be closed under convex combinations with expected
sufficient statistics)
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 29 / 34
30. Extensions
Summary (Contd.)
The Bad Needs that the E-step be explicit
¯
Needs that θ be explicit
Not appropriate for short (say, less than 1000
observations) data records without cycling
What about non-independent observations?
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 30 / 34
31. Extensions
Online EM in Latent Factor Models (Ongoing Work)
Many models of the form
Cn |Hn ∼ gPK
k=1 θk Hn,k
where {gλ }λ∈Λ is an exponential family of distributions and Hn is a latent
random vector of positive weights (probabilistic matrix factorization,
discrete component analysis, partial membership models, simplicial
mixtures)
Figure: Bayesian network representations of Latent Dirichlet Allocation (LDA)
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 31 / 34
32. Extensions
Simulated Online EM Algorithm for LDA
For n = 1, . . .
Simulated E-step
˜
Simulate Hn given Cn and θn−1
(in practise, using a short run of
Metropolis-Hastings or collapsed
Gibbs sampling)
Use the Rao-Blackwellized update
˜
Sn = (1−γn )Sn−1 +γn Eθn−1 s(Zn , Wn )| Wn , Hn
¯
M-step θn = θ(Sn )
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 32 / 34
33. Extensions
Ignoring the sampling bias, this recursion can be analyzed and has the
same asymptotic properties as the online EM algorithm
In particular, for well-specified models,
−1/2 L −1
γn (θn − θ ) −→ N (0, If (θ ))
instead of
−1/2 L −1
γn (θn − θ ) −→ N (0, Ip (θ ))
for the “exact” online EM algorithm (Ip (θ ) = −Eθ [ 2 log p (X , Y )]).
θ θ 1 1
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 33 / 34
34. References
Capp´, O. & Moulines, E. (2009). On-line expectation-maximization algorithm for
e
latent data models. J. Roy. Statist. Soc. B, 71(3):593-613.
Capp´, O. (2011). Online Expectation-Maximisation. To appear in Mengersen, K.,
e
Titterington, M., & Robert, C. P., eds., Mixtures, Wiley.
Liang, P. & Klein, D. (2009). Online EM for Unsupervised Models. In Proc
NAACL Conference.
Neal, R. M. & Hinton, G. E. (1999). A view of the EM algorithm that justifies
incremental, sparse, and other variants. In Jordan, M. I., ed., Learning in graphical
models, pages 355–368. MIT Press, Cambridge, MA, USA.
Rohde, D. & Capp´, O. (2011). Online maximum-likelihood estimation for latent
e
factor models. Submitted.
Sato, M. (2000). Convergence of on-line EM algorithm. In Proc. International
Conference on Neural Information Processing, 1:476–481.
Sato, M. & Ishii, S. (2000). On-line EM algorithm for the normalized Gaussian
network. Neural Computation, 12:407-432.
Titterington, D. M. (1984). Recursive parameter estimation using incomplete
data. J. Roy. Statist. Soc. B, 46(2):257-267.
0. Capp´ (@ BigMC)
e Online EM Algorithm March 2011 34 / 34