Maximum likelihood estimation of regularisation parameters in inverse problems: an empirical Bayesian approach
1. Maximum likelihood estimation of
regularisation parameters in inverse
problems: an empirical Bayesian approach.
V. De Bortoli
joint work with: A.F. Vidal, M. Pereyra, A. Durmus
January 28, 2021
Oxford University
0 / 31
5. General setting
How to recover in an unknown image x ∈ Rd
?
We measure y, related to x by some mathematical model.
For example, in many imaging problems
y = A(x) + w,
for some operator A : Rd
→ Rd
(non necessarily linear) that
might be poorly conditioned or rank deficient, and an unknown
perturbation or “noise” w.
The recovery of x from y is often ill-posed or ill-conditioned,
so we regularise the problem to make it well posed.
4 / 31
6. Bayesian basics
Probabilistic framework to provide estimations on the recovery x
and related quantities (uncertainty, high posterior density
intervall...)
Adopting a subjective probability approach, we propose a
prior on x, denoted p(x) (more on the choice of the prior later).
To derive inferences about x from y we postulate a joint
statistical model p(x, y); typically specified via the decomposition
p(x, y) = p(y|x)p(x).
Using this decomposition, we then compute quantities related to
p(x|y) using Bayes’ rule.
5 / 31
7. Likelihood VS prior information
The decomposition p(x, y) = p(y|x)p(x) has two key ingredients:
The likelihood function: the conditional distribution p(y|x) that
models the data observation process (forward model).
The prior function: the marginal distribution p(x) that models
our knowledge about x “before observing y”.
In our examples, p(y|x) is Gaussian (with semi-definite positive
covariance matrix). This covers many imaging problems provided
that the noise is Gaussian (deblurring, denoising, hyperspectral
unmixing).
" Many other possible choices for the noise (Poisson, binomial...)
leading to other (and often more complicated) models.
Usually p(x) enforces desirable properties on the solution
(sparsity in a wavelet basis, smoothness) but new
machine-learning based approaches use data-based priors, see
Song and Ermon (2019).
6 / 31
8. Regularisation parameters and prior
Often the prior will be of the form
p(x|θ) = exp[−hθ, ϕ(x)i]
Z
Rd
exp[−hθ, ϕ(x̃)i]dx̃ , (1)
where θ ∈ Rp
is a regularisation parameter, ϕ : Rd
→ Rp
.
θ controls the trade-off between the likelihood information and
the prior information.
θ might be hard to select depending on the problem. There exists
numerous approaches to tune θ:
generalised cross-validation Golub et al. (1979)
L-curve Lawson and Hanson (1995)
the discrepancy principle Morozov (2012)
residual whiteness measures Almeida and Figueiredo (2013)
Stein’s Unbiased Risk Estimator Deledalle et al. (2014)
hierarchical Bayes Pereyra et al. (2013)
empirical Bayes Carlin and Louis (2000)
7 / 31
9. Maximum-a-posteriori (MAP) estimation
A first estimator: Maximum A Posteriori estimation
x?
= arg max
x∈Rd
p(x|y, θ) = arg max
x∈Rd
{p(y|x)p(x|θ)} . (2)
In the convex case, huge literature on the topic (Nesterov (2005);
Nemirovski (2004); Chambolle and Pock (2011))
Fast algorithms in many cases (non-differentiable priors: ISTA,
FISTA Beck and Teboulle (2009), constrained composite
problems Chaux et al. (2009), ADMM Boyd et al. (2011)...)
However, depending on the application, MAP estimation has some
limitations e.g.,
This is a point estimator. Can we trust our estimator? How to
perform model selection?
Is the mode really what we want? (what about the mean or the
median?)
Sensitivity w.r.t to θ.
8 / 31
10. Illustrative example: astronomical image reconstruction
Recover x ∈ Rd
from low-dimensional degraded observation
y = MFx + w,
where F is the Fourier transform, M ∈ Cm×d
is a measurement
operator and w is Gaussian noise. We use the model
p(x|y) ∝ exp
−ky − MFxk2
/2σ2
− θkΨxk1
1Rn
+
(x). (3)
y
x?
Figure 1: Radio-interferometric image reconstruction of the W28
supernova. Image from Repetti et al. (2019)
9 / 31
11. Contribution
Our goal here is to:
Define efficient samplers in high-dimensional space (sampling
from pθ(x|y)).
Estimate the regularisation parameter using the Bayesian
framework.
Our main ingredients:
Markov chain sampling (functional autoregressive models),
Stochastic approximation with Markovian noise,
Empirical Bayes methodology.
10 / 31
13. Langevin diffusion
Sampling from π(x) ∝ e−U(x)
: a continuous-time solution
dXt = −∇U(Xt) +
√
2dBt ,
with (Bt)t≥0 d-dimensional Brownian motion.
Existence of unique strong solution for Lipschitz ∇U.
Pt(x, A) = P (Xx
t ∈ A) (semigroup of the diffusion).
Ergodicity under weak assumptions Roberts and Tweedie (1996).
Ergodicity of Langevin diffusion
If there exists R ≥ 0 such that for any x ∈ Rd
with kxk ≥ R,
h∇U(x), xi ≥ −akxk2
then the diffusion is ergodic, i.e.
lim
t→+∞
kPt(x, ·) − πkTV = 0 ,
where we recall that kµ − νkTV = supA∈B(Rd){µ(A) − ν(A)}.
12 / 31
14. Euler-Maruyama discretization
We cannot sample from general continuous-time processes.
The Euler-Maruyama discretizes this continuous-time
dynamics: Unadjusted Langevin Algorithm (ULA)
Xk+1 = Xk − γ∇U(Xk) +
p
2γZk+1 ,
(Zk)k∈N i.i.d. Gaussian r.v with zero mean and Id covariance matrix.
13 / 31
15. Ergodicity results
Similarly to the continuous-time process define
Rγ(x, A) = P
x − γ∇U(x) +
p
2γZ ∈ A
. (4)
We define
Rγf(x) =
Z
Rd
f(y)Rγ(x, dy) = E [f(Xx
)] . (5)
and
Rn
γ (x, A) =
Z
Rd
· · ·
Z
Rd
Rγ(x, dx2)Rγ(x2, dx3) . . . Rγ(xn, A) . (6)
(Rn
γ )n∈N admits an invariant measure πγ under a Lyapunov
type condition RγV (x) ≤ V (x) − γ + bγ1x∈K.
The chain is ergodic limn→+∞ kRn
γ (x, ·) − πγkTV = 0, see (Douc
et al., 2018, Theorem 10.2.13, Theorem 11.3.1).
We have limγ→0 kπ − πγkTV = 0 Durmus and Moulines (2017).
14 / 31
16. Quantitative convergence bounds
Can we get quantitative convergence rates?
Using Foster-Lyapunov conditions and minorization conditions
Douc et al. (2018) we can obtain geometric convergence for some
distance even without strong convexity.
We introduce the Wasserstein distance with cost c, given for
any µ, ν ∈ P(Rd
) by
Wc(µ, ν) = infπ∈Λ(µ,ν)
R
Rd×Rd c(x, y)dπ(x, y) . (7)
Λ(µ, ν) = set of couplings between µ and ν.
Example 1: c(x, y) = 1Rd{0}(x, y) → total variation.
Example 2: c(x, y) = kx − yk → Wasserstein distance of order 1.
15 / 31
17. Convergence of the EM discretization
Geometric ergodicity of ULA
Assume that
k∇U(x) − ∇U(y)k ≤ L kx − yk,
h∇U(x) − ∇U(y), x − yi ≥ m kx − yk
2
for kx − yk ≥ R
There exist γ̄ 0, Dγ̄,1, Dγ̄,2, Eγ̄ ≥ 0 and λγ̄, ργ̄ ∈ [0, 1) with
λγ̄ ≤ ργ̄, such that for any γ ∈ (0, γ̄], x, y ∈ Rd
and k ∈ N
Wc(δxRk
γ, δyRk
γ) ≤ λ
kγ/4
γ̄ [Dγ̄,1c(x, y) + Dγ̄,21x6=y] + Eγ̄ρ
kγ/4
γ̄ 1x6=y ,
where c(x, y) = 1x6=y(1 + kx − yk /R).
the first cv rate characterizes the forgetting of the initial
conditions.
The second cv rate characterizes effective convergence rate.
Independence w.r.t the dimension d.
Geometric ergodicity w.r.t k · kTV and W1.
16 / 31
18. Non-differentiable case
What if U = f + g with g non-differentiable (but convex)? → use
the Moreau-Yoshida envelope.
Different converging schemes
Xk+1 = proxγ
g (Xk − γ∇f(Xk) +
p
2γZk+1) , (8)
Xk+1 = proxγ
g (Xk) − γ∇f(proxγ
g (Xk)) +
p
2γZk+1 , (9)
Xk+1 = Xk − γ(∇f(Xk) + (Xk − proxγ
g (Xk))/γ) +
p
2γZk+1 .
(10)
Note that (10), Moreau Yoshida Unadjusted Langevin Algorithm
(MYULA) is ULA applied to f + gγ
.
Geometric convergence under similar conditions as in the
differentiable case (regularity + strong convexity at infinity).
17 / 31
20. Regularisation parameter MLE
Back to the estimation of θ.
p(x|y, θ) ∝ p(y|x)p(x|θ) . (11)
In this talk we adopt an empirical Bayes approach and consider the
Maximum Likelihood Estimation (MLE)
θ?
= arg max
θ∈Θ
p(y|θ) = arg max
θ∈Θ
Z
Rd
p(y, x|θ)dx ,
Θ is some convex compact set in Rp
.
We solve it by using a stochastic gradient algorithm driven by
two proximal MCMC kernels.
Given θ?
, we then compute
x?
= arg min
x∈Rd
{− log(p(y|x)) − log(p(x|θ?
))} , (12)
using efficient algorithms available in the optimization field.
19 / 31
21. Projected gradient algorithm
First idea to find the minimizers of θ 7→ − log(p(y|θ)): use some
projected gradient descent algorithm
θn+1 = ΠΘ [θn + δn∇θ log p(y|θn)] , (13)
with (δn)n∈N some sequence of stepsizes and ΠΘ the projection
onto Θ.
If θ 7→ p(y|θ) is convex then this scheme converges towards θ?
(if
it is unique).
Problem: ∇ log p(y|θ) is intractable.
20 / 31
22. Stochastic projected gradient algorithm
Remark that we have (Fisher’s identity) (p(x|θ) ∝ exp[−hθ, ϕi])
∇θ log p(y|θ) = Ex|y,θ[∇θ log p(x, y|θ)]
= −Ex|y,θ[ϕ + ∇θ log Z(θ)] ,
where Z(θ) is the normalizing constant
Z(θ) =
R
Rd exp[−hθ, ϕ(x)i]dx.
In addition, since ∇θ log Z(θ) = −Ex|θ[ϕ(x)], we get that
∇θ log p(y|θ) = Ex|θ[ϕ(x)] − Ex|y,θ[ϕ(x)] . (14)
But, again, most of the time these expectations are intractable.
Similarities with Energy-based models (EBM) for generative
modelling (in this setting θ represents the parameter of a neural
network).
21 / 31
23. Our algorithm
In the differentiable case: Stochastic Optimization with Unadjusted
Langevin (SOUL) Algorithm.
Initialisation X0, U0 ∈ Rd
, θ ∈ Θ, (δk)k∈N = (δ0(k + 1)−0.8
)k∈N.
for k = 0 to n
(i) Markov chain update (MYULA) Xn+1 with target
x 7→ p(y|x, θn)
(ii) Markov chain update (MYULA) Un+1 with target
x 7→ p(x|θn)
(iii) Stochastic gradient update
θn+1 = ΠΘ[θn + δn(ϕ(Un+1) − ϕ(Xn+1))] . (15)
end for
Output The iterates (θn)n∈N.
22 / 31
24. Our algorithm (explicit recursion)
Initialisation X0, U0 ∈ Rd
, θ ∈ Θ, (δk)k∈N = (δ0(k + 1)−0.8
)k∈N.
for k = 0 to n
(i) Markov chain update (MYULA) Xn+1
Xn+1 = (1 − γn/λn)Xn + γn∇ log(p(y|Xn, θn))
+ (γn/λn) proxλn
log(p(x|θn))(Xn) +
p
2γnZ1
n+1 . (16)
(ii) Markov chain update (MYULA) Un+1
Un+1 = (1 − γn/λn)Un + (γn/λn) proxλn
log(p(x|θn))(Un) +
p
2γnZ2
n+1 .
(17)
(iii) Stochastic gradient update
θn+1 = ΠΘ[θn + δn(ϕ(Un+1) − ϕ(Xn+1))] . (18)
end for
Output The iterates (θn)n∈N.
23 / 31
25. Convergence results
Convergence of the averaged sequence
Assume that
θ 7→ log(p(y|θ)) is convex with Lipschitz gradient.
k log(p(x))k ≥ ηkxk − c for any x ∈ Rd
P
n∈N δn = +∞,
P
n∈N δnγ
1/2
n +∞,
P
n∈N δ2
nγ−2
n +∞
Then almost surely
exp
( n
X
k=1
δk log(p(y|θk))
, n
X
k=1
δk
)
−min
Θ
log(p(y|θ)) ≤ C
, n
X
k=1
δk
!
.
(19)
other conditions on log(p) can be considered (tail conditions).
A similar result holds in expectation with explicit bounds.
Possible extension to the non-convex setting (convergence of the
averaged sequence associated with (k∇f(θk)k2
)k∈N.
24 / 31
26. Deblurring with Total-Variation Prior
SNR=20dB SNR=30 SNR=40
MSE Time (min) MSE Time (min) MSE Time (min)
Best 23.29 21.39 19.06
Emp. Bayes 23.50 0.86 21.46 0.85 19.24 0.85
Hier. Bayes 25.07 0.58 22.84 1.27 19.84 3.27
SUGAR 24.44 3.92 24.24 4.50 24.21 4.81
Original Degraded x
EB
x
HB x
DP x
SUG
X
X
X
SNR=20
SNR=30
SNR=40
X Min MSE
Empirical B.
Disc. Prin.
Hierarchical B.
SUGAR
10-4 0.001 0.010 0.100 1
θ
20
30
40
50
60
70
MSE(θ)
Image:flinstones
25 / 31
27. Denoising with Total Generalized Variation
We consider TGV2
θ(u) = infr∈R2d
n
θ1 krk1,2 + θ2 kJ(∆x − r)k1,Frob
o
Chambolle and Lions (1997).
Figure 2: Goldhill image (Original-Degraded-Estimated MAP),
SNR=12dB.
26 / 31
29. Denoising with Total Generalized Variation
Evolution of θ through iterations starting from different initial values:
θinit=10
θinit=0.1
θinit=40
28 / 31
31. Conclusion
The Bayesian framework provides a mathematical setting to
compute many statistics on p(x|y, θ).
In this presentation, we focus on the problem of selecting the
regularisation parameter.
Combining tools from Markov chain theory and stochastic
approximation we derive a scheme which provably converges
towards the optimal regularizing parameter in an empirical
Bayesian sense.
The algorithm works well in practice (even in cases not covered
by the theory (yet!)).
30 / 31
32. Perspectives
The inverse problems we consider are still quite
simple/generic (total variation, `1 loss...). Can we extend our
tools to cover more intricate and problem specific priors?
(wavelet-based prior or composite prior)
Can we use more advanced optimization schemes for
sampling/optimization? (dual averaging, mirror descent) Better
convergence guarantees?
Can we use data-based priors Song and Ermon (2019)?
Thank you for your attention!
31 / 31
33. Bibliography:
Mariana SC Almeida and Mário AT Figueiredo. Parameter estimation for blind
and non-blind deblurring using residual whiteness measures. IEEE
Transactions on Image Processing, 22(7):2751–2763, 2013.
Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm
for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202,
2009.
Stephen Boyd, Neal Parikh, and Eric Chu. Distributed optimization and statistical
learning via the alternating direction method of multipliers. Now Publishers
Inc, 2011.
Bradley P. Carlin and Thomas A. Louis. Empirical Bayes: past, present and
future. J. Amer. Statist. Assoc., 95(452):1286–1289, 2000. ISSN 0162-1459.
doi: 10.2307/2669771. URL https://doi.org/10.2307/2669771.
Antonin Chambolle and Pierre-Louis Lions. Image recovery via total variation
minimization and related problems. Numerische Mathematik, 76(2):167–188,
1997.
Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for
convex problems with applications to imaging. Journal of mathematical
imaging and vision, 40(1):120–145, 2011.
32 / 31
34. Caroline Chaux, Jean-Christophe Pesquet, and Nelly Pustelnik. Nested iterative
algorithms for convex constrained image recovery problems. SIAM Journal on
Imaging Sciences, 2(2):730–762, 2009.
Charles-Alban Deledalle, Samuel Vaiter, Jalal Fadili, and Gabriel Peyré. Stein
Unbiased GrAdient estimator of the Risk (SUGAR) for multiple parameter
selection. SIAM Journal on Imaging Sciences, 7(4):2448–2487, 2014.
Randal Douc, Eric Moulines, Pierre Priouret, and Philippe Soulier. Markov
chains. Springer Series in Operations Research and Financial Engineering.
Springer, Cham, 2018. ISBN 978-3-319-97703-4; 978-3-319-97704-1.
A. Durmus and É. Moulines. Nonasymptotic convergence analysis for the
unadjusted Langevin algorithm. Ann. Appl. Probab., 27(3):1551–1587, 2017.
ISSN 1050-5164.
Gene H Golub, Michael Heath, and Grace Wahba. Generalized cross-validation as
a method for choosing a good ridge parameter. Technometrics, 21(2):215–223,
1979.
Charles L Lawson and Richard J Hanson. Solving least squares problems,
volume 15. Siam, 1995.
Vladimir Alekseevich Morozov. Methods for solving incorrectly posed problems.
Springer Science Business Media, 2012.
33 / 31
35. Arkadi Nemirovski. Prox-method with rate of convergence o (1/t) for variational
inequalities with lipschitz continuous monotone operators and smooth
convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):
229–251, 2004.
Yu Nesterov. Smooth minimization of non-smooth functions. Mathematical
programming, 103(1):127–152, 2005.
Marcelo Pereyra, Nicolas Dobigeon, Hadj Batatia, and Jean-Yves Tourneret.
Estimating the granularity coefficient of a Potts-Markov random field within a
Markov chain Monte Carlo algorithm. IEEE Transactions on Image
Processing, 22(6):2385–2397, 2013.
Audrey Repetti, Marcelo Pereyra, and Yves Wiaux. Scalable bayesian uncertainty
quantification in imaging inverse problems via convex optimization. SIAM
Journal on Imaging Sciences, 12(1):87–118, 2019.
G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin
distributions and their discrete approximations. Bernoulli, 2(4):341–363, 1996.
ISSN 1350-7265.
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of
the data distribution. arXiv preprint arXiv:1907.05600, 2019.
34 / 31