Adaptive Three Operator Splitting

Adaptive Three Operator Splitting.
Fabian Pedregosa, Gauthier Gidel
2018 International Symposium on Mathematical Programming, Bordeaux

Three Operator Splitting (TOS)
• Recently proposed method (Davis and Yin, 2017)
Solves optimization problems of the form
minimize
x∈Rd
f (x) + g(x) + h(x) ,
with access to f , proxγg , proxγh.
1/18

minimize
x∈Rd
f (x) + g(x) + h(x) ,
• Can be generalized to an arbitrary number of proximal terms.
1/18

minimize
x∈Rd
f (x) + g(x) + h(x) ,
• Can be generalized to an arbitrary number of proximal terms.
• Many complex penalties can be written as a sum of proximable
terms: overlapping group lasso, 1 trend ﬁltering, isotonic
constraints, total variation, intersection of constraints, etc.
1/18

Importance of step-size
Guaranteed to converge for any step-size γ < 2/L, with L =
Lipschitz constant of f .
In practice, best performance is often achieved for γ 2/L
0 1000 2000 3000 4000 5000
Iterations
10 14
10 11
10 8
10 5
10 2
Objectiveminusoptimum
=1/L
=2/L
=5/L
=10/L
=20/L =50/L
2/18

Motivation
• L is a global upper bound on the Lipschitz constant, locally it
can be much smaller.
3/18

Motivation
• Adaptive step-size strategies (aka inexact line search) have a
long tradition (Armijo, 1966), have been adapted to
proximal-gradient (Beck and Teboulle, 2009).
3/18

Motivation
• Adaptive step-size strategies (aka inexact line search) have a
long tradition (Armijo, 1966), have been adapted to
proximal-gradient (Beck and Teboulle, 2009).
• Goal. Can these adaptive step-size methods be adapted to
the three operator splitting?
3/18

Outline
1. Revisiting the Three Operator Splitting
2. Adaptive Three Operator Splitting
3. Experiments
4/18

Revisiting the Three Operator
Splitting

• Three operator splitting (Davis and Yin, 2017):
zt = proxγh(yt)
xt = proxγg (2yt − zt − γ f (zt))
yt+1 = yt − zt + xt
5/18

zt = proxγh(yt)
• Generalization of both Proximal-Gradient and
Douglas-Rachford.
5/18

zt = proxγh(yt)
• Generalization of both Proximal-Gradient and
Douglas-Rachford.
• Depends only on one step-size parameter.
5/18

Revisiting the three operator splitting
1. Introduce ut := 1
γ (yt − xt). We can rewrite TOS as
xt+1 = proxγg (zt − γ( f (zt) + ut))
ut+1 = proxh∗/γ(ut + xt+1/γ) ,
zt+1 = xt+1 − γ(ut+1 − ut)
6/18

zt+1 = xt+1 − γ(ut+1 − ut)
2. Saddle-point reformulation of original problem
min
x∈Rd
f (x) + g(x) + h(x)
= min
x∈Rd
f (x) + g(x) + max
u∈Rd
{ x, u − h∗
(u)}
6/18

zt+1 = xt+1 − γ(ut+1 − ut)
2. Saddle-point reformulation of original problem
min
x∈Rd
f (x) + g(x) + h(x)
= min
x∈Rd
max
u∈Rd
f (x) + g(x) + x, u − h∗
(u)
:=L(x,u)
6/18

Minimizing with respect to primal variable
min
x∈Rd
L(x, ut) = min
x∈Rd
f (x) + x, ut
smooth
+ g(x)
proximal
7/18

min
x∈Rd
L(x, ut) = min
x∈Rd
f (x) + x, ut
smooth
+ g(x)
proximal
• Proximal-gradient iteration, with x = zt as starting point:
7/18

min
x∈Rd
L(x, ut) = min
x∈Rd
f (x) + x, ut
smooth
+ g(x)
proximal
• Proximal-gradient iteration, with x = zt as starting point:
= ﬁrst step of TOS
7/18

Minimizing with respect to the dual variable
min
u∈Rd
L(xt, u) = min
u∈Rd
h∗
(u) − xt, u
proximal
8/18

min
u∈Rd
L(xt, u) = min
u∈Rd
h∗
(u) − xt, u
proximal
• Proximal-point iteration:
ut+1 = proxσh∗ (ut + σxt+1)
= second step in TOS with σ = 1/γ
8/18

min
u∈Rd
L(xt, u) = min
u∈Rd
h∗
(u) − xt, u
proximal
• Proximal-point iteration:
ut+1 = proxσh∗ (ut + σxt+1)
= second step in TOS with σ = 1/γ
• Third line:
zt+1 = xt+1 − γ(ut+1 − ut) (extrapolation step)
8/18

→ xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 1: proximal-gradient step
9/18

xt+1 = proxγg(zt− γ( f(zt) + ut))
→ ut+1 = proxh /γ(ut + xt+1/γ)
zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 1: proximal-point step
9/18

→ zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 2: extrapolation
9/18

zt+1 = xt+1 − γ(ut+1 − ut)
u
x
9/18

→ zt+1 = xt+1 − γ(ut+1 − ut)
u
x
9/18

zt+1 = xt+1 − γ(ut+1 − ut)
u
x
9/18

→ zt+1 = xt+1 − γ(ut+1 − ut)
u
x
9/18

zt+1 = xt+1 − γ(ut+1 − ut)
u
x
9/18

→ zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Take-Home Message
TOS is (basically) alternated proximal-gradient and
proximal-point
9/18

→ zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Take-Home Message
TOS is (basically) alternated proximal-gradient and
proximal-point
Can we adapt the adaptive step-size of proximal-gradient?
9/18

Adaptive Three Operator Splitting

Adaptive Three Operator Splitting1
Start with optimistic step-size γt and decrease it until:
f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt +
1
2γt
xt+1 − zt
2
with xt+1 = proxγt g (zt − γt( f (zt) + ut))
1
Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator
Splitting”. In: Proceedings of the 35th International Conference on Machine
Learning (ICML).
10/18

f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt +
1
2γt
xt+1 − zt
2
Run rest of algorithm with that step-size:
ut+1 = proxh∗/γt
(ut + xt+1/γt) (1)
zt+1 = xt+1 − γt(ut+1 − ut) (2)
1
Learning (ICML).
10/18

f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt +
1
2γt
xt+1 − zt
2
Run rest of algorithm with that step-size:
ut+1 = proxh∗/γt
(ut + xt+1/γt) (1)
zt+1 = xt+1 − γt(ut+1 − ut) (2)
1
Learning (ICML).
10/18
Beneﬁts
• Automatic tuning of step-size
• (practically) hyperparameter-free

Performance of the adaptive step-size strategy
0 1000 2000 3000 4000 5000
Iterations
10 14
10 11
10 8
10 5
10 2
=1/L
=2/L
=5/L
=10/L
=20/L
=50/L
adaptive
Performance is as good as best hand-tuned step-size
11/18

Convergence rates 1/3
Convergence rate in terms of average (aka ergodic) sequence.
st
def
=
t−1
i=0
γt , xt
def
=
t−1
i=0
γi xi+1 /st , ut
def
=
t−1
i=0
γi ui+1 /st .
12/18

Convergence rate in terms of average (aka ergodic) sequence.
st
def
=
t−1
i=0
γt , xt
def
=
t−1
i=0
γi xi+1 /st , ut
def
=
t−1
i=0
γi ui+1 /st .
Theorem (sublinear convergence rate)
For any (x, u) ∈ domL:
L(xt, u) − L(x, ut) ≤
z0 − x 2 + γ2
0 u0 − u 2
2st
.
12/18

If h is Lipschitz, we can bound it and obtain rates in terms of
objective function suboptimality.
13/18

If h is Lipschitz, we can bound it and obtain rates in terms of
objective function suboptimality.
Corollary
Let h be βh-Lipschitz. Then we have
P(xt+1) − P(x∗
) ≤
z0 − x∗ 2+ 2γ2
0( u0
2+ β2
h)
2st
= O(1/t).
with P(x) = f (x) + g(x) + h(x).
13/18

Linear convergence under (somewhat unrealistic) assumptions.
Theorem
If f is Lf -smooth, µ-strongly convex and h is Lh-smooth then
xt+1 − x 2
≤ 1 − min τ
µ
Lf
,
1
1 + γ0Lh
t+1
C0 (3)
with τ = line search decrease factor, C0 = only depends on initial
conditions.
• Better rate than µ
Lf
× 1
(1+γLh)2 from (Davis and Yin, 2015).
14/18

Logistic + Nearly-isotonic penalty
Problem
arg minx logistic(x) + λ p−1
i=1 max{xi − xi+1, 0}
Coefficients
Magnitude
=10 6
Coefficients
Magnitude
=10 3
Coefficients
Magnitude
=0.01
Coefficients
Magnitude
=0.1
estimated coefficients
ground truth
0 100 200 300 400
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 200 400 600
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 100 200 300 400
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 100 200 300 400
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L)TOS-AOLS PDHG Adaptive PDHG
15/18

Logistic + Overlapping group lasso penalty
Problem
arg min
x
logistic(x) + λ g∈G [x]g 2
Coefficients
Magnitude
=10 6
Coefficients
Magnitude
=10 3
Coefficients
Magnitude
=0.01
Coefficients
Magnitude
=0.1
estimated coefficients
ground truth
0 10 20 30 40 50
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 10 20 30 40 50
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 5 10 15 20
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0.0 0.2 0.4 0.6 0.8 1.0
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L) TOS-AOLS PDHG Adaptive PDHG
16/18

Quadratic loss + total variation penalty
Problem
arg min
x
least squares(x) + λ x TV
Recoveredcoefficients
=10 6 =10 5 =10 4 =10 3
0 500 1000 1500 2000
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 100 200 300 400
Time (in seconds)
10 12
10 9
10 6
10 3
100
0 10 20 30 40
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 2 4 6 8
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L) TOS-AOLS PDHG Adaptive PDHG
17/18

Conclusion
• Suﬃcient decrease condition to set step-size in three operator
splitting.
18/18

Conclusion
splitting.
• (Mostly) Hyperparameter-free, adaptivity to local geometry.
18/18

Conclusion
splitting.
• Same convergence guarantees as ﬁxed step-size method.
18/18

Conclusion
splitting.
• Large empirical improvements, specially in the
low-regularization and non-quadratic regime.
18/18

Conclusion
splitting.
• Large empirical improvements, specially in the
low-regularization and non-quadratic regime.
Perspectives
• Linear convergence under less restrictive assumptions?
• Acceleration.
https://arxiv.org/abs/1804.02339
18/18

References
Armijo, Larry (1966). “Minimization of functions having Lipschitz continuous ﬁrst
partial derivatives”. In: Paciﬁc Journal of Mathematics.
Beck, Amir and Marc Teboulle (2009). “Gradient-based algorithms with applications to
signal recovery”. In: Convex optimization in signal processing and communications.
Davis, Damek and Wotao Yin (2015). “A three-operator splitting scheme and its
optimization applications”. In: preprint arXiv:1504.01032v1.
— (2017). “A three-operator splitting scheme and its optimization applications”. In:
Set-valued and variational analysis.
Pedregosa, Fabian and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”.
In: Proceedings of the 35th International Conference on Machine Learning (ICML).
18/18

Adaptive Three Operator Splitting

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Adaptive Three Operator Splitting

Similar a Adaptive Three Operator Splitting (20)

Más de Fabian Pedregosa

Más de Fabian Pedregosa (11)

Último

Último (20)

Adaptive Three Operator Splitting