Full paper: https://arxiv.org/pdf/1804.02339.pdf
We propose and analyze a novel adaptive step size variant of the Davis-Yin three operator splitting, a method that can solve optimization problems composed of a sum of a smooth term for which we have access to its gradient and an arbitrary number of potentially non-smooth terms for which we have access to their proximal operator. The proposed method leverages local information of the objective function, allowing for larger step sizes while preserving the convergence properties of the original method. It only requires two extra function evaluations per iteration and does not depend on any step size hyperparameter besides an initial estimate. We provide a convergence rate analysis of this method, showing sublinear convergence rate for general convex functions and linear convergence under stronger assumptions, matching the best known rates of its non adaptive variant. Finally, an empirical comparison with related methods on 6 different problems illustrates the computational advantage of the adaptive step size strategy.
1. Adaptive Three Operator Splitting.
Fabian Pedregosa, Gauthier Gidel
2018 International Symposium on Mathematical Programming, Bordeaux
2. Three Operator Splitting (TOS)
• Recently proposed method (Davis and Yin, 2017)
Solves optimization problems of the form
minimize
x∈Rd
f (x) + g(x) + h(x) ,
with access to f , proxγg , proxγh.
1/18
3. Three Operator Splitting (TOS)
• Recently proposed method (Davis and Yin, 2017)
Solves optimization problems of the form
minimize
x∈Rd
f (x) + g(x) + h(x) ,
with access to f , proxγg , proxγh.
• Can be generalized to an arbitrary number of proximal terms.
1/18
4. Three Operator Splitting (TOS)
• Recently proposed method (Davis and Yin, 2017)
Solves optimization problems of the form
minimize
x∈Rd
f (x) + g(x) + h(x) ,
with access to f , proxγg , proxγh.
• Can be generalized to an arbitrary number of proximal terms.
• Many complex penalties can be written as a sum of proximable
terms: overlapping group lasso, 1 trend filtering, isotonic
constraints, total variation, intersection of constraints, etc.
1/18
5. Importance of step-size
Guaranteed to converge for any step-size γ < 2/L, with L =
Lipschitz constant of f .
In practice, best performance is often achieved for γ 2/L
0 1000 2000 3000 4000 5000
Iterations
10 14
10 11
10 8
10 5
10 2
Objectiveminusoptimum
=1/L
=2/L
=5/L
=10/L
=20/L =50/L
2/18
6. Motivation
• L is a global upper bound on the Lipschitz constant, locally it
can be much smaller.
3/18
7. Motivation
• L is a global upper bound on the Lipschitz constant, locally it
can be much smaller.
• Adaptive step-size strategies (aka inexact line search) have a
long tradition (Armijo, 1966), have been adapted to
proximal-gradient (Beck and Teboulle, 2009).
3/18
8. Motivation
• L is a global upper bound on the Lipschitz constant, locally it
can be much smaller.
• Adaptive step-size strategies (aka inexact line search) have a
long tradition (Armijo, 1966), have been adapted to
proximal-gradient (Beck and Teboulle, 2009).
• Goal. Can these adaptive step-size methods be adapted to
the three operator splitting?
3/18
9. Outline
1. Revisiting the Three Operator Splitting
2. Adaptive Three Operator Splitting
3. Experiments
4/18
11. Three Operator Splitting (TOS)
• Three operator splitting (Davis and Yin, 2017):
zt = proxγh(yt)
xt = proxγg (2yt − zt − γ f (zt))
yt+1 = yt − zt + xt
5/18
12. Three Operator Splitting (TOS)
• Three operator splitting (Davis and Yin, 2017):
zt = proxγh(yt)
xt = proxγg (2yt − zt − γ f (zt))
yt+1 = yt − zt + xt
• Generalization of both Proximal-Gradient and
Douglas-Rachford.
5/18
13. Three Operator Splitting (TOS)
• Three operator splitting (Davis and Yin, 2017):
zt = proxγh(yt)
xt = proxγg (2yt − zt − γ f (zt))
yt+1 = yt − zt + xt
• Generalization of both Proximal-Gradient and
Douglas-Rachford.
• Depends only on one step-size parameter.
5/18
14. Revisiting the three operator splitting
1. Introduce ut := 1
γ (yt − xt). We can rewrite TOS as
xt+1 = proxγg (zt − γ( f (zt) + ut))
ut+1 = proxh∗/γ(ut + xt+1/γ) ,
zt+1 = xt+1 − γ(ut+1 − ut)
6/18
15. Revisiting the three operator splitting
1. Introduce ut := 1
γ (yt − xt). We can rewrite TOS as
xt+1 = proxγg (zt − γ( f (zt) + ut))
ut+1 = proxh∗/γ(ut + xt+1/γ) ,
zt+1 = xt+1 − γ(ut+1 − ut)
2. Saddle-point reformulation of original problem
min
x∈Rd
f (x) + g(x) + h(x)
= min
x∈Rd
f (x) + g(x) + max
u∈Rd
{ x, u − h∗
(u)}
6/18
16. Revisiting the three operator splitting
1. Introduce ut := 1
γ (yt − xt). We can rewrite TOS as
xt+1 = proxγg (zt − γ( f (zt) + ut))
ut+1 = proxh∗/γ(ut + xt+1/γ) ,
zt+1 = xt+1 − γ(ut+1 − ut)
2. Saddle-point reformulation of original problem
min
x∈Rd
f (x) + g(x) + h(x)
= min
x∈Rd
max
u∈Rd
f (x) + g(x) + x, u − h∗
(u)
:=L(x,u)
6/18
17. Minimizing with respect to primal variable
min
x∈Rd
L(x, ut) = min
x∈Rd
f (x) + x, ut
smooth
+ g(x)
proximal
7/18
18. Minimizing with respect to primal variable
min
x∈Rd
L(x, ut) = min
x∈Rd
f (x) + x, ut
smooth
+ g(x)
proximal
• Proximal-gradient iteration, with x = zt as starting point:
xt+1 = proxγg (zt − γ( f (zt) + ut))
7/18
19. Minimizing with respect to primal variable
min
x∈Rd
L(x, ut) = min
x∈Rd
f (x) + x, ut
smooth
+ g(x)
proximal
• Proximal-gradient iteration, with x = zt as starting point:
xt+1 = proxγg (zt − γ( f (zt) + ut))
= first step of TOS
7/18
20. Minimizing with respect to the dual variable
min
u∈Rd
L(xt, u) = min
u∈Rd
h∗
(u) − xt, u
proximal
8/18
21. Minimizing with respect to the dual variable
min
u∈Rd
L(xt, u) = min
u∈Rd
h∗
(u) − xt, u
proximal
• Proximal-point iteration:
ut+1 = proxσh∗ (ut + σxt+1)
= second step in TOS with σ = 1/γ
8/18
22. Minimizing with respect to the dual variable
min
u∈Rd
L(xt, u) = min
u∈Rd
h∗
(u) − xt, u
proximal
• Proximal-point iteration:
ut+1 = proxσh∗ (ut + σxt+1)
= second step in TOS with σ = 1/γ
• Third line:
zt+1 = xt+1 − γ(ut+1 − ut) (extrapolation step)
8/18
23. Revisiting the three operator splitting
→ xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 1: proximal-gradient step
9/18
24. Revisiting the three operator splitting
xt+1 = proxγg(zt− γ( f(zt) + ut))
→ ut+1 = proxh /γ(ut + xt+1/γ)
zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 1: proximal-point step
9/18
25. Revisiting the three operator splitting
xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
→ zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 2: extrapolation
9/18
26. Revisiting the three operator splitting
→ xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 2: proximal-gradient step
9/18
27. Revisiting the three operator splitting
xt+1 = proxγg(zt− γ( f(zt) + ut))
→ ut+1 = proxh /γ(ut + xt+1/γ)
zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 3: proximal-point step
9/18
28. Revisiting the three operator splitting
xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
→ zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 3: extrapolation
9/18
29. Revisiting the three operator splitting
→ xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 4: proximal-gradient step
9/18
30. Revisiting the three operator splitting
xt+1 = proxγg(zt− γ( f(zt) + ut))
→ ut+1 = proxh /γ(ut + xt+1/γ)
zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 4: proximal-point step
9/18
31. Revisiting the three operator splitting
xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
→ zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 5: extrapolation
9/18
32. Revisiting the three operator splitting
→ xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 5: proximal-gradient step
9/18
33. Revisiting the three operator splitting
xt+1 = proxγg(zt− γ( f(zt) + ut))
→ ut+1 = proxh /γ(ut + xt+1/γ)
zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 6: proximal-point step
9/18
34. Revisiting the three operator splitting
xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
→ zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 6: extrapolation
Take-Home Message
TOS is (basically) alternated proximal-gradient and
proximal-point
9/18
35. Revisiting the three operator splitting
xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
→ zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 6: extrapolation
Take-Home Message
TOS is (basically) alternated proximal-gradient and
proximal-point
Can we adapt the adaptive step-size of proximal-gradient?
9/18
37. Adaptive Three Operator Splitting1
Start with optimistic step-size γt and decrease it until:
f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt +
1
2γt
xt+1 − zt
2
with xt+1 = proxγt g (zt − γt( f (zt) + ut))
1
Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator
Splitting”. In: Proceedings of the 35th International Conference on Machine
Learning (ICML).
10/18
38. Adaptive Three Operator Splitting1
Start with optimistic step-size γt and decrease it until:
f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt +
1
2γt
xt+1 − zt
2
with xt+1 = proxγt g (zt − γt( f (zt) + ut))
Run rest of algorithm with that step-size:
ut+1 = proxh∗/γt
(ut + xt+1/γt) (1)
zt+1 = xt+1 − γt(ut+1 − ut) (2)
1
Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator
Splitting”. In: Proceedings of the 35th International Conference on Machine
Learning (ICML).
10/18
39. Adaptive Three Operator Splitting1
Start with optimistic step-size γt and decrease it until:
f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt +
1
2γt
xt+1 − zt
2
with xt+1 = proxγt g (zt − γt( f (zt) + ut))
Run rest of algorithm with that step-size:
ut+1 = proxh∗/γt
(ut + xt+1/γt) (1)
zt+1 = xt+1 − γt(ut+1 − ut) (2)
1
Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator
Splitting”. In: Proceedings of the 35th International Conference on Machine
Learning (ICML).
10/18
Benefits
• Automatic tuning of step-size
• (practically) hyperparameter-free
40. Performance of the adaptive step-size strategy
0 1000 2000 3000 4000 5000
Iterations
10 14
10 11
10 8
10 5
10 2
Objectiveminusoptimum
=1/L
=2/L
=5/L
=10/L
=20/L
=50/L
adaptive
Performance is as good as best hand-tuned step-size
11/18
41. Convergence rates 1/3
Convergence rate in terms of average (aka ergodic) sequence.
st
def
=
t−1
i=0
γt , xt
def
=
t−1
i=0
γi xi+1 /st , ut
def
=
t−1
i=0
γi ui+1 /st .
12/18
42. Convergence rates 1/3
Convergence rate in terms of average (aka ergodic) sequence.
st
def
=
t−1
i=0
γt , xt
def
=
t−1
i=0
γi xi+1 /st , ut
def
=
t−1
i=0
γi ui+1 /st .
Theorem (sublinear convergence rate)
For any (x, u) ∈ domL:
L(xt, u) − L(x, ut) ≤
z0 − x 2 + γ2
0 u0 − u 2
2st
.
12/18
43. Convergence rates 2/3
If h is Lipschitz, we can bound it and obtain rates in terms of
objective function suboptimality.
13/18
44. Convergence rates 2/3
If h is Lipschitz, we can bound it and obtain rates in terms of
objective function suboptimality.
Corollary
Let h be βh-Lipschitz. Then we have
P(xt+1) − P(x∗
) ≤
z0 − x∗ 2+ 2γ2
0( u0
2+ β2
h)
2st
= O(1/t).
with P(x) = f (x) + g(x) + h(x).
13/18
45. Convergence rates 3/3
Linear convergence under (somewhat unrealistic) assumptions.
Theorem
If f is Lf -smooth, µ-strongly convex and h is Lh-smooth then
xt+1 − x 2
≤ 1 − min τ
µ
Lf
,
1
1 + γ0Lh
t+1
C0 (3)
with τ = line search decrease factor, C0 = only depends on initial
conditions.
• Better rate than µ
Lf
× 1
(1+γLh)2 from (Davis and Yin, 2015).
14/18
51. Conclusion
• Sufficient decrease condition to set step-size in three operator
splitting.
• (Mostly) Hyperparameter-free, adaptivity to local geometry.
18/18
52. Conclusion
• Sufficient decrease condition to set step-size in three operator
splitting.
• (Mostly) Hyperparameter-free, adaptivity to local geometry.
• Same convergence guarantees as fixed step-size method.
18/18
53. Conclusion
• Sufficient decrease condition to set step-size in three operator
splitting.
• (Mostly) Hyperparameter-free, adaptivity to local geometry.
• Same convergence guarantees as fixed step-size method.
• Large empirical improvements, specially in the
low-regularization and non-quadratic regime.
18/18
54. Conclusion
• Sufficient decrease condition to set step-size in three operator
splitting.
• (Mostly) Hyperparameter-free, adaptivity to local geometry.
• Same convergence guarantees as fixed step-size method.
• Large empirical improvements, specially in the
low-regularization and non-quadratic regime.
Perspectives
• Linear convergence under less restrictive assumptions?
• Acceleration.
https://arxiv.org/abs/1804.02339
18/18
55. References
Armijo, Larry (1966). “Minimization of functions having Lipschitz continuous first
partial derivatives”. In: Pacific Journal of Mathematics.
Beck, Amir and Marc Teboulle (2009). “Gradient-based algorithms with applications to
signal recovery”. In: Convex optimization in signal processing and communications.
Davis, Damek and Wotao Yin (2015). “A three-operator splitting scheme and its
optimization applications”. In: preprint arXiv:1504.01032v1.
— (2017). “A three-operator splitting scheme and its optimization applications”. In:
Set-valued and variational analysis.
Pedregosa, Fabian and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”.
In: Proceedings of the 35th International Conference on Machine Learning (ICML).
18/18