SlideShare una empresa de Scribd logo
1 de 55
Descargar para leer sin conexión
Adaptive Three Operator Splitting.
Fabian Pedregosa, Gauthier Gidel
2018 International Symposium on Mathematical Programming, Bordeaux
Three Operator Splitting (TOS)
• Recently proposed method (Davis and Yin, 2017)
Solves optimization problems of the form
minimize
x∈Rd
f (x) + g(x) + h(x) ,
with access to f , proxγg , proxγh.
1/18
Three Operator Splitting (TOS)
• Recently proposed method (Davis and Yin, 2017)
Solves optimization problems of the form
minimize
x∈Rd
f (x) + g(x) + h(x) ,
with access to f , proxγg , proxγh.
• Can be generalized to an arbitrary number of proximal terms.
1/18
Three Operator Splitting (TOS)
• Recently proposed method (Davis and Yin, 2017)
Solves optimization problems of the form
minimize
x∈Rd
f (x) + g(x) + h(x) ,
with access to f , proxγg , proxγh.
• Can be generalized to an arbitrary number of proximal terms.
• Many complex penalties can be written as a sum of proximable
terms: overlapping group lasso, 1 trend filtering, isotonic
constraints, total variation, intersection of constraints, etc.
1/18
Importance of step-size
Guaranteed to converge for any step-size γ < 2/L, with L =
Lipschitz constant of f .
In practice, best performance is often achieved for γ 2/L
0 1000 2000 3000 4000 5000
Iterations
10 14
10 11
10 8
10 5
10 2
Objectiveminusoptimum
=1/L
=2/L
=5/L
=10/L
=20/L =50/L
2/18
Motivation
• L is a global upper bound on the Lipschitz constant, locally it
can be much smaller.
3/18
Motivation
• L is a global upper bound on the Lipschitz constant, locally it
can be much smaller.
• Adaptive step-size strategies (aka inexact line search) have a
long tradition (Armijo, 1966), have been adapted to
proximal-gradient (Beck and Teboulle, 2009).
3/18
Motivation
• L is a global upper bound on the Lipschitz constant, locally it
can be much smaller.
• Adaptive step-size strategies (aka inexact line search) have a
long tradition (Armijo, 1966), have been adapted to
proximal-gradient (Beck and Teboulle, 2009).
• Goal. Can these adaptive step-size methods be adapted to
the three operator splitting?
3/18
Outline
1. Revisiting the Three Operator Splitting
2. Adaptive Three Operator Splitting
3. Experiments
4/18
Revisiting the Three Operator
Splitting
Three Operator Splitting (TOS)
• Three operator splitting (Davis and Yin, 2017):
zt = proxγh(yt)
xt = proxγg (2yt − zt − γ f (zt))
yt+1 = yt − zt + xt
5/18
Three Operator Splitting (TOS)
• Three operator splitting (Davis and Yin, 2017):
zt = proxγh(yt)
xt = proxγg (2yt − zt − γ f (zt))
yt+1 = yt − zt + xt
• Generalization of both Proximal-Gradient and
Douglas-Rachford.
5/18
Three Operator Splitting (TOS)
• Three operator splitting (Davis and Yin, 2017):
zt = proxγh(yt)
xt = proxγg (2yt − zt − γ f (zt))
yt+1 = yt − zt + xt
• Generalization of both Proximal-Gradient and
Douglas-Rachford.
• Depends only on one step-size parameter.
5/18
Revisiting the three operator splitting
1. Introduce ut := 1
γ (yt − xt). We can rewrite TOS as
xt+1 = proxγg (zt − γ( f (zt) + ut))
ut+1 = proxh∗/γ(ut + xt+1/γ) ,
zt+1 = xt+1 − γ(ut+1 − ut)
6/18
Revisiting the three operator splitting
1. Introduce ut := 1
γ (yt − xt). We can rewrite TOS as
xt+1 = proxγg (zt − γ( f (zt) + ut))
ut+1 = proxh∗/γ(ut + xt+1/γ) ,
zt+1 = xt+1 − γ(ut+1 − ut)
2. Saddle-point reformulation of original problem
min
x∈Rd
f (x) + g(x) + h(x)
= min
x∈Rd
f (x) + g(x) + max
u∈Rd
{ x, u − h∗
(u)}
6/18
Revisiting the three operator splitting
1. Introduce ut := 1
γ (yt − xt). We can rewrite TOS as
xt+1 = proxγg (zt − γ( f (zt) + ut))
ut+1 = proxh∗/γ(ut + xt+1/γ) ,
zt+1 = xt+1 − γ(ut+1 − ut)
2. Saddle-point reformulation of original problem
min
x∈Rd
f (x) + g(x) + h(x)
= min
x∈Rd
max
u∈Rd
f (x) + g(x) + x, u − h∗
(u)
:=L(x,u)
6/18
Minimizing with respect to primal variable
min
x∈Rd
L(x, ut) = min
x∈Rd
f (x) + x, ut
smooth
+ g(x)
proximal
7/18
Minimizing with respect to primal variable
min
x∈Rd
L(x, ut) = min
x∈Rd
f (x) + x, ut
smooth
+ g(x)
proximal
• Proximal-gradient iteration, with x = zt as starting point:
xt+1 = proxγg (zt − γ( f (zt) + ut))
7/18
Minimizing with respect to primal variable
min
x∈Rd
L(x, ut) = min
x∈Rd
f (x) + x, ut
smooth
+ g(x)
proximal
• Proximal-gradient iteration, with x = zt as starting point:
xt+1 = proxγg (zt − γ( f (zt) + ut))
= first step of TOS
7/18
Minimizing with respect to the dual variable
min
u∈Rd
L(xt, u) = min
u∈Rd
h∗
(u) − xt, u
proximal
8/18
Minimizing with respect to the dual variable
min
u∈Rd
L(xt, u) = min
u∈Rd
h∗
(u) − xt, u
proximal
• Proximal-point iteration:
ut+1 = proxσh∗ (ut + σxt+1)
= second step in TOS with σ = 1/γ
8/18
Minimizing with respect to the dual variable
min
u∈Rd
L(xt, u) = min
u∈Rd
h∗
(u) − xt, u
proximal
• Proximal-point iteration:
ut+1 = proxσh∗ (ut + σxt+1)
= second step in TOS with σ = 1/γ
• Third line:
zt+1 = xt+1 − γ(ut+1 − ut) (extrapolation step)
8/18
Revisiting the three operator splitting
→ xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 1: proximal-gradient step
9/18
Revisiting the three operator splitting
xt+1 = proxγg(zt− γ( f(zt) + ut))
→ ut+1 = proxh /γ(ut + xt+1/γ)
zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 1: proximal-point step
9/18
Revisiting the three operator splitting
xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
→ zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 2: extrapolation
9/18
Revisiting the three operator splitting
→ xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 2: proximal-gradient step
9/18
Revisiting the three operator splitting
xt+1 = proxγg(zt− γ( f(zt) + ut))
→ ut+1 = proxh /γ(ut + xt+1/γ)
zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 3: proximal-point step
9/18
Revisiting the three operator splitting
xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
→ zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 3: extrapolation
9/18
Revisiting the three operator splitting
→ xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 4: proximal-gradient step
9/18
Revisiting the three operator splitting
xt+1 = proxγg(zt− γ( f(zt) + ut))
→ ut+1 = proxh /γ(ut + xt+1/γ)
zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 4: proximal-point step
9/18
Revisiting the three operator splitting
xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
→ zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 5: extrapolation
9/18
Revisiting the three operator splitting
→ xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 5: proximal-gradient step
9/18
Revisiting the three operator splitting
xt+1 = proxγg(zt− γ( f(zt) + ut))
→ ut+1 = proxh /γ(ut + xt+1/γ)
zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 6: proximal-point step
9/18
Revisiting the three operator splitting
xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
→ zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 6: extrapolation
Take-Home Message
TOS is (basically) alternated proximal-gradient and
proximal-point
9/18
Revisiting the three operator splitting
xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
→ zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 6: extrapolation
Take-Home Message
TOS is (basically) alternated proximal-gradient and
proximal-point
Can we adapt the adaptive step-size of proximal-gradient?
9/18
Adaptive Three Operator Splitting
Adaptive Three Operator Splitting1
Start with optimistic step-size γt and decrease it until:
f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt +
1
2γt
xt+1 − zt
2
with xt+1 = proxγt g (zt − γt( f (zt) + ut))
1
Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator
Splitting”. In: Proceedings of the 35th International Conference on Machine
Learning (ICML).
10/18
Adaptive Three Operator Splitting1
Start with optimistic step-size γt and decrease it until:
f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt +
1
2γt
xt+1 − zt
2
with xt+1 = proxγt g (zt − γt( f (zt) + ut))
Run rest of algorithm with that step-size:
ut+1 = proxh∗/γt
(ut + xt+1/γt) (1)
zt+1 = xt+1 − γt(ut+1 − ut) (2)
1
Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator
Splitting”. In: Proceedings of the 35th International Conference on Machine
Learning (ICML).
10/18
Adaptive Three Operator Splitting1
Start with optimistic step-size γt and decrease it until:
f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt +
1
2γt
xt+1 − zt
2
with xt+1 = proxγt g (zt − γt( f (zt) + ut))
Run rest of algorithm with that step-size:
ut+1 = proxh∗/γt
(ut + xt+1/γt) (1)
zt+1 = xt+1 − γt(ut+1 − ut) (2)
1
Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator
Splitting”. In: Proceedings of the 35th International Conference on Machine
Learning (ICML).
10/18
Benefits
• Automatic tuning of step-size
• (practically) hyperparameter-free
Performance of the adaptive step-size strategy
0 1000 2000 3000 4000 5000
Iterations
10 14
10 11
10 8
10 5
10 2
Objectiveminusoptimum
=1/L
=2/L
=5/L
=10/L
=20/L
=50/L
adaptive
Performance is as good as best hand-tuned step-size
11/18
Convergence rates 1/3
Convergence rate in terms of average (aka ergodic) sequence.
st
def
=
t−1
i=0
γt , xt
def
=
t−1
i=0
γi xi+1 /st , ut
def
=
t−1
i=0
γi ui+1 /st .
12/18
Convergence rates 1/3
Convergence rate in terms of average (aka ergodic) sequence.
st
def
=
t−1
i=0
γt , xt
def
=
t−1
i=0
γi xi+1 /st , ut
def
=
t−1
i=0
γi ui+1 /st .
Theorem (sublinear convergence rate)
For any (x, u) ∈ domL:
L(xt, u) − L(x, ut) ≤
z0 − x 2 + γ2
0 u0 − u 2
2st
.
12/18
Convergence rates 2/3
If h is Lipschitz, we can bound it and obtain rates in terms of
objective function suboptimality.
13/18
Convergence rates 2/3
If h is Lipschitz, we can bound it and obtain rates in terms of
objective function suboptimality.
Corollary
Let h be βh-Lipschitz. Then we have
P(xt+1) − P(x∗
) ≤
z0 − x∗ 2+ 2γ2
0( u0
2+ β2
h)
2st
= O(1/t).
with P(x) = f (x) + g(x) + h(x).
13/18
Convergence rates 3/3
Linear convergence under (somewhat unrealistic) assumptions.
Theorem
If f is Lf -smooth, µ-strongly convex and h is Lh-smooth then
xt+1 − x 2
≤ 1 − min τ
µ
Lf
,
1
1 + γ0Lh
t+1
C0 (3)
with τ = line search decrease factor, C0 = only depends on initial
conditions.
• Better rate than µ
Lf
× 1
(1+γLh)2 from (Davis and Yin, 2015).
14/18
Experiments
Logistic + Nearly-isotonic penalty
Problem
arg minx logistic(x) + λ p−1
i=1 max{xi − xi+1, 0}
Coefficients
Magnitude
=10 6
Coefficients
Magnitude
=10 3
Coefficients
Magnitude
=0.01
Coefficients
Magnitude
=0.1
estimated coefficients
ground truth
0 100 200 300 400
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
Objectiveminusoptimum
0 200 400 600
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 100 200 300 400
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 100 200 300 400
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L)TOS-AOLS PDHG Adaptive PDHG
15/18
Logistic + Overlapping group lasso penalty
Problem
arg min
x
logistic(x) + λ g∈G [x]g 2
Coefficients
Magnitude
=10 6
Coefficients
Magnitude
=10 3
Coefficients
Magnitude
=0.01
Coefficients
Magnitude
=0.1
estimated coefficients
ground truth
0 10 20 30 40 50
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
Objectiveminusoptimum
0 10 20 30 40 50
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 5 10 15 20
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0.0 0.2 0.4 0.6 0.8 1.0
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L) TOS-AOLS PDHG Adaptive PDHG
16/18
Quadratic loss + total variation penalty
Problem
arg min
x
least squares(x) + λ x TV
Recoveredcoefficients
=10 6 =10 5 =10 4 =10 3
0 500 1000 1500 2000
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
Objectiveminusoptimum
0 100 200 300 400
Time (in seconds)
10 12
10 9
10 6
10 3
100
0 10 20 30 40
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 2 4 6 8
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L) TOS-AOLS PDHG Adaptive PDHG
17/18
Conclusion
• Sufficient decrease condition to set step-size in three operator
splitting.
18/18
Conclusion
• Sufficient decrease condition to set step-size in three operator
splitting.
• (Mostly) Hyperparameter-free, adaptivity to local geometry.
18/18
Conclusion
• Sufficient decrease condition to set step-size in three operator
splitting.
• (Mostly) Hyperparameter-free, adaptivity to local geometry.
• Same convergence guarantees as fixed step-size method.
18/18
Conclusion
• Sufficient decrease condition to set step-size in three operator
splitting.
• (Mostly) Hyperparameter-free, adaptivity to local geometry.
• Same convergence guarantees as fixed step-size method.
• Large empirical improvements, specially in the
low-regularization and non-quadratic regime.
18/18
Conclusion
• Sufficient decrease condition to set step-size in three operator
splitting.
• (Mostly) Hyperparameter-free, adaptivity to local geometry.
• Same convergence guarantees as fixed step-size method.
• Large empirical improvements, specially in the
low-regularization and non-quadratic regime.
Perspectives
• Linear convergence under less restrictive assumptions?
• Acceleration.
https://arxiv.org/abs/1804.02339
18/18
References
Armijo, Larry (1966). “Minimization of functions having Lipschitz continuous first
partial derivatives”. In: Pacific Journal of Mathematics.
Beck, Amir and Marc Teboulle (2009). “Gradient-based algorithms with applications to
signal recovery”. In: Convex optimization in signal processing and communications.
Davis, Damek and Wotao Yin (2015). “A three-operator splitting scheme and its
optimization applications”. In: preprint arXiv:1504.01032v1.
— (2017). “A three-operator splitting scheme and its optimization applications”. In:
Set-valued and variational analysis.
Pedregosa, Fabian and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”.
In: Proceedings of the 35th International Conference on Machine Learning (ICML).
18/18

Más contenido relacionado

La actualidad más candente

ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
Hidekazu Oiwa
 
Dominación y extensiones óptimas de operadores con rango esencial compacto en...
Dominación y extensiones óptimas de operadores con rango esencial compacto en...Dominación y extensiones óptimas de operadores con rango esencial compacto en...
Dominación y extensiones óptimas de operadores con rango esencial compacto en...
esasancpe
 

La actualidad más candente (20)

S. Duplij. Polyadic algebraic structures and their applications
S. Duplij. Polyadic algebraic structures and their applicationsS. Duplij. Polyadic algebraic structures and their applications
S. Duplij. Polyadic algebraic structures and their applications
 
Fast Identification of Heavy Hitters by Cached and Packed Group Testing
Fast Identification of Heavy Hitters by Cached and Packed Group TestingFast Identification of Heavy Hitters by Cached and Packed Group Testing
Fast Identification of Heavy Hitters by Cached and Packed Group Testing
 
Memory Efficient Adaptive Optimization
Memory Efficient Adaptive OptimizationMemory Efficient Adaptive Optimization
Memory Efficient Adaptive Optimization
 
【DL輪読会】SUMO: Unbiased Estimation of Log Marginal Probability for Latent Varia...
【DL輪読会】SUMO: Unbiased Estimation of Log Marginal Probability for Latent Varia...【DL輪読会】SUMO: Unbiased Estimation of Log Marginal Probability for Latent Varia...
【DL輪読会】SUMO: Unbiased Estimation of Log Marginal Probability for Latent Varia...
 
Wasserstein GAN
Wasserstein GANWasserstein GAN
Wasserstein GAN
 
A
AA
A
 
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
 
Irs gan doc
Irs gan docIrs gan doc
Irs gan doc
 
確率的推論と行動選択
確率的推論と行動選択確率的推論と行動選択
確率的推論と行動選択
 
Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...
 
Semi vae memo (1)
Semi vae memo (1)Semi vae memo (1)
Semi vae memo (1)
 
Semi vae memo (2)
Semi vae memo (2)Semi vae memo (2)
Semi vae memo (2)
 
Comparing estimation algorithms for block clustering models
Comparing estimation algorithms for block clustering modelsComparing estimation algorithms for block clustering models
Comparing estimation algorithms for block clustering models
 
[DL輪読会]GANとエネルギーベースモデル
[DL輪読会]GANとエネルギーベースモデル[DL輪読会]GANとエネルギーベースモデル
[DL輪読会]GANとエネルギーベースモデル
 
Dominación y extensiones óptimas de operadores con rango esencial compacto en...
Dominación y extensiones óptimas de operadores con rango esencial compacto en...Dominación y extensiones óptimas de operadores con rango esencial compacto en...
Dominación y extensiones óptimas de operadores con rango esencial compacto en...
 
Lesson 27: Integration by Substitution (Section 041 slides)
Lesson 27: Integration by Substitution (Section 041 slides)Lesson 27: Integration by Substitution (Section 041 slides)
Lesson 27: Integration by Substitution (Section 041 slides)
 
Invers fungsi
Invers fungsiInvers fungsi
Invers fungsi
 
Design Method of Directional GenLOT with Trend Vanishing Moments
Design Method of Directional GenLOT with Trend Vanishing MomentsDesign Method of Directional GenLOT with Trend Vanishing Moments
Design Method of Directional GenLOT with Trend Vanishing Moments
 
Variational AutoEncoder
Variational AutoEncoderVariational AutoEncoder
Variational AutoEncoder
 
Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)
 

Similar a Adaptive Three Operator Splitting

Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
Gael Varoquaux
 
lesson10-thechainrule034slides-091006133832-phpapp01.pptx
lesson10-thechainrule034slides-091006133832-phpapp01.pptxlesson10-thechainrule034slides-091006133832-phpapp01.pptx
lesson10-thechainrule034slides-091006133832-phpapp01.pptx
JohnReyManzano2
 
Indefinite Integral
Indefinite IntegralIndefinite Integral
Indefinite Integral
JelaiAujero
 
Piecewise function lesson 3
Piecewise function lesson 3Piecewise function lesson 3
Piecewise function lesson 3
aksetter
 

Similar a Adaptive Three Operator Splitting (20)

Algebra 2 Section 5-1
Algebra 2 Section 5-1Algebra 2 Section 5-1
Algebra 2 Section 5-1
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
 
Sufficient decrease is all you need
Sufficient decrease is all you needSufficient decrease is all you need
Sufficient decrease is all you need
 
Stochastic Frank-Wolfe for Constrained Finite Sum Minimization @ Montreal Opt...
Stochastic Frank-Wolfe for Constrained Finite Sum Minimization @ Montreal Opt...Stochastic Frank-Wolfe for Constrained Finite Sum Minimization @ Montreal Opt...
Stochastic Frank-Wolfe for Constrained Finite Sum Minimization @ Montreal Opt...
 
Operation on functions
Operation on functionsOperation on functions
Operation on functions
 
Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...
 
lesson10-thechainrule034slides-091006133832-phpapp01.pptx
lesson10-thechainrule034slides-091006133832-phpapp01.pptxlesson10-thechainrule034slides-091006133832-phpapp01.pptx
lesson10-thechainrule034slides-091006133832-phpapp01.pptx
 
Indefinite Integral
Indefinite IntegralIndefinite Integral
Indefinite Integral
 
Piecewise function lesson 3
Piecewise function lesson 3Piecewise function lesson 3
Piecewise function lesson 3
 
Ece3075 a 8
Ece3075 a 8Ece3075 a 8
Ece3075 a 8
 
functions limits and continuity
functions limits and continuityfunctions limits and continuity
functions limits and continuity
 
1531 fourier series- integrals and trans
1531 fourier series- integrals and trans1531 fourier series- integrals and trans
1531 fourier series- integrals and trans
 
Gradient_Descent_Unconstrained.pdf
Gradient_Descent_Unconstrained.pdfGradient_Descent_Unconstrained.pdf
Gradient_Descent_Unconstrained.pdf
 
3.4 Composition of Functions
3.4 Composition of Functions3.4 Composition of Functions
3.4 Composition of Functions
 
composite functions
composite functionscomposite functions
composite functions
 
Functions limits and continuity
Functions limits and continuityFunctions limits and continuity
Functions limits and continuity
 
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
 
Application H-matrices for solving PDEs with multi-scale coefficients, jumpin...
Application H-matrices for solving PDEs with multi-scale coefficients, jumpin...Application H-matrices for solving PDEs with multi-scale coefficients, jumpin...
Application H-matrices for solving PDEs with multi-scale coefficients, jumpin...
 
Singlevaropt
SinglevaroptSinglevaropt
Singlevaropt
 
Maths AIP.pdf
Maths AIP.pdfMaths AIP.pdf
Maths AIP.pdf
 

Más de Fabian Pedregosa

Más de Fabian Pedregosa (11)

Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2
 
Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1
 
Average case acceleration through spectral density estimation
Average case acceleration through spectral density estimationAverage case acceleration through spectral density estimation
Average case acceleration through spectral density estimation
 
Asynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsAsynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and Algorithms
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine Learning
 
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradient
 
Lightning: large scale machine learning in python
Lightning: large scale machine learning in pythonLightning: large scale machine learning in python
Lightning: large scale machine learning in python
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Silpa
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 

Último (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptx
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
An introduction on sequence tagged site mapping
An introduction on sequence tagged site mappingAn introduction on sequence tagged site mapping
An introduction on sequence tagged site mapping
 

Adaptive Three Operator Splitting

  • 1. Adaptive Three Operator Splitting. Fabian Pedregosa, Gauthier Gidel 2018 International Symposium on Mathematical Programming, Bordeaux
  • 2. Three Operator Splitting (TOS) • Recently proposed method (Davis and Yin, 2017) Solves optimization problems of the form minimize x∈Rd f (x) + g(x) + h(x) , with access to f , proxγg , proxγh. 1/18
  • 3. Three Operator Splitting (TOS) • Recently proposed method (Davis and Yin, 2017) Solves optimization problems of the form minimize x∈Rd f (x) + g(x) + h(x) , with access to f , proxγg , proxγh. • Can be generalized to an arbitrary number of proximal terms. 1/18
  • 4. Three Operator Splitting (TOS) • Recently proposed method (Davis and Yin, 2017) Solves optimization problems of the form minimize x∈Rd f (x) + g(x) + h(x) , with access to f , proxγg , proxγh. • Can be generalized to an arbitrary number of proximal terms. • Many complex penalties can be written as a sum of proximable terms: overlapping group lasso, 1 trend filtering, isotonic constraints, total variation, intersection of constraints, etc. 1/18
  • 5. Importance of step-size Guaranteed to converge for any step-size γ < 2/L, with L = Lipschitz constant of f . In practice, best performance is often achieved for γ 2/L 0 1000 2000 3000 4000 5000 Iterations 10 14 10 11 10 8 10 5 10 2 Objectiveminusoptimum =1/L =2/L =5/L =10/L =20/L =50/L 2/18
  • 6. Motivation • L is a global upper bound on the Lipschitz constant, locally it can be much smaller. 3/18
  • 7. Motivation • L is a global upper bound on the Lipschitz constant, locally it can be much smaller. • Adaptive step-size strategies (aka inexact line search) have a long tradition (Armijo, 1966), have been adapted to proximal-gradient (Beck and Teboulle, 2009). 3/18
  • 8. Motivation • L is a global upper bound on the Lipschitz constant, locally it can be much smaller. • Adaptive step-size strategies (aka inexact line search) have a long tradition (Armijo, 1966), have been adapted to proximal-gradient (Beck and Teboulle, 2009). • Goal. Can these adaptive step-size methods be adapted to the three operator splitting? 3/18
  • 9. Outline 1. Revisiting the Three Operator Splitting 2. Adaptive Three Operator Splitting 3. Experiments 4/18
  • 10. Revisiting the Three Operator Splitting
  • 11. Three Operator Splitting (TOS) • Three operator splitting (Davis and Yin, 2017): zt = proxγh(yt) xt = proxγg (2yt − zt − γ f (zt)) yt+1 = yt − zt + xt 5/18
  • 12. Three Operator Splitting (TOS) • Three operator splitting (Davis and Yin, 2017): zt = proxγh(yt) xt = proxγg (2yt − zt − γ f (zt)) yt+1 = yt − zt + xt • Generalization of both Proximal-Gradient and Douglas-Rachford. 5/18
  • 13. Three Operator Splitting (TOS) • Three operator splitting (Davis and Yin, 2017): zt = proxγh(yt) xt = proxγg (2yt − zt − γ f (zt)) yt+1 = yt − zt + xt • Generalization of both Proximal-Gradient and Douglas-Rachford. • Depends only on one step-size parameter. 5/18
  • 14. Revisiting the three operator splitting 1. Introduce ut := 1 γ (yt − xt). We can rewrite TOS as xt+1 = proxγg (zt − γ( f (zt) + ut)) ut+1 = proxh∗/γ(ut + xt+1/γ) , zt+1 = xt+1 − γ(ut+1 − ut) 6/18
  • 15. Revisiting the three operator splitting 1. Introduce ut := 1 γ (yt − xt). We can rewrite TOS as xt+1 = proxγg (zt − γ( f (zt) + ut)) ut+1 = proxh∗/γ(ut + xt+1/γ) , zt+1 = xt+1 − γ(ut+1 − ut) 2. Saddle-point reformulation of original problem min x∈Rd f (x) + g(x) + h(x) = min x∈Rd f (x) + g(x) + max u∈Rd { x, u − h∗ (u)} 6/18
  • 16. Revisiting the three operator splitting 1. Introduce ut := 1 γ (yt − xt). We can rewrite TOS as xt+1 = proxγg (zt − γ( f (zt) + ut)) ut+1 = proxh∗/γ(ut + xt+1/γ) , zt+1 = xt+1 − γ(ut+1 − ut) 2. Saddle-point reformulation of original problem min x∈Rd f (x) + g(x) + h(x) = min x∈Rd max u∈Rd f (x) + g(x) + x, u − h∗ (u) :=L(x,u) 6/18
  • 17. Minimizing with respect to primal variable min x∈Rd L(x, ut) = min x∈Rd f (x) + x, ut smooth + g(x) proximal 7/18
  • 18. Minimizing with respect to primal variable min x∈Rd L(x, ut) = min x∈Rd f (x) + x, ut smooth + g(x) proximal • Proximal-gradient iteration, with x = zt as starting point: xt+1 = proxγg (zt − γ( f (zt) + ut)) 7/18
  • 19. Minimizing with respect to primal variable min x∈Rd L(x, ut) = min x∈Rd f (x) + x, ut smooth + g(x) proximal • Proximal-gradient iteration, with x = zt as starting point: xt+1 = proxγg (zt − γ( f (zt) + ut)) = first step of TOS 7/18
  • 20. Minimizing with respect to the dual variable min u∈Rd L(xt, u) = min u∈Rd h∗ (u) − xt, u proximal 8/18
  • 21. Minimizing with respect to the dual variable min u∈Rd L(xt, u) = min u∈Rd h∗ (u) − xt, u proximal • Proximal-point iteration: ut+1 = proxσh∗ (ut + σxt+1) = second step in TOS with σ = 1/γ 8/18
  • 22. Minimizing with respect to the dual variable min u∈Rd L(xt, u) = min u∈Rd h∗ (u) − xt, u proximal • Proximal-point iteration: ut+1 = proxσh∗ (ut + σxt+1) = second step in TOS with σ = 1/γ • Third line: zt+1 = xt+1 − γ(ut+1 − ut) (extrapolation step) 8/18
  • 23. Revisiting the three operator splitting → xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 1: proximal-gradient step 9/18
  • 24. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) → ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 1: proximal-point step 9/18
  • 25. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) → zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 2: extrapolation 9/18
  • 26. Revisiting the three operator splitting → xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 2: proximal-gradient step 9/18
  • 27. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) → ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 3: proximal-point step 9/18
  • 28. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) → zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 3: extrapolation 9/18
  • 29. Revisiting the three operator splitting → xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 4: proximal-gradient step 9/18
  • 30. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) → ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 4: proximal-point step 9/18
  • 31. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) → zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 5: extrapolation 9/18
  • 32. Revisiting the three operator splitting → xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 5: proximal-gradient step 9/18
  • 33. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) → ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 6: proximal-point step 9/18
  • 34. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) → zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 6: extrapolation Take-Home Message TOS is (basically) alternated proximal-gradient and proximal-point 9/18
  • 35. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) → zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 6: extrapolation Take-Home Message TOS is (basically) alternated proximal-gradient and proximal-point Can we adapt the adaptive step-size of proximal-gradient? 9/18
  • 37. Adaptive Three Operator Splitting1 Start with optimistic step-size γt and decrease it until: f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt + 1 2γt xt+1 − zt 2 with xt+1 = proxγt g (zt − γt( f (zt) + ut)) 1 Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”. In: Proceedings of the 35th International Conference on Machine Learning (ICML). 10/18
  • 38. Adaptive Three Operator Splitting1 Start with optimistic step-size γt and decrease it until: f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt + 1 2γt xt+1 − zt 2 with xt+1 = proxγt g (zt − γt( f (zt) + ut)) Run rest of algorithm with that step-size: ut+1 = proxh∗/γt (ut + xt+1/γt) (1) zt+1 = xt+1 − γt(ut+1 − ut) (2) 1 Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”. In: Proceedings of the 35th International Conference on Machine Learning (ICML). 10/18
  • 39. Adaptive Three Operator Splitting1 Start with optimistic step-size γt and decrease it until: f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt + 1 2γt xt+1 − zt 2 with xt+1 = proxγt g (zt − γt( f (zt) + ut)) Run rest of algorithm with that step-size: ut+1 = proxh∗/γt (ut + xt+1/γt) (1) zt+1 = xt+1 − γt(ut+1 − ut) (2) 1 Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”. In: Proceedings of the 35th International Conference on Machine Learning (ICML). 10/18 Benefits • Automatic tuning of step-size • (practically) hyperparameter-free
  • 40. Performance of the adaptive step-size strategy 0 1000 2000 3000 4000 5000 Iterations 10 14 10 11 10 8 10 5 10 2 Objectiveminusoptimum =1/L =2/L =5/L =10/L =20/L =50/L adaptive Performance is as good as best hand-tuned step-size 11/18
  • 41. Convergence rates 1/3 Convergence rate in terms of average (aka ergodic) sequence. st def = t−1 i=0 γt , xt def = t−1 i=0 γi xi+1 /st , ut def = t−1 i=0 γi ui+1 /st . 12/18
  • 42. Convergence rates 1/3 Convergence rate in terms of average (aka ergodic) sequence. st def = t−1 i=0 γt , xt def = t−1 i=0 γi xi+1 /st , ut def = t−1 i=0 γi ui+1 /st . Theorem (sublinear convergence rate) For any (x, u) ∈ domL: L(xt, u) − L(x, ut) ≤ z0 − x 2 + γ2 0 u0 − u 2 2st . 12/18
  • 43. Convergence rates 2/3 If h is Lipschitz, we can bound it and obtain rates in terms of objective function suboptimality. 13/18
  • 44. Convergence rates 2/3 If h is Lipschitz, we can bound it and obtain rates in terms of objective function suboptimality. Corollary Let h be βh-Lipschitz. Then we have P(xt+1) − P(x∗ ) ≤ z0 − x∗ 2+ 2γ2 0( u0 2+ β2 h) 2st = O(1/t). with P(x) = f (x) + g(x) + h(x). 13/18
  • 45. Convergence rates 3/3 Linear convergence under (somewhat unrealistic) assumptions. Theorem If f is Lf -smooth, µ-strongly convex and h is Lh-smooth then xt+1 − x 2 ≤ 1 − min τ µ Lf , 1 1 + γ0Lh t+1 C0 (3) with τ = line search decrease factor, C0 = only depends on initial conditions. • Better rate than µ Lf × 1 (1+γLh)2 from (Davis and Yin, 2015). 14/18
  • 47. Logistic + Nearly-isotonic penalty Problem arg minx logistic(x) + λ p−1 i=1 max{xi − xi+1, 0} Coefficients Magnitude =10 6 Coefficients Magnitude =10 3 Coefficients Magnitude =0.01 Coefficients Magnitude =0.1 estimated coefficients ground truth 0 100 200 300 400 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Objectiveminusoptimum 0 200 400 600 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 100 200 300 400 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 100 200 300 400 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L)TOS-AOLS PDHG Adaptive PDHG 15/18
  • 48. Logistic + Overlapping group lasso penalty Problem arg min x logistic(x) + λ g∈G [x]g 2 Coefficients Magnitude =10 6 Coefficients Magnitude =10 3 Coefficients Magnitude =0.01 Coefficients Magnitude =0.1 estimated coefficients ground truth 0 10 20 30 40 50 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Objectiveminusoptimum 0 10 20 30 40 50 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 5 10 15 20 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0.0 0.2 0.4 0.6 0.8 1.0 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L) TOS-AOLS PDHG Adaptive PDHG 16/18
  • 49. Quadratic loss + total variation penalty Problem arg min x least squares(x) + λ x TV Recoveredcoefficients =10 6 =10 5 =10 4 =10 3 0 500 1000 1500 2000 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Objectiveminusoptimum 0 100 200 300 400 Time (in seconds) 10 12 10 9 10 6 10 3 100 0 10 20 30 40 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 2 4 6 8 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L) TOS-AOLS PDHG Adaptive PDHG 17/18
  • 50. Conclusion • Sufficient decrease condition to set step-size in three operator splitting. 18/18
  • 51. Conclusion • Sufficient decrease condition to set step-size in three operator splitting. • (Mostly) Hyperparameter-free, adaptivity to local geometry. 18/18
  • 52. Conclusion • Sufficient decrease condition to set step-size in three operator splitting. • (Mostly) Hyperparameter-free, adaptivity to local geometry. • Same convergence guarantees as fixed step-size method. 18/18
  • 53. Conclusion • Sufficient decrease condition to set step-size in three operator splitting. • (Mostly) Hyperparameter-free, adaptivity to local geometry. • Same convergence guarantees as fixed step-size method. • Large empirical improvements, specially in the low-regularization and non-quadratic regime. 18/18
  • 54. Conclusion • Sufficient decrease condition to set step-size in three operator splitting. • (Mostly) Hyperparameter-free, adaptivity to local geometry. • Same convergence guarantees as fixed step-size method. • Large empirical improvements, specially in the low-regularization and non-quadratic regime. Perspectives • Linear convergence under less restrictive assumptions? • Acceleration. https://arxiv.org/abs/1804.02339 18/18
  • 55. References Armijo, Larry (1966). “Minimization of functions having Lipschitz continuous first partial derivatives”. In: Pacific Journal of Mathematics. Beck, Amir and Marc Teboulle (2009). “Gradient-based algorithms with applications to signal recovery”. In: Convex optimization in signal processing and communications. Davis, Damek and Wotao Yin (2015). “A three-operator splitting scheme and its optimization applications”. In: preprint arXiv:1504.01032v1. — (2017). “A three-operator splitting scheme and its optimization applications”. In: Set-valued and variational analysis. Pedregosa, Fabian and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”. In: Proceedings of the 35th International Conference on Machine Learning (ICML). 18/18