The document proposes a new parallel method called Proximal Asynchronous Stochastic Gradient Average (ProxASAGA) for solving composite optimization problems. ProxASAGA extends SAGA to handle nonsmooth objectives using proximal operators, and runs asynchronously in parallel without locks. It is shown to converge at the same linear rate as the sequential algorithm theoretically, and achieves speedups of 6-12x on a 20-core machine in practice on large datasets, with greater speedups on sparser problems as predicted by theory.
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization
1. Breaking the Nonsmooth Barrier: A Scalable
Parallel Method for Composite Optimization
Fabian Pedregosa Rémi Leblond Simon Lacoste–Julien
2. Motivation
Since 2005, the speed of
processors has stagnated.
The number of cores has
increased.
Development of parallel
asynchronous variants of
stochastic gradient algorithms
1/6
3. Motivation
Since 2005, the speed of
processors has stagnated.
The number of cores has
increased.
Development of parallel
asynchronous variants of
stochastic gradient algorithms
SGD → Hogwild (Niu et al. 2011).
SVRG → Kromagnon (Reddi et al. 2015; Mania et al. 2017).
SAGA → ASAGA (Leblond, Pedregosa, and Lacoste-Julien 2017).
1/6
4. Composite objective
These methods assume objective function is smooth
Cannot be applied to Lasso, Group Lasso, box constraints, etc.
2/6
5. Composite objective
These methods assume objective function is smooth
Cannot be applied to Lasso, Group Lasso, box constraints, etc.
Objective: minimize composite objective function:
minimize
x
f(x) + h(x) , with f(x) = 1
n
∑n
i=1 fi(x)
where fi is smooth and h is a block-separable (i.e., h(x) =
∑
B h([x]B))
convex function for which we have access to its proximal operator.
2/6
6. Sparse Proximal SAGA
Contribution 1: Sparse Proximal SAGA. Variant of SAGA (Defazio, Bach,
and Lacoste-Julien 2014), particularly efficient when ∇fi are sparse.
3/6
7. Sparse Proximal SAGA
Contribution 1: Sparse Proximal SAGA. Variant of SAGA (Defazio, Bach,
and Lacoste-Julien 2014), particularly efficient when ∇fi are sparse.
Like SAGA, it relies on unbiased gradient estimate
vi=∇fi(x) − αi + Diα ;
3/6
8. Sparse Proximal SAGA
Contribution 1: Sparse Proximal SAGA. Variant of SAGA (Defazio, Bach,
and Lacoste-Julien 2014), particularly efficient when ∇fi are sparse.
Like SAGA, it relies on unbiased gradient estimate and proximal step
vi=∇fi(x) − αi + Diα ; x+
= proxγφi
(x − γvi) ; α+
i = ∇fi(x)
3/6
9. Sparse Proximal SAGA
Contribution 1: Sparse Proximal SAGA. Variant of SAGA (Defazio, Bach,
and Lacoste-Julien 2014), particularly efficient when ∇fi are sparse.
Like SAGA, it relies on unbiased gradient estimate and proximal step
vi=∇fi(x) − αi + Diα ; x+
= proxγφi
(x − γvi) ; α+
i = ∇fi(x)
Unlike SAGA, Di and φi are designed to give sparse updates while
verifying unbiasedness conditions.
3/6
10. Sparse Proximal SAGA
Contribution 1: Sparse Proximal SAGA. Variant of SAGA (Defazio, Bach,
and Lacoste-Julien 2014), particularly efficient when ∇fi are sparse.
Like SAGA, it relies on unbiased gradient estimate and proximal step
vi=∇fi(x) − αi + Diα ; x+
= proxγφi
(x − γvi) ; α+
i = ∇fi(x)
Unlike SAGA, Di and φi are designed to give sparse updates while
verifying unbiasedness conditions.
Convergence: same linear convergence rate as SAGA, with cheaper
updates in presence of sparsity.
3/6
11. Proximal Asynchronous SAGA (ProxASAGA)
Contribution 2: Proximal Asynchronous SAGA (ProxASAGA). Each core
runs Sparse Proximal SAGA asynchronously without locks and
updates x, α and α in shared memory.
All read/write operations to shared memory are inconsistent, i.e.,
no performance destroying vector-level locks while reading/writing.
Convergence: under sparsity assumptions, ProxASAGA converges
with the same rate as the sequential algorithm =⇒ theoretical
linear speedup with respect to the number of cores.
4/6
13. Empirical results - Speedup
Speedup =
Time to 10−10
suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
6/6
14. Empirical results - Speedup
Speedup =
Time to 10−10
suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
• ProxASAGA achieves speedups between 6x and 12x on a 20 cores
architecture.
6/6
15. Empirical results - Speedup
Speedup =
Time to 10−10
suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
• ProxASAGA achieves speedups between 6x and 12x on a 20 cores
architecture.
• As predicted by theory, there is a high correlation between
degree of sparsity and speedup.
6/6
16. Empirical results - Speedup
Speedup =
Time to 10−10
suboptimality on one core
Time to same suboptimality on k cores
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20
Timespeedup
KDD10 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 KDD12 dataset
2 4 6 8 10 12 14 16 18 20
Number of cores
2
4
6
8
10
12
14
16
18
20 Criteo dataset
Ideal ProxASAGA AsySPCD FISTA
• ProxASAGA achieves speedups between 6x and 12x on a 20 cores
architecture.
• As predicted by theory, there is a high correlation between
degree of sparsity and speedup.
Thanks for your attention, see you at poster #159. 6/6
17. References
Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien (2014). “SAGA: A fast incremental gradient
method with support for non-strongly convex composite objectives”. In: Advances in Neural
Information Processing Systems.
Leblond, Rémi, Fabian Pedregosa, and Simon Lacoste-Julien (2017). “ASAGA: asynchronous parallel
SAGA”. In: Proceedings of the 20th International Conference on Artificial Intelligence and
Statistics (AISTATS 2017).
Mania, Horia et al. (2017). “Perturbed iterate analysis for asynchronous stochastic optimization”. In:
SIAM Journal on Optimization.
Niu, Feng et al. (2011). “Hogwild: A lock-free approach to parallelizing stochastic gradient descent”.
In: Advances in Neural Information Processing Systems.
Pedregosa, Fabian, Rémi Leblond, and Simon Lacoste-Julien (2017). “Breaking the Nonsmooth
Barrier: A Scalable Parallel Method for Composite Optimization”. In: Advances in Neural
Information Processing Systems 30.
Reddi, Sashank J et al. (2015). “On variance reduction in stochastic gradient descent and its
asynchronous variants”. In: Advances in Neural Information Processing Systems.
6/6