Logistic regression with L1 and L2 regularization is a widely used technique for solving
classication and class probability estimation problems. With the numbers of both featurescand examples growing rapidly in the fields like text mining and clickstream data analysis parallelization and the use of cluster architectures becomes important. We present a novel algorithm for tting regularized logistic regression in the distributed environment. The algorithm splits data between nodes by features, uses coordinate descent on each node and line search to merge results globally. Convergence proof is provided. A modications of the algorithm addresses slow node problem. We empirically compare our program with several state-of-the art approaches that rely on different algorithmic and data spitting methods. Experiments demonstrate that our approach is scalable and superior when training on large and sparse datasets.
----------------------------------------------------------
Machine Learning: Prospects and Applications
58 October 2015, Berlin, Germany
Distributed Coordinate Descent for Logistic Regression with Regularization
1. Distributed Coordinate Descent for Logistic Regression with
Regularization
Ilya Tromov (Yandex Data Factory)
Alexander Genkin (AVG Consulting)
presented by Ilya Tromov
Machine Learning: Prospects and Applications
58 October 2015, Berlin, Germany
3. Large Scale Machine Learning
Large Scale Machine Learning = Big Data + ML
Many applications in web search, online advertising, e-commerce, text processing etc.
4. Large Scale Machine Learning
Large Scale Machine Learning = Big Data + ML
Many applications in web search, online advertising, e-commerce, text processing etc.
Key features of Large Scale Machine Learning problems:
1 Large number of examples n
2 High dimensionality p
5. Large Scale Machine Learning
Large Scale Machine Learning = Big Data + ML
Many applications in web search, online advertising, e-commerce, text processing etc.
Key features of Large Scale Machine Learning problems:
1 Large number of examples n
2 High dimensionality p
Datasets are often:
1 Sparse
2 Don't t memory of a single machine
6. Large Scale Machine Learning
Large Scale Machine Learning = Big Data + ML
Many applications in web search, online advertising, e-commerce, text processing etc.
Key features of Large Scale Machine Learning problems:
1 Large number of examples n
2 High dimensionality p
Datasets are often:
1 Sparse
2 Don't t memory of a single machine
Linear methods for classication and regression are often used for large-scale problems:
1 Training testing for linear models are fast
2 High dimensional datasets are rich and non-linearities are not required
7. Binary Classication
Supervised machine learning problem:
given feature vector xi ∈ Rp predict yi ∈ {−1, +1}.
Function
F : x → y
should be built using training dataset {xi , yi }n
i=1 and minimize expected risk:
Ex,y Ψ(y, F(x))
where Ψ(·, ·) is some loss function.
8. Logistic Regression
Logistic regression is a special case of Generalized Linear Model with the logit link
function:
yi ∈ {−1, +1}
P(y = +1|x) =
1
1 + exp(−βT
x)
9. Logistic Regression
Logistic regression is a special case of Generalized Linear Model with the logit link
function:
yi ∈ {−1, +1}
P(y = +1|x) =
1
1 + exp(−βT
x)
Negated log-likelihood (empirical risk) L(β)
L(β) =
n
i=1
log(1 + exp(−yi βT
xi ))
β∗
= argmin
β
L(β)
10. Logistic Regression
Logistic regression is a special case of Generalized Linear Model with the logit link
function:
yi ∈ {−1, +1}
P(y = +1|x) =
1
1 + exp(−βT
x)
Negated log-likelihood (empirical risk) L(β)
L(β) =
n
i=1
log(1 + exp(−yi βT
xi ))
β∗
= argmin
β
L(β) + R(β) regularizer
17. Logistic Regression, regularization
L1-regularization, provides feature selection
argmin
β
(L(β) + λ1||β||1)
Minimization of non-smooth convex function.
Optimization techniques for large datasets
Subgradient method
Online learning via truncated gradient
Coordinate descent (GLMNET, BBR)
18. Logistic Regression, regularization
L1-regularization, provides feature selection
argmin
β
(L(β) + λ1||β||1)
Minimization of non-smooth convex function.
Optimization techniques for large datasets, distributed
Subgradient method slow
Online learning via truncated gradient poor parallelization
Coordinate descent (GLMNET, BBR) ?
19. How to run coordinate descent in parallel?
Suppose we have several machines (cluster)
20. How to run coordinate descent in parallel?
Suppose we have several machines (cluster)
features
S1
S2
… SM
examples
Dataset is split by features among machines
S1
∪ . . . ∪ SM
= {1, ..., p}
Sm
∩ Sk
= ∅, k = m
βT
= ((β1
)T , (β2
)T , . . . , (βM
)T )
Each machine makes steps on its own subset
of input features ∆βm
22. Problems
Two main questions:
1 How to compute ∆βm
2 How to organize communication between machines
Answers:
1 Each machine makes step using GLMNET algorithm.
23. Problems
Two main questions:
1 How to compute ∆βm
2 How to organize communication between machines
Answers:
1 Each machine makes step using GLMNET algorithm.
2
∆β =
M
m=1
∆βm
Steps from machines can come in conict!
so that target function will increase
L(β + ∆β) + R(β + ∆β) L(β + ∆β) + R(β)
25. Problems
β ← β + α∆β, 0 α ≤ 1
where α is found by an Armijo rule
L(β + ∆β) + R(β + ∆β) ≤ L(β) + R(β) + ασDk
Dk
= L(β)T
∆β + R(β + ∆β) − R(β)
26. Problems
β ← β + α∆β, 0 α ≤ 1
where α is found by an Armijo rule
L(β + ∆β) + R(β + ∆β) ≤ L(β) + R(β) + ασDk
Dk
= L(β)T
∆β + R(β + ∆β) − R(β)
L(β + α∆β) =
n
i=1
log(1 + exp(−yi (β + α∆β)T
xi ))
R(β + α∆β) =
M
m=1
R(βm
+ α∆βm
)
27. Eective communication between machines
L(β + α∆β) =
n
i=1
log(1 + exp(−yi (β + α∆β)T
xi ))
R(β + α∆β) =
M
m=1
R(βm
+ α∆βm
)
Data transfer:
(βT
xi ) are kept synchronized
(∆βT
xi ) are summed up via MPI_AllReduce (M vectors of size n)
Calculate R(βm
+ α∆βm
), L(β)T ∆βm
separately, and then sum up (M scalars)
Total communication cost: M(n + 1)
28. Distributed GLMNET (d-GLMNET)
d-GLMNET Algorithm
Input: training dataset {xi , yi }n
i=1, split to M parts over features.
βm
← 0, ∆βm
← 0, where m - index of a machine
Repeat until converged:
1 Do in parallel over M machines:
2 Find ∆βm
and calculate (∆(βm
)T xi ))
3 Sum up ∆βm
, (∆(βm
)T xi ) using MPI_AllReduce
4 ∆β ← M
m=1 ∆βm
5 (∆βT
xi ) ← M
m=1(∆(βm
)T xi )
6 Find α using line search with Armijo rule
7 β ← β + α∆β,
8 (exp(βT
xi )) ← (exp(βT
xi + α∆βT
xi ))
29. Solving the ¾slow node¿ problem
Distributed Machine Learning Algorithm
Do until converged:
1 Do some computations in parallel over M machines
2 Synchronize
30. Solving the ¾slow node¿ problem
Distributed Machine Learning Algorithm
Do until converged:
1 Do some computations in parallel over M machines
2 Synchronize PROBLEM!
M − 1 fast machines will wait for 1 slow
31. Solving the ¾slow node¿ problem
Distributed Machine Learning Algorithm
Do until converged:
1 Do some computations in parallel over M machines
2 Synchronize PROBLEM!
M − 1 fast machines will wait for 1 slow
Our solution: machine m at iteration k updates subset Pm
k ⊆ Sm of input features at
iteration.
The synchronization is done in separate thread asynchronously, we call it
Asynchronous Load Balancing (ALB).
32. Theoretical Results
Theorem 1. Each iteration of the d-GLMNET is equivalent to
β ← β + α∆β∗
∆β∗
= argmin
∆β
L(β) + L (β)T
∆β +
1
2
∆βT
H(β)∆β + λ1||β + ∆β||1
where H(β) is a block-diagonal approximation to the Hessian 2L(β),
iteration-dependent
33. Theoretical Results
Theorem 1. Each iteration of the d-GLMNET is equivalent to
β ← β + α∆β∗
∆β∗
= argmin
∆β
L(β) + L (β)T
∆β +
1
2
∆βT
H(β)∆β + λ1||β + ∆β||1
where H(β) is a block-diagonal approximation to the Hessian 2L(β),
iteration-dependent
Theorem 2. d-GLMNET algorithm converges at least linearly.
37. Numerical Experiments
We compared
d-GLMNET
Online learning via truncated gradient (Vowpal Wabbit)
L-BFGS (Vowpal Wabbit)
ADMM with sharing (feature splitting)
1 We selected best L1 and L2 regularization on test set from range {2−6, . . . , 26}
2 We found parameters of online learning and ADMM yielding best performance
3 For evaluating timing performance we repeated training 9 times and selected run
with median time
39. Conclusions Future Work
d-GLMNET is faster that state-of-the-art algoritms (online learning, L-BFGS,
ADMM) on sparse high-dimensional datasets
d-GLMNET can be easily extended to
other [block-]separable regularizers: bridge, SCAD, group Lasso, e.t.c.
other generalized linear models
40. Conclusions Future Work
d-GLMNET is faster that state-of-the-art algoritms (online learning, L-BFGS,
ADMM) on sparse high-dimensional datasets
d-GLMNET can be easily extended to
other [block-]separable regularizers: bridge, SCAD, group Lasso, e.t.c.
other generalized linear models
Extending software architecture to boosting
F∗
(x) =
M
i=1
fi (x), where fi (x) is a week learner
Let machine m t weak learner f m
i (x
m) on subset of input features Sm. Then
fi (x) = α
M
m=1
f m
i (x
m
)
where α is calculated via line search, in the similar way as in d-GLMNET algorithm.
41. Conclusions Future Work
Software implementation:
https://github.com/IlyaTrofimov/dlr
42. Conclusions Future Work
Software implementation:
https://github.com/IlyaTrofimov/dlr
paper is available by request :
Ilya Tromov - trofim@yandex-team.ru