SlideShare una empresa de Scribd logo
1 de 43
Descargar para leer sin conexión
Distributed Coordinate Descent for Logistic Regression with
Regularization
Ilya Tromov (Yandex Data Factory)
Alexander Genkin (AVG Consulting)
presented by Ilya Tromov
Machine Learning: Prospects and Applications
58 October 2015, Berlin, Germany
Large Scale Machine Learning
Large Scale Machine Learning = Big Data + ML
Large Scale Machine Learning
Large Scale Machine Learning = Big Data + ML
Many applications in web search, online advertising, e-commerce, text processing etc.
Large Scale Machine Learning
Large Scale Machine Learning = Big Data + ML
Many applications in web search, online advertising, e-commerce, text processing etc.
Key features of Large Scale Machine Learning problems:
1 Large number of examples n
2 High dimensionality p
Large Scale Machine Learning
Large Scale Machine Learning = Big Data + ML
Many applications in web search, online advertising, e-commerce, text processing etc.
Key features of Large Scale Machine Learning problems:
1 Large number of examples n
2 High dimensionality p
Datasets are often:
1 Sparse
2 Don't t memory of a single machine
Large Scale Machine Learning
Large Scale Machine Learning = Big Data + ML
Many applications in web search, online advertising, e-commerce, text processing etc.
Key features of Large Scale Machine Learning problems:
1 Large number of examples n
2 High dimensionality p
Datasets are often:
1 Sparse
2 Don't t memory of a single machine
Linear methods for classication and regression are often used for large-scale problems:
1 Training  testing for linear models are fast
2 High dimensional datasets are rich and non-linearities are not required
Binary Classication
Supervised machine learning problem:
given feature vector xi ∈ Rp predict yi ∈ {−1, +1}.
Function
F : x → y
should be built using training dataset {xi , yi }n
i=1 and minimize expected risk:
Ex,y Ψ(y, F(x))
where Ψ(·, ·) is some loss function.
Logistic Regression
Logistic regression is a special case of Generalized Linear Model with the logit link
function:
yi ∈ {−1, +1}
P(y = +1|x) =
1
1 + exp(−βT
x)
Logistic Regression
Logistic regression is a special case of Generalized Linear Model with the logit link
function:
yi ∈ {−1, +1}
P(y = +1|x) =
1
1 + exp(−βT
x)
Negated log-likelihood (empirical risk) L(β)
L(β) =
n
i=1
log(1 + exp(−yi βT
xi ))
β∗
= argmin
β
L(β)
Logistic Regression
Logistic regression is a special case of Generalized Linear Model with the logit link
function:
yi ∈ {−1, +1}
P(y = +1|x) =
1
1 + exp(−βT
x)
Negated log-likelihood (empirical risk) L(β)
L(β) =
n
i=1
log(1 + exp(−yi βT
xi ))
β∗
= argmin
β
L(β) + R(β) regularizer
Logistic Regression, regularization
L2-regularization
argmin
β
L(β) +
λ2
2
||β||2
Logistic Regression, regularization
L2-regularization
argmin
β
L(β) +
λ2
2
||β||2
L1-regularization, provides feature selection
argmin
β
(L(β) + λ1||β||1)
Logistic Regression, regularization
L2-regularization
argmin
β
L(β) +
λ2
2
||β||2
Minimization of smooth convex function.
Logistic Regression, regularization
L2-regularization
argmin
β
L(β) +
λ2
2
||β||2
Minimization of smooth convex function.
Optimization techniques for large datasets
SGD
Conjugate gradients
L-BFGS
Coordinate descent (GLMNET, BBR)
Logistic Regression, regularization
L2-regularization
argmin
β
L(β) +
λ2
2
||β||2
Minimization of smooth convex function.
Optimization techniques for large datasets, distributed
SGD poor parallelization
Conjugate gradients good parallelization
L-BFGS good parallelization
Coordinate descent (GLMNET, BBR) ?
Logistic Regression, regularization
L1-regularization, provides feature selection
argmin
β
(L(β) + λ1||β||1)
Minimization of non-smooth convex function.
Logistic Regression, regularization
L1-regularization, provides feature selection
argmin
β
(L(β) + λ1||β||1)
Minimization of non-smooth convex function.
Optimization techniques for large datasets
Subgradient method
Online learning via truncated gradient
Coordinate descent (GLMNET, BBR)
Logistic Regression, regularization
L1-regularization, provides feature selection
argmin
β
(L(β) + λ1||β||1)
Minimization of non-smooth convex function.
Optimization techniques for large datasets, distributed
Subgradient method slow
Online learning via truncated gradient poor parallelization
Coordinate descent (GLMNET, BBR) ?
How to run coordinate descent in parallel?
Suppose we have several machines (cluster)
How to run coordinate descent in parallel?
Suppose we have several machines (cluster)
features
S1
S2
… SM
examples
Dataset is split by features among machines
S1
∪ . . . ∪ SM
= {1, ..., p}
Sm
∩ Sk
= ∅, k = m
βT
= ((β1
)T , (β2
)T , . . . , (βM
)T )
Each machine makes steps on its own subset
of input features ∆βm
Problems
Two main questions:
1 How to compute ∆βm
2 How to organize communication between machines
Problems
Two main questions:
1 How to compute ∆βm
2 How to organize communication between machines
Answers:
1 Each machine makes step using GLMNET algorithm.
Problems
Two main questions:
1 How to compute ∆βm
2 How to organize communication between machines
Answers:
1 Each machine makes step using GLMNET algorithm.
2
∆β =
M
m=1
∆βm
Steps from machines can come in conict!
so that target function will increase
L(β + ∆β) + R(β + ∆β)  L(β + ∆β) + R(β)
Problems
β ← β + α∆β, 0  α ≤ 1
Problems
β ← β + α∆β, 0  α ≤ 1
where α is found by an Armijo rule
L(β + ∆β) + R(β + ∆β) ≤ L(β) + R(β) + ασDk
Dk
= L(β)T
∆β + R(β + ∆β) − R(β)
Problems
β ← β + α∆β, 0  α ≤ 1
where α is found by an Armijo rule
L(β + ∆β) + R(β + ∆β) ≤ L(β) + R(β) + ασDk
Dk
= L(β)T
∆β + R(β + ∆β) − R(β)
L(β + α∆β) =
n
i=1
log(1 + exp(−yi (β + α∆β)T
xi ))
R(β + α∆β) =
M
m=1
R(βm
+ α∆βm
)
Eective communication between machines
L(β + α∆β) =
n
i=1
log(1 + exp(−yi (β + α∆β)T
xi ))
R(β + α∆β) =
M
m=1
R(βm
+ α∆βm
)
Data transfer:
(βT
xi ) are kept synchronized
(∆βT
xi ) are summed up via MPI_AllReduce (M vectors of size n)
Calculate R(βm
+ α∆βm
), L(β)T ∆βm
separately, and then sum up (M scalars)
Total communication cost: M(n + 1)
Distributed GLMNET (d-GLMNET)
d-GLMNET Algorithm
Input: training dataset {xi , yi }n
i=1, split to M parts over features.
βm
← 0, ∆βm
← 0, where m - index of a machine
Repeat until converged:
1 Do in parallel over M machines:
2 Find ∆βm
and calculate (∆(βm
)T xi ))
3 Sum up ∆βm
, (∆(βm
)T xi ) using MPI_AllReduce
4 ∆β ← M
m=1 ∆βm
5 (∆βT
xi ) ← M
m=1(∆(βm
)T xi )
6 Find α using line search with Armijo rule
7 β ← β + α∆β,
8 (exp(βT
xi )) ← (exp(βT
xi + α∆βT
xi ))
Solving the ¾slow node¿ problem
Distributed Machine Learning Algorithm
Do until converged:
1 Do some computations in parallel over M machines
2 Synchronize
Solving the ¾slow node¿ problem
Distributed Machine Learning Algorithm
Do until converged:
1 Do some computations in parallel over M machines
2 Synchronize PROBLEM!
M − 1 fast machines will wait for 1 slow
Solving the ¾slow node¿ problem
Distributed Machine Learning Algorithm
Do until converged:
1 Do some computations in parallel over M machines
2 Synchronize PROBLEM!
M − 1 fast machines will wait for 1 slow
Our solution: machine m at iteration k updates subset Pm
k ⊆ Sm of input features at
iteration.
The synchronization is done in separate thread asynchronously, we call it
Asynchronous Load Balancing (ALB).
Theoretical Results
Theorem 1. Each iteration of the d-GLMNET is equivalent to
β ← β + α∆β∗
∆β∗
= argmin
∆β
L(β) + L (β)T
∆β +
1
2
∆βT
H(β)∆β + λ1||β + ∆β||1
where H(β) is a block-diagonal approximation to the Hessian 2L(β),
iteration-dependent
Theoretical Results
Theorem 1. Each iteration of the d-GLMNET is equivalent to
β ← β + α∆β∗
∆β∗
= argmin
∆β
L(β) + L (β)T
∆β +
1
2
∆βT
H(β)∆β + λ1||β + ∆β||1
where H(β) is a block-diagonal approximation to the Hessian 2L(β),
iteration-dependent
Theorem 2. d-GLMNET algorithm converges at least linearly.
Numerical Experiments
dataset size #examples #features nnz
train/test/validation
epsilon 12 Gb 0.4 / 0.05 / 0.05 × 106
2000 8.0 × 108
webspam 21 Gb 0.315 / 0.0175 / 0.0175 × 106
16.6 × 106
1.2 × 109
yandex_ad 56 Gb 57 / 2.35 / 2.35 × 106
35 × 106
5.57 × 109
Numerical Experiments
dataset size #examples #features nnz
train/test/validation
epsilon 12 Gb 0.4 / 0.05 / 0.05 × 106
2000 8.0 × 108
webspam 21 Gb 0.315 / 0.0175 / 0.0175 × 106
16.6 × 106
1.2 × 109
yandex_ad 56 Gb 57 / 2.35 / 2.35 × 106
35 × 106
5.57 × 109
16 machines Intel(R) Xeon(R) CPU E5-2660 2.20GHz, 32 GB RAM, gigabit Ethernet.
Numerical Experiments
We compared
d-GLMNET
Online learning via truncated gradient (Vowpal Wabbit)
L-BFGS (Vowpal Wabbit)
ADMM with sharing (feature splitting)
Numerical Experiments
We compared
d-GLMNET
Online learning via truncated gradient (Vowpal Wabbit)
L-BFGS (Vowpal Wabbit)
ADMM with sharing (feature splitting)
1 We selected best L1 and L2 regularization on test set from range {2−6, . . . , 26}
2 We found parameters of online learning and ADMM yielding best performance
3 For evaluating timing performance we repeated training 9 times and selected run
with median time
¾yandex_ad¿ dataset, testing quality vs time
L2 regularization L1 regularization
Conclusions  Future Work
d-GLMNET is faster that state-of-the-art algoritms (online learning, L-BFGS,
ADMM) on sparse high-dimensional datasets
d-GLMNET can be easily extended to
other [block-]separable regularizers: bridge, SCAD, group Lasso, e.t.c.
other generalized linear models
Conclusions  Future Work
d-GLMNET is faster that state-of-the-art algoritms (online learning, L-BFGS,
ADMM) on sparse high-dimensional datasets
d-GLMNET can be easily extended to
other [block-]separable regularizers: bridge, SCAD, group Lasso, e.t.c.
other generalized linear models
Extending software architecture to boosting
F∗
(x) =
M
i=1
fi (x), where fi (x) is a week learner
Let machine m t weak learner f m
i (x
m) on subset of input features Sm. Then
fi (x) = α
M
m=1
f m
i (x
m
)
where α is calculated via line search, in the similar way as in d-GLMNET algorithm.
Conclusions  Future Work
Software implementation:
https://github.com/IlyaTrofimov/dlr
Conclusions  Future Work
Software implementation:
https://github.com/IlyaTrofimov/dlr
paper is available by request :
Ilya Tromov - trofim@yandex-team.ru
Thank you :)
Questions ?

Más contenido relacionado

La actualidad más candente

Bayesian Dark Knowledge and Matrix Factorization
Bayesian Dark Knowledge and Matrix FactorizationBayesian Dark Knowledge and Matrix Factorization
Bayesian Dark Knowledge and Matrix FactorizationPreferred Networks
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks남주 김
 
Variational Autoencoded Regression of Visual Data with Generative Adversarial...
Variational Autoencoded Regression of Visual Data with Generative Adversarial...Variational Autoencoded Regression of Visual Data with Generative Adversarial...
Variational Autoencoded Regression of Visual Data with Generative Adversarial...NAVER Engineering
 
Additive model and boosting tree
Additive model and boosting treeAdditive model and boosting tree
Additive model and boosting treeDong Guo
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational AutoencoderMark Chang
 
Introduction to NumPy for Machine Learning Programmers
Introduction to NumPy for Machine Learning ProgrammersIntroduction to NumPy for Machine Learning Programmers
Introduction to NumPy for Machine Learning ProgrammersKimikazu Kato
 
Gradient descent optimizer
Gradient descent optimizerGradient descent optimizer
Gradient descent optimizerHojin Yang
 
MATLAB for Technical Computing
MATLAB for Technical ComputingMATLAB for Technical Computing
MATLAB for Technical ComputingNaveed Rehman
 
Algebraic data types: Semilattices
Algebraic data types: SemilatticesAlgebraic data types: Semilattices
Algebraic data types: SemilatticesBernhard Huemer
 
Algorithms 101 for Data Scientists
Algorithms 101 for Data ScientistsAlgorithms 101 for Data Scientists
Algorithms 101 for Data ScientistsChristopher Conlan
 
Learning Deep Learning
Learning Deep LearningLearning Deep Learning
Learning Deep Learningsimaokasonse
 
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahonGraph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahonChristopher Conlan
 
Ch01 basic concepts_nosoluiton
Ch01 basic concepts_nosoluitonCh01 basic concepts_nosoluiton
Ch01 basic concepts_nosoluitonshin
 
Circular convolution Using DFT Matlab Code
Circular convolution Using DFT Matlab CodeCircular convolution Using DFT Matlab Code
Circular convolution Using DFT Matlab CodeBharti Airtel Ltd.
 

La actualidad más candente (20)

Bayesian Dark Knowledge and Matrix Factorization
Bayesian Dark Knowledge and Matrix FactorizationBayesian Dark Knowledge and Matrix Factorization
Bayesian Dark Knowledge and Matrix Factorization
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
Variational Autoencoded Regression of Visual Data with Generative Adversarial...
Variational Autoencoded Regression of Visual Data with Generative Adversarial...Variational Autoencoded Regression of Visual Data with Generative Adversarial...
Variational Autoencoded Regression of Visual Data with Generative Adversarial...
 
Additive model and boosting tree
Additive model and boosting treeAdditive model and boosting tree
Additive model and boosting tree
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
Introduction to NumPy for Machine Learning Programmers
Introduction to NumPy for Machine Learning ProgrammersIntroduction to NumPy for Machine Learning Programmers
Introduction to NumPy for Machine Learning Programmers
 
Gradient descent optimizer
Gradient descent optimizerGradient descent optimizer
Gradient descent optimizer
 
MATLAB for Technical Computing
MATLAB for Technical ComputingMATLAB for Technical Computing
MATLAB for Technical Computing
 
Algebraic data types: Semilattices
Algebraic data types: SemilatticesAlgebraic data types: Semilattices
Algebraic data types: Semilattices
 
Algorithms 101 for Data Scientists
Algorithms 101 for Data ScientistsAlgorithms 101 for Data Scientists
Algorithms 101 for Data Scientists
 
Learning Deep Learning
Learning Deep LearningLearning Deep Learning
Learning Deep Learning
 
cyclic_code.pdf
cyclic_code.pdfcyclic_code.pdf
cyclic_code.pdf
 
Cryptography
CryptographyCryptography
Cryptography
 
3 analysis.gtm
3 analysis.gtm3 analysis.gtm
3 analysis.gtm
 
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahonGraph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
 
Ch01 basic concepts_nosoluiton
Ch01 basic concepts_nosoluitonCh01 basic concepts_nosoluiton
Ch01 basic concepts_nosoluiton
 
Analysis of Algorithum
Analysis of AlgorithumAnalysis of Algorithum
Analysis of Algorithum
 
Circular convolution Using DFT Matlab Code
Circular convolution Using DFT Matlab CodeCircular convolution Using DFT Matlab Code
Circular convolution Using DFT Matlab Code
 
2D Plot Matlab
2D Plot Matlab2D Plot Matlab
2D Plot Matlab
 

Similar a Distributed Coordinate Descent for Logistic Regression with Regularization

Stack squeues lists
Stack squeues listsStack squeues lists
Stack squeues listsJames Wong
 
Stacksqueueslists
StacksqueueslistsStacksqueueslists
StacksqueueslistsFraboni Ec
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues listsYoung Alista
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues listsTony Nguyen
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues listsHarry Potter
 
complexity analysis.pdf
complexity analysis.pdfcomplexity analysis.pdf
complexity analysis.pdfpasinduneshan
 
Neural networks with python
Neural networks with pythonNeural networks with python
Neural networks with pythonSimone Piunno
 
Introduction to Matlab
Introduction to MatlabIntroduction to Matlab
Introduction to MatlabAmr Rashed
 
CP4151 Advanced data structures and algorithms
CP4151 Advanced data structures and algorithmsCP4151 Advanced data structures and algorithms
CP4151 Advanced data structures and algorithmsSheba41
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Gael Varoquaux
 
Python Programming - IX. On Randomness
Python Programming - IX. On RandomnessPython Programming - IX. On Randomness
Python Programming - IX. On RandomnessRanel Padon
 
dynamic programming Rod cutting class
dynamic programming Rod cutting classdynamic programming Rod cutting class
dynamic programming Rod cutting classgiridaroori
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascentjeykottalam
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopHéloïse Nonne
 
A MATLAB project on LCR circuits
A MATLAB project on LCR circuitsA MATLAB project on LCR circuits
A MATLAB project on LCR circuitssvrohith 9
 

Similar a Distributed Coordinate Descent for Logistic Regression with Regularization (20)

Stack squeues lists
Stack squeues listsStack squeues lists
Stack squeues lists
 
Stacksqueueslists
StacksqueueslistsStacksqueueslists
Stacksqueueslists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
complexity analysis.pdf
complexity analysis.pdfcomplexity analysis.pdf
complexity analysis.pdf
 
Es272 ch1
Es272 ch1Es272 ch1
Es272 ch1
 
Neural networks with python
Neural networks with pythonNeural networks with python
Neural networks with python
 
Matlab intro
Matlab introMatlab intro
Matlab intro
 
Introduction to Matlab
Introduction to MatlabIntroduction to Matlab
Introduction to Matlab
 
Dynamic pgmming
Dynamic pgmmingDynamic pgmming
Dynamic pgmming
 
CP4151 Advanced data structures and algorithms
CP4151 Advanced data structures and algorithmsCP4151 Advanced data structures and algorithms
CP4151 Advanced data structures and algorithms
 
Slides
SlidesSlides
Slides
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
 
Python Programming - IX. On Randomness
Python Programming - IX. On RandomnessPython Programming - IX. On Randomness
Python Programming - IX. On Randomness
 
dynamic programming Rod cutting class
dynamic programming Rod cutting classdynamic programming Rod cutting class
dynamic programming Rod cutting class
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascent
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and Hadoop
 
A MATLAB project on LCR circuits
A MATLAB project on LCR circuitsA MATLAB project on LCR circuits
A MATLAB project on LCR circuits
 

Último

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 

Último (20)

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 

Distributed Coordinate Descent for Logistic Regression with Regularization

  • 1. Distributed Coordinate Descent for Logistic Regression with Regularization Ilya Tromov (Yandex Data Factory) Alexander Genkin (AVG Consulting) presented by Ilya Tromov Machine Learning: Prospects and Applications 58 October 2015, Berlin, Germany
  • 2. Large Scale Machine Learning Large Scale Machine Learning = Big Data + ML
  • 3. Large Scale Machine Learning Large Scale Machine Learning = Big Data + ML Many applications in web search, online advertising, e-commerce, text processing etc.
  • 4. Large Scale Machine Learning Large Scale Machine Learning = Big Data + ML Many applications in web search, online advertising, e-commerce, text processing etc. Key features of Large Scale Machine Learning problems: 1 Large number of examples n 2 High dimensionality p
  • 5. Large Scale Machine Learning Large Scale Machine Learning = Big Data + ML Many applications in web search, online advertising, e-commerce, text processing etc. Key features of Large Scale Machine Learning problems: 1 Large number of examples n 2 High dimensionality p Datasets are often: 1 Sparse 2 Don't t memory of a single machine
  • 6. Large Scale Machine Learning Large Scale Machine Learning = Big Data + ML Many applications in web search, online advertising, e-commerce, text processing etc. Key features of Large Scale Machine Learning problems: 1 Large number of examples n 2 High dimensionality p Datasets are often: 1 Sparse 2 Don't t memory of a single machine Linear methods for classication and regression are often used for large-scale problems: 1 Training testing for linear models are fast 2 High dimensional datasets are rich and non-linearities are not required
  • 7. Binary Classication Supervised machine learning problem: given feature vector xi ∈ Rp predict yi ∈ {−1, +1}. Function F : x → y should be built using training dataset {xi , yi }n i=1 and minimize expected risk: Ex,y Ψ(y, F(x)) where Ψ(·, ·) is some loss function.
  • 8. Logistic Regression Logistic regression is a special case of Generalized Linear Model with the logit link function: yi ∈ {−1, +1} P(y = +1|x) = 1 1 + exp(−βT x)
  • 9. Logistic Regression Logistic regression is a special case of Generalized Linear Model with the logit link function: yi ∈ {−1, +1} P(y = +1|x) = 1 1 + exp(−βT x) Negated log-likelihood (empirical risk) L(β) L(β) = n i=1 log(1 + exp(−yi βT xi )) β∗ = argmin β L(β)
  • 10. Logistic Regression Logistic regression is a special case of Generalized Linear Model with the logit link function: yi ∈ {−1, +1} P(y = +1|x) = 1 1 + exp(−βT x) Negated log-likelihood (empirical risk) L(β) L(β) = n i=1 log(1 + exp(−yi βT xi )) β∗ = argmin β L(β) + R(β) regularizer
  • 12. Logistic Regression, regularization L2-regularization argmin β L(β) + λ2 2 ||β||2 L1-regularization, provides feature selection argmin β (L(β) + λ1||β||1)
  • 13. Logistic Regression, regularization L2-regularization argmin β L(β) + λ2 2 ||β||2 Minimization of smooth convex function.
  • 14. Logistic Regression, regularization L2-regularization argmin β L(β) + λ2 2 ||β||2 Minimization of smooth convex function. Optimization techniques for large datasets SGD Conjugate gradients L-BFGS Coordinate descent (GLMNET, BBR)
  • 15. Logistic Regression, regularization L2-regularization argmin β L(β) + λ2 2 ||β||2 Minimization of smooth convex function. Optimization techniques for large datasets, distributed SGD poor parallelization Conjugate gradients good parallelization L-BFGS good parallelization Coordinate descent (GLMNET, BBR) ?
  • 16. Logistic Regression, regularization L1-regularization, provides feature selection argmin β (L(β) + λ1||β||1) Minimization of non-smooth convex function.
  • 17. Logistic Regression, regularization L1-regularization, provides feature selection argmin β (L(β) + λ1||β||1) Minimization of non-smooth convex function. Optimization techniques for large datasets Subgradient method Online learning via truncated gradient Coordinate descent (GLMNET, BBR)
  • 18. Logistic Regression, regularization L1-regularization, provides feature selection argmin β (L(β) + λ1||β||1) Minimization of non-smooth convex function. Optimization techniques for large datasets, distributed Subgradient method slow Online learning via truncated gradient poor parallelization Coordinate descent (GLMNET, BBR) ?
  • 19. How to run coordinate descent in parallel? Suppose we have several machines (cluster)
  • 20. How to run coordinate descent in parallel? Suppose we have several machines (cluster) features S1 S2 … SM examples Dataset is split by features among machines S1 ∪ . . . ∪ SM = {1, ..., p} Sm ∩ Sk = ∅, k = m βT = ((β1 )T , (β2 )T , . . . , (βM )T ) Each machine makes steps on its own subset of input features ∆βm
  • 21. Problems Two main questions: 1 How to compute ∆βm 2 How to organize communication between machines
  • 22. Problems Two main questions: 1 How to compute ∆βm 2 How to organize communication between machines Answers: 1 Each machine makes step using GLMNET algorithm.
  • 23. Problems Two main questions: 1 How to compute ∆βm 2 How to organize communication between machines Answers: 1 Each machine makes step using GLMNET algorithm. 2 ∆β = M m=1 ∆βm Steps from machines can come in conict! so that target function will increase L(β + ∆β) + R(β + ∆β) L(β + ∆β) + R(β)
  • 24. Problems β ← β + α∆β, 0 α ≤ 1
  • 25. Problems β ← β + α∆β, 0 α ≤ 1 where α is found by an Armijo rule L(β + ∆β) + R(β + ∆β) ≤ L(β) + R(β) + ασDk Dk = L(β)T ∆β + R(β + ∆β) − R(β)
  • 26. Problems β ← β + α∆β, 0 α ≤ 1 where α is found by an Armijo rule L(β + ∆β) + R(β + ∆β) ≤ L(β) + R(β) + ασDk Dk = L(β)T ∆β + R(β + ∆β) − R(β) L(β + α∆β) = n i=1 log(1 + exp(−yi (β + α∆β)T xi )) R(β + α∆β) = M m=1 R(βm + α∆βm )
  • 27. Eective communication between machines L(β + α∆β) = n i=1 log(1 + exp(−yi (β + α∆β)T xi )) R(β + α∆β) = M m=1 R(βm + α∆βm ) Data transfer: (βT xi ) are kept synchronized (∆βT xi ) are summed up via MPI_AllReduce (M vectors of size n) Calculate R(βm + α∆βm ), L(β)T ∆βm separately, and then sum up (M scalars) Total communication cost: M(n + 1)
  • 28. Distributed GLMNET (d-GLMNET) d-GLMNET Algorithm Input: training dataset {xi , yi }n i=1, split to M parts over features. βm ← 0, ∆βm ← 0, where m - index of a machine Repeat until converged: 1 Do in parallel over M machines: 2 Find ∆βm and calculate (∆(βm )T xi )) 3 Sum up ∆βm , (∆(βm )T xi ) using MPI_AllReduce 4 ∆β ← M m=1 ∆βm 5 (∆βT xi ) ← M m=1(∆(βm )T xi ) 6 Find α using line search with Armijo rule 7 β ← β + α∆β, 8 (exp(βT xi )) ← (exp(βT xi + α∆βT xi ))
  • 29. Solving the ¾slow node¿ problem Distributed Machine Learning Algorithm Do until converged: 1 Do some computations in parallel over M machines 2 Synchronize
  • 30. Solving the ¾slow node¿ problem Distributed Machine Learning Algorithm Do until converged: 1 Do some computations in parallel over M machines 2 Synchronize PROBLEM! M − 1 fast machines will wait for 1 slow
  • 31. Solving the ¾slow node¿ problem Distributed Machine Learning Algorithm Do until converged: 1 Do some computations in parallel over M machines 2 Synchronize PROBLEM! M − 1 fast machines will wait for 1 slow Our solution: machine m at iteration k updates subset Pm k ⊆ Sm of input features at iteration. The synchronization is done in separate thread asynchronously, we call it Asynchronous Load Balancing (ALB).
  • 32. Theoretical Results Theorem 1. Each iteration of the d-GLMNET is equivalent to β ← β + α∆β∗ ∆β∗ = argmin ∆β L(β) + L (β)T ∆β + 1 2 ∆βT H(β)∆β + λ1||β + ∆β||1 where H(β) is a block-diagonal approximation to the Hessian 2L(β), iteration-dependent
  • 33. Theoretical Results Theorem 1. Each iteration of the d-GLMNET is equivalent to β ← β + α∆β∗ ∆β∗ = argmin ∆β L(β) + L (β)T ∆β + 1 2 ∆βT H(β)∆β + λ1||β + ∆β||1 where H(β) is a block-diagonal approximation to the Hessian 2L(β), iteration-dependent Theorem 2. d-GLMNET algorithm converges at least linearly.
  • 34. Numerical Experiments dataset size #examples #features nnz train/test/validation epsilon 12 Gb 0.4 / 0.05 / 0.05 × 106 2000 8.0 × 108 webspam 21 Gb 0.315 / 0.0175 / 0.0175 × 106 16.6 × 106 1.2 × 109 yandex_ad 56 Gb 57 / 2.35 / 2.35 × 106 35 × 106 5.57 × 109
  • 35. Numerical Experiments dataset size #examples #features nnz train/test/validation epsilon 12 Gb 0.4 / 0.05 / 0.05 × 106 2000 8.0 × 108 webspam 21 Gb 0.315 / 0.0175 / 0.0175 × 106 16.6 × 106 1.2 × 109 yandex_ad 56 Gb 57 / 2.35 / 2.35 × 106 35 × 106 5.57 × 109 16 machines Intel(R) Xeon(R) CPU E5-2660 2.20GHz, 32 GB RAM, gigabit Ethernet.
  • 36. Numerical Experiments We compared d-GLMNET Online learning via truncated gradient (Vowpal Wabbit) L-BFGS (Vowpal Wabbit) ADMM with sharing (feature splitting)
  • 37. Numerical Experiments We compared d-GLMNET Online learning via truncated gradient (Vowpal Wabbit) L-BFGS (Vowpal Wabbit) ADMM with sharing (feature splitting) 1 We selected best L1 and L2 regularization on test set from range {2−6, . . . , 26} 2 We found parameters of online learning and ADMM yielding best performance 3 For evaluating timing performance we repeated training 9 times and selected run with median time
  • 38. ¾yandex_ad¿ dataset, testing quality vs time L2 regularization L1 regularization
  • 39. Conclusions Future Work d-GLMNET is faster that state-of-the-art algoritms (online learning, L-BFGS, ADMM) on sparse high-dimensional datasets d-GLMNET can be easily extended to other [block-]separable regularizers: bridge, SCAD, group Lasso, e.t.c. other generalized linear models
  • 40. Conclusions Future Work d-GLMNET is faster that state-of-the-art algoritms (online learning, L-BFGS, ADMM) on sparse high-dimensional datasets d-GLMNET can be easily extended to other [block-]separable regularizers: bridge, SCAD, group Lasso, e.t.c. other generalized linear models Extending software architecture to boosting F∗ (x) = M i=1 fi (x), where fi (x) is a week learner Let machine m t weak learner f m i (x m) on subset of input features Sm. Then fi (x) = α M m=1 f m i (x m ) where α is calculated via line search, in the similar way as in d-GLMNET algorithm.
  • 41. Conclusions Future Work Software implementation: https://github.com/IlyaTrofimov/dlr
  • 42. Conclusions Future Work Software implementation: https://github.com/IlyaTrofimov/dlr paper is available by request : Ilya Tromov - trofim@yandex-team.ru