SlideShare una empresa de Scribd logo
1 de 50
Understanding SparseNet: Theory and Practice
Nicholas Dronen
July 2, 2013
Overview
• First half technical, then practical
• Theory is necessary for intuition of good practice
• Practice - examples using R
What is SparseNet?
• sparsenet: an R package (2012) based on MC+
algorithm
• Related packages: lars (2003), glmnet (2008)
• More geneology later
• SparseNet fits linear models
A linear model
x
y
b
m = y/x
Linear models with many predictor variables
Y = Xβ + ǫ
• Y is a (n × 1) vector of response variables
• X is a (n × p) matrix of predictor variables
• β is a (p × 1) vector of coefficients
• ǫ is noise
Two tasks:
• Parameter estimation: estimate β
• Subset selection: find ≤ p variables to include in model
Task 1: Parameter estimation: which β is best?
A common measure of quality of a particular β is residual sum of
squares (RSS). Let
ˆyi = p
j=1 xijβj
be the prediction for essay i. Then
RSS(β) = n
i=1(yi − ˆyi)2
Residuals
1.0 1.5 2.0 2.5 3.0
1.01.52.02.53.0
x
y
yi
yi
^
}|yi − yi
^|
Task 1: Parameter estimation: minimizing RSS
ˆβ = arg minβ RSS(β)
Is ˆβ computed by exhaustive search of βs? No
Task 1: Parameter estimation (cont.)
After some matrix calculus, estimating the parameters becomes a
linear algebra problem (an inversion and a few multiplications):
ˆβ = (XT
X)−1
XT
y
This solution sometimes called “ordinary least squares”
Task 2: Subset selection
• Find ≤ p predictor variables to include in model
• Reasons for subset selection
• Principle of parsimony, Occam’s razor
• Interpretation: each βi should be meaningful
• Prediction: p too large, model overfits train data
Task 2: Subset selection; linear models
• Different methods, different eras
• Stepwise regression (circa Common Era)
• Leaps and bounds (Furnival, 1974)
• Ridge regression: constrain β (Hoerl and Kennard,
1970)
• Lasso, sparsenet: constrain β, shrink irrelevant βi s to
0 (Tibshirani, 1996; Mazumder et al, 2012)
• Ridge, lasso, sparsenet are shrinkage methods
Task 2: Subset selection; leaps and bounds
• Intelligently enumerate all possible subsets -
computationally expensive
• ∼ O(2p
) time complexity – very high latency
• Prevents modeler from iterating quickly
• Not feasible for wide matrices (p > 50) – job runs a day
or more
Shrinkage methods
When computing ˆβ add a penalty term J and parameter λ
ˆβ(λ) = arg minβ [RSS(β) + λJ(β)]
Equivalently
ˆβ(t) = arg minβ RSS(β)
subject to J(β) ≤ t
Correspondence between λ and t is one to one
When λ is 0, ˆβ(λ) is just the ordinary least squares solution
Norms
Most penalty terms based on some norm of β:
• ℓ2: ||β||2 = i β2
i (Euclidean norm)
• ℓ1: ||β||1 = i |βi | (Manhattan/taxicab norm)
• ℓ0: ||β||0 = i I(βi = 0) (Number of non-zero entries)
Norms (concretely)
Let x be the vector [2, 1]
• ||x||2: 22
+ 12
= 5
• ||x||1: |2| + |1| = 3
• ||x||0: I(2 = 0) + I(1 = 0) = 2
Visualizing norms
(0,0)
(2,1)
Visualizing norms: ℓ2
(0,0)
(2,1)
||x||2 = 22
+ 12
Visualizing norms: ℓ1
(0,0)
(2,1)
2
1
||x||1 = |2| + |1|
Visualizing norms: ℓ0
(0,0)
(2,1)
||x||0 = I(2 ≠ 0)+ I(1 ≠ 0)
Penalty terms
• Ridge: J(β) = ||β||2
• Lasso: J(β) = ||β||1
• Geometry of a penalty defines the region of possible
solutions
Ridge regression: ℓ2 penalty
ˆβ(λ) = arg minβ [RSS(β) + λ p
i=1 β2
i ]
Equivalently
ˆβ(t) = arg minβ RSS(β)
subject to p
i=1 β2
i ≤ t
Region of an ℓ2 ball (λ = 1)
β1 = 1
β2 = 0
β1 = 0.67
β2 = 0.74
β1 = 0.33
β2 = 0.94
β1 = 0
β2 = 1
Lasso: ℓ1 penalty
ˆβ(λ) = arg minβ [RSS(β) + λ p
i=1 |βi|]
Equivalently
ˆβ(t) = arg minβ RSS(β)
subject to p
i=1 |βi| ≤ t
Region of an ℓ1 ball (λ = 1)
β1 = 1
β2 = 0
β1 = 0.67
β2 = 0.33
β1 = 0.33
β2 = 0.67
β1 = 0
β2 = 1
Sparsity and shrinkage
Unlike the ℓ2 penalty (right green region), the ℓ1 penalty (left
green region) produces sparse models
✦
❫
✦
❫✷✦
✶
✦✷
✦✶ ✦
Hastie et al., The Elements of Statistical Learning, 2009
Why (some) shrinkage methods are great
Lasso and sparsenet perform two tasks simultaneously:
• Estimate β
• Find ≤ p predictor variables to include in model
Rest of talk
• Lasso (with examples)
• Model selection concepts
• SparseNet
Solving the lasso: original approach
• Estimating ˆβ for the lasso is a convex optimization
problem (specifically, quadratic programming [QP])
• Original approach - not terribly fast
• Find λ0 such that β is 0
• Run QP solver to estimate ˆβ(λi ) for each λi in some
subset of the range [0, λ0]
• Use cross validation to determine best λi
Solving the lasso: newer approaches
• Fastest algorithms for obtaining lasso solution are
implemented as R packages:
• lars (Hastie and Efron, 2003)
• Least angle regression
• Lasso solution is piecewise linear; a series of
clever projections
• glmnet (Friedman et al, 2008)
• Very fast coordinate descent
• Compute ˆβ(λ) by iteratively computing each
ˆβi (λ) until convergence; akin to solving
univariate regressions
• These methods are as fast as a single ordinary least
squares solution.
Lasso examples
LARS
> l i b r a r y ( l a r s )
> o b j e c t <− l a r s (X, y )
> coef ( o b j e c t )
V1 V2 V3 V4
[ 1 , ] 0.000000 0.000000 0.0000000 0.00000
[ 2 , ] 0.000000 4.059090 0.0000000 0.00000
[ 3 , ] 0.000000 5.597966 0.1454205 0.00000
[ 4 , ] 1.081084 5.712019 0.1590585 0.00000
[ 5 , ] 2.723949 6.789427 0.2128308 −10.53639
Coefficient profile (LARS)
* * *
*
*
0.0 0.2 0.4 0.6 0.8 1.0
−50510152025
|beta|/max|beta|
StandardizedCoefficients
*
*
* *
*
* *
*
*
*
* * * *
*
LASSO
4132
0 1 2 4
GLMNET
> l i b r a r y ( glmnet )
> o b j e c t <− glmnet (X, y , nlambda=5)
> coef ( o b j e c t )
s0 s1 s2 s3 s4
( I n t e r c e p t ) 4.07 −2.193 3.068 3.593 3.644
V1 . 1.362 2.586 2.709 2.723
V2 . 5.895 6.700 6.781 6.789
V3 . 0.168 0.208 0.212 0.213
V4 . −1.792 −9.664 −10.450 −10.528
Coefficient profile (GLMNET)
0 5 10 15 20
−10−505
L1 Norm
Coefficients
0 2 4 4 4
Theory of the lasso
coherence
adaptive
restricted regression
minimal adaptive
restricted eigenvalue
restricted
eigenvaluecompatibilityirrepresentable
Theorem 7.3
Corollary 6.13
Theorem 7.2
 ✁✂✄☎✆ ✝✞✟✠✡☛ ☞✞✁ ✌✁✆✡✍✄✎✍✞✠ ✂✠✡ ✏✑✒✆✁✁✞✁
✓✔✕✖✗✘ ✙✚✛✜✢✣ ✤✚✔ ✏✥✦✘✔✔✚✔
Theorems 6.2 and 6.4
Lemma 6.11
Buhlmann and van de Geer, Statistics for High-Dimensional Data:
Methods, Theory, and Applications, Springer-Verlag, 2011
Some limitations of the lasso
• “The lasso penalty is somewhat indifferent to the choice
among a set of strong but correlated variables” –
Tibshirani (link)
• Variable selection consistency only under the
irrepresentable condition (“On Model Selection
Consistency of Lasso”, Zhao and Yu, 2006)
• Some theoretical conditions apply only in the extreme
p >> n
• For our purposes: does the learner yield models with high
predictive accuracy?
Model selection concepts
Cross validation
• LOO cross validation is unbiased but high variance
• Bias: accurate estimate of performance on held-out data
• Variance: sensitive to variations in training data
• K-fold cross validation is preferred (K = 5 . . . 20)
Model selection
You are given a set of models and their K-fold CV performance
estimates. How do you choose the one that will perform well on
unseen data?
Model selection: choose best
• Choose the model with the best CV performance estimate
• If you trust CV estimates, why not?
Model selection: choose using 1 standard-error rule
Choose the model that
• is the most parsimonious
• has CV performance close enough to the best one
Breiman’s one standard-error rule (Classification and Regression
Trees, 1984, Breiman et al, p. 80) defines “close enough” as
having an error no more than one standard error above the error of
the best model
Model selection: best vs. 1 SE rule
●
●
●
●
●
●
●
●
●
● ●
● ● ●
●
● ●
● ● ● ●
●
●
●
●
●
●
● ●
● ● ●
● ● ●
●
●
● ●
●
● ● ● ●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
● ●
● ●
●
● ● ● ● ●
●
●
● ●
● ● ●
●
● ●
●
●
● ● ● ● ● ● ●
1.00
1.25
1.50
1.75
2.00
0 10 20 30 40
NVars
MSE
type
●●
●●
●●
●●
Best
One SE
Test
Train
Rule
Best
One SE
Model selection example with GLMNET
> l i b r a r y ( glmnet )
> o b j e c t <− cv . glmnet (X, y , nlambda=5)
> object$lambda .1 se ; object$lambda . min
[ 1 ] 0.182
[ 1 ] 0.00182
> p r e d i c t ( object , newX , s=object$lambda .1 se ) [ 1 , ]
1
5.21
> p r e d i c t ( object , newX , s=object$lambda . min ) [ 1 , ]
1
5.36
SparseNet
SparseNet
• Based on Cun-Hui Zhang’s (Rutgers statistics) MC+
algorithm
• “Nearly unbiased variable selection under minimax
concave penalty”, Zhang, 2010
• Uses same coordinate descent approach as glmnet
• MC+ has two components
• Minimax concave penalty (MCP) (more complex than
l1 penalty)
• Penalized linear unbiased selection (PLUS) algorithm
• Variable selection consistent under more general
conditions than the lasso
• Zhang has implementation on CRAN: plus
SparseNet
> l i b r a r y ( s p a r s e n e t )
> o b j e c t <− s p a r s e n e t (X, y , nlambda=5, ngamma=4)
> coef ( o b j e c t )
$g1
l 0 l 1 l 2 l 3 l 4
( I n t e r c e p t ) 4.1 e+00 −2.22 3.03 3.58 3.61
V1 . 1.39 2.62 2.72 2.75
V2 2.1 e−15 5.89 6.69 6.78 6.78
V3 . 0.17 0.21 0.21 0.21
V4 . −1.77 −9.64 −10.44 −10.51
$g2
l 0 l 1 l 2 l 3 l 4
( I n t e r c e p t ) 4.1 e+00 −2.05 3.30 3.67 3.66
V1 . 1.30 2.49 2.70 2.72
V2 4.2 e−15 5.95 6.77 6.79 6.79
. . . .
Coefficient profile (SparseNet)
0 5 10 15 20
−10−505
L1 Norm
Coefficients
0 2 4 4 4
Lasso
0 5 10 15 20
−10−505
L1 Norm
Coefficients
0 2 4 4 4
Gamma = 150
0 5 10 15 20
−10−505
L1 Norm
Coefficients
0 1 2 3 3
Gamma = 12.2
0 5 10 15 20
−10−505
L1 Norm
Coefficients
0 1 2 2 3
Subset
Model selection example with SparseNet
> l i b r a r y ( s p a r s e n e t )
> o b j e c t <− cv . s p a r s e n e t (X, y , nlambda=5)
> object$parms .1 se ; object$parms . min
gamma lambda
8.563 0.081
gamma lambda
9.9 e+35 8.1 e−04
Model selection example with SparseNet (cont.)
> l i b r a r y ( s p a r s e n e t )
> o b j e c t <− cv . s p a r s e n e t (X, y , nlambda=5)
> p r e d i c t ( object , newX , which=”parms . min ” ) [ 1 , ]
1
5.4
> p r e d i c t ( object , newX , which=”parms .1 se ” ) [ 1 , ]
1
5.2
Summary
• Refinement of the lasso (1996, Tibshirani)
• Strong theoretical guarantees, good empirical performance
• Best-in-class linear modeling algorithm

Más contenido relacionado

La actualidad más candente

Initial value problems
Initial value problemsInitial value problems
Initial value problemsAli Jan Hasan
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutData Science London
 
Summery of Robust and Effective Metric Learning Using Capped Trace Norm
Summery of  Robust and Effective Metric Learning Using Capped Trace NormSummery of  Robust and Effective Metric Learning Using Capped Trace Norm
Summery of Robust and Effective Metric Learning Using Capped Trace Normssuser42f2881
 
Unbiased Hamiltonian Monte Carlo
Unbiased Hamiltonian Monte CarloUnbiased Hamiltonian Monte Carlo
Unbiased Hamiltonian Monte CarloJeremyHeng10
 
Robust Control of Uncertain Switched Linear Systems based on Stochastic Reach...
Robust Control of Uncertain Switched Linear Systems based on Stochastic Reach...Robust Control of Uncertain Switched Linear Systems based on Stochastic Reach...
Robust Control of Uncertain Switched Linear Systems based on Stochastic Reach...Leo Asselborn
 
Introducing Zap Q-Learning
Introducing Zap Q-Learning   Introducing Zap Q-Learning
Introducing Zap Q-Learning Sean Meyn
 
Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsChristian Robert
 
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Reinforcement Learning: Hidden Theory and New Super-Fast AlgorithmsReinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Reinforcement Learning: Hidden Theory and New Super-Fast AlgorithmsSean Meyn
 
Some Thoughts on Sampling
Some Thoughts on SamplingSome Thoughts on Sampling
Some Thoughts on SamplingDon Sheehy
 
Approximate Inference (Chapter 10, PRML Reading)
Approximate Inference (Chapter 10, PRML Reading)Approximate Inference (Chapter 10, PRML Reading)
Approximate Inference (Chapter 10, PRML Reading)Ha Phuong
 
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithmsRao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithmsChristian Robert
 
Dynamic response of structures with uncertain properties
Dynamic response of structures with uncertain propertiesDynamic response of structures with uncertain properties
Dynamic response of structures with uncertain propertiesUniversity of Glasgow
 
Machine learning with neural networks
Machine learning with neural networksMachine learning with neural networks
Machine learning with neural networksLet's talk about IT
 
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Chiheb Ben Hammouda
 
Dynamics of structures with uncertainties
Dynamics of structures with uncertaintiesDynamics of structures with uncertainties
Dynamics of structures with uncertaintiesUniversity of Glasgow
 
Information in the Weights
Information in the WeightsInformation in the Weights
Information in the WeightsMark Chang
 

La actualidad más candente (20)

Initial value problems
Initial value problemsInitial value problems
Initial value problems
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in Mahout
 
Summery of Robust and Effective Metric Learning Using Capped Trace Norm
Summery of  Robust and Effective Metric Learning Using Capped Trace NormSummery of  Robust and Effective Metric Learning Using Capped Trace Norm
Summery of Robust and Effective Metric Learning Using Capped Trace Norm
 
Unbiased Hamiltonian Monte Carlo
Unbiased Hamiltonian Monte CarloUnbiased Hamiltonian Monte Carlo
Unbiased Hamiltonian Monte Carlo
 
2018 MUMS Fall Course - Sampling-based techniques for uncertainty propagation...
2018 MUMS Fall Course - Sampling-based techniques for uncertainty propagation...2018 MUMS Fall Course - Sampling-based techniques for uncertainty propagation...
2018 MUMS Fall Course - Sampling-based techniques for uncertainty propagation...
 
Robust Control of Uncertain Switched Linear Systems based on Stochastic Reach...
Robust Control of Uncertain Switched Linear Systems based on Stochastic Reach...Robust Control of Uncertain Switched Linear Systems based on Stochastic Reach...
Robust Control of Uncertain Switched Linear Systems based on Stochastic Reach...
 
Neural ODE
Neural ODENeural ODE
Neural ODE
 
Introducing Zap Q-Learning
Introducing Zap Q-Learning   Introducing Zap Q-Learning
Introducing Zap Q-Learning
 
2018 MUMS Fall Course - Mathematical surrogate and reduced-order models - Ral...
2018 MUMS Fall Course - Mathematical surrogate and reduced-order models - Ral...2018 MUMS Fall Course - Mathematical surrogate and reduced-order models - Ral...
2018 MUMS Fall Course - Mathematical surrogate and reduced-order models - Ral...
 
Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithms
 
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Reinforcement Learning: Hidden Theory and New Super-Fast AlgorithmsReinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
 
Some Thoughts on Sampling
Some Thoughts on SamplingSome Thoughts on Sampling
Some Thoughts on Sampling
 
Approximate Inference (Chapter 10, PRML Reading)
Approximate Inference (Chapter 10, PRML Reading)Approximate Inference (Chapter 10, PRML Reading)
Approximate Inference (Chapter 10, PRML Reading)
 
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithmsRao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
 
Benelearn2016
Benelearn2016Benelearn2016
Benelearn2016
 
Dynamic response of structures with uncertain properties
Dynamic response of structures with uncertain propertiesDynamic response of structures with uncertain properties
Dynamic response of structures with uncertain properties
 
Machine learning with neural networks
Machine learning with neural networksMachine learning with neural networks
Machine learning with neural networks
 
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
 
Dynamics of structures with uncertainties
Dynamics of structures with uncertaintiesDynamics of structures with uncertainties
Dynamics of structures with uncertainties
 
Information in the Weights
Information in the WeightsInformation in the Weights
Information in the Weights
 

Similar a Sparsenet

Similar a Sparsenet (20)

Regression vs Neural Net
Regression vs Neural NetRegression vs Neural Net
Regression vs Neural Net
 
Model Selection and Validation
Model Selection and ValidationModel Selection and Validation
Model Selection and Validation
 
Regularization and variable selection via elastic net
Regularization and variable selection via elastic netRegularization and variable selection via elastic net
Regularization and variable selection via elastic net
 
Optimization tutorial
Optimization tutorialOptimization tutorial
Optimization tutorial
 
Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep Learning
 
JISA_Paper
JISA_PaperJISA_Paper
JISA_Paper
 
Bias-Variance_relted_to_ML.pdf
Bias-Variance_relted_to_ML.pdfBias-Variance_relted_to_ML.pdf
Bias-Variance_relted_to_ML.pdf
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Regression.pptx
Regression.pptxRegression.pptx
Regression.pptx
 
Regression.pptx
Regression.pptxRegression.pptx
Regression.pptx
 
Input analysis
Input analysisInput analysis
Input analysis
 
MDPSO_SDM_2012_Souma
MDPSO_SDM_2012_SoumaMDPSO_SDM_2012_Souma
MDPSO_SDM_2012_Souma
 
슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)
슬로우캠퍼스:  scikit-learn & 머신러닝 (강박사)슬로우캠퍼스:  scikit-learn & 머신러닝 (강박사)
슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)
 
Mit6 094 iap10_lec03
Mit6 094 iap10_lec03Mit6 094 iap10_lec03
Mit6 094 iap10_lec03
 
CI_L01_Optimization.pdf
CI_L01_Optimization.pdfCI_L01_Optimization.pdf
CI_L01_Optimization.pdf
 
15303589.ppt
15303589.ppt15303589.ppt
15303589.ppt
 
ARIMA
ARIMA ARIMA
ARIMA
 
LPP, Duality and Game Theory
LPP, Duality and Game TheoryLPP, Duality and Game Theory
LPP, Duality and Game Theory
 
08-Regression.pptx
08-Regression.pptx08-Regression.pptx
08-Regression.pptx
 
Lec3
Lec3Lec3
Lec3
 

Último

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Último (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Sparsenet

  • 1. Understanding SparseNet: Theory and Practice Nicholas Dronen July 2, 2013
  • 2. Overview • First half technical, then practical • Theory is necessary for intuition of good practice • Practice - examples using R
  • 3. What is SparseNet? • sparsenet: an R package (2012) based on MC+ algorithm • Related packages: lars (2003), glmnet (2008) • More geneology later • SparseNet fits linear models
  • 5. Linear models with many predictor variables Y = Xβ + ǫ • Y is a (n × 1) vector of response variables • X is a (n × p) matrix of predictor variables • β is a (p × 1) vector of coefficients • ǫ is noise Two tasks: • Parameter estimation: estimate β • Subset selection: find ≤ p variables to include in model
  • 6. Task 1: Parameter estimation: which β is best? A common measure of quality of a particular β is residual sum of squares (RSS). Let ˆyi = p j=1 xijβj be the prediction for essay i. Then RSS(β) = n i=1(yi − ˆyi)2
  • 7. Residuals 1.0 1.5 2.0 2.5 3.0 1.01.52.02.53.0 x y yi yi ^ }|yi − yi ^|
  • 8. Task 1: Parameter estimation: minimizing RSS ˆβ = arg minβ RSS(β) Is ˆβ computed by exhaustive search of βs? No
  • 9. Task 1: Parameter estimation (cont.) After some matrix calculus, estimating the parameters becomes a linear algebra problem (an inversion and a few multiplications): ˆβ = (XT X)−1 XT y This solution sometimes called “ordinary least squares”
  • 10. Task 2: Subset selection • Find ≤ p predictor variables to include in model • Reasons for subset selection • Principle of parsimony, Occam’s razor • Interpretation: each βi should be meaningful • Prediction: p too large, model overfits train data
  • 11. Task 2: Subset selection; linear models • Different methods, different eras • Stepwise regression (circa Common Era) • Leaps and bounds (Furnival, 1974) • Ridge regression: constrain β (Hoerl and Kennard, 1970) • Lasso, sparsenet: constrain β, shrink irrelevant βi s to 0 (Tibshirani, 1996; Mazumder et al, 2012) • Ridge, lasso, sparsenet are shrinkage methods
  • 12. Task 2: Subset selection; leaps and bounds • Intelligently enumerate all possible subsets - computationally expensive • ∼ O(2p ) time complexity – very high latency • Prevents modeler from iterating quickly • Not feasible for wide matrices (p > 50) – job runs a day or more
  • 13. Shrinkage methods When computing ˆβ add a penalty term J and parameter λ ˆβ(λ) = arg minβ [RSS(β) + λJ(β)] Equivalently ˆβ(t) = arg minβ RSS(β) subject to J(β) ≤ t Correspondence between λ and t is one to one When λ is 0, ˆβ(λ) is just the ordinary least squares solution
  • 14. Norms Most penalty terms based on some norm of β: • ℓ2: ||β||2 = i β2 i (Euclidean norm) • ℓ1: ||β||1 = i |βi | (Manhattan/taxicab norm) • ℓ0: ||β||0 = i I(βi = 0) (Number of non-zero entries)
  • 15. Norms (concretely) Let x be the vector [2, 1] • ||x||2: 22 + 12 = 5 • ||x||1: |2| + |1| = 3 • ||x||0: I(2 = 0) + I(1 = 0) = 2
  • 19. Visualizing norms: ℓ0 (0,0) (2,1) ||x||0 = I(2 ≠ 0)+ I(1 ≠ 0)
  • 20. Penalty terms • Ridge: J(β) = ||β||2 • Lasso: J(β) = ||β||1 • Geometry of a penalty defines the region of possible solutions
  • 21. Ridge regression: ℓ2 penalty ˆβ(λ) = arg minβ [RSS(β) + λ p i=1 β2 i ] Equivalently ˆβ(t) = arg minβ RSS(β) subject to p i=1 β2 i ≤ t
  • 22. Region of an ℓ2 ball (λ = 1) β1 = 1 β2 = 0 β1 = 0.67 β2 = 0.74 β1 = 0.33 β2 = 0.94 β1 = 0 β2 = 1
  • 23. Lasso: ℓ1 penalty ˆβ(λ) = arg minβ [RSS(β) + λ p i=1 |βi|] Equivalently ˆβ(t) = arg minβ RSS(β) subject to p i=1 |βi| ≤ t
  • 24. Region of an ℓ1 ball (λ = 1) β1 = 1 β2 = 0 β1 = 0.67 β2 = 0.33 β1 = 0.33 β2 = 0.67 β1 = 0 β2 = 1
  • 25. Sparsity and shrinkage Unlike the ℓ2 penalty (right green region), the ℓ1 penalty (left green region) produces sparse models ✦ ❫ ✦ ❫✷✦ ✶ ✦✷ ✦✶ ✦ Hastie et al., The Elements of Statistical Learning, 2009
  • 26. Why (some) shrinkage methods are great Lasso and sparsenet perform two tasks simultaneously: • Estimate β • Find ≤ p predictor variables to include in model
  • 27. Rest of talk • Lasso (with examples) • Model selection concepts • SparseNet
  • 28. Solving the lasso: original approach • Estimating ˆβ for the lasso is a convex optimization problem (specifically, quadratic programming [QP]) • Original approach - not terribly fast • Find λ0 such that β is 0 • Run QP solver to estimate ˆβ(λi ) for each λi in some subset of the range [0, λ0] • Use cross validation to determine best λi
  • 29. Solving the lasso: newer approaches • Fastest algorithms for obtaining lasso solution are implemented as R packages: • lars (Hastie and Efron, 2003) • Least angle regression • Lasso solution is piecewise linear; a series of clever projections • glmnet (Friedman et al, 2008) • Very fast coordinate descent • Compute ˆβ(λ) by iteratively computing each ˆβi (λ) until convergence; akin to solving univariate regressions • These methods are as fast as a single ordinary least squares solution.
  • 31. LARS > l i b r a r y ( l a r s ) > o b j e c t <− l a r s (X, y ) > coef ( o b j e c t ) V1 V2 V3 V4 [ 1 , ] 0.000000 0.000000 0.0000000 0.00000 [ 2 , ] 0.000000 4.059090 0.0000000 0.00000 [ 3 , ] 0.000000 5.597966 0.1454205 0.00000 [ 4 , ] 1.081084 5.712019 0.1590585 0.00000 [ 5 , ] 2.723949 6.789427 0.2128308 −10.53639
  • 32. Coefficient profile (LARS) * * * * * 0.0 0.2 0.4 0.6 0.8 1.0 −50510152025 |beta|/max|beta| StandardizedCoefficients * * * * * * * * * * * * * * * LASSO 4132 0 1 2 4
  • 33. GLMNET > l i b r a r y ( glmnet ) > o b j e c t <− glmnet (X, y , nlambda=5) > coef ( o b j e c t ) s0 s1 s2 s3 s4 ( I n t e r c e p t ) 4.07 −2.193 3.068 3.593 3.644 V1 . 1.362 2.586 2.709 2.723 V2 . 5.895 6.700 6.781 6.789 V3 . 0.168 0.208 0.212 0.213 V4 . −1.792 −9.664 −10.450 −10.528
  • 34. Coefficient profile (GLMNET) 0 5 10 15 20 −10−505 L1 Norm Coefficients 0 2 4 4 4
  • 35. Theory of the lasso coherence adaptive restricted regression minimal adaptive restricted eigenvalue restricted eigenvaluecompatibilityirrepresentable Theorem 7.3 Corollary 6.13 Theorem 7.2  ✁✂✄☎✆ ✝✞✟✠✡☛ ☞✞✁ ✌✁✆✡✍✄✎✍✞✠ ✂✠✡ ✏✑✒✆✁✁✞✁ ✓✔✕✖✗✘ ✙✚✛✜✢✣ ✤✚✔ ✏✥✦✘✔✔✚✔ Theorems 6.2 and 6.4 Lemma 6.11 Buhlmann and van de Geer, Statistics for High-Dimensional Data: Methods, Theory, and Applications, Springer-Verlag, 2011
  • 36. Some limitations of the lasso • “The lasso penalty is somewhat indifferent to the choice among a set of strong but correlated variables” – Tibshirani (link) • Variable selection consistency only under the irrepresentable condition (“On Model Selection Consistency of Lasso”, Zhao and Yu, 2006) • Some theoretical conditions apply only in the extreme p >> n • For our purposes: does the learner yield models with high predictive accuracy?
  • 38. Cross validation • LOO cross validation is unbiased but high variance • Bias: accurate estimate of performance on held-out data • Variance: sensitive to variations in training data • K-fold cross validation is preferred (K = 5 . . . 20)
  • 39. Model selection You are given a set of models and their K-fold CV performance estimates. How do you choose the one that will perform well on unseen data?
  • 40. Model selection: choose best • Choose the model with the best CV performance estimate • If you trust CV estimates, why not?
  • 41. Model selection: choose using 1 standard-error rule Choose the model that • is the most parsimonious • has CV performance close enough to the best one Breiman’s one standard-error rule (Classification and Regression Trees, 1984, Breiman et al, p. 80) defines “close enough” as having an error no more than one standard error above the error of the best model
  • 42. Model selection: best vs. 1 SE rule ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.00 1.25 1.50 1.75 2.00 0 10 20 30 40 NVars MSE type ●● ●● ●● ●● Best One SE Test Train Rule Best One SE
  • 43. Model selection example with GLMNET > l i b r a r y ( glmnet ) > o b j e c t <− cv . glmnet (X, y , nlambda=5) > object$lambda .1 se ; object$lambda . min [ 1 ] 0.182 [ 1 ] 0.00182 > p r e d i c t ( object , newX , s=object$lambda .1 se ) [ 1 , ] 1 5.21 > p r e d i c t ( object , newX , s=object$lambda . min ) [ 1 , ] 1 5.36
  • 45. SparseNet • Based on Cun-Hui Zhang’s (Rutgers statistics) MC+ algorithm • “Nearly unbiased variable selection under minimax concave penalty”, Zhang, 2010 • Uses same coordinate descent approach as glmnet • MC+ has two components • Minimax concave penalty (MCP) (more complex than l1 penalty) • Penalized linear unbiased selection (PLUS) algorithm • Variable selection consistent under more general conditions than the lasso • Zhang has implementation on CRAN: plus
  • 46. SparseNet > l i b r a r y ( s p a r s e n e t ) > o b j e c t <− s p a r s e n e t (X, y , nlambda=5, ngamma=4) > coef ( o b j e c t ) $g1 l 0 l 1 l 2 l 3 l 4 ( I n t e r c e p t ) 4.1 e+00 −2.22 3.03 3.58 3.61 V1 . 1.39 2.62 2.72 2.75 V2 2.1 e−15 5.89 6.69 6.78 6.78 V3 . 0.17 0.21 0.21 0.21 V4 . −1.77 −9.64 −10.44 −10.51 $g2 l 0 l 1 l 2 l 3 l 4 ( I n t e r c e p t ) 4.1 e+00 −2.05 3.30 3.67 3.66 V1 . 1.30 2.49 2.70 2.72 V2 4.2 e−15 5.95 6.77 6.79 6.79 . . . .
  • 47. Coefficient profile (SparseNet) 0 5 10 15 20 −10−505 L1 Norm Coefficients 0 2 4 4 4 Lasso 0 5 10 15 20 −10−505 L1 Norm Coefficients 0 2 4 4 4 Gamma = 150 0 5 10 15 20 −10−505 L1 Norm Coefficients 0 1 2 3 3 Gamma = 12.2 0 5 10 15 20 −10−505 L1 Norm Coefficients 0 1 2 2 3 Subset
  • 48. Model selection example with SparseNet > l i b r a r y ( s p a r s e n e t ) > o b j e c t <− cv . s p a r s e n e t (X, y , nlambda=5) > object$parms .1 se ; object$parms . min gamma lambda 8.563 0.081 gamma lambda 9.9 e+35 8.1 e−04
  • 49. Model selection example with SparseNet (cont.) > l i b r a r y ( s p a r s e n e t ) > o b j e c t <− cv . s p a r s e n e t (X, y , nlambda=5) > p r e d i c t ( object , newX , which=”parms . min ” ) [ 1 , ] 1 5.4 > p r e d i c t ( object , newX , which=”parms .1 se ” ) [ 1 , ] 1 5.2
  • 50. Summary • Refinement of the lasso (1996, Tibshirani) • Strong theoretical guarantees, good empirical performance • Best-in-class linear modeling algorithm