Sparsenet

Understanding SparseNet: Theory and Practice
Nicholas Dronen
July 2, 2013

Overview
• First half technical, then practical
• Theory is necessary for intuition of good practice
• Practice - examples using R

What is SparseNet?
• sparsenet: an R package (2012) based on MC+
algorithm
• Related packages: lars (2003), glmnet (2008)
• More geneology later
• SparseNet ﬁts linear models

Linear models with many predictor variables
Y = Xβ + ǫ
• Y is a (n × 1) vector of response variables
• X is a (n × p) matrix of predictor variables
• β is a (p × 1) vector of coeﬃcients
• ǫ is noise
Two tasks:
• Parameter estimation: estimate β
• Subset selection: ﬁnd ≤ p variables to include in model

Task 1: Parameter estimation: which β is best?
A common measure of quality of a particular β is residual sum of
squares (RSS). Let
ˆyi = p
j=1 xijβj
be the prediction for essay i. Then
RSS(β) = n
i=1(yi − ˆyi)2

Residuals
1.0 1.5 2.0 2.5 3.0
1.01.52.02.53.0
x
y
yi
yi
^
}|yi − yi
^|

Task 1: Parameter estimation: minimizing RSS
ˆβ = arg minβ RSS(β)
Is ˆβ computed by exhaustive search of βs? No

Task 1: Parameter estimation (cont.)
After some matrix calculus, estimating the parameters becomes a
linear algebra problem (an inversion and a few multiplications):
ˆβ = (XT
X)−1
XT
y
This solution sometimes called “ordinary least squares”

Task 2: Subset selection
• Find ≤ p predictor variables to include in model
• Reasons for subset selection
• Principle of parsimony, Occam’s razor
• Interpretation: each βi should be meaningful
• Prediction: p too large, model overﬁts train data

Task 2: Subset selection; linear models
• Diﬀerent methods, diﬀerent eras
• Stepwise regression (circa Common Era)
• Leaps and bounds (Furnival, 1974)
• Ridge regression: constrain β (Hoerl and Kennard,
1970)
• Lasso, sparsenet: constrain β, shrink irrelevant βi s to
0 (Tibshirani, 1996; Mazumder et al, 2012)
• Ridge, lasso, sparsenet are shrinkage methods

Task 2: Subset selection; leaps and bounds
• Intelligently enumerate all possible subsets -
computationally expensive
• ∼ O(2p
) time complexity – very high latency
• Prevents modeler from iterating quickly
• Not feasible for wide matrices (p > 50) – job runs a day
or more

Shrinkage methods
When computing ˆβ add a penalty term J and parameter λ
ˆβ(λ) = arg minβ [RSS(β) + λJ(β)]
Equivalently
ˆβ(t) = arg minβ RSS(β)
subject to J(β) ≤ t
Correspondence between λ and t is one to one
When λ is 0, ˆβ(λ) is just the ordinary least squares solution

Norms
Most penalty terms based on some norm of β:
• ℓ2: ||β||2 = i β2
i (Euclidean norm)
• ℓ1: ||β||1 = i |βi | (Manhattan/taxicab norm)
• ℓ0: ||β||0 = i I(βi = 0) (Number of non-zero entries)

Norms (concretely)
Let x be the vector [2, 1]
• ||x||2: 22
+ 12
= 5
• ||x||1: |2| + |1| = 3
• ||x||0: I(2 = 0) + I(1 = 0) = 2

Visualizing norms: ℓ2
(0,0)
(2,1)
||x||2 = 22
+ 12

(0,0)
(2,1)
2
1
||x||1 = |2| + |1|

(0,0)
(2,1)
||x||0 = I(2 ≠ 0)+ I(1 ≠ 0)

Penalty terms
• Ridge: J(β) = ||β||2
• Lasso: J(β) = ||β||1
• Geometry of a penalty deﬁnes the region of possible
solutions

Ridge regression: ℓ2 penalty
ˆβ(λ) = arg minβ [RSS(β) + λ p
i=1 β2
i ]
Equivalently
subject to p
i=1 β2
i ≤ t

Region of an ℓ2 ball (λ = 1)
β1 = 1
β2 = 0
β1 = 0.67
β2 = 0.74
β1 = 0.33
β2 = 0.94
β1 = 0
β2 = 1

Lasso: ℓ1 penalty
ˆβ(λ) = arg minβ [RSS(β) + λ p
i=1 |βi|]
Equivalently
subject to p
i=1 |βi| ≤ t

Region of an ℓ1 ball (λ = 1)
β1 = 1
β2 = 0
β1 = 0.67
β2 = 0.33
β1 = 0.33
β2 = 0.67
β1 = 0
β2 = 1

Sparsity and shrinkage
Unlike the ℓ2 penalty (right green region), the ℓ1 penalty (left
green region) produces sparse models
✦
❫
✦
❫✷✦
✶
✦✷
✦✶ ✦
Hastie et al., The Elements of Statistical Learning, 2009

Why (some) shrinkage methods are great
Lasso and sparsenet perform two tasks simultaneously:
• Estimate β
• Find ≤ p predictor variables to include in model

Rest of talk
• Lasso (with examples)
• Model selection concepts
• SparseNet

Solving the lasso: original approach
• Estimating ˆβ for the lasso is a convex optimization
problem (speciﬁcally, quadratic programming [QP])
• Original approach - not terribly fast
• Find λ0 such that β is 0
• Run QP solver to estimate ˆβ(λi ) for each λi in some
subset of the range [0, λ0]
• Use cross validation to determine best λi

Solving the lasso: newer approaches
• Fastest algorithms for obtaining lasso solution are
implemented as R packages:
• lars (Hastie and Efron, 2003)
• Least angle regression
• Lasso solution is piecewise linear; a series of
clever projections
• glmnet (Friedman et al, 2008)
• Very fast coordinate descent
• Compute ˆβ(λ) by iteratively computing each
ˆβi (λ) until convergence; akin to solving
univariate regressions
• These methods are as fast as a single ordinary least
squares solution.

LARS
> l i b r a r y ( l a r s )
> o b j e c t <− l a r s (X, y )
> coef ( o b j e c t )
V1 V2 V3 V4
[ 1 , ] 0.000000 0.000000 0.0000000 0.00000
[ 2 , ] 0.000000 4.059090 0.0000000 0.00000
[ 3 , ] 0.000000 5.597966 0.1454205 0.00000
[ 4 , ] 1.081084 5.712019 0.1590585 0.00000
[ 5 , ] 2.723949 6.789427 0.2128308 −10.53639

Coeﬃcient proﬁle (LARS)
* * *
*
*
0.0 0.2 0.4 0.6 0.8 1.0
−50510152025
|beta|/max|beta|
StandardizedCoefficients
*
*
* *
*
* *
*
*
*
* * * *
*
LASSO
4132
0 1 2 4

GLMNET
> l i b r a r y ( glmnet )
> o b j e c t <− glmnet (X, y , nlambda=5)
s0 s1 s2 s3 s4
( I n t e r c e p t ) 4.07 −2.193 3.068 3.593 3.644
V1 . 1.362 2.586 2.709 2.723
V2 . 5.895 6.700 6.781 6.789
V3 . 0.168 0.208 0.212 0.213
V4 . −1.792 −9.664 −10.450 −10.528

Coeﬃcient proﬁle (GLMNET)
0 5 10 15 20
−10−505
L1 Norm
Coefficients
0 2 4 4 4

Theory of the lasso
coherence
adaptive
restricted regression
minimal adaptive
restricted eigenvalue
restricted
eigenvaluecompatibilityirrepresentable
Theorem 7.3
Corollary 6.13
Theorem 7.2
✁✂✄☎✆ ✝✞✟✠✡☛ ☞✞✁ ✌✁✆✡✍✄✎✍✞✠ ✂✠✡ ✏✑✒✆✁✁✞✁
✓✔✕✖✗✘ ✙✚✛✜✢✣ ✤✚✔ ✏✥✦✘✔✔✚✔
Theorems 6.2 and 6.4
Lemma 6.11
Buhlmann and van de Geer, Statistics for High-Dimensional Data:
Methods, Theory, and Applications, Springer-Verlag, 2011

Some limitations of the lasso
• “The lasso penalty is somewhat indiﬀerent to the choice
among a set of strong but correlated variables” –
Tibshirani (link)
• Variable selection consistency only under the
irrepresentable condition (“On Model Selection
Consistency of Lasso”, Zhao and Yu, 2006)
• Some theoretical conditions apply only in the extreme
p >> n
• For our purposes: does the learner yield models with high
predictive accuracy?

Cross validation
• LOO cross validation is unbiased but high variance
• Bias: accurate estimate of performance on held-out data
• Variance: sensitive to variations in training data
• K-fold cross validation is preferred (K = 5 . . . 20)

Model selection
You are given a set of models and their K-fold CV performance
estimates. How do you choose the one that will perform well on
unseen data?

Model selection: choose best
• Choose the model with the best CV performance estimate
• If you trust CV estimates, why not?

Model selection: choose using 1 standard-error rule
Choose the model that
• is the most parsimonious
• has CV performance close enough to the best one
Breiman’s one standard-error rule (Classiﬁcation and Regression
Trees, 1984, Breiman et al, p. 80) deﬁnes “close enough” as
having an error no more than one standard error above the error of
the best model

Model selection: best vs. 1 SE rule
●
●
●
●
●
●
●
●
●
● ●
● ● ●
●
● ●
● ● ● ●
●
●
●
●
●
●
● ●
● ● ●
● ● ●
●
●
● ●
●
● ● ● ●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
● ●
● ●
●
● ● ● ● ●
●
●
● ●
● ● ●
●
● ●
●
●
● ● ● ● ● ● ●
1.00
1.25
1.50
1.75
2.00
0 10 20 30 40
NVars
MSE
type
●●
●●
●●
●●
Best
One SE
Test
Train
Rule
Best
One SE

Model selection example with GLMNET
> l i b r a r y ( glmnet )
> o b j e c t <− cv . glmnet (X, y , nlambda=5)
> object$lambda .1 se ; object$lambda . min
[ 1 ] 0.182
[ 1 ] 0.00182
> p r e d i c t ( object , newX , s=object$lambda .1 se ) [ 1 , ]
1
5.21
> p r e d i c t ( object , newX , s=object$lambda . min ) [ 1 , ]
1
5.36

SparseNet
• Based on Cun-Hui Zhang’s (Rutgers statistics) MC+
algorithm
• “Nearly unbiased variable selection under minimax
concave penalty”, Zhang, 2010
• Uses same coordinate descent approach as glmnet
• MC+ has two components
• Minimax concave penalty (MCP) (more complex than
l1 penalty)
• Penalized linear unbiased selection (PLUS) algorithm
• Variable selection consistent under more general
conditions than the lasso
• Zhang has implementation on CRAN: plus

SparseNet
> l i b r a r y ( s p a r s e n e t )
> o b j e c t <− s p a r s e n e t (X, y , nlambda=5, ngamma=4)
$g1
l 0 l 1 l 2 l 3 l 4
( I n t e r c e p t ) 4.1 e+00 −2.22 3.03 3.58 3.61
V1 . 1.39 2.62 2.72 2.75
V2 2.1 e−15 5.89 6.69 6.78 6.78
V3 . 0.17 0.21 0.21 0.21
V4 . −1.77 −9.64 −10.44 −10.51
$g2
l 0 l 1 l 2 l 3 l 4
( I n t e r c e p t ) 4.1 e+00 −2.05 3.30 3.67 3.66
V1 . 1.30 2.49 2.70 2.72
V2 4.2 e−15 5.95 6.77 6.79 6.79
. . . .

Coeﬃcient proﬁle (SparseNet)
0 5 10 15 20
−10−505
L1 Norm
Coefficients
0 2 4 4 4
Lasso
0 5 10 15 20
−10−505
L1 Norm
Coefficients
0 2 4 4 4
Gamma = 150
0 5 10 15 20
−10−505
L1 Norm
Coefficients
0 1 2 3 3
Gamma = 12.2
0 5 10 15 20
−10−505
L1 Norm
Coefficients
0 1 2 2 3
Subset

Model selection example with SparseNet
> o b j e c t <− cv . s p a r s e n e t (X, y , nlambda=5)
> object$parms .1 se ; object$parms . min
gamma lambda
8.563 0.081
gamma lambda
9.9 e+35 8.1 e−04

Model selection example with SparseNet (cont.)
> o b j e c t <− cv . s p a r s e n e t (X, y , nlambda=5)
> p r e d i c t ( object , newX , which=”parms . min ” ) [ 1 , ]
1
5.4
> p r e d i c t ( object , newX , which=”parms .1 se ” ) [ 1 , ]
1
5.2

Summary
• Reﬁnement of the lasso (1996, Tibshirani)
• Strong theoretical guarantees, good empirical performance
• Best-in-class linear modeling algorithm

Sparsenet

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Sparsenet

Similar a Sparsenet (20)

Último

Último (20)

Sparsenet