SlideShare a Scribd company logo
1 of 51
Download to read offline
How to Create Predictive Models in
R using Ensembles
Giovanni Seni, Ph.D.
Intuit
@IntuitInc

Giovanni_Seni@intuit.com
Santa Clara University
GSeni@scu.edu

Strata - Hadoop World, New York
October 28, 2013
Reference

© 2013 G.Seni

2013 Strata Conference + Hadoop World

2
Overview
•  Motivation, In a Nutshell & Timeline
•  Predictive Learning & Decision Trees
•  Ensemble Methods - Diversity & Importance Sampling
–  Bagging
–  Random Forest
–  Ada Boost
–  Gradient Boosting
–  Rule Ensembles

•  Summary

© 2013 G.Seni

2013 Strata Conference + Hadoop World

3
Motivation

Volume 9 Issue 2

© 2013 G.Seni

2013 Strata Conference + Hadoop World

4
Motivation (2)

“1′st Place Algorithm Description: … 4. Classification: Ensemble classification methods are used to combine
multiple classifiers. Two separate Random Forest ensembles are created based on the shadow index (one for the
shadow-covered area and one for the shadow-free area). The random forest “Out of Bag” error is used to
automatically evaluate features according to their impact, resulting in 45 features selected for the shadow-free and
55 for the shadow-covered part.”
© 2013 G.Seni

2013 Strata Conference + Hadoop World

5
Motivation (3)
•  “What are the best of the best techniques at winning
Kaggle competitions?
–  Ensembles of Decisions Trees
–  Deep Learning

account for 90% of top 3 winners!”
Jeremy Howard, Chief Scientist of Kaggle
KDD 2013
⇒ Key common characteristics:
–  Resistance to overfitting
–  Universal approximations
© 2013 G.Seni

2013 Strata Conference + Hadoop World

6
Ensemble Methods in a Nutshell
•  “Algorithmic” statistical procedure
•  Based on combining the fitted values from a number of
fitting attempts
•  Loosely related to:
–  Iterative procedures
–  Bootstrap procedures

•  Original idea: a “weak” procedure can be strengthened if
it can operate “by committee”
–  e.g., combining low-bias/high-variance procedures

•  Accompanied by interpretation methodology
© 2013 G.Seni

2013 Strata Conference + Hadoop World

7
Timeline
•  CART (Breiman, Friedman, Stone, Olshen, 1983)
•  Bagging (Breiman, 1996)
–  Random Forest (Ho, 1995; Breiman 2001)

•  AdaBoost (Freund, Schapire, 1997)
•  Boosting – a statistical view (Friedman et. al., 2000)
–  Gradient Boosting (Friedman, 2001)
–  Stochastic Gradient Boosting (Friedman, 1999)

•  Importance Sampling Learning Ensembles (ISLE)
(Friedman, Popescu, 2003)
© 2013 G.Seni

2013 Strata Conference + Hadoop World

8
Timeline (2)
•  Regularization – variance control techniques:
–  Lasso (Tibshirani, 1996)
–  LARS (Efron, 2004)
–  Elastic Net (Zou, Hastie, 2005)
–  GLMs via Coordinate Descent (Friedman, Hastie, Tibshirani, 2008)

•  Rule Ensembles (Friedman, Popescu, 2008)

© 2013 G.Seni

2013 Strata Conference + Hadoop World

9
Overview
•  Motivation, In a Nutshell & Timeline
Ø  Predictive Learning & Decision Trees
•  Ensemble Methods
•  Summary

© 2013 G.Seni

2013 Strata Conference + Hadoop World

10
Predictive Learning
Procedure Summary
N
N
•  Given "training" data D = { yi , xi1 , xi 2 ,, xin }1 = { yi , x i }1

–  D is a random sample from some unknown (joint) distribution


 
•  Build a functional model y = F ( x1 , x2 ,, xn ) = F ( x )
–  Offers adequate and interpretable description of how the inputs
affect the outputs
–  Parsimony is an important criterion: simpler models are preferred
for the sake of scientific insight into the x - y relationship

•  Need to specify: < model, score criterion, search strategy >
© 2013 G.Seni

2013 Strata Conference + Hadoop World

11
Predictive Learning
Procedure Summary (2)
•  Model: underlying functional form sought from data


F (x) = F (x; a) ∈ ℱ

family of functions indexed by a

•  Score criterion: judges (lack of) quality of fitted model


–  Loss function L( y, F ): penalizes individual errors in prediction

–  Risk R(a) = E y ,x L( y, F (x; a)) : the expected loss over all predictions

•  Search Strategy: minimization procedure of score criterion
a* = arg min R(a)
a

© 2013 G.Seni

2013 Strata Conference + Hadoop World

12
Predictive Learning
Procedure Summary (3)
•  “Surrogate” Score criterion:
N
–  Training data: { yi , x i }1 ~ p( x, y )

*
–  p ( x, y ) unknown ⇒ a unknown

⇒ Use approximation: Empirical Risk


1
•  R (a) =
N


∑ L( y, F (xi ; a))
N

i =1

•  If not N >> n ,

© 2013 G.Seni

⇒



a = arg min R(a)
a


R(a) >> R(a* )

2013 Strata Conference + Hadoop World

13
Predictive Learning
Example
•  A simple data set
Attribute-1

Attribute-2

Class

( x1 )

( x2 )

1.0

2.0

blue

2.0

1.0

green

…

…

…

4.5

3.5

x2

?

(y)

•  What is the class of new point

x1

?

•  Many approaches… no method is universally better; try
several / use committee
© 2013 G.Seni

2013 Strata Conference + Hadoop World

14
Predictive Learning
Example (2)
•  Ordinary Linear Regression (OLR)
x2

x1

n

–  Model: F(x) = a0 + ∑ a j x j
j=1

;


⎧
F (x) ≥ 0 ⎨
⎩else

⇒ Not flexible enough
© 2013 G.Seni

2013 Strata Conference + Hadoop World

15
Decision Trees
Overview
x2
R2

x1 ≥ 5

R1

x2 ≥ 3

3

R4

x1 ≥ 2

R3
2

x1

5

M

ˆ
ˆ ˆ
•  Model: y = T (x ) = ∑ cm I R (x )
m =1

m

 M
{Rm }m=1 =

Sub-regions of input
variable space

where I R (x) = 1 if x ∈ R , 0 otherwise
© 2013 G.Seni

2013 Strata Conference + Hadoop World

16
Decision Trees
Overview (2)
•  Score criterion:
–  Classification – "0-1 loss" ⇒ misclassification error (or surrogate)
N

M

ˆ
{ cˆm, Rm } = argmin
1

M
cm ,Rm 1

TM ={

}

∑I (y ≠ T
i

M

(x i ))

i=1

2

ˆ
ˆ
–  Regression – least squares – i.e., L( y , y ) = ( y − y )
M

ˆ ˆ
{ cm, Rm } = argmin
1

M

N

∑( y − T
i

TM ={cm ,Rm }1 i=1

M

(x i ))


R(TM )

2



•  Search: Find T = arg min T R(T )
–  i.e., find best regions Rm and constants cm
© 2013 G.Seni

2013 Strata Conference + Hadoop World

17
Decision Trees
Overview (3)
•  Join optimization with respect to Rm and cm simultaneously is
very difficult
⇒ use a greedy iterative procedure
R0

R4

R1

R5 R6

R2

R3

•

•

j 1 , s1

R0
j 2 , s2

•

•

j 1 , s1

R0
j 2 , s2

R1

•

R0

•

R1

•
R3

•

j 1 , s1

•
R4

j 3 , s3

j 2 , s2

•

j 1 , s1

R0

•
R3

••

•

R4 R5
R7

2013 Strata Conference + Hadoop World

j 4 , s4

R6

•
© 2013 G.Seni

j 3 , s3

R2

R1

R2

•

•
R8
18
Decision Trees
What is the “right” size of a model?
y

y

y

ο ο ο
ο
ο
ο ο

ο ο

ο ο ο

ο ο

ο

⇒

ο
ο

ο

c1

ο

ο ο

x

ο ο

ο ο

ο ο

ο ο ο

c2

ο

vs

ο
ο

c1 ο

c2

ο ο

ο

ο ο

ο
ο ο

ο ο

c3

ο
ο

ο

ο

x

x

•  Dilemma
–  If model (# of splits) is too small, then approximation is too crude
(bias) ⇒ increased errors
–  If model is too large, then it fits the training data too closely
(overfitting, increased variance) ⇒ increased errors
© 2013 G.Seni

2013 Strata Conference + Hadoop World

19
Decision Trees
What is the “right” size of a model? (2)
High Bias

Low Bias

Low Variance

Prediction Error

High Variance

Test Sample

Training Sample
Low

M*

High

Model Complexity

–  Right sized tree, M * when test error is at a minimum
,
–  Error on the training is not a useful estimator!
•  If test set is not available, need alternative method
© 2013 G.Seni

2013 Strata Conference + Hadoop World

20
Decision Trees
Pruning to obtain “right” size
•  Two strategies
–  Prepruning - stop growing a branch when information becomes
unreliable
•  #(Rm) – i.e., number of data points, too small
⇒ same bound everywhere in the tree
•  Next split not worthwhile
⇒ Not sufficient condition

–  Postpruning - take a fully-grown tree and discard unreliable parts
(i.e., not supported by test data)
•  C4.5: pessimistic pruning
•  CART: cost-complexity pruning
© 2013 G.Seni

(more statistically grounded)

2013 Strata Conference + Hadoop World

21
Decision Trees

1.0

Hands-on Exercise
Start Rstudio

• 

0.8

• 

Navigate to directory:
example.1.LinearBoundary

Load and run “fitModel_CART.R”

• 

If curious, also see
“gen2DdataLinear.R”

• 

After boosting discussion, load and
run “fitModel_GBM.R

0.0

0.2

0.4

x2

Set working directory: use
setwd() or with GUI

• 

0.6

• 

0.0

0.2

0.4

0.6

0.8

1.0

x1

© 2013 G.Seni

2013 Strata Conference + Hadoop World

22
Decision Trees
Key Features
•  Ability to deal with irrelevant inputs
–  i.e., automatic variable subset selection
–  Measure anything you can measure
–  Score provided for selected variables ("importance")

•  No data preprocessing needed
-  Naturally handle all types of variables
• 

numeric, binary, categorical

-  Invariant under monotone transformations: x j = g j (x j )
• 
• 
© 2013 G.Seni

Variable scales are irrelevant
Immune to bad x j −distributions (e.g., outliers)
2013 Strata Conference + Hadoop World

23
Decision Trees
Key Features (2)
•  Computational scalability
–  Relatively fast: O(nN log N )

•  Missing value tolerant
-  Moderate loss of accuracy due to missing values
-  Handling via "surrogate" splits

•  "Off-the-shelf" procedure
-  Few tunable parameters

•  Interpretable model representation
-  Binary tree graphic
© 2013 G.Seni

2013 Strata Conference + Hadoop World

24
Decision Trees
Limitations
•  Discontinuous piecewise constant model
F (x)

x

–  In order to have many splits you need to have a lot of data
•  In high-dimensions, you often run out of data after a few splits

–  Also note error is bigger near region boundaries
© 2013 G.Seni

2013 Strata Conference + Hadoop World

25
Decision Trees
Limitations (2)
•  Not good for low interaction F * (x )
n

*

–  e.g., F (x ) = ao + ∑ a j x j is worst function for trees
j =1

n

= ∑ f j* (x j )

(no interaction, additive)

j =1

–  In order for xl to enter model, must split on it
•  Path from root to node is a product of indicators

•  Not good for F * (x ) that has dependence on many variables
-  Each split reduces training data for subsequent splits (data
fragmentation)

© 2013 G.Seni

2013 Strata Conference + Hadoop World

26
Decision Trees
Limitations (3)
•  High variance caused by greedy search strategy (local optima)
–  Errors in upper splits are propagated down to affect all splits
below it
⇒ Small changes in data (sampling fluctuations) can cause
big changes in tree
- Very deep trees might be questionable
- Pruning is important

•  What to do next?
–  Live with problems
–  Use other methods (when possible)
–  Fix-up trees: use ensembles
© 2013 G.Seni

2013 Strata Conference + Hadoop World

27
Overview
•  In a Nutshell & Timeline
•  Predictive Learning & Decision Trees
Ø  Ensemble Methods
–  In a Nutshell, Diversity & Importance Sampling
–  Generic Ensemble Generation
–  Bagging, RF, AdaBoost, Boosting, Rule Ensembles

•  Summary

© 2013 G.Seni

2013 Strata Conference + Hadoop World

28
Ensemble Methods
In a Nutshell
M

•  Model: F (x) = c0 + ∑m=1 cmTm (x)
M
–  { m (x)}1 : “basis” functions (or “base learners”)
T

–  i.e., linear model in a (very) high dimensional space of derived
variables

•  Learner characterization:

Tm (x) = T (x; p m )

–  p m : a specific set of joint parameter values – e.g., split definitions
at internal nodes and predictions at terminal nodes
–  {T (x; p)}p∈P : function class – i.e., set of all base learners of specified
family
© 2013 G.Seni

2013 Strata Conference + Hadoop World

29
Ensemble Methods
In a Nutshell (2)
•  Learning: two-step process; approximate solution to
N
M
  M
{cm , p m }o = arg min ∑ L yi , c0 + ∑ cmT (x;p m )
M
{cm , p m }o i=1

(

m=1

)

M
–  Step 1: Choose points {p m }1

M
•  i.e., select {Tm (x)}1 ⊂ {T (x; p)}p∈P

M
–  Step 2: Determine weights {cm }0

•  e.g., via regularized LR

© 2013 G.Seni

2013 Strata Conference + Hadoop World

30
Ensemble Methods
Importance Sampling (Friedman, 2003)
•  How to judiciously choose the “basis” functions (i.e., {pm }1M )?
M
•  Goal: find “good” {pm }1 so that

M
M
F (x;{p m }1 , {cm }1 ) ≅ F * (x )

•  Connection with numerical integration:
– 

∫

Ρ

M

I (p) ∂p ≈ ∑m =1 w m I (p m )

vs.

© 2013 G.Seni

2013 Strata Conference + Hadoop World

Accuracy improves
when we choose more
points from this
region…

31
Importance Sampling
Numerical Integration via Monte Carlo Methods
M
•  r (p) = sampling pdf of p ∈ P -- i.e, {p m ~ r (p)}1

–  Simple approach: r (p m ) iid -- i.e., uniform
–  In our problem: inversely related to p m’s “risk”
•  i.e., T (x; p m ) has high error ⇒ lack of relevance of p m ⇒ low r (pm )

•  “Quasi” Monte Carlo:
–  with/out knowledge of the other points that will be used
•  i.e., single point vs. group importance

–  Sequential approximation: p’s relevance judged in the context of
the (fixed) previously selected points
© 2013 G.Seni

2013 Strata Conference + Hadoop World

32
Ensemble Methods
Importance Sampling – Characterization of
•  Let p∗ = arg minp Risk(p)

Narrow r (p)

Broad r (p)

M

T
•  Ensemble { (x; p m )}1 of “strong” base
learners - i.e., all with Risk (p m ) ≈ Risk (p∗ )

•  Diverse ensemble - i.e., predictions are
not highly correlated with each other

•  T (x; p m ) yield similar highly correlated
’s
predictions ⇒ unexceptional performance

•  However, many “weak” base learners - i.e.,
Risk (p m ) >> Risk (p ∗ ) ⇒ poor performance

© 2013 G.Seni

2013 Strata Conference + Hadoop World

33
Ensemble Methods
Approximate Process of Drawing from
•  Heuristic sampling strategy: sampling around p by iteratively
applying small perturbations to existing problem structure
∗

–  Generating ensemble members Tm (x) = T (x; p m )

For m = 1 to M {
pm = PERTURBm { minp Ε xy L( y, T (x; p) )}
arg
}
⋅
–  PERTURB {} is a (random) modification of any of

•  Data distribution - e.g., by re-weighting the observations
•  Loss function - e.g., by modifying its argument
•  Search algorithm (used to find minp)


–  Width of r (p ) is controlled by degree of perturbation
© 2013 G.Seni

2013 Strata Conference + Hadoop World

34
Generic Ensemble Generation
Step 1: Choose Base Learners p!

!
! !

•  Forward Stagewise Fitting Procedure:
𝐹0 (x) = 0    
For    𝑚 = 1  to  𝑀    {	
  
	
  	
  	
  	
  	
  //	
  Fit	
  a	
  single	
  base	
  learner	
    
          p

Modification of
data distribution

𝑚

= argmin .
p

𝐿0𝑦 𝑖 ,   𝐹 𝑚 −1 + 𝑇(x 𝑖 ; p)8  

𝑖∈𝑆 𝑚 ( 𝜂 )

	
  	
  	
  	
  	
  //	
  Update	
  additive	
  expansion	
  
          𝑇 𝑚 ( 𝑥 ) = 𝑇0x; p 𝒎 8	
  
          𝐹 𝑚 (x) =    𝐹 𝑚 −1 (x) + 𝜐 ∙ 𝑇 𝑚 (x)  
}  
write  { 𝑇 𝑚 (x)}1𝑀 	
  

–  Algorithm control: L, η , υ

Modification of loss function
(“sequential” approximation)

•  Sm (η ) : random sub-sample of size η ≤ N ⇒ impacts ensemble "diversity"
m −1

•  Fm−1 (x) = υ ⋅ ∑k =1Tk (x) : “memory” function (0 ≤ υ ≤ 1 )
© 2013 G.Seni

2013 Strata Conference + Hadoop World

35
Generic Ensemble Generation
Step 2: Choose Coefficients c!

!
!!

M
M
•  Given {Tm (x)}m=1 = {T (x; pm )}m=1 , coefficients can be obtained by a

regularized linear regression
N
M
⎛
⎞

{cm } = arg min ∑ L⎜ yi , c0 + ∑ cmTm (xi ) ⎟ + λ ⋅ P(c )
{cm }
i =1 ⎝
m =1
⎠

–  Regularization here helps reduce bias (in addition to variance) of the
model
–  New iterative fast algorithms for various loss/penalty combinations
•  “GLMs via Coordinate Descent” (2008)

© 2013 G.Seni

2013 Strata Conference + Hadoop World

36
Bagging (Breiman, 1996)
•  Bagging = Bootstrap Aggregation
ˆ
•  L(y, y) : as available for single tree

F0 (x) = 0
For m = 1 to M {

•  υ = 0 ⇒ no memory

p m = arg min
p

•  η = N / 2

i

m −1

i∈S m ( )

( x i ) + T ( x i ; p ))

Tm (x) = T (x; p m )

•  Tm (x) ⇒ are large un-pruned
trees

∑ηL(y , F

Fm (x) = Fm −1 (x) + υ ⋅ Tm (x)
υ
}

•  co = 0, {cm = 1 / M }1M

M

i.e., not fit to the data (avg)

write {Tm (x)}1

–  i.e., perturbation of the data distribution only
–  Potential improvements?
–  R package: ipred
© 2013 G.Seni

2013 Strata Conference + Hadoop World

37
Bagging
Hands-on Exercise

1.0

• 

Navigate to directory:
example.2.EllipticalBoundary

0.0

Load and run
–  fitModel_Bagging_by_hand.R

-0.5

–  fitModel_CART.R (optional)
• 

If curious, also see
gen2DdataNonLinear.R

-1.0

x2

Set working directory: use setwd()
or with GUI

• 

0.5

• 

• 

After class, load and run
fitModel_Bagging.R

-2

-1

0

1

2

x1

© 2013 G.Seni

2013 Strata Conference + Hadoop World

38
Bagging
Why it helps?

ˆ
•  Under L( y, y) = ( y − y) 2, averaging reduces variance and leaves
bias unchanged

•  Consider “idealized” bagging (aggregate)


estimator: f (x) = Ε f Z (x)



–  f Z fit to bootstrap data set Z = {yi , xi }1N
–  Z is sampled from actual population distribution (not training data)




–  We can write: Ε[Y − f Z (x)] = Ε[Y − f (x) + f (x) − f Z (x)]
2

2


2
= Ε Y − f ( x) + Ε f Z ( x) − f ( x)

[
]
≥ Ε[Y − f (x)]

[

]

2

2

⇒  true population aggregation never increases mean squared error!
⇒  Bagging will often decrease MSE…
© 2013 G.Seni

2013 Strata Conference + Hadoop World

39
Random Forest (Ho, 1995; Breiman, 2001)
•  Random Forest = Bagging + algorithm randomizing
–  Subset splitting
As each tree is constructed…
•  Draw a random sample of predictors before each node is split

ns = ⎣log 2 (n) + 1⎦
•  Find best split as usual but selecting only from subset of predictors


M
⇒ Increased diversity among {Tm (x)}1 - i.e., wider r (p)
•  Width (inversely) controlled by ns

–  Speed improvement over Bagging
–  R package: randomForest
© 2013 G.Seni

2013 Strata Conference + Hadoop World

40
Bagging vs. Random Forest vs. ISLE
100 Target Functions Comparison (Popescu, 2005)
•  ISLE improvements:
–  Different data sampling strategy (not fixed)
–  Fit coefficients to data

Comparative RMS Error

•  xxx_6_5%_P : 6 terminal nodes trees
5% samples without replacement
Post-processing – i.e., using
estimated “optimal” quadrature
coefficients
⇒ Significantly faster to build!

Bag

© 2013 G.Seni

RF

Bag_6_5%_P RF_6_5%_P

2013 Strata Conference + Hadoop World

41
AdaBoost (Freund & Schapire, 1997)
observation weights : wi( 0 ) = 1 N
For m = 1 to M {
a. Fit a classifier Tm (x) to training data with wi( m )
b. Compute
errm =

∑

N

i =1

(cm , p m ) = arg min

w I ( yi ≠ Tm (x i ))

∑

N

∑ηL( y , F
i

m −1

(x i ) + c ⋅ T (x i ; p) )

i∈S m ( )

Tm (x) = T (x; p m )
Fm (x) = Fm −1 (x) + υ ⋅ cm ⋅ Tm (x)

d. Set wi( m +1) = wi( m ) ⋅ exp[α m ⋅ I ( yi ≠ Tm (x i )]
}
Output sign ∑m =1α mTm (x)

c, p

wi( m )

c. Compue α m = log((1 − errm ) errm )

M

For m = 1 to M {

(m)
i

i =1

(

F0 (x) = 0

}
M
write {cm , Tm (x)}1

)

–  We need to show p m = arg min (⋅) is equivalent to line a. above
p

Book

•  Equivalence to Forward Stagewise Fitting Procedure
–  cm = arg min (⋅) is equivalent to line c.
c

•  R package adabag
© 2013 G.Seni

2013 Strata Conference + Hadoop World

42
AdaBoost
Hands-on Exercise

1.0

• 

Navigate to directory:
example.2.EllipticalBoundary
Set working directory: use setwd()
or with GUI

• 

Load and run

0.0

–  fitModel_Adaboost_by_hand.R

-0.5

• 

After class, load and run
fitModel_Adaboost.R and
fitModel_RandomForest.R

-1.0

x2

0.5

• 

-2

-1

0

1

2

x1

© 2013 G.Seni

2013 Strata Conference + Hadoop World

43
Stochastic Gradient Boosting (Friedman, 2001)
•  Boosting with any differentiable loss criterion
ˆ
•  General L( y, y )

F0 (x) = c00

•  υ = 0.1 ⇒ Sequential sampling

For m = 1 to M {
(cm , p m ) = arg min
m
c, p

•  η = N 2

∑ηL( y , F
i

m −1

i∈S m ( )

(x i ) + c ⋅ T (x i ; p))

Tm (x) = T (x; p m )

•  Tm (x) ⇒ Any “weak” learner
N

•  co = arg minc ∑i =1 L( yi , c)

Fm (x) = Fm −1 (x) +υ ⋅ cm ⋅ Tm (x)
υ
}
M
write {(υ ⋅ cm ), Tm (x)}1

M
•  {cm }1 ⇒ “shrunk” sequential

partial regression coefficients

–  Potential improvements?
–  R package: gbm
© 2013 G.Seni

2013 Strata Conference + Hadoop World

44
Stochastic Gradient Boosting
LAD Regression – L

!, ! = ! − !

•  More robust than ( y − F )2
•  Resistant to outliers in y
…trees already providing
resistance to outliers in x

!

N

F0 (x) = median{yi }1
For m = 1 to M {

// Step1 : find Tm (x)
~ = sign ( y − F (x ) )
yi
i
m −1
i

•  Note:


{R }

J

jm 1

–  Trees are fitted to pseudoresponse

(

// Step2 : find coefficients

⇒ Can’t interpret interpret

γˆ jm = median{yi − Fm −1 (x i )}1N

x i ∈R jm

individual trees

–  “shrunk” version of tree gets
added to ensemble

j = 1… J

// Update expansion

–  Original tree constants are
overwritten

© 2013 G.Seni

N
= J − terminal node LS - regression tree {~i , x i }1
y


Fm (x) = Fm −1 (x) + υ ⋅ ∑ γˆ jm I x i ∈ R jm
J

j =1

(

)

}

2013 Strata Conference + Hadoop World

45

)
Parallel vs. Sequential Ensembles
100 Target Functions Comparison (Popescu, 2005)

Comparative RMS Error

•  xxx_6_5%_P : 6 terminal nodes trees
5% samples without replacement
Post-processing – i.e., using
estimated “optimal” quadrature
coefficients
“Sequential”

“Parallel”
Bag

RF

Boost

Seq_0.01_20%_P

•  Seq_υ_η%_P : “Sequential” ensemble
6 terminal nodes trees
υ : “memory” factor
η % samples without replacement
Post-processing

Bag_6_5%_P RF_6_5%_P
Seq_0.1_50%_P

•  Sequential ISLE tend to perform better than parallel ones
–  Consistent with results observed in classical Monte Carlo integration
© 2013 G.Seni

2013 Strata Conference + Hadoop World

46
Rule Ensembles (Friedman & Popescu, 2005)
J

ˆ
ˆ
•  Trees as collection of conjunctive rules: Tm (x) = ∑ c jm I (x ∈ R jm )
j =1

R1

27

R4
15

R2 ⇒

R5
15

22

x1

r2 (x) = I ( x1 > 22) ⋅ I (0 ≤ x2 ≤ 27)
r3 (x) = I (15 < x1 ≤ 22) ⋅ I (0 ≤ x2 )

R4 ⇒

r4 (x) = I (0 ≤ x1 ≤ 15) ⋅ I ( x2 > 15)

R5 ⇒

x2

r1 (x) = I ( x1 > 22) ⋅ I ( x2 > 27)

R3 ⇒

R3

ˆ
y

R2

R1 ⇒

r5 (x) = I (0 ≤ x1 ≤ 15) ⋅ I (0 ≤ x2 ≤ 15)

–  These simple rules, rm (x) ∈ {0,1} can be used as base learners
,
–  Main motivation is interpretability
© 2013 G.Seni

2013 Strata Conference + Hadoop World

47
Rule Ensembles
ISLE Procedure
•  Rule-based model: F (x) = a0 + ∑ am rm (x)
m

–  Still a piecewise constant model ⇒ complement the non-linear rules
with purely linear terms:

•  Fitting
–  Step 1: derive rules from tree ensemble (shortcut)
•  Tree size controls rule “complexity” (interaction order)

–  Step 2: fit coefficients using linear regularized procedure:

(

N
P
K
ˆ
ˆ
({ak },{b j }) = arg min ∑ L yi , F x; {ak }0 , {b j }1
{ak },{b j }

© 2013 G.Seni

i=1

(

)) +!!! ⋅

2013 Strata Conference + Hadoop World

!(a) + !(b) !
48
Boosting & Rule Ensembles
Hands-on Exercise

2500

• 

example.3.Diamonds

Load and run

1500

Set working directory: use setwd()
or with GUI

• 

2000

• 

–  viewDiamondData.R
–  fitModel_GBM.R

1000

–  fitModel_RE.R
• 

500

Absolute loss

Navigate to directory:

After class, go to:
example.1.LinearBoundary

0

200

400

600

800

1000

Run fitModel_GBM.R

Iteration

© 2013 G.Seni

2013 Strata Conference + Hadoop World

49
Overview
•  Motivation, In a Nutshell & Timeline
•  Predictive Learning & Decision Trees
•  Ensemble Methods
Ø  Summary

© 2013 G.Seni

2013 Strata Conference + Hadoop World

50
Summary
•  Ensemble methods have been found to perform extremely
well in a variety of problem domains
•  Shown to have desirable statistical properties
•  Latest ensemble research brings together important
foundational strands of statistics
•  Emphasis on accuracy but significant progress has been
made on interpretability
Go build Ensembles and keep in touch!

© 2013 G.Seni

2013 Strata Conference + Hadoop World

51

More Related Content

What's hot

Supervised Prediction of Graph Summaries
Supervised Prediction of Graph SummariesSupervised Prediction of Graph Summaries
Supervised Prediction of Graph SummariesDaniil Mirylenka
 
Learning to Rank with Neural Networks
Learning to Rank with Neural NetworksLearning to Rank with Neural Networks
Learning to Rank with Neural NetworksBhaskar Mitra
 
Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Manohar Mukku
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial NetworksMark Chang
 
Introduction to Generative Adversarial Networks
Introduction to Generative Adversarial NetworksIntroduction to Generative Adversarial Networks
Introduction to Generative Adversarial NetworksBennoG1
 
Deep Learning and Optimization Methods
Deep Learning and Optimization MethodsDeep Learning and Optimization Methods
Deep Learning and Optimization MethodsStefan Kühn
 
Machine learning with R
Machine learning with RMachine learning with R
Machine learning with RMaarten Smeets
 
A Short Introduction to Generative Adversarial Networks
A Short Introduction to Generative Adversarial NetworksA Short Introduction to Generative Adversarial Networks
A Short Introduction to Generative Adversarial NetworksJong Wook Kim
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational AutoencoderMark Chang
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and ApplicationsEmanuele Ghelfi
 
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...Rizwan Habib
 
Unsupervised learning represenation with DCGAN
Unsupervised learning represenation with DCGANUnsupervised learning represenation with DCGAN
Unsupervised learning represenation with DCGANShyam Krishna Khadka
 
Reading group gan - 20170417
Reading group   gan - 20170417Reading group   gan - 20170417
Reading group gan - 20170417Shuai Zhang
 
Adversarial learning for neural dialogue generation
Adversarial learning for neural dialogue generationAdversarial learning for neural dialogue generation
Adversarial learning for neural dialogue generationKeon Kim
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networksYunjey Choi
 
Generative Adversarial Network (+Laplacian Pyramid GAN)
Generative Adversarial Network (+Laplacian Pyramid GAN)Generative Adversarial Network (+Laplacian Pyramid GAN)
Generative Adversarial Network (+Laplacian Pyramid GAN)NamHyuk Ahn
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Sri Ambati
 
Finding connections among images using CycleGAN
Finding connections among images using CycleGANFinding connections among images using CycleGAN
Finding connections among images using CycleGANNAVER Engineering
 
InfoGAN and Generative Adversarial Networks
InfoGAN and Generative Adversarial NetworksInfoGAN and Generative Adversarial Networks
InfoGAN and Generative Adversarial NetworksZak Jost
 

What's hot (20)

Supervised Prediction of Graph Summaries
Supervised Prediction of Graph SummariesSupervised Prediction of Graph Summaries
Supervised Prediction of Graph Summaries
 
Learning to Rank with Neural Networks
Learning to Rank with Neural NetworksLearning to Rank with Neural Networks
Learning to Rank with Neural Networks
 
Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial Networks
 
Introduction to Generative Adversarial Networks
Introduction to Generative Adversarial NetworksIntroduction to Generative Adversarial Networks
Introduction to Generative Adversarial Networks
 
Deep Learning and Optimization Methods
Deep Learning and Optimization MethodsDeep Learning and Optimization Methods
Deep Learning and Optimization Methods
 
Machine learning with R
Machine learning with RMachine learning with R
Machine learning with R
 
A Short Introduction to Generative Adversarial Networks
A Short Introduction to Generative Adversarial NetworksA Short Introduction to Generative Adversarial Networks
A Short Introduction to Generative Adversarial Networks
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
GAN - Theory and Applications
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and Applications
 
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...
 
Unsupervised learning represenation with DCGAN
Unsupervised learning represenation with DCGANUnsupervised learning represenation with DCGAN
Unsupervised learning represenation with DCGAN
 
Reading group gan - 20170417
Reading group   gan - 20170417Reading group   gan - 20170417
Reading group gan - 20170417
 
Adversarial learning for neural dialogue generation
Adversarial learning for neural dialogue generationAdversarial learning for neural dialogue generation
Adversarial learning for neural dialogue generation
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
Generative Adversarial Network (+Laplacian Pyramid GAN)
Generative Adversarial Network (+Laplacian Pyramid GAN)Generative Adversarial Network (+Laplacian Pyramid GAN)
Generative Adversarial Network (+Laplacian Pyramid GAN)
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
 
Finding connections among images using CycleGAN
Finding connections among images using CycleGANFinding connections among images using CycleGAN
Finding connections among images using CycleGAN
 
그림 그리는 AI
그림 그리는 AI그림 그리는 AI
그림 그리는 AI
 
InfoGAN and Generative Adversarial Networks
InfoGAN and Generative Adversarial NetworksInfoGAN and Generative Adversarial Networks
InfoGAN and Generative Adversarial Networks
 

Viewers also liked

Building a Predictive Model
Building a Predictive ModelBuilding a Predictive Model
Building a Predictive ModelDKALab
 
Normalization_BCA_
Normalization_BCA_Normalization_BCA_
Normalization_BCA_Bhavini Shah
 
Using Rule Ensembles to Predict Credit Risk #GHC15
Using Rule Ensembles to Predict Credit Risk #GHC15Using Rule Ensembles to Predict Credit Risk #GHC15
Using Rule Ensembles to Predict Credit Risk #GHC15Intuit Inc.
 
Bagging Decision Trees on Data Sets with Classification Noise
Bagging Decision Trees on Data Sets with Classification NoiseBagging Decision Trees on Data Sets with Classification Noise
Bagging Decision Trees on Data Sets with Classification NoiseNTNU
 
Monte Carlo Methods
Monte Carlo MethodsMonte Carlo Methods
Monte Carlo MethodsJames Bell
 
Machine Learning with R and Tableau
Machine Learning with R and TableauMachine Learning with R and Tableau
Machine Learning with R and TableauKayden Kelly
 
Using HDDT to avoid instances propagation in unbalanced and evolving data str...
Using HDDT to avoid instances propagation in unbalanced and evolving data str...Using HDDT to avoid instances propagation in unbalanced and evolving data str...
Using HDDT to avoid instances propagation in unbalanced and evolving data str...Andrea Dal Pozzolo
 
Statistical Modeling: The Two Cultures
Statistical Modeling: The Two CulturesStatistical Modeling: The Two Cultures
Statistical Modeling: The Two CulturesChristoph Molnar
 
Particle Filters and Applications in Computer Vision
Particle Filters and Applications in Computer VisionParticle Filters and Applications in Computer Vision
Particle Filters and Applications in Computer Visionzukun
 
Developing Custom Controls with UI5 (OpenUI5 video lecture)
Developing Custom Controls with UI5 (OpenUI5 video lecture)Developing Custom Controls with UI5 (OpenUI5 video lecture)
Developing Custom Controls with UI5 (OpenUI5 video lecture)Michael Graf
 
Introduction To Predictive Analytics Part I
Introduction To Predictive Analytics   Part IIntroduction To Predictive Analytics   Part I
Introduction To Predictive Analytics Part Ijayroy
 

Viewers also liked (20)

Predictive Analytics using R
Predictive Analytics using RPredictive Analytics using R
Predictive Analytics using R
 
Building a Predictive Model
Building a Predictive ModelBuilding a Predictive Model
Building a Predictive Model
 
Normalization_BCA_
Normalization_BCA_Normalization_BCA_
Normalization_BCA_
 
Using Rule Ensembles to Predict Credit Risk #GHC15
Using Rule Ensembles to Predict Credit Risk #GHC15Using Rule Ensembles to Predict Credit Risk #GHC15
Using Rule Ensembles to Predict Credit Risk #GHC15
 
Bagging Decision Trees on Data Sets with Classification Noise
Bagging Decision Trees on Data Sets with Classification NoiseBagging Decision Trees on Data Sets with Classification Noise
Bagging Decision Trees on Data Sets with Classification Noise
 
Model Inventory
Model InventoryModel Inventory
Model Inventory
 
Monte Carlo Methods
Monte Carlo MethodsMonte Carlo Methods
Monte Carlo Methods
 
Machine Learning with R and Tableau
Machine Learning with R and TableauMachine Learning with R and Tableau
Machine Learning with R and Tableau
 
Conditional trees
Conditional treesConditional trees
Conditional trees
 
Predictive Modeling and Analytics select_chapters
Predictive Modeling and Analytics select_chaptersPredictive Modeling and Analytics select_chapters
Predictive Modeling and Analytics select_chapters
 
Predictive Modeling with Enterprise Miner
Predictive Modeling with Enterprise MinerPredictive Modeling with Enterprise Miner
Predictive Modeling with Enterprise Miner
 
Using HDDT to avoid instances propagation in unbalanced and evolving data str...
Using HDDT to avoid instances propagation in unbalanced and evolving data str...Using HDDT to avoid instances propagation in unbalanced and evolving data str...
Using HDDT to avoid instances propagation in unbalanced and evolving data str...
 
Statistical Modeling: The Two Cultures
Statistical Modeling: The Two CulturesStatistical Modeling: The Two Cultures
Statistical Modeling: The Two Cultures
 
predictive models
predictive modelspredictive models
predictive models
 
Particle Filters and Applications in Computer Vision
Particle Filters and Applications in Computer VisionParticle Filters and Applications in Computer Vision
Particle Filters and Applications in Computer Vision
 
From Business Intelligence to Predictive Analytics
From Business Intelligence to Predictive AnalyticsFrom Business Intelligence to Predictive Analytics
From Business Intelligence to Predictive Analytics
 
Developing Custom Controls with UI5 (OpenUI5 video lecture)
Developing Custom Controls with UI5 (OpenUI5 video lecture)Developing Custom Controls with UI5 (OpenUI5 video lecture)
Developing Custom Controls with UI5 (OpenUI5 video lecture)
 
Introduction To Predictive Analytics Part I
Introduction To Predictive Analytics   Part IIntroduction To Predictive Analytics   Part I
Introduction To Predictive Analytics Part I
 
Xgboost
XgboostXgboost
Xgboost
 
Management Consulting
Management ConsultingManagement Consulting
Management Consulting
 

Similar to Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles

Comparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for RegressionComparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for RegressionSeonho Park
 
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryUncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryRikiya Takahashi
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analyticsCollin Bennett
 
R language tutorial
R language tutorialR language tutorial
R language tutorialDavid Chiu
 
Presentazione Tesi Laurea Triennale in Informatica
Presentazione Tesi Laurea Triennale in InformaticaPresentazione Tesi Laurea Triennale in Informatica
Presentazione Tesi Laurea Triennale in InformaticaLuca Marignati
 
Hadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using HadoopHadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using HadoopYahoo Developer Network
 
Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2yannabraham
 
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화NAVER Engineering
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial IndustrySubrat Panda, PhD
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 r-kor
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Shardinginside-BigData.com
 
Big Data And Machine Learning Using MATLAB.pdf
Big Data And Machine Learning Using MATLAB.pdfBig Data And Machine Learning Using MATLAB.pdf
Big Data And Machine Learning Using MATLAB.pdfssuserb2837a
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
 

Similar to Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles (20)

Comparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for RegressionComparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for Regression
 
Planet
PlanetPlanet
Planet
 
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryUncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game Theory
 
Data reduction
Data reductionData reduction
Data reduction
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analytics
 
R language tutorial
R language tutorialR language tutorial
R language tutorial
 
Cluster
ClusterCluster
Cluster
 
Presentazione Tesi Laurea Triennale in Informatica
Presentazione Tesi Laurea Triennale in InformaticaPresentazione Tesi Laurea Triennale in Informatica
Presentazione Tesi Laurea Triennale in Informatica
 
Hadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using HadoopHadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using Hadoop
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2
 
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Sharding
 
Big Data And Machine Learning Using MATLAB.pdf
Big Data And Machine Learning Using MATLAB.pdfBig Data And Machine Learning Using MATLAB.pdf
Big Data And Machine Learning Using MATLAB.pdf
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
 
Machine learning
Machine learningMachine learning
Machine learning
 
Machine learning
Machine learning Machine learning
Machine learning
 

More from Intuit Inc.

State of Small Business – Growth and Success Report
State of Small Business – Growth and Success ReportState of Small Business – Growth and Success Report
State of Small Business – Growth and Success ReportIntuit Inc.
 
The State of Small Business Cash Flow
The State of Small Business Cash FlowThe State of Small Business Cash Flow
The State of Small Business Cash FlowIntuit Inc.
 
Small Business in the Age of AI
Small Business in the Age of AI Small Business in the Age of AI
Small Business in the Age of AI Intuit Inc.
 
Get financially Fit: Tips for Using QuickBooks
Get financially Fit: Tips for Using QuickBooksGet financially Fit: Tips for Using QuickBooks
Get financially Fit: Tips for Using QuickBooksIntuit Inc.
 
SEO, Social, and More: Digital Marketing for your Business
SEO, Social, and More: Digital Marketing for your BusinessSEO, Social, and More: Digital Marketing for your Business
SEO, Social, and More: Digital Marketing for your BusinessIntuit Inc.
 
Why Building Your Brand is Key to Getting Customers
Why Building Your Brand is Key to Getting CustomersWhy Building Your Brand is Key to Getting Customers
Why Building Your Brand is Key to Getting CustomersIntuit Inc.
 
Get Found Fast: Google AdWords Strategies for Growth
Get Found Fast: Google AdWords Strategies for GrowthGet Found Fast: Google AdWords Strategies for Growth
Get Found Fast: Google AdWords Strategies for GrowthIntuit Inc.
 
Giving Clients What They Want
Giving Clients What They WantGiving Clients What They Want
Giving Clients What They WantIntuit Inc.
 
What Accounting Will Look Like in 2030
What Accounting Will Look Like in 2030What Accounting Will Look Like in 2030
What Accounting Will Look Like in 2030Intuit Inc.
 
Pricing in the Digital Age
Pricing in the Digital Age Pricing in the Digital Age
Pricing in the Digital Age Intuit Inc.
 
Handbook: Power Panel on Apps you need to give you more time to serve your cl...
Handbook: Power Panel on Apps you need to give you more time to serve your cl...Handbook: Power Panel on Apps you need to give you more time to serve your cl...
Handbook: Power Panel on Apps you need to give you more time to serve your cl...Intuit Inc.
 
Handbook: Advanced QuickBooks Online - Handling Tricky Transactions
Handbook: Advanced QuickBooks Online - Handling Tricky TransactionsHandbook: Advanced QuickBooks Online - Handling Tricky Transactions
Handbook: Advanced QuickBooks Online - Handling Tricky TransactionsIntuit Inc.
 
Advanced QuickBooks Online - Handling Tricky Transactions
Advanced QuickBooks Online - Handling Tricky TransactionsAdvanced QuickBooks Online - Handling Tricky Transactions
Advanced QuickBooks Online - Handling Tricky TransactionsIntuit Inc.
 
Handling tricky transactions in QuickBooks Online
Handling tricky transactions in QuickBooks OnlineHandling tricky transactions in QuickBooks Online
Handling tricky transactions in QuickBooks OnlineIntuit Inc.
 
Social media is social business
Social media is social business  Social media is social business
Social media is social business Intuit Inc.
 
Conversation guide: Forming deep relationships with your clients
Conversation guide: Forming deep relationships with your clientsConversation guide: Forming deep relationships with your clients
Conversation guide: Forming deep relationships with your clientsIntuit Inc.
 
Making tax digital
Making tax digital  Making tax digital
Making tax digital Intuit Inc.
 
Giving clients what they want
Giving clients what they want Giving clients what they want
Giving clients what they want Intuit Inc.
 
100 percent cloud your action plan for success
100 percent cloud your action plan for success 100 percent cloud your action plan for success
100 percent cloud your action plan for success Intuit Inc.
 
Attracting and retaining top talent
Attracting and retaining top talent Attracting and retaining top talent
Attracting and retaining top talent Intuit Inc.
 

More from Intuit Inc. (20)

State of Small Business – Growth and Success Report
State of Small Business – Growth and Success ReportState of Small Business – Growth and Success Report
State of Small Business – Growth and Success Report
 
The State of Small Business Cash Flow
The State of Small Business Cash FlowThe State of Small Business Cash Flow
The State of Small Business Cash Flow
 
Small Business in the Age of AI
Small Business in the Age of AI Small Business in the Age of AI
Small Business in the Age of AI
 
Get financially Fit: Tips for Using QuickBooks
Get financially Fit: Tips for Using QuickBooksGet financially Fit: Tips for Using QuickBooks
Get financially Fit: Tips for Using QuickBooks
 
SEO, Social, and More: Digital Marketing for your Business
SEO, Social, and More: Digital Marketing for your BusinessSEO, Social, and More: Digital Marketing for your Business
SEO, Social, and More: Digital Marketing for your Business
 
Why Building Your Brand is Key to Getting Customers
Why Building Your Brand is Key to Getting CustomersWhy Building Your Brand is Key to Getting Customers
Why Building Your Brand is Key to Getting Customers
 
Get Found Fast: Google AdWords Strategies for Growth
Get Found Fast: Google AdWords Strategies for GrowthGet Found Fast: Google AdWords Strategies for Growth
Get Found Fast: Google AdWords Strategies for Growth
 
Giving Clients What They Want
Giving Clients What They WantGiving Clients What They Want
Giving Clients What They Want
 
What Accounting Will Look Like in 2030
What Accounting Will Look Like in 2030What Accounting Will Look Like in 2030
What Accounting Will Look Like in 2030
 
Pricing in the Digital Age
Pricing in the Digital Age Pricing in the Digital Age
Pricing in the Digital Age
 
Handbook: Power Panel on Apps you need to give you more time to serve your cl...
Handbook: Power Panel on Apps you need to give you more time to serve your cl...Handbook: Power Panel on Apps you need to give you more time to serve your cl...
Handbook: Power Panel on Apps you need to give you more time to serve your cl...
 
Handbook: Advanced QuickBooks Online - Handling Tricky Transactions
Handbook: Advanced QuickBooks Online - Handling Tricky TransactionsHandbook: Advanced QuickBooks Online - Handling Tricky Transactions
Handbook: Advanced QuickBooks Online - Handling Tricky Transactions
 
Advanced QuickBooks Online - Handling Tricky Transactions
Advanced QuickBooks Online - Handling Tricky TransactionsAdvanced QuickBooks Online - Handling Tricky Transactions
Advanced QuickBooks Online - Handling Tricky Transactions
 
Handling tricky transactions in QuickBooks Online
Handling tricky transactions in QuickBooks OnlineHandling tricky transactions in QuickBooks Online
Handling tricky transactions in QuickBooks Online
 
Social media is social business
Social media is social business  Social media is social business
Social media is social business
 
Conversation guide: Forming deep relationships with your clients
Conversation guide: Forming deep relationships with your clientsConversation guide: Forming deep relationships with your clients
Conversation guide: Forming deep relationships with your clients
 
Making tax digital
Making tax digital  Making tax digital
Making tax digital
 
Giving clients what they want
Giving clients what they want Giving clients what they want
Giving clients what they want
 
100 percent cloud your action plan for success
100 percent cloud your action plan for success 100 percent cloud your action plan for success
100 percent cloud your action plan for success
 
Attracting and retaining top talent
Attracting and retaining top talent Attracting and retaining top talent
Attracting and retaining top talent
 

Recently uploaded

Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...DhatriParmar
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQuiz Club NITW
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleCeline George
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
week 1 cookery 8 fourth - quarter .pptx
week 1 cookery 8  fourth  -  quarter .pptxweek 1 cookery 8  fourth  -  quarter .pptx
week 1 cookery 8 fourth - quarter .pptxJonalynLegaspi2
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
 
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvRicaMaeCastro1
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataBabyAnnMotar
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQuiz Club NITW
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Association for Project Management
 

Recently uploaded (20)

Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 
Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP Module
 
prashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Professionprashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Profession
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
week 1 cookery 8 fourth - quarter .pptx
week 1 cookery 8  fourth  -  quarter .pptxweek 1 cookery 8  fourth  -  quarter .pptx
week 1 cookery 8 fourth - quarter .pptx
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
 
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
 
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptxINCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
 
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped data
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
 

Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles

  • 1. How to Create Predictive Models in R using Ensembles Giovanni Seni, Ph.D. Intuit @IntuitInc Giovanni_Seni@intuit.com Santa Clara University GSeni@scu.edu Strata - Hadoop World, New York October 28, 2013
  • 2. Reference © 2013 G.Seni 2013 Strata Conference + Hadoop World 2
  • 3. Overview •  Motivation, In a Nutshell & Timeline •  Predictive Learning & Decision Trees •  Ensemble Methods - Diversity & Importance Sampling –  Bagging –  Random Forest –  Ada Boost –  Gradient Boosting –  Rule Ensembles •  Summary © 2013 G.Seni 2013 Strata Conference + Hadoop World 3
  • 4. Motivation Volume 9 Issue 2 © 2013 G.Seni 2013 Strata Conference + Hadoop World 4
  • 5. Motivation (2) “1′st Place Algorithm Description: … 4. Classification: Ensemble classification methods are used to combine multiple classifiers. Two separate Random Forest ensembles are created based on the shadow index (one for the shadow-covered area and one for the shadow-free area). The random forest “Out of Bag” error is used to automatically evaluate features according to their impact, resulting in 45 features selected for the shadow-free and 55 for the shadow-covered part.” © 2013 G.Seni 2013 Strata Conference + Hadoop World 5
  • 6. Motivation (3) •  “What are the best of the best techniques at winning Kaggle competitions? –  Ensembles of Decisions Trees –  Deep Learning account for 90% of top 3 winners!” Jeremy Howard, Chief Scientist of Kaggle KDD 2013 ⇒ Key common characteristics: –  Resistance to overfitting –  Universal approximations © 2013 G.Seni 2013 Strata Conference + Hadoop World 6
  • 7. Ensemble Methods in a Nutshell •  “Algorithmic” statistical procedure •  Based on combining the fitted values from a number of fitting attempts •  Loosely related to: –  Iterative procedures –  Bootstrap procedures •  Original idea: a “weak” procedure can be strengthened if it can operate “by committee” –  e.g., combining low-bias/high-variance procedures •  Accompanied by interpretation methodology © 2013 G.Seni 2013 Strata Conference + Hadoop World 7
  • 8. Timeline •  CART (Breiman, Friedman, Stone, Olshen, 1983) •  Bagging (Breiman, 1996) –  Random Forest (Ho, 1995; Breiman 2001) •  AdaBoost (Freund, Schapire, 1997) •  Boosting – a statistical view (Friedman et. al., 2000) –  Gradient Boosting (Friedman, 2001) –  Stochastic Gradient Boosting (Friedman, 1999) •  Importance Sampling Learning Ensembles (ISLE) (Friedman, Popescu, 2003) © 2013 G.Seni 2013 Strata Conference + Hadoop World 8
  • 9. Timeline (2) •  Regularization – variance control techniques: –  Lasso (Tibshirani, 1996) –  LARS (Efron, 2004) –  Elastic Net (Zou, Hastie, 2005) –  GLMs via Coordinate Descent (Friedman, Hastie, Tibshirani, 2008) •  Rule Ensembles (Friedman, Popescu, 2008) © 2013 G.Seni 2013 Strata Conference + Hadoop World 9
  • 10. Overview •  Motivation, In a Nutshell & Timeline Ø  Predictive Learning & Decision Trees •  Ensemble Methods •  Summary © 2013 G.Seni 2013 Strata Conference + Hadoop World 10
  • 11. Predictive Learning Procedure Summary N N •  Given "training" data D = { yi , xi1 , xi 2 ,, xin }1 = { yi , x i }1 –  D is a random sample from some unknown (joint) distribution    •  Build a functional model y = F ( x1 , x2 ,, xn ) = F ( x ) –  Offers adequate and interpretable description of how the inputs affect the outputs –  Parsimony is an important criterion: simpler models are preferred for the sake of scientific insight into the x - y relationship •  Need to specify: < model, score criterion, search strategy > © 2013 G.Seni 2013 Strata Conference + Hadoop World 11
  • 12. Predictive Learning Procedure Summary (2) •  Model: underlying functional form sought from data   F (x) = F (x; a) ∈ ℱ family of functions indexed by a •  Score criterion: judges (lack of) quality of fitted model  –  Loss function L( y, F ): penalizes individual errors in prediction  –  Risk R(a) = E y ,x L( y, F (x; a)) : the expected loss over all predictions •  Search Strategy: minimization procedure of score criterion a* = arg min R(a) a © 2013 G.Seni 2013 Strata Conference + Hadoop World 12
  • 13. Predictive Learning Procedure Summary (3) •  “Surrogate” Score criterion: N –  Training data: { yi , x i }1 ~ p( x, y ) * –  p ( x, y ) unknown ⇒ a unknown ⇒ Use approximation: Empirical Risk  1 •  R (a) = N  ∑ L( y, F (xi ; a)) N i =1 •  If not N >> n , © 2013 G.Seni ⇒   a = arg min R(a) a  R(a) >> R(a* ) 2013 Strata Conference + Hadoop World 13
  • 14. Predictive Learning Example •  A simple data set Attribute-1 Attribute-2 Class ( x1 ) ( x2 ) 1.0 2.0 blue 2.0 1.0 green … … … 4.5 3.5 x2 ? (y) •  What is the class of new point x1 ? •  Many approaches… no method is universally better; try several / use committee © 2013 G.Seni 2013 Strata Conference + Hadoop World 14
  • 15. Predictive Learning Example (2) •  Ordinary Linear Regression (OLR) x2 x1 n  –  Model: F(x) = a0 + ∑ a j x j j=1 ;  ⎧ F (x) ≥ 0 ⎨ ⎩else ⇒ Not flexible enough © 2013 G.Seni 2013 Strata Conference + Hadoop World 15
  • 16. Decision Trees Overview x2 R2 x1 ≥ 5 R1 x2 ≥ 3 3 R4 x1 ≥ 2 R3 2 x1 5 M ˆ ˆ ˆ •  Model: y = T (x ) = ∑ cm I R (x ) m =1 m  M {Rm }m=1 = Sub-regions of input variable space where I R (x) = 1 if x ∈ R , 0 otherwise © 2013 G.Seni 2013 Strata Conference + Hadoop World 16
  • 17. Decision Trees Overview (2) •  Score criterion: –  Classification – "0-1 loss" ⇒ misclassification error (or surrogate) N M ˆ { cˆm, Rm } = argmin 1 M cm ,Rm 1 TM ={ } ∑I (y ≠ T i M (x i )) i=1 2 ˆ ˆ –  Regression – least squares – i.e., L( y , y ) = ( y − y ) M ˆ ˆ { cm, Rm } = argmin 1 M N ∑( y − T i TM ={cm ,Rm }1 i=1 M (x i ))  R(TM ) 2   •  Search: Find T = arg min T R(T ) –  i.e., find best regions Rm and constants cm © 2013 G.Seni 2013 Strata Conference + Hadoop World 17
  • 18. Decision Trees Overview (3) •  Join optimization with respect to Rm and cm simultaneously is very difficult ⇒ use a greedy iterative procedure R0 R4 R1 R5 R6 R2 R3 • • j 1 , s1 R0 j 2 , s2 • • j 1 , s1 R0 j 2 , s2 R1 • R0 • R1 • R3 • j 1 , s1 • R4 j 3 , s3 j 2 , s2 • j 1 , s1 R0 • R3 •• • R4 R5 R7 2013 Strata Conference + Hadoop World j 4 , s4 R6 • © 2013 G.Seni j 3 , s3 R2 R1 R2 • • R8 18
  • 19. Decision Trees What is the “right” size of a model? y y y ο ο ο ο ο ο ο ο ο ο ο ο ο ο ο ⇒ ο ο ο c1 ο ο ο x ο ο ο ο ο ο ο ο ο c2 ο vs ο ο c1 ο c2 ο ο ο ο ο ο ο ο ο ο c3 ο ο ο ο x x •  Dilemma –  If model (# of splits) is too small, then approximation is too crude (bias) ⇒ increased errors –  If model is too large, then it fits the training data too closely (overfitting, increased variance) ⇒ increased errors © 2013 G.Seni 2013 Strata Conference + Hadoop World 19
  • 20. Decision Trees What is the “right” size of a model? (2) High Bias Low Bias Low Variance Prediction Error High Variance Test Sample Training Sample Low M* High Model Complexity –  Right sized tree, M * when test error is at a minimum , –  Error on the training is not a useful estimator! •  If test set is not available, need alternative method © 2013 G.Seni 2013 Strata Conference + Hadoop World 20
  • 21. Decision Trees Pruning to obtain “right” size •  Two strategies –  Prepruning - stop growing a branch when information becomes unreliable •  #(Rm) – i.e., number of data points, too small ⇒ same bound everywhere in the tree •  Next split not worthwhile ⇒ Not sufficient condition –  Postpruning - take a fully-grown tree and discard unreliable parts (i.e., not supported by test data) •  C4.5: pessimistic pruning •  CART: cost-complexity pruning © 2013 G.Seni (more statistically grounded) 2013 Strata Conference + Hadoop World 21
  • 22. Decision Trees 1.0 Hands-on Exercise Start Rstudio •  0.8 •  Navigate to directory: example.1.LinearBoundary Load and run “fitModel_CART.R” •  If curious, also see “gen2DdataLinear.R” •  After boosting discussion, load and run “fitModel_GBM.R 0.0 0.2 0.4 x2 Set working directory: use setwd() or with GUI •  0.6 •  0.0 0.2 0.4 0.6 0.8 1.0 x1 © 2013 G.Seni 2013 Strata Conference + Hadoop World 22
  • 23. Decision Trees Key Features •  Ability to deal with irrelevant inputs –  i.e., automatic variable subset selection –  Measure anything you can measure –  Score provided for selected variables ("importance") •  No data preprocessing needed -  Naturally handle all types of variables •  numeric, binary, categorical -  Invariant under monotone transformations: x j = g j (x j ) •  •  © 2013 G.Seni Variable scales are irrelevant Immune to bad x j −distributions (e.g., outliers) 2013 Strata Conference + Hadoop World 23
  • 24. Decision Trees Key Features (2) •  Computational scalability –  Relatively fast: O(nN log N ) •  Missing value tolerant -  Moderate loss of accuracy due to missing values -  Handling via "surrogate" splits •  "Off-the-shelf" procedure -  Few tunable parameters •  Interpretable model representation -  Binary tree graphic © 2013 G.Seni 2013 Strata Conference + Hadoop World 24
  • 25. Decision Trees Limitations •  Discontinuous piecewise constant model F (x) x –  In order to have many splits you need to have a lot of data •  In high-dimensions, you often run out of data after a few splits –  Also note error is bigger near region boundaries © 2013 G.Seni 2013 Strata Conference + Hadoop World 25
  • 26. Decision Trees Limitations (2) •  Not good for low interaction F * (x ) n * –  e.g., F (x ) = ao + ∑ a j x j is worst function for trees j =1 n = ∑ f j* (x j ) (no interaction, additive) j =1 –  In order for xl to enter model, must split on it •  Path from root to node is a product of indicators •  Not good for F * (x ) that has dependence on many variables -  Each split reduces training data for subsequent splits (data fragmentation) © 2013 G.Seni 2013 Strata Conference + Hadoop World 26
  • 27. Decision Trees Limitations (3) •  High variance caused by greedy search strategy (local optima) –  Errors in upper splits are propagated down to affect all splits below it ⇒ Small changes in data (sampling fluctuations) can cause big changes in tree - Very deep trees might be questionable - Pruning is important •  What to do next? –  Live with problems –  Use other methods (when possible) –  Fix-up trees: use ensembles © 2013 G.Seni 2013 Strata Conference + Hadoop World 27
  • 28. Overview •  In a Nutshell & Timeline •  Predictive Learning & Decision Trees Ø  Ensemble Methods –  In a Nutshell, Diversity & Importance Sampling –  Generic Ensemble Generation –  Bagging, RF, AdaBoost, Boosting, Rule Ensembles •  Summary © 2013 G.Seni 2013 Strata Conference + Hadoop World 28
  • 29. Ensemble Methods In a Nutshell M •  Model: F (x) = c0 + ∑m=1 cmTm (x) M –  { m (x)}1 : “basis” functions (or “base learners”) T –  i.e., linear model in a (very) high dimensional space of derived variables •  Learner characterization: Tm (x) = T (x; p m ) –  p m : a specific set of joint parameter values – e.g., split definitions at internal nodes and predictions at terminal nodes –  {T (x; p)}p∈P : function class – i.e., set of all base learners of specified family © 2013 G.Seni 2013 Strata Conference + Hadoop World 29
  • 30. Ensemble Methods In a Nutshell (2) •  Learning: two-step process; approximate solution to N M   M {cm , p m }o = arg min ∑ L yi , c0 + ∑ cmT (x;p m ) M {cm , p m }o i=1 ( m=1 ) M –  Step 1: Choose points {p m }1 M •  i.e., select {Tm (x)}1 ⊂ {T (x; p)}p∈P M –  Step 2: Determine weights {cm }0 •  e.g., via regularized LR © 2013 G.Seni 2013 Strata Conference + Hadoop World 30
  • 31. Ensemble Methods Importance Sampling (Friedman, 2003) •  How to judiciously choose the “basis” functions (i.e., {pm }1M )? M •  Goal: find “good” {pm }1 so that M M F (x;{p m }1 , {cm }1 ) ≅ F * (x ) •  Connection with numerical integration: –  ∫ Ρ M I (p) ∂p ≈ ∑m =1 w m I (p m ) vs. © 2013 G.Seni 2013 Strata Conference + Hadoop World Accuracy improves when we choose more points from this region… 31
  • 32. Importance Sampling Numerical Integration via Monte Carlo Methods M •  r (p) = sampling pdf of p ∈ P -- i.e, {p m ~ r (p)}1 –  Simple approach: r (p m ) iid -- i.e., uniform –  In our problem: inversely related to p m’s “risk” •  i.e., T (x; p m ) has high error ⇒ lack of relevance of p m ⇒ low r (pm ) •  “Quasi” Monte Carlo: –  with/out knowledge of the other points that will be used •  i.e., single point vs. group importance –  Sequential approximation: p’s relevance judged in the context of the (fixed) previously selected points © 2013 G.Seni 2013 Strata Conference + Hadoop World 32
  • 33. Ensemble Methods Importance Sampling – Characterization of •  Let p∗ = arg minp Risk(p) Narrow r (p) Broad r (p) M T •  Ensemble { (x; p m )}1 of “strong” base learners - i.e., all with Risk (p m ) ≈ Risk (p∗ ) •  Diverse ensemble - i.e., predictions are not highly correlated with each other •  T (x; p m ) yield similar highly correlated ’s predictions ⇒ unexceptional performance •  However, many “weak” base learners - i.e., Risk (p m ) >> Risk (p ∗ ) ⇒ poor performance © 2013 G.Seni 2013 Strata Conference + Hadoop World 33
  • 34. Ensemble Methods Approximate Process of Drawing from •  Heuristic sampling strategy: sampling around p by iteratively applying small perturbations to existing problem structure ∗ –  Generating ensemble members Tm (x) = T (x; p m ) For m = 1 to M { pm = PERTURBm { minp Ε xy L( y, T (x; p) )} arg } ⋅ –  PERTURB {} is a (random) modification of any of •  Data distribution - e.g., by re-weighting the observations •  Loss function - e.g., by modifying its argument •  Search algorithm (used to find minp)  –  Width of r (p ) is controlled by degree of perturbation © 2013 G.Seni 2013 Strata Conference + Hadoop World 34
  • 35. Generic Ensemble Generation Step 1: Choose Base Learners p! ! ! ! •  Forward Stagewise Fitting Procedure: 𝐹0 (x) = 0     For    𝑚 = 1  to  𝑀    {            //  Fit  a  single  base  learner              p Modification of data distribution 𝑚 = argmin . p 𝐿0𝑦 𝑖 ,  𝐹 𝑚 −1 + 𝑇(x 𝑖 ; p)8   𝑖∈𝑆 𝑚 ( 𝜂 )          //  Update  additive  expansion            𝑇 𝑚 ( 𝑥 ) = 𝑇0x; p 𝒎 8            𝐹 𝑚 (x) =   𝐹 𝑚 −1 (x) + 𝜐 ∙ 𝑇 𝑚 (x)   }   write  { 𝑇 𝑚 (x)}1𝑀   –  Algorithm control: L, η , υ Modification of loss function (“sequential” approximation) •  Sm (η ) : random sub-sample of size η ≤ N ⇒ impacts ensemble "diversity" m −1 •  Fm−1 (x) = υ ⋅ ∑k =1Tk (x) : “memory” function (0 ≤ υ ≤ 1 ) © 2013 G.Seni 2013 Strata Conference + Hadoop World 35
  • 36. Generic Ensemble Generation Step 2: Choose Coefficients c! ! !! M M •  Given {Tm (x)}m=1 = {T (x; pm )}m=1 , coefficients can be obtained by a regularized linear regression N M ⎛ ⎞  {cm } = arg min ∑ L⎜ yi , c0 + ∑ cmTm (xi ) ⎟ + λ ⋅ P(c ) {cm } i =1 ⎝ m =1 ⎠ –  Regularization here helps reduce bias (in addition to variance) of the model –  New iterative fast algorithms for various loss/penalty combinations •  “GLMs via Coordinate Descent” (2008) © 2013 G.Seni 2013 Strata Conference + Hadoop World 36
  • 37. Bagging (Breiman, 1996) •  Bagging = Bootstrap Aggregation ˆ •  L(y, y) : as available for single tree F0 (x) = 0 For m = 1 to M { •  υ = 0 ⇒ no memory p m = arg min p •  η = N / 2 i m −1 i∈S m ( ) ( x i ) + T ( x i ; p )) Tm (x) = T (x; p m ) •  Tm (x) ⇒ are large un-pruned trees ∑ηL(y , F Fm (x) = Fm −1 (x) + υ ⋅ Tm (x) υ } •  co = 0, {cm = 1 / M }1M M i.e., not fit to the data (avg) write {Tm (x)}1 –  i.e., perturbation of the data distribution only –  Potential improvements? –  R package: ipred © 2013 G.Seni 2013 Strata Conference + Hadoop World 37
  • 38. Bagging Hands-on Exercise 1.0 •  Navigate to directory: example.2.EllipticalBoundary 0.0 Load and run –  fitModel_Bagging_by_hand.R -0.5 –  fitModel_CART.R (optional) •  If curious, also see gen2DdataNonLinear.R -1.0 x2 Set working directory: use setwd() or with GUI •  0.5 •  •  After class, load and run fitModel_Bagging.R -2 -1 0 1 2 x1 © 2013 G.Seni 2013 Strata Conference + Hadoop World 38
  • 39. Bagging Why it helps?  ˆ •  Under L( y, y) = ( y − y) 2, averaging reduces variance and leaves bias unchanged •  Consider “idealized” bagging (aggregate)  estimator: f (x) = Ε f Z (x)  –  f Z fit to bootstrap data set Z = {yi , xi }1N –  Z is sampled from actual population distribution (not training data)   –  We can write: Ε[Y − f Z (x)] = Ε[Y − f (x) + f (x) − f Z (x)] 2 2  2 = Ε Y − f ( x) + Ε f Z ( x) − f ( x) [ ] ≥ Ε[Y − f (x)] [ ] 2 2 ⇒  true population aggregation never increases mean squared error! ⇒  Bagging will often decrease MSE… © 2013 G.Seni 2013 Strata Conference + Hadoop World 39
  • 40. Random Forest (Ho, 1995; Breiman, 2001) •  Random Forest = Bagging + algorithm randomizing –  Subset splitting As each tree is constructed… •  Draw a random sample of predictors before each node is split ns = ⎣log 2 (n) + 1⎦ •  Find best split as usual but selecting only from subset of predictors  M ⇒ Increased diversity among {Tm (x)}1 - i.e., wider r (p) •  Width (inversely) controlled by ns –  Speed improvement over Bagging –  R package: randomForest © 2013 G.Seni 2013 Strata Conference + Hadoop World 40
  • 41. Bagging vs. Random Forest vs. ISLE 100 Target Functions Comparison (Popescu, 2005) •  ISLE improvements: –  Different data sampling strategy (not fixed) –  Fit coefficients to data Comparative RMS Error •  xxx_6_5%_P : 6 terminal nodes trees 5% samples without replacement Post-processing – i.e., using estimated “optimal” quadrature coefficients ⇒ Significantly faster to build! Bag © 2013 G.Seni RF Bag_6_5%_P RF_6_5%_P 2013 Strata Conference + Hadoop World 41
  • 42. AdaBoost (Freund & Schapire, 1997) observation weights : wi( 0 ) = 1 N For m = 1 to M { a. Fit a classifier Tm (x) to training data with wi( m ) b. Compute errm = ∑ N i =1 (cm , p m ) = arg min w I ( yi ≠ Tm (x i )) ∑ N ∑ηL( y , F i m −1 (x i ) + c ⋅ T (x i ; p) ) i∈S m ( ) Tm (x) = T (x; p m ) Fm (x) = Fm −1 (x) + υ ⋅ cm ⋅ Tm (x) d. Set wi( m +1) = wi( m ) ⋅ exp[α m ⋅ I ( yi ≠ Tm (x i )] } Output sign ∑m =1α mTm (x) c, p wi( m ) c. Compue α m = log((1 − errm ) errm ) M For m = 1 to M { (m) i i =1 ( F0 (x) = 0 } M write {cm , Tm (x)}1 ) –  We need to show p m = arg min (⋅) is equivalent to line a. above p Book •  Equivalence to Forward Stagewise Fitting Procedure –  cm = arg min (⋅) is equivalent to line c. c •  R package adabag © 2013 G.Seni 2013 Strata Conference + Hadoop World 42
  • 43. AdaBoost Hands-on Exercise 1.0 •  Navigate to directory: example.2.EllipticalBoundary Set working directory: use setwd() or with GUI •  Load and run 0.0 –  fitModel_Adaboost_by_hand.R -0.5 •  After class, load and run fitModel_Adaboost.R and fitModel_RandomForest.R -1.0 x2 0.5 •  -2 -1 0 1 2 x1 © 2013 G.Seni 2013 Strata Conference + Hadoop World 43
  • 44. Stochastic Gradient Boosting (Friedman, 2001) •  Boosting with any differentiable loss criterion ˆ •  General L( y, y ) F0 (x) = c00 •  υ = 0.1 ⇒ Sequential sampling For m = 1 to M { (cm , p m ) = arg min m c, p •  η = N 2 ∑ηL( y , F i m −1 i∈S m ( ) (x i ) + c ⋅ T (x i ; p)) Tm (x) = T (x; p m ) •  Tm (x) ⇒ Any “weak” learner N •  co = arg minc ∑i =1 L( yi , c) Fm (x) = Fm −1 (x) +υ ⋅ cm ⋅ Tm (x) υ } M write {(υ ⋅ cm ), Tm (x)}1 M •  {cm }1 ⇒ “shrunk” sequential partial regression coefficients –  Potential improvements? –  R package: gbm © 2013 G.Seni 2013 Strata Conference + Hadoop World 44
  • 45. Stochastic Gradient Boosting LAD Regression – L !, ! = ! − ! •  More robust than ( y − F )2 •  Resistant to outliers in y …trees already providing resistance to outliers in x ! N F0 (x) = median{yi }1 For m = 1 to M { // Step1 : find Tm (x) ~ = sign ( y − F (x ) ) yi i m −1 i •  Note:  {R } J jm 1 –  Trees are fitted to pseudoresponse ( // Step2 : find coefficients ⇒ Can’t interpret interpret γˆ jm = median{yi − Fm −1 (x i )}1N  x i ∈R jm individual trees –  “shrunk” version of tree gets added to ensemble j = 1… J // Update expansion –  Original tree constants are overwritten © 2013 G.Seni N = J − terminal node LS - regression tree {~i , x i }1 y  Fm (x) = Fm −1 (x) + υ ⋅ ∑ γˆ jm I x i ∈ R jm J j =1 ( ) } 2013 Strata Conference + Hadoop World 45 )
  • 46. Parallel vs. Sequential Ensembles 100 Target Functions Comparison (Popescu, 2005) Comparative RMS Error •  xxx_6_5%_P : 6 terminal nodes trees 5% samples without replacement Post-processing – i.e., using estimated “optimal” quadrature coefficients “Sequential” “Parallel” Bag RF Boost Seq_0.01_20%_P •  Seq_υ_η%_P : “Sequential” ensemble 6 terminal nodes trees υ : “memory” factor η % samples without replacement Post-processing Bag_6_5%_P RF_6_5%_P Seq_0.1_50%_P •  Sequential ISLE tend to perform better than parallel ones –  Consistent with results observed in classical Monte Carlo integration © 2013 G.Seni 2013 Strata Conference + Hadoop World 46
  • 47. Rule Ensembles (Friedman & Popescu, 2005) J ˆ ˆ •  Trees as collection of conjunctive rules: Tm (x) = ∑ c jm I (x ∈ R jm ) j =1 R1 27 R4 15 R2 ⇒ R5 15 22 x1 r2 (x) = I ( x1 > 22) ⋅ I (0 ≤ x2 ≤ 27) r3 (x) = I (15 < x1 ≤ 22) ⋅ I (0 ≤ x2 ) R4 ⇒ r4 (x) = I (0 ≤ x1 ≤ 15) ⋅ I ( x2 > 15) R5 ⇒ x2 r1 (x) = I ( x1 > 22) ⋅ I ( x2 > 27) R3 ⇒ R3 ˆ y R2 R1 ⇒ r5 (x) = I (0 ≤ x1 ≤ 15) ⋅ I (0 ≤ x2 ≤ 15) –  These simple rules, rm (x) ∈ {0,1} can be used as base learners , –  Main motivation is interpretability © 2013 G.Seni 2013 Strata Conference + Hadoop World 47
  • 48. Rule Ensembles ISLE Procedure •  Rule-based model: F (x) = a0 + ∑ am rm (x) m –  Still a piecewise constant model ⇒ complement the non-linear rules with purely linear terms: •  Fitting –  Step 1: derive rules from tree ensemble (shortcut) •  Tree size controls rule “complexity” (interaction order) –  Step 2: fit coefficients using linear regularized procedure: ( N P K ˆ ˆ ({ak },{b j }) = arg min ∑ L yi , F x; {ak }0 , {b j }1 {ak },{b j } © 2013 G.Seni i=1 ( )) +!!! ⋅ 2013 Strata Conference + Hadoop World !(a) + !(b) ! 48
  • 49. Boosting & Rule Ensembles Hands-on Exercise 2500 •  example.3.Diamonds Load and run 1500 Set working directory: use setwd() or with GUI •  2000 •  –  viewDiamondData.R –  fitModel_GBM.R 1000 –  fitModel_RE.R •  500 Absolute loss Navigate to directory: After class, go to: example.1.LinearBoundary 0 200 400 600 800 1000 Run fitModel_GBM.R Iteration © 2013 G.Seni 2013 Strata Conference + Hadoop World 49
  • 50. Overview •  Motivation, In a Nutshell & Timeline •  Predictive Learning & Decision Trees •  Ensemble Methods Ø  Summary © 2013 G.Seni 2013 Strata Conference + Hadoop World 50
  • 51. Summary •  Ensemble methods have been found to perform extremely well in a variety of problem domains •  Shown to have desirable statistical properties •  Latest ensemble research brings together important foundational strands of statistics •  Emphasis on accuracy but significant progress has been made on interpretability Go build Ensembles and keep in touch! © 2013 G.Seni 2013 Strata Conference + Hadoop World 51