SlideShare una empresa de Scribd logo
1 de 41
Descargar para leer sin conexión
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Ensemble Learners
• Classification Trees (Done)
• Bagging: Averaging Trees
• Random Forests: Cleverer Averaging of Trees
• Boosting: Cleverest Averaging of Trees
• Details in

chaps 10, 15 and 16

Methods for dramatically improving the performance of weak
learners such as Trees. Classification trees are adaptive and robust,
but do not make good prediction models. The techniques discussed
here enhance their performance considerably.

227
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Properties of Trees
• [ ]Can handle huge datasets
• [ ]Can handle mixed predictors—quantitative and qualitative
• [ ]Easily ignore redundant variables
• [ ]Handle missing data elegantly through surrogate splits
• [ ]Small trees are easy to interpret
• ["]Large trees are hard to interpret
• ["]Prediction performance is often poor

228
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

email
600/1536
ch$<0.0555
ch$>0.0555

email
280/1177

spam
48/359

remove<0.06

hp<0.405
hp>0.405

remove>0.06

email
180/1065

spam

ch!<0.191

spam

george<0.005
george>0.005

email

email

spam

80/652

0/209

36/123

16/81

free<0.065
free>0.065

email

email

email

spam

77/423

3/229

16/94

9/29

CAPMAX<10.5
business<0.145
CAPMAX>10.5
business>0.145

email

email

email

spam

20/238

57/185

14/89

3/5

receive<0.125
edu<0.045
receive>0.125
edu>0.045

email

spam

email

email

19/236

1/2

48/113

9/72

our<1.2
our>1.2

email

spam

37/101

1/12

0/3

CAPAVE<2.7505
CAPAVE>2.7505

email

hp<0.03
hp>0.03

email

6/109

email
100/204

80/861

0/22

george<0.15
CAPAVE<2.907
george>0.15
CAPAVE>2.907

ch!>0.191

email

email

spam
26/337

9/112

spam
19/110

spam

7/227

1999<0.58
1999>0.58

spam
18/109

email

0/1

229
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

ROC curve for pruned tree on SPAM data

0.8

1.0

SPAM Data
TREE − Error: 8.7%

0.0

0.2

0.4

Sensitivity

0.6

o

0.0

0.2

0.4

0.6
Specificity

0.8

1.0

Overall error rate on test data:
8.7%.
ROC curve obtained by varying the threshold c0 of the classifier:
C(X) = +1 if Pr(+1|X) > c0 .
Sensitivity: proportion of true
spam identified
Specificity: proportion of true
email identified.

We may want specificity to be high, and suffer some spam:
Specificity : 95% =⇒ Sensitivity : 79%

230
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Toy Classification Problem
Bayes Error Rate: 0.25
6
4

1

2
0
-2
-4
-6

X2

1

1

1

1

1
1
111
11
1 1
1
1
1
1 0 110111 1 0 1 1
1
1 1 11 0 1
1
11 1
1 1
1
1 1
1
1 1
1
1
11 1 0 1 11 0 1 1 1 0
1
0
1
00 1 11
1
1
1 1 010 010 11 010 1 00 01 1 1 1
1
1 1
1 01
0
1
0 00 00
0
1 1 11 1 0 0 00 0 01011 0 0111
101 1
01 0
0 0 0 0 1 0011 1 11 0
1 1 1 101 1010 0 0 0000010 0 0 1
0 00 1
0 100 0 1 0 0 0
1
0
0 00 01 100 000 011
0 00 0 0 0
1 00 0 001000 000 1 1 1
000 00 1 000 0
11
1 0 1
0
0
1 0 0 0101 0
1
11
1 1 001 01001 000110 000 0111 0 1 1 11
0 00
0100000 000 10 001 0 1 1 1
0
1 1 00
1 0
1
0 1 0001000 1
01 0 00100000011 0
1 00000000001000010 0 1 0
1 0 1 00 00000000000000 00001101 1
00 000 10
1 1
1
1
1
0 0 0 10 0100 0 0 1
00 0
11 0 111 0 00000010010111 101010 01
1 1 0 0 11 10 0010 100 1 0 0
1
0
1 0 01100 0 0 0 101 0 01 1
00 0
0 01 000 01 0 0
1
1 1000000 0 000 1 1
1
00
1 1 0 10 11 111 0000001 1 111 11 1 1
0001 01000000 1 11 1
1
0 0 0
0 011 0 00 0
0 01 1 1
1
0
1 1 1
1
0
1 11 11 1 1111
0 1 1 10 01000 0 1 1 1111 1 1
1 00 1
0 11 0
11 0 0 0 1
0
0
111 11 1 11 1001 0 1 10 01 1 1 1 1
0
1 1 1
1
1
0
1 1 1 1 11
1
1 1
10
1
1
1 1 1 1 11 1 1
1
1
1
11
1 11 1
1 1
1
1
1
1
1
1 11
1 11 11 1 1 1
1
1
1
1
1
1 11
1
1
1
1
1
1
1
1
1
1
1
1

-6

-4

-2

• Data X and Y , with Y
taking values +1 or −1.

1

1

1

1

1

0

X1

2

4

• Here X =∈ IR2

1

1
1

6

• The black boundary
is the Bayes Decision
Boundary - the best
one can do. Here 25%
• Goal: given N training
pairs (Xi , Yi ) produce
ˆ
a classifier C(X) ∈
{−1, 1}

• Also estimate the probability of the class labels Pr(Y = +1|X).

231
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Toy Example — No Noise
Bayes Error Rate: 0
3

1
1
1 1 1
11
1
11
11 1
1 11
1 11 1
1
1 1
1
11 1 1 1 11 1 1 1 1
11
1 1
1
11 1 1
1
1
1 1 11 1 1 11 1 1 11 11 1 1
1 1
1
1 1 11 1 1 1
1 11
11
1 11 1 1 11 1 0 000
1
1 0 0
11 1
1
1
1
1
1 1 1 1111 1
1
00 00 101 1 1
1
00 0
1
1
1
0 0
1 1 1 11 00 0 00 0000 00 001 111 1 1
0 0 0 0 0 0 00 0 1 1 1
1 1 11 0 0
0
1 1 11
1 0 0 0 0 00 000 0 00
1
1
1
1 10 0 0
1
1
0 0 00 0
1 11 11 100000000 0 0000 0 000 00 0 1 1 1 1 1
0
1 11 0 0 00000 00 0 0 0 0 0 0 1
1
0
1 0 0 0 0 0 0 0000000
11
0 0 0 0 0 000 0 00 001 11 1 1
1
1
0
1 0 0 00 00 0 00 0 00 0 0
0
1
1 0 00 000 0 0 0 00
0
0 0
1 1 1 00 0 0 0000 00 0000000 0 00 01 1 1 1 1 11 1 1
1 1
0
1
1 0 000 00000 0 0000 0 000
1
000 0 000 0 00 00 0 0
0
1
0
0 0 00
1
00
0
11
00 0 0 0
11
1
1 10 0 0 0 0 0 0 0000 00 0 0 0 11 1 1
1 1
1
0
0 00 0
1
1 1
00 0 0
1 1 1 1 1 000 0000 000000 0000 0 0 0 11 1 1
1 1
0
0 0 00000 0 0 00 1 11 1
0 000000 0
1
11 1 11 0 0 00 0 00 0 000 1 1 1 1 1 1
0 0
0
1
0
0
00 0 000 00 0 0 0
1
1
1
1 1 1
1
0
1 11
0 00 00 0 00 00 1 1 11
1
1 00 0 0
0
0
1 1
1 1
1
1
1 1 01 00 0 0 00
1 0 0
11 1 1
0
1
1 1 111 001 0 010 1 1 111 1 1
1 1
1 1 0 11
1 1 01
1
1 1 1 1 1
1
1
1
1
1 1 1 1 1 11 1
1
1
1 11 11 1 1 111 11
1
11
1
1 11 1 1 1 1
1 11 1 1 1 1
1
1 1
11
1
1
1
1 1
1
1
1
1 1
11 1
1
1
1
1
1 11
1
1
1
11
1
11

0
-1
-2
-3

X2

1

2

1

-3

-2

1

11

-1

1
11

1
1

1
1 1

0

X1

1

2

3

• Deterministic problem;
noise comes from sampling distribution of X.
• Use a training sample
of size 200.
• Here Bayes Error is
0%.

232
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Classification Tree
1
94/200
x.2<-1.06711
x.2>-1.06711
0

1

72/166
x.2<1.14988
x.2>1.14988

0/34

1

0
40/134
x.1<1.13632
x.1>1.13632
0

1

23/117

0/32

0/17

x.1<-0.900735
x.1>-0.900735
1

0

5/26
x.1<-1.1668
x.1>-1.1668
1

1
0/12

2/91
x.2<-0.823968
x.2>-0.823968
0

5/14
x.1<-1.07831
x.1>-1.07831
1

1

1/5

4/9

0

2/8

0/83

233
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

234

Decision Boundary: Tree
Error Rate: 0.073

2

3

• The eight-node tree is
boxy, and has a testerror rate 0f 7.3%

1

1

0

1
1

-1

1
1

-2

1

1
1 1
1
1 1
1
1
1
1 1
1
1
1
11
1
0
0 01
1
1 11
0 11 1
1
1 1
0
000
1 0 00
0 000 0 00
0
0
00
0 0 0 00
1 0 0
00
1
1 11 1
1 0 0 0 0 0 0 0 00 0 1
1
1 0 0 00
0
0
0
00 0 0 0
00
0
00 0 0
0 00 0 0 0 0 0
1 0 0
0
0
0 0 00 0 1
00
0
1 1
0 0
1
00 0 0
1
1 000 00 0 0 0 1
0
1 1 1
0
11
1 10 01
11
1
11
1 11
1 1
1
1
11
1
1
1
1
1
1 1
1
1
1
1
1
1
1

1

1

-3

X2

1

1

11

-3

-2

-1

0
X1

1

2

3

• When
the
nested
spheres are in 10dimensions, a classification tree produces
a rather noisy and
ˆ
inaccurate rule C(X),
with error rates around
30%.
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Model Averaging
Classification trees can be simple, but often produce noisy (bushy)
or weak (stunted) classifiers.
• Bagging (Breiman, 1996): Fit many large trees to
bootstrap-resampled versions of the training data, and classify
by majority vote.
• Boosting (Freund & Shapire, 1996): Fit many large or small
trees to reweighted versions of the training data. Classify by
weighted majority vote.
• Random Forests (Breiman 1999): Fancier version of bagging.
In general Boosting ≻ Random Forests ≻ Bagging ≻ Single Tree.

235
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Bagging
• Bagging or bootstrap aggregation averages a noisy fitted
function refit to many bootstrap samples, to reduce its
variance. Bagging is a poor man’s Bayes; See

pp 282.

• Suppose C(S, x) is a fitted classifier, such as a tree, fit to our
training data S, producing a predicted class label at input
point x.
• To bag C, we draw bootstrap samples S ∗1 , . . . S ∗B each of size
N with replacement from the training data. Then
Cbag (x) = Majority Vote {C(S ∗b , x)}B .
b=1
• Bagging can dramatically reduce the variance of unstable
procedures (like trees), leading to improved prediction.
However any simple structure in C (e.g a tree) is lost.

236
SLDM III c Hastie & Tibshirani - October 9, 2013

Original Tree

Random Forests and Boosting

b=1

b=2

x.1 < 0.395

x.1 < 0.555

x.2 < 0.205

|

|

|

1

1
0

1

0

1

0

0

1

0

0
0

1

b=3

0

0

1

0

1

b=4

1

b=5

x.2 < 0.285

x.3 < 0.985

x.4 < −1.36

|

|

|
0
1

0

0
1

1

0

1

1

1

0
1

0

1

b=6

1

1

0

0

1

b=7

b=8

x.1 < 0.395

x.1 < 0.395

x.3 < 0.985

|

|

|

1
1

1
0
1

1

0

0

0

1

0
0

1

b=9

0

0

1

b = 10

b = 11

x.1 < 0.395

x.1 < 0.555

x.1 < 0.555

|

|

|

1
0
1
0

1

1
0

1

0

0
1

0

1
0

1

237
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Decision Boundary: Bagging

2

3

Error Rate: 0.032

1

1

0

1
1

-1

1
1

-2

1

1
1 1
1
1 1
1
1
1
1 1
1
1
1
11
1
0
0 01
1
1 11
0 11 1
1
1 1
0
000
1 0 00
0 000 0 00
0
0
00
0 0 0 00
1 0 0
00
1
1 11 1
1 0 0 0 0 0 0 0 00 0 1
1
1 0 0 00
0
0
0
00 0 0 0
00
0
00 0 0
0 00 0 0 0 0 0
1 0 0
0
0
0 0 00 0 1
00
0
1 1
0 0
1
00 0 0
1
1 000 00 0 0 0 1
0
1 1 1
0
11
1 10 01
11
1
11
1 11
1 1
1
1
11
1
1
1
1
1
1 1
1
1
1
1
1
1
1

1

1

• Bagging reduces error
by variance reduction.

-3

X2

1

1

11

-3

-2

-1

0
X1

1

2

• Bagging averages many
trees, and produces
smoother
decision
boundaries.

3

• Bias is slightly increased because bagged
trees are slightly shallower.

238
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Random forests
• refinement of bagged trees; very popular—

chapter 15.

• at each tree split, a random sample of m features is drawn, and
only those m features are considered for splitting. Typically
√
m = p or log2 p, where p is the number of features
• For each tree grown on a bootstrap sample, the error rate for
observations left out of the bootstrap sample is monitored.
This is called the “out-of-bag” or OOB error rate.
• random forests tries to improve on bagging by “de-correlating”
the trees, and reduce the variance.
• Each tree has the same expectation, so increasing the number
of trees does not alter the bias of bagging or random forests.

239
SLDM III c Hastie & Tibshirani - October 9, 2013

Wisdom of Crowds

10

Wisdom of Crowds
Consensus
Individual

8

• 10 movie categories, 4 choices
in each category

6

• 50 people, for each question 15
informed and 35 guessers

4

Expected Correct out of 10

Random Forests and Boosting

• For informed people,

0
0.25

0.50

0.75

P − Probability of Informed Person Being Correct

1.00

=

p

Pr(incorrect)

2

Pr(correct)

=

(1 − p)/3

• For guessers,
Pr(all choices) = 1/4.

Consensus (orange) improve over individuals (teal) even for
moderate p.

240
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

241

1.0

ROC curve for TREE, SVM and Random Forest on SPAM data

0.8

TREE, SVM and RF
Random Forest − Error: 4.75%
SVM − Error: 6.7%
TREE − Error: 8.7%

Random Forest dominates
both other methods on the
SPAM data — 4.75% error.
Used 500 trees with default
settings for random Forest
package in R.

0.0

0.2

0.4

Sensitivity

0.6

o
o
o

0.0

0.2

0.4

0.6
Specificity

0.8

1.0
SLDM III c Hastie & Tibshirani - October 9, 2013

242

0.07

0.08

RF: decorrelation

0.06

ˆ
• VarfRF (x) = ρ(x)σ 2 (x)

0.03

0.04

0.05

• The correlation is because
we are growing trees to the
same data.

0.02

Correlation between Trees

Random Forests and Boosting

1

4

7 10

16

22

28

34

40

Number of Randomly Selected Splitting Variables m

46

• The correlation decreases
as m decreases—m is the
number of variables sampled at each split.

Based on a simple experiment with random forest regression
models.
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Random Forest Ensemble

0.15
0.10
0.05
Mean Squared Error
Squared Bias
Variance

0.0
0

10

20

30
m

40

50

Variance

0.80
0.70

0.75

0.20

0.85

Bias & Variance

0.65

Mean Squared Error and Squared Bias

243

• The smaller m, the
lower the variance of
the random forest ensemble
• Small m is also associated with higher
bias, because important variables can be
missed by the sampling.
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

244

Boosting
• Breakthrough invention of Freund &
Schapire (1997)
Weighted Sample

CM (x)

• Average many trees, each grown to reweighted versions of the training data.
• Weighting decorrelates the trees, by focussing on regions missed by past trees.

Weighted Sample

C3 (x)

Weighted Sample

C2 (x)

• Final Classifier is weighted average of
classifiers:
M

αm Cm (x)

C(x) = sign
Training Sample

C1 (x)

m=1

Note: Cm (x) ∈ {−1, +1}.
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

245

100 Node Trees

Boosting vs Bagging
0.4

Bagging
AdaBoost

• Bayes error rate is 0%.

0.2

• Trees are grown best
first without pruning.

0.1

• Leftmost term is a single tree.

0.0

Test Error

0.3

• 2000
points
from
Nested Spheres in R10

0

100

200
Number of Terms

300

400
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

AdaBoost (Freund & Schapire, 1996)
1. Initialize the observation weights wi = 1/N, i = 1, 2, . . . , N .
2. For m = 1 to M repeat steps (a)–(d):
(a) Fit a classifier Cm (x) to the training data using weights wi .
(b) Compute weighted error of newest tree
errm =

N
i=1

wi I(yi = Cm (xi ))
N
i=1

wi

(c) Compute αm = log[(1 − errm )/errm ].

(d) Update weights for i = 1, . . . , N :
wi ← wi · exp[αm · I(yi = Cm (xi ))]
and renormalize to wi to sum to 1.
3. Output C(x) = sign

M
m=1

αm Cm (x) .

.

246
Random Forests and Boosting

0.5

SLDM III c Hastie & Tibshirani - October 9, 2013

247

Boosting Stumps

0.4

Single Stump

0.3

• Boosting stumps works
remarkably well on
the
nested-spheres
problem.

0.1

0.2

400 Node Tree

0.0

Test Error

• A stump is a two-node
tree, after a single split.

0

100

200
Boosting Iterations

300

400
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

248

0.4

• Bayes error is 0%.

in

10-

0.2

• Boosting drives the training
error to zero (green curve).

0.1

• Further iterations continue to
improve test error in many examples (red curve).

0.0

Train and Test Error

• Nested
spheres
Dimensions.

0.3

0.5

Training Error

0

100

200

300
Number of Terms

400

500

600
Random Forests and Boosting

0.5

SLDM III c Hastie & Tibshirani - October 9, 2013

249

0.3

• Nested Gaussians in
10-Dimensions.
• Bayes error is 25%.

Bayes Error
0.2

• Boosting with stumps

0.1

• Here the test error
does increase, but quite
slowly.

0.0

Train and Test Error

0.4

Noisy Problems

0

100

200

300
Number of Terms

400

500

600
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Stagewise Additive Modeling
Boosting builds an additive model
M

βm b(x; γm ).

F (x) =
m=1

Here b(x, γm ) is a tree, and γm parametrizes the splits.
We do things like that in statistics all the time!
• GAMs: F (x) =

j

fj (xj )

• Basis expansions: F (x) =

M
m=1 θm hm (x)

Traditionally the parameters fm , θm are fit jointly (i.e. least
squares, maximum likelihood).
With boosting, the parameters (βm , γm ) are fit in a stagewise
fashion. This slows the process down, and overfits less quickly.

250
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Example: Least Squares Boosting
1. Start with function F0 (x) = 0 and residual r = y, m = 0
2. m ← m + 1
3. Fit a CART regression tree to r giving g(x)
4. Set fm (x) ← ǫ · g(x)
5. Set Fm (x) ← Fm−1 (x) + fm (x), r ← r − fm (x) and repeat
step 2–5 many times
• g(x) typically shallow tree; e.g. constrained to have k = 3 splits
only (depth).
• Stagewise fitting (no backfitting!). Slows down (over-) fitting
process.
• 0 < ǫ ≤ 1 is shrinkage parameter; slows overfitting even further.

251
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Least Squares Boosting and ǫ
Prediction Error

Single tree

ǫ=1

ǫ = .01

Number of steps m

• If instead of trees, we use linear regression terms, boosting with
small ǫ very similar to lasso!

252
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Adaboost: Stagewise Modeling
• AdaBoost builds an additive logistic regression model
M

Pr(Y = 1|x)
αm fm (x)
F (x) = log
=
Pr(Y = −1|x) m=1
by stagewise fitting using the loss function
L(y, F (x)) = exp(−y F (x)).
• Instead of fitting trees to residuals, the special form of the
exponential loss leads to fitting trees to weighted versions of
the original data. See

pp 343.

253
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

3.0

Why Exponential Loss?

2.0
0.5

1.0

1.5

• Leads to simple reweighting
scheme.

0.0

Loss

2.5

Misclassification
Exponential
Binomial Deviance
Squared Error
Support Vector

• e−yF (x) is a monotone,
smooth upper bound on
misclassification loss at x.

−2

−1

0

1

2

• Has logit transform as population minimizer

y·F

F ∗ (x) =
• y · F is the margin; positive
margin means correct classification.

1
Pr(Y = 1|x)
log
2
Pr(Y = −1|x)

• Other more robust loss functions, like binomial deviance.

254
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

General Boosting Algorithms
The boosting paradigm can be extended to general loss functions
— not only squared error or exponential.
1. Initialize F0 (x) = 0.
2. For m = 1 to M :
(a) Compute
(βm , γm ) = arg minβ,γ

N
i=1

L(yi , Fm−1 (xi ) + βb(xi ; γ)).

(b) Set Fm (x) = Fm−1 (x) + βm b(x; γm ).
Sometimes we replace step (b) in item 2 by
(b∗ ) Set Fm (x) = Fm−1 (x) + ǫβm b(x; γm )
Here ǫ is a shrinkage factor; in gbm package in R, the default is
ǫ = 0.001. Shrinkage slows the stagewise model-building even more,
and typically leads to better performance.

255
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Gradient Boosting
• General boosting algorithm that works with a variety of
different loss functions. Models include regression, resistant
regression, K-class classification and risk modeling.
• Gradient Boosting builds additive tree models, for example, for
representing the logits in logistic regression.
• Tree size is a parameter that determines the order of
interaction (next slide).
• Gradient Boosting inherits all the good features of trees
(variable selection, missing data, mixed predictors), and
improves on the weak features, such as prediction performance.
• Gradient Boosting is described in detail in
, section 10.10,
and is implemented in Greg Ridgeways gbm package in R.

256
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

0.070

Spam Data

0.055
0.050
0.045
0.040

Test Error

0.060

0.065

Bagging
Random Forest
Gradient Boosting (5 Node)

0

500

1000

1500

Number of Trees

2000

2500

257
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Spam Example Results
With 3000 training and 1500 test observations, Gradient Boosting
fits an additive logistic model
f (x) = log

Pr(spam|x)
Pr(email|x)

using trees with J = 6 terminal-node trees.
Gradient Boosting achieves a test error of 4.5%, compared to 5.3%
for an additive GAM, 4.75% for Random Forests, and 8.7% for
CART.

258
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Spam: Variable Importance
3d
addresses
labs
telnet
857
415
direct
cs
table
85
#
parts
credit
[
lab
conference
report
original
data
project
font
make
address
order
all
hpl
technology
people
pm
mail
over
650
;
meeting
email
000
internet
receive
(
re
business
1999
will
money
our
you
edu
CAPTOT
george
CAPMAX
your
CAPAVE
free
remove
hp
$
!

0

20

40

60

Relative importance

80

100

259
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

0.0

0.2

0.4

0.6

0.8

-0.2 0.0 0.2 0.4 0.6 0.8 1.0

Partial Dependence

-0.2 0.0 0.2 0.4 0.6 0.8 1.0

Partial Dependence

Spam: Partial Dependence

1.0

0.0

0.2

0.4

0.6

-0.2 0.0 0.2
-1.0

-0.6

-0.6

Partial Dependence

-0.2 0.0 0.2

remove

-1.0

Partial Dependence

!

0.0

0.2

0.4

0.6
edu

0.8

1.0

0.0

0.5

1.0

1.5
hp

2.0

2.5

3.0

260
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

California Housing Data

0.42
0.40
0.38
0.36
0.34
0.32

Test Average Absolute Error

0.44

RF m=2
RF m=6
GBM depth=4
GBM depth=6

0

200

400

600

Number of Trees

800

1000

261
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Tree Size

0.4

Stumps
10 Node
100 Node
Adaboost

0.2

0.3

The tree size J determines
the interaction order of the
model:
η(X)

=

ηj (Xj )

0.1

j

+

ηjk (Xj , Xk )
jk

0.0

Test Error

262

+
0

100

200
Number of Terms

300

400

ηjkl (Xj , Xk , Xl )
jkl

+···
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Stumps win!
Since the true decision boundary is the surface of a sphere, the
function that describes it has the form
2
2
2
f (X) = X1 + X2 + . . . + Xp − c = 0.

Boosted stumps via Gradient Boosting returns reasonable
approximations to these quadratic functions.
Coordinate Functions for Additive Logistic Trees

f1 (x1 )

f2 (x2 )

f3 (x3 )

f4 (x4 )

f5 (x5 )

f6 (x6 )

f7 (x7 )

f8 (x8 )

f9 (x9 )

f10 (x10 )

263
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Learning Ensembles
Ensemble Learning consists of two steps:
1. Construction of a dictionary D = {T1 (X), T2 (X), . . . , TM (X)}
of basis elements (weak learners) Tm (X).
2. Fitting a model f (X) =

m∈D

αm Tm (X).

Simple examples of ensembles
• Linear regression: The ensemble consists of the coordinate
functions Tm (X) = Xm . The fitting is done by least squares.
• Random Forests: The ensemble consists of trees grown to
booststrapped versions of the data, with additional
randomization at each split. The fitting simply averages.
• Gradient Boosting: The ensemble is grown in an adaptive
fashion, but then simply averaged at the end.

264
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Learning Ensembles and the Lasso
Proposed by Popescu and Friedman (2004):
Fit the model f (X) = m∈D αm Tm (X) in step (2) using Lasso
regularization:
α(λ) = arg min
α

αm Tm (xi )] + λ

L[yi , α0 +
i=1

M

M

N

m=1

m=1

|αm |.

Advantages:
• Often ensembles are large, involving thousands of small trees.
The post-processing selects a smaller subset of these and
combines them efficiently.
• Often the post-processing improves the process that generated
the ensemble.

265
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Post-Processing Random Forest Ensembles

0.09

Spam Data

0.07
0.06
0.05
0.04

Test Error

0.08

Random Forest
Random Forest (5%, 6)
Gradient Boost (5 node)

0

100

200

300

Number of Trees

400

500

266
SLDM III c Hastie & Tibshirani - October 9, 2013

Random Forests and Boosting

Software
• R: free GPL statistical computing environment available from
CRAN, implements the S language. Includes:
– randomForest: implementation of Leo Breimans algorithms.
– rpart: Terry Therneau’s implementation of classification and
regression trees.
– gbm: Greg Ridgeway’s implementation of Friedman’s
gradient boosting algorithm.
• Salford Systems: Commercial implementation of trees, random
forests and gradient boosting.
• Splus (Insightful): Commerical version of S.
• Weka: GPL software from University of Waikato, New Zealand.
Includes Trees, Random Forests and many other procedures.

267

Más contenido relacionado

La actualidad más candente

Spectral clustering
Spectral clusteringSpectral clustering
Spectral clusteringSOYEON KIM
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoostJoonyoung Yi
 
Xgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - ExplainedXgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - ExplainedSimon Lia-Jonassen
 
Introduction to Hamiltonian Neural Networks
Introduction to Hamiltonian Neural NetworksIntroduction to Hamiltonian Neural Networks
Introduction to Hamiltonian Neural NetworksMiles Cranmer
 
ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...
ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...
ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...Jihun Yun
 
Introduction to XGboost
Introduction to XGboostIntroduction to XGboost
Introduction to XGboostShuai Zhang
 
Evolutionary Game Theory
Evolutionary Game TheoryEvolutionary Game Theory
Evolutionary Game TheoryAli Abbasi
 
Optimization in Deep Learning
Optimization in Deep LearningOptimization in Deep Learning
Optimization in Deep LearningYan Xu
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysisguru_prasadg
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CARTXueping Peng
 
Understanding Random Forests: From Theory to Practice
Understanding Random Forests: From Theory to PracticeUnderstanding Random Forests: From Theory to Practice
Understanding Random Forests: From Theory to PracticeGilles Louppe
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forestsViet-Trung TRAN
 
Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboostmichiaki ito
 

La actualidad más candente (20)

Spectral clustering
Spectral clusteringSpectral clustering
Spectral clustering
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
03 Machine Learning Linear Algebra
03 Machine Learning Linear Algebra03 Machine Learning Linear Algebra
03 Machine Learning Linear Algebra
 
Xgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - ExplainedXgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - Explained
 
Doe16 nested
Doe16 nestedDoe16 nested
Doe16 nested
 
Introduction to Hamiltonian Neural Networks
Introduction to Hamiltonian Neural NetworksIntroduction to Hamiltonian Neural Networks
Introduction to Hamiltonian Neural Networks
 
ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...
ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...
ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...
 
Introduction to XGboost
Introduction to XGboostIntroduction to XGboost
Introduction to XGboost
 
Evolutionary Game Theory
Evolutionary Game TheoryEvolutionary Game Theory
Evolutionary Game Theory
 
simplex method
simplex methodsimplex method
simplex method
 
Optimization in Deep Learning
Optimization in Deep LearningOptimization in Deep Learning
Optimization in Deep Learning
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysis
 
08 clustering
08 clustering08 clustering
08 clustering
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
 
Understanding Random Forests: From Theory to Practice
Understanding Random Forests: From Theory to PracticeUnderstanding Random Forests: From Theory to Practice
Understanding Random Forests: From Theory to Practice
 
MCDM PPT.pptx
MCDM PPT.pptxMCDM PPT.pptx
MCDM PPT.pptx
 
PCA
PCAPCA
PCA
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
 
boosting algorithm
boosting algorithmboosting algorithm
boosting algorithm
 
Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboost
 

Destacado

Interpretable machine learning
Interpretable machine learningInterpretable machine learning
Interpretable machine learningSri Ambati
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
 
Applying Machine Learning using H2O
Applying Machine Learning using H2OApplying Machine Learning using H2O
Applying Machine Learning using H2OSri Ambati
 
Q trade presentation
Q trade presentationQ trade presentation
Q trade presentationewig123
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3arogozhnikov
 
Comparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for RegressionComparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for RegressionSeonho Park
 
Introduction to Some Tree based Learning Method
Introduction to Some Tree based Learning MethodIntroduction to Some Tree based Learning Method
Introduction to Some Tree based Learning MethodHonglin Yu
 
ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217Sri Ambati
 
Bias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsBias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsGilles Louppe
 
Kaggle "Give me some credit" challenge overview
Kaggle "Give me some credit" challenge overviewKaggle "Give me some credit" challenge overview
Kaggle "Give me some credit" challenge overviewAdam Pah
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Zihui Li
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsSri Ambati
 
Machine Learning and Data Mining: 16 Classifiers Ensembles
Machine Learning and Data Mining: 16 Classifiers EnsemblesMachine Learning and Data Mining: 16 Classifiers Ensembles
Machine Learning and Data Mining: 16 Classifiers EnsemblesPier Luca Lanzi
 
H2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark LandryH2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark LandrySri Ambati
 
Higgs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - KaggleHiggs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - KaggleSajith Edirisinghe
 
classification_methods-logistic regression Machine Learning
classification_methods-logistic regression Machine Learning classification_methods-logistic regression Machine Learning
classification_methods-logistic regression Machine Learning Shiraz316
 
Cybersecurity with AI - Ashrith Barthur
Cybersecurity with AI - Ashrith BarthurCybersecurity with AI - Ashrith Barthur
Cybersecurity with AI - Ashrith BarthurSri Ambati
 

Destacado (20)

Interpretable machine learning
Interpretable machine learningInterpretable machine learning
Interpretable machine learning
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
Applying Machine Learning using H2O
Applying Machine Learning using H2OApplying Machine Learning using H2O
Applying Machine Learning using H2O
 
Q trade presentation
Q trade presentationQ trade presentation
Q trade presentation
 
Tree advanced
Tree advancedTree advanced
Tree advanced
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3
 
Comparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for RegressionComparison Study of Decision Tree Ensembles for Regression
Comparison Study of Decision Tree Ensembles for Regression
 
Introduction to Some Tree based Learning Method
Introduction to Some Tree based Learning MethodIntroduction to Some Tree based Learning Method
Introduction to Some Tree based Learning Method
 
ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217
 
L4. Ensembles of Decision Trees
L4. Ensembles of Decision TreesL4. Ensembles of Decision Trees
L4. Ensembles of Decision Trees
 
Bias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsBias-variance decomposition in Random Forests
Bias-variance decomposition in Random Forests
 
Kaggle "Give me some credit" challenge overview
Kaggle "Give me some credit" challenge overviewKaggle "Give me some credit" challenge overview
Kaggle "Give me some credit" challenge overview
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Machine Learning and Data Mining: 16 Classifiers Ensembles
Machine Learning and Data Mining: 16 Classifiers EnsemblesMachine Learning and Data Mining: 16 Classifiers Ensembles
Machine Learning and Data Mining: 16 Classifiers Ensembles
 
H2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark LandryH2O World - GBM and Random Forest in H2O- Mark Landry
H2O World - GBM and Random Forest in H2O- Mark Landry
 
Higgs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - KaggleHiggs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - Kaggle
 
classification_methods-logistic regression Machine Learning
classification_methods-logistic regression Machine Learning classification_methods-logistic regression Machine Learning
classification_methods-logistic regression Machine Learning
 
Cybersecurity with AI - Ashrith Barthur
Cybersecurity with AI - Ashrith BarthurCybersecurity with AI - Ashrith Barthur
Cybersecurity with AI - Ashrith Barthur
 
ISAX
ISAXISAX
ISAX
 

Más de Sri Ambati

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxSri Ambati
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek Sri Ambati
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thSri Ambati
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionSri Ambati
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Sri Ambati
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMsSri Ambati
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the WaySri Ambati
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OSri Ambati
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Sri Ambati
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersSri Ambati
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Sri Ambati
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Sri Ambati
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...Sri Ambati
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability Sri Ambati
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email AgainSri Ambati
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Sri Ambati
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...Sri Ambati
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...Sri Ambati
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneySri Ambati
 

Más de Sri Ambati (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey
 

Último

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 

Último (20)

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 

Dr. Trevor Hastie: Data Science of GBM (October 10, 2013: Presented With H2O)

  • 1. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Ensemble Learners • Classification Trees (Done) • Bagging: Averaging Trees • Random Forests: Cleverer Averaging of Trees • Boosting: Cleverest Averaging of Trees • Details in chaps 10, 15 and 16 Methods for dramatically improving the performance of weak learners such as Trees. Classification trees are adaptive and robust, but do not make good prediction models. The techniques discussed here enhance their performance considerably. 227
  • 2. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Properties of Trees • [ ]Can handle huge datasets • [ ]Can handle mixed predictors—quantitative and qualitative • [ ]Easily ignore redundant variables • [ ]Handle missing data elegantly through surrogate splits • [ ]Small trees are easy to interpret • ["]Large trees are hard to interpret • ["]Prediction performance is often poor 228
  • 3. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting email 600/1536 ch$<0.0555 ch$>0.0555 email 280/1177 spam 48/359 remove<0.06 hp<0.405 hp>0.405 remove>0.06 email 180/1065 spam ch!<0.191 spam george<0.005 george>0.005 email email spam 80/652 0/209 36/123 16/81 free<0.065 free>0.065 email email email spam 77/423 3/229 16/94 9/29 CAPMAX<10.5 business<0.145 CAPMAX>10.5 business>0.145 email email email spam 20/238 57/185 14/89 3/5 receive<0.125 edu<0.045 receive>0.125 edu>0.045 email spam email email 19/236 1/2 48/113 9/72 our<1.2 our>1.2 email spam 37/101 1/12 0/3 CAPAVE<2.7505 CAPAVE>2.7505 email hp<0.03 hp>0.03 email 6/109 email 100/204 80/861 0/22 george<0.15 CAPAVE<2.907 george>0.15 CAPAVE>2.907 ch!>0.191 email email spam 26/337 9/112 spam 19/110 spam 7/227 1999<0.58 1999>0.58 spam 18/109 email 0/1 229
  • 4. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting ROC curve for pruned tree on SPAM data 0.8 1.0 SPAM Data TREE − Error: 8.7% 0.0 0.2 0.4 Sensitivity 0.6 o 0.0 0.2 0.4 0.6 Specificity 0.8 1.0 Overall error rate on test data: 8.7%. ROC curve obtained by varying the threshold c0 of the classifier: C(X) = +1 if Pr(+1|X) > c0 . Sensitivity: proportion of true spam identified Specificity: proportion of true email identified. We may want specificity to be high, and suffer some spam: Specificity : 95% =⇒ Sensitivity : 79% 230
  • 5. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Toy Classification Problem Bayes Error Rate: 0.25 6 4 1 2 0 -2 -4 -6 X2 1 1 1 1 1 1 111 11 1 1 1 1 1 1 0 110111 1 0 1 1 1 1 1 11 0 1 1 11 1 1 1 1 1 1 1 1 1 1 1 11 1 0 1 11 0 1 1 1 0 1 0 1 00 1 11 1 1 1 1 010 010 11 010 1 00 01 1 1 1 1 1 1 1 01 0 1 0 00 00 0 1 1 11 1 0 0 00 0 01011 0 0111 101 1 01 0 0 0 0 0 1 0011 1 11 0 1 1 1 101 1010 0 0 0000010 0 0 1 0 00 1 0 100 0 1 0 0 0 1 0 0 00 01 100 000 011 0 00 0 0 0 1 00 0 001000 000 1 1 1 000 00 1 000 0 11 1 0 1 0 0 1 0 0 0101 0 1 11 1 1 001 01001 000110 000 0111 0 1 1 11 0 00 0100000 000 10 001 0 1 1 1 0 1 1 00 1 0 1 0 1 0001000 1 01 0 00100000011 0 1 00000000001000010 0 1 0 1 0 1 00 00000000000000 00001101 1 00 000 10 1 1 1 1 1 0 0 0 10 0100 0 0 1 00 0 11 0 111 0 00000010010111 101010 01 1 1 0 0 11 10 0010 100 1 0 0 1 0 1 0 01100 0 0 0 101 0 01 1 00 0 0 01 000 01 0 0 1 1 1000000 0 000 1 1 1 00 1 1 0 10 11 111 0000001 1 111 11 1 1 0001 01000000 1 11 1 1 0 0 0 0 011 0 00 0 0 01 1 1 1 0 1 1 1 1 0 1 11 11 1 1111 0 1 1 10 01000 0 1 1 1111 1 1 1 00 1 0 11 0 11 0 0 0 1 0 0 111 11 1 11 1001 0 1 10 01 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 11 1 1 1 10 1 1 1 1 1 1 11 1 1 1 1 1 11 1 11 1 1 1 1 1 1 1 1 1 11 1 11 11 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 -6 -4 -2 • Data X and Y , with Y taking values +1 or −1. 1 1 1 1 1 0 X1 2 4 • Here X =∈ IR2 1 1 1 6 • The black boundary is the Bayes Decision Boundary - the best one can do. Here 25% • Goal: given N training pairs (Xi , Yi ) produce ˆ a classifier C(X) ∈ {−1, 1} • Also estimate the probability of the class labels Pr(Y = +1|X). 231
  • 6. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Toy Example — No Noise Bayes Error Rate: 0 3 1 1 1 1 1 11 1 11 11 1 1 11 1 11 1 1 1 1 1 11 1 1 1 11 1 1 1 1 11 1 1 1 11 1 1 1 1 1 1 11 1 1 11 1 1 11 11 1 1 1 1 1 1 1 11 1 1 1 1 11 11 1 11 1 1 11 1 0 000 1 1 0 0 11 1 1 1 1 1 1 1 1 1111 1 1 00 00 101 1 1 1 00 0 1 1 1 0 0 1 1 1 11 00 0 00 0000 00 001 111 1 1 0 0 0 0 0 0 00 0 1 1 1 1 1 11 0 0 0 1 1 11 1 0 0 0 0 00 000 0 00 1 1 1 1 10 0 0 1 1 0 0 00 0 1 11 11 100000000 0 0000 0 000 00 0 1 1 1 1 1 0 1 11 0 0 00000 00 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0000000 11 0 0 0 0 0 000 0 00 001 11 1 1 1 1 0 1 0 0 00 00 0 00 0 00 0 0 0 1 1 0 00 000 0 0 0 00 0 0 0 1 1 1 00 0 0 0000 00 0000000 0 00 01 1 1 1 1 11 1 1 1 1 0 1 1 0 000 00000 0 0000 0 000 1 000 0 000 0 00 00 0 0 0 1 0 0 0 00 1 00 0 11 00 0 0 0 11 1 1 10 0 0 0 0 0 0 0000 00 0 0 0 11 1 1 1 1 1 0 0 00 0 1 1 1 00 0 0 1 1 1 1 1 000 0000 000000 0000 0 0 0 11 1 1 1 1 0 0 0 00000 0 0 00 1 11 1 0 000000 0 1 11 1 11 0 0 00 0 00 0 000 1 1 1 1 1 1 0 0 0 1 0 0 00 0 000 00 0 0 0 1 1 1 1 1 1 1 0 1 11 0 00 00 0 00 00 1 1 11 1 1 00 0 0 0 0 1 1 1 1 1 1 1 1 01 00 0 0 00 1 0 0 11 1 1 0 1 1 1 111 001 0 010 1 1 111 1 1 1 1 1 1 0 11 1 1 01 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 11 11 1 1 111 11 1 11 1 1 11 1 1 1 1 1 11 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 11 1 11 0 -1 -2 -3 X2 1 2 1 -3 -2 1 11 -1 1 11 1 1 1 1 1 0 X1 1 2 3 • Deterministic problem; noise comes from sampling distribution of X. • Use a training sample of size 200. • Here Bayes Error is 0%. 232
  • 7. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Classification Tree 1 94/200 x.2<-1.06711 x.2>-1.06711 0 1 72/166 x.2<1.14988 x.2>1.14988 0/34 1 0 40/134 x.1<1.13632 x.1>1.13632 0 1 23/117 0/32 0/17 x.1<-0.900735 x.1>-0.900735 1 0 5/26 x.1<-1.1668 x.1>-1.1668 1 1 0/12 2/91 x.2<-0.823968 x.2>-0.823968 0 5/14 x.1<-1.07831 x.1>-1.07831 1 1 1/5 4/9 0 2/8 0/83 233
  • 8. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 234 Decision Boundary: Tree Error Rate: 0.073 2 3 • The eight-node tree is boxy, and has a testerror rate 0f 7.3% 1 1 0 1 1 -1 1 1 -2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 0 0 01 1 1 11 0 11 1 1 1 1 0 000 1 0 00 0 000 0 00 0 0 00 0 0 0 00 1 0 0 00 1 1 11 1 1 0 0 0 0 0 0 0 00 0 1 1 1 0 0 00 0 0 0 00 0 0 0 00 0 00 0 0 0 00 0 0 0 0 0 1 0 0 0 0 0 0 00 0 1 00 0 1 1 0 0 1 00 0 0 1 1 000 00 0 0 0 1 0 1 1 1 0 11 1 10 01 11 1 11 1 11 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -3 X2 1 1 11 -3 -2 -1 0 X1 1 2 3 • When the nested spheres are in 10dimensions, a classification tree produces a rather noisy and ˆ inaccurate rule C(X), with error rates around 30%.
  • 9. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Model Averaging Classification trees can be simple, but often produce noisy (bushy) or weak (stunted) classifiers. • Bagging (Breiman, 1996): Fit many large trees to bootstrap-resampled versions of the training data, and classify by majority vote. • Boosting (Freund & Shapire, 1996): Fit many large or small trees to reweighted versions of the training data. Classify by weighted majority vote. • Random Forests (Breiman 1999): Fancier version of bagging. In general Boosting ≻ Random Forests ≻ Bagging ≻ Single Tree. 235
  • 10. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Bagging • Bagging or bootstrap aggregation averages a noisy fitted function refit to many bootstrap samples, to reduce its variance. Bagging is a poor man’s Bayes; See pp 282. • Suppose C(S, x) is a fitted classifier, such as a tree, fit to our training data S, producing a predicted class label at input point x. • To bag C, we draw bootstrap samples S ∗1 , . . . S ∗B each of size N with replacement from the training data. Then Cbag (x) = Majority Vote {C(S ∗b , x)}B . b=1 • Bagging can dramatically reduce the variance of unstable procedures (like trees), leading to improved prediction. However any simple structure in C (e.g a tree) is lost. 236
  • 11. SLDM III c Hastie & Tibshirani - October 9, 2013 Original Tree Random Forests and Boosting b=1 b=2 x.1 < 0.395 x.1 < 0.555 x.2 < 0.205 | | | 1 1 0 1 0 1 0 0 1 0 0 0 1 b=3 0 0 1 0 1 b=4 1 b=5 x.2 < 0.285 x.3 < 0.985 x.4 < −1.36 | | | 0 1 0 0 1 1 0 1 1 1 0 1 0 1 b=6 1 1 0 0 1 b=7 b=8 x.1 < 0.395 x.1 < 0.395 x.3 < 0.985 | | | 1 1 1 0 1 1 0 0 0 1 0 0 1 b=9 0 0 1 b = 10 b = 11 x.1 < 0.395 x.1 < 0.555 x.1 < 0.555 | | | 1 0 1 0 1 1 0 1 0 0 1 0 1 0 1 237
  • 12. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Decision Boundary: Bagging 2 3 Error Rate: 0.032 1 1 0 1 1 -1 1 1 -2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 0 0 01 1 1 11 0 11 1 1 1 1 0 000 1 0 00 0 000 0 00 0 0 00 0 0 0 00 1 0 0 00 1 1 11 1 1 0 0 0 0 0 0 0 00 0 1 1 1 0 0 00 0 0 0 00 0 0 0 00 0 00 0 0 0 00 0 0 0 0 0 1 0 0 0 0 0 0 00 0 1 00 0 1 1 0 0 1 00 0 0 1 1 000 00 0 0 0 1 0 1 1 1 0 11 1 10 01 11 1 11 1 11 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 • Bagging reduces error by variance reduction. -3 X2 1 1 11 -3 -2 -1 0 X1 1 2 • Bagging averages many trees, and produces smoother decision boundaries. 3 • Bias is slightly increased because bagged trees are slightly shallower. 238
  • 13. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Random forests • refinement of bagged trees; very popular— chapter 15. • at each tree split, a random sample of m features is drawn, and only those m features are considered for splitting. Typically √ m = p or log2 p, where p is the number of features • For each tree grown on a bootstrap sample, the error rate for observations left out of the bootstrap sample is monitored. This is called the “out-of-bag” or OOB error rate. • random forests tries to improve on bagging by “de-correlating” the trees, and reduce the variance. • Each tree has the same expectation, so increasing the number of trees does not alter the bias of bagging or random forests. 239
  • 14. SLDM III c Hastie & Tibshirani - October 9, 2013 Wisdom of Crowds 10 Wisdom of Crowds Consensus Individual 8 • 10 movie categories, 4 choices in each category 6 • 50 people, for each question 15 informed and 35 guessers 4 Expected Correct out of 10 Random Forests and Boosting • For informed people, 0 0.25 0.50 0.75 P − Probability of Informed Person Being Correct 1.00 = p Pr(incorrect) 2 Pr(correct) = (1 − p)/3 • For guessers, Pr(all choices) = 1/4. Consensus (orange) improve over individuals (teal) even for moderate p. 240
  • 15. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 241 1.0 ROC curve for TREE, SVM and Random Forest on SPAM data 0.8 TREE, SVM and RF Random Forest − Error: 4.75% SVM − Error: 6.7% TREE − Error: 8.7% Random Forest dominates both other methods on the SPAM data — 4.75% error. Used 500 trees with default settings for random Forest package in R. 0.0 0.2 0.4 Sensitivity 0.6 o o o 0.0 0.2 0.4 0.6 Specificity 0.8 1.0
  • 16. SLDM III c Hastie & Tibshirani - October 9, 2013 242 0.07 0.08 RF: decorrelation 0.06 ˆ • VarfRF (x) = ρ(x)σ 2 (x) 0.03 0.04 0.05 • The correlation is because we are growing trees to the same data. 0.02 Correlation between Trees Random Forests and Boosting 1 4 7 10 16 22 28 34 40 Number of Randomly Selected Splitting Variables m 46 • The correlation decreases as m decreases—m is the number of variables sampled at each split. Based on a simple experiment with random forest regression models.
  • 17. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Random Forest Ensemble 0.15 0.10 0.05 Mean Squared Error Squared Bias Variance 0.0 0 10 20 30 m 40 50 Variance 0.80 0.70 0.75 0.20 0.85 Bias & Variance 0.65 Mean Squared Error and Squared Bias 243 • The smaller m, the lower the variance of the random forest ensemble • Small m is also associated with higher bias, because important variables can be missed by the sampling.
  • 18. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 244 Boosting • Breakthrough invention of Freund & Schapire (1997) Weighted Sample CM (x) • Average many trees, each grown to reweighted versions of the training data. • Weighting decorrelates the trees, by focussing on regions missed by past trees. Weighted Sample C3 (x) Weighted Sample C2 (x) • Final Classifier is weighted average of classifiers: M αm Cm (x) C(x) = sign Training Sample C1 (x) m=1 Note: Cm (x) ∈ {−1, +1}.
  • 19. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 245 100 Node Trees Boosting vs Bagging 0.4 Bagging AdaBoost • Bayes error rate is 0%. 0.2 • Trees are grown best first without pruning. 0.1 • Leftmost term is a single tree. 0.0 Test Error 0.3 • 2000 points from Nested Spheres in R10 0 100 200 Number of Terms 300 400
  • 20. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting AdaBoost (Freund & Schapire, 1996) 1. Initialize the observation weights wi = 1/N, i = 1, 2, . . . , N . 2. For m = 1 to M repeat steps (a)–(d): (a) Fit a classifier Cm (x) to the training data using weights wi . (b) Compute weighted error of newest tree errm = N i=1 wi I(yi = Cm (xi )) N i=1 wi (c) Compute αm = log[(1 − errm )/errm ]. (d) Update weights for i = 1, . . . , N : wi ← wi · exp[αm · I(yi = Cm (xi ))] and renormalize to wi to sum to 1. 3. Output C(x) = sign M m=1 αm Cm (x) . . 246
  • 21. Random Forests and Boosting 0.5 SLDM III c Hastie & Tibshirani - October 9, 2013 247 Boosting Stumps 0.4 Single Stump 0.3 • Boosting stumps works remarkably well on the nested-spheres problem. 0.1 0.2 400 Node Tree 0.0 Test Error • A stump is a two-node tree, after a single split. 0 100 200 Boosting Iterations 300 400
  • 22. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 248 0.4 • Bayes error is 0%. in 10- 0.2 • Boosting drives the training error to zero (green curve). 0.1 • Further iterations continue to improve test error in many examples (red curve). 0.0 Train and Test Error • Nested spheres Dimensions. 0.3 0.5 Training Error 0 100 200 300 Number of Terms 400 500 600
  • 23. Random Forests and Boosting 0.5 SLDM III c Hastie & Tibshirani - October 9, 2013 249 0.3 • Nested Gaussians in 10-Dimensions. • Bayes error is 25%. Bayes Error 0.2 • Boosting with stumps 0.1 • Here the test error does increase, but quite slowly. 0.0 Train and Test Error 0.4 Noisy Problems 0 100 200 300 Number of Terms 400 500 600
  • 24. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Stagewise Additive Modeling Boosting builds an additive model M βm b(x; γm ). F (x) = m=1 Here b(x, γm ) is a tree, and γm parametrizes the splits. We do things like that in statistics all the time! • GAMs: F (x) = j fj (xj ) • Basis expansions: F (x) = M m=1 θm hm (x) Traditionally the parameters fm , θm are fit jointly (i.e. least squares, maximum likelihood). With boosting, the parameters (βm , γm ) are fit in a stagewise fashion. This slows the process down, and overfits less quickly. 250
  • 25. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Example: Least Squares Boosting 1. Start with function F0 (x) = 0 and residual r = y, m = 0 2. m ← m + 1 3. Fit a CART regression tree to r giving g(x) 4. Set fm (x) ← ǫ · g(x) 5. Set Fm (x) ← Fm−1 (x) + fm (x), r ← r − fm (x) and repeat step 2–5 many times • g(x) typically shallow tree; e.g. constrained to have k = 3 splits only (depth). • Stagewise fitting (no backfitting!). Slows down (over-) fitting process. • 0 < ǫ ≤ 1 is shrinkage parameter; slows overfitting even further. 251
  • 26. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Least Squares Boosting and ǫ Prediction Error Single tree ǫ=1 ǫ = .01 Number of steps m • If instead of trees, we use linear regression terms, boosting with small ǫ very similar to lasso! 252
  • 27. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Adaboost: Stagewise Modeling • AdaBoost builds an additive logistic regression model M Pr(Y = 1|x) αm fm (x) F (x) = log = Pr(Y = −1|x) m=1 by stagewise fitting using the loss function L(y, F (x)) = exp(−y F (x)). • Instead of fitting trees to residuals, the special form of the exponential loss leads to fitting trees to weighted versions of the original data. See pp 343. 253
  • 28. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 3.0 Why Exponential Loss? 2.0 0.5 1.0 1.5 • Leads to simple reweighting scheme. 0.0 Loss 2.5 Misclassification Exponential Binomial Deviance Squared Error Support Vector • e−yF (x) is a monotone, smooth upper bound on misclassification loss at x. −2 −1 0 1 2 • Has logit transform as population minimizer y·F F ∗ (x) = • y · F is the margin; positive margin means correct classification. 1 Pr(Y = 1|x) log 2 Pr(Y = −1|x) • Other more robust loss functions, like binomial deviance. 254
  • 29. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting General Boosting Algorithms The boosting paradigm can be extended to general loss functions — not only squared error or exponential. 1. Initialize F0 (x) = 0. 2. For m = 1 to M : (a) Compute (βm , γm ) = arg minβ,γ N i=1 L(yi , Fm−1 (xi ) + βb(xi ; γ)). (b) Set Fm (x) = Fm−1 (x) + βm b(x; γm ). Sometimes we replace step (b) in item 2 by (b∗ ) Set Fm (x) = Fm−1 (x) + ǫβm b(x; γm ) Here ǫ is a shrinkage factor; in gbm package in R, the default is ǫ = 0.001. Shrinkage slows the stagewise model-building even more, and typically leads to better performance. 255
  • 30. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Gradient Boosting • General boosting algorithm that works with a variety of different loss functions. Models include regression, resistant regression, K-class classification and risk modeling. • Gradient Boosting builds additive tree models, for example, for representing the logits in logistic regression. • Tree size is a parameter that determines the order of interaction (next slide). • Gradient Boosting inherits all the good features of trees (variable selection, missing data, mixed predictors), and improves on the weak features, such as prediction performance. • Gradient Boosting is described in detail in , section 10.10, and is implemented in Greg Ridgeways gbm package in R. 256
  • 31. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 0.070 Spam Data 0.055 0.050 0.045 0.040 Test Error 0.060 0.065 Bagging Random Forest Gradient Boosting (5 Node) 0 500 1000 1500 Number of Trees 2000 2500 257
  • 32. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Spam Example Results With 3000 training and 1500 test observations, Gradient Boosting fits an additive logistic model f (x) = log Pr(spam|x) Pr(email|x) using trees with J = 6 terminal-node trees. Gradient Boosting achieves a test error of 4.5%, compared to 5.3% for an additive GAM, 4.75% for Random Forests, and 8.7% for CART. 258
  • 33. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Spam: Variable Importance 3d addresses labs telnet 857 415 direct cs table 85 # parts credit [ lab conference report original data project font make address order all hpl technology people pm mail over 650 ; meeting email 000 internet receive ( re business 1999 will money our you edu CAPTOT george CAPMAX your CAPAVE free remove hp $ ! 0 20 40 60 Relative importance 80 100 259
  • 34. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting 0.0 0.2 0.4 0.6 0.8 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 Partial Dependence -0.2 0.0 0.2 0.4 0.6 0.8 1.0 Partial Dependence Spam: Partial Dependence 1.0 0.0 0.2 0.4 0.6 -0.2 0.0 0.2 -1.0 -0.6 -0.6 Partial Dependence -0.2 0.0 0.2 remove -1.0 Partial Dependence ! 0.0 0.2 0.4 0.6 edu 0.8 1.0 0.0 0.5 1.0 1.5 hp 2.0 2.5 3.0 260
  • 35. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting California Housing Data 0.42 0.40 0.38 0.36 0.34 0.32 Test Average Absolute Error 0.44 RF m=2 RF m=6 GBM depth=4 GBM depth=6 0 200 400 600 Number of Trees 800 1000 261
  • 36. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Tree Size 0.4 Stumps 10 Node 100 Node Adaboost 0.2 0.3 The tree size J determines the interaction order of the model: η(X) = ηj (Xj ) 0.1 j + ηjk (Xj , Xk ) jk 0.0 Test Error 262 + 0 100 200 Number of Terms 300 400 ηjkl (Xj , Xk , Xl ) jkl +···
  • 37. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Stumps win! Since the true decision boundary is the surface of a sphere, the function that describes it has the form 2 2 2 f (X) = X1 + X2 + . . . + Xp − c = 0. Boosted stumps via Gradient Boosting returns reasonable approximations to these quadratic functions. Coordinate Functions for Additive Logistic Trees f1 (x1 ) f2 (x2 ) f3 (x3 ) f4 (x4 ) f5 (x5 ) f6 (x6 ) f7 (x7 ) f8 (x8 ) f9 (x9 ) f10 (x10 ) 263
  • 38. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Learning Ensembles Ensemble Learning consists of two steps: 1. Construction of a dictionary D = {T1 (X), T2 (X), . . . , TM (X)} of basis elements (weak learners) Tm (X). 2. Fitting a model f (X) = m∈D αm Tm (X). Simple examples of ensembles • Linear regression: The ensemble consists of the coordinate functions Tm (X) = Xm . The fitting is done by least squares. • Random Forests: The ensemble consists of trees grown to booststrapped versions of the data, with additional randomization at each split. The fitting simply averages. • Gradient Boosting: The ensemble is grown in an adaptive fashion, but then simply averaged at the end. 264
  • 39. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Learning Ensembles and the Lasso Proposed by Popescu and Friedman (2004): Fit the model f (X) = m∈D αm Tm (X) in step (2) using Lasso regularization: α(λ) = arg min α αm Tm (xi )] + λ L[yi , α0 + i=1 M M N m=1 m=1 |αm |. Advantages: • Often ensembles are large, involving thousands of small trees. The post-processing selects a smaller subset of these and combines them efficiently. • Often the post-processing improves the process that generated the ensemble. 265
  • 40. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Post-Processing Random Forest Ensembles 0.09 Spam Data 0.07 0.06 0.05 0.04 Test Error 0.08 Random Forest Random Forest (5%, 6) Gradient Boost (5 node) 0 100 200 300 Number of Trees 400 500 266
  • 41. SLDM III c Hastie & Tibshirani - October 9, 2013 Random Forests and Boosting Software • R: free GPL statistical computing environment available from CRAN, implements the S language. Includes: – randomForest: implementation of Leo Breimans algorithms. – rpart: Terry Therneau’s implementation of classification and regression trees. – gbm: Greg Ridgeway’s implementation of Friedman’s gradient boosting algorithm. • Salford Systems: Commercial implementation of trees, random forests and gradient boosting. • Splus (Insightful): Commerical version of S. • Weka: GPL software from University of Waikato, New Zealand. Includes Trees, Random Forests and many other procedures. 267