PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Why Ensembles Win
Data Mining Competitions
A Predictive Analytics Center of Excellence (PACE) Tech Talk
November 14, 2012

Dean Abbott
Abbott Analytics, Inc.
Blog: http://abbottanalytics.blogspot.com
URL: http://www.abbottanalytics.com
Twitter: @deanabb
Email: dean@abbottanalytics.com
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
1

Outline

  Motivation for Ensembles
  How Ensembles are Built
  Do Ensembles Violate Occams Razor?
  Why Do Ensembles Win?

2

PAKDD Cup 2007 Results: Score
Metric Changes Winner
Par4cipant
AUCROC
AUCROC
Top
Decile
Top
Decile

Modeling
Par4cipant
Affilia4on

Modeling
Technique
Affilia4on
Type
-‐ (Trapezoid (Trapezoidal
Rule)
Response
Rate
Response

Implementa4on
-‐>
Loca4on
-‐>

>
al
Rule)-‐>
Rank
-‐>
-‐>
Rate
Rank
-‐>

Ensembles

TreeNet
+
Logis-c
Regression
Salford
Systems
Mainland
China
Prac--oner
70.01%
1
13.00%
7

Probit
Regression
SAS
USA
Prac--oner
69.99%
2
13.13%
6

MLP
+
n-‐Tuple
Classifier
Brazil
Prac--oner
69.62%
3
13.88%
1

TreeNet
Salford
Systems
USA
Prac--oner
69.61%
4
13.25%
4

TreeNet
Salford
Systems
Mainland
China
Prac--oner
69.42%
5
13.50%
2

Ridge
Regression
Rank
Belgium
Prac--oner
69.28%
6
12.88%
9

2-‐Layer
Linear
Regression
USA
Prac--oner
69.14%
7
12.88%
9

Logis-c
Regression
+
Decision
Stump
+
AdaBoost
+
VFI
Mainland
China
Academia
69.10%
8
13.25%
4

Logis-c
Average
of
Single
Decision
Func-ons
Australia
Prac--oner
68.85%
9
12.13%
17

Logis-c
Regression
Weka
Singapore
Academia
68.69%
10
12.38%
16

Logis-c
Regression
Mainland
China
Prac--oner
68.58%
11
12.88%
9

Decision
Tree
+
Neural
Network
+
Logis-c
Regression
Singapore
68.54%
12
13.00%
7

Scorecard
Linear
Addi-ve
Model
Xeno
USA
Prac--oner
68.28%
13
11.75%
20

Random
Forest
Weka
USA
68.04%
14
12.50%
14

Expanding
Regression
Tree
+
RankBoost
+
Bagging
Weka
Mainland
China
Academia
68.02%
15
12.50%
14

SAS
+
Salford

Logis-c
Regression
Systems
India
Prac--oner
67.58%
16
12.00%
19

J48
+
BayesNet
Weka
Mainland
China
Academia
67.56%
17
11.63%
21

Neural
Network
+
General
Addi-ve
Model
Tiberius
USA
Prac--oner
67.54%
18
11.63%
21

Decision
Tree
+
Neural
Network
Mainland
China
Academia
67.50%
19
12.88%
9

Decision
Tree
+
Neural
Network
+
Logis-c
Regression
SAS
USA
Academia
66.71%
20
13.50%
2

Neural
Network
SAS
USA
Academia
66.36%
21
12.13%
17

Decision
Tree
+
Neural
Network
+
Logis-c
Regression
SAS
USA
Academia
65.95%
22
11.63%
21

Neural
Network
SAS
USA
Academia
65.69%
23
9.25%
32

Mul--‐dimension
Balanced
Random
Forest
Mainland
China
Academia
65.42%
24
12.63%
13

Neural
Network
SAS
USA
Academia
65.28%
25
11.00%
26

CHAID
Decision
Tree
SPSS
Argen-na
Academia
64.53%
26
11.25%
24

Under-‐Sampling
Based
on
Clustering
+
CART
Decision
Tree
Taiwan
Academia
64.45%
27
11.13%
25

Decision
Tree
+
Neural
Network
+
Polynomial
Regression
SAS
USA
Academia
64.26%
28
9.38%
30

3

Netflix Prize

  2006 Netflix State-of-the-art (Cinematch)
RMSE = 0.9525
  Prize: reduce this RMSE by 10% => 0.8572
  2007: Korbell team Progress Prize winner
–  107 algorithm ensemble
–  Top algorithm: SVD with RMSE = 0.8914
–  2nd algorithm: Restricted Boltzmann Machine with RMSE =
0.8990
–  Mini-ensemble (SVD+RBM) has RMSE = 0.88

http://techblog.netflix.com/2012/04/netflix-
recommendations-beyond-5-stars.html
4

Common Kinds of Ensembles
vs. Single Models

Ensembles {
Single
Classifiers

From Zhuowen Tu, “Ensemble Classification Methods: Bagging,
Boosting, and Random Forests”
5

What are Model Ensembles?

  Combining outputs from multiple models into single
decision
  Models can be created using the same algorithm, or
several different algorithms

Decision Logic

Ensemble Prediction
6

Creating Model Ensembles Step 1:
Generate Component Models

Can Vary Data or Single data set
Model Parameters:
  Case (Record) Weights —
bootstrapping, sampling
  Data Values —
add noise, recode data
  Learning Parameters —
vary learning rates, pruning
severity, random seeds
  Variable Subsets — Multiple models
vary candidate inputs, and predictions
features
7

Creating Model Ensembles Step 2:
Combining Models

  Combining Methods Multiple models
–  Estimation: Average Outputs and predictions
–  Classification: Average
probabilities or vote
(best M of N)
  Variance Reduction
–  Build complex, overfit models Combine
–  All models built in same manner
  Bias Reduction
–  Build simple models
–  Subsequent models weight
records with errors more (or
model actual errors)
Decision or
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. Prediction Value
8

How Model Complexity Effects Errors

Giovanni Seni , John Elder, Ensemble Methods in Data Mining:
Improving Accuracy Through Combining Predictions, Morgan and
Claypool Publishers, 2010 (ISBN: 978-1608452842)
9

Commonly Used Information-
Theoretic Complexity Penalties

BIC: Baysian Information Criterion
AIC: Akaike Information Criterion
MDL: Minimum Description Length

For a nice summary:
http://en.wikipedia.org/wiki/Regularization_(mathematics)

10

Four Keys to Effective
Ensembling

  Diversity of opinion
  Independence
  Decentralization
  Aggregation

  From The Wisdom of Crowds, James
Surowiecki

11

11

Bagging

  Bagging Method
–  Create many data sets by
bootstrapping (can also do this
with cross validation)
–  Create one decision tree for
each data set
–  Combine decision trees by
averaging (or voting) final
decisions
–  Primarily reduces model
variance rather than bias
  Results
–  On average, better than any Final
Answer
individual tree
(average)

12

Boosting (Adaboost)

  Boosting Method
–  Creating tree using training data set Reweight
examples
–  Score each data point, indicating when each where
incorrect decision is made (errors) classification
incorrect
–  Retrain, giving rows with incorrect decisions
more weight. Repeat Combine
–  Final prediction is a weighted average of all models via
weighted sum
models-> model regularization.
–  Best to create weak models—simple models
(just a few splits for a decision tree) and let
the boosting iterations find the complexity.
–  Often used with trees or Naïve Bayes
  Results
–  Usually better than individual tree or Bagging
13

Random Forest Ensembles

  Random Forest (RF) Method
–  Exact same methodology as
Bagging, but with a twist
–  At each split, rather than using the
entire set of candidate inputs, use
a random subset of candidate
inputs
–  Generates diversity of samples and
inputs (splits)
  Results

–  On average, better than any Final
individual tree, Bagging, or even Answer
Boosting (average)

14

Stochastic Gradient Boosting

  Implemented in MART (Jerry Friedman), and
TreeNet (Salford Systems) Predict errors in
ensemble tree
  Algorithm
so far
–  Begin with a simple model—a constant value
for a model Combine
–  Build a simple tree (perhaps 6 terminal nodes) models via
—now there are 6 possible levels, whereas weighted sum
before there was one level
–  Score the model and compute errors. The score Build
is the sum of all previous trees, weighted by a
learning rate
–  Build a new tree with the errors as the target
variable.
  Results
–  TreeNet has won 2 KDD-Cup competitions and
numerous others
–  It is less prone to outliers and overfit than
Adaboost Final Answer
(additive model)
15

Ensembles of Trees: Smoothers

  Ensembles smooth jagged decision boundaries

Pictures from
T.G. Dietterich. Ensemble methods in machine learning. In
Multiple Classier Systems, Cagliari, Italy, 2000.

16

Heterogeneous Model
Ensembles on Glass Data

Max Error Min Error Avera ge Error   Model prediction diversity
40 % obtained by using different
algorithms: tree, NN, RBF,
35 % Gaussian, Regression, k-NN
Percent Classification Error

30 %   Combining 3-5 models on
average better than best
25 %
single model
20 %
  Combining all 6 models not
15 % best (best is 3&4 model
combination), but is close
10 %
  The is an example of reducing
5% model variance through
0%
ensembles, but not model bias
1 2 3 4 5 6
Number Models Combin ed

17

Direct Marketing Example:
Considerations or I-Miner
From Abbott, D.W., "How to Improve Customer
Acquisition Models with Ensembles", presented at
Predictive Analytics World Conference, Washington,
D.C., October 20, 2009.

Steps:
1.  Join by record—all models applied to same data in
same row order
2.  Change probability names
3.  Average probabilities
1.  Decision is avg_prob > threshold
4.  Decile Probability Ranks
18

Direct Marketing Example: Variable
Inclusion in Model Ensembles

  Twenty-Five different # Models with Common
Variables
variables represented # Models # Variables

in the ten models
  Only five were
represented in seven
or more models
  Twelve were From Abbott, D.W., "How to Improve
represented in one or Customer Acquisition Models with
Ensembles", presented at
two models Predictive Analytics World
Conference, Washington, D.C.,
October 20, 2009.
19

Fraud Detection Example:
Deployment Stream

Model scoring
picks up scores
from each
model, combines
in an ensemble,
and pushes
scores back to
database

20

Fraud Detection Example: Overall
Model Score on Validation Data

Total Score (from validation population)
“Score”
10.0 9.5 weights
8.8
false
Normalized Score

9.0 7.5 7.0
8.0 7.2 7.2 6.8 6.9 7.2 alarms
7.0 6.1 6.3 6.8 6.3
5.3 5.7 5.3 and
6.0
5.0 sensitivi
4.0 ty
3.0
2.0 1.0
1.0 Overall,
ensemble

g
W t Te rst

Te g
er e 5 ge

5 st
e r ve e
10

se 1 1
1
2
3
4
5
6
7
8
9

is

in
st tin
A v A bl

e
s o
Av ag ra

st
ag B
m

or s
Be W
clearly
En

e
best, and
much
Model better
than best
From Abbott, D, and Tom Konchan, “Advanced Fraud Detection on
Techniques for Vendor Payments”, Predictive Analytics Summit,
testing
San Diego, CA, February 24, 2011.
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. data 21

Are Ensembles Better?

  Accuracy? Yes
  Interpretability? No
  Do Ensembles contradict Occam’s Razor?
–  Principle: simpler models generalize better; avoid
overfit!
–  They are more complex than single models (RF
may have hundreds of trees in the ensemble)
–  Yet these more complex models perform better on
held-out data
–  But…are they really more complex?
22

Generalized Degrees of
Freedom

  Linear Regression: a degree of freedom in the
model is simple a parameter
–  Does not extrapolate to non-linear methods
–  Number of “parameters” in non-linear methods can
produce more complexity or less
  Enter…Generalized Degrees of Freedom (GDF)
–  GDF (Ye 1998) “randomly perturbs (adds noise to)
the output variable, re-runs the modeling
procedure, and measures the changes to the
estimates” (for same number of parameters)
23

The Math of GDF

From Giovanni Seni , John Elder, Ensemble Methods in Data Mining: Improving
Accuracy Through Combining Predictions, Morgan and Claypool Publishers, 2010
(ISBN: 978-1608452842)

24

The Effect of GDF

From Elder, J.F.E IV, “The Generalization Paradox of Ensembles”, Journal of
Computational and Graphical Statistics, Volume 12, Number 4, Pages 853–864
25

Why Ensembles Win

  Performance, performance, performance
  Single model sometimes provide insufficient accuracy
–  Neural networks become stuck in local minima
–  Decision trees
  Run out of data
  Are greedy—can get fooled early
–  Single algorithms keep pushing performance using the same
ideas (basis function / algorithm), and are incapable of
thinking outside of their box
  Different algorithms or algorithms built using
resample data achieve the same level of accuracy but
on different cases—they identify different ways to get
the same level of accuracy

26

Conclusion

  Ensembles can achieve significant model
performance improvements
  The key to good ensembles is diversity in
sampling and variable selection
  Can be applied to single algorithm, or across
multiple algorithms
  Just do it!

27

References

  Giovanni Seni , John Elder, Ensemble Methods in Data Mining:
Improving Accuracy Through Combining Predictions, Morgan and
Claypool Publishers, 2010 (ISBN: 978-1608452842)
  Elder, J.F.E IV, “The Generalization Paradox of Ensembles”, Journal
of Computational and Graphical Statistics, Volume 12, Number 4,
Pages 853–864 DOI: 10.1198/1061860032733

  Abbott, D.W., “The Benefits of Creating Ensembles of Classifiers”,
Abbott Analytics, Inc., http://www.abbottanalytics.com/white-paper-
classifiers.php

  Abbott, D.W., “A Comparison of Algorithms at PAKDD2007”, Blog
post at http://abbottanalytics.blogspot.com/2007/05/comparison-of-
algorithms-at-pakdd2007.html

28

References

  Tu, Zhuowen, “Ensemble Classification Methods: Bagging, Boosting,
and Random Forests”, http://www.loni.ucla.edu/~ztu/courses/
2010_CS_spring/cs269_2010_ensemble.pdf

  Ye, J. (1998), “On Measuring and Correcting the Effects of Data
Mining and Model Selection,” Journal of the American Statistical
Association, 93, 120–131.

29

PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Recomendados

Recomendados

Más contenido relacionado

Similar a PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Similar a PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions (14)

Último

Último (20)

PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions