Introduction to RandomForests 2004

An Introduction to RandomForests™
Salford Systems
http://www.salford-systems.com
golomi@salford-systems.com
Dan Steinberg, Mikhail Golovnya, N. Scott Cardell

 New approach for many data analytical tasks developed by
Leo Breiman of University of California, Berkeley
◦ Co-author of CART® with Friedman, Olshen, and Stone
◦ Author of Bagging and Arcing approaches to combining trees
 Good for classification and regression problems
◦ Also for clustering, density estimation
◦ Outlier and anomaly detection
◦ Explicit missing value imputation
 Builds on the notions of committees of experts but is
substantially different in key implementation details

 The term usually refers to pattern discovery in large data bases
 Initially appeared in the late twentieth century and directly
associated with the PC boom
◦ Spread of data collection devices
◦ Dramatically increased data storage capacity
◦ Exponential growth in computational power of CPUs
 The necessity to go way beyond standard statistical techniques
in data analysis
◦ Dealing with extremely large numbers of variables
◦ Dealing with highly non-linear dependency structures
◦ Dealing with missing values and dirty data

 The following major classes of problems are
usually considered:
◦ Supervised Learning (interested in predicting some
outcome variable based on observed predictors)
 Regression (quantitative outcome)
 Classification (nominal or categorical outcome)
◦ Unsupervised Learning (no single target variable
available- interested in partitioning data into cluster,
finding association rules, etc.)

 Relating gene expressions to the presence of a
certain decease based upon microarray data
 Indentifying potential fraud cases in credit card
transactions (binary target)
 Predicting level of user satisfaction as poor, average,
good, excellent (4-level target)
 Optical Digit Recognition (10-level target)
 Predicting consumer preferences towards different
kinds of vehicles (could be as many as several
hundred level target)

 Predicting efficacy of a drug based upon demographic factors
 Predicting the amount of sales (target) based on current
observed conditions
 Predicting user energy consumption (target) depending on
the season, business type, location, etc.
 Predicting medium house value (target) based on the crime
rate, pollution level, proximity, age, industrialization level,
etc.

 DNA Microarray Data- which samples cluster together? Which
genes cluster together?
 Market Basket Analysis- which products do customers tend to
buy together?
 Clustering For Classification- Handwritten zip code problem:
can we find prototype digits for 1,2, etc. to use for
classification?

 The answer usually has two sides:
◦ Understanding the relationship
◦ Predictive accuracy
 Some algorithms dominate one side (understanding)
◦ Classical methods
◦ Single trees
◦ Nearest neighbor
◦ MARS
 Others dominate the other side (predicting)
◦ Neural nets
◦ TreeNet
◦ Random Forests

 Leo Breiman says:
◦ Framing the question as the choice between accuracy
and interpretability is an incorrect interpretation of what
the goal of a statistical analysis is
 The goal is NOT interpretability, but accurate information
 Nature’s mechanisms are generally complex and cannot be
summarized by a relatively simple stochastic model, even as
a first approximation
 The better the model fits the data, the more sound the
inferences about the phenomenon are

 The only way to attain the best predictive accuracy o
real life data is to build a complex model
 Analyzing this model will also provide the most
accurate insight!
 At the same time, the model complexity makes it far
more difficult to analyze it
◦ A random forest may contain 3,000 trees jointly
contributing to the overall prediction
◦ There could be 5,000 association rules found in a typical
unsupervised learning algorithm

 (Insert table)
 Example of a classification tree for UCSD
heart decease study

 Relatively fast
 Requires minimal supervision by analyst
 Produces easy to understand models
 Conducts automatic variable selection
 Handles missing values via surrogate splits
 Invariant to monotonic transformations of predictors
 Impervious to outliers

 Piece-wise constant models
 “Sharp” decision boundaries
 Exponential data exhaustion
 Difficulties capturing global linear patterns
 Models tend to evolve around the strongest effects
 Not the best predictive accuracy

 A random forest is a collection of single trees grown in a
special way
 The overall prediction is determined by voting (in
classification) or averaging (in regression)
 The law of Large Numbers ensures convergence
 The key to accuracy is low correlation and bias
 To keep bias low, trees are grown to maximum depth

 Each tree is grown on a bootstrap sample from the learning
set
 A number R us specified (square root by default) such that it
is noticeably smaller than the total number of available
predictors
 During tree growing phase, at each node only R predictors are
randomly selected and tried

 All major advantages of a single tree are automatically
preserved
 Since each tree is grown on a bootstrap sample, one can
◦ Use out of bag samples to compute an unbiased estimate of
the accuracy
◦ Use out of bag samples to determine variable importances
 There is no overfitting as the number of trees increases

 It is possible to compute generalized proximity between any pair
of cases
 Based on proximities one can
◦ Proceed with a well-defined clustering solution
◦ Detect outliers
◦ Generate informative data views/projections using scaling
coordinates
◦ Do missing value imputation
 Easy expansion into the unsupervised learning domain

 High levels of predictive accuracy delivered automatically
◦ Only a few control parameters to experiment with
◦ Strong for both regression and classification
 Resistant to overtraining (overfitting)- generalizes well to new data
 Trains rapidly even with thousands of potential predictors
◦ No need for prior feature (variable) selection
 Diagnostic pinpoint multivariate outliers
 Offers a revolutionary new approach to clustering using tree-based
between-record distance measures
 Built on CART® inspired trees and thus
◦ Results invariant to monotone transformations of variables

 Method intended to generate a large number of substantially
different models
◦ Randomness introduced in two simultaneous ways
◦ By row: records selected for training at random with replacement (as in
bootstrap resampling of the bagger)
◦ By column: candidate predictors at any node are chosen at random and
best splitter selected from the random subset
 Each tree is grown out to maximal size and left unpruned
◦ Trees are deliberately overfit, becoming a form of nearest neighbor
predictor
◦ Experiments convincingly show that pruning these trees hurt performance
◦ Overfit individual trees combine to yield properly fit ensembles

 Self-testing possible even if all data is used for training
◦ Only 63% of available training data will be used to grow any one
tree
◦ A 37% portion of training data always unused
 The unused portion of the training data is known as Out-Of-Bag (OOB)
data and can be used to provide an ongoing dynamic assessment of
model performance
◦ Allows fitting to small data sets without explicitly holding back
any data for testing
◦ All training data is used cumulatively in training, but only a 63%
portion used at any one time
 Similar to cross-validation but unstructured

 Intensive post processing of data to extract more
insight into data
◦ Most important is introduction of distance metric
between any two data records
◦ The more similar two records are the more often they
will land in same terminal node of a tree
◦ With a large number of different trees simply count the
number of times they co-locate in same leaf nodes
◦ Distance metric can be used to construct dissimilarity
matrix input into hierarchical clustering

 Ultimately in modeling our goal is to produce a single
score, prediction, forecast, or class assignment
 The motivation generating multiple models is the
hope that by somehow combining models results will
be better than if we relied on a single model
 When multiple models are generated they are
normally combined by
◦ Voting in classification problems, perhaps weighted
◦ Averaging in regression problems, perhaps weighted

 Combining trees via averaging or voting will only be
beneficial if the trees are different from each other
 In original bootstrap aggregation paper Breiman noted
bagging worked best for high variance (unstable)
techniques
◦ If results of each model are near identical little to be
gained by averaging
 Resampling of the bagger from the training data
intended to induce differences in trees
◦ Accomplished essentially varying the weight on any
data record

 Bootstrap sample is fairly similar to taking a 65% sample from
the original training data
 If you grow many trees each based on a different 65% random
sample of your data you expect some variation in the trees
produced
 Bootstrap sample goes a bit further in ensuring that the new
sample is of the same size as the original by allowing some
records to be selected multiple times
 In practice the different samples induce different trees but
trees are not that different

 The bagger was limited by the fact that even with resampling
trees are likely to be somewhat similar to each other,
particularly with strong data structure
 Random Forests induces vastly more between tree differences
by forcing splits to be based on different predictors
◦ Accomplished by introducing randomness into split
selection

 Breiman points out tradeoff:
◦ As R increases strength of individual tree should increase
◦ However, correlation between trees also increases reducing advantage of
combining
 Want to select R to optimally balance the two effects
◦ Can only be determined via experimentation
 Breiman has suggested three values to test:
◦ R= 1/2sqrt(M)
◦ R= sqrt(M)
◦ R= 2sqrt(M)
◦ For M= 100 test values for R: 5,10,20
◦ For M= 400 test values for R: 10, 20, 40

 Random Forests machinery unlike CART in that
◦ Only one splitting rule: Gini
◦ Class weight concept but no explicit priors or costs
◦ No surrogates: Missing values imputed for data first automatically
 Default fast imputation just uses means
 Compute intensive method uses tree-based nearest neighbors to base
imputation on (discussed later)
◦ None of the display and reporting machinery are tree refinement
services of CART
 Does follow CART in that all splits are binary

 Trees combined via voting (classification) or averaging
(regression)
 Classification trees “vote”
◦ Recall that classification trees classify
 Assign each case to ONE class only
◦ With 50 trees, 50 class assignments for each case
◦ Winner is the class with the most votes
◦ Votes could be weighted- say by accuracy of individual trees
 Regression trees assign a real predicted value for each case
◦ Predictions are combined via averaging
◦ Results will be much smoother than from a single tree

 Probability of being omitted in a single draw is (1-1/n)
 Probability of being omitted in all n draws is (1-1/n)n
 Limit of series as n increases is (1/e)= 0.368
◦ Approximately 36.8% sample excluded 0% of resample
◦ 36.8% sample included once 36.8% of resample
◦ 18.4% sample included twice thus represent…36.8% of resample
◦ 6.1% sample included three times…18.4% of resample
◦ 1.9% sample included four or more times…8% if resample 100%
◦ Example: distribution of weights in a 2,000 record resample:
◦ (insert table)

 Want to use mass spectrometer data to classify
different types of prostate cancer
◦ 772 observations available
 398- healthy samples
 178- 1st type of cancer samples
 196- 2nd type of cancer samples
◦ 111 mass spectra measurements are recorded for each
sample

 (insert table)
 The above table shows cross-validated prediction success
results of a single CART tree for the prostate data
 The run was conducted under PRIORS DATA to facilitate
comparisons with subsequent RF run
◦ The relative error corresponds to the absolute error of
30.4%

 Topic discussed by several Machine Learning researchers
 Possibilities:
◦ Select splitter, split point, or both at random
◦ Choose splitter at random from the top K splitters
 Random Forests: Suppose we have M available predictors
◦ Select R eligible splitters at random and let best split node
◦ If R=1 this is just random splitter selection
◦ If R=M this becomes Brieman’s bagger
◦ If R<< M then we get Breian’s Random Forests
 Breiman suggests R=sqrt(M) as a good rule of thumb

 A performance of a single tree will be somewhat driven by the
number of candidate predictors allowed at each node
 Consider R=1: the splitter is always chosen at random +
performance could be quite weak
 As relevant splitters get into tree and tree is allowed to grow
massively, single tree can be predictive even if R=1
 As R is allowed to increase quality of splits can improve as
there will be better (and more relevant) splitters

 (insert graph)
 In this experiment, we ran RF with 100 trees on the
prostate data using different values for the number
of variables Nvars searched at each split

 RF clearly outperforms single tree for any number of Nvars
◦ We saw above that a properly pruned tree gives cross-validated absolute
error of 30.4% (the very right end of the red curve)
 The performance of a single tree tends to deviate substantially
with the number of predictors allowed to be searched (a single
tree is a high variance object)
 The RF reaches the nearly stable error rate of about 20% when
only 10 variables are searched in each node (marked by the blue
color)
 Discounting the minor fluctuations, the error rate also remains
stable for Nvars above 10
◦ This generally agrees with Breiman’s suggestion to use square root N=111
as a rough estimate of the optimal value for Nvars
 The performance for small Nvars can be usually further improved
by increasing the number of runs

 (insert table)
 The above results correspond to a standard RF run
with 500 trees, Nvars=15, and unit class weights
 Note that the overall error rate is 19.4% which is
2/3 of the baseline CART error of 30.4%

 RF does not use a test dataset to report accuracy
 For every tree grown, about 30% of data are left out-of-bag
(OOB)
 This means that these cases can be safely used in place of the
test data to evaluate the performance of the current tree
 For any tree in RF, its own OOB sample is used- hence no bias is
ever introduced into the estimates
 The final OOB estimate for the entire RF can be simply obtained
by averaging individual OOB estimates
 Consequently, this estimate is unbiased and behaves as if we had
an independent test sample of the same size as the learn sample

 The prostate dataset is somewhat partially unbalanced- class 1
contains fewer records than the remaining classes
 Under the default RF settings, the minority classes will have
higher misclassification rates than the dominant classes
 Misbalance in the individual class error rates may also be caused
by other data specific issues
 Class weights are used in RF to boost the accuracy of the
specified classes
 General Rule of Thumb: to increase accuracy in the given class,
one should increase the corresponding class weight
 In many ways this is similar to the PRIORS control used in CART
for the same purpose

 Our next run sets the weight for class one to
2
 As a result, class 1 is classified with a much
better accuracy at the cost of slightly reduced
accuracy in the remaining classes

 At the end of an RF run, the proportion of votes for
each class is recorded
 We can define Margin of a case simply as the
proportion of votes for the true class minus the
maximum proportion of votes for the other classes
 The larger the margin, the higher the confidence of
classification

 (insert table)
 This extract shows percent votes for the top 30
records in the dataset along with the
corresponding margins
 The green lines have high margins and therefore
high confidence of predictions
 The pink lines have negative margins, which means
that these observations are not classified correctly

 The concept of margin allows new “unbiased” definition of variable
importance
 To estimate the importance of the mth variable:
◦ Take the OOB cases for the ldh tree, assume that we already know the margin for
those cases M
◦ Randomly permute all values of the variable m
◦ Apply the ldh tree to the OOB cases with the permuted values
◦ Compute the new margin M
◦ Compute the difference M-M
 The variable importance is defined as the average lowering of the margin
across all OOB cases and all trees in the RF
 This procedure is fundamentally different from the intrinsic variable
importance scored computed by CART- the latter are always based on
the LEARN data and are subject to the overfitting issues

 The top portion of the variable importance list for the
data is shown here
 Analysis of the complete list reveals that all 111
variables are nearly equally strongly contributing to
the model predictions
 This is in a striking contrast with the single CART tree
that has no choice but to use a limited subset of
variables by tree’s construction
 The above explains why the RF model has a
significantly lower error rate (20%) when compared to
a single CART tree (30%)

 RF introduces a novel way to define proximity between two
observations
◦ Initialize proximities to zeroes
◦ For any given tree, apply the tree to all cases
◦ If case I and j both end up in the same node, increase proximity prox(ij)
between I and j by one
◦ Accumulate over all trees in RF and normalize by twice the number of trees
in RF
 The resulting matrix of size NxN provides intrinsic measure of
proximity
◦ The measure is invariant to monotone transformations
◦ The measure is clearly defined for any type of independent variables,
including categorical

 (insert graph)
 The above extract shows the proximity matrix for the
top 10 records of the prostate dataset
◦ Note ones on the main diagonal- any case has
“perfect” proximity to itself
◦ Observations that are “alike” will have proximities
close to one
 these cells have green background
◦ The closer proximity to 0, the more dissimilar cases i
and j are
 These cells have pink B

 Having the full intrinsic proximity matrix opens new horizons
◦ Informative data views using metric scaling
◦ Missing value imputation
◦ Outlier detection
 Unfortunately, things get out of control when dataset size
exceeds 5,000 observations (25,000,000+ cells are needed)
 RF switches to “compressed” form of the proximity matrix to
handle large datasets- for any case, only M closest cases are
recorded. M is usually less than 100.

 The values 1-prox(ij) can be treated as Euclidean distances
in a high dimensional space
 The theory of metric scaling solves the problem of finding
the most representative projections of the underlying data
“cloud” onto low dimensional space using the data
proximities
◦ The theory is similar in spirit to the principal components analysis
and discriminant analysis
 The solution is given in the form of ordered “scaling
coordinates”
 Looking at the scatter plots of the top scaling coordinates
provides informative views of the data

 (insert graph)
 This extract shows five initial scaling coordinates for
the top 30 records of the prostate data
 We will look at the scatter plots among the first,
second, and third scaling coordinates
 The following color codes will be used for the target
classes:
◦ Green- class 0
◦ Red- class 1
◦ Blue- class 2

 (insert graphs)
 A nearly perfect separation of all three classes is clearly seen
 From this we conclude that the outcome variable admits clear
prediction using RF model which utilizes 111 original
predictors
 The residual error is mostly due to the presence of the “focal”
point where all the three rays meet

 (insert graphs)
 Again, three distinct target classes show up as
separate clusters
 The “focal” point represents a cluster of records
that can’t be distinguished from each other

 Outliers are defined as cases having small proximities to
all other cases belonging to the same target class
 The following algorithm is used:
◦ For a case n, compute the sum of the squares of prox(nk) for all k
in the same class as n
◦ Take the inverse- it will be large if the case is “far away” from the
rest
◦ Standardize using the median and standard deviation
◦
◦ Look at the cases with the largest values- those are potential
outliers
 Generally, a value above 10 is reason to suspect the case
of being an outlier

 This extract shows top 30 records of the prostate
dataset sorted descending by the outlier measure
 Clearly the top 6 cases (class 2 with IDs: 771, 683,
539, and class 0 with IDs 127, 281, 282) are
suspicious
 All of these seem to be located at the “focal point”
on the corresponding scaling coordinate plots

 RF offers two ways of missing value imputation
 The Cheap Way- conventional median imputation for continuous
variables and mode imputation for categorical variables
 The Right Way:
◦ Suppose case n has x coordinate missing
◦ Do the Cheap Way imputation for starters
◦ Grow a full size RF
◦ We can now re-estimate the missing value by a weighted average
◦ over all cases k with non-missing x using weights prox(nk)
◦ Repeat steps 2 and 3 several times to ensure convergence

 An alternative display to view how the target classes are
different with respect to the individual predictors
◦ Recall, at the end of an RF run all cases in the dataset, obtain K
separate votes for the class membership (assuming K target
classes)
◦ Take any target class and sort all observations by the count of
votes for this class descending
◦ Take the top 50 observations and the bottom 50 observations,
those are correspondingly the most likely and the least likely
members of the given target class
◦ Parallel coordinate plots report uniformly (0,1) scaled values of all
predictors for the top 50 and bottom 50 sorted records, along
with the 25th, 50th and j percentiles within each predictor

 (insert graph)
 This is a detailed display of the normalized values
of the initial 20 predictors for the top voted 50
records in each target class (this gives 50x3=150
graphs)
 Class 0 generally has normalized values of the
initial 20 predictors close to 0 (left side 0tt, lw, y,
o, ragg, wp) except perhaps M9X11

 (insert graph)
 It is easier to see this when looking at the quartile
plots only
 Note that class 2 tends to have the largest values
of the corresponding predictors
 The graph can be scrolled forward to view all of the
111 predictors

 (insert graph)
 The least likely plots roughly result to the similar
conclusions: small predictor values are the least
likely for class 2, etc.

 RF admits an interesting possibility to solve unsupervised learning
problems, in particular, clustering problems and missing value
imputation in the general sense
 Recall that in the unsupervised learning the concept of target is not
defined
 RF generates a synthetic target variable in order to proceed with a
regular run:
◦ Give class label 1 to the original data
◦ Create a copy of the data such that each variable is sampled independently from the
values available in the original dataset
◦ Give class label 2 to the copy of the data
◦ Note that the second copy has marginal distributions identical to the first copy,
whereas the possible dependency among predictors is completely destroyed
◦
◦ A necessary drawback is that the resulting dataset is twice as large as the original

 We now have a clear binary supervised learning problem
 Running an RF on this dataset may provide the following
insights:
◦ When the resulting misclassification error is high (above 50%), the
variables are basically independent- no interesting structure exists
◦ Otherwise, the dependency structure can be further studied by looking at
the scaling coordinates and exploiting the proximity matrix in other ways
◦ For instance, the resulting proximity matrix can be used as an important
starting point for the subsequent hierarchical clustering analysis
 Recall that the proximity measures are invariant to monotone
transformations and naturally support categorical variables
 The same missing value imputation procedure as before can now
be employed
 These techniques work extremely well for small datasets

 We generated a synthetic dataset based on the
prostate data
 The resulting dataset still has 111 predictors but
twice the number of records- the first half being
the exact replica of the original data
 The final error is only 0.2% which is an indication of
a very strong dependency among the predictors

 (insert graph)
 The resulting plots resemble what we had before
 However, this distance is in terms of how
dependent the predictors are, whereas previously it
was in terms of having the same target class
 In view of this, the non cancerous tissue (green)
appears to stand apart from the cancerous

 + Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.
 + Breiman, L. (1996). Arcing classifiers (Technical Report). Berkeley: Statistics
Department, University of California.
 + Buntine, W. (1991). Learning classification trees. In D.J. Hand, ed., Artificial
Intelligence Frontiers in Statistics, Chapman and Hall: London, 182-201.
 + Dietterich, T. (1998). An experimental comparison of three methods for
constructing ensembles of decision trees: Bagging, Boosting, and Randomization.
Machine Learning, 40, 139-158.
 + Freund, Y. & Schapire, R. E. (1996). Experiments with a new boosting algorithm.
In L. Saitta, ed., Machine Learning: Proceedings of the Thirteenth National
Conference, Morgan Kaufmann, pp. 148-156.
 + Friedman, J.H. (1999). RandomForests. Stanford: Statistics Department, Stanford
University.
 + Friedman, J.H. (1999). Greedy function approximation: a gradient boosting
machine. Stanford: Statistics Department, Stanford University.
 + Heath, D., Kasif, S., and Salzberg, S. (1993) k-dt: A multi-tree learning method.
Proceedings of the Second International Workshop on Multistrategy Learning,
1002-1007, Morgan Kaufman: Chambery, France.
 + Kwok, S., and Carter, C. (1990). Multiple decision trees. In Shachter, R., Levitt,
T., Kanal, L., and Lemmer, J., eds. Uncertainty in Artificial Intelligence 4, North-
Holland, 327-335.

Introduction to RandomForests 2004

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Introduction to RandomForests 2004

Similar a Introduction to RandomForests 2004 (20)

Más de Salford Systems

Más de Salford Systems (20)

Último

Último (20)

Introduction to RandomForests 2004