The document provides an introduction to TreeNet, a machine learning algorithm developed by Jerome Friedman. TreeNet builds regression and classification models in a stagewise fashion, using small regression trees at each stage to model residuals from the previous stage. It employs techniques like learning small trees, subsampling data, and using a small learning rate to minimize overfitting. TreeNet models can be very accurate while remaining resistant to overfitting.
Time Series Foundation Models - current state and future directions
Introduction to TreeNet (2004)
1. An Introduction to TreeNetTM
Salford Systems
http://www.salford-systems.com
golomi@salford-systems.com
Mikhail Golovnya, Dan Steinberg, Scott Cardell
2. New approaches to machine learning/function
◦ Approximation developed by Jerome H. Friedman at
Stanford University
Co-author of CART® with Breiman, Olshen and Stone
Author of MARSTM, PRIM, Projection Pursuit
Good for classification and regression problems
Builds on the notions of committees of experts
and boosting but is substantially different in
implementation details
3. Stagewise function approximation in which each stage models
residuals from the last step model
◦ Conventional boosting models original target each stage
Each stage uses a very small tree, as small as two nodes and
typically in the range of 4-8 nodes
◦ Conventional bagging and boosting use full size trees and even
massively large trees
Each stage learns from a fraction of the available training data.
Typically less than 50% to start and falling into 20% or less by the
last stage
Each stage learns only a little: Severely down weight contribution of
each new tree (learning rate is typically 0.10 or less)
Focus in classification is on points near decision boundary, ignore
points far away from boundary even if the points are on the wrong
4. Built on CART trees and thus
◦ Immune to outliers
◦ Handles missing values automatically
◦ Selects variables
◦ Results invariant wrt monotone transformations of variables
Trains very rapidly: many small trees do not take
much longer run times than one large tree
Resistant to over training- generalizes very well
Can be remarkably accurate with little effort
BUT resulting model may be very complex
5. An intuitive introduction
TreeNet Mathematical Basics
◦ Specifications of the TreeNet model as a series expansion
◦ Non-parametric approach to steepest descent optimization
TreeNet at work
◦ Small trees, learning rates, sub-sample fractions, regression types
◦ Reading the output: reports and diagnostics
Comparing to AdaBoost and other methods
6. Consider the basic problem of estimating continuous
outcome y based on a vector of predictors X
Running a step-wise multiple linear regression will produce
an estimate f1 (X) and associated residuals
A simple intuitive idea: run a second-stage regression model
to produce an estimate of the residuals f2 (X) and the
associated updated residuals r^2=(y-f1f2)
Repeating this process multiple times results to the following
series expansion: y=f1+f2+f3+…
7. The above idea can be easily implemented
Unfortunately, the direct implementation suffers from the
overfitting issues
The residuals from the previous model essentially communicate
information about where this model fails the most- hence, the
next stage model effectively tries to improve the previous model
where it failed
This is generally known as boosting
We may want to replace individual regressions with something
simpler- regression trees, for example
It is not yet known whether this simple idea actually works nor it
is clear how to generalize it for various types of loss functions or
classifications
8. For any given set of inputs X we want to predict some
outcome y
Thus we want to construct a “nice” function f(X) which in turn
can be used to express an estimate of y
We need to define how “nice” can be measured
9. In regression, when y is continuous, the easiest is to assume
that f(X) itself is the estimate of y
We may then define the loss function as the loss incurred
when y is estimated by f(X)
For example, least square loss (LS) is defined as (LOA¯0’:f)^2
Formally, a “nicely” defined f(X) will have the smallest
expected loss (over the entire population) within the
boundaries of its construction (for example, in multiple linear
regressions, f(X) belongs to the class of linear functions)
10. In reality, we have a set of N observed pairs (x,y) from the
population, not the entire population
Hence, the expected loss WU/A can be replaced with an
estimate
Here Fi=f(x)
The problem thus reduces to finding a function f(X) that
minimizes R
Unfortunately, classification will demand additional treatment
11. Consider binary classification and assume that y is coded as +1
or -1
The most detailed solution would then give us the associated
probabilities p(y)
Since probabilities are naturally constrained to the [0,1] interval,
we assume that the function f(X) is transformed
p(y)=1/(1+exp(-2fy))
Note that p(+1)+p(-1)=1
The “trick” here is finding an unconstrained estimate f instead of
constrained estimate p
Also note that f is simply half log-odds ratio of y=+1
12. (insert graph)
This graph shows the one-to-one correspondence between f
and p for y=+1
Note that the most significant probability change occurs when
f is between -3 and +3
13. Again, the main question is what “nice” f means given that we observed
N pairs (x,y) from the population
Approaching this problem from the maximum likelihood point of view,
one may show that the negative log-likelihood in this case becomes
(insert equation)
The problem once again reduces to finding f that minimizes R above
We could obtain the same result formally by introducing a special loss
function for classification (insert equation)
The above likelihood considerations show a “natural” way to arrive to
such a peculiar loss function
14. Other approaches to defining the loss functions for binary
classification are possible
For example, by throwing away the log term in the previous
equation one would arrive to the following loss L=exp(2yf)
It is possible to show that this loss function is effectively used
in the “classical” AdaBoost algorithm
AdaBoost could be considered as a predecessor of gradient
boosting, we will defer the comparison until later
15. To summarize we are looking for a function
f(X) that minimizes the estimate of loss
The typical loss functions are
(insert equations)
16. The function f(X) is introduced as a known function of a
fixed set of unknown parameters
The problem then reduces to finding a set of optimal
parameter estimates using non-linear optimization
techniques
Multiple linear regression and logistic regression: 1(X) is a
linear combination of fixed predictors; parameters being
the intercept term and the slope coefficients
Major problem: the function and predictors need to be
specified beforehand- this usually results to a lengthy
trial-and-error process
17. Construct f(X) using stage-wise approach
Start with a constant, then at each stage adjust the values off
(X) in various regions of data
It is important to keep the adjustment rate low- the resulting
model will become smoother and usually less subject to
overfitting
Note that we are effectively treating the values f=f(X) at all
individual observed data points as separate parameters
18. More specifically, assume that we have gone through k-1
stages and obtained the current version fK-1 (X)
We want to construct an updated version fk(x) resulting to
a smaller value of R
Treating individual (insert equation) as parameters, we
proceed by computing the anti-gradient (insert equation)
The individual components mark the “directions” in which
individual fK-1 must be changed to obtain a smaller R
To induce smoothness lets limit our “freedom” by allowing
only M (a smaller number, say between 2 and 10) distinct
constant adjustments at any given stage
19. The optimal strategy is then to group individual
components gk, into M mutually exclusive groups, such
that the variance within each group is minimized
But this is equivalent to growing a fixed-size (M terminal
nodes) regression tree using gk, as the target
Suppose we found M subsets (insert equation) of cases
(insert equation)
The constant adjustments a kj are computed to minimize
(insert equation)
Finally the updated f(X) is (insert equation)
20. For the given loss function L[y,IV],M, and MaxTrees
◦ Make an initial guess f(X)=f
◦ For K=0 to MaxTrees-1
◦ Compute the anti-gradient Gk by taking the derivative of the loss with
respect to f(X) and substitute y and current fk (X)
◦ Fit an M-node regression tree to the components of the negative gradient
1this will partition observations into M mutually exclusive groups
◦ Find the within node updates a5 by performing M univariate optimizations
of the node contributions to the estimated loss
◦ Do the update (insert equation)
◦ End for
21. For L[y,IV]=(y-f)^2, M, and MaxTrees
Initial guess f(X)=f= mean(y)
For K=0 to MaxTrees-1
The anti-gradient component (insert equation) which is the traditional
definition of the current residual
Fit an M-node regression tree to the current residuals 1* this will partition
observations into M mutually exclusive groups
The within-node updates a k, simply become node averages of the current
residuals
Do the update: (insert equation)
End for
22. For L[Y,fiX]=1 y-fl,M, and MaxTrees
Initial guess f(X)=f=median(y)
For k=0 to MaxTrees-1
The anti-gradient component (insert equation) which is the sign of the
current residuals
Fit an M-node regression tree to the sign of the current residuals 1* this
will partition observations into M mutually exclusive groups
The within-node updates a ki now become node medians of the current
residuals
Do the update (insert equation)
End for
23. For L[y,f(X)]=log[1-exp(-2yf)], M, and MaxTrees
Initial guess f(X)=f= half log- odds of y=+1
For k=0 to MaxTrees-1
Recall that (insert equation) we call these generalize residuals
Fit an M-node regression tree to the generalized residuals 1* this will
partition observations into M mutually exclusive groups
The within-node updates ak, are somewhat complicated (insert
equation) where all measures are taken with respect to the node and
variance (insert equation)
Do the update (insert equation)
End for
24. Consider the following simple data set with
single predictor X and 1000 observations
Here and in the following slides negative
response observations are marked in blue
whereas positive response observations are
marked in red
The general tendency is to have positive
response in the middle of the range of X
(insert table)
25. The dataset was generated using the
following model described by f(X) and the
corresponding p(X) for y=+1
(insert graphs)
26. (insert graph)
TreeNet fits constant probability 0.55
The residuals are positive for y=+1 and
negative for y=-1
27. (insert graph)
The dataset was partitioned into 3 regions:
low X (negative adjustment), middle X
(positive), and large X (negative)
The residuals “reflect” the directions of the
adjustments
28. (insert graph)
This graph shows predictors f(X) after 1000
iterations and a very small learning rate of
0.002
Note how the true shape was nearly perfectly
recovered
29. The purpose of running a regression tree is to group observations into
homogenous subsets
Once we have the right partition the adjustments for each terminal node
are computed separately to optimize the given loss function- these are
generally different from the predictions generated by the regression tree
itself (they are the same only for the LS Loss)
Thus, the procedure is no longer as simple as the initial intuitive
recursive regression approach we started with
Nonetheless, the tree is used to define the actual form of (X) over the
range of X and not only for the individual data points observed
This becomes important in the final model deployment and scoring
30. Up to this point we guarded against overfitting only by allowing a small
number of adjustments at each stage
We may further enhance this subject by forcing the adjustments to be
smaller
This is done by introducing a new parameter called “shrinkage” (learning
rate) that is set to a constant value between 0 and 1
Small learning rates result to smoother models: a rate of 0.1 means that
TreeNet will take 10 times more iterations to extract the same signal-
more variables will be tried, finer partitions will result, smaller boundary
jumps will take place
Ideally, one might ultimately want to keep the learning rate close to zero
and the number of stages (trees) close to infinity
However, rates below 0.001 usually become impractical
31. (insert graph)
This graph shows predictor f(X) after 100
iterations and a learning rate of 1
Note the roughness of the shape and the
presence of abrupt strong jumps
32. (insert graph)
This graph shows predicted f(X) after 1000
iterations and a very small learning rate of
0.0002
Note how the true shape was nearly perfectly
recovered
It may be further approved
33. At each stage, instead of working with the entire learn dataset,
consider taking a random sample of a fixed size
Typical sampling rates are set to 50% of the learn data (the
default) and even smaller for very large datasets
In the long run, the entire learn dataset is exploited but the
running time is reduced by the factor of two with the 50%
sampling rate
Sampling forces TreeNet to “rethink” optimal partition points
from run to run due to random fluctuations of the residuals
This, combined with the shrinkage and a large number of
iterations, results to the overall improvement of the captured
signal shape
34. (insert graph)
This graph shows predicted f(X) after 1000
stages, learning rate of 0.002, and 50%
sampling
Note the minor fluctuations in the average
loss
The resulting model is nice and smooth but
there is still room for improvement
35. (insert graph)
All previous allowed as few as 10 cases for
individual region/node (the default)
Here we have increased this limit up to 50
This immediately resulted to an even smoother
shape
In practice, various node size limits should be
tried
36. In classification problems, it is possible to further
reduce the amount of data processed as each stage
We ignore data points “too far” from the decision
boundary to be usefully considered
◦ Well correctly classified points are ignored (just like
conventional boosting)
◦ Badly misclassified data points are also ignored (very
different from conventional boosting)
◦ The focus is on the cases most difficult to classify correctly:
those near the decision boundary
37. (insert graph)
2-dimensional predictor space
Red dots represent cases with +1 target
Green dots represent cases with -1 target
Black curve represents the decision boundary
38. The remaining slides present TreeNet runs on real data as
well as give examples of GUI controls
We start with the Boston Housing dataset to illustrate
regression
Then we proceed with the Cell Phone dataset to illustrate
classification