SlideShare una empresa de Scribd logo
1 de 42
An Introduction to TreeNetTM

              Salford Systems
      http://www.salford-systems.com
        golomi@salford-systems.com
Mikhail Golovnya, Dan Steinberg, Scott Cardell
   New approaches to machine learning/function

    ◦ Approximation developed by Jerome H. Friedman at
      Stanford University

      Co-author of CART® with Breiman, Olshen and Stone
      Author of MARSTM, PRIM, Projection Pursuit

   Good for classification and regression problems

   Builds on the notions of committees of experts
    and boosting but is substantially different in
    implementation details
   Stagewise function approximation in which each stage models
    residuals from the last step model
    ◦ Conventional boosting models original target each stage

   Each stage uses a very small tree, as small as two nodes and
    typically in the range of 4-8 nodes
    ◦ Conventional bagging and boosting use full size trees and even
      massively large trees

   Each stage learns from a fraction of the available training data.
    Typically less than 50% to start and falling into 20% or less by the
    last stage

   Each stage learns only a little: Severely down weight contribution of
    each new tree (learning rate is typically 0.10 or less)

   Focus in classification is on points near decision boundary, ignore
    points far away from boundary even if the points are on the wrong
   Built on CART trees and thus
    ◦ Immune to outliers

    ◦ Handles missing values automatically

    ◦ Selects variables

    ◦ Results invariant wrt monotone transformations of variables

   Trains very rapidly: many small trees do not take

   much longer run times than one large tree

   Resistant to over training- generalizes very well

   Can be remarkably accurate with little effort

   BUT resulting model may be very complex
   An intuitive introduction

   TreeNet Mathematical Basics

    ◦ Specifications of the TreeNet model as a series expansion

    ◦ Non-parametric approach to steepest descent optimization

   TreeNet at work

    ◦ Small trees, learning rates, sub-sample fractions, regression types

    ◦ Reading the output: reports and diagnostics

   Comparing to AdaBoost and other methods
   Consider the basic problem of estimating continuous
    outcome y based on a vector of predictors X

   Running a step-wise multiple linear regression will produce
    an estimate f1 (X) and associated residuals

   A simple intuitive idea: run a second-stage regression model
    to produce an estimate of the residuals f2 (X) and the
    associated updated residuals r^2=(y-f1f2)

   Repeating this process multiple times results to the following
    series expansion: y=f1+f2+f3+…
   The above idea can be easily implemented

   Unfortunately, the direct implementation suffers from the
    overfitting issues

   The residuals from the previous model essentially communicate
    information about where this model fails the most- hence, the
    next stage model effectively tries to improve the previous model
    where it failed

   This is generally known as boosting

   We may want to replace individual regressions with something
    simpler- regression trees, for example

   It is not yet known whether this simple idea actually works nor it
    is clear how to generalize it for various types of loss functions or
    classifications
   For any given set of inputs X we want to predict some
    outcome y

   Thus we want to construct a “nice” function f(X) which in turn
    can be used to express an estimate of y

   We need to define how “nice” can be measured
   In regression, when y is continuous, the easiest is to assume
    that f(X) itself is the estimate of y

   We may then define the loss function as the loss incurred
    when y is estimated by f(X)

   For example, least square loss (LS) is defined as (LOA¯0’:f)^2

   Formally, a “nicely” defined f(X) will have the smallest
    expected loss (over the entire population) within the
    boundaries of its construction (for example, in multiple linear
    regressions, f(X) belongs to the class of linear functions)
   In reality, we have a set of N observed pairs (x,y) from the
    population, not the entire population

   Hence, the expected loss WU/A can be replaced with an
    estimate

   Here Fi=f(x)

   The problem thus reduces to finding a function f(X) that
    minimizes R

   Unfortunately, classification will demand additional treatment
   Consider binary classification and assume that y is coded as +1
    or -1

   The most detailed solution would then give us the associated
    probabilities p(y)

   Since probabilities are naturally constrained to the [0,1] interval,
    we assume that the function f(X) is transformed
       p(y)=1/(1+exp(-2fy))

   Note that p(+1)+p(-1)=1

   The “trick” here is finding an unconstrained estimate f instead of
    constrained estimate p

   Also note that f is simply half log-odds ratio of y=+1
   (insert graph)

   This graph shows the one-to-one correspondence between f
    and p for y=+1

   Note that the most significant probability change occurs when
    f is between -3 and +3
   Again, the main question is what “nice” f means given that we observed
    N pairs (x,y) from the population

   Approaching this problem from the maximum likelihood point of view,
    one may show that the negative log-likelihood in this case becomes

   (insert equation)

   The problem once again reduces to finding f that minimizes R above

   We could obtain the same result formally by introducing a special loss
    function for classification (insert equation)

   The above likelihood considerations show a “natural” way to arrive to
    such a peculiar loss function
   Other approaches to defining the loss functions for binary
    classification are possible

   For example, by throwing away the log term in the previous
    equation one would arrive to the following loss L=exp(2yf)

   It is possible to show that this loss function is effectively used
    in the “classical” AdaBoost algorithm

   AdaBoost could be considered as a predecessor of gradient
    boosting, we will defer the comparison until later
   To summarize we are looking for a function
    f(X) that minimizes the estimate of loss

   The typical loss functions are

   (insert equations)
   The function f(X) is introduced as a known function of a
    fixed set of unknown parameters

   The problem then reduces to finding a set of optimal
    parameter estimates using non-linear optimization
    techniques

   Multiple linear regression and logistic regression: 1(X) is a
    linear combination of fixed predictors; parameters being
    the intercept term and the slope coefficients

   Major problem: the function and predictors need to be
    specified beforehand- this usually results to a lengthy
    trial-and-error process
   Construct f(X) using stage-wise approach

   Start with a constant, then at each stage adjust the values off
    (X) in various regions of data

   It is important to keep the adjustment rate low- the resulting
    model will become smoother and usually less subject to
    overfitting

   Note that we are effectively treating the values f=f(X) at all
    individual observed data points as separate parameters
   More specifically, assume that we have gone through k-1
    stages and obtained the current version fK-1 (X)

   We want to construct an updated version fk(x) resulting to
    a smaller value of R

   Treating individual (insert equation) as parameters, we
    proceed by computing the anti-gradient (insert equation)

   The individual components mark the “directions” in which
    individual fK-1 must be changed to obtain a smaller R

   To induce smoothness lets limit our “freedom” by allowing
    only M (a smaller number, say between 2 and 10) distinct
    constant adjustments at any given stage
   The optimal strategy is then to group individual
    components gk, into M mutually exclusive groups, such
    that the variance within each group is minimized

   But this is equivalent to growing a fixed-size (M terminal
    nodes) regression tree using gk, as the target

   Suppose we found M subsets (insert equation) of cases
    (insert equation)

   The constant adjustments a kj are computed to minimize
    (insert equation)

   Finally the updated f(X) is (insert equation)
   For the given loss function L[y,IV],M, and MaxTrees
    ◦ Make an initial guess f(X)=f

    ◦ For K=0 to MaxTrees-1

    ◦ Compute the anti-gradient Gk by taking the derivative of the loss with
      respect to f(X) and substitute y and current fk (X)

    ◦ Fit an M-node regression tree to the components of the negative gradient
      1this will partition observations into M mutually exclusive groups

    ◦ Find the within node updates a5 by performing M univariate optimizations
      of the node contributions to the estimated loss

    ◦ Do the update (insert equation)

    ◦ End for
   For L[y,IV]=(y-f)^2, M, and MaxTrees

   Initial guess f(X)=f= mean(y)

   For K=0 to MaxTrees-1

   The anti-gradient component (insert equation) which is the traditional
    definition of the current residual

   Fit an M-node regression tree to the current residuals 1* this will partition
    observations into M mutually exclusive groups

   The within-node updates a k, simply become node averages of the current
    residuals

   Do the update: (insert equation)
   End for
   For L[Y,fiX]=1 y-fl,M, and MaxTrees

   Initial guess f(X)=f=median(y)

   For k=0 to MaxTrees-1

   The anti-gradient component (insert equation) which is the sign of the
    current residuals

   Fit an M-node regression tree to the sign of the current residuals 1* this
    will partition observations into M mutually exclusive groups

   The within-node updates a ki now become node medians of the current
    residuals

   Do the update (insert equation)
   End for
   For L[y,f(X)]=log[1-exp(-2yf)], M, and MaxTrees

   Initial guess f(X)=f= half log- odds of y=+1

   For k=0 to MaxTrees-1

   Recall that (insert equation) we call these generalize residuals

   Fit an M-node regression tree to the generalized residuals 1* this will
    partition observations into M mutually exclusive groups

   The within-node updates ak, are somewhat complicated (insert
    equation) where all measures are taken with respect to the node and
    variance (insert equation)

   Do the update (insert equation)
   End for
   Consider the following simple data set with
    single predictor X and 1000 observations
   Here and in the following slides negative
    response observations are marked in blue
    whereas positive response observations are
    marked in red
   The general tendency is to have positive
    response in the middle of the range of X
   (insert table)
   The dataset was generated using the
    following model described by f(X) and the
    corresponding p(X) for y=+1
   (insert graphs)
   (insert graph)

   TreeNet fits constant probability 0.55

   The residuals are positive for y=+1 and
    negative for y=-1
   (insert graph)

   The dataset was partitioned into 3 regions:
    low X (negative adjustment), middle X
    (positive), and large X (negative)

   The residuals “reflect” the directions of the
    adjustments
   (insert graph)

   This graph shows predictors f(X) after 1000
    iterations and a very small learning rate of
    0.002

   Note how the true shape was nearly perfectly
    recovered
   The purpose of running a regression tree is to group observations into
    homogenous subsets

   Once we have the right partition the adjustments for each terminal node
    are computed separately to optimize the given loss function- these are
    generally different from the predictions generated by the regression tree
    itself (they are the same only for the LS Loss)

   Thus, the procedure is no longer as simple as the initial intuitive
    recursive regression approach we started with

   Nonetheless, the tree is used to define the actual form of (X) over the
    range of X and not only for the individual data points observed

   This becomes important in the final model deployment and scoring
   Up to this point we guarded against overfitting only by allowing a small
    number of adjustments at each stage

   We may further enhance this subject by forcing the adjustments to be
    smaller

   This is done by introducing a new parameter called “shrinkage” (learning
    rate) that is set to a constant value between 0 and 1

   Small learning rates result to smoother models: a rate of 0.1 means that
    TreeNet will take 10 times more iterations to extract the same signal-
    more variables will be tried, finer partitions will result, smaller boundary
    jumps will take place

   Ideally, one might ultimately want to keep the learning rate close to zero
    and the number of stages (trees) close to infinity

   However, rates below 0.001 usually become impractical
   (insert graph)

   This graph shows predictor f(X) after 100
    iterations and a learning rate of 1

   Note the roughness of the shape and the
    presence of abrupt strong jumps
   (insert graph)

   This graph shows predicted f(X) after 1000
    iterations and a very small learning rate of
    0.0002

   Note how the true shape was nearly perfectly
    recovered

   It may be further approved
   At each stage, instead of working with the entire learn dataset,
    consider taking a random sample of a fixed size

   Typical sampling rates are set to 50% of the learn data (the
    default) and even smaller for very large datasets

   In the long run, the entire learn dataset is exploited but the
    running time is reduced by the factor of two with the 50%
    sampling rate

   Sampling forces TreeNet to “rethink” optimal partition points
    from run to run due to random fluctuations of the residuals

   This, combined with the shrinkage and a large number of
    iterations, results to the overall improvement of the captured
    signal shape
   (insert graph)

   This graph shows predicted f(X) after 1000
    stages, learning rate of 0.002, and 50%
    sampling

   Note the minor fluctuations in the average
    loss

   The resulting model is nice and smooth but
    there is still room for improvement
   (insert graph)

   All previous allowed as few as 10 cases for
    individual region/node (the default)

   Here we have increased this limit up to 50

   This immediately resulted to an even smoother
    shape

   In practice, various node size limits should be
    tried
   In classification problems, it is possible to further
    reduce the amount of data processed as each stage

   We ignore data points “too far” from the decision
    boundary to be usefully considered

    ◦ Well correctly classified points are ignored (just like
      conventional boosting)

    ◦ Badly misclassified data points are also ignored (very
      different from conventional boosting)

    ◦ The focus is on the cases most difficult to classify correctly:
      those near the decision boundary
   (insert graph)

   2-dimensional predictor space

   Red dots represent cases with +1 target

   Green dots represent cases with -1 target

   Black curve represents the decision boundary
   The remaining slides present TreeNet runs on real data as
    well as give examples of GUI controls

   We start with the Boston Housing dataset to illustrate
    regression

   Then we proceed with the Cell Phone dataset to illustrate
    classification
   (insert graph)
   (insert graph)
   (insert graph)
   Essentially a regression tree with 2 terminal
    nodes
   (insert table)
   CART run with TARGET=MV

   PREDICTORS= LSTAT

   LIMIT DEPTH= 1

   Save residuals as RESI

Más contenido relacionado

La actualidad más candente

Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAminaRepo
 
Divide and conquer
Divide and conquerDivide and conquer
Divide and conquerVikas Sharma
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Zihui Li
 
Understanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized treesUnderstanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized treesGilles Louppe
 
X01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorieX01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorieMarco Moldenhauer
 
Algorithm Using Divide And Conquer
Algorithm Using Divide And ConquerAlgorithm Using Divide And Conquer
Algorithm Using Divide And ConquerUrviBhalani2
 
Lesson 27: Integration by Substitution (Section 041 handout)
Lesson 27: Integration by Substitution (Section 041 handout)Lesson 27: Integration by Substitution (Section 041 handout)
Lesson 27: Integration by Substitution (Section 041 handout)Matthew Leingang
 
Machine Learning Algorithms Review(Part 2)
Machine Learning Algorithms Review(Part 2)Machine Learning Algorithms Review(Part 2)
Machine Learning Algorithms Review(Part 2)Zihui Li
 
Bias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsBias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsGilles Louppe
 
Recursion - Algorithms and Data Structures
Recursion - Algorithms and Data StructuresRecursion - Algorithms and Data Structures
Recursion - Algorithms and Data StructuresPriyanka Rana
 
5.2 divide and conquer
5.2 divide and conquer5.2 divide and conquer
5.2 divide and conquerKrish_ver2
 
C users_mpk7_app_data_local_temp_plugtmp_plugin-week3recursive
C  users_mpk7_app_data_local_temp_plugtmp_plugin-week3recursiveC  users_mpk7_app_data_local_temp_plugtmp_plugin-week3recursive
C users_mpk7_app_data_local_temp_plugtmp_plugin-week3recursiverokiah64
 
Numerical Methods
Numerical MethodsNumerical Methods
Numerical MethodsESUG
 
Artificial Intelligence
Artificial Intelligence Artificial Intelligence
Artificial Intelligence butest
 
Branch and bounding : Data structures
Branch and bounding : Data structuresBranch and bounding : Data structures
Branch and bounding : Data structuresKàŕtheek Jåvvàjí
 
Knowledge Representation, Inference and Reasoning
Knowledge Representation, Inference and ReasoningKnowledge Representation, Inference and Reasoning
Knowledge Representation, Inference and ReasoningSagacious IT Solution
 

La actualidad más candente (20)

A Gentle Introduction to the EM Algorithm
A Gentle Introduction to the EM AlgorithmA Gentle Introduction to the EM Algorithm
A Gentle Introduction to the EM Algorithm
 
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
 
Divide and conquer
Divide and conquerDivide and conquer
Divide and conquer
 
Divide and Conquer
Divide and ConquerDivide and Conquer
Divide and Conquer
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Understanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized treesUnderstanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized trees
 
Daa unit 2
Daa unit 2Daa unit 2
Daa unit 2
 
X01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorieX01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorie
 
Algorithm Using Divide And Conquer
Algorithm Using Divide And ConquerAlgorithm Using Divide And Conquer
Algorithm Using Divide And Conquer
 
Lesson 27: Integration by Substitution (Section 041 handout)
Lesson 27: Integration by Substitution (Section 041 handout)Lesson 27: Integration by Substitution (Section 041 handout)
Lesson 27: Integration by Substitution (Section 041 handout)
 
Machine Learning Algorithms Review(Part 2)
Machine Learning Algorithms Review(Part 2)Machine Learning Algorithms Review(Part 2)
Machine Learning Algorithms Review(Part 2)
 
Bias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsBias-variance decomposition in Random Forests
Bias-variance decomposition in Random Forests
 
Recursion - Algorithms and Data Structures
Recursion - Algorithms and Data StructuresRecursion - Algorithms and Data Structures
Recursion - Algorithms and Data Structures
 
5.2 divide and conquer
5.2 divide and conquer5.2 divide and conquer
5.2 divide and conquer
 
Divide and conquer
Divide and conquerDivide and conquer
Divide and conquer
 
C users_mpk7_app_data_local_temp_plugtmp_plugin-week3recursive
C  users_mpk7_app_data_local_temp_plugtmp_plugin-week3recursiveC  users_mpk7_app_data_local_temp_plugtmp_plugin-week3recursive
C users_mpk7_app_data_local_temp_plugtmp_plugin-week3recursive
 
Numerical Methods
Numerical MethodsNumerical Methods
Numerical Methods
 
Artificial Intelligence
Artificial Intelligence Artificial Intelligence
Artificial Intelligence
 
Branch and bounding : Data structures
Branch and bounding : Data structuresBranch and bounding : Data structures
Branch and bounding : Data structures
 
Knowledge Representation, Inference and Reasoning
Knowledge Representation, Inference and ReasoningKnowledge Representation, Inference and Reasoning
Knowledge Representation, Inference and Reasoning
 

Similar a Introduction to TreeNet (2004)

The Fundamental theorem of calculus
The Fundamental theorem of calculus The Fundamental theorem of calculus
The Fundamental theorem of calculus AhsanIrshad8
 
STA003_WK4_L.pptx
STA003_WK4_L.pptxSTA003_WK4_L.pptx
STA003_WK4_L.pptxMAmir23
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorizationmidi
 
Introduction to functions
Introduction to functionsIntroduction to functions
Introduction to functionsElkin Guillen
 
Introduction to Machine Learning Lectures
Introduction to Machine Learning LecturesIntroduction to Machine Learning Lectures
Introduction to Machine Learning Lecturesssuserfece35
 
Machine Learning
Machine LearningMachine Learning
Machine LearningAshwin P N
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Universitat Politècnica de Catalunya
 
Differential Equations Assignment Help
Differential Equations Assignment HelpDifferential Equations Assignment Help
Differential Equations Assignment HelpMath Homework Solver
 
Differential Equations Assignment Help
Differential Equations Assignment HelpDifferential Equations Assignment Help
Differential Equations Assignment HelpMaths Assignment Help
 
Machine learning (1)
Machine learning (1)Machine learning (1)
Machine learning (1)NYversity
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligencekeerthikaA8
 
Artificial intelligence.pptx
Artificial intelligence.pptxArtificial intelligence.pptx
Artificial intelligence.pptxkeerthikaA8
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligencekeerthikaA8
 
Introduction to Functions
Introduction to FunctionsIntroduction to Functions
Introduction to FunctionsMelanie Loslo
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesnextlib
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programmingJay Nagar
 

Similar a Introduction to TreeNet (2004) (20)

The Fundamental theorem of calculus
The Fundamental theorem of calculus The Fundamental theorem of calculus
The Fundamental theorem of calculus
 
STA003_WK4_L.pptx
STA003_WK4_L.pptxSTA003_WK4_L.pptx
STA003_WK4_L.pptx
 
BP106RMT.pdf
BP106RMT.pdfBP106RMT.pdf
BP106RMT.pdf
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
 
Note introductions of functions
Note introductions of functionsNote introductions of functions
Note introductions of functions
 
Introduction to functions
Introduction to functionsIntroduction to functions
Introduction to functions
 
Introduction to Machine Learning Lectures
Introduction to Machine Learning LecturesIntroduction to Machine Learning Lectures
Introduction to Machine Learning Lectures
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
E10
E10E10
E10
 
Differential Equations Assignment Help
Differential Equations Assignment HelpDifferential Equations Assignment Help
Differential Equations Assignment Help
 
Differential Equations Assignment Help
Differential Equations Assignment HelpDifferential Equations Assignment Help
Differential Equations Assignment Help
 
Machine learning (1)
Machine learning (1)Machine learning (1)
Machine learning (1)
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
 
Artificial intelligence.pptx
Artificial intelligence.pptxArtificial intelligence.pptx
Artificial intelligence.pptx
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
 
Introduction to Functions
Introduction to FunctionsIntroduction to Functions
Introduction to Functions
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
 
DNN_M3_Optimization.pdf
DNN_M3_Optimization.pdfDNN_M3_Optimization.pdf
DNN_M3_Optimization.pdf
 

Más de Salford Systems

Datascience101presentation4
Datascience101presentation4Datascience101presentation4
Datascience101presentation4Salford Systems
 
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsSalford Systems
 
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Salford Systems
 
Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Salford Systems
 
The Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningThe Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningSalford Systems
 
Introduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerIntroduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerSalford Systems
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like YouSalford Systems
 
Statistically Significant Quotes To Remember
Statistically Significant Quotes To RememberStatistically Significant Quotes To Remember
Statistically Significant Quotes To RememberSalford Systems
 
Using CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetUsing CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetSalford Systems
 
CART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideCART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideSalford Systems
 
Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to marsSalford Systems
 
Data Mining for Higher Education
Data Mining for Higher EducationData Mining for Higher Education
Data Mining for Higher EducationSalford Systems
 
Comparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingComparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingSalford Systems
 
Molecular data mining tool advances in hiv
Molecular data mining tool  advances in hivMolecular data mining tool  advances in hiv
Molecular data mining tool advances in hivSalford Systems
 
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees:  A Winning CombinationTreeNet Tree Ensembles & CART Decision Trees:  A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees: A Winning CombinationSalford Systems
 
SPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSalford Systems
 
Hybrid cart logit model 1998
Hybrid cart logit model 1998Hybrid cart logit model 1998
Hybrid cart logit model 1998Salford Systems
 
Session Logs Tutorial for SPM
Session Logs Tutorial for SPMSession Logs Tutorial for SPM
Session Logs Tutorial for SPMSalford Systems
 
Some of the new features in SPM 7
Some of the new features in SPM 7Some of the new features in SPM 7
Some of the new features in SPM 7Salford Systems
 

Más de Salford Systems (20)

Datascience101presentation4
Datascience101presentation4Datascience101presentation4
Datascience101presentation4
 
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForests
 
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
 
Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications
 
The Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningThe Do's and Don'ts of Data Mining
The Do's and Don'ts of Data Mining
 
Introduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerIntroduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele Cutler
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You
 
Statistically Significant Quotes To Remember
Statistically Significant Quotes To RememberStatistically Significant Quotes To Remember
Statistically Significant Quotes To Remember
 
Using CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetUsing CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example Dataset
 
CART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideCART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User Guide
 
Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to mars
 
Data Mining for Higher Education
Data Mining for Higher EducationData Mining for Higher Education
Data Mining for Higher Education
 
Comparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingComparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modeling
 
Molecular data mining tool advances in hiv
Molecular data mining tool  advances in hivMolecular data mining tool  advances in hiv
Molecular data mining tool advances in hiv
 
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees:  A Winning CombinationTreeNet Tree Ensembles & CART Decision Trees:  A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
 
SPM v7.0 Feature Matrix
SPM v7.0 Feature MatrixSPM v7.0 Feature Matrix
SPM v7.0 Feature Matrix
 
SPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARS
 
Hybrid cart logit model 1998
Hybrid cart logit model 1998Hybrid cart logit model 1998
Hybrid cart logit model 1998
 
Session Logs Tutorial for SPM
Session Logs Tutorial for SPMSession Logs Tutorial for SPM
Session Logs Tutorial for SPM
 
Some of the new features in SPM 7
Some of the new features in SPM 7Some of the new features in SPM 7
Some of the new features in SPM 7
 

Último

Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 

Último (20)

Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 

Introduction to TreeNet (2004)

  • 1. An Introduction to TreeNetTM Salford Systems http://www.salford-systems.com golomi@salford-systems.com Mikhail Golovnya, Dan Steinberg, Scott Cardell
  • 2. New approaches to machine learning/function ◦ Approximation developed by Jerome H. Friedman at Stanford University  Co-author of CART® with Breiman, Olshen and Stone  Author of MARSTM, PRIM, Projection Pursuit  Good for classification and regression problems  Builds on the notions of committees of experts and boosting but is substantially different in implementation details
  • 3. Stagewise function approximation in which each stage models residuals from the last step model ◦ Conventional boosting models original target each stage  Each stage uses a very small tree, as small as two nodes and typically in the range of 4-8 nodes ◦ Conventional bagging and boosting use full size trees and even massively large trees  Each stage learns from a fraction of the available training data. Typically less than 50% to start and falling into 20% or less by the last stage  Each stage learns only a little: Severely down weight contribution of each new tree (learning rate is typically 0.10 or less)  Focus in classification is on points near decision boundary, ignore points far away from boundary even if the points are on the wrong
  • 4. Built on CART trees and thus ◦ Immune to outliers ◦ Handles missing values automatically ◦ Selects variables ◦ Results invariant wrt monotone transformations of variables  Trains very rapidly: many small trees do not take  much longer run times than one large tree  Resistant to over training- generalizes very well  Can be remarkably accurate with little effort  BUT resulting model may be very complex
  • 5. An intuitive introduction  TreeNet Mathematical Basics ◦ Specifications of the TreeNet model as a series expansion ◦ Non-parametric approach to steepest descent optimization  TreeNet at work ◦ Small trees, learning rates, sub-sample fractions, regression types ◦ Reading the output: reports and diagnostics  Comparing to AdaBoost and other methods
  • 6. Consider the basic problem of estimating continuous outcome y based on a vector of predictors X  Running a step-wise multiple linear regression will produce an estimate f1 (X) and associated residuals  A simple intuitive idea: run a second-stage regression model to produce an estimate of the residuals f2 (X) and the associated updated residuals r^2=(y-f1f2)  Repeating this process multiple times results to the following series expansion: y=f1+f2+f3+…
  • 7. The above idea can be easily implemented  Unfortunately, the direct implementation suffers from the overfitting issues  The residuals from the previous model essentially communicate information about where this model fails the most- hence, the next stage model effectively tries to improve the previous model where it failed  This is generally known as boosting  We may want to replace individual regressions with something simpler- regression trees, for example  It is not yet known whether this simple idea actually works nor it is clear how to generalize it for various types of loss functions or classifications
  • 8. For any given set of inputs X we want to predict some outcome y  Thus we want to construct a “nice” function f(X) which in turn can be used to express an estimate of y  We need to define how “nice” can be measured
  • 9. In regression, when y is continuous, the easiest is to assume that f(X) itself is the estimate of y  We may then define the loss function as the loss incurred when y is estimated by f(X)  For example, least square loss (LS) is defined as (LOA¯0’:f)^2  Formally, a “nicely” defined f(X) will have the smallest expected loss (over the entire population) within the boundaries of its construction (for example, in multiple linear regressions, f(X) belongs to the class of linear functions)
  • 10. In reality, we have a set of N observed pairs (x,y) from the population, not the entire population  Hence, the expected loss WU/A can be replaced with an estimate  Here Fi=f(x)  The problem thus reduces to finding a function f(X) that minimizes R  Unfortunately, classification will demand additional treatment
  • 11. Consider binary classification and assume that y is coded as +1 or -1  The most detailed solution would then give us the associated probabilities p(y)  Since probabilities are naturally constrained to the [0,1] interval, we assume that the function f(X) is transformed p(y)=1/(1+exp(-2fy))  Note that p(+1)+p(-1)=1  The “trick” here is finding an unconstrained estimate f instead of constrained estimate p  Also note that f is simply half log-odds ratio of y=+1
  • 12. (insert graph)  This graph shows the one-to-one correspondence between f and p for y=+1  Note that the most significant probability change occurs when f is between -3 and +3
  • 13. Again, the main question is what “nice” f means given that we observed N pairs (x,y) from the population  Approaching this problem from the maximum likelihood point of view, one may show that the negative log-likelihood in this case becomes  (insert equation)  The problem once again reduces to finding f that minimizes R above  We could obtain the same result formally by introducing a special loss function for classification (insert equation)  The above likelihood considerations show a “natural” way to arrive to such a peculiar loss function
  • 14. Other approaches to defining the loss functions for binary classification are possible  For example, by throwing away the log term in the previous equation one would arrive to the following loss L=exp(2yf)  It is possible to show that this loss function is effectively used in the “classical” AdaBoost algorithm  AdaBoost could be considered as a predecessor of gradient boosting, we will defer the comparison until later
  • 15. To summarize we are looking for a function f(X) that minimizes the estimate of loss  The typical loss functions are  (insert equations)
  • 16. The function f(X) is introduced as a known function of a fixed set of unknown parameters  The problem then reduces to finding a set of optimal parameter estimates using non-linear optimization techniques  Multiple linear regression and logistic regression: 1(X) is a linear combination of fixed predictors; parameters being the intercept term and the slope coefficients  Major problem: the function and predictors need to be specified beforehand- this usually results to a lengthy trial-and-error process
  • 17. Construct f(X) using stage-wise approach  Start with a constant, then at each stage adjust the values off (X) in various regions of data  It is important to keep the adjustment rate low- the resulting model will become smoother and usually less subject to overfitting  Note that we are effectively treating the values f=f(X) at all individual observed data points as separate parameters
  • 18. More specifically, assume that we have gone through k-1 stages and obtained the current version fK-1 (X)  We want to construct an updated version fk(x) resulting to a smaller value of R  Treating individual (insert equation) as parameters, we proceed by computing the anti-gradient (insert equation)  The individual components mark the “directions” in which individual fK-1 must be changed to obtain a smaller R  To induce smoothness lets limit our “freedom” by allowing only M (a smaller number, say between 2 and 10) distinct constant adjustments at any given stage
  • 19. The optimal strategy is then to group individual components gk, into M mutually exclusive groups, such that the variance within each group is minimized  But this is equivalent to growing a fixed-size (M terminal nodes) regression tree using gk, as the target  Suppose we found M subsets (insert equation) of cases (insert equation)  The constant adjustments a kj are computed to minimize (insert equation)  Finally the updated f(X) is (insert equation)
  • 20. For the given loss function L[y,IV],M, and MaxTrees ◦ Make an initial guess f(X)=f ◦ For K=0 to MaxTrees-1 ◦ Compute the anti-gradient Gk by taking the derivative of the loss with respect to f(X) and substitute y and current fk (X) ◦ Fit an M-node regression tree to the components of the negative gradient 1this will partition observations into M mutually exclusive groups ◦ Find the within node updates a5 by performing M univariate optimizations of the node contributions to the estimated loss ◦ Do the update (insert equation) ◦ End for
  • 21. For L[y,IV]=(y-f)^2, M, and MaxTrees  Initial guess f(X)=f= mean(y)  For K=0 to MaxTrees-1  The anti-gradient component (insert equation) which is the traditional definition of the current residual  Fit an M-node regression tree to the current residuals 1* this will partition observations into M mutually exclusive groups  The within-node updates a k, simply become node averages of the current residuals  Do the update: (insert equation)  End for
  • 22. For L[Y,fiX]=1 y-fl,M, and MaxTrees  Initial guess f(X)=f=median(y)  For k=0 to MaxTrees-1  The anti-gradient component (insert equation) which is the sign of the current residuals  Fit an M-node regression tree to the sign of the current residuals 1* this will partition observations into M mutually exclusive groups  The within-node updates a ki now become node medians of the current residuals  Do the update (insert equation)  End for
  • 23. For L[y,f(X)]=log[1-exp(-2yf)], M, and MaxTrees  Initial guess f(X)=f= half log- odds of y=+1  For k=0 to MaxTrees-1  Recall that (insert equation) we call these generalize residuals  Fit an M-node regression tree to the generalized residuals 1* this will partition observations into M mutually exclusive groups  The within-node updates ak, are somewhat complicated (insert equation) where all measures are taken with respect to the node and variance (insert equation)  Do the update (insert equation)  End for
  • 24. Consider the following simple data set with single predictor X and 1000 observations  Here and in the following slides negative response observations are marked in blue whereas positive response observations are marked in red  The general tendency is to have positive response in the middle of the range of X  (insert table)
  • 25. The dataset was generated using the following model described by f(X) and the corresponding p(X) for y=+1  (insert graphs)
  • 26. (insert graph)  TreeNet fits constant probability 0.55  The residuals are positive for y=+1 and negative for y=-1
  • 27. (insert graph)  The dataset was partitioned into 3 regions: low X (negative adjustment), middle X (positive), and large X (negative)  The residuals “reflect” the directions of the adjustments
  • 28. (insert graph)  This graph shows predictors f(X) after 1000 iterations and a very small learning rate of 0.002  Note how the true shape was nearly perfectly recovered
  • 29. The purpose of running a regression tree is to group observations into homogenous subsets  Once we have the right partition the adjustments for each terminal node are computed separately to optimize the given loss function- these are generally different from the predictions generated by the regression tree itself (they are the same only for the LS Loss)  Thus, the procedure is no longer as simple as the initial intuitive recursive regression approach we started with  Nonetheless, the tree is used to define the actual form of (X) over the range of X and not only for the individual data points observed  This becomes important in the final model deployment and scoring
  • 30. Up to this point we guarded against overfitting only by allowing a small number of adjustments at each stage  We may further enhance this subject by forcing the adjustments to be smaller  This is done by introducing a new parameter called “shrinkage” (learning rate) that is set to a constant value between 0 and 1  Small learning rates result to smoother models: a rate of 0.1 means that TreeNet will take 10 times more iterations to extract the same signal- more variables will be tried, finer partitions will result, smaller boundary jumps will take place  Ideally, one might ultimately want to keep the learning rate close to zero and the number of stages (trees) close to infinity  However, rates below 0.001 usually become impractical
  • 31. (insert graph)  This graph shows predictor f(X) after 100 iterations and a learning rate of 1  Note the roughness of the shape and the presence of abrupt strong jumps
  • 32. (insert graph)  This graph shows predicted f(X) after 1000 iterations and a very small learning rate of 0.0002  Note how the true shape was nearly perfectly recovered  It may be further approved
  • 33. At each stage, instead of working with the entire learn dataset, consider taking a random sample of a fixed size  Typical sampling rates are set to 50% of the learn data (the default) and even smaller for very large datasets  In the long run, the entire learn dataset is exploited but the running time is reduced by the factor of two with the 50% sampling rate  Sampling forces TreeNet to “rethink” optimal partition points from run to run due to random fluctuations of the residuals  This, combined with the shrinkage and a large number of iterations, results to the overall improvement of the captured signal shape
  • 34. (insert graph)  This graph shows predicted f(X) after 1000 stages, learning rate of 0.002, and 50% sampling  Note the minor fluctuations in the average loss  The resulting model is nice and smooth but there is still room for improvement
  • 35. (insert graph)  All previous allowed as few as 10 cases for individual region/node (the default)  Here we have increased this limit up to 50  This immediately resulted to an even smoother shape  In practice, various node size limits should be tried
  • 36. In classification problems, it is possible to further reduce the amount of data processed as each stage  We ignore data points “too far” from the decision boundary to be usefully considered ◦ Well correctly classified points are ignored (just like conventional boosting) ◦ Badly misclassified data points are also ignored (very different from conventional boosting) ◦ The focus is on the cases most difficult to classify correctly: those near the decision boundary
  • 37. (insert graph)  2-dimensional predictor space  Red dots represent cases with +1 target  Green dots represent cases with -1 target  Black curve represents the decision boundary
  • 38. The remaining slides present TreeNet runs on real data as well as give examples of GUI controls  We start with the Boston Housing dataset to illustrate regression  Then we proceed with the Cell Phone dataset to illustrate classification
  • 39. (insert graph)
  • 40. (insert graph)
  • 41. (insert graph)  Essentially a regression tree with 2 terminal nodes
  • 42. (insert table)  CART run with TARGET=MV  PREDICTORS= LSTAT  LIMIT DEPTH= 1  Save residuals as RESI