SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
Pruning The Tree
Growing a tree this way may stop too early: splitting a particu-
lar leaf may not need to an improvement in the sum of squared
errors, but if we split anyway, we may find subsequently prof-
itable splits.
For example, suppose that there are two binary covariates,
Xi1, Xi,2 ∈ {0, 1}, and that
E[Yi|Xi,1 = 0, Xi,2 = 0] = E[Yi|Xi,1 = 1, Xi,2 = 1] = −1
E[Yi|Xi,1 = 0, Xi,2 = 1] = E[Yi|Xi,1 = 0, Xi,2 = 1] = 1
Then splitting on Xi1 or Xi2 does not improve the objective
function, but once one splits on either of them, the subsequent
splits lead to an improvement.
36
This motivates pruning the tree:
• First grow a big tree by using a deliberately small value of the
penalty term, or simply growing the tree till the leaves have a
preset small number of observations.
• Then go back and prune branches or leaves that do not
collectively improve the objective function sufficiently.
37
Tree leads to 37 splits
Test sample root-mean-squared error
OLS: 1.4093
LASSO: 0.7379
Tree: 0.7865
38
2.c Boosting
Suppose we have a simple, possibly naive, but easy to com-
pute, way of estimating a regression function, a so-called weak
learner.
Boosting is a general approach to repeatedly use the weak
learner to get a good predictor for both classification and re-
gression problems. It can be used with many different weak
learners, trees, kernels, support vector machines, neural net-
works, etc.
Here I illustrate it using regression trees.
39
Suppose g(x|X, Y) is based on a very simple regression tree,
using only a single split. So, the algorithm selects a covari-
ate k(X, Y) and a threshold t(X, Y) and then estimates the
regression function as
g1(x|X, Y) =
Y left if xk(X,Y) ≤ t(X, Y)
Y right if xk(X,Y) > t(X, Y)
where
Y left =
i:Xi,k≤t Yi
i:Xi,k≤t 1
Y right =
i:Xi,k>t Yi
i:Xi,k>t 1
Not a very good predictor by itself.
40
Define the residual relative to this weak learner:
ε1i = Yi − g1(Xi|X, Y)
Now apply the same weak learner to the new data set (X, ε1).
Grow a second tree g2(X, ε1) based on this data set (with single
split), and define the new residuals as
ε2i = Yi − g1(Xi|X, Y) − g2(Xi|X, ε1)
Re-apply the weak learner to the data set (X, ε2).
After doing this many times you get an additive approximation
to the regression function:
M
m=1
gm(x|X, εm−1) =
K
k=1
hk(xk) where ε0 = Y
41
Had we used a weak learner with two splits, we would have
allowed for second order effects h(xk, xl).
In practice researchers use shallow trees, say with six splits
(implicitly allowing for 6-th order interactions), and grow many
trees, e.g., 400-500.
Often the depth of the initial trees is fixed in advance in an ad
hoc manner (difficult to choose too many tuning parameters
optimally), and the number of trees is based on prediction
errors in a test sample (similar in spirit to cross-validation).
42
2.d Bagging (Bootstrap AGGregatING) Applicable to many
ways of estimating regression function, here applied to trees.
1. Draw a bootstrap sample of size N from the data.
2. Construct a tree gb(x), possibly with pruning, possibly with
data-dependent penalty term.
Estimate the regression function by averaging over bootstrap
estimates:
1
B
B
b=1
gb(x)
If the basic learner were linear, than the bagging is ineffective.
For nonlinear learners, however, this smoothes things and can
lead to improvements.
43
2.e Random Forests (Great general purpose method)
Given data (X, Y), with the dimension of X equal to N ×K, do
the same as bagging, but with a different way of constructing
the tree given the bootstrap sample: Start with a tree with a
single leaf.
1. Randomly select L regressors out of the set of K regressors
2. Select the optimal cov and threshold among L regressors
3. If some leaves have more than Nmin units, go back to (1)
4. Otherwise, stop
Average trees over bootstrap samples.
44
• For bagging and random forest there is recent research sug-
gesting asymptotic normality may hold. It would require using
small training samples relative to the overall sample (on the
order of N/(ln(N)K), where K is the number of features.
See Wager, Efron, and Hastie (2014) and Wager (2015).
45
2.f Neural Networks / Deep Learning
Goes back to 1990’s, work by Hal White. Recent resurgence.
Model the relation between Xi and Yi through hidden layer(s)
of Zi, with M elements Zi,m:
Zi,m = σ(α0m + α1mXi), for m = 1, . . . , M
Yi = β0 + β1Zi + εi
So, the Yi are linear in a number of transformations of the
original covariates Xi. Often the transformations are sigmoid
functions σ(a) = (1 + exp(−a))−1. We fit the parameters αm
and β by minimizing
N
i=1
(Yi − g(Xi, α, β))2
46
• Estimation can be hard. Start with α random but close to
zero, so close to linear model, using gradient descent methods.
• Difficulty is to avoid overfitting. We can add a penalty term
N
i=1
(Yi − g(Xi, α, β))2
+ λ ·


k,m
α2
k,m +
k
β2
k


Find optimal penalty term λ by monitoring sum of squared
prediction errors on test sample.
47
2.g Ensemble Methods, Model Averaging, and Super Learn-
ers
Suppose we have M candidate estimators gm(·|X, Y). They can
be similar, e.g., all trees, or they can be qualitatively different,
some trees, some regression models, some support vector ma-
chines, some neural networks. We can try to combine them to
get a better estimator, often better than any single algorithm.
Note that we do not attempt to select a single method, rather
we look for weights that may be non-zero for multiple methods.
Most competitions for supervised learning methods have been
won by algorithms that combine more basic methods, often
many different methods.
48
One question is how exactly to combine methods.
One approach is to construct weights α1, . . . , αM by solving,
using a test sample
min
α1,...,αM
N
i=1

Yi −
M
m=1
αm · gm(Xi)


2
If we have many algorithms to choose from we may wish to
regularize this problem by adding a LASSO-type penalty term:
λ ·
M
m=1
|αm|
That is, we restrict the ensemble estimator to be a weighted
average of the original estimators where we shrink the weights
using an L1 norm. The result will be a weighted average that
puts non-zero weights on only a few models.
49
Test sample root-mean-squared error
OLS: 1.4093
LASSO: 0.7379
Tree: 0.7865
Ensemble: 0.7375
Weights ensemble:
OLS 0.0130
LASSO 0.7249
Tree 0.2621
50
3. Unsupervised Learning: Clustering
We have N observations on a M-component vector of features,
Xi, i = 1, . . . , N. We want to find patterns in these data.
Note: there is no outcome Yi here, which gave rise to the term
“unsupervised learning.”
• One approach is to reduce the dimension of the Xi using
principal components. We can then fit models using those
principal components rather than the full set of features.
• Second approach is to partition the space into a finite set.
We can then fit models to the subpopulations in each of those
sets.
51
3.a Principal Components
• Old method, e.g., in Theil (Principles of Econometrics)
We want to find a set of K N-vectors Y1, . . . , YK so that
Xi ≈
K
k=1
γikYk
or, collectively, we want to find approximation
X = ΓY, Γ is M × K, M > K
This is useful in cases where we have many features and we
want to reduce the dimension without giving up a lot of infor-
mation.
52
First normalize components of Xi, so average N
i=1 Xi/N = 0,
and components have unit variance.
First principal component: Find N-vector Y1 and the M vector
Γ that solve
Y1, Γ = arg min
Y1,Γ
trace (X − Y1Γ) (X − Y1Γ)
This leads to Y1 being the eigenvector of the N × N matrix
XX corresponding to the largest eigenvalue. Given Y, Γ is
easy to find.
Subsequent Yk correspond to the subsequent eigenvectors.
53
3.b Mixture Models and the EM Algorithm
Model the joint distribution of the L-component vector Xi as
a mixture of parametric distributions:
f(x) =
K
k=1
πk · fk(x; θk) fk(·; ·) known
We want to estimate the parameters of the mixture compo-
nents, θk, and the mixture probabilities πk.
The mixture components can be multivariate normal, of any
other parametric distribution. Straight maximum likelihood
estimation is very difficult because the likelihood function is
multi-modal.
54
The EM algorithm (Dempster, Laird, Rubin, 1977) makes this
easy as long as it is easy to estimate the θk given data from
the k-th mixture component. Start with πk = 1/K for all k.
Create starting values for θk, all different.
Update weights, the conditional probability of belonging to
cluster k given parameter values (E-step in EM):
wik =
πk · fk(Xi; θk)
K
m=1 πm · fm(Xi; θm)
Update θk (M-step in EM)
θk = arg max
θ
N
i=1
wik · ln fk(Xi; θ)
55
Algorithm can be slow but is very reliable. It gives probabilities
for each cluster. Then we can use those to assign new units
to clusters, using the highest probability.
Used in duration models in Gamma-Weibull mixtures (Lan-
caster, 1979), non-parametric Weibull mixtures (Heckman &
Singer, 1984).
56
3.c The k-means Algorithm
1. Start with k arbitrary centroids cm for the k clusters.
2. Assign each observation to the nearest centroid:
Ii = m if Xi − cm = min
m =1
Xi − cm
3. Re-calculate the centroids as
cm =
i:Ii=m
Xi
i:Ii=m
1
4. Go back to (2) if centroids have changed, otherwise stop.
57
• k-means is fast
• Results can be sensitive to starting values for centroids.
58
3.d. Mining for Association Rules: The Apriori Algorithm
Suppose we have a set of N customers, each buying arbitrary
subsets of a set F0 containing M items. We want to find
subsets of k items that are bought together by at least L cus-
tomers, for different values of k.
This is very much a data mining exercise. There is no model,
simply a search for items that go together. Of course this
may suggest causal relationships, and suggest that discounting
some items may increase sales of other items.
It is potentially difficult, because there are M choose k subsets
of k items that could be elements of Fk. The solution is to do
this sequentially.
59
You start with k = 1, by selecting all items that are bought
by at least L customers. This gives a set F1 ⊂ F0 of the M
original items.
Now, for k ≥ 2, given Fk−1, find Fk.
First construct the set of possible elements of Fk. For a set
of items F to be in Fk, it must be that any set obtained by
dropping one of the k items in F , say the m-th item, leading
to the set F(m) = F/{m}, must be an element of Fk−1.
The reason is that for F to be in Fk, it must be that there
are at least L customers buying that set of items. Hence there
must be at least L customers buying the set F(m), and so F(m)
is an k − 1 item set that must be an element of Fk−1.
60
5. Support Vector Machines and Classification
Suppose we have a sample (Yi, Xi), i = 1, . . . , N, with Yi ∈
{−1, 1}.
We are trying to come up with a classification rule that assigns
units to one of the two classes −1 or 1.
One conventional econometric approach is to estimate a logis-
tic regression model and assign units to the groups based on
the estimated probability.
Support vector machines look for a boundary h(x), such that
units on one side of the boundary are assigned to one group,
and units on the other side are assigned to the other group.
61
Suppose we limit ourselves to linear rules,
h(x) = β0 + xβ1,
where we assign a unit with features x to class 1 if h(x) ≥ 0
and to -1 otherwise.
We want to choose β0 and β1 to optimize the classification.
The question is how to quantify the quality of the classification.
62
Suppose there is a hyperplane that completely separates the
Yi = −1 and Yi = 1 groups. In that case we can look for the
β, such that β = 1, that maximize the margin M:
max
β0,β1
M s.t. Yi · (β0 + Xiβ1) ≥ M ∀i, β = 1
The restriction implies that each point is at least a distance M
away from the boundary.
We can rewrite this as
min
β0,β1
β s.t. Yi · (β0 + Xiβ1) ≥ 1 ∀i.
63
Often there is no such hyperplane. In that case we define a
penalty we pay for observations that are not at least a distance
M away from the boundary.
min
β0,β1
β s.t. Yi · (β0 + Xiβ1) ≥ 1 − εi, εi ≥ 0 ∀i,
i
ε ≤ C.
Alternatively
min
β
1
2
· β + C ·
i=1
εi
subject to
εi ≥ 0, Yi · (β0 + Xiβ1) ≥ 1 − εi
64
With the linear specification h(x) = β0+xβ1 the support vector
machine leads to
min
β0,β1
N
i=1
1 − Yi · (β0 + Xiβ1) + + λ · β 2
where a + is a if a > 0 and zero otherwise, and λ = 1/C.
Note that we do not get probabilities here as in a logistic re-
gression model, only assignments to one of the two groups.
65
Thank You!
66

Más contenido relacionado

La actualidad más candente

Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Fabian Pedregosa
 
better together? statistical learning in models made of modules
better together? statistical learning in models made of modulesbetter together? statistical learning in models made of modules
better together? statistical learning in models made of modulesChristian Robert
 
Statistics symposium talk, Harvard University
Statistics symposium talk, Harvard UniversityStatistics symposium talk, Harvard University
Statistics symposium talk, Harvard UniversityChristian Robert
 
Problem_Session_Notes
Problem_Session_NotesProblem_Session_Notes
Problem_Session_NotesLu Mao
 
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...Mengxi Jiang
 
Understanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized treesUnderstanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized treesGilles Louppe
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
 
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Umberto Picchini
 
Testing as estimation: the demise of the Bayes factor
Testing as estimation: the demise of the Bayes factorTesting as estimation: the demise of the Bayes factor
Testing as estimation: the demise of the Bayes factorChristian Robert
 
Lesson 14: Derivatives of Logarithmic and Exponential Functions
Lesson 14: Derivatives of Logarithmic and Exponential FunctionsLesson 14: Derivatives of Logarithmic and Exponential Functions
Lesson 14: Derivatives of Logarithmic and Exponential FunctionsMatthew Leingang
 
07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation MaximizationAndres Mendez-Vazquez
 
Bias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsBias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsGilles Louppe
 
Discrete Probability Distributions
Discrete  Probability DistributionsDiscrete  Probability Distributions
Discrete Probability DistributionsE-tan
 
Inference for stochastic differential equations via approximate Bayesian comp...
Inference for stochastic differential equations via approximate Bayesian comp...Inference for stochastic differential equations via approximate Bayesian comp...
Inference for stochastic differential equations via approximate Bayesian comp...Umberto Picchini
 

La actualidad más candente (20)

Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2
 
Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1
 
better together? statistical learning in models made of modules
better together? statistical learning in models made of modulesbetter together? statistical learning in models made of modules
better together? statistical learning in models made of modules
 
ABC in Venezia
ABC in VeneziaABC in Venezia
ABC in Venezia
 
Statistics symposium talk, Harvard University
Statistics symposium talk, Harvard UniversityStatistics symposium talk, Harvard University
Statistics symposium talk, Harvard University
 
Problem_Session_Notes
Problem_Session_NotesProblem_Session_Notes
Problem_Session_Notes
 
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
 
Understanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized treesUnderstanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized trees
 
1 - Linear Regression
1 - Linear Regression1 - Linear Regression
1 - Linear Regression
 
A Gentle Introduction to the EM Algorithm
A Gentle Introduction to the EM AlgorithmA Gentle Introduction to the EM Algorithm
A Gentle Introduction to the EM Algorithm
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
 
Testing as estimation: the demise of the Bayes factor
Testing as estimation: the demise of the Bayes factorTesting as estimation: the demise of the Bayes factor
Testing as estimation: the demise of the Bayes factor
 
Lesson 14: Derivatives of Logarithmic and Exponential Functions
Lesson 14: Derivatives of Logarithmic and Exponential FunctionsLesson 14: Derivatives of Logarithmic and Exponential Functions
Lesson 14: Derivatives of Logarithmic and Exponential Functions
 
07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization
 
Bias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsBias-variance decomposition in Random Forests
Bias-variance decomposition in Random Forests
 
Discrete Probability Distributions
Discrete  Probability DistributionsDiscrete  Probability Distributions
Discrete Probability Distributions
 
PMED Transition Workshop - Non-parametric Techniques for Estimating Tumor Het...
PMED Transition Workshop - Non-parametric Techniques for Estimating Tumor Het...PMED Transition Workshop - Non-parametric Techniques for Estimating Tumor Het...
PMED Transition Workshop - Non-parametric Techniques for Estimating Tumor Het...
 
Inference for stochastic differential equations via approximate Bayesian comp...
Inference for stochastic differential equations via approximate Bayesian comp...Inference for stochastic differential equations via approximate Bayesian comp...
Inference for stochastic differential equations via approximate Bayesian comp...
 

Similar a Nber slides11 lecture2

Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modeljins0618
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector MachinesEdgar Marca
 
Variational inference
Variational inference  Variational inference
Variational inference Natan Katz
 
Radial Basis Function Interpolation
Radial Basis Function InterpolationRadial Basis Function Interpolation
Radial Basis Function InterpolationJesse Bettencourt
 
Two algorithms to accelerate training of back-propagation neural networks
Two algorithms to accelerate training of back-propagation neural networksTwo algorithms to accelerate training of back-propagation neural networks
Two algorithms to accelerate training of back-propagation neural networksESCOM
 
376951072-3-Greedy-Method-new-ppt.ppt
376951072-3-Greedy-Method-new-ppt.ppt376951072-3-Greedy-Method-new-ppt.ppt
376951072-3-Greedy-Method-new-ppt.pptRohitPaul71
 
Machine learning (2)
Machine learning (2)Machine learning (2)
Machine learning (2)NYversity
 
Thesis oral defense
Thesis oral defenseThesis oral defense
Thesis oral defenseFan Zhitao
 
Lecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsLecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsMarina Santini
 
IVR - Chapter 5 - Bayesian methods
IVR - Chapter 5 - Bayesian methodsIVR - Chapter 5 - Bayesian methods
IVR - Chapter 5 - Bayesian methodsCharles Deledalle
 
Deep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptxDeep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptxFreefireGarena30
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfJunghyun Lee
 

Similar a Nber slides11 lecture2 (20)

Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture model
 
nber_slides.pdf
nber_slides.pdfnber_slides.pdf
nber_slides.pdf
 
The Gaussian Hardy-Littlewood Maximal Function
The Gaussian Hardy-Littlewood Maximal FunctionThe Gaussian Hardy-Littlewood Maximal Function
The Gaussian Hardy-Littlewood Maximal Function
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector Machines
 
Variational inference
Variational inference  Variational inference
Variational inference
 
Radial Basis Function Interpolation
Radial Basis Function InterpolationRadial Basis Function Interpolation
Radial Basis Function Interpolation
 
Two algorithms to accelerate training of back-propagation neural networks
Two algorithms to accelerate training of back-propagation neural networksTwo algorithms to accelerate training of back-propagation neural networks
Two algorithms to accelerate training of back-propagation neural networks
 
376951072-3-Greedy-Method-new-ppt.ppt
376951072-3-Greedy-Method-new-ppt.ppt376951072-3-Greedy-Method-new-ppt.ppt
376951072-3-Greedy-Method-new-ppt.ppt
 
Machine learning (2)
Machine learning (2)Machine learning (2)
Machine learning (2)
 
Thesis oral defense
Thesis oral defenseThesis oral defense
Thesis oral defense
 
ML unit3.pptx
ML unit3.pptxML unit3.pptx
ML unit3.pptx
 
Statistical Method In Economics
Statistical Method In EconomicsStatistical Method In Economics
Statistical Method In Economics
 
Lecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsLecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest Neighbors
 
SAS Homework Help
SAS Homework HelpSAS Homework Help
SAS Homework Help
 
IVR - Chapter 5 - Bayesian methods
IVR - Chapter 5 - Bayesian methodsIVR - Chapter 5 - Bayesian methods
IVR - Chapter 5 - Bayesian methods
 
Presentation gauge field theory
Presentation gauge field theoryPresentation gauge field theory
Presentation gauge field theory
 
452Paper
452Paper452Paper
452Paper
 
JISA_Paper
JISA_PaperJISA_Paper
JISA_Paper
 
Deep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptxDeep learning Unit1 BasicsAllllllll.pptx
Deep learning Unit1 BasicsAllllllll.pptx
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdf
 

Más de NBER

FISCAL STIMULUS IN ECONOMIC UNIONS: WHAT ROLE FOR STATES
FISCAL STIMULUS IN ECONOMIC UNIONS: WHAT ROLE FOR STATESFISCAL STIMULUS IN ECONOMIC UNIONS: WHAT ROLE FOR STATES
FISCAL STIMULUS IN ECONOMIC UNIONS: WHAT ROLE FOR STATESNBER
 
Business in the United States Who Owns it and How Much Tax They Pay
Business in the United States Who Owns it and How Much Tax They PayBusiness in the United States Who Owns it and How Much Tax They Pay
Business in the United States Who Owns it and How Much Tax They PayNBER
 
Redistribution through Minimum Wage Regulation: An Analysis of Program Linkag...
Redistribution through Minimum Wage Regulation: An Analysis of Program Linkag...Redistribution through Minimum Wage Regulation: An Analysis of Program Linkag...
Redistribution through Minimum Wage Regulation: An Analysis of Program Linkag...NBER
 
The Distributional E ffects of U.S. Clean Energy Tax Credits
The Distributional Effects of U.S. Clean Energy Tax CreditsThe Distributional Effects of U.S. Clean Energy Tax Credits
The Distributional E ffects of U.S. Clean Energy Tax CreditsNBER
 
An Experimental Evaluation of Strategies to Increase Property Tax Compliance:...
An Experimental Evaluation of Strategies to Increase Property Tax Compliance:...An Experimental Evaluation of Strategies to Increase Property Tax Compliance:...
An Experimental Evaluation of Strategies to Increase Property Tax Compliance:...NBER
 
Nbe rtopicsandrecomvlecture1
Nbe rtopicsandrecomvlecture1Nbe rtopicsandrecomvlecture1
Nbe rtopicsandrecomvlecture1NBER
 
Nbe rcausalpredictionv111 lecture2
Nbe rcausalpredictionv111 lecture2Nbe rcausalpredictionv111 lecture2
Nbe rcausalpredictionv111 lecture2NBER
 
Recommenders, Topics, and Text
Recommenders, Topics, and TextRecommenders, Topics, and Text
Recommenders, Topics, and TextNBER
 
Machine Learning and Causal Inference
Machine Learning and Causal InferenceMachine Learning and Causal Inference
Machine Learning and Causal InferenceNBER
 
Jackson nber-slides2014 lecture3
Jackson nber-slides2014 lecture3Jackson nber-slides2014 lecture3
Jackson nber-slides2014 lecture3NBER
 
Jackson nber-slides2014 lecture1
Jackson nber-slides2014 lecture1Jackson nber-slides2014 lecture1
Jackson nber-slides2014 lecture1NBER
 
Acemoglu lecture2
Acemoglu lecture2Acemoglu lecture2
Acemoglu lecture2NBER
 
Acemoglu lecture4
Acemoglu lecture4Acemoglu lecture4
Acemoglu lecture4NBER
 
The NBER Working Paper Series at 20,000 - Joshua Gans
The NBER Working Paper Series at 20,000 - Joshua GansThe NBER Working Paper Series at 20,000 - Joshua Gans
The NBER Working Paper Series at 20,000 - Joshua GansNBER
 
The NBER Working Paper Series at 20,000 - Claudia Goldin
The NBER Working Paper Series at 20,000 - Claudia GoldinThe NBER Working Paper Series at 20,000 - Claudia Goldin
The NBER Working Paper Series at 20,000 - Claudia GoldinNBER
 
The NBER Working Paper Series at 20,000 - James Poterba
The NBER Working Paper Series at 20,000 - James PoterbaThe NBER Working Paper Series at 20,000 - James Poterba
The NBER Working Paper Series at 20,000 - James PoterbaNBER
 
The NBER Working Paper Series at 20,000 - Scott Stern
The NBER Working Paper Series at 20,000 - Scott SternThe NBER Working Paper Series at 20,000 - Scott Stern
The NBER Working Paper Series at 20,000 - Scott SternNBER
 
The NBER Working Paper Series at 20,000 - Glenn Ellison
The NBER Working Paper Series at 20,000 - Glenn EllisonThe NBER Working Paper Series at 20,000 - Glenn Ellison
The NBER Working Paper Series at 20,000 - Glenn EllisonNBER
 
L3 1b
L3 1bL3 1b
L3 1bNBER
 
Econometrics of High-Dimensional Sparse Models
Econometrics of High-Dimensional Sparse ModelsEconometrics of High-Dimensional Sparse Models
Econometrics of High-Dimensional Sparse ModelsNBER
 

Más de NBER (20)

FISCAL STIMULUS IN ECONOMIC UNIONS: WHAT ROLE FOR STATES
FISCAL STIMULUS IN ECONOMIC UNIONS: WHAT ROLE FOR STATESFISCAL STIMULUS IN ECONOMIC UNIONS: WHAT ROLE FOR STATES
FISCAL STIMULUS IN ECONOMIC UNIONS: WHAT ROLE FOR STATES
 
Business in the United States Who Owns it and How Much Tax They Pay
Business in the United States Who Owns it and How Much Tax They PayBusiness in the United States Who Owns it and How Much Tax They Pay
Business in the United States Who Owns it and How Much Tax They Pay
 
Redistribution through Minimum Wage Regulation: An Analysis of Program Linkag...
Redistribution through Minimum Wage Regulation: An Analysis of Program Linkag...Redistribution through Minimum Wage Regulation: An Analysis of Program Linkag...
Redistribution through Minimum Wage Regulation: An Analysis of Program Linkag...
 
The Distributional E ffects of U.S. Clean Energy Tax Credits
The Distributional Effects of U.S. Clean Energy Tax CreditsThe Distributional Effects of U.S. Clean Energy Tax Credits
The Distributional E ffects of U.S. Clean Energy Tax Credits
 
An Experimental Evaluation of Strategies to Increase Property Tax Compliance:...
An Experimental Evaluation of Strategies to Increase Property Tax Compliance:...An Experimental Evaluation of Strategies to Increase Property Tax Compliance:...
An Experimental Evaluation of Strategies to Increase Property Tax Compliance:...
 
Nbe rtopicsandrecomvlecture1
Nbe rtopicsandrecomvlecture1Nbe rtopicsandrecomvlecture1
Nbe rtopicsandrecomvlecture1
 
Nbe rcausalpredictionv111 lecture2
Nbe rcausalpredictionv111 lecture2Nbe rcausalpredictionv111 lecture2
Nbe rcausalpredictionv111 lecture2
 
Recommenders, Topics, and Text
Recommenders, Topics, and TextRecommenders, Topics, and Text
Recommenders, Topics, and Text
 
Machine Learning and Causal Inference
Machine Learning and Causal InferenceMachine Learning and Causal Inference
Machine Learning and Causal Inference
 
Jackson nber-slides2014 lecture3
Jackson nber-slides2014 lecture3Jackson nber-slides2014 lecture3
Jackson nber-slides2014 lecture3
 
Jackson nber-slides2014 lecture1
Jackson nber-slides2014 lecture1Jackson nber-slides2014 lecture1
Jackson nber-slides2014 lecture1
 
Acemoglu lecture2
Acemoglu lecture2Acemoglu lecture2
Acemoglu lecture2
 
Acemoglu lecture4
Acemoglu lecture4Acemoglu lecture4
Acemoglu lecture4
 
The NBER Working Paper Series at 20,000 - Joshua Gans
The NBER Working Paper Series at 20,000 - Joshua GansThe NBER Working Paper Series at 20,000 - Joshua Gans
The NBER Working Paper Series at 20,000 - Joshua Gans
 
The NBER Working Paper Series at 20,000 - Claudia Goldin
The NBER Working Paper Series at 20,000 - Claudia GoldinThe NBER Working Paper Series at 20,000 - Claudia Goldin
The NBER Working Paper Series at 20,000 - Claudia Goldin
 
The NBER Working Paper Series at 20,000 - James Poterba
The NBER Working Paper Series at 20,000 - James PoterbaThe NBER Working Paper Series at 20,000 - James Poterba
The NBER Working Paper Series at 20,000 - James Poterba
 
The NBER Working Paper Series at 20,000 - Scott Stern
The NBER Working Paper Series at 20,000 - Scott SternThe NBER Working Paper Series at 20,000 - Scott Stern
The NBER Working Paper Series at 20,000 - Scott Stern
 
The NBER Working Paper Series at 20,000 - Glenn Ellison
The NBER Working Paper Series at 20,000 - Glenn EllisonThe NBER Working Paper Series at 20,000 - Glenn Ellison
The NBER Working Paper Series at 20,000 - Glenn Ellison
 
L3 1b
L3 1bL3 1b
L3 1b
 
Econometrics of High-Dimensional Sparse Models
Econometrics of High-Dimensional Sparse ModelsEconometrics of High-Dimensional Sparse Models
Econometrics of High-Dimensional Sparse Models
 

Último

Call On 6297143586 Yerwada Call Girls In All Pune 24/7 Provide Call With Bes...
Call On 6297143586  Yerwada Call Girls In All Pune 24/7 Provide Call With Bes...Call On 6297143586  Yerwada Call Girls In All Pune 24/7 Provide Call With Bes...
Call On 6297143586 Yerwada Call Girls In All Pune 24/7 Provide Call With Bes...tanu pandey
 
Just Call Vip call girls Wardha Escorts ☎️8617370543 Starting From 5K to 25K ...
Just Call Vip call girls Wardha Escorts ☎️8617370543 Starting From 5K to 25K ...Just Call Vip call girls Wardha Escorts ☎️8617370543 Starting From 5K to 25K ...
Just Call Vip call girls Wardha Escorts ☎️8617370543 Starting From 5K to 25K ...Dipal Arora
 
↑VVIP celebrity ( Pune ) Serampore Call Girls 8250192130 unlimited shot and a...
↑VVIP celebrity ( Pune ) Serampore Call Girls 8250192130 unlimited shot and a...↑VVIP celebrity ( Pune ) Serampore Call Girls 8250192130 unlimited shot and a...
↑VVIP celebrity ( Pune ) Serampore Call Girls 8250192130 unlimited shot and a...ranjana rawat
 
PPT Item # 4 - 231 Encino Ave (Significance Only)
PPT Item # 4 - 231 Encino Ave (Significance Only)PPT Item # 4 - 231 Encino Ave (Significance Only)
PPT Item # 4 - 231 Encino Ave (Significance Only)ahcitycouncil
 
Lucknow 💋 Russian Call Girls Lucknow ₹7.5k Pick Up & Drop With Cash Payment 8...
Lucknow 💋 Russian Call Girls Lucknow ₹7.5k Pick Up & Drop With Cash Payment 8...Lucknow 💋 Russian Call Girls Lucknow ₹7.5k Pick Up & Drop With Cash Payment 8...
Lucknow 💋 Russian Call Girls Lucknow ₹7.5k Pick Up & Drop With Cash Payment 8...anilsa9823
 
Call Girls Nanded City Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Nanded City Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Nanded City Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Nanded City Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Call Girls Chakan Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chakan Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Chakan Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chakan Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
The U.S. Budget and Economic Outlook (Presentation)
The U.S. Budget and Economic Outlook (Presentation)The U.S. Budget and Economic Outlook (Presentation)
The U.S. Budget and Economic Outlook (Presentation)Congressional Budget Office
 
CBO’s Recent Appeals for New Research on Health-Related Topics
CBO’s Recent Appeals for New Research on Health-Related TopicsCBO’s Recent Appeals for New Research on Health-Related Topics
CBO’s Recent Appeals for New Research on Health-Related TopicsCongressional Budget Office
 
2024: The FAR, Federal Acquisition Regulations - Part 28
2024: The FAR, Federal Acquisition Regulations - Part 282024: The FAR, Federal Acquisition Regulations - Part 28
2024: The FAR, Federal Acquisition Regulations - Part 28JSchaus & Associates
 
(ANIKA) Call Girls Wadki ( 7001035870 ) HI-Fi Pune Escorts Service
(ANIKA) Call Girls Wadki ( 7001035870 ) HI-Fi Pune Escorts Service(ANIKA) Call Girls Wadki ( 7001035870 ) HI-Fi Pune Escorts Service
(ANIKA) Call Girls Wadki ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
(TARA) Call Girls Chakan ( 7001035870 ) HI-Fi Pune Escorts Service
(TARA) Call Girls Chakan ( 7001035870 ) HI-Fi Pune Escorts Service(TARA) Call Girls Chakan ( 7001035870 ) HI-Fi Pune Escorts Service
(TARA) Call Girls Chakan ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Top Rated Pune Call Girls Dapodi ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...
Top Rated  Pune Call Girls Dapodi ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...Top Rated  Pune Call Girls Dapodi ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...
Top Rated Pune Call Girls Dapodi ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...Call Girls in Nagpur High Profile
 
Night 7k to 12k Call Girls Service In Navi Mumbai 👉 BOOK NOW 9833363713 👈 ♀️...
Night 7k to 12k  Call Girls Service In Navi Mumbai 👉 BOOK NOW 9833363713 👈 ♀️...Night 7k to 12k  Call Girls Service In Navi Mumbai 👉 BOOK NOW 9833363713 👈 ♀️...
Night 7k to 12k Call Girls Service In Navi Mumbai 👉 BOOK NOW 9833363713 👈 ♀️...aartirawatdelhi
 
(PRIYA) Call Girls Rajgurunagar ( 7001035870 ) HI-Fi Pune Escorts Service
(PRIYA) Call Girls Rajgurunagar ( 7001035870 ) HI-Fi Pune Escorts Service(PRIYA) Call Girls Rajgurunagar ( 7001035870 ) HI-Fi Pune Escorts Service
(PRIYA) Call Girls Rajgurunagar ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
WIPO magazine issue -1 - 2024 World Intellectual Property organization.
WIPO magazine issue -1 - 2024 World Intellectual Property organization.WIPO magazine issue -1 - 2024 World Intellectual Property organization.
WIPO magazine issue -1 - 2024 World Intellectual Property organization.Christina Parmionova
 
Booking open Available Pune Call Girls Shukrawar Peth 6297143586 Call Hot In...
Booking open Available Pune Call Girls Shukrawar Peth  6297143586 Call Hot In...Booking open Available Pune Call Girls Shukrawar Peth  6297143586 Call Hot In...
Booking open Available Pune Call Girls Shukrawar Peth 6297143586 Call Hot In...tanu pandey
 
(NEHA) Bhosari Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(NEHA) Bhosari Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(NEHA) Bhosari Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(NEHA) Bhosari Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 

Último (20)

Call On 6297143586 Yerwada Call Girls In All Pune 24/7 Provide Call With Bes...
Call On 6297143586  Yerwada Call Girls In All Pune 24/7 Provide Call With Bes...Call On 6297143586  Yerwada Call Girls In All Pune 24/7 Provide Call With Bes...
Call On 6297143586 Yerwada Call Girls In All Pune 24/7 Provide Call With Bes...
 
Just Call Vip call girls Wardha Escorts ☎️8617370543 Starting From 5K to 25K ...
Just Call Vip call girls Wardha Escorts ☎️8617370543 Starting From 5K to 25K ...Just Call Vip call girls Wardha Escorts ☎️8617370543 Starting From 5K to 25K ...
Just Call Vip call girls Wardha Escorts ☎️8617370543 Starting From 5K to 25K ...
 
↑VVIP celebrity ( Pune ) Serampore Call Girls 8250192130 unlimited shot and a...
↑VVIP celebrity ( Pune ) Serampore Call Girls 8250192130 unlimited shot and a...↑VVIP celebrity ( Pune ) Serampore Call Girls 8250192130 unlimited shot and a...
↑VVIP celebrity ( Pune ) Serampore Call Girls 8250192130 unlimited shot and a...
 
PPT Item # 4 - 231 Encino Ave (Significance Only)
PPT Item # 4 - 231 Encino Ave (Significance Only)PPT Item # 4 - 231 Encino Ave (Significance Only)
PPT Item # 4 - 231 Encino Ave (Significance Only)
 
Lucknow 💋 Russian Call Girls Lucknow ₹7.5k Pick Up & Drop With Cash Payment 8...
Lucknow 💋 Russian Call Girls Lucknow ₹7.5k Pick Up & Drop With Cash Payment 8...Lucknow 💋 Russian Call Girls Lucknow ₹7.5k Pick Up & Drop With Cash Payment 8...
Lucknow 💋 Russian Call Girls Lucknow ₹7.5k Pick Up & Drop With Cash Payment 8...
 
Call Girls Nanded City Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Nanded City Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Nanded City Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Nanded City Call Me 7737669865 Budget Friendly No Advance Booking
 
Call Girls Chakan Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chakan Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Chakan Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chakan Call Me 7737669865 Budget Friendly No Advance Booking
 
The U.S. Budget and Economic Outlook (Presentation)
The U.S. Budget and Economic Outlook (Presentation)The U.S. Budget and Economic Outlook (Presentation)
The U.S. Budget and Economic Outlook (Presentation)
 
CBO’s Recent Appeals for New Research on Health-Related Topics
CBO’s Recent Appeals for New Research on Health-Related TopicsCBO’s Recent Appeals for New Research on Health-Related Topics
CBO’s Recent Appeals for New Research on Health-Related Topics
 
2024: The FAR, Federal Acquisition Regulations - Part 28
2024: The FAR, Federal Acquisition Regulations - Part 282024: The FAR, Federal Acquisition Regulations - Part 28
2024: The FAR, Federal Acquisition Regulations - Part 28
 
(ANIKA) Call Girls Wadki ( 7001035870 ) HI-Fi Pune Escorts Service
(ANIKA) Call Girls Wadki ( 7001035870 ) HI-Fi Pune Escorts Service(ANIKA) Call Girls Wadki ( 7001035870 ) HI-Fi Pune Escorts Service
(ANIKA) Call Girls Wadki ( 7001035870 ) HI-Fi Pune Escorts Service
 
(TARA) Call Girls Chakan ( 7001035870 ) HI-Fi Pune Escorts Service
(TARA) Call Girls Chakan ( 7001035870 ) HI-Fi Pune Escorts Service(TARA) Call Girls Chakan ( 7001035870 ) HI-Fi Pune Escorts Service
(TARA) Call Girls Chakan ( 7001035870 ) HI-Fi Pune Escorts Service
 
Call Girls Service Connaught Place @9999965857 Delhi 🫦 No Advance VVIP 🍎 SER...
Call Girls Service Connaught Place @9999965857 Delhi 🫦 No Advance  VVIP 🍎 SER...Call Girls Service Connaught Place @9999965857 Delhi 🫦 No Advance  VVIP 🍎 SER...
Call Girls Service Connaught Place @9999965857 Delhi 🫦 No Advance VVIP 🍎 SER...
 
Top Rated Pune Call Girls Dapodi ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...
Top Rated  Pune Call Girls Dapodi ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...Top Rated  Pune Call Girls Dapodi ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...
Top Rated Pune Call Girls Dapodi ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...
 
Night 7k to 12k Call Girls Service In Navi Mumbai 👉 BOOK NOW 9833363713 👈 ♀️...
Night 7k to 12k  Call Girls Service In Navi Mumbai 👉 BOOK NOW 9833363713 👈 ♀️...Night 7k to 12k  Call Girls Service In Navi Mumbai 👉 BOOK NOW 9833363713 👈 ♀️...
Night 7k to 12k Call Girls Service In Navi Mumbai 👉 BOOK NOW 9833363713 👈 ♀️...
 
(PRIYA) Call Girls Rajgurunagar ( 7001035870 ) HI-Fi Pune Escorts Service
(PRIYA) Call Girls Rajgurunagar ( 7001035870 ) HI-Fi Pune Escorts Service(PRIYA) Call Girls Rajgurunagar ( 7001035870 ) HI-Fi Pune Escorts Service
(PRIYA) Call Girls Rajgurunagar ( 7001035870 ) HI-Fi Pune Escorts Service
 
WIPO magazine issue -1 - 2024 World Intellectual Property organization.
WIPO magazine issue -1 - 2024 World Intellectual Property organization.WIPO magazine issue -1 - 2024 World Intellectual Property organization.
WIPO magazine issue -1 - 2024 World Intellectual Property organization.
 
Booking open Available Pune Call Girls Shukrawar Peth 6297143586 Call Hot In...
Booking open Available Pune Call Girls Shukrawar Peth  6297143586 Call Hot In...Booking open Available Pune Call Girls Shukrawar Peth  6297143586 Call Hot In...
Booking open Available Pune Call Girls Shukrawar Peth 6297143586 Call Hot In...
 
Call Girls In Rohini ꧁❤ 🔝 9953056974🔝❤꧂ Escort ServiCe
Call Girls In  Rohini ꧁❤ 🔝 9953056974🔝❤꧂ Escort ServiCeCall Girls In  Rohini ꧁❤ 🔝 9953056974🔝❤꧂ Escort ServiCe
Call Girls In Rohini ꧁❤ 🔝 9953056974🔝❤꧂ Escort ServiCe
 
(NEHA) Bhosari Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(NEHA) Bhosari Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(NEHA) Bhosari Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(NEHA) Bhosari Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 

Nber slides11 lecture2

  • 1. Pruning The Tree Growing a tree this way may stop too early: splitting a particu- lar leaf may not need to an improvement in the sum of squared errors, but if we split anyway, we may find subsequently prof- itable splits. For example, suppose that there are two binary covariates, Xi1, Xi,2 ∈ {0, 1}, and that E[Yi|Xi,1 = 0, Xi,2 = 0] = E[Yi|Xi,1 = 1, Xi,2 = 1] = −1 E[Yi|Xi,1 = 0, Xi,2 = 1] = E[Yi|Xi,1 = 0, Xi,2 = 1] = 1 Then splitting on Xi1 or Xi2 does not improve the objective function, but once one splits on either of them, the subsequent splits lead to an improvement. 36
  • 2. This motivates pruning the tree: • First grow a big tree by using a deliberately small value of the penalty term, or simply growing the tree till the leaves have a preset small number of observations. • Then go back and prune branches or leaves that do not collectively improve the objective function sufficiently. 37
  • 3.
  • 4. Tree leads to 37 splits Test sample root-mean-squared error OLS: 1.4093 LASSO: 0.7379 Tree: 0.7865 38
  • 5.
  • 6. 2.c Boosting Suppose we have a simple, possibly naive, but easy to com- pute, way of estimating a regression function, a so-called weak learner. Boosting is a general approach to repeatedly use the weak learner to get a good predictor for both classification and re- gression problems. It can be used with many different weak learners, trees, kernels, support vector machines, neural net- works, etc. Here I illustrate it using regression trees. 39
  • 7. Suppose g(x|X, Y) is based on a very simple regression tree, using only a single split. So, the algorithm selects a covari- ate k(X, Y) and a threshold t(X, Y) and then estimates the regression function as g1(x|X, Y) = Y left if xk(X,Y) ≤ t(X, Y) Y right if xk(X,Y) > t(X, Y) where Y left = i:Xi,k≤t Yi i:Xi,k≤t 1 Y right = i:Xi,k>t Yi i:Xi,k>t 1 Not a very good predictor by itself. 40
  • 8. Define the residual relative to this weak learner: ε1i = Yi − g1(Xi|X, Y) Now apply the same weak learner to the new data set (X, ε1). Grow a second tree g2(X, ε1) based on this data set (with single split), and define the new residuals as ε2i = Yi − g1(Xi|X, Y) − g2(Xi|X, ε1) Re-apply the weak learner to the data set (X, ε2). After doing this many times you get an additive approximation to the regression function: M m=1 gm(x|X, εm−1) = K k=1 hk(xk) where ε0 = Y 41
  • 9. Had we used a weak learner with two splits, we would have allowed for second order effects h(xk, xl). In practice researchers use shallow trees, say with six splits (implicitly allowing for 6-th order interactions), and grow many trees, e.g., 400-500. Often the depth of the initial trees is fixed in advance in an ad hoc manner (difficult to choose too many tuning parameters optimally), and the number of trees is based on prediction errors in a test sample (similar in spirit to cross-validation). 42
  • 10. 2.d Bagging (Bootstrap AGGregatING) Applicable to many ways of estimating regression function, here applied to trees. 1. Draw a bootstrap sample of size N from the data. 2. Construct a tree gb(x), possibly with pruning, possibly with data-dependent penalty term. Estimate the regression function by averaging over bootstrap estimates: 1 B B b=1 gb(x) If the basic learner were linear, than the bagging is ineffective. For nonlinear learners, however, this smoothes things and can lead to improvements. 43
  • 11. 2.e Random Forests (Great general purpose method) Given data (X, Y), with the dimension of X equal to N ×K, do the same as bagging, but with a different way of constructing the tree given the bootstrap sample: Start with a tree with a single leaf. 1. Randomly select L regressors out of the set of K regressors 2. Select the optimal cov and threshold among L regressors 3. If some leaves have more than Nmin units, go back to (1) 4. Otherwise, stop Average trees over bootstrap samples. 44
  • 12. • For bagging and random forest there is recent research sug- gesting asymptotic normality may hold. It would require using small training samples relative to the overall sample (on the order of N/(ln(N)K), where K is the number of features. See Wager, Efron, and Hastie (2014) and Wager (2015). 45
  • 13. 2.f Neural Networks / Deep Learning Goes back to 1990’s, work by Hal White. Recent resurgence. Model the relation between Xi and Yi through hidden layer(s) of Zi, with M elements Zi,m: Zi,m = σ(α0m + α1mXi), for m = 1, . . . , M Yi = β0 + β1Zi + εi So, the Yi are linear in a number of transformations of the original covariates Xi. Often the transformations are sigmoid functions σ(a) = (1 + exp(−a))−1. We fit the parameters αm and β by minimizing N i=1 (Yi − g(Xi, α, β))2 46
  • 14. • Estimation can be hard. Start with α random but close to zero, so close to linear model, using gradient descent methods. • Difficulty is to avoid overfitting. We can add a penalty term N i=1 (Yi − g(Xi, α, β))2 + λ ·   k,m α2 k,m + k β2 k   Find optimal penalty term λ by monitoring sum of squared prediction errors on test sample. 47
  • 15. 2.g Ensemble Methods, Model Averaging, and Super Learn- ers Suppose we have M candidate estimators gm(·|X, Y). They can be similar, e.g., all trees, or they can be qualitatively different, some trees, some regression models, some support vector ma- chines, some neural networks. We can try to combine them to get a better estimator, often better than any single algorithm. Note that we do not attempt to select a single method, rather we look for weights that may be non-zero for multiple methods. Most competitions for supervised learning methods have been won by algorithms that combine more basic methods, often many different methods. 48
  • 16. One question is how exactly to combine methods. One approach is to construct weights α1, . . . , αM by solving, using a test sample min α1,...,αM N i=1  Yi − M m=1 αm · gm(Xi)   2 If we have many algorithms to choose from we may wish to regularize this problem by adding a LASSO-type penalty term: λ · M m=1 |αm| That is, we restrict the ensemble estimator to be a weighted average of the original estimators where we shrink the weights using an L1 norm. The result will be a weighted average that puts non-zero weights on only a few models. 49
  • 17. Test sample root-mean-squared error OLS: 1.4093 LASSO: 0.7379 Tree: 0.7865 Ensemble: 0.7375 Weights ensemble: OLS 0.0130 LASSO 0.7249 Tree 0.2621 50
  • 18. 3. Unsupervised Learning: Clustering We have N observations on a M-component vector of features, Xi, i = 1, . . . , N. We want to find patterns in these data. Note: there is no outcome Yi here, which gave rise to the term “unsupervised learning.” • One approach is to reduce the dimension of the Xi using principal components. We can then fit models using those principal components rather than the full set of features. • Second approach is to partition the space into a finite set. We can then fit models to the subpopulations in each of those sets. 51
  • 19. 3.a Principal Components • Old method, e.g., in Theil (Principles of Econometrics) We want to find a set of K N-vectors Y1, . . . , YK so that Xi ≈ K k=1 γikYk or, collectively, we want to find approximation X = ΓY, Γ is M × K, M > K This is useful in cases where we have many features and we want to reduce the dimension without giving up a lot of infor- mation. 52
  • 20. First normalize components of Xi, so average N i=1 Xi/N = 0, and components have unit variance. First principal component: Find N-vector Y1 and the M vector Γ that solve Y1, Γ = arg min Y1,Γ trace (X − Y1Γ) (X − Y1Γ) This leads to Y1 being the eigenvector of the N × N matrix XX corresponding to the largest eigenvalue. Given Y, Γ is easy to find. Subsequent Yk correspond to the subsequent eigenvectors. 53
  • 21. 3.b Mixture Models and the EM Algorithm Model the joint distribution of the L-component vector Xi as a mixture of parametric distributions: f(x) = K k=1 πk · fk(x; θk) fk(·; ·) known We want to estimate the parameters of the mixture compo- nents, θk, and the mixture probabilities πk. The mixture components can be multivariate normal, of any other parametric distribution. Straight maximum likelihood estimation is very difficult because the likelihood function is multi-modal. 54
  • 22. The EM algorithm (Dempster, Laird, Rubin, 1977) makes this easy as long as it is easy to estimate the θk given data from the k-th mixture component. Start with πk = 1/K for all k. Create starting values for θk, all different. Update weights, the conditional probability of belonging to cluster k given parameter values (E-step in EM): wik = πk · fk(Xi; θk) K m=1 πm · fm(Xi; θm) Update θk (M-step in EM) θk = arg max θ N i=1 wik · ln fk(Xi; θ) 55
  • 23. Algorithm can be slow but is very reliable. It gives probabilities for each cluster. Then we can use those to assign new units to clusters, using the highest probability. Used in duration models in Gamma-Weibull mixtures (Lan- caster, 1979), non-parametric Weibull mixtures (Heckman & Singer, 1984). 56
  • 24. 3.c The k-means Algorithm 1. Start with k arbitrary centroids cm for the k clusters. 2. Assign each observation to the nearest centroid: Ii = m if Xi − cm = min m =1 Xi − cm 3. Re-calculate the centroids as cm = i:Ii=m Xi i:Ii=m 1 4. Go back to (2) if centroids have changed, otherwise stop. 57
  • 25. • k-means is fast • Results can be sensitive to starting values for centroids. 58
  • 26. 3.d. Mining for Association Rules: The Apriori Algorithm Suppose we have a set of N customers, each buying arbitrary subsets of a set F0 containing M items. We want to find subsets of k items that are bought together by at least L cus- tomers, for different values of k. This is very much a data mining exercise. There is no model, simply a search for items that go together. Of course this may suggest causal relationships, and suggest that discounting some items may increase sales of other items. It is potentially difficult, because there are M choose k subsets of k items that could be elements of Fk. The solution is to do this sequentially. 59
  • 27. You start with k = 1, by selecting all items that are bought by at least L customers. This gives a set F1 ⊂ F0 of the M original items. Now, for k ≥ 2, given Fk−1, find Fk. First construct the set of possible elements of Fk. For a set of items F to be in Fk, it must be that any set obtained by dropping one of the k items in F , say the m-th item, leading to the set F(m) = F/{m}, must be an element of Fk−1. The reason is that for F to be in Fk, it must be that there are at least L customers buying that set of items. Hence there must be at least L customers buying the set F(m), and so F(m) is an k − 1 item set that must be an element of Fk−1. 60
  • 28. 5. Support Vector Machines and Classification Suppose we have a sample (Yi, Xi), i = 1, . . . , N, with Yi ∈ {−1, 1}. We are trying to come up with a classification rule that assigns units to one of the two classes −1 or 1. One conventional econometric approach is to estimate a logis- tic regression model and assign units to the groups based on the estimated probability. Support vector machines look for a boundary h(x), such that units on one side of the boundary are assigned to one group, and units on the other side are assigned to the other group. 61
  • 29. Suppose we limit ourselves to linear rules, h(x) = β0 + xβ1, where we assign a unit with features x to class 1 if h(x) ≥ 0 and to -1 otherwise. We want to choose β0 and β1 to optimize the classification. The question is how to quantify the quality of the classification. 62
  • 30. Suppose there is a hyperplane that completely separates the Yi = −1 and Yi = 1 groups. In that case we can look for the β, such that β = 1, that maximize the margin M: max β0,β1 M s.t. Yi · (β0 + Xiβ1) ≥ M ∀i, β = 1 The restriction implies that each point is at least a distance M away from the boundary. We can rewrite this as min β0,β1 β s.t. Yi · (β0 + Xiβ1) ≥ 1 ∀i. 63
  • 31. Often there is no such hyperplane. In that case we define a penalty we pay for observations that are not at least a distance M away from the boundary. min β0,β1 β s.t. Yi · (β0 + Xiβ1) ≥ 1 − εi, εi ≥ 0 ∀i, i ε ≤ C. Alternatively min β 1 2 · β + C · i=1 εi subject to εi ≥ 0, Yi · (β0 + Xiβ1) ≥ 1 − εi 64
  • 32. With the linear specification h(x) = β0+xβ1 the support vector machine leads to min β0,β1 N i=1 1 − Yi · (β0 + Xiβ1) + + λ · β 2 where a + is a if a > 0 and zero otherwise, and λ = 1/C. Note that we do not get probabilities here as in a logistic re- gression model, only assignments to one of the two groups. 65