Predicting Real-valued Outputs: An introduction to regression

Predicting Real-valued
outputs: an introduction
to Regression
Andrew W. Moore
Note to other teachers and users of
these slides. Andrew would be delighted

Professor
if you found this source material useful in
giving your own lectures. Feel free to use
these slides verbatim, or to modify them

School of Computer Science
to fit your own needs. PowerPoint
originals are available. If you make use
of a significant portion of these slides in

Carnegie Mellon University
your own lecture, please include this l
teria
d ma
message, or the following link to the
rdere Nets
source repository of Andrew’s tutorials:
is reo
www.cs.cmu.edu/~awm l
This he Neura “Favorite
http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully t
d the rithms”
awm@cs.cmu.edu from
re an
received.
o
lectu ssion Alg
412-268-7599 e
Regr
e
r
lectu
Copyright © 2001, 2003, Andrew W. Moore

Single-
Parameter
Linear
Regression
Copyright © 2001, 2003, Andrew W. Moore 2

Linear Regression
DATASET

inputs outputs
x1 = 1 y1 = 1
x2 = 3 y2 = 2.2
↑
x3 = 2 y3 = 2
w
↓
←1→
x4 = 1.5 y4 = 1.9
x5 = 4 y5 = 3.1

Linear regression assumes that the expected value of
the output given an input, E[y|x], is linear.
Simplest case: Out(x) = wx for some unknown w.
Given the data, we can estimate w.

1-parameter linear regression
Assume that the data is formed by
yi = wxi + noisei

where…
• the noise signals are independent
• the noise has a normal distribution with mean 0
and unknown variance σ2

p(y|w,x) has a normal distribution with
• mean wx
• variance σ2

Bayesian Linear Regression
p(y|w,x) = Normal (mean wx, var σ2)

We have a set of datapoints (x1,y1) (x2,y2) … (xn,yn)
which are EVIDENCE about w.

We want to infer w from the data.
p(w|x1, x2, x3,…xn, y1, y2…yn)
•You can use BAYES rule to work out a posterior
distribution for w given the data.
•Or you could do Maximum Likelihood Estimation


Maximum likelihood estimation of w

Asks the question:
“For which value of w is this data most likely to have
happened?”
<=>
For what w is
p(y1, y2…yn |x1, x2, x3,…xn, w) maximized?
<=>
For what w is n

∏ p( y w, x ) maximized?
i i
i =1

For what w is
n

∏ p ( y i w , x i ) maximized?
i =1

For what w is
n
1
∏ exp(− 2 (
yi − wxi
) 2 ) maximized?
σ
i =1
For what w is
2
1  y − wx i 
n

∑ −i  maximized?
σ
2 
i =1
For what w is 2
n

∑ (y )
− wx minimized?
i i
i =1


Linear Regression

The maximum
likelihood w is
the one that E(w)
w
minimizes sum-
Ε = ∑( yi − wxi )
2
of-squares of
residuals i

(∑ x )w
= ∑ yi − (2∑ xi yi )w +
2 2 2
i
i
We want to minimize a quadratic function of w.

Linear Regression
Easy to show the sum of
squares is minimized
when
∑x y i i
w=
∑x
2
i

The maximum likelihood
Out(x) = wx
model is

We can use it for
prediction

Linear Regression
Easy to show the sum of
squares is minimized
when
∑x y
p(w)
i i
w= w

∑x
2
Note:
i In Bayesian stats you’d have
ended up with a prob dist of w
The maximum likelihood
Out(x) = wx
model is And predictions would have given a prob
dist of expected output

Often useful to know your confidence.
We can use it for
Max likelihood can give some kinds of
prediction
confidence too.


Multivariate
Linear
Regression

Multivariate Regression
What if the inputs are vectors?
3.
.4
6.
2-d input
5
.
example
.8
x2 . 10

x1
Dataset has form
x1 y1
x2 y2
x3 y3
.: :
.
xR yR

Write matrix X and Y thus:

 .....x1 .....   x11 x1m   y1 
x12 ...
.....x .....  x  y 
x22 ... x2 m 
 =  21
x= y =  2
2

 
 M
M
M
 
 
.....x R .....  xR1 xR 2 ... xRm   yR 

(there are R datapoints. Each input has m components)
The linear regression model assumes a vector w such that
Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D]
The max. likelihood w is w = (XTX) -1(XTY)


Write matrix X and Y thus:

 .....x1 .....   x11 x1m   y1 
x12 ...
.....x .....  x ... x2 m  
x22  y =  y2 
 =  21
x= 2

 
 M
M
M
 
 
.....x R .....  xR1 xR 2 ... xRm   yR 

IMPORTANT EXERCISE:
(there are R datapoints. Each input hasPROVE IT !!!!!
m components)
The linear regression model assumes a vector w such that
Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D]
The max. likelihood w is w = (XTX) -1(XTY)


Multivariate Regression (con’t)

The max. likelihood w is w = (XTX)-1(XTY)
R

∑x x ki kj
XTX is an m x m matrix: i,j’th elt is
k =1
R

∑x y
XTY is an m-element vector: i’th elt ki k
k =1


Constant Term
in Linear
Regression

What about a constant term?
We may expect
linear data that does
not go through the
origin.

Statisticians and
Neural Net Folks all
agree on a simple
obvious hack.

Can you guess??

The constant term
• The trick is to create a fake input “X0” that
always takes the value 1

X1 X2 Y X0 X1 X2 Y
2 4 16 1 2 4 16
3 4 17 1 3 4 17
5 5 20 1 5 5 20
Before: After:
Y=w1X1+ w2X2 Y= w0X0+w1X1+ w2X2
…has to be a poor = w0+w1X1+ w2X2
In this example,
You should be able
model …has a fine constant
to see the MLE w0
, w1 and w2 by
term
inspection

. ..
i ty
astic
ed
rosc
e
H et
Linear
Regression with
varying noise

Regression with varying noise
• Suppose you know the variance of the noise that
was added to each datapoint.
y=3
σ=2
σi2
xi yi
½ ½ 4 y=2
σ=1/2
1 1 1
σ=1
2 1 1/4 y=1
σ=1/2
σ=2
2 3 4 y=0

3 2 1/4 x=0 x=1 x=2 x=3

MLE
the w?
yi ~ N ( wxi , σ i2 ) at’s f
W h at e o
Assume
stim
e

MLE estimation with varying noise
argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ 12 , σ 2 ,..., σ R , w) =
2 2
1 2 R

w Assuming independence
( yi − wxi ) 2
R among noise and then
argmin ∑ = plugging in equation for
σ i2 Gaussian and simplifying.
i =1
w
 
x ( y − wx ) Setting dLL/dw
R
 w such that ∑ i i 2 i = 0  = equal to zero
 
σi
 
i =1

 R xi yi  Trivial algebra
∑ 2 
 
 i =1 σ i 
 R xi2 
∑ 2 
 σ
 i =1 i 

This is Weighted Regression
• We are asking to minimize the weighted sum of
squares
y=3
σ=2
( yi − wxi ) 2
R

argmin ∑ σ i2 y=2
i =1 σ=1/2
w
σ=1
y=1
σ=1/2
σ=2
y=0
x=0 x=1 x=2 x=3

1
where weight for i’th datapoint is σ i2

Non-linear
Regression

Non-linear Regression
• Suppose you know that y is related to a function of x in
such a way that the predicted values have a non-linear
dependence on w, e.g:
y=3
xi yi
½ ½ y=2

1 2.5
2 3 y=1

3 2 y=0

3 3 x=0 x=1 x=2 x=3

MLE
yi ~ N ( w + xi , σ ) the w?
2 at’s f
W h at e o
Assume
stim
e

Non-linear MLE estimation
argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ , w) =
1 2 R

w Assuming i.i.d. and

argmin ∑ (y )
R then plugging in
2
− w + xi = equation for Gaussian
i
and simplifying.
i =1
w
 
y − w + xi Setting dLL/dw
R
 w such that ∑ i = 0 = equal to zero
 
w + xi
 
i =1


1 2 R


argmin ∑ (y )
R then plugging in
2
− w + xi = equation for Gaussian
i
and simplifying.
i =1
w
 
y − w + xi Setting dLL/dw
R
 w such that ∑ i = 0 = equal to zero
 
w + xi
 
i =1

We’re down the
algebraic toilet
t
wha
ess
u
So g e do?
w

1 2 R


argmin ∑ ( )
R
Common (but not only) approach: then plugging in
2
yi − w + xi = equation for Gaussian
Numerical Solutions: and simplifying.
i =1
w
• Line Search
• Simulated Annealing
 
yi − w + xi Setting dLL/dw
R

∑
 w such = 0 =
• Gradient Descent that equal to zero
 
w + xi
 
• Conjugate Gradient i =1

• Levenberg Marquart
We’re down the
• Newton’s Method
algebraic toilet
Also, special purpose statistical-
t
wha
optimization-specific tricks such as
uess ?
So g e do
E.M. (See Gaussian Mixtures lecture
for introduction) w

Polynomial
Regression

Polynomial Regression
So far we’ve mainly been dealing with linear regression
X1 X2 Y X= 3 2 y= 7

3 2 7 1 1 3
1 1 3 : : :
: : : x1=(3,2).. y1=7..
Z= 1 y=
3 2 7
1 1 1 3
: : β=(ZTZ)-1(ZTy)
:
z1=(1,3,2).. y1=7..
yest = β0+ β1 x1+ β2 x2
zk=(1,xk1,xk2)

Quadratic Regression
It’s trivial to do linear fits of fixed nonlinear basis functions
X1 X2 Y X= 3 2 y= 7

3 2 7 1 1 3
1 1 3 : : :
: : : x1=(3,2).. y1=7..
y=
1 3 2 9 6 4
Z=
7
1 1 1 1 1 1
β=(ZTZ)-1(ZTy)
3
: :
:
yest = β0+ β1 x1+ β2 x2+
z=(1 , x1, x2 , x1 2, x1x2,x22,)
β3 x12 + β4 x1x2 + β5 x22


Quadratic Regression
It’s trivial to do linear fitsof afixed nonlinear basisterm.
Each component of z vector is called a functions
X1 X2 Each column of X= 3 matrix is called a term column
Y y= 7
the Z 2
3 2 How many terms in 1 quadratic regression with m
7 a1 3
1 inputs?
1 3 :: :
: •1 constant term 1=(3,2)..
: : x y1=7..
1 •m linear terms6 4 y=
329
Z=
7
1 •(m+1)-choose-2 =1
1 1 1 1 m(m+1)/2 quadratic terms
β=(ZT -1 T
(m+2)-choose-2 terms in total = O(m2) Z) (Z y)
3
: :
:
y = β0+ β1 x1+ β2 x2+
est
z=(1 , x1, x2 , x12, x1x2,x22,) -1 T
Note that solving β=(ZTZ) (Z y) is thusβO(m6)+ β x 2
β3 x12 + 4 x1x2 52


Qth-degree polynomial Regression
X1 X2 Y X= 3 y= 7
2
3 2 7 1 1 3
1 1 3 : : :
: : : x1=(3,2).. y1=7..
y= 7
1 3 2 9 6 …
Z=
3
1 1 1 1 1 …
β=(ZTZ)-1(ZTy)
:
: …
yest = β0+
z=(all products of powers of inputs in
which sum of powers is q or less,) β1 x1+…


m inputs, degree Q: how many terms?
= the number of unique terms of the form
m
x1q1 x2 2 ...xmm where ∑ qi ≤ Q
q
q

i =1
= the number of unique terms of the m
form
1q0 x1q1 x2 2 ...xmm where ∑ qi = Q
q
q

i =0
= the number of lists of non-negative integers [q0,q1,q2,..qm]
Σqi = Q
in which
= the number of ways of placing Q red disks on a row of
squares of length Q+m = (Q+m)-choose-Q

Q=11, m=4
q0=2 q1=2 q2=0 q3=4 q4=3

Radial Basis
Functions

Radial Basis Functions (RBFs)
X1 X2 Y X= 3 y= 7
2
3 2 7 1 1 3
1 1 3 : : :
: :: x1=(3,2).. y1=7..
… … … … … … y= 7
Z=
3
………………
β=(ZTZ)-1(ZTy)
:
………………
yest = β0+
z=(list of radial basis function evaluations)
β1 x1+…


1-d RBFs

y

c1 c1 c1
x

yest = β1 φ1(x) + β2 φ2(x) + β3 φ3(x)
where
φi(x) = KernelFunction( | x - ci | / KW)


Example

y

c1 c1 c1
x

yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x)
where


RBFs with Linear Regression
KW also held constant
(initialized to be large
enough that there’s decent
c
Ally i ’s are held constant
overlap between basis
(initialized randomly or
functions*
on a grid in m-
c1 c1 c1
x
dimensional input space) *Usually much better than the crappy
overlap on my diagram
yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x)
where


RBFs with Linear Regression
KW also held constant
(initialized to be large
enough that there’s decent
c
Ally i ’s are held constant
overlap between basis
(initialized randomly or
functions*
on a grid in m-
c1 c1 c1
x
dimensional input space) *Usually much better than the crappy
overlap on my diagram
yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x)
where
then given Q basis functions, define the matrix Z such that Zkj =
KernelFunction( | xk - ci | / KW) where xk is the kth vector of inputs
And as before, β=(ZTZ)-1(ZTy)


RBFs with NonLinear Regression

Allow the ci ’s to adapt to KW allowed to adapt to the data.
y data (initialized
the (Some folks even let each basis
function have its own
randomly or on a grid in
c1 c c1
KWj,permitting fine detail in
m-dimensional input 1
x
dense regions of input space)
space)
yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x)
where
But how do we now find all the βj’s, ci ’s and KW ?



y data (initialized
c1 c c1
x
space)
yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x)
where

Answer: Gradient Descent



y data (initialized
c1 c c1
x
space)
yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x)
where
(But I’d like to see, or hope someone’s already done, a
Answer: Gradient Descent
hybrid, where the ci ’s and KW are updated with gradient
descent while the βj’s use matrix inversion)


Radial Basis Functions in 2-d
Two inputs.
Outputs (heights
Center
sticking out of page)
not shown.

Sphere of
significant
x2
influence of
center

x1

Happy RBFs in 2-d
Blue dots denote
coordinates of
input vectors
Center

Sphere of
significant
x2
influence of
center

x1

Crabby RBFs in 2-d What’s the
problem in this
example?
Blue dots denote
coordinates of
input vectors
Center

Sphere of
significant
x2
influence of
center

x1

More crabby RBFs And what’s the
problem in this
example?
Blue dots denote
coordinates of
input vectors
Center

Sphere of
significant
x2
influence of
center

x1

Hopeless! Even before seeing the data, you should
understand that this is a disaster!

Center

Sphere of
significant
x2 influence of
center

x1

Unhappy Even before seeing the data, you should
understand that this isn’t good either..

Center

Sphere of
significant
x2 influence of
center

x1

Robust
Regression

Robust Regression

y

x

Robust Regression
This is the best fit that
Quadratic Regression can
manage

y

x

Robust Regression
…but this is what we’d
probably prefer

y

x

LOESS-based Robust Regression
After the initial fit, score
each datapoint according to
how well it’s fitted…

y You are a very good
datapoint.

x


datapoint.

You are not too
shabby.

x

But you are
pathetic.

datapoint.

You are not too
shabby.

x

Robust Regression
For k = 1 to R…
•Let (xk,yk) be the kth datapoint
y
•Let yestk be predicted value of
yk
x
•Let wk be a weight for
datapoint k that is large if the
datapoint fits well and small if it
fits badly:
wk = KernelFn([yk- yestk]2)


Robust Regression
For k = 1 to R…
y
yk
x
Then redo the regression
using weighted datapoints.
fits badly:
Weighted regression was described earlier in
the “vary noise” section, and is also discussed
in the “Memory-based Learning” Lecture.

Guess what happens next?


Robust Regression
For k = 1 to R…
y
yk
x
Then redo the regression
using weighted datapoints.
fits badly:
I taught you how to do this in the “Instance-
based” lecture (only then the weights
depended on distance in input-space)

Repeat whole thing until
converged!


Robust Regression---what we’re
doing
What regular regression does:
Assume yk was originally generated using the
following recipe:
yk = β0+ β1 xk+ β2 xk2 +N(0,σ2)
Computational task is to find the Maximum
Likelihood β0 , β1 and β2


doing
What LOESS robust regression does:
following recipe:
With probability p:
yk = β0+ β1 xk+ β2 xk2 +N(0,σ2)
But otherwise
yk ~ N(µ,σhuge2)
Likelihood β0 , β1 , β2 , p, µ and σhuge


doing
What LOESS robust regression does:
Mysteriously, the
reweighting procedure
following recipe: does this computation
for us.
With probability p:
yk = β0+ β1 xk+ β2 xk2 +N(0,σ2) Your first glimpse of
two spectacular letters:
But otherwise
yk ~ N(µ,σhuge2) E.M.
Likelihood β0 , β1 , β2 , p, µ and σhuge


Regression
Trees

Regression Trees
• “Decision trees for regression”


A regression tree leaf

Predict age = 47

Mean age of records
matching this leaf node


A one-split regression tree

Gender?
Female Male

Predict age = 39 Predict age = 36


Choosing the attribute to split on
Gender Rich? Num. Num. Beany Age
Children Babies
Female No 2 1 38
Male No 0 0 24
Male Yes 0 5+ 72
: : : : :

• We can’t use
information gain.
• What should we use?


Children Babies
Female No 2 1 38
Male No 0 0 24
Male Yes 0 5+ 72
: : : : :
MSE(Y|X) = The expected squared error if we must predict a record’s Y
value given only knowledge of the record’s X value
If we’re told x=j, the smallest expected error comes from predicting the
mean of the Y-values among those records in which x=j. Call this mean
quantity µyx=j
Then…
1 NX
MSE (Y | X )= ∑ ∑ (x y=kj )− µ yx= j ) 2
R j =1 ( k such that k

Children Babies
NoRegression tree attribute selection: greedily
Female 2 1 38
choose the attribute that minimizes MSE(Y|X)
Male No 0 0 24
Guess 0what we do about real-valued inputs?
Male Yes 5+ 72
: : : : :
Guess how we prevent overfitting
MSE(Y|X) = The expected squared error if we must predict a record’s Y
value given only knowledge of the record’s X value
If we’re told x=j, the smallest expected error comes from predicting the
mean of the Y-values among those records in which x=j. Call this mean
quantity µyx=j
Then…
1 NX
MSE (Y | X )= ∑ ∑ (x y=kj )− µ yx= j ) 2
R j =1 ( k such that k

Pruning Decision
…property-owner = Yes
Do I deserve
to live?
Gender?
Female Male

Predict age = 39 Predict age = 36
# property-owning females = 56712 # property-owning males = 55800
Mean age among POFs = 39 Mean age among POMs = 36
Age std dev among POFs = 12 Age std dev among POMs = 11.5

Use a standard Chi-squared test of the null-
hypothesis “these two populations have the same
mean” and Bob’s your uncle.


Linear Regression Trees
Also known as
“Model Trees”
Gender?
Female Male

Predict age = Predict age =
26 + 6 * NumChildren - 24 + 7 * NumChildren -
2 * YearsEducation 2.5 * YearsEducation

Leaves contain linear Split attribute chosen to minimize
functions (trained using MSE of regressed children.
linear regression on all
Pruning with a different Chi-
records matching that leaf)
squared

Linear Regression Trees
Also known as
“Model Trees”
Gender?
d
ste
ny en te
Female Male
ae
re b
gno has he
i
lly Predict raget =
ha u ing d
t
tes
a
Predict age = ic t
typ ibute ree d ste ttribu
You
26 + 6 * NumChildrenttr the 24 l+ nte ued a e
t
- u7 *l NumChildren -
ail: rical a in al v
et o se e* l-va abo
2 * YearsEducation up
D 2.5 a YearsEducation
teg her But u e r d
te
ca hig tes
us
n.
sio , and been
n
or
Leaves contain lineares s Split attribute chosen to minimize
reg ibute ey’ve of regressed children.
functions (trained using h MSE
r
at t t
linear regression on alln if
e
ev Pruning with a different Chi-
records matching that leaf)
squared

Test your understanding
Assuming regular regression trees, can you sketch a
graph of the fitted function yest(x) over this diagram?

y

x

Test your understanding
Assuming linear regression trees, can you sketch a graph
of the fitted function yest(x) over this diagram?

y

x

Multilinear
Interpolation

Multilinear Interpolation
Consider this dataset. Suppose we wanted to create a
continuous and piecewise linear fit to the data

y

x

Create a set of knot points: selected X-coordinates
(usually equally spaced) that cover the data

y

q1 q2 q3 q4 q5
x

We are going to assume the data was generated by a
noisy version of a function that can only bend at the
knots. Here are 3 examples (none fits the data well)

y

q1 q2 q3 q4 q5
x

How to find the best fit?
Idea 1: Simply perform a separate regression in each
segment for each part of the curve
What’s the problem with this idea?

y

q1 q2 q3 q4 q5
x

Let’s look at what goes on in the red segment
(q3 − x) (q − x)
y est ( x) = h2 + 2 h3 where w = q3 − q2
w w

h2

y
h3

q1 q2 q3 q4 q5
x

In the red segment… y est ( x) = h2 φ2 ( x) + h3φ3 ( x)
x − q2 q −x
where φ2 ( x) = 1 − , φ3 ( x) = 1 − 3
w w

h2

y
φ2(x)
h3

q1 q2 q3 q4 q5
x

x − q2 q −x
where φ2 ( x) = 1 − , φ3 ( x) = 1 − 3
w w

h2

y
φ2(x)
h3

q1 q2 q3 q4 q5
φ3(x)
x

| x − q2 | | x − q3 |
where φ2 ( x) = 1 − , φ3 ( x) = 1 −
w w

h2

y
φ2(x)
h3

q1 q2 q3 q4 q5
φ3(x)
x

| x − q2 | | x − q3 |
where φ2 ( x) = 1 − , φ3 ( x) = 1 −
w w

h2

y
φ2(x)
h3

q1 q2 q3 q4 q5
φ3(x)
x

How to find the best fit? NK
In general y est ( x) = ∑ hi φi ( x)
i =1

 | x − qi |
1 − if | x − qi |<w
where φi ( x) =  w
 0 otherwise


h2

y
φ2(x)
h3

q1 q2 q3 q4 q5
φ3(x)
x

How to find the best fit? NK
In general y est ( x) = ∑ hi φi ( x)
i =1

 | x − qi |
1 − if | x − qi |<w
where φi ( x) =  w
 0 otherwise


h2 And this is simply a basis function
regression problem!

y We know how to find the least
φ2(x)
h3 squares hiis!

q1 q2 q3 q4 q5
φ3(x)
x

In two dimensions…
Blue dots show
locations of input
vectors (outputs
not depicted)

x2

x1

In two dimensions… Each purple dot
is a knot point.
Blue dots show It will contain
locations of input the height of
vectors (outputs the estimated
not depicted) surface

x2

x1

is a knot point.
9 3
But how do we
do the
interpolation to
ensure that the
surface is
7 8
continuous?

x2

x1

is a knot point.
Blue dots showthe It will contain
To predict
value here…
9 3
But how do we
do the
interpolation to
ensure that the
surface is
7 8
continuous?

x2

x1

is a knot point.
97 3
But how do we
do the
To predict the
interpolation to
value here… ensure that the
First interpolate surface is
7 8
its value on two continuous?
opposite edges… 7.33
x2

x1

is a knot point.
97 3
To predict the But how do we
value here… do the
7.05 interpolation to
First interpolate
ensure that the
its value on two
surface is
7 8
opposite edges…
continuous?
Then interpolate
7.33
between those
x
two values 2

x1

is a knot point.
97 3
To predict the But how do we
value here… do the
7.05 interpolation to
First interpolate
ensure that the
its value on two
Notes: 8 surface is
7
opposite edges…
continuous?
Then interpolate
7.33 can easily be generalized
This
between those
to m dimensions.
x
two values 2
It should be easy to see that it
ensures continuity
x1 The patches are not linear

Doing the regression Given data, how
do we find the
optimal knot
heights?
Happily, it’s
9 3 simply a two-
dimensional
basis function
problem.
(Working out
7 8 the basis
functions is
tedious,
x2 unilluminating,
and easy)
What’s the
problem in
higher
x1 dimensions?

MARS: Multivariate
Adaptive Regression
Splines

MARS
• Multivariate Adaptive Regression Splines
• Invented by Jerry Friedman (one of
Andrew’s heroes)
• Simplest version:

Let’s assume the function we are learning is of the
following form:
m
y (x) = ∑ g k ( xk )
est

k =1
Instead of a linear combination of the inputs, it’s a linear
combination of non-linear functions of individual inputs


MARS m
y (x) = ∑ g k ( xk )
est

k =1

Idea: Each
gk is one of
these

y

q1 q2 q3 q4 q5
x

MARS m
y (x) = ∑ g k ( xk )
est

k =1

m NK
y (x) = ∑∑ h k φ k ( xk ) qkj : The location of
est
the j’th knot in the
jj
k =1 j =1 k’th dimension
hkj : The regressed
 | xk − q k | height of the j’th
1 − j
if | xk − q k |<wk knot in the k’th
where φ k ( x) =  j
y wk
j dimension
 0 otherwise
 wk: The spacing
between knots in
the kth dimension
q1 q2 q3 q4 q5
x

That’s not complicated enough!
• Okay, now let’s get serious. We’ll allow
arbitrary “two-way interactions”:
m m m
y (x) = ∑ g k ( xk ) + ∑ ∑g
est
( xk , xt )
kt
k =1 k =1 t = k +1

Can still be expressed as a linear
The function we’re combination of basis functions
learning is allowed to be
Thus learnable by linear regression
a sum of non-linear
functions over all one-d Full MARS: Uses cross-validation to
and 2-d subsets of choose a subset of subspaces, knot
attributes resolution and other parameters.


If you like MARS…
…See also CMAC (Cerebellar Model Articulated
Controller) by James Albus (another of
Andrew’s heroes)
• Many of the same gut-level intuitions
• But entirely in a neural-network, biologically
plausible way
• (All the low dimensional functions are by
means of lookup tables, trained with a delta-
rule and using a clever blurred update and
hash-tables)


Where are we now?
Inputs

Inference
p(E1|E2) Joint DE
Engine Learn
Dec Tree, Gauss/Joint BC, Gauss Naïve BC,
Inputs

Predict
Classifier category

Joint DE, Naïve DE, Gauss/Joint DE, Gauss Naïve
Inputs Inputs

Prob-
Density
DE
ability
Estimator
Linear Regression, Polynomial Regression, RBFs,
Predict
Regressor Robust Regression Regression Trees, Multilinear
real no.
Interp, MARS


Citations
Radial Basis Functions Regression Trees etc
T. Poggio and F. Girosi, L. Breiman and J. H. Friedman and
Regularization Algorithms for R. A. Olshen and C. J. Stone,
Learning That Are Equivalent to Classification and Regression
Multilayer Networks, Science, Trees, Wadsworth, 1984
247, 978--982, 1989 J. R. Quinlan, Combining Instance-
LOESS Based and Model-Based
Learning, Machine Learning:
W. S. Cleveland, Robust Locally
Proceedings of the Tenth
Weighted Regression and
International Conference, 1993
Smoothing Scatterplots, Journal
of the American Statistical MARS
Association, 74, 368, 829-836, J. H. Friedman, Multivariate
December, 1979 Adaptive Regression Splines,
Department for Statistics,
Stanford University, 1988,
Technical Report No. 102


Predicting Real-valued Outputs: An introduction to regression

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (6)

Similar a Predicting Real-valued Outputs: An introduction to regression

Similar a Predicting Real-valued Outputs: An introduction to regression (20)

Más de guestfee8698

Más de guestfee8698 (13)

Último

Último (20)

Predicting Real-valued Outputs: An introduction to regression