Enviar búsqueda
Cargar
Predicting Real-valued Outputs: An introduction to regression
•
0 recomendaciones
•
713 vistas
G
guestfee8698
Seguir
Tecnología
Economía y finanzas
Denunciar
Compartir
Denunciar
Compartir
1 de 51
Descargar ahora
Descargar para leer sin conexión
Recomendados
Neural Networks
Neural Networks
guestfee8698
Maximum Likelihood Estimation
Maximum Likelihood Estimation
guestfee8698
Gaussians
Gaussians
guestfee8698
Gaussian Mixture Models
Gaussian Mixture Models
guestfee8698
Probability Density Functions
Probability Density Functions
guestb86588
Iceaa07 Foils
Iceaa07 Foils
Antonini
Per de matematica listo.
Per de matematica listo.
nohelicordero
Lesson 27: Integration by Substitution (Section 10 version)
Lesson 27: Integration by Substitution (Section 10 version)
Matthew Leingang
Recomendados
Neural Networks
Neural Networks
guestfee8698
Maximum Likelihood Estimation
Maximum Likelihood Estimation
guestfee8698
Gaussians
Gaussians
guestfee8698
Gaussian Mixture Models
Gaussian Mixture Models
guestfee8698
Probability Density Functions
Probability Density Functions
guestb86588
Iceaa07 Foils
Iceaa07 Foils
Antonini
Per de matematica listo.
Per de matematica listo.
nohelicordero
Lesson 27: Integration by Substitution (Section 10 version)
Lesson 27: Integration by Substitution (Section 10 version)
Matthew Leingang
Parameter Estimation for Semiparametric Models with CMARS and Its Applications
Parameter Estimation for Semiparametric Models with CMARS and Its Applications
SSA KPI
Evolution of regression ols to gps to mars
Evolution of regression ols to gps to mars
Salford Systems
Presentation Machine Learning
Presentation Machine Learning
Periklis Gogas
Introduction to MARS (1999)
Introduction to MARS (1999)
Salford Systems
Tokyowebmining41
Tokyowebmining41
Tokyo Medical and Dental University
Data Science - Part XV - MARS, Logistic Regression, & Survival Analysis
Data Science - Part XV - MARS, Logistic Regression, & Survival Analysis
Derek Kane
VC dimensio
VC dimensio
guestfee8698
Information Gain
Information Gain
guest32311f
Lesson 27: Integration by Substitution, part II (Section 10 version)
Lesson 27: Integration by Substitution, part II (Section 10 version)
Matthew Leingang
Lesson 22: Optimization I (Section 10 Version)
Lesson 22: Optimization I (Section 10 Version)
Matthew Leingang
IVR - Chapter 5 - Bayesian methods
IVR - Chapter 5 - Bayesian methods
Charles Deledalle
Electromagnetic.pdf
Electromagnetic.pdf
ManishKumawat77
Lecture6
Lecture6
voracle
2021 03-01-on the relationship between self-attention and convolutional layers
2021 03-01-on the relationship between self-attention and convolutional layers
JAEMINJEONG5
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
Computer Science Club
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...
Kirill Eremenko
A simple confidence interval for meta analysis
A simple confidence interval for meta analysis
rsd kol abundjani
Lesson 19: Double Integrals over General Regions
Lesson 19: Double Integrals over General Regions
Matthew Leingang
Partitions
Partitions
Nicholas Teff
Large variance and fat tail of damage by natural disaster
Large variance and fat tail of damage by natural disaster
Hang-Hyun Jo
Morphing Image
Morphing Image
Copen Hagen
Derivatives
Derivatives
gregorblanco
Más contenido relacionado
Destacado
Parameter Estimation for Semiparametric Models with CMARS and Its Applications
Parameter Estimation for Semiparametric Models with CMARS and Its Applications
SSA KPI
Evolution of regression ols to gps to mars
Evolution of regression ols to gps to mars
Salford Systems
Presentation Machine Learning
Presentation Machine Learning
Periklis Gogas
Introduction to MARS (1999)
Introduction to MARS (1999)
Salford Systems
Tokyowebmining41
Tokyowebmining41
Tokyo Medical and Dental University
Data Science - Part XV - MARS, Logistic Regression, & Survival Analysis
Data Science - Part XV - MARS, Logistic Regression, & Survival Analysis
Derek Kane
Destacado
(6)
Parameter Estimation for Semiparametric Models with CMARS and Its Applications
Parameter Estimation for Semiparametric Models with CMARS and Its Applications
Evolution of regression ols to gps to mars
Evolution of regression ols to gps to mars
Presentation Machine Learning
Presentation Machine Learning
Introduction to MARS (1999)
Introduction to MARS (1999)
Tokyowebmining41
Tokyowebmining41
Data Science - Part XV - MARS, Logistic Regression, & Survival Analysis
Data Science - Part XV - MARS, Logistic Regression, & Survival Analysis
Similar a Predicting Real-valued Outputs: An introduction to regression
VC dimensio
VC dimensio
guestfee8698
Information Gain
Information Gain
guest32311f
Lesson 27: Integration by Substitution, part II (Section 10 version)
Lesson 27: Integration by Substitution, part II (Section 10 version)
Matthew Leingang
Lesson 22: Optimization I (Section 10 Version)
Lesson 22: Optimization I (Section 10 Version)
Matthew Leingang
IVR - Chapter 5 - Bayesian methods
IVR - Chapter 5 - Bayesian methods
Charles Deledalle
Electromagnetic.pdf
Electromagnetic.pdf
ManishKumawat77
Lecture6
Lecture6
voracle
2021 03-01-on the relationship between self-attention and convolutional layers
2021 03-01-on the relationship between self-attention and convolutional layers
JAEMINJEONG5
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
Computer Science Club
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...
Kirill Eremenko
A simple confidence interval for meta analysis
A simple confidence interval for meta analysis
rsd kol abundjani
Lesson 19: Double Integrals over General Regions
Lesson 19: Double Integrals over General Regions
Matthew Leingang
Partitions
Partitions
Nicholas Teff
Large variance and fat tail of damage by natural disaster
Large variance and fat tail of damage by natural disaster
Hang-Hyun Jo
Morphing Image
Morphing Image
Copen Hagen
Derivatives
Derivatives
gregorblanco
Lesson 19: Optimization Problems
Lesson 19: Optimization Problems
Matthew Leingang
Multimodal pattern matching algorithms and applications
Multimodal pattern matching algorithms and applications
Xavier Anguera
Eight Regression Algorithms
Eight Regression Algorithms
guestfee8698
Lesson 22: Optimization I (Section 4 version)
Lesson 22: Optimization I (Section 4 version)
Matthew Leingang
Similar a Predicting Real-valued Outputs: An introduction to regression
(20)
VC dimensio
VC dimensio
Information Gain
Information Gain
Lesson 27: Integration by Substitution, part II (Section 10 version)
Lesson 27: Integration by Substitution, part II (Section 10 version)
Lesson 22: Optimization I (Section 10 Version)
Lesson 22: Optimization I (Section 10 Version)
IVR - Chapter 5 - Bayesian methods
IVR - Chapter 5 - Bayesian methods
Electromagnetic.pdf
Electromagnetic.pdf
Lecture6
Lecture6
2021 03-01-on the relationship between self-attention and convolutional layers
2021 03-01-on the relationship between self-attention and convolutional layers
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...
A simple confidence interval for meta analysis
A simple confidence interval for meta analysis
Lesson 19: Double Integrals over General Regions
Lesson 19: Double Integrals over General Regions
Partitions
Partitions
Large variance and fat tail of damage by natural disaster
Large variance and fat tail of damage by natural disaster
Morphing Image
Morphing Image
Derivatives
Derivatives
Lesson 19: Optimization Problems
Lesson 19: Optimization Problems
Multimodal pattern matching algorithms and applications
Multimodal pattern matching algorithms and applications
Eight Regression Algorithms
Eight Regression Algorithms
Lesson 22: Optimization I (Section 4 version)
Lesson 22: Optimization I (Section 4 version)
Más de guestfee8698
PAC Learning
PAC Learning
guestfee8698
Support Vector Machines
Support Vector Machines
guestfee8698
Hidden Markov Models
Hidden Markov Models
guestfee8698
K-means and Hierarchical Clustering
K-means and Hierarchical Clustering
guestfee8698
Short Overview of Bayes Nets
Short Overview of Bayes Nets
guestfee8698
A Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian Classifiers
guestfee8698
Learning Bayesian Networks
Learning Bayesian Networks
guestfee8698
Inference in Bayesian Networks
Inference in Bayesian Networks
guestfee8698
Bayesian Networks
Bayesian Networks
guestfee8698
Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)
guestfee8698
Cross-Validation
Cross-Validation
guestfee8698
Gaussian Bayes Classifiers
Gaussian Bayes Classifiers
guestfee8698
Probability for Data Miners
Probability for Data Miners
guestfee8698
Más de guestfee8698
(13)
PAC Learning
PAC Learning
Support Vector Machines
Support Vector Machines
Hidden Markov Models
Hidden Markov Models
K-means and Hierarchical Clustering
K-means and Hierarchical Clustering
Short Overview of Bayes Nets
Short Overview of Bayes Nets
A Short Intro to Naive Bayesian Classifiers
A Short Intro to Naive Bayesian Classifiers
Learning Bayesian Networks
Learning Bayesian Networks
Inference in Bayesian Networks
Inference in Bayesian Networks
Bayesian Networks
Bayesian Networks
Instance-based learning (aka Case-based or Memory-based or non-parametric)
Instance-based learning (aka Case-based or Memory-based or non-parametric)
Cross-Validation
Cross-Validation
Gaussian Bayes Classifiers
Gaussian Bayes Classifiers
Probability for Data Miners
Probability for Data Miners
Último
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
DianaGray10
🐬 The future of MySQL is Postgres 🐘
🐬 The future of MySQL is Postgres 🐘
RTylerCroy
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
The Digital Insurer
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
Khem
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
The Digital Insurer
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
The Digital Insurer
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
hans926745
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
V3cube
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
sammart93
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
Principled Technologies
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
Enterprise Knowledge
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
Product Anonymous
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
Andrey Devyatkin
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
The Digital Insurer
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
HampshireHUG
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
Gabriella Davis
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Martijn de Jong
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
ThousandEyes
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
apidays
Último
(20)
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
🐬 The future of MySQL is Postgres 🐘
🐬 The future of MySQL is Postgres 🐘
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Predicting Real-valued Outputs: An introduction to regression
1.
Predicting Real-valued
outputs: an introduction to Regression Andrew W. Moore Note to other teachers and users of these slides. Andrew would be delighted Professor if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them School of Computer Science to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in Carnegie Mellon University your own lecture, please include this l teria d ma message, or the following link to the rdere Nets source repository of Andrew’s tutorials: is reo www.cs.cmu.edu/~awm l This he Neura “Favorite http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully t d the rithms” awm@cs.cmu.edu from re an received. o lectu ssion Alg 412-268-7599 e Regr e r lectu Copyright © 2001, 2003, Andrew W. Moore Single- Parameter Linear Regression Copyright © 2001, 2003, Andrew W. Moore 2
2.
Linear Regression
DATASET inputs outputs x1 = 1 y1 = 1 x2 = 3 y2 = 2.2 ↑ x3 = 2 y3 = 2 w ↓ ←1→ x4 = 1.5 y4 = 1.9 x5 = 4 y5 = 3.1 Linear regression assumes that the expected value of the output given an input, E[y|x], is linear. Simplest case: Out(x) = wx for some unknown w. Given the data, we can estimate w. Copyright © 2001, 2003, Andrew W. Moore 3 1-parameter linear regression Assume that the data is formed by yi = wxi + noisei where… • the noise signals are independent • the noise has a normal distribution with mean 0 and unknown variance σ2 p(y|w,x) has a normal distribution with • mean wx • variance σ2 Copyright © 2001, 2003, Andrew W. Moore 4
3.
Bayesian Linear Regression
p(y|w,x) = Normal (mean wx, var σ2) We have a set of datapoints (x1,y1) (x2,y2) … (xn,yn) which are EVIDENCE about w. We want to infer w from the data. p(w|x1, x2, x3,…xn, y1, y2…yn) •You can use BAYES rule to work out a posterior distribution for w given the data. •Or you could do Maximum Likelihood Estimation Copyright © 2001, 2003, Andrew W. Moore 5 Maximum likelihood estimation of w Asks the question: “For which value of w is this data most likely to have happened?” <=> For what w is p(y1, y2…yn |x1, x2, x3,…xn, w) maximized? <=> For what w is n ∏ p( y w, x ) maximized? i i i =1 Copyright © 2001, 2003, Andrew W. Moore 6
4.
For what w
is n ∏ p ( y i w , x i ) maximized? i =1 For what w is n 1 ∏ exp(− 2 ( yi − wxi ) 2 ) maximized? σ i =1 For what w is 2 1 y − wx i n ∑ −i maximized? σ 2 i =1 For what w is 2 n ∑ (y ) − wx minimized? i i i =1 Copyright © 2001, 2003, Andrew W. Moore 7 Linear Regression The maximum likelihood w is the one that E(w) w minimizes sum- Ε = ∑( yi − wxi ) 2 of-squares of residuals i (∑ x )w = ∑ yi − (2∑ xi yi )w + 2 2 2 i i We want to minimize a quadratic function of w. Copyright © 2001, 2003, Andrew W. Moore 8
5.
Linear Regression Easy to
show the sum of squares is minimized when ∑x y i i w= ∑x 2 i The maximum likelihood Out(x) = wx model is We can use it for prediction Copyright © 2001, 2003, Andrew W. Moore 9 Linear Regression Easy to show the sum of squares is minimized when ∑x y p(w) i i w= w ∑x 2 Note: i In Bayesian stats you’d have ended up with a prob dist of w The maximum likelihood Out(x) = wx model is And predictions would have given a prob dist of expected output Often useful to know your confidence. We can use it for Max likelihood can give some kinds of prediction confidence too. Copyright © 2001, 2003, Andrew W. Moore 10
6.
Multivariate
Linear Regression Copyright © 2001, 2003, Andrew W. Moore 11 Multivariate Regression What if the inputs are vectors? 3. .4 6. 2-d input 5 . example .8 x2 . 10 x1 Dataset has form x1 y1 x2 y2 x3 y3 .: : . xR yR Copyright © 2001, 2003, Andrew W. Moore 12
7.
Multivariate Regression Write matrix
X and Y thus: .....x1 ..... x11 x1m y1 x12 ... .....x ..... x y x22 ... x2 m = 21 x= y = 2 2 M M M .....x R ..... xR1 xR 2 ... xRm yR (there are R datapoints. Each input has m components) The linear regression model assumes a vector w such that Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D] The max. likelihood w is w = (XTX) -1(XTY) Copyright © 2001, 2003, Andrew W. Moore 13 Multivariate Regression Write matrix X and Y thus: .....x1 ..... x11 x1m y1 x12 ... .....x ..... x ... x2 m x22 y = y2 = 21 x= 2 M M M .....x R ..... xR1 xR 2 ... xRm yR IMPORTANT EXERCISE: (there are R datapoints. Each input hasPROVE IT !!!!! m components) The linear regression model assumes a vector w such that Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D] The max. likelihood w is w = (XTX) -1(XTY) Copyright © 2001, 2003, Andrew W. Moore 14
8.
Multivariate Regression (con’t) The
max. likelihood w is w = (XTX)-1(XTY) R ∑x x ki kj XTX is an m x m matrix: i,j’th elt is k =1 R ∑x y XTY is an m-element vector: i’th elt ki k k =1 Copyright © 2001, 2003, Andrew W. Moore 15 Constant Term in Linear Regression Copyright © 2001, 2003, Andrew W. Moore 16
9.
What about a
constant term? We may expect linear data that does not go through the origin. Statisticians and Neural Net Folks all agree on a simple obvious hack. Can you guess?? Copyright © 2001, 2003, Andrew W. Moore 17 The constant term • The trick is to create a fake input “X0” that always takes the value 1 X1 X2 Y X0 X1 X2 Y 2 4 16 1 2 4 16 3 4 17 1 3 4 17 5 5 20 1 5 5 20 Before: After: Y=w1X1+ w2X2 Y= w0X0+w1X1+ w2X2 …has to be a poor = w0+w1X1+ w2X2 In this example, You should be able model …has a fine constant to see the MLE w0 , w1 and w2 by term inspection Copyright © 2001, 2003, Andrew W. Moore 18
10.
. ..
i ty astic ed rosc e H et Linear Regression with varying noise Copyright © 2001, 2003, Andrew W. Moore 19 Regression with varying noise • Suppose you know the variance of the noise that was added to each datapoint. y=3 σ=2 σi2 xi yi ½ ½ 4 y=2 σ=1/2 1 1 1 σ=1 2 1 1/4 y=1 σ=1/2 σ=2 2 3 4 y=0 3 2 1/4 x=0 x=1 x=2 x=3 MLE the w? yi ~ N ( wxi , σ i2 ) at’s f W h at e o Assume stim e Copyright © 2001, 2003, Andrew W. Moore 20
11.
MLE estimation with
varying noise argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ 12 , σ 2 ,..., σ R , w) = 2 2 1 2 R w Assuming independence ( yi − wxi ) 2 R among noise and then argmin ∑ = plugging in equation for σ i2 Gaussian and simplifying. i =1 w x ( y − wx ) Setting dLL/dw R w such that ∑ i i 2 i = 0 = equal to zero σi i =1 R xi yi Trivial algebra ∑ 2 i =1 σ i R xi2 ∑ 2 σ i =1 i Copyright © 2001, 2003, Andrew W. Moore 21 This is Weighted Regression • We are asking to minimize the weighted sum of squares y=3 σ=2 ( yi − wxi ) 2 R argmin ∑ σ i2 y=2 i =1 σ=1/2 w σ=1 y=1 σ=1/2 σ=2 y=0 x=0 x=1 x=2 x=3 1 where weight for i’th datapoint is σ i2 Copyright © 2001, 2003, Andrew W. Moore 22
12.
Non-linear
Regression Copyright © 2001, 2003, Andrew W. Moore 23 Non-linear Regression • Suppose you know that y is related to a function of x in such a way that the predicted values have a non-linear dependence on w, e.g: y=3 xi yi ½ ½ y=2 1 2.5 2 3 y=1 3 2 y=0 3 3 x=0 x=1 x=2 x=3 MLE yi ~ N ( w + xi , σ ) the w? 2 at’s f W h at e o Assume stim e Copyright © 2001, 2003, Andrew W. Moore 24
13.
Non-linear MLE estimation
argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ , w) = 1 2 R w Assuming i.i.d. and argmin ∑ (y ) R then plugging in 2 − w + xi = equation for Gaussian i and simplifying. i =1 w y − w + xi Setting dLL/dw R w such that ∑ i = 0 = equal to zero w + xi i =1 Copyright © 2001, 2003, Andrew W. Moore 25 Non-linear MLE estimation argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ , w) = 1 2 R w Assuming i.i.d. and argmin ∑ (y ) R then plugging in 2 − w + xi = equation for Gaussian i and simplifying. i =1 w y − w + xi Setting dLL/dw R w such that ∑ i = 0 = equal to zero w + xi i =1 We’re down the algebraic toilet t wha ess u So g e do? w Copyright © 2001, 2003, Andrew W. Moore 26
14.
Non-linear MLE estimation
argmax log p( y , y ,..., y | x1 , x2 ,..., xR , σ , w) = 1 2 R w Assuming i.i.d. and argmin ∑ ( ) R Common (but not only) approach: then plugging in 2 yi − w + xi = equation for Gaussian Numerical Solutions: and simplifying. i =1 w • Line Search • Simulated Annealing yi − w + xi Setting dLL/dw R ∑ w such = 0 = • Gradient Descent that equal to zero w + xi • Conjugate Gradient i =1 • Levenberg Marquart We’re down the • Newton’s Method algebraic toilet Also, special purpose statistical- t wha optimization-specific tricks such as uess ? So g e do E.M. (See Gaussian Mixtures lecture for introduction) w Copyright © 2001, 2003, Andrew W. Moore 27 Polynomial Regression Copyright © 2001, 2003, Andrew W. Moore 28
15.
Polynomial Regression So far
we’ve mainly been dealing with linear regression X1 X2 Y X= 3 2 y= 7 3 2 7 1 1 3 1 1 3 : : : : : : x1=(3,2).. y1=7.. Z= 1 y= 3 2 7 1 1 1 3 : : β=(ZTZ)-1(ZTy) : z1=(1,3,2).. y1=7.. yest = β0+ β1 x1+ β2 x2 zk=(1,xk1,xk2) Copyright © 2001, 2003, Andrew W. Moore 29 Quadratic Regression It’s trivial to do linear fits of fixed nonlinear basis functions X1 X2 Y X= 3 2 y= 7 3 2 7 1 1 3 1 1 3 : : : : : : x1=(3,2).. y1=7.. y= 1 3 2 9 6 4 Z= 7 1 1 1 1 1 1 β=(ZTZ)-1(ZTy) 3 : : : yest = β0+ β1 x1+ β2 x2+ z=(1 , x1, x2 , x1 2, x1x2,x22,) β3 x12 + β4 x1x2 + β5 x22 Copyright © 2001, 2003, Andrew W. Moore 30
16.
Quadratic Regression It’s trivial
to do linear fitsof afixed nonlinear basisterm. Each component of z vector is called a functions X1 X2 Each column of X= 3 matrix is called a term column Y y= 7 the Z 2 3 2 How many terms in 1 quadratic regression with m 7 a1 3 1 inputs? 1 3 :: : : •1 constant term 1=(3,2).. : : x y1=7.. 1 •m linear terms6 4 y= 329 Z= 7 1 •(m+1)-choose-2 =1 1 1 1 1 m(m+1)/2 quadratic terms β=(ZT -1 T (m+2)-choose-2 terms in total = O(m2) Z) (Z y) 3 : : : y = β0+ β1 x1+ β2 x2+ est z=(1 , x1, x2 , x12, x1x2,x22,) -1 T Note that solving β=(ZTZ) (Z y) is thusβO(m6)+ β x 2 β3 x12 + 4 x1x2 52 Copyright © 2001, 2003, Andrew W. Moore 31 Qth-degree polynomial Regression X1 X2 Y X= 3 y= 7 2 3 2 7 1 1 3 1 1 3 : : : : : : x1=(3,2).. y1=7.. y= 7 1 3 2 9 6 … Z= 3 1 1 1 1 1 … β=(ZTZ)-1(ZTy) : : … yest = β0+ z=(all products of powers of inputs in which sum of powers is q or less,) β1 x1+… Copyright © 2001, 2003, Andrew W. Moore 32
17.
m inputs, degree
Q: how many terms? = the number of unique terms of the form m x1q1 x2 2 ...xmm where ∑ qi ≤ Q q q i =1 = the number of unique terms of the m form 1q0 x1q1 x2 2 ...xmm where ∑ qi = Q q q i =0 = the number of lists of non-negative integers [q0,q1,q2,..qm] Σqi = Q in which = the number of ways of placing Q red disks on a row of squares of length Q+m = (Q+m)-choose-Q Q=11, m=4 q0=2 q1=2 q2=0 q3=4 q4=3 Copyright © 2001, 2003, Andrew W. Moore 33 Radial Basis Functions Copyright © 2001, 2003, Andrew W. Moore 34
18.
Radial Basis Functions
(RBFs) X1 X2 Y X= 3 y= 7 2 3 2 7 1 1 3 1 1 3 : : : : :: x1=(3,2).. y1=7.. … … … … … … y= 7 Z= 3 ……………… β=(ZTZ)-1(ZTy) : ……………… yest = β0+ z=(list of radial basis function evaluations) β1 x1+… Copyright © 2001, 2003, Andrew W. Moore 35 1-d RBFs y c1 c1 c1 x yest = β1 φ1(x) + β2 φ2(x) + β3 φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) Copyright © 2001, 2003, Andrew W. Moore 36
19.
Example
y c1 c1 c1 x yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) Copyright © 2001, 2003, Andrew W. Moore 37 RBFs with Linear Regression KW also held constant (initialized to be large enough that there’s decent c Ally i ’s are held constant overlap between basis (initialized randomly or functions* on a grid in m- c1 c1 c1 x dimensional input space) *Usually much better than the crappy overlap on my diagram yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) Copyright © 2001, 2003, Andrew W. Moore 38
20.
RBFs with Linear
Regression KW also held constant (initialized to be large enough that there’s decent c Ally i ’s are held constant overlap between basis (initialized randomly or functions* on a grid in m- c1 c1 c1 x dimensional input space) *Usually much better than the crappy overlap on my diagram yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) then given Q basis functions, define the matrix Z such that Zkj = KernelFunction( | xk - ci | / KW) where xk is the kth vector of inputs And as before, β=(ZTZ)-1(ZTy) Copyright © 2001, 2003, Andrew W. Moore 39 RBFs with NonLinear Regression Allow the ci ’s to adapt to KW allowed to adapt to the data. y data (initialized the (Some folks even let each basis function have its own randomly or on a grid in c1 c c1 KWj,permitting fine detail in m-dimensional input 1 x dense regions of input space) space) yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) But how do we now find all the βj’s, ci ’s and KW ? Copyright © 2001, 2003, Andrew W. Moore 40
21.
RBFs with NonLinear
Regression Allow the ci ’s to adapt to KW allowed to adapt to the data. y data (initialized the (Some folks even let each basis function have its own randomly or on a grid in c1 c c1 KWj,permitting fine detail in m-dimensional input 1 x dense regions of input space) space) yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) But how do we now find all the βj’s, ci ’s and KW ? Answer: Gradient Descent Copyright © 2001, 2003, Andrew W. Moore 41 RBFs with NonLinear Regression Allow the ci ’s to adapt to KW allowed to adapt to the data. y data (initialized the (Some folks even let each basis function have its own randomly or on a grid in c1 c c1 KWj,permitting fine detail in m-dimensional input 1 x dense regions of input space) space) yest = 2φ1(x) + 0.05φ2(x) + 0.5φ3(x) where φi(x) = KernelFunction( | x - ci | / KW) But how do we now find all the βj’s, ci ’s and KW ? (But I’d like to see, or hope someone’s already done, a Answer: Gradient Descent hybrid, where the ci ’s and KW are updated with gradient descent while the βj’s use matrix inversion) Copyright © 2001, 2003, Andrew W. Moore 42
22.
Radial Basis Functions
in 2-d Two inputs. Outputs (heights Center sticking out of page) not shown. Sphere of significant x2 influence of center x1 Copyright © 2001, 2003, Andrew W. Moore 43 Happy RBFs in 2-d Blue dots denote coordinates of input vectors Center Sphere of significant x2 influence of center x1 Copyright © 2001, 2003, Andrew W. Moore 44
23.
Crabby RBFs in
2-d What’s the problem in this example? Blue dots denote coordinates of input vectors Center Sphere of significant x2 influence of center x1 Copyright © 2001, 2003, Andrew W. Moore 45 More crabby RBFs And what’s the problem in this example? Blue dots denote coordinates of input vectors Center Sphere of significant x2 influence of center x1 Copyright © 2001, 2003, Andrew W. Moore 46
24.
Hopeless!
Even before seeing the data, you should understand that this is a disaster! Center Sphere of significant x2 influence of center x1 Copyright © 2001, 2003, Andrew W. Moore 47 Unhappy Even before seeing the data, you should understand that this isn’t good either.. Center Sphere of significant x2 influence of center x1 Copyright © 2001, 2003, Andrew W. Moore 48
25.
Robust
Regression Copyright © 2001, 2003, Andrew W. Moore 49 Robust Regression y x Copyright © 2001, 2003, Andrew W. Moore 50
26.
Robust Regression
This is the best fit that Quadratic Regression can manage y x Copyright © 2001, 2003, Andrew W. Moore 51 Robust Regression …but this is what we’d probably prefer y x Copyright © 2001, 2003, Andrew W. Moore 52
27.
LOESS-based Robust Regression
After the initial fit, score each datapoint according to how well it’s fitted… y You are a very good datapoint. x Copyright © 2001, 2003, Andrew W. Moore 53 LOESS-based Robust Regression After the initial fit, score each datapoint according to how well it’s fitted… y You are a very good datapoint. You are not too shabby. x Copyright © 2001, 2003, Andrew W. Moore 54
28.
LOESS-based Robust Regression
After the initial fit, score each datapoint according to how well it’s fitted… But you are pathetic. y You are a very good datapoint. You are not too shabby. x Copyright © 2001, 2003, Andrew W. Moore 55 Robust Regression For k = 1 to R… •Let (xk,yk) be the kth datapoint y •Let yestk be predicted value of yk x •Let wk be a weight for datapoint k that is large if the datapoint fits well and small if it fits badly: wk = KernelFn([yk- yestk]2) Copyright © 2001, 2003, Andrew W. Moore 56
29.
Robust Regression
For k = 1 to R… •Let (xk,yk) be the kth datapoint y •Let yestk be predicted value of yk x •Let wk be a weight for datapoint k that is large if the Then redo the regression datapoint fits well and small if it using weighted datapoints. fits badly: Weighted regression was described earlier in the “vary noise” section, and is also discussed wk = KernelFn([yk- yestk]2) in the “Memory-based Learning” Lecture. Guess what happens next? Copyright © 2001, 2003, Andrew W. Moore 57 Robust Regression For k = 1 to R… •Let (xk,yk) be the kth datapoint y •Let yestk be predicted value of yk x •Let wk be a weight for datapoint k that is large if the Then redo the regression datapoint fits well and small if it using weighted datapoints. fits badly: I taught you how to do this in the “Instance- based” lecture (only then the weights wk = KernelFn([yk- yestk]2) depended on distance in input-space) Repeat whole thing until converged! Copyright © 2001, 2003, Andrew W. Moore 58
30.
Robust Regression---what we’re
doing What regular regression does: Assume yk was originally generated using the following recipe: yk = β0+ β1 xk+ β2 xk2 +N(0,σ2) Computational task is to find the Maximum Likelihood β0 , β1 and β2 Copyright © 2001, 2003, Andrew W. Moore 59 Robust Regression---what we’re doing What LOESS robust regression does: Assume yk was originally generated using the following recipe: With probability p: yk = β0+ β1 xk+ β2 xk2 +N(0,σ2) But otherwise yk ~ N(µ,σhuge2) Computational task is to find the Maximum Likelihood β0 , β1 , β2 , p, µ and σhuge Copyright © 2001, 2003, Andrew W. Moore 60
31.
Robust Regression---what we’re
doing What LOESS robust regression does: Mysteriously, the Assume yk was originally generated using the reweighting procedure following recipe: does this computation for us. With probability p: yk = β0+ β1 xk+ β2 xk2 +N(0,σ2) Your first glimpse of two spectacular letters: But otherwise yk ~ N(µ,σhuge2) E.M. Computational task is to find the Maximum Likelihood β0 , β1 , β2 , p, µ and σhuge Copyright © 2001, 2003, Andrew W. Moore 61 Regression Trees Copyright © 2001, 2003, Andrew W. Moore 62
32.
Regression Trees • “Decision
trees for regression” Copyright © 2001, 2003, Andrew W. Moore 63 A regression tree leaf Predict age = 47 Mean age of records matching this leaf node Copyright © 2001, 2003, Andrew W. Moore 64
33.
A one-split regression
tree Gender? Female Male Predict age = 39 Predict age = 36 Copyright © 2001, 2003, Andrew W. Moore 65 Choosing the attribute to split on Gender Rich? Num. Num. Beany Age Children Babies Female No 2 1 38 Male No 0 0 24 Male Yes 0 5+ 72 : : : : : • We can’t use information gain. • What should we use? Copyright © 2001, 2003, Andrew W. Moore 66
34.
Choosing the attribute
to split on Gender Rich? Num. Num. Beany Age Children Babies Female No 2 1 38 Male No 0 0 24 Male Yes 0 5+ 72 : : : : : MSE(Y|X) = The expected squared error if we must predict a record’s Y value given only knowledge of the record’s X value If we’re told x=j, the smallest expected error comes from predicting the mean of the Y-values among those records in which x=j. Call this mean quantity µyx=j Then… 1 NX MSE (Y | X )= ∑ ∑ (x y=kj )− µ yx= j ) 2 R j =1 ( k such that k Copyright © 2001, 2003, Andrew W. Moore 67 Choosing the attribute to split on Gender Rich? Num. Num. Beany Age Children Babies NoRegression tree attribute selection: greedily Female 2 1 38 choose the attribute that minimizes MSE(Y|X) Male No 0 0 24 Guess 0what we do about real-valued inputs? Male Yes 5+ 72 : : : : : Guess how we prevent overfitting MSE(Y|X) = The expected squared error if we must predict a record’s Y value given only knowledge of the record’s X value If we’re told x=j, the smallest expected error comes from predicting the mean of the Y-values among those records in which x=j. Call this mean quantity µyx=j Then… 1 NX MSE (Y | X )= ∑ ∑ (x y=kj )− µ yx= j ) 2 R j =1 ( k such that k Copyright © 2001, 2003, Andrew W. Moore 68
35.
Pruning Decision
…property-owner = Yes Do I deserve to live? Gender? Female Male Predict age = 39 Predict age = 36 # property-owning females = 56712 # property-owning males = 55800 Mean age among POFs = 39 Mean age among POMs = 36 Age std dev among POFs = 12 Age std dev among POMs = 11.5 Use a standard Chi-squared test of the null- hypothesis “these two populations have the same mean” and Bob’s your uncle. Copyright © 2001, 2003, Andrew W. Moore 69 Linear Regression Trees …property-owner = Yes Also known as “Model Trees” Gender? Female Male Predict age = Predict age = 26 + 6 * NumChildren - 24 + 7 * NumChildren - 2 * YearsEducation 2.5 * YearsEducation Leaves contain linear Split attribute chosen to minimize functions (trained using MSE of regressed children. linear regression on all Pruning with a different Chi- records matching that leaf) squared Copyright © 2001, 2003, Andrew W. Moore 70
36.
Linear Regression Trees
…property-owner = Yes Also known as “Model Trees” Gender? d ste ny en te Female Male ae re b gno has he i lly Predict raget = ha u ing d t tes a Predict age = ic t typ ibute ree d ste ttribu You 26 + 6 * NumChildrenttr the 24 l+ nte ued a e t - u7 *l NumChildren - ail: rical a in al v et o se e* l-va abo 2 * YearsEducation up D 2.5 a YearsEducation teg her But u e r d te ca hig tes us n. sio , and been n or Leaves contain lineares s Split attribute chosen to minimize reg ibute ey’ve of regressed children. functions (trained using h MSE r at t t linear regression on alln if e ev Pruning with a different Chi- records matching that leaf) squared Copyright © 2001, 2003, Andrew W. Moore 71 Test your understanding Assuming regular regression trees, can you sketch a graph of the fitted function yest(x) over this diagram? y x Copyright © 2001, 2003, Andrew W. Moore 72
37.
Test your understanding
Assuming linear regression trees, can you sketch a graph of the fitted function yest(x) over this diagram? y x Copyright © 2001, 2003, Andrew W. Moore 73 Multilinear Interpolation Copyright © 2001, 2003, Andrew W. Moore 74
38.
Multilinear Interpolation
Consider this dataset. Suppose we wanted to create a continuous and piecewise linear fit to the data y x Copyright © 2001, 2003, Andrew W. Moore 75 Multilinear Interpolation Create a set of knot points: selected X-coordinates (usually equally spaced) that cover the data y q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 76
39.
Multilinear Interpolation
We are going to assume the data was generated by a noisy version of a function that can only bend at the knots. Here are 3 examples (none fits the data well) y q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 77 How to find the best fit? Idea 1: Simply perform a separate regression in each segment for each part of the curve What’s the problem with this idea? y q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 78
40.
How to find
the best fit? Let’s look at what goes on in the red segment (q3 − x) (q − x) y est ( x) = h2 + 2 h3 where w = q3 − q2 w w h2 y h3 q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 79 How to find the best fit? In the red segment… y est ( x) = h2 φ2 ( x) + h3φ3 ( x) x − q2 q −x where φ2 ( x) = 1 − , φ3 ( x) = 1 − 3 w w h2 y φ2(x) h3 q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 80
41.
How to find
the best fit? In the red segment… y est ( x) = h2 φ2 ( x) + h3φ3 ( x) x − q2 q −x where φ2 ( x) = 1 − , φ3 ( x) = 1 − 3 w w h2 y φ2(x) h3 q1 q2 q3 q4 q5 φ3(x) x Copyright © 2001, 2003, Andrew W. Moore 81 How to find the best fit? In the red segment… y est ( x) = h2 φ2 ( x) + h3φ3 ( x) | x − q2 | | x − q3 | where φ2 ( x) = 1 − , φ3 ( x) = 1 − w w h2 y φ2(x) h3 q1 q2 q3 q4 q5 φ3(x) x Copyright © 2001, 2003, Andrew W. Moore 82
42.
How to find
the best fit? In the red segment… y est ( x) = h2 φ2 ( x) + h3φ3 ( x) | x − q2 | | x − q3 | where φ2 ( x) = 1 − , φ3 ( x) = 1 − w w h2 y φ2(x) h3 q1 q2 q3 q4 q5 φ3(x) x Copyright © 2001, 2003, Andrew W. Moore 83 How to find the best fit? NK In general y est ( x) = ∑ hi φi ( x) i =1 | x − qi | 1 − if | x − qi |<w where φi ( x) = w 0 otherwise h2 y φ2(x) h3 q1 q2 q3 q4 q5 φ3(x) x Copyright © 2001, 2003, Andrew W. Moore 84
43.
How to find
the best fit? NK In general y est ( x) = ∑ hi φi ( x) i =1 | x − qi | 1 − if | x − qi |<w where φi ( x) = w 0 otherwise h2 And this is simply a basis function regression problem! y We know how to find the least φ2(x) h3 squares hiis! q1 q2 q3 q4 q5 φ3(x) x Copyright © 2001, 2003, Andrew W. Moore 85 In two dimensions… Blue dots show locations of input vectors (outputs not depicted) x2 x1 Copyright © 2001, 2003, Andrew W. Moore 86
44.
In two dimensions…
Each purple dot is a knot point. Blue dots show It will contain locations of input the height of vectors (outputs the estimated not depicted) surface x2 x1 Copyright © 2001, 2003, Andrew W. Moore 87 In two dimensions… Each purple dot is a knot point. Blue dots show It will contain locations of input the height of vectors (outputs the estimated 9 3 not depicted) surface But how do we do the interpolation to ensure that the surface is 7 8 continuous? x2 x1 Copyright © 2001, 2003, Andrew W. Moore 88
45.
In two dimensions…
Each purple dot is a knot point. Blue dots showthe It will contain To predict locations of input the height of value here… vectors (outputs the estimated 9 3 not depicted) surface But how do we do the interpolation to ensure that the surface is 7 8 continuous? x2 x1 Copyright © 2001, 2003, Andrew W. Moore 89 In two dimensions… Each purple dot is a knot point. Blue dots show It will contain locations of input the height of vectors (outputs the estimated 97 3 not depicted) surface But how do we do the To predict the interpolation to value here… ensure that the First interpolate surface is 7 8 its value on two continuous? opposite edges… 7.33 x2 x1 Copyright © 2001, 2003, Andrew W. Moore 90
46.
In two dimensions…
Each purple dot is a knot point. Blue dots show It will contain locations of input the height of vectors (outputs the estimated 97 3 not depicted) surface To predict the But how do we value here… do the 7.05 interpolation to First interpolate ensure that the its value on two surface is 7 8 opposite edges… continuous? Then interpolate 7.33 between those x two values 2 x1 Copyright © 2001, 2003, Andrew W. Moore 91 In two dimensions… Each purple dot is a knot point. Blue dots show It will contain locations of input the height of vectors (outputs the estimated 97 3 not depicted) surface To predict the But how do we value here… do the 7.05 interpolation to First interpolate ensure that the its value on two Notes: 8 surface is 7 opposite edges… continuous? Then interpolate 7.33 can easily be generalized This between those to m dimensions. x two values 2 It should be easy to see that it ensures continuity x1 The patches are not linear Copyright © 2001, 2003, Andrew W. Moore 92
47.
Doing the regression
Given data, how do we find the optimal knot heights? Happily, it’s 9 3 simply a two- dimensional basis function problem. (Working out 7 8 the basis functions is tedious, x2 unilluminating, and easy) What’s the problem in higher x1 dimensions? Copyright © 2001, 2003, Andrew W. Moore 93 MARS: Multivariate Adaptive Regression Splines Copyright © 2001, 2003, Andrew W. Moore 94
48.
MARS • Multivariate Adaptive
Regression Splines • Invented by Jerry Friedman (one of Andrew’s heroes) • Simplest version: Let’s assume the function we are learning is of the following form: m y (x) = ∑ g k ( xk ) est k =1 Instead of a linear combination of the inputs, it’s a linear combination of non-linear functions of individual inputs Copyright © 2001, 2003, Andrew W. Moore 95 MARS m y (x) = ∑ g k ( xk ) est k =1 Instead of a linear combination of the inputs, it’s a linear combination of non-linear functions of individual inputs Idea: Each gk is one of these y q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 96
49.
MARS
m y (x) = ∑ g k ( xk ) est k =1 Instead of a linear combination of the inputs, it’s a linear combination of non-linear functions of individual inputs m NK y (x) = ∑∑ h k φ k ( xk ) qkj : The location of est the j’th knot in the jj k =1 j =1 k’th dimension hkj : The regressed | xk − q k | height of the j’th 1 − j if | xk − q k |<wk knot in the k’th where φ k ( x) = j y wk j dimension 0 otherwise wk: The spacing between knots in the kth dimension q1 q2 q3 q4 q5 x Copyright © 2001, 2003, Andrew W. Moore 97 That’s not complicated enough! • Okay, now let’s get serious. We’ll allow arbitrary “two-way interactions”: m m m y (x) = ∑ g k ( xk ) + ∑ ∑g est ( xk , xt ) kt k =1 k =1 t = k +1 Can still be expressed as a linear The function we’re combination of basis functions learning is allowed to be Thus learnable by linear regression a sum of non-linear functions over all one-d Full MARS: Uses cross-validation to and 2-d subsets of choose a subset of subspaces, knot attributes resolution and other parameters. Copyright © 2001, 2003, Andrew W. Moore 98
50.
If you like
MARS… …See also CMAC (Cerebellar Model Articulated Controller) by James Albus (another of Andrew’s heroes) • Many of the same gut-level intuitions • But entirely in a neural-network, biologically plausible way • (All the low dimensional functions are by means of lookup tables, trained with a delta- rule and using a clever blurred update and hash-tables) Copyright © 2001, 2003, Andrew W. Moore 99 Where are we now? Inputs Inference p(E1|E2) Joint DE Engine Learn Dec Tree, Gauss/Joint BC, Gauss Naïve BC, Inputs Predict Classifier category Joint DE, Naïve DE, Gauss/Joint DE, Gauss Naïve Inputs Inputs Prob- Density DE ability Estimator Linear Regression, Polynomial Regression, RBFs, Predict Regressor Robust Regression Regression Trees, Multilinear real no. Interp, MARS Copyright © 2001, 2003, Andrew W. Moore 100
51.
Citations Radial Basis Functions
Regression Trees etc T. Poggio and F. Girosi, L. Breiman and J. H. Friedman and Regularization Algorithms for R. A. Olshen and C. J. Stone, Learning That Are Equivalent to Classification and Regression Multilayer Networks, Science, Trees, Wadsworth, 1984 247, 978--982, 1989 J. R. Quinlan, Combining Instance- LOESS Based and Model-Based Learning, Machine Learning: W. S. Cleveland, Robust Locally Proceedings of the Tenth Weighted Regression and International Conference, 1993 Smoothing Scatterplots, Journal of the American Statistical MARS Association, 74, 368, 829-836, J. H. Friedman, Multivariate December, 1979 Adaptive Regression Splines, Department for Statistics, Stanford University, 1988, Technical Report No. 102 Copyright © 2001, 2003, Andrew W. Moore 101
Descargar ahora