SlideShare una empresa de Scribd logo
1 de 167
Descargar para leer sin conexión
Neural Networks
Radial Basis Function Networks - Forward Heuristics
Andres Mendez-Vazquez
December 10, 2015
1 / 58
Outline
1 Predicting Variance of w and the output d
The Variance Matrix
Selecting Regularization Parameter
2 How many dimensions?
How many dimensions?
3 Forward Selection Algorithms
Introduction
Incremental Operations
Complexity Comparison
Adding Basis Function Under Regularization
Removing an Old Basis Function under Regularization
A Possible Forward Algorithm
2 / 58
Outline
1 Predicting Variance of w and the output d
The Variance Matrix
Selecting Regularization Parameter
2 How many dimensions?
How many dimensions?
3 Forward Selection Algorithms
Introduction
Incremental Operations
Complexity Comparison
Adding Basis Function Under Regularization
Removing an Old Basis Function under Regularization
A Possible Forward Algorithm
3 / 58
What is the variance of the weight vector w?
The meaning
If the weight have been calculated upon the basis an estimation of a
stochastic variable d:
What is the corresponding uncertainty in the estimation of w?
Assume that the noise affecting d is normal and independently,
identically distributed
ED d − d
T
d − d = σ2
I (1)
Where σ is the Standard Deviation of the noise and d the mean value
of d
Thus, we have that the noise N d, σ2 .
4 / 58
What is the variance of the weight vector w?
The meaning
If the weight have been calculated upon the basis an estimation of a
stochastic variable d:
What is the corresponding uncertainty in the estimation of w?
Assume that the noise affecting d is normal and independently,
identically distributed
ED d − d
T
d − d = σ2
I (1)
Where σ is the Standard Deviation of the noise and d the mean value
of d
Thus, we have that the noise N d, σ2 .
4 / 58
What is the variance of the weight vector w?
The meaning
If the weight have been calculated upon the basis an estimation of a
stochastic variable d:
What is the corresponding uncertainty in the estimation of w?
Assume that the noise affecting d is normal and independently,
identically distributed
ED d − d
T
d − d = σ2
I (1)
Where σ is the Standard Deviation of the noise and d the mean value
of d
Thus, we have that the noise N d, σ2 .
4 / 58
What is the variance of the weight vector w?
The meaning
If the weight have been calculated upon the basis an estimation of a
stochastic variable d:
What is the corresponding uncertainty in the estimation of w?
Assume that the noise affecting d is normal and independently,
identically distributed
ED d − d
T
d − d = σ2
I (1)
Where σ is the Standard Deviation of the noise and d the mean value
of d
Thus, we have that the noise N d, σ2 .
4 / 58
Remember
We are using a linear model
f (xi) =
d1
j=1
wjφj (xi) (2)
Thus, solving the error under regularization
w = ΦT
Φ + Λ
−1
ΦT
d (3)
5 / 58
Remember
We are using a linear model
f (xi) =
d1
j=1
wjφj (xi) (2)
Thus, solving the error under regularization
w = ΦT
Φ + Λ
−1
ΦT
d (3)
5 / 58
Thus
Getting the Expected Value
w = ED (w) = ED ΦT
Φ + Λ
−1
ΦT
d
= ΦT
Φ + Λ
−1
ΦT
ED [d]
= ΦT
Φ + Λ
−1
ΦT
d
6 / 58
Thus
Getting the Expected Value
w = ED (w) = ED ΦT
Φ + Λ
−1
ΦT
d
= ΦT
Φ + Λ
−1
ΦT
ED [d]
= ΦT
Φ + Λ
−1
ΦT
d
6 / 58
Thus
Getting the Expected Value
w = ED (w) = ED ΦT
Φ + Λ
−1
ΦT
d
= ΦT
Φ + Λ
−1
ΦT
ED [d]
= ΦT
Φ + Λ
−1
ΦT
d
6 / 58
Thus, we have
The Variance of w is
W = ED (w − w) (w − w)T
= ED ΦT
Φ + Λ
−1
ΦT
d − ΦT
Φ + Λ
−1
ΦT
d × ...
ΦT
Φ + Λ
−1
ΦT
d − ΦT
Φ + Λ
−1
ΦT
d
T
= ED ΦT
Φ + Λ
−1
ΦT
d − d d − d
T
Φ ΦT
Φ + Λ
−1
= ΦT
Φ + Λ
−1
ΦT
ED d − d d − d
T
Φ ΦT
Φ + Λ
−1
= ΦT
Φ + Λ
−1
ΦT
σ2
IΦ ΦT
Φ + Λ
−1
= σ2
ΦT
Φ + Λ
−1
ΦT
Φ ΦT
Φ + Λ
−1
7 / 58
Thus, we have
The Variance of w is
W = ED (w − w) (w − w)T
= ED ΦT
Φ + Λ
−1
ΦT
d − ΦT
Φ + Λ
−1
ΦT
d × ...
ΦT
Φ + Λ
−1
ΦT
d − ΦT
Φ + Λ
−1
ΦT
d
T
= ED ΦT
Φ + Λ
−1
ΦT
d − d d − d
T
Φ ΦT
Φ + Λ
−1
= ΦT
Φ + Λ
−1
ΦT
ED d − d d − d
T
Φ ΦT
Φ + Λ
−1
= ΦT
Φ + Λ
−1
ΦT
σ2
IΦ ΦT
Φ + Λ
−1
= σ2
ΦT
Φ + Λ
−1
ΦT
Φ ΦT
Φ + Λ
−1
7 / 58
Thus, we have
The Variance of w is
W = ED (w − w) (w − w)T
= ED ΦT
Φ + Λ
−1
ΦT
d − ΦT
Φ + Λ
−1
ΦT
d × ...
ΦT
Φ + Λ
−1
ΦT
d − ΦT
Φ + Λ
−1
ΦT
d
T
= ED ΦT
Φ + Λ
−1
ΦT
d − d d − d
T
Φ ΦT
Φ + Λ
−1
= ΦT
Φ + Λ
−1
ΦT
ED d − d d − d
T
Φ ΦT
Φ + Λ
−1
= ΦT
Φ + Λ
−1
ΦT
σ2
IΦ ΦT
Φ + Λ
−1
= σ2
ΦT
Φ + Λ
−1
ΦT
Φ ΦT
Φ + Λ
−1
7 / 58
Thus, we have
The Variance of w is
W = ED (w − w) (w − w)T
= ED ΦT
Φ + Λ
−1
ΦT
d − ΦT
Φ + Λ
−1
ΦT
d × ...
ΦT
Φ + Λ
−1
ΦT
d − ΦT
Φ + Λ
−1
ΦT
d
T
= ED ΦT
Φ + Λ
−1
ΦT
d − d d − d
T
Φ ΦT
Φ + Λ
−1
= ΦT
Φ + Λ
−1
ΦT
ED d − d d − d
T
Φ ΦT
Φ + Λ
−1
= ΦT
Φ + Λ
−1
ΦT
σ2
IΦ ΦT
Φ + Λ
−1
= σ2
ΦT
Φ + Λ
−1
ΦT
Φ ΦT
Φ + Λ
−1
7 / 58
Thus, we have
The Variance of w is
W = ED (w − w) (w − w)T
= ED ΦT
Φ + Λ
−1
ΦT
d − ΦT
Φ + Λ
−1
ΦT
d × ...
ΦT
Φ + Λ
−1
ΦT
d − ΦT
Φ + Λ
−1
ΦT
d
T
= ED ΦT
Φ + Λ
−1
ΦT
d − d d − d
T
Φ ΦT
Φ + Λ
−1
= ΦT
Φ + Λ
−1
ΦT
ED d − d d − d
T
Φ ΦT
Φ + Λ
−1
= ΦT
Φ + Λ
−1
ΦT
σ2
IΦ ΦT
Φ + Λ
−1
= σ2
ΦT
Φ + Λ
−1
ΦT
Φ ΦT
Φ + Λ
−1
7 / 58
Thus, we have
The Variance of w is
W = ED (w − w) (w − w)T
= ED ΦT
Φ + Λ
−1
ΦT
d − ΦT
Φ + Λ
−1
ΦT
d × ...
ΦT
Φ + Λ
−1
ΦT
d − ΦT
Φ + Λ
−1
ΦT
d
T
= ED ΦT
Φ + Λ
−1
ΦT
d − d d − d
T
Φ ΦT
Φ + Λ
−1
= ΦT
Φ + Λ
−1
ΦT
ED d − d d − d
T
Φ ΦT
Φ + Λ
−1
= ΦT
Φ + Λ
−1
ΦT
σ2
IΦ ΦT
Φ + Λ
−1
= σ2
ΦT
Φ + Λ
−1
ΦT
Φ ΦT
Φ + Λ
−1
7 / 58
The Least Squared Error Case
We have
Λ = 0 =⇒ W = σ2
ΦT
Φ + Λ
−1
(4)
The following matrix is known as the variance matrix
A−1
= ΦT
Φ + Λ
−1
(5)
For the standard Ridge Regression when ΦT
Φ = A − λId1
W = σ2
A−1
[A − λId1 ] A−1
= σ2
A−1
− λA−2
8 / 58
The Least Squared Error Case
We have
Λ = 0 =⇒ W = σ2
ΦT
Φ + Λ
−1
(4)
The following matrix is known as the variance matrix
A−1
= ΦT
Φ + Λ
−1
(5)
For the standard Ridge Regression when ΦT
Φ = A − λId1
W = σ2
A−1
[A − λId1 ] A−1
= σ2
A−1
− λA−2
8 / 58
The Least Squared Error Case
We have
Λ = 0 =⇒ W = σ2
ΦT
Φ + Λ
−1
(4)
The following matrix is known as the variance matrix
A−1
= ΦT
Φ + Λ
−1
(5)
For the standard Ridge Regression when ΦT
Φ = A − λId1
W = σ2
A−1
[A − λId1 ] A−1
= σ2
A−1
− λA−2
8 / 58
The Least Squared Error Case
We have
Λ = 0 =⇒ W = σ2
ΦT
Φ + Λ
−1
(4)
The following matrix is known as the variance matrix
A−1
= ΦT
Φ + Λ
−1
(5)
For the standard Ridge Regression when ΦT
Φ = A − λId1
W = σ2
A−1
[A − λId1 ] A−1
= σ2
A−1
− λA−2
8 / 58
Outline
1 Predicting Variance of w and the output d
The Variance Matrix
Selecting Regularization Parameter
2 How many dimensions?
How many dimensions?
3 Forward Selection Algorithms
Introduction
Incremental Operations
Complexity Comparison
Adding Basis Function Under Regularization
Removing an Old Basis Function under Regularization
A Possible Forward Algorithm
9 / 58
How to select the regularization parameter λ?
We know that
f = Φ ΦT
Φ + λId1
−1
ΦT
d (6)
Thus, we have the following projection matrix assuming the difference
between d and f
d − f = d − Φ ΦT
Φ + λId1
−1
ΦT
d (7)
Thus, we have tha Projection Matrix P
d − f =



IN − Φ ΦT
Φ + λId1
−1
ΦT
P



 d (8)
10 / 58
How to select the regularization parameter λ?
We know that
f = Φ ΦT
Φ + λId1
−1
ΦT
d (6)
Thus, we have the following projection matrix assuming the difference
between d and f
d − f = d − Φ ΦT
Φ + λId1
−1
ΦT
d (7)
Thus, we have tha Projection Matrix P
d − f =



IN − Φ ΦT
Φ + λId1
−1
ΦT
P



 d (8)
10 / 58
How to select the regularization parameter λ?
We know that
f = Φ ΦT
Φ + λId1
−1
ΦT
d (6)
Thus, we have the following projection matrix assuming the difference
between d and f
d − f = d − Φ ΦT
Φ + λId1
−1
ΦT
d (7)
Thus, we have tha Projection Matrix P
d − f =



IN − Φ ΦT
Φ + λId1
−1
ΦT
P



 d (8)
10 / 58
We can use this to rewrite the cost function
Cost function
C (w, λ) =
N
i=1
(di − f (xi))2
+
d1
j=1
λjw2
j (9)
We have then
C (w, λ) = Hw − d
T
Hw − d + wT
Λw
= d
T
Φ [A]−1
ΦT
− Id1 Φ [A]−1
ΦT
− Id1 d + ...
dT
Φ [A]−1
Λ [A]−1
ΦT
d
11 / 58
We can use this to rewrite the cost function
Cost function
C (w, λ) =
N
i=1
(di − f (xi))2
+
d1
j=1
λjw2
j (9)
We have then
C (w, λ) = Hw − d
T
Hw − d + wT
Λw
= d
T
Φ [A]−1
ΦT
− Id1 Φ [A]−1
ΦT
− Id1 d + ...
dT
Φ [A]−1
Λ [A]−1
ΦT
d
11 / 58
We can use this to rewrite the cost function
Cost function
C (w, λ) =
N
i=1
(di − f (xi))2
+
d1
j=1
λjw2
j (9)
We have then
C (w, λ) = Hw − d
T
Hw − d + wT
Λw
= d
T
Φ [A]−1
ΦT
− Id1 Φ [A]−1
ΦT
− Id1 d + ...
dT
Φ [A]−1
Λ [A]−1
ΦT
d
11 / 58
However
We have
Φ [A]−1
Λ [A]−1
Φ = Φ [A]−1
A − ΦT
Φ [A]−1
ΦT
= Φ [A]−1
ΦT
− Φ [A]−1
ΦT
2
= P − P2
Simplifying the minimum cost
C (w, λ) = d
T
P2
d+ d
T
P − P2
d = d
T
Pd (10)
12 / 58
However
We have
Φ [A]−1
Λ [A]−1
Φ = Φ [A]−1
A − ΦT
Φ [A]−1
ΦT
= Φ [A]−1
ΦT
− Φ [A]−1
ΦT
2
= P − P2
Simplifying the minimum cost
C (w, λ) = d
T
P2
d+ d
T
P − P2
d = d
T
Pd (10)
12 / 58
However
We have
Φ [A]−1
Λ [A]−1
Φ = Φ [A]−1
A − ΦT
Φ [A]−1
ΦT
= Φ [A]−1
ΦT
− Φ [A]−1
ΦT
2
= P − P2
Simplifying the minimum cost
C (w, λ) = d
T
P2
d+ d
T
P − P2
d = d
T
Pd (10)
12 / 58
However
We have
Φ [A]−1
Λ [A]−1
Φ = Φ [A]−1
A − ΦT
Φ [A]−1
ΦT
= Φ [A]−1
ΦT
− Φ [A]−1
ΦT
2
= P − P2
Simplifying the minimum cost
C (w, λ) = d
T
P2
d+ d
T
P − P2
d = d
T
Pd (10)
12 / 58
In summary, we have for the Ridge Regression
Something Notable
A =ΦT
Φ + λId1
w =A−1
ΦT
d
P =IN − ΦA−1
ΦT
Important Observation
Some sort of model selection must be used to choose a value for the
regularisation parameter .
The value chosen is the one associated with the lowest prediction
error.
Question
Which method should be used to predict the error and how is the optimal
value found?
13 / 58
In summary, we have for the Ridge Regression
Something Notable
A =ΦT
Φ + λId1
w =A−1
ΦT
d
P =IN − ΦA−1
ΦT
Important Observation
Some sort of model selection must be used to choose a value for the
regularisation parameter .
The value chosen is the one associated with the lowest prediction
error.
Question
Which method should be used to predict the error and how is the optimal
value found?
13 / 58
In summary, we have for the Ridge Regression
Something Notable
A =ΦT
Φ + λId1
w =A−1
ΦT
d
P =IN − ΦA−1
ΦT
Important Observation
Some sort of model selection must be used to choose a value for the
regularisation parameter .
The value chosen is the one associated with the lowest prediction
error.
Question
Which method should be used to predict the error and how is the optimal
value found?
13 / 58
In summary, we have for the Ridge Regression
Something Notable
A =ΦT
Φ + λId1
w =A−1
ΦT
d
P =IN − ΦA−1
ΦT
Important Observation
Some sort of model selection must be used to choose a value for the
regularisation parameter .
The value chosen is the one associated with the lowest prediction
error.
Question
Which method should be used to predict the error and how is the optimal
value found?
13 / 58
In summary, we have for the Ridge Regression
Something Notable
A =ΦT
Φ + λId1
w =A−1
ΦT
d
P =IN − ΦA−1
ΦT
Important Observation
Some sort of model selection must be used to choose a value for the
regularisation parameter .
The value chosen is the one associated with the lowest prediction
error.
Question
Which method should be used to predict the error and how is the optimal
value found?
13 / 58
In summary, we have for the Ridge Regression
Something Notable
A =ΦT
Φ + λId1
w =A−1
ΦT
d
P =IN − ΦA−1
ΦT
Important Observation
Some sort of model selection must be used to choose a value for the
regularisation parameter .
The value chosen is the one associated with the lowest prediction
error.
Question
Which method should be used to predict the error and how is the optimal
value found?
13 / 58
In summary, we have for the Ridge Regression
Something Notable
A =ΦT
Φ + λId1
w =A−1
ΦT
d
P =IN − ΦA−1
ΦT
Important Observation
Some sort of model selection must be used to choose a value for the
regularisation parameter .
The value chosen is the one associated with the lowest prediction
error.
Question
Which method should be used to predict the error and how is the optimal
value found?
13 / 58
In summary, we have for the Ridge Regression
Something Notable
A =ΦT
Φ + λId1
w =A−1
ΦT
d
P =IN − ΦA−1
ΦT
Important Observation
Some sort of model selection must be used to choose a value for the
regularisation parameter .
The value chosen is the one associated with the lowest prediction
error.
Question
Which method should be used to predict the error and how is the optimal
value found?
13 / 58
In summary, we have for the Ridge Regression
Something Notable
A =ΦT
Φ + λId1
w =A−1
ΦT
d
P =IN − ΦA−1
ΦT
Important Observation
Some sort of model selection must be used to choose a value for the
regularisation parameter .
The value chosen is the one associated with the lowest prediction
error.
Question
Which method should be used to predict the error and how is the optimal
value found?
13 / 58
Answer
Something Notable
The answer to the rst question is that nobody knows for sure.
There are many methods to train to obtain that value
Leave-one-out cross-validation.
Generalized cross-validation.
Final prediction error.
Bayesian information criterion.
Bootstrap methods.
14 / 58
Answer
Something Notable
The answer to the rst question is that nobody knows for sure.
There are many methods to train to obtain that value
Leave-one-out cross-validation.
Generalized cross-validation.
Final prediction error.
Bayesian information criterion.
Bootstrap methods.
14 / 58
Answer
Something Notable
The answer to the rst question is that nobody knows for sure.
There are many methods to train to obtain that value
Leave-one-out cross-validation.
Generalized cross-validation.
Final prediction error.
Bayesian information criterion.
Bootstrap methods.
14 / 58
Answer
Something Notable
The answer to the rst question is that nobody knows for sure.
There are many methods to train to obtain that value
Leave-one-out cross-validation.
Generalized cross-validation.
Final prediction error.
Bayesian information criterion.
Bootstrap methods.
14 / 58
Answer
Something Notable
The answer to the rst question is that nobody knows for sure.
There are many methods to train to obtain that value
Leave-one-out cross-validation.
Generalized cross-validation.
Final prediction error.
Bayesian information criterion.
Bootstrap methods.
14 / 58
Answer
Something Notable
The answer to the rst question is that nobody knows for sure.
There are many methods to train to obtain that value
Leave-one-out cross-validation.
Generalized cross-validation.
Final prediction error.
Bayesian information criterion.
Bootstrap methods.
14 / 58
We will use an iterative method
We have the following iterative process from Generalized
Cross-Validation
λ =
dT P2
dtrace A−1 − λA−2
wT
A−1wtrace (P)
(11)
To see the development, please take a look to Appendix A.10
“Introduction to Radial Basis Function Networks” by Mark J.L. Orr
An iterative process started with an initial λ
The value is updated until convergence.
15 / 58
We will use an iterative method
We have the following iterative process from Generalized
Cross-Validation
λ =
dT P2
dtrace A−1 − λA−2
wT
A−1wtrace (P)
(11)
To see the development, please take a look to Appendix A.10
“Introduction to Radial Basis Function Networks” by Mark J.L. Orr
An iterative process started with an initial λ
The value is updated until convergence.
15 / 58
We will use an iterative method
We have the following iterative process from Generalized
Cross-Validation
λ =
dT P2
dtrace A−1 − λA−2
wT
A−1wtrace (P)
(11)
To see the development, please take a look to Appendix A.10
“Introduction to Radial Basis Function Networks” by Mark J.L. Orr
An iterative process started with an initial λ
The value is updated until convergence.
15 / 58
Outline
1 Predicting Variance of w and the output d
The Variance Matrix
Selecting Regularization Parameter
2 How many dimensions?
How many dimensions?
3 Forward Selection Algorithms
Introduction
Incremental Operations
Complexity Comparison
Adding Basis Function Under Regularization
Removing an Old Basis Function under Regularization
A Possible Forward Algorithm
16 / 58
How many dimensions for the mapping to high dimensions?
We have the following for ordinary least squares - no regularization
A−1
= ΦT
Φ
−1
(12)
Now, you suppose that
You are given a set of numbers {xi}N
i=1 randomly drawn from a Gaussian
distribution and you are asked to estimate the variance without told the
mean.
We can calculate the sample mean
x =
1
N
N
i=1
xi (13)
17 / 58
How many dimensions for the mapping to high dimensions?
We have the following for ordinary least squares - no regularization
A−1
= ΦT
Φ
−1
(12)
Now, you suppose that
You are given a set of numbers {xi}N
i=1 randomly drawn from a Gaussian
distribution and you are asked to estimate the variance without told the
mean.
We can calculate the sample mean
x =
1
N
N
i=1
xi (13)
17 / 58
How many dimensions for the mapping to high dimensions?
We have the following for ordinary least squares - no regularization
A−1
= ΦT
Φ
−1
(12)
Now, you suppose that
You are given a set of numbers {xi}N
i=1 randomly drawn from a Gaussian
distribution and you are asked to estimate the variance without told the
mean.
We can calculate the sample mean
x =
1
N
N
i=1
xi (13)
17 / 58
Thus
This allows to calculate the sample variance
ˆσ2
=
1
N − 1
N
i=1
(xi − x)2
(14)
Problem, from Where the parameter N − 1 comes from?
It comes from the fact that the parameter x is fitting the noise.
The system has N degrees of freedom
Thus the underestimation of the variance is restored by reducing the
remaining degrees of freedom by one.
18 / 58
Thus
This allows to calculate the sample variance
ˆσ2
=
1
N − 1
N
i=1
(xi − x)2
(14)
Problem, from Where the parameter N − 1 comes from?
It comes from the fact that the parameter x is fitting the noise.
The system has N degrees of freedom
Thus the underestimation of the variance is restored by reducing the
remaining degrees of freedom by one.
18 / 58
Thus
This allows to calculate the sample variance
ˆσ2
=
1
N − 1
N
i=1
(xi − x)2
(14)
Problem, from Where the parameter N − 1 comes from?
It comes from the fact that the parameter x is fitting the noise.
The system has N degrees of freedom
Thus the underestimation of the variance is restored by reducing the
remaining degrees of freedom by one.
18 / 58
In Supervised Learning
Similarly
It would be a mistake to divide the sum-squared-training-error by the
number of patterns in order to estimate the noise variance since some
degrees of freedom will have been used up in fitting the model.
In our linear model there are d1 weights and N patterns in the
training set
It leaves N − d1 degrees of freedom.
The estimation of the variance is then
ˆσ2
=
ˆS
N − d1
(15)
Remark: ˆS is the sum-squared-error over the training set at the
optimal weight vector and ˆσ2 is called the unbiased estimate
of variance.
19 / 58
In Supervised Learning
Similarly
It would be a mistake to divide the sum-squared-training-error by the
number of patterns in order to estimate the noise variance since some
degrees of freedom will have been used up in fitting the model.
In our linear model there are d1 weights and N patterns in the
training set
It leaves N − d1 degrees of freedom.
The estimation of the variance is then
ˆσ2
=
ˆS
N − d1
(15)
Remark: ˆS is the sum-squared-error over the training set at the
optimal weight vector and ˆσ2 is called the unbiased estimate
of variance.
19 / 58
In Supervised Learning
Similarly
It would be a mistake to divide the sum-squared-training-error by the
number of patterns in order to estimate the noise variance since some
degrees of freedom will have been used up in fitting the model.
In our linear model there are d1 weights and N patterns in the
training set
It leaves N − d1 degrees of freedom.
The estimation of the variance is then
ˆσ2
=
ˆS
N − d1
(15)
Remark: ˆS is the sum-squared-error over the training set at the
optimal weight vector and ˆσ2 is called the unbiased estimate
of variance.
19 / 58
First, standard least squared error
Although there is still d1 weights in the model
The effective number of parameters(John Moody), γ, is less than d1
and it depends on the size of the regularization parameters.
We have the following (Moody and MacKay)
γ = N − trace (P) (16)
20 / 58
First, standard least squared error
Although there is still d1 weights in the model
The effective number of parameters(John Moody), γ, is less than d1
and it depends on the size of the regularization parameters.
We have the following (Moody and MacKay)
γ = N − trace (P) (16)
20 / 58
First, standard least squared error
In the standard least squared error without regularization,
A−1
= ΦT
Φ
−1
γ = N − trace IN − ΦA−1
ΦT
= trace ΦA−1
ΦT
= trace A−1
ΦT
Φ
= trace A−1
ΦT
Φ
= trace (Id1 )
= d1
21 / 58
First, standard least squared error
In the standard least squared error without regularization,
A−1
= ΦT
Φ
−1
γ = N − trace IN − ΦA−1
ΦT
= trace ΦA−1
ΦT
= trace A−1
ΦT
Φ
= trace A−1
ΦT
Φ
= trace (Id1 )
= d1
21 / 58
First, standard least squared error
In the standard least squared error without regularization,
A−1
= ΦT
Φ
−1
γ = N − trace IN − ΦA−1
ΦT
= trace ΦA−1
ΦT
= trace A−1
ΦT
Φ
= trace A−1
ΦT
Φ
= trace (Id1 )
= d1
21 / 58
First, standard least squared error
In the standard least squared error without regularization,
A−1
= ΦT
Φ
−1
γ = N − trace IN − ΦA−1
ΦT
= trace ΦA−1
ΦT
= trace A−1
ΦT
Φ
= trace A−1
ΦT
Φ
= trace (Id1 )
= d1
21 / 58
First, standard least squared error
In the standard least squared error without regularization,
A−1
= ΦT
Φ
−1
γ = N − trace IN − ΦA−1
ΦT
= trace ΦA−1
ΦT
= trace A−1
ΦT
Φ
= trace A−1
ΦT
Φ
= trace (Id1 )
= d1
21 / 58
First, standard least squared error
In the standard least squared error without regularization,
A−1
= ΦT
Φ
−1
γ = N − trace IN − ΦA−1
ΦT
= trace ΦA−1
ΦT
= trace A−1
ΦT
Φ
= trace A−1
ΦT
Φ
= trace (Id1 )
= d1
21 / 58
Now, with the regularization term
We have A−1
= ΦT
Φ − λId1
−1
γ = trace A−1
ΦT
Φ
= trace A−1
(A − λId1 )
= trace Id1 − λA−1
= d1 − λ A−1
22 / 58
Now, with the regularization term
We have A−1
= ΦT
Φ − λId1
−1
γ = trace A−1
ΦT
Φ
= trace A−1
(A − λId1 )
= trace Id1 − λA−1
= d1 − λ A−1
22 / 58
Now, with the regularization term
We have A−1
= ΦT
Φ − λId1
−1
γ = trace A−1
ΦT
Φ
= trace A−1
(A − λId1 )
= trace Id1 − λA−1
= d1 − λ A−1
22 / 58
Now, with the regularization term
We have A−1
= ΦT
Φ − λId1
−1
γ = trace A−1
ΦT
Φ
= trace A−1
(A − λId1 )
= trace Id1 − λA−1
= d1 − λ A−1
22 / 58
Now
If the eigenvalues of the matrix ΦT
Φ are {µj}d1
j=1
γ = d1 − λtrace A−1
= d1 − λ
d1
i=1
1
λ + µj
=
d1
i=1
µj
λ + µj
23 / 58
Now
If the eigenvalues of the matrix ΦT
Φ are {µj}d1
j=1
γ = d1 − λtrace A−1
= d1 − λ
d1
i=1
1
λ + µj
=
d1
i=1
µj
λ + µj
23 / 58
Now
If the eigenvalues of the matrix ΦT
Φ are {µj}d1
j=1
γ = d1 − λtrace A−1
= d1 − λ
d1
i=1
1
λ + µj
=
d1
i=1
µj
λ + µj
23 / 58
Outline
1 Predicting Variance of w and the output d
The Variance Matrix
Selecting Regularization Parameter
2 How many dimensions?
How many dimensions?
3 Forward Selection Algorithms
Introduction
Incremental Operations
Complexity Comparison
Adding Basis Function Under Regularization
Removing an Old Basis Function under Regularization
A Possible Forward Algorithm
24 / 58
About Ridge Regression
Remark
Ridge regression is used as a way to balance bias and variance by varying
the effective number of parameters in a linear model.
An alternative strategy
It is to to compare models made up of different subsets of basis functions
drawn from the same fixed set of candidates.
This is called
Subset selection in statistics and machine learning.
25 / 58
About Ridge Regression
Remark
Ridge regression is used as a way to balance bias and variance by varying
the effective number of parameters in a linear model.
An alternative strategy
It is to to compare models made up of different subsets of basis functions
drawn from the same fixed set of candidates.
This is called
Subset selection in statistics and machine learning.
25 / 58
About Ridge Regression
Remark
Ridge regression is used as a way to balance bias and variance by varying
the effective number of parameters in a linear model.
An alternative strategy
It is to to compare models made up of different subsets of basis functions
drawn from the same fixed set of candidates.
This is called
Subset selection in statistics and machine learning.
25 / 58
Problem
This is normally intractable when you have N
2N
− 1 subsets to test (17)
We could use different methods
1 K-means which is explained in the book.
2 Forward Selection heuristics that we will explain here.
Forward Selection
It starts with an empty subset to which a basis function is added at a time:
The one that reduces the sum-squared error the most.
Until a chosen criterion, such that GCV stops decreasing.
26 / 58
Problem
This is normally intractable when you have N
2N
− 1 subsets to test (17)
We could use different methods
1 K-means which is explained in the book.
2 Forward Selection heuristics that we will explain here.
Forward Selection
It starts with an empty subset to which a basis function is added at a time:
The one that reduces the sum-squared error the most.
Until a chosen criterion, such that GCV stops decreasing.
26 / 58
Problem
This is normally intractable when you have N
2N
− 1 subsets to test (17)
We could use different methods
1 K-means which is explained in the book.
2 Forward Selection heuristics that we will explain here.
Forward Selection
It starts with an empty subset to which a basis function is added at a time:
The one that reduces the sum-squared error the most.
Until a chosen criterion, such that GCV stops decreasing.
26 / 58
Problem
This is normally intractable when you have N
2N
− 1 subsets to test (17)
We could use different methods
1 K-means which is explained in the book.
2 Forward Selection heuristics that we will explain here.
Forward Selection
It starts with an empty subset to which a basis function is added at a time:
The one that reduces the sum-squared error the most.
Until a chosen criterion, such that GCV stops decreasing.
26 / 58
Problem
This is normally intractable when you have N
2N
− 1 subsets to test (17)
We could use different methods
1 K-means which is explained in the book.
2 Forward Selection heuristics that we will explain here.
Forward Selection
It starts with an empty subset to which a basis function is added at a time:
The one that reduces the sum-squared error the most.
Until a chosen criterion, such that GCV stops decreasing.
26 / 58
Problem
This is normally intractable when you have N
2N
− 1 subsets to test (17)
We could use different methods
1 K-means which is explained in the book.
2 Forward Selection heuristics that we will explain here.
Forward Selection
It starts with an empty subset to which a basis function is added at a time:
The one that reduces the sum-squared error the most.
Until a chosen criterion, such that GCV stops decreasing.
26 / 58
Subset Selection Vs. Optimization
Classic Neural Network Optimization
It involves the optimization, by gradient descent, of a nonlinear
sum-squared-error surface in a high-dimensional space defined by the
network parameters.
In specific in RBF
The network parameters are the centers, sizes and hidden-to-output
weights.
27 / 58
Subset Selection Vs. Optimization
Classic Neural Network Optimization
It involves the optimization, by gradient descent, of a nonlinear
sum-squared-error surface in a high-dimensional space defined by the
network parameters.
In specific in RBF
The network parameters are the centers, sizes and hidden-to-output
weights.
27 / 58
Subset Selection Vs. Optimization
In Subset Selection
The heuristic searches in a discrete space of subsets of a set of hidden
units with fixed centers and sizes while finding a subset with the
lowest prediction error.
It uses, a minimization criteria as the variance of the GVC:
ˆσ2
GCV =
N ˆd
T
P2 ˆd
trace (P)2 (18)
28 / 58
Subset Selection Vs. Optimization
In Subset Selection
The heuristic searches in a discrete space of subsets of a set of hidden
units with fixed centers and sizes while finding a subset with the
lowest prediction error.
It uses, a minimization criteria as the variance of the GVC:
ˆσ2
GCV =
N ˆd
T
P2 ˆd
trace (P)2 (18)
28 / 58
Subset Selection Vs. Optimization
In Subset Selection
The heuristic searches in a discrete space of subsets of a set of hidden
units with fixed centers and sizes while finding a subset with the
lowest prediction error.
It uses, a minimization criteria as the variance of the GVC:
ˆσ2
GCV =
N ˆd
T
P2 ˆd
trace (P)2 (18)
28 / 58
In addition
Hidden-to-Output Weights
They are not selected, they are slaved to the centers and sizes of the
chosen subset.
Forward selection is a non-linear type of heuristic with the following
advantages
There is no need to fix the number of hidden units in advance.
The model selection criteria are tractable.
The computational requirements are relatively low.
29 / 58
In addition
Hidden-to-Output Weights
They are not selected, they are slaved to the centers and sizes of the
chosen subset.
Forward selection is a non-linear type of heuristic with the following
advantages
There is no need to fix the number of hidden units in advance.
The model selection criteria are tractable.
The computational requirements are relatively low.
29 / 58
In addition
Hidden-to-Output Weights
They are not selected, they are slaved to the centers and sizes of the
chosen subset.
Forward selection is a non-linear type of heuristic with the following
advantages
There is no need to fix the number of hidden units in advance.
The model selection criteria are tractable.
The computational requirements are relatively low.
29 / 58
In addition
Hidden-to-Output Weights
They are not selected, they are slaved to the centers and sizes of the
chosen subset.
Forward selection is a non-linear type of heuristic with the following
advantages
There is no need to fix the number of hidden units in advance.
The model selection criteria are tractable.
The computational requirements are relatively low.
29 / 58
Thus, under the classic least squared error
Something Notable
In forward selection each step involves growing the network by one basis
function.
Therefore
Adding a new basis function is one of the incremental operations by using
the equation
Pd1+1 = Pd1 −
Pd1 φjφT
j Pd1
φT
j Pd1 φj
(19)
30 / 58
Thus, under the classic least squared error
Something Notable
In forward selection each step involves growing the network by one basis
function.
Therefore
Adding a new basis function is one of the incremental operations by using
the equation
Pd1+1 = Pd1 −
Pd1 φjφT
j Pd1
φT
j Pd1 φj
(19)
30 / 58
Thus
Where
Pm+1 is the succeeding projection matrix if the J-th member of the
set is added.
Pm the projection matrix for the m−hidden units.
The vectors φj
N
j=1
are the column vectors of the matrix Φ with
N d1.
31 / 58
Thus
Where
Pm+1 is the succeeding projection matrix if the J-th member of the
set is added.
Pm the projection matrix for the m−hidden units.
The vectors φj
N
j=1
are the column vectors of the matrix Φ with
N d1.
31 / 58
Thus
Where
Pm+1 is the succeeding projection matrix if the J-th member of the
set is added.
Pm the projection matrix for the m−hidden units.
The vectors φj
N
j=1
are the column vectors of the matrix Φ with
N d1.
31 / 58
Thus
We have that
ΦN = [φ1 φ2 ... φN ] (20)
If we take in account all the possible centers given by all the basis
32 / 58
Outline
1 Predicting Variance of w and the output d
The Variance Matrix
Selecting Regularization Parameter
2 How many dimensions?
How many dimensions?
3 Forward Selection Algorithms
Introduction
Incremental Operations
Complexity Comparison
Adding Basis Function Under Regularization
Removing an Old Basis Function under Regularization
A Possible Forward Algorithm
33 / 58
What are we going to do?
This is what we want to do
1 Adding a new basis function
2 Removing an old basis function
34 / 58
Given a matrix
Given a squared matrix of size d1, we have the following
B−1
B = Id1
BB−1
= Id1
Inverse of matrix with small-rank adjustment
Suppose that an n × n matrix B1 is obtained by adding a small-rank
adjustment XRY T to matrix B0,
B1 = B0 + XRYT
(21)
Where
B0 ∈ Rd1×d1 is the known inverse, X, Y ∈ Rd1×r are known with d1 > r,
R ∈ Rr×r and the inverse of B1 is sought.
35 / 58
Given a matrix
Given a squared matrix of size d1, we have the following
B−1
B = Id1
BB−1
= Id1
Inverse of matrix with small-rank adjustment
Suppose that an n × n matrix B1 is obtained by adding a small-rank
adjustment XRY T to matrix B0,
B1 = B0 + XRYT
(21)
Where
B0 ∈ Rd1×d1 is the known inverse, X, Y ∈ Rd1×r are known with d1 > r,
R ∈ Rr×r and the inverse of B1 is sought.
35 / 58
Given a matrix
Given a squared matrix of size d1, we have the following
B−1
B = Id1
BB−1
= Id1
Inverse of matrix with small-rank adjustment
Suppose that an n × n matrix B1 is obtained by adding a small-rank
adjustment XRY T to matrix B0,
B1 = B0 + XRYT
(21)
Where
B0 ∈ Rd1×d1 is the known inverse, X, Y ∈ Rd1×r are known with d1 > r,
R ∈ Rr×r and the inverse of B1 is sought.
35 / 58
We can do the following
We have the following formula
B−1
1 = B−1
0 − B−1
0 X YT
B−1
0 X + R−1
−1
YB−1
0 (22)
Something Notable
This is quite more efficient because involves inverting a r−matrix
YT
B−1
0 X + R−1
.
36 / 58
We can do the following
We have the following formula
B−1
1 = B−1
0 − B−1
0 X YT
B−1
0 X + R−1
−1
YB−1
0 (22)
Something Notable
This is quite more efficient because involves inverting a r−matrix
YT
B−1
0 X + R−1
.
36 / 58
Thus, we can then partition the matrix A
We have the following partition
B =
B11 B12
B21 B22
(23)
We have that
B−1
=
B11 − B12B−1
22 B21
−1
B−1
11 B12 B21B−1
11 B12 − A22
−1
B21B−1
11 A12 − A22
−1
B21B−1
11 B22 − B21B−1
11 B12
−1 (24)
37 / 58
Thus, we can then partition the matrix A
We have the following partition
B =
B11 B12
B21 B22
(23)
We have that
B−1
=
B11 − B12B−1
22 B21
−1
B−1
11 B12 B21B−1
11 B12 − A22
−1
B21B−1
11 A12 − A22
−1
B21B−1
11 B22 − B21B−1
11 B12
−1 (24)
37 / 58
Finally, we get using ∆ = B22 − B21B−1
11 B12
We have
B−1
=
B−1
11 + B−1
11 B12∆−1B21B−1
11 −B−1
11 B12∆−1
−∆−1B21B−1
11 ∆−1 (25)
Using this equation we obtain the following improvements
Because if we retrain the network, we need to do the following:
Involving constructing the new design matrix.
Multiplying it with itself.
Adding the regularizer (if there is one).
Taking the inverse to obtain the variance matrix.
Recomputing the projection matrix.
38 / 58
Finally, we get using ∆ = B22 − B21B−1
11 B12
We have
B−1
=
B−1
11 + B−1
11 B12∆−1B21B−1
11 −B−1
11 B12∆−1
−∆−1B21B−1
11 ∆−1 (25)
Using this equation we obtain the following improvements
Because if we retrain the network, we need to do the following:
Involving constructing the new design matrix.
Multiplying it with itself.
Adding the regularizer (if there is one).
Taking the inverse to obtain the variance matrix.
Recomputing the projection matrix.
38 / 58
Finally, we get using ∆ = B22 − B21B−1
11 B12
We have
B−1
=
B−1
11 + B−1
11 B12∆−1B21B−1
11 −B−1
11 B12∆−1
−∆−1B21B−1
11 ∆−1 (25)
Using this equation we obtain the following improvements
Because if we retrain the network, we need to do the following:
Involving constructing the new design matrix.
Multiplying it with itself.
Adding the regularizer (if there is one).
Taking the inverse to obtain the variance matrix.
Recomputing the projection matrix.
38 / 58
Finally, we get using ∆ = B22 − B21B−1
11 B12
We have
B−1
=
B−1
11 + B−1
11 B12∆−1B21B−1
11 −B−1
11 B12∆−1
−∆−1B21B−1
11 ∆−1 (25)
Using this equation we obtain the following improvements
Because if we retrain the network, we need to do the following:
Involving constructing the new design matrix.
Multiplying it with itself.
Adding the regularizer (if there is one).
Taking the inverse to obtain the variance matrix.
Recomputing the projection matrix.
38 / 58
Finally, we get using ∆ = B22 − B21B−1
11 B12
We have
B−1
=
B−1
11 + B−1
11 B12∆−1B21B−1
11 −B−1
11 B12∆−1
−∆−1B21B−1
11 ∆−1 (25)
Using this equation we obtain the following improvements
Because if we retrain the network, we need to do the following:
Involving constructing the new design matrix.
Multiplying it with itself.
Adding the regularizer (if there is one).
Taking the inverse to obtain the variance matrix.
Recomputing the projection matrix.
38 / 58
Finally, we get using ∆ = B22 − B21B−1
11 B12
We have
B−1
=
B−1
11 + B−1
11 B12∆−1B21B−1
11 −B−1
11 B12∆−1
−∆−1B21B−1
11 ∆−1 (25)
Using this equation we obtain the following improvements
Because if we retrain the network, we need to do the following:
Involving constructing the new design matrix.
Multiplying it with itself.
Adding the regularizer (if there is one).
Taking the inverse to obtain the variance matrix.
Recomputing the projection matrix.
38 / 58
Finally, we get using ∆ = B22 − B21B−1
11 B12
We have
B−1
=
B−1
11 + B−1
11 B12∆−1B21B−1
11 −B−1
11 B12∆−1
−∆−1B21B−1
11 ∆−1 (25)
Using this equation we obtain the following improvements
Because if we retrain the network, we need to do the following:
Involving constructing the new design matrix.
Multiplying it with itself.
Adding the regularizer (if there is one).
Taking the inverse to obtain the variance matrix.
Recomputing the projection matrix.
38 / 58
Outline
1 Predicting Variance of w and the output d
The Variance Matrix
Selecting Regularization Parameter
2 How many dimensions?
How many dimensions?
3 Forward Selection Algorithms
Introduction
Incremental Operations
Complexity Comparison
Adding Basis Function Under Regularization
Removing an Old Basis Function under Regularization
A Possible Forward Algorithm
39 / 58
Complexity of calculation of P
We have the following approximate number of multiplications
Operation Completely Retrain Using Operation
Add a new basis d3
1 + Nd2
1 + N2d1 N2
Remove an old basis d3
1 + Nd2
1 + N2d1 N2
Add a new pattern d3
1 + Nd2
1 + N2d1 2d2
1 + d1N + N2
Remove an old pattern d3
1 + Nd2
1 + N2d1 2d2
1 + d1N + N2
40 / 58
Outline
1 Predicting Variance of w and the output d
The Variance Matrix
Selecting Regularization Parameter
2 How many dimensions?
How many dimensions?
3 Forward Selection Algorithms
Introduction
Incremental Operations
Complexity Comparison
Adding Basis Function Under Regularization
Removing an Old Basis Function under Regularization
A Possible Forward Algorithm
41 / 58
Adding a Basis Function
We do the following
If the J−th basis function is chosen then φj is appended to the last
column of Φd1 and renamed m + 1.
Thus, incrementing to the new matrix
Φd1+1 = Φd1 φd1+1 (26)
42 / 58
Adding a Basis Function
We do the following
If the J−th basis function is chosen then φj is appended to the last
column of Φd1 and renamed m + 1.
Thus, incrementing to the new matrix
Φd1+1 = Φd1 φd1+1 (26)
42 / 58
Where
We have that
φd1+1 =






φd1+1 (x1)
φd1+1 (x2)
...
φd1+1 (xN )






(27)
43 / 58
Using our variance matrix
We have the following variance for the general case
Ad1+1 = ΦT
d1+1Φd1+1 + Λd1+1 (28)
We have
Ad1+1 = ΦT
d1+1Φd1+1 + Λd1+1 =
ΦT
d1
φT
d1+1
Φd1 φd1+1 +
Λd1 0
0T
λd1+1
44 / 58
Using our variance matrix
We have the following variance for the general case
Ad1+1 = ΦT
d1+1Φd1+1 + Λd1+1 (28)
We have
Ad1+1 = ΦT
d1+1Φd1+1 + Λd1+1 =
ΦT
d1
φT
d1+1
Φd1 φd1+1 +
Λd1 0
0T
λd1+1
44 / 58
Thus
We have that
Ad1+1 =
ΦT
d1
Φd1 ΦT
d1
φd1+1
φT
d1+1Φd1 φT
d1+1 φd1+1
+
Λd1 0
0T
λd1+1
=
ΦT
d1
Φd1 + Λd1 ΦT
d1
φd1+1
φT
d1+1Φd1 λd1+1 + φT
d1+1 φd1+1
45 / 58
Thus
We have that
Ad1+1 =
ΦT
d1
Φd1 ΦT
d1
φd1+1
φT
d1+1Φd1 φT
d1+1 φd1+1
+
Λd1 0
0T
λd1+1
=
ΦT
d1
Φd1 + Λd1 ΦT
d1
φd1+1
φT
d1+1Φd1 λd1+1 + φT
d1+1 φd1+1
45 / 58
Then
We have
Ad1+1 =
Ad1 ΦT
d1
φd1+1
φT
d1+1Φd1 λd1+1 + φT
d1+1 φd1+1
Using our partition
A−1
d1+1 =
A−1
d1
0
0T
0
+ ...
1
λd1+1 + φT
d1+1Pd1 φd1+1
A−1
d1
ΦT
d1
φd1+1
−1
φT
d1+1Φd1 A−1
d1
− 1
Where Pd1 = IN − Φd1 A−1
d1
ΦT
d1
46 / 58
Then
We have
Ad1+1 =
Ad1 ΦT
d1
φd1+1
φT
d1+1Φd1 λd1+1 + φT
d1+1 φd1+1
Using our partition
A−1
d1+1 =
A−1
d1
0
0T
0
+ ...
1
λd1+1 + φT
d1+1Pd1 φd1+1
A−1
d1
ΦT
d1
φd1+1
−1
φT
d1+1Φd1 A−1
d1
− 1
Where Pd1 = IN − Φd1 A−1
d1
ΦT
d1
46 / 58
Then
We have
Ad1+1 =
Ad1 ΦT
d1
φd1+1
φT
d1+1Φd1 λd1+1 + φT
d1+1 φd1+1
Using our partition
A−1
d1+1 =
A−1
d1
0
0T
0
+ ...
1
λd1+1 + φT
d1+1Pd1 φd1+1
A−1
d1
ΦT
d1
φd1+1
−1
φT
d1+1Φd1 A−1
d1
− 1
Where Pd1 = IN − Φd1 A−1
d1
ΦT
d1
46 / 58
Finally
Then
A−1
d1+1 =
A−1
d1
0
0T
0
+ ...
1
λd1+1 + φT
d1+1Pd1 φd1+1
A−1
d1
ΦT
d1
φd1+1 φT
d1+1Φd1 − A−1
d1
A−1
d1
ΦT
d1
φd1+
− φT
d1+1Φd1 A−1
d1
−1
47 / 58
We have then that
We can use the previous result for A−1
m+1
Pd1+1 =IN − Φd1+1A−1
d1+1ΦT
d1+1
=Pd1 −
Pd1 φd1+1 φT
d1+1Pd1
λd1+1 + φT
d1+1Pd1 φd1+1
48 / 58
We have then that
We can use the previous result for A−1
m+1
Pd1+1 =IN − Φd1+1A−1
d1+1ΦT
d1+1
=Pd1 −
Pd1 φd1+1 φT
d1+1Pd1
λd1+1 + φT
d1+1Pd1 φd1+1
48 / 58
How do we select the new basis
We can use the greatest in sum-squared error difference
ˆSd1 = ˆyT
P2
d1
ˆy (29)
In addition, we have
ˆSd1+1 = ˆyT
P2
d1+1ˆy (30)
49 / 58
How do we select the new basis
We can use the greatest in sum-squared error difference
ˆSd1 = ˆyT
P2
d1
ˆy (29)
In addition, we have
ˆSd1+1 = ˆyT
P2
d1+1ˆy (30)
49 / 58
Thus, we do the difference
We want to maximize the decrease
ˆSd1 − ˆSd1+1 = ˆyT
P2
d1
ˆy − ˆyT
P2
d1+1ˆy
= ˆyT
P2
d1
− P2
d1+1 ˆy
= ˆyT

P2
d1
− Pd1 −
Pd1 φd1+1 φT
d1+1Pd1
λd1+1 + φT
d1+1Pd1 φd1+1
2

 ˆy
=
2ˆyT
P2
d1
φd1+1ˆyT
Pd1+1φj
λd1+1 + φT
d1+1Pd1 φd1+1
−
ˆyT
P2
d1+1φj
2
φT
j P2
d1+1φj
λd1+1 + φT
d1+1Pd1 φd1+1
2
50 / 58
Thus, we do the difference
We want to maximize the decrease
ˆSd1 − ˆSd1+1 = ˆyT
P2
d1
ˆy − ˆyT
P2
d1+1ˆy
= ˆyT
P2
d1
− P2
d1+1 ˆy
= ˆyT

P2
d1
− Pd1 −
Pd1 φd1+1 φT
d1+1Pd1
λd1+1 + φT
d1+1Pd1 φd1+1
2

 ˆy
=
2ˆyT
P2
d1
φd1+1ˆyT
Pd1+1φj
λd1+1 + φT
d1+1Pd1 φd1+1
−
ˆyT
P2
d1+1φj
2
φT
j P2
d1+1φj
λd1+1 + φT
d1+1Pd1 φd1+1
2
50 / 58
Thus, we do the difference
We want to maximize the decrease
ˆSd1 − ˆSd1+1 = ˆyT
P2
d1
ˆy − ˆyT
P2
d1+1ˆy
= ˆyT
P2
d1
− P2
d1+1 ˆy
= ˆyT

P2
d1
− Pd1 −
Pd1 φd1+1 φT
d1+1Pd1
λd1+1 + φT
d1+1Pd1 φd1+1
2

 ˆy
=
2ˆyT
P2
d1
φd1+1ˆyT
Pd1+1φj
λd1+1 + φT
d1+1Pd1 φd1+1
−
ˆyT
P2
d1+1φj
2
φT
j P2
d1+1φj
λd1+1 + φT
d1+1Pd1 φd1+1
2
50 / 58
Thus, we do the difference
We want to maximize the decrease
ˆSd1 − ˆSd1+1 = ˆyT
P2
d1
ˆy − ˆyT
P2
d1+1ˆy
= ˆyT
P2
d1
− P2
d1+1 ˆy
= ˆyT

P2
d1
− Pd1 −
Pd1 φd1+1 φT
d1+1Pd1
λd1+1 + φT
d1+1Pd1 φd1+1
2

 ˆy
=
2ˆyT
P2
d1
φd1+1ˆyT
Pd1+1φj
λd1+1 + φT
d1+1Pd1 φd1+1
−
ˆyT
P2
d1+1φj
2
φT
j P2
d1+1φj
λd1+1 + φT
d1+1Pd1 φd1+1
2
50 / 58
An alternative is is to seek to maximize the decrease in the
cost function
We have
ˆCd1 − ˆCd1+1 = ˆyT
Pd1
ˆy − ˆyT
Pd1+1ˆy
= ˆyT Pd1 φd1+1 φT
d1+1Pd1
λd1+1 + φT
d1+1Pd1 φd1+1
ˆy
=
ˆyT
Pd1 φd1+1
2
λd1+1 + φT
d1+1Pd1 φd1+1
51 / 58
Outline
1 Predicting Variance of w and the output d
The Variance Matrix
Selecting Regularization Parameter
2 How many dimensions?
How many dimensions?
3 Forward Selection Algorithms
Introduction
Incremental Operations
Complexity Comparison
Adding Basis Function Under Regularization
Removing an Old Basis Function under Regularization
A Possible Forward Algorithm
52 / 58
Removing an Old Basis Function under Regularization
Here, we can remove any column
Process:
1 Move the selected j-th column at the end (Permutation).
2 Apply our well known equation with Pd1 in place of Pd1+1 and
Pd1−1 in place of Pd1 .
3 In addition φj in place of φd1+1 .
4 And λj in place of λd1+1.
Thus, we have
Pd1 = Pd1−1 −
Pd1−1 φj φT
j Pd1−1
λj + φT
j Pd1−1 φj
(31)
53 / 58
Removing an Old Basis Function under Regularization
Here, we can remove any column
Process:
1 Move the selected j-th column at the end (Permutation).
2 Apply our well known equation with Pd1 in place of Pd1+1 and
Pd1−1 in place of Pd1 .
3 In addition φj in place of φd1+1 .
4 And λj in place of λd1+1.
Thus, we have
Pd1 = Pd1−1 −
Pd1−1 φj φT
j Pd1−1
λj + φT
j Pd1−1 φj
(31)
53 / 58
Removing an Old Basis Function under Regularization
Here, we can remove any column
Process:
1 Move the selected j-th column at the end (Permutation).
2 Apply our well known equation with Pd1 in place of Pd1+1 and
Pd1−1 in place of Pd1 .
3 In addition φj in place of φd1+1 .
4 And λj in place of λd1+1.
Thus, we have
Pd1 = Pd1−1 −
Pd1−1 φj φT
j Pd1−1
λj + φT
j Pd1−1 φj
(31)
53 / 58
Removing an Old Basis Function under Regularization
Here, we can remove any column
Process:
1 Move the selected j-th column at the end (Permutation).
2 Apply our well known equation with Pd1 in place of Pd1+1 and
Pd1−1 in place of Pd1 .
3 In addition φj in place of φd1+1 .
4 And λj in place of λd1+1.
Thus, we have
Pd1 = Pd1−1 −
Pd1−1 φj φT
j Pd1−1
λj + φT
j Pd1−1 φj
(31)
53 / 58
Removing an Old Basis Function under Regularization
Here, we can remove any column
Process:
1 Move the selected j-th column at the end (Permutation).
2 Apply our well known equation with Pd1 in place of Pd1+1 and
Pd1−1 in place of Pd1 .
3 In addition φj in place of φd1+1 .
4 And λj in place of λd1+1.
Thus, we have
Pd1 = Pd1−1 −
Pd1−1 φj φT
j Pd1−1
λj + φT
j Pd1−1 φj
(31)
53 / 58
Thus
If λj = 0
We can first post- and then pre-multiplying by φj to obtain expressions of
Pd1−1 φj and φT
j Pd1−1 φj in terms of Pd1
Thus, we have
Pd1−1 = Pd1 +
Pd1 φj φT
j Pd1
λj − φT
j Pd1 φj
(32)
However
For small λj, the round-off error can be problematic!!!
54 / 58
Thus
If λj = 0
We can first post- and then pre-multiplying by φj to obtain expressions of
Pd1−1 φj and φT
j Pd1−1 φj in terms of Pd1
Thus, we have
Pd1−1 = Pd1 +
Pd1 φj φT
j Pd1
λj − φT
j Pd1 φj
(32)
However
For small λj, the round-off error can be problematic!!!
54 / 58
Thus
If λj = 0
We can first post- and then pre-multiplying by φj to obtain expressions of
Pd1−1 φj and φT
j Pd1−1 φj in terms of Pd1
Thus, we have
Pd1−1 = Pd1 +
Pd1 φj φT
j Pd1
λj − φT
j Pd1 φj
(32)
However
For small λj, the round-off error can be problematic!!!
54 / 58
Outline
1 Predicting Variance of w and the output d
The Variance Matrix
Selecting Regularization Parameter
2 How many dimensions?
How many dimensions?
3 Forward Selection Algorithms
Introduction
Incremental Operations
Complexity Comparison
Adding Basis Function Under Regularization
Removing an Old Basis Function under Regularization
A Possible Forward Algorithm
55 / 58
Based in the previous ideas
We are ready for a basic algorithm
However, this can be improved.
56 / 58
We have the following pseudocode
Forward-Regularization(D)
1 Select the functions d1 to used as basis based on the data D
This can be done randomly or using the clustering method described in
Haykin
2 Select an > 0 stopping criteria
3 ˆCd1 = ˆyT
Pd1 ˆy
4 Do
5 ˆCd1 = ˆCd1+1
6 d1 = d1 + 1
7 Do
8 Select a new base element and generate φd1+1 . Several strategies exist
9 Generate A−1
d1+1 and Pd1+1
10 Calculate ˆCd1+1
11 Until
ˆyT
Pd1
φd1+1
2
λd1+1+ φT
d1+1Pd1
φd1+1
> 0
12 Until ˆCd1 − ˆCd1+1
2
<
57 / 58
We have the following pseudocode
Forward-Regularization(D)
1 Select the functions d1 to used as basis based on the data D
This can be done randomly or using the clustering method described in
Haykin
2 Select an > 0 stopping criteria
3 ˆCd1 = ˆyT
Pd1 ˆy
4 Do
5 ˆCd1 = ˆCd1+1
6 d1 = d1 + 1
7 Do
8 Select a new base element and generate φd1+1 . Several strategies exist
9 Generate A−1
d1+1 and Pd1+1
10 Calculate ˆCd1+1
11 Until
ˆyT
Pd1
φd1+1
2
λd1+1+ φT
d1+1Pd1
φd1+1
> 0
12 Until ˆCd1 − ˆCd1+1
2
<
57 / 58
We have the following pseudocode
Forward-Regularization(D)
1 Select the functions d1 to used as basis based on the data D
This can be done randomly or using the clustering method described in
Haykin
2 Select an > 0 stopping criteria
3 ˆCd1 = ˆyT
Pd1 ˆy
4 Do
5 ˆCd1 = ˆCd1+1
6 d1 = d1 + 1
7 Do
8 Select a new base element and generate φd1+1 . Several strategies exist
9 Generate A−1
d1+1 and Pd1+1
10 Calculate ˆCd1+1
11 Until
ˆyT
Pd1
φd1+1
2
λd1+1+ φT
d1+1Pd1
φd1+1
> 0
12 Until ˆCd1 − ˆCd1+1
2
<
57 / 58
We have the following pseudocode
Forward-Regularization(D)
1 Select the functions d1 to used as basis based on the data D
This can be done randomly or using the clustering method described in
Haykin
2 Select an > 0 stopping criteria
3 ˆCd1 = ˆyT
Pd1 ˆy
4 Do
5 ˆCd1 = ˆCd1+1
6 d1 = d1 + 1
7 Do
8 Select a new base element and generate φd1+1 . Several strategies exist
9 Generate A−1
d1+1 and Pd1+1
10 Calculate ˆCd1+1
11 Until
ˆyT
Pd1
φd1+1
2
λd1+1+ φT
d1+1Pd1
φd1+1
> 0
12 Until ˆCd1 − ˆCd1+1
2
<
57 / 58
We have the following pseudocode
Forward-Regularization(D)
1 Select the functions d1 to used as basis based on the data D
This can be done randomly or using the clustering method described in
Haykin
2 Select an > 0 stopping criteria
3 ˆCd1 = ˆyT
Pd1 ˆy
4 Do
5 ˆCd1 = ˆCd1+1
6 d1 = d1 + 1
7 Do
8 Select a new base element and generate φd1+1 . Several strategies exist
9 Generate A−1
d1+1 and Pd1+1
10 Calculate ˆCd1+1
11 Until
ˆyT
Pd1
φd1+1
2
λd1+1+ φT
d1+1Pd1
φd1+1
> 0
12 Until ˆCd1 − ˆCd1+1
2
<
57 / 58
We have the following pseudocode
Forward-Regularization(D)
1 Select the functions d1 to used as basis based on the data D
This can be done randomly or using the clustering method described in
Haykin
2 Select an > 0 stopping criteria
3 ˆCd1 = ˆyT
Pd1 ˆy
4 Do
5 ˆCd1 = ˆCd1+1
6 d1 = d1 + 1
7 Do
8 Select a new base element and generate φd1+1 . Several strategies exist
9 Generate A−1
d1+1 and Pd1+1
10 Calculate ˆCd1+1
11 Until
ˆyT
Pd1
φd1+1
2
λd1+1+ φT
d1+1Pd1
φd1+1
> 0
12 Until ˆCd1 − ˆCd1+1
2
<
57 / 58
We have the following pseudocode
Forward-Regularization(D)
1 Select the functions d1 to used as basis based on the data D
This can be done randomly or using the clustering method described in
Haykin
2 Select an > 0 stopping criteria
3 ˆCd1 = ˆyT
Pd1 ˆy
4 Do
5 ˆCd1 = ˆCd1+1
6 d1 = d1 + 1
7 Do
8 Select a new base element and generate φd1+1 . Several strategies exist
9 Generate A−1
d1+1 and Pd1+1
10 Calculate ˆCd1+1
11 Until
ˆyT
Pd1
φd1+1
2
λd1+1+ φT
d1+1Pd1
φd1+1
> 0
12 Until ˆCd1 − ˆCd1+1
2
<
57 / 58
We have the following pseudocode
Forward-Regularization(D)
1 Select the functions d1 to used as basis based on the data D
This can be done randomly or using the clustering method described in
Haykin
2 Select an > 0 stopping criteria
3 ˆCd1 = ˆyT
Pd1 ˆy
4 Do
5 ˆCd1 = ˆCd1+1
6 d1 = d1 + 1
7 Do
8 Select a new base element and generate φd1+1 . Several strategies exist
9 Generate A−1
d1+1 and Pd1+1
10 Calculate ˆCd1+1
11 Until
ˆyT
Pd1
φd1+1
2
λd1+1+ φT
d1+1Pd1
φd1+1
> 0
12 Until ˆCd1 − ˆCd1+1
2
<
57 / 58
We have the following pseudocode
Forward-Regularization(D)
1 Select the functions d1 to used as basis based on the data D
This can be done randomly or using the clustering method described in
Haykin
2 Select an > 0 stopping criteria
3 ˆCd1 = ˆyT
Pd1 ˆy
4 Do
5 ˆCd1 = ˆCd1+1
6 d1 = d1 + 1
7 Do
8 Select a new base element and generate φd1+1 . Several strategies exist
9 Generate A−1
d1+1 and Pd1+1
10 Calculate ˆCd1+1
11 Until
ˆyT
Pd1
φd1+1
2
λd1+1+ φT
d1+1Pd1
φd1+1
> 0
12 Until ˆCd1 − ˆCd1+1
2
<
57 / 58
We have the following pseudocode
Forward-Regularization(D)
1 Select the functions d1 to used as basis based on the data D
This can be done randomly or using the clustering method described in
Haykin
2 Select an > 0 stopping criteria
3 ˆCd1 = ˆyT
Pd1 ˆy
4 Do
5 ˆCd1 = ˆCd1+1
6 d1 = d1 + 1
7 Do
8 Select a new base element and generate φd1+1 . Several strategies exist
9 Generate A−1
d1+1 and Pd1+1
10 Calculate ˆCd1+1
11 Until
ˆyT
Pd1
φd1+1
2
λd1+1+ φT
d1+1Pd1
φd1+1
> 0
12 Until ˆCd1 − ˆCd1+1
2
<
57 / 58
We have the following pseudocode
Forward-Regularization(D)
1 Select the functions d1 to used as basis based on the data D
This can be done randomly or using the clustering method described in
Haykin
2 Select an > 0 stopping criteria
3 ˆCd1 = ˆyT
Pd1 ˆy
4 Do
5 ˆCd1 = ˆCd1+1
6 d1 = d1 + 1
7 Do
8 Select a new base element and generate φd1+1 . Several strategies exist
9 Generate A−1
d1+1 and Pd1+1
10 Calculate ˆCd1+1
11 Until
ˆyT
Pd1
φd1+1
2
λd1+1+ φT
d1+1Pd1
φd1+1
> 0
12 Until ˆCd1 − ˆCd1+1
2
<
57 / 58
We have the following pseudocode
Forward-Regularization(D)
1 Select the functions d1 to used as basis based on the data D
This can be done randomly or using the clustering method described in
Haykin
2 Select an > 0 stopping criteria
3 ˆCd1 = ˆyT
Pd1 ˆy
4 Do
5 ˆCd1 = ˆCd1+1
6 d1 = d1 + 1
7 Do
8 Select a new base element and generate φd1+1 . Several strategies exist
9 Generate A−1
d1+1 and Pd1+1
10 Calculate ˆCd1+1
11 Until
ˆyT
Pd1
φd1+1
2
λd1+1+ φT
d1+1Pd1
φd1+1
> 0
12 Until ˆCd1 − ˆCd1+1
2
<
57 / 58
We have the following pseudocode
Forward-Regularization(D)
1 Select the functions d1 to used as basis based on the data D
This can be done randomly or using the clustering method described in
Haykin
2 Select an > 0 stopping criteria
3 ˆCd1 = ˆyT
Pd1 ˆy
4 Do
5 ˆCd1 = ˆCd1+1
6 d1 = d1 + 1
7 Do
8 Select a new base element and generate φd1+1 . Several strategies exist
9 Generate A−1
d1+1 and Pd1+1
10 Calculate ˆCd1+1
11 Until
ˆyT
Pd1
φd1+1
2
λd1+1+ φT
d1+1Pd1
φd1+1
> 0
12 Until ˆCd1 − ˆCd1+1
2
<
57 / 58
We have the following pseudocode
Forward-Regularization(D)
1 Select the functions d1 to used as basis based on the data D
This can be done randomly or using the clustering method described in
Haykin
2 Select an > 0 stopping criteria
3 ˆCd1 = ˆyT
Pd1 ˆy
4 Do
5 ˆCd1 = ˆCd1+1
6 d1 = d1 + 1
7 Do
8 Select a new base element and generate φd1+1 . Several strategies exist
9 Generate A−1
d1+1 and Pd1+1
10 Calculate ˆCd1+1
11 Until
ˆyT
Pd1
φd1+1
2
λd1+1+ φT
d1+1Pd1
φd1+1
> 0
12 Until ˆCd1 − ˆCd1+1
2
<
57 / 58
For more on this
Please Read the following
Introduction to Radial Basis Function Networks by Mark J. L. Orr
And there is much more
Look At the book Bootstrap Methods and their Application by A. C.
Davison and D. V. Hinkley
58 / 58
For more on this
Please Read the following
Introduction to Radial Basis Function Networks by Mark J. L. Orr
And there is much more
Look At the book Bootstrap Methods and their Application by A. C.
Davison and D. V. Hinkley
58 / 58

Más contenido relacionado

La actualidad más candente

Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector MachinesEdgar Marca
 
27 Machine Learning Unsupervised Measure Properties
27 Machine Learning Unsupervised Measure Properties27 Machine Learning Unsupervised Measure Properties
27 Machine Learning Unsupervised Measure PropertiesAndres Mendez-Vazquez
 
The Kernel Trick
The Kernel TrickThe Kernel Trick
The Kernel TrickEdgar Marca
 
16 Machine Learning Universal Approximation Multilayer Perceptron
16 Machine Learning Universal Approximation Multilayer Perceptron16 Machine Learning Universal Approximation Multilayer Perceptron
16 Machine Learning Universal Approximation Multilayer PerceptronAndres Mendez-Vazquez
 
28 Dealing with the NP Poblems: Exponential Search and Approximation Algorithms
28 Dealing with the NP Poblems: Exponential Search and Approximation Algorithms28 Dealing with the NP Poblems: Exponential Search and Approximation Algorithms
28 Dealing with the NP Poblems: Exponential Search and Approximation AlgorithmsAndres Mendez-Vazquez
 
31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster ValidityAndres Mendez-Vazquez
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi DivergenceMasahiro Suzuki
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackarogozhnikov
 
Machine learning in science and industry — day 3
Machine learning in science and industry — day 3Machine learning in science and industry — day 3
Machine learning in science and industry — day 3arogozhnikov
 
(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural NetworkMasahiro Suzuki
 
Some fixed point theorems in fuzzy mappings
Some fixed point theorems in fuzzy mappingsSome fixed point theorems in fuzzy mappings
Some fixed point theorems in fuzzy mappingsAlexander Decker
 
AGGREGATION OF OPINIONS FOR SYSTEM SELECTION USING APPROXIMATIONS OF FUZZY NU...
AGGREGATION OF OPINIONS FOR SYSTEM SELECTION USING APPROXIMATIONS OF FUZZY NU...AGGREGATION OF OPINIONS FOR SYSTEM SELECTION USING APPROXIMATIONS OF FUZZY NU...
AGGREGATION OF OPINIONS FOR SYSTEM SELECTION USING APPROXIMATIONS OF FUZZY NU...mathsjournal
 
AGGREGATION OF OPINIONS FOR SYSTEM SELECTION USING APPROXIMATIONS OF FUZZY NU...
AGGREGATION OF OPINIONS FOR SYSTEM SELECTION USING APPROXIMATIONS OF FUZZY NU...AGGREGATION OF OPINIONS FOR SYSTEM SELECTION USING APPROXIMATIONS OF FUZZY NU...
AGGREGATION OF OPINIONS FOR SYSTEM SELECTION USING APPROXIMATIONS OF FUZZY NU...mathsjournal
 
AGGREGATION OF OPINIONS FOR SYSTEM SELECTION USING APPROXIMATIONS OF FUZZY NU...
AGGREGATION OF OPINIONS FOR SYSTEM SELECTION USING APPROXIMATIONS OF FUZZY NU...AGGREGATION OF OPINIONS FOR SYSTEM SELECTION USING APPROXIMATIONS OF FUZZY NU...
AGGREGATION OF OPINIONS FOR SYSTEM SELECTION USING APPROXIMATIONS OF FUZZY NU...mathsjournal
 
overlap add and overlap save method
overlap add and overlap save methodoverlap add and overlap save method
overlap add and overlap save methodsivakumars90
 
Fuzzy logic andits Applications
Fuzzy logic andits ApplicationsFuzzy logic andits Applications
Fuzzy logic andits ApplicationsDrATAMILARASIMCA
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
 

La actualidad más candente (19)

Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector Machines
 
06 Machine Learning - Naive Bayes
06 Machine Learning - Naive Bayes06 Machine Learning - Naive Bayes
06 Machine Learning - Naive Bayes
 
27 Machine Learning Unsupervised Measure Properties
27 Machine Learning Unsupervised Measure Properties27 Machine Learning Unsupervised Measure Properties
27 Machine Learning Unsupervised Measure Properties
 
The Kernel Trick
The Kernel TrickThe Kernel Trick
The Kernel Trick
 
16 Machine Learning Universal Approximation Multilayer Perceptron
16 Machine Learning Universal Approximation Multilayer Perceptron16 Machine Learning Universal Approximation Multilayer Perceptron
16 Machine Learning Universal Approximation Multilayer Perceptron
 
28 Dealing with the NP Poblems: Exponential Search and Approximation Algorithms
28 Dealing with the NP Poblems: Exponential Search and Approximation Algorithms28 Dealing with the NP Poblems: Exponential Search and Approximation Algorithms
28 Dealing with the NP Poblems: Exponential Search and Approximation Algorithms
 
31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence
 
L3 some other properties
L3 some other propertiesL3 some other properties
L3 some other properties
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic track
 
Machine learning in science and industry — day 3
Machine learning in science and industry — day 3Machine learning in science and industry — day 3
Machine learning in science and industry — day 3
 
(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network
 
Some fixed point theorems in fuzzy mappings
Some fixed point theorems in fuzzy mappingsSome fixed point theorems in fuzzy mappings
Some fixed point theorems in fuzzy mappings
 
AGGREGATION OF OPINIONS FOR SYSTEM SELECTION USING APPROXIMATIONS OF FUZZY NU...
AGGREGATION OF OPINIONS FOR SYSTEM SELECTION USING APPROXIMATIONS OF FUZZY NU...AGGREGATION OF OPINIONS FOR SYSTEM SELECTION USING APPROXIMATIONS OF FUZZY NU...
AGGREGATION OF OPINIONS FOR SYSTEM SELECTION USING APPROXIMATIONS OF FUZZY NU...
 
AGGREGATION OF OPINIONS FOR SYSTEM SELECTION USING APPROXIMATIONS OF FUZZY NU...
AGGREGATION OF OPINIONS FOR SYSTEM SELECTION USING APPROXIMATIONS OF FUZZY NU...AGGREGATION OF OPINIONS FOR SYSTEM SELECTION USING APPROXIMATIONS OF FUZZY NU...
AGGREGATION OF OPINIONS FOR SYSTEM SELECTION USING APPROXIMATIONS OF FUZZY NU...
 
AGGREGATION OF OPINIONS FOR SYSTEM SELECTION USING APPROXIMATIONS OF FUZZY NU...
AGGREGATION OF OPINIONS FOR SYSTEM SELECTION USING APPROXIMATIONS OF FUZZY NU...AGGREGATION OF OPINIONS FOR SYSTEM SELECTION USING APPROXIMATIONS OF FUZZY NU...
AGGREGATION OF OPINIONS FOR SYSTEM SELECTION USING APPROXIMATIONS OF FUZZY NU...
 
overlap add and overlap save method
overlap add and overlap save methodoverlap add and overlap save method
overlap add and overlap save method
 
Fuzzy logic andits Applications
Fuzzy logic andits ApplicationsFuzzy logic andits Applications
Fuzzy logic andits Applications
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 

Destacado

Introduction to Radial Basis Function Networks
Introduction to Radial Basis Function NetworksIntroduction to Radial Basis Function Networks
Introduction to Radial Basis Function NetworksESCOM
 
Radial Basis Function Network (RBFN)
Radial Basis Function Network (RBFN)Radial Basis Function Network (RBFN)
Radial Basis Function Network (RBFN)ahmad haidaroh
 
Radial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and DhanashriRadial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and Dhanashrisheetal katkar
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesadil raja
 
Neural Networks: Radial Bases Functions (RBF)
Neural Networks: Radial Bases Functions (RBF)Neural Networks: Radial Bases Functions (RBF)
Neural Networks: Radial Bases Functions (RBF)Mostafa G. M. Mostafa
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesnextlib
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for ClassificationPrakash Pimpale
 
2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines
2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines
2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector MachinesDongseo University
 

Destacado (9)

Introduction to Radial Basis Function Networks
Introduction to Radial Basis Function NetworksIntroduction to Radial Basis Function Networks
Introduction to Radial Basis Function Networks
 
Radial Basis Function Network (RBFN)
Radial Basis Function Network (RBFN)Radial Basis Function Network (RBFN)
Radial Basis Function Network (RBFN)
 
Radial Basis Function
Radial Basis FunctionRadial Basis Function
Radial Basis Function
 
Radial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and DhanashriRadial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and Dhanashri
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Neural Networks: Radial Bases Functions (RBF)
Neural Networks: Radial Bases Functions (RBF)Neural Networks: Radial Bases Functions (RBF)
Neural Networks: Radial Bases Functions (RBF)
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines
2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines
2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines
 

Similar a 18 Machine Learning Radial Basis Function Networks Forward Heuristics

Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...
Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...
Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...Alexander Litvinenko
 
Low rank tensor approximation of probability density and characteristic funct...
Low rank tensor approximation of probability density and characteristic funct...Low rank tensor approximation of probability density and characteristic funct...
Low rank tensor approximation of probability density and characteristic funct...Alexander Litvinenko
 
Nyquist criterion for zero ISI
Nyquist criterion for zero ISINyquist criterion for zero ISI
Nyquist criterion for zero ISIGunasekara Reddy
 
Fourier transforms & fft algorithm (paul heckbert, 1998) by tantanoid
Fourier transforms & fft algorithm (paul heckbert, 1998) by tantanoidFourier transforms & fft algorithm (paul heckbert, 1998) by tantanoid
Fourier transforms & fft algorithm (paul heckbert, 1998) by tantanoidXavier Davias
 
Exercicios de integrais
Exercicios de integraisExercicios de integrais
Exercicios de integraisRibeij2
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Pierre Jacob
 
Contemporary communication systems 1st edition mesiya solutions manual
Contemporary communication systems 1st edition mesiya solutions manualContemporary communication systems 1st edition mesiya solutions manual
Contemporary communication systems 1st edition mesiya solutions manualto2001
 
HJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio ProblemHJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio ProblemAshwin Rao
 
DeepLearn2022 2. Variance Matters
DeepLearn2022  2. Variance MattersDeepLearn2022  2. Variance Matters
DeepLearn2022 2. Variance MattersSean Meyn
 
Slides FIS5.pdfOutline1 Fixed Income DerivativesThe .docx
Slides FIS5.pdfOutline1 Fixed Income DerivativesThe .docxSlides FIS5.pdfOutline1 Fixed Income DerivativesThe .docx
Slides FIS5.pdfOutline1 Fixed Income DerivativesThe .docxbudabrooks46239
 
Litvinenko_RWTH_UQ_Seminar_talk.pdf
Litvinenko_RWTH_UQ_Seminar_talk.pdfLitvinenko_RWTH_UQ_Seminar_talk.pdf
Litvinenko_RWTH_UQ_Seminar_talk.pdfAlexander Litvinenko
 
Dynamic equity price pdf
Dynamic equity price pdfDynamic equity price pdf
Dynamic equity price pdfDavid Keck
 
A Fibonacci-like universe expansion on time-scale
A Fibonacci-like universe expansion on time-scaleA Fibonacci-like universe expansion on time-scale
A Fibonacci-like universe expansion on time-scaleOctavianPostavaru
 
MLP輪読スパース8章 トレースノルム正則化
MLP輪読スパース8章 トレースノルム正則化MLP輪読スパース8章 トレースノルム正則化
MLP輪読スパース8章 トレースノルム正則化Akira Tanimoto
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Pierre Jacob
 

Similar a 18 Machine Learning Radial Basis Function Networks Forward Heuristics (20)

Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...
Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...
Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...
 
Low rank tensor approximation of probability density and characteristic funct...
Low rank tensor approximation of probability density and characteristic funct...Low rank tensor approximation of probability density and characteristic funct...
Low rank tensor approximation of probability density and characteristic funct...
 
Nyquist criterion for zero ISI
Nyquist criterion for zero ISINyquist criterion for zero ISI
Nyquist criterion for zero ISI
 
Basic concepts and how to measure price volatility
Basic concepts and how to measure price volatility Basic concepts and how to measure price volatility
Basic concepts and how to measure price volatility
 
Fourier transforms & fft algorithm (paul heckbert, 1998) by tantanoid
Fourier transforms & fft algorithm (paul heckbert, 1998) by tantanoidFourier transforms & fft algorithm (paul heckbert, 1998) by tantanoid
Fourier transforms & fft algorithm (paul heckbert, 1998) by tantanoid
 
Exercicios de integrais
Exercicios de integraisExercicios de integrais
Exercicios de integrais
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
 
Contemporary communication systems 1st edition mesiya solutions manual
Contemporary communication systems 1st edition mesiya solutions manualContemporary communication systems 1st edition mesiya solutions manual
Contemporary communication systems 1st edition mesiya solutions manual
 
HJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio ProblemHJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio Problem
 
Fourier transform
Fourier transformFourier transform
Fourier transform
 
DeepLearn2022 2. Variance Matters
DeepLearn2022  2. Variance MattersDeepLearn2022  2. Variance Matters
DeepLearn2022 2. Variance Matters
 
Slides FIS5.pdfOutline1 Fixed Income DerivativesThe .docx
Slides FIS5.pdfOutline1 Fixed Income DerivativesThe .docxSlides FIS5.pdfOutline1 Fixed Income DerivativesThe .docx
Slides FIS5.pdfOutline1 Fixed Income DerivativesThe .docx
 
Tables
TablesTables
Tables
 
Litvinenko_RWTH_UQ_Seminar_talk.pdf
Litvinenko_RWTH_UQ_Seminar_talk.pdfLitvinenko_RWTH_UQ_Seminar_talk.pdf
Litvinenko_RWTH_UQ_Seminar_talk.pdf
 
Dynamic equity price pdf
Dynamic equity price pdfDynamic equity price pdf
Dynamic equity price pdf
 
PCA on graph/network
PCA on graph/networkPCA on graph/network
PCA on graph/network
 
Radiation
RadiationRadiation
Radiation
 
A Fibonacci-like universe expansion on time-scale
A Fibonacci-like universe expansion on time-scaleA Fibonacci-like universe expansion on time-scale
A Fibonacci-like universe expansion on time-scale
 
MLP輪読スパース8章 トレースノルム正則化
MLP輪読スパース8章 トレースノルム正則化MLP輪読スパース8章 トレースノルム正則化
MLP輪読スパース8章 トレースノルム正則化
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
 

Más de Andres Mendez-Vazquez

01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectorsAndres Mendez-Vazquez
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issuesAndres Mendez-Vazquez
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiationAndres Mendez-Vazquez
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep LearningAndres Mendez-Vazquez
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learningAndres Mendez-Vazquez
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusAndres Mendez-Vazquez
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusAndres Mendez-Vazquez
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesAndres Mendez-Vazquez
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variationsAndres Mendez-Vazquez
 
Introduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusIntroduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusAndres Mendez-Vazquez
 

Más de Andres Mendez-Vazquez (20)

2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
 
05 linear transformations
05 linear transformations05 linear transformations
05 linear transformations
 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
 
01.02 linear equations
01.02 linear equations01.02 linear equations
01.02 linear equations
 
01.01 vector spaces
01.01 vector spaces01.01 vector spaces
01.01 vector spaces
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
 
Zetta global
Zetta globalZetta global
Zetta global
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
 
17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension
 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
 
Introduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusIntroduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems Syllabus
 

Último

70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical training70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical trainingGladiatorsKasper
 
Theory of Machine Notes / Lecture Material .pdf
Theory of Machine Notes / Lecture Material .pdfTheory of Machine Notes / Lecture Material .pdf
Theory of Machine Notes / Lecture Material .pdfShreyas Pandit
 
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxCurve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxRomil Mishra
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Sumanth A
 
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...gerogepatton
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 
Guardians of E-Commerce: Harnessing NLP and Machine Learning Approaches for A...
Guardians of E-Commerce: Harnessing NLP and Machine Learning Approaches for A...Guardians of E-Commerce: Harnessing NLP and Machine Learning Approaches for A...
Guardians of E-Commerce: Harnessing NLP and Machine Learning Approaches for A...IJAEMSJORNAL
 
Forming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).pptForming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).pptNoman khan
 
STATE TRANSITION DIAGRAM in psoc subject
STATE TRANSITION DIAGRAM in psoc subjectSTATE TRANSITION DIAGRAM in psoc subject
STATE TRANSITION DIAGRAM in psoc subjectGayathriM270621
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfDrew Moseley
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Romil Mishra
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTSneha Padhiar
 
Substation Automation SCADA and Gateway Solutions by BRH
Substation Automation SCADA and Gateway Solutions by BRHSubstation Automation SCADA and Gateway Solutions by BRH
Substation Automation SCADA and Gateway Solutions by BRHbirinder2
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Communityprachaibot
 
22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...
22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...
22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...KrishnaveniKrishnara1
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmComputer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmDeepika Walanjkar
 
Robotics Group 10 (Control Schemes) cse.pdf
Robotics Group 10  (Control Schemes) cse.pdfRobotics Group 10  (Control Schemes) cse.pdf
Robotics Group 10 (Control Schemes) cse.pdfsahilsajad201
 

Último (20)

70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical training70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical training
 
ASME-B31.4-2019-estandar para diseño de ductos
ASME-B31.4-2019-estandar para diseño de ductosASME-B31.4-2019-estandar para diseño de ductos
ASME-B31.4-2019-estandar para diseño de ductos
 
Theory of Machine Notes / Lecture Material .pdf
Theory of Machine Notes / Lecture Material .pdfTheory of Machine Notes / Lecture Material .pdf
Theory of Machine Notes / Lecture Material .pdf
 
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxCurve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
 
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...
March 2024 - Top 10 Read Articles in Artificial Intelligence and Applications...
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 
Guardians of E-Commerce: Harnessing NLP and Machine Learning Approaches for A...
Guardians of E-Commerce: Harnessing NLP and Machine Learning Approaches for A...Guardians of E-Commerce: Harnessing NLP and Machine Learning Approaches for A...
Guardians of E-Commerce: Harnessing NLP and Machine Learning Approaches for A...
 
Forming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).pptForming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).ppt
 
STATE TRANSITION DIAGRAM in psoc subject
STATE TRANSITION DIAGRAM in psoc subjectSTATE TRANSITION DIAGRAM in psoc subject
STATE TRANSITION DIAGRAM in psoc subject
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdf
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
 
Substation Automation SCADA and Gateway Solutions by BRH
Substation Automation SCADA and Gateway Solutions by BRHSubstation Automation SCADA and Gateway Solutions by BRH
Substation Automation SCADA and Gateway Solutions by BRH
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Community
 
22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...
22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...
22CYT12 & Chemistry for Computer Systems_Unit-II-Corrosion & its Control Meth...
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmComputer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
 
Robotics Group 10 (Control Schemes) cse.pdf
Robotics Group 10  (Control Schemes) cse.pdfRobotics Group 10  (Control Schemes) cse.pdf
Robotics Group 10 (Control Schemes) cse.pdf
 
Versatile Engineering Construction Firms
Versatile Engineering Construction FirmsVersatile Engineering Construction Firms
Versatile Engineering Construction Firms
 

18 Machine Learning Radial Basis Function Networks Forward Heuristics

  • 1. Neural Networks Radial Basis Function Networks - Forward Heuristics Andres Mendez-Vazquez December 10, 2015 1 / 58
  • 2. Outline 1 Predicting Variance of w and the output d The Variance Matrix Selecting Regularization Parameter 2 How many dimensions? How many dimensions? 3 Forward Selection Algorithms Introduction Incremental Operations Complexity Comparison Adding Basis Function Under Regularization Removing an Old Basis Function under Regularization A Possible Forward Algorithm 2 / 58
  • 3. Outline 1 Predicting Variance of w and the output d The Variance Matrix Selecting Regularization Parameter 2 How many dimensions? How many dimensions? 3 Forward Selection Algorithms Introduction Incremental Operations Complexity Comparison Adding Basis Function Under Regularization Removing an Old Basis Function under Regularization A Possible Forward Algorithm 3 / 58
  • 4. What is the variance of the weight vector w? The meaning If the weight have been calculated upon the basis an estimation of a stochastic variable d: What is the corresponding uncertainty in the estimation of w? Assume that the noise affecting d is normal and independently, identically distributed ED d − d T d − d = σ2 I (1) Where σ is the Standard Deviation of the noise and d the mean value of d Thus, we have that the noise N d, σ2 . 4 / 58
  • 5. What is the variance of the weight vector w? The meaning If the weight have been calculated upon the basis an estimation of a stochastic variable d: What is the corresponding uncertainty in the estimation of w? Assume that the noise affecting d is normal and independently, identically distributed ED d − d T d − d = σ2 I (1) Where σ is the Standard Deviation of the noise and d the mean value of d Thus, we have that the noise N d, σ2 . 4 / 58
  • 6. What is the variance of the weight vector w? The meaning If the weight have been calculated upon the basis an estimation of a stochastic variable d: What is the corresponding uncertainty in the estimation of w? Assume that the noise affecting d is normal and independently, identically distributed ED d − d T d − d = σ2 I (1) Where σ is the Standard Deviation of the noise and d the mean value of d Thus, we have that the noise N d, σ2 . 4 / 58
  • 7. What is the variance of the weight vector w? The meaning If the weight have been calculated upon the basis an estimation of a stochastic variable d: What is the corresponding uncertainty in the estimation of w? Assume that the noise affecting d is normal and independently, identically distributed ED d − d T d − d = σ2 I (1) Where σ is the Standard Deviation of the noise and d the mean value of d Thus, we have that the noise N d, σ2 . 4 / 58
  • 8. Remember We are using a linear model f (xi) = d1 j=1 wjφj (xi) (2) Thus, solving the error under regularization w = ΦT Φ + Λ −1 ΦT d (3) 5 / 58
  • 9. Remember We are using a linear model f (xi) = d1 j=1 wjφj (xi) (2) Thus, solving the error under regularization w = ΦT Φ + Λ −1 ΦT d (3) 5 / 58
  • 10. Thus Getting the Expected Value w = ED (w) = ED ΦT Φ + Λ −1 ΦT d = ΦT Φ + Λ −1 ΦT ED [d] = ΦT Φ + Λ −1 ΦT d 6 / 58
  • 11. Thus Getting the Expected Value w = ED (w) = ED ΦT Φ + Λ −1 ΦT d = ΦT Φ + Λ −1 ΦT ED [d] = ΦT Φ + Λ −1 ΦT d 6 / 58
  • 12. Thus Getting the Expected Value w = ED (w) = ED ΦT Φ + Λ −1 ΦT d = ΦT Φ + Λ −1 ΦT ED [d] = ΦT Φ + Λ −1 ΦT d 6 / 58
  • 13. Thus, we have The Variance of w is W = ED (w − w) (w − w)T = ED ΦT Φ + Λ −1 ΦT d − ΦT Φ + Λ −1 ΦT d × ... ΦT Φ + Λ −1 ΦT d − ΦT Φ + Λ −1 ΦT d T = ED ΦT Φ + Λ −1 ΦT d − d d − d T Φ ΦT Φ + Λ −1 = ΦT Φ + Λ −1 ΦT ED d − d d − d T Φ ΦT Φ + Λ −1 = ΦT Φ + Λ −1 ΦT σ2 IΦ ΦT Φ + Λ −1 = σ2 ΦT Φ + Λ −1 ΦT Φ ΦT Φ + Λ −1 7 / 58
  • 14. Thus, we have The Variance of w is W = ED (w − w) (w − w)T = ED ΦT Φ + Λ −1 ΦT d − ΦT Φ + Λ −1 ΦT d × ... ΦT Φ + Λ −1 ΦT d − ΦT Φ + Λ −1 ΦT d T = ED ΦT Φ + Λ −1 ΦT d − d d − d T Φ ΦT Φ + Λ −1 = ΦT Φ + Λ −1 ΦT ED d − d d − d T Φ ΦT Φ + Λ −1 = ΦT Φ + Λ −1 ΦT σ2 IΦ ΦT Φ + Λ −1 = σ2 ΦT Φ + Λ −1 ΦT Φ ΦT Φ + Λ −1 7 / 58
  • 15. Thus, we have The Variance of w is W = ED (w − w) (w − w)T = ED ΦT Φ + Λ −1 ΦT d − ΦT Φ + Λ −1 ΦT d × ... ΦT Φ + Λ −1 ΦT d − ΦT Φ + Λ −1 ΦT d T = ED ΦT Φ + Λ −1 ΦT d − d d − d T Φ ΦT Φ + Λ −1 = ΦT Φ + Λ −1 ΦT ED d − d d − d T Φ ΦT Φ + Λ −1 = ΦT Φ + Λ −1 ΦT σ2 IΦ ΦT Φ + Λ −1 = σ2 ΦT Φ + Λ −1 ΦT Φ ΦT Φ + Λ −1 7 / 58
  • 16. Thus, we have The Variance of w is W = ED (w − w) (w − w)T = ED ΦT Φ + Λ −1 ΦT d − ΦT Φ + Λ −1 ΦT d × ... ΦT Φ + Λ −1 ΦT d − ΦT Φ + Λ −1 ΦT d T = ED ΦT Φ + Λ −1 ΦT d − d d − d T Φ ΦT Φ + Λ −1 = ΦT Φ + Λ −1 ΦT ED d − d d − d T Φ ΦT Φ + Λ −1 = ΦT Φ + Λ −1 ΦT σ2 IΦ ΦT Φ + Λ −1 = σ2 ΦT Φ + Λ −1 ΦT Φ ΦT Φ + Λ −1 7 / 58
  • 17. Thus, we have The Variance of w is W = ED (w − w) (w − w)T = ED ΦT Φ + Λ −1 ΦT d − ΦT Φ + Λ −1 ΦT d × ... ΦT Φ + Λ −1 ΦT d − ΦT Φ + Λ −1 ΦT d T = ED ΦT Φ + Λ −1 ΦT d − d d − d T Φ ΦT Φ + Λ −1 = ΦT Φ + Λ −1 ΦT ED d − d d − d T Φ ΦT Φ + Λ −1 = ΦT Φ + Λ −1 ΦT σ2 IΦ ΦT Φ + Λ −1 = σ2 ΦT Φ + Λ −1 ΦT Φ ΦT Φ + Λ −1 7 / 58
  • 18. Thus, we have The Variance of w is W = ED (w − w) (w − w)T = ED ΦT Φ + Λ −1 ΦT d − ΦT Φ + Λ −1 ΦT d × ... ΦT Φ + Λ −1 ΦT d − ΦT Φ + Λ −1 ΦT d T = ED ΦT Φ + Λ −1 ΦT d − d d − d T Φ ΦT Φ + Λ −1 = ΦT Φ + Λ −1 ΦT ED d − d d − d T Φ ΦT Φ + Λ −1 = ΦT Φ + Λ −1 ΦT σ2 IΦ ΦT Φ + Λ −1 = σ2 ΦT Φ + Λ −1 ΦT Φ ΦT Φ + Λ −1 7 / 58
  • 19. The Least Squared Error Case We have Λ = 0 =⇒ W = σ2 ΦT Φ + Λ −1 (4) The following matrix is known as the variance matrix A−1 = ΦT Φ + Λ −1 (5) For the standard Ridge Regression when ΦT Φ = A − λId1 W = σ2 A−1 [A − λId1 ] A−1 = σ2 A−1 − λA−2 8 / 58
  • 20. The Least Squared Error Case We have Λ = 0 =⇒ W = σ2 ΦT Φ + Λ −1 (4) The following matrix is known as the variance matrix A−1 = ΦT Φ + Λ −1 (5) For the standard Ridge Regression when ΦT Φ = A − λId1 W = σ2 A−1 [A − λId1 ] A−1 = σ2 A−1 − λA−2 8 / 58
  • 21. The Least Squared Error Case We have Λ = 0 =⇒ W = σ2 ΦT Φ + Λ −1 (4) The following matrix is known as the variance matrix A−1 = ΦT Φ + Λ −1 (5) For the standard Ridge Regression when ΦT Φ = A − λId1 W = σ2 A−1 [A − λId1 ] A−1 = σ2 A−1 − λA−2 8 / 58
  • 22. The Least Squared Error Case We have Λ = 0 =⇒ W = σ2 ΦT Φ + Λ −1 (4) The following matrix is known as the variance matrix A−1 = ΦT Φ + Λ −1 (5) For the standard Ridge Regression when ΦT Φ = A − λId1 W = σ2 A−1 [A − λId1 ] A−1 = σ2 A−1 − λA−2 8 / 58
  • 23. Outline 1 Predicting Variance of w and the output d The Variance Matrix Selecting Regularization Parameter 2 How many dimensions? How many dimensions? 3 Forward Selection Algorithms Introduction Incremental Operations Complexity Comparison Adding Basis Function Under Regularization Removing an Old Basis Function under Regularization A Possible Forward Algorithm 9 / 58
  • 24. How to select the regularization parameter λ? We know that f = Φ ΦT Φ + λId1 −1 ΦT d (6) Thus, we have the following projection matrix assuming the difference between d and f d − f = d − Φ ΦT Φ + λId1 −1 ΦT d (7) Thus, we have tha Projection Matrix P d − f =    IN − Φ ΦT Φ + λId1 −1 ΦT P     d (8) 10 / 58
  • 25. How to select the regularization parameter λ? We know that f = Φ ΦT Φ + λId1 −1 ΦT d (6) Thus, we have the following projection matrix assuming the difference between d and f d − f = d − Φ ΦT Φ + λId1 −1 ΦT d (7) Thus, we have tha Projection Matrix P d − f =    IN − Φ ΦT Φ + λId1 −1 ΦT P     d (8) 10 / 58
  • 26. How to select the regularization parameter λ? We know that f = Φ ΦT Φ + λId1 −1 ΦT d (6) Thus, we have the following projection matrix assuming the difference between d and f d − f = d − Φ ΦT Φ + λId1 −1 ΦT d (7) Thus, we have tha Projection Matrix P d − f =    IN − Φ ΦT Φ + λId1 −1 ΦT P     d (8) 10 / 58
  • 27. We can use this to rewrite the cost function Cost function C (w, λ) = N i=1 (di − f (xi))2 + d1 j=1 λjw2 j (9) We have then C (w, λ) = Hw − d T Hw − d + wT Λw = d T Φ [A]−1 ΦT − Id1 Φ [A]−1 ΦT − Id1 d + ... dT Φ [A]−1 Λ [A]−1 ΦT d 11 / 58
  • 28. We can use this to rewrite the cost function Cost function C (w, λ) = N i=1 (di − f (xi))2 + d1 j=1 λjw2 j (9) We have then C (w, λ) = Hw − d T Hw − d + wT Λw = d T Φ [A]−1 ΦT − Id1 Φ [A]−1 ΦT − Id1 d + ... dT Φ [A]−1 Λ [A]−1 ΦT d 11 / 58
  • 29. We can use this to rewrite the cost function Cost function C (w, λ) = N i=1 (di − f (xi))2 + d1 j=1 λjw2 j (9) We have then C (w, λ) = Hw − d T Hw − d + wT Λw = d T Φ [A]−1 ΦT − Id1 Φ [A]−1 ΦT − Id1 d + ... dT Φ [A]−1 Λ [A]−1 ΦT d 11 / 58
  • 30. However We have Φ [A]−1 Λ [A]−1 Φ = Φ [A]−1 A − ΦT Φ [A]−1 ΦT = Φ [A]−1 ΦT − Φ [A]−1 ΦT 2 = P − P2 Simplifying the minimum cost C (w, λ) = d T P2 d+ d T P − P2 d = d T Pd (10) 12 / 58
  • 31. However We have Φ [A]−1 Λ [A]−1 Φ = Φ [A]−1 A − ΦT Φ [A]−1 ΦT = Φ [A]−1 ΦT − Φ [A]−1 ΦT 2 = P − P2 Simplifying the minimum cost C (w, λ) = d T P2 d+ d T P − P2 d = d T Pd (10) 12 / 58
  • 32. However We have Φ [A]−1 Λ [A]−1 Φ = Φ [A]−1 A − ΦT Φ [A]−1 ΦT = Φ [A]−1 ΦT − Φ [A]−1 ΦT 2 = P − P2 Simplifying the minimum cost C (w, λ) = d T P2 d+ d T P − P2 d = d T Pd (10) 12 / 58
  • 33. However We have Φ [A]−1 Λ [A]−1 Φ = Φ [A]−1 A − ΦT Φ [A]−1 ΦT = Φ [A]−1 ΦT − Φ [A]−1 ΦT 2 = P − P2 Simplifying the minimum cost C (w, λ) = d T P2 d+ d T P − P2 d = d T Pd (10) 12 / 58
  • 34. In summary, we have for the Ridge Regression Something Notable A =ΦT Φ + λId1 w =A−1 ΦT d P =IN − ΦA−1 ΦT Important Observation Some sort of model selection must be used to choose a value for the regularisation parameter . The value chosen is the one associated with the lowest prediction error. Question Which method should be used to predict the error and how is the optimal value found? 13 / 58
  • 35. In summary, we have for the Ridge Regression Something Notable A =ΦT Φ + λId1 w =A−1 ΦT d P =IN − ΦA−1 ΦT Important Observation Some sort of model selection must be used to choose a value for the regularisation parameter . The value chosen is the one associated with the lowest prediction error. Question Which method should be used to predict the error and how is the optimal value found? 13 / 58
  • 36. In summary, we have for the Ridge Regression Something Notable A =ΦT Φ + λId1 w =A−1 ΦT d P =IN − ΦA−1 ΦT Important Observation Some sort of model selection must be used to choose a value for the regularisation parameter . The value chosen is the one associated with the lowest prediction error. Question Which method should be used to predict the error and how is the optimal value found? 13 / 58
  • 37. In summary, we have for the Ridge Regression Something Notable A =ΦT Φ + λId1 w =A−1 ΦT d P =IN − ΦA−1 ΦT Important Observation Some sort of model selection must be used to choose a value for the regularisation parameter . The value chosen is the one associated with the lowest prediction error. Question Which method should be used to predict the error and how is the optimal value found? 13 / 58
  • 38. In summary, we have for the Ridge Regression Something Notable A =ΦT Φ + λId1 w =A−1 ΦT d P =IN − ΦA−1 ΦT Important Observation Some sort of model selection must be used to choose a value for the regularisation parameter . The value chosen is the one associated with the lowest prediction error. Question Which method should be used to predict the error and how is the optimal value found? 13 / 58
  • 39. In summary, we have for the Ridge Regression Something Notable A =ΦT Φ + λId1 w =A−1 ΦT d P =IN − ΦA−1 ΦT Important Observation Some sort of model selection must be used to choose a value for the regularisation parameter . The value chosen is the one associated with the lowest prediction error. Question Which method should be used to predict the error and how is the optimal value found? 13 / 58
  • 40. In summary, we have for the Ridge Regression Something Notable A =ΦT Φ + λId1 w =A−1 ΦT d P =IN − ΦA−1 ΦT Important Observation Some sort of model selection must be used to choose a value for the regularisation parameter . The value chosen is the one associated with the lowest prediction error. Question Which method should be used to predict the error and how is the optimal value found? 13 / 58
  • 41. In summary, we have for the Ridge Regression Something Notable A =ΦT Φ + λId1 w =A−1 ΦT d P =IN − ΦA−1 ΦT Important Observation Some sort of model selection must be used to choose a value for the regularisation parameter . The value chosen is the one associated with the lowest prediction error. Question Which method should be used to predict the error and how is the optimal value found? 13 / 58
  • 42. In summary, we have for the Ridge Regression Something Notable A =ΦT Φ + λId1 w =A−1 ΦT d P =IN − ΦA−1 ΦT Important Observation Some sort of model selection must be used to choose a value for the regularisation parameter . The value chosen is the one associated with the lowest prediction error. Question Which method should be used to predict the error and how is the optimal value found? 13 / 58
  • 43. Answer Something Notable The answer to the rst question is that nobody knows for sure. There are many methods to train to obtain that value Leave-one-out cross-validation. Generalized cross-validation. Final prediction error. Bayesian information criterion. Bootstrap methods. 14 / 58
  • 44. Answer Something Notable The answer to the rst question is that nobody knows for sure. There are many methods to train to obtain that value Leave-one-out cross-validation. Generalized cross-validation. Final prediction error. Bayesian information criterion. Bootstrap methods. 14 / 58
  • 45. Answer Something Notable The answer to the rst question is that nobody knows for sure. There are many methods to train to obtain that value Leave-one-out cross-validation. Generalized cross-validation. Final prediction error. Bayesian information criterion. Bootstrap methods. 14 / 58
  • 46. Answer Something Notable The answer to the rst question is that nobody knows for sure. There are many methods to train to obtain that value Leave-one-out cross-validation. Generalized cross-validation. Final prediction error. Bayesian information criterion. Bootstrap methods. 14 / 58
  • 47. Answer Something Notable The answer to the rst question is that nobody knows for sure. There are many methods to train to obtain that value Leave-one-out cross-validation. Generalized cross-validation. Final prediction error. Bayesian information criterion. Bootstrap methods. 14 / 58
  • 48. Answer Something Notable The answer to the rst question is that nobody knows for sure. There are many methods to train to obtain that value Leave-one-out cross-validation. Generalized cross-validation. Final prediction error. Bayesian information criterion. Bootstrap methods. 14 / 58
  • 49. We will use an iterative method We have the following iterative process from Generalized Cross-Validation λ = dT P2 dtrace A−1 − λA−2 wT A−1wtrace (P) (11) To see the development, please take a look to Appendix A.10 “Introduction to Radial Basis Function Networks” by Mark J.L. Orr An iterative process started with an initial λ The value is updated until convergence. 15 / 58
  • 50. We will use an iterative method We have the following iterative process from Generalized Cross-Validation λ = dT P2 dtrace A−1 − λA−2 wT A−1wtrace (P) (11) To see the development, please take a look to Appendix A.10 “Introduction to Radial Basis Function Networks” by Mark J.L. Orr An iterative process started with an initial λ The value is updated until convergence. 15 / 58
  • 51. We will use an iterative method We have the following iterative process from Generalized Cross-Validation λ = dT P2 dtrace A−1 − λA−2 wT A−1wtrace (P) (11) To see the development, please take a look to Appendix A.10 “Introduction to Radial Basis Function Networks” by Mark J.L. Orr An iterative process started with an initial λ The value is updated until convergence. 15 / 58
  • 52. Outline 1 Predicting Variance of w and the output d The Variance Matrix Selecting Regularization Parameter 2 How many dimensions? How many dimensions? 3 Forward Selection Algorithms Introduction Incremental Operations Complexity Comparison Adding Basis Function Under Regularization Removing an Old Basis Function under Regularization A Possible Forward Algorithm 16 / 58
  • 53. How many dimensions for the mapping to high dimensions? We have the following for ordinary least squares - no regularization A−1 = ΦT Φ −1 (12) Now, you suppose that You are given a set of numbers {xi}N i=1 randomly drawn from a Gaussian distribution and you are asked to estimate the variance without told the mean. We can calculate the sample mean x = 1 N N i=1 xi (13) 17 / 58
  • 54. How many dimensions for the mapping to high dimensions? We have the following for ordinary least squares - no regularization A−1 = ΦT Φ −1 (12) Now, you suppose that You are given a set of numbers {xi}N i=1 randomly drawn from a Gaussian distribution and you are asked to estimate the variance without told the mean. We can calculate the sample mean x = 1 N N i=1 xi (13) 17 / 58
  • 55. How many dimensions for the mapping to high dimensions? We have the following for ordinary least squares - no regularization A−1 = ΦT Φ −1 (12) Now, you suppose that You are given a set of numbers {xi}N i=1 randomly drawn from a Gaussian distribution and you are asked to estimate the variance without told the mean. We can calculate the sample mean x = 1 N N i=1 xi (13) 17 / 58
  • 56. Thus This allows to calculate the sample variance ˆσ2 = 1 N − 1 N i=1 (xi − x)2 (14) Problem, from Where the parameter N − 1 comes from? It comes from the fact that the parameter x is fitting the noise. The system has N degrees of freedom Thus the underestimation of the variance is restored by reducing the remaining degrees of freedom by one. 18 / 58
  • 57. Thus This allows to calculate the sample variance ˆσ2 = 1 N − 1 N i=1 (xi − x)2 (14) Problem, from Where the parameter N − 1 comes from? It comes from the fact that the parameter x is fitting the noise. The system has N degrees of freedom Thus the underestimation of the variance is restored by reducing the remaining degrees of freedom by one. 18 / 58
  • 58. Thus This allows to calculate the sample variance ˆσ2 = 1 N − 1 N i=1 (xi − x)2 (14) Problem, from Where the parameter N − 1 comes from? It comes from the fact that the parameter x is fitting the noise. The system has N degrees of freedom Thus the underestimation of the variance is restored by reducing the remaining degrees of freedom by one. 18 / 58
  • 59. In Supervised Learning Similarly It would be a mistake to divide the sum-squared-training-error by the number of patterns in order to estimate the noise variance since some degrees of freedom will have been used up in fitting the model. In our linear model there are d1 weights and N patterns in the training set It leaves N − d1 degrees of freedom. The estimation of the variance is then ˆσ2 = ˆS N − d1 (15) Remark: ˆS is the sum-squared-error over the training set at the optimal weight vector and ˆσ2 is called the unbiased estimate of variance. 19 / 58
  • 60. In Supervised Learning Similarly It would be a mistake to divide the sum-squared-training-error by the number of patterns in order to estimate the noise variance since some degrees of freedom will have been used up in fitting the model. In our linear model there are d1 weights and N patterns in the training set It leaves N − d1 degrees of freedom. The estimation of the variance is then ˆσ2 = ˆS N − d1 (15) Remark: ˆS is the sum-squared-error over the training set at the optimal weight vector and ˆσ2 is called the unbiased estimate of variance. 19 / 58
  • 61. In Supervised Learning Similarly It would be a mistake to divide the sum-squared-training-error by the number of patterns in order to estimate the noise variance since some degrees of freedom will have been used up in fitting the model. In our linear model there are d1 weights and N patterns in the training set It leaves N − d1 degrees of freedom. The estimation of the variance is then ˆσ2 = ˆS N − d1 (15) Remark: ˆS is the sum-squared-error over the training set at the optimal weight vector and ˆσ2 is called the unbiased estimate of variance. 19 / 58
  • 62. First, standard least squared error Although there is still d1 weights in the model The effective number of parameters(John Moody), γ, is less than d1 and it depends on the size of the regularization parameters. We have the following (Moody and MacKay) γ = N − trace (P) (16) 20 / 58
  • 63. First, standard least squared error Although there is still d1 weights in the model The effective number of parameters(John Moody), γ, is less than d1 and it depends on the size of the regularization parameters. We have the following (Moody and MacKay) γ = N − trace (P) (16) 20 / 58
  • 64. First, standard least squared error In the standard least squared error without regularization, A−1 = ΦT Φ −1 γ = N − trace IN − ΦA−1 ΦT = trace ΦA−1 ΦT = trace A−1 ΦT Φ = trace A−1 ΦT Φ = trace (Id1 ) = d1 21 / 58
  • 65. First, standard least squared error In the standard least squared error without regularization, A−1 = ΦT Φ −1 γ = N − trace IN − ΦA−1 ΦT = trace ΦA−1 ΦT = trace A−1 ΦT Φ = trace A−1 ΦT Φ = trace (Id1 ) = d1 21 / 58
  • 66. First, standard least squared error In the standard least squared error without regularization, A−1 = ΦT Φ −1 γ = N − trace IN − ΦA−1 ΦT = trace ΦA−1 ΦT = trace A−1 ΦT Φ = trace A−1 ΦT Φ = trace (Id1 ) = d1 21 / 58
  • 67. First, standard least squared error In the standard least squared error without regularization, A−1 = ΦT Φ −1 γ = N − trace IN − ΦA−1 ΦT = trace ΦA−1 ΦT = trace A−1 ΦT Φ = trace A−1 ΦT Φ = trace (Id1 ) = d1 21 / 58
  • 68. First, standard least squared error In the standard least squared error without regularization, A−1 = ΦT Φ −1 γ = N − trace IN − ΦA−1 ΦT = trace ΦA−1 ΦT = trace A−1 ΦT Φ = trace A−1 ΦT Φ = trace (Id1 ) = d1 21 / 58
  • 69. First, standard least squared error In the standard least squared error without regularization, A−1 = ΦT Φ −1 γ = N − trace IN − ΦA−1 ΦT = trace ΦA−1 ΦT = trace A−1 ΦT Φ = trace A−1 ΦT Φ = trace (Id1 ) = d1 21 / 58
  • 70. Now, with the regularization term We have A−1 = ΦT Φ − λId1 −1 γ = trace A−1 ΦT Φ = trace A−1 (A − λId1 ) = trace Id1 − λA−1 = d1 − λ A−1 22 / 58
  • 71. Now, with the regularization term We have A−1 = ΦT Φ − λId1 −1 γ = trace A−1 ΦT Φ = trace A−1 (A − λId1 ) = trace Id1 − λA−1 = d1 − λ A−1 22 / 58
  • 72. Now, with the regularization term We have A−1 = ΦT Φ − λId1 −1 γ = trace A−1 ΦT Φ = trace A−1 (A − λId1 ) = trace Id1 − λA−1 = d1 − λ A−1 22 / 58
  • 73. Now, with the regularization term We have A−1 = ΦT Φ − λId1 −1 γ = trace A−1 ΦT Φ = trace A−1 (A − λId1 ) = trace Id1 − λA−1 = d1 − λ A−1 22 / 58
  • 74. Now If the eigenvalues of the matrix ΦT Φ are {µj}d1 j=1 γ = d1 − λtrace A−1 = d1 − λ d1 i=1 1 λ + µj = d1 i=1 µj λ + µj 23 / 58
  • 75. Now If the eigenvalues of the matrix ΦT Φ are {µj}d1 j=1 γ = d1 − λtrace A−1 = d1 − λ d1 i=1 1 λ + µj = d1 i=1 µj λ + µj 23 / 58
  • 76. Now If the eigenvalues of the matrix ΦT Φ are {µj}d1 j=1 γ = d1 − λtrace A−1 = d1 − λ d1 i=1 1 λ + µj = d1 i=1 µj λ + µj 23 / 58
  • 77. Outline 1 Predicting Variance of w and the output d The Variance Matrix Selecting Regularization Parameter 2 How many dimensions? How many dimensions? 3 Forward Selection Algorithms Introduction Incremental Operations Complexity Comparison Adding Basis Function Under Regularization Removing an Old Basis Function under Regularization A Possible Forward Algorithm 24 / 58
  • 78. About Ridge Regression Remark Ridge regression is used as a way to balance bias and variance by varying the effective number of parameters in a linear model. An alternative strategy It is to to compare models made up of different subsets of basis functions drawn from the same fixed set of candidates. This is called Subset selection in statistics and machine learning. 25 / 58
  • 79. About Ridge Regression Remark Ridge regression is used as a way to balance bias and variance by varying the effective number of parameters in a linear model. An alternative strategy It is to to compare models made up of different subsets of basis functions drawn from the same fixed set of candidates. This is called Subset selection in statistics and machine learning. 25 / 58
  • 80. About Ridge Regression Remark Ridge regression is used as a way to balance bias and variance by varying the effective number of parameters in a linear model. An alternative strategy It is to to compare models made up of different subsets of basis functions drawn from the same fixed set of candidates. This is called Subset selection in statistics and machine learning. 25 / 58
  • 81. Problem This is normally intractable when you have N 2N − 1 subsets to test (17) We could use different methods 1 K-means which is explained in the book. 2 Forward Selection heuristics that we will explain here. Forward Selection It starts with an empty subset to which a basis function is added at a time: The one that reduces the sum-squared error the most. Until a chosen criterion, such that GCV stops decreasing. 26 / 58
  • 82. Problem This is normally intractable when you have N 2N − 1 subsets to test (17) We could use different methods 1 K-means which is explained in the book. 2 Forward Selection heuristics that we will explain here. Forward Selection It starts with an empty subset to which a basis function is added at a time: The one that reduces the sum-squared error the most. Until a chosen criterion, such that GCV stops decreasing. 26 / 58
  • 83. Problem This is normally intractable when you have N 2N − 1 subsets to test (17) We could use different methods 1 K-means which is explained in the book. 2 Forward Selection heuristics that we will explain here. Forward Selection It starts with an empty subset to which a basis function is added at a time: The one that reduces the sum-squared error the most. Until a chosen criterion, such that GCV stops decreasing. 26 / 58
  • 84. Problem This is normally intractable when you have N 2N − 1 subsets to test (17) We could use different methods 1 K-means which is explained in the book. 2 Forward Selection heuristics that we will explain here. Forward Selection It starts with an empty subset to which a basis function is added at a time: The one that reduces the sum-squared error the most. Until a chosen criterion, such that GCV stops decreasing. 26 / 58
  • 85. Problem This is normally intractable when you have N 2N − 1 subsets to test (17) We could use different methods 1 K-means which is explained in the book. 2 Forward Selection heuristics that we will explain here. Forward Selection It starts with an empty subset to which a basis function is added at a time: The one that reduces the sum-squared error the most. Until a chosen criterion, such that GCV stops decreasing. 26 / 58
  • 86. Problem This is normally intractable when you have N 2N − 1 subsets to test (17) We could use different methods 1 K-means which is explained in the book. 2 Forward Selection heuristics that we will explain here. Forward Selection It starts with an empty subset to which a basis function is added at a time: The one that reduces the sum-squared error the most. Until a chosen criterion, such that GCV stops decreasing. 26 / 58
  • 87. Subset Selection Vs. Optimization Classic Neural Network Optimization It involves the optimization, by gradient descent, of a nonlinear sum-squared-error surface in a high-dimensional space defined by the network parameters. In specific in RBF The network parameters are the centers, sizes and hidden-to-output weights. 27 / 58
  • 88. Subset Selection Vs. Optimization Classic Neural Network Optimization It involves the optimization, by gradient descent, of a nonlinear sum-squared-error surface in a high-dimensional space defined by the network parameters. In specific in RBF The network parameters are the centers, sizes and hidden-to-output weights. 27 / 58
  • 89. Subset Selection Vs. Optimization In Subset Selection The heuristic searches in a discrete space of subsets of a set of hidden units with fixed centers and sizes while finding a subset with the lowest prediction error. It uses, a minimization criteria as the variance of the GVC: ˆσ2 GCV = N ˆd T P2 ˆd trace (P)2 (18) 28 / 58
  • 90. Subset Selection Vs. Optimization In Subset Selection The heuristic searches in a discrete space of subsets of a set of hidden units with fixed centers and sizes while finding a subset with the lowest prediction error. It uses, a minimization criteria as the variance of the GVC: ˆσ2 GCV = N ˆd T P2 ˆd trace (P)2 (18) 28 / 58
  • 91. Subset Selection Vs. Optimization In Subset Selection The heuristic searches in a discrete space of subsets of a set of hidden units with fixed centers and sizes while finding a subset with the lowest prediction error. It uses, a minimization criteria as the variance of the GVC: ˆσ2 GCV = N ˆd T P2 ˆd trace (P)2 (18) 28 / 58
  • 92. In addition Hidden-to-Output Weights They are not selected, they are slaved to the centers and sizes of the chosen subset. Forward selection is a non-linear type of heuristic with the following advantages There is no need to fix the number of hidden units in advance. The model selection criteria are tractable. The computational requirements are relatively low. 29 / 58
  • 93. In addition Hidden-to-Output Weights They are not selected, they are slaved to the centers and sizes of the chosen subset. Forward selection is a non-linear type of heuristic with the following advantages There is no need to fix the number of hidden units in advance. The model selection criteria are tractable. The computational requirements are relatively low. 29 / 58
  • 94. In addition Hidden-to-Output Weights They are not selected, they are slaved to the centers and sizes of the chosen subset. Forward selection is a non-linear type of heuristic with the following advantages There is no need to fix the number of hidden units in advance. The model selection criteria are tractable. The computational requirements are relatively low. 29 / 58
  • 95. In addition Hidden-to-Output Weights They are not selected, they are slaved to the centers and sizes of the chosen subset. Forward selection is a non-linear type of heuristic with the following advantages There is no need to fix the number of hidden units in advance. The model selection criteria are tractable. The computational requirements are relatively low. 29 / 58
  • 96. Thus, under the classic least squared error Something Notable In forward selection each step involves growing the network by one basis function. Therefore Adding a new basis function is one of the incremental operations by using the equation Pd1+1 = Pd1 − Pd1 φjφT j Pd1 φT j Pd1 φj (19) 30 / 58
  • 97. Thus, under the classic least squared error Something Notable In forward selection each step involves growing the network by one basis function. Therefore Adding a new basis function is one of the incremental operations by using the equation Pd1+1 = Pd1 − Pd1 φjφT j Pd1 φT j Pd1 φj (19) 30 / 58
  • 98. Thus Where Pm+1 is the succeeding projection matrix if the J-th member of the set is added. Pm the projection matrix for the m−hidden units. The vectors φj N j=1 are the column vectors of the matrix Φ with N d1. 31 / 58
  • 99. Thus Where Pm+1 is the succeeding projection matrix if the J-th member of the set is added. Pm the projection matrix for the m−hidden units. The vectors φj N j=1 are the column vectors of the matrix Φ with N d1. 31 / 58
  • 100. Thus Where Pm+1 is the succeeding projection matrix if the J-th member of the set is added. Pm the projection matrix for the m−hidden units. The vectors φj N j=1 are the column vectors of the matrix Φ with N d1. 31 / 58
  • 101. Thus We have that ΦN = [φ1 φ2 ... φN ] (20) If we take in account all the possible centers given by all the basis 32 / 58
  • 102. Outline 1 Predicting Variance of w and the output d The Variance Matrix Selecting Regularization Parameter 2 How many dimensions? How many dimensions? 3 Forward Selection Algorithms Introduction Incremental Operations Complexity Comparison Adding Basis Function Under Regularization Removing an Old Basis Function under Regularization A Possible Forward Algorithm 33 / 58
  • 103. What are we going to do? This is what we want to do 1 Adding a new basis function 2 Removing an old basis function 34 / 58
  • 104. Given a matrix Given a squared matrix of size d1, we have the following B−1 B = Id1 BB−1 = Id1 Inverse of matrix with small-rank adjustment Suppose that an n × n matrix B1 is obtained by adding a small-rank adjustment XRY T to matrix B0, B1 = B0 + XRYT (21) Where B0 ∈ Rd1×d1 is the known inverse, X, Y ∈ Rd1×r are known with d1 > r, R ∈ Rr×r and the inverse of B1 is sought. 35 / 58
  • 105. Given a matrix Given a squared matrix of size d1, we have the following B−1 B = Id1 BB−1 = Id1 Inverse of matrix with small-rank adjustment Suppose that an n × n matrix B1 is obtained by adding a small-rank adjustment XRY T to matrix B0, B1 = B0 + XRYT (21) Where B0 ∈ Rd1×d1 is the known inverse, X, Y ∈ Rd1×r are known with d1 > r, R ∈ Rr×r and the inverse of B1 is sought. 35 / 58
  • 106. Given a matrix Given a squared matrix of size d1, we have the following B−1 B = Id1 BB−1 = Id1 Inverse of matrix with small-rank adjustment Suppose that an n × n matrix B1 is obtained by adding a small-rank adjustment XRY T to matrix B0, B1 = B0 + XRYT (21) Where B0 ∈ Rd1×d1 is the known inverse, X, Y ∈ Rd1×r are known with d1 > r, R ∈ Rr×r and the inverse of B1 is sought. 35 / 58
  • 107. We can do the following We have the following formula B−1 1 = B−1 0 − B−1 0 X YT B−1 0 X + R−1 −1 YB−1 0 (22) Something Notable This is quite more efficient because involves inverting a r−matrix YT B−1 0 X + R−1 . 36 / 58
  • 108. We can do the following We have the following formula B−1 1 = B−1 0 − B−1 0 X YT B−1 0 X + R−1 −1 YB−1 0 (22) Something Notable This is quite more efficient because involves inverting a r−matrix YT B−1 0 X + R−1 . 36 / 58
  • 109. Thus, we can then partition the matrix A We have the following partition B = B11 B12 B21 B22 (23) We have that B−1 = B11 − B12B−1 22 B21 −1 B−1 11 B12 B21B−1 11 B12 − A22 −1 B21B−1 11 A12 − A22 −1 B21B−1 11 B22 − B21B−1 11 B12 −1 (24) 37 / 58
  • 110. Thus, we can then partition the matrix A We have the following partition B = B11 B12 B21 B22 (23) We have that B−1 = B11 − B12B−1 22 B21 −1 B−1 11 B12 B21B−1 11 B12 − A22 −1 B21B−1 11 A12 − A22 −1 B21B−1 11 B22 − B21B−1 11 B12 −1 (24) 37 / 58
  • 111. Finally, we get using ∆ = B22 − B21B−1 11 B12 We have B−1 = B−1 11 + B−1 11 B12∆−1B21B−1 11 −B−1 11 B12∆−1 −∆−1B21B−1 11 ∆−1 (25) Using this equation we obtain the following improvements Because if we retrain the network, we need to do the following: Involving constructing the new design matrix. Multiplying it with itself. Adding the regularizer (if there is one). Taking the inverse to obtain the variance matrix. Recomputing the projection matrix. 38 / 58
  • 112. Finally, we get using ∆ = B22 − B21B−1 11 B12 We have B−1 = B−1 11 + B−1 11 B12∆−1B21B−1 11 −B−1 11 B12∆−1 −∆−1B21B−1 11 ∆−1 (25) Using this equation we obtain the following improvements Because if we retrain the network, we need to do the following: Involving constructing the new design matrix. Multiplying it with itself. Adding the regularizer (if there is one). Taking the inverse to obtain the variance matrix. Recomputing the projection matrix. 38 / 58
  • 113. Finally, we get using ∆ = B22 − B21B−1 11 B12 We have B−1 = B−1 11 + B−1 11 B12∆−1B21B−1 11 −B−1 11 B12∆−1 −∆−1B21B−1 11 ∆−1 (25) Using this equation we obtain the following improvements Because if we retrain the network, we need to do the following: Involving constructing the new design matrix. Multiplying it with itself. Adding the regularizer (if there is one). Taking the inverse to obtain the variance matrix. Recomputing the projection matrix. 38 / 58
  • 114. Finally, we get using ∆ = B22 − B21B−1 11 B12 We have B−1 = B−1 11 + B−1 11 B12∆−1B21B−1 11 −B−1 11 B12∆−1 −∆−1B21B−1 11 ∆−1 (25) Using this equation we obtain the following improvements Because if we retrain the network, we need to do the following: Involving constructing the new design matrix. Multiplying it with itself. Adding the regularizer (if there is one). Taking the inverse to obtain the variance matrix. Recomputing the projection matrix. 38 / 58
  • 115. Finally, we get using ∆ = B22 − B21B−1 11 B12 We have B−1 = B−1 11 + B−1 11 B12∆−1B21B−1 11 −B−1 11 B12∆−1 −∆−1B21B−1 11 ∆−1 (25) Using this equation we obtain the following improvements Because if we retrain the network, we need to do the following: Involving constructing the new design matrix. Multiplying it with itself. Adding the regularizer (if there is one). Taking the inverse to obtain the variance matrix. Recomputing the projection matrix. 38 / 58
  • 116. Finally, we get using ∆ = B22 − B21B−1 11 B12 We have B−1 = B−1 11 + B−1 11 B12∆−1B21B−1 11 −B−1 11 B12∆−1 −∆−1B21B−1 11 ∆−1 (25) Using this equation we obtain the following improvements Because if we retrain the network, we need to do the following: Involving constructing the new design matrix. Multiplying it with itself. Adding the regularizer (if there is one). Taking the inverse to obtain the variance matrix. Recomputing the projection matrix. 38 / 58
  • 117. Finally, we get using ∆ = B22 − B21B−1 11 B12 We have B−1 = B−1 11 + B−1 11 B12∆−1B21B−1 11 −B−1 11 B12∆−1 −∆−1B21B−1 11 ∆−1 (25) Using this equation we obtain the following improvements Because if we retrain the network, we need to do the following: Involving constructing the new design matrix. Multiplying it with itself. Adding the regularizer (if there is one). Taking the inverse to obtain the variance matrix. Recomputing the projection matrix. 38 / 58
  • 118. Outline 1 Predicting Variance of w and the output d The Variance Matrix Selecting Regularization Parameter 2 How many dimensions? How many dimensions? 3 Forward Selection Algorithms Introduction Incremental Operations Complexity Comparison Adding Basis Function Under Regularization Removing an Old Basis Function under Regularization A Possible Forward Algorithm 39 / 58
  • 119. Complexity of calculation of P We have the following approximate number of multiplications Operation Completely Retrain Using Operation Add a new basis d3 1 + Nd2 1 + N2d1 N2 Remove an old basis d3 1 + Nd2 1 + N2d1 N2 Add a new pattern d3 1 + Nd2 1 + N2d1 2d2 1 + d1N + N2 Remove an old pattern d3 1 + Nd2 1 + N2d1 2d2 1 + d1N + N2 40 / 58
  • 120. Outline 1 Predicting Variance of w and the output d The Variance Matrix Selecting Regularization Parameter 2 How many dimensions? How many dimensions? 3 Forward Selection Algorithms Introduction Incremental Operations Complexity Comparison Adding Basis Function Under Regularization Removing an Old Basis Function under Regularization A Possible Forward Algorithm 41 / 58
  • 121. Adding a Basis Function We do the following If the J−th basis function is chosen then φj is appended to the last column of Φd1 and renamed m + 1. Thus, incrementing to the new matrix Φd1+1 = Φd1 φd1+1 (26) 42 / 58
  • 122. Adding a Basis Function We do the following If the J−th basis function is chosen then φj is appended to the last column of Φd1 and renamed m + 1. Thus, incrementing to the new matrix Φd1+1 = Φd1 φd1+1 (26) 42 / 58
  • 123. Where We have that φd1+1 =       φd1+1 (x1) φd1+1 (x2) ... φd1+1 (xN )       (27) 43 / 58
  • 124. Using our variance matrix We have the following variance for the general case Ad1+1 = ΦT d1+1Φd1+1 + Λd1+1 (28) We have Ad1+1 = ΦT d1+1Φd1+1 + Λd1+1 = ΦT d1 φT d1+1 Φd1 φd1+1 + Λd1 0 0T λd1+1 44 / 58
  • 125. Using our variance matrix We have the following variance for the general case Ad1+1 = ΦT d1+1Φd1+1 + Λd1+1 (28) We have Ad1+1 = ΦT d1+1Φd1+1 + Λd1+1 = ΦT d1 φT d1+1 Φd1 φd1+1 + Λd1 0 0T λd1+1 44 / 58
  • 126. Thus We have that Ad1+1 = ΦT d1 Φd1 ΦT d1 φd1+1 φT d1+1Φd1 φT d1+1 φd1+1 + Λd1 0 0T λd1+1 = ΦT d1 Φd1 + Λd1 ΦT d1 φd1+1 φT d1+1Φd1 λd1+1 + φT d1+1 φd1+1 45 / 58
  • 127. Thus We have that Ad1+1 = ΦT d1 Φd1 ΦT d1 φd1+1 φT d1+1Φd1 φT d1+1 φd1+1 + Λd1 0 0T λd1+1 = ΦT d1 Φd1 + Λd1 ΦT d1 φd1+1 φT d1+1Φd1 λd1+1 + φT d1+1 φd1+1 45 / 58
  • 128. Then We have Ad1+1 = Ad1 ΦT d1 φd1+1 φT d1+1Φd1 λd1+1 + φT d1+1 φd1+1 Using our partition A−1 d1+1 = A−1 d1 0 0T 0 + ... 1 λd1+1 + φT d1+1Pd1 φd1+1 A−1 d1 ΦT d1 φd1+1 −1 φT d1+1Φd1 A−1 d1 − 1 Where Pd1 = IN − Φd1 A−1 d1 ΦT d1 46 / 58
  • 129. Then We have Ad1+1 = Ad1 ΦT d1 φd1+1 φT d1+1Φd1 λd1+1 + φT d1+1 φd1+1 Using our partition A−1 d1+1 = A−1 d1 0 0T 0 + ... 1 λd1+1 + φT d1+1Pd1 φd1+1 A−1 d1 ΦT d1 φd1+1 −1 φT d1+1Φd1 A−1 d1 − 1 Where Pd1 = IN − Φd1 A−1 d1 ΦT d1 46 / 58
  • 130. Then We have Ad1+1 = Ad1 ΦT d1 φd1+1 φT d1+1Φd1 λd1+1 + φT d1+1 φd1+1 Using our partition A−1 d1+1 = A−1 d1 0 0T 0 + ... 1 λd1+1 + φT d1+1Pd1 φd1+1 A−1 d1 ΦT d1 φd1+1 −1 φT d1+1Φd1 A−1 d1 − 1 Where Pd1 = IN − Φd1 A−1 d1 ΦT d1 46 / 58
  • 131. Finally Then A−1 d1+1 = A−1 d1 0 0T 0 + ... 1 λd1+1 + φT d1+1Pd1 φd1+1 A−1 d1 ΦT d1 φd1+1 φT d1+1Φd1 − A−1 d1 A−1 d1 ΦT d1 φd1+ − φT d1+1Φd1 A−1 d1 −1 47 / 58
  • 132. We have then that We can use the previous result for A−1 m+1 Pd1+1 =IN − Φd1+1A−1 d1+1ΦT d1+1 =Pd1 − Pd1 φd1+1 φT d1+1Pd1 λd1+1 + φT d1+1Pd1 φd1+1 48 / 58
  • 133. We have then that We can use the previous result for A−1 m+1 Pd1+1 =IN − Φd1+1A−1 d1+1ΦT d1+1 =Pd1 − Pd1 φd1+1 φT d1+1Pd1 λd1+1 + φT d1+1Pd1 φd1+1 48 / 58
  • 134. How do we select the new basis We can use the greatest in sum-squared error difference ˆSd1 = ˆyT P2 d1 ˆy (29) In addition, we have ˆSd1+1 = ˆyT P2 d1+1ˆy (30) 49 / 58
  • 135. How do we select the new basis We can use the greatest in sum-squared error difference ˆSd1 = ˆyT P2 d1 ˆy (29) In addition, we have ˆSd1+1 = ˆyT P2 d1+1ˆy (30) 49 / 58
  • 136. Thus, we do the difference We want to maximize the decrease ˆSd1 − ˆSd1+1 = ˆyT P2 d1 ˆy − ˆyT P2 d1+1ˆy = ˆyT P2 d1 − P2 d1+1 ˆy = ˆyT  P2 d1 − Pd1 − Pd1 φd1+1 φT d1+1Pd1 λd1+1 + φT d1+1Pd1 φd1+1 2   ˆy = 2ˆyT P2 d1 φd1+1ˆyT Pd1+1φj λd1+1 + φT d1+1Pd1 φd1+1 − ˆyT P2 d1+1φj 2 φT j P2 d1+1φj λd1+1 + φT d1+1Pd1 φd1+1 2 50 / 58
  • 137. Thus, we do the difference We want to maximize the decrease ˆSd1 − ˆSd1+1 = ˆyT P2 d1 ˆy − ˆyT P2 d1+1ˆy = ˆyT P2 d1 − P2 d1+1 ˆy = ˆyT  P2 d1 − Pd1 − Pd1 φd1+1 φT d1+1Pd1 λd1+1 + φT d1+1Pd1 φd1+1 2   ˆy = 2ˆyT P2 d1 φd1+1ˆyT Pd1+1φj λd1+1 + φT d1+1Pd1 φd1+1 − ˆyT P2 d1+1φj 2 φT j P2 d1+1φj λd1+1 + φT d1+1Pd1 φd1+1 2 50 / 58
  • 138. Thus, we do the difference We want to maximize the decrease ˆSd1 − ˆSd1+1 = ˆyT P2 d1 ˆy − ˆyT P2 d1+1ˆy = ˆyT P2 d1 − P2 d1+1 ˆy = ˆyT  P2 d1 − Pd1 − Pd1 φd1+1 φT d1+1Pd1 λd1+1 + φT d1+1Pd1 φd1+1 2   ˆy = 2ˆyT P2 d1 φd1+1ˆyT Pd1+1φj λd1+1 + φT d1+1Pd1 φd1+1 − ˆyT P2 d1+1φj 2 φT j P2 d1+1φj λd1+1 + φT d1+1Pd1 φd1+1 2 50 / 58
  • 139. Thus, we do the difference We want to maximize the decrease ˆSd1 − ˆSd1+1 = ˆyT P2 d1 ˆy − ˆyT P2 d1+1ˆy = ˆyT P2 d1 − P2 d1+1 ˆy = ˆyT  P2 d1 − Pd1 − Pd1 φd1+1 φT d1+1Pd1 λd1+1 + φT d1+1Pd1 φd1+1 2   ˆy = 2ˆyT P2 d1 φd1+1ˆyT Pd1+1φj λd1+1 + φT d1+1Pd1 φd1+1 − ˆyT P2 d1+1φj 2 φT j P2 d1+1φj λd1+1 + φT d1+1Pd1 φd1+1 2 50 / 58
  • 140. An alternative is is to seek to maximize the decrease in the cost function We have ˆCd1 − ˆCd1+1 = ˆyT Pd1 ˆy − ˆyT Pd1+1ˆy = ˆyT Pd1 φd1+1 φT d1+1Pd1 λd1+1 + φT d1+1Pd1 φd1+1 ˆy = ˆyT Pd1 φd1+1 2 λd1+1 + φT d1+1Pd1 φd1+1 51 / 58
  • 141. Outline 1 Predicting Variance of w and the output d The Variance Matrix Selecting Regularization Parameter 2 How many dimensions? How many dimensions? 3 Forward Selection Algorithms Introduction Incremental Operations Complexity Comparison Adding Basis Function Under Regularization Removing an Old Basis Function under Regularization A Possible Forward Algorithm 52 / 58
  • 142. Removing an Old Basis Function under Regularization Here, we can remove any column Process: 1 Move the selected j-th column at the end (Permutation). 2 Apply our well known equation with Pd1 in place of Pd1+1 and Pd1−1 in place of Pd1 . 3 In addition φj in place of φd1+1 . 4 And λj in place of λd1+1. Thus, we have Pd1 = Pd1−1 − Pd1−1 φj φT j Pd1−1 λj + φT j Pd1−1 φj (31) 53 / 58
  • 143. Removing an Old Basis Function under Regularization Here, we can remove any column Process: 1 Move the selected j-th column at the end (Permutation). 2 Apply our well known equation with Pd1 in place of Pd1+1 and Pd1−1 in place of Pd1 . 3 In addition φj in place of φd1+1 . 4 And λj in place of λd1+1. Thus, we have Pd1 = Pd1−1 − Pd1−1 φj φT j Pd1−1 λj + φT j Pd1−1 φj (31) 53 / 58
  • 144. Removing an Old Basis Function under Regularization Here, we can remove any column Process: 1 Move the selected j-th column at the end (Permutation). 2 Apply our well known equation with Pd1 in place of Pd1+1 and Pd1−1 in place of Pd1 . 3 In addition φj in place of φd1+1 . 4 And λj in place of λd1+1. Thus, we have Pd1 = Pd1−1 − Pd1−1 φj φT j Pd1−1 λj + φT j Pd1−1 φj (31) 53 / 58
  • 145. Removing an Old Basis Function under Regularization Here, we can remove any column Process: 1 Move the selected j-th column at the end (Permutation). 2 Apply our well known equation with Pd1 in place of Pd1+1 and Pd1−1 in place of Pd1 . 3 In addition φj in place of φd1+1 . 4 And λj in place of λd1+1. Thus, we have Pd1 = Pd1−1 − Pd1−1 φj φT j Pd1−1 λj + φT j Pd1−1 φj (31) 53 / 58
  • 146. Removing an Old Basis Function under Regularization Here, we can remove any column Process: 1 Move the selected j-th column at the end (Permutation). 2 Apply our well known equation with Pd1 in place of Pd1+1 and Pd1−1 in place of Pd1 . 3 In addition φj in place of φd1+1 . 4 And λj in place of λd1+1. Thus, we have Pd1 = Pd1−1 − Pd1−1 φj φT j Pd1−1 λj + φT j Pd1−1 φj (31) 53 / 58
  • 147. Thus If λj = 0 We can first post- and then pre-multiplying by φj to obtain expressions of Pd1−1 φj and φT j Pd1−1 φj in terms of Pd1 Thus, we have Pd1−1 = Pd1 + Pd1 φj φT j Pd1 λj − φT j Pd1 φj (32) However For small λj, the round-off error can be problematic!!! 54 / 58
  • 148. Thus If λj = 0 We can first post- and then pre-multiplying by φj to obtain expressions of Pd1−1 φj and φT j Pd1−1 φj in terms of Pd1 Thus, we have Pd1−1 = Pd1 + Pd1 φj φT j Pd1 λj − φT j Pd1 φj (32) However For small λj, the round-off error can be problematic!!! 54 / 58
  • 149. Thus If λj = 0 We can first post- and then pre-multiplying by φj to obtain expressions of Pd1−1 φj and φT j Pd1−1 φj in terms of Pd1 Thus, we have Pd1−1 = Pd1 + Pd1 φj φT j Pd1 λj − φT j Pd1 φj (32) However For small λj, the round-off error can be problematic!!! 54 / 58
  • 150. Outline 1 Predicting Variance of w and the output d The Variance Matrix Selecting Regularization Parameter 2 How many dimensions? How many dimensions? 3 Forward Selection Algorithms Introduction Incremental Operations Complexity Comparison Adding Basis Function Under Regularization Removing an Old Basis Function under Regularization A Possible Forward Algorithm 55 / 58
  • 151. Based in the previous ideas We are ready for a basic algorithm However, this can be improved. 56 / 58
  • 152. We have the following pseudocode Forward-Regularization(D) 1 Select the functions d1 to used as basis based on the data D This can be done randomly or using the clustering method described in Haykin 2 Select an > 0 stopping criteria 3 ˆCd1 = ˆyT Pd1 ˆy 4 Do 5 ˆCd1 = ˆCd1+1 6 d1 = d1 + 1 7 Do 8 Select a new base element and generate φd1+1 . Several strategies exist 9 Generate A−1 d1+1 and Pd1+1 10 Calculate ˆCd1+1 11 Until ˆyT Pd1 φd1+1 2 λd1+1+ φT d1+1Pd1 φd1+1 > 0 12 Until ˆCd1 − ˆCd1+1 2 < 57 / 58
  • 153. We have the following pseudocode Forward-Regularization(D) 1 Select the functions d1 to used as basis based on the data D This can be done randomly or using the clustering method described in Haykin 2 Select an > 0 stopping criteria 3 ˆCd1 = ˆyT Pd1 ˆy 4 Do 5 ˆCd1 = ˆCd1+1 6 d1 = d1 + 1 7 Do 8 Select a new base element and generate φd1+1 . Several strategies exist 9 Generate A−1 d1+1 and Pd1+1 10 Calculate ˆCd1+1 11 Until ˆyT Pd1 φd1+1 2 λd1+1+ φT d1+1Pd1 φd1+1 > 0 12 Until ˆCd1 − ˆCd1+1 2 < 57 / 58
  • 154. We have the following pseudocode Forward-Regularization(D) 1 Select the functions d1 to used as basis based on the data D This can be done randomly or using the clustering method described in Haykin 2 Select an > 0 stopping criteria 3 ˆCd1 = ˆyT Pd1 ˆy 4 Do 5 ˆCd1 = ˆCd1+1 6 d1 = d1 + 1 7 Do 8 Select a new base element and generate φd1+1 . Several strategies exist 9 Generate A−1 d1+1 and Pd1+1 10 Calculate ˆCd1+1 11 Until ˆyT Pd1 φd1+1 2 λd1+1+ φT d1+1Pd1 φd1+1 > 0 12 Until ˆCd1 − ˆCd1+1 2 < 57 / 58
  • 155. We have the following pseudocode Forward-Regularization(D) 1 Select the functions d1 to used as basis based on the data D This can be done randomly or using the clustering method described in Haykin 2 Select an > 0 stopping criteria 3 ˆCd1 = ˆyT Pd1 ˆy 4 Do 5 ˆCd1 = ˆCd1+1 6 d1 = d1 + 1 7 Do 8 Select a new base element and generate φd1+1 . Several strategies exist 9 Generate A−1 d1+1 and Pd1+1 10 Calculate ˆCd1+1 11 Until ˆyT Pd1 φd1+1 2 λd1+1+ φT d1+1Pd1 φd1+1 > 0 12 Until ˆCd1 − ˆCd1+1 2 < 57 / 58
  • 156. We have the following pseudocode Forward-Regularization(D) 1 Select the functions d1 to used as basis based on the data D This can be done randomly or using the clustering method described in Haykin 2 Select an > 0 stopping criteria 3 ˆCd1 = ˆyT Pd1 ˆy 4 Do 5 ˆCd1 = ˆCd1+1 6 d1 = d1 + 1 7 Do 8 Select a new base element and generate φd1+1 . Several strategies exist 9 Generate A−1 d1+1 and Pd1+1 10 Calculate ˆCd1+1 11 Until ˆyT Pd1 φd1+1 2 λd1+1+ φT d1+1Pd1 φd1+1 > 0 12 Until ˆCd1 − ˆCd1+1 2 < 57 / 58
  • 157. We have the following pseudocode Forward-Regularization(D) 1 Select the functions d1 to used as basis based on the data D This can be done randomly or using the clustering method described in Haykin 2 Select an > 0 stopping criteria 3 ˆCd1 = ˆyT Pd1 ˆy 4 Do 5 ˆCd1 = ˆCd1+1 6 d1 = d1 + 1 7 Do 8 Select a new base element and generate φd1+1 . Several strategies exist 9 Generate A−1 d1+1 and Pd1+1 10 Calculate ˆCd1+1 11 Until ˆyT Pd1 φd1+1 2 λd1+1+ φT d1+1Pd1 φd1+1 > 0 12 Until ˆCd1 − ˆCd1+1 2 < 57 / 58
  • 158. We have the following pseudocode Forward-Regularization(D) 1 Select the functions d1 to used as basis based on the data D This can be done randomly or using the clustering method described in Haykin 2 Select an > 0 stopping criteria 3 ˆCd1 = ˆyT Pd1 ˆy 4 Do 5 ˆCd1 = ˆCd1+1 6 d1 = d1 + 1 7 Do 8 Select a new base element and generate φd1+1 . Several strategies exist 9 Generate A−1 d1+1 and Pd1+1 10 Calculate ˆCd1+1 11 Until ˆyT Pd1 φd1+1 2 λd1+1+ φT d1+1Pd1 φd1+1 > 0 12 Until ˆCd1 − ˆCd1+1 2 < 57 / 58
  • 159. We have the following pseudocode Forward-Regularization(D) 1 Select the functions d1 to used as basis based on the data D This can be done randomly or using the clustering method described in Haykin 2 Select an > 0 stopping criteria 3 ˆCd1 = ˆyT Pd1 ˆy 4 Do 5 ˆCd1 = ˆCd1+1 6 d1 = d1 + 1 7 Do 8 Select a new base element and generate φd1+1 . Several strategies exist 9 Generate A−1 d1+1 and Pd1+1 10 Calculate ˆCd1+1 11 Until ˆyT Pd1 φd1+1 2 λd1+1+ φT d1+1Pd1 φd1+1 > 0 12 Until ˆCd1 − ˆCd1+1 2 < 57 / 58
  • 160. We have the following pseudocode Forward-Regularization(D) 1 Select the functions d1 to used as basis based on the data D This can be done randomly or using the clustering method described in Haykin 2 Select an > 0 stopping criteria 3 ˆCd1 = ˆyT Pd1 ˆy 4 Do 5 ˆCd1 = ˆCd1+1 6 d1 = d1 + 1 7 Do 8 Select a new base element and generate φd1+1 . Several strategies exist 9 Generate A−1 d1+1 and Pd1+1 10 Calculate ˆCd1+1 11 Until ˆyT Pd1 φd1+1 2 λd1+1+ φT d1+1Pd1 φd1+1 > 0 12 Until ˆCd1 − ˆCd1+1 2 < 57 / 58
  • 161. We have the following pseudocode Forward-Regularization(D) 1 Select the functions d1 to used as basis based on the data D This can be done randomly or using the clustering method described in Haykin 2 Select an > 0 stopping criteria 3 ˆCd1 = ˆyT Pd1 ˆy 4 Do 5 ˆCd1 = ˆCd1+1 6 d1 = d1 + 1 7 Do 8 Select a new base element and generate φd1+1 . Several strategies exist 9 Generate A−1 d1+1 and Pd1+1 10 Calculate ˆCd1+1 11 Until ˆyT Pd1 φd1+1 2 λd1+1+ φT d1+1Pd1 φd1+1 > 0 12 Until ˆCd1 − ˆCd1+1 2 < 57 / 58
  • 162. We have the following pseudocode Forward-Regularization(D) 1 Select the functions d1 to used as basis based on the data D This can be done randomly or using the clustering method described in Haykin 2 Select an > 0 stopping criteria 3 ˆCd1 = ˆyT Pd1 ˆy 4 Do 5 ˆCd1 = ˆCd1+1 6 d1 = d1 + 1 7 Do 8 Select a new base element and generate φd1+1 . Several strategies exist 9 Generate A−1 d1+1 and Pd1+1 10 Calculate ˆCd1+1 11 Until ˆyT Pd1 φd1+1 2 λd1+1+ φT d1+1Pd1 φd1+1 > 0 12 Until ˆCd1 − ˆCd1+1 2 < 57 / 58
  • 163. We have the following pseudocode Forward-Regularization(D) 1 Select the functions d1 to used as basis based on the data D This can be done randomly or using the clustering method described in Haykin 2 Select an > 0 stopping criteria 3 ˆCd1 = ˆyT Pd1 ˆy 4 Do 5 ˆCd1 = ˆCd1+1 6 d1 = d1 + 1 7 Do 8 Select a new base element and generate φd1+1 . Several strategies exist 9 Generate A−1 d1+1 and Pd1+1 10 Calculate ˆCd1+1 11 Until ˆyT Pd1 φd1+1 2 λd1+1+ φT d1+1Pd1 φd1+1 > 0 12 Until ˆCd1 − ˆCd1+1 2 < 57 / 58
  • 164. We have the following pseudocode Forward-Regularization(D) 1 Select the functions d1 to used as basis based on the data D This can be done randomly or using the clustering method described in Haykin 2 Select an > 0 stopping criteria 3 ˆCd1 = ˆyT Pd1 ˆy 4 Do 5 ˆCd1 = ˆCd1+1 6 d1 = d1 + 1 7 Do 8 Select a new base element and generate φd1+1 . Several strategies exist 9 Generate A−1 d1+1 and Pd1+1 10 Calculate ˆCd1+1 11 Until ˆyT Pd1 φd1+1 2 λd1+1+ φT d1+1Pd1 φd1+1 > 0 12 Until ˆCd1 − ˆCd1+1 2 < 57 / 58
  • 165. We have the following pseudocode Forward-Regularization(D) 1 Select the functions d1 to used as basis based on the data D This can be done randomly or using the clustering method described in Haykin 2 Select an > 0 stopping criteria 3 ˆCd1 = ˆyT Pd1 ˆy 4 Do 5 ˆCd1 = ˆCd1+1 6 d1 = d1 + 1 7 Do 8 Select a new base element and generate φd1+1 . Several strategies exist 9 Generate A−1 d1+1 and Pd1+1 10 Calculate ˆCd1+1 11 Until ˆyT Pd1 φd1+1 2 λd1+1+ φT d1+1Pd1 φd1+1 > 0 12 Until ˆCd1 − ˆCd1+1 2 < 57 / 58
  • 166. For more on this Please Read the following Introduction to Radial Basis Function Networks by Mark J. L. Orr And there is much more Look At the book Bootstrap Methods and their Application by A. C. Davison and D. V. Hinkley 58 / 58
  • 167. For more on this Please Read the following Introduction to Radial Basis Function Networks by Mark J. L. Orr And there is much more Look At the book Bootstrap Methods and their Application by A. C. Davison and D. V. Hinkley 58 / 58