SlideShare una empresa de Scribd logo
1 de 11
Descargar para leer sin conexión
Learning & Adaptive Systems                                                   1


7     The Multilayer Perceptron

7.1   The multilayer perceptron – general

The “multilayer perceptron” (MLP) is a design that overcomes the short-
comings of the simple perceptron. The MLP can solve general nonlinear
classification problems. A MLP is a hierarchical structure of several “sim-
ple” perceptrons (with smooth transfer functions). For instance, a “one
hidden layer” MLPwith a logistic output unit looks like this, see figures in
e.g. (Bishop 1995) or (Haykin 1999),
                                        1
                      y (x) =
                      ˆ                                                   (7.1)
                                 1 + exp[−a(x)]
                                  M
                      a(x) =           vj hj (x) = vT h(x)                (7.2)
                                 j=0
                                  D
                                                       T
                     hj (x) =          φ(wjk xk ) = φ(wj x)               (7.3)
                                 k=0

where the transfer function, or activation function, φ(z) typically is a sigmoid
of the form

                              φ(z) = tanh(z),                             (7.4)
                                        1
                              φ(z) =         .                            (7.5)
                                     1 + e−z
The former type, the hyperbolic tangent, is the more common one and it
makes the training a little easier than if you use a logistic function.

 The logistic output unit (7.1) is the correct one to use for a classification
problem.

 If the idea is to model a function (i.e. nonlinear regression) then it is
common to use a linear output unit

                                 y (x) = a(x).
                                 ˆ                                        (7.6)


7.2   Training an MLP – Backpropagation

The perhaps most straightforward way to design a training algorithm for
the MLP is to use the gradient descent algorithm. What we need is for
the model output y to be differentiable with respect to all the parameters
                 ˆ
2                                      Thorsteinn S. R¨gnvaldsson, February 18, 2000
                                                      o


wjk and vj . We have a training data set X = {x(n), y(n)}n=1,...,N with N
observations, and we denote all the weights in the network by W = {wj , v}.
The batch form of gradient descent then goes as follows:


    1. Initialize W with e.g. small random values.

    2. Repeat until convergence (either when the error E is below some preset
       value or until the gradient W E is smaller than a preset value), t is
       the iteration number

       2.1 Compute the update
           ∆W(t) = −η W E(t) = η N e(n, t)
                                         n=1                   W y (n, t)
                                                                 ˆ
           where e(n, t) = (y(n) − y (n, t))
                                   ˆ
       2.2 Update the weights W(t + 1) = W(t) + ∆W(t)
       2.3 Compute the error E(t + 1)


 As an example, we compute the weight updates for the special case of
a multilayer perceptron with one hidden layer, using the transfer function
φ(z) (e.g. tanh(z)), and one output unit with the transfer function θ(z) (e.g.
logistic or linear). We use half the mean square error
                           N                                   N
                     1                          1 2
                 E=           [y(n) − y (n)] =
                                      ˆ                            e2 (n),    (7.7)
                    2N    n=1
                                               2N            n=1

and the following notation

                         y (x) = θ [a(x)] ,
                         ˆ                                                    (7.8)
                                                 M
                         a(x) = v0 +                  vj hj (x),              (7.9)
                                                j=1
                        hj (x) = φ [bj (x)] ,                                (7.10)
                                                   D
                         bj (x) = wj0 +                 wjk xk .             (7.11)
                                                  k=1

Here, vj are the weights between the hidden layer and the output layer, and
wjk are the weights between the input and the hidden layer.

 For weight vi we get
                                   N
                  ∂E        1                   ∂ y (n)
                                                  ˆ
                        = −              e(n)
                  ∂vi       N      n=1
                                                 ∂vi
                                    N
                               1                           ∂a(n)
                        = −              e(n)θ [a(n)]
                               N   n=1
                                                            ∂vi
Learning & Adaptive Systems                                                               3

                                      N
                                  1
                       = −                  e(n)θ [a(n)] hi (n)                      (7.12)
                                  N   n=1
                                                      N
                                   ∂E     1
              ⇒ ∆vi = −η               =η                   e(n)θ [a(n)] hi (n)      (7.13)
                                   ∂vi    N           n=1

with the definition h0 (n) ≡ 1. If the output transfer function is linear, i.e.
θ(z) = z, then θ (z) = 1. If the output function is logistic, i.e. θ(z) =
[1 + exp(−z)]−1 , then θ (z) = θ(z)[1 − θ(z)].

 For weight wil we get
                              N
           ∂E            1                ∂ y (n)
                                            ˆ
                  = −              e(n)
           ∂wil          N   n=1
                                           ∂wil
                              N
                         1                             ∂a(n)
                  = −              e(n)θ [a(n)]
                         N   n=1
                                                        ∂wil
                              N
                         1                                  ∂hi (n)
                  = −              e(n)θ [a(n)] vi
                         N   n=1
                                                             ∂wil
                              N
                         1                                             ∂bi (n)
                  = −              e(n)θ [a(n)] vi φ [bi (n)]
                         N   n=1
                                                                        ∂wil
                              N
                         1
                  = −              e(n)θ [a(n)] vi φ [bi (n)] xl                     (7.14)
                         N   n=1
                                              N
                         ∂E      1
       ⇒ ∆wil = −η            =η                    e(n)θ [a(n)] vi φ [bi (n)] xl (n) (7.15)
                         ∂wil    N           n=1

with the definition x0 (n) ≡ 1. If the hidden unit transfer function is the
hyperbolic tangent function, i.e. φ(z) = tanh(z), then φ (z) = 1 − φ2 (z).

 This gradient descent method for updating the weights has become known
as the “backpropagation” training algorithm. The motivation for the name
becomes clear if we introduce the notation

                             δ(n) = e(n)θ [a(n)] ,                                   (7.16)
                         δi (n) = δ(n)vi φ [bi (n)] ,                                (7.17)

which enables us to write
                                                  N
                                       1
                         ∆vi       = η                 δ(n)hi (n),                   (7.18)
                                       N        n=1
                                                 N
                                            1
                         ∆wil = η                      δi (n)xl (n),                 (7.19)
                                            N   n=1
4                                  Thorsteinn S. R¨gnvaldsson, February 18, 2000
                                                  o


which is very similar to the good old LMS algorithm. Expression (7.17)
corresponds to a propagation of δ(n) backwards through the network.

 The gradient descent learning algorithm corresponds to backprop in its
batch form, where the update is computed using all the available training
data. There is also an “on-line” version where the updates are done after
each pattern x(n) without averaging over all patterns.

 Bishop (1995) and Haykin (1999) discuss variants of backprop, e.g. using a
“momentum” term and adaptive learning rate η. Backprop with “momen-
tum” amounts to using an update equal to

                   ∆W(t) = −η      W E(t)   + α∆W(t − 1),                (7.20)

where 0 < α < 1. This enables the learning to gain momentum (hence the
name). If there are several updates in the same direction, then the effective
learning rate is increased. If there are several updates in opposite directions
then the effective learning rate is decreased.

 Backpropagation is, in general, a very slow learning algorithm – even with
momentum – and there are many better algorithms which we discuss be-
low. However, backpropagation was very important in the beginning of the
1980:ies because is was used to demonstrate that multilayer perceptrons can
learn things.



7.3   RPROP

A very useful gradient based learning algorithm, that is not discussed in
Bishop (Bishop 1995), is the “resilient backropagation” (RPROP) algorithm
(Riedmiller & Braun 1993). It uses individual adaptive learning rates com-
bined with the so-called “Manhattan” update step.

 The standard backpropagation updates the weights according to

                                            ∂E
                              ∆wil = −η          .                       (7.21)
                                            ∂wil

The “Manhattan” update step, on the other hand, uses only the sign of the
derivative (the reason for the name should be obvious to anyone who has
seen a map of Manhattan), i.e.

                                              ∂E
                           ∆wil = −ηsign           .                     (7.22)
                                              ∂wil
Learning & Adaptive Systems                                                       5


 The RPROP algorithm combines this Manhattan step with individual
learning rates for each weight, and the algorithm goes as follows

                                                   ∂E
                        ∆wil (t) = −ηil (t)sign         ,                     (7.23)
                                                   ∂wil

where wil denotes any weight in the network (e.g. also hidden to output
weights).

 The learning rate ηil (t) is adjusted according to

                        γ + ηil (t − 1) if ∂il E(t) · ∂il E(t − 1) > 0
            ηil (t) =                                                         (7.24)
                        γ − ηil (t − 1) if ∂il E(t) · ∂il E(t − 1) < 0

where γ + and γ − are different growth/shrinking factors (0 < γ − < 1 < γ + ).
Values that have worked well for me are γ − = 0.5 and γ + = 1.2, with limits
such that 10−6 ≤ ηij (t) ≤ 50. I have used the short notation ∂il E(t) ≡ ∂E(t) .
                                                                         ∂wil
The RPROP algorithm is a batch algorithm, since the learning rate update
becomes noisy and uncertain if the error E is evaluated over only a single
pattern.

 The RPROP algorithm is implemented in the MATLAB neural network
toolbox.



7.4   Second order learning algorithms

Backpropagation, i.e. gradient descent, is a first order learning algorithm.
This means that it only uses information about the first order derivative
when it minimizes the error. The idea behind a first order algorithm can be
illustrated by expanding the error E in a Taylor series around the current
weight position W

                                          T                           2
      E(W + ∆W) = E(W) +            W E(W) ∆W        +O         ∆W        .   (7.25)

The vector W contains all the weights wjk and vj (and others if we are
considering other network architectures) and we require that ∆W is small.
The notation O ∆W 2 denotes all the terms that contains the small
weight step ∆W multiplied by itself at least once, and by “small” we mean
that ∆W is so small that the gradient term W E(W)∆W is larger than
the sum of the higher order terms. In that case we can ignore the higher
order terms and write
                                                            T
                 E(W + ∆W) ≈ E(W) +               W E(W)        ∆W.           (7.26)
6                                 Thorsteinn S. R¨gnvaldsson, February 18, 2000
                                                 o


Now, we want to change the weights so that the new error E(W + ∆W) is
smaller than the current error E(W). One way to guarantee this is to set
the weight update ∆W proportional to the negative gradient, i.e. ∆W =
−η W E(W), in which case we have
                                                     2
            E(W + ∆W) ≈ E(W) − η           W E(W)        ≤ E(W).        (7.27)

However, this of course requires that ∆W is so small that we can motivate
(7.26).

 We can extend this and also consider the second order term in the Taylor
expansion. That is

                                T   1  T                                    3
E(W+∆W) = E(W)+           W E(W) ∆W+ ∆W H(W)∆W+O                      ∆W          ,
                                           2
                                                                        (7.28)
where
                                          T
                         H(W) =      W    W E(W)                        (7.29)
                                                    ∂ 2 E(W)
is the Hessian matrix with elements Hij (W) = ∂wi ∂wj . The Hessian is
symmetric (all eigenvalues are consequently real and we can diagonalize H
with an orthogonal transformation).

 If we can ignore the higher order terms in (7.28) then we have

                                    T           1
    E(W + ∆W) ≈ E(W) +        W E(W) ∆W        + ∆WT H(W)∆W.            (7.30)
                                                2
We want to change the weights so that the new error E(W+∆W) is smaller
than the current error E(W). Furthermore, we want it to be as small as
possible. That is, we want to minimize E(W + ∆W) by choosing ∆W
appropriately. The requirement that we end up at an extremum point is

                               W E(W     + ∆W) = 0
                    ⇒    W E(W)   + H(W)∆W = 0,                         (7.31)

which yields the optimum weight update as

                        ∆W = H −1 (W)      W E(W).                      (7.32)

To guarantee that this is a mininum point we must also require that the
Hessian matrix is positive definite. This means that all the eigenvalues of
the Hessian matrix must pe positive. If any of the eigenvalues are zero then
we have a saddle point and H(W) is not invertible. If any of the eigenvalues
of H(W) are negative then we have a maximum point for at least one of the
weights wj and (7.32) will actually move away from the minimum!
Learning & Adaptive Systems                                                             7


 The update step (7.32) is usually referred to as a “Newton-step”, and the
minimization method that uses this update step is the Newton algorithm.

 Some problems with “vanilla” Newton learning (7.32) are:


   • The Hessian matrix may not be invertible, i.e. some of the eigenvalues
     are zero.

   • The Hessian matrix may have negative eigenvalues.

   • The Hessian matrix is expensive to compute and also expensive to
     invert. The learning may therefore be slower than a first order method.


The first two problems are handled by regularizing the Hessian, i.e. by re-
placing H(W) by H(W)+λI. This effectively filters out all eigenvalues that
are smaller than λ. The third problem is handled by “Quasi-Newton” meth-
ods that iteratively try to estimate the inverse Hessian using expressions of
the form H −1 (W + ∆W) ≈ H −1 (W) + correction.



7.4.1   The Levenberg-Marquardt algorithm


The Levenberg-Marquardt, see e.g. (Bishop 1995), is a very efficient second
order learning algorithm that builds on the assumption that the error E is
a quadratic error (which it usually is), like half the mean square error. In
this case we have
                              N                          N
            1 ∂ 2 MSE   1         ∂ y (n) ∂ y (n)
                                    ˆ       ˆ       1                 ∂ 2 y (n)
                                                                          ˆ
    Hij =             =                           +            e(n)             .   (7.33)
            2 ∂wi ∂wj   N     n=1
                                   ∂wi ∂wj          N   n=1
                                                                      ∂wi ∂wj

If the residual e(n) is symmetrically distributed around zero and small then
we can assume that the second term in (7.33) is very small compared to the
first term. If so, then we can approximate
                                            N
                                      1        ∂ y (n) ∂ y (n)
                                                 ˆ       ˆ
                          Hij     ≈
                                      N    n=1
                                                ∂wi ∂wj
                                            N
                                      1
                      H(W) ≈                     J(n)JT (n),                        (7.34)
                                      N    n=1

where we have used the notation

                                  J(n) =     W y (n)
                                               ˆ                                    (7.35)
8                                      Thorsteinn S. R¨gnvaldsson, February 18, 2000
                                                      o


and we will refer to J as the “Jacobian”. This approximation is not as
costly to compute as the exact Hessian, since no second order derivatives
are needed.

 The fact that the Hessian is approximated by a sum of outer products JJT
means that the rank of H is at most N . That is, there must be at least
as many observations as there are weights in the network (which intuitively
makes sense).

 This approximation of the Hessian is used in a Newton step together with
a regularization term, so that the Levenberg-Marquardt update is

                               N                     −1
                   1                     T
              ∆W =                  J(n)J (n) + λI        W E(W).            (7.36)
                   N          n=1

The Levenberg-Marquardt update is a very useful learning algorithm, and
it represents a combination of gradient descent and a Newton step search.
We have that
             
              1       E(W)                           when λ → ∞
                 λ
      ∆W →         1
                                         −1                           ,      (7.37)
             
                   N    n   J(n)JT (n)        E(W) when λ → 0

which corresponds to gradient descent, with η = 1/λ, when λ is large, and
to Newton learning when λ is small.


7.5   Rules of thumb when choosing a learning algorithm

These rules of thumb are based on my own personal experience.

 If the problem is small, with few observations and with few weights, then
the Levenberg-Marquardt learning is the preferred learning algorithm. This
is because it is very fast if the Hessian is small. Furthermore, the MATLAB
implementation of Levenberg-Marquardt stores a matrix which is (number
of weights) × (number of observations). Thus, if either of these numbers is
big then the machine you are working on may run out of memory and start
swapping – and your learning will take “forever”.

 If the problem is large, i.e. many observations and few/many weights, then
RPROP is a fast and reliable algorithm.

 If you have a real-time problem, where you need to adapt the network
on-line, then you should use on-line backpropagation.
Learning & Adaptive Systems                                                   9


7.6   How to estimate the generalization error

The real goal when modeling is to generalize to new data, not just perform
well on the training data set that is presented during training.

 In general, the error on the training data will be a biased estimate of
the generalization error. To be specific, it will tend to be smaller than the
generalization error if we select our model such that it minimizes the training
error. We can therefore not use the training error as our selection criteria.

 One way to estimate the generalization error is to do cross-validation. This
means using a test data set, which is a subset of the available data (typically
25-35%) that is removed before any training is done, and which is not used
again until all training is done. The performance on this test data will be
an unbiased estimate of the generalization error, provided that the data has
not been used in any way during the modeling process. If it has been used,
e.g. for model validation when selecting hyperparameter values, then it will
be a biased estimate.

 If there is lots of data available then it may be sufficient to use one test
set for estimating the generalization error. However, if data is scarce then
it is necessary to use more data-efficient methods. One such method is the
K-fold cross-validation method.

 The central idea in K-fold cross-validation is to repeat the cross-validation
test K times. That is, divide the available data into K subsets, here denoted
by Dk , where each subset contains a sample of the data that reflects the data
distribution (i.e. you must make sure that one subset does not contain e.g.
only one category in a classification task). The procedure then goes like
this:


  1. Repeat K times, i.e. until all data subsets have been used for testing
     once.

       1.1 Set aside one of the subsets, Dk , for testing, and use the remaining
           data subsets Di=k for training.
       1.2 Train your model using the training data.
       1.3 Test your model on the data subset Dk . This gives you a test
           data error Etest,k .

  2. The estimate of the generalization error is the mean of the K individual
                          1
     test errors: Egen. = K k Etest,k .
10                                   Thorsteinn S. R¨gnvaldsson, February 18, 2000
                                                    o


One benefit with K-fold cross-validation is that you can estimate an error
bar for the generalization error by computing the standard deviation of the
Etest,k values.

 Note: The errors Etest,k are often approximately log-normally distributed.
At least, log Etest,k tends to be more normally distributed than Etest,k . It
is therefore more appropriate to use the mean of the logs as an estimate for
the log generalization error. That is
                              K
                          1
           log Egen. =              log Etest,k                             (7.38)
                          K   k=1
                                           K
                                      1
         ∆ log Egen. = 1.96                   [log Etest,k − log Egen. ]2   (7.39)
                                    K − 1 k=1

where the lower row is a 95% confidence band for the log generalization
error.

 An error bar from cross-validation includes the model variation due to both
different training sets and different initial conditions.
Learning & Adaptive Systems                                             11


References

Bishop, C. M. (1995), Neural Networks for Pattern Recognition, Oxford
    University Press, Oxford.

Haykin, S. (1999), Neural Networks – A Comprehensive Foundation, second
    edn, Prentice Hall Inc., Upper Saddle River, New Jersey.

Riedmiller, M. & Braun, H. (1993), A direct adaptive method for faster
    backpropagation learning: The RPROP algorithm, in H. Ruspini, ed.,
    ‘Proc. of the IEEE Intl. Conference on Neural Networks’, San Fransisco,
    California, pp. 586–591.

Más contenido relacionado

La actualidad más candente

05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer vision
zukun
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
l1-Embeddings and Algorithmic Applications
l1-Embeddings and Algorithmic Applicationsl1-Embeddings and Algorithmic Applications
l1-Embeddings and Algorithmic Applications
Grigory Yaroslavtsev
 

La actualidad más candente (20)

05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer vision
 
Particle filter
Particle filterParticle filter
Particle filter
 
Monte-Carlo method for Two-Stage SLP
Monte-Carlo method for Two-Stage SLPMonte-Carlo method for Two-Stage SLP
Monte-Carlo method for Two-Stage SLP
 
Lesson 27: Integration by Substitution (slides)
Lesson 27: Integration by Substitution (slides)Lesson 27: Integration by Substitution (slides)
Lesson 27: Integration by Substitution (slides)
 
Md2521102111
Md2521102111Md2521102111
Md2521102111
 
Neural network Algos formulas
Neural network Algos formulasNeural network Algos formulas
Neural network Algos formulas
 
Digital fiiter
Digital fiiterDigital fiiter
Digital fiiter
 
Rousseau
RousseauRousseau
Rousseau
 
Neural Processes Family
Neural Processes FamilyNeural Processes Family
Neural Processes Family
 
SSA slides
SSA slidesSSA slides
SSA slides
 
A Novel Methodology for Designing Linear Phase IIR Filters
A Novel Methodology for Designing Linear Phase IIR FiltersA Novel Methodology for Designing Linear Phase IIR Filters
A Novel Methodology for Designing Linear Phase IIR Filters
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Holographic Cotton Tensor
Holographic Cotton TensorHolographic Cotton Tensor
Holographic Cotton Tensor
 
Lesson 26: The Fundamental Theorem of Calculus (slides)
Lesson 26: The Fundamental Theorem of Calculus (slides)Lesson 26: The Fundamental Theorem of Calculus (slides)
Lesson 26: The Fundamental Theorem of Calculus (slides)
 
On estimating the integrated co volatility using
On estimating the integrated co volatility usingOn estimating the integrated co volatility using
On estimating the integrated co volatility using
 
Signal Processing Course : Orthogonal Bases
Signal Processing Course : Orthogonal BasesSignal Processing Course : Orthogonal Bases
Signal Processing Course : Orthogonal Bases
 
l1-Embeddings and Algorithmic Applications
l1-Embeddings and Algorithmic Applicationsl1-Embeddings and Algorithmic Applications
l1-Embeddings and Algorithmic Applications
 
Coincidence points for mappings under generalized contraction
Coincidence points for mappings under generalized contractionCoincidence points for mappings under generalized contraction
Coincidence points for mappings under generalized contraction
 
H function and a problem related to a string
H function and a problem related to a stringH function and a problem related to a string
H function and a problem related to a string
 
Lecture 1: linear SVM in the primal
Lecture 1: linear SVM in the primalLecture 1: linear SVM in the primal
Lecture 1: linear SVM in the primal
 

Destacado

Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
ESCOM
 
Perceptron Slides
Perceptron SlidesPerceptron Slides
Perceptron Slides
ESCOM
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
DEEPASHRI HK
 
Artificial Intelligence Presentation
Artificial Intelligence PresentationArtificial Intelligence Presentation
Artificial Intelligence Presentation
lpaviglianiti
 

Destacado (20)

Back propagation
Back propagationBack propagation
Back propagation
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
 
Multi Layer Perceptron & Back Propagation
Multi Layer Perceptron & Back PropagationMulti Layer Perceptron & Back Propagation
Multi Layer Perceptron & Back Propagation
 
Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer Perceptron
 
Perceptron
PerceptronPerceptron
Perceptron
 
03 neural network
03 neural network03 neural network
03 neural network
 
Neural network
Neural networkNeural network
Neural network
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptron
 
15 Machine Learning Multilayer Perceptron
15 Machine Learning Multilayer Perceptron15 Machine Learning Multilayer Perceptron
15 Machine Learning Multilayer Perceptron
 
Perceptron Slides
Perceptron SlidesPerceptron Slides
Perceptron Slides
 
Lecture 9 Perceptron
Lecture 9 PerceptronLecture 9 Perceptron
Lecture 9 Perceptron
 
Perceptron (neural network)
Perceptron (neural network)Perceptron (neural network)
Perceptron (neural network)
 
Adaline madaline
Adaline madalineAdaline madaline
Adaline madaline
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagation
 
Artificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKSArtificial intelligence NEURAL NETWORKS
Artificial intelligence NEURAL NETWORKS
 
neural network
neural networkneural network
neural network
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
Artificial Intelligence Presentation
Artificial Intelligence PresentationArtificial Intelligence Presentation
Artificial Intelligence Presentation
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 

Similar a The multilayer perceptron

GradStudentSeminarSept30
GradStudentSeminarSept30GradStudentSeminarSept30
GradStudentSeminarSept30
Ryan White
 
Varian, microeconomic analysis, solution book
Varian, microeconomic analysis, solution bookVarian, microeconomic analysis, solution book
Varian, microeconomic analysis, solution book
José Antonio PAYANO YALE
 
Engr 371 midterm february 1997
Engr 371 midterm february 1997Engr 371 midterm february 1997
Engr 371 midterm february 1997
amnesiann
 
Backpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkBackpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural Network
Hiroshi Kuwajima
 

Similar a The multilayer perceptron (20)

Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...
Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...
Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...
 
GradStudentSeminarSept30
GradStudentSeminarSept30GradStudentSeminarSept30
GradStudentSeminarSept30
 
Matrix calculus
Matrix calculusMatrix calculus
Matrix calculus
 
Fourier series
Fourier seriesFourier series
Fourier series
 
Fourier series
Fourier seriesFourier series
Fourier series
 
Optimal Finite Difference Grids
Optimal Finite Difference GridsOptimal Finite Difference Grids
Optimal Finite Difference Grids
 
Neural network and mlp
Neural network and mlpNeural network and mlp
Neural network and mlp
 
Disjoint sets
Disjoint setsDisjoint sets
Disjoint sets
 
21 4 ztransform
21 4 ztransform21 4 ztransform
21 4 ztransform
 
Varian, microeconomic analysis, solution book
Varian, microeconomic analysis, solution bookVarian, microeconomic analysis, solution book
Varian, microeconomic analysis, solution book
 
1 - Linear Regression
1 - Linear Regression1 - Linear Regression
1 - Linear Regression
 
Partitions
PartitionsPartitions
Partitions
 
1
11
1
 
Z transform ROC eng.Math
Z transform ROC eng.MathZ transform ROC eng.Math
Z transform ROC eng.Math
 
Additional notes EC220
Additional notes EC220Additional notes EC220
Additional notes EC220
 
Engr 371 midterm february 1997
Engr 371 midterm february 1997Engr 371 midterm february 1997
Engr 371 midterm february 1997
 
Basics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programmingBasics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programming
 
Backpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural NetworkBackpropagation in Convolutional Neural Network
Backpropagation in Convolutional Neural Network
 
Mesh Processing Course : Active Contours
Mesh Processing Course : Active ContoursMesh Processing Course : Active Contours
Mesh Processing Course : Active Contours
 
Imc2017 day2-solutions
Imc2017 day2-solutionsImc2017 day2-solutions
Imc2017 day2-solutions
 

Más de ESCOM

redes neuronales tipo Som
redes neuronales tipo Somredes neuronales tipo Som
redes neuronales tipo Som
ESCOM
 
redes neuronales Som
redes neuronales Somredes neuronales Som
redes neuronales Som
ESCOM
 
redes neuronales Som Slides
redes neuronales Som Slidesredes neuronales Som Slides
redes neuronales Som Slides
ESCOM
 
red neuronal Som Net
red neuronal Som Netred neuronal Som Net
red neuronal Som Net
ESCOM
 
Self Organinising neural networks
Self Organinising  neural networksSelf Organinising  neural networks
Self Organinising neural networks
ESCOM
 
redes neuronales Kohonen
redes neuronales Kohonenredes neuronales Kohonen
redes neuronales Kohonen
ESCOM
 
Teoria Resonancia Adaptativa
Teoria Resonancia AdaptativaTeoria Resonancia Adaptativa
Teoria Resonancia Adaptativa
ESCOM
 
ejemplo red neuronal Art1
ejemplo red neuronal Art1ejemplo red neuronal Art1
ejemplo red neuronal Art1
ESCOM
 
redes neuronales tipo Art3
redes neuronales tipo Art3redes neuronales tipo Art3
redes neuronales tipo Art3
ESCOM
 
Art2
Art2Art2
Art2
ESCOM
 
Redes neuronales tipo Art
Redes neuronales tipo ArtRedes neuronales tipo Art
Redes neuronales tipo Art
ESCOM
 
Neocognitron
NeocognitronNeocognitron
Neocognitron
ESCOM
 
Neocognitron
NeocognitronNeocognitron
Neocognitron
ESCOM
 
Neocognitron
NeocognitronNeocognitron
Neocognitron
ESCOM
 
Fukushima Cognitron
Fukushima CognitronFukushima Cognitron
Fukushima Cognitron
ESCOM
 
Counterpropagation NETWORK
Counterpropagation NETWORKCounterpropagation NETWORK
Counterpropagation NETWORK
ESCOM
 
Counterpropagation NETWORK
Counterpropagation NETWORKCounterpropagation NETWORK
Counterpropagation NETWORK
ESCOM
 
Counterpropagation
CounterpropagationCounterpropagation
Counterpropagation
ESCOM
 
Teoría de Resonancia Adaptativa Art2 ARTMAP
Teoría de Resonancia Adaptativa Art2 ARTMAPTeoría de Resonancia Adaptativa Art2 ARTMAP
Teoría de Resonancia Adaptativa Art2 ARTMAP
ESCOM
 
Teoría de Resonancia Adaptativa ART1
Teoría de Resonancia Adaptativa ART1Teoría de Resonancia Adaptativa ART1
Teoría de Resonancia Adaptativa ART1
ESCOM
 

Más de ESCOM (20)

redes neuronales tipo Som
redes neuronales tipo Somredes neuronales tipo Som
redes neuronales tipo Som
 
redes neuronales Som
redes neuronales Somredes neuronales Som
redes neuronales Som
 
redes neuronales Som Slides
redes neuronales Som Slidesredes neuronales Som Slides
redes neuronales Som Slides
 
red neuronal Som Net
red neuronal Som Netred neuronal Som Net
red neuronal Som Net
 
Self Organinising neural networks
Self Organinising  neural networksSelf Organinising  neural networks
Self Organinising neural networks
 
redes neuronales Kohonen
redes neuronales Kohonenredes neuronales Kohonen
redes neuronales Kohonen
 
Teoria Resonancia Adaptativa
Teoria Resonancia AdaptativaTeoria Resonancia Adaptativa
Teoria Resonancia Adaptativa
 
ejemplo red neuronal Art1
ejemplo red neuronal Art1ejemplo red neuronal Art1
ejemplo red neuronal Art1
 
redes neuronales tipo Art3
redes neuronales tipo Art3redes neuronales tipo Art3
redes neuronales tipo Art3
 
Art2
Art2Art2
Art2
 
Redes neuronales tipo Art
Redes neuronales tipo ArtRedes neuronales tipo Art
Redes neuronales tipo Art
 
Neocognitron
NeocognitronNeocognitron
Neocognitron
 
Neocognitron
NeocognitronNeocognitron
Neocognitron
 
Neocognitron
NeocognitronNeocognitron
Neocognitron
 
Fukushima Cognitron
Fukushima CognitronFukushima Cognitron
Fukushima Cognitron
 
Counterpropagation NETWORK
Counterpropagation NETWORKCounterpropagation NETWORK
Counterpropagation NETWORK
 
Counterpropagation NETWORK
Counterpropagation NETWORKCounterpropagation NETWORK
Counterpropagation NETWORK
 
Counterpropagation
CounterpropagationCounterpropagation
Counterpropagation
 
Teoría de Resonancia Adaptativa Art2 ARTMAP
Teoría de Resonancia Adaptativa Art2 ARTMAPTeoría de Resonancia Adaptativa Art2 ARTMAP
Teoría de Resonancia Adaptativa Art2 ARTMAP
 
Teoría de Resonancia Adaptativa ART1
Teoría de Resonancia Adaptativa ART1Teoría de Resonancia Adaptativa ART1
Teoría de Resonancia Adaptativa ART1
 

Último

Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
MateoGardella
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 

Último (20)

PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 

The multilayer perceptron

  • 1. Learning & Adaptive Systems 1 7 The Multilayer Perceptron 7.1 The multilayer perceptron – general The “multilayer perceptron” (MLP) is a design that overcomes the short- comings of the simple perceptron. The MLP can solve general nonlinear classification problems. A MLP is a hierarchical structure of several “sim- ple” perceptrons (with smooth transfer functions). For instance, a “one hidden layer” MLPwith a logistic output unit looks like this, see figures in e.g. (Bishop 1995) or (Haykin 1999), 1 y (x) = ˆ (7.1) 1 + exp[−a(x)] M a(x) = vj hj (x) = vT h(x) (7.2) j=0 D T hj (x) = φ(wjk xk ) = φ(wj x) (7.3) k=0 where the transfer function, or activation function, φ(z) typically is a sigmoid of the form φ(z) = tanh(z), (7.4) 1 φ(z) = . (7.5) 1 + e−z The former type, the hyperbolic tangent, is the more common one and it makes the training a little easier than if you use a logistic function. The logistic output unit (7.1) is the correct one to use for a classification problem. If the idea is to model a function (i.e. nonlinear regression) then it is common to use a linear output unit y (x) = a(x). ˆ (7.6) 7.2 Training an MLP – Backpropagation The perhaps most straightforward way to design a training algorithm for the MLP is to use the gradient descent algorithm. What we need is for the model output y to be differentiable with respect to all the parameters ˆ
  • 2. 2 Thorsteinn S. R¨gnvaldsson, February 18, 2000 o wjk and vj . We have a training data set X = {x(n), y(n)}n=1,...,N with N observations, and we denote all the weights in the network by W = {wj , v}. The batch form of gradient descent then goes as follows: 1. Initialize W with e.g. small random values. 2. Repeat until convergence (either when the error E is below some preset value or until the gradient W E is smaller than a preset value), t is the iteration number 2.1 Compute the update ∆W(t) = −η W E(t) = η N e(n, t) n=1 W y (n, t) ˆ where e(n, t) = (y(n) − y (n, t)) ˆ 2.2 Update the weights W(t + 1) = W(t) + ∆W(t) 2.3 Compute the error E(t + 1) As an example, we compute the weight updates for the special case of a multilayer perceptron with one hidden layer, using the transfer function φ(z) (e.g. tanh(z)), and one output unit with the transfer function θ(z) (e.g. logistic or linear). We use half the mean square error N N 1 1 2 E= [y(n) − y (n)] = ˆ e2 (n), (7.7) 2N n=1 2N n=1 and the following notation y (x) = θ [a(x)] , ˆ (7.8) M a(x) = v0 + vj hj (x), (7.9) j=1 hj (x) = φ [bj (x)] , (7.10) D bj (x) = wj0 + wjk xk . (7.11) k=1 Here, vj are the weights between the hidden layer and the output layer, and wjk are the weights between the input and the hidden layer. For weight vi we get N ∂E 1 ∂ y (n) ˆ = − e(n) ∂vi N n=1 ∂vi N 1 ∂a(n) = − e(n)θ [a(n)] N n=1 ∂vi
  • 3. Learning & Adaptive Systems 3 N 1 = − e(n)θ [a(n)] hi (n) (7.12) N n=1 N ∂E 1 ⇒ ∆vi = −η =η e(n)θ [a(n)] hi (n) (7.13) ∂vi N n=1 with the definition h0 (n) ≡ 1. If the output transfer function is linear, i.e. θ(z) = z, then θ (z) = 1. If the output function is logistic, i.e. θ(z) = [1 + exp(−z)]−1 , then θ (z) = θ(z)[1 − θ(z)]. For weight wil we get N ∂E 1 ∂ y (n) ˆ = − e(n) ∂wil N n=1 ∂wil N 1 ∂a(n) = − e(n)θ [a(n)] N n=1 ∂wil N 1 ∂hi (n) = − e(n)θ [a(n)] vi N n=1 ∂wil N 1 ∂bi (n) = − e(n)θ [a(n)] vi φ [bi (n)] N n=1 ∂wil N 1 = − e(n)θ [a(n)] vi φ [bi (n)] xl (7.14) N n=1 N ∂E 1 ⇒ ∆wil = −η =η e(n)θ [a(n)] vi φ [bi (n)] xl (n) (7.15) ∂wil N n=1 with the definition x0 (n) ≡ 1. If the hidden unit transfer function is the hyperbolic tangent function, i.e. φ(z) = tanh(z), then φ (z) = 1 − φ2 (z). This gradient descent method for updating the weights has become known as the “backpropagation” training algorithm. The motivation for the name becomes clear if we introduce the notation δ(n) = e(n)θ [a(n)] , (7.16) δi (n) = δ(n)vi φ [bi (n)] , (7.17) which enables us to write N 1 ∆vi = η δ(n)hi (n), (7.18) N n=1 N 1 ∆wil = η δi (n)xl (n), (7.19) N n=1
  • 4. 4 Thorsteinn S. R¨gnvaldsson, February 18, 2000 o which is very similar to the good old LMS algorithm. Expression (7.17) corresponds to a propagation of δ(n) backwards through the network. The gradient descent learning algorithm corresponds to backprop in its batch form, where the update is computed using all the available training data. There is also an “on-line” version where the updates are done after each pattern x(n) without averaging over all patterns. Bishop (1995) and Haykin (1999) discuss variants of backprop, e.g. using a “momentum” term and adaptive learning rate η. Backprop with “momen- tum” amounts to using an update equal to ∆W(t) = −η W E(t) + α∆W(t − 1), (7.20) where 0 < α < 1. This enables the learning to gain momentum (hence the name). If there are several updates in the same direction, then the effective learning rate is increased. If there are several updates in opposite directions then the effective learning rate is decreased. Backpropagation is, in general, a very slow learning algorithm – even with momentum – and there are many better algorithms which we discuss be- low. However, backpropagation was very important in the beginning of the 1980:ies because is was used to demonstrate that multilayer perceptrons can learn things. 7.3 RPROP A very useful gradient based learning algorithm, that is not discussed in Bishop (Bishop 1995), is the “resilient backropagation” (RPROP) algorithm (Riedmiller & Braun 1993). It uses individual adaptive learning rates com- bined with the so-called “Manhattan” update step. The standard backpropagation updates the weights according to ∂E ∆wil = −η . (7.21) ∂wil The “Manhattan” update step, on the other hand, uses only the sign of the derivative (the reason for the name should be obvious to anyone who has seen a map of Manhattan), i.e. ∂E ∆wil = −ηsign . (7.22) ∂wil
  • 5. Learning & Adaptive Systems 5 The RPROP algorithm combines this Manhattan step with individual learning rates for each weight, and the algorithm goes as follows ∂E ∆wil (t) = −ηil (t)sign , (7.23) ∂wil where wil denotes any weight in the network (e.g. also hidden to output weights). The learning rate ηil (t) is adjusted according to γ + ηil (t − 1) if ∂il E(t) · ∂il E(t − 1) > 0 ηil (t) = (7.24) γ − ηil (t − 1) if ∂il E(t) · ∂il E(t − 1) < 0 where γ + and γ − are different growth/shrinking factors (0 < γ − < 1 < γ + ). Values that have worked well for me are γ − = 0.5 and γ + = 1.2, with limits such that 10−6 ≤ ηij (t) ≤ 50. I have used the short notation ∂il E(t) ≡ ∂E(t) . ∂wil The RPROP algorithm is a batch algorithm, since the learning rate update becomes noisy and uncertain if the error E is evaluated over only a single pattern. The RPROP algorithm is implemented in the MATLAB neural network toolbox. 7.4 Second order learning algorithms Backpropagation, i.e. gradient descent, is a first order learning algorithm. This means that it only uses information about the first order derivative when it minimizes the error. The idea behind a first order algorithm can be illustrated by expanding the error E in a Taylor series around the current weight position W T 2 E(W + ∆W) = E(W) + W E(W) ∆W +O ∆W . (7.25) The vector W contains all the weights wjk and vj (and others if we are considering other network architectures) and we require that ∆W is small. The notation O ∆W 2 denotes all the terms that contains the small weight step ∆W multiplied by itself at least once, and by “small” we mean that ∆W is so small that the gradient term W E(W)∆W is larger than the sum of the higher order terms. In that case we can ignore the higher order terms and write T E(W + ∆W) ≈ E(W) + W E(W) ∆W. (7.26)
  • 6. 6 Thorsteinn S. R¨gnvaldsson, February 18, 2000 o Now, we want to change the weights so that the new error E(W + ∆W) is smaller than the current error E(W). One way to guarantee this is to set the weight update ∆W proportional to the negative gradient, i.e. ∆W = −η W E(W), in which case we have 2 E(W + ∆W) ≈ E(W) − η W E(W) ≤ E(W). (7.27) However, this of course requires that ∆W is so small that we can motivate (7.26). We can extend this and also consider the second order term in the Taylor expansion. That is T 1 T 3 E(W+∆W) = E(W)+ W E(W) ∆W+ ∆W H(W)∆W+O ∆W , 2 (7.28) where T H(W) = W W E(W) (7.29) ∂ 2 E(W) is the Hessian matrix with elements Hij (W) = ∂wi ∂wj . The Hessian is symmetric (all eigenvalues are consequently real and we can diagonalize H with an orthogonal transformation). If we can ignore the higher order terms in (7.28) then we have T 1 E(W + ∆W) ≈ E(W) + W E(W) ∆W + ∆WT H(W)∆W. (7.30) 2 We want to change the weights so that the new error E(W+∆W) is smaller than the current error E(W). Furthermore, we want it to be as small as possible. That is, we want to minimize E(W + ∆W) by choosing ∆W appropriately. The requirement that we end up at an extremum point is W E(W + ∆W) = 0 ⇒ W E(W) + H(W)∆W = 0, (7.31) which yields the optimum weight update as ∆W = H −1 (W) W E(W). (7.32) To guarantee that this is a mininum point we must also require that the Hessian matrix is positive definite. This means that all the eigenvalues of the Hessian matrix must pe positive. If any of the eigenvalues are zero then we have a saddle point and H(W) is not invertible. If any of the eigenvalues of H(W) are negative then we have a maximum point for at least one of the weights wj and (7.32) will actually move away from the minimum!
  • 7. Learning & Adaptive Systems 7 The update step (7.32) is usually referred to as a “Newton-step”, and the minimization method that uses this update step is the Newton algorithm. Some problems with “vanilla” Newton learning (7.32) are: • The Hessian matrix may not be invertible, i.e. some of the eigenvalues are zero. • The Hessian matrix may have negative eigenvalues. • The Hessian matrix is expensive to compute and also expensive to invert. The learning may therefore be slower than a first order method. The first two problems are handled by regularizing the Hessian, i.e. by re- placing H(W) by H(W)+λI. This effectively filters out all eigenvalues that are smaller than λ. The third problem is handled by “Quasi-Newton” meth- ods that iteratively try to estimate the inverse Hessian using expressions of the form H −1 (W + ∆W) ≈ H −1 (W) + correction. 7.4.1 The Levenberg-Marquardt algorithm The Levenberg-Marquardt, see e.g. (Bishop 1995), is a very efficient second order learning algorithm that builds on the assumption that the error E is a quadratic error (which it usually is), like half the mean square error. In this case we have N N 1 ∂ 2 MSE 1 ∂ y (n) ∂ y (n) ˆ ˆ 1 ∂ 2 y (n) ˆ Hij = = + e(n) . (7.33) 2 ∂wi ∂wj N n=1 ∂wi ∂wj N n=1 ∂wi ∂wj If the residual e(n) is symmetrically distributed around zero and small then we can assume that the second term in (7.33) is very small compared to the first term. If so, then we can approximate N 1 ∂ y (n) ∂ y (n) ˆ ˆ Hij ≈ N n=1 ∂wi ∂wj N 1 H(W) ≈ J(n)JT (n), (7.34) N n=1 where we have used the notation J(n) = W y (n) ˆ (7.35)
  • 8. 8 Thorsteinn S. R¨gnvaldsson, February 18, 2000 o and we will refer to J as the “Jacobian”. This approximation is not as costly to compute as the exact Hessian, since no second order derivatives are needed. The fact that the Hessian is approximated by a sum of outer products JJT means that the rank of H is at most N . That is, there must be at least as many observations as there are weights in the network (which intuitively makes sense). This approximation of the Hessian is used in a Newton step together with a regularization term, so that the Levenberg-Marquardt update is N −1 1 T ∆W = J(n)J (n) + λI W E(W). (7.36) N n=1 The Levenberg-Marquardt update is a very useful learning algorithm, and it represents a combination of gradient descent and a Newton step search. We have that   1 E(W) when λ → ∞ λ ∆W → 1 −1 , (7.37)  N n J(n)JT (n) E(W) when λ → 0 which corresponds to gradient descent, with η = 1/λ, when λ is large, and to Newton learning when λ is small. 7.5 Rules of thumb when choosing a learning algorithm These rules of thumb are based on my own personal experience. If the problem is small, with few observations and with few weights, then the Levenberg-Marquardt learning is the preferred learning algorithm. This is because it is very fast if the Hessian is small. Furthermore, the MATLAB implementation of Levenberg-Marquardt stores a matrix which is (number of weights) × (number of observations). Thus, if either of these numbers is big then the machine you are working on may run out of memory and start swapping – and your learning will take “forever”. If the problem is large, i.e. many observations and few/many weights, then RPROP is a fast and reliable algorithm. If you have a real-time problem, where you need to adapt the network on-line, then you should use on-line backpropagation.
  • 9. Learning & Adaptive Systems 9 7.6 How to estimate the generalization error The real goal when modeling is to generalize to new data, not just perform well on the training data set that is presented during training. In general, the error on the training data will be a biased estimate of the generalization error. To be specific, it will tend to be smaller than the generalization error if we select our model such that it minimizes the training error. We can therefore not use the training error as our selection criteria. One way to estimate the generalization error is to do cross-validation. This means using a test data set, which is a subset of the available data (typically 25-35%) that is removed before any training is done, and which is not used again until all training is done. The performance on this test data will be an unbiased estimate of the generalization error, provided that the data has not been used in any way during the modeling process. If it has been used, e.g. for model validation when selecting hyperparameter values, then it will be a biased estimate. If there is lots of data available then it may be sufficient to use one test set for estimating the generalization error. However, if data is scarce then it is necessary to use more data-efficient methods. One such method is the K-fold cross-validation method. The central idea in K-fold cross-validation is to repeat the cross-validation test K times. That is, divide the available data into K subsets, here denoted by Dk , where each subset contains a sample of the data that reflects the data distribution (i.e. you must make sure that one subset does not contain e.g. only one category in a classification task). The procedure then goes like this: 1. Repeat K times, i.e. until all data subsets have been used for testing once. 1.1 Set aside one of the subsets, Dk , for testing, and use the remaining data subsets Di=k for training. 1.2 Train your model using the training data. 1.3 Test your model on the data subset Dk . This gives you a test data error Etest,k . 2. The estimate of the generalization error is the mean of the K individual 1 test errors: Egen. = K k Etest,k .
  • 10. 10 Thorsteinn S. R¨gnvaldsson, February 18, 2000 o One benefit with K-fold cross-validation is that you can estimate an error bar for the generalization error by computing the standard deviation of the Etest,k values. Note: The errors Etest,k are often approximately log-normally distributed. At least, log Etest,k tends to be more normally distributed than Etest,k . It is therefore more appropriate to use the mean of the logs as an estimate for the log generalization error. That is K 1 log Egen. = log Etest,k (7.38) K k=1 K 1 ∆ log Egen. = 1.96 [log Etest,k − log Egen. ]2 (7.39) K − 1 k=1 where the lower row is a 95% confidence band for the log generalization error. An error bar from cross-validation includes the model variation due to both different training sets and different initial conditions.
  • 11. Learning & Adaptive Systems 11 References Bishop, C. M. (1995), Neural Networks for Pattern Recognition, Oxford University Press, Oxford. Haykin, S. (1999), Neural Networks – A Comprehensive Foundation, second edn, Prentice Hall Inc., Upper Saddle River, New Jersey. Riedmiller, M. & Braun, H. (1993), A direct adaptive method for faster backpropagation learning: The RPROP algorithm, in H. Ruspini, ed., ‘Proc. of the IEEE Intl. Conference on Neural Networks’, San Fransisco, California, pp. 586–591.