Dm part03 neural-networks-handout

Christof Monz
Informatics Institute
University of Amsterdam
Data Mining
Part 3: Neural Networks
Overview
Christof Monz
Data Mining - Part 3: Neural Networks
1
Perceptrons
Gradient descent search
Multi-layer neural networks
The backpropagation algorithm

Neural Networks
Christof Monz
2
Analogy to biological neural systems, the most
robust learning systems we know
Attempt to understand natural biological
systems through computational modeling
Massive parallelism allows for computational
eﬃciency
Help understand ‘distributed’ nature of neural
representations
Intelligent behavior as an ‘emergent’ property of
large number of simple units rather than from
explicitly encoded symbolic rules and algorithms
Neural Network Learning
Christof Monz
3
Learning approach based on modeling
adaptation in biological neural systems
Perceptron: Initial algorithm for learning
simple neural networks (single layer) developed
in the 1950s
Backpropagation: More complex algorithm for
learning multi-layer neural networks developed
in the 1980s.

Real Neurons
Christof Monz
4
Human Neural Network
Christof Monz
5

Modeling Neural Networks
Christof Monz
6
Perceptrons
Christof Monz
7

Perceptrons
Christof Monz
8
A perceptron is a single layer neural network
with one output unit
The output of a perceptron is computed as
follows
o(x1 ...xn) =
1 ifw0 +w1x1 +...+wnxn > 0
−1 otherwise
Assume a ‘dummy’ input x0 = 1 we can write:
o(x1 ...xn) =
1 if ∑n
i=0 wixi > 0
−1 otherwise
Perceptrons
Christof Monz
9
Learning a perceptron involves choosing the
‘right’ values for the weights w0 ...wn
The set of candidate hypotheses is
H = {w | w ∈ ℜ(n+1)}

Representational Power
Christof Monz
10
A single perceptron represent many boolean
functions, e.g. AND, OR, NAND (¬AND), . . . ,
but not all (e.g., XOR)
Peceptron Training Rule
Christof Monz
11
The perceptron training rule can be deﬁned
for each weight as:
wi ← wi +∆wi
where ∆wi = η(t −o)xi
where t is the target output, o is the output of
the perceptron, and η is the learning rate
This scenario assume that we know what the
target outputs are supposed to be like

Peceptron Training Rule Example
Christof Monz
12
If t = o then η(t −o)xi = 0 and ∆wi = 0, i.e.
the weight for wi remains unchanged, regardless
of the learning rate and the input values (i.e. xi)
Let’s assume a learning rate of η = 0.1 and an
input value of xi = 0.8
• If t = +1 and o = −1, then
∆wi = 0.1(1 −(−1)))0.8 = 0.16
• If t = −1 and o = +1, then
∆wi = 0.1(−1 −1)))0.8 = −0.16
Peceptron Training Rule
Christof Monz
13
The perceptron training rule converges after a
ﬁnite number of iterations
Stopping criterion holds if the amount of
changes falls below a pre-deﬁned threshold θ,
e.g., if |∆w|L1 < θ
But only if the training examples are linearly
separable

The Delta Rule
Christof Monz
14
The delta rule overcomes the shortcoming of
the perceptron training rule not being
guaranteed to converge if the examples are not
linearly separable
Delta rule is based on gradient descent search
Let’s assume we have an unthresholded
perceptron: o(x) = w ·x
We can deﬁne the training error as:
E(w) = 1
2 ∑
d∈D
(td −od)2
where D is the set of training examples
Error Surface
Christof Monz
15

Gradient Descent
Christof Monz
16
The gradient of E is the vector pointing in the
direction of the steepest increase for any point
on the error surface
∇E(w) = ∂E
∂w0
, ∂E
∂w1
,..., ∂E
∂wn
Since we are interested in minimizing the error,
we consider negative gradients: −∇E(w)
The training rule for gradient descent is:
w ← w +∆w
where ∆w = −η∇E(w)
Gradient Descent
Christof Monz
17
The training rule for individual weights is
deﬁned as wi ← wi +∆wi
where ∆wi = −η ∂E
∂wi
Instantiating E for the error function we use
gives: ∂E
∂wi
= ∂
∂wi
1
2 ∑
d∈D
(td −od)2
How do we use partial derivatives to actually
compute updates to weights at each step?

Gradient Descent
Christof Monz
18
∂E
∂wi
=
∂
∂wi
1
2
∑
d∈D
(td −od)2
=
1
2
∑
d∈D
∂
∂wi
(td −od)2
=
1
2
∑
d∈D
2(td −od)
∂
∂wi
(td −od)
= ∑
d∈D
(td −od)
∂
∂wi
(td −od)
∂E
∂wi
= ∑
d∈D
(td −od)·(−xid)
Gradient Descent
Christof Monz
19
The delta rule for individual weights can now be
written as wi ← wi +∆wi
where ∆wi = η ∑
d∈D
(td −od)xid
The gradient descent algorithm
• picks initial random weights
• computes the outputs
• updates each weight by adding ∆wi
• repeats until converge

The Gradient Descent Algorithm
Christof Monz
20
Each training example is a pair < x,t >
1 Initialize each wi to some small random value
2 Until the termination condition is met do:
2.1 Initialize each ∆wi to 0
2.2 For each < x,t >∈ D do
2.2.1 Compute o(x)
2.2.2 For each weight wi do
∆wi ← ∆wi +η(t −o)xi
2.3 For each weight wi do
wi ← wi +∆wi
The Gradient Descent Algorithm
Christof Monz
21
The gradient descent algorithm will ﬁnd the
global minimum, provided that the learning rate
is small enough
If the learning rate is too large, this algorithm
runs into the risk of overstepping the global
minimum
It’s a common strategy to gradually the
decrease the learning rate
This algorithm works also in case the training
examples are not linearly separable

Shortcomings of Gradient Descent
Christof Monz
22
Converging to a minimum can be quite slow
(i.e. it can take thousands of steps). Increasing
the learning rate on the other hand can lead to
overstepping minima
If there are multiple local minima in the error
surface, gradient descent can get stuck in one
of them and not find the global minimum
Stochastic gradient descent alleviates these
difficulties
Stochastic Gradient Descent
Christof Monz
23
Gradient descent updates the weights after
summing over all training examples
Stochastic (or incremental) gradient descent
updates weights incrementally after calculating
the error for each individual training example
This this end step 2.3 is deleted and step 2.2.2
modified

Stochastic Gradient Descent
Christof Monz
24
Each training example is a pair < x,t >
1 Initialize each wi to some small random value
2 Until the termination condition is met do:
2.1 Initialize each ∆wi to 0
2.2 For each < x,t >∈ D do
2.2.1 Compute o(x)
2.2.2 For each weight wi do
wi ← wi +η(t −o)xi
Comparison
Christof Monz
25
In standard gradient descent summing over
multiple examples requires more computations
per weight update step
As a consequence standard gradient descent
often uses larger learning rates than stochastic
gradient descent
Stochastic gradient descent can avoid falling
into local minima because it uses the diﬀerent
∇Ed(w) rather than the overall ∇E(w) to guide
its search

Multi-Layer Neural Networks
Christof Monz
26
Perceptrons only have two layers: the input
layer and the output layer
Perceptrons only have one output unit
Perceptrons are limited in their expressiveness
Multi-layer neural networks consist of an input
layer, a hidden layer, and an output layer
Multi-layer neural networks can have several
output units
Christof Monz
27

Christof Monz
28
The units of the hidden layer function as input
units to the next layer
However, multiple layers of linear units still
produce only linear functions
The step function in perceptrons is another
choice, but it is not diﬀerentiable, and therefore
not suitable for gradient descent search
Solution: the sigmoid function, a non-linear,
diﬀerentiable threshold function
Sigmoid Unit
Christof Monz
29

The Sigmoid Function
Christof Monz
30
The output is computed as o = σ(w ·x)
where σ(y) = 1
1+e−y
i.e. o = σ(w ·x) = 1
1+e−(w·x)
Another nice property of the sigmoid function is
that its derivative is easily expressed:
dσ(y)
dy
= σ(y)·(1 −σ(y))
Learning with Multiple Layers
Christof Monz
31
The gradient descent search can be used to
train multi-layer neural networks, but the
algorithm has to be adapted
Firstly, there can be multiple output units, and
therefore the error function as to be generalized:
E(w) = 1
2 ∑
d∈D
∑
k∈outputs
(tkd −okd)2
Secondly, the error ‘feedback’ has to be fed
through multiple layers

Backpropagation Algorithm
Christof Monz
32
For each training example < x,t > do
1. Input x to the network and compute ou for every unit in
the network
2. For each output unit k calculate its error δk :
δk ← ok (1 −ok )(tk −ok )
3. For each hidden unit h calculate its error δh:
δh ← oh(1 −oh) ∑
k∈outputs
wkhδk
4. Update each network weight wji :
wji ← wji +∆wji
where ∆wji = ηδj xji
Note: xji is the value from unit i to j and wji is
the weight of connecting unit i to j,
Backpropagation Algorithm
Christof Monz
33
Step 1 propagates the input forward through
the network
Steps 2–4 propagate the errors backward
through the network
Step 2 is similar to the delta rule in gradient
descent (step 2.3)
Step 3 sums over the errors of all output units
inﬂuence by a given hidden unit (this is because
the training data only provides direct feedback
for the output units)

Applications of Neural Networks
Christof Monz
34
Text to speech
Fraud detection
Automated vehicles
Game playing
Handwriting recognition
Summary
Christof Monz
35
Perceptrons, simple one layer neural networks
Perceptron training rule
Gradient descent search
Multi-layer neural networks
Backpropagation algorithm

Dm part03 neural-networks-handout

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (7)

Similar a Dm part03 neural-networks-handout

Similar a Dm part03 neural-networks-handout (20)

Más de okeee

Más de okeee (20)

Último

Último (20)

Dm part03 neural-networks-handout