https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
4. Multilayer perceptrons
When each node in each layer is a linear
combination of all inputs from the previous
layer then the network is called a multilayer
perceptron (MLP)
Weights can be organized into matrices.
Forward pass computes
!(#)
5. Training MLPs
With Multiple Layer Perceptrons we need to find the gradient of the loss function with respect to all the
parameters of the model (W(t), b(t))
These can be found using the chain rule of differentiation.
The calculations reveal that the gradient wrt the parameters in layer k only depends on the error from the
above layer and the output from the layer below.
This means that the gradients for each layer can be computed iteratively, starting at the last layer and
propagating the error back through the network. This is known as the backpropagation algorithm.
6. • Computational Graphs
• Examples applying chain of rule in simple graphs
• Backpropagation applied to Multilayer Perceptron
• Another perspective: modular backprop
Backpropagation algorithm
7. Computational graphs
z
x y
x
u(1) u(2)
·
+
y^
x w b
s
U(1) U(2)
matmul
+
H
X W b
relu
u(1)
u(2)
·
y^
x w l
x
u(3)
sum
sqrt
! = #$ %$=&(x)w + b) .=max 0, 12 + 3 %$=x)w
4 5
6
76
8
From Deep Learning Book
13. Backpropagation applied to an element of the MLP
For a single neuron with its linear and non-linear part
ℎ"
#
g(·)
ℎ(
)
ℎ"
#*"
+)*"
,)*" = .(/),) +0)) = .(+)*")
1+)*"
1,)
= /2
1,)*"
1+)*"
= .3(+)*")
15. h2 h3a3 a4 h4
Loss
Hidden Hidden Output
W2 W3
x a2
Input
W1
Forward Pass
h2 h3a3 a4 h4
Loss
Hidden Hidden Output
W2 W3
x a2
Input
W1
L
Backward Pass
Backpropagation is applied to the Backward Pass
16. Probability Class given an input
(softmax)
h2 h3a3 a4 h4
Loss
Hidden Hidden Output
W2 W3
x a2
Input
W1
Figure Credit: Kevin McGuiness
Forward Pass
17. Probability Class given an input
(softmax)
Loss function; e.g., negative log-likelihood
(good for classification)
h2 h3a3 a4 h4
Loss
Hidden Hidden Output
W2 W3
x a2
Input
W1
Regularization term (L2 Norm)
aka as weight decay
Figure Credit: Kevin McGuiness
Forward Pass
18. Probability Class given an input
(softmax)
Minimize the loss (plus some regularization
term) w.r.t. Parameters over the whole
training set.
Loss function; e.g., negative log-likelihood
(good for classification)
h2 h3a3 a4 h4
Loss
Hidden Hidden Output
W2 W3
x a2
Input
W1
Regularization term (L2 Norm)
aka as weight decay
Figure Credit: Kevin McGuiness
Forward Pass
19. 1. Find the error in the top layer:
h2 h3a3 a4 h4
Loss
Hidden Hidden Output
W2 W3
x a2
Input
W1
L
Figure Credit: Kevin McGuiness
Backward Pass
20. 1. Find the error in the top layer: 2. Compute weight updates
h2 h3a3 a4 h4
Loss
Hidden Hidden Output
W2 W3
x a2
Input
W1
L
Figure Credit: Kevin McGuiness
Backward Pass
To simplify we don’t consider the biass
21. 1. Find the error in the top layer: 3. Backpropagate error to layer below2. Compute weight updates
h2 h3a3 a4 h4
Loss
Hidden Hidden Output
W2 W3
x a2
Input
W1
L
Figure Credit: Kevin McGuiness
Backward Pass
To simplify we don’t consider the biass
22. Another perspective: Modular backprop
You could use the chain rule on all the individual neurons to compute the
gradients with respect to the parameters and backpropagate the error signal.
It is useful to use the layer abstraction
Then define the backpropagation algorithm in terms of three operations that layers
need to be able to do.
This is called modular backpropagation
26. Modular backprop
Using this idea, it is possible to create
many types of layers
● Linear (fully connected layers)
● Activation functions (sigmoid, ReLU)
● Convolutions
● Pooling
● Dropout
Once layers support the backward and
forward operations, they can be plugged
together to create more complex functions
Convolution
Input Error (L)
Gradients
ReLU
Linear
Gradients
Output Error (L+1)
27. Implementation notes
Caffe and Torch
Libraries like Caffe and Torch implement
backpropagation this way.
To define a new layer, you need to create
an class and define the forward and
backward operations.
Theano and TensorFlow
Libraries like Theano and TensorFlow
operate on a computational graph.
To define a new layer, you only need to
specify the forward operation. Autodiff is
used to automatically infer backward.
You also don't need to implement
backprop manually in Theano or
TensorFlow. It uses computational graph
optimizations to automatically factor out
common computations.
28. Issues on Backpropagation and Training
Gradient Descent: Move the parameter !"in small steps in the direction opposite sign of the
derivative of the loss with respect j.
!($) = !($'() − * $'( + ,-ℒ /, 1 2 − 3! $'(
Stochastic gradient descent (SGD): estimate the gradient with one sample, or better, with a
minibatch of examples.
Weight Decay: Regularization term that penalizes large weights, distributes values among all the
parameters
Momentum: the movement direction of parameters averages the gradient estimation with
previous ones.
Several strategies have been proposed to update the weights: optimizers
29. Note on hyperparameters
So far we have lots of hyperparameters to choose:
1. Learning rate (a)
2. Weight decay (l)
3. Number of epochs
4. Number of hidden layers
5. Nodes in each hidden layer
6. Weight initialization strategy
7. Loss function
8. Activation functions
9. ...
… next class more
30. Summary
• Backpropagation is applied during the Backward pass while training
• Computational graphs help to understand the chain rule of differentiation
• Parameters in layer k only depend on the error from the above layer and the output from
the layer below. This means that the gradients for each layer can be computed iteratively,
starting at the last layer and propagating the error back through the network.
• Hyperparameters have to be chosen and it’s not obvious
• For a “deeper” study: http://www.deeplearningbook.org/