1. DEEP LEARNING
MULTILAYERED / HIERARCHICAL NEURAL NETWORK BASED INFORMATION PROCESSING
ROBUST, GENERALIZABLE, AND SCALABLE
2. AGENDA
• WHAT DOES IT MEAN FOR A MACHINE TO LEARN?
• ML VS DL
• APPLICATIONS OF DL
• DL CONCEPTS
• PROMINENT DL ARCHITECTURES
• DL BASED SAMPLE PROJECTS
3. HOW DOES MACHINE LEARNING WORK?
• Consider an Equation Ax = b, where b is the actual RHS value
• We make a machine predict the values of RHS as b’
• Machine learning is a set of algorithmic techniques to minimize
the error (b-b’) in this equation through optimization.
• This is done by changing the values of weights in the x column
vector (parameter vector) until we find a good set of values that
gives us the closest outcomes to the actual values of b.
•
8. MORE APPLICATIONS OF DL
Real Time Image Recognition Sentiment Analysis
Search Ranking Personalization
Speaker Identification Text Prediction
Handwriting Recognition Machine
Translation
Face Detection Music Tagging
Entity Recognition Style Transfer
Image Captioning Emotion Detection
Text Summarization
18. Training data
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0
etc …
Present a training pattern
1.4
2.7
1.9
19. Training data
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0
etc …
Feed it through to get output
1.4
2.7 0.8
1.9
20. Training data
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0
etc …
Compare with target output
1.4
2.7 0.8
0
1.9 error 0.8
21. Training data
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0
etc …
Adjust weights based on error
1.4
2.7 0.8
0
1.9 error 0.8
22. Training data
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0
etc …
Present a training pattern
6.4
2.8
1.7
23. Training data
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0
etc …
Feed it through to get output
6.4
2.8 0.9
1.7
24. Training data
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0
etc …
Compare with target output
6.4
2.8 0.9
1
1.7 error -0.1
25. Training data
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0
etc …
Adjust weights based on error
6.4
2.8 0.9
1
1.7 error -0.1
26. Training data
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0
etc …
And so on ….
6.4
2.8 0.9
1
1.7 error -0.1
Repeat this thousands, maybe millions of times – each time taking a random training instance, and making
slight weight adjustments.
Algorithms for weight adjustment are designed to make changes that will reduce the error
27. A SAMPLE NEURAL NETWORK AND
CORRESPONDING COMPUTATION
The Forward Propagation
28. THE ENTIRE PROCESS BEHIND
OPTIMIZATION
• In each iteration:
• Select a network architecture, i.e. number of hidden layers, number of
neurons in each layer and activation function
• Initialize weights randomly
• Use forward propagation to determine the output node
• Find the error of the model using the known labels
• Back-propagate the error into the network and determine the error for
each node and Update the weights to minimize gradient
29. THE PSEUDOCODE FOR CALCULATING
OUTPUT OF FORWARD-PROPAGATING
NEURAL NETWORK
•# node[] := array of topologically sorted nodes,
An edge from a to b means a is to the left of b
•# If the Neural Network has R inputs and S
outputs, then first R nodes are input nodes and
last S nodes are output nodes.
•# incoming[x] := nodes connected to node x
•# weight[x] := weights of incoming edges to x
30. • For each neuron x, from left to right −
• if x <= R: do nothing # its an input node
• inputs[x] = [output[i] for i in incoming[x]]
• weighted_sum = dot_product(weights[x], inputs[x])
• output[x] = Activation_function(weighted_sum)
To train a neural network, use the iterative gradient descent method. After
random initialization, we make predictions on some subset of the data with
forward-propagation process, compute the corresponding cost function C,
and update each weight w by an amount proportional to dC/dw, i.e., the
derivative of the cost functions w.r.t. the weight. The proportionality
constant is known as the learning rate.
We calculate the gradients backwards, i.e., first calculate the gradients of
the output layer, then the top-most hidden layer, followed by the preceding
hidden layer, and so on, ending at the input layer.
31. GRADIENT DESCENT OPTIMIZATION
TECHNIQUE
TO FIND WHICH WEIGHT PRODUCES THE LEAST ERROR
• The ratio between network Error and each of the weights is a
derivative, dE/dw that calculates the extent to which a slight
change in a weight causes a slight change in the error
• Use the chain rule of calculus to work back through the
network activations and outputs. This leads us to the weight in
question, and its relationship to overall error.
• We can calculate how a change in weight affects a change in
error by first calculating how a change in activation affects a
change in Error, and how a change in weight affects a change in
activation.
32. learning rate
Gradient Descent
objective/cost function
𝑱(𝜽)
𝜃𝑗
𝑛𝑒𝑤
= 𝜃𝑗
𝑜𝑙𝑑
− 𝛼
ⅆ
ⅆ𝜃𝑗
𝑜𝑙𝑑 𝐽(𝜃) Update each element of θ
𝜃 𝑛𝑒𝑤
= 𝜃 𝑜𝑙𝑑
− 𝛼𝛻𝜃 𝐽(𝜃) Matrix notation for all parameters
Recursively apply chain rule though each node
33. OVERFITTING
• When the network learns the rare dependencies in training data
but can not produce the correct output for test data.
• Overfitting is combatted by Regularization.
• Regularization Methods – Drop out, early stopping, data
augmentation, transfer learning
34. DROPOUT
• Dropout is a technique where during each iteration of gradient
descent, we drop/ignore a set of randomly selected nodes.
• Each neuron is kept with a probability of q and dropped randomly
with probability 1-q. The value q may be different for each layer in
the neural network. A value of 0.5 for the hidden layers, and 0 for
input layer works well on a wide range of tasks.
• During evaluation and prediction, no dropout is used. The output
of each neuron is multiplied by q so that the input to the next layer
has the same expected value.
35. EARLY STOPPING
• We stop training when the error starts to increase.
• Here, by error, we mean the error measured on validation data,
which is the part of training data used for tuning hyper-
parameters.
• In this case, the hyper-parameter is the stop criteria.
36. DATA AUGMENTATION
• We increase the quantum of data we have or augment it by
using existing data and applying some transformations on it.
• For instance, in many computer vision tasks such as object
classification, an effective data augmentation technique is
adding new data points that are cropped or translated versions
of original data.
37. TRANSFER LEARNING
• The process of taking a pre-trained model and “fine-tuning”
the model with our own dataset is called transfer learning.
• We train the pre-trained model on a large dataset. Then, we
remove the last layer of the network and replace it with a new
layer with random weights.
• We then freeze the weights of all the other layers and train the
network normally. The pre-trained model will act as a feature
extractor, and only the last layer will be trained on the current
task.
38. CORE COMPONENTS OF A DEEP NET
• Parameters - weights on the connections in the network
• Layers- Layers are a fundamental architectural unit in deep networks
• Activation functions- A nonlinear transform applied to the output of
the previous layer. : • Sigmoid • Tanh • Hard tanh • Rectified linear
unit (ReLU)
• Loss functions- To determine the penalty for an incorrect
classification e.g. Squared loss • Logistic loss • Hinge loss • Negative
log likelihood classification of an input
• Optimization methods-Gradient Descent, Genetic Algo, Simulated
Annealing, PSO, ACO
• Hyper-parameters- Layer size , learning rate, Regularization
41. AUTO-ENCODERS
TO LEARN COMPRESSED REPRESENTATIONS OF DATASETS
Auto-encoders differ from
multilayer perceptron:
• They use unlabeled data in
unsupervised learning.
• They build a compressed
representation of the input
data.
Use backpropagation to update
their weights. The main
difference between RBMs and
auto-encoders is in how they
calculate the gradients.
42. RESTRICTED BOLTZMANN MACHINES(RBM)
With RBMs, every visible unit is connected to
every hidden unit, yet no units from the same
layer are connected. Pre-training using RBMs
means teaching it to reconstruct the original
data from a limited sample of that data
Contrastive Divergence
RBMs calculate gradients by using an
algorithm called contrastive divergence. It
minimizes the KL divergence (the delta
between the real distribution of the data and
the guess) by sampling k steps of a Markov
chain to compute a guess.
44. DEEP BELIEF NETWORKS
DBNS ARE COMPOSED OF LAYERS OF RESTRICTED BOLTZMANN MACHINES (RBMS) FOR THE PRE-
TRAIN
PHASE AND THEN A FEED-FORWARD NETWORK FOR THE FINE-TUNE PHASE.
45. GENERATIVE ADVERSARIAL
NETWORKS(GANS)
• GANs use unsupervised learning to train two adversarial models
in parallel.
• The generative network - The generative network in GANs
generates data (or images) with a special kind of layer called a
de-convolutional layer
• The discriminator network –the secondary network, generally
an CNN for image classification tasks
47. • Convolutional layers transform the input data by using a patch
of locally connecting neurons from the previous layer. The layer
will compute a dot product between the region of the neurons
in the input layer and the weights to which they are locally
connected in the output layer.
• A convolution is defined as a mathematical operation
describing a rule for how to merge two sets of information. The
convolution operation is known as the feature detector of a
CNN.
49. POOLING LAYERS
• They reduce the data representation progressively over the
network and help control overfitting. The pooling layer
operates independently on every depth slice of the input.
• The pooling layer uses the max() operation to resize the input
data spatially (width, height). This operation is referred to as
max pooling. With a 2 × 2 filter size, the max() operation is
taking the largest of four numbers in the filter area. This
operation does not affect the depth dimension.
50. FULLY CONNECTED LAYERS
• To compute class scores that we’ll use as output of
the network (e.g., the output layer at the end of the
network). The dimensions of the output volume is [ 1
× 1 × N ], where N is the number of output classes
we’re evaluating.
• Fully connected layers perform transformations on the
input data volume that are a function of the
activations in the input volume and the parameters
51. RECURRENT NEURAL NETWORKS
• Recurrent Neural Networks take each vector from a sequence of
input vectors and model them one at a time. This allows the
network to retain state while modeling each input vector across
the window of input vectors. Modeling the time dimension is
done by Recurrent Neural Networks.
52. • A Recurrent Neural Network includes a feedback loop
to learn from sequences of varying lengths.
• Has an extra parameter matrix for the connections
between time-steps to capture the temporal
relationships in the data.
• RNNs are trained to generate sequences, in which the
output at each time-step is based on both the current
input and the input at all previous time steps.
• Normal Recurrent Neural Networks compute a
gradient with an algorithm called backpropagation
through time (BPTT).
53. VANISHING GRADIENT PROBLEM
• Recurrent Neural Networks are known to have issues with
the “vanishing gradient problem.”
• This issue occurs when the gradients become too large or
too small and make it difficult to model long-range
dependencies (10 time-steps or more) in the structure of
the input dataset.
• The most effective way to get around this issue is to use
the LSTM variant of Recurrent Neural Networks.
54. LSTM NETWORKS
• The critical component of the LSTM is the memory cell and the gates
(the forget gate, the input gate). The contents of the memory cell are
modulated by the input gates and forget gates.
• Assuming that both of these gates are closed, the contents of the
memory cell will remain unmodified between one time-step and the
next.
• The gating structure allows information to be retained across many
time-steps, and consequently also allows gradients to flow across
many time-steps.
• This allows the LSTM model to overcome the vanishing gradient
problem that occurs with most Recurrent Neural Network models.
55. USE CASES OF LSTMS
• Generating sentences (e.g.,
character-level language
models)
• Classifying time-series
• Speech recognition
• Handwriting recognition
• Polyphonic music modeling
56. RECURSIVE NEURAL NETWORKS
• Recursive Neural Networks, like Recurrent Neural Networks, can deal
with variable length input, can model hierarchical structures in
training dataset.
• Applications - Deconstructing scenes to not only identify the objects
in the scene, but also how the objects relate to form the scene, scene
and sentence parsing.
• Architecture - A shared-weight matrix and a binary tree structure
that allows the recursive network to learn varying sequences of words
or parts of an image.
• Recursive Neural Networks use a variation of backpropagation called
backpropagation through structure (BPTS). The feed-forward pass
happens bottom-up, and backpropagation is top-down.
58. CHOOSING A DEEP NET FOR YOUR
RESEARCH
• To extract patterns from a set of un-labelled data, we use a RBM or an Auto
encoder.
• For text processing, sentiment analysis, parsing and name entity
recognition, we use a RNN or Recursive Neural Tensor Network or RNTN.
• For any language model that operates at character level, we use the RNN.
• For image recognition, we use deep belief network DBN or convolutional
network CNN.
• For object recognition, we use a RNTN or a CNN.
• For speech recognition, we use RNN.
• In general, DBN and MLP with RELU are both good choices for classification.
• For time series analysis, it is always recommended to use RNN.
59. WHEN TO USE DEEP LEARNING
• Simpler models (logistic regression) don’t achieve the accuracy
level your use case needs
• You have complex pattern matching in images, NLP, or audio to
deal with
• You have high dimensionality data
• You have the dimension of time in your vectors (sequences)
60. WHEN TO STICK WITH TRADITIONAL MACHINE
LEARNING
• You have high-quality, low-dimensional data; for example,
columnar data from a database export
• You’re not trying to find complex patterns in image data
• You’ll achieve poor results from both methods when the data is
incomplete and/or of poor quality.
61. DEEP LEARNING FOR DETECTING CANCER
• https://youtu.be/9Mz84cwVmS0
• Examples
Adaptive Gradient Algorithm (AdaGrad) that maintains a per-parameter learning rate that improves performance on problems with sparse gradients (e.g. natural language and computer vision problems).
Root Mean Square Propagation (RMSProp) that also maintains per-parameter learning rates that are adapted based on the average of recent magnitudes of the gradients for the weight (e.g. how quickly it is changing). This means the algorithm does well on online and non-stationary problems (e.g. noisy).
Adam realizes the benefits of both AdaGrad and RMSProp.
Instead of adapting the parameter learning rates based on the average first moment (the mean) as in RMSProp, Adam also makes use of the average of the second moments of the gradients (the uncentered variance).
Specifically, the algorithm calculates an exponential moving average of the gradient and the squared gradient, and the parameters beta1 and beta2 control the decay rates of these moving averages.
The initial value of the moving averages and beta1 and beta2 values close to 1.0 (recommended) result in a bias of moment estimates towards zero. This bias is overcome by first calculating the biased estimates before then calculating bias-corrected estimates.