SlideShare una empresa de Scribd logo
1 de 86
Descargar para leer sin conexión
Sequence Modelling
with Deep Learning
ODSC London 2019 Tutorial
Natasha Latysheva
Overview
I. Introduction to sequence modelling
II. Quick neural network review
• Feed-forward networks
III. Recurrent neural networks
• From feed-forward networks to recurrence
• RNNs with gating mechanisms
IV. Practical: Building a language model for Game of Thrones
V. Components of state-of-the-art RNN models
• Encoder-decoder models
• Bidirectionality
• Attention
VI. Transformers and self-attention
Speaker Intro
• Welocalize
• We provide language services
• Fairly large, by revenue 8th largest globally,
4th largest US. 1500+ employees.
• Lots of localisation (translation)
• International marketing, site optimisation
• NLP engineering team
• 14 people remote across US, Ireland, UK,
Germany, China
• Various NLP things: machine translation,
text-to-speech, NER, sentiment, topics,
classification, etc.
I. Introduction to Sequence Modelling
Other sequence problems
Less conventional sequence data
• Activity on a website:
• [click_button, move_cursor, wait,
wait, click_subscribe, close_tab]
• Customer history:
• [inactive -> mildly_active ->
payment_made -> complaint_filed
-> inactive -> account_closed]
• Code (constrained language) is
sequential data – can learn the
structure
II. Quick Neural Network Review
Feed-forward networks
Simplifying the notation
• Single neurons
• Weight matrices, bias vectors
• Fully-connected layer
III. Recurrent Neural Networks
Why do we need fancy methods to
model sequences?
• Say we are training a translation
model, English->French
• ”The cat is black” to “Le chat is
noir”
• Could in theory use a feed-
forward network to translate
word-by-word
Why do we need fancy methods?
• A feed-forward network treats
time steps as completely
independent
• Even in this simple 1-to-1
correspondence example, things
are broken
• How you translate “black” depends
on noun gender (“noir” vs. “noire”)
• How you translate “The” also
depends on gender (“Le” vs. “La”)
• More generally, getting the
translation right requires context
Why do we need fancy methods?
• We need a way for the network
to remember information from
previous time steps
Recurrent neural networks
• Extremely popular way of modelling
sequential data
• Process data one time step at a
time, while updating a running
internal hidden state
Standard FF network to RNN
Standard FF network to RNN
Standard FF network to RNN
Standard FF network to RNN
Standard FF network to RNN
• At each time step, RNN
passes on its activations
from previous time step
• In theory all the way back
to the first time step
Standard FF network to RNN
*Activation function probably tanh or ReLU
Standard FF network to RNN
• So you can say this is a
form of memory
• Cell hidden state
transferred
• Basis for RNNs
remembering context
Memory problems
• Basic RNNs not great at
long-term dependencies
but plenty of ways to
improve this
• Information gating
mechanisms
• Condensing input using
encoders
Gating mechanisms
• Gates regulate the flow of
information
• Very helpful - basic RNN cells not really
used anymore. Responsible for recent
RNN popularity.
• Add explicit mechanisms to remember
information and forget information
• Why use gates?
• Helps you learn long-term
dependencies
• Not all time points are equally relevant
– not everything has to be remembered
• Speeds up training/convergence
Gated recurrent
units (GRUs)
• GRUs were developed later
than LSTMs but are simpler
• Motivation is to get the main
benefits of LSTMs but with less
computation
• Reset gate: Mechanism to
decide when to remember vs.
forget/reset previous
information (hidden state)
• Update gate: Mechanism to
decide when to update
hidden state
GRU mechanics
• Reset gate controls how
much past info we use
• Rt = 0 means we are resetting
our RNN, not using any
previous information
• Rt = 1 means we use all of
previous information (back to
our normal vanilla RNN)
GRU mechanics
• Update gate controls whether
we bother updating our
hidden state using new
information
• Zt = 1 means you’re not
updating, you’re just using
previous hidden state
• Zt = 0 means you’re updating as
much as possible
LSTM mechanics
• LSTMs add a memory unit to
further control the flow of
information through the cell
• Also whereas GRUs have 2
gates, an LSTM cell has 3
gates:
• An input gate – should I ignore
or consider the input?
• A forget gate – should I keep
or throw away the information
in memory?
• An output gate – how should I
use input, hidden state and
memory to output my next
hidden state?
GRUs vs. LSTMs
• GRUs are simpler + train
faster
• LSTMs more popular – can
give slightly better
performance, but GRU
performance often on par
• LSTMs would in theory
outperform GRUs in tasks
requiring very long-range
modelling
IV. Game of Thrones Language Model
Notebook
• ~30 mins
• Jupyter
notebook on
building an RNN-
based language
model
• Python 3 + Keras
for neural
networks
tinyurl.com/wbay5o3
IV. Components of SOTA RNN models
Encoder-Decoder architectures
• Being forced to
immediately
output a French
word for every
English word
Encoder-Decoder architectures
Encoder-Decoder architectures
• Tends to work a lot
better than using a
single sequence-to-
sequence RNNs to
produce an output
for each input step
• You often need to
see the whole
sequence before
knowing what to
output
Bidirectionality in RNN encoder-decoders
• For the encoder,
bidirectional RNNs
(BRNNs) often used
• BRNNs read the
input sequences
forwards and
backwards
Bidirectional
RNNs
• Process input
sequences in both
directions
The problem with RNN encoder-decoders
• Serious information
bottleneck
• Condense input
sequence down to a
small vector?!
• Memorise long
sequence + regurgitate
• Not how humans work
• Long computation
paths
Attention concept
• Has been very influential in
deep learning
• Originally developed for
MT (Bahdanau, 2014)
• As you’re producing your
output sequence, maybe
not every part of your input
is as equally relevant
• Image captioning example
Lu et al. 2017. Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning.
Attention intuition
• Attention allows the
network to refer back
to the input
sequence, instead of
forcing it to encode
all information into
one fixed-length
vector
• Encoder: Used BRNN
to compute rich set of
features about source
words and their
surrounding words
• Decoder is asked to
choose which hidden
states to use and
ignore
• Weighted sum of
hidden states used to
predict the next word
Attention intuition
• Decoder RNN uses
attention parameters
to decide how much
to pay attention to
different parts of the
input
• Allows the model to
amplify the signal
from relevant parts of
the input sequence
• This improves
modelling
Attention intuition
Main benefits
• Encoder passes a lot
more data to
the decoder
• Not just last hidden
state
• Passes all hidden states
at every time step
• Computation path
problem: relevant
information is now
closer by
Summary so far
• Sequence modelling
• Recurrent neural
networks
• Some key components
of SOTA RNN-based
models:
• Gating mechanisms
(GRUs and LSTMs)
• Encoder-decoders
• Bidirectional encoding
• Attention
V. Transformers and self-attention
Transformers are taking over NLP
• Translation, language
models, question
answering, summarisation,
etc.
• Some of the best word
embeddings are based on
Transformers
• BERT, ELmO, OpenAI GPT-2
models
A single Transformer encoder block
• No recurrence, no convolutions
• “Attention is all you need” paper
• The core concept is the self-
attention mechanism
• Much more parallelisable than
RNN-based models, which
means faster training
Self-attention is a
sequence-to-sequence
operation
• At the highest level – self-
attention takes t input
vectors and outputs t
output vectors
• Take input embedding for
“the” and update it by
incorporating in
information from its
context
How is the vector for “the” updated?
• Each output vector
is a weighted sum
of the input vectors
• But all of these
weights are
different
These are not learned weights in the
traditional neural network sense
• The weights are
calculated by taking
dot products
• Can use different
functions over input
Example calculation of a single weight
Example calculation of a single weight
Calculating a weight matrix row
Attention weight matrix
• The dot product can be
anything (negative infinity to
positive infinity)
• We normalise by length
• We softmax this so that the
weights are positive values
summing to 1
• Attention weight matrix
summarises relationship
between words
• Because dot products capture
similarity between vectors
Multi-headed attention
• Attention weight matrix
captures relationship
between words
• But there’s many
different ways words can
be related
• And which ones you want
to capture depends on
your task
• Different attention heads
learn different relations
between word pairs
Img source
Difference to RNNs
• Whereas RNNs updates context
token-by-token by updating
internal hidden state, self-
attention captures context by
updating all word representations
simultaneously
• Lower computational complexity,
scales better with more data
• More parallelisable = faster
training
Connecting all
these concepts
• “Useful” input representations are
learned
• “Useful” weights for transforming
input vectors are learned
• These quantities should produce
“useful” dot products
• That lead to “useful” updated input
vectors
• That lead to “useful” input to the
feed-forward network layer
• … etc. … that eventually lead to
lower overall loss on the training set
Summary
I. Introduction to sequence modelling
II. Quick neural network review
• How a single neuron functions
• Feed-forward networks
III. Recurrent neural networks
• From feed-forward networks to recurrence
• RNNs with gating mechanisms
IV. Practical: Building a language model for Game of Thrones
V. Components of state-of-the-art RNN models
• Encoder-decoder models
• Bidirectionality
• Attention
VI. Transformers and self-attention
Further Reading
• More accessible: Andrew Ng
Sequence Course on Coursera
• https://www.coursera.org/learn/nlp-
sequence-models
• More technical: Deep Learning book
by Goodfellow et al.
• https://www.deeplearningbook.org/cont
ents/rnn.html
• Also: Alex Smola Berkeley Lectures
• https://www.youtube.com/user/smolix/vi
deos
Just for fun
• Talk to transformer
• https://talktotransformer.com/
• Using OpenAI’s “too
dangerous to release” GPT-
2 language model
Thanks, questions?
Extra slides
Sequences in natural language
• Sequence modelling very popular in
NLP because language is sequential by
nature
• Text
• Sequences of words
• Sequences of characters
• We process text sequentially, though in
principle could see all words at once
• Speech
• Sequence of amplitudes over time
• Frequency spectrogram over time
• Extracted frequency features over time
Img source
Sequences in biology
• Genomics, DNA and
RNA sequences
• Proteomics, protein
sequences,
structural biology
• Trying to represent
sequences in some
way, or predict some
function or
association of the
sequence
Img source
Sequences in finance
• Lots of time series data
• Numerical sequences (stocks,
indices)
• Lots of forecasting work –
predicting the future (trading
strategies)
• Deep learning for these
sequences perhaps not as
popular as you might think
• Quite well-developed methods
based on classical statistics,
interpretability important
Img source
Img source
Single neuron computation
• What computation is
happening inside 1
neuron?
• If you understand how 1
neuron computes output
given input, it’s a small
step to understand how an
entire network computes
output given input
Single neuron computation
• What computation is
happening inside 1
neuron?
• If you understand how 1
neuron computes output
given input, it’s a small
step to understand how an
entire network computes
output given input
Perceptrons
• Modelling a binary outcome using
binary input features
• Should I have a cup of tea?
• 0 = no
• 1 = yes
• Three features with 1 weight each:
• Do they have Earl Grey?
• earl_grey, 𝑤" = 3
• Have I just had a cup of tea?
• already_had, 𝑤# =-1
• Can I get it to go?
• to_go, 𝑤$ =2
Perceptrons
• Modelling a binary outcome using
binary input features
• Should I have a cup of tea?
• 0 = no
• 1 = yes
• Three features with 1 weight each:
• Do they have Earl Grey?
• earl_grey, 𝑤" = 3
• Have I just had a cup of tea?
• already_had, 𝑤# =-1
• Can I get it to go?
• to_go, 𝑤$ =2
Perceptrons
• Here weights are
cherry-picked, but
perceptrons learn these
weights automatically
from training data by
shifting parameters to
minimise error
Perceptrons
• Formalising the perceptron
calculation
• Instead of a threshold, more
common to see a bias term
• Instead of writing out the
sums using sigma notation,
more common to see dot
products.
• Vectorisation for efficiency
• Here, I manually chose these
values – but given a dataset of
past inputs/outputs, you could
learn the optimal parameter
values
Perceptrons
• Formalising the
perceptron calculation
• Instead of a threshold,
more common to see a
bias term
• Instead of writing out
the sums using sigma
notation, more common
to see dot products.
• Vectorisation for
efficiency
Sigmoid neurons
• Want to handle continuous
values
• Where input can be
something other than just 0 or
1
• Where output can be
something other than just 0 or
1
• We put the weighted sum of
inputs through an activation
function
• Sigmoid or logistic function
Sigmoid neurons
• The sigmoid function is
basically a smoothed out
perceptron!
• Output no longer a
sudden jump
• It’s the smoothness of the
function that we care
about
Img source
Activation functions
• Which activation function
to use?
• Heuristics based on
experiments, not proof-
based
Img source
More layers!
• Increase
number of
layers to
increase
capacity for
abstraction,
hierarchical
processing of
input
Training on big window sizes
• How much of window size? On very long sequence, unrolled
RNN becomes a very deep network
• Same problems with vanishing/exploding gradients as normal
networks
• And takes a longer time to train
• The normal tricks can help – good initialization of parameters, non-
saturating activation functions, gradient clipping, batch norm
• Training over a limited number of steps – truncated
backpropagation through time
LSTM mechanics
• Input, forget, output gates are
little neural networks within the
cell
• Memory being updated via
forget gate and candidate
memory
• Hidden state being updated by
output gate, which weighs up all
information
Query, Key, and Value transformations
• Notice that we are using
each input vector on 3
separate occasions
• E.g. vector x2
1. To take dot products
with each other input
vector when calculating
y2
2. In dot products with
other output vectors (y1,
y3, y4) are calculated
3. And in the weighted
sum to produce output
vector y2
Query, Key, and Value transformations
• To model these 3
different functions for
each input vector, and
give the model extra
expressivity and
flexibility, we are going
to modify the input
vectors
• Apply simple linear
transformations
Input transformation
matrices
• These weight matrices
are learnable
parameters
• Gives something else
to learn by gradient
descent

Más contenido relacionado

La actualidad más candente

Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Simplilearn
 

La actualidad más candente (20)

RNN and its applications
RNN and its applicationsRNN and its applications
RNN and its applications
 
Recurrent Neural Network
Recurrent Neural NetworkRecurrent Neural Network
Recurrent Neural Network
 
LSTM Basics
LSTM BasicsLSTM Basics
LSTM Basics
 
LSTM Tutorial
LSTM TutorialLSTM Tutorial
LSTM Tutorial
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
 
Introduction to Few shot learning
Introduction to Few shot learningIntroduction to Few shot learning
Introduction to Few shot learning
 
Recurrent neural network
Recurrent neural networkRecurrent neural network
Recurrent neural network
 
Few shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learning
 
Cnn
CnnCnn
Cnn
 
Rnn and lstm
Rnn and lstmRnn and lstm
Rnn and lstm
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
 
Rnn & Lstm
Rnn & LstmRnn & Lstm
Rnn & Lstm
 
RNN-LSTM.pptx
RNN-LSTM.pptxRNN-LSTM.pptx
RNN-LSTM.pptx
 
Lstm
LstmLstm
Lstm
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network
 
Word embedding
Word embedding Word embedding
Word embedding
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 

Similar a Sequence Modelling with Deep Learning

Unit one ppt of deeep learning which includes Ann cnn
Unit one ppt of  deeep learning which includes Ann cnnUnit one ppt of  deeep learning which includes Ann cnn
Unit one ppt of deeep learning which includes Ann cnn
kartikaursang53
 

Similar a Sequence Modelling with Deep Learning (20)

Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From Scratch
 
Sequence Model pytorch at colab with gpu.pdf
Sequence Model pytorch at colab with gpu.pdfSequence Model pytorch at colab with gpu.pdf
Sequence Model pytorch at colab with gpu.pdf
 
Complete solution for Recurrent neural network.pptx
Complete solution for Recurrent neural network.pptxComplete solution for Recurrent neural network.pptx
Complete solution for Recurrent neural network.pptx
 
Natural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyNatural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A Survey
 
Deep learning
Deep learningDeep learning
Deep learning
 
240219_RNN, LSTM code.pptxdddddddddddddddd
240219_RNN, LSTM code.pptxdddddddddddddddd240219_RNN, LSTM code.pptxdddddddddddddddd
240219_RNN, LSTM code.pptxdddddddddddddddd
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.
 
DSRLab seminar Introduction to deep learning
DSRLab seminar   Introduction to deep learningDSRLab seminar   Introduction to deep learning
DSRLab seminar Introduction to deep learning
 
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...
 
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A PrimerMDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
 
Introduction of Machine learning and Deep Learning
Introduction of Machine learning and Deep LearningIntroduction of Machine learning and Deep Learning
Introduction of Machine learning and Deep Learning
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
 
240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptx240115_Attention Is All You Need (2017 NIPS).pptx
240115_Attention Is All You Need (2017 NIPS).pptx
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
 
Unit one ppt of deeep learning which includes Ann cnn
Unit one ppt of  deeep learning which includes Ann cnnUnit one ppt of  deeep learning which includes Ann cnn
Unit one ppt of deeep learning which includes Ann cnn
 
Distributed deep learning_framework_spark_4_may_2015_ver_0.7
Distributed deep learning_framework_spark_4_may_2015_ver_0.7Distributed deep learning_framework_spark_4_may_2015_ver_0.7
Distributed deep learning_framework_spark_4_may_2015_ver_0.7
 
Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
 
From neural networks to deep learning
From neural networks to deep learningFrom neural networks to deep learning
From neural networks to deep learning
 
A Survey of Convolutional Neural Networks
A Survey of Convolutional Neural NetworksA Survey of Convolutional Neural Networks
A Survey of Convolutional Neural Networks
 

Último

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 

Último (20)

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 

Sequence Modelling with Deep Learning

  • 1. Sequence Modelling with Deep Learning ODSC London 2019 Tutorial Natasha Latysheva
  • 2. Overview I. Introduction to sequence modelling II. Quick neural network review • Feed-forward networks III. Recurrent neural networks • From feed-forward networks to recurrence • RNNs with gating mechanisms IV. Practical: Building a language model for Game of Thrones V. Components of state-of-the-art RNN models • Encoder-decoder models • Bidirectionality • Attention VI. Transformers and self-attention
  • 3. Speaker Intro • Welocalize • We provide language services • Fairly large, by revenue 8th largest globally, 4th largest US. 1500+ employees. • Lots of localisation (translation) • International marketing, site optimisation • NLP engineering team • 14 people remote across US, Ireland, UK, Germany, China • Various NLP things: machine translation, text-to-speech, NER, sentiment, topics, classification, etc.
  • 4. I. Introduction to Sequence Modelling
  • 5.
  • 7. Less conventional sequence data • Activity on a website: • [click_button, move_cursor, wait, wait, click_subscribe, close_tab] • Customer history: • [inactive -> mildly_active -> payment_made -> complaint_filed -> inactive -> account_closed] • Code (constrained language) is sequential data – can learn the structure
  • 8. II. Quick Neural Network Review
  • 10. Simplifying the notation • Single neurons • Weight matrices, bias vectors • Fully-connected layer
  • 12. Why do we need fancy methods to model sequences? • Say we are training a translation model, English->French • ”The cat is black” to “Le chat is noir” • Could in theory use a feed- forward network to translate word-by-word
  • 13. Why do we need fancy methods? • A feed-forward network treats time steps as completely independent • Even in this simple 1-to-1 correspondence example, things are broken • How you translate “black” depends on noun gender (“noir” vs. “noire”) • How you translate “The” also depends on gender (“Le” vs. “La”) • More generally, getting the translation right requires context
  • 14. Why do we need fancy methods? • We need a way for the network to remember information from previous time steps
  • 15. Recurrent neural networks • Extremely popular way of modelling sequential data • Process data one time step at a time, while updating a running internal hidden state
  • 20. Standard FF network to RNN • At each time step, RNN passes on its activations from previous time step • In theory all the way back to the first time step
  • 21. Standard FF network to RNN *Activation function probably tanh or ReLU
  • 22. Standard FF network to RNN • So you can say this is a form of memory • Cell hidden state transferred • Basis for RNNs remembering context
  • 23. Memory problems • Basic RNNs not great at long-term dependencies but plenty of ways to improve this • Information gating mechanisms • Condensing input using encoders
  • 24. Gating mechanisms • Gates regulate the flow of information • Very helpful - basic RNN cells not really used anymore. Responsible for recent RNN popularity. • Add explicit mechanisms to remember information and forget information • Why use gates? • Helps you learn long-term dependencies • Not all time points are equally relevant – not everything has to be remembered • Speeds up training/convergence
  • 25. Gated recurrent units (GRUs) • GRUs were developed later than LSTMs but are simpler • Motivation is to get the main benefits of LSTMs but with less computation • Reset gate: Mechanism to decide when to remember vs. forget/reset previous information (hidden state) • Update gate: Mechanism to decide when to update hidden state
  • 26. GRU mechanics • Reset gate controls how much past info we use • Rt = 0 means we are resetting our RNN, not using any previous information • Rt = 1 means we use all of previous information (back to our normal vanilla RNN)
  • 27. GRU mechanics • Update gate controls whether we bother updating our hidden state using new information • Zt = 1 means you’re not updating, you’re just using previous hidden state • Zt = 0 means you’re updating as much as possible
  • 28. LSTM mechanics • LSTMs add a memory unit to further control the flow of information through the cell • Also whereas GRUs have 2 gates, an LSTM cell has 3 gates: • An input gate – should I ignore or consider the input? • A forget gate – should I keep or throw away the information in memory? • An output gate – how should I use input, hidden state and memory to output my next hidden state?
  • 29. GRUs vs. LSTMs • GRUs are simpler + train faster • LSTMs more popular – can give slightly better performance, but GRU performance often on par • LSTMs would in theory outperform GRUs in tasks requiring very long-range modelling
  • 30. IV. Game of Thrones Language Model
  • 31. Notebook • ~30 mins • Jupyter notebook on building an RNN- based language model • Python 3 + Keras for neural networks tinyurl.com/wbay5o3
  • 32. IV. Components of SOTA RNN models
  • 33. Encoder-Decoder architectures • Being forced to immediately output a French word for every English word
  • 35.
  • 36. Encoder-Decoder architectures • Tends to work a lot better than using a single sequence-to- sequence RNNs to produce an output for each input step • You often need to see the whole sequence before knowing what to output
  • 37.
  • 38. Bidirectionality in RNN encoder-decoders • For the encoder, bidirectional RNNs (BRNNs) often used • BRNNs read the input sequences forwards and backwards
  • 39.
  • 41. The problem with RNN encoder-decoders • Serious information bottleneck • Condense input sequence down to a small vector?! • Memorise long sequence + regurgitate • Not how humans work • Long computation paths
  • 42. Attention concept • Has been very influential in deep learning • Originally developed for MT (Bahdanau, 2014) • As you’re producing your output sequence, maybe not every part of your input is as equally relevant • Image captioning example Lu et al. 2017. Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning.
  • 43. Attention intuition • Attention allows the network to refer back to the input sequence, instead of forcing it to encode all information into one fixed-length vector
  • 44. • Encoder: Used BRNN to compute rich set of features about source words and their surrounding words • Decoder is asked to choose which hidden states to use and ignore • Weighted sum of hidden states used to predict the next word Attention intuition
  • 45. • Decoder RNN uses attention parameters to decide how much to pay attention to different parts of the input • Allows the model to amplify the signal from relevant parts of the input sequence • This improves modelling Attention intuition
  • 46. Main benefits • Encoder passes a lot more data to the decoder • Not just last hidden state • Passes all hidden states at every time step • Computation path problem: relevant information is now closer by
  • 47. Summary so far • Sequence modelling • Recurrent neural networks • Some key components of SOTA RNN-based models: • Gating mechanisms (GRUs and LSTMs) • Encoder-decoders • Bidirectional encoding • Attention
  • 48. V. Transformers and self-attention
  • 49. Transformers are taking over NLP • Translation, language models, question answering, summarisation, etc. • Some of the best word embeddings are based on Transformers • BERT, ELmO, OpenAI GPT-2 models
  • 50. A single Transformer encoder block • No recurrence, no convolutions • “Attention is all you need” paper • The core concept is the self- attention mechanism • Much more parallelisable than RNN-based models, which means faster training
  • 51. Self-attention is a sequence-to-sequence operation • At the highest level – self- attention takes t input vectors and outputs t output vectors • Take input embedding for “the” and update it by incorporating in information from its context
  • 52. How is the vector for “the” updated?
  • 53. • Each output vector is a weighted sum of the input vectors • But all of these weights are different
  • 54. These are not learned weights in the traditional neural network sense • The weights are calculated by taking dot products • Can use different functions over input
  • 55. Example calculation of a single weight
  • 56. Example calculation of a single weight
  • 57. Calculating a weight matrix row
  • 58. Attention weight matrix • The dot product can be anything (negative infinity to positive infinity) • We normalise by length • We softmax this so that the weights are positive values summing to 1 • Attention weight matrix summarises relationship between words • Because dot products capture similarity between vectors
  • 59. Multi-headed attention • Attention weight matrix captures relationship between words • But there’s many different ways words can be related • And which ones you want to capture depends on your task • Different attention heads learn different relations between word pairs Img source
  • 60. Difference to RNNs • Whereas RNNs updates context token-by-token by updating internal hidden state, self- attention captures context by updating all word representations simultaneously • Lower computational complexity, scales better with more data • More parallelisable = faster training
  • 61. Connecting all these concepts • “Useful” input representations are learned • “Useful” weights for transforming input vectors are learned • These quantities should produce “useful” dot products • That lead to “useful” updated input vectors • That lead to “useful” input to the feed-forward network layer • … etc. … that eventually lead to lower overall loss on the training set
  • 62. Summary I. Introduction to sequence modelling II. Quick neural network review • How a single neuron functions • Feed-forward networks III. Recurrent neural networks • From feed-forward networks to recurrence • RNNs with gating mechanisms IV. Practical: Building a language model for Game of Thrones V. Components of state-of-the-art RNN models • Encoder-decoder models • Bidirectionality • Attention VI. Transformers and self-attention
  • 63. Further Reading • More accessible: Andrew Ng Sequence Course on Coursera • https://www.coursera.org/learn/nlp- sequence-models • More technical: Deep Learning book by Goodfellow et al. • https://www.deeplearningbook.org/cont ents/rnn.html • Also: Alex Smola Berkeley Lectures • https://www.youtube.com/user/smolix/vi deos
  • 64. Just for fun • Talk to transformer • https://talktotransformer.com/ • Using OpenAI’s “too dangerous to release” GPT- 2 language model
  • 67. Sequences in natural language • Sequence modelling very popular in NLP because language is sequential by nature • Text • Sequences of words • Sequences of characters • We process text sequentially, though in principle could see all words at once • Speech • Sequence of amplitudes over time • Frequency spectrogram over time • Extracted frequency features over time Img source
  • 68. Sequences in biology • Genomics, DNA and RNA sequences • Proteomics, protein sequences, structural biology • Trying to represent sequences in some way, or predict some function or association of the sequence Img source
  • 69. Sequences in finance • Lots of time series data • Numerical sequences (stocks, indices) • Lots of forecasting work – predicting the future (trading strategies) • Deep learning for these sequences perhaps not as popular as you might think • Quite well-developed methods based on classical statistics, interpretability important Img source Img source
  • 70. Single neuron computation • What computation is happening inside 1 neuron? • If you understand how 1 neuron computes output given input, it’s a small step to understand how an entire network computes output given input
  • 71. Single neuron computation • What computation is happening inside 1 neuron? • If you understand how 1 neuron computes output given input, it’s a small step to understand how an entire network computes output given input
  • 72. Perceptrons • Modelling a binary outcome using binary input features • Should I have a cup of tea? • 0 = no • 1 = yes • Three features with 1 weight each: • Do they have Earl Grey? • earl_grey, 𝑤" = 3 • Have I just had a cup of tea? • already_had, 𝑤# =-1 • Can I get it to go? • to_go, 𝑤$ =2
  • 73. Perceptrons • Modelling a binary outcome using binary input features • Should I have a cup of tea? • 0 = no • 1 = yes • Three features with 1 weight each: • Do they have Earl Grey? • earl_grey, 𝑤" = 3 • Have I just had a cup of tea? • already_had, 𝑤# =-1 • Can I get it to go? • to_go, 𝑤$ =2
  • 74. Perceptrons • Here weights are cherry-picked, but perceptrons learn these weights automatically from training data by shifting parameters to minimise error
  • 75. Perceptrons • Formalising the perceptron calculation • Instead of a threshold, more common to see a bias term • Instead of writing out the sums using sigma notation, more common to see dot products. • Vectorisation for efficiency • Here, I manually chose these values – but given a dataset of past inputs/outputs, you could learn the optimal parameter values
  • 76. Perceptrons • Formalising the perceptron calculation • Instead of a threshold, more common to see a bias term • Instead of writing out the sums using sigma notation, more common to see dot products. • Vectorisation for efficiency
  • 77. Sigmoid neurons • Want to handle continuous values • Where input can be something other than just 0 or 1 • Where output can be something other than just 0 or 1 • We put the weighted sum of inputs through an activation function • Sigmoid or logistic function
  • 78. Sigmoid neurons • The sigmoid function is basically a smoothed out perceptron! • Output no longer a sudden jump • It’s the smoothness of the function that we care about Img source
  • 79. Activation functions • Which activation function to use? • Heuristics based on experiments, not proof- based Img source
  • 80. More layers! • Increase number of layers to increase capacity for abstraction, hierarchical processing of input
  • 81. Training on big window sizes • How much of window size? On very long sequence, unrolled RNN becomes a very deep network • Same problems with vanishing/exploding gradients as normal networks • And takes a longer time to train • The normal tricks can help – good initialization of parameters, non- saturating activation functions, gradient clipping, batch norm • Training over a limited number of steps – truncated backpropagation through time
  • 82. LSTM mechanics • Input, forget, output gates are little neural networks within the cell • Memory being updated via forget gate and candidate memory • Hidden state being updated by output gate, which weighs up all information
  • 83.
  • 84. Query, Key, and Value transformations • Notice that we are using each input vector on 3 separate occasions • E.g. vector x2 1. To take dot products with each other input vector when calculating y2 2. In dot products with other output vectors (y1, y3, y4) are calculated 3. And in the weighted sum to produce output vector y2
  • 85. Query, Key, and Value transformations • To model these 3 different functions for each input vector, and give the model extra expressivity and flexibility, we are going to modify the input vectors • Apply simple linear transformations
  • 86. Input transformation matrices • These weight matrices are learnable parameters • Gives something else to learn by gradient descent