SlideShare una empresa de Scribd logo
1 de 97
Descargar para leer sin conexión
Introduction to Transformers
for NLP
where we are and how we got here
Olga Petrova
AI Product Manager
DataTalks.Club
Preliminaries
● Who I am:
○ Product Manager for AI PaaS at Scaleway
Scaleway: European cloud provider, originating from
France
Preliminaries
● Who I am:
○ Product Manager for AI PaaS at Scaleway
Scaleway: European cloud provider, originating from
France
○ Machine Learning engineer at Scaleway
Preliminaries
● Who I am:
○ Product Manager for AI PaaS at Scaleway
Scaleway: European cloud provider, originating from
France
○ Machine Learning engineer at Scaleway
○ Quantum physicist at École Normale Supérieure, the
Max Planck Institute, Johns Hopkins University
Preliminaries
● Who I am:
○ Product Manager for AI PaaS at Scaleway
Scaleway: European cloud provider, originating from
France
○ Machine Learning engineer at Scaleway
○ Quantum physicist at École Normale Supérieure, the
Max Planck Institute, Johns Hopkins University
www.olgapaints.net
Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
Representing words
NLP 101 RNNs Attention Transformers BERT
Representing words
I play with my dog.
NLP 101 RNNs Attention Transformers BERT
Representing words
I play with my dog.
● Machine learning: inputs are vectors / matrices
NLP 101 RNNs Attention Transformers BERT
Representing words
I play with my dog.
● Machine learning: inputs are vectors / matrices
ID Age Gender Weight Diagnosis
264264 27 0 72 1
908696 61 1 80 0
NLP 101 RNNs Attention Transformers BERT
Representing words
ID Age Gender Weight Diagnosis
264264 27 0 72 1
908696 61 1 80 0
I play with my dog.
● Machine learning: inputs are vectors / matrices
● How do we convert “I play with my dog” into numerical values?
NLP 101 RNNs Attention Transformers BERT
Representing words
I play with my dog.
I
play
with
my
dog
.
NLP 101 RNNs Attention Transformers BERT
Representing words
I play with my dog.
I
play
with
my
dog
.
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
● One hot encoding
● Dimensionality = vocabulary size
1
0
0
0
0
0
1
2
3
4
5
6
NLP 101 RNNs Attention Transformers BERT
Representing words
I play with my dog.
I play with my cat.
I
play
with
my
dog
.
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
● One hot encoding
● Dimensionality = vocabulary size
1
2
3
4
5
6
NLP 101 RNNs Attention Transformers BERT
Representing words
I play with my dog.
I play with my cat.
I
play
with
my
dog
.
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
● One hot encoding
● Dimensionality = vocabulary size
cat
0
0
0
0
0
0
1
1
2
3
4
5
6
7
NLP 101 RNNs Attention Transformers BERT
play played playing
I play / played / was playing with my dog.
Word = token: , ,
Subword = token: , ,
E.g.: played = ;
● In practice, subword tokenization is used most often
● For simplicity, we’ll go with word = token for the rest of the talk
Tokenization
play #ed #ing
play #ed
NLP 101 RNNs Attention Transformers BERT
Word embeddings
= …… N = …… = …… = ……
One hot encodings N = vocabulary size
● puppy : dog = kitten : cat. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
0
1
cat
0
0
dog
0
0
1
0
0
0
0
1
kitten
0
0
1
0
puppy
NLP 101 RNNs Attention Transformers BERT
Word embeddings
= …… N = …… = …… = ……
One hot encodings
● dog : puppy = cat : kitten. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
0
1
cat
0
0
dog
0
0
1
0
0
0
0
1
kitten
0
0
1
0
puppy
NLP 101 RNNs Attention Transformers BERT
Word embeddings
= …… N = …… = …… = ……
One hot encodings
● dog : puppy = cat : kitten. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
0
1
cat
0
0
dog
0
0
1
0
0
0
0
1
kitten
0
0
1
0
puppy
NLP 101 RNNs Attention Transformers BERT
Word embeddings
= …… N = …… = …… = ……
One hot encodings
● dog : puppy = cat : kitten. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
0
1
cat
0
0
dog
0
0
1
0
0
0
0
1
kitten
0
0
1
0
puppy
1
0
1
0
1
1
1
0
0
0
1
0
= M = = =
cat dog kitten puppy
NLP 101 RNNs Attention Transformers BERT
Word embeddings
What do you mean, “meaningful way”?
= …… N = …… = …… = ……
One hot encodings
● dog : puppy = cat : kitten. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
1
0
1
0
1
1
1
0
0
0
1
0
= M = = =
cat dog kitten puppy
NLP 101 RNNs Attention Transformers BERT
Word embeddings
What do you mean, “meaningful way”?
= …… N = …… = …… = ……
One hot encodings
● dog : puppy = cat : kitten. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
1
0
1
0
1
1
1
0
0
0
1
0
= M = = =
cat dog kitten puppy
1 for feline
1 for canine - = -
1 for adult / 0 for baby
dog puppy cat kitten
NLP 101 RNNs Attention Transformers BERT
How do we arrive at this representation?
Word embeddings
NLP 101 RNNs Attention Transformers BERT
How do we arrive at this representation?
● Pass the one-hot vector through a network with a bottleneck layer
Word embeddings
NLP 101 RNNs Attention Transformers BERT
How do we arrive at this representation?
● Pass the one-hot vector through a network with a bottleneck layer
= multiply the N-dim vector by NxM matrix to get an M-dim vector
Word embeddings
Embedding matrix W
One-hot vector
Embedded
vector
NLP 101 RNNs Attention Transformers BERT
How do we arrive at this representation?
● Pass the one-hot vector through a network with a bottleneck layer
= multiply the N-dim vector by NxM matrix to get an M-dim vector
● Some information gets lost in the process
Word embeddings
Embedding matrix W
One-hot vector
Embedded
vector
NLP 101 RNNs Attention Transformers BERT
How do we arrive at this representation?
● Pass the one-hot vector through a network with a bottleneck layer
= multiply the N-dim vector by NxM matrix to get an M-dim vector
● Some information gets lost in the process
● Embedding matrix: through machine learning!
Word embeddings
Embedding matrix W
One-hot vector
Embedded
vector
In practice:
● Embedding can be part of your
model that you train for a ML task
● Or: use pre-trained embeddings
(e.g. Word2Vec, GloVe, etc)
NLP 101 RNNs Attention Transformers BERT
Tokenization:
blog.floydhub.com/tokenization-nlp/
Word embeddings:
jalammar.github.io/illustrated-word2vec/
Further reading
NLP 101 RNNs Attention Transformers BERT
Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
Sequence to Sequence models
● Not all seq2seq models are NLP
● Not all NLP models are seq2seq!
● But: they are common enough
NLP 101 RNNs Attention Transformers BERT
Sequence to Sequence models
● Not all seq2seq models are NLP
● Not all NLP models are seq2seq!
● But: they are common enough
NLP 101 RNNs Attention Transformers BERT
ML model Sequence 2
Sequence 1
Sequence to Sequence models
● Not all seq2seq models are NLP
● Not all NLP models are seq2seq!
● But: they are common enough
NLP 101 RNNs Attention Transformers BERT
ML model Sequence 2
Sequence 1
● Language modeling, question answering, machine translation, etc
Machine
Translation
I am a student
je suis étudiant
Sequence to Sequence models
● Not all seq2seq models are NLP
● Not all NLP models are seq2seq!
● But: they are common enough
NLP 101 RNNs Attention Transformers BERT
ML model Sequence 2
Sequence 1
● Language modeling, question answering, machine translation, etc
Machine
Translation
I am a student
je suis étudiant
Takeaways:
1. Multiple elements in a sequence
2. Order of the elements matters
RNNs: Recurrent Neural Networks
NLP 101 RNNs Attention Transformers BERT
RNNs: Recurrent Neural Networks
What does this mean???
NLP 101 RNNs Attention Transformers BERT
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
RNNs: the seq2seq example
Output sequence
Input sequence
NLP 101 RNNs Attention Transformers BERT
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Context vector: contains the
meaning of the whole input
sequence
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Context vector: contains the
meaning of the whole input
sequence
Inputs (word embeddings)
Temporal direction (remember the Wikipedia
definition?)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Context vector: contains the
meaning of the whole input
sequence
Inputs (word embeddings)
Temporal direction (remember the Wikipedia
definition?)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
Does this really work?
NLP 101 RNNs Attention Transformers BERT
Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
NLP 101 RNNs Attention Transformers BERT
Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
NLP 101 RNNs Attention Transformers BERT
Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
NLP 101 RNNs Attention Transformers BERT
Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
● Catch # 2: Relying on a fixed-length context vector to capture the full input
sequence. What can go wrong?
NLP 101 RNNs Attention Transformers BERT
Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
● Catch # 2: Relying on a fixed-length context vector to capture the full input
sequence. What can go wrong?
“I am a student.”
NLP 101 RNNs Attention Transformers BERT
Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
● Catch # 2: Relying on a fixed-length context vector to capture the full input
sequence. What can go wrong?
“I am a student.” ✅
NLP 101 RNNs Attention Transformers BERT
Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
● Catch # 2: Relying on a fixed-length context vector to capture the full input
sequence. What can go wrong?
“I am a student.” ✅
“It was a wrong number that started it, the telephone ringing three times in the dead
of night, and the voice on the other end asking for someone he was not.”
NLP 101 RNNs Attention Transformers BERT
Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
● Catch # 2: Relying on a fixed-length context vector to capture the full input
sequence. What can go wrong?
“I am a student.” ✅
“It was a wrong number that started it, the telephone ringing three times in the dead
of night, and the voice on the other end asking for someone he was not.” 😱
NLP 101 RNNs Attention Transformers BERT
Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
The Attention Mechanism
NLP 101 RNNs Attention Transformers BERT
The Attention Mechanism
NLP 101 RNNs Attention Transformers BERT
Recurrent NN from the previous section
The Attention Mechanism
NLP 101 RNNs Attention Transformers BERT
Recurrent NN from the previous section
The Attention Mechanism
NLP 101 RNNs Attention Transformers BERT
Recurrent NN from the previous section
Send all of the hidden states to the Decoder, not just
the last one
The Attention Mechanism
NLP 101 RNNs Attention Transformers BERT
Recurrent NN from the previous section
Send all of the hidden states to the Decoder, not just
the last one
The Decoder determines which input elements get
attended to via a softmax layer
Further reading
NLP 101 RNNs Attention Transformers BERT
jalammar.github.io/illustrated-transformer/
My blog post that this talk is based on + the linear algebra perspective on
Attention:
towardsdatascience.com/natural-language-processing-the-age-of-transformers-a3
6c0265937d
Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
Attention is All You Need © Google, 2017
NLP 101 RNNs Attention Transformers BERT
Attention is All You Need © Google, 2017
● Keep the Attention mechanism, throw away the sequential RNN
● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs
NLP 101 RNNs Attention Transformers BERT
Attention is All You Need © Google, 2017
● Keep the Attention mechanism, throw away the sequential RNN
● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs
● Transformer:
○ Decoder and Encoder both have multiple layers
NLP 101 RNNs Attention Transformers BERT
Attention is All You Need © Google, 2017
● Keep the Attention mechanism, throw away the sequential RNN
● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs
● Transformer:
○ Decoder and Encoder both have multiple layers
○ The output hidden states of the Encoder are inputs to all the layers of the Decoder
NLP 101 RNNs Attention Transformers BERT
Attention is All You Need © Google, 2017
● Keep the Attention mechanism, throw away the sequential RNN
● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs
● Transformer:
○ Decoder and Encoder both have multiple layers
○ The output hidden states of the Encoder are inputs to all the layers of the Decoder
○ The intermediate hidden states of the Encoder are also inputs… to other sequence elements’
Encoder layers! Self-Attention
NLP 101 RNNs Attention Transformers BERT
Attention is All You Need © Google, 2017
● Keep the Attention mechanism, throw away the sequential RNN
● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs
● Transformer:
○ Decoder and Encoder both have multiple layers
○ The output hidden states of the Encoder are inputs to all the layers of the Decoder
○ The intermediate hidden states of the Encoder are also inputs… to other sequence elements’
Encoder layers! Self-Attention
○ Also Self-Attention in the Decoder
NLP 101 RNNs Attention Transformers BERT
Attention all around
NLP 101 RNNs Attention Transformers BERT
Attention all around
NLP 101 RNNs Attention Transformers BERT
Word embedding + position encoding
Attention all around
● All of the sequence elements (w1, w2,
w3) are being processed in parallel
● Each element’s encoding gets
information (through hidden states)
from other elements
● Bidirectional by design 😍
NLP 101 RNNs Attention Transformers BERT
Why Self-Attention?
● “The animal did not cross the road
because it was too tired.”
vs.
● “The animal did not cross the road
because it was too wide.”
NLP 101 RNNs Attention Transformers BERT
Why Self-Attention?
● “The animal did not cross the road
because it was too tired.”
vs.
● “The animal did not cross the road
because it was too wide.”
Uni-directional RNN would have missed
this! Bi-directional is nice.
NLP 101 RNNs Attention Transformers BERT
Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
(Contextual) word embeddings
● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words
NLP 101 RNNs Attention Transformers BERT
(Contextual) word embeddings
● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words
● Meaning often depends on the context (surrounding words)
NLP 101 RNNs Attention Transformers BERT
(Contextual) word embeddings
● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words
● Meaning often depends on the context (surrounding words)
● Self-Attention in Transformers processes context when encoding a token
NLP 101 RNNs Attention Transformers BERT
(Contextual) word embeddings
● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words
● Meaning often depends on the context (surrounding words)
● Self-Attention in Transformers processes context when encoding a token
● Can we just take the Transformer Encoder and use it to embed words?
NLP 101 RNNs Attention Transformers BERT
(Contextual) word embeddings
● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words
● Meaning often depends on the context (surrounding words)
● Self-Attention in Transformers processes context when encoding a token
● Can we just take the Transformer Encoder and use it to embed words?
NLP 101 RNNs Attention Transformers BERT
(Contextual) word embeddings
BERT: Bidirectional Encoder Representations from Transformers
NLP 101 RNNs Attention Transformers BERT
(Contextual) word embeddings
BERT: Bidirectional Encoder Representations from Transformers
Instead of Word2Vec, use pre-trained BERT for your NLP tasks ;)
NLP 101 RNNs Attention Transformers BERT
What was BERT pre-trained on?
Task # 1: Masked Language Modeling
● Take a text corpus, randomly mask 15% of words
● Put a softmax layer (dim = vocab size) on top of BERT encodings for the words
that are present to predict which word is missing
NLP 101 RNNs Attention Transformers BERT
What was BERT pre-trained on?
Task # 1: Masked Language Modeling
● Take a text corpus, randomly mask 15% of words
● Put a softmax layer (dim = vocab size) on top of BERT encodings for the words
that are present to predict which word is missing
Task # 2: Next Sentence Prediction
● Input sequence: two sentences separated by a special token
● Binary classification: are these sentences successive or not?
NLP 101 RNNs Attention Transformers BERT
Further reading and coding
My blog posts:
towardsdatascience.com/natural-language-processing-the-age-of-transformers-a36c0
265937d
blog.scaleway.com/understanding-text-with-bert/
The HuggingFace 🤗 Transformers library: huggingface.co/
NLP 101 RNNs Attention Transformers BERT

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
Attention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its ApplicationsAttention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its Applications
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Large Language Models - From RNN to BERT
Large Language Models - From RNN to BERTLarge Language Models - From RNN to BERT
Large Language Models - From RNN to BERT
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
 
Gpt1 and 2 model review
Gpt1 and 2 model reviewGpt1 and 2 model review
Gpt1 and 2 model review
 
INTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRUINTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRU
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
 
rnn BASICS
rnn BASICSrnn BASICS
rnn BASICS
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
Building NLP applications with Transformers
Building NLP applications with TransformersBuilding NLP applications with Transformers
Building NLP applications with Transformers
 
Attention
AttentionAttention
Attention
 

Similar a Introduction to Transformers for NLP - Olga Petrova

Unit3-RNN (recurrent neural network).pptx
Unit3-RNN (recurrent neural network).pptxUnit3-RNN (recurrent neural network).pptx
Unit3-RNN (recurrent neural network).pptx
LikithaAk
 

Similar a Introduction to Transformers for NLP - Olga Petrova (20)

Kaggle Tweet Sentiment Extraction: 1st place solution
Kaggle Tweet Sentiment Extraction: 1st place solutionKaggle Tweet Sentiment Extraction: 1st place solution
Kaggle Tweet Sentiment Extraction: 1st place solution
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLP
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)
 
BERT.pptx
BERT.pptxBERT.pptx
BERT.pptx
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Transformers and BERT with SageMaker
Transformers and BERT with SageMakerTransformers and BERT with SageMaker
Transformers and BERT with SageMaker
 
Introduction to Transformers
Introduction to TransformersIntroduction to Transformers
Introduction to Transformers
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
NeurIPS_2018_ConvAI2_ParticipantSlides.pptx
NeurIPS_2018_ConvAI2_ParticipantSlides.pptxNeurIPS_2018_ConvAI2_ParticipantSlides.pptx
NeurIPS_2018_ConvAI2_ParticipantSlides.pptx
 
Week 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-OnWeek 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-On
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
NLP Bootcamp
NLP BootcampNLP Bootcamp
NLP Bootcamp
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLP
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
 
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0
 
Scientists meet Entrepreneurs - AI & Machine Learning, Mark Fishel, Institute...
Scientists meet Entrepreneurs - AI & Machine Learning, Mark Fishel, Institute...Scientists meet Entrepreneurs - AI & Machine Learning, Mark Fishel, Institute...
Scientists meet Entrepreneurs - AI & Machine Learning, Mark Fishel, Institute...
 
Recipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tastyRecipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tasty
 
Unit3-RNN (recurrent neural network).pptx
Unit3-RNN (recurrent neural network).pptxUnit3-RNN (recurrent neural network).pptx
Unit3-RNN (recurrent neural network).pptx
 

Más de Alexey Grigorev

Más de Alexey Grigorev (20)

MLOps week 1 intro
MLOps week 1 introMLOps week 1 intro
MLOps week 1 intro
 
Codementor - Data Science at OLX
Codementor - Data Science at OLX Codementor - Data Science at OLX
Codementor - Data Science at OLX
 
Data Monitoring with whylogs
Data Monitoring with whylogsData Monitoring with whylogs
Data Monitoring with whylogs
 
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introduction
 
AI in Fashion - Size & Fit - Nour Karessli
 AI in Fashion - Size & Fit - Nour Karessli AI in Fashion - Size & Fit - Nour Karessli
AI in Fashion - Size & Fit - Nour Karessli
 
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia PavlovaAI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
 
ML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - KubernetesML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - Kubernetes
 
Paradoxes in Data Science
Paradoxes in Data ScienceParadoxes in Data Science
Paradoxes in Data Science
 
ML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learningML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learning
 
Algorithmic fairness
Algorithmic fairnessAlgorithmic fairness
Algorithmic fairness
 
MLOps at OLX
MLOps at OLXMLOps at OLX
MLOps at OLX
 
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble LearningML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
 
ML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deploymentML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deployment
 
ML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for ClassificationML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for Classification
 
ML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for ClassificationML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for Classification
 
ML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office HoursML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office Hours
 
AMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplacesAMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplaces
 
ML Zoomcamp 2 - Slides
ML Zoomcamp 2 - SlidesML Zoomcamp 2 - Slides
ML Zoomcamp 2 - Slides
 
ML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction ProjectML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction Project
 
ML Zoomcamp - Course Overview and Logistics
ML Zoomcamp - Course Overview and LogisticsML Zoomcamp - Course Overview and Logistics
ML Zoomcamp - Course Overview and Logistics
 

Último

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 

Último (20)

Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 

Introduction to Transformers for NLP - Olga Petrova

  • 1. Introduction to Transformers for NLP where we are and how we got here Olga Petrova AI Product Manager DataTalks.Club
  • 2. Preliminaries ● Who I am: ○ Product Manager for AI PaaS at Scaleway Scaleway: European cloud provider, originating from France
  • 3. Preliminaries ● Who I am: ○ Product Manager for AI PaaS at Scaleway Scaleway: European cloud provider, originating from France ○ Machine Learning engineer at Scaleway
  • 4. Preliminaries ● Who I am: ○ Product Manager for AI PaaS at Scaleway Scaleway: European cloud provider, originating from France ○ Machine Learning engineer at Scaleway ○ Quantum physicist at École Normale Supérieure, the Max Planck Institute, Johns Hopkins University
  • 5. Preliminaries ● Who I am: ○ Product Manager for AI PaaS at Scaleway Scaleway: European cloud provider, originating from France ○ Machine Learning engineer at Scaleway ○ Quantum physicist at École Normale Supérieure, the Max Planck Institute, Johns Hopkins University www.olgapaints.net
  • 6. Outline 1. NLP 101: how do we represent words in machine learning 🤖 Tokenization 🤖 Word embeddings 2. Order matters: Recurrent Neural Networks 🤖 RNNs 🤖 Seq2Seq models 🤖 What is wrong with RNNs 3. Attention is all you need! 🤖 What is Attention 🤖 Attention, the linear algebra perspective 4. The Transformer architecture 🤖 Mostly Attention 5. BERT: contextualized word embeddings
  • 7. Outline 1. NLP 101: how do we represent words in machine learning 🤖 Tokenization 🤖 Word embeddings 2. Order matters: Recurrent Neural Networks 🤖 RNNs 🤖 Seq2Seq models 🤖 What is wrong with RNNs 3. Attention is all you need! 🤖 What is Attention 🤖 Attention, the linear algebra perspective 4. The Transformer architecture 🤖 Mostly Attention 5. BERT: contextualized word embeddings
  • 8. Representing words NLP 101 RNNs Attention Transformers BERT
  • 9. Representing words I play with my dog. NLP 101 RNNs Attention Transformers BERT
  • 10. Representing words I play with my dog. ● Machine learning: inputs are vectors / matrices NLP 101 RNNs Attention Transformers BERT
  • 11. Representing words I play with my dog. ● Machine learning: inputs are vectors / matrices ID Age Gender Weight Diagnosis 264264 27 0 72 1 908696 61 1 80 0 NLP 101 RNNs Attention Transformers BERT
  • 12. Representing words ID Age Gender Weight Diagnosis 264264 27 0 72 1 908696 61 1 80 0 I play with my dog. ● Machine learning: inputs are vectors / matrices ● How do we convert “I play with my dog” into numerical values? NLP 101 RNNs Attention Transformers BERT
  • 13. Representing words I play with my dog. I play with my dog . NLP 101 RNNs Attention Transformers BERT
  • 14. Representing words I play with my dog. I play with my dog . 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 ● One hot encoding ● Dimensionality = vocabulary size 1 0 0 0 0 0 1 2 3 4 5 6 NLP 101 RNNs Attention Transformers BERT
  • 15. Representing words I play with my dog. I play with my cat. I play with my dog . 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 ● One hot encoding ● Dimensionality = vocabulary size 1 2 3 4 5 6 NLP 101 RNNs Attention Transformers BERT
  • 16. Representing words I play with my dog. I play with my cat. I play with my dog . 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 ● One hot encoding ● Dimensionality = vocabulary size cat 0 0 0 0 0 0 1 1 2 3 4 5 6 7 NLP 101 RNNs Attention Transformers BERT
  • 17. play played playing I play / played / was playing with my dog. Word = token: , , Subword = token: , , E.g.: played = ; ● In practice, subword tokenization is used most often ● For simplicity, we’ll go with word = token for the rest of the talk Tokenization play #ed #ing play #ed NLP 101 RNNs Attention Transformers BERT
  • 18. Word embeddings = …… N = …… = …… = …… One hot encodings N = vocabulary size ● puppy : dog = kitten : cat. How can we record this? ● Goal: a lower-dimensional representation of a word (M << N) ● Words are forced to “group together” in a meaningful way 0 1 cat 0 0 dog 0 0 1 0 0 0 0 1 kitten 0 0 1 0 puppy NLP 101 RNNs Attention Transformers BERT
  • 19. Word embeddings = …… N = …… = …… = …… One hot encodings ● dog : puppy = cat : kitten. How can we record this? ● Goal: a lower-dimensional representation of a word (M << N) ● Words are forced to “group together” in a meaningful way 0 1 cat 0 0 dog 0 0 1 0 0 0 0 1 kitten 0 0 1 0 puppy NLP 101 RNNs Attention Transformers BERT
  • 20. Word embeddings = …… N = …… = …… = …… One hot encodings ● dog : puppy = cat : kitten. How can we record this? ● Goal: a lower-dimensional representation of a word (M << N) ● Words are forced to “group together” in a meaningful way 0 1 cat 0 0 dog 0 0 1 0 0 0 0 1 kitten 0 0 1 0 puppy NLP 101 RNNs Attention Transformers BERT
  • 21. Word embeddings = …… N = …… = …… = …… One hot encodings ● dog : puppy = cat : kitten. How can we record this? ● Goal: a lower-dimensional representation of a word (M << N) ● Words are forced to “group together” in a meaningful way 0 1 cat 0 0 dog 0 0 1 0 0 0 0 1 kitten 0 0 1 0 puppy 1 0 1 0 1 1 1 0 0 0 1 0 = M = = = cat dog kitten puppy NLP 101 RNNs Attention Transformers BERT
  • 22. Word embeddings What do you mean, “meaningful way”? = …… N = …… = …… = …… One hot encodings ● dog : puppy = cat : kitten. How can we record this? ● Goal: a lower-dimensional representation of a word (M << N) ● Words are forced to “group together” in a meaningful way 1 0 1 0 1 1 1 0 0 0 1 0 = M = = = cat dog kitten puppy NLP 101 RNNs Attention Transformers BERT
  • 23. Word embeddings What do you mean, “meaningful way”? = …… N = …… = …… = …… One hot encodings ● dog : puppy = cat : kitten. How can we record this? ● Goal: a lower-dimensional representation of a word (M << N) ● Words are forced to “group together” in a meaningful way 1 0 1 0 1 1 1 0 0 0 1 0 = M = = = cat dog kitten puppy 1 for feline 1 for canine - = - 1 for adult / 0 for baby dog puppy cat kitten NLP 101 RNNs Attention Transformers BERT
  • 24. How do we arrive at this representation? Word embeddings NLP 101 RNNs Attention Transformers BERT
  • 25. How do we arrive at this representation? ● Pass the one-hot vector through a network with a bottleneck layer Word embeddings NLP 101 RNNs Attention Transformers BERT
  • 26. How do we arrive at this representation? ● Pass the one-hot vector through a network with a bottleneck layer = multiply the N-dim vector by NxM matrix to get an M-dim vector Word embeddings Embedding matrix W One-hot vector Embedded vector NLP 101 RNNs Attention Transformers BERT
  • 27. How do we arrive at this representation? ● Pass the one-hot vector through a network with a bottleneck layer = multiply the N-dim vector by NxM matrix to get an M-dim vector ● Some information gets lost in the process Word embeddings Embedding matrix W One-hot vector Embedded vector NLP 101 RNNs Attention Transformers BERT
  • 28. How do we arrive at this representation? ● Pass the one-hot vector through a network with a bottleneck layer = multiply the N-dim vector by NxM matrix to get an M-dim vector ● Some information gets lost in the process ● Embedding matrix: through machine learning! Word embeddings Embedding matrix W One-hot vector Embedded vector In practice: ● Embedding can be part of your model that you train for a ML task ● Or: use pre-trained embeddings (e.g. Word2Vec, GloVe, etc) NLP 101 RNNs Attention Transformers BERT
  • 30. Outline 1. NLP 101: how do we represent words in machine learning 🤖 Tokenization 🤖 Word embeddings 2. Order matters: Recurrent Neural Networks 🤖 RNNs 🤖 Seq2Seq models 🤖 What is wrong with RNNs 3. Attention is all you need! 🤖 What is Attention 🤖 Attention, the linear algebra perspective 4. The Transformer architecture 🤖 Mostly Attention 5. BERT: contextualized word embeddings
  • 31. Sequence to Sequence models ● Not all seq2seq models are NLP ● Not all NLP models are seq2seq! ● But: they are common enough NLP 101 RNNs Attention Transformers BERT
  • 32. Sequence to Sequence models ● Not all seq2seq models are NLP ● Not all NLP models are seq2seq! ● But: they are common enough NLP 101 RNNs Attention Transformers BERT ML model Sequence 2 Sequence 1
  • 33. Sequence to Sequence models ● Not all seq2seq models are NLP ● Not all NLP models are seq2seq! ● But: they are common enough NLP 101 RNNs Attention Transformers BERT ML model Sequence 2 Sequence 1 ● Language modeling, question answering, machine translation, etc Machine Translation I am a student je suis étudiant
  • 34. Sequence to Sequence models ● Not all seq2seq models are NLP ● Not all NLP models are seq2seq! ● But: they are common enough NLP 101 RNNs Attention Transformers BERT ML model Sequence 2 Sequence 1 ● Language modeling, question answering, machine translation, etc Machine Translation I am a student je suis étudiant Takeaways: 1. Multiple elements in a sequence 2. Order of the elements matters
  • 35. RNNs: Recurrent Neural Networks NLP 101 RNNs Attention Transformers BERT
  • 36. RNNs: Recurrent Neural Networks What does this mean??? NLP 101 RNNs Attention Transformers BERT
  • 37. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT
  • 38. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT
  • 39. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT
  • 40. RNNs: the seq2seq example Output sequence Input sequence NLP 101 RNNs Attention Transformers BERT
  • 41. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 42. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 43. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 44. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 45. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 46. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 47. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Context vector: contains the meaning of the whole input sequence Inputs (word embeddings)
  • 48. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Context vector: contains the meaning of the whole input sequence Inputs (word embeddings) Temporal direction (remember the Wikipedia definition?)
  • 49. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Context vector: contains the meaning of the whole input sequence Inputs (word embeddings) Temporal direction (remember the Wikipedia definition?)
  • 50. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 51. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 52. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 53. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 54. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 55. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 56. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 57. RNNs: the seq2seq example NLP 101 RNNs Attention Transformers BERT Outputs (one-hot) Inputs (word embeddings)
  • 58. Does this really work? NLP 101 RNNs Attention Transformers BERT
  • 59. Does this really work? Yes! (For the most part) Google Translate has been using these since ~ 2016 NLP 101 RNNs Attention Transformers BERT
  • 60. Does this really work? Yes! (For the most part) Google Translate has been using these since ~ 2016 What is the catch? NLP 101 RNNs Attention Transformers BERT
  • 61. Does this really work? Yes! (For the most part) Google Translate has been using these since ~ 2016 What is the catch? ● Catch # 1: Recurrent = Non-parallelizable 🐌 NLP 101 RNNs Attention Transformers BERT
  • 62. Does this really work? Yes! (For the most part) Google Translate has been using these since ~ 2016 What is the catch? ● Catch # 1: Recurrent = Non-parallelizable 🐌 ● Catch # 2: Relying on a fixed-length context vector to capture the full input sequence. What can go wrong? NLP 101 RNNs Attention Transformers BERT
  • 63. Does this really work? Yes! (For the most part) Google Translate has been using these since ~ 2016 What is the catch? ● Catch # 1: Recurrent = Non-parallelizable 🐌 ● Catch # 2: Relying on a fixed-length context vector to capture the full input sequence. What can go wrong? “I am a student.” NLP 101 RNNs Attention Transformers BERT
  • 64. Does this really work? Yes! (For the most part) Google Translate has been using these since ~ 2016 What is the catch? ● Catch # 1: Recurrent = Non-parallelizable 🐌 ● Catch # 2: Relying on a fixed-length context vector to capture the full input sequence. What can go wrong? “I am a student.” ✅ NLP 101 RNNs Attention Transformers BERT
  • 65. Does this really work? Yes! (For the most part) Google Translate has been using these since ~ 2016 What is the catch? ● Catch # 1: Recurrent = Non-parallelizable 🐌 ● Catch # 2: Relying on a fixed-length context vector to capture the full input sequence. What can go wrong? “I am a student.” ✅ “It was a wrong number that started it, the telephone ringing three times in the dead of night, and the voice on the other end asking for someone he was not.” NLP 101 RNNs Attention Transformers BERT
  • 66. Does this really work? Yes! (For the most part) Google Translate has been using these since ~ 2016 What is the catch? ● Catch # 1: Recurrent = Non-parallelizable 🐌 ● Catch # 2: Relying on a fixed-length context vector to capture the full input sequence. What can go wrong? “I am a student.” ✅ “It was a wrong number that started it, the telephone ringing three times in the dead of night, and the voice on the other end asking for someone he was not.” 😱 NLP 101 RNNs Attention Transformers BERT
  • 67. Outline 1. NLP 101: how do we represent words in machine learning 🤖 Tokenization 🤖 Word embeddings 2. Order matters: Recurrent Neural Networks 🤖 RNNs 🤖 Seq2Seq models 🤖 What is wrong with RNNs 3. Attention is all you need! 🤖 What is Attention 🤖 Attention, the linear algebra perspective 4. The Transformer architecture 🤖 Mostly Attention 5. BERT: contextualized word embeddings
  • 68. The Attention Mechanism NLP 101 RNNs Attention Transformers BERT
  • 69. The Attention Mechanism NLP 101 RNNs Attention Transformers BERT Recurrent NN from the previous section
  • 70. The Attention Mechanism NLP 101 RNNs Attention Transformers BERT Recurrent NN from the previous section
  • 71. The Attention Mechanism NLP 101 RNNs Attention Transformers BERT Recurrent NN from the previous section Send all of the hidden states to the Decoder, not just the last one
  • 72. The Attention Mechanism NLP 101 RNNs Attention Transformers BERT Recurrent NN from the previous section Send all of the hidden states to the Decoder, not just the last one The Decoder determines which input elements get attended to via a softmax layer
  • 73. Further reading NLP 101 RNNs Attention Transformers BERT jalammar.github.io/illustrated-transformer/ My blog post that this talk is based on + the linear algebra perspective on Attention: towardsdatascience.com/natural-language-processing-the-age-of-transformers-a3 6c0265937d
  • 74. Outline 1. NLP 101: how do we represent words in machine learning 🤖 Tokenization 🤖 Word embeddings 2. Order matters: Recurrent Neural Networks 🤖 RNNs 🤖 Seq2Seq models 🤖 What is wrong with RNNs 3. Attention is all you need! 🤖 What is Attention 🤖 Attention, the linear algebra perspective 4. The Transformer architecture 🤖 Mostly Attention 5. BERT: contextualized word embeddings
  • 75. Attention is All You Need © Google, 2017 NLP 101 RNNs Attention Transformers BERT
  • 76. Attention is All You Need © Google, 2017 ● Keep the Attention mechanism, throw away the sequential RNN ● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs NLP 101 RNNs Attention Transformers BERT
  • 77. Attention is All You Need © Google, 2017 ● Keep the Attention mechanism, throw away the sequential RNN ● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs ● Transformer: ○ Decoder and Encoder both have multiple layers NLP 101 RNNs Attention Transformers BERT
  • 78. Attention is All You Need © Google, 2017 ● Keep the Attention mechanism, throw away the sequential RNN ● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs ● Transformer: ○ Decoder and Encoder both have multiple layers ○ The output hidden states of the Encoder are inputs to all the layers of the Decoder NLP 101 RNNs Attention Transformers BERT
  • 79. Attention is All You Need © Google, 2017 ● Keep the Attention mechanism, throw away the sequential RNN ● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs ● Transformer: ○ Decoder and Encoder both have multiple layers ○ The output hidden states of the Encoder are inputs to all the layers of the Decoder ○ The intermediate hidden states of the Encoder are also inputs… to other sequence elements’ Encoder layers! Self-Attention NLP 101 RNNs Attention Transformers BERT
  • 80. Attention is All You Need © Google, 2017 ● Keep the Attention mechanism, throw away the sequential RNN ● Reminder: the basic idea of Attention was to pass all of the hidden states as inputs ● Transformer: ○ Decoder and Encoder both have multiple layers ○ The output hidden states of the Encoder are inputs to all the layers of the Decoder ○ The intermediate hidden states of the Encoder are also inputs… to other sequence elements’ Encoder layers! Self-Attention ○ Also Self-Attention in the Decoder NLP 101 RNNs Attention Transformers BERT
  • 81. Attention all around NLP 101 RNNs Attention Transformers BERT
  • 82. Attention all around NLP 101 RNNs Attention Transformers BERT Word embedding + position encoding
  • 83. Attention all around ● All of the sequence elements (w1, w2, w3) are being processed in parallel ● Each element’s encoding gets information (through hidden states) from other elements ● Bidirectional by design 😍 NLP 101 RNNs Attention Transformers BERT
  • 84. Why Self-Attention? ● “The animal did not cross the road because it was too tired.” vs. ● “The animal did not cross the road because it was too wide.” NLP 101 RNNs Attention Transformers BERT
  • 85. Why Self-Attention? ● “The animal did not cross the road because it was too tired.” vs. ● “The animal did not cross the road because it was too wide.” Uni-directional RNN would have missed this! Bi-directional is nice. NLP 101 RNNs Attention Transformers BERT
  • 86.
  • 87. Outline 1. NLP 101: how do we represent words in machine learning 🤖 Tokenization 🤖 Word embeddings 2. Order matters: Recurrent Neural Networks 🤖 RNNs 🤖 Seq2Seq models 🤖 What is wrong with RNNs 3. Attention is all you need! 🤖 What is Attention 🤖 Attention, the linear algebra perspective 4. The Transformer architecture 🤖 Mostly Attention 5. BERT: contextualized word embeddings
  • 88. (Contextual) word embeddings ● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words NLP 101 RNNs Attention Transformers BERT
  • 89. (Contextual) word embeddings ● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words ● Meaning often depends on the context (surrounding words) NLP 101 RNNs Attention Transformers BERT
  • 90. (Contextual) word embeddings ● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words ● Meaning often depends on the context (surrounding words) ● Self-Attention in Transformers processes context when encoding a token NLP 101 RNNs Attention Transformers BERT
  • 91. (Contextual) word embeddings ● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words ● Meaning often depends on the context (surrounding words) ● Self-Attention in Transformers processes context when encoding a token ● Can we just take the Transformer Encoder and use it to embed words? NLP 101 RNNs Attention Transformers BERT
  • 92. (Contextual) word embeddings ● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words ● Meaning often depends on the context (surrounding words) ● Self-Attention in Transformers processes context when encoding a token ● Can we just take the Transformer Encoder and use it to embed words? NLP 101 RNNs Attention Transformers BERT
  • 93. (Contextual) word embeddings BERT: Bidirectional Encoder Representations from Transformers NLP 101 RNNs Attention Transformers BERT
  • 94. (Contextual) word embeddings BERT: Bidirectional Encoder Representations from Transformers Instead of Word2Vec, use pre-trained BERT for your NLP tasks ;) NLP 101 RNNs Attention Transformers BERT
  • 95. What was BERT pre-trained on? Task # 1: Masked Language Modeling ● Take a text corpus, randomly mask 15% of words ● Put a softmax layer (dim = vocab size) on top of BERT encodings for the words that are present to predict which word is missing NLP 101 RNNs Attention Transformers BERT
  • 96. What was BERT pre-trained on? Task # 1: Masked Language Modeling ● Take a text corpus, randomly mask 15% of words ● Put a softmax layer (dim = vocab size) on top of BERT encodings for the words that are present to predict which word is missing Task # 2: Next Sentence Prediction ● Input sequence: two sentences separated by a special token ● Binary classification: are these sentences successive or not? NLP 101 RNNs Attention Transformers BERT
  • 97. Further reading and coding My blog posts: towardsdatascience.com/natural-language-processing-the-age-of-transformers-a36c0 265937d blog.scaleway.com/understanding-text-with-bert/ The HuggingFace 🤗 Transformers library: huggingface.co/ NLP 101 RNNs Attention Transformers BERT