2. Preliminaries
● Who I am:
○ Product Manager for AI PaaS at Scaleway
Scaleway: European cloud provider, originating from
France
3. Preliminaries
● Who I am:
○ Product Manager for AI PaaS at Scaleway
Scaleway: European cloud provider, originating from
France
○ Machine Learning engineer at Scaleway
4. Preliminaries
● Who I am:
○ Product Manager for AI PaaS at Scaleway
Scaleway: European cloud provider, originating from
France
○ Machine Learning engineer at Scaleway
○ Quantum physicist at École Normale Supérieure, the
Max Planck Institute, Johns Hopkins University
5. Preliminaries
● Who I am:
○ Product Manager for AI PaaS at Scaleway
Scaleway: European cloud provider, originating from
France
○ Machine Learning engineer at Scaleway
○ Quantum physicist at École Normale Supérieure, the
Max Planck Institute, Johns Hopkins University
www.olgapaints.net
6. Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
7. Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
10. Representing words
I play with my dog.
● Machine learning: inputs are vectors / matrices
NLP 101 RNNs Attention Transformers BERT
11. Representing words
I play with my dog.
● Machine learning: inputs are vectors / matrices
ID Age Gender Weight Diagnosis
264264 27 0 72 1
908696 61 1 80 0
NLP 101 RNNs Attention Transformers BERT
12. Representing words
ID Age Gender Weight Diagnosis
264264 27 0 72 1
908696 61 1 80 0
I play with my dog.
● Machine learning: inputs are vectors / matrices
● How do we convert “I play with my dog” into numerical values?
NLP 101 RNNs Attention Transformers BERT
13. Representing words
I play with my dog.
I
play
with
my
dog
.
NLP 101 RNNs Attention Transformers BERT
14. Representing words
I play with my dog.
I
play
with
my
dog
.
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
● One hot encoding
● Dimensionality = vocabulary size
1
0
0
0
0
0
1
2
3
4
5
6
NLP 101 RNNs Attention Transformers BERT
15. Representing words
I play with my dog.
I play with my cat.
I
play
with
my
dog
.
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
● One hot encoding
● Dimensionality = vocabulary size
1
2
3
4
5
6
NLP 101 RNNs Attention Transformers BERT
16. Representing words
I play with my dog.
I play with my cat.
I
play
with
my
dog
.
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
● One hot encoding
● Dimensionality = vocabulary size
cat
0
0
0
0
0
0
1
1
2
3
4
5
6
7
NLP 101 RNNs Attention Transformers BERT
17. play played playing
I play / played / was playing with my dog.
Word = token: , ,
Subword = token: , ,
E.g.: played = ;
● In practice, subword tokenization is used most often
● For simplicity, we’ll go with word = token for the rest of the talk
Tokenization
play #ed #ing
play #ed
NLP 101 RNNs Attention Transformers BERT
18. Word embeddings
= …… N = …… = …… = ……
One hot encodings N = vocabulary size
● puppy : dog = kitten : cat. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
0
1
cat
0
0
dog
0
0
1
0
0
0
0
1
kitten
0
0
1
0
puppy
NLP 101 RNNs Attention Transformers BERT
19. Word embeddings
= …… N = …… = …… = ……
One hot encodings
● dog : puppy = cat : kitten. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
0
1
cat
0
0
dog
0
0
1
0
0
0
0
1
kitten
0
0
1
0
puppy
NLP 101 RNNs Attention Transformers BERT
20. Word embeddings
= …… N = …… = …… = ……
One hot encodings
● dog : puppy = cat : kitten. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
0
1
cat
0
0
dog
0
0
1
0
0
0
0
1
kitten
0
0
1
0
puppy
NLP 101 RNNs Attention Transformers BERT
21. Word embeddings
= …… N = …… = …… = ……
One hot encodings
● dog : puppy = cat : kitten. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
0
1
cat
0
0
dog
0
0
1
0
0
0
0
1
kitten
0
0
1
0
puppy
1
0
1
0
1
1
1
0
0
0
1
0
= M = = =
cat dog kitten puppy
NLP 101 RNNs Attention Transformers BERT
22. Word embeddings
What do you mean, “meaningful way”?
= …… N = …… = …… = ……
One hot encodings
● dog : puppy = cat : kitten. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
1
0
1
0
1
1
1
0
0
0
1
0
= M = = =
cat dog kitten puppy
NLP 101 RNNs Attention Transformers BERT
23. Word embeddings
What do you mean, “meaningful way”?
= …… N = …… = …… = ……
One hot encodings
● dog : puppy = cat : kitten. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
1
0
1
0
1
1
1
0
0
0
1
0
= M = = =
cat dog kitten puppy
1 for feline
1 for canine - = -
1 for adult / 0 for baby
dog puppy cat kitten
NLP 101 RNNs Attention Transformers BERT
24. How do we arrive at this representation?
Word embeddings
NLP 101 RNNs Attention Transformers BERT
25. How do we arrive at this representation?
● Pass the one-hot vector through a network with a bottleneck layer
Word embeddings
NLP 101 RNNs Attention Transformers BERT
26. How do we arrive at this representation?
● Pass the one-hot vector through a network with a bottleneck layer
= multiply the N-dim vector by NxM matrix to get an M-dim vector
Word embeddings
Embedding matrix W
One-hot vector
Embedded
vector
NLP 101 RNNs Attention Transformers BERT
27. How do we arrive at this representation?
● Pass the one-hot vector through a network with a bottleneck layer
= multiply the N-dim vector by NxM matrix to get an M-dim vector
● Some information gets lost in the process
Word embeddings
Embedding matrix W
One-hot vector
Embedded
vector
NLP 101 RNNs Attention Transformers BERT
28. How do we arrive at this representation?
● Pass the one-hot vector through a network with a bottleneck layer
= multiply the N-dim vector by NxM matrix to get an M-dim vector
● Some information gets lost in the process
● Embedding matrix: through machine learning!
Word embeddings
Embedding matrix W
One-hot vector
Embedded
vector
In practice:
● Embedding can be part of your
model that you train for a ML task
● Or: use pre-trained embeddings
(e.g. Word2Vec, GloVe, etc)
NLP 101 RNNs Attention Transformers BERT
30. Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
31. Sequence to Sequence models
● Not all seq2seq models are NLP
● Not all NLP models are seq2seq!
● But: they are common enough
NLP 101 RNNs Attention Transformers BERT
32. Sequence to Sequence models
● Not all seq2seq models are NLP
● Not all NLP models are seq2seq!
● But: they are common enough
NLP 101 RNNs Attention Transformers BERT
ML model Sequence 2
Sequence 1
33. Sequence to Sequence models
● Not all seq2seq models are NLP
● Not all NLP models are seq2seq!
● But: they are common enough
NLP 101 RNNs Attention Transformers BERT
ML model Sequence 2
Sequence 1
● Language modeling, question answering, machine translation, etc
Machine
Translation
I am a student
je suis étudiant
34. Sequence to Sequence models
● Not all seq2seq models are NLP
● Not all NLP models are seq2seq!
● But: they are common enough
NLP 101 RNNs Attention Transformers BERT
ML model Sequence 2
Sequence 1
● Language modeling, question answering, machine translation, etc
Machine
Translation
I am a student
je suis étudiant
Takeaways:
1. Multiple elements in a sequence
2. Order of the elements matters
40. RNNs: the seq2seq example
Output sequence
Input sequence
NLP 101 RNNs Attention Transformers BERT
41. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
42. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
43. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
44. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
45. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
46. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
47. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Context vector: contains the
meaning of the whole input
sequence
Inputs (word embeddings)
48. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Context vector: contains the
meaning of the whole input
sequence
Inputs (word embeddings)
Temporal direction (remember the Wikipedia
definition?)
49. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Context vector: contains the
meaning of the whole input
sequence
Inputs (word embeddings)
Temporal direction (remember the Wikipedia
definition?)
50. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
51. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
52. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
53. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
54. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
55. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
56. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
57. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
59. Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
NLP 101 RNNs Attention Transformers BERT
60. Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
NLP 101 RNNs Attention Transformers BERT
61. Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
NLP 101 RNNs Attention Transformers BERT
62. Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
● Catch # 2: Relying on a fixed-length context vector to capture the full input
sequence. What can go wrong?
NLP 101 RNNs Attention Transformers BERT
63. Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
● Catch # 2: Relying on a fixed-length context vector to capture the full input
sequence. What can go wrong?
“I am a student.”
NLP 101 RNNs Attention Transformers BERT
64. Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
● Catch # 2: Relying on a fixed-length context vector to capture the full input
sequence. What can go wrong?
“I am a student.” ✅
NLP 101 RNNs Attention Transformers BERT
65. Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
● Catch # 2: Relying on a fixed-length context vector to capture the full input
sequence. What can go wrong?
“I am a student.” ✅
“It was a wrong number that started it, the telephone ringing three times in the dead
of night, and the voice on the other end asking for someone he was not.”
NLP 101 RNNs Attention Transformers BERT
66. Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
● Catch # 2: Relying on a fixed-length context vector to capture the full input
sequence. What can go wrong?
“I am a student.” ✅
“It was a wrong number that started it, the telephone ringing three times in the dead
of night, and the voice on the other end asking for someone he was not.” 😱
NLP 101 RNNs Attention Transformers BERT
67. Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
71. The Attention Mechanism
NLP 101 RNNs Attention Transformers BERT
Recurrent NN from the previous section
Send all of the hidden states to the Decoder, not just
the last one
72. The Attention Mechanism
NLP 101 RNNs Attention Transformers BERT
Recurrent NN from the previous section
Send all of the hidden states to the Decoder, not just
the last one
The Decoder determines which input elements get
attended to via a softmax layer
73. Further reading
NLP 101 RNNs Attention Transformers BERT
jalammar.github.io/illustrated-transformer/
My blog post that this talk is based on + the linear algebra perspective on
Attention:
towardsdatascience.com/natural-language-processing-the-age-of-transformers-a3
6c0265937d
74. Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
82. Attention all around
NLP 101 RNNs Attention Transformers BERT
Word embedding + position encoding
83. Attention all around
● All of the sequence elements (w1, w2,
w3) are being processed in parallel
● Each element’s encoding gets
information (through hidden states)
from other elements
● Bidirectional by design 😍
NLP 101 RNNs Attention Transformers BERT
84. Why Self-Attention?
● “The animal did not cross the road
because it was too tired.”
vs.
● “The animal did not cross the road
because it was too wide.”
NLP 101 RNNs Attention Transformers BERT
85. Why Self-Attention?
● “The animal did not cross the road
because it was too tired.”
vs.
● “The animal did not cross the road
because it was too wide.”
Uni-directional RNN would have missed
this! Bi-directional is nice.
NLP 101 RNNs Attention Transformers BERT
86.
87. Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
88. (Contextual) word embeddings
● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words
NLP 101 RNNs Attention Transformers BERT
89. (Contextual) word embeddings
● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words
● Meaning often depends on the context (surrounding words)
NLP 101 RNNs Attention Transformers BERT
90. (Contextual) word embeddings
● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words
● Meaning often depends on the context (surrounding words)
● Self-Attention in Transformers processes context when encoding a token
NLP 101 RNNs Attention Transformers BERT
91. (Contextual) word embeddings
● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words
● Meaning often depends on the context (surrounding words)
● Self-Attention in Transformers processes context when encoding a token
● Can we just take the Transformer Encoder and use it to embed words?
NLP 101 RNNs Attention Transformers BERT
92. (Contextual) word embeddings
● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words
● Meaning often depends on the context (surrounding words)
● Self-Attention in Transformers processes context when encoding a token
● Can we just take the Transformer Encoder and use it to embed words?
NLP 101 RNNs Attention Transformers BERT
94. (Contextual) word embeddings
BERT: Bidirectional Encoder Representations from Transformers
Instead of Word2Vec, use pre-trained BERT for your NLP tasks ;)
NLP 101 RNNs Attention Transformers BERT
95. What was BERT pre-trained on?
Task # 1: Masked Language Modeling
● Take a text corpus, randomly mask 15% of words
● Put a softmax layer (dim = vocab size) on top of BERT encodings for the words
that are present to predict which word is missing
NLP 101 RNNs Attention Transformers BERT
96. What was BERT pre-trained on?
Task # 1: Masked Language Modeling
● Take a text corpus, randomly mask 15% of words
● Put a softmax layer (dim = vocab size) on top of BERT encodings for the words
that are present to predict which word is missing
Task # 2: Next Sentence Prediction
● Input sequence: two sentences separated by a special token
● Binary classification: are these sentences successive or not?
NLP 101 RNNs Attention Transformers BERT
97. Further reading and coding
My blog posts:
towardsdatascience.com/natural-language-processing-the-age-of-transformers-a36c0
265937d
blog.scaleway.com/understanding-text-with-bert/
The HuggingFace 🤗 Transformers library: huggingface.co/
NLP 101 RNNs Attention Transformers BERT