The NLP muppets revolution! @ Data Science London 2019
video: https://skillsmatter.com/skillscasts/13940-a-deep-dive-into-contextual-word-embeddings-and-understanding-what-nlp-models-learn
event: https://www.meetup.com/Data-Science-London/events/261483332/
3. Outline
1. NLP pre-2018 (15 mins)
2. The Revolution (30 mins)
3. Related research activities in FAIR London (15 mins)
3
4. Natural Language Processing
Goal: for computers to process or “understand”
natural language in order to perform tasks that are
useful.
Christopher Manning
4
• Fully understanding and representing the meaning of language
(or even defining it) is an AI-complete problem.
5. Discrete representation of word meaning
5
• WordNet: lexical database for the English language
• “one-hot” encoding of words
• Problem: no notion of relationships (e.g., similarity) between words
6. Word meaning as a neural word vector
• Representing a word by means of its neighbors
6
large corpora
Source: https://becominghuman.ai/how-does-word2vecs-skip-gram-work-f92e0525def4
1
2
3
4
Unsupervised !
7. Word meaning as a neural word vector
Place words in high dimensional vector spaces
Source: https://www.tensorflow.org/tutorials/representation/word2vec
• Representing a word by means of its neighbors
7
8. NLU Models, pre 2018
Architectures Galore!
• Design a custom model for
each task
• Train on as much labeled
data as you have
• Often use pretrained word
embeddings, but not always
E.g. BiDAF (Seo et al, 2017) for Reading Comprehension
only first layer is
pre-trained
8
9. Limitations of word embeddings
• Word2vec, Glove and related methods represent shallow representations of language
(such as edges in an image) and fail to capture the high level structure of language
• A model that uses such word embeddings need to learn by itself complex language
phenomena:
• A huge number of examples is needed to achieve good performance
• word-sense disambiguation
• compositionality
• long-term dependencies
• negation
• etc
9
12. Aim: to predict the next word given the previous words.
𝑝 𝑤#, … , 𝑤& = (
)*#..&
𝑝(𝑤)|𝑤#, … , 𝑤).#)
e.g.: p(The cat is on the table) =
p(The) x p(cat | The) x p(is | cat, The) x p(on | is, cat, The) ...
𝑝 𝑤) 𝑤#, … , 𝑤).# is often a (recurrent) neural network
Language Modelling (LM)
12
13. What are LMs used for?
Until ~2018
• Used in text generation, e.g.
machine translation or speech
recognition
• Long standing proof of
concepts for language
understanding, largely ignored
As of this year
• Use for everything!
• Pretrain on lots of data
• Finetune for specific tasks
E.g. some of the embeddings in NLP from scratch
papers (Collobert et. al. 2009-2011) were extracted
from language models:
“Following our NLP from scratch philosophy, we
now describe how to dramatically improve these
embeddings using large unlabeled data sets.”
“Very long training times make such strategies
necessary for the foreseeable future: if we had been
given computers ten times faster, we probably would
have found uses for data sets ten times bigger.”
13
14. Are LMs NLP-complete?
AI-complete?
• Good LMs definitely won’t
solve vision or robots or
grounded language
understanding… J
But, for NLP…
• How much signal is there
in raw text alone?
• The signal is weak, and
unclear how to best learn
from it…
World knowledge:
The Mona Lisa is a portrait painting by [ Leonardo | Obama ]
Coreference resolution:
John robbed the bank. He was [ arrested | terrified | bored ]
Machine translation:
Belka is the Russian word for [ cat | squirrel ]
14
15. Language Model Pretraining *
• ELMo: contextual word embeddings [Best paper, NAACL 2018]
• GPT: no embeddings, finetune LM
• BERT: bidirectional LM finetuning [Best paper, NAACL 2019]
• GPT2: even larger scale, zero shot
*See also Dai & Lee (2015), Peters et al (2017),
Howard & Ruder (2018), and many others
15
16. Contextual word embeddings:
f(wk | w1, …, wn ) ∈ ℝ N
f(play | Elmo and Cookie Monster play a game .)
≠
f(play | The Broadway play premiered yesterday .)
16
The Allen Institute for Artificial Intelligence
17. Contextual word embeddings
AI2 ELMo
• Train two LMs: left-to-
right and right-to-left
[Peters et al, 2018]
The Broadway play premiered yesterday
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
??
17
18. Contextual word embeddings
AI2 ELMo
• Train two LMs: left-to-
right and right-to-left
[Peters et al, 2018]
The Broadway play premiered yesterday
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
??
18
19. Embeddings from Language Models
AI2 ELMo
• Train two LMs: left-to-
right and right-to-left
• Extract contextualized
vectors from networks
• Use instead of word
embeddings (e.g.
FastText)
• Still use custom task
archtectures…
[Peters et al, 2018]
The Broadway play premiered yesterday
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
=ELMo: f( )
19
20. AI2 ELMo
• Train two LMs: left-to-
right and right-to-left
• Extract contextualized
vectors from networks
• Use instead of word
embeddings (e.g.
FastText)
• Still use custom task
archtectures…
[Peters et al, 2018]
20
21. AI2 ELMo
• Train two LMs: left-to-
right and right-to-left
• Extract contextualized
vectors from networks
• Use instead of word
embeddings (e.g.
FastText)
• Still use custom task
archtectures…
[Peters et al, 2018]
21
22. AI2 ELMo
• Train two LMs: left-to-
right and right-to-left
• Extract contextualized
vectors from networks
• Use instead of word
embeddings (e.g.
FastText)
• Still use custom task
archtectures…
[Peters et al, 2018]
• Simple and general, building on recent ideas
[text cat. (Dai and Lee 2015); tagging (Peters et al, 2017)]
• Showed self-pre-training working for lots of problems!
• Still assumed custom task-specific architectures…
22
26. 1. pretrain
Huge corpora
(B of words)
TRANSFORMER
2. fine-tune
LM
TRANSFORMER
This takes days/weeks on several GPUs / TPUs
Simple
component
Fine-tune for a specific task on
as much labeled data as you have.
Fast !
SupervisedUnsupervised
Pre-trained models publicly
available on the web!
27. Why embeddings?
OpenAI GPT
• Train transformer
language model
E.g., can we pre-train a common architecture?
[Radford et al, 2018]
??
… The submarine
…
…
Transformer Transformer Transformer
Transformer Transformer Transformer
is
27
OpenAI
28. Why embeddings?
OpenAI GPT
• Train transformer
language model
• Encode multiple
sentences at once
E.g., can we pre-train a common architecture?
[Radford et al, 2018]
Textual Classification (e.g. sentiment)
<S> Kids love gelato! <E>
Textual Entailment
<S> Kids love gelato. <SEP> No one hates gelato. <E>
28
29. Why embeddings?
OpenAI GPT
• Train transformer
language model
• Encode multiple
sentences at once
• Add small layer on
top
• Fine tune for each
end task
E.g., can we pre-train a common architecture?
[Radford et al, 2018]
29
31. Without ensemble for SNLI: ESIM+ELMo (88.7) vs GPT(89.9)
• Train on longer texts, with self attention
• Idea of a unified architecture is currently winning!
• Will scale up really nicely (e.g. GPT-2)
• Isn’t bidirectional…
31
33. Google’s BERT
• Train masked
language model
• Otherwise, largely
adopt OpenAI
approach
New task: predict the missing/masked word(s)
[Devlin et al, 2018]
… The [MASK]
…
…
is yellow …
??
Idea: jointly model left and right context
Transformer Transformer Transformer Transformer
Transformer Transformer Transformer Transformer
33
Google AI
…
…
34. Google’s BERT
• Train masked
language model
• Otherwise, largely
adopt OpenAI
approach
• Generalize beyond
text classification
New task: predict the missing/masked word(s)
[Devlin et al, 2018]
Idea: jointly model left and right context
34
36. Google’s BERT
• Train masked
language model
• Otherwise, largely
adopt OpenAI
approach
• Generalize beyond
text classification
[Devlin et al, 2018]
Idea: jointly model left and right context
vs.
E.g. for reading comprehension
36
37. Google’s BERT
• Train masked
language model
• Otherwise, largely
adopt OpenAI
approach
• Generalize beyond
text classification
New task: predict the missing/masked word(s)
[Devlin et al, 2018]
Idea: jointly model left and right context
• Bidirectional reasoning is important!
• Much better comparison, clearly best approach
• Arguably killing off “architecture hacking” research
• Currently focus on intense study in NLP community
37
39. What happens with more data?
OpenAI GPT-2
• Same model as GPT
• Still left-to-right
• Add more
parameters and train
on cleaner data!
• Adopt a new
evaluation scheme…
[Radford et al, 2019]
Also, don’t fine tune for different end tasks…
??
… premiered yesterday
…
…
…Transformer Transformer Transformer Transformer
…Transformer Transformer Transformer Transformer
… It was
39
41. Very good at generating text! Too good to release to the public…
41
42. Evaluation: zero shot
OpenAI GPT-2
• Same model as GPT
• Still left-to-right
• Add more
parameters and train
on cleaner data!
• Adopt a new
evaluation scheme…
[Radford et al, 2019]
Assume no labeled data…
Question Answering:
Summarization:
Machine Translation:
Q: Who wrote the origin of species? A: ???
[input document, e.g. new article]. TL;DR: ???
[English sentence]1=[French sentence]1, …
[English sentence]n=???
42
44. Zero shot QA results… Model is learning lots of facts!
44
45. Zero shot QA results… Model is learning lots of facts!
• Left-to-right again... but at least easy to sample from
• Zero shot results are still at baseline levels…
• LMs clearly improve with more parameters, more data
• Very nice job selling the work to the press
49. Single-token objects and answers.
50
Language models rank every word in
the vocabulary by its probability.
PREDICTIONS
“The theory of relativity was developed by ___ .”
Einstein -1.143
him -2.994
Newton -3.758
Aristotle -4.477
Maxwell -4.486 unified vocabulary
Methodology
50k facts
53. BERT-large. The last column reports the top5 tokens generated together with the associated log probability (in square brackets)
Examples of Generations