The NLP Muppets revolution!

The NLP Muppets revolution!
Fabio Petroni
Data Science London
28 May 2019

Disclaimer
2
This is not a Data Science talk

Outline
1. NLP pre-2018 (15 mins)
2. The Revolution (30 mins)
3. Related research activities in FAIR London (15 mins)
3

Natural Language Processing
Goal: for computers to process or “understand”
natural language in order to perform tasks that are
useful.
Christopher Manning
4
• Fully understanding and representing the meaning of language
(or even defining it) is an AI-complete problem.

Discrete representation of word meaning
5
• WordNet: lexical database for the English language
• “one-hot” encoding of words
• Problem: no notion of relationships (e.g., similarity) between words

Word meaning as a neural word vector
• Representing a word by means of its neighbors
6
large corpora
Source: https://becominghuman.ai/how-does-word2vecs-skip-gram-work-f92e0525def4
1
2
3
4
Unsupervised !

Word meaning as a neural word vector
Place words in high dimensional vector spaces
Source: https://www.tensorflow.org/tutorials/representation/word2vec
• Representing a word by means of its neighbors
7

NLU Models, pre 2018
Architectures Galore!
• Design a custom model for
each task
• Train on as much labeled
data as you have
• Often use pretrained word
embeddings, but not always
E.g. BiDAF (Seo et al, 2017) for Reading Comprehension
only first layer is
pre-trained
8

Limitations of word embeddings
• Word2vec, Glove and related methods represent shallow representations of language
(such as edges in an image) and fail to capture the high level structure of language
• A model that uses such word embeddings need to learn by itself complex language
phenomena:
• A huge number of examples is needed to achieve good performance
• word-sense disambiguation
• compositionality
• long-term dependencies
• negation
• etc
9

Paradigm Shift
initializing the first
layer of our models
pretraining the
entire model
10

Paradigm Shift
initializing the first
layer of our models
pretraining the
entire model
11
Which task to use?

Aim: to predict the next word given the previous words.
𝑝 𝑤#, … , 𝑤& = (
)*#..&
𝑝(𝑤)|𝑤#, … , 𝑤).#)
e.g.: p(The cat is on the table) =
p(The) x p(cat | The) x p(is | cat, The) x p(on | is, cat, The) ...
𝑝 𝑤) 𝑤#, … , 𝑤).# is often a (recurrent) neural network
Language Modelling (LM)
12

What are LMs used for?
Until ~2018
• Used in text generation, e.g.
machine translation or speech
recognition
• Long standing proof of
concepts for language
understanding, largely ignored
As of this year
• Use for everything!
• Pretrain on lots of data
• Finetune for specific tasks
E.g. some of the embeddings in NLP from scratch
papers (Collobert et. al. 2009-2011) were extracted
from language models:
“Following our NLP from scratch philosophy, we
now describe how to dramatically improve these
embeddings using large unlabeled data sets.”
“Very long training times make such strategies
necessary for the foreseeable future: if we had been
given computers ten times faster, we probably would
have found uses for data sets ten times bigger.”
13

Are LMs NLP-complete?
AI-complete?
• Good LMs definitely won’t
solve vision or robots or
grounded language
understanding… J
But, for NLP…
• How much signal is there
in raw text alone?
• The signal is weak, and
unclear how to best learn
from it…
World knowledge:
The Mona Lisa is a portrait painting by [ Leonardo | Obama ]
Coreference resolution:
John robbed the bank. He was [ arrested | terrified | bored ]
Machine translation:
Belka is the Russian word for [ cat | squirrel ]
14

Language Model Pretraining *
• ELMo: contextual word embeddings [Best paper, NAACL 2018]
• GPT: no embeddings, finetune LM
• BERT: bidirectional LM finetuning [Best paper, NAACL 2019]
• GPT2: even larger scale, zero shot
*See also Dai & Lee (2015), Peters et al (2017),
Howard & Ruder (2018), and many others
15

Contextual word embeddings:
f(wk | w1, …, wn ) ∈ ℝ N
f(play | Elmo and Cookie Monster play a game .)
≠
f(play | The Broadway play premiered yesterday .)
16
The Allen Institute for Artificial Intelligence

Contextual word embeddings
AI2 ELMo
• Train two LMs: left-to-
right and right-to-left
[Peters et al, 2018]
The Broadway play premiered yesterday
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
??
17

Contextual word embeddings
AI2 ELMo
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
??
18

Embeddings from Language Models
AI2 ELMo
• Extract contextualized
vectors from networks
• Use instead of word
embeddings (e.g.
FastText)
• Still use custom task
archtectures…
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
=ELMo: f( )
19

AI2 ELMo
embeddings (e.g.
FastText)
archtectures…
20

AI2 ELMo
embeddings (e.g.
FastText)
archtectures…
21

AI2 ELMo
embeddings (e.g.
FastText)
archtectures…
• Simple and general, building on recent ideas
[text cat. (Dai and Lee 2015); tagging (Peters et al, 2017)]
• Showed self-pre-training working for lots of problems!
• Still assumed custom task-specific architectures…
22

Question: Why Embeddings At All?
23

The Transformer
(A Vaswani et al, 2017)
24
Google AI

The Transformer
(A Vaswani et al, 2017)
25

1. pretrain
Huge corpora
(B of words)
TRANSFORMER
2. fine-tune
LM
TRANSFORMER
This takes days/weeks on several GPUs / TPUs
Simple
component
Fine-tune for a specific task on
as much labeled data as you have.
Fast !
SupervisedUnsupervised
Pre-trained models publicly
available on the web!

Why embeddings?
OpenAI GPT
• Train transformer
language model
E.g., can we pre-train a common architecture?
[Radford et al, 2018]
??
… The submarine
…
…
Transformer Transformer Transformer
Transformer Transformer Transformer
is
27
OpenAI

Why embeddings?
OpenAI GPT
language model
• Encode multiple
sentences at once
Textual Classification (e.g. sentiment)
<S> Kids love gelato! <E>
Textual Entailment
<S> Kids love gelato. <SEP> No one hates gelato. <E>
28

Why embeddings?
OpenAI GPT
language model
• Encode multiple
sentences at once
• Add small layer on
top
• Fine tune for each
end task
29

Without ensemble for SNLI: ESIM+ELMo (88.7) vs GPT(89.9)

Without ensemble for SNLI: ESIM+ELMo (88.7) vs GPT(89.9)
• Train on longer texts, with self attention
• Idea of a unified architecture is currently winning!
• Will scale up really nicely (e.g. GPT-2)
• Isn’t bidirectional…
31

Question: Why not bidirectional?
32

Google’s BERT
• Train masked
language model
• Otherwise, largely
adopt OpenAI
approach
New task: predict the missing/masked word(s)
[Devlin et al, 2018]
… The [MASK]
…
…
is yellow …
??
Idea: jointly model left and right context
Transformer Transformer Transformer Transformer
Transformer Transformer Transformer Transformer
33
Google AI
…
…

Google’s BERT
• Train masked
language model
adopt OpenAI
approach
• Generalize beyond
text classification
34

Google’s BERT
• Train masked
language model
adopt OpenAI
approach
text classification
vs.
E.g. for reading comprehension
36

Google’s BERT
• Train masked
language model
adopt OpenAI
approach
text classification
• Bidirectional reasoning is important!
• Much better comparison, clearly best approach
• Arguably killing off “architecture hacking” research
• Currently focus on intense study in NLP community
37

What happens with more data?
OpenAI GPT-2
• Same model as GPT
• Still left-to-right
• Add more
parameters and train
on cleaner data!
• Adopt a new
evaluation scheme…
Also, don’t fine tune for different end tasks…
??
… premiered yesterday
…
…
…Transformer Transformer Transformer Transformer
…Transformer Transformer Transformer Transformer
… It was
39

Very good language models! Caveat: training sets differ…
40

Very good at generating text! Too good to release to the public…
41

Evaluation: zero shot
OpenAI GPT-2
• Same model as GPT
• Still left-to-right
• Add more
parameters and train
on cleaner data!
• Adopt a new
evaluation scheme…
Assume no labeled data…
Question Answering:
Summarization:
Machine Translation:
Q: Who wrote the origin of species? A: ???
[input document, e.g. new article]. TL;DR: ???
[English sentence]1=[French sentence]1, …
[English sentence]n=???
42

Zero shot results… Performs near baseline, with lots of data…
43

Zero shot QA results… Model is learning lots of facts!
44

Zero shot QA results… Model is learning lots of facts!
• Left-to-right again... but at least easy to sample from
• Zero shot results are still at baseline levels…
• LMs clearly improve with more parameters, more data
• Very nice job selling the work to the press

• What amount of knowledge do LM store?
• How does their performance compare to
automatically constructed KBs?
Research Question

48
off-the-shelf
Sorokin and Gurevych, 2017
What does BERT know?
oracle

Single-token objects and answers.
50
Language models rank every word in
the vocabulary by its probability.
PREDICTIONS
“The theory of relativity was developed by ___ .”
Einstein -1.143
him -2.994
Newton -3.758
Aristotle -4.477
Maxwell -4.486 unified vocabulary
Methodology
50k facts

4.4
1.2
7.6
2.9
10
0
2
4
6
8
10
12
Frequency KBP KBP
Oracle
ELMO BERT
Google RE
16.7
6.1
33.8
7.1
32.3
0
5
10
15
20
25
30
35
40
Frequency KBP KBP
Oracle
ELMO BERT
T-REx
Factual Knowledge
p@1
p@1

0
10
20
30
40
50
60
70
80
10
0
10
1
10
2
meanP@k
k
Fs
Txl
Eb
E5B
Bb
Bl
Mean P@k curve for T-REx
FS: Fairseq-fconv
Txl: Transformer xl
Eb: Elmo base
E5B: Elmo 5.5B
Bb: Bert-base
Bl: Bert-large

0
5.5
11
16.5
22
Frequency ELMO BERT
Precisionat1
ConceptNet
54
Common Sense Knowledge

BERT-large. The last column reports the top5 tokens generated together with the associated log probability (in square brackets)
Examples of Generations

0
17.5
35
52.5
70
Question Answering
DrQa DrQaBERT BERT
p@1 p@10

https://github.com/facebookresearch/LAMA
58

download pretrained
language models rather
than pretrained word
embeddings!
59

The NLP Muppets revolution!

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a The NLP Muppets revolution!

Similar a The NLP Muppets revolution! (20)

Más de Fabio Petroni, PhD

Más de Fabio Petroni, PhD (6)

Último

Último (20)

The NLP Muppets revolution!