SlideShare una empresa de Scribd logo
1 de 58
Descargar para leer sin conexión
The NLP Muppets revolution!
Fabio Petroni
Data Science London
28 May 2019
Disclaimer
2
This is not a Data Science talk
Outline
1. NLP pre-2018 (15 mins)
2. The Revolution (30 mins)
3. Related research activities in FAIR London (15 mins)
3
Natural Language Processing
Goal: for computers to process or “understand”
natural language in order to perform tasks that are
useful.
Christopher Manning
4
• Fully understanding and representing the meaning of language
(or even defining it) is an AI-complete problem.
Discrete representation of word meaning
5
• WordNet: lexical database for the English language
• “one-hot” encoding of words
• Problem: no notion of relationships (e.g., similarity) between words
Word meaning as a neural word vector
• Representing a word by means of its neighbors
6
large corpora
Source: https://becominghuman.ai/how-does-word2vecs-skip-gram-work-f92e0525def4
1
2
3
4
Unsupervised !
Word meaning as a neural word vector
Place words in high dimensional vector spaces
Source: https://www.tensorflow.org/tutorials/representation/word2vec
• Representing a word by means of its neighbors
7
NLU Models, pre 2018
Architectures Galore!
• Design a custom model for
each task
• Train on as much labeled
data as you have
• Often use pretrained word
embeddings, but not always
E.g. BiDAF (Seo et al, 2017) for Reading Comprehension
only first layer is
pre-trained
8
Limitations of word embeddings
• Word2vec, Glove and related methods represent shallow representations of language
(such as edges in an image) and fail to capture the high level structure of language
• A model that uses such word embeddings need to learn by itself complex language
phenomena:
• A huge number of examples is needed to achieve good performance
• word-sense disambiguation
• compositionality
• long-term dependencies
• negation
• etc
9
Paradigm Shift
initializing the first
layer of our models
pretraining the
entire model
10
Paradigm Shift
initializing the first
layer of our models
pretraining the
entire model
11
Which task to use?
Aim: to predict the next word given the previous words.
𝑝 𝑤#, … , 𝑤& = (
)*#..&
𝑝(𝑤)|𝑤#, … , 𝑤).#)
e.g.: p(The cat is on the table) =
p(The) x p(cat | The) x p(is | cat, The) x p(on | is, cat, The) ...
𝑝 𝑤) 𝑤#, … , 𝑤).# is often a (recurrent) neural network
Language Modelling (LM)
12
What are LMs used for?
Until ~2018
• Used in text generation, e.g.
machine translation or speech
recognition
• Long standing proof of
concepts for language
understanding, largely ignored
As of this year
• Use for everything!
• Pretrain on lots of data
• Finetune for specific tasks
E.g. some of the embeddings in NLP from scratch
papers (Collobert et. al. 2009-2011) were extracted
from language models:
“Following our NLP from scratch philosophy, we
now describe how to dramatically improve these
embeddings using large unlabeled data sets.”
“Very long training times make such strategies
necessary for the foreseeable future: if we had been
given computers ten times faster, we probably would
have found uses for data sets ten times bigger.”
13
Are LMs NLP-complete?
AI-complete?
• Good LMs definitely won’t
solve vision or robots or
grounded language
understanding… J
But, for NLP…
• How much signal is there
in raw text alone?
• The signal is weak, and
unclear how to best learn
from it…
World knowledge:
The Mona Lisa is a portrait painting by [ Leonardo | Obama ]
Coreference resolution:
John robbed the bank. He was [ arrested | terrified | bored ]
Machine translation:
Belka is the Russian word for [ cat | squirrel ]
14
Language Model Pretraining *
• ELMo: contextual word embeddings [Best paper, NAACL 2018]
• GPT: no embeddings, finetune LM
• BERT: bidirectional LM finetuning [Best paper, NAACL 2019]
• GPT2: even larger scale, zero shot
*See also Dai & Lee (2015), Peters et al (2017),
Howard & Ruder (2018), and many others
15
Contextual word embeddings:
f(wk | w1, …, wn ) ∈ ℝ N
f(play | Elmo and Cookie Monster play a game .)
≠
f(play | The Broadway play premiered yesterday .)
16
The Allen Institute for Artificial Intelligence
Contextual word embeddings
AI2 ELMo
• Train two LMs: left-to-
right and right-to-left
[Peters et al, 2018]
The Broadway play premiered yesterday
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
??
17
Contextual word embeddings
AI2 ELMo
• Train two LMs: left-to-
right and right-to-left
[Peters et al, 2018]
The Broadway play premiered yesterday
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
??
18
Embeddings from Language Models
AI2 ELMo
• Train two LMs: left-to-
right and right-to-left
• Extract contextualized
vectors from networks
• Use instead of word
embeddings (e.g.
FastText)
• Still use custom task
archtectures…
[Peters et al, 2018]
The Broadway play premiered yesterday
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
=ELMo: f( )
19
AI2 ELMo
• Train two LMs: left-to-
right and right-to-left
• Extract contextualized
vectors from networks
• Use instead of word
embeddings (e.g.
FastText)
• Still use custom task
archtectures…
[Peters et al, 2018]
20
AI2 ELMo
• Train two LMs: left-to-
right and right-to-left
• Extract contextualized
vectors from networks
• Use instead of word
embeddings (e.g.
FastText)
• Still use custom task
archtectures…
[Peters et al, 2018]
21
AI2 ELMo
• Train two LMs: left-to-
right and right-to-left
• Extract contextualized
vectors from networks
• Use instead of word
embeddings (e.g.
FastText)
• Still use custom task
archtectures…
[Peters et al, 2018]
• Simple and general, building on recent ideas
[text cat. (Dai and Lee 2015); tagging (Peters et al, 2017)]
• Showed self-pre-training working for lots of problems!
• Still assumed custom task-specific architectures…
22
Question: Why Embeddings At All?
23
The Transformer
(A Vaswani et al, 2017)
24
Google AI
The Transformer
(A Vaswani et al, 2017)
25
1. pretrain
Huge corpora
(B of words)
TRANSFORMER
2. fine-tune
LM
TRANSFORMER
This takes days/weeks on several GPUs / TPUs
Simple
component
Fine-tune for a specific task on
as much labeled data as you have.
Fast !
SupervisedUnsupervised
Pre-trained models publicly
available on the web!
Why embeddings?
OpenAI GPT
• Train transformer
language model
E.g., can we pre-train a common architecture?
[Radford et al, 2018]
??
… The submarine
…
…
Transformer Transformer Transformer
Transformer Transformer Transformer
is
27
OpenAI
Why embeddings?
OpenAI GPT
• Train transformer
language model
• Encode multiple
sentences at once
E.g., can we pre-train a common architecture?
[Radford et al, 2018]
Textual Classification (e.g. sentiment)
<S> Kids love gelato! <E>
Textual Entailment
<S> Kids love gelato. <SEP> No one hates gelato. <E>
28
Why embeddings?
OpenAI GPT
• Train transformer
language model
• Encode multiple
sentences at once
• Add small layer on
top
• Fine tune for each
end task
E.g., can we pre-train a common architecture?
[Radford et al, 2018]
29
Without ensemble for SNLI: ESIM+ELMo (88.7) vs GPT(89.9)
Without ensemble for SNLI: ESIM+ELMo (88.7) vs GPT(89.9)
• Train on longer texts, with self attention
• Idea of a unified architecture is currently winning!
• Will scale up really nicely (e.g. GPT-2)
• Isn’t bidirectional…
31
Question: Why not bidirectional?
32
Google’s BERT
• Train masked
language model
• Otherwise, largely
adopt OpenAI
approach
New task: predict the missing/masked word(s)
[Devlin et al, 2018]
… The [MASK]
…
…
is yellow …
??
Idea: jointly model left and right context
Transformer Transformer Transformer Transformer
Transformer Transformer Transformer Transformer
33
Google AI
…
…
Google’s BERT
• Train masked
language model
• Otherwise, largely
adopt OpenAI
approach
• Generalize beyond
text classification
New task: predict the missing/masked word(s)
[Devlin et al, 2018]
Idea: jointly model left and right context
34
35
Google’s BERT
• Train masked
language model
• Otherwise, largely
adopt OpenAI
approach
• Generalize beyond
text classification
[Devlin et al, 2018]
Idea: jointly model left and right context
vs.
E.g. for reading comprehension
36
Google’s BERT
• Train masked
language model
• Otherwise, largely
adopt OpenAI
approach
• Generalize beyond
text classification
New task: predict the missing/masked word(s)
[Devlin et al, 2018]
Idea: jointly model left and right context
• Bidirectional reasoning is important!
• Much better comparison, clearly best approach
• Arguably killing off “architecture hacking” research
• Currently focus on intense study in NLP community
37
Question: Why fine-tune?
38
What happens with more data?
OpenAI GPT-2
• Same model as GPT
• Still left-to-right
• Add more
parameters and train
on cleaner data!
• Adopt a new
evaluation scheme…
[Radford et al, 2019]
Also, don’t fine tune for different end tasks…
??
… premiered yesterday
…
…
…Transformer Transformer Transformer Transformer
…Transformer Transformer Transformer Transformer
… It was
39
Very good language models! Caveat: training sets differ…
40
Very good at generating text! Too good to release to the public…
41
Evaluation: zero shot
OpenAI GPT-2
• Same model as GPT
• Still left-to-right
• Add more
parameters and train
on cleaner data!
• Adopt a new
evaluation scheme…
[Radford et al, 2019]
Assume no labeled data…
Question Answering:
Summarization:
Machine Translation:
Q: Who wrote the origin of species? A: ???
[input document, e.g. new article]. TL;DR: ???
[English sentence]1=[French sentence]1, …
[English sentence]n=???
42
Zero shot results… Performs near baseline, with lots of data…
43
Zero shot QA results… Model is learning lots of facts!
44
Zero shot QA results… Model is learning lots of facts!
• Left-to-right again... but at least easy to sample from
• Zero shot results are still at baseline levels…
• LMs clearly improve with more parameters, more data
• Very nice job selling the work to the press
FAIR research activities
46
• What amount of knowledge do LM store?
• How does their performance compare to
automatically constructed KBs?
Research Question
48
off-the-shelf
Sorokin and Gurevych, 2017
What does BERT know?
oracle
Single-token objects and answers.
50
Language models rank every word in
the vocabulary by its probability.
PREDICTIONS
“The theory of relativity was developed by ___ .”
Einstein -1.143
him -2.994
Newton -3.758
Aristotle -4.477
Maxwell -4.486 unified vocabulary
Methodology
50k facts
4.4
1.2
7.6
2.9
10
0
2
4
6
8
10
12
Frequency KBP KBP
Oracle
ELMO BERT
Google RE
16.7
6.1
33.8
7.1
32.3
0
5
10
15
20
25
30
35
40
Frequency KBP KBP
Oracle
ELMO BERT
T-REx
Factual Knowledge
p@1
p@1
0
10
20
30
40
50
60
70
80
10
0
10
1
10
2
meanP@k
k
Fs
Txl
Eb
E5B
Bb
Bl
Mean P@k curve for T-REx
FS: Fairseq-fconv
Txl: Transformer xl
Eb: Elmo base
E5B: Elmo 5.5B
Bb: Bert-base
Bl: Bert-large
0
5.5
11
16.5
22
Frequency ELMO BERT
Precisionat1
ConceptNet
54
Common Sense Knowledge
BERT-large. The last column reports the top5 tokens generated together with the associated log probability (in square brackets)
Examples of Generations
56
Question Answering
0
17.5
35
52.5
70
Question Answering
DrQa DrQaBERT BERT
p@1 p@10
https://github.com/facebookresearch/LAMA
58
download pretrained
language models rather
than pretrained word
embeddings!
59
THANK YOU

Más contenido relacionado

La actualidad más candente

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingMinh Pham
 
A Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxA Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxSaiPragnaKancheti
 
A brief primer on OpenAI's GPT-3
A brief primer on OpenAI's GPT-3A brief primer on OpenAI's GPT-3
A brief primer on OpenAI's GPT-3Ishan Jain
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Deep Learning Italia
 
“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOpsRui Quintino
 
Prompting is an art / Sztuka promptowania
Prompting is an art / Sztuka promptowaniaPrompting is an art / Sztuka promptowania
Prompting is an art / Sztuka promptowaniaMichal Jaskolski
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfPo-Chuan Chen
 
Reinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face TransformersReinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face TransformersJulien SIMON
 
Active Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfActive Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfPo-Chuan Chen
 
DevBCN Vertex AI - Pipelines for your MLOps workflows
DevBCN Vertex AI - Pipelines for your MLOps workflowsDevBCN Vertex AI - Pipelines for your MLOps workflows
DevBCN Vertex AI - Pipelines for your MLOps workflowsMárton Kodok
 
Benchmark comparison of Large Language Models
Benchmark comparison of Large Language ModelsBenchmark comparison of Large Language Models
Benchmark comparison of Large Language ModelsMatej Varga
 
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & KubeflowMLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & KubeflowJan Kirenz
 
Vertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsVertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsMárton Kodok
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understandinggohyunwoong
 
Large Language Models Bootcamp
Large Language Models BootcampLarge Language Models Bootcamp
Large Language Models BootcampData Science Dojo
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Fwdays
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Yuta Niki
 
Chain-of-thought Prompting.pptx
Chain-of-thought Prompting.pptxChain-of-thought Prompting.pptx
Chain-of-thought Prompting.pptxNeethaSherra1
 

La actualidad más candente (20)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
A Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxA Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptx
 
A brief primer on OpenAI's GPT-3
A brief primer on OpenAI's GPT-3A brief primer on OpenAI's GPT-3
A brief primer on OpenAI's GPT-3
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps“Houston, we have a model...” Introduction to MLOps
“Houston, we have a model...” Introduction to MLOps
 
Prompting is an art / Sztuka promptowania
Prompting is an art / Sztuka promptowaniaPrompting is an art / Sztuka promptowania
Prompting is an art / Sztuka promptowania
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
 
Reinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face TransformersReinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face Transformers
 
Active Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfActive Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdf
 
DevBCN Vertex AI - Pipelines for your MLOps workflows
DevBCN Vertex AI - Pipelines for your MLOps workflowsDevBCN Vertex AI - Pipelines for your MLOps workflows
DevBCN Vertex AI - Pipelines for your MLOps workflows
 
Benchmark comparison of Large Language Models
Benchmark comparison of Large Language ModelsBenchmark comparison of Large Language Models
Benchmark comparison of Large Language Models
 
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & KubeflowMLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
 
Vertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsVertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflows
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Large Language Models Bootcamp
Large Language Models BootcampLarge Language Models Bootcamp
Large Language Models Bootcamp
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
Chain-of-thought Prompting.pptx
Chain-of-thought Prompting.pptxChain-of-thought Prompting.pptx
Chain-of-thought Prompting.pptx
 

Similar a The NLP Muppets revolution!

Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignAnubhav Jain
 
Training at AI Frontiers 2018 - Ni Lao: Weakly Supervised Natural Language Un...
Training at AI Frontiers 2018 - Ni Lao: Weakly Supervised Natural Language Un...Training at AI Frontiers 2018 - Ni Lao: Weakly Supervised Natural Language Un...
Training at AI Frontiers 2018 - Ni Lao: Weakly Supervised Natural Language Un...AI Frontiers
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentationSurya Sg
 
Putting the science in computer science
Putting the science in computer sciencePutting the science in computer science
Putting the science in computer scienceFelienne Hermans
 
Graph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to DialogueGraph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to DialogueJinho Choi
 
Build your own Language - Why and How?
Build your own Language - Why and How?Build your own Language - Why and How?
Build your own Language - Why and How?Markus Voelter
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLPSatyam Saxena
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLPAnuj Gupta
 
1066_multitask_prompted_training_en.pdf
1066_multitask_prompted_training_en.pdf1066_multitask_prompted_training_en.pdf
1066_multitask_prompted_training_en.pdfssusere320ca
 
Deep network notes.pdf
Deep network notes.pdfDeep network notes.pdf
Deep network notes.pdfRamya Nellutla
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Roelof Pieters
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH WarNik Chow
 
Nlp 2020 global ai conf -jeff_shomaker_final
Nlp 2020 global ai conf -jeff_shomaker_finalNlp 2020 global ai conf -jeff_shomaker_final
Nlp 2020 global ai conf -jeff_shomaker_finalJeffrey Shomaker
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...Hayahide Yamagishi
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMassimo Schenone
 
MixedLanguageProcessingTutorialEMNLP2019.pptx
MixedLanguageProcessingTutorialEMNLP2019.pptxMixedLanguageProcessingTutorialEMNLP2019.pptx
MixedLanguageProcessingTutorialEMNLP2019.pptxMariYam371004
 

Similar a The NLP Muppets revolution! (20)

Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
 
Training at AI Frontiers 2018 - Ni Lao: Weakly Supervised Natural Language Un...
Training at AI Frontiers 2018 - Ni Lao: Weakly Supervised Natural Language Un...Training at AI Frontiers 2018 - Ni Lao: Weakly Supervised Natural Language Un...
Training at AI Frontiers 2018 - Ni Lao: Weakly Supervised Natural Language Un...
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
 
Putting the science in computer science
Putting the science in computer sciencePutting the science in computer science
Putting the science in computer science
 
Graph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to DialogueGraph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to Dialogue
 
Build your own Language - Why and How?
Build your own Language - Why and How?Build your own Language - Why and How?
Build your own Language - Why and How?
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLP
 
1066_multitask_prompted_training_en.pdf
1066_multitask_prompted_training_en.pdf1066_multitask_prompted_training_en.pdf
1066_multitask_prompted_training_en.pdf
 
Deep network notes.pdf
Deep network notes.pdfDeep network notes.pdf
Deep network notes.pdf
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
 
Nlp 2020 global ai conf -jeff_shomaker_final
Nlp 2020 global ai conf -jeff_shomaker_finalNlp 2020 global ai conf -jeff_shomaker_final
Nlp 2020 global ai conf -jeff_shomaker_final
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
MixedLanguageProcessingTutorialEMNLP2019.pptx
MixedLanguageProcessingTutorialEMNLP2019.pptxMixedLanguageProcessingTutorialEMNLP2019.pptx
MixedLanguageProcessingTutorialEMNLP2019.pptx
 
About programming languages
About programming languagesAbout programming languages
About programming languages
 
Plug play language_models
Plug play language_modelsPlug play language_models
Plug play language_models
 

Más de Fabio Petroni, PhD

Mining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionMining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionFabio Petroni, PhD
 
HDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law GraphsHDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law GraphsFabio Petroni, PhD
 
CORE: Context-Aware Open Relation Extraction with Factorization Machines
CORE: Context-Aware Open Relation Extraction with Factorization MachinesCORE: Context-Aware Open Relation Extraction with Factorization Machines
CORE: Context-Aware Open Relation Extraction with Factorization MachinesFabio Petroni, PhD
 
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...Fabio Petroni, PhD
 
LCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative FilteringLCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative FilteringFabio Petroni, PhD
 
HSIENA: a hybrid publish/subscribe system
HSIENA: a hybrid publish/subscribe systemHSIENA: a hybrid publish/subscribe system
HSIENA: a hybrid publish/subscribe systemFabio Petroni, PhD
 

Más de Fabio Petroni, PhD (6)

Mining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionMining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completion
 
HDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law GraphsHDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law Graphs
 
CORE: Context-Aware Open Relation Extraction with Factorization Machines
CORE: Context-Aware Open Relation Extraction with Factorization MachinesCORE: Context-Aware Open Relation Extraction with Factorization Machines
CORE: Context-Aware Open Relation Extraction with Factorization Machines
 
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
 
LCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative FilteringLCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative Filtering
 
HSIENA: a hybrid publish/subscribe system
HSIENA: a hybrid publish/subscribe systemHSIENA: a hybrid publish/subscribe system
HSIENA: a hybrid publish/subscribe system
 

Último

GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...Lokesh Kothari
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 

Último (20)

GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 

The NLP Muppets revolution!

  • 1. The NLP Muppets revolution! Fabio Petroni Data Science London 28 May 2019
  • 2. Disclaimer 2 This is not a Data Science talk
  • 3. Outline 1. NLP pre-2018 (15 mins) 2. The Revolution (30 mins) 3. Related research activities in FAIR London (15 mins) 3
  • 4. Natural Language Processing Goal: for computers to process or “understand” natural language in order to perform tasks that are useful. Christopher Manning 4 • Fully understanding and representing the meaning of language (or even defining it) is an AI-complete problem.
  • 5. Discrete representation of word meaning 5 • WordNet: lexical database for the English language • “one-hot” encoding of words • Problem: no notion of relationships (e.g., similarity) between words
  • 6. Word meaning as a neural word vector • Representing a word by means of its neighbors 6 large corpora Source: https://becominghuman.ai/how-does-word2vecs-skip-gram-work-f92e0525def4 1 2 3 4 Unsupervised !
  • 7. Word meaning as a neural word vector Place words in high dimensional vector spaces Source: https://www.tensorflow.org/tutorials/representation/word2vec • Representing a word by means of its neighbors 7
  • 8. NLU Models, pre 2018 Architectures Galore! • Design a custom model for each task • Train on as much labeled data as you have • Often use pretrained word embeddings, but not always E.g. BiDAF (Seo et al, 2017) for Reading Comprehension only first layer is pre-trained 8
  • 9. Limitations of word embeddings • Word2vec, Glove and related methods represent shallow representations of language (such as edges in an image) and fail to capture the high level structure of language • A model that uses such word embeddings need to learn by itself complex language phenomena: • A huge number of examples is needed to achieve good performance • word-sense disambiguation • compositionality • long-term dependencies • negation • etc 9
  • 10. Paradigm Shift initializing the first layer of our models pretraining the entire model 10
  • 11. Paradigm Shift initializing the first layer of our models pretraining the entire model 11 Which task to use?
  • 12. Aim: to predict the next word given the previous words. 𝑝 𝑤#, … , 𝑤& = ( )*#..& 𝑝(𝑤)|𝑤#, … , 𝑤).#) e.g.: p(The cat is on the table) = p(The) x p(cat | The) x p(is | cat, The) x p(on | is, cat, The) ... 𝑝 𝑤) 𝑤#, … , 𝑤).# is often a (recurrent) neural network Language Modelling (LM) 12
  • 13. What are LMs used for? Until ~2018 • Used in text generation, e.g. machine translation or speech recognition • Long standing proof of concepts for language understanding, largely ignored As of this year • Use for everything! • Pretrain on lots of data • Finetune for specific tasks E.g. some of the embeddings in NLP from scratch papers (Collobert et. al. 2009-2011) were extracted from language models: “Following our NLP from scratch philosophy, we now describe how to dramatically improve these embeddings using large unlabeled data sets.” “Very long training times make such strategies necessary for the foreseeable future: if we had been given computers ten times faster, we probably would have found uses for data sets ten times bigger.” 13
  • 14. Are LMs NLP-complete? AI-complete? • Good LMs definitely won’t solve vision or robots or grounded language understanding… J But, for NLP… • How much signal is there in raw text alone? • The signal is weak, and unclear how to best learn from it… World knowledge: The Mona Lisa is a portrait painting by [ Leonardo | Obama ] Coreference resolution: John robbed the bank. He was [ arrested | terrified | bored ] Machine translation: Belka is the Russian word for [ cat | squirrel ] 14
  • 15. Language Model Pretraining * • ELMo: contextual word embeddings [Best paper, NAACL 2018] • GPT: no embeddings, finetune LM • BERT: bidirectional LM finetuning [Best paper, NAACL 2019] • GPT2: even larger scale, zero shot *See also Dai & Lee (2015), Peters et al (2017), Howard & Ruder (2018), and many others 15
  • 16. Contextual word embeddings: f(wk | w1, …, wn ) ∈ ℝ N f(play | Elmo and Cookie Monster play a game .) ≠ f(play | The Broadway play premiered yesterday .) 16 The Allen Institute for Artificial Intelligence
  • 17. Contextual word embeddings AI2 ELMo • Train two LMs: left-to- right and right-to-left [Peters et al, 2018] The Broadway play premiered yesterday LSTM LSTM LSTM LSTM LSTM LSTM ?? 17
  • 18. Contextual word embeddings AI2 ELMo • Train two LMs: left-to- right and right-to-left [Peters et al, 2018] The Broadway play premiered yesterday LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM ?? 18
  • 19. Embeddings from Language Models AI2 ELMo • Train two LMs: left-to- right and right-to-left • Extract contextualized vectors from networks • Use instead of word embeddings (e.g. FastText) • Still use custom task archtectures… [Peters et al, 2018] The Broadway play premiered yesterday LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM =ELMo: f( ) 19
  • 20. AI2 ELMo • Train two LMs: left-to- right and right-to-left • Extract contextualized vectors from networks • Use instead of word embeddings (e.g. FastText) • Still use custom task archtectures… [Peters et al, 2018] 20
  • 21. AI2 ELMo • Train two LMs: left-to- right and right-to-left • Extract contextualized vectors from networks • Use instead of word embeddings (e.g. FastText) • Still use custom task archtectures… [Peters et al, 2018] 21
  • 22. AI2 ELMo • Train two LMs: left-to- right and right-to-left • Extract contextualized vectors from networks • Use instead of word embeddings (e.g. FastText) • Still use custom task archtectures… [Peters et al, 2018] • Simple and general, building on recent ideas [text cat. (Dai and Lee 2015); tagging (Peters et al, 2017)] • Showed self-pre-training working for lots of problems! • Still assumed custom task-specific architectures… 22
  • 24. The Transformer (A Vaswani et al, 2017) 24 Google AI
  • 25. The Transformer (A Vaswani et al, 2017) 25
  • 26. 1. pretrain Huge corpora (B of words) TRANSFORMER 2. fine-tune LM TRANSFORMER This takes days/weeks on several GPUs / TPUs Simple component Fine-tune for a specific task on as much labeled data as you have. Fast ! SupervisedUnsupervised Pre-trained models publicly available on the web!
  • 27. Why embeddings? OpenAI GPT • Train transformer language model E.g., can we pre-train a common architecture? [Radford et al, 2018] ?? … The submarine … … Transformer Transformer Transformer Transformer Transformer Transformer is 27 OpenAI
  • 28. Why embeddings? OpenAI GPT • Train transformer language model • Encode multiple sentences at once E.g., can we pre-train a common architecture? [Radford et al, 2018] Textual Classification (e.g. sentiment) <S> Kids love gelato! <E> Textual Entailment <S> Kids love gelato. <SEP> No one hates gelato. <E> 28
  • 29. Why embeddings? OpenAI GPT • Train transformer language model • Encode multiple sentences at once • Add small layer on top • Fine tune for each end task E.g., can we pre-train a common architecture? [Radford et al, 2018] 29
  • 30. Without ensemble for SNLI: ESIM+ELMo (88.7) vs GPT(89.9)
  • 31. Without ensemble for SNLI: ESIM+ELMo (88.7) vs GPT(89.9) • Train on longer texts, with self attention • Idea of a unified architecture is currently winning! • Will scale up really nicely (e.g. GPT-2) • Isn’t bidirectional… 31
  • 32. Question: Why not bidirectional? 32
  • 33. Google’s BERT • Train masked language model • Otherwise, largely adopt OpenAI approach New task: predict the missing/masked word(s) [Devlin et al, 2018] … The [MASK] … … is yellow … ?? Idea: jointly model left and right context Transformer Transformer Transformer Transformer Transformer Transformer Transformer Transformer 33 Google AI … …
  • 34. Google’s BERT • Train masked language model • Otherwise, largely adopt OpenAI approach • Generalize beyond text classification New task: predict the missing/masked word(s) [Devlin et al, 2018] Idea: jointly model left and right context 34
  • 35. 35
  • 36. Google’s BERT • Train masked language model • Otherwise, largely adopt OpenAI approach • Generalize beyond text classification [Devlin et al, 2018] Idea: jointly model left and right context vs. E.g. for reading comprehension 36
  • 37. Google’s BERT • Train masked language model • Otherwise, largely adopt OpenAI approach • Generalize beyond text classification New task: predict the missing/masked word(s) [Devlin et al, 2018] Idea: jointly model left and right context • Bidirectional reasoning is important! • Much better comparison, clearly best approach • Arguably killing off “architecture hacking” research • Currently focus on intense study in NLP community 37
  • 39. What happens with more data? OpenAI GPT-2 • Same model as GPT • Still left-to-right • Add more parameters and train on cleaner data! • Adopt a new evaluation scheme… [Radford et al, 2019] Also, don’t fine tune for different end tasks… ?? … premiered yesterday … … …Transformer Transformer Transformer Transformer …Transformer Transformer Transformer Transformer … It was 39
  • 40. Very good language models! Caveat: training sets differ… 40
  • 41. Very good at generating text! Too good to release to the public… 41
  • 42. Evaluation: zero shot OpenAI GPT-2 • Same model as GPT • Still left-to-right • Add more parameters and train on cleaner data! • Adopt a new evaluation scheme… [Radford et al, 2019] Assume no labeled data… Question Answering: Summarization: Machine Translation: Q: Who wrote the origin of species? A: ??? [input document, e.g. new article]. TL;DR: ??? [English sentence]1=[French sentence]1, … [English sentence]n=??? 42
  • 43. Zero shot results… Performs near baseline, with lots of data… 43
  • 44. Zero shot QA results… Model is learning lots of facts! 44
  • 45. Zero shot QA results… Model is learning lots of facts! • Left-to-right again... but at least easy to sample from • Zero shot results are still at baseline levels… • LMs clearly improve with more parameters, more data • Very nice job selling the work to the press
  • 47. • What amount of knowledge do LM store? • How does their performance compare to automatically constructed KBs? Research Question
  • 48. 48 off-the-shelf Sorokin and Gurevych, 2017 What does BERT know? oracle
  • 49. Single-token objects and answers. 50 Language models rank every word in the vocabulary by its probability. PREDICTIONS “The theory of relativity was developed by ___ .” Einstein -1.143 him -2.994 Newton -3.758 Aristotle -4.477 Maxwell -4.486 unified vocabulary Methodology 50k facts
  • 50. 4.4 1.2 7.6 2.9 10 0 2 4 6 8 10 12 Frequency KBP KBP Oracle ELMO BERT Google RE 16.7 6.1 33.8 7.1 32.3 0 5 10 15 20 25 30 35 40 Frequency KBP KBP Oracle ELMO BERT T-REx Factual Knowledge p@1 p@1
  • 51. 0 10 20 30 40 50 60 70 80 10 0 10 1 10 2 meanP@k k Fs Txl Eb E5B Bb Bl Mean P@k curve for T-REx FS: Fairseq-fconv Txl: Transformer xl Eb: Elmo base E5B: Elmo 5.5B Bb: Bert-base Bl: Bert-large
  • 53. BERT-large. The last column reports the top5 tokens generated together with the associated log probability (in square brackets) Examples of Generations
  • 57. download pretrained language models rather than pretrained word embeddings! 59