Transfer Learning in NLP:
Concepts and Tools
Thomas Wolf
HuggingFace Inc.
● Concepts and History
● Anatomy of a State-of-the-art Model
● Open source tools
● Current Trends
● Limits and Open Questions
Some slides are adapted from our
NAACL 2019 Tutorial on Transfer
Learning in NLP with my collaborators
Concepts & History
What is Transfer Learning?
Pan and Yang (2010)
Why Transfer Learning in NLP? (intuitively)
Why should Transfer Learning work in NLP?
● Many NLP tasks share common knowledge about language (linguistic
representations, structural similarities...)
● Tasks can inform each other—e.g. syntax and semantics
● Annotated data is rare, make use of as much supervision as available.
● Unlabelled data is super abundant (internet), should try to use it
Empirically, transfer learning has resulted in SOTA for many supervised NLP tasks
(e.g. classification, information extraction, Q&A, etc)
Why Transfer Learning in NLP? (empirically)
Performance on Named Entity Recognition (NER) on CoNLL-2003 (English) over time
Ruder (2019)
We will
focus on
Types of transfer learning in NLP
Training: Sequential Transfer Learning
Learn on one task / dataset, then transfer to another task / dataset
sequence labeling
Pretraining Adaptation
[-0.4, 0.9, …]
Word vectors
cats = [0.2, -0.3, …]
dogs = [0.4, -0.5, …]
Sentence/doc vectors
It’s raining
cats and dogs.
We have two
[0.8, 0.9, …]
[-1.2, 0.0, …]
We have two cats.
[1.2, -0.3, …]
It’s raining cats and dogs.
History: The rise of language modeling
Many currently successful pretraining approaches are based on language
modeling, i.e. learning to predict:
● empirical probability of text: Pϴ(text)
● empirical conditional probability of text (e.g. translation): Pϴ(text | other text)
● Doesn’t require human annotation
● Many languages have enough text to learn high capacity model
● Versatile—can learn both sentence or word representations with a variety of
objective functions (autoregressive language modeling, masked language
modeling, span prediction, skip-thoughts, cross-view training….)
● Language modeling is a very difficult task, even for humans.
● Language models are expected to compress any possible context into a
vector that generalizes over possible completions.
○ E.g. “I think this is the beginning of a beautiful ???”
History: The rise of language modeling
● To have any chance at solving this task, a model is forced to learn syntax,
semantics, encode facts about the world, etc.
● Given enough data and compute, a big model can do a reasonable job!
Anatomy of a State-of-the-Art
Transfer Learning Model
A State-of-the-Art Transfer Learning Model
Two essential components: model & training
● The model: pre-training architecture and adaptations for fine-tuning
○ Current large architectures are mostly based on Transformers (but ULMFiT)
○ Unclear advantages of smart architecture (XLNet) versus more data (RoBERTa)
○ Trend toward larger models: XLM (664M), GPT-2 (1.5B), Megatron-LM (8.5 B)
● The training: pre-training and adaptation phases
○ Learning long term dependencies => long stream of continuous text (books, wiki)
○ Toward using more data in both phases RoBERTa (160GB) MT-DNN (WNLI)
○ Quality of the data is important
Model: Using a typical Transfer Learning model
Jim Henson was a puppeteer
1.2 2.7 0.6 -0.2
3.7 9.1 -2.1 3.1
1.5 -4.7 2.4 6.7
6.1 2.4 7.3 -0.6
-3.1 2.5 1.9 -0.1
0.7 2.1 4.2 -3.1
True 0.7886
False -0.223
Model: From shallow to deep
Devlin et al 2019: BERT: Pre-training of
Deep Bidirectional Transformers for
Language Understanding
1 layer 24 layers
Bengio et al 2003: A Neural Probabilistic
Language Model
BERT is pretrained for both sentence and contextual word representations, using masked language
modeling and next sentence prediction.
● Pretrained model: BERT-large has 340M parameters, 24 layers
● Adaptation head: just a linear layer on top of the representation output by the pretrained model.
Model: the example of BERT
See also: Logeswaran and Lee, ICLR 2018, (Devlin et al. 2019)
Large-scale transformer architectures (GPT-2, BERT, XLM…) are very similar to each other and consist of:
● summing words and position embeddings
● applying a succession of transformer blocks with:
○ layer normalisation
○ a self-attention module
○ dropout and a residual connection
○ another layer normalisation
○ a feed-forward module with one hidden layer and a non linearity:
Linear ⇨ ReLU/gelu ⇨ Linear
○ dropout and a residual connection
Model: Inside BERT, GPT-2, XLNet, RoBERTa
Main differences between BERT, GPT-2, XLNet: the pretraining objective
● causal language modeling for GPT
● masked language modeling for BERT (+ next sentence prediction)
(Child et al, 2019)
Model: Adapting for target task
General workflow:
1. Remove pretraining task head if not useful for target task
E.g. remove softmax classifier
2. Add target task-specific layers on top/bottom of
pretrained model
Simple: adding linear layer(s) on top of the pretrained model
More complex: model output as input for a separate model
Sometimes more complex: Adapting to a structurally different task
Ex: Pretraining with a single input sequence and adapting to a task with
several input sequences (ex: translation, conditional generation...)
➯ Use pretrained model to initialize as much as possible of target model
➯ Ramachandran et al., EMNLP 2017; Lample & Conneau, 2019
Training: Adaptation on a text classification task
Replace the pretraining
head with a classification
head: a linear layer,
which takes as input the
hidden-state of a token
Keep our pretrained model
unchanged as the backbone.
Initialization of the model:
● Initialize the weights of the model (in particular the added parameters)
● Reload common weights from the pretrained model.
Training: Adaptation on a text classification task
We are at the state-of-the-art
❏ The error rate goes down quickly! After one epoch we already have >90% accuracy.
⇨ Fine-tuning is highly data efficient in Transfer Learning
❏ We took our pre-training & fine-tuning hyper-parameters straight from the literature on related models.
⇨ Fine-tuning is often robust to the exact choice of hyper-parameters
Training: Adaptation on a text classification task
A few words on robustness & variance.
❏ Large pretrained models (e.g. BERT large) are
prone to degenerate performance when fine-tuned
on tasks with small training sets.
❏ Observed behavior is often “on-off”: it either works
very well or doesn’t work at all.
❏ Understanding the conditions and causes of this
behavior (models, adaptation schemes) is an
open research question.
Phang et al., 2018 23
Open-source tools
Hubs and Libraries
Open-sourcing: practical considerations
● Pretraining large-scale models is costly
Use open-source models
Share your pretrained models
“Energy and Policy Considerations for Deep Learning in NLP” - Strubell, Ganesh, McCallum - ACL 2019
● Sharing/accessing pretrained models
○ Hubs: Tensorflow Hub, PyTorch Hub
○ Author released checkpoints: ex BERT, GPT...
○ Third-party libraries: AllenNLP,, HuggingFace
● Design considerations
○ Hubs/libraries:
■ Simple to use but can be difficult to modify model internal architecture
○ Author released checkpoints:
■ More difficult to use but you have full control over the model internals
● Based on GitHub repositories, a model is shared by adding a file to the GitHub repository.
● PyTorch Hub can fetch the model from the master branch on GitHub. This means that you
don’t need to package your model (pip) & users can always access the most recent version.
● Both model definitions and pre-trained weights can be shared
● More details: and
PyTorch Hub
Main limitations of Hubs
● TensorFlow Hub is a library for sharing machine learning models as self-contained pieces of
TensorFlow graph with their weights and assets.
● Modules are automatically downloaded and cached when instantiated.
● Each time a module is called, it adds operations to the current TensorFlow graph.
● More details:
● No access to the source code of the model (black-box)
● Not possible to modify the internals of the model (e.g. to add Adapters)
HuggingFace library with Transformers 👾
We’ve built an opinionated library of pretrained models (Pytorch-transformers) for
NLP researchers and practitioners seeking to use/study/modify large-scale
pretrained transformers models such as BERT, GPT, GPT-2, XLNet, RoBERTa...
The library was designed with two strong principles in mind:
● be as easy to use and fast to on-board as possible:
○ almost no abstractions to learn: models, tokenizer and configuration,
○ a common from_pretrained() method takes care of downloading/caching/loading
classes from pretrained instances supplied in the library or user’s saved instances,
○ to build-upon the library, the user can use regular PyTorch modules.
● provide state-of-the-art models identical to the original models:
○ examples reproducing official results,
○ carefully drafted code as close as possible to the original computation graph.
Current Trends
Larger models
Larger models on larger datasets
Minimum amount of data is required to unlock the potential of Transfer Learning
Question: Did no one think of this before? Why did it only start in ‘18 (ELMo)?
J. Devlin’s Answer: Good results on pre-training is >1,000x to 100,000 more
expensive than supervised training.
○ E.g., 10x-100x bigger model trained for 100x-1,000x as many steps.
○ Imagine in 2013: well-tuned 2-layer, LSTM gets 80% accuracy on sentiment
analysis, training for 8 hours.
○ Pre-train large-scale language model on same architecture for a week, get +0.5%.
○ Reviewers: “Who would do something so expensive for such a small gain?”
Devlin et al.
Larger models on larger datasets
Diminishing returns of using more data/bigger models:
➭ For a linear gain in performance, an exponentially larger model is required.
Radford and Wu et al. Devlin et al. Hancock @ Fwdays’19
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 34
And a lot of very fresh work which will be published around the end of the year:
Tsai et al., Turc et al., Tang et al., ...
Smaller models: Distillating large models
Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 35
Training cost make headlines but as large-scale models reach production,
inference time will likely account for most of a model's total environmental cost.
Distilling larger models in smaller ones:
● reduce inference cost
● capitalize on the inductive biases learned by a large model.
95% of the performances of a model like Bert can be preserved in a distilled
model 40% smaller and 60% faster (our teams work on DistilBERT, open-sourced
in our pytorch-transformers library)
Smaller models: Distillation from large models
Distillation: two main tricks to train a student model from a teacher model:
1. Starting from high-quality weights initializations derived from the teacher
2. Training the student to mimic the full output distribution of the teacher
Limits, Open Questions
Shortcomings of pretrained language models
Large, pretrained language models can be difficult to optimize.
● Fine-tuning is often unstable and has a high variance, particularly if the
target datasets are very small. BERT large is prone to degenerate
performance; multiple random restarts can be necessary (Phang et al., 2018)
● Do we really need all these parameters?
● Recent work shows that only a few of the attention heads in BERT are
required (Voita et al., ACL 2019, Michel et al.).
● More work needed to understand model parameters.
● Pruning and distillation are two ways to deal with this.
● See also: the lottery ticket hypothesis (Frankle et al., ICLR 2019).
Shortcoming of language modeling in general
The most successful current pretraining methods are based on variants of
language modeling. But this have many shortcomings:
● Not appropriate for all models
○ If we condition on more inputs (video/sound), need to pretrain those parts
● Weak signal for semantics and long-term context vs. strong signal for
syntax and short-term word co-occurrences
● Pretrained language models are bad at
○ fine-grained linguistic tasks (Liu et al., NAACL 2019)
○ common sense (when you actually make it difficult; Zellers et al., ACL 2019);
coherent natural language generation
○ tend to overfit to surface form information when fine-tuned; ‘rapid surface
Shortcoming of language modeling in general
Need for grounded representations
● Limits of distributional hypothesis—difficult to learn certain types of
information from raw text
○ Human reporting bias: not stating the obvious (Gordon and Van Durme, 2013)
○ Common sense isn’t written down
○ Facts about named entities
○ No grounding to other modalities
● Possible solutions:
○ Incorporate structured knowledge (e.g. databases - ERNIE: Zhang et al 2019)
○ Multimodal learning (e.g. visual representations - VideoBERT: Sun et al. 2019)
○ Interactive/human-in-the-loop approaches (e.g. dialog: Hancock et al. 2018)
That’s all folks!

  • 38. Shortcoming of language modeling in general Need for grounded representations ● Limits of distributional hypothesis—difficult to learn certain types of information from raw text ○ Human reporting bias: not stating the obvious (Gordon and Van Durme, 2013) ○ Common sense isn’t written down ○ Facts about named entities ○ No grounding to other modalities ● Possible solutions: ○ Incorporate structured knowledge (e.g. databases - ERNIE: Zhang et al 2019) ○ Multimodal learning (e.g. visual representations - VideoBERT: Sun et al. 2019) ○ Interactive/human-in-the-loop approaches (e.g. dialog: Hancock et al. 2018) Transfer Learning in NLP: Concepts and Tools - Thomas Wolf - Slide 40