XLNet is a generalized autoregressive pretraining model for natural language understanding. It leverages all possible permutations of the input sequence to capture bidirectional contexts, unlike previous autoregressive models which only learn information in one direction. This allows XLNet to better model relationships between non-consecutive words. XLNet also does not rely on corrupted data, but directly models the probability distribution over the input text. In experiments, XLNet achieves state-of-the-art results on 18 different natural language processing tasks.
3. It’s all about Pretraining + Fine Tuning
Machine Reading
Comprehension
Pretrained Model
(Language Model)
Fine Tuning
Fine Tuning
One pretrained model achieves state-of-the-art results
on a wide range of NLP tasks (18)
We assume that NLU(Natural Language Understanding)
can be estimated by those tasks
Fine Tuning
4. How can we generate a pretrained model
to understand the Natural Language?
5. Language Model
• A statistical language model is a probability distribution over
sequences of words.
Given such a sequence, say of length m, it assigns a
probability
to the whole sequence.
“I love Natural Language Processing” 문장이 그럴듯한 분포인지
P( I, love , Natural, Language , Processing) 을 추정
7. It is called Autoregressive(AR) Language Model
A example of Autoregressive :
‘t’ depends on ‘<t' contexts
and predicts recursively
8. Limitations
• AR based language models only learns uni-directional
information.
Missing bidirectional contexts
9. Previous Work
BERT : Pre-training of Deep Bidirectional Transformers for
Language Understanding
10. Reconstruct the missing part
(Denoising Auto Encoder based approach)
Image Text
“I love Natural [MASK] Processing”
11. Limitations
• 1. BERT doesn’t use [MASK] in fine tuning step but only in
pretraining step.
• 2. BERT assumes the predicted tokens are independent of
each other given unmasked tokens.
Missing important information such as Noun Phrase(‘New’,
‘York’)
12. XLNet – Permutation Language Model
• 1. XLNet leverages all possible permutations.
Enables to capture both direction
• 2. As a generalized AR language model, XLNet does not rely on data
corruption.
Considers the entire probability distribution of a text sequence
• 3. XLNet achieves SOTA on 18 NLP tasks.
13. Model Overview
Previous Context
(Brought from
*Transformer-XL)
주의:
Sequence order doesn’t change.
But with the different Factorization order,
different parts are used as input.
*Dai, Zihang, et al. "Transformer-xl: Attentive language models
beyond a fixed-length context." arXiv preprint arXiv:1901.02860(2019).
I love Natural Language
Natural
Key IDEA:
It only sees t-1 when I'm predicting
t's words.
Let's mix the order index!
# Even if you predict the same Natural,
the word Permutation is different.
(They Solve it Attention Mask).
It means that you will look back and
forth in many combinations.
14. We have a problem in target function
• Input words(x) : I love Natural Language Processing
• Permutation1 : 4 1 2 5 3 ex) P(Processing | I, love, Language)
• Permutation2 : 4 1 2 3 5 ex) P(Natural | I, love, Language)
𝒛 𝒕𝒛<𝒕
𝒛 𝟏
𝒛 𝟐
Target
Function :
15. We have a problem in target function
• Input words(x) : I love Natural Language Processing
• Permutation1 : 4 1 2 5 3 ex) P(Processing | I, love, Language)
• Permutation2 : 4 1 2 3 5 ex) P(Natural | I, love, Language)
𝒛 𝒕𝒛<𝒕
We didn’t consider the position 𝒛 𝒕 !
𝒛 𝟏
𝒛 𝟐
Natural 이랑 Processing
예측했을 때 확률이 같네?
Shouldn't we tell them
apart where they are
located?
16.
17.
18. Why Two-stream? We have to get h for the 𝒛<𝒕 Content stream which will be used in g
21. Conclusion
• The XLNet is a generalized AR pretraining model that
combines the advantage of both conventional language
model(AR) and AutoEncoder model (AE)
• Leveraging the Transformer-XL and designing the two stream
mechanism, XLNet is trained to estimate the probability
distribution as in AR.
• XLNet achieves state-of-the-art results various tasks
(18 NLPtasks)