SlideShare una empresa de Scribd logo
1 de 42
Descargar para leer sin conexión
BREAKING THE SOFTMAX BOTTLENECK:
A HIGH-RANK RNN LANGUAGE MODEL
Zhilin Yang∗, Zihang Dai∗, Ruslan Salakhutdinov, William W. Cohen
School of Computer Science
Carnegie Mellon University
{zhiliny,dzihang,rsalakhu,wcohen}@cs.cmu.edu
Publish as a conference paper at ICLR 2018
∗Equal contribution. Ordering determined by dice rolling. Ray
Outline
• Introduction

• Language Modeling as Matrix Factorization

• Experiments

• Related work

• Conclusions
!2
Introduction
Section 1
• A corpus of tokens:

• Language model:
Language Model
!4
“Sharon likes to play _______”
P(“EverWing” | “Sharon likes to play”) = 0.01
P(“boyfriends”| “Sharon likes to play”) = 0.666
P(“Instagram” | “Sharon likes to play”) = 0.1018
…
context next token
corpus tokens
LM
contextnext token
← Sharon’s diary
P(X1|C1) P(X2|C1) … P(XM|C1)
P(X1|C2) P(X2|C2) … P(XM|C2)
…
P(X1|CN) P(X2|CN) … P(XM|CN)
Language Model
!5
<s> I am Sharon </s>
<s> Sharon I am </s>
<s> I love group meeting </s>
2-grams example:
{ (<s>, I), (I, am), (am, Sharon) …, (meeting, </s>) }

P( Sharon | am ) = C( Sharon, am ) / C(am) = 1 / 2

P( am | I ) = 2 / 3

P( I | am ) = 2 / 2

….
window size
Ngram Language Model
P(X1|C1) P(X2|C1) … P(XM|C1)
P(X1|C2) P(X2|C2) … P(XM|C2)
…
P(X1|CN) P(X2|CN) … P(XM|CN)
store
RNN Language Model
!6
(concept)
RNN
xt
gt
P(“EverWing” | context)
P(“boyfriends”| context)
…
Expected
Output:
(lots of weights)
RNN
x1
g1
RNN
x2
g2
RNN
x3
g3
RNN
xn
gn
…
=
(unroll)
Sharon likes to playInput:
(Sharon’s diary)
context
[P(X1|Ct) P(X2|Ct) … P(XM|Ct)]
probability vector
fit
(approximate)
[vector]
RNN Language Model
!7
RNN
x1
g1
RNN
x2
g2
RNN
x3
g3
RNN
xn
gn
…
(practical)
[P(X1|Ct) P(X2|Ct) … P(XM|Ct)]
Softmax(hT w)
context Sharon likes to play
[context vector] (h)
hT w = [hT w1, hT w2, … hT wM]
[word vector] [word vector] [word vector] [word vector]w
X1
(w1, w2, …, wM)
X2
1. context encode
2. relation to each word
setting

# of words: 10,000 

word vector dim: 400
(dim: 400)
(dim: 400)
(dim: 10,000)
(dim: [10,000, 400])
(dim: 10,000)
Softmax function
!8
[ 1, 12, 7, 11 ]
[hT w1, hT w2, hT w3, hT w4]Ct = “Sharon likes to play”
⇒ context vector (h)
X1 = “EverWing” ⇒ w1
X2 = “boyfriends” ⇒ w2
X3 = “Instagram” ⇒ w3
X4 = “MaLuLian” ⇒ w4
[ 0.0, 0.73, 0.0, 0.27 ][P(X1|Ct), P(X2|Ct), P(X3|Ct), P(X4|Ct)]
sum = e1
+ e12
+ e7
+ e11
[ ]
e1
sum
e12
sum
e7
sum
e11
sum
Softmax(hT w)
=
(word vector)
dot product
Goal:
back to RNN LM
!9
RNN
gn
[P(X1|Ct) P(X2|Ct) … P(XT|Ct)]
Softmax(hT w)
[context vector] (h)
hT w = [hT w1, hT w2, … hT wT]
(dim: 400)
(dim: [10,000, 400])
(dim: 10,000)
(dim: 10,000)
(w1, w2, …, wT) Is it capable of modeling the
Natural Language?
Softmax bottleneck
If NO!
the performance of our model is limited
setting

# of words: 10,000 

word vector dim: 400
Contribution
!10
• Identify this Softmax bottleneck.

• Propose a new method to address this problem 

called Mixture of Softmaxes (MoS)
by matrix factorization (in Section 2) and experiments (in Section 3)
(M = LU)
why Softmax
!11
[ 1, 12, 7, 11 ]
[ 0.0, 0.73, 0.01, 0.26 ]
(usually `argmax` or `category label` would be like: [0, 1, 0, 0])
Softmax() simple normalize
[ 0.03, 0.38, 0.22, 0.36 ]
It’s more like `argmax`!
too close
Language Modeling
as
Matrix Factorization
Section 2
• Rank & matrix factorization

• Softmax-based RNN LM & matrix form

• Identify the Softmax bottleneck

• Propose hypothesis: Natural Language is high-rank

• Easy fixes?

• Mixture of Softmaxes (MoS) & Mixture of Contexts (MoC)
!13
Flow of Section 2
• We can consider rank is the number of i.i.d. features (dimensions)

• If A is a m x n matrix: rank(A) ≤ min(m, n)

• example:
!14
Rank
Age Height Weight H-W Gender
18 183 90 93 M
22 158 48 110 F
28 159 52 107 F
24 175 73 102 M
Age Height Weight Gender
18 183 90 M
22 158 48 F
28 159 52 F
24 175 73 M
dim = 4x3 → rank = 3 dim = 4x4 → rank = 4?
(in matrix)
NO! → rank = 3
rank(V) = rank(WH) ≤ min(rank(W), rank(H))
!15
Matrix Factorization
WH = V
Property:
4 x 62 x 64 x 2
If A is a m x n matrix: rank(A) ≤ min(m, n)
= 2
• hc: context vector

• wx: word embedding vector

• θ: all the parameters (weights) in our model

• Pθ: our model

• P*: true data distribution
!16
Sb RNN LM
(dim: 400)
(dim: 400)
by θ
logit
P(X1|C1) P(X2|C1) … P(XM|C1)
P(X1|C2) P(X2|C2) … P(XM|C2)
…
P(X1|CN) P(X2|CN) … P(XM|CN)
(Softmax-based RNN Language Model)
!17
Matrix form
(N: # of context) (M: # of token)
star (*) means: true data distribution
our model
our model our model
ground truth
Softmax(HθWθ) =
Softmax(A) =
!18
Concept of derivation
Softmax SoftmaxSoftmax
if we want
=
F(A) is an infinite set that
contain all possible logits.
matrix factorization
!19
Matrix form
(N: # of context) (M: # of token)
star (*) means: true data distribution
our model
our model our model
ground truth
Softmax(HθWθ) =
Softmax(A) =
!20
F(A): row-wise shift operation
JN,M is a all-one matrixF is an infinite set.
all possible logits are in F(A)
max rank difference = 1
!21
Concept of derivation
Softmax SoftmaxSoftmax
if we want
=
F(A) is an infinite set that
contain all possible logits.
matrix factorization
!22
Matrix Factorization
our model a matrix related to
the ground truth
rank(A′) = rank(HθWθ) = min(rank(Hθ), rank(Wθ)) ≤ d
rank(A′) ≤ d
(is upper bounded by d)
Softmax bottleneck
rank(V) = rank(WH) ≤ min(rank(W), rank(H))
WH = V
!23
Softmax bottleneck
If d is too small, our model (Softmax) does not have the capacity 

to express the true data distribution.
If rank(A) − 1 > d,
there must exists a context c, such that
if we want
rank(A′) ≤ d
(is upper bounded by d)
Hypothesis: NL (A) is High-Rank
1. Natural language is highly context-dependent.

2. If A is low-rank

3. (Experiment) The high-rank LM outperforms many low-rank LMs. (in Section 3)
!24
token: “north” in
international news & geo. textbook
They hypothesize it should result in a high-rank matrix A.
We can create all semantic meanings by a few hundred bases.
Softmax bottleneck limits the expressiveness of the model.
learn a low-rank model rank( Softmax-based model ) ≤ d << rank(A)
3 reasons
If true
• Increasing the dimension d ?
!25
Easy Fixes?
Higher rank
Increasing the number of parameters, too.
Curse of dimensionality
Empirical conclusion: increasing the d beyond several hundred is not helpful
But !
(Merity et al., 2017; Melis et al., 2017; Krause et al., 2017)
→ Overfitting
Mixture of Softmaxes (MoS) high-rank, but do not increase the dimension d
rank( Softmax-based model ) ≤ d << rank(A)
!26
Mixture of Softmaxes
π: mixture weight, or call “prior”

wπ,k & Wh,k are new model parameters

K is a hyper-parameter
(high-rank Language Model)
!27
Mixture of Softmaxes
context vector
RNN
gn
RNN output (gt)
priorprior logit
logit
h = tanh(W · gt)
w · gt π
h · wx
weighted average
( ×→∑ )
Softmax
Softmax
K
[π1, π2, …, πk]
[h1w, h2w, …, hkw]
!28
Mixture of Contexts (MoC)
(low-rank baseline)
It’s very like Softmax model
Experiments
Section 3
• Evaluate the MoS model base on perplexity (PP)

- Penn Treebank (PTB)

- WikiText-2 (WT2)

- 1B Word dataset (a larger dataset)

• MoS is a generic model: experiment on Seq2Seq model (PP & BLEU)

- Switch-board dataset (a dialog dataset)
!30
Main Results
(smaller value is better)
(larger value is better)
using MoS on state-of-the-art model
!31
Main results
on Penn Treebank (PTB)
(PP: smaller value is better)
!32
Main results
on WikiText-2 (WT2)
on 1B Word dataset
!33
Main results
on Seq2Seq model
Switch-board dataset (a dialog dataset)
(BLEU: larger value is better)
!34
Ablation Study
compare MoS & MoC
MoC do not perform better
!35
Verify the Role of rank
Rank compassion on PTB.
Test different # of Softmax (K)
estimate the rank of A →
The embedding sizes (d) of
Softmax, MoC and MoS are 400, 280, 280 respectively.
The vocabulary size (M) is 10,000 for all models.
!36
Inverse experiment
on character-level language model
Softmax does not suffer from a rank limitation
→ MoS does not improve the performance
(low-rank problem)
→ MoS’ performance is from it’s high-rank
!37
MoS computational time
relation between K & time
→ it’s `sub-linear` relation

related to the size of dataset
!38
Qualitative analysis
to see how MoS improves the next-token prediction in detail
!39
Qualitative analysis
to see how MoS improves the next-token prediction in detail
Conclusions
Section 5
• Softmax-based language models 

is limited by the dimension of the word embedding

→ Softmax bottleneck

• MoS

improves the performance over Softmax

avoid overfitting & naively increasing dimension

• MoS improves the state-of-the-art results by a large margin

→ It’s importance to have a high-rank model for natural language.
!41
Conclusions
Thanks ✄

Más contenido relacionado

La actualidad más candente

再帰型ニューラルネット in 機械学習プロフェッショナルシリーズ輪読会
再帰型ニューラルネット in 機械学習プロフェッショナルシリーズ輪読会再帰型ニューラルネット in 機械学習プロフェッショナルシリーズ輪読会
再帰型ニューラルネット in 機械学習プロフェッショナルシリーズ輪読会
Shotaro Sano
 
ICML2017 best paper (Understanding black box predictions via influence functi...
ICML2017 best paper (Understanding black box predictions via influence functi...ICML2017 best paper (Understanding black box predictions via influence functi...
ICML2017 best paper (Understanding black box predictions via influence functi...
Antosny
 
A Review of the Split Bregman Method for L1 Regularized Problems
A Review of the Split Bregman Method for L1 Regularized ProblemsA Review of the Split Bregman Method for L1 Regularized Problems
A Review of the Split Bregman Method for L1 Regularized Problems
Pardis N
 

La actualidad más candente (20)

Preparing your data for Machine Learning with Feature Scaling
Preparing your data for  Machine Learning with Feature ScalingPreparing your data for  Machine Learning with Feature Scaling
Preparing your data for Machine Learning with Feature Scaling
 
Spatial operation.ppt
Spatial operation.pptSpatial operation.ppt
Spatial operation.ppt
 
[PR12] image super resolution using deep convolutional networks
[PR12] image super resolution using deep convolutional networks[PR12] image super resolution using deep convolutional networks
[PR12] image super resolution using deep convolutional networks
 
最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情
 
深層学習による非滑らかな関数の推定
深層学習による非滑らかな関数の推定深層学習による非滑らかな関数の推定
深層学習による非滑らかな関数の推定
 
Contrastive learning 20200607
Contrastive learning 20200607Contrastive learning 20200607
Contrastive learning 20200607
 
【DL輪読会】“Gestalt Principles Emerge When Learning Universal Sound Source Separa...
【DL輪読会】“Gestalt Principles Emerge When Learning Universal Sound Source Separa...【DL輪読会】“Gestalt Principles Emerge When Learning Universal Sound Source Separa...
【DL輪読会】“Gestalt Principles Emerge When Learning Universal Sound Source Separa...
 
Person Re-Identification におけるRe-ranking のための K reciprocal-encoding
Person Re-Identification におけるRe-ranking のための K reciprocal-encodingPerson Re-Identification におけるRe-ranking のための K reciprocal-encoding
Person Re-Identification におけるRe-ranking のための K reciprocal-encoding
 
ドメイン適応の原理と応用
ドメイン適応の原理と応用ドメイン適応の原理と応用
ドメイン適応の原理と応用
 
再帰型ニューラルネット in 機械学習プロフェッショナルシリーズ輪読会
再帰型ニューラルネット in 機械学習プロフェッショナルシリーズ輪読会再帰型ニューラルネット in 機械学習プロフェッショナルシリーズ輪読会
再帰型ニューラルネット in 機械学習プロフェッショナルシリーズ輪読会
 
KDDCUP2020 RL Track : 強化学習部門入賞の手法紹介
KDDCUP2020 RL Track : 強化学習部門入賞の手法紹介KDDCUP2020 RL Track : 強化学習部門入賞の手法紹介
KDDCUP2020 RL Track : 強化学習部門入賞の手法紹介
 
SimGAN 輪講資料
SimGAN 輪講資料SimGAN 輪講資料
SimGAN 輪講資料
 
Deep Generative Models
Deep Generative ModelsDeep Generative Models
Deep Generative Models
 
統計的音声合成変換と近年の発展
統計的音声合成変換と近年の発展統計的音声合成変換と近年の発展
統計的音声合成変換と近年の発展
 
Priorに基づく画像/テンソルの復元
Priorに基づく画像/テンソルの復元Priorに基づく画像/テンソルの復元
Priorに基づく画像/テンソルの復元
 
Deep Recurrent Q-Learning(DRQN) for Partially Observable MDPs
Deep Recurrent Q-Learning(DRQN) for Partially Observable MDPsDeep Recurrent Q-Learning(DRQN) for Partially Observable MDPs
Deep Recurrent Q-Learning(DRQN) for Partially Observable MDPs
 
内発的動機づけの計算モデル, 岡夏樹
内発的動機づけの計算モデル, 岡夏樹内発的動機づけの計算モデル, 岡夏樹
内発的動機づけの計算モデル, 岡夏樹
 
Deep learning を用いた画像から説明文の自動生成に関する研究の紹介
Deep learning を用いた画像から説明文の自動生成に関する研究の紹介Deep learning を用いた画像から説明文の自動生成に関する研究の紹介
Deep learning を用いた画像から説明文の自動生成に関する研究の紹介
 
ICML2017 best paper (Understanding black box predictions via influence functi...
ICML2017 best paper (Understanding black box predictions via influence functi...ICML2017 best paper (Understanding black box predictions via influence functi...
ICML2017 best paper (Understanding black box predictions via influence functi...
 
A Review of the Split Bregman Method for L1 Regularized Problems
A Review of the Split Bregman Method for L1 Regularized ProblemsA Review of the Split Bregman Method for L1 Regularized Problems
A Review of the Split Bregman Method for L1 Regularized Problems
 

Similar a Breaking the Softmax Bottleneck: a high-rank RNN Language Model

[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
npinto
 
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-Joins
Jonny Daenen
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
butest
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
butest
 
NTCIR11-Math2-PattaniyilN_slides
NTCIR11-Math2-PattaniyilN_slidesNTCIR11-Math2-PattaniyilN_slides
NTCIR11-Math2-PattaniyilN_slides
Nidhin Pattaniyil
 

Similar a Breaking the Softmax Bottleneck: a high-rank RNN Language Model (20)

CS571: Distributional semantics
CS571: Distributional semanticsCS571: Distributional semantics
CS571: Distributional semantics
 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
 
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up
 
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
 
Finding similar items in high dimensional spaces locality sensitive hashing
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashing
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 
Accelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference CompilationAccelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference Compilation
 
Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...
Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...
Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...
 
Unit 3
Unit 3Unit 3
Unit 3
 
Unit 3
Unit 3Unit 3
Unit 3
 
leanCoR: lean Connection-based DL Reasoner
leanCoR: lean Connection-based DL ReasonerleanCoR: lean Connection-based DL Reasoner
leanCoR: lean Connection-based DL Reasoner
 
Basics of Dynamic programming
Basics of Dynamic programming Basics of Dynamic programming
Basics of Dynamic programming
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-Joins
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
 
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
 
NTCIR11-Math2-PattaniyilN_slides
NTCIR11-Math2-PattaniyilN_slidesNTCIR11-Math2-PattaniyilN_slides
NTCIR11-Math2-PattaniyilN_slides
 
Slides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingSlides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processing
 

Último

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 

Último (20)

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 

Breaking the Softmax Bottleneck: a high-rank RNN Language Model

  • 1. BREAKING THE SOFTMAX BOTTLENECK: A HIGH-RANK RNN LANGUAGE MODEL Zhilin Yang∗, Zihang Dai∗, Ruslan Salakhutdinov, William W. Cohen School of Computer Science Carnegie Mellon University {zhiliny,dzihang,rsalakhu,wcohen}@cs.cmu.edu Publish as a conference paper at ICLR 2018 ∗Equal contribution. Ordering determined by dice rolling. Ray
  • 2. Outline • Introduction • Language Modeling as Matrix Factorization • Experiments • Related work • Conclusions !2
  • 4. • A corpus of tokens: • Language model: Language Model !4 “Sharon likes to play _______” P(“EverWing” | “Sharon likes to play”) = 0.01 P(“boyfriends”| “Sharon likes to play”) = 0.666 P(“Instagram” | “Sharon likes to play”) = 0.1018 … context next token corpus tokens LM contextnext token ← Sharon’s diary P(X1|C1) P(X2|C1) … P(XM|C1) P(X1|C2) P(X2|C2) … P(XM|C2) … P(X1|CN) P(X2|CN) … P(XM|CN) Language Model
  • 5. !5 <s> I am Sharon </s> <s> Sharon I am </s> <s> I love group meeting </s> 2-grams example: { (<s>, I), (I, am), (am, Sharon) …, (meeting, </s>) } P( Sharon | am ) = C( Sharon, am ) / C(am) = 1 / 2 P( am | I ) = 2 / 3 P( I | am ) = 2 / 2 …. window size Ngram Language Model P(X1|C1) P(X2|C1) … P(XM|C1) P(X1|C2) P(X2|C2) … P(XM|C2) … P(X1|CN) P(X2|CN) … P(XM|CN) store
  • 6. RNN Language Model !6 (concept) RNN xt gt P(“EverWing” | context) P(“boyfriends”| context) … Expected Output: (lots of weights) RNN x1 g1 RNN x2 g2 RNN x3 g3 RNN xn gn … = (unroll) Sharon likes to playInput: (Sharon’s diary) context [P(X1|Ct) P(X2|Ct) … P(XM|Ct)] probability vector fit (approximate) [vector]
  • 7. RNN Language Model !7 RNN x1 g1 RNN x2 g2 RNN x3 g3 RNN xn gn … (practical) [P(X1|Ct) P(X2|Ct) … P(XM|Ct)] Softmax(hT w) context Sharon likes to play [context vector] (h) hT w = [hT w1, hT w2, … hT wM] [word vector] [word vector] [word vector] [word vector]w X1 (w1, w2, …, wM) X2 1. context encode 2. relation to each word setting # of words: 10,000 word vector dim: 400 (dim: 400) (dim: 400) (dim: 10,000) (dim: [10,000, 400]) (dim: 10,000)
  • 8. Softmax function !8 [ 1, 12, 7, 11 ] [hT w1, hT w2, hT w3, hT w4]Ct = “Sharon likes to play” ⇒ context vector (h) X1 = “EverWing” ⇒ w1 X2 = “boyfriends” ⇒ w2 X3 = “Instagram” ⇒ w3 X4 = “MaLuLian” ⇒ w4 [ 0.0, 0.73, 0.0, 0.27 ][P(X1|Ct), P(X2|Ct), P(X3|Ct), P(X4|Ct)] sum = e1 + e12 + e7 + e11 [ ] e1 sum e12 sum e7 sum e11 sum Softmax(hT w) = (word vector) dot product Goal:
  • 9. back to RNN LM !9 RNN gn [P(X1|Ct) P(X2|Ct) … P(XT|Ct)] Softmax(hT w) [context vector] (h) hT w = [hT w1, hT w2, … hT wT] (dim: 400) (dim: [10,000, 400]) (dim: 10,000) (dim: 10,000) (w1, w2, …, wT) Is it capable of modeling the Natural Language? Softmax bottleneck If NO! the performance of our model is limited setting # of words: 10,000 word vector dim: 400
  • 10. Contribution !10 • Identify this Softmax bottleneck. • Propose a new method to address this problem called Mixture of Softmaxes (MoS) by matrix factorization (in Section 2) and experiments (in Section 3) (M = LU)
  • 11. why Softmax !11 [ 1, 12, 7, 11 ] [ 0.0, 0.73, 0.01, 0.26 ] (usually `argmax` or `category label` would be like: [0, 1, 0, 0]) Softmax() simple normalize [ 0.03, 0.38, 0.22, 0.36 ] It’s more like `argmax`! too close
  • 13. • Rank & matrix factorization • Softmax-based RNN LM & matrix form • Identify the Softmax bottleneck • Propose hypothesis: Natural Language is high-rank • Easy fixes? • Mixture of Softmaxes (MoS) & Mixture of Contexts (MoC) !13 Flow of Section 2
  • 14. • We can consider rank is the number of i.i.d. features (dimensions) • If A is a m x n matrix: rank(A) ≤ min(m, n) • example: !14 Rank Age Height Weight H-W Gender 18 183 90 93 M 22 158 48 110 F 28 159 52 107 F 24 175 73 102 M Age Height Weight Gender 18 183 90 M 22 158 48 F 28 159 52 F 24 175 73 M dim = 4x3 → rank = 3 dim = 4x4 → rank = 4? (in matrix) NO! → rank = 3
  • 15. rank(V) = rank(WH) ≤ min(rank(W), rank(H)) !15 Matrix Factorization WH = V Property: 4 x 62 x 64 x 2 If A is a m x n matrix: rank(A) ≤ min(m, n) = 2
  • 16. • hc: context vector • wx: word embedding vector • θ: all the parameters (weights) in our model • Pθ: our model • P*: true data distribution !16 Sb RNN LM (dim: 400) (dim: 400) by θ logit P(X1|C1) P(X2|C1) … P(XM|C1) P(X1|C2) P(X2|C2) … P(XM|C2) … P(X1|CN) P(X2|CN) … P(XM|CN) (Softmax-based RNN Language Model)
  • 17. !17 Matrix form (N: # of context) (M: # of token) star (*) means: true data distribution our model our model our model ground truth Softmax(HθWθ) = Softmax(A) =
  • 18. !18 Concept of derivation Softmax SoftmaxSoftmax if we want = F(A) is an infinite set that contain all possible logits. matrix factorization
  • 19. !19 Matrix form (N: # of context) (M: # of token) star (*) means: true data distribution our model our model our model ground truth Softmax(HθWθ) = Softmax(A) =
  • 20. !20 F(A): row-wise shift operation JN,M is a all-one matrixF is an infinite set. all possible logits are in F(A) max rank difference = 1
  • 21. !21 Concept of derivation Softmax SoftmaxSoftmax if we want = F(A) is an infinite set that contain all possible logits. matrix factorization
  • 22. !22 Matrix Factorization our model a matrix related to the ground truth rank(A′) = rank(HθWθ) = min(rank(Hθ), rank(Wθ)) ≤ d rank(A′) ≤ d (is upper bounded by d) Softmax bottleneck rank(V) = rank(WH) ≤ min(rank(W), rank(H)) WH = V
  • 23. !23 Softmax bottleneck If d is too small, our model (Softmax) does not have the capacity to express the true data distribution. If rank(A) − 1 > d, there must exists a context c, such that if we want rank(A′) ≤ d (is upper bounded by d)
  • 24. Hypothesis: NL (A) is High-Rank 1. Natural language is highly context-dependent. 2. If A is low-rank 3. (Experiment) The high-rank LM outperforms many low-rank LMs. (in Section 3) !24 token: “north” in international news & geo. textbook They hypothesize it should result in a high-rank matrix A. We can create all semantic meanings by a few hundred bases. Softmax bottleneck limits the expressiveness of the model. learn a low-rank model rank( Softmax-based model ) ≤ d << rank(A) 3 reasons If true
  • 25. • Increasing the dimension d ? !25 Easy Fixes? Higher rank Increasing the number of parameters, too. Curse of dimensionality Empirical conclusion: increasing the d beyond several hundred is not helpful But ! (Merity et al., 2017; Melis et al., 2017; Krause et al., 2017) → Overfitting Mixture of Softmaxes (MoS) high-rank, but do not increase the dimension d rank( Softmax-based model ) ≤ d << rank(A)
  • 26. !26 Mixture of Softmaxes π: mixture weight, or call “prior” wπ,k & Wh,k are new model parameters K is a hyper-parameter (high-rank Language Model)
  • 27. !27 Mixture of Softmaxes context vector RNN gn RNN output (gt) priorprior logit logit h = tanh(W · gt) w · gt π h · wx weighted average ( ×→∑ ) Softmax Softmax K [π1, π2, …, πk] [h1w, h2w, …, hkw]
  • 28. !28 Mixture of Contexts (MoC) (low-rank baseline) It’s very like Softmax model
  • 30. • Evaluate the MoS model base on perplexity (PP) - Penn Treebank (PTB) - WikiText-2 (WT2) - 1B Word dataset (a larger dataset) • MoS is a generic model: experiment on Seq2Seq model (PP & BLEU) - Switch-board dataset (a dialog dataset) !30 Main Results (smaller value is better) (larger value is better) using MoS on state-of-the-art model
  • 31. !31 Main results on Penn Treebank (PTB) (PP: smaller value is better)
  • 32. !32 Main results on WikiText-2 (WT2) on 1B Word dataset
  • 33. !33 Main results on Seq2Seq model Switch-board dataset (a dialog dataset) (BLEU: larger value is better)
  • 34. !34 Ablation Study compare MoS & MoC MoC do not perform better
  • 35. !35 Verify the Role of rank Rank compassion on PTB. Test different # of Softmax (K) estimate the rank of A → The embedding sizes (d) of Softmax, MoC and MoS are 400, 280, 280 respectively. The vocabulary size (M) is 10,000 for all models.
  • 36. !36 Inverse experiment on character-level language model Softmax does not suffer from a rank limitation → MoS does not improve the performance (low-rank problem) → MoS’ performance is from it’s high-rank
  • 37. !37 MoS computational time relation between K & time → it’s `sub-linear` relation related to the size of dataset
  • 38. !38 Qualitative analysis to see how MoS improves the next-token prediction in detail
  • 39. !39 Qualitative analysis to see how MoS improves the next-token prediction in detail
  • 41. • Softmax-based language models is limited by the dimension of the word embedding → Softmax bottleneck • MoS improves the performance over Softmax avoid overfitting & naively increasing dimension • MoS improves the state-of-the-art results by a large margin → It’s importance to have a high-rank model for natural language. !41 Conclusions