Breaking the Softmax Bottleneck: a high-rank RNN Language Model

BREAKING THE SOFTMAX BOTTLENECK:
A HIGH-RANK RNN LANGUAGE MODEL
Zhilin Yang∗, Zihang Dai∗, Ruslan Salakhutdinov, William W. Cohen
School of Computer Science
Carnegie Mellon University
{zhiliny,dzihang,rsalakhu,wcohen}@cs.cmu.edu
Publish as a conference paper at ICLR 2018
∗Equal contribution. Ordering determined by dice rolling. Ray

Outline
• Introduction

• Language Modeling as Matrix Factorization

• Experiments

• Related work

• Conclusions
!2

!5
<s> I am Sharon </s>
<s> Sharon I am </s>
<s> I love group meeting </s>
2-grams example:
{ (<s>, I), (I, am), (am, Sharon) …, (meeting, </s>) }

P( Sharon | am ) = C( Sharon, am ) / C(am) = 1 / 2

P( am | I ) = 2 / 3

P( I | am ) = 2 / 2

….
window size
Ngram Language Model
P(X1|C1) P(X2|C1) … P(XM|C1)
P(X1|C2) P(X2|C2) … P(XM|C2)
…
store

RNN Language Model
!6
(concept)
RNN
xt
gt
P(“EverWing” | context)
P(“boyfriends”| context)
…
Expected
Output:
(lots of weights)
RNN
x1
g1
RNN
x2
g2
RNN
x3
g3
RNN
xn
gn
…
=
(unroll)
Sharon likes to playInput:
(Sharon’s diary)
context
[P(X1|Ct) P(X2|Ct) … P(XM|Ct)]
probability vector
ﬁt
(approximate)
[vector]

RNN Language Model
!7
RNN
x1
g1
RNN
x2
g2
RNN
x3
g3
RNN
xn
gn
…
(practical)
[P(X1|Ct) P(X2|Ct) … P(XM|Ct)]
Softmax(hT w)
context Sharon likes to play
[context vector] (h)
hT w = [hT w1, hT w2, … hT wM]
[word vector] [word vector] [word vector] [word vector]w
X1
(w1, w2, …, wM)
X2
1. context encode
2. relation to each word
setting

# of words: 10,000

word vector dim: 400
(dim: 400)
(dim: 400)
(dim: 10,000)
(dim: [10,000, 400])
(dim: 10,000)

Softmax function
!8
[ 1, 12, 7, 11 ]
[hT w1, hT w2, hT w3, hT w4]Ct = “Sharon likes to play”
⇒ context vector (h)
X1 = “EverWing” ⇒ w1
X2 = “boyfriends” ⇒ w2
X3 = “Instagram” ⇒ w3
X4 = “MaLuLian” ⇒ w4
[ 0.0, 0.73, 0.0, 0.27 ][P(X1|Ct), P(X2|Ct), P(X3|Ct), P(X4|Ct)]
sum = e1
+ e12
+ e7
+ e11
[ ]
e1
sum
e12
sum
e7
sum
e11
sum
Softmax(hT w)
=
(word vector)
dot product
Goal:

back to RNN LM
!9
RNN
gn
[P(X1|Ct) P(X2|Ct) … P(XT|Ct)]
Softmax(hT w)
[context vector] (h)
hT w = [hT w1, hT w2, … hT wT]
(dim: 400)
(dim: [10,000, 400])
(dim: 10,000)
(dim: 10,000)
(w1, w2, …, wT) Is it capable of modeling the
Natural Language?
Softmax bottleneck
If NO!
the performance of our model is limited
setting

# of words: 10,000

word vector dim: 400

Contribution
!10
• Identify this Softmax bottleneck.

• Propose a new method to address this problem

called Mixture of Softmaxes (MoS)
by matrix factorization (in Section 2) and experiments (in Section 3)
(M = LU)

why Softmax
!11
[ 1, 12, 7, 11 ]
[ 0.0, 0.73, 0.01, 0.26 ]
(usually `argmax` or `category label` would be like: [0, 1, 0, 0])
Softmax() simple normalize
[ 0.03, 0.38, 0.22, 0.36 ]
It’s more like `argmax`!
too close

Language Modeling
as
Matrix Factorization
Section 2

• Rank & matrix factorization

• Softmax-based RNN LM & matrix form

• Identify the Softmax bottleneck

• Propose hypothesis: Natural Language is high-rank

• Easy ﬁxes?

• Mixture of Softmaxes (MoS) & Mixture of Contexts (MoC)
!13
Flow of Section 2

• We can consider rank is the number of i.i.d. features (dimensions)

• If A is a m x n matrix: rank(A) ≤ min(m, n)

• example:
!14
Rank
Age Height Weight H-W Gender
18 183 90 93 M
22 158 48 110 F
28 159 52 107 F
24 175 73 102 M
Age Height Weight Gender
18 183 90 M
22 158 48 F
28 159 52 F
24 175 73 M
dim = 4x3 → rank = 3 dim = 4x4 → rank = 4?
(in matrix)
NO! → rank = 3

rank(V) = rank(WH) ≤ min(rank(W), rank(H))
!15
WH = V
Property:
4 x 62 x 64 x 2
If A is a m x n matrix: rank(A) ≤ min(m, n)
= 2

• hc: context vector

• wx: word embedding vector

• θ: all the parameters (weights) in our model

• Pθ: our model

• P*: true data distribution
!16
Sb RNN LM
(dim: 400)
(dim: 400)
by θ
logit
P(X1|C1) P(X2|C1) … P(XM|C1)
P(X1|C2) P(X2|C2) … P(XM|C2)
…
(Softmax-based RNN Language Model)

!17
Matrix form
(N: # of context) (M: # of token)
star (*) means: true data distribution
our model
our model our model
ground truth
Softmax(HθWθ) =
Softmax(A) =

!18
Concept of derivation
Softmax SoftmaxSoftmax
if we want
=
F(A) is an inﬁnite set that
contain all possible logits.
matrix factorization

!19
Matrix form
(N: # of context) (M: # of token)
star (*) means: true data distribution
our model
our model our model
ground truth
Softmax(HθWθ) =
Softmax(A) =

!20
F(A): row-wise shift operation
JN,M is a all-one matrixF is an inﬁnite set.
all possible logits are in F(A)
max rank diﬀerence = 1

!21
Concept of derivation
Softmax SoftmaxSoftmax
if we want
=
F(A) is an inﬁnite set that
contain all possible logits.
matrix factorization

!22
our model a matrix related to
the ground truth
rank(A′) = rank(HθWθ) = min(rank(Hθ), rank(Wθ)) ≤ d
rank(A′) ≤ d
(is upper bounded by d)
Softmax bottleneck
rank(V) = rank(WH) ≤ min(rank(W), rank(H))
WH = V

!23
Softmax bottleneck
If d is too small, our model (Softmax) does not have the capacity

to express the true data distribution.
If rank(A) − 1 > d,
there must exists a context c, such that
if we want
rank(A′) ≤ d
(is upper bounded by d)

Hypothesis: NL (A) is High-Rank
1. Natural language is highly context-dependent.

2. If A is low-rank

3. (Experiment) The high-rank LM outperforms many low-rank LMs. (in Section 3)
!24
token: “north” in
international news & geo. textbook
They hypothesize it should result in a high-rank matrix A.
We can create all semantic meanings by a few hundred bases.
Softmax bottleneck limits the expressiveness of the model.
learn a low-rank model rank( Softmax-based model ) ≤ d << rank(A)
3 reasons
If true

• Increasing the dimension d ?
!25
Easy Fixes?
Higher rank
Increasing the number of parameters, too.
Curse of dimensionality
Empirical conclusion: increasing the d beyond several hundred is not helpful
But !
(Merity et al., 2017; Melis et al., 2017; Krause et al., 2017)
→ Overﬁtting
Mixture of Softmaxes (MoS) high-rank, but do not increase the dimension d
rank( Softmax-based model ) ≤ d << rank(A)

!26
Mixture of Softmaxes
π: mixture weight, or call “prior”

wπ,k & Wh,k are new model parameters

K is a hyper-parameter
(high-rank Language Model)

!27
Mixture of Softmaxes
context vector
RNN
gn
RNN output (gt)
priorprior logit
logit
h = tanh(W · gt)
w · gt π
h · wx
weighted average
( ×→∑ )
Softmax
Softmax
K
[π1, π2, …, πk]
[h1w, h2w, …, hkw]

!28
Mixture of Contexts (MoC)
(low-rank baseline)
It’s very like Softmax model

• Evaluate the MoS model base on perplexity (PP)

- Penn Treebank (PTB)

- WikiText-2 (WT2)

- 1B Word dataset (a larger dataset)

• MoS is a generic model: experiment on Seq2Seq model (PP & BLEU)

- Switch-board dataset (a dialog dataset)
!30
Main Results
(smaller value is better)
(larger value is better)
using MoS on state-of-the-art model

!31
Main results
on Penn Treebank (PTB)
(PP: smaller value is better)

!32
Main results
on WikiText-2 (WT2)
on 1B Word dataset

!33
Main results
on Seq2Seq model
Switch-board dataset (a dialog dataset)
(BLEU: larger value is better)

!34
Ablation Study
compare MoS & MoC
MoC do not perform better

!35
Verify the Role of rank
Rank compassion on PTB.
Test diﬀerent # of Softmax (K)
estimate the rank of A →
The embedding sizes (d) of
Softmax, MoC and MoS are 400, 280, 280 respectively.
The vocabulary size (M) is 10,000 for all models.

!36
Inverse experiment
on character-level language model
Softmax does not suﬀer from a rank limitation
→ MoS does not improve the performance
(low-rank problem)
→ MoS’ performance is from it’s high-rank

!37
MoS computational time
relation between K & time
→ it’s `sub-linear` relation

related to the size of dataset

!38
Qualitative analysis
to see how MoS improves the next-token prediction in detail

!39
Qualitative analysis
to see how MoS improves the next-token prediction in detail

• Softmax-based language models

is limited by the dimension of the word embedding

→ Softmax bottleneck

• MoS

improves the performance over Softmax

avoid overﬁtting & naively increasing dimension

• MoS improves the state-of-the-art results by a large margin

→ It’s importance to have a high-rank model for natural language.
!41
Conclusions

Breaking the Softmax Bottleneck: a high-rank RNN Language Model

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Breaking the Softmax Bottleneck: a high-rank RNN Language Model

Similar a Breaking the Softmax Bottleneck: a high-rank RNN Language Model (20)

Último

Último (20)

Breaking the Softmax Bottleneck: a high-rank RNN Language Model