1. Don’t count, predict!
A systematic comparison of context-counting vs.
context-predicting semantic vectors
Marco Baroni, Georgiana Dinu,
German Kruszewski
ACL 2014
読む人:東北大学 D1 高瀬翔
7. 共起情報の集計
• 窓幅 n 以内の単語との
• Pointwise Mutual Information(PMI)
– 二単語(wi, wj)の相関の強さ
– 片方が出現したとき必ずもう一方も出現で最大に
• Local Mutual Information(LMI)[Evert+ 05]
PMI(wi,wj ) = log(
p(wi,wj )
p(wi )* p(wj )
)
LMI(wi,wj ) = p(wi,wj )*log(
p(wi,wj )
p(wi )* p(wj )
)
In 1912, Kafka wrote the story…
n n
8. 次元圧縮
• 単語の文脈ベクトルの次元を k に圧縮
– 行列の圧縮テクニックを利用(SVD,NMF)
• SVD(特異値分解)
– UkとΣkから m × k の行列を作成
• NMF(Nonnegative Matrix Factorization)[Lee+ 00]
– 行列を非負の要素からなる二つの行列で近似
U
(m×m
)
行列
(m × n)
=
VT
(n × n)
Σ
(m × n)
k
Uk
k
Σk
行列
(m × n)
≅
k
m
m
m × k の行列を
圧縮結果とする
10. word2vec(CBOW)
• 周辺(窓幅 n)の単語から中央の単語ベクトルを予測
• skip-gram(周辺単語を予測するモデル)より高速,大規模
データでは高性能
• 高速化,精度向上のための様々な工夫
– Hierarchical softmax,negative sampling,subsampling
– (時間の都合上今回は説明しません…)
n
n
training time. The basic Skip-gram formulation defines p(wt + j |wt ) using the softmax fun
p(wO |wI ) =
exp v′
wO
⊤
vwI
W
w= 1 exp v′
w
⊤
vwI
where vw and v′
w are the “input” and “output” vector representations of w, and W is
ber of words in the vocabulary. This formulation is impractical because the cost of c
∇ logp(wO |wI ) is proportional to W, which is often large (105
–107
terms).
2.1 Hierarchical Softmax
A computationally efficient approximation of the full softmax is the hierarchical softma
context of neural network language models, it was first introduced by Morin and Bengio
main advantage is that instead of evaluating W output nodes in the neural network to o
probability distribution, it is needed to evaluate only about log2(W) nodes.
The hierarchical softmax uses a binary tree representation of the output layer with the W
its leaves and, for each node, explicitly represents the relative probabilities of its child nod
define a random walk that assigns probabilities to words.
More precisely, each word w can be reached by an appropriate path from the root of the
n(w, j ) be the j -th node on the path from the root to w, and let L(w) be the length of thi
wI wO ここの計算が大変(全単語対象)
なので様々な工夫を行う