Machine learning using Microsoft Azure at the Microsoft Research - Moscow State University Joint Research Centre. Specifically, this describes the Adaptive Skip-gram (AdaGram) model; a nonparametric extension of famous Skip-gram model implemented in word2vec software which is able to learn multiple representations per word capturing different word meanings. This projects implements AdaGram in Julia language.
The team uses Microsoft Azure to run at scale, allowing analysis of the Wikipedia corpus on large virtual machines.
MSU Bayes Group homepage - http://bayesgroup.ru/
Github project - https://github.com/sbos/AdaGram.jl
Research paper preprint - http://arxiv.org/abs/1502.07257
4. Learning as optimization
• can be huge
• regularizer and loss can be complex
• parameters’ dimensionality can be very large
N
loss
F(✓) = r(✓) +
NX
i=1
fi(xi; ✓) ! min
✓
regularizer
objectparameters
5. Learning as optimization
• can be huge
• regularizer and loss can be complex
• parameters’ dimensionality can be very large
N
loss
F(✓) = r(✓) +
NX
i=1
fi(xi; ✓) ! min
✓
regularizer
objectparameters
Commodity PC is not enough!
6. Learning word embeddings
For each word find its embedding such that
similar words have close embeddings
Java
Platform
.NET
Mono
Railways
Ticket
Train
Politics
Party
Socialism
9. Learning word embeddings
…compiled for a specific hardware platform, since different central processor…
object: word and its context
loss: log p(v|w) p(v|w) =
exp(AT
wBv)
PV
v0=1 exp(AT
wBv0 )
10. Learning word embeddings
…compiled for a specific hardware platform, since different central processor…
object: word and its context
loss:
parameters: word embeddings Aw, Bw 2 RD
, w 2 1, . . . , V
log p(v|w) p(v|w) =
exp(AT
wBv)
PV
v0=1 exp(AT
wBv0 )
Skip-gram (Mikolov et al, 2013)
23. Stochastic parallel optimization
My laptop: 2 cores, 8 GB RAM 22 hours
2 hoursDataset - English Wikipedia 2012 (5.7 GB raw text, 1 billion words)
24. Learning polysemic word embeddings
Java
Platform (1)
.NET
Mono
Railways
Ticket
Platform (2)
Train
Platform (3)
Politics
Party
Socialism
25. Learning polysemic word embeddings
…compiled for a specific hardware platform, since different central processor…
(computer meaning)
26. Learning polysemic word embeddings
…compiled for a specific hardware platform, since different central processor…
…as the safe distance from the platform edge increases with the speed…
(railway meaning)
(computer meaning)
27. Learning polysemic word embeddings
…compiled for a specific hardware platform, since different central processor…
…as the safe distance from the platform edge increases with the speed…
(railway meaning)
(computer meaning)
… Socialist Party; the Socialist Workers Platform and the Committee for a…
(political meaning)
28. Learning polysemic word embeddings
…compiled for a specific hardware platform, since different central processor…
…as the safe distance from the platform edge increases with the speed…
(railway meaning)
(computer meaning)
… Socialist Party; the Socialist Workers Platform and the Committee for a…
(political meaning)
loss:
loss:
loss:
log p(v|w, z = 1)
log p(v|w, z = 2)
log p(v|w, z = 3)
29. Learning polysemic word embeddings
…compiled for a specific hardware platform, since different central processor…
…as the safe distance from the platform edge increases with the speed…
(railway meaning)
(computer meaning)
… Socialist Party; the Socialist Workers Platform and the Committee for a…
(political meaning)
loss:
loss:
loss:
p(v|w, z = k) =
exp(AT
wkBv)
PV
v0=1 exp(AT
wkBv0 )
log p(v|w, z = 1)
log p(v|w, z = 2)
log p(v|w, z = 3)
word meanings are unobserved
30. Learning polysemic word embeddings
log p(W, V |A, B, ↵) = log
Z
p(z|↵)
Y
i
Y
j
p(vij|wi, zi, A, B)dz ! max
A,B
word meanings are unobserved, hence EM algorithm must be employed
31. Learning polysemic word embeddings
log p(W, V |A, B, ↵) = log
Z
p(z|↵)
Y
i
Y
j
p(vij|wi, zi, A, B)dz ! max
A,B
word meanings are unobserved, hence EM algorithm must be employed
• How to choose prior such that it allows to automatically increase number of word
meanings if necessary?
• How to put the EM procedure into stochastic optimization framework?
32. Learning polysemic word embeddings
log p(W, V |A, B, ↵) = log
Z
p(z|↵)
Y
i
Y
j
p(vij|wi, zi, A, B)dz ! max
A,B
word meanings are unobserved, hence EM algorithm must be employed
• How to choose prior such that it allows to automatically increase number of word
meanings if necessary?
• How to put the EM procedure into stochastic optimization framework?
Bayesian nonparametrics (Orbanz, 2014)
33. Learning polysemic word embeddings
log p(W, V |A, B, ↵) = log
Z
p(z|↵)
Y
i
Y
j
p(vij|wi, zi, A, B)dz ! max
A,B
word meanings are unobserved, hence EM algorithm must be employed
• How to choose prior such that it allows to automatically increase number of word
meanings if necessary?
• How to put the EM procedure into stochastic optimization framework?
Stochastic variational inference (Blei et al, 2012)
Bayesian nonparametrics (Orbanz, 2014)
35. EM algorithm
E-step: disambiguate the word given its context
… Socialist Party; the Socialist Workers Platform and the Committee for a…
p(z = politics) = 0.96
p(z = transport) = 0.01
p(z = computer) = 0.03
36. EM algorithm
E-step: disambiguate the word given its context
… Socialist Party; the Socialist Workers Platform and the Committee for a…
p(z = politics) = 0.96
p(z = transport) = 0.01
p(z = computer) = 0.03
M-step: update word embeddings by weighted gradient
✓t+1
= ✓t
+ tr
"
X
k
p(zi = k) log p(vij|wi, zi = k, ✓t
)
#
37. Learning polysemic word embeddings
• 400k word vocabulary, 300-dimensional embeddings,
max. 30 meanings per word
• 7.2 billion parameters to train!
• 18 GB memory snapshot
38. Learning polysemic word embeddings
My laptop: 2 cores, 8 GB RAM 6 days!
16 hoursDataset - English Wikipedia 2012 (5.7 GB raw text, 1 billion words)
46. and thanks to Microsoft Research and
Microsoft Azure team!
Dmitry Kondrashkin Anton Osokin Dmitry P. Vetrov
project page: bayesgroup.ru/adagram
sources: github.com/sbos/AdaGram.jl