Parallel asynchronous inference of word senses with Microsoft Azure

Parallel asynchronous
inference of word
senses with Azure
Sergey Bartunov, MSU

Learning as optimization
F(✓) = r(✓) +
NX
i=1
fi(xi; ✓) ! min
✓

loss
F(✓) = r(✓) +
NX
i=1
fi(xi; ✓) ! min
✓
regularizer
objectparameters

• can be huge
• regularizer and loss can be complex
• parameters’ dimensionality can be very large
N
loss
F(✓) = r(✓) +
NX
i=1
fi(xi; ✓) ! min
✓
regularizer
objectparameters

• can be huge
• regularizer and loss can be complex
• parameters’ dimensionality can be very large
N
loss
F(✓) = r(✓) +
NX
i=1
fi(xi; ✓) ! min
✓
regularizer
objectparameters
Commodity PC is not enough!

Learning word embeddings
For each word ﬁnd its embedding such that  
similar words have close embeddings
Java
Platform
.NET
Mono
Railways
Ticket
Train
Politics
Party
Socialism

…compiled for a speciﬁc hardware platform, since different central processor…

object: word and its context

loss: log p(v|w) p(v|w) =
exp(AT
wBv)
PV
v0=1 exp(AT
wBv0 )

loss:
parameters: word embeddings Aw, Bw 2 RD
, w 2 1, . . . , V
log p(v|w) p(v|w) =
exp(AT
wBv)
PV
v0=1 exp(AT
wBv0 )
Skip-gram (Mikolov et al, 2013)

Gradient optimization
F(✓) = r(✓) +
NX
i=1
fi(xi; ✓) ! min
✓
gradient descent
✓t+1
= ✓t
trF(✓t
)

Stochastic optimization
F(✓) = r(✓) +
NX
i=1
fi(xi; ✓) ! min
✓
stochastic gradient descent
✓t+1
= ✓t
tG(✓t
)

F(✓) = r(✓) +
NX
i=1
fi(xi; ✓) ! min
✓
✓t+1
= ✓t
tG(✓t
)
EG(✓) = rF(✓)

F(✓) = r(✓) +
NX
i=1
fi(xi; ✓) ! min
✓
✓t+1
= ✓t
tG(✓t
)
EG(✓) = rF(✓)
for example: G(✓) = r [r(✓) + Nfj(xj; ✓)] , j ⇠ Uniform(1, N)

• 400k word vocabulary, 300-dimensional embeddings
• 240 million parameters to train!
• 1 GB memory snapshot

Stochastic parallel optimization
core 1 core 2 core K…
shared parameters

shared parameters
data ﬂow

shared parameters
data ﬂow
no synchronization!!
(see e.g. Hogwild paper)

My laptop: 2 cores, 8 GB RAM

My laptop: 2 cores, 8 GB RAM 22 hours
2 hoursDataset - English Wikipedia 2012 (5.7 GB raw text, 1 billion words)

Learning polysemic word embeddings
Java
Platform (1)
.NET
Mono
Railways
Ticket
Platform (2)
Train
Platform (3)
Politics
Party
Socialism

(computer meaning)

…as the safe distance from the platform edge increases with the speed…
(railway meaning)
(computer meaning)

(railway meaning)
(computer meaning)
… Socialist Party; the Socialist Workers Platform and the Committee for a…
(political meaning)

(railway meaning)
(computer meaning)
(political meaning)
loss:
loss:
loss:
log p(v|w, z = 1)
log p(v|w, z = 2)
log p(v|w, z = 3)

(railway meaning)
(computer meaning)
(political meaning)
loss:
loss:
loss:
p(v|w, z = k) =
exp(AT
wkBv)
PV
v0=1 exp(AT
wkBv0 )
log p(v|w, z = 1)
log p(v|w, z = 2)
log p(v|w, z = 3)
word meanings are unobserved

log p(W, V |A, B, ↵) = log
Z
p(z|↵)
Y
i
Y
j
p(vij|wi, zi, A, B)dz ! max
A,B
word meanings are unobserved, hence EM algorithm must be employed

Z
p(z|↵)
Y
i
Y
j
A,B
• How to choose prior such that it allows to automatically increase number of word
meanings if necessary?
• How to put the EM procedure into stochastic optimization framework?

Z
p(z|↵)
Y
i
Y
j
A,B
Bayesian nonparametrics (Orbanz, 2014)

Z
p(z|↵)
Y
i
Y
j
A,B
Stochastic variational inference (Blei et al, 2012)
Bayesian nonparametrics (Orbanz, 2014)

EM algorithm

EM algorithm
E-step: disambiguate the word given its context
p(z = politics) = 0.96
p(z = transport) = 0.01
p(z = computer) = 0.03

EM algorithm
E-step: disambiguate the word given its context
p(z = politics) = 0.96
p(z = transport) = 0.01
p(z = computer) = 0.03
M-step: update word embeddings by weighted gradient
✓t+1
= ✓t
+ tr
"
X
k
p(zi = k) log p(vij|wi, zi = k, ✓t
)
#

• 400k word vocabulary, 300-dimensional embeddings, 
max. 30 meanings per word
• 7.2 billion parameters to train!
• 18 GB memory snapshot

My laptop: 2 cores, 8 GB RAM 6 days!
16 hoursDataset - English Wikipedia 2012 (5.7 GB raw text, 1 billion words)

Results
julia> expected_pi(vm, dict.word2id["cloud"])
30-element Array{Float64,1}:
0.404964
0.134444
0.0987207
0.361865
5.70338e-6
5.18419e-7
4.7129e-8
4.28446e-9
3.89496e-10
3.54087e-11
⋮

Results
julia> nearest_neighbors(vm, dict, "cloud", 1)
10-element Array{(Any,Any,Any),1}:
("clouds",1,0.791538f0)
("haze",2,0.6702103f0)
("nimbostratus",1,0.653774f0)
("altostratus",1,0.6300289f0)
("noctilucent",1,0.6294991f0)
("cumulonimbus",1,0.6289225f0)
("stratocumulus",1,0.6274564f0)
("cumulus",2,0.6273055f0)
("clouds",2,0.6201524f0)
("cirrostratus",1,0.6146165f0)

Results
("louis",5,0.5705162f0)
("vrain",1,0.55054826f0)
("lucie",1,0.52579653f0)
("clair",1,0.52284604f0)
("johns",2,0.5215208f0)
("marys",1,0.5036709f0)
("nazianz",1,0.4979607f0)
("lawrence",2,0.49513188f0)
("missouri",3,0.49284995f0)
("joseph",2,0.4928328f0)

Results
("computing",1,0.7052178f0)
("middleware",1,0.68975633f0)
("cloud-based",1,0.6546666f0)
("context-aware",1,0.6417114f0)
("enterprise",1,0.63958025f0)
("virtualization",1,0.6359488f0)
("soa",1,0.6349716f0)
("distributed",1,0.6310058f0)
("unicore",1,0.62737936f0)
("client-server",1,0.6239226f0)

Results
("mist",1,0.56100917f0)
("clouds",3,0.54695433f0)
("ﬁre",5,0.53125167f0)
("ﬂame",3,0.52561617f0)
("dragon",1,0.5224602f0)
("sorceror",1,0.5199405f0)
("shining",2,0.5165066f0)
("shadow",1,0.516233f0)
("mysterious",2,0.5153119f0)
("smoke",3,0.51471066f0)

Results
julia> disambiguate(vm, dict, "cloud",
split("weather forecast cold rainy"))
0.999278
9.49993e-7
1.52921e-8
0.000720983
0.0
0.0
0.0
0.0
0.0
0.0
⋮

Results
julia> disambiguate(vm, dict, "cloud",
split("multi-core virtual machine"))
0.000243637
6.16926e-5
0.998918
0.000776869
0.0
0.0
0.0
0.0
0.0
0.0
⋮

and thanks to Microsoft Research and
Microsoft Azure team!
Dmitry Kondrashkin Anton Osokin Dmitry P. Vetrov
project page: bayesgroup.ru/adagram
sources: github.com/sbos/AdaGram.jl

Parallel asynchronous inference of word senses with Microsoft Azure

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (12)

Destacado

Destacado (6)

Similar a Parallel asynchronous inference of word senses with Microsoft Azure

Similar a Parallel asynchronous inference of word senses with Microsoft Azure (20)

Más de Microsoft Azure for Research

Más de Microsoft Azure for Research (11)

Último

Último (20)

Parallel asynchronous inference of word senses with Microsoft Azure