SlideShare una empresa de Scribd logo
1 de 46
Descargar para leer sin conexión
Parallel asynchronous
inference of word
senses with Azure
Sergey Bartunov, MSU
Learning as optimization
F(✓) = r(✓) +
NX
i=1
fi(xi; ✓) ! min
✓
Learning as optimization
loss
F(✓) = r(✓) +
NX
i=1
fi(xi; ✓) ! min
✓
regularizer
objectparameters
Learning as optimization
• can be huge
• regularizer and loss can be complex
• parameters’ dimensionality can be very large
N
loss
F(✓) = r(✓) +
NX
i=1
fi(xi; ✓) ! min
✓
regularizer
objectparameters
Learning as optimization
• can be huge
• regularizer and loss can be complex
• parameters’ dimensionality can be very large
N
loss
F(✓) = r(✓) +
NX
i=1
fi(xi; ✓) ! min
✓
regularizer
objectparameters
Commodity PC is not enough!
Learning word embeddings
For each word find its embedding such that 

similar words have close embeddings
Java
Platform
.NET
Mono
Railways
Ticket
Train
Politics
Party
Socialism
Learning word embeddings
…compiled for a specific hardware platform, since different central processor…
Learning word embeddings
…compiled for a specific hardware platform, since different central processor…
object: word and its context
Learning word embeddings
…compiled for a specific hardware platform, since different central processor…
object: word and its context
loss: log p(v|w) p(v|w) =
exp(AT
wBv)
PV
v0=1 exp(AT
wBv0 )
Learning word embeddings
…compiled for a specific hardware platform, since different central processor…
object: word and its context
loss:
parameters: word embeddings Aw, Bw 2 RD
, w 2 1, . . . , V
log p(v|w) p(v|w) =
exp(AT
wBv)
PV
v0=1 exp(AT
wBv0 )
Skip-gram (Mikolov et al, 2013)
Gradient optimization
F(✓) = r(✓) +
NX
i=1
fi(xi; ✓) ! min
✓
gradient descent
✓t+1
= ✓t
trF(✓t
)
Stochastic optimization
F(✓) = r(✓) +
NX
i=1
fi(xi; ✓) ! min
✓
stochastic gradient descent
✓t+1
= ✓t
tG(✓t
)
Stochastic optimization
F(✓) = r(✓) +
NX
i=1
fi(xi; ✓) ! min
✓
stochastic gradient descent
✓t+1
= ✓t
tG(✓t
)
EG(✓) = rF(✓)
Stochastic optimization
F(✓) = r(✓) +
NX
i=1
fi(xi; ✓) ! min
✓
stochastic gradient descent
✓t+1
= ✓t
tG(✓t
)
EG(✓) = rF(✓)
for example: G(✓) = r [r(✓) + Nfj(xj; ✓)] , j ⇠ Uniform(1, N)
Learning word embeddings
• 400k word vocabulary, 300-dimensional embeddings
• 240 million parameters to train!
• 1 GB memory snapshot
Stochastic parallel optimization
core 1 core 2 core K…
shared parameters
Stochastic parallel optimization
core 1 core 2 core K…
shared parameters
data flow
Stochastic parallel optimization
core 1 core 2 core K…
shared parameters
data flow
Stochastic parallel optimization
core 1 core 2 core K…
shared parameters
data flow
no synchronization!!
(see e.g. Hogwild paper)
Stochastic parallel optimization
My laptop: 2 cores, 8 GB RAM
Stochastic parallel optimization
My laptop: 2 cores, 8 GB RAM
Stochastic parallel optimization
My laptop: 2 cores, 8 GB RAM
Stochastic parallel optimization
My laptop: 2 cores, 8 GB RAM 22 hours
2 hoursDataset - English Wikipedia 2012 (5.7 GB raw text, 1 billion words)
Learning polysemic word embeddings
Java
Platform (1)
.NET
Mono
Railways
Ticket
Platform (2)
Train
Platform (3)
Politics
Party
Socialism
Learning polysemic word embeddings
…compiled for a specific hardware platform, since different central processor…
(computer meaning)
Learning polysemic word embeddings
…compiled for a specific hardware platform, since different central processor…
…as the safe distance from the platform edge increases with the speed…
(railway meaning)
(computer meaning)
Learning polysemic word embeddings
…compiled for a specific hardware platform, since different central processor…
…as the safe distance from the platform edge increases with the speed…
(railway meaning)
(computer meaning)
… Socialist Party; the Socialist Workers Platform and the Committee for a…
(political meaning)
Learning polysemic word embeddings
…compiled for a specific hardware platform, since different central processor…
…as the safe distance from the platform edge increases with the speed…
(railway meaning)
(computer meaning)
… Socialist Party; the Socialist Workers Platform and the Committee for a…
(political meaning)
loss:
loss:
loss:
log p(v|w, z = 1)
log p(v|w, z = 2)
log p(v|w, z = 3)
Learning polysemic word embeddings
…compiled for a specific hardware platform, since different central processor…
…as the safe distance from the platform edge increases with the speed…
(railway meaning)
(computer meaning)
… Socialist Party; the Socialist Workers Platform and the Committee for a…
(political meaning)
loss:
loss:
loss:
p(v|w, z = k) =
exp(AT
wkBv)
PV
v0=1 exp(AT
wkBv0 )
log p(v|w, z = 1)
log p(v|w, z = 2)
log p(v|w, z = 3)
word meanings are unobserved
Learning polysemic word embeddings
log p(W, V |A, B, ↵) = log
Z
p(z|↵)
Y
i
Y
j
p(vij|wi, zi, A, B)dz ! max
A,B
word meanings are unobserved, hence EM algorithm must be employed
Learning polysemic word embeddings
log p(W, V |A, B, ↵) = log
Z
p(z|↵)
Y
i
Y
j
p(vij|wi, zi, A, B)dz ! max
A,B
word meanings are unobserved, hence EM algorithm must be employed
• How to choose prior such that it allows to automatically increase number of word
meanings if necessary?
• How to put the EM procedure into stochastic optimization framework?
Learning polysemic word embeddings
log p(W, V |A, B, ↵) = log
Z
p(z|↵)
Y
i
Y
j
p(vij|wi, zi, A, B)dz ! max
A,B
word meanings are unobserved, hence EM algorithm must be employed
• How to choose prior such that it allows to automatically increase number of word
meanings if necessary?
• How to put the EM procedure into stochastic optimization framework?
Bayesian nonparametrics (Orbanz, 2014)
Learning polysemic word embeddings
log p(W, V |A, B, ↵) = log
Z
p(z|↵)
Y
i
Y
j
p(vij|wi, zi, A, B)dz ! max
A,B
word meanings are unobserved, hence EM algorithm must be employed
• How to choose prior such that it allows to automatically increase number of word
meanings if necessary?
• How to put the EM procedure into stochastic optimization framework?
Stochastic variational inference (Blei et al, 2012)
Bayesian nonparametrics (Orbanz, 2014)
EM algorithm
… Socialist Party; the Socialist Workers Platform and the Committee for a…
EM algorithm
E-step: disambiguate the word given its context
… Socialist Party; the Socialist Workers Platform and the Committee for a…
p(z = politics) = 0.96
p(z = transport) = 0.01
p(z = computer) = 0.03
EM algorithm
E-step: disambiguate the word given its context
… Socialist Party; the Socialist Workers Platform and the Committee for a…
p(z = politics) = 0.96
p(z = transport) = 0.01
p(z = computer) = 0.03
M-step: update word embeddings by weighted gradient
✓t+1
= ✓t
+ tr
"
X
k
p(zi = k) log p(vij|wi, zi = k, ✓t
)
#
Learning polysemic word embeddings
• 400k word vocabulary, 300-dimensional embeddings,

max. 30 meanings per word
• 7.2 billion parameters to train!
• 18 GB memory snapshot
Learning polysemic word embeddings
My laptop: 2 cores, 8 GB RAM 6 days!
16 hoursDataset - English Wikipedia 2012 (5.7 GB raw text, 1 billion words)
Results
julia> expected_pi(vm, dict.word2id["cloud"])
30-element Array{Float64,1}:
0.404964
0.134444
0.0987207
0.361865
5.70338e-6
5.18419e-7
4.7129e-8
4.28446e-9
3.89496e-10
3.54087e-11
⋮
Results
julia> nearest_neighbors(vm, dict, "cloud", 1)
10-element Array{(Any,Any,Any),1}:
("clouds",1,0.791538f0)
("haze",2,0.6702103f0)
("nimbostratus",1,0.653774f0)
("altostratus",1,0.6300289f0)
("noctilucent",1,0.6294991f0)
("cumulonimbus",1,0.6289225f0)
("stratocumulus",1,0.6274564f0)
("cumulus",2,0.6273055f0)
("clouds",2,0.6201524f0)
("cirrostratus",1,0.6146165f0)
Results
julia> nearest_neighbors(vm, dict, "cloud", 2)
10-element Array{(Any,Any,Any),1}:
("louis",5,0.5705162f0)
("vrain",1,0.55054826f0)
("lucie",1,0.52579653f0)
("clair",1,0.52284604f0)
("johns",2,0.5215208f0)
("marys",1,0.5036709f0)
("nazianz",1,0.4979607f0)
("lawrence",2,0.49513188f0)
("missouri",3,0.49284995f0)
("joseph",2,0.4928328f0)
Results
julia> nearest_neighbors(vm, dict, "cloud", 3)
10-element Array{(Any,Any,Any),1}:
("computing",1,0.7052178f0)
("middleware",1,0.68975633f0)
("cloud-based",1,0.6546666f0)
("context-aware",1,0.6417114f0)
("enterprise",1,0.63958025f0)
("virtualization",1,0.6359488f0)
("soa",1,0.6349716f0)
("distributed",1,0.6310058f0)
("unicore",1,0.62737936f0)
("client-server",1,0.6239226f0)
Results
julia> nearest_neighbors(vm, dict, "cloud", 4)
10-element Array{(Any,Any,Any),1}:
("mist",1,0.56100917f0)
("clouds",3,0.54695433f0)
("fire",5,0.53125167f0)
("flame",3,0.52561617f0)
("dragon",1,0.5224602f0)
("sorceror",1,0.5199405f0)
("shining",2,0.5165066f0)
("shadow",1,0.516233f0)
("mysterious",2,0.5153119f0)
("smoke",3,0.51471066f0)
Results
julia> disambiguate(vm, dict, "cloud",
split("weather forecast cold rainy"))
30-element Array{Float64,1}:
0.999278
9.49993e-7
1.52921e-8
0.000720983
0.0
0.0
0.0
0.0
0.0
0.0
⋮
Results
julia> disambiguate(vm, dict, "cloud",
split("multi-core virtual machine"))
30-element Array{Float64,1}:
0.000243637
6.16926e-5
0.998918
0.000776869
0.0
0.0
0.0
0.0
0.0
0.0
⋮
and thanks to Microsoft Research and
Microsoft Azure team!
Dmitry Kondrashkin Anton Osokin Dmitry P. Vetrov
project page: bayesgroup.ru/adagram
sources: github.com/sbos/AdaGram.jl

Más contenido relacionado

La actualidad más candente

Introduction to behavior based recommendation system
Introduction to behavior based recommendation systemIntroduction to behavior based recommendation system
Introduction to behavior based recommendation systemKimikazu Kato
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationPierre de Lacaze
 
深層意味表現学習 (Deep Semantic Representations)
深層意味表現学習 (Deep Semantic Representations)深層意味表現学習 (Deep Semantic Representations)
深層意味表現学習 (Deep Semantic Representations)Danushka Bollegala
 
SociaLite: High-level Query Language for Big Data Analysis
SociaLite: High-level Query Language for Big Data AnalysisSociaLite: High-level Query Language for Big Data Analysis
SociaLite: High-level Query Language for Big Data AnalysisDataWorks Summit
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
 
Introduction To TensorFlow | Deep Learning with TensorFlow | TensorFlow For B...
Introduction To TensorFlow | Deep Learning with TensorFlow | TensorFlow For B...Introduction To TensorFlow | Deep Learning with TensorFlow | TensorFlow For B...
Introduction To TensorFlow | Deep Learning with TensorFlow | TensorFlow For B...Edureka!
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
Introduction to Generative Adversarial Networks
Introduction to Generative Adversarial NetworksIntroduction to Generative Adversarial Networks
Introduction to Generative Adversarial NetworksBennoG1
 
Cs1123 9 strings
Cs1123 9 stringsCs1123 9 strings
Cs1123 9 stringsTAlha MAlik
 
Gremlin: A Graph-Based Programming Language
Gremlin: A Graph-Based Programming LanguageGremlin: A Graph-Based Programming Language
Gremlin: A Graph-Based Programming LanguageMarko Rodriguez
 
3 - Finding similar items
3 - Finding similar items3 - Finding similar items
3 - Finding similar itemsViet-Trung TRAN
 

La actualidad más candente (12)

Introduction to behavior based recommendation system
Introduction to behavior based recommendation systemIntroduction to behavior based recommendation system
Introduction to behavior based recommendation system
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
 
深層意味表現学習 (Deep Semantic Representations)
深層意味表現学習 (Deep Semantic Representations)深層意味表現学習 (Deep Semantic Representations)
深層意味表現学習 (Deep Semantic Representations)
 
SociaLite: High-level Query Language for Big Data Analysis
SociaLite: High-level Query Language for Big Data AnalysisSociaLite: High-level Query Language for Big Data Analysis
SociaLite: High-level Query Language for Big Data Analysis
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 
1
11
1
 
Introduction To TensorFlow | Deep Learning with TensorFlow | TensorFlow For B...
Introduction To TensorFlow | Deep Learning with TensorFlow | TensorFlow For B...Introduction To TensorFlow | Deep Learning with TensorFlow | TensorFlow For B...
Introduction To TensorFlow | Deep Learning with TensorFlow | TensorFlow For B...
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Introduction to Generative Adversarial Networks
Introduction to Generative Adversarial NetworksIntroduction to Generative Adversarial Networks
Introduction to Generative Adversarial Networks
 
Cs1123 9 strings
Cs1123 9 stringsCs1123 9 strings
Cs1123 9 strings
 
Gremlin: A Graph-Based Programming Language
Gremlin: A Graph-Based Programming LanguageGremlin: A Graph-Based Programming Language
Gremlin: A Graph-Based Programming Language
 
3 - Finding similar items
3 - Finding similar items3 - Finding similar items
3 - Finding similar items
 

Destacado (6)

containers2016
containers2016containers2016
containers2016
 
ieee cloud 2015 keynote talk
ieee cloud 2015 keynote talkieee cloud 2015 keynote talk
ieee cloud 2015 keynote talk
 
Rob DeRosa - Seattle .NET Mobile MeetUp
Rob DeRosa - Seattle .NET Mobile MeetUpRob DeRosa - Seattle .NET Mobile MeetUp
Rob DeRosa - Seattle .NET Mobile MeetUp
 
Accelerating your Research with Microsoft Azure (June 2015)
Accelerating your Research with Microsoft Azure (June 2015)Accelerating your Research with Microsoft Azure (June 2015)
Accelerating your Research with Microsoft Azure (June 2015)
 
ieeecloud2016
ieeecloud2016ieeecloud2016
ieeecloud2016
 
A4 r overview deck_1.7
A4 r overview deck_1.7A4 r overview deck_1.7
A4 r overview deck_1.7
 

Similar a Parallel asynchronous inference of word senses with Microsoft Azure

Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Avelin Huo
 
Pattern Mining To Unknown Word Extraction (10
Pattern Mining To Unknown Word Extraction (10Pattern Mining To Unknown Word Extraction (10
Pattern Mining To Unknown Word Extraction (10Jason Yang
 
Os8 2
Os8 2Os8 2
Os8 2issbp
 
19. Data Structures and Algorithm Complexity
19. Data Structures and Algorithm Complexity19. Data Structures and Algorithm Complexity
19. Data Structures and Algorithm ComplexityIntro C# Book
 
Stack squeues lists
Stack squeues listsStack squeues lists
Stack squeues listsJames Wong
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues listsHarry Potter
 
Stacksqueueslists
StacksqueueslistsStacksqueueslists
StacksqueueslistsFraboni Ec
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues listsYoung Alista
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues listsTony Nguyen
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.pptbutest
 
Machine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.pptMachine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.pptbutest
 
Artificial Intelligence and Optimization with Parallelism
Artificial Intelligence and Optimization with ParallelismArtificial Intelligence and Optimization with Parallelism
Artificial Intelligence and Optimization with ParallelismOlivier Teytaud
 
XESLite - Handling Event Logs in ProM
XESLite - Handling Event Logs in ProMXESLite - Handling Event Logs in ProM
XESLite - Handling Event Logs in ProMFelix Mannhardt
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupAndy Sloane
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...PyData
 
Machine Learning for Trading
Machine Learning for TradingMachine Learning for Trading
Machine Learning for TradingLarry Guo
 

Similar a Parallel asynchronous inference of word senses with Microsoft Azure (20)

Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03
 
Pattern Mining To Unknown Word Extraction (10
Pattern Mining To Unknown Word Extraction (10Pattern Mining To Unknown Word Extraction (10
Pattern Mining To Unknown Word Extraction (10
 
Os8 2
Os8 2Os8 2
Os8 2
 
Lecture20 xing
Lecture20 xingLecture20 xing
Lecture20 xing
 
19. Data Structures and Algorithm Complexity
19. Data Structures and Algorithm Complexity19. Data Structures and Algorithm Complexity
19. Data Structures and Algorithm Complexity
 
Stack squeues lists
Stack squeues listsStack squeues lists
Stack squeues lists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Stacksqueueslists
StacksqueueslistsStacksqueueslists
Stacksqueueslists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
ppt
pptppt
ppt
 
ppt
pptppt
ppt
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.ppt
 
Machine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.pptMachine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.ppt
 
Artificial Intelligence and Optimization with Parallelism
Artificial Intelligence and Optimization with ParallelismArtificial Intelligence and Optimization with Parallelism
Artificial Intelligence and Optimization with Parallelism
 
XESLite - Handling Event Logs in ProM
XESLite - Handling Event Logs in ProMXESLite - Handling Event Logs in ProM
XESLite - Handling Event Logs in ProM
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data Meetup
 
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
 
Machine Learning for Trading
Machine Learning for TradingMachine Learning for Trading
Machine Learning for Trading
 

Más de Microsoft Azure for Research

Más de Microsoft Azure for Research (11)

Esciencetalk
EsciencetalkEsciencetalk
Esciencetalk
 
Accelerating your research with Microsoft Azure
Accelerating your research with Microsoft AzureAccelerating your research with Microsoft Azure
Accelerating your research with Microsoft Azure
 
Cloud hpc-bigdata-challenges
Cloud hpc-bigdata-challengesCloud hpc-bigdata-challenges
Cloud hpc-bigdata-challenges
 
The Fourth Paradigm - Deltares Data Science Day, 31 October 2014
The Fourth Paradigm - Deltares Data Science Day, 31 October 2014The Fourth Paradigm - Deltares Data Science Day, 31 October 2014
The Fourth Paradigm - Deltares Data Science Day, 31 October 2014
 
Environmental Science, Big Data and the Cloud
Environmental Science, Big Data and the CloudEnvironmental Science, Big Data and the Cloud
Environmental Science, Big Data and the Cloud
 
Keynote IEEE International Workshop on Cloud Analytics. Dennis Gannon
Keynote IEEE International Workshop on Cloud Analytics. Dennis  GannonKeynote IEEE International Workshop on Cloud Analytics. Dennis  Gannon
Keynote IEEE International Workshop on Cloud Analytics. Dennis Gannon
 
Doing Research in the Cloud - NIH Workshop Dennis Gannon
Doing Research in the Cloud - NIH Workshop Dennis GannonDoing Research in the Cloud - NIH Workshop Dennis Gannon
Doing Research in the Cloud - NIH Workshop Dennis Gannon
 
Big data - from consumers and patients, to the sea and stars
Big data - from consumers and patients, to the sea and starsBig data - from consumers and patients, to the sea and stars
Big data - from consumers and patients, to the sea and stars
 
Reproducible Research and the Cloud
Reproducible Research and the CloudReproducible Research and the Cloud
Reproducible Research and the Cloud
 
Living Outside the Comfort Zone - Daron green florianopolis 5-7-2014
Living Outside the Comfort Zone - Daron green   florianopolis 5-7-2014Living Outside the Comfort Zone - Daron green   florianopolis 5-7-2014
Living Outside the Comfort Zone - Daron green florianopolis 5-7-2014
 
Keynote Presentation at Moscow State University.
Keynote Presentation at Moscow State University.Keynote Presentation at Moscow State University.
Keynote Presentation at Moscow State University.
 

Último

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessUXDXConf
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!Memoori
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?Paolo Missier
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireExakis Nelite
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfFIDO Alliance
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctBrainSell Technologies
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfFIDO Alliance
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Skynet Technologies
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfFIDO Alliance
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024Stephen Perrenod
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingScyllaDB
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftshyamraj55
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfSrushith Repakula
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfUK Journal
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfFIDO Alliance
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxFIDO Alliance
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandIES VE
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...FIDO Alliance
 

Último (20)

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 

Parallel asynchronous inference of word senses with Microsoft Azure

  • 1. Parallel asynchronous inference of word senses with Azure Sergey Bartunov, MSU
  • 2. Learning as optimization F(✓) = r(✓) + NX i=1 fi(xi; ✓) ! min ✓
  • 3. Learning as optimization loss F(✓) = r(✓) + NX i=1 fi(xi; ✓) ! min ✓ regularizer objectparameters
  • 4. Learning as optimization • can be huge • regularizer and loss can be complex • parameters’ dimensionality can be very large N loss F(✓) = r(✓) + NX i=1 fi(xi; ✓) ! min ✓ regularizer objectparameters
  • 5. Learning as optimization • can be huge • regularizer and loss can be complex • parameters’ dimensionality can be very large N loss F(✓) = r(✓) + NX i=1 fi(xi; ✓) ! min ✓ regularizer objectparameters Commodity PC is not enough!
  • 6. Learning word embeddings For each word find its embedding such that 
 similar words have close embeddings Java Platform .NET Mono Railways Ticket Train Politics Party Socialism
  • 7. Learning word embeddings …compiled for a specific hardware platform, since different central processor…
  • 8. Learning word embeddings …compiled for a specific hardware platform, since different central processor… object: word and its context
  • 9. Learning word embeddings …compiled for a specific hardware platform, since different central processor… object: word and its context loss: log p(v|w) p(v|w) = exp(AT wBv) PV v0=1 exp(AT wBv0 )
  • 10. Learning word embeddings …compiled for a specific hardware platform, since different central processor… object: word and its context loss: parameters: word embeddings Aw, Bw 2 RD , w 2 1, . . . , V log p(v|w) p(v|w) = exp(AT wBv) PV v0=1 exp(AT wBv0 ) Skip-gram (Mikolov et al, 2013)
  • 11. Gradient optimization F(✓) = r(✓) + NX i=1 fi(xi; ✓) ! min ✓ gradient descent ✓t+1 = ✓t trF(✓t )
  • 12. Stochastic optimization F(✓) = r(✓) + NX i=1 fi(xi; ✓) ! min ✓ stochastic gradient descent ✓t+1 = ✓t tG(✓t )
  • 13. Stochastic optimization F(✓) = r(✓) + NX i=1 fi(xi; ✓) ! min ✓ stochastic gradient descent ✓t+1 = ✓t tG(✓t ) EG(✓) = rF(✓)
  • 14. Stochastic optimization F(✓) = r(✓) + NX i=1 fi(xi; ✓) ! min ✓ stochastic gradient descent ✓t+1 = ✓t tG(✓t ) EG(✓) = rF(✓) for example: G(✓) = r [r(✓) + Nfj(xj; ✓)] , j ⇠ Uniform(1, N)
  • 15. Learning word embeddings • 400k word vocabulary, 300-dimensional embeddings • 240 million parameters to train! • 1 GB memory snapshot
  • 16. Stochastic parallel optimization core 1 core 2 core K… shared parameters
  • 17. Stochastic parallel optimization core 1 core 2 core K… shared parameters data flow
  • 18. Stochastic parallel optimization core 1 core 2 core K… shared parameters data flow
  • 19. Stochastic parallel optimization core 1 core 2 core K… shared parameters data flow no synchronization!! (see e.g. Hogwild paper)
  • 20. Stochastic parallel optimization My laptop: 2 cores, 8 GB RAM
  • 21. Stochastic parallel optimization My laptop: 2 cores, 8 GB RAM
  • 22. Stochastic parallel optimization My laptop: 2 cores, 8 GB RAM
  • 23. Stochastic parallel optimization My laptop: 2 cores, 8 GB RAM 22 hours 2 hoursDataset - English Wikipedia 2012 (5.7 GB raw text, 1 billion words)
  • 24. Learning polysemic word embeddings Java Platform (1) .NET Mono Railways Ticket Platform (2) Train Platform (3) Politics Party Socialism
  • 25. Learning polysemic word embeddings …compiled for a specific hardware platform, since different central processor… (computer meaning)
  • 26. Learning polysemic word embeddings …compiled for a specific hardware platform, since different central processor… …as the safe distance from the platform edge increases with the speed… (railway meaning) (computer meaning)
  • 27. Learning polysemic word embeddings …compiled for a specific hardware platform, since different central processor… …as the safe distance from the platform edge increases with the speed… (railway meaning) (computer meaning) … Socialist Party; the Socialist Workers Platform and the Committee for a… (political meaning)
  • 28. Learning polysemic word embeddings …compiled for a specific hardware platform, since different central processor… …as the safe distance from the platform edge increases with the speed… (railway meaning) (computer meaning) … Socialist Party; the Socialist Workers Platform and the Committee for a… (political meaning) loss: loss: loss: log p(v|w, z = 1) log p(v|w, z = 2) log p(v|w, z = 3)
  • 29. Learning polysemic word embeddings …compiled for a specific hardware platform, since different central processor… …as the safe distance from the platform edge increases with the speed… (railway meaning) (computer meaning) … Socialist Party; the Socialist Workers Platform and the Committee for a… (political meaning) loss: loss: loss: p(v|w, z = k) = exp(AT wkBv) PV v0=1 exp(AT wkBv0 ) log p(v|w, z = 1) log p(v|w, z = 2) log p(v|w, z = 3) word meanings are unobserved
  • 30. Learning polysemic word embeddings log p(W, V |A, B, ↵) = log Z p(z|↵) Y i Y j p(vij|wi, zi, A, B)dz ! max A,B word meanings are unobserved, hence EM algorithm must be employed
  • 31. Learning polysemic word embeddings log p(W, V |A, B, ↵) = log Z p(z|↵) Y i Y j p(vij|wi, zi, A, B)dz ! max A,B word meanings are unobserved, hence EM algorithm must be employed • How to choose prior such that it allows to automatically increase number of word meanings if necessary? • How to put the EM procedure into stochastic optimization framework?
  • 32. Learning polysemic word embeddings log p(W, V |A, B, ↵) = log Z p(z|↵) Y i Y j p(vij|wi, zi, A, B)dz ! max A,B word meanings are unobserved, hence EM algorithm must be employed • How to choose prior such that it allows to automatically increase number of word meanings if necessary? • How to put the EM procedure into stochastic optimization framework? Bayesian nonparametrics (Orbanz, 2014)
  • 33. Learning polysemic word embeddings log p(W, V |A, B, ↵) = log Z p(z|↵) Y i Y j p(vij|wi, zi, A, B)dz ! max A,B word meanings are unobserved, hence EM algorithm must be employed • How to choose prior such that it allows to automatically increase number of word meanings if necessary? • How to put the EM procedure into stochastic optimization framework? Stochastic variational inference (Blei et al, 2012) Bayesian nonparametrics (Orbanz, 2014)
  • 34. EM algorithm … Socialist Party; the Socialist Workers Platform and the Committee for a…
  • 35. EM algorithm E-step: disambiguate the word given its context … Socialist Party; the Socialist Workers Platform and the Committee for a… p(z = politics) = 0.96 p(z = transport) = 0.01 p(z = computer) = 0.03
  • 36. EM algorithm E-step: disambiguate the word given its context … Socialist Party; the Socialist Workers Platform and the Committee for a… p(z = politics) = 0.96 p(z = transport) = 0.01 p(z = computer) = 0.03 M-step: update word embeddings by weighted gradient ✓t+1 = ✓t + tr " X k p(zi = k) log p(vij|wi, zi = k, ✓t ) #
  • 37. Learning polysemic word embeddings • 400k word vocabulary, 300-dimensional embeddings,
 max. 30 meanings per word • 7.2 billion parameters to train! • 18 GB memory snapshot
  • 38. Learning polysemic word embeddings My laptop: 2 cores, 8 GB RAM 6 days! 16 hoursDataset - English Wikipedia 2012 (5.7 GB raw text, 1 billion words)
  • 39. Results julia> expected_pi(vm, dict.word2id["cloud"]) 30-element Array{Float64,1}: 0.404964 0.134444 0.0987207 0.361865 5.70338e-6 5.18419e-7 4.7129e-8 4.28446e-9 3.89496e-10 3.54087e-11 ⋮
  • 40. Results julia> nearest_neighbors(vm, dict, "cloud", 1) 10-element Array{(Any,Any,Any),1}: ("clouds",1,0.791538f0) ("haze",2,0.6702103f0) ("nimbostratus",1,0.653774f0) ("altostratus",1,0.6300289f0) ("noctilucent",1,0.6294991f0) ("cumulonimbus",1,0.6289225f0) ("stratocumulus",1,0.6274564f0) ("cumulus",2,0.6273055f0) ("clouds",2,0.6201524f0) ("cirrostratus",1,0.6146165f0)
  • 41. Results julia> nearest_neighbors(vm, dict, "cloud", 2) 10-element Array{(Any,Any,Any),1}: ("louis",5,0.5705162f0) ("vrain",1,0.55054826f0) ("lucie",1,0.52579653f0) ("clair",1,0.52284604f0) ("johns",2,0.5215208f0) ("marys",1,0.5036709f0) ("nazianz",1,0.4979607f0) ("lawrence",2,0.49513188f0) ("missouri",3,0.49284995f0) ("joseph",2,0.4928328f0)
  • 42. Results julia> nearest_neighbors(vm, dict, "cloud", 3) 10-element Array{(Any,Any,Any),1}: ("computing",1,0.7052178f0) ("middleware",1,0.68975633f0) ("cloud-based",1,0.6546666f0) ("context-aware",1,0.6417114f0) ("enterprise",1,0.63958025f0) ("virtualization",1,0.6359488f0) ("soa",1,0.6349716f0) ("distributed",1,0.6310058f0) ("unicore",1,0.62737936f0) ("client-server",1,0.6239226f0)
  • 43. Results julia> nearest_neighbors(vm, dict, "cloud", 4) 10-element Array{(Any,Any,Any),1}: ("mist",1,0.56100917f0) ("clouds",3,0.54695433f0) ("fire",5,0.53125167f0) ("flame",3,0.52561617f0) ("dragon",1,0.5224602f0) ("sorceror",1,0.5199405f0) ("shining",2,0.5165066f0) ("shadow",1,0.516233f0) ("mysterious",2,0.5153119f0) ("smoke",3,0.51471066f0)
  • 44. Results julia> disambiguate(vm, dict, "cloud", split("weather forecast cold rainy")) 30-element Array{Float64,1}: 0.999278 9.49993e-7 1.52921e-8 0.000720983 0.0 0.0 0.0 0.0 0.0 0.0 ⋮
  • 45. Results julia> disambiguate(vm, dict, "cloud", split("multi-core virtual machine")) 30-element Array{Float64,1}: 0.000243637 6.16926e-5 0.998918 0.000776869 0.0 0.0 0.0 0.0 0.0 0.0 ⋮
  • 46. and thanks to Microsoft Research and Microsoft Azure team! Dmitry Kondrashkin Anton Osokin Dmitry P. Vetrov project page: bayesgroup.ru/adagram sources: github.com/sbos/AdaGram.jl