SlideShare una empresa de Scribd logo
1 de 144
Descargar para leer sin conexión
lda2vec
(word2vec, and lda)
Christopher Moody
@ Stitch Fix
About
@chrisemoody
Caltech Physics
PhD. in astrostats supercomputing
sklearn t-SNE contributor
Data Labs at Stitch Fix
github.com/cemoody
Gaussian Processes t-SNE
chainer
deep learning
Tensor Decomposition
word2vec
lda
1
2
3lda2vec
1. king - man + woman = queen
2. Huge splash in NLP world
3. Learns from raw text
4. Pretty simple algorithm
5. Comes pretrained
word2vec
1. Set up an objective function
2. Randomly initialize vectors
3. Do gradient descent
word2vec
w
ord2vec
word2vec: learn word vector w
from it’s surrounding context
w
w
ord2vec
“The fox jumped over the lazy dog”
Maximize the likelihood of seeing the words given the word over.
P(the|over)
P(fox|over)
P(jumped|over)
P(the|over)
P(lazy|over)
P(dog|over)
…instead of maximizing the likelihood of co-occurrence counts.
w
ord2vec
P(fox|over)
What should this be?
w
ord2vec
P(vfox|vover)
Should depend on the word vectors.
P(fox|over)
w
ord2vec
“The fox jumped over the lazy dog”
P(w|c)
Extract pairs from context window around every input word.
w
ord2vec
“The fox jumped over the lazy dog”
c
P(w|c)
Extract pairs from context window around every input word.
w
ord2vec
“The fox jumped over the lazy dog”
w
P(w|c)
c
Extract pairs from context window around every input word.
w
ord2vec
P(w|c)
w c
“The fox jumped over the lazy dog”
Extract pairs from context window around every input word.
w
ord2vec
“The fox jumped over the lazy dog”
P(w|c)
w c
Extract pairs from context window around every input word.
w
ord2vec
P(w|c)
c w
“The fox jumped over the lazy dog”
Extract pairs from context window around every input word.
w
ord2vec
P(w|c)
c w
“The fox jumped over the lazy dog”
Extract pairs from context window around every input word.
w
ord2vec
P(w|c)
c w
“The fox jumped over the lazy dog”
Extract pairs from context window around every input word.
w
ord2vec
P(w|c)
w c
“The fox jumped over the lazy dog”
Extract pairs from context window around every input word.
w
ord2vec
P(w|c)
cw
“The fox jumped over the lazy dog”
Extract pairs from context window around every input word.
w
ord2vec
P(w|c)
cw
“The fox jumped over the lazy dog”
Extract pairs from context window around every input word.
w
ord2vec
P(w|c)
cw
“The fox jumped over the lazy dog”
Extract pairs from context window around every input word.
w
ord2vec
P(w|c)
c w
“The fox jumped over the lazy dog”
Extract pairs from context window around every input word.
w
ord2vec
P(w|c)
c w
“The fox jumped over the lazy dog”
Extract pairs from context window around every input word.
objective
Measure loss between
w and c?
How should we define P(w|c)?
objective
w . c
How should we define P(w|c)?
Measure loss between
w and c?
w
ord2vec
w . c ~ 1
objective
w
c
vcanada . vsnow ~ 1
w
ord2vec
w . c ~ 0
objective
w
c
vcanada . vdesert ~0
w
ord2vec
w . c ~ -1
objective
w
c
w
ord2vec
w . c ∈ [-1,1]
objective
w
ord2vec
But we’d like to measure a probability.
w . c ∈ [-1,1]
objective
w
ord2vec
But we’d like to measure a probability.
objective
∈ [0,1]σ(c·w)
w
ord2vec
But we’d like to measure a probability.
objective
∈ [0,1]σ(c·w)
w
c
w
c
SimilarDissimilar
w
ord2vec
Loss function:
objective
L=σ(c·w)
Logistic (binary) choice.
Is the (context, word) combination from our dataset?
w
ord2vec
The skip-gram negative-sampling model
objective
Trivial solution is that context = word for all vectors
L=σ(c·w)
w
c
w
ord2vec
The skip-gram negative-sampling model
L = σ(c·w) + σ(-c·wneg)
objective
Draw random words in vocabulary.
w
ord2vec
The skip-gram negative-sampling model
objective
Discriminate positive from negative samples
Multiple Negative
L = σ(c·w) + σ(-c·wneg) +…+ σ(-c·wneg)
w
ord2vec
The SGNS Model
PM
I
ci·wj = PMI(Mij) - log k
…is extremely similar to matrix factorization!
Levy & Goldberg 2014
L = σ(c·w) + σ(-c·wneg)
w
ord2vec
The SGNS Model
PM
I
Levy & Goldberg 2014
‘traditional’ NLP
L = σ(c·w) + σ(-c·wneg)
ci·wj = PMI(Mij) - log k
…is extremely similar to matrix factorization!
w
ord2vec
The SGNS Model
L = σ(c·w) + Σσ(-c·w)
PM
I
ci·wj = log
Levy & Goldberg 2014
#(ci,wj)/n
k #(wj)/n #(ci)/n
‘traditional’ NLP
w
ord2vec
The SGNS Model
L = σ(c·w) + Σσ(-c·w)
PM
I
ci·wj = log
Levy & Goldberg 2014
popularity of c,w
k (popularity of c) (popularity of w)
‘traditional’ NLP
w
ord2vec
PM
I
99% of word2vec
is counting.
And you can count
words in SQL
w
ord2vec
PM
I
Count how many times
you saw c·w
Count how many times
you saw c
Count how many times
you saw w
w
ord2vec
PM
I
…and this takes ~5 minutes to compute on a single core.
Computing SVD is a completely standard math library.
word2vec
ITEM_3469 + ‘Pregnant’
+ ‘Pregnant’
= ITEM_701333
= ITEM_901004
= ITEM_800456
what about?LDA?
LDA
on Client Item
Descriptions
LDA
on Item
Descriptions
(with Jay)
LDA
on Item
Descriptions
(with Jay)
LDA
on Item
Descriptions
(with Jay)
lda vs word2vec
Bayesian Graphical ModelML Neural Model
word2vec is local:
one word predicts a nearby word
“I love finding new designer brands for jeans”
“I love finding new designer brands for jeans”
But text is usually organized.
“I love finding new designer brands for jeans”
But text is usually organized.
“I love finding new designer brands for jeans”
In LDA, documents globally predict words.
doc 7681
typical word2vec vector
[ 0%, 9%, 78%, 11%]
typical LDA document vector
[ -0.75, -1.25, -0.55, -0.12, +2.2]
All sum to 100%All real values
5D word2vec vector
[ 0%, 9%, 78%, 11%]
5D LDA document vector
[ -0.75, -1.25, -0.55, -0.12, +2.2]
Sparse
All sum to 100%
Dimensions are absolute
Dense
All real values
Dimensions relative
100D word2vec vector
[ 0%0%0%0%0% … 0%, 9%, 78%, 11%]
100D LDA document vector
[ -0.75, -1.25, -0.55, -0.27, -0.94, 0.44, 0.05, 0.31 … -0.12, +2.2]
Sparse
All sum to 100%
Dimensions are absolute
Dense
All real values
Dimensions relative
dense sparse
100D word2vec vector
[ 0%0%0%0%0% … 0%, 9%, 78%, 11%]
100D LDA document vector
[ -0.75, -1.25, -0.55, -0.27, -0.94, 0.44, 0.05, 0.31 … -0.12, +2.2]
Similar in fewer ways
(more interpretable)
Similar in 100D ways
(very flexible)
+mixture
+sparse
can we do both? lda2vec
-1.9 0.85 -0.6 -0.3 -0.5
Lufthansa is a German airline and when
fox
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Lufthansa is a German airline and when
German
word2vec predicts locally:
one word predicts a nearby word
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
German
Document vector
predicts a word from
a global context
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
We’re missing
mixtures & sparsity!
German
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
We’re missing
mixtures & sparsity!
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
Now it’s a mixture.
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
Trinitarian
baptismal
Pentecostals
Bede
schismatics
excommunication
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
#topics
Document weight
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
topic 1 = “religion”
Trinitarian
baptismal
Pentecostals
Bede
schismatics
excommunication
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
#topics
Document weight
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
Milosevic
absentee
Indonesia
Lebanese
Isrealis
Karadzic
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
#topics
Document weight
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
topic 2 = “politics”
Milosevic
absentee
Indonesia
Lebanese
Isrealis
Karadzic
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
#topics
Document weight
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
Sparsity!
0.34 -0.1 0.17
41% 26% 34%
-1.4 -0.5 -1.4
-1.9-1.7 0.75
0.96-0.7 -1.9
-0.2-1.1 0.6
-0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5
-2.6 0.45 -1.3 -0.6 -0.8
Lufthansa is a German airline and when
#topics
#topics
fox
#hiddenunits
#topics
#hidden units#hidden units
#hidden units
Skip grams from
sentences
Word vector
Negative sampling loss
Topic matrix
Document proportion
Document weight
Document vector
Context vector
x
+
Lufthansa is a German airline and when
34% 32% 34%
t=0
41% 26% 34%
t=10
99% 1% 0%
t=∞
tim
e
@chrisemoody
lda2vec.com
+ API docs
+ Examples
+ GPU
+ Tests
@chrisemoody
lda2vec.com
@chrisemoody
Example Hacker News comments
Topics:
http://nbviewer.jupyter.org/github/cemoody/lda2vec/blob/master/examples/
hacker_news/lda2vec/lda2vec.ipynb
Word vectors:
https://github.com/cemoody/
lda2vec/blob/master/examples/
hacker_news/lda2vec/
word_vectors.ipynb
@chrisemoody
lda2vec.com
human-interpretable doc topics, use LDA.
machine-useable word-level features, use word2vec.
if you like to experiment a lot, and have
topics over user / doc / region / etc. features, use lda2vec.
(and you have a GPU)
If you want…
?@chrisemoody
Multithreaded
Stitch Fix
@chrisemoody
lda2vec.com
Credit
Large swathes of this talk are from
previous presentations by:
• Tomas Mikolov
• David Blei
• Christopher Olah
• Radim Rehurek
• Omer Levy & Yoav Goldberg
• Richard Socher
• Xin Rong
• Tim Hopper
“PS! Thank you for such an awesome idea”
@chrisemoody
doc_id=1846
Can we model topics to sentences?
lda2lstm
Can we model topics to sentences?
lda2lstm
“PS! Thank you for such an awesome idea”doc_id=1846
@chrisemoody
Can we model topics to images?
lda2ae
TJ Torres
and now for something completely crazy
4
Fun
Stuff
translation
(using just a rotation
matrix)
M
ikolov
2013
English
Spanish
Matrix
Rotation
deepwalk
Perozzi
etal2014
learn word vectors from
sentences
“The fox jumped over the lazy dog”
vOUT vOUT vOUT vOUT vOUTvOUT
‘words’ are graph vertices
‘sentences’ are random walks on the
graph
word2vec
Playlists at
Spotify
context
sequence
learning
‘words’ are song indices
‘sentences’ are playlists
Playlists at
Spotify
context
Erik
Bernhardsson
Great performance on ‘related artists’
Fixes at
Stitch Fix
sequence
learning
Let’s try:
‘words’ are items
‘sentences’ are fixes
Fixes at
Stitch Fix
context
Learn similarity between styles
because they co-occur
Learn ‘coherent’ styles
sequence
learning
Fixes at
Stitch Fix?
context
sequence
learning
Got lots of structure!
Fixes at
Stitch Fix?
context
sequence
learning
Fixes at
Stitch Fix?
context
sequence
learning
Nearby regions are
consistent ‘closets’
?@chrisemoody
Multithreaded
Stitch Fix
context
dependent
Levy
&
G
oldberg
2014
Australian scientist discovers star with telescope
context +/- 2 words
context
dependent
context
Australian scientist discovers star with telescope
Levy
&
G
oldberg
2014
context
dependent
context
Australian scientist discovers star with telescope
context
Levy
&
G
oldberg
2014
context
dependent
context
BoW DEPS
topically-similar vs ‘functionally’ similar
Levy
&
G
oldberg
2014
?@chrisemoody
Multithreaded
Stitch Fix
Crazy
Approaches
Paragraph Vectors
(Just extend the context window)
Content dependency
(Change the window grammatically)
Social word2vec (deepwalk)
(Sentence is a walk on the graph)
Spotify
(Sentence is a playlist of song_ids)
Stitch Fix
(Sentence is a shipment of five items)
CBOW
“The fox jumped over the lazy dog”
Guess the word
given the context
~20x faster.
(this is the alternative.)
vOUT
vIN vINvIN vINvIN vIN
SkipGram
“The fox jumped over the lazy dog”
vOUT vOUT
vIN
vOUT vOUT vOUTvOUT
Guess the context
given the word
Better at syntax.
(this is the one we went over)
lda2vec
vDOC = a vtopic1 + b vtopic2 +…
Let’s make vDOC sparse
lda2vec
This works! 😀 But vDOC isn’t as
interpretable as the topic vectors. 😔
vDOC = topic0 + topic1
Let’s say that vDOC ads
lda2vec
softmax(vOUT * (vIN+ vDOC))
theory of lda2vec
lda2vec
pyLDAvis of lda2vec
lda2vec
LDA
Results
context
H
istory
I loved every choice in this fix!! Great job!
Great Stylist Perfect
LDA
Results
context
H
istory
Body Fit
My measurements are 36-28-32. If that helps.
I like wearing some clothing that is fitted.
Very hard for me to find pants that fit right.
LDA
Results
context
H
istory
Sizing
Really enjoyed the experience and the
pieces, sizing for tops was too big.
Looking forward to my next box!
Excited for next
LDA
Results
context
H
istory
Almost Bought
It was a great fix. Loved the two items I
kept and the three I sent back were close!
Perfect
All of the following ideas will change what
‘words’ and ‘context’ represent.
paragraph
vector
What about summarizing documents?
On the day he took office, President Obama reached out to America’s enemies,
offering in his first inaugural address to extend a hand if you are willing to unclench
your fist. More than six years later, he has arrived at a moment of truth in testing that
On the day he took office, President Obama reached out to America’s enemies,
offering in his first inaugural address to extend a hand if you are willing to unclench
your fist. More than six years later, he has arrived at a moment of truth in testing that
The framework nuclear agreement he reached with Iran on Thursday did not provide
the definitive answer to whether Mr. Obama’s audacious gamble will pay off. The fist
Iran has shaken at the so-called Great Satan since 1979 has not completely relaxed.
paragraph
vector
Normal skipgram extends C words before, and C words after.
IN
OUT OUT
On the day he took office, President Obama reached out to America’s enemies,
offering in his first inaugural address to extend a hand if you are willing to unclench
your fist. More than six years later, he has arrived at a moment of truth in testing that
The framework nuclear agreement he reached with Iran on Thursday did not provide
the definitive answer to whether Mr. Obama’s audacious gamble will pay off. The fist
Iran has shaken at the so-called Great Satan since 1979 has not completely relaxed.
paragraph
vector
A document vector simply extends the context to the whole document.
IN
OUT OUT
OUT OUTdoc_1347
from	gensim.models	import	Doc2Vec		
fn	=	“item_document_vectors”		
model	=	Doc2Vec.load(fn)		
model.most_similar('pregnant')		
matches	=	list(filter(lambda	x:	'SENT_'	in	x[0],	matches))			
#	['...I	am	currently	23	weeks	pregnant...',		
#		'...I'm	now	10	weeks	pregnant...',		
#		'...not	showing	too	much	yet...',		
#		'...15	weeks	now.	Baby	bump...',		
#		'...6	weeks	post	partum!...',		
#		'...12	weeks	postpartum	and	am	nursing...',		
#		'...I	have	my	baby	shower	that...',		
#		'...am	still	breastfeeding...',		
#		'...I	would	love	an	outfit	for	a	baby	shower...']
sentence
search
lda2vec Text by the Bay 2016
lda2vec Text by the Bay 2016
lda2vec Text by the Bay 2016
lda2vec Text by the Bay 2016
lda2vec Text by the Bay 2016
lda2vec Text by the Bay 2016
lda2vec Text by the Bay 2016

Más contenido relacionado

Destacado

Discussion on the Distributed Search Engine
Discussion on the Distributed Search EngineDiscussion on the Distributed Search Engine
Discussion on the Distributed Search EngineYusuke Fujisaka
 
Journal club: Meta-Prod2Vec
Journal club: Meta-Prod2Vec Journal club: Meta-Prod2Vec
Journal club: Meta-Prod2Vec Yuya Kanemoto
 
What do we get from Twitter - and what not?
What do we get from Twitter - and what not?What do we get from Twitter - and what not?
What do we get from Twitter - and what not?Katrin Weller
 
Fabrikatyr lda topic modelling practical application
Fabrikatyr lda topic modelling practical applicationFabrikatyr lda topic modelling practical application
Fabrikatyr lda topic modelling practical applicationTim Carnus
 
Topic Modelling to identify behavioral trends in online communities
Topic Modelling to identify behavioral trends in online communities Topic Modelling to identify behavioral trends in online communities
Topic Modelling to identify behavioral trends in online communities Conor Duke
 
Distributed representation of sentences and documents
Distributed representation of sentences and documentsDistributed representation of sentences and documents
Distributed representation of sentences and documentsAbdullah Khan Zehady
 
Drawing word2vec
Drawing word2vecDrawing word2vec
Drawing word2vecKai Sasaki
 
EMNLP2014読み会 "Efficient Non-parametric Estimation of Multiple Embeddings per ...
EMNLP2014読み会 "Efficient Non-parametric Estimation of Multiple Embeddings per ...EMNLP2014読み会 "Efficient Non-parametric Estimation of Multiple Embeddings per ...
EMNLP2014読み会 "Efficient Non-parametric Estimation of Multiple Embeddings per ...Yuya Unno
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector spaceAbdullah Khan Zehady
 
[FW Invest] Près de 2,3 milliards d’euros investis dans la Tech française en ...
[FW Invest] Près de 2,3 milliards d’euros investis dans la Tech française en ...[FW Invest] Près de 2,3 milliards d’euros investis dans la Tech française en ...
[FW Invest] Près de 2,3 milliards d’euros investis dans la Tech française en ...FrenchWeb.fr
 
LDA Beginner's Tutorial
LDA Beginner's TutorialLDA Beginner's Tutorial
LDA Beginner's TutorialWayne Lee
 

Destacado (13)

Discussion on the Distributed Search Engine
Discussion on the Distributed Search EngineDiscussion on the Distributed Search Engine
Discussion on the Distributed Search Engine
 
P2p search engine
P2p search engineP2p search engine
P2p search engine
 
Journal club: Meta-Prod2Vec
Journal club: Meta-Prod2Vec Journal club: Meta-Prod2Vec
Journal club: Meta-Prod2Vec
 
What do we get from Twitter - and what not?
What do we get from Twitter - and what not?What do we get from Twitter - and what not?
What do we get from Twitter - and what not?
 
Fabrikatyr lda topic modelling practical application
Fabrikatyr lda topic modelling practical applicationFabrikatyr lda topic modelling practical application
Fabrikatyr lda topic modelling practical application
 
Topic Modelling to identify behavioral trends in online communities
Topic Modelling to identify behavioral trends in online communities Topic Modelling to identify behavioral trends in online communities
Topic Modelling to identify behavioral trends in online communities
 
Distributed representation of sentences and documents
Distributed representation of sentences and documentsDistributed representation of sentences and documents
Distributed representation of sentences and documents
 
Drawing word2vec
Drawing word2vecDrawing word2vec
Drawing word2vec
 
EMNLP2014読み会 "Efficient Non-parametric Estimation of Multiple Embeddings per ...
EMNLP2014読み会 "Efficient Non-parametric Estimation of Multiple Embeddings per ...EMNLP2014読み会 "Efficient Non-parametric Estimation of Multiple Embeddings per ...
EMNLP2014読み会 "Efficient Non-parametric Estimation of Multiple Embeddings per ...
 
Emnlp読み会資料
Emnlp読み会資料Emnlp読み会資料
Emnlp読み会資料
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector space
 
[FW Invest] Près de 2,3 milliards d’euros investis dans la Tech française en ...
[FW Invest] Près de 2,3 milliards d’euros investis dans la Tech française en ...[FW Invest] Près de 2,3 milliards d’euros investis dans la Tech française en ...
[FW Invest] Près de 2,3 milliards d’euros investis dans la Tech française en ...
 
LDA Beginner's Tutorial
LDA Beginner's TutorialLDA Beginner's Tutorial
LDA Beginner's Tutorial
 

Similar a lda2vec Text by the Bay 2016

Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherMLReview
 
From grep to BERT
From grep to BERTFrom grep to BERT
From grep to BERTQAware GmbH
 
A Taste of Python - Devdays Toronto 2009
A Taste of Python - Devdays Toronto 2009A Taste of Python - Devdays Toronto 2009
A Taste of Python - Devdays Toronto 2009Jordan Baker
 
The TclQuadcode Compiler
The TclQuadcode CompilerThe TclQuadcode Compiler
The TclQuadcode CompilerDonal Fellows
 
Building WordSpaces via Random Indexing from simple to complex spaces
Building WordSpaces via Random Indexing from simple to complex spacesBuilding WordSpaces via Random Indexing from simple to complex spaces
Building WordSpaces via Random Indexing from simple to complex spacesPierpaolo Basile
 
"SSC" - Geometria e Semantica del Linguaggio
"SSC" - Geometria e Semantica del Linguaggio"SSC" - Geometria e Semantica del Linguaggio
"SSC" - Geometria e Semantica del LinguaggioAlumni Mathematica
 
CS571: Distributional semantics
CS571: Distributional semanticsCS571: Distributional semantics
CS571: Distributional semanticsJinho Choi
 
Recipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tastyRecipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tastyPyData
 
Ur Domain Haz Monoids DDDx NYC 2014
Ur Domain Haz Monoids DDDx NYC 2014Ur Domain Haz Monoids DDDx NYC 2014
Ur Domain Haz Monoids DDDx NYC 2014Cyrille Martraire
 
AI&BigData Lab 2016. Анатолий Востряков: Перевод с "плохого" английского на "...
AI&BigData Lab 2016. Анатолий Востряков: Перевод с "плохого" английского на "...AI&BigData Lab 2016. Анатолий Востряков: Перевод с "плохого" английского на "...
AI&BigData Lab 2016. Анатолий Востряков: Перевод с "плохого" английского на "...GeeksLab Odessa
 
Word embeddings
Word embeddingsWord embeddings
Word embeddingsShruti kar
 
Python Performance 101
Python Performance 101Python Performance 101
Python Performance 101Ankur Gupta
 
Simultaneous,Deep,Transfer,Across, Domains,and,Tasks
Simultaneous,Deep,Transfer,Across, Domains,and,TasksSimultaneous,Deep,Transfer,Across, Domains,and,Tasks
Simultaneous,Deep,Transfer,Across, Domains,and,TasksAlejandro Cartas
 
DataWeave 2.0 - MuleSoft CONNECT 2019
DataWeave 2.0 - MuleSoft CONNECT 2019DataWeave 2.0 - MuleSoft CONNECT 2019
DataWeave 2.0 - MuleSoft CONNECT 2019Sabrina Marechal
 
The Error of Our Ways
The Error of Our WaysThe Error of Our Ways
The Error of Our WaysKevlin Henney
 
An introduction to Rust: the modern programming language to develop safe and ...
An introduction to Rust: the modern programming language to develop safe and ...An introduction to Rust: the modern programming language to develop safe and ...
An introduction to Rust: the modern programming language to develop safe and ...Claudio Capobianco
 

Similar a lda2vec Text by the Bay 2016 (20)

Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and Whither
 
Word2vec and Friends
Word2vec and FriendsWord2vec and Friends
Word2vec and Friends
 
From grep to BERT
From grep to BERTFrom grep to BERT
From grep to BERT
 
A Taste of Python - Devdays Toronto 2009
A Taste of Python - Devdays Toronto 2009A Taste of Python - Devdays Toronto 2009
A Taste of Python - Devdays Toronto 2009
 
The TclQuadcode Compiler
The TclQuadcode CompilerThe TclQuadcode Compiler
The TclQuadcode Compiler
 
Building WordSpaces via Random Indexing from simple to complex spaces
Building WordSpaces via Random Indexing from simple to complex spacesBuilding WordSpaces via Random Indexing from simple to complex spaces
Building WordSpaces via Random Indexing from simple to complex spaces
 
"SSC" - Geometria e Semantica del Linguaggio
"SSC" - Geometria e Semantica del Linguaggio"SSC" - Geometria e Semantica del Linguaggio
"SSC" - Geometria e Semantica del Linguaggio
 
CS571: Distributional semantics
CS571: Distributional semanticsCS571: Distributional semantics
CS571: Distributional semantics
 
Recipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tastyRecipe2Vec: Or how does my robot know what’s tasty
Recipe2Vec: Or how does my robot know what’s tasty
 
Ur Domain Haz Monoids DDDx NYC 2014
Ur Domain Haz Monoids DDDx NYC 2014Ur Domain Haz Monoids DDDx NYC 2014
Ur Domain Haz Monoids DDDx NYC 2014
 
Lesson 7: The Derivative
Lesson 7: The DerivativeLesson 7: The Derivative
Lesson 7: The Derivative
 
AI&BigData Lab 2016. Анатолий Востряков: Перевод с "плохого" английского на "...
AI&BigData Lab 2016. Анатолий Востряков: Перевод с "плохого" английского на "...AI&BigData Lab 2016. Анатолий Востряков: Перевод с "плохого" английского на "...
AI&BigData Lab 2016. Анатолий Востряков: Перевод с "плохого" английского на "...
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
Deep into Ruby Code Coverage
Deep into Ruby Code CoverageDeep into Ruby Code Coverage
Deep into Ruby Code Coverage
 
Python Performance 101
Python Performance 101Python Performance 101
Python Performance 101
 
Chapter 7 drill
Chapter 7 drillChapter 7 drill
Chapter 7 drill
 
Simultaneous,Deep,Transfer,Across, Domains,and,Tasks
Simultaneous,Deep,Transfer,Across, Domains,and,TasksSimultaneous,Deep,Transfer,Across, Domains,and,Tasks
Simultaneous,Deep,Transfer,Across, Domains,and,Tasks
 
DataWeave 2.0 - MuleSoft CONNECT 2019
DataWeave 2.0 - MuleSoft CONNECT 2019DataWeave 2.0 - MuleSoft CONNECT 2019
DataWeave 2.0 - MuleSoft CONNECT 2019
 
The Error of Our Ways
The Error of Our WaysThe Error of Our Ways
The Error of Our Ways
 
An introduction to Rust: the modern programming language to develop safe and ...
An introduction to Rust: the modern programming language to develop safe and ...An introduction to Rust: the modern programming language to develop safe and ...
An introduction to Rust: the modern programming language to develop safe and ...
 

Último

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Último (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

lda2vec Text by the Bay 2016

  • 2. About @chrisemoody Caltech Physics PhD. in astrostats supercomputing sklearn t-SNE contributor Data Labs at Stitch Fix github.com/cemoody Gaussian Processes t-SNE chainer deep learning Tensor Decomposition
  • 4. 1. king - man + woman = queen 2. Huge splash in NLP world 3. Learns from raw text 4. Pretty simple algorithm 5. Comes pretrained word2vec
  • 5. 1. Set up an objective function 2. Randomly initialize vectors 3. Do gradient descent word2vec
  • 6. w ord2vec word2vec: learn word vector w from it’s surrounding context w
  • 7. w ord2vec “The fox jumped over the lazy dog” Maximize the likelihood of seeing the words given the word over. P(the|over) P(fox|over) P(jumped|over) P(the|over) P(lazy|over) P(dog|over) …instead of maximizing the likelihood of co-occurrence counts.
  • 9. w ord2vec P(vfox|vover) Should depend on the word vectors. P(fox|over)
  • 10. w ord2vec “The fox jumped over the lazy dog” P(w|c) Extract pairs from context window around every input word.
  • 11. w ord2vec “The fox jumped over the lazy dog” c P(w|c) Extract pairs from context window around every input word.
  • 12. w ord2vec “The fox jumped over the lazy dog” w P(w|c) c Extract pairs from context window around every input word.
  • 13. w ord2vec P(w|c) w c “The fox jumped over the lazy dog” Extract pairs from context window around every input word.
  • 14. w ord2vec “The fox jumped over the lazy dog” P(w|c) w c Extract pairs from context window around every input word.
  • 15. w ord2vec P(w|c) c w “The fox jumped over the lazy dog” Extract pairs from context window around every input word.
  • 16. w ord2vec P(w|c) c w “The fox jumped over the lazy dog” Extract pairs from context window around every input word.
  • 17. w ord2vec P(w|c) c w “The fox jumped over the lazy dog” Extract pairs from context window around every input word.
  • 18. w ord2vec P(w|c) w c “The fox jumped over the lazy dog” Extract pairs from context window around every input word.
  • 19. w ord2vec P(w|c) cw “The fox jumped over the lazy dog” Extract pairs from context window around every input word.
  • 20. w ord2vec P(w|c) cw “The fox jumped over the lazy dog” Extract pairs from context window around every input word.
  • 21. w ord2vec P(w|c) cw “The fox jumped over the lazy dog” Extract pairs from context window around every input word.
  • 22. w ord2vec P(w|c) c w “The fox jumped over the lazy dog” Extract pairs from context window around every input word.
  • 23. w ord2vec P(w|c) c w “The fox jumped over the lazy dog” Extract pairs from context window around every input word.
  • 24. objective Measure loss between w and c? How should we define P(w|c)?
  • 25. objective w . c How should we define P(w|c)? Measure loss between w and c?
  • 26. w ord2vec w . c ~ 1 objective w c vcanada . vsnow ~ 1
  • 27. w ord2vec w . c ~ 0 objective w c vcanada . vdesert ~0
  • 28. w ord2vec w . c ~ -1 objective w c
  • 29. w ord2vec w . c ∈ [-1,1] objective
  • 30. w ord2vec But we’d like to measure a probability. w . c ∈ [-1,1] objective
  • 31. w ord2vec But we’d like to measure a probability. objective ∈ [0,1]σ(c·w)
  • 32. w ord2vec But we’d like to measure a probability. objective ∈ [0,1]σ(c·w) w c w c SimilarDissimilar
  • 33. w ord2vec Loss function: objective L=σ(c·w) Logistic (binary) choice. Is the (context, word) combination from our dataset?
  • 34. w ord2vec The skip-gram negative-sampling model objective Trivial solution is that context = word for all vectors L=σ(c·w) w c
  • 35. w ord2vec The skip-gram negative-sampling model L = σ(c·w) + σ(-c·wneg) objective Draw random words in vocabulary.
  • 36. w ord2vec The skip-gram negative-sampling model objective Discriminate positive from negative samples Multiple Negative L = σ(c·w) + σ(-c·wneg) +…+ σ(-c·wneg)
  • 37. w ord2vec The SGNS Model PM I ci·wj = PMI(Mij) - log k …is extremely similar to matrix factorization! Levy & Goldberg 2014 L = σ(c·w) + σ(-c·wneg)
  • 38. w ord2vec The SGNS Model PM I Levy & Goldberg 2014 ‘traditional’ NLP L = σ(c·w) + σ(-c·wneg) ci·wj = PMI(Mij) - log k …is extremely similar to matrix factorization!
  • 39. w ord2vec The SGNS Model L = σ(c·w) + Σσ(-c·w) PM I ci·wj = log Levy & Goldberg 2014 #(ci,wj)/n k #(wj)/n #(ci)/n ‘traditional’ NLP
  • 40. w ord2vec The SGNS Model L = σ(c·w) + Σσ(-c·w) PM I ci·wj = log Levy & Goldberg 2014 popularity of c,w k (popularity of c) (popularity of w) ‘traditional’ NLP
  • 41. w ord2vec PM I 99% of word2vec is counting. And you can count words in SQL
  • 42. w ord2vec PM I Count how many times you saw c·w Count how many times you saw c Count how many times you saw w
  • 43. w ord2vec PM I …and this takes ~5 minutes to compute on a single core. Computing SVD is a completely standard math library.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 64.
  • 72. word2vec is local: one word predicts a nearby word “I love finding new designer brands for jeans”
  • 73. “I love finding new designer brands for jeans” But text is usually organized.
  • 74. “I love finding new designer brands for jeans” But text is usually organized.
  • 75. “I love finding new designer brands for jeans” In LDA, documents globally predict words. doc 7681
  • 76. typical word2vec vector [ 0%, 9%, 78%, 11%] typical LDA document vector [ -0.75, -1.25, -0.55, -0.12, +2.2] All sum to 100%All real values
  • 77. 5D word2vec vector [ 0%, 9%, 78%, 11%] 5D LDA document vector [ -0.75, -1.25, -0.55, -0.12, +2.2] Sparse All sum to 100% Dimensions are absolute Dense All real values Dimensions relative
  • 78. 100D word2vec vector [ 0%0%0%0%0% … 0%, 9%, 78%, 11%] 100D LDA document vector [ -0.75, -1.25, -0.55, -0.27, -0.94, 0.44, 0.05, 0.31 … -0.12, +2.2] Sparse All sum to 100% Dimensions are absolute Dense All real values Dimensions relative dense sparse
  • 79. 100D word2vec vector [ 0%0%0%0%0% … 0%, 9%, 78%, 11%] 100D LDA document vector [ -0.75, -1.25, -0.55, -0.27, -0.94, 0.44, 0.05, 0.31 … -0.12, +2.2] Similar in fewer ways (more interpretable) Similar in 100D ways (very flexible) +mixture +sparse
  • 80. can we do both? lda2vec
  • 81. -1.9 0.85 -0.6 -0.3 -0.5 Lufthansa is a German airline and when fox #hidden units Skip grams from sentences Word vector Negative sampling loss Lufthansa is a German airline and when German word2vec predicts locally: one word predicts a nearby word
  • 82. 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when German Document vector predicts a word from a global context 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when
  • 83. 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when We’re missing mixtures & sparsity! German
  • 84. 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when We’re missing mixtures & sparsity!
  • 85. 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when Now it’s a mixture. 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Topic matrix Document proportion Document weight Document vector Context vector x +
  • 86. 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when Trinitarian baptismal Pentecostals Bede schismatics excommunication 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when 0.34 -0.1 0.17 #topics Document weight
  • 87. 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when topic 1 = “religion” Trinitarian baptismal Pentecostals Bede schismatics excommunication 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when 0.34 -0.1 0.17 #topics Document weight
  • 88. 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when Milosevic absentee Indonesia Lebanese Isrealis Karadzic 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when 0.34 -0.1 0.17 #topics Document weight
  • 89. 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when topic 2 = “politics” Milosevic absentee Indonesia Lebanese Isrealis Karadzic 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when 0.34 -0.1 0.17 #topics Document weight
  • 90. 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Topic matrix Document proportion Document weight Document vector Context vector x +
  • 91. 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when
  • 92. 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when
  • 93. 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when Sparsity! 0.34 -0.1 0.17 41% 26% 34% -1.4 -0.5 -1.4 -1.9-1.7 0.75 0.96-0.7 -1.9 -0.2-1.1 0.6 -0.7 -0.4 -0.7 -0.3 -0.3-1.9 0.85 -0.6 -0.3 -0.5 -2.6 0.45 -1.3 -0.6 -0.8 Lufthansa is a German airline and when #topics #topics fox #hiddenunits #topics #hidden units#hidden units #hidden units Skip grams from sentences Word vector Negative sampling loss Topic matrix Document proportion Document weight Document vector Context vector x + Lufthansa is a German airline and when 34% 32% 34% t=0 41% 26% 34% t=10 99% 1% 0% t=∞ tim e
  • 95. + API docs + Examples + GPU + Tests @chrisemoody lda2vec.com
  • 96. @chrisemoody Example Hacker News comments Topics: http://nbviewer.jupyter.org/github/cemoody/lda2vec/blob/master/examples/ hacker_news/lda2vec/lda2vec.ipynb Word vectors: https://github.com/cemoody/ lda2vec/blob/master/examples/ hacker_news/lda2vec/ word_vectors.ipynb
  • 97. @chrisemoody lda2vec.com human-interpretable doc topics, use LDA. machine-useable word-level features, use word2vec. if you like to experiment a lot, and have topics over user / doc / region / etc. features, use lda2vec. (and you have a GPU) If you want…
  • 100. Credit Large swathes of this talk are from previous presentations by: • Tomas Mikolov • David Blei • Christopher Olah • Radim Rehurek • Omer Levy & Yoav Goldberg • Richard Socher • Xin Rong • Tim Hopper
  • 101. “PS! Thank you for such an awesome idea” @chrisemoody doc_id=1846 Can we model topics to sentences? lda2lstm
  • 102. Can we model topics to sentences? lda2lstm “PS! Thank you for such an awesome idea”doc_id=1846 @chrisemoody Can we model topics to images? lda2ae TJ Torres
  • 103. and now for something completely crazy 4 Fun Stuff
  • 104. translation (using just a rotation matrix) M ikolov 2013 English Spanish Matrix Rotation
  • 105. deepwalk Perozzi etal2014 learn word vectors from sentences “The fox jumped over the lazy dog” vOUT vOUT vOUT vOUT vOUTvOUT ‘words’ are graph vertices ‘sentences’ are random walks on the graph word2vec
  • 106. Playlists at Spotify context sequence learning ‘words’ are song indices ‘sentences’ are playlists
  • 108. Fixes at Stitch Fix sequence learning Let’s try: ‘words’ are items ‘sentences’ are fixes
  • 109. Fixes at Stitch Fix context Learn similarity between styles because they co-occur Learn ‘coherent’ styles sequence learning
  • 112. Fixes at Stitch Fix? context sequence learning Nearby regions are consistent ‘closets’
  • 115. context dependent context Australian scientist discovers star with telescope Levy & G oldberg 2014
  • 116. context dependent context Australian scientist discovers star with telescope context Levy & G oldberg 2014
  • 117. context dependent context BoW DEPS topically-similar vs ‘functionally’ similar Levy & G oldberg 2014
  • 119.
  • 120. Crazy Approaches Paragraph Vectors (Just extend the context window) Content dependency (Change the window grammatically) Social word2vec (deepwalk) (Sentence is a walk on the graph) Spotify (Sentence is a playlist of song_ids) Stitch Fix (Sentence is a shipment of five items)
  • 121.
  • 122. CBOW “The fox jumped over the lazy dog” Guess the word given the context ~20x faster. (this is the alternative.) vOUT vIN vINvIN vINvIN vIN SkipGram “The fox jumped over the lazy dog” vOUT vOUT vIN vOUT vOUT vOUTvOUT Guess the context given the word Better at syntax. (this is the one we went over)
  • 123. lda2vec vDOC = a vtopic1 + b vtopic2 +… Let’s make vDOC sparse
  • 124. lda2vec This works! 😀 But vDOC isn’t as interpretable as the topic vectors. 😔 vDOC = topic0 + topic1 Let’s say that vDOC ads
  • 126.
  • 129. LDA Results context H istory I loved every choice in this fix!! Great job! Great Stylist Perfect
  • 130. LDA Results context H istory Body Fit My measurements are 36-28-32. If that helps. I like wearing some clothing that is fitted. Very hard for me to find pants that fit right.
  • 131. LDA Results context H istory Sizing Really enjoyed the experience and the pieces, sizing for tops was too big. Looking forward to my next box! Excited for next
  • 132. LDA Results context H istory Almost Bought It was a great fix. Loved the two items I kept and the three I sent back were close! Perfect
  • 133. All of the following ideas will change what ‘words’ and ‘context’ represent.
  • 134. paragraph vector What about summarizing documents? On the day he took office, President Obama reached out to America’s enemies, offering in his first inaugural address to extend a hand if you are willing to unclench your fist. More than six years later, he has arrived at a moment of truth in testing that
  • 135. On the day he took office, President Obama reached out to America’s enemies, offering in his first inaugural address to extend a hand if you are willing to unclench your fist. More than six years later, he has arrived at a moment of truth in testing that The framework nuclear agreement he reached with Iran on Thursday did not provide the definitive answer to whether Mr. Obama’s audacious gamble will pay off. The fist Iran has shaken at the so-called Great Satan since 1979 has not completely relaxed. paragraph vector Normal skipgram extends C words before, and C words after. IN OUT OUT
  • 136. On the day he took office, President Obama reached out to America’s enemies, offering in his first inaugural address to extend a hand if you are willing to unclench your fist. More than six years later, he has arrived at a moment of truth in testing that The framework nuclear agreement he reached with Iran on Thursday did not provide the definitive answer to whether Mr. Obama’s audacious gamble will pay off. The fist Iran has shaken at the so-called Great Satan since 1979 has not completely relaxed. paragraph vector A document vector simply extends the context to the whole document. IN OUT OUT OUT OUTdoc_1347