[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features
1. Unsupervised Learning of Sentence Embeddings using
Compositional n-Gram Features
2018/07/12 M1 Hiroki Shimanaka
Matteo Pagliardini, Prakhar Gupta and Martin Jaggi
NAACL 2018
2. 1
lUnsupervised training of word representation, such as Word2Vec
[Mikolov et al., 2013] is now routinely trained on very large amounts of
raw text data, and have become ubiquitous building blocks of a majority
of current state-of-the-art NLP applications.
lWhile very useful semantic representations are available for words.
Abstract & Introduction (1)
3. 2
lA strong trend in deep-learning for NLP leads towards increasingly
powerful and com-plex models.
ØWhile extremely strong in expressiveness, the increased model complexity
makes such models much slower to train on larger datasets.
lOn the other hand, simpler “shallow” models such as matrix
factorizations (or bilinear models) can benefit from training on much
larger sets of data, which can be a key advantage, especially in the
unsupervised setting.
Abstract & Introduction (2)
4. Abstract & Introduction (3)
lThe authors present a simple but efficient unsupervised objective to
train distributed representations of sentences.
lTheir method outperforms the state-of-the-art unsupervised models
on most benchmark tasks, highlighting the robustness of the produced
general-purpose sentence embeddings.
3
5. 4
Approach
Their proposed model (Sent2Vec) can be seen as an extension of the Cl -
BOW [Mikolov et al., 2013] training objecKve to train sentence instead
of word embeddings.
Sent2Vec is a simple unsupervised model allowing to coml -pose
sentence embeddings using word vectors along with n-gram
embeddings, simultaneously training composiKon and the embed-ding
vectors themselves.
6. 5
Model (1)
lTheir model is inspired by simple matrix factor models (bilinear
models) such as recently very successfully used in unsupervised learning
of word embeddings.
!∈ℝ$×&
: target word vectors
'∈ℝ&×|)|: learnt source word vectors
): vocabulary
ℎ: hidden size
ι4 ∈{0, 1}|)|
:binary vector encoding :
:: sentence
7. 6
Model (2)
lConceptually, Sent2Vec can be interpreted as a natural extension of the
word-contexts from C-BOW [Mikolov et al., 2013] to a larger sentence
context.
!(#): the list of n−grams (including un−igrams)
ι&(') ∈{0, 1}|/|: binary vector encoding !(#) #: sentence
67: source (or context) embed−ding
8. 7
Model (3)
lNegative sampling
!"#
:the set of words sampled nega6vely for the word $% ∈ '
': sentence.": source (or context) embed−ding
/": target embedding
0":the normalized frequency of $ in the corpus
9. 8
Model (4)
lSubsampling
lTo select the possible target unigrams (positives), they use subsampling as in
[Joulin et al., 2017; Bojanowski et al., 2017], each word ! being discarded
with probability 1 − $%(!)
()*
:the set of words sampled negaJvely for the word !+ ∈ -
-: sentence
4): source (or context) embed−ding
5): target embed−ding$6 ! ≔
8)
∑):∈; 8):
8):the normalized frequency of ! in the corpus
10. 9
Experimental Setting (1)
lDataset:
lthe Toronto book corpus (70M sentences)
lWikipedia sentences and tweets
lDropout:
lFor each sentence they use dropout on its list of n-grams ! " ∖ {%(")},
where %(") is the set of all unigrams contained in sentence ".
lThey find that dropping K n-grams (n > 1) for each sentence is giving superior
results compared to dropping each token with some fixed probability.
lRegularization
lApplying L1 regularization to the word vectors.
lOptimizer
lSGD
16. 15
Conclusion
lIn this paper, they introduce a novel, computationally efficient, unsupervised,
C-BOW-inspired method to train and infer sentence embeddings.
lOn supervised evaluations, their method, on an average, achieves better
performance than all other unsupervised competitors with the exception of
Skip-Thought [Kiros et al., 2015].
lHowever, their model is generalizable, extremely fast to train, simple to
understand and easily interpretable, showing the relevance of simple and
well-grounded representation models in contrast to the models using deep
architectures.
lFuture work could focus on augmenting the model to exploit data with
ordered sentences.