Visual-Semantic Embeddings: some thoughts on Language

1
@graphiﬁc
Roelof Pieters
Visual-‐Seman3c

Embeddings: 
Some
Thoughts
on
Language
23
January
2015
 
KTH
FEEDA
www.csc.kth.se/~roelof/
roelof@kth.se

Can we understand Language ?
1. Language is ambiguous: 
Every sentence has many possible interpretations.
2. Language is productive: 
We will always encounter new words or new
constructions
3. Language is culturally speciﬁc
Some of the challenges in Language Understanding:
3

constructions
• plays well with others
VB ADV P NN
NN NN P DT
• fruit ﬂies like a banana
NN NN VB DT NN
NN VB P DT NN
NN NN P DT NN
NN VB VB DT NN
• the students went to class
DT NN VB P NN
4

constructions
5

[Karlgren 2014, NLP Sthlm Meetup]6

constructions
3. Language is culturally speciﬁc
7

ML: Traditional Approach
1. Gather as much LABELED data as you can get
2. Throw some algorithms at it (mainly put in an SVM and
keep it at that)
3. If you actually have tried more algos: Pick the best
4. Spend hours hand engineering some features / feature
selection / dimensionality reduction (PCA, SVD, etc)
5. Repeat…
For each new problem/question::
8

Machine Learning for NLP
Data
Classic Approach: Data is fed into a learning algorithm:
Learning  
Algorithm
9

some of the (many) treebank datasets
source: http://www-nlp.stanford.edu/links/statnlp.html#Treebanks
!
10

Penn Treebank
That’s a lot of “manual” work:
11

• the students went to class
DT NN VB P NN
• plays well with others
VB ADV P NN
NN NN P DT
• fruit ﬂies like a banana
NN NN VB DT NN
NN VB P DT NN
NN NN P DT NN
NN VB VB DT NN
With a lot of issues:
Penn Treebank
12

Learning  
Algorithm
Data
“Features”
Prediction
Prediction/ 
Classiﬁer
train set
test set
13

Learning  
Algorithm
“Features”
Prediction
Prediction/ 
Classiﬁer
train set
test set
14

Deep Learning: Why for NLP ?
Beat state of the art at:
• Language Modeling (Mikolov et al. 2011) [WSJ AR task]
• Speech Recognition (Dahl et al. 2012, Seide et al 2011;
following Mohammed et al. 2011)
• Sentiment Classiﬁcation (Socher et al. 2011)
and we already know for some longer time it works well for
Computer Vision of course:
• MNIST hand-written digit recognition (Ciresan et al.
2010)
• Image Recognition (Krizhevsky et al. 2012) [ImageNet]
15

One Model rules them all ? 
 
DL approaches have been successfully applied to:
Deep Learning: Why for NLP ?
Automatic summarization Coreference resolution Discourse analysis
Machine translation Morphological segmentation Named entity recognition (NER)
Natural language generation
Natural language understanding
Optical character recognition (OCR)
Part-of-speech tagging
Parsing
Question answering
Relationship extraction
sentence boundary disambiguation
Sentiment analysis
Speech recognition
Speech segmentation
Topic segmentation and recognition
Word segmentation
Word sense disambiguation
Information retrieval (IR)
Information extraction (IE)
Speech processing
16

• What is the meaning of a word? 
(Lexical semantics)
• What is the meaning of a sentence? 
([Compositional] semantics)
• What is the meaning of a longer piece of text?
(Discourse semantics)
Semantics: Meaning
17

• NLP treats words mainly (rule-based/statistical
approaches at least) as atomic symbols: 
• or in vector space: 
• also known as “one hot” representation.
• Its problem ?
Word Representation
Love Candy Store
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 …]
Candy [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 …] AND
Store [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 …] = 0 !
18

• Structure corresponds to meaning:
Structure and Meaning
19

• Semantics
• Syntax
20
NLP: what can we work with?

• Language models deﬁne probability distributions
over (natural language) strings or sentences
• Joint and Conditional Probability
Language Model
21

Language Model
22

Language Model
23

Word senses
What is the meaning of words?
• Most words have many different senses: 
dog = animal or sausage?
How are the meanings of different words related?
• - Specific relations between senses: 
Animal is more general than dog.
• - Semantic fields: 
money is related to bank
24

Word senses
Polysemy:
• A lexeme is polysemous if it has different related
senses
• bank = financial institution or building
Homonyms:
• Two lexemes are homonyms if their senses are
unrelated, but they happen to have the same spelling
and pronunciation
• bank = (financial) bank or (river) bank
25

Word senses: relations
Symmetric relations:
• Synonyms: couch/sofa 
Two lemmas with the same sense
• Antonyms: cold/hot, rise/fall, in/out 
Two lemmas with the opposite sense
Hierarchical relations:
• Hypernyms and hyponyms: pet/dog 
The hyponym (dog) is more speciﬁc than the hypernym
(pet)
• Holonyms and meronyms: car/wheel 
The meronym (wheel) is a part of the holonym (car)
26

Distributional representations
“You shall know a word by the company it keeps” 
(J. R. Firth 1957)
One of the most successful ideas of modern
statistical NLP!
these words represent banking
• Hard (class based) clustering models
• Soft clustering models
27

Distributional hypothesis
He ﬁlled the wampimuk, passed it
around and we all drunk some
We found a little, hairy wampimuk
sleeping behind the tree
(McDonald & Ramscar 2001)
28

Distributional semantics
Landauer and Dumais (1997), Turney and Pantel (2010), …
29

Distributional semantics
Distributional meaning as co-occurrence vector:
30

Distributional representations
• Taking it further:
• Continuous word embeddings
• Combine vector space semantics with the
prediction of probabilistic models
• Words are represented as a dense vector:
Candy =
31

Word Embeddings: SocherVector Space Model
adapted rom Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA
In a perfect world:
32

adapted rom Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA
In a perfect world:
the country of my birth
the place where I was born
33

• Can theoretically (given enough units) approximate
“any” function
• and fit to “any” kind of data
• Efficient for NLP: hidden layers can be used as word
lookup tables
• Dense distributed word vectors + efficient NN
training algorithms:
• Can scale to billions of words !
Why Neural Networks for NLP?
34

• Representation of words as continuous vectors has a
long history (Hinton et al. 1986; Rumelhart et al. 1986;
Elman 1990)
• First neural network language model: NNLM (Bengio
et al. 2001; Bengio et al. 2003) based on earlier ideas of
distributed representations for symbols (Hinton 1986)
How?
35

Figure (edited) from Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA
In a perfect world:
the country of my birth
the place where I was born ?
…
36

Compositionality
Principle of compositionality:
the “meaning (vector) of a
complex expression (sentence)
is determined by:
— Gottlob Frege  
(1848 - 1925)
- the meanings of its constituent
expressions (words) and
- the rules (grammar) used to
combine them”
37

• How do we handle the compositionality of language in
our models?
38
Compositionality

our models?
• Recursion : 
the same operator (same parameters) is
applied repeatedly on diﬀerent components
39
Compositionality

our models?
• Option 1: Recurrent Neural Networks (RNN)
40
RNN 1: Recurrent Neural Networks

our models?
• Option 2: Recursive Neural Networks (also
sometimes called RNN)
41
RNN 2: Recursive Neural Networks

• achieved SOTA in 2011 on
Language Modeling (WSJ AR
task) (Mikolov et al.,
INTERSPEECH 2011):
• and again at ASRU 2011:
42
Recurrent Neural Networks
“Comparison to other LMs shows that RNN
LMs are state of the art by a large margin.
Improvements inrease with more training data.”
“[ RNN LM trained on a] single core on 400M words in a few days,
with 1% absolute improvement in WER on state of the art setup”
Mikolov, T., Karaﬁat, M., Burget, L., Cernock, J.H., Khudanpur, S. (2011) 
Recurrent neural network based language model

43
(simple recurrent  
neural network for LM)
input
hidden layer(s)
output layer
+ sigmoid activation function
+ softmax function:
Mikolov, T., Karaﬁat, M., Burget, L., Cernock, J.H., Khudanpur, S. (2011) 
Recurrent neural network based language model

44
backpropagation through time

45
backpropagation through time
class based recurrent NN
[code (Mikolov’s RNNLM Toolkit) and more info: http://rnnlm.org/ ]

• Recursive Neural
Network for LM (Socher
et al. 2011; Socher
2014)
• achieved SOTA on new
Stanford Sentiment
Treebank dataset (but
comparing it to many
other models):
Recursive Neural Network
46
Socher, R., Perelygin,, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C. (2013) 
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
info & code: http://nlp.stanford.edu/sentiment/

Recursive Neural Tensor Network
47
code & info: http://www.socher.org/index.php/Main/
ParsingNaturalScenesAndNaturalLanguageWithRecursiveNeuralNetworks
Socher, R., Liu, C.C., NG, A.Y., Manning, C.D. (2011)  
Parsing Natural Scenes and Natural Language with Recursive Neural Networks

Recursive Neural Tensor Network
48

• RNN (Socher et al.
2011a)
49

2011a)
• Matrix-Vector RNN
(MV-RNN) (Socher et
al., 2012)
50

2011a)
• Matrix-Vector RNN
(MV-RNN) (Socher et
al., 2012)
• Recursive Neural
Tensor Network (RNTN)
(Socher et al. 2013)
51

• negation detection:
52

NP
PP/IN
NP
DT NN PRP$ NN
Parse Tree
Recurrent NN for Vector Space
53

NP
PP/IN
NP
DT NN PRP$ NN
Parse Tree
INDT NN PRP NN
Compositionality
54
Recurrent NN: CompositionalityRecurrent NN for Vector Space

NP
IN
NP
PRP NN
Parse Tree
DT NN
Compositionality
55

NP
IN
NP
DT NN PRP NN
PP
NP (S / ROOT)
“rules” “meanings”
Compositionality
56

Vector Space + Word Embeddings: Socher
57

Vector Space + Word Embeddings: Socher
58
Recurrent NN for Vector Space

Word Embeddings: Turian (2010)
Turian, J., Ratinov, L., Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning
code & info: http://metaoptimize.com/projects/wordreprs/59

Word Embeddings: Turian (2010)
Turian, J., Ratinov, L., Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning
code & info: http://metaoptimize.com/projects/wordreprs/
60

Word Embeddings: Collobert & Weston (2011)
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P. (2011) .
Natural Language Processing (almost) from Scratch
61

Multi-embeddings: Stanford (2012)
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng (2012) 
Improving Word Representations via Global Context and Multiple Word Prototypes
62

Linguistic Regularities: Mikolov (2013)
code & info: https://code.google.com/p/word2vec/
Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic Regularities in Continuous Space Word Representations
63

Word Embeddings for MT: Mikolov (2013)
Mikolov, T., Le, V. L., Sutskever, I. (2013) .  
Exploiting Similarities among Languages for Machine Translation
64

Recursive Deep Models & Sentiment: Socher (2013)
Socher, R., Perelygin, A., Wu, J., Chuang, J.,Manning, C., Ng, A., Potts, C. (2013)  
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank.
code & demo: http://nlp.stanford.edu/sentiment/index.html
65

Paragraph Vectors: Le & Mikolov (2014)
Le, Q., Mikolov,. T. (2014) Distributed Representations of Sentences and Documents
66
• add context (sentence, paragraph, document) to word
vectors during training
!
Results on Stanford Sentiment  
Treebank dataset:

Global Vectors, GloVe: Stanford (2014)
Pennington, P., Socher, R., Manning,. D.M. (2014).  
GloVe: Global Vectors for Word Representation
code & demo: http://nlp.stanford.edu/projects/glove/
vs
results on the word analogy task
“similar accuracy”
67

Dependency-based Embeddings: Levy & Goldberg (2014)
Levy, O., Goldberg, Y. (2014). Dependency-Based Word Embeddings
code & demo: https://levyomer.wordpress.com/2014/04/25/dependency-based-word-
embeddings/
- Syntactic Dependency Context
Australian scientist discovers star with telescope
- Bag of Words (BoW) Context
0.3$
0.4$
0.5$
0.6$
0.7$
0.8$
0.9$
1$
0$ 0.1$ 0.2$ 0.3$ 0.4$ 0.5$ 0.6$ 0.7$ 0.8$ 0.9$ 1$
Precision$
Recall$
“Dependency-based
embeddings have more
functional
similarities”
68

Joint Image-Word Embeddings
69

1. Multimodal representation learning
2. Generating descriptions of images
3. Ranking images and captions (“image-sentence
ranking”)
4. Evaluation Methods?
Some Current Approaches
70

• Learning joint image-word embeddings in a low-
dimension embedding space (Weston et al 2010;
Frome et al. 2013)
• Embedding images and sentences into a common
space (Socher et al. TACL 2014; Karpathy et al. NIPS
2014)
• also: deep Boltzmann machines
(Srivastava&Salakhutdinov NIPS 2012), log-bilinear
neural language models (Kiros et al. ICML 2014),
autoencoders (Ngiam ICML 2011), recurrent neural
networks (m-RNN: Mao et al. 2014) and topic-models
(Jia ICCV 2011)
1. Multimodal representation learning
71

• Neural network methods
• Composition-based methods.
• Template-based methods.
2. Generating descriptions of images
72

• Bi-directional approaches (Hockenmaier group):
kernel CCA (Hodosh et al. JAIR 2013), normalized
CCA (Gong et al. EECV 2014)
• Dependency tree recursive networks (Stanford)
(Socher et al. TACL 2014)
3. Ranking images and captions
73

• Current evaluation methods unreliable and don’t
match with human judgements
• Bleu
• Rouge
• Perplexity??
• Ranking as proxy for generation / scoring function
4. Evaluation?
74

• unsupervised pre-training on many images
• in parallel train a neural network Language Model
• train linear mapping between image representations
and word embeddings, representing the diﬀerent
“classes”
75
Zero-shot Learning

DeViSE model (Frome et al. 2013)
• skip-gram text model on wikipedia corpus of 5.7 million
documents (5.4 billion words) - approach from (Mikolov
et al. ICLR 2013)
76
Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., Ranzato, M.A. (2013)  
Devise: A deep visual-semantic embedding model

Encoder: A deep convolutional network (CNN) and long short-
term memory recurrent network (LSTM) for learning a joint
image-sentence embedding.
Decoder: A new neural language model that combines structure
and content vectors for generating words one at a time in
sequence.
Encoder-Decoder pipeline (Kiros et al 2014)
77
Kiros, R., Salakhutdinov, R., Zemerl, R. S. (2014)  
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

Kiros, R., Salakhutdinov, R., Zemerl, R. S. (2014)  
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
• matches state-of-the-art performance on Flickr8K and
Flickr30K without using object detections
• new best results when using the 19-layer Oxford
convolutional network.
• linear encoders: learned embedding space captures
multimodal regularities (e.g. *image of a blue car* - "blue"
+ "red" is near images of red cars)
Encoder-Decoder pipeline (Kiros et al 2014)
78

demo time 
(code adapted from: https://github.com/karpathy/
neuraltalk)
79

• Theano - CPU/GPU symbolic expression compiler in
python (from LISA lab at University of Montreal). http://
deeplearning.net/software/theano/
• Pylearn2 - library designed to make machine learning
research easy. http://deeplearning.net/software/pylearn2/
• Torch - Matlab-like environment for state-of-the-art
machine learning algorithms in lua (from Ronan Collobert,
Clement Farabet and Koray Kavukcuoglu) http://torch.ch/
• more info: http://deeplearning.net/software links/
Wanna Play ?
Wanna Play ? General Deep Learning
80

• RNNLM (Mikolov) 
http://rnnlm.org
• NB-SVM 
https://github.com/mesnilgr/nbsvm
• Word2Vec (skipgrams/cbow) 
https://code.google.com/p/word2vec/ (original) 
http://radimrehurek.com/gensim/models/word2vec.html (python)
• GloVe 
http://nlp.stanford.edu/projects/glove/ (original) 
https://github.com/maciejkula/glove-python (python)
• Socher et al / Stanford RNN Sentiment code: 
http://nlp.stanford.edu/sentiment/code.html
• Deep Learning without Magic Tutorial: 
http://nlp.stanford.edu/courses/NAACL2013/
Wanna Play ? NLP
81

• cuda-convnet2 (Alex Krizhevsky, Toronto) (c++/
CUDA, optimized for GTX 580)  
https://code.google.com/p/cuda-convnet2/
• Caﬀe (Berkeley) (Cuda/OpenCL, Theano, Python) 
http://caﬀe.berkeleyvision.org/
• OverFeat (NYU)  
http://cilvr.nyu.edu/doku.php?id=code:start
Wanna Play ? Computer Vision
82

as PhD candidate KTH/CSC:
“Always interested in discussing
Machine Learning, Deep
Architectures, Graphs, and
Language Technology”
In touch!
roelof@kth.se
www.csc.kth.se/~roelof/
Internship / EntrepeneurshipAcademic/Research
as CIO/CTO Feeda:
“Always looking for additions to our  
brand new R&D team” 
 
[Internships upcoming on  
KTH exjobb website…]
roelof@feeda.com
www.feeda.com
Feeda
83

Were Hiring!
roelof@feeda.com
www.feeda.com
Feeda
• Dev Ops
• Software Developers
• Data Scientists
84

Visual-Semantic Embeddings: some thoughts on Language

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Visual-Semantic Embeddings: some thoughts on Language

Similar a Visual-Semantic Embeddings: some thoughts on Language (20)

Más de Roelof Pieters

Más de Roelof Pieters (13)

Último

Último (15)

Visual-Semantic Embeddings: some thoughts on Language