2. Agenda
Describe the Problem. ( Terminology and
dataset)
Describe framework for the solution
Various Models
Results
Demo
Tensorflow
3. Problem Description
NLI an important problem in NLP. Summarization,
information retrieval, and question answering
rely on high-quality NLI.
Problem Set-up:
SNLI dataset is a collection of about half a million
natural language inference (NLI) problems.
Given sentence A( premise), sentence B
(hypothesis) classify the relationship
Classify relationship entailment, contradiction, or
neutral.
4. Problem Description
Training dataset 550,000 sentences
Test 10000 sentences
Dev 10000 sentences
Vocabulary 33,000
Original paper from NIPS
(http://stanford.edu/~angeli/papers/2015-
emnlp-snli.pdf)
6. Few More Examples
Premise: a statue at a museum that no seems to be looking at .
Hypothesis: there is a statue that not many people seem to be
interested in .
Entailment
a couple walk hand in hand down a street .
the couple is married .
neutral
7. Word Embedding
Premise Sentence
Model
300d
tanh layer
Word Embedding Word Embedding
Hypothesis
Sentence Model
tanh layerW1 tanh
layer
tanh layertanh layertanh layerW2 tanh layer
W3 tanh layer
Cross Entropy Loss
Softmax
100d 100d
300d
Glove
CBOW
Skip-Gram
Bag of words
LSTM, GRU
CNN
200d
200d
200d
Dropout
L2 Regularization
Model Architecture
9. Distributed Word Representation
Word embedding is the collective name for a set of
language modeling and feature learning techniques in
natural language processing (NLP) where words or
phrases from the vocabulary are mapped to vectors of
real numbers in a low-dimensional space relative to the
vocabulary size ("continuous space").
Different Algorithms , Similar Ideas
CBOW
Skip-gram
Glove
16. def negSamplingGradient(center, neighbour, outputVectors, dataset,
K=10):
word_samples = np.zeros((K, word_dim))
In a loop generate random samples
word_samples[i, :] = outputVectors[sample, :]
u_o = outputVectors[neighbour, :]
## compute derivative with respect to v_c word
## first part ( 1 - sigmoid( u_o.T, v_c))*u_o
cost_tmp= sigmoid(np.dot(u_o, center))[0]
grad_part_1 = -(1 - cost_tmp) * u_o
# second part of gradient
#with respect to each negative sample the
# gradient has the form ( 1 - sigmoid( u_j.T, v_c))*u_j
cost_tmp = sigmoid( np.dot( -word_samples, center))
tmp_2 = (1 - cost_tmp) * word_samples
grad_part_2 = np.sum( tmp_2, axis= 0 )
grad_vc = grad_part_1 + grad_part_2
17. Bag of Words Approach
Premise-Sentence : Simply add the word
vectors for all the words in the Sentence.
Do the same for Hypothesis Sentence.
Now we have feature vectors for each
sentence.
Train the Neural Network. Not really deep
network.
34. Results
Since we have randomly initialization of the weight matrices the
start accuracy of almost all models is 33%( 3 classes).
When applying high learning rate and high
dropout_keep_probability, very high gradients flow in the
network and so the loss function keeps swinging.
Bag of words without word vector update gives accuracy about
62-64%.
Bag of words with word vector update gives accuracy about
70%.
CNN accuracy is about 75%
LSTM accuracy 78-79%. State of the art 86% (neural attention
models)
35. Hyper-parameters
the network dimensionality
optimization Routine (Adam, Stochastic Gradient Descent)
Learning rate, Regularization parameters(L2, Dropout)
Learning Rate: Determines how much parameter changes in
each epoch. Larger values cause parameters to swing leading
to sub-optimal results.
Dropout : Probability measure on whether the individual neurons
will fire or shutoff. Prevents over-fitting by randomly keeping
away information. What this does is allows the network to learn
the data patterns without
39. TensorFlow
TensorFlow is a deep learning library recently
open-sourced by Google.
TensorFlow provides primitives for defining
functions on tensors and automatically computing
their derivatives.
It has various utilities to use for solving Stochastic
Minimization problems.
Similar to Numpy. Infact numpy arrays can be
converted into tensors. ( This isn’t most efficient
approach)