SNLI_presentation_2

Natural Language Inference
Viral Gupta

Agenda

Describe the Problem. ( Terminology and
dataset)

Describe framework for the solution

Various Models

Results

Demo

Tensorflow

Problem Description

NLI an important problem in NLP. Summarization,
information retrieval, and question answering
rely on high-quality NLI.

Problem Set-up:

SNLI dataset is a collection of about half a million
natural language inference (NLI) problems.

Given sentence A( premise), sentence B
(hypothesis) classify the relationship

Classify relationship entailment, contradiction, or
neutral.

Problem Description

Training dataset 550,000 sentences

Test 10000 sentences

Dev 10000 sentences

Vocabulary 33,000

Original paper from NIPS
(http://stanford.edu/~angeli/papers/2015-
emnlp-snli.pdf)

Few More Examples

Premise: a statue at a museum that no seems to be looking at .

Hypothesis: there is a statue that not many people seem to be
interested in .

Entailment

a couple walk hand in hand down a street .

the couple is married .

neutral

Word Embedding
Premise Sentence
Model
300d
tanh layer
Word Embedding Word Embedding
Hypothesis
Sentence Model
tanh layerW1 tanh
layer
tanh layertanh layertanh layerW2 tanh layer
W3 tanh layer
Cross Entropy Loss
Softmax
100d 100d
300d
Glove
CBOW
Skip-Gram
Bag of words
LSTM, GRU
CNN
200d
200d
200d
Dropout
L2 Regularization
Model Architecture

Distributed Word Representation
Google Images

Distributed Word Representation

Word embedding is the collective name for a set of
language modeling and feature learning techniques in
natural language processing (NLP) where words or
phrases from the vocabulary are mapped to vectors of
real numbers in a low-dimensional space relative to the
vocabulary size ("continuous space").

Different Algorithms , Similar Ideas

CBOW

Skip-gram

Glove

Skip-gram Psuedocode
Distributed word representations are good.
Center wordTarget word

def negSamplingGradient(center, neighbour, outputVectors, dataset,
K=10):
word_samples = np.zeros((K, word_dim))
In a loop generate random samples
word_samples[i, :] = outputVectors[sample, :]
u_o = outputVectors[neighbour, :]
## compute derivative with respect to v_c word
## first part ( 1 - sigmoid( u_o.T, v_c))*u_o
cost_tmp= sigmoid(np.dot(u_o, center))[0]
grad_part_1 = -(1 - cost_tmp) * u_o
# second part of gradient
#with respect to each negative sample the
# gradient has the form ( 1 - sigmoid( u_j.T, v_c))*u_j
cost_tmp = sigmoid( np.dot( -word_samples, center))
tmp_2 = (1 - cost_tmp) * word_samples
grad_part_2 = np.sum( tmp_2, axis= 0 )
grad_vc = grad_part_1 + grad_part_2

Bag of Words Approach

Premise-Sentence : Simply add the word
vectors for all the words in the Sentence.

Do the same for Hypothesis Sentence.

Now we have feature vectors for each
sentence.

Train the Neural Network. Not really deep
network.

Using LSTM to Extract Feature
vector
Google Images

CNN
• Go over Convolution operation
• Go over Max-pooling operation
• How to apply these to NLP

Applying CNN to Sentences
Filters Activation Map Feature Set
Wildml Blog

Results

Since we have randomly initialization of the weight matrices the
start accuracy of almost all models is 33%( 3 classes).

When applying high learning rate and high
dropout_keep_probability, very high gradients flow in the
network and so the loss function keeps swinging.

Bag of words without word vector update gives accuracy about
62-64%.

Bag of words with word vector update gives accuracy about
70%.

CNN accuracy is about 75%

LSTM accuracy 78-79%. State of the art 86% (neural attention
models)

Hyper-parameters

the network dimensionality

optimization Routine (Adam, Stochastic Gradient Descent)

Learning rate, Regularization parameters(L2, Dropout)

Learning Rate: Determines how much parameter changes in
each epoch. Larger values cause parameters to swing leading
to sub-optimal results.

Dropout : Probability measure on whether the individual neurons
will fire or shutoff. Prevents over-fitting by randomly keeping
away information. What this does is allows the network to learn
the data patterns without

TensorFlow

TensorFlow is a deep learning library recently
open-sourced by Google.

TensorFlow provides primitives for defining
functions on tensors and automatically computing
their derivatives.

It has various utilities to use for solving Stochastic
Minimization problems.

Similar to Numpy. Infact numpy arrays can be
converted into tensors. ( This isn’t most efficient
approach)

SNLI_presentation_2

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a SNLI_presentation_2

Similar a SNLI_presentation_2 (20)

SNLI_presentation_2