#4 Convolutional Neural Networks for Natural Language Processing

Convolutional Neural Networks for
Natural Language Processing
Adriaan Schakel
November 26, 2015

Google Trends
Query: convolutional neural networks

arXiv
http://export.arxiv.org/api/query?search_query=
abs:convolutional+AND+neural+AND+network&
start=0&max_results=10000

ILSVRC2012
Challenge: identify main objects present in images (from 1000 object
categories)
Tranining data: 1,2 million labelled images
October 13, 2012: results released
Winner: Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey Hinton
(University of Toronto)
Score: top-5 test error rate of 15.3%, compared to 26.2% achieved
by second-best entry

ILSVRC2012
Krizhevsky, Sutskever, and Hinton (2012)

ILSVRC2012
AlexNet:
deep ConvNet trained on raw RGB pixel values
60 million parameters and 650,000 neurons
5 convolutional layers (some followed by max-pooling layers) and
3 globally-connected layers
a ﬁnal 1000-way softmax
trained on two NVIDIA GPUs for about a week
use of dropout in the globally-connected layers

Radical Change
Letter from Yann LeCun to editor of CVPR 2012:
Getting papers about feature learning accepted at vision
conference has always been a struggle, and I’ve had more than
my share of bad reviews over the years. I was very sure that
this paper was going to get good reviews because:
it uses no hand-crafted features (it’s all learned all the way
through. Incredibly, this was seen as a negative point by
the reviewers!);
it beats all published results on 3 standard datasets for
scene parsing;
it’s an order of magnitude faster than the competing
methods.
If that is not enough to get good reviews, I just don’t know
what is. So, I’m giving up on submitting to computer vision
conferences altogether. (. . . ) Submitting our papers is just a
waste of everyone’s time (and incredibly demoralizing to my lab
members).

Revolution?
History:
1980: introduction of ConvNets by Fukushima
late 1980s: further development by LeCun and
collaborators @ Bell Labs
late 1990s: LeNet-5 was reading about 20% of
written checks in U.S.
Breakthrough due to:
persistence of academic researchers
improved algorithms
increase in computing power
increase in amount of data
dissemination of knowledge
http://www.elitestreetsmagazine.com/magazine/2008/jan-mar/art.php

Neural Networks
1943 McCulloch and Pitts proposed ﬁrst artiﬁcial neuron:
computes weighted sum of its binary input signals,
xi = 0, 1
y = θ
n
i=1
wi xi − u
1957 Rosenblatt developed a learning algorithm: the perceptron
(for linearly separable data only)
K Jain, J Mao, KM Mohiuddin - IEEE computer, 1996

Perceptron
The New York Times July 7, 1958:

Feed-Forward Neural Networks
neurons arranged in layers
neurons propagate signals only forward
input of jth neuron in layer l:
xl
j =
i
wl
ji yl−1
i
output:
yl
j = h xl
j
K Jain, J Mao, KM Mohiuddin - IEEE computer, 1996; commons.wikimedia.org

Backpropagation
Paul Werbos (1974):
1. initialize weights to small random values
2. choose input pattern
3. propagate signal forward through network
4. determine error (E) and propagate it backwards through network to
assign credit/blame to each unit
5. update weights by means of gradient descent:
∆wji = −η
∂E
∂wji

ConvNets
Feed-forward nets w/:
local receptive ﬁeld
shared weights
Applications:
character recognition
face recognition
medical diagnosis
self-driving cars
object recognition
(e.g. birds)

Race to bring Deep Learning to the Masses
Major players:
Google
Facebook
Baidu
Microsoft
Nvidia
Apple
Amazon
LeCun @ Facebook
http://www.popsci.com/facebook-ai

Fooling ConvNets
ﬁx trained network
carry out backprop using wrong class label
update input pixels:
Goodfellow, Shlens, and Szegedy, ICLR 2015

Dreaming ConvNets
ﬁx trained network
initialize input by
average image of
some class
carry out backprop
using that class’
label
update input pixels:
Simonyan, Vedaldi, and Zisserman, arXiv:1312.6034

ConvNets for NLP Tasks
2008:
Case study: sentiment analysis (classiﬁcation)
Rationale: key phrases, that are indicative of class membership, can
appear anywhere in a document

Applications
almost every image posted by Mrs
Merkel’s office of her in meetings
and summits has attracted
comments in Russian criticising
her and her policies.
Staff in Mrs Merkel’s office have
been deleting comments but some
remain despite the purge.
FAZ 07.06.2015:
Merkels Social-Media-Team, dessen Mitarbeiterzahl nicht
bekanntgegeben wird, war heillos überfordert.

Pre-trained Word Vectors
Word embeddings:
dense vectors (w/ dimension d of order 100)
derived from word co-occurrences: a word is characterized by the
company it keeps (Firth, 1957)
GloVe [Pennington, Socher, and Manning (2014)]:
log-bilinear regression model
learns word vectors, such that:
log(Xij ) = wT
i ˜wj + bi + ˜bj
Xij the number of times word j occurs in the context of word i
wi ∈ Rd word vectors
˜wj ∈ Rd context word vectors
Word2vec (skip-gram algorithm) [Mikolov et al. (2013)]:
shallow feed-forward neural network
learns word vectors, such that:
Xij
j Xij
=
ewT
i ˜wj
j ewT
i
˜wj

Pre-trained Word Vectors
Kim (2014): Sentence classiﬁcation
Hyperparameters:
ﬁlters of width (region
size) 3, 4, and 5
100 feature maps each
max-pooling layer
penultimate layer: 300
units
Datasets (average sentence length ∼ 20):
movie reviews w/ one sentence per
review (pos/neg?)
electronic product reviews (pos/neg?)
TREC question dataset. Is question
about a person, a location, numeric
information, etc.? (6 categories).
arXiv:1408.5882

One-Hot Vectors
Johnson and Zhang (2015): Classiﬁcation of larger pieces of text
(average size ∼ 200)
aardvark :





1
0
...
0





, zwieback :





0
...
0
1





Hyperparameters:
ﬁlter width (region size) 3
stack words in region
1000 feature maps
max-pooling
penultimate layer: 1000
units
Performance:
IMDB (|V | = 30k): error rate 8.74%
Amazon Elec: error rate 7.74%
arXiv:1412.1058

Character Input
Zhang, Zhao, and LeCun (2015): Large datasets
Hyperparameters:
alphabet of size 70
6 convolutional layers (all followed by max-pooling layers) and
3 fully-connected layers
ﬁlter width (region size) 7 or 3
1024 feature maps
Performance:
Model AG Sogou DBP. Yelp P. Yelp F. Yah. A. Amz. F. Amz. P.
BoW 11.19 7.15 3.39 7.76 42.01 31.11 45.36 9.60
BoW TFIDF 10.36 6.55 2.63 6.34 40.14 28.96 44.74 9.00
ngrams 7.96 2.92 1.37 4.36 43.74 31.53 45.73 7.98
ngrams TFIDF 7.64 2.81 1.31 4.56 45.20 31.49 47.56 8.46
ConvNet 12.82 4.88 1.73 5.89 39.62 29.55 41.31 5.51
arXiv:1509.01626

Outlook
Convenient and powerful libraries:
Theano (Lasagne, Keras) developed at LISA, University of Montreal
Torch primarily developed by Ronan Collobert (now @ Facebook),
used within Facebook, Google, Twitter, and NYU
TensorFlow by Google
The new iPhone 6S shows great GPU performance. So, expect
(more) deep learning coming to your phone.
Embedded devices like Nvidia’s TX1, a tiny
supercomputer w/ 256 CUDA cores and 4GB
memory, for driver-assistance systems and the like.
http://technonewschannel.com/tips-trick/5-hidden-features-of-android-camera-which-you-should-know/

#4 Convolutional Neural Networks for Natural Language Processing

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a #4 Convolutional Neural Networks for Natural Language Processing

Similar a #4 Convolutional Neural Networks for Natural Language Processing (20)

Último

Último (20)

#4 Convolutional Neural Networks for Natural Language Processing