Deep learning Malaysia presentation 12/4/2017

Commercially available tools
Open source tools
Word2vec Words embedding model
RNN for generating Bahasa text
LSTM in theory
Integrating Words embedding model
into LSTM
Automatic question and reply engine
Overview of NLP using
Deep Learning Part 1

Open Source Tools
• Python : NLTK and Textblob
– POS Tagging: Part of speech tagging
– Named Entities Recognition:
• "Steve job" -> person
• "Apple" -> organization
– Semantic Identification
– Lemmatization and Stemming: reduce words to base form
• R: TM
http://textminingonline.com/getting-started-with-
- Word Tokenization
- Sentence Tokenization
- Part-of-speech tagging
- Noun phrase extraction
- Sentiment analysis
- Word Pluralization
- Word Singularization
- Spelling correction
- Parsing
- Classification (Naive Bayes, Decision Tree)
- Language translation and detection powered by - Google
Translate
- Word and phrase frequencies
- n-grams
- Word inflection (pluralization andsingularization) and
lemmatization
- JSON serialization
- Add new models or languages through extensions
- WordNet integration

Word2vec
https://iksinc.wordpress.com/tag/word2vec/https://www.slideshare.net/hen_drik/word2vec-from-theory-to-practice

Word2vec
word
vectors

Using Word2vec
• Consider the training corpus having the following sentences:
– “the dog saw a cat”
– “the dog chased the cat”
– “the cat climbed a tree”
• The corpus vocabulary has eight words. Once ordered alphabetically, each word can be referenced
by its index, i.e. a, cat, chased, climbed, dog, saw, the, tree}. For this example, the neural network
will have eight input neurons and eight output neurons. Let us assume that we decide to use three
neurons in the hidden layer. This means that Winput and Woutput will be 8×3 and 3×8 matrices,
respectively. Before training begins, these matrices are initialized to small random values as is
usual in neural network training. Just for the illustration sake, let us assume Winput and Woutput to be
initialized to the following values:
[,1] [,2] [,3]
[1,] -0.92513658 -0.743787260 1.6273785
[2,] 0.08458616 -1.258307794 0.4852640
[3,] 0.83675919 -0.001426922 -0.1703800
[4,] 0.94409916 0.018061199 -0.6304152
[5,] 0.04691568 -1.599246381 -0.8439630
[6,] -0.82112415 -1.084833252 0.1231866
[7,] -0.93035265 0.003375370 -1.3572083
[8,] -1.31701003 0.659632590 0.1134216
Winput = Woutput = [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] -1.3907119 0.5259696 1.0829041 1.9993983 0.3370346 1.4518856 0.4802576 0.3931751
[2,] -0.5347725 0.6776164 -0.3288658 -0.1287490 -0.5626609 0.6886097 -1.3618653 1.0593093
[3,] -0.9392639 -0.3924568 0.6765556 0.5703951 0.6843841 -0.9567421 -0.3512964 1.8581440

Using Word2vec
• Suppose we want the network to learn relationship between the words “cat” and “climbed”. That is,
the network should show a high probability for “climbed” when “cat” is inputted to the network. In
word embedding terminology, the word “cat” is referred as the context word and the word “climbed”
is referred as the target word.
• cat --> climbed
• In this case, the input vector X will be [0 1 0 0 0 0 0 0]. Notice that only the second component of
the vector is 1. This is because the input word is “cat” which is holding number two position in
sorted list of corpus words. Given that the target word is “climbed”, the target vector will look like
• [0 0 0 1 0 0 0 0 ].
• With the input vector representing “cat”, the output at the hidden layer neurons can be computed
as:
Ht = X.Winput
[,1] [,2] [,3]
[1,] -0.92513658 -0.743787260 1.6273785
[2,] 0.08458616 -1.258307794 0.4852640
[3,] 0.83675919 -0.001426922 -0.1703800
[4,] 0.94409916 0.018061199 -0.6304152
[5,] 0.04691568 -1.599246381 -0.8439630
[6,] -0.82112415 -1.084833252 0.1231866
[7,] -0.93035265 0.003375370 -1.3572083
[8,] -1.31701003 0.659632590 0.1134216

Using Word2vec
• It should not surprise us that the vector Ht of hidden neuron outputs mimics the weights of the
second row of Winput matrix because of 1-out-of-V representation. So the function of the input to
hidden layer connections is basically to copy the input word vector to hidden layer. Carrying out
similar manipulations for hidden to output layer, the activation vector for output layer neurons can
be written as:
• Since the goal is produce probabilities for words in the output layer, Pr(wordk|wordcontext) for k =
1, V, to reflect their next word relationship with the context word at input, we need the sum of
neuron outputs in the output layer to add to one. Word2vec achieves this by converting activation
values of output layer neurons to probabilities using the softmax function. Thus, the output of the k-
th neuron is computed by the following expression where activation(n) represents the activation
value of the n-th output layer neuron
Ht .Woutput = [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 0.09948251 -0.9986054 0.8337211 0.6079195 1.068616 -1.207946 1.583797 -0.3979896
Target vector = [0 0 0 1 0 0 0 0 ].

Using Word2vec
• Thus, the probabilities for eight words in the corpus are:
• The probability in bold is for the chosen target word “climbed”. Given the target vector [0 0 0 1 0 0 0
0 ], the error vector for the output layer is easily computed by subtracting the probability vector from
the target vector. Once the error is known, the weights in the matrices Winput and Woutput
can be updated using backpropagation. Thus, the training can proceed by presenting different
context-target words pair from the corpus. In essence, this is how Word2vec learns relationships
between words and in the process develops vector representations for words in the corpus.
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
0.0768 0.0256 0.1602 0.127 0.2026 0.0207 0.339 0.0467

Word2vec
• CBOW:
– Predict the current word based
on the context
– Order of words in the history
does not influence the
projection
– Faster & more appropriate for
larger corpora

Word2vec
• Continuous Skip Gram Model:
– maximize classification of a word
based on another word in the
same sentence
– better word vectors for frequent
words, but slower to train

Using Word2vec - CBOW
• The above description and architecture is meant for learning relationships between pair of words. In the continuous bag of
words model, context is represented by multiple words for a given target words. For example, we could use “cat” and “tree”
as context words for “climbed” as the target word. This calls for a modification to the neural network architecture. The
modification, shown below, consists of replicating the input to hidden layer connections C times, the number of context
words, and adding a divide by C operation in the hidden layer neurons.
• With the above configuration to specify C context words, each word being coded using 1-out-of-V representation means that
the hidden layer output is the average of word vectors corresponding to context words at input. The output layer remains the
same
cat
the
mat
sat

Using Word2vec - SGM
• Skip-gram model reverses the use of target and context words. In this case, the target word is fed at the input, the hidden
layer remains the same, and the output layer of the neural network is replicated multiple times to accommodate the chosen
number of context words. Taking the example of “cat” and “tree” as context words and “climbed” as the target word, the input
vector in the skim-gram model would be [0 0 0 1 0 0 0 0 ], while the two output layers would have [0 1 0 0 0 0 0 0] and [0 0 0
0 0 0 0 1 ] as target vectors respectively.
• In place of producing one vector of probabilities, two such vectors would be produced for the current example. The error
vector for each output layer is produced in the manner as discussed above. However, the error vectors from all output layers
are summed up to adjust the weights via backpropagation. This ensures that weight matrix Woutput for each output layer
remains identical all through training.
cat
the
mat
sat

Using Word2vec
*word vectors capture many
linguistic similarities

Using Word2vec
*word vectors capture many
linguistic similarities
* vector operations:
V(Paris) - V(France) +V(Italy)
results in a vector
which is very close to V(Rome)
V(Beijing) - V(China) +
V(Poland) results in a vector
which is very close to V(Warsaw)
V(King) - V(Woman) +
V(Man) results in a vector
which is very close to V(Queen)

Using Word2vec
• Original: http://word2vec.googlecode.com/svn/trunk/
• C++11 version: https://github.com/jdeng/word2vec
• Python: http://radimrehurek.com/gensim/models/ word2vec.html
• R: https://github.com/bmschmidt/wordVectors
• Java: https://github.com/ansjsun/word2vec_java
• Parallel java: https://github.com/siegfang/word2vec
• Spark: https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/mllib/feature/Word2Vec.html
• CUDAversion: https://github.com/whatupbiatch/cuda-word2vec
• Numpy implementation:
https://github.com/lazyprogrammer/machine_learning_examples/blob/master/nlp_class2/word2vec.py

• Simple implementation of RNN of tensorflow for word-level language model
• https://github.com/hunkim/word-rnn-tensorflow/blob/master/model.py
• Small traininging dataset (text file of size 3.1MB), it has cleaned Malay text with complete sentences.
• number of hidden states is 50, number of rnn layer is 2, number of rnn sequence length is 20 and number of epochs is 10.
• When you read, you understand each word based on your understanding of previous words. You don’t throw everything
away and start thinking from scratch again. Your thoughts have persistence. Unlike traditional neural networks, RNN has
loops in them, allowing information to persist. A recurrent neural network can be thought of as multiple copies of the same
network, each passing a message to a successor. This chain-like nature reveals that recurrent neural networks are
intimately related to sequences and lists. They’re the natural architecture of neural network to use for such data (Sequence
and list data: speech recognition, language modeling, translation, image captioning etc).
http://colah.github.io/posts/2015-08-Understanding-LSTMs/

• The hidden state at time step t is h_t. It is a function of the input at the same time step x_t, modified by a weight matrix W (like the
one we used for feedforward nets) added to the hidden state of the previous time step h_t-1 multiplied by its own hidden-state-to-
hidden-state matrix U, otherwise known as a transition matrix and similar to a Markov chain. The weight matrices are filters that
determine how much importance to accord to both the present input and the past hidden state. The error they generate will return via
backpropagation and be used to adjust their weights until error can’t go any lower.
• The sum of the weight input and hidden state is squashed by the function φ – either a logistic sigmoid function or tanh, depending –
which is a standard tool for condensing very large or very small values into a logistic space, as well as making gradients workable for
backpropagation.
• Because this feedback loop occurs at every time step in the series, each hidden state contains traces not only of the previous hidden
state, but also of all those that preceded h_t-1 for as long as memory can persist.
• Given a series of letters, a recurrent will use the first character to help determine its perception of the second character, such that an
initial q might lead it to infer that the next letter will be u, while an initial t might lead it to infer that the next letter will be h.
• Since recurrent nets span time, they are probably best illustrated with animation (the first vertical line of nodes to appear can be
thought of as a feedforward network, which becomes recurrent as it unfurls over time).
http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Commercially available tools
Open source tools
Word2vec Words embedding model
LSTM in theory
Integrating Words embedding model
into LSTM
Automatic question and reply engine
Overview of NLP using
Deep Learning Part 2

Personal Profile
• Advisory Data Scientist (IBM Malaysia)
• Linkedin: https://www.linkedin.com/in/brian-ho-34068a36/
• Github Blog: https://kimusu2008.github.io/
• Github Account: https://github.com/kimusu2008

Deep learning Malaysia presentation 12/4/2017

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Deep learning Malaysia presentation 12/4/2017

Similar to Deep learning Malaysia presentation 12/4/2017 (20)

Recently uploaded

Recently uploaded (18)

Deep learning Malaysia presentation 12/4/2017