2. • Machine Learning algorithms are incapable of processing strings.
• They require numbers as inputs.
• Huge amount of text data to be converted.
• A Word Embedding format generally tries to map a word using a dictionary to a vector.
sentence= “Word Embeddings are Word converted into numbers ”
• Words: “Embeddings” or “numbers ” etc.
• Dictionary: List of all unique words in the sentence.
[‘Word’,’Embeddings’,’are’,’Converted’,’into’,’numbers’]
• A vector representation of a word may be a one-hot encoded vector where 1 stands for the position where the word
exists and 0 everywhere else.
“numbers” - [0,0,0,0,0,1]
“converted” - [0,0,0,1,0,0].
Why do we need WORD EMBEDDINGS?
3. CONTEXT CLUES- meaning of unknown word by the words that surround it in some CONTEXT.
CONTEXT may be : “WINDOW “surrounding words.
“SENTENCE” it occurs in.
“PARAGRAPH” that contains it.
“ENTIRE DOCUMENT”.
BIGGER GOAL: How to generalize the knowledge obtained from one particular word to other words that are somehow
similar?
• Semantic space -> latent semantic space.
• Map individual words in latent semantic space.
• Obtain vector representation for each word.
• Dimension -> vocabulary of words.
• Semantically similar words are closer in vector space. “WORD EMBEDDING”
WORD EMBEDDING: Embed the words into some vector space.
WORD EMBEDDINGS DEFINITION
4. 1. Frequency based Embedding
• Count Vectors
• TF-IDF
• Co-Occurrence
• Matrix Factorization
• Co-occurrence probability ratio
WORD EMBEDDINGS TYPES:
Count vector representation:
TF = (Number of times term t appears in a
document)/(Number of terms in the document)
IDF = log(N/n), where, N is the number of
documents and n is the number of documents a
term t has appeared in.
TF(This,Document1) = 1/8
TF(This, Document2)=1/5
IDF(This) = log(2/2) = 0
IDF(Messi) = log(2/1) = 0.301.
TF-IDF(This,Document1) = (1/8) * (0) = 0
TF-IDF(This, Document2) = (1/5) * (0) = 0
TF-IDF(Messi, Document1) = (4/8)*0.301 = 0.15
https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
5. Matrix Factorization of Word Embeddings (co-
occurrence matrix)
Similar words tend to occur together and will have similar context for example – Apple is a fruit. Mango is a fruit.
Apple and mango tend to have a similar context i.e. fruit.
Co-occurrence – For a given corpus, the co-occurrence of a pair of words say w1 and w2 is the number of times they have
appeared together in a Context Window.
Context Window – Context window is specified by a number and the direction.
Corpus =“The quick brown fox jumps over the lazy dog.”
Corpus = “He is not lazy. He is intelligent. He is smart”.
https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
6. PMI = Pointwise Mutual Information.
Matrix Factorization of Word Embeddings (co-
occurrence matrix)
Larger PMI Higher correlation
ISSUES: Many entries with PMI (w,c) = log 0
SOLUTION:
• Set PMI(w,c) = 0 for all unobserved pairs.
• Drop all entries of PMI< 0 [POSITIVE POINTWISE MUTUAL INFORMATION]
Where, w= word, c= context word
Produces 2 different vectors for each word:
• Describes word when it is the ‘target word’ in the window
• Describes word when it is the ‘context word’ in window
7. Glove captures relationship between two words by CO-OCCURRENCE PROBABILITY RATIO:
Co-occurrence Probability Ratio:
P( k | i ) probability of observing ‘k’ word in ‘i’ word context.
Example, P (solid | ice) / P ( solid | steam) = large
P (gas | ice) / P ( gas | steam) = less
Produces 2 different vectors for each word:
• wi target word of word i
• ῶ i context word of word i
Learns the function:
After addition of bias terms, Weighting for rare and frequent occurrences:
Final Loss:
8. Prediction based Embedding:
2. Prediction based Embedding
• CBOW
• Skip-Gram
CBOW and Skip-Gram model for neural network differs in the terms of input and output of the neural network.
• CBOW: Input to neural network is set of context words within a certain window surrounding a ‘target’ word. And
output predicts the ‘target’ word i.e., what word should belong to the target position.
• Skip-Gram: Input is similar to a CBOW model. Output predicts each ‘context’ based on the ‘target’ word appearing
at the center of the window.
Both the times we learn wi and context vector ῶ i for each word in the vocabulary.
10. Prediction based Embedding:
• Goal is to supply training samples and learn the weights.
• Use the weights to predict probabilities for new input word.
The network is going to tell us the probability for every word in our vocabulary of being the context (window) that we
choose.
Example: “Soviet”-“Union” vs “Soviet”-“Russia” vs “Soviet”-“Kangaroo”
Corpus =“The quick brown fox jumps over the lazy dog.” Context window = 2
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
12. Prediction based Embedding:
Output:
Issue: Expensive to compute the normalizing constant for softmax output layer which involves a sum of entire
vocabulary.
Mikolov introduced NEGATIVE SAMPLING – related to skip-gram model.
• Treating common word pairs or phrases as single “words” in their model.
Example: “Boston” “Globe” vs “Boston Globe”
• Subsampling frequent words to decrease the number of training examples.
• Modifying the optimization objective with a technique they called “Negative Sampling”, which causes each
training sample to update only a small percentage of the model’s weights.
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
13. Document Representation from Word
Embeddings
• What can be the space representation for “documents” rather than “words”?
• Embed a document at the centroid of all word vectors by taking average.
• Take minimum value in each vector dimension.
• Take maximum value in each vector dimension.
• This works well for small documents eg. tweets.
• For longer documents, represent a document as a bag of word vectors.
Word Mover’s Distance.
But still no proper representation of a document. Le & Mikolov : “Paragraph vector”
• Paragraph vectors with distributed memory modifies CBOW
• Paragraph vectors with distributed bag of words modifies skip-gram.