SlideShare a Scribd company logo
1 of 57
How do I embed a sequence of words?
From context-free to contextual embeddings
Fabio Fumarola
Milan, July 18 2019
Language and Machine Learning
Distributional Semantics
Context-free word embeddings:
Count based
Context based
Contextual embeddings
Final Considerations & Conclusions
Language and Machine Learning
Distributional Semantics
Context-free word embeddings:
Count based
Context based
Contextual embeddings
Final Considerations & Conclusions
4
allrightsreserved
- It is distinct from other known form of communications since it is compositional
- It allows speakers to express/reference ideas/opinions in sentences by combining subjects, verbs
and objects
- Compositionality gives human language an endless capacity for generating new sentences as
speakers recombine set of words taken from a vocabulary
- For instance, with just 25 words it is possible to generate over 15K distinct sentences
Human language: what is special
Pagel M. Q&A: What is human language, when did it evolve and why should we care?. BMC Biol. 2017 Jul 24
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5525259/
5
allrightsreserved
- A machine learning model to understand and
process natural language needs to transform it into
numeric values
- Simple Solution: to do one-hot encoding for each
word in the vocabulary
- Complexity: one-hot encoding is impractical
computationally for large vocabulary
- And …
Human language in Machine Learning 1/2
6
allrightsreserved
- Word Embeddings represent text in vectors with
much more lower and denser dimensions
- Are based on the idea that contextual information
alone constitutes a viable representation of
linguistic items (in contrast with Chomsky)
- Open questions: When, Why, How, Limitations ?
Human language in Machine Learning 2/2
Language and Machine Learning
Distributional Semantics
Context-free word embeddings
Count based
Context based
Contextual embeddings
Final Considerations & Conclusions
8
allrightsreserved
Theoretical roots are in structural linguistics and language philosophy:
- In 1950: Harris Z., Firth J. and Wittgenstein L. published works that used feature representations to
quantify semantic-similarity (hand-crafted features)
- In 1957: Osgood C., Suci G. and Tannenbaum P. with “The measurement of meaning” (Urbana, IL
press) found three recurring attitudes of people to evaluate words and phrase: ‘good-bad’, ’strong-
weak’, ‘active-passive’
- From 1990s: many methods to generate contextual features were developed
A brief History of Word Embeddings 1/3
9
allrightsreserved
All the works are based on the idea that:
”Semantically similar words tends to have similar contextual distributions” (Miller and Charles, 1991)
A brief History of Word Embeddings 2/3
Miller, G. A., & Charles, W. G. (1991). Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1), 1-28.
Geometric techniques
to measure similarity
10
allrightsreserved
There are also many other alternative terms in use
- from the very general distributed representation
- to the more specific semantic vector space or simply word space.
- For consistency, I will adhere to current practice and use the term word embedding in this
presentation.
Word Embeddings: a note
11
allrightsreserved
1. Count-based: based on matrix factorization of a word co-occurrence matrix (unsupervised)
2. Context-based (or predictive models) : given a local context, we want to predict the target word
and in parallel to learn an efficient vector word representation (supervised)
Distributional Semantic Model: Two Main Approaches
Baroni, Dinu and Kruszewsky, Don’t count, predict! A systematic comparison of context-counting vs. context-
predicting semantic vectors ACL 2014
(https://www.aclweb.org/anthology/P14-1023)
Language and Machine Learning
Distributional Semantics
Context-free word embeddings:
Count based
Context based
Contextual embeddings
Final Considerations & Conclusions
13
allrightsreserved
Starting from raw co-occurrence counts they apply a function (Matrix Factorization) to map them down
to small and dense word vectors:
- Latent Semantic Analysis/Indexing (LSA/LSI), develop in the context of IR
- Topics models, Probabilistic LSA (PLSA) and Latent Dirichlet Allocation (LDA)
Count-based methods
14
allrightsreserved
- A: m x n (documents, terms)
- U: m x n (documents concepts)
- 𝚺: r x r diagonal matrix (strength of
each ‘concept’)
- V: n x r (terms, concepts)
Matrix Factorization: SVD
Am
n

m
n
U
VT

T
documents
terms
15
allrightsreserved
SVD: Users to Movies
=
1 1 1 0 0
3 3 3 0 0
4 4 4 0 0
5 5 5 0 0
0 2 0 4 4
0 0 0 5 5
0 1 0 2 2

m
n
U
VT
“Concepts”
AKA Latent dimensions
AKA Latent factors
Matrix
Alien
Serenity
Casablanca
Amelie
16
allrightsreserved
SVD: users to concept similarity matrix
SciFi-concept
Romance-concept
=
Matrix
Alien
Serenity
Casablanca
Amelie
1 1 1 0 0
3 3 3 0 0
4 4 4 0 0
5 5 5 0 0
0 2 0 4 4
0 0 0 5 5
0 1 0 2 2
0.13 0.02 -0.01
0.41 0.07 -0.03
0.55 0.09 -0.04
0.68 0.11 -0.05
0.15 -0.59 0.65
0.07 -0.73 -0.67
0.07 -0.29 0.32
U
17
allrightsreserved
SVD: Concepts Strength
= x x
Matrix
Alien
Serenity
Casablanca
Amelie
1 1 1 0 0
3 3 3 0 0
4 4 4 0 0
5 5 5 0 0
0 2 0 4 4
0 0 0 5 5
0 1 0 2 2
12.4 0 0
0 9.5 0
0 0 1.3
“strength” of the SciFi-concept

18
allrightsreserved
SVD: Movie to Concept similarity
= x x
Matrix
Alien
Serenity
Casablanca
Amelie
1 1 1 0 0
3 3 3 0 0
4 4 4 0 0
5 5 5 0 0
0 2 0 4 4
0 0 0 5 5
0 1 0 2 2
0.56 0.59 0.56 0.09 0.09
0.12 -0.02 0.12 -0.69 -0.69
0.40 -0.80 0.40 0.09 0.09
VT
SciFi-concept
19
allrightsreserved
Language and Machine Learning
Distributional Semantics
Context-free word embeddings:
Count based
Context based
Contextual embeddings
Final Considerations & Conclusions
20
allrightsreserved
Build models to predict a word given its neighbors
- In 2003: Bengio et al. [1] coin the term word embeddings and train them in a neural language model
- In 2008: Collobert and Weston [2]: show the utility of pre-trained word embeddings for downstream
tasks
- In 2013: Mikolov et al. [3] created word2vec toolkit that allows us a seamless training and use o pre-
trained embeddings
History
1. Yoshua Bengio et al. A neural probabilistic language model. 2003 Journal of Machine Learning Research
2. Collobert and Weston, A unified architecture for natural language processing, ICML 2008
3. Mikolov et al., Efficient Estimation of Word Representations in Vector Space, ICLR 2013
21
allrightsreserved
The construction of context vectors is turned:
- From: co-occurrence statistics + matrix factorization (heuristic stacking of vectors)
- To: vectors learned by training a supervised model (single supervised model)
Idea
22
allrightsreserved
- We assume a corpus containing sequence of T training words w1, w2, w3, ..., wt
- The words belong to a vocabulary V of size |V|
- The models consider a context of n words
- We associate to every word an input embedding vw with d dimensions and an output embedding v’
w
Notation
23
allrightsreserved
- We use a sliding window of a fixed size moving along the given text
- The word in the middle is the target and the left-right neighborhood in the window is the context
Example: “The man who passes the sentence should swing the sword”
Data Preparation
24
allrightsreserved
- Given a wi the model is trained to predict the words in the window [wt – c, ..., _ , ..., wt + c] (Mikolov et
al., 2013)
- The target word is predicted by randomly select words close-by the target
Skip-Gram Model
• Input wi and output wy are one-hot
encoded
• The multiplication of x and W gives
the embedding vector of input word wi
• The multiplication of the hidden layer
and W’ produces the output one-hot
vector y
• W’ encoded the meaning of words as
context
25
allrightsreserved
- The Continuous Bag-of-Words (Mikolov et al., 2013) predicts the target word from the
surrounding window
Continuous Bag-of-Words (CBOW)
CBOW model is better for small dataset
26
allrightsreserved
• Considering an input word wI , and the correct word w
• We sample N other words from a noise distribution ŵ1, ŵ2, ... , ŵN ~ Q.
1. Get close to 0 when input and target embedding are equal
2. Get close to zero when the input embedding is far from noise word embeddings
Loss function: Negative Sampling
Distance
input target embedding
Distance
input noise embedding
27
allrightsreserved
- Proposed by Pennington et al. in 2014, combines count-based matrix factorization to skip-gram
- Word meanings are captured by the ratios of co-occurrence probabilities
- large values correlate well with ice, and small values correlate well with steam
- The ratio of probabilities encodes some crude form of meaning associated with the abstract
concept
Global Vectors (GloVe)
https://nlp.stanford.edu/pubs/glove.pdf
28
allrightsreserved
- Word2Vec has several drawbacks:
• No sentence representations: taking the average pre-trained word vector is popular but does not work
always well
• Not exploiting morphology: words with the same radicals are learned as different vectors
FastText is an extension of word2vec to take into account the morphology of words
FastText: Enriching Word Vectors with Sub-word Information
https://arxiv.org/pdf/1607.04606v1.pdf
disastrous / disaster mangera / mangerai
cbow
Many
words
word
Skip-gram
word word
text classification
Many
words
label
29
allrightsreserved
Represent each word as a bag of character n-grams
1. A vector representation is associated to each character n-gram
2. Words are represented as the sum of these representations
Example: “matter”, with n=3 is represented as
^ma, mat, att, tte, ter, er$
Where ^ and $ are special symbols to distinguish the n-grams from words in the vocabulary
- in this way compounds name and out of vocabulary words are easy to model
FastText: Idea
https://arxiv.org/pdf/1607.04606v1.pdf
30
allrightsreserved
- Features of a word computed using n-grams
FastText: Features
manang
nge ger
era
rai
mang ange
nger
gera
erai
mangerai
Character n-grams Word itself
- Out Of Vocabulary (OOV) words representations
manang
nge ger
era
rai
mang ange
nger
gera
erai
mangerai
Character n-grams Word itself
31
allrightsreserved
- the features preprocessed can used to train a CBOW or Skip-Gram (this works better) model
FastText: Model
32
allrightsreserved
It can be used for:
1. Classification A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text
Classification, https://arxiv.org/abs/1607.01759
2. Translation: A. Joulin, P. Bojanowski, T. Mikolov, H. Jegou, E. Grave, Loss in Translation: Learning
Bilingual Word Mapping with a Retrieval Criterion, https://aclweb.org/anthology/D18-1330
FastText: Extensions
33
allrightsreserved
1. fastText with n-grams do better on syntactic tasks
2. Gensim word2vect and fastText with no n-grams do
slightly better on semantic task (due to the test dataset)
3. In general, the performance of the models seems to get
closer increasing the corpus size
4. The semantic accuracy increase with the corpus size
5. The syntactic accuracy slowly increase with the corpus
size
Word2Vec vs. FastText: Results Comparison
34
allrightsreserved
The main difference between the approaches is :
- Documents as context: for LSA and topic models (legacy from Information Retrieval)
• The document-based: semantic relatedness (e.g. ‘boat’ – ‘water’)
- Words as context: for word skip-gram, CBOW, glove and FastText (linguistic and cognitive
perspective)
- The word-based: semantic similarity (e.g. ‘boat’ - ‘ship’)
Recap: Count vs. Context
35
allrightsreserved
- Poincaré Embeddings
(https://arxiv.org/pdf/1705.08039.pdf)
Interesting extensions not covered
36
allrightsreserved
- Sent2Vec Embeddings NAACL 2018 (https://aclweb.org/anthology/N18-1049)
- An extension of FastText and CBOW to sentences
Interesting extensions not covered
37
allrightsreserved
Language and Machine Learning
Distributional Semantics
Context-free word embeddings:
Count based
Context based
Contextual embeddings
Final Considerations & Conclusions
38
allrightsreserved
- The introduced embeddings suffer of a major problem they cannot model polysemy
- In the sentences:
• “I have an Apple phone”, and
• “I am eating an apple”
the two “apple” words refer to very different things but share the same embedding vector
- Solution: Contextual Embeddings
Contextualized Embeddings
39
allrightsreserved
- CoVe
- ELMO
- ULMFit
- OpenAI GPT
- BERT
- OpenAI GPT 2
Contextualized Embeddings
40
allrightsreserved
- Learns contextual word embeddings by using the encoder in an attentional sequence to sequence model
trained for machine translation
- CoVe word representations are function of the entire input sequence
Contextual Word Vectors (CoVe)
McCann B, Bradbury J., Xiong c., Socher R. Learned in Translation: Contextualized Word Vectors
NIPS 2017 https://arxiv.org/pdf/1708.00107.pdf
41
allrightsreserved
Contextual Word Vectors (CoVe)
McCann B, Bradbury J., Xiong c., Socher R. Learned in Translation: Contextualized Word Vectors
NIPS 2017 https://arxiv.org/pdf/1708.00107.pdf
42
allrightsreserved
1. Pre-training is bounded by datasets for supervised translation
2. Limited contribution of CoVe when used as layer for other tasks
CoVe: Limitations
43
allrightsreserved
- Learns contextualized embeddings by pre-training a language model that learns to predict the next
token given the history
Embeddings from Language Model (ELMo)
Peters, Matthew E. and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luk
Deep contextualized word representations NAACL 2018
44
allrightsreserved
- Contextual: the representation of each word depends on the sentence in which it is used
- Deep: the word representation is computed by stacking the internal states of all the layers
bidirectional language model
Different layers encodes different kind of information (e.g. PoS tagging from lower layers and word-
sense disambiguation in higher levels)
- Character based: the representation is based on characters rather than words (can take
advantage of sub-words to compute ...)
ELMo Representations
45
allrightsreserved
ELMo: Key Results
46
allrightsreserved
- SQuAD: Stanford Question Answering Dataset
- SNLI: Stanford Natural Language inference (entailment, contradiction, neutral)
- SRL: Semantic Role Labeling
- Coref: Clustering mentions in text that refer to the same underlying real world entities
- NER: Named Entity Extraction
- Sentiment: The fine-grained sentiment classification task in the Stanford Sentiment Treebank
Tasks and Datasets
47
allrightsreserved
Fine-tuning: (i) different learning rate per layer (ii) scheduled increase decrease of the learning rate
Universal Language Model Fine-tuning (ULMFiT)
Jeremy Howard, Sebastian Ruder Universal Language Model Fine-tuning for Text Classification, ACL 2018
https://arxiv.org/abs/1801.06146
48
allrightsreserved
GPT has two major differences from ELMo:
OpenAI Generative Pre-training Transformer
Radford A., et al. Improving Language Understanding by Generative Pre-Training
https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
Model ELMo GPT
Architecture BiLSTM Transformer
Type of Embeddings Customized per task Same embeddings fine
tuned for al tasks
49
allrightsreserved
GPT: Transformer Architecture
Radford A., et al. Improving Language Understanding by Generative Pre-Training
https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
50
allrightsreserved
- Is a data compression technique that iteratively replaces the most frequent pair of symbols
(originally bytes) in a given dataset with a single unused symbol.
- In each iteration, the algorithm finds the most frequent (adjacent) pair of symbols, each can be
constructed of a single character or a sequence of characters, and merged them to create a new
symbol
- It adopted to solve the open-vocabulary issue
GPT: Byte Pair Encoding
https://github.com/soaxelbrooke/python-bpe
https://arxiv.org/abs/1508.07909
https://github.com/google/sentencepiece
51
allrightsreserved
- Compared to GPT the largest difference is that BERT is
Bidirectional
Contextual Embedding example: “I made a bank deposit”
1. Unidirectional : “bank” is represented only based on “I made” but not
“deposit”
2. Bidirectional: “bank” is represented by using the left and right context
Moreover, BERT is available with multilingual where sentences from
different language are decoded in the same latent space
Bidirectional Encoder Representations from Transformer (BERT)
52
allrightsreserved
- The input embedding is the sum of three parts:
BERT input representation
BPE tokenization
Sentence Embedding
Position Embedding
53
allrightsreserved
Extension of GPT (1.5B parameters, 10x more) that:
1. Zero-Shot transfer: no need for transfer learning. All the downstream task are modeled as
predicting conditional probabilities
2. Model Modifications
• Layer normalization was moved to the input of each sub-block
• Additional layer normalization after the final self-attention block
• The weights of residual layers are scaled by 1/sqrt(N) where N is the number or residual layers
• Use large vocabulary size and context
OpenAI GPT-2
54
allrightsreserved
Language and Machine Learning
Distributional Semantics
Context-free word embeddings:
Count based
Context based
Contextual embeddings
Final Considerations & Conclusions
55
allrightsreserved
- We evaluated the evolution of context-free word embeddings to deal with:
• Subword-level
• OOV handling
- Then we considered contextual embeddings to address:
• Polysemy and multi-sense word representation
• Sentence embeddings
• Embeddings for multiple languages
- What we do not covered are:
• Temporal dimension
• bias
Final considerations & Conclusions
56
allrightsreserved
Confidentiality
57
allrightsreserved
Contacts
Bologna
Via Guglielmo Marconi, 43
+39 051 6480911
italy@prometeia.com
Milan
Via Brera, 18
Viale Monza, 265
+39 02 80505845
italy@prometeia.com
Cairo
Smart Village - Concordia Building, B2111
Km 28 Cairo Alex Desert Road
6 of October City, Giza
egypt@prometeia.com
Istanbul
River Plaza, Kat 19
Büyükdere Caddesi Bahar Sokak
No. 13, 34394
| Levent | Istanbul | Turkey
+ 90 212 709 02 80 – 81 – 82
turkey@prometeia.com
London
Dashwood House 69 Old Broad Street
EC2M 1QS
+44 (0) 207 786 3525
uk@prometeia.com
Moscow
ul. Ilyinka, 4
Capital Business Center Office 308
+7 (916) 215 0692
russia@prometeia.com
Rome
Viale Regina Margherita, 279
italy@prometeia.com
www.prometeia.com
Prometeiagroup
Prometeia
@PrometeiaGroup
Prometeia

More Related Content

What's hot

Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
NYC Predictive Analytics
 
[Emnlp] what is glo ve part i - towards data science
[Emnlp] what is glo ve  part i - towards data science[Emnlp] what is glo ve  part i - towards data science
[Emnlp] what is glo ve part i - towards data science
Nikhil Jaiswal
 

What's hot (20)

Word Embedding to Document distances
Word Embedding to Document distancesWord Embedding to Document distances
Word Embedding to Document distances
 
semeval2016
semeval2016semeval2016
semeval2016
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentation
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modeling
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
 
Distributional semantics
Distributional semanticsDistributional semantics
Distributional semantics
 
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
[Emnlp] what is glo ve part i - towards data science
[Emnlp] what is glo ve  part i - towards data science[Emnlp] what is glo ve  part i - towards data science
[Emnlp] what is glo ve part i - towards data science
 
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
 
Neural word embedding and language modelling
Neural word embedding and language modellingNeural word embedding and language modelling
Neural word embedding and language modelling
 
Contemporary Models of Natural Language Processing
Contemporary Models of Natural Language ProcessingContemporary Models of Natural Language Processing
Contemporary Models of Natural Language Processing
 
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextGDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
 
Slides.ltdca
Slides.ltdcaSlides.ltdca
Slides.ltdca
 

Similar to Embedding for fun fumarola Meetup Milano DLI luglio

Continuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdfContinuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdf
devangmittal4
 

Similar to Embedding for fun fumarola Meetup Milano DLI luglio (20)

Text Mining for Lexicography
Text Mining for LexicographyText Mining for Lexicography
Text Mining for Lexicography
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
Multilingual Text Classification using Ontologies
Multilingual Text Classification using OntologiesMultilingual Text Classification using Ontologies
Multilingual Text Classification using Ontologies
 
Effect of word embedding vector dimensionality on sentiment analysis through ...
Effect of word embedding vector dimensionality on sentiment analysis through ...Effect of word embedding vector dimensionality on sentiment analysis through ...
Effect of word embedding vector dimensionality on sentiment analysis through ...
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI) International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Designing, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural NetworksDesigning, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural Networks
 
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONAN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Continuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdfContinuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdf
 
New word analogy corpus
New word analogy corpusNew word analogy corpus
New word analogy corpus
 
Metrics for Evaluating Quality of Embeddings for Ontological Concepts
Metrics for Evaluating Quality of Embeddings for Ontological Concepts Metrics for Evaluating Quality of Embeddings for Ontological Concepts
Metrics for Evaluating Quality of Embeddings for Ontological Concepts
 
IJNLC 2013 - Ambiguity-Aware Document Similarity
IJNLC  2013 - Ambiguity-Aware Document SimilarityIJNLC  2013 - Ambiguity-Aware Document Similarity
IJNLC 2013 - Ambiguity-Aware Document Similarity
 
2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt
 
AMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITYAMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITY
 
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSLEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
 
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSLEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
 
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTSLEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
LEARNING CROSS-LINGUAL WORD EMBEDDINGS WITH UNIVERSAL CONCEPTS
 
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
 

More from Deep Learning Italia

MEETUP APRILE - Ganomaly - Anomaly Detection.pdf
MEETUP APRILE - Ganomaly - Anomaly Detection.pdfMEETUP APRILE - Ganomaly - Anomaly Detection.pdf
MEETUP APRILE - Ganomaly - Anomaly Detection.pdf
Deep Learning Italia
 

More from Deep Learning Italia (20)

Machine Learning driven Quantum Optimization for Marketing
Machine Learning driven Quantum Optimization for MarketingMachine Learning driven Quantum Optimization for Marketing
Machine Learning driven Quantum Optimization for Marketing
 
Modelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettive
Modelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettiveModelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettive
Modelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettive
 
Transformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxTransformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptx
 
Meetup Luglio - Operations Research.pdf
Meetup Luglio - Operations Research.pdfMeetup Luglio - Operations Research.pdf
Meetup Luglio - Operations Research.pdf
 
Meetup Giugno - c-ResUNET.pdf
Meetup Giugno - c-ResUNET.pdfMeetup Giugno - c-ResUNET.pdf
Meetup Giugno - c-ResUNET.pdf
 
MEETUP Maggio - Team Automata
MEETUP Maggio - Team AutomataMEETUP Maggio - Team Automata
MEETUP Maggio - Team Automata
 
MEETUP APRILE - Ganomaly - Anomaly Detection.pdf
MEETUP APRILE - Ganomaly - Anomaly Detection.pdfMEETUP APRILE - Ganomaly - Anomaly Detection.pdf
MEETUP APRILE - Ganomaly - Anomaly Detection.pdf
 
2022_Meetup_Mazza-Marzo.pptx
2022_Meetup_Mazza-Marzo.pptx2022_Meetup_Mazza-Marzo.pptx
2022_Meetup_Mazza-Marzo.pptx
 
Machine Learning Security
Machine Learning SecurityMachine Learning Security
Machine Learning Security
 
The science of can and can t e la computazione quantistica
The science of can and can t e la computazione quantisticaThe science of can and can t e la computazione quantistica
The science of can and can t e la computazione quantistica
 
Dli meetup moccia
Dli meetup mocciaDli meetup moccia
Dli meetup moccia
 
Pi school-dli-presentation de nobili
Pi school-dli-presentation de nobiliPi school-dli-presentation de nobili
Pi school-dli-presentation de nobili
 
Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework
 
Explanation methods for Artificial Intelligence Models
Explanation methods for Artificial Intelligence ModelsExplanation methods for Artificial Intelligence Models
Explanation methods for Artificial Intelligence Models
 
Use Cases Machine Learning for Healthcare
Use Cases Machine Learning for HealthcareUse Cases Machine Learning for Healthcare
Use Cases Machine Learning for Healthcare
 
NLG, Training, Inference & Evaluation
NLG, Training, Inference & Evaluation NLG, Training, Inference & Evaluation
NLG, Training, Inference & Evaluation
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
Towards quantum machine learning calogero zarbo - meet up
Towards quantum machine learning  calogero zarbo - meet upTowards quantum machine learning  calogero zarbo - meet up
Towards quantum machine learning calogero zarbo - meet up
 
Macaluso antonio meetup dli 2020-12-15
Macaluso antonio  meetup dli 2020-12-15Macaluso antonio  meetup dli 2020-12-15
Macaluso antonio meetup dli 2020-12-15
 
Data privacy e anonymization in R
Data privacy e anonymization in RData privacy e anonymization in R
Data privacy e anonymization in R
 

Recently uploaded

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 

Recently uploaded (20)

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 

Embedding for fun fumarola Meetup Milano DLI luglio

  • 1. How do I embed a sequence of words? From context-free to contextual embeddings Fabio Fumarola Milan, July 18 2019
  • 2. Language and Machine Learning Distributional Semantics Context-free word embeddings: Count based Context based Contextual embeddings Final Considerations & Conclusions
  • 3. Language and Machine Learning Distributional Semantics Context-free word embeddings: Count based Context based Contextual embeddings Final Considerations & Conclusions
  • 4. 4 allrightsreserved - It is distinct from other known form of communications since it is compositional - It allows speakers to express/reference ideas/opinions in sentences by combining subjects, verbs and objects - Compositionality gives human language an endless capacity for generating new sentences as speakers recombine set of words taken from a vocabulary - For instance, with just 25 words it is possible to generate over 15K distinct sentences Human language: what is special Pagel M. Q&A: What is human language, when did it evolve and why should we care?. BMC Biol. 2017 Jul 24 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5525259/
  • 5. 5 allrightsreserved - A machine learning model to understand and process natural language needs to transform it into numeric values - Simple Solution: to do one-hot encoding for each word in the vocabulary - Complexity: one-hot encoding is impractical computationally for large vocabulary - And … Human language in Machine Learning 1/2
  • 6. 6 allrightsreserved - Word Embeddings represent text in vectors with much more lower and denser dimensions - Are based on the idea that contextual information alone constitutes a viable representation of linguistic items (in contrast with Chomsky) - Open questions: When, Why, How, Limitations ? Human language in Machine Learning 2/2
  • 7. Language and Machine Learning Distributional Semantics Context-free word embeddings Count based Context based Contextual embeddings Final Considerations & Conclusions
  • 8. 8 allrightsreserved Theoretical roots are in structural linguistics and language philosophy: - In 1950: Harris Z., Firth J. and Wittgenstein L. published works that used feature representations to quantify semantic-similarity (hand-crafted features) - In 1957: Osgood C., Suci G. and Tannenbaum P. with “The measurement of meaning” (Urbana, IL press) found three recurring attitudes of people to evaluate words and phrase: ‘good-bad’, ’strong- weak’, ‘active-passive’ - From 1990s: many methods to generate contextual features were developed A brief History of Word Embeddings 1/3
  • 9. 9 allrightsreserved All the works are based on the idea that: ”Semantically similar words tends to have similar contextual distributions” (Miller and Charles, 1991) A brief History of Word Embeddings 2/3 Miller, G. A., & Charles, W. G. (1991). Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1), 1-28. Geometric techniques to measure similarity
  • 10. 10 allrightsreserved There are also many other alternative terms in use - from the very general distributed representation - to the more specific semantic vector space or simply word space. - For consistency, I will adhere to current practice and use the term word embedding in this presentation. Word Embeddings: a note
  • 11. 11 allrightsreserved 1. Count-based: based on matrix factorization of a word co-occurrence matrix (unsupervised) 2. Context-based (or predictive models) : given a local context, we want to predict the target word and in parallel to learn an efficient vector word representation (supervised) Distributional Semantic Model: Two Main Approaches Baroni, Dinu and Kruszewsky, Don’t count, predict! A systematic comparison of context-counting vs. context- predicting semantic vectors ACL 2014 (https://www.aclweb.org/anthology/P14-1023)
  • 12. Language and Machine Learning Distributional Semantics Context-free word embeddings: Count based Context based Contextual embeddings Final Considerations & Conclusions
  • 13. 13 allrightsreserved Starting from raw co-occurrence counts they apply a function (Matrix Factorization) to map them down to small and dense word vectors: - Latent Semantic Analysis/Indexing (LSA/LSI), develop in the context of IR - Topics models, Probabilistic LSA (PLSA) and Latent Dirichlet Allocation (LDA) Count-based methods
  • 14. 14 allrightsreserved - A: m x n (documents, terms) - U: m x n (documents concepts) - 𝚺: r x r diagonal matrix (strength of each ‘concept’) - V: n x r (terms, concepts) Matrix Factorization: SVD Am n  m n U VT  T documents terms
  • 15. 15 allrightsreserved SVD: Users to Movies = 1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2  m n U VT “Concepts” AKA Latent dimensions AKA Latent factors Matrix Alien Serenity Casablanca Amelie
  • 16. 16 allrightsreserved SVD: users to concept similarity matrix SciFi-concept Romance-concept = Matrix Alien Serenity Casablanca Amelie 1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 0.13 0.02 -0.01 0.41 0.07 -0.03 0.55 0.09 -0.04 0.68 0.11 -0.05 0.15 -0.59 0.65 0.07 -0.73 -0.67 0.07 -0.29 0.32 U
  • 17. 17 allrightsreserved SVD: Concepts Strength = x x Matrix Alien Serenity Casablanca Amelie 1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 12.4 0 0 0 9.5 0 0 0 1.3 “strength” of the SciFi-concept 
  • 18. 18 allrightsreserved SVD: Movie to Concept similarity = x x Matrix Alien Serenity Casablanca Amelie 1 1 1 0 0 3 3 3 0 0 4 4 4 0 0 5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2 0.56 0.59 0.56 0.09 0.09 0.12 -0.02 0.12 -0.69 -0.69 0.40 -0.80 0.40 0.09 0.09 VT SciFi-concept
  • 19. 19 allrightsreserved Language and Machine Learning Distributional Semantics Context-free word embeddings: Count based Context based Contextual embeddings Final Considerations & Conclusions
  • 20. 20 allrightsreserved Build models to predict a word given its neighbors - In 2003: Bengio et al. [1] coin the term word embeddings and train them in a neural language model - In 2008: Collobert and Weston [2]: show the utility of pre-trained word embeddings for downstream tasks - In 2013: Mikolov et al. [3] created word2vec toolkit that allows us a seamless training and use o pre- trained embeddings History 1. Yoshua Bengio et al. A neural probabilistic language model. 2003 Journal of Machine Learning Research 2. Collobert and Weston, A unified architecture for natural language processing, ICML 2008 3. Mikolov et al., Efficient Estimation of Word Representations in Vector Space, ICLR 2013
  • 21. 21 allrightsreserved The construction of context vectors is turned: - From: co-occurrence statistics + matrix factorization (heuristic stacking of vectors) - To: vectors learned by training a supervised model (single supervised model) Idea
  • 22. 22 allrightsreserved - We assume a corpus containing sequence of T training words w1, w2, w3, ..., wt - The words belong to a vocabulary V of size |V| - The models consider a context of n words - We associate to every word an input embedding vw with d dimensions and an output embedding v’ w Notation
  • 23. 23 allrightsreserved - We use a sliding window of a fixed size moving along the given text - The word in the middle is the target and the left-right neighborhood in the window is the context Example: “The man who passes the sentence should swing the sword” Data Preparation
  • 24. 24 allrightsreserved - Given a wi the model is trained to predict the words in the window [wt – c, ..., _ , ..., wt + c] (Mikolov et al., 2013) - The target word is predicted by randomly select words close-by the target Skip-Gram Model • Input wi and output wy are one-hot encoded • The multiplication of x and W gives the embedding vector of input word wi • The multiplication of the hidden layer and W’ produces the output one-hot vector y • W’ encoded the meaning of words as context
  • 25. 25 allrightsreserved - The Continuous Bag-of-Words (Mikolov et al., 2013) predicts the target word from the surrounding window Continuous Bag-of-Words (CBOW) CBOW model is better for small dataset
  • 26. 26 allrightsreserved • Considering an input word wI , and the correct word w • We sample N other words from a noise distribution ŵ1, ŵ2, ... , ŵN ~ Q. 1. Get close to 0 when input and target embedding are equal 2. Get close to zero when the input embedding is far from noise word embeddings Loss function: Negative Sampling Distance input target embedding Distance input noise embedding
  • 27. 27 allrightsreserved - Proposed by Pennington et al. in 2014, combines count-based matrix factorization to skip-gram - Word meanings are captured by the ratios of co-occurrence probabilities - large values correlate well with ice, and small values correlate well with steam - The ratio of probabilities encodes some crude form of meaning associated with the abstract concept Global Vectors (GloVe) https://nlp.stanford.edu/pubs/glove.pdf
  • 28. 28 allrightsreserved - Word2Vec has several drawbacks: • No sentence representations: taking the average pre-trained word vector is popular but does not work always well • Not exploiting morphology: words with the same radicals are learned as different vectors FastText is an extension of word2vec to take into account the morphology of words FastText: Enriching Word Vectors with Sub-word Information https://arxiv.org/pdf/1607.04606v1.pdf disastrous / disaster mangera / mangerai cbow Many words word Skip-gram word word text classification Many words label
  • 29. 29 allrightsreserved Represent each word as a bag of character n-grams 1. A vector representation is associated to each character n-gram 2. Words are represented as the sum of these representations Example: “matter”, with n=3 is represented as ^ma, mat, att, tte, ter, er$ Where ^ and $ are special symbols to distinguish the n-grams from words in the vocabulary - in this way compounds name and out of vocabulary words are easy to model FastText: Idea https://arxiv.org/pdf/1607.04606v1.pdf
  • 30. 30 allrightsreserved - Features of a word computed using n-grams FastText: Features manang nge ger era rai mang ange nger gera erai mangerai Character n-grams Word itself - Out Of Vocabulary (OOV) words representations manang nge ger era rai mang ange nger gera erai mangerai Character n-grams Word itself
  • 31. 31 allrightsreserved - the features preprocessed can used to train a CBOW or Skip-Gram (this works better) model FastText: Model
  • 32. 32 allrightsreserved It can be used for: 1. Classification A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification, https://arxiv.org/abs/1607.01759 2. Translation: A. Joulin, P. Bojanowski, T. Mikolov, H. Jegou, E. Grave, Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion, https://aclweb.org/anthology/D18-1330 FastText: Extensions
  • 33. 33 allrightsreserved 1. fastText with n-grams do better on syntactic tasks 2. Gensim word2vect and fastText with no n-grams do slightly better on semantic task (due to the test dataset) 3. In general, the performance of the models seems to get closer increasing the corpus size 4. The semantic accuracy increase with the corpus size 5. The syntactic accuracy slowly increase with the corpus size Word2Vec vs. FastText: Results Comparison
  • 34. 34 allrightsreserved The main difference between the approaches is : - Documents as context: for LSA and topic models (legacy from Information Retrieval) • The document-based: semantic relatedness (e.g. ‘boat’ – ‘water’) - Words as context: for word skip-gram, CBOW, glove and FastText (linguistic and cognitive perspective) - The word-based: semantic similarity (e.g. ‘boat’ - ‘ship’) Recap: Count vs. Context
  • 36. 36 allrightsreserved - Sent2Vec Embeddings NAACL 2018 (https://aclweb.org/anthology/N18-1049) - An extension of FastText and CBOW to sentences Interesting extensions not covered
  • 37. 37 allrightsreserved Language and Machine Learning Distributional Semantics Context-free word embeddings: Count based Context based Contextual embeddings Final Considerations & Conclusions
  • 38. 38 allrightsreserved - The introduced embeddings suffer of a major problem they cannot model polysemy - In the sentences: • “I have an Apple phone”, and • “I am eating an apple” the two “apple” words refer to very different things but share the same embedding vector - Solution: Contextual Embeddings Contextualized Embeddings
  • 39. 39 allrightsreserved - CoVe - ELMO - ULMFit - OpenAI GPT - BERT - OpenAI GPT 2 Contextualized Embeddings
  • 40. 40 allrightsreserved - Learns contextual word embeddings by using the encoder in an attentional sequence to sequence model trained for machine translation - CoVe word representations are function of the entire input sequence Contextual Word Vectors (CoVe) McCann B, Bradbury J., Xiong c., Socher R. Learned in Translation: Contextualized Word Vectors NIPS 2017 https://arxiv.org/pdf/1708.00107.pdf
  • 41. 41 allrightsreserved Contextual Word Vectors (CoVe) McCann B, Bradbury J., Xiong c., Socher R. Learned in Translation: Contextualized Word Vectors NIPS 2017 https://arxiv.org/pdf/1708.00107.pdf
  • 42. 42 allrightsreserved 1. Pre-training is bounded by datasets for supervised translation 2. Limited contribution of CoVe when used as layer for other tasks CoVe: Limitations
  • 43. 43 allrightsreserved - Learns contextualized embeddings by pre-training a language model that learns to predict the next token given the history Embeddings from Language Model (ELMo) Peters, Matthew E. and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luk Deep contextualized word representations NAACL 2018
  • 44. 44 allrightsreserved - Contextual: the representation of each word depends on the sentence in which it is used - Deep: the word representation is computed by stacking the internal states of all the layers bidirectional language model Different layers encodes different kind of information (e.g. PoS tagging from lower layers and word- sense disambiguation in higher levels) - Character based: the representation is based on characters rather than words (can take advantage of sub-words to compute ...) ELMo Representations
  • 46. 46 allrightsreserved - SQuAD: Stanford Question Answering Dataset - SNLI: Stanford Natural Language inference (entailment, contradiction, neutral) - SRL: Semantic Role Labeling - Coref: Clustering mentions in text that refer to the same underlying real world entities - NER: Named Entity Extraction - Sentiment: The fine-grained sentiment classification task in the Stanford Sentiment Treebank Tasks and Datasets
  • 47. 47 allrightsreserved Fine-tuning: (i) different learning rate per layer (ii) scheduled increase decrease of the learning rate Universal Language Model Fine-tuning (ULMFiT) Jeremy Howard, Sebastian Ruder Universal Language Model Fine-tuning for Text Classification, ACL 2018 https://arxiv.org/abs/1801.06146
  • 48. 48 allrightsreserved GPT has two major differences from ELMo: OpenAI Generative Pre-training Transformer Radford A., et al. Improving Language Understanding by Generative Pre-Training https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf Model ELMo GPT Architecture BiLSTM Transformer Type of Embeddings Customized per task Same embeddings fine tuned for al tasks
  • 49. 49 allrightsreserved GPT: Transformer Architecture Radford A., et al. Improving Language Understanding by Generative Pre-Training https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
  • 50. 50 allrightsreserved - Is a data compression technique that iteratively replaces the most frequent pair of symbols (originally bytes) in a given dataset with a single unused symbol. - In each iteration, the algorithm finds the most frequent (adjacent) pair of symbols, each can be constructed of a single character or a sequence of characters, and merged them to create a new symbol - It adopted to solve the open-vocabulary issue GPT: Byte Pair Encoding https://github.com/soaxelbrooke/python-bpe https://arxiv.org/abs/1508.07909 https://github.com/google/sentencepiece
  • 51. 51 allrightsreserved - Compared to GPT the largest difference is that BERT is Bidirectional Contextual Embedding example: “I made a bank deposit” 1. Unidirectional : “bank” is represented only based on “I made” but not “deposit” 2. Bidirectional: “bank” is represented by using the left and right context Moreover, BERT is available with multilingual where sentences from different language are decoded in the same latent space Bidirectional Encoder Representations from Transformer (BERT)
  • 52. 52 allrightsreserved - The input embedding is the sum of three parts: BERT input representation BPE tokenization Sentence Embedding Position Embedding
  • 53. 53 allrightsreserved Extension of GPT (1.5B parameters, 10x more) that: 1. Zero-Shot transfer: no need for transfer learning. All the downstream task are modeled as predicting conditional probabilities 2. Model Modifications • Layer normalization was moved to the input of each sub-block • Additional layer normalization after the final self-attention block • The weights of residual layers are scaled by 1/sqrt(N) where N is the number or residual layers • Use large vocabulary size and context OpenAI GPT-2
  • 54. 54 allrightsreserved Language and Machine Learning Distributional Semantics Context-free word embeddings: Count based Context based Contextual embeddings Final Considerations & Conclusions
  • 55. 55 allrightsreserved - We evaluated the evolution of context-free word embeddings to deal with: • Subword-level • OOV handling - Then we considered contextual embeddings to address: • Polysemy and multi-sense word representation • Sentence embeddings • Embeddings for multiple languages - What we do not covered are: • Temporal dimension • bias Final considerations & Conclusions
  • 57. 57 allrightsreserved Contacts Bologna Via Guglielmo Marconi, 43 +39 051 6480911 italy@prometeia.com Milan Via Brera, 18 Viale Monza, 265 +39 02 80505845 italy@prometeia.com Cairo Smart Village - Concordia Building, B2111 Km 28 Cairo Alex Desert Road 6 of October City, Giza egypt@prometeia.com Istanbul River Plaza, Kat 19 Büyükdere Caddesi Bahar Sokak No. 13, 34394 | Levent | Istanbul | Turkey + 90 212 709 02 80 – 81 – 82 turkey@prometeia.com London Dashwood House 69 Old Broad Street EC2M 1QS +44 (0) 207 786 3525 uk@prometeia.com Moscow ul. Ilyinka, 4 Capital Business Center Office 308 +7 (916) 215 0692 russia@prometeia.com Rome Viale Regina Margherita, 279 italy@prometeia.com www.prometeia.com Prometeiagroup Prometeia @PrometeiaGroup Prometeia