SlideShare una empresa de Scribd logo
1 de 45
Descargar para leer sin conexión
Idiom Token Classification using Sentential
Distributed Semantics
Giancarlo D. Salton Robert J. Ross John D. Kelleher
Applied Intelligence Research Centre
School of Computing
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Outline
Idioms
Distributed Representations
“Per-expression” classification
“General” classification
Conclusions
Future Work on Idiom Token Classification
Idiom Classification on Machine Translation Pipeline
2/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Outline
Idioms
Distributed Representations
“Per-expression” classification
“General” classification
Conclusions
Future Work on Idiom Token Classification
Idiom Classification on Machine Translation Pipeline
Idioms 3/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Idioms
Idioms are multiword expressions (MWE)
Their meaning is non-compositional
No linguistic agreement upon the set of characteristics defining
idioms
Idioms 4/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Idiomatic and Literal Usages
Literally...
Actually...
How to distinguish between a literal and idiomatic usage?
Idiom token classification
Idioms 5/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Previous Work
Previous work used “per-expression” models
– different set of features for each expression
in general, these features are not reusable
– i.e., a model is trained for each particular expression
In our opinion the state-of-the-art is Peng et al. (2014)
– Also “per-expression” classification
– Topic models
– Up to 5 paragraphs of context!
Idioms 6/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Per-expression classifiers
Per-expression classifiers
– Expensive
Idioms samples are rare
– Time-consuming
Feature engineering
Idioms 7/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
General Classifiers?
Can we find a common set of features?
Can we train a general classifier?
hold+horses
vs.
break+ice
vs.
spill+beans
Idioms 8/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Outline
Idioms
Distributed Representations
“Per-expression” classification
“General” classification
Conclusions
Future Work on Idiom Token Classification
Idiom Classification on Machine Translation Pipeline
Distributed Representations 9/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Distributed Representations of Words
Word2vec (Mikolov et al., 2013)
Distributed Representations 10/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Skip-thought Vectors (or Sent2Vec)
(Kiros et al., 2015)
Encoder/Decoder Framework
Distributed Representations 11/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Skip-thought Vectors (or Sent2Vec)
(Kiros et al., 2015)
Encoder/Decoder Framework
Distributed representations = features!
Distributed Representations 12/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Skip-thought Vectors (or Sent2Vec)
(Kiros et al., 2015)
Encoder/Decoder Framework
Distributed representations = features!
Distributed Representations 13/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Skip-thought Vectors (or Sent2Vec)
(Kiros et al., 2015)
Encoder/Decoder Framework
Distributed representations = features!
Distributed Representations 14/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Skip-thought Vectors (or Sent2Vec)
(Kiros et al., 2015)
Encoder/Decoder Framework
– Encoder learns to encode information about the context of an
input sentence
Distributed representations = features!
Distributed Representations 15/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Distributed Representations vs Idioms
Distributed representations cluster words (word2vec) or
sentences (sent2vec) with similar semantics
– Empirical results have shown that
Idiomatic vs. literal usages
– Idioms should alse be in a different part of space than literal
expressions (at least when considering the same expression)
Distributed Representations 16/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Outline
Idioms
Distributed Representations
“Per-expression” classification
“General” classification
Conclusions
Future Work on Idiom Token Classification
Idiom Classification on Machine Translation Pipeline
“Per-expression” classification 17/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
“Per-expression” settings
Following baseline evaluation (Peng et al., 2014)
4 expressions from VNC-Tokens dataset:
– blow+whistle, lose+head, make+scene and take+heart
Balanced training sets
Imbalanced test sets
“Per-expression” classification 18/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
“Per-expression” classifiers
K-Nearest Neighbours
– 2, 3, 5 and 10 neighbours
Support Vector Machines
– Linear SVM: linear kernel and grid search for best parameters
– Grid SVM: grid search for best kernel/parameters
– SGD SVM: linear kernel trained with Stochastic Gradient Descent
“Per-expression” classification 19/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
blow+whistle results
Models Precision Recall F1-Score
Peng et al. (2014)
FDA-Topics 0.62 0.60 0.61
FDA-Topics+A 0.47 0.44 0.45
FDA-Text 0.65 0.43 0.52
FDA-Text+A 0.45 0.49 0.47
SVMs-Topics 0.07 0.40 0.12
SVMs-Topics+A 0.21 0.54 0.30
SVMs-Text 0.17 0.90 0.29
SVMs-Text+A 0.24 0.87 0.38
Distributed Representations
KNN-2 0.61 0.41 0.49
KNN-3 0.84 0.32 0.46
KNN-5 0.79 0.28 0.41
KNN-10 0.83 0.30 0.44
Linear SVM 0.77 0.50 0.60
Grid SVM 0.80 0.51 0.62
SGD SVM 0.70 0.40 0.51
“Per-expression” classification 20/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
lose+head results
Models Precision Recall F1-Score
Peng et al. (2014)
FDA-Topics 0.76 0.97 0.85
FDA-Topics+A 0.74 0.93 0.82
FDA-Text 0.72 0.73 0.72
FDA-Text+A 0.67 0.88 0.76
SVMs-Topics 0.60 0.83 0.70
SVMs-Topics+A 0.66 0.77 0.71
SVMs-Text 0.30 0.50 0.38
SVMs-Text+A 0.66 0.85 0.74
Distributed Representations
KNN-2 0.30 0.64 0.41
KNN-3 0.58 0.65 0.61
KNN-5 0.57 0.65 0.61
KNN-10 0.28 0.68 0.40
Linear SVM 0.72 0.84 0.77
Grid SVM 0.83 0.89 0.85
SGD SVM 0.73 0.79 0.76
“Per-expression” classification 21/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
make+scene results
Models Precision Recall F1-Score
Peng et al. (2014)
FDA-Topics 0.79 0.95 0.86
FDA-Topics+A 0.82 0.69 0.75
FDA-Text 0.79 0.95 0.86
FDA-Text+A 0.80 0.99 0.88
SVMs-Topics 0.46 0.57 0.51
SVMs-Topics+A 0.42 0.29 0.34
SVMs-Text 0.10 0.01 0.02
SVMs-Text+A 0.07 0.01 0.02
Distributed Representations
KNN-2 0.55 0.89 0.68
KNN-3 0.88 0.88 0.88
KNN-5 0.87 0.83 0.85
KNN-10 0.85 0.83 0.84
Linear SVM 0.81 0.91 0.86
Grid SVM 0.80 0.91 0.85
SGD SVM 0.85 0.91 0.88
“Per-expression” classification 22/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
take+heart results
Models Precision Recall F1-Score
Peng et al. (2014)
FDA-Topics 0.93 0.99 0.96
FDA-Topics+A 0.92 0.98 0.95
FDA-Text 0.46 0.40 0.43
FDA-Text+A 0.47 0.29 0.36
SVMs-Topics 0.90 1.00 0.95
SVMs-Topics+A 0.91 1.00 0.95
SVMs-Text 0.65 0.21 0.32
SVMs-Text+A 0.74 0.13 0.22
Distributed Representations
KNN-2 0.46 0.96 0.62
KNN-3 0.72 0.94 0.81
KNN-5 0.73 0.94 0.82
KNN-10 0.78 0.94 0.85
Linear SVM 0.73 0.96 0.83
Grid SVM 0.72 0.96 0.82
SGD SVM 0.61 0.95 0.74
“Per-expression” classification 23/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
“Per-expression” evaluation
No single model performed best for all expressions
SVM consistently outperformed K-NNs
Peng et al. (2014) features may capture a different set of
dimensions
Combination with baseline model may result in stronger classifier
“Per-expression” classification 24/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Outline
Idioms
Distributed Representations
“Per-expression” classification
“General” classification
Conclusions
Future Work on Idiom Token Classification
Idiom Classification on Machine Translation Pipeline
“General” classification 25/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
“General classifier” settings
Simulation of expected behaviour on real data
27 expressions of “balanced” part of VNC-Tokens dataset
Imbalanced training set
Imbalanced test set
“General” classification 26/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
“General classifier” classifiers
SVMs only
– Linear SVM: linear kernel and grid search for best parameters
– Grid SVM: grid search for best kernel/parameters
– SGD SVM: linear kernel trained with Stochastic Gradient Descent
“General” classification 27/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
“General classifier” results
Linear SVM Grid SVM SGD SVM
Expressions Pr. Rec. F1 Pr. Rec. F1 Pr. Rec. F1
blow+whistle 0.84 0.67 0.75 0.84 0.68 0.75 0.67 0.59 0.63
lose+head 0.78 0.66 0.72 0.75 0.64 0.69 0.75 0.67 0.71
make+scene 0.92 0.84 0.88 0.92 0.81 0.86 0.78 0.81 0.79
take+heart 0.94 0.79 0.86 0.94 0.80 0.86 0.86 0.80 0.83
Total 0.84 0.80 0.83 0.84 0.80 0.83 0.79 0.79 0.78
“General” classification 28/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
“General classifier” evaluation
Expected behaviour on “real world”
– Consider imbalances of real data
2 classifiers had high performance
– Same general precision, recall and F1
– Deviations occurred across individual expressions
Performance is still not consistent over all classifiers and across
expressions
“General” classification 29/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
PCA Analysis of Distributed Representations on
“General” classifier
“General” classification 30/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Outline
Idioms
Distributed Representations
“Per-expression” classification
“General” classification
Conclusions
Future Work on Idiom Token Classification
Idiom Classification on Machine Translation Pipeline
Conclusions 31/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Conclusions
Our approach needs less resources to achieve roughly the same
performance
SVM generally perform better than KNNs
“General classifier” is feasible
“Per-expression” does achieve better results in some cases
Conclusions 32/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Outline
Idioms
Distributed Representations
“Per-expression” classification
“General” classification
Conclusions
Future Work on Idiom Token Classification
Idiom Classification on Machine Translation Pipeline
Future Work on Idiom Token Classification 33/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Future Work on Idiom Token Classification
Apply to other languages than English
Apply to other datasets
– e.g., the IDX Corpus
What are the main sources of error for the “general classifier”?
– Better understanding of representations is needed
Future Work on Idiom Token Classification 34/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Outline
Idioms
Distributed Representations
“Per-expression” classification
“General” classification
Conclusions
Future Work on Idiom Token Classification
Idiom Classification on Machine Translation Pipeline
Idiom Classification on Machine Translation Pipeline 35/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Idiom Token Classification on Machine Translation
Pipeline
(Salton et al., 2014b)
Idiom Classification on Machine Translation Pipeline 36/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Idiom Token Classification on Machine Translation
Pipeline
(Salton et al., 2014b)
Idiom Classification on Machine Translation Pipeline 37/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Idiom Token Classification on Machine Translation
Pipeline
(Salton et al., 2014b)
Idiom Classification on Machine Translation Pipeline 38/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Idiom Token Classification on Machine Translation
Pipeline
(Salton et al., 2014b)
Idiom Classification on Machine Translation Pipeline 39/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Idiom Token Classification on Machine Translation
Pipeline
(Salton et al., 2014b)
Idiom Classification on Machine Translation Pipeline 40/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Idiom Token Classification on Machine Translation
Pipeline
(Salton et al., 2014b)
Idiom Classification on Machine Translation Pipeline 41/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Idiom Token Classification on Machine Translation
Pipeline
(Salton et al., 2014b)
Idiom Classification on Machine Translation Pipeline 42/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Idiom Token Classification on Machine Translation
Pipeline
(Salton et al., 2014b)
Idiom Classification on Machine Translation Pipeline 43/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
References
Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun,
Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in
Neural Information Processing Systems 28, pages 3276–3284.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
Distributed representations of words and phrases and their compositionality. In
Advances in Neural Information Processing Systems 26, pages 3111–3119.
Jing Peng, Anna Feldman, and Ekaterina Vylomova. 2014. Classifying idiomatic and
literal expressions using topic models and intensity of emotions. In Proceedings of the
2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),
pages 2019–2027, October.
Giancarlo D. Salton, Robert J. Ross, and John D. Kelleher. 2014a. An Empirical
Study of the Impact of Idioms on Phrase Based Statistical Machine Translation of
English to Brazilian-Portuguese. In Third Workshop on Hybrid Approaches to
Translation (HyTra), pages 36–41.
Giancarlo D. Salton, Robert J. Ross, and John D. Kelleher. 2014b. Evaluation of a
substitution method for idiom transformation in statistical machine translation. In The
10th Workshop on Multiword Expressions (MWE 2014), pages 38–42.
44/45
Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Thank you!
Giancarlo D. Salton would like to thank CAPES (“Coordenao de
Aperfeioamento de Pessoal de Nvel Superior”) for his Science Without
Borders scholarship, proc n. 9050-13-2
45/45

Más contenido relacionado

Similar a Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

Similar a Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton) (9)

End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Lan...
End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Lan...End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Lan...
End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Lan...
 
Deep Within-Class Covariance Analysis for Robust Deep Audio Representation Le...
Deep Within-Class Covariance Analysis for Robust Deep Audio Representation Le...Deep Within-Class Covariance Analysis for Robust Deep Audio Representation Le...
Deep Within-Class Covariance Analysis for Robust Deep Audio Representation Le...
 
WISS QA Do it yourself Question answering over Linked Data
WISS QA Do it yourself Question answering over Linked DataWISS QA Do it yourself Question answering over Linked Data
WISS QA Do it yourself Question answering over Linked Data
 
A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...
A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...
A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...
 
Bridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full versionBridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full version
 
Exploiting Distributional Semantic Models in Question Answering
Exploiting Distributional Semantic Models in Question AnsweringExploiting Distributional Semantic Models in Question Answering
Exploiting Distributional Semantic Models in Question Answering
 
Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
Deep Learning for Chatbot (3/4)
Deep Learning for Chatbot (3/4)Deep Learning for Chatbot (3/4)
Deep Learning for Chatbot (3/4)
 

Más de Sebastian Ruder

Más de Sebastian Ruder (20)

Frontiers of Natural Language Processing
Frontiers of Natural Language ProcessingFrontiers of Natural Language Processing
Frontiers of Natural Language Processing
 
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain ShiftStrong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
 
On the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary InductionOn the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary Induction
 
Neural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain ShiftNeural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain Shift
 
Successes and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSuccesses and Frontiers of Deep Learning
Successes and Frontiers of Deep Learning
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoHuman Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
 
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian MihaiMachine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
 
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana IfrimHashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
 
Transfer Learning for Natural Language Processing
Transfer Learning for Natural Language ProcessingTransfer Learning for Natural Language Processing
Transfer Learning for Natural Language Processing
 
Transfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine LearningTransfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine Learning
 
Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...
 
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer GilmartinSpoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
 
NIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderNIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian Ruder
 
Modeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverModeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John Glover
 
Multi-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer CalixtoMulti-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer Calixto
 
Funded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIENFunded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIEN
 
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...
 
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Último (20)

Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)

  • 1. Idiom Token Classification using Sentential Distributed Semantics Giancarlo D. Salton Robert J. Ross John D. Kelleher Applied Intelligence Research Centre School of Computing
  • 2. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Outline Idioms Distributed Representations “Per-expression” classification “General” classification Conclusions Future Work on Idiom Token Classification Idiom Classification on Machine Translation Pipeline 2/45
  • 3. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Outline Idioms Distributed Representations “Per-expression” classification “General” classification Conclusions Future Work on Idiom Token Classification Idiom Classification on Machine Translation Pipeline Idioms 3/45
  • 4. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Idioms Idioms are multiword expressions (MWE) Their meaning is non-compositional No linguistic agreement upon the set of characteristics defining idioms Idioms 4/45
  • 5. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Idiomatic and Literal Usages Literally... Actually... How to distinguish between a literal and idiomatic usage? Idiom token classification Idioms 5/45
  • 6. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Previous Work Previous work used “per-expression” models – different set of features for each expression in general, these features are not reusable – i.e., a model is trained for each particular expression In our opinion the state-of-the-art is Peng et al. (2014) – Also “per-expression” classification – Topic models – Up to 5 paragraphs of context! Idioms 6/45
  • 7. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Per-expression classifiers Per-expression classifiers – Expensive Idioms samples are rare – Time-consuming Feature engineering Idioms 7/45
  • 8. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup General Classifiers? Can we find a common set of features? Can we train a general classifier? hold+horses vs. break+ice vs. spill+beans Idioms 8/45
  • 9. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Outline Idioms Distributed Representations “Per-expression” classification “General” classification Conclusions Future Work on Idiom Token Classification Idiom Classification on Machine Translation Pipeline Distributed Representations 9/45
  • 10. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Distributed Representations of Words Word2vec (Mikolov et al., 2013) Distributed Representations 10/45
  • 11. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Skip-thought Vectors (or Sent2Vec) (Kiros et al., 2015) Encoder/Decoder Framework Distributed Representations 11/45
  • 12. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Skip-thought Vectors (or Sent2Vec) (Kiros et al., 2015) Encoder/Decoder Framework Distributed representations = features! Distributed Representations 12/45
  • 13. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Skip-thought Vectors (or Sent2Vec) (Kiros et al., 2015) Encoder/Decoder Framework Distributed representations = features! Distributed Representations 13/45
  • 14. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Skip-thought Vectors (or Sent2Vec) (Kiros et al., 2015) Encoder/Decoder Framework Distributed representations = features! Distributed Representations 14/45
  • 15. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Skip-thought Vectors (or Sent2Vec) (Kiros et al., 2015) Encoder/Decoder Framework – Encoder learns to encode information about the context of an input sentence Distributed representations = features! Distributed Representations 15/45
  • 16. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Distributed Representations vs Idioms Distributed representations cluster words (word2vec) or sentences (sent2vec) with similar semantics – Empirical results have shown that Idiomatic vs. literal usages – Idioms should alse be in a different part of space than literal expressions (at least when considering the same expression) Distributed Representations 16/45
  • 17. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Outline Idioms Distributed Representations “Per-expression” classification “General” classification Conclusions Future Work on Idiom Token Classification Idiom Classification on Machine Translation Pipeline “Per-expression” classification 17/45
  • 18. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup “Per-expression” settings Following baseline evaluation (Peng et al., 2014) 4 expressions from VNC-Tokens dataset: – blow+whistle, lose+head, make+scene and take+heart Balanced training sets Imbalanced test sets “Per-expression” classification 18/45
  • 19. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup “Per-expression” classifiers K-Nearest Neighbours – 2, 3, 5 and 10 neighbours Support Vector Machines – Linear SVM: linear kernel and grid search for best parameters – Grid SVM: grid search for best kernel/parameters – SGD SVM: linear kernel trained with Stochastic Gradient Descent “Per-expression” classification 19/45
  • 20. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup blow+whistle results Models Precision Recall F1-Score Peng et al. (2014) FDA-Topics 0.62 0.60 0.61 FDA-Topics+A 0.47 0.44 0.45 FDA-Text 0.65 0.43 0.52 FDA-Text+A 0.45 0.49 0.47 SVMs-Topics 0.07 0.40 0.12 SVMs-Topics+A 0.21 0.54 0.30 SVMs-Text 0.17 0.90 0.29 SVMs-Text+A 0.24 0.87 0.38 Distributed Representations KNN-2 0.61 0.41 0.49 KNN-3 0.84 0.32 0.46 KNN-5 0.79 0.28 0.41 KNN-10 0.83 0.30 0.44 Linear SVM 0.77 0.50 0.60 Grid SVM 0.80 0.51 0.62 SGD SVM 0.70 0.40 0.51 “Per-expression” classification 20/45
  • 21. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup lose+head results Models Precision Recall F1-Score Peng et al. (2014) FDA-Topics 0.76 0.97 0.85 FDA-Topics+A 0.74 0.93 0.82 FDA-Text 0.72 0.73 0.72 FDA-Text+A 0.67 0.88 0.76 SVMs-Topics 0.60 0.83 0.70 SVMs-Topics+A 0.66 0.77 0.71 SVMs-Text 0.30 0.50 0.38 SVMs-Text+A 0.66 0.85 0.74 Distributed Representations KNN-2 0.30 0.64 0.41 KNN-3 0.58 0.65 0.61 KNN-5 0.57 0.65 0.61 KNN-10 0.28 0.68 0.40 Linear SVM 0.72 0.84 0.77 Grid SVM 0.83 0.89 0.85 SGD SVM 0.73 0.79 0.76 “Per-expression” classification 21/45
  • 22. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup make+scene results Models Precision Recall F1-Score Peng et al. (2014) FDA-Topics 0.79 0.95 0.86 FDA-Topics+A 0.82 0.69 0.75 FDA-Text 0.79 0.95 0.86 FDA-Text+A 0.80 0.99 0.88 SVMs-Topics 0.46 0.57 0.51 SVMs-Topics+A 0.42 0.29 0.34 SVMs-Text 0.10 0.01 0.02 SVMs-Text+A 0.07 0.01 0.02 Distributed Representations KNN-2 0.55 0.89 0.68 KNN-3 0.88 0.88 0.88 KNN-5 0.87 0.83 0.85 KNN-10 0.85 0.83 0.84 Linear SVM 0.81 0.91 0.86 Grid SVM 0.80 0.91 0.85 SGD SVM 0.85 0.91 0.88 “Per-expression” classification 22/45
  • 23. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup take+heart results Models Precision Recall F1-Score Peng et al. (2014) FDA-Topics 0.93 0.99 0.96 FDA-Topics+A 0.92 0.98 0.95 FDA-Text 0.46 0.40 0.43 FDA-Text+A 0.47 0.29 0.36 SVMs-Topics 0.90 1.00 0.95 SVMs-Topics+A 0.91 1.00 0.95 SVMs-Text 0.65 0.21 0.32 SVMs-Text+A 0.74 0.13 0.22 Distributed Representations KNN-2 0.46 0.96 0.62 KNN-3 0.72 0.94 0.81 KNN-5 0.73 0.94 0.82 KNN-10 0.78 0.94 0.85 Linear SVM 0.73 0.96 0.83 Grid SVM 0.72 0.96 0.82 SGD SVM 0.61 0.95 0.74 “Per-expression” classification 23/45
  • 24. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup “Per-expression” evaluation No single model performed best for all expressions SVM consistently outperformed K-NNs Peng et al. (2014) features may capture a different set of dimensions Combination with baseline model may result in stronger classifier “Per-expression” classification 24/45
  • 25. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Outline Idioms Distributed Representations “Per-expression” classification “General” classification Conclusions Future Work on Idiom Token Classification Idiom Classification on Machine Translation Pipeline “General” classification 25/45
  • 26. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup “General classifier” settings Simulation of expected behaviour on real data 27 expressions of “balanced” part of VNC-Tokens dataset Imbalanced training set Imbalanced test set “General” classification 26/45
  • 27. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup “General classifier” classifiers SVMs only – Linear SVM: linear kernel and grid search for best parameters – Grid SVM: grid search for best kernel/parameters – SGD SVM: linear kernel trained with Stochastic Gradient Descent “General” classification 27/45
  • 28. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup “General classifier” results Linear SVM Grid SVM SGD SVM Expressions Pr. Rec. F1 Pr. Rec. F1 Pr. Rec. F1 blow+whistle 0.84 0.67 0.75 0.84 0.68 0.75 0.67 0.59 0.63 lose+head 0.78 0.66 0.72 0.75 0.64 0.69 0.75 0.67 0.71 make+scene 0.92 0.84 0.88 0.92 0.81 0.86 0.78 0.81 0.79 take+heart 0.94 0.79 0.86 0.94 0.80 0.86 0.86 0.80 0.83 Total 0.84 0.80 0.83 0.84 0.80 0.83 0.79 0.79 0.78 “General” classification 28/45
  • 29. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup “General classifier” evaluation Expected behaviour on “real world” – Consider imbalances of real data 2 classifiers had high performance – Same general precision, recall and F1 – Deviations occurred across individual expressions Performance is still not consistent over all classifiers and across expressions “General” classification 29/45
  • 30. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup PCA Analysis of Distributed Representations on “General” classifier “General” classification 30/45
  • 31. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Outline Idioms Distributed Representations “Per-expression” classification “General” classification Conclusions Future Work on Idiom Token Classification Idiom Classification on Machine Translation Pipeline Conclusions 31/45
  • 32. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Conclusions Our approach needs less resources to achieve roughly the same performance SVM generally perform better than KNNs “General classifier” is feasible “Per-expression” does achieve better results in some cases Conclusions 32/45
  • 33. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Outline Idioms Distributed Representations “Per-expression” classification “General” classification Conclusions Future Work on Idiom Token Classification Idiom Classification on Machine Translation Pipeline Future Work on Idiom Token Classification 33/45
  • 34. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Future Work on Idiom Token Classification Apply to other languages than English Apply to other datasets – e.g., the IDX Corpus What are the main sources of error for the “general classifier”? – Better understanding of representations is needed Future Work on Idiom Token Classification 34/45
  • 35. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Outline Idioms Distributed Representations “Per-expression” classification “General” classification Conclusions Future Work on Idiom Token Classification Idiom Classification on Machine Translation Pipeline Idiom Classification on Machine Translation Pipeline 35/45
  • 36. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Idiom Token Classification on Machine Translation Pipeline (Salton et al., 2014b) Idiom Classification on Machine Translation Pipeline 36/45
  • 37. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Idiom Token Classification on Machine Translation Pipeline (Salton et al., 2014b) Idiom Classification on Machine Translation Pipeline 37/45
  • 38. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Idiom Token Classification on Machine Translation Pipeline (Salton et al., 2014b) Idiom Classification on Machine Translation Pipeline 38/45
  • 39. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Idiom Token Classification on Machine Translation Pipeline (Salton et al., 2014b) Idiom Classification on Machine Translation Pipeline 39/45
  • 40. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Idiom Token Classification on Machine Translation Pipeline (Salton et al., 2014b) Idiom Classification on Machine Translation Pipeline 40/45
  • 41. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Idiom Token Classification on Machine Translation Pipeline (Salton et al., 2014b) Idiom Classification on Machine Translation Pipeline 41/45
  • 42. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Idiom Token Classification on Machine Translation Pipeline (Salton et al., 2014b) Idiom Classification on Machine Translation Pipeline 42/45
  • 43. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Idiom Token Classification on Machine Translation Pipeline (Salton et al., 2014b) Idiom Classification on Machine Translation Pipeline 43/45
  • 44. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup References Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems 28, pages 3276–3284. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pages 3111–3119. Jing Peng, Anna Feldman, and Ekaterina Vylomova. 2014. Classifying idiomatic and literal expressions using topic models and intensity of emotions. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2019–2027, October. Giancarlo D. Salton, Robert J. Ross, and John D. Kelleher. 2014a. An Empirical Study of the Impact of Idioms on Phrase Based Statistical Machine Translation of English to Brazilian-Portuguese. In Third Workshop on Hybrid Approaches to Translation (HyTra), pages 36–41. Giancarlo D. Salton, Robert J. Ross, and John D. Kelleher. 2014b. Evaluation of a substitution method for idiom transformation in statistical machine translation. In The 10th Workshop on Multiword Expressions (MWE 2014), pages 38–42. 44/45
  • 45. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup Thank you! Giancarlo D. Salton would like to thank CAPES (“Coordenao de Aperfeioamento de Pessoal de Nvel Superior”) for his Science Without Borders scholarship, proc n. 9050-13-2 45/45