SlideShare una empresa de Scribd logo
1 de 31
Topic Modeling and
WSD on the Ancora
Corpus
Ruben Izquierdo
Marten Postma
Piek Vossen
Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
Outline
1. Starting Point
2. Motivation
3. Our Approach
4. Evaluation Framework
5. Experiments and Results
6. Conclusions
2Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
Starting point
 “Understanding languages by machines” project
 Starts from the results of DutchSemCor (WSD)
 Analyse the real problems of WSD
 Understand the WSD task
 Word
 Meaning
 Context
3Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
Outline
1. Starting Point
2. Motivation
3. Our Approach
4. Evaluation Framework
5. Experiments and Results
6. Conclusions
4Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
Still WSD?
 Word Sense Disambiguation is still unsolved
 Used in high level applications
 Recently some unsupervised approaches and SemEval
tasks
 Babelnet, Babelfy…
 Several reasons and problems
5Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
WSD problems I
 Context is not considered properly
 Most are/were supervised approaches
 Moving to unsupervised, graph-based…
 WSD as a black box
 The larger number of features, the better performance?
 The best and newest machine learning algorithm
 WSD is seen as only one problem
 All words and cases treated in the same way
6Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
WSD problems II
 Error analysis SenseEval/SemEval systems [Postma
et al., 2014]
 Propagation errors (monosemous)
 Most Frequent Sense bias
 Supervised systems are skewed towards MFS
 Error analysis on WSD and SenseEval/SemEval
 Performance on MFS cases is good
 Very poor performance on non MFS cases
7Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
WSD problems II
8Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
WSD problems II
 Most Frequent Sense bias
 Supervised systems are skewed towards MFS
 Error analysis on WSD and SenseEval/SemEval
 Performance on MFS cases is good
 Very poor performance on non MFS cases
 Systems assign MFS in almost every case
 Sval2
 799 cases where the correct is not the MFS
 84% of the system still assign the MFS
9Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
Outline
1. Starting Point
2. Motivation
3. Our Approach
4. Evaluation Framework
5. Experiments and Results
6. Conclusions
10Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
Main idea
 WSD considered as two different problems
 When the MFS applies
 More general usages
 Larger contexts ??
 Rest of the senses
 More concrete usages
 Shorter contexts ??
 Specialized classifiers for each case
 Different features, parameters, contexts…
 Evaluation for Spanish
 Sense annotated corpus Ancora
11Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
Our approach
 TRAINING. Use Topic Modeling (LDA) to induce word
expert classifiers
 For the Most Frequent Sense 
 Topics for the MFS case
 Topics for non MFS cases
 For the rest of senses (non MFS)
 Topics for every sense
 CLASSIFICATION. Apply the 2 classifiers in cascade
to decide the sense in every case
BINARY
MULTICLASS
12Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
Training
13Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
Classification
14Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
Outline
1. Starting Point
2. Motivation
3. Our Approach
4. Evaluation Framework
5. Experiments and Results
6. Conclusions
15Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
Evaluation framework
 Ancora corpus
 News Articles, Spanish part, 500K words, sense
annotated (nouns)
 Converted to NAF format
 3 Folded-cross validation
 Keeping sense distribution
 7119 unique lemmas annotated with nominal senses
16Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
Evaluation framework
 Ancora corpus
 Spanish part, 500K words, sense annotated (nouns)
 3 Folded-cross validation
 Keeping sense distribution
 7119 unique lemmas annotated
 4907 are monosemous (69%)
 2212 are polysemous (31%)
 589 with at least 3 instances per sense (from the annotated)
17Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
Evaluation framework
 Ancora corpus
 Spanish part, 500K words, sense annotated (nouns)
 3 Folded-cross validation
 Keeping sense distribution
 7119 unique lemmas annotated
0
200
400
600
800
1000
1200
1400
2 3 4 5 6 7 8 9 10 11 12
Number of lemmas vs. polysemy
Number of Lemmas
18Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
Baseline Results
 For the 589 selected lemmas
Baseline Accuracy
Random 40.10
MFS overall 67.68
MFS folded 68.63
19Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
Outline
1. Starting Point
2. Motivation
3. Our Approach
4. Evaluation Framework
5. Experiments and Results
6. Conclusions
20Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
Experimentation
 Configuration of our cascade classifiers
 Only one step with the senseLDA classifier
 2 steps, mfsLDA with perfect performance + senseLDA
 2 steps, mfsLDA and senseLDA both induced
automatically
 LDA parameters (python gensim library)
 Context size (number of sentences)
 Number of topics for LDA
21Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
Results I
Instance
Example
Sense
LDA (all
senses)
Word
Sense
One step
classification
Sentences Topics Accurac
y
MFS baseline 68.63
0 3 67.54
10 65.56
100 58.34
3 3 66.30
10 64.62
100 60.07
50 3 66.04
10 63.42
100 59.06
• MFS not reached
• Most informative clues in
small contexts
• More topics  less
performance
22
Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
Results II
Instance
Example
MFS
(100%
accuracy)
Sense
LDA (all
senses)
Word
Sense
Two steps, MFS
classifier 100%
performance
Sentences Topics Accurac
y
MFS baseline 68.63
0 3 92.48
10 92.12
100 90.50
3 3 92.45
10 92.11
100 91.60
50 3 92.41
10 92.12
100 91.43
• Extremely high figures
• Good performance of the
senseLDA classifier (when no
MFS)
• Similar behaviour w.r.t. #sents
and # topics
23
Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
Results III
Instance
Example
MFS (s5)
Sense
LDA (all
senses)
Word
Sense
Two steps, MFS
classifier #S=5
Sents Topics Acc. MFS
T100
Acc. MFS
T1000
MFS baseline 68.63
0 3 74.53 66.73
10 74.00 66.41
100 72.61 64.91
3 3 74.30 66.61
10 73.87 66.36
100 73.39 65.76
50 3 74.26 66.48
10 73.90 66.24
100 73.53 65.75
• MFS s5 t100
• Smaller contexts for
non MFS cases (3, 50
included by 0)
• 3 Topics is the best
24
Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
Results IV
Instance
Example
MFS (s50)
Sense
LDA (all
senses)
Word
Sense
Two steps, MFS
classifier #S=50
Sents Topics Acc. MFS
T100
Acc. MFS
T1000
MFS baseline 68.63
0 3 73.34 67.15
10 72.92 66.76
100 71.43 65.13
3 3 73.21 67.02
10 72.88 66.60
100 72.40 66.24
50 3 73.21 66.95
10 72.83 66.58
100 72.15 66.20
• Similar behaviour
compared to MFS_s5
• Slightly lower results
25
Lemma comparison
Lemma MFS (68.63) LDA (74.53) Variation Annotations
año 89.15 91.19 2.04 1275
país 72.29 83.55 11.26 695
presidente 70.31 73.94 3.63 690
partido 55.87 64.48 8.61 641
equipo 98.32 98.88 0.56 539
mes 54.29 80 25.71 315
hora 61.39 56.11 -5.28 305
caso 61.05 91.58 30.53 286
mundo 47.31 40.14 -7.17 279
semana 85.06 92.34 7.28 263
Most frequent lemmas
26Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
Outline
1. Starting Point
2. Motivation
3. Our Approach
4. Evaluation Framework
5. Experiments and Results
6. Conclusions
27Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
Conclusions
 Simple approach based on LDA for WSD in Spanish
 Two step classification approach for WSD improves the results for
Spanish (6 points)
 Different nature of both cases
 MFS in contexts of 5 sentences, 100 topics
 NonMFS in contexts in the local sentence, 3 topics
 All code and data publicly
available on GitHub (group policy)
http://github.com/rubenIzquierdo/lda_wsd
28Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
29Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
30Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
Ruben Izquierdo
Marten Postma
Piek Vossen
email: ruben.izquierdobevia@vu.nl
http://github.com/rubenIzquierdo/lda_wsd
http://rubenizquierdobevia.com
31Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.

Más contenido relacionado

Destacado

Mining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionMining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completion
Fabio Petroni, PhD
 
The Power of Declarative Analytics
The Power of Declarative AnalyticsThe Power of Declarative Analytics
The Power of Declarative Analytics
Yunyao Li
 
Error analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationError analysis of Word Sense Disambiguation
Error analysis of Word Sense Disambiguation
Rubén Izquierdo Beviá
 

Destacado (17)

The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...
 
ULM-1 Understanding Languages by Machines: The borders of Ambiguity
ULM-1 Understanding Languages by Machines: The borders of AmbiguityULM-1 Understanding Languages by Machines: The borders of Ambiguity
ULM-1 Understanding Languages by Machines: The borders of Ambiguity
 
Evaluating entity linking an analysis of current benchmark datasets and a ro...
Evaluating entity linking  an analysis of current benchmark datasets and a ro...Evaluating entity linking  an analysis of current benchmark datasets and a ro...
Evaluating entity linking an analysis of current benchmark datasets and a ro...
 
Entity Typing and Event Extraction
Entity Typing and Event Extraction Entity Typing and Event Extraction
Entity Typing and Event Extraction
 
HDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law GraphsHDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law Graphs
 
LCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative FilteringLCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative Filtering
 
Mining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionMining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completion
 
The Power of Declarative Analytics
The Power of Declarative AnalyticsThe Power of Declarative Analytics
The Power of Declarative Analytics
 
Transparent Machine Learning for Information Extraction: State-of-the-Art and...
Transparent Machine Learning for Information Extraction: State-of-the-Art and...Transparent Machine Learning for Information Extraction: State-of-the-Art and...
Transparent Machine Learning for Information Extraction: State-of-the-Art and...
 
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsPolyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
 
KafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF filesKafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF files
 
HSIENA: a hybrid publish/subscribe system
HSIENA: a hybrid publish/subscribe systemHSIENA: a hybrid publish/subscribe system
HSIENA: a hybrid publish/subscribe system
 
Enterprise Search in the Big Data Era: Recent Developments and Open Challenges
Enterprise Search in the Big Data Era: Recent Developments and Open ChallengesEnterprise Search in the Big Data Era: Recent Developments and Open Challenges
Enterprise Search in the Big Data Era: Recent Developments and Open Challenges
 
Error analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationError analysis of Word Sense Disambiguation
Error analysis of Word Sense Disambiguation
 
CORE: Context-Aware Open Relation Extraction with Factorization Machines
CORE: Context-Aware Open Relation Extraction with Factorization MachinesCORE: Context-Aware Open Relation Extraction with Factorization Machines
CORE: Context-Aware Open Relation Extraction with Factorization Machines
 
BigML Summer 2016 Release
BigML Summer 2016 ReleaseBigML Summer 2016 Release
BigML Summer 2016 Release
 
BigML Fall 2016 Release
BigML Fall 2016 ReleaseBigML Fall 2016 Release
BigML Fall 2016 Release
 

Similar a Topic modeling and WSD on the Ancora corpus

Towards Automated Classification of Discussion Transcripts: A Cognitive Prese...
Towards Automated Classification of Discussion Transcripts: A Cognitive Prese...Towards Automated Classification of Discussion Transcripts: A Cognitive Prese...
Towards Automated Classification of Discussion Transcripts: A Cognitive Prese...
Vitomir Kovanovic
 

Similar a Topic modeling and WSD on the Ancora corpus (18)

third_seminar
third_seminarthird_seminar
third_seminar
 
FORTE: Few Samples for Recognizing Hand Gestures with a Smartphone-attached R...
FORTE: Few Samples for Recognizing Hand Gestures with a Smartphone-attached R...FORTE: Few Samples for Recognizing Hand Gestures with a Smartphone-attached R...
FORTE: Few Samples for Recognizing Hand Gestures with a Smartphone-attached R...
 
MDE-experiments
MDE-experimentsMDE-experiments
MDE-experiments
 
Csmr10a.ppt
Csmr10a.pptCsmr10a.ppt
Csmr10a.ppt
 
CSMR10a.ppt
CSMR10a.pptCSMR10a.ppt
CSMR10a.ppt
 
Application_of_Deep_Learning_Techniques.pptx
Application_of_Deep_Learning_Techniques.pptxApplication_of_Deep_Learning_Techniques.pptx
Application_of_Deep_Learning_Techniques.pptx
 
Comparing the code quality of ECMs
Comparing the code quality of ECMsComparing the code quality of ECMs
Comparing the code quality of ECMs
 
Towards an Active Learning System for Company Name Disambiguation in Microblo...
Towards an Active Learning System for Company Name Disambiguation in Microblo...Towards an Active Learning System for Company Name Disambiguation in Microblo...
Towards an Active Learning System for Company Name Disambiguation in Microblo...
 
Wcre12c.ppt
Wcre12c.pptWcre12c.ppt
Wcre12c.ppt
 
21AI401 AI Unit 1.pdf
21AI401 AI Unit 1.pdf21AI401 AI Unit 1.pdf
21AI401 AI Unit 1.pdf
 
Towards Automated Classification of Discussion Transcripts: A Cognitive Prese...
Towards Automated Classification of Discussion Transcripts: A Cognitive Prese...Towards Automated Classification of Discussion Transcripts: A Cognitive Prese...
Towards Automated Classification of Discussion Transcripts: A Cognitive Prese...
 
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
 
Hpai class 16 - learning - 041320
Hpai   class 16 - learning - 041320Hpai   class 16 - learning - 041320
Hpai class 16 - learning - 041320
 
openEHR in the world
openEHR in the worldopenEHR in the world
openEHR in the world
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning Analytics
 
New approaches in music generation from tonal and modal perspectives
New approaches in music generation from tonal and modal perspectivesNew approaches in music generation from tonal and modal perspectives
New approaches in music generation from tonal and modal perspectives
 
Microcontroladores: Programación del microcontrolador PIC en C
Microcontroladores: Programación del microcontrolador PIC en CMicrocontroladores: Programación del microcontrolador PIC en C
Microcontroladores: Programación del microcontrolador PIC en C
 
Help! I need an empirical study for my PhD!
Help! I need an empirical study for my PhD!Help! I need an empirical study for my PhD!
Help! I need an empirical study for my PhD!
 

Más de Rubén Izquierdo Beviá

CLTL presentation: training an opinion mining system from KAF files using CRF
CLTL presentation: training an opinion mining system from KAF files using CRFCLTL presentation: training an opinion mining system from KAF files using CRF
CLTL presentation: training an opinion mining system from KAF files using CRF
Rubén Izquierdo Beviá
 
CLIN 2012: DutchSemCor Building a semantically annotated corpus for Dutch
CLIN 2012: DutchSemCor  Building a semantically annotated corpus for DutchCLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch
CLIN 2012: DutchSemCor Building a semantically annotated corpus for Dutch
Rubén Izquierdo Beviá
 
RANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpusRANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpus
Rubén Izquierdo Beviá
 

Más de Rubén Izquierdo Beviá (9)

CLTL python course: Object Oriented Programming (2/3)
CLTL python course: Object Oriented Programming (2/3)CLTL python course: Object Oriented Programming (2/3)
CLTL python course: Object Oriented Programming (2/3)
 
CLTL python course: Object Oriented Programming (1/3)
CLTL python course: Object Oriented Programming (1/3)CLTL python course: Object Oriented Programming (1/3)
CLTL python course: Object Oriented Programming (1/3)
 
CLTL Software and Web Services
CLTL Software and Web Services CLTL Software and Web Services
CLTL Software and Web Services
 
Thesis presentation (WSD and Semantic Classes)
Thesis presentation (WSD and Semantic Classes)Thesis presentation (WSD and Semantic Classes)
Thesis presentation (WSD and Semantic Classes)
 
ULM1 - The borders of Ambiguity
ULM1 - The borders of AmbiguityULM1 - The borders of Ambiguity
ULM1 - The borders of Ambiguity
 
CLTL: Description of web services and sofware. Nijmegen 2013
CLTL: Description of web services and sofware. Nijmegen 2013CLTL: Description of web services and sofware. Nijmegen 2013
CLTL: Description of web services and sofware. Nijmegen 2013
 
CLTL presentation: training an opinion mining system from KAF files using CRF
CLTL presentation: training an opinion mining system from KAF files using CRFCLTL presentation: training an opinion mining system from KAF files using CRF
CLTL presentation: training an opinion mining system from KAF files using CRF
 
CLIN 2012: DutchSemCor Building a semantically annotated corpus for Dutch
CLIN 2012: DutchSemCor  Building a semantically annotated corpus for DutchCLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch
CLIN 2012: DutchSemCor Building a semantically annotated corpus for Dutch
 
RANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpusRANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpus
 

Último

LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
Silpa
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
Silpa
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Silpa
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
ANSARKHAN96
 

Último (20)

LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptx
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditions
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 

Topic modeling and WSD on the Ancora corpus

  • 1. Topic Modeling and WSD on the Ancora Corpus Ruben Izquierdo Marten Postma Piek Vossen Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
  • 2. Outline 1. Starting Point 2. Motivation 3. Our Approach 4. Evaluation Framework 5. Experiments and Results 6. Conclusions 2Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
  • 3. Starting point  “Understanding languages by machines” project  Starts from the results of DutchSemCor (WSD)  Analyse the real problems of WSD  Understand the WSD task  Word  Meaning  Context 3Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
  • 4. Outline 1. Starting Point 2. Motivation 3. Our Approach 4. Evaluation Framework 5. Experiments and Results 6. Conclusions 4Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
  • 5. Still WSD?  Word Sense Disambiguation is still unsolved  Used in high level applications  Recently some unsupervised approaches and SemEval tasks  Babelnet, Babelfy…  Several reasons and problems 5Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
  • 6. WSD problems I  Context is not considered properly  Most are/were supervised approaches  Moving to unsupervised, graph-based…  WSD as a black box  The larger number of features, the better performance?  The best and newest machine learning algorithm  WSD is seen as only one problem  All words and cases treated in the same way 6Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
  • 7. WSD problems II  Error analysis SenseEval/SemEval systems [Postma et al., 2014]  Propagation errors (monosemous)  Most Frequent Sense bias  Supervised systems are skewed towards MFS  Error analysis on WSD and SenseEval/SemEval  Performance on MFS cases is good  Very poor performance on non MFS cases 7Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
  • 8. WSD problems II 8Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
  • 9. WSD problems II  Most Frequent Sense bias  Supervised systems are skewed towards MFS  Error analysis on WSD and SenseEval/SemEval  Performance on MFS cases is good  Very poor performance on non MFS cases  Systems assign MFS in almost every case  Sval2  799 cases where the correct is not the MFS  84% of the system still assign the MFS 9Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
  • 10. Outline 1. Starting Point 2. Motivation 3. Our Approach 4. Evaluation Framework 5. Experiments and Results 6. Conclusions 10Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
  • 11. Main idea  WSD considered as two different problems  When the MFS applies  More general usages  Larger contexts ??  Rest of the senses  More concrete usages  Shorter contexts ??  Specialized classifiers for each case  Different features, parameters, contexts…  Evaluation for Spanish  Sense annotated corpus Ancora 11Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
  • 12. Our approach  TRAINING. Use Topic Modeling (LDA) to induce word expert classifiers  For the Most Frequent Sense   Topics for the MFS case  Topics for non MFS cases  For the rest of senses (non MFS)  Topics for every sense  CLASSIFICATION. Apply the 2 classifiers in cascade to decide the sense in every case BINARY MULTICLASS 12Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
  • 13. Training 13Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
  • 14. Classification 14Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
  • 15. Outline 1. Starting Point 2. Motivation 3. Our Approach 4. Evaluation Framework 5. Experiments and Results 6. Conclusions 15Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
  • 16. Evaluation framework  Ancora corpus  News Articles, Spanish part, 500K words, sense annotated (nouns)  Converted to NAF format  3 Folded-cross validation  Keeping sense distribution  7119 unique lemmas annotated with nominal senses 16Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
  • 17. Evaluation framework  Ancora corpus  Spanish part, 500K words, sense annotated (nouns)  3 Folded-cross validation  Keeping sense distribution  7119 unique lemmas annotated  4907 are monosemous (69%)  2212 are polysemous (31%)  589 with at least 3 instances per sense (from the annotated) 17Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
  • 18. Evaluation framework  Ancora corpus  Spanish part, 500K words, sense annotated (nouns)  3 Folded-cross validation  Keeping sense distribution  7119 unique lemmas annotated 0 200 400 600 800 1000 1200 1400 2 3 4 5 6 7 8 9 10 11 12 Number of lemmas vs. polysemy Number of Lemmas 18Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
  • 19. Baseline Results  For the 589 selected lemmas Baseline Accuracy Random 40.10 MFS overall 67.68 MFS folded 68.63 19Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
  • 20. Outline 1. Starting Point 2. Motivation 3. Our Approach 4. Evaluation Framework 5. Experiments and Results 6. Conclusions 20Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
  • 21. Experimentation  Configuration of our cascade classifiers  Only one step with the senseLDA classifier  2 steps, mfsLDA with perfect performance + senseLDA  2 steps, mfsLDA and senseLDA both induced automatically  LDA parameters (python gensim library)  Context size (number of sentences)  Number of topics for LDA 21Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
  • 22. Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. Results I Instance Example Sense LDA (all senses) Word Sense One step classification Sentences Topics Accurac y MFS baseline 68.63 0 3 67.54 10 65.56 100 58.34 3 3 66.30 10 64.62 100 60.07 50 3 66.04 10 63.42 100 59.06 • MFS not reached • Most informative clues in small contexts • More topics  less performance 22
  • 23. Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. Results II Instance Example MFS (100% accuracy) Sense LDA (all senses) Word Sense Two steps, MFS classifier 100% performance Sentences Topics Accurac y MFS baseline 68.63 0 3 92.48 10 92.12 100 90.50 3 3 92.45 10 92.11 100 91.60 50 3 92.41 10 92.12 100 91.43 • Extremely high figures • Good performance of the senseLDA classifier (when no MFS) • Similar behaviour w.r.t. #sents and # topics 23
  • 24. Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. Results III Instance Example MFS (s5) Sense LDA (all senses) Word Sense Two steps, MFS classifier #S=5 Sents Topics Acc. MFS T100 Acc. MFS T1000 MFS baseline 68.63 0 3 74.53 66.73 10 74.00 66.41 100 72.61 64.91 3 3 74.30 66.61 10 73.87 66.36 100 73.39 65.76 50 3 74.26 66.48 10 73.90 66.24 100 73.53 65.75 • MFS s5 t100 • Smaller contexts for non MFS cases (3, 50 included by 0) • 3 Topics is the best 24
  • 25. Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. Results IV Instance Example MFS (s50) Sense LDA (all senses) Word Sense Two steps, MFS classifier #S=50 Sents Topics Acc. MFS T100 Acc. MFS T1000 MFS baseline 68.63 0 3 73.34 67.15 10 72.92 66.76 100 71.43 65.13 3 3 73.21 67.02 10 72.88 66.60 100 72.40 66.24 50 3 73.21 66.95 10 72.83 66.58 100 72.15 66.20 • Similar behaviour compared to MFS_s5 • Slightly lower results 25
  • 26. Lemma comparison Lemma MFS (68.63) LDA (74.53) Variation Annotations año 89.15 91.19 2.04 1275 país 72.29 83.55 11.26 695 presidente 70.31 73.94 3.63 690 partido 55.87 64.48 8.61 641 equipo 98.32 98.88 0.56 539 mes 54.29 80 25.71 315 hora 61.39 56.11 -5.28 305 caso 61.05 91.58 30.53 286 mundo 47.31 40.14 -7.17 279 semana 85.06 92.34 7.28 263 Most frequent lemmas 26Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
  • 27. Outline 1. Starting Point 2. Motivation 3. Our Approach 4. Evaluation Framework 5. Experiments and Results 6. Conclusions 27Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
  • 28. Conclusions  Simple approach based on LDA for WSD in Spanish  Two step classification approach for WSD improves the results for Spanish (6 points)  Different nature of both cases  MFS in contexts of 5 sentences, 100 topics  NonMFS in contexts in the local sentence, 3 topics  All code and data publicly available on GitHub (group policy) http://github.com/rubenIzquierdo/lda_wsd 28Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
  • 29. 29Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
  • 30. 30Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.
  • 31. Ruben Izquierdo Marten Postma Piek Vossen email: ruben.izquierdobevia@vu.nl http://github.com/rubenIzquierdo/lda_wsd http://rubenizquierdobevia.com 31Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.

Notas del editor

  1. 4019 with 1 sense 1318 with 2 senses 449 with 3 227 with 4 110 with 5 41 with 6 38 with 7 11 with 8 10 with 9 5 with 10 2 with 11 senses 1 with 12