Topic modeling and WSD on the Ancora corpus

Topic Modeling and
WSD on the Ancora
Corpus
Ruben Izquierdo
Marten Postma
Piek Vossen
Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.

Outline
1. Starting Point
2. Motivation
3. Our Approach
4. Evaluation Framework
5. Experiments and Results
6. Conclusions
2Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.

Starting point
 “Understanding languages by machines” project
 Starts from the results of DutchSemCor (WSD)
 Analyse the real problems of WSD
 Understand the WSD task
 Word
 Meaning
 Context

Outline
1. Starting Point
2. Motivation
3. Our Approach
6. Conclusions

Still WSD?
 Word Sense Disambiguation is still unsolved
 Used in high level applications
 Recently some unsupervised approaches and SemEval
tasks
 Babelnet, Babelfy…
 Several reasons and problems

WSD problems I
 Context is not considered properly
 Most are/were supervised approaches
 Moving to unsupervised, graph-based…
 WSD as a black box
 The larger number of features, the better performance?
 The best and newest machine learning algorithm
 WSD is seen as only one problem
 All words and cases treated in the same way

WSD problems II
 Error analysis SenseEval/SemEval systems [Postma
et al., 2014]
 Propagation errors (monosemous)
 Most Frequent Sense bias
 Supervised systems are skewed towards MFS
 Error analysis on WSD and SenseEval/SemEval
 Performance on MFS cases is good
 Very poor performance on non MFS cases

WSD problems II

WSD problems II
 Most Frequent Sense bias
 Supervised systems are skewed towards MFS
 Error analysis on WSD and SenseEval/SemEval
 Performance on MFS cases is good
 Very poor performance on non MFS cases
 Systems assign MFS in almost every case
 Sval2
 799 cases where the correct is not the MFS
 84% of the system still assign the MFS

Outline
1. Starting Point
2. Motivation
3. Our Approach
6. Conclusions

Main idea
 WSD considered as two different problems
 When the MFS applies
 More general usages
 Larger contexts ??
 Rest of the senses
 More concrete usages
 Shorter contexts ??
 Specialized classifiers for each case
 Different features, parameters, contexts…
 Evaluation for Spanish
 Sense annotated corpus Ancora

Our approach
 TRAINING. Use Topic Modeling (LDA) to induce word
expert classifiers
 For the Most Frequent Sense 
 Topics for the MFS case
 Topics for non MFS cases
 For the rest of senses (non MFS)
 Topics for every sense
 CLASSIFICATION. Apply the 2 classifiers in cascade
to decide the sense in every case
BINARY
MULTICLASS

Training

Classification

Outline
1. Starting Point
2. Motivation
3. Our Approach
6. Conclusions

Evaluation framework
 Ancora corpus
 News Articles, Spanish part, 500K words, sense
annotated (nouns)
 Converted to NAF format
 3 Folded-cross validation
 Keeping sense distribution
 7119 unique lemmas annotated with nominal senses

 Ancora corpus
 Spanish part, 500K words, sense annotated (nouns)
 7119 unique lemmas annotated
 4907 are monosemous (69%)
 2212 are polysemous (31%)
 589 with at least 3 instances per sense (from the annotated)

 Ancora corpus
 Spanish part, 500K words, sense annotated (nouns)
 7119 unique lemmas annotated
0
200
400
600
800
1000
1200
1400
2 3 4 5 6 7 8 9 10 11 12
Number of lemmas vs. polysemy
Number of Lemmas

Baseline Results
 For the 589 selected lemmas
Baseline Accuracy
Random 40.10
MFS overall 67.68
MFS folded 68.63

Outline
1. Starting Point
2. Motivation
3. Our Approach
6. Conclusions

Experimentation
 Configuration of our cascade classifiers
 Only one step with the senseLDA classifier
 2 steps, mfsLDA with perfect performance + senseLDA
 2 steps, mfsLDA and senseLDA both induced
automatically
 LDA parameters (python gensim library)
 Context size (number of sentences)
 Number of topics for LDA

Results I
Instance
Example
Sense
LDA (all
senses)
Word
Sense
One step
classification
Sentences Topics Accurac
y
MFS baseline 68.63
0 3 67.54
10 65.56
100 58.34
3 3 66.30
10 64.62
100 60.07
50 3 66.04
10 63.42
100 59.06
• MFS not reached
• Most informative clues in
small contexts
• More topics  less
performance
22

Results II
Instance
Example
MFS
(100%
accuracy)
Sense
LDA (all
senses)
Word
Sense
Two steps, MFS
classifier 100%
performance
Sentences Topics Accurac
y
MFS baseline 68.63
0 3 92.48
10 92.12
100 90.50
3 3 92.45
10 92.11
100 91.60
50 3 92.41
10 92.12
100 91.43
• Extremely high figures
• Good performance of the
senseLDA classifier (when no
MFS)
• Similar behaviour w.r.t. #sents
and # topics
23

Results III
Instance
Example
MFS (s5)
Sense
LDA (all
senses)
Word
Sense
Two steps, MFS
classifier #S=5
Sents Topics Acc. MFS
T100
Acc. MFS
T1000
MFS baseline 68.63
0 3 74.53 66.73
10 74.00 66.41
100 72.61 64.91
3 3 74.30 66.61
10 73.87 66.36
100 73.39 65.76
50 3 74.26 66.48
10 73.90 66.24
100 73.53 65.75
• MFS s5 t100
• Smaller contexts for
non MFS cases (3, 50
included by 0)
• 3 Topics is the best
24

Results IV
Instance
Example
MFS (s50)
Sense
LDA (all
senses)
Word
Sense
Two steps, MFS
classifier #S=50
Sents Topics Acc. MFS
T100
Acc. MFS
T1000
MFS baseline 68.63
0 3 73.34 67.15
10 72.92 66.76
100 71.43 65.13
3 3 73.21 67.02
10 72.88 66.60
100 72.40 66.24
50 3 73.21 66.95
10 72.83 66.58
100 72.15 66.20
• Similar behaviour
compared to MFS_s5
• Slightly lower results
25

Lemma comparison
Lemma MFS (68.63) LDA (74.53) Variation Annotations
año 89.15 91.19 2.04 1275
país 72.29 83.55 11.26 695
presidente 70.31 73.94 3.63 690
partido 55.87 64.48 8.61 641
equipo 98.32 98.88 0.56 539
mes 54.29 80 25.71 315
hora 61.39 56.11 -5.28 305
caso 61.05 91.58 30.53 286
mundo 47.31 40.14 -7.17 279
semana 85.06 92.34 7.28 263
Most frequent lemmas

Outline
1. Starting Point
2. Motivation
3. Our Approach
6. Conclusions

Conclusions
 Simple approach based on LDA for WSD in Spanish
 Two step classification approach for WSD improves the results for
Spanish (6 points)
 Different nature of both cases
 MFS in contexts of 5 sentences, 100 topics
 NonMFS in contexts in the local sentence, 3 topics
 All code and data publicly
available on GitHub (group policy)
http://github.com/rubenIzquierdo/lda_wsd

Ruben Izquierdo
Marten Postma
Piek Vossen
email: ruben.izquierdobevia@vu.nl
http://github.com/rubenIzquierdo/lda_wsd
http://rubenizquierdobevia.com

Topic modeling and WSD on the Ancora corpus

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (17)

Similar a Topic modeling and WSD on the Ancora corpus

Similar a Topic modeling and WSD on the Ancora corpus (18)

Más de Rubén Izquierdo Beviá

Más de Rubén Izquierdo Beviá (9)

Último

Último (20)

Topic modeling and WSD on the Ancora corpus

Notas del editor