Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
DutchSemCor Building Semantically Annotated Corpus Dutch
1. DutchSemCor
Building a semantically annotated corpus for Dutch
Piek Vossen, Attila Görög, VU University Amsterdam
Fons Laan, ISLA, University of Amsterdam
Rubén Izquierdo, Tilburg University
Antal van den Bosch, Maarten van Gompel, Radboud University Nijmegen
1CLIN 22,Tilburg University, 20/01/2012
2. 2
Overview
Project goals and planning
Current progress
Word-sense-disambiguation results
Active learning phase
CLIN 22,Tilburg University, 20/01/2012
3. 3
Goals and planning
Funded by NWO, 2009-2012
Create a large semantically tagged corpus for
Dutch:
− Sense-tags from the Cornetto database
(includes Dutch wordnet)
− Domain labels from Wordnet Domains
− Named entities mapped to Wikipedia
CLIN 22,Tilburg University, 20/01/2012
4. 4
Global procedure
Phase-1:
− 25 examples per meaning for 3,000 most polysemous and frequent
nouns, verbs and adjectives (average nr. of meanings = 3)
− Annotated by two student assistents
− Minimal IAA 80%
Phase-2:
− Word-sense-disambiguation (WSD) systems trained with the data
of phase-1
− Active learning: add examples for low performing words and
meanings untill we reach accuracy of 80% or no progress
Phase-3:
− Apply WSD to rest of the full corpus
CLIN 22,Tilburg University, 20/01/2012
5. 5
Corpora
SoNaR: 500M tokens written Dutch
CGN: 1M tokens spoken Dutch
Web snippets mediated through WebCorp.co.uk (
http://www.webcorp.org.uk/)
− In case no or insufficient examples are found for
particular senses in SoNaR and CGN
− Students select snippets (target word and
context) which are added to the corpus in the
SoNaR annotation format
CLIN 22,Tilburg University, 20/01/2012
7. 7
Current results Phase-1
PoS: nouns, verbs and adjectives
Number of annotated lemmas: 2,870
Number of word senses: 11,982
Number of overlapping annotations: 282,503
(67% SoNaR, 5% CGN, 28% Snippets)
Inter Annotator Agreement: 92%
Coverage of senses with 25 examples: 70%
Coverage of annotations for words: 79%
CLIN 22,Tilburg University, 20/01/2012
8. WSD Systems
UKB --> Knowledge-based WSD system that
employs semantic relations
Tilburg WSD --> Supervised machine-learning
based WSD system
8CLIN 22,Tilburg University, 20/01/2012
9. UKB. Description
Knowledge based (Agirre and Soroa, 2009)
WordNet considered as a graph
− Senses -> nodes
− Relations -> edges
Personalized PageRank algorithm
− Modification of traditional PageRank
− Context words act as source nodes injecting
mass into word senses
− Assign stronger probabilities to certain nodes
9CLIN 22,Tilburg University, 20/01/2012
11. UKB. Graph relations
Relation Number
Dutch synset – Dutch synset 140,219
Domain - Domain 125
Dutch synset - Domain 86,798
Dutch synset – English synset 73,935
English synset – English synset 252,392
English synset – English gloss synset 419,387
Annotation co-occurrences
polysemous
17,152
Annotation co-occurrences
monosemous
151,598
TOTAL 1,266,481
UKB-1 UKB-2
UKB-3
Annot. Co-
occurrences ( AC )
UKB-4 = UKB-1 + AC
UKB-5 = UKB-3 + AC
11CLIN 22,Tilburg University, 20/01/2012
12. UKB. Evaluation
Precision Recall F-measure
UKB-1 01.4557 0.4491 0.4523
UKB-2 0.4557 0.4491 0.4524
UKB-3 0.4560 0.4493 0.4526
UKB-4 0.6360 0.6272 0.6316
UKB-5 0.6411 0.6322 0.6366
For comparison SemEval2010 Task on WSD in specific domain, all-words-task:
UKB3 52.6 precision
English UKB 48.1 precision
UKB5 & UKB4 gained 9 points on UKB3 due to co-occurrence relations
12CLIN 22,Tilburg University, 20/01/2012
13. Tilburg WSD System
Based on TiMBL, K-nearest neighbour classifier
(Daelemans et at, 2007)
Features:
− Local context (words in window around target)
− Global context (binary Bag of Words)
− Sonar category (domain label)
Parameter Search:
− Using TiMBL leave-one-out feature
Evaluation:
− 10 examples per sense TEST
− >= 15 examples per sense TRAIN
13CLIN 22,Tilburg University, 20/01/2012
14. Tilburg WSD System. First results
Feature set Token accuracy
Words1
0.6462
Words1
+ Bag-of-words 0.7259
Words1
+ PoS1
+ Bag-of-words 0.7226
Words1
+ Bag-of-words + PS 0.7931
Bag-of-words improvement of 8%
Parameter search (PS) improvement of another 7%
Previous experiments suggest that the best size for the context window is 1
14CLIN 22,Tilburg University, 20/01/2012
15. TIMBL confidence 0.55:
Precision 0.84 (+0.44 compared to no filtering)
Fscore 0.78 (only -0.03 less than no filtering)
Tilburg WSD System. TiMBL
Confidence
15CLIN 22,Tilburg University, 20/01/2012
16. Active Learning
1. Obtain annotated data
2. Train and evaluate the system
3. Select words with accuracy < 80%
4. Apply WSD all tokens of selected words not
annotated
5. Select tokens of meanings with F-score <
80%
16CLIN 22,Tilburg University, 20/01/2012
17. Active Learning
6) For each word meaning rank all the tokens according to the
combination (F-score)
1) TiMBL confidence
2) Distance to the nearest neighbor
6) Select the 50 first ranking tokens per meaning to be manually
reviewed in 2 weeks
6) Go to 1
17CLIN 22,Tilburg University, 20/01/2012
18. Future Work
Fine tune the active learning
Optimize the WSD systems
Combine different WSD systems
Test on independent texts in all-words task
Apply optimal system to full corpora (over 500K
tokens)
18CLIN 22,Tilburg University, 20/01/2012
19. 19
Thanks to
Anneleen Schoen
Charlotte van Tongeren
Daphne van Kessel
Dieke Janssen
Elizabeth van Zutphen
Gratia Bruining
Jonica Kaagman
Laura Kipp
Lisanne Ranzijn
Marlisa Hommel
Wilma van Velzen
Milou Kerkhof
Sam Vossen
Niqee Vossen
Rosa Scheffer
Chantal van Son
CLIN 22,Tilburg University, 20/01/2012