SlideShare una empresa de Scribd logo
1 de 19
DutchSemCor
Building a semantically annotated corpus for Dutch
Piek Vossen, Attila Görög, VU University Amsterdam
Fons Laan, ISLA, University of Amsterdam
Rubén Izquierdo, Tilburg University
Antal van den Bosch, Maarten van Gompel, Radboud University Nijmegen
1CLIN 22,Tilburg University, 20/01/2012
2
Overview

Project goals and planning

Current progress

Word-sense-disambiguation results

Active learning phase
CLIN 22,Tilburg University, 20/01/2012
3
Goals and planning

Funded by NWO, 2009-2012

Create a large semantically tagged corpus for
Dutch:
− Sense-tags from the Cornetto database
(includes Dutch wordnet)
− Domain labels from Wordnet Domains
− Named entities mapped to Wikipedia
CLIN 22,Tilburg University, 20/01/2012
4
Global procedure

Phase-1:
− 25 examples per meaning for 3,000 most polysemous and frequent
nouns, verbs and adjectives (average nr. of meanings = 3)
− Annotated by two student assistents
− Minimal IAA 80%

Phase-2:
− Word-sense-disambiguation (WSD) systems trained with the data
of phase-1
− Active learning: add examples for low performing words and
meanings untill we reach accuracy of 80% or no progress

Phase-3:
− Apply WSD to rest of the full corpus
CLIN 22,Tilburg University, 20/01/2012
5
Corpora

SoNaR: 500M tokens written Dutch

CGN: 1M tokens spoken Dutch

Web snippets mediated through WebCorp.co.uk (
http://www.webcorp.org.uk/)
− In case no or insufficient examples are found for
particular senses in SoNaR and CGN
− Students select snippets (target word and
context) which are added to the corpus in the
SoNaR annotation format
CLIN 22,Tilburg University, 20/01/2012
CLIN 22,Tilburg University, 20/01/2012 6
Annotation tool
7
Current results Phase-1

PoS: nouns, verbs and adjectives

Number of annotated lemmas: 2,870

Number of word senses: 11,982

Number of overlapping annotations: 282,503
(67% SoNaR, 5% CGN, 28% Snippets)

Inter Annotator Agreement: 92%

Coverage of senses with 25 examples: 70%

Coverage of annotations for words: 79%
CLIN 22,Tilburg University, 20/01/2012
WSD Systems

UKB --> Knowledge-based WSD system that
employs semantic relations

Tilburg WSD --> Supervised machine-learning
based WSD system
8CLIN 22,Tilburg University, 20/01/2012
UKB. Description

Knowledge based (Agirre and Soroa, 2009)

WordNet considered as a graph
− Senses -> nodes
− Relations -> edges

Personalized PageRank algorithm
− Modification of traditional PageRank
− Context words act as source nodes injecting
mass into word senses
− Assign stronger probabilities to certain nodes
9CLIN 22,Tilburg University, 20/01/2012
UKB. Semantic relations

Dutch WordNet

English WordNet

Dutch WordNet ==> English WordNet

WordNet Domain
− tennis player, tennis ball => tennis =>
− Football player, football => soccer =>

Annotation co-occurrence relations
− Polysemous => monosemous
− Polysemous => polysemous
SPORT
10CLIN 22,Tilburg University, 20/01/2012
UKB. Graph relations
Relation Number
Dutch synset – Dutch synset 140,219
Domain - Domain 125
Dutch synset - Domain 86,798
Dutch synset – English synset 73,935
English synset – English synset 252,392
English synset – English gloss synset 419,387
Annotation co-occurrences
polysemous
17,152
Annotation co-occurrences
monosemous
151,598
TOTAL 1,266,481
UKB-1 UKB-2
UKB-3
Annot. Co-
occurrences ( AC )
UKB-4 = UKB-1 + AC
UKB-5 = UKB-3 + AC
11CLIN 22,Tilburg University, 20/01/2012
UKB. Evaluation
Precision Recall F-measure
UKB-1 01.4557 0.4491 0.4523
UKB-2 0.4557 0.4491 0.4524
UKB-3 0.4560 0.4493 0.4526
UKB-4 0.6360 0.6272 0.6316
UKB-5 0.6411 0.6322 0.6366
For comparison SemEval2010 Task on WSD in specific domain, all-words-task:
 UKB3 52.6 precision
 English UKB 48.1 precision
 UKB5 & UKB4 gained 9 points on UKB3 due to co-occurrence relations
12CLIN 22,Tilburg University, 20/01/2012
Tilburg WSD System

Based on TiMBL, K-nearest neighbour classifier
(Daelemans et at, 2007)

Features:
− Local context (words in window around target)
− Global context (binary Bag of Words)
− Sonar category (domain label)

Parameter Search:
− Using TiMBL leave-one-out feature

Evaluation:
− 10 examples per sense TEST
− >= 15 examples per sense TRAIN
13CLIN 22,Tilburg University, 20/01/2012
Tilburg WSD System. First results
Feature set Token accuracy
Words1
0.6462
Words1
+ Bag-of-words 0.7259
Words1
+ PoS1
+ Bag-of-words 0.7226
Words1
+ Bag-of-words + PS 0.7931
 Bag-of-words improvement of 8%
 Parameter search (PS) improvement of another 7%
Previous experiments suggest that the best size for the context window is 1
14CLIN 22,Tilburg University, 20/01/2012
TIMBL confidence 0.55:
Precision 0.84 (+0.44 compared to no filtering)
Fscore 0.78 (only -0.03 less than no filtering)
Tilburg WSD System. TiMBL
Confidence
15CLIN 22,Tilburg University, 20/01/2012
Active Learning
1. Obtain annotated data
2. Train and evaluate the system
3. Select words with accuracy < 80%
4. Apply WSD all tokens of selected words not
annotated
5. Select tokens of meanings with F-score <
80%
16CLIN 22,Tilburg University, 20/01/2012
Active Learning
6) For each word meaning rank all the tokens according to the
combination (F-score)
1) TiMBL confidence
2) Distance to the nearest neighbor
6) Select the 50 first ranking tokens per meaning to be manually
reviewed in 2 weeks
6) Go to 1
17CLIN 22,Tilburg University, 20/01/2012
Future Work

Fine tune the active learning

Optimize the WSD systems

Combine different WSD systems

Test on independent texts in all-words task

Apply optimal system to full corpora (over 500K
tokens)
18CLIN 22,Tilburg University, 20/01/2012
19
Thanks to

Anneleen Schoen

Charlotte van Tongeren

Daphne van Kessel

Dieke Janssen

Elizabeth van Zutphen

Gratia Bruining

Jonica Kaagman

Laura Kipp

Lisanne Ranzijn

Marlisa Hommel

Wilma van Velzen
Milou Kerkhof
Sam Vossen
Niqee Vossen
Rosa Scheffer
Chantal van Son
CLIN 22,Tilburg University, 20/01/2012

Más contenido relacionado

Similar a DutchSemCor Building Semantically Annotated Corpus Dutch

Orientation slides : M1 CCS (Cloud Computing and Services) : Univ de Rennes 1
Orientation slides : M1 CCS (Cloud Computing and Services) : Univ de Rennes 1Orientation slides : M1 CCS (Cloud Computing and Services) : Univ de Rennes 1
Orientation slides : M1 CCS (Cloud Computing and Services) : Univ de Rennes 1Muhammad Chaudry
 
This is a iot pdf slayyus this help to students make new slybs oke
This is a iot pdf slayyus this help to students make new slybs okeThis is a iot pdf slayyus this help to students make new slybs oke
This is a iot pdf slayyus this help to students make new slybs okeakkujain2003
 
The case for yet another digital preservation evaluation tool
The case for yet another digital preservation evaluation toolThe case for yet another digital preservation evaluation tool
The case for yet another digital preservation evaluation toolbertwerk
 
Eunis Workshop Hoel 2008 06
Eunis Workshop Hoel 2008 06Eunis Workshop Hoel 2008 06
Eunis Workshop Hoel 2008 06Tore Hoel
 
Massive Open Online Course on Open Data. The TODO Online Training Programme
Massive Open Online Course on Open Data. The TODO Online Training ProgrammeMassive Open Online Course on Open Data. The TODO Online Training Programme
Massive Open Online Course on Open Data. The TODO Online Training Programmesamossummit
 
«Ejemplos de herramientas que nos facilitan las analíticas de aprendizaje en ...
«Ejemplos de herramientas que nos facilitan las analíticas de aprendizaje en ...«Ejemplos de herramientas que nos facilitan las analíticas de aprendizaje en ...
«Ejemplos de herramientas que nos facilitan las analíticas de aprendizaje en ...eMadrid network
 
Introduction to the ASPECT project
Introduction to the ASPECT projectIntroduction to the ASPECT project
Introduction to the ASPECT projectDavid Massart
 
HL7 & HL7 CDA: The Implementation of Thailand's Healthcare Messaging Exchange...
HL7 & HL7 CDA: The Implementation of Thailand's Healthcare Messaging Exchange...HL7 & HL7 CDA: The Implementation of Thailand's Healthcare Messaging Exchange...
HL7 & HL7 CDA: The Implementation of Thailand's Healthcare Messaging Exchange...Nawanan Theera-Ampornpunt
 
Compliance driven process development with DCR graphs
Compliance driven process development with DCR graphsCompliance driven process development with DCR graphs
Compliance driven process development with DCR graphsHugo Andrés López
 
FDMEE Scripting - Cloud and On-Premises - It Ain't Groovy, But It's My Bread ...
FDMEE Scripting - Cloud and On-Premises - It Ain't Groovy, But It's My Bread ...FDMEE Scripting - Cloud and On-Premises - It Ain't Groovy, But It's My Bread ...
FDMEE Scripting - Cloud and On-Premises - It Ain't Groovy, But It's My Bread ...Joseph Alaimo Jr
 
What are other universities doing to support RDM?
What are other universities doing to support RDM?What are other universities doing to support RDM?
What are other universities doing to support RDM?Sarah Jones
 
European Green IT Webinar 2014 - Kaliterre (France)
European Green IT Webinar 2014 - Kaliterre (France)European Green IT Webinar 2014 - Kaliterre (France)
European Green IT Webinar 2014 - Kaliterre (France)GreenLabCenter
 
[DOLAP2023] The Whys and Wherefores of Cubes
[DOLAP2023] The Whys and Wherefores of Cubes[DOLAP2023] The Whys and Wherefores of Cubes
[DOLAP2023] The Whys and Wherefores of CubesUniversity of Bologna
 
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...KozoChikai
 
Cross-lingual Information Retrieval
Cross-lingual Information RetrievalCross-lingual Information Retrieval
Cross-lingual Information RetrievalShadi Saleh
 
Learning Design research topics
Learning Design research topicsLearning Design research topics
Learning Design research topicsJuan Manuel Dodero
 
111722, 1220 AM Assignment Paper C - IFSM 304 7383 Ethics i.docx
111722, 1220 AM Assignment Paper C - IFSM 304 7383 Ethics i.docx111722, 1220 AM Assignment Paper C - IFSM 304 7383 Ethics i.docx
111722, 1220 AM Assignment Paper C - IFSM 304 7383 Ethics i.docxSusanaFurman449
 
Jisc learning analytics overview Oct2017
Jisc learning analytics overview Oct2017Jisc learning analytics overview Oct2017
Jisc learning analytics overview Oct2017Paul Bailey
 

Similar a DutchSemCor Building Semantically Annotated Corpus Dutch (20)

Orientation slides : M1 CCS (Cloud Computing and Services) : Univ de Rennes 1
Orientation slides : M1 CCS (Cloud Computing and Services) : Univ de Rennes 1Orientation slides : M1 CCS (Cloud Computing and Services) : Univ de Rennes 1
Orientation slides : M1 CCS (Cloud Computing and Services) : Univ de Rennes 1
 
This is a iot pdf slayyus this help to students make new slybs oke
This is a iot pdf slayyus this help to students make new slybs okeThis is a iot pdf slayyus this help to students make new slybs oke
This is a iot pdf slayyus this help to students make new slybs oke
 
The case for yet another digital preservation evaluation tool
The case for yet another digital preservation evaluation toolThe case for yet another digital preservation evaluation tool
The case for yet another digital preservation evaluation tool
 
Eunis Workshop Hoel 2008 06
Eunis Workshop Hoel 2008 06Eunis Workshop Hoel 2008 06
Eunis Workshop Hoel 2008 06
 
Massive Open Online Course on Open Data. The TODO Online Training Programme
Massive Open Online Course on Open Data. The TODO Online Training ProgrammeMassive Open Online Course on Open Data. The TODO Online Training Programme
Massive Open Online Course on Open Data. The TODO Online Training Programme
 
«Ejemplos de herramientas que nos facilitan las analíticas de aprendizaje en ...
«Ejemplos de herramientas que nos facilitan las analíticas de aprendizaje en ...«Ejemplos de herramientas que nos facilitan las analíticas de aprendizaje en ...
«Ejemplos de herramientas que nos facilitan las analíticas de aprendizaje en ...
 
Introduction to the ASPECT project
Introduction to the ASPECT projectIntroduction to the ASPECT project
Introduction to the ASPECT project
 
HL7 & HL7 CDA: The Implementation of Thailand's Healthcare Messaging Exchange...
HL7 & HL7 CDA: The Implementation of Thailand's Healthcare Messaging Exchange...HL7 & HL7 CDA: The Implementation of Thailand's Healthcare Messaging Exchange...
HL7 & HL7 CDA: The Implementation of Thailand's Healthcare Messaging Exchange...
 
Compliance driven process development with DCR graphs
Compliance driven process development with DCR graphsCompliance driven process development with DCR graphs
Compliance driven process development with DCR graphs
 
FDMEE Scripting - Cloud and On-Premises - It Ain't Groovy, But It's My Bread ...
FDMEE Scripting - Cloud and On-Premises - It Ain't Groovy, But It's My Bread ...FDMEE Scripting - Cloud and On-Premises - It Ain't Groovy, But It's My Bread ...
FDMEE Scripting - Cloud and On-Premises - It Ain't Groovy, But It's My Bread ...
 
What are other universities doing to support RDM?
What are other universities doing to support RDM?What are other universities doing to support RDM?
What are other universities doing to support RDM?
 
European Green IT Webinar 2014 - Kaliterre (France)
European Green IT Webinar 2014 - Kaliterre (France)European Green IT Webinar 2014 - Kaliterre (France)
European Green IT Webinar 2014 - Kaliterre (France)
 
[DOLAP2023] The Whys and Wherefores of Cubes
[DOLAP2023] The Whys and Wherefores of Cubes[DOLAP2023] The Whys and Wherefores of Cubes
[DOLAP2023] The Whys and Wherefores of Cubes
 
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
 
ASL®2 - Application Services Library - Foundation
ASL®2 - Application Services Library - FoundationASL®2 - Application Services Library - Foundation
ASL®2 - Application Services Library - Foundation
 
Icsm12.ppt
Icsm12.pptIcsm12.ppt
Icsm12.ppt
 
Cross-lingual Information Retrieval
Cross-lingual Information RetrievalCross-lingual Information Retrieval
Cross-lingual Information Retrieval
 
Learning Design research topics
Learning Design research topicsLearning Design research topics
Learning Design research topics
 
111722, 1220 AM Assignment Paper C - IFSM 304 7383 Ethics i.docx
111722, 1220 AM Assignment Paper C - IFSM 304 7383 Ethics i.docx111722, 1220 AM Assignment Paper C - IFSM 304 7383 Ethics i.docx
111722, 1220 AM Assignment Paper C - IFSM 304 7383 Ethics i.docx
 
Jisc learning analytics overview Oct2017
Jisc learning analytics overview Oct2017Jisc learning analytics overview Oct2017
Jisc learning analytics overview Oct2017
 

Más de Rubén Izquierdo Beviá

ULM-1 Understanding Languages by Machines: The borders of Ambiguity
ULM-1 Understanding Languages by Machines: The borders of AmbiguityULM-1 Understanding Languages by Machines: The borders of Ambiguity
ULM-1 Understanding Languages by Machines: The borders of AmbiguityRubén Izquierdo Beviá
 
DutchSemCor workshop: Domain classification and WSD systems
DutchSemCor workshop: Domain classification and WSD systemsDutchSemCor workshop: Domain classification and WSD systems
DutchSemCor workshop: Domain classification and WSD systemsRubén Izquierdo Beviá
 
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRubén Izquierdo Beviá
 
Topic modeling and WSD on the Ancora corpus
Topic modeling and WSD on the Ancora corpusTopic modeling and WSD on the Ancora corpus
Topic modeling and WSD on the Ancora corpusRubén Izquierdo Beviá
 
Error analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationError analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationRubén Izquierdo Beviá
 
KafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF filesKafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF filesRubén Izquierdo Beviá
 
CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)Rubén Izquierdo Beviá
 
CLTL python course: Object Oriented Programming (2/3)
CLTL python course: Object Oriented Programming (2/3)CLTL python course: Object Oriented Programming (2/3)
CLTL python course: Object Oriented Programming (2/3)Rubén Izquierdo Beviá
 
CLTL python course: Object Oriented Programming (1/3)
CLTL python course: Object Oriented Programming (1/3)CLTL python course: Object Oriented Programming (1/3)
CLTL python course: Object Oriented Programming (1/3)Rubén Izquierdo Beviá
 
Thesis presentation (WSD and Semantic Classes)
Thesis presentation (WSD and Semantic Classes)Thesis presentation (WSD and Semantic Classes)
Thesis presentation (WSD and Semantic Classes)Rubén Izquierdo Beviá
 
CLTL: Description of web services and sofware. Nijmegen 2013
CLTL: Description of web services and sofware. Nijmegen 2013CLTL: Description of web services and sofware. Nijmegen 2013
CLTL: Description of web services and sofware. Nijmegen 2013Rubén Izquierdo Beviá
 
CLTL presentation: training an opinion mining system from KAF files using CRF
CLTL presentation: training an opinion mining system from KAF files using CRFCLTL presentation: training an opinion mining system from KAF files using CRF
CLTL presentation: training an opinion mining system from KAF files using CRFRubén Izquierdo Beviá
 
RANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpusRANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpusRubén Izquierdo Beviá
 

Más de Rubén Izquierdo Beviá (17)

ULM-1 Understanding Languages by Machines: The borders of Ambiguity
ULM-1 Understanding Languages by Machines: The borders of AmbiguityULM-1 Understanding Languages by Machines: The borders of Ambiguity
ULM-1 Understanding Languages by Machines: The borders of Ambiguity
 
DutchSemCor workshop: Domain classification and WSD systems
DutchSemCor workshop: Domain classification and WSD systemsDutchSemCor workshop: Domain classification and WSD systems
DutchSemCor workshop: Domain classification and WSD systems
 
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
 
Topic modeling and WSD on the Ancora corpus
Topic modeling and WSD on the Ancora corpusTopic modeling and WSD on the Ancora corpus
Topic modeling and WSD on the Ancora corpus
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Error analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationError analysis of Word Sense Disambiguation
Error analysis of Word Sense Disambiguation
 
Juan Calvino y el Calvinismo
Juan Calvino y el CalvinismoJuan Calvino y el Calvinismo
Juan Calvino y el Calvinismo
 
KafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF filesKafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF files
 
CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)
 
CLTL python course: Object Oriented Programming (2/3)
CLTL python course: Object Oriented Programming (2/3)CLTL python course: Object Oriented Programming (2/3)
CLTL python course: Object Oriented Programming (2/3)
 
CLTL python course: Object Oriented Programming (1/3)
CLTL python course: Object Oriented Programming (1/3)CLTL python course: Object Oriented Programming (1/3)
CLTL python course: Object Oriented Programming (1/3)
 
CLTL Software and Web Services
CLTL Software and Web Services CLTL Software and Web Services
CLTL Software and Web Services
 
Thesis presentation (WSD and Semantic Classes)
Thesis presentation (WSD and Semantic Classes)Thesis presentation (WSD and Semantic Classes)
Thesis presentation (WSD and Semantic Classes)
 
ULM1 - The borders of Ambiguity
ULM1 - The borders of AmbiguityULM1 - The borders of Ambiguity
ULM1 - The borders of Ambiguity
 
CLTL: Description of web services and sofware. Nijmegen 2013
CLTL: Description of web services and sofware. Nijmegen 2013CLTL: Description of web services and sofware. Nijmegen 2013
CLTL: Description of web services and sofware. Nijmegen 2013
 
CLTL presentation: training an opinion mining system from KAF files using CRF
CLTL presentation: training an opinion mining system from KAF files using CRFCLTL presentation: training an opinion mining system from KAF files using CRF
CLTL presentation: training an opinion mining system from KAF files using CRF
 
RANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpusRANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpus
 

Último

Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...RKavithamani
 

Último (20)

Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
 

DutchSemCor Building Semantically Annotated Corpus Dutch

  • 1. DutchSemCor Building a semantically annotated corpus for Dutch Piek Vossen, Attila Görög, VU University Amsterdam Fons Laan, ISLA, University of Amsterdam Rubén Izquierdo, Tilburg University Antal van den Bosch, Maarten van Gompel, Radboud University Nijmegen 1CLIN 22,Tilburg University, 20/01/2012
  • 2. 2 Overview  Project goals and planning  Current progress  Word-sense-disambiguation results  Active learning phase CLIN 22,Tilburg University, 20/01/2012
  • 3. 3 Goals and planning  Funded by NWO, 2009-2012  Create a large semantically tagged corpus for Dutch: − Sense-tags from the Cornetto database (includes Dutch wordnet) − Domain labels from Wordnet Domains − Named entities mapped to Wikipedia CLIN 22,Tilburg University, 20/01/2012
  • 4. 4 Global procedure  Phase-1: − 25 examples per meaning for 3,000 most polysemous and frequent nouns, verbs and adjectives (average nr. of meanings = 3) − Annotated by two student assistents − Minimal IAA 80%  Phase-2: − Word-sense-disambiguation (WSD) systems trained with the data of phase-1 − Active learning: add examples for low performing words and meanings untill we reach accuracy of 80% or no progress  Phase-3: − Apply WSD to rest of the full corpus CLIN 22,Tilburg University, 20/01/2012
  • 5. 5 Corpora  SoNaR: 500M tokens written Dutch  CGN: 1M tokens spoken Dutch  Web snippets mediated through WebCorp.co.uk ( http://www.webcorp.org.uk/) − In case no or insufficient examples are found for particular senses in SoNaR and CGN − Students select snippets (target word and context) which are added to the corpus in the SoNaR annotation format CLIN 22,Tilburg University, 20/01/2012
  • 6. CLIN 22,Tilburg University, 20/01/2012 6 Annotation tool
  • 7. 7 Current results Phase-1  PoS: nouns, verbs and adjectives  Number of annotated lemmas: 2,870  Number of word senses: 11,982  Number of overlapping annotations: 282,503 (67% SoNaR, 5% CGN, 28% Snippets)  Inter Annotator Agreement: 92%  Coverage of senses with 25 examples: 70%  Coverage of annotations for words: 79% CLIN 22,Tilburg University, 20/01/2012
  • 8. WSD Systems  UKB --> Knowledge-based WSD system that employs semantic relations  Tilburg WSD --> Supervised machine-learning based WSD system 8CLIN 22,Tilburg University, 20/01/2012
  • 9. UKB. Description  Knowledge based (Agirre and Soroa, 2009)  WordNet considered as a graph − Senses -> nodes − Relations -> edges  Personalized PageRank algorithm − Modification of traditional PageRank − Context words act as source nodes injecting mass into word senses − Assign stronger probabilities to certain nodes 9CLIN 22,Tilburg University, 20/01/2012
  • 10. UKB. Semantic relations  Dutch WordNet  English WordNet  Dutch WordNet ==> English WordNet  WordNet Domain − tennis player, tennis ball => tennis => − Football player, football => soccer =>  Annotation co-occurrence relations − Polysemous => monosemous − Polysemous => polysemous SPORT 10CLIN 22,Tilburg University, 20/01/2012
  • 11. UKB. Graph relations Relation Number Dutch synset – Dutch synset 140,219 Domain - Domain 125 Dutch synset - Domain 86,798 Dutch synset – English synset 73,935 English synset – English synset 252,392 English synset – English gloss synset 419,387 Annotation co-occurrences polysemous 17,152 Annotation co-occurrences monosemous 151,598 TOTAL 1,266,481 UKB-1 UKB-2 UKB-3 Annot. Co- occurrences ( AC ) UKB-4 = UKB-1 + AC UKB-5 = UKB-3 + AC 11CLIN 22,Tilburg University, 20/01/2012
  • 12. UKB. Evaluation Precision Recall F-measure UKB-1 01.4557 0.4491 0.4523 UKB-2 0.4557 0.4491 0.4524 UKB-3 0.4560 0.4493 0.4526 UKB-4 0.6360 0.6272 0.6316 UKB-5 0.6411 0.6322 0.6366 For comparison SemEval2010 Task on WSD in specific domain, all-words-task:  UKB3 52.6 precision  English UKB 48.1 precision  UKB5 & UKB4 gained 9 points on UKB3 due to co-occurrence relations 12CLIN 22,Tilburg University, 20/01/2012
  • 13. Tilburg WSD System  Based on TiMBL, K-nearest neighbour classifier (Daelemans et at, 2007)  Features: − Local context (words in window around target) − Global context (binary Bag of Words) − Sonar category (domain label)  Parameter Search: − Using TiMBL leave-one-out feature  Evaluation: − 10 examples per sense TEST − >= 15 examples per sense TRAIN 13CLIN 22,Tilburg University, 20/01/2012
  • 14. Tilburg WSD System. First results Feature set Token accuracy Words1 0.6462 Words1 + Bag-of-words 0.7259 Words1 + PoS1 + Bag-of-words 0.7226 Words1 + Bag-of-words + PS 0.7931  Bag-of-words improvement of 8%  Parameter search (PS) improvement of another 7% Previous experiments suggest that the best size for the context window is 1 14CLIN 22,Tilburg University, 20/01/2012
  • 15. TIMBL confidence 0.55: Precision 0.84 (+0.44 compared to no filtering) Fscore 0.78 (only -0.03 less than no filtering) Tilburg WSD System. TiMBL Confidence 15CLIN 22,Tilburg University, 20/01/2012
  • 16. Active Learning 1. Obtain annotated data 2. Train and evaluate the system 3. Select words with accuracy < 80% 4. Apply WSD all tokens of selected words not annotated 5. Select tokens of meanings with F-score < 80% 16CLIN 22,Tilburg University, 20/01/2012
  • 17. Active Learning 6) For each word meaning rank all the tokens according to the combination (F-score) 1) TiMBL confidence 2) Distance to the nearest neighbor 6) Select the 50 first ranking tokens per meaning to be manually reviewed in 2 weeks 6) Go to 1 17CLIN 22,Tilburg University, 20/01/2012
  • 18. Future Work  Fine tune the active learning  Optimize the WSD systems  Combine different WSD systems  Test on independent texts in all-words task  Apply optimal system to full corpora (over 500K tokens) 18CLIN 22,Tilburg University, 20/01/2012
  • 19. 19 Thanks to  Anneleen Schoen  Charlotte van Tongeren  Daphne van Kessel  Dieke Janssen  Elizabeth van Zutphen  Gratia Bruining  Jonica Kaagman  Laura Kipp  Lisanne Ranzijn  Marlisa Hommel  Wilma van Velzen Milou Kerkhof Sam Vossen Niqee Vossen Rosa Scheffer Chantal van Son CLIN 22,Tilburg University, 20/01/2012