SlideShare una empresa de Scribd logo
1 de 26
Descargar para leer sin conexión
Resources for linguistically motivated
                  Multilingual Anaphora Resolution

                                           Kepa Joseba Rodr´
                                                           ıguez


                                         Advisor: Massimo Poesio
                                            18. January 2011



Kepa Joseba Rodr´
                ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Outline

           1    Motivation of the research
           2    Contributions of this dissertation
           3    Limitations of previous annotation schemes
           4    Annotation scheme proposal
           5    Annotated data
           6    Usability of the data for anaphora resolution
           7    Use of the data
           8    Conclusions


Kepa Joseba Rodr´
                ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Motivation
                Linguistic research: cross linguistic studies about
                anaphora (Poesio et al 2004)
                Applications: summarization (Steinberger et al 2007)
                Applications: machine translation
                    1    German: Peter hat Maria seine Blumen zum Gießen
                         gegeben. Sie hat sie vertrocknen lassen.
                    2    English (Babelfish): Peter gave Maria his flowers for
                         pouring. Then it left it to dry.
                    3    English (Google translate): Peter gave Mary flowers
                         to his casting. Then she let them dry up.
                    4    English (wanted): Peter gave Maria his flowers to
                         water. Then she let them dry out.

Kepa Joseba Rodr´
                ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Contributions


                Development of a linguistically motivated annotation
                scheme for anaphoric relations.
                Implementation of the scheme for manual annotation of
                English and Italian data.
                Creation of annotated data for English and Italian.
                Use of the corpora for feature extraction and development
                of anaphora resolution systems in English and Italian.
                Participation of the systems in SemEval 2010.



Kepa Joseba Rodr´
                ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Limitations of previous schemes (1)



                Coverage of the annotation.
                Annotation of reference.
                Identification and annotation of discontinuity of semantic
                material.
                Problem of multiple interpretations: ambiguity.




Kepa Joseba Rodr´
                ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Limitations of previous schemes (2)


       Coverage of the annotation:
           Annotated relations: only identity
           ACE-like annotation schemes constraint the annotation to
           noun phrases from a list of semantic types.
           Genres: Most annotation schemes focus the annotation
           on a few genres.




Kepa Joseba Rodr´
                ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Limitations of previous schemes (3)
       Annotation of reference
           Expletives: they are not considered.
           There are two people waiting for the interview.
           Predication:
                         MUC, ACE: No distinction between predication and
                         identity relation.
                         OntoNotes: no semantic criteria to decide which noun
                         phrase is referring and which is a predicate.
                                 [The president of the bank] is [John Smith].
                                 [John Smith] is [the president of the bank].
                Coordination: coordinated items are considered referring
                expressions in corpora like MUC or OntoNotes.
                         [Milosevic or anyone else]
                Nominals and proper names in premodifier position.
Kepa Joseba Rodr´
                ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Limitations of previous schemes (4)


       Identification of discontinuous semantic material.
            Bill and Hillary Clinton
            black cars and bikes


       Multiple interpretations are not captured
           [The house] is on [a long street]. [It] is very dirty.




Kepa Joseba Rodr´
                ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Annotation scheme

                Annotation of all noun phrases
                Distinction between referring and non-referring
                expressions
                Annotation of clitics attached to the verb and empty
                pronouns
                Introduction of ambiguity
                Introduction of discontinuous markables
                Annotation of different kind of relations: identity,
                discourse deixis and bridging.


Kepa Joseba Rodr´
                ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Reference


                Markables are classified in referring and non-referring
                Non-referring markables are annotated with type of
                non-referring expression
                Referring markables are annotated with:
                         Information status: New or old.
                         Semantic type




Kepa Joseba Rodr´
                ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Reference
       Types of non-referring expressions
           Expletives
                         [There] are two people waiting for the interview
                         The new car is [there]
                Predicate: semantic criteria to distinguish predicate and
                referring expression.
                         [Il presidente della Repubblica, [Giorgio Napolitano]]
                         [The president of the bank] is [John Smith].
                         [John Smith] is [the president of the bank].
                Quantifiers:
                         [All of [the box cars]]
                Coordination.
                Idiomatic expressions
                         by [the nape of [the neck]]
Kepa Joseba Rodr´
                ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Semantic types
           1    Person
           2    Animate
           3    Organization
           4    Facility
           5    Geopolitical entity (GPE)
           6    Location
           7    Temporal
           8    Numerical
           9    Concrete
           10   Abstract
           11   Event
           12   Other
           13   Unknown
Kepa Joseba Rodr´
                ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Annotation of ambiguity

                Not always a unique interpretation for a markable.
                    1 Be careful hooking up [the engine] to [the boxcar]
                      because [it] is faulty.
                    2 [The house] is on [a long street]. [It] is very dirty.
                In case of ambiguity, we tag the markable as ambiguous
                and we annotate the possible interpretations.
                Other possible ambiguities are:
                         Information status: between new and old.
                         Old and not referring.



Kepa Joseba Rodr´
                ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
List of annotated features
                Agreement features
                         Gender
                         Number
                         Person
                Grammatical function
                Reference and information status
                Semantic type
                Type of non-referring
                Link to antecedent
                Ambiguity
                Bridging
Kepa Joseba Rodr´
                ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Description of the annotated data


                ARRAU (English)
                         Wall Street Journal texts
                         Trains dialogues
                         Gnome corpus
                         Pear stories
                Live Memories Corpus for Italian (LMC)
                         Wikipedia sites
                         Blog sites
                         VENEX dataset



Kepa Joseba Rodr´
                ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Description: English corpus
       WSJ dataset
            205 files
            147,600 words in 5585 sentences. 47,900 markables.
            1% of discontinuous markables, 12.6% non-referring.
       Trains dialogues
            35 files
            26,000 words in 4600 sentences. 5200 markables.
       GNOME corpus
            5 files
            21,600 words in 1000 sentences. 6100 markables
       PEAR stories
            20 files
            14,000 words in 2,000 sentences. 3,900 markables.
Kepa Joseba Rodr´
                ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Description: Italian corpus
       Wikipedia dataset:
           144 files.
           140.000 words in 4700 sentences. 44.500 markables.
           0.5% discontinuous markables, 0.5% clitics attached to
           the verb, 4.5% empty subjects.13.7% non-referring.
       Blogs dataset:
           75 files.
           53.000 words in 2230 sentences.
           16.000 markables.
       VENEX corpus:
           30 files
           20,300 words in 720 sentences
           6.220 markables
Kepa Joseba Rodr´
                ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Reliability of the annotation – ARRAU



                Previous study for annotation of anaphoric links published
                by (Poesio and Artstein, 2008)
                Metric: Krippendorf’s α
                α = 0.6-0.7
                Statistics reflect the complexity of the task.




Kepa Joseba Rodr´
                ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Reliability of the annotation – LMC

                Metric: Sigel and Castellan’s κ
                Information status and reference: old, new and
                non-referring
                         κ = 0.80
                Basic annotation of the markable: new, phrase
                antecedent, segment antecedent, predicate, quantifier,
                expletive, coordination and idiom.
                         κ = 0.79
                         Main disagreement between discourse new and predicate
                Semantic type
                         κ = 0.85

Kepa Joseba Rodr´
                ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Reliability of the annotation – LMC


                Link to the antecedent
                         κ = 0.88
                Antecedent of clitics
                         κ = 0.84
                Antecedent of empty pronouns
                         κ = 0.93




Kepa Joseba Rodr´
                ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Use of the corpus for anaphora resolution (1)


                Baseline proposed by (Soon et al 2001)
                Classifier: MaxEnt
                English data: ACE02, MUC-7 and ARRAU
                Italian data: ICAB and LMC
                Evaluation metrics:
                         MUC (Vilain et al. 1995)
                         CEAF (Luo, 2005)
                         Link based evaluation



Kepa Joseba Rodr´
                ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Use of the corpus for anaphora resolution (2)

       English corpora: ARRAU, ACE, MUC
                           ACE Carafe MUC-7 ACE02 ARRAU
        MUC                     0.618   0.585 0.590 0.557
        CEAF-AGGR Φ-3           0.537   0.379 0.393 0.683
        CEAF-AGGR Φ-4           0.506   0.206 0.309 0.717
        Link-based              0.638   0.594 0.532 0.540
        Pronouns                0.686   0.492 0.597 0.558
        Nominals                0.355   0.455 0.239 0.352
        Names                   0.638   0.817 0.784 0.763



Kepa Joseba Rodr´
                ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Use of the corpus for anaphora resolution (3)

       Italian corpora: LMC, ICAB
                            ICAB LMC-Sys LMC-Gold
         MUC                0.494  0.456    0.619
         CEAF-AGGR Φ-3 0.557       0.622    0.798
         CEAF-AGGR Φ-4 0.560       0.671    0.869
         Link-based         0.556  0.470    0.580
         Pronouns           0.452  0.520    0.521
         Nominals           0.421  0.303    0.522
         Names              0.741  0.642    0.752



Kepa Joseba Rodr´
                ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Use of the corpus for anaphora resolution (4)


                Use of C4 decision trees to compare the impact of
                individual features.
                The impact of the baseline features is similar for English
                and Italian with two exceptions:
                         The impact of gender matching is high in English, but
                         has no effect for Italian.
                         The use of automatically computed aliases have a high
                         impact for Italian and a low impact for English.




Kepa Joseba Rodr´
                ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Use of the data


                5th International Workshop on Semantic Evaluations
                (SemEval 2010)
                Task: Coreference Resolution in Multiple Languages.
                Comparative research about zero-anaphora in Italian and
                Japanese
                Training and evaluation of content extraction models in
                the Live Memories project.




Kepa Joseba Rodr´
                ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
Conclusions


                Linguistic motivated annotation scheme applicable to
                English and Italian.
                Scheme used to annotate different genres: newspapers,
                encyclopedic text, dialogue, narrative and weblogs.
                Corpora are usable to build anaphora resolution models.
                Datasets have been used for international competitions
                and for linguistic research.




Kepa Joseba Rodr´
                ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution

Más contenido relacionado

La actualidad más candente

Sorace's ExAPP plenary
Sorace's ExAPP plenarySorace's ExAPP plenary
Sorace's ExAPP plenaryexapp2013
 
K12Translate Webinar Slides: Engaging ELL Parents
K12Translate Webinar Slides: Engaging ELL ParentsK12Translate Webinar Slides: Engaging ELL Parents
K12Translate Webinar Slides: Engaging ELL ParentsVIA
 
Tg first quarter_grade_7_english
Tg first quarter_grade_7_englishTg first quarter_grade_7_english
Tg first quarter_grade_7_englishM J
 
Chinese basics and translation guide
Chinese basics and translation guideChinese basics and translation guide
Chinese basics and translation guideFrank Zhonghe Wei
 
The cortical organization of syntactic processing in American Sign Language
The cortical organization of syntactic processing in American Sign LanguageThe cortical organization of syntactic processing in American Sign Language
The cortical organization of syntactic processing in American Sign LanguageWilliam Matchin
 
CONTRASTIVE FONOLOGY POR VALERIA NAVARRO
CONTRASTIVE FONOLOGY POR VALERIA NAVARROCONTRASTIVE FONOLOGY POR VALERIA NAVARRO
CONTRASTIVE FONOLOGY POR VALERIA NAVARROvalecris05
 
B047006011
B047006011B047006011
B047006011inventy
 
Pro Translating Presentation
Pro Translating PresentationPro Translating Presentation
Pro Translating Presentationmcdelavega
 
Grade 7 Learning Module in English (Quarter 1 to 2)
Grade 7 Learning Module in English (Quarter 1 to 2)Grade 7 Learning Module in English (Quarter 1 to 2)
Grade 7 Learning Module in English (Quarter 1 to 2)R Borres
 
Stretching translation dollars for your district final
Stretching translation dollars for your district finalStretching translation dollars for your district final
Stretching translation dollars for your district finalVIA
 
Europass
EuropassEuropass
Europasslajun1
 
Organigramas de Linguistica
Organigramas de LinguisticaOrganigramas de Linguistica
Organigramas de Linguisticakevinyepez
 
Collecting and Evaluating Speech Recognition Corpora for Nine Southern Bantu ...
Collecting and Evaluating Speech Recognition Corpora for Nine Southern Bantu ...Collecting and Evaluating Speech Recognition Corpora for Nine Southern Bantu ...
Collecting and Evaluating Speech Recognition Corpora for Nine Southern Bantu ...Guy De Pauw
 
Essayorganizer2 090923170930 Phpapp02
Essayorganizer2 090923170930 Phpapp02Essayorganizer2 090923170930 Phpapp02
Essayorganizer2 090923170930 Phpapp02tiffany meade
 
Ch 12. Amplifying what the speaker says
Ch 12. Amplifying what the speaker saysCh 12. Amplifying what the speaker says
Ch 12. Amplifying what the speaker saysHyomin Yoo
 
Planning NB6
Planning NB6Planning NB6
Planning NB6conita10
 
Identification of prosodic features of punjabi for enhancing the pronunciatio...
Identification of prosodic features of punjabi for enhancing the pronunciatio...Identification of prosodic features of punjabi for enhancing the pronunciatio...
Identification of prosodic features of punjabi for enhancing the pronunciatio...ijnlc
 

La actualidad más candente (17)

Sorace's ExAPP plenary
Sorace's ExAPP plenarySorace's ExAPP plenary
Sorace's ExAPP plenary
 
K12Translate Webinar Slides: Engaging ELL Parents
K12Translate Webinar Slides: Engaging ELL ParentsK12Translate Webinar Slides: Engaging ELL Parents
K12Translate Webinar Slides: Engaging ELL Parents
 
Tg first quarter_grade_7_english
Tg first quarter_grade_7_englishTg first quarter_grade_7_english
Tg first quarter_grade_7_english
 
Chinese basics and translation guide
Chinese basics and translation guideChinese basics and translation guide
Chinese basics and translation guide
 
The cortical organization of syntactic processing in American Sign Language
The cortical organization of syntactic processing in American Sign LanguageThe cortical organization of syntactic processing in American Sign Language
The cortical organization of syntactic processing in American Sign Language
 
CONTRASTIVE FONOLOGY POR VALERIA NAVARRO
CONTRASTIVE FONOLOGY POR VALERIA NAVARROCONTRASTIVE FONOLOGY POR VALERIA NAVARRO
CONTRASTIVE FONOLOGY POR VALERIA NAVARRO
 
B047006011
B047006011B047006011
B047006011
 
Pro Translating Presentation
Pro Translating PresentationPro Translating Presentation
Pro Translating Presentation
 
Grade 7 Learning Module in English (Quarter 1 to 2)
Grade 7 Learning Module in English (Quarter 1 to 2)Grade 7 Learning Module in English (Quarter 1 to 2)
Grade 7 Learning Module in English (Quarter 1 to 2)
 
Stretching translation dollars for your district final
Stretching translation dollars for your district finalStretching translation dollars for your district final
Stretching translation dollars for your district final
 
Europass
EuropassEuropass
Europass
 
Organigramas de Linguistica
Organigramas de LinguisticaOrganigramas de Linguistica
Organigramas de Linguistica
 
Collecting and Evaluating Speech Recognition Corpora for Nine Southern Bantu ...
Collecting and Evaluating Speech Recognition Corpora for Nine Southern Bantu ...Collecting and Evaluating Speech Recognition Corpora for Nine Southern Bantu ...
Collecting and Evaluating Speech Recognition Corpora for Nine Southern Bantu ...
 
Essayorganizer2 090923170930 Phpapp02
Essayorganizer2 090923170930 Phpapp02Essayorganizer2 090923170930 Phpapp02
Essayorganizer2 090923170930 Phpapp02
 
Ch 12. Amplifying what the speaker says
Ch 12. Amplifying what the speaker saysCh 12. Amplifying what the speaker says
Ch 12. Amplifying what the speaker says
 
Planning NB6
Planning NB6Planning NB6
Planning NB6
 
Identification of prosodic features of punjabi for enhancing the pronunciatio...
Identification of prosodic features of punjabi for enhancing the pronunciatio...Identification of prosodic features of punjabi for enhancing the pronunciatio...
Identification of prosodic features of punjabi for enhancing the pronunciatio...
 

Similar a Resources for linguistically motivated Multilingual Anaphora Resolution

The Psychology of Language Chapter 3
The Psychology of Language Chapter 3The Psychology of Language Chapter 3
The Psychology of Language Chapter 3Ami Spears
 
Language, Thought and Culture
Language, Thought and CultureLanguage, Thought and Culture
Language, Thought and CultureSaeed Jafari
 
Hum1 podcast-week8-f11-language-online
Hum1 podcast-week8-f11-language-onlineHum1 podcast-week8-f11-language-online
Hum1 podcast-week8-f11-language-onlineBerkeley City College
 
Esp language descriptions
Esp language descriptionsEsp language descriptions
Esp language descriptionslarka
 
Numerical Cognition, linguistic relativity and the ontology of numbers
Numerical Cognition, linguistic relativity and the ontology of numbersNumerical Cognition, linguistic relativity and the ontology of numbers
Numerical Cognition, linguistic relativity and the ontology of numbersHady Ba
 
Code Switching and Code Mixing
Code Switching and Code MixingCode Switching and Code Mixing
Code Switching and Code MixingBlanca Sosa
 
Corpus Linguistics II.pptx
Corpus Linguistics II.pptxCorpus Linguistics II.pptx
Corpus Linguistics II.pptxRachidMouzouni1
 
English phonology - 05-Phonetic Transcription
English phonology - 05-Phonetic TranscriptionEnglish phonology - 05-Phonetic Transcription
English phonology - 05-Phonetic TranscriptionDaniel Kurniawan
 
Nlp Sentemental analysis of Tweetr And CaseStudy
Nlp Sentemental analysis of Tweetr And CaseStudyNlp Sentemental analysis of Tweetr And CaseStudy
Nlp Sentemental analysis of Tweetr And CaseStudyRaza Azeem
 
Intro to language
Intro to languageIntro to language
Intro to languageCameliaN
 
Sgaw Tone and Intonation
Sgaw Tone and IntonationSgaw Tone and Intonation
Sgaw Tone and IntonationLuke West
 
Mona Baker's strategies for translation. Chapter 2
Mona Baker's strategies for translation. Chapter 2Mona Baker's strategies for translation. Chapter 2
Mona Baker's strategies for translation. Chapter 2ssusere6b7f7
 
Intro to language
Intro to languageIntro to language
Intro to languageCameliaN
 
Lin101 introduction to linguistics
Lin101 introduction to linguisticsLin101 introduction to linguistics
Lin101 introduction to linguisticsDr. Russell Rodrigo
 
07 - Sociolinguistics.pdf based on sides
07 - Sociolinguistics.pdf based on sides07 - Sociolinguistics.pdf based on sides
07 - Sociolinguistics.pdf based on sidesJoseCotes7
 
Chapter 1 - Class.pptx
Chapter 1 - Class.pptxChapter 1 - Class.pptx
Chapter 1 - Class.pptxbrianjars
 

Similar a Resources for linguistically motivated Multilingual Anaphora Resolution (20)

The Psychology of Language Chapter 3
The Psychology of Language Chapter 3The Psychology of Language Chapter 3
The Psychology of Language Chapter 3
 
Language, Thought and Culture
Language, Thought and CultureLanguage, Thought and Culture
Language, Thought and Culture
 
Hum1 podcast-week8-f11-language-online
Hum1 podcast-week8-f11-language-onlineHum1 podcast-week8-f11-language-online
Hum1 podcast-week8-f11-language-online
 
Esp language descriptions
Esp language descriptionsEsp language descriptions
Esp language descriptions
 
Numerical Cognition, linguistic relativity and the ontology of numbers
Numerical Cognition, linguistic relativity and the ontology of numbersNumerical Cognition, linguistic relativity and the ontology of numbers
Numerical Cognition, linguistic relativity and the ontology of numbers
 
Code Switching and Code Mixing
Code Switching and Code MixingCode Switching and Code Mixing
Code Switching and Code Mixing
 
Corpus Linguistics II.pptx
Corpus Linguistics II.pptxCorpus Linguistics II.pptx
Corpus Linguistics II.pptx
 
Week 3 LIN321
Week 3 LIN321Week 3 LIN321
Week 3 LIN321
 
English phonology - 05-Phonetic Transcription
English phonology - 05-Phonetic TranscriptionEnglish phonology - 05-Phonetic Transcription
English phonology - 05-Phonetic Transcription
 
Week 3 LIN321
Week 3 LIN321Week 3 LIN321
Week 3 LIN321
 
Leah Dacheva & Richard Fay
Leah Dacheva & Richard FayLeah Dacheva & Richard Fay
Leah Dacheva & Richard Fay
 
Nlp Sentemental analysis of Tweetr And CaseStudy
Nlp Sentemental analysis of Tweetr And CaseStudyNlp Sentemental analysis of Tweetr And CaseStudy
Nlp Sentemental analysis of Tweetr And CaseStudy
 
Intro to language
Intro to languageIntro to language
Intro to language
 
Sgaw Tone and Intonation
Sgaw Tone and IntonationSgaw Tone and Intonation
Sgaw Tone and Intonation
 
Mona Baker's strategies for translation. Chapter 2
Mona Baker's strategies for translation. Chapter 2Mona Baker's strategies for translation. Chapter 2
Mona Baker's strategies for translation. Chapter 2
 
What is Syntax in English
What is Syntax in EnglishWhat is Syntax in English
What is Syntax in English
 
Intro to language
Intro to languageIntro to language
Intro to language
 
Lin101 introduction to linguistics
Lin101 introduction to linguisticsLin101 introduction to linguistics
Lin101 introduction to linguistics
 
07 - Sociolinguistics.pdf based on sides
07 - Sociolinguistics.pdf based on sides07 - Sociolinguistics.pdf based on sides
07 - Sociolinguistics.pdf based on sides
 
Chapter 1 - Class.pptx
Chapter 1 - Class.pptxChapter 1 - Class.pptx
Chapter 1 - Class.pptx
 

Más de Kepa J. Rodriguez

LOD4JS - Linked Open Data for Jewish Studies
LOD4JS - Linked Open Data for Jewish StudiesLOD4JS - Linked Open Data for Jewish Studies
LOD4JS - Linked Open Data for Jewish StudiesKepa J. Rodriguez
 
The use of controlled and structured vocabularies in a digitally joined-up world
The use of controlled and structured vocabularies in a digitally joined-up worldThe use of controlled and structured vocabularies in a digitally joined-up world
The use of controlled and structured vocabularies in a digitally joined-up worldKepa J. Rodriguez
 
Use case: data edited as a book !!!
Use case: data edited as a book !!!Use case: data edited as a book !!!
Use case: data edited as a book !!!Kepa J. Rodriguez
 
Building a 3-gram model for Language Identification
Building a 3-gram model for Language IdentificationBuilding a 3-gram model for Language Identification
Building a 3-gram model for Language IdentificationKepa J. Rodriguez
 
Design and prototype of a Help Desk System for EHRI: an Information Retrieval...
Design and prototype of a Help Desk System for EHRI: an Information Retrieval...Design and prototype of a Help Desk System for EHRI: an Information Retrieval...
Design and prototype of a Help Desk System for EHRI: an Information Retrieval...Kepa J. Rodriguez
 
Information Extraction on Noisy Texts for Historical Research
Information Extraction on Noisy Texts for Historical ResearchInformation Extraction on Noisy Texts for Historical Research
Information Extraction on Noisy Texts for Historical ResearchKepa J. Rodriguez
 
Named entity extraction tools for raw OCR text
Named entity extraction tools for raw OCR textNamed entity extraction tools for raw OCR text
Named entity extraction tools for raw OCR textKepa J. Rodriguez
 
Active Annotation of Corpora.
Active Annotation of Corpora.Active Annotation of Corpora.
Active Annotation of Corpora.Kepa J. Rodriguez
 

Más de Kepa J. Rodriguez (9)

LOD4JS - Linked Open Data for Jewish Studies
LOD4JS - Linked Open Data for Jewish StudiesLOD4JS - Linked Open Data for Jewish Studies
LOD4JS - Linked Open Data for Jewish Studies
 
The use of controlled and structured vocabularies in a digitally joined-up world
The use of controlled and structured vocabularies in a digitally joined-up worldThe use of controlled and structured vocabularies in a digitally joined-up world
The use of controlled and structured vocabularies in a digitally joined-up world
 
Use case: data edited as a book !!!
Use case: data edited as a book !!!Use case: data edited as a book !!!
Use case: data edited as a book !!!
 
Building a 3-gram model for Language Identification
Building a 3-gram model for Language IdentificationBuilding a 3-gram model for Language Identification
Building a 3-gram model for Language Identification
 
Design and prototype of a Help Desk System for EHRI: an Information Retrieval...
Design and prototype of a Help Desk System for EHRI: an Information Retrieval...Design and prototype of a Help Desk System for EHRI: an Information Retrieval...
Design and prototype of a Help Desk System for EHRI: an Information Retrieval...
 
Information Extraction on Noisy Texts for Historical Research
Information Extraction on Noisy Texts for Historical ResearchInformation Extraction on Noisy Texts for Historical Research
Information Extraction on Noisy Texts for Historical Research
 
Named entity extraction tools for raw OCR text
Named entity extraction tools for raw OCR textNamed entity extraction tools for raw OCR text
Named entity extraction tools for raw OCR text
 
Active Annotation of Corpora.
Active Annotation of Corpora.Active Annotation of Corpora.
Active Annotation of Corpora.
 
Cross Document Coreference
Cross Document CoreferenceCross Document Coreference
Cross Document Coreference
 

Último

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Último (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Resources for linguistically motivated Multilingual Anaphora Resolution

  • 1. Resources for linguistically motivated Multilingual Anaphora Resolution Kepa Joseba Rodr´ ıguez Advisor: Massimo Poesio 18. January 2011 Kepa Joseba Rodr´ ıguez Resources for linguistically motivated Multilingual Anaphora Resolution
  • 2. Outline 1 Motivation of the research 2 Contributions of this dissertation 3 Limitations of previous annotation schemes 4 Annotation scheme proposal 5 Annotated data 6 Usability of the data for anaphora resolution 7 Use of the data 8 Conclusions Kepa Joseba Rodr´ ıguez Resources for linguistically motivated Multilingual Anaphora Resolution
  • 3. Motivation Linguistic research: cross linguistic studies about anaphora (Poesio et al 2004) Applications: summarization (Steinberger et al 2007) Applications: machine translation 1 German: Peter hat Maria seine Blumen zum Gießen gegeben. Sie hat sie vertrocknen lassen. 2 English (Babelfish): Peter gave Maria his flowers for pouring. Then it left it to dry. 3 English (Google translate): Peter gave Mary flowers to his casting. Then she let them dry up. 4 English (wanted): Peter gave Maria his flowers to water. Then she let them dry out. Kepa Joseba Rodr´ ıguez Resources for linguistically motivated Multilingual Anaphora Resolution
  • 4. Contributions Development of a linguistically motivated annotation scheme for anaphoric relations. Implementation of the scheme for manual annotation of English and Italian data. Creation of annotated data for English and Italian. Use of the corpora for feature extraction and development of anaphora resolution systems in English and Italian. Participation of the systems in SemEval 2010. Kepa Joseba Rodr´ ıguez Resources for linguistically motivated Multilingual Anaphora Resolution
  • 5. Limitations of previous schemes (1) Coverage of the annotation. Annotation of reference. Identification and annotation of discontinuity of semantic material. Problem of multiple interpretations: ambiguity. Kepa Joseba Rodr´ ıguez Resources for linguistically motivated Multilingual Anaphora Resolution
  • 6. Limitations of previous schemes (2) Coverage of the annotation: Annotated relations: only identity ACE-like annotation schemes constraint the annotation to noun phrases from a list of semantic types. Genres: Most annotation schemes focus the annotation on a few genres. Kepa Joseba Rodr´ ıguez Resources for linguistically motivated Multilingual Anaphora Resolution
  • 7. Limitations of previous schemes (3) Annotation of reference Expletives: they are not considered. There are two people waiting for the interview. Predication: MUC, ACE: No distinction between predication and identity relation. OntoNotes: no semantic criteria to decide which noun phrase is referring and which is a predicate. [The president of the bank] is [John Smith]. [John Smith] is [the president of the bank]. Coordination: coordinated items are considered referring expressions in corpora like MUC or OntoNotes. [Milosevic or anyone else] Nominals and proper names in premodifier position. Kepa Joseba Rodr´ ıguez Resources for linguistically motivated Multilingual Anaphora Resolution
  • 8. Limitations of previous schemes (4) Identification of discontinuous semantic material. Bill and Hillary Clinton black cars and bikes Multiple interpretations are not captured [The house] is on [a long street]. [It] is very dirty. Kepa Joseba Rodr´ ıguez Resources for linguistically motivated Multilingual Anaphora Resolution
  • 9. Annotation scheme Annotation of all noun phrases Distinction between referring and non-referring expressions Annotation of clitics attached to the verb and empty pronouns Introduction of ambiguity Introduction of discontinuous markables Annotation of different kind of relations: identity, discourse deixis and bridging. Kepa Joseba Rodr´ ıguez Resources for linguistically motivated Multilingual Anaphora Resolution
  • 10. Reference Markables are classified in referring and non-referring Non-referring markables are annotated with type of non-referring expression Referring markables are annotated with: Information status: New or old. Semantic type Kepa Joseba Rodr´ ıguez Resources for linguistically motivated Multilingual Anaphora Resolution
  • 11. Reference Types of non-referring expressions Expletives [There] are two people waiting for the interview The new car is [there] Predicate: semantic criteria to distinguish predicate and referring expression. [Il presidente della Repubblica, [Giorgio Napolitano]] [The president of the bank] is [John Smith]. [John Smith] is [the president of the bank]. Quantifiers: [All of [the box cars]] Coordination. Idiomatic expressions by [the nape of [the neck]] Kepa Joseba Rodr´ ıguez Resources for linguistically motivated Multilingual Anaphora Resolution
  • 12. Semantic types 1 Person 2 Animate 3 Organization 4 Facility 5 Geopolitical entity (GPE) 6 Location 7 Temporal 8 Numerical 9 Concrete 10 Abstract 11 Event 12 Other 13 Unknown Kepa Joseba Rodr´ ıguez Resources for linguistically motivated Multilingual Anaphora Resolution
  • 13. Annotation of ambiguity Not always a unique interpretation for a markable. 1 Be careful hooking up [the engine] to [the boxcar] because [it] is faulty. 2 [The house] is on [a long street]. [It] is very dirty. In case of ambiguity, we tag the markable as ambiguous and we annotate the possible interpretations. Other possible ambiguities are: Information status: between new and old. Old and not referring. Kepa Joseba Rodr´ ıguez Resources for linguistically motivated Multilingual Anaphora Resolution
  • 14. List of annotated features Agreement features Gender Number Person Grammatical function Reference and information status Semantic type Type of non-referring Link to antecedent Ambiguity Bridging Kepa Joseba Rodr´ ıguez Resources for linguistically motivated Multilingual Anaphora Resolution
  • 15. Description of the annotated data ARRAU (English) Wall Street Journal texts Trains dialogues Gnome corpus Pear stories Live Memories Corpus for Italian (LMC) Wikipedia sites Blog sites VENEX dataset Kepa Joseba Rodr´ ıguez Resources for linguistically motivated Multilingual Anaphora Resolution
  • 16. Description: English corpus WSJ dataset 205 files 147,600 words in 5585 sentences. 47,900 markables. 1% of discontinuous markables, 12.6% non-referring. Trains dialogues 35 files 26,000 words in 4600 sentences. 5200 markables. GNOME corpus 5 files 21,600 words in 1000 sentences. 6100 markables PEAR stories 20 files 14,000 words in 2,000 sentences. 3,900 markables. Kepa Joseba Rodr´ ıguez Resources for linguistically motivated Multilingual Anaphora Resolution
  • 17. Description: Italian corpus Wikipedia dataset: 144 files. 140.000 words in 4700 sentences. 44.500 markables. 0.5% discontinuous markables, 0.5% clitics attached to the verb, 4.5% empty subjects.13.7% non-referring. Blogs dataset: 75 files. 53.000 words in 2230 sentences. 16.000 markables. VENEX corpus: 30 files 20,300 words in 720 sentences 6.220 markables Kepa Joseba Rodr´ ıguez Resources for linguistically motivated Multilingual Anaphora Resolution
  • 18. Reliability of the annotation – ARRAU Previous study for annotation of anaphoric links published by (Poesio and Artstein, 2008) Metric: Krippendorf’s α α = 0.6-0.7 Statistics reflect the complexity of the task. Kepa Joseba Rodr´ ıguez Resources for linguistically motivated Multilingual Anaphora Resolution
  • 19. Reliability of the annotation – LMC Metric: Sigel and Castellan’s κ Information status and reference: old, new and non-referring κ = 0.80 Basic annotation of the markable: new, phrase antecedent, segment antecedent, predicate, quantifier, expletive, coordination and idiom. κ = 0.79 Main disagreement between discourse new and predicate Semantic type κ = 0.85 Kepa Joseba Rodr´ ıguez Resources for linguistically motivated Multilingual Anaphora Resolution
  • 20. Reliability of the annotation – LMC Link to the antecedent κ = 0.88 Antecedent of clitics κ = 0.84 Antecedent of empty pronouns κ = 0.93 Kepa Joseba Rodr´ ıguez Resources for linguistically motivated Multilingual Anaphora Resolution
  • 21. Use of the corpus for anaphora resolution (1) Baseline proposed by (Soon et al 2001) Classifier: MaxEnt English data: ACE02, MUC-7 and ARRAU Italian data: ICAB and LMC Evaluation metrics: MUC (Vilain et al. 1995) CEAF (Luo, 2005) Link based evaluation Kepa Joseba Rodr´ ıguez Resources for linguistically motivated Multilingual Anaphora Resolution
  • 22. Use of the corpus for anaphora resolution (2) English corpora: ARRAU, ACE, MUC ACE Carafe MUC-7 ACE02 ARRAU MUC 0.618 0.585 0.590 0.557 CEAF-AGGR Φ-3 0.537 0.379 0.393 0.683 CEAF-AGGR Φ-4 0.506 0.206 0.309 0.717 Link-based 0.638 0.594 0.532 0.540 Pronouns 0.686 0.492 0.597 0.558 Nominals 0.355 0.455 0.239 0.352 Names 0.638 0.817 0.784 0.763 Kepa Joseba Rodr´ ıguez Resources for linguistically motivated Multilingual Anaphora Resolution
  • 23. Use of the corpus for anaphora resolution (3) Italian corpora: LMC, ICAB ICAB LMC-Sys LMC-Gold MUC 0.494 0.456 0.619 CEAF-AGGR Φ-3 0.557 0.622 0.798 CEAF-AGGR Φ-4 0.560 0.671 0.869 Link-based 0.556 0.470 0.580 Pronouns 0.452 0.520 0.521 Nominals 0.421 0.303 0.522 Names 0.741 0.642 0.752 Kepa Joseba Rodr´ ıguez Resources for linguistically motivated Multilingual Anaphora Resolution
  • 24. Use of the corpus for anaphora resolution (4) Use of C4 decision trees to compare the impact of individual features. The impact of the baseline features is similar for English and Italian with two exceptions: The impact of gender matching is high in English, but has no effect for Italian. The use of automatically computed aliases have a high impact for Italian and a low impact for English. Kepa Joseba Rodr´ ıguez Resources for linguistically motivated Multilingual Anaphora Resolution
  • 25. Use of the data 5th International Workshop on Semantic Evaluations (SemEval 2010) Task: Coreference Resolution in Multiple Languages. Comparative research about zero-anaphora in Italian and Japanese Training and evaluation of content extraction models in the Live Memories project. Kepa Joseba Rodr´ ıguez Resources for linguistically motivated Multilingual Anaphora Resolution
  • 26. Conclusions Linguistic motivated annotation scheme applicable to English and Italian. Scheme used to annotate different genres: newspapers, encyclopedic text, dialogue, narrative and weblogs. Corpora are usable to build anaphora resolution models. Datasets have been used for international competitions and for linguistic research. Kepa Joseba Rodr´ ıguez Resources for linguistically motivated Multilingual Anaphora Resolution