Handwritten Text Recognition for manuscripts and early printed texts
Resources for linguistically motivated Multilingual Anaphora Resolution
1. Resources for linguistically motivated
Multilingual Anaphora Resolution
Kepa Joseba Rodr´
ıguez
Advisor: Massimo Poesio
18. January 2011
Kepa Joseba Rodr´
ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
2. Outline
1 Motivation of the research
2 Contributions of this dissertation
3 Limitations of previous annotation schemes
4 Annotation scheme proposal
5 Annotated data
6 Usability of the data for anaphora resolution
7 Use of the data
8 Conclusions
Kepa Joseba Rodr´
ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
3. Motivation
Linguistic research: cross linguistic studies about
anaphora (Poesio et al 2004)
Applications: summarization (Steinberger et al 2007)
Applications: machine translation
1 German: Peter hat Maria seine Blumen zum Gießen
gegeben. Sie hat sie vertrocknen lassen.
2 English (Babelfish): Peter gave Maria his flowers for
pouring. Then it left it to dry.
3 English (Google translate): Peter gave Mary flowers
to his casting. Then she let them dry up.
4 English (wanted): Peter gave Maria his flowers to
water. Then she let them dry out.
Kepa Joseba Rodr´
ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
4. Contributions
Development of a linguistically motivated annotation
scheme for anaphoric relations.
Implementation of the scheme for manual annotation of
English and Italian data.
Creation of annotated data for English and Italian.
Use of the corpora for feature extraction and development
of anaphora resolution systems in English and Italian.
Participation of the systems in SemEval 2010.
Kepa Joseba Rodr´
ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
5. Limitations of previous schemes (1)
Coverage of the annotation.
Annotation of reference.
Identification and annotation of discontinuity of semantic
material.
Problem of multiple interpretations: ambiguity.
Kepa Joseba Rodr´
ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
6. Limitations of previous schemes (2)
Coverage of the annotation:
Annotated relations: only identity
ACE-like annotation schemes constraint the annotation to
noun phrases from a list of semantic types.
Genres: Most annotation schemes focus the annotation
on a few genres.
Kepa Joseba Rodr´
ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
7. Limitations of previous schemes (3)
Annotation of reference
Expletives: they are not considered.
There are two people waiting for the interview.
Predication:
MUC, ACE: No distinction between predication and
identity relation.
OntoNotes: no semantic criteria to decide which noun
phrase is referring and which is a predicate.
[The president of the bank] is [John Smith].
[John Smith] is [the president of the bank].
Coordination: coordinated items are considered referring
expressions in corpora like MUC or OntoNotes.
[Milosevic or anyone else]
Nominals and proper names in premodifier position.
Kepa Joseba Rodr´
ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
8. Limitations of previous schemes (4)
Identification of discontinuous semantic material.
Bill and Hillary Clinton
black cars and bikes
Multiple interpretations are not captured
[The house] is on [a long street]. [It] is very dirty.
Kepa Joseba Rodr´
ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
9. Annotation scheme
Annotation of all noun phrases
Distinction between referring and non-referring
expressions
Annotation of clitics attached to the verb and empty
pronouns
Introduction of ambiguity
Introduction of discontinuous markables
Annotation of different kind of relations: identity,
discourse deixis and bridging.
Kepa Joseba Rodr´
ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
10. Reference
Markables are classified in referring and non-referring
Non-referring markables are annotated with type of
non-referring expression
Referring markables are annotated with:
Information status: New or old.
Semantic type
Kepa Joseba Rodr´
ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
11. Reference
Types of non-referring expressions
Expletives
[There] are two people waiting for the interview
The new car is [there]
Predicate: semantic criteria to distinguish predicate and
referring expression.
[Il presidente della Repubblica, [Giorgio Napolitano]]
[The president of the bank] is [John Smith].
[John Smith] is [the president of the bank].
Quantifiers:
[All of [the box cars]]
Coordination.
Idiomatic expressions
by [the nape of [the neck]]
Kepa Joseba Rodr´
ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
13. Annotation of ambiguity
Not always a unique interpretation for a markable.
1 Be careful hooking up [the engine] to [the boxcar]
because [it] is faulty.
2 [The house] is on [a long street]. [It] is very dirty.
In case of ambiguity, we tag the markable as ambiguous
and we annotate the possible interpretations.
Other possible ambiguities are:
Information status: between new and old.
Old and not referring.
Kepa Joseba Rodr´
ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
14. List of annotated features
Agreement features
Gender
Number
Person
Grammatical function
Reference and information status
Semantic type
Type of non-referring
Link to antecedent
Ambiguity
Bridging
Kepa Joseba Rodr´
ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
15. Description of the annotated data
ARRAU (English)
Wall Street Journal texts
Trains dialogues
Gnome corpus
Pear stories
Live Memories Corpus for Italian (LMC)
Wikipedia sites
Blog sites
VENEX dataset
Kepa Joseba Rodr´
ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
16. Description: English corpus
WSJ dataset
205 files
147,600 words in 5585 sentences. 47,900 markables.
1% of discontinuous markables, 12.6% non-referring.
Trains dialogues
35 files
26,000 words in 4600 sentences. 5200 markables.
GNOME corpus
5 files
21,600 words in 1000 sentences. 6100 markables
PEAR stories
20 files
14,000 words in 2,000 sentences. 3,900 markables.
Kepa Joseba Rodr´
ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
17. Description: Italian corpus
Wikipedia dataset:
144 files.
140.000 words in 4700 sentences. 44.500 markables.
0.5% discontinuous markables, 0.5% clitics attached to
the verb, 4.5% empty subjects.13.7% non-referring.
Blogs dataset:
75 files.
53.000 words in 2230 sentences.
16.000 markables.
VENEX corpus:
30 files
20,300 words in 720 sentences
6.220 markables
Kepa Joseba Rodr´
ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
18. Reliability of the annotation – ARRAU
Previous study for annotation of anaphoric links published
by (Poesio and Artstein, 2008)
Metric: Krippendorf’s α
α = 0.6-0.7
Statistics reflect the complexity of the task.
Kepa Joseba Rodr´
ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
19. Reliability of the annotation – LMC
Metric: Sigel and Castellan’s κ
Information status and reference: old, new and
non-referring
κ = 0.80
Basic annotation of the markable: new, phrase
antecedent, segment antecedent, predicate, quantifier,
expletive, coordination and idiom.
κ = 0.79
Main disagreement between discourse new and predicate
Semantic type
κ = 0.85
Kepa Joseba Rodr´
ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
20. Reliability of the annotation – LMC
Link to the antecedent
κ = 0.88
Antecedent of clitics
κ = 0.84
Antecedent of empty pronouns
κ = 0.93
Kepa Joseba Rodr´
ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
21. Use of the corpus for anaphora resolution (1)
Baseline proposed by (Soon et al 2001)
Classifier: MaxEnt
English data: ACE02, MUC-7 and ARRAU
Italian data: ICAB and LMC
Evaluation metrics:
MUC (Vilain et al. 1995)
CEAF (Luo, 2005)
Link based evaluation
Kepa Joseba Rodr´
ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
22. Use of the corpus for anaphora resolution (2)
English corpora: ARRAU, ACE, MUC
ACE Carafe MUC-7 ACE02 ARRAU
MUC 0.618 0.585 0.590 0.557
CEAF-AGGR Φ-3 0.537 0.379 0.393 0.683
CEAF-AGGR Φ-4 0.506 0.206 0.309 0.717
Link-based 0.638 0.594 0.532 0.540
Pronouns 0.686 0.492 0.597 0.558
Nominals 0.355 0.455 0.239 0.352
Names 0.638 0.817 0.784 0.763
Kepa Joseba Rodr´
ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
23. Use of the corpus for anaphora resolution (3)
Italian corpora: LMC, ICAB
ICAB LMC-Sys LMC-Gold
MUC 0.494 0.456 0.619
CEAF-AGGR Φ-3 0.557 0.622 0.798
CEAF-AGGR Φ-4 0.560 0.671 0.869
Link-based 0.556 0.470 0.580
Pronouns 0.452 0.520 0.521
Nominals 0.421 0.303 0.522
Names 0.741 0.642 0.752
Kepa Joseba Rodr´
ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
24. Use of the corpus for anaphora resolution (4)
Use of C4 decision trees to compare the impact of
individual features.
The impact of the baseline features is similar for English
and Italian with two exceptions:
The impact of gender matching is high in English, but
has no effect for Italian.
The use of automatically computed aliases have a high
impact for Italian and a low impact for English.
Kepa Joseba Rodr´
ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
25. Use of the data
5th International Workshop on Semantic Evaluations
(SemEval 2010)
Task: Coreference Resolution in Multiple Languages.
Comparative research about zero-anaphora in Italian and
Japanese
Training and evaluation of content extraction models in
the Live Memories project.
Kepa Joseba Rodr´
ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution
26. Conclusions
Linguistic motivated annotation scheme applicable to
English and Italian.
Scheme used to annotate different genres: newspapers,
encyclopedic text, dialogue, narrative and weblogs.
Corpora are usable to build anaphora resolution models.
Datasets have been used for international competitions
and for linguistic research.
Kepa Joseba Rodr´
ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution