Resources for linguistically motivated Multilingual Anaphora Resolution

Resources for linguistically motivated
Multilingual Anaphora Resolution

Kepa Joseba Rodr´
ıguez

Advisor: Massimo Poesio
18. January 2011

Kepa Joseba Rodr´
ıguez
Resources for linguistically motivated Multilingual Anaphora Resolution

Outline

1 Motivation of the research
2 Contributions of this dissertation
3 Limitations of previous annotation schemes
4 Annotation scheme proposal
5 Annotated data
6 Usability of the data for anaphora resolution
7 Use of the data
8 Conclusions

Kepa Joseba Rodr´
ıguez

Motivation
Linguistic research: cross linguistic studies about
anaphora (Poesio et al 2004)
Applications: summarization (Steinberger et al 2007)
Applications: machine translation
1 German: Peter hat Maria seine Blumen zum Gießen
gegeben. Sie hat sie vertrocknen lassen.
2 English (Babelfish): Peter gave Maria his flowers for
pouring. Then it left it to dry.
3 English (Google translate): Peter gave Mary flowers
to his casting. Then she let them dry up.
4 English (wanted): Peter gave Maria his flowers to
water. Then she let them dry out.

Kepa Joseba Rodr´
ıguez

Contributions

Development of a linguistically motivated annotation
scheme for anaphoric relations.
Implementation of the scheme for manual annotation of
English and Italian data.
Creation of annotated data for English and Italian.
Use of the corpora for feature extraction and development
of anaphora resolution systems in English and Italian.
Participation of the systems in SemEval 2010.

Kepa Joseba Rodr´
ıguez

Limitations of previous schemes (1)

Coverage of the annotation.
Annotation of reference.
Identiﬁcation and annotation of discontinuity of semantic
material.
Problem of multiple interpretations: ambiguity.

Kepa Joseba Rodr´
ıguez


Coverage of the annotation:
Annotated relations: only identity
ACE-like annotation schemes constraint the annotation to
noun phrases from a list of semantic types.
Genres: Most annotation schemes focus the annotation
on a few genres.

Kepa Joseba Rodr´
ıguez

Annotation of reference
Expletives: they are not considered.
There are two people waiting for the interview.
Predication:
MUC, ACE: No distinction between predication and
identity relation.
OntoNotes: no semantic criteria to decide which noun
phrase is referring and which is a predicate.
[The president of the bank] is [John Smith].
[John Smith] is [the president of the bank].
Coordination: coordinated items are considered referring
expressions in corpora like MUC or OntoNotes.
[Milosevic or anyone else]
Nominals and proper names in premodiﬁer position.
Kepa Joseba Rodr´
ıguez


Identiﬁcation of discontinuous semantic material.
Bill and Hillary Clinton
black cars and bikes

Multiple interpretations are not captured
[The house] is on [a long street]. [It] is very dirty.

Kepa Joseba Rodr´
ıguez

Annotation scheme

Annotation of all noun phrases
Distinction between referring and non-referring
expressions
Annotation of clitics attached to the verb and empty
pronouns
Introduction of ambiguity
Introduction of discontinuous markables
Annotation of diﬀerent kind of relations: identity,
discourse deixis and bridging.

Kepa Joseba Rodr´
ıguez

Reference

Markables are classiﬁed in referring and non-referring
Non-referring markables are annotated with type of
non-referring expression
Referring markables are annotated with:
Information status: New or old.
Semantic type

Kepa Joseba Rodr´
ıguez

Reference
Types of non-referring expressions
Expletives
[There] are two people waiting for the interview
The new car is [there]
Predicate: semantic criteria to distinguish predicate and
referring expression.
[Il presidente della Repubblica, [Giorgio Napolitano]]
[The president of the bank] is [John Smith].
[John Smith] is [the president of the bank].
Quantiﬁers:
[All of [the box cars]]
Coordination.
Idiomatic expressions
by [the nape of [the neck]]
Kepa Joseba Rodr´
ıguez

Semantic types
1 Person
2 Animate
3 Organization
4 Facility
5 Geopolitical entity (GPE)
6 Location
7 Temporal
8 Numerical
9 Concrete
10 Abstract
11 Event
12 Other
13 Unknown
Kepa Joseba Rodr´
ıguez

Annotation of ambiguity

Not always a unique interpretation for a markable.
1 Be careful hooking up [the engine] to [the boxcar]
because [it] is faulty.
2 [The house] is on [a long street]. [It] is very dirty.
In case of ambiguity, we tag the markable as ambiguous
and we annotate the possible interpretations.
Other possible ambiguities are:
Information status: between new and old.
Old and not referring.

Kepa Joseba Rodr´
ıguez

List of annotated features
Agreement features
Gender
Number
Person
Grammatical function
Reference and information status
Semantic type
Type of non-referring
Link to antecedent
Ambiguity
Bridging
Kepa Joseba Rodr´
ıguez

Description of the annotated data

ARRAU (English)
Wall Street Journal texts
Trains dialogues
Gnome corpus
Pear stories
Live Memories Corpus for Italian (LMC)
Wikipedia sites
Blog sites
VENEX dataset

Kepa Joseba Rodr´
ıguez

Description: English corpus
WSJ dataset
205 files
147,600 words in 5585 sentences. 47,900 markables.
1% of discontinuous markables, 12.6% non-referring.
Trains dialogues
35 files
26,000 words in 4600 sentences. 5200 markables.
GNOME corpus
5 files
21,600 words in 1000 sentences. 6100 markables
PEAR stories
20 files
14,000 words in 2,000 sentences. 3,900 markables.
Kepa Joseba Rodr´
ıguez

Description: Italian corpus
Wikipedia dataset:
144 files.
140.000 words in 4700 sentences. 44.500 markables.
0.5% discontinuous markables, 0.5% clitics attached to
the verb, 4.5% empty subjects.13.7% non-referring.
Blogs dataset:
75 files.
53.000 words in 2230 sentences.
16.000 markables.
VENEX corpus:
30 files
20,300 words in 720 sentences
6.220 markables
Kepa Joseba Rodr´
ıguez

Reliability of the annotation – ARRAU

Previous study for annotation of anaphoric links published
by (Poesio and Artstein, 2008)
Metric: Krippendorf’s α
α = 0.6-0.7
Statistics reﬂect the complexity of the task.

Kepa Joseba Rodr´
ıguez

Reliability of the annotation – LMC

Metric: Sigel and Castellan’s κ
Information status and reference: old, new and
non-referring
κ = 0.80
Basic annotation of the markable: new, phrase
antecedent, segment antecedent, predicate, quantiﬁer,
expletive, coordination and idiom.
κ = 0.79
Main disagreement between discourse new and predicate
Semantic type
κ = 0.85

Kepa Joseba Rodr´
ıguez

Reliability of the annotation – LMC

Link to the antecedent
κ = 0.88
Antecedent of clitics
κ = 0.84
Antecedent of empty pronouns
κ = 0.93

Kepa Joseba Rodr´
ıguez

Use of the corpus for anaphora resolution (1)

Baseline proposed by (Soon et al 2001)
Classiﬁer: MaxEnt
English data: ACE02, MUC-7 and ARRAU
Italian data: ICAB and LMC
Evaluation metrics:
MUC (Vilain et al. 1995)
CEAF (Luo, 2005)
Link based evaluation

Kepa Joseba Rodr´
ıguez


English corpora: ARRAU, ACE, MUC
ACE Carafe MUC-7 ACE02 ARRAU
MUC 0.618 0.585 0.590 0.557
CEAF-AGGR Φ-3 0.537 0.379 0.393 0.683
CEAF-AGGR Φ-4 0.506 0.206 0.309 0.717
Link-based 0.638 0.594 0.532 0.540
Pronouns 0.686 0.492 0.597 0.558
Nominals 0.355 0.455 0.239 0.352
Names 0.638 0.817 0.784 0.763

Kepa Joseba Rodr´
ıguez


Italian corpora: LMC, ICAB
ICAB LMC-Sys LMC-Gold
MUC 0.494 0.456 0.619
CEAF-AGGR Φ-3 0.557 0.622 0.798
CEAF-AGGR Φ-4 0.560 0.671 0.869
Link-based 0.556 0.470 0.580
Pronouns 0.452 0.520 0.521
Nominals 0.421 0.303 0.522
Names 0.741 0.642 0.752

Kepa Joseba Rodr´
ıguez


Use of C4 decision trees to compare the impact of
individual features.
The impact of the baseline features is similar for English
and Italian with two exceptions:
The impact of gender matching is high in English, but
has no eﬀect for Italian.
The use of automatically computed aliases have a high
impact for Italian and a low impact for English.

Kepa Joseba Rodr´
ıguez

Use of the data

5th International Workshop on Semantic Evaluations
(SemEval 2010)
Task: Coreference Resolution in Multiple Languages.
Comparative research about zero-anaphora in Italian and
Japanese
Training and evaluation of content extraction models in
the Live Memories project.

Kepa Joseba Rodr´
ıguez

Conclusions

Linguistic motivated annotation scheme applicable to
English and Italian.
Scheme used to annotate diﬀerent genres: newspapers,
encyclopedic text, dialogue, narrative and weblogs.
Corpora are usable to build anaphora resolution models.
Datasets have been used for international competitions
and for linguistic research.

Kepa Joseba Rodr´
ıguez

Resources for linguistically motivated Multilingual Anaphora Resolution

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (17)

Similar a Resources for linguistically motivated Multilingual Anaphora Resolution

Similar a Resources for linguistically motivated Multilingual Anaphora Resolution (20)

Más de Kepa J. Rodriguez

Más de Kepa J. Rodriguez (9)

Último

Último (20)

Resources for linguistically motivated Multilingual Anaphora Resolution