Mihăilă, C., Ilisei, I. & Inkpen, D. To Be or Not to Be a Zero Pronoun: A Machine Learning Approach for Romanian. In Proceedings of PROMISE joint with CICLing 2010, Iaşi, Romania
To Be or Not to be a Zero Pronoun: A Machine Learning Approach for Romanian
1. Introduction
Corpus
Identification
Conclusions
To Be or Not To Be a Zero Pronoun?
A Machine Learning Approach For Romanian
Claudiu Mih˘il˘1
a a Iustina Ilisei2 Diana Inkpen3
1 Faculty of Computer Science,
”Alexandru Ioan Cuza” University of Ia¸i
s
2 Research Institute in Information and Language Processing,
University of Wolverhampton
3 School of Information Technology and Engineering,
University of Ottawa
PROMISE, 29 March 2010, Ia¸i, Romania
s
Mih˘il˘, Ilisei & Inkpen
a a Identifying Romanian Zero Pronouns
2. Introduction
Corpus
Identification
Conclusions
Outline
1 Introduction
Motivation
Zero Subjects vs. Zero Pronouns
Previous Work
2 Corpus
Annotation
Statistics
3 Identification
Features
Algorithms
Results
4 Conclusions
Mih˘il˘, Ilisei & Inkpen
a a Identifying Romanian Zero Pronouns
3. Introduction
Motivation
Corpus
Zero Subjects vs. Zero Pronouns
Identification
Previous Work
Conclusions
Motivation
The problem
Invisible anaphors
Lack of morphological information
Utility
Information extraction/retrieval
Automatic summarisation
Machine translation
Multiple-choice test items generation
etc.
Mih˘il˘, Ilisei & Inkpen
a a Identifying Romanian Zero Pronouns
4. Introduction
Motivation
Corpus
Zero Subjects vs. Zero Pronouns
Identification
Previous Work
Conclusions
Motivation
The problem
Invisible anaphors
Lack of morphological information
Utility
Information extraction/retrieval
Automatic summarisation
Machine translation
Multiple-choice test items generation
etc.
Mih˘il˘, Ilisei & Inkpen
a a Identifying Romanian Zero Pronouns
5. Introduction
Motivation
Corpus
Zero Subjects vs. Zero Pronouns
Identification
Previous Work
Conclusions
Zero Subjects vs. Zero Pronouns
Zero subjects
The verb does not need a subject
Plou˘.
a ˆ pare r˘u de voi. Azi
Imi a nu-mi arde de glum˘.
a
Zero pronouns
Lexically retrievable from the inflection of the verb
Coreferring an overt noun, noun phrase, or clause
zp [Eu]
Merg la ¸coal˘.
s a
Cine a auzit s-a ˆıntors ¸i
s zp [acela] a plecat.
Mih˘il˘, Ilisei & Inkpen
a a Identifying Romanian Zero Pronouns
6. Introduction
Motivation
Corpus
Zero Subjects vs. Zero Pronouns
Identification
Previous Work
Conclusions
Zero Subjects vs. Zero Pronouns
Zero subjects
The verb does not need a subject
Plou˘.
a ˆ pare r˘u de voi. Azi
Imi a nu-mi arde de glum˘.
a
Zero pronouns
Lexically retrievable from the inflection of the verb
Coreferring an overt noun, noun phrase, or clause
zp [Eu]
Merg la ¸coal˘.
s a
Cine a auzit s-a ˆıntors ¸i
s zp [acela] a plecat.
Mih˘il˘, Ilisei & Inkpen
a a Identifying Romanian Zero Pronouns
7. Introduction
Motivation
Corpus
Zero Subjects vs. Zero Pronouns
Identification
Previous Work
Conclusions
Previous Work
For other languages
Spanish: Ferr´ndez & Peral (2000), Rello & Ilisei (2009)
a
Chinese: Converse (2006), Zhao & Ng (2007)
Japanese, Korean, Portuguese, etc.
For Romanian
Harabagiu & Maiorano (2000)
Pavel et al. (2006)
Mih˘il˘, Ilisei & Inkpen
a a Identifying Romanian Zero Pronouns
8. Introduction
Motivation
Corpus
Zero Subjects vs. Zero Pronouns
Identification
Previous Work
Conclusions
Previous Work
For other languages
Spanish: Ferr´ndez & Peral (2000), Rello & Ilisei (2009)
a
Chinese: Converse (2006), Zhao & Ng (2007)
Japanese, Korean, Portuguese, etc.
For Romanian
Harabagiu & Maiorano (2000)
Pavel et al. (2006)
Mih˘il˘, Ilisei & Inkpen
a a Identifying Romanian Zero Pronouns
9. Introduction
Corpus Annotation
Identification Statistics
Conclusions
Annotation
Empty XML tag with attributes
id
antecedent – the reference id, ’non-nominal’, or ’elliptic’
dependent verb – the reference id
clause type – main, coordinated, juxtaposed, or subordinated
annotator confidence – regarding the position, high or low
Inter-annotator agreement
Agreement on ZP’s dependent verb: ≈ 98%
Cohen’s Kappa Coefficient: κ ≈ 90%
Agreement on ZP’s position in text: ≈ 90%
Mih˘il˘, Ilisei & Inkpen
a a Identifying Romanian Zero Pronouns
10. Introduction
Corpus Annotation
Identification Statistics
Conclusions
Annotation
Empty XML tag with attributes
id
antecedent – the reference id, ’non-nominal’, or ’elliptic’
dependent verb – the reference id
clause type – main, coordinated, juxtaposed, or subordinated
annotator confidence – regarding the position, high or low
Inter-annotator agreement
Agreement on ZP’s dependent verb: ≈ 98%
Cohen’s Kappa Coefficient: κ ≈ 90%
Agreement on ZP’s position in text: ≈ 90%
Mih˘il˘, Ilisei & Inkpen
a a Identifying Romanian Zero Pronouns
11. Introduction
Corpus Annotation
Identification Statistics
Conclusions
Statistics
Corpus size
Overview NT ET LT ST Overall
No. of tokens 18690 12963 13739 3391 48783
No. of sentences 816 574 790 253 2433
No. of ZPs 245 172 113 251 781
Avg. tokens/sent. 22.90 22.58 17.39 13.40 20.05
Avg. ZP/sent. 0.30 0.30 0.14 0.99 0.32
Mih˘il˘, Ilisei & Inkpen
a a Identifying Romanian Zero Pronouns
12. Introduction
Features
Corpus
Algorithms
Identification
Results
Conclusions
Features
10 features
From RACAI’s parser
type – main, auxiliary, copulative, or modal
mood – indicative, subjunctive, etc.
tense – present, imperfect, past, or pluperfect
person – first, second, or third
number – singular or plural
gender – masculine, feminine, or neuter
clitic – whether clitic form or not
Dynamically computed
impersonality – whether strictly impersonal or not
’se’ – verb preceded by reflexive pronoun ’se’
The verb class from the manual annotation
Mih˘il˘, Ilisei & Inkpen
a a Identifying Romanian Zero Pronouns
13. Introduction
Features
Corpus
Algorithms
Identification
Results
Conclusions
Features
10 features
From RACAI’s parser
type – main, auxiliary, copulative, or modal
mood – indicative, subjunctive, etc.
tense – present, imperfect, past, or pluperfect
person – first, second, or third
number – singular or plural
gender – masculine, feminine, or neuter
clitic – whether clitic form or not
Dynamically computed
impersonality – whether strictly impersonal or not
’se’ – verb preceded by reflexive pronoun ’se’
The verb class from the manual annotation
Mih˘il˘, Ilisei & Inkpen
a a Identifying Romanian Zero Pronouns
14. Introduction
Features
Corpus
Algorithms
Identification
Results
Conclusions
Features
10 features
From RACAI’s parser
type – main, auxiliary, copulative, or modal
mood – indicative, subjunctive, etc.
tense – present, imperfect, past, or pluperfect
person – first, second, or third
number – singular or plural
gender – masculine, feminine, or neuter
clitic – whether clitic form or not
Dynamically computed
impersonality – whether strictly impersonal or not
’se’ – verb preceded by reflexive pronoun ’se’
The verb class from the manual annotation
Mih˘il˘, Ilisei & Inkpen
a a Identifying Romanian Zero Pronouns
15. Introduction
Features
Corpus
Algorithms
Identification
Results
Conclusions
Algorithms
Weka classifiers
SMO – implementation of SVM
Jrip – implementation of decision rules
J48 – implementation of decision trees
Vote – majority-voting meta-classifier on previous three
Data set
781 verbs with a ZP
781 randomly selected verbs without a ZP
10-fold cross validation
Mih˘il˘, Ilisei & Inkpen
a a Identifying Romanian Zero Pronouns
16. Introduction
Features
Corpus
Algorithms
Identification
Results
Conclusions
Algorithms
Weka classifiers
SMO – implementation of SVM
Jrip – implementation of decision rules
J48 – implementation of decision trees
Vote – majority-voting meta-classifier on previous three
Data set
781 verbs with a ZP
781 randomly selected verbs without a ZP
10-fold cross validation
Mih˘il˘, Ilisei & Inkpen
a a Identifying Romanian Zero Pronouns
17. Introduction
Features
Corpus
Algorithms
Identification
Results
Conclusions
Results
Classifier results
has ZP not ZP
Class. Acc.
P R F1 P R F1
SMO 0.739 0.684 0.889 0.773 0.841 0.590 0.694
Jrip 0.733 0.709 0.793 0.748 0.765 0.675 0.717
J48 0.720 0.698 0.777 0.735 0.749 0.663 0.703
Vote 0.733 0.705 0.802 0.750 0.770 0.665 0.713
Mih˘il˘, Ilisei & Inkpen
a a Identifying Romanian Zero Pronouns
18. Introduction
Features
Corpus
Algorithms
Identification
Results
Conclusions
Results
Attribute evaluation
Attribute ChiSquare InfoGain
Mood 402.546 0.206
’Se’ 25.719 0.012
Person 21.217 0.010
Impersonality 12.092 0.007
Tense 9.371 0.004
Type 2.577 0.001
Number 0.354 1E-4
Gender 7E-4 3E-7
Clitic 0 0
Mih˘il˘, Ilisei & Inkpen
a a Identifying Romanian Zero Pronouns
19. Introduction
Features
Corpus
Algorithms
Identification
Results
Conclusions
Results
Error analysis
Ambiguity:
E greu f˘r˘ bani.
aa
E greu de scris o carte.
Se ˆ
ıntunec˘ la ora cinci.
a
El se ˆ
ıntunec˘ la fat˘.
a ¸a
Parser errors
Mih˘il˘, Ilisei & Inkpen
a a Identifying Romanian Zero Pronouns
20. Introduction
Features
Corpus
Algorithms
Identification
Results
Conclusions
Results
Error analysis
Ambiguity:
E greu f˘r˘ bani.
aa
E greu de scris o carte.
Se ˆ
ıntunec˘ la ora cinci.
a
El se ˆ
ıntunec˘ la fat˘.
a ¸a
Parser errors
Mih˘il˘, Ilisei & Inkpen
a a Identifying Romanian Zero Pronouns
21. Introduction
Corpus
Identification
Conclusions
Conclusions
Summary
RoZP, a corpus with manually annotated ZPs
Identification of over 70% of ZPs using ML methods
Outlook
Improve the identification accuracy
other features – no. of verbs in sentence
syntactic information?
Resolve the identified ZPs
Mih˘il˘, Ilisei & Inkpen
a a Identifying Romanian Zero Pronouns
22. Introduction
Corpus
Identification
Conclusions
Conclusions
Summary
RoZP, a corpus with manually annotated ZPs
Identification of over 70% of ZPs using ML methods
Outlook
Improve the identification accuracy
other features – no. of verbs in sentence
syntactic information?
Resolve the identified ZPs
Mih˘il˘, Ilisei & Inkpen
a a Identifying Romanian Zero Pronouns
23. Introduction
Corpus
Identification
Conclusions
Thank you!
Questions?
Mih˘il˘, Ilisei & Inkpen
a a Identifying Romanian Zero Pronouns