SlideShare una empresa de Scribd logo
1 de 23
Introduction
                              Corpus
                        Identification
                          Conclusions




To Be or Not To Be a Zero Pronoun?
  A Machine Learning Approach For Romanian


Claudiu Mih˘il˘1
           a a                  Iustina Ilisei2        Diana Inkpen3

                   1 Faculty of Computer Science,

              ”Alexandru Ioan Cuza” University of Ia¸i
                                                    s
 2 Research   Institute in Information and Language Processing,
                    University of Wolverhampton
     3 School   of Information Technology and Engineering,
                        University of Ottawa


    PROMISE, 29 March 2010, Ia¸i, Romania
                              s

              Mih˘il˘, Ilisei & Inkpen
                 a a                     Identifying Romanian Zero Pronouns
Introduction
                                      Corpus
                                Identification
                                  Conclusions


Outline

  1   Introduction
         Motivation
         Zero Subjects vs. Zero Pronouns
         Previous Work
  2   Corpus
        Annotation
        Statistics
  3   Identification
        Features
        Algorithms
        Results
  4   Conclusions

                      Mih˘il˘, Ilisei & Inkpen
                         a a                     Identifying Romanian Zero Pronouns
Introduction
                                               Motivation
                                    Corpus
                                               Zero Subjects vs. Zero Pronouns
                              Identification
                                               Previous Work
                                Conclusions


Motivation

  The problem
      Invisible anaphors
       Lack of morphological information

  Utility
       Information extraction/retrieval
       Automatic summarisation
       Machine translation
       Multiple-choice test items generation
       etc.


                    Mih˘il˘, Ilisei & Inkpen
                       a a                     Identifying Romanian Zero Pronouns
Introduction
                                               Motivation
                                    Corpus
                                               Zero Subjects vs. Zero Pronouns
                              Identification
                                               Previous Work
                                Conclusions


Motivation

  The problem
      Invisible anaphors
       Lack of morphological information

  Utility
       Information extraction/retrieval
       Automatic summarisation
       Machine translation
       Multiple-choice test items generation
       etc.


                    Mih˘il˘, Ilisei & Inkpen
                       a a                     Identifying Romanian Zero Pronouns
Introduction
                                                     Motivation
                                       Corpus
                                                     Zero Subjects vs. Zero Pronouns
                                 Identification
                                                     Previous Work
                                   Conclusions


Zero Subjects vs. Zero Pronouns


  Zero subjects
      The verb does not need a subject
              Plou˘.
                  a        ˆ pare r˘u de voi. Azi
                           Imi     a                                  nu-mi arde de glum˘.
                                                                                        a

  Zero pronouns
      Lexically retrievable from the inflection of the verb
      Coreferring an overt noun, noun phrase, or clause
           zp [Eu]
                 Merg la ¸coal˘.
                           s     a
           Cine a auzit s-a ˆıntors ¸i
                                    s             zp [acela]   a plecat.




                       Mih˘il˘, Ilisei & Inkpen
                          a a                        Identifying Romanian Zero Pronouns
Introduction
                                                     Motivation
                                       Corpus
                                                     Zero Subjects vs. Zero Pronouns
                                 Identification
                                                     Previous Work
                                   Conclusions


Zero Subjects vs. Zero Pronouns


  Zero subjects
      The verb does not need a subject
              Plou˘.
                  a        ˆ pare r˘u de voi. Azi
                           Imi     a                                  nu-mi arde de glum˘.
                                                                                        a

  Zero pronouns
      Lexically retrievable from the inflection of the verb
      Coreferring an overt noun, noun phrase, or clause
           zp [Eu]
                 Merg la ¸coal˘.
                           s     a
           Cine a auzit s-a ˆıntors ¸i
                                    s             zp [acela]   a plecat.




                       Mih˘il˘, Ilisei & Inkpen
                          a a                        Identifying Romanian Zero Pronouns
Introduction
                                              Motivation
                                   Corpus
                                              Zero Subjects vs. Zero Pronouns
                             Identification
                                              Previous Work
                               Conclusions


Previous Work


  For other languages
      Spanish: Ferr´ndez & Peral (2000), Rello & Ilisei (2009)
                   a
      Chinese: Converse (2006), Zhao & Ng (2007)
      Japanese, Korean, Portuguese, etc.

  For Romanian
      Harabagiu & Maiorano (2000)
      Pavel et al. (2006)




                   Mih˘il˘, Ilisei & Inkpen
                      a a                     Identifying Romanian Zero Pronouns
Introduction
                                              Motivation
                                   Corpus
                                              Zero Subjects vs. Zero Pronouns
                             Identification
                                              Previous Work
                               Conclusions


Previous Work


  For other languages
      Spanish: Ferr´ndez & Peral (2000), Rello & Ilisei (2009)
                   a
      Chinese: Converse (2006), Zhao & Ng (2007)
      Japanese, Korean, Portuguese, etc.

  For Romanian
      Harabagiu & Maiorano (2000)
      Pavel et al. (2006)




                   Mih˘il˘, Ilisei & Inkpen
                      a a                     Identifying Romanian Zero Pronouns
Introduction
                                    Corpus     Annotation
                              Identification    Statistics
                                Conclusions


Annotation

  Empty XML tag with attributes
     id
      antecedent – the reference id, ’non-nominal’, or ’elliptic’
      dependent verb – the reference id
      clause type – main, coordinated, juxtaposed, or subordinated
      annotator confidence – regarding the position, high or low

  Inter-annotator agreement
      Agreement on ZP’s dependent verb: ≈ 98%
           Cohen’s Kappa Coefficient: κ ≈ 90%
      Agreement on ZP’s position in text: ≈ 90%

                    Mih˘il˘, Ilisei & Inkpen
                       a a                     Identifying Romanian Zero Pronouns
Introduction
                                    Corpus     Annotation
                              Identification    Statistics
                                Conclusions


Annotation

  Empty XML tag with attributes
     id
      antecedent – the reference id, ’non-nominal’, or ’elliptic’
      dependent verb – the reference id
      clause type – main, coordinated, juxtaposed, or subordinated
      annotator confidence – regarding the position, high or low

  Inter-annotator agreement
      Agreement on ZP’s dependent verb: ≈ 98%
           Cohen’s Kappa Coefficient: κ ≈ 90%
      Agreement on ZP’s position in text: ≈ 90%

                    Mih˘il˘, Ilisei & Inkpen
                       a a                     Identifying Romanian Zero Pronouns
Introduction
                                   Corpus       Annotation
                             Identification      Statistics
                               Conclusions


Statistics



  Corpus size
     Overview                NT                ET         LT          ST             Overall
     No. of tokens          18690             12963     13739        3391            48783
     No. of sentences        816               574       790          253             2433
     No. of ZPs              245               172       113          251             781
     Avg. tokens/sent.      22.90             22.58     17.39        13.40           20.05
     Avg. ZP/sent.           0.30              0.30      0.14        0.99             0.32




                   Mih˘il˘, Ilisei & Inkpen
                      a a                       Identifying Romanian Zero Pronouns
Introduction
                                               Features
                                    Corpus
                                               Algorithms
                              Identification
                                               Results
                                Conclusions


Features

  10 features
       From RACAI’s parser
           type – main, auxiliary, copulative, or modal
           mood – indicative, subjunctive, etc.
           tense – present, imperfect, past, or pluperfect
           person – first, second, or third
           number – singular or plural
           gender – masculine, feminine, or neuter
           clitic – whether clitic form or not
      Dynamically computed
           impersonality – whether strictly impersonal or not
           ’se’ – verb preceded by reflexive pronoun ’se’
      The verb class from the manual annotation

                    Mih˘il˘, Ilisei & Inkpen
                       a a                     Identifying Romanian Zero Pronouns
Introduction
                                               Features
                                    Corpus
                                               Algorithms
                              Identification
                                               Results
                                Conclusions


Features

  10 features
       From RACAI’s parser
           type – main, auxiliary, copulative, or modal
           mood – indicative, subjunctive, etc.
           tense – present, imperfect, past, or pluperfect
           person – first, second, or third
           number – singular or plural
           gender – masculine, feminine, or neuter
           clitic – whether clitic form or not
      Dynamically computed
           impersonality – whether strictly impersonal or not
           ’se’ – verb preceded by reflexive pronoun ’se’
      The verb class from the manual annotation

                    Mih˘il˘, Ilisei & Inkpen
                       a a                     Identifying Romanian Zero Pronouns
Introduction
                                               Features
                                    Corpus
                                               Algorithms
                              Identification
                                               Results
                                Conclusions


Features

  10 features
       From RACAI’s parser
           type – main, auxiliary, copulative, or modal
           mood – indicative, subjunctive, etc.
           tense – present, imperfect, past, or pluperfect
           person – first, second, or third
           number – singular or plural
           gender – masculine, feminine, or neuter
           clitic – whether clitic form or not
      Dynamically computed
           impersonality – whether strictly impersonal or not
           ’se’ – verb preceded by reflexive pronoun ’se’
      The verb class from the manual annotation

                    Mih˘il˘, Ilisei & Inkpen
                       a a                     Identifying Romanian Zero Pronouns
Introduction
                                              Features
                                   Corpus
                                              Algorithms
                             Identification
                                              Results
                               Conclusions


Algorithms


  Weka classifiers
     SMO – implementation of SVM
      Jrip – implementation of decision rules
      J48 – implementation of decision trees
      Vote – majority-voting meta-classifier on previous three

  Data set
      781 verbs with a ZP
      781 randomly selected verbs without a ZP
      10-fold cross validation


                   Mih˘il˘, Ilisei & Inkpen
                      a a                     Identifying Romanian Zero Pronouns
Introduction
                                              Features
                                   Corpus
                                              Algorithms
                             Identification
                                              Results
                               Conclusions


Algorithms


  Weka classifiers
     SMO – implementation of SVM
      Jrip – implementation of decision rules
      J48 – implementation of decision trees
      Vote – majority-voting meta-classifier on previous three

  Data set
      781 verbs with a ZP
      781 randomly selected verbs without a ZP
      10-fold cross validation


                   Mih˘il˘, Ilisei & Inkpen
                      a a                     Identifying Romanian Zero Pronouns
Introduction
                                                 Features
                                      Corpus
                                                 Algorithms
                                Identification
                                                 Results
                                  Conclusions


Results



  Classifier results
                                    has ZP                              not ZP
    Class.    Acc.
                         P             R          F1            P          R           F1
    SMO      0.739     0.684         0.889       0.773        0.841      0.590        0.694
    Jrip     0.733     0.709         0.793       0.748        0.765      0.675        0.717
    J48      0.720     0.698         0.777       0.735        0.749      0.663        0.703
    Vote     0.733     0.705         0.802       0.750        0.770      0.665        0.713




                      Mih˘il˘, Ilisei & Inkpen
                         a a                     Identifying Romanian Zero Pronouns
Introduction
                                               Features
                                    Corpus
                                               Algorithms
                              Identification
                                               Results
                                Conclusions


Results

  Attribute evaluation
                 Attribute                ChiSquare         InfoGain
                 Mood                       402.546            0.206
                 ’Se’                        25.719            0.012
                 Person                      21.217            0.010
                 Impersonality               12.092            0.007
                 Tense                        9.371            0.004
                 Type                         2.577            0.001
                 Number                       0.354             1E-4
                 Gender                        7E-4             3E-7
                 Clitic                           0                0



                    Mih˘il˘, Ilisei & Inkpen
                       a a                     Identifying Romanian Zero Pronouns
Introduction
                                                 Features
                                      Corpus
                                                 Algorithms
                                Identification
                                                 Results
                                  Conclusions


Results



  Error analysis
       Ambiguity:
              E greu f˘r˘ bani.
                      aa
           E greu de scris o carte.
              Se ˆ
                 ıntunec˘ la ora cinci.
                        a
           El se ˆ
                 ıntunec˘ la fat˘.
                        a      ¸a
      Parser errors




                      Mih˘il˘, Ilisei & Inkpen
                         a a                     Identifying Romanian Zero Pronouns
Introduction
                                                 Features
                                      Corpus
                                                 Algorithms
                                Identification
                                                 Results
                                  Conclusions


Results



  Error analysis
       Ambiguity:
              E greu f˘r˘ bani.
                      aa
           E greu de scris o carte.
              Se ˆ
                 ıntunec˘ la ora cinci.
                        a
           El se ˆ
                 ıntunec˘ la fat˘.
                        a      ¸a
      Parser errors




                      Mih˘il˘, Ilisei & Inkpen
                         a a                     Identifying Romanian Zero Pronouns
Introduction
                                    Corpus
                              Identification
                                Conclusions


Conclusions


  Summary
     RoZP, a corpus with manually annotated ZPs
      Identification of over 70% of ZPs using ML methods

  Outlook
      Improve the identification accuracy
           other features – no. of verbs in sentence
           syntactic information?
      Resolve the identified ZPs



                    Mih˘il˘, Ilisei & Inkpen
                       a a                     Identifying Romanian Zero Pronouns
Introduction
                                    Corpus
                              Identification
                                Conclusions


Conclusions


  Summary
     RoZP, a corpus with manually annotated ZPs
      Identification of over 70% of ZPs using ML methods

  Outlook
      Improve the identification accuracy
           other features – no. of verbs in sentence
           syntactic information?
      Resolve the identified ZPs



                    Mih˘il˘, Ilisei & Inkpen
                       a a                     Identifying Romanian Zero Pronouns
Introduction
                             Corpus
                       Identification
                         Conclusions




Thank you!
Questions?




             Mih˘il˘, Ilisei & Inkpen
                a a                     Identifying Romanian Zero Pronouns

Más contenido relacionado

Destacado

Zemanta: A Content Recommendation Engine
Zemanta: A Content Recommendation EngineZemanta: A Content Recommendation Engine
Zemanta: A Content Recommendation EngineClaudiu Mihăilă
 
Grammar book
Grammar bookGrammar book
Grammar booknkrinder
 
Impersonal constructions with se
Impersonal constructions with seImpersonal constructions with se
Impersonal constructions with sespanishtutor
 
Grammer complete Notes
Grammer complete NotesGrammer complete Notes
Grammer complete NotesMalik Sajjad
 
'It' as Impersonal Subject
'It' as Impersonal Subject'It' as Impersonal Subject
'It' as Impersonal SubjecttheLecturette
 
Français Niveau Intermédiare 600 exercices
 Français Niveau Intermédiare 600 exercices Français Niveau Intermédiare 600 exercices
Français Niveau Intermédiare 600 exercicesBetty Ingrid
 
5 complete first_certificate_teacher_39_s_book
5 complete first_certificate_teacher_39_s_book5 complete first_certificate_teacher_39_s_book
5 complete first_certificate_teacher_39_s_bookMaria José Silva
 
GRAMMAR AND VOCABULARY FOR CAE AND CPE
GRAMMAR AND VOCABULARY FOR CAE AND CPEGRAMMAR AND VOCABULARY FOR CAE AND CPE
GRAMMAR AND VOCABULARY FOR CAE AND CPEBetty Ingrid
 
Emphatic Structures
Emphatic StructuresEmphatic Structures
Emphatic StructuresVQuevedo
 
Clarke, simon english grammar in context essential
Clarke, simon   english grammar in context essentialClarke, simon   english grammar in context essential
Clarke, simon english grammar in context essentialTelma Ventura
 
have-fun-with-vocabulary
 have-fun-with-vocabulary have-fun-with-vocabulary
have-fun-with-vocabularykaticat
 
Wh Cleft Sentences
Wh Cleft SentencesWh Cleft Sentences
Wh Cleft SentencesJANA CIOBANU
 

Destacado (20)

Zemanta: A Content Recommendation Engine
Zemanta: A Content Recommendation EngineZemanta: A Content Recommendation Engine
Zemanta: A Content Recommendation Engine
 
Grammar book
Grammar bookGrammar book
Grammar book
 
Grammer book1
Grammer book1Grammer book1
Grammer book1
 
Grammar book
Grammar bookGrammar book
Grammar book
 
Impersonal constructions with se
Impersonal constructions with seImpersonal constructions with se
Impersonal constructions with se
 
Pronouns - English Grammar
Pronouns - English GrammarPronouns - English Grammar
Pronouns - English Grammar
 
Grammer complete Notes
Grammer complete NotesGrammer complete Notes
Grammer complete Notes
 
Cleft sentences
Cleft sentencesCleft sentences
Cleft sentences
 
'It' as Impersonal Subject
'It' as Impersonal Subject'It' as Impersonal Subject
'It' as Impersonal Subject
 
Français Niveau Intermédiare 600 exercices
 Français Niveau Intermédiare 600 exercices Français Niveau Intermédiare 600 exercices
Français Niveau Intermédiare 600 exercices
 
Study abroad
Study abroad Study abroad
Study abroad
 
Emphatic form
Emphatic formEmphatic form
Emphatic form
 
5 complete first_certificate_teacher_39_s_book
5 complete first_certificate_teacher_39_s_book5 complete first_certificate_teacher_39_s_book
5 complete first_certificate_teacher_39_s_book
 
What is Syntax?
What is Syntax?What is Syntax?
What is Syntax?
 
GRAMMAR AND VOCABULARY FOR CAE AND CPE
GRAMMAR AND VOCABULARY FOR CAE AND CPEGRAMMAR AND VOCABULARY FOR CAE AND CPE
GRAMMAR AND VOCABULARY FOR CAE AND CPE
 
Emphatic Structures
Emphatic StructuresEmphatic Structures
Emphatic Structures
 
Clarke, simon english grammar in context essential
Clarke, simon   english grammar in context essentialClarke, simon   english grammar in context essential
Clarke, simon english grammar in context essential
 
have-fun-with-vocabulary
 have-fun-with-vocabulary have-fun-with-vocabulary
have-fun-with-vocabulary
 
Wh Cleft Sentences
Wh Cleft SentencesWh Cleft Sentences
Wh Cleft Sentences
 
Oxford Grammar And Vocabulary
Oxford Grammar And VocabularyOxford Grammar And Vocabulary
Oxford Grammar And Vocabulary
 

Más de Claudiu Mihăilă

News Search Using Discourse Analytics
News Search Using Discourse AnalyticsNews Search Using Discourse Analytics
News Search Using Discourse AnalyticsClaudiu Mihăilă
 
Analysing Entity Type Variation across Biomedical Subdomains
Analysing Entity Type Variation across Biomedical SubdomainsAnalysing Entity Type Variation across Biomedical Subdomains
Analysing Entity Type Variation across Biomedical SubdomainsClaudiu Mihăilă
 
Translation studies: Simplification and Explicitation Universals
Translation studies: Simplification and Explicitation UniversalsTranslation studies: Simplification and Explicitation Universals
Translation studies: Simplification and Explicitation UniversalsClaudiu Mihăilă
 
Simplification and Explicitation Universals
Simplification and Explicitation UniversalsSimplification and Explicitation Universals
Simplification and Explicitation UniversalsClaudiu Mihăilă
 
Functional Dependency Grammar
Functional Dependency GrammarFunctional Dependency Grammar
Functional Dependency GrammarClaudiu Mihăilă
 
TEDDY - Thesaurus Editor: Design and Definition Yarn
TEDDY - Thesaurus Editor: Design and Definition YarnTEDDY - Thesaurus Editor: Design and Definition Yarn
TEDDY - Thesaurus Editor: Design and Definition YarnClaudiu Mihăilă
 
Nature-inspired methods for the Semantic Web
Nature-inspired methods for the Semantic WebNature-inspired methods for the Semantic Web
Nature-inspired methods for the Semantic WebClaudiu Mihăilă
 
Modelling social Web applications via tinydb
Modelling social Web applications via tinydbModelling social Web applications via tinydb
Modelling social Web applications via tinydbClaudiu Mihăilă
 

Más de Claudiu Mihăilă (8)

News Search Using Discourse Analytics
News Search Using Discourse AnalyticsNews Search Using Discourse Analytics
News Search Using Discourse Analytics
 
Analysing Entity Type Variation across Biomedical Subdomains
Analysing Entity Type Variation across Biomedical SubdomainsAnalysing Entity Type Variation across Biomedical Subdomains
Analysing Entity Type Variation across Biomedical Subdomains
 
Translation studies: Simplification and Explicitation Universals
Translation studies: Simplification and Explicitation UniversalsTranslation studies: Simplification and Explicitation Universals
Translation studies: Simplification and Explicitation Universals
 
Simplification and Explicitation Universals
Simplification and Explicitation UniversalsSimplification and Explicitation Universals
Simplification and Explicitation Universals
 
Functional Dependency Grammar
Functional Dependency GrammarFunctional Dependency Grammar
Functional Dependency Grammar
 
TEDDY - Thesaurus Editor: Design and Definition Yarn
TEDDY - Thesaurus Editor: Design and Definition YarnTEDDY - Thesaurus Editor: Design and Definition Yarn
TEDDY - Thesaurus Editor: Design and Definition Yarn
 
Nature-inspired methods for the Semantic Web
Nature-inspired methods for the Semantic WebNature-inspired methods for the Semantic Web
Nature-inspired methods for the Semantic Web
 
Modelling social Web applications via tinydb
Modelling social Web applications via tinydbModelling social Web applications via tinydb
Modelling social Web applications via tinydb
 

Último

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Último (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

To Be or Not to be a Zero Pronoun: A Machine Learning Approach for Romanian

  • 1. Introduction Corpus Identification Conclusions To Be or Not To Be a Zero Pronoun? A Machine Learning Approach For Romanian Claudiu Mih˘il˘1 a a Iustina Ilisei2 Diana Inkpen3 1 Faculty of Computer Science, ”Alexandru Ioan Cuza” University of Ia¸i s 2 Research Institute in Information and Language Processing, University of Wolverhampton 3 School of Information Technology and Engineering, University of Ottawa PROMISE, 29 March 2010, Ia¸i, Romania s Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 2. Introduction Corpus Identification Conclusions Outline 1 Introduction Motivation Zero Subjects vs. Zero Pronouns Previous Work 2 Corpus Annotation Statistics 3 Identification Features Algorithms Results 4 Conclusions Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 3. Introduction Motivation Corpus Zero Subjects vs. Zero Pronouns Identification Previous Work Conclusions Motivation The problem Invisible anaphors Lack of morphological information Utility Information extraction/retrieval Automatic summarisation Machine translation Multiple-choice test items generation etc. Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 4. Introduction Motivation Corpus Zero Subjects vs. Zero Pronouns Identification Previous Work Conclusions Motivation The problem Invisible anaphors Lack of morphological information Utility Information extraction/retrieval Automatic summarisation Machine translation Multiple-choice test items generation etc. Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 5. Introduction Motivation Corpus Zero Subjects vs. Zero Pronouns Identification Previous Work Conclusions Zero Subjects vs. Zero Pronouns Zero subjects The verb does not need a subject Plou˘. a ˆ pare r˘u de voi. Azi Imi a nu-mi arde de glum˘. a Zero pronouns Lexically retrievable from the inflection of the verb Coreferring an overt noun, noun phrase, or clause zp [Eu] Merg la ¸coal˘. s a Cine a auzit s-a ˆıntors ¸i s zp [acela] a plecat. Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 6. Introduction Motivation Corpus Zero Subjects vs. Zero Pronouns Identification Previous Work Conclusions Zero Subjects vs. Zero Pronouns Zero subjects The verb does not need a subject Plou˘. a ˆ pare r˘u de voi. Azi Imi a nu-mi arde de glum˘. a Zero pronouns Lexically retrievable from the inflection of the verb Coreferring an overt noun, noun phrase, or clause zp [Eu] Merg la ¸coal˘. s a Cine a auzit s-a ˆıntors ¸i s zp [acela] a plecat. Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 7. Introduction Motivation Corpus Zero Subjects vs. Zero Pronouns Identification Previous Work Conclusions Previous Work For other languages Spanish: Ferr´ndez & Peral (2000), Rello & Ilisei (2009) a Chinese: Converse (2006), Zhao & Ng (2007) Japanese, Korean, Portuguese, etc. For Romanian Harabagiu & Maiorano (2000) Pavel et al. (2006) Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 8. Introduction Motivation Corpus Zero Subjects vs. Zero Pronouns Identification Previous Work Conclusions Previous Work For other languages Spanish: Ferr´ndez & Peral (2000), Rello & Ilisei (2009) a Chinese: Converse (2006), Zhao & Ng (2007) Japanese, Korean, Portuguese, etc. For Romanian Harabagiu & Maiorano (2000) Pavel et al. (2006) Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 9. Introduction Corpus Annotation Identification Statistics Conclusions Annotation Empty XML tag with attributes id antecedent – the reference id, ’non-nominal’, or ’elliptic’ dependent verb – the reference id clause type – main, coordinated, juxtaposed, or subordinated annotator confidence – regarding the position, high or low Inter-annotator agreement Agreement on ZP’s dependent verb: ≈ 98% Cohen’s Kappa Coefficient: κ ≈ 90% Agreement on ZP’s position in text: ≈ 90% Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 10. Introduction Corpus Annotation Identification Statistics Conclusions Annotation Empty XML tag with attributes id antecedent – the reference id, ’non-nominal’, or ’elliptic’ dependent verb – the reference id clause type – main, coordinated, juxtaposed, or subordinated annotator confidence – regarding the position, high or low Inter-annotator agreement Agreement on ZP’s dependent verb: ≈ 98% Cohen’s Kappa Coefficient: κ ≈ 90% Agreement on ZP’s position in text: ≈ 90% Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 11. Introduction Corpus Annotation Identification Statistics Conclusions Statistics Corpus size Overview NT ET LT ST Overall No. of tokens 18690 12963 13739 3391 48783 No. of sentences 816 574 790 253 2433 No. of ZPs 245 172 113 251 781 Avg. tokens/sent. 22.90 22.58 17.39 13.40 20.05 Avg. ZP/sent. 0.30 0.30 0.14 0.99 0.32 Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 12. Introduction Features Corpus Algorithms Identification Results Conclusions Features 10 features From RACAI’s parser type – main, auxiliary, copulative, or modal mood – indicative, subjunctive, etc. tense – present, imperfect, past, or pluperfect person – first, second, or third number – singular or plural gender – masculine, feminine, or neuter clitic – whether clitic form or not Dynamically computed impersonality – whether strictly impersonal or not ’se’ – verb preceded by reflexive pronoun ’se’ The verb class from the manual annotation Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 13. Introduction Features Corpus Algorithms Identification Results Conclusions Features 10 features From RACAI’s parser type – main, auxiliary, copulative, or modal mood – indicative, subjunctive, etc. tense – present, imperfect, past, or pluperfect person – first, second, or third number – singular or plural gender – masculine, feminine, or neuter clitic – whether clitic form or not Dynamically computed impersonality – whether strictly impersonal or not ’se’ – verb preceded by reflexive pronoun ’se’ The verb class from the manual annotation Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 14. Introduction Features Corpus Algorithms Identification Results Conclusions Features 10 features From RACAI’s parser type – main, auxiliary, copulative, or modal mood – indicative, subjunctive, etc. tense – present, imperfect, past, or pluperfect person – first, second, or third number – singular or plural gender – masculine, feminine, or neuter clitic – whether clitic form or not Dynamically computed impersonality – whether strictly impersonal or not ’se’ – verb preceded by reflexive pronoun ’se’ The verb class from the manual annotation Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 15. Introduction Features Corpus Algorithms Identification Results Conclusions Algorithms Weka classifiers SMO – implementation of SVM Jrip – implementation of decision rules J48 – implementation of decision trees Vote – majority-voting meta-classifier on previous three Data set 781 verbs with a ZP 781 randomly selected verbs without a ZP 10-fold cross validation Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 16. Introduction Features Corpus Algorithms Identification Results Conclusions Algorithms Weka classifiers SMO – implementation of SVM Jrip – implementation of decision rules J48 – implementation of decision trees Vote – majority-voting meta-classifier on previous three Data set 781 verbs with a ZP 781 randomly selected verbs without a ZP 10-fold cross validation Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 17. Introduction Features Corpus Algorithms Identification Results Conclusions Results Classifier results has ZP not ZP Class. Acc. P R F1 P R F1 SMO 0.739 0.684 0.889 0.773 0.841 0.590 0.694 Jrip 0.733 0.709 0.793 0.748 0.765 0.675 0.717 J48 0.720 0.698 0.777 0.735 0.749 0.663 0.703 Vote 0.733 0.705 0.802 0.750 0.770 0.665 0.713 Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 18. Introduction Features Corpus Algorithms Identification Results Conclusions Results Attribute evaluation Attribute ChiSquare InfoGain Mood 402.546 0.206 ’Se’ 25.719 0.012 Person 21.217 0.010 Impersonality 12.092 0.007 Tense 9.371 0.004 Type 2.577 0.001 Number 0.354 1E-4 Gender 7E-4 3E-7 Clitic 0 0 Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 19. Introduction Features Corpus Algorithms Identification Results Conclusions Results Error analysis Ambiguity: E greu f˘r˘ bani. aa E greu de scris o carte. Se ˆ ıntunec˘ la ora cinci. a El se ˆ ıntunec˘ la fat˘. a ¸a Parser errors Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 20. Introduction Features Corpus Algorithms Identification Results Conclusions Results Error analysis Ambiguity: E greu f˘r˘ bani. aa E greu de scris o carte. Se ˆ ıntunec˘ la ora cinci. a El se ˆ ıntunec˘ la fat˘. a ¸a Parser errors Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 21. Introduction Corpus Identification Conclusions Conclusions Summary RoZP, a corpus with manually annotated ZPs Identification of over 70% of ZPs using ML methods Outlook Improve the identification accuracy other features – no. of verbs in sentence syntactic information? Resolve the identified ZPs Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 22. Introduction Corpus Identification Conclusions Conclusions Summary RoZP, a corpus with manually annotated ZPs Identification of over 70% of ZPs using ML methods Outlook Improve the identification accuracy other features – no. of verbs in sentence syntactic information? Resolve the identified ZPs Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 23. Introduction Corpus Identification Conclusions Thank you! Questions? Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns