SlideShare una empresa de Scribd logo
1 de 12
Investigating the Possibilities of
       Using SMT for Text Annotation
                                 László J. Laki1,2
                                laki.laszlo@itk.ppke.hu

  1 Pázmány    Péter Catholic University, Faculty of Information Technology

                2 MTA-PPKE    Language Technology Research Group


This work was partially supported by TÁMOP-4.2.1.B – 11/2/KMR-2011–0002
OUTLINE
•   SMT as POS tagger
•   Baseline system
•   Decreasing the size of target vocabulary
•   Handling OOV words
•   Evaluation
•   Conclusion
STATISTICAL MACHINE TRANSLATION




• Frameworks                      • Corpus
  – MOSES (Koehn et. al., 2007)     – Szeged Korpusz 2
                                      (Csendes et. al., 2003)
  – JOSHUA (Li et. al., 2009)
                                    – 1.2 million words
  – SRILM (Stolcke, 2002)           – MSD coding system
THE BASELINE SYSTEM
Plain text   a konszolidációra való törekvés találkozott a budapest#bank igényeivel is -
             tudjuk meg garadnai#róbert adattárház-menedzsertől .
Reference    a_[Tf]       konszolidáció_[Nc-ss]   való_[Afp-sn]    törekvés_[Nc-sn]
annotation   találkozik_[Vmis3s---n] a_[Tf] Budapest#Bank_[Np-sn] igény_[Nc-pi---s3]
             is_[Ccsp] -_[Punct] tud_[Vmip1p---y] meg_[Rp] Garadnai#Róbert_[Np-sn]
             adattárház-menedzser_[Nc-sb] ._[Punct]
System’s     a_[Tf] konszolidáció_[Nc-ss] való_[Afp-sn] törekvés_[Nc-sn]
annotation   találkozik_[Vmis3s---n] a_[Tf] Budapest#Bank_[Np-sn] igény_[Nc-pi---s3]
             is_[Ccsp] -_[Punct] tud_[Vmip1p---y] meg_[Rp] Garadnai#Róbert_[Np-sn]
              adattárház-menedzsertől ._[Punct]


• Correct annotation:         24557          System       BLEU score       Accuracy
• Incorrect annotation:         646           MOSES              98.49%         91.29%
• No annotation:               1697          JOSHUA              97.31%         91.07%
DECREASING THE SIZE OF TARGET VOCABULARY
• With only POS disambiguation
   – Annotate to POS tags without lemmatization
      • (e.g. találkozik_[Vmis3s---n] -> [Vmis3s---n])
   – Complexity: 152694 ->1128 tokens;
   – Accuracy: 91.46% (+0.17%)
• With simplifying POS tags
   – Annotate to main POS tags
      • (e.g. [Vmis3s---n] -> V)
   – Complexity: 1128 -> 14 tokens;
   – Accuracy: 92.20% (+0.91%)
• Conclusion
   – None of the OOV words were tagged (1698 pieces)
   – Quality slightly increased at the cost of the significant
     information loss
HANDLING OOV WORDS
• OOV words are included in just a few                             Token            #
  word classes                                             ezt                120
• Analyze the context of the OOV words                     a                  100
• Create a dictionary based on the                         kívül              6
  frequency of the words calculated                        diplomáciai        4
  from training set
                                                           magyarországi      4
• The words not included in this
                                                           képességet         2
  dictionary are changed to string
  „unk”                                                    erőfeszítéseken 2
• Tested on different thresholds                           adhatnák           1


Plain text   ezt a lobbyerőt és képességet a diplomáciai erőfeszítéseken kívül
                 a lobbyerőt és képességet a diplomáciai erőfeszítéseken
             mindenekelőtt a magyarországi multinacionálisokadhatnák . .
                                           multinacionálisok adhatnák
Modified     ezt a unk és unk a diplomáciai unk kívül mindenekelőtt a magyarországi unk
                   unk és unk a diplomáciai unk kívül mindenekelőtt a magyarországi
text         unk .
                 unk .
Threshold                   Accuracy
                         Original text   Lemmatized   Multiple
                                            text      treshold
         HANDLING OOV WORDS
           X<1(Baseline) 91.46%
                              93.13%         92.57%      93.28%
                              90.40%         92.25%      90.65%
                              88.41%         91.81%      88.62%
                              87.07%         91.48%      87.40%
                              85.97%         91.10%      86.15%

• In the original text
  – Best accuracy: 93.13%
• In case of lemmas
  – Best accuracy: 92.57%
• Multiple thresholds
  – Best accuracy: 93.28%
INTRODUCING POSTFIXES
• Goal: Separate nouns, verbs, adjectives,
  etc.
• Different POS types have characteristic
  postfixes
• Use last characters of the OOV words.
  – Last 2,3,4 characters
  – e.g. noun: házból -> unk_ból
         verb: megállítottuk -> unk_tuk
INTRODUCING POSTFIXES
Threshold           Accuracy
              Number of leftcharacters
              2          3          4
   X<1       91.46%    91.46%     91.46%
(Baseline)
             95.17%    95.83% 95.96%
             94.17%    95.32%     95.90%
             93.48%    94.97%     95.73%
             92.94%    94.70%     95.60%
             92.61%    94.55%     95.55%
EVALUATION
                               System                 Token   Sentence
• Baseline:                Only POS tagging          accuracy accuracy
  – Choose the best    Baseline (BL)                   89.66%   25.27%
                       SMT-_Baselin2                   91.46%   34.53%
• PurePos:
                       SMT-_OOV-_postfix               95.96%   56.47%
  – Maxent and HMM     PurePos                         96.03%   55.87%
    based              PurePos-MorphTable              97.29%   66.40%
  – Include            OpenNLP Maxent (ONM)            95.28%   26.00%

    morphological      OpenNLP Perceptron (ONP)        94.98%   26.67%

    disambiguation               System               Token   Sentence
                       POS tagging + lemmatization   accuracy accuracy
• OpenNLP              SMT-_Baselin1                   91.29%   33.73%
  – Maxent based       PurePos                         83.92%   10.00%
                       PurePos-MorphTable              84.89%   11.60%
  – Perceptron based
CONCLUSION
• SMT system was examined for part-of-
  speech disambiguation and lemmatization
  in Hungarian
• Absolutely automated system
• Best accuracy about 96%
• Decreasing the size of target vocabulary
• Handle OOV words
THANK YOU FOR YOUR ATTENTION

     laki.laszlo@itk.ppke.hu

Más contenido relacionado

Similar a Investigating the Possibilities of Using SMT for Text Annotation

RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRubén Izquierdo Beviá
 
RANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpusRANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpusRubén Izquierdo Beviá
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Saurabh Kaushik
 
The L2F Spoken Web Search system for Mediaeval 2012
The L2F Spoken Web Search system for Mediaeval 2012The L2F Spoken Web Search system for Mediaeval 2012
The L2F Spoken Web Search system for Mediaeval 2012MediaEval2012
 
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...geraintduck
 
Methods for Amharic Part-of-Speech Tagging
Methods for Amharic Part-of-Speech TaggingMethods for Amharic Part-of-Speech Tagging
Methods for Amharic Part-of-Speech TaggingGuy De Pauw
 
Supporting the authoring process with linguistic software
Supporting the authoring process with linguistic softwareSupporting the authoring process with linguistic software
Supporting the authoring process with linguistic softwarevsrtwin
 
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...ayaha osaki
 

Similar a Investigating the Possibilities of Using SMT for Text Annotation (12)

LSA algorithm
LSA algorithmLSA algorithm
LSA algorithm
 
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
 
RANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpusRANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpus
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
The L2F Spoken Web Search system for Mediaeval 2012
The L2F Spoken Web Search system for Mediaeval 2012The L2F Spoken Web Search system for Mediaeval 2012
The L2F Spoken Web Search system for Mediaeval 2012
 
haenelt.ppt
haenelt.ppthaenelt.ppt
haenelt.ppt
 
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...
 
Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro
Towards OpenLogos Hybrid Machine Translation - Anabela BarreiroTowards OpenLogos Hybrid Machine Translation - Anabela Barreiro
Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro
 
Methods for Amharic Part-of-Speech Tagging
Methods for Amharic Part-of-Speech TaggingMethods for Amharic Part-of-Speech Tagging
Methods for Amharic Part-of-Speech Tagging
 
Supporting the authoring process with linguistic software
Supporting the authoring process with linguistic softwareSupporting the authoring process with linguistic software
Supporting the authoring process with linguistic software
 
Marek Rei - 2017 - Semi-supervised Multitask Learning for Sequence Labeling
Marek Rei - 2017 - Semi-supervised Multitask Learning for Sequence LabelingMarek Rei - 2017 - Semi-supervised Multitask Learning for Sequence Labeling
Marek Rei - 2017 - Semi-supervised Multitask Learning for Sequence Labeling
 
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
 

Último

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Último (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Investigating the Possibilities of Using SMT for Text Annotation

  • 1. Investigating the Possibilities of Using SMT for Text Annotation László J. Laki1,2 laki.laszlo@itk.ppke.hu 1 Pázmány Péter Catholic University, Faculty of Information Technology 2 MTA-PPKE Language Technology Research Group This work was partially supported by TÁMOP-4.2.1.B – 11/2/KMR-2011–0002
  • 2. OUTLINE • SMT as POS tagger • Baseline system • Decreasing the size of target vocabulary • Handling OOV words • Evaluation • Conclusion
  • 3. STATISTICAL MACHINE TRANSLATION • Frameworks • Corpus – MOSES (Koehn et. al., 2007) – Szeged Korpusz 2 (Csendes et. al., 2003) – JOSHUA (Li et. al., 2009) – 1.2 million words – SRILM (Stolcke, 2002) – MSD coding system
  • 4. THE BASELINE SYSTEM Plain text a konszolidációra való törekvés találkozott a budapest#bank igényeivel is - tudjuk meg garadnai#róbert adattárház-menedzsertől . Reference a_[Tf] konszolidáció_[Nc-ss] való_[Afp-sn] törekvés_[Nc-sn] annotation találkozik_[Vmis3s---n] a_[Tf] Budapest#Bank_[Np-sn] igény_[Nc-pi---s3] is_[Ccsp] -_[Punct] tud_[Vmip1p---y] meg_[Rp] Garadnai#Róbert_[Np-sn] adattárház-menedzser_[Nc-sb] ._[Punct] System’s a_[Tf] konszolidáció_[Nc-ss] való_[Afp-sn] törekvés_[Nc-sn] annotation találkozik_[Vmis3s---n] a_[Tf] Budapest#Bank_[Np-sn] igény_[Nc-pi---s3] is_[Ccsp] -_[Punct] tud_[Vmip1p---y] meg_[Rp] Garadnai#Róbert_[Np-sn] adattárház-menedzsertől ._[Punct] • Correct annotation: 24557 System BLEU score Accuracy • Incorrect annotation: 646 MOSES 98.49% 91.29% • No annotation: 1697 JOSHUA 97.31% 91.07%
  • 5. DECREASING THE SIZE OF TARGET VOCABULARY • With only POS disambiguation – Annotate to POS tags without lemmatization • (e.g. találkozik_[Vmis3s---n] -> [Vmis3s---n]) – Complexity: 152694 ->1128 tokens; – Accuracy: 91.46% (+0.17%) • With simplifying POS tags – Annotate to main POS tags • (e.g. [Vmis3s---n] -> V) – Complexity: 1128 -> 14 tokens; – Accuracy: 92.20% (+0.91%) • Conclusion – None of the OOV words were tagged (1698 pieces) – Quality slightly increased at the cost of the significant information loss
  • 6. HANDLING OOV WORDS • OOV words are included in just a few Token # word classes ezt 120 • Analyze the context of the OOV words a 100 • Create a dictionary based on the kívül 6 frequency of the words calculated diplomáciai 4 from training set magyarországi 4 • The words not included in this képességet 2 dictionary are changed to string „unk” erőfeszítéseken 2 • Tested on different thresholds adhatnák 1 Plain text ezt a lobbyerőt és képességet a diplomáciai erőfeszítéseken kívül a lobbyerőt és képességet a diplomáciai erőfeszítéseken mindenekelőtt a magyarországi multinacionálisokadhatnák . . multinacionálisok adhatnák Modified ezt a unk és unk a diplomáciai unk kívül mindenekelőtt a magyarországi unk unk és unk a diplomáciai unk kívül mindenekelőtt a magyarországi text unk . unk .
  • 7. Threshold Accuracy Original text Lemmatized Multiple text treshold HANDLING OOV WORDS X<1(Baseline) 91.46% 93.13% 92.57% 93.28% 90.40% 92.25% 90.65% 88.41% 91.81% 88.62% 87.07% 91.48% 87.40% 85.97% 91.10% 86.15% • In the original text – Best accuracy: 93.13% • In case of lemmas – Best accuracy: 92.57% • Multiple thresholds – Best accuracy: 93.28%
  • 8. INTRODUCING POSTFIXES • Goal: Separate nouns, verbs, adjectives, etc. • Different POS types have characteristic postfixes • Use last characters of the OOV words. – Last 2,3,4 characters – e.g. noun: házból -> unk_ból verb: megállítottuk -> unk_tuk
  • 9. INTRODUCING POSTFIXES Threshold Accuracy Number of leftcharacters 2 3 4 X<1 91.46% 91.46% 91.46% (Baseline) 95.17% 95.83% 95.96% 94.17% 95.32% 95.90% 93.48% 94.97% 95.73% 92.94% 94.70% 95.60% 92.61% 94.55% 95.55%
  • 10. EVALUATION System Token Sentence • Baseline: Only POS tagging accuracy accuracy – Choose the best Baseline (BL) 89.66% 25.27% SMT-_Baselin2 91.46% 34.53% • PurePos: SMT-_OOV-_postfix 95.96% 56.47% – Maxent and HMM PurePos 96.03% 55.87% based PurePos-MorphTable 97.29% 66.40% – Include OpenNLP Maxent (ONM) 95.28% 26.00% morphological OpenNLP Perceptron (ONP) 94.98% 26.67% disambiguation System Token Sentence POS tagging + lemmatization accuracy accuracy • OpenNLP SMT-_Baselin1 91.29% 33.73% – Maxent based PurePos 83.92% 10.00% PurePos-MorphTable 84.89% 11.60% – Perceptron based
  • 11. CONCLUSION • SMT system was examined for part-of- speech disambiguation and lemmatization in Hungarian • Absolutely automated system • Best accuracy about 96% • Decreasing the size of target vocabulary • Handle OOV words
  • 12. THANK YOU FOR YOUR ATTENTION laki.laszlo@itk.ppke.hu