SlideShare una empresa de Scribd logo
1 de 23
Descargar para leer sin conexión
Institut für Anthropomatik1 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Grammatical Agreement in SMT
Seminar Sprach-zu-Sprach-Übersetzung
SS 2013
Institut für Anthropomatik2 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Inflection
– Modification of a word
– signals grammatical variants (tense, gender, case, …)
– e.g. walk vs. Walked
Agreement
– Inflection for related words in a sentence has to agree
– e.g. das Haus vs. die Haus
Some languages are weakly inflected (e.g. English)
Some are highly inflected (e.g. German, Arabic, …)
Inflection and Agreement
Institut für Anthropomatik3 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Local Agreement Errors
Ref:
the-carF
goF
with-speed
Hypo:
the-carF
goM
with-speed
Long-distance Agreement Errors
Ref: celle qui parle , c’est ma femme
oneF
who speak , is my wifeF
Hypo: celui qui parle est ma femme
oneM
who speak is my spouseF
Agreement Errors
Institut für Anthropomatik4 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Approaches for SMT
Morphological Generation
– Create raw stems and modify with predicted inflection
Agreement Constraints
– Use SCFG of target and add constraints to it
Class-based Agreement Model
– Use morphological word classes “Noun+Def+Sg+Fem”
Institut für Anthropomatik5 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Morphological Generation: Idea
“Generating Complex Morphology for Machine Translation” (Minkov
and Toutanova, 2007)
Convert MT output to stem sequence
Predict an inflection for every stem
Reflect meaning and comply with agreement rules
Institut für Anthropomatik6 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Morphological Generation: Lexicons
Morphology analysis and generation
Operations:
– Stemming
– Inflection
– Morphological analysis
Create manually
Create automatically from data
Here: assumed as given
Institut für Anthropomatik7 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Morphological Generation: Inflection Prediction
Maximum Entropy Markov model (2nd
order)
Features:
– Monolingual
– Bilingual
– Lexical
– Morphological
– Syntactic
p(̄y∣̄x)=∏t=1
n
p(yt∣ yt−1 , yt−2 , xt ) , yt ∈It
Institut für Anthropomatik8 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Morphological Generation: Evaluation
English-Russian and English-Arabic
Technical (software manual) domain
Input: Aligned sentence pairs of reference translations (no output of MT
System) → reduce noise
Accuracy (%) results
Institut für Anthropomatik9 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Morphological Generation: Conclusion
Needed resources:
– Large corpus of aligned sentence pairs
– Lexicons (source and target) with the three operations
+ Better accuracy than simple LM (even with small training data)
+ Easy to add to existing MT system
- Expensive creation of lexicons
Institut für Anthropomatik10 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Constraints: Idea
“Agreement Constraints for Statistical Machine Translation into
German” (Williams and Koehn, 2011)
String-to-tree model
Synchronous grammar for target language
Adding learned constraints and probabilities
Evaluation of constraints during decoding
Institut für Anthropomatik11 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Constraints: Feature Structure
Feature structure
Unification
Institut für Anthropomatik12 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Constraints: Grammar
Synchronous grammar learned from parallel corpus
Extended by constraints at target-side
Sample rule/constraint:
NP-SB → the X1
cat | die AP1
Katze
Institut für Anthropomatik13 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Constraints: Training
Propagation rules to
capture NP/PP agreements:
Applied bottom-up
Institut für Anthropomatik14 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Constraints: Decoding
Model:
Every element of rule/constraint has a feature structure
Constraint evaluation: Each hypothesis stores set of feature structures
corresponding to its root rule element
Recombination of hypotheses is possible
̂t=arg max
t
p(t∣s)
p(t∣s)=
1
Z
∑
i=1
n
λi hi (s ,t)
Institut für Anthropomatik15 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Constraints: Evaluation
English-German
Europarl and News Commentary
Parsing: BitPar; Alignment: GIZA++; SCFG rules: Moses toolkit
Treebank for target
Grammar: ~140 m rules
BLEU scores and p-values for three test sets
Institut für Anthropomatik16 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Constraints: Conclusion
Needed resources:
– Parallel corpus
– Heuristics for constraint extraction
+ Improvement in translation accuracy
- Improvement is quite small
Institut für Anthropomatik17 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Class-Based: Idea
1. Segmentation
2. Tagging
3. Scoring
“A Class-Based Agreement Model for Generating Accurately Inflected
Translations” (Green and DeNero, 2012)
During Decoding
Target-Side
Three Steps:
Institut für Anthropomatik18 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Class-Based: Segmentation
Train conditional random field
Features:
Centered 5-character window
During decoding
Not as preprocessing step
Labels:
I: Continuation (Inside)
O: Outside (whitespace)
B: Beginning
F: Non-native chars
Institut für Anthropomatik19 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Class-Based: Tagging
Train CRF on full sentences with gold classes
Features:
– Current and previous words, affixes, etc.
Labels:
– Morphological classes
→ Gender, number, person, definiteness
– e.g. 89 classes for Arabic
Example:
'the car'
Tagged: “Noun+Def+Sg+Fem”
Institut für Anthropomatik20 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Class-Based: Scoring
Scoring of word sequences not comparable across hypotheses
→ Scoring class sequences with generative model
Simple bigram LM over gold class sequences (add-1 smoothed)
τ' =arg max
τ
p(τ∣̂s)
q(e)= p(τ')=∏i=1
I
p(τ'i∣τ'i−1)
Institut für Anthropomatik21 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Class-Based: Evaluation
English-Arabic
Training data: variety of sources (e.g. web)
Development and Test: NIST sets (Newswire and mixed genre
[broadcast news, newsgroups, weblog])
Phrase-based decoder
BLEU score for newswire sets
BLEU score for mixed genre sets
Institut für Anthropomatik22 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Class-Based: Conclusion
Needed resources:
– Treebank for target (existing for many languages)
– Large target corpus
+ Improves translation quality
+ Easy to integrate in existing MT system
- Increases decoding time
- Not very good for mixed genres
Institut für Anthropomatik23 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel
Green, S. and DeNero, J. (2012). “A Class-Based Agreement Model for
Generating Accurately Inflected Translations”. In: ACL.
Williams, P. and Koehn, P. (2011). “Agreement Constraints for Statistical
Machine Translation into German”. In: Sixth Workshop on Statistical
Machine Translation
Minkov, E. and Toutanova, K. (2007) “Generating Complex Morphology
for Machine Translation”. In: ACL.
References

Más contenido relacionado

Destacado

Understanding the errors of arabic speaking ell’s
Understanding the errors of arabic speaking ell’sUnderstanding the errors of arabic speaking ell’s
Understanding the errors of arabic speaking ell’s
anbray723
 
Google translator
Google translatorGoogle translator
Google translator
Laura P
 
Arabic to-english machine translation
Arabic to-english machine translationArabic to-english machine translation
Arabic to-english machine translation
Arabic_NLP_ImamU2013
 
Introduction to Translation
Introduction to TranslationIntroduction to Translation
Introduction to Translation
Mohammed Raiyah
 
Grammatical problems in translation
Grammatical problems in translationGrammatical problems in translation
Grammatical problems in translation
Academic Supervisor
 

Destacado (16)

Understanding the errors of arabic speaking ell’s
Understanding the errors of arabic speaking ell’sUnderstanding the errors of arabic speaking ell’s
Understanding the errors of arabic speaking ell’s
 
Translation Problems with 4 Different Languages
Translation Problems with 4 Different LanguagesTranslation Problems with 4 Different Languages
Translation Problems with 4 Different Languages
 
Google translator
Google translatorGoogle translator
Google translator
 
Arabic to-english machine translation
Arabic to-english machine translationArabic to-english machine translation
Arabic to-english machine translation
 
Translation problems
Translation problemsTranslation problems
Translation problems
 
Translation strategy
Translation strategyTranslation strategy
Translation strategy
 
Introduction to Translation
Introduction to TranslationIntroduction to Translation
Introduction to Translation
 
Grammatical problems in translation
Grammatical problems in translationGrammatical problems in translation
Grammatical problems in translation
 
Challenges of Translation
Challenges of TranslationChallenges of Translation
Challenges of Translation
 
Translation Strategies, by Dr. Shadia Y. Banjar
Translation Strategies, by Dr. Shadia Y. BanjarTranslation Strategies, by Dr. Shadia Y. Banjar
Translation Strategies, by Dr. Shadia Y. Banjar
 
Methods Of Translation
Methods Of TranslationMethods Of Translation
Methods Of Translation
 
Translation techniques presentation
Translation  techniques  presentationTranslation  techniques  presentation
Translation techniques presentation
 
Translation Types
Translation TypesTranslation Types
Translation Types
 
Intercultural Communications Chapter 5: Language
Intercultural Communications Chapter 5: LanguageIntercultural Communications Chapter 5: Language
Intercultural Communications Chapter 5: Language
 
Translation: purpose in practice
Translation: purpose in practiceTranslation: purpose in practice
Translation: purpose in practice
 
LinkedIn SlideShare: Knowledge, Well-Presented
LinkedIn SlideShare: Knowledge, Well-PresentedLinkedIn SlideShare: Knowledge, Well-Presented
LinkedIn SlideShare: Knowledge, Well-Presented
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

Grammatical Agreement in SMT

  • 1. Institut für Anthropomatik1 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Grammatical Agreement in SMT Seminar Sprach-zu-Sprach-Übersetzung SS 2013
  • 2. Institut für Anthropomatik2 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Inflection – Modification of a word – signals grammatical variants (tense, gender, case, …) – e.g. walk vs. Walked Agreement – Inflection for related words in a sentence has to agree – e.g. das Haus vs. die Haus Some languages are weakly inflected (e.g. English) Some are highly inflected (e.g. German, Arabic, …) Inflection and Agreement
  • 3. Institut für Anthropomatik3 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Local Agreement Errors Ref: the-carF goF with-speed Hypo: the-carF goM with-speed Long-distance Agreement Errors Ref: celle qui parle , c’est ma femme oneF who speak , is my wifeF Hypo: celui qui parle est ma femme oneM who speak is my spouseF Agreement Errors
  • 4. Institut für Anthropomatik4 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Approaches for SMT Morphological Generation – Create raw stems and modify with predicted inflection Agreement Constraints – Use SCFG of target and add constraints to it Class-based Agreement Model – Use morphological word classes “Noun+Def+Sg+Fem”
  • 5. Institut für Anthropomatik5 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Morphological Generation: Idea “Generating Complex Morphology for Machine Translation” (Minkov and Toutanova, 2007) Convert MT output to stem sequence Predict an inflection for every stem Reflect meaning and comply with agreement rules
  • 6. Institut für Anthropomatik6 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Morphological Generation: Lexicons Morphology analysis and generation Operations: – Stemming – Inflection – Morphological analysis Create manually Create automatically from data Here: assumed as given
  • 7. Institut für Anthropomatik7 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Morphological Generation: Inflection Prediction Maximum Entropy Markov model (2nd order) Features: – Monolingual – Bilingual – Lexical – Morphological – Syntactic p(̄y∣̄x)=∏t=1 n p(yt∣ yt−1 , yt−2 , xt ) , yt ∈It
  • 8. Institut für Anthropomatik8 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Morphological Generation: Evaluation English-Russian and English-Arabic Technical (software manual) domain Input: Aligned sentence pairs of reference translations (no output of MT System) → reduce noise Accuracy (%) results
  • 9. Institut für Anthropomatik9 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Morphological Generation: Conclusion Needed resources: – Large corpus of aligned sentence pairs – Lexicons (source and target) with the three operations + Better accuracy than simple LM (even with small training data) + Easy to add to existing MT system - Expensive creation of lexicons
  • 10. Institut für Anthropomatik10 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Constraints: Idea “Agreement Constraints for Statistical Machine Translation into German” (Williams and Koehn, 2011) String-to-tree model Synchronous grammar for target language Adding learned constraints and probabilities Evaluation of constraints during decoding
  • 11. Institut für Anthropomatik11 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Constraints: Feature Structure Feature structure Unification
  • 12. Institut für Anthropomatik12 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Constraints: Grammar Synchronous grammar learned from parallel corpus Extended by constraints at target-side Sample rule/constraint: NP-SB → the X1 cat | die AP1 Katze
  • 13. Institut für Anthropomatik13 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Constraints: Training Propagation rules to capture NP/PP agreements: Applied bottom-up
  • 14. Institut für Anthropomatik14 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Constraints: Decoding Model: Every element of rule/constraint has a feature structure Constraint evaluation: Each hypothesis stores set of feature structures corresponding to its root rule element Recombination of hypotheses is possible ̂t=arg max t p(t∣s) p(t∣s)= 1 Z ∑ i=1 n λi hi (s ,t)
  • 15. Institut für Anthropomatik15 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Constraints: Evaluation English-German Europarl and News Commentary Parsing: BitPar; Alignment: GIZA++; SCFG rules: Moses toolkit Treebank for target Grammar: ~140 m rules BLEU scores and p-values for three test sets
  • 16. Institut für Anthropomatik16 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Constraints: Conclusion Needed resources: – Parallel corpus – Heuristics for constraint extraction + Improvement in translation accuracy - Improvement is quite small
  • 17. Institut für Anthropomatik17 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Class-Based: Idea 1. Segmentation 2. Tagging 3. Scoring “A Class-Based Agreement Model for Generating Accurately Inflected Translations” (Green and DeNero, 2012) During Decoding Target-Side Three Steps:
  • 18. Institut für Anthropomatik18 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Class-Based: Segmentation Train conditional random field Features: Centered 5-character window During decoding Not as preprocessing step Labels: I: Continuation (Inside) O: Outside (whitespace) B: Beginning F: Non-native chars
  • 19. Institut für Anthropomatik19 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Class-Based: Tagging Train CRF on full sentences with gold classes Features: – Current and previous words, affixes, etc. Labels: – Morphological classes → Gender, number, person, definiteness – e.g. 89 classes for Arabic Example: 'the car' Tagged: “Noun+Def+Sg+Fem”
  • 20. Institut für Anthropomatik20 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Class-Based: Scoring Scoring of word sequences not comparable across hypotheses → Scoring class sequences with generative model Simple bigram LM over gold class sequences (add-1 smoothed) τ' =arg max τ p(τ∣̂s) q(e)= p(τ')=∏i=1 I p(τ'i∣τ'i−1)
  • 21. Institut für Anthropomatik21 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Class-Based: Evaluation English-Arabic Training data: variety of sources (e.g. web) Development and Test: NIST sets (Newswire and mixed genre [broadcast news, newsgroups, weblog]) Phrase-based decoder BLEU score for newswire sets BLEU score for mixed genre sets
  • 22. Institut für Anthropomatik22 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Class-Based: Conclusion Needed resources: – Treebank for target (existing for many languages) – Large target corpus + Improves translation quality + Easy to integrate in existing MT system - Increases decoding time - Not very good for mixed genres
  • 23. Institut für Anthropomatik23 24.06.13 Simon Hummel – Lehrstuhl Prof. Waibel Green, S. and DeNero, J. (2012). “A Class-Based Agreement Model for Generating Accurately Inflected Translations”. In: ACL. Williams, P. and Koehn, P. (2011). “Agreement Constraints for Statistical Machine Translation into German”. In: Sixth Workshop on Statistical Machine Translation Minkov, E. and Toutanova, K. (2007) “Generating Complex Morphology for Machine Translation”. In: ACL. References