SlideShare una empresa de Scribd logo
1 de 24
Descargar para leer sin conexión
technology
from seed
WHEN MULTIWORDS GO BAD
IN MACHINE TRANSLATION
Anabela Barreiro, L2F - INESC-ID, Portugal
Johanna Monti, University of Sassari, Italy
Brigitte Orliac, Logos Institute, USA
Fernando Batista, L2F - INESC-ID and ISCTE, Portugal
• Introduction
• Multiwords in NLP and MT
• RBMT and SMT approaches to multiword processing
• Multiword Assessment
– Corpus
– Multiword taxonomy
– Error Categorization for multiword translations
– Quantitative results
– Analysis of relevant problems
• OL semantico-syntactic rules for multiword translation precision
• Main conclusions and future work
OUTLINE
• MT has become popular, widespread and useful to society
• Every internet user is now using MT
(with or without knowing it!)
• HOWEVER, linguistic quality is still a serious problem,
as translations contain morphological, syntactic and
semantic errors
• Successful MWU processing still represents
one of the most significant linguistic challenges
for MT systems
INTRODUCTION
• Crucial role in NLP
• Essential in MT
– Unfeasible to include all MWU in dictionaries
– Poor syntactic and semantic analysis – reduces the performance
of NLP systems
– Fragmentation of any part of a MWU leads to generation errors
– Incorrect MWU generation has a negative impact on the
understandability and quality of the translated text
MULTIWORDS IN NLP AND MT
• Lack/degree of compositionality
• Constituent dependencies
• contiguous (adjacent) - no inserts
• non-contiguous (remote) – inserts
• Morphosyntactic variations
Freely available RBMT and SMT fail at translating MWU:
– RBMT systems fail for lack of MWU coverage
– SMT systems fail for lack of linguistic (semantico-syntactic)
knowledge to process them, leading to structural problems
CRITICAL PROBLEMS FOR MULTIWORD
PROCESSING
MT APPROACHES TO MULTIWORD
PROCESSING
• Importance of a correct processing of MWU so that they can be
translated correctly by MT systems
• Sag et al., 2001, Thurmair, 2004, Rayson et al., 2010, Monti,
2013
• Solutions to resolve MWU translation problems:
– Use of generative dependency grammars with features
• Diaconescu, 2004
– grouping bilingual MWU before performing statistical alignment
• Lambert and Banchs, 2006
– paraphrasing MWU
• Barreiro, 2010
MULTIWORD PROCESSING IN RBMT
• Lexical approach
– MWU as single lemmata in dictionaries
– Suitable for contiguous compounds
• Compositional approach
– MWU processing by means of POS tagging and syntactic analysis of
its different components
– Suitable for compounds not coded in the dictionary and for verbal
constructions
MULTIWORD PROCESSING IN SMT
• Traditional approach to word alignment
• Brown et al., 1993
– Inability to handle many-to-many correspondences
• Current state-of-the-art phrase-based SMT systems
• Koehn et al., 2003
– The correct translation of MWU occurs if the its constituents are
marked and aligned as parts of consecutive phrases in the training set
– Phrases are defined as sequences of contiguous words (n-grams) with
limited linguistic information (mostly syntactic)
• will stay - linguistically meaningful
• that he - no linguistic significance
MULTIWORD PROCESSING IN SMT
• MWU processing and translation in SMT as a problem of:
– Automatically learning and integrating translations
– Word sense disambiguation
– Word alignment
• Barreiro et al., 2013
• Integration of phrase-based models with linguistic knowledge:
– Identification of possible monolingual MWU
• Wu et al., 2008, Okita et al., 2010
– Integration of bilingual domain MWU in SMT
• Ren et al., 2009
– Incorporation of machine-readable dictionaries and glossaries, treating
these resources as phrases in the phrase-based table
• Okuma et al., 2008
– Identification and grouping of MWU prior to statistical alignment
• Lambert and Banchs, 2006
• Linguistic analysis and error categorization of the MWU translations
• 2 MT systems:
• OpenLogos RBMT
• Google Translate SMT
• 3 language pairs: EN-FR, EN-IT and EN-PT
• Analysis of MWU translations by 3 MT expert linguists
• MWU taxonomy to evaluate MWU (in any system independently of the
approach)
• OpenLogos solution to MWU processing in MT
OUR WORK
• Created a Corpus of MWU
• Translated sentences with the MWU into EN, IT and PT using the OL
and GT systems
The purpose of our work WAS NOT to compare and evaluate
systems
The purpose of our work WAS to assess and measure the quality of
MWU translation independently of the two systems considered
• Developed an empirically-driven taxonomy for MWU
• Analysed the MWU translation errors based on this taxonomy
– The different errors were categorized by MT expert linguists of the
respective target languages
METHODOLOGY
CORPUS
• 150 English sentences - news and internet
• Average of ~4-5 MWU per sentence
• The corpus was divided into 3 sets of 50 sentences translated for each
language pair by the 2 systems
• 3 native linguists reviewed 50 sentences each for the 3 target
languages, and evaluated the MWU translations for each of these
languages (1 evaluator for each language), classifying the translations
according to a binary evaluation metrics:
• OK for correct translations
• ERR for incorrect translations
• None of the systems was specifically trained for the task - texts were
not domain specific
MULTIWORD TAXONOMY
Type Subtype Acronym Example
VERB
Compound Verb COMPV may have been done, have [already] shown
Support Verb Construction SVC
make a presentation, be meaningful, have [particularly good] links,
give an illustration of, be the ADV cause of, fall [so far] short of, take a
seat; to play a [very important] role
Prepositional Verb PREPV deal with, give N to
Phrasal Verb PHRV closing down, make N up, slow down to; stand up to, mix N up with
Other Verbal Expression VEXPR in trying to, hold N in place
NOUN
Compound Noun COMPN union spokesman, constraint-based grammar, air conditioning
Prepositional Noun PREPN interest in, right side of
ADJECTIVE
Compound Adjective COMPADJ cost-cutting
Prepositional Adjective PREPADJ famous for; similar to
ADVERB
Compound Adverb COMPADV in a fast way, most notably, last time
Prepositional Adverb PREPADV in front of
DETERMINER
Compound Determiner COMPDET certain of these
Prepositional Determiner PREPDET most of
CONJUNCTION Compound Conjunction COMPCONJ in order to, as a result of, rather than
PREPOSITION Compound Preposition COMPPREP as part of
OTHER
EXPRESSION
Named Entity NE Economic Council
Idiomatic Expression IDIOM get to the bottom of the situation, purr like a cat; for goodness’ sake
Lexical Bundle BUNDLE I believe that, as much if not more than, if I were you
• The results shed some light on the demand for higher precision MWU
translation
• MWU occur frequently in our corpus - several times within
the same sentence:
Witnesses said the speeding car may have been playing tag with
another vehicle when it veered into the southbound lane occupied
by Lopez' truck shortly before 8 p.m. Sunday
• may have been playing tag with - COMPV - idiomatic PREPSVC
• veered into - PREPV
• southbound lane - COMPN
• 8 p.m. Sunday - double temporal expression (time + date)
RESULTS
QUANTITATIVE RESULTS
Correct and incorrect MWU translations
15
System Lang pair OK ERR Total
OL
EN-FR 40 48 88
EN-IT 36 83 119
EN-PT 60 96 156
Total 136 227 363
GT
EN-FR 70 38 108
EN-IT 59 47 106
EN-PT 67 47 114
Total 196 132 328
Performance for the 3 most frequent MWU
16
EN-FR OL GT
Type Ok Error Ok Error
VERB 17 21 27 12
COMPN 8 10 13 18
NE 6 4 16 4
EN-IT OL GT
Type Ok Error Ok Error
COMPN 14 39 26 21
VERB 10 12 6 15
NE 2 8 14 2
EN-PT OL GT
Type Ok Error Ok Error
VERB 30 21 11 23
COMPN 28 12 18 17
NE 11 26 9 9
QUANTITATIVE RESULTS
General language or domain-specific COMPN - 32,5%
• hit-run driver
pilot hit run
chauffeur/conducteur ayant commis un délit de fuite
• nuclear fuel cycle
cycle de combustible nucléaire
cycle de combustion nucléaire
SVC - 18,6%
• is a bit misleading (adjectival)
est un égarement de morceau
est quelque peu trompeur
• it has [wide] applicability (nominal)
il a l’applicabilité large
il a de nombreuses possibilités d’application
MULTIWORDS “GOING BAD” IN FRENCH
LOGOS APPROACH TO MULTIWORD
TRANSLATION
Main linguistic knowledge bases of the LOGOS system:
• Dictionaries
• Semantico-syntactic rules - analysis, transfer and generation
• Semantic Table SEMTAB - language-pair specific rules
– Analysis and translation of words in their context
– invoked after dictionary look-up and during the execution of target
transfer rules to solve analysis and lexical ambiguity problems
• verb dependencies - different verb argument structures
– speak to
– speak against
– speak of
– speak on N (radio, TV, television, etc.)
• MWU of different nature
SAL - Semantico-syntactic Abstraction Language
– Taxonomy: 3 levels organized hierarchically:
• Supersets / Sets / Subsets
– Semantico-Syntactic continuum from NL word to Word Class
• Literal word: airport
• Head morph: port
• SAL Subset: Agfunc (agentive functional location)
• SAL Set: func (functional location)
• SAL Superset: PL (place)
• Word Class: N
– SAL combines both the lexical and the compositional approaches in
order to process different types of MWU
LOGOS APPROACH TO MULTIWORD
TRANSLATION
RESOLUTION OF POLYSEMY
NL String SEMTAB Rule Portuguese Transfer
raise a child  V(‘raise’) N(ANdes)  criar. . .
raise corn  V(‘raise’) N(MAedib)  cultivar. . .
raise the rent  V(‘raise’) N(MEabs)  aumentar. . .
DEEP STRUCTURE RULES OF SEMTAB
A single deep-structure rule matches multiple surface-structures
and produces correct target transfers
he raised the rent  ele aumentou a renda V+Object
the raising of the rent  o aumento da renda Gerund
the rent, raised by …  a renda, aumentada por… Part. ADJ
a rent raise  um aumento de renda Noun
• MWU – problematic for MT systems independently of the approach
• Literal translations lead to unclear/incorrect translations or loss of meaning
• Correct identification and analysis of source language MWU is a challenging
task, but the starting point for higher quality MT
– Linguistic quality evaluation metrics
– Systematic categorization of errors by MT expert linguists
– Specific corpora for MWU evaluation
• OpenLogos approach to MWU processing uses semantico-syntactic rules,
which can contribute to MWU translation quality with reference to any
language pair
CONCLUSIONS
FUTURE WORK
• Research on how OpenLogos linguistic knowledge – SEMTAB - can be
applied to a SMT system to correct MWU errors … and vice-versa
• Successful combination of linguistic PRECISION (OL approach) and
COVERAGE (GT approach) in resolving the MWU problem
– evolution in the MT field
• Successful integration of semantico-syntactic knowledge in SMT
– solution for achieving high quality MT
– The accomplishment of this task requires a combination of expertise
in MT technology and deep linguistic knowledge to address reverse
research avenue: integration of SMT technology/processes in RBMT to
advance MT
24
Thank you!
This work was supported by Fundação para a Ciência e Tecnologia (Portugal) through
Anabela Barreiro’s post-doctoral grant SFRH/BPD/91446/2012 and project PEst-OE/EEI/LA0021/2013.

Más contenido relacionado

Similar a When Multiwords Go Bad in Machine Translation

LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
Lifeng (Aaron) Han
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
Lifeng (Aaron) Han
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
Lifeng (Aaron) Han
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
Lifeng (Aaron) Han
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Abdullah al Mamun
 

Similar a When Multiwords Go Bad in Machine Translation (20)

Barreiro-Batista-LR4NLP@Coling2018-presentation
Barreiro-Batista-LR4NLP@Coling2018-presentationBarreiro-Batista-LR4NLP@Coling2018-presentation
Barreiro-Batista-LR4NLP@Coling2018-presentation
 
CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Transla...
CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Transla...CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Transla...
CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Transla...
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
Machine Translation of Discontinuous Multiword Units
Machine Translation of Discontinuous Multiword UnitsMachine Translation of Discontinuous Multiword Units
Machine Translation of Discontinuous Multiword Units
 
Make it simple with paraphrases: Automated paraphrasing for authoring aids an...
Make it simple with paraphrases: Automated paraphrasing for authoring aids an...Make it simple with paraphrases: Automated paraphrasing for authoring aids an...
Make it simple with paraphrases: Automated paraphrasing for authoring aids an...
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
 
Anabela Barreiro - Alinhamentos
Anabela Barreiro - AlinhamentosAnabela Barreiro - Alinhamentos
Anabela Barreiro - Alinhamentos
 
Cross language alignments - challenges guidelines and gold sets
Cross language alignments - challenges guidelines and gold setsCross language alignments - challenges guidelines and gold sets
Cross language alignments - challenges guidelines and gold sets
 
Translationusing moses1
Translationusing moses1Translationusing moses1
Translationusing moses1
 
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel CorporaMultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora
 
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
 
Effectof morphologicalsegmentation&de segmentationonmachinetranslation
Effectof morphologicalsegmentation&de segmentationonmachinetranslationEffectof morphologicalsegmentation&de segmentationonmachinetranslation
Effectof morphologicalsegmentation&de segmentationonmachinetranslation
 
OpenLogos Semantico-Syntactic Knowledge-Rich Bilingual Dictionaries
OpenLogos Semantico-Syntactic Knowledge-Rich Bilingual DictionariesOpenLogos Semantico-Syntactic Knowledge-Rich Bilingual Dictionaries
OpenLogos Semantico-Syntactic Knowledge-Rich Bilingual Dictionaries
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translation
 
2211 APSIPA
2211 APSIPA2211 APSIPA
2211 APSIPA
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Introduction
IntroductionIntroduction
Introduction
 

Más de INESC-ID (Spoken Language Systems Laboratory - L2F)

Análise comparativa das edições portuguesa e brasileira de Os livros que dev...
Análise comparativa das edições portuguesa e brasileira de  Os livros que dev...Análise comparativa das edições portuguesa e brasileira de  Os livros que dev...
Análise comparativa das edições portuguesa e brasileira de Os livros que dev...
INESC-ID (Spoken Language Systems Laboratory - L2F)
 
Barreiro et al POP@PROPOR2018-informal2formal-language
Barreiro et al POP@PROPOR2018-informal2formal-languageBarreiro et al POP@PROPOR2018-informal2formal-language
Barreiro et al POP@PROPOR2018-informal2formal-language
INESC-ID (Spoken Language Systems Laboratory - L2F)
 
Barreiro-Mota-VarDial@Coling2018-poster
Barreiro-Mota-VarDial@Coling2018-posterBarreiro-Mota-VarDial@Coling2018-poster
Barreiro-Mota-VarDial@Coling2018-poster
INESC-ID (Spoken Language Systems Laboratory - L2F)
 

Más de INESC-ID (Spoken Language Systems Laboratory - L2F) (20)

Multi3Generation@INGL2020
Multi3Generation@INGL2020Multi3Generation@INGL2020
Multi3Generation@INGL2020
 
NooJ 2020 presentation
NooJ 2020 presentationNooJ 2020 presentation
NooJ 2020 presentation
 
PROPOR2020_Barreiroetal
PROPOR2020_BarreiroetalPROPOR2020_Barreiroetal
PROPOR2020_Barreiroetal
 
Análise comparativa das edições portuguesa e brasileira de Os livros que dev...
Análise comparativa das edições portuguesa e brasileira de  Os livros que dev...Análise comparativa das edições portuguesa e brasileira de  Os livros que dev...
Análise comparativa das edições portuguesa e brasileira de Os livros que dev...
 
Welcome session 3rd Annual MC Meeting - enetCollect COST Action
Welcome session 3rd Annual MC Meeting - enetCollect COST ActionWelcome session 3rd Annual MC Meeting - enetCollect COST Action
Welcome session 3rd Annual MC Meeting - enetCollect COST Action
 
Syntactic-semantic analysis for information extraction in biomedicine
Syntactic-semantic analysis for information extraction in biomedicineSyntactic-semantic analysis for information extraction in biomedicine
Syntactic-semantic analysis for information extraction in biomedicine
 
Cross language semantic relations between English and Portuguese
Cross language semantic relations between English and PortugueseCross language semantic relations between English and Portuguese
Cross language semantic relations between English and Portuguese
 
Paraphrasing biomedical support verb constructions for machine translation
Paraphrasing biomedical support verb constructions for machine translationParaphrasing biomedical support verb constructions for machine translation
Paraphrasing biomedical support verb constructions for machine translation
 
ReWriter for legal text
ReWriter for legal textReWriter for legal text
ReWriter for legal text
 
Chatbots for Language Learning
Chatbots for Language LearningChatbots for Language Learning
Chatbots for Language Learning
 
eSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and Summarization
eSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and SummarizationeSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and Summarization
eSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and Summarization
 
Barreiro et al POP@PROPOR2018-informal2formal-language
Barreiro et al POP@PROPOR2018-informal2formal-languageBarreiro et al POP@PROPOR2018-informal2formal-language
Barreiro et al POP@PROPOR2018-informal2formal-language
 
Rebelo-Arnold et al POP@PROPOR2018-EP-BP-alignments
Rebelo-Arnold et al POP@PROPOR2018-EP-BP-alignmentsRebelo-Arnold et al POP@PROPOR2018-EP-BP-alignments
Rebelo-Arnold et al POP@PROPOR2018-EP-BP-alignments
 
Barreiro-Mota-VarDial@Coling2018-poster
Barreiro-Mota-VarDial@Coling2018-posterBarreiro-Mota-VarDial@Coling2018-poster
Barreiro-Mota-VarDial@Coling2018-poster
 
NooJ-2018-Palermo
NooJ-2018-PalermoNooJ-2018-Palermo
NooJ-2018-Palermo
 
Poster @ enetCollect CA MC meeting in Iasi, Romania
Poster @ enetCollect CA MC meeting in Iasi, Romania Poster @ enetCollect CA MC meeting in Iasi, Romania
Poster @ enetCollect CA MC meeting in Iasi, Romania
 
projeto-eSPERTo
projeto-eSPERToprojeto-eSPERTo
projeto-eSPERTo
 
ReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software Tool
ReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software ToolReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software Tool
ReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software Tool
 
Poster l2f 2017
Poster l2f 2017Poster l2f 2017
Poster l2f 2017
 
Nooj2017 cmota-etal
Nooj2017 cmota-etalNooj2017 cmota-etal
Nooj2017 cmota-etal
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

When Multiwords Go Bad in Machine Translation

  • 1. technology from seed WHEN MULTIWORDS GO BAD IN MACHINE TRANSLATION Anabela Barreiro, L2F - INESC-ID, Portugal Johanna Monti, University of Sassari, Italy Brigitte Orliac, Logos Institute, USA Fernando Batista, L2F - INESC-ID and ISCTE, Portugal
  • 2. • Introduction • Multiwords in NLP and MT • RBMT and SMT approaches to multiword processing • Multiword Assessment – Corpus – Multiword taxonomy – Error Categorization for multiword translations – Quantitative results – Analysis of relevant problems • OL semantico-syntactic rules for multiword translation precision • Main conclusions and future work OUTLINE
  • 3. • MT has become popular, widespread and useful to society • Every internet user is now using MT (with or without knowing it!) • HOWEVER, linguistic quality is still a serious problem, as translations contain morphological, syntactic and semantic errors • Successful MWU processing still represents one of the most significant linguistic challenges for MT systems INTRODUCTION
  • 4. • Crucial role in NLP • Essential in MT – Unfeasible to include all MWU in dictionaries – Poor syntactic and semantic analysis – reduces the performance of NLP systems – Fragmentation of any part of a MWU leads to generation errors – Incorrect MWU generation has a negative impact on the understandability and quality of the translated text MULTIWORDS IN NLP AND MT
  • 5. • Lack/degree of compositionality • Constituent dependencies • contiguous (adjacent) - no inserts • non-contiguous (remote) – inserts • Morphosyntactic variations Freely available RBMT and SMT fail at translating MWU: – RBMT systems fail for lack of MWU coverage – SMT systems fail for lack of linguistic (semantico-syntactic) knowledge to process them, leading to structural problems CRITICAL PROBLEMS FOR MULTIWORD PROCESSING
  • 6. MT APPROACHES TO MULTIWORD PROCESSING • Importance of a correct processing of MWU so that they can be translated correctly by MT systems • Sag et al., 2001, Thurmair, 2004, Rayson et al., 2010, Monti, 2013 • Solutions to resolve MWU translation problems: – Use of generative dependency grammars with features • Diaconescu, 2004 – grouping bilingual MWU before performing statistical alignment • Lambert and Banchs, 2006 – paraphrasing MWU • Barreiro, 2010
  • 7. MULTIWORD PROCESSING IN RBMT • Lexical approach – MWU as single lemmata in dictionaries – Suitable for contiguous compounds • Compositional approach – MWU processing by means of POS tagging and syntactic analysis of its different components – Suitable for compounds not coded in the dictionary and for verbal constructions
  • 8. MULTIWORD PROCESSING IN SMT • Traditional approach to word alignment • Brown et al., 1993 – Inability to handle many-to-many correspondences • Current state-of-the-art phrase-based SMT systems • Koehn et al., 2003 – The correct translation of MWU occurs if the its constituents are marked and aligned as parts of consecutive phrases in the training set – Phrases are defined as sequences of contiguous words (n-grams) with limited linguistic information (mostly syntactic) • will stay - linguistically meaningful • that he - no linguistic significance
  • 9. MULTIWORD PROCESSING IN SMT • MWU processing and translation in SMT as a problem of: – Automatically learning and integrating translations – Word sense disambiguation – Word alignment • Barreiro et al., 2013 • Integration of phrase-based models with linguistic knowledge: – Identification of possible monolingual MWU • Wu et al., 2008, Okita et al., 2010 – Integration of bilingual domain MWU in SMT • Ren et al., 2009 – Incorporation of machine-readable dictionaries and glossaries, treating these resources as phrases in the phrase-based table • Okuma et al., 2008 – Identification and grouping of MWU prior to statistical alignment • Lambert and Banchs, 2006
  • 10. • Linguistic analysis and error categorization of the MWU translations • 2 MT systems: • OpenLogos RBMT • Google Translate SMT • 3 language pairs: EN-FR, EN-IT and EN-PT • Analysis of MWU translations by 3 MT expert linguists • MWU taxonomy to evaluate MWU (in any system independently of the approach) • OpenLogos solution to MWU processing in MT OUR WORK
  • 11. • Created a Corpus of MWU • Translated sentences with the MWU into EN, IT and PT using the OL and GT systems The purpose of our work WAS NOT to compare and evaluate systems The purpose of our work WAS to assess and measure the quality of MWU translation independently of the two systems considered • Developed an empirically-driven taxonomy for MWU • Analysed the MWU translation errors based on this taxonomy – The different errors were categorized by MT expert linguists of the respective target languages METHODOLOGY
  • 12. CORPUS • 150 English sentences - news and internet • Average of ~4-5 MWU per sentence • The corpus was divided into 3 sets of 50 sentences translated for each language pair by the 2 systems • 3 native linguists reviewed 50 sentences each for the 3 target languages, and evaluated the MWU translations for each of these languages (1 evaluator for each language), classifying the translations according to a binary evaluation metrics: • OK for correct translations • ERR for incorrect translations • None of the systems was specifically trained for the task - texts were not domain specific
  • 13. MULTIWORD TAXONOMY Type Subtype Acronym Example VERB Compound Verb COMPV may have been done, have [already] shown Support Verb Construction SVC make a presentation, be meaningful, have [particularly good] links, give an illustration of, be the ADV cause of, fall [so far] short of, take a seat; to play a [very important] role Prepositional Verb PREPV deal with, give N to Phrasal Verb PHRV closing down, make N up, slow down to; stand up to, mix N up with Other Verbal Expression VEXPR in trying to, hold N in place NOUN Compound Noun COMPN union spokesman, constraint-based grammar, air conditioning Prepositional Noun PREPN interest in, right side of ADJECTIVE Compound Adjective COMPADJ cost-cutting Prepositional Adjective PREPADJ famous for; similar to ADVERB Compound Adverb COMPADV in a fast way, most notably, last time Prepositional Adverb PREPADV in front of DETERMINER Compound Determiner COMPDET certain of these Prepositional Determiner PREPDET most of CONJUNCTION Compound Conjunction COMPCONJ in order to, as a result of, rather than PREPOSITION Compound Preposition COMPPREP as part of OTHER EXPRESSION Named Entity NE Economic Council Idiomatic Expression IDIOM get to the bottom of the situation, purr like a cat; for goodness’ sake Lexical Bundle BUNDLE I believe that, as much if not more than, if I were you
  • 14. • The results shed some light on the demand for higher precision MWU translation • MWU occur frequently in our corpus - several times within the same sentence: Witnesses said the speeding car may have been playing tag with another vehicle when it veered into the southbound lane occupied by Lopez' truck shortly before 8 p.m. Sunday • may have been playing tag with - COMPV - idiomatic PREPSVC • veered into - PREPV • southbound lane - COMPN • 8 p.m. Sunday - double temporal expression (time + date) RESULTS
  • 15. QUANTITATIVE RESULTS Correct and incorrect MWU translations 15 System Lang pair OK ERR Total OL EN-FR 40 48 88 EN-IT 36 83 119 EN-PT 60 96 156 Total 136 227 363 GT EN-FR 70 38 108 EN-IT 59 47 106 EN-PT 67 47 114 Total 196 132 328
  • 16. Performance for the 3 most frequent MWU 16 EN-FR OL GT Type Ok Error Ok Error VERB 17 21 27 12 COMPN 8 10 13 18 NE 6 4 16 4 EN-IT OL GT Type Ok Error Ok Error COMPN 14 39 26 21 VERB 10 12 6 15 NE 2 8 14 2 EN-PT OL GT Type Ok Error Ok Error VERB 30 21 11 23 COMPN 28 12 18 17 NE 11 26 9 9 QUANTITATIVE RESULTS
  • 17. General language or domain-specific COMPN - 32,5% • hit-run driver pilot hit run chauffeur/conducteur ayant commis un délit de fuite • nuclear fuel cycle cycle de combustible nucléaire cycle de combustion nucléaire SVC - 18,6% • is a bit misleading (adjectival) est un égarement de morceau est quelque peu trompeur • it has [wide] applicability (nominal) il a l’applicabilité large il a de nombreuses possibilités d’application MULTIWORDS “GOING BAD” IN FRENCH
  • 18. LOGOS APPROACH TO MULTIWORD TRANSLATION Main linguistic knowledge bases of the LOGOS system: • Dictionaries • Semantico-syntactic rules - analysis, transfer and generation • Semantic Table SEMTAB - language-pair specific rules – Analysis and translation of words in their context – invoked after dictionary look-up and during the execution of target transfer rules to solve analysis and lexical ambiguity problems • verb dependencies - different verb argument structures – speak to – speak against – speak of – speak on N (radio, TV, television, etc.) • MWU of different nature
  • 19. SAL - Semantico-syntactic Abstraction Language – Taxonomy: 3 levels organized hierarchically: • Supersets / Sets / Subsets – Semantico-Syntactic continuum from NL word to Word Class • Literal word: airport • Head morph: port • SAL Subset: Agfunc (agentive functional location) • SAL Set: func (functional location) • SAL Superset: PL (place) • Word Class: N – SAL combines both the lexical and the compositional approaches in order to process different types of MWU LOGOS APPROACH TO MULTIWORD TRANSLATION
  • 20. RESOLUTION OF POLYSEMY NL String SEMTAB Rule Portuguese Transfer raise a child  V(‘raise’) N(ANdes)  criar. . . raise corn  V(‘raise’) N(MAedib)  cultivar. . . raise the rent  V(‘raise’) N(MEabs)  aumentar. . .
  • 21. DEEP STRUCTURE RULES OF SEMTAB A single deep-structure rule matches multiple surface-structures and produces correct target transfers he raised the rent  ele aumentou a renda V+Object the raising of the rent  o aumento da renda Gerund the rent, raised by …  a renda, aumentada por… Part. ADJ a rent raise  um aumento de renda Noun
  • 22. • MWU – problematic for MT systems independently of the approach • Literal translations lead to unclear/incorrect translations or loss of meaning • Correct identification and analysis of source language MWU is a challenging task, but the starting point for higher quality MT – Linguistic quality evaluation metrics – Systematic categorization of errors by MT expert linguists – Specific corpora for MWU evaluation • OpenLogos approach to MWU processing uses semantico-syntactic rules, which can contribute to MWU translation quality with reference to any language pair CONCLUSIONS
  • 23. FUTURE WORK • Research on how OpenLogos linguistic knowledge – SEMTAB - can be applied to a SMT system to correct MWU errors … and vice-versa • Successful combination of linguistic PRECISION (OL approach) and COVERAGE (GT approach) in resolving the MWU problem – evolution in the MT field • Successful integration of semantico-syntactic knowledge in SMT – solution for achieving high quality MT – The accomplishment of this task requires a combination of expertise in MT technology and deep linguistic knowledge to address reverse research avenue: integration of SMT technology/processes in RBMT to advance MT
  • 24. 24 Thank you! This work was supported by Fundação para a Ciência e Tecnologia (Portugal) through Anabela Barreiro’s post-doctoral grant SFRH/BPD/91446/2012 and project PEst-OE/EEI/LA0021/2013.