This preseantation addresses the impact of multiword translation errors in machine translation (MT). We have analysed translations of multiwords in the OpenLogos
rule-based system (RBMT) and in the Google Translate statistical system (SMT) for the English-French, English-Italian, and English-Portuguese language pairs. Our study shows that, for distinct reasons, multiwords remain a problematic area for MT independently of the approach, and require adequate linguistic quality evaluation metrics founded on a systematic categorization of errors by MT expert linguists. We propose an empirically-driven taxonomy for multiwords, and highlight the need for the development of specific corpora for multiword evaluation. Finally, the paper presents the Logos approach to multiword processing, illustrating how semantico-syntactic rules contribute to multiword translation quality.
1. technology
from seed
WHEN MULTIWORDS GO BAD
IN MACHINE TRANSLATION
Anabela Barreiro, L2F - INESC-ID, Portugal
Johanna Monti, University of Sassari, Italy
Brigitte Orliac, Logos Institute, USA
Fernando Batista, L2F - INESC-ID and ISCTE, Portugal
2. • Introduction
• Multiwords in NLP and MT
• RBMT and SMT approaches to multiword processing
• Multiword Assessment
– Corpus
– Multiword taxonomy
– Error Categorization for multiword translations
– Quantitative results
– Analysis of relevant problems
• OL semantico-syntactic rules for multiword translation precision
• Main conclusions and future work
OUTLINE
3. • MT has become popular, widespread and useful to society
• Every internet user is now using MT
(with or without knowing it!)
• HOWEVER, linguistic quality is still a serious problem,
as translations contain morphological, syntactic and
semantic errors
• Successful MWU processing still represents
one of the most significant linguistic challenges
for MT systems
INTRODUCTION
4. • Crucial role in NLP
• Essential in MT
– Unfeasible to include all MWU in dictionaries
– Poor syntactic and semantic analysis – reduces the performance
of NLP systems
– Fragmentation of any part of a MWU leads to generation errors
– Incorrect MWU generation has a negative impact on the
understandability and quality of the translated text
MULTIWORDS IN NLP AND MT
5. • Lack/degree of compositionality
• Constituent dependencies
• contiguous (adjacent) - no inserts
• non-contiguous (remote) – inserts
• Morphosyntactic variations
Freely available RBMT and SMT fail at translating MWU:
– RBMT systems fail for lack of MWU coverage
– SMT systems fail for lack of linguistic (semantico-syntactic)
knowledge to process them, leading to structural problems
CRITICAL PROBLEMS FOR MULTIWORD
PROCESSING
6. MT APPROACHES TO MULTIWORD
PROCESSING
• Importance of a correct processing of MWU so that they can be
translated correctly by MT systems
• Sag et al., 2001, Thurmair, 2004, Rayson et al., 2010, Monti,
2013
• Solutions to resolve MWU translation problems:
– Use of generative dependency grammars with features
• Diaconescu, 2004
– grouping bilingual MWU before performing statistical alignment
• Lambert and Banchs, 2006
– paraphrasing MWU
• Barreiro, 2010
7. MULTIWORD PROCESSING IN RBMT
• Lexical approach
– MWU as single lemmata in dictionaries
– Suitable for contiguous compounds
• Compositional approach
– MWU processing by means of POS tagging and syntactic analysis of
its different components
– Suitable for compounds not coded in the dictionary and for verbal
constructions
8. MULTIWORD PROCESSING IN SMT
• Traditional approach to word alignment
• Brown et al., 1993
– Inability to handle many-to-many correspondences
• Current state-of-the-art phrase-based SMT systems
• Koehn et al., 2003
– The correct translation of MWU occurs if the its constituents are
marked and aligned as parts of consecutive phrases in the training set
– Phrases are defined as sequences of contiguous words (n-grams) with
limited linguistic information (mostly syntactic)
• will stay - linguistically meaningful
• that he - no linguistic significance
9. MULTIWORD PROCESSING IN SMT
• MWU processing and translation in SMT as a problem of:
– Automatically learning and integrating translations
– Word sense disambiguation
– Word alignment
• Barreiro et al., 2013
• Integration of phrase-based models with linguistic knowledge:
– Identification of possible monolingual MWU
• Wu et al., 2008, Okita et al., 2010
– Integration of bilingual domain MWU in SMT
• Ren et al., 2009
– Incorporation of machine-readable dictionaries and glossaries, treating
these resources as phrases in the phrase-based table
• Okuma et al., 2008
– Identification and grouping of MWU prior to statistical alignment
• Lambert and Banchs, 2006
10. • Linguistic analysis and error categorization of the MWU translations
• 2 MT systems:
• OpenLogos RBMT
• Google Translate SMT
• 3 language pairs: EN-FR, EN-IT and EN-PT
• Analysis of MWU translations by 3 MT expert linguists
• MWU taxonomy to evaluate MWU (in any system independently of the
approach)
• OpenLogos solution to MWU processing in MT
OUR WORK
11. • Created a Corpus of MWU
• Translated sentences with the MWU into EN, IT and PT using the OL
and GT systems
The purpose of our work WAS NOT to compare and evaluate
systems
The purpose of our work WAS to assess and measure the quality of
MWU translation independently of the two systems considered
• Developed an empirically-driven taxonomy for MWU
• Analysed the MWU translation errors based on this taxonomy
– The different errors were categorized by MT expert linguists of the
respective target languages
METHODOLOGY
12. CORPUS
• 150 English sentences - news and internet
• Average of ~4-5 MWU per sentence
• The corpus was divided into 3 sets of 50 sentences translated for each
language pair by the 2 systems
• 3 native linguists reviewed 50 sentences each for the 3 target
languages, and evaluated the MWU translations for each of these
languages (1 evaluator for each language), classifying the translations
according to a binary evaluation metrics:
• OK for correct translations
• ERR for incorrect translations
• None of the systems was specifically trained for the task - texts were
not domain specific
13. MULTIWORD TAXONOMY
Type Subtype Acronym Example
VERB
Compound Verb COMPV may have been done, have [already] shown
Support Verb Construction SVC
make a presentation, be meaningful, have [particularly good] links,
give an illustration of, be the ADV cause of, fall [so far] short of, take a
seat; to play a [very important] role
Prepositional Verb PREPV deal with, give N to
Phrasal Verb PHRV closing down, make N up, slow down to; stand up to, mix N up with
Other Verbal Expression VEXPR in trying to, hold N in place
NOUN
Compound Noun COMPN union spokesman, constraint-based grammar, air conditioning
Prepositional Noun PREPN interest in, right side of
ADJECTIVE
Compound Adjective COMPADJ cost-cutting
Prepositional Adjective PREPADJ famous for; similar to
ADVERB
Compound Adverb COMPADV in a fast way, most notably, last time
Prepositional Adverb PREPADV in front of
DETERMINER
Compound Determiner COMPDET certain of these
Prepositional Determiner PREPDET most of
CONJUNCTION Compound Conjunction COMPCONJ in order to, as a result of, rather than
PREPOSITION Compound Preposition COMPPREP as part of
OTHER
EXPRESSION
Named Entity NE Economic Council
Idiomatic Expression IDIOM get to the bottom of the situation, purr like a cat; for goodness’ sake
Lexical Bundle BUNDLE I believe that, as much if not more than, if I were you
14. • The results shed some light on the demand for higher precision MWU
translation
• MWU occur frequently in our corpus - several times within
the same sentence:
Witnesses said the speeding car may have been playing tag with
another vehicle when it veered into the southbound lane occupied
by Lopez' truck shortly before 8 p.m. Sunday
• may have been playing tag with - COMPV - idiomatic PREPSVC
• veered into - PREPV
• southbound lane - COMPN
• 8 p.m. Sunday - double temporal expression (time + date)
RESULTS
15. QUANTITATIVE RESULTS
Correct and incorrect MWU translations
15
System Lang pair OK ERR Total
OL
EN-FR 40 48 88
EN-IT 36 83 119
EN-PT 60 96 156
Total 136 227 363
GT
EN-FR 70 38 108
EN-IT 59 47 106
EN-PT 67 47 114
Total 196 132 328
16. Performance for the 3 most frequent MWU
16
EN-FR OL GT
Type Ok Error Ok Error
VERB 17 21 27 12
COMPN 8 10 13 18
NE 6 4 16 4
EN-IT OL GT
Type Ok Error Ok Error
COMPN 14 39 26 21
VERB 10 12 6 15
NE 2 8 14 2
EN-PT OL GT
Type Ok Error Ok Error
VERB 30 21 11 23
COMPN 28 12 18 17
NE 11 26 9 9
QUANTITATIVE RESULTS
17. General language or domain-specific COMPN - 32,5%
• hit-run driver
pilot hit run
chauffeur/conducteur ayant commis un délit de fuite
• nuclear fuel cycle
cycle de combustible nucléaire
cycle de combustion nucléaire
SVC - 18,6%
• is a bit misleading (adjectival)
est un égarement de morceau
est quelque peu trompeur
• it has [wide] applicability (nominal)
il a l’applicabilité large
il a de nombreuses possibilités d’application
MULTIWORDS “GOING BAD” IN FRENCH
18. LOGOS APPROACH TO MULTIWORD
TRANSLATION
Main linguistic knowledge bases of the LOGOS system:
• Dictionaries
• Semantico-syntactic rules - analysis, transfer and generation
• Semantic Table SEMTAB - language-pair specific rules
– Analysis and translation of words in their context
– invoked after dictionary look-up and during the execution of target
transfer rules to solve analysis and lexical ambiguity problems
• verb dependencies - different verb argument structures
– speak to
– speak against
– speak of
– speak on N (radio, TV, television, etc.)
• MWU of different nature
19. SAL - Semantico-syntactic Abstraction Language
– Taxonomy: 3 levels organized hierarchically:
• Supersets / Sets / Subsets
– Semantico-Syntactic continuum from NL word to Word Class
• Literal word: airport
• Head morph: port
• SAL Subset: Agfunc (agentive functional location)
• SAL Set: func (functional location)
• SAL Superset: PL (place)
• Word Class: N
– SAL combines both the lexical and the compositional approaches in
order to process different types of MWU
LOGOS APPROACH TO MULTIWORD
TRANSLATION
20. RESOLUTION OF POLYSEMY
NL String SEMTAB Rule Portuguese Transfer
raise a child V(‘raise’) N(ANdes) criar. . .
raise corn V(‘raise’) N(MAedib) cultivar. . .
raise the rent V(‘raise’) N(MEabs) aumentar. . .
21. DEEP STRUCTURE RULES OF SEMTAB
A single deep-structure rule matches multiple surface-structures
and produces correct target transfers
he raised the rent ele aumentou a renda V+Object
the raising of the rent o aumento da renda Gerund
the rent, raised by … a renda, aumentada por… Part. ADJ
a rent raise um aumento de renda Noun
22. • MWU – problematic for MT systems independently of the approach
• Literal translations lead to unclear/incorrect translations or loss of meaning
• Correct identification and analysis of source language MWU is a challenging
task, but the starting point for higher quality MT
– Linguistic quality evaluation metrics
– Systematic categorization of errors by MT expert linguists
– Specific corpora for MWU evaluation
• OpenLogos approach to MWU processing uses semantico-syntactic rules,
which can contribute to MWU translation quality with reference to any
language pair
CONCLUSIONS
23. FUTURE WORK
• Research on how OpenLogos linguistic knowledge – SEMTAB - can be
applied to a SMT system to correct MWU errors … and vice-versa
• Successful combination of linguistic PRECISION (OL approach) and
COVERAGE (GT approach) in resolving the MWU problem
– evolution in the MT field
• Successful integration of semantico-syntactic knowledge in SMT
– solution for achieving high quality MT
– The accomplishment of this task requires a combination of expertise
in MT technology and deep linguistic knowledge to address reverse
research avenue: integration of SMT technology/processes in RBMT to
advance MT
24. 24
Thank you!
This work was supported by Fundação para a Ciência e Tecnologia (Portugal) through
Anabela Barreiro’s post-doctoral grant SFRH/BPD/91446/2012 and project PEst-OE/EEI/LA0021/2013.