In this work, we present the construction of multilingual parallel corpora with annotation of multiword expressions (MWEs). MWEs include verbal MWEs (vMWEs) defined in the PARSEME shared task that have a verb as the head of the studied terms. The annotated vMWEs are also bilingually and multilingually aligned manually. The languages covered include English, Chinese, Polish, and German. Our original English corpus is taken from the PARSEME shared task in 2018. We performed machine translation of this source corpus followed by human post editing and annotation of target MWEs. Strict quality control was applied for error limitation, i.e., each MT output sentence received first manual post editing and annotation plus second manual quality rechecking. One of our findings during corpora preparation is that accurate translation of MWEs presents challenges to MT systems. To facilitate further MT research, we present a categorisation of the error types encountered by MT systems in performing MWE related translation. To acquire a broader view of MT issues, we selected four popular state-of-the-art MT models for comparisons namely: Microsoft Bing Translator, GoogleMT, Baidu Fanyi and DeepL MT. Because of the noise removal, translation post editing and MWE annotation by human professionals, we believe our AlphaMWE dataset will be an asset for cross-lingual and multilingual research, such as MT and information extraction. Our multilingual corpora are available as open access at github.com/poethan/AlphaMWE
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations (ppt4WS)
1. *ADAPT Research Centre
^ Insight Centre for Data Analytics
School of Computing, Dublin City University, Ireland
AlphaMWE: Construction of Multilingual
Parallel Corpora with MWE Annotations
Lifeng Han*, Gareth J. F. Jones*, and Alan F. Smeaton^
research paper (long track) @ MWE-LEX2020, affiliated with COLING2020
1
2. www.adaptcentre.ie
Content
• Background (motivation of this work)
• Related work (MWE corpus construction)
• AlphaMWE (Experimental work procedure)
• Discussions (MT issues coming across MWEs)
Cartoon, https://www.freepik.com/premium-vector/boy-with-lot-books_5762027.htm
Assume
you have
AlphaMWE
paper
in hand :)
2
3. www.adaptcentre.ie
Background
Definition - Multi-Word Expressions (MWEs)
• Various definitions include syntactic structure / semantic viewpoints (Constant
et al., 2017):
• covering syntactic anomalies,
• non-compositionality,
• non-substitutability
• ambiguity .
• Baldwin and Kim (2010) define MWEs as “lexical items that:
• (i) can be decomposed into multiple lexemes; and
• (ii) display lexical, syntactic, semantic, pragmatic and/or statistical
idiomaticity”.
3
4. www.adaptcentre.ie
Background
MWE with NLP
• MWEs are an interesting topic to both NLP researchers and linguists (Sag et al.,
2002; Constant et al., 2017; Pulcini, 2020).
• Automatic processing of MWEs poses significant challenges for computational
linguistics (CL):
• e.g. word sense disambiguation (WSD), parsing and MT (Lambert and Banchs,
2005; Bouamor et al., 2012; Skadina, 2016; Li et al., 2019; Han et al., 2020).
• caused by variety/richness of MWEs as they are used in language.
• However, very few bilingual/multilingual parallel corpora with MWE annotations
available for cross-lingual NLP research and for downstream applications (e.g. MT,
Johnson et al., 2016).
Background
4
5. www.adaptcentre.ie
Background
This paper’s view/focus
• Verbal MWEs (vMWEs): a mature category, received attention from many
researchers (e.g. Maldonado et al., 2017):
• Have a verb as the head of the studied term / function as verbal phrases, e.g.
“kick the bucket”, “cutting capers” and “(beer) gone to one’s head”.
• AlphaMWE: construction of a multilingual corpus with vMWEs annotation,
including English-Chinese/German/Polish
• A deeper insight into the difficulties of processing MWEs: a categorisation of
the errors made by MT models
• State-of-the-art MT models are far from reaching parity with humans in terms
of translation performance (Wu et al., 2016; Hassan et al., 2018).
Alfredo Maldonado, Lifeng Han, Erwan Moreau, et al. (2017) Detection of verbal multi-word expressions via conditional random fields with syntactic dependency
features and semantic re-ranking. In: 13th Workshop on Multiword Expressions (MWE 2017), Apr 2017, Valencia, Spain.
Background
5
7. www.adaptcentre.ie
Related work
Monolingual
• monolingual corpora with vMWE annotations
• PARSEME shared task corpora (Savary et al., 2017; Ramisch et al., 2018):
• 2020 edition covers 14 languages including Chinese, Hindi, Turkish.
• monolingual English - focused corpora includes
• MWE aware “English Dependency Corpus” from the LDC (LDC2017T01)1: covers
compound words used to train parsing models.
• English MWEs from “web reviews data” by Schneider et al. (2014) : covers noun, verb
and preposition super-senses
• English verbal MWEs from Walsh et al. (2018) and Kato et al. (2018) : covers PARSEME
shared task defined vMWE categories, or partly.
7
8. www.adaptcentre.ie
Related work
Monolingual
• All monolingual settings, independently by different language speakers without
bilingual alignment.
• Some non-native speakers, automatic creation, e.g. Kato et al. (2018):
• combination of automatic annotations and crowdsourcing.
• first construct a vMWE dictionary based on the English-language Wiktionary
• …
• formalise vMWE annotations as a multiword sense disambiguation problem to
exploit crowdsourcing (VP, LVC, semi-fixed vMWEs)
• These corpora => helpful for monolingual MWE research - discovery or
identification, however, difficult to bilingual or multilingual research - MT or cross-
lingual IE
Related work
Monolingual
8
9. www.adaptcentre.ie
Related work
Bilingual
• Vincze (2012), an English-Hungarian parallel corpus with annotations for light
verb constructions (LVCs)
• 703 LVCs for Hungarian and 727 for English were annotated
• a comparison between English and Hungarian data
• however, did not cover other types of vMWEs, e.g. verbal idioms, or verb-
particle constructions
• not extended to any other language pairs.
Related work
9
10. www.adaptcentre.ie
Related work
Bilingual
• Han et al. (2020) : an automatic construction of bilingual MWE terms from
parallel corpus, English-Chinese/German.
• automated extraction of monolingual MWEs based on POS patterns,
MWEtoolkit (Ramisch et al. 2010)
• align two side monolingual MWEs into bilingual terms based on statistical
lexical translation probability
• due to the automated procedure, the extracted bilingual “MWE terms”
contain also normal phrases
• POS pattern design needs to be further refined (Skadina, 2016; Rikters and
Bojar, 2017; Han et al., 2020)
Lifeng Han, Gareth Jones, and Alan Smeaton. 2020. MultiMWE: Building a multi-lingual multi-word expression (MWE) parallel corpora. In Proceedings
of The 12th Language Resources and Evaluation Conference, pages 2970–2979, Marseille, France.
Related work
10
13. www.adaptcentre.ie
AlphaMWE
Source corpus + MT model
• Source corpus selection: PARSEME-2018 English
• - (rational: annotators / application)
• MT model comparison, the errors by: google/bing/baidu/DeepL
• - Figure 2 in paper (Sample comparison of outputs from four MT models).
• - Blue: redundancy; green: adding error; pink: reordering error; yellow:
dropping error.
AlphaMWE
Check the paper for detail explanations
13
14. www.adaptcentre.ie
AlphaMWEAlphaMWE
14
Two sample sentences’ MT outputs comparison from head of test file
Source # text = SQL Server verifies that the account name and password were validated when the user logged on to the
system and grants access to the database, without requiring a separate logon name or password.
DeepL # text = SQL Server
Google text = SQL Server
Bing [text] SQL Server
Baidu #
Ref. # = SQL Server
Source # text = See the http://officeupdate.microsoft.com/, Microsoft Developer Network Web site for more information on TSQL.
DeepL # text = TSQL http://officeupdate.microsoft.com/ Microsoft Developer Network Web
Google text = Microsoft SQL http://officeupdate.microsoft.com/ Microsoft
Bing [ ] TSQL http://officeupdate.microsoft.com/ TSQL Microsoft
Baidu #text= http://officeupdate.microsoft.com/ TSQL
Ref. # text = TSQL http://officeupdate.microsoft.com/
Blue: redundancy; green: adding error; pink: reordering error; yellow: dropping error.
15. www.adaptcentre.ie
AlphaMWE
During annotation: several situations and decisions
• a) (vMWEs, into a general phrase) -> offer two different references, one being
revised in a vMWE/MWE presentation in the target;
• b) (source, target) in a different register (thx, for instance) -> offer two
reference sentences, with one of them using the same register;
• c) (normal_phrase, into v/MWE), or (v/MWE, v/MWE) but the source vMWE
was not annotated -> additions to include such vMWE (pairs);
• d) English source: wrong/incorrect annotation, or some mis-spelling -> we
corrected them.
AlphaMWE
Check the paper for interesting examples
15
16. www.adaptcentre.ie
AlphaMWE
Size and possible usage
• Extracted all 750 English sentences which have vMWE tags included.
• - target covered so far: Chinese, German, Polish with Spanish/French/
Italian/Irish in future.
• Its comparable to some standard shared task usage.
• - development and test data sets from the annual WMT (Bojar et al., 2017)
and also from the NIST MT challenges -> approximately 2,000 sentences
for Dev/testing over some years (https://www.nist.gov/programs-projects/
machine-translation)
AlphaMWE
16
19. www.adaptcentre.ie
MT issues coming across MWEs (or related)
English->Chinese/German/Polish
• When MT produces incorrect / awkward translations, we categorise en->zh:
• - common sense, super sense, abstract phrase, idiom, metaphor and
ambiguity, with ambiguity further sub-divided
• en->de (paper Appendix B examples):
• (1) (English MWEs, one_word) partially because that German has separable verbs
• (2) (normal phrase, polite/formal DE), biased MT, generally fine but depends on the context to
decide which form is more suitable, and
• (3) (vMWEs, normal phrase DE) often happened
• en->pl (paper Appendix B examples):
• coherence-unaware errors, literal translation errors, context unaware situation errors.
See paper for interesting examples
19
24. www.adaptcentre.ie
Summary and Conclusions
• Construction of multilingual parallel corpora, AlphaMWE,
• vMWEs annotated by native speakers.
• Described the procedure for MT model selection, human post editing and
annotation, compared different state-of-the-art MT models and classified MT
errors from vMWEs related to sentence/context translations.
• Characterised errors into categories to support MT research.
• English→Chinese, English→German and English→Polish and
• Similarly categorised MT issues when handling MWEs.
• AlphaMWE will be maintained as a publicly available corpora, and extended to
include other language pairs, e.g. Spanish, French and Italian.
• We also plan to extend the annotated MWE genres beyond the vMWEs defined in
the PARSEME shared task.
24
25. www.adaptcentre.ie
References
• Timothy Baldwin and Su Nam Kim. 2010. Multiword expressions. In Handbook of Natural LanguageProcessing, Second Edition, pages 267–292.
Chapman and Hall.
• Ondřej Bojar, et al. 2017. Findings of the 2017 conference on machine translation (WMT17). In WMT.
• Dhouha Bouamor, Nasredine Semmar, and Pierre Zweigenbaum. 2012. Identifying bilingual multi-word expressions for statistical machine translation. In
LREC.
• Vishal Chowdhary and Scott Greenwood. 2017. Emt: End to end model training for msr machine translation. In Proceedings of the 1st Workshop on
Data Management for End-to-End ML.
• Mathieu Constant, et al. 2017. Survey: Multiword expression processing: A Survey. Computational Linguistics, 43(4):837–892.
• Lifeng Han, Gareth J. F. Jones, and Alan Smeaton. 2020. MultiMWE: Building a multi-lingual multiword expression (MWE) parallel corpora. In LREC.
• Hany Hassan, et al. 2018. Achieving human parity on automatic chinese to english news translation. ArXiv, abs/1803.05567.
• Melvin Johnson, et al. 2016. Google’s multilingual neural machine translation system: Enabling zero-shot translation. CoRR, abs/1611.04558.
• Akihiko Kato, Hiroyuki Shindo, and Yuji Matsumoto. 2018. Construction of Large-scale English Verbal Multiword Expression Annotated Corpus. In
LREC.
• Patrik Lambert and Rafael E. Banchs. 2005. Data Inferred Multi-word Expressions for Statistical Machine Translation. In Proceedings of Machine
Translation Summit X, pages 396–403, Thailand.
• Xiaoqing Li, Jinghui Yan, Jiajun Zhang, and Chengqing Zong. 2019. Neural name translation improves neural machine translation. In Machine
Translation, pages 93–100, Singapore. Springer.
• Alfredo Maldonado, Lifeng Han, Erwan Moreau, et al. 2017. Detection of verbal multi-word expressions via conditional random fields with syntactic
dependency features and semantic re-ranking. In The 13th MWE.
25
26. www.adaptcentre.ie
References
• Virginia Pulcini. 2020. English-derived multi-word and phraseological units across languages in the global anglicism database. Textus,
English Studies in Italy, (1/2020):127–143.
• Carlos Ramisch et al. 2018. Edition 1.1 of the PARSEME shared task on automatic identification of verbal multiword expressions. In LAW-
MWE-CxG-2018)
• Matīss Rikters and Ondřej Bojar. 2017. Paying Attention to Multi-Word Expressions in Neural MachineTranslation. In Proceedings of the 16th
Machine Translation Summit.
• Ivan A. Sag, et al. 2002. Multiword expressions: A pain in the neck for nlp. In Alexander Gelbukh, editor, Computational Linguistics and
Intelligent Text Processing.
• Agata Savary, et al. 2017. The PARSEME shared task on automatic identification of verbal multiword expressions. In MWE2017.
• Nathan Schneider, et al. 2014. Comprehensive annotation of multiword expressions in a social web corpus. In Proceedings of the LREC.
• Inguna Skadina. 2016. Multi-word expressions in english-latvian machine translation. Baltic J. Modern Computing, 4:811–825.
• Meng Sun, Bojian Jiang, Hao Xiong, Zhongjun He, Hua Wu, and Haifeng Wang. 2019. Baidu neural machine translation systems for WMT19.
In WMT.
• Ashish Vaswani, et al. 2017. Attention is all you need. In Conference on Neural Information Processing System, pages 6000–6010.
• Veronika Vincze. 2012. Light verb constructions in the SzegedParalellFX English–Hungarian parallel corpus. In LREC.
• Abigail Walsh, et al. 2018. Constructing an annotated corpus of verbal MWEs for English. In (LAW-MWE-CxG2018), pages 193–200.
• Yonghui Wu, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/
1609.08144.
References
26
28. www.adaptcentre.ie
• The ADAPT Centre for Digital Content Technology is funded under the SFI Research
Centres Programme (Grant 13/RC/2106) and is co-funded under the European
Regional Development Fund. The input of Alan Smeaton is part-funded by Science
Foundation Ireland under grant number SFI/12/RC/2289 (Insight Centre).
• The authors are very grateful to their colleagues who helped to create the AlphaMWE
corpora by post editing and annotation work across all language pairs, to Yi Lu and
Dr. Paolo Bolzoni for helping with the experiments, to Lorin Sweeney, Roise McGagh,
and Eoin Treacy for discussions about English MWEs and terms, Yandy Wong for
discussion of Cantonese examples, and Hailin Hao and Yao Tong for joining the first
discussion.
• The authors also thank the anonymous reviewers for their thorough reviews and
insightful comments.
Acknowledgement
Funding and helping colleagues
Acknowledgement
28
29. www.adaptcentre.ie
Appendix
Error Examples from English Corpus Fixed in AlphaMWE
• error annotations of vMWEs in source monolingual corpus surly have some
impact on the accuracy level of the vMWE discovery and identification shared
task, but also affect the bilingual usage of AlphaMWE, so we tried to address
all these cases. For instance, the example sentence in ‘Abstract Phrase’,
English corpus annotated wrongly the sequence “had to go on” as a verbal
idioms (VIDs) which is not accurate. The verb “had” here is affiliated with “all
he had” instead of “to go on”. So either we shall annotate “go on” as vMWE
in the sentence or the overall clause “all he had to go on” as a studied term.
29
30. www.adaptcentre.ie
• Another example with a different type of vMWE is the sentence “He put them
on in a kind of trance.” where the source English corpus tagged “put” and
“trance” as Light-verb construction (VLC.cause). However, the phrase is with
“put...on” instead of “put...trance”. “put someone into a trance” is a phrase to
express “make someone into a half-conscious state”. However, for this
sentence, if we check back a bit further of the context, it means “he put on
his cloth in a kind of trance”. The word “trance” is affiliated with the phrase “in
a kind of trance” instead of “put”.
Error Examples from English Corpus Fixed in AlphaMWE
AppendixAppendix
30
31. www.adaptcentre.ie
English→German MT Samples Reflecting Afore Mentioned MWE Related Issues.
• Firstly, for the English vMWE translates into single German word, let’s see the
vMWE “woke up” the sentence “An old woman with crinkly grey hair woke up
at her post outside the lavatory and opened the door, smiling and grasping a
filthy cleaning rag.” has corresponding German aligned word “erwachte” with
a suitable translation “Eine alte Frau mit krausem, grauem Haar erwachte auf
ihrem Posten vor der Toilette und öffnete die Tür, lächelte und griff nach einem
schmutzigen Putzlappen.”
• Secondly, for the automatic translation to German that is very biased towards
choosing the polite or formal form, see the examples such as “Sie”instead of
the second form singular “du”for “you”, “auf Basis von” instead of “basierend
auf” for “based on”. To achieve a higher accuracy level of MT, it shall depend
on the context of usage to decide which form is more suitable.
AppendixAppendix
31
32. www.adaptcentre.ie
English→Polish MT Samples Reflecting Afore Mentioned MWE Related Issues.
• Regarding the MT output issues on English to Polish that fall into coherence-unaware error, for instance, the vMWE “write
off” in sentence “Then someone says that it can’t be long now before the Russians write Arafat off.” was translated as
“Wypiszą” (Potem ktoś mówi, że już niedługo Rosjanie wypiszą Arafata.) which means “prescribe”, instead of correct one
“spiszą na straty (Arafata)”. This error shall be able to avoid by the coherence of the sentence itself in meaning
preservation models.
• For the literal translation, we can see the example vMWE “gave (him) a look” in the sentence “She ruffled her feathers and
gave him a look of deep disgust.” which was literally translated as “dała mu spojrzenie”, however, in Polish, people use
“throw a look” as “rzuciła (mu) spojrzenie” instead of “gave (dała, a female form)”. Another example of literal translation
leading to errors is the vMWE “turn the tables” from sentence “Now Iran wants to turn the tables and is inviting
cartoonists to do their best by depicting the Holocaust.” which is translated as “odwrócić stoliki (turn tables)”, however, it
shall be “odwrócić sytuację (turn the situation)” or “odwrócić rolę (turn role)” with a proper translation “Teraz Iran chce
odwrócić sytuację i zachęca rysowników, by zrobili wszystko, co w ich mocy, przedstawiając Holocaust.” These two
examples present the localization issue in the target language.
• For the context unaware issue, we can look back to the example sentence “But it did not give me the time of day.” from
Fig. 8. It was literally translated word by word into “Ale nie dało mi to pory dnia.” which is in the sense of hour/time.
However, it shall be “Nie sądzę aby to było coś wyjątkowo/szczególnie dla mnie. (I do not think this is special to me.)”
based on the context, or “Ale to nie moja bajka” as an idiomatic expression which means “not my fairy tale” (indicating
not my cup of tea).
AppendixAppendix
32