SlideShare una empresa de Scribd logo
1 de 32
Descargar para leer sin conexión
*ADAPT Research Centre

^ Insight Centre for Data Analytics 

School of Computing, Dublin City University, Ireland
AlphaMWE: Construction of Multilingual
Parallel Corpora with MWE Annotations
Lifeng Han*, Gareth J. F. Jones*, and Alan F. Smeaton^
research paper (long track) @ MWE-LEX2020, affiliated with COLING2020
1
www.adaptcentre.ie
Content
• Background (motivation of this work)

• Related work (MWE corpus construction)

• AlphaMWE (Experimental work procedure)

• Discussions (MT issues coming across MWEs)
Cartoon, https://www.freepik.com/premium-vector/boy-with-lot-books_5762027.htm
Assume 

you have 

AlphaMWE 

paper 

in hand :)
2
www.adaptcentre.ie
Background
Definition - Multi-Word Expressions (MWEs)
• Various definitions include syntactic structure / semantic viewpoints (Constant
et al., 2017): 

• covering syntactic anomalies, 

• non-compositionality, 

• non-substitutability 

• ambiguity . 

• Baldwin and Kim (2010) define MWEs as “lexical items that: 

• (i) can be decomposed into multiple lexemes; and 

• (ii) display lexical, syntactic, semantic, pragmatic and/or statistical
idiomaticity”.
3
www.adaptcentre.ie
Background
MWE with NLP
• MWEs are an interesting topic to both NLP researchers and linguists (Sag et al.,
2002; Constant et al., 2017; Pulcini, 2020). 

• Automatic processing of MWEs poses significant challenges for computational
linguistics (CL): 

• e.g. word sense disambiguation (WSD), parsing and MT (Lambert and Banchs,
2005; Bouamor et al., 2012; Skadina, 2016; Li et al., 2019; Han et al., 2020). 

• caused by variety/richness of MWEs as they are used in language.

• However, very few bilingual/multilingual parallel corpora with MWE annotations
available for cross-lingual NLP research and for downstream applications (e.g. MT,
Johnson et al., 2016).
Background
4
www.adaptcentre.ie
Background
This paper’s view/focus
• Verbal MWEs (vMWEs): a mature category, received attention from many
researchers (e.g. Maldonado et al., 2017):

• Have a verb as the head of the studied term / function as verbal phrases, e.g.
“kick the bucket”, “cutting capers” and “(beer) gone to one’s head”. 

• AlphaMWE: construction of a multilingual corpus with vMWEs annotation,
including English-Chinese/German/Polish

• A deeper insight into the difficulties of processing MWEs: a categorisation of
the errors made by MT models

• State-of-the-art MT models are far from reaching parity with humans in terms
of translation performance (Wu et al., 2016; Hassan et al., 2018).
Alfredo Maldonado, Lifeng Han, Erwan Moreau, et al. (2017) Detection of verbal multi-word expressions via conditional random fields with syntactic dependency
features and semantic re-ranking. In: 13th Workshop on Multiword Expressions (MWE 2017), Apr 2017, Valencia, Spain.
Background
5
www.adaptcentre.ie
Content
• Background

• Related work

• AlphaMWE

• Discussions
cartoon, https://www.amazon.ca/Tweety-Bird-not-happy-Decal/dp/
Related work

Corpus
6
www.adaptcentre.ie
Related work
Monolingual
• monolingual corpora with vMWE annotations

• PARSEME shared task corpora (Savary et al., 2017; Ramisch et al., 2018):

• 2020 edition covers 14 languages including Chinese, Hindi, Turkish. 

• monolingual English - focused corpora includes

• MWE aware “English Dependency Corpus” from the LDC (LDC2017T01)1: covers
compound words used to train parsing models. 

• English MWEs from “web reviews data” by Schneider et al. (2014) : covers noun, verb
and preposition super-senses 

• English verbal MWEs from Walsh et al. (2018) and Kato et al. (2018) : covers PARSEME
shared task defined vMWE categories, or partly.
7
www.adaptcentre.ie
Related work
Monolingual
• All monolingual settings, independently by different language speakers without
bilingual alignment.

• Some non-native speakers, automatic creation, e.g. Kato et al. (2018): 

• combination of automatic annotations and crowdsourcing. 

• first construct a vMWE dictionary based on the English-language Wiktionary

• …

• formalise vMWE annotations as a multiword sense disambiguation problem to
exploit crowdsourcing (VP, LVC, semi-fixed vMWEs)

• These corpora => helpful for monolingual MWE research - discovery or
identification, however, difficult to bilingual or multilingual research - MT or cross-
lingual IE
Related work
Monolingual
8
www.adaptcentre.ie
Related work
Bilingual
• Vincze (2012), an English-Hungarian parallel corpus with annotations for light
verb constructions (LVCs)

• 703 LVCs for Hungarian and 727 for English were annotated

• a comparison between English and Hungarian data

• however, did not cover other types of vMWEs, e.g. verbal idioms, or verb-
particle constructions

• not extended to any other language pairs.
Related work
9
www.adaptcentre.ie
Related work
Bilingual
• Han et al. (2020) : an automatic construction of bilingual MWE terms from
parallel corpus, English-Chinese/German. 

• automated extraction of monolingual MWEs based on POS patterns,
MWEtoolkit (Ramisch et al. 2010)

• align two side monolingual MWEs into bilingual terms based on statistical
lexical translation probability

• due to the automated procedure, the extracted bilingual “MWE terms”
contain also normal phrases

• POS pattern design needs to be further refined (Skadina, 2016; Rikters and
Bojar, 2017; Han et al., 2020)
Lifeng Han, Gareth Jones, and Alan Smeaton. 2020. MultiMWE: Building a multi-lingual multi-word expression (MWE) parallel corpora. In Proceedings
of The 12th Language Resources and Evaluation Conference, pages 2970–2979, Marseille, France.
Related work
10
www.adaptcentre.ie
Content
• Background

• Related work

• AlphaMWE

• Discussions
cartoon, https://www.bbc.com/news/
AlphaMWE
11
www.adaptcentre.ie
AlphaMWE
Procedure for constructing AlphaMWE
AlphaMWE
12
www.adaptcentre.ie
AlphaMWE
Source corpus + MT model
• Source corpus selection: PARSEME-2018 English 

• - (rational: annotators / application)

• MT model comparison, the errors by: google/bing/baidu/DeepL

• - Figure 2 in paper (Sample comparison of outputs from four MT models).

• - Blue: redundancy; green: adding error; pink: reordering error; yellow:
dropping error.
AlphaMWE
Check the paper for detail explanations
13
www.adaptcentre.ie
AlphaMWEAlphaMWE
14
Two sample sentences’ MT outputs comparison from head of test file
Source # text = SQL Server verifies that the account name and password were validated when the user logged on to the
system and grants access to the database, without requiring a separate logon name or password.
DeepL # text = SQL Server
Google text = SQL Server
Bing [text] SQL Server
Baidu #
Ref. # = SQL Server
Source # text = See the http://officeupdate.microsoft.com/, Microsoft Developer Network Web site for more information on TSQL.
DeepL # text = TSQL http://officeupdate.microsoft.com/ Microsoft Developer Network Web
Google text = Microsoft SQL http://officeupdate.microsoft.com/ Microsoft
Bing [ ] TSQL http://officeupdate.microsoft.com/ TSQL Microsoft
Baidu #text= http://officeupdate.microsoft.com/ TSQL
Ref. # text = TSQL http://officeupdate.microsoft.com/
Blue: redundancy; green: adding error; pink: reordering error; yellow: dropping error.
www.adaptcentre.ie
AlphaMWE
During annotation: several situations and decisions
• a) (vMWEs, into a general phrase) -> offer two different references, one being
revised in a vMWE/MWE presentation in the target; 

• b) (source, target) in a different register (thx, for instance) -> offer two
reference sentences, with one of them using the same register;

• c) (normal_phrase, into v/MWE), or (v/MWE, v/MWE) but the source vMWE
was not annotated -> additions to include such vMWE (pairs); 

• d) English source: wrong/incorrect annotation, or some mis-spelling -> we
corrected them.
AlphaMWE
Check the paper for interesting examples
15
www.adaptcentre.ie
AlphaMWE
Size and possible usage
• Extracted all 750 English sentences which have vMWE tags included. 

• - target covered so far: Chinese, German, Polish with Spanish/French/
Italian/Irish in future.

• Its comparable to some standard shared task usage. 

• - development and test data sets from the annual WMT (Bojar et al., 2017)
and also from the NIST MT challenges -> approximately 2,000 sentences
for Dev/testing over some years (https://www.nist.gov/programs-projects/
machine-translation)
AlphaMWE
16
www.adaptcentre.ie
Examples
of
AlphaMWE
sentences
17
AlphaMWE
www.adaptcentre.ie
Content
• Background

• Related work

• AlphaMWE

• Discussions (refer paper for more interesting ones)
cartoon, https://images.app.goo.gl/Y6bAfr9oFsswWjUY7
Selected
Cases
18
www.adaptcentre.ie
MT issues coming across MWEs (or related)
English->Chinese/German/Polish
• When MT produces incorrect / awkward translations, we categorise en->zh:

• - common sense, super sense, abstract phrase, idiom, metaphor and
ambiguity, with ambiguity further sub-divided 

• en->de (paper Appendix B examples):

• (1) (English MWEs, one_word) partially because that German has separable verbs

• (2) (normal phrase, polite/formal DE), biased MT, generally fine but depends on the context to
decide which form is more suitable, and 

• (3) (vMWEs, normal phrase DE) often happened

• en->pl (paper Appendix B examples):

• coherence-unaware errors, literal translation errors, context unaware situation errors.
See paper for interesting examples
19
www.adaptcentre.ie
MT issues coming across MWEs (or related)
Idiom
MT issues coming across MWEs (or related)
20
www.adaptcentre.ie
MT issues coming across MWEs (or related)
metaphor
MT issues coming across MWEs (or related)
21
www.adaptcentre.ie
MT issues coming across MWEs (or related)
Ambiguity- Context unaware ambiguity
MT issues coming across MWEs (or related)
22
www.adaptcentre.ie
MT issues coming across MWEs (or related)
Ambiguity- social/literature unaware ambiguity
MT issues coming across MWEs (or related)
23
www.adaptcentre.ie
Summary and Conclusions
• Construction of multilingual parallel corpora, AlphaMWE, 

• vMWEs annotated by native speakers. 

• Described the procedure for MT model selection, human post editing and
annotation, compared different state-of-the-art MT models and classified MT
errors from vMWEs related to sentence/context translations. 

• Characterised errors into categories to support MT research. 

• English→Chinese, English→German and English→Polish and 

• Similarly categorised MT issues when handling MWEs. 

• AlphaMWE will be maintained as a publicly available corpora, and extended to
include other language pairs, e.g. Spanish, French and Italian. 

• We also plan to extend the annotated MWE genres beyond the vMWEs defined in
the PARSEME shared task.
24
www.adaptcentre.ie
References
• Timothy Baldwin and Su Nam Kim. 2010. Multiword expressions. In Handbook of Natural LanguageProcessing, Second Edition, pages 267–292.
Chapman and Hall.

• Ondřej Bojar, et al. 2017. Findings of the 2017 conference on machine translation (WMT17). In WMT.

• Dhouha Bouamor, Nasredine Semmar, and Pierre Zweigenbaum. 2012. Identifying bilingual multi-word expressions for statistical machine translation. In
LREC.

• Vishal Chowdhary and Scott Greenwood. 2017. Emt: End to end model training for msr machine translation. In Proceedings of the 1st Workshop on
Data Management for End-to-End ML.

• Mathieu Constant, et al. 2017. Survey: Multiword expression processing: A Survey. Computational Linguistics, 43(4):837–892.

• Lifeng Han, Gareth J. F. Jones, and Alan Smeaton. 2020. MultiMWE: Building a multi-lingual multiword expression (MWE) parallel corpora. In LREC.

• Hany Hassan, et al. 2018. Achieving human parity on automatic chinese to english news translation. ArXiv, abs/1803.05567.

• Melvin Johnson, et al. 2016. Google’s multilingual neural machine translation system: Enabling zero-shot translation. CoRR, abs/1611.04558.

• Akihiko Kato, Hiroyuki Shindo, and Yuji Matsumoto. 2018. Construction of Large-scale English Verbal Multiword Expression Annotated Corpus. In
LREC.

• Patrik Lambert and Rafael E. Banchs. 2005. Data Inferred Multi-word Expressions for Statistical Machine Translation. In Proceedings of Machine
Translation Summit X, pages 396–403, Thailand.

• Xiaoqing Li, Jinghui Yan, Jiajun Zhang, and Chengqing Zong. 2019. Neural name translation improves neural machine translation. In Machine
Translation, pages 93–100, Singapore. Springer.

• Alfredo Maldonado, Lifeng Han, Erwan Moreau, et al. 2017. Detection of verbal multi-word expressions via conditional random fields with syntactic
dependency features and semantic re-ranking. In The 13th MWE.
25
www.adaptcentre.ie
References
• Virginia Pulcini. 2020. English-derived multi-word and phraseological units across languages in the global anglicism database. Textus,
English Studies in Italy, (1/2020):127–143. 

• Carlos Ramisch et al. 2018. Edition 1.1 of the PARSEME shared task on automatic identification of verbal multiword expressions. In LAW-
MWE-CxG-2018)

• Matīss Rikters and Ondřej Bojar. 2017. Paying Attention to Multi-Word Expressions in Neural MachineTranslation. In Proceedings of the 16th
Machine Translation Summit. 

• Ivan A. Sag, et al. 2002. Multiword expressions: A pain in the neck for nlp. In Alexander Gelbukh, editor, Computational Linguistics and
Intelligent Text Processing.

• Agata Savary, et al. 2017. The PARSEME shared task on automatic identification of verbal multiword expressions. In MWE2017. 

• Nathan Schneider, et al. 2014. Comprehensive annotation of multiword expressions in a social web corpus. In Proceedings of the LREC. 

• Inguna Skadina. 2016. Multi-word expressions in english-latvian machine translation. Baltic J. Modern Computing, 4:811–825. 

• Meng Sun, Bojian Jiang, Hao Xiong, Zhongjun He, Hua Wu, and Haifeng Wang. 2019. Baidu neural machine translation systems for WMT19.
In WMT.

• Ashish Vaswani, et al. 2017. Attention is all you need. In Conference on Neural Information Processing System, pages 6000–6010. 

• Veronika Vincze. 2012. Light verb constructions in the SzegedParalellFX English–Hungarian parallel corpus. In LREC. 

• Abigail Walsh, et al. 2018. Constructing an annotated corpus of verbal MWEs for English. In (LAW-MWE-CxG2018), pages 193–200. 

• Yonghui Wu, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/
1609.08144.
References
26
www.adaptcentre.ie
Acknowledgement
Contact list (alphabetic order)
27
www.adaptcentre.ie
• The ADAPT Centre for Digital Content Technology is funded under the SFI Research
Centres Programme (Grant 13/RC/2106) and is co-funded under the European
Regional Development Fund. The input of Alan Smeaton is part-funded by Science
Foundation Ireland under grant number SFI/12/RC/2289 (Insight Centre). 

• The authors are very grateful to their colleagues who helped to create the AlphaMWE
corpora by post editing and annotation work across all language pairs, to Yi Lu and
Dr. Paolo Bolzoni for helping with the experiments, to Lorin Sweeney, Roise McGagh,
and Eoin Treacy for discussions about English MWEs and terms, Yandy Wong for
discussion of Cantonese examples, and Hailin Hao and Yao Tong for joining the first
discussion. 

• The authors also thank the anonymous reviewers for their thorough reviews and
insightful comments.
Acknowledgement
Funding and helping colleagues
Acknowledgement
28
www.adaptcentre.ie
Appendix
Error Examples from English Corpus Fixed in AlphaMWE
• error annotations of vMWEs in source monolingual corpus surly have some
impact on the accuracy level of the vMWE discovery and identification shared
task, but also affect the bilingual usage of AlphaMWE, so we tried to address
all these cases. For instance, the example sentence in ‘Abstract Phrase’,
English corpus annotated wrongly the sequence “had to go on” as a verbal
idioms (VIDs) which is not accurate. The verb “had” here is affiliated with “all
he had” instead of “to go on”. So either we shall annotate “go on” as vMWE
in the sentence or the overall clause “all he had to go on” as a studied term.
29
www.adaptcentre.ie
• Another example with a different type of vMWE is the sentence “He put them
on in a kind of trance.” where the source English corpus tagged “put” and
“trance” as Light-verb construction (VLC.cause). However, the phrase is with
“put...on” instead of “put...trance”. “put someone into a trance” is a phrase to
express “make someone into a half-conscious state”. However, for this
sentence, if we check back a bit further of the context, it means “he put on
his cloth in a kind of trance”. The word “trance” is affiliated with the phrase “in
a kind of trance” instead of “put”.
Error Examples from English Corpus Fixed in AlphaMWE
AppendixAppendix
30
www.adaptcentre.ie
English→German MT Samples Reflecting Afore Mentioned MWE Related Issues.
• Firstly, for the English vMWE translates into single German word, let’s see the
vMWE “woke up” the sentence “An old woman with crinkly grey hair woke up
at her post outside the lavatory and opened the door, smiling and grasping a
filthy cleaning rag.” has corresponding German aligned word “erwachte” with
a suitable translation “Eine alte Frau mit krausem, grauem Haar erwachte auf
ihrem Posten vor der Toilette und öffnete die Tür, lächelte und griff nach einem
schmutzigen Putzlappen.”

• Secondly, for the automatic translation to German that is very biased towards
choosing the polite or formal form, see the examples such as “Sie”instead of
the second form singular “du”for “you”, “auf Basis von” instead of “basierend
auf” for “based on”. To achieve a higher accuracy level of MT, it shall depend
on the context of usage to decide which form is more suitable.
AppendixAppendix
31
www.adaptcentre.ie
English→Polish MT Samples Reflecting Afore Mentioned MWE Related Issues.
• Regarding the MT output issues on English to Polish that fall into coherence-unaware error, for instance, the vMWE “write
off” in sentence “Then someone says that it can’t be long now before the Russians write Arafat off.” was translated as
“Wypiszą” (Potem ktoś mówi, że już niedługo Rosjanie wypiszą Arafata.) which means “prescribe”, instead of correct one
“spiszą na straty (Arafata)”. This error shall be able to avoid by the coherence of the sentence itself in meaning
preservation models. 

• For the literal translation, we can see the example vMWE “gave (him) a look” in the sentence “She ruffled her feathers and
gave him a look of deep disgust.” which was literally translated as “dała mu spojrzenie”, however, in Polish, people use
“throw a look” as “rzuciła (mu) spojrzenie” instead of “gave (dała, a female form)”. Another example of literal translation
leading to errors is the vMWE “turn the tables” from sentence “Now Iran wants to turn the tables and is inviting
cartoonists to do their best by depicting the Holocaust.” which is translated as “odwrócić stoliki (turn tables)”, however, it
shall be “odwrócić sytuację (turn the situation)” or “odwrócić rolę (turn role)” with a proper translation “Teraz Iran chce
odwrócić sytuację i zachęca rysowników, by zrobili wszystko, co w ich mocy, przedstawiając Holocaust.” These two
examples present the localization issue in the target language. 

• For the context unaware issue, we can look back to the example sentence “But it did not give me the time of day.” from
Fig. 8. It was literally translated word by word into “Ale nie dało mi to pory dnia.” which is in the sense of hour/time.
However, it shall be “Nie sądzę aby to było coś wyjątkowo/szczególnie dla mnie. (I do not think this is special to me.)”
based on the context, or “Ale to nie moja bajka” as an idiomatic expression which means “not my fairy tale” (indicating
not my cup of tea).
AppendixAppendix
32

Más contenido relacionado

Similar a AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations (ppt4WS)

Poster: Using Open Source Tools to Improve Access to Oral History Collections
Poster: Using Open Source Tools to Improve Access to Oral History CollectionsPoster: Using Open Source Tools to Improve Access to Oral History Collections
Poster: Using Open Source Tools to Improve Access to Oral History CollectionsBecky Yoose
 
Directions This assignment is for a Reading Course. The cross-dis
Directions This assignment is for a Reading Course. The cross-disDirections This assignment is for a Reading Course. The cross-dis
Directions This assignment is for a Reading Course. The cross-disAlyciaGold776
 
How to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learningHow to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learningLena Shakurova
 
Tech the Common Core!
Tech the Common Core!Tech the Common Core!
Tech the Common Core!aj6785
 
Learning Usage of English KWICly with WebLEAP/DSR
Learning Usage of English KWICly with WebLEAP/DSRLearning Usage of English KWICly with WebLEAP/DSR
Learning Usage of English KWICly with WebLEAP/DSRTakashi Yamanoue
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needsIvan Berlocher
 
Gaining Advantage in e-Learning with Semantic Adaptive Technology
Gaining Advantage in e-Learning with Semantic Adaptive TechnologyGaining Advantage in e-Learning with Semantic Adaptive Technology
Gaining Advantage in e-Learning with Semantic Adaptive TechnologyOntotext
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlpLaraOlmosCamarena
 
CS4200 2019 Lecture 1: Introduction
CS4200 2019 Lecture 1: IntroductionCS4200 2019 Lecture 1: Introduction
CS4200 2019 Lecture 1: IntroductionEelco Visser
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answeringAli Kabbadj
 
Web-based lessons and e-portfolios
Web-based lessons and e-portfoliosWeb-based lessons and e-portfolios
Web-based lessons and e-portfoliosEvelyn Izquierdo
 
Successful Single-Source Content Development
Successful Single-Source Content Development Successful Single-Source Content Development
Successful Single-Source Content Development Xyleme
 
Gadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALLGadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALLLawrie Hunter
 
Towards digitizing scholarly communication
Towards digitizing scholarly communicationTowards digitizing scholarly communication
Towards digitizing scholarly communicationSören Auer
 
Machine translation from English to Hindi
Machine translation from English to HindiMachine translation from English to Hindi
Machine translation from English to HindiRajat Jain
 

Similar a AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations (ppt4WS) (20)

Poster: Using Open Source Tools to Improve Access to Oral History Collections
Poster: Using Open Source Tools to Improve Access to Oral History CollectionsPoster: Using Open Source Tools to Improve Access to Oral History Collections
Poster: Using Open Source Tools to Improve Access to Oral History Collections
 
Directions This assignment is for a Reading Course. The cross-dis
Directions This assignment is for a Reading Course. The cross-disDirections This assignment is for a Reading Course. The cross-dis
Directions This assignment is for a Reading Course. The cross-dis
 
How to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learningHow to expand your nlp solution to new languages using transfer learning
How to expand your nlp solution to new languages using transfer learning
 
Tech the Common Core!
Tech the Common Core!Tech the Common Core!
Tech the Common Core!
 
Learning Usage of English KWICly with WebLEAP/DSR
Learning Usage of English KWICly with WebLEAP/DSRLearning Usage of English KWICly with WebLEAP/DSR
Learning Usage of English KWICly with WebLEAP/DSR
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needs
 
ISE - TAUS Tokyo Forum 2015
ISE - TAUS Tokyo Forum 2015ISE - TAUS Tokyo Forum 2015
ISE - TAUS Tokyo Forum 2015
 
FinalReport
FinalReportFinalReport
FinalReport
 
Gaining Advantage in e-Learning with Semantic Adaptive Technology
Gaining Advantage in e-Learning with Semantic Adaptive TechnologyGaining Advantage in e-Learning with Semantic Adaptive Technology
Gaining Advantage in e-Learning with Semantic Adaptive Technology
 
Illustrated Code (ASE 2021)
Illustrated Code (ASE 2021)Illustrated Code (ASE 2021)
Illustrated Code (ASE 2021)
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlp
 
20110728 datalift-rpi-troy
20110728 datalift-rpi-troy20110728 datalift-rpi-troy
20110728 datalift-rpi-troy
 
CS4200 2019 Lecture 1: Introduction
CS4200 2019 Lecture 1: IntroductionCS4200 2019 Lecture 1: Introduction
CS4200 2019 Lecture 1: Introduction
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answering
 
Web-based lessons and e-portfolios
Web-based lessons and e-portfoliosWeb-based lessons and e-portfolios
Web-based lessons and e-portfolios
 
Successful Single-Source Content Development
Successful Single-Source Content Development Successful Single-Source Content Development
Successful Single-Source Content Development
 
Gadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALLGadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALL
 
Tsl641
Tsl641Tsl641
Tsl641
 
Towards digitizing scholarly communication
Towards digitizing scholarly communicationTowards digitizing scholarly communication
Towards digitizing scholarly communication
 
Machine translation from English to Hindi
Machine translation from English to HindiMachine translation from English to Hindi
Machine translation from English to Hindi
 

Más de Lifeng (Aaron) Han

WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni ManchesterWMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni ManchesterLifeng (Aaron) Han
 
Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)Lifeng (Aaron) Han
 
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Lifeng (Aaron) Han
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...Lifeng (Aaron) Han
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
 HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio... HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...Lifeng (Aaron) Han
 
Meta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsMeta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsLifeng (Aaron) Han
 
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...Lifeng (Aaron) Han
 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Lifeng (Aaron) Han
 
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...Lifeng (Aaron) Han
 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word ExpressionsLifeng (Aaron) Han
 
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerBuild moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerLifeng (Aaron) Han
 
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Lifeng (Aaron) Han
 
A deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationA deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationLifeng (Aaron) Han
 
machine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveymachine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveyLifeng (Aaron) Han
 
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Lifeng (Aaron) Han
 
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelChinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelLifeng (Aaron) Han
 
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Lifeng (Aaron) Han
 
PubhD talk: MT serving the society
PubhD talk: MT serving the societyPubhD talk: MT serving the society
PubhD talk: MT serving the societyLifeng (Aaron) Han
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLifeng (Aaron) Han
 

Más de Lifeng (Aaron) Han (20)

WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni ManchesterWMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
 
Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)
 
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
 HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio... HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
 
Meta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsMeta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methods
 
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...
 
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
 
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
Chinese Character Decomposition for  Neural MT with Multi-Word ExpressionsChinese Character Decomposition for  Neural MT with Multi-Word Expressions
Chinese Character Decomposition for Neural MT with Multi-Word Expressions
 
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerBuild moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
 
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
 
A deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationA deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine Translation
 
machine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveymachine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a survey
 
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
 
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelChinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
 
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
 
PubhD talk: MT serving the society
PubhD talk: MT serving the societyPubhD talk: MT serving the society
PubhD talk: MT serving the society
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
Thesis-Master-MTE-Aaron
Thesis-Master-MTE-AaronThesis-Master-MTE-Aaron
Thesis-Master-MTE-Aaron
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxMarkSteadman7
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....rightmanforbloodline
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governanceWSO2
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfdanishmna97
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaWSO2
 

Último (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 

AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations (ppt4WS)

  • 1. *ADAPT Research Centre ^ Insight Centre for Data Analytics School of Computing, Dublin City University, Ireland AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations Lifeng Han*, Gareth J. F. Jones*, and Alan F. Smeaton^ research paper (long track) @ MWE-LEX2020, affiliated with COLING2020 1
  • 2. www.adaptcentre.ie Content • Background (motivation of this work) • Related work (MWE corpus construction) • AlphaMWE (Experimental work procedure) • Discussions (MT issues coming across MWEs) Cartoon, https://www.freepik.com/premium-vector/boy-with-lot-books_5762027.htm Assume you have AlphaMWE paper in hand :) 2
  • 3. www.adaptcentre.ie Background Definition - Multi-Word Expressions (MWEs) • Various definitions include syntactic structure / semantic viewpoints (Constant et al., 2017): • covering syntactic anomalies, • non-compositionality, • non-substitutability • ambiguity . • Baldwin and Kim (2010) define MWEs as “lexical items that: • (i) can be decomposed into multiple lexemes; and • (ii) display lexical, syntactic, semantic, pragmatic and/or statistical idiomaticity”. 3
  • 4. www.adaptcentre.ie Background MWE with NLP • MWEs are an interesting topic to both NLP researchers and linguists (Sag et al., 2002; Constant et al., 2017; Pulcini, 2020). • Automatic processing of MWEs poses significant challenges for computational linguistics (CL): • e.g. word sense disambiguation (WSD), parsing and MT (Lambert and Banchs, 2005; Bouamor et al., 2012; Skadina, 2016; Li et al., 2019; Han et al., 2020). • caused by variety/richness of MWEs as they are used in language. • However, very few bilingual/multilingual parallel corpora with MWE annotations available for cross-lingual NLP research and for downstream applications (e.g. MT, Johnson et al., 2016). Background 4
  • 5. www.adaptcentre.ie Background This paper’s view/focus • Verbal MWEs (vMWEs): a mature category, received attention from many researchers (e.g. Maldonado et al., 2017): • Have a verb as the head of the studied term / function as verbal phrases, e.g. “kick the bucket”, “cutting capers” and “(beer) gone to one’s head”. • AlphaMWE: construction of a multilingual corpus with vMWEs annotation, including English-Chinese/German/Polish • A deeper insight into the difficulties of processing MWEs: a categorisation of the errors made by MT models • State-of-the-art MT models are far from reaching parity with humans in terms of translation performance (Wu et al., 2016; Hassan et al., 2018). Alfredo Maldonado, Lifeng Han, Erwan Moreau, et al. (2017) Detection of verbal multi-word expressions via conditional random fields with syntactic dependency features and semantic re-ranking. In: 13th Workshop on Multiword Expressions (MWE 2017), Apr 2017, Valencia, Spain. Background 5
  • 6. www.adaptcentre.ie Content • Background • Related work • AlphaMWE • Discussions cartoon, https://www.amazon.ca/Tweety-Bird-not-happy-Decal/dp/ Related work Corpus 6
  • 7. www.adaptcentre.ie Related work Monolingual • monolingual corpora with vMWE annotations • PARSEME shared task corpora (Savary et al., 2017; Ramisch et al., 2018): • 2020 edition covers 14 languages including Chinese, Hindi, Turkish. • monolingual English - focused corpora includes • MWE aware “English Dependency Corpus” from the LDC (LDC2017T01)1: covers compound words used to train parsing models. • English MWEs from “web reviews data” by Schneider et al. (2014) : covers noun, verb and preposition super-senses • English verbal MWEs from Walsh et al. (2018) and Kato et al. (2018) : covers PARSEME shared task defined vMWE categories, or partly. 7
  • 8. www.adaptcentre.ie Related work Monolingual • All monolingual settings, independently by different language speakers without bilingual alignment. • Some non-native speakers, automatic creation, e.g. Kato et al. (2018): • combination of automatic annotations and crowdsourcing. • first construct a vMWE dictionary based on the English-language Wiktionary • … • formalise vMWE annotations as a multiword sense disambiguation problem to exploit crowdsourcing (VP, LVC, semi-fixed vMWEs) • These corpora => helpful for monolingual MWE research - discovery or identification, however, difficult to bilingual or multilingual research - MT or cross- lingual IE Related work Monolingual 8
  • 9. www.adaptcentre.ie Related work Bilingual • Vincze (2012), an English-Hungarian parallel corpus with annotations for light verb constructions (LVCs) • 703 LVCs for Hungarian and 727 for English were annotated • a comparison between English and Hungarian data • however, did not cover other types of vMWEs, e.g. verbal idioms, or verb- particle constructions • not extended to any other language pairs. Related work 9
  • 10. www.adaptcentre.ie Related work Bilingual • Han et al. (2020) : an automatic construction of bilingual MWE terms from parallel corpus, English-Chinese/German. • automated extraction of monolingual MWEs based on POS patterns, MWEtoolkit (Ramisch et al. 2010) • align two side monolingual MWEs into bilingual terms based on statistical lexical translation probability • due to the automated procedure, the extracted bilingual “MWE terms” contain also normal phrases • POS pattern design needs to be further refined (Skadina, 2016; Rikters and Bojar, 2017; Han et al., 2020) Lifeng Han, Gareth Jones, and Alan Smeaton. 2020. MultiMWE: Building a multi-lingual multi-word expression (MWE) parallel corpora. In Proceedings of The 12th Language Resources and Evaluation Conference, pages 2970–2979, Marseille, France. Related work 10
  • 11. www.adaptcentre.ie Content • Background • Related work • AlphaMWE • Discussions cartoon, https://www.bbc.com/news/ AlphaMWE 11
  • 13. www.adaptcentre.ie AlphaMWE Source corpus + MT model • Source corpus selection: PARSEME-2018 English • - (rational: annotators / application) • MT model comparison, the errors by: google/bing/baidu/DeepL • - Figure 2 in paper (Sample comparison of outputs from four MT models). • - Blue: redundancy; green: adding error; pink: reordering error; yellow: dropping error. AlphaMWE Check the paper for detail explanations 13
  • 14. www.adaptcentre.ie AlphaMWEAlphaMWE 14 Two sample sentences’ MT outputs comparison from head of test file Source # text = SQL Server verifies that the account name and password were validated when the user logged on to the system and grants access to the database, without requiring a separate logon name or password. DeepL # text = SQL Server Google text = SQL Server Bing [text] SQL Server Baidu # Ref. # = SQL Server Source # text = See the http://officeupdate.microsoft.com/, Microsoft Developer Network Web site for more information on TSQL. DeepL # text = TSQL http://officeupdate.microsoft.com/ Microsoft Developer Network Web Google text = Microsoft SQL http://officeupdate.microsoft.com/ Microsoft Bing [ ] TSQL http://officeupdate.microsoft.com/ TSQL Microsoft Baidu #text= http://officeupdate.microsoft.com/ TSQL Ref. # text = TSQL http://officeupdate.microsoft.com/ Blue: redundancy; green: adding error; pink: reordering error; yellow: dropping error.
  • 15. www.adaptcentre.ie AlphaMWE During annotation: several situations and decisions • a) (vMWEs, into a general phrase) -> offer two different references, one being revised in a vMWE/MWE presentation in the target; • b) (source, target) in a different register (thx, for instance) -> offer two reference sentences, with one of them using the same register; • c) (normal_phrase, into v/MWE), or (v/MWE, v/MWE) but the source vMWE was not annotated -> additions to include such vMWE (pairs); • d) English source: wrong/incorrect annotation, or some mis-spelling -> we corrected them. AlphaMWE Check the paper for interesting examples 15
  • 16. www.adaptcentre.ie AlphaMWE Size and possible usage • Extracted all 750 English sentences which have vMWE tags included. • - target covered so far: Chinese, German, Polish with Spanish/French/ Italian/Irish in future. • Its comparable to some standard shared task usage. • - development and test data sets from the annual WMT (Bojar et al., 2017) and also from the NIST MT challenges -> approximately 2,000 sentences for Dev/testing over some years (https://www.nist.gov/programs-projects/ machine-translation) AlphaMWE 16
  • 18. www.adaptcentre.ie Content • Background • Related work • AlphaMWE • Discussions (refer paper for more interesting ones) cartoon, https://images.app.goo.gl/Y6bAfr9oFsswWjUY7 Selected Cases 18
  • 19. www.adaptcentre.ie MT issues coming across MWEs (or related) English->Chinese/German/Polish • When MT produces incorrect / awkward translations, we categorise en->zh: • - common sense, super sense, abstract phrase, idiom, metaphor and ambiguity, with ambiguity further sub-divided • en->de (paper Appendix B examples): • (1) (English MWEs, one_word) partially because that German has separable verbs • (2) (normal phrase, polite/formal DE), biased MT, generally fine but depends on the context to decide which form is more suitable, and • (3) (vMWEs, normal phrase DE) often happened • en->pl (paper Appendix B examples): • coherence-unaware errors, literal translation errors, context unaware situation errors. See paper for interesting examples 19
  • 20. www.adaptcentre.ie MT issues coming across MWEs (or related) Idiom MT issues coming across MWEs (or related) 20
  • 21. www.adaptcentre.ie MT issues coming across MWEs (or related) metaphor MT issues coming across MWEs (or related) 21
  • 22. www.adaptcentre.ie MT issues coming across MWEs (or related) Ambiguity- Context unaware ambiguity MT issues coming across MWEs (or related) 22
  • 23. www.adaptcentre.ie MT issues coming across MWEs (or related) Ambiguity- social/literature unaware ambiguity MT issues coming across MWEs (or related) 23
  • 24. www.adaptcentre.ie Summary and Conclusions • Construction of multilingual parallel corpora, AlphaMWE, • vMWEs annotated by native speakers. • Described the procedure for MT model selection, human post editing and annotation, compared different state-of-the-art MT models and classified MT errors from vMWEs related to sentence/context translations. • Characterised errors into categories to support MT research. • English→Chinese, English→German and English→Polish and • Similarly categorised MT issues when handling MWEs. • AlphaMWE will be maintained as a publicly available corpora, and extended to include other language pairs, e.g. Spanish, French and Italian. • We also plan to extend the annotated MWE genres beyond the vMWEs defined in the PARSEME shared task. 24
  • 25. www.adaptcentre.ie References • Timothy Baldwin and Su Nam Kim. 2010. Multiword expressions. In Handbook of Natural LanguageProcessing, Second Edition, pages 267–292. Chapman and Hall. • Ondřej Bojar, et al. 2017. Findings of the 2017 conference on machine translation (WMT17). In WMT. • Dhouha Bouamor, Nasredine Semmar, and Pierre Zweigenbaum. 2012. Identifying bilingual multi-word expressions for statistical machine translation. In LREC. • Vishal Chowdhary and Scott Greenwood. 2017. Emt: End to end model training for msr machine translation. In Proceedings of the 1st Workshop on Data Management for End-to-End ML. • Mathieu Constant, et al. 2017. Survey: Multiword expression processing: A Survey. Computational Linguistics, 43(4):837–892. • Lifeng Han, Gareth J. F. Jones, and Alan Smeaton. 2020. MultiMWE: Building a multi-lingual multiword expression (MWE) parallel corpora. In LREC. • Hany Hassan, et al. 2018. Achieving human parity on automatic chinese to english news translation. ArXiv, abs/1803.05567. • Melvin Johnson, et al. 2016. Google’s multilingual neural machine translation system: Enabling zero-shot translation. CoRR, abs/1611.04558. • Akihiko Kato, Hiroyuki Shindo, and Yuji Matsumoto. 2018. Construction of Large-scale English Verbal Multiword Expression Annotated Corpus. In LREC. • Patrik Lambert and Rafael E. Banchs. 2005. Data Inferred Multi-word Expressions for Statistical Machine Translation. In Proceedings of Machine Translation Summit X, pages 396–403, Thailand. • Xiaoqing Li, Jinghui Yan, Jiajun Zhang, and Chengqing Zong. 2019. Neural name translation improves neural machine translation. In Machine Translation, pages 93–100, Singapore. Springer. • Alfredo Maldonado, Lifeng Han, Erwan Moreau, et al. 2017. Detection of verbal multi-word expressions via conditional random fields with syntactic dependency features and semantic re-ranking. In The 13th MWE. 25
  • 26. www.adaptcentre.ie References • Virginia Pulcini. 2020. English-derived multi-word and phraseological units across languages in the global anglicism database. Textus, English Studies in Italy, (1/2020):127–143. • Carlos Ramisch et al. 2018. Edition 1.1 of the PARSEME shared task on automatic identification of verbal multiword expressions. In LAW- MWE-CxG-2018) • Matīss Rikters and Ondřej Bojar. 2017. Paying Attention to Multi-Word Expressions in Neural MachineTranslation. In Proceedings of the 16th Machine Translation Summit. • Ivan A. Sag, et al. 2002. Multiword expressions: A pain in the neck for nlp. In Alexander Gelbukh, editor, Computational Linguistics and Intelligent Text Processing. • Agata Savary, et al. 2017. The PARSEME shared task on automatic identification of verbal multiword expressions. In MWE2017. • Nathan Schneider, et al. 2014. Comprehensive annotation of multiword expressions in a social web corpus. In Proceedings of the LREC. • Inguna Skadina. 2016. Multi-word expressions in english-latvian machine translation. Baltic J. Modern Computing, 4:811–825. • Meng Sun, Bojian Jiang, Hao Xiong, Zhongjun He, Hua Wu, and Haifeng Wang. 2019. Baidu neural machine translation systems for WMT19. In WMT. • Ashish Vaswani, et al. 2017. Attention is all you need. In Conference on Neural Information Processing System, pages 6000–6010. • Veronika Vincze. 2012. Light verb constructions in the SzegedParalellFX English–Hungarian parallel corpus. In LREC. • Abigail Walsh, et al. 2018. Constructing an annotated corpus of verbal MWEs for English. In (LAW-MWE-CxG2018), pages 193–200. • Yonghui Wu, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/ 1609.08144. References 26
  • 28. www.adaptcentre.ie • The ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund. The input of Alan Smeaton is part-funded by Science Foundation Ireland under grant number SFI/12/RC/2289 (Insight Centre). • The authors are very grateful to their colleagues who helped to create the AlphaMWE corpora by post editing and annotation work across all language pairs, to Yi Lu and Dr. Paolo Bolzoni for helping with the experiments, to Lorin Sweeney, Roise McGagh, and Eoin Treacy for discussions about English MWEs and terms, Yandy Wong for discussion of Cantonese examples, and Hailin Hao and Yao Tong for joining the first discussion. • The authors also thank the anonymous reviewers for their thorough reviews and insightful comments. Acknowledgement Funding and helping colleagues Acknowledgement 28
  • 29. www.adaptcentre.ie Appendix Error Examples from English Corpus Fixed in AlphaMWE • error annotations of vMWEs in source monolingual corpus surly have some impact on the accuracy level of the vMWE discovery and identification shared task, but also affect the bilingual usage of AlphaMWE, so we tried to address all these cases. For instance, the example sentence in ‘Abstract Phrase’, English corpus annotated wrongly the sequence “had to go on” as a verbal idioms (VIDs) which is not accurate. The verb “had” here is affiliated with “all he had” instead of “to go on”. So either we shall annotate “go on” as vMWE in the sentence or the overall clause “all he had to go on” as a studied term. 29
  • 30. www.adaptcentre.ie • Another example with a different type of vMWE is the sentence “He put them on in a kind of trance.” where the source English corpus tagged “put” and “trance” as Light-verb construction (VLC.cause). However, the phrase is with “put...on” instead of “put...trance”. “put someone into a trance” is a phrase to express “make someone into a half-conscious state”. However, for this sentence, if we check back a bit further of the context, it means “he put on his cloth in a kind of trance”. The word “trance” is affiliated with the phrase “in a kind of trance” instead of “put”. Error Examples from English Corpus Fixed in AlphaMWE AppendixAppendix 30
  • 31. www.adaptcentre.ie English→German MT Samples Reflecting Afore Mentioned MWE Related Issues. • Firstly, for the English vMWE translates into single German word, let’s see the vMWE “woke up” the sentence “An old woman with crinkly grey hair woke up at her post outside the lavatory and opened the door, smiling and grasping a filthy cleaning rag.” has corresponding German aligned word “erwachte” with a suitable translation “Eine alte Frau mit krausem, grauem Haar erwachte auf ihrem Posten vor der Toilette und öffnete die Tür, lächelte und griff nach einem schmutzigen Putzlappen.” • Secondly, for the automatic translation to German that is very biased towards choosing the polite or formal form, see the examples such as “Sie”instead of the second form singular “du”for “you”, “auf Basis von” instead of “basierend auf” for “based on”. To achieve a higher accuracy level of MT, it shall depend on the context of usage to decide which form is more suitable. AppendixAppendix 31
  • 32. www.adaptcentre.ie English→Polish MT Samples Reflecting Afore Mentioned MWE Related Issues. • Regarding the MT output issues on English to Polish that fall into coherence-unaware error, for instance, the vMWE “write off” in sentence “Then someone says that it can’t be long now before the Russians write Arafat off.” was translated as “Wypiszą” (Potem ktoś mówi, że już niedługo Rosjanie wypiszą Arafata.) which means “prescribe”, instead of correct one “spiszą na straty (Arafata)”. This error shall be able to avoid by the coherence of the sentence itself in meaning preservation models. • For the literal translation, we can see the example vMWE “gave (him) a look” in the sentence “She ruffled her feathers and gave him a look of deep disgust.” which was literally translated as “dała mu spojrzenie”, however, in Polish, people use “throw a look” as “rzuciła (mu) spojrzenie” instead of “gave (dała, a female form)”. Another example of literal translation leading to errors is the vMWE “turn the tables” from sentence “Now Iran wants to turn the tables and is inviting cartoonists to do their best by depicting the Holocaust.” which is translated as “odwrócić stoliki (turn tables)”, however, it shall be “odwrócić sytuację (turn the situation)” or “odwrócić rolę (turn role)” with a proper translation “Teraz Iran chce odwrócić sytuację i zachęca rysowników, by zrobili wszystko, co w ich mocy, przedstawiając Holocaust.” These two examples present the localization issue in the target language. • For the context unaware issue, we can look back to the example sentence “But it did not give me the time of day.” from Fig. 8. It was literally translated word by word into “Ale nie dało mi to pory dnia.” which is in the sense of hour/time. However, it shall be “Nie sądzę aby to było coś wyjątkowo/szczególnie dla mnie. (I do not think this is special to me.)” based on the context, or “Ale to nie moja bajka” as an idiomatic expression which means “not my fairy tale” (indicating not my cup of tea). AppendixAppendix 32