SlideShare una empresa de Scribd logo
1 de 29
Descargar para leer sin conexión
ADAPT Research Centre

School of Computing, Dublin City University, Ireland

lifeng.han@adaptcentre.ie
Meta-evaluation of
machine translation evaluation methods
Lifeng Han
Metrics2021 Tutorial Track/type: Workshop on Informetric and Scientometric Research (SIG-MET), ASIS&T.
October 23–24.
1
www.adaptcentre.ie
Content
• Background (motivation of this work)

• Related work (earlier surveys)

• Machine Translation Evaluation (MTE) & Meta-eval

• Discussion and Perspectives

• Conclusions
Cartoon, https://www.freepik.com/premium-vector/boy-with-lot-books_5762027.htm
2
www.adaptcentre.ie
Background
Driven: - two factors
• MTE as a key point in Translation Assessment & MT model development

• Human evaluations as the golden criteria of assessment-> high-cost?

• Automatic metrics for MT system tuning, parameter optimisation-> reliability?

• New MTE methods very needed for 

• i) distinguishing SOTA high-performance MT systems (ref. ACL21 outstanding
paper by Marie et al.) 

• ii) helping low-resource language pairs / domains MT research

• Related earlier surveys: 

• not covering recent years development (EuroMatrix07, GALE09)

• Specialised/focused to certain domain (Secară 05)

• or very initial, brief (Goutte 06)
3
Work based on an earlier pre-print: L. Han (2016)
“Machine Translation Evaluation Resources and Methods: A Survey”. arXiv:1605.04515 [cs.CL] updated-Sep.2018
www.adaptcentre.ie
Background
Goals:
• a concise but structured classification of MTE

• An overview

• extending related literature, e.g. recent years published MTE work

• + Quality Estimation (QE), + evaluating methods of MTE itself (meta-eval)

• easy to grasp what users/researchers need

• find their corresponding assessment methods efficiently from existing ones or
make their own that can be revised/inspired by this overview

• We expect this survey to be helpful for more NLP tasks (model evaluations)

• NLP tasks share similarities and Translation is a relatively larger task that
covers/impacts/interacts with others - e.g.:

• text summarisation (Bhandari et al 2021), 

• language generation (Novikova et al. 2017emnlp), 

• searching (Liu et al. 2021acm), code generation (Liguori et al 2021)
4
5
Split&read
Parallel
corpora
Monolingual
corpora
Prob.
Calculation
Word
alignment
Phrase table
lang. Model
Test sentence
Chunk MT
Reordered
MT
Param.
Optimize
Candidate
MTs
Quality Est.
Find pattern
Reference T
MTE
Topic Origin: Manual assessment began much earlier, automatic MTE with SMT, metrics, QEs
PB-SMT
www.adaptcentre.ie
Han, Lifeng, Gareth, Jones and Alan F., Smeaton (2020) MultiMWE: building a multi-lingual multi-word expression (MWE) parallel corpora. In:
12th International Conference on Language Resources and Evaluation (LREC), 11-16 May, 2020, Marseille, France. (Virtual).
Background
example: MTE
(Han et al. 2020LREC)

MT output / candidat T
MTE1: src + ref + output

=>(HE)

MTE2: src + output

=>(HE, QE)

MTE3: ref + output

=>(HE, Metrics)
6
ZH source:
ZH pinyin:
Nián nián suì suì huā xiāng sì, suì suì nián
nián rén bù tóng.
EN reference:
The flowers are similar each year, while
people are changing every year.
EN MT output:
One year spent similar, each year is
different
www.adaptcentre.ie
Content
cartoon, https://www.amazon.ca/Tweety-Bird-not-happy-Decal/
Related
work

7
• Background (motivation of this work)

• Related work (earlier surveys)
• Machine Translation Evaluation (MTE) & Meta-eval

• Discussion and Perspectives

• Conclusions
www.adaptcentre.ie
Related work
- earlier surveys on MTE
• Alina Secară (2005): in Proceedings of the eCoLoRe/MeLLANGE workshop. page 39-44. 

• A special focus on error classification schemes

• from translation industry and translation teaching institutions

• a consistent and systematic error classification survey till early 2000s

• Cyril Goutte (2006): Automatic Evaluation of Machine Translation Quality. 5 pages.

- surveyed two kind of metrics to date (2006): string match, and IR inspired.

- String match: WER, PER; 

- IR style technique (n-gram precision): BLEU, NIST, F, METEOR

• EuroMatrix (2007): 1.3: Survey of Machine Translation Evaluation. In EuroMatrix Project Report,
Statistical and Hybrid MT between All European Languages, co-ordinator: Prof. Hans Uszkoreit

• Human evaluation and automatic metrics for MT, to date (2007).

• An extensive referenced metric lists, compared to other earlier peer work.
8
www.adaptcentre.ie
Related work
- earlier surveys on MTE
9
• Bonnie Dorr (2009) edited DARPA GALE program report. Chap 5: Machine Translation Evaluation.

• Global Autonomous Language Exploitation programme from NIST.

• Cover HE (intro) and Metrics (heavier)

• Finishing up with METEOR, TER-P, SEPIA (a syntax aware metric, 2007), etc.

• Concluding the limitations of their metrics: high resource requirements, some seconds to score
one segment pair, etc. Calls for better automatic metrics

• Màrquez L. Dialogue 2013 invited talk, extended. automatic evaluation of machine translation
quality. @EAMT

• An emphasis on linguistically motivated measures

• introducing Asiya online interface by UPC for MT output error analysis

• Introducing QE (fresh TQA task in WMT community)

• pre-print: L. Han (2016): Machine Translation Evaluation Resources and Methods: A Survey
https://arxiv.org/abs/1605.04515
We acknowledge and endorse the EuroMatrix (2007) and NIST’ s GALE (2009) program reports.
www.adaptcentre.ie
Content
cartoon, https://www.bbc.com/news/
10
data: https://github.com/
poethan/MWE4MT
• Background (motivation of this work)

• Related work (earlier surveys)

• Machine Translation Evaluation (MTE) & Meta-eval
• Discussion and Perspectives

• Conclusions
www.adaptcentre.ie
11
Machine Translation Evaluation: Meta-eval of MTE paradigm
Machine Translation
Human evaluation Automatic metrics Quality estimation
Meta-evaluation
Human-in-the-Loop
metrics
Metrics as QE
Natural Language Processing
www.adaptcentre.ie
Manual methods

- started since Translation was deployed 

- a boost development with together MT technology

- traditional criteria (from 1960s) vs advanced (further developed)



Automatic methods - 2000s on when MT making sense & popular

- Metrics (traditional): word n-gram match, linguistic features,
statistical models, then ML/DL based. 

- QEs (later developed): without using reference translations.

Evaluating MTE methods (meta-evaluation/assessment)

- statistical significance, agreement, correlation methodologies, etc.
12
Machine Translation Evaluation: Meta-eval of MTE paradigm
13
Human Assessment Methods
Traditional Advanced
further development
fluency, adequacy, comprehension
Intelligibility and fidelity
revisiting traditional criteria
crowd source intelligence
segment ranking
utilizing post-editing
extended criteria
task oriented
www.adaptcentre.ie
selected definitions:

- Intelligible: as far as possible, the translation should read like normal, well-edited prose and
be readily understandable in the same way that such a sentence would be understandable if
originally composed in the translation language. 

- Fidelity or accuracy: the translation should, as little as possible, twist, distort, or controvert
the meaning intended by the original.

- Comprehension relates to “Informativeness’’: objective is to measure a system's ability to
produce a translation that conveys sufficient information, such that people can gain
necessary information from it.

- Fluency: whether a sentence is well-formed and fluent/grammar in context

- Adequacy: the quantity of the information existent in the original text that a translation
contains (recall)
14
Meta-eval of MTE paradigm: Human Eval
www.adaptcentre.ie
- task oriented: in light of the tasks for which their output might be used, e.g. hotel booking

- Extended Criteria: suitability, interoperability, reliability, usability, efficiency, maintainability,
portability. (focus on MT)

- Utilising Post-editing: compare the post-edited correct translation to the original MT output, the
number of editing steps

- Segment Ranking: assessors were frequently asked to provide a complete ranking over all the
candidate translations of the same source segment, WMTs.

- crowd-source intelligence: with Amazon MTurk, using fluency criteria, different scales set-up,
with some quality control solutions.

- Revisiting Traditional Criteria: ask the assessors to mark whatever errors from the candidate
translations, with questions on adequacy and comprehension.
15
Meta-eval of MTE paradigm: Human Eval
Automatic Quality Assessment Methods
Traditional Advanced
Deep Learning Models
N-gram surface
matching
Deeper linguistic
features
Edit distance
Precision and Recall
Word order
Syntax POS, phrase,
sentence structure
Semantics name entity, MWEs,
synonym, textual
entailment,
paraphrase, semantic
role, language model
16
www.adaptcentre.ie
- n-gram string match: carried out on word surface level

- editing distance

- precision, recall, f-score, word order

- augmented/hybrid

- linguistic features: to capture syntax and meaning

- syntactic (POS, chunk, sentence structure)

- semantic (dictionary, paraphrase, NE, MWE, entailment, language
model)

- deep learning / ML models

- LSTM, NNs, RNNs: modeling syntactic and semantic features
17
Meta-eval of MTE paradigm: Automatic Metrics
Our work: 

N-gram string match: augmented/hybrid 

LEPOR, nLEPOR: Length Penalty, Precision, n-gram Position difference
Penalty and Recall. (n-gram P and R: nLEPOR) https://github.com/poethan/
LEPOR/ (code in Perl) - best score on En-2-other WMT2013 shared task.

Deeper linguistic features: 

hLEPOR: word level + PoS < Harmonic mean of factors> 

HPPR, word level + phrase structure

News: customised hLEPOR using Pre-trained language model and human
labelled data from this year at: https://github.com/poethan/cushLEPOR 

-(presented in MT summit21:https://aclanthology.org/2021.mtsummit-up.28/)

www.adaptcentre.ie
18
Meta-eval of MTE paradigm: Automatic Metrics
Han et al. (2012) "LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors" in COLING.
hLEPOR: Language-independent Model for Machine Translation Evaluation with Reinforced Factors, in MT summit 2013
HPPR: Phrase Tagset Mapping for French and English Treebanks and Its Application in Machine Translation Evaluation. In GSCL2013
www.adaptcentre.ie
How to estimate the translation quality without depending on the prepared reference translation?

- e.g. based on the features that can be extracted from MT output and source text

- estimation spans: word, sentence, system level.

Evaluation methods: 

- word level: Mean Absolute Error (MAE) as a primary metric, and Root of Mean Squared
Error (RMSE) as a secondary metric, borrowed from measuring regression task.

- sentence level: DA, HTER

- document level: MQM (multi-dimention quality metrics)

Recent trends: 

- Metrics and QE integration/cross performance, how to utilise/communicate both metrics
task and QE task knowledge

-WMT19: there were 10 reference-less evaluation metrics which were used for the QE task,
“QE as a Metric”

- challenges: predicting src-word that leads error, multilingual or language independent
models, low resource languages, etc.
19
Meta-eval of MTE paradigm: Metrics / QE
www.adaptcentre.ie
Our QE models: Statistical modelling with metrics criteria for QE

I.- sentence-level quality estimation: EBLEU (an enhanced BLEU with POS)

II.- system selection: probability model Naïve Bayes (NB) and SVM as classification
algorithm with the features from evaluation metrics (ength penalty, Precision, Recall
and Rank values)

III.- word-level quality estimation: take the contextual information into account, we
employed a discriminative undirected probabilistic graphical model conditional
random field (CRF), in addition to the NB algorithm.
20
Meta-eval of MTE paradigm: Automatic Eval/QE
Han et al. (2013) Quality Estimation for Machine Translation Using the Joint Method of Evaluation Criteria and Statistical
Modeling. In WMT13. https://www.aclweb.org/anthology/W13-2245/
www.adaptcentre.ie
Statistical significance: by bootstrap re-sampling methods, different system
indeed own different quality? statistical significance intervals for evaluation
metrics on small test sets. 

Human judgement agreement kappa levels: how often they agree with each
other on the segments, or they agree with themselves on the same segment.

Correlation methods: how much correlation is the candidate ranking and
scoring (metrics, QE), compared to the reference ranking (e.g. human)?
Spearman, Pearson, Kendall tau.

Metrics comparisons: compare and manipulate different metrics.
21
MT evaluation paradigm: Meta-eval
www.adaptcentre.ie
Content
22
data: https://github.com/
poethan/MWE4MT
cartoon, https://www.bbc.com/news/
Discussion
• Background (motivation of this work)

• Related work (earlier surveys)

• Machine Translation Evaluation (MTE) & Meta-eval

• Discussion and Perspectives
• Conclusions
www.adaptcentre.ie
Discussion and Perspectives
23
• MQM, getting more attention recently, applied to WMT tasks (QE2020), many criteria,
needs to be adapted to specific task of evaluations. (www.qt21.eu/mqm-definition)

• QE more attention, more realistic in practical situation when no ref available, e.g.
emergency situation for instruction/translation, such as war, natural disaster, rescue,
in some low resource scenario. As SAE stated 2001: the poor MT means leading to
death, ….

• Human assessment agreement has always been an issue, and recent google-HE-
with-WMT data shows the crowd source human assessment disagree largely with
professional translators. i.e. the crowd source assessors cannot distinguish real
improved MT system that may have syntax or semantic parts improved.

• How to improve the test set diversity (domains), reference translations coverage -
semantical equivalence, e.g. ⼝口⽔水战 (kǒushuǐ zhàn, war of words vs oral combat).
www.adaptcentre.ie
Discussion and Perspectives
24
• Metrics still facing challenges in segment level performance/correlation.

• Metrics score interpretation: automatic evaluation metrics usually yield meaningless score, which is very test
set specific and the absolute value is not informative: 

• — what is the meaning of -16094 score by the MTeRater metric, or 1.98 score by ROSE?

• — similar goes to 19.07 by BEER / 28.47 by BLEU / 33.03 by METEOR for a mostly good translation

• — we suggest our own LEPOR and hLEPOR/nLEPOR where relatively good/better MT system translations
were reported scoring 0.60 to 0.80, with overall (0 ~ 1)

• Low-resource MT development & eval: avoid the long road that high-resource language pairs took?

• — semantic evaluation metrics

• — human in the loop evaluation

• — semantic interpreted test-suits
www.adaptcentre.ie
Content
cartoon, https://images.app.goo.gl/Y6bAfr9oFsswWjUY7
What have
we done?
25
• Background (motivation of this work)

• Related work (earlier surveys)

• Machine Translation Evaluation (MTE) & Meta-eval

• Discussion and Perspectives

• Conclusions
www.adaptcentre.ie
Conclusions
26
• A brief overview of both human/manual and automatic MTE methods for
translation quality, and the meta-eval: evaluating the assessment methods.

• Automated models cover both reference-translation dependent metrics and
quality estimation without using reference. There is a new trend of reference less
metrics.

• MT is still far from reaching human parity, especially in broader domains, topics,
low-resource language pairs. MTE will still play a key role in the MT
development.

• MTE methods will surely have an influence in other NLP tasks and be impacted
vice versa.
www.adaptcentre.ie
References
• Our earlier work related to this tutorial:

• Lifeng Han. (2016) Machine Translation Evaluation Resources and Methods: A Survey https://arxiv.org/abs/1605.04515 updated 2018.

• Lifeng Han, Gareth J. F. Jones, and Alan Smeaton. (2020) MultiMWE: Building a multi-lingual multiword expression (MWE) parallel corpora. In LREC.

• Lifeng Han. (2014) LEPOR: An Augmented Machine Translation Evaluation Metric. Msc thesis. University of Macau. https://library2.um.edu.mo/
etheses/b33358400_ft.pdf 

• Lifeng Han and I. Sorokina and Gleb Erofeev and S. Gladkoff. (2021) CushLEPOR: Customised hLEPOR Metric Using LABSE Distilled Knowledge
Model to Improve Agreement with Human Judgements. WMT2021 (in press). http://doras.dcu.ie/26182/ 

• Lifeng Han, Gareth Jones, Alan Smeaton. (2021) Translation Quality Assessment: A Brief Survey on Manual and Automatic Methods.
MoTra2021@NoDaLiDa2021. https://aclanthology.org/2021.motra-1.3/ 

• WMT findings:

• WMT and Metrics task: Koehn and Mons (2006), …, Ondřej Bojar, et al. (2013),…on to Barrault et al. (2020)

• QE task findings: from (Callison-Burch et al., 2012), …on to Special et al. (2020)

• MTE interaction with other NLP task evaluations:

• Liu et al. Meta-evaluation of Conversational Search Evaluation Metrics, (2021) https://arxiv.org/pdf/2104.13453.pdf ACM Transactions on
Information Systems 

• Pietro Liguori et al. 2021. Shellcode_IA32: A Dataset for Automatic Shellcode Generation. https://arxiv.org/abs/2104.13100 

• Why We Need New Evaluation Metrics for NLG. by Jekaterina Novikova et al 2017emnlp. https://www.aclweb.org/anthology/D17-1238/ 

• Evaluation of text generation: A survey. (2020) A Celikyilmaz, E Clark, J Gao - arXiv preprint arXiv:2006.14799, - arxiv.org

• SCOTI: Science Captioning of Terrain Images for data prioritization and local image search. D Qiu, B Rothrock, T Islam, AK Didier, VZ Sun… -
Planetary and Space …, 2020 - Elsevier
27
www.adaptcentre.ie
28
• Go raibh maith agat!
• 谢谢!
• Dankeschön
• Thank you!
• Gracias!
• Grazie!
• Dziękuję Ci!
• Merci!
• Dank je!
• спасибі!
• धन्यवाद!
• Благодаря ти!
quiz: which language do you recognise? 🙃
謝謝!
tak skal
du have
Takk skal du ha
tack
Kiitos
Þakka þér fyrir
Qujan Qujanaq Qujanarsuaq
www.adaptcentre.ie
29
• Go raibh maith agat!
• 谢谢!
• Dankeschön
• Thank you!
• Gracias!
• Grazie!
• Dziękuję Ci!
• Merci!
• Dank je!
• спасибі!
• धन्यवाद!
• Благодаря ти!
How many can you speak? 🙃
Dankeschön
Danish:
tak skal
du have
Norwegian: Takk skal du ha
Swedish: tack
Finnish: Kiitos
Icelandic: Þakka þér fyrir
Kalaallisut/greelandic: Qujan Qujanaq Qujanarsuaq
謝謝!

Más contenido relacionado

La actualidad más candente

Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentationSurya Sg
 
Nlp presentation
Nlp presentationNlp presentation
Nlp presentationSurya Sg
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022NU_I_TODALAB
 
Practical machine learning - Part 1
Practical machine learning - Part 1Practical machine learning - Part 1
Practical machine learning - Part 1Traian Rebedea
 
Frontiers of Natural Language Processing
Frontiers of Natural Language ProcessingFrontiers of Natural Language Processing
Frontiers of Natural Language ProcessingSebastian Ruder
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categoriesWarNik Chow
 
2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT IntroductionRIILP
 
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Lifeng (Aaron) Han
 
Intro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringIntro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringTraian Rebedea
 
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...Ahmed Magdy Ezzeldin, MSc.
 
Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2VecWord representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vecananth
 
SemEval - Aspect Based Sentiment Analysis
SemEval - Aspect Based Sentiment AnalysisSemEval - Aspect Based Sentiment Analysis
SemEval - Aspect Based Sentiment AnalysisAditya Joshi
 
Multi-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer CalixtoMulti-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer CalixtoSebastian Ruder
 
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...Lifeng (Aaron) Han
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingYoung Seok Kim
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH WarNik Chow
 
Answer Selection and Validation for Arabic Questions
Answer Selection and Validation for Arabic QuestionsAnswer Selection and Validation for Arabic Questions
Answer Selection and Validation for Arabic QuestionsAhmed Magdy Ezzeldin, MSc.
 

La actualidad más candente (20)

Searching for the Best Machine Translation Combination
Searching for the Best Machine Translation CombinationSearching for the Best Machine Translation Combination
Searching for the Best Machine Translation Combination
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
 
Nlp presentation
Nlp presentationNlp presentation
Nlp presentation
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
 
Practical machine learning - Part 1
Practical machine learning - Part 1Practical machine learning - Part 1
Practical machine learning - Part 1
 
Frontiers of Natural Language Processing
Frontiers of Natural Language ProcessingFrontiers of Natural Language Processing
Frontiers of Natural Language Processing
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories
 
2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction
 
1909 paclic
1909 paclic1909 paclic
1909 paclic
 
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
Incorporating Chinese Radicals Into Neural Machine Translation: Deeper Than C...
 
Intro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringIntro to Deep Learning for Question Answering
Intro to Deep Learning for Question Answering
 
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
 
Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2VecWord representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vec
 
SemEval - Aspect Based Sentiment Analysis
SemEval - Aspect Based Sentiment AnalysisSemEval - Aspect Based Sentiment Analysis
SemEval - Aspect Based Sentiment Analysis
 
Multi-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer CalixtoMulti-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer Calixto
 
Arabic question answering ‫‬
Arabic question answering ‫‬Arabic question answering ‫‬
Arabic question answering ‫‬
 
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
 
Answer Selection and Validation for Arabic Questions
Answer Selection and Validation for Arabic QuestionsAnswer Selection and Validation for Arabic Questions
Answer Selection and Validation for Arabic Questions
 

Similar a Meta-evaluation of machine translation evaluation methods

Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Lifeng (Aaron) Han
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
 HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio... HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...Lifeng (Aaron) Han
 
Text Analytics applied to HR
Text Analytics applied to HRText Analytics applied to HR
Text Analytics applied to HRHendrik Feddersen
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLifeng (Aaron) Han
 
Tech capabilities with_sa
Tech capabilities with_saTech capabilities with_sa
Tech capabilities with_saRobert Martin
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationChamani Shiranthika
 
Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsParisa Niksefat
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answeringAli Kabbadj
 
A deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationA deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationLifeng (Aaron) Han
 
Machine translation course program (in English)
Machine translation course program (in English)Machine translation course program (in English)
Machine translation course program (in English)Dmitry Kan
 
A hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzerA hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzerIAESIJAI
 
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...Lifeng (Aaron) Han
 
Work in progress: ChatGPT as an Assistant in Paper Writing
Work in progress: ChatGPT as an Assistant in Paper WritingWork in progress: ChatGPT as an Assistant in Paper Writing
Work in progress: ChatGPT as an Assistant in Paper WritingManuel Castro
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...Lifeng (Aaron) Han
 
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoHuman Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoSebastian Ruder
 
Unsupervised Quality Estimation Model for English to German Translation and I...
Unsupervised Quality Estimation Model for English to German Translation and I...Unsupervised Quality Estimation Model for English to German Translation and I...
Unsupervised Quality Estimation Model for English to German Translation and I...Lifeng (Aaron) Han
 
MT(1).pdf
MT(1).pdfMT(1).pdf
MT(1).pdfs n
 
Pangeanic Cor-ActivaTM-Neural machine translation Taus Tokyo 2017
Pangeanic Cor-ActivaTM-Neural machine translation Taus Tokyo 2017Pangeanic Cor-ActivaTM-Neural machine translation Taus Tokyo 2017
Pangeanic Cor-ActivaTM-Neural machine translation Taus Tokyo 2017Manuel Herranz
 

Similar a Meta-evaluation of machine translation evaluation methods (20)

Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date ov...
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
 HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio... HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
 
Text Analytics applied to HR
Text Analytics applied to HRText Analytics applied to HR
Text Analytics applied to HR
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
Tech capabilities with_sa
Tech capabilities with_saTech capabilities with_sa
Tech capabilities with_sa
 
TAUS QE Summit 2017 eBay EN-DE MT Pilot
TAUS QE Summit 2017   eBay EN-DE MT PilotTAUS QE Summit 2017   eBay EN-DE MT Pilot
TAUS QE Summit 2017 eBay EN-DE MT Pilot
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translation
 
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
 
Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation Outputs
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answering
 
A deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationA deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine Translation
 
Machine translation course program (in English)
Machine translation course program (in English)Machine translation course program (in English)
Machine translation course program (in English)
 
A hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzerA hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzer
 
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
 
Work in progress: ChatGPT as an Assistant in Paper Writing
Work in progress: ChatGPT as an Assistant in Paper WritingWork in progress: ChatGPT as an Assistant in Paper Writing
Work in progress: ChatGPT as an Assistant in Paper Writing
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Profession...
 
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila CastilhoHuman Evaluation: Why do we need it? - Dr. Sheila Castilho
Human Evaluation: Why do we need it? - Dr. Sheila Castilho
 
Unsupervised Quality Estimation Model for English to German Translation and I...
Unsupervised Quality Estimation Model for English to German Translation and I...Unsupervised Quality Estimation Model for English to German Translation and I...
Unsupervised Quality Estimation Model for English to German Translation and I...
 
MT(1).pdf
MT(1).pdfMT(1).pdf
MT(1).pdf
 
Pangeanic Cor-ActivaTM-Neural machine translation Taus Tokyo 2017
Pangeanic Cor-ActivaTM-Neural machine translation Taus Tokyo 2017Pangeanic Cor-ActivaTM-Neural machine translation Taus Tokyo 2017
Pangeanic Cor-ActivaTM-Neural machine translation Taus Tokyo 2017
 

Más de Lifeng (Aaron) Han

WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni ManchesterWMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni ManchesterLifeng (Aaron) Han
 
Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)Lifeng (Aaron) Han
 
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...Lifeng (Aaron) Han
 
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerBuild moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerLifeng (Aaron) Han
 
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Lifeng (Aaron) Han
 
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...Lifeng (Aaron) Han
 
machine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveymachine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveyLifeng (Aaron) Han
 
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelChinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelLifeng (Aaron) Han
 
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Lifeng (Aaron) Han
 
Machine translation evaluation: a survey
Machine translation evaluation: a surveyMachine translation evaluation: a survey
Machine translation evaluation: a surveyLifeng (Aaron) Han
 
LEPOR: an augmented machine translation evaluation metric
LEPOR: an augmented machine translation evaluation metric LEPOR: an augmented machine translation evaluation metric
LEPOR: an augmented machine translation evaluation metric Lifeng (Aaron) Han
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...Lifeng (Aaron) Han
 
PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks
PPT-CCL: A Universal Phrase Tagset for Multilingual TreebanksPPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks
PPT-CCL: A Universal Phrase Tagset for Multilingual TreebanksLifeng (Aaron) Han
 

Más de Lifeng (Aaron) Han (16)

WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni ManchesterWMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
WMT2022 Biomedical MT PPT: Logrus Global and Uni Manchester
 
Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)Measuring Uncertainty in Translation Quality Evaluation (TQE)
Measuring Uncertainty in Translation Quality Evaluation (TQE)
 
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
 
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longerBuild moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
Build moses on ubuntu (64 bit) system in virtubox recorded by aaron _v2longer
 
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with...
 
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations ...
 
machine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveymachine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a survey
 
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelChinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
 
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
Quality Estimation for Machine Translation Using the Joint Method of Evaluati...
 
Thesis-Master-MTE-Aaron
Thesis-Master-MTE-AaronThesis-Master-MTE-Aaron
Thesis-Master-MTE-Aaron
 
Machine translation evaluation: a survey
Machine translation evaluation: a surveyMachine translation evaluation: a survey
Machine translation evaluation: a survey
 
LEPOR: an augmented machine translation evaluation metric
LEPOR: an augmented machine translation evaluation metric LEPOR: an augmented machine translation evaluation metric
LEPOR: an augmented machine translation evaluation metric
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
 
PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks
PPT-CCL: A Universal Phrase Tagset for Multilingual TreebanksPPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks
PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks
 

Último

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Último (20)

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 

Meta-evaluation of machine translation evaluation methods

  • 1. ADAPT Research Centre School of Computing, Dublin City University, Ireland lifeng.han@adaptcentre.ie Meta-evaluation of machine translation evaluation methods Lifeng Han Metrics2021 Tutorial Track/type: Workshop on Informetric and Scientometric Research (SIG-MET), ASIS&T. October 23–24. 1
  • 2. www.adaptcentre.ie Content • Background (motivation of this work) • Related work (earlier surveys) • Machine Translation Evaluation (MTE) & Meta-eval • Discussion and Perspectives • Conclusions Cartoon, https://www.freepik.com/premium-vector/boy-with-lot-books_5762027.htm 2
  • 3. www.adaptcentre.ie Background Driven: - two factors • MTE as a key point in Translation Assessment & MT model development • Human evaluations as the golden criteria of assessment-> high-cost? • Automatic metrics for MT system tuning, parameter optimisation-> reliability? • New MTE methods very needed for • i) distinguishing SOTA high-performance MT systems (ref. ACL21 outstanding paper by Marie et al.) • ii) helping low-resource language pairs / domains MT research • Related earlier surveys: • not covering recent years development (EuroMatrix07, GALE09) • Specialised/focused to certain domain (Secară 05) • or very initial, brief (Goutte 06) 3 Work based on an earlier pre-print: L. Han (2016) “Machine Translation Evaluation Resources and Methods: A Survey”. arXiv:1605.04515 [cs.CL] updated-Sep.2018
  • 4. www.adaptcentre.ie Background Goals: • a concise but structured classification of MTE • An overview • extending related literature, e.g. recent years published MTE work • + Quality Estimation (QE), + evaluating methods of MTE itself (meta-eval) • easy to grasp what users/researchers need • find their corresponding assessment methods efficiently from existing ones or make their own that can be revised/inspired by this overview • We expect this survey to be helpful for more NLP tasks (model evaluations) • NLP tasks share similarities and Translation is a relatively larger task that covers/impacts/interacts with others - e.g.: • text summarisation (Bhandari et al 2021), • language generation (Novikova et al. 2017emnlp), • searching (Liu et al. 2021acm), code generation (Liguori et al 2021) 4
  • 5. 5 Split&read Parallel corpora Monolingual corpora Prob. Calculation Word alignment Phrase table lang. Model Test sentence Chunk MT Reordered MT Param. Optimize Candidate MTs Quality Est. Find pattern Reference T MTE Topic Origin: Manual assessment began much earlier, automatic MTE with SMT, metrics, QEs PB-SMT
  • 6. www.adaptcentre.ie Han, Lifeng, Gareth, Jones and Alan F., Smeaton (2020) MultiMWE: building a multi-lingual multi-word expression (MWE) parallel corpora. In: 12th International Conference on Language Resources and Evaluation (LREC), 11-16 May, 2020, Marseille, France. (Virtual). Background example: MTE (Han et al. 2020LREC) MT output / candidat T MTE1: src + ref + output =>(HE) MTE2: src + output =>(HE, QE) MTE3: ref + output =>(HE, Metrics) 6 ZH source: ZH pinyin: Nián nián suì suì huā xiāng sì, suì suì nián nián rén bù tóng. EN reference: The flowers are similar each year, while people are changing every year. EN MT output: One year spent similar, each year is different
  • 7. www.adaptcentre.ie Content cartoon, https://www.amazon.ca/Tweety-Bird-not-happy-Decal/ Related work 7 • Background (motivation of this work) • Related work (earlier surveys) • Machine Translation Evaluation (MTE) & Meta-eval • Discussion and Perspectives • Conclusions
  • 8. www.adaptcentre.ie Related work - earlier surveys on MTE • Alina Secară (2005): in Proceedings of the eCoLoRe/MeLLANGE workshop. page 39-44. • A special focus on error classification schemes • from translation industry and translation teaching institutions • a consistent and systematic error classification survey till early 2000s • Cyril Goutte (2006): Automatic Evaluation of Machine Translation Quality. 5 pages. - surveyed two kind of metrics to date (2006): string match, and IR inspired. - String match: WER, PER; - IR style technique (n-gram precision): BLEU, NIST, F, METEOR • EuroMatrix (2007): 1.3: Survey of Machine Translation Evaluation. In EuroMatrix Project Report, Statistical and Hybrid MT between All European Languages, co-ordinator: Prof. Hans Uszkoreit • Human evaluation and automatic metrics for MT, to date (2007). • An extensive referenced metric lists, compared to other earlier peer work. 8
  • 9. www.adaptcentre.ie Related work - earlier surveys on MTE 9 • Bonnie Dorr (2009) edited DARPA GALE program report. Chap 5: Machine Translation Evaluation. • Global Autonomous Language Exploitation programme from NIST. • Cover HE (intro) and Metrics (heavier) • Finishing up with METEOR, TER-P, SEPIA (a syntax aware metric, 2007), etc. • Concluding the limitations of their metrics: high resource requirements, some seconds to score one segment pair, etc. Calls for better automatic metrics • Màrquez L. Dialogue 2013 invited talk, extended. automatic evaluation of machine translation quality. @EAMT • An emphasis on linguistically motivated measures • introducing Asiya online interface by UPC for MT output error analysis • Introducing QE (fresh TQA task in WMT community) • pre-print: L. Han (2016): Machine Translation Evaluation Resources and Methods: A Survey https://arxiv.org/abs/1605.04515 We acknowledge and endorse the EuroMatrix (2007) and NIST’ s GALE (2009) program reports.
  • 10. www.adaptcentre.ie Content cartoon, https://www.bbc.com/news/ 10 data: https://github.com/ poethan/MWE4MT • Background (motivation of this work) • Related work (earlier surveys) • Machine Translation Evaluation (MTE) & Meta-eval • Discussion and Perspectives • Conclusions
  • 11. www.adaptcentre.ie 11 Machine Translation Evaluation: Meta-eval of MTE paradigm Machine Translation Human evaluation Automatic metrics Quality estimation Meta-evaluation Human-in-the-Loop metrics Metrics as QE Natural Language Processing
  • 12. www.adaptcentre.ie Manual methods - started since Translation was deployed - a boost development with together MT technology - traditional criteria (from 1960s) vs advanced (further developed) Automatic methods - 2000s on when MT making sense & popular - Metrics (traditional): word n-gram match, linguistic features, statistical models, then ML/DL based. - QEs (later developed): without using reference translations. Evaluating MTE methods (meta-evaluation/assessment) - statistical significance, agreement, correlation methodologies, etc. 12 Machine Translation Evaluation: Meta-eval of MTE paradigm
  • 13. 13 Human Assessment Methods Traditional Advanced further development fluency, adequacy, comprehension Intelligibility and fidelity revisiting traditional criteria crowd source intelligence segment ranking utilizing post-editing extended criteria task oriented
  • 14. www.adaptcentre.ie selected definitions: - Intelligible: as far as possible, the translation should read like normal, well-edited prose and be readily understandable in the same way that such a sentence would be understandable if originally composed in the translation language. - Fidelity or accuracy: the translation should, as little as possible, twist, distort, or controvert the meaning intended by the original. - Comprehension relates to “Informativeness’’: objective is to measure a system's ability to produce a translation that conveys sufficient information, such that people can gain necessary information from it. - Fluency: whether a sentence is well-formed and fluent/grammar in context - Adequacy: the quantity of the information existent in the original text that a translation contains (recall) 14 Meta-eval of MTE paradigm: Human Eval
  • 15. www.adaptcentre.ie - task oriented: in light of the tasks for which their output might be used, e.g. hotel booking - Extended Criteria: suitability, interoperability, reliability, usability, efficiency, maintainability, portability. (focus on MT) - Utilising Post-editing: compare the post-edited correct translation to the original MT output, the number of editing steps - Segment Ranking: assessors were frequently asked to provide a complete ranking over all the candidate translations of the same source segment, WMTs. - crowd-source intelligence: with Amazon MTurk, using fluency criteria, different scales set-up, with some quality control solutions. - Revisiting Traditional Criteria: ask the assessors to mark whatever errors from the candidate translations, with questions on adequacy and comprehension. 15 Meta-eval of MTE paradigm: Human Eval
  • 16. Automatic Quality Assessment Methods Traditional Advanced Deep Learning Models N-gram surface matching Deeper linguistic features Edit distance Precision and Recall Word order Syntax POS, phrase, sentence structure Semantics name entity, MWEs, synonym, textual entailment, paraphrase, semantic role, language model 16
  • 17. www.adaptcentre.ie - n-gram string match: carried out on word surface level - editing distance - precision, recall, f-score, word order - augmented/hybrid - linguistic features: to capture syntax and meaning - syntactic (POS, chunk, sentence structure) - semantic (dictionary, paraphrase, NE, MWE, entailment, language model) - deep learning / ML models - LSTM, NNs, RNNs: modeling syntactic and semantic features 17 Meta-eval of MTE paradigm: Automatic Metrics
  • 18. Our work: N-gram string match: augmented/hybrid LEPOR, nLEPOR: Length Penalty, Precision, n-gram Position difference Penalty and Recall. (n-gram P and R: nLEPOR) https://github.com/poethan/ LEPOR/ (code in Perl) - best score on En-2-other WMT2013 shared task. Deeper linguistic features: hLEPOR: word level + PoS < Harmonic mean of factors> HPPR, word level + phrase structure News: customised hLEPOR using Pre-trained language model and human labelled data from this year at: https://github.com/poethan/cushLEPOR -(presented in MT summit21:https://aclanthology.org/2021.mtsummit-up.28/) www.adaptcentre.ie 18 Meta-eval of MTE paradigm: Automatic Metrics Han et al. (2012) "LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors" in COLING. hLEPOR: Language-independent Model for Machine Translation Evaluation with Reinforced Factors, in MT summit 2013 HPPR: Phrase Tagset Mapping for French and English Treebanks and Its Application in Machine Translation Evaluation. In GSCL2013
  • 19. www.adaptcentre.ie How to estimate the translation quality without depending on the prepared reference translation? - e.g. based on the features that can be extracted from MT output and source text - estimation spans: word, sentence, system level. Evaluation methods: - word level: Mean Absolute Error (MAE) as a primary metric, and Root of Mean Squared Error (RMSE) as a secondary metric, borrowed from measuring regression task. - sentence level: DA, HTER - document level: MQM (multi-dimention quality metrics) Recent trends: - Metrics and QE integration/cross performance, how to utilise/communicate both metrics task and QE task knowledge -WMT19: there were 10 reference-less evaluation metrics which were used for the QE task, “QE as a Metric” - challenges: predicting src-word that leads error, multilingual or language independent models, low resource languages, etc. 19 Meta-eval of MTE paradigm: Metrics / QE
  • 20. www.adaptcentre.ie Our QE models: Statistical modelling with metrics criteria for QE I.- sentence-level quality estimation: EBLEU (an enhanced BLEU with POS) II.- system selection: probability model Naïve Bayes (NB) and SVM as classification algorithm with the features from evaluation metrics (ength penalty, Precision, Recall and Rank values) III.- word-level quality estimation: take the contextual information into account, we employed a discriminative undirected probabilistic graphical model conditional random field (CRF), in addition to the NB algorithm. 20 Meta-eval of MTE paradigm: Automatic Eval/QE Han et al. (2013) Quality Estimation for Machine Translation Using the Joint Method of Evaluation Criteria and Statistical Modeling. In WMT13. https://www.aclweb.org/anthology/W13-2245/
  • 21. www.adaptcentre.ie Statistical significance: by bootstrap re-sampling methods, different system indeed own different quality? statistical significance intervals for evaluation metrics on small test sets. Human judgement agreement kappa levels: how often they agree with each other on the segments, or they agree with themselves on the same segment. Correlation methods: how much correlation is the candidate ranking and scoring (metrics, QE), compared to the reference ranking (e.g. human)? Spearman, Pearson, Kendall tau. Metrics comparisons: compare and manipulate different metrics. 21 MT evaluation paradigm: Meta-eval
  • 22. www.adaptcentre.ie Content 22 data: https://github.com/ poethan/MWE4MT cartoon, https://www.bbc.com/news/ Discussion • Background (motivation of this work) • Related work (earlier surveys) • Machine Translation Evaluation (MTE) & Meta-eval • Discussion and Perspectives • Conclusions
  • 23. www.adaptcentre.ie Discussion and Perspectives 23 • MQM, getting more attention recently, applied to WMT tasks (QE2020), many criteria, needs to be adapted to specific task of evaluations. (www.qt21.eu/mqm-definition) • QE more attention, more realistic in practical situation when no ref available, e.g. emergency situation for instruction/translation, such as war, natural disaster, rescue, in some low resource scenario. As SAE stated 2001: the poor MT means leading to death, …. • Human assessment agreement has always been an issue, and recent google-HE- with-WMT data shows the crowd source human assessment disagree largely with professional translators. i.e. the crowd source assessors cannot distinguish real improved MT system that may have syntax or semantic parts improved. • How to improve the test set diversity (domains), reference translations coverage - semantical equivalence, e.g. ⼝口⽔水战 (kǒushuǐ zhàn, war of words vs oral combat).
  • 24. www.adaptcentre.ie Discussion and Perspectives 24 • Metrics still facing challenges in segment level performance/correlation. • Metrics score interpretation: automatic evaluation metrics usually yield meaningless score, which is very test set specific and the absolute value is not informative: • — what is the meaning of -16094 score by the MTeRater metric, or 1.98 score by ROSE? • — similar goes to 19.07 by BEER / 28.47 by BLEU / 33.03 by METEOR for a mostly good translation • — we suggest our own LEPOR and hLEPOR/nLEPOR where relatively good/better MT system translations were reported scoring 0.60 to 0.80, with overall (0 ~ 1) • Low-resource MT development & eval: avoid the long road that high-resource language pairs took? • — semantic evaluation metrics • — human in the loop evaluation • — semantic interpreted test-suits
  • 25. www.adaptcentre.ie Content cartoon, https://images.app.goo.gl/Y6bAfr9oFsswWjUY7 What have we done? 25 • Background (motivation of this work) • Related work (earlier surveys) • Machine Translation Evaluation (MTE) & Meta-eval • Discussion and Perspectives • Conclusions
  • 26. www.adaptcentre.ie Conclusions 26 • A brief overview of both human/manual and automatic MTE methods for translation quality, and the meta-eval: evaluating the assessment methods. • Automated models cover both reference-translation dependent metrics and quality estimation without using reference. There is a new trend of reference less metrics. • MT is still far from reaching human parity, especially in broader domains, topics, low-resource language pairs. MTE will still play a key role in the MT development. • MTE methods will surely have an influence in other NLP tasks and be impacted vice versa.
  • 27. www.adaptcentre.ie References • Our earlier work related to this tutorial: • Lifeng Han. (2016) Machine Translation Evaluation Resources and Methods: A Survey https://arxiv.org/abs/1605.04515 updated 2018. • Lifeng Han, Gareth J. F. Jones, and Alan Smeaton. (2020) MultiMWE: Building a multi-lingual multiword expression (MWE) parallel corpora. In LREC. • Lifeng Han. (2014) LEPOR: An Augmented Machine Translation Evaluation Metric. Msc thesis. University of Macau. https://library2.um.edu.mo/ etheses/b33358400_ft.pdf • Lifeng Han and I. Sorokina and Gleb Erofeev and S. Gladkoff. (2021) CushLEPOR: Customised hLEPOR Metric Using LABSE Distilled Knowledge Model to Improve Agreement with Human Judgements. WMT2021 (in press). http://doras.dcu.ie/26182/ • Lifeng Han, Gareth Jones, Alan Smeaton. (2021) Translation Quality Assessment: A Brief Survey on Manual and Automatic Methods. MoTra2021@NoDaLiDa2021. https://aclanthology.org/2021.motra-1.3/ • WMT findings: • WMT and Metrics task: Koehn and Mons (2006), …, Ondřej Bojar, et al. (2013),…on to Barrault et al. (2020) • QE task findings: from (Callison-Burch et al., 2012), …on to Special et al. (2020) • MTE interaction with other NLP task evaluations: • Liu et al. Meta-evaluation of Conversational Search Evaluation Metrics, (2021) https://arxiv.org/pdf/2104.13453.pdf ACM Transactions on Information Systems • Pietro Liguori et al. 2021. Shellcode_IA32: A Dataset for Automatic Shellcode Generation. https://arxiv.org/abs/2104.13100 • Why We Need New Evaluation Metrics for NLG. by Jekaterina Novikova et al 2017emnlp. https://www.aclweb.org/anthology/D17-1238/ • Evaluation of text generation: A survey. (2020) A Celikyilmaz, E Clark, J Gao - arXiv preprint arXiv:2006.14799, - arxiv.org • SCOTI: Science Captioning of Terrain Images for data prioritization and local image search. D Qiu, B Rothrock, T Islam, AK Didier, VZ Sun… - Planetary and Space …, 2020 - Elsevier 27
  • 28. www.adaptcentre.ie 28 • Go raibh maith agat! • 谢谢! • Dankeschön • Thank you! • Gracias! • Grazie! • Dziękuję Ci! • Merci! • Dank je! • спасибі! • धन्यवाद! • Благодаря ти! quiz: which language do you recognise? 🙃 謝謝! tak skal du have Takk skal du ha tack Kiitos Þakka þér fyrir Qujan Qujanaq Qujanarsuaq
  • 29. www.adaptcentre.ie 29 • Go raibh maith agat! • 谢谢! • Dankeschön • Thank you! • Gracias! • Grazie! • Dziękuję Ci! • Merci! • Dank je! • спасибі! • धन्यवाद! • Благодаря ти! How many can you speak? 🙃 Dankeschön Danish: tak skal du have Norwegian: Takk skal du ha Swedish: tack Finnish: Kiitos Icelandic: Þakka þér fyrir Kalaallisut/greelandic: Qujan Qujanaq Qujanarsuaq 謝謝!