Machine translation evaluation metrics provide no correlation with post-editor productivity gains

•

0 recomendaciones•3,536 vistas

RIILP

Tecnología Empresariales

Machine translation evaluation
Hermes Traducciones y Servicios Lingüísticos

MT at Hermes
2

 Pure RBMT engines with pre- and post-processing macros.
 Texts from technical domains.
 Applied-technology department has been working for over a
year in MT engines.
 Over 250,000 words post-edited with internal engines in the
last year.
 Average new word count for projects post-edited with internal
engines: 9,000 words.

Our purpose with MT evals
3

Automated metrics might help us:
 predict PE time and productivity gains;
 negotiate reasonable discounts;
 evaluate quality of engines;
 measure performance of applied-technology department;
 not depend on human-reported data.

What we hoped to find
4

 We hoped some metric would correlate with productivity gain
data provided by post-editors.
 We gathered BLEU, F-Measure, METEOR and TER
values.
 Ideally, we would end up relying on automated metrics rather
than time and productivity measurements reported by posteditors.

What we hoped to find
5

120.00

100.00

80.00

60.00

40.00

20.00

0.00
0.00

20.00

40.00

60.00

Productivity gain %

80.00

100.00

120.00

What we hoped to find
6

120.00

100.00

80.00

60.00

40.00

20.00

0.00
0.00

20.00

40.00

60.00

Productivity gain %

80.00

100.00

120.00

What we actually found: No correlation
7
100.00
90.00
80.00
70.00
60.00
BLEU

50.00

F-Measure
TER

40.00

METEOR
30.00
20.00
10.00
0.00
0.00

20.00

40.00

60.00

80.00

100.00

Productivity gain %

120.00

140.00

160.00

What we actually found: No correlation
8
100.00
90.00
80.00
70.00
60.00
BLEU

50.00

F-Measure
TER

40.00

METEOR
30.00
20.00
10.00
0.00
0.00

20.00

40.00

60.00

80.00

100.00

Productivity gain %

120.00

140.00

160.00

Reasons for the variability
9

 Different CAT environments (Trados Studio, memoQ,
Idiom, TagEditor, etc.).
 Different engines (per domain, per client, etc.).
 Different clients, different needs.
 Different post-editors.
 Or, if same post-editor, different post-editing skills over time.

 Different word volumes.
 Specific productivity or consistency-enhancement
processing can affect metrics negatively.

Productivity-enhancement example
10

 Source: Add events as described in Adding Events to a Model.
 PE: Agregue los eventos como se describe en Adición de eventos a un
modelo.
 Raw 1: Agregue los eventos como se describe en la adición de los eventos a
un modelo.
 Raw 2: Agregue los eventos como se describe en Adding Events to a Model.
 Scores:
Raw 1 Raw 2
 BLEU
 TER

68,59
17,65

53,33
29,41

Metrics for Raw 1 are significantly
better, but Raw 2 is faster to post-edit
thanks to automatic terminology
insertion tools (such as Xbench).

Human evaluation
11

 Adequacy: How much of the meaning expressed in the goldstandard translation or the source is also expressed in the target
translation?





4. Everything
3. Most
2. Little
1. None

 Fluency: To what extent is a target side translation grammatically
well informed, without spelling errors and experienced as using
natural/intuitive language by a native speaker?





4. Flawless
3. Good
2. Dis-fluent
1. Incomprehensible
Source: TAUS MT evaluation guidelines
https://evaluation.taus.net/resources/adequacy-fluency-guidelines

Conclusions
12

 We combine automated metrics with time/productivity data reported
by post-editor for final evaluation of internal MT performance.
 Poor post-editing skills or any project-specific contingency can be
counter-balanced with good automated metrics.
 We look for qualitative information in automated metrics, not
quantitative.
 BLEU values of 65 and 70 for two different engines tell us both
are good engines, not that one will render 5% better results than
the other.

Más contenido relacionado

La actualidad más candente

Overview of Multidimensional Quality Metrics (QTLaunchPad)Arle Lommel

Defining Translation Quality in ASTMSerge Gladkoff

TAUS Quality Dashboard and the integration of DQF in translation technologies...TAUS - The Language Data Network

High Volume, Rapid Turn Around Localization: Lessons LearnedSDL

Top Trans Survey Translation IssuesRaya Wasser

The Latest Advances in Patent Machine TranslationIconic Translation Machines

How Technology Has Changed the World of Technical TranslationTennycut

Technical_translation_is_it_really_about_terminology_enVyacheslav Guzovsky

Keys to successful technical translationTrue Language

Technical Translationtms support solutions ltd

Back translation explained: what we do and what you getPacific International Translations

Good Applications of Bad Machine Translationbdonaldson

Steps in translation processDonald Navarro Kreitz

Language translator internship reportSumitSumit26

The 3 types of translation review – and when to use themPacific International Translations

MT Use in Lingosail, by Yongpeng Wei, LingosailTAUS - The Language Data Network

Technical translation (1)Brian Cannon

Panel: Translation Quality ChallengesSDL

2. Project Management - Alexandre Helle & Manuel Herranz (Pangeanic)RIILP

Insights in the MT Market, by Jaap van der Meer, TAUSTAUS - The Language Data Network

La actualidad más candente (20)

Overview of Multidimensional Quality Metrics (QTLaunchPad)

Defining Translation Quality in ASTM

TAUS Quality Dashboard and the integration of DQF in translation technologies...

High Volume, Rapid Turn Around Localization: Lessons Learned

Top Trans Survey Translation Issues

The Latest Advances in Patent Machine Translation

How Technology Has Changed the World of Technical Translation

Technical_translation_is_it_really_about_terminology_en

Keys to successful technical translation

Technical Translation

Back translation explained: what we do and what you get

Good Applications of Bad Machine Translation

Steps in translation process

Language translator internship report

The 3 types of translation review – and when to use them

MT Use in Lingosail, by Yongpeng Wei, Lingosail

Technical translation (1)

Panel: Translation Quality Challenges

2. Project Management - Alexandre Helle & Manuel Herranz (Pangeanic)

Insights in the MT Market, by Jaap van der Meer, TAUS

Destacado

3. Natalia Konstantinova (UoW) EXPERT IntroductionRIILP

9. Manuel Harranz (pangeanic) Hybrid Solutions for TranslationRIILP

1. EXPERT Winter School Partner IntroductionsRIILP

5. manuel arcedillo & juanjo arevalillo (hermes) translation memoriesRIILP

8. Qun Liu (DCU) Hybrid Solutions for TranslationRIILP

17. Anne Schuman (USAAR) Terminology and Ontologies 2RIILP

16. Anne Schumann (USAAR) Terminology and Ontologies 1RIILP

14. Michael Oakes (UoW) Natural Language Processing for TranslationRIILP

7. Trevor Cohn (usfd) Statistical Machine TranslationRIILP

2. Constantin Orasan (UoW) EXPERT IntroductionRIILP

4. Josef Van Genabith (DCU) & Khalil Sima'an (UVA) Example Based Machine Tran...RIILP

6. Khalil Sima'an (UVA) Statistical Machine TranslationRIILP

12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...RIILP

13. Constantin Orasan (UoW) Natural Language Processing for TranslationRIILP

Destacado (14)

3. Natalia Konstantinova (UoW) EXPERT Introduction

9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

1. EXPERT Winter School Partner Introductions

5. manuel arcedillo & juanjo arevalillo (hermes) translation memories

8. Qun Liu (DCU) Hybrid Solutions for Translation

17. Anne Schuman (USAAR) Terminology and Ontologies 2

16. Anne Schumann (USAAR) Terminology and Ontologies 1

14. Michael Oakes (UoW) Natural Language Processing for Translation

7. Trevor Cohn (usfd) Statistical Machine Translation

2. Constantin Orasan (UoW) EXPERT Introduction

4. Josef Van Genabith (DCU) & Khalil Sima'an (UVA) Example Based Machine Tran...

6. Khalil Sima'an (UVA) Statistical Machine Translation

12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...

13. Constantin Orasan (UoW) Natural Language Processing for Translation

Similar a Machine translation evaluation metrics provide no correlation with post-editor productivity gains

Tech capabilities with_saRobert Martin

Learn the different approaches to machine translation and how to improve the ...SDL

Seeing the Wood for the Trees in MT Evaluation: an LSP success story from RWSIconic Translation Machines

Welocalize Throughputs and Post-Editing Productivity Webinar Laura CasanellasWelocalize

Evaluation of MT Quality/Productivity at eBay - AMTA 2018Jose Luis Bonilla Sánchez

TAUS QE Summit 2017 eBay EN-DE MT PilotJose Luis Bonilla Sánchez

MT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L MargWelocalize

TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2...TAUS - The Language Data Network

State of the Machine Translation by Intento (November 2017)Konstantin Savenkov

Carla Parra Escartin - ER2 Hermes Traducciones RIILP

Ch26phanleson

WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...Welocalize

Amta 2012-federico (1)FabiolaPanetti

SE_Unit 2.pptxUmaMaheswariBHCInfor

Quality is in the Eye of the Beholder, by Eva KlaudinyovaCcaps Translation and Localization

Quality is in the Eye of the Beholder - Part 2Think Latin America

Building a pan-European automated translation platform, Andrejs Vasiljevs, CE...TAUS - The Language Data Network

What machine translation developers are doing to make post-editors happyIconic Translation Machines

Software toolsravindravekariya

HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...Lifeng (Aaron) Han

Similar a Machine translation evaluation metrics provide no correlation with post-editor productivity gains (20)

Tech capabilities with_sa

Learn the different approaches to machine translation and how to improve the ...

Seeing the Wood for the Trees in MT Evaluation: an LSP success story from RWS

Welocalize Throughputs and Post-Editing Productivity Webinar Laura Casanellas

Evaluation of MT Quality/Productivity at eBay - AMTA 2018

TAUS QE Summit 2017 eBay EN-DE MT Pilot

MT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L Marg

TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2...

State of the Machine Translation by Intento (November 2017)

Carla Parra Escartin - ER2 Hermes Traducciones

Ch26

WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...

Amta 2012-federico (1)

SE_Unit 2.pptx

Quality is in the Eye of the Beholder, by Eva Klaudinyova

Quality is in the Eye of the Beholder - Part 2

Building a pan-European automated translation platform, Andrejs Vasiljevs, CE...

What machine translation developers are doing to make post-editors happy

Software tools

HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...

Más de RIILP

Gabriella Gonzalez - eTRAD RIILP

Manuel Herranz - Pangeanic RIILP

Juanjo Arevelillo - Hermes Traducciones RIILP

Gianluca Giulinin - FAO RIILP

Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic RIILP

Tony O'Dowd - KantanMT RIILP

Santanu Pal - ESR 2 USAARRIILP

Chris Hokamp - ESR 9 DCU RIILP

Anna Zaretskaya - ESR 1 UMARIILP

Carolina Scarton - ESR 7 - USFD RIILP

Rohit Gupta - ESR 4 - UoW RIILP

Hernani Costa - ESR 3 - UMA RIILP

Liangyou Li - ESR 8 - DCU RIILP

Liling Tan - ESR 5 USAARRIILP

Sandra de luca - AcclaroRIILP

ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015RIILP

ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015RIILP

ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015RIILP

ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015RIILP

ESR4 Rohit Gupta - EXPERT Summer School - Malaga 2015RIILP

Más de RIILP (20)

Gabriella Gonzalez - eTRAD

Manuel Herranz - Pangeanic

Juanjo Arevelillo - Hermes Traducciones

Gianluca Giulinin - FAO

Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic

Tony O'Dowd - KantanMT

Santanu Pal - ESR 2 USAAR

Chris Hokamp - ESR 9 DCU

Anna Zaretskaya - ESR 1 UMA

Carolina Scarton - ESR 7 - USFD

Rohit Gupta - ESR 4 - UoW

Hernani Costa - ESR 3 - UMA

Liangyou Li - ESR 8 - DCU

Liling Tan - ESR 5 USAAR

Sandra de luca - Acclaro

ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015

ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015

ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015

ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015

ESR4 Rohit Gupta - EXPERT Summer School - Malaga 2015

Último

Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

From Family Reminiscence to Scholarly Archive .Alan Dix

SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

What is Artificial Intelligence?????????blackmambaettijean

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Rise of the Machines: Known As Drones...Rick Flair

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

unit 4 immunoblotting technique complete.pptxBkGupta21

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan

Gen AI in Business - Global Trends Report 2024.pdfAddepto

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

Machine translation evaluation metrics provide no correlation with post-editor productivity gains

1. Machine translation evaluation Hermes Traducciones y Servicios Lingüísticos

2. MT at Hermes 2  Pure RBMT engines with pre- and post-processing macros.  Texts from technical domains.  Applied-technology department has been working for over a year in MT engines.  Over 250,000 words post-edited with internal engines in the last year.  Average new word count for projects post-edited with internal engines: 9,000 words.

3. Our purpose with MT evals 3 Automated metrics might help us:  predict PE time and productivity gains;  negotiate reasonable discounts;  evaluate quality of engines;  measure performance of applied-technology department;  not depend on human-reported data.

4. What we hoped to find 4  We hoped some metric would correlate with productivity gain data provided by post-editors.  We gathered BLEU, F-Measure, METEOR and TER values.  Ideally, we would end up relying on automated metrics rather than time and productivity measurements reported by posteditors.

5. What we hoped to find 5 120.00 100.00 80.00 60.00 40.00 20.00 0.00 0.00 20.00 40.00 60.00 Productivity gain % 80.00 100.00 120.00

6. What we hoped to find 6 120.00 100.00 80.00 60.00 40.00 20.00 0.00 0.00 20.00 40.00 60.00 Productivity gain % 80.00 100.00 120.00

7. What we actually found: No correlation 7 100.00 90.00 80.00 70.00 60.00 BLEU 50.00 F-Measure TER 40.00 METEOR 30.00 20.00 10.00 0.00 0.00 20.00 40.00 60.00 80.00 100.00 Productivity gain % 120.00 140.00 160.00

8. What we actually found: No correlation 8 100.00 90.00 80.00 70.00 60.00 BLEU 50.00 F-Measure TER 40.00 METEOR 30.00 20.00 10.00 0.00 0.00 20.00 40.00 60.00 80.00 100.00 Productivity gain % 120.00 140.00 160.00

9. Reasons for the variability 9  Different CAT environments (Trados Studio, memoQ, Idiom, TagEditor, etc.).  Different engines (per domain, per client, etc.).  Different clients, different needs.  Different post-editors.  Or, if same post-editor, different post-editing skills over time.  Different word volumes.  Specific productivity or consistency-enhancement processing can affect metrics negatively.

10. Productivity-enhancement example 10  Source: Add events as described in Adding Events to a Model.  PE: Agregue los eventos como se describe en Adición de eventos a un modelo.  Raw 1: Agregue los eventos como se describe en la adición de los eventos a un modelo.  Raw 2: Agregue los eventos como se describe en Adding Events to a Model.  Scores: Raw 1 Raw 2  BLEU  TER 68,59 17,65 53,33 29,41 Metrics for Raw 1 are significantly better, but Raw 2 is faster to post-edit thanks to automatic terminology insertion tools (such as Xbench).

11. Human evaluation 11  Adequacy: How much of the meaning expressed in the goldstandard translation or the source is also expressed in the target translation?     4. Everything 3. Most 2. Little 1. None  Fluency: To what extent is a target side translation grammatically well informed, without spelling errors and experienced as using natural/intuitive language by a native speaker?     4. Flawless 3. Good 2. Dis-fluent 1. Incomprehensible Source: TAUS MT evaluation guidelines https://evaluation.taus.net/resources/adequacy-fluency-guidelines

12. Conclusions 12  We combine automated metrics with time/productivity data reported by post-editor for final evaluation of internal MT performance.  Poor post-editing skills or any project-specific contingency can be counter-balanced with good automated metrics.  We look for qualitative information in automated metrics, not quantitative.  BLEU values of 65 and 70 for two different engines tell us both are good engines, not that one will render 5% better results than the other.

Machine translation evaluation metrics provide no correlation with post-editor productivity gains

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (14)

Similar a Machine translation evaluation metrics provide no correlation with post-editor productivity gains

Similar a Machine translation evaluation metrics provide no correlation with post-editor productivity gains (20)

Más de RIILP

Más de RIILP (20)

Último

Último (20)

Machine translation evaluation metrics provide no correlation with post-editor productivity gains