2. MT at Hermes
2
Pure RBMT engines with pre- and post-processing macros.
Texts from technical domains.
Applied-technology department has been working for over a
year in MT engines.
Over 250,000 words post-edited with internal engines in the
last year.
Average new word count for projects post-edited with internal
engines: 9,000 words.
3. Our purpose with MT evals
3
Automated metrics might help us:
predict PE time and productivity gains;
negotiate reasonable discounts;
evaluate quality of engines;
measure performance of applied-technology department;
not depend on human-reported data.
4. What we hoped to find
4
We hoped some metric would correlate with productivity gain
data provided by post-editors.
We gathered BLEU, F-Measure, METEOR and TER
values.
Ideally, we would end up relying on automated metrics rather
than time and productivity measurements reported by posteditors.
5. What we hoped to find
5
120.00
100.00
80.00
60.00
40.00
20.00
0.00
0.00
20.00
40.00
60.00
Productivity gain %
80.00
100.00
120.00
6. What we hoped to find
6
120.00
100.00
80.00
60.00
40.00
20.00
0.00
0.00
20.00
40.00
60.00
Productivity gain %
80.00
100.00
120.00
7. What we actually found: No correlation
7
100.00
90.00
80.00
70.00
60.00
BLEU
50.00
F-Measure
TER
40.00
METEOR
30.00
20.00
10.00
0.00
0.00
20.00
40.00
60.00
80.00
100.00
Productivity gain %
120.00
140.00
160.00
8. What we actually found: No correlation
8
100.00
90.00
80.00
70.00
60.00
BLEU
50.00
F-Measure
TER
40.00
METEOR
30.00
20.00
10.00
0.00
0.00
20.00
40.00
60.00
80.00
100.00
Productivity gain %
120.00
140.00
160.00
9. Reasons for the variability
9
Different CAT environments (Trados Studio, memoQ,
Idiom, TagEditor, etc.).
Different engines (per domain, per client, etc.).
Different clients, different needs.
Different post-editors.
Or, if same post-editor, different post-editing skills over time.
Different word volumes.
Specific productivity or consistency-enhancement
processing can affect metrics negatively.
10. Productivity-enhancement example
10
Source: Add events as described in Adding Events to a Model.
PE: Agregue los eventos como se describe en Adición de eventos a un
modelo.
Raw 1: Agregue los eventos como se describe en la adición de los eventos a
un modelo.
Raw 2: Agregue los eventos como se describe en Adding Events to a Model.
Scores:
Raw 1 Raw 2
BLEU
TER
68,59
17,65
53,33
29,41
Metrics for Raw 1 are significantly
better, but Raw 2 is faster to post-edit
thanks to automatic terminology
insertion tools (such as Xbench).
11. Human evaluation
11
Adequacy: How much of the meaning expressed in the goldstandard translation or the source is also expressed in the target
translation?
4. Everything
3. Most
2. Little
1. None
Fluency: To what extent is a target side translation grammatically
well informed, without spelling errors and experienced as using
natural/intuitive language by a native speaker?
4. Flawless
3. Good
2. Dis-fluent
1. Incomprehensible
Source: TAUS MT evaluation guidelines
https://evaluation.taus.net/resources/adequacy-fluency-guidelines
12. Conclusions
12
We combine automated metrics with time/productivity data reported
by post-editor for final evaluation of internal MT performance.
Poor post-editing skills or any project-specific contingency can be
counter-balanced with good automated metrics.
We look for qualitative information in automated metrics, not
quantitative.
BLEU values of 65 and 70 for two different engines tell us both
are good engines, not that one will render 5% better results than
the other.