Presentation PPT in MT SUMMIT 2013.
Language-independent Model for Machine Translation Evaluation with Reinforced Factors
International Association for Machine Translation2013
Authors: Aaron Li-Feng Han, Derek Wong, Lidia S. Chao, Yervant Ho, Yi Lu, Anson Xing, Samuel Zeng
Proceedings of the 14th biennial International Conference of Machine Translation Summit (MT Summit 2013). Nice, France. 2 - 6 September 2013. Open tool https://github.com/aaronlifenghan/aaron-project-hlepor (Machine Translation Archive)
How AI, OpenAI, and ChatGPT impact business and software.
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation with Reinforced Factors
1. MT SUMMIT 2013
Aaron L.-F. Han, Derek F. Wong, and Lidia S. Chao, Liangye He, Yi Lu,
Junwen Xing and Xiaodong Zeng
September 2nd-6th, 2013, Nice, France
Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory
Department of Computer and Information Science
University of Macau
2. The importance of machine translation (MT) evaluation
Automatic MT evaluation metrics introduction
1. Lexical similarity
2. Linguistic features
3. Metrics combination
Designed metric: LEPOR Series
1. Motivation
2. LEPOR Metrics Description
3. Performances on international ACL-WMT corpora
4. Publications and Open source tools
Further information
3. • Eager communication with each other of different
nationalities
– Promote the translation technology
• Rapid development of Machine translation
– machine translation (MT) began as early as in the 1950s
(Weaver, 1955)
– big progress science the 1990s due to the development of
computers (storage capacity and computational power)
and the enlarged bilingual corpora (Marino et al. 2006)
4. • Some recent works of MT research:
– Och (2003) present MERT (Minimum Error Rate Training)
for log-linear SMT
– Su et al. (2009) use the Thematic Role Templates model to
improve the translation
– Xiong et al. (2011) employ the maximum-entropy model,
etc.
– The data-driven methods including example-based MT
(Carl and Way, 2003) and statistical MT (Koehn, 2010)
became main approaches in MT literature.
5. • How well the MT systems perform and whether they
make some progress?
• Difficulties of MT evaluation
– language variability results in no single correct translation
– the natural languages are highly ambiguous and different
languages do not always express the same content in the
same way (Arnold, 2003)
6. • Traditional manual evaluation criteria:
– intelligibility (measuring how understandable the
sentence is)
– fidelity (measuring how much information the translated
sentence retains as compared to the original) by the
Automatic Language Processing Advisory Committee
(ALPAC) around 1966 (Carroll, 1966)
– adequacy (similar as fidelity), fluency (whether the
sentence is well-formed and fluent) and comprehension
(improved intelligibility) by Defense Advanced Research
Projects Agency (DARPA) of US (White et al., 1994)
7. • Problems of manual evaluations :
– Time-consuming
– Expensive
– Unrepeatable
– Low agreement (Callison-Burch, et al., 2011)
9. • Precision-based
Bleu (Papineni et al., 2002 ACL)
• Recall-based
ROUGE(Lin, 2004 WAS)
• Precision and Recall
Meteor (Banerjee and Lavie, 2005 ACL)
10. • Word-order based
NKT_NSR(Isozaki et al., 2010EMNLP), Port (Chen
et al., 2012 ACL), ATEC (Wong et al., 2008AMTA)
• Word-alignment based
AER (Och and Ney, 2003 J.CL)
• Edit distance-based
WER(Su et al., 1992Coling), PER(Tillmann et al.,
1997 EUROSPEECH), TER (Snover et al., 2006
AMTA)
11. • Language model
LM-SVM (Gamon et al., 2005EAMT)
• Shallow parsing
GLEU (Mutton et al., 2007ACL), TerrorCat (Fishel
et al., 2012WMT)
• Semantic roles
Named entity, morphological, synonymy,
paraphrasing, discourse representation, etc.
12. • MTeRater-Plus (Parton et al., 2011WMT)
– Combine BLEU, TERp (Snover et al., 2009) and Meteor
(Banerjee and Lavie, 2005; Lavie and Denkowski, 2009)
• MPF & WMPBleu (Popovic, 2011WMT)
– Arithmetic mean of F score and BLEU score
• SIA (Liu and Gildea, 2006ACL)
– Combine the advantages of n-gram-based metrics and
loose-sequence-based metrics
13. • hLEPOR: harmonic mean of enhanced Length Penalty,
Precision, n-gram Position difference Penalty and
Recall
14. • Weaknesses in existing metrics:
– perform well on certain language pairs but weak on others,
which we call as the language-bias problem;
– consider no linguistic information (leading the metrics
result in low correlation with human judgments) or too
many linguistic features (difficult in replicability), which we
call as the extremism problem;
– present incomprehensive factors (e.g. BLEU focus on
precision only).
– What to do?
15. • to address some of the existing problems:
– Design tunable parameters to address the language-bias
problem;
– Use concise or optimized linguistic features for the
linguistic extremism problem;
– Design augmented factors.
23. • Example, employment of linguistic features:
Fig. 4. Example of n-gram POS alignment
Fig. 5. Example of NPD calculation
24. • Enhanced version with linguistic features:
• ℎ𝐿𝐸𝑃𝑂𝑅 𝐸 =
1
𝑤ℎ𝑤+𝑤ℎ𝑝
(𝑤ℎ𝑤ℎ𝐿𝐸𝑃𝑂𝑅 𝑤𝑜𝑟𝑑 +
𝑤ℎ𝑝ℎ𝐿𝐸𝑃𝑂𝑅 𝑃𝑂𝑆) (10)
• The system-level scores ℎ𝐿𝐸𝑃𝑂𝑅 𝑤𝑜𝑟𝑑
and ℎ𝐿𝐸𝑃𝑂𝑅 𝑃𝑂𝑆 use the same algorithm on word
sequence and POS sequence respectively.
25. • When multi-references:
• Select the alignment that results in the minimum NPD
score.
Fig. 6. N-gram alignment when multi-references
26. • How reliable is the automatic metric?
• Evaluation criteria for evaluation metrics:
– Human judgments are the golden to approach, currently.
• Correlation with human judgments:
• System-level Spearman rank correlation coefficient:
– 𝜌 𝑋𝑌 = 1 −
6 𝑑 𝑖
2𝑛
𝑖=1
𝑛(𝑛2−1)
(11)
– 𝑋 = 𝑥1, … , 𝑥 𝑛 , 𝑌 = {𝑦1, … , 𝑦𝑛}
27. • Training data (WMT08)
– 2,028 sentences for each document
– English vs Spanish/German/French/Czech
• Testing data (WMT11)
– 3,003 sentences for each document
– English vs Spanish/German/French/Czech
30. • Language-independent Model for Machine
Translation Evaluation with Reinforced Factors
– Aaron L.-F. Han, Derek Wong, Lidia S. Chao, Liangye He, Yi
Lu, Junwen Xing, Xiaodong Zeng. Proceedings of MT
Summit 2013. Nice, France.
• Machine Translation evaluation tool-hLEPOR:
https://github.com/aaronlifenghan/aaron-project-
hlepor
31. • Ongoing and further works:
– The combination of translation and evaluation, tuning the
translation model using evaluation metrics
– Evaluation models from the perspective of semantics
– The exploration of unsupervised evaluation models,
extracting features from source and target languages
32. • Actually speaking, the evaluation works are very
related to the similarity measuring. Where we have
employed them is in the MT evaluation. These works
can be further developed into other literature:
– information retrieval
– question and answering
– Searching
– text analysis
– etc.
33. MT SUMMIT 2013, September 2nd-6th, 2013, Nice, France
Aaron L.-F. Han, Derek F. Wong, and Lidia S. Chao, Liangye He, Yi Lu,
Junwen Xing and Xiaodong Zeng
Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory
Department of Computer and Information Science
University of Macau