Pptphrase tagset mapping for french and english treebanks and its application in machine translation evaluation
1. Phrase Tagset Mapping for French and English
Treebanks and Its Application in Machine Translation
25th InternationEavl aClounafteiroennce, GSCL 2013
Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, Liangye He, Shuo Li,
and Ling Zhu
September 25th -27th, 2013, Darmstadt, Germany
Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory
Department of Computer and Information Science
University of Macau
2. Contents
● Background of language Treebank
● Motivation
● Designed phrase tagset mapping
● Application in MT evaluation
1. Manual evaluations
2. Traditional automatic MT evaluation methods
3. Designed unsupervised MT evaluation
4. Evaluating the evaluation method
5. Experiments
6. Open source code
● Discussion
● Further information
3. 1. Background of language Treebank
• To promote the development of syntactic analysis
• Many language treebanks are developed
– English Penn Treebank (Marcus et al., 1993; Mitchell et al.,
1994)
– German Negra Treebank (Skut et al., 1997)
– French Treebank (Abeillé et al., 2003)
– Chinese Sinica Treebank (Chen et al., 2003)
– Etc.
4. 1. Background of language Treebank
• Problems
– Different treebanks use their own syntactic tagsets
– The number of tags ranging from tens (e.g. English Penn
Treebank) to hundreds (e.g. Chinese Sinica Treebank)
– Inconvenient when undertaking the multilingual or cross-lingual
research
5. 2. Motivation
• To bridge the gap between these treebanks and
facilitate future research
– E.g. the unsupervised induction of syntactic structure
• Petrov et al. (2012) develop a universal POS tagset
• How about the phrase level tags?
• The disaccord problem in the phrase level tags
remains unsolved
– Let’s try to solve it
6. 3. Designed phrase tagset mapping
• Tentative design of phrase tagset mapping
– On English Penn Treebank I, II & French Treebank
• 9 universal phrasal categories covering
– 14 phrase tags in English Penn Treebank I
– 26 phrase tags in English Penn Treebank II
– 14 phrase tags in French Treebank
7. 3. Designed phrase tagset mapping
Table 1: phrase tagset mapping for French and English treebanks
8. 3. Designed phrase tagset mapping
• Universal phrasal categories: NP (noun phrase), VP
(verb phrase), AJP (adjective phrase), AVP (adverbial
phrase), PP (prepositional phrase), S (sub/-sentence),
CONJP (conjunction phrase), COP (coordinated
phrse), X (other phrases or unknown)
• NP covering
– French tags: NP
– English tags: NP, NAC (the scope of certain prenominal
modifiers within an NP), NX (within certain complex NPs
to mark the head of NP), WHNP (wh-noun phrase), QP
(quantifier phrase)
9. 3. Designed phrase tagset mapping
• VP covering
– French tags: VN (verbal nucleus), VP (infinitives and
nonfinite clauses)
– English tags: VP (verb phrase)
• AJP covering
– French tags: AP (adjectival phrase)
– English tags: ADJP (adjective phrase), WHADJP (wh-adjective
phrase)
10. 3. Designed phrase tagset mapping
• AVP covering
– French tags: AdP (adverbial phrases)
– English tags: ADVP (adverb phrase), WHAVP (wh-adverb
phrase), PRT (particle)
• PP covering
– French tags: PP
– English tags: PP, WHPP (wh-propositional phrase phrase)
11. 3. Designed phrase tagset mapping
• S covering
– French tags: SENT (sentence), S (finite clause)
– English tags: S (simple declarative clause), SBAR (clause
introduced by a subordinating conjunction), SBARQ (direct
question introduced by a wh-phrase), SINV (declarative
sentence with subject-aux inversion), SQ (sub-constituent
of SBARQ), PRN (parenthetical), FRAG (fragment), RRC
(reduced relative clause).
• CONJP covering
– French tags: N/A
– English tags: CONJP
12. 3. Designed phrase tagset mapping
• COP covering
– French tags: COORD (coordinated phrase)
– English tags: UCP (coordinated phrases belonging to
different categories)
• X covering
– French tags: unknown
– English tags: X (unknown or uncertain), INTJ
(interjection), LST (list marker)
14. 4.1 Manual evaluations
• Rapid development of Machine Translations
– MT began as early as in the 1950s (Weaver, 1955)
– Big progress science the 1990s due to the development of
computers (storage capacity and computational power) and
the enlarged bilingual corpora (Marino et al. 2006)
• Difficulties of MT evaluation
– language variability results in no single correct translation
– the natural languages are highly ambiguous and different
languages do not always express the same content in the
same way (Arnold, 2003)
15. 4.1 Manual evaluations
• Traditional manual evaluation criteria:
– intelligibility (measuring how understandable the sentence
is)
– fidelity (measuring how much information the translated
sentence retains as compared to the original) by the
Automatic Language Processing Advisory Committee
(ALPAC) around 1966 (Carroll, 1966)
– adequacy (similar as fidelity), fluency (whether the
sentence is well-formed and fluent) and comprehension
(improved intelligibility) by Defense Advanced Research
Projects Agency (DARPA) of US (White et al., 1994)
17. 4.2 Traditional automatic MT evaluations
• Measuring the similarity of automatic translation and
reference translation
– Automatic translation (or hypothesis translation, target
translation): by automatic MT system
– Reference translation: by professional translators
– Source language and source document: not used
• Traditional automatic evaluation:
– BLEU: n-gram precisions (Papineni et al., 2002)
– TER: edit distances (Snover et al., 2006)
– METEOR: precision and recall (Banerjee and Lavie, 2005)
18. 4.3 Designed unsupervised MT evaluation
• Problems in supervised MT evaluation
– Reference translations are expensive
– Reference translations are not available is some cases
• Could we get rid of the reference translation?
– Unsupervised MT evaluation method
– Extract information from source and target language
– How to use the designed universal phrase tagset?
19. 4.3 Designed unsupervised MT evaluation
• Assume that the translated sentence should have a
similar set of phrase categories with the source
sentence.
– This design is inspired by the synonymous relation between
source and target sentence.
• Two sentences that have similar set of phrases may
talk about different things.
– However, this evaluation approach is not designed for
general circumstance
– Assume that the targeted sentences are indeed the
translated sentences from the source document
20. 4.3 Designed unsupervised MT evaluation
• First, we parse the source and target languages
respectively
• Then we extract the phrase set from the source and
target sentences
• Third, we convert the phrases into the developed
universal phrase categories
• Last, we measure the similarity of source and target
language on the universal phrase sequences
22. 4.3 Designed unsupervised MT evaluation
The level of extracted phrase tags: just the upper level of POS tags, bottom-up
Figure 2: convert the extracted phrase into universal phrase tags
23. 4.3 Designed unsupervised MT evaluation
• What is the similarity metric we employed?
• Designed similarity metric: HPPR
– N1 gram position order difference penalty
– Weighted N2 gram precision
– Weighted N3 gram recall
– Weighted geometric mean in n-gram precision & recall
– Weighted harmonic mean to combine sub-factors
– The parameters are tunable according to different language
pairs
31. 4.5 Experiments
• Corpus from WMT
– Workshop of statistical machine translation
– SIGMT, ACL’S special interest group of machine
translation
• Training data (WMT11), tune the parameters
– 3, 003 sentences for each document
– 18 automatic French-to-English MT systems
• Testing data (WMT12)
– 3, 003 sentences for each document
– 15 automatic French-to-English MT systems
32. 4.5 Experiments
• Training, tune the parameters
– N1, N2 and N3 are tuned as 2, 3 and 3 due to the fact that
the 4-gram chunk match usually results in 0 score.
– Tuned values of factor weights are shown in table
Table 2: tuned parameter values
33. 4.5 Experiments
• Comparisons with:
– BLEU, measure the closeness of the hypothesis and
reference translations, n-gram precision
– TER, measure the editing distance of hypothesis to
reference translations
34. 4.5 Experiments
Table 3: training (development) scores on WMT11 corpus
Table 4: testing scores on WMT12 corpus
35. 4.5 Experiments
Table 5: correlation score intro (Cohen, 1988)
● The experiment results on the development and testing corpora show that
HPPR without using reference translations has yielded promising correlation
scores (0.63 and 0.59 respectively).
● There is still potential to improve the performances of all the three metrics,
even though that the correlation scores which are higher than 0.5 are
already considered as strong correlation as shown in Table 5.
36. 4.6 Open source code
• Phrase Tagset Mapping for French and English Treebanks
and Its Application in Machine Translation Evaluation
– Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, Liangye
He, Shuo Li, and Ling Zhu. GSCL 2013, Darmstadt,
Germany. LNCS Vol. 8105, pp. 119-131, Volume Editors:
Iryna Gurevych, Chris Biemann and Torsten Zesch.
• Open source tool for phrase tagset mapping and HPPR
similarity measuring algorithms:
https://github.com/aaronlifenghan/aaron-project-hppr
37. 5. Discussion
• Facilitate future research in multilingual or cross-lingual
literature, this paper designs a phrase tags
mapping between the French Treebank and the
English Penn Treebank using 9 phrase categories.
• One of the potential applications of the designed
universal phrase tagset is shown in the unsupervised
MT evaluation task in the experiment section.
38. 5. Discussion
• There are still some limitations in this work to be
addressed in the future.
– The designed universal phrase categories may not be able
to cover all the phrase tags of other language treebanks,
so this tagset could be expanded when necessary.
– The designed HPPR formula contains the n-gram factors
of position difference, precision and recall, which may not
be sufficient or suitable for some of the other language
pairs, so different measuring factors should be added or
switched when facing new tasks.
39. 5. Discussion
• Actually speaking, the designed models are very
related to the similarity measuring. Where we have
employed them is in the MT evaluation. These works
may be further developed into other literature:
– information retrieval
– question and answering
– Searching
– text analysis
– etc.
40. 6. Further information
• Ongoing and further works:
– The combination of translation and evaluation, tuning the
translation model using evaluation metrics
– Evaluation models from the perspective of semantics
– The further explorations of unsupervised evaluation
models, extracting other features from source and target
languages
• Aaron open source tools: https://github.com/aaronlifenghan
• Aaron network Home: http://www.linkedin.com/in/aaronhan
41. Phrase Tagset Mapping for French and English
Treebanks and Its Application in Machine
Translation Evaluation
GSCL 2013, Darmstadt, Germany
Q and A
Aaron L.-F. Han
email: hanlifengaaron AT gmail DOT com
Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory
Department of Computer and Information Science
University of Macau