1. EMNLP 2016 reading
Incorporating Discrete Translation Lexicons
into Neural Machine Translation
author : Philip Arthur
Graham Neubig, Satoshi Nakamura
presentation : Sekizawa Yuuki
Komachi lab M1
17/02/15 1
2. Incorporating Discrete Translation Lexicons
into Neural Machine Translation
• NMT often mistakes traislating
low-frequency content words
• lose sentence meaning
• propose method
• encode low-frequency words by lexicon probabilicity
• 2methods : 1, use it as a bias 2, linear interpolation
• result (En-Ja translation, use two corpora (KFTT, BTEC) )
• improve 2.0-2.3 BLEU, 0.13-0.44 NIST score
• faster covergence time
17/02/15 2
3. NMT feature
• NMT system
• treat each word in the vocabulary as a vector of continuous-
valued numbers
• share statistical power between similar words
(“dog” and “cat”) or contexts (“this is” and “that is”)
• drawback : often mistranslate into words that seem natural in the
context
do not reflect the content of the source sentence.
• PBMT・SMT tend to rarely make this kind of mistake
• base their translations on discrete phrase mappings
• ensure that source words will be translated into a target word that
has been observed as a translation at least once in the training data
17/02/15 3
4. NMT
• source words
• target words
• translate probability
17/02/15 4
weight matrix bias vector
fixed-width vector
5. Integrating Lexicons into NMT
• Lexicon probability
17/02/15 5
lexical matrix
by input sentence
alignment
probability
v
o
c
a
b
input sentence words
6. combine lexicon probability
1. model bias
1. linear interpolation
17/02/15 6
x : learnable
parameter
(begin : 0.5)
prevent zero
probability
here : 0.001
7. Constructing Lexicon Probability
1. automatically learning
• use EM algorithm
• E : count expected count :
• M : lexicon probability
2. manual
• use dictionary entry
as translation
3. hybrid
17/02/15 7
all possible count
translation set
of source word f
8. Experiment
• Dataset : KFTT, BTEC
• English to Japanese
• tokenize, lowercase
• length <= 50
• if low frequent word,
it replace <unk> and translate in test (Luong et al (2015) )
• BTEC : less than 1, KFTT : less than 3
• Evaluation
• BLEU, NIST, recall (rare words from references)
17/02/15 8
Data Corpu
s
Sentence Tokens
En Ja
Train BTEC
KFTT
464K
377K
3.60M 4.97M
7.77M 8.04M
Dev BTEC
KFTT
510
1,160
3.8K 5.3K
24.3K
26.8K
Test BTEC
KFTT
508
1,169
3.8K 5.5K
26.0K
28.4K
appear less than 8 times in
target training corpus or references
vocab-size source target
BTEC 17.8k 21.8k
KFTT 48.2k 49.1k
10. compare with related work
† : p < 0.05, * : p < 0.10
17/02/15 10
+2.3 +0.44 +30%
11. compare with related work
† : p < 0.05, * : p < 0.10
• KFTT : BLEU↑ NIST↓ (compare with SMT)
• traditional SMT systems have a small advantage
in translating low-frequency words
17/02/15 11
13. Training curves
• in KFTT
• blue : attn
• orange : auto-bias
• green : hyb-bias
• first iteration : propose BLEU are higher than attn
• iteration time : 167minutes (attn) 275minutes (auto-bias)
• due to calculate and use lexical probability matrix
17/02/15 13
14. Attention matrices
• proposed (bias)
• more correct
• lighter color : stronger word attention
• red box : correct alignment
17/02/15 14
15. proposed method result
first column
without lexicon NMT
bias
・man is less effective
due to coverage for target
domain words
linear
・reverse to bias
・worse than bias
due to constant
interpolation coefficient
17/02/15 15
16. Incorporating Discrete Translation Lexicons
into Neural Machine Translation
• NMT often mistakes traislating
low-frequency content words
• propose method
• encode low-frequency words by lexicon probabilicity
• 2methods : 1, use it as a bias 2, linear interpolation
• improve 2.0-2.3 BLEU, 0.13-0.44 NIST score
• faster covergence time
17/02/15 16