Syntax-based Simultaneous Translation through Prediction of Unseen Syntactic Constituents
1. 15/07/29 1
Syntax-based Simultaneous Translation
through
Prediction of Unseen Syntactic Constituents
Yusuke Oda
Graham Neubig
Sakriani Sakti
Tomoki Toda
Satoshi Nakamura
ACL, July 27, 2015
2. 15/07/29 2
Two Features of This Study
● Syntax-based Machine Translation
– State-of-the-art SMT method for distant language pairs
This is (NP)
This is
DT VBZ
(NP)
VP
NP
S
これ は (NP) で す
Parse
MT
● Simultaneous Translation
– Prevent translation delay when translating continuous speech
In the next 18 minutes I'm going to take you on a journey.
Translate Translate Translate
Split Split
3. 15/07/29 3
Delay
(depends on the input length)
Speech Translation - Standard Setting
in the next 18 minutes
I 'm going to take you on a journey
Speech
Recognition
今から18分で
皆様を旅にお連れします
Machine
Translation
Speech
Synthesis
● Problem: long delay (if few explicit sentence boundaries)
4. 15/07/29 4
Shorter delay
Simultaneous Translation with Segmentation
● Separate the input
at good positions
ASR
SS
今から
18分で 皆様を
お連れします 旅に
MT
Segmentation
I 'm going
to take you
in the next
18 minutes on a journey
● The system can
generate output
w/o waiting for
end-of-speech
Translation Quality
Segmentation Frequency
Trade-off
5. 15/07/29 5
Unseen VP
Syntactic Problems in Segmentation
● Segmentation allows us to translate each part separately
● But often breaks the syntax
In the next 18 minutes I 'm going to ...
PP
NPIN
NP
NN
NP
NNSCDJJDT
Iminutes18nextthein
Predicted
Boundary
PP
S
IN NP PRP
NP
NNSCDJJDT
Iminutes18nextthein (VP)
VP
● Bad effect on syntax-based machine translation
6. 15/07/29 6
Motivation of This Study
● Predict unseen syntax constituents
In the next 18 minutes I
PP
NPIN
NP
NN
NP
NNSCDJJDT
Iminutes18nextthein
PP
S
IN NP PRP
NP
NNSCDJJDT
Iminutes18nextthein (VP)
VP
Predict
VP
● Translate from correct tree
今 から 18 分 私 今 から 18 分 で 私 は (VP)
7. 15/07/29 7
Summaries of Proposed Methods
● Proposed 1: Predicting and using unseen constituents
● Proposed 2: Waiting for translation
this is NPthis is
a pen
this
is
a
pen
これは NP です
これはペンですthis is a pen
Waiting
this is
Proposed 2
Proposed 1
SyntaxPrediction
ASR
Segmentation
Translation
Parsing
Output
8. 15/07/29 8
What is Required?
● To use predicted constituents in translation, we need:
1. Making training data
2. Deciding a prediction strategy
3. Using results for translation
9. 15/07/29 9
Leaf span
Making Training Data for Syntax Prediction
● Decompose gold trees in the treebank
S
VPNP
NN
NP
DT
VBZ
penaisThis
DT
1. Select any leaf span in the tree
2. Find the path between
leftmost/rightmost leaves
3. Delete the outside subtree
NN
4. Replace inside subtrees
with topmost phrase label
5. Finally we obtain:
nil is a NN nil
Leaf spanLeft syntax Right syntax
10. 15/07/29 10
VP ... 0.65
NP ... 0.28
nil ... 0.04
...
Syntax Prediction from Incorrect Trees
Iminutes18nextthein
PP
NPIN
NP
NN
NP
NNSCDJJDT
1. Parse the input as-is
Input translation unit
Word:R1=I
POS:R1=NN
Word:R1-2=I,minutes
POS:R1-2=NN,NNS
...
ROOT=PP
ROOT-L=IN
ROOT-R=NP
...
2. Extract features
VP nil
11. 15/07/29 11
Syntax-based MT with Additional Constituents
● Use tree-to-string (T2S) MT framework
This is NP
This is
DT VBZ
NP
VP
NP
S
これ は NP で す
Parse
MT
– Obtains state-of-the-art results
on syntactically distant language pairs
(e.g. English→Japanese)
– Possible to use additional syntactic constituents explicitly
12. 15/07/29 12
you on a journey 旅の途中で
Translation Waiting (1)
● Reordering problem
– Right syntax sometimes goes left in the translation
in the next 18 minutes I (VP) 今から18分で私は(VP)
'm going to take (NP) (NP)を行っています
Reordering
– Considering the output language,
we should output future inputs before current input
13. 15/07/29 13
Waiting for Translation
● Heuristics: waiting for the next input
in the next 18 minutes I (VP) 今から18分で私は(VP)
'm going to take (NP) (NP)を行っています
● Expect to avoid syntactically strange segmentation
Wait
'm going to take
you on a journey
貴方を旅にお連れします
15. 15/07/29 15
Results: Prediction Accuracies
● Half precision
– Not trivial problem
● Low recall
– Caused by redundant constituents
in the gold syntax
I 'm a NN nilOur predictor
I 'm a JJ NN PP nilGold syntax
● E.g. "I 'm a"
Precision = 1/1 NN NN
Recall = 1/3 JJNN NN PP
Precision = 52.77%
Recall = 34.87%
Actual performance
16. 15/07/29 16
Results: Translation Trade-off (1)
0 2 4 6 8 10 12 14 16
0.07
0.08
0.09
0.1
0.11
0.12
0.13
0.14
0.15
TranslationAccuracy
BLEU RIBES
Mean #words in inputs ∝ Delay
Short Long Short Long
0 2 4 6 8 10 12 14 16
0.42
0.44
0.46
0.48
0.5
0.52
0.54
0.56
0.58
0.6
PBMT
● Short inputs reduce translation accuracies
Using N-words segmentation (not-optimized)
17. 15/07/29 17
Results: Translation Trade-off (2)
0 2 4 6 8 10 12 14 16
0.07
0.08
0.09
0.1
0.11
0.12
0.13
0.14
0.15
TranslationAccuracy
BLEU RIBES
Mean #words in inputs ∝ Delay
T2S
PBMT
0 2 4 6 8 10 12 14 16
0.42
0.44
0.46
0.48
0.5
0.52
0.54
0.56
0.58
0.6
Short Long Short Long
● Long phrase ... T2S > PBMT
● Short phrase ... T2S < PBMT
18. 15/07/29 18
Results: Translation Trade-off (3)
0 2 4 6 8 10 12 14 16
0.07
0.08
0.09
0.1
0.11
0.12
0.13
0.14
0.15
TranslationAccuracy
BLEU RIBES
Mean #words in inputs ∝ Delay
T2S
PBMT
Proposed
0 2 4 6 8 10 12 14 16
0.42
0.44
0.46
0.48
0.5
0.52
0.54
0.56
0.58
0.6
Short Long Short Long
● Prevent accuracy decreasing in short phrases
● More robustness for reordering
19. 15/07/29 19
Results: Using Other Segmentation
0 2 4 6 8 10 12 14 16 18
0.42
0.44
0.46
0.48
0.5
0.52
0.54
0.56
0.58
0.6
TranslationAccuracy
Mean #words in inputs ∝ Delay
0 2 4 6 8 10 12 14 16 18
0.07
0.08
0.09
0.1
0.11
0.12
0.13
0.14
0.15
Short Long Short Long
BLEU RIBES
● Using an optimized segmentation [Oda+2014]
T2S
PBMT
Proposed
● Segmentation overfitting
● But reordering is better than others
20. 15/07/29 20
Summaries
● Combining two frameworks
– Syntax-based machine translation
– Simultaneous translation
● Methods
– Unseen syntax prediction
– Waiting for translation
● Experimental results
– Prevent accuracy decrease in short phrases
– More robustness for reordering
● Future works
– Improving prediction accuracies
– Using other context features
0 2 4 6 8 10 12 14 16
0.42
0.44
0.46
0.48
0.5
0.52
0.54
0.56
0.58
0.6
this is NPthis is
a pen
これは NP です
これはペンですthis is a pen
Waiting
this is
SyntaxPrediction
Translation
Parsing
Output
Notas del editor
Hello
My name is Yusuke Oda
from the Nara Institute of Science and Technology
in Japan
I&apos;ll talk about
syntax-based simultaneous translation
This study is mainly targetting
speech translation
So some ideas are not standard
for the normal machine translation
So please keep this point in mind
while watching my presentation.
Our study considers 2 features
of machine translation
The first is syntax-based translation
which uses syntax information,
for example parse tree
to improve translation accuracy
This is a state-of-the-art method
for translating distant language pairs
for example English to Japanese,
which we are working on
The second is simultaneous translation
which prevents delay
when translating long speech
We want to combine these methods
but this problem is not straight-forward
First, I&apos;d like to talk the standard setting
of speech translation.
For example, we consider
that we first obtain a waveform
of English speech
We perform speech recognition
to make English text from the speech
Next the machine translation system
converts the text
into Japanese text,
Finally the speech synthesizer makes
corresponding Japanese speech
and outputs it by a speaker.
You can see that in this process
there is a delay
from when we obtain the input speech
to when we can generate the output speech
This delay depends on the length
of the translated segment
If we wait until
the speaker finishes speaking a full sentence
we may have to wait for a l ong time
until the start of the output
Simultaneous translation avoids this problem
by separating the input
into shorter phrases
For example,
we obtain the same input speech
The difference from normal speech translation is
simultaneous translation uses a segmentation
strategy to divide the input words
Each shorter phrase generated by the segmentation
is translated and synthesized
So the simultaneous translation system
can generate the result
without waiting for the end of the speech.
The important point is the trade-off relationship
between translation quality
and segmentation frequency.
If we divide the input many times then
the output speed becomes faster but
the translation accuracy drops.
Conventional segmentation strategies are mainly based on
maintaining translation accuracy when
selecting as many segmentation boundaries as possible.
So, the segmentation is simple method
for simultaneous translation.
But this approach often breaks the syntax
For example,
this is an input sentence
and remaining unseen inputs.
In this case,
the segmentation boundary is decided here
and we have to translate using only left side
,&quot;In the next eighteen minutes&quot;
This example may look strange but
Segmentation algorithm based on machine learning
often generate such phrases because
we cannot predict future input completely.
Then we try to parse this phrase and
a parse tree is obtained, but
this is wrong tree because of
the input is not a well-formed phrase.
But the human can easily consider
the following verb phrase is omitted in this phrase,
and using this unseen information can generate a correct parse tree.
Using incorrect syntax informatin causes bad effects for
the machine translation,
so we want to use the correct information if possible.
Our approach to prevent this problem
is easy to understand.
First we predict unseen syntax information
using the current phrase
by some machine learning method,
And then use these information
to make correct parse trees
and make correct translation results.
This is the overall summary
of our proposed methods.
We propose 2 steps
for simultaneous translation.
One is unseen syntax prediction
using current input phrases
and using the syntax
for the translation
The another is a heuristic
of the waiting for translation
when the output syntax is placed
in an un-desirable position.
So, we need some methods
to predict and use unseen syntax information.
First, we need to make the training data
for the syntax prediction.
Second we need a strategy
to predict additional syntax information
Last, we need a strategy
to use prediction results
for machine translation.
So first we need the training data.
We can generate training data
from the gold parse trees in a treebank.
first,
we select a span of leaf nodes
in the gold parse tree.
we can assume this span
is the input phrase
to be translated.
Next we search for the path
between leftmost and rightmost nodes
in the leaf span.
Next, we delete the outside subtree
to make a minimal tree
Then, we replace each inside subtree
not in the selected path
using its topmost phrase label.
In this case
we finally obtain this sequence.
This data means
the left side of the phrase &quot;is a&quot; could have no more syntax,
and right side has one syntactic constituent &quot;NN&quot;
So,
i&apos;d like to explain our algorithm
to predict unseen syntax information
from the input phrase.
In this slide we do the prediction for this phrase
First
we forcibly generate a parse tree
from the input,
Note that this tree may be incorrect
because the input is not a complete phrase
Next we extract some features
from the input and the parse tree
Then we perform multi-class classification
using these features
to decide what kind of syntax
should be appended next.
Any classifier can be used,
but in this study
we used simple linear SVMs.
And we repeat this prediction
until the end marker is obtained
After predicting additional syntax,
we want to use this information
for machine translation.
In this study,
we use the tree-to-string
machine translation framework
The tree-to-string method usually uses
the parse tree of the input sentence
And tree-to-string translation also has
the good characteristic that
it can easily use
predicted syntax information explicitly
because tree-to-string translation uses
the sub-trees of the input tree
And we can ignore details of the additional syntax.
However
the simple approach
to combine segmentation and tree-to-string translation
still has a problem
mainly caused by the reordering
Here we show some inputs and outputs
This is the first input and the translation
This output is no problem,
because additional constituent VP
is placed at the same position
in both input and output
This is the next input and the translation.
In this case,
the additional constituent NP,
originally placed on the right side of the input,
is placed on the left side of the translation.
This is a problem
because the speaker has not yet spoken the content
that should replace to this constituent NP.
So, we cannot create a translation
until obtaining next input
To avoid this problem,
we use one heuristic,
waiting for translation.
if we detect that
the right side syntax of the input phrase
is placed somewhere
except right side of the output,
we can recognize
the re-ordering has occured,
and we should ignore the current result
and wait for the next input.
when we get the next input,
we concatenate the previous and new input
and perform same prediction
and translation method again.
The prediction of segmentation strategies
is not perfect
because we cannot predict
future inputs completely.
So we expect to avoid segmentation errors
using this approach
This is the experimental settings.
We used the Penn Treebank
for training the syntax prediction,
and an English and Japanese parallel corpus
extracted from the WIT3 dataset
to train the machine translation system.
We show 3 settings
of the experiment in this study
One of the baselines is PBMT,
which uses Moses for simultaneous translation.
This is the same as conventional studies
The other baseline is T2S,
which only with a tree-to-string translation model
instead of PBMT
Our proposed method uses the same decoder for T2S
but includes our methods of
syntax prediction
and waiting heuristic
First, I&apos;d like to explain
how to evaluate the accuracy
of the syntax prediction.
For example,
we are considering this input,
and we obtained this result
from our predictor.
and corresponding gold syntax is here.
Then we calculate precision and recall
using the number of constituents and number of matches
In this case
the prediction is 1
and the recall is 1 third
according to these equations
There are the accuracies of our predictor.
This means the precision is half, which is not bad
considering the predictor is based on a simple linear-SVM.
but it also can be seen that
this prediction is not trivial
And we also can see that
the recall is low,
but this is caused by redundant constituents
in the gold syntax.
For example, both constituents JJ and PP
in the gold syntax are possible
but not necessary
So the low recall is comparatively less important
than the precision.
Next we explain the translation trade-off
which is an important factor
of simultaneous translation
these green lines show the translation accuracy
of the PBMT system.
the horizontal axis means the number of words
in the input phrase
which can be proportional to the delay
for the output.
And the vertical axis shows each evaluation measure
BLEU is based on the n-gram precision
and RIBES is based on the re-ordering
we can see that
when we try to translate shorter phrases
we obtain lower accuracy
than longer phrases
which means there is a trade-off relationship
between the delay and translation accuracy
Next,
we show the result of the T2S baseline
which uses the tree-to-string translation
instead of PBMT.
We can see same tendencies
of the trade-off relationship
And we also can see that
the translation accuracies of T2S method
for longer phrases
is higher than PBMT
but accuracies for shorter phrases
are decreased.
As we explained,
shorter input causes many instances
of broken syntax
which explains the decrease in accuracy.
These red lines are the result
of our proposed method
which includes the syntax prediction
and waiting heuristic
We can see 2 points
One is the actual delay time
becomes slightly longer than T2S,
we can see this point as
red lines are placed farther right
than other methods.
This is the effect of the translation waiting
And the other important point is
that translation accuracy
of our proposed method
is higher than other baselines
especially the RIBES score is higher
even for shorter inputs
RIBES is sensitive to word ordering
So this means
our proposed method has more robustness
with respect to re-ordering
which frequently occurs
when translating between syntactically distant languages
This is another evaluation
using a state-of-the-art segmentation strategy
which directly optimizes the translation accuracy
but does not explicitly consider
syntax syntax information.
We can see the BLEU score is mostly the same
as the T2S baseline,
so this is caused by the overfitting
of the segmentation because
this segmentation strategy optimizes
segmentation boundaries
without syntax features
but we can see
the RIBES score is higher than other baselines
so our methods still have the advantage
regarding re-ordering.
So
This is the end of my presentation.
We proposed 2 methods
for applying syntax-based methods
into simultaneous translation
And we show an improvement
of translation accuracy
especially the reordering
between distant languages
One of our future works is
improving the precision of the prediction
And our methods still don&apos;t use
any context information ourside of the current phrase,
so using them is another future work.
Thank you very much.