3. Intro
Brief history
• “1-2 million words an hour”
• “quite adequate speed to
cope with the whole output
of the Soviet Union in a
week… a few hours computer
time a week”
• [full scale production] “if our
experiments go well, within 5
years or so”
http://youtu.be/K-HfpsHPmvw
4. What is PangeaMT?
The first commercial application of Open Source Moses (AMTA 2010,
http://euromatrixplus.net/moses)
A development overcoming Moses limitations for localization
industry presented at Association for MT in the Americas :
PangeaMT putting open standards to work... well AMTA 2010
http://bit.ly/uM8x6V
06/2011 PangeaMT launches the DIY Solution to Machine Translate
independently and flexibly like never before http://bit.ly/kSd3wC
07/2011 MT experiences Sony Europe http://slidesha.re/oxZmBS
07/2011 A harness that eases re-training and updating DIY SMT
as presented at TAUS Barcelona 2011 http://slidesha.re/nEe5mU
02/2012 API for hosted solutions
5. What is PangeaMT?
2007 and before
• RB tests with commercial software
• Insufficiently good output
• Only internal production
2007/08
• V1: Small data sets (2-5M words),
automotive & electronics
• (ES), then Fr/It/De in other fields
• EU Post-Editing Award
2009/10
• Division born
• 00's of engine trials and
language combinations
• Open-Source to commercial
2011/12
• DIY SMT
• Automated retraining
• API v1
• Glossary
• Automated re-training
• Transfer architecture
and know-how to users
• Compatibility with
commercial formats
(ttx, sdlxliff, docx, odt)
• TMX / XLIFF workflows
• Powerful API v2 for live translation
• Confidence scores
• Compatibility with more commercial formats
2013
6. SMT at work
Unrest is continuing in Cairo as protesters set up their demand for Egypt’s
military rulers to resign
+ specific language rules
+ job or client glossary
+ hybrid technologies
7. Data? best clean, thank you
Cleaning
<tu srclang="en-GB">
<tuv xml:lang="EN-GB">
<seg>A system for recovering the methane that is emitted from the manure so that
it does not leak into the atmosphere.</seg>
</tuv>
<tuv xml:lang="FR-FR">
<seg>Système permettant de r€ pérer le méthane qui se dégage de l'engrais naturel
d'origine animale de sorte qu'il ne se dissipe pas dans l'atmosphère.</seg>
</tuv>
Cleaning
<tu creationdate="20090817T114430Z" creationid="APIACCESS"
changedate="20110617T141159Z" changeid=“pat">
<tuv xml:lang="EN-US">
<seg>Overall heigtht –<bpt i="1">{f43 </bpt> <ept i="1">}</ept>25"; width –
<bpt i="2">{f43 </bpt> <ept i="2">}</ept>20.1".</seg>
</tuv>
<tuv xml:lang="ES-EM">
<seg><bpt i="1">{f2 </bpt>Altura total - 25"; anchura <ept i="1">}</ept>–
<bpt i="2">{f43 </bpt> <ept i="2">}</ept><bpt i="3">{f2 </bpt>20,1".<ept
i="3">}</ept></seg>
</tuv>
</tu>
More cleaning
<tuv xml:lang=“EN-US">
<seg>On 22nd May we decided not to join the group.</seg>
<tuv xml:lang=“DE-DE">
<seg>Am 22. </seg>
8. Data? best clean, thank you
Cleaning
<tu srclang="en-GB">
<tuv xml:lang="EN-GB">
<seg>The President of the United States visited Costa Rica.</seg>
</tuv>
<tuv xml:lang=“ES-ES">
<seg>El Presidente de los Estados Unidos, el señor Obama y su esposa la señora
Michelle, visitaron Costa Rica el pasado sábado.</seg>
</tuv>
Cleaning
<tuv xml:lang=“JP">
<seg>同書は「通訳・翻訳キャリアガイド」の2011-2012年度版。
英字新聞のジャパンタイムズ社が強みとするジャーナリスティックな視点で、通訳や翻訳という仕事が持つ魅
力ややりがい、プロに要求されるスキルおよび意識の持ち方などを紹介。また通訳者・翻訳者になるための道
すじから、実際の仕事の現場にいたるまで、今日の通訳・翻訳業界の実像を包括的に紹介。</seg>
<tuv xml:lang=“EN-US">
<seg>It is a journalistic point of view and strengths of the Englishlanguage newspaper Japan Times. It includes a description of the exciting and
rewarding work of translation and interpretation, as well as the introduction of
consciousness and how to acquire the required professional skills. The road to
becoming a translator and interpreter also down to the actual work site, a
comprehensive guide to interpreting the reality of today'stranslation industry.
</seg>
More cleaning
9. Data? best clean, thank you
Parallel text extraction / Translation
input / Post-edited material
Cleaning
This is often comes from CAT tools or document
alignments, crawling
Engine training with
clean data
Having approved,
terminologically sound,
clean data improves engine
accuracy and performance
with even small sets of
data.
Data Cleaning (in-lines)
Remove all non-translation
data.
Data cleaning modules
•
•
•
TMX Human approval
Some of this material may
actually be OK for training. It
is then input in the training
set.
•
•
Remove any “suspects”:
Sentences that are too long
Mismatches (of many
kinds!)
Terminological inaccuracies
Non-useful segments, etc
14. System features – For EXPERT
Typically a 5 n-gram, DL, table
Unrest is continuing in Cairo as protesters set up their demand for Egypt’s
military rulers to resign
•
•
•
•
specific language rules
job / client glossary
hybrid technologies
good bleu tracking, ideal
for experimentation
15. Different MT Systems for Different
Lang Pairs?
Related languages
SMT, with accurate n-gram training and in-domain data (typically 5,
distorsion limit, weighs and fine-tuning)
Morphology-rich languages
Data is not enough and casuistry too large (Baltic languages like Lavian are
extreme, Turkish is regular but too many suffixes) SMT cannot cope. Rulebased or Hybrid
Syntactically distant languages
Need additional information, this is where different HYBRID TECHNIQUES
come into place. NO “SIZE FITS ALL”
16. Hybridation Experiences at Pangeanic
Rationale
when the
syntactic distance between languages is very large
(unrelated languages). Patterns are lost (or not found)
monotone TR
-
Linguistic
Information
Language
Knowledge
Data
Output Translation
17. Hybridation Experiences at Pangeanic
TWO OPTIONS
SYNTAX-BASED HYBRID SMT
Altaic languages English
Arabic European languages
Agglutinative Non- agglutinative
Linguistic
Information
Language
Knowledge
Data
RE-ORDERING
Toshiba / Mecab benchmarking
EN JP
Output Translation
18. Hybridation Experiences at Pangeanic
TWO METHODS
CHALLENGES
SVO vs SOV
Tokenization: No spaces between words Mecab/KyTea for JP,
Peterson Segmentor for ZH
RBMT systems have traditionally worked with linguistic &
morphological analyzers. Thus “units” were segmented.
SMT can’t and so we need to tokenize to leave similar amount of
“words” on both sides Giza++ can then relate words and groups.
20. Hybridation Experiences at Pangeanic
TWO METHODS
CHALLENGES
SVO vs SOV
Re-ordering?
Phrase-based or hierarchical models (syntactical)?
Continue to press the button to scroll through the components of the program until
the display shows the desired current selection.
Japanese proper word order would be
the display the desired current selection shows until the components the program of
through to scroll the button to press continue.
21. Hybridation Experiences at Pangeanic
Syntax-based analysis & re-ordering rules
SYNTAX-BASED (TREE) FOR HYBRID SMT
Tree depth: 10
Calc time +59% !!
22. Hybridation Experiences at Pangeanic
Syntax-based analysis & re-ordering rules
SYNTAX-BASED RULES FOR HYBRID SMT
発売 時 には、 同社は 次の バージョンを 提供する 予定 です 。
Translation & Cleaning
available When , the company the following : plans to offer :
Nipponization module
(Cond clause),
(Subject)
(VBPt) (to)
(Predicate)
(ADV) (ADJ) (Punct) (DET) (NNSing) (VBPt3) (to) (VBinf) (DET) (NN)
When available, the company plans to offer the following:
23. Hybridation Experiences at Pangeanic
TWO OPTIONS
TOSHIBA vs MECAB
Toshiba’s The Honyaku is a established RB system (+30 years)
Lacks flexibility, rules contradict each other
Proposal: re-arrange whole corpus EN for JP with Toshiba’s
rules, but this meant dependency on a proprietary system for
future inputs.
24. Hybridation Experiences at Pangeanic
TWO OPTIONS
TOSHIBA vs MECAB – LESSONS LEARNT
Mecab re-ordering produced higher BLEU than Toshiba’s
5-fold structure
25. Hybridation Experiences at Pangeanic
TWO OPTIONS
TOSHIBA vs MECAB – LESSONS LEARNT
Mecab re-ordering produced higher BLEU than Toshiba’s
Paper published December 2011 AAMT Going Hybrid: Pangeanic’s and Toshiba’s
First Steps Toward ENJP MT Hybridation
26. Hybridation Experiences at Pangeanic
TWO OPTIONS
TOSHIBA vs MECAB – LESSONS LEARNT
Mecab re-ordering produced higher BLEU than Toshiba’s
Paper published December 2011 AAMT Going Hybrid: Pangeanic’s and Toshiba’s
First Steps Toward ENJP MT Hybridation
27. Future (current) Work on Hybrids
Morphology-rich langs: RU in particular.
Improve DE
Distant languages: re-ordering for AR?
Agglutinative langs: TK – new paradigm