9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

PangeaMT
Sharing Experiences on MT System,
Data management,
Hybridation
Alex Helle / Manuel Herranz

Intro
Brief history
Pangea system introduction /
features for EXPERT

Hybridation experiences at
Pangeanic (+future work)

Intro
Brief history
• “1-2 million words an hour”
• “quite adequate speed to
cope with the whole output
of the Soviet Union in a
week… a few hours computer
time a week”
• [full scale production] “if our
experiments go well, within 5
years or so”

http://youtu.be/K-HfpsHPmvw

What is PangeaMT?
 The first commercial application of Open Source Moses (AMTA 2010,
http://euromatrixplus.net/moses)
 A development overcoming Moses limitations for localization
industry presented at Association for MT in the Americas :
PangeaMT putting open standards to work... well AMTA 2010
http://bit.ly/uM8x6V
 06/2011 PangeaMT launches the DIY Solution to Machine Translate
independently and flexibly like never before http://bit.ly/kSd3wC
 07/2011 MT experiences Sony Europe http://slidesha.re/oxZmBS
 07/2011 A harness that eases re-training and updating  DIY SMT
as presented at TAUS Barcelona 2011 http://slidesha.re/nEe5mU
 02/2012 API for hosted solutions

What is PangeaMT?

2007 and before

• RB tests with commercial software
• Insufficiently good output
• Only internal production

2007/08
• V1: Small data sets (2-5M words),
automotive & electronics
• (ES), then Fr/It/De in other fields

• EU Post-Editing Award
2009/10
• Division born
• 00's of engine trials and
language combinations
• Open-Source to commercial

2011/12
• DIY SMT
• Automated retraining
• API v1
• Glossary
• Automated re-training
• Transfer architecture
and know-how to users
• Compatibility with
commercial formats
(ttx, sdlxliff, docx, odt)

• TMX / XLIFF workflows

• Powerful API v2 for live translation

• Confidence scores
• Compatibility with more commercial formats

2013

SMT at work
Unrest is continuing in Cairo as protesters set up their demand for Egypt’s
military rulers to resign

+ specific language rules
+ job or client glossary
+ hybrid technologies

Data? best clean, thank you
Cleaning
<tu srclang="en-GB">
<tuv xml:lang="EN-GB">
<seg>A system for recovering the methane that is emitted from the manure so that
it does not leak into the atmosphere.</seg>
</tuv>
<tuv xml:lang="FR-FR">
<seg>Système permettant de r€ pérer le méthane qui se dégage de l'engrais naturel
d'origine animale de sorte qu'il ne se dissipe pas dans l'atmosphère.</seg>
</tuv>

Cleaning

<tu creationdate="20090817T114430Z" creationid="APIACCESS"
changedate="20110617T141159Z" changeid=“pat">
<tuv xml:lang="EN-US">
<seg>Overall heigtht –<bpt i="1">{f43 </bpt> <ept i="1">}</ept>25"; width –
<bpt i="2">{f43 </bpt> <ept i="2">}</ept>20.1".</seg>
</tuv>
<tuv xml:lang="ES-EM">
<seg><bpt i="1">{f2 </bpt>Altura total - 25"; anchura <ept i="1">}</ept>–
<bpt i="2">{f43 </bpt> <ept i="2">}</ept><bpt i="3">{f2 </bpt>20,1".<ept
i="3">}</ept></seg>
</tuv>
</tu>

More cleaning

<tuv xml:lang=“EN-US">
<seg>On 22nd May we decided not to join the group.</seg>
<tuv xml:lang=“DE-DE">
<seg>Am 22. </seg>

Cleaning
<tu srclang="en-GB">
<tuv xml:lang="EN-GB">
<seg>The President of the United States visited Costa Rica.</seg>
</tuv>
<tuv xml:lang=“ES-ES">
<seg>El Presidente de los Estados Unidos, el señor Obama y su esposa la señora
Michelle, visitaron Costa Rica el pasado sábado.</seg>
</tuv>

Cleaning

<tuv xml:lang=“JP">
<seg>同書は「通訳・翻訳キャリアガイド」の2011-2012年度版。
英字新聞のジャパンタイムズ社が強みとするジャーナリスティックな視点で、通訳や翻訳という仕事が持つ魅
力ややりがい、プロに要求されるスキルおよび意識の持ち方などを紹介。また通訳者・翻訳者になるための道
すじから、実際の仕事の現場にいたるまで、今日の通訳・翻訳業界の実像を包括的に紹介。</seg>
<tuv xml:lang=“EN-US">
<seg>It is a journalistic point of view and strengths of the Englishlanguage newspaper Japan Times. It includes a description of the exciting and
rewarding work of translation and interpretation, as well as the introduction of
consciousness and how to acquire the required professional skills. The road to
becoming a translator and interpreter also down to the actual work site, a
comprehensive guide to interpreting the reality of today'stranslation industry.
</seg>

More cleaning

Parallel text extraction / Translation
input / Post-edited material

Cleaning

This is often comes from CAT tools or document
alignments, crawling

Engine training with
clean data
Having approved,
terminologically sound,
clean data improves engine
accuracy and performance
with even small sets of
data.

Data Cleaning (in-lines)
Remove all non-translation
data.

Data cleaning modules
•
•
•

TMX Human approval
Some of this material may
actually be OK for training. It
is then input in the training
set.

•
•

Remove any “suspects”:
Sentences that are too long
Mismatches (of many
kinds!)
Terminological inaccuracies
Non-useful segments, etc

System features – For EXPERT
Cleaning

Domain

Engine Creation

Engine Training

Typically a 5 n-gram, DL, table
Unrest is continuing in Cairo as protesters set up their demand for Egypt’s
military rulers to resign

•
•
•
•

specific language rules
job / client glossary
hybrid technologies
good bleu tracking, ideal
for experimentation

Different MT Systems for Different
Lang Pairs?
Related languages 
SMT, with accurate n-gram training and in-domain data (typically 5,
distorsion limit, weighs and fine-tuning)
Morphology-rich languages 
Data is not enough and casuistry too large (Baltic languages like Lavian are
extreme, Turkish is regular but too many suffixes) SMT cannot cope. Rulebased or Hybrid
Syntactically distant languages 
Need additional information, this is where different HYBRID TECHNIQUES
come into place. NO “SIZE FITS ALL”

Hybridation Experiences at Pangeanic
Rationale
when the
syntactic distance between languages is very large
(unrelated languages). Patterns are lost (or not found)
 monotone TR
-

Linguistic
Information

Language
Knowledge

Data

Output Translation

TWO OPTIONS

SYNTAX-BASED HYBRID SMT
Altaic languages   English
Arabic   European languages
Agglutinative   Non- agglutinative

Linguistic
Information

Language
Knowledge

Data

RE-ORDERING
Toshiba / Mecab benchmarking
EN   JP
Output Translation

TWO METHODS

CHALLENGES
 SVO vs SOV
 Tokenization: No spaces between words Mecab/KyTea for JP,
Peterson Segmentor for ZH
 RBMT systems have traditionally worked with linguistic &
morphological analyzers. Thus “units” were segmented.
 SMT can’t and so we need to tokenize to leave similar amount of
“words” on both sides  Giza++ can then relate words and groups.

TWO OPTIONS

CHALLENGES
 SVO vs SOV

TWO METHODS

CHALLENGES
 SVO vs SOV
 Re-ordering?
 Phrase-based or hierarchical models (syntactical)?
Continue to press the button to scroll through the components of the program until
the display shows the desired current selection.
Japanese proper word order would be

the display the desired current selection shows until the components the program of
through to scroll the button to press continue.

Syntax-based analysis & re-ordering rules

SYNTAX-BASED (TREE) FOR HYBRID SMT

Tree depth: 10
Calc time +59% !!

Syntax-based analysis & re-ordering rules

SYNTAX-BASED RULES FOR HYBRID SMT
発売時には、同社は次のバージョンを提供する予定です。

Translation & Cleaning
available When , the company the following : plans to offer :

Nipponization module
(Cond clause),

(Subject)

(VBPt) (to)

(Predicate)

(ADV) (ADJ) (Punct) (DET) (NNSing) (VBPt3) (to) (VBinf) (DET) (NN)
When available, the company plans to offer the following:

TWO OPTIONS

TOSHIBA vs MECAB
Toshiba’s The Honyaku is a established RB system (+30 years)
Lacks flexibility, rules contradict each other
Proposal: re-arrange whole corpus EN for JP with Toshiba’s
rules, but this meant dependency on a proprietary system for
future inputs.

TWO OPTIONS

TOSHIBA vs MECAB – LESSONS LEARNT
Mecab re-ordering produced higher BLEU than Toshiba’s
5-fold structure

TWO OPTIONS

TOSHIBA vs MECAB – LESSONS LEARNT
Mecab re-ordering produced higher BLEU than Toshiba’s

Paper published December 2011 AAMT Going Hybrid: Pangeanic’s and Toshiba’s
First Steps Toward ENJP MT Hybridation

Future (current) Work on Hybrids
 Morphology-rich langs: RU in particular.
Improve DE

 Distant languages: re-ordering for AR?
 Agglutinative langs: TK – new paradigm

Brief history

Intro

Pangea system introduction /
features for EXPERT
Hybridation experiences at
Pangeanic (+future work)

Questions?
m.herranz@pangeanic.com
#manuelhrrnz #pangeanic

pangeanic

9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (16)

Similar a 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation

Similar a 9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation (20)

Más de RIILP

Más de RIILP (20)

Último

Último (20)

9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation