Presentation at CEF-EU-Luxembourg

Manuel Herranz
Pangeanic
PangeaMT System, Cleaning,
Automation on Retraining, Data
Management, Hybridization

What is PangeaMT?
 The first commercial application of Open Source Moses (AMTA
2010, http://euromatrixplus.net/moses). Trados compatibility.
Automated (re)training modules by folder.
 A development overcoming Moses limitations – reported to
Association for MT in the Americas : PangeaMT putting open
standards to work... well AMTA 2010 http://bit.ly/uM8x6V
 2011 PangeaMT launches the DIY Solution to Machine Translate
independently and flexibly like never before http://bit.ly/kSd3wC
 2011 MT experiences Sony Europe http://slidesha.re/oxZmBS
 2011 A harness that eases re-training and updating  DIY SMT as
presented at TAUS Barcelona 2011 http://slidesha.re/nEe5mU

What is PangeaMT?
 2012 Collaboration with Toshiba for Japanese hybridation: 2 articles
at Asian Association for Machine Translation (2012-2013).
Automated (re)training modules by domain tag.
 2012: Compatibility with SDLXLIFF, MemoQ and all Tikal formats.
 2013: API for hosted solutions.
 2014: Compatibility with MemSource.
 2015: Pangea v3

Partners in Research
ITI: 8 FT researchers 85 staff
PRHLT: 15 researchers led by
Prof. F. Casacuberta

T/products/services/processes that can be
offered to CEF.AT
copyright status/IPR. How can EC do business with you?
- The Pangea platform is the property of Pangeanic S.L, Valencia, Spain.
- Built on Open Source (some GNL). Copyright/IPR: 80% Pangeanic, 20% ITI (Univ. –
underlying code)
- Pangeanic is free to commercialize, hire or install a full PangeaMT platform, customize it
to specific user needs, design and implement new features together with its technological
partners ITI (Computer Science Institute) and PRHLT.
- Full ownership of the platform includes engine creation by domain, data cleaning
processes, engine retraining/update plus
- modular pre- and post-processing per language pair (rules, re-ordering, etc)
- BLEU scores and translation statistics.
- API for CAT-tool integration: Trados ttx, Studio sdlxliff, MemoQ, Memsource,
etc.
- Training on language /field customization.
- Data services (including cleaning)
- Training, consultancy and support.
- Pangeanic is free to provide full system services to any interested party, via RFP or direct
purchasing request (PO).

Can the technology be licensed to be integrated
in the CEF platform/services?
Yes, to CEF and any other services requiring fast MT + retrainings.
PangeaMT is not just a “I sell MT engines” business. It is a full machine translation
environment with plug-ins and API calls. It is Moses-bases but it can also change paradigm
to Apertium, Thot or other systems whilst keeping powerful language-dependant pre-
processing and post-processing modules, which are key to hibridization.
The technology is designed more large MT users than for small LSPs.
What are the licensing conditions? Consultancy
services
• 1-3 year license (renewable) which provides unlimited use for translation purposes
domain customization
• [practically] unlimited engine creation within the contracted language pairs (typically
this is limited only to client’s data availability)
• engine retrainings.
• The use of the platform requires a 1 week training. Consultancy services available.

Languages/language pairs and the achieved
level of quality/evaluation results
BLUE results following typical academic standards:
2000 sentences taken out of main body.
These results are for general purpose baseline
engines after data cleaning and little customization.
Typically based on EU, TAUS and our own data.
• Translator’s typical output without MT: up from 2,300 words.
• Productivity with MT around 4-5k/day/translator.
with customized engines.
• Peaked at 9k/day/person.
Automotive client 8M/year  1,5 translators.
This means 23,188 words/translator/day over
230 working days (including repetitions).

Languages/language pairs and the achieved level
of quality/evaluation results
Evaluation results: Clients –
 Use case Sybase 2011 – 2012 MosesCore:
http://es.slideshare.net/TAUS/4-june-2012-taus-moses-open-source-mt-
showcase-paris-kerstin-bier-sybase
PE productivity >70%, cost savings 20%. 5M words, 49% BLEU EN-DE
Deliveries 50% faster. BLUE not good metric, preferred METEOR.
 Use case Sony: F/I/G/S, marketing and technical tests. Reported at
TAUS / Localization World Barcelona -> +50% productivity increase
& time to market. Language Project Managers updated and
created engines
http://www.slideshare.net/manuelherranz/mtexperiences-sony-europe-
pangeamt-fprastarosonyeyustepangeamt
From EN and into EN : FR, IT, ES, ES-MX, DE, PT, PTBR, DA, SV, NO, NL, GK, PL, BG, RO, ZH, JP,
KR, RU (19).
From ES and into ES : FR, IT, PT. Under development: DE, SV, RU, ZH. 2016: ES <-> AR

Resources needed to make each tool/service/process
work and/or to adapt it to a specific domain (LR & of
human manual/preparatory work)
PangeaMT is a machine translation environment and as such, it can work with any language pair. A
sufficiently big and representative training corpus is required for statistical modelling (FastAlign,
KenLM, Moses) and model learning.
“Basic” or baseline models can be trained by default by linguists with sufficient knowledge of TM
cleaning. However, the platform allows big improvements on these models in several ways:
• Language-specific and domain-specific modular pre-processing and post-processing: linguistic
information, re-ordering, etc., help to normalize the text internally for better SMT model
learning. The platform includes some general pre-processing modules and others that are
language-dependant. New pre-processing modules can be easily added.
• Table combination. This is a Moses functionality combining several segment tables. When we
do domain adaptation, a general table is used when out of vocabulary or not good enough
threshold is achieved.
• Monolingual data for language model enhancement.
• Hybridization. Language rules can be included in pre- or post-processing and users can opt for
hierarchical models (slower results). WIP: Morphology-rich languages and Japanese with POS.

Resources needed to make each tool/service/process
work and/or to adapt it to a specific domain (LR & of
human manual/preparatory work)
• Minimum 5M words for meaningful domains (as reported Sybase), customization
time: 1 person, 1-2 weeks including data cleaning.
• General models can be trained with minimum human intervention. All users need to
do is to upload the cleanest possible TMX bitexts. The system will automatically train
the model applying general or language-specific pre-processing routines.
• Pre-processing routines can be adapted with use (computational
linguist/programmer)
• Specialized models (or domain-specific) can be trained and set with specific language
features: language / domain. These specific models can be combined with general
models to obtain a wider coverage.
• Retrainings are always automated: Sybase, Sony, Honda, Subaru, Hioki….

Data? best clean, thank you
<tu srclang="en-GB">
<tuv xml:lang="EN-GB">
<seg>A system for recovering the methane that is emitted from the manure so that
it does not leak into the atmosphere.</seg>
</tuv>
<tuv xml:lang="FR-FR">
<seg>Système permettant de r€ pérer le méthane qui se dégage de l'engrais naturel
d'origine animale de sorte qu'il ne se dissipe pas dans l'atmosphère.</seg>
</tuv>
<tu creationdate="20090817T114430Z" creationid="APIACCESS"
changedate="20110617T141159Z" changeid=“pat">
<tuv xml:lang="EN-US">
<seg>Overall heigtht –<bpt i="1">{f43 </bpt> <ept i="1">}</ept>25"; width –
<bpt i="2">{f43 </bpt> <ept i="2">}</ept>20.1".</seg>
</tuv>
<tuv xml:lang="ES-EM">
<seg><bpt i="1">{f2 </bpt>Altura total - 25"; anchura <ept i="1">}</ept>–
<bpt i="2">{f43 </bpt> <ept i="2">}</ept><bpt i="3">{f2 </bpt>20,1".<ept
i="3">}</ept></seg>
</tuv>
</tu>
<tuv xml:lang=“EN-US">
<seg>On 22nd May we decided not to join the group.</seg>
<tuv xml:lang=“DE-DE">
<seg>Am 22. </seg>
More cleaning
Cleaning

More cleaning
Cleaning
<tu srclang="en-GB">
<tuv xml:lang="EN-GB">
<seg>The President of the United States visited Costa Rica.</seg>
</tuv>
<tuv xml:lang=“ES-ES">
<seg>El Presidente de los Estados Unidos, el señor Obama y su esposa la señora
Michelle, visitaron Costa Rica el pasado sábado.</seg>
</tuv>
<tuv xml:lang=“JP">
<seg>同書は「通訳・翻訳キャリアガイド」の2011-2012年度版。
英字新聞のジャパンタイムズ社が強みとするジャーナリスティックな視点で、通訳や翻訳という仕事が持つ魅
力ややりがい、プロに要求されるスキルおよび意識の持ち方などを紹介。また通訳者・翻訳者になるための道
すじから、実際の仕事の現場にいたるまで、今日の通訳・翻訳業界の実像を包括的に紹介。</seg>
<tuv xml:lang=“EN-US">
<seg>It is a journalistic point of view and strengths of the English-
language newspaper Japan Times. It includes a description of the exciting and
rewarding work of translation and interpretation, as well as the introduction of
consciousness and how to acquire the required professional skills. The road to
becoming a translator and interpreter also down to the actual work site, a
comprehensive guide to interpreting the reality of today'stranslation industry.
</seg>

Engine training with
clean data
Having approved,
terminologically sound,
clean data improves engine
accuracy and performance
with even small sets of
data.
Data cleaning modules
• Remove any “suspects”:
• Sentences that are too long
• Mismatches (of many
kinds!)
• Terminological inaccuracies
• Non-useful segments, etc
Parallel text extraction / Translation
input / Post-edited material
This is often comes from CAT tools or document
alignments, crawling
Data Cleaning (in-lines)
Remove all non-translation
data.
TMX Human approval
Some of this material may
actually be OK for training. It
is then input in the training
set.

Programme Committe comment
Concentrate on the processes and the automation of
workflows and how well these have been validated and tested
in real production of MT systems.
We are interested in the "factory" concept, automating
training data domain selection, tuning and data cleaning.

System features – For EXPERT
Domain

Engine Creation

Engine Training

- when the
syntactic distance between languages is very large
(unrelated languages). Patterns are lost (or not found)
 monotone TR
-
-
Hybridation Experiences at Pangeanic
Rationale
Output Translation
Data
Linguistic
Information
Language
Knowledge

SYNTAX-BASED (TREE) FOR HYBRID SMT
Syntax-based analysis & re-ordering rules
Tree depth: 10
Calc time +59% !!

TOSHIBA vs MECAB – LESSONS LEARNT
Mecab re-ordering produced higher BLEU than Toshiba’s
Steps Toward ENJP MT Hybridation
TWO OPTIONS

TOSHIBA vs MECAB – LESSONS LEARNT
Mecab re-ordering produced higher BLEU than Toshiba’s
Paper published December 2011 AAMT Going Hybrid: Pangeanic’s and Toshiba’s
First Steps Toward ENJP MT Hybridation

Installing PangeaMT
MySQL Server mysql-server 5.5
Open JDK openjdk-8-jdk 8
Apache Tomcat tomcat8 8
Moses
Giza++
Fast_align
KenLM
Tikal (Okapi)
IRSTLM
mkcls
mgiza
tercpp
multi-bleu.perl
mecab
timeout
python

Questions?
m.herranz@pangeanic.com
#manuelhrrnz #pangeanic pangeanic

Presentation at CEF-EU-Luxembourg

Recommended

Recommended

More Related Content

Similar to Presentation at CEF-EU-Luxembourg

Similar to Presentation at CEF-EU-Luxembourg (20)

More from Manuel Herranz

More from Manuel Herranz (7)

Recently uploaded

Recently uploaded (20)

Presentation at CEF-EU-Luxembourg

Editor's Notes