Our statistical machine translation platform and hybrid features were presented at the European Commission offices in Luxembourg last Tuesday 22nd September. It is one of the tools that the European Union will consider, among other machine translation commercial solutions, as a tool to help its mandate for CEF (Connecting Europe Facility). Pangeanic’s CEO, Manuel Herranz, presented the current state-of-the-art that PangeaMT version 3 represents. Representatives from the EU were particularly interested in the solid data management features, machine translation engine retraining routines, data cleaning and automated engine training and creation features. One of key features with the new PangeaMT version is the possibility to change translation algorithms and use rule-based systems like Apertium and Thot as well as the default Moses. It is also compatible with 3rd-party calls from other systems. Its powerful API can also provide machine translated output to requests anywhere in the world, although the platform is designed for onsite use at translation companies and organizations. PangeaMT is also compatible with several popular translation formats like ttx, sdlxliff, memoq, memsource, and most xml-based Tikal formats.
2. What is PangeaMT?
The first commercial application of Open Source Moses (AMTA
2010, http://euromatrixplus.net/moses). Trados compatibility.
Automated (re)training modules by folder.
A development overcoming Moses limitations – reported to
Association for MT in the Americas : PangeaMT putting open
standards to work... well AMTA 2010 http://bit.ly/uM8x6V
2011 PangeaMT launches the DIY Solution to Machine Translate
independently and flexibly like never before http://bit.ly/kSd3wC
2011 MT experiences Sony Europe http://slidesha.re/oxZmBS
2011 A harness that eases re-training and updating DIY SMT as
presented at TAUS Barcelona 2011 http://slidesha.re/nEe5mU
3. What is PangeaMT?
2012 Collaboration with Toshiba for Japanese hybridation: 2 articles
at Asian Association for Machine Translation (2012-2013).
Automated (re)training modules by domain tag.
2012: Compatibility with SDLXLIFF, MemoQ and all Tikal formats.
2013: API for hosted solutions.
2014: Compatibility with MemSource.
2015: Pangea v3
4. Partners in Research
ITI: 8 FT researchers 85 staff
PRHLT: 15 researchers led by
Prof. F. Casacuberta
5. T/products/services/processes that can be
offered to CEF.AT
copyright status/IPR. How can EC do business with you?
- The Pangea platform is the property of Pangeanic S.L, Valencia, Spain.
- Built on Open Source (some GNL). Copyright/IPR: 80% Pangeanic, 20% ITI (Univ. –
underlying code)
- Pangeanic is free to commercialize, hire or install a full PangeaMT platform, customize it
to specific user needs, design and implement new features together with its technological
partners ITI (Computer Science Institute) and PRHLT.
- Full ownership of the platform includes engine creation by domain, data cleaning
processes, engine retraining/update plus
- modular pre- and post-processing per language pair (rules, re-ordering, etc)
- BLEU scores and translation statistics.
- API for CAT-tool integration: Trados ttx, Studio sdlxliff, MemoQ, Memsource,
etc.
- Training on language /field customization.
- Data services (including cleaning)
- Training, consultancy and support.
- Pangeanic is free to provide full system services to any interested party, via RFP or direct
purchasing request (PO).
6. Can the technology be licensed to be integrated
in the CEF platform/services?
Yes, to CEF and any other services requiring fast MT + retrainings.
PangeaMT is not just a “I sell MT engines” business. It is a full machine translation
environment with plug-ins and API calls. It is Moses-bases but it can also change paradigm
to Apertium, Thot or other systems whilst keeping powerful language-dependant pre-
processing and post-processing modules, which are key to hibridization.
The technology is designed more large MT users than for small LSPs.
What are the licensing conditions? Consultancy
services
• 1-3 year license (renewable) which provides unlimited use for translation purposes
domain customization
• [practically] unlimited engine creation within the contracted language pairs (typically
this is limited only to client’s data availability)
• engine retrainings.
• The use of the platform requires a 1 week training. Consultancy services available.
7. Languages/language pairs and the achieved
level of quality/evaluation results
BLUE results following typical academic standards:
2000 sentences taken out of main body.
These results are for general purpose baseline
engines after data cleaning and little customization.
Typically based on EU, TAUS and our own data.
• Translator’s typical output without MT: up from 2,300 words.
• Productivity with MT around 4-5k/day/translator.
with customized engines.
• Peaked at 9k/day/person.
Automotive client 8M/year 1,5 translators.
This means 23,188 words/translator/day over
230 working days (including repetitions).
8. Languages/language pairs and the achieved level
of quality/evaluation results
Evaluation results: Clients –
Use case Sybase 2011 – 2012 MosesCore:
http://es.slideshare.net/TAUS/4-june-2012-taus-moses-open-source-mt-
showcase-paris-kerstin-bier-sybase
PE productivity >70%, cost savings 20%. 5M words, 49% BLEU EN-DE
Deliveries 50% faster. BLUE not good metric, preferred METEOR.
Use case Sony: F/I/G/S, marketing and technical tests. Reported at
TAUS / Localization World Barcelona -> +50% productivity increase
& time to market. Language Project Managers updated and
created engines
http://www.slideshare.net/manuelherranz/mtexperiences-sony-europe-
pangeamt-fprastarosonyeyustepangeamt
From EN and into EN : FR, IT, ES, ES-MX, DE, PT, PTBR, DA, SV, NO, NL, GK, PL, BG, RO, ZH, JP,
KR, RU (19).
From ES and into ES : FR, IT, PT. Under development: DE, SV, RU, ZH. 2016: ES <-> AR
9. Resources needed to make each tool/service/process
work and/or to adapt it to a specific domain (LR & of
human manual/preparatory work)
PangeaMT is a machine translation environment and as such, it can work with any language pair. A
sufficiently big and representative training corpus is required for statistical modelling (FastAlign,
KenLM, Moses) and model learning.
“Basic” or baseline models can be trained by default by linguists with sufficient knowledge of TM
cleaning. However, the platform allows big improvements on these models in several ways:
• Language-specific and domain-specific modular pre-processing and post-processing: linguistic
information, re-ordering, etc., help to normalize the text internally for better SMT model
learning. The platform includes some general pre-processing modules and others that are
language-dependant. New pre-processing modules can be easily added.
• Table combination. This is a Moses functionality combining several segment tables. When we
do domain adaptation, a general table is used when out of vocabulary or not good enough
threshold is achieved.
• Monolingual data for language model enhancement.
• Hybridization. Language rules can be included in pre- or post-processing and users can opt for
hierarchical models (slower results). WIP: Morphology-rich languages and Japanese with POS.
10. Resources needed to make each tool/service/process
work and/or to adapt it to a specific domain (LR & of
human manual/preparatory work)
• Minimum 5M words for meaningful domains (as reported Sybase), customization
time: 1 person, 1-2 weeks including data cleaning.
• General models can be trained with minimum human intervention. All users need to
do is to upload the cleanest possible TMX bitexts. The system will automatically train
the model applying general or language-specific pre-processing routines.
• Pre-processing routines can be adapted with use (computational
linguist/programmer)
• Specialized models (or domain-specific) can be trained and set with specific language
features: language / domain. These specific models can be combined with general
models to obtain a wider coverage.
• Retrainings are always automated: Sybase, Sony, Honda, Subaru, Hioki….
11. Data? best clean, thank you
<tu srclang="en-GB">
<tuv xml:lang="EN-GB">
<seg>A system for recovering the methane that is emitted from the manure so that
it does not leak into the atmosphere.</seg>
</tuv>
<tuv xml:lang="FR-FR">
<seg>Système permettant de r€ pérer le méthane qui se dégage de l'engrais naturel
d'origine animale de sorte qu'il ne se dissipe pas dans l'atmosphère.</seg>
</tuv>
<tu creationdate="20090817T114430Z" creationid="APIACCESS"
changedate="20110617T141159Z" changeid=“pat">
<tuv xml:lang="EN-US">
<seg>Overall heigtht –<bpt i="1">{f43 </bpt> <ept i="1">}</ept>25"; width –
<bpt i="2">{f43 </bpt> <ept i="2">}</ept>20.1".</seg>
</tuv>
<tuv xml:lang="ES-EM">
<seg><bpt i="1">{f2 </bpt>Altura total - 25"; anchura <ept i="1">}</ept>–
<bpt i="2">{f43 </bpt> <ept i="2">}</ept><bpt i="3">{f2 </bpt>20,1".<ept
i="3">}</ept></seg>
</tuv>
</tu>
<tuv xml:lang=“EN-US">
<seg>On 22nd May we decided not to join the group.</seg>
<tuv xml:lang=“DE-DE">
<seg>Am 22. </seg>
More cleaning
Cleaning
12. More cleaning
Cleaning
<tu srclang="en-GB">
<tuv xml:lang="EN-GB">
<seg>The President of the United States visited Costa Rica.</seg>
</tuv>
<tuv xml:lang=“ES-ES">
<seg>El Presidente de los Estados Unidos, el señor Obama y su esposa la señora
Michelle, visitaron Costa Rica el pasado sábado.</seg>
</tuv>
<tuv xml:lang=“JP">
<seg>同書は「通訳・翻訳キャリアガイド」の2011-2012年度版。
英字新聞のジャパンタイムズ社が強みとするジャーナリスティックな視点で、通訳や翻訳という仕事が持つ魅
力ややりがい、プロに要求されるスキルおよび意識の持ち方などを紹介。また通訳者・翻訳者になるための道
すじから、実際の仕事の現場にいたるまで、今日の通訳・翻訳業界の実像を包括的に紹介。</seg>
<tuv xml:lang=“EN-US">
<seg>It is a journalistic point of view and strengths of the English-
language newspaper Japan Times. It includes a description of the exciting and
rewarding work of translation and interpretation, as well as the introduction of
consciousness and how to acquire the required professional skills. The road to
becoming a translator and interpreter also down to the actual work site, a
comprehensive guide to interpreting the reality of today'stranslation industry.
</seg>
Data? best clean, thank you
13. Engine training with
clean data
Having approved,
terminologically sound,
clean data improves engine
accuracy and performance
with even small sets of
data.
Data cleaning modules
• Remove any “suspects”:
• Sentences that are too long
• Mismatches (of many
kinds!)
• Terminological inaccuracies
• Non-useful segments, etc
Parallel text extraction / Translation
input / Post-edited material
This is often comes from CAT tools or document
alignments, crawling
Data Cleaning (in-lines)
Remove all non-translation
data.
TMX Human approval
Some of this material may
actually be OK for training. It
is then input in the training
set.
Data? best clean, thank you
14. Programme Committe comment
Concentrate on the processes and the automation of
workflows and how well these have been validated and tested
in real production of MT systems.
We are interested in the "factory" concept, automating
training data domain selection, tuning and data cleaning.
19. - when the
syntactic distance between languages is very large
(unrelated languages). Patterns are lost (or not found)
monotone TR
-
-
Hybridation Experiences at Pangeanic
Rationale
Output Translation
Data
Linguistic
Information
Language
Knowledge
20. SYNTAX-BASED (TREE) FOR HYBRID SMT
Hybridation Experiences at Pangeanic
Syntax-based analysis & re-ordering rules
Tree depth: 10
Calc time +59% !!
21. TOSHIBA vs MECAB – LESSONS LEARNT
Mecab re-ordering produced higher BLEU than Toshiba’s
Steps Toward ENJP MT Hybridation
Hybridation Experiences at Pangeanic
TWO OPTIONS
22. TOSHIBA vs MECAB – LESSONS LEARNT
Mecab re-ordering produced higher BLEU than Toshiba’s
Paper published December 2011 AAMT Going Hybrid: Pangeanic’s and Toshiba’s
First Steps Toward ENJP MT Hybridation
Hybridation Experiences at Pangeanic
23. Installing PangeaMT
MySQL Server mysql-server 5.5
Open JDK openjdk-8-jdk 8
Apache Tomcat tomcat8 8
Moses
Giza++
Fast_align
KenLM
Tikal (Okapi)
IRSTLM
mkcls
mgiza
tercpp
multi-bleu.perl
mecab
timeout
python
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.
Las siguientes diapositivas muestran distintos ejemplos de escalas de tiempo con elementos gráficos SmartArt.
Incluya una escala de tiempo del proyecto, donde se indiquen claramente los hitos y fechas importantes, y resalte dónde se encuentra el proyecto en este momento.
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva.