Pangea Machine Translation platform from Pangeanic. A product presentation by Manuel Herranz, Elia Yuste, Andi Frank showcasing the best of automated cleaning cycles, automated engine retraining, machine translation engine creation.
Automating Google Workspace (GWS) & more with Apps Script
Gala Webminar September 2013
1. PangeaMT
Manuel Herranz – Elia Yuste – Alex Helle – Andi Frank
User-Empowering
Data-Driven, In-Domain
Machine Translation
#pangeanic E: central@pangea.com.mtpangeanic
2. AGENDA
• Industry reflections
• Pangeanic PangeaMT
• Customization as Key Initial Servicing Step of our MT
Offering
• All about the PangeaMT Platform
– Featuring Highlights and Demo
– API : CAT Environment Integration (Demo)
• Q&A Round
GALA Marketplace Offer
3. ´1
´2
1.This is an
example text. Go
ahead and
replace it with
your own text.
2.This is an
example text. Go
ahead and
replace it with
your own text.
19951995
20052005
20152015
3.This is an
example text. Go
ahead and
replace it with
your own text.
4.This is an
example text. Go
ahead and
replace it with
your own text.
COST OF TRANSLATION (price/w) vs DEMAND
10-YEAR STEPS
DEMAND
• Price per word a
valid model?
• Is there an
explanation?
• What can we do
about it? Is there a
future for the
Language Industry?
• Unique to this
industry?
4. MASSIVE AMOUNTS OF DATA –
IS LANGUAGE BUSINESS MANAGEABLE?
World’s data in Tb / Exa
TypicalTranslationVlume
1990 1995 2000
2005 2010 2015
5. Why Machine Translation?
As of May 2009: 487 Billion gigabytes or
1,000,000,000 * 487,000,000,000 = 4,87 x 1020
Estimates
Up 50% a year (Oracle)
Doubles every 11 hours (IBM)
Humankind has stored more than 295 billion gigabytes (or 295 exabytes) of
data since 1986
ComputerWorld - 2011
Researchers at the University of California, Berkeley, that found
the amount of data generated from the dawn of time through
2002 was about 5 exabytes.
6. Why Machine Translation?
The Data Deluge
As Content Volume Explodes,
Machine Translation Becomes an
Inevitable Part of Global Content
Strategy
http://ow.ly/jVuhZ
In 2011, it took about two days for the
world to create the same 5 exabytes of
data that it took human eons to
generate.
In 2013, it took the world just 10
minutes to create 5 exabytes.
Eric Schmidt: Every 2 Days We Create
As Much Information As We Did Up To
2003
TechCrunch, 2010
The sixth power of 1,000 = 1018
1 EB = 1000000000000000000B = 1018
bytes = 1000petabytes = 1 billion gigabytes.
8. What can I do with MT?
Machine Translation application, NEW usage and success depend on
MT for assimilation: “gisting” or
“understanding“
Sports Politics
Social etc
Output
format
• Practically unlimited demand; but free web-based
services reduce incentive to improve technology
• Coverage + important. Instant quality
MT for dissemination: “publication“
MT for direct communication
Output
format
Sports Politics
Social etc
• Publishable quality that can only be achieved by
humans. MT & tools a productivity booster
Output
format
Output
format
Sports Politics
Social etc
• Current R&D, Military uses systems for
spoken MT, first applications for
smartphones, online help, multilingual
chat systems
Output
format
Output
format
9. 9
Short history
Pangeanic: LSP. Major clients in Asia, European
localization, increasing number of languages
Need to produce translation faster, cheaper…
Experimenting with some RB MT systems
TAUS & TDA founding members
Partnering with Valencia's Computer Science
Institute & Prof. F. Casacuberta / E. Vidal
Research Team
Commercial implementations of PangeaMT systems
at client side: SONY EUROPE, SYBASE, LSPs….
10. 10
Milestones
EU Post-editing contract 2007 (RBMT output)
Euromatrix mention
AMTA 2010
AAMT 2011/12 (JP Hybridization and MT DIY)
1st commercial platform 2010
DIY 2011 (automated re-training cycles)
SaaS Power, LocWorld Paris 2012
Improved automated cleaning cycles,
Online automated training
Regional EU R&D Funds (“Feder” x 3: 2009-2011) &
Marie Curie EXPERT Project
11. Customization by the PangeaMT Team
Key to achieve better qualitative results later
• Top-notch human and automated service
• Focused on the Client from day one!
• Prior to 1st-time Engine Delivery prior to Platform
Deployment (production)
• Customization concentrates on data and best engine
consultancy
• Data cleaning and enhancement
• The impact of glossaries (in-domain, client-/product-
specific…)
• Reporting (your data was like this…..now let’s do this)
• Training
Pangeanic tests all the development features in-house at a
TRANSLATION DEPARTMENT BEFORE RELEASE.
13. Don’t forget data cleaning!!!
<tu srclang="en-GB">
<tuv xml:lang="EN-GB">
<seg>A system for recovering the methane that is emitted from the manure so that it
does not leak into the atmosphere.</seg>
</tuv>
<tuv xml:lang="FR-FR">
<seg>Système permettant de r€ pérer le méthane qui se dégage de l'engrais naturel
d'origine animale de sorte qu'il ne se dissipe pas dans l'atmosphère.</seg>
</tuv>
<tu creationdate="20090817T114430Z" creationid="APIACCESS"
changedate="20110617T141159Z" changeid=“pat">
<tuv xml:lang="EN-US">
<seg>Overall heigtht –<bpt i="1">{f43 </bpt> <ept i="1">}</ept>25"; width –
<bpt i="2">{f43 </bpt> <ept i="2">}</ept>20.1".</seg>
</tuv>
<tuv xml:lang="ES-EM">
<seg><bpt i="1">{f2 </bpt>Altura total - 25"; anchura <ept i="1">}</ept>–<bpt
i="2">{f43 </bpt> <ept i="2">}</ept><bpt i="3">{f2 </bpt>20,1".<ept
i="3">}</ept></seg>
</tuv>
</tu>
<tuv xml:lang=“EN-US">
<seg>On 22nd May we decided not to join the group.</seg>
<tuv xml:lang=“DE-DE">
<seg>Am 22. </seg>
More cleaning
Cleaning
14. Don’t forget data cleaning!!!
<tu srclang="en-GB">
<tuv xml:lang="EN-GB">
<seg>The President of the United States visited Costa Rica.</seg>
</tuv>
<tuv xml:lang=“ES-ES">
<seg>El Presidente de los Estados Unidos, el señor Obama y su esposa la señora
Michelle, visitaron Costa Rica el pasado sábado.</seg>
</tuv>
<tuv xml:lang=“JP">
<seg> 同書は「通訳・翻訳キャリアガイド」の 2011-2012 年度版。
英字新聞のジャパンタイムズ社が強みとするジャーナリスティックな視点で、通訳や翻訳という仕事が持つ魅
力ややりがい、プロに要求されるスキルおよび意識の持ち方などを紹介。また通訳者・翻訳者になるための道
すじから、実際の仕事の現場にいたるまで、今日の通訳・翻訳業界の実像を包括的に紹介。 </seg>
<tuv xml:lang=“EN-US">
<seg>It is a journalistic point of view and strengths of the English-
language newspaper Japan Times. It includes a description of the exciting and
rewarding work of translation and interpretation, as well as the introduction of
consciousness and how to acquire the required professional skills. The road to
becoming a translator and interpreter also down to the actual work site, a
comprehensive guide to interpreting the reality of today'stranslation industry.
</seg>
More cleaning
Cleaning
15. More cleaning
Cleaning
Engine training with
clean data
Having approved,
terminologically sound,
clean data improves engine
accuracy and performance
with even small sets of
data.
Data cleaning modules
•Remove any “suspects”:
•Sentences that are too long
•Mismatches (of many kinds!)
•Terminological inaccuracies
•Non-useful segments, etc
Parallel text extraction / Translation
input / Post-edited material
This is often comes from CAT tools or document
alignments, crawling
Data Cleaning (in-lines)
Remove all non-translation
data.
TMX Human approval
Some of this material may
actually be OK for training. It
is then input in the training
set.
DATA CLEANING CYCLE (AUTOMATED)DATA CLEANING CYCLE (AUTOMATED)
16. A Success Story
Sony Professional Europe, Salomé Lopez-Lavado
Needs
-Improve publication
French, Italian,
Spanish
-8M words training
set
-time-to-market: from
3 days down to 1,5
days: html, InDesign,
-Outsourcing cost:
-20%
-Volume: 1,5M
words/year
Japanese Automotive
manufacturer
-Spanish
-8M words/year
-Time to market
reduced by 2 week –
3 weeks from 8 to 6
or 5 weeks
-Team of 17
freelancers down to
4-7 post-editors
-Outsourcing cost:
-30%
Spanish LSP working
for banking sector
-Spanish
-1-2M words/year
-Time to market: 1-
week to 2 days!!!!
-Docx, html, tmx
-Down from 2-3 in-
house staff and 2-3
freelancers to 2 in-
house!!!
http://ow.ly/peuFD
Successfully
applied (3d-
party
applications/
beneficiaries)
18. • PangeaMT can be self-hosted when
data security is critical (all
processes internal to the
organization)
- commercially sensitive data,
- financial, legal, institutional,
- intelligence, knowledge-gathering,
- product pre-release, etc
• Control Panel + full system statistics
• Re-trainings and updates by the
client for data privacy / more
accuracy
Potential Uses of Machine Translation
19. • Information discovery: patent,
unknown documents,
• Automatic, on-demand creation
of foreign language versions /
web apps – keyword testing
• multilingual crawling, data
discovery
• Pre-translation
Other Potential Uses of Machine Translation
21. 21
Platform overview
• 24/7 control over your data and engines
• secure, robust and scalable
• user focused (permissions and empowering
capabilities)
• API linked, if need be
• enabled us to offer an extraordinary flexible
business model
- SaaS
- SaaS Power (online DIY, re-trainings included)
- Full Power (PLATFORM OWNERSHIP)
27. Myth: MT will never be as good
as humans
“We cannot solve the problem using the
same tools and the way of thinking that
created it” A. Einstein
uhmmm, it is going to get really
good...
2nd
stage
PE material and more data make engines even
more predictable. More specialist engines
3rd
stage
Beyond 2030... no predictions
1st
stage
We are creating usable engines, first PE
experiences 2009-2015 or 2020
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
Technology tools developed by the industry for the industry. Very “applied” “practical” philosophy
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
* Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT