SlideShare una empresa de Scribd logo
1 de 29
PangeaMT
Manuel Herranz – Elia Yuste – Alex Helle – Andi Frank
User-Empowering
Data-Driven, In-Domain
Machine Translation
#pangeanic E: central@pangea.com.mtpangeanic
AGENDA
• Industry reflections
• Pangeanic  PangeaMT
• Customization as Key Initial Servicing Step of our MT
Offering
• All about the PangeaMT Platform
– Featuring Highlights and Demo
– API : CAT Environment Integration (Demo)
• Q&A Round
GALA Marketplace Offer
´1
´2
1.This is an
example text. Go
ahead and
replace it with
your own text.
2.This is an
example text. Go
ahead and
replace it with
your own text.
19951995
20052005
20152015
3.This is an
example text. Go
ahead and
replace it with
your own text.
4.This is an
example text. Go
ahead and
replace it with
your own text.
COST OF TRANSLATION (price/w) vs DEMAND
10-YEAR STEPS
DEMAND
• Price per word a
valid model?
• Is there an
explanation?
• What can we do
about it? Is there a
future for the
Language Industry?
• Unique to this
industry?
MASSIVE AMOUNTS OF DATA –
IS LANGUAGE BUSINESS MANAGEABLE?
World’s data in Tb / Exa
TypicalTranslationVlume
1990 1995 2000
2005 2010 2015
Why Machine Translation?
 As of May 2009: 487 Billion gigabytes or
1,000,000,000 * 487,000,000,000 = 4,87 x 1020
 Estimates
 Up 50% a year (Oracle)
 Doubles every 11 hours (IBM)
 Humankind has stored more than 295 billion gigabytes (or 295 exabytes) of
data since 1986
ComputerWorld - 2011
 Researchers at the University of California, Berkeley, that found
the amount of data generated from the dawn of time through
2002 was about 5 exabytes.
Why Machine Translation?
The Data Deluge
As Content Volume Explodes,
Machine Translation Becomes an
Inevitable Part of Global Content
Strategy
http://ow.ly/jVuhZ
 In 2011, it took about two days for the
world to create the same 5 exabytes of
data that it took human eons to
generate.
 In 2013, it took the world just 10
minutes to create 5 exabytes.
 Eric Schmidt: Every 2 Days We Create
As Much Information As We Did Up To
2003
TechCrunch, 2010
The sixth power of 1,000 = 1018
1 EB = 1000000000000000000B = 1018
bytes = 1000petabytes = 1 billion gigabytes.
Where is data stored?
What can I do with MT?
Machine Translation application, NEW usage and success depend on
 MT for assimilation: “gisting” or
“understanding“
Sports Politics
Social etc
Output
format
• Practically unlimited demand; but free web-based
services reduce incentive to improve technology
• Coverage + important. Instant quality
 MT for dissemination: “publication“
 MT for direct communication
Output
format
Sports Politics
Social etc
• Publishable quality that can only be achieved by
humans. MT & tools a productivity booster
Output
format
Output
format
Sports Politics
Social etc
• Current R&D, Military uses systems for
spoken MT, first applications for
smartphones, online help, multilingual
chat systems
Output
format
Output
format
9
Short history
 Pangeanic: LSP. Major clients in Asia, European
localization, increasing number of languages
 Need to produce translation faster, cheaper…
 Experimenting with some RB MT systems
 TAUS & TDA founding members
 Partnering with Valencia's Computer Science
Institute & Prof. F. Casacuberta / E. Vidal
Research Team
 Commercial implementations of PangeaMT systems
at client side: SONY EUROPE, SYBASE, LSPs….
10
Milestones
 EU Post-editing contract 2007 (RBMT output)
 Euromatrix mention
 AMTA 2010
 AAMT 2011/12 (JP Hybridization and MT DIY)
 1st commercial platform 2010
 DIY 2011 (automated re-training cycles)
 SaaS Power, LocWorld Paris 2012
 Improved automated cleaning cycles,
 Online automated training
 Regional EU R&D Funds (“Feder” x 3: 2009-2011) &
Marie Curie EXPERT Project
Customization by the PangeaMT Team
Key to achieve better qualitative results later
• Top-notch human and automated service
• Focused on the Client from day one!
• Prior to 1st-time Engine Delivery  prior to Platform
Deployment (production)
• Customization concentrates on data and best engine
consultancy
• Data cleaning and enhancement
• The impact of glossaries (in-domain, client-/product-
specific…)
• Reporting (your data was like this…..now let’s do this)
• Training
 Pangeanic tests all the development features in-house at a
TRANSLATION DEPARTMENT BEFORE RELEASE.
Getting the data right:
Automated cleaning and preparation
Don’t forget data cleaning!!!
<tu srclang="en-GB">
<tuv xml:lang="EN-GB">
<seg>A system for recovering the methane that is emitted from the manure so that it
does not leak into the atmosphere.</seg>
</tuv>
<tuv xml:lang="FR-FR">
<seg>Système permettant de r€ pérer le méthane qui se dégage de l'engrais naturel
d'origine animale de sorte qu'il ne se dissipe pas dans l'atmosphère.</seg>
</tuv>
<tu creationdate="20090817T114430Z" creationid="APIACCESS"
changedate="20110617T141159Z" changeid=“pat">
<tuv xml:lang="EN-US">
<seg>Overall heigtht –<bpt i="1">{f43 </bpt> <ept i="1">}</ept>25&quot;; width –
<bpt i="2">{f43 </bpt> <ept i="2">}</ept>20.1&quot;.</seg>
</tuv>
<tuv xml:lang="ES-EM">
<seg><bpt i="1">{f2 </bpt>Altura total - 25&quot;; anchura <ept i="1">}</ept>–<bpt
i="2">{f43 </bpt> <ept i="2">}</ept><bpt i="3">{f2 </bpt>20,1&quot;.<ept
i="3">}</ept></seg>
</tuv>
</tu>
<tuv xml:lang=“EN-US">
<seg>On 22nd May we decided not to join the group.</seg>
<tuv xml:lang=“DE-DE">
<seg>Am 22. </seg>
More cleaning
Cleaning
Don’t forget data cleaning!!!
<tu srclang="en-GB">
<tuv xml:lang="EN-GB">
<seg>The President of the United States visited Costa Rica.</seg>
</tuv>
<tuv xml:lang=“ES-ES">
<seg>El Presidente de los Estados Unidos, el señor Obama y su esposa la señora
Michelle, visitaron Costa Rica el pasado sábado.</seg>
</tuv>
<tuv xml:lang=“JP">
<seg> 同書は「通訳・翻訳キャリアガイド」の 2011-2012 年度版。
英字新聞のジャパンタイムズ社が強みとするジャーナリスティックな視点で、通訳や翻訳という仕事が持つ魅
力ややりがい、プロに要求されるスキルおよび意識の持ち方などを紹介。また通訳者・翻訳者になるための道
すじから、実際の仕事の現場にいたるまで、今日の通訳・翻訳業界の実像を包括的に紹介。 </seg>
<tuv xml:lang=“EN-US">
<seg>It is a journalistic point of view and strengths of the English-
language newspaper Japan Times. It includes a description of the exciting and
rewarding work of translation and interpretation, as well as the introduction of
consciousness and how to acquire the required professional skills. The road to
becoming a translator and interpreter also down to the actual work site, a
comprehensive guide to interpreting the reality of today'stranslation industry.
</seg>
More cleaning
Cleaning
More cleaning
Cleaning
Engine training with
clean data
Having approved,
terminologically sound,
clean data improves engine
accuracy and performance
with even small sets of
data.
Data cleaning modules
•Remove any “suspects”:
•Sentences that are too long
•Mismatches (of many kinds!)
•Terminological inaccuracies
•Non-useful segments, etc
Parallel text extraction / Translation
input / Post-edited material
This is often comes from CAT tools or document
alignments, crawling
Data Cleaning (in-lines)
Remove all non-translation
data.
TMX Human approval
Some of this material may
actually be OK for training. It
is then input in the training
set.
DATA CLEANING CYCLE (AUTOMATED)DATA CLEANING CYCLE (AUTOMATED)
A Success Story
Sony Professional Europe, Salomé Lopez-Lavado
Needs
-Improve publication
French, Italian,
Spanish
-8M words training
set
-time-to-market: from
3 days down to 1,5
days: html, InDesign,
-Outsourcing cost:
-20%
-Volume: 1,5M
words/year
Japanese Automotive
manufacturer
-Spanish
-8M words/year
-Time to market
reduced by 2 week –
3 weeks from 8 to 6
or 5 weeks
-Team of 17
freelancers down to
4-7 post-editors
-Outsourcing cost:
-30%
Spanish LSP working
for banking sector
-Spanish
-1-2M words/year
-Time to market: 1-
week to 2 days!!!!
-Docx, html, tmx
-Down from 2-3 in-
house staff and 2-3
freelancers to 2 in-
house!!!
http://ow.ly/peuFD
Successfully
applied (3d-
party
applications/
beneficiaries)
Use Case -
✔
Even with
small data
sets!!
• PangeaMT can be self-hosted when
data security is critical (all
processes internal to the
organization)
- commercially sensitive data,
- financial, legal, institutional,
- intelligence, knowledge-gathering,
- product pre-release, etc
• Control Panel + full system statistics
• Re-trainings and updates by the
client for data privacy / more
accuracy
Potential Uses of Machine Translation
• Information discovery: patent,
unknown documents,
• Automatic, on-demand creation
of foreign language versions /
web apps – keyword testing
• multilingual crawling, data
discovery
• Pre-translation
Other Potential Uses of Machine Translation
20
Polling Questions to
Audience
21
Platform overview
• 24/7 control over your data and engines
• secure, robust and scalable
• user focused (permissions and empowering
capabilities)
• API linked, if need be
• enabled us to offer an extraordinary flexible
business model
- SaaS
- SaaS Power (online DIY, re-trainings included)
- Full Power (PLATFORM OWNERSHIP)
PangeaMT System – Domain Creation
PangeaMT System – Data Cleaning
PangeaMT System – Engine Creation
PangeaMT System – Engine Training
26
PangeaMT API – SDL Plugin
Demo Time
(Video file)
Myth: MT will never be as good
as humans
“We cannot solve the problem using the
same tools and the way of thinking that
created it” A. Einstein
uhmmm, it is going to get really
good...
2nd
stage
PE material and more data make engines even
more predictable. More specialist engines
3rd
stage
Beyond 2030... no predictions
1st
stage
We are creating usable engines, first PE
experiences 2009-2015 or 2020
GALA
Marketplace Offer
central@pangea.com.mt
Free Consultancy and Custom Engine
Piloting Period
October-November 2013
Q&A
Thank you!!
central@pangea.com.mt

Más contenido relacionado

Destacado

Modalidades de la traducción
Modalidades de la traducciónModalidades de la traducción
Modalidades de la traducción
Jordán Masías
 
Izadas de bandera en el colegio santiago
Izadas de bandera en el colegio santiagoIzadas de bandera en el colegio santiago
Izadas de bandera en el colegio santiago
Danypauly
 
Medical translation
Medical translationMedical translation
Medical translation
Word Perfect
 

Destacado (20)

Planificación
PlanificaciónPlanificación
Planificación
 
Modalidades de la traducción
Modalidades de la traducciónModalidades de la traducción
Modalidades de la traducción
 
TAUS MT SHOWCASE, I Used to Be a Translator, Now I Run MT, Manuel Herranz, Pa...
TAUS MT SHOWCASE, I Used to Be a Translator, Now I Run MT, Manuel Herranz, Pa...TAUS MT SHOWCASE, I Used to Be a Translator, Now I Run MT, Manuel Herranz, Pa...
TAUS MT SHOWCASE, I Used to Be a Translator, Now I Run MT, Manuel Herranz, Pa...
 
The Japanese Market - Meeting Requirements, by Hiroki Kawano, Memsource
The Japanese Market - Meeting Requirements, by Hiroki Kawano, MemsourceThe Japanese Market - Meeting Requirements, by Hiroki Kawano, Memsource
The Japanese Market - Meeting Requirements, by Hiroki Kawano, Memsource
 
Translation for and in the government - Tanya Helmen (National Virtual Transl...
Translation for and in the government - Tanya Helmen (National Virtual Transl...Translation for and in the government - Tanya Helmen (National Virtual Transl...
Translation for and in the government - Tanya Helmen (National Virtual Transl...
 
Recent Progress in Machine Translation between Japanese and Others, by Mick E...
Recent Progress in Machine Translation between Japanese and Others, by Mick E...Recent Progress in Machine Translation between Japanese and Others, by Mick E...
Recent Progress in Machine Translation between Japanese and Others, by Mick E...
 
TAUS MT SHOWCASE, A Small LSP’s Guide to Commercialized Open Source SMT, Tom ...
TAUS MT SHOWCASE, A Small LSP’s Guide to Commercialized Open Source SMT, Tom ...TAUS MT SHOWCASE, A Small LSP’s Guide to Commercialized Open Source SMT, Tom ...
TAUS MT SHOWCASE, A Small LSP’s Guide to Commercialized Open Source SMT, Tom ...
 
How hard is website translation?
How hard is website translation?How hard is website translation?
How hard is website translation?
 
Izadas de bandera en el colegio santiago
Izadas de bandera en el colegio santiagoIzadas de bandera en el colegio santiago
Izadas de bandera en el colegio santiago
 
Closing the gap between expectations and what's available in the globalizatio...
Closing the gap between expectations and what's available in the globalizatio...Closing the gap between expectations and what's available in the globalizatio...
Closing the gap between expectations and what's available in the globalizatio...
 
Predictive Analysis in Machine Translation is Business Intelligence.
Predictive Analysis in Machine Translation is Business Intelligence.Predictive Analysis in Machine Translation is Business Intelligence.
Predictive Analysis in Machine Translation is Business Intelligence.
 
Medical translation
Medical translationMedical translation
Medical translation
 
Defining the Translation Purpose (Lena Marg, MT Project Manager at Welocalize)
Defining the Translation Purpose (Lena Marg, MT Project Manager at Welocalize)Defining the Translation Purpose (Lena Marg, MT Project Manager at Welocalize)
Defining the Translation Purpose (Lena Marg, MT Project Manager at Welocalize)
 
Translation problems of advertising.
Translation problems of advertising.Translation problems of advertising.
Translation problems of advertising.
 
TAUS Translation Technology Showcase Webinar - SmartMATE, 13 November 2013
TAUS Translation Technology Showcase Webinar - SmartMATE, 13 November 2013TAUS Translation Technology Showcase Webinar - SmartMATE, 13 November 2013
TAUS Translation Technology Showcase Webinar - SmartMATE, 13 November 2013
 
[Paper Introduction] A Context-Aware Topic Model for Statistical Machine Tran...
[Paper Introduction] A Context-Aware Topic Model for Statistical Machine Tran...[Paper Introduction] A Context-Aware Topic Model for Statistical Machine Tran...
[Paper Introduction] A Context-Aware Topic Model for Statistical Machine Tran...
 
Machine Translation Quality - Are We There Yet? - Olga Beregovaya (Welocalize)
Machine Translation Quality - Are We There Yet? - Olga Beregovaya (Welocalize)Machine Translation Quality - Are We There Yet? - Olga Beregovaya (Welocalize)
Machine Translation Quality - Are We There Yet? - Olga Beregovaya (Welocalize)
 
StyleScorer: a Tool to Streamline Translation Workflows. Olga Beregovaya (Wel...
StyleScorer: a Tool to Streamline Translation Workflows. Olga Beregovaya (Wel...StyleScorer: a Tool to Streamline Translation Workflows. Olga Beregovaya (Wel...
StyleScorer: a Tool to Streamline Translation Workflows. Olga Beregovaya (Wel...
 
Application Practice on Integrated TM Management solution and TM Sharing, by ...
Application Practice on Integrated TM Management solution and TM Sharing, by ...Application Practice on Integrated TM Management solution and TM Sharing, by ...
Application Practice on Integrated TM Management solution and TM Sharing, by ...
 
Syntactic change through translation: A corpus-based approach to language change
Syntactic change through translation: A corpus-based approach to language changeSyntactic change through translation: A corpus-based approach to language change
Syntactic change through translation: A corpus-based approach to language change
 

Similar a Gala Webminar September 2013

Shanish_SQL_PLSQL_Profile
Shanish_SQL_PLSQL_ProfileShanish_SQL_PLSQL_Profile
Shanish_SQL_PLSQL_Profile
Shanish Jain
 
Integrate Big Data into Your Organization with Informatica and Perficient
Integrate Big Data into Your Organization with Informatica and PerficientIntegrate Big Data into Your Organization with Informatica and Perficient
Integrate Big Data into Your Organization with Informatica and Perficient
Perficient, Inc.
 

Similar a Gala Webminar September 2013 (20)

iadaatpa gala boston
iadaatpa gala bostoniadaatpa gala boston
iadaatpa gala boston
 
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangean...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangean...TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangean...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangean...
 
DataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdfDataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdf
 
Growing Your Freelance Business (Olga Melnikova)
Growing Your Freelance Business (Olga Melnikova)Growing Your Freelance Business (Olga Melnikova)
Growing Your Freelance Business (Olga Melnikova)
 
Lights Out, Translation is Datafied, by Jaap van der Meer (TAUS)
Lights Out, Translation is Datafied, by Jaap van der Meer (TAUS)Lights Out, Translation is Datafied, by Jaap van der Meer (TAUS)
Lights Out, Translation is Datafied, by Jaap van der Meer (TAUS)
 
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
 
Good Applications of Bad Machine Translation
Good Applications of Bad Machine TranslationGood Applications of Bad Machine Translation
Good Applications of Bad Machine Translation
 
National Labor Exchange Standards Presentation
National Labor Exchange Standards PresentationNational Labor Exchange Standards Presentation
National Labor Exchange Standards Presentation
 
Shanish_SQL_PLSQL_Profile
Shanish_SQL_PLSQL_ProfileShanish_SQL_PLSQL_Profile
Shanish_SQL_PLSQL_Profile
 
Updated_Resume
Updated_ResumeUpdated_Resume
Updated_Resume
 
Integrate Big Data into Your Organization with Informatica and Perficient
Integrate Big Data into Your Organization with Informatica and PerficientIntegrate Big Data into Your Organization with Informatica and Perficient
Integrate Big Data into Your Organization with Informatica and Perficient
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
What is the Economic Case for Machine Translation?
What is the Economic Case for Machine Translation?What is the Economic Case for Machine Translation?
What is the Economic Case for Machine Translation?
 
Migrating to Alfresco Part II: The “How” – Tools & Best Practices for Renovat...
Migrating to Alfresco Part II: The “How” – Tools & Best Practices for Renovat...Migrating to Alfresco Part II: The “How” – Tools & Best Practices for Renovat...
Migrating to Alfresco Part II: The “How” – Tools & Best Practices for Renovat...
 
Corporate presentation- Arohatech
Corporate presentation- ArohatechCorporate presentation- Arohatech
Corporate presentation- Arohatech
 
Lexcelera MT Breaking Compromises
Lexcelera MT Breaking CompromisesLexcelera MT Breaking Compromises
Lexcelera MT Breaking Compromises
 
Personal databank
Personal databankPersonal databank
Personal databank
 
Mammothdb - Public VC Pitchdeck!
Mammothdb - Public VC Pitchdeck!Mammothdb - Public VC Pitchdeck!
Mammothdb - Public VC Pitchdeck!
 
5 challenges of scaling l10n workflows KantanMT/bmmt webinar
5 challenges of scaling l10n workflows KantanMT/bmmt webinar5 challenges of scaling l10n workflows KantanMT/bmmt webinar
5 challenges of scaling l10n workflows KantanMT/bmmt webinar
 

Último

Último (20)

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Gala Webminar September 2013

  • 1. PangeaMT Manuel Herranz – Elia Yuste – Alex Helle – Andi Frank User-Empowering Data-Driven, In-Domain Machine Translation #pangeanic E: central@pangea.com.mtpangeanic
  • 2. AGENDA • Industry reflections • Pangeanic  PangeaMT • Customization as Key Initial Servicing Step of our MT Offering • All about the PangeaMT Platform – Featuring Highlights and Demo – API : CAT Environment Integration (Demo) • Q&A Round GALA Marketplace Offer
  • 3. ´1 ´2 1.This is an example text. Go ahead and replace it with your own text. 2.This is an example text. Go ahead and replace it with your own text. 19951995 20052005 20152015 3.This is an example text. Go ahead and replace it with your own text. 4.This is an example text. Go ahead and replace it with your own text. COST OF TRANSLATION (price/w) vs DEMAND 10-YEAR STEPS DEMAND • Price per word a valid model? • Is there an explanation? • What can we do about it? Is there a future for the Language Industry? • Unique to this industry?
  • 4. MASSIVE AMOUNTS OF DATA – IS LANGUAGE BUSINESS MANAGEABLE? World’s data in Tb / Exa TypicalTranslationVlume 1990 1995 2000 2005 2010 2015
  • 5. Why Machine Translation?  As of May 2009: 487 Billion gigabytes or 1,000,000,000 * 487,000,000,000 = 4,87 x 1020  Estimates  Up 50% a year (Oracle)  Doubles every 11 hours (IBM)  Humankind has stored more than 295 billion gigabytes (or 295 exabytes) of data since 1986 ComputerWorld - 2011  Researchers at the University of California, Berkeley, that found the amount of data generated from the dawn of time through 2002 was about 5 exabytes.
  • 6. Why Machine Translation? The Data Deluge As Content Volume Explodes, Machine Translation Becomes an Inevitable Part of Global Content Strategy http://ow.ly/jVuhZ  In 2011, it took about two days for the world to create the same 5 exabytes of data that it took human eons to generate.  In 2013, it took the world just 10 minutes to create 5 exabytes.  Eric Schmidt: Every 2 Days We Create As Much Information As We Did Up To 2003 TechCrunch, 2010 The sixth power of 1,000 = 1018 1 EB = 1000000000000000000B = 1018 bytes = 1000petabytes = 1 billion gigabytes.
  • 7. Where is data stored?
  • 8. What can I do with MT? Machine Translation application, NEW usage and success depend on  MT for assimilation: “gisting” or “understanding“ Sports Politics Social etc Output format • Practically unlimited demand; but free web-based services reduce incentive to improve technology • Coverage + important. Instant quality  MT for dissemination: “publication“  MT for direct communication Output format Sports Politics Social etc • Publishable quality that can only be achieved by humans. MT & tools a productivity booster Output format Output format Sports Politics Social etc • Current R&D, Military uses systems for spoken MT, first applications for smartphones, online help, multilingual chat systems Output format Output format
  • 9. 9 Short history  Pangeanic: LSP. Major clients in Asia, European localization, increasing number of languages  Need to produce translation faster, cheaper…  Experimenting with some RB MT systems  TAUS & TDA founding members  Partnering with Valencia's Computer Science Institute & Prof. F. Casacuberta / E. Vidal Research Team  Commercial implementations of PangeaMT systems at client side: SONY EUROPE, SYBASE, LSPs….
  • 10. 10 Milestones  EU Post-editing contract 2007 (RBMT output)  Euromatrix mention  AMTA 2010  AAMT 2011/12 (JP Hybridization and MT DIY)  1st commercial platform 2010  DIY 2011 (automated re-training cycles)  SaaS Power, LocWorld Paris 2012  Improved automated cleaning cycles,  Online automated training  Regional EU R&D Funds (“Feder” x 3: 2009-2011) & Marie Curie EXPERT Project
  • 11. Customization by the PangeaMT Team Key to achieve better qualitative results later • Top-notch human and automated service • Focused on the Client from day one! • Prior to 1st-time Engine Delivery  prior to Platform Deployment (production) • Customization concentrates on data and best engine consultancy • Data cleaning and enhancement • The impact of glossaries (in-domain, client-/product- specific…) • Reporting (your data was like this…..now let’s do this) • Training  Pangeanic tests all the development features in-house at a TRANSLATION DEPARTMENT BEFORE RELEASE.
  • 12. Getting the data right: Automated cleaning and preparation
  • 13. Don’t forget data cleaning!!! <tu srclang="en-GB"> <tuv xml:lang="EN-GB"> <seg>A system for recovering the methane that is emitted from the manure so that it does not leak into the atmosphere.</seg> </tuv> <tuv xml:lang="FR-FR"> <seg>Système permettant de r€ pérer le méthane qui se dégage de l'engrais naturel d'origine animale de sorte qu'il ne se dissipe pas dans l'atmosphère.</seg> </tuv> <tu creationdate="20090817T114430Z" creationid="APIACCESS" changedate="20110617T141159Z" changeid=“pat"> <tuv xml:lang="EN-US"> <seg>Overall heigtht –<bpt i="1">{f43 </bpt> <ept i="1">}</ept>25&quot;; width – <bpt i="2">{f43 </bpt> <ept i="2">}</ept>20.1&quot;.</seg> </tuv> <tuv xml:lang="ES-EM"> <seg><bpt i="1">{f2 </bpt>Altura total - 25&quot;; anchura <ept i="1">}</ept>–<bpt i="2">{f43 </bpt> <ept i="2">}</ept><bpt i="3">{f2 </bpt>20,1&quot;.<ept i="3">}</ept></seg> </tuv> </tu> <tuv xml:lang=“EN-US"> <seg>On 22nd May we decided not to join the group.</seg> <tuv xml:lang=“DE-DE"> <seg>Am 22. </seg> More cleaning Cleaning
  • 14. Don’t forget data cleaning!!! <tu srclang="en-GB"> <tuv xml:lang="EN-GB"> <seg>The President of the United States visited Costa Rica.</seg> </tuv> <tuv xml:lang=“ES-ES"> <seg>El Presidente de los Estados Unidos, el señor Obama y su esposa la señora Michelle, visitaron Costa Rica el pasado sábado.</seg> </tuv> <tuv xml:lang=“JP"> <seg> 同書は「通訳・翻訳キャリアガイド」の 2011-2012 年度版。 英字新聞のジャパンタイムズ社が強みとするジャーナリスティックな視点で、通訳や翻訳という仕事が持つ魅 力ややりがい、プロに要求されるスキルおよび意識の持ち方などを紹介。また通訳者・翻訳者になるための道 すじから、実際の仕事の現場にいたるまで、今日の通訳・翻訳業界の実像を包括的に紹介。 </seg> <tuv xml:lang=“EN-US"> <seg>It is a journalistic point of view and strengths of the English- language newspaper Japan Times. It includes a description of the exciting and rewarding work of translation and interpretation, as well as the introduction of consciousness and how to acquire the required professional skills. The road to becoming a translator and interpreter also down to the actual work site, a comprehensive guide to interpreting the reality of today'stranslation industry. </seg> More cleaning Cleaning
  • 15. More cleaning Cleaning Engine training with clean data Having approved, terminologically sound, clean data improves engine accuracy and performance with even small sets of data. Data cleaning modules •Remove any “suspects”: •Sentences that are too long •Mismatches (of many kinds!) •Terminological inaccuracies •Non-useful segments, etc Parallel text extraction / Translation input / Post-edited material This is often comes from CAT tools or document alignments, crawling Data Cleaning (in-lines) Remove all non-translation data. TMX Human approval Some of this material may actually be OK for training. It is then input in the training set. DATA CLEANING CYCLE (AUTOMATED)DATA CLEANING CYCLE (AUTOMATED)
  • 16. A Success Story Sony Professional Europe, Salomé Lopez-Lavado Needs -Improve publication French, Italian, Spanish -8M words training set -time-to-market: from 3 days down to 1,5 days: html, InDesign, -Outsourcing cost: -20% -Volume: 1,5M words/year Japanese Automotive manufacturer -Spanish -8M words/year -Time to market reduced by 2 week – 3 weeks from 8 to 6 or 5 weeks -Team of 17 freelancers down to 4-7 post-editors -Outsourcing cost: -30% Spanish LSP working for banking sector -Spanish -1-2M words/year -Time to market: 1- week to 2 days!!!! -Docx, html, tmx -Down from 2-3 in- house staff and 2-3 freelancers to 2 in- house!!! http://ow.ly/peuFD Successfully applied (3d- party applications/ beneficiaries)
  • 17. Use Case - ✔ Even with small data sets!!
  • 18. • PangeaMT can be self-hosted when data security is critical (all processes internal to the organization) - commercially sensitive data, - financial, legal, institutional, - intelligence, knowledge-gathering, - product pre-release, etc • Control Panel + full system statistics • Re-trainings and updates by the client for data privacy / more accuracy Potential Uses of Machine Translation
  • 19. • Information discovery: patent, unknown documents, • Automatic, on-demand creation of foreign language versions / web apps – keyword testing • multilingual crawling, data discovery • Pre-translation Other Potential Uses of Machine Translation
  • 21. 21 Platform overview • 24/7 control over your data and engines • secure, robust and scalable • user focused (permissions and empowering capabilities) • API linked, if need be • enabled us to offer an extraordinary flexible business model - SaaS - SaaS Power (online DIY, re-trainings included) - Full Power (PLATFORM OWNERSHIP)
  • 22. PangeaMT System – Domain Creation
  • 23. PangeaMT System – Data Cleaning
  • 24. PangeaMT System – Engine Creation
  • 25. PangeaMT System – Engine Training
  • 26. 26 PangeaMT API – SDL Plugin Demo Time (Video file)
  • 27. Myth: MT will never be as good as humans “We cannot solve the problem using the same tools and the way of thinking that created it” A. Einstein uhmmm, it is going to get really good... 2nd stage PE material and more data make engines even more predictable. More specialist engines 3rd stage Beyond 2030... no predictions 1st stage We are creating usable engines, first PE experiences 2009-2015 or 2020
  • 28. GALA Marketplace Offer central@pangea.com.mt Free Consultancy and Custom Engine Piloting Period October-November 2013

Notas del editor

  1. * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  2. * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  3. * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  4. * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  5. Technology tools developed by the industry for the industry. Very “applied” “practical” philosophy
  6. * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  7. * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  8. * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  9. * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  10. * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  11. * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  12. * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  13. * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  14. * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  15. * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  16. * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  17. * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT