SlideShare a Scribd company logo
1 of 27
PRESENTATION
AMTA
ALEXYANISHEVSKY
Welocalize
November 2015
How Much Cake is Enough:
The Case for Domain-Specific
Engines
What is the Tipping Point?
How Much Cake Is Too Much Cake?
I
LOVE
CAKE…
TOO MUCH
CAKE! ugh!
•How many engines
•How to split domains
•How to measure success
•How to improve
AGENDA
HOW MANY ENGINES: CRITERIA
• Environment: elegant deployment?
• Cost
• How different are they from each other?
• Maintenance: engineering + linguistic feedback implementation
HOW TO SPLIT DOMAINS:
CRITERIA
• Content owner feedback
• Historical experience based on business unit or portfolio
• Naming convention
• Style analysis: difference in characteristics based on lexical diversity,
sentence length + syntactic complexity
HOW TO SPLIT DOMAINS:
TOOLS
• Build domain-specific language models + select TUs for domain by PPL
• Source Content Profiler – helps identify domain based on language models,
as well as other stylistic characteristics
• Style Scorer – higher score indicates better match to style established by
client’s documents
HOLISTIC APPROACH BASED ON SEVERAL TOOLS:
TOOLS: PERPLEXITY EVALUATOR
TU LEVEL
<tu srclang="EN-US" tuid="75438"> <prop type="x-ppl:train2">208</prop><prop type="x-
ppl:techdoc6">191.025</prop><prop type="x-ppl:support2">325.983</prop><prop type="x-
ppl:sales1">97.0736</prop><prop type="x-ppl:productLoc1">396.398</prop><prop type="x-
ppl:legal1">617.876</prop><tuv xml:lang="EN-US"> <seg>Consistent feature set across
multiple platforms (Windows, Mac, iOS, Android).</seg> </tuv> <tuv
changedate="20140325T122530Z" changeid="serviceaaa" creationdate="20140325T122530Z"
creationid="serviceaaa" lastusagedate="20140325T122530Z" usagecount="0" xml:lang="ES-XL">
<prop type="x-ALS:Context">TEXT</prop> <prop type="x-ALS:Source
File">DATATC39720SRCEN-USco-02__battle-card_enco-02__battle-card_en.inx</prop>
<seg>Conjunto de características coherente en varias plataformas (Windows, Mac, iOS,
Android)</seg> </tuv> </tu>
TOOLS: SOURCE CONTENT PROFILERIN CONJUNCTION with DCU/ADAPT
TOOLS: STYLE SCORER
COMBINES PPL RATIOS,
DISSIMILARITY SCORE +
CLASSIFICATION SCORE
TEST CATEGORY TRAINING CATEGORY SCORE
SUPPORT TECH DOC 3.16
TECH DOC TECH DOC 2.94
TECH DOC LEGAL .02
WHY USE STYLE SCORER?
• Identify similarity of source document to “gold standard” documents from that
domain and other domains
• Identify similarity of target document to “gold standard” documents from that
domain and other domains
• Example: Is this really a support document? To what degree is it similar to
other support documents, tech doc documents, etc.?
• Dissimilarity can point to worse quality for raw MT and/or reduced post-editing
productivity
STYLE SCORER + SCP
• SCP helps classify a document
• Style Scorer tells you how good a match a document is to a profile
• SCP only works on English source
• Style Scorer works on English source + non-English target
Three Rings for the Elven-kings under the sky,
Seven for the Dwarf-lords in their halls of stone,
Nine for Mortal Men doomed to die,
One for the Dark Lord on his dark throne
In the Land of Mordor where the Shadows lie.
One Ring to rule them all, One Ring to find them,
One Ring to bring them all and in the darkness bind them
In the Land of Mordor where the Shadows lie.
CASE STUDY
ONE DOMAIN?
One ring
to rule them all
CASE STUDY: HOW MANY DOMAINS?
• Started with 6 Domains: Technical Documentation, Legal, Support, Training,
Product UI, Sales/Marketing
• Found that Technical Documentation, Support + Training were very similar based
on LM scores against each other, Style Scorer, length of sentences, similar
grammatical structures
• Found that Product UI was close enough to above 3 that making a separate
engine was not warranted
• Found that Legal + Sales/Marketing were different enough from above domains
and from each other based on LM scores against each other, Style Scorer +
length of sentences
CASE STUDY: GATHERING ASSETS
TMs
• Old
• Somewhat recent
• Current
• Termbases in MultiTerm
• Existing user dictionaries + normalization dictionaries
• New user dictionaries based on term extractions + auto-import for some
languages
CASE STUDY: CURATING ASSETS
• Cleaned TMs
• Introduced metadata for PPL into TUs based on LM perplexity
• Kept the UDs + normalization dictionaries as is
• Additional term extraction for weak languages or languages with
insufficient assets
CASE STUDY: ENGINE ITERATIONS
Based on options in Systran:
• RBMT only
• Hybrid with Stemming, LM Order, Distortion, etc.
• SMT only
HOW TO MEASURE SUCCESS
• Automatic scores
• Human evaluations
• Some reduction in linguistic issues
Forthcoming
• Informed PE Distance
• Productivity
• Systematic reduction of linguistic issues
CASE STUDY: AUTOMATIC SCORES
FOR GENERAL/TECH DOC
CASE STUDY: HUMAN EVALUATIONS
FOR GENERAL/TECH DOC
CASE STUDY: AUTOMATIC SCORES
SALES/MARKETING1
CASE STUDY: AUTOMATIC SCORES
SALES/MARKETING2
CASE STUDY: AUTOMATIC SCORES
LEGAL1
CASE STUDY: AUTOMATIC SCORES
LEGAL2
CASE STUDY: AUTOMATIC SCORES
TECH DOC
HOW TO IMPROVE
Scoring
• Linguistically-informed automatic scores: PE distance with different
weights for POS
• Productivity
Engine
• Eradicate high-frequency inconsistencies between TMs, Termbases +
User Dictionaries (UDs)
• Create domain-specific UDs
• Send best reply: TMT Prime, send best translation irrespective of domain
Source
• Pre-MT source check: was this content properly categorized?
SUMMARY
• Domain-specific engines yield better results as evidenced by auto scores
and human evaluations. Some of evidence of reduction of linguistic issues.
• Group closely related content into one domain
• Determine how many engines your infrastructure can support
YOU
THANK
ALEXYANISHEVSKY
Welocalize
November 2015

More Related Content

Viewers also liked

MT Quality Evaluations: From Test Environment to Production
MT Quality Evaluations: From Test Environment to ProductionMT Quality Evaluations: From Test Environment to Production
MT Quality Evaluations: From Test Environment to ProductionWelocalize
 
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...Welocalize
 
Safaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-Editing
Safaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-EditingSafaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-Editing
Safaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-EditingWelocalize
 
Rating Evaluation Methods through Correlation MTE 2014 Workshop May 2014
Rating Evaluation Methods through Correlation MTE 2014 Workshop May 2014Rating Evaluation Methods through Correlation MTE 2014 Workshop May 2014
Rating Evaluation Methods through Correlation MTE 2014 Workshop May 2014Welocalize
 
MT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L Marg
MT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L MargMT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L Marg
MT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L MargWelocalize
 
2013 CHAT tcworld tekom Welocalize Teaminology
2013 CHAT tcworld tekom Welocalize Teaminology 2013 CHAT tcworld tekom Welocalize Teaminology
2013 CHAT tcworld tekom Welocalize Teaminology Welocalize
 
EAMT Presentation by Welocalize Olga Beregovaya May 2015
EAMT Presentation by Welocalize Olga Beregovaya May 2015EAMT Presentation by Welocalize Olga Beregovaya May 2015
EAMT Presentation by Welocalize Olga Beregovaya May 2015Welocalize
 
Better translations through automated source and post edit analysis
Better translations through automated source and post edit analysisBetter translations through automated source and post edit analysis
Better translations through automated source and post edit analysisWelocalize
 
Stephane Domisse (John Deere) at the Industry Leaders Forum 2015
Stephane Domisse (John Deere) at the Industry Leaders Forum 2015Stephane Domisse (John Deere) at the Industry Leaders Forum 2015
Stephane Domisse (John Deere) at the Industry Leaders Forum 2015TAUS - The Language Data Network
 
Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...
Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...
Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...Welocalize
 
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...Welocalize
 

Viewers also liked (11)

MT Quality Evaluations: From Test Environment to Production
MT Quality Evaluations: From Test Environment to ProductionMT Quality Evaluations: From Test Environment to Production
MT Quality Evaluations: From Test Environment to Production
 
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
Welocalize EAMT 2014 Presentation Assumptions, Expectations and Outliers in P...
 
Safaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-Editing
Safaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-EditingSafaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-Editing
Safaba Welocalize MT Summit 2013 Analyzing MT Utility and Post-Editing
 
Rating Evaluation Methods through Correlation MTE 2014 Workshop May 2014
Rating Evaluation Methods through Correlation MTE 2014 Workshop May 2014Rating Evaluation Methods through Correlation MTE 2014 Workshop May 2014
Rating Evaluation Methods through Correlation MTE 2014 Workshop May 2014
 
MT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L Marg
MT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L MargMT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L Marg
MT Summit 2013 Welocalize Getting the MT Recipe Right by L Casanellas and L Marg
 
2013 CHAT tcworld tekom Welocalize Teaminology
2013 CHAT tcworld tekom Welocalize Teaminology 2013 CHAT tcworld tekom Welocalize Teaminology
2013 CHAT tcworld tekom Welocalize Teaminology
 
EAMT Presentation by Welocalize Olga Beregovaya May 2015
EAMT Presentation by Welocalize Olga Beregovaya May 2015EAMT Presentation by Welocalize Olga Beregovaya May 2015
EAMT Presentation by Welocalize Olga Beregovaya May 2015
 
Better translations through automated source and post edit analysis
Better translations through automated source and post edit analysisBetter translations through automated source and post edit analysis
Better translations through automated source and post edit analysis
 
Stephane Domisse (John Deere) at the Industry Leaders Forum 2015
Stephane Domisse (John Deere) at the Industry Leaders Forum 2015Stephane Domisse (John Deere) at the Industry Leaders Forum 2015
Stephane Domisse (John Deere) at the Industry Leaders Forum 2015
 
Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...
Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...
Localizing for Travel: Diverse Solutions for Diverse Needs by Laura Casanell...
 
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization W...
 

Similar to How Much Cake to Eat: The Case for Targeted MT Engines

Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Hady Elsahar
 
Babysitting your orm essenmacher, adam
Babysitting your orm   essenmacher, adamBabysitting your orm   essenmacher, adam
Babysitting your orm essenmacher, adamAdam Essenmacher
 
The Intricacies of DITA Content Localization
The Intricacies of DITA Content LocalizationThe Intricacies of DITA Content Localization
The Intricacies of DITA Content LocalizationIXIASOFT
 
The importance of terminology
The importance of terminologyThe importance of terminology
The importance of terminologySDL Trados
 
Opening the Black Box of Software Localization
Opening the Black Box of Software LocalizationOpening the Black Box of Software Localization
Opening the Black Box of Software LocalizationKenneth Farrall
 
Single-Sourcing and Localization stc16
Single-Sourcing and Localization stc16Single-Sourcing and Localization stc16
Single-Sourcing and Localization stc16Laura Dent
 
MongoDB at Agilysys: A Case Study
MongoDB at Agilysys: A Case StudyMongoDB at Agilysys: A Case Study
MongoDB at Agilysys: A Case StudyMongoDB
 
Intro to Programming Lang.pptx
Intro to Programming Lang.pptxIntro to Programming Lang.pptx
Intro to Programming Lang.pptxssuser51ead3
 
Info 2402 irt-chapter_4
Info 2402 irt-chapter_4Info 2402 irt-chapter_4
Info 2402 irt-chapter_4Shahriar Rafee
 
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...lucenerevolution
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformLucidworks (Archived)
 
Talk proposal get_accepted
Talk proposal get_acceptedTalk proposal get_accepted
Talk proposal get_acceptedlauraxthomson
 
Karnov Super power your search with Text Analytics - Findability Day 2014
Karnov Super power your search with Text Analytics - Findability Day 2014Karnov Super power your search with Text Analytics - Findability Day 2014
Karnov Super power your search with Text Analytics - Findability Day 2014Findwise
 
Introduction to Testing and TDD
Introduction to Testing and TDDIntroduction to Testing and TDD
Introduction to Testing and TDDSarah Dutkiewicz
 
Getting the Most out of Your Translation Memories (TM-Town ProZ Webinar April...
Getting the Most out of Your Translation Memories (TM-Town ProZ Webinar April...Getting the Most out of Your Translation Memories (TM-Town ProZ Webinar April...
Getting the Most out of Your Translation Memories (TM-Town ProZ Webinar April...Kevin Dias
 

Similar to How Much Cake to Eat: The Case for Targeted MT Engines (20)

Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
 
Babysitting your orm essenmacher, adam
Babysitting your orm   essenmacher, adamBabysitting your orm   essenmacher, adam
Babysitting your orm essenmacher, adam
 
The Intricacies of DITA Content Localization
The Intricacies of DITA Content LocalizationThe Intricacies of DITA Content Localization
The Intricacies of DITA Content Localization
 
The importance of terminology
The importance of terminologyThe importance of terminology
The importance of terminology
 
Opening the Black Box of Software Localization
Opening the Black Box of Software LocalizationOpening the Black Box of Software Localization
Opening the Black Box of Software Localization
 
Single-Sourcing and Localization stc16
Single-Sourcing and Localization stc16Single-Sourcing and Localization stc16
Single-Sourcing and Localization stc16
 
DITA for Localization
DITA for LocalizationDITA for Localization
DITA for Localization
 
MongoDB at Agilysys: A Case Study
MongoDB at Agilysys: A Case StudyMongoDB at Agilysys: A Case Study
MongoDB at Agilysys: A Case Study
 
TM Town - TAUS Tokyo Forum 2015
TM Town - TAUS Tokyo Forum 2015TM Town - TAUS Tokyo Forum 2015
TM Town - TAUS Tokyo Forum 2015
 
Intro to Programming Lang.pptx
Intro to Programming Lang.pptxIntro to Programming Lang.pptx
Intro to Programming Lang.pptx
 
Info 2402 irt-chapter_4
Info 2402 irt-chapter_4Info 2402 irt-chapter_4
Info 2402 irt-chapter_4
 
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
 
Talk proposal get_accepted
Talk proposal get_acceptedTalk proposal get_accepted
Talk proposal get_accepted
 
Karnov Super power your search with Text Analytics - Findability Day 2014
Karnov Super power your search with Text Analytics - Findability Day 2014Karnov Super power your search with Text Analytics - Findability Day 2014
Karnov Super power your search with Text Analytics - Findability Day 2014
 
IR
IRIR
IR
 
Introduction to Testing and TDD
Introduction to Testing and TDDIntroduction to Testing and TDD
Introduction to Testing and TDD
 
The Typed Index
The Typed IndexThe Typed Index
The Typed Index
 
Getting the Most out of Your Translation Memories (TM-Town ProZ Webinar April...
Getting the Most out of Your Translation Memories (TM-Town ProZ Webinar April...Getting the Most out of Your Translation Memories (TM-Town ProZ Webinar April...
Getting the Most out of Your Translation Memories (TM-Town ProZ Webinar April...
 
Introduction
IntroductionIntroduction
Introduction
 

More from Welocalize

Automating the Localization Workflow. What Works?
Automating the Localization Workflow. What Works?Automating the Localization Workflow. What Works?
Automating the Localization Workflow. What Works?Welocalize
 
Welocalize Throughputs and Post-Editing Productivity Webinar Laura Casanellas
Welocalize Throughputs and Post-Editing Productivity Webinar Laura CasanellasWelocalize Throughputs and Post-Editing Productivity Webinar Laura Casanellas
Welocalize Throughputs and Post-Editing Productivity Webinar Laura CasanellasWelocalize
 
Content Marketing World 2014 Language Fun Fact Challenge by welocalize
Content Marketing World 2014 Language Fun Fact Challenge by welocalizeContent Marketing World 2014 Language Fun Fact Challenge by welocalize
Content Marketing World 2014 Language Fun Fact Challenge by welocalizeWelocalize
 
Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014
Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014
Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014Welocalize
 
TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...
TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...
TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...Welocalize
 
Beyond Disruption: Make Way for Return on Content by Welocalize Olga Beregovaya
Beyond Disruption: Make Way for Return on Content by Welocalize Olga BeregovayaBeyond Disruption: Make Way for Return on Content by Welocalize Olga Beregovaya
Beyond Disruption: Make Way for Return on Content by Welocalize Olga BeregovayaWelocalize
 
Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...
Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...
Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...Welocalize
 

More from Welocalize (7)

Automating the Localization Workflow. What Works?
Automating the Localization Workflow. What Works?Automating the Localization Workflow. What Works?
Automating the Localization Workflow. What Works?
 
Welocalize Throughputs and Post-Editing Productivity Webinar Laura Casanellas
Welocalize Throughputs and Post-Editing Productivity Webinar Laura CasanellasWelocalize Throughputs and Post-Editing Productivity Webinar Laura Casanellas
Welocalize Throughputs and Post-Editing Productivity Webinar Laura Casanellas
 
Content Marketing World 2014 Language Fun Fact Challenge by welocalize
Content Marketing World 2014 Language Fun Fact Challenge by welocalizeContent Marketing World 2014 Language Fun Fact Challenge by welocalize
Content Marketing World 2014 Language Fun Fact Challenge by welocalize
 
Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014
Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014
Welocalize Cisco CNGL Partnership Shared at Localization World Dublin 2014
 
TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...
TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...
TAUS Quality Summit Dublin Welocalize Presentation by Olga Beregovaya and Len...
 
Beyond Disruption: Make Way for Return on Content by Welocalize Olga Beregovaya
Beyond Disruption: Make Way for Return on Content by Welocalize Olga BeregovayaBeyond Disruption: Make Way for Return on Content by Welocalize Olga Beregovaya
Beyond Disruption: Make Way for Return on Content by Welocalize Olga Beregovaya
 
Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...
Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...
Overcoming “Old Fears” in the “New Marketing” World by Informatica and Weloca...
 

Recently uploaded

Constitution of Company Article of Association
Constitution of Company Article of AssociationConstitution of Company Article of Association
Constitution of Company Article of Associationseri bangash
 
How to refresh to be fit for the future world
How to refresh to be fit for the future worldHow to refresh to be fit for the future world
How to refresh to be fit for the future worldChris Skinner
 
Future of Trade 2024 - Decoupled and Reconfigured - Snapshot Report
Future of Trade 2024 - Decoupled and Reconfigured - Snapshot ReportFuture of Trade 2024 - Decoupled and Reconfigured - Snapshot Report
Future of Trade 2024 - Decoupled and Reconfigured - Snapshot ReportDubai Multi Commodity Centre
 
Hyundai capital 2024 1q Earnings release
Hyundai capital 2024 1q Earnings releaseHyundai capital 2024 1q Earnings release
Hyundai capital 2024 1q Earnings releaseirhcs
 
8 Questions B2B Commercial Teams Can Ask To Help Product Discovery
8 Questions B2B Commercial Teams Can Ask To Help Product Discovery8 Questions B2B Commercial Teams Can Ask To Help Product Discovery
8 Questions B2B Commercial Teams Can Ask To Help Product DiscoveryDesmond Leo
 
Blinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptx
Blinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptxBlinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptx
Blinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptxSaksham Gupta
 
NFS- Operations Presentation - Recurrent
NFS- Operations Presentation - RecurrentNFS- Operations Presentation - Recurrent
NFS- Operations Presentation - Recurrenttoniquemcintosh1
 
TriStar Gold Corporate Presentation May 2024
TriStar Gold Corporate Presentation May 2024TriStar Gold Corporate Presentation May 2024
TriStar Gold Corporate Presentation May 2024Adnet Communications
 
wagamamaLab presentation @MIT 20240509 IRODORI
wagamamaLab presentation @MIT 20240509 IRODORIwagamamaLab presentation @MIT 20240509 IRODORI
wagamamaLab presentation @MIT 20240509 IRODORIIRODORI inc.
 
Special Purpose Vehicle (Purpose, Formation & examples)
Special Purpose Vehicle (Purpose, Formation & examples)Special Purpose Vehicle (Purpose, Formation & examples)
Special Purpose Vehicle (Purpose, Formation & examples)linciy03
 
Copyright: What Creators and Users of Art Need to Know
Copyright: What Creators and Users of Art Need to KnowCopyright: What Creators and Users of Art Need to Know
Copyright: What Creators and Users of Art Need to KnowMiriam Robeson
 
tekAura | Desktop Procedure Template (2016)
tekAura | Desktop Procedure Template (2016)tekAura | Desktop Procedure Template (2016)
tekAura | Desktop Procedure Template (2016)Norah Medlin
 
Stages of Startup Funding - An Explainer
Stages of Startup Funding - An ExplainerStages of Startup Funding - An Explainer
Stages of Startup Funding - An ExplainerAlejandro Cremades
 
Expert Cross-Border Financial Planning Advisors
Expert Cross-Border Financial Planning AdvisorsExpert Cross-Border Financial Planning Advisors
Expert Cross-Border Financial Planning Advisorscardinalpointwealth11
 
Raising Seed Capital by Steve Schlafman at RRE Ventures
Raising Seed Capital by Steve Schlafman at RRE VenturesRaising Seed Capital by Steve Schlafman at RRE Ventures
Raising Seed Capital by Steve Schlafman at RRE VenturesAlejandro Cremades
 
How to Maintain Healthy Life style.pptx
How to Maintain  Healthy Life style.pptxHow to Maintain  Healthy Life style.pptx
How to Maintain Healthy Life style.pptxrdishurana
 
Your Work Matters to God RestorationChurch.pptx
Your Work Matters to God RestorationChurch.pptxYour Work Matters to God RestorationChurch.pptx
Your Work Matters to God RestorationChurch.pptxOs Hillman
 
Global Internal Audit Standards 2024.pdf
Global Internal Audit Standards 2024.pdfGlobal Internal Audit Standards 2024.pdf
Global Internal Audit Standards 2024.pdfAmer Morgan
 
How Do Venture Capitalists Make Decisions?
How Do Venture Capitalists Make Decisions?How Do Venture Capitalists Make Decisions?
How Do Venture Capitalists Make Decisions?Alejandro Cremades
 

Recently uploaded (20)

Constitution of Company Article of Association
Constitution of Company Article of AssociationConstitution of Company Article of Association
Constitution of Company Article of Association
 
How to refresh to be fit for the future world
How to refresh to be fit for the future worldHow to refresh to be fit for the future world
How to refresh to be fit for the future world
 
Future of Trade 2024 - Decoupled and Reconfigured - Snapshot Report
Future of Trade 2024 - Decoupled and Reconfigured - Snapshot ReportFuture of Trade 2024 - Decoupled and Reconfigured - Snapshot Report
Future of Trade 2024 - Decoupled and Reconfigured - Snapshot Report
 
Hyundai capital 2024 1q Earnings release
Hyundai capital 2024 1q Earnings releaseHyundai capital 2024 1q Earnings release
Hyundai capital 2024 1q Earnings release
 
8 Questions B2B Commercial Teams Can Ask To Help Product Discovery
8 Questions B2B Commercial Teams Can Ask To Help Product Discovery8 Questions B2B Commercial Teams Can Ask To Help Product Discovery
8 Questions B2B Commercial Teams Can Ask To Help Product Discovery
 
Blinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptx
Blinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptxBlinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptx
Blinkit: Revolutionizing the On-Demand Grocery Delivery Service.pptx
 
NFS- Operations Presentation - Recurrent
NFS- Operations Presentation - RecurrentNFS- Operations Presentation - Recurrent
NFS- Operations Presentation - Recurrent
 
TriStar Gold Corporate Presentation May 2024
TriStar Gold Corporate Presentation May 2024TriStar Gold Corporate Presentation May 2024
TriStar Gold Corporate Presentation May 2024
 
wagamamaLab presentation @MIT 20240509 IRODORI
wagamamaLab presentation @MIT 20240509 IRODORIwagamamaLab presentation @MIT 20240509 IRODORI
wagamamaLab presentation @MIT 20240509 IRODORI
 
Special Purpose Vehicle (Purpose, Formation & examples)
Special Purpose Vehicle (Purpose, Formation & examples)Special Purpose Vehicle (Purpose, Formation & examples)
Special Purpose Vehicle (Purpose, Formation & examples)
 
Copyright: What Creators and Users of Art Need to Know
Copyright: What Creators and Users of Art Need to KnowCopyright: What Creators and Users of Art Need to Know
Copyright: What Creators and Users of Art Need to Know
 
tekAura | Desktop Procedure Template (2016)
tekAura | Desktop Procedure Template (2016)tekAura | Desktop Procedure Template (2016)
tekAura | Desktop Procedure Template (2016)
 
Stages of Startup Funding - An Explainer
Stages of Startup Funding - An ExplainerStages of Startup Funding - An Explainer
Stages of Startup Funding - An Explainer
 
Expert Cross-Border Financial Planning Advisors
Expert Cross-Border Financial Planning AdvisorsExpert Cross-Border Financial Planning Advisors
Expert Cross-Border Financial Planning Advisors
 
WAM Corporate Presentation May 2024_w.pdf
WAM Corporate Presentation May 2024_w.pdfWAM Corporate Presentation May 2024_w.pdf
WAM Corporate Presentation May 2024_w.pdf
 
Raising Seed Capital by Steve Schlafman at RRE Ventures
Raising Seed Capital by Steve Schlafman at RRE VenturesRaising Seed Capital by Steve Schlafman at RRE Ventures
Raising Seed Capital by Steve Schlafman at RRE Ventures
 
How to Maintain Healthy Life style.pptx
How to Maintain  Healthy Life style.pptxHow to Maintain  Healthy Life style.pptx
How to Maintain Healthy Life style.pptx
 
Your Work Matters to God RestorationChurch.pptx
Your Work Matters to God RestorationChurch.pptxYour Work Matters to God RestorationChurch.pptx
Your Work Matters to God RestorationChurch.pptx
 
Global Internal Audit Standards 2024.pdf
Global Internal Audit Standards 2024.pdfGlobal Internal Audit Standards 2024.pdf
Global Internal Audit Standards 2024.pdf
 
How Do Venture Capitalists Make Decisions?
How Do Venture Capitalists Make Decisions?How Do Venture Capitalists Make Decisions?
How Do Venture Capitalists Make Decisions?
 

How Much Cake to Eat: The Case for Targeted MT Engines

  • 1. PRESENTATION AMTA ALEXYANISHEVSKY Welocalize November 2015 How Much Cake is Enough: The Case for Domain-Specific Engines
  • 2. What is the Tipping Point? How Much Cake Is Too Much Cake? I LOVE CAKE… TOO MUCH CAKE! ugh!
  • 3. •How many engines •How to split domains •How to measure success •How to improve AGENDA
  • 4. HOW MANY ENGINES: CRITERIA • Environment: elegant deployment? • Cost • How different are they from each other? • Maintenance: engineering + linguistic feedback implementation
  • 5. HOW TO SPLIT DOMAINS: CRITERIA • Content owner feedback • Historical experience based on business unit or portfolio • Naming convention • Style analysis: difference in characteristics based on lexical diversity, sentence length + syntactic complexity
  • 6. HOW TO SPLIT DOMAINS: TOOLS • Build domain-specific language models + select TUs for domain by PPL • Source Content Profiler – helps identify domain based on language models, as well as other stylistic characteristics • Style Scorer – higher score indicates better match to style established by client’s documents HOLISTIC APPROACH BASED ON SEVERAL TOOLS:
  • 7. TOOLS: PERPLEXITY EVALUATOR TU LEVEL <tu srclang="EN-US" tuid="75438"> <prop type="x-ppl:train2">208</prop><prop type="x- ppl:techdoc6">191.025</prop><prop type="x-ppl:support2">325.983</prop><prop type="x- ppl:sales1">97.0736</prop><prop type="x-ppl:productLoc1">396.398</prop><prop type="x- ppl:legal1">617.876</prop><tuv xml:lang="EN-US"> <seg>Consistent feature set across multiple platforms (Windows, Mac, iOS, Android).</seg> </tuv> <tuv changedate="20140325T122530Z" changeid="serviceaaa" creationdate="20140325T122530Z" creationid="serviceaaa" lastusagedate="20140325T122530Z" usagecount="0" xml:lang="ES-XL"> <prop type="x-ALS:Context">TEXT</prop> <prop type="x-ALS:Source File">DATATC39720SRCEN-USco-02__battle-card_enco-02__battle-card_en.inx</prop> <seg>Conjunto de características coherente en varias plataformas (Windows, Mac, iOS, Android)</seg> </tuv> </tu>
  • 8. TOOLS: SOURCE CONTENT PROFILERIN CONJUNCTION with DCU/ADAPT
  • 9. TOOLS: STYLE SCORER COMBINES PPL RATIOS, DISSIMILARITY SCORE + CLASSIFICATION SCORE TEST CATEGORY TRAINING CATEGORY SCORE SUPPORT TECH DOC 3.16 TECH DOC TECH DOC 2.94 TECH DOC LEGAL .02
  • 10. WHY USE STYLE SCORER? • Identify similarity of source document to “gold standard” documents from that domain and other domains • Identify similarity of target document to “gold standard” documents from that domain and other domains • Example: Is this really a support document? To what degree is it similar to other support documents, tech doc documents, etc.? • Dissimilarity can point to worse quality for raw MT and/or reduced post-editing productivity
  • 11. STYLE SCORER + SCP • SCP helps classify a document • Style Scorer tells you how good a match a document is to a profile • SCP only works on English source • Style Scorer works on English source + non-English target
  • 12. Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone, Nine for Mortal Men doomed to die, One for the Dark Lord on his dark throne In the Land of Mordor where the Shadows lie. One Ring to rule them all, One Ring to find them, One Ring to bring them all and in the darkness bind them In the Land of Mordor where the Shadows lie. CASE STUDY ONE DOMAIN? One ring to rule them all
  • 13. CASE STUDY: HOW MANY DOMAINS? • Started with 6 Domains: Technical Documentation, Legal, Support, Training, Product UI, Sales/Marketing • Found that Technical Documentation, Support + Training were very similar based on LM scores against each other, Style Scorer, length of sentences, similar grammatical structures • Found that Product UI was close enough to above 3 that making a separate engine was not warranted • Found that Legal + Sales/Marketing were different enough from above domains and from each other based on LM scores against each other, Style Scorer + length of sentences
  • 14. CASE STUDY: GATHERING ASSETS TMs • Old • Somewhat recent • Current • Termbases in MultiTerm • Existing user dictionaries + normalization dictionaries • New user dictionaries based on term extractions + auto-import for some languages
  • 15. CASE STUDY: CURATING ASSETS • Cleaned TMs • Introduced metadata for PPL into TUs based on LM perplexity • Kept the UDs + normalization dictionaries as is • Additional term extraction for weak languages or languages with insufficient assets
  • 16. CASE STUDY: ENGINE ITERATIONS Based on options in Systran: • RBMT only • Hybrid with Stemming, LM Order, Distortion, etc. • SMT only
  • 17. HOW TO MEASURE SUCCESS • Automatic scores • Human evaluations • Some reduction in linguistic issues Forthcoming • Informed PE Distance • Productivity • Systematic reduction of linguistic issues
  • 18. CASE STUDY: AUTOMATIC SCORES FOR GENERAL/TECH DOC
  • 19. CASE STUDY: HUMAN EVALUATIONS FOR GENERAL/TECH DOC
  • 20. CASE STUDY: AUTOMATIC SCORES SALES/MARKETING1
  • 21. CASE STUDY: AUTOMATIC SCORES SALES/MARKETING2
  • 22. CASE STUDY: AUTOMATIC SCORES LEGAL1
  • 23. CASE STUDY: AUTOMATIC SCORES LEGAL2
  • 24. CASE STUDY: AUTOMATIC SCORES TECH DOC
  • 25. HOW TO IMPROVE Scoring • Linguistically-informed automatic scores: PE distance with different weights for POS • Productivity Engine • Eradicate high-frequency inconsistencies between TMs, Termbases + User Dictionaries (UDs) • Create domain-specific UDs • Send best reply: TMT Prime, send best translation irrespective of domain Source • Pre-MT source check: was this content properly categorized?
  • 26. SUMMARY • Domain-specific engines yield better results as evidenced by auto scores and human evaluations. Some of evidence of reduction of linguistic issues. • Group closely related content into one domain • Determine how many engines your infrastructure can support

Editor's Notes

  1. LM built using KenLM with ngram=5