4. Introduction, background
Why a Catalan version?
Celebration of LV’s 130 anniversary
Normalization of the use of Catalan
Investment to face the crisis
Opportunity to consolidate LV’s hegemony
5. [2] Customer goals
To publish two language Journalists should be
editions of the same able to write in
newspaper daily any
(supplements incl.). of the two languages.
Neither quality nor
distribution timeframes
should be affected.
6. Customer requirements
• Tailor-made system
• Complying with LV’s style guide
• Seamless integration into journalist’s
workflow
MT • Translation of Hermes XML and
InDesign formats
• Reliability, high availability
• High performance
7. [3] Ramp-up phase
Project set-up
Work areas MT linguistic improvement/tuning
Post-editing preparation
MT system set-up and integration
MT lexicon training
Duration 8 months (+ 3 months)
Staff LV: 10-12 in-house journalists
Lucy: 3 computational linguists / lexicographers
1 software developer
Incyta: 2 professional post-editors
Important! On-site support
8. Subphases
TASKS Phase 1 Phase 2 Phase 3 Phase 4
Linguistic improvement/tuning
- Language-type definition x
- Creation of a corpus of real texts x x x x
- Analysis of the translation quality x x x x
- Error reporting (lexicon and grammar errors) x x x x
- Linguistic implementation (lex and grammar) x x x x
- Pre and post-editing filters x x x x
Post-editing preparation
- Gathering of MT post-editing guidelines x
- Evaluation of post-editing effort x x
- Creation and training of the post-editing team x
Technical set-up
- System set-up and integration x
- Preparation of XML converters x
Maintenance
- Lexicon maintenance training x
Duration 2 mo 3 mo 3 mo 3 mo
9. [a] Linguistic tuning
Language
model
Corpus
Translation
quality (TQ)
Analysis and
error-reporting
Implementation
Accomplished
improvement data
10. Linguistic tuning
Catalan language model
• no exclusion
• compliant with standards
• innovative in terminology
• dynamic in syntactical structures
Corpus
• ES: 500,000 transl. units – 8,300,000 words
• CA: 250,000 transl. units – 3,000,000 words
11. Linguistic tuning
Translation Quality
Medium
Minimal
post-edit
post-
2%
editing
24%
Perfect
74%
Conclusions
• No specific domains (except Sports)
• Culture: proper names
• Opinion: idioms, plays on words
• Errors not repetitive
• % style to be post-edited
12. Linguistic tuning
Analysis and error reporting
• Semi-automatic detection of missing words
• Terminology lists
• New and different translations, error
reporting
Implementation
• Proper names [44.5 % of the TUs ]
• Idioms
• Alternatives
13. Linguistic tuning
Accomplished improvement data
• Work in figures
40,000 lexicon entries (20,000 for each transl. direction)
Around 440 grammar rules
Around 7,200 words in the proper names files (each transl. dir)
• Non-measurable work
Understanding of the MT system
Understanding of the newspaper specificities
Support in the style guide taking into account MT
• Improvement
ES>CA 41% diff => 35% better , 4% similar, 2% worse
CA>ES 36% diff => 32% better, 3% similar, 1% worse
15. Post-editing
Metrics on
translation volume
Metrics on
Specificities post-editing effort
of the text
Post-editors
Post-editing workspace
resources
Error reporting
process and tools
Post-editing
team and profile
16. Post-editing: metrics
Total Lex/gram Style
File translation units post-edition % post-edition %
LV_2010-10-27 2,474 464 18.79% 394 15.96%
(= 42.512 words)
Conclusions
• Different sections had different levels of post-editing
• What style corrections could be avoided?
• Post-editing speed: 1,000-1,500 words/h
• Daily volume: 75,000 words
• New post-editing team: 20 post-editors/12 editors
17. Post-editing: resources, workspace
Post-editors
Resources on
should have Post-editing Adapt CMS to new
Intranet language
proficiency in their guide workflow
portal
skills BUT also
Be trained on New Bilingual style
Classified
MT post-ed processing guide
frequent
MT errors status
Have an Links to all
integrated reference
workspace dictionaries
Reference
Have document for New mark-ups
training MT portal for
resources any journalist
at a click
19. Post-editing: error reporting, team
Error reporting
• Crucial for continuous improvement
• Not automated (yet)
• Provide better support to error reporting
Definition of post-editing profile and team
• Proficient in Catalan
• Journalist background
20. [c] System integration
During phase 1: pre-production
• Pre-production set-up and installation
• Hermes XML converter
• Changes in the LT engine to translate InDesign
files
During phase 3: production
• Production installation
• Test (load, performance and stress)
• Performance 500-1,200 w/sec
• Definition of the final installation size
21. System integration
Language Hermes
Hermes InDesign
portal InDesign
Web Service Web Service
Production Pre-production Maintenance
• Production: balanced high performance (HP) and high availability (HA) configuration
• System requirements: normal Windows Server -> low HW footprint
(e.g. Dual Core/Quad 2.5-3 GHz, 2-4 GB RAM running Win Server 2003/2008)
22. [4] Operation: production process
Staff Effort Timeline
• 20 post-editors • 30’ linguistic review • Start 5 p.m.
• 12 editors • 10’ journalistic review • First edition 11.30 p.m.
• 70,000 words/day + suppl. • Second edition 2.30 a.m.
24. [5] Next goals
Success! Yes.
Thanks to
• Close work and
Next!
cooperation • How to reduce
• Three parties post-editing effort
involved • How to re-use
• Time and effort post-edited text
investment
• Customisation
25. Thank you for your attention
Magí Camps Blanca Vidal Ignasi Navarro
La Vanguardia Lucy Software Ibérica Incyta
mcamps@lavanguardia.es blanca.vidal@lucysoftware.com Ignasi_navarro@incyta.com
www.lavanguardia.es www.lucysoftware.com www.incyta.com