SlideShare una empresa de Scribd logo
1 de 17
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE


Integration of Advanced Language Processing Techniques into
Statistical Machine Translation

11:10-11:30
Wednesday, 17 October

Diego Bartolome
Tauyou
Language Processing Techniques
                          for
Statistical Machine Translation



      Contact: Diego Bartolome – dbc@tauyou.com
      C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
      Tel. +34 93 711 29 96
To start ...




               Contact: Diego Bartolome – dbc@tauyou.com
               C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
               Tel. +34 93 711 29 96
… you choose Moses ...

Translation memories + linguistic assets

Cleaning and training following tutorials

BLEU score seems ok in training

                                  … but ...

the results are awful!


             Contact: Diego Bartolome – dbc@tauyou.com
             C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
             Tel. +34 93 711 29 96
Why?

Not enough data
Unclean translation memories
Misalignments
Spelling and grammar errors
Difficult language pairs
Selection of wrong parameters
Application of suboptimal techniques

So many things … what can you do?
             Contact: Diego Bartolome – dbc@tauyou.com
             C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
             Tel. +34 93 711 29 96
Contact: Diego Bartolome – dbc@tauyou.com
C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
Tel. +34 93 711 29 96
Some steps

Maximum exploitation of existing assets

Source content optimization

Data selection and cleaning

Improvement of the models

Linguistic processing

...


            Contact: Diego Bartolome – dbc@tauyou.com
            C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
            Tel. +34 93 711 29 96
Existing assets: increase TM leverage

Translation memory sharing

   Clients, Partners, Competitors, EU, UN, TAUS

Relevant on-line data retrieval

Advanced TM techniques

   Sub-segment matching

   Parts of Speech replacement


             Contact: Diego Bartolome – dbc@tauyou.com
             C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
             Tel. +34 93 711 29 96
Source optimization (I): Pre-editing

new
doc
                                         proposed
                                            doc              +    html
                                                                 report

        Spell check
        Grammar check
        Style check
        Terminology check
        Client checklist

          Contact: Diego Bartolome – dbc@tauyou.com
          C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
          Tel. +34 93 711 29 96
Source optimization (II): Summarization

new
doc
                                         proposed
                                            doc              +    html
                                                                 report

       % to reduce
       Use translation memories
             Project
             Client
             All

          Contact: Diego Bartolome – dbc@tauyou.com
          C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
          Tel. +34 93 711 29 96
Summarization example

                  http://www.translationautomation.com/press-
                  releases/free-open-source-machine-translation-
                  tutorial-is-made-available-by-taus




         Contact: Diego Bartolome – dbc@tauyou.com
         C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
         Tel. +34 93 711 29 96
Data selection and cleaning – a sample

Clean translation memories

   Length, punctuation, terminology, repetitions …

   Segment splitting

Optimize weight of most frequent n-grams in corpus

   Validate their translations

Add out-of-domain data for irrelevant n-grams


             Contact: Diego Bartolome – dbc@tauyou.com
             C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
             Tel. +34 93 711 29 96
Models optimization

Filter the translation tables
   Remove the garbage + tune the weights if necessary
Optimize language models
   Adapt them to the translation purpose
Tune parameters correctly
   Tune set, test set, optimization parameters …

Improve recasing
              Contact: Diego Bartolome – dbc@tauyou.com
              C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
              Tel. +34 93 711 29 96
Linguistic processing

In the source and/or target language

   Grammar checking

   Entities detection

      proper nouns, alphanumeric words, numbers, ...

   Compund words splitting

   Sentence reordering


           Contact: Diego Bartolome – dbc@tauyou.com
           C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
           Tel. +34 93 711 29 96
An example from
Source
XXX 335102 doses are calculated as a free acid of the sodium salt (NA).
The potential toxicity of XXX 335102 was studied in a number of acute toxicity studies in mouse and rat
and repeat dose toxicity studies of 8 and 32 weeks each in rat and monkeys.
XXX 335102 was negative in a panel of in vivo and in vitro tests to assess mutagenicity and
clastogenicity identifying no genotoxic risks for human subjects.
An in vitro assay for phototoxic potential suggested that XXX 335102 is photoxic/photosensitive.
In the 8-week studies in monkeys, increases in unconjugated bilirubin were noted at the doses tested
(33, 88, 192 and 444mg/kg/day); the greatest increases occurring at Week 4 and declining or returning
to control levels by Week 8.

Reference
Las dosis de XXX 335102 se calculan como la sal sódica sin ácido (AS).
La toxicidad potencial de XXX 335102 se estudió en varios estudios de toxicidad aguda en ratones y
ratas y en estudios de toxicidad con administración repetida de 8 y 32 semanas en ratas y monos.
Se obtuvieron resultados negativos en un grupo de pruebas in vivo e in vitro para evaluar su mutagenia
y clastogenia, sin identificarse riesgos genotóxicos para el ser humano.
En un estudio in vitro de su potencial fototóxico se sugirió que XXX 335102 es fototóxico o
fotosensibilizador.
En los estudios de 8 semanas en monos se apreció el aumento de la bilirrubina no conjugada con las
dosis estudiadas (33, 88, 192 y 444 mg/kg/día), produciéndose el mayor incremento en la semana 4 y
disminuyendo o volviendo a los niveles de control en la semana 8.
Generic engine
XXX 335102 se calculan en forma de dosis de ácido libre del sodio sal (NA).
La Toxicidad potencial de XXX 335102 fue estudiado en una serie de estudios de toxicidad aguda en
ratón y rata y vuelva a dosis estudios de toxicidad, de 8 y de 32 semanas en rata y cada uno de los
monos.
XXX 335102 era negativo en un grupo de in vivo y pruebas in vitro para evaluar mutagenicidad y
genotóxicas clastogenicity no identificar los riesgos para los participantes humanos.
Un para fines de ensayo in vitro phototoxic potencial se sugirió que XXX 335102
photoxic/Photosensitive.
En Los 8 -week estudios en los monos, aumentos en unconjugated bilirrubina salieron a las dosis
analizada (33, 88, 192 y 444 mg/kg/día); los mayores incrementos habidos En la semana 4 y la
reducción o devolver a nivel de control de 8 Por semana.

Medical engine with improvements
Las dosis XXX 335102 se calculan como ácido libre de la sal sódica (AS).
La toxicidad potencial de XXX 335102 se estudió en varios estudios de toxicidad aguda en ratones y
ratas y en estudios de toxicidad con administración repetida de 8 y 32 semanas en ratas y monos.
XXX 335102 dio negativo en un grupo de pruebas in vivo e in vitro para evaluar su mutagenia y
clastogenia, sin identificarse riesgos genotóxicos para el ser humano.
En un estudio in vitro de su potencial fototóxico se sugirió que XXX 335102 es fototóxico o
fotosensibilizador.
En los estudios de 8 semanas en monos, el aumento de la bilirrubina no conjugada con las dosis
estudiadas (33, 88, 192 y 444 mg/kg/día); el mayor incremento en la semana 4 y disminuyendo o
volviendo a los niveles de control en la semana 8.
Conclusions

MT can be combined with other advanced techniques

Creating an improving an engine requires time

   You can also be lucky at the first try!

The optimum results require translators

   Implementation of the linguistic knowledge

   Continuous improvement


          Contact: Diego Bartolome – dbc@tauyou.com
          C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain
          Tel. +34 93 711 29 96

Más contenido relacionado

Destacado

TAUS MT SHOWCASE, Microsoft Translator, Chris Wendt, Microsoft, 10 October 2013
TAUS MT SHOWCASE,  Microsoft Translator, Chris Wendt, Microsoft, 10 October 2013TAUS MT SHOWCASE,  Microsoft Translator, Chris Wendt, Microsoft, 10 October 2013
TAUS MT SHOWCASE, Microsoft Translator, Chris Wendt, Microsoft, 10 October 2013TAUS - The Language Data Network
 
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Fred Hollowood, Symante...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Fred Hollowood, Symante...TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Fred Hollowood, Symante...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Fred Hollowood, Symante...TAUS - The Language Data Network
 
Home Goods Retailer pg 11
Home Goods Retailer pg 11Home Goods Retailer pg 11
Home Goods Retailer pg 11Eugene Beukes
 
LA INFORMATICA EN LA SALUD
LA INFORMATICA EN LA SALUDLA INFORMATICA EN LA SALUD
LA INFORMATICA EN LA SALUDguest6a65b7
 
Strengthening learning contexts: An introduction
Strengthening learning contexts: An introductionStrengthening learning contexts: An introduction
Strengthening learning contexts: An introductionDenis Gillet
 
Comparte Marketing - Producción creativa - Lartaun Pérez
Comparte Marketing - Producción creativa - Lartaun PérezComparte Marketing - Producción creativa - Lartaun Pérez
Comparte Marketing - Producción creativa - Lartaun PérezLas Iniciativas
 
#SITNL pubquiz 2013
#SITNL pubquiz 2013#SITNL pubquiz 2013
#SITNL pubquiz 2013svleuken
 
Icv forum programm_2012
Icv forum programm_2012Icv forum programm_2012
Icv forum programm_2012ICV_eV
 

Destacado (12)

TAUS MT SHOWCASE, Microsoft Translator, Chris Wendt, Microsoft, 10 October 2013
TAUS MT SHOWCASE,  Microsoft Translator, Chris Wendt, Microsoft, 10 October 2013TAUS MT SHOWCASE,  Microsoft Translator, Chris Wendt, Microsoft, 10 October 2013
TAUS MT SHOWCASE, Microsoft Translator, Chris Wendt, Microsoft, 10 October 2013
 
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Fred Hollowood, Symante...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Fred Hollowood, Symante...TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Fred Hollowood, Symante...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Fred Hollowood, Symante...
 
Quality Management in Localization Certification
Quality Management in Localization CertificationQuality Management in Localization Certification
Quality Management in Localization Certification
 
Terminology in the cloud with memoQ and TaaS, CHAT2013
Terminology in the cloud with memoQ and TaaS, CHAT2013Terminology in the cloud with memoQ and TaaS, CHAT2013
Terminology in the cloud with memoQ and TaaS, CHAT2013
 
Home Goods Retailer pg 11
Home Goods Retailer pg 11Home Goods Retailer pg 11
Home Goods Retailer pg 11
 
LA INFORMATICA EN LA SALUD
LA INFORMATICA EN LA SALUDLA INFORMATICA EN LA SALUD
LA INFORMATICA EN LA SALUD
 
Strengthening learning contexts: An introduction
Strengthening learning contexts: An introductionStrengthening learning contexts: An introduction
Strengthening learning contexts: An introduction
 
Comparte Marketing - Producción creativa - Lartaun Pérez
Comparte Marketing - Producción creativa - Lartaun PérezComparte Marketing - Producción creativa - Lartaun Pérez
Comparte Marketing - Producción creativa - Lartaun Pérez
 
#SITNL pubquiz 2013
#SITNL pubquiz 2013#SITNL pubquiz 2013
#SITNL pubquiz 2013
 
Icv forum programm_2012
Icv forum programm_2012Icv forum programm_2012
Icv forum programm_2012
 
el Cachorro
el Cachorroel Cachorro
el Cachorro
 
Revista digital innovación.
Revista digital  innovación.Revista digital  innovación.
Revista digital innovación.
 

Más de TAUS - The Language Data Network

TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...TAUS - The Language Data Network
 
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...TAUS - The Language Data Network
 
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...TAUS - The Language Data Network
 
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...TAUS - The Language Data Network
 
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...TAUS - The Language Data Network
 
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...TAUS - The Language Data Network
 
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)TAUS - The Language Data Network
 
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
 Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann... Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...TAUS - The Language Data Network
 
A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...TAUS - The Language Data Network
 
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...TAUS - The Language Data Network
 
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...TAUS - The Language Data Network
 
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...TAUS - The Language Data Network
 
The Theory and Practice of Computer Aided Translation Training System, Liu Q...
 The Theory and Practice of Computer Aided Translation Training System, Liu Q... The Theory and Practice of Computer Aided Translation Training System, Liu Q...
The Theory and Practice of Computer Aided Translation Training System, Liu Q...TAUS - The Language Data Network
 
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)TAUS - The Language Data Network
 
A use-case for getting MT into your company, Kerstin Berns (berns language c...
 A use-case for getting MT into your company, Kerstin Berns (berns language c... A use-case for getting MT into your company, Kerstin Berns (berns language c...
A use-case for getting MT into your company, Kerstin Berns (berns language c...TAUS - The Language Data Network
 

Más de TAUS - The Language Data Network (20)

TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
 
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
 
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
 
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
 
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
 
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
 
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
 
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
 Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann... Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
 
A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...
 
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
 
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
 
Farmer Lv (TrueTran)
Farmer Lv (TrueTran)Farmer Lv (TrueTran)
Farmer Lv (TrueTran)
 
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
 
The Theory and Practice of Computer Aided Translation Training System, Liu Q...
 The Theory and Practice of Computer Aided Translation Training System, Liu Q... The Theory and Practice of Computer Aided Translation Training System, Liu Q...
The Theory and Practice of Computer Aided Translation Training System, Liu Q...
 
Translation Technology Showcase in Shenzhen
Translation Technology Showcase in ShenzhenTranslation Technology Showcase in Shenzhen
Translation Technology Showcase in Shenzhen
 
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
 
SDL Trados Studio 2017, Jocelyn He (SDL)
SDL Trados Studio 2017, Jocelyn He (SDL)SDL Trados Studio 2017, Jocelyn He (SDL)
SDL Trados Studio 2017, Jocelyn He (SDL)
 
How we train post-editors - Yongpeng Wei (Lingosail)
How we train post-editors - Yongpeng Wei (Lingosail)How we train post-editors - Yongpeng Wei (Lingosail)
How we train post-editors - Yongpeng Wei (Lingosail)
 
A use-case for getting MT into your company, Kerstin Berns (berns language c...
 A use-case for getting MT into your company, Kerstin Berns (berns language c... A use-case for getting MT into your company, Kerstin Berns (berns language c...
A use-case for getting MT into your company, Kerstin Berns (berns language c...
 
QE integrated in XTM, by Bob Willans (XTM)
QE integrated in XTM, by Bob Willans (XTM)QE integrated in XTM, by Bob Willans (XTM)
QE integrated in XTM, by Bob Willans (XTM)
 

Último

Plan de aula informatica segundo periodo.docx
Plan de aula informatica segundo periodo.docxPlan de aula informatica segundo periodo.docx
Plan de aula informatica segundo periodo.docxpabonheidy28
 
ATAJOS DE WINDOWS. Los diferentes atajos para utilizar en windows y ser más e...
ATAJOS DE WINDOWS. Los diferentes atajos para utilizar en windows y ser más e...ATAJOS DE WINDOWS. Los diferentes atajos para utilizar en windows y ser más e...
ATAJOS DE WINDOWS. Los diferentes atajos para utilizar en windows y ser más e...FacuMeza2
 
Redes direccionamiento y subredes ipv4 2024 .pdf
Redes direccionamiento y subredes ipv4 2024 .pdfRedes direccionamiento y subredes ipv4 2024 .pdf
Redes direccionamiento y subredes ipv4 2024 .pdfsoporteupcology
 
Proyecto integrador. Las TIC en la sociedad S4.pptx
Proyecto integrador. Las TIC en la sociedad S4.pptxProyecto integrador. Las TIC en la sociedad S4.pptx
Proyecto integrador. Las TIC en la sociedad S4.pptx241521559
 
PARTES DE UN OSCILOSCOPIO ANALOGICO .pdf
PARTES DE UN OSCILOSCOPIO ANALOGICO .pdfPARTES DE UN OSCILOSCOPIO ANALOGICO .pdf
PARTES DE UN OSCILOSCOPIO ANALOGICO .pdfSergioMendoza354770
 
Presentación inteligencia artificial en la actualidad
Presentación inteligencia artificial en la actualidadPresentación inteligencia artificial en la actualidad
Presentación inteligencia artificial en la actualidadMiguelAngelVillanuev48
 
Hernandez_Hernandez_Practica web de la sesion 12.pptx
Hernandez_Hernandez_Practica web de la sesion 12.pptxHernandez_Hernandez_Practica web de la sesion 12.pptx
Hernandez_Hernandez_Practica web de la sesion 12.pptxJOSEMANUELHERNANDEZH11
 
ejercicios pseint para aprogramacion sof
ejercicios pseint para aprogramacion sofejercicios pseint para aprogramacion sof
ejercicios pseint para aprogramacion sofJuancarlosHuertasNio1
 
Cortes-24-de-abril-Tungurahua-3 año 2024
Cortes-24-de-abril-Tungurahua-3 año 2024Cortes-24-de-abril-Tungurahua-3 año 2024
Cortes-24-de-abril-Tungurahua-3 año 2024GiovanniJavierHidalg
 
International Women's Day Sucre 2024 (IWD)
International Women's Day Sucre 2024 (IWD)International Women's Day Sucre 2024 (IWD)
International Women's Day Sucre 2024 (IWD)GDGSucre
 
Instrumentación Hoy_ INTERPRETAR EL DIAGRAMA UNIFILAR GENERAL DE UNA PLANTA I...
Instrumentación Hoy_ INTERPRETAR EL DIAGRAMA UNIFILAR GENERAL DE UNA PLANTA I...Instrumentación Hoy_ INTERPRETAR EL DIAGRAMA UNIFILAR GENERAL DE UNA PLANTA I...
Instrumentación Hoy_ INTERPRETAR EL DIAGRAMA UNIFILAR GENERAL DE UNA PLANTA I...AlanCedillo9
 
La era de la educación digital y sus desafios
La era de la educación digital y sus desafiosLa era de la educación digital y sus desafios
La era de la educación digital y sus desafiosFundación YOD YOD
 
guía de registro de slideshare por Brayan Joseph
guía de registro de slideshare por Brayan Josephguía de registro de slideshare por Brayan Joseph
guía de registro de slideshare por Brayan JosephBRAYANJOSEPHPEREZGOM
 
Medidas de formas, coeficiente de asimetría y coeficiente de curtosis.pptx
Medidas de formas, coeficiente de asimetría y coeficiente de curtosis.pptxMedidas de formas, coeficiente de asimetría y coeficiente de curtosis.pptx
Medidas de formas, coeficiente de asimetría y coeficiente de curtosis.pptxaylincamaho
 
KELA Presentacion Costa Rica 2024 - evento Protégeles
KELA Presentacion Costa Rica 2024 - evento ProtégelesKELA Presentacion Costa Rica 2024 - evento Protégeles
KELA Presentacion Costa Rica 2024 - evento ProtégelesFundación YOD YOD
 
trabajotecologiaisabella-240424003133-8f126965.pdf
trabajotecologiaisabella-240424003133-8f126965.pdftrabajotecologiaisabella-240424003133-8f126965.pdf
trabajotecologiaisabella-240424003133-8f126965.pdfIsabellaMontaomurill
 
El gusano informático Morris (1988) - Julio Ardita (1995) - Citizenfour (2014...
El gusano informático Morris (1988) - Julio Ardita (1995) - Citizenfour (2014...El gusano informático Morris (1988) - Julio Ardita (1995) - Citizenfour (2014...
El gusano informático Morris (1988) - Julio Ardita (1995) - Citizenfour (2014...JaquelineJuarez15
 
Global Azure Lima 2024 - Integración de Datos con Microsoft Fabric
Global Azure Lima 2024 - Integración de Datos con Microsoft FabricGlobal Azure Lima 2024 - Integración de Datos con Microsoft Fabric
Global Azure Lima 2024 - Integración de Datos con Microsoft FabricKeyla Dolores Méndez
 
SalmorejoTech 2024 - Spring Boot <3 Testcontainers
SalmorejoTech 2024 - Spring Boot <3 TestcontainersSalmorejoTech 2024 - Spring Boot <3 Testcontainers
SalmorejoTech 2024 - Spring Boot <3 TestcontainersIván López Martín
 
CLASE DE TECNOLOGIA E INFORMATICA PRIMARIA
CLASE  DE TECNOLOGIA E INFORMATICA PRIMARIACLASE  DE TECNOLOGIA E INFORMATICA PRIMARIA
CLASE DE TECNOLOGIA E INFORMATICA PRIMARIAWilbisVega
 

Último (20)

Plan de aula informatica segundo periodo.docx
Plan de aula informatica segundo periodo.docxPlan de aula informatica segundo periodo.docx
Plan de aula informatica segundo periodo.docx
 
ATAJOS DE WINDOWS. Los diferentes atajos para utilizar en windows y ser más e...
ATAJOS DE WINDOWS. Los diferentes atajos para utilizar en windows y ser más e...ATAJOS DE WINDOWS. Los diferentes atajos para utilizar en windows y ser más e...
ATAJOS DE WINDOWS. Los diferentes atajos para utilizar en windows y ser más e...
 
Redes direccionamiento y subredes ipv4 2024 .pdf
Redes direccionamiento y subredes ipv4 2024 .pdfRedes direccionamiento y subredes ipv4 2024 .pdf
Redes direccionamiento y subredes ipv4 2024 .pdf
 
Proyecto integrador. Las TIC en la sociedad S4.pptx
Proyecto integrador. Las TIC en la sociedad S4.pptxProyecto integrador. Las TIC en la sociedad S4.pptx
Proyecto integrador. Las TIC en la sociedad S4.pptx
 
PARTES DE UN OSCILOSCOPIO ANALOGICO .pdf
PARTES DE UN OSCILOSCOPIO ANALOGICO .pdfPARTES DE UN OSCILOSCOPIO ANALOGICO .pdf
PARTES DE UN OSCILOSCOPIO ANALOGICO .pdf
 
Presentación inteligencia artificial en la actualidad
Presentación inteligencia artificial en la actualidadPresentación inteligencia artificial en la actualidad
Presentación inteligencia artificial en la actualidad
 
Hernandez_Hernandez_Practica web de la sesion 12.pptx
Hernandez_Hernandez_Practica web de la sesion 12.pptxHernandez_Hernandez_Practica web de la sesion 12.pptx
Hernandez_Hernandez_Practica web de la sesion 12.pptx
 
ejercicios pseint para aprogramacion sof
ejercicios pseint para aprogramacion sofejercicios pseint para aprogramacion sof
ejercicios pseint para aprogramacion sof
 
Cortes-24-de-abril-Tungurahua-3 año 2024
Cortes-24-de-abril-Tungurahua-3 año 2024Cortes-24-de-abril-Tungurahua-3 año 2024
Cortes-24-de-abril-Tungurahua-3 año 2024
 
International Women's Day Sucre 2024 (IWD)
International Women's Day Sucre 2024 (IWD)International Women's Day Sucre 2024 (IWD)
International Women's Day Sucre 2024 (IWD)
 
Instrumentación Hoy_ INTERPRETAR EL DIAGRAMA UNIFILAR GENERAL DE UNA PLANTA I...
Instrumentación Hoy_ INTERPRETAR EL DIAGRAMA UNIFILAR GENERAL DE UNA PLANTA I...Instrumentación Hoy_ INTERPRETAR EL DIAGRAMA UNIFILAR GENERAL DE UNA PLANTA I...
Instrumentación Hoy_ INTERPRETAR EL DIAGRAMA UNIFILAR GENERAL DE UNA PLANTA I...
 
La era de la educación digital y sus desafios
La era de la educación digital y sus desafiosLa era de la educación digital y sus desafios
La era de la educación digital y sus desafios
 
guía de registro de slideshare por Brayan Joseph
guía de registro de slideshare por Brayan Josephguía de registro de slideshare por Brayan Joseph
guía de registro de slideshare por Brayan Joseph
 
Medidas de formas, coeficiente de asimetría y coeficiente de curtosis.pptx
Medidas de formas, coeficiente de asimetría y coeficiente de curtosis.pptxMedidas de formas, coeficiente de asimetría y coeficiente de curtosis.pptx
Medidas de formas, coeficiente de asimetría y coeficiente de curtosis.pptx
 
KELA Presentacion Costa Rica 2024 - evento Protégeles
KELA Presentacion Costa Rica 2024 - evento ProtégelesKELA Presentacion Costa Rica 2024 - evento Protégeles
KELA Presentacion Costa Rica 2024 - evento Protégeles
 
trabajotecologiaisabella-240424003133-8f126965.pdf
trabajotecologiaisabella-240424003133-8f126965.pdftrabajotecologiaisabella-240424003133-8f126965.pdf
trabajotecologiaisabella-240424003133-8f126965.pdf
 
El gusano informático Morris (1988) - Julio Ardita (1995) - Citizenfour (2014...
El gusano informático Morris (1988) - Julio Ardita (1995) - Citizenfour (2014...El gusano informático Morris (1988) - Julio Ardita (1995) - Citizenfour (2014...
El gusano informático Morris (1988) - Julio Ardita (1995) - Citizenfour (2014...
 
Global Azure Lima 2024 - Integración de Datos con Microsoft Fabric
Global Azure Lima 2024 - Integración de Datos con Microsoft FabricGlobal Azure Lima 2024 - Integración de Datos con Microsoft Fabric
Global Azure Lima 2024 - Integración de Datos con Microsoft Fabric
 
SalmorejoTech 2024 - Spring Boot <3 Testcontainers
SalmorejoTech 2024 - Spring Boot <3 TestcontainersSalmorejoTech 2024 - Spring Boot <3 Testcontainers
SalmorejoTech 2024 - Spring Boot <3 Testcontainers
 
CLASE DE TECNOLOGIA E INFORMATICA PRIMARIA
CLASE  DE TECNOLOGIA E INFORMATICA PRIMARIACLASE  DE TECNOLOGIA E INFORMATICA PRIMARIA
CLASE DE TECNOLOGIA E INFORMATICA PRIMARIA
 

TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Seattle, Language Processing Techniques for Statistical Machine Translation, Diego Bartolome, tauyou, 17 October 2012

  • 1. TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE Integration of Advanced Language Processing Techniques into Statistical Machine Translation 11:10-11:30 Wednesday, 17 October Diego Bartolome Tauyou
  • 2. Language Processing Techniques for Statistical Machine Translation Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  • 3. To start ... Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  • 4. … you choose Moses ... Translation memories + linguistic assets Cleaning and training following tutorials BLEU score seems ok in training … but ... the results are awful! Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  • 5. Why? Not enough data Unclean translation memories Misalignments Spelling and grammar errors Difficult language pairs Selection of wrong parameters Application of suboptimal techniques So many things … what can you do? Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  • 6. Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  • 7. Some steps Maximum exploitation of existing assets Source content optimization Data selection and cleaning Improvement of the models Linguistic processing ... Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  • 8. Existing assets: increase TM leverage Translation memory sharing Clients, Partners, Competitors, EU, UN, TAUS Relevant on-line data retrieval Advanced TM techniques Sub-segment matching Parts of Speech replacement Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  • 9. Source optimization (I): Pre-editing new doc proposed doc + html report Spell check Grammar check Style check Terminology check Client checklist Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  • 10. Source optimization (II): Summarization new doc proposed doc + html report % to reduce Use translation memories Project Client All Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  • 11. Summarization example http://www.translationautomation.com/press- releases/free-open-source-machine-translation- tutorial-is-made-available-by-taus Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  • 12. Data selection and cleaning – a sample Clean translation memories Length, punctuation, terminology, repetitions … Segment splitting Optimize weight of most frequent n-grams in corpus Validate their translations Add out-of-domain data for irrelevant n-grams Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  • 13. Models optimization Filter the translation tables Remove the garbage + tune the weights if necessary Optimize language models Adapt them to the translation purpose Tune parameters correctly Tune set, test set, optimization parameters … Improve recasing Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  • 14. Linguistic processing In the source and/or target language Grammar checking Entities detection proper nouns, alphanumeric words, numbers, ... Compund words splitting Sentence reordering Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96
  • 15. An example from Source XXX 335102 doses are calculated as a free acid of the sodium salt (NA). The potential toxicity of XXX 335102 was studied in a number of acute toxicity studies in mouse and rat and repeat dose toxicity studies of 8 and 32 weeks each in rat and monkeys. XXX 335102 was negative in a panel of in vivo and in vitro tests to assess mutagenicity and clastogenicity identifying no genotoxic risks for human subjects. An in vitro assay for phototoxic potential suggested that XXX 335102 is photoxic/photosensitive. In the 8-week studies in monkeys, increases in unconjugated bilirubin were noted at the doses tested (33, 88, 192 and 444mg/kg/day); the greatest increases occurring at Week 4 and declining or returning to control levels by Week 8. Reference Las dosis de XXX 335102 se calculan como la sal sódica sin ácido (AS). La toxicidad potencial de XXX 335102 se estudió en varios estudios de toxicidad aguda en ratones y ratas y en estudios de toxicidad con administración repetida de 8 y 32 semanas en ratas y monos. Se obtuvieron resultados negativos en un grupo de pruebas in vivo e in vitro para evaluar su mutagenia y clastogenia, sin identificarse riesgos genotóxicos para el ser humano. En un estudio in vitro de su potencial fototóxico se sugirió que XXX 335102 es fototóxico o fotosensibilizador. En los estudios de 8 semanas en monos se apreció el aumento de la bilirrubina no conjugada con las dosis estudiadas (33, 88, 192 y 444 mg/kg/día), produciéndose el mayor incremento en la semana 4 y disminuyendo o volviendo a los niveles de control en la semana 8.
  • 16. Generic engine XXX 335102 se calculan en forma de dosis de ácido libre del sodio sal (NA). La Toxicidad potencial de XXX 335102 fue estudiado en una serie de estudios de toxicidad aguda en ratón y rata y vuelva a dosis estudios de toxicidad, de 8 y de 32 semanas en rata y cada uno de los monos. XXX 335102 era negativo en un grupo de in vivo y pruebas in vitro para evaluar mutagenicidad y genotóxicas clastogenicity no identificar los riesgos para los participantes humanos. Un para fines de ensayo in vitro phototoxic potencial se sugirió que XXX 335102 photoxic/Photosensitive. En Los 8 -week estudios en los monos, aumentos en unconjugated bilirrubina salieron a las dosis analizada (33, 88, 192 y 444 mg/kg/día); los mayores incrementos habidos En la semana 4 y la reducción o devolver a nivel de control de 8 Por semana. Medical engine with improvements Las dosis XXX 335102 se calculan como ácido libre de la sal sódica (AS). La toxicidad potencial de XXX 335102 se estudió en varios estudios de toxicidad aguda en ratones y ratas y en estudios de toxicidad con administración repetida de 8 y 32 semanas en ratas y monos. XXX 335102 dio negativo en un grupo de pruebas in vivo e in vitro para evaluar su mutagenia y clastogenia, sin identificarse riesgos genotóxicos para el ser humano. En un estudio in vitro de su potencial fototóxico se sugirió que XXX 335102 es fototóxico o fotosensibilizador. En los estudios de 8 semanas en monos, el aumento de la bilirrubina no conjugada con las dosis estudiadas (33, 88, 192 y 444 mg/kg/día); el mayor incremento en la semana 4 y disminuyendo o volviendo a los niveles de control en la semana 8.
  • 17. Conclusions MT can be combined with other advanced techniques Creating an improving an engine requires time You can also be lucky at the first try! The optimum results require translators Implementation of the linguistic knowledge Continuous improvement Contact: Diego Bartolome – dbc@tauyou.com C/ Les Planes 39, 1o 2a – 08201 Sabadell – Spain Tel. +34 93 711 29 96