Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Artificial Intelligence TM and Terminology Onboarding

297 visualizaciones

Publicado el

We're building a tool that automatically scrapes a multilingual website and generates a translation memory from it. From there it extracts the terminology.

Publicado en: Tecnología
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Artificial Intelligence TM and Terminology Onboarding

  1. 1. AI TM and Terminology Onboarding Rémy Blättler Chief of the System @ Supertext
  2. 2. Webspider (with PhantomJS) NMT-based alignment Term extraction Structured text in multiple languages Translation memory Terminology database
  3. 3. PhantomJS (based on Chrome) can spider JavaScript based pages.  Great for modern dynamic site. Result: Structured (HTML) multilingual data Webspider
  4. 4. • Speed • Multitasking / Scaling Problems
  5. 5. NMT Alignment How can I create my own coupon template? Wie kann ich eine eigene Gutscheinvorlage erstellen? Wie kann ich eine eigene Coupon-Vorlage erstellen? 2. Align 1. Translate
  6. 6. • False positives • Best matching algorithm • Gale-Church, Levenshtein, Bleu score Problems
  7. 7. breakfast cereal dinner lunch => Cereal GENSIM word2vec man => boy woman => girl Sweden  Norway 0.76  Finland 0.71  Estonia 0.54
  8. 8. New York City Terms and Conditions GENSIM Phrases Never Follow (Audi) Just do it (Nike)
  9. 9. 1. Average occurrence of a term over all corpora 2. Average occurrence of a term for this scan 3. Same for the other language Term extraction
  10. 10. Detect the specific phrase in the source & target: Geänderte Segmentberichterstattung erhöht Aussagekraft. Improvements gained from changes in segment reporting. Term extraction 2
  11. 11. Lindschulte-Gruppe Lindschulte Group Anleihen bonds BKW Energie AG BKW Energie AG Revisionsstelle external auditors Vizepräsident des Verwaltungsrats Deputy Chair staatlichen Fonds state funds Die beizulegenden Zeitwerte The fair values Stilllegungs- und Entsorgungsfonds disposal funds Gebäudetechnik building technologies Wertänderung Value adjustment Bewertungsverfahren Level valuations
  12. 12. • Speed (tests take multiple hours) • Insufficient data (>50k TM units helps) • Bad source data (HTML, Javascript, etc.) Problems