We're building a tool that automatically scrapes a multilingual website and generates a translation memory from it. From there it extracts the terminology.
5. PhantomJS (based on Chrome) can spider JavaScript
based pages.
Great for modern dynamic site.
Result: Structured (HTML) multilingual data
Webspider
7. NMT Alignment
How can I create my own
coupon template?
Wie kann ich eine eigene
Gutscheinvorlage erstellen?
Wie kann ich eine eigene
Coupon-Vorlage erstellen?
2. Align
1. Translate
9. breakfast cereal dinner lunch => Cereal
GENSIM word2vec
man => boy woman => girl
Sweden
Norway 0.76
Finland 0.71
Estonia 0.54
10. New York City
Terms and Conditions
GENSIM Phrases
Never Follow (Audi)
Just do it (Nike)
11. 1. Average occurrence of a term over all corpora
2. Average occurrence of a term for this scan
3. Same for the other language
Term extraction
12. Detect the specific phrase in the source & target:
Geänderte Segmentberichterstattung erhöht
Aussagekraft.
Improvements gained from changes
in segment reporting.
Term extraction 2
13. Lindschulte-Gruppe Lindschulte Group
Anleihen bonds
BKW Energie AG BKW Energie AG
Revisionsstelle external auditors
Vizepräsident des Verwaltungsrats Deputy Chair
staatlichen Fonds state funds
Die beizulegenden Zeitwerte The fair values
Stilllegungs- und Entsorgungsfonds disposal funds
Gebäudetechnik building technologies
Wertänderung Value adjustment
Bewertungsverfahren Level valuations
14. • Speed (tests take multiple hours)
• Insufficient data (>50k TM units helps)
• Bad source data (HTML, Javascript, etc.)
Problems
Editor's Notes
Cluster documents and classify them by topic
Sentiment Analysis
Recommendations, e.g. CRM
Gensim: Shallow, two-layer neural networks trained to reconstruct linguistic contexts of words
It gets complicated if a segment has multiple extracted phrases.
=> Find other segments with the same phrases