TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2013

TAUS
MACHINE
TRANSLATION
SHOWCASE

The WeMT Program
10:20 – 10:40
Thursday, 10 October 2013
Olga Beregovaya
Welocalize

WeMT
Tools
and

Processes

We’ll talk about:
•  MT
Programs

•  Metrics

•  Engines

•  Language
Tools

Current MT Programs

Dell
–
27
languages

Autodesk
–
11
languages

PayPal

-‐
8
languages

Cisco
–
17
languages
between
3
Ders

Intuit
–
20+languages

MicrosoH
(pre-‐project
support)

McAfee
(pilot)

…
many
more
in
pilot
stage

MT Program: Path-to-Success
Components

A
set
of
MT
engines
–
“mix
and
match”

TMT
SelecDon
Mechanisms

Post-‐ediDng
Environment

Processes
and
metrics

Data
gathering
and
reporDng
tool
–
what,

how
much,
how
fast
and
at
what
eﬀort

EDUCATION
EDUCATION
EDUCATION

CHANGE

The recipe
for success

Process and Workflow
All aspects of the localization ecosystem are
taken into consideration

MT KPIs:

Selec3ng
the
right
MT
engine

By
using
our
MT
engine
selecDon
Scorecard
we
make
sure
all

important
KPIs
are
taken
into
consideraDon
at
selecDon
Dme

Empowerment
through
educa3on

Internal,
by
the
use
of
customized
Toolkits;
external,
through

specialised
Trainings.

The
feedback
loop

ConstrucDve
communicaDon
from
post-‐editor
to
MT

provider

ü 
Produc3vity:
Throughputs

ü 
Produc3vity:
Delta

ü 
Quality:
LQA

ü 
Quality:
Automa3c
Scores

ü 
Cost

ü 
GlobalSight:
Connec3vity

ü 
GlobalSight:
Tagging

ü 
Human
Evalua3on

ü 
Customiza3on:
Internal/External

ü 
Customiza3on:
Time

MT Program Design - Source
o 
o 
o 
o 

o 

o 

Source
content
classiﬁcaDon
(i.e.
markeDng/UI/UA/UGC)

Length
of
the
source
segment

Source
segment
morpho-‐syntacDc
complexity

Presence/absence
of
pre-‐deﬁned
glossary
terms
or
mulD-‐word
glossary

elements,
UI
elements,
numeric
variables,
product
lists,
‘do-‐not-‐translate’

and
transliteraDon
lists

Tag
density
-‐
Metadata
aeributes
and
their
representaDon
in
localizaDon

industry
standard
formats
(“tags”)

ROC
–
quality
levels
based
on
content
use
(“impact”)

3D
Model:
Expected
producDvity
mapped
to
desired
quality
levels
and
source

content
complexity

MT Engine Selection Scorecard
Produc3vity
-‐
Throughputs

Number
of
post-‐edited
words
per
hour

Produc3vity
-‐
Delta

Percentage
diﬀerence
between
translaDon
and
post-‐

ediDng
Dme

Cost

ExtrapolaDon,
cost
per
word

CMS
-‐
Connec3vity

We have tested and used

Is
there
a
connector
in
place?

different engines so we’ve seen
the good, the bad and the ugly;
Quality/Nature
of
source

now we can better appreciate
Quality
(Final)
-‐
LQA

what we have

Internal
quality
veriﬁcaDon

Quality
(MT)
-‐
Automa3c
Scores

A
set
of
automaDc
scoring
systems
is
used

Scorecard - Metrics
Overall
data

German
KPIs
#
1 #
2 #
3 #
4
Productivity
4 4 4 4
Productivity
Increase
5 4 1 3
Quality
-‐
LQA
2 2 1 2
Quality
-‐
Automatic
Scores
3 3 3 3
Cost
4 2 3 3
GlobalSight
-‐
Connectivity

4 3 2 4
GlobalSight
-‐
Tagging

4 2 4 2
Human
Evaluation
3 3 3 4
Customization
-‐
Internal/External
4 2 3 3
Customization
-‐
Time
3 1 2 1
Total
36 26 26 29

French
KPIs
#
1 #
2 #
3 #
4
Productivity
4 5 3 4
Productivity
Increase
5 5 1 4
Quality
-‐
LQA
5 3 3 4
Quality
-‐
Automatic
Scores
3 4 3 3
Cost
4 2 3 3
GlobalSight
-‐
Connectivity

4 3 2 4
GlobalSight
-‐
Tagging

4 2 2 2
Human
Evaluation
3 3 3 3
Customization
-‐
Internal/External 4 2 3 3
Customization
-‐
Time
3 1 2 1
Total
39 30 25 31

ProducDvity
metrics

AutomaDc
Scoring

Human
EvaluaDon

Toolkits and Trainings
Our
experience:

ü 
Most
translators
know
and
have
experienced
post-‐ediDng
but
they
have

limited
knowledge
of
any
other
related
aspect
(automaDc
scoring,
output

differences
between
RBMT
and
SMT...)

ü 
The
majority
of
people
who
work
in
localizaDon
have
heard
about
MT
but

most
of
them
sDll
find
it
a
daunDng
subject.

Our
answer:

ü 
ConDnuous
MT
and
PE
related
trainings
and
documentaDon
for
language

providers

ü 
Customized
Toolkits
for
different
internal
departments
(ProducDon,
Quality,

Sales,
Vendor
Management)

Transparency and Ownership
Theory
–
knowledge
foundaDons

Prac3ce
–
customized
PE
sessions
for
diﬀerent
client
accounts

Transparency
–
process,
engine
selecDon/customizaDon,
evaluaDons

Training helps a lot - After I was told
some of the background information
and tips and tricks for certain engines/
outputs, I was much more relaxed
and happy to give MT a go.

Responsibility
–
valid
evaluaDons,
construcDve
feedback,
quality
ownership

Legacy data – best prediction tool

>
StaDsDcs
from
legacy
knowledge
base

The feedback loop
For me the biggest
advantage would be
the possibility to
implement a client
terminology list [in SMT]

I wish we could easily fix
the corpus for outdated
terminology and
characters

Teach the engine to properly
cope with sentences containing
more than one verb and/or
verbs in progressive form

engine retraining improved significantly the
handling of tags and spaces around tags,
this is a productive achievement as it saves
us a lot of manual corrections.

Feedback and Engine Improvement

“Beyond the Engine” Tools
•  Teaminology
-‐
crowdsourcing
plamorm
for
centralized
term
governance;
simultaneous

concordance
search
of
TMs
and
term
bases
=>
clean
training
data

•  Dispatcher
-‐
A
global
community
content
translaDon
applicaDon
that
connects
user

generated
content
(UGC)
including
live
chats,
social
media,
forums,
comments
and

knowledge
bases
to
customized
machine
translaDon
(MT)
engines
for
real-‐Dme

translaDon

•  Source
Candidate
Scorer
–
scoring
of
candidate
sentences
against
historically
good
and

bad
sentences
based
on
POS
and
perplexity

•  Corpus
Prepara3on
Toolkit
–
set
of
applicaDon
to
maximize
data
preparaDon
for
MT

engine
training

Source Candidate Scorer
Source
Candidate
Scorer

Compares
your
source
content
to
“the
good”
and
“the
bad”

legacy
segments
and
esDmates
potenDal
suitability
for
MT

Corpus Preparation Suite

• 
• 
• 
• 
• 
• 
• 

Variety
of
tools
to
prepare
corpus
for
training
MT
engines
such
as:

DeleDng
formaong
tags
from
TMX

Removing
double
spaces

Removing
duplicated
punctuaDon
(e.g.
commas)

DeleDng
segments
where
source
=
target

DeleDng
segments
containing
only
URLs

Escaping
characters

Removing
duplicate
sentences

Corpus Preparation: TM Creator
Aggregates
training
data
from
various
relevant
sources

TM Creator

Corpus Preparation: TMX Splitter

Extracts
the
relevant
training
corpus

based
on
the
TMX
metadata

Welocalize Moses Implementation
•  Why?
Far
more
control
over
engine
quality
since
we
can
control
corpus

preparaDon
and
output
post-‐processing

•  Control
over
metadata
handling

•  Ties
into
our
company
open-‐source
philosophy

•  Have
experienced
personnel
in-‐house

•  Can
extend
and
customize
Moses
funcDonality
as
necessary

•  Have
connector
to
TMS
(GlobalSight)

RESULTS:
In
our
internal
tests
with
Moses/DoMT,
we
are
geong
automated

scores
similar
to
commercial
engines
for
the
languages
into
which
we
localize

most.

Same
feedback
received
from
human
evaluators

… And it works!
We are in the position to offer realistic discounts and aggressive
timelines providing quality levels appropriate for the content

“Work-in-progress” Projects

•  Ongoing improvements to our adaptation of iOmegaT tool
(Welocalize/CNGL)
•  Industry Partner in CNGL “Source Content Profiler” project
•  Adoption of TMTPrime (CNGL) - MT vs. Fuzzy Match selection
mechanism
•  Language and content-specific pre-processing for the inhouse Moses deployment
•  Teaminology – adding linguistic intelligence

Questions

Language_Tools_Group_all@welocalize.com

We
speak
MT
-‐
the
language
of
the
future

TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2013

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2013

Similar a TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2013 (20)

Más de TAUS - The Language Data Network

Más de TAUS - The Language Data Network (20)

Último

Último (20)

TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2013