This presentation is a part of the MosesCore project that encourages the development and usage of open source machine translation tools, notably the Moses statistical MT toolkit. MosesCore is supported by the European Commission Grant Number 288487 under the 7th Framework Programme.
For the latest updates go to http://www.statmt.org/mosescore/
or follow us on Twitter - #MosesCore
4. Current MT Programs
Dell
–
27
languages
Autodesk
–
11
languages
PayPal
-‐
8
languages
Cisco
–
17
languages
between
3
Ders
Intuit
–
20+languages
MicrosoH
(pre-‐project
support)
McAfee
(pilot)
…
many
more
in
pilot
stage
5. MT Program: Path-to-Success
Components
A
set
of
MT
engines
–
“mix
and
match”
TMT
SelecDon
Mechanisms
Post-‐ediDng
Environment
Processes
and
metrics
Data
gathering
and
reporDng
tool
–
what,
how
much,
how
fast
and
at
what
effort
EDUCATION
EDUCATION
EDUCATION
CHANGE
The recipe
for success
6. Process and Workflow
All aspects of the localization ecosystem are
taken into consideration
MT KPIs:
Selec3ng
the
right
MT
engine
By
using
our
MT
engine
selecDon
Scorecard
we
make
sure
all
important
KPIs
are
taken
into
consideraDon
at
selecDon
Dme
Empowerment
through
educa3on
Internal,
by
the
use
of
customized
Toolkits;
external,
through
specialised
Trainings.
The
feedback
loop
ConstrucDve
communicaDon
from
post-‐editor
to
MT
provider
ü
Produc3vity:
Throughputs
ü
Produc3vity:
Delta
ü
Quality:
LQA
ü
Quality:
Automa3c
Scores
ü
Cost
ü
GlobalSight:
Connec3vity
ü
GlobalSight:
Tagging
ü
Human
Evalua3on
ü
Customiza3on:
Internal/External
ü
Customiza3on:
Time
7. MT Program Design - Source
o
o
o
o
o
o
Source
content
classificaDon
(i.e.
markeDng/UI/UA/UGC)
Length
of
the
source
segment
Source
segment
morpho-‐syntacDc
complexity
Presence/absence
of
pre-‐defined
glossary
terms
or
mulD-‐word
glossary
elements,
UI
elements,
numeric
variables,
product
lists,
‘do-‐not-‐translate’
and
transliteraDon
lists
Tag
density
-‐
Metadata
aeributes
and
their
representaDon
in
localizaDon
industry
standard
formats
(“tags”)
ROC
–
quality
levels
based
on
content
use
(“impact”)
3D
Model:
Expected
producDvity
mapped
to
desired
quality
levels
and
source
content
complexity
8. MT Engine Selection Scorecard
Produc3vity
-‐
Throughputs
Number
of
post-‐edited
words
per
hour
Produc3vity
-‐
Delta
Percentage
difference
between
translaDon
and
post-‐
ediDng
Dme
Cost
ExtrapolaDon,
cost
per
word
CMS
-‐
Connec3vity
We have tested and used
Is
there
a
connector
in
place?
different engines so we’ve seen
the good, the bad and the ugly;
Quality/Nature
of
source
now we can better appreciate
Quality
(Final)
-‐
LQA
what we have
Internal
quality
verificaDon
Quality
(MT)
-‐
Automa3c
Scores
A
set
of
automaDc
scoring
systems
is
used
10. Toolkits and Trainings
Our
experience:
ü
Most
translators
know
and
have
experienced
post-‐ediDng
but
they
have
limited
knowledge
of
any
other
related
aspect
(automaDc
scoring,
output
differences
between
RBMT
and
SMT...)
ü
The
majority
of
people
who
work
in
localizaDon
have
heard
about
MT
but
most
of
them
sDll
find
it
a
daunDng
subject.
Our
answer:
ü
ConDnuous
MT
and
PE
related
trainings
and
documentaDon
for
language
providers
ü
Customized
Toolkits
for
different
internal
departments
(ProducDon,
Quality,
Sales,
Vendor
Management)
11. Transparency and Ownership
Theory
–
knowledge
foundaDons
Prac3ce
–
customized
PE
sessions
for
different
client
accounts
Transparency
–
process,
engine
selecDon/customizaDon,
evaluaDons
Training helps a lot - After I was told
some of the background information
and tips and tricks for certain engines/
outputs, I was much more relaxed
and happy to give MT a go.
Responsibility
–
valid
evaluaDons,
construcDve
feedback,
quality
ownership
12. Legacy data – best prediction tool
>
StaDsDcs
from
legacy
knowledge
base
13. The feedback loop
For me the biggest
advantage would be
the possibility to
implement a client
terminology list [in SMT]
I wish we could easily fix
the corpus for outdated
terminology and
characters
Teach the engine to properly
cope with sentences containing
more than one verb and/or
verbs in progressive form
engine retraining improved significantly the
handling of tags and spaces around tags,
this is a productive achievement as it saves
us a lot of manual corrections.
15. “Beyond the Engine” Tools
• Teaminology
-‐
crowdsourcing
plamorm
for
centralized
term
governance;
simultaneous
concordance
search
of
TMs
and
term
bases
=>
clean
training
data
• Dispatcher
-‐
A
global
community
content
translaDon
applicaDon
that
connects
user
generated
content
(UGC)
including
live
chats,
social
media,
forums,
comments
and
knowledge
bases
to
customized
machine
translaDon
(MT)
engines
for
real-‐Dme
translaDon
• Source
Candidate
Scorer
–
scoring
of
candidate
sentences
against
historically
good
and
bad
sentences
based
on
POS
and
perplexity
• Corpus
Prepara3on
Toolkit
–
set
of
applicaDon
to
maximize
data
preparaDon
for
MT
engine
training
19. Corpus Preparation Suite
•
•
•
•
•
•
•
Variety
of
tools
to
prepare
corpus
for
training
MT
engines
such
as:
DeleDng
formaong
tags
from
TMX
Removing
double
spaces
Removing
duplicated
punctuaDon
(e.g.
commas)
DeleDng
segments
where
source
=
target
DeleDng
segments
containing
only
URLs
Escaping
characters
Removing
duplicate
sentences
20. Corpus Preparation: TM Creator
Aggregates
training
data
from
various
relevant
sources
TM Creator
21. Corpus Preparation: TMX Splitter
Extracts
the
relevant
training
corpus
based
on
the
TMX
metadata
22. Welocalize Moses Implementation
• Why?
Far
more
control
over
engine
quality
since
we
can
control
corpus
preparaDon
and
output
post-‐processing
• Control
over
metadata
handling
• Ties
into
our
company
open-‐source
philosophy
• Have
experienced
personnel
in-‐house
• Can
extend
and
customize
Moses
funcDonality
as
necessary
• Have
connector
to
TMS
(GlobalSight)
RESULTS:
In
our
internal
tests
with
Moses/DoMT,
we
are
geong
automated
scores
similar
to
commercial
engines
for
the
languages
into
which
we
localize
most.
Same
feedback
received
from
human
evaluators
23. … And it works!
We are in the position to offer realistic discounts and aggressive
timelines providing quality levels appropriate for the content
24. “Work-in-progress” Projects
• Ongoing improvements to our adaptation of iOmegaT tool
(Welocalize/CNGL)
• Industry Partner in CNGL “Source Content Profiler” project
• Adoption of TMTPrime (CNGL) - MT vs. Fuzzy Match selection
mechanism
• Language and content-specific pre-processing for the inhouse Moses deployment
• Teaminology – adding linguistic intelligence