Big data is having a disruptive impact across the sciences.
Human annotation of semantic interpretation tasks is a critical
part of big data semantics, but it is based on an antiquated
ideal of a single correct truth that needs to be similarly
disrupted.We expose seven myths about human annotation,
most of which derive from that antiquated ideal of truth,
and dispell these myths with examples from our research.We
propose a new theory of truth, Crowd Truth, that is based
on the intuition that human interpretation is subjective, and
that measuring annotations on the same objects of interpretation (in our examples, sentences) across a crowd will provide a useful representation of their subjectivity and the range of reasonable interpretations.
Unleash Your Potential - Namagunga Girls Coding Club
Truth is a Lie: 7 Myths about Human Annotation @CogComputing Forum 2014
1. Truth
is
a
Lie
CrowdTruth:
The
7
Myths
of
Human
Annota9on
Lora
Aroyo
2.
3.
4.
5.
6. Human
annota9on
of
seman)c
interpreta)on
tasks
as
cri)cal
part
of
cogni)ve
systems
engineering
– standard
prac)ce
based
on
an9quated
ideal
of
a
single
correct
truth
– 7
myths
of
human
annota)on
– new
theory
of
truth:
CrowdTruth
Take
Home
Message
Lora Aroyo
7. I
amar
prestar
aen...
• amount
of
data
&
scale
of
computa9on
available
have
increased
by
a
previously
inconceivable
amount
• CS
&
AI
moved
out
of
thought
problems
to
empirical
science
• current
methods
pre-‐date
this
fundamental
shi?
• the
ideal
of
“one
truth”
is
a
lie
• crowdsourcing
&
seman9cs
together
correct
the
fallacy
and
improve
analy)c
systems
The
world
has
changed:
there
is
a
need
to
form
a
new
theory
of
truth
-‐
appropriate
to
cogni)ve
systems
Lora Aroyo
8. Seman)c
interpreta)on
is
needed
in
all
sciences
– Data
abstracted
into
categories
– PaIerns,
correla9ons,
associa9ons
&
implica9ons
are
extracted
Seman9c
Interpreta9on
Cogni9ve
Compu9ng:
providing
some
way
of
scalable
seman)c
interpreta)on
Lora Aroyo
9. • Humans
analyze
examples:
annota)ons
for
ground
truth
=
the
correct
output
for
each
example
• Machines
learn
from
the
examples
• Ground
Truth
Quality:
– measured
by
inter-‐annotator
agreement
– founded
on
ideal
for
single,
universally
constant
truth
– high
agreement
=
high
quality
– disagreement
must
be
eliminated
Tradi9onal
Human
Annota9on
Lora Aroyo
Current
gold
standard
acquisi9on
&
quality
evalua9on
are
outdated
10. • Cogni)ve
Compu)ng
increases
the
need
for
machines
to
handle
the
scale
of
data
• Results
in
increasing
need
for
new
gold
standards
able
to
measure
machine
performance
on
tasks
that
require
seman)c
interpreta)on
Need
for
Change
Lora Aroyo
The
New
Ground
Truth
is
CrowdTruth
11. • One
truth:
data
collec)on
efforts
assume
one
correct
interpreta)on
for
every
example
• All
examples
are
created
equal:
ground
truth
treats
all
examples
the
same
–
either
match
the
correct
result
or
not
• Detailed
guidelines
help:
if
examples
cause
disagreement
-‐
add
instruc)ons
to
limit
interpreta)ons
• Disagreement
is
bad:
increase
quality
of
annota)on
data
by
reducing
disagreement
among
the
annotators
• One
is
enough:
most
of
the
annotated
examples
are
evaluated
by
one
person
• Experts
are
beIer:
annotators
with
domain
knowledge
provide
beIer
annota)ons
• Once
done,
forever
valid:
annota)ons
are
not
updated;
new
data
not
aligned
with
previous
7
Myths
myths
directly
influence
the
prac)ce
of
collec)ng
human
annotated
data;
Need
to
be
revisited
in
the
context
of
new
changing
world
&
in
the
face
of
a
new
theory
of
truth
(CrowdTruth)
Lora Aroyo
12. current
ground
truth
collec)on
efforts
assume
one
correct
interpreta)on
for
every
example
the
ideal
of
truth
is
a
fallacy
for
seman9c
interpreta9on
and
needs
to
be
changed
1.
One
Truth
What
if
there
are
MORE?
Lora Aroyo
13. Which is the mood most appropriate
Cluster
1
Cluster
2
Cluster
3
Cluster
4
Cluster
5
Other
passionate,
rollicking,
literate,
humorous,
silly,
aggressive,
fiery,
does
not
fit
into
rousing,
cheerful,
fun,
poignant,
wis9ul,
campy,
quirky,
tense,
anxious,
any
of
the
5
confident,
sweet,
amiable,
bi>ersweet,
whimsical,
wi>y,
intense,
vola?le,
clusters
boisterous,
good-‐natured
autumnal,
wry
visceral
rowdy
brooding
Lora Aroyo
Choose
one:
for each song?
one
truth?
Results
in:
(Lee
and
Hu
2012)
14. • typically
annotators
are
asked
whether
a
binary
property
holds
for
each
example
• o?en
not
given
a
chance
to
say
that
the
property
may
par9ally
hold,
or
holds
but
is
not
clearly
expressed
• mathema9cs
of
using
ground
truth
treats
every
example
the
same
–
either
match
correct
result
or
not
• poor
quality
examples
tend
to
generate
high
disagreement
disagreement
allows
us
to
weight
sentences
=
the
ability
to
train
&
evaluate
a
machine
more
flexibly
2.
All
Examples
Are
Created
Equal
What
if
they
are
DIFFERENT?
Lora Aroyo
15. Is TREAT relation expressed between
the highlighted terms?
ANTIBIOTICS are the first line treatment for indications of
TYPHUS.
clearly
With ANTIBIOTICS in short supply, DDT was used during World
War II to control the insect vectors of TYPHUS.
treats
less
clear
treats
equal
training
data?
disagreement
can
indicate
vagueness
&
ambiguity
of
sentences
Lora Aroyo
16. • Perfuming
agreement
scores
by
forcing
annotators
to
make
choices
they
may
think
are
not
valid
• Low
annotator
agreement
is
addressed
by
detailed
guidelines
for
annotators
to
consistently
handle
the
cases
that
generate
disagreement
• Remove
poten9al
signal
on
examples
that
are
ambiguous
precise
annota)on
guidelines
do
eliminate
disagreement
but
do
not
increase
quality
3.
Detailed
Guidelines
Help
What
if
they
HURT?
Lora Aroyo
17. Which mood cluster is
most appropriate for a song?
Instruc9ons
Your
task
is
to
listen
to
the
following
30
second
music
clips
and
select
disagreement
can
indicate
problems
with
the
task
the
most
appropriate
mood
cluster
that
represents
the
mood
of
the
music.
Try
to
think
about
the
mood
carried
by
the
music
and
please
try
to
ignore
any
lyrics.
If
you
feel
the
music
does
not
fit
into
any
of
the
5
clusters
please
select
“Other”.
The
descrip)ons
of
the
clusters
are
provided
in
the
panel
at
the
top
of
the
page
for
your
reference.
Answer
the
ques)ons
carefully.
Your
work
will
not
be
accepted
if
your
answers
are
inconsistent
and/or
incomplete.
restric2ng
guidelines
help?
(Lee
and
Hu
2012)
Lora Aroyo
18. • rather
than
accep)ng
disagreement
as
a
natural
property
of
seman)c
interpreta)on
• tradi)onally,
disagreement
is
considered
a
measure
of
poor
quality
because:
– task
is
poorly
defined
or
– annotators
lack
training
this
makes
the
elimina9on
of
disagreement
the
GOAL
4.
Disagreement
is
Bad
What
if
it
is
GOOD?
Lora Aroyo
19. Does each sentence express
the TREAT relation?
ANTIBIOTICS are the first line treatment for indications of TYPHUS.
à agreement 95%
Patients with TYPHUS who were given ANTIBIOTICS exhibited side-effects.
à agreement 80%
With ANTIBIOTICS in short supply, DDT was used during WWII to control
the insect vectors of TYPHUS.
à agreement 50%
disagreement
bad?
disagreement
can
reflect
the
degree
of
clarity
in
a
sentence
Lora Aroyo
20. • over
90%
of
annotated
examples
–
seen
by
1-‐2
annotators
• small
number
overlap
–
to
measure
agreement
five
or
six
popular
interpreta9ons
can’t
be
captured
by
one
or
two
people
5.
One
is
Enough
What
if
it
is
NOT
ENOUGH?
Lora Aroyo
21. One
Quality?
accumulated
results
for
each
rela)on
across
all
the
sentences
20
workers/sentence
(and
higher)
yields
same
rela9ve
disagreement
Lora Aroyo
22. • conven9onal
wisdom:
human
annotators
with
domain
knowledge
provide
beIer
annotated
data,
e.g
– medical
texts
should
be
annotated
by
medical
experts
• but
experts
are
expensive
&
don’t
scale
mul9ple
perspec9ves
on
data
can
be
useful,
beyond
what
experts
believe
is
salient
or
correct
6.
Experts
Are
BeIer
What
if
the
CROWD
IS
BETTER?
Lora Aroyo
23. What is the (medical) relation between
the highlighted (medical) terms?
• 91% of expert annotations covered by the crowd
• expert annotators reach agreement only in 30%
• most popular crowd vote covers 95% of this
expert annotation agreement
experts
beIer
than
crowd?
Lora Aroyo
24. • perspec9ves
change
over
9me
–
old
training
data
might
contain
examples
that
are
not
valid
or
only
par)ally
valid
later
• con9nuous
collec9on
of
training
data
over
)me
allows
the
adapta)on
of
gold
standards
to
changing
)mes
– popularity
of
music
– levels
of
educa)on
7.
Once
Done,
Forever
Valid
What
if
VALIDITY
CHANGES?
25. Which are mentions of terrorists
in this sentence?
OSAMA
BIN
LADEN used money from his own
construction company to support the MUHAJADEEN in
Afghanistan against Soviet forces.
forever
valid?
1990:
hero
2011:
terrorist
both
types
should
be
valid
-‐
two
roles
for
same
en9ty
-‐
adapta9on
of
gold
standards
to
changing
9mes
Lora Aroyo
27. • annotator disagreement is signal, not noise.
• it is indicative of the variation in human
semantic interpretation of signs
• it can indicate ambiguity, vagueness,
similarity, over-generality, as well as quality
crowdtruth.org