CrowdTruth Measures for Language Ambiguity: The Case of Medical Relation Extraction. Anca Dumitrache, Lora Aroyo and Chris Welty ==> http://oak.dcs.shef.ac.uk/ld4ie2015/LD4IE2015/Program.html
Handwritten Text Recognition for manuscripts and early printed texts
#CrowdTruth: Linked Data for Information Extraction @ISWC2015
1. Anca Dumitrache, Lora Aroyo, Chris Welty
http://CrowdTruth.org
Measures for
Language Ambiguity
Medical Relation Extraction
Linked Data for Information Extraction @ ISWC2015
#CrowdTruth @anouk_anca @laroyo @cawelty #LD4IE2015
2. • Most
knowledge
is
in
text,
but
it’s
not
structured
• Linked
Data
sources
are
a
good
start,
but
incomplete
• Goal
(Distance
Supervision):
– extract
LD
triples
from
text
– given
exis@ng
tuples
find
sentences
that
men@on
both
args
– use
resul@ng
sentences
as
TP
to
train
a
classifier
• But
can
some8mes
be
wrong
– <PALPATION>
loca8on
<CHEST>
– feeling
the
way
CHEST
expands
(PALPATION),
can
iden8fy
areas
of
lung
that
are
full
of
fluid
• Standard
approach
Expert
Annota8on
Background
http://CrowdTruth.org
3. • Human
annotators
with
domain
knowledge
provide
be>er
annotated
data,
e.g
– if
you
want
medical
texts
annotated
for
medical
rela@ons
you
need
medical
experts
• But
experts
are
expensive
&
don’t
scale
• MulFple
perspecFves
on
data
can
be
useful,
beyond
what
experts
believe
is
salient
or
correct
Human
AnnotaFon
Myth:
Experts
know
best
What
if
the
CROWD
IS
BETTER?
http://CrowdTruth.org
4. What is the relation between the highlighted terms?
He was the first physician to identify the relationship
between HEMOPHILIA
and HEMOPHILIC
ARTHROPATHY.
Experts
Know
Best?
Crowd
reads
text
literally
-‐
provide
be>er
examples
to
machine
experts:
cause
crowd:
no
relaFon
hMp://CrowdTruth.org
5. Experts
Know
Best?
experts
vs.
crowd?
What is the (medical) relation between the
highlighted (medical) terms?
• 91% of expert annotations covered by the crowd
• expert annotators reach agreement only in 30%
• most popular crowd vote covers 95% of this
expert annotation agreement
hMp://CrowdTruth.org
6. • rather
than
accep@ng
disagreement
as
a
natural
property
of
seman@c
interpreta@on
• tradi@onally,
disagreement
is
considered
a
measure
of
poor
quality
in
the
annota@on
task
because:
– task
is
poorly
defined
or
– annotators
lack
training
This
makes
the
eliminaFon
of
disagreement
a
goal
Human
AnnotaFon
Myth:
Disagreement
is
Bad
What
if
it
is
GOOD?
http://CrowdTruth.org
7. Disagreement
Bad?
Does each sentence express the TREAT relation?
ANTIBIOTICS are the first line treatment for indications of TYPHUS.
→ agreement 95%
Patients with TYPHUS who were given ANTIBIOTICS exhibited side-
effects.
→ agreement 80%
With ANTIBIOTICS in short supply, DDT was used during WWII to control
the insect vectors of TYPHUS.
→ agreement 50%
Disagreement
can
reflect
the
degree
of
clarity
in
a
sentence
hMp://CrowdTruth.org
8. • Annotator disagreement is
signal, not noise.
• It is indicative of the
variation in human
semantic interpretation of
signs
• It can indicate ambiguity,
vagueness, similarity, over-
generality, etc,
as well as quality
CrowdTruth
http://CrowdTruth.org
9. • Goal:
collecting a Medical RelEx Gold
Standard
improve the performance of a
RelEx Classifier
• Approach:
crowdsource 900 medical
sentences
measure disagreement with
CrowdTruth Metrics
train & evaluate classifier with
CrowdTruth SRS Score
CrowdTruth
for
medical
relaFon
extracFon
http://CrowdTruth.org
10. RelEx
TASK
in
CrowdFlower
PaFents
with
ACUTE
FEVER
and
nausea
could
be
suffering
from
INFLUENZA
AH1N1
Is
ACUTE
FEVER
–
related
to
→
INFLUENZA
AH1N1?
hMp://CrowdTruth.org
13. Unclear
relaFonship
between
the
two
arguments
reflected
in
the
disagreement
Sentence
Clarity
hMp://CrowdTruth.org
14. Clearly
expressed
relaFon
between
the
two
arguments
reflected
in
the
agreement
Sentence
Clarity
hMp://CrowdTruth.org
15. Measures
how
clearly
a
sentence
expresses
a
relaFon
0 1 1 0 0 4 3 0 0 5 1 0
Unit vector for
relation R6
Sentence
Vector
Cosine = .55
Sentence-‐RelaFon
Score
(SRS)
hMp://CrowdTruth.org
16. 0.907,
p
=
0:007
0.844
AnnotaFon
Quality
of
Expert
vs.
Crowd
AnnotaFons
hMp://CrowdTruth.org
17. 0.907,
p
=
0:007
0.844
[0.6
-‐
0.8]
crowd
significantly
out-‐performs
expert
with
max
in
0.907
F1
@
0.7
threshold
AnnotaFon
Quality
of
Expert
vs.
Crowd
AnnotaFons
hMp://CrowdTruth.org
18. • Normally P = TP/(TP+FP)
• Intuition:
– some sentences make better examples
– more important to get the clear cases right
– but P normally treats all examples as equal
• We propose:
– weight P with sentence-relation score (SRS)
PW = ∑i (TPi x SRSi)
∑i (TPi x SRSi) + ∑i (FPi x SRSi)
*and similarly for F1, Recall, and Accuracy
Weighted
Precision*
hMp://CrowdTruth.org
19. CrowdTruth
SRS
Score
as
a
Weight
for
AnnotaFon
Quality
F1
Unweighted Weighted
Crowd@.5 0.8382 0.9329
Crowd@.7 0.9074 0.9626
Expert 0.8444 0.8611
Single 0.6637 0.7344
Baseline 0.6559 0.6891
the
sentences
with
a
lot
of
disagreement
weigh
less
hMp://CrowdTruth.org
20. hMp://CrowdTruth.org
weighted
F1
scores
higher
at
any
given
threshold
RelEx
CAUSE
Classifier
for
Crowd
&
Expert
Weighted
vs.
Unweighted
F1
Score
0.658
0.638
Crowd
Expert
21. 0.642,
p
=
0:016
0.638
RelEx
CAUSE
Classifier
F1
for
Crowd
vs.
Expert
AnnotaFons
hMp://CrowdTruth.org
22. 0.642,
p
=
0:016
0.638
crowd
provides
training
data
that
is
at
least
as
good
if
not
bePer
than
experts
RelEx
CAUSE
Classifier
F1
for
Crowd
vs.
Expert
AnnotaFons
hMp://CrowdTruth.org
24. • crowd can build a ground truth
• performs just as well as medical
experts
• crowd is also cheaper
• crowd is always available
• crowd can be used as a weight
• improved F1 scores for crowd
and expert ground truths
• CrowdTruth = a solution to Clinical
NLP Challenge:
• lack of ground truth for training &
benchmarking
Experiments
showed:
http://CrowdTruth.org