#CrowdTruth: Linked Data for Information Extraction @ISWC2015

Anca Dumitrache, Lora Aroyo, Chris Welty
http://CrowdTruth.org
Measures for
Language Ambiguity
Medical Relation Extraction
Linked Data for Information Extraction @ ISWC2015
#CrowdTruth @anouk_anca @laroyo @cawelty #LD4IE2015

• Most
knowledge
is
in
text,
but
it’s
not

structured

•  Linked
Data
sources
are
a
good
start,

but
incomplete

•  Goal
(Distance
Supervision):

–  extract
LD
triples
from
text

–  given
exis@ng
tuples
find
sentences

that
men@on
both
args

–  use
resul@ng
sentences
as
TP
to
train

a
classifier

• But
can
some8mes
be
wrong

–  <PALPATION>
loca8on
<CHEST>

–  feeling
the
way
CHEST
expands

(PALPATION),
can
iden8fy
areas
of

lung
that
are
full
of
fluid

• Standard
approach

Expert
Annota8on

Background


•  Human
annotators
with
domain

knowledge
provide
be>er

annotated
data,
e.g

–  if
you
want
medical
texts
annotated

for
medical
rela@ons
you
need

medical
experts

•  But
experts
are
expensive
&

don’t
scale

•  MulFple
perspecFves
on
data

can
be
useful,
beyond
what

experts
believe
is
salient
or

correct

Human
AnnotaFon

Myth:

Experts
know
best

What
if
the
CROWD
IS
BETTER?


What is the relation between the highlighted terms?

He was the first physician to identify the relationship
between HEMOPHILIA
and HEMOPHILIC
ARTHROPATHY.

Experts
Know
Best?

Crowd
reads
text
literally
-‐
provide
be>er
examples
to
machine

experts:
cause

crowd:
no
relaFon

hMp://CrowdTruth.org

Experts
Know
Best?

experts
vs.
crowd?

What is the (medical) relation between the
highlighted (medical) terms?

•  91% of expert annotations covered by the crowd
•  expert annotators reach agreement only in 30%
•  most popular crowd vote covers 95% of this
expert annotation agreement


•  rather
than
accep@ng

disagreement
as
a
natural

property
of
seman@c

interpreta@on

•  tradi@onally,
disagreement
is

considered
a
measure
of
poor

quality
in
the
annota@on
task

because:

–  task
is
poorly
deﬁned
or

–  annotators
lack
training

This
makes
the
eliminaFon
of

disagreement
a
goal

Human
AnnotaFon

Myth:

Disagreement
is
Bad

What
if
it
is
GOOD?


Disagreement
Bad?

Does each sentence express the TREAT relation?

ANTIBIOTICS are the first line treatment for indications of TYPHUS.
→ agreement 95%
Patients with TYPHUS who were given ANTIBIOTICS exhibited side-
effects.
→ agreement 80%
With ANTIBIOTICS in short supply, DDT was used during WWII to control
the insect vectors of TYPHUS.
→ agreement 50%
Disagreement
can
reﬂect
the
degree
of
clarity
in
a
sentence


•  Annotator disagreement is
signal, not noise.
•  It is indicative of the
variation in human
semantic interpretation of
signs
•  It can indicate ambiguity,
vagueness, similarity, over-
generality, etc,
as well as quality
CrowdTruth


•  Goal:
collecting a Medical RelEx Gold
Standard
improve the performance of a
RelEx Classifier
•  Approach:
crowdsource 900 medical
sentences
measure disagreement with
CrowdTruth Metrics
train & evaluate classifier with
CrowdTruth SRS Score
CrowdTruth
for

medical
relaFon

extracFon


RelEx
TASK
in
CrowdFlower

PaFents
with
ACUTE
FEVER
and
nausea
could
be
suﬀering

from
INFLUENZA
AH1N1

Is
ACUTE
FEVER
–
related
to
→
INFLUENZA
AH1N1?


1 1 1
Worker
Vector


1 1 1
1 1
1
1 1
1 1
1 1
1
1
1
0 1 1 0 0 4 3 0 0 5 1 0
Sentence
Vector


Unclear
relaFonship
between
the
two
arguments
reﬂected

in
the
disagreement

Sentence
Clarity


Clearly
expressed
relaFon
between
the
two
arguments
reﬂected
in

the
agreement

Sentence
Clarity


Measures
how
clearly
a
sentence
expresses
a
relaFon

0 1 1 0 0 4 3 0 0 5 1 0
Unit vector for
relation R6
Sentence
Vector
Cosine = .55
Sentence-‐RelaFon
Score
(SRS)


0.907,
p
=
0:007

0.844

AnnotaFon
Quality

of
Expert
vs.
Crowd
AnnotaFons


0.907,
p
=
0:007

0.844

[0.6
-‐
0.8]
crowd
signiﬁcantly
out-‐performs
expert

with
max
in
0.907
F1
@
0.7
threshold

AnnotaFon
Quality

of
Expert
vs.
Crowd
AnnotaFons


•  Normally P = TP/(TP+FP)
• Intuition:
– some sentences make better examples
– more important to get the clear cases right
– but P normally treats all examples as equal
• We propose:
– weight P with sentence-relation score (SRS)
PW = ∑i (TPi x SRSi)
∑i (TPi x SRSi) + ∑i (FPi x SRSi)
*and similarly for F1, Recall, and Accuracy
Weighted
Precision*


CrowdTruth
SRS
Score
as
a
Weight

for
AnnotaFon
Quality
F1

Unweighted Weighted
Crowd@.5 0.8382 0.9329
Crowd@.7 0.9074 0.9626
Expert 0.8444 0.8611
Single 0.6637 0.7344
Baseline 0.6559 0.6891

the
sentences
with
a
lot
of
disagreement
weigh
less



weighted
F1
scores
higher
at
any
given
threshold

RelEx
CAUSE
Classiﬁer

for
Crowd
&
Expert

Weighted
vs.
Unweighted
F1
Score

0.658
0.638
Crowd
Expert

0.642,
p
=
0:016

0.638

RelEx
CAUSE
Classiﬁer
F1

for
Crowd
vs.
Expert
AnnotaFons


0.642,
p
=
0:016

0.638

crowd
provides
training
data
that
is
at
least
as
good

if
not
bePer
than
experts

RelEx
CAUSE
Classiﬁer
F1

for
Crowd
vs.
Expert
AnnotaFons


Measured
per
worker

0 1 1 0 0 4 3 0 0 5 1 0
Worker’s
sentence
vector
Sentence
Vector
AVG (Cosine)
Worker-‐Sentence
Disagreement


•  crowd can build a ground truth
•  performs just as well as medical
experts
•  crowd is also cheaper
•  crowd is always available
•  crowd can be used as a weight
•  improved F1 scores for crowd
and expert ground truths
•  CrowdTruth = a solution to Clinical
NLP Challenge:
•  lack of ground truth for training &
benchmarking
Experiments
showed:


CrowdTruth.org
http://data.CrowdTruth.org/medical-relex
#CrowdTruth @anouk_anca @laroyo @cawelty #LD4IE2015 #ISWC2015

#CrowdTruth: Linked Data for Information Extraction @ISWC2015

Recomendados

Recomendados

Más contenido relacionado

Similar a #CrowdTruth: Linked Data for Information Extraction @ISWC2015

Similar a #CrowdTruth: Linked Data for Information Extraction @ISWC2015 (20)

Más de Lora Aroyo

Más de Lora Aroyo (20)

Último

Último (20)

#CrowdTruth: Linked Data for Information Extraction @ISWC2015