Cheap, Fast, and Good? Evaluating Nonexpert Annotations for NLP Tasks

Cheap and Fast - But is it Good?
Evaluating Nonexpert Annotations
for Natural Language Tasks

Rion Snow Brendan O’Connor Daniel Jurafsky Andrew Y. Ng

The primacy of data

(Banko and Brill, 2001):
Scaling to Very Very Large Corpora
for Natural Language Disambiguation

Datasets drive research
statistical semantic role
parsing labeling
PropBank
Penn Treebank

word sense speech
disambiguation recognition
WordNet Switchboard
SemCor

statistical
textual
machine
entailment
Pascal RTE translation
UN Parallel Text

The advent of human
computation

• Open Mind Common Sense (Singh et al., 2002)
• Games with a Purpose (von Ahn and Dabbish, 2004)
• Online Word Games (Vickrey et al., 2008)

Amazon Mechanical Turk
But what if your task isn’t “fun”?

mturk.com

Using AMT for dataset
creation
• Su et al. (2007): name resolution, attribute extraction

• Nakov (2008): paraphrasing noun compounds

• Kaisser and Lowe (2008): sentence-level QA annotation

• Kaisser et al. (2008): customizing QA summary length

• Zaenen (2008): evaluating RTE agreement

Using AMT is cheap
Paper Labels Cents/Label
Su et al. (2007) 10,500 1.5

Nakov (2008) 19,018 unreported

Kaisser and Lowe (2008) 24,321 2.0

Kaisser et al. (2008) 45,300 3.7

Zaenen (2008) 4,000 2.0

And it’s fast...

blog.doloreslabs.com

But is it good?
• Objective: compare nonexpert annotation
quality on NLP tasks with gold standard,
expert-annotated data
• Method: pick 5 standard datasets, and
relabel each point with 10 new annotations
• Compare Turk agreement to dataset with
reported expert interannotator agreement

Tasks
• Affect recognition fear(“Tropical storm forms in Atlantic”) >
fear(“Goal delight for Sheva”)
• Strapparava and Mihalcea (2007)

• Word Similarity sim(boy, lad) > sim(rooster, noon)
• Miller and Charles (1991)

• Textual Entailment if “Microsoft was established in Italy in 1985”,
then “Microsoft was established in 1985” ?
• Dagan et al. (2006)

• WSD “a bass on the line” vs. “a funky bass line”
• Pradhan et al. (2007)

• Temporal Annotation ran happens before fell in:
• Pustejovsky et al. (2003) “The horse ran past the barn fell.”

Tasks
Expert Unique Interannotator Answer
Task
Labelers Examples Agreement Type
Affect
6 700 0.603 numeric
Recognition
Word
1 30 0.958 numeric
Similarity
Textual
1 800 0.91 binary
Entailment
Temporal
1 462 Unknown binary
Annotation

WSD 1 177 Unknown ternary

Interannotator Agreement
Emotion 1-E ITA
Anger 0.459
Disgust 0.583
• 6 total experts.
Fear 0.711
• One expert’s ITA is calculated as
Joy 0.596
the average of Pearson correlations
from each annotator to the avg. of Sadness 0.645
the other 5 annotators.
Surprise 0.464
Valence 0.844
All 0.603

Nonexpert ITA
We average over k
annotations to create a
single “proto-labeler”.

We plot the ITA of this
proto-labeler for up to
10 annotations and
compare to the average
single expert ITA.

anger disgust
Emotion 1-E ITA 10-N ITA

0.75
0.65

Anger 0.459 0.675
correlation

correlation
0.65
0.55

Disgust 0.583 0.746
0.55
0.45

2 4 6 8 10 2 4 6 8 10

fear joy
Fear 0.711 0.689
0.65
0.70

0.45 0.55
correlation

correlation
0.50 0.60

Joy 0.596 0.632
0.35

Sadness 0.645 0.776
0.40

2 4 6 8 10 2 4 6 8 10

sadness surprise
0.50

Surprise 0.464 0.496
0.75

0.30 0.40
correlation

correlation
0.65

Valence 0.844 0.669
0.55

0.20

All 0.603 0.694
2 4 6 8 10 2 4 6 8 10
annotators annotators

Number of nonexpert annotators required to match expert ITA, on average: 4

word similarity RTE
Task 1-E ITA 10-N ITA
0.84 0.90 0.96

0.70 0.80 0.90
Affect
correlation

accuracy
0.603 0.694
Recognition
Word
2 4 6 8 10 2 4 6 8 10
0.958 0.952
before/after WSD Similarity
0.980 0.990 1.000
0.70 0.80 0.90

Textual
accuracy

accuracy

0.91 0.897
Entailment
Temporal
2 4 6 8 10 2 4 6 8 10 0.940
annotators annotators Annotation

WSD 0.994

Error Analysis: WSD
only 1 “mistake” out of 177 labels:

“The Egyptian president said
he would visit Libya today...”

Semeval Task 17 marks this as “executive ofﬁcer of a ﬁrm” sense,
while Turkers voted for “head of a country” sense.

Error Analysis: RTE
~10 disagreements out of 100:
• Bob Carpenter: “Over half of the residual
disagreements between the Turker annotations and
the gold standard were of this highly suspect
nature and some were just wrong.”

• Bob Carpenter’s full analysis available at“Fool’s
Gold Standard”, http://lingpipe-blog.com/

Close Examples
T:
A car bomb that exploded outside a U.S. T: “Google ﬁles for its long awaited IPO.”
military base near Beiji, killed 11 Iraqis.
H: “Google goes public.”
H: A car bomb exploded outside a U.S. base in
the northern town of Beiji, killing 11 Iraqis.

Labeled “TRUE” in PASCAL RTE-1, Labeled “TRUE” in PASCAL RTE-1,
Turkers vote 6-4 “FALSE”. Turkers vote 6-4 “FALSE”.

Weighting Annotators
• There are a small number of very proliﬁc, very
noisy annotators. If we plot each annotator:

1.0
0.8
accuracy

0.6
0.4

0 200 400 600 800

number of annotations

Task: RTE
• We should be able to do better than majority voting.

• To infer the true value x , we weight each
i
response yi from annotator w using a small gold
standard training set:

• We estimate annotator response from 5% of the gold
standard test set, and evaluate with 20-fold CV.

RTE before/after
0.7 0.8 0.9

0.9
accuracy

0.8
Gold calibrated
Naive voting

0.7
annotators annotators

RTE: 4.0% avg. Temporal: 3.4% avg.
accuracy increase accuracy increase

• Several follow-up posts at http://lingpipe-blog.com

Cost Summary
Total Cost in Time in Labels / Labels /
Task
Labels USD hours USD Hour
Affect 7000 $2.00 5.93 3500 1180.4
Recognition
Word
300 $0.20 0.17 1500 1724.1
Similarity
Textual
8000 $8.00 89.3 1000 89.59
Entailment
Temporal
4620 $13.86 39.9 333.3 115.85
Annotation
WSD 1770 $1.76 8.59 1005.7 206.1

All 21690 $25.82 143.9 840.0 150.7

In Summary
• All collected data and annotator
instructions are available at:
http://ai.stanford.edu/~rion/annotations

• Summary blog post and comments on
the Dolores Labs blog:
http://blog.doloreslabs.com

nlp.stanford.edu doloreslabs.com ai.stanford.edu

Training systems on
nonexpert annotations
• A simple affect recognition classiﬁer trained
on the averaged nonexpert votes
outperforms one trained on a single expert
annotation

Where are Turkers?
United States 77.1%
India 5.3%
Philippines 2.8%
Canada 2.8%
UK 1.9%
Germany 0.8%
Italy 0.5%
Netherlands 0.5%
Portugal 0.5%
Australia 0.4%

Remaining 7.3% divided among 78 countries / territories

Analysis by Dolores Labs

Who are Turkers?

Gender Age

Education Annual income
“Mechanical Turk: The Demographics”, Panos Ipeirotis, NYU
behind-the-enemy-lines.blogspot.com

Why are Turkers?

A. To Kill Time
B. Fruitful way to spend free time
C. Income purposes
D. Pocket change/extra cash
E. For entertainment
F. Challenge, self-competition
G. Unemployed, no regular job, part-time job
H. To sharpen/ To keep mind sharp
I. Learn English

“Why People Participate on Mechanical Turk, Now Tabulated”, Panos Ipeirotis, NYU

How much does AMT pay?

“How Much Turking Pays?”, Panos Ipeirotis, NYU

Annotaton Guidelines:
Affective Text

Word Similarity

Textual Entailment

Temporal Ordering

Word Sense Disambiguation

Affect Recognition

We label 100 headlines
for each of 7 emotions
We pay 4 cents for 20
headlines (140 total
labels)
Total Cost: $2.00
Time to complete: 5.94 hrs

Example Task: Word Similarity
30 word pairs
(Rubenstein and
Goodenough, xxxx)

We pay 10 Turkers 2
cents apiece to score
all 30 word pairs

Total cost: $0.20
Time to complete:
10.4 minutes

Word Similarity ITA
0.96
correlation
0.84 0.90

2 4 6 8 10
annotations

• Comparison against multiple annotators
• (graphs)
• avg. number of nonexperts : expert = 4

Datasets lead the way
WSJ + syntactic annotation = Penn TreeBank enables Statistical
parsing

Brown corpus + sense labeling = Semcor => WSD

TreeBank + role labels = PropBank => SRL

political speeches + translations = United Nations parallel
corpora => statistical machine translation

more: RTE, Timebank, ACE/MUC, etc...

Datasets drive research
statistical semantic role
parsing labeling
PropBank
Penn Treebank

word sense
speech
disambiguation
recognition
WordNet
SemCor Switchboard

social network
analysis statistical MT
Enron E-mail
Corpus UN Parallel Text
textual
entailment
Pascal RTE

Cheap, Fast, and Good? Evaluating Nonexpert Annotations for NLP Tasks

Recomendados

Recomendados

Más contenido relacionado

Último

Último (20)

Destacado

Destacado (20)

Cheap, Fast, and Good? Evaluating Nonexpert Annotations for NLP Tasks