2. Agenda
• Why
• What
• How
• Cost
• Quality
• Measurement
3. Why
Ask
Real
Humans
• They’re
our
customers
– Some2mes
asking
is
the
best
way
to
find
out
what
you
want
to
know
– Provide
ground
truth
for
automated
metrics
• Provide
data
for
– Experimental
Evalua2on
• complements
A/B
tes2ng,
surveys
– Query
Diagnosis
– Judged
Test
Corpus
• Machine
Learning
• Offline
evalua2on
– Produc2on
Quality
Control
4. Why
Crowdsourcing
• Fast
– 1-‐3
days
• Low
Cost
– pennies
per
judgment
• High
Quality
– Mul2ple
workers
– Worker
evalua2on
(test
ques2ons
&
inter-‐worker
agreement)
• Flexible
– Ask
anything
9. Crowdsourced
Search
Relevance
Evalua2on
• What
are
we
measuring
– Relevance
• What
are
we
not
measuring
– Value
– Purchase
metrics
– Revenue
10. Industry
Standard
Sample
• As
in
the
original
DCG
formula2on,
we’ll
be
using
a
four-‐point
scale
for
relevance
assessment:
• Irrelevant
document
(0)
• Marginally
relevant
document
(1)
• Fairly
relevant
document
(2)
• Highly
relevant
document
(3)
hcp://www.sigir.org/forum/2008D/papers/2008d_sigirforum_alonso.pdf
15. Quality
• Tes2ng
– Train/test
workers
before
they
start
– Mix
test
ques2ons
into
the
work
mix
– Discard
data
from
unreliable
workers
• Redundancy
– Cost
is
low
>
Ask
mul2ple
workers
– Monitor
inter-‐worker
agreement
– Have
trusted
workers
monitor
new
workers
– Track
worker
“feedback”
over
2me
16. eBay
@
SIGIR
’10
Ensuring
quality
in
crowdsourced
search
relevance
evalua8on:
The
effects
of
training
ques8on
distribu8on
John
Le,
Andy
Edmonds,
Vaughn
Hester,
Lukas
Biewald
The
use
of
crowdsourcing
plaiorms
like
Amazon
Mechanical
Turk
for
evalua2ng
the
relevance
of
search
results
has
become
an
effec2ve
strategy
that
yields
results
quickly
and
inexpensively.
One
approach
to
ensure
quality
of
worker
judgments
is
to
include
an
ini2al
training
period
and
subsequent
sporadic
inser2on
of
predefined
gold
standard
data
(training
data).
Workers
are
no2fied
or
rejected
when
they
err
on
the
training
data,
and
trust
and
quality
ra2ngs
are
adjusted
accordingly.
In
this
paper,
we
assess
how
this
type
of
dynamic
learning
environment
can
affect
the
workers'
results
in
a
search
relevance
evalua2on
task
completed
on
Amazon
Mechanical
Turk.
Specifically,
we
show
how
the
distribu2on
of
training
set
answers
impacts
training
of
workers
and
aggregate
quality
of
worker
results.
We
conclude
that
in
a
relevance
categoriza2on
task,
a
uniform
distribu2on
of
labels
across
training
data
labels
produces
op2mal
peaks
in
1)
individual
worker
precision
and
2)
majority
vo2ng
aggregate
result
accuracy.
SIGIR
’10,
July
19-‐23,
2010,
Geneva,
Switzerland
17. Metrics
• There
are
standard
industry
metrics
• Designed
to
measure
value
to
the
end
user
• Older
metrics
– Precision
&
recall
(binary
relevance,
no
no2on
of
posi2on)
• Current
metrics
– Cumula2ve
Gain
(overall
value
of
results
on
a
non-‐
binary
relevance
scale)
– Discounted
(adjusted
for
posi2on
value)
– Normalized
(common
0-‐1
scale)
18. Judgment
Scale
Granularity
Binary
Web
Search
SigIR
3
Point
4
Point
Offensive
-‐1
Spam
-‐2
Spam
-‐1
Off
Topic
-‐2
Off
Topic
0
Irrelevant
Off
Topic
Irrelevant
0
Not
Matching
-‐1
Not
Matching
Relevant
Marginally
Relevant
1
Relevant
Useful
Fairly
Relevant
1
Matching
1
Good
Match
Vital
Highly
Relevant
2
Great
Match
21. Con2nuous
Produc2on
Evalua2on
• Daily
query
sampling/scraping
to
facilitate
ongoing
monitoring,
QA,
triage,
and
post-‐hoc
business
analysis
NDCG
Time
By
Site,
Category,
Query
…
25. Measuring
a
Ranked
List
Huan
Liu,
Lei
Tang
and
Ni2n
Agarwal.
Tutorial
on
Community
Detec1on
and
Behavior
Study
for
Social
Compu1ng.
Presented
in
The
1st
IEEE
Interna2onal
Conference
on
Social
Compu2ng
(SocialCom’09),
2009.
hcp://www.iisocialcom.org/conference/socialcom2009/download/SocialCom09-‐tutorial.pdf
27. NDCG
-‐
Example
Huan
Liu,
Lei
Tang
and
Ni2n
Agarwal.
Tutorial
on
Community
Detec1on
and
Behavior
Study
for
Social
Compu1ng.
Presented
in
The
1st
IEEE
Interna2onal
Conference
on
Social
Compu2ng
(SocialCom’09),
2009.
hcp://www.iisocialcom.org/conference/socialcom2009/download/SocialCom09-‐tutorial.pdf
28. Open
Ques2ons
• Discrete
vs.
Con2nuous
relevance
scale
• #
of
workers
• Distribu2on
of
test
ques2ons
• Genera2on
of
test
ques2ons
• Qualifica2on
(demographics,
interests,
region)
• Dynamic
worker
assignment
based
on
qualifica2on
• Mobile
workers
(untapped
pool)