2011 Crowdsourcing Search Evaluation

Crowdsourcing
search
relevance

evalua2on
at
eBay

Brian
Johnson

September
28,
2011

Agenda

•  Why

•  What

•  How

•  Cost

•  Quality

•  Measurement

Why
Ask
Real
Humans

•  They’re
our
customers

–  Some2mes
asking
is
the
best
way
to
ﬁnd
out
what
you

want
to
know

–  Provide
ground
truth
for
automated
metrics

•  Provide
data
for

–  Experimental
Evalua2on

•  complements
A/B
tes2ng,
surveys

–  Query
Diagnosis

–  Judged
Test
Corpus

•  Machine
Learning

•  Oﬄine
evalua2on

–  Produc2on
Quality
Control

Why
Crowdsourcing

•  Fast

–  1-‐3
days

•  Low
Cost

–  pennies
per
judgment

•  High
Quality

–  Mul2ple
workers

–  Worker
evalua2on
(test
ques2ons
&
inter-‐worker

agreement)

•  Flexible

–  Ask
anything

Judgment
Volume
by
Day

Cost

Judgments
Cost

1
$0.01

10
$0.10

100
$1.00

1,000
$10.00

10,000
$100.00

100,000
$1,000.00

1,000,000
$10,000.00

Who
are
these
workers

•  Crowdﬂower

–  Mechanical
Turk

–  Gambit/Facebook

–  TrialPay

–  SamaSource

•  LiveOps

•  CloudCrowd

–  Facebook

What
Can
We
Evaluate

•  Search
Ranking

–  Query
>
Item

•  Item/Image
Similarity

–  Item
>
Item

•  Merchandising

–  Query
>
Item

–  Category
>
Item

–  Item
>
Item

•  Product
Tagging

–  Item
>
Product

•  Category
Recommenda2ons

–  Item
(Title)
>
Category

Crowdsourced
Search
Relevance

Evalua2on

•  What
are
we
measuring

–  Relevance

•  What
are
we
not
measuring

–  Value

–  Purchase
metrics

–  Revenue

Industry
Standard
Sample

•  As
in
the
original
DCG
formula2on,
we’ll
be

using
a
four-‐point
scale
for
relevance

assessment:

•  Irrelevant
document
(0)

•  Marginally
relevant
document
(1)

•  Fairly
relevant
document
(2)

•  Highly
relevant
document
(3)

hcp://www.sigir.org/forum/2008D/papers/2008d_sigirforum_alonso.pdf

eBay
Search
Relevance
Crowdsourcing

Quality

•  Tes2ng

–  Train/test
workers
before
they
start

–  Mix
test
ques2ons
into
the
work
mix

–  Discard
data
from
unreliable
workers

•  Redundancy

–  Cost
is
low
>
Ask
mul2ple
workers

–  Monitor
inter-‐worker
agreement

–  Have
trusted
workers
monitor
new
workers

–  Track
worker
“feedback”
over
2me

eBay
@
SIGIR
’10

Ensuring
quality
in
crowdsourced
search
relevance
evalua8on:

The
effects
of
training
ques8on
distribu8on

John
Le,
Andy
Edmonds,
Vaughn
Hester,
Lukas
Biewald

The
use
of
crowdsourcing
plaiorms
like
Amazon
Mechanical
Turk
for
evalua2ng
the
relevance
of
search

results
has
become
an
effec2ve
strategy
that
yields
results
quickly
and
inexpensively.
One
approach
to

ensure
quality
of
worker
judgments
is
to
include
an
ini2al
training
period
and
subsequent
sporadic

inser2on
of
predefined
gold
standard
data
(training
data).
Workers
are
no2fied
or
rejected
when
they
err

on
the
training
data,
and
trust
and
quality
ra2ngs
are
adjusted
accordingly.
In
this
paper,
we
assess
how

this
type
of
dynamic
learning
environment
can
affect
the
workers'
results
in
a
search
relevance
evalua2on

task
completed
on
Amazon
Mechanical
Turk.
Specifically,
we
show
how
the
distribu2on
of
training
set

answers
impacts
training
of
workers
and
aggregate
quality
of
worker
results.
We
conclude
that
in
a

relevance
categoriza2on
task,
a
uniform
distribu2on
of
labels
across
training
data
labels
produces
op2mal

peaks
in
1)
individual
worker
precision
and
2)
majority
vo2ng
aggregate
result
accuracy.

SIGIR
’10,
July
19-‐23,
2010,
Geneva,
Switzerland

Metrics

•  There
are
standard
industry
metrics

•  Designed
to
measure
value
to
the
end
user

•  Older
metrics

–  Precision
&
recall
(binary
relevance,
no
no2on
of

posi2on)

•  Current
metrics

–  Cumula2ve
Gain
(overall
value
of
results
on
a
non-‐
binary
relevance
scale)

–  Discounted
(adjusted
for
posi2on
value)

–  Normalized
(common
0-‐1
scale)

Judgment
Scale
Granularity

Binary
Web
Search
SigIR
3
Point
4
Point

Offensive

-‐1

Spam
-‐2

Spam

-‐1

Off
Topic
-‐2

Off
Topic

0

Irrelevant

Off
Topic

Irrelevant
0

Not
Matching
-‐1

Not
Matching

Relevant

Marginally
Relevant

1

Relevant

Useful

Fairly
Relevant
1

Matching
1

Good
Match

Vital

Highly
Relevant

2

Great
Match

Rank
Discount

Rank
Discount
d
1/r^constant

1.00

0.90

0.80

0.70

0.60

0.50

0.40

0.30

0.20

0.10

0.00

1
2
3
4
5
6
7
8
9
10

Cumula2ve
Gain
Metrics

Normalized
Normalized

Discounted
Discounted

Discounted
Ideal
Rank
Cumula8ve
Ideal
Rank
Cumula8ve

Human
Cumula8ve
Rank
Cumula8ve
Order
Ideal
DCG
Gain
Order
Ideal
DCG
Gain

Rank
Judgment
Gain
Discount
Gain
Observed
Observed
Observed
Theore8cal
Theore8cal
Theore8cal

r
j
cg
d
dcg
io
idcgo
ndcgo
it
idcgt
ndcgt

dcg(n-‐1)
+
j
*
dcg(n-‐1)
+
io
dcg(n)
/
dcg(n-‐1)
+
it
dcg(n)
/
idcgt

0-‐1

+=
j
1
/
r^c
d
sort(j)
*
d
idcgo(n)
1
*
d
(n)

1
1.0
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00

2
1.0
2.00
0.53
1.53
1.00
1.53
1.00
1.00
1.53
1.00

3
0.8
2.80
0.37
1.83
1.00
1.90
0.96
1.00
1.90
0.96

4
0.0
2.80
0.28
1.83
1.00
2.18
0.84
1.00
2.18
0.84

5
1.0
3.80
0.23
2.06
0.80
2.37
0.87
1.00
2.41
0.85

6
0.2
4.00
0.20
2.10
0.50
2.47
0.85
1.00
2.61
0.80

7
0.2
4.20
0.17
2.13
0.20
2.50
0.85
1.00
2.78
0.77

8
0.5
4.70
0.15
2.21
0.20
2.53
0.87
1.00
2.93
0.75

9
1.0
5.70
0.14
2.34
0.00
2.53
0.93
1.00
3.07
0.76

10
0.0
5.70
0.12
2.34
0.00
2.53
0.93
1.00
3.19
0.73

Con2nuous
Produc2on
Evalua2on

•  Daily
query
sampling/scraping
to
facilitate

ongoing
monitoring,
QA,
triage,
and
post-‐hoc

business
analysis

NDCG

Time

By
Site,
Category,
Query
…

Human
Judgment
>
Query
List

Best
Match
Variant
Comparison

Measuring
a
Ranked
List

Huan
Liu,
Lei
Tang
and
Ni2n
Agarwal.
Tutorial
on
Community
Detec1on
and
Behavior
Study
for
Social
Compu1ng.

Presented
in
The
1st
IEEE
Interna2onal
Conference
on
Social
Compu2ng
(SocialCom’09),
2009.

hcp://www.iisocialcom.org/conference/socialcom2009/download/SocialCom09-‐tutorial.pdf

Ranking
Evalua2on

hcp://research.microsox.com/en-‐us/um/people/kevynct/ﬁles/ECIR-‐2010-‐ML-‐Tutorial-‐FinalToPrint.pdf

NDCG
-‐
Example

Huan
Liu,
Lei
Tang
and
Ni2n
Agarwal.
Tutorial
on
Community
Detec1on
and
Behavior
Study
for
Social
Compu1ng.

Presented
in
The
1st
IEEE
Interna2onal
Conference
on
Social
Compu2ng
(SocialCom’09),
2009.

hcp://www.iisocialcom.org/conference/socialcom2009/download/SocialCom09-‐tutorial.pdf

Open
Ques2ons

•  Discrete
vs.
Con2nuous
relevance
scale

•  #
of
workers

•  Distribu2on
of
test
ques2ons

•  Genera2on
of
test
ques2ons

•  Qualiﬁca2on
(demographics,
interests,
region)

•  Dynamic
worker
assignment
based
on

qualiﬁca2on

•  Mobile
workers
(untapped
pool)

References

•  Discounted
Cumula2ve
Gain

–  hcp://en.wikipedia.org/wiki/
Discounted_cumula2ve_gain

•  hcp://crowdﬂower.com/

•  hcp://www.cloudcrowd.com/

•  hcp://www.trialpay.com

2011 Crowdsourcing Search Evaluation

Recomendados

Recomendados

Más contenido relacionado

Similar a 2011 Crowdsourcing Search Evaluation

Similar a 2011 Crowdsourcing Search Evaluation (20)

Más de Brian Johnson

Más de Brian Johnson (12)

Último

Último (20)

2011 Crowdsourcing Search Evaluation