Evangelos Kanoulas — Advances in Information Retrieval Evaluation

Evalua&ng
Mul&-‐Query
Sessions

Evangelos
Kanoulas*,
Ben
Cartere9e+,
Paul
Clough*,
Mark
Sanderson$

*
University
of
Sheﬃeld,
UK
+
University
of
Delaware,
USA

$
RMIT
University,
Australia

Why
sessions?

•  Current
evalua&on
framework

–  Assesses
the
eﬀec&veness
of
systems
over
one-‐
shot
queries

•  Users
reformulate
their
ini&al
query

•  S&ll
ﬁne
if
…

–  op&mizing
system
for
one-‐shot
queries
led
to

op&mal
performance
over
an
en&re
session

Why
sessions?

When was the DuPont Science Essay Contest created?

Ini&al
Query
: DuPont Science Essay Contest
Reformula&on
:
When was the DSEC created?

•  e.g.
retrieval
systems
should
accumulate

informa&on
along
a
session

Extend
the
evalua&on
framework

From
one
query
evalua&on

To
mul&-‐query
sessions
evalua&on

Construct
appropriate
test
collec&ons

Rethink
of
evalua&on
measures

What
is
the
appropriate
collec&on?

Test
collec&ons
we
built…

•  Text
REtrieval
Conference
(TREC)

–  sponsored
by
NIST

–  many
compe&&ons;
among
them

Session
Track
2010,
2011,
…

Test
collec&on
we
built
in
2010…

•  Corpus:
ClueWeb09

–  1
billion
web
pages
(5TB
compressed)

•  Queries
and
Reformula&ons

–  150
query
pairs:
ini$al
query,
reformula$on

–  3
types
of
reformula&ons
(not
disclosed
to

par&cipants)

•  Speciﬁca&on
(52
query
pairs)

•  Generaliza&on
(48
query
pairs)

•  Drifing
/
Parallel
Reformula&on
(50
query
pairs)

Some
Cri&cism…

•  Ar&ﬁcial
reformula&ons

•  Short
reformula&ons

–  just
2
queries

•  No
other
user
interac&on
data

–  clicks,
dwell
&mes,
etc.

•  Reformula&ons
are
sta&c
(do
not
depend
on
the

SE’s
response)

–  The
collec&on
does
not
allow
early
abandonment

–  The
reformula&on
itself
does
not
change
up
on
SE’s

response

Test
Collec&on
in
2011

•  Corpus:
ClueWeb09

–  1
billion
web
pages
(5TB
compressed)

•  Queries
and
Reformula&ons

–  Real
users
searching
ClueWeb09

–  76
sessions
of
2
up
10
reformula&ons

•  Other
interac&ons

–  Clicks,
dwell
&mes,
mouse
movements,
relevance

judgments

•  But…
reformula&ons
are
s&ll
sta&c

Basic
test
collec&on

•  A
set
of
informa&on
needs

What do we know about black powder ammunition?

–  A
sta&c
sequence
of
m
queries

Ini&al
Query
:
black powder ammunition

1st
Reformula&on
:
black powder wiki
gun powder wiki
2nd
Reformula&on
:

…
…

(m-‐1)th
Reformula&on
:
history of gunpowder

Experiment

black powder black powder gun powder
ammunition wiki wiki

1

2

3

4

5

6

7

8

9

10

…

Evalua&on
over
a
single
ranked
list

Experiment


1

2

3

4

5

6

7

8

9

10

…

What
is
a
good
system?

How
can
we
measure
“goodness”?

Measuring
“goodness”

The
user
steps
down
a
ranked
list
of
documents
and

observes
each
one
of
them
un&l
a
decision
point

and
either

a) 
abandons
the
search,
or

b) 
reformulates

While
stepping
down
or
sideways,
the
user

accumulates
u&lity

What
are
the
challenges?

Evalua&on
oover
aul&ple
ranked
lists

Evalua&on
ver
m
single
ist


1

2

3

4

5

6

7

8

9

10

…

Exis&ng
measures

•  Session
DCG
[Järvelin
et
al
ECIR
2008]

The
user
steps
down
the
ranked
list
un&l
rank
k
and

reformulates
[Determinis&c;
no
early
abandonment]

•  Expected
session
u&lity
[Yang
and
Lad
ICTIR
2009]

The
user
steps
down
a
ranked
list
of
documents
un&l

a
decision
point
and
reformulates
[Stochas&c;
no

early
abandonment]

Evalua&ng
over
paths

Op&mize



Model-‐free
measures

Integrate
out



Model-‐based
measures

Evalua&on
measures

•  Evalua&ng
over
paths

•  Model
–
free
measures

•  Model
–
based
measures

Model-‐free
measures

The
user
is
an
oracle
that
knows
when
to

reformulate

Ω(k,j)
:
paths
of
length
k,
ending
at
reformula&on
j

Count
number
of
relevant
docs
on
the
op&mal
path

ω
of
length
k
ending
at
query
j

Model-‐free
measures

Q1
Q2
Q3

N
R
R

ω(10,3)
:
length
10,
ending
at
3rd
query

N
R
R

Deﬁne
:

N
R
R

N
R
R

N
R
R

Precision@k,j

N
N
R
Recall@k,j

N
N
R
Precision@recall,j

N
N
R

N
N
R

N
N
R

…
…
…

Model-‐free
measures

Q1
Q2
Q3

N
R
R

N
R
R

N
R
R

precision
N
R
R

N
R
R

N
N
R

N
N
R

N
N
R

ref
orm

N
N
R

ula
tio

N
N
R
all
n

rec
…
…
…

Model-‐free
measures

Q1
Q2
Q3

N
R
R

N
R
R

ranking 1 ranking 2 ranking 3
N
R
R

1.0

1.0

1.0
N
R
R

0.8

0.8

0.8
N
R
R

0.6

0.6

0.6
precision

precision

precision
N
N
R

0.4

0.4

0.4
0.2

0.2

0.2
N
N
R

0.0

0.0

0.0
N

0.0 0.2
N

0.4
R

0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

recall recall recall

N
N
R

N
N
R

…
…
…

Model-‐based
measures

Probabilis&c
space
of
users
following

diﬀerent
paths

•  Ω
is
the
space
of
all
paths

•  P(ω)
is
the
prob
of
a
user
following
a
path
ω
in
Ω

•  Mω
is
a
measure
over
a
path
ω

esM = P (ω)Mω
ω∈Ω
[Yang
and
Lad
ICTIR
2009]

Model
Browsing
Behavior

black powder
ammunition

1
Posion-‐based
models

2

3

4
The
chance
of
observing
a

5

document
depends
on
the
posion

6

7

of
the
document
in
the
ranked
list.

8

9

10

…

Rank
Biased
Precision

[Moﬀat
and
Zobel,
TOIS08]

black powder Query

ammunition

1

View
Next

2
Item

3

4

5
Stop

6

7

8

9

10

…

Model
Browsing
Behavior

black powder
ammunition

1
Cascade-‐based
models

2

3

4
The
chance
of
observing
a

5

document
depends
on
the
posion

6

7

of
the
document
in
the
ranked
list

8
and
the
relevance
of
documents/
9
snippets
already
viewed.

10

…

Expected
Reciprocal
Rank

[Chapelle
et
al
CIKM09]

black powder Query

ammunition

1

View
Next

2
Item

3

4

5

Relevant?

6

7

8

highly
somewhat
no

9

10

…

Stop

Expected
Browsing
Ulity

[Yilmaz
et
al
CIKM10]

DEBU (r) = P(Er )⋅ P(C | Rr )
n
EBU = ∑ DEBU (r)⋅ Rr
r =1

€

Probability
of
a
path

Q1
Q2
Q3

N
R
R

N
R
R
Joint
probability
of

N
R
R

N
R
R

N
R
R
(1)
abandoning
at
reform
2

N
N
R

N
N
R

N
N
R
(2)
reformulang
at
rank
3

N
N
R

of
ﬁrst
query

N
N
R

…
…
…

Probability
of
a
path

Q1
Q2
Q3

N
R
R

N
R
R

N
R
R

N
R
R

(1)
Probability
of
abandoning

N
R
R
at
reform
2

N
N
R

X

N
N
R

Probability
of

N
N
R
(2)
reformulang
at
rank
3

N
N
R

N
N
R
of
ﬁrst
query

…
…
…

Geometric
w/
parameter
preform

Q1
Q2
Q3

N
R
R

N
R
R

N
R
R

N
R
R

N
R
R

Probability

N
N
R
of
abandoning

N
N
R

(1)
the
session
at

N
N
R

reformulaon
i

N
N
R

N
N
R

…
…
…

Truncated
Geometric

w/
parameter
preform

Q1
Q2
Q3

N
R
R

N
R
R

N
R
R

N
R
R

N
R
R

Probability

N
N
R
of
abandoning

N
N
R

(1)
the
session
at

N
N
R

reformulaon
i

N
N
R

N
N
R

…
…
…

Truncated
Geometric

w/
parameter
preform

Q1
Q2
Q3

N
R
R

Geometric
w/
parameter
pdown

N
R
R

N
R
R

N
R
R

N
R
R
Probability

N
N
R

N
N
R

(2)
of
reformulang

N
N
R
at
rank
j

N
N
R
(of
1
to
i-‐1
reform)

N
N
R

…
…
…

Model-‐based
measures

Probabilisc
space
of
users
following

diﬀerent
paths

•  Ω
is
the
space
of
all
paths

•  P(ω)
is
the
prob
of
a
user
following
a
path
ω
in
Ω

•  Mω
is
a
measure
over
a
path
ω

esM = P (ω)Mω
ω∈Ω

Evaluaon
measures

•  Evaluang
over
paths

•  Model
–
free
measures

•  Model
–
based
measures

Evaluaon
measures

•  Properes

–  How
do
the
new
measures
correlate
with

previously
introduced?

–  Do
they
behave
as
expected,
i.e.
do
they
reward

early
retrieval
of
relevant
documents?

Correlaons

•  TREC
2010
Session
track

nsDCG vs. esNDCG nsDCG vs. esAP

Kendall''s tau : 0.7972 Kendall''s tau : 0.5247
0.20

0.08
esNDCG

0.15

esAP

0.06
0.10

0.04
0.10 0.15 0.20
0.10 0.15 0.20
nsDCG
nsDCG

Reward
early
retrieval

•  TREC9
Query
track

–  50
topics
and
23
query
sets
(formulaons)

•  Simulate
sessions

esMPC@20
esMRC@20
esMAP

“good”-‐”good”
0.378
0.036
0.122

“good”-‐”bad”
0.363


0.034


0.112



“bad”-‐”good”
0.271


0.023


0.083



“bad”-‐”bad”
0.254


0.022


0.073



Conclusions

•  Extend
the
evaluaon
framework
to
sessions

–  Built
the
appropriate
test
collecon

–  Rethink
of
evaluaon
measures

•  Basic
test
collecon

•  Model-‐free
and
model-‐based
measures

•  Did
not
talk
about:

–  Duplicate
documents

–  Eﬃcient
computaon
of
the
measures

Evangelos Kanoulas — Advances in Information Retrieval Evaluation

Recomendados

Recomendados

Más contenido relacionado

Más de yaevents

Más de yaevents (20)

Último

Último (20)

Evangelos Kanoulas — Advances in Information Retrieval Evaluation