Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Evangelos Kanoulas — Advances in Information Retrieval Evaluation
1. Evalua&ng
Mul&-‐Query
Sessions
Evangelos
Kanoulas*,
Ben
Cartere9e+,
Paul
Clough*,
Mark
Sanderson$
*
University
of
Sheffield,
UK
+
University
of
Delaware,
USA
$
RMIT
University,
Australia
2. Why
sessions?
• Current
evalua&on
framework
– Assesses
the
effec&veness
of
systems
over
one-‐
shot
queries
• Users
reformulate
their
ini&al
query
• S&ll
fine
if
…
– op&mizing
system
for
one-‐shot
queries
led
to
op&mal
performance
over
an
en&re
session
3. Why
sessions?
When was the DuPont Science Essay Contest created?
Ini&al
Query
: DuPont Science Essay Contest
Reformula&on
:
When was the DSEC created?
• e.g.
retrieval
systems
should
accumulate
informa&on
along
a
session
4. Extend
the
evalua&on
framework
From
one
query
evalua&on
To
mul&-‐query
sessions
evalua&on
7. Test
collec&ons
we
built…
• Text
REtrieval
Conference
(TREC)
– sponsored
by
NIST
– many
compe&&ons;
among
them
Session
Track
2010,
2011,
…
8. Test
collec&on
we
built
in
2010…
• Corpus:
ClueWeb09
– 1
billion
web
pages
(5TB
compressed)
• Queries
and
Reformula&ons
– 150
query
pairs:
ini$al
query,
reformula$on
– 3
types
of
reformula&ons
(not
disclosed
to
par&cipants)
• Specifica&on
(52
query
pairs)
• Generaliza&on
(48
query
pairs)
• Drifing
/
Parallel
Reformula&on
(50
query
pairs)
9. Some
Cri&cism…
• Ar&ficial
reformula&ons
• Short
reformula&ons
– just
2
queries
• No
other
user
interac&on
data
– clicks,
dwell
&mes,
etc.
• Reformula&ons
are
sta&c
(do
not
depend
on
the
SE’s
response)
– The
collec&on
does
not
allow
early
abandonment
– The
reformula&on
itself
does
not
change
up
on
SE’s
response
10. Test
Collec&on
in
2011
• Corpus:
ClueWeb09
– 1
billion
web
pages
(5TB
compressed)
• Queries
and
Reformula&ons
– Real
users
searching
ClueWeb09
– 76
sessions
of
2
up
10
reformula&ons
• Other
interac&ons
– Clicks,
dwell
&mes,
mouse
movements,
relevance
judgments
• But…
reformula&ons
are
s&ll
sta&c
11. Basic
test
collec&on
• A
set
of
informa&on
needs
What do we know about black powder ammunition?
– A
sta&c
sequence
of
m
queries
Ini&al
Query
:
black powder ammunition
1st
Reformula&on
:
black powder wiki
gun powder wiki
2nd
Reformula&on
:
…
…
(m-‐1)th
Reformula&on
:
history of gunpowder
12. Experiment
black powder black powder gun powder
ammunition wiki wiki
1
2
3
4
5
6
7
8
9
10
…
13. Evalua&on
over
a
single
ranked
list
Experiment
black powder black powder gun powder
ammunition wiki wiki
1
2
3
4
5
6
7
8
9
10
…
17. Measuring
“goodness”
The
user
steps
down
a
ranked
list
of
documents
and
observes
each
one
of
them
un&l
a
decision
point
and
either
a)
abandons
the
search,
or
b)
reformulates
While
stepping
down
or
sideways,
the
user
accumulates
u&lity
19. Evalua&on
oover
aul&ple
ranked
lists
Evalua&on
ver
m
single
ist
black powder black powder gun powder
ammunition wiki wiki
1
2
3
4
5
6
7
8
9
10
…
20.
21. Exis&ng
measures
• Session
DCG
[Järvelin
et
al
ECIR
2008]
The
user
steps
down
the
ranked
list
un&l
rank
k
and
reformulates
[Determinis&c;
no
early
abandonment]
• Expected
session
u&lity
[Yang
and
Lad
ICTIR
2009]
The
user
steps
down
a
ranked
list
of
documents
un&l
a
decision
point
and
reformulates
[Stochas&c;
no
early
abandonment]
22. Evalua&ng
over
paths
Op&mize
Model-‐free
measures
Integrate
out
Model-‐based
measures
23. Evalua&on
measures
• Evalua&ng
over
paths
• Model
–
free
measures
• Model
–
based
measures
24. Model-‐free
measures
The
user
is
an
oracle
that
knows
when
to
reformulate
Ω(k,j)
:
paths
of
length
k,
ending
at
reformula&on
j
Count
number
of
relevant
docs
on
the
op&mal
path
ω
of
length
k
ending
at
query
j
25. Model-‐free
measures
Q1
Q2
Q3
N
R
R
ω(10,3)
:
length
10,
ending
at
3rd
query
N
R
R
Define
:
N
R
R
N
R
R
N
R
R
Precision@k,j
N
N
R
Recall@k,j
N
N
R
Precision@recall,j
N
N
R
N
N
R
N
N
R
…
…
…
26. Model-‐free
measures
Q1
Q2
Q3
N
R
R
N
R
R
N
R
R
precision
N
R
R
N
R
R
N
N
R
N
N
R
N
N
R
ref
orm
N
N
R
ula
tio
N
N
R
all
n
rec
…
…
…
27. Model-‐free
measures
Q1
Q2
Q3
N
R
R
N
R
R
ranking 1 ranking 2 ranking 3
N
R
R
1.0
1.0
1.0
N
R
R
0.8
0.8
0.8
N
R
R
0.6
0.6
0.6
precision
precision
precision
N
N
R
0.4
0.4
0.4
0.2
0.2
0.2
N
N
R
0.0
0.0
0.0
N
0.0 0.2
N
0.4
R
0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
recall recall recall
N
N
R
N
N
R
…
…
…
28. Model-‐free
measures
Q1
Q2
Q3
N
R
R
N
R
R
N
R
R
precision
N
R
R
N
R
R
N
N
R
N
N
R
N
N
R
ref
orm
N
N
R
ula
tio
N
N
R
all
n
rec
…
…
…
29. Evalua&on
measures
• Evalua&ng
over
paths
• Model
–
free
measures
• Model
–
based
measures
30. Model-‐based
measures
Probabilis&c
space
of
users
following
different
paths
• Ω
is
the
space
of
all
paths
• P(ω)
is
the
prob
of
a
user
following
a
path
ω
in
Ω
• Mω
is
a
measure
over
a
path
ω
esM = P (ω)Mω
ω∈Ω
[Yang
and
Lad
ICTIR
2009]
31. Model
Browsing
Behavior
black powder
ammunition
1
Posion-‐based
models
2
3
4
The
chance
of
observing
a
5
document
depends
on
the
posion
6
7
of
the
document
in
the
ranked
list.
8
9
10
…
32. Rank
Biased
Precision
[Moffat
and
Zobel,
TOIS08]
black powder Query
ammunition
1
View
Next
2
Item
3
4
5
Stop
6
7
8
9
10
…
33. Model
Browsing
Behavior
black powder
ammunition
1
Cascade-‐based
models
2
3
4
The
chance
of
observing
a
5
document
depends
on
the
posion
6
7
of
the
document
in
the
ranked
list
8
and
the
relevance
of
documents/
9
snippets
already
viewed.
10
…
34. Expected
Reciprocal
Rank
[Chapelle
et
al
CIKM09]
black powder Query
ammunition
1
View
Next
2
Item
3
4
5
Relevant?
6
7
8
highly
somewhat
no
9
10
…
Stop
35. Expected
Browsing
Ulity
[Yilmaz
et
al
CIKM10]
DEBU (r) = P(Er )⋅ P(C | Rr )
n
EBU = ∑ DEBU (r)⋅ Rr
r =1
€
36. Probability
of
a
path
Q1
Q2
Q3
N
R
R
N
R
R
Joint
probability
of
N
R
R
N
R
R
N
R
R
(1)
abandoning
at
reform
2
N
N
R
N
N
R
N
N
R
(2)
reformulang
at
rank
3
N
N
R
of
first
query
N
N
R
…
…
…
37. Probability
of
a
path
Q1
Q2
Q3
N
R
R
N
R
R
N
R
R
N
R
R
(1)
Probability
of
abandoning
N
R
R
at
reform
2
N
N
R
X
N
N
R
Probability
of
N
N
R
(2)
reformulang
at
rank
3
N
N
R
N
N
R
of
first
query
…
…
…
38. Geometric
w/
parameter
preform
Q1
Q2
Q3
N
R
R
N
R
R
N
R
R
N
R
R
N
R
R
Probability
N
N
R
of
abandoning
N
N
R
(1)
the
session
at
N
N
R
reformulaon
i
N
N
R
N
N
R
…
…
…
39. Truncated
Geometric
w/
parameter
preform
Q1
Q2
Q3
N
R
R
N
R
R
N
R
R
N
R
R
N
R
R
Probability
N
N
R
of
abandoning
N
N
R
(1)
the
session
at
N
N
R
reformulaon
i
N
N
R
N
N
R
…
…
…
40. Truncated
Geometric
w/
parameter
preform
Q1
Q2
Q3
N
R
R
Geometric
w/
parameter
pdown
N
R
R
N
R
R
N
R
R
N
R
R
Probability
N
N
R
N
N
R
(2)
of
reformulang
N
N
R
at
rank
j
N
N
R
(of
1
to
i-‐1
reform)
N
N
R
…
…
…
41. Model-‐based
measures
Probabilisc
space
of
users
following
different
paths
• Ω
is
the
space
of
all
paths
• P(ω)
is
the
prob
of
a
user
following
a
path
ω
in
Ω
• Mω
is
a
measure
over
a
path
ω
esM = P (ω)Mω
ω∈Ω
42. Evaluaon
measures
• Evaluang
over
paths
• Model
–
free
measures
• Model
–
based
measures
43. Evaluaon
measures
• Properes
– How
do
the
new
measures
correlate
with
previously
introduced?
– Do
they
behave
as
expected,
i.e.
do
they
reward
early
retrieval
of
relevant
documents?
44. Correlaons
• TREC
2010
Session
track
nsDCG vs. esNDCG nsDCG vs. esAP
Kendall''s tau : 0.7972 Kendall''s tau : 0.5247
0.20
0.08
esNDCG
0.15
esAP
0.06
0.10
0.04
0.10 0.15 0.20
0.10 0.15 0.20
nsDCG
nsDCG
46. Conclusions
• Extend
the
evaluaon
framework
to
sessions
– Built
the
appropriate
test
collecon
– Rethink
of
evaluaon
measures
• Basic
test
collecon
• Model-‐free
and
model-‐based
measures
• Did
not
talk
about:
– Duplicate
documents
– Efficient
computaon
of
the
measures