Part of Information Retrieval evaluation research is limited by the fact that we do not know the distributions of system effectiveness over the populations of topics and, by extension, their true mean scores. The workaround usually consists in resampling topics from an existing collection and approximating the statistics of interest with the observations made between random subsamples, as if one represented the population and the other a random sample. However, this methodology is clearly limited by the availability of data, the impossibility to control the properties of these data, and the fact that we do not really measure what we intend to. To overcome these limitations, we propose a method based on vine copulas for stochastic simulation of evaluation results where the true system distributions are known upfront. In the basic use case, it takes the scores from an existing collection to build a semi-parametric model representing the set of systems and the population of topics, which can then be used to make realistic simulations of the scores by the same systems but on random new topics. Our ability to simulate this kind of data not only removes the current limitations, but also offers new opportunities for research. As an example, we show the benefits of this approach in two sample applications replicating typical experiments found in the literature. We provide a full R package to simulate new data following the proposed method, which can also be used to fully reproduce the results in this paper.
3. Experiments in IR
3
Core research
“how well?”
input
IR systems
Evaluation research
“what if?”
output
test
collection
AP
P@10
conditions
input
no. topics
stat. signif. tests
output
AP
P@10
p-values
Kendall τ
conditions
4. Experiments in IR
4
Core research
“how well?”
input
IR systems
Evaluation research
“what if?”
output
test
collection
AP
P@10
conditions
input
no. topics
stat. signif. tests
output
AP
P@10
p-values
Kendall τ
conditions
5. Experiments in IR
5
Core research
“how well?”
input
IR systems
Evaluation research
“what if?”
output
test
collection
AP
P@10
conditions
input
no. topics
stat. signif. tests
output
AP
P@10
p-values
Kendall τ
conditions
7. Current Limitations
1.Finite data
2.Unknown Truth
3.Lack of Control
Artificially create other
collections by resampling
from the existing data
7
How we Make Do
Limited to dozens of
systems and topics from
past evaluations like TREC
s
t
?
?
8. Current Limitations
1.Finite data
2.Unknown Truth
3.Lack of Control
Split in two topic sets and
consider results with one
subset as the truth
8
How we Make Do
Don’t know true properties
of systems, such as mean or
variance over topics
s
t
𝑿
𝑿
?
𝑿
𝑿
?
9. Current Limitations
1.Finite data
2.Unknown Truth
3.Lack of Control
Artificial modifications of
effectiveness scores that
lead to invalid data
9
How we Make Do
Can’t control properties of
systems, such as true mean.
Systems are how they are
?
-1 10
-1 10
11. Stochastic Simulation
11
Core research
“how well?”
IR systems
Evaluation research
“what if?”
test
collection
AP
P@10
no. topics
stat. signif. tests
AP
P@10
p-values
Kendall τ
Model
AP
P@10
12. • Build a generative model
of the joint distribution of
system scores
• Simulate scores on new,
random topics
• Lack of data
• Unknown truth
• Lack of control
Stochastic Simulation
12
Core research
“how well?”
IR systems
Evaluation research
“what if?”
test
collection
AP
P@10
no. topics
stat. signif. tests
AP
P@10
p-values
Kendall τ
Model
13. • Build a generative model
of the joint distribution of
system scores
• Simulate scores on new,
random topics
• Lack of data
• Unknown truth
• Lack of control
• Fit the model to existing
data to make it realistic
• Needs to be flexible to
model real data
Stochastic Simulation
13
Core research
“how well?”
IR systems
Evaluation research
“what if?”
test
collection
AP
P@10
no. topics
stat. signif. tests
AP
P@10
p-values
Kendall τ
Model
14. Model
• …of the joint distribution of system scores
• We use copula models, which separate:
1.Marginal distributions, of individual systems
2.Dependence structure, among systems
• Easy to customize: plug and play simulate
14
15. • Fit the model:
Model
15
𝑌1, … , 𝑌𝑛
𝑋1,…,𝑋n
*generalizes
to several
systems
16. • Fit the model:
1. Fit the margins
Model
15
𝑌1, … , 𝑌𝑛
𝑋1,…,𝑋n
*generalizes
to several
systems
17. • Fit the model:
1. Fit the margins
2. Turn to pseudo-
observations
Model
15
𝑌1, … , 𝑌𝑛
𝑉𝑖 = 𝐹𝑌 𝑌𝑖
𝑋1,…,𝑋n
𝑈𝑖=𝐹𝑋𝑋𝑖
*generalizes
to several
systems
18. • Fit the model:
1. Fit the margins
2. Turn to pseudo-
observations
3. Fit the copula
Model
15
𝑌1, … , 𝑌𝑛
𝑉𝑖 = 𝐹𝑌 𝑌𝑖
𝑋1,…,𝑋n
𝑈𝑖=𝐹𝑋𝑋𝑖
*generalizes
to several
systems
19. • Fit the model:
1. Fit the margins
2. Turn to pseudo-
observations
3. Fit the copula
• Or instantiate at will
Model
16
*generalizes
to several
systems
20. • Fit the model:
1. Fit the margins
2. Turn to pseudo-
observations
3. Fit the copula
• Or instantiate at will
• Simulate from the
model:
1. Generate pseudo-
observations
Model
16
𝑉
𝑈
*generalizes
to several
systems
21. • Fit the model:
1. Fit the margins
2. Turn to pseudo-
observations
3. Fit the copula
• Or instantiate at will
• Simulate from the
model:
1. Generate pseudo-
observations
2. Turn into effectiveness
scores
Model
16
𝑌 = 𝐹𝑌
−1
𝑉
𝑉
𝑋=𝐹𝑋
−1
𝑈
𝑈
*generalizes
to several
systems
22. Modeling Dependences
• Gaussian copulas
– Only correlation
– Only symmetric
• R-Vine copulas
– Allows tail dependence
– Allows asymmetricity
– Built from pair-copulas
(bivariate)
– Eg: F(S1,S2), F(S4,S2|S1),
F(S4,S3|S1,S2), …
– ~40 alternatives based on 12
different families
17
23. Modeling Margins
• All effectiveness measures have discrete
distributions, but for some we can fairly
assume they’re continuous
–AP, nDCG
• For some others, this assumption is clearly
wrong, so we must preserve the support
–P@10: 1, 0.9, 0.8, …
–RR: 1, 1/2, 1/3, …
18
25. Transform to Predefined Mean
Problem: given 𝐹, transform to 𝐹 such that
𝝁 = 𝝁∗
and preserving the support
Solution: transform with a specific Beta
find 𝛼, 𝛽 > 1
such that 𝜇 = 𝜇∗
where 𝐹 𝑥 = 𝐹𝐵𝑒𝑡𝑎 𝐹 𝑥 ; 𝛼, 𝛽
20
27. Data
• TREC Web Ad hoc runs 2010-2014
– 50 topics and 30-88 systems each
– 12924 total system-topic pairs
• Continuous measures: AP, nDCG@20, ERR@20
• Discrete measures: P@10, P@20, RR
• Points of Interest
1. Margins
2. Copulas
3. Simulated scores
22
28. • 1572 system-measure pairs
• 5425 models successfully fitted
• Log-Likelihood:
• Kernel Smoothing (esp. discrete)
• Normal & Beta 25% of cases
• AIC and BIC:
• Normal & Beta 67% of cases
• Beta-Binomial 50% of P@k
• Transform all to the mean in the
given data and select again:
• Kernel Smoothing nearly always
1. Margins
23
29. • 39627 system pairs
• Fit pair-copulas and select
according to Log-Likelihood
• Wide diversity
• Gaussian copulas rarely
selected; correlation is not
enough
• Complex models are preferred
2. Dependence
24
30. • Simulate 1000 new topics and record deviations from the model
• 𝜇 − 𝑋 and 𝜎2 − 𝑠2
• Repeat 1000 times
• Full knowledge of truth encoded in the model
3. Simulation: Scores
25
31. • Web 2010, nDCG@20
• Simulate 500 new topics
• Dependence captured in the model
3. Simulation: Dependencies
26
41. [With simulation]
Same margins
• Type I errors at α=5% and 1%:
• AP: 4.9% and 0.9%
• P@10: 4.9% and 1%
1. Type I Errors
29
p-value
Type I
error?
42. [With simulation]
Same margins
• Type I errors at α=5% and 1%:
• AP: 4.9% and 0.9%
• P@10: 4.9% and 1%
Transformed margins
1. Type I Errors
30
43. [With simulation]
Same margins
• Type I errors at α=5% and 1%:
• AP: 4.9% and 0.9%
• P@10: 4.9% and 1%
Transformed margins
1. Type I Errors
30
44. [With simulation]
Same margins
• Type I errors at α=5% and 1%:
• AP: 4.9% and 0.9%
• P@10: 4.9% and 1%
Transformed margins
1. Type I Errors
30
45. [With simulation]
Same margins
• Type I errors at α=5% and 1%:
• AP: 4.9% and 0.9%
• P@10: 4.9% and 1%
Transformed margins
• Type I errors at α=5% and 1%:
• AP: 4.9% and 0.9%
• P@10: 5% and 1%
1. Type I Errors
30
p-value
Type I
error?
46. 2. Sneak Peak: Statistical Power and σ
31
[Webber et al, 2008]
• Show empirical evidence of the problem of sequential testing
• Limited data
• Unknown truth (true σ)
48. Today
• Part of Evaluation Research has
data-related limitations
–Lack of data, no knowledge of truth, no control
–How valid are our results?
• We propose a methodology for stochastic
simulation to eliminate these limitations
–Flexible, realistic, highly customizable
–Allows us to study new problems, directly
33
49. Tomorrow
• Even more flexibility
• Simulate new systems for given topics
• Add third factors
–Fixed: already possible
–Random: we’ll see
• Simulate full runs (doc scores & relevance)
34
50. simIReff
• All results fully reproducible
• Developed a full R-package for simulation
https://github.com/julian-urbano/simIReff
effs <- effDiscFitAndSelect(data, support("p20"))
cop <- effcopFit(data, effs)
y <- reffcop(1000, cop)
35