The Music Information Retrieval field has acknowledged the need for rigorous scientific evaluations for some time now. Several efforts were set out to develop and provide the necessary infrastructure, technology and methodologies to carry out these evaluations, out of which the annual Music Information Retrieval Evaluation eXchange emerged. The community as a whole has enormously gained from this evaluation forum, but very little attention has been paid to reliability and correctness issues. From the standpoint of the analysis of experimental validity, this paper presents a survey of past meta-evaluation work in the context of Text Information Retrieval, arguing that the music community still needs to address various issues concerning the evaluation of music systems and the IR cycle, pointing out directions for further research and proposals in this line.
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain
1. Information Retrieval Meta-Evaluation:
Challenges and Opportunities
in the Music Domain
Julián Urbano @julian_urbano
University Carlos III of Madrid
ISMIR 2011
Picture by Daniel Ray Miami, USA · October 26th
7. ISMIR 2001 resolution on the need to create
standardized MIR test collections tasks and
collections, tasks,
evaluation metrics for MIR research and development
NTCIR CLEF
Cranfield 2 MEDLARS SMART TREC (1999-today)
(2000-today)
(1962-1966) (1966-1967) (1992-today)
(1961-1995)
1960 2011
ISMIR
(2000-today)
3 workshops (2002-2003):
The MIR/MDL Evaluation Project
8. ISMIR 2001 resolution on the need to create
standardized MIR test collections tasks and
collections, tasks,
evaluation metrics for MIR research and development
NTCIR CLEF
Cranfield 2 MEDLARS SMART TREC (1999-today)
(2000-today)
(1962-1966) (1966-1967) (1992-today)
(1961-1995)
1960 2011
follow the steps of the Text IR folks ISMIR
(2000-today)
MIREX
(2005-today)
but carefully: not everything applies to music
>1200
3 workshops (2002-2003): runs!
The MIR/MDL Evaluation Project
9. are we done already?
nearly 2 decades of
Evaluation is not easy Meta-
Meta-Evaluation in Text IR
NTCIR CLEF
Cranfield 2 MEDLARS SMART TREC (1999-today)
(2000-today)
(1962-1966) (1966-1967) (1992-today)
(1961-1995)
1960 2011
ISMIR MIREX
(2000-today)
(2005-today)
positive
impact
on MIR
10. are we done already?
nearly 2 decades of
Evaluation is not easy Meta-
Meta-Evaluation in Text IR
NTCIR CLEF
Cranfield 2 MEDLARS SMART TREC (1999-today)
(2000-today)
(1962-1966) (1966-1967) (1992-today)
(1961-1995)
1960 2011
some good practices inherited from here ISMIR MIREX
(2000-today)
(2005-today)
positive
impact
“not everything applies” a lot of things
have happened here!
on MIR
but much of it does!
14. Experimental Validity
how well an experiment meets the well-grounded
requirements of the scientific method
do the results fairly and actually assess
what was intended?
Meta-Evaluation
analyze the validity of IR Evaluation experiments
15. Ground truth
User model
Documents
Measures
Systems
Queries
Construct Task
x x x
Content x x x x x
Convergent x x x
Criterion x x x
Internal x x x x x
External x x x x
Conclusion x x x x x
17. construct validity
#fail
measure quality of a Web search engine
by the number of visits
what?
do the variables of the experiment correspond
to the theoretical meaning of the concept
they purport to measure?
how?
thorough selection and justification
of the variables used
18. construct validity in IR
effectiveness measures and their user model
[Carterette, SIGIR2011]
set-based measures do not resemble real users
[Sanderson et al., SIGIR2010]
rank-based measures are better
[Jarvelin et al., TOIS2002]
graded relevance is better
[Voorhees, SIGIR2001][Kekäläinen, IP&M2005]
other forms of ground truth are better
[Bennet et al., SIGIRForum2008]
19. content validity
#fail
measure reading comprehension
only with sci-fi books
what?
do the experimental units reflect and represent
the elements of the domain under study?
how?
careful selection of the experimental units
20. content validity in IR
tasks closely resembling real-world settings
systems completely fulfilling real-user needs
heavy user component, difficult to control
evaluate de system component instead
[Cleverdon, SIGIR2001][Voorhees, CLEF2002]
actual value of systems is really unknown
[Marchioni, CACM2006]
sometimes they just do not work with real users
[Turpin et al., SIGIR2001]
21. content validity in IR
documents resembling real-world settings’
large and representative samples
specially for Machine Learning
careful selection of queries, diverse but reasonable
[Voorhees, CLEF2002][Carterette et al., ECIR2009]
random selection is not good
some queries are better to differentiate bad systems
[Guiver et al., TOIS2009][Robertson, ECIR2011]
22. convergent validity
#fail
measures of math skills not correlated
with abstract thinking
what?
do the results agree with others, theoretical or
experimental, they should be related with?
how?
careful examination and confirmation
of the relationship between the results
and others supposedly related
23. convergent validity in IR
ground truth data is subjective
differences across groups and over time
different results depending on who evaluates
absolute numbers change
relative differences stand still for the most part
[Voorhees, IP&M2000]
for large-scale evaluations or varying experience
of assessors, differences do exist
[Carterette et al., 2010]
24. convergent validity in IR
measures are precision- or recall-oriented
they should therefore be correlated with each other
but they actually are not reliability?
[Kekäläinen, IP&M2005][Sakai, IP&M2007]
better correlated with others than with themselves!
[Webber et al., SIGIR2008]
correlation with user satisfaction in the task
[Sanderson et al., SIGIR2010]
ranks, unconventional judgments, discounted gain…
[Bennet et al, SIGIRForum2008][Järvelin et al, TOIS2002]
25. criterion validity
#fail
ask if the new drink is good
instead of better than the old one
what?
are the results correlated with those of
other experiments already known to be valid?
how?
careful examination and confirmation of the
correlation between our results and previous ones
26. criterion validity in IR
practical large-scale methodologies: pooling
[Buckley et al., SIGIR2004] less effort, but
same results?
judgments by non-experts
[Bailey et al., SIGIR2008]
crowdsourcing for low-cost
[Alonso et al., SIGIR2009][Carvalho et al., SIGIRForum2010]
estimate measures with fewer judgments
[Yilmaz et al., CIKM2006][Yilmaz et al., SIGIR2008]
select what documents to judge, by informativeness
[Carterette et al., SIGIR2006][Carterette et al., SIGIR2007]
use no relevance judgments at all
[Soboroff et al., SIGIR2001]
27. internal validity
#fail
measure usefulness of Windows vs Linux vs iOS
only with Apple employees
what?
can the conclusions be rigorously drawn
from the experiment alone
and not other overlooked factors?
how?
careful identification and control of possible
confounding variables and selection of desgin
28. internal validity in IR
inconsistency: performance depends on assessors
[Voorhees, IP&M2000][Carterette et al., SIGIR2010]
incompleteness: performance depends on pools
system reinforcement
[Zobel, SIGIR2008]
affects reliability of measures and overall results
[Sakai, JIR2008][Buckley et al., SIGIR2007] specially for
Machine Learning
train-test: same characteristics in queries and docs
improvements on the same collections: overfitting
[Voorhees, CLEF2002]
measures must be fair to all systems
29. external validity
#fail
study cancer treatment mostly with teenage males
what?
can the results be generalized
to other populations and experimental settings?
how?
careful design and justification
of sampling and selection methods
30. external validity in IR
weakest point of IR Evaluation
[Voorhees, CLEF2002]
large-scale is always incomplete
[Zobel, SIGIR2008][Buckley et al., SIGIR2004]
test collections are themselves an evaluation result
but they become hardly reusable
[Carterette et al., WSDM2010][Carterette et al., SIGIR2010]
31. external validity in IR
systems perform differently with different collections
cross-collection comparisons are unjustified
highly depends on test collection characteristics
[Bodoff et al., SIGIR2007][Voorhees, CLEF2002]
interpretation of results must be in terms of
pairwise comparisons, not absolute numbers
[Voorhees, CLEF2002]
do not claim anything about state of the art
based on a handful of experiments
baselines can be used to compare across collections
meaningful, [Armstrong et al., CIKM2009]
not random!
32. conclusion validity
#fail
more access to the Internet in China than in the US
because of the larger total number of users
what?
are the conclusions justified based on the results?
how?
careful selection of the measuring instruments and
statistical methods used to draw grand conclusions
33. conclusion validity in IR
measures should be sensitive and stable
[Buckley et al., SIGIR2000]
and also powerful
[Voorhees et al., SIGIR2002][Sakai, IP&M2007]
with little effort
[Sanderson et al., SIGIR2005]
always bearing in mind
the user model and the task
34. conclusion validity in IR
statistical methods to compare score distributions
[Smucker et al., CIKM2007][Webber et al., CIKM2008]
correct interpretation of the statistics
hypothesis testing is troublesome
statistical significance ≠ practical significance
increasing #queries (sample size) increases power
to detect ever smaller differences (effect-size)
eventually, everything is statistically significant
45. IR Research & Development Cycle
collections are too small
and/or biased
lack of realistic,
controlled public standard formats
collections and evaluation
software to
minimize bugs
can’t replicate
private,
private undescribed results, often
and unanalyzed leading to
collections emerge wrong conclusions
47. IR Research & Development Cycle
undocumented measures,
lack of baselines as lower bound no accepted evaluation software
(random is not a baseline!)
proper statistics correct
interpretation
of statistics
49. IR Research & Development Cycle
raw musical material
unknown
undocumented
queries and/or
documents
go back to private
collections:
overfitting!
overfitting!
53. collections
large, heterogeneous and controlled
not a hard endeavour, except for the damn copyright
Million Song Dataset!
still problematic (new features?, actual music)
standardize collections across tasks
better understanding and use of improvements
54. raw music data
essential for Learning and Improvement phases
use copyright-free data
Jamendo!
study possible biases
reconsider artificial material
55. evaluation model
let teams run their own algorithms
(needs public collections)
relief for IMIRSEL and promote wider participation
successfuly used for 20 years in Text IR venues
adopted by MusiCLEF
only viable alternative in the long run
MIREX-DIY platforms still don’t allow full completion
of the IR Research & Development Cycle
56. organization
IMIRSEL plans, schedules and runs everything
add a 2nd tier of organizers, task-specific
logistics, planning, evaluation, troubleshooting…
format of large forums like TREC and CLEF
smooth the process and develop tasks that really
push the limits of the state of the art
57. overview papers
every year, by task organizers
detail the evaluation process, data, results
discussion to boost Interpretation and Learning
perfect wrap-up for team papers
rarely discuss results, and many are not even drafted
58. specific methodologies
MIR has unique methodologies and measures
meta-evaluate: analyze and improve
human effects on the evaluation
user satisfaction
59. standard evaluation software
bugs are inevitable
open evaluation software to everybody
gain reliability
speed up the development process
serve as documentation for newcomers
promote standardization of formats
60. baselines
help measuring the overall progress of the filed
standard formats + standard software +
public controlled collections + raw music +
task-specific organization
measure the state of the art
61. commitment
we need to acknowledge the current problems
MIREX should not only be a place to
evaluate and improve systems
but also a place to
meta-evaluate and improve how we evaluate
and a place to
design tasks that challenge researchers
analyze our evaluation methodologies
62. we all need to start
questioning
evaluation practices