SlideShare una empresa de Scribd logo
1 de 132
Descargar para leer sin conexión
Evaluation in (Music) Information Retrieval
through the Audio Music Similarity task
Julián Urbano

Barcelona, Spain · January 16th 2014
Spam
• @julian_urbano
• Postdoctoral researcher
– Music Technology Group, Universitat Pompeu Fabra

• Recently: PhD, Computer Science
– (Evaluation in) (Music) Information Retrieval

2
Information Retrieval
• Automatic representation, storage and search of
unstructured information
– Traditionally textual information
– Lately multimedia too: images, video, music

• A user has an information need and uses an IR
system that retrieves the relevant or significant
information from a collection of documents
3
Information Retrieval Evaluation
• IR systems are based on models to estimate
relevance, implementing different techniques
• How good is my system? What system is better?
– Answered with IR Evaluation experiments
– “if you can’t measure it, you can’t improve it”
– But we need to be able to trust our measurements

• Research on IR Evaluation
– Improve our methods to evaluate systems
– Critical for the correct development of the field
4
Disclaimer
• If you see…
A system is evaluated with a test collection
containing queries, documents and judgments
telling how relevant a document is to a query

• …you can think of
An algorithm is evaluated with a dataset
containing queries, songs and annotations
telling how similar a song is to a query
5
Talk outline
•
•
•
•
•

Why we want to Evaluate…
…and what we do with Cranfield
Validity: users versus systems
Reliability: estimating from samples
Efficiency: reducing annotations

6
Introduction:
Why we want to Evaluate…
The two questions
• How good is my system?
– What does good mean?
– What is good enough?

• Is system A better than system B?
– What does better mean?
– How much better?

• Efficiency? Effectiveness? Ease?
8
Measure user experience
• We are interested in user-measures
– Time to complete task, idle time, success rate, failure
rate, frustration, ease to learn, ease to use …

• Their distributions describe user experience, fully
– For an arbitrary user, query and document collection,
what is the distribution of…

0

none
time to complete task

some
frustration

much
9
The big(ger) picture
• Different user-measures attempting to assess the
same thing: user satisfaction
– How likely is it that an arbitrary user, with an arbitrary
query (and with an arbitrary document collection) will
be satisfied by the system?

• This is the ultimate goal: the good, the better

10
The big(ger) question
• User satisfaction…as Bernoulli trial

no
yes
satisfaction

• Probability of satisfaction P(Sat = yes)?
• Probability that k in n users are satisfied?
• Probability of >80% users satisfied?
11
Introduction:
…what we do with Cranfield
Sources of variability
user-measure = f(documents, query, user, system)
• Our goal is the distribution of the user-measure
for our system, which is impossible to calculate
– (Possibly?) infinite population

• The best we can do is estimate it
– Sample documents, queries and users
– Measure user experience, implicitly or explicitly
– Representativeness, cost, ethics, privacy…
13
Fix samples
• Hard to replicate experiment and repeat results
• Just plain impossible to reproduce results
• Get a (hopefully) good sample and fix it
– Documents and queries

• But we can’t fix the users!

14
Simulate users…and fix them
• Cranfield paradigm: remove users, but include a
user-abstraction, fixed across experiments
– Static user component: judgments in the ground truth
– Dynamic user component: effectiveness measures

• Remove all sources of variability, except systems
user-measure = f(documents, query, user, system)

15
Simulate users…and fix them
• Cranfield paradigm: remove users, but include a
user-abstraction, fixed across experiments
– Static user component: judgments in the ground truth
– Dynamic user component: effectiveness measures

• Remove all sources of variability, except systems
user-measure = f(documents, query, user, system)
user-measure = f(system)
15
Test collections
• Controlled set of documents, queries and
judgments, shared across researchers
• (Most?) important resource for IR research
– Experiments are inexpensive (collections are not!)
– Research becomes systematic
– Reproducibility becomes possible and easy

16
Wait a minute
• Are we estimating distributions about users or
distributions about systems?
system-effectiveness = f(system, scale, measure)
• We come up with different distributions of
system-effectiveness, depending on how we
abstract users from the experiment
– Different scales to assess relevance
– Different measures to model user behavior
17
Assumption
• System-measures correspond to user-measures
Users
Time to complete task
Idle time
Success rate
Failure rate
Frustration
Ease to learn
Ease to use
Satisfaction
…

Systems
P
AP
RR
DCG
nDCG
ERR
GAP
Q
…

18
Assumption
• System-measures correspond to user-measures
Users
Time to complete task
Idle time
Success rate
Failure rate
Frustration
Ease to learn
Ease to use
Satisfaction
…

Systems
P
AP
RR
DCG
nDCG
ERR
GAP
Q
…

18
Assumption
• System-measures correspond to user-measures
Users
Time to complete task
Idle time
Success rate
Failure rate
Frustration
Ease to learn
Ease to use
Satisfaction
…

Systems
P
AP
RR
DCG
nDCG
ERR
GAP
Q
…

18
Assumption
• System-measures correspond to user-measures
Users
Time to complete task
Idle time
Success rate
Failure rate
Frustration
Ease to learn
Ease to use
Satisfaction
…

Systems
P
AP
RR
DCG
nDCG
ERR
GAP
Q
…

18
Assumption
• System-measures correspond to user-measures
Users
Time to complete task
Idle time
Success rate
Failure rate
Frustration
Ease to learn
Ease to use
Satisfaction
…

Systems
P
AP
RR
DCG
nDCG
ERR
GAP
Q
…

19
Experiments with Test Collections
• Our goal is the users
user-measure = f(system)
• but Cranfield tells us about systems
system-effectiveness = f(system, scale, measure)

20
Experiments with Test Collections
• Our goal is the users
user-measure = f(system)
• but Cranfield tells us about systems
system-effectiveness = f(system, scale, measure)

20
Experiments with Test Collections
• Our goal is the users
user-measure = f(system)
• but Cranfield tells us about systems
system-effectiveness = f(system, scale, measure)
• This poses several problems
– That we have been dealing with for over 50 years
– But hey, they’re extremely interesting!
20
Validity, Reliability and Efficiency
• Validity: are we measuring what we want to?
– Internal: are observed effects due to hidden factors?
– External: are queries, documents and users representative?
– Construct: do system-measures match user-measures?
– Conclusion: how good is good and how better is better?

• Reliability: how repeatable are the results?
– How large do collections need to be?
– What statistical methods should be used?

• Efficiency: how inexpensive is it to get valid and
reliable results? (i.e. to build a test collection)
– Can we estimate results with fewer judgments?
21
In this talk
How to study and improve
the validity, reliability and efficiency
of the methods used to evaluate IR systems
• Audio Music Similarity task as example
– Song as query input to system, audio signal
– Retrieve songs musically similar to it, by content
– Resembles traditional Ad Hoc retrieval in Text IR
– Important task in Music IR
• Music recommendation
• Playlist generation
• Plagiarism detection
22
Validity:
Effectiveness and Satisfaction
Assumption of Cranfield
• Systems with better effectiveness are perceived
by users as more useful, more satisfactory
• Tricky: different effectiveness measures and
relevance scales produce different distributions
– Which one is better to predict satisfaction?

• Map system effectiveness onto user satisfaction,
experimentally
– If P@10 = 0.2, how likely is it that an arbitrary user will
find the results satisfactory?
– What is P(Sat | P@10 = 0.2)?
24
User-oriented System-measures
• Effectiveness measures are generally not
formulated to correlate with user-satisfaction
– If effectiveness is λ = 0, we expect P(Sat) = 0
– If effectiveness is λ = 1, we expect P(Sat) = 1
– In general, we expect P(Sat | λ) = λ

• But this is not what we have
– Effectiveness measures need to be reformulated
– Upper bounds, recall components, ideal rankings
– Many mathematical details omitted in this talk

25
User Components: Measures and Scales
• How is relevance measured in the judgments?
– Nominal, ordinal, interval, ratio

• How are results consumed?
– Set, list

• What determines document utility?
– Positional, cascade
– Linear, exponential

• What determines user persistence?
– Navigational, informational
26
Measures and Scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
27
Measures and Scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
27
Measures and Scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
27
Measures and Scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
27
Measures and Scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
27
Measures and Scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
27
Measures and Scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
27
Measures and Scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
27
Measures and Scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
27
Measures and Scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
27
Measures and Scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
27
Measures and Scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

MIREX
X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
27
Experimental design

28
What can we infer?
• Preference: difference noticed by user
– Positive: user agrees with evaluation
– Negative: user disagrees with evaluation

• Non-preference: difference not noticed by user
– Good: both systems are satisfactory
– Bad: both systems are not satisfactory

29
Data
• Queries, documents and judgments from MIREX
• 4115 unique and artificial examples
– At least 200 examples per (measure-scale-λ)

• 432 unique queries, 5636 unique documents
• Answers collected via Crowdsourcing
– Quality control with trap questions

• 113 unique subjects
30
Single system: how good is it?
• For 2045 examples (49%) users could not decide
which system was better

What do we expect?

31
Single system: how good is it?
• For 2045 examples (49%) users could not decide
which system was better

31
Single system: how good is it?
• Large ℓmin thresholds underestimate satisfaction

32
Single system: how good is it?
• Users don’t pay attention to ranking?

33
Single system: how good is it?
• Exponential gain underestimates satisfaction

34
Single system: how good is it?
• Document utility independent of others

35
Two systems: which one is better?
• For 2090 examples (51%) users did prefer one
system over the other one

What do we expect?

36
Two systems: which one is better?
• For 2090 examples (51%) users did prefer one
system over the other one

36
Two systems: which one is better?
• Large differences needed for users to note them

37
Two systems: which one is better?
• More relevance levels are better to discriminate

38
Two systems: which one is better?
• Cascade and navigational user models are not
appropriate

39
Two systems: which one is better?
• Users do prefer the (supposedly) worse system

40
Summary
• Effectiveness and satisfaction are clearly correlated

– There is a 20% bias: P(Sat | 0) > 0 and P(Sat | 1) < 1
– Room to improve: personalization, better user abstraction

• Magnitude of differences does matter
– Just looking at rankings is very naive
• Be careful with statistical significance

– Need Δλ≈0.4 for users to agree with effectiveness
• Historically, only 20% of times in MIREX

• Differences among measures and scales
– Linear gain slightly better than exponential gain
– Informational and positional user models better than
navigational and cascade
– The more relevance levels, the better
41
Measures and scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
42
Measures and scales
Measure
P@5
AP@5
RR@5
CGl@5
CGe@5
DCGl@5
DCGe@5
EDCGl@5
EDCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
ERRl@5
ERRe@5
GAP@5
ADR@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X

Artificial Graded
nℒ=3
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X
X
X
X
X
X

ℓmin=20
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X

Artificial Binary
ℓmin=40 ℓmin=60
X
X
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
X
X
EDCGl@5 EDCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
X
X
ERRl@5 ERRl@5
AP@5
AP@5
X
X

ℓmin=80
X
X
X
P@5
P@5
X
DCGl@5
X
EDCGl@5
AP@5
AP@5
X
RBPl@5
X
ERRl@5
AP@5
X
43
Validity:
Satisfaction over Samples
Evaluate in terms of user satisfaction
• So far, arbitrary users for a single query
– P Sat Ql @5 = 0.61 = 0.7

• Easily for n users and a single query
– P Sat15 = 10 Ql @5 = 0.61 = 0.21

• What about a sample of queries 𝒬?
– Map queries separately for the distribution of P(Sat)
– For easier mappings, P(Sat | λ) functions are
interpolated with simple polynomials
45
Expected probability of satisfaction
• Now we can compute point and interval estimates
of the expected probability of satisfaction
• Intuition fails when interpreting effectiveness

46
System success
• If P(Sat) ≥ threshold the system is successful
– Setting the threshold was rather arbitrary before
– Now it is meaningful, in terms of user satisfaction

• Intuitively, we want the majority of users to find
the system satisfactory
– P Succ = P P Sat > 0.5 = 1 − FP

Sat

(0.5)

• Improving queries for which we are bad is
worthier than further improving those for which
we are already good
47
Distribution of P(Sat)
• But we (will) only have a handful queries, estimates
will probably be bad
– Need to estimate the cumulative distribution function of
user satisfaction: FP(Sat)
– Not described by any typical distribution family

• More than ≈25 queries in the collection
– ecdf approximates better

• Less than ≈25 queries in the collection

– Normal for graded scales, ecdf for binary scales

• Beta is always the best with the Fine scale
– Which turns out to be the best scale, overall
48
Intuition fails, again
Intuitive conclusions based on effectiveness alone
contradict those based on user satisfaction
– E Δλ = −0.002

– E ΔP Sat
– E ΔP Succ

= 0.001
= 0.07

49
Intuition fails, again
Intuitive conclusions based on effectiveness alone
contradict those based on user satisfaction
– E Δλ = −0.002

– E ΔP Sat
– E ΔP Succ

= 0.001
= 0.07

49
Intuition fails, again
Intuitive conclusions based on effectiveness alone
contradict those based on user satisfaction
– E Δλ = −0.002

– E ΔP Sat
– E ΔP Succ

= 0.001
= 0.07

49
Historically, in MIREX
• Systems are not as satisfactory as we thought
• But they are more successful
– Good (or bad) for some kinds of queries

50
Measures and scales
Measure
P@5
AP@5
CGl@5
CGe@5
DCGl@5
DCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
GAP@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X

X
X
X
X
X

Artificial Graded
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X

Artificial Binary
ℓmin=20 ℓmin=40
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
AP@5
AP@5

51
Measures and scales
Measure
P@5
AP@5
CGl@5
CGe@5
DCGl@5
DCGe@5
Ql@5
Qe@5
RBPl@5
RBPe@5
GAP@5

Original
Broad
Fine

X
X
X
X
X
X
X
X
X

X
X
X
X
X

Artificial Graded
nℒ=4
nℒ=5

X
X
X
X
X
X
X
X
X

X
X
X
X
X
X
X
X
X

Artificial Binary
ℓmin=20 ℓmin=40
X
X
X
X
P@5
P@5
P@5
P@5
X
X
DCGl@5 DCGl@5
AP@5
AP@5
AP@5
AP@5
X
X
RBPl@5 RBPl@5
AP@5
AP@5

52
Reliability:
Optimal Statistical Significance Tests
Random error
• Test collections are just samples from larger,
possibly infinite, populations
• If we conclude system A is better than B, how
confident can we be?
– Δλ 𝒬 is just an estimate of the population mean μΔλ

• Usually employ some statistical significance test
for differences in location
• If it is statistically significant, we have confidence
that the true difference is at least that large
54
Statistical hypothesis testing
• Set two mutually exclusive hypotheses
– H0 : μΔλ = 0
– H1 : μΔλ ≠ 0

• Run test, obtain p-value= P μΔλ ≥ Δλ 𝒬 H0
– p ≤ α: statistically significant, high confidence
– p > α: statistically non-significant, low confidence

• Possible errors in the binary decision
– Type I: incorrectly reject H0
– Type II: incorrectly accept H0
55
Statistical significance tests
• (Non-)parametric tests
– t-test, Wilcoxon test, Sign test

• Based on resampling
– Bootstrap test, permutation/randomization test

• They make certain assumptions about
distributions and sampling methods
– Often violated in IR evaluation experiments
– Which test behaves better, in practice, knowing that
assumptions are violated?
56
Optimality criteria
• Power
– Achieve significance as often as possible (low Type II)
– Usually increases Type I error rates

• Safety
– Minimize Type I error rates
– Usually decreases power

• Exactness
– Maintain Type I error rate at α level
– Permutation test is theoretically exact
57
Experimental design
• Randomly split query set in two
• Evaluate all systems with both subsets
– Simulating two different test collections

• Compare p-values with both subsets
– How well do statistical tests agree with themselves?
– At different α levels

• All systems and queries from MIREX 2007-2011
– >15M p-values
58
Power and success
• Bootstrap test is the most powerful
• Wilcoxon, bootstrap and permutation are the
most successful, depending on α level

59
Conflicts
• Wilcoxon and t-test are the safest at low α levels
• Wilcoxon is the most exact at low α levels, but
bootstrap is for usual levels

60
Summary
• Bootstrap test is the most powerful, and still it
has smaller Type I error rates, so we are safe
• Power and success:
– CGl@5 > GAP@5 > DCGl@5 > RBPl@5
– Fine > Broad > binary

• Conflicts:
– Very similar across measures and scales
– Corrections for multiple comparisons (e.g. Tukey) do
not seem necessary
61
Reliability:
Test Collection Size
Acceptable sample size
• Reliability is higher with larger sample sizes
– But it is also more expensive
– What is an acceptable test collection size?

• Answer with Generalizability Theory
– G-Study: estimate variance components
– D-Study: estimate reliability of different sample sizes
and experimental designs

63
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA

σ2 =

σ 2 + σ2 + σ 2
s
q
sq

• Estimated with Analysis of Variance

64
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA

σ2 =

σ 2 + σ2 + σ 2
s
q
sq

• Estimated with Analysis of Variance

64
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA

σ2 =

σ 2 + σ2 + σ 2
s
q
sq

• Estimated with Analysis of Variance

64
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA

σ2 =

σ 2 + σ2 + σ 2
s
q
sq

• Estimated with Analysis of Variance

64
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA

σ2 =

σ 2 + σ2 + σ 2
s
q
sq

• Estimated with Analysis of Variance

64
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA

σ2 =

σ 2 + σ2 + σ 2
s
q
sq

• Estimated with Analysis of Variance

64
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA

σ2 =

σ 2 + σ2 + σ 2
s
q
sq

• Estimated with Analysis of Variance

64
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA

σ2 =

σ 2 + σ2 + σ 2
s
q
sq

• Estimated with Analysis of Variance

64
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA

σ2 =

σ 2 + σ2 + σ 2
s
q
sq

• Estimated with Analysis of Variance

64
G-study: variance components
Fully crossed experimental design: s × q
λq,A = λ + λA + λq + εqA

σ2 =

σ 2 + σ2 + σ 2
s
q
sq

• Estimated with Analysis of Variance

64
Intuition
𝜆𝐴 𝜆𝐵

𝜆𝐶 𝜆𝐷𝜆𝐸 𝜆𝐹

• If σ2 is small or σ2 is large, we need more queries
s
q

65
Intuition
𝜆𝐴 𝜆𝐵

𝜆𝐶 𝜆𝐷𝜆𝐸 𝜆𝐹

• If σ2 is small or σ2 is large, we need more queries
s
q
𝜆𝐴 𝜆𝐵 𝜆𝐶 𝜆 𝐷 𝜆𝐸 𝜆𝐹

Larger σ2
s

65
Intuition
𝜆𝐴 𝜆𝐵

𝜆𝐶 𝜆𝐷𝜆𝐸 𝜆𝐹

• If σ2 is small or σ2 is large, we need more queries
s
q
𝜆𝐴 𝜆𝐵 𝜆𝐶 𝜆 𝐷 𝜆𝐸 𝜆𝐹

Larger

σ2
s

𝜆𝐴 𝜆𝐵

𝜆𝐶 𝜆𝐷𝜆𝐸 𝜆𝐹

Smaller σ2 or
q
more queries

65
D-study: variance ratios
• Stability of absolute scores

Φ nq

σ2
s
=
σ2 + σ2
q
e
2
σs +
nq

• Stability of relative scores
Eρ2 nq =

σ2
s
σ2
σ2 + e
s
nq

• We can easily estimate how many queries are
needed to reach some level of stability (reliability)
66
D-study: variance ratios
• Stability of absolute scores

Φ nq

σ2
s
=
σ2 + σ2
q
e
2
σs +
nq

• Stability of relative scores
Eρ2 nq =

σ2
s
σ2
σ2 + e
s
nq

• We can easily estimate how many queries are
needed to reach some level of stability (reliability)
66
Effect of query set size
•
•
•
•

Average absolute stability Φ = 0.97
≈65 queries needed for Φ2 = 0.95, ≈100 in worst cases
Fine scale slightly better than Broad and binary scales
RBPl@5 and nDCGl@5 are the most stable

67
Effect of query set size
•
•
•
•

Average relative stability Eρ2 = 0.98
≈35 queries needed for Eρ2 = 0.95, ≈60 in worst cases
Fine scale better than Broad and binary scales
CGl@5 and RBPl@5 are the most stable

68
Effect of cutoff k
• What if we use a deeper cutoff, k=10?
– From 100 queries and k=5 to 50 queries and k=10
– Should still have stable scores
– Judging effort should decrease
– Rank-based measures should become more stable

• Tested in MIREX 2012
– In 2013 too, but not analyzed here

69
Effect of cutoff k
• Judging effort reduced to 72% of the usual
• Generally stable
– From Φ = 0.81 to Φ = 0.83
– From Eρ2 = 0.93 to Eρ2 = 0.95

70
Effect of cutoff k
• Reliability given a fixed budged for judging?
– k=10 allows us to use fewer queries, about 70%
– Slightly reduced relative stability

71
Effect of assessor set size
• More assessors or simply more queries?
– Judging effort is multiplied

• Can be studied with MIREX 2006 data
– 3 different assessors per query
– Nested experimental design: s × h: q

72
Effect of assessor set size
•

2
2
Broad scale: σs ≈ σh:q
Fine scale: σ2 ≫ σ2
s
h:q

•
• Always better to spend resources on queries

73
Summary
• MIREX collections generally larger than necessary
• For fixed budget
– More queries better than more assessors
– More queries slightly better than deeper cutoff
• Worth studying alternative user model?

•
•
•
•

Employ G-Theory while building the collection
Fine better than Broad, better than binary
CGl@5 and DCGl@5 best for relative stability
RBPl@5 and nDCGl@5 best for absolute stability
74
Implications
• Fixing the number of queries across years is
unrealistic
– Especially because they are not intended for reuse

• Fixing the number of queries across task is simply
nonsense
• Need to analyze on a case-by-case basis, while
building the collections
– GT4IReval, R package online
– https://github.com/julian-urbano/GT4IREval
75
Efficiency:
Learning Relevance Distributions
Probabilistic evaluation
• The MIREX setting is still expensive
– Need to judge all top k documents from all systems
– Takes days, even weeks sometimes

• Model relevance probabilistically
– Relevance judgments are random variables over the space
of possible assignments of relevance
E Rd =

P 𝑅𝑑 =ℓ ·ℓ
ℓ∈ℒ

P 𝑅 𝑑 = ℓ · ℓ2 − E 𝑅 𝑑

Var 𝑅 𝑑 =

2

ℓ∈ℒ

• Effectiveness measures are also probabilistic
77
Probabilistic evaluation
• Accuracy increases as we make judgments
– E R d ← rd

• Reliability increases too (confidence)
– Var R d ← 0

• Iteratively estimate relevance and effectiveness
– If confidence is low, make judgments
– If confidence is high, stop

• Judge as few documents as possible
78
Learning distributions of relevance
• Uniform distribution is very uninformative
• Historical distribution in MIREX has high variance
• Estimate from a set of features: P R d = ℓ θd
– For each document separately
– Ordinal Logistic Regression

• Three sets of features
– Output-based, can always be used
– Judgment-based, to exploit known judgments
– Audio-based, to exploit musical similarity
79
Learned models
• Mout : can be used even without judgments
– Similarity between systems’ outputs
– Genre and artist metadata
• Genre is highly correlated to similarity

– Decent fit, R2 ≈ 0.35

• Mjud : can be used when there are judgments
– Similarity between systems’ outputs
– Known relevance of same system and same artist
• Artist is extremely correlated to similarity

– Excellent fit, R2 ≈ 0.91
80
Estimation errors
• Actual vs. predicted by Mout
– 0.36 with Broad and 0.34 with Fine

• Actual vs. predicted by Mjud
– 0.14 with Broad and 0.09 with Fine

• Among assessors in MIREX 2006
– 0.39 with Broad and 0.31 with Fine

• Negligible under the current MIREX setting
81
Efficiency:
Probabilistic Evaluation
Probabilistic effectiveness measures
• Effectiveness scores become random variables too
• Example: DCGl@k
– (Usual) deterministic formulation:
𝑘
𝑖=1 𝑟𝑖 / log 2 𝑖 + 1
𝐷𝐶𝐺 𝑙 @𝑘 = 𝑘
𝑖=1 𝑛ℒ − 1 / log 2 𝑖 + 1
– (New) probabilistic formulation:
𝑘
1
E 𝑅𝑖
E 𝐷𝐶𝐺 𝑙 @𝑘 =
𝜂 𝐷𝐶𝐺 𝑙
𝑖=1 log 2 𝑖 + 1
𝑘
1
Var 𝑅 𝑖
Var 𝐷𝐶𝐺 𝑙 @𝑘 = 2
𝜂 𝐷𝐶𝐺 𝑙
𝑖=1 log 2 𝑖 + 1

2
83
Probabilistic effectiveness measures
• Effectiveness scores become random variables too
• Example: DCGl@k
– (Usual) deterministic formulation:
𝑘
𝑖=1 𝑟𝑖 / log 2 𝑖 + 1
𝐷𝐶𝐺 𝑙 @𝑘 = 𝑘
𝑖=1 𝑛ℒ − 1 / log 2 𝑖 + 1
– (New) probabilistic formulation:
𝑘
1
E 𝑅𝑖
E 𝐷𝐶𝐺 𝑙 @𝑘 =
𝜂 𝐷𝐶𝐺 𝑙
𝑖=1 log 2 𝑖 + 1
𝑘
1
Var 𝑅 𝑖
Var 𝐷𝐶𝐺 𝑙 @𝑘 = 2
𝜂 𝐷𝐶𝐺 𝑙
𝑖=1 log 2 𝑖 + 1

𝜂 𝐷𝐶𝐺 𝑙

2
83
Probabilistic effectiveness measures
• From there we can compute Δ𝐷𝐶𝐺 𝑙 @𝑘AB
• And averages over a sample of queries 𝒬
• Different approaches to compute estimates
– Deal with dependence of random variables
– Different definitions of confidence

• For measures based on ideal ranking (nDCGl@k
and RBPl@k) we do not have a closed form
– Approximated with Delta method and Taylor series
84
Ranking without judgments
1. Estimate relevance with Mout
2. Estimate relative differences and rank systems
• Average confidence in the rankings is 94%
• Average accuracy of the ranking is 92%

85
Ranking without judgments
• Can we trust individual estimates?
– Ideally, we want X% accuracy when X% confidence
– Confidence slightly overestimated in [0.9, 0.99)
Confidence
[0.5, 0.6)
[0.6, 0.7)
[0.7, 0.8)
[0.8, 0.9)
[0.9, 0.95)
[0.95, 0.99)
[0.99, 1)
E[Accuracy]

DCGl@5
Broad
In bin
Accuracy
23 (6.5%)
0.826
14 (4%)
0.786
14 (4%)
0.571
22 (6.2%)
0.864
23 (6.5%)
0.87
24 (6.8%)
0.917
232 (65.9%)
0.996
0.938

Fine
In bin
22 (6.2%)
16 (4.5%)
11 (3.1%)
21 (6%)
19 (5.4%)
27 (7.7%)
236 (67%)

Accuracy
0.636
0.812
0.364
0.762
0.895
0.926
0.996
0.921

86
Relative estimates with judgments
1. Estimate relevance with Mout
2. Estimate relative differences and rank systems
3. While confidence is low (<95%)
1. Select a document and judge it
2. Update relevance estimates with Mjud when possible
3. Update estimates of differences and rank systems

• What documents should we judge?
– Those that are the most informative
– Measure-dependent
87
Relative estimates with judgments
• Judging effort dramatically reduced
– 1.3% with CGl@5, 9.7% with RBPl@5

• Average accuracy still 92%, but improved individually
– 74% of estimates with >99% confidence, 99.9% accurate
– Expected accuracy improves slightly from 0.927 to 0.931

88
Absolute estimates with judgments
1. Estimate relevance with Mout
2. Estimate absolute effectiveness scores
3. While confidence is low (expected error >±0.05)
1. Select a document and judge it
2. Update relevance estimates with Mjud when possible
3. Update estimates of absolute effectiveness scores

• What documents should we judge?
– Those that reduce variance the most
– Measure-dependent
89
Absolute estimates with judgments
• The stopping condition is overly confident
– Virtually no judgments are even needed (supposedly)

• But effectiveness is highly overestimated

– Especially with nDCGl@5 and RBPl@5
– Mjud, and especially Mout, tend to overestimate relevance

90
Absolute estimates with judgments
• Practical fix: correct variance
• Estimates are better, but at the cost of judging
– Need between 15% and 35% of judgments

91
Summary
• Estimate ranking of systems with no judgments
– 92% accuracy on average, trustworthy individually
– Statistically significant differences are always correct

• If we want more confidence, judge documents
– As few as 2% needed to reach 95% confidence
– 74% of estimates have >99% confidence and accuracy

• Estimate absolute scores, judging as necessary
– Around 25% needed to ensure error <0.05

92
Implications
• We do not need dozens of volunteers to make
thousands of judgments over several days
• Just one person spending a couple hours is fine

• The spare manpower can be put to better use
– Redundant judgments to have better estimates
– Make annotations for other tasks

• It naturally promotes collaborative creation of
test collections by iteratively adding the
judgments needed in each experiment (if any)
93
Future Work
Validity
• User studies to understand user behavior
• What information to include in test collections
• Other forms of relevance judgment to better
capture document utility
• Explicitly define judging guidelines
• Similar mapping for Text IR
– Different user models within the same task
95
Reliability
• Corrections for Multiple Comparisons
• Methods to reliably estimate reliability while
building test collections

96
Efficiency
• Better models to estimate document relevance
• Correct variance when having just a few
relevance judgments available
• Estimate relevance beyond k=5
• Other stopping conditions and document weights

97
Conduct similar studies
for the wealth of tasks in
Music Information Retrieval

Más contenido relacionado

Similar a Evaluation in (Music) Information Retrieval through the Audio Music Similarity task

Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityJulián Urbano
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalJulián Urbano
 
Information retrieval systems irt ppt do
Information retrieval systems irt ppt doInformation retrieval systems irt ppt do
Information retrieval systems irt ppt doPonnuthuraiSelvaraj1
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introductionLiang Xiang
 
Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Roi Blanco
 
Metrics in usability testing and user experiences
Metrics in usability testing and user experiencesMetrics in usability testing and user experiences
Metrics in usability testing and user experiencesHim Chitchat
 
Recommender Systems: Advances in Collaborative Filtering
Recommender Systems: Advances in Collaborative FilteringRecommender Systems: Advances in Collaborative Filtering
Recommender Systems: Advances in Collaborative FilteringChangsung Moon
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowJulián Urbano
 
Analytic emperical Mehods
Analytic emperical MehodsAnalytic emperical Mehods
Analytic emperical MehodsM Surendar
 
e3-chap-09.ppt
e3-chap-09.ppte3-chap-09.ppt
e3-chap-09.pptKingSh2
 
Week 2 Sentiment Analysis Using Machine Learning
Week 2 Sentiment Analysis Using Machine Learning Week 2 Sentiment Analysis Using Machine Learning
Week 2 Sentiment Analysis Using Machine Learning SARCCOM
 
Tech Systems - Testing Processes - Beta Testing of Online Exam System - Sanjo...
Tech Systems - Testing Processes - Beta Testing of Online Exam System - Sanjo...Tech Systems - Testing Processes - Beta Testing of Online Exam System - Sanjo...
Tech Systems - Testing Processes - Beta Testing of Online Exam System - Sanjo...Sanjoy Sanyal
 

Similar a Evaluation in (Music) Information Retrieval through the Audio Music Similarity task (20)

Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music Similarity
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 
Information Retrieval Evaluation
Information Retrieval EvaluationInformation Retrieval Evaluation
Information Retrieval Evaluation
 
Information retrieval systems irt ppt do
Information retrieval systems irt ppt doInformation retrieval systems irt ppt do
Information retrieval systems irt ppt do
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introduction
 
Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement Influence of Timeline and Named-entity Components on User Engagement
Influence of Timeline and Named-entity Components on User Engagement
 
Metrics in usability testing and user experiences
Metrics in usability testing and user experiencesMetrics in usability testing and user experiences
Metrics in usability testing and user experiences
 
Recommender Systems: Advances in Collaborative Filtering
Recommender Systems: Advances in Collaborative FilteringRecommender Systems: Advances in Collaborative Filtering
Recommender Systems: Advances in Collaborative Filtering
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and How
 
Evaluation techniques
Evaluation techniquesEvaluation techniques
Evaluation techniques
 
Analytic emperical Mehods
Analytic emperical MehodsAnalytic emperical Mehods
Analytic emperical Mehods
 
Usability requirements
Usability requirements Usability requirements
Usability requirements
 
e3-chap-09.ppt
e3-chap-09.ppte3-chap-09.ppt
e3-chap-09.ppt
 
Paper prototype evaluation
Paper prototype evaluationPaper prototype evaluation
Paper prototype evaluation
 
E3 chap-09
E3 chap-09E3 chap-09
E3 chap-09
 
Other metrics
Other metricsOther metrics
Other metrics
 
Week 2 Sentiment Analysis Using Machine Learning
Week 2 Sentiment Analysis Using Machine Learning Week 2 Sentiment Analysis Using Machine Learning
Week 2 Sentiment Analysis Using Machine Learning
 
E3 chap-09
E3 chap-09E3 chap-09
E3 chap-09
 
Human Computer Interaction Evaluation
Human Computer Interaction EvaluationHuman Computer Interaction Evaluation
Human Computer Interaction Evaluation
 
Tech Systems - Testing Processes - Beta Testing of Online Exam System - Sanjo...
Tech Systems - Testing Processes - Beta Testing of Online Exam System - Sanjo...Tech Systems - Testing Processes - Beta Testing of Online Exam System - Sanjo...
Tech Systems - Testing Processes - Beta Testing of Online Exam System - Sanjo...
 

Más de Julián Urbano

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Julián Urbano
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationJulián Urbano
 
A Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationJulián Urbano
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured DocumentsJulián Urbano
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...Julián Urbano
 
A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...Julián Urbano
 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...Julián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackJulián Urbano
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...Julián Urbano
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Julián Urbano
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityJulián Urbano
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...Julián Urbano
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Julián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...Julián Urbano
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Julián Urbano
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityJulián Urbano
 
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Julián Urbano
 
Improving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsImproving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsJulián Urbano
 
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksCrowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksJulián Urbano
 

Más de Julián Urbano (20)

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
 
Your PhD and You
Your PhD and YouYour PhD and You
Your PhD and You
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP Correlation
 
A Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR Evaluation
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured Documents
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
 
A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...
 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and Stability
 
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
 
Improving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsImproving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered Lists
 
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity TasksCrowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks
 

Último

Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinojohnmickonozaleda
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 

Último (20)

Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipino
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 

Evaluation in (Music) Information Retrieval through the Audio Music Similarity task

  • 1. Evaluation in (Music) Information Retrieval through the Audio Music Similarity task Julián Urbano Barcelona, Spain · January 16th 2014
  • 2. Spam • @julian_urbano • Postdoctoral researcher – Music Technology Group, Universitat Pompeu Fabra • Recently: PhD, Computer Science – (Evaluation in) (Music) Information Retrieval 2
  • 3. Information Retrieval • Automatic representation, storage and search of unstructured information – Traditionally textual information – Lately multimedia too: images, video, music • A user has an information need and uses an IR system that retrieves the relevant or significant information from a collection of documents 3
  • 4. Information Retrieval Evaluation • IR systems are based on models to estimate relevance, implementing different techniques • How good is my system? What system is better? – Answered with IR Evaluation experiments – “if you can’t measure it, you can’t improve it” – But we need to be able to trust our measurements • Research on IR Evaluation – Improve our methods to evaluate systems – Critical for the correct development of the field 4
  • 5. Disclaimer • If you see… A system is evaluated with a test collection containing queries, documents and judgments telling how relevant a document is to a query • …you can think of An algorithm is evaluated with a dataset containing queries, songs and annotations telling how similar a song is to a query 5
  • 6. Talk outline • • • • • Why we want to Evaluate… …and what we do with Cranfield Validity: users versus systems Reliability: estimating from samples Efficiency: reducing annotations 6
  • 7. Introduction: Why we want to Evaluate…
  • 8. The two questions • How good is my system? – What does good mean? – What is good enough? • Is system A better than system B? – What does better mean? – How much better? • Efficiency? Effectiveness? Ease? 8
  • 9. Measure user experience • We are interested in user-measures – Time to complete task, idle time, success rate, failure rate, frustration, ease to learn, ease to use … • Their distributions describe user experience, fully – For an arbitrary user, query and document collection, what is the distribution of… 0 none time to complete task some frustration much 9
  • 10. The big(ger) picture • Different user-measures attempting to assess the same thing: user satisfaction – How likely is it that an arbitrary user, with an arbitrary query (and with an arbitrary document collection) will be satisfied by the system? • This is the ultimate goal: the good, the better 10
  • 11. The big(ger) question • User satisfaction…as Bernoulli trial no yes satisfaction • Probability of satisfaction P(Sat = yes)? • Probability that k in n users are satisfied? • Probability of >80% users satisfied? 11
  • 12. Introduction: …what we do with Cranfield
  • 13. Sources of variability user-measure = f(documents, query, user, system) • Our goal is the distribution of the user-measure for our system, which is impossible to calculate – (Possibly?) infinite population • The best we can do is estimate it – Sample documents, queries and users – Measure user experience, implicitly or explicitly – Representativeness, cost, ethics, privacy… 13
  • 14. Fix samples • Hard to replicate experiment and repeat results • Just plain impossible to reproduce results • Get a (hopefully) good sample and fix it – Documents and queries • But we can’t fix the users! 14
  • 15. Simulate users…and fix them • Cranfield paradigm: remove users, but include a user-abstraction, fixed across experiments – Static user component: judgments in the ground truth – Dynamic user component: effectiveness measures • Remove all sources of variability, except systems user-measure = f(documents, query, user, system) 15
  • 16. Simulate users…and fix them • Cranfield paradigm: remove users, but include a user-abstraction, fixed across experiments – Static user component: judgments in the ground truth – Dynamic user component: effectiveness measures • Remove all sources of variability, except systems user-measure = f(documents, query, user, system) user-measure = f(system) 15
  • 17. Test collections • Controlled set of documents, queries and judgments, shared across researchers • (Most?) important resource for IR research – Experiments are inexpensive (collections are not!) – Research becomes systematic – Reproducibility becomes possible and easy 16
  • 18. Wait a minute • Are we estimating distributions about users or distributions about systems? system-effectiveness = f(system, scale, measure) • We come up with different distributions of system-effectiveness, depending on how we abstract users from the experiment – Different scales to assess relevance – Different measures to model user behavior 17
  • 19. Assumption • System-measures correspond to user-measures Users Time to complete task Idle time Success rate Failure rate Frustration Ease to learn Ease to use Satisfaction … Systems P AP RR DCG nDCG ERR GAP Q … 18
  • 20. Assumption • System-measures correspond to user-measures Users Time to complete task Idle time Success rate Failure rate Frustration Ease to learn Ease to use Satisfaction … Systems P AP RR DCG nDCG ERR GAP Q … 18
  • 21. Assumption • System-measures correspond to user-measures Users Time to complete task Idle time Success rate Failure rate Frustration Ease to learn Ease to use Satisfaction … Systems P AP RR DCG nDCG ERR GAP Q … 18
  • 22. Assumption • System-measures correspond to user-measures Users Time to complete task Idle time Success rate Failure rate Frustration Ease to learn Ease to use Satisfaction … Systems P AP RR DCG nDCG ERR GAP Q … 18
  • 23. Assumption • System-measures correspond to user-measures Users Time to complete task Idle time Success rate Failure rate Frustration Ease to learn Ease to use Satisfaction … Systems P AP RR DCG nDCG ERR GAP Q … 19
  • 24. Experiments with Test Collections • Our goal is the users user-measure = f(system) • but Cranfield tells us about systems system-effectiveness = f(system, scale, measure) 20
  • 25. Experiments with Test Collections • Our goal is the users user-measure = f(system) • but Cranfield tells us about systems system-effectiveness = f(system, scale, measure) 20
  • 26. Experiments with Test Collections • Our goal is the users user-measure = f(system) • but Cranfield tells us about systems system-effectiveness = f(system, scale, measure) • This poses several problems – That we have been dealing with for over 50 years – But hey, they’re extremely interesting! 20
  • 27. Validity, Reliability and Efficiency • Validity: are we measuring what we want to? – Internal: are observed effects due to hidden factors? – External: are queries, documents and users representative? – Construct: do system-measures match user-measures? – Conclusion: how good is good and how better is better? • Reliability: how repeatable are the results? – How large do collections need to be? – What statistical methods should be used? • Efficiency: how inexpensive is it to get valid and reliable results? (i.e. to build a test collection) – Can we estimate results with fewer judgments? 21
  • 28. In this talk How to study and improve the validity, reliability and efficiency of the methods used to evaluate IR systems • Audio Music Similarity task as example – Song as query input to system, audio signal – Retrieve songs musically similar to it, by content – Resembles traditional Ad Hoc retrieval in Text IR – Important task in Music IR • Music recommendation • Playlist generation • Plagiarism detection 22
  • 30. Assumption of Cranfield • Systems with better effectiveness are perceived by users as more useful, more satisfactory • Tricky: different effectiveness measures and relevance scales produce different distributions – Which one is better to predict satisfaction? • Map system effectiveness onto user satisfaction, experimentally – If P@10 = 0.2, how likely is it that an arbitrary user will find the results satisfactory? – What is P(Sat | P@10 = 0.2)? 24
  • 31. User-oriented System-measures • Effectiveness measures are generally not formulated to correlate with user-satisfaction – If effectiveness is λ = 0, we expect P(Sat) = 0 – If effectiveness is λ = 1, we expect P(Sat) = 1 – In general, we expect P(Sat | λ) = λ • But this is not what we have – Effectiveness measures need to be reformulated – Upper bounds, recall components, ideal rankings – Many mathematical details omitted in this talk 25
  • 32. User Components: Measures and Scales • How is relevance measured in the judgments? – Nominal, ordinal, interval, ratio • How are results consumed? – Set, list • What determines document utility? – Positional, cascade – Linear, exponential • What determines user persistence? – Navigational, informational 26
  • 33. Measures and Scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine X X X X X X X X X X X X X X X X X X X X X Artificial Graded nℒ=3 nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 27
  • 34. Measures and Scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine X X X X X X X X X X X X X X X X X X X X X Artificial Graded nℒ=3 nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 27
  • 35. Measures and Scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine X X X X X X X X X X X X X X X X X X X X X Artificial Graded nℒ=3 nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 27
  • 36. Measures and Scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine X X X X X X X X X X X X X X X X X X X X X Artificial Graded nℒ=3 nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 27
  • 37. Measures and Scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine X X X X X X X X X X X X X X X X X X X X X Artificial Graded nℒ=3 nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 27
  • 38. Measures and Scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine X X X X X X X X X X X X X X X X X X X X X Artificial Graded nℒ=3 nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 27
  • 39. Measures and Scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine X X X X X X X X X X X X X X X X X X X X X Artificial Graded nℒ=3 nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 27
  • 40. Measures and Scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine X X X X X X X X X X X X X X X X X X X X X Artificial Graded nℒ=3 nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 27
  • 41. Measures and Scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine X X X X X X X X X X X X X X X X X X X X X Artificial Graded nℒ=3 nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 27
  • 42. Measures and Scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine X X X X X X X X X X X X X X X X X X X X X Artificial Graded nℒ=3 nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 27
  • 43. Measures and Scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine X X X X X X X X X X X X X X X X X X X X X Artificial Graded nℒ=3 nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 27
  • 44. Measures and Scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine Artificial Graded nℒ=3 nℒ=4 nℒ=5 MIREX X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 27
  • 46. What can we infer? • Preference: difference noticed by user – Positive: user agrees with evaluation – Negative: user disagrees with evaluation • Non-preference: difference not noticed by user – Good: both systems are satisfactory – Bad: both systems are not satisfactory 29
  • 47. Data • Queries, documents and judgments from MIREX • 4115 unique and artificial examples – At least 200 examples per (measure-scale-λ) • 432 unique queries, 5636 unique documents • Answers collected via Crowdsourcing – Quality control with trap questions • 113 unique subjects 30
  • 48. Single system: how good is it? • For 2045 examples (49%) users could not decide which system was better What do we expect? 31
  • 49. Single system: how good is it? • For 2045 examples (49%) users could not decide which system was better 31
  • 50. Single system: how good is it? • Large ℓmin thresholds underestimate satisfaction 32
  • 51. Single system: how good is it? • Users don’t pay attention to ranking? 33
  • 52. Single system: how good is it? • Exponential gain underestimates satisfaction 34
  • 53. Single system: how good is it? • Document utility independent of others 35
  • 54. Two systems: which one is better? • For 2090 examples (51%) users did prefer one system over the other one What do we expect? 36
  • 55. Two systems: which one is better? • For 2090 examples (51%) users did prefer one system over the other one 36
  • 56. Two systems: which one is better? • Large differences needed for users to note them 37
  • 57. Two systems: which one is better? • More relevance levels are better to discriminate 38
  • 58. Two systems: which one is better? • Cascade and navigational user models are not appropriate 39
  • 59. Two systems: which one is better? • Users do prefer the (supposedly) worse system 40
  • 60. Summary • Effectiveness and satisfaction are clearly correlated – There is a 20% bias: P(Sat | 0) > 0 and P(Sat | 1) < 1 – Room to improve: personalization, better user abstraction • Magnitude of differences does matter – Just looking at rankings is very naive • Be careful with statistical significance – Need Δλ≈0.4 for users to agree with effectiveness • Historically, only 20% of times in MIREX • Differences among measures and scales – Linear gain slightly better than exponential gain – Informational and positional user models better than navigational and cascade – The more relevance levels, the better 41
  • 61. Measures and scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine X X X X X X X X X X X X X X X X X X X X X Artificial Graded nℒ=3 nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 42
  • 62. Measures and scales Measure P@5 AP@5 RR@5 CGl@5 CGe@5 DCGl@5 DCGe@5 EDCGl@5 EDCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 ERRl@5 ERRe@5 GAP@5 ADR@5 Original Broad Fine X X X X X X X X X X X X X X X X X X X X X Artificial Graded nℒ=3 nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X ℓmin=20 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X Artificial Binary ℓmin=40 ℓmin=60 X X X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 X X EDCGl@5 EDCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 X X ERRl@5 ERRl@5 AP@5 AP@5 X X ℓmin=80 X X X P@5 P@5 X DCGl@5 X EDCGl@5 AP@5 AP@5 X RBPl@5 X ERRl@5 AP@5 X 43
  • 64. Evaluate in terms of user satisfaction • So far, arbitrary users for a single query – P Sat Ql @5 = 0.61 = 0.7 • Easily for n users and a single query – P Sat15 = 10 Ql @5 = 0.61 = 0.21 • What about a sample of queries 𝒬? – Map queries separately for the distribution of P(Sat) – For easier mappings, P(Sat | λ) functions are interpolated with simple polynomials 45
  • 65. Expected probability of satisfaction • Now we can compute point and interval estimates of the expected probability of satisfaction • Intuition fails when interpreting effectiveness 46
  • 66. System success • If P(Sat) ≥ threshold the system is successful – Setting the threshold was rather arbitrary before – Now it is meaningful, in terms of user satisfaction • Intuitively, we want the majority of users to find the system satisfactory – P Succ = P P Sat > 0.5 = 1 − FP Sat (0.5) • Improving queries for which we are bad is worthier than further improving those for which we are already good 47
  • 67. Distribution of P(Sat) • But we (will) only have a handful queries, estimates will probably be bad – Need to estimate the cumulative distribution function of user satisfaction: FP(Sat) – Not described by any typical distribution family • More than ≈25 queries in the collection – ecdf approximates better • Less than ≈25 queries in the collection – Normal for graded scales, ecdf for binary scales • Beta is always the best with the Fine scale – Which turns out to be the best scale, overall 48
  • 68. Intuition fails, again Intuitive conclusions based on effectiveness alone contradict those based on user satisfaction – E Δλ = −0.002 – E ΔP Sat – E ΔP Succ = 0.001 = 0.07 49
  • 69. Intuition fails, again Intuitive conclusions based on effectiveness alone contradict those based on user satisfaction – E Δλ = −0.002 – E ΔP Sat – E ΔP Succ = 0.001 = 0.07 49
  • 70. Intuition fails, again Intuitive conclusions based on effectiveness alone contradict those based on user satisfaction – E Δλ = −0.002 – E ΔP Sat – E ΔP Succ = 0.001 = 0.07 49
  • 71. Historically, in MIREX • Systems are not as satisfactory as we thought • But they are more successful – Good (or bad) for some kinds of queries 50
  • 72. Measures and scales Measure P@5 AP@5 CGl@5 CGe@5 DCGl@5 DCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 GAP@5 Original Broad Fine X X X X X X X X X X X X X X Artificial Graded nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X Artificial Binary ℓmin=20 ℓmin=40 X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 AP@5 AP@5 51
  • 73. Measures and scales Measure P@5 AP@5 CGl@5 CGe@5 DCGl@5 DCGe@5 Ql@5 Qe@5 RBPl@5 RBPe@5 GAP@5 Original Broad Fine X X X X X X X X X X X X X X Artificial Graded nℒ=4 nℒ=5 X X X X X X X X X X X X X X X X X X Artificial Binary ℓmin=20 ℓmin=40 X X X X P@5 P@5 P@5 P@5 X X DCGl@5 DCGl@5 AP@5 AP@5 AP@5 AP@5 X X RBPl@5 RBPl@5 AP@5 AP@5 52
  • 75. Random error • Test collections are just samples from larger, possibly infinite, populations • If we conclude system A is better than B, how confident can we be? – Δλ 𝒬 is just an estimate of the population mean μΔλ • Usually employ some statistical significance test for differences in location • If it is statistically significant, we have confidence that the true difference is at least that large 54
  • 76. Statistical hypothesis testing • Set two mutually exclusive hypotheses – H0 : μΔλ = 0 – H1 : μΔλ ≠ 0 • Run test, obtain p-value= P μΔλ ≥ Δλ 𝒬 H0 – p ≤ α: statistically significant, high confidence – p > α: statistically non-significant, low confidence • Possible errors in the binary decision – Type I: incorrectly reject H0 – Type II: incorrectly accept H0 55
  • 77. Statistical significance tests • (Non-)parametric tests – t-test, Wilcoxon test, Sign test • Based on resampling – Bootstrap test, permutation/randomization test • They make certain assumptions about distributions and sampling methods – Often violated in IR evaluation experiments – Which test behaves better, in practice, knowing that assumptions are violated? 56
  • 78. Optimality criteria • Power – Achieve significance as often as possible (low Type II) – Usually increases Type I error rates • Safety – Minimize Type I error rates – Usually decreases power • Exactness – Maintain Type I error rate at α level – Permutation test is theoretically exact 57
  • 79. Experimental design • Randomly split query set in two • Evaluate all systems with both subsets – Simulating two different test collections • Compare p-values with both subsets – How well do statistical tests agree with themselves? – At different α levels • All systems and queries from MIREX 2007-2011 – >15M p-values 58
  • 80. Power and success • Bootstrap test is the most powerful • Wilcoxon, bootstrap and permutation are the most successful, depending on α level 59
  • 81. Conflicts • Wilcoxon and t-test are the safest at low α levels • Wilcoxon is the most exact at low α levels, but bootstrap is for usual levels 60
  • 82. Summary • Bootstrap test is the most powerful, and still it has smaller Type I error rates, so we are safe • Power and success: – CGl@5 > GAP@5 > DCGl@5 > RBPl@5 – Fine > Broad > binary • Conflicts: – Very similar across measures and scales – Corrections for multiple comparisons (e.g. Tukey) do not seem necessary 61
  • 84. Acceptable sample size • Reliability is higher with larger sample sizes – But it is also more expensive – What is an acceptable test collection size? • Answer with Generalizability Theory – G-Study: estimate variance components – D-Study: estimate reliability of different sample sizes and experimental designs 63
  • 85. G-study: variance components Fully crossed experimental design: s × q λq,A = λ + λA + λq + εqA σ2 = σ 2 + σ2 + σ 2 s q sq • Estimated with Analysis of Variance 64
  • 86. G-study: variance components Fully crossed experimental design: s × q λq,A = λ + λA + λq + εqA σ2 = σ 2 + σ2 + σ 2 s q sq • Estimated with Analysis of Variance 64
  • 87. G-study: variance components Fully crossed experimental design: s × q λq,A = λ + λA + λq + εqA σ2 = σ 2 + σ2 + σ 2 s q sq • Estimated with Analysis of Variance 64
  • 88. G-study: variance components Fully crossed experimental design: s × q λq,A = λ + λA + λq + εqA σ2 = σ 2 + σ2 + σ 2 s q sq • Estimated with Analysis of Variance 64
  • 89. G-study: variance components Fully crossed experimental design: s × q λq,A = λ + λA + λq + εqA σ2 = σ 2 + σ2 + σ 2 s q sq • Estimated with Analysis of Variance 64
  • 90. G-study: variance components Fully crossed experimental design: s × q λq,A = λ + λA + λq + εqA σ2 = σ 2 + σ2 + σ 2 s q sq • Estimated with Analysis of Variance 64
  • 91. G-study: variance components Fully crossed experimental design: s × q λq,A = λ + λA + λq + εqA σ2 = σ 2 + σ2 + σ 2 s q sq • Estimated with Analysis of Variance 64
  • 92. G-study: variance components Fully crossed experimental design: s × q λq,A = λ + λA + λq + εqA σ2 = σ 2 + σ2 + σ 2 s q sq • Estimated with Analysis of Variance 64
  • 93. G-study: variance components Fully crossed experimental design: s × q λq,A = λ + λA + λq + εqA σ2 = σ 2 + σ2 + σ 2 s q sq • Estimated with Analysis of Variance 64
  • 94. G-study: variance components Fully crossed experimental design: s × q λq,A = λ + λA + λq + εqA σ2 = σ 2 + σ2 + σ 2 s q sq • Estimated with Analysis of Variance 64
  • 95. Intuition 𝜆𝐴 𝜆𝐵 𝜆𝐶 𝜆𝐷𝜆𝐸 𝜆𝐹 • If σ2 is small or σ2 is large, we need more queries s q 65
  • 96. Intuition 𝜆𝐴 𝜆𝐵 𝜆𝐶 𝜆𝐷𝜆𝐸 𝜆𝐹 • If σ2 is small or σ2 is large, we need more queries s q 𝜆𝐴 𝜆𝐵 𝜆𝐶 𝜆 𝐷 𝜆𝐸 𝜆𝐹 Larger σ2 s 65
  • 97. Intuition 𝜆𝐴 𝜆𝐵 𝜆𝐶 𝜆𝐷𝜆𝐸 𝜆𝐹 • If σ2 is small or σ2 is large, we need more queries s q 𝜆𝐴 𝜆𝐵 𝜆𝐶 𝜆 𝐷 𝜆𝐸 𝜆𝐹 Larger σ2 s 𝜆𝐴 𝜆𝐵 𝜆𝐶 𝜆𝐷𝜆𝐸 𝜆𝐹 Smaller σ2 or q more queries 65
  • 98. D-study: variance ratios • Stability of absolute scores Φ nq σ2 s = σ2 + σ2 q e 2 σs + nq • Stability of relative scores Eρ2 nq = σ2 s σ2 σ2 + e s nq • We can easily estimate how many queries are needed to reach some level of stability (reliability) 66
  • 99. D-study: variance ratios • Stability of absolute scores Φ nq σ2 s = σ2 + σ2 q e 2 σs + nq • Stability of relative scores Eρ2 nq = σ2 s σ2 σ2 + e s nq • We can easily estimate how many queries are needed to reach some level of stability (reliability) 66
  • 100. Effect of query set size • • • • Average absolute stability Φ = 0.97 ≈65 queries needed for Φ2 = 0.95, ≈100 in worst cases Fine scale slightly better than Broad and binary scales RBPl@5 and nDCGl@5 are the most stable 67
  • 101. Effect of query set size • • • • Average relative stability Eρ2 = 0.98 ≈35 queries needed for Eρ2 = 0.95, ≈60 in worst cases Fine scale better than Broad and binary scales CGl@5 and RBPl@5 are the most stable 68
  • 102. Effect of cutoff k • What if we use a deeper cutoff, k=10? – From 100 queries and k=5 to 50 queries and k=10 – Should still have stable scores – Judging effort should decrease – Rank-based measures should become more stable • Tested in MIREX 2012 – In 2013 too, but not analyzed here 69
  • 103. Effect of cutoff k • Judging effort reduced to 72% of the usual • Generally stable – From Φ = 0.81 to Φ = 0.83 – From Eρ2 = 0.93 to Eρ2 = 0.95 70
  • 104. Effect of cutoff k • Reliability given a fixed budged for judging? – k=10 allows us to use fewer queries, about 70% – Slightly reduced relative stability 71
  • 105. Effect of assessor set size • More assessors or simply more queries? – Judging effort is multiplied • Can be studied with MIREX 2006 data – 3 different assessors per query – Nested experimental design: s × h: q 72
  • 106. Effect of assessor set size • 2 2 Broad scale: σs ≈ σh:q Fine scale: σ2 ≫ σ2 s h:q • • Always better to spend resources on queries 73
  • 107. Summary • MIREX collections generally larger than necessary • For fixed budget – More queries better than more assessors – More queries slightly better than deeper cutoff • Worth studying alternative user model? • • • • Employ G-Theory while building the collection Fine better than Broad, better than binary CGl@5 and DCGl@5 best for relative stability RBPl@5 and nDCGl@5 best for absolute stability 74
  • 108. Implications • Fixing the number of queries across years is unrealistic – Especially because they are not intended for reuse • Fixing the number of queries across task is simply nonsense • Need to analyze on a case-by-case basis, while building the collections – GT4IReval, R package online – https://github.com/julian-urbano/GT4IREval 75
  • 110. Probabilistic evaluation • The MIREX setting is still expensive – Need to judge all top k documents from all systems – Takes days, even weeks sometimes • Model relevance probabilistically – Relevance judgments are random variables over the space of possible assignments of relevance E Rd = P 𝑅𝑑 =ℓ ·ℓ ℓ∈ℒ P 𝑅 𝑑 = ℓ · ℓ2 − E 𝑅 𝑑 Var 𝑅 𝑑 = 2 ℓ∈ℒ • Effectiveness measures are also probabilistic 77
  • 111. Probabilistic evaluation • Accuracy increases as we make judgments – E R d ← rd • Reliability increases too (confidence) – Var R d ← 0 • Iteratively estimate relevance and effectiveness – If confidence is low, make judgments – If confidence is high, stop • Judge as few documents as possible 78
  • 112. Learning distributions of relevance • Uniform distribution is very uninformative • Historical distribution in MIREX has high variance • Estimate from a set of features: P R d = ℓ θd – For each document separately – Ordinal Logistic Regression • Three sets of features – Output-based, can always be used – Judgment-based, to exploit known judgments – Audio-based, to exploit musical similarity 79
  • 113. Learned models • Mout : can be used even without judgments – Similarity between systems’ outputs – Genre and artist metadata • Genre is highly correlated to similarity – Decent fit, R2 ≈ 0.35 • Mjud : can be used when there are judgments – Similarity between systems’ outputs – Known relevance of same system and same artist • Artist is extremely correlated to similarity – Excellent fit, R2 ≈ 0.91 80
  • 114. Estimation errors • Actual vs. predicted by Mout – 0.36 with Broad and 0.34 with Fine • Actual vs. predicted by Mjud – 0.14 with Broad and 0.09 with Fine • Among assessors in MIREX 2006 – 0.39 with Broad and 0.31 with Fine • Negligible under the current MIREX setting 81
  • 116. Probabilistic effectiveness measures • Effectiveness scores become random variables too • Example: DCGl@k – (Usual) deterministic formulation: 𝑘 𝑖=1 𝑟𝑖 / log 2 𝑖 + 1 𝐷𝐶𝐺 𝑙 @𝑘 = 𝑘 𝑖=1 𝑛ℒ − 1 / log 2 𝑖 + 1 – (New) probabilistic formulation: 𝑘 1 E 𝑅𝑖 E 𝐷𝐶𝐺 𝑙 @𝑘 = 𝜂 𝐷𝐶𝐺 𝑙 𝑖=1 log 2 𝑖 + 1 𝑘 1 Var 𝑅 𝑖 Var 𝐷𝐶𝐺 𝑙 @𝑘 = 2 𝜂 𝐷𝐶𝐺 𝑙 𝑖=1 log 2 𝑖 + 1 2 83
  • 117. Probabilistic effectiveness measures • Effectiveness scores become random variables too • Example: DCGl@k – (Usual) deterministic formulation: 𝑘 𝑖=1 𝑟𝑖 / log 2 𝑖 + 1 𝐷𝐶𝐺 𝑙 @𝑘 = 𝑘 𝑖=1 𝑛ℒ − 1 / log 2 𝑖 + 1 – (New) probabilistic formulation: 𝑘 1 E 𝑅𝑖 E 𝐷𝐶𝐺 𝑙 @𝑘 = 𝜂 𝐷𝐶𝐺 𝑙 𝑖=1 log 2 𝑖 + 1 𝑘 1 Var 𝑅 𝑖 Var 𝐷𝐶𝐺 𝑙 @𝑘 = 2 𝜂 𝐷𝐶𝐺 𝑙 𝑖=1 log 2 𝑖 + 1 𝜂 𝐷𝐶𝐺 𝑙 2 83
  • 118. Probabilistic effectiveness measures • From there we can compute Δ𝐷𝐶𝐺 𝑙 @𝑘AB • And averages over a sample of queries 𝒬 • Different approaches to compute estimates – Deal with dependence of random variables – Different definitions of confidence • For measures based on ideal ranking (nDCGl@k and RBPl@k) we do not have a closed form – Approximated with Delta method and Taylor series 84
  • 119. Ranking without judgments 1. Estimate relevance with Mout 2. Estimate relative differences and rank systems • Average confidence in the rankings is 94% • Average accuracy of the ranking is 92% 85
  • 120. Ranking without judgments • Can we trust individual estimates? – Ideally, we want X% accuracy when X% confidence – Confidence slightly overestimated in [0.9, 0.99) Confidence [0.5, 0.6) [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) [0.9, 0.95) [0.95, 0.99) [0.99, 1) E[Accuracy] DCGl@5 Broad In bin Accuracy 23 (6.5%) 0.826 14 (4%) 0.786 14 (4%) 0.571 22 (6.2%) 0.864 23 (6.5%) 0.87 24 (6.8%) 0.917 232 (65.9%) 0.996 0.938 Fine In bin 22 (6.2%) 16 (4.5%) 11 (3.1%) 21 (6%) 19 (5.4%) 27 (7.7%) 236 (67%) Accuracy 0.636 0.812 0.364 0.762 0.895 0.926 0.996 0.921 86
  • 121. Relative estimates with judgments 1. Estimate relevance with Mout 2. Estimate relative differences and rank systems 3. While confidence is low (<95%) 1. Select a document and judge it 2. Update relevance estimates with Mjud when possible 3. Update estimates of differences and rank systems • What documents should we judge? – Those that are the most informative – Measure-dependent 87
  • 122. Relative estimates with judgments • Judging effort dramatically reduced – 1.3% with CGl@5, 9.7% with RBPl@5 • Average accuracy still 92%, but improved individually – 74% of estimates with >99% confidence, 99.9% accurate – Expected accuracy improves slightly from 0.927 to 0.931 88
  • 123. Absolute estimates with judgments 1. Estimate relevance with Mout 2. Estimate absolute effectiveness scores 3. While confidence is low (expected error >±0.05) 1. Select a document and judge it 2. Update relevance estimates with Mjud when possible 3. Update estimates of absolute effectiveness scores • What documents should we judge? – Those that reduce variance the most – Measure-dependent 89
  • 124. Absolute estimates with judgments • The stopping condition is overly confident – Virtually no judgments are even needed (supposedly) • But effectiveness is highly overestimated – Especially with nDCGl@5 and RBPl@5 – Mjud, and especially Mout, tend to overestimate relevance 90
  • 125. Absolute estimates with judgments • Practical fix: correct variance • Estimates are better, but at the cost of judging – Need between 15% and 35% of judgments 91
  • 126. Summary • Estimate ranking of systems with no judgments – 92% accuracy on average, trustworthy individually – Statistically significant differences are always correct • If we want more confidence, judge documents – As few as 2% needed to reach 95% confidence – 74% of estimates have >99% confidence and accuracy • Estimate absolute scores, judging as necessary – Around 25% needed to ensure error <0.05 92
  • 127. Implications • We do not need dozens of volunteers to make thousands of judgments over several days • Just one person spending a couple hours is fine • The spare manpower can be put to better use – Redundant judgments to have better estimates – Make annotations for other tasks • It naturally promotes collaborative creation of test collections by iteratively adding the judgments needed in each experiment (if any) 93
  • 129. Validity • User studies to understand user behavior • What information to include in test collections • Other forms of relevance judgment to better capture document utility • Explicitly define judging guidelines • Similar mapping for Text IR – Different user models within the same task 95
  • 130. Reliability • Corrections for Multiple Comparisons • Methods to reliably estimate reliability while building test collections 96
  • 131. Efficiency • Better models to estimate document relevance • Correct variance when having just a few relevance judgments available • Estimate relevance beyond k=5 • Other stopping conditions and document weights 97
  • 132. Conduct similar studies for the wealth of tasks in Music Information Retrieval