SlideShare una empresa de Scribd logo
1 de 132
IR Evaluation
Mihai Lupu
lupu@ifs.tuwien.ac.at
Chapter 8 of the Introduction to IR book
M. Sanderson. Test Collection Based Evaluation of Information
Retrieval Systems Foundations and Trends in IR, 2010
1
Outline
 Introduction
– Introduction to IR
 Kinds of evaluation
 Retrieval Effectiveness evaluation
– Measures, Experimentation
– Test Collections
 User-based evaluation
 Discussion on Evaluation
 Conclusion
2
Introduction
• Why?
– Put a figure on the benefit we get from a system
– Because without evaluation, there is no research
3
Objective
measurements
Information Retrieval
 “Information retrieval is a field concerned with the structure,
analysis, organization, storage, searching, and retrieval of
information.” (Salton, 1968)
 General definition that can be applied to many types of
information and search applications
 Primary focus of IR since the 50s has been on text and
documents
Information Retrieval
Information Retrieval
Information Retrieval
 Key insights of/for information retrieval
– text has no meaning
 ฉันมีรถสีแดง
– but it is still the most informative source
 ฉันมีรถสีฟ้า is more similar to the above than คุณมีรถไฟฟ้า
– text is not random
 I drive a red car is more probable than
– I drive a red horse
– A red car I drive
– Car red a drive I
– meaning is defined by usage
 I drive a truck / I drive a car / I drive the bus  truck / car / bus
are similar in meaning
Information Retrieval
 Key insights of/for information retrieval
– text has no meaning
 ฉันมีรถสีแดง
– but it is still the most informative source
 ฉันมีรถสีฟ้า is more similar to the above than คุณมีรถไฟฟ้า
– text is not random
 I drive a red car is more probable than
– I drive a red horse
– A red car I drive
– Car red a drive I
– meaning is defined by usage
 I drive a truck / I drive a car / I drive the bus  truck / car / bus
are similar in meaning
term frequency (TF), document frequency (DF)
TF-IDF, BM25 (Best match 25)
language models (uni-gram, bi-gram, n-gram)
statistical semantics (latent semantic analysis,
random indexing, deep learning)
Big Issues in IR
 Relevance
– What is it?
– Simple (and simplistic) definition: A relevant document contains
the information that a person was looking for when they submitted
a query to the search engine
– Many factors influence a person’s decision about what is relevant:
e.g., task, context, novelty, style
– Topical relevance (same topic) vs. user relevance (everything
else)
 Relevance
– Retrieval models define a view of relevance
– Ranking algorithms used in search engines are based
on retrieval models
– Most models describe statistical properties of text rather
than linguistic
 i.e. counting simple text features such as words
instead of parsing and analyzing the sentences
 Statistical approach to text processing started with
Luhn in the 50s
 Linguistic features can be part of a statistical model
Big Issues in IR
Big Issues in IR
 Evaluation
– Experimental procedures and measures for comparing system
output with user expectations
 Originated in Cranfield experiments in the 60s
– IR evaluation methods now used in many fields
– Typically use test collection of documents, queries, and relevance
judgments
 Most commonly used are TREC collections
– Recall and precision are two examples of effectiveness measures
Big Issues in IR
 Users and Information Needs
– Search evaluation is user-centered
– Keyword queries are often poor descriptions of actual information
needs
– Interaction and context are important for understanding user intent
– Query refinement techniques such as query expansion, query
suggestion, relevance feedback improve ranking
Introduction
• Why?
– Put a figure on the benefit we get from a system
– Because without evaluation, there is no research
• Why is this a research field in itself?
– Because there are many kinds of IR
• With different evaluation criteria
– Because it’s difficult
• Why?
– Because it involves human subjectivity (document relevance)
– Because of the amount of data involved (who can sit down
and evaluate 1,750,000 documents returned by Google for
‘university vienna’?)
13
Kinds of evaluation
14
Kinds of evaluation
• “Efficient and effective system”
• Time and space: efficiency
– Generally constrained by pre-development specification
• E.g. real-time answers vs. batch jobs
• E.g. index-size constraints
– Easy to measure
• Good results: effectiveness
– Harder to define --> more research into it
• And…
15
Kinds of evaluation (cont.)
• User studies
– Does a 2% increase in some retrieval performance measure actually
make a user happier?
– Does displaying a text snippet improve usability even if the
underlying method is 10% weaker than some other method?
– Hard to do
– Mostly anecdotal examples
– Many IR people don’t like to do it (though it’s starting to change)
16
Kinds of evaluation (cont.)
 Intrinsic
– “internal”
– ultimate goal is the retrieved set
 Extrinsic
– “external”
– in the context of the usage of the retrieval tool
17
What to measure in an IR system?
1966, Cleverdon:
1. coverage – the extent to which relevant matter exists in the
system
2. time lag ~ efficiency
3. presentation
4. effort on the part of the user to answer his information
need
5. recall
6. precision
18
What to measure in an IR system?
1966, Cleverdon:
1. coverage – the extent to which relevant matter exists in the
system
2. time lag ~ efficiency
3. presentation
4. effort on the part of the user to answer his information
need
5. recall
6. precision
Effectiveness
19
A desirable measure of retrieval performance would have the following
properties: 1, it would be a measure of effectiveness. 2, it would not be
confounded by the relative willingness of the system to emit items. 3, it would
be a single number – in preference, for example, to a pair of numbers which
may co-vary in a loosely specified way, or a curve representing a table of
several pairs of numbers 4, it would allow complete ordering of different
performances, and assess the performance of any one system in absolute
terms. Given a measure with these properties, we could be confident of
having a pure and valid index of how well a retrieval system (or method) were
performing the function it was primarily designed to accomplish, and we could
reasonably ask questions of the form “Shall we pay X dollars for Y units of
effectiveness?” (Swets, 1967)
Outline
• Introduction
• Kinds of evaluation
• Retrieval Effectiveness evaluation
– Measures
– Test Collections
 User-based evaluation
• Discussion on Evaluation
• Conclusion
20
Efficiency Metrics
21
Retrieval Effectiveness
 Precision
– How happy are we with what we’ve got
 Recall
– How much more we could have had
Precision =
Number of relevant documents
retrieved
Number of documents retrieved
Recall =
Number of relevant documents
retrieved
Number of relevant documents
22
Retrieval Effectiveness
Retrieved
documents
Relevant documents
Universe of documents
23
Precision and Recall
24
Retrieval effectiveness
 What if we don’t like this twin-measure approach?
 A solution:
– Van Rijsbergen’s E-Measure:
– With a special case: Harmonic mean
E =1-
1
a
1
precision
+ 1-a( )
1
recall
F =
2× precision×recall
precision+recall
25
Retrieval effectiveness
 What if we don’t like this twin-measure approach?
 A solution:
– Van Rijsbergen’s E-Measure:
– With a special case: Harmonic mean
E =1-
1
a
1
precision
+ 1-a( )
1
recall
F =
2× precision×recall
precision+recall
26
Retrieval effectiveness
 Tools we need:
– A set of documents (the “dataset”)
– A set of questions/queries/topics
– For each topic, and for each document, a decision: relevant or not
relevant
 Let’s assume for the moment that’s all we need and that
we have it
27
Retrieval Effectiveness
• Precision and Recall generally plotted as a “Precision-Recall
curve”
0
1
1
precision
recall
size of retrieved set increases
• They do not play well together
28
Precision-Recall Curves
 How to build a Precision-Recall Curve?
– For one query at a time
– Make checkpoints on the recall-axis
0
1
1
precision
recall
29
Precision-Recall Curves
 How to build a Precision-Recall Curve?
– For one query at a time
– Make checkpoints on the recall-axis
0
1
1
precision
recall
30
Precision-Recall Curves
• How to build a Precision-Recall Curve?
– For one query at a time
– Make checkpoints on the recall-axis
– Repeat for all queries
0
1
1
precision
recall
31
Precision-Recall Curves
• And the average is the system’s P-R curve
0
1
1
precision
recall
# retrieved documents increases
• We can compare systems by comparing the
curves
32
Precision-Recall Graph
--reality check--
33
Interpolation
 To average graphs, calculate precision at standard recall
levels:
– where S is the set of observed (R,P) points
 Defines precision at any recall level as the maximum
precision observed in any recall-precision point at a
higher recall level
– produces a step function
– defines precision at recall 0.0
34
Interpolation
35
Average Precision at
Standard Recall Levels
• Recall-precision graph plotted by simply
joining the average precision points at
the standard recall levels
36
Average Recall-Precision Graph
37
Graph for 50 Queries
38
Retrieval Effectiveness
• Not quite done yet…
– When to stop retrieving?
• Both P and R imply a cut-off value
– How about graded relevance
• Some documents may be more relevant to the question than
others
– How about ranking?
• A document retrieved at position 1,234,567 can still be
considered useful?
– Who says which documents are relevant and which not?
39
Single-value measures
• Fix a “reasonable” cutoff
– R-precision
 Precision at R, where R is the number of relevant documents.
 Fix the number of desired documents
– Reciprocal rank (RR)
 1/rank of first relevant document in the ranked list returned
 Make it less sensitive to the cutoff
• Average precision
– For each query:
 R= # relevant documents
 i = rank
 k = # retrieved documents
 P(i) precision at rank i
• rel(i)=1 if document at rank i relevant, 0 otherwise
– For each system:
• Compute the mean of these averages: Mean Average
Precision (MAP) – one of the most used measures
AP =
P(i)×rel(i)( )
i=1
k
å
R
40
R- Precision
 Precision at the R-th position in the ranking of results for
a query that has R relevant documents.
n doc # relevant
1 588 x
2 589 x
3 576
4 590 x
5 986
6 592 x
7 984
8 988
9 578
10 985
11 103
12 591
13 772 x
14 990
R = # of relevant docs = 6
R-Precision = 4/6 = 0.67
41
Averaging Across Queries
42
Average Precision
43
MAP
44
Retrieval Effectiveness
• Not quite done yet…
– When to stop retrieving?
• Both P and R imply a cut-off value
– How about graded relevance
• Some documents may be more relevant to the question than
others
– How about ranking?
• A document retrieved at position 1,234,567 can still be
considered useful?
– Who says which documents are relevant and which not?
45
Cumulative Gain
• For each document d, and query q, define
rel(d,q) >= 0
• The higher the value, the more relevant the document is to
the query
• Pitfalls:
– Graded relevance introduces even more ambiguity in practice
With great flexibility comes great
responsibility to justify parameter values
46
Retrieval Effectiveness
• Not quite done yet…
– When to stop retrieving?
• Both P and R imply a cut-off value
– How about graded relevance
• Some documents may be more relevant to the question than
others
– How about ranking?
• A document retrieved at position 1,234,567 can still be
considered useful?
– Who says which documents are relevant and which not?
47
Discounted Cumulative Gain
 Popular measure for evaluating web search and related
tasks
 Two assumptions:
– Highly relevant documents are more useful than marginally relevant
document
– the lower the ranked position of a relevant document, the less useful
it is for the user, since it is less likely to be examined
48
Discounted Cumulative Gain
 Uses graded relevance as a measure of the usefulness, or
gain, from examining a document
 Gain is accumulated starting at the top of the ranking and
may be reduced, or discounted, at lower ranks
 Typical discount is 1/log (rank)
– With base 2, the discount at rank 4 is 1/2, and at rank 8 it is 1/3
49
Discounted Cumulative Gain
 DCG is the total gain accumulated at a particular rank p:
 Alternative formulation:
– used by some web search companies
– emphasis on retrieving highly relevant documents
[Jarvelin:2000]
[Borges:2005]
50
Discounted Cumulative Gain
• Neither CG, nor DCG can be used for comparison
across topics!
depends on the # relevant documents per topic
51
Normalised Discounted Cumulative Gain
 Compute CG / DCG for the optimal return set
Eg: (5,5,5,4,4,3,3,3,3,2,2,2,1,1,1,1,1,1,0,0,0,0..)
has the Ideal Discounted Cumulative Gain: IDCG
 Normalise:
NDCG(n) =
DCG(n)
IDCG(n)
52
some more variations
Eg: (5,5,5,4,4,3,3,3,3,2,2,2,1,1,1,1,1,1,0,0,0,0..)
has the Ideal Discounted Cumulative Gain: IDCG
“our rank”: (5,2,0,0,5,2,4,0,0,1,4,…)
 two ranked lists
– rank correlation measures
 kendall Tau (similarity of orderings)
 pearson Rho (linear correlation between variables)
 spearman Rho (Pearson for ranks)
53
some more variations
 rank biased precision (RBP)
– “log-based discount is not a good model of users’ behaviour”
– imagine the probability p of the user moving on to the next document
RBP(n) = (1- p)× rel(i)× pi-1
i=1
n
å
p~0.95 p~0.0
54
Time-based calibration
 Assumption
– The objective of the search engine is to improve the efficiency of an
information seeking task
 Extend nDCG to replace discount with a time-based
function
(Smucker and Clarke:2011)
Normalization
Gain Decay, as a function of
time to reach item k in
the ranked list55
The water filling model (Luo et al, 2013)
 and the corresponding Cube
Test (CT)
 also for professional search
– to capture embedded subtopics
 no assumption of linear
traversal of documents
– takes into account time
 potential cap on the amount of
information taken into account
 high discriminative power
56
Other diversity metrics
 several aspects of the topic might [need to] be covered
– Aspectual recall/precision
 discount may take into account previously seen aspects
– α-NDCG = NDCG where
rel(i) = J(di,k)(1-a)
rk,i-1
k=1
m
å
rk,i-1 = J(dj,k)
j=1
i-1
å J(dj,k) =
1 dj relevant to nk
0 otherwise
ì
í
ï
îï
57
Other measures
• There are many IR measures!
• trec_eval is a little program that computes many of them
– 37 in v9.0, many of which are multi-point (e.g. Precision @10,
@20…)
• http://trec.nist.gov/trec_eval/
• “there is a measure to make anyone a winner”
– Not really true, but still…
58
Other measures
• How about correlations between measures?
• Kendal Tau values
• From Voorhees and Harman,2004
• Overall they correlate
P(30) R-Prec MAP .5 prec
R(1,100
0)
Rel Ret MRR
P(10) 0.88 0.81 0.79 0.78 0.78 0.77 0.77
P(30) 0.87 0.84 0.82 0.80 0.79 0.72
R-Prec 0.93 0.87 0.83 0.83 0.67
MAP 0.88 0.85 0.85 0.64
.5 prec 0.77 0.78 0.63
R(1,100
0)
0.92 0.67
Rel ret 0.66
59
Topic sets
 Topic selection
– In early TREC candidates rejected if ambiguous
 Are all topics equal?
– Mean Average Precision uses arithmetic mean
– Classical Test Theory experiments (Bodoff and Li,2007) identified
outliers that could change the rankings
MAP: a change in AP from 0.05 to 0.1 has the same effect as a
change from 0.25 to 0.3
GMAP: a change in AP from 0.05 to 0.1 has the same effect as a
change from 0.25 to 0.5
60
Measure measures
 What is the best measure?
– What makes a measure better?
 Match to task
– E.g.
 Known item search: MRR
 Something more quantitative?
– Correlations between measures
 Does the system ranking change when using different measures
 Useful to group measures
– Ability to distinguish between runs
– Measure stability
61
Ad-hoc quiz
 It was necessary to normalize the discounted cumulative
gain (NDCG) because…
 of the assumption for normal probability distribution
 to be able to compare across topics
 normalization is always better
 to be able to average across topics
62
Ad-hoc quiz
 It was necessary to normalize the discounted cumulative
gain (NDCG) because…
 of the assumption for normal probability distribution
 to be able to compare across topics
 normalization is always better
 to be able to average across topics
63
Measure stability
 Success criteria:
– A measure is good if it is able to predict differences between
systems (on the average of future queries)
 Method
– Split collection in 2
1. Use as train collection to rank runs
2. Use as test collection to compute how many pair-wise
comparisons hold
 Observations
– Cut-off measures less stable than MAP
64
Measure stability
 Success criteria:
– A measure is good if it is able to predict differences between
systems (on the average of future queries)
 Method
– Split collection in 2
1. Use as train collection to rank runs
2. Use as test collection to compute how many pair-wise
comparisons hold
 Observations
– Cut-off measures less stable than MAP
Any other criteria for measure
quality?
65
Measure measures
 started with opinions from ’60s, seen some measures –
have the targets changed?
 7 numeric properties of effectiveness metrics (Moffat 2013)
66
7 properties of effectiveness metrics
 Boundedness – the set of scores attainable by the metric is bounded,
usually in [0,1]
 Monotonicity – if a ranking of length k is extended so that k+1 elements
are included, the score never decreases
 Convergence – if a document outside the top k is swapped with a less
relevant document inside the top k, the score strictly increases
 Top-weightedness – if a document within the top k is swapped with a
less relevant one higher in the ranking, the score strictly increases
 Localization – a score at depth k can be compute based solely on
knowledge of the documents that appear in top k
 Completeness – a score can be calculated even if the query has no
relevant documents
 Realizability – provided that the collection has at least one relevant
document, it is possible for the score at depth k to be maximal.
68
So far
 introduction
 metrics
we are now able to say
“System A is better than System B”
or are we?
Remember
- we only have limited data
- potential future applications unbounded
a very strong
statement!
69
Statistical validity
 Whatever evaluation metric used, all experiments must be
statistically valid
– i.e. differences must not be the result of chance
0
0.05
0.1
0.15
0.2
MAP
70
Statistical validity
• Ingredients of a significance test
– A test statistic (e.g. the differences between AP values)
– A null hypothesis (e.g. “there is no difference between the two
systems)
 This gives us a particular distribution of the test statistic
– An alternative hypothesis (one or two-tailed tests)
 don’t change it after the test
– A significance level computed by taking the actual value of the test
statistic and determining how likely it is to see this value given the
distribution implied by the null hypothesis
• P-value
• If the p-value is low, we can feel confident that we can reject
the null hypothesis  the systems are different
71
Statistical validity
 Common practice is to declare systems different when the
p-value <= 0.05
 A few tests
– Randomization tests
 Wilcoxon Signed Rank test
 Sign test
– Boostrap test
– Student’s Paired t-test
 See recent discussion in SIGIR Forum
– T. Sakai - Statistical Reform in Information Retrieval?
 effect sizes
 confidence intervals
72
Statistical validity
 How do we increase the statistical validity of an
experiment?
 By increasing the number of topics
– The more topics, the more confident we are that the difference
between average scores will be significant
 What’s the minimum number of topics?
42
• Depends, but
• TREC started with 50
• Below 25 is generally considered
not significant
73
Example Experimental Results
B- A = 21.4
74
t-Test
 Assumption is that the difference between the effectiveness
values is a sample from a normal distribution
 Null hypothesis is that the mean of the distribution of
differences is zero
 Test statistic
– for the example,
75
t-Testt=2.33
76
t-Testt=2.33
77
Statistical Validity - example
78
79
80
81
82
Summary
 so far
– introduction
– metrics
 next
– where to get ground truth
 some more metrics
– discussion
83
Retrieval Effectiveness
• Not quite done yet…
– When to stop retrieving?
• Both P and R imply a cut-off value
– How about graded relevance
• Some documents may be more relevant to the question than
others
– How about ranking?
• A document retrieved at position 1,234,567 can still be
considered useful?
– Who says which documents are relevant and which not?
84
Relevance assessments
• Ideally
– Sit down and look at all documents
• Practically
– The ClueWeb09 collection has
• 1,040,809,705 web pages, in 10 languages
• 5 TB, compressed. (25 TB, uncompressed.)
– No way to do this exhaustively
– Look only at the set of returned documents
• Assumption: if there are enough systems being tested and not
one of them returned a document – the document is not relevant
85
Relevance assessments - Pooling
 Combine the results retrieved by all systems
 Choose a parameter k (typically 100)
 Choose the top k documents as ranked in each submitted
run
 The pool is the union of these sets of docs
– Between k and (# submitted runs) × k documents in pool
– (k+1)st document returned in one run either irrelevant or ranked
higher in another run
 Give pool to judges for relevance assessments
86
From Donna Harman
87
Relevance assessments - Pooling
 Conditions under which pooling works [Robertson]
– Range of different kinds of systems, including manual systems
– Reasonably deep pools (100+ from each system)
 But depends on collection size
– The collections cannot be too big.
 Big is so relative…
88
Relevance assessments - Pooling
 Advantage of pooling:
– Fewer documents must be manually assessed for relevance
 Disadvantages of pooling:
– Can’t be certain that all documents satisfying the query are found
(recall values may not be accurate)
– Runs that did not participate in the pooling may be disadvantaged
– If only one run finds certain relevant documents, but ranked lower
than 100, it will not get credit for these.
89
Relevance assessments
 Pooling with randomized sampling
 As the data collection grows, the top 100 may not be
representative of the entire result set
– (i.e. the assumption that everything after is not relevant does not
hold anymore)
 Add, to the pool, a set of documents randomly sampled
from the entire retrieved set
– If the sampling is uniform  easy to reason about, but may be too
sparse as the collection grows
– Stratified sampling: get more from the top of the ranked list
[Yilmaz et al.:2008]
90
Relevance assessments - incomplete
• The unavoidable conclusion is that we have to handle
incomplete relevance assessments
– Consider unjudged = non relevant
– Do not consider unjudged at all (i.e. compress the ranked lists)
• A new measure:
– BPref (binary preference)
 r = a relevant returned document
 R = # documents judged relevant
 N = # documents judged non-relevant
 n = a non-relevant document
BPref =
1
R
1-
|{n |rank(n) > rank(r)}|
min(R, N)
æ
è
ç
ö
ø
÷
r
å
91
Relevance assessments - incomplete
• BPref was designed to mimic MAP
• soon after, induced AP and inferred AP were proposed
• if data complete – equal to MAP
indAP =
1
R
1-
|{n | rank(n) > rank(r)}|
rank(r)
æ
è
ç
ö
ø
÷
r
å
inf AP(k) =
1
R
1
k
+
k -1
k
d100
k -1
×
rel +e
rel + nonrel +e
æ
è
çç
ö
ø
÷÷
é
ë
ê
ê
ù
û
ú
úr
å
expectation of precision at rank k
92
 not only are we incomplete, but we might also be
inconsistent in our judgments
93
Relevance assessment - subjectivity
 In TREC-CHEM’09 we had each topic evaluated by two
students
– “conflicts” ranged between 2% and 33% (excluding a topic with 60%
conflict)
– This all increased if we considered “strict disagreement”
 In general, inter-evaluator agreement is rarely above 80%
 There is little one can do about it
94
Relevance assessment - subjectivity
 Good news:
– “idiosyncratic nature of relevance judgments does not affect
comparative results” (E. Voorhees)
– Mean Kendall Tau between system rankings produced from
different query relevance sets: 0.938
– Similar results held for:
 Different query sets
 Different evaluation measures
 Different assessor types
 Single opinion vs .group opinion judgments
95
No assessors
 Pooling assumes all relevant documents found by systems
– Take this assumption further
 Voting based- relevance assessments
– Consider top K only
Soboroff et al:2001
96
Test Collections
 Generally created as the result of an evaluation campaign
– TREC – Text Retrieval Conference (USA)
– CLEF – Cross Language Evaluation Forum (EU)
– NTCIR - NII Test Collection for IR Systems (JP)
– INEX – Initiative for evaluation of XML Retrieval
– …
 First one and paradigm definer:
– The Cranfield Collection
 In the 1950s
 Aeronautics
 1400 queries, about 6000 documents
 Fully evaluated
97
TREC
 Started in 1992
 Always organised in the States, on the NIST campus
 As leader, introduced most of the jargon used in IR
Evaluation:
– Topic = query / request for information
– Run = a ranked list of results
– Qrel = relevance judgements
98
TREC
 Organised as a set of tracks that focus on a particular sub-
problem of IR
– E.g.
 Patient records, Session, Chemical, Genome, Legal, Blog,
Spam,Q&A, Novelty, Enterprise, Terabyte, Web, Video, Speech,
OCR, Chinese, Spanish, Interactive, Filtering, Routing, Million
Query, Ad-Hoc, Robust
– Set of tracks in a year depends on
 Interest of participants
 Fit to TREC
 Needs of sponsors
 Resource constraints
99
TREC
Call for
participation Task
definition
Document
procureme
Topic
definitio
IR
experiments
Results
Results
analysis
TREC
conference
Proceedings
publication
100
TREC – Task definition
 Each Track has a set of Tasks:
 Examples of tasks from the Blog track:
– 1. Finding blog posts that contain opinions about the topic
– 2. Ranking positive and negative blog posts
– 3. (A separate baseline task to just find blog posts relevant to the
topic)
– 4. Finding blogs that have a principal, recurring interest in the
topic
101
TREC - Topics
 For TREC, topics generally have a specific format (not
always though)
– <ID>
– <title>
 Very short
– <description>
 A brief statement of what would be a relevant document
– <narrative>
 A long description, meant also for the evaluator to understand
how to judge the topic
102
TREC - Topics
 Example:
– <ID>
 312
– <title>
 Hydroponics
– <description>
 Document will discuss the science of growing plants in water or
some substance other than soil
– <narrative>
 A relevant document will contain specific information on the
necessary nutrients, experiments, types of substrates, and/or
any other pertinent facts related to the science of hydroponics.
Related information includes, but is not limited to, the history
of hydro- …
103
CLEF
 Cross Language Evaluation Forum
– From 2010: Conference on Multilingual and Multimodal
Information Access Evaluation
– Supported by the PROMISE Network of Excellence
 Started in 2000
 Grand challenge:
– Fully multilingual, multimodal IR systems
 Capable of processing a query in any medium and any
language
 Finding relevant information from a multilingual multimedia
collection
 And presenting it in the style most likely to be useful for the
user
104
CLEF
• Previous tracks:
• Mono-, bi- multilingual text retrieval
• Interactive cross language retrieval
• Cross language spoken document retrieval
• QA in multiple languages
• Cross language retrieval in image collections
• CL geographical retrieval
• CL Video retrieval
• Multilingual information filtering
• Intellectual property
• Log file analysis
• Large scale grid experiments
• From 2010
– Organised as a series of “labs”
105
MediaEval
 dedicated to evaluating new algorithms for multimedia
access and retrieval.
 emphasizes the 'multi' in multimedia
 focuses on human and social aspects of multimedia tasks
– speech recognition, multimedia content analysis, music and audio
analysis, user-contributed information (tags, tweets), viewer
affective response, social networks, temporal and geo-
coordinates.
http://www.multimediaeval.org/
106
Test collections - summary
 it is important to design the right experiment for the right
IR task
– Web retrieval is very different from legal retrieval
 The example of Patent retrieval
– High Recall: a single missed document can invalidate a patent
– Session based: single searches may involve days of cycles of results
review and query reformulation
– Defendable: Process and results may need to be defended in court
107
Outline
 Introduction
 Kinds of evaluation
 Retrieval Effectiveness evaluation
– Measures, Experimentation
– Test Collections
 User-based evaluation
 Discussion on Evaluation
 Conclusion
108
User-based evaluation
 Different levels of user involvement
– Based on subjectivity levels
1. Relevant/non-relevant assessments
 Used largely in lab-like evaluation as described before
2. User satisfaction evaluation
 Some work on 1., very little on 2.
– User satisfaction is very subjective
 UIs play a major role
 Search dissatisfaction can be a result of the non-existence of
relevant documents
109
User-based evaluation
 User-based relevance assessments
– Focus the user on each query-document pair
110
User-based evaluation
 User-based relevance assessments
– Focus the user one each query-document pair
111
User-based evaluation
 User-based relevance assessments
– Focus the user on each query-document pair
– Focus the user on query-document-document
112
User-based evaluation
 User-based relevance assessments
– Focus the user on each query-document pair
– Focus the user on query-document-document
Relative judgements of documents
“Is document X more relevant than document Y for the
given query?”
- Many more assessments needed
- Better inter-annotator agreement [Rees and Schultz,
1967]
113
User-based evaluation
 User-based relevance assessments
– Focus the user on each query-document pair
– Focus the user on query-document-document
– Focus the user on lists of results
114
User-based evaluation
 User-based relevance assessments
– Focus the user one each query-document pair
– Focus the user on query-document-document
– Focus the user on lists of results
Image from Thomas and Hawking, Evaluation by comparing result sets in context, CIKM2006
115
User-based evaluation
 User-based relevance assessments
– Focus the user on each query-document pair
– Focus the user on query-document-document
– Focus the user on lists of results
 Some issues, alternatives
– Control for all sorts of user-based biases
116
User-based evaluation
 User-based relevance assessments
– Focus the user one each query-document pair
– Focus the user on lists of results
– Focus the user on query-document-document
 Some issues, alternatives
– Control for all sorts of user-based biases
Image from Bailey, Thomas and Hawking, Does brandname influence perceived search result quality?, ADCS2007
117
User-based evaluation
 User-based relevance assessments
– Focus the user on each query-document pair
– Focus the user on query-document-document
– Focus the user on lists of results
 Some issues, alternatives
– Control for all sorts of user-based biases
– Two-panel evaluation
– limits the number of systems which can be evaluated
– Is unusable in real-life contexts
– Interspersed ranked list with click monitoring
118
Effectiveness evaluation
lab-like vs. user-focused
 Results are mixed: some experiments show correlations,
some not
 Do user preferences and Evaluation Measures Line up?
SIGIR 2010: Sanderson, Paramita, Clough, Kanoulas
– shows the existence of correlations
 User preferences is inherently user dependent
 Domain specific IR will be different
 The relationship between IR effectiveness measures and
user satisfaction, SIGIR 2007, Al-Maskari, Sanderson,
Clough
– strong correlation between user satisfaction and DCG, which
disappeared when normalized to NDCG.
119
Predicting performance
Future data and queries
 not absolute, but relative performance
– ad-hoc evaluations suffer in particular
– no comparison between lab and operational settings
 for justified reasons, but still none
– how much better must a system be?
 generally, require statistical significance
[Trippe:2011]
120
Predictive performance
 Future systems
 Test collections are often used to prove we have a better
system than the state of the art
– not all documents were evaluated
121
Predictive performance
 Future systems
 Test collections are often used to prove we have a better
system than the state of the art
– not all documents were evaluated
– “retrofit” metrics that are not considered resilient to such evolution
 RBP [Webber:2009]
 Precision@n [Lipani:2014], Recall@n […]
122
Why do this?
- Precision@n and Recall@n are loved in industry
- Also in industry, technology migration steps are high (i.e. hold on to a
system that ‘works’ until it is patently obvious it affects business
performance)
Are Lab evals sufficient?
 Patent search is an active process where the end-user
engages in a process of understanding and interacting with
the information
 evaluation needs a definition of success
– success ~ lower risk
 partly precision and recall
 partly (some argue the most important part) the intellectual and
interactive role of the patent search system as a whole
 series of evaluation layers
– lab evals are now the lowest level
– to elevate them, they must measure risk and incentivize systems to
provide estimates of confidence in the results they provide
[Trippe:2011]
123
Outline
 Introduction
 Kinds of evaluation
 Retrieval Effectiveness evaluation
– Measures, Experimentation
– Test Collections
 User-based evaluation
 Discussion on Evaluation
 Conclusion
124
Discussion on evaluation
 Laboratory evaluation – good or bad?
– Rigorous testing
– Over-constrained
 I usually make the comparison to a tennis
racket:
– No evaluation of the device will tell you how well it
will perform in real life – that largely depends on the
user
– But the user will chose the device based on the lab
evaluation
125
Discussion on evaluation
 There is bias to account for
– E.g. number of relevant documents per topic
126
Discussion on evaluation
 Recall and recall-related measures are often contested
 [cooper:73,p95]
– “The involvement of unexamined documents in a performance
formula has long been taken for granted as a perfectly natural thing,
but if one stops to ponder the situation, it begins to appear most
peculiar. … Surely a document which the system user has not been
shown in any form, to which he has not devoted the slightest particle
of time or attention during his use of the system output, and of
whose very existence he is unaware, does that user neither harm
nor good in his search”
 Clearly not true in the legal & patent domains
127
Discussion on Evaluation
 Realistic tasks and user models
– Evaluation has to be based on the available data sets.
 This creates the user model
 Tasks need to correspond to available techniques
 Much literature on generating tasks
– Experts describe typical tasks
– Use of log files of various sorts
 IR Research decades behind sociology in terms of user
modeling – there is a place to learn from
128
Discussion on Evaluation
 Competitiveness
– Most campaigns take pain in explaining “This is not a competition –
this is an evaluation”
 Competitions are stimulating, but
– Participants wary of participating if they are not sure to win
 Particularly commercial vendors
– Without special care from organizers, it stifles creativity:
 Best way to win is to take last year’s method and improve a bit
 Original approaches are risky
129
Discussion on Evaluation
 Topical Relevance
 What other kinds of relevance factors are there?
– diversity of information
– quality
– credibility
– ease of reading
130
Conclusion
• IR Evaluation is a research field in itself
• Without evaluation, research is pointless
– IR Evaluation research included
•  statistical significance testing is a must to validate results
• Most IR Evaluation exercises are laboratory experiments
– As such, care must be taken to match, to the extent possible, real
needs of the users
• Experiments in the wild are rare, small and domain specific:
– VideOlympics (2007-2009)
– PatOlympics (2010-2012)
131
Bibliography
 Test Collection Based Evaluation of Information Retrieval Systems
– M. Sanderson 2010
 TREC – Experiment and Evaluation in Information Retrieval
– E. Voorhees, D. Harman (eds.)
 On the history of evaluation in IR
– S. Robertson, 2008, Journal of Information Science
 A Comparison of Statistical Significance Tests for Information Retrieval
Evaluation
– M. Smucker, J. Allan, B. Carterette (CIKM’07)
 A Simple and Efficient Sampling Methodfor Estimating AP and NDCG
– E. Yilmaz, E. Kanoulas, J. Aslam (SIGIR’08)
132
Bibliography
 Do User Preferences and Evaluation Measures Line Up?, M. Sanderson and M. L. Paramita and P. Clough and E.
Kanoulas 2010
 A Review of Factors Influencing User Satisfaction in Information Retrieval, A. Al-Maskari and M. Sanderson 2010
 Towards higher quality health search results: Automated quality rating of depression websites, D. Hawking and T.
Tang and R. Sankaranarayana and K. Griffiths and N. Craswell and P. Bailey 2007
 Evaluating Sampling Methods for Uncooperative Collections, P. Thomas and D. Hawking 2007
 Comparing the Sensitivity of Information Retrieval Metrics, F. Radlinski and N. Craswell 2010
 Redundancy, Diversity and Interdependent Document Relevance, F. Radlinski and P. Bennett and B. Carterette
and T. Joachims 2009
 Does Brandname influence perceived search result quality? Yahoo!, Google, and WebKumara, P. Bailey and P.
Thomas and D. Hawking 2007
 Methods for Evaluating Interactive Information Retrieval Systems with Users, D. Kelly 2009
 C-TEST: Supporting Novelty and Diversity in TestFiles for Search Tuning, D. Hawking and T. Rowlands and P.
Thomas 2009
 Live Web Search Experiments for the Rest of Us, T. Jones and D. Hawking and R. Sankaranarayana 2010
 Quality and relevance of domain-specific search: A case study in mental health, T. Tang and N. Craswell and D.
Hawking and K. Griffiths and H. Christensen 2006
 New methods for creating testfiles: Tuning enterprise search with C-TEST, D. Hawking and P. Thomas and T.
Gedeon and T. Jones and T. Rowlands 2006
 A Field Experimental Approach to the Study of Relevance Assessments in Relation to Document Searching, A. M.
Rees and D. G. Schultz, Final Report to the National Science Foundation. Volume II, Appendices. Clearing- house
for Federal Scientific and Technical Information, October 1967
 The Water Filling Model and the Cube Test: Multi-dimensional Evaluation for Professional Search , J. Luo, C. Wing,
H. Yang and M. Hearst, CIKM 2013
 On sample sizes for non-matched-pair IR experiments, S. Robertson, 1990, Information Processing & Management
 Lipani A, Lupu M, Hanbury A, Splitting Water: Precision and Anti-Precision to Reduce Pool Bias, SIGIR 2015
 W. Webber and L. A. F. Park. Score adjustment for correction of pooling bias. In Proc. of SIGIR, 2009
133

Más contenido relacionado

La actualidad más candente

Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayesDhwaj Raj
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrievalNanthini Dominique
 
Learning to rank
Learning to rankLearning to rank
Learning to rankBruce Kuo
 
Functions of information retrival system(1)
Functions of information retrival system(1)Functions of information retrival system(1)
Functions of information retrival system(1)silambu111
 
Information retrieval 9 tf idf weights
Information retrieval 9 tf idf weightsInformation retrieval 9 tf idf weights
Information retrieval 9 tf idf weightsVaibhav Khanna
 
Post coordinate indexing .. Library and information science
Post coordinate indexing .. Library and information sciencePost coordinate indexing .. Library and information science
Post coordinate indexing .. Library and information scienceharshaec
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalSudarsun Santhiappan
 
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information RetrievalIndexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information RetrievalVikas Bhushan
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notesAnandh Arumugakan
 
Information Retrieval Models
Information Retrieval ModelsInformation Retrieval Models
Information Retrieval ModelsNisha Arankandath
 
Text Mining and Visualization
Text Mining and VisualizationText Mining and Visualization
Text Mining and VisualizationSeth Grimes
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introductionnimmyjans4
 

La actualidad más candente (20)

Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayes
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
 
Learning to rank
Learning to rankLearning to rank
Learning to rank
 
Text MIning
Text MIningText MIning
Text MIning
 
Lec1,2
Lec1,2Lec1,2
Lec1,2
 
Web search vs ir
Web search vs irWeb search vs ir
Web search vs ir
 
Multimedia Information Retrieval
Multimedia Information RetrievalMultimedia Information Retrieval
Multimedia Information Retrieval
 
Functions of information retrival system(1)
Functions of information retrival system(1)Functions of information retrival system(1)
Functions of information retrival system(1)
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Information retrieval 9 tf idf weights
Information retrieval 9 tf idf weightsInformation retrieval 9 tf idf weights
Information retrieval 9 tf idf weights
 
Post coordinate indexing .. Library and information science
Post coordinate indexing .. Library and information sciencePost coordinate indexing .. Library and information science
Post coordinate indexing .. Library and information science
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
 
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information RetrievalIndexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
Information Retrieval Models
Information Retrieval ModelsInformation Retrieval Models
Information Retrieval Models
 
Term weighting
Term weightingTerm weighting
Term weighting
 
FRBR model by Gaurav Boudh
FRBR model by Gaurav BoudhFRBR model by Gaurav Boudh
FRBR model by Gaurav Boudh
 
Text Mining and Visualization
Text Mining and VisualizationText Mining and Visualization
Text Mining and Visualization
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
 
Inverted index
Inverted indexInverted index
Inverted index
 

Destacado

Aggregating Multiple Dimensions for Computing Document Relevance
Aggregating Multiple Dimensions for Computing Document RelevanceAggregating Multiple Dimensions for Computing Document Relevance
Aggregating Multiple Dimensions for Computing Document RelevanceJosé Ramón Ríos Viqueira
 
Exploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesExploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesLaura Po
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked dataLaura Po
 
The stairs evaluation
The stairs evaluationThe stairs evaluation
The stairs evaluationmaruthimlis
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineSalford Systems
 
1st KeyStone Summer School - Hackathon Challenge
1st KeyStone Summer School - Hackathon Challenge1st KeyStone Summer School - Hackathon Challenge
1st KeyStone Summer School - Hackathon ChallengeJoel Azzopardi
 
Keystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenanceKeystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenancePaolo Missier
 
Search, Exploration and Analytics of Evolving Data
Search, Exploration and Analytics of Evolving DataSearch, Exploration and Analytics of Evolving Data
Search, Exploration and Analytics of Evolving DataNattiya Kanhabua
 
powerpoint
powerpointpowerpoint
powerpointbutest
 
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joinedKeystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joinedJoel Azzopardi
 
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalMauro Dragoni
 
Very Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval ToolkitVery Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval ToolkitKavita Ganesan
 
similarity measure
similarity measure similarity measure
similarity measure ZHAO Sam
 
Information retrieval system!
Information retrieval system!Information retrieval system!
Information retrieval system!Jane Garay
 
Health information systems (his)
Health information systems (his)Health information systems (his)
Health information systems (his)Nkosinathi Lungu
 

Destacado (20)

Aggregating Multiple Dimensions for Computing Document Relevance
Aggregating Multiple Dimensions for Computing Document RelevanceAggregating Multiple Dimensions for Computing Document Relevance
Aggregating Multiple Dimensions for Computing Document Relevance
 
Exploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesExploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sources
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked data
 
School intro
School introSchool intro
School intro
 
The stairs evaluation
The stairs evaluationThe stairs evaluation
The stairs evaluation
 
C3.6.1
C3.6.1C3.6.1
C3.6.1
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
 
1st KeyStone Summer School - Hackathon Challenge
1st KeyStone Summer School - Hackathon Challenge1st KeyStone Summer School - Hackathon Challenge
1st KeyStone Summer School - Hackathon Challenge
 
Keystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenanceKeystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenance
 
Search, Exploration and Analytics of Evolving Data
Search, Exploration and Analytics of Evolving DataSearch, Exploration and Analytics of Evolving Data
Search, Exploration and Analytics of Evolving Data
 
powerpoint
powerpointpowerpoint
powerpoint
 
Precision and Recall
Precision and RecallPrecision and Recall
Precision and Recall
 
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joinedKeystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
 
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
 
Stairs
StairsStairs
Stairs
 
Curse of Dimensionality and Big Data
Curse of Dimensionality and Big DataCurse of Dimensionality and Big Data
Curse of Dimensionality and Big Data
 
Very Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval ToolkitVery Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval Toolkit
 
similarity measure
similarity measure similarity measure
similarity measure
 
Information retrieval system!
Information retrieval system!Information retrieval system!
Information retrieval system!
 
Health information systems (his)
Health information systems (his)Health information systems (his)
Health information systems (his)
 

Similar a Information Retrieval Evaluation

Information retrieval systems irt ppt do
Information retrieval systems irt ppt doInformation retrieval systems irt ppt do
Information retrieval systems irt ppt doPonnuthuraiSelvaraj1
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Julián Urbano
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalMounia Lalmas-Roelleke
 
Chapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdfChapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdfHabtamu100
 
information technology materrailas paper
information technology materrailas paperinformation technology materrailas paper
information technology materrailas papermelkamutesfay1
 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityJulián Urbano
 
Multi-method Evaluation in Scientific Paper Recommender Systems
Multi-method Evaluation in Scientific Paper Recommender SystemsMulti-method Evaluation in Scientific Paper Recommender Systems
Multi-method Evaluation in Scientific Paper Recommender SystemsAravind Sesagiri Raamkumar
 
Informatio retrival evaluation
Informatio retrival evaluationInformatio retrival evaluation
Informatio retrival evaluationNidhirBiswas
 
Tutorial: Context-awareness In Information Retrieval and Recommender Systems
Tutorial: Context-awareness In Information Retrieval and Recommender SystemsTutorial: Context-awareness In Information Retrieval and Recommender Systems
Tutorial: Context-awareness In Information Retrieval and Recommender SystemsYONG ZHENG
 
Chapter 02 collaborative recommendation
Chapter 02   collaborative recommendationChapter 02   collaborative recommendation
Chapter 02 collaborative recommendationAravindharamanan S
 
Chapter 02 collaborative recommendation
Chapter 02   collaborative recommendationChapter 02   collaborative recommendation
Chapter 02 collaborative recommendationAravindharamanan S
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrievalssbd6985
 
information retrieval
information retrievalinformation retrieval
information retrievalssbd6985
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowJulián Urbano
 
Measuring the usefulness of Knowledge Organization Systems in Information Ret...
Measuring the usefulness of Knowledge Organization Systems in Information Ret...Measuring the usefulness of Knowledge Organization Systems in Information Ret...
Measuring the usefulness of Knowledge Organization Systems in Information Ret...GESIS
 
Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004
Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004
Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004Alan Walker
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalJulián Urbano
 
Lecture_4_Data_Gathering_and_Analysis.pdf
Lecture_4_Data_Gathering_and_Analysis.pdfLecture_4_Data_Gathering_and_Analysis.pdf
Lecture_4_Data_Gathering_and_Analysis.pdfAbdullahOmar64
 

Similar a Information Retrieval Evaluation (20)

Information retrieval systems irt ppt do
Information retrieval systems irt ppt doInformation retrieval systems irt ppt do
Information retrieval systems irt ppt do
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information Retrieval
 
Chapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdfChapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdf
 
information technology materrailas paper
information technology materrailas paperinformation technology materrailas paper
information technology materrailas paper
 
Search quality in practice
Search quality in practiceSearch quality in practice
Search quality in practice
 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music Similarity
 
Multi-method Evaluation in Scientific Paper Recommender Systems
Multi-method Evaluation in Scientific Paper Recommender SystemsMulti-method Evaluation in Scientific Paper Recommender Systems
Multi-method Evaluation in Scientific Paper Recommender Systems
 
Informatio retrival evaluation
Informatio retrival evaluationInformatio retrival evaluation
Informatio retrival evaluation
 
Tutorial: Context-awareness In Information Retrieval and Recommender Systems
Tutorial: Context-awareness In Information Retrieval and Recommender SystemsTutorial: Context-awareness In Information Retrieval and Recommender Systems
Tutorial: Context-awareness In Information Retrieval and Recommender Systems
 
Chapter 02 collaborative recommendation
Chapter 02   collaborative recommendationChapter 02   collaborative recommendation
Chapter 02 collaborative recommendation
 
Chapter 02 collaborative recommendation
Chapter 02   collaborative recommendationChapter 02   collaborative recommendation
Chapter 02 collaborative recommendation
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
information retrieval
information retrievalinformation retrieval
information retrieval
 
TCI in primary care - SEM (2006)
TCI in primary care - SEM (2006)TCI in primary care - SEM (2006)
TCI in primary care - SEM (2006)
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and How
 
Measuring the usefulness of Knowledge Organization Systems in Information Ret...
Measuring the usefulness of Knowledge Organization Systems in Information Ret...Measuring the usefulness of Knowledge Organization Systems in Information Ret...
Measuring the usefulness of Knowledge Organization Systems in Information Ret...
 
Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004
Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004
Data mining guest lecture (CSE6331 University of Texas, Arlington) 2004
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 
Lecture_4_Data_Gathering_and_Analysis.pdf
Lecture_4_Data_Gathering_and_Analysis.pdfLecture_4_Data_Gathering_and_Analysis.pdf
Lecture_4_Data_Gathering_and_Analysis.pdf
 

Último

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 

Último (20)

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 

Information Retrieval Evaluation

  • 1. IR Evaluation Mihai Lupu lupu@ifs.tuwien.ac.at Chapter 8 of the Introduction to IR book M. Sanderson. Test Collection Based Evaluation of Information Retrieval Systems Foundations and Trends in IR, 2010 1
  • 2. Outline  Introduction – Introduction to IR  Kinds of evaluation  Retrieval Effectiveness evaluation – Measures, Experimentation – Test Collections  User-based evaluation  Discussion on Evaluation  Conclusion 2
  • 3. Introduction • Why? – Put a figure on the benefit we get from a system – Because without evaluation, there is no research 3 Objective measurements
  • 4. Information Retrieval  “Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.” (Salton, 1968)  General definition that can be applied to many types of information and search applications  Primary focus of IR since the 50s has been on text and documents
  • 7. Information Retrieval  Key insights of/for information retrieval – text has no meaning  ฉันมีรถสีแดง – but it is still the most informative source  ฉันมีรถสีฟ้า is more similar to the above than คุณมีรถไฟฟ้า – text is not random  I drive a red car is more probable than – I drive a red horse – A red car I drive – Car red a drive I – meaning is defined by usage  I drive a truck / I drive a car / I drive the bus  truck / car / bus are similar in meaning
  • 8. Information Retrieval  Key insights of/for information retrieval – text has no meaning  ฉันมีรถสีแดง – but it is still the most informative source  ฉันมีรถสีฟ้า is more similar to the above than คุณมีรถไฟฟ้า – text is not random  I drive a red car is more probable than – I drive a red horse – A red car I drive – Car red a drive I – meaning is defined by usage  I drive a truck / I drive a car / I drive the bus  truck / car / bus are similar in meaning term frequency (TF), document frequency (DF) TF-IDF, BM25 (Best match 25) language models (uni-gram, bi-gram, n-gram) statistical semantics (latent semantic analysis, random indexing, deep learning)
  • 9. Big Issues in IR  Relevance – What is it? – Simple (and simplistic) definition: A relevant document contains the information that a person was looking for when they submitted a query to the search engine – Many factors influence a person’s decision about what is relevant: e.g., task, context, novelty, style – Topical relevance (same topic) vs. user relevance (everything else)
  • 10.  Relevance – Retrieval models define a view of relevance – Ranking algorithms used in search engines are based on retrieval models – Most models describe statistical properties of text rather than linguistic  i.e. counting simple text features such as words instead of parsing and analyzing the sentences  Statistical approach to text processing started with Luhn in the 50s  Linguistic features can be part of a statistical model Big Issues in IR
  • 11. Big Issues in IR  Evaluation – Experimental procedures and measures for comparing system output with user expectations  Originated in Cranfield experiments in the 60s – IR evaluation methods now used in many fields – Typically use test collection of documents, queries, and relevance judgments  Most commonly used are TREC collections – Recall and precision are two examples of effectiveness measures
  • 12. Big Issues in IR  Users and Information Needs – Search evaluation is user-centered – Keyword queries are often poor descriptions of actual information needs – Interaction and context are important for understanding user intent – Query refinement techniques such as query expansion, query suggestion, relevance feedback improve ranking
  • 13. Introduction • Why? – Put a figure on the benefit we get from a system – Because without evaluation, there is no research • Why is this a research field in itself? – Because there are many kinds of IR • With different evaluation criteria – Because it’s difficult • Why? – Because it involves human subjectivity (document relevance) – Because of the amount of data involved (who can sit down and evaluate 1,750,000 documents returned by Google for ‘university vienna’?) 13
  • 15. Kinds of evaluation • “Efficient and effective system” • Time and space: efficiency – Generally constrained by pre-development specification • E.g. real-time answers vs. batch jobs • E.g. index-size constraints – Easy to measure • Good results: effectiveness – Harder to define --> more research into it • And… 15
  • 16. Kinds of evaluation (cont.) • User studies – Does a 2% increase in some retrieval performance measure actually make a user happier? – Does displaying a text snippet improve usability even if the underlying method is 10% weaker than some other method? – Hard to do – Mostly anecdotal examples – Many IR people don’t like to do it (though it’s starting to change) 16
  • 17. Kinds of evaluation (cont.)  Intrinsic – “internal” – ultimate goal is the retrieved set  Extrinsic – “external” – in the context of the usage of the retrieval tool 17
  • 18. What to measure in an IR system? 1966, Cleverdon: 1. coverage – the extent to which relevant matter exists in the system 2. time lag ~ efficiency 3. presentation 4. effort on the part of the user to answer his information need 5. recall 6. precision 18
  • 19. What to measure in an IR system? 1966, Cleverdon: 1. coverage – the extent to which relevant matter exists in the system 2. time lag ~ efficiency 3. presentation 4. effort on the part of the user to answer his information need 5. recall 6. precision Effectiveness 19 A desirable measure of retrieval performance would have the following properties: 1, it would be a measure of effectiveness. 2, it would not be confounded by the relative willingness of the system to emit items. 3, it would be a single number – in preference, for example, to a pair of numbers which may co-vary in a loosely specified way, or a curve representing a table of several pairs of numbers 4, it would allow complete ordering of different performances, and assess the performance of any one system in absolute terms. Given a measure with these properties, we could be confident of having a pure and valid index of how well a retrieval system (or method) were performing the function it was primarily designed to accomplish, and we could reasonably ask questions of the form “Shall we pay X dollars for Y units of effectiveness?” (Swets, 1967)
  • 20. Outline • Introduction • Kinds of evaluation • Retrieval Effectiveness evaluation – Measures – Test Collections  User-based evaluation • Discussion on Evaluation • Conclusion 20
  • 22. Retrieval Effectiveness  Precision – How happy are we with what we’ve got  Recall – How much more we could have had Precision = Number of relevant documents retrieved Number of documents retrieved Recall = Number of relevant documents retrieved Number of relevant documents 22
  • 25. Retrieval effectiveness  What if we don’t like this twin-measure approach?  A solution: – Van Rijsbergen’s E-Measure: – With a special case: Harmonic mean E =1- 1 a 1 precision + 1-a( ) 1 recall F = 2× precision×recall precision+recall 25
  • 26. Retrieval effectiveness  What if we don’t like this twin-measure approach?  A solution: – Van Rijsbergen’s E-Measure: – With a special case: Harmonic mean E =1- 1 a 1 precision + 1-a( ) 1 recall F = 2× precision×recall precision+recall 26
  • 27. Retrieval effectiveness  Tools we need: – A set of documents (the “dataset”) – A set of questions/queries/topics – For each topic, and for each document, a decision: relevant or not relevant  Let’s assume for the moment that’s all we need and that we have it 27
  • 28. Retrieval Effectiveness • Precision and Recall generally plotted as a “Precision-Recall curve” 0 1 1 precision recall size of retrieved set increases • They do not play well together 28
  • 29. Precision-Recall Curves  How to build a Precision-Recall Curve? – For one query at a time – Make checkpoints on the recall-axis 0 1 1 precision recall 29
  • 30. Precision-Recall Curves  How to build a Precision-Recall Curve? – For one query at a time – Make checkpoints on the recall-axis 0 1 1 precision recall 30
  • 31. Precision-Recall Curves • How to build a Precision-Recall Curve? – For one query at a time – Make checkpoints on the recall-axis – Repeat for all queries 0 1 1 precision recall 31
  • 32. Precision-Recall Curves • And the average is the system’s P-R curve 0 1 1 precision recall # retrieved documents increases • We can compare systems by comparing the curves 32
  • 34. Interpolation  To average graphs, calculate precision at standard recall levels: – where S is the set of observed (R,P) points  Defines precision at any recall level as the maximum precision observed in any recall-precision point at a higher recall level – produces a step function – defines precision at recall 0.0 34
  • 36. Average Precision at Standard Recall Levels • Recall-precision graph plotted by simply joining the average precision points at the standard recall levels 36
  • 38. Graph for 50 Queries 38
  • 39. Retrieval Effectiveness • Not quite done yet… – When to stop retrieving? • Both P and R imply a cut-off value – How about graded relevance • Some documents may be more relevant to the question than others – How about ranking? • A document retrieved at position 1,234,567 can still be considered useful? – Who says which documents are relevant and which not? 39
  • 40. Single-value measures • Fix a “reasonable” cutoff – R-precision  Precision at R, where R is the number of relevant documents.  Fix the number of desired documents – Reciprocal rank (RR)  1/rank of first relevant document in the ranked list returned  Make it less sensitive to the cutoff • Average precision – For each query:  R= # relevant documents  i = rank  k = # retrieved documents  P(i) precision at rank i • rel(i)=1 if document at rank i relevant, 0 otherwise – For each system: • Compute the mean of these averages: Mean Average Precision (MAP) – one of the most used measures AP = P(i)×rel(i)( ) i=1 k å R 40
  • 41. R- Precision  Precision at the R-th position in the ranking of results for a query that has R relevant documents. n doc # relevant 1 588 x 2 589 x 3 576 4 590 x 5 986 6 592 x 7 984 8 988 9 578 10 985 11 103 12 591 13 772 x 14 990 R = # of relevant docs = 6 R-Precision = 4/6 = 0.67 41
  • 45. Retrieval Effectiveness • Not quite done yet… – When to stop retrieving? • Both P and R imply a cut-off value – How about graded relevance • Some documents may be more relevant to the question than others – How about ranking? • A document retrieved at position 1,234,567 can still be considered useful? – Who says which documents are relevant and which not? 45
  • 46. Cumulative Gain • For each document d, and query q, define rel(d,q) >= 0 • The higher the value, the more relevant the document is to the query • Pitfalls: – Graded relevance introduces even more ambiguity in practice With great flexibility comes great responsibility to justify parameter values 46
  • 47. Retrieval Effectiveness • Not quite done yet… – When to stop retrieving? • Both P and R imply a cut-off value – How about graded relevance • Some documents may be more relevant to the question than others – How about ranking? • A document retrieved at position 1,234,567 can still be considered useful? – Who says which documents are relevant and which not? 47
  • 48. Discounted Cumulative Gain  Popular measure for evaluating web search and related tasks  Two assumptions: – Highly relevant documents are more useful than marginally relevant document – the lower the ranked position of a relevant document, the less useful it is for the user, since it is less likely to be examined 48
  • 49. Discounted Cumulative Gain  Uses graded relevance as a measure of the usefulness, or gain, from examining a document  Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks  Typical discount is 1/log (rank) – With base 2, the discount at rank 4 is 1/2, and at rank 8 it is 1/3 49
  • 50. Discounted Cumulative Gain  DCG is the total gain accumulated at a particular rank p:  Alternative formulation: – used by some web search companies – emphasis on retrieving highly relevant documents [Jarvelin:2000] [Borges:2005] 50
  • 51. Discounted Cumulative Gain • Neither CG, nor DCG can be used for comparison across topics! depends on the # relevant documents per topic 51
  • 52. Normalised Discounted Cumulative Gain  Compute CG / DCG for the optimal return set Eg: (5,5,5,4,4,3,3,3,3,2,2,2,1,1,1,1,1,1,0,0,0,0..) has the Ideal Discounted Cumulative Gain: IDCG  Normalise: NDCG(n) = DCG(n) IDCG(n) 52
  • 53. some more variations Eg: (5,5,5,4,4,3,3,3,3,2,2,2,1,1,1,1,1,1,0,0,0,0..) has the Ideal Discounted Cumulative Gain: IDCG “our rank”: (5,2,0,0,5,2,4,0,0,1,4,…)  two ranked lists – rank correlation measures  kendall Tau (similarity of orderings)  pearson Rho (linear correlation between variables)  spearman Rho (Pearson for ranks) 53
  • 54. some more variations  rank biased precision (RBP) – “log-based discount is not a good model of users’ behaviour” – imagine the probability p of the user moving on to the next document RBP(n) = (1- p)× rel(i)× pi-1 i=1 n å p~0.95 p~0.0 54
  • 55. Time-based calibration  Assumption – The objective of the search engine is to improve the efficiency of an information seeking task  Extend nDCG to replace discount with a time-based function (Smucker and Clarke:2011) Normalization Gain Decay, as a function of time to reach item k in the ranked list55
  • 56. The water filling model (Luo et al, 2013)  and the corresponding Cube Test (CT)  also for professional search – to capture embedded subtopics  no assumption of linear traversal of documents – takes into account time  potential cap on the amount of information taken into account  high discriminative power 56
  • 57. Other diversity metrics  several aspects of the topic might [need to] be covered – Aspectual recall/precision  discount may take into account previously seen aspects – α-NDCG = NDCG where rel(i) = J(di,k)(1-a) rk,i-1 k=1 m å rk,i-1 = J(dj,k) j=1 i-1 å J(dj,k) = 1 dj relevant to nk 0 otherwise ì í ï îï 57
  • 58. Other measures • There are many IR measures! • trec_eval is a little program that computes many of them – 37 in v9.0, many of which are multi-point (e.g. Precision @10, @20…) • http://trec.nist.gov/trec_eval/ • “there is a measure to make anyone a winner” – Not really true, but still… 58
  • 59. Other measures • How about correlations between measures? • Kendal Tau values • From Voorhees and Harman,2004 • Overall they correlate P(30) R-Prec MAP .5 prec R(1,100 0) Rel Ret MRR P(10) 0.88 0.81 0.79 0.78 0.78 0.77 0.77 P(30) 0.87 0.84 0.82 0.80 0.79 0.72 R-Prec 0.93 0.87 0.83 0.83 0.67 MAP 0.88 0.85 0.85 0.64 .5 prec 0.77 0.78 0.63 R(1,100 0) 0.92 0.67 Rel ret 0.66 59
  • 60. Topic sets  Topic selection – In early TREC candidates rejected if ambiguous  Are all topics equal? – Mean Average Precision uses arithmetic mean – Classical Test Theory experiments (Bodoff and Li,2007) identified outliers that could change the rankings MAP: a change in AP from 0.05 to 0.1 has the same effect as a change from 0.25 to 0.3 GMAP: a change in AP from 0.05 to 0.1 has the same effect as a change from 0.25 to 0.5 60
  • 61. Measure measures  What is the best measure? – What makes a measure better?  Match to task – E.g.  Known item search: MRR  Something more quantitative? – Correlations between measures  Does the system ranking change when using different measures  Useful to group measures – Ability to distinguish between runs – Measure stability 61
  • 62. Ad-hoc quiz  It was necessary to normalize the discounted cumulative gain (NDCG) because…  of the assumption for normal probability distribution  to be able to compare across topics  normalization is always better  to be able to average across topics 62
  • 63. Ad-hoc quiz  It was necessary to normalize the discounted cumulative gain (NDCG) because…  of the assumption for normal probability distribution  to be able to compare across topics  normalization is always better  to be able to average across topics 63
  • 64. Measure stability  Success criteria: – A measure is good if it is able to predict differences between systems (on the average of future queries)  Method – Split collection in 2 1. Use as train collection to rank runs 2. Use as test collection to compute how many pair-wise comparisons hold  Observations – Cut-off measures less stable than MAP 64
  • 65. Measure stability  Success criteria: – A measure is good if it is able to predict differences between systems (on the average of future queries)  Method – Split collection in 2 1. Use as train collection to rank runs 2. Use as test collection to compute how many pair-wise comparisons hold  Observations – Cut-off measures less stable than MAP Any other criteria for measure quality? 65
  • 66. Measure measures  started with opinions from ’60s, seen some measures – have the targets changed?  7 numeric properties of effectiveness metrics (Moffat 2013) 66
  • 67. 7 properties of effectiveness metrics  Boundedness – the set of scores attainable by the metric is bounded, usually in [0,1]  Monotonicity – if a ranking of length k is extended so that k+1 elements are included, the score never decreases  Convergence – if a document outside the top k is swapped with a less relevant document inside the top k, the score strictly increases  Top-weightedness – if a document within the top k is swapped with a less relevant one higher in the ranking, the score strictly increases  Localization – a score at depth k can be compute based solely on knowledge of the documents that appear in top k  Completeness – a score can be calculated even if the query has no relevant documents  Realizability – provided that the collection has at least one relevant document, it is possible for the score at depth k to be maximal. 68
  • 68. So far  introduction  metrics we are now able to say “System A is better than System B” or are we? Remember - we only have limited data - potential future applications unbounded a very strong statement! 69
  • 69. Statistical validity  Whatever evaluation metric used, all experiments must be statistically valid – i.e. differences must not be the result of chance 0 0.05 0.1 0.15 0.2 MAP 70
  • 70. Statistical validity • Ingredients of a significance test – A test statistic (e.g. the differences between AP values) – A null hypothesis (e.g. “there is no difference between the two systems)  This gives us a particular distribution of the test statistic – An alternative hypothesis (one or two-tailed tests)  don’t change it after the test – A significance level computed by taking the actual value of the test statistic and determining how likely it is to see this value given the distribution implied by the null hypothesis • P-value • If the p-value is low, we can feel confident that we can reject the null hypothesis  the systems are different 71
  • 71. Statistical validity  Common practice is to declare systems different when the p-value <= 0.05  A few tests – Randomization tests  Wilcoxon Signed Rank test  Sign test – Boostrap test – Student’s Paired t-test  See recent discussion in SIGIR Forum – T. Sakai - Statistical Reform in Information Retrieval?  effect sizes  confidence intervals 72
  • 72. Statistical validity  How do we increase the statistical validity of an experiment?  By increasing the number of topics – The more topics, the more confident we are that the difference between average scores will be significant  What’s the minimum number of topics? 42 • Depends, but • TREC started with 50 • Below 25 is generally considered not significant 73
  • 74. t-Test  Assumption is that the difference between the effectiveness values is a sample from a normal distribution  Null hypothesis is that the mean of the distribution of differences is zero  Test statistic – for the example, 75
  • 78. 79
  • 79. 80
  • 80. 81
  • 81. 82
  • 82. Summary  so far – introduction – metrics  next – where to get ground truth  some more metrics – discussion 83
  • 83. Retrieval Effectiveness • Not quite done yet… – When to stop retrieving? • Both P and R imply a cut-off value – How about graded relevance • Some documents may be more relevant to the question than others – How about ranking? • A document retrieved at position 1,234,567 can still be considered useful? – Who says which documents are relevant and which not? 84
  • 84. Relevance assessments • Ideally – Sit down and look at all documents • Practically – The ClueWeb09 collection has • 1,040,809,705 web pages, in 10 languages • 5 TB, compressed. (25 TB, uncompressed.) – No way to do this exhaustively – Look only at the set of returned documents • Assumption: if there are enough systems being tested and not one of them returned a document – the document is not relevant 85
  • 85. Relevance assessments - Pooling  Combine the results retrieved by all systems  Choose a parameter k (typically 100)  Choose the top k documents as ranked in each submitted run  The pool is the union of these sets of docs – Between k and (# submitted runs) × k documents in pool – (k+1)st document returned in one run either irrelevant or ranked higher in another run  Give pool to judges for relevance assessments 86
  • 87. Relevance assessments - Pooling  Conditions under which pooling works [Robertson] – Range of different kinds of systems, including manual systems – Reasonably deep pools (100+ from each system)  But depends on collection size – The collections cannot be too big.  Big is so relative… 88
  • 88. Relevance assessments - Pooling  Advantage of pooling: – Fewer documents must be manually assessed for relevance  Disadvantages of pooling: – Can’t be certain that all documents satisfying the query are found (recall values may not be accurate) – Runs that did not participate in the pooling may be disadvantaged – If only one run finds certain relevant documents, but ranked lower than 100, it will not get credit for these. 89
  • 89. Relevance assessments  Pooling with randomized sampling  As the data collection grows, the top 100 may not be representative of the entire result set – (i.e. the assumption that everything after is not relevant does not hold anymore)  Add, to the pool, a set of documents randomly sampled from the entire retrieved set – If the sampling is uniform  easy to reason about, but may be too sparse as the collection grows – Stratified sampling: get more from the top of the ranked list [Yilmaz et al.:2008] 90
  • 90. Relevance assessments - incomplete • The unavoidable conclusion is that we have to handle incomplete relevance assessments – Consider unjudged = non relevant – Do not consider unjudged at all (i.e. compress the ranked lists) • A new measure: – BPref (binary preference)  r = a relevant returned document  R = # documents judged relevant  N = # documents judged non-relevant  n = a non-relevant document BPref = 1 R 1- |{n |rank(n) > rank(r)}| min(R, N) æ è ç ö ø ÷ r å 91
  • 91. Relevance assessments - incomplete • BPref was designed to mimic MAP • soon after, induced AP and inferred AP were proposed • if data complete – equal to MAP indAP = 1 R 1- |{n | rank(n) > rank(r)}| rank(r) æ è ç ö ø ÷ r å inf AP(k) = 1 R 1 k + k -1 k d100 k -1 × rel +e rel + nonrel +e æ è çç ö ø ÷÷ é ë ê ê ù û ú úr å expectation of precision at rank k 92
  • 92.  not only are we incomplete, but we might also be inconsistent in our judgments 93
  • 93. Relevance assessment - subjectivity  In TREC-CHEM’09 we had each topic evaluated by two students – “conflicts” ranged between 2% and 33% (excluding a topic with 60% conflict) – This all increased if we considered “strict disagreement”  In general, inter-evaluator agreement is rarely above 80%  There is little one can do about it 94
  • 94. Relevance assessment - subjectivity  Good news: – “idiosyncratic nature of relevance judgments does not affect comparative results” (E. Voorhees) – Mean Kendall Tau between system rankings produced from different query relevance sets: 0.938 – Similar results held for:  Different query sets  Different evaluation measures  Different assessor types  Single opinion vs .group opinion judgments 95
  • 95. No assessors  Pooling assumes all relevant documents found by systems – Take this assumption further  Voting based- relevance assessments – Consider top K only Soboroff et al:2001 96
  • 96. Test Collections  Generally created as the result of an evaluation campaign – TREC – Text Retrieval Conference (USA) – CLEF – Cross Language Evaluation Forum (EU) – NTCIR - NII Test Collection for IR Systems (JP) – INEX – Initiative for evaluation of XML Retrieval – …  First one and paradigm definer: – The Cranfield Collection  In the 1950s  Aeronautics  1400 queries, about 6000 documents  Fully evaluated 97
  • 97. TREC  Started in 1992  Always organised in the States, on the NIST campus  As leader, introduced most of the jargon used in IR Evaluation: – Topic = query / request for information – Run = a ranked list of results – Qrel = relevance judgements 98
  • 98. TREC  Organised as a set of tracks that focus on a particular sub- problem of IR – E.g.  Patient records, Session, Chemical, Genome, Legal, Blog, Spam,Q&A, Novelty, Enterprise, Terabyte, Web, Video, Speech, OCR, Chinese, Spanish, Interactive, Filtering, Routing, Million Query, Ad-Hoc, Robust – Set of tracks in a year depends on  Interest of participants  Fit to TREC  Needs of sponsors  Resource constraints 99
  • 100. TREC – Task definition  Each Track has a set of Tasks:  Examples of tasks from the Blog track: – 1. Finding blog posts that contain opinions about the topic – 2. Ranking positive and negative blog posts – 3. (A separate baseline task to just find blog posts relevant to the topic) – 4. Finding blogs that have a principal, recurring interest in the topic 101
  • 101. TREC - Topics  For TREC, topics generally have a specific format (not always though) – <ID> – <title>  Very short – <description>  A brief statement of what would be a relevant document – <narrative>  A long description, meant also for the evaluator to understand how to judge the topic 102
  • 102. TREC - Topics  Example: – <ID>  312 – <title>  Hydroponics – <description>  Document will discuss the science of growing plants in water or some substance other than soil – <narrative>  A relevant document will contain specific information on the necessary nutrients, experiments, types of substrates, and/or any other pertinent facts related to the science of hydroponics. Related information includes, but is not limited to, the history of hydro- … 103
  • 103. CLEF  Cross Language Evaluation Forum – From 2010: Conference on Multilingual and Multimodal Information Access Evaluation – Supported by the PROMISE Network of Excellence  Started in 2000  Grand challenge: – Fully multilingual, multimodal IR systems  Capable of processing a query in any medium and any language  Finding relevant information from a multilingual multimedia collection  And presenting it in the style most likely to be useful for the user 104
  • 104. CLEF • Previous tracks: • Mono-, bi- multilingual text retrieval • Interactive cross language retrieval • Cross language spoken document retrieval • QA in multiple languages • Cross language retrieval in image collections • CL geographical retrieval • CL Video retrieval • Multilingual information filtering • Intellectual property • Log file analysis • Large scale grid experiments • From 2010 – Organised as a series of “labs” 105
  • 105. MediaEval  dedicated to evaluating new algorithms for multimedia access and retrieval.  emphasizes the 'multi' in multimedia  focuses on human and social aspects of multimedia tasks – speech recognition, multimedia content analysis, music and audio analysis, user-contributed information (tags, tweets), viewer affective response, social networks, temporal and geo- coordinates. http://www.multimediaeval.org/ 106
  • 106. Test collections - summary  it is important to design the right experiment for the right IR task – Web retrieval is very different from legal retrieval  The example of Patent retrieval – High Recall: a single missed document can invalidate a patent – Session based: single searches may involve days of cycles of results review and query reformulation – Defendable: Process and results may need to be defended in court 107
  • 107. Outline  Introduction  Kinds of evaluation  Retrieval Effectiveness evaluation – Measures, Experimentation – Test Collections  User-based evaluation  Discussion on Evaluation  Conclusion 108
  • 108. User-based evaluation  Different levels of user involvement – Based on subjectivity levels 1. Relevant/non-relevant assessments  Used largely in lab-like evaluation as described before 2. User satisfaction evaluation  Some work on 1., very little on 2. – User satisfaction is very subjective  UIs play a major role  Search dissatisfaction can be a result of the non-existence of relevant documents 109
  • 109. User-based evaluation  User-based relevance assessments – Focus the user on each query-document pair 110
  • 110. User-based evaluation  User-based relevance assessments – Focus the user one each query-document pair 111
  • 111. User-based evaluation  User-based relevance assessments – Focus the user on each query-document pair – Focus the user on query-document-document 112
  • 112. User-based evaluation  User-based relevance assessments – Focus the user on each query-document pair – Focus the user on query-document-document Relative judgements of documents “Is document X more relevant than document Y for the given query?” - Many more assessments needed - Better inter-annotator agreement [Rees and Schultz, 1967] 113
  • 113. User-based evaluation  User-based relevance assessments – Focus the user on each query-document pair – Focus the user on query-document-document – Focus the user on lists of results 114
  • 114. User-based evaluation  User-based relevance assessments – Focus the user one each query-document pair – Focus the user on query-document-document – Focus the user on lists of results Image from Thomas and Hawking, Evaluation by comparing result sets in context, CIKM2006 115
  • 115. User-based evaluation  User-based relevance assessments – Focus the user on each query-document pair – Focus the user on query-document-document – Focus the user on lists of results  Some issues, alternatives – Control for all sorts of user-based biases 116
  • 116. User-based evaluation  User-based relevance assessments – Focus the user one each query-document pair – Focus the user on lists of results – Focus the user on query-document-document  Some issues, alternatives – Control for all sorts of user-based biases Image from Bailey, Thomas and Hawking, Does brandname influence perceived search result quality?, ADCS2007 117
  • 117. User-based evaluation  User-based relevance assessments – Focus the user on each query-document pair – Focus the user on query-document-document – Focus the user on lists of results  Some issues, alternatives – Control for all sorts of user-based biases – Two-panel evaluation – limits the number of systems which can be evaluated – Is unusable in real-life contexts – Interspersed ranked list with click monitoring 118
  • 118. Effectiveness evaluation lab-like vs. user-focused  Results are mixed: some experiments show correlations, some not  Do user preferences and Evaluation Measures Line up? SIGIR 2010: Sanderson, Paramita, Clough, Kanoulas – shows the existence of correlations  User preferences is inherently user dependent  Domain specific IR will be different  The relationship between IR effectiveness measures and user satisfaction, SIGIR 2007, Al-Maskari, Sanderson, Clough – strong correlation between user satisfaction and DCG, which disappeared when normalized to NDCG. 119
  • 119. Predicting performance Future data and queries  not absolute, but relative performance – ad-hoc evaluations suffer in particular – no comparison between lab and operational settings  for justified reasons, but still none – how much better must a system be?  generally, require statistical significance [Trippe:2011] 120
  • 120. Predictive performance  Future systems  Test collections are often used to prove we have a better system than the state of the art – not all documents were evaluated 121
  • 121. Predictive performance  Future systems  Test collections are often used to prove we have a better system than the state of the art – not all documents were evaluated – “retrofit” metrics that are not considered resilient to such evolution  RBP [Webber:2009]  Precision@n [Lipani:2014], Recall@n […] 122 Why do this? - Precision@n and Recall@n are loved in industry - Also in industry, technology migration steps are high (i.e. hold on to a system that ‘works’ until it is patently obvious it affects business performance)
  • 122. Are Lab evals sufficient?  Patent search is an active process where the end-user engages in a process of understanding and interacting with the information  evaluation needs a definition of success – success ~ lower risk  partly precision and recall  partly (some argue the most important part) the intellectual and interactive role of the patent search system as a whole  series of evaluation layers – lab evals are now the lowest level – to elevate them, they must measure risk and incentivize systems to provide estimates of confidence in the results they provide [Trippe:2011] 123
  • 123. Outline  Introduction  Kinds of evaluation  Retrieval Effectiveness evaluation – Measures, Experimentation – Test Collections  User-based evaluation  Discussion on Evaluation  Conclusion 124
  • 124. Discussion on evaluation  Laboratory evaluation – good or bad? – Rigorous testing – Over-constrained  I usually make the comparison to a tennis racket: – No evaluation of the device will tell you how well it will perform in real life – that largely depends on the user – But the user will chose the device based on the lab evaluation 125
  • 125. Discussion on evaluation  There is bias to account for – E.g. number of relevant documents per topic 126
  • 126. Discussion on evaluation  Recall and recall-related measures are often contested  [cooper:73,p95] – “The involvement of unexamined documents in a performance formula has long been taken for granted as a perfectly natural thing, but if one stops to ponder the situation, it begins to appear most peculiar. … Surely a document which the system user has not been shown in any form, to which he has not devoted the slightest particle of time or attention during his use of the system output, and of whose very existence he is unaware, does that user neither harm nor good in his search”  Clearly not true in the legal & patent domains 127
  • 127. Discussion on Evaluation  Realistic tasks and user models – Evaluation has to be based on the available data sets.  This creates the user model  Tasks need to correspond to available techniques  Much literature on generating tasks – Experts describe typical tasks – Use of log files of various sorts  IR Research decades behind sociology in terms of user modeling – there is a place to learn from 128
  • 128. Discussion on Evaluation  Competitiveness – Most campaigns take pain in explaining “This is not a competition – this is an evaluation”  Competitions are stimulating, but – Participants wary of participating if they are not sure to win  Particularly commercial vendors – Without special care from organizers, it stifles creativity:  Best way to win is to take last year’s method and improve a bit  Original approaches are risky 129
  • 129. Discussion on Evaluation  Topical Relevance  What other kinds of relevance factors are there? – diversity of information – quality – credibility – ease of reading 130
  • 130. Conclusion • IR Evaluation is a research field in itself • Without evaluation, research is pointless – IR Evaluation research included •  statistical significance testing is a must to validate results • Most IR Evaluation exercises are laboratory experiments – As such, care must be taken to match, to the extent possible, real needs of the users • Experiments in the wild are rare, small and domain specific: – VideOlympics (2007-2009) – PatOlympics (2010-2012) 131
  • 131. Bibliography  Test Collection Based Evaluation of Information Retrieval Systems – M. Sanderson 2010  TREC – Experiment and Evaluation in Information Retrieval – E. Voorhees, D. Harman (eds.)  On the history of evaluation in IR – S. Robertson, 2008, Journal of Information Science  A Comparison of Statistical Significance Tests for Information Retrieval Evaluation – M. Smucker, J. Allan, B. Carterette (CIKM’07)  A Simple and Efficient Sampling Methodfor Estimating AP and NDCG – E. Yilmaz, E. Kanoulas, J. Aslam (SIGIR’08) 132
  • 132. Bibliography  Do User Preferences and Evaluation Measures Line Up?, M. Sanderson and M. L. Paramita and P. Clough and E. Kanoulas 2010  A Review of Factors Influencing User Satisfaction in Information Retrieval, A. Al-Maskari and M. Sanderson 2010  Towards higher quality health search results: Automated quality rating of depression websites, D. Hawking and T. Tang and R. Sankaranarayana and K. Griffiths and N. Craswell and P. Bailey 2007  Evaluating Sampling Methods for Uncooperative Collections, P. Thomas and D. Hawking 2007  Comparing the Sensitivity of Information Retrieval Metrics, F. Radlinski and N. Craswell 2010  Redundancy, Diversity and Interdependent Document Relevance, F. Radlinski and P. Bennett and B. Carterette and T. Joachims 2009  Does Brandname influence perceived search result quality? Yahoo!, Google, and WebKumara, P. Bailey and P. Thomas and D. Hawking 2007  Methods for Evaluating Interactive Information Retrieval Systems with Users, D. Kelly 2009  C-TEST: Supporting Novelty and Diversity in TestFiles for Search Tuning, D. Hawking and T. Rowlands and P. Thomas 2009  Live Web Search Experiments for the Rest of Us, T. Jones and D. Hawking and R. Sankaranarayana 2010  Quality and relevance of domain-specific search: A case study in mental health, T. Tang and N. Craswell and D. Hawking and K. Griffiths and H. Christensen 2006  New methods for creating testfiles: Tuning enterprise search with C-TEST, D. Hawking and P. Thomas and T. Gedeon and T. Jones and T. Rowlands 2006  A Field Experimental Approach to the Study of Relevance Assessments in Relation to Document Searching, A. M. Rees and D. G. Schultz, Final Report to the National Science Foundation. Volume II, Appendices. Clearing- house for Federal Scientific and Technical Information, October 1967  The Water Filling Model and the Cube Test: Multi-dimensional Evaluation for Professional Search , J. Luo, C. Wing, H. Yang and M. Hearst, CIKM 2013  On sample sizes for non-matched-pair IR experiments, S. Robertson, 1990, Information Processing & Management  Lipani A, Lupu M, Hanbury A, Splitting Water: Precision and Anti-Precision to Reduce Pool Bias, SIGIR 2015  W. Webber and L. A. F. Park. Score adjustment for correction of pooling bias. In Proc. of SIGIR, 2009 133

Notas del editor

  1. Thai text for “I have a red car”
  2. some terms you will be hearing us talking about
  3. In this lecture we will focus on the first, intrinsic, evaluation, and only mention the second part, as it will be discussed in much more detail in K. Jarvelin’s lecture.
  4. A desirable measure of retrieval performance would have the following properties: First, it would express solely the ability of a retrieval system to distinguish between wanted and unwanted items – that is, it would be a measure of effectiveness. Second, the desired measure would not be confounded by the relative willingness of the system to emit items – it would express discrimination power independent of any “acceptance criterion” employed, whether the criterion is characteristic of the system or adjusted by the user. Third, the measure would be a single number – in preference, for example, to a pair of numbers which may co-vary in a loosely specified way, or a curve representing a table of several pairs of numbers – so that it could be transmitted simply and immediately comprehended. Fourth, and finally, the measure would allow complete ordering of different performances, and assess the performance of any one system in absolute terms – that is, the metric would be a scale with a unit, a true zero, and a maximum value. Given a measure with these properties, we could be confident of having a pure and valid index of how well a retrieval system (or method) were performing the function it was primarily designed to accomplish, and we could reasonably ask questions of the form “Shall we pay X dollars for Y units of effectiveness?” (Swets, 1967, http://onlinelibrary.wiley.com/doi/10.1002/asi.4630200110/)
  5. For the E measure, Beta indicates what the user prefers (precision: beta>1, recall: beta<1) These methods clearly depend on cut-off values, which make them unusable for meaningful comparison between topics (a topic may have very few relevant documents, a topic may have many more) The harmonic mean is considered better for averaging ratios. Example: precision:0.1 and recall 0.9, arithmetic average is 0.5 – quite high, while harmonic is 0.18. eVen more extreme case : think precision 0.01 and recall 0.99
  6. For the E measure, Beta indicates what the user prefers (precision: beta>1, recall: beta<1) These methods clearly depend on cut-off values, which make them unusable for meaningful comparison between topics (a topic may have very few relevant documents, a topic may have many more) The harmonic mean is considered better for averaging ratios. Example: precision:0.1 and recall 0.9, arithmetic average is 0.5 – quite high, while harmonic is 0.18. eVen more extreme case : think precision 0.01 and recall 0.99
  7. this interpolation is actually not obvious because we might not always have the same values for recall (remember that that depends on the number of relevant documents per topic). The common way is to consider as precision at recall_i the highest precision measure at any level greater or equal than recall_i
  8. Cut-off based measures also have the significant disadvantage that they are unstable with respect to the size of the collection They are also unfaire between topics: number of relevant documents for each topic in the collection generally differs, but improvements are considered the same by these measures. Across all seven participating groups, P(20) was higher for searches on the 20 GB collection than on the subset; on average 39% higher. Note also that all forms of AP and R-precision approximate the area under a recall precision graph (Sanderson, Aslam)
  9. Here is where genre may come into play, as well as difficulty. This time function needs to be calibrated to the user.
  10. .5 prec is recall obtained by the system when precision first dips below 0.5 and at least ten documents have been retrieved (heuristic that users will look at the result set as long as there are more relevant than non-relevant documents) R(1,1000) is a weighted rel ret , such that the topics with most relevant documents do not dominate the measure
  11. Even now, topics are rejected (removed) if no relevant documents have been identified
  12. Cost of evaluation. e.g P@5 is very cheap, while MAP is much more expensive
  13. “What are the required conditions? Well, the evidence suggests that we need to start with a good range of different kinds of systems – preferably, in particular, including some manual systems involving human-designed search strategies and (preferably again) some degree of interaction in the search. Second, we need reasonably deep pools (preferably 100+ from each system, not 10). Third, the collections themselves cannot be too big. “ (Robertson:2008)
  14. Or at least very little on 2. which can be published in IR journals and conferences
  15. Re-quoted from [Moffat&Zobel:2008]