Information Retrieval Evaluation

IR Evaluation
Mihai Lupu
lupu@ifs.tuwien.ac.at
Chapter 8 of the Introduction to IR book
M. Sanderson. Test Collection Based Evaluation of Information
Retrieval Systems Foundations and Trends in IR, 2010
1

Outline
 Introduction
– Introduction to IR
 Kinds of evaluation
 Retrieval Effectiveness evaluation
– Measures, Experimentation
– Test Collections
 User-based evaluation
 Discussion on Evaluation
 Conclusion
2

Introduction
• Why?
– Put a figure on the benefit we get from a system
– Because without evaluation, there is no research
3
Objective
measurements

Information Retrieval
 “Information retrieval is a field concerned with the structure,
analysis, organization, storage, searching, and retrieval of
information.” (Salton, 1968)
 General definition that can be applied to many types of
information and search applications
 Primary focus of IR since the 50s has been on text and
documents

 Key insights of/for information retrieval
– text has no meaning
 ฉันมีรถสีแดง
– but it is still the most informative source
 ฉันมีรถสีฟ้า is more similar to the above than คุณมีรถไฟฟ้า
– text is not random
 I drive a red car is more probable than
– I drive a red horse
– A red car I drive
– Car red a drive I
– meaning is defined by usage
 I drive a truck / I drive a car / I drive the bus  truck / car / bus
are similar in meaning

 Key insights of/for information retrieval
– text has no meaning
 ฉันมีรถสีแดง
– but it is still the most informative source
 ฉันมีรถสีฟ้า is more similar to the above than คุณมีรถไฟฟ้า
– text is not random
 I drive a red car is more probable than
– I drive a red horse
– A red car I drive
– Car red a drive I
– meaning is defined by usage
 I drive a truck / I drive a car / I drive the bus  truck / car / bus
are similar in meaning
term frequency (TF), document frequency (DF)
TF-IDF, BM25 (Best match 25)
language models (uni-gram, bi-gram, n-gram)
statistical semantics (latent semantic analysis,
random indexing, deep learning)

Big Issues in IR
 Relevance
– What is it?
– Simple (and simplistic) definition: A relevant document contains
the information that a person was looking for when they submitted
a query to the search engine
– Many factors influence a person’s decision about what is relevant:
e.g., task, context, novelty, style
– Topical relevance (same topic) vs. user relevance (everything
else)

 Relevance
– Retrieval models define a view of relevance
– Ranking algorithms used in search engines are based
on retrieval models
– Most models describe statistical properties of text rather
than linguistic
 i.e. counting simple text features such as words
instead of parsing and analyzing the sentences
 Statistical approach to text processing started with
Luhn in the 50s
 Linguistic features can be part of a statistical model
Big Issues in IR

Big Issues in IR
 Evaluation
– Experimental procedures and measures for comparing system
output with user expectations
 Originated in Cranfield experiments in the 60s
– IR evaluation methods now used in many fields
– Typically use test collection of documents, queries, and relevance
judgments
 Most commonly used are TREC collections
– Recall and precision are two examples of effectiveness measures

Big Issues in IR
 Users and Information Needs
– Search evaluation is user-centered
– Keyword queries are often poor descriptions of actual information
needs
– Interaction and context are important for understanding user intent
– Query refinement techniques such as query expansion, query
suggestion, relevance feedback improve ranking

Introduction
• Why?
– Put a figure on the benefit we get from a system
– Because without evaluation, there is no research
• Why is this a research field in itself?
– Because there are many kinds of IR
• With different evaluation criteria
– Because it’s difficult
• Why?
– Because it involves human subjectivity (document relevance)
– Because of the amount of data involved (who can sit down
and evaluate 1,750,000 documents returned by Google for
‘university vienna’?)
13

Kinds of evaluation
• “Efficient and effective system”
• Time and space: efficiency
– Generally constrained by pre-development specification
• E.g. real-time answers vs. batch jobs
• E.g. index-size constraints
– Easy to measure
• Good results: effectiveness
– Harder to define --> more research into it
• And…
15

Kinds of evaluation (cont.)
• User studies
– Does a 2% increase in some retrieval performance measure actually
make a user happier?
– Does displaying a text snippet improve usability even if the
underlying method is 10% weaker than some other method?
– Hard to do
– Mostly anecdotal examples
– Many IR people don’t like to do it (though it’s starting to change)
16

Kinds of evaluation (cont.)
 Intrinsic
– “internal”
– ultimate goal is the retrieved set
 Extrinsic
– “external”
– in the context of the usage of the retrieval tool
17

What to measure in an IR system?
1966, Cleverdon:
1. coverage – the extent to which relevant matter exists in the
system
2. time lag ~ efficiency
3. presentation
4. effort on the part of the user to answer his information
need
5. recall
6. precision
18

What to measure in an IR system?
1966, Cleverdon:
1. coverage – the extent to which relevant matter exists in the
system
2. time lag ~ efficiency
3. presentation
4. effort on the part of the user to answer his information
need
5. recall
6. precision
Effectiveness
19
A desirable measure of retrieval performance would have the following
properties: 1, it would be a measure of effectiveness. 2, it would not be
confounded by the relative willingness of the system to emit items. 3, it would
be a single number – in preference, for example, to a pair of numbers which
may co-vary in a loosely specified way, or a curve representing a table of
several pairs of numbers 4, it would allow complete ordering of different
performances, and assess the performance of any one system in absolute
terms. Given a measure with these properties, we could be confident of
having a pure and valid index of how well a retrieval system (or method) were
performing the function it was primarily designed to accomplish, and we could
reasonably ask questions of the form “Shall we pay X dollars for Y units of
effectiveness?” (Swets, 1967)

Outline
• Introduction
• Kinds of evaluation
• Retrieval Effectiveness evaluation
– Measures
• Discussion on Evaluation
• Conclusion
20

Retrieval Effectiveness
 Precision
– How happy are we with what we’ve got
 Recall
– How much more we could have had
Precision =
Number of relevant documents
retrieved
Number of documents retrieved
Recall =
retrieved
22

Retrieved
documents
Relevant documents
Universe of documents
23

Retrieval effectiveness
 What if we don’t like this twin-measure approach?
 A solution:
– Van Rijsbergen’s E-Measure:
– With a special case: Harmonic mean
E =1-
1
a
1
precision
+ 1-a( )
1
recall
F =
2× precision×recall
precision+recall
25

 What if we don’t like this twin-measure approach?
 A solution:
– Van Rijsbergen’s E-Measure:
– With a special case: Harmonic mean
E =1-
1
a
1
precision
+ 1-a( )
1
recall
F =
2× precision×recall
precision+recall
26

 Tools we need:
– A set of documents (the “dataset”)
– A set of questions/queries/topics
– For each topic, and for each document, a decision: relevant or not
relevant
 Let’s assume for the moment that’s all we need and that
we have it
27

• Precision and Recall generally plotted as a “Precision-Recall
curve”
0
1
1
precision
recall
size of retrieved set increases
• They do not play well together
28

Precision-Recall Curves
 How to build a Precision-Recall Curve?
– For one query at a time
– Make checkpoints on the recall-axis
0
1
1
precision
recall
29

 How to build a Precision-Recall Curve?
0
1
1
precision
recall
30

• How to build a Precision-Recall Curve?
– Repeat for all queries
0
1
1
precision
recall
31

• And the average is the system’s P-R curve
0
1
1
precision
recall
# retrieved documents increases
• We can compare systems by comparing the
curves
32

Precision-Recall Graph
--reality check--
33

Interpolation
 To average graphs, calculate precision at standard recall
levels:
– where S is the set of observed (R,P) points
 Defines precision at any recall level as the maximum
precision observed in any recall-precision point at a
higher recall level
– produces a step function
– defines precision at recall 0.0
34

Average Precision at
Standard Recall Levels
• Recall-precision graph plotted by simply
joining the average precision points at
the standard recall levels
36

Average Recall-Precision Graph
37

• Not quite done yet…
– When to stop retrieving?
• Both P and R imply a cut-off value
– How about graded relevance
• Some documents may be more relevant to the question than
others
– How about ranking?
• A document retrieved at position 1,234,567 can still be
considered useful?
– Who says which documents are relevant and which not?
39

Single-value measures
• Fix a “reasonable” cutoff
– R-precision
 Precision at R, where R is the number of relevant documents.
 Fix the number of desired documents
– Reciprocal rank (RR)
 1/rank of first relevant document in the ranked list returned
 Make it less sensitive to the cutoff
• Average precision
– For each query:
 R= # relevant documents
 i = rank
 k = # retrieved documents
 P(i) precision at rank i
• rel(i)=1 if document at rank i relevant, 0 otherwise
– For each system:
• Compute the mean of these averages: Mean Average
Precision (MAP) – one of the most used measures
AP =
P(i)×rel(i)( )
i=1
k
å
R
40

R- Precision
 Precision at the R-th position in the ranking of results for
a query that has R relevant documents.
n doc # relevant
1 588 x
2 589 x
3 576
4 590 x
5 986
6 592 x
7 984
8 988
9 578
10 985
11 103
12 591
13 772 x
14 990
R = # of relevant docs = 6
R-Precision = 4/6 = 0.67
41

others
considered useful?
45

Cumulative Gain
• For each document d, and query q, define
rel(d,q) >= 0
• The higher the value, the more relevant the document is to
the query
• Pitfalls:
– Graded relevance introduces even more ambiguity in practice
With great flexibility comes great
responsibility to justify parameter values
46

others
considered useful?
47

Discounted Cumulative Gain
 Popular measure for evaluating web search and related
tasks
 Two assumptions:
– Highly relevant documents are more useful than marginally relevant
document
– the lower the ranked position of a relevant document, the less useful
it is for the user, since it is less likely to be examined
48

 Uses graded relevance as a measure of the usefulness, or
gain, from examining a document
 Gain is accumulated starting at the top of the ranking and
may be reduced, or discounted, at lower ranks
 Typical discount is 1/log (rank)
– With base 2, the discount at rank 4 is 1/2, and at rank 8 it is 1/3
49

 DCG is the total gain accumulated at a particular rank p:
 Alternative formulation:
– used by some web search companies
– emphasis on retrieving highly relevant documents
[Jarvelin:2000]
[Borges:2005]
50

• Neither CG, nor DCG can be used for comparison
across topics!
depends on the # relevant documents per topic
51

Normalised Discounted Cumulative Gain
 Compute CG / DCG for the optimal return set
Eg: (5,5,5,4,4,3,3,3,3,2,2,2,1,1,1,1,1,1,0,0,0,0..)
has the Ideal Discounted Cumulative Gain: IDCG
 Normalise:
NDCG(n) =
DCG(n)
IDCG(n)
52

some more variations
Eg: (5,5,5,4,4,3,3,3,3,2,2,2,1,1,1,1,1,1,0,0,0,0..)
has the Ideal Discounted Cumulative Gain: IDCG
“our rank”: (5,2,0,0,5,2,4,0,0,1,4,…)
 two ranked lists
– rank correlation measures
 kendall Tau (similarity of orderings)
 pearson Rho (linear correlation between variables)
 spearman Rho (Pearson for ranks)
53

some more variations
 rank biased precision (RBP)
– “log-based discount is not a good model of users’ behaviour”
– imagine the probability p of the user moving on to the next document
RBP(n) = (1- p)× rel(i)× pi-1
i=1
n
å
p~0.95 p~0.0
54

Time-based calibration
 Assumption
– The objective of the search engine is to improve the efficiency of an
information seeking task
 Extend nDCG to replace discount with a time-based
function
(Smucker and Clarke:2011)
Normalization
Gain Decay, as a function of
time to reach item k in
the ranked list55

The water filling model (Luo et al, 2013)
 and the corresponding Cube
Test (CT)
 also for professional search
– to capture embedded subtopics
 no assumption of linear
traversal of documents
– takes into account time
 potential cap on the amount of
information taken into account
 high discriminative power
56

Other diversity metrics
 several aspects of the topic might [need to] be covered
– Aspectual recall/precision
 discount may take into account previously seen aspects
– α-NDCG = NDCG where
rel(i) = J(di,k)(1-a)
rk,i-1
k=1
m
å
rk,i-1 = J(dj,k)
j=1
i-1
å J(dj,k) =
1 dj relevant to nk
0 otherwise
ì
í
ï
îï
57

Other measures
• There are many IR measures!
• trec_eval is a little program that computes many of them
– 37 in v9.0, many of which are multi-point (e.g. Precision @10,
@20…)
• http://trec.nist.gov/trec_eval/
• “there is a measure to make anyone a winner”
– Not really true, but still…
58

Other measures
• How about correlations between measures?
• Kendal Tau values
• From Voorhees and Harman,2004
• Overall they correlate
P(30) R-Prec MAP .5 prec
R(1,100
0)
Rel Ret MRR
P(10) 0.88 0.81 0.79 0.78 0.78 0.77 0.77
P(30) 0.87 0.84 0.82 0.80 0.79 0.72
R-Prec 0.93 0.87 0.83 0.83 0.67
MAP 0.88 0.85 0.85 0.64
.5 prec 0.77 0.78 0.63
R(1,100
0)
0.92 0.67
Rel ret 0.66
59

Topic sets
 Topic selection
– In early TREC candidates rejected if ambiguous
 Are all topics equal?
– Mean Average Precision uses arithmetic mean
– Classical Test Theory experiments (Bodoff and Li,2007) identified
outliers that could change the rankings
MAP: a change in AP from 0.05 to 0.1 has the same effect as a
change from 0.25 to 0.3
GMAP: a change in AP from 0.05 to 0.1 has the same effect as a
change from 0.25 to 0.5
60

Measure measures
 What is the best measure?
– What makes a measure better?
 Match to task
– E.g.
 Known item search: MRR
 Something more quantitative?
– Correlations between measures
 Does the system ranking change when using different measures
 Useful to group measures
– Ability to distinguish between runs
– Measure stability
61

Ad-hoc quiz
 It was necessary to normalize the discounted cumulative
gain (NDCG) because…
 of the assumption for normal probability distribution
 to be able to compare across topics
 normalization is always better
 to be able to average across topics
62

Ad-hoc quiz
 It was necessary to normalize the discounted cumulative
gain (NDCG) because…
 of the assumption for normal probability distribution
 to be able to compare across topics
 normalization is always better
 to be able to average across topics
63

Measure stability
 Success criteria:
– A measure is good if it is able to predict differences between
systems (on the average of future queries)
 Method
– Split collection in 2
1. Use as train collection to rank runs
2. Use as test collection to compute how many pair-wise
comparisons hold
 Observations
– Cut-off measures less stable than MAP
64

Measure stability
 Success criteria:
– A measure is good if it is able to predict differences between
systems (on the average of future queries)
 Method
– Split collection in 2
1. Use as train collection to rank runs
2. Use as test collection to compute how many pair-wise
comparisons hold
 Observations
– Cut-off measures less stable than MAP
Any other criteria for measure
quality?
65

Measure measures
 started with opinions from ’60s, seen some measures –
have the targets changed?
 7 numeric properties of effectiveness metrics (Moffat 2013)
66

7 properties of effectiveness metrics
 Boundedness – the set of scores attainable by the metric is bounded,
usually in [0,1]
 Monotonicity – if a ranking of length k is extended so that k+1 elements
are included, the score never decreases
 Convergence – if a document outside the top k is swapped with a less
relevant document inside the top k, the score strictly increases
 Top-weightedness – if a document within the top k is swapped with a
less relevant one higher in the ranking, the score strictly increases
 Localization – a score at depth k can be compute based solely on
knowledge of the documents that appear in top k
 Completeness – a score can be calculated even if the query has no
relevant documents
 Realizability – provided that the collection has at least one relevant
document, it is possible for the score at depth k to be maximal.
68

So far
 introduction
 metrics
we are now able to say
“System A is better than System B”
or are we?
Remember
- we only have limited data
- potential future applications unbounded
a very strong
statement!
69

Statistical validity
 Whatever evaluation metric used, all experiments must be
statistically valid
– i.e. differences must not be the result of chance
0
0.05
0.1
0.15
0.2
MAP
70

• Ingredients of a significance test
– A test statistic (e.g. the differences between AP values)
– A null hypothesis (e.g. “there is no difference between the two
systems)
 This gives us a particular distribution of the test statistic
– An alternative hypothesis (one or two-tailed tests)
 don’t change it after the test
– A significance level computed by taking the actual value of the test
statistic and determining how likely it is to see this value given the
distribution implied by the null hypothesis
• P-value
• If the p-value is low, we can feel confident that we can reject
the null hypothesis  the systems are different
71

 Common practice is to declare systems different when the
p-value <= 0.05
 A few tests
– Randomization tests
 Wilcoxon Signed Rank test
 Sign test
– Boostrap test
– Student’s Paired t-test
 See recent discussion in SIGIR Forum
– T. Sakai - Statistical Reform in Information Retrieval?
 effect sizes
 confidence intervals
72

 How do we increase the statistical validity of an
experiment?
 By increasing the number of topics
– The more topics, the more confident we are that the difference
between average scores will be significant
 What’s the minimum number of topics?
42
• Depends, but
• TREC started with 50
• Below 25 is generally considered
not significant
73

Example Experimental Results
B- A = 21.4
74

t-Test
 Assumption is that the difference between the effectiveness
values is a sample from a normal distribution
 Null hypothesis is that the mean of the distribution of
differences is zero
 Test statistic
– for the example,
75

Statistical Validity - example
78

Summary
 so far
– introduction
– metrics
 next
– where to get ground truth
 some more metrics
– discussion
83

others
considered useful?
84

Relevance assessments
• Ideally
– Sit down and look at all documents
• Practically
– The ClueWeb09 collection has
• 1,040,809,705 web pages, in 10 languages
• 5 TB, compressed. (25 TB, uncompressed.)
– No way to do this exhaustively
– Look only at the set of returned documents
• Assumption: if there are enough systems being tested and not
one of them returned a document – the document is not relevant
85

Relevance assessments - Pooling
 Combine the results retrieved by all systems
 Choose a parameter k (typically 100)
 Choose the top k documents as ranked in each submitted
run
 The pool is the union of these sets of docs
– Between k and (# submitted runs) × k documents in pool
– (k+1)st document returned in one run either irrelevant or ranked
higher in another run
 Give pool to judges for relevance assessments
86

 Conditions under which pooling works [Robertson]
– Range of different kinds of systems, including manual systems
– Reasonably deep pools (100+ from each system)
 But depends on collection size
– The collections cannot be too big.
 Big is so relative…
88

 Advantage of pooling:
– Fewer documents must be manually assessed for relevance
 Disadvantages of pooling:
– Can’t be certain that all documents satisfying the query are found
(recall values may not be accurate)
– Runs that did not participate in the pooling may be disadvantaged
– If only one run finds certain relevant documents, but ranked lower
than 100, it will not get credit for these.
89

Relevance assessments
 Pooling with randomized sampling
 As the data collection grows, the top 100 may not be
representative of the entire result set
– (i.e. the assumption that everything after is not relevant does not
hold anymore)
 Add, to the pool, a set of documents randomly sampled
from the entire retrieved set
– If the sampling is uniform  easy to reason about, but may be too
sparse as the collection grows
– Stratified sampling: get more from the top of the ranked list
[Yilmaz et al.:2008]
90

Relevance assessments - incomplete
• The unavoidable conclusion is that we have to handle
incomplete relevance assessments
– Consider unjudged = non relevant
– Do not consider unjudged at all (i.e. compress the ranked lists)
• A new measure:
– BPref (binary preference)
 r = a relevant returned document
 R = # documents judged relevant
 N = # documents judged non-relevant
 n = a non-relevant document
BPref =
1
R
1-
|{n |rank(n) > rank(r)}|
min(R, N)
æ
è
ç
ö
ø
÷
r
å
91

Relevance assessments - incomplete
• BPref was designed to mimic MAP
• soon after, induced AP and inferred AP were proposed
• if data complete – equal to MAP
indAP =
1
R
1-
|{n | rank(n) > rank(r)}|
rank(r)
æ
è
ç
ö
ø
÷
r
å
inf AP(k) =
1
R
1
k
+
k -1
k
d100
k -1
×
rel +e
rel + nonrel +e
æ
è
çç
ö
ø
÷÷
é
ë
ê
ê
ù
û
ú
úr
å
expectation of precision at rank k
92

 not only are we incomplete, but we might also be
inconsistent in our judgments
93

Relevance assessment - subjectivity
 In TREC-CHEM’09 we had each topic evaluated by two
students
– “conflicts” ranged between 2% and 33% (excluding a topic with 60%
conflict)
– This all increased if we considered “strict disagreement”
 In general, inter-evaluator agreement is rarely above 80%
 There is little one can do about it
94

Relevance assessment - subjectivity
 Good news:
– “idiosyncratic nature of relevance judgments does not affect
comparative results” (E. Voorhees)
– Mean Kendall Tau between system rankings produced from
different query relevance sets: 0.938
– Similar results held for:
 Different query sets
 Different evaluation measures
 Different assessor types
 Single opinion vs .group opinion judgments
95

No assessors
 Pooling assumes all relevant documents found by systems
– Take this assumption further
 Voting based- relevance assessments
– Consider top K only
Soboroff et al:2001
96

Test Collections
 Generally created as the result of an evaluation campaign
– TREC – Text Retrieval Conference (USA)
– CLEF – Cross Language Evaluation Forum (EU)
– NTCIR - NII Test Collection for IR Systems (JP)
– INEX – Initiative for evaluation of XML Retrieval
– …
 First one and paradigm definer:
– The Cranfield Collection
 In the 1950s
 Aeronautics
 1400 queries, about 6000 documents
 Fully evaluated
97

TREC
 Started in 1992
 Always organised in the States, on the NIST campus
 As leader, introduced most of the jargon used in IR
Evaluation:
– Topic = query / request for information
– Run = a ranked list of results
– Qrel = relevance judgements
98

TREC
 Organised as a set of tracks that focus on a particular sub-
problem of IR
– E.g.
 Patient records, Session, Chemical, Genome, Legal, Blog,
Spam,Q&A, Novelty, Enterprise, Terabyte, Web, Video, Speech,
OCR, Chinese, Spanish, Interactive, Filtering, Routing, Million
Query, Ad-Hoc, Robust
– Set of tracks in a year depends on
 Interest of participants
 Fit to TREC
 Needs of sponsors
 Resource constraints
99

TREC
Call for
participation Task
definition
Document
procureme
Topic
definitio
IR
experiments
Results
Results
analysis
TREC
conference
Proceedings
publication
100

TREC – Task definition
 Each Track has a set of Tasks:
 Examples of tasks from the Blog track:
– 1. Finding blog posts that contain opinions about the topic
– 2. Ranking positive and negative blog posts
– 3. (A separate baseline task to just find blog posts relevant to the
topic)
– 4. Finding blogs that have a principal, recurring interest in the
topic
101

TREC - Topics
 For TREC, topics generally have a specific format (not
always though)
– <ID>
– <title>
 Very short
– <description>
 A brief statement of what would be a relevant document
– <narrative>
 A long description, meant also for the evaluator to understand
how to judge the topic
102

TREC - Topics
 Example:
– <ID>
 312
– <title>
 Hydroponics
– <description>
 Document will discuss the science of growing plants in water or
some substance other than soil
– <narrative>
 A relevant document will contain specific information on the
necessary nutrients, experiments, types of substrates, and/or
any other pertinent facts related to the science of hydroponics.
Related information includes, but is not limited to, the history
of hydro- …
103

CLEF
 Cross Language Evaluation Forum
– From 2010: Conference on Multilingual and Multimodal
Information Access Evaluation
– Supported by the PROMISE Network of Excellence
 Started in 2000
 Grand challenge:
– Fully multilingual, multimodal IR systems
 Capable of processing a query in any medium and any
language
 Finding relevant information from a multilingual multimedia
collection
 And presenting it in the style most likely to be useful for the
user
104

CLEF
• Previous tracks:
• Mono-, bi- multilingual text retrieval
• Interactive cross language retrieval
• Cross language spoken document retrieval
• QA in multiple languages
• Cross language retrieval in image collections
• CL geographical retrieval
• CL Video retrieval
• Multilingual information filtering
• Intellectual property
• Log file analysis
• Large scale grid experiments
• From 2010
– Organised as a series of “labs”
105

MediaEval
 dedicated to evaluating new algorithms for multimedia
access and retrieval.
 emphasizes the 'multi' in multimedia
 focuses on human and social aspects of multimedia tasks
– speech recognition, multimedia content analysis, music and audio
analysis, user-contributed information (tags, tweets), viewer
affective response, social networks, temporal and geo-
coordinates.
http://www.multimediaeval.org/
106

Test collections - summary
 it is important to design the right experiment for the right
IR task
– Web retrieval is very different from legal retrieval
 The example of Patent retrieval
– High Recall: a single missed document can invalidate a patent
– Session based: single searches may involve days of cycles of results
review and query reformulation
– Defendable: Process and results may need to be defended in court
107

Outline
 Introduction
 Conclusion
108

User-based evaluation
 Different levels of user involvement
– Based on subjectivity levels
1. Relevant/non-relevant assessments
 Used largely in lab-like evaluation as described before
2. User satisfaction evaluation
 Some work on 1., very little on 2.
– User satisfaction is very subjective
 UIs play a major role
 Search dissatisfaction can be a result of the non-existence of
relevant documents
109

 User-based relevance assessments
– Focus the user on each query-document pair
110

– Focus the user one each query-document pair
111

– Focus the user on query-document-document
112

Relative judgements of documents
“Is document X more relevant than document Y for the
given query?”
- Many more assessments needed
- Better inter-annotator agreement [Rees and Schultz,
1967]
113

– Focus the user on lists of results
114

Image from Thomas and Hawking, Evaluation by comparing result sets in context, CIKM2006
115

 Some issues, alternatives
– Control for all sorts of user-based biases
116

Image from Bailey, Thomas and Hawking, Does brandname inﬂuence perceived search result quality?, ADCS2007
117

– Two-panel evaluation
– limits the number of systems which can be evaluated
– Is unusable in real-life contexts
– Interspersed ranked list with click monitoring
118

Effectiveness evaluation
lab-like vs. user-focused
 Results are mixed: some experiments show correlations,
some not
 Do user preferences and Evaluation Measures Line up?
SIGIR 2010: Sanderson, Paramita, Clough, Kanoulas
– shows the existence of correlations
 User preferences is inherently user dependent
 Domain specific IR will be different
 The relationship between IR effectiveness measures and
user satisfaction, SIGIR 2007, Al-Maskari, Sanderson,
Clough
– strong correlation between user satisfaction and DCG, which
disappeared when normalized to NDCG.
119

Predicting performance
Future data and queries
 not absolute, but relative performance
– ad-hoc evaluations suffer in particular
– no comparison between lab and operational settings
 for justified reasons, but still none
– how much better must a system be?
 generally, require statistical significance
[Trippe:2011]
120

Predictive performance
 Future systems
 Test collections are often used to prove we have a better
system than the state of the art
– not all documents were evaluated
121

Predictive performance
 Future systems
 Test collections are often used to prove we have a better
system than the state of the art
– not all documents were evaluated
– “retrofit” metrics that are not considered resilient to such evolution
 RBP [Webber:2009]
 Precision@n [Lipani:2014], Recall@n […]
122
Why do this?
- Precision@n and Recall@n are loved in industry
- Also in industry, technology migration steps are high (i.e. hold on to a
system that ‘works’ until it is patently obvious it affects business
performance)

Are Lab evals sufficient?
 Patent search is an active process where the end-user
engages in a process of understanding and interacting with
the information
 evaluation needs a definition of success
– success ~ lower risk
 partly precision and recall
 partly (some argue the most important part) the intellectual and
interactive role of the patent search system as a whole
 series of evaluation layers
– lab evals are now the lowest level
– to elevate them, they must measure risk and incentivize systems to
provide estimates of confidence in the results they provide
[Trippe:2011]
123

Outline
 Introduction
 Conclusion
124

Discussion on evaluation
 Laboratory evaluation – good or bad?
– Rigorous testing
– Over-constrained
 I usually make the comparison to a tennis
racket:
– No evaluation of the device will tell you how well it
will perform in real life – that largely depends on the
user
– But the user will chose the device based on the lab
evaluation
125

 There is bias to account for
– E.g. number of relevant documents per topic
126

 Recall and recall-related measures are often contested
 [cooper:73,p95]
– “The involvement of unexamined documents in a performance
formula has long been taken for granted as a perfectly natural thing,
but if one stops to ponder the situation, it begins to appear most
peculiar. … Surely a document which the system user has not been
shown in any form, to which he has not devoted the slightest particle
of time or attention during his use of the system output, and of
whose very existence he is unaware, does that user neither harm
nor good in his search”
 Clearly not true in the legal & patent domains
127

Discussion on Evaluation
 Realistic tasks and user models
– Evaluation has to be based on the available data sets.
 This creates the user model
 Tasks need to correspond to available techniques
 Much literature on generating tasks
– Experts describe typical tasks
– Use of log files of various sorts
 IR Research decades behind sociology in terms of user
modeling – there is a place to learn from
128

 Competitiveness
– Most campaigns take pain in explaining “This is not a competition –
this is an evaluation”
 Competitions are stimulating, but
– Participants wary of participating if they are not sure to win
 Particularly commercial vendors
– Without special care from organizers, it stifles creativity:
 Best way to win is to take last year’s method and improve a bit
 Original approaches are risky
129

 Topical Relevance
 What other kinds of relevance factors are there?
– diversity of information
– quality
– credibility
– ease of reading
130

Conclusion
• IR Evaluation is a research field in itself
• Without evaluation, research is pointless
– IR Evaluation research included
•  statistical significance testing is a must to validate results
• Most IR Evaluation exercises are laboratory experiments
– As such, care must be taken to match, to the extent possible, real
needs of the users
• Experiments in the wild are rare, small and domain specific:
– VideOlympics (2007-2009)
– PatOlympics (2010-2012)
131

Bibliography
 Test Collection Based Evaluation of Information Retrieval Systems
– M. Sanderson 2010
 TREC – Experiment and Evaluation in Information Retrieval
– E. Voorhees, D. Harman (eds.)
 On the history of evaluation in IR
– S. Robertson, 2008, Journal of Information Science
 A Comparison of Statistical Signiﬁcance Tests for Information Retrieval
Evaluation
– M. Smucker, J. Allan, B. Carterette (CIKM’07)
 A Simple and Efficient Sampling Methodfor Estimating AP and NDCG
– E. Yilmaz, E. Kanoulas, J. Aslam (SIGIR’08)
132

Bibliography
 Do User Preferences and Evaluation Measures Line Up?, M. Sanderson and M. L. Paramita and P. Clough and E.
Kanoulas 2010
 A Review of Factors Influencing User Satisfaction in Information Retrieval, A. Al-Maskari and M. Sanderson 2010
 Towards higher quality health search results: Automated quality rating of depression websites, D. Hawking and T.
Tang and R. Sankaranarayana and K. Griffiths and N. Craswell and P. Bailey 2007
 Evaluating Sampling Methods for Uncooperative Collections, P. Thomas and D. Hawking 2007
 Comparing the Sensitivity of Information Retrieval Metrics, F. Radlinski and N. Craswell 2010
 Redundancy, Diversity and Interdependent Document Relevance, F. Radlinski and P. Bennett and B. Carterette
and T. Joachims 2009
 Does Brandname influence perceived search result quality? Yahoo!, Google, and WebKumara, P. Bailey and P.
Thomas and D. Hawking 2007
 Methods for Evaluating Interactive Information Retrieval Systems with Users, D. Kelly 2009
 C-TEST: Supporting Novelty and Diversity in TestFiles for Search Tuning, D. Hawking and T. Rowlands and P.
Thomas 2009
 Live Web Search Experiments for the Rest of Us, T. Jones and D. Hawking and R. Sankaranarayana 2010
 Quality and relevance of domain-specific search: A case study in mental health, T. Tang and N. Craswell and D.
Hawking and K. Griffiths and H. Christensen 2006
 New methods for creating testfiles: Tuning enterprise search with C-TEST, D. Hawking and P. Thomas and T.
Gedeon and T. Jones and T. Rowlands 2006
 A Field Experimental Approach to the Study of Relevance Assessments in Relation to Document Searching, A. M.
Rees and D. G. Schultz, Final Report to the National Science Foundation. Volume II, Appendices. Clearing- house
for Federal Scientific and Technical Information, October 1967
 The Water Filling Model and the Cube Test: Multi-dimensional Evaluation for Professional Search , J. Luo, C. Wing,
H. Yang and M. Hearst, CIKM 2013
 On sample sizes for non-matched-pair IR experiments, S. Robertson, 1990, Information Processing & Management
 Lipani A, Lupu M, Hanbury A, Splitting Water: Precision and Anti-Precision to Reduce Pool Bias, SIGIR 2015
 W. Webber and L. A. F. Park. Score adjustment for correction of pooling bias. In Proc. of SIGIR, 2009
133

Information Retrieval Evaluation

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Information Retrieval Evaluation

Similar a Information Retrieval Evaluation (20)

Último

Último (20)

Information Retrieval Evaluation

Notas del editor