evaluation in infomation retrival

Evaluation in Information Retrieval

Ruihua Song
Web Search and Mining Group
Email: rsong@microsoft.com

Overview
• Retrieval Effectiveness Evaluation
• Evaluation Measures
• Significance Test
• One Selected SIGIR Paper

How to evaluate?
• How well does system meet information
need?
̵ System evaluation: how good are
document rankings?
̵ User-based evaluation: how satisfied
is user?

Ellen Voorhees, The TREC
Conference: An Introduction

Evaluation Challenges On The
Web
• Collection is dynamic
̵ 10-20% urls change every month
• Queries are time sensitive
̵ Topics are hot then they ae not
• Spam methods evolve
̵ Algorithms evaluated against last month’s
web may not work today
• But we have a lot of users… you can use
clicks as supervision

SIGIR'05 Keynote given by Amit
Singhal from Google

P-R curve
• Precision and recall
• Precision-recall curve
• Average precision-recall curve

P-R curve (cont.)

• For a query there is a result list (answer set)

R A
(Relevant Docs) Ra
(Answer Set)

P-R curve (cont.)
• Recall is fraction of
the relevant
| Ra |
document which has
been retrieved
recall =
|R|
| Ra |
• Precision is fraction
of the retrieved
precision =
| A|
document which is
relevant

P-R curve (cont.)
• E.g.
̵ For some query, |Total Docs|=200,|R|=20
̵ r: relevant
̵ n: non-relevant
̵ At rank 10,recall=6/20,precision=6/10

r n n r r n r n r r
d , d , d , d , d , d , d , d , d , d ,...
123 84 5 87 80 59 90 8 89 55

MAP
• Mean Average Precision
• Defined as mean of the precision obtained
after each relevant document is retrieved,
using zero as the precision for document that
are not retrieved.

MAP (cont.)
• E.g.
̵ |Total Docs|=200, |R|=20
̵ The whole result list consist of 10 docs is as follow
̵ r-rel
̵ n-nonrelevant
̵ MAP = (1+2/4+3/5+4/7+5/9+6/10)/6

r n n r r n r n r r
d ,d ,d ,d ,d ,d ,d ,d ,d ,d
123 84 5 87 80 59 90 8 89 55

Precision at 10
• P@10 is the number of relevant documents in
the top 10 documents in the ranked list
returned for a topic

• E.g.
̵ there is 3 documents in the top 10
documents that is relevant
̵ P@10=0.3

Mean Reciprocal Rank
• MRR is the reciprocal of the first relevant
document’s rank in the ranked list returned
for a topic

• E.g.
̵ the first relevant document is ranked as
No.4
̵ MRR = ¼ = 0.25

bpref
• Bpref stands for Binary Preference
• Consider only judged docs in result list
• The basic idea is to count number of time
judged non-relevant docs retrieval before
judged relevant docs

bpref (cont.)

• E.g.
̵ |Total Docs| =200, |R|=20
̵ r: judged relevant
̵ n: judged non-relevant
̵ u: not judged, unknown whether relevant or
not
r n n u r n r u u r
d , d , d , d , d , d , d , d , d , d ,...
123 84 5 87 80 59 90 8 89 55

References
• Baeza-Yates, R. & Ribeiro-Neto, B.
Modern Information Retrieval
Addison Wesley, 1999 , 73-96

• Buckley, C. & Voorhees, E.M.
Retrieval Evaluation with Incomplete
Information
Proceedings of SIGIR 2004

NDCG
• Two assumptions about ranked result
list
̵ Highly relevant document are more
valuable
̵ The greater the ranked position of a
relevant document , the less valuable
it is for the user

NDCG (cont.)
• Graded judgment -> gain vector
• Cumulated Gain

NDCG (cont.)
• Discounted CG
• Discounting function

NDCG (cont.)
• Ideal (D)CG vector

NDCG (cont.)
• Normalized (D)CG

NDCG (cont.)
• Pros.
̵ Graded, more precise than R-P
̵ Reflect more user behavior (e.g. user
persistence)
̵ CG and DCG graphs are intuitive to
interpret

• Cons.
̵ Disagreements in rating
̵ How to set parameters

Reference
• Jarvelin, K. & Kekalainen, J.
Cumulated Gain-based Evaluation of IR
Techniques
ACM Transactions on Information Systems, 2002 ,
20 , 422-446

Significance Test
̵ Why is it necessary?
̵ T-Test is chosen in IR experiments
• Paired
• Two-tailed / One-tailed

Is the difference significant?
• Two almost same systems

p(.)

Green < Yellow ?

p(.) score

The difference is significant
or just caused by chance

score

T-Test
• 样本均值和总体均值的比较
̵ 为了判断观察出的一组计量数据是否与其总体均值
接近，两者的相差是同一总体样本与总体之间的误
差，还是已超出抽样误差的允许范围而存在显著差
别？
• 成对资料样本均值的比较
̵ 有时我们并不知道总体均值，且数据成对关联。我
们可以先初步观察每对数据的差别情况，进一步算
出平均相差为样本均值，再与假设的总体均值比较
看相差是否显著

医学理论第七章摘自
www.37c.com.cn

T-Test (cont.)

医学理论第七章摘自
www.37c.com.cn

Overview
̵ T. Joachims, L. Granka, B. Pang, H. Hembrooke,
and G. Gay, Accurately Interpreting
Clickthrough Data as Implicit Feedback,
Proceedings of the Conference on Research and
Development in Information Retrieval (SIGIR),
2005.

Introduction
• The user study is different in at least two respects
from previous work
̵ The study provides detailed insight into the users’
decision-making process through the use of
eyetracking
̵ Evaluate relative preference signals derived from
user behavior
• Clicking decisions are biased at least two ways, trust
bias and quality bias
• Clicks have to be interpreted relative to the order of
presentation and relative to the other abstracts

User Study
• Designed these studies to not only record
and evaluate user actions, but also to give
insight into the decision process that lead the
user to the action

• This is achieved by recording users’ eye
movements by Eye tracking

Two Phases of the Study
• Phase I
̵ 34 participants
̵ Start search with Google query, search for
answers
• Phase II
̵ Investigate how users react to manipulations of
search results
̵ Same instructions as phase I
̵ Each subject assigned to one of three
experimental conditions
• Normal
• Swapped
• Reversed

Explicit Relevance Judgments
• Collected explicit relevance judgments for all queries
and results pages
̵ Phase I
• Randomized the order of abstracts and asked jugdes to
(weakly) order the abstracts
̵ Phase II
• The set for judging includes more
• Abstracts and Web pages
• Inter-judge agreements
̵ Phase I: 89.5%
̵ Phase II: abstract 82.5%, page 86.4%

Eyetracking
• Fixations
̵ 200-300 milliseconds
̵ Used in this paper
• Saccades
̵ 40-50 milliseconds
• Pupil dilation

Analysis of User Behavior
• Which links do users view and click?

• Do users scan links from top to bottom?

• Which links do users evaluate before clicking?

Which links do users view and
click?

• Almost equal frequency of 1st and 2nd link, but more clicks on
1st link
• Once the user has started scrolling, rank appears to become
less of an influence

Do users scan links from top to
bottom?

• Big gap before viewing 3rd ranked abstract
• Users scan viewable results thoroughly before
scrolling

Which links do users evaluate
before clicking?

• Abstracts closer above the clicked link are more likely
to be viewed
• Abstract right below a link is viewed roughly 50% of
the time

Analysis of Implicit Feedback
• Does relevance influence user decisions?

• Are clicks absolute relevance judgments?

• Are clicks relative relevance judgments?

Does relevance influence user
decisions?
• Yes
• Use the “reversed” condition
̵ Controllably decreases the quality of the retrieval
function and relevance of highly ranked abstracts
• Users react in two ways
̵ View lower ranked links more frequently, scan
significantly more abstracts
̵ Subjects are much less likely to click on the first
link, more likely to click on a lower ranked link

Are clicks absolute relevance
judgments?
• Interpretation is problematic
• Trust Bias
̵ Abstract ranked first receives more clicks
than the second
• First link is more relevant (not influenced by
order of presentation) or
• Users prefer the first link due to some level of
trust in the search engine (influenced by order
of presentation)

Trust Bias

• Hypothesis that users are not influenced by
presentation order can be rejected
• Users have substantial trust in search engine’s ability
to estimate relevance

Quality Bias
• Quality of the ranking influences the user’s
clicking behavior
̵ If relevance of retrieved results decreases,
users click on abstracts that are on
average less relevant
̵ Confirmed by the “reversed” condition

Are clicks relative relevance
judgments?
• An accurate interpretation of clicks needs to
take two biases into consideration, but they
are they are difficult to measure explicitly
̵ User’s trust into quality of search engine
̵ Quality of retrieval function itself
• How about interpreting clicks as pairwise
preference statements?
• An example

In the example,

Comments:
• Takes trust and quality bias into consideration
• Substantially and significantly better than random
• Close in accuracy to inter judge agreement

In the example,

Comments:
• Slightly more accurate than Strategy 1
• Not a significant difference in Phase II

In the example,

Comments:
• Accuracy worse than Strategy 1
• Ranking quality has an effect on the accuracy

In the example,

Rel(l5) > Rel(l4)

Comments:
• No significant differences compared to Strategy 1

In the example,

Rel(l1) > Rel(l2), Rel(l3) > Rel(l4), Rel(l5) > Rel(l6)

Comments:
• Highly accurate in the “normal” condition
• Misleading
̵Aligned preferences probably less valuable for learning
̵ Better results even if user behaves randomly
• Less accurate than Strategy 1 in the “reversed” condition

Conclusion
• Users’ clicking decisions influenced by search bias
and quality bias, so it is difficult to interpret clicks as
absolute feedback

• Strategies for generating relative relevance feedback
signals, which are shown to correspond well with
explicit judgments

• While implicit relevance signals are less consistent
with explicit judgments than the explicit judgments
among each other, but the difference is
encouragingly small

Summary
̵ T. Joachims, L. Granka, B. Pang, H. Hembrooke,
and G. Gay, Accurately Interpreting
Clickthrough Data as Implicit Feedback,
Proceedings of the Conference on Research and
Development in Information Retrieval (SIGIR),
2005.

evaluation in infomation retrival

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a evaluation in infomation retrival

Similar a evaluation in infomation retrival (20)

Último

Último (20)

evaluation in infomation retrival