VIP Call Girl in Mira Road 💧 9920725232 ( Call Me ) Get A New Crush Everyday ...
evaluation in infomation retrival
1. Evaluation in Information Retrieval
Ruihua Song
Web Search and Mining Group
Email: rsong@microsoft.com
2. Overview
• Retrieval Effectiveness Evaluation
• Evaluation Measures
• Significance Test
• One Selected SIGIR Paper
3. How to evaluate?
• How well does system meet information
need?
̵ System evaluation: how good are
document rankings?
̵ User-based evaluation: how satisfied
is user?
14. Evaluation Challenges On The
Web
• Collection is dynamic
̵ 10-20% urls change every month
• Queries are time sensitive
̵ Topics are hot then they ae not
• Spam methods evolve
̵ Algorithms evaluated against last month’s
web may not work today
• But we have a lot of users… you can use
clicks as supervision
SIGIR'05 Keynote given by Amit
Singhal from Google
15. Overview
• Retrieval Effectiveness Evaluation
• Evaluation Measures
• Significance Test
• One Selected SIGIR Paper
17. P-R curve
• Precision and recall
• Precision-recall curve
• Average precision-recall curve
18. P-R curve (cont.)
• For a query there is a result list (answer set)
R A
(Relevant Docs) Ra
(Answer Set)
19. P-R curve (cont.)
• Recall is fraction of
the relevant
| Ra |
document which has
been retrieved
recall =
|R|
| Ra |
• Precision is fraction
of the retrieved
precision =
| A|
document which is
relevant
20. P-R curve (cont.)
• E.g.
̵ For some query, |Total Docs|=200,|R|=20
̵ r: relevant
̵ n: non-relevant
̵ At rank 10,recall=6/20,precision=6/10
r n n r r n r n r r
d , d , d , d , d , d , d , d , d , d ,...
123 84 5 87 80 59 90 8 89 55
23. MAP
• Mean Average Precision
• Defined as mean of the precision obtained
after each relevant document is retrieved,
using zero as the precision for document that
are not retrieved.
24. MAP (cont.)
• E.g.
̵ |Total Docs|=200, |R|=20
̵ The whole result list consist of 10 docs is as follow
̵ r-rel
̵ n-nonrelevant
̵ MAP = (1+2/4+3/5+4/7+5/9+6/10)/6
r n n r r n r n r r
d ,d ,d ,d ,d ,d ,d ,d ,d ,d
123 84 5 87 80 59 90 8 89 55
25. Precision at 10
• P@10 is the number of relevant documents in
the top 10 documents in the ranked list
returned for a topic
• E.g.
̵ there is 3 documents in the top 10
documents that is relevant
̵ P@10=0.3
26. Mean Reciprocal Rank
• MRR is the reciprocal of the first relevant
document’s rank in the ranked list returned
for a topic
• E.g.
̵ the first relevant document is ranked as
No.4
̵ MRR = ¼ = 0.25
27. bpref
• Bpref stands for Binary Preference
• Consider only judged docs in result list
• The basic idea is to count number of time
judged non-relevant docs retrieval before
judged relevant docs
29. bpref (cont.)
• E.g.
̵ |Total Docs| =200, |R|=20
̵ r: judged relevant
̵ n: judged non-relevant
̵ u: not judged, unknown whether relevant or
not
r n n u r n r u u r
d , d , d , d , d , d , d , d , d , d ,...
123 84 5 87 80 59 90 8 89 55
30. References
• Baeza-Yates, R. & Ribeiro-Neto, B.
Modern Information Retrieval
Addison Wesley, 1999 , 73-96
• Buckley, C. & Voorhees, E.M.
Retrieval Evaluation with Incomplete
Information
Proceedings of SIGIR 2004
31. NDCG
• Two assumptions about ranked result
list
̵ Highly relevant document are more
valuable
̵ The greater the ranked position of a
relevant document , the less valuable
it is for the user
38. NDCG (cont.)
• Pros.
̵ Graded, more precise than R-P
̵ Reflect more user behavior (e.g. user
persistence)
̵ CG and DCG graphs are intuitive to
interpret
• Cons.
̵ Disagreements in rating
̵ How to set parameters
39. Reference
• Jarvelin, K. & Kekalainen, J.
Cumulated Gain-based Evaluation of IR
Techniques
ACM Transactions on Information Systems, 2002 ,
20 , 422-446
40. Overview
• Retrieval Effectiveness Evaluation
• Evaluation Measures
• Significance Test
• One Selected SIGIR Paper
41. Significance Test
• Significance Test
̵ Why is it necessary?
̵ T-Test is chosen in IR experiments
• Paired
• Two-tailed / One-tailed
42. Is the difference significant?
• Two almost same systems
p(.)
Green < Yellow ?
p(.) score
The difference is significant
or just caused by chance
score
50. Overview
• Retrieval Effectiveness Evaluation
• Evaluation Measures
• Significance Test
• One Selected SIGIR Paper
̵ T. Joachims, L. Granka, B. Pang, H. Hembrooke,
and G. Gay, Accurately Interpreting
Clickthrough Data as Implicit Feedback,
Proceedings of the Conference on Research and
Development in Information Retrieval (SIGIR),
2005.
52. Introduction
• The user study is different in at least two respects
from previous work
̵ The study provides detailed insight into the users’
decision-making process through the use of
eyetracking
̵ Evaluate relative preference signals derived from
user behavior
• Clicking decisions are biased at least two ways, trust
bias and quality bias
• Clicks have to be interpreted relative to the order of
presentation and relative to the other abstracts
53. User Study
• Designed these studies to not only record
and evaluate user actions, but also to give
insight into the decision process that lead the
user to the action
• This is achieved by recording users’ eye
movements by Eye tracking
55. Two Phases of the Study
• Phase I
̵ 34 participants
̵ Start search with Google query, search for
answers
• Phase II
̵ Investigate how users react to manipulations of
search results
̵ Same instructions as phase I
̵ Each subject assigned to one of three
experimental conditions
• Normal
• Swapped
• Reversed
56. Explicit Relevance Judgments
• Collected explicit relevance judgments for all queries
and results pages
̵ Phase I
• Randomized the order of abstracts and asked jugdes to
(weakly) order the abstracts
̵ Phase II
• The set for judging includes more
• Abstracts and Web pages
• Inter-judge agreements
̵ Phase I: 89.5%
̵ Phase II: abstract 82.5%, page 86.4%
57. Eyetracking
• Fixations
̵ 200-300 milliseconds
̵ Used in this paper
• Saccades
̵ 40-50 milliseconds
• Pupil dilation
58. Analysis of User Behavior
• Which links do users view and click?
• Do users scan links from top to bottom?
• Which links do users evaluate before clicking?
59. Which links do users view and
click?
• Almost equal frequency of 1st and 2nd link, but more clicks on
1st link
• Once the user has started scrolling, rank appears to become
less of an influence
60. Do users scan links from top to
bottom?
• Big gap before viewing 3rd ranked abstract
• Users scan viewable results thoroughly before
scrolling
61. Which links do users evaluate
before clicking?
• Abstracts closer above the clicked link are more likely
to be viewed
• Abstract right below a link is viewed roughly 50% of
the time
62. Analysis of Implicit Feedback
• Does relevance influence user decisions?
• Are clicks absolute relevance judgments?
• Are clicks relative relevance judgments?
63. Does relevance influence user
decisions?
• Yes
• Use the “reversed” condition
̵ Controllably decreases the quality of the retrieval
function and relevance of highly ranked abstracts
• Users react in two ways
̵ View lower ranked links more frequently, scan
significantly more abstracts
̵ Subjects are much less likely to click on the first
link, more likely to click on a lower ranked link
64. Are clicks absolute relevance
judgments?
• Interpretation is problematic
• Trust Bias
̵ Abstract ranked first receives more clicks
than the second
• First link is more relevant (not influenced by
order of presentation) or
• Users prefer the first link due to some level of
trust in the search engine (influenced by order
of presentation)
65. Trust Bias
• Hypothesis that users are not influenced by
presentation order can be rejected
• Users have substantial trust in search engine’s ability
to estimate relevance
66. Quality Bias
• Quality of the ranking influences the user’s
clicking behavior
̵ If relevance of retrieved results decreases,
users click on abstracts that are on
average less relevant
̵ Confirmed by the “reversed” condition
67. Are clicks relative relevance
judgments?
• An accurate interpretation of clicks needs to
take two biases into consideration, but they
are they are difficult to measure explicitly
̵ User’s trust into quality of search engine
̵ Quality of retrieval function itself
• How about interpreting clicks as pairwise
preference statements?
• An example
68. In the example,
Comments:
• Takes trust and quality bias into consideration
• Substantially and significantly better than random
• Close in accuracy to inter judge agreement
76. In the example,
Rel(l1) > Rel(l2), Rel(l3) > Rel(l4), Rel(l5) > Rel(l6)
Comments:
• Highly accurate in the “normal” condition
• Misleading
̵Aligned preferences probably less valuable for learning
̵ Better results even if user behaves randomly
• Less accurate than Strategy 1 in the “reversed” condition
78. Conclusion
• Users’ clicking decisions influenced by search bias
and quality bias, so it is difficult to interpret clicks as
absolute feedback
• Strategies for generating relative relevance feedback
signals, which are shown to correspond well with
explicit judgments
• While implicit relevance signals are less consistent
with explicit judgments than the explicit judgments
among each other, but the difference is
encouragingly small
79. Summary
• Retrieval Effectiveness Evaluation
• Evaluation Measures
• Significance Test
• One Selected SIGIR Paper
̵ T. Joachims, L. Granka, B. Pang, H. Hembrooke,
and G. Gay, Accurately Interpreting
Clickthrough Data as Implicit Feedback,
Proceedings of the Conference on Research and
Development in Information Retrieval (SIGIR),
2005.