SlideShare una empresa de Scribd logo
1 de 80
Descargar para leer sin conexión
Evaluation in Information Retrieval

   Ruihua Song
   Web Search and Mining Group
   Email: rsong@microsoft.com
Overview
•   Retrieval Effectiveness Evaluation
•   Evaluation Measures
•   Significance Test
•   One Selected SIGIR Paper
How to evaluate?
• How well does system meet information
  need?
   ̵ System evaluation: how good are
      document rankings?
    ̵ User-based evaluation: how satisfied
      is user?
Ellen Voorhees, The TREC
Conference: An Introduction
Ellen Voorhees, The TREC
Conference: An Introduction
Ellen Voorhees, The TREC
Conference: An Introduction
Ellen Voorhees, The TREC
Conference: An Introduction
Ellen Voorhees, The TREC
Conference: An Introduction
Ellen Voorhees, The TREC
Conference: An Introduction
Ellen Voorhees, The TREC
Conference: An Introduction
Ellen Voorhees, The TREC
Conference: An Introduction
Ellen Voorhees, The TREC
Conference: An Introduction
Ellen Voorhees, The TREC
Conference: An Introduction
Evaluation Challenges On The
Web
• Collection is dynamic
   ̵ 10-20% urls change every month
• Queries are time sensitive
    ̵ Topics are hot then they ae not
• Spam methods evolve
     ̵ Algorithms evaluated against last month’s
       web may not work today
• But we have a lot of users… you can use
  clicks as supervision

                SIGIR'05 Keynote given by Amit
                     Singhal from Google
Overview
•   Retrieval Effectiveness Evaluation
•   Evaluation Measures
•   Significance Test
•   One Selected SIGIR Paper
Ellen Voorhees, The TREC
Conference: An Introduction
P-R curve
• Precision and recall
• Precision-recall curve
• Average precision-recall curve
P-R curve (cont.)

• For a query there is a result list (answer set)



                    R                 A
              (Relevant Docs) Ra
                                 (Answer Set)
P-R curve (cont.)
• Recall is fraction of
  the relevant
                                   | Ra |
  document which has
  been retrieved
                          recall =
                                    |R|
                                       | Ra |
• Precision is fraction
  of the retrieved
                          precision =
                                        | A|
  document which is
  relevant
P-R curve (cont.)
• E.g.
  ̵ For some query, |Total Docs|=200,|R|=20
   ̵ r: relevant
    ̵ n: non-relevant
     ̵ At rank 10,recall=6/20,precision=6/10

    r    n    n   r    r    n    r    n   r    r
  d , d , d , d , d , d , d , d , d , d ,...
   123   84   5   87   80   59   90   8   89   55
Individual query P-R curve
P-R curve (cont.)
MAP
• Mean Average Precision
• Defined as mean of the precision obtained
  after each relevant document is retrieved,
  using zero as the precision for document that
  are not retrieved.
MAP (cont.)
• E.g.
   ̵ |Total Docs|=200, |R|=20
    ̵ The whole result list consist of 10 docs is as follow
     ̵ r-rel
      ̵ n-nonrelevant
       ̵ MAP = (1+2/4+3/5+4/7+5/9+6/10)/6

        r    n    n   r    r    n    r    n   r    r
      d ,d ,d ,d ,d ,d ,d ,d ,d ,d
       123   84   5   87   80   59   90   8   89   55
Precision at 10
• P@10 is the number of relevant documents in
  the top 10 documents in the ranked list
  returned for a topic

• E.g.
  ̵ there is 3 documents in the top 10
     documents that is relevant
   ̵ P@10=0.3
Mean Reciprocal Rank
• MRR is the reciprocal of the first relevant
  document’s rank in the ranked list returned
  for a topic

• E.g.
  ̵ the first relevant document is ranked as
     No.4
   ̵ MRR = ¼ = 0.25
bpref
• Bpref stands for Binary Preference
• Consider only judged docs in result list
• The basic idea is to count number of time
  judged non-relevant docs retrieval before
  judged relevant docs
bpref (cont.)
bpref (cont.)

• E.g.
  ̵ |Total Docs| =200, |R|=20
   ̵ r: judged relevant
    ̵ n: judged non-relevant
     ̵ u: not judged, unknown whether relevant or
       not
    r    n    n   u    r    n    r    u   u    r
  d , d , d , d , d , d , d , d , d , d ,...
   123   84   5   87   80   59   90   8   89   55
References
• Baeza-Yates, R. & Ribeiro-Neto, B.
  Modern Information Retrieval
  Addison Wesley, 1999 , 73-96

• Buckley, C. & Voorhees, E.M.
  Retrieval Evaluation with Incomplete
  Information
  Proceedings of SIGIR 2004
NDCG
• Two assumptions about ranked result
  list
   ̵ Highly relevant document are more
      valuable
    ̵ The greater the ranked position of a
      relevant document , the less valuable
      it is for the user
NDCG (cont.)
• Graded judgment -> gain vector
• Cumulated Gain
NDCG (cont.)
• Discounted CG
• Discounting function
NDCG (cont.)
• Ideal (D)CG vector
NDCG (cont.)
NDCG (cont.)
• Normalized (D)CG
NDCG (cont.)
NDCG (cont.)
• Pros.
  ̵ Graded, more precise than R-P
   ̵ Reflect more user behavior (e.g. user
      persistence)
    ̵ CG and DCG graphs are intuitive to
      interpret

• Cons.
  ̵ Disagreements in rating
   ̵ How to set parameters
Reference
• Jarvelin, K. & Kekalainen, J.
  Cumulated Gain-based Evaluation of IR
  Techniques
  ACM Transactions on Information Systems, 2002 ,
  20 , 422-446
Overview
•   Retrieval Effectiveness Evaluation
•   Evaluation Measures
•   Significance Test
•   One Selected SIGIR Paper
Significance Test
• Significance Test
  ̵ Why is it necessary?
   ̵ T-Test is chosen in IR experiments
     • Paired
     • Two-tailed / One-tailed
Is the difference significant?
• Two almost same systems


p(.)

                                  Green < Yellow ?


p(.)                score

                            The difference is significant
                             or just caused by chance

                    score
T-Test
• 样本均值和总体均值的比较
  ̵ 为了判断观察出的一组计量数据是否与其总体均值
     接近,两者的相差是同一总体样本与总体之间的误
     差,还是已超出抽样误差的允许范围而存在显著差
     别?
• 成对资料样本均值的比较
   ̵ 有时我们并不知道总体均值,且数据成对关联。我
     们可以先初步观察每对数据的差别情况,进一步算
     出平均相差为样本均值,再与假设的总体均值比较
     看相差是否显著



          医学理论第七章 摘自
           www.37c.com.cn
T-Test (cont.)




             医学理论第七章 摘自
              www.37c.com.cn
T-Test (cont.)




             医学理论第七章 摘自
              www.37c.com.cn
T-Test (cont.)




             医学理论第七章 摘自
              www.37c.com.cn
T-Test (cont.)




             医学理论第七章 摘自
              www.37c.com.cn
T-Test (cont.)




             医学理论第七章 摘自
              www.37c.com.cn
T-Test (cont.)




             医学理论第七章 摘自
              www.37c.com.cn
Overview
•   Retrieval Effectiveness Evaluation
•   Evaluation Measures
•   Significance Test
•   One Selected SIGIR Paper
    ̵ T. Joachims, L. Granka, B. Pang, H. Hembrooke,
      and G. Gay, Accurately Interpreting
      Clickthrough Data as Implicit Feedback,
      Proceedings of the Conference on Research and
      Development in Information Retrieval (SIGIR),
      2005.
First Author
Introduction
• The user study is different in at least two respects
  from previous work
   ̵ The study provides detailed insight into the users’
      decision-making process through the use of
      eyetracking
    ̵ Evaluate relative preference signals derived from
      user behavior
• Clicking decisions are biased at least two ways, trust
  bias and quality bias
• Clicks have to be interpreted relative to the order of
  presentation and relative to the other abstracts
User Study
• Designed these studies to not only record
  and evaluate user actions, but also to give
  insight into the decision process that lead the
  user to the action

• This is achieved by recording users’ eye
  movements by Eye tracking
Questions Used
Two Phases of the Study
• Phase I
   ̵ 34 participants
    ̵ Start search with Google query, search for
         answers
• Phase II
     ̵ Investigate how users react to manipulations of
         search results
      ̵ Same instructions as phase I
       ̵ Each subject assigned to one of three
         experimental conditions
      • Normal
      • Swapped
      • Reversed
Explicit Relevance Judgments
• Collected explicit relevance judgments for all queries
  and results pages
   ̵ Phase I
      • Randomized the order of abstracts and asked jugdes to
        (weakly) order the abstracts
   ̵ Phase II
      • The set for judging includes more
      • Abstracts and Web pages
• Inter-judge agreements
   ̵ Phase I: 89.5%
    ̵ Phase II: abstract 82.5%, page 86.4%
Eyetracking
• Fixations
   ̵ 200-300 milliseconds
    ̵ Used in this paper
• Saccades
     ̵ 40-50 milliseconds
• Pupil dilation
Analysis of User Behavior
• Which links do users view and click?

• Do users scan links from top to bottom?

• Which links do users evaluate before clicking?
Which links do users view and
click?




•   Almost equal frequency of 1st and 2nd link, but more clicks on
    1st link
•   Once the user has started scrolling, rank appears to become
    less of an influence
Do users scan links from top to
bottom?




• Big gap before viewing 3rd ranked abstract
• Users scan viewable results thoroughly before
  scrolling
Which links do users evaluate
before clicking?




• Abstracts closer above the clicked link are more likely
  to be viewed
• Abstract right below a link is viewed roughly 50% of
  the time
Analysis of Implicit Feedback
• Does relevance influence user decisions?

• Are clicks absolute relevance judgments?

• Are clicks relative relevance judgments?
Does relevance influence user
decisions?
• Yes
• Use the “reversed” condition
   ̵ Controllably decreases the quality of the retrieval
       function and relevance of highly ranked abstracts
• Users react in two ways
    ̵ View lower ranked links more frequently, scan
       significantly more abstracts
     ̵ Subjects are much less likely to click on the first
       link, more likely to click on a lower ranked link
Are clicks absolute relevance
judgments?
• Interpretation is problematic
• Trust Bias
   ̵ Abstract ranked first receives more clicks
     than the second
     • First link is more relevant (not influenced by
       order of presentation) or
     • Users prefer the first link due to some level of
       trust in the search engine (influenced by order
       of presentation)
Trust Bias




• Hypothesis that users are not influenced by
  presentation order can be rejected
• Users have substantial trust in search engine’s ability
  to estimate relevance
Quality Bias
• Quality of the ranking influences the user’s
  clicking behavior
   ̵ If relevance of retrieved results decreases,
      users click on abstracts that are on
      average less relevant
    ̵ Confirmed by the “reversed” condition
Are clicks relative relevance
judgments?
• An accurate interpretation of clicks needs to
  take two biases into consideration, but they
  are they are difficult to measure explicitly
   ̵ User’s trust into quality of search engine
    ̵ Quality of retrieval function itself
• How about interpreting clicks as pairwise
  preference statements?
• An example
In the example,



Comments:
• Takes trust and quality bias into consideration
• Substantially and significantly better than random
• Close in accuracy to inter judge agreement
Experimental Results
In the example,



Comments:
• Slightly more accurate than Strategy 1
• Not a significant difference in Phase II
Experimental Results
In the example,



Comments:
• Accuracy worse than Strategy 1
• Ranking quality has an effect on the accuracy
Experimental Results
In the example,

      Rel(l5) > Rel(l4)

Comments:
• No significant differences compared to Strategy 1
Experimental Results
In the example,

Rel(l1) > Rel(l2),      Rel(l3) > Rel(l4),    Rel(l5) > Rel(l6)

Comments:
• Highly accurate in the “normal” condition
• Misleading
    ̵Aligned preferences probably less valuable for learning
   ̵ Better results even if user behaves randomly
• Less accurate than Strategy 1 in the “reversed” condition
Experimental Results
Conclusion
• Users’ clicking decisions influenced by search bias
  and quality bias, so it is difficult to interpret clicks as
  absolute feedback

• Strategies for generating relative relevance feedback
  signals, which are shown to correspond well with
  explicit judgments

• While implicit relevance signals are less consistent
  with explicit judgments than the explicit judgments
  among each other, but the difference is
  encouragingly small
Summary
•   Retrieval Effectiveness Evaluation
•   Evaluation Measures
•   Significance Test
•   One Selected SIGIR Paper
    ̵ T. Joachims, L. Granka, B. Pang, H. Hembrooke,
      and G. Gay, Accurately Interpreting
      Clickthrough Data as Implicit Feedback,
      Proceedings of the Conference on Research and
      Development in Information Retrieval (SIGIR),
      2005.
Thanks!

Más contenido relacionado

La actualidad más candente

Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
Dishant Ailawadi
 
2012 data analysis
2012 data analysis2012 data analysis
2012 data analysis
cherylyap61
 
Business Research Methods. data collection preparation and analysis
Business Research Methods. data collection preparation and analysisBusiness Research Methods. data collection preparation and analysis
Business Research Methods. data collection preparation and analysis
Ahsan Khan Eco (Superior College)
 
Basics of data_interpretation
Basics of data_interpretationBasics of data_interpretation
Basics of data_interpretation
Vasista Vinuthan
 
Sampling Techniques, Data Collection and tabulation in the field of Social Sc...
Sampling Techniques, Data Collection and tabulation in the field of Social Sc...Sampling Techniques, Data Collection and tabulation in the field of Social Sc...
Sampling Techniques, Data Collection and tabulation in the field of Social Sc...
Manoj Sharma
 
Fundamentals of data analysis
Fundamentals of data analysisFundamentals of data analysis
Fundamentals of data analysis
Shameem Ali
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Julián Urbano
 

La actualidad más candente (20)

Classification
ClassificationClassification
Classification
 
Data Analysis
Data AnalysisData Analysis
Data Analysis
 
sigir2018tutorial
sigir2018tutorialsigir2018tutorial
sigir2018tutorial
 
sigir2020
sigir2020sigir2020
sigir2020
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
 
2012 data analysis
2012 data analysis2012 data analysis
2012 data analysis
 
Business Research Methods. data collection preparation and analysis
Business Research Methods. data collection preparation and analysisBusiness Research Methods. data collection preparation and analysis
Business Research Methods. data collection preparation and analysis
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
 
Descriptive Analytics: Data Reduction
 Descriptive Analytics: Data Reduction Descriptive Analytics: Data Reduction
Descriptive Analytics: Data Reduction
 
evaluation and credibility-Part 1
evaluation and credibility-Part 1evaluation and credibility-Part 1
evaluation and credibility-Part 1
 
Ankit presentation
Ankit presentationAnkit presentation
Ankit presentation
 
Basics of data_interpretation
Basics of data_interpretationBasics of data_interpretation
Basics of data_interpretation
 
Sampling Techniques, Data Collection and tabulation in the field of Social Sc...
Sampling Techniques, Data Collection and tabulation in the field of Social Sc...Sampling Techniques, Data Collection and tabulation in the field of Social Sc...
Sampling Techniques, Data Collection and tabulation in the field of Social Sc...
 
Statistical Methods in Research
Statistical Methods in ResearchStatistical Methods in Research
Statistical Methods in Research
 
Statistics for data science
Statistics for data science Statistics for data science
Statistics for data science
 
Fundamentals of data analysis
Fundamentals of data analysisFundamentals of data analysis
Fundamentals of data analysis
 
data analysis techniques and statistical softwares
data analysis techniques and statistical softwaresdata analysis techniques and statistical softwares
data analysis techniques and statistical softwares
 
Business statistics
Business statisticsBusiness statistics
Business statistics
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 

Similar a evaluation in infomation retrival

Robert.webster
Robert.websterRobert.webster
Robert.webster
NASAPMC
 

Similar a evaluation in infomation retrival (20)

IR Evaluation using Rank-Biased Precision
IR Evaluation using Rank-Biased PrecisionIR Evaluation using Rank-Biased Precision
IR Evaluation using Rank-Biased Precision
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information Retrieval
 
Peer Review in the LiquidPub project
Peer Review in the LiquidPub projectPeer Review in the LiquidPub project
Peer Review in the LiquidPub project
 
Chapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdfChapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdf
 
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
 
sigir2019
sigir2019sigir2019
sigir2019
 
1.measurement&scaling b.com
1.measurement&scaling b.com1.measurement&scaling b.com
1.measurement&scaling b.com
 
[IUI 2017] Criteria Chains: A Novel Multi-Criteria Recommendation Approach
[IUI 2017] Criteria Chains: A Novel Multi-Criteria Recommendation Approach[IUI 2017] Criteria Chains: A Novel Multi-Criteria Recommendation Approach
[IUI 2017] Criteria Chains: A Novel Multi-Criteria Recommendation Approach
 
Case Study in BPM Dashboards
Case Study in BPM DashboardsCase Study in BPM Dashboards
Case Study in BPM Dashboards
 
Introduction to Core Assessments
Introduction to Core AssessmentsIntroduction to Core Assessments
Introduction to Core Assessments
 
Spc
SpcSpc
Spc
 
Iir 08 ver.1.0
Iir 08 ver.1.0Iir 08 ver.1.0
Iir 08 ver.1.0
 
information technology materrailas paper
information technology materrailas paperinformation technology materrailas paper
information technology materrailas paper
 
Project Quality management
Project Quality managementProject Quality management
Project Quality management
 
Multi-method Evaluation in Scientific Paper Recommender Systems
Multi-method Evaluation in Scientific Paper Recommender SystemsMulti-method Evaluation in Scientific Paper Recommender Systems
Multi-method Evaluation in Scientific Paper Recommender Systems
 
Analytics Boot Camp - Slides
Analytics Boot Camp - SlidesAnalytics Boot Camp - Slides
Analytics Boot Camp - Slides
 
Scor model
Scor modelScor model
Scor model
 
Tech Prep Review and Improvement Process
Tech Prep Review and Improvement ProcessTech Prep Review and Improvement Process
Tech Prep Review and Improvement Process
 
Evaluating Search Performance
Evaluating Search PerformanceEvaluating Search Performance
Evaluating Search Performance
 
Robert.webster
Robert.websterRobert.webster
Robert.webster
 

Último

20240429 Calibre April 2024 Investor Presentation.pdf
20240429 Calibre April 2024 Investor Presentation.pdf20240429 Calibre April 2024 Investor Presentation.pdf
20240429 Calibre April 2024 Investor Presentation.pdf
Adnet Communications
 
VIP Independent Call Girls in Mumbai 🌹 9920725232 ( Call Me ) Mumbai Escorts ...
VIP Independent Call Girls in Mumbai 🌹 9920725232 ( Call Me ) Mumbai Escorts ...VIP Independent Call Girls in Mumbai 🌹 9920725232 ( Call Me ) Mumbai Escorts ...
VIP Independent Call Girls in Mumbai 🌹 9920725232 ( Call Me ) Mumbai Escorts ...
dipikadinghjn ( Why You Choose Us? ) Escorts
 
VIP Call Girl in Mumbai 💧 9920725232 ( Call Me ) Get A New Crush Everyday Wit...
VIP Call Girl in Mumbai 💧 9920725232 ( Call Me ) Get A New Crush Everyday Wit...VIP Call Girl in Mumbai 💧 9920725232 ( Call Me ) Get A New Crush Everyday Wit...
VIP Call Girl in Mumbai 💧 9920725232 ( Call Me ) Get A New Crush Everyday Wit...
dipikadinghjn ( Why You Choose Us? ) Escorts
 
VIP Call Girl in Mira Road 💧 9920725232 ( Call Me ) Get A New Crush Everyday ...
VIP Call Girl in Mira Road 💧 9920725232 ( Call Me ) Get A New Crush Everyday ...VIP Call Girl in Mira Road 💧 9920725232 ( Call Me ) Get A New Crush Everyday ...
VIP Call Girl in Mira Road 💧 9920725232 ( Call Me ) Get A New Crush Everyday ...
dipikadinghjn ( Why You Choose Us? ) Escorts
 

Último (20)

Top Rated Pune Call Girls Viman Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Sex...
Top Rated  Pune Call Girls Viman Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Sex...Top Rated  Pune Call Girls Viman Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Sex...
Top Rated Pune Call Girls Viman Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Sex...
 
Indore Real Estate Market Trends Report.pdf
Indore Real Estate Market Trends Report.pdfIndore Real Estate Market Trends Report.pdf
Indore Real Estate Market Trends Report.pdf
 
20240429 Calibre April 2024 Investor Presentation.pdf
20240429 Calibre April 2024 Investor Presentation.pdf20240429 Calibre April 2024 Investor Presentation.pdf
20240429 Calibre April 2024 Investor Presentation.pdf
 
Kharghar Blowjob Housewife Call Girls NUmber-9833754194-CBD Belapur Internati...
Kharghar Blowjob Housewife Call Girls NUmber-9833754194-CBD Belapur Internati...Kharghar Blowjob Housewife Call Girls NUmber-9833754194-CBD Belapur Internati...
Kharghar Blowjob Housewife Call Girls NUmber-9833754194-CBD Belapur Internati...
 
Call Girls in New Friends Colony Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escort...
Call Girls in New Friends Colony Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escort...Call Girls in New Friends Colony Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escort...
Call Girls in New Friends Colony Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escort...
 
Booking open Available Pune Call Girls Wadgaon Sheri 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Wadgaon Sheri  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Wadgaon Sheri  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Wadgaon Sheri 6297143586 Call Hot Ind...
 
The Economic History of the U.S. Lecture 19.pdf
The Economic History of the U.S. Lecture 19.pdfThe Economic History of the U.S. Lecture 19.pdf
The Economic History of the U.S. Lecture 19.pdf
 
(INDIRA) Call Girl Mumbai Call Now 8250077686 Mumbai Escorts 24x7
(INDIRA) Call Girl Mumbai Call Now 8250077686 Mumbai Escorts 24x7(INDIRA) Call Girl Mumbai Call Now 8250077686 Mumbai Escorts 24x7
(INDIRA) Call Girl Mumbai Call Now 8250077686 Mumbai Escorts 24x7
 
VIP Independent Call Girls in Mumbai 🌹 9920725232 ( Call Me ) Mumbai Escorts ...
VIP Independent Call Girls in Mumbai 🌹 9920725232 ( Call Me ) Mumbai Escorts ...VIP Independent Call Girls in Mumbai 🌹 9920725232 ( Call Me ) Mumbai Escorts ...
VIP Independent Call Girls in Mumbai 🌹 9920725232 ( Call Me ) Mumbai Escorts ...
 
WhatsApp 📞 Call : 9892124323 ✅Call Girls In Chembur ( Mumbai ) secure service
WhatsApp 📞 Call : 9892124323  ✅Call Girls In Chembur ( Mumbai ) secure serviceWhatsApp 📞 Call : 9892124323  ✅Call Girls In Chembur ( Mumbai ) secure service
WhatsApp 📞 Call : 9892124323 ✅Call Girls In Chembur ( Mumbai ) secure service
 
05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx
05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx
05_Annelore Lenoir_Docbyte_MeetupDora&Cybersecurity.pptx
 
The Economic History of the U.S. Lecture 23.pdf
The Economic History of the U.S. Lecture 23.pdfThe Economic History of the U.S. Lecture 23.pdf
The Economic History of the U.S. Lecture 23.pdf
 
VIP Call Girl in Mumbai 💧 9920725232 ( Call Me ) Get A New Crush Everyday Wit...
VIP Call Girl in Mumbai 💧 9920725232 ( Call Me ) Get A New Crush Everyday Wit...VIP Call Girl in Mumbai 💧 9920725232 ( Call Me ) Get A New Crush Everyday Wit...
VIP Call Girl in Mumbai 💧 9920725232 ( Call Me ) Get A New Crush Everyday Wit...
 
Solution Manual for Principles of Corporate Finance 14th Edition by Richard B...
Solution Manual for Principles of Corporate Finance 14th Edition by Richard B...Solution Manual for Principles of Corporate Finance 14th Edition by Richard B...
Solution Manual for Principles of Corporate Finance 14th Edition by Richard B...
 
Mira Road Memorable Call Grls Number-9833754194-Bhayandar Speciallty Call Gir...
Mira Road Memorable Call Grls Number-9833754194-Bhayandar Speciallty Call Gir...Mira Road Memorable Call Grls Number-9833754194-Bhayandar Speciallty Call Gir...
Mira Road Memorable Call Grls Number-9833754194-Bhayandar Speciallty Call Gir...
 
Stock Market Brief Deck (Under Pressure).pdf
Stock Market Brief Deck (Under Pressure).pdfStock Market Brief Deck (Under Pressure).pdf
Stock Market Brief Deck (Under Pressure).pdf
 
TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...
TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...
TEST BANK For Corporate Finance, 13th Edition By Stephen Ross, Randolph Weste...
 
The Economic History of the U.S. Lecture 25.pdf
The Economic History of the U.S. Lecture 25.pdfThe Economic History of the U.S. Lecture 25.pdf
The Economic History of the U.S. Lecture 25.pdf
 
Mira Road Awesome 100% Independent Call Girls NUmber-9833754194-Dahisar Inter...
Mira Road Awesome 100% Independent Call Girls NUmber-9833754194-Dahisar Inter...Mira Road Awesome 100% Independent Call Girls NUmber-9833754194-Dahisar Inter...
Mira Road Awesome 100% Independent Call Girls NUmber-9833754194-Dahisar Inter...
 
VIP Call Girl in Mira Road 💧 9920725232 ( Call Me ) Get A New Crush Everyday ...
VIP Call Girl in Mira Road 💧 9920725232 ( Call Me ) Get A New Crush Everyday ...VIP Call Girl in Mira Road 💧 9920725232 ( Call Me ) Get A New Crush Everyday ...
VIP Call Girl in Mira Road 💧 9920725232 ( Call Me ) Get A New Crush Everyday ...
 

evaluation in infomation retrival

  • 1. Evaluation in Information Retrieval Ruihua Song Web Search and Mining Group Email: rsong@microsoft.com
  • 2. Overview • Retrieval Effectiveness Evaluation • Evaluation Measures • Significance Test • One Selected SIGIR Paper
  • 3. How to evaluate? • How well does system meet information need? ̵ System evaluation: how good are document rankings? ̵ User-based evaluation: how satisfied is user?
  • 4. Ellen Voorhees, The TREC Conference: An Introduction
  • 5. Ellen Voorhees, The TREC Conference: An Introduction
  • 6. Ellen Voorhees, The TREC Conference: An Introduction
  • 7. Ellen Voorhees, The TREC Conference: An Introduction
  • 8. Ellen Voorhees, The TREC Conference: An Introduction
  • 9. Ellen Voorhees, The TREC Conference: An Introduction
  • 10. Ellen Voorhees, The TREC Conference: An Introduction
  • 11. Ellen Voorhees, The TREC Conference: An Introduction
  • 12. Ellen Voorhees, The TREC Conference: An Introduction
  • 13. Ellen Voorhees, The TREC Conference: An Introduction
  • 14. Evaluation Challenges On The Web • Collection is dynamic ̵ 10-20% urls change every month • Queries are time sensitive ̵ Topics are hot then they ae not • Spam methods evolve ̵ Algorithms evaluated against last month’s web may not work today • But we have a lot of users… you can use clicks as supervision SIGIR'05 Keynote given by Amit Singhal from Google
  • 15. Overview • Retrieval Effectiveness Evaluation • Evaluation Measures • Significance Test • One Selected SIGIR Paper
  • 16. Ellen Voorhees, The TREC Conference: An Introduction
  • 17. P-R curve • Precision and recall • Precision-recall curve • Average precision-recall curve
  • 18. P-R curve (cont.) • For a query there is a result list (answer set) R A (Relevant Docs) Ra (Answer Set)
  • 19. P-R curve (cont.) • Recall is fraction of the relevant | Ra | document which has been retrieved recall = |R| | Ra | • Precision is fraction of the retrieved precision = | A| document which is relevant
  • 20. P-R curve (cont.) • E.g. ̵ For some query, |Total Docs|=200,|R|=20 ̵ r: relevant ̵ n: non-relevant ̵ At rank 10,recall=6/20,precision=6/10 r n n r r n r n r r d , d , d , d , d , d , d , d , d , d ,... 123 84 5 87 80 59 90 8 89 55
  • 23. MAP • Mean Average Precision • Defined as mean of the precision obtained after each relevant document is retrieved, using zero as the precision for document that are not retrieved.
  • 24. MAP (cont.) • E.g. ̵ |Total Docs|=200, |R|=20 ̵ The whole result list consist of 10 docs is as follow ̵ r-rel ̵ n-nonrelevant ̵ MAP = (1+2/4+3/5+4/7+5/9+6/10)/6 r n n r r n r n r r d ,d ,d ,d ,d ,d ,d ,d ,d ,d 123 84 5 87 80 59 90 8 89 55
  • 25. Precision at 10 • P@10 is the number of relevant documents in the top 10 documents in the ranked list returned for a topic • E.g. ̵ there is 3 documents in the top 10 documents that is relevant ̵ P@10=0.3
  • 26. Mean Reciprocal Rank • MRR is the reciprocal of the first relevant document’s rank in the ranked list returned for a topic • E.g. ̵ the first relevant document is ranked as No.4 ̵ MRR = ¼ = 0.25
  • 27. bpref • Bpref stands for Binary Preference • Consider only judged docs in result list • The basic idea is to count number of time judged non-relevant docs retrieval before judged relevant docs
  • 29. bpref (cont.) • E.g. ̵ |Total Docs| =200, |R|=20 ̵ r: judged relevant ̵ n: judged non-relevant ̵ u: not judged, unknown whether relevant or not r n n u r n r u u r d , d , d , d , d , d , d , d , d , d ,... 123 84 5 87 80 59 90 8 89 55
  • 30. References • Baeza-Yates, R. & Ribeiro-Neto, B. Modern Information Retrieval Addison Wesley, 1999 , 73-96 • Buckley, C. & Voorhees, E.M. Retrieval Evaluation with Incomplete Information Proceedings of SIGIR 2004
  • 31. NDCG • Two assumptions about ranked result list ̵ Highly relevant document are more valuable ̵ The greater the ranked position of a relevant document , the less valuable it is for the user
  • 32. NDCG (cont.) • Graded judgment -> gain vector • Cumulated Gain
  • 33. NDCG (cont.) • Discounted CG • Discounting function
  • 34. NDCG (cont.) • Ideal (D)CG vector
  • 38. NDCG (cont.) • Pros. ̵ Graded, more precise than R-P ̵ Reflect more user behavior (e.g. user persistence) ̵ CG and DCG graphs are intuitive to interpret • Cons. ̵ Disagreements in rating ̵ How to set parameters
  • 39. Reference • Jarvelin, K. & Kekalainen, J. Cumulated Gain-based Evaluation of IR Techniques ACM Transactions on Information Systems, 2002 , 20 , 422-446
  • 40. Overview • Retrieval Effectiveness Evaluation • Evaluation Measures • Significance Test • One Selected SIGIR Paper
  • 41. Significance Test • Significance Test ̵ Why is it necessary? ̵ T-Test is chosen in IR experiments • Paired • Two-tailed / One-tailed
  • 42. Is the difference significant? • Two almost same systems p(.) Green < Yellow ? p(.) score The difference is significant or just caused by chance score
  • 43. T-Test • 样本均值和总体均值的比较 ̵ 为了判断观察出的一组计量数据是否与其总体均值 接近,两者的相差是同一总体样本与总体之间的误 差,还是已超出抽样误差的允许范围而存在显著差 别? • 成对资料样本均值的比较 ̵ 有时我们并不知道总体均值,且数据成对关联。我 们可以先初步观察每对数据的差别情况,进一步算 出平均相差为样本均值,再与假设的总体均值比较 看相差是否显著 医学理论第七章 摘自 www.37c.com.cn
  • 44. T-Test (cont.) 医学理论第七章 摘自 www.37c.com.cn
  • 45. T-Test (cont.) 医学理论第七章 摘自 www.37c.com.cn
  • 46. T-Test (cont.) 医学理论第七章 摘自 www.37c.com.cn
  • 47. T-Test (cont.) 医学理论第七章 摘自 www.37c.com.cn
  • 48. T-Test (cont.) 医学理论第七章 摘自 www.37c.com.cn
  • 49. T-Test (cont.) 医学理论第七章 摘自 www.37c.com.cn
  • 50. Overview • Retrieval Effectiveness Evaluation • Evaluation Measures • Significance Test • One Selected SIGIR Paper ̵ T. Joachims, L. Granka, B. Pang, H. Hembrooke, and G. Gay, Accurately Interpreting Clickthrough Data as Implicit Feedback, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), 2005.
  • 52. Introduction • The user study is different in at least two respects from previous work ̵ The study provides detailed insight into the users’ decision-making process through the use of eyetracking ̵ Evaluate relative preference signals derived from user behavior • Clicking decisions are biased at least two ways, trust bias and quality bias • Clicks have to be interpreted relative to the order of presentation and relative to the other abstracts
  • 53. User Study • Designed these studies to not only record and evaluate user actions, but also to give insight into the decision process that lead the user to the action • This is achieved by recording users’ eye movements by Eye tracking
  • 55. Two Phases of the Study • Phase I ̵ 34 participants ̵ Start search with Google query, search for answers • Phase II ̵ Investigate how users react to manipulations of search results ̵ Same instructions as phase I ̵ Each subject assigned to one of three experimental conditions • Normal • Swapped • Reversed
  • 56. Explicit Relevance Judgments • Collected explicit relevance judgments for all queries and results pages ̵ Phase I • Randomized the order of abstracts and asked jugdes to (weakly) order the abstracts ̵ Phase II • The set for judging includes more • Abstracts and Web pages • Inter-judge agreements ̵ Phase I: 89.5% ̵ Phase II: abstract 82.5%, page 86.4%
  • 57. Eyetracking • Fixations ̵ 200-300 milliseconds ̵ Used in this paper • Saccades ̵ 40-50 milliseconds • Pupil dilation
  • 58. Analysis of User Behavior • Which links do users view and click? • Do users scan links from top to bottom? • Which links do users evaluate before clicking?
  • 59. Which links do users view and click? • Almost equal frequency of 1st and 2nd link, but more clicks on 1st link • Once the user has started scrolling, rank appears to become less of an influence
  • 60. Do users scan links from top to bottom? • Big gap before viewing 3rd ranked abstract • Users scan viewable results thoroughly before scrolling
  • 61. Which links do users evaluate before clicking? • Abstracts closer above the clicked link are more likely to be viewed • Abstract right below a link is viewed roughly 50% of the time
  • 62. Analysis of Implicit Feedback • Does relevance influence user decisions? • Are clicks absolute relevance judgments? • Are clicks relative relevance judgments?
  • 63. Does relevance influence user decisions? • Yes • Use the “reversed” condition ̵ Controllably decreases the quality of the retrieval function and relevance of highly ranked abstracts • Users react in two ways ̵ View lower ranked links more frequently, scan significantly more abstracts ̵ Subjects are much less likely to click on the first link, more likely to click on a lower ranked link
  • 64. Are clicks absolute relevance judgments? • Interpretation is problematic • Trust Bias ̵ Abstract ranked first receives more clicks than the second • First link is more relevant (not influenced by order of presentation) or • Users prefer the first link due to some level of trust in the search engine (influenced by order of presentation)
  • 65. Trust Bias • Hypothesis that users are not influenced by presentation order can be rejected • Users have substantial trust in search engine’s ability to estimate relevance
  • 66. Quality Bias • Quality of the ranking influences the user’s clicking behavior ̵ If relevance of retrieved results decreases, users click on abstracts that are on average less relevant ̵ Confirmed by the “reversed” condition
  • 67. Are clicks relative relevance judgments? • An accurate interpretation of clicks needs to take two biases into consideration, but they are they are difficult to measure explicitly ̵ User’s trust into quality of search engine ̵ Quality of retrieval function itself • How about interpreting clicks as pairwise preference statements? • An example
  • 68. In the example, Comments: • Takes trust and quality bias into consideration • Substantially and significantly better than random • Close in accuracy to inter judge agreement
  • 70. In the example, Comments: • Slightly more accurate than Strategy 1 • Not a significant difference in Phase II
  • 72. In the example, Comments: • Accuracy worse than Strategy 1 • Ranking quality has an effect on the accuracy
  • 74. In the example, Rel(l5) > Rel(l4) Comments: • No significant differences compared to Strategy 1
  • 76. In the example, Rel(l1) > Rel(l2), Rel(l3) > Rel(l4), Rel(l5) > Rel(l6) Comments: • Highly accurate in the “normal” condition • Misleading ̵Aligned preferences probably less valuable for learning ̵ Better results even if user behaves randomly • Less accurate than Strategy 1 in the “reversed” condition
  • 78. Conclusion • Users’ clicking decisions influenced by search bias and quality bias, so it is difficult to interpret clicks as absolute feedback • Strategies for generating relative relevance feedback signals, which are shown to correspond well with explicit judgments • While implicit relevance signals are less consistent with explicit judgments than the explicit judgments among each other, but the difference is encouragingly small
  • 79. Summary • Retrieval Effectiveness Evaluation • Evaluation Measures • Significance Test • One Selected SIGIR Paper ̵ T. Joachims, L. Granka, B. Pang, H. Hembrooke, and G. Gay, Accurately Interpreting Clickthrough Data as Implicit Feedback, Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), 2005.