SlideShare una empresa de Scribd logo
1 de 12
Descargar para leer sin conexión
The Philosophy of Information
 Retrieval Evaluation (2001)

       by Ellen Voorhees
The Author

• Computer scientist, Retrieval Group,
  NIST (15 years)
    o   TREC, TRECVid , and TAC - large-scale evaluation of
        technologies for processing natural language text and
        searching diverse media types
•   Research focus: "developing and validating
    appropriate evaluation schemes to measure system
    effectiveness in these areas"

• Siemens Corporate Research (9 years)
    o   factory automation, intelligence agents, agents
        applied to information access




                  http://www.linkedin.com/pub/ellen-voorhees/6/115/3b8
NIST (National Institute of Standards and
Technology)
• Non-regulatory agency of U.S. Dept of Commerce

• "Promote U.S. innovation and industrial competitiveness [...]
  enhance economic security and improve our quality of life"

• Estimated 2011 budget: $722 million

• Standards Reference Materials (experimental control samples,
  quality control benchmarks), election technology, ID cards

• 3 Nobel Prize Winners




          http://en.wikipedia.org/wiki/National_Institute_of_Standards_and_Technology
Premises

• User-based evaluation (p.1)

  o   better, more direct measure of user needs
  o   BUT very expensive and difficult to execute properly

• System evaluation (p.1)

  o   less expensive
  o   abstraction of retrieval process
  o   can control variables
         increases power of comparative experiments
  o   diagnostic information about system behavior
The Cranfield Paradigm
• Dominant model for 4 decades (p.1)

• Cranfield 2 experiment (1960s) - first lab testing of IR system
  (p.2)

   o   investigated which indexing languages is best
   o   design: considering the performance of index languages
       free from operational variable contamination
   o   aeronautics experts, aeronautics collection
   o   test collection: documents, information needs/topics,
       relevance judgment set
   o   assumptions:
          relevance approximated by topical similarity
          single judgment set representative of user population
          lists of relevant documents for each topic complete
Modern Adaptations to Cranfield
Paradigm not true, need to decrease noise (p.3)
• Assumptions
   o   modern collections larger and more diverse
   o   less complete relevance judgments

• Adaptations:
   o Ranked list of documents for each topic
        ordered by decreasing retrieval likelihood
   o Effectiveness as a whole computed as average across
     topics
   o Large number of topics
   o Use pooling (subsets of documents) instead (p.4)
   o Assumptions don't need to be strictly true for test
     collection to be viable
        different retrieval run scores compared on same test
        collections
How to Build a Test Collection
(TREC example)
• Set of documents and topics (reflective of operational setting
  and real tasks) (p.4)
   o e.g. law articles for law library

• Participants run topics against documents
   o return top documents per topic

• Pool formed, then judged by relevance assessors
   o evaluated using relevance judgments (binary)

• Results returned to participant

• Relevance judgments turn documents and topics into test
  collection (p.5)
Effects of Pooling and Incomplete Judgments
• Pooling doesn't produce complete judgments (p.5)
   o Some relevant documents not judged
   o If added later, from lower in system rankings

• Skewed across topics (p.6)
   o if have many relevant documents initially and later on

• What to do?
  o deep and diverse pool (p.9)
  o recall-oriented manual runs to supplement
  o opt for smaller, fair judgment set rather than larger biased
    set
Assessor Relevance Judgments

• Different judges, different time settings (p.9)

• Different assessor makes different relevance sets for same
  topics (subjectivity of relevance)

• TREC: 3 judges (p.10)

• Overlap < 50%, assessors really disagreed
Evaluating with Assessor Inconsistency
• Perform system ranking, sorting by value obtained by each
  system (p.10)

• Query-Relevance Set: different combinations of assessor
  judgments per topic

• Repeat experiments several times: (p.13)
  o different measures
  o different topic sets
  o different systems
  o different assessor groups

• Comparative evaluation result: stability of ranked retrieval
  results
Cross-Language Collections

• More difficult to build than monolingual collections (p.13)
  o separate set of assessors for each language
  o multiple assessors for 1 topic
  o need diverse pools for all languages
      minority language pools smaller and less diverse (p.14)

• What to do?
  o close coordination for consistency (p.13)
  o proceed with care
Discussion

• Do laboratory experiments translate to operational settings?

• Which metrics or evaluation scores are more meaningful to
  you?

• Are there other ways to reduce noise and error?

Más contenido relacionado

Destacado

Info-Computationalism and Philosophical Aspects of Research in Information Sc...
Info-Computationalism and Philosophical Aspects of Research in Information Sc...Info-Computationalism and Philosophical Aspects of Research in Information Sc...
Info-Computationalism and Philosophical Aspects of Research in Information Sc...Gordana Dodig-Crnkovic
 
Pojem informace jako anomálie v informační vědě
Pojem informace jako anomálie v informační věděPojem informace jako anomálie v informační vědě
Pojem informace jako anomálie v informační věděJiří Stodola
 
The Philosophy of Information and the Structure of Philosophical Revolutions
The Philosophy of Information and the Structure of Philosophical RevolutionsThe Philosophy of Information and the Structure of Philosophical Revolutions
The Philosophy of Information and the Structure of Philosophical RevolutionsPhiloWeb
 
Naturalized Epistemology North American Computing and Philosophy 2007
Naturalized Epistemology North American Computing and Philosophy 2007 Naturalized Epistemology North American Computing and Philosophy 2007
Naturalized Epistemology North American Computing and Philosophy 2007 Gordana Dodig-Crnkovic
 
Jean-Yves Béziau: The metalogical hexagon of opposition
Jean-Yves Béziau: The metalogical hexagon of opposition Jean-Yves Béziau: The metalogical hexagon of opposition
Jean-Yves Béziau: The metalogical hexagon of opposition Filosofía Costa-Rica
 
The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...Brian Solis
 

Destacado (6)

Info-Computationalism and Philosophical Aspects of Research in Information Sc...
Info-Computationalism and Philosophical Aspects of Research in Information Sc...Info-Computationalism and Philosophical Aspects of Research in Information Sc...
Info-Computationalism and Philosophical Aspects of Research in Information Sc...
 
Pojem informace jako anomálie v informační vědě
Pojem informace jako anomálie v informační věděPojem informace jako anomálie v informační vědě
Pojem informace jako anomálie v informační vědě
 
The Philosophy of Information and the Structure of Philosophical Revolutions
The Philosophy of Information and the Structure of Philosophical RevolutionsThe Philosophy of Information and the Structure of Philosophical Revolutions
The Philosophy of Information and the Structure of Philosophical Revolutions
 
Naturalized Epistemology North American Computing and Philosophy 2007
Naturalized Epistemology North American Computing and Philosophy 2007 Naturalized Epistemology North American Computing and Philosophy 2007
Naturalized Epistemology North American Computing and Philosophy 2007
 
Jean-Yves Béziau: The metalogical hexagon of opposition
Jean-Yves Béziau: The metalogical hexagon of opposition Jean-Yves Béziau: The metalogical hexagon of opposition
Jean-Yves Béziau: The metalogical hexagon of opposition
 
The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...
 

Similar a Philosophy of IR Evaluation Ellen Voorhees

Search term recommendation and non-textual ranking evaluated
 Search term recommendation and non-textual ranking evaluated Search term recommendation and non-textual ranking evaluated
Search term recommendation and non-textual ranking evaluatedGESIS
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalMounia Lalmas-Roelleke
 
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...Aravind Sesagiri Raamkumar
 
Advantages of Query Biased Summaries in Information Retrieval
Advantages of Query Biased Summaries in Information RetrievalAdvantages of Query Biased Summaries in Information Retrieval
Advantages of Query Biased Summaries in Information RetrievalOnur Yılmaz
 
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...alessio_ferrari
 
empirical-SLR.pptx
empirical-SLR.pptxempirical-SLR.pptx
empirical-SLR.pptxJitha Kannan
 
Text Retrieval Conferences (TREC)
Text Retrieval Conferences (TREC)Text Retrieval Conferences (TREC)
Text Retrieval Conferences (TREC)Abdul Gaffar
 
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...PyData
 
Systematic literature review technique.pptx
Systematic literature review technique.pptxSystematic literature review technique.pptx
Systematic literature review technique.pptxTANMAY DAS GUPTA
 
Advanced topics research
Advanced topics researchAdvanced topics research
Advanced topics researchkieran122
 
Proposing a Scientific Paper Retrieval and Recommender Framework
Proposing a Scientific Paper Retrieval and Recommender FrameworkProposing a Scientific Paper Retrieval and Recommender Framework
Proposing a Scientific Paper Retrieval and Recommender FrameworkAravind Sesagiri Raamkumar
 
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docxModule 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docxaudeleypearl
 
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docxModule 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docxroushhsiu
 
Enhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataEnhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataBarry Smith
 
Introduction to Systematic Literature Review method
Introduction to Systematic Literature Review methodIntroduction to Systematic Literature Review method
Introduction to Systematic Literature Review methodNorsaremah Salleh
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the WebRinke Hoekstra
 

Similar a Philosophy of IR Evaluation Ellen Voorhees (20)

Search term recommendation and non-textual ranking evaluated
 Search term recommendation and non-textual ranking evaluated Search term recommendation and non-textual ranking evaluated
Search term recommendation and non-textual ranking evaluated
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information Retrieval
 
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
Comparison of Techniques for Measuring Research Coverage of Scientific Papers...
 
Advantages of Query Biased Summaries in Information Retrieval
Advantages of Query Biased Summaries in Information RetrievalAdvantages of Query Biased Summaries in Information Retrieval
Advantages of Query Biased Summaries in Information Retrieval
 
Chapter 7.pdf
Chapter 7.pdfChapter 7.pdf
Chapter 7.pdf
 
qury.pdf
qury.pdfqury.pdf
qury.pdf
 
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
 
empirical-SLR.pptx
empirical-SLR.pptxempirical-SLR.pptx
empirical-SLR.pptx
 
Text Retrieval Conferences (TREC)
Text Retrieval Conferences (TREC)Text Retrieval Conferences (TREC)
Text Retrieval Conferences (TREC)
 
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
 
Cue Forum2008
Cue Forum2008Cue Forum2008
Cue Forum2008
 
Systematic literature review technique.pptx
Systematic literature review technique.pptxSystematic literature review technique.pptx
Systematic literature review technique.pptx
 
Advanced topics research
Advanced topics researchAdvanced topics research
Advanced topics research
 
Systematic Literature Review
Systematic Literature ReviewSystematic Literature Review
Systematic Literature Review
 
Proposing a Scientific Paper Retrieval and Recommender Framework
Proposing a Scientific Paper Retrieval and Recommender FrameworkProposing a Scientific Paper Retrieval and Recommender Framework
Proposing a Scientific Paper Retrieval and Recommender Framework
 
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docxModule 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
 
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docxModule 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
Module 3 - CaseMethodology and FindingsCase AssignmentThe Ca.docx
 
Enhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataEnhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort Data
 
Introduction to Systematic Literature Review method
Introduction to Systematic Literature Review methodIntroduction to Systematic Literature Review method
Introduction to Systematic Literature Review method
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the Web
 

Último

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 

Último (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 

Philosophy of IR Evaluation Ellen Voorhees

  • 1. The Philosophy of Information Retrieval Evaluation (2001) by Ellen Voorhees
  • 2. The Author • Computer scientist, Retrieval Group, NIST (15 years) o TREC, TRECVid , and TAC - large-scale evaluation of technologies for processing natural language text and searching diverse media types • Research focus: "developing and validating appropriate evaluation schemes to measure system effectiveness in these areas" • Siemens Corporate Research (9 years) o factory automation, intelligence agents, agents applied to information access http://www.linkedin.com/pub/ellen-voorhees/6/115/3b8
  • 3. NIST (National Institute of Standards and Technology) • Non-regulatory agency of U.S. Dept of Commerce • "Promote U.S. innovation and industrial competitiveness [...] enhance economic security and improve our quality of life" • Estimated 2011 budget: $722 million • Standards Reference Materials (experimental control samples, quality control benchmarks), election technology, ID cards • 3 Nobel Prize Winners http://en.wikipedia.org/wiki/National_Institute_of_Standards_and_Technology
  • 4. Premises • User-based evaluation (p.1) o better, more direct measure of user needs o BUT very expensive and difficult to execute properly • System evaluation (p.1) o less expensive o abstraction of retrieval process o can control variables increases power of comparative experiments o diagnostic information about system behavior
  • 5. The Cranfield Paradigm • Dominant model for 4 decades (p.1) • Cranfield 2 experiment (1960s) - first lab testing of IR system (p.2) o investigated which indexing languages is best o design: considering the performance of index languages free from operational variable contamination o aeronautics experts, aeronautics collection o test collection: documents, information needs/topics, relevance judgment set o assumptions: relevance approximated by topical similarity single judgment set representative of user population lists of relevant documents for each topic complete
  • 6. Modern Adaptations to Cranfield Paradigm not true, need to decrease noise (p.3) • Assumptions o modern collections larger and more diverse o less complete relevance judgments • Adaptations: o Ranked list of documents for each topic ordered by decreasing retrieval likelihood o Effectiveness as a whole computed as average across topics o Large number of topics o Use pooling (subsets of documents) instead (p.4) o Assumptions don't need to be strictly true for test collection to be viable different retrieval run scores compared on same test collections
  • 7. How to Build a Test Collection (TREC example) • Set of documents and topics (reflective of operational setting and real tasks) (p.4) o e.g. law articles for law library • Participants run topics against documents o return top documents per topic • Pool formed, then judged by relevance assessors o evaluated using relevance judgments (binary) • Results returned to participant • Relevance judgments turn documents and topics into test collection (p.5)
  • 8. Effects of Pooling and Incomplete Judgments • Pooling doesn't produce complete judgments (p.5) o Some relevant documents not judged o If added later, from lower in system rankings • Skewed across topics (p.6) o if have many relevant documents initially and later on • What to do? o deep and diverse pool (p.9) o recall-oriented manual runs to supplement o opt for smaller, fair judgment set rather than larger biased set
  • 9. Assessor Relevance Judgments • Different judges, different time settings (p.9) • Different assessor makes different relevance sets for same topics (subjectivity of relevance) • TREC: 3 judges (p.10) • Overlap < 50%, assessors really disagreed
  • 10. Evaluating with Assessor Inconsistency • Perform system ranking, sorting by value obtained by each system (p.10) • Query-Relevance Set: different combinations of assessor judgments per topic • Repeat experiments several times: (p.13) o different measures o different topic sets o different systems o different assessor groups • Comparative evaluation result: stability of ranked retrieval results
  • 11. Cross-Language Collections • More difficult to build than monolingual collections (p.13) o separate set of assessors for each language o multiple assessors for 1 topic o need diverse pools for all languages minority language pools smaller and less diverse (p.14) • What to do? o close coordination for consistency (p.13) o proceed with care
  • 12. Discussion • Do laboratory experiments translate to operational settings? • Which metrics or evaluation scores are more meaningful to you? • Are there other ways to reduce noise and error?