SlideShare una empresa de Scribd logo
1 de 35
Descargar para leer sin conexión
Copulas for
Information Retrieval
Carsten Eickhoff, Arjen P. de Vries, Kevyn Collins-Thompson
Copulas – What is it all about?
• Assume two sufficiently different
commodities
• Rare elemental metals
• Pork bellies
• No apparent correlations
0
1
2
3
4
5
6
Rare Earths Pork Bellies
Copulas – What is it all about?
• Two seemingly independent variables
• Yet, for rare extreme cases, there are
co-movements
• “Tail dependencies”
• Copulas decouple observations and
dependencies
• IR models are good at estimating marginals
• Copulas are good at combining them
Overview
1. Non-linear Dependency Structures in IR
2. Copulas – Intuition & Background
3. Multivariate Relevance Estimation
4. When to use them?
5. Score Fusion
6. Conclusion & Future Directions
1
Non-Linear Dependency Structures in IR
Multivariate Relevance Modelling
• IR Systems index and retrieve a growing variety of document types
• Many structured, or at least “complex”
• Single-criteria relevance frameworks do not perform well
• Multi-criteria models tend to be either:
a) Naïve (e.g., independence assumption), or,
b) Hard to qualitatively interpret for humans (e.g., L2R)
Non-Linear Dependencies
• Non-linear dependency structures are still a challenge
• TREC 2010 Faceted Blog Distillation Task, Topic 1171, “mysql”
• Relevance Criteria:
• Topicality
• Subjectivity
Non-Linear Dependencies
• Pearson’s ᵨ= 0.18
• So, there is no real dependency
• …right?
Non-Linear Dependencies
• Pearson’s ᵨ= 0.18
• So, there is no real dependency
• …right?
Non-Linear Dependencies
• Pearson’s ᵨ= 0.18
• So, there is no real dependency
• …right?
• In the lower third of the scale,
we note ᵨ= 0.37
Non-Linear Dependencies
• Pearson’s ᵨ= 0.18
• So, there is no real dependency
• …right?
• In the lower third of the scale,
we note ᵨ= 0.37,
• And in the upper third, it turns
to ᵨ= -0.4
2
Copulas – Intuition & Background
Copulas (from copulare, to join)
• Copulas model complex non-linear dependencies between variables
that simple correlations can't capture
• Decouple marginal distributions from dependency structure
• Approximate joint multivariate distributions
• Applied previously in portfolio and risk management, meteorology,
river flooding predictions, …
Formal Basics
• Given a k-dimensional rv
• Map to unit cube
• Describe joint cdf with copula
• Isolation of a component
• Copula’s zero
Closing the circle
• Recall the example TREC topic 1171
• Linear combination: AP = 0.14,
below collection average (0.25)
• Fit Clayton copula to model joint
relevance distribution
• AP rises to 0.22
3
Multivariate Relevance Estimation
Joint Relevance Estimation
• Estimate marginal distributions from data
• Estimate copula fitting parameters to maximize posterior probability of
observing data
• Use copula to represent joint probability of relevance
Joint Relevance Estimation
• We study three different scenarios:
• Opinionated blog posts
• Personalized bookmarks
• Child-friendly websites
• Use original training portion of the corpora where available
• A 90/10 split otherwise
Results I – Opinionated Blog Posts
• TREC Blogs08 dataset
• 1.3 M documents
• Relevance dimensions: Topicality & Subjectivity
• Significantly higher performance than linear combination model
Results II – Personalized Bookmarks
• Dataset by Vallet & Castells
• 339k documents
• Relevance Dimensions: Topicality & Personal relevance
• Significantly performance gains in some metrics
Results III – Child-friendly Websites
• Dataset from the PuppyIR project (http://puppyir.eu)
• 22k documents
• Relevance Dimensions: Topicality & Child-suitability
• Worse-than-baseline performance
4
Copulas – When to use them?
When to use them?
• Previously: Strongly varying performance for different settings
• Is there a way of predicting the merit?
• Recall: copulas model tail dependencies between dimensions
Types of Tail Dependencies
Measuring Tail Dependencies
• According to Frees and Valdez 1998: IL and IU measure strength of
lower and upper tail dependencies
• Anderson-Darling test of goodness-of-fit between copula and
observed data
Domain Frees Tail index Anderson-Darling Actual Retrieval
Performance
Opinionated Blogs IL = 0.07 0.67 Copulas > linear
Personalized Bookmarks IU = 0.49 0.47 Copulas = linear
Child-friendly Websites IL = IU = 0 0.046 Copulas < linear
5
Copulas for Score Fusion
Score Fusion
• A different angle on relevance estimation
• Combine individual retrieval system scores instead of modelling relevance
from content criteria
• In this setting, submissions to historic TRECs serve as criteria
• We randomly draw k individual runs and combine them using copulas
Fusion Methods
• Established: • Copula-based:
Results – TREC 4
• Results are averaged across 200 randomizations per setting of k
• Relative improvements over the best, worst and median fused run in
terms of percentages of MAP
• Small but consistent improvements over non-copula fusion baselines
Robustness - CombSUM
• Fusion approaches are often
sensitive to weak contributions
• We control the number of weak
submissions added to the fusion
• Copulas’ explicit modeling of
dependency structure is more
robust
Robustness - CombMNZ
• Fusion approaches are often
sensitive to weak contributions
• We control the number of weak
submissions added to the fusion
• Copulas’ explicit modeling of
dependency structure is more
robust
6
Conclusion and Future Directions
Conclusion
• Copulas decouple observations and dependencies
• IR models are good at estimating marginal
• Copulas are good at combining them
• We use them for multivariate relevance estimation
• Strongly scenario-dependent performance
• Tail indices & goodness of fit tests as estimators of expected performance
• Copulas for score fusion
• Robust to outliers
The Road Ahead
• Currently, we use single copulas for relevance modelling
• Copula mixtures and composite Archimedean copulas for higher accuracy
• Here, we use pre-existing copula families and fit them to data
• Instead, can we formalize copulas from scratch to include domain knowledge?
• So far, we explored two-dimensional relevance spaces
• What happens as we move into higher-order systems?
Thank You!

Más contenido relacionado

Último

What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxnelietumpap1
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 

Último (20)

Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Q4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptxQ4 English4 Week3 PPT Melcnmg-based.pptx
Q4 English4 Week3 PPT Melcnmg-based.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 

Destacado

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Destacado (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Copulas for Information Retrieval (SIGIR'13)

  • 1. Copulas for Information Retrieval Carsten Eickhoff, Arjen P. de Vries, Kevyn Collins-Thompson
  • 2. Copulas – What is it all about? • Assume two sufficiently different commodities • Rare elemental metals • Pork bellies • No apparent correlations 0 1 2 3 4 5 6 Rare Earths Pork Bellies
  • 3. Copulas – What is it all about? • Two seemingly independent variables • Yet, for rare extreme cases, there are co-movements • “Tail dependencies” • Copulas decouple observations and dependencies • IR models are good at estimating marginals • Copulas are good at combining them
  • 4. Overview 1. Non-linear Dependency Structures in IR 2. Copulas – Intuition & Background 3. Multivariate Relevance Estimation 4. When to use them? 5. Score Fusion 6. Conclusion & Future Directions
  • 6. Multivariate Relevance Modelling • IR Systems index and retrieve a growing variety of document types • Many structured, or at least “complex” • Single-criteria relevance frameworks do not perform well • Multi-criteria models tend to be either: a) Naïve (e.g., independence assumption), or, b) Hard to qualitatively interpret for humans (e.g., L2R)
  • 7. Non-Linear Dependencies • Non-linear dependency structures are still a challenge • TREC 2010 Faceted Blog Distillation Task, Topic 1171, “mysql” • Relevance Criteria: • Topicality • Subjectivity
  • 8. Non-Linear Dependencies • Pearson’s ᵨ= 0.18 • So, there is no real dependency • …right?
  • 9. Non-Linear Dependencies • Pearson’s ᵨ= 0.18 • So, there is no real dependency • …right?
  • 10. Non-Linear Dependencies • Pearson’s ᵨ= 0.18 • So, there is no real dependency • …right? • In the lower third of the scale, we note ᵨ= 0.37
  • 11. Non-Linear Dependencies • Pearson’s ᵨ= 0.18 • So, there is no real dependency • …right? • In the lower third of the scale, we note ᵨ= 0.37, • And in the upper third, it turns to ᵨ= -0.4
  • 12. 2 Copulas – Intuition & Background
  • 13. Copulas (from copulare, to join) • Copulas model complex non-linear dependencies between variables that simple correlations can't capture • Decouple marginal distributions from dependency structure • Approximate joint multivariate distributions • Applied previously in portfolio and risk management, meteorology, river flooding predictions, …
  • 14. Formal Basics • Given a k-dimensional rv • Map to unit cube • Describe joint cdf with copula • Isolation of a component • Copula’s zero
  • 15. Closing the circle • Recall the example TREC topic 1171 • Linear combination: AP = 0.14, below collection average (0.25) • Fit Clayton copula to model joint relevance distribution • AP rises to 0.22
  • 17. Joint Relevance Estimation • Estimate marginal distributions from data • Estimate copula fitting parameters to maximize posterior probability of observing data • Use copula to represent joint probability of relevance
  • 18. Joint Relevance Estimation • We study three different scenarios: • Opinionated blog posts • Personalized bookmarks • Child-friendly websites • Use original training portion of the corpora where available • A 90/10 split otherwise
  • 19. Results I – Opinionated Blog Posts • TREC Blogs08 dataset • 1.3 M documents • Relevance dimensions: Topicality & Subjectivity • Significantly higher performance than linear combination model
  • 20. Results II – Personalized Bookmarks • Dataset by Vallet & Castells • 339k documents • Relevance Dimensions: Topicality & Personal relevance • Significantly performance gains in some metrics
  • 21. Results III – Child-friendly Websites • Dataset from the PuppyIR project (http://puppyir.eu) • 22k documents • Relevance Dimensions: Topicality & Child-suitability • Worse-than-baseline performance
  • 22. 4 Copulas – When to use them?
  • 23. When to use them? • Previously: Strongly varying performance for different settings • Is there a way of predicting the merit? • Recall: copulas model tail dependencies between dimensions
  • 24. Types of Tail Dependencies
  • 25. Measuring Tail Dependencies • According to Frees and Valdez 1998: IL and IU measure strength of lower and upper tail dependencies • Anderson-Darling test of goodness-of-fit between copula and observed data Domain Frees Tail index Anderson-Darling Actual Retrieval Performance Opinionated Blogs IL = 0.07 0.67 Copulas > linear Personalized Bookmarks IU = 0.49 0.47 Copulas = linear Child-friendly Websites IL = IU = 0 0.046 Copulas < linear
  • 27. Score Fusion • A different angle on relevance estimation • Combine individual retrieval system scores instead of modelling relevance from content criteria • In this setting, submissions to historic TRECs serve as criteria • We randomly draw k individual runs and combine them using copulas
  • 28. Fusion Methods • Established: • Copula-based:
  • 29. Results – TREC 4 • Results are averaged across 200 randomizations per setting of k • Relative improvements over the best, worst and median fused run in terms of percentages of MAP • Small but consistent improvements over non-copula fusion baselines
  • 30. Robustness - CombSUM • Fusion approaches are often sensitive to weak contributions • We control the number of weak submissions added to the fusion • Copulas’ explicit modeling of dependency structure is more robust
  • 31. Robustness - CombMNZ • Fusion approaches are often sensitive to weak contributions • We control the number of weak submissions added to the fusion • Copulas’ explicit modeling of dependency structure is more robust
  • 33. Conclusion • Copulas decouple observations and dependencies • IR models are good at estimating marginal • Copulas are good at combining them • We use them for multivariate relevance estimation • Strongly scenario-dependent performance • Tail indices & goodness of fit tests as estimators of expected performance • Copulas for score fusion • Robust to outliers
  • 34. The Road Ahead • Currently, we use single copulas for relevance modelling • Copula mixtures and composite Archimedean copulas for higher accuracy • Here, we use pre-existing copula families and fit them to data • Instead, can we formalize copulas from scratch to include domain knowledge? • So far, we explored two-dimensional relevance spaces • What happens as we move into higher-order systems?