Copulas for Information Retrieval (SIGIR'13)

Copulas for
Information Retrieval
Carsten Eickhoff, Arjen P. de Vries, Kevyn Collins-Thompson

Copulas – What is it all about?
• Assume two sufficiently different
commodities
• Rare elemental metals
• Pork bellies
• No apparent correlations
0
1
2
3
4
5
6
Rare Earths Pork Bellies

Copulas – What is it all about?
• Two seemingly independent variables
• Yet, for rare extreme cases, there are
co-movements
• “Tail dependencies”
• Copulas decouple observations and
dependencies
• IR models are good at estimating marginals
• Copulas are good at combining them

Overview
1. Non-linear Dependency Structures in IR
2. Copulas – Intuition & Background
3. Multivariate Relevance Estimation
4. When to use them?
5. Score Fusion
6. Conclusion & Future Directions

1
Non-Linear Dependency Structures in IR

Multivariate Relevance Modelling
• IR Systems index and retrieve a growing variety of document types
• Many structured, or at least “complex”
• Single-criteria relevance frameworks do not perform well
• Multi-criteria models tend to be either:
a) Naïve (e.g., independence assumption), or,
b) Hard to qualitatively interpret for humans (e.g., L2R)

Non-Linear Dependencies
• Non-linear dependency structures are still a challenge
• TREC 2010 Faceted Blog Distillation Task, Topic 1171, “mysql”
• Relevance Criteria:
• Topicality
• Subjectivity

• Pearson’s ᵨ= 0.18
• So, there is no real dependency
• …right?

• …right?
• In the lower third of the scale,
we note ᵨ= 0.37

• …right?
• In the lower third of the scale,
we note ᵨ= 0.37,
• And in the upper third, it turns
to ᵨ= -0.4

2
Copulas – Intuition & Background

Copulas (from copulare, to join)
• Copulas model complex non-linear dependencies between variables
that simple correlations can't capture
• Decouple marginal distributions from dependency structure
• Approximate joint multivariate distributions
• Applied previously in portfolio and risk management, meteorology,
river flooding predictions, …

Formal Basics
• Given a k-dimensional rv
• Map to unit cube
• Describe joint cdf with copula
• Isolation of a component
• Copula’s zero

Closing the circle
• Recall the example TREC topic 1171
• Linear combination: AP = 0.14,
below collection average (0.25)
• Fit Clayton copula to model joint
relevance distribution
• AP rises to 0.22

3
Multivariate Relevance Estimation

Joint Relevance Estimation
• Estimate marginal distributions from data
• Estimate copula fitting parameters to maximize posterior probability of
observing data
• Use copula to represent joint probability of relevance

Joint Relevance Estimation
• We study three different scenarios:
• Opinionated blog posts
• Personalized bookmarks
• Child-friendly websites
• Use original training portion of the corpora where available
• A 90/10 split otherwise

Results I – Opinionated Blog Posts
• TREC Blogs08 dataset
• 1.3 M documents
• Relevance dimensions: Topicality & Subjectivity
• Significantly higher performance than linear combination model

Results II – Personalized Bookmarks
• Dataset by Vallet & Castells
• 339k documents
• Relevance Dimensions: Topicality & Personal relevance
• Significantly performance gains in some metrics

Results III – Child-friendly Websites
• Dataset from the PuppyIR project (http://puppyir.eu)
• 22k documents
• Relevance Dimensions: Topicality & Child-suitability
• Worse-than-baseline performance

4
Copulas – When to use them?

When to use them?
• Previously: Strongly varying performance for different settings
• Is there a way of predicting the merit?
• Recall: copulas model tail dependencies between dimensions

Measuring Tail Dependencies
• According to Frees and Valdez 1998: IL and IU measure strength of
lower and upper tail dependencies
• Anderson-Darling test of goodness-of-fit between copula and
observed data
Domain Frees Tail index Anderson-Darling Actual Retrieval
Performance
Opinionated Blogs IL = 0.07 0.67 Copulas > linear
Personalized Bookmarks IU = 0.49 0.47 Copulas = linear
Child-friendly Websites IL = IU = 0 0.046 Copulas < linear

Score Fusion
• A different angle on relevance estimation
• Combine individual retrieval system scores instead of modelling relevance
from content criteria
• In this setting, submissions to historic TRECs serve as criteria
• We randomly draw k individual runs and combine them using copulas

Fusion Methods
• Established: • Copula-based:

Results – TREC 4
• Results are averaged across 200 randomizations per setting of k
• Relative improvements over the best, worst and median fused run in
terms of percentages of MAP
• Small but consistent improvements over non-copula fusion baselines

Robustness - CombSUM
• Fusion approaches are often
sensitive to weak contributions
• We control the number of weak
submissions added to the fusion
• Copulas’ explicit modeling of
dependency structure is more
robust

Robustness - CombMNZ
• Fusion approaches are often
sensitive to weak contributions
• We control the number of weak
submissions added to the fusion
• Copulas’ explicit modeling of
dependency structure is more
robust

6
Conclusion and Future Directions

Conclusion
• Copulas decouple observations and dependencies
• IR models are good at estimating marginal
• Copulas are good at combining them
• We use them for multivariate relevance estimation
• Strongly scenario-dependent performance
• Tail indices & goodness of fit tests as estimators of expected performance
• Copulas for score fusion
• Robust to outliers

The Road Ahead
• Currently, we use single copulas for relevance modelling
• Copula mixtures and composite Archimedean copulas for higher accuracy
• Here, we use pre-existing copula families and fit them to data
• Instead, can we formalize copulas from scratch to include domain knowledge?
• So far, we explored two-dimensional relevance spaces
• What happens as we move into higher-order systems?

Copulas for Information Retrieval (SIGIR'13)

Recomendados

Recomendados

Más contenido relacionado

Último

Último (20)

Destacado

Destacado (20)

Copulas for Information Retrieval (SIGIR'13)