6. Problem definition
Search result diversification is an optimization
problem aiming to find k items which are the
subset of all relevant results that contains both
most relevant and most diverse results.
10. How can items be diverse?
Word sense diversity,
from ambiguous queries
Information source
diversity, from
unambiguous queries
11. Measures of diversity
Diversity is tightly coupled with the concept of
similarity
To address the different aspects of the problem
several measures emerged:
Semantic distance
Categorical distance
Novel information
12. Semantic distance
Diversifies on content dissimilarity
Uses the min-hashing
scheme to get the Sd ={MH h 1 d ,... , MH h d }
n
sketch of a document
∣Su ∩Sv∣
Distance is computed sim u , v=
∣Su ∪Sv∣
from Jaccard similarity d u , v=1−sim u , v
Does not work well when the documents have too
different lengths or small sketch size
13. Categorical distance
Emphasizes word sense diversification
It is based on metadata (Taxonomy)
The measure is a weighted tree distance
l u l v
1 1
d u , v= ∑ 2
e i−1
∑ 2
e i−1
i=lca u , v i=lca u , v
Examples of taxonomies:
/Top/Health vs /Top/Finance
/Top/Sport/Racing vs /Top/Sport/Football
14. Novel information
Diversifies on a general sense regarding
content dissimilarity. Good for subtopics
Results are represented with unigram language
models (Used for natural language processing)
For each document is evaluated (with the
Kullback-Leibler divergence) how much novel
information it brings into the set
How many extra bits will be needed to describe
the new document using only the already
selected document in the set
15. Diversity measures: open issues
Some aspects not taken into account:
intrinsic properties of the document
genre of the document
sentiment regarding the topic
17. Diversification objectives
It has been proved impossible to find a function
that has all the required properties:
scale invariance
consistency
richness
stability
independence of irrelevant attributes
monotonicity
strength of relevance
strength of similarity
18. Diversification objectives
Several functions proposed:
Max sum Max min
(No stability) (No consistency
nor stability)
Max sum of max score
Mono objective
(Maximizes relevancy and
(No consistency)
then diversity)
Categorical
Max product
(Results have to cover
(It is based on the already
a set of categories)
chosen results)
19. Diversification algorithms
Finding the best solution is a NP-Hard problem
Algorithm depends on the objective function
Approximation Greedy
Open issues:
Is Off-line
Are there efficient
pre-computation
data structures?
applicable?
21. Data set for the evaluation
Full text
TREC Interactive
Top results from commercial search engine
Structured data
Taxonomies (Open Directory Project)
Ground truth
Wikipedia disambiguation pages
Judgements from Amazon Mechanical Turk
There is the need of task-specific standard datasets
22. Benchmarks
Adaptation from existing metrics:
Alpha-NDCG Subtopic recall and
Normalized discounted precision
cumulative gain Number of subtopics
covered
User intent Comparison
Results distribution against the
should reflect what the optimum
user is asking for
23. Alpha-nDCG
Based on information nuggets (Answer to a
question)
A document is relevant when it contains a nugget
needed by the user
Quality of results graded by human assessors
The most nuggets are in the set the best
24. Subtopic recall and precision
Is the result set exhaustive?
number of subtopics covered by the first k documents
s−recall at k =
total number of subtopics
Is the result set efficient?
minRank S opt , r
s− precision at r=
minRank S , r
25. Conclusions
Diversification can really improve quality of search
results
There is still some work to do in order to achieve
good results in all the possible scenarios
26. Open issues
There is room for improvement defining new
diversity types and metrics
Ranking functions should take in account diversity
from the beginning in an integrated process
Datasets to evaluate each notion of diversity
should be built
27. References
Minack, E., Demartini, G., Nejdl W.: Current Approaches to Search
Result Diversification. In: Proceedings of ISWC '09
Gollapudi, S., Sharma, A.: An Axiomatic Approach for Result
Diversification.In: Proocedings of WWW '09
Zhai, C.X., Cohen, W.W., Lafferty, J.: Beyond Independent Relevance:
Methods and Evaluation Metrics for Subtopic Retrieval. In: Proceedings
of SIGIR '03
Agrawal, R., Gollapudi, S., Halverson, A., Ieong, S.: Diversifying Search
Results. In: Proceedings of WSDM '09
Clough, P., Sanderson, M., Abouammoh, M., Navarro, S., Paramita, M.:
Multiple Approaches to Analysing Query Diversity. In: Proceedings of
SIGIR '09
Clarke, C.L., Kolla, M., Cormack, G.V., Vechtomova, O., Ashkan, A.,
Büttcher, S., MacKinnon, I.: Novelty and Diversity in Information
Retrieval Evaluation. In: Proceedings of SIGIR '08