The document discusses efficient diversity-aware search. It begins by explaining how traditional search focuses only on relevance but ignores users' own perspectives, which can lead to redundant results. It then introduces diversity-aware search as a solution. The key aspects covered include a vector space data model, a user behavior model that considers relevance and novelty, and quantifying answer quality. The document primarily focuses on the DivGen approach, which incrementally computes candidate documents while maintaining usefulness bounds to enable early pruning. Novel data access primitives like bound access, batch sequential access, and document random access are introduced to improve efficiency.
2. Background & Motivation
What is search?
1. A user U initiates a query Q
2. A list of documents D sorted by relevance R w.r.t Q are returned
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 2/20
3. Background & Motivation
What is search?
1. A user U initiates a query Q
2. A list of documents D sorted by relevance R w.r.t Q are returned
User Satisfaction sat(U, Q)
It’s all about relevance between D and Q!
User U has its own perspective on relevance RU
1
Roughly speaking, sat(U, Q) ∝ diff (RU ,R)
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 2/20
4. Background & Motivation
What is search?
1. A user U initiates a query Q
2. A list of documents D sorted by relevance R w.r.t Q are returned
User Satisfaction sat(U, Q)
It’s all about relevance between D and Q!
User U has its own perspective on relevance RU
1
Roughly speaking, sat(U, Q) ∝ diff (RU ,R)
Problem: RU is difficult to capture, and usually ignored!
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 2/20
5. Background & Motivation
What is search?
1. A user U initiates a query Q
2. A list of documents D sorted by relevance R w.r.t Q are returned
User Satisfaction sat(U, Q)
It’s all about relevance between D and Q!
User U has its own perspective on relevance RU
1
Roughly speaking, sat(U, Q) ∝ diff (RU ,R)
Problem: RU is difficult to capture, and usually ignored!
Symptoms of ignoring RU
Redundant documents included in the result set
Most relevant documents in terms of RU excluded from the result set
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 2/20
6. Background & Motivation
What is search?
1. A user U initiates a query Q
2. A list of documents D sorted by relevance R w.r.t Q are returned
User Satisfaction sat(U, Q)
It’s all about relevance between D and Q!
User U has its own perspective on relevance RU
1
Roughly speaking, sat(U, Q) ∝ diff (RU ,R)
Problem: RU is difficult to capture, and usually ignored!
Symptoms of ignoring RU
Redundant documents included in the result set
Most relevant documents in terms of RU excluded from the result set
Solution: diversity-aware search!
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 2/20
9. Diversity-Aware Search
Intuitively, relevance + dissimilarity
Formally, a content-based diversification perspective:
Data Model
User Behavior Model
Answer Quality
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 4/20
10. Data Model
Vector Space Model: documents as weighted sets of features
Each document d is represented as a vector
d = (d 1 , d 2 , ...),
denoting feature i has weight d i ≥ 0 in document d
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 5/20
11. Data Model
Vector Space Model: documents as weighted sets of features
Each document d is represented as a vector
d = (d 1 , d 2 , ...),
denoting feature i has weight d i ≥ 0 in document d
Examples
textual documents: features can be keywords weighted in a tf.idf
manner
graph “documents”: features can be paths in the corpus graph
in recsys scenario: features can be the set of users who recommend a
document
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 5/20
12. User Behavior Model
Assumption: the user examines the results in their order of
presentation.
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20
13. User Behavior Model
Assumption: the user examines the results in their order of
presentation.
Usefulness of a document d: the probability that d is useful
Relevance: the probability that d is relevant
Novelty: the probability that d’s content is not redundant
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20
14. User Behavior Model
Assumption: the user examines the results in their order of
presentation.
Usefulness of a document d: the probability that d is useful
Relevance: the probability that d is relevant
Novelty: the probability that d’s content is not redundant
Consider a document d preceded by d1 , d2 , ..., dm w.r.t a query q, its
usefulness is defined below:
use(d|{d1 , ..., dm }, q) = rel(d|q) · (1 − red(d|{d1 , ...dm }, q))
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20
15. User Behavior Model
Assumption: the user examines the results in their order of
presentation.
Usefulness of a document d: the probability that d is useful
Relevance: the probability that d is relevant
Novelty: the probability that d’s content is not redundant
Consider a document d preceded by d1 , d2 , ..., dm w.r.t a query q, its
usefulness is defined below:
use(d|{d1 , ..., dm }, q) = rel(d|q) · (1 − red(d|{d1 , ...dm }, q))
⇓
m
use(d|{d1 , ..., dm }, q) = sim(d, q) · i=1 (1 − red(d|di , q))
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20
16. User Behavior Model
Assumption: the user examines the results in their order of
presentation.
Usefulness of a document d: the probability that d is useful
Relevance: the probability that d is relevant
Novelty: the probability that d’s content is not redundant
Consider a document d preceded by d1 , d2 , ..., dm w.r.t a query q, its
usefulness is defined below:
use(d|{d1 , ..., dm }, q) = rel(d|q) · (1 − red(d|{d1 , ...dm }, q))
⇓
m
use(d|{d1 , ..., dm }, q) = sim(d, q) · i=1 (1 − red(d|di , q))
red(d|di , q) can be decomposed further:
sim(d, di ): the probability that the content of d is similar to, or
contained in, that of di ;
fq : the estimated probability that, given a query q, a document with
similar content to, or content contained in, a document previously
emitted, is redundant.
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20
17. User Behavior Model
Assumption: the user examines the results in their order of
presentation.
Usefulness of a document d: the probability that d is useful
Relevance: the probability that d is relevant
Novelty: the probability that d’s content is not redundant
Consider a document d preceded by d1 , d2 , ..., dm w.r.t a query q, its
usefulness is defined below:
use(d|{d1 , ..., dm }, q) = rel(d|q) · (1 − red(d|{d1 , ...dm }, q))
⇓
m
use(d|{d1 , ..., dm }, q) = sim(d, q) · i=1 (1 − red(d|di , q))
red(d|di , q) can be decomposed further:
sim(d, di ): the probability that the content of d is similar to, or
contained in, that of di ;
fq : the estimated probability that, given a query q, a document with
similar content to, or content contained in, a document previously
emitted, is redundant.
red(d|di , q) = sim(d, di ) · fq
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20
18. User Behavior Model (Cont.)
Focus Parameter fq
fq is the main tunable parameter in red(d|di , q) = sim(d, di ) · fq
It is defined on a per-query basis, and denotes the amount of desired
diversification
Smaller fq favors relevance over diversity
Larger fq favors diversity over relevance
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 7/20
19. User Behavior Model (Cont.)
Focus Parameter fq
fq is the main tunable parameter in red(d|di , q) = sim(d, di ) · fq
It is defined on a per-query basis, and denotes the amount of desired
diversification
Smaller fq favors relevance over diversity
Larger fq favors diversity over relevance
Probabilistic interpretation:
“how likely is a relevant document to be useful to the user, given
that they have already examined a document with similar content? ”
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 7/20
22. Answer Quality
Quantification properties
Tractable instantiation
An optimal answer for strict order dominance semantics can be
found by greedily identifying the best result at position 1, 2, ..., k
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 8/20
24. A First Stab to DAS
Steps:
1. Compute the relevance of each document to the query;
2. Identify the highest score document d, and update the usefulness of
all other documents, based on their similarity to d;
3. Repeat the procedure k times.
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 10/20
25. A First Stab to DAS
Steps:
1. Compute the relevance of each document to the query;
2. Identify the highest score document d, and update the usefulness of
all other documents, based on their similarity to d;
3. Repeat the procedure k times.
Problems:
It requires access to the entire corpus.
It is too inefficient even for a moderately large set of documents.
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 10/20
26. A Threshold Algorithm for DAS
Generate-Filter Idea:
Incrementally compute documents in descending order of relevance;
Maintain upper and lower bounds on the relevance of every
encountered document;
Rerank the documents with diversity taken into account.
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 11/20
27. A Threshold Algorithm for DAS
Generate-Filter Idea:
Incrementally compute documents in descending order of relevance;
Maintain upper and lower bounds on the relevance of every
encountered document;
Rerank the documents with diversity taken into account.
Data Access Primitives
Sequential Access (SA): retrieve the id of the document with the
next highest weight for a specified feature i
Random Access (RA): retrieve the exact weight of feature i in
document d
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 11/20
28. A Threshold Algorithm for DAS
Generate-Filter Idea:
Incrementally compute documents in descending order of relevance;
Maintain upper and lower bounds on the relevance of every
encountered document;
Rerank the documents with diversity taken into account.
Data Access Primitives
Sequential Access (SA): retrieve the id of the document with the
next highest weight for a specified feature i
Random Access (RA): retrieve the exact weight of feature i in
document d
Drawbacks
Fully compute the relevance, and retrieve the entire content;
Wasted I/O efforts, and a lot of this I/O is not sequential in nature;
Hardly any early pruning is possible.
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 11/20
29. A Threshold Algorithm for DAS
Generate-Filter Idea:
Incrementally compute documents in descending order of relevance;
Maintain upper and lower bounds on the relevance of every
encountered document;
Rerank the documents with diversity taken into account.
Data Access Primitives
Sequential Access (SA): retrieve the id of the document with the
next highest weight for a specified feature i
Random Access (RA): retrieve the exact weight of feature i in
document d
Drawbacks
Fully compute the relevance, and retrieve the entire content;
Wasted I/O efforts, and a lot of this I/O is not sequential in nature;
Hardly any early pruning is possible.
DivGen: making Generate aware of diversity!
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 11/20
30. The DivGen Algorithm
Idea: maintain a set of candidate documents with bounds on
usefulness
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 12/20
31. The DivGen Algorithm
Idea: maintain a set of candidate documents with bounds on
usefulness
Novel Data Access Primitives
Bound Access (BA): retrieve the features with the highest weight in
d, as well as an upper bound w on the weight of any other features
of d
Batch Sequential Access (BSA): retrieve the documents with the
highest weight of non-query feature i, as well as an upper bound w
on the weight of i in any other document
Document Random Access (DocRA): retrieve all the features with
nonzero weight in d, along with their exact weights
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 12/20
32. The DivGen Algorithm
Idea: maintain a set of candidate documents with bounds on
usefulness
Novel Data Access Primitives
Bound Access (BA): retrieve the features with the highest weight in
d, as well as an upper bound w on the weight of any other features
of d
Batch Sequential Access (BSA): retrieve the documents with the
highest weight of non-query feature i, as well as an upper bound w
on the weight of i in any other document
Document Random Access (DocRA): retrieve all the features with
nonzero weight in d, along with their exact weights
Advantages of BA, BSA, DocRA
Existing index techniques can be easily leveraged to enable these
primitives.
These primitives can enable a set of early prunings to make the
algorithm more efficient.
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 12/20
37. Evaluation
Experimental Setup
Java 6, Oracle BerkeleyDB Java Edition v3.3.74
Ubuntu Linux 8.04, Intel Core2 X6800 2.93GHz CPU, 1GB Memory
ext3fs filesystem with a page size of 4KB
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 16/20
38. Evaluation
Experimental Setup
Java 6, Oracle BerkeleyDB Java Edition v3.3.74
Ubuntu Linux 8.04, Intel Core2 X6800 2.93GHz CPU, 1GB Memory
ext3fs filesystem with a page size of 4KB
Datasets
Real data: taken from Grapevine, a tool for distilling knowledge from
social media
Synthetic data: Zipfian distribution across documents, and normal
distribution in each document.
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 16/20
39. Evaluation
Experimental Setup
Java 6, Oracle BerkeleyDB Java Edition v3.3.74
Ubuntu Linux 8.04, Intel Core2 X6800 2.93GHz CPU, 1GB Memory
ext3fs filesystem with a page size of 4KB
Datasets
Real data: taken from Grapevine, a tool for distilling knowledge from
social media
Synthetic data: Zipfian distribution across documents, and normal
distribution in each document. How to synthesize?
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 16/20
42. Conclusion
This paper
formally studied the diversity-aware search problem;
proposed a set of novel data access primitives to efficiently solve
DAS;
performed experimental studies demonstrating the usefulness of
DivGen.
CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 19/20