Efficient Diversity-Aware Search Techniques

Eﬃcient Diversity-Aware Search

Dacong (Tony) Yan
May 4, 2011

Background & Motivation

What is search?
1. A user U initiates a query Q
2. A list of documents D sorted by relevance R w.r.t Q are returned

CSE 788, Dacong (Tony) Yan Eﬃcient Diversity-Aware Search 2/20


What is search?

User Satisfaction sat(U, Q)
It’s all about relevance between D and Q!
User U has its own perspective on relevance RU
1
Roughly speaking, sat(U, Q) ∝ diﬀ (RU ,R)



What is search?

1
Problem: RU is diﬃcult to capture, and usually ignored!



What is search?

1

Symptoms of ignoring RU
Redundant documents included in the result set
Most relevant documents in terms of RU excluded from the result set



What is search?

1

Symptoms of ignoring RU
Redundant documents included in the result set
Most relevant documents in terms of RU excluded from the result set

Solution: diversity-aware search!

Agenda

Diversity-Aware Search
DivGen Approach
Evaluation
Conclusion



Intuitively, relevance + dissimilarity



Intuitively, relevance + dissimilarity
Formally, a content-based diversiﬁcation perspective:
Data Model
User Behavior Model
Answer Quality


Data Model

Vector Space Model: documents as weighted sets of features
Each document d is represented as a vector
d = (d 1 , d 2 , ...),
denoting feature i has weight d i ≥ 0 in document d


Data Model

Vector Space Model: documents as weighted sets of features
Each document d is represented as a vector
d = (d 1 , d 2 , ...),
denoting feature i has weight d i ≥ 0 in document d
Examples
textual documents: features can be keywords weighted in a tf.idf
manner
graph “documents”: features can be paths in the corpus graph
in recsys scenario: features can be the set of users who recommend a
document


User Behavior Model

Assumption: the user examines the results in their order of
presentation.


User Behavior Model

presentation.
Usefulness of a document d: the probability that d is useful
Relevance: the probability that d is relevant
Novelty: the probability that d’s content is not redundant


User Behavior Model

presentation.
Consider a document d preceded by d1 , d2 , ..., dm w.r.t a query q, its
usefulness is deﬁned below:
use(d|{d1 , ..., dm }, q) = rel(d|q) · (1 − red(d|{d1 , ...dm }, q))


User Behavior Model

presentation.
⇓
m
use(d|{d1 , ..., dm }, q) = sim(d, q) · i=1 (1 − red(d|di , q))


User Behavior Model

presentation.
⇓
m
red(d|di , q) can be decomposed further:
sim(d, di ): the probability that the content of d is similar to, or
contained in, that of di ;
fq : the estimated probability that, given a query q, a document with
similar content to, or content contained in, a document previously
emitted, is redundant.


User Behavior Model

presentation.
⇓
m
red(d|di , q) can be decomposed further:
sim(d, di ): the probability that the content of d is similar to, or
contained in, that of di ;
fq : the estimated probability that, given a query q, a document with
similar content to, or content contained in, a document previously
emitted, is redundant.
red(d|di , q) = sim(d, di ) · fq

User Behavior Model (Cont.)

Focus Parameter fq
fq is the main tunable parameter in red(d|di , q) = sim(d, di ) · fq
It is deﬁned on a per-query basis, and denotes the amount of desired
diversiﬁcation
Smaller fq favors relevance over diversity
Larger fq favors diversity over relevance


User Behavior Model (Cont.)

Focus Parameter fq
fq is the main tunable parameter in red(d|di , q) = sim(d, di ) · fq
It is deﬁned on a per-query basis, and denotes the amount of desired
diversiﬁcation
Smaller fq favors relevance over diversity
Larger fq favors diversity over relevance
Probabilistic interpretation:
“how likely is a relevant document to be useful to the user, given
that they have already examined a document with similar content? ”


Answer Quality

Quantiﬁcation properties


Answer Quality


Tractable instantiation


Answer Quality


Tractable instantiation

An optimal answer for strict order dominance semantics can be
found by greedily identifying the best result at position 1, 2, ..., k


A First Stab to DAS

Steps:
1. Compute the relevance of each document to the query;
2. Identify the highest score document d, and update the usefulness of
all other documents, based on their similarity to d;
3. Repeat the procedure k times.


A First Stab to DAS

Steps:
1. Compute the relevance of each document to the query;
2. Identify the highest score document d, and update the usefulness of
all other documents, based on their similarity to d;
3. Repeat the procedure k times.
Problems:
It requires access to the entire corpus.
It is too ineﬃcient even for a moderately large set of documents.


A Threshold Algorithm for DAS

Generate-Filter Idea:
Incrementally compute documents in descending order of relevance;
Maintain upper and lower bounds on the relevance of every
encountered document;
Rerank the documents with diversity taken into account.



Data Access Primitives
Sequential Access (SA): retrieve the id of the document with the
next highest weight for a speciﬁed feature i
Random Access (RA): retrieve the exact weight of feature i in
document d



document d
Drawbacks
Fully compute the relevance, and retrieve the entire content;
Wasted I/O eﬀorts, and a lot of this I/O is not sequential in nature;
Hardly any early pruning is possible.



document d
Drawbacks
Fully compute the relevance, and retrieve the entire content;
Wasted I/O eﬀorts, and a lot of this I/O is not sequential in nature;
Hardly any early pruning is possible.

DivGen: making Generate aware of diversity!

The DivGen Algorithm

Idea: maintain a set of candidate documents with bounds on
usefulness



usefulness
Novel Data Access Primitives
Bound Access (BA): retrieve the features with the highest weight in
d, as well as an upper bound w on the weight of any other features
of d
Batch Sequential Access (BSA): retrieve the documents with the
highest weight of non-query feature i, as well as an upper bound w
on the weight of i in any other document
Document Random Access (DocRA): retrieve all the features with
nonzero weight in d, along with their exact weights



usefulness
Novel Data Access Primitives
Bound Access (BA): retrieve the features with the highest weight in
d, as well as an upper bound w on the weight of any other features
of d
Batch Sequential Access (BSA): retrieve the documents with the
highest weight of non-query feature i, as well as an upper bound w
on the weight of i in any other document
Document Random Access (DocRA): retrieve all the features with
nonzero weight in d, along with their exact weights
Advantages of BA, BSA, DocRA
Existing index techniques can be easily leveraged to enable these
primitives.
These primitives can enable a set of early prunings to make the
algorithm more eﬃcient.


Algorithm Pseudo-code


Revisit Data Access Primitives


An Execution Example


Evaluation

Experimental Setup
Java 6, Oracle BerkeleyDB Java Edition v3.3.74
Ubuntu Linux 8.04, Intel Core2 X6800 2.93GHz CPU, 1GB Memory
ext3fs ﬁlesystem with a page size of 4KB


Evaluation

Experimental Setup
Datasets
Real data: taken from Grapevine, a tool for distilling knowledge from
social media
Synthetic data: Zipﬁan distribution across documents, and normal
distribution in each document.


Evaluation

Experimental Setup
Datasets
Real data: taken from Grapevine, a tool for distilling knowledge from
social media
Synthetic data: Zipﬁan distribution across documents, and normal
distribution in each document. How to synthesize?


Evaluation (Cont. I)


Evaluation (Cont. II)


Conclusion

This paper
formally studied the diversity-aware search problem;
proposed a set of novel data access primitives to eﬃciently solve
DAS;
performed experimental studies demonstrating the usefulness of
DivGen.


Efficient Diversity-Aware Search Techniques

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (7)

Similar a Efficient Diversity-Aware Search Techniques

Similar a Efficient Diversity-Aware Search Techniques (20)

Último

Último (20)

Efficient Diversity-Aware Search Techniques