SlideShare una empresa de Scribd logo
1 de 43
Descargar para leer sin conexión
Efficient Diversity-Aware Search




       Dacong (Tony) Yan
          May 4, 2011
Background & Motivation


            What is search?
              1. A user U initiates a query Q
              2. A list of documents D sorted by relevance R w.r.t Q are returned




CSE 788, Dacong (Tony) Yan                 Efficient Diversity-Aware Search           2/20
Background & Motivation


            What is search?
              1. A user U initiates a query Q
              2. A list of documents D sorted by relevance R w.r.t Q are returned


            User Satisfaction sat(U, Q)
                   It’s all about relevance between D and Q!
                   User U has its own perspective on relevance RU
                                                       1
                   Roughly speaking, sat(U, Q) ∝ diff (RU ,R)




CSE 788, Dacong (Tony) Yan                  Efficient Diversity-Aware Search          2/20
Background & Motivation


            What is search?
              1. A user U initiates a query Q
              2. A list of documents D sorted by relevance R w.r.t Q are returned


            User Satisfaction sat(U, Q)
                   It’s all about relevance between D and Q!
                   User U has its own perspective on relevance RU
                                                       1
                   Roughly speaking, sat(U, Q) ∝ diff (RU ,R)
                   Problem: RU is difficult to capture, and usually ignored!




CSE 788, Dacong (Tony) Yan                  Efficient Diversity-Aware Search          2/20
Background & Motivation


            What is search?
              1. A user U initiates a query Q
              2. A list of documents D sorted by relevance R w.r.t Q are returned


            User Satisfaction sat(U, Q)
                   It’s all about relevance between D and Q!
                   User U has its own perspective on relevance RU
                                                       1
                   Roughly speaking, sat(U, Q) ∝ diff (RU ,R)
                   Problem: RU is difficult to capture, and usually ignored!


            Symptoms of ignoring RU
                   Redundant documents included in the result set
                   Most relevant documents in terms of RU excluded from the result set




CSE 788, Dacong (Tony) Yan                   Efficient Diversity-Aware Search              2/20
Background & Motivation


            What is search?
              1. A user U initiates a query Q
              2. A list of documents D sorted by relevance R w.r.t Q are returned


            User Satisfaction sat(U, Q)
                   It’s all about relevance between D and Q!
                   User U has its own perspective on relevance RU
                                                       1
                   Roughly speaking, sat(U, Q) ∝ diff (RU ,R)
                   Problem: RU is difficult to capture, and usually ignored!


            Symptoms of ignoring RU
                   Redundant documents included in the result set
                   Most relevant documents in terms of RU excluded from the result set


                   Solution: diversity-aware search!
CSE 788, Dacong (Tony) Yan                   Efficient Diversity-Aware Search              2/20
Agenda



            Background & Motivation
            Diversity-Aware Search
            DivGen Approach
            Evaluation
            Conclusion




CSE 788, Dacong (Tony) Yan            Efficient Diversity-Aware Search   3/20
Diversity-Aware Search




            Intuitively, relevance + dissimilarity




CSE 788, Dacong (Tony) Yan                  Efficient Diversity-Aware Search   4/20
Diversity-Aware Search




            Intuitively, relevance + dissimilarity
            Formally, a content-based diversification perspective:
                   Data Model
                   User Behavior Model
                   Answer Quality




CSE 788, Dacong (Tony) Yan               Efficient Diversity-Aware Search   4/20
Data Model




            Vector Space Model: documents as weighted sets of features
            Each document d is represented as a vector
                                     d = (d 1 , d 2 , ...),
            denoting feature i has weight d i ≥ 0 in document d




CSE 788, Dacong (Tony) Yan               Efficient Diversity-Aware Search   5/20
Data Model




            Vector Space Model: documents as weighted sets of features
            Each document d is represented as a vector
                                         d = (d 1 , d 2 , ...),
            denoting feature i has weight d i ≥ 0 in document d
            Examples
                   textual documents: features can be keywords weighted in a tf.idf
                   manner
                   graph “documents”: features can be paths in the corpus graph
                   in recsys scenario: features can be the set of users who recommend a
                   document




CSE 788, Dacong (Tony) Yan                   Efficient Diversity-Aware Search               5/20
User Behavior Model

            Assumption: the user examines the results in their order of
            presentation.




CSE 788, Dacong (Tony) Yan               Efficient Diversity-Aware Search   6/20
User Behavior Model

            Assumption: the user examines the results in their order of
            presentation.
            Usefulness of a document d: the probability that d is useful
                   Relevance: the probability that d is relevant
                   Novelty: the probability that d’s content is not redundant




CSE 788, Dacong (Tony) Yan                    Efficient Diversity-Aware Search    6/20
User Behavior Model

            Assumption: the user examines the results in their order of
            presentation.
            Usefulness of a document d: the probability that d is useful
                   Relevance: the probability that d is relevant
                   Novelty: the probability that d’s content is not redundant
            Consider a document d preceded by d1 , d2 , ..., dm w.r.t a query q, its
            usefulness is defined below:
                use(d|{d1 , ..., dm }, q) = rel(d|q) · (1 − red(d|{d1 , ...dm }, q))




CSE 788, Dacong (Tony) Yan                    Efficient Diversity-Aware Search           6/20
User Behavior Model

            Assumption: the user examines the results in their order of
            presentation.
            Usefulness of a document d: the probability that d is useful
                   Relevance: the probability that d is relevant
                   Novelty: the probability that d’s content is not redundant
            Consider a document d preceded by d1 , d2 , ..., dm w.r.t a query q, its
            usefulness is defined below:
                use(d|{d1 , ..., dm }, q) = rel(d|q) · (1 − red(d|{d1 , ...dm }, q))
                                                 ⇓
                                                           m
                 use(d|{d1 , ..., dm }, q) = sim(d, q) · i=1 (1 − red(d|di , q))




CSE 788, Dacong (Tony) Yan                    Efficient Diversity-Aware Search           6/20
User Behavior Model

            Assumption: the user examines the results in their order of
            presentation.
            Usefulness of a document d: the probability that d is useful
                   Relevance: the probability that d is relevant
                   Novelty: the probability that d’s content is not redundant
            Consider a document d preceded by d1 , d2 , ..., dm w.r.t a query q, its
            usefulness is defined below:
                use(d|{d1 , ..., dm }, q) = rel(d|q) · (1 − red(d|{d1 , ...dm }, q))
                                                 ⇓
                                                           m
                 use(d|{d1 , ..., dm }, q) = sim(d, q) · i=1 (1 − red(d|di , q))
            red(d|di , q) can be decomposed further:
                   sim(d, di ): the probability that the content of d is similar to, or
                   contained in, that of di ;
                   fq : the estimated probability that, given a query q, a document with
                   similar content to, or content contained in, a document previously
                   emitted, is redundant.


CSE 788, Dacong (Tony) Yan                    Efficient Diversity-Aware Search               6/20
User Behavior Model

            Assumption: the user examines the results in their order of
            presentation.
            Usefulness of a document d: the probability that d is useful
                   Relevance: the probability that d is relevant
                   Novelty: the probability that d’s content is not redundant
            Consider a document d preceded by d1 , d2 , ..., dm w.r.t a query q, its
            usefulness is defined below:
                use(d|{d1 , ..., dm }, q) = rel(d|q) · (1 − red(d|{d1 , ...dm }, q))
                                                 ⇓
                                                           m
                 use(d|{d1 , ..., dm }, q) = sim(d, q) · i=1 (1 − red(d|di , q))
            red(d|di , q) can be decomposed further:
                   sim(d, di ): the probability that the content of d is similar to, or
                   contained in, that of di ;
                   fq : the estimated probability that, given a query q, a document with
                   similar content to, or content contained in, a document previously
                   emitted, is redundant.
                                         red(d|di , q) = sim(d, di ) · fq
CSE 788, Dacong (Tony) Yan                    Efficient Diversity-Aware Search               6/20
User Behavior Model (Cont.)




     Focus Parameter fq
            fq is the main tunable parameter in red(d|di , q) = sim(d, di ) · fq
            It is defined on a per-query basis, and denotes the amount of desired
            diversification
                   Smaller fq favors relevance over diversity
                   Larger fq favors diversity over relevance




CSE 788, Dacong (Tony) Yan                     Efficient Diversity-Aware Search      7/20
User Behavior Model (Cont.)




     Focus Parameter fq
            fq is the main tunable parameter in red(d|di , q) = sim(d, di ) · fq
            It is defined on a per-query basis, and denotes the amount of desired
            diversification
                   Smaller fq favors relevance over diversity
                   Larger fq favors diversity over relevance
            Probabilistic interpretation:
              “how likely is a relevant document to be useful to the user, given
            that they have already examined a document with similar content? ”




CSE 788, Dacong (Tony) Yan                     Efficient Diversity-Aware Search      7/20
Answer Quality


            Quantification properties




CSE 788, Dacong (Tony) Yan             Efficient Diversity-Aware Search   8/20
Answer Quality


            Quantification properties




            Tractable instantiation




CSE 788, Dacong (Tony) Yan             Efficient Diversity-Aware Search   8/20
Answer Quality


            Quantification properties




            Tractable instantiation




            An optimal answer for strict order dominance semantics can be
            found by greedily identifying the best result at position 1, 2, ..., k

CSE 788, Dacong (Tony) Yan                  Efficient Diversity-Aware Search           8/20
The DivGen Approach
A First Stab to DAS




            Steps:
              1. Compute the relevance of each document to the query;
              2. Identify the highest score document d, and update the usefulness of
                 all other documents, based on their similarity to d;
              3. Repeat the procedure k times.




CSE 788, Dacong (Tony) Yan                 Efficient Diversity-Aware Search              10/20
A First Stab to DAS




            Steps:
              1. Compute the relevance of each document to the query;
              2. Identify the highest score document d, and update the usefulness of
                 all other documents, based on their similarity to d;
              3. Repeat the procedure k times.
            Problems:
                   It requires access to the entire corpus.
                   It is too inefficient even for a moderately large set of documents.




CSE 788, Dacong (Tony) Yan                   Efficient Diversity-Aware Search            10/20
A Threshold Algorithm for DAS


            Generate-Filter Idea:
                   Incrementally compute documents in descending order of relevance;
                   Maintain upper and lower bounds on the relevance of every
                   encountered document;
                   Rerank the documents with diversity taken into account.




CSE 788, Dacong (Tony) Yan                  Efficient Diversity-Aware Search             11/20
A Threshold Algorithm for DAS


            Generate-Filter Idea:
                   Incrementally compute documents in descending order of relevance;
                   Maintain upper and lower bounds on the relevance of every
                   encountered document;
                   Rerank the documents with diversity taken into account.
            Data Access Primitives
                   Sequential Access (SA): retrieve the id of the document with the
                   next highest weight for a specified feature i
                   Random Access (RA): retrieve the exact weight of feature i in
                   document d




CSE 788, Dacong (Tony) Yan                   Efficient Diversity-Aware Search            11/20
A Threshold Algorithm for DAS


            Generate-Filter Idea:
                   Incrementally compute documents in descending order of relevance;
                   Maintain upper and lower bounds on the relevance of every
                   encountered document;
                   Rerank the documents with diversity taken into account.
            Data Access Primitives
                   Sequential Access (SA): retrieve the id of the document with the
                   next highest weight for a specified feature i
                   Random Access (RA): retrieve the exact weight of feature i in
                   document d
            Drawbacks
                   Fully compute the relevance, and retrieve the entire content;
                   Wasted I/O efforts, and a lot of this I/O is not sequential in nature;
                   Hardly any early pruning is possible.




CSE 788, Dacong (Tony) Yan                   Efficient Diversity-Aware Search                11/20
A Threshold Algorithm for DAS


            Generate-Filter Idea:
                   Incrementally compute documents in descending order of relevance;
                   Maintain upper and lower bounds on the relevance of every
                   encountered document;
                   Rerank the documents with diversity taken into account.
            Data Access Primitives
                   Sequential Access (SA): retrieve the id of the document with the
                   next highest weight for a specified feature i
                   Random Access (RA): retrieve the exact weight of feature i in
                   document d
            Drawbacks
                   Fully compute the relevance, and retrieve the entire content;
                   Wasted I/O efforts, and a lot of this I/O is not sequential in nature;
                   Hardly any early pruning is possible.


        DivGen: making Generate aware of diversity!
CSE 788, Dacong (Tony) Yan                   Efficient Diversity-Aware Search                11/20
The DivGen Algorithm


            Idea: maintain a set of candidate documents with bounds on
            usefulness




CSE 788, Dacong (Tony) Yan             Efficient Diversity-Aware Search    12/20
The DivGen Algorithm


            Idea: maintain a set of candidate documents with bounds on
            usefulness
            Novel Data Access Primitives
                   Bound Access (BA): retrieve the features with the highest weight in
                   d, as well as an upper bound w on the weight of any other features
                   of d
                   Batch Sequential Access (BSA): retrieve the documents with the
                   highest weight of non-query feature i, as well as an upper bound w
                   on the weight of i in any other document
                   Document Random Access (DocRA): retrieve all the features with
                   nonzero weight in d, along with their exact weights




CSE 788, Dacong (Tony) Yan                   Efficient Diversity-Aware Search              12/20
The DivGen Algorithm


            Idea: maintain a set of candidate documents with bounds on
            usefulness
            Novel Data Access Primitives
                   Bound Access (BA): retrieve the features with the highest weight in
                   d, as well as an upper bound w on the weight of any other features
                   of d
                   Batch Sequential Access (BSA): retrieve the documents with the
                   highest weight of non-query feature i, as well as an upper bound w
                   on the weight of i in any other document
                   Document Random Access (DocRA): retrieve all the features with
                   nonzero weight in d, along with their exact weights
            Advantages of BA, BSA, DocRA
                   Existing index techniques can be easily leveraged to enable these
                   primitives.
                   These primitives can enable a set of early prunings to make the
                   algorithm more efficient.

CSE 788, Dacong (Tony) Yan                   Efficient Diversity-Aware Search              12/20
Algorithm Pseudo-code




CSE 788, Dacong (Tony) Yan     Efficient Diversity-Aware Search   13/20
Revisit Data Access Primitives




CSE 788, Dacong (Tony) Yan      Efficient Diversity-Aware Search   14/20
An Execution Example




CSE 788, Dacong (Tony) Yan    Efficient Diversity-Aware Search   15/20
An Execution Example




CSE 788, Dacong (Tony) Yan    Efficient Diversity-Aware Search   15/20
Evaluation




            Experimental Setup
                   Java 6, Oracle BerkeleyDB Java Edition v3.3.74
                   Ubuntu Linux 8.04, Intel Core2 X6800 2.93GHz CPU, 1GB Memory
                   ext3fs filesystem with a page size of 4KB




CSE 788, Dacong (Tony) Yan                Efficient Diversity-Aware Search          16/20
Evaluation




            Experimental Setup
                   Java 6, Oracle BerkeleyDB Java Edition v3.3.74
                   Ubuntu Linux 8.04, Intel Core2 X6800 2.93GHz CPU, 1GB Memory
                   ext3fs filesystem with a page size of 4KB
            Datasets
                   Real data: taken from Grapevine, a tool for distilling knowledge from
                   social media
                   Synthetic data: Zipfian distribution across documents, and normal
                   distribution in each document.




CSE 788, Dacong (Tony) Yan                   Efficient Diversity-Aware Search                16/20
Evaluation




            Experimental Setup
                   Java 6, Oracle BerkeleyDB Java Edition v3.3.74
                   Ubuntu Linux 8.04, Intel Core2 X6800 2.93GHz CPU, 1GB Memory
                   ext3fs filesystem with a page size of 4KB
            Datasets
                   Real data: taken from Grapevine, a tool for distilling knowledge from
                   social media
                   Synthetic data: Zipfian distribution across documents, and normal
                   distribution in each document. How to synthesize?




CSE 788, Dacong (Tony) Yan                   Efficient Diversity-Aware Search                16/20
Evaluation (Cont. I)




CSE 788, Dacong (Tony) Yan     Efficient Diversity-Aware Search   17/20
Evaluation (Cont. II)




CSE 788, Dacong (Tony) Yan      Efficient Diversity-Aware Search   18/20
Conclusion




     This paper
            formally studied the diversity-aware search problem;
            proposed a set of novel data access primitives to efficiently solve
            DAS;
            performed experimental studies demonstrating the usefulness of
            DivGen.




CSE 788, Dacong (Tony) Yan               Efficient Diversity-Aware Search         19/20
Thank you!

Más contenido relacionado

Destacado

SherLog: Error Diagnosis by Connecting Clues from Run-time Logs
SherLog: Error Diagnosis by Connecting Clues from Run-time LogsSherLog: Error Diagnosis by Connecting Clues from Run-time Logs
SherLog: Error Diagnosis by Connecting Clues from Run-time LogsDacong (Tony) Yan
 
Systematic Testing for Resource Leaks in Android Applications
Systematic Testing for Resource Leaks in Android ApplicationsSystematic Testing for Resource Leaks in Android Applications
Systematic Testing for Resource Leaks in Android ApplicationsDacong (Tony) Yan
 
Static Reference Analysis for GUI Objects in Android Software
Static Reference Analysis for GUI Objects in Android SoftwareStatic Reference Analysis for GUI Objects in Android Software
Static Reference Analysis for GUI Objects in Android SoftwareDacong (Tony) Yan
 
Uncovering Performance Problems in Java Applications with Reference Propagati...
Uncovering Performance Problems in Java Applications with Reference Propagati...Uncovering Performance Problems in Java Applications with Reference Propagati...
Uncovering Performance Problems in Java Applications with Reference Propagati...Dacong (Tony) Yan
 
LeakChecker: Practical Static Memory Leak Detection for Managed Languages
LeakChecker: Practical Static Memory Leak Detection for Managed LanguagesLeakChecker: Practical Static Memory Leak Detection for Managed Languages
LeakChecker: Practical Static Memory Leak Detection for Managed LanguagesDacong (Tony) Yan
 
Members satisfaction research. sensing our current customers
Members satisfaction research. sensing our current customersMembers satisfaction research. sensing our current customers
Members satisfaction research. sensing our current customersIrynka
 

Destacado (7)

SherLog: Error Diagnosis by Connecting Clues from Run-time Logs
SherLog: Error Diagnosis by Connecting Clues from Run-time LogsSherLog: Error Diagnosis by Connecting Clues from Run-time Logs
SherLog: Error Diagnosis by Connecting Clues from Run-time Logs
 
Systematic Testing for Resource Leaks in Android Applications
Systematic Testing for Resource Leaks in Android ApplicationsSystematic Testing for Resource Leaks in Android Applications
Systematic Testing for Resource Leaks in Android Applications
 
Static Reference Analysis for GUI Objects in Android Software
Static Reference Analysis for GUI Objects in Android SoftwareStatic Reference Analysis for GUI Objects in Android Software
Static Reference Analysis for GUI Objects in Android Software
 
AVIO class present
AVIO class presentAVIO class present
AVIO class present
 
Uncovering Performance Problems in Java Applications with Reference Propagati...
Uncovering Performance Problems in Java Applications with Reference Propagati...Uncovering Performance Problems in Java Applications with Reference Propagati...
Uncovering Performance Problems in Java Applications with Reference Propagati...
 
LeakChecker: Practical Static Memory Leak Detection for Managed Languages
LeakChecker: Practical Static Memory Leak Detection for Managed LanguagesLeakChecker: Practical Static Memory Leak Detection for Managed Languages
LeakChecker: Practical Static Memory Leak Detection for Managed Languages
 
Members satisfaction research. sensing our current customers
Members satisfaction research. sensing our current customersMembers satisfaction research. sensing our current customers
Members satisfaction research. sensing our current customers
 

Similar a Efficient Diversity-Aware Search Techniques

Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher
 
IMPROVING PERSONALIZED SEARCH ON SOCIAL WEB BASED ON SIMILARITIES BETWEEN USERS
IMPROVING PERSONALIZED SEARCH ON SOCIAL WEB BASED ON SIMILARITIES BETWEEN USERSIMPROVING PERSONALIZED SEARCH ON SOCIAL WEB BASED ON SIMILARITIES BETWEEN USERS
IMPROVING PERSONALIZED SEARCH ON SOCIAL WEB BASED ON SIMILARITIES BETWEEN USERSOana Tifrea-Marciuska
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Kira
 
Linked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender SystemsLinked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender SystemsVito Ostuni
 
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…Roku
 
Summary of SIGIR 2011 Papers
Summary of SIGIR 2011 PapersSummary of SIGIR 2011 Papers
Summary of SIGIR 2011 Paperschetanagavankar
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.pptBereketAraya
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.pptBereketAraya
 
Document ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspaceDocument ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspacePrakash Dubey
 
Entity Retrieval (WWW 2013 tutorial)
Entity Retrieval (WWW 2013 tutorial)Entity Retrieval (WWW 2013 tutorial)
Entity Retrieval (WWW 2013 tutorial)krisztianbalog
 
[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用台灣資料科學年會
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligencevini89
 
Language independent document
Language independent documentLanguage independent document
Language independent documentijcsit
 
Spatial Approximate String Keyword content Query processing
Spatial Approximate String Keyword content Query processingSpatial Approximate String Keyword content Query processing
Spatial Approximate String Keyword content Query processinginventionjournals
 
Slides
SlidesSlides
Slidesbutest
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrievalrchbeir
 
Enhancing Privacy of Confidential Data using K Anonymization
Enhancing Privacy of Confidential Data using K AnonymizationEnhancing Privacy of Confidential Data using K Anonymization
Enhancing Privacy of Confidential Data using K AnonymizationIDES Editor
 
Impact of Crowdsourcing OCR Improvements on Retrievability Bias
Impact of Crowdsourcing OCR Improvements  on Retrievability Bias Impact of Crowdsourcing OCR Improvements  on Retrievability Bias
Impact of Crowdsourcing OCR Improvements on Retrievability Bias Myriam Traub
 

Similar a Efficient Diversity-Aware Search Techniques (20)

Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
IMPROVING PERSONALIZED SEARCH ON SOCIAL WEB BASED ON SIMILARITIES BETWEEN USERS
IMPROVING PERSONALIZED SEARCH ON SOCIAL WEB BASED ON SIMILARITIES BETWEEN USERSIMPROVING PERSONALIZED SEARCH ON SOCIAL WEB BASED ON SIMILARITIES BETWEEN USERS
IMPROVING PERSONALIZED SEARCH ON SOCIAL WEB BASED ON SIMILARITIES BETWEEN USERS
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Sigir 2011 proceedings
Sigir 2011 proceedingsSigir 2011 proceedings
Sigir 2011 proceedings
 
Linked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender SystemsLinked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender Systems
 
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…
 
Summary of SIGIR 2011 Papers
Summary of SIGIR 2011 PapersSummary of SIGIR 2011 Papers
Summary of SIGIR 2011 Papers
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
Document ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspaceDocument ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspace
 
Entity Retrieval (WWW 2013 tutorial)
Entity Retrieval (WWW 2013 tutorial)Entity Retrieval (WWW 2013 tutorial)
Entity Retrieval (WWW 2013 tutorial)
 
[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Language independent document
Language independent documentLanguage independent document
Language independent document
 
Spatial Approximate String Keyword content Query processing
Spatial Approximate String Keyword content Query processingSpatial Approximate String Keyword content Query processing
Spatial Approximate String Keyword content Query processing
 
Slides
SlidesSlides
Slides
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Enhancing Privacy of Confidential Data using K Anonymization
Enhancing Privacy of Confidential Data using K AnonymizationEnhancing Privacy of Confidential Data using K Anonymization
Enhancing Privacy of Confidential Data using K Anonymization
 
Impact of Crowdsourcing OCR Improvements on Retrievability Bias
Impact of Crowdsourcing OCR Improvements  on Retrievability Bias Impact of Crowdsourcing OCR Improvements  on Retrievability Bias
Impact of Crowdsourcing OCR Improvements on Retrievability Bias
 

Último

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 

Último (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Efficient Diversity-Aware Search Techniques

  • 1. Efficient Diversity-Aware Search Dacong (Tony) Yan May 4, 2011
  • 2. Background & Motivation What is search? 1. A user U initiates a query Q 2. A list of documents D sorted by relevance R w.r.t Q are returned CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 2/20
  • 3. Background & Motivation What is search? 1. A user U initiates a query Q 2. A list of documents D sorted by relevance R w.r.t Q are returned User Satisfaction sat(U, Q) It’s all about relevance between D and Q! User U has its own perspective on relevance RU 1 Roughly speaking, sat(U, Q) ∝ diff (RU ,R) CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 2/20
  • 4. Background & Motivation What is search? 1. A user U initiates a query Q 2. A list of documents D sorted by relevance R w.r.t Q are returned User Satisfaction sat(U, Q) It’s all about relevance between D and Q! User U has its own perspective on relevance RU 1 Roughly speaking, sat(U, Q) ∝ diff (RU ,R) Problem: RU is difficult to capture, and usually ignored! CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 2/20
  • 5. Background & Motivation What is search? 1. A user U initiates a query Q 2. A list of documents D sorted by relevance R w.r.t Q are returned User Satisfaction sat(U, Q) It’s all about relevance between D and Q! User U has its own perspective on relevance RU 1 Roughly speaking, sat(U, Q) ∝ diff (RU ,R) Problem: RU is difficult to capture, and usually ignored! Symptoms of ignoring RU Redundant documents included in the result set Most relevant documents in terms of RU excluded from the result set CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 2/20
  • 6. Background & Motivation What is search? 1. A user U initiates a query Q 2. A list of documents D sorted by relevance R w.r.t Q are returned User Satisfaction sat(U, Q) It’s all about relevance between D and Q! User U has its own perspective on relevance RU 1 Roughly speaking, sat(U, Q) ∝ diff (RU ,R) Problem: RU is difficult to capture, and usually ignored! Symptoms of ignoring RU Redundant documents included in the result set Most relevant documents in terms of RU excluded from the result set Solution: diversity-aware search! CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 2/20
  • 7. Agenda Background & Motivation Diversity-Aware Search DivGen Approach Evaluation Conclusion CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 3/20
  • 8. Diversity-Aware Search Intuitively, relevance + dissimilarity CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 4/20
  • 9. Diversity-Aware Search Intuitively, relevance + dissimilarity Formally, a content-based diversification perspective: Data Model User Behavior Model Answer Quality CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 4/20
  • 10. Data Model Vector Space Model: documents as weighted sets of features Each document d is represented as a vector d = (d 1 , d 2 , ...), denoting feature i has weight d i ≥ 0 in document d CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 5/20
  • 11. Data Model Vector Space Model: documents as weighted sets of features Each document d is represented as a vector d = (d 1 , d 2 , ...), denoting feature i has weight d i ≥ 0 in document d Examples textual documents: features can be keywords weighted in a tf.idf manner graph “documents”: features can be paths in the corpus graph in recsys scenario: features can be the set of users who recommend a document CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 5/20
  • 12. User Behavior Model Assumption: the user examines the results in their order of presentation. CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20
  • 13. User Behavior Model Assumption: the user examines the results in their order of presentation. Usefulness of a document d: the probability that d is useful Relevance: the probability that d is relevant Novelty: the probability that d’s content is not redundant CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20
  • 14. User Behavior Model Assumption: the user examines the results in their order of presentation. Usefulness of a document d: the probability that d is useful Relevance: the probability that d is relevant Novelty: the probability that d’s content is not redundant Consider a document d preceded by d1 , d2 , ..., dm w.r.t a query q, its usefulness is defined below: use(d|{d1 , ..., dm }, q) = rel(d|q) · (1 − red(d|{d1 , ...dm }, q)) CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20
  • 15. User Behavior Model Assumption: the user examines the results in their order of presentation. Usefulness of a document d: the probability that d is useful Relevance: the probability that d is relevant Novelty: the probability that d’s content is not redundant Consider a document d preceded by d1 , d2 , ..., dm w.r.t a query q, its usefulness is defined below: use(d|{d1 , ..., dm }, q) = rel(d|q) · (1 − red(d|{d1 , ...dm }, q)) ⇓ m use(d|{d1 , ..., dm }, q) = sim(d, q) · i=1 (1 − red(d|di , q)) CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20
  • 16. User Behavior Model Assumption: the user examines the results in their order of presentation. Usefulness of a document d: the probability that d is useful Relevance: the probability that d is relevant Novelty: the probability that d’s content is not redundant Consider a document d preceded by d1 , d2 , ..., dm w.r.t a query q, its usefulness is defined below: use(d|{d1 , ..., dm }, q) = rel(d|q) · (1 − red(d|{d1 , ...dm }, q)) ⇓ m use(d|{d1 , ..., dm }, q) = sim(d, q) · i=1 (1 − red(d|di , q)) red(d|di , q) can be decomposed further: sim(d, di ): the probability that the content of d is similar to, or contained in, that of di ; fq : the estimated probability that, given a query q, a document with similar content to, or content contained in, a document previously emitted, is redundant. CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20
  • 17. User Behavior Model Assumption: the user examines the results in their order of presentation. Usefulness of a document d: the probability that d is useful Relevance: the probability that d is relevant Novelty: the probability that d’s content is not redundant Consider a document d preceded by d1 , d2 , ..., dm w.r.t a query q, its usefulness is defined below: use(d|{d1 , ..., dm }, q) = rel(d|q) · (1 − red(d|{d1 , ...dm }, q)) ⇓ m use(d|{d1 , ..., dm }, q) = sim(d, q) · i=1 (1 − red(d|di , q)) red(d|di , q) can be decomposed further: sim(d, di ): the probability that the content of d is similar to, or contained in, that of di ; fq : the estimated probability that, given a query q, a document with similar content to, or content contained in, a document previously emitted, is redundant. red(d|di , q) = sim(d, di ) · fq CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20
  • 18. User Behavior Model (Cont.) Focus Parameter fq fq is the main tunable parameter in red(d|di , q) = sim(d, di ) · fq It is defined on a per-query basis, and denotes the amount of desired diversification Smaller fq favors relevance over diversity Larger fq favors diversity over relevance CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 7/20
  • 19. User Behavior Model (Cont.) Focus Parameter fq fq is the main tunable parameter in red(d|di , q) = sim(d, di ) · fq It is defined on a per-query basis, and denotes the amount of desired diversification Smaller fq favors relevance over diversity Larger fq favors diversity over relevance Probabilistic interpretation: “how likely is a relevant document to be useful to the user, given that they have already examined a document with similar content? ” CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 7/20
  • 20. Answer Quality Quantification properties CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 8/20
  • 21. Answer Quality Quantification properties Tractable instantiation CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 8/20
  • 22. Answer Quality Quantification properties Tractable instantiation An optimal answer for strict order dominance semantics can be found by greedily identifying the best result at position 1, 2, ..., k CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 8/20
  • 24. A First Stab to DAS Steps: 1. Compute the relevance of each document to the query; 2. Identify the highest score document d, and update the usefulness of all other documents, based on their similarity to d; 3. Repeat the procedure k times. CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 10/20
  • 25. A First Stab to DAS Steps: 1. Compute the relevance of each document to the query; 2. Identify the highest score document d, and update the usefulness of all other documents, based on their similarity to d; 3. Repeat the procedure k times. Problems: It requires access to the entire corpus. It is too inefficient even for a moderately large set of documents. CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 10/20
  • 26. A Threshold Algorithm for DAS Generate-Filter Idea: Incrementally compute documents in descending order of relevance; Maintain upper and lower bounds on the relevance of every encountered document; Rerank the documents with diversity taken into account. CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 11/20
  • 27. A Threshold Algorithm for DAS Generate-Filter Idea: Incrementally compute documents in descending order of relevance; Maintain upper and lower bounds on the relevance of every encountered document; Rerank the documents with diversity taken into account. Data Access Primitives Sequential Access (SA): retrieve the id of the document with the next highest weight for a specified feature i Random Access (RA): retrieve the exact weight of feature i in document d CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 11/20
  • 28. A Threshold Algorithm for DAS Generate-Filter Idea: Incrementally compute documents in descending order of relevance; Maintain upper and lower bounds on the relevance of every encountered document; Rerank the documents with diversity taken into account. Data Access Primitives Sequential Access (SA): retrieve the id of the document with the next highest weight for a specified feature i Random Access (RA): retrieve the exact weight of feature i in document d Drawbacks Fully compute the relevance, and retrieve the entire content; Wasted I/O efforts, and a lot of this I/O is not sequential in nature; Hardly any early pruning is possible. CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 11/20
  • 29. A Threshold Algorithm for DAS Generate-Filter Idea: Incrementally compute documents in descending order of relevance; Maintain upper and lower bounds on the relevance of every encountered document; Rerank the documents with diversity taken into account. Data Access Primitives Sequential Access (SA): retrieve the id of the document with the next highest weight for a specified feature i Random Access (RA): retrieve the exact weight of feature i in document d Drawbacks Fully compute the relevance, and retrieve the entire content; Wasted I/O efforts, and a lot of this I/O is not sequential in nature; Hardly any early pruning is possible. DivGen: making Generate aware of diversity! CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 11/20
  • 30. The DivGen Algorithm Idea: maintain a set of candidate documents with bounds on usefulness CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 12/20
  • 31. The DivGen Algorithm Idea: maintain a set of candidate documents with bounds on usefulness Novel Data Access Primitives Bound Access (BA): retrieve the features with the highest weight in d, as well as an upper bound w on the weight of any other features of d Batch Sequential Access (BSA): retrieve the documents with the highest weight of non-query feature i, as well as an upper bound w on the weight of i in any other document Document Random Access (DocRA): retrieve all the features with nonzero weight in d, along with their exact weights CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 12/20
  • 32. The DivGen Algorithm Idea: maintain a set of candidate documents with bounds on usefulness Novel Data Access Primitives Bound Access (BA): retrieve the features with the highest weight in d, as well as an upper bound w on the weight of any other features of d Batch Sequential Access (BSA): retrieve the documents with the highest weight of non-query feature i, as well as an upper bound w on the weight of i in any other document Document Random Access (DocRA): retrieve all the features with nonzero weight in d, along with their exact weights Advantages of BA, BSA, DocRA Existing index techniques can be easily leveraged to enable these primitives. These primitives can enable a set of early prunings to make the algorithm more efficient. CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 12/20
  • 33. Algorithm Pseudo-code CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 13/20
  • 34. Revisit Data Access Primitives CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 14/20
  • 35. An Execution Example CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 15/20
  • 36. An Execution Example CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 15/20
  • 37. Evaluation Experimental Setup Java 6, Oracle BerkeleyDB Java Edition v3.3.74 Ubuntu Linux 8.04, Intel Core2 X6800 2.93GHz CPU, 1GB Memory ext3fs filesystem with a page size of 4KB CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 16/20
  • 38. Evaluation Experimental Setup Java 6, Oracle BerkeleyDB Java Edition v3.3.74 Ubuntu Linux 8.04, Intel Core2 X6800 2.93GHz CPU, 1GB Memory ext3fs filesystem with a page size of 4KB Datasets Real data: taken from Grapevine, a tool for distilling knowledge from social media Synthetic data: Zipfian distribution across documents, and normal distribution in each document. CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 16/20
  • 39. Evaluation Experimental Setup Java 6, Oracle BerkeleyDB Java Edition v3.3.74 Ubuntu Linux 8.04, Intel Core2 X6800 2.93GHz CPU, 1GB Memory ext3fs filesystem with a page size of 4KB Datasets Real data: taken from Grapevine, a tool for distilling knowledge from social media Synthetic data: Zipfian distribution across documents, and normal distribution in each document. How to synthesize? CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 16/20
  • 40. Evaluation (Cont. I) CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 17/20
  • 41. Evaluation (Cont. II) CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 18/20
  • 42. Conclusion This paper formally studied the diversity-aware search problem; proposed a set of novel data access primitives to efficiently solve DAS; performed experimental studies demonstrating the usefulness of DivGen. CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 19/20