SlideShare una empresa de Scribd logo
1 de 36
Retrieval and Feedback Models
      for Blog Distillation


  CMU at the TREC 2007 Blog Track

    Jonathan Elsas, Jaime Arguello,
     Jamie Callan, Jaime Carbonell
CMU’s Blog Distillation Focus
• Two Research Questions:
  What is the appropriate unit of retrieval?
   Feeds or Entries?
  How can we effectively do pseudo-relevance
   feedback for Blog Distillation?


• Our four submissions investigate these
  two dimensions.
Outline
• Corpus Preprocessing Tasks
• Two Feedback Models
• Two Retrieval Models
• Results
• Discussion
Corpus Preprocessing
• Used only FEED documents (vs. PERMALINK
  or HOMEPAGE documents).
• For each FEEDNO, extracted each new entry
  across all feed fetches
• Aggregated into 1-document-per-FEEDNO,
  retaining structural elements from the FEED
  documents:
  Title, description, entry, entry title, etc…
• Very helpful: Python Universal Feed Parser
Two Pseudo-Relevance
        Feedback Models
• Indri’s Built in Pseudo-Relevance Feedback
  (Lavrenko’s Relevance Model)
  – Using Metzler’s Dependence Model query on the
    full feed documents
  – Produces weighted unigram PRF query,
             QRM = #weight( w1 t1 w2 t2 … w50 t50)


• Wikipedia-based Pseudo-Relevance Feedback
  – Focus on anchor text linking to highly ranked
    documents wrt baseline BOW query
Wikipedia-based PRF
            Wikipedia




BOW query




                  …
Wikipedia-based PRF
                Wikipedia



Relevance Set
(top T)


BOW query


                            Working Set
                            (top N)



                      …
Wikipedia-based PRF
                     Wikipedia



Relevance Set
(top T)




                                 Working Set
                                 (top N)



                           …
Wikipedia-based PRF
                              Wikipedia



Relevance Set                                       Extract anchor text from
                                                    Working Set that point to
(top T)                                             the Relevance Set.




                                                         Working Set
                                                         (top N)
       Qwiki = #weight( w1 a-text1 w2 a-text2 … w20 a-text20)

                where wi ~ sum( T …
                                  - rank(target(ai)) )
Wikipedia-based PRF




         Query: “Apple iPod”
Two Pseudo-Relevance
  Feedback Models
    Q 983 “Photography”
  Wikipedia PRF         Indri’s Relevance Model
        photography     photography
       photographer     aerial
       depth of field    digital
              camera    full
         photograph     resource
       pulitzer prize   stock
      digital camera    free
   photographic film     information
    photojournalism     art
    cinematography      wedding
       shutter speed    great
Two Pseudo-Relevance
      Feedback Models
           Q 995 “Ruby on Rails”
        Wikipedia PRF         Indri’s Relevance Model
                   pokmon     rail
                       ruby   ruby
  pokmon ruby and sapphire    kakutani
            pokmon emerald    que
 ruby programming language    article
                        php   weblog
pokmon firered and leafgreen   activestate
         pokmon adventures    rubyonrail
       pokmon video games     new
             standard gauge   develop
                       tram   dontstopmusic
Two Retrieval Models
• Large Document model
  – Entire Feed is the unit of retrieval
• Small Document model
  – Individual entry is the unit of retrieval
  – Ranked Entries are aggregated into a Feed
    Ranking
Large Document model
Feed Document
Collection



     <?xml… <?xml…
     <feed> <feed>
<?xml… <?xml…
     <entry> <entry>
 `<?xml… <?xml…
   …
   <?xml… …<entry>
     <entry> <?xml…
     …         …
 </…> … </…>    …
     <entry> </…>
   </…>        <entry>
    </…>
     <entry> </…>
               <entry>
     <entry> <entry>
        …         …
      </…>      </…>
Large Document model
Feed Document
Collection

                         Q
     <?xml… <?xml…
     <feed> <feed>
<?xml… <?xml…
     <entry> <entry>
 `<?xml… <?xml…
   …
   <?xml… …<entry>
     <entry> <?xml…
     …         …
 </…> … </…>    …
     <entry> </…>
   </…>        <entry>
    </…>
     <entry> </…>
               <entry>
     <entry> <entry>
        …         …
      </…>      </…>
Large Document model
Feed Document
Collection                   Ranked Feeds

                         Q
     <?xml… <?xml…
     <feed> <feed>
<?xml… <?xml…
     <entry> <entry>
 `<?xml… <?xml…
   …
   <?xml… …<entry>
     <entry> <?xml…
     …         …
 </…> … </…>    …
     <entry> </…>
   </…>        <entry>
    </…>
     <entry> </…>
               <entry>
     <entry> <entry>
        …         …
      </…>      </…>
Small Document model
  Entry Document
     Collection



             <entry>
      <entry> <entry>
        <entry> <entry>
<entry><entry> <entry>
              <entry>
  <entry><entry> <?xml…
    <entry> <entry>
 <entry>
            <?xml…
                   <entry>
      <entry> <entry>
       <?xml… <entry>
   <entry> <entry><?xml…
           <entry>
     <entry>
        <entry>
            <entry><entry>
       <entry>
        <?xml…
<entry>       <entry>
  <entry>      <entry>
         <entry> <?xml…
    <entry>
      <entry> <entry>
       <?xml…
        <entry>
Small Document model
  Entry Document             Q
     Collection



             <entry>
      <entry> <entry>
        <entry> <entry>
<entry><entry> <entry>
              <entry>
  <entry><entry> <?xml…
    <entry> <entry>
 <entry>
            <?xml…
                   <entry>
      <entry> <entry>
       <?xml… <entry>
   <entry> <entry><?xml…
           <entry>
     <entry>
        <entry>
            <entry><entry>
       <entry>
        <?xml…
<entry>       <entry>
  <entry>      <entry>
         <entry> <?xml…
    <entry>
      <entry> <entry>
       <?xml…
        <entry>
Small Document model
  Entry Document                 Q
     Collection
                             Ranked Entries

             <entry>
      <entry> <entry>
        <entry> <entry>
<entry><entry> <entry>
              <entry>
  <entry><entry> <?xml…
    <entry> <entry>
 <entry>
            <?xml…
                   <entry>
      <entry> <entry>
       <?xml… <entry>
   <entry> <entry><?xml…
           <entry>
     <entry>
        <entry>
            <entry><entry>
       <entry>
        <?xml…
<entry>       <entry>
  <entry>      <entry>
         <entry> <?xml…
    <entry>
      <entry> <entry>
       <?xml…
        <entry>
Small Document model
  Entry Document                 Q
     Collection                               Ranked Feeds
                             Ranked Entries

             <entry>
      <entry> <entry>
        <entry> <entry>
<entry><entry> <entry>
              <entry>
  <entry><entry> <?xml…
    <entry> <entry>
 <entry>
            <?xml…
                   <entry>
      <entry> <entry>
       <?xml… <entry>
   <entry> <entry><?xml…
           <entry>
     <entry>
        <entry>
            <entry><entry>
       <entry>
        <?xml…
<entry>       <entry>
  <entry>      <entry>
         <entry> <?xml…
    <entry>
      <entry> <entry>
       <?xml…
        <entry>
Large Document Model
• Language Modeling retrieval model using
  the feed document structure.
• Features used for Large Document
  model:
  – P(Feed | Qfeed-title)
  – P(Feed | Qentry-text)
  – P(Feed | QRM)
  – P(Feed | Qwiki)
Large Document Model
• Language Modeling retrieval model using
  the feed document structure.
• Features used for Large Document
Large Document Indri Query:
  model:
  – P(Feed | Qfeed-title)
  – P(Feed | Qentry-text)
  – P(Feed | QRM)
  – P(Feed | Qwiki)
Small Document Model
• Feed ranking in blog distillation is analogous to
  resource ranking in federated search
                     Feed ~ Resource
                    Entry ~ Document


• Created sampled collection
   – Sampled (with replacement) 100 entries from each
     feed
   – Control for dramatically different feed lengths
Small Document Model
• Feed ranking in blog distillation is analogous to
  resource ranking in federated search
                     Feed ~ Resource
                    Entry ~ Document


• Created sampled collection
   – Sampled (with replacement) 100 entries from each
     feed
   – Control for dramatically different feed lengths
Small Document Model
Adapted Relevant Document Distribution
 Estimation (ReDDE) resource ranking.

ReDDE: well-known state-of-the-art
 federated search algorithm
Small Document Model
Adapted Relevant Document Distribution
 Estimation (ReDDE) resource ranking.

ReDDE: well-known state-of-the-art length:
Assuming uniform prior, equal feed
 federated search algorithm
Small Document Model
Adapted Relevant Document Distribution
 Estimation (ReDDE) resource ranking.

ReDDE: well-known state-of-the-art length:
Assuming uniform prior, equal feed
 federated search algorithm
Small Document Model
• Features used in the small document
  model :
  – P(Feed | Qentry-text)
  – P(Feed | QRM)
  – P(Feed | Qwiki)
Small Document Model
• Features used in the small document
  model :
Small Document Indri Query:
  – P(Feed | Q        )
               entry-text

  – P(Feed | QRM)
  – P(Feed | Qwiki)
Parameter Setting
• Selecting feature weights (λ’s) required
  training data
• Relevance judgments produced for a
  small subset of the queries (6+2)
  BOW title query, 50 docs judged/query
• Simple grid search to choose parameters
  that maximized MAP
Results




Our best run:
 Wikipedia expansion + Large Doc.
Results




Queries (ordered by CMUfeedW performance)
Feature Elimination
CMUfeedW




                    CMUfeed
Conclusions
• What worked
  –   Preprocessing the corpus & using only feed XML
  –   Simple retrieval model with appropriate features
  –   Wikipedia expansion
  –   (small amount of) training data
• What didn’t       (… or what we tried without success)
  – Small Document Model (but we think it can)
  – Spam/Splog detection, anchor text, URL
    segmentation
Thank You
Feature Elimination
                       +Training   -Training
+All (CMUfeedW)         0.3695      0.3536

-Dependence             0.3676      0.3457
Model
-Wiki (CMUfeed)         0.3385      0.3152

-Wiki, -Indri’s PRF                 0.2989

-Wiki, -Indri’s PRF,                0.2907
-Dep Model

Más contenido relacionado

La actualidad más candente

Embrace NoSQL and Eventual Consistency with Ripple
Embrace NoSQL and Eventual Consistency with RippleEmbrace NoSQL and Eventual Consistency with Ripple
Embrace NoSQL and Eventual Consistency with RippleSean Cribbs
 
Java EE 8 - What’s new on the Web front
Java EE 8 - What’s new on the Web frontJava EE 8 - What’s new on the Web front
Java EE 8 - What’s new on the Web frontDavid Delabassee
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studyCharlie Hull
 
Configure Your Projects with Apache Tamaya
Configure Your Projects with Apache TamayaConfigure Your Projects with Apache Tamaya
Configure Your Projects with Apache TamayaAnatole Tresch
 

La actualidad más candente (6)

JSON-B for CZJUG
JSON-B for CZJUGJSON-B for CZJUG
JSON-B for CZJUG
 
Embrace NoSQL and Eventual Consistency with Ripple
Embrace NoSQL and Eventual Consistency with RippleEmbrace NoSQL and Eventual Consistency with Ripple
Embrace NoSQL and Eventual Consistency with Ripple
 
Java EE 8 - What’s new on the Web front
Java EE 8 - What’s new on the Web frontJava EE 8 - What’s new on the Web front
Java EE 8 - What’s new on the Web front
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance study
 
Message passing
Message passingMessage passing
Message passing
 
Configure Your Projects with Apache Tamaya
Configure Your Projects with Apache TamayaConfigure Your Projects with Apache Tamaya
Configure Your Projects with Apache Tamaya
 

Similar a CMU TREC 2007 Blog Track

Decoupling Official Statistics
Decoupling Official StatisticsDecoupling Official Statistics
Decoupling Official StatisticsXavier Badosa
 
Modern javascript localization with c-3po and the good old gettext
Modern javascript localization with c-3po and the good old gettextModern javascript localization with c-3po and the good old gettext
Modern javascript localization with c-3po and the good old gettextAlexander Mostovenko
 
Concurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and ScalaConcurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and ScalaFernando Rodriguez
 
Overview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseOverview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseBrendan Tierney
 
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero DowntimeDeploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero DowntimeTwilio Inc
 
PostgreSQL Portland Performance Practice Project - Database Test 2 Series Ove...
PostgreSQL Portland Performance Practice Project - Database Test 2 Series Ove...PostgreSQL Portland Performance Practice Project - Database Test 2 Series Ove...
PostgreSQL Portland Performance Practice Project - Database Test 2 Series Ove...Mark Wong
 
Smart Client Development
Smart Client DevelopmentSmart Client Development
Smart Client DevelopmentTamir Khason
 
Fisl - Deployment
Fisl - DeploymentFisl - Deployment
Fisl - DeploymentFabio Akita
 
WebAssembly. Neither Web Nor Assembly, All Revolutionary
WebAssembly. Neither Web Nor Assembly, All RevolutionaryWebAssembly. Neither Web Nor Assembly, All Revolutionary
WebAssembly. Neither Web Nor Assembly, All RevolutionaryC4Media
 
2 Roads to Redemption - Thoughts on XSS and SQLIA
2 Roads to Redemption - Thoughts on XSS and SQLIA2 Roads to Redemption - Thoughts on XSS and SQLIA
2 Roads to Redemption - Thoughts on XSS and SQLIAguestfdcb8a
 
ISTA 2019 - Migrating data-intensive microservices from Python to Go
ISTA 2019 - Migrating data-intensive microservices from Python to GoISTA 2019 - Migrating data-intensive microservices from Python to Go
ISTA 2019 - Migrating data-intensive microservices from Python to GoNikolay Stoitsev
 
Supercharging WordPress Development - Wordcamp Brighton 2019
Supercharging WordPress Development - Wordcamp Brighton 2019Supercharging WordPress Development - Wordcamp Brighton 2019
Supercharging WordPress Development - Wordcamp Brighton 2019Adam Tomat
 
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQueryRemedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQueryTatsuhiko Miyagawa
 
Property-based testing of XMPP: generate your tests automatically - ejabberd ...
Property-based testing of XMPP: generate your tests automatically - ejabberd ...Property-based testing of XMPP: generate your tests automatically - ejabberd ...
Property-based testing of XMPP: generate your tests automatically - ejabberd ...Mickaël Rémond
 

Similar a CMU TREC 2007 Blog Track (20)

Jstl Guide
Jstl GuideJstl Guide
Jstl Guide
 
Decoupling Official Statistics
Decoupling Official StatisticsDecoupling Official Statistics
Decoupling Official Statistics
 
Modern javascript localization with c-3po and the good old gettext
Modern javascript localization with c-3po and the good old gettextModern javascript localization with c-3po and the good old gettext
Modern javascript localization with c-3po and the good old gettext
 
Concurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and ScalaConcurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and Scala
 
Overview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseOverview of running R in the Oracle Database
Overview of running R in the Oracle Database
 
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero DowntimeDeploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
 
PostgreSQL Portland Performance Practice Project - Database Test 2 Series Ove...
PostgreSQL Portland Performance Practice Project - Database Test 2 Series Ove...PostgreSQL Portland Performance Practice Project - Database Test 2 Series Ove...
PostgreSQL Portland Performance Practice Project - Database Test 2 Series Ove...
 
Smart Client Development
Smart Client DevelopmentSmart Client Development
Smart Client Development
 
Qtp online training
Qtp online trainingQtp online training
Qtp online training
 
Fisl - Deployment
Fisl - DeploymentFisl - Deployment
Fisl - Deployment
 
WebAssembly. Neither Web Nor Assembly, All Revolutionary
WebAssembly. Neither Web Nor Assembly, All RevolutionaryWebAssembly. Neither Web Nor Assembly, All Revolutionary
WebAssembly. Neither Web Nor Assembly, All Revolutionary
 
Java Annotations and Pre-processing
Java  Annotations and Pre-processingJava  Annotations and Pre-processing
Java Annotations and Pre-processing
 
2 Roads to Redemption - Thoughts on XSS and SQLIA
2 Roads to Redemption - Thoughts on XSS and SQLIA2 Roads to Redemption - Thoughts on XSS and SQLIA
2 Roads to Redemption - Thoughts on XSS and SQLIA
 
ISTA 2019 - Migrating data-intensive microservices from Python to Go
ISTA 2019 - Migrating data-intensive microservices from Python to GoISTA 2019 - Migrating data-intensive microservices from Python to Go
ISTA 2019 - Migrating data-intensive microservices from Python to Go
 
Supercharging WordPress Development - Wordcamp Brighton 2019
Supercharging WordPress Development - Wordcamp Brighton 2019Supercharging WordPress Development - Wordcamp Brighton 2019
Supercharging WordPress Development - Wordcamp Brighton 2019
 
Nuxeo JavaOne 2007
Nuxeo JavaOne 2007Nuxeo JavaOne 2007
Nuxeo JavaOne 2007
 
Rack Middleware
Rack MiddlewareRack Middleware
Rack Middleware
 
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQueryRemedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
 
Property-based testing of XMPP: generate your tests automatically - ejabberd ...
Property-based testing of XMPP: generate your tests automatically - ejabberd ...Property-based testing of XMPP: generate your tests automatically - ejabberd ...
Property-based testing of XMPP: generate your tests automatically - ejabberd ...
 
Broadleaf Presents Thymeleaf
Broadleaf Presents ThymeleafBroadleaf Presents Thymeleaf
Broadleaf Presents Thymeleaf
 

Último

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 

Último (20)

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 

CMU TREC 2007 Blog Track

  • 1. Retrieval and Feedback Models for Blog Distillation CMU at the TREC 2007 Blog Track Jonathan Elsas, Jaime Arguello, Jamie Callan, Jaime Carbonell
  • 2. CMU’s Blog Distillation Focus • Two Research Questions: What is the appropriate unit of retrieval? Feeds or Entries? How can we effectively do pseudo-relevance feedback for Blog Distillation? • Our four submissions investigate these two dimensions.
  • 3. Outline • Corpus Preprocessing Tasks • Two Feedback Models • Two Retrieval Models • Results • Discussion
  • 4. Corpus Preprocessing • Used only FEED documents (vs. PERMALINK or HOMEPAGE documents). • For each FEEDNO, extracted each new entry across all feed fetches • Aggregated into 1-document-per-FEEDNO, retaining structural elements from the FEED documents: Title, description, entry, entry title, etc… • Very helpful: Python Universal Feed Parser
  • 5. Two Pseudo-Relevance Feedback Models • Indri’s Built in Pseudo-Relevance Feedback (Lavrenko’s Relevance Model) – Using Metzler’s Dependence Model query on the full feed documents – Produces weighted unigram PRF query, QRM = #weight( w1 t1 w2 t2 … w50 t50) • Wikipedia-based Pseudo-Relevance Feedback – Focus on anchor text linking to highly ranked documents wrt baseline BOW query
  • 6. Wikipedia-based PRF Wikipedia BOW query …
  • 7. Wikipedia-based PRF Wikipedia Relevance Set (top T) BOW query Working Set (top N) …
  • 8. Wikipedia-based PRF Wikipedia Relevance Set (top T) Working Set (top N) …
  • 9. Wikipedia-based PRF Wikipedia Relevance Set Extract anchor text from Working Set that point to (top T) the Relevance Set. Working Set (top N) Qwiki = #weight( w1 a-text1 w2 a-text2 … w20 a-text20) where wi ~ sum( T … - rank(target(ai)) )
  • 10. Wikipedia-based PRF Query: “Apple iPod”
  • 11. Two Pseudo-Relevance Feedback Models Q 983 “Photography” Wikipedia PRF Indri’s Relevance Model photography photography photographer aerial depth of field digital camera full photograph resource pulitzer prize stock digital camera free photographic film information photojournalism art cinematography wedding shutter speed great
  • 12. Two Pseudo-Relevance Feedback Models Q 995 “Ruby on Rails” Wikipedia PRF Indri’s Relevance Model pokmon rail ruby ruby pokmon ruby and sapphire kakutani pokmon emerald que ruby programming language article php weblog pokmon firered and leafgreen activestate pokmon adventures rubyonrail pokmon video games new standard gauge develop tram dontstopmusic
  • 13. Two Retrieval Models • Large Document model – Entire Feed is the unit of retrieval • Small Document model – Individual entry is the unit of retrieval – Ranked Entries are aggregated into a Feed Ranking
  • 14. Large Document model Feed Document Collection <?xml… <?xml… <feed> <feed> <?xml… <?xml… <entry> <entry> `<?xml… <?xml… … <?xml… …<entry> <entry> <?xml… … … </…> … </…> … <entry> </…> </…> <entry> </…> <entry> </…> <entry> <entry> <entry> … … </…> </…>
  • 15. Large Document model Feed Document Collection Q <?xml… <?xml… <feed> <feed> <?xml… <?xml… <entry> <entry> `<?xml… <?xml… … <?xml… …<entry> <entry> <?xml… … … </…> … </…> … <entry> </…> </…> <entry> </…> <entry> </…> <entry> <entry> <entry> … … </…> </…>
  • 16. Large Document model Feed Document Collection Ranked Feeds Q <?xml… <?xml… <feed> <feed> <?xml… <?xml… <entry> <entry> `<?xml… <?xml… … <?xml… …<entry> <entry> <?xml… … … </…> … </…> … <entry> </…> </…> <entry> </…> <entry> </…> <entry> <entry> <entry> … … </…> </…>
  • 17. Small Document model Entry Document Collection <entry> <entry> <entry> <entry> <entry> <entry><entry> <entry> <entry> <entry><entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry><?xml… <entry> <entry> <entry> <entry><entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry>
  • 18. Small Document model Entry Document Q Collection <entry> <entry> <entry> <entry> <entry> <entry><entry> <entry> <entry> <entry><entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry><?xml… <entry> <entry> <entry> <entry><entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry>
  • 19. Small Document model Entry Document Q Collection Ranked Entries <entry> <entry> <entry> <entry> <entry> <entry><entry> <entry> <entry> <entry><entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry><?xml… <entry> <entry> <entry> <entry><entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry>
  • 20. Small Document model Entry Document Q Collection Ranked Feeds Ranked Entries <entry> <entry> <entry> <entry> <entry> <entry><entry> <entry> <entry> <entry><entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry><?xml… <entry> <entry> <entry> <entry><entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry>
  • 21. Large Document Model • Language Modeling retrieval model using the feed document structure. • Features used for Large Document model: – P(Feed | Qfeed-title) – P(Feed | Qentry-text) – P(Feed | QRM) – P(Feed | Qwiki)
  • 22. Large Document Model • Language Modeling retrieval model using the feed document structure. • Features used for Large Document Large Document Indri Query: model: – P(Feed | Qfeed-title) – P(Feed | Qentry-text) – P(Feed | QRM) – P(Feed | Qwiki)
  • 23. Small Document Model • Feed ranking in blog distillation is analogous to resource ranking in federated search Feed ~ Resource Entry ~ Document • Created sampled collection – Sampled (with replacement) 100 entries from each feed – Control for dramatically different feed lengths
  • 24. Small Document Model • Feed ranking in blog distillation is analogous to resource ranking in federated search Feed ~ Resource Entry ~ Document • Created sampled collection – Sampled (with replacement) 100 entries from each feed – Control for dramatically different feed lengths
  • 25. Small Document Model Adapted Relevant Document Distribution Estimation (ReDDE) resource ranking. ReDDE: well-known state-of-the-art federated search algorithm
  • 26. Small Document Model Adapted Relevant Document Distribution Estimation (ReDDE) resource ranking. ReDDE: well-known state-of-the-art length: Assuming uniform prior, equal feed federated search algorithm
  • 27. Small Document Model Adapted Relevant Document Distribution Estimation (ReDDE) resource ranking. ReDDE: well-known state-of-the-art length: Assuming uniform prior, equal feed federated search algorithm
  • 28. Small Document Model • Features used in the small document model : – P(Feed | Qentry-text) – P(Feed | QRM) – P(Feed | Qwiki)
  • 29. Small Document Model • Features used in the small document model : Small Document Indri Query: – P(Feed | Q ) entry-text – P(Feed | QRM) – P(Feed | Qwiki)
  • 30. Parameter Setting • Selecting feature weights (λ’s) required training data • Relevance judgments produced for a small subset of the queries (6+2) BOW title query, 50 docs judged/query • Simple grid search to choose parameters that maximized MAP
  • 31. Results Our best run: Wikipedia expansion + Large Doc.
  • 32. Results Queries (ordered by CMUfeedW performance)
  • 34. Conclusions • What worked – Preprocessing the corpus & using only feed XML – Simple retrieval model with appropriate features – Wikipedia expansion – (small amount of) training data • What didn’t (… or what we tried without success) – Small Document Model (but we think it can) – Spam/Splog detection, anchor text, URL segmentation
  • 36. Feature Elimination +Training -Training +All (CMUfeedW) 0.3695 0.3536 -Dependence 0.3676 0.3457 Model -Wiki (CMUfeed) 0.3385 0.3152 -Wiki, -Indri’s PRF 0.2989 -Wiki, -Indri’s PRF, 0.2907 -Dep Model