SlideShare una empresa de Scribd logo
1 de 36
Retrieval and Feedback Models
      for Blog Distillation


  CMU at the TREC 2007 Blog Track

    Jonathan Elsas, Jaime Arguello,
     Jamie Callan, Jaime Carbonell
CMU’s Blog Distillation Focus
• Two Research Questions:
  What is the appropriate unit of retrieval?
   Feeds or Entries?
  How can we effectively do pseudo-relevance
   feedback for Blog Distillation?


• Our four submissions investigate these
  two dimensions.
Outline
• Corpus Preprocessing Tasks
• Two Feedback Models
• Two Retrieval Models
• Results
• Discussion
Corpus Preprocessing
• Used only FEED documents (vs. PERMALINK
  or HOMEPAGE documents).
• For each FEEDNO, extracted each new entry
  across all feed fetches
• Aggregated into 1-document-per-FEEDNO,
  retaining structural elements from the FEED
  documents:
  Title, description, entry, entry title, etc…
• Very helpful: Python Universal Feed Parser
Two Pseudo-Relevance
        Feedback Models
• Indri’s Built in Pseudo-Relevance Feedback
  (Lavrenko’s Relevance Model)
  – Using Metzler’s Dependence Model query on the
    full feed documents
  – Produces weighted unigram PRF query,
             QRM = #weight( w1 t1 w2 t2 … w50 t50)


• Wikipedia-based Pseudo-Relevance Feedback
  – Focus on anchor text linking to highly ranked
    documents wrt baseline BOW query
Wikipedia-based PRF
            Wikipedia




BOW query




                  …
Wikipedia-based PRF
                Wikipedia



Relevance Set
(top T)


BOW query


                            Working Set
                            (top N)



                      …
Wikipedia-based PRF
                     Wikipedia



Relevance Set
(top T)




                                 Working Set
                                 (top N)



                           …
Wikipedia-based PRF
                              Wikipedia



Relevance Set                                       Extract anchor text from
                                                    Working Set that point to
(top T)                                             the Relevance Set.




                                                         Working Set
                                                         (top N)
       Qwiki = #weight( w1 a-text1 w2 a-text2 … w20 a-text20)

                where wi ~ sum( T …
                                  - rank(target(ai)) )
Wikipedia-based PRF




         Query: “Apple iPod”
Two Pseudo-Relevance
  Feedback Models
    Q 983 “Photography”
  Wikipedia PRF         Indri’s Relevance Model
        photography     photography
       photographer     aerial
       depth of field    digital
              camera    full
         photograph     resource
       pulitzer prize   stock
      digital camera    free
   photographic film     information
    photojournalism     art
    cinematography      wedding
       shutter speed    great
Two Pseudo-Relevance
      Feedback Models
           Q 995 “Ruby on Rails”
        Wikipedia PRF         Indri’s Relevance Model
                   pokmon     rail
                       ruby   ruby
  pokmon ruby and sapphire    kakutani
            pokmon emerald    que
 ruby programming language    article
                        php   weblog
pokmon firered and leafgreen   activestate
         pokmon adventures    rubyonrail
       pokmon video games     new
             standard gauge   develop
                       tram   dontstopmusic
Two Retrieval Models
• Large Document model
  – Entire Feed is the unit of retrieval
• Small Document model
  – Individual entry is the unit of retrieval
  – Ranked Entries are aggregated into a Feed
    Ranking
Large Document model
Feed Document
Collection



     <?xml… <?xml…
     <feed> <feed>
<?xml… <?xml…
     <entry> <entry>
 `<?xml… <?xml…
   …
   <?xml… …<entry>
     <entry> <?xml…
     …         …
 </…> … </…>    …
     <entry> </…>
   </…>        <entry>
    </…>
     <entry> </…>
               <entry>
     <entry> <entry>
        …         …
      </…>      </…>
Large Document model
Feed Document
Collection

                         Q
     <?xml… <?xml…
     <feed> <feed>
<?xml… <?xml…
     <entry> <entry>
 `<?xml… <?xml…
   …
   <?xml… …<entry>
     <entry> <?xml…
     …         …
 </…> … </…>    …
     <entry> </…>
   </…>        <entry>
    </…>
     <entry> </…>
               <entry>
     <entry> <entry>
        …         …
      </…>      </…>
Large Document model
Feed Document
Collection                   Ranked Feeds

                         Q
     <?xml… <?xml…
     <feed> <feed>
<?xml… <?xml…
     <entry> <entry>
 `<?xml… <?xml…
   …
   <?xml… …<entry>
     <entry> <?xml…
     …         …
 </…> … </…>    …
     <entry> </…>
   </…>        <entry>
    </…>
     <entry> </…>
               <entry>
     <entry> <entry>
        …         …
      </…>      </…>
Small Document model
  Entry Document
     Collection



             <entry>
      <entry> <entry>
        <entry> <entry>
<entry><entry> <entry>
              <entry>
  <entry><entry> <?xml…
    <entry> <entry>
 <entry>
            <?xml…
                   <entry>
      <entry> <entry>
       <?xml… <entry>
   <entry> <entry><?xml…
           <entry>
     <entry>
        <entry>
            <entry><entry>
       <entry>
        <?xml…
<entry>       <entry>
  <entry>      <entry>
         <entry> <?xml…
    <entry>
      <entry> <entry>
       <?xml…
        <entry>
Small Document model
  Entry Document             Q
     Collection



             <entry>
      <entry> <entry>
        <entry> <entry>
<entry><entry> <entry>
              <entry>
  <entry><entry> <?xml…
    <entry> <entry>
 <entry>
            <?xml…
                   <entry>
      <entry> <entry>
       <?xml… <entry>
   <entry> <entry><?xml…
           <entry>
     <entry>
        <entry>
            <entry><entry>
       <entry>
        <?xml…
<entry>       <entry>
  <entry>      <entry>
         <entry> <?xml…
    <entry>
      <entry> <entry>
       <?xml…
        <entry>
Small Document model
  Entry Document                 Q
     Collection
                             Ranked Entries

             <entry>
      <entry> <entry>
        <entry> <entry>
<entry><entry> <entry>
              <entry>
  <entry><entry> <?xml…
    <entry> <entry>
 <entry>
            <?xml…
                   <entry>
      <entry> <entry>
       <?xml… <entry>
   <entry> <entry><?xml…
           <entry>
     <entry>
        <entry>
            <entry><entry>
       <entry>
        <?xml…
<entry>       <entry>
  <entry>      <entry>
         <entry> <?xml…
    <entry>
      <entry> <entry>
       <?xml…
        <entry>
Small Document model
  Entry Document                 Q
     Collection                               Ranked Feeds
                             Ranked Entries

             <entry>
      <entry> <entry>
        <entry> <entry>
<entry><entry> <entry>
              <entry>
  <entry><entry> <?xml…
    <entry> <entry>
 <entry>
            <?xml…
                   <entry>
      <entry> <entry>
       <?xml… <entry>
   <entry> <entry><?xml…
           <entry>
     <entry>
        <entry>
            <entry><entry>
       <entry>
        <?xml…
<entry>       <entry>
  <entry>      <entry>
         <entry> <?xml…
    <entry>
      <entry> <entry>
       <?xml…
        <entry>
Large Document Model
• Language Modeling retrieval model using
  the feed document structure.
• Features used for Large Document
  model:
  – P(Feed | Qfeed-title)
  – P(Feed | Qentry-text)
  – P(Feed | QRM)
  – P(Feed | Qwiki)
Large Document Model
• Language Modeling retrieval model using
  the feed document structure.
• Features used for Large Document
Large Document Indri Query:
  model:
  – P(Feed | Qfeed-title)
  – P(Feed | Qentry-text)
  – P(Feed | QRM)
  – P(Feed | Qwiki)
Small Document Model
• Feed ranking in blog distillation is analogous to
  resource ranking in federated search
                     Feed ~ Resource
                    Entry ~ Document


• Created sampled collection
   – Sampled (with replacement) 100 entries from each
     feed
   – Control for dramatically different feed lengths
Small Document Model
• Feed ranking in blog distillation is analogous to
  resource ranking in federated search
                     Feed ~ Resource
                    Entry ~ Document


• Created sampled collection
   – Sampled (with replacement) 100 entries from each
     feed
   – Control for dramatically different feed lengths
Small Document Model
Adapted Relevant Document Distribution
 Estimation (ReDDE) resource ranking.

ReDDE: well-known state-of-the-art
 federated search algorithm
Small Document Model
Adapted Relevant Document Distribution
 Estimation (ReDDE) resource ranking.

ReDDE: well-known state-of-the-art length:
Assuming uniform prior, equal feed
 federated search algorithm
Small Document Model
Adapted Relevant Document Distribution
 Estimation (ReDDE) resource ranking.

ReDDE: well-known state-of-the-art length:
Assuming uniform prior, equal feed
 federated search algorithm
Small Document Model
• Features used in the small document
  model :
  – P(Feed | Qentry-text)
  – P(Feed | QRM)
  – P(Feed | Qwiki)
Small Document Model
• Features used in the small document
  model :
Small Document Indri Query:
  – P(Feed | Q        )
               entry-text

  – P(Feed | QRM)
  – P(Feed | Qwiki)
Parameter Setting
• Selecting feature weights (λ’s) required
  training data
• Relevance judgments produced for a
  small subset of the queries (6+2)
  BOW title query, 50 docs judged/query
• Simple grid search to choose parameters
  that maximized MAP
Results




Our best run:
 Wikipedia expansion + Large Doc.
Results




Queries (ordered by CMUfeedW performance)
Feature Elimination
CMUfeedW




                    CMUfeed
Conclusions
• What worked
  –   Preprocessing the corpus & using only feed XML
  –   Simple retrieval model with appropriate features
  –   Wikipedia expansion
  –   (small amount of) training data
• What didn’t       (… or what we tried without success)
  – Small Document Model (but we think it can)
  – Spam/Splog detection, anchor text, URL
    segmentation
Thank You
Feature Elimination
                       +Training   -Training
+All (CMUfeedW)         0.3695      0.3536

-Dependence             0.3676      0.3457
Model
-Wiki (CMUfeed)         0.3385      0.3152

-Wiki, -Indri’s PRF                 0.2989

-Wiki, -Indri’s PRF,                0.2907
-Dep Model

Más contenido relacionado

La actualidad más candente

Embrace NoSQL and Eventual Consistency with Ripple
Embrace NoSQL and Eventual Consistency with RippleEmbrace NoSQL and Eventual Consistency with Ripple
Embrace NoSQL and Eventual Consistency with RippleSean Cribbs
 
Java EE 8 - What’s new on the Web front
Java EE 8 - What’s new on the Web frontJava EE 8 - What’s new on the Web front
Java EE 8 - What’s new on the Web frontDavid Delabassee
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studyCharlie Hull
 
Configure Your Projects with Apache Tamaya
Configure Your Projects with Apache TamayaConfigure Your Projects with Apache Tamaya
Configure Your Projects with Apache TamayaAnatole Tresch
 

La actualidad más candente (6)

JSON-B for CZJUG
JSON-B for CZJUGJSON-B for CZJUG
JSON-B for CZJUG
 
Embrace NoSQL and Eventual Consistency with Ripple
Embrace NoSQL and Eventual Consistency with RippleEmbrace NoSQL and Eventual Consistency with Ripple
Embrace NoSQL and Eventual Consistency with Ripple
 
Java EE 8 - What’s new on the Web front
Java EE 8 - What’s new on the Web frontJava EE 8 - What’s new on the Web front
Java EE 8 - What’s new on the Web front
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance study
 
Message passing
Message passingMessage passing
Message passing
 
Configure Your Projects with Apache Tamaya
Configure Your Projects with Apache TamayaConfigure Your Projects with Apache Tamaya
Configure Your Projects with Apache Tamaya
 

Similar a CMU TREC 2007 Blog Track

Decoupling Official Statistics
Decoupling Official StatisticsDecoupling Official Statistics
Decoupling Official StatisticsXavier Badosa
 
Modern javascript localization with c-3po and the good old gettext
Modern javascript localization with c-3po and the good old gettextModern javascript localization with c-3po and the good old gettext
Modern javascript localization with c-3po and the good old gettextAlexander Mostovenko
 
Concurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and ScalaConcurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and ScalaFernando Rodriguez
 
Overview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseOverview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseBrendan Tierney
 
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero DowntimeDeploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero DowntimeTwilio Inc
 
PostgreSQL Portland Performance Practice Project - Database Test 2 Series Ove...
PostgreSQL Portland Performance Practice Project - Database Test 2 Series Ove...PostgreSQL Portland Performance Practice Project - Database Test 2 Series Ove...
PostgreSQL Portland Performance Practice Project - Database Test 2 Series Ove...Mark Wong
 
Smart Client Development
Smart Client DevelopmentSmart Client Development
Smart Client DevelopmentTamir Khason
 
Fisl - Deployment
Fisl - DeploymentFisl - Deployment
Fisl - DeploymentFabio Akita
 
WebAssembly. Neither Web Nor Assembly, All Revolutionary
WebAssembly. Neither Web Nor Assembly, All RevolutionaryWebAssembly. Neither Web Nor Assembly, All Revolutionary
WebAssembly. Neither Web Nor Assembly, All RevolutionaryC4Media
 
2 Roads to Redemption - Thoughts on XSS and SQLIA
2 Roads to Redemption - Thoughts on XSS and SQLIA2 Roads to Redemption - Thoughts on XSS and SQLIA
2 Roads to Redemption - Thoughts on XSS and SQLIAguestfdcb8a
 
ISTA 2019 - Migrating data-intensive microservices from Python to Go
ISTA 2019 - Migrating data-intensive microservices from Python to GoISTA 2019 - Migrating data-intensive microservices from Python to Go
ISTA 2019 - Migrating data-intensive microservices from Python to GoNikolay Stoitsev
 
Supercharging WordPress Development - Wordcamp Brighton 2019
Supercharging WordPress Development - Wordcamp Brighton 2019Supercharging WordPress Development - Wordcamp Brighton 2019
Supercharging WordPress Development - Wordcamp Brighton 2019Adam Tomat
 
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQueryRemedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQueryTatsuhiko Miyagawa
 
Property-based testing of XMPP: generate your tests automatically - ejabberd ...
Property-based testing of XMPP: generate your tests automatically - ejabberd ...Property-based testing of XMPP: generate your tests automatically - ejabberd ...
Property-based testing of XMPP: generate your tests automatically - ejabberd ...Mickaël Rémond
 

Similar a CMU TREC 2007 Blog Track (20)

Jstl Guide
Jstl GuideJstl Guide
Jstl Guide
 
Decoupling Official Statistics
Decoupling Official StatisticsDecoupling Official Statistics
Decoupling Official Statistics
 
Modern javascript localization with c-3po and the good old gettext
Modern javascript localization with c-3po and the good old gettextModern javascript localization with c-3po and the good old gettext
Modern javascript localization with c-3po and the good old gettext
 
Concurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and ScalaConcurrent and Distributed Applications with Akka, Java and Scala
Concurrent and Distributed Applications with Akka, Java and Scala
 
Overview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseOverview of running R in the Oracle Database
Overview of running R in the Oracle Database
 
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero DowntimeDeploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
 
PostgreSQL Portland Performance Practice Project - Database Test 2 Series Ove...
PostgreSQL Portland Performance Practice Project - Database Test 2 Series Ove...PostgreSQL Portland Performance Practice Project - Database Test 2 Series Ove...
PostgreSQL Portland Performance Practice Project - Database Test 2 Series Ove...
 
Smart Client Development
Smart Client DevelopmentSmart Client Development
Smart Client Development
 
Qtp online training
Qtp online trainingQtp online training
Qtp online training
 
Fisl - Deployment
Fisl - DeploymentFisl - Deployment
Fisl - Deployment
 
WebAssembly. Neither Web Nor Assembly, All Revolutionary
WebAssembly. Neither Web Nor Assembly, All RevolutionaryWebAssembly. Neither Web Nor Assembly, All Revolutionary
WebAssembly. Neither Web Nor Assembly, All Revolutionary
 
Java Annotations and Pre-processing
Java  Annotations and Pre-processingJava  Annotations and Pre-processing
Java Annotations and Pre-processing
 
2 Roads to Redemption - Thoughts on XSS and SQLIA
2 Roads to Redemption - Thoughts on XSS and SQLIA2 Roads to Redemption - Thoughts on XSS and SQLIA
2 Roads to Redemption - Thoughts on XSS and SQLIA
 
ISTA 2019 - Migrating data-intensive microservices from Python to Go
ISTA 2019 - Migrating data-intensive microservices from Python to GoISTA 2019 - Migrating data-intensive microservices from Python to Go
ISTA 2019 - Migrating data-intensive microservices from Python to Go
 
Supercharging WordPress Development - Wordcamp Brighton 2019
Supercharging WordPress Development - Wordcamp Brighton 2019Supercharging WordPress Development - Wordcamp Brighton 2019
Supercharging WordPress Development - Wordcamp Brighton 2019
 
Nuxeo JavaOne 2007
Nuxeo JavaOne 2007Nuxeo JavaOne 2007
Nuxeo JavaOne 2007
 
Rack Middleware
Rack MiddlewareRack Middleware
Rack Middleware
 
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQueryRemedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
 
Property-based testing of XMPP: generate your tests automatically - ejabberd ...
Property-based testing of XMPP: generate your tests automatically - ejabberd ...Property-based testing of XMPP: generate your tests automatically - ejabberd ...
Property-based testing of XMPP: generate your tests automatically - ejabberd ...
 
Broadleaf Presents Thymeleaf
Broadleaf Presents ThymeleafBroadleaf Presents Thymeleaf
Broadleaf Presents Thymeleaf
 

Último

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Último (20)

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

CMU TREC 2007 Blog Track

  • 1. Retrieval and Feedback Models for Blog Distillation CMU at the TREC 2007 Blog Track Jonathan Elsas, Jaime Arguello, Jamie Callan, Jaime Carbonell
  • 2. CMU’s Blog Distillation Focus • Two Research Questions: What is the appropriate unit of retrieval? Feeds or Entries? How can we effectively do pseudo-relevance feedback for Blog Distillation? • Our four submissions investigate these two dimensions.
  • 3. Outline • Corpus Preprocessing Tasks • Two Feedback Models • Two Retrieval Models • Results • Discussion
  • 4. Corpus Preprocessing • Used only FEED documents (vs. PERMALINK or HOMEPAGE documents). • For each FEEDNO, extracted each new entry across all feed fetches • Aggregated into 1-document-per-FEEDNO, retaining structural elements from the FEED documents: Title, description, entry, entry title, etc… • Very helpful: Python Universal Feed Parser
  • 5. Two Pseudo-Relevance Feedback Models • Indri’s Built in Pseudo-Relevance Feedback (Lavrenko’s Relevance Model) – Using Metzler’s Dependence Model query on the full feed documents – Produces weighted unigram PRF query, QRM = #weight( w1 t1 w2 t2 … w50 t50) • Wikipedia-based Pseudo-Relevance Feedback – Focus on anchor text linking to highly ranked documents wrt baseline BOW query
  • 6. Wikipedia-based PRF Wikipedia BOW query …
  • 7. Wikipedia-based PRF Wikipedia Relevance Set (top T) BOW query Working Set (top N) …
  • 8. Wikipedia-based PRF Wikipedia Relevance Set (top T) Working Set (top N) …
  • 9. Wikipedia-based PRF Wikipedia Relevance Set Extract anchor text from Working Set that point to (top T) the Relevance Set. Working Set (top N) Qwiki = #weight( w1 a-text1 w2 a-text2 … w20 a-text20) where wi ~ sum( T … - rank(target(ai)) )
  • 10. Wikipedia-based PRF Query: “Apple iPod”
  • 11. Two Pseudo-Relevance Feedback Models Q 983 “Photography” Wikipedia PRF Indri’s Relevance Model photography photography photographer aerial depth of field digital camera full photograph resource pulitzer prize stock digital camera free photographic film information photojournalism art cinematography wedding shutter speed great
  • 12. Two Pseudo-Relevance Feedback Models Q 995 “Ruby on Rails” Wikipedia PRF Indri’s Relevance Model pokmon rail ruby ruby pokmon ruby and sapphire kakutani pokmon emerald que ruby programming language article php weblog pokmon firered and leafgreen activestate pokmon adventures rubyonrail pokmon video games new standard gauge develop tram dontstopmusic
  • 13. Two Retrieval Models • Large Document model – Entire Feed is the unit of retrieval • Small Document model – Individual entry is the unit of retrieval – Ranked Entries are aggregated into a Feed Ranking
  • 14. Large Document model Feed Document Collection <?xml… <?xml… <feed> <feed> <?xml… <?xml… <entry> <entry> `<?xml… <?xml… … <?xml… …<entry> <entry> <?xml… … … </…> … </…> … <entry> </…> </…> <entry> </…> <entry> </…> <entry> <entry> <entry> … … </…> </…>
  • 15. Large Document model Feed Document Collection Q <?xml… <?xml… <feed> <feed> <?xml… <?xml… <entry> <entry> `<?xml… <?xml… … <?xml… …<entry> <entry> <?xml… … … </…> … </…> … <entry> </…> </…> <entry> </…> <entry> </…> <entry> <entry> <entry> … … </…> </…>
  • 16. Large Document model Feed Document Collection Ranked Feeds Q <?xml… <?xml… <feed> <feed> <?xml… <?xml… <entry> <entry> `<?xml… <?xml… … <?xml… …<entry> <entry> <?xml… … … </…> … </…> … <entry> </…> </…> <entry> </…> <entry> </…> <entry> <entry> <entry> … … </…> </…>
  • 17. Small Document model Entry Document Collection <entry> <entry> <entry> <entry> <entry> <entry><entry> <entry> <entry> <entry><entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry><?xml… <entry> <entry> <entry> <entry><entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry>
  • 18. Small Document model Entry Document Q Collection <entry> <entry> <entry> <entry> <entry> <entry><entry> <entry> <entry> <entry><entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry><?xml… <entry> <entry> <entry> <entry><entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry>
  • 19. Small Document model Entry Document Q Collection Ranked Entries <entry> <entry> <entry> <entry> <entry> <entry><entry> <entry> <entry> <entry><entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry><?xml… <entry> <entry> <entry> <entry><entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry>
  • 20. Small Document model Entry Document Q Collection Ranked Feeds Ranked Entries <entry> <entry> <entry> <entry> <entry> <entry><entry> <entry> <entry> <entry><entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry> <entry> <entry><?xml… <entry> <entry> <entry> <entry><entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <?xml… <entry>
  • 21. Large Document Model • Language Modeling retrieval model using the feed document structure. • Features used for Large Document model: – P(Feed | Qfeed-title) – P(Feed | Qentry-text) – P(Feed | QRM) – P(Feed | Qwiki)
  • 22. Large Document Model • Language Modeling retrieval model using the feed document structure. • Features used for Large Document Large Document Indri Query: model: – P(Feed | Qfeed-title) – P(Feed | Qentry-text) – P(Feed | QRM) – P(Feed | Qwiki)
  • 23. Small Document Model • Feed ranking in blog distillation is analogous to resource ranking in federated search Feed ~ Resource Entry ~ Document • Created sampled collection – Sampled (with replacement) 100 entries from each feed – Control for dramatically different feed lengths
  • 24. Small Document Model • Feed ranking in blog distillation is analogous to resource ranking in federated search Feed ~ Resource Entry ~ Document • Created sampled collection – Sampled (with replacement) 100 entries from each feed – Control for dramatically different feed lengths
  • 25. Small Document Model Adapted Relevant Document Distribution Estimation (ReDDE) resource ranking. ReDDE: well-known state-of-the-art federated search algorithm
  • 26. Small Document Model Adapted Relevant Document Distribution Estimation (ReDDE) resource ranking. ReDDE: well-known state-of-the-art length: Assuming uniform prior, equal feed federated search algorithm
  • 27. Small Document Model Adapted Relevant Document Distribution Estimation (ReDDE) resource ranking. ReDDE: well-known state-of-the-art length: Assuming uniform prior, equal feed federated search algorithm
  • 28. Small Document Model • Features used in the small document model : – P(Feed | Qentry-text) – P(Feed | QRM) – P(Feed | Qwiki)
  • 29. Small Document Model • Features used in the small document model : Small Document Indri Query: – P(Feed | Q ) entry-text – P(Feed | QRM) – P(Feed | Qwiki)
  • 30. Parameter Setting • Selecting feature weights (λ’s) required training data • Relevance judgments produced for a small subset of the queries (6+2) BOW title query, 50 docs judged/query • Simple grid search to choose parameters that maximized MAP
  • 31. Results Our best run: Wikipedia expansion + Large Doc.
  • 32. Results Queries (ordered by CMUfeedW performance)
  • 34. Conclusions • What worked – Preprocessing the corpus & using only feed XML – Simple retrieval model with appropriate features – Wikipedia expansion – (small amount of) training data • What didn’t (… or what we tried without success) – Small Document Model (but we think it can) – Spam/Splog detection, anchor text, URL segmentation
  • 36. Feature Elimination +Training -Training +All (CMUfeedW) 0.3695 0.3536 -Dependence 0.3676 0.3457 Model -Wiki (CMUfeed) 0.3385 0.3152 -Wiki, -Indri’s PRF 0.2989 -Wiki, -Indri’s PRF, 0.2907 -Dep Model