SlideShare a Scribd company logo
1 of 17
University of Waterloo
         MultiText for Genomics


 Task-Specific Query
Expansion for Genomics
 (MultiText Experiments for TREC 2003)


               David L. Yeung
            University of Waterloo,
           Waterloo, Ontario, Canada

                Nov. 20, 2003



                                                                          1/17

              TREC 2003 Genomics Track: University of Waterloo MultiText Project
The MultiText Project

•What    is MultiText?
 •   A collection of IR tools developed at U of Waterloo.


What is MultiText for Genomics?
 •   Based on MultiText.
 •   No external databases or domain-specific knowledge.
 •   A combination of techniques...


                                                                                       2/17

                           TREC 2003 Genomics Track: University of Waterloo MultiText Project
MultiText for Genomics
  •What   is MultiText for Genomics?


           Query Formulation
                (Okapi)

                                                Feedback
Topic                                                                       Documents
                                            (Query expansion)

             Query Tiering
              (metadata)



                                                                                         3/17

                             TREC 2003 Genomics Track: University of Waterloo MultiText Project
Query Formulation (Okapi)
•Two          interesting facts:                               Query Formulation
 •   Gene name type didn't matter                                   (Okapi)

 •   Spacing and punctuation affected performance
•Example            (training topic 5):
     •   glycine receptor, alpha 1
     •   Glycine-receptor, alpha1
     •   Alpha 1 Glycine Receptor
     •   glycine receptors... alpha receptor... alpha 1
         •   And so on...




                                                                                             4/17

                                 TREC 2003 Genomics Track: University of Waterloo MultiText Project
Okapi Search Term Sets
•Generate            multiple search term sets:
 •   Okapi 1 (higher precision, lower recall)
     •   Treat gene names as phrases, except for punctuation.
         •   “glycine_receptor_alpha_1”
 •   Okapi 2
     •   Heuristics for guessing role of punctuation; also guess plurals.
 •   Okapi 3 (lower precision, higher recall)
     •   All pairs of tokens from gene names (bigrams).
         •   “glycine[_]receptor”, “receptor[_]alpha”, “alpha[_]1”, etc.
 •   Okapi Fusion
     •   Take the product of the 3 scores.
                                                                                                  5/17

                                      TREC 2003 Genomics Track: University of Waterloo MultiText Project
Results of Okapi Experiments
                        Mean Average Precision (MAP)

                                                                       Okapi 1
Training
                                                                       Okapi 2
                                                                       Okapi 3
    Test
                                                                       Okapi Fusion

           0    0.05   0.1   0.15    0.2     0.25     0.3     0.35


• Two interesting points:
   • The trend in MAP is reversed between the training and test data.
   • Recall (from most to least): Okapi Fusion/Okapi 3, Okapi 2, Okapi 1.




                                                                                          6/17

                              TREC 2003 Genomics Track: University of Waterloo MultiText Project
MultiText for Genomics
  •Next:   Query Tiering


            Query Formulation
                 (Okapi)

                                                 Feedback
Topic                                                                        Documents
                                             (Query expansion)

              Query Tiering
               (metadata)



                                                                                          7/17

                              TREC 2003 Genomics Track: University of Waterloo MultiText Project
Query Tiering (metadata)
•Use   metadata tags in data:
   (“<TagName>”..“</TagName>”) > “search_terms”


•Order   them by correlation to relevance:
                                  chemical list (RN)

                      Relevance   title (TI)

                                  abstract (AB)

                                     MeSH headings (MH)

                                            PubMed ID (PMID)...
                                                                                      8/17

                          TREC 2003 Genomics Track: University of Waterloo MultiText Project
The Query Tiers
•6   Query Tiers:                                                 Query Tiering
                                                                   (metadata)
 •   Tier 1:
     •   Almost exact match in the “chemical list” metadata field.
     •   “glycine receptor, alpha 1” → “glycine receptor alpha1”
 •   Tier 2:
     •   As above, but allow for additional terms.
     •   “RAC1” → “rac1 GTP-Binding Protein”
 •   Tier 3:
     •   Gene name is weakened until a match is made.
     •   “estrogen receptor 1” → “Receptors, Estrogen”

                                                                                            9/17

                                TREC 2003 Genomics Track: University of Waterloo MultiText Project
The Query Tiers
•6   Query Tiers (continued):
 •   Tier 4:
     •   Boolean expression in the “title” metadata field.
     •   “tyrosyl-tRNA synthetase” → “tyrosyl”^“trna”^“synthetase”
 •   Tier 5:
     •   Boolean expression in the “chemical list” metadata field.
 •   Tier 6:
     •   Boolean expression in the “abstract” metadata field.




                                                                                           10/17

                                TREC 2003 Genomics Track: University of Waterloo MultiText Project
Using the Query Tiers

•Can       retrieve documents using:
 •   All Tiers (AT)
     •   The tiers are executed in order.
 •   Best Tier (BT)
     •   Once a tier has retrieved non-zero documents, ignore the rest.



               ... then fuse with results of Okapi experiment.


                                                                                           11/17

                                TREC 2003 Genomics Track: University of Waterloo MultiText Project
Using the Query Tiers
                   Query Formulation
                        (Okapi)

                                                         Feedback
Topic                                                                                Documents
                                                     (Query expansion)

                      Query Tiering
                       (metadata)


   •Fusing           with Okapi:
        •   Rank Fusion (-R)
            •   Document's score based on weighted sum of (reverse) rank.

                                                                                                 12/17

                                      TREC 2003 Genomics Track: University of Waterloo MultiText Project
MultiText for Genomics
  •Next:   Feedback


           Query Formulation
                (Okapi)

                                                Feedback
Topic                                                                       Documents
                                            (Query expansion)

             Query Tiering
              (metadata)



                                                                                        13/17

                             TREC 2003 Genomics Track: University of Waterloo MultiText Project
Feedback (Query expansion)
•Learn            “most relevant” chemical:                                  Feedback
 •   Using pseudo-relevance feedback                                     (Query expansion)

 •   Only if document not matched in Tier 1
 •   Assign score to chemicals using Tf-Idf scoring scheme
                                                                                                         α
•Example                (training topic 27):                                                  N 
                                                                         w i = R i ×  log
                                                                                     
                                                                                              
                                                                                               f 
                                                                                                  
                                                                                                    
 •   cholinergic receptor, muscarinic 3                                                       i 
     •   Receptors, Muscarinic (29880.980020675546)
     •   Muscarinic Antagonists (20430.84754342255)
     •   muscarinic receptor M2 (13976.522895229124)
     •   muscarinic receptor M3 (11159.997636110056)
     •   Carbachol (11101.760218985524)
         •   ... etc.
                                                                                                     14/17

                                          TREC 2003 Genomics Track: University of Waterloo MultiText Project
Complete MTG System - Runs
        Query Formulation
             (Okapi)

                                                  Feedback
Topic                                                                         Documents
                                              (Query expansion)

          Query Tiering
           (metadata)

         •Complete   runs: Okapi Fusion, ATR, BTR, ATRF, BTRF
          •   Fusion with Okapi: Rank Fusion (-R)
          •   Query Tiering: All Tiers (AT), Best Tier (BT)
          •   Feedback: (-F)

                                                                                          15/17

                               TREC 2003 Genomics Track: University of Waterloo MultiText Project
Complete MTG System -
                 Results
                              Mean Average Precision (MAP)

Training                                                                     Okapi Fusion
                                                                             ATR
    Test                                                                     BTR
                                                                             ATRF*
             0   0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5                BTRF*

     •Complete      runs: Okapi Fusion, ATR, BTR, ATRF*, BTRF*
         •   Fusion with Okapi: Rank Fusion (-R)
         •   Query Tiering: All Tiers (AT), Best Tier (BT)
         •   Feedback: (-F)
     •                                    * denotes an official submission

                                                                                               16/17

                                    TREC 2003 Genomics Track: University of Waterloo MultiText Project
Conclusions
•MultiTextsupports a variety of standard and
 non-standard techniques:
 •   Okapi BM25 implementation
 •   Query Tiering and Fusion
 •   Pseudo-relevance Feedback

•Possible
        to improve performance in genomics
 domain even without domain-specific knowledge:
 •   Characteristics of corpus (SSR, metadata)
 •   Merging results of multiple independent methods

         •For   more information, please see our paper!
                                                                                      17/17

                           TREC 2003 Genomics Track: University of Waterloo MultiText Project

More Related Content

Viewers also liked

Blog, Wiki, Or Teacher Web
Blog, Wiki, Or Teacher WebBlog, Wiki, Or Teacher Web
Blog, Wiki, Or Teacher Web
Nicole
 
Knight Center Analytics
Knight Center AnalyticsKnight Center Analytics
Knight Center Analytics
Qing Ye
 
Our Digital Natives Presentation
Our Digital Natives PresentationOur Digital Natives Presentation
Our Digital Natives Presentation
guest3584c6c
 
Knight Center
Knight CenterKnight Center
Knight Center
Qing Ye
 
UTPA prez slides
UTPA prez slidesUTPA prez slides
UTPA prez slides
Qing Ye
 
Sample Test Slide Ppt
Sample Test Slide PptSample Test Slide Ppt
Sample Test Slide Ppt
guest88f951e
 
Famous Hispanic Americans Powerpoint
Famous Hispanic Americans PowerpointFamous Hispanic Americans Powerpoint
Famous Hispanic Americans Powerpoint
007aud
 
Dramatic Irony
Dramatic IronyDramatic Irony
Dramatic Irony
007aud
 
Revised Blooms Taxonomy
Revised Blooms TaxonomyRevised Blooms Taxonomy
Revised Blooms Taxonomy
007aud
 
Design improvement-of-the-existing-car-jack-design---24-pages
Design improvement-of-the-existing-car-jack-design---24-pagesDesign improvement-of-the-existing-car-jack-design---24-pages
Design improvement-of-the-existing-car-jack-design---24-pages
Jorge Flórido
 

Viewers also liked (19)

XV SEMINARIO INTERNACIONAL DE SALUD ALIMENTACIÓN Y NUTRICIÓN HUMANA”
XV SEMINARIO INTERNACIONAL DE SALUD ALIMENTACIÓN Y NUTRICIÓN HUMANA”XV SEMINARIO INTERNACIONAL DE SALUD ALIMENTACIÓN Y NUTRICIÓN HUMANA”
XV SEMINARIO INTERNACIONAL DE SALUD ALIMENTACIÓN Y NUTRICIÓN HUMANA”
 
Presentación ITgallery Demoday INCUBE
Presentación ITgallery Demoday INCUBEPresentación ITgallery Demoday INCUBE
Presentación ITgallery Demoday INCUBE
 
Blog, Wiki, Or Teacher Web
Blog, Wiki, Or Teacher WebBlog, Wiki, Or Teacher Web
Blog, Wiki, Or Teacher Web
 
Knight Center Analytics
Knight Center AnalyticsKnight Center Analytics
Knight Center Analytics
 
ITgallery - Presentación SPEGC para el 3er Programa de Aceleración
ITgallery - Presentación SPEGC para el 3er Programa de AceleraciónITgallery - Presentación SPEGC para el 3er Programa de Aceleración
ITgallery - Presentación SPEGC para el 3er Programa de Aceleración
 
Product Photography Catalog
Product Photography CatalogProduct Photography Catalog
Product Photography Catalog
 
Our Digital Natives Presentation
Our Digital Natives PresentationOur Digital Natives Presentation
Our Digital Natives Presentation
 
XV SEMINARIO INTERNACIONAL DE SALUD ALIMENTACIÓN Y NUTRICIÓN HUMANA”
XV SEMINARIO INTERNACIONAL DE SALUD ALIMENTACIÓN Y NUTRICIÓN HUMANA”XV SEMINARIO INTERNACIONAL DE SALUD ALIMENTACIÓN Y NUTRICIÓN HUMANA”
XV SEMINARIO INTERNACIONAL DE SALUD ALIMENTACIÓN Y NUTRICIÓN HUMANA”
 
Knight Center
Knight CenterKnight Center
Knight Center
 
UTPA prez slides
UTPA prez slidesUTPA prez slides
UTPA prez slides
 
Sample Test Slide Ppt
Sample Test Slide PptSample Test Slide Ppt
Sample Test Slide Ppt
 
Presentación ITgallery en Summa Art Fair (Madrid)
Presentación ITgallery en Summa Art Fair (Madrid)Presentación ITgallery en Summa Art Fair (Madrid)
Presentación ITgallery en Summa Art Fair (Madrid)
 
Quantum Algorithms for Evaluating MIN-MAX Trees
Quantum Algorithms for Evaluating MIN-MAX TreesQuantum Algorithms for Evaluating MIN-MAX Trees
Quantum Algorithms for Evaluating MIN-MAX Trees
 
Unicode and Legacy Representations of Emoji (IUC 36)
Unicode and Legacy Representations of Emoji (IUC 36)Unicode and Legacy Representations of Emoji (IUC 36)
Unicode and Legacy Representations of Emoji (IUC 36)
 
Famous Hispanic Americans Powerpoint
Famous Hispanic Americans PowerpointFamous Hispanic Americans Powerpoint
Famous Hispanic Americans Powerpoint
 
Dramatic Irony
Dramatic IronyDramatic Irony
Dramatic Irony
 
Revised Blooms Taxonomy
Revised Blooms TaxonomyRevised Blooms Taxonomy
Revised Blooms Taxonomy
 
Design improvement-of-the-existing-car-jack-design---24-pages
Design improvement-of-the-existing-car-jack-design---24-pagesDesign improvement-of-the-existing-car-jack-design---24-pages
Design improvement-of-the-existing-car-jack-design---24-pages
 
Ideas sobre Seguridad y Salud
Ideas sobre Seguridad y SaludIdeas sobre Seguridad y Salud
Ideas sobre Seguridad y Salud
 

Similar to Task-Specific Query Expansion for Genomics (MultiText Experiments for TREC 2003)

Making project data avalialble eNanomapper through Database
Making project data avalialble eNanomapper through  DatabaseMaking project data avalialble eNanomapper through  Database
Making project data avalialble eNanomapper through Database
Nina Jeliazkova
 
The MIAPA ontology: An annotation ontology for validating minimum metadata re...
The MIAPA ontology: An annotation ontology for validating minimum metadata re...The MIAPA ontology: An annotation ontology for validating minimum metadata re...
The MIAPA ontology: An annotation ontology for validating minimum metadata re...
Hilmar Lapp
 
eXframe: A Semantic Web Platform for Genomic Experiments
eXframe: A Semantic Web Platform for Genomic ExperimentseXframe: A Semantic Web Platform for Genomic Experiments
eXframe: A Semantic Web Platform for Genomic Experiments
Tim Clark
 

Similar to Task-Specific Query Expansion for Genomics (MultiText Experiments for TREC 2003) (20)

Automatic Key Term Extraction from Spoken Course Lectures
Automatic Key Term Extraction from Spoken Course LecturesAutomatic Key Term Extraction from Spoken Course Lectures
Automatic Key Term Extraction from Spoken Course Lectures
 
Thesis biobix
Thesis biobixThesis biobix
Thesis biobix
 
Cufflinks
CufflinksCufflinks
Cufflinks
 
Making project data avalialble eNanomapper through Database
Making project data avalialble eNanomapper through  DatabaseMaking project data avalialble eNanomapper through  Database
Making project data avalialble eNanomapper through Database
 
2013-01-17 Research Object
2013-01-17 Research Object2013-01-17 Research Object
2013-01-17 Research Object
 
Folker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data AnnotationFolker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data Annotation
 
DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...
DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...
DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012
 
University of Toronto Chemistry Librarians Workshop June 2012
University of Toronto Chemistry Librarians Workshop June 2012University of Toronto Chemistry Librarians Workshop June 2012
University of Toronto Chemistry Librarians Workshop June 2012
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 
Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128
 
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
scRNA-Seq Lecture - Stem Cell Network RNA-Seq Workshop 2017
 
The MIAPA ontology: An annotation ontology for validating minimum metadata re...
The MIAPA ontology: An annotation ontology for validating minimum metadata re...The MIAPA ontology: An annotation ontology for validating minimum metadata re...
The MIAPA ontology: An annotation ontology for validating minimum metadata re...
 
exFrame: a Semantic Web Platform for Genomics Experiments
exFrame: a Semantic Web Platform for Genomics ExperimentsexFrame: a Semantic Web Platform for Genomics Experiments
exFrame: a Semantic Web Platform for Genomics Experiments
 
eXframe: A Semantic Web Platform for Genomic Experiments
eXframe: A Semantic Web Platform for Genomic ExperimentseXframe: A Semantic Web Platform for Genomic Experiments
eXframe: A Semantic Web Platform for Genomic Experiments
 
Initial steps towards a production platform for DNA sequence analysis on the ...
Initial steps towards a production platform for DNA sequence analysis on the ...Initial steps towards a production platform for DNA sequence analysis on the ...
Initial steps towards a production platform for DNA sequence analysis on the ...
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
GLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopGLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics Workshop
 
GIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdfGIAB_ASHG_JZook_2023.pdf
GIAB_ASHG_JZook_2023.pdf
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 

Recently uploaded

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
SoniaTolstoy
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 

Recently uploaded (20)

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 

Task-Specific Query Expansion for Genomics (MultiText Experiments for TREC 2003)

  • 1. University of Waterloo MultiText for Genomics Task-Specific Query Expansion for Genomics (MultiText Experiments for TREC 2003) David L. Yeung University of Waterloo, Waterloo, Ontario, Canada Nov. 20, 2003 1/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  • 2. The MultiText Project •What is MultiText? • A collection of IR tools developed at U of Waterloo. What is MultiText for Genomics? • Based on MultiText. • No external databases or domain-specific knowledge. • A combination of techniques... 2/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  • 3. MultiText for Genomics •What is MultiText for Genomics? Query Formulation (Okapi) Feedback Topic Documents (Query expansion) Query Tiering (metadata) 3/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  • 4. Query Formulation (Okapi) •Two interesting facts: Query Formulation • Gene name type didn't matter (Okapi) • Spacing and punctuation affected performance •Example (training topic 5): • glycine receptor, alpha 1 • Glycine-receptor, alpha1 • Alpha 1 Glycine Receptor • glycine receptors... alpha receptor... alpha 1 • And so on... 4/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  • 5. Okapi Search Term Sets •Generate multiple search term sets: • Okapi 1 (higher precision, lower recall) • Treat gene names as phrases, except for punctuation. • “glycine_receptor_alpha_1” • Okapi 2 • Heuristics for guessing role of punctuation; also guess plurals. • Okapi 3 (lower precision, higher recall) • All pairs of tokens from gene names (bigrams). • “glycine[_]receptor”, “receptor[_]alpha”, “alpha[_]1”, etc. • Okapi Fusion • Take the product of the 3 scores. 5/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  • 6. Results of Okapi Experiments Mean Average Precision (MAP) Okapi 1 Training Okapi 2 Okapi 3 Test Okapi Fusion 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 • Two interesting points: • The trend in MAP is reversed between the training and test data. • Recall (from most to least): Okapi Fusion/Okapi 3, Okapi 2, Okapi 1. 6/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  • 7. MultiText for Genomics •Next: Query Tiering Query Formulation (Okapi) Feedback Topic Documents (Query expansion) Query Tiering (metadata) 7/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  • 8. Query Tiering (metadata) •Use metadata tags in data: (“<TagName>”..“</TagName>”) > “search_terms” •Order them by correlation to relevance: chemical list (RN) Relevance title (TI) abstract (AB) MeSH headings (MH) PubMed ID (PMID)... 8/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  • 9. The Query Tiers •6 Query Tiers: Query Tiering (metadata) • Tier 1: • Almost exact match in the “chemical list” metadata field. • “glycine receptor, alpha 1” → “glycine receptor alpha1” • Tier 2: • As above, but allow for additional terms. • “RAC1” → “rac1 GTP-Binding Protein” • Tier 3: • Gene name is weakened until a match is made. • “estrogen receptor 1” → “Receptors, Estrogen” 9/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  • 10. The Query Tiers •6 Query Tiers (continued): • Tier 4: • Boolean expression in the “title” metadata field. • “tyrosyl-tRNA synthetase” → “tyrosyl”^“trna”^“synthetase” • Tier 5: • Boolean expression in the “chemical list” metadata field. • Tier 6: • Boolean expression in the “abstract” metadata field. 10/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  • 11. Using the Query Tiers •Can retrieve documents using: • All Tiers (AT) • The tiers are executed in order. • Best Tier (BT) • Once a tier has retrieved non-zero documents, ignore the rest. ... then fuse with results of Okapi experiment. 11/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  • 12. Using the Query Tiers Query Formulation (Okapi) Feedback Topic Documents (Query expansion) Query Tiering (metadata) •Fusing with Okapi: • Rank Fusion (-R) • Document's score based on weighted sum of (reverse) rank. 12/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  • 13. MultiText for Genomics •Next: Feedback Query Formulation (Okapi) Feedback Topic Documents (Query expansion) Query Tiering (metadata) 13/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  • 14. Feedback (Query expansion) •Learn “most relevant” chemical: Feedback • Using pseudo-relevance feedback (Query expansion) • Only if document not matched in Tier 1 • Assign score to chemicals using Tf-Idf scoring scheme α •Example (training topic 27):   N  w i = R i ×  log    f    • cholinergic receptor, muscarinic 3   i  • Receptors, Muscarinic (29880.980020675546) • Muscarinic Antagonists (20430.84754342255) • muscarinic receptor M2 (13976.522895229124) • muscarinic receptor M3 (11159.997636110056) • Carbachol (11101.760218985524) • ... etc. 14/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  • 15. Complete MTG System - Runs Query Formulation (Okapi) Feedback Topic Documents (Query expansion) Query Tiering (metadata) •Complete runs: Okapi Fusion, ATR, BTR, ATRF, BTRF • Fusion with Okapi: Rank Fusion (-R) • Query Tiering: All Tiers (AT), Best Tier (BT) • Feedback: (-F) 15/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  • 16. Complete MTG System - Results Mean Average Precision (MAP) Training Okapi Fusion ATR Test BTR ATRF* 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 BTRF* •Complete runs: Okapi Fusion, ATR, BTR, ATRF*, BTRF* • Fusion with Okapi: Rank Fusion (-R) • Query Tiering: All Tiers (AT), Best Tier (BT) • Feedback: (-F) • * denotes an official submission 16/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  • 17. Conclusions •MultiTextsupports a variety of standard and non-standard techniques: • Okapi BM25 implementation • Query Tiering and Fusion • Pseudo-relevance Feedback •Possible to improve performance in genomics domain even without domain-specific knowledge: • Characteristics of corpus (SSR, metadata) • Merging results of multiple independent methods •For more information, please see our paper! 17/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project