Task-Specific Query Expansion for Genomics (MultiText Experiments for TREC 2003)

University of Waterloo
MultiText for Genomics

Task-Specific Query
Expansion for Genomics
(MultiText Experiments for TREC 2003)

David L. Yeung
University of Waterloo,
Waterloo, Ontario, Canada

Nov. 20, 2003

1/17

TREC 2003 Genomics Track: University of Waterloo MultiText Project

The MultiText Project

•What is MultiText?
• A collection of IR tools developed at U of Waterloo.

What is MultiText for Genomics?
• Based on MultiText.
• No external databases or domain-specific knowledge.
• A combination of techniques...

2/17


•What is MultiText for Genomics?

Query Formulation
(Okapi)

Feedback
Topic Documents
(Query expansion)

Query Tiering
(metadata)

3/17


Query Formulation (Okapi)
•Two interesting facts: Query Formulation
• Gene name type didn't matter (Okapi)

• Spacing and punctuation affected performance
•Example (training topic 5):
• glycine receptor, alpha 1
• Glycine-receptor, alpha1
• Alpha 1 Glycine Receptor
• glycine receptors... alpha receptor... alpha 1
• And so on...

4/17


Okapi Search Term Sets
•Generate multiple search term sets:
• Okapi 1 (higher precision, lower recall)
• Treat gene names as phrases, except for punctuation.
• “glycine_receptor_alpha_1”
• Okapi 2
• Heuristics for guessing role of punctuation; also guess plurals.
• Okapi 3 (lower precision, higher recall)
• All pairs of tokens from gene names (bigrams).
• “glycine[_]receptor”, “receptor[_]alpha”, “alpha[_]1”, etc.
• Okapi Fusion
• Take the product of the 3 scores.
5/17


Results of Okapi Experiments
Mean Average Precision (MAP)

Okapi 1
Training
Okapi 2
Okapi 3
Test
Okapi Fusion

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

• Two interesting points:
• The trend in MAP is reversed between the training and test data.
• Recall (from most to least): Okapi Fusion/Okapi 3, Okapi 2, Okapi 1.

6/17


•Next: Query Tiering

Query Formulation
(Okapi)

Feedback
Topic Documents
(Query expansion)

Query Tiering
(metadata)

7/17


Query Tiering (metadata)
•Use metadata tags in data:
(“<TagName>”..“</TagName>”) > “search_terms”

•Order them by correlation to relevance:
chemical list (RN)

Relevance title (TI)

abstract (AB)

MeSH headings (MH)

PubMed ID (PMID)...
8/17


The Query Tiers
•6 Query Tiers: Query Tiering
(metadata)
• Tier 1:
• Almost exact match in the “chemical list” metadata field.
• “glycine receptor, alpha 1” → “glycine receptor alpha1”
• Tier 2:
• As above, but allow for additional terms.
• “RAC1” → “rac1 GTP-Binding Protein”
• Tier 3:
• Gene name is weakened until a match is made.
• “estrogen receptor 1” → “Receptors, Estrogen”

9/17


The Query Tiers
•6 Query Tiers (continued):
• Tier 4:
• Boolean expression in the “title” metadata field.
• “tyrosyl-tRNA synthetase” → “tyrosyl”^“trna”^“synthetase”
• Tier 5:
• Boolean expression in the “chemical list” metadata field.
• Tier 6:
• Boolean expression in the “abstract” metadata field.

10/17


Using the Query Tiers

•Can retrieve documents using:
• All Tiers (AT)
• The tiers are executed in order.
• Best Tier (BT)
• Once a tier has retrieved non-zero documents, ignore the rest.

... then fuse with results of Okapi experiment.

11/17


Using the Query Tiers
Query Formulation
(Okapi)

Feedback
Topic Documents
(Query expansion)

Query Tiering
(metadata)

•Fusing with Okapi:
• Rank Fusion (-R)
• Document's score based on weighted sum of (reverse) rank.

12/17


•Next: Feedback

Query Formulation
(Okapi)

Feedback
Topic Documents
(Query expansion)

Query Tiering
(metadata)

13/17


Feedback (Query expansion)
•Learn “most relevant” chemical: Feedback
• Using pseudo-relevance feedback (Query expansion)

• Only if document not matched in Tier 1
• Assign score to chemicals using Tf-Idf scoring scheme
α
•Example (training topic 27):   N 
w i = R i ×  log


 f 


• cholinergic receptor, muscarinic 3   i 
• Receptors, Muscarinic (29880.980020675546)
• Muscarinic Antagonists (20430.84754342255)
• muscarinic receptor M2 (13976.522895229124)
• muscarinic receptor M3 (11159.997636110056)
• Carbachol (11101.760218985524)
• ... etc.
14/17


Complete MTG System - Runs
Query Formulation
(Okapi)

Feedback
Topic Documents
(Query expansion)

Query Tiering
(metadata)

•Complete runs: Okapi Fusion, ATR, BTR, ATRF, BTRF
• Fusion with Okapi: Rank Fusion (-R)
• Query Tiering: All Tiers (AT), Best Tier (BT)
• Feedback: (-F)

15/17


Complete MTG System -
Results
Mean Average Precision (MAP)

Training Okapi Fusion
ATR
Test BTR
ATRF*
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 BTRF*

•Complete runs: Okapi Fusion, ATR, BTR, ATRF*, BTRF*
• Fusion with Okapi: Rank Fusion (-R)
• Query Tiering: All Tiers (AT), Best Tier (BT)
• Feedback: (-F)
• * denotes an official submission

16/17


Conclusions
•MultiTextsupports a variety of standard and
non-standard techniques:
• Okapi BM25 implementation
• Query Tiering and Fusion
• Pseudo-relevance Feedback

•Possible
to improve performance in genomics
domain even without domain-specific knowledge:
• Characteristics of corpus (SSR, metadata)
• Merging results of multiple independent methods

•For more information, please see our paper!
17/17


Task-Specific Query Expansion for Genomics (MultiText Experiments for TREC 2003)

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (19)

Similar to Task-Specific Query Expansion for Genomics (MultiText Experiments for TREC 2003)

Similar to Task-Specific Query Expansion for Genomics (MultiText Experiments for TREC 2003) (20)

Recently uploaded

Recently uploaded (20)

Task-Specific Query Expansion for Genomics (MultiText Experiments for TREC 2003)