A presentation given at TREC (Text REtrieval Conference) 2003, based on the paper "Task-Specific Query Expansion (MultiText Experiments for TREC 2003)" by myself, Charles Clarke, Gordon Cormack, Thomas Lynam, and Egidio Terra.
The research presented in this talk formed the basis of my Master's (MMath) thesis in computer science.
Task-Specific Query Expansion for Genomics (MultiText Experiments for TREC 2003)
1. University of Waterloo
MultiText for Genomics
Task-Specific Query
Expansion for Genomics
(MultiText Experiments for TREC 2003)
David L. Yeung
University of Waterloo,
Waterloo, Ontario, Canada
Nov. 20, 2003
1/17
TREC 2003 Genomics Track: University of Waterloo MultiText Project
2. The MultiText Project
•What is MultiText?
• A collection of IR tools developed at U of Waterloo.
What is MultiText for Genomics?
• Based on MultiText.
• No external databases or domain-specific knowledge.
• A combination of techniques...
2/17
TREC 2003 Genomics Track: University of Waterloo MultiText Project
3. MultiText for Genomics
•What is MultiText for Genomics?
Query Formulation
(Okapi)
Feedback
Topic Documents
(Query expansion)
Query Tiering
(metadata)
3/17
TREC 2003 Genomics Track: University of Waterloo MultiText Project
4. Query Formulation (Okapi)
•Two interesting facts: Query Formulation
• Gene name type didn't matter (Okapi)
• Spacing and punctuation affected performance
•Example (training topic 5):
• glycine receptor, alpha 1
• Glycine-receptor, alpha1
• Alpha 1 Glycine Receptor
• glycine receptors... alpha receptor... alpha 1
• And so on...
4/17
TREC 2003 Genomics Track: University of Waterloo MultiText Project
5. Okapi Search Term Sets
•Generate multiple search term sets:
• Okapi 1 (higher precision, lower recall)
• Treat gene names as phrases, except for punctuation.
• “glycine_receptor_alpha_1”
• Okapi 2
• Heuristics for guessing role of punctuation; also guess plurals.
• Okapi 3 (lower precision, higher recall)
• All pairs of tokens from gene names (bigrams).
• “glycine[_]receptor”, “receptor[_]alpha”, “alpha[_]1”, etc.
• Okapi Fusion
• Take the product of the 3 scores.
5/17
TREC 2003 Genomics Track: University of Waterloo MultiText Project
6. Results of Okapi Experiments
Mean Average Precision (MAP)
Okapi 1
Training
Okapi 2
Okapi 3
Test
Okapi Fusion
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
• Two interesting points:
• The trend in MAP is reversed between the training and test data.
• Recall (from most to least): Okapi Fusion/Okapi 3, Okapi 2, Okapi 1.
6/17
TREC 2003 Genomics Track: University of Waterloo MultiText Project
7. MultiText for Genomics
•Next: Query Tiering
Query Formulation
(Okapi)
Feedback
Topic Documents
(Query expansion)
Query Tiering
(metadata)
7/17
TREC 2003 Genomics Track: University of Waterloo MultiText Project
8. Query Tiering (metadata)
•Use metadata tags in data:
(“<TagName>”..“</TagName>”) > “search_terms”
•Order them by correlation to relevance:
chemical list (RN)
Relevance title (TI)
abstract (AB)
MeSH headings (MH)
PubMed ID (PMID)...
8/17
TREC 2003 Genomics Track: University of Waterloo MultiText Project
9. The Query Tiers
•6 Query Tiers: Query Tiering
(metadata)
• Tier 1:
• Almost exact match in the “chemical list” metadata field.
• “glycine receptor, alpha 1” → “glycine receptor alpha1”
• Tier 2:
• As above, but allow for additional terms.
• “RAC1” → “rac1 GTP-Binding Protein”
• Tier 3:
• Gene name is weakened until a match is made.
• “estrogen receptor 1” → “Receptors, Estrogen”
9/17
TREC 2003 Genomics Track: University of Waterloo MultiText Project
10. The Query Tiers
•6 Query Tiers (continued):
• Tier 4:
• Boolean expression in the “title” metadata field.
• “tyrosyl-tRNA synthetase” → “tyrosyl”^“trna”^“synthetase”
• Tier 5:
• Boolean expression in the “chemical list” metadata field.
• Tier 6:
• Boolean expression in the “abstract” metadata field.
10/17
TREC 2003 Genomics Track: University of Waterloo MultiText Project
11. Using the Query Tiers
•Can retrieve documents using:
• All Tiers (AT)
• The tiers are executed in order.
• Best Tier (BT)
• Once a tier has retrieved non-zero documents, ignore the rest.
... then fuse with results of Okapi experiment.
11/17
TREC 2003 Genomics Track: University of Waterloo MultiText Project
12. Using the Query Tiers
Query Formulation
(Okapi)
Feedback
Topic Documents
(Query expansion)
Query Tiering
(metadata)
•Fusing with Okapi:
• Rank Fusion (-R)
• Document's score based on weighted sum of (reverse) rank.
12/17
TREC 2003 Genomics Track: University of Waterloo MultiText Project
13. MultiText for Genomics
•Next: Feedback
Query Formulation
(Okapi)
Feedback
Topic Documents
(Query expansion)
Query Tiering
(metadata)
13/17
TREC 2003 Genomics Track: University of Waterloo MultiText Project
14. Feedback (Query expansion)
•Learn “most relevant” chemical: Feedback
• Using pseudo-relevance feedback (Query expansion)
• Only if document not matched in Tier 1
• Assign score to chemicals using Tf-Idf scoring scheme
α
•Example (training topic 27): N
w i = R i × log
f
• cholinergic receptor, muscarinic 3 i
• Receptors, Muscarinic (29880.980020675546)
• Muscarinic Antagonists (20430.84754342255)
• muscarinic receptor M2 (13976.522895229124)
• muscarinic receptor M3 (11159.997636110056)
• Carbachol (11101.760218985524)
• ... etc.
14/17
TREC 2003 Genomics Track: University of Waterloo MultiText Project
15. Complete MTG System - Runs
Query Formulation
(Okapi)
Feedback
Topic Documents
(Query expansion)
Query Tiering
(metadata)
•Complete runs: Okapi Fusion, ATR, BTR, ATRF, BTRF
• Fusion with Okapi: Rank Fusion (-R)
• Query Tiering: All Tiers (AT), Best Tier (BT)
• Feedback: (-F)
15/17
TREC 2003 Genomics Track: University of Waterloo MultiText Project
16. Complete MTG System -
Results
Mean Average Precision (MAP)
Training Okapi Fusion
ATR
Test BTR
ATRF*
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 BTRF*
•Complete runs: Okapi Fusion, ATR, BTR, ATRF*, BTRF*
• Fusion with Okapi: Rank Fusion (-R)
• Query Tiering: All Tiers (AT), Best Tier (BT)
• Feedback: (-F)
• * denotes an official submission
16/17
TREC 2003 Genomics Track: University of Waterloo MultiText Project
17. Conclusions
•MultiTextsupports a variety of standard and
non-standard techniques:
• Okapi BM25 implementation
• Query Tiering and Fusion
• Pseudo-relevance Feedback
•Possible
to improve performance in genomics
domain even without domain-specific knowledge:
• Characteristics of corpus (SSR, metadata)
• Merging results of multiple independent methods
•For more information, please see our paper!
17/17
TREC 2003 Genomics Track: University of Waterloo MultiText Project