SlideShare una empresa de Scribd logo
1 de 49
Descargar para leer sin conexión
Validation and Analysis of Hypothesis
Generation Systems
Justin Sybrandt
1
Talk Outline
Warning: is actually two talks
● Overview + Background
● Large-Scale Validation of
Hypothesis Generation Systems via
Candidate Ranking
● Are Abstracts Enough for
Hypothesis Generation?
2
Overview
3
Problem Overview
● Medical research is expensive and risky
● Text mining can identify fruitful research directions before expensive experiments
Pfizer Ends Hunt for Drugs to Treat
Alzheimer’s and Parkinson's
WSJ
4
● NIH provides 27 million abstracts
● 2-4 thousand added daily
● Lack of communication leads to undiscovered connections
● Hypothesis generation finds implicitly published relationships
Hypothesis Generation
Fish Oil Blood Viscosity
Raynaud’s
Syndrome
?
5
● Presented at KDD’17
● Validated against small number of
historical examples
● Relied on expert input to interpret
results
● Original Pipeline
○ Data Collection
○ Network Construction
○ Relevant Abstract
Identification
○ Topic Modeling
Automatic Biomedical Hypothesis
Generation System
6
Neoplasms
Tumor - Tumour
Oncological abnormality
Data Collection
Abstracts &
n-grams
Predicates
Codified Terms
Tumours evade immune control by
creating hostile microenvironments
that perturb T cell metabolism and
effector function.
Tumours Immune Control
evade
7
Network Construction
8
● Select two query nodes
● Find shortest path
● Locate nearby abstracts
● Collect sub-corpus
Relevant Abstract Identification
9
Extract Information
● Apply LDA topic modeling
● Explore patterns in fuzzy clusters
● Original limitations:
○ Expert analysis
○ No numerical results
○ Lots of data, time consuming
10
Justin Sybrandt1
, Michael Shtutman2
, Ilya Safro1
1
Clemson U. - School of Computing
2
U. of S. Carolina - Drug Discovery and Biomedical Sciences
Large-Scale Validation
of Hypothesis Generation Systems
via Candidate Ranking
Validation
● Challenges
○ Lack of datasets
○ Problematic false positive /
negative
● We propose a scalable approach
● Verification through lab studies
Does it work?
12
Existing Validation
● Existing validation methods [1]
○ Replicate Swanson’s experiments
○ Statistical evaluation
○ Incorporate expert opinion
○ Publish in medicine
● Complications
○ Human in the loop
○ Consumes expert time
○ Small validation sets
[1] M. Yetisgen-Yildiz and W. Pratt, “Evaluation of literature-based discovery systems,”
in Literature-based discovery. Springer, 2008, pp. 101–113. 13
Drug Discovery and Candidate Selection
● Drug companies must prioritize investments
● Thousands of targets narrow to handful of candidates
● Drug discovery is a ranking problem
Hit Identification
Lead Discovery
Lead Optimization
Drug Candidates
14
● New validation approach inspired by drug discovery
● Rank recent hypotheses by plausibility
● Requires
○ Positive & negative samples
○ Ranking criteria
● Produces area under ROC curve
Validation through Candidate Ranking
False Positive Rate
TruePositiveRate
Random
G
ood
Better
15
Collecting Recent Hypotheses
● Assume abstracts are a reasonable summary
● Identify original term-pairs from each year
● Select cut year for validation (2010)
● Record pairs newer than cut year
● Published Set
Fish Oil Blood Viscosity
Raynaud’s
Syndrome
?
Year
#NewTerm-Pairs
Prevalence of New Hypotheses in
Medical Literature
16
Collecting Negative Samples
● Select term subset present at cut year
● Randomly pair terms
● Record sampled pairs that do not occur in literature
● Generate samples equal to published set
● Noise Set
17
● Extract numeric features from topic model results
● Learn correlation between features and plausibility
● Generate a collection of measurements
○ Embedding based
○ Topic network based
Creating a Ranking Criteria
18
Clusters of Words in
Embedding Space
Embedding Measurements
● Connected terms should...
○ Be similarly embedded
○ Share nearby topics
● Topic embeddings from centroids
● Measure L2
distances and cosine similarity
19
Topic Network Measurements
● Place terms and topics in network
● Edges formed by nearest-neighbors in embedding
● Add edges until path between terms appears
● Observed different network properties
20
Polynomial Combination
● Each previous metric is heuristically backed
● Polynomial combination provides
○ Interpretable results
○ Improved performance
○ Easy fitting
21
Results
● Represents 8,638 queries
● Cut year 2010
● Polynomial is top performer
● L2
shows strength of embedding
● Topic information adds signal
22
Results Highly Cited
● Represents 2,896 queries
● Subset to papers with 100 citations
● Performance improved
● Similar order of metric performance
23
Verification in Lab Experiments
● Want to show that ranking method extends beyond validation experiment
● Focus on HIV-associated Neurodegenerative Disease (HAND)
○ ~30% of HIV patients over 60 have dementia
○ ~7% is typical rate
● Ran over 30k queries
24
New HAND-Gene Connection
● DDX3 identified in top 10% of genes
● Previously studied in relation to cancer
● Unexpected in this context
● Support from wet lab experiments
○ Rapidly age HIV+ neurons with cocaine
○ Cells with DDX3 inhibited survive
○ Cells with DDX3 active die
25
Summary : Validation
● Introduces a new validation method based on candidate ranking
○ Does not rely on expert input
○ Scales to large validation sets
● Proposed ranking metrics
○ Embedding based
○ Topic network based
● Validated our system, Moliere
○ Published vs. Noise
○ Highly Cited vs. Noise
● Applied ranking to real-world application
○ HIV associated dementia
See more online at:
sybrandt.com/2018/validation
26
27
Moliere Validation ? ? ?
Justin Sybrandt, Angelo Carrabba, Alexander Herzog, Ilya Safro
Clemson U. - School of Computing
Are Abstracts Enough for Hypothesis
Generation?
Motivation
● We now have a method to evaluate overall system performance
● Interesting questions:
○ What effect does corpus size and document length have on results?
○ How sensitive is a hypothesis generation system to input qualities?
○ How many papers does a hypothesis generation system need?
○ Are abstracts enough?
29
Challenges with Full Text
● Larger documents
○ ~15.6x more words
● Expensive to acquire
○ Abstracts are free
● Harder to parse
○ Figures, tables, references
○ Often must parse PDFs
30
Input Data from Other Systems
● Titles Only
○ ARROWSMITH - 1986
● Titles and Abstracts (+ external sources)
○ Moliere - 2017
○ Disease-Connect - 2015
○ BrainSCANr - 2010
○ ...
● Full Text
○ Watson for Drug Discovery - 2014
31
● Titles Only
○ ARROWSMITH - 1986
● Titles and Abstracts (+ external sources)
○ Moliere - 2017
○ Disease-Connect - 2015
○ BrainSCANr - 2010
○ ...
● Full Text
○ Watson for Drug Discovery - 2014
Input Data from Other Systems
32
● Proprietary system
● Designed after
recommender systems [2]
● Most inference on
term-document matrix [3]
[2] Spangler, Scott. Accelerating Discovery: Mining Unstructured Information for Hypothesis Generation.
Chapman and Hall/CRC, 2015.
[3] He, Qi, Ming Ji, and W. Scott Spangler. "Mining strong relevance between heterogeneous entities
from their co-occurrences." U.S. Patent Application No. 14/279,617.
Methodology
● Create datasets of variable corpus size and document size
○ Free abstracts from PubMed
○ Free full texts from PubMed Central
● Construct multiple “instances” of Moliere
○ Rebuild embedding, network, and queries
● Use previously discussed validation and ranking
○ Cut year 2015
33
● From PubMed
○ Entire dataset
○ Randomly sampled 1 / 2
○ Randomly sampled 1 / 4
○ Randomly sampled 1 / 8
○ Randomly sampled 1 / 16
● From PubMed Central*
○ Full Texts
○ Abstracts
Considered Corpora
* We restrict PMC to only papers released
in plain text that contain abstracts.
34
Input Dataset Comparisons
35
All of PubMed PMC Full Text
# Documents
(Millions)
24 1
Median Words
Per Document
71 1,594
Unique Words
(Millions)
2.4 6.5
Total Words
(Billions)
1.85 1.86
Input Dataset Comparisons
36
PMC Abstracts PMC Full Text
# Documents
(Millions)
1 1
Median Words
Per Document
102 1,594
Unique Words
(Millions)
0.67 6.5
Total Words
(Billions)
0.1 1.86
Input Dataset Comparisons
37
PMC Abstracts 1 / 16 PubMed
# Documents
(Millions)
1 1.5
Median Words
Per Document
102 71
Unique Words
(Millions)
0.67 0.35
Total Words
(Billions)
0.1 0.1
● PubMed contains some questionable “abstracts”
○ Translated
○ Incomplete records
○ Scanned from older documents
● PubMed Central
○ Much more recent
○ Authors submit their own full-text papers
○ Conform better to modern publication standards
PMC vs. PubMed Quality Comparison
38
Experiments
● Collected 2,000 validation pairs
○ Cut year 2015
○ Term pairs shared across all corpora
● Trained entire Moliere system per corpus
○ Embedding
○ Phrase Mining
○ Network Construction
○ Queries
○ Training Polynomial
39
Results overall
● We present full results in paper
● Focus here on L2
and Polynomial
○ L2
evaluates embedding quality
○ Polynomial evaluates max performance
* Lower performance than previously
discussed. This work embeds text in
R100
while the previous embeds in R500
.
40
Findings
● Embedding
○ Full Text > Abstracts
41
0.678
0.777
Findings
● Embedding
○ Full Text > Abstracts
○ Clean >> Many
42
0.678
0.651
0.612
Findings
● Embedding
○ Full Text > Abstracts
○ Clean >> Many
● Max Performance
○ Clean = Many
43
0.718
Findings
● Embedding
○ Full Text > Abstracts
○ Clean >> Many
● Max Performance
○ Clean = Many
○ Full Text > Abstracts
44
0.718
0.795
The Downside to Full Text
● Increased single query runtime from 100s to ~4,500s
○ May be reasonable for specific searches
○ Does not scale to large candidate experiments
○ Most runtime during LDA topic models
● Full text topics are less interpretable
45
Answers
● What effect does corpus size and document length have on results?
○ Increasing either helps
○ Document length has more effect than corpus size
○ Documents that are too long negatively affect topic interpretability
● Effect
○ Removing very short documents likely to boost overall performance
○ Using automatically generated summaries may balance performance
46
Answers
● How sensitive is a hypothesis generation system to input qualities?
○ Significantly sensitive to short noisy documents
○ Performance predicated upon embedding
● Effect
○ Pre-trained word embeddings may boost performance
47
Answers
● How many papers does a hypothesis generation system need?
○ ~1 million perform well
○ Quality > Quantity
● Effect
○ Lower barrier to entry for cross-domain applications
48
Summary : Are Abstracts Enough?
● Explored multiple input corpora
○ PubMed vs. PubMed Central
● Found that longer documents increase performance
○ PMC abstracts are longer than Medline
○ Longer documents > larger quantity
● Significant runtime tradeoff
○ 45x runtime for 10% improvement
● Answer depends on the application
See more online at:
sybrandt.com/2018/abstracts
49

Más contenido relacionado

Similar a BigData'18: Validation and Analysis of Hypothesis Generation Systems

DISTANT-CTO: A Zero Cost, Distantly Supervised Approach to Improve Low-Resour...
DISTANT-CTO: A Zero Cost, Distantly Supervised Approach to Improve Low-Resour...DISTANT-CTO: A Zero Cost, Distantly Supervised Approach to Improve Low-Resour...
DISTANT-CTO: A Zero Cost, Distantly Supervised Approach to Improve Low-Resour...
Anjani Dhrangadhariya
 

Similar a BigData'18: Validation and Analysis of Hypothesis Generation Systems (20)

Introduction to meta analysis
Introduction to meta analysisIntroduction to meta analysis
Introduction to meta analysis
 
Amia tb-review-15
Amia tb-review-15Amia tb-review-15
Amia tb-review-15
 
Sharing and standards christopher hart - clinical innovation and partnering...
Sharing and standards   christopher hart - clinical innovation and partnering...Sharing and standards   christopher hart - clinical innovation and partnering...
Sharing and standards christopher hart - clinical innovation and partnering...
 
NCATS CTSA N3C
NCATS CTSA N3C NCATS CTSA N3C
NCATS CTSA N3C
 
Zen and the Art of Data Science Maintenance
Zen and the Art of Data Science MaintenanceZen and the Art of Data Science Maintenance
Zen and the Art of Data Science Maintenance
 
Embi cri yir-2017-final
Embi cri yir-2017-finalEmbi cri yir-2017-final
Embi cri yir-2017-final
 
Pistoia alliance debates analytics 15-09-2015 16.00
Pistoia alliance debates   analytics 15-09-2015 16.00Pistoia alliance debates   analytics 15-09-2015 16.00
Pistoia alliance debates analytics 15-09-2015 16.00
 
Clinical Research Informatics Year-in-Review 2024
Clinical Research Informatics Year-in-Review 2024Clinical Research Informatics Year-in-Review 2024
Clinical Research Informatics Year-in-Review 2024
 
DISTANT-CTO: A Zero Cost, Distantly Supervised Approach to Improve Low-Resour...
DISTANT-CTO: A Zero Cost, Distantly Supervised Approach to Improve Low-Resour...DISTANT-CTO: A Zero Cost, Distantly Supervised Approach to Improve Low-Resour...
DISTANT-CTO: A Zero Cost, Distantly Supervised Approach to Improve Low-Resour...
 
Enhancing the quality and transparency of health research: Introducing the PR...
Enhancing the quality and transparency of health research: Introducing the PR...Enhancing the quality and transparency of health research: Introducing the PR...
Enhancing the quality and transparency of health research: Introducing the PR...
 
SPIN Workshop Microbial Genomics @NIST
SPIN Workshop Microbial Genomics @NISTSPIN Workshop Microbial Genomics @NIST
SPIN Workshop Microbial Genomics @NIST
 
Sudhakar singh meta analysis
Sudhakar singh meta analysisSudhakar singh meta analysis
Sudhakar singh meta analysis
 
Amia tbi-14-final
Amia tbi-14-finalAmia tbi-14-final
Amia tbi-14-final
 
Pathway studio into webinar 052715v1
Pathway studio into webinar 052715v1Pathway studio into webinar 052715v1
Pathway studio into webinar 052715v1
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformatics
 
Öppen data och forskningens genomslag
Öppen data och forskningens genomslagÖppen data och forskningens genomslag
Öppen data och forskningens genomslag
 
SPIN Workshop Microbial Genomics @NIST
SPIN Workshop Microbial Genomics @NISTSPIN Workshop Microbial Genomics @NIST
SPIN Workshop Microbial Genomics @NIST
 
Informatics and the merging of research and quality measures with bedside care
Informatics and the merging of research and quality measures with bedside careInformatics and the merging of research and quality measures with bedside care
Informatics and the merging of research and quality measures with bedside care
 
A Data Mining Framework for the Analysis of Patient Arrivals into Healthcare ...
A Data Mining Framework for the Analysis of Patient Arrivals into Healthcare ...A Data Mining Framework for the Analysis of Patient Arrivals into Healthcare ...
A Data Mining Framework for the Analysis of Patient Arrivals into Healthcare ...
 
AI for COVID-19 - Q42020 update
AI for COVID-19 - Q42020 updateAI for COVID-19 - Q42020 update
AI for COVID-19 - Q42020 update
 

Último

Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Silpa
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
Silpa
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
Silpa
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 

Último (20)

Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditions
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 

BigData'18: Validation and Analysis of Hypothesis Generation Systems

  • 1. Validation and Analysis of Hypothesis Generation Systems Justin Sybrandt 1
  • 2. Talk Outline Warning: is actually two talks ● Overview + Background ● Large-Scale Validation of Hypothesis Generation Systems via Candidate Ranking ● Are Abstracts Enough for Hypothesis Generation? 2
  • 4. Problem Overview ● Medical research is expensive and risky ● Text mining can identify fruitful research directions before expensive experiments Pfizer Ends Hunt for Drugs to Treat Alzheimer’s and Parkinson's WSJ 4
  • 5. ● NIH provides 27 million abstracts ● 2-4 thousand added daily ● Lack of communication leads to undiscovered connections ● Hypothesis generation finds implicitly published relationships Hypothesis Generation Fish Oil Blood Viscosity Raynaud’s Syndrome ? 5
  • 6. ● Presented at KDD’17 ● Validated against small number of historical examples ● Relied on expert input to interpret results ● Original Pipeline ○ Data Collection ○ Network Construction ○ Relevant Abstract Identification ○ Topic Modeling Automatic Biomedical Hypothesis Generation System 6
  • 7. Neoplasms Tumor - Tumour Oncological abnormality Data Collection Abstracts & n-grams Predicates Codified Terms Tumours evade immune control by creating hostile microenvironments that perturb T cell metabolism and effector function. Tumours Immune Control evade 7
  • 9. ● Select two query nodes ● Find shortest path ● Locate nearby abstracts ● Collect sub-corpus Relevant Abstract Identification 9
  • 10. Extract Information ● Apply LDA topic modeling ● Explore patterns in fuzzy clusters ● Original limitations: ○ Expert analysis ○ No numerical results ○ Lots of data, time consuming 10
  • 11. Justin Sybrandt1 , Michael Shtutman2 , Ilya Safro1 1 Clemson U. - School of Computing 2 U. of S. Carolina - Drug Discovery and Biomedical Sciences Large-Scale Validation of Hypothesis Generation Systems via Candidate Ranking
  • 12. Validation ● Challenges ○ Lack of datasets ○ Problematic false positive / negative ● We propose a scalable approach ● Verification through lab studies Does it work? 12
  • 13. Existing Validation ● Existing validation methods [1] ○ Replicate Swanson’s experiments ○ Statistical evaluation ○ Incorporate expert opinion ○ Publish in medicine ● Complications ○ Human in the loop ○ Consumes expert time ○ Small validation sets [1] M. Yetisgen-Yildiz and W. Pratt, “Evaluation of literature-based discovery systems,” in Literature-based discovery. Springer, 2008, pp. 101–113. 13
  • 14. Drug Discovery and Candidate Selection ● Drug companies must prioritize investments ● Thousands of targets narrow to handful of candidates ● Drug discovery is a ranking problem Hit Identification Lead Discovery Lead Optimization Drug Candidates 14
  • 15. ● New validation approach inspired by drug discovery ● Rank recent hypotheses by plausibility ● Requires ○ Positive & negative samples ○ Ranking criteria ● Produces area under ROC curve Validation through Candidate Ranking False Positive Rate TruePositiveRate Random G ood Better 15
  • 16. Collecting Recent Hypotheses ● Assume abstracts are a reasonable summary ● Identify original term-pairs from each year ● Select cut year for validation (2010) ● Record pairs newer than cut year ● Published Set Fish Oil Blood Viscosity Raynaud’s Syndrome ? Year #NewTerm-Pairs Prevalence of New Hypotheses in Medical Literature 16
  • 17. Collecting Negative Samples ● Select term subset present at cut year ● Randomly pair terms ● Record sampled pairs that do not occur in literature ● Generate samples equal to published set ● Noise Set 17
  • 18. ● Extract numeric features from topic model results ● Learn correlation between features and plausibility ● Generate a collection of measurements ○ Embedding based ○ Topic network based Creating a Ranking Criteria 18 Clusters of Words in Embedding Space
  • 19. Embedding Measurements ● Connected terms should... ○ Be similarly embedded ○ Share nearby topics ● Topic embeddings from centroids ● Measure L2 distances and cosine similarity 19
  • 20. Topic Network Measurements ● Place terms and topics in network ● Edges formed by nearest-neighbors in embedding ● Add edges until path between terms appears ● Observed different network properties 20
  • 21. Polynomial Combination ● Each previous metric is heuristically backed ● Polynomial combination provides ○ Interpretable results ○ Improved performance ○ Easy fitting 21
  • 22. Results ● Represents 8,638 queries ● Cut year 2010 ● Polynomial is top performer ● L2 shows strength of embedding ● Topic information adds signal 22
  • 23. Results Highly Cited ● Represents 2,896 queries ● Subset to papers with 100 citations ● Performance improved ● Similar order of metric performance 23
  • 24. Verification in Lab Experiments ● Want to show that ranking method extends beyond validation experiment ● Focus on HIV-associated Neurodegenerative Disease (HAND) ○ ~30% of HIV patients over 60 have dementia ○ ~7% is typical rate ● Ran over 30k queries 24
  • 25. New HAND-Gene Connection ● DDX3 identified in top 10% of genes ● Previously studied in relation to cancer ● Unexpected in this context ● Support from wet lab experiments ○ Rapidly age HIV+ neurons with cocaine ○ Cells with DDX3 inhibited survive ○ Cells with DDX3 active die 25
  • 26. Summary : Validation ● Introduces a new validation method based on candidate ranking ○ Does not rely on expert input ○ Scales to large validation sets ● Proposed ranking metrics ○ Embedding based ○ Topic network based ● Validated our system, Moliere ○ Published vs. Noise ○ Highly Cited vs. Noise ● Applied ranking to real-world application ○ HIV associated dementia See more online at: sybrandt.com/2018/validation 26
  • 28. Justin Sybrandt, Angelo Carrabba, Alexander Herzog, Ilya Safro Clemson U. - School of Computing Are Abstracts Enough for Hypothesis Generation?
  • 29. Motivation ● We now have a method to evaluate overall system performance ● Interesting questions: ○ What effect does corpus size and document length have on results? ○ How sensitive is a hypothesis generation system to input qualities? ○ How many papers does a hypothesis generation system need? ○ Are abstracts enough? 29
  • 30. Challenges with Full Text ● Larger documents ○ ~15.6x more words ● Expensive to acquire ○ Abstracts are free ● Harder to parse ○ Figures, tables, references ○ Often must parse PDFs 30
  • 31. Input Data from Other Systems ● Titles Only ○ ARROWSMITH - 1986 ● Titles and Abstracts (+ external sources) ○ Moliere - 2017 ○ Disease-Connect - 2015 ○ BrainSCANr - 2010 ○ ... ● Full Text ○ Watson for Drug Discovery - 2014 31
  • 32. ● Titles Only ○ ARROWSMITH - 1986 ● Titles and Abstracts (+ external sources) ○ Moliere - 2017 ○ Disease-Connect - 2015 ○ BrainSCANr - 2010 ○ ... ● Full Text ○ Watson for Drug Discovery - 2014 Input Data from Other Systems 32 ● Proprietary system ● Designed after recommender systems [2] ● Most inference on term-document matrix [3] [2] Spangler, Scott. Accelerating Discovery: Mining Unstructured Information for Hypothesis Generation. Chapman and Hall/CRC, 2015. [3] He, Qi, Ming Ji, and W. Scott Spangler. "Mining strong relevance between heterogeneous entities from their co-occurrences." U.S. Patent Application No. 14/279,617.
  • 33. Methodology ● Create datasets of variable corpus size and document size ○ Free abstracts from PubMed ○ Free full texts from PubMed Central ● Construct multiple “instances” of Moliere ○ Rebuild embedding, network, and queries ● Use previously discussed validation and ranking ○ Cut year 2015 33
  • 34. ● From PubMed ○ Entire dataset ○ Randomly sampled 1 / 2 ○ Randomly sampled 1 / 4 ○ Randomly sampled 1 / 8 ○ Randomly sampled 1 / 16 ● From PubMed Central* ○ Full Texts ○ Abstracts Considered Corpora * We restrict PMC to only papers released in plain text that contain abstracts. 34
  • 35. Input Dataset Comparisons 35 All of PubMed PMC Full Text # Documents (Millions) 24 1 Median Words Per Document 71 1,594 Unique Words (Millions) 2.4 6.5 Total Words (Billions) 1.85 1.86
  • 36. Input Dataset Comparisons 36 PMC Abstracts PMC Full Text # Documents (Millions) 1 1 Median Words Per Document 102 1,594 Unique Words (Millions) 0.67 6.5 Total Words (Billions) 0.1 1.86
  • 37. Input Dataset Comparisons 37 PMC Abstracts 1 / 16 PubMed # Documents (Millions) 1 1.5 Median Words Per Document 102 71 Unique Words (Millions) 0.67 0.35 Total Words (Billions) 0.1 0.1
  • 38. ● PubMed contains some questionable “abstracts” ○ Translated ○ Incomplete records ○ Scanned from older documents ● PubMed Central ○ Much more recent ○ Authors submit their own full-text papers ○ Conform better to modern publication standards PMC vs. PubMed Quality Comparison 38
  • 39. Experiments ● Collected 2,000 validation pairs ○ Cut year 2015 ○ Term pairs shared across all corpora ● Trained entire Moliere system per corpus ○ Embedding ○ Phrase Mining ○ Network Construction ○ Queries ○ Training Polynomial 39
  • 40. Results overall ● We present full results in paper ● Focus here on L2 and Polynomial ○ L2 evaluates embedding quality ○ Polynomial evaluates max performance * Lower performance than previously discussed. This work embeds text in R100 while the previous embeds in R500 . 40
  • 41. Findings ● Embedding ○ Full Text > Abstracts 41 0.678 0.777
  • 42. Findings ● Embedding ○ Full Text > Abstracts ○ Clean >> Many 42 0.678 0.651 0.612
  • 43. Findings ● Embedding ○ Full Text > Abstracts ○ Clean >> Many ● Max Performance ○ Clean = Many 43 0.718
  • 44. Findings ● Embedding ○ Full Text > Abstracts ○ Clean >> Many ● Max Performance ○ Clean = Many ○ Full Text > Abstracts 44 0.718 0.795
  • 45. The Downside to Full Text ● Increased single query runtime from 100s to ~4,500s ○ May be reasonable for specific searches ○ Does not scale to large candidate experiments ○ Most runtime during LDA topic models ● Full text topics are less interpretable 45
  • 46. Answers ● What effect does corpus size and document length have on results? ○ Increasing either helps ○ Document length has more effect than corpus size ○ Documents that are too long negatively affect topic interpretability ● Effect ○ Removing very short documents likely to boost overall performance ○ Using automatically generated summaries may balance performance 46
  • 47. Answers ● How sensitive is a hypothesis generation system to input qualities? ○ Significantly sensitive to short noisy documents ○ Performance predicated upon embedding ● Effect ○ Pre-trained word embeddings may boost performance 47
  • 48. Answers ● How many papers does a hypothesis generation system need? ○ ~1 million perform well ○ Quality > Quantity ● Effect ○ Lower barrier to entry for cross-domain applications 48
  • 49. Summary : Are Abstracts Enough? ● Explored multiple input corpora ○ PubMed vs. PubMed Central ● Found that longer documents increase performance ○ PMC abstracts are longer than Medline ○ Longer documents > larger quantity ● Significant runtime tradeoff ○ 45x runtime for 10% improvement ● Answer depends on the application See more online at: sybrandt.com/2018/abstracts 49