SlideShare una empresa de Scribd logo
1 de 32
Descargar para leer sin conexión
In silico prediction of novel therapeutic
targets using gene – disease
association data
Enrico Ferrero, PhD, Associate GSK Fellow
Scientific Leader, Computational Biology and Stats, Target Sciences
GSK
Big Data in Medicine
04.07.2017
Challenges in pharma R&D
Time and costs are increasing but success rate is declining
2In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
Why focus on targets?
Late phase failures cost (a lot) more
3
0
200
400
600
800
1000
1200
0
10
20
30
40
50
60
70
80
90
100
Lead discovery Lead optimization Pre-clinical FTIH Phase 2 Phase 3
Relativecost(permolecule)
Nmolecules
Manhattan Institute, 2012
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
Rethink the drug discovery pipeline
Spend more time and resources in target validation to reduce attrition in later phases
4
Targetvalidation
Potentialtargets
Pre-clinical FTIH LaunchPhase 2 Phase 3
Lead discovery
Lead optimisation
Launch
PotentialtargetsPotentialtargets
Lead discovery Lead optimisation Pre-clinical FTIH Phase 2 Phase 3
Target
validation
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
Cook et al., 2014; Nelson et al., 2015
Target discovery and genetics evidence
 40% of efficacy failures are due to poor linkage between target and disease.
 The proportion of drug mechanisms with direct genetic support increases significantly across the
drug development pipeline.
 Selecting genetically supported targets could double the success rate in clinical development.
Open Targets
A platform for therapeutic target identification and validation
6In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
Could it be as easy as spotting spam emails?
7In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
 Is it possible to predict novel therapeutic targets using
available gene – disease association data?
Predicting therapeutic targets
A simple machine learning workflow
8
Generate input
data matrix
Assign labels and
split into training,
test and
prediction sets
Exploratory data
analysis
Tune, train and
test classifiers
using nested
cross-validation
Evaluate best
classifier
performance on
test set
Explore predicted
targets across the
drug discovery
pipeline
Make predictions
using best
performing
classifier
Validate with
literature text
mining
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
Predict therapeutic targets only using gene – disease association data
Data sources and data processing
9
 Obtain all gene disease associations and
supporting evidence from Open Targets
platform.
 For all genes, create numeric features by taking
the mean score across all diseases:
– Genetic associations (germline)
– Somatic mutations
– Significant gene expression changes
– Disease-relevant phenotype in animal model
– Pathway-level evidence
 Gather positive labels from Pharmaprojects:
only consider targets with drugs currently on
the market, in clinical trials or preclinical
studies. Exclude targets with drugs withdrawn
from market or whose development has been
discontinued.
Input data matrix generation
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
A positive – unlabelled (PU) semi-supervised learning approach
10
 A semi-supervised framework with only
positive labels is used: targets according to
PharmaProjects constitute the positive class
(P), while the rest of the proteome is used as
the unlabelled class (U), containing both
negatives and yet-to-be-discovered positive.
 All positive cases (1421) and an equal number
of randomly selected unlabelled cases (2842 in
total) are set apart for training (80%) and
testing (20%).
 The remainder is kept as a prediction set
where predictions from the final model will be
made.
Split data into training, test and prediction set
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
Dimensionality reduction reveals structure in the data
11
t-Distributed Stochastic Neighbour Embedding (t-SNE)
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
What are the most “important” features?
12
Chi-squared test + information gain
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
Nested cross-validation and bagging for tuning and model selection
13
 Four classifiers are independently tuned, trained and tested on the
training set using a nested cross-validation strategy (4 inner rounds
for parameter tuning and 4 outer rounds to assess performance):
– Random forest (tuned parameters: number of trees and
number of features);
– Feed-forward neural network with single hidden layer (tuned
parameters: size and decay);
– Support vector machine with radial kernel (tuned parameters:
gamma and cost);
– Gradient boosting machine with AdaBoost exponential loss
function (tuned parameters: number of trees and interaction
depth).
 In PU learning, U contains both positive and negative cases, which
results in classifier instability. Bagging (bootstrap aggregating) can
improve the performance of instable classifiers by randomly
resampling P and U with replacement (bootstrap) and then
aggregating the results by majority voting:
– Bagging with 100 iterations was applied to the neural network,
the support vector machine and the gradient boosting machine.
– Random forests are already a special case of bagging.
Tuning, training and testing four classifiers
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
Evaluating classifiers performance
14
Receiver operating characteristic curves
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
AUC 0.76
Disease association evidence higher for more advanced targets
15
Model predicts late-stage targets more easily than early-stage ones
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
Literature text mining validation of predictions
16
Highly significant overlap between predictions and text mining results
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
Conclusions
17
 The gene – disease association data from Open Targets contains enough information to predict
whether a protein can make a therapeutic target or not with decent accuracy (71%)
 Aside from standard cross-validation and testing, prediction results were also validated by mining
the scientific literature for therapeutic targets and assessing the significance of the overlap.
 The ability of the neural network model to predict late stage targets with greater accuracy
confirms that clear linkage between target and disease is essential to maximise chances of success
in the clinic.
 Of the evidence types tested, animal models showing disease-relevant phenotypes, dysregulated
gene expression in disease tissue and genetic associations between gene and disease appear as
the most informative ones.
In silico predictions of novel therapeutic targets using gene – disease association data
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
Acknowledgements
18
 Ian Dunham
 Philippe Sanseau
 Gautier Koscielny
 Giovanni Dall’Olio
 Pankaj Agarwal
 Mark Hurle
 Steven Barrett
 Nicola Richmond
 Jin Yao
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
Thank you
19
Pharmaprojects
20In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
An industry-wide drug development database
Exploratory data analysis reveals sparse data with little structure
21
Hierarchical clustering + principal component analysis
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
Tune, train and test classifiers using cross-validation
22
Decision tree classification criteria
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
Evaluating classifiers performance
23
Performance measures for supervised learning
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
Neural network performance on independent test set
24
Selected classifier with most balanced overall performance for further analyses
Cross-validation Test
Misclassification error 0.303 0.287
Accuracy 0.697 0.713
AUC 0.758 0.763
Recall/Sensitivity 0.610 0.638
Specificity 0.785 0.784
Precision 0.742 0.736
F1 Score 0.670 0.683
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
Tune, train and test classifiers using cross-validation
25
Misclassification error
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
Evaluate best classifier performance on test set
26
Confusion matrices
Crossvalidation
Prediction outcome
Unknown Target
Actual
value
Unknown 912 217
Target 445 700
Test
Prediction outcome
Unknown Target
Actual
value
Unknown 225 67
Target 99 177
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
Split into training, test and prediction sets
27
Assess the effect of randomly sampling from unlabelled class: Monte Carlo simulation
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
Tune, train and test classifiers using crossvalidation
28
Precision recall curves
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
Tune, train and test classifiers using crossvalidation
Predicted targets Predicted non-targets
In silico prediction of novel therapeutic targets using gene –
disease association data
Enrico Ferrero
29
Overlap between predictions on training set
30
Majority of targets with discontinued programmes not predicted as targets
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
Targets with lower disease association fail more often
Generating predictions on remaining 15K genes
31
Run model on prediction set (not used for training/testing)
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
Validate with literature text mining
32
Assess the significance of the literature-based validation: permutation test
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero

Más contenido relacionado

La actualidad más candente

AI applications in life sciences - drug development
AI applications in life sciences - drug developmentAI applications in life sciences - drug development
AI applications in life sciences - drug developmentJayanthi Repalli, PhD
 
SMi Group's AI in Drug Discovery 2020 conference
SMi Group's AI in Drug Discovery 2020 conferenceSMi Group's AI in Drug Discovery 2020 conference
SMi Group's AI in Drug Discovery 2020 conferenceDale Butler
 
Ai in drug discovery and drug development
Ai in drug discovery and drug developmentAi in drug discovery and drug development
Ai in drug discovery and drug developmentSRUTHI N
 
Very brief overview of AI in drug discovery
Very brief overview of AI in drug discoveryVery brief overview of AI in drug discovery
Very brief overview of AI in drug discoveryDr. Gerry Higgins
 
Overcoming obstacles to repurposing for neurodegenerative disease
Overcoming obstacles to repurposing for neurodegenerative diseaseOvercoming obstacles to repurposing for neurodegenerative disease
Overcoming obstacles to repurposing for neurodegenerative diseaseLona Vincent
 
Bayesian estimations of strong toxic signals [compatibility mode]
Bayesian estimations of strong toxic signals [compatibility mode]Bayesian estimations of strong toxic signals [compatibility mode]
Bayesian estimations of strong toxic signals [compatibility mode]Bhaswat Chakraborty
 
Combination of informative biomarkers in small pilot studies and estimation ...
Combination of informative  biomarkers in small pilot studies and estimation ...Combination of informative  biomarkers in small pilot studies and estimation ...
Combination of informative biomarkers in small pilot studies and estimation ...LEGATO project
 
Sample size & meta analysis
Sample size & meta analysisSample size & meta analysis
Sample size & meta analysisdrsrb
 
How to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkHow to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkStats Statswork
 
BioVariance - Pediatric Pharmacogenomics in Drug Discovery
BioVariance - Pediatric Pharmacogenomics in Drug DiscoveryBioVariance - Pediatric Pharmacogenomics in Drug Discovery
BioVariance - Pediatric Pharmacogenomics in Drug DiscoveryJosef Scheiber
 
Bayesian random effects meta-analysis model for normal data - Pubrica
Bayesian random effects meta-analysis model for normal data - PubricaBayesian random effects meta-analysis model for normal data - Pubrica
Bayesian random effects meta-analysis model for normal data - PubricaPubrica
 
Big Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use CasesBig Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use CasesJosef Scheiber
 
2011 JSM - Good Statistical Practices
2011 JSM - Good Statistical Practices2011 JSM - Good Statistical Practices
2011 JSM - Good Statistical PracticesTerry Liao
 
Predicting Diabetic Readmission Rates: Moving Beyond HbA1c
Predicting Diabetic Readmission Rates: Moving Beyond HbA1cPredicting Diabetic Readmission Rates: Moving Beyond HbA1c
Predicting Diabetic Readmission Rates: Moving Beyond HbA1cDamian R. Mingle, MBA
 
5 essential steps for sample size determination in clinical trials slideshare
5 essential steps for sample size determination in clinical trials   slideshare5 essential steps for sample size determination in clinical trials   slideshare
5 essential steps for sample size determination in clinical trials slidesharenQuery
 
Sample Size Estimation and Statistical Test Selection
Sample Size Estimation  and Statistical Test SelectionSample Size Estimation  and Statistical Test Selection
Sample Size Estimation and Statistical Test SelectionVaggelis Vergoulas
 
Introduction to health research
Introduction to health researchIntroduction to health research
Introduction to health researchKannan Iyanar
 
Early MA Assessment for Personalized Medicine: a framework to assess the chal...
Early MA Assessment for Personalized Medicine: a framework to assess the chal...Early MA Assessment for Personalized Medicine: a framework to assess the chal...
Early MA Assessment for Personalized Medicine: a framework to assess the chal...3GDR
 

La actualidad más candente (20)

Discovery_Schreiner
Discovery_SchreinerDiscovery_Schreiner
Discovery_Schreiner
 
AI applications in life sciences - drug development
AI applications in life sciences - drug developmentAI applications in life sciences - drug development
AI applications in life sciences - drug development
 
SMi Group's AI in Drug Discovery 2020 conference
SMi Group's AI in Drug Discovery 2020 conferenceSMi Group's AI in Drug Discovery 2020 conference
SMi Group's AI in Drug Discovery 2020 conference
 
Ai in drug discovery and drug development
Ai in drug discovery and drug developmentAi in drug discovery and drug development
Ai in drug discovery and drug development
 
Very brief overview of AI in drug discovery
Very brief overview of AI in drug discoveryVery brief overview of AI in drug discovery
Very brief overview of AI in drug discovery
 
Overcoming obstacles to repurposing for neurodegenerative disease
Overcoming obstacles to repurposing for neurodegenerative diseaseOvercoming obstacles to repurposing for neurodegenerative disease
Overcoming obstacles to repurposing for neurodegenerative disease
 
Bayesian estimations of strong toxic signals [compatibility mode]
Bayesian estimations of strong toxic signals [compatibility mode]Bayesian estimations of strong toxic signals [compatibility mode]
Bayesian estimations of strong toxic signals [compatibility mode]
 
Combination of informative biomarkers in small pilot studies and estimation ...
Combination of informative  biomarkers in small pilot studies and estimation ...Combination of informative  biomarkers in small pilot studies and estimation ...
Combination of informative biomarkers in small pilot studies and estimation ...
 
Sample size & meta analysis
Sample size & meta analysisSample size & meta analysis
Sample size & meta analysis
 
How to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkHow to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - Statswork
 
BioVariance - Pediatric Pharmacogenomics in Drug Discovery
BioVariance - Pediatric Pharmacogenomics in Drug DiscoveryBioVariance - Pediatric Pharmacogenomics in Drug Discovery
BioVariance - Pediatric Pharmacogenomics in Drug Discovery
 
Bayesian random effects meta-analysis model for normal data - Pubrica
Bayesian random effects meta-analysis model for normal data - PubricaBayesian random effects meta-analysis model for normal data - Pubrica
Bayesian random effects meta-analysis model for normal data - Pubrica
 
Big Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use CasesBig Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use Cases
 
2011 JSM - Good Statistical Practices
2011 JSM - Good Statistical Practices2011 JSM - Good Statistical Practices
2011 JSM - Good Statistical Practices
 
Predicting Diabetic Readmission Rates: Moving Beyond HbA1c
Predicting Diabetic Readmission Rates: Moving Beyond HbA1cPredicting Diabetic Readmission Rates: Moving Beyond HbA1c
Predicting Diabetic Readmission Rates: Moving Beyond HbA1c
 
5 essential steps for sample size determination in clinical trials slideshare
5 essential steps for sample size determination in clinical trials   slideshare5 essential steps for sample size determination in clinical trials   slideshare
5 essential steps for sample size determination in clinical trials slideshare
 
Sample Size Estimation and Statistical Test Selection
Sample Size Estimation  and Statistical Test SelectionSample Size Estimation  and Statistical Test Selection
Sample Size Estimation and Statistical Test Selection
 
Introduction to health research
Introduction to health researchIntroduction to health research
Introduction to health research
 
Sample size calculation
Sample size calculationSample size calculation
Sample size calculation
 
Early MA Assessment for Personalized Medicine: a framework to assess the chal...
Early MA Assessment for Personalized Medicine: a framework to assess the chal...Early MA Assessment for Personalized Medicine: a framework to assess the chal...
Early MA Assessment for Personalized Medicine: a framework to assess the chal...
 

Similar a In silico prediction of novel therapeutic targets using gene - disease association data

Systems biology in polypharmacology: explaining and predicting drug secondary...
Systems biology in polypharmacology: explaining and predicting drug secondary...Systems biology in polypharmacology: explaining and predicting drug secondary...
Systems biology in polypharmacology: explaining and predicting drug secondary...Andrei KUCHARAVY
 
Role of bioinformatics in drug designing
Role of bioinformatics in drug designingRole of bioinformatics in drug designing
Role of bioinformatics in drug designingW Roseybala Devi
 
Review : Impact of informatics on IVF
Review : Impact of informatics on IVFReview : Impact of informatics on IVF
Review : Impact of informatics on IVFVirochana Kaul
 
Gene Profiling in Clinical Oncology - Slide 9 - F. André - Genomic evaluation...
Gene Profiling in Clinical Oncology - Slide 9 - F. André - Genomic evaluation...Gene Profiling in Clinical Oncology - Slide 9 - F. André - Genomic evaluation...
Gene Profiling in Clinical Oncology - Slide 9 - F. André - Genomic evaluation...European School of Oncology
 
Introduction to the drug discovery process
Introduction to the drug discovery processIntroduction to the drug discovery process
Introduction to the drug discovery processThanh Truong
 
Diabetes Systems Biology And Genetics V6
Diabetes Systems Biology And Genetics V6Diabetes Systems Biology And Genetics V6
Diabetes Systems Biology And Genetics V6cphensley
 
Forum on Personalized Medicine: Challenges for the next decade
Forum on Personalized Medicine: Challenges for the next decadeForum on Personalized Medicine: Challenges for the next decade
Forum on Personalized Medicine: Challenges for the next decadeJoaquin Dopazo
 
Pistoia Alliance-Elsevier Datathon
Pistoia Alliance-Elsevier DatathonPistoia Alliance-Elsevier Datathon
Pistoia Alliance-Elsevier DatathonPistoia Alliance
 
BioVariance Services Flyer
BioVariance Services FlyerBioVariance Services Flyer
BioVariance Services FlyerJosef Scheiber
 
Bioinformatics in dermato-oncology
Bioinformatics in dermato-oncologyBioinformatics in dermato-oncology
Bioinformatics in dermato-oncologyJoaquin Dopazo
 
Exploiting technical replicate variance in omics data analysis (RepExplore)
Exploiting technical replicate variance in omics data analysis (RepExplore)Exploiting technical replicate variance in omics data analysis (RepExplore)
Exploiting technical replicate variance in omics data analysis (RepExplore)Enrico Glaab
 
Boyce ahrq-ddi-conference-2008
Boyce ahrq-ddi-conference-2008Boyce ahrq-ddi-conference-2008
Boyce ahrq-ddi-conference-2008Richard Boyce, PhD
 
Myelin repair open science summit 07.31.10 v2
Myelin repair   open science summit 07.31.10 v2Myelin repair   open science summit 07.31.10 v2
Myelin repair open science summit 07.31.10 v2Open Science Summit
 
Antti haapalinna 10th december 08 oulu1
Antti haapalinna 10th december 08 oulu1Antti haapalinna 10th december 08 oulu1
Antti haapalinna 10th december 08 oulu1Antti Haapalinna
 
NGS for Infectious Disease Diagnostics: An Opportunity for Growth
NGS for Infectious Disease Diagnostics: An Opportunity for Growth NGS for Infectious Disease Diagnostics: An Opportunity for Growth
NGS for Infectious Disease Diagnostics: An Opportunity for Growth Alira Health
 
The Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival predictionThe Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival predictionBenjamin Good
 
Discovery on Target 2014 - The Industry's Preeminent Event on Novel Drug Targets
Discovery on Target 2014 - The Industry's Preeminent Event on Novel Drug TargetsDiscovery on Target 2014 - The Industry's Preeminent Event on Novel Drug Targets
Discovery on Target 2014 - The Industry's Preeminent Event on Novel Drug TargetsJaime Hodges
 
How AstraZeneca is Applying AI, Imaging & Data Analytics (AI-Driven Drug Deve...
How AstraZeneca is Applying AI, Imaging & Data Analytics (AI-Driven Drug Deve...How AstraZeneca is Applying AI, Imaging & Data Analytics (AI-Driven Drug Deve...
How AstraZeneca is Applying AI, Imaging & Data Analytics (AI-Driven Drug Deve...Nick Brown
 
2010StanfordE25 Michele dragoescu e25 project
2010StanfordE25 Michele dragoescu e25 project2010StanfordE25 Michele dragoescu e25 project
2010StanfordE25 Michele dragoescu e25 projectmdragoescu
 
Results and Discussion - Identification of Drug Targets from Bacterial Genomoe
Results and Discussion - Identification of Drug Targets from Bacterial GenomoeResults and Discussion - Identification of Drug Targets from Bacterial Genomoe
Results and Discussion - Identification of Drug Targets from Bacterial GenomoeDr. Paulsharma Chakravarthy
 

Similar a In silico prediction of novel therapeutic targets using gene - disease association data (20)

Systems biology in polypharmacology: explaining and predicting drug secondary...
Systems biology in polypharmacology: explaining and predicting drug secondary...Systems biology in polypharmacology: explaining and predicting drug secondary...
Systems biology in polypharmacology: explaining and predicting drug secondary...
 
Role of bioinformatics in drug designing
Role of bioinformatics in drug designingRole of bioinformatics in drug designing
Role of bioinformatics in drug designing
 
Review : Impact of informatics on IVF
Review : Impact of informatics on IVFReview : Impact of informatics on IVF
Review : Impact of informatics on IVF
 
Gene Profiling in Clinical Oncology - Slide 9 - F. André - Genomic evaluation...
Gene Profiling in Clinical Oncology - Slide 9 - F. André - Genomic evaluation...Gene Profiling in Clinical Oncology - Slide 9 - F. André - Genomic evaluation...
Gene Profiling in Clinical Oncology - Slide 9 - F. André - Genomic evaluation...
 
Introduction to the drug discovery process
Introduction to the drug discovery processIntroduction to the drug discovery process
Introduction to the drug discovery process
 
Diabetes Systems Biology And Genetics V6
Diabetes Systems Biology And Genetics V6Diabetes Systems Biology And Genetics V6
Diabetes Systems Biology And Genetics V6
 
Forum on Personalized Medicine: Challenges for the next decade
Forum on Personalized Medicine: Challenges for the next decadeForum on Personalized Medicine: Challenges for the next decade
Forum on Personalized Medicine: Challenges for the next decade
 
Pistoia Alliance-Elsevier Datathon
Pistoia Alliance-Elsevier DatathonPistoia Alliance-Elsevier Datathon
Pistoia Alliance-Elsevier Datathon
 
BioVariance Services Flyer
BioVariance Services FlyerBioVariance Services Flyer
BioVariance Services Flyer
 
Bioinformatics in dermato-oncology
Bioinformatics in dermato-oncologyBioinformatics in dermato-oncology
Bioinformatics in dermato-oncology
 
Exploiting technical replicate variance in omics data analysis (RepExplore)
Exploiting technical replicate variance in omics data analysis (RepExplore)Exploiting technical replicate variance in omics data analysis (RepExplore)
Exploiting technical replicate variance in omics data analysis (RepExplore)
 
Boyce ahrq-ddi-conference-2008
Boyce ahrq-ddi-conference-2008Boyce ahrq-ddi-conference-2008
Boyce ahrq-ddi-conference-2008
 
Myelin repair open science summit 07.31.10 v2
Myelin repair   open science summit 07.31.10 v2Myelin repair   open science summit 07.31.10 v2
Myelin repair open science summit 07.31.10 v2
 
Antti haapalinna 10th december 08 oulu1
Antti haapalinna 10th december 08 oulu1Antti haapalinna 10th december 08 oulu1
Antti haapalinna 10th december 08 oulu1
 
NGS for Infectious Disease Diagnostics: An Opportunity for Growth
NGS for Infectious Disease Diagnostics: An Opportunity for Growth NGS for Infectious Disease Diagnostics: An Opportunity for Growth
NGS for Infectious Disease Diagnostics: An Opportunity for Growth
 
The Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival predictionThe Cure: Making a game of gene selection for breast cancer survival prediction
The Cure: Making a game of gene selection for breast cancer survival prediction
 
Discovery on Target 2014 - The Industry's Preeminent Event on Novel Drug Targets
Discovery on Target 2014 - The Industry's Preeminent Event on Novel Drug TargetsDiscovery on Target 2014 - The Industry's Preeminent Event on Novel Drug Targets
Discovery on Target 2014 - The Industry's Preeminent Event on Novel Drug Targets
 
How AstraZeneca is Applying AI, Imaging & Data Analytics (AI-Driven Drug Deve...
How AstraZeneca is Applying AI, Imaging & Data Analytics (AI-Driven Drug Deve...How AstraZeneca is Applying AI, Imaging & Data Analytics (AI-Driven Drug Deve...
How AstraZeneca is Applying AI, Imaging & Data Analytics (AI-Driven Drug Deve...
 
2010StanfordE25 Michele dragoescu e25 project
2010StanfordE25 Michele dragoescu e25 project2010StanfordE25 Michele dragoescu e25 project
2010StanfordE25 Michele dragoescu e25 project
 
Results and Discussion - Identification of Drug Targets from Bacterial Genomoe
Results and Discussion - Identification of Drug Targets from Bacterial GenomoeResults and Discussion - Identification of Drug Targets from Bacterial Genomoe
Results and Discussion - Identification of Drug Targets from Bacterial Genomoe
 

Último

How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 

Último (17)

How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 

In silico prediction of novel therapeutic targets using gene - disease association data

  • 1. In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero, PhD, Associate GSK Fellow Scientific Leader, Computational Biology and Stats, Target Sciences GSK Big Data in Medicine 04.07.2017
  • 2. Challenges in pharma R&D Time and costs are increasing but success rate is declining 2In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  • 3. Why focus on targets? Late phase failures cost (a lot) more 3 0 200 400 600 800 1000 1200 0 10 20 30 40 50 60 70 80 90 100 Lead discovery Lead optimization Pre-clinical FTIH Phase 2 Phase 3 Relativecost(permolecule) Nmolecules Manhattan Institute, 2012 In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  • 4. Rethink the drug discovery pipeline Spend more time and resources in target validation to reduce attrition in later phases 4 Targetvalidation Potentialtargets Pre-clinical FTIH LaunchPhase 2 Phase 3 Lead discovery Lead optimisation Launch PotentialtargetsPotentialtargets Lead discovery Lead optimisation Pre-clinical FTIH Phase 2 Phase 3 Target validation In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  • 5. Cook et al., 2014; Nelson et al., 2015 Target discovery and genetics evidence  40% of efficacy failures are due to poor linkage between target and disease.  The proportion of drug mechanisms with direct genetic support increases significantly across the drug development pipeline.  Selecting genetically supported targets could double the success rate in clinical development.
  • 6. Open Targets A platform for therapeutic target identification and validation 6In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  • 7. Could it be as easy as spotting spam emails? 7In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero  Is it possible to predict novel therapeutic targets using available gene – disease association data? Predicting therapeutic targets
  • 8. A simple machine learning workflow 8 Generate input data matrix Assign labels and split into training, test and prediction sets Exploratory data analysis Tune, train and test classifiers using nested cross-validation Evaluate best classifier performance on test set Explore predicted targets across the drug discovery pipeline Make predictions using best performing classifier Validate with literature text mining In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero Predict therapeutic targets only using gene – disease association data
  • 9. Data sources and data processing 9  Obtain all gene disease associations and supporting evidence from Open Targets platform.  For all genes, create numeric features by taking the mean score across all diseases: – Genetic associations (germline) – Somatic mutations – Significant gene expression changes – Disease-relevant phenotype in animal model – Pathway-level evidence  Gather positive labels from Pharmaprojects: only consider targets with drugs currently on the market, in clinical trials or preclinical studies. Exclude targets with drugs withdrawn from market or whose development has been discontinued. Input data matrix generation In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  • 10. A positive – unlabelled (PU) semi-supervised learning approach 10  A semi-supervised framework with only positive labels is used: targets according to PharmaProjects constitute the positive class (P), while the rest of the proteome is used as the unlabelled class (U), containing both negatives and yet-to-be-discovered positive.  All positive cases (1421) and an equal number of randomly selected unlabelled cases (2842 in total) are set apart for training (80%) and testing (20%).  The remainder is kept as a prediction set where predictions from the final model will be made. Split data into training, test and prediction set In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  • 11. Dimensionality reduction reveals structure in the data 11 t-Distributed Stochastic Neighbour Embedding (t-SNE) In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  • 12. What are the most “important” features? 12 Chi-squared test + information gain In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  • 13. Nested cross-validation and bagging for tuning and model selection 13  Four classifiers are independently tuned, trained and tested on the training set using a nested cross-validation strategy (4 inner rounds for parameter tuning and 4 outer rounds to assess performance): – Random forest (tuned parameters: number of trees and number of features); – Feed-forward neural network with single hidden layer (tuned parameters: size and decay); – Support vector machine with radial kernel (tuned parameters: gamma and cost); – Gradient boosting machine with AdaBoost exponential loss function (tuned parameters: number of trees and interaction depth).  In PU learning, U contains both positive and negative cases, which results in classifier instability. Bagging (bootstrap aggregating) can improve the performance of instable classifiers by randomly resampling P and U with replacement (bootstrap) and then aggregating the results by majority voting: – Bagging with 100 iterations was applied to the neural network, the support vector machine and the gradient boosting machine. – Random forests are already a special case of bagging. Tuning, training and testing four classifiers In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  • 14. Evaluating classifiers performance 14 Receiver operating characteristic curves In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero AUC 0.76
  • 15. Disease association evidence higher for more advanced targets 15 Model predicts late-stage targets more easily than early-stage ones In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  • 16. Literature text mining validation of predictions 16 Highly significant overlap between predictions and text mining results In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  • 17. Conclusions 17  The gene – disease association data from Open Targets contains enough information to predict whether a protein can make a therapeutic target or not with decent accuracy (71%)  Aside from standard cross-validation and testing, prediction results were also validated by mining the scientific literature for therapeutic targets and assessing the significance of the overlap.  The ability of the neural network model to predict late stage targets with greater accuracy confirms that clear linkage between target and disease is essential to maximise chances of success in the clinic.  Of the evidence types tested, animal models showing disease-relevant phenotypes, dysregulated gene expression in disease tissue and genetic associations between gene and disease appear as the most informative ones. In silico predictions of novel therapeutic targets using gene – disease association data In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  • 18. Acknowledgements 18  Ian Dunham  Philippe Sanseau  Gautier Koscielny  Giovanni Dall’Olio  Pankaj Agarwal  Mark Hurle  Steven Barrett  Nicola Richmond  Jin Yao In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  • 20. Pharmaprojects 20In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero An industry-wide drug development database
  • 21. Exploratory data analysis reveals sparse data with little structure 21 Hierarchical clustering + principal component analysis In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  • 22. Tune, train and test classifiers using cross-validation 22 Decision tree classification criteria In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  • 23. Evaluating classifiers performance 23 Performance measures for supervised learning In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  • 24. Neural network performance on independent test set 24 Selected classifier with most balanced overall performance for further analyses Cross-validation Test Misclassification error 0.303 0.287 Accuracy 0.697 0.713 AUC 0.758 0.763 Recall/Sensitivity 0.610 0.638 Specificity 0.785 0.784 Precision 0.742 0.736 F1 Score 0.670 0.683 In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  • 25. Tune, train and test classifiers using cross-validation 25 Misclassification error In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  • 26. Evaluate best classifier performance on test set 26 Confusion matrices Crossvalidation Prediction outcome Unknown Target Actual value Unknown 912 217 Target 445 700 Test Prediction outcome Unknown Target Actual value Unknown 225 67 Target 99 177 In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  • 27. Split into training, test and prediction sets 27 Assess the effect of randomly sampling from unlabelled class: Monte Carlo simulation In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  • 28. Tune, train and test classifiers using crossvalidation 28 Precision recall curves In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  • 29. Tune, train and test classifiers using crossvalidation Predicted targets Predicted non-targets In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero 29 Overlap between predictions on training set
  • 30. 30 Majority of targets with discontinued programmes not predicted as targets In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero Targets with lower disease association fail more often
  • 31. Generating predictions on remaining 15K genes 31 Run model on prediction set (not used for training/testing) In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero
  • 32. Validate with literature text mining 32 Assess the significance of the literature-based validation: permutation test In silico prediction of novel therapeutic targets using gene – disease association data Enrico Ferrero