Elements of language learning - an analysis of how different elements of lang...
In silico prediction of novel therapeutic targets using gene - disease association data
1. In silico prediction of novel therapeutic
targets using gene – disease
association data
Enrico Ferrero, PhD, Associate GSK Fellow
Scientific Leader, Computational Biology and Stats, Target Sciences
GSK
Big Data in Medicine
04.07.2017
2. Challenges in pharma R&D
Time and costs are increasing but success rate is declining
2In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
3. Why focus on targets?
Late phase failures cost (a lot) more
3
0
200
400
600
800
1000
1200
0
10
20
30
40
50
60
70
80
90
100
Lead discovery Lead optimization Pre-clinical FTIH Phase 2 Phase 3
Relativecost(permolecule)
Nmolecules
Manhattan Institute, 2012
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
4. Rethink the drug discovery pipeline
Spend more time and resources in target validation to reduce attrition in later phases
4
Targetvalidation
Potentialtargets
Pre-clinical FTIH LaunchPhase 2 Phase 3
Lead discovery
Lead optimisation
Launch
PotentialtargetsPotentialtargets
Lead discovery Lead optimisation Pre-clinical FTIH Phase 2 Phase 3
Target
validation
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
5. Cook et al., 2014; Nelson et al., 2015
Target discovery and genetics evidence
40% of efficacy failures are due to poor linkage between target and disease.
The proportion of drug mechanisms with direct genetic support increases significantly across the
drug development pipeline.
Selecting genetically supported targets could double the success rate in clinical development.
6. Open Targets
A platform for therapeutic target identification and validation
6In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
7. Could it be as easy as spotting spam emails?
7In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
Is it possible to predict novel therapeutic targets using
available gene – disease association data?
Predicting therapeutic targets
8. A simple machine learning workflow
8
Generate input
data matrix
Assign labels and
split into training,
test and
prediction sets
Exploratory data
analysis
Tune, train and
test classifiers
using nested
cross-validation
Evaluate best
classifier
performance on
test set
Explore predicted
targets across the
drug discovery
pipeline
Make predictions
using best
performing
classifier
Validate with
literature text
mining
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
Predict therapeutic targets only using gene – disease association data
9. Data sources and data processing
9
Obtain all gene disease associations and
supporting evidence from Open Targets
platform.
For all genes, create numeric features by taking
the mean score across all diseases:
– Genetic associations (germline)
– Somatic mutations
– Significant gene expression changes
– Disease-relevant phenotype in animal model
– Pathway-level evidence
Gather positive labels from Pharmaprojects:
only consider targets with drugs currently on
the market, in clinical trials or preclinical
studies. Exclude targets with drugs withdrawn
from market or whose development has been
discontinued.
Input data matrix generation
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
10. A positive – unlabelled (PU) semi-supervised learning approach
10
A semi-supervised framework with only
positive labels is used: targets according to
PharmaProjects constitute the positive class
(P), while the rest of the proteome is used as
the unlabelled class (U), containing both
negatives and yet-to-be-discovered positive.
All positive cases (1421) and an equal number
of randomly selected unlabelled cases (2842 in
total) are set apart for training (80%) and
testing (20%).
The remainder is kept as a prediction set
where predictions from the final model will be
made.
Split data into training, test and prediction set
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
11. Dimensionality reduction reveals structure in the data
11
t-Distributed Stochastic Neighbour Embedding (t-SNE)
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
12. What are the most “important” features?
12
Chi-squared test + information gain
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
13. Nested cross-validation and bagging for tuning and model selection
13
Four classifiers are independently tuned, trained and tested on the
training set using a nested cross-validation strategy (4 inner rounds
for parameter tuning and 4 outer rounds to assess performance):
– Random forest (tuned parameters: number of trees and
number of features);
– Feed-forward neural network with single hidden layer (tuned
parameters: size and decay);
– Support vector machine with radial kernel (tuned parameters:
gamma and cost);
– Gradient boosting machine with AdaBoost exponential loss
function (tuned parameters: number of trees and interaction
depth).
In PU learning, U contains both positive and negative cases, which
results in classifier instability. Bagging (bootstrap aggregating) can
improve the performance of instable classifiers by randomly
resampling P and U with replacement (bootstrap) and then
aggregating the results by majority voting:
– Bagging with 100 iterations was applied to the neural network,
the support vector machine and the gradient boosting machine.
– Random forests are already a special case of bagging.
Tuning, training and testing four classifiers
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
14. Evaluating classifiers performance
14
Receiver operating characteristic curves
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
AUC 0.76
15. Disease association evidence higher for more advanced targets
15
Model predicts late-stage targets more easily than early-stage ones
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
16. Literature text mining validation of predictions
16
Highly significant overlap between predictions and text mining results
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
17. Conclusions
17
The gene – disease association data from Open Targets contains enough information to predict
whether a protein can make a therapeutic target or not with decent accuracy (71%)
Aside from standard cross-validation and testing, prediction results were also validated by mining
the scientific literature for therapeutic targets and assessing the significance of the overlap.
The ability of the neural network model to predict late stage targets with greater accuracy
confirms that clear linkage between target and disease is essential to maximise chances of success
in the clinic.
Of the evidence types tested, animal models showing disease-relevant phenotypes, dysregulated
gene expression in disease tissue and genetic associations between gene and disease appear as
the most informative ones.
In silico predictions of novel therapeutic targets using gene – disease association data
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
18. Acknowledgements
18
Ian Dunham
Philippe Sanseau
Gautier Koscielny
Giovanni Dall’Olio
Pankaj Agarwal
Mark Hurle
Steven Barrett
Nicola Richmond
Jin Yao
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
20. Pharmaprojects
20In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
An industry-wide drug development database
21. Exploratory data analysis reveals sparse data with little structure
21
Hierarchical clustering + principal component analysis
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
22. Tune, train and test classifiers using cross-validation
22
Decision tree classification criteria
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
24. Neural network performance on independent test set
24
Selected classifier with most balanced overall performance for further analyses
Cross-validation Test
Misclassification error 0.303 0.287
Accuracy 0.697 0.713
AUC 0.758 0.763
Recall/Sensitivity 0.610 0.638
Specificity 0.785 0.784
Precision 0.742 0.736
F1 Score 0.670 0.683
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
25. Tune, train and test classifiers using cross-validation
25
Misclassification error
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
26. Evaluate best classifier performance on test set
26
Confusion matrices
Crossvalidation
Prediction outcome
Unknown Target
Actual
value
Unknown 912 217
Target 445 700
Test
Prediction outcome
Unknown Target
Actual
value
Unknown 225 67
Target 99 177
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
27. Split into training, test and prediction sets
27
Assess the effect of randomly sampling from unlabelled class: Monte Carlo simulation
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
28. Tune, train and test classifiers using crossvalidation
28
Precision recall curves
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
29. Tune, train and test classifiers using crossvalidation
Predicted targets Predicted non-targets
In silico prediction of novel therapeutic targets using gene –
disease association data
Enrico Ferrero
29
Overlap between predictions on training set
30. 30
Majority of targets with discontinued programmes not predicted as targets
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
Targets with lower disease association fail more often
31. Generating predictions on remaining 15K genes
31
Run model on prediction set (not used for training/testing)
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero
32. Validate with literature text mining
32
Assess the significance of the literature-based validation: permutation test
In silico prediction of novel therapeutic targets using gene – disease association data
Enrico Ferrero