SlideShare una empresa de Scribd logo
1 de 1
Created by: Digital Media Services
Computational prediction of novel drug targets
using gene – disease association data
Background
Target identification and validation is a pressing challenge in the
pharmaceutical industry, with many of the programmes that fail for
efficacy reasons showing poor association between the drug target
and the disease [1].
Hence, the selection of the right target for the right disease in early
discovery phases is a key decision to maximise the chances of success
in the clinic and ensure a sustainable business in the longer term.
The Open Targets Target Validation Platform [2] aims to bridge the
gap between targets and diseases by collecting different types of
evidence that can link the two entities, such as genetic or expression
data.
In this study, we take advantage of the Open Targets platform using a
semi-supervised approach on positive and unlabelled data to assess
whether the disease association evidence that it contains can be used
to make de novo predictions of potential therapeutic targets (Fig. 1).
We test the information content of five types of evidence connecting
genes with diseases (pathways, animal models, somatic and germline
genetics and RNA expression) and evaluate different classification
algorithms for predicting new drug targets.
Enrico Ferrero1, Ian Dunham2,3 and Philippe Sanseau1,3
1 Computational Biology, Target Sciences, GSK, Medicines Research Centre, Gunnels Wood Road, Stevenage SG1 2NY, UK
2 European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
3 Open Targets, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Fig. 1. Framework for prediction of
therapeutic targets based on
disease association data.
Observations and features were
gathered from Open Targets while
the labels were collected from
Citeline PharmaProjects. The
resulting data matrix was split into
a prediction set and a working
dataset. The former was kept aside
and used to perform predictions
once the model was established.
The latter, containing both positive
and unlabelled observations, was
further split into training and test
sets to train the classifiers and
evaluate their performance.
Fig. 2. Exploratory data analysis
using dimensionality reduction.
The t-SNE algorithm for non-linear
dimensionality reduction was run
with default parameters. Each dot
in the two-dimensional space
represents a gene and is coloured
according to its label (green: target,
purple: non-target).
Assessing feature importance and classification criteria
We set out to investigate the contributions of all five individual data
types and their relative importance in the dataset by using two
methods that are normally used to rank predictors in order of
importance in feature selection workflows: the Chi-squared test and
the information gain method. Regardless of the method utilised, we
observed animal_model, rna_expression and genetic_association
showing the highest values and association with the outcome variable
(Fig.3).
Fig. 3. Feature importance. Feature
importance according to three
independent feature selection
methods (left to right): Kruskal-
Wallis test, chi-squared test and
information gain.
To test our hypothesis that the features described above would be
prioritised by a learning algorithm, we trained a classification tree on
a subset of the data – the training set – corresponding to 80% of the
observations. We then explored the classification criteria of the model
to get insights into the features the algorithm uses to make
classification decisions.
In line with the feature importance results described above, we found
that animal_model was the first node in the tree, with rna_expression
and genetic_association also required for target classification (Fig. 4).
Fig. 4. Decision tree classification
criteria. Colours represent
predicted outcome (purple: non-
target, green: target). In each node,
numbers represent (from top to
bottom): outcome (0: non-target, 1:
target), number of observations in
node per class (left: non-target,
right: target), percentage of
observations in node.
Different learning algorithms can predict therapeutic targets with
good accuracy
We explored four algorithms that are generally known to achieve
good performance across a number of settings: a random forest, a
support vector machine (SVM) with a radial kernel, a feed-forward
neural network and a gradient boosting machine using the adaboost
algorithm.
The performance of the four algorithms was benchmarked using 10-
fold cross-validation on the training set. Interestingly, we observed
the four methods to have broadly similar performance, with no
classifier clearly outperforming the others. The receiver operating
characteristic (ROC) curves showed substantial similarity at different
thresholds of the true positive and false positive rates (Fig. 5).
Fig. 5. Receiver operating
characteristic curves. The
performance of four classifiers
algorithms was estimated using 10-
fold cross-validation on the training
set.
We then calculated standard performance measures including the
area under the curve (AUC), accuracy, precision, recall/sensitivity,
specificity and F1 score. Overall, we observed that all algorithms had
good accuracy, AUC, precision and specificity. The recall/sensitivity
was found to be somewhat lower, possibly highlighting a limitation in
identifying true positives or simply reflecting the fact that a number
of the unlabelled observations are actually positives (Fig 6).
Fig. 6 Performance measures.
Standard performance measures
for the four classifiers were
estimated using 10-fold cross-
validation on the training set.
Literature text mining validates predictions of novel targets
To independently test the validity of our predictions, we searched for
occurrences of a gene or protein being flagged as a therapeutic target in
titles and abstracts on MEDLINE and calculated the overlap with the
random forest predictions (Fig. 6). We found 551 genes in common
between the two sets, a highly significant proportion as assessed by
Fisher’s exact test (p-value = 1.21e-133, odds ratio = 4.68).
These results serve as an external source of validation of the approach
described here and demonstrate that the types of disease-association
data used in this study can be predictive of therapeutic targets with good
accuracy.
Using a random forest classifier to predict drug targets based on disease
association data
Based on the benchmark results, we selected the random forest classifier
as the algorithm with the most balanced performance overall and
evaluated its performance on an independent test set that was not
previously fed to the learning algorithm.
We found the performance to be very consistent with the training set
cross-validation results, indicating that our model did not overfit the
training data.
The model achieved an AUC of 0.76 and accuracy above 70%, meaning
that less than 30% of the observations were misclassified. The model
featured good precision (0.73) and excellent specificity (0.77); on the
other hand, the recall/sensitivity and the F1 score were lower, but still
well above 60% (0.64 and 0.68, respectively).
Fig. 8. Validation of target
predictions using the scientific
literature. The Venn diagram shows
the overlap between the predicted
targets from the random forest
classifier and those retrieved from
the literature through text mining.
Results
Targets and non-targets appear as distinct on a two-dimensional
space
We carried out an exploratory analysis on the working dataset using
unsupervised methods to understand whether targets and non-
targets had different characteristics.
Using the t-distributed stochastic neighbour embedding method (t-
SNE) resulted in a rather clear separation between targets and non-
targets on a two-dimensional space: most of the data points labelled
as targets lie on a curved line and a gradient of targets and non-
targets is present along the vertical axis (Fig. 2).
The known targets utilised in the training and test set were further
explored at this stage. We set out to understand whether there was
any difference between those that were predicted as targets and
those predicted as non-targets based on how advanced they were in
the drug discovery pipeline (Fig. 7).
We observed that many of the targets correctly identified as such
belonged to drugs currently on the market (35.8%), while the
proportion of launched drugs for predicted non-targets was much
lower (20.4%). A similar trend was observed for programmes
currently in phase II and III clinical trials. We confirmed this by using
logistic regression and found significant differences for the following
stages: launched (p-value = 4.65e-21), pre-registration (p-value =
0.01), phase III clinical trial (p-value = 1.09e-5) and phase II clinical
trial (p-value = 9.38e-5).
Fig. 7. Distribution of targets across
different stages of the drug
discovery pipeline. Known drug
targets predicted as non-targets
(purple) or targets (green) during
training and testing.
Conclusions
We have presented a machine learning approach that is able to make
accurate predictions of therapeutic targets based on the gene –
disease association data present in the Open Targets Target Validation
Platform, demonstrating that disease association is predictive of the
ability of a gene or a protein to work as a drug target.
These findings provide the first formal proof that drug targets can be
predicted using solely disease association data and strengthen the
hypothesis that establishing unambiguous causative links between
putative targets and diseases is of paramount importance to maximise
the chances of success of drug discovery programmes, as previously
reported [1,3].
We believe that, as an industry, we need to focus on clear-cut and
unambiguous evidence linking genes and diseases to maximise our
chances of success; in particular, animal model, genetic and gene
expression evidence should be among the data types driving the
target discovery process.
Finally, we welcome initiatives aimed at the comprehensive
annotation of gene – disease relationships such as Open Targets that
have a real potential to catalyse a more forward-looking and data-
driven target discovery process in the years ahead.
References
[1] Cook D, Brown D, Alexander R, March R, Morgan P, Satterthwaite
G, et al. Lessons learned from the fate of AstraZeneca’s drug pipeline:
a five-dimensional framework. Nat. Rev. Drug Discov.
[2] Open Targets Target Validation Platform
https://www.targetvalidation.org
[3] Nelson MR, Tipney H, Painter JL, Shen J, Nicoletti P, Shen Y, et al.
The support of human genetic evidence for approved drug
indications. Nat. Genet.
Acknowledgements
We would like to thank Gautier Koscielny and Giovanni Dall’Olio for
enabling or facilitating access to data and software. We are grateful to
Jin Yao and Marc Claesen for useful discussions on the topic of
machine learning. Finally, we thank Pankaj Agarwal, Mark Hurle,
Steven Barrett and Nicola Richmond for their helpful comments.

Más contenido relacionado

La actualidad más candente

Artificial Intelligence for Discovery
Artificial Intelligence for DiscoveryArtificial Intelligence for Discovery
Artificial Intelligence for DiscoveryDayOne
 
AI applications in life sciences - drug development
AI applications in life sciences - drug developmentAI applications in life sciences - drug development
AI applications in life sciences - drug developmentJayanthi Repalli, PhD
 
SMi Group's AI in Drug Discovery 2020 conference
SMi Group's AI in Drug Discovery 2020 conferenceSMi Group's AI in Drug Discovery 2020 conference
SMi Group's AI in Drug Discovery 2020 conferenceDale Butler
 
Ai in drug discovery and drug development
Ai in drug discovery and drug developmentAi in drug discovery and drug development
Ai in drug discovery and drug developmentSRUTHI N
 
Very brief overview of AI in drug discovery
Very brief overview of AI in drug discoveryVery brief overview of AI in drug discovery
Very brief overview of AI in drug discoveryDr. Gerry Higgins
 
Artificial intelligence in drug discovery
Artificial intelligence in drug discoveryArtificial intelligence in drug discovery
Artificial intelligence in drug discoveryRAVINDRABABUKOPPERA
 
Overcoming obstacles to repurposing for neurodegenerative disease
Overcoming obstacles to repurposing for neurodegenerative diseaseOvercoming obstacles to repurposing for neurodegenerative disease
Overcoming obstacles to repurposing for neurodegenerative diseaseLona Vincent
 
Rmc phenotypic screening
Rmc phenotypic screeningRmc phenotypic screening
Rmc phenotypic screeningAnn-Marie Roche
 
Role of AI in Drug Discovery and Development
Role of AI in  Drug Discovery and DevelopmentRole of AI in  Drug Discovery and Development
Role of AI in Drug Discovery and DevelopmentDr. Manu Kumar Shetty
 
Bayesian estimations of strong toxic signals [compatibility mode]
Bayesian estimations of strong toxic signals [compatibility mode]Bayesian estimations of strong toxic signals [compatibility mode]
Bayesian estimations of strong toxic signals [compatibility mode]Bhaswat Chakraborty
 
Combination of informative biomarkers in small pilot studies and estimation ...
Combination of informative  biomarkers in small pilot studies and estimation ...Combination of informative  biomarkers in small pilot studies and estimation ...
Combination of informative biomarkers in small pilot studies and estimation ...LEGATO project
 
Sample size & meta analysis
Sample size & meta analysisSample size & meta analysis
Sample size & meta analysisdrsrb
 
Big Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use CasesBig Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use CasesJosef Scheiber
 
How to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkHow to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkStats Statswork
 
BioVariance - Pediatric Pharmacogenomics in Drug Discovery
BioVariance - Pediatric Pharmacogenomics in Drug DiscoveryBioVariance - Pediatric Pharmacogenomics in Drug Discovery
BioVariance - Pediatric Pharmacogenomics in Drug DiscoveryJosef Scheiber
 
Computational prediction of antimicrobial peptide activity
Computational prediction of antimicrobial peptide activityComputational prediction of antimicrobial peptide activity
Computational prediction of antimicrobial peptide activityThet Su Win
 
Predicting Diabetic Readmission Rates: Moving Beyond HbA1c
Predicting Diabetic Readmission Rates: Moving Beyond HbA1cPredicting Diabetic Readmission Rates: Moving Beyond HbA1c
Predicting Diabetic Readmission Rates: Moving Beyond HbA1cDamian R. Mingle, MBA
 
Data mining methodologies for pharmacovigilance
Data mining methodologies for pharmacovigilanceData mining methodologies for pharmacovigilance
Data mining methodologies for pharmacovigilanceAbdelfattah Al Zaqqa
 

La actualidad más candente (20)

Artificial Intelligence for Discovery
Artificial Intelligence for DiscoveryArtificial Intelligence for Discovery
Artificial Intelligence for Discovery
 
Discovery_Schreiner
Discovery_SchreinerDiscovery_Schreiner
Discovery_Schreiner
 
AI applications in life sciences - drug development
AI applications in life sciences - drug developmentAI applications in life sciences - drug development
AI applications in life sciences - drug development
 
SMi Group's AI in Drug Discovery 2020 conference
SMi Group's AI in Drug Discovery 2020 conferenceSMi Group's AI in Drug Discovery 2020 conference
SMi Group's AI in Drug Discovery 2020 conference
 
Ai in drug discovery and drug development
Ai in drug discovery and drug developmentAi in drug discovery and drug development
Ai in drug discovery and drug development
 
Very brief overview of AI in drug discovery
Very brief overview of AI in drug discoveryVery brief overview of AI in drug discovery
Very brief overview of AI in drug discovery
 
Artificial intelligence in drug discovery
Artificial intelligence in drug discoveryArtificial intelligence in drug discovery
Artificial intelligence in drug discovery
 
Overcoming obstacles to repurposing for neurodegenerative disease
Overcoming obstacles to repurposing for neurodegenerative diseaseOvercoming obstacles to repurposing for neurodegenerative disease
Overcoming obstacles to repurposing for neurodegenerative disease
 
Rmc phenotypic screening
Rmc phenotypic screeningRmc phenotypic screening
Rmc phenotypic screening
 
Role of AI in Drug Discovery and Development
Role of AI in  Drug Discovery and DevelopmentRole of AI in  Drug Discovery and Development
Role of AI in Drug Discovery and Development
 
Bayesian estimations of strong toxic signals [compatibility mode]
Bayesian estimations of strong toxic signals [compatibility mode]Bayesian estimations of strong toxic signals [compatibility mode]
Bayesian estimations of strong toxic signals [compatibility mode]
 
Combination of informative biomarkers in small pilot studies and estimation ...
Combination of informative  biomarkers in small pilot studies and estimation ...Combination of informative  biomarkers in small pilot studies and estimation ...
Combination of informative biomarkers in small pilot studies and estimation ...
 
Sample size & meta analysis
Sample size & meta analysisSample size & meta analysis
Sample size & meta analysis
 
Paras new CV`..
Paras new CV`..Paras new CV`..
Paras new CV`..
 
Big Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use CasesBig Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use Cases
 
How to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkHow to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - Statswork
 
BioVariance - Pediatric Pharmacogenomics in Drug Discovery
BioVariance - Pediatric Pharmacogenomics in Drug DiscoveryBioVariance - Pediatric Pharmacogenomics in Drug Discovery
BioVariance - Pediatric Pharmacogenomics in Drug Discovery
 
Computational prediction of antimicrobial peptide activity
Computational prediction of antimicrobial peptide activityComputational prediction of antimicrobial peptide activity
Computational prediction of antimicrobial peptide activity
 
Predicting Diabetic Readmission Rates: Moving Beyond HbA1c
Predicting Diabetic Readmission Rates: Moving Beyond HbA1cPredicting Diabetic Readmission Rates: Moving Beyond HbA1c
Predicting Diabetic Readmission Rates: Moving Beyond HbA1c
 
Data mining methodologies for pharmacovigilance
Data mining methodologies for pharmacovigilanceData mining methodologies for pharmacovigilance
Data mining methodologies for pharmacovigilance
 

Similar a Computational prediction of novel drug targets using gene disease association data

AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSISAN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSISijcsit
 
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSISAN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSISAIRCC Publishing Corporation
 
IRJET - E-Health Chain and Anticipation of Future Disease
IRJET - E-Health Chain and Anticipation of Future DiseaseIRJET - E-Health Chain and Anticipation of Future Disease
IRJET - E-Health Chain and Anticipation of Future DiseaseIRJET Journal
 
Statistics in meta analysis
Statistics in meta analysisStatistics in meta analysis
Statistics in meta analysisDr Shri Sangle
 
METHODS OF CAUSALITY ASSESMENT.@ Clinical Pharmacy 4th Pharm D
METHODS OF CAUSALITY ASSESMENT.@ Clinical Pharmacy 4th Pharm DMETHODS OF CAUSALITY ASSESMENT.@ Clinical Pharmacy 4th Pharm D
METHODS OF CAUSALITY ASSESMENT.@ Clinical Pharmacy 4th Pharm DDrpradeepthi
 
Biostatistics clinical research & trials
Biostatistics clinical research & trialsBiostatistics clinical research & trials
Biostatistics clinical research & trialseclinicaltools
 
A Research Study On Alcohol Abuse
A Research Study On Alcohol AbuseA Research Study On Alcohol Abuse
A Research Study On Alcohol AbuseLynn Holkesvik
 
Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...IJECEIAES
 
Hybrid prediction model with missing value imputation for medical data 2015-g...
Hybrid prediction model with missing value imputation for medical data 2015-g...Hybrid prediction model with missing value imputation for medical data 2015-g...
Hybrid prediction model with missing value imputation for medical data 2015-g...Jitender Grover
 
A method for mining infrequent causal associations and its application in fin...
A method for mining infrequent causal associations and its application in fin...A method for mining infrequent causal associations and its application in fin...
A method for mining infrequent causal associations and its application in fin...IEEEFINALYEARPROJECTS
 
IRJET- The Prediction of Heart Disease using Naive Bayes Classifier
IRJET- The Prediction of Heart Disease using Naive Bayes ClassifierIRJET- The Prediction of Heart Disease using Naive Bayes Classifier
IRJET- The Prediction of Heart Disease using Naive Bayes ClassifierIRJET Journal
 
IRJET- Disease Prediction using Machine Learning
IRJET-  Disease Prediction using Machine LearningIRJET-  Disease Prediction using Machine Learning
IRJET- Disease Prediction using Machine LearningIRJET Journal
 
A method for mining infrequent causal associations and its application in fin...
A method for mining infrequent causal associations and its application in fin...A method for mining infrequent causal associations and its application in fin...
A method for mining infrequent causal associations and its application in fin...JPINFOTECH JAYAPRAKASH
 
IRJET- Genetic Algorithm for Feature Selection to Improve Heart Disease Predi...
IRJET- Genetic Algorithm for Feature Selection to Improve Heart Disease Predi...IRJET- Genetic Algorithm for Feature Selection to Improve Heart Disease Predi...
IRJET- Genetic Algorithm for Feature Selection to Improve Heart Disease Predi...IRJET Journal
 
Cancer prognosis prediction using balanced stratified sampling
Cancer prognosis prediction using balanced stratified samplingCancer prognosis prediction using balanced stratified sampling
Cancer prognosis prediction using balanced stratified samplingijscai
 
Chronic Kidney Disease Prediction Using Machine Learning
Chronic Kidney Disease Prediction Using Machine LearningChronic Kidney Disease Prediction Using Machine Learning
Chronic Kidney Disease Prediction Using Machine LearningIJCSIS Research Publications
 
Drug Target Interaction (DTI) prediction (MSc. thesis)
Drug Target Interaction (DTI) prediction (MSc. thesis) Drug Target Interaction (DTI) prediction (MSc. thesis)
Drug Target Interaction (DTI) prediction (MSc. thesis) Dimitris Papadopoulos
 

Similar a Computational prediction of novel drug targets using gene disease association data (20)

AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSISAN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
 
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSISAN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
AN ALGORITHM FOR PREDICTIVE DATA MINING APPROACH IN MEDICAL DIAGNOSIS
 
Propensity Scores in Medical Device Trials
Propensity Scores in Medical Device TrialsPropensity Scores in Medical Device Trials
Propensity Scores in Medical Device Trials
 
IRJET - E-Health Chain and Anticipation of Future Disease
IRJET - E-Health Chain and Anticipation of Future DiseaseIRJET - E-Health Chain and Anticipation of Future Disease
IRJET - E-Health Chain and Anticipation of Future Disease
 
Statistics in meta analysis
Statistics in meta analysisStatistics in meta analysis
Statistics in meta analysis
 
METHODS OF CAUSALITY ASSESMENT.@ Clinical Pharmacy 4th Pharm D
METHODS OF CAUSALITY ASSESMENT.@ Clinical Pharmacy 4th Pharm DMETHODS OF CAUSALITY ASSESMENT.@ Clinical Pharmacy 4th Pharm D
METHODS OF CAUSALITY ASSESMENT.@ Clinical Pharmacy 4th Pharm D
 
Biostatistics clinical research & trials
Biostatistics clinical research & trialsBiostatistics clinical research & trials
Biostatistics clinical research & trials
 
The Lachman Test
The Lachman TestThe Lachman Test
The Lachman Test
 
A Research Study On Alcohol Abuse
A Research Study On Alcohol AbuseA Research Study On Alcohol Abuse
A Research Study On Alcohol Abuse
 
Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...
 
Hybrid prediction model with missing value imputation for medical data 2015-g...
Hybrid prediction model with missing value imputation for medical data 2015-g...Hybrid prediction model with missing value imputation for medical data 2015-g...
Hybrid prediction model with missing value imputation for medical data 2015-g...
 
A method for mining infrequent causal associations and its application in fin...
A method for mining infrequent causal associations and its application in fin...A method for mining infrequent causal associations and its application in fin...
A method for mining infrequent causal associations and its application in fin...
 
IRJET- The Prediction of Heart Disease using Naive Bayes Classifier
IRJET- The Prediction of Heart Disease using Naive Bayes ClassifierIRJET- The Prediction of Heart Disease using Naive Bayes Classifier
IRJET- The Prediction of Heart Disease using Naive Bayes Classifier
 
IRJET- Disease Prediction using Machine Learning
IRJET-  Disease Prediction using Machine LearningIRJET-  Disease Prediction using Machine Learning
IRJET- Disease Prediction using Machine Learning
 
A method for mining infrequent causal associations and its application in fin...
A method for mining infrequent causal associations and its application in fin...A method for mining infrequent causal associations and its application in fin...
A method for mining infrequent causal associations and its application in fin...
 
IRJET- Genetic Algorithm for Feature Selection to Improve Heart Disease Predi...
IRJET- Genetic Algorithm for Feature Selection to Improve Heart Disease Predi...IRJET- Genetic Algorithm for Feature Selection to Improve Heart Disease Predi...
IRJET- Genetic Algorithm for Feature Selection to Improve Heart Disease Predi...
 
Cancer prognosis prediction using balanced stratified sampling
Cancer prognosis prediction using balanced stratified samplingCancer prognosis prediction using balanced stratified sampling
Cancer prognosis prediction using balanced stratified sampling
 
Advance concepts in screening
Advance concepts in screeningAdvance concepts in screening
Advance concepts in screening
 
Chronic Kidney Disease Prediction Using Machine Learning
Chronic Kidney Disease Prediction Using Machine LearningChronic Kidney Disease Prediction Using Machine Learning
Chronic Kidney Disease Prediction Using Machine Learning
 
Drug Target Interaction (DTI) prediction (MSc. thesis)
Drug Target Interaction (DTI) prediction (MSc. thesis) Drug Target Interaction (DTI) prediction (MSc. thesis)
Drug Target Interaction (DTI) prediction (MSc. thesis)
 

Último

Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 

Último (20)

Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 

Computational prediction of novel drug targets using gene disease association data

  • 1. Created by: Digital Media Services Computational prediction of novel drug targets using gene – disease association data Background Target identification and validation is a pressing challenge in the pharmaceutical industry, with many of the programmes that fail for efficacy reasons showing poor association between the drug target and the disease [1]. Hence, the selection of the right target for the right disease in early discovery phases is a key decision to maximise the chances of success in the clinic and ensure a sustainable business in the longer term. The Open Targets Target Validation Platform [2] aims to bridge the gap between targets and diseases by collecting different types of evidence that can link the two entities, such as genetic or expression data. In this study, we take advantage of the Open Targets platform using a semi-supervised approach on positive and unlabelled data to assess whether the disease association evidence that it contains can be used to make de novo predictions of potential therapeutic targets (Fig. 1). We test the information content of five types of evidence connecting genes with diseases (pathways, animal models, somatic and germline genetics and RNA expression) and evaluate different classification algorithms for predicting new drug targets. Enrico Ferrero1, Ian Dunham2,3 and Philippe Sanseau1,3 1 Computational Biology, Target Sciences, GSK, Medicines Research Centre, Gunnels Wood Road, Stevenage SG1 2NY, UK 2 European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK 3 Open Targets, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Fig. 1. Framework for prediction of therapeutic targets based on disease association data. Observations and features were gathered from Open Targets while the labels were collected from Citeline PharmaProjects. The resulting data matrix was split into a prediction set and a working dataset. The former was kept aside and used to perform predictions once the model was established. The latter, containing both positive and unlabelled observations, was further split into training and test sets to train the classifiers and evaluate their performance. Fig. 2. Exploratory data analysis using dimensionality reduction. The t-SNE algorithm for non-linear dimensionality reduction was run with default parameters. Each dot in the two-dimensional space represents a gene and is coloured according to its label (green: target, purple: non-target). Assessing feature importance and classification criteria We set out to investigate the contributions of all five individual data types and their relative importance in the dataset by using two methods that are normally used to rank predictors in order of importance in feature selection workflows: the Chi-squared test and the information gain method. Regardless of the method utilised, we observed animal_model, rna_expression and genetic_association showing the highest values and association with the outcome variable (Fig.3). Fig. 3. Feature importance. Feature importance according to three independent feature selection methods (left to right): Kruskal- Wallis test, chi-squared test and information gain. To test our hypothesis that the features described above would be prioritised by a learning algorithm, we trained a classification tree on a subset of the data – the training set – corresponding to 80% of the observations. We then explored the classification criteria of the model to get insights into the features the algorithm uses to make classification decisions. In line with the feature importance results described above, we found that animal_model was the first node in the tree, with rna_expression and genetic_association also required for target classification (Fig. 4). Fig. 4. Decision tree classification criteria. Colours represent predicted outcome (purple: non- target, green: target). In each node, numbers represent (from top to bottom): outcome (0: non-target, 1: target), number of observations in node per class (left: non-target, right: target), percentage of observations in node. Different learning algorithms can predict therapeutic targets with good accuracy We explored four algorithms that are generally known to achieve good performance across a number of settings: a random forest, a support vector machine (SVM) with a radial kernel, a feed-forward neural network and a gradient boosting machine using the adaboost algorithm. The performance of the four algorithms was benchmarked using 10- fold cross-validation on the training set. Interestingly, we observed the four methods to have broadly similar performance, with no classifier clearly outperforming the others. The receiver operating characteristic (ROC) curves showed substantial similarity at different thresholds of the true positive and false positive rates (Fig. 5). Fig. 5. Receiver operating characteristic curves. The performance of four classifiers algorithms was estimated using 10- fold cross-validation on the training set. We then calculated standard performance measures including the area under the curve (AUC), accuracy, precision, recall/sensitivity, specificity and F1 score. Overall, we observed that all algorithms had good accuracy, AUC, precision and specificity. The recall/sensitivity was found to be somewhat lower, possibly highlighting a limitation in identifying true positives or simply reflecting the fact that a number of the unlabelled observations are actually positives (Fig 6). Fig. 6 Performance measures. Standard performance measures for the four classifiers were estimated using 10-fold cross- validation on the training set. Literature text mining validates predictions of novel targets To independently test the validity of our predictions, we searched for occurrences of a gene or protein being flagged as a therapeutic target in titles and abstracts on MEDLINE and calculated the overlap with the random forest predictions (Fig. 6). We found 551 genes in common between the two sets, a highly significant proportion as assessed by Fisher’s exact test (p-value = 1.21e-133, odds ratio = 4.68). These results serve as an external source of validation of the approach described here and demonstrate that the types of disease-association data used in this study can be predictive of therapeutic targets with good accuracy. Using a random forest classifier to predict drug targets based on disease association data Based on the benchmark results, we selected the random forest classifier as the algorithm with the most balanced performance overall and evaluated its performance on an independent test set that was not previously fed to the learning algorithm. We found the performance to be very consistent with the training set cross-validation results, indicating that our model did not overfit the training data. The model achieved an AUC of 0.76 and accuracy above 70%, meaning that less than 30% of the observations were misclassified. The model featured good precision (0.73) and excellent specificity (0.77); on the other hand, the recall/sensitivity and the F1 score were lower, but still well above 60% (0.64 and 0.68, respectively). Fig. 8. Validation of target predictions using the scientific literature. The Venn diagram shows the overlap between the predicted targets from the random forest classifier and those retrieved from the literature through text mining. Results Targets and non-targets appear as distinct on a two-dimensional space We carried out an exploratory analysis on the working dataset using unsupervised methods to understand whether targets and non- targets had different characteristics. Using the t-distributed stochastic neighbour embedding method (t- SNE) resulted in a rather clear separation between targets and non- targets on a two-dimensional space: most of the data points labelled as targets lie on a curved line and a gradient of targets and non- targets is present along the vertical axis (Fig. 2). The known targets utilised in the training and test set were further explored at this stage. We set out to understand whether there was any difference between those that were predicted as targets and those predicted as non-targets based on how advanced they were in the drug discovery pipeline (Fig. 7). We observed that many of the targets correctly identified as such belonged to drugs currently on the market (35.8%), while the proportion of launched drugs for predicted non-targets was much lower (20.4%). A similar trend was observed for programmes currently in phase II and III clinical trials. We confirmed this by using logistic regression and found significant differences for the following stages: launched (p-value = 4.65e-21), pre-registration (p-value = 0.01), phase III clinical trial (p-value = 1.09e-5) and phase II clinical trial (p-value = 9.38e-5). Fig. 7. Distribution of targets across different stages of the drug discovery pipeline. Known drug targets predicted as non-targets (purple) or targets (green) during training and testing. Conclusions We have presented a machine learning approach that is able to make accurate predictions of therapeutic targets based on the gene – disease association data present in the Open Targets Target Validation Platform, demonstrating that disease association is predictive of the ability of a gene or a protein to work as a drug target. These findings provide the first formal proof that drug targets can be predicted using solely disease association data and strengthen the hypothesis that establishing unambiguous causative links between putative targets and diseases is of paramount importance to maximise the chances of success of drug discovery programmes, as previously reported [1,3]. We believe that, as an industry, we need to focus on clear-cut and unambiguous evidence linking genes and diseases to maximise our chances of success; in particular, animal model, genetic and gene expression evidence should be among the data types driving the target discovery process. Finally, we welcome initiatives aimed at the comprehensive annotation of gene – disease relationships such as Open Targets that have a real potential to catalyse a more forward-looking and data- driven target discovery process in the years ahead. References [1] Cook D, Brown D, Alexander R, March R, Morgan P, Satterthwaite G, et al. Lessons learned from the fate of AstraZeneca’s drug pipeline: a five-dimensional framework. Nat. Rev. Drug Discov. [2] Open Targets Target Validation Platform https://www.targetvalidation.org [3] Nelson MR, Tipney H, Painter JL, Shen J, Nicoletti P, Shen Y, et al. The support of human genetic evidence for approved drug indications. Nat. Genet. Acknowledgements We would like to thank Gautier Koscielny and Giovanni Dall’Olio for enabling or facilitating access to data and software. We are grateful to Jin Yao and Marc Claesen for useful discussions on the topic of machine learning. Finally, we thank Pankaj Agarwal, Mark Hurle, Steven Barrett and Nicola Richmond for their helpful comments.