All scientific disciplines, including medicinal chemistry, are experiencing a revolution in unprecedented rates of data being generated and the subsequent analysis and exploitation of this data is increasingly fundamental to innovation. Using data to design better compounds is a challenge for Medicinal and Computational chemists.
The design of small-molecule drug candidates, encompassing characteristics such as potency, selectivity and ADMET (absorption, distribution, metabolism, excretion and toxicity) is a key factor in the success of clinical trials and computer-aided drug discovery/design methods have played a major role in the development of therapeutically important small molecules for over three decades. These methods are broadly classified as either structure-based or ligand-based.
In this webinar our expert Dr. Olivier Barberan will discuss ligand-based methods and he will cover the following:
How to use only ligand information to predict activity depending on its similarity/dissimilarity to previously known active ligands.
- Discuss ligand-based pharmacophores, molecular descriptors, and quantitative structure-activity relationships and important tools such as target/ligand databases necessary for successful implementation of various computer-aided drug discovery/design methods in a drug discovery campaign.
How predictive models help Medicinal Chemists design better drugs_webinar
1. Olivier Barberan
Senior Product Manager
Data driven drug discovery :
How Predictive models help medicinal
chemists design better drugs.
2. Putting Data to Work |
Today’s discussion
• Elsevier is a partner and vendor to 100% of the world’s top
pharmaceutical companies
• It provides leading data solutions in chemistry, biology, drug safety
and biomedical review
• Our workflow solutions are based on clearly defined use cases
BUT as with many data-driven companies, our data is used for specific
purposes via different user interfaces.
Our Challenge: How to externalize data for:
• Complementary data integration
• Bioactivity prediction
• Use in additional use cases and workflows
2
3. Putting Data to Work |
The Challenge Pharma face : Investing in earlier development stages
builds up the pipeline and reduces attrition from foreseeable causes
3
Target ID &
Validation
Lead ID &
Validation
Pre-clinical Clinical
(Phase I to III)
Post-Launch
Characterize &
understand
disease
Identify, design &
validate leads
Cull/prioritize leads Determine safety
and efficacy profile
Manage risk &
compliance; improve
patient care
Source: Tufts Center for the study of drug development, Nov 2014
$125 M $773 M $200 M $1,460 M $3–5 B
Cost (/NME)
“We cannot fail for reasons we could have predicted. We
should fail only for reasons we could not predict.”
—Dr Moncef Slaoui
Head of Global R&D, GSK
• Low margin of safety is a major
cause of attrition in Phase I and II
• Lack of efficacy is a major cause of
attrition in Phase II and III
Better informed decisions at the Lead ID
& Validation stage generates more
optimized leads and mitigates failures
and miss-investments
80% 65% 69% 12%
Success Rate
“Usability & Re-usability of data is critical to our success.
We are experts at capturing data for purpose, and poor at
making that data available for any purpose ”
—Dr M. Ramsey,
Chief Data Officer, GSK
4. | 4
Excerption of facts – how Reaxys Medicinal
Chemistry handles the literature
Excerption of facts and Indexing of literature and patents is done by
pharmacologists for Medicinal chemists
Substance-record
Bibliographic-record
ExcerptionIndexing
Bioactivity-record
6. Hit Identification Lead gen. and optimization
Drug discovery involves different personas and needs
Medicinal chemists: “I need to
optimize chemical structures in order to
improve affinity, selectivity ADMET
properties and decrease side effects”
Computational chemists:
“I need to find hits for druggable targets”
(virtual screening)
Integration /
Professional
Services
In house data
9. Putting Data to Work |
Ligand-based virtual screening – Using Reaxys Medicinal
Chemistry
Objective
• Describe an In Silico Screening approach
using Reaxys Medicinal Chemistry
Case Study on T-Type calcium channels
10. Putting Data to Work |
Ligand-based in silico screening
730 Query structures
Representation & Chemical
Space Molecular descriptors &
Fingerprints
Virtual Screening
Pharmacophoric Similarity
N
O
N
N
N
O
N
N
N
314 Hits
"Drug-like" Filtering
1. Molecular diversity and chemical originality
2. Compounds availability
39 compounds ordered for testing
28 M Substances
Chemical space based on
synthesized substances
Pipeline pilot
11. Putting Data to Work | 11
Biological testing of the 39 hits on Cav3.2
Electrophysiology experiments: Screening @10 µM
on Cav3.2 T-Type channels
1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627 2930313233343536373839
0
25
50
75
100
Peakcurrentinhibition(%)
28
9 compounds with a % inhibition > 75%
15 compounds with a % inhibition >50%
Compound # (@ 10µM)
Back to
referring
slide
Reaxys Medicinal Chemistry supports Ligand-based virtual
screening approaches
12. Putting Data to Work |
Lead optimization
Predictive Modeling of On-And-Off Target Bioactivities
13. TITLE OF PRESENTATION | 13
Issue
• Need to make best possible decisions to rank/decide on drug
candidates
• Use all available information to help make decisions about potential
therapeutics
Method
• Build hundreds of models for on and off-target activities
• Create a model building engine in KNIME that can make models for large
lists of targets using data from Reaxys Medicinal Chemistry.
• Selectivity panels.
• Toxicology pathways (from pathway analysis).
• PK/ADME models
Challenge: Make best decisions based on data
14. TITLE OF PRESENTATION |
• Issue: some targets have thousands of measurements
• Solution: Randomly select N examples from each decade of pX activity
• Selecting from each decade provides a more balanced data set across the activity range
• In this study only IC50 measurements are used, not % inhibition or Kd
• N is from 50 to 200 in these examples.
• Compounds are selected so that training and test sets have no overlap
• Use random selection of both
• For each compound the median pX value is chosen when multiple measurements are
reported
• pX = -log(molar activity)
• Partial-least-squares correlation is used, selecting the optimal number of descriptor
components
• Model quality is assessed
• I like RMS error of prediction because it most directly answers the question “how accurate do I expect the
prediction to be?”
• r2, F and other statistics are also computed.
14
Elements of the model making workflow
15. TITLE OF PRESENTATION | 15
Random Selection to ‘Flatten’ Activity Distribution
1 2 43 65 7 8 9 10
-log(M)
#of
compounds
1 2 43 65 7 8 9 10
-log(M)
#of
compounds
Raw reports – may be thousands for a
popular target. Activities and structure
distribution may skew statistics
Randomly selected ‘N’
compounds per decade
Result - diverse set of
structures and more uniform
distribution of activities
16. TITLE OF PRESENTATION |
High-Level workflow for creating models
• Uses a list of UNIPROT target names, and NCBI
synonyms
• Attempts to create a model for each target
• Rejects models for insufficient data, or low
quality
17. TITLE OF PRESENTATION |
• Powerful statistical methods can select variables to create a model that looks great on
the training set
• However the real test is how well the model can predict activity for a compound not in
the training set.
• In these examples the Training and Test sets are not overlapping; no test compound is
in the training set.
• In this study the best metric for the model is the RMS Error of Prediction – the errors
seen when predicting activities for compounds not in the training set.
• This is useful to provide the “error bar” for the predictions.
• The required accuracy depends on what you are using the prediction for
• The Pearson r2 provides a metric of “fit” but is dependent on the range of the data and does not
directly tell you how much error to expect.
• Other metrics for prediction might include how similar the compound is any compounds in the training
set. One might expect the predictions to be less reliable if the compound is very different.
17
The Importance of the Test Set
18. TITLE OF PRESENTATION | 18
Model Engine
• Creates series of Reaxys queries for each decade of activity
• Uses PLS/R to create and evaluate models
• Creates spreadsheet with RMS error of prediction and plots
19. TITLE OF PRESENTATION |
• The descriptors are properties computed for the compounds that are
related to the activities
• They encode elements of the structure and other properties of the
compounds.
• In this study we combined 2 sets of computed descriptors
• RD Kit
• CDK Toolkit
19
Descriptors
20. TITLE OF PRESENTATION |
• Results are similar for this ABL example.
• For ‘final’ models larger selections or all data points can be used to increase
robustness
• The automation aspect makes repeated trials easy to do
20
Q: Since random sets are selected, what happens if you repeat trials?
A: The results are nearly identical in a statistical sense
Model
RMS Error of
Prediction
Training Set 0.63
Test Set 1.6
Model
RMS Error of
Prediction
Training Set 0.62
Test Set 1.0
y = 0.9924x
R² = 0.8803
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Training Set
y = 0.9964x
R² = 0.7402
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Test Set
y = 0.9924x
R² = 0.8804
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Training Set y = 1.0049x
R² = 0.4629
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Test Set
y = 0.9914x
R² = 0.8597
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Training Set
y = 0.9959x
R² = 0.7053
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
Test Set
Model
RMS Error of
Prediction
Training Set 0.66
Test Set 1.0
21. TITLE OF PRESENTATION |
• Training set, test set
• Statistics for training and test sets.
21
Spreadsheet created for each model
22. TITLE OF PRESENTATION | 22
EGFR Model Example
y = 0.9829x
R² = 0.8205
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
EGFR Training Set y = 0.9765x
R² = 0.7151
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
EGFR Test Set
RMSEP Type target protein trainingSetSize testSetSize
0.97 Training Set P00533 EGFR 1583
1.2 Test Set P00533 EGFR 521
23. TITLE OF PRESENTATION |
RMSEP Type target protein trainingSetSize testSetSize
0.76 Training Set O60674 JAK2 1372
0.89 Test Set O60674 JAK2 446
23
JAK2 Model Example
y = 0.9889x
R² = 0.7638
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
JAK2 Training Set y = 0.9956x
R² = 0.6749
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
JAK2 Test Set
24. TITLE OF PRESENTATION |
RMSEP Type target protein trainingSetSize testSetSize
0.65 Training Set P49841 GSK3β 1271
0.97 Test Set P49841 GSK3β 427
24
GSK3 Model Example
y = 0.9911x
R² = 0.813
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
GSK3β Training Set y = 1.0008x
R² = 0.5678
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16pXPredicted
pX Measured
GSK3β Test Set
25. TITLE OF PRESENTATION |
RMSEP Type target protein trainingSetSize testSetSize
0.90 Training Set P98161 PKD1 60
1.05 Test Set P98161 PKD1 26
25
Example of Poor Model
y = 0.9795x
R² = -3.788
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted
pX Measured
PDK1 Training Set
y = 0.9867x
R² = -15.88
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
pXPredicted pX Measured
PDK1 Test Set
26. TITLE OF PRESENTATION | 26
Prediction of Activities
• Reads in all models
• Reads in a set of molecules in SDF format
• Predicts activity for each model
• Computes Z value for classification of active/inactive of pX > 6
• Classify as active/inactive based on desired confidence
28. TITLE OF PRESENTATION | 28
Fedratinib – trials stopped due to cerebral edema in patients.
29. TITLE OF PRESENTATION | 29
IRAK3 identified by pathway analysis as significant for brain
toxicity
Rinker, Predicting adverse drug events using literature-based pathway analysis 251st National Meeting of the
American Chemical Society, San Diego CA March 13-17, 2016
32. • This tool makes it very easy to do answer many many questions with
minimal time and effort, as shown.
One can add ability to search for specific scaffolds and combine with target query.
• It provides an unprecedented tool to explore QSAR by target, class and
other criteria.
• One can assess the likelihood of sufficient data to make a model for a
given target
The automatic evaluation using a test set, provides nearly instant feedback on the
prospect of making a predictive model.
• Because it uses the KNIME system, nearly any descriptor set can be
used for making the models.
There are additional open-source and commercial descriptor tools available.
One can probably get better results with better descriptors.
32
Conclusions
34. Prediction of ADMET properties
Influencing medicinal chemistry design
• logD7.4
• Protein Binding
• Solubility
• Metabolic
Stability
• hERG
• Etc…
Step1
•Rat PPB
• Hu heps
• CYP inhib
• Caco2
• NaV1.5
• Etc…
Step2
• logD7.4
• Solubility
•Protein Binding
• hERG
• Rat PPB
• Metabolic
Stability
• CYP inhib
• Caco2
• etc.
Step 0
AstraZeneca’s global HERG QSAR model70 has contributed to the
reduction in the synthesis of ‘red flag’ compounds (compounds that are
measured to have an HERG potency of <1μM)
from 25.8% of all compounds tested in 2003 to only 6% in 2010.
Cumming, J.G., Davis, A.M. et al. Nat. Rev. Drug Disc. (2013) 12, 948–962
35. Putting Data to Work |
Prediction of Cardiotoxic drugs related to hERG blockade
36. Putting Data to Work | 36
How to avoid hERG inhibition?
QT interval prolongation can lead to serious arrhythmias which can evolve into a fatal issue.
A large number of non-cardiac drugs-induced QT prolongation has been reported and
continues to increase with the withdrawal of some blockbuster medicines.
hERG seems to be the main target of this adverse side effect.
In silico models could be rapid and powerful tools to screen out potential hERG blockers as
early as possible during the discovery process
37. Putting Data to Work |
Extraction & Methodology
hERG data set
640 mol
Recursive Partitioning (RP)
MOE QuaSAR Classify
Molecules tested on hERG (Kv11.1)
2D-Molecular descriptors sets Predictive models
Subsets of molecules
According to biological detailed protocols
Representation of hERG ligands within chemical space
of NCI database according to the two first PCA axis
NCI database
hERG data set
Cross-validation
External validation
38. Putting Data to Work |
Model 3 HIGH WEAK
1 µM 10 µM 50 / 9 mol50 / 46 mol
Training Test
Descriptors Relevant P_VSA Relevant P_VSA
High 48/50 47/50 45/46 42/46
Weak 49/50 49/50 8/9 9/9
All 97% 96% 96% 93%
Correct classifications determined by 5-fold cross-validation (Training) and
by external validation (Test) for each descriptor set.
SlogP_VSA7
SlogP_VSA2SMR_VSA6
PEOE_VSA+1
+ +
+ +
+
-
- -
-
-
SMR_VSA5 SMR_VSA6
SMR_VSA5
SMR_VSA4
PEOE_VSA+0
+
-
SlogP
PEOE_VSA_FHYD
SMR_VSA1
SMR_VSA6
SMR_VSA1
SlogP vsa_pol
-
-
-- ++
Relevant P_VSA
39. Putting Data to Work |
Summary
Reaxys Medicinal Chemistry enables the retrieval of high quality datasets with
chemical diversity and homogeneous biological activities.
Pertinent predictive models of hERG activity have been designed using
recursive partitioning analysis with 2D-molecular descriptors.
From Reaxys Medicinal Chemistry, a fast virtual screening approach could be
used as an early tool in the drug discovery process to avoid cardiotoxic side
effects related to hERG blockage.
AstraZeneca’s global HERG QSAR model70 has contributed to the reduction in the
synthesis of ‘red flag’ compounds (compounds that are measured to have an HERG
potency of <1μM)
from 25.8% of all compounds tested in 2003 to only 6% in 2010.
Cumming, J.G., Davis, A.M. et al. Nat. Rev. Drug Disc. (2013) 12, 948–962
Even if models are performing well designers need interpretable models
40. Wrap Up
“They care mostly about
Accessing our data through
API Knime Pipeline pilot”
“They want a product they can
use right out of the box”
New Reaxys Medicinal Chemistry is supporting Hit to lead and lead
optimization process by providing relevant and high quality data to scientists.
Computational Chemists
High quality data on many
different topics (efficacy , ADMET,
Animal models)
Large Amount of data to Perform
models
Medicinal Chemists
Accessing the data through third
party tools
Reaxys Medicinal Chemistry is able to support both Computational and
Medicinal chemists.
Notas del editor
Challenges:
- Big data and increasing complexity
- Multiple sources of information
- A wide variety of tools and processes
- End users from different departments, wide diversity of domain expertise
- Finding the right information
Needs:
- All relevant data must be accessible and actionable
- Information must be flexibly delivered
- All content and solutions must integrate into the existing environment of tools and processes, and with other data sources
- Large-scale data mining and processing must be possible
- Comprehensive tools that can integrate and/or replace other tools.
Medium size pharma are in between consequently they are potentially interested in the two approaches
Product out of the box and Content integration.
The Strategy to adopt depend on the personas that are involved in the deal Computationnal chemist, Chemoinformaticians etc.. (HT data users) => Data + KNIME/API
Chemist medicinal chemist etc…. (LT data users) => UI