Lecture given by Ed Griffen UKQSAR meeting Sept 2017. Covers material from work in our paper http://pubs.acs.org/doi/10.1021/acs.jmedchem.7b00935 background discussed in https://www.linkedin.com/pulse/first-draft-medicinal-chemistry-admet-encyclopedia-ed-griffen/
3. MedChemica | 2017
‘Big Data’ analysis for Medicinal Chemistry
• No new compounds to make
• No new testing to do
• Exploit the compounds and data you’ve already paid for
• Accelerate all new projects
• Augment the skills and experience of your chemists
• Mythbusting…
All very cost effective
4. MedChemica | 2017
Make a real textbook of Medicinal Chemistry
MMPA
MMPA
MMPA
Combine
and
Extract
Rules
Multiple Pharma
ADMET data
>437000 rules
Better
Project
decisions
Increased
Medicinal
Chemistry
learning
http://pubs.acs.org/doi/10.1021/acs.jmedchem.7b00935
5. MedChemica | 2017
Pillars of Knowledge Mining
Data
Cheminformatics
Statistics
Engineering
Interface
Design
Better
Decisions
?
6. MedChemica | 2017
Where to get data?
• Public data is unrepresentative
• Censored by publication bias
• Pharma data – can’t share
structures due to IP.
• Use chemical transformations to
encode knowledge from matched
molecular pair (MMP) analysis
now sharable
Novartis: Kramer, C.; Kalliokoski, T.
et al The Experimental Uncertainty
of Heterogeneous Public Ki Data J.
Med. Chem 2012, 55, 5165
If project data really looked like
that, there would be no problem in
the Pharma industry.
7. MedChemica | 2017
Data Sources
Roche
Database
AZ
Data
MMP
finder
AZ
Database
MMP
finder
MMP
finder
Roche
Data
Genentech
Data
Grand Rule
Database
Grand Rule
Database
Grand Rule
Database
Grand Rule
Database
AZ
Exploitation
Roche
Exploitation
Genentech
Exploitation
>500 million pairs
MedChemica
Aggregation
Individual
company firewall
Genentech
Database
0.5 million rules
8. MedChemica | 2017
Pillars of Knowledge Mining
Data
Cheminformatics
Statistics
Engineering
Interface
Design
Better
Decisions
?
9. MedChemica | 2017
• Matched Molecular Pairs – Molecules that
differ only by a particular, well-defined
structural transformation
• Transformation with environment capture –
MMPs can be recorded as transformations from
A B
• Environment is essential to understand
chemistry
Griffen, E. et al. Matched Molecular Pairs as a Medicinal Chemistry Tool. Journal of Medicinal Chemistry. 2011, 54(22), pp.7739-7750.
Advanced MMPA with MCPairs
Δ Data A-
B
1
2
2
3
3
3
4
4
4
12
23
3
34
4
4
A B
Environment is key and we need to capture it in our chemical encoding…
10. MedChemica | 2017
Environment really matters
HMe:
• Median Dlog(Solubility)
• 225 different
environments
2.5log
1.5log
HMe:
• Median Dlog(Clint)
Human microsomal
clearance
• 278 different
environments
11. MedChemica | 2017
HF: What effect on Clearance?
• Median Dlog(Clint) Human microsomal clearance
• 37 different environments
2 fold improvement 2 fold worse
Increase
clearanc
e
decrease
clearanc
e
12. MedChemica | 2017
MMPA: Engineering challenges
• Quick to implement on a small scale
• Always becomes an n2 problem….
• ‘Challenging’ at enterprise scales 100,000+
– Cheminformatics ‘gotchas’
• Tautomers, charge states
• Unusual aromatic systems
• Highly symmetric molecules
• Capturing and coding environments accurately
– Structure and data integrity
– Assay ontologies
– Database schema optimized for cluster I/O
• Speed at scale essential
13. MedChemica | 2017
Identify and group matching SMIRKS
Calc ulate statistical parameters for eac h unique
SMIRKS(n, median, sd, se, n_up/ n_down)
Is n ≥ 6?
Not enough data:
ignore transformation
Is the | median| ≤ 0.05 and the
interc entile range (10-90%) ≤ 0.3?
Perform two-tailed binomial test on the
transformation to determine the
signific anc e of the up/ down frequenc y
transformation is
c lassified as ‘neutral’
Transformation c lassified as
‘NED’ (No Effec t Determined)
Transformation c lassified as
‘increase’ or ‘ decrease’
depending on whic h direc tion the
property is c hanging
passfail
yesno
yesno
Rule selection
0 +ve-ve
Median data difference
Neutral IncreaseDecrease
NED
• No assumption of normal
distribution
• Manages ‘censored’ =
qualified / out-of-range data
14. MedChemica | 2017
Merging knowledge
• Use the transforms that
are robust in both
companies to calibrate
assays.
• Once the assays are
calibrated against each
other the transform data
can be combined to build
support in poorly
exemplified transforms
• Methodology precedented
in other fields
CalibrateRobust
Robust
Weak
Weak
Discover
Novel
Pharma 1
Pharma 2
15. MedChemica | 2017
Merging Assays
Compound A
Compound B
Compound C
Compound D
Transformation 1
Transformation 2
pIC50,
log(Clint),
pSol etc
Assay 1 Assay 2
DT1
DT2
DT1’= DT1
DT2’= DT2
DT1’
DT2’
Assay 2 more
sensitive than
Assay 1
Assay 1 D
Assay 2 D
Assay 2 less
sensitive than
Assay 1
T1
T2
• Sets of transformations can be calibrated against each other as we
are comparing D values in assays not absolute values
16. MedChemica | 2017
Merging Details
• Datasets are standardized by comparison of transformations
shared by contributing companies
• Transformations are examined at the “pair example” level
• Minimum of 6 transformations, each with a minimum of 6 pairs
(42 compounds bare minimum) required to standardise
• “calibration factors” extracted to standardize the datasets to a
common value – mean of calibration factors 0.94, typical range
0.8-1.2.
• Datasets with too few common transformations have standard
compound measurements shared for calibration.
17. MedChemica | 2017
Pharma 1 100k rules
Pharma 2 92k rules
Pharma 3 37k rules
5.8k rules in common (pre-merge) ~ 2%
New Rules 88k
~26% of total
Merge
Combining data yields brand new rules
Gains: 300 - 900%
Merging knowledge – GRDv1
18. MedChemica | 2017
Exploiting Knowledge for Compound Optimization
Measured
Data
rule
finder
Exploitable
Knowledge
MCExpert
System
Problem molecule
New molecule
suggestions
rule
finder
MCPairs=
“..it’s like asking 150 of your peers
for ideas in just a few seconds” –
AZ Principal Scientist
19. MedChemica | 2017
Build Interfaces to many tools
Pair & Rule
Database
Compounds
from Rules
API server
RESTful
API
Chemistry Shape
and electrostatics
MCPairs
MCRules
Corporate structures and
measurements
20. MedChemica | 2017
Knowledge Extracted
Numbers of statistically valid transforms
Grouped Datasets Number of Rules
logD7.4 153449
Merged solubility 46655
In vitro microsomal clearance:
Human, rat, mouse, cyno, dog
88423
In vitro hepatocyte clearance :
Human, rat, mouse, cyno, dog 26627
MCDK permeability A-B / B – A efflux 1852
Cytochrome P450 inhibition:
2C9, 2D6 , 3A4 , 2C19 , 1A2
40605
Cardiac ion channels
NaV 1.5, hERG ion channel inhibition
15636
Glutathione Stability 116
plasma protein or albumin binding
Human, rat, mouse, cyno, dog
64622
Grand Rule
Database
v3
21. MedChemica | 2017
Single company vs merged
Comparison between Roche-only and GRD rules for human
microsomal clearance. Overall R2 is 0.76 and RMSE 0.11.
22. MedChemica | 2017
There is no “logD receptor”…
• We often use lipophilicity as a design surrogate
• Provides a context for changes
• Key multi-objective design issues are centered round
conflicting logD correlations:
• Solubility & metabolic stabilitypotency & permeability
• Particularly useful to look at chemical transformations that
‘ break the dogma’ of logD correlation
23. MedChemica | 2017
Solubility : logD – trends & exceptions
>=20 examples per rule, n=13,453
R2 = 0.66, slope = -0.57, intercept = 0.
Magenta line: line of slope -1, intercept 0, dark blue line linear best fit, pale blue density ellipse contains
99% and the mid blue ellipse contains 50% of the transformations.
25. MedChemica | 2017
Clearance : logD – trends & exceptions
>=20 examples per rule, n=11,572
R2 = 0.40, slope 0.23, intercept = 0.
Magenta line: line of slope 1, intercept 0, dark blue line linear best fit, pale blue density ellipse
contains 99% and the mid blue ellipse contains 50% of the transformations.
27. MedChemica | 2017
Pillars of Knowledge Mining
Data
Cheminformatics
Statistics
Engineering
Interface
Design
Better
Decisions
?
28. MedChemica | 2017
Influencing Chemists
“In the choice between changing ones mind and proving
there's no need to do so, most people get busy on the
proof.”
John Kenneth Galbraith
“For the great enemy of truth is very often not the lie--
deliberate, contrived and dishonest--but the myth--
persistent, persuasive, and unrealistic. Too often we
hold fast to the clichés of our forebears. We subject all
facts to a prefabricated set of interpretations. We enjoy
the comfort of opinion without the discomfort of
thought.”
Address by President John F. Kennedy
Yale University Commencement
29. MedChemica | 2017
Better Human-Machine interactions
All software is mediated through people
• We want to augment medicinal chemists skills and experience
• Chemists need to discover knowledge themselves
• Intuitive ( = fast & familiar)
• Summary data + option to drill into the detail
• What are the two interfaces chemists feel most comfortable with?
• Web browsers
• Excel
31. MedChemica | 2017
Pillars of Knowledge Mining
Data
Cheminformatics
Statistics
Engineering
Interface
Design
Better
Decisions
?
32. MedChemica | 2017
More examples of Success
32
Thompson; M.J. et al J. Med. Chem., 2015, 58 (23), pp 9309–9333
DOI: 10.1021/acs.jmedchem.5b01312
33. MedChemica | 2017
“Me-Betters” on a Massive scale
Enumerator
System
1162
Marketed
Drugs
Wealth of
Follow-on
opportunities
Grand Rule
Database
v3
Improve solubility & metabolism
= lower dose
= uid from bid/tid
Safer, better compliance
~425 improvement suggestions / drug
34. MedChemica | 2017
• Exploiting MMPs –
– Matched molecular series
– MMP based clustering
– QSAR from MMPA
• Interface design is key
To the Future
?
35. MedChemica | 2017
Conclusions
• We have to accelerate projects
– Exploiting existing data is highly efficient
• High quality medchem knowledge can be mined and exchanged
on a large scale
– There is a huge amount of medicinal chemistry knowledge
– Right science, statistics, engineering
• Human - machine interfaces are critical
37. MedChemica | 2017
About Us Passionate about generating better decisions from data
Dr Andrew G. Leach
Technical Director
Liverpool John Moores
12 years experience large
Pharma
Applied computational and
medicinal chemistry
Dr Ed Griffen
Technical Director
21 years experience large
Pharma, biotech
Medicinal chemistry and large
scale statistical analysis
methods
Dr Al Dossetter
Managing Director
17 years Medicinal chemistry large
Pharma and extensive cloud computing
experience
Dr Ali Griffen
Business Analyst
21 years experience Team leader
bioscientist and biological data curation
large Pharma
Dr Shane Montague
Lead Data Scientist
PhD Computer Science
13 years experience
Microsoft, University of
Salford Data science and
information security
Editor's Notes
Lot’s of people come forward with ideas to ‘revolutionise drug discovery’, but being more data driven is surprisingly cheap compared to most of them. Eg ‘new modalities’ like therapeutic RNAs or chimeric antigen receptors, r even large ring macrocycles.