Presented at the 15th GCC - German Conference on Cheminformatics November 2019
We combine regression forest machine learning with our MMPA based generative methods to deliver an active learning system to accelerate lead optimisation. In the process we identify permutative MMPA as a method to leverage SAR information from small data sets.
Published by MedChemica Ltd
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Accelerating lead optimisation with active learning by exploiting MMPA based ADMET knowledge with regression forest potency models
1. • Features are acid, base, hydrogen
bond donor, acceptor, hydrophobe,
aromatic attachment, aliphatic
attachment and halogen. Definitions
are highly engineered.†
• Feature 1 – topological distance -
Feature 2
• Engineered for chemical relevance –
features can be superimposed or
directly linked, e.g. enables a group
to be both a hydrogen bond
acceptor and a base
• A bit identifies a pharmacophore pair
e.g. : Aromatic - 3 bonds - Base
• Used as unfolded 280 bit fingerprints
• Regression Forest as ML method
• Build models with 10 fold CV – report
CV-Pearson’s R2 and CV RMSE
• Build RF error model to generate
predicted error for each compound
using the same descriptors
†Taylor, R.; Cole, J. C.; Cosgrove, D. A.; Gardiner, E. J.; Gillet, V. J.; Korb, O. J Comput Aided Mol Des 2012, 26 (4), 451–472.
†Acid & Base definitions are SMARTS including C, N, heteroaromatic acids, bases excluding weak aniline bases, including amidines, guanidine’s - MedChemica
definitions.
Regression forest models
Strategy Number of
compounds
generated
Number of
matches to D2
known set
Maximum
pIC50
(actual)
Maximum pIC50
(predicted[error])
Hit-to-Lead 682 10 7.8 5.5[0.21]
Dopamine class 469 8 7.9 5.5[0.23]
Solubility 10148 10 7.8 5.5[0.21]
Metabolism 12729 19 7.9 5.5[0.21]
Permutative
MMPA
(env = 4)
5 3 7.9 6.1[?]
Accelerating lead optimisation with active learning by exploiting MMPA based
ADMET knowledge with regression forest potency models
A. G. Dossetter•, E. Griffen•, A. Leach•+, P. de Sousa•.
•Medchemica Ltd, Macclesfield, UK, + Pharmacy and Biomolecular Sciences, Liverpool John Moores University,
Problem
How can we reduce the number of compounds made in going from a small set of confirmed hits to
compounds we can test in vivo? For example: can we go from 30 hits to potent in vivo available leads in 10
rounds of synthesizing 30 compounds?
Learning
Combining focused generative approaches with
explainable QSAR models is shows initial promise.
The pinch point is the second set of compounds.
MedChemica
contact@medchemica.com
Approach Case Study
Dopamine D2 dataset
• Well studied target, ligand based design,
• >5200 measured compounds known
• Simulate hit optimization process
• Use known compounds as validation
The Startpoints
30 compounds: 5 <= pIC50 <=6 , -1 < AlogP < 3.5, selected by LLE sort
Generate virtual compounds from MedChemica Knowledge database
• Hit-to-Lead transformations – the most used medicinal chemistry
• ADMET transformations for metabolism and solubility
• Target class transformations learning from target analogues
Permutative MMPA
• generate compounds from data already gained
Regression forest models
• Accurate pharmacophore features with topological distance
• Unfolded fingerprints connect feature importance to pharmacophores
• Error models give accuracy of prediction for each compound
Active Learning
• Explore from predicted high potency, high error
• Exploit from predicted high potency, low error
• Take all compounds in a data set
• Find all matched pairs extract DpIC50
and the transforms between them
• Aggregate transformations with
median DpIC50 and count of pairs
• Apply all transformations back to the
initial data set (at what environment
level?)
• Predicted pIC50 = substrate pIC50 +
median DpIC50
• Remove existing compounds
• Prioritise new compounds by pIC50
estimate
Permutative MMPA
M1
M2
M3
M4
t1
M5
t1
t1
M*
• M1 à M2 transform t1
• M3 à M4 transform t1
• M5 matches t1 and generates
M*
• Predict pIC50:
pIC50(M5) + median DpIC50(t1)
MedChemica
Transformation
Database
Generator
Substrate
molecules
Virtual
molecules
Generate molecules from Knowledge Database
• Hit – to - Lead transformations:
689 transformations with >=250 example pairs
• Dopamine receptor transformations(not D2!)
1027 transformations
• Solubility
6320 transformations
• Metabolism
12719 transformations
Generating new structures is not an issue…
Conclusions
• Good starting points are key(!)
• There is no free lunch – good models need data
• Make best use of the data you already have – focused permutative MMPA finds SAR you may have missed by eye
• Target class based enumeration is most efficient, but still need a better method for round 2 synthesis
• The first set of compounds after the hits are critical if you want to move fast…
Experiment: Fully automated active learning
• Build RF model CV-R2 -0.26, small data set, is it useful?
• Enumerate from all compounds:
• what’s the best enumeration strategy?
• how to pick the (few)compounds to make from the enumerated set?
?
90% of predictions within 0.5 log of measured
• Enumeration generates high potency
compounds, but but early models are too
coarse to correctly prioritize the best small
set for synthesis either by high error or high
potency
7.9!
• Permutative MMPA with tight definition of MMPA environment generates an excellent first
set of follow up compounds learning from the SAR within the hits
• The second batch of compounds is more of a challenge….
Most potent compound(measured) from HtL
enumeration
Active Learning
Hits
Build model with
error estimates
Enumerate
Select for
Explore and
Exploit
Synthesise & Test
Compounds
with data
Compounds
meet
criteria?
Yes
No
Explore: prioritize high error
Exploit : prioritize high potency & low error
Ratio of explore to exploit varies with stage
Select enumeration strategy by stage:
Hit-to lead, target class, solubility, metabolism
For in silico simulation match to
known and measured compounds