MedChemica Active Learning - Combining MMPA and ML

Exploiting medicinal chemistry knowledge to accelerate projects October 2020
October 2020
Not for Circulation
Accelerating lead optimisation with Active Learning -
joining MMPA ADMET knowledge with Regression Forest
machine learning models
Dr Alexander G. Dossetter
Managing Director, MedChemica Ltd
Available on Slideshare - search for Dossetter
Twitter @MedChemica
Twitter @covid_moonshot
Twitter #BucketListPapers
https://www.medchemica.com/bucket-list/

Exploiting medicinal chemistry knowledge to accelerate projects October 2020Exploiting medicinal chemistry knowledge to accelerate projects October 2020
Agenda
• Problem statement
• What is Active Learning?
– How can it applied to LI and LO?
• Generating new ideas with MMPA
– Enumeration with MMPA (RuleDesignTM)
• “hit-to-lead” / “AllRules” / 3pairtrans
• Protein class Rule sets
– Permutative-MMPA (Free Wilson ++)
• Getting the best ideas from small data sets
• Regression Forest models for ‘potency’ prediction
– QSAR revisited with transparent descriptors
- Analysis of Error
• Learnings so far
– The system can ‘gets stuck’ at the start…
• ”It’s like the first 8 moves in chess”

Problem Statement
…8 Years of working with pharma companies
“Our median number of compounds per LO project is 3000 - this is
unsustainable… [it should be] 300”
– Director of Chemistry (large pharma)
“Can we define the text book of medicinal chemistry?”
– Director of Comp Chem (large pharma)
“We are aiming at 300 compound per project. Currently we are about 400, we will
get better”
– ExScienta scientist at SCI ‘What can Big Data do for chemistry”
“Can you find us hits [leads] and predict potency on this [brand] new
protein?”
- Many many people….
MedChemica: using knowledge extraction techniques to build Artificial
Intelligence (AI) systems to reduce the time and cost to critical
compounds and candidate drugs.

Problem Statement
“Can you find us hits [leads] and predict potency on this
[brand] new protein?”
Can we automate Lead compound design?
The algorithm will:-
- design compounds and explore SAR
- ‘actively’ selecting compounds to improve properties
- AND improve the machine learning models
Small
amount of
data
Matched
Molecular Pair
Analysis
Explainable
QSAR
Awesome leads
pIC50 > 7, good in-vitro PK
SAR, Novelty

Augmenting the Medicinal Chemist
Prioritizes
options
Sets goals
Makes
Decisions
Data is organized
and summarized

Augmented Chemists
proposalsRuleDesignTM
Permutative
MMPA
Missing
features
Explainable
QSAR models
Alerts
ideas
Score
and
store
Make &
test
SpotDesignTM
SLIDE 27

Augmenting the Chemist: Lessons so far…
Develop AI constructively
• Use methods that can be directly connected to
chemical structures and data
– SpotDesign™, RuleDesignTM, Permutative
MMPA, Explainable QSAR
• Ensure that all methods are auditable
– See the transformations and underlying data,
see the pharmacophore pairs on molecules
• Automate updates and track metrics
– All systems are automated from the start,
logging is built in
• Integrate automated systems and chemists ideas
Principles for Positive Engagement
• Define common goals
• Evaluate with directly observable
data
• Expose conflicting views
• Continuous learning and
improvement
• Place in context
Chemists: AI Is Here; Unite To Get the Benefits,
Griffen E.J.; Dossetter, A.G.; Leach,A.G; J. Med. Chem. 2020, 63, 16, 8695–8704.
https://doi.org/10.1021/acs.jmedchem.0c00163

Data
Warehouse
rule
finder
Exploitable
Knowledge
Molecule
problem
solving
Explainable
QSAR
Automated
loader
MMPA
Clean
Structures &
Data
Property
Prediction
Idea ranking
Instant SAR
analysis
REST API &
GUI
Explainable AI for Medicinal Chemistry Design

Griffen, E. et al. J. Med. Chem. 2011, 54(22), pp.7739 - 7750.
Leach et al. J. Chem. Inf. Model. 2017, 57, 2424 - 2436
Fully Automated Matched Molecular Pair Analysis (MMPA)
What is this form of Artificial Intelligence?
Δ Data A-
B1
2
2
3
3
3
4
4
4
12
23
3
34
4
4A B
• Matched Molecular Pairs – Molecules that differ only by a
particular, well-defined structural transformation
• Capture the change and environment – MMPs can be recorded as
transformations from A B
• Statistical analysis to define “medicinal chemistry rules”
Defined transformations with high probability of improving
properties of molecules
• Store in a high performance database and provide an intuitive user
interface
Level 4 and higher very
important to P-MMPA

A B pSol A (μM) pSol B (μM) ∆pSol
- 4.3(48 μM) - 3.2 (700μM) 1.1
- 6.0 (1.0 μM) - 3.7 (178 μM) 2.3
-5.7 (2.0 μM) - 4.1 (82 μM) 1.6
3 pairs +ve Sol
Median 1.6
CHEMBL1949790CHEMBL1949786
From SAR to MMPA…..
CHEMBL3356658 CHEMBL218767
CHEMBL456322CHEMBL456802
MCPairs Rule finder required 6 matched pairs for 95% confidence
(Al)(Al)

The Matched Pairs leading to Rule…..
Actual Rule from MCPairs
Endpoint:
Aqueous Solubility at pH 7.4
[CHEMBL2362975]
n-qual 69
n-qual-up 47
n-qual-down 21
median ∆pSol 0.26
std dev +/- 0.636
(Al)(Al)
Explainable
• Drill back to real world
examples and measured data
Actionable
• Clear decision to make the
compound

Identify and group matching SMIRKS
Calc ulate statistical parameters for eac h unique
SMIRKS(n, median, sd, se, n_up/ n_down)
Is n ≥ 6?
Not enough data:
ignore transformation
Is the | median| ≤ 0.05 and the
interc entile range (10-90%) ≤ 0.3?
Perform two-tailed binomial test on the
transformation to determine the
signific anc e of the up/ down frequenc y
transformation is
c lassified as ‘neutral’
Transformation c lassified as
‘NED’ (No Effec t Determined)
Transformation c lassified as
‘increase’ or ‘ decrease’
depending on whic h direc tion the
property is c hanging
passfail
yesno
yesno
Rule selection
0 +ve-ve
Median data difference
Neutral IncreaseDecrease
NED
• No assumption of normal
distribution
• Manages ‘censored’ =
qualified / out-of-range data
Leach et al. J. Chem. Inf. Model. 2017, 57, 2424 - 2436

Molecule Problem Solving - RuleDesignTM
RuleDesignTM (formally “Compounds From Rules”)
• Exploitable Knowledge is a Rule database derived from MMPA
• User puts in a problem molecule with a property they wish to improve
o e.g. solubility, metabolism, hERG….
• System generates potential improved molecules based on data
Exploitable
Knowledge
Enumerator
System
Problem molecule + property to improve
Solution molecules
Watch RuleDesignTM on YouTube https://www.youtube.com/watch?v=nQxXddJDTfc
“..it’s like asking 150 of your peers for ideas in just a few seconds”
- Principal Scientist (large pharma)

Looking at the results
Results sorted in
increasing RMM
(Mol Weight)
Yellow highlight is
the overlap with
the input
compound
One column per assay
– colour and direction
- LogD decrease, Sol increase
Hyperlink to “Drill
back” to the
original data

“Multi-Step” transformations
Shibuya Crossing Tokyo
A C
B
E
F
Would you go steps via A -> B -> C
How would you go know to go E -> F
Or go straight there via D
- if the data said it was good?
D
A Turing test for molecular generators
Darren Green D.; et al J. Med. Chem. 2020
https://doi.org/10.1021/acs.jmedchem.0c01148

How many pairs? – deeper Goal setting
Specific Goal
settings
Non-rules transformations
from pair counts
’All Rules’
– all of the Increase and Decrease Rules for all datasets
– warning output can be large
– not suitable for Excel spreadsheet
‘Hit to Lead’
– most frequent transformations chemists perform
’Min 3 pair Trans’
– all transformations with 3 OR MORE matched pairs
‘Min 6 pair Trans’
– all transformations with 6 OR MORE matched pairs
- Actually Increase, Decrease, Neutral and NED

Broad Rule Sets
• “Rules” for increasing
“potency” are gathered by
MMPA
• Individual assay Rules
(numbers in brackets) are
grouped as a “Broad” Goal
• Example Dopamine Rules
number 3548 (screen shot)
• Therefore new hits for a new
Dopamine target can have
these Rules applied [What
worked in the past?]

Permutative MMPA
• Take all compounds in a data set
• Find all matched pairs & extract
DpIC50 and the transforms between
them
• Aggregate transformations with median
DpIC50 and count of pairs
• Apply all transformations back to
the initial data set (at the most
specific environment level) NO R
GROUP MAPPING REQUIRED !!!
• Predicted pIC50 = substrate pIC50 +
median DpIC50
• Remove existing compounds
• Prioritize new compounds by pIC50
estimate
M1
M2
M3
M4
t1
M5
t1
t1
M*
Internal
Structures
& data
Apply
transforms
New
structures
&
estimated
data
Filter and
prioritize
Extract
transforms
Remove
existing
compounds

Exploit Own or Patent Data
External Patents
& data
Extract
transforms
Apply
transforms
Filter and
prioritize
Internal
Structures &
data
Apply
transforms
New
structures &
estimated
data
Filter and
prioritize
Extract
transforms
Remove
existing
compounds

Client Oncology PPI project example
• 386 patent compounds analyzed
• 6024 pair relationships found(39% - good
number of MMPs)
• Permutative MMPA process:
• Apply to own series,
• Then filter:
• remove undesirable substructure
• Estimated potency >= 6.5,
• clogP <= 2.5
• 52 suggestions
Measurement =
p(TR-FRET nucleotide exchange assay pIC50) or
estimated pIC50 from seed value + DpIC50
Explainable
• Visible, original real world compounds and
measurement
Actionable
• Prioritises ‘realistic’ next step compounds.
PPIpIC50
cLogP
Molecule suggestions yes no

Regression Forest Models
• Features are acid, base, hydrogen bond
donor, acceptor, hydrophobe, aromatic
attachment, aliphatic attachment and
halogen. Definitions are highly engineered
[SMARTS]
• Feature 1 – topological dist - Feature 2
• Engineered for chemical relevance –
features can be superimposed or directly
linked, e.g. enables a group to be both a
hydrogen bond acceptor and a base
• A bit identifies a pharmacophore pair
e.g. : Aromatic - 3 bonds - Base
• Used as unfolded 360 bit fingerprints
• Regression Forest as ML method
• Build models with 10 fold CV – report
CV-Pearson’s R2 and CV RMSE
• Build RF error model to generate
predicted error for each compound
using the same descriptors

Feature Definition
Basic Group Atom or group most likely protonated at pH 7.4
Acidic Group Atom or group most likely deprotonated at pH 7.4, includes N and C
acids
Acceptor Definitions derived from Taylor, Cosgrove et al
Donor Definitions derived from Taylor, Cosgrove et al
Hydrophobic C4 or greater cyclic or acyclic alkyl group
Aromatic Attachment connection of any group to an aromatic atom excluding connections
within rings
Aliphatic Attachment connection of any atom to an aliphatic group not in a ring.
Halo F,Cl, Br, I
Reference for Donor acceptor feature definitions:
Taylor, R.; Cole, J. C.; Cosgrove, D. A.; Gardiner, E. J.; Gillet, V. J.; Korb, O. J Comput Aided Mol Des 2012, 26 (4), 451–472.
Acid & Base definitions are SMARTS including C, N, heteroaromatic acids, bases excluding weak aniline bases, including
amidines, guanidine’s - MedChemica definitions.
MedChemica Advanced Pharmacophore Pairs
Gobbi, A.; Poppinger, D. Biotechnology and Bioengineering 1998, 61 (1), 47–54.
Reutlinger, M.; Koch, C. P.; Reker, D.; Todoroff, N.; Schneider, P.; Rodrigues, T.; Schneider, G. Mol. Inf. 2013, 32 (2), 133–138.

Regression Forest & Pharmacophore understanding
• hERG – auditable models
• Identify important chemical features driving potency
• Predict hERG potency from RF model [10 fold CV]
Pharmacophore fp length 280
10 fold CV
Compounds in training 6196
RMSE 0.37
CV R2 0.51

Examples of exact Pharmacophore Pairs
HBA-same_group-Base HBA-1_atom-HBD Base-2_atom-Ar
Topological distances are precisely specified and can be exactly visualized on the
molecules – no ambiguity over which features are correlated with activity
Critically – enables interrogation and validation of SAR understanding
Record as an unfolded fingerprint of 360 bits, 1 or 0 for presence or absence of a
feature-distance-feature pair

• hERG – auditable models
• Predict hERG potency from RF model [10 fold CV]
• Example CHEMBL12713 sertindole
• Colour structure by feature importance
weighted sum of of pharmacophore pair
fingerprints – show the chemists where the
hotspots are.
• Drill deeper to show the most important positive
and negative features. RF prediction pIC50 7.8
median_with: 5.1
median_without: 4.7
median_diff: 0.4
n_examples_with: 4585
n_examples_without : 1383
median_with: 5.1,
median_without: 5.3
median_diff: -0.2
n_examples_with: 3106
n_examples_without : 2862
Regression Forest & Pharmacophore understanding

Explainable – chemists can see the parts of the molecule that count
Explainable
• Highlighted features show the chemist the contribution to the
prediction
Actionable
• Which parts should be optimized to achieve the Goal
Explainable
• Nearest Neighbours show original data on which model is built
Actionable
• What weight do I put on this results? How likely is it? Do we test?

RF and kNN are good but……
• The models are good but could be great or even superb..
• Analysis of error identifies the exact “functional groups” that are less accurately
predicted
• A feedback loop could design cmpds to improve models  testing
• “Either not enough or the wrong sort of data – the downfall of AI in Life Science?” – Dossetter, A.G.
https://www.linkedin.com/pulse/either-enough-wrong-sort-data-downfall-ai-life-al-dossetter/
Using the model RMSE to
estimate error:
78% measured values in
range prediction +/- RMSE

Overview
Generate virtual compounds from MCPairs MMPA
• Hit-to-Lead transformations – the most used medicinal chemistry
• ADMET transformations for metabolism and solubility
• Target class transformations learning from target analogues
• E.g. Dopamine Rule
Regression forest models
• Accurate pharmacophore features with topological distance
• Unfolded fingerprints connect feature importance to
pharmacophores
• Error models give accuracy of prediction for each compound
Active Learning
• Explore Strategy - predicted high potency, high error
• Exploit Strategy - predicted high potency, low error

Active Learning
Hits
Build model
with error
estimates
Enumerate
Select for
Explore and
Exploit
Synthesise &
Test
Compounds
with data
Compounds
meet criteria?
Yes
No
STRATEGIES
Explore: prioritize high error
Exploit : prioritize high potency & low error
Ratio of explore to exploit varies with stage
Select enumeration strategy by stage:
Hit-to lead, target class, solubility,
metabolism
For in silico simulation match to
known and measured compounds
System operational

Active Learning – V1
Challenges:
• How to get started when you only have a few
compounds to model build from
• limited synthesis resource
D2 Case study
• Start with 30 literature compounds :
5 <= pIC50 <=6 , -1 < AlogP < 3.5, selected by
LLE sort (literature contains 5200 compounds)
• Build RF model CV-R2 -0.26, small data set
• Enumerate from all compounds:
• What is the best enumeration strategy?
– how to pick the (few)compounds to make from the
enumerated set?
– Enumeration is a success if we match literature
compounds (very stringent test)
– Have we learnt all that the initial set of compounds
can teach us?
Strategy
(MMPA)
Number of
compounds
generated
Number of
matches to D2
known set
Maximum
pIC50
(actual)
Maximum pIC50
(predicted[error])
Hit-to-Lead 682 10 7.8 5.5[0.21]
Dopamine
class
469 8 7.9 5.5[0.23]
Solubility 10148 10 7.8 5.5[0.21]
Metabolism 12729 19 7.9 5.5[0.21]
Permutative
MMPA
(env = 4)
5 3 7.9 6.1[?]
D2pIC50
cLogP
Round 1…..

D2 worked example – The p-MMPA
Predicted: pIC50 6.1, actual pIC50 7.9
Finding all the MMP SAR that is present and
applying it exhaustively including behind the
Pareto frontier.
D2pIC50
cLogP

Active Learning v2
System under development
Hits
Compounds
with data
P-MMPA Under
Dev
Compounds
with data
Build model
with error
estimatesEnumerate
Select for
Explore and
Exploit
Synthesise &
Test
Compounds
meet criteria?
Yes
No
Explore: prioritize high error
Exploit : prioritize high potency & low error
Ratio of explore to exploit varies with stage
Enumerate by:
target class,
solubility,
metabolism
Compounds
with data
Need initial “induction phase” before cyclic
automated active learning can be applied

Like the opening in chess game
• “The first moves of a chess game are
termed the "opening" or "opening
moves". A good opening will provide
better protection of the King, control
over an area of the board (particularly
the centre), greater mobility for pieces,
and possibly opportunities to capture
opposing pawns and pieces.” A Beginner's
Garden of Chess Openings - David A. Wheeler
• Success or failure of an
automated active learning
system could be like the first few
moves of a chess – they shape
the game…
• Will it always need a human
intervention (or ten…)? …set up for either Queen’s Gambit, King’s Indian Defense,
Nimzo-Indian, Bogo-Indian, Queen’s Indian Defense, and
Dutch Defense.

Learning from First Experiments….
• MMPA and RF work together to suggest and rank compound designs
• Strategies explored
– Explore: prioritize high error
– Exploit : prioritize high potency & low error
• Ratio of explore / exploit varies with stage
• The initial phase from a small number of hits is a challenge
– Hit-to-Lead / ADMET Rules did not match compounds in literature
– Victims of what is published
– Requires full datasets
– Process can get “stuck”
• Human intervention may always be required
• Both MMPA and RF can select compounds to make to improve models –
analysis of error.
• Permutative-MMPA works very well (of course)
• Where AI could help is a compound selector depending on strategy

• Dr Alexander G. Dossetter
• Managing Director, MedChemica Ltd
• al.dossetter@medchemic.com
• MedChemica
• Lauren Reid
• Jessica Stacey
• Phil De. Sousa
• Shane Montague
• Edward J. Griffen
• Andrew G. Leach
• Available on Slideshere - search for Dossetter
• Twitter @MedChemica
• Twitter #BucketListPapers
• https://www.medchemica.com/bucket-list/
Thank you

Exploiting medicinal chemistry knowledge to accelerate projects October 2020October 2020
Not for Circulation
About MedChemica
>10 experience in building A.I. Systems for drug discovery

• Founded in 2012 by AZ AP Medicinal / Computational chemists
to accelerate drug hunting by exploiting data driven knowledge
• Domain leaders in SAR knowledge extraction and knowledge
based design
• > 11 years experience of building AI systems that suggest
actions to chemists (7 years as MedChemica)
• Creators of largest ever documented database of medicinal
chemistry ADMET knowledge
MedChemica Publications

AI Software Platforms
– Complete In-house platform
– Analysis of own data and automated
updating
– Design tool access to all chemists
– Custom fitting (Software-as-a-Service)
One stop GUI
Design tool
Biotech, Universities and
Foundations
Medium to large pharma,
agrochemical and materials
research
– Secure web-based AI design platform
– CHEMBL, Patent data analysed
– Merged into one knowledgebase

Science As A Service (SaaS)
Target ID
Hit
Screening
Lead Identification Lead Optimisation Pre-Clinical
AI H2L design
sets
Bespoke Advanced Analytics and Computational Chemistry services through-out the research phase
Compound design to
solve ADMET and
potency issues
Third party
compound
assessment
Directed virtual screening
for hit matter
Library design for novel
protein targets
AI Toxophore
assessment
Patent analysis
Pharmacophore
profiling
Generating IP for
clients
[Scaffold hops]
Collection
evaluation
and
enhancement

October 2020
Not for Circulation
Panel Discussion:
What should the Medicinal Chemistry Discipline be
like in 10 years?
Slideshere - search for Dossetter
Twitter @MedChemica
Twitter @covid_moonshot
Twitter #BucketListPapers
https://www.medchemica.com/bucket-list/

MedChemica Active Learning - Combining MMPA and ML

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a MedChemica Active Learning - Combining MMPA and ML

Similar a MedChemica Active Learning - Combining MMPA and ML (20)

Último

Último (20)

MedChemica Active Learning - Combining MMPA and ML

Notas del editor