3. Structure Ac2vity Landscapes
• Rugged gorges or rolling hills?
– Small structural changes associated with large
ac=vity changes represent steep slopes in the
landscape
– But tradi=onally, QSAR assumes gentle slopes
– Machine learning is not very good for special
cases
Maggiora, G.M., J. Chem. Inf. Model., 2006, 46, 1535–1535
4. Characterizing the Landscape
• A cliff can be numerically characterized
• Structure Ac=vity Landscape Index (SALI)
Ai − A j
SALIi, j =
1− sim(i, j)
• Cliffs are characterized by elements of the
matrix with very large values
€
Guha, R.; Van Drie, J.H., J. Chem. Inf. Model., 2008, 48, 646–658
7. Descriptor Space Smoothness
gatifloxacin
granisetron dolasetron perhexiline amitriptyline diltiazem sparfloxacin grepafloxacin sildenafil moxifloxacin gatifloxacin
moxifloxacin grepafloxacin sildenafil
sparfloxacin diltiazem amitriptyline
dolasetron granisetron imipramine perhexiline
400
Number of Edges in SALI Graph
mibefradil chlorpromazine azimilide bepridil
cisapride E-4031 sertindole pimozide dofetilide droperidol thioridazine haloperidol domperidone loratadine mizolastine bepridil azimilide mibefradil chlorpromazine imipramine
halofantrine mizolastine loratadine domperidone verapamil terfenadine
sertindole dofetilide haloperidol thioridazine droperidol
300
E-4031 cisapride pimozide
astemizole
astemizole
200
grepafloxacin sildenafil moxifloxacin gatifloxacin
100
0
0.0 0.2 0.4 0.6 0.8 1.0 astemizole
SALI Cutoff
• Edge count of the SALI graph for varying cutoffs
• Measures smoothness of the descriptor space
• Can reduce this to a single number (AUC)
8. Other Examples
400
• Instead of fingerprints,
Number of Edges in SALI Graph
300
we use molecular 200 2D
descriptors 100
• SALI denominator now 0
uses Euclidean distance 0.0 0.2 0.4 0.6
SALI Cutoff
0.8 1.0
• 2D & 3D random
descriptor sets
400
Number of Edges in SALI Graph
– None are really good
300
3D
– Too rough, or
200
– Too flat
100
0
0.0 0.2 0.4 0.6 0.8 1.0
SALI Cutoff
12. Model Search Using the SCI
• We’ve used the SALI to retrospec=vely analyze
models
• Can we use SALI to develop models?
– Iden=fy a model that captures the cliffs
• Tricky
– Cliffs are fundamentally outliers
– Op=mizing for good SALI values implies overfikng
– Need to trade‐off between SALI & generalizability
13. The Objec2ve Func2on
1.0
• S0 is a measure of the models 0.9
S100
S(X)
ability to summarize the dataset
0.8
S 0
0.7
(analogous to RMSE) 0.6
• S100 measures the models
0.0 0.2 0.4 0.6 0.8 1.0
SALI Cutoff
ability to capture cliffs
• Ideally, the curve starts high and stays high
1 1 (S100 − S0 ) 1
F= F= + F=
S100 S0 2 SCI
14. SALI Based Model Selec2on
RMSE SCI S(100)
• Considered the BZR dataset
0.5
from Sutherland et al
S(X)
0.0
• Iden=fied “best” models
-0.5
using a GA to select from a 0.0 0.2 0.4 0.6
SALI Cutoff
0.8 1.0
pool of 2D descriptors RMSE SCI S(100)
• While SALI based op=miza=on 0.5
can lead to a “bemer” curve,
S(X)
0.0
it doesn’t give the best model -0.5
0.00 0.02 0.04 0.06 0.08
SALI Cutoff
Sutherland, J et al, J. Chem. Inf. Comput. Sci., 2003, 43, 1906‐1915
15. SALI Based Model Selec2on
RMSE SCI S(0) + D/2
• 107 aryl azoles as ER‐β agonists
0.5
S(X)
0.0
• Used a GA and 2D descriptors -0.5
to iden=fy models
0.0 0.2 0.4 0.6 0.8 1.0
• In this case, a SALI based RMSE
SALI Cutoff
SCI S(0) + D/2
objec=ve func=on was able to
iden=fy the best model 0.5
• Interes=ngly, SCI does not
S(X)
0.0
seem to perform very well -0.5
0.00 0.02 0.04 0.06 0.08
SALI Cutoff
Malamas, M.S. et al, J Med Chem, 2004, 47, 5021‐5040
16. SALI Based Model Selec2on
• The size of the solu=on space explored
depends on the SALI objec=ve func=on
1.15
BZR ER‐β
0.65
1.10
1.05
0.60
RMSE
RMSE
1.00
0.95
0.55
0.90
RMSE S(100) SCI 1/S(0) + D/2 RMSE SCI
Objective Function Objective Function
17. Predic2ng the Landscape
• Rather than predic=ng ac=vity directly, we can
try to predict the SAR landscape
• Implies that we amempt to directly predict cliffs
– Observa=ons are now pairs of molecules
• A more complex problem
– Choice of features is trickier
– S=ll face the problem of cliffs as outliers
– Somewhat similar to predic=ng ac=vity differences
Scheiber et al, StaDsDcal Analysis and Data Mining, 2009, 2, 115‐122
18. Predic2ng Cliffs
• Dependent variable are pairwise SALI values,
calculated using fingerprints
• Independent variables are molecular
descriptors – but considered pairwise
– Absolute difference of descriptor pairs, or
– Geometric mean of descriptor pairs
– …
• Develop a model to correlate pairwise
descriptors to pairwise SALI values
19. A Test Case
• We first consider the Cavalli CoMFA dataset of 30
molecules with pIC50’s
• Evaluate topological and physicochemical
descriptors
• Developed random forest
models
– On the original observed
values (30 obs)
– On the SALI values
(435 observa=ons)
Cavalli, A. et al, J Med Chem, 2002, 45, 3844‐3853
20. Double Coun2ng Structures?
GeoMean
• The dependent and 60
50
independent variables both 40
encode structure.
30
20
• But premy low correla=ons
10
Percent of Total
0
between individual pairwise
AbsDiff
60
descriptors and the SALI
50
40
values 30
20
10
0
0.00 0.05 0.10 0.15
R2
22. Test Case 2
• Considered the Holloway docking dataset, 32
molecules with pIC50’s and Einter
• Similar strategy as before
• Need to transform SALI values
• Descriptors show minimal
correla=on 50
30
40
Percent of Total
Percent of Total
30
20
20
10
10
0 0
0 20 40 60 80 100 120 -1 0 1 2
Holloway, M.K. et al, J Med Chem, 1995, 38, 305‐317 SALI log10 (SALI)
25. Model Caveats
• Models based on SALI values are dependent
on their being an SAR in the original ac=vity
data
• Scrambling results for these models are
poorer than the original
models but aren’t as 6
Predicted SALI
random as expected 4
2
0
0 2 4 6
Observed SALI
26. SALI in Bulk
• Much of this material is exploratory
• So we’re interested in trends across many assays
• ChEMBL is an excellent source for ac=vity cliffs
• Assay selec=on
– Human target, binding assay
– High confidence (score = 9)
– Number of compounds between 75 & 300
– Only consider non‐NA ac=vity values
– Censored data is considered the same as exact data
– 31 assays
• We iden=fy datasets with ac=vity cliffs by the skewness
of the dependent variable
29. Conclusions
• SALI is the first step in characterizing the SAR
landscape
• Allows us to directly analyze the landscape, as
opposed to individual molecules
• Being able to predict the landscape could
serve as a useful way to extend an SAR
landscape
30. Acknowledgements
• John Van Drie
• Gerry Maggiora
• Mic Lajiness
• Jurgen Bajorath
33. ER‐β Dataset
• 107 molecules, censored data
taken as exact
• A few big cliffs
• The best linear model performs
decently
25
-1
20
Predicted pIC50
Frequency
15
-2
10
-3
5
-4
0
-4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5
-4 -3 -2 -1
Observed pIC50 pIC50
34. Different Ac2vity Representa2ons
• Using the Hill parameters from a dose‐response
curve represents richer data than a single IC50
SInf
S0
Sinf d(Pi ,P j )
SALIi, j =
50%
Activity
AC50 1− sim(i, j)
H
S0
AC50
Concentration €