What Makes a Good Structure Activity Landscape?

What Makes a Good Structure
Ac2vity Landscape?

Rajarshi Guha
NIH Chemical Genomics Center

August 25, 2010
Na=onal ACS Mee=ng, Boston

Outline

•  Selec=on with SALI
•  Predic=ng the landscape
•  SALI in bulk

Structure Ac2vity Landscapes

•  Rugged gorges or rolling hills?
–  Small structural changes associated with large
ac=vity changes represent steep slopes in the
landscape
–  But tradi=onally, QSAR assumes gentle slopes
–  Machine learning is not very good for special
cases
Maggiora, G.M., J. Chem. Inf. Model., 2006, 46, 1535–1535

Characterizing the Landscape

•  A cliﬀ can be numerically characterized
•  Structure Ac=vity Landscape Index (SALI)

Ai − A j
SALIi, j =
1− sim(i, j)
•  Cliﬀs are characterized by elements of the
matrix with very large values

€
Guha, R.; Van Drie, J.H., J. Chem. Inf. Model., 2008, 48, 646–658

Visualizing SALI Values

•  The SALI graph
–  Compounds are nodes
–  Nodes i,j are connected if SALI(i,j) > X
–  Only display connected nodes
!
17 !!!!!!!!!
7 13 29 43 49 45 54 59 76

!
15 !
28 ! !!!!!!!
6 52 44 50 46 55 60 75

! !
3 18 !!
2 35 !! !
20 22 9 !
64 !
69 !
21 !
34 !
38

!
8 !
65 !
24 ! !
1 71 !!
12 58 !!
63 10 !! ! !!
68 27 23 41 42 !!!!
72 73 31 51 !
39

!
5 ! !
19 62 !
25 !
57 !
56 !!!
30 53 37

!
4 !
40

!
66

What Can We Do With SALI’s?

•  SALI characterizes cliﬀs & non‐cliﬀs
•  For a  given molecular representa=on, SALI’s
gives us an idea of  the
smoothness of the
SAR landscape
•  Models try and encode
this landscape
•  Use the landscape to guide
descriptor or model
selec=on

Descriptor Space Smoothness
gatifloxacin

granisetron dolasetron perhexiline amitriptyline diltiazem sparfloxacin grepafloxacin sildenafil moxifloxacin gatifloxacin

moxifloxacin grepafloxacin sildenafil

sparfloxacin diltiazem amitriptyline

dolasetron granisetron imipramine perhexiline
400

Number of Edges in SALI Graph
mibefradil chlorpromazine azimilide bepridil
cisapride E-4031 sertindole pimozide dofetilide droperidol thioridazine haloperidol domperidone loratadine mizolastine bepridil azimilide mibefradil chlorpromazine imipramine

halofantrine mizolastine loratadine domperidone verapamil terfenadine

sertindole dofetilide haloperidol thioridazine droperidol
300

E-4031 cisapride pimozide

astemizole

astemizole

200

grepafloxacin sildenafil moxifloxacin gatifloxacin

100

0

0.0 0.2 0.4 0.6 0.8 1.0 astemizole

SALI Cutoff

•  Edge count of the SALI graph for varying cutoﬀs
•  Measures smoothness of the descriptor space
•  Can reduce this to a single number (AUC)

Other Examples
400

•  Instead of ﬁngerprints,

300

we use molecular   200 2D
descriptors 100

•  SALI denominator now   0

uses Euclidean distance 0.0 0.2 0.4 0.6

SALI Cutoff
0.8 1.0

•  2D & 3D random
descriptor sets
400

–  None are really good
300

3D
–  Too rough, or
200

–  Too ﬂat
100

0

0.0 0.2 0.4 0.6 0.8 1.0

SALI Cutoff

Feature Selec2on Using SALI

•  Surprisingly, exhaus=ve search of 66,000 4‐
descriptor combina=ons did not yield semi‐
smoothly decreasing curves
•  Not en=rely clear what type of curve is
desirable

Measuring Model Quality

•  A QSAR model should easily encode the “rolling hills”
•  A good model captures the most significant cliffs
•  Can be formalized as

How many of the edge orderings of a SALI graph
  does the model predict correctly?

•  Define S (X ), represen=ng the number of edges
correctly predicted for a SALI network at a threshold X
•  Repeat for varying X and obtain the SALI curve

SALI Curves

1.0
1.0

0.5
0.5

S(X)
S(X)

0.0
0.0

!0.5
!0.5

3!descriptor
5!descriptor
Scrambled 3!descriptor !1.0
SCI = 0.12
!1.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

X X

Model Search Using the SCI

•  We’ve used the SALI to retrospec=vely analyze
models
•  Can we use SALI to develop models?
–  Iden=fy a model that captures the cliffs
•  Tricky
–  Cliffs are fundamentally outliers
–  Op=mizing for good SALI values implies overfikng
–  Need to trade‐off between SALI & generalizability

The Objec2ve Func2on
1.0

•  S0 is a measure of the models 0.9

S100

S(X)
ability to summarize the dataset
0.8
S 0
0.7

(analogous to RMSE) 0.6

•  S100 measures the models
0.0 0.2 0.4 0.6 0.8 1.0

SALI Cutoff

ability to capture cliﬀs
•  Ideally, the curve starts high and stays high

1 1 (S100 − S0 ) 1
F= F= + F=
S100 S0 2 SCI

SALI Based Model Selec2on
RMSE SCI S(100)

•  Considered the BZR dataset
0.5

from Sutherland et al

S(X)
0.0

•  Iden=ﬁed “best” models
-0.5

using a GA to select from a   0.0 0.2 0.4 0.6

SALI Cutoff
0.8 1.0

pool of 2D descriptors RMSE SCI S(100)

•  While SALI based op=miza=on 0.5

can lead to a “bemer” curve,
S(X)
0.0

it doesn’t give the best model -0.5

0.00 0.02 0.04 0.06 0.08

SALI Cutoff

Sutherland, J et al, J. Chem. Inf. Comput. Sci., 2003, 43, 1906‐1915

RMSE SCI S(0) + D/2

•  107 aryl azoles as ER‐β agonists
0.5

S(X)
0.0

•  Used a GA and 2D descriptors -0.5

to iden=fy models
0.0 0.2 0.4 0.6 0.8 1.0

•  In this case, a SALI based RMSE
SALI Cutoff

SCI S(0) + D/2

objec=ve func=on was able to
iden=fy the best model 0.5

•  Interes=ngly, SCI does not
S(X)
0.0

seem to perform very well -0.5

0.00 0.02 0.04 0.06 0.08

SALI Cutoff

Malamas, M.S. et al, J Med Chem, 2004, 47, 5021‐5040

•  The size of the solu=on space explored
depends on the SALI objec=ve func=on
1.15

BZR ER‐β

0.65
1.10
1.05

0.60
RMSE
RMSE

1.00
0.95

0.55
0.90

RMSE S(100) SCI 1/S(0) + D/2 RMSE SCI

Objective Function Objective Function

Predic2ng the Landscape

•  Rather than predic=ng ac=vity directly, we can
try to predict the SAR landscape
•  Implies that we amempt to directly predict cliffs
–  Observa=ons are now pairs of molecules
•  A more complex problem
–  Choice of features is trickier
–  S=ll face the problem of cliffs as outliers
–  Somewhat similar to predic=ng ac=vity differences

Scheiber et al, StaDsDcal Analysis and Data Mining, 2009, 2, 115‐122

Predic2ng Cliffs

•  Dependent variable are pairwise SALI values,
calculated using fingerprints
•  Independent variables are molecular
descriptors – but considered pairwise
–  Absolute difference of descriptor pairs, or
–  Geometric mean of descriptor pairs
–  …
•  Develop a model to correlate pairwise
descriptors to pairwise SALI values

A Test Case

•  We ﬁrst consider the Cavalli CoMFA dataset of 30
molecules with pIC50’s
•  Evaluate topological and physicochemical
descriptors
•  Developed random forest
models
–  On the original observed
values (30 obs)
–  On the SALI values
(435 observa=ons)

Cavalli, A. et al, J Med Chem, 2002, 45, 3844‐3853

Double Coun2ng Structures?
GeoMean

•  The dependent and   60

50

independent variables both   40

encode structure.
30

20

•  But premy low correla=ons
10

Percent of Total
0

between individual pairwise
AbsDiff
60

descriptors and the SALI
50

40

values 30

20

10

0

0.00 0.05 0.10 0.15

R2

Model Summaries
Original pIC50 SALI, AbsDiﬀ SALI, GeoMean
9 RMSE = 0.97 RMSE = 1.10 RMSE = 1.04
6 6 !
8
Predicted pIC50

! !! !

Predicted SALI

Predicted SALI
! ! !
! ! !
! ! ! !!!
! ! ! ! ! !! !
! ! !
! ! ! ! ! !! !
7 ! !
! ! !!! ! ! !
! ! ! ! ! !
! !
!
!
!
4 ! ! ! !! !
! ! !
4 !! !! ! ! !
!
! ! ! !!! !!
! !
! ! ! !! !! ! ! ! !
! !!
! ! ! ! !! ! !
! ! ! !! ! !! ! !
! !! ! !
! ! !! ! ! !!
! ! ! ! ! !! !
!! ! !! ! ! !!! ! !
!
! ! !!!!!!! ! ! ! ! !
! ! ! !! !
!!
! !! !
! ! ! ! ! !! ! !
6 ! ! ! !! ! ! ! !
! ! !! ! ! ! !!
! ! !!!! ! !!!!!!!! ! ! ! ! ! !
! ! !!! !!
! ! ! ! ! ! !!! ! ! ! ! ! !!!! ! ! !
!! ! !!! !!
! ! ! !!! ! !!!!! !
!
! ! ! !! !!
! !
! ! !! ! !!!!! ! !!!! !
! ! ! ! !! !! ! !! ! !
! ! ! ! ! ! !! ! ! !! !!!!!! !!!!! !!
! ! !! ! !
!
! ! !
!! ! ! !!! !!!! !!!! !!! ! ! !
!!
! !!!!! !! ! ! ! ! ! ! ! ! ! ! !! ! !
! ! !
! !! !! ! !! !! !! !! ! !!
! ! !! ! !!
! ! !!! !!!!!!!!!! !! ! ! !!
!! !!!! ! ! ! !
!! !! ! ! ! !!!!!!!!!!!! ! !! ! !
!
!! !!!!!!!!! !!!!! !!
! ! ! !! ! !
! 2 !!!!!!! ! !!
! ! ! ! !
! !
! !! !! ! !
! !!!! !!!! ! !!
!
! !!!! ! ! ! ! !!
! !!!!!!! !!! !! 2 !
! ! !!!!!!! !!! !
! !!!!!! ! ! ! ! !
! ! ! ! !!!! ! !! !
! ! ! !
!!!!!!!!!!!! !! ! !
! !! ! !!!
! !
! !! !!!!! ! !! ! ! !
! !
! !!!!!!!
! !!! !
!!! !! ! ! ! ! !
! ! ! ! !!! ! ! ! !
5 ! ! !
! !
! ! !!!!! ! !
! !! ! ! ! !!! !!! !!!!! !
!!! !!! !!!! ! !
!! ! ! ! !
! !
!! !
! ! ! ! !
!
! ! !! ! ! !
! ! ! ! !!!! ! !
! ! !! !!
! !! !
!! !
! ! ! ! !
! !!! !
!! !
!!
!!
4
0 0

4 5 6 7 8 9 0 2 4 6 0 2 4 6

Observed pIC50 Observed SALI Observed SALI

•  All models explain similar % of variance of
their respec=ve datasets
•  Using geometric mean as the descriptor
aggrega=on func=on seems to perform best
•  SALI models are more robust due to larger size
of the dataset

Test Case 2

•  Considered the Holloway docking dataset, 32
molecules with pIC50’s and Einter
•  Similar strategy as before
•  Need to transform SALI values
•  Descriptors show minimal
correla=on 50

30

40
Percent of Total

Percent of Total
30
20

20

10

10

0 0

0 20 40 60 80 100 120 -1 0 1 2
Holloway, M.K. et al, J Med Chem, 1995, 38, 305‐317 SALI log10 (SALI)

Model Summaries
10 RMSE = 1.05 RMSE = 0.48 RMSE = 0.48
! !
! !! 2 2

Predicted log10(SALI)

Predicted log10(SALI)
! !
! !!
! ! !! !!
!! ! !
!
9 ! ! !!
!!
! !!
!! !!!
Predicted pIC50

! !
! ! !! ! !! ! !! !
! ! !! !!
! !
!! ! !
! ! !!!!! !
! !
!!!!
!!
! !! ! !
!
!!
!! !!
!
!! !
! !! ! !
!!! !
1 ! ! !!!!!!!!
!!
!!!
!!
!!!
! ! ! !! !!!!!!! ! ! 1 ! !!!!! !
! !! ! ! !!!!!! !
! !!!!
!
! !!!!
!!!! !
!
!
!!!!!
!! ! ! ! !!!!!
! !! !!!!!! !!!!!!!!
! !!!!
!!
!!!! !
! !!!!!!!! !
!
! !!
!!!!
!
!! !! !!
!!
! ! !! ! ! !! !!!!!!! ! !
! !
!! !
!!!!!!! !
! !!! !!
8 ! ! ! ! !
! !! ! !!!!!!
!!
!!!!!
! ! ! ! !!!!!!!
!! !
!!!!! !
!!
!! !! ! !!!
! ! ! ! !!!! ! !
! ! !! !!! ! !
!
!!
! !! ! !!!!!!!!
!!!
!
! !!!!!! !
! !!!!! !
!
! !!! !
! ! ! !!!! !!!!!!
!! ! ! !! !!!!!
! ! !!!
!!! ! ! !!!!!
!
!! ! !
! ! !!! ! ! ! !
! ! ! ! !!!! ! ! ! !
!
! !
! ! ! ! ! !!!!!!!!!!
! ! ! !!!!! !
!! ! !
! !!! ! !!!!!!!! !!
! ! ! ! !!!!!!!!! !
! ! ! !! !!!!! !!!!!! !
! ! !!! !
! !!! ! ! !
! ! ! !!!!!
! !!! ! !! !!!!!!!
! !
! ! ! !! !! ! !!!! !!!!
! ! !
!!!!!!! !!! ! !!! !! !!!
!
! ! ! !!!
! !!! !! ! ! !! !!!!
!!!! !!!!
!
!
!! ! ! !!!!
!! ! ! ! ! !!!
!! !! !! ! !! !
!! ! !!
! ! ! ! ! ! !! ! ! ! !
! ! ! !!
! ! ! !
!! !
7 ! ! !! !! !
! !
! ! ! ! !!! !
! !! !
!! !
! ! !
! ! ! !!
! 0 !! !
!
0 ! !
!
! ! !
!
6 !
!
!1 !1
5

5 6 7 8 9 10 !1 0 1 2 !1 0 1 2

Observed pIC50 Observed log10(SALI) Observed log10(SALI)

•  The SALI models perform much poorer in
terms of % of variance explained
•  Descriptor aggrega=on method does not seem
to have much effect
•  The SALI models appear to perform decently
on the cliffs – but misses the most significant

Model Summaries
10 RMSE = 1.05 100 RMSE = 9.76 100 RMSE = 10.01
! ! !
! !!
! ! !!
! !
9 !
Predicted pIC50

! 80 80

Predicted SALI

Predicted SALI
! !
!

8 !
! 60 60 !
! !
! ! ! ! !
! !! !! ! ! !
! ! !
7 ! !
!
!
! ! ! ! ! !
! !
40 !
! !
!
!
40 ! ! !
! !
! ! ! ! ! ! !
6 !
! ! !
! !!
!
! !
! ! ! ! ! !
! !!! ! !
! ! ! ! ! ! ! !
! !
! ! !
! ! ! ! !! !
20 ! !! ! !! !
!
! 20 ! !! !
!
5 ! ! !! ! !
!
! !! !!! !
!!
! !!! !!! ! !
! ! !!
!
!!
! !! !!! !! !
!!!! !!! !
!
!!!!!!!!!
!
! !! !!!!!
!!!!
! !!!!!!! ! ! ! ! !! !!! ! !
! !
!!!!!! ! !
!!
!!!
!!
!!
!!!!!! ! ! ! !!!
! !!!!!!
! ! !!!
! !
! !!!!!! !
!! !!!!! ! !
!
!!! !
!
!!!!!!!
!!! ! !
!!!!!!!!
!!!! !
!!!!!!!!
!!
! !!!!!!!!
! !!!!
!! !!!!
!!!!!! !
!!!! !
!!!! !! !
!!!
!!!!!
!!!!!! !
!!!! ! ! ! !!!!!!!!
! !! !
!!
!!
!!!! !
!!!!!!
!!!! !
!!!!!
!
!! !! ! ! !
!!!
!!
!!!!
!!
!!!! !
!!!! !
!!!!! ! !
!!
!!!!!
!! !!!!!!
!! !
!
!!! !
! !!
!!!!!! !
! ! !
!!!!!!
!!!!!!!
!!!!!
!! !
!!!
!
!!
!!!!! !!
!!!!!! !
!!! ! !
!
!! !
!!
!!
!
!! !!
!

5 6 7 8 9 10 20 40 60 80 100 20 40 60 80 100

Observed pIC50 Observed SALI Observed SALI

•  With untransformed SALI values, models
perform similarly in terms of % of variance
explained
•  The most signiﬁcant cliﬀs correspond to
stereoisomers

Model Caveats

•  Models based on SALI values are dependent
on their being an SAR in the original ac=vity
data
•  Scrambling results for these models are
poorer than the original
models but aren’t as 6

Predicted SALI
random as expected 4

2

0

0 2 4 6

Observed SALI

SALI in Bulk

•  Much of this material is exploratory
•  So we’re interested in trends across many assays
•  ChEMBL is an excellent source for ac=vity cliffs
•  Assay selec=on
–  Human target, binding assay
–  High confidence (score = 9)
–  Number of compounds between 75 & 300
–  Only consider non‐NA ac=vity values
–  Censored data is considered the same as exact data
–  31 assays
•  We iden=fy datasets with ac=vity cliffs by the skewness
of the dependent variable

SALI in Bulk

•  Used pIC50’s and CDK hashed ﬁngerprints
•  Lots of material to explore here

SALI in Bulk

•  But fingerprint quality is important
•  Assay 379744 has 83 cliffs of infinite height
since 83 pairs of molecules have Tc = 1.0
•  Probably should simply ignore such
“iden=cal” molecules

Conclusions

•  SALI is the ﬁrst step in characterizing the SAR
landscape
•  Allows us to directly analyze the landscape, as
opposed to individual molecules
•  Being able to predict the landscape could
serve as a useful way to extend an SAR
landscape

Acknowledgements

•  John Van Drie
•  Gerry Maggiora
•  Mic Lajiness
•  Jurgen Bajorath

Job Openings at NCGC/NCTT

•  Sovware development (focusing on Tripod)
–  Java, Swing UI, algorithms
•  Research Informa=cs Scien=st
–  Generalist, cheminforma=cs, comp chem, med
chem
•  Collaborate with chemists, biologists
•  Cukng edge problems
•  Lots of fresh data
•  Fun!

ER‐β Dataset
•  107 molecules, censored data
taken as exact
•  A few big cliﬀs
•  The best linear model performs
decently
25

-1
20
Predicted pIC50

Frequency

15

-2
10

-3
5

-4
0

-4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5
-4 -3 -2 -1

Observed pIC50 pIC50

Diﬀerent Ac2vity Representa2ons

•  Using the Hill parameters from a dose‐response
curve represents richer data than a single IC50
SInf

 S0 
 
 Sinf  d(Pi ,P j )
SALIi, j =
50%

 
Activity

 AC50  1− sim(i, j)
H 
 
S0

AC50
Concentration €

SALI Curves from DRCs

•  No difference in major cliffs
•  Some of the minor cliffs are highlighted using
the DRC instead of IC50

Height

0.5 0.6 0.7 0.8 0.9 1.0

17
14
23
25
26
18
9
27
16
19
1
10
32
6
29
8
33
12
30
11
4

hclust (*, "complete")
22
5
28
2
7
13
3
31
24
20
15
21
Clustering in the Holloway Dataset

What Makes a Good Structure Activity Landscape?

Recomendados

Recomendados

Más contenido relacionado

Más de Rajarshi Guha

Más de Rajarshi Guha (20)

What Makes a Good Structure Activity Landscape?