ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering
1. Characterizing the Density of Chemical Spaces and
its Use in Outlier Analysis and Clustering
Rajarshi Guha
School of Informatics
Indiana University
17th August, 2007
Cambridge, MA
2. Outline
R-NN curves for diversity analysis
Counting clusters with R-NN curves
Density and domain applicability
3. R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Nearest Neighbor Methods
Traditional kNN methods are q
simple, fast, intuitive q q
q
q
q
q
q
Applications in q
q
q
q
q q q
regression & classification q
q
q
q
diversity analysis q
q
q q q
q q
q
Can be misleading if the nearest
q q q
q
q q
q
q q
q q
neighbor is far away q
q
q
q
q
q
q
R-NN methods may be more q
q
q
q
q
suitable
4. R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Diversity Analysis
Why is it Important?
Compound acquisition
Lead hopping
Knowledge of the distribution of compounds in a descriptor
space may improve predictive models
5. R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Approaches to Diversity Analysis
Cell based q q
q
Divide space into bins q
q q
q
qq
q q
qq
q
q
q
qq
q
q
q
q
q q
q
qq
q
q
qq qq q
q
qq q
qq q q
q q
qq q q q
Compounds are mapped to bins
q q q q
q qqq q qq q
q qqq q qq q
q q q q
q q q q q q qqq q
q q
q q qqq q q q
q q q q q
q q
qq q q
q
q q q q q
qq q q q
q q q qq q
q q
q q
q q q
q
q
q
qq q q
q q q q
Disadvantages q
qq q
q
q
q
Not useful for high dimensional
data
Choosing the bin size can be tricky
Schnur, D.; J. Chem. Inf. Comput. Sci. 1999, 39, 36–45
Agrafiotis, D.; Rassokhin, D.; J. Chem. Inf. Comput. Sci. 2002, 42, 117–12
6. R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Approaches to Diversity Analysis
q
Distance based q q
q
q
q
q
Considers distance between q
q
q
q
q
compounds in a space q q
q
q
q
q q
q q q q
Generally requires pairwise distance q
q
q
q
calculation q q q
q
q q
q
q q
q q
q q
Can be sped up by kD trees, MVP q
q
q
q
q
trees etc. q
q
q
q
q
Agrafiotis, D. K. and Lobanov, V. S.; J. Chem. Inf. Comput. Sci. 1999, 39, 51–58
7. R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Generating an R-NN Curve
Observations
Consider a query point with a hypersphere, of radius R,
centered on it
For small R, the hypersphere will contain very few or no
neighbors
For larger R, the number of neighbors will increase
When R ≥ Dmax , the neighbor set is the whole dataset
The question is . . .
Does the variation of nearest neighbor count with radius allow us
to characterize the location of a query point in a dataset?
8. R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Generating an R-NN Curve
Observations
Consider a query point with a hypersphere, of radius R,
centered on it
For small R, the hypersphere will contain very few or no
neighbors
For larger R, the number of neighbors will increase
When R ≥ Dmax , the neighbor set is the whole dataset
The question is . . .
Does the variation of nearest neighbor count with radius allow us
to characterize the location of a query point in a dataset?
9. R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Generating an R-NN Curve
Algorithm q
q 50%
q q
Dmax ← max pairwise distance q q
q
q
qq
30%
for molecule in dataset do q q
q q
q
q q
q
q q
q
q
q
q
q
q
q q q
R ← 0.01 × Dmax
q q q
q
q q
q qq q q q q
q q q q
qq q
q
q q q
q 10%
q q q q q q
q
q q q
q 5%
while R ≤ Dmax do
q q q
q qq qq
q q q q q
q qq
q q q q
qq q
q q q
q q q
q q q
q
q qq q q q
q q
q q q
Find NN’s within radius R
q q
q q
q
q
q
q
q q q q
q q q
q
q q q
Increment R
q q
q q q
q q q
q
q q q q q
q
q
q
end while q
end for q
Plot NN count vs. R
Guha, R. et al.; J. Chem. Inf. Model., 2006, 46, 1713–1772
10. R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Generating an R-NN Curve
qq
qq
qqqq
qqq qqqqqqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqq
q qqqqqq
qqqqq
q
qqqq
qqqq
250
q
250
qq
qqq
qq
Number of Neighbors
Number of Neighbors
q q
200
200
qq
q
q
q
150
150
qq
q
q
q
q
100
100
q
qq
q
q
50
q
50
qq
q
qq
qq q
qqq
qq
qqqqqqqqqq
qqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqq
0
qqq
qq
0
0 20 40 60 80 100 0 20 40 60 80 100
R R
Sparse Dense
11. R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Characterizing an R-NN Curve
Converting the Plot to Numbers
Since R-NN curves are sigmoidal, fit them to the logistic
equation
1 + me −R/τ
NN = a ·
1 + ne −R/τ
m, n should characterize the curve
Problems
Two parameters
Non-linear fitting is dependent on the starting point
For some starting points, the fir does not converge and
requires repetition
12. R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Characterizing an R-NN Curve
Converting the Plot to a Number
Determine the value of R where the
lower tail transitions to the linear
portion of the curve S=0
S’’
Solution
Determine the slope at various
points on the curve S’
S=0
Find R for the first occurence of
the maximal slope (Rmax(S) )
Can be achieved using a finite
difference approach
13. R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Characterizing an R-NN Curve
Converting the Plot to a Number
Determine the value of R where the
lower tail transitions to the linear
portion of the curve S=0
S’’
Solution
Determine the slope at various
points on the curve S’
S=0
Find R for the first occurence of
the maximal slope (Rmax(S) )
Can be achieved using a finite
difference approach
14. R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Characterizing Multiple R-NN Curves
Problem
Visual inspection of curves is useful
3904
for a few molecules
For larger datasets we need to
80
1861
summarize R-NN curves
60
Rmax S
Solution
40
Plot Rmax(S) values for each
molecule in the dataset
20
Points at the top of the plot are
0 1000 2000 3000 4000
located in the sparsest regions Serial Number
Points at the bottom are located in
the densest regions
Kazius, J. et al.; J. Med. Chem. 2005, 48, 312–320
15. R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
R-NN Curves and Outliers
qq
q
q
4000
q H2N
HN N•
Number of Neighbors
O
q
HO
O OH
H2N
O
3000
O O
O HN H
N N
q H
O
H HN O
N N
H O
O O O
2000
NH
HN HO N
q H
N
H HO
O NH
O H2N
N
H
q O HN O
•N N O
O H
1000
NH NH
q
H2N S
q
S •N
q HN
q O NH2
O
q
qq
OH NH
qq
qq O HN
qq HN OH
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq HN
0
NH
H
O N O
H
S N O
N NH
0 20 40 60 80 100 H
O OH O
O
O
O
R HN
NH
NH
O
H2N
N• NH2
HN
3904 •N
NH
NH2
O
HO
q
qqqqqqqqqqqqqq
qqqqqqqqqqqqqq
3904
4000
q
q
Number of Neighbors
q
3000
q H
HO
O N S
H2N
N O
O
O OH N
2000
q NH O H
OH N O
HO O
S H
N N
O
NH NH2
O O
q O N
HO
OH O N
O O
N
1000
N
N
q
HO NH NH2
q OH HN
O
q
H2N O
q
q
q
q
qq O
qqq
qqq
qqqqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqq H2N
0
0 20 40 60 80 100
R
1861
1861
16. R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
How Can We Use It For Large Datasets?
Breaking the O(n2 ) barrier
Traditional NN detection has a time complexity of O(n2 )
Modern NN algorithms such as kD-trees
have lower time complexity
restricted to the exact NN problem
Solution is to use approximate NN algorithms such as Locality
Sensitive Hashing (LSH)
Bentley, J.; Commun. ACM 1980, 23, 214–229
Datar, M. et al.; SCG ’04: Proc. 20th Symp. Comp. Geom.; ACM Press, 2004
Dutta, D.; Guha, R.; Jurs, P.; Chen, T.; J. Chem. Inf. Model. 2006, 46, 321–333
17. R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
How Can We Use It For Large Datasets?
−1.0
−1.5
log [Mean Query Time (sec)]
Why LSH?
−2.0
Theoretically sublinear
−2.5
Shown to be 3 orders of
−3.0
LSH (142 descriptors)
LSH (20 descriptors)
magnitude faster than kNN (142 descriptors)
kNN (20 descriptors)
−3.5
traditional kNN
0.02 0.03 0.04 0.05 0.06 0.07 0.08
Very accurate (> 94%) Radius
Comparison of NN detection speed
on a 42,000 compound dataset
using a 200 point query set
Dutta, D.; Guha, R.; Jurs, P.; Chen, T.; J. Chem. Inf. Model. 2006, 46, 321–333
18. R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Alternatives?
Why not use PCA?
R-NN curves are fundamentally a
form of dimension reduction
4
Principal Components Analysis is
2
also a form of dimension reduction
Principal Component 2
0
Disadvantages
−2
Eigendecomposition via SVD is
−4
O(n3 )
−6
Difficult to visualize more than 2 or 0 5 10
Principal Component 1
3 PC’s at the same time
We are no longer in the original
descriptor space
20. R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
R-NN Curves and Clusters
Counting the steps Approaches
Essentially a curve matching Hausdorff / Fr´chet distance
e
problem - requires canonical curves
All points will not be RMSE from distance matrix
indicative of the number of Slope analysis
clusters
Not applicable for radially
distributed clusters
21. R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
R-NN Curves and Their Slopes
251 84 251 84
300
300
14
12
20
250
250
Slope of the RNN Curve
Slope of the RNN Curve
10
200
200
15
8
150
150
10
6
100
100
4
5
2
50
50
0
0
0
0
0 20 40 60 80 100 0 20 40 60 80 100
0 20 40 60 80 100 0 20 40 60 80 100
88 137 88 137
300
300
15
250
250
15
Slope of the RNN Curve
Slope of the RNN Curve
200
200
10
10
150
150
100
100
5
5
50
50
0
0
0
0
0 20 40 60 80 100 0 20 40 60 80 100
0 20 40 60 80 100 0 20 40 60 80 100
Smoothed first derivative of the R-NN Curves
Guha, R. et al., J. Chem. Inf. Model., 2007, 47, 1308–1318
22. R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
R-NN Curves and Their Slopes
251 84 251 84
300
300
14
12
20
250
250
Slope of the RNN Curve
Slope of the RNN Curve
10
200
200
15
8
150
150
10
6
100
100
4
5
2
50
50
0
0
0
0
0 20 40 60 80 100 0 20 40 60 80 100
0 20 40 60 80 100 0 20 40 60 80 100
88 137 88 137
300
300
15
250
250
15
Slope of the RNN Curve
Slope of the RNN Curve
200
200
10
10
150
150
100
100
5
5
50
50
0
0
0
0
0 20 40 60 80 100 0 20 40 60 80 100
0 20 40 60 80 100 0 20 40 60 80 100
Smoothed first derivative of the R-NN Curves
Identifying peaks identifies the number of clusters
Automated picking can identify spurious peaks
Guha, R. et al., J. Chem. Inf. Model., 2007, 47, 1308–1318
24. R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
Simulated Data
Simulated 2D data using a Thomas point process
Predicted k, followed by kmeans clustering using k
Investigated similar values of k
k ASW k ASW k ASW
4 0.61 3 0.44 4 0.64
3 0.74 2 0.65 3 0.48
5 0.70 4 0.47 5 0.56
ASW - average silhouette width, higher is better; k - number of clusters
25. R-NN Curves for Diversity Analysis
Linking R-NN Curves to Cluster Counts
Density & Domain Applicability
A Mixed Dataset
Dataset composition
277 DHFR inhibitors based on
substituteed pyrimidinediamine and
diaminopteridine scaffolds
277 molecules from the DIPPR project
mainly simple hydrocarbons
boiling point was modeled
Evaluated 147 Molconn-Z descriptors,
reduced to 24
We expect at least 3 clusters
Sutherland, J. et al.; J. Chem. Inf. Comput. Sci., 2003, 43, 1906–1915
Goll, E. and Jurs, P.; J. Chem. Inf. Comput. Sci., 1999, 39, 974–983