SlideShare una empresa de Scribd logo
1 de 37
Descargar para leer sin conexión
Characterizing the Density of Chemical Spaces and
    its Use in Outlier Analysis and Clustering

                   Rajarshi Guha

                  School of Informatics
                   Indiana University


                 17th August, 2007
                  Cambridge, MA
Outline




      R-NN curves for diversity analysis
      Counting clusters with R-NN curves
      Density and domain applicability
R-NN Curves for Diversity Analysis
                                       Linking R-NN Curves to Cluster Counts
                                       Density & Domain Applicability


Nearest Neighbor Methods



    Traditional kNN methods are                                                         q




    simple, fast, intuitive                                                         q                   q
                                                                                                                                                        q

                                                                                                                        q
                                                                                                q
                                                                                                                                                                    q
                                                                            q


    Applications in                                                                                         q


                                                                                                                            q
                                                                                                                                q


                                                                                                                q
                                                            q       q                                                               q

        regression & classification                                                  q
                                                                                            q
                                                                                                                                                                q


                                                                                                                                    q


        diversity analysis                                          q
                                                                        q
                                                                                            q       q                                   q



                                                            q                                                                       q
                                                                            q


    Can be misleading if the nearest
                                                        q               q                               q
                                                                q
                                                                                                q                                       q

                                                                                                            q
                                                                                    q                               q
                                                                                q                   q

    neighbor is far away                                                                                q


                                                                                                                q
                                                                                                                            q
                                                                                                                                q

                                                                                                                                            q
                                                                                                                                                    q




                                                                                                                                                            q


    R-NN methods may be more                        q
                                                                                                q
                                                                                                        q
                                                                                                            q
                                                                                                                                                q




    suitable
R-NN Curves for Diversity Analysis
                                   Linking R-NN Curves to Cluster Counts
                                   Density & Domain Applicability


Diversity Analysis




  Why is it Important?
     Compound acquisition
      Lead hopping
      Knowledge of the distribution of compounds in a descriptor
      space may improve predictive models
R-NN Curves for Diversity Analysis
                                                            Linking R-NN Curves to Cluster Counts
                                                            Density & Domain Applicability


Approaches to Diversity Analysis



Cell based                                                                                                       q             q

                                                                                                                                                   q


     Divide space into bins                                                    q



                                                                                   q               q
                                                                                                           q


                                                                                                           qq
                                                                                                              q q
                                                                                                                 qq
                                                                                                                  q
                                                                                                                      q
                                                                                                                      q
                                                                                                                      qq
                                                                                                                           q
                                                                                                                               q

                                                                                                                                   q
                                                                                                                                        q

                                                                                                                                      q q
                                                                                                                                       q
                                                                                                                                          qq
                                                                                                                                                   q
                                                                                                                q
                                                                                                             qq                      qq q
                                                                                                                                      q
                                                                                                                    qq q
                                                                                                                       qq                 q        q
                                                                                                        q q
                                                                                                   qq                                q q           q


      Compounds are mapped to bins
                                                                                                   q           q     q                       q
                                                                                   q                    qqq q qq q
                                                                                                         q         qqq              q qq  q
                                                                                                             q                     q      q    q
                                                                                               q         q q  q       q    q             qqq       q
                                                                                                                  q      q
                                                                               q                              q qqq           q q q
                                                                                                               q q               q    q                q
                                                                                       q                                                   q
                                                                                                                     qq q      q
                                                                                                                               q
                                                                                           q             q      q     q q
                                                                                                                     qq         q q      q
                                                                                                             q q        q     qq     q
                                                                                                            q                          q
                                                                                                            q               q
                                                                                       q                   q                 q
                                                                                                                             q
                                                                                                                    q
                                                                                               q
                                                                                                            qq           q q
                                                                                                              q      q               q q

Disadvantages                                                                                          q
                                                                                                           qq   q
                                                                                                                 q
                                                                                                                             q
                                                                                                                                   q




    Not useful for high dimensional
    data
      Choosing the bin size can be tricky




  Schnur, D.; J. Chem. Inf. Comput. Sci. 1999, 39, 36–45
  Agrafiotis, D.; Rassokhin, D.; J. Chem. Inf. Comput. Sci. 2002, 42, 117–12
R-NN Curves for Diversity Analysis
                                                            Linking R-NN Curves to Cluster Counts
                                                            Density & Domain Applicability


Approaches to Diversity Analysis


                                                                                                              q




Distance based                                                                                            q                   q


                                                                                                                                              q
                                                                                                                                                                              q

                                                                                                                      q
                                                                                                                                                                                          q



     Considers distance between                                                                   q
                                                                                                                                  q


                                                                                                                                                  q
                                                                                                                                                      q


                                                                                                                                      q

     compounds in a space                                                         q       q

                                                                                                                  q
                                                                                                                                                          q
                                                                                                                                                                                      q


                                                                                                          q                                               q
                                                                                          q                       q       q                                   q


      Generally requires pairwise distance                                        q
                                                                                              q



                                                                                                                                                          q
                                                                                                  q


      calculation                                                             q               q                               q
                                                                                      q
                                                                                                                      q                                       q

                                                                                                                                  q
                                                                                                          q                               q
                                                                                                      q                   q
                                                                                                                              q                   q

      Can be sped up by kD trees, MVP                                                                                                 q
                                                                                                                                                      q

                                                                                                                                                                  q
                                                                                                                                                                          q




                                                                                                                                                                                  q


      trees etc.                                                          q
                                                                                                                      q
                                                                                                                              q
                                                                                                                                  q
                                                                                                                                                                      q




  Agrafiotis, D. K. and Lobanov, V. S.; J. Chem. Inf. Comput. Sci. 1999, 39, 51–58
R-NN Curves for Diversity Analysis
                                     Linking R-NN Curves to Cluster Counts
                                     Density & Domain Applicability


Generating an R-NN Curve


  Observations
      Consider a query point with a hypersphere, of radius R,
      centered on it
      For small R, the hypersphere will contain very few or no
      neighbors
      For larger R, the number of neighbors will increase
      When R ≥ Dmax , the neighbor set is the whole dataset

  The question is . . .
  Does the variation of nearest neighbor count with radius allow us
  to characterize the location of a query point in a dataset?
R-NN Curves for Diversity Analysis
                                     Linking R-NN Curves to Cluster Counts
                                     Density & Domain Applicability


Generating an R-NN Curve


  Observations
      Consider a query point with a hypersphere, of radius R,
      centered on it
      For small R, the hypersphere will contain very few or no
      neighbors
      For larger R, the number of neighbors will increase
      When R ≥ Dmax , the neighbor set is the whole dataset

  The question is . . .
  Does the variation of nearest neighbor count with radius allow us
  to characterize the location of a query point in a dataset?
R-NN Curves for Diversity Analysis
                                                               Linking R-NN Curves to Cluster Counts
                                                               Density & Domain Applicability


Generating an R-NN Curve


Algorithm                                                                                                                      q
                                                                                                                                           q            50%
                                                                                                             q                                                                q


  Dmax ← max pairwise distance                                                                                                 q           q
                                                                                                                                           q
                                                                                                                                                q
                                                                                                                                               qq
                                                                                                                                                                      30%
  for molecule in dataset do                                                         q                       q



                                                                                                             q       q
                                                                                                                     q
                                                                                                                         q q

                                                                                                                           q
                                                                                                                          q q
                                                                                                                               q
                                                                                                                                       q
                                                                                                                                       q
                                                                                                                                           q

                                                                                                                                                    q
                                                                                                                                                                                       q

                                                                                                                                   q           q q


    R ← 0.01 × Dmax
                                                                                                     q                                                  q q
                                                                                                                           q
                                                                                                                                           q                                  q
                                                                                         q                           qq                    q            q             q                    q
                                                                                                         q             q q q
                                                                                                         qq                                                   q
                                                                                                                     q
                                                                                                                         q q     q
                                                                                                                                         q                                        10%
                                                                                                                          q q q       q    q                  q
                                                                                                                                                                          q
                                                                                                                  q                  q     q
                                                                                                                            q                                                      5%
    while R ≤ Dmax do
                                                                                                                     q                     q                      q
                                                                                 q                               qq        qq
                                                                                                         q    q  q      q                 q
                                                                                                 q              qq
                                                                                             q           q     q q
                                                                                                               qq            q
                                                                                                                  q         q q
                                                                                                                            q              q                                       q
                                                                                                                      q                    q                  q
                                                                                                                                                              q
                                                                                                          q qq q               q                              q
                                                                                                              q                                                                            q
                                                                                                                        q          q  q


       Find NN’s within radius R
                                                                                                      q                          q
                                                                                         q                                                                    q
                                                                                                                                         q
                                                                                         q
                                                                                                                                    q
                                                                                                                                    q
                                                                            q                    q         q        q
                                                                                                                    q                 q                            q
                                                                                                                                                                  q
                                                                                                                                q q                           q


       Increment R
                                                                                                                                q                                             q
                                                                                                        q                         q     q
                                                                                                    q           q     q
                                                                                                                                   q
                                                                                                   q         q q    q                  q
                                                                                                                                             q
                                                                                                                                                                                   q
                                                                                                                                                                                               q


    end while                                                                                                                          q




  end for                                                                                                                                                         q




  Plot NN count vs. R



  Guha, R. et al.; J. Chem. Inf. Model., 2006, 46, 1713–1772
R-NN Curves for Diversity Analysis
                                                                                                   Linking R-NN Curves to Cluster Counts
                                                                                                   Density & Domain Applicability


Generating an R-NN Curve

                                                                                              qq
                                                                                              qq
                                                                                         qqqq
                                                                                          qqq                                                                             qqqqqqqqqqqqqqqqqqqqqqqqqqq
                                                                                                                                                                          qqqqqqqqqqqqqqqqqqqqqqqqqqq
                                                                                        q                                                                          qqqqqq
                                                                                                                                                                    qqqqq
                                                                                       q
                                                                                                                                                              qqqq
                                                                                                                                                               qqqq
                       250




                                                                                      q




                                                                                                                          250
                                                                                                                                                            qq
                                                                                                                                                        qqq
                                                                                                                                                         qq
 Number of Neighbors




                                                                                                    Number of Neighbors
                                                                                  q                                                                   q
                       200




                                                                                                                          200
                                                                                                                                                   qq
                                                                                  q


                                                                              q
                                                                                                                                                  q
                       150




                                                                                                                          150
                                                                                                                                              qq
                                                                                                                                               q

                                                                                                                                              q
                                                                                                                                             q
                                                                                                                                          q
                       100




                                                                                                                          100
                                                                              q

                                                                                                                                        qq
                                                                                                                                         q



                                                                          q
                       50




                                                                                                                                      q




                                                                                                                          50
                                                                         qq
                                                                          q
                                                                      qq
                                                                       qq                                                           q
                                                                  qqq
                                                                   qq
                                                       qqqqqqqqqq
                                                       qqqqqqqqqq
                             qqqqqqqqqqqqqqqqqqqqqqqqq
                              qqqqqqqqqqqqqqqqqqqqqqqq
                       0




                                                                                                                                qqq
                                                                                                                                 qq




                                                                                                                          0
                             0        20          40         60         80                   100                                0                  20            40         60          80        100
                                                       R                                                                                                               R

                                           Sparse                                                                                                          Dense
R-NN Curves for Diversity Analysis
                                      Linking R-NN Curves to Cluster Counts
                                      Density & Domain Applicability


Characterizing an R-NN Curve


  Converting the Plot to Numbers
      Since R-NN curves are sigmoidal, fit them to the logistic
      equation
                                    1 + me −R/τ
                          NN = a ·
                                    1 + ne −R/τ
      m, n should characterize the curve
      Problems
           Two parameters
           Non-linear fitting is dependent on the starting point
           For some starting points, the fir does not converge and
           requires repetition
R-NN Curves for Diversity Analysis
                                       Linking R-NN Curves to Cluster Counts
                                       Density & Domain Applicability


Characterizing an R-NN Curve


Converting the Plot to a Number
Determine the value of R where the
lower tail transitions to the linear
portion of the curve                                                            S=0

                                                                          S’’
Solution
     Determine the slope at various
     points on the curve                                             S’
                                                  S=0
    Find R for the first occurence of
    the maximal slope (Rmax(S) )
    Can be achieved using a finite
    difference approach
R-NN Curves for Diversity Analysis
                                       Linking R-NN Curves to Cluster Counts
                                       Density & Domain Applicability


Characterizing an R-NN Curve


Converting the Plot to a Number
Determine the value of R where the
lower tail transitions to the linear
portion of the curve                                                            S=0

                                                                          S’’
Solution
     Determine the slope at various
     points on the curve                                             S’
                                                  S=0
    Find R for the first occurence of
    the maximal slope (Rmax(S) )
    Can be achieved using a finite
    difference approach
R-NN Curves for Diversity Analysis
                                                       Linking R-NN Curves to Cluster Counts
                                                       Density & Domain Applicability


Characterizing Multiple R-NN Curves

Problem
    Visual inspection of curves is useful
                                                                                                            3904
    for a few molecules
      For larger datasets we need to




                                                                          80
                                                                                             1861

      summarize R-NN curves




                                                                          60
                                                                 Rmax S
Solution




                                                                          40
     Plot Rmax(S) values for each
     molecule in the dataset



                                                                          20
      Points at the top of the plot are
                                                                               0   1000       2000   3000          4000
      located in the sparsest regions                                                     Serial Number

      Points at the bottom are located in
      the densest regions

  Kazius, J. et al.; J. Med. Chem. 2005, 48, 312–320
R-NN Curves for Diversity Analysis
                                                                                                                      Linking R-NN Curves to Cluster Counts
                                                                                                                      Density & Domain Applicability


R-NN Curves and Outliers

                                                                                                                 qq
                                                                                                                  q
                                                                                                                q




                              4000
                                                                                                               q                                                                                                                                                                                                    H2N



                                                                                                                                                                                                                                                                                                              HN            N•
        Number of Neighbors
                                                                                                                                                                                                 O
                                                                                                            q
                                                                                                                                                                                                                            HO
                                                                                                                                                                                                                                                                                                                            O                 OH
                                                                                                                                                                              H2N
                                                                                                                                                                                                                                                                                                                       O
                              3000


                                                                                                                                                                                                              O                          O
                                                                                                                                                                                                         O            HN                      H
                                                                                                                                                                                                                                              N                                                                                 N
                                                                                                           q                                                                                                                                                                                                                    H
                                                                                                                                                                                                                                                         O
                                                                                                                                                                                                                                                              H                                                HN               O
                                                                                                                                                                                                     N                                                        N
                                                                                                                                                                                                     H                                                                                                    O

                                                                                                                                                                 O                      O                                                                                      O
                              2000




                                                                                                                                                                                                     NH
                                                                                                                                                                                                                                                                      HN           HO                           N
                                                                                                        q                                                                                                                                                                                                       H
                                                                                                                                                                         N
                                                                                                                                                                         H                                                                                                                                                                     HO
                                                                                                                                                                                                                                                                                    O            NH
                                                                                                                                                                                                                                                                               O                      H2N
                                                                                                                                                                         N
                                                                                                                                                                         H
                                                                                                       q                                                                  O                                                                                           HN           O

                                                                                                                                   •N                                                                                                                                               N                           O
                                                                                                                                                                     O                                                                                                              H
                              1000




                                                                                                                                                  NH                              NH
                                                                                                    q
                                                                                                                                   H2N                                                                                                                                             S
                                                                                                   q
                                                                                                                                                                                                                                                                                             S                                                                  •N
                                                                                      q                                                                                   HN
                                                                                     q                                                                                                                                                                                                                O                                                                       NH2
                                                                                                                                                  O
                                                                                    q
                                                                                  qq
                                                                                                                                                                 OH                                                                                                                                                                                                 NH
                                                                                qq
                                                                               qq                                                                                             O                                                                                                         HN
                                                                             qq                                                                                                                  HN                                                                                                                                      OH
                                      qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
                                     qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq                                                                                                                                                                                                   HN
                              0




                                                                                                                                                                                                                                                                                                                NH
                                                                                                                                                                                                                            H
                                                                                                                                                                                                      O                     N                                                               O
                                                                                                                                                                                                                                                                      H
                                                                                                                                                                                   S                                                                                  N                                                                   O
                                                                                                                                                                                                                                                    N                                                                               NH
                                     0         20         40          60                     80                 100                                                                                                                                 H
                                                                                                                                                                                                                                                        O                                                 OH       O
                                                                                                                                                                                                                                                                           O
                                                                                                                                                                                                                                 O
                                                                                                                                                                                                                                                                                                                       O
                                                                R                                                                                                                                HN
                                                                                                                                                                                                                                                                                                                                              NH
                                                                                                                                                                                                                                                                                                                                                               NH


                                                                                                                                                                                                                                                                                                                                                   O
                                                                                                                                                                               H2N

                                                                                                                                                                                                     N•                                                                                                                                                                   NH2
                                                                                                                                                                                                                                                                                                                                                           HN



                                                    3904                                                                                                                                                                                                                                                                   •N
                                                                                                                                                                                                                                                                                                                                         NH




                                                                                                                                                                                                                                                                                                                                         NH2
                                                                                                                                                                                                                                                                                                                                                               O

                                                                                                                                                                                                                                                                                                                                                                         HO




                                                                                          q
                                                                                           qqqqqqqqqqqqqq
                                                                                            qqqqqqqqqqqqqq


                                                                                                                                                                                                                                             3904
                              4000




                                                                                         q

                                                                                       q
        Number of Neighbors




                                                                                     q
                              3000




                                                                                   q                                                                                                                                                          H
                                                                                                                                                                                                         HO
                                                                                                                                                                                                                                 O            N                       S
                                                                                                                                   H2N

                                                                                                                                                                                                                                                                  N                                                                            O
                                                                                                                                              O
                                                                                                                                   O                            OH                                                               N
                              2000




                                                                                 q                                                                                                                           NH   O              H
                                                                                                                                                                                                                                                   OH                      N                                                                               O
                                                                                                                                   HO                                                        O
                                                                                                                                                                                                                                                                      S                 H
                                                                                                                                                                                                                                                                                        N                       N
                                                                                                                                                                     O
                                                                                                                                                                                                                  NH                         NH2
                                                                                                                                                      O                             O
                                                                               q                                                                                                                                                                                                   O                                                N
                                                                                                                                                      HO
                                                                                                                                         OH                                                                   O                          N
                                                                                                                                                                               O                                                               O
                                                                                                                                                                                                                            N
                              1000




                                                                                                                                                                                                                                                                                                                            N
                                                                                                                                                                                         N
                                                                             q
                                                                                                                                                           HO                                            NH                                             NH2

                                                                           q                                                                                                        OH                                          HN
                                                                                                                                                                                                                                                                                                                                                       O
                                                                         q
                                                                                                                                                                                                                  H2N                                                                                                                    O
                                                                        q
                                                                       q
                                                                      q
                                                                     q
                                                                   qq                                                                                                                                                                O
                                                                qqq
                                                               qqq
                                      qqqqqqqqqqqqqqqqqqqqqqqqq
                                     qqqqqqqqqqqqqqqqqqqqqqqqq                                                                                                                                                        H2N
                              0




                                     0         20         40          60                     80                 100

                                                                R
                                                                                                                                                                                                                                             1861
                                                    1861
R-NN Curves for Diversity Analysis
                                                             Linking R-NN Curves to Cluster Counts
                                                             Density & Domain Applicability


How Can We Use It For Large Datasets?



  Breaking the O(n2 ) barrier
         Traditional NN detection has a time complexity of O(n2 )
         Modern NN algorithms such as kD-trees
                 have lower time complexity
                 restricted to the exact NN problem
         Solution is to use approximate NN algorithms such as Locality
         Sensitive Hashing (LSH)




  Bentley, J.; Commun. ACM 1980, 23, 214–229
  Datar, M. et al.; SCG ’04: Proc. 20th Symp. Comp. Geom.; ACM Press, 2004
  Dutta, D.; Guha, R.; Jurs, P.; Chen, T.; J. Chem. Inf. Model. 2006, 46, 321–333
R-NN Curves for Diversity Analysis
                                                             Linking R-NN Curves to Cluster Counts
                                                             Density & Domain Applicability


How Can We Use It For Large Datasets?




                                                                                              −1.0
                                                                                              −1.5
                                                                log [Mean Query Time (sec)]
Why LSH?




                                                                                              −2.0
   Theoretically sublinear




                                                                                              −2.5
      Shown to be 3 orders of




                                                                                              −3.0
                                                                                                                      LSH (142 descriptors)
                                                                                                                      LSH (20 descriptors)
      magnitude faster than                                                                                           kNN (142 descriptors)
                                                                                                                      kNN (20 descriptors)




                                                                                              −3.5
      traditional kNN
                                                                                                     0.02 0.03 0.04 0.05 0.06 0.07 0.08
      Very accurate (> 94%)                                                                                       Radius



                                                             Comparison of NN detection speed
                                                             on a 42,000 compound dataset
                                                             using a 200 point query set
  Dutta, D.; Guha, R.; Jurs, P.; Chen, T.; J. Chem. Inf. Model. 2006, 46, 321–333
R-NN Curves for Diversity Analysis
                                       Linking R-NN Curves to Cluster Counts
                                       Density & Domain Applicability


Alternatives?

Why not use PCA?
   R-NN curves are fundamentally a
   form of dimension reduction




                                                                          4
    Principal Components Analysis is




                                                                          2
    also a form of dimension reduction




                                                  Principal Component 2
                                                                          0
Disadvantages




                                                                          −2
    Eigendecomposition via SVD is




                                                                          −4
    O(n3 )



                                                                          −6
    Difficult to visualize more than 2 or                                        0            5              10

                                                                                   Principal Component 1
    3 PC’s at the same time
    We are no longer in the original
    descriptor space
R-NN Curves for Diversity Analysis
                                                     Linking R-NN Curves to Cluster Counts
                                                     Density & Domain Applicability


R-NN Curves and Clusters

                                                                          251                                       84




                                                      300




                                                                                                300
                                                      250




                                                                                                250
                                                      200




                                                                                                200
 15




                                                      150




                                                                                                150
                                                      100




                                                                                                100
                               251




                                                      50




                                                                                                50
 10




                                                      0




                                                                                                0
                                                            0   20   40         60   80   100         0   20   40         60   80   100

                                                                          88                                        137
           88




                                                      300




                                                                                                300
                                                      250




                                                                                                250
 5




                                     84




                                                      200




                                                                                                200
                                                      150




                                                                                                150
                137



                                                      100




                                                                                                100
 0




                                                      50




                                                                                                50
                                                      0




                                                                                                0
      −5   0          5   10          15   20   25          0   20   40         60   80   100         0   20   40         60   80   100


                                                                               Smoothed R-NN Curves


           R-NN curves are indicative of the number of clusters
R-NN Curves for Diversity Analysis
                                   Linking R-NN Curves to Cluster Counts
                                   Density & Domain Applicability


R-NN Curves and Clusters



Counting the steps                 Approaches
    Essentially a curve matching          Hausdorff / Fr´chet distance
                                                         e
    problem                               - requires canonical curves
    All points will not be                RMSE from distance matrix
    indicative of the number of           Slope analysis
    clusters
    Not applicable for radially
    distributed clusters
R-NN Curves for Diversity Analysis
                                                                                     Linking R-NN Curves to Cluster Counts
                                                                                     Density & Domain Applicability


R-NN Curves and Their Slopes
                     251                                       84                                                            251                                                                  84
 300




                                           300




                                                                                                              14
                                                                                                              12




                                                                                                                                                                               20
 250




                                           250




                                                                                     Slope of the RNN Curve




                                                                                                                                                      Slope of the RNN Curve
                                                                                                              10
 200




                                           200




                                                                                                                                                                               15
                                                                                                              8
 150




                                           150




                                                                                                                                                                               10
                                                                                                              6
 100




                                           100




                                                                                                              4




                                                                                                                                                                               5
                                                                                                              2
 50




                                           50




                                                                                                              0




                                                                                                                                                                               0
 0




                                           0




                                                                                                                   0   20   40        60   80   100                                 0   20   40        60   80   100
       0   20   40         60   80   100         0   20   40         60   80   100

                     88                                        137                                                               88                                                           137
 300




                                           300




                                                                                                              15
 250




                                           250




                                                                                                                                                                               15
                                                                                     Slope of the RNN Curve




                                                                                                                                                      Slope of the RNN Curve
 200




                                           200




                                                                                                              10




                                                                                                                                                                               10
 150




                                           150
 100




                                           100




                                                                                                              5




                                                                                                                                                                               5
 50




                                           50




                                                                                                              0




                                                                                                                                                                               0
 0




                                           0




                                                                                                                   0   20   40        60   80   100                                 0   20   40        60   80   100
       0   20   40         60   80   100         0   20   40         60   80   100


                                                                                                                   Smoothed first derivative of the R-NN Curves




       Guha, R. et al., J. Chem. Inf. Model., 2007, 47, 1308–1318
R-NN Curves for Diversity Analysis
                                                                                     Linking R-NN Curves to Cluster Counts
                                                                                     Density & Domain Applicability


R-NN Curves and Their Slopes
                     251                                       84                                                            251                                                                  84
 300




                                           300




                                                                                                              14
                                                                                                              12




                                                                                                                                                                               20
 250




                                           250




                                                                                     Slope of the RNN Curve




                                                                                                                                                      Slope of the RNN Curve
                                                                                                              10
 200




                                           200




                                                                                                                                                                               15
                                                                                                              8
 150




                                           150




                                                                                                                                                                               10
                                                                                                              6
 100




                                           100




                                                                                                              4




                                                                                                                                                                               5
                                                                                                              2
 50




                                           50




                                                                                                              0




                                                                                                                                                                               0
 0




                                           0




                                                                                                                   0   20   40        60   80   100                                 0   20   40        60   80   100
       0   20   40         60   80   100         0   20   40         60   80   100

                     88                                        137                                                               88                                                           137
 300




                                           300




                                                                                                              15
 250




                                           250




                                                                                                                                                                               15
                                                                                     Slope of the RNN Curve




                                                                                                                                                      Slope of the RNN Curve
 200




                                           200




                                                                                                              10




                                                                                                                                                                               10
 150




                                           150
 100




                                           100




                                                                                                              5




                                                                                                                                                                               5
 50




                                           50




                                                                                                              0




                                                                                                                                                                               0
 0




                                           0




                                                                                                                   0   20   40        60   80   100                                 0   20   40        60   80   100
       0   20   40         60   80   100         0   20   40         60   80   100


                                                                                                                   Smoothed first derivative of the R-NN Curves


                Identifying peaks identifies the number of clusters
                Automated picking can identify spurious peaks
       Guha, R. et al., J. Chem. Inf. Model., 2007, 47, 1308–1318
R-NN Curves for Diversity Analysis
                                   Linking R-NN Curves to Cluster Counts
                                   Density & Domain Applicability


Slope Analysis of R-NN Curves
                                                                                                         RNN Curve




                                                             300
                                                             250
                                                             200
                                                             150
                                                             100
Procedure




                                                             50
  for i in molecules do




                                                             0
                                                                    0                      20                40                60                80               100



                                                                                                                         D'
    Evaluate R-NN curve




                                                             8
    F ← smoothed R-NN curve




                                                             6
    Evaluate F




                                                             4
                                                             2
    Smooth F
    Nroot,i ← no. of roots of F




                                                             0
                                                                    0                      20                40                60                80               100




  end for                                                                                                               D''



  Ncluster = [max(Nroot ) + 1]/2



                                                             0.5
                                                                                                                     qq
                                                                                                                    q q
                                                                                                                   q q
                                                                                                                  q     q
                                                                    qqqq
                                                                     qqq
                                                                                                                 q
                                                                         q                                               q
                                                                          q                                     q
                                                                           q                                   q          q               qq
                                                                           q                                  q                          q q
                                                                                                             q             q            q q
                                                                               q                            q                          q     q
                                                                                                           q                q         q       q




                                                             0.0
                                                                               q                          q                                                          qqqqqqqq
                                                                                                                                                                      qqqqqqq
                                                                                                         q                           q                              q
                                                                                                        q                    q                 q                   q
                                                                                   q                   q                                                          q
                                                                                                                                    q                            q
                                                                                                      q                       q
                                                                                   q                 q                             q            q               q
                                                                                                                               q q                             q
                                                                                                    q                           qq
                                                                                       q           q                            q                q            q
                                                                                                   q                                                         q
                                                                                       q                                                         q
                                                                                                  q                                                         q
                                                                                           q     q                                                   q     q
                                                                                            q q                                                       q q
                                                                                              qq
                                                                                             qq




                                                             −0.5
                                                                                                                                                       qqq
                                                                                                                                                       q




                                                                    0                      20                40                60                80               100
R-NN Curves for Diversity Analysis
                                                              Linking R-NN Curves to Cluster Counts
                                                              Density & Domain Applicability


Simulated Data

          Simulated 2D data using a Thomas point process
          Predicted k, followed by kmeans clustering using k
          Investigated similar values of k




        k     ASW                                  k      ASW                               k     ASW
        4     0.61                                 3      0.44                              4     0.64
        3     0.74                                 2      0.65                              3     0.48
        5     0.70                                 4      0.47                              5     0.56
  ASW - average silhouette width, higher is better; k - number of clusters
R-NN Curves for Diversity Analysis
                                                              Linking R-NN Curves to Cluster Counts
                                                              Density & Domain Applicability


A Mixed Dataset



Dataset composition
    277 DHFR inhibitors based on
              substituteed pyrimidinediamine and
              diaminopteridine scaffolds
      277 molecules from the DIPPR project
              mainly simple hydrocarbons
              boiling point was modeled
      Evaluated 147 Molconn-Z descriptors,
      reduced to 24
      We expect at least 3 clusters

  Sutherland, J. et al.; J. Chem. Inf. Comput. Sci., 2003, 43, 1906–1915
  Goll, E. and Jurs, P.; J. Chem. Inf. Comput. Sci., 1999, 39, 974–983
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

Más contenido relacionado

Destacado

Realizing Service Finder at ESTC 2008
Realizing Service Finder at ESTC 2008Realizing Service Finder at ESTC 2008
Realizing Service Finder at ESTC 2008Emanuele Della Valle
 
容易忽略的住家隱形殺手:低頻幅射
容易忽略的住家隱形殺手:低頻幅射容易忽略的住家隱形殺手:低頻幅射
容易忽略的住家隱形殺手:低頻幅射janlai
 
080924 Ext Js入門者の最初の壁(第4回勉強会)
080924 Ext Js入門者の最初の壁(第4回勉強会)080924 Ext Js入門者の最初の壁(第4回勉強会)
080924 Ext Js入門者の最初の壁(第4回勉強会)Yuki Naotori
 
Presentazione La Fenice Per Mediglia
Presentazione La Fenice Per MedigliaPresentazione La Fenice Per Mediglia
Presentazione La Fenice Per Medigliaguest8ee283
 
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformEnhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformRajarshi Guha
 

Destacado (6)

Realizing Service Finder at ESTC 2008
Realizing Service Finder at ESTC 2008Realizing Service Finder at ESTC 2008
Realizing Service Finder at ESTC 2008
 
容易忽略的住家隱形殺手:低頻幅射
容易忽略的住家隱形殺手:低頻幅射容易忽略的住家隱形殺手:低頻幅射
容易忽略的住家隱形殺手:低頻幅射
 
080924 Ext Js入門者の最初の壁(第4回勉強会)
080924 Ext Js入門者の最初の壁(第4回勉強会)080924 Ext Js入門者の最初の壁(第4回勉強会)
080924 Ext Js入門者の最初の壁(第4回勉強会)
 
Presentazione La Fenice Per Mediglia
Presentazione La Fenice Per MedigliaPresentazione La Fenice Per Mediglia
Presentazione La Fenice Per Mediglia
 
This is me
This is meThis is me
This is me
 
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformEnhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
 

Similar a Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

Weak Lensing Simulator
Weak Lensing SimulatorWeak Lensing Simulator
Weak Lensing SimulatorCorey Chivers
 
boston scientific1996_annual
boston scientific1996_annualboston scientific1996_annual
boston scientific1996_annualfinance28
 
boston scientific1996_annual
boston scientific1996_annualboston scientific1996_annual
boston scientific1996_annualfinance28
 
Living In The Cloud
Living In The CloudLiving In The Cloud
Living In The Cloudacme
 
Survey of Esoko Users
Survey of Esoko UsersSurvey of Esoko Users
Survey of Esoko UsersEsoko
 
Regression Modelling Overview
Regression Modelling OverviewRegression Modelling Overview
Regression Modelling OverviewPaul Hewson
 
Introduction to power laws
Introduction to power lawsIntroduction to power laws
Introduction to power lawsColin Gillespie
 
Liquid Measurement Story
Liquid Measurement StoryLiquid Measurement Story
Liquid Measurement Storycheggi1
 
Example sweavefunnelplot
Example sweavefunnelplotExample sweavefunnelplot
Example sweavefunnelplotTony Hirst
 
Robust parametric classification and variable selection with minimum distance...
Robust parametric classification and variable selection with minimum distance...Robust parametric classification and variable selection with minimum distance...
Robust parametric classification and variable selection with minimum distance...echi99
 
F1 2011 Korea Race Report
F1 2011 Korea Race ReportF1 2011 Korea Race Report
F1 2011 Korea Race ReportTony Hirst
 
Portuguese Market and On-board Sampling Effort Review
Portuguese Market and On-board Sampling Effort ReviewPortuguese Market and On-board Sampling Effort Review
Portuguese Market and On-board Sampling Effort ReviewErnesto Jardim
 

Similar a Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering (16)

Weak Lensing Simulator
Weak Lensing SimulatorWeak Lensing Simulator
Weak Lensing Simulator
 
boston scientific1996_annual
boston scientific1996_annualboston scientific1996_annual
boston scientific1996_annual
 
boston scientific1996_annual
boston scientific1996_annualboston scientific1996_annual
boston scientific1996_annual
 
Slides mcneil
Slides mcneilSlides mcneil
Slides mcneil
 
Slides alexander-mcneil
Slides alexander-mcneilSlides alexander-mcneil
Slides alexander-mcneil
 
Slides p6
Slides p6Slides p6
Slides p6
 
Living In The Cloud
Living In The CloudLiving In The Cloud
Living In The Cloud
 
Survey of Esoko Users
Survey of Esoko UsersSurvey of Esoko Users
Survey of Esoko Users
 
Regression Modelling Overview
Regression Modelling OverviewRegression Modelling Overview
Regression Modelling Overview
 
Clustering Plot
Clustering PlotClustering Plot
Clustering Plot
 
Introduction to power laws
Introduction to power lawsIntroduction to power laws
Introduction to power laws
 
Liquid Measurement Story
Liquid Measurement StoryLiquid Measurement Story
Liquid Measurement Story
 
Example sweavefunnelplot
Example sweavefunnelplotExample sweavefunnelplot
Example sweavefunnelplot
 
Robust parametric classification and variable selection with minimum distance...
Robust parametric classification and variable selection with minimum distance...Robust parametric classification and variable selection with minimum distance...
Robust parametric classification and variable selection with minimum distance...
 
F1 2011 Korea Race Report
F1 2011 Korea Race ReportF1 2011 Korea Race Report
F1 2011 Korea Race Report
 
Portuguese Market and On-board Sampling Effort Review
Portuguese Market and On-board Sampling Effort ReviewPortuguese Market and On-board Sampling Effort Review
Portuguese Market and On-board Sampling Effort Review
 

Más de Rajarshi Guha

Pharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomePharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomeRajarshi Guha
 
Pharos: Putting targets in context
Pharos: Putting targets in contextPharos: Putting targets in context
Pharos: Putting targets in contextRajarshi Guha
 
Pharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomePharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomeRajarshi Guha
 
Pharos - Face of the KMC
Pharos - Face of the KMCPharos - Face of the KMC
Pharos - Face of the KMCRajarshi Guha
 
What can your library do for you?
What can your library do for you?What can your library do for you?
What can your library do for you?Rajarshi Guha
 
So I have an SD File … What do I do next?
So I have an SD File … What do I do next?So I have an SD File … What do I do next?
So I have an SD File … What do I do next?Rajarshi Guha
 
Characterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsCharacterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsRajarshi Guha
 
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action: Bridging Chemistry and Biology with Informatics at NCATSFrom Data to Action: Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATSRajarshi Guha
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & RRajarshi Guha
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical StructuresRajarshi Guha
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Rajarshi Guha
 
When the whole is better than the parts
When the whole is better than the partsWhen the whole is better than the parts
When the whole is better than the partsRajarshi Guha
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Rajarshi Guha
 
Pushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesPushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesRajarshi Guha
 
Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Rajarshi Guha
 
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research DatabaseRajarshi Guha
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsRajarshi Guha
 
Chemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleChemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleRajarshi Guha
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Rajarshi Guha
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in RRajarshi Guha
 

Más de Rajarshi Guha (20)

Pharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomePharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark Genome
 
Pharos: Putting targets in context
Pharos: Putting targets in contextPharos: Putting targets in context
Pharos: Putting targets in context
 
Pharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomePharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark Genome
 
Pharos - Face of the KMC
Pharos - Face of the KMCPharos - Face of the KMC
Pharos - Face of the KMC
 
What can your library do for you?
What can your library do for you?What can your library do for you?
What can your library do for you?
 
So I have an SD File … What do I do next?
So I have an SD File … What do I do next?So I have an SD File … What do I do next?
So I have an SD File … What do I do next?
 
Characterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsCharacterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network Models
 
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action: Bridging Chemistry and Biology with Informatics at NCATSFrom Data to Action: Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
 
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & R
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical Structures
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
 
When the whole is better than the parts
When the whole is better than the partsWhen the whole is better than the parts
When the whole is better than the parts
 
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
 
Pushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesPushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the Pipes
 
Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...
 
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research Database
 
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of Cheminformatics
 
Chemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleChemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & Reproducible
 
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?
 
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in R
 

Último

Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxCulture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxPoojaSen20
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 

Último (20)

Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptxCulture Uniformity or Diversity IN SOCIOLOGY.pptx
Culture Uniformity or Diversity IN SOCIOLOGY.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 

Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

  • 1. Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering Rajarshi Guha School of Informatics Indiana University 17th August, 2007 Cambridge, MA
  • 2. Outline R-NN curves for diversity analysis Counting clusters with R-NN curves Density and domain applicability
  • 3. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Nearest Neighbor Methods Traditional kNN methods are q simple, fast, intuitive q q q q q q q Applications in q q q q q q q regression & classification q q q q diversity analysis q q q q q q q q Can be misleading if the nearest q q q q q q q q q q q neighbor is far away q q q q q q q R-NN methods may be more q q q q q suitable
  • 4. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Diversity Analysis Why is it Important? Compound acquisition Lead hopping Knowledge of the distribution of compounds in a descriptor space may improve predictive models
  • 5. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Approaches to Diversity Analysis Cell based q q q Divide space into bins q q q q qq q q qq q q q qq q q q q q q q qq q q qq qq q q qq q qq q q q q qq q q q Compounds are mapped to bins q q q q q qqq q qq q q qqq q qq q q q q q q q q q q q qqq q q q q q qqq q q q q q q q q q q qq q q q q q q q q qq q q q q q q qq q q q q q q q q q q q qq q q q q q q Disadvantages q qq q q q q Not useful for high dimensional data Choosing the bin size can be tricky Schnur, D.; J. Chem. Inf. Comput. Sci. 1999, 39, 36–45 Agrafiotis, D.; Rassokhin, D.; J. Chem. Inf. Comput. Sci. 2002, 42, 117–12
  • 6. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Approaches to Diversity Analysis q Distance based q q q q q q Considers distance between q q q q q compounds in a space q q q q q q q q q q q Generally requires pairwise distance q q q q calculation q q q q q q q q q q q q q Can be sped up by kD trees, MVP q q q q q trees etc. q q q q q Agrafiotis, D. K. and Lobanov, V. S.; J. Chem. Inf. Comput. Sci. 1999, 39, 51–58
  • 7. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Generating an R-NN Curve Observations Consider a query point with a hypersphere, of radius R, centered on it For small R, the hypersphere will contain very few or no neighbors For larger R, the number of neighbors will increase When R ≥ Dmax , the neighbor set is the whole dataset The question is . . . Does the variation of nearest neighbor count with radius allow us to characterize the location of a query point in a dataset?
  • 8. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Generating an R-NN Curve Observations Consider a query point with a hypersphere, of radius R, centered on it For small R, the hypersphere will contain very few or no neighbors For larger R, the number of neighbors will increase When R ≥ Dmax , the neighbor set is the whole dataset The question is . . . Does the variation of nearest neighbor count with radius allow us to characterize the location of a query point in a dataset?
  • 9. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Generating an R-NN Curve Algorithm q q 50% q q Dmax ← max pairwise distance q q q q qq 30% for molecule in dataset do q q q q q q q q q q q q q q q q q q q R ← 0.01 × Dmax q q q q q q q qq q q q q q q q q qq q q q q q q 10% q q q q q q q q q q q 5% while R ≤ Dmax do q q q q qq qq q q q q q q qq q q q q qq q q q q q q q q q q q q qq q q q q q q q q Find NN’s within radius R q q q q q q q q q q q q q q q q q q q Increment R q q q q q q q q q q q q q q q q q end while q end for q Plot NN count vs. R Guha, R. et al.; J. Chem. Inf. Model., 2006, 46, 1713–1772
  • 10. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Generating an R-NN Curve qq qq qqqq qqq qqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqq q qqqqqq qqqqq q qqqq qqqq 250 q 250 qq qqq qq Number of Neighbors Number of Neighbors q q 200 200 qq q q q 150 150 qq q q q q 100 100 q qq q q 50 q 50 qq q qq qq q qqq qq qqqqqqqqqq qqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqq 0 qqq qq 0 0 20 40 60 80 100 0 20 40 60 80 100 R R Sparse Dense
  • 11. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Characterizing an R-NN Curve Converting the Plot to Numbers Since R-NN curves are sigmoidal, fit them to the logistic equation 1 + me −R/τ NN = a · 1 + ne −R/τ m, n should characterize the curve Problems Two parameters Non-linear fitting is dependent on the starting point For some starting points, the fir does not converge and requires repetition
  • 12. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Characterizing an R-NN Curve Converting the Plot to a Number Determine the value of R where the lower tail transitions to the linear portion of the curve S=0 S’’ Solution Determine the slope at various points on the curve S’ S=0 Find R for the first occurence of the maximal slope (Rmax(S) ) Can be achieved using a finite difference approach
  • 13. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Characterizing an R-NN Curve Converting the Plot to a Number Determine the value of R where the lower tail transitions to the linear portion of the curve S=0 S’’ Solution Determine the slope at various points on the curve S’ S=0 Find R for the first occurence of the maximal slope (Rmax(S) ) Can be achieved using a finite difference approach
  • 14. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Characterizing Multiple R-NN Curves Problem Visual inspection of curves is useful 3904 for a few molecules For larger datasets we need to 80 1861 summarize R-NN curves 60 Rmax S Solution 40 Plot Rmax(S) values for each molecule in the dataset 20 Points at the top of the plot are 0 1000 2000 3000 4000 located in the sparsest regions Serial Number Points at the bottom are located in the densest regions Kazius, J. et al.; J. Med. Chem. 2005, 48, 312–320
  • 15. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability R-NN Curves and Outliers qq q q 4000 q H2N HN N• Number of Neighbors O q HO O OH H2N O 3000 O O O HN H N N q H O H HN O N N H O O O O 2000 NH HN HO N q H N H HO O NH O H2N N H q O HN O •N N O O H 1000 NH NH q H2N S q S •N q HN q O NH2 O q qq OH NH qq qq O HN qq HN OH qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq HN 0 NH H O N O H S N O N NH 0 20 40 60 80 100 H O OH O O O O R HN NH NH O H2N N• NH2 HN 3904 •N NH NH2 O HO q qqqqqqqqqqqqqq qqqqqqqqqqqqqq 3904 4000 q q Number of Neighbors q 3000 q H HO O N S H2N N O O O OH N 2000 q NH O H OH N O HO O S H N N O NH NH2 O O q O N HO OH O N O O N 1000 N N q HO NH NH2 q OH HN O q H2N O q q q q qq O qqq qqq qqqqqqqqqqqqqqqqqqqqqqqqq qqqqqqqqqqqqqqqqqqqqqqqqq H2N 0 0 20 40 60 80 100 R 1861 1861
  • 16. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability How Can We Use It For Large Datasets? Breaking the O(n2 ) barrier Traditional NN detection has a time complexity of O(n2 ) Modern NN algorithms such as kD-trees have lower time complexity restricted to the exact NN problem Solution is to use approximate NN algorithms such as Locality Sensitive Hashing (LSH) Bentley, J.; Commun. ACM 1980, 23, 214–229 Datar, M. et al.; SCG ’04: Proc. 20th Symp. Comp. Geom.; ACM Press, 2004 Dutta, D.; Guha, R.; Jurs, P.; Chen, T.; J. Chem. Inf. Model. 2006, 46, 321–333
  • 17. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability How Can We Use It For Large Datasets? −1.0 −1.5 log [Mean Query Time (sec)] Why LSH? −2.0 Theoretically sublinear −2.5 Shown to be 3 orders of −3.0 LSH (142 descriptors) LSH (20 descriptors) magnitude faster than kNN (142 descriptors) kNN (20 descriptors) −3.5 traditional kNN 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Very accurate (> 94%) Radius Comparison of NN detection speed on a 42,000 compound dataset using a 200 point query set Dutta, D.; Guha, R.; Jurs, P.; Chen, T.; J. Chem. Inf. Model. 2006, 46, 321–333
  • 18. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Alternatives? Why not use PCA? R-NN curves are fundamentally a form of dimension reduction 4 Principal Components Analysis is 2 also a form of dimension reduction Principal Component 2 0 Disadvantages −2 Eigendecomposition via SVD is −4 O(n3 ) −6 Difficult to visualize more than 2 or 0 5 10 Principal Component 1 3 PC’s at the same time We are no longer in the original descriptor space
  • 19. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability R-NN Curves and Clusters 251 84 300 300 250 250 200 200 15 150 150 100 100 251 50 50 10 0 0 0 20 40 60 80 100 0 20 40 60 80 100 88 137 88 300 300 250 250 5 84 200 200 150 150 137 100 100 0 50 50 0 0 −5 0 5 10 15 20 25 0 20 40 60 80 100 0 20 40 60 80 100 Smoothed R-NN Curves R-NN curves are indicative of the number of clusters
  • 20. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability R-NN Curves and Clusters Counting the steps Approaches Essentially a curve matching Hausdorff / Fr´chet distance e problem - requires canonical curves All points will not be RMSE from distance matrix indicative of the number of Slope analysis clusters Not applicable for radially distributed clusters
  • 21. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability R-NN Curves and Their Slopes 251 84 251 84 300 300 14 12 20 250 250 Slope of the RNN Curve Slope of the RNN Curve 10 200 200 15 8 150 150 10 6 100 100 4 5 2 50 50 0 0 0 0 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 88 137 88 137 300 300 15 250 250 15 Slope of the RNN Curve Slope of the RNN Curve 200 200 10 10 150 150 100 100 5 5 50 50 0 0 0 0 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Smoothed first derivative of the R-NN Curves Guha, R. et al., J. Chem. Inf. Model., 2007, 47, 1308–1318
  • 22. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability R-NN Curves and Their Slopes 251 84 251 84 300 300 14 12 20 250 250 Slope of the RNN Curve Slope of the RNN Curve 10 200 200 15 8 150 150 10 6 100 100 4 5 2 50 50 0 0 0 0 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 88 137 88 137 300 300 15 250 250 15 Slope of the RNN Curve Slope of the RNN Curve 200 200 10 10 150 150 100 100 5 5 50 50 0 0 0 0 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Smoothed first derivative of the R-NN Curves Identifying peaks identifies the number of clusters Automated picking can identify spurious peaks Guha, R. et al., J. Chem. Inf. Model., 2007, 47, 1308–1318
  • 23. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Slope Analysis of R-NN Curves RNN Curve 300 250 200 150 100 Procedure 50 for i in molecules do 0 0 20 40 60 80 100 D' Evaluate R-NN curve 8 F ← smoothed R-NN curve 6 Evaluate F 4 2 Smooth F Nroot,i ← no. of roots of F 0 0 20 40 60 80 100 end for D'' Ncluster = [max(Nroot ) + 1]/2 0.5 qq q q q q q q qqqq qqq q q q q q q q q qq q q q q q q q q q q q q q q q q 0.0 q q qqqqqqqq qqqqqqq q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q qq qq −0.5 qqq q 0 20 40 60 80 100
  • 24. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability Simulated Data Simulated 2D data using a Thomas point process Predicted k, followed by kmeans clustering using k Investigated similar values of k k ASW k ASW k ASW 4 0.61 3 0.44 4 0.64 3 0.74 2 0.65 3 0.48 5 0.70 4 0.47 5 0.56 ASW - average silhouette width, higher is better; k - number of clusters
  • 25. R-NN Curves for Diversity Analysis Linking R-NN Curves to Cluster Counts Density & Domain Applicability A Mixed Dataset Dataset composition 277 DHFR inhibitors based on substituteed pyrimidinediamine and diaminopteridine scaffolds 277 molecules from the DIPPR project mainly simple hydrocarbons boiling point was modeled Evaluated 147 Molconn-Z descriptors, reduced to 24 We expect at least 3 clusters Sutherland, J. et al.; J. Chem. Inf. Comput. Sci., 2003, 43, 1906–1915 Goll, E. and Jurs, P.; J. Chem. Inf. Comput. Sci., 1999, 39, 974–983