SlideShare a Scribd company logo
1 of 1
Download to read offline
Benchmarking and Validation of
                                                 JChem ECFP and FCFP Fingerprints
                                                                                        Roger Sayle, NextMove Software Ltd, Cambridge, UK
                                                                                                                                              roger@nextmovesoftware.co.uk


   Abstract
1. Overview                                                                                 6. Fingerprint Saturation
                                                                                           6. Fingerprint Saturation
The cornerstone of pharmaceutical chemistry is Crum Brown’s observation that               A common failing with binary fingerprints is caused by their inability to represent
similar compounds have similar therapeutic benefits. Cheminformatics tries to              the number of times a feature (such as a path or substructure) occurs. The
capture this insight by defining measures of similarity between the computer               fingerprints for decane (C10), undecane (C11) and dodecane (C12) are typically
representations of two molecules, with the hope of capturing a medicinal                   identical, as are those for many protein and DNA sequences. A more powerful
chemist’s intuitive sense of “likeness”, and thereby correlate with bioactivity.           representation that solves these issues is to replace occurrence bits with counts,
This poster evaluates the chemical similarity measures offered by ChemAxon on a            turning binary fingerprints into occurrence histograms.
standard reference benchmark. Any such benchmark must by necessity be                      LINGOs similarity achieves better results on the Briem & Lessel benchmark by
flawed; the similarity between two molecules is influenced by the framework by             using counts instead of bits. However, as described in the Continuous Tanimoto
which they are compared [2]. However a robust similarity measure should                    section below, care has to be taken to use a suitable similarity measure for
typically perform better on such benchmarks, whilst a weaker model of chemical             comparing histograms.
similarity would be expected to perform worse (on average).                                ChemAxon have announced that the upcoming release of JChem, version 5.5, will
                                                                                           support ECFP and FCFP fingerprints with counts.
2. Briem & Lessel Benchmark
The benchmark employed in this evaluation is the commonly used Briem and                   7. Continuous Tanimoto
Lessel benchmark *1+. This test assesses a method’s ability to identify near               Although there is universal agreement on how the Tanimoto coefficient should be
neighbours with the same biological activity from decoy molecules and molecules            interpreted for binary values, its application to continuous values, such as
with different biological activities. Five classes of active compounds are used: 40        histogram counts, has been implemented differently by different authors [4,5].
ACE inhibitors, 49 TXA antagonists, 110 HMG-CoA reductase inhibitors, 133 PAF              Consider the two alternate definitions T0 and T1 given below.
antagonists and 48 5HT3 antagonists. In addition to these 380 active compounds,                                             𝑁                                     𝑁
                                                                                                                            𝑖 𝑥𝑖 𝑦𝑖                                𝑖 min⁡ 𝑥 𝑖 ,
                                                                                                                                                                        (         𝑦 𝑖)
the data set contains 573 “random” MDDR compounds, for a total of 953                           𝑇0 (𝑥, 𝑦) =       𝑁                                𝑇1 (𝑥, 𝑦) =      𝑁
molecules. The benchmark proceeds by determining the 10 nearest neighbours                                    𝑖       𝑥 𝑖2 + 𝑦 𝑖2 − 𝑥 𝑖 𝑦 𝑖                       𝑖 max(𝑥 𝑖 ,     𝑦 𝑖)
for each of the 380 active compounds. The query is not considered a neighbour              Both definitions agree for binary valued vectors, and are guaranteed to return
of itself. The score for each activity class is the fraction of these neighbours that      increasing fractional values between zero and one. Notice however that for
have the same activity as the query. Finally, the overall score is the average of the      x = { 3 } and y = { 4 }, then T1 = 3/4 = 0.75 but T0 = 12/13 ~ 0.923.
score of the five activity classes.                                                        In experiments with LINGO’s histograms, T1 was found to be superior (producing
                                                                                           an improvement of ~0.9%) whereas T0 actually made the results worse (by ~3%)
 3. Fingerprint Methods
3. Fingerprint Methods
Historically, the similarity method underlying ChemAxon’s JChem search engine
                                                                                           8. Conclusions
relied upon Chemical Fingerprints (“CF”). These are path-based fingerprints
                                                                                           • ChemAxon’s Chemical Fingerprints perform comparably with other path and
similar to Daylight fingerprints, which allow a number of variants depending upon
                                                                                              feature-based fingerprints (including MACCS 166-bit keys, Daylight fingerprints
parameters for the number of bits in the fingerprint, the longest bond path to
                                                                                              and PubChem/CACTVS fingerprints). All these methods perform equivalently.
encode and the number of bits set by each path. The “Marvin FP” below uses the
                                                                                           • ECFP fingerprints, originally developed by Scitegic/Accelrys and as recently
generatemd defaults of 1024 bits, paths of up to 7 bonds, and 3 bits per path. The
                                                                                              implemented by ChemAxon, perform exceptionally well on the standard Briem
“JChem FP” below uses the JChemManager defaults of 512 bits, paths up to
                                                                                              and Lessel benchmark.
length 6 and 2 bits per path.
                                                                                           • The announced ECFP histograms would be anticipated to set new records in 2D
Recently, in v5.4, ChemAxon has added support for ECFP and FCFP fingerprints
                                                                                              chemical similarity.
originally introduced by Scitegic, now Accelrys *8+. These are termed “ECFP_4”
and “FCFP_8” below indicating the ChemAxon implementation with diameter
                                                                                           9. Acknowledgements
                                                                                           9. Acknowledgements
parameters of 4 bonds and 8 bonds respectively.
                                                                                           To Miklos Vargyas and Alex Allardyce for the invitation to present a poster at the
For reference comparison to other methods, also shown are LINGOs similarity [4],
                                                                                           ChemAxon UGM, to Peter Kovacs for JChem ECFP support and rapid bug fixing,
MACCS 166-bit keys *3+, Daylight fingerprints and IBM’s patented InChI-based
                                                                                           and to AstraZeneca and Vertex Pharmaceuticals for their interest in 2D similarity.
chemical similarity (US20080004810A1) as used in their SIMPLE product [7].
                                                                                           10. Bibliography
 4. Tanimoto Coefficient
4. Tanimoto Coefficient                                                                    1. Hans Briem and Uta F. Lessel, “In vitro and in silico Affinity Fingerprints: Finding
Many ways of comparing similarity between binary fingerprints have been                       Similarities beyond Structural Classes”, Perspectives in Drug Discovery and Design, Vol.
discussed in the literature [9]; generally the best performing of these is the                20, pp. 231-244, 2000.
Tanimoto coefficient, 𝑇 = 𝑋 ∩ 𝑌        𝑋 ∪ 𝑌 . This definition has almost magical          2. Robert D. Brown and Yvonne C. Martin, “Use of Structure-Activity Data to Compare
properties, normalizing the differences between two feature sets by their sizes,              Structure-based Clustering Methods and Descriptors for Use in Compound Selection”,
                                                                                              JCICS, Vol. 36, No. 3, pp. 572-582, 1996.
intuitively “the fraction in common”. Experimentally this correlates well to the           3. Joseph L. Durant, Burton A. Leland, Douglas R. Henry and James G. Nourse,
chemical and biological notion of what makes two molecules similar.                           “Reoptimization of MDL Keys for Use in Drug Discovery”, JCIM, Vol. 42, pp. 1273-1280,
                                                                                              2002.
5. Evaluation Results                                                                      4. J. Andrew Grant, James A. Haigh, Barry T. Pickup, Anthony Nicholls and Roger A. Sayle,
90%                                                                                           “Lingos, Finite State Machines and Fast Similarity Searching”, JCIM, Vol. 46, No. 5, pp.
        79.4%                                                                                 1912-1918, 2006.
80%               76.0%     75.6%                                                          5. Thierry Kogel, Ola Engkvist, Niklas Blomberg and Sorel Muresan, “Multifingerprint Based
70%                                   66.2%     65.2%                                         Similarity Searches for Targeted Class Compound Selection”, JCIM, Vol. 46, No. 3, pp.
                                                         63.7%     64.0%
                                                                                              1201-1213, 2006.
60%                                                                                        6. Steven W. Muchmore, Derek A. Debe, James T. Metz, Scott P. Brown, Yvonne C. Martin and
50%                                                                                           Philip J. Hajduk, “Application of Belief Theory to Similarity Data Fusion for Use in Analog
                                                                             42.0%            Searching and Lead Hopping”, JCIM, Vol. 48, No. 5, pp. 941-948, 2008.
40%
                                                                                           7. James Rhodes, Stephen Boyer, Jeffrey Kreule, Ying Chen and Patricia Ordonez, “Mining
30%                                                                                           Patents using Molecular Similarity Search”, Pacific Symposium on Biocomputing, Vol. 12,
20%                                                                                           pp. 304-315, 2007.
                                                                                           8. David Rogers and Mathew Hahn, “Extended Connectivity Fingerprints”, JCIM, Vol. 50, No.
10%                                                                                           5, pp. 742-754, 2010.
 0%                                                                                        9. Peter Willet, John M. Barnard and Geoffrey M. Downs, “Chemical Similarity Searching”,
       ECFP_4    FCFP_8    LINGOs    MACCS Marvin CF JChem CF Daylight        IBM             JCICS, Vol. 38, No. 6, pp. 893-996, 1998.



                                                                                                                                                         NextMove Software Limited
                                                                                                                                                         Innovation Centre (Unit 23)
                                                                           www.nextmovesoftware.co.uk
                                                                                                                                                            Cambridge Science Park
                                                                           www.nextmovesoftware.com
                                                                                                                                                            Milton Road, Cambridge
                                                                                                                                                                   England CB4 0EY

More Related Content

Similar to Benchmarking and Validation of JChem ECFP and FCFP Fingerprints

A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...
CSCJournals
 
Architecture of a morphological malware detector
Architecture of a morphological malware detectorArchitecture of a morphological malware detector
Architecture of a morphological malware detector
UltraUploader
 
Design of arq and hybrid arq protocols for wireless channels using bch codes
Design of arq and hybrid arq protocols for wireless channels using bch codesDesign of arq and hybrid arq protocols for wireless channels using bch codes
Design of arq and hybrid arq protocols for wireless channels using bch codes
IAEME Publication
 

Similar to Benchmarking and Validation of JChem ECFP and FCFP Fingerprints (20)

GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACHGPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
 
JBUON-21-1-33
JBUON-21-1-33JBUON-21-1-33
JBUON-21-1-33
 
A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...
 
An Automatic Clustering Technique for Optimal Clusters
An Automatic Clustering Technique for Optimal ClustersAn Automatic Clustering Technique for Optimal Clusters
An Automatic Clustering Technique for Optimal Clusters
 
Architecture of a morphological malware detector
Architecture of a morphological malware detectorArchitecture of a morphological malware detector
Architecture of a morphological malware detector
 
Cdma
CdmaCdma
Cdma
 
The Effect of Updating the Local Pheromone on ACS Performance using Fuzzy Log...
The Effect of Updating the Local Pheromone on ACS Performance using Fuzzy Log...The Effect of Updating the Local Pheromone on ACS Performance using Fuzzy Log...
The Effect of Updating the Local Pheromone on ACS Performance using Fuzzy Log...
 
Empirical and quantum mechanical methods of 13 c chemical shifts prediction c...
Empirical and quantum mechanical methods of 13 c chemical shifts prediction c...Empirical and quantum mechanical methods of 13 c chemical shifts prediction c...
Empirical and quantum mechanical methods of 13 c chemical shifts prediction c...
 
An Algorithm For Vector Quantizer Design
An Algorithm For Vector Quantizer DesignAn Algorithm For Vector Quantizer Design
An Algorithm For Vector Quantizer Design
 
Design of arq and hybrid arq protocols for wireless channels using bch codes
Design of arq and hybrid arq protocols for wireless channels using bch codesDesign of arq and hybrid arq protocols for wireless channels using bch codes
Design of arq and hybrid arq protocols for wireless channels using bch codes
 
Applications of Artificial Neural Networks in Cancer Prediction
Applications of Artificial Neural Networks in Cancer PredictionApplications of Artificial Neural Networks in Cancer Prediction
Applications of Artificial Neural Networks in Cancer Prediction
 
An Efficient Genetic Algorithm for Solving Knapsack Problem.pdf
An Efficient Genetic Algorithm for Solving Knapsack Problem.pdfAn Efficient Genetic Algorithm for Solving Knapsack Problem.pdf
An Efficient Genetic Algorithm for Solving Knapsack Problem.pdf
 
A Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
A Comparative Analysis of Feature Selection Methods for Clustering DNA SequencesA Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
A Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
 
Classification With Ant Colony
Classification With Ant ColonyClassification With Ant Colony
Classification With Ant Colony
 
Molecular Biology Software Links
Molecular Biology Software LinksMolecular Biology Software Links
Molecular Biology Software Links
 
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODES
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODESOPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODES
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODES
 
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODES
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODESOPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODES
OPTIMIZATION OF HEURISTIC ALGORITHMS FOR IMPROVING BER OF ADAPTIVE TURBO CODES
 
gkv343.pdf
gkv343.pdfgkv343.pdf
gkv343.pdf
 
Ant Colony Optimization for Optimal Low-Pass State Variable Filter Sizing
Ant Colony Optimization for Optimal Low-Pass State Variable Filter Sizing Ant Colony Optimization for Optimal Low-Pass State Variable Filter Sizing
Ant Colony Optimization for Optimal Low-Pass State Variable Filter Sizing
 
D111823
D111823D111823
D111823
 

More from NextMove Software

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
NextMove Software
 

More from NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 

Recently uploaded

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 

Recently uploaded (20)

How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 

Benchmarking and Validation of JChem ECFP and FCFP Fingerprints

  • 1. Benchmarking and Validation of JChem ECFP and FCFP Fingerprints Roger Sayle, NextMove Software Ltd, Cambridge, UK roger@nextmovesoftware.co.uk Abstract 1. Overview 6. Fingerprint Saturation 6. Fingerprint Saturation The cornerstone of pharmaceutical chemistry is Crum Brown’s observation that A common failing with binary fingerprints is caused by their inability to represent similar compounds have similar therapeutic benefits. Cheminformatics tries to the number of times a feature (such as a path or substructure) occurs. The capture this insight by defining measures of similarity between the computer fingerprints for decane (C10), undecane (C11) and dodecane (C12) are typically representations of two molecules, with the hope of capturing a medicinal identical, as are those for many protein and DNA sequences. A more powerful chemist’s intuitive sense of “likeness”, and thereby correlate with bioactivity. representation that solves these issues is to replace occurrence bits with counts, This poster evaluates the chemical similarity measures offered by ChemAxon on a turning binary fingerprints into occurrence histograms. standard reference benchmark. Any such benchmark must by necessity be LINGOs similarity achieves better results on the Briem & Lessel benchmark by flawed; the similarity between two molecules is influenced by the framework by using counts instead of bits. However, as described in the Continuous Tanimoto which they are compared [2]. However a robust similarity measure should section below, care has to be taken to use a suitable similarity measure for typically perform better on such benchmarks, whilst a weaker model of chemical comparing histograms. similarity would be expected to perform worse (on average). ChemAxon have announced that the upcoming release of JChem, version 5.5, will support ECFP and FCFP fingerprints with counts. 2. Briem & Lessel Benchmark The benchmark employed in this evaluation is the commonly used Briem and 7. Continuous Tanimoto Lessel benchmark *1+. This test assesses a method’s ability to identify near Although there is universal agreement on how the Tanimoto coefficient should be neighbours with the same biological activity from decoy molecules and molecules interpreted for binary values, its application to continuous values, such as with different biological activities. Five classes of active compounds are used: 40 histogram counts, has been implemented differently by different authors [4,5]. ACE inhibitors, 49 TXA antagonists, 110 HMG-CoA reductase inhibitors, 133 PAF Consider the two alternate definitions T0 and T1 given below. antagonists and 48 5HT3 antagonists. In addition to these 380 active compounds, 𝑁 𝑁 𝑖 𝑥𝑖 𝑦𝑖 𝑖 min⁡ 𝑥 𝑖 , ( 𝑦 𝑖) the data set contains 573 “random” MDDR compounds, for a total of 953 𝑇0 (𝑥, 𝑦) = 𝑁 𝑇1 (𝑥, 𝑦) = 𝑁 molecules. The benchmark proceeds by determining the 10 nearest neighbours 𝑖 𝑥 𝑖2 + 𝑦 𝑖2 − 𝑥 𝑖 𝑦 𝑖 𝑖 max(𝑥 𝑖 , 𝑦 𝑖) for each of the 380 active compounds. The query is not considered a neighbour Both definitions agree for binary valued vectors, and are guaranteed to return of itself. The score for each activity class is the fraction of these neighbours that increasing fractional values between zero and one. Notice however that for have the same activity as the query. Finally, the overall score is the average of the x = { 3 } and y = { 4 }, then T1 = 3/4 = 0.75 but T0 = 12/13 ~ 0.923. score of the five activity classes. In experiments with LINGO’s histograms, T1 was found to be superior (producing an improvement of ~0.9%) whereas T0 actually made the results worse (by ~3%) 3. Fingerprint Methods 3. Fingerprint Methods Historically, the similarity method underlying ChemAxon’s JChem search engine 8. Conclusions relied upon Chemical Fingerprints (“CF”). These are path-based fingerprints • ChemAxon’s Chemical Fingerprints perform comparably with other path and similar to Daylight fingerprints, which allow a number of variants depending upon feature-based fingerprints (including MACCS 166-bit keys, Daylight fingerprints parameters for the number of bits in the fingerprint, the longest bond path to and PubChem/CACTVS fingerprints). All these methods perform equivalently. encode and the number of bits set by each path. The “Marvin FP” below uses the • ECFP fingerprints, originally developed by Scitegic/Accelrys and as recently generatemd defaults of 1024 bits, paths of up to 7 bonds, and 3 bits per path. The implemented by ChemAxon, perform exceptionally well on the standard Briem “JChem FP” below uses the JChemManager defaults of 512 bits, paths up to and Lessel benchmark. length 6 and 2 bits per path. • The announced ECFP histograms would be anticipated to set new records in 2D Recently, in v5.4, ChemAxon has added support for ECFP and FCFP fingerprints chemical similarity. originally introduced by Scitegic, now Accelrys *8+. These are termed “ECFP_4” and “FCFP_8” below indicating the ChemAxon implementation with diameter 9. Acknowledgements 9. Acknowledgements parameters of 4 bonds and 8 bonds respectively. To Miklos Vargyas and Alex Allardyce for the invitation to present a poster at the For reference comparison to other methods, also shown are LINGOs similarity [4], ChemAxon UGM, to Peter Kovacs for JChem ECFP support and rapid bug fixing, MACCS 166-bit keys *3+, Daylight fingerprints and IBM’s patented InChI-based and to AstraZeneca and Vertex Pharmaceuticals for their interest in 2D similarity. chemical similarity (US20080004810A1) as used in their SIMPLE product [7]. 10. Bibliography 4. Tanimoto Coefficient 4. Tanimoto Coefficient 1. Hans Briem and Uta F. Lessel, “In vitro and in silico Affinity Fingerprints: Finding Many ways of comparing similarity between binary fingerprints have been Similarities beyond Structural Classes”, Perspectives in Drug Discovery and Design, Vol. discussed in the literature [9]; generally the best performing of these is the 20, pp. 231-244, 2000. Tanimoto coefficient, 𝑇 = 𝑋 ∩ 𝑌 𝑋 ∪ 𝑌 . This definition has almost magical 2. Robert D. Brown and Yvonne C. Martin, “Use of Structure-Activity Data to Compare properties, normalizing the differences between two feature sets by their sizes, Structure-based Clustering Methods and Descriptors for Use in Compound Selection”, JCICS, Vol. 36, No. 3, pp. 572-582, 1996. intuitively “the fraction in common”. Experimentally this correlates well to the 3. Joseph L. Durant, Burton A. Leland, Douglas R. Henry and James G. Nourse, chemical and biological notion of what makes two molecules similar. “Reoptimization of MDL Keys for Use in Drug Discovery”, JCIM, Vol. 42, pp. 1273-1280, 2002. 5. Evaluation Results 4. J. Andrew Grant, James A. Haigh, Barry T. Pickup, Anthony Nicholls and Roger A. Sayle, 90% “Lingos, Finite State Machines and Fast Similarity Searching”, JCIM, Vol. 46, No. 5, pp. 79.4% 1912-1918, 2006. 80% 76.0% 75.6% 5. Thierry Kogel, Ola Engkvist, Niklas Blomberg and Sorel Muresan, “Multifingerprint Based 70% 66.2% 65.2% Similarity Searches for Targeted Class Compound Selection”, JCIM, Vol. 46, No. 3, pp. 63.7% 64.0% 1201-1213, 2006. 60% 6. Steven W. Muchmore, Derek A. Debe, James T. Metz, Scott P. Brown, Yvonne C. Martin and 50% Philip J. Hajduk, “Application of Belief Theory to Similarity Data Fusion for Use in Analog 42.0% Searching and Lead Hopping”, JCIM, Vol. 48, No. 5, pp. 941-948, 2008. 40% 7. James Rhodes, Stephen Boyer, Jeffrey Kreule, Ying Chen and Patricia Ordonez, “Mining 30% Patents using Molecular Similarity Search”, Pacific Symposium on Biocomputing, Vol. 12, 20% pp. 304-315, 2007. 8. David Rogers and Mathew Hahn, “Extended Connectivity Fingerprints”, JCIM, Vol. 50, No. 10% 5, pp. 742-754, 2010. 0% 9. Peter Willet, John M. Barnard and Geoffrey M. Downs, “Chemical Similarity Searching”, ECFP_4 FCFP_8 LINGOs MACCS Marvin CF JChem CF Daylight IBM JCICS, Vol. 38, No. 6, pp. 893-996, 1998. NextMove Software Limited Innovation Centre (Unit 23) www.nextmovesoftware.co.uk Cambridge Science Park www.nextmovesoftware.com Milton Road, Cambridge England CB4 0EY