SlideShare una empresa de Scribd logo
1 de 28
Cheminformatics



                Noel M. O‟Boyle




                   July 2012
EMBL-EBI/Wellcome Trust Course: Resources for
       Computational Drug Discovery
Cheminformatics
• Hard to define in words:
   – David Wild: “The field that studies all aspects of the representation and use
     of chemical and related biological information on computers”
   – Design, creation, organization, management, retrieval, analysis,
     dissemination, visualization and use of chemical information
• Hard to agree on spelling:
   – Sometimes chemoinformatics
• More easily thought of as encompassing a range of concepts
  and techniques
   –   Molecular similarity
   –   Quantitative-structure activity relationships (QSAR)
   –   Substructure search
   –   (Automated) Molecular depiction
   –   Encoding/decoding of molecular structures
   –   3D structure generation from a 2D or 0D structure
   –   Conformer generation
   –   Algorithms: ring perception, aromaticity, isomers
References
• An introduction to cheminformatics, A. R.
  Leach, V. J. Gillet
• Cheminformatics, Johann Gasteiger and
  Thomas Engel (Eds)
• Molecular modelling – Principles and
  Applications, A. R. Leach

• I571 Chemical Information Technology, David
  Wild, University of Indiana
     http://i571.wikispaces.com/
Molecular representation




Mike Hann (GSK): “Ceci n'est pas une molecule serves
to remind us that all of the graphics images presented
here are not molecules, not even pictures of molecules,
but pictures of icons which we believe represent some
aspects of the molecule's properties.”
            http://mgl.scripps.edu/people/goodsell/mgs_art/hann.html
Computer representations of molecules
• How can a molecular structure be stored on
  a computer?
   –   Common names: aspirin
   –   IUPAC name: 2-acetoxybenzoic acid
   –   Formula: C9H8O4
   –   As an image (PNG, GIF, etc.)
   –   CAS number: 50-78-2
   –   File format: ChemDraw file, MOL file, etc.
   –   SMILES string: O=C(Oc1ccccc1C(=O)O)C
   – Binary Fingerprint:
       10000100000001100000100100000001
                                                    http://en.wikipedia.org/wiki/Aspirin


• How should it be stored?
   – …if I want to use it for computation
   – ...if I want a unique identifier
   – …if I want to retain stereochemical information
Computer representations of molecules
    • The structure of a molecule can be represented by
      a graph
          – Graph = collection of nodes and edges, nodes and
            edges have properties (atomic number, bond order)
    • Represent the molecular graph somehow
          – Connection table (which nodes are connected to which
            other nodes)
          – Line notation (e.g. SMILES)




Fig 12.2: Molecular modelling – principles and applications, Andrew R Leach, Pearson, 2nd edn.
Chemical file formats




•   A large number of file formats have been developed, but there are certain de-facto
    standards
•   2D/3D structures:
     –   MOL file for small-molecule structures
     –   PDB files for protein structures from crystallography
     –   MOL2 files for protein structures from modelling software (e.g. after manipulation of the PDB
         file)
•   Line notations:
     –   SMILES format, InChI format
A chemical file format: MOL file
Fig 12.3: Molecular modelling – principles and applications, Andrew R Leach, Pearson, 2nd edn.




   • This file format can represent 0D, 2D information (a
     depiction) as well as 3D
SMILES format
• Simplified Molecular Input Line Entry System
   – Weininger, J Chem Inf Comput Sci, 1988, 28, 31
   – More recently, a community developed description:
     http://opensmiles.org
   – Linear format (“line notation”) that describes the connection table
     and stereochemistry of a molecule (i.e. 0D)
   – Convenient to enter as a query on-line, store in a spreadsheet,
     pass by email, etc.
• Examples:
   – CC represents CH3CH3 (ethane)
   – CC(=O)O represents CH3COOH (acetic acid)
• Basic guidelines:
   – Hydrogens are implicit
   – Parentheses indicate branches
   – Each atom is connected to the preceding atom to its left (excluding
     branches in-between)
   – Single bonds are implicit, = for double, # for triple
• What does the SMILES string OCC represent?
SMILES format II
 • To represent rings, you need to break a ring bond and replace it
   by a ring opening symbol and a corresponding ring closure
   symbol                                                        Br

                1 1                                     C     C
                          C1CCC=CC1
                                                   Cl

• To represent double bond stereochemistry you use / and 
   • Cl/C=C/Br (trans), Cl/C=CBr (cis)
• To represent tetrahedral stereochemistry you use @ or @@
   • Br[C@](Cl)(I)F means that looking from the Br, the Cl, I, and
      F are arranged anticlockwise
• To represent aromaticity, use lower case
   • C1CCCCC1 (cyclohexane)
   • c1ccccc1 (benzene)
Canonical SMILES
• In general, many different SMILES strings can be written
  for the same molecule
   – Not a unique identifier (one-to-many)
   – Ethanol: CCO, OCC, C(O)C
• Algorithms for producing “canonical SMILES” have been
  developed
   – The same unique SMILES string is always created for a
     particular molecule
   – One-to-one relationship between structure and
     representation
   – Note however, that different software implement different
     canonicalisation algorithms
• Uses:
   – Can be used to remove duplicate molecules from a database
       • Generate the canonical SMILES for each molecule and ensure that
         they are unique
   – Check identity (compare two molecules)
       • Did this software change the structure? Or get the stereochemistry
         confused?
SMILES format III
• There a couple of nice features of the SMILES format that
  can come in handy when manipulating structures

• Concatentating SMILES strings creates a bond between
  fragments
   – CC and CO gives CCCO
   – Can be used for combinatorial chemistry, e.g. generating all
     possible products from a 4-component Ugi reaction
   – Can be used to prepare polymers by concatenating
     monomers
   – Open Babel can be used to prepare suitable SMILES strings

• In file format conversion, the atom order in a SMILES
  string is usually preserved in the output format
   – Sometimes you need a particular atom to be atom#1 in the
     file format (e.g. for covalent docking in GOLD)
       • Write the corresponding SMILES and convert to a 3D format
InChI
• International Chemical Identifier
    – Line notation developed by NIST and IUPAC
    – Goal: An index for uniquely identifying a molecule

Aspirin
InChI=1/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)/f/h11H

• Features
    – Derived from the structure (unlike CAS number)
    – One-to-one relationship between InChI and structure (“canonical”)
    – Layers (of specificity)
         • Can distinguish between stereoisomers, isotopes, or can leave out those layers
    – Different tautomeric forms give rise to the same InChI (unlike SMILES)
• Notes
    – Not human readable or writeable
    – All implementations use the same (open source) code which is provided
      by the InChI Trust
         • “The Trust's goal is to enable the interlinking and combining of chemical, biological
           and related information, using unique machine-readable chemical structure
           representations to facilitate and expedite new scientific discoveries.”
• For more info, see http://www.inchi-trust.org under Downloads
A unique identifier makes it easy to link databases




                                           DrugBank
                                 ChEBI
US Generic Legislation
•   Comprehensive Drug Abuse and Control Act, 1970
•   Controlled Substances Act, 1970
•   Federal Analog Act, 1986

•   The term “controlled substance analog” means a substance
     –    The chemical structure of which is substantially similar to the chemical structure of a
          controlled substance in schedule I or II




         Slide courtesy Dr. J.J. Keating, School of Pharmacy, University College Cork
Molecular similarity
• Similarity principle:
   – Structurally similar molecules tend to have similar properties
       • Properties: biological activity, solubility, color and so on


• If we can measure similarity somehow…
   – Can construct a distance matrix
       • Distance = inverse of similarity
       • Such matrices can be used to cluster compounds, to create a 2D
         depiction showing the spread of molecular structures in a dataset,
         to select a diverse subset
   – Can use to find molecules in a database similar to a
     particular query
   – Can use to see whether a particular property is correlated
     with molecular similarity

• ...But how to measure similarity?
   – One way is using molecular fingerprints
Molecular fingerprints
• A molecular fingerprint is an encoding of the molecular structure
  onto a (long) binary string
    – 100100010000001011000000000001...

• Path-based fingerprints (e.g. Daylight fingerprint)
    – Break the molecule up into all possible fragments of length 1, 2,
      3...7
    – Create a string representing each fragment
    – Hash each string onto a number between 1 and 1024 (for example)
        • Wikipedia: “A hash function is any well-defined procedure or mathematical
          function that converts a large, possibly variable-sized amount of data into a
          small datum, usually a single integer that may serve as an index to an array”
    – Set the corresponding bit of the fingerprint to 1 (all others will be 0)

• Key-based fingerprints (e.g. MACCS keys)
    – A (long) list of pre-generated questions about a chemical structure
        • “Are there fewer than 3 oxygens?”
        • “Is there an S-S bond?”
        • “Is there a ring of size 4?”
    – Each answer, true or false, corresponds to a 1 or 0 in the binary
      fingerprint
Similarity of molecular fingerprints
• Molecules with the same bits set will be more similar than
  molecules with different bits set

• To quantify this, we can use the Tanimoto coefficient
     – Tanimoto Similarity = Intersection/Union
     – Bounded by 0 and 1 (no similarity to perfect similarity)
     – A value of greater than 0.7 or 0.8 indicates structural similarity

• How similar are aspirin (A) and salicylic acid (B)?




•   Using a path-based fingerprint, 64 bits are set for A, 38 for B
     • Intersection is 38 (Note: B is a substructure of A)
     • Union is 64
     • Similarity = 0.59
Similarity of atom environments
•   Fingerprints can also be used to measure similarity
    of atom environments
•   Circular fingerprints (HOSE codes)
     – Bremser, W., HOSE – a novel substructure code. Anal.
       Chim. Acta 1978, 103, 355.
     – Describe atom environment in terms of atom types at
       various bond distances from a particular atom
•   Can be used for proton NMR prediction
     – Hydrogens attached to similar atoms tend to have
       similar NMR shifts
     – Given a database of molecules with assigned NMR                     Image: T. Davies, W. Robien, J. Seymour.
       spectra, try to find Hs in the same environment up to as            Spectroscopy Europe, 2006, 18, 22
       many levels as possible and use their NMR shifts to                 (http://www.modgraph.co.uk/Downloads/T
       predict the shift for your proton                                   D_18_1.pdf)
•   The same database can be used for structure
    identification
     – Given a proton NMR spectrum, what chemical structures
       are consistent with the NMR

•   NMRShiftDB (http://nmrshiftdb.org)
     – Freely available Open database of NMR spectra – add your own spectra (with assigned
       peaks) – predict assignments
     –   Tutorial: http://nmrshiftdb.sourceforge.net/nmrshiftdbebitraining.pdf
Substructure search using SMARTS
• SMARTS – an extension of SMILES for substructure searching
   – Can be used to find molecules with a particular substructure
   – Can be used to filter out molecules with a particular substructure

• Simple example
   – Ether: [OD2]([#6])[#6]
       • Any oxygen with exactly two bonds each to a carbon
• Can get (a lot) more complicated
   – Carbonic Acid or Carbonic Acid-Ester:
     [CX3](=[OX1])([OX2])[OX2H,OX1H0-1]
       • Hits acid and conjugate base. Won't hit carbonic acid diester
SMARTSviewer
http://smartsview.zbh.uni-hamburg.de/
K. Schomburg, H.-C. Ehrlich, K.
Stierand, M.Rarey. “From Structure Diagrams
to Visual Chemical Patterns” J. Chem. Inf.
Model., 2010, 50, 1529.

[CX3](=[OX1])([OX2])[OX2H,OX1H0-1]
Substructure search using SMARTS
• SMARTS – an extension of SMILES for substructure searching
   – Can be used to find molecules with a particular substructure
   – Can be used to filter out molecules with a particular substructure

• Simple example
   – Ether: [OD2]([#6])[#6]
       • Any oxygen with exactly two bonds each to a carbon
• Can get (a lot) more complicated
   – Carbonic Acid or Carbonic Acid-Ester:
     [CX3](=[OX1])([OX2])[OX2H,OX1H0-1]
       • Hits acid and conjugate base. Won't hit carbonic acid diester


• Examples of use
   – Filtering structures
   – Identify substructures that are associated with toxicological
     problems
   – Develop or use a group contribution descriptor such as TPSA
FAF-Drugs2: Free ADME/tox filtering tool to assist drug discovery
and chemical biology projects, Lagorce et al, BMC Bioinf, 2008, 9, 396.
Calculation of Topological Polar Surface Area


• TPSA
• Ertl, Rohde, Selzer, J. Med.
  Chem., 2000, 43, 3714.
• A fragment-based method
  for calculating the polar
  surface area
Quantitative Structure-Activity Relationships (QSAR)

•   Also QSPR (Structure-Property)
     – Exactly the same idea but with some physical property
•   Create a mathematical model that links a molecule‟s structure to a
    particular property or biological activity
     – Could be used to perceive the link between structure and function/property
     – Could be used to propose changes to a structure to increase activity
     – Could be used to predict the activity/property for an unknown molecule

•   Problem: Activity = 2.4 *                             Does not compute!




• Need to replace the actual structure by some values that are a
  proxy for the structure - “Molecular descriptors”
•   Numerical values that represent in some way some physico-chemical
    properties of the molecule
     •   We saw one already, the Total Polar Surface Area
     •   Others: molecular weight, number of hydrogen bond donors, LogP
         (octanol/water partition coefficient)
     •   It is usual to calculate 100 or more of these
Building and testing a predictive QSAR model

• Need dataset with known values for the property of
  interest
   – Divide into 2/3 training set and 1/3 test set
• Choose a regression model
   – Linear regression, artificial neural network, support vector
     machine, random forest, etc.
• Train the model to predict the property values for the
  training set based on their descriptors
• Apply the model to the test set
   – Find the RMSEP and R2
       • Root-mean squared error of prediction and correlation coefficient


• Practical Notes:
   – Descriptors can be calculated with the CDK or RDKit
   – Models can be built using R (r-project.org)
   – For a combination of the two, see rcdk
Lipinski‟s Rule of Fives
Chris Lipinski
                                                                  Note: Rule of thumb


 Rule of Fives
                                                                   Oral bioavailability


  •   Lipinski took a dataset of drug candidates that made it to Phase II
  •   He examined the distribution of particular descriptor values related to
      ADME
  •   An orally active drug should not fail more than one of the following
      „rules‟:
       –   Molecular weight <= 500
       –   Number of H-bond donors <= 5
       –   Number of H-bond acceptors <= 10
       –   LogP <= 5
  •   These rules are often applied as an pre-screening filter
                   Image: http://collaborativedrug.com/blog/blog/2009/10/07/cdd-community-meeting/
Open Source cheminformatics software resources

• GUI:
    – Open Babel
    – LICSS – Excel-CDK interface

• Command-line interface:
    – Open Babel (“babel”)
    – MayaChemTools

• Programming toolkits:
    – Open Babel (C++, Perl, Python, .NET, Java), RDKit (C++, Python),
      Chemistry Development Kit [CDK] (Java, Jython, ...), PerlMol (Perl),
      MayaChemTools (Perl)
    – Cinfony (by me!) presents a simplified interface to some of these

• Specialized toolkits:
    – OSRA: image to structure
    – OPSIN: name to structure
    – OSCAR: Identify chemical terms in text

Más contenido relacionado

La actualidad más candente

Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical Structures
Rajarshi Guha
 

La actualidad más candente (20)

In Silico methods for ADMET prediction of new molecules
 In Silico methods for ADMET prediction of new molecules In Silico methods for ADMET prediction of new molecules
In Silico methods for ADMET prediction of new molecules
 
Molecular maodeling and drug design
Molecular maodeling and drug designMolecular maodeling and drug design
Molecular maodeling and drug design
 
Molecular docking by harendra ...power point presentation
Molecular docking by harendra ...power point presentationMolecular docking by harendra ...power point presentation
Molecular docking by harendra ...power point presentation
 
Energy minimization
Energy minimizationEnergy minimization
Energy minimization
 
Molecular modelling and docking studies
Molecular modelling and docking studiesMolecular modelling and docking studies
Molecular modelling and docking studies
 
Molecular and Quantum Mechanics in drug design
Molecular and Quantum Mechanics in drug designMolecular and Quantum Mechanics in drug design
Molecular and Quantum Mechanics in drug design
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformatics
 
molecular mechanics and quantum mechnics
molecular mechanics and quantum mechnicsmolecular mechanics and quantum mechnics
molecular mechanics and quantum mechnics
 
Cheminformatics-1.ppt
Cheminformatics-1.pptCheminformatics-1.ppt
Cheminformatics-1.ppt
 
Chemo informatics scope and applications
Chemo informatics scope and applicationsChemo informatics scope and applications
Chemo informatics scope and applications
 
Drug properties (ADMET) prediction using AI
Drug properties (ADMET) prediction using AIDrug properties (ADMET) prediction using AI
Drug properties (ADMET) prediction using AI
 
Molecular Mechanics in Molecular Modeling
Molecular Mechanics in Molecular ModelingMolecular Mechanics in Molecular Modeling
Molecular Mechanics in Molecular Modeling
 
Virtual screening techniques
Virtual screening techniquesVirtual screening techniques
Virtual screening techniques
 
Molecular modelling
Molecular modelling Molecular modelling
Molecular modelling
 
MOLECULAR DOCKING
MOLECULAR DOCKINGMOLECULAR DOCKING
MOLECULAR DOCKING
 
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical Structures
 
Ligand based drug design
Ligand based drug designLigand based drug design
Ligand based drug design
 
Seminar energy minimization mettthod
Seminar energy minimization mettthodSeminar energy minimization mettthod
Seminar energy minimization mettthod
 
Lecture 9 molecular descriptors
Lecture 9  molecular descriptorsLecture 9  molecular descriptors
Lecture 9 molecular descriptors
 
Molecular docking
Molecular dockingMolecular docking
Molecular docking
 

Destacado (6)

High throughput sequencing
High throughput sequencingHigh throughput sequencing
High throughput sequencing
 
Nucleic acid database
Nucleic acid database Nucleic acid database
Nucleic acid database
 
SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)
 
Nucleic Acid Sequence databases
Nucleic Acid Sequence databasesNucleic Acid Sequence databases
Nucleic Acid Sequence databases
 
Genome Database Systems
Genome Database Systems Genome Database Systems
Genome Database Systems
 
Protein databases
Protein databasesProtein databases
Protein databases
 

Similar a Cheminformatics

Universal Smiles: Finally a canonical SMILES string
Universal Smiles: Finally a canonical SMILES stringUniversal Smiles: Finally a canonical SMILES string
Universal Smiles: Finally a canonical SMILES string
baoilleach
 
a concept map on The concept maps.docx
a concept map on The concept maps.docxa concept map on The concept maps.docx
a concept map on The concept maps.docx
write30
 
Clustering the royal society of chemistry chemical repository to enable enhan...
Clustering the royal society of chemistry chemical repository to enable enhan...Clustering the royal society of chemistry chemical repository to enable enhan...
Clustering the royal society of chemistry chemical repository to enable enhan...
Valery Tkachenko
 
1 -val_gillet_-_ligand-based_and_structure-based_virtual_screening
1  -val_gillet_-_ligand-based_and_structure-based_virtual_screening1  -val_gillet_-_ligand-based_and_structure-based_virtual_screening
1 -val_gillet_-_ligand-based_and_structure-based_virtual_screening
Deependra Ban
 
conventional Vs. tactile computing
conventional Vs. tactile computingconventional Vs. tactile computing
conventional Vs. tactile computing
harish kumar
 
Chapter 1 - Concepts for Object Databases.ppt
Chapter 1 - Concepts for Object Databases.pptChapter 1 - Concepts for Object Databases.ppt
Chapter 1 - Concepts for Object Databases.ppt
Shemse Shukre
 

Similar a Cheminformatics (20)

Overview of cheminformatics
Overview of cheminformaticsOverview of cheminformatics
Overview of cheminformatics
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformatics
 
Universal Smiles: Finally a canonical SMILES string
Universal Smiles: Finally a canonical SMILES stringUniversal Smiles: Finally a canonical SMILES string
Universal Smiles: Finally a canonical SMILES string
 
Chemoinformatic File Format.pptx
Chemoinformatic File Format.pptxChemoinformatic File Format.pptx
Chemoinformatic File Format.pptx
 
2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...
 
Standards and software: practical aids for reproducibility of computational r...
Standards and software: practical aids for reproducibility of computational r...Standards and software: practical aids for reproducibility of computational r...
Standards and software: practical aids for reproducibility of computational r...
 
Odbms concepts
Odbms conceptsOdbms concepts
Odbms concepts
 
Molecular modelling (1)
Molecular modelling (1)Molecular modelling (1)
Molecular modelling (1)
 
a concept map on The concept maps.docx
a concept map on The concept maps.docxa concept map on The concept maps.docx
a concept map on The concept maps.docx
 
Intro to in silico drug discovery 2014
Intro to in silico drug discovery 2014Intro to in silico drug discovery 2014
Intro to in silico drug discovery 2014
 
Molecular modelling for in silico drug discovery
Molecular modelling for in silico drug discoveryMolecular modelling for in silico drug discovery
Molecular modelling for in silico drug discovery
 
Structural bioinformatics.
Structural bioinformatics.Structural bioinformatics.
Structural bioinformatics.
 
Modules for reusable and collaborative modeling of biological mathematical sy...
Modules for reusable and collaborative modeling of biological mathematical sy...Modules for reusable and collaborative modeling of biological mathematical sy...
Modules for reusable and collaborative modeling of biological mathematical sy...
 
Chapter – 2 Data Models.pdf
Chapter – 2 Data Models.pdfChapter – 2 Data Models.pdf
Chapter – 2 Data Models.pdf
 
R.P Maurya ppt on C C D C & DSSP(Bioinformatics)
R.P Maurya ppt  on C C D C & DSSP(Bioinformatics)R.P Maurya ppt  on C C D C & DSSP(Bioinformatics)
R.P Maurya ppt on C C D C & DSSP(Bioinformatics)
 
Tertiary structure prediction- MODELLER, RASMOL
Tertiary structure prediction- MODELLER, RASMOLTertiary structure prediction- MODELLER, RASMOL
Tertiary structure prediction- MODELLER, RASMOL
 
Clustering the royal society of chemistry chemical repository to enable enhan...
Clustering the royal society of chemistry chemical repository to enable enhan...Clustering the royal society of chemistry chemical repository to enable enhan...
Clustering the royal society of chemistry chemical repository to enable enhan...
 
1 -val_gillet_-_ligand-based_and_structure-based_virtual_screening
1  -val_gillet_-_ligand-based_and_structure-based_virtual_screening1  -val_gillet_-_ligand-based_and_structure-based_virtual_screening
1 -val_gillet_-_ligand-based_and_structure-based_virtual_screening
 
conventional Vs. tactile computing
conventional Vs. tactile computingconventional Vs. tactile computing
conventional Vs. tactile computing
 
Chapter 1 - Concepts for Object Databases.ppt
Chapter 1 - Concepts for Object Databases.pptChapter 1 - Concepts for Object Databases.ppt
Chapter 1 - Concepts for Object Databases.ppt
 

Más de baoilleach

What's New and Cooking in Open Babel 2.3.2
What's New and Cooking in Open Babel 2.3.2What's New and Cooking in Open Babel 2.3.2
What's New and Cooking in Open Babel 2.3.2
baoilleach
 
Large-scale computational design and selection of polymers for solar cells
Large-scale computational design and selection of polymers for solar cellsLarge-scale computational design and selection of polymers for solar cells
Large-scale computational design and selection of polymers for solar cells
baoilleach
 
Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...
baoilleach
 
Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...
baoilleach
 
Improving enrichment rates
Improving enrichment ratesImproving enrichment rates
Improving enrichment rates
baoilleach
 

Más de baoilleach (20)

We need to talk about Kekulization, Aromaticity and SMILES
We need to talk about Kekulization, Aromaticity and SMILESWe need to talk about Kekulization, Aromaticity and SMILES
We need to talk about Kekulization, Aromaticity and SMILES
 
Open Babel project overview
Open Babel project overviewOpen Babel project overview
Open Babel project overview
 
So I have an SD File... What do I do next?
So I have an SD File... What do I do next?So I have an SD File... What do I do next?
So I have an SD File... What do I do next?
 
Chemistrify the Web
Chemistrify the WebChemistrify the Web
Chemistrify the Web
 
What's New and Cooking in Open Babel 2.3.2
What's New and Cooking in Open Babel 2.3.2What's New and Cooking in Open Babel 2.3.2
What's New and Cooking in Open Babel 2.3.2
 
Intro to Open Babel
Intro to Open BabelIntro to Open Babel
Intro to Open Babel
 
Protein-ligand docking
Protein-ligand dockingProtein-ligand docking
Protein-ligand docking
 
Making the most of a QM calculation
Making the most of a QM calculationMaking the most of a QM calculation
Making the most of a QM calculation
 
Data Analysis in QSAR
Data Analysis in QSARData Analysis in QSAR
Data Analysis in QSAR
 
Large-scale computational design and selection of polymers for solar cells
Large-scale computational design and selection of polymers for solar cellsLarge-scale computational design and selection of polymers for solar cells
Large-scale computational design and selection of polymers for solar cells
 
My Open Access papers
My Open Access papersMy Open Access papers
My Open Access papers
 
Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...
 
De novo design of molecular wires with optimal properties for solar energy co...
De novo design of molecular wires with optimal properties for solar energy co...De novo design of molecular wires with optimal properties for solar energy co...
De novo design of molecular wires with optimal properties for solar energy co...
 
Cinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tuneCinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tune
 
Density functional theory calculations on Ruthenium polypyridyl complexes inc...
Density functional theory calculations on Ruthenium polypyridyl complexes inc...Density functional theory calculations on Ruthenium polypyridyl complexes inc...
Density functional theory calculations on Ruthenium polypyridyl complexes inc...
 
Application of Density Functional Theory to Scanning Tunneling Microscopy
Application of Density Functional Theory to Scanning Tunneling MicroscopyApplication of Density Functional Theory to Scanning Tunneling Microscopy
Application of Density Functional Theory to Scanning Tunneling Microscopy
 
Towards Practical Molecular Devices
Towards Practical Molecular DevicesTowards Practical Molecular Devices
Towards Practical Molecular Devices
 
Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...
 
Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...
 
Improving enrichment rates
Improving enrichment ratesImproving enrichment rates
Improving enrichment rates
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Cheminformatics

  • 1. Cheminformatics Noel M. O‟Boyle July 2012 EMBL-EBI/Wellcome Trust Course: Resources for Computational Drug Discovery
  • 2. Cheminformatics • Hard to define in words: – David Wild: “The field that studies all aspects of the representation and use of chemical and related biological information on computers” – Design, creation, organization, management, retrieval, analysis, dissemination, visualization and use of chemical information • Hard to agree on spelling: – Sometimes chemoinformatics • More easily thought of as encompassing a range of concepts and techniques – Molecular similarity – Quantitative-structure activity relationships (QSAR) – Substructure search – (Automated) Molecular depiction – Encoding/decoding of molecular structures – 3D structure generation from a 2D or 0D structure – Conformer generation – Algorithms: ring perception, aromaticity, isomers
  • 3. References • An introduction to cheminformatics, A. R. Leach, V. J. Gillet • Cheminformatics, Johann Gasteiger and Thomas Engel (Eds) • Molecular modelling – Principles and Applications, A. R. Leach • I571 Chemical Information Technology, David Wild, University of Indiana http://i571.wikispaces.com/
  • 4. Molecular representation Mike Hann (GSK): “Ceci n'est pas une molecule serves to remind us that all of the graphics images presented here are not molecules, not even pictures of molecules, but pictures of icons which we believe represent some aspects of the molecule's properties.” http://mgl.scripps.edu/people/goodsell/mgs_art/hann.html
  • 5. Computer representations of molecules • How can a molecular structure be stored on a computer? – Common names: aspirin – IUPAC name: 2-acetoxybenzoic acid – Formula: C9H8O4 – As an image (PNG, GIF, etc.) – CAS number: 50-78-2 – File format: ChemDraw file, MOL file, etc. – SMILES string: O=C(Oc1ccccc1C(=O)O)C – Binary Fingerprint: 10000100000001100000100100000001 http://en.wikipedia.org/wiki/Aspirin • How should it be stored? – …if I want to use it for computation – ...if I want a unique identifier – …if I want to retain stereochemical information
  • 6. Computer representations of molecules • The structure of a molecule can be represented by a graph – Graph = collection of nodes and edges, nodes and edges have properties (atomic number, bond order) • Represent the molecular graph somehow – Connection table (which nodes are connected to which other nodes) – Line notation (e.g. SMILES) Fig 12.2: Molecular modelling – principles and applications, Andrew R Leach, Pearson, 2nd edn.
  • 7. Chemical file formats • A large number of file formats have been developed, but there are certain de-facto standards • 2D/3D structures: – MOL file for small-molecule structures – PDB files for protein structures from crystallography – MOL2 files for protein structures from modelling software (e.g. after manipulation of the PDB file) • Line notations: – SMILES format, InChI format
  • 8. A chemical file format: MOL file Fig 12.3: Molecular modelling – principles and applications, Andrew R Leach, Pearson, 2nd edn. • This file format can represent 0D, 2D information (a depiction) as well as 3D
  • 9. SMILES format • Simplified Molecular Input Line Entry System – Weininger, J Chem Inf Comput Sci, 1988, 28, 31 – More recently, a community developed description: http://opensmiles.org – Linear format (“line notation”) that describes the connection table and stereochemistry of a molecule (i.e. 0D) – Convenient to enter as a query on-line, store in a spreadsheet, pass by email, etc. • Examples: – CC represents CH3CH3 (ethane) – CC(=O)O represents CH3COOH (acetic acid) • Basic guidelines: – Hydrogens are implicit – Parentheses indicate branches – Each atom is connected to the preceding atom to its left (excluding branches in-between) – Single bonds are implicit, = for double, # for triple • What does the SMILES string OCC represent?
  • 10. SMILES format II • To represent rings, you need to break a ring bond and replace it by a ring opening symbol and a corresponding ring closure symbol Br 1 1 C C C1CCC=CC1 Cl • To represent double bond stereochemistry you use / and • Cl/C=C/Br (trans), Cl/C=CBr (cis) • To represent tetrahedral stereochemistry you use @ or @@ • Br[C@](Cl)(I)F means that looking from the Br, the Cl, I, and F are arranged anticlockwise • To represent aromaticity, use lower case • C1CCCCC1 (cyclohexane) • c1ccccc1 (benzene)
  • 11. Canonical SMILES • In general, many different SMILES strings can be written for the same molecule – Not a unique identifier (one-to-many) – Ethanol: CCO, OCC, C(O)C • Algorithms for producing “canonical SMILES” have been developed – The same unique SMILES string is always created for a particular molecule – One-to-one relationship between structure and representation – Note however, that different software implement different canonicalisation algorithms • Uses: – Can be used to remove duplicate molecules from a database • Generate the canonical SMILES for each molecule and ensure that they are unique – Check identity (compare two molecules) • Did this software change the structure? Or get the stereochemistry confused?
  • 12. SMILES format III • There a couple of nice features of the SMILES format that can come in handy when manipulating structures • Concatentating SMILES strings creates a bond between fragments – CC and CO gives CCCO – Can be used for combinatorial chemistry, e.g. generating all possible products from a 4-component Ugi reaction – Can be used to prepare polymers by concatenating monomers – Open Babel can be used to prepare suitable SMILES strings • In file format conversion, the atom order in a SMILES string is usually preserved in the output format – Sometimes you need a particular atom to be atom#1 in the file format (e.g. for covalent docking in GOLD) • Write the corresponding SMILES and convert to a 3D format
  • 13. InChI • International Chemical Identifier – Line notation developed by NIST and IUPAC – Goal: An index for uniquely identifying a molecule Aspirin InChI=1/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)/f/h11H • Features – Derived from the structure (unlike CAS number) – One-to-one relationship between InChI and structure (“canonical”) – Layers (of specificity) • Can distinguish between stereoisomers, isotopes, or can leave out those layers – Different tautomeric forms give rise to the same InChI (unlike SMILES) • Notes – Not human readable or writeable – All implementations use the same (open source) code which is provided by the InChI Trust • “The Trust's goal is to enable the interlinking and combining of chemical, biological and related information, using unique machine-readable chemical structure representations to facilitate and expedite new scientific discoveries.” • For more info, see http://www.inchi-trust.org under Downloads
  • 14. A unique identifier makes it easy to link databases DrugBank ChEBI
  • 15. US Generic Legislation • Comprehensive Drug Abuse and Control Act, 1970 • Controlled Substances Act, 1970 • Federal Analog Act, 1986 • The term “controlled substance analog” means a substance – The chemical structure of which is substantially similar to the chemical structure of a controlled substance in schedule I or II Slide courtesy Dr. J.J. Keating, School of Pharmacy, University College Cork
  • 16. Molecular similarity • Similarity principle: – Structurally similar molecules tend to have similar properties • Properties: biological activity, solubility, color and so on • If we can measure similarity somehow… – Can construct a distance matrix • Distance = inverse of similarity • Such matrices can be used to cluster compounds, to create a 2D depiction showing the spread of molecular structures in a dataset, to select a diverse subset – Can use to find molecules in a database similar to a particular query – Can use to see whether a particular property is correlated with molecular similarity • ...But how to measure similarity? – One way is using molecular fingerprints
  • 17. Molecular fingerprints • A molecular fingerprint is an encoding of the molecular structure onto a (long) binary string – 100100010000001011000000000001... • Path-based fingerprints (e.g. Daylight fingerprint) – Break the molecule up into all possible fragments of length 1, 2, 3...7 – Create a string representing each fragment – Hash each string onto a number between 1 and 1024 (for example) • Wikipedia: “A hash function is any well-defined procedure or mathematical function that converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index to an array” – Set the corresponding bit of the fingerprint to 1 (all others will be 0) • Key-based fingerprints (e.g. MACCS keys) – A (long) list of pre-generated questions about a chemical structure • “Are there fewer than 3 oxygens?” • “Is there an S-S bond?” • “Is there a ring of size 4?” – Each answer, true or false, corresponds to a 1 or 0 in the binary fingerprint
  • 18. Similarity of molecular fingerprints • Molecules with the same bits set will be more similar than molecules with different bits set • To quantify this, we can use the Tanimoto coefficient – Tanimoto Similarity = Intersection/Union – Bounded by 0 and 1 (no similarity to perfect similarity) – A value of greater than 0.7 or 0.8 indicates structural similarity • How similar are aspirin (A) and salicylic acid (B)? • Using a path-based fingerprint, 64 bits are set for A, 38 for B • Intersection is 38 (Note: B is a substructure of A) • Union is 64 • Similarity = 0.59
  • 19. Similarity of atom environments • Fingerprints can also be used to measure similarity of atom environments • Circular fingerprints (HOSE codes) – Bremser, W., HOSE – a novel substructure code. Anal. Chim. Acta 1978, 103, 355. – Describe atom environment in terms of atom types at various bond distances from a particular atom • Can be used for proton NMR prediction – Hydrogens attached to similar atoms tend to have similar NMR shifts – Given a database of molecules with assigned NMR Image: T. Davies, W. Robien, J. Seymour. spectra, try to find Hs in the same environment up to as Spectroscopy Europe, 2006, 18, 22 many levels as possible and use their NMR shifts to (http://www.modgraph.co.uk/Downloads/T predict the shift for your proton D_18_1.pdf) • The same database can be used for structure identification – Given a proton NMR spectrum, what chemical structures are consistent with the NMR • NMRShiftDB (http://nmrshiftdb.org) – Freely available Open database of NMR spectra – add your own spectra (with assigned peaks) – predict assignments – Tutorial: http://nmrshiftdb.sourceforge.net/nmrshiftdbebitraining.pdf
  • 20. Substructure search using SMARTS • SMARTS – an extension of SMILES for substructure searching – Can be used to find molecules with a particular substructure – Can be used to filter out molecules with a particular substructure • Simple example – Ether: [OD2]([#6])[#6] • Any oxygen with exactly two bonds each to a carbon • Can get (a lot) more complicated – Carbonic Acid or Carbonic Acid-Ester: [CX3](=[OX1])([OX2])[OX2H,OX1H0-1] • Hits acid and conjugate base. Won't hit carbonic acid diester
  • 21. SMARTSviewer http://smartsview.zbh.uni-hamburg.de/ K. Schomburg, H.-C. Ehrlich, K. Stierand, M.Rarey. “From Structure Diagrams to Visual Chemical Patterns” J. Chem. Inf. Model., 2010, 50, 1529. [CX3](=[OX1])([OX2])[OX2H,OX1H0-1]
  • 22. Substructure search using SMARTS • SMARTS – an extension of SMILES for substructure searching – Can be used to find molecules with a particular substructure – Can be used to filter out molecules with a particular substructure • Simple example – Ether: [OD2]([#6])[#6] • Any oxygen with exactly two bonds each to a carbon • Can get (a lot) more complicated – Carbonic Acid or Carbonic Acid-Ester: [CX3](=[OX1])([OX2])[OX2H,OX1H0-1] • Hits acid and conjugate base. Won't hit carbonic acid diester • Examples of use – Filtering structures – Identify substructures that are associated with toxicological problems – Develop or use a group contribution descriptor such as TPSA
  • 23. FAF-Drugs2: Free ADME/tox filtering tool to assist drug discovery and chemical biology projects, Lagorce et al, BMC Bioinf, 2008, 9, 396.
  • 24. Calculation of Topological Polar Surface Area • TPSA • Ertl, Rohde, Selzer, J. Med. Chem., 2000, 43, 3714. • A fragment-based method for calculating the polar surface area
  • 25. Quantitative Structure-Activity Relationships (QSAR) • Also QSPR (Structure-Property) – Exactly the same idea but with some physical property • Create a mathematical model that links a molecule‟s structure to a particular property or biological activity – Could be used to perceive the link between structure and function/property – Could be used to propose changes to a structure to increase activity – Could be used to predict the activity/property for an unknown molecule • Problem: Activity = 2.4 * Does not compute! • Need to replace the actual structure by some values that are a proxy for the structure - “Molecular descriptors” • Numerical values that represent in some way some physico-chemical properties of the molecule • We saw one already, the Total Polar Surface Area • Others: molecular weight, number of hydrogen bond donors, LogP (octanol/water partition coefficient) • It is usual to calculate 100 or more of these
  • 26. Building and testing a predictive QSAR model • Need dataset with known values for the property of interest – Divide into 2/3 training set and 1/3 test set • Choose a regression model – Linear regression, artificial neural network, support vector machine, random forest, etc. • Train the model to predict the property values for the training set based on their descriptors • Apply the model to the test set – Find the RMSEP and R2 • Root-mean squared error of prediction and correlation coefficient • Practical Notes: – Descriptors can be calculated with the CDK or RDKit – Models can be built using R (r-project.org) – For a combination of the two, see rcdk
  • 27. Lipinski‟s Rule of Fives Chris Lipinski Note: Rule of thumb Rule of Fives Oral bioavailability • Lipinski took a dataset of drug candidates that made it to Phase II • He examined the distribution of particular descriptor values related to ADME • An orally active drug should not fail more than one of the following „rules‟: – Molecular weight <= 500 – Number of H-bond donors <= 5 – Number of H-bond acceptors <= 10 – LogP <= 5 • These rules are often applied as an pre-screening filter Image: http://collaborativedrug.com/blog/blog/2009/10/07/cdd-community-meeting/
  • 28. Open Source cheminformatics software resources • GUI: – Open Babel – LICSS – Excel-CDK interface • Command-line interface: – Open Babel (“babel”) – MayaChemTools • Programming toolkits: – Open Babel (C++, Perl, Python, .NET, Java), RDKit (C++, Python), Chemistry Development Kit [CDK] (Java, Jython, ...), PerlMol (Perl), MayaChemTools (Perl) – Cinfony (by me!) presents a simplified interface to some of these • Specialized toolkits: – OSRA: image to structure – OPSIN: name to structure – OSCAR: Identify chemical terms in text

Notas del editor

  1. Next time: More on Magritte
  2. Acetic acid
  3. Acetic acid
  4. Add year
  5. Next time: Add some pictures
  6. Next time: Add example of what intersection and union mean graphically