SlideShare una empresa de Scribd logo
1 de 76
NCI/CADD Chemical Structure Web Services
Markus Sitzmann
Computer-Aided Drug Design Group, Chemical Biology Laboratory,
Frederick National Laboratory for Cancer Research, NIH, DHHS
http://cactus.nci.nih.gov
Chemical Structure Web API

                                                                      external
      Chemical
                            NCI/CADD               NCI/CADD          web services
      Identifier
                            web service            web service
       Resolver


                                                                                    http

                             Chemical Structure Web API




                                                                   other
                                          CACTVS                 software
                                                                 packages

    NCI/CADD Chemical Structure                                  OPSIN
          DataBase (CSDB)
Chemical Structures

                      SYBYL Line Notation
  SMILES                                         CAS Registry Number
                       chemical names
                                                    GIF image
  ChemNavigator SID                                                    SD File

                           chemical structure
       CML

                                                                        FDA UNII

              NCI/CADD Identifiers
                                            NSC number
                                                                          MRV

     InChI/InChIKey
                                                                 PubChem SID/CID
                             ChemSpider ID
                                                          ChEBI ID
  Chemical Formula                                                        PDB Ligand ID
Chemical Identifier Resolver (CIR)



                          CIR works as a resolver for different
                          chemical structure identifiers or
                          representations.
                          It allows one to convert a given
                          structure identifier into another
                          representation or structure
                          identifier.




   http://cactus.nci.nih.gov/chemical/structure
Chemical Identifier Resolver (CIR)

                         • officially released in June 2009
                         • since then four beta versions
                           (for testing, learning, experience things)
                         • one larger database update March 2010
                         • since early 2012: major internal rewrite
                           (which will allow us to add new services
                           and API functionality while not breaking
                           the existing API)
                         • major database update and services
                           planned for 2013


   http://cactus.nci.nih.gov/chemical/structure
CIR Usage Statistics

12,000,000
               Requests per month since June 2009
10,000,000

 8,000,000

 6,000,000

 4,000,000

 2,000,000

        0




             Typical number of unique IP addresses per month: 4,000 – 8,000
                                                                              7
Top Users (US)

Academic/Hospitals                Pharma/Chemical Industry
• St. Olaf College                • Eli Lilly
• Carnegie Mellon                 • Dow Chemical
• Drexel University               • Intermune
• Princeton                       • Procter & Gamble
• Mayo                            • Vertex


U.S. Government                   Other
• EPA                             • Google
• NIH (NIEHS, NCI, NLM...)        • Amazon
• Lawrence Livermore Natl. Lab.   • HP
• CDC                             • Agilent
• DoD                             • Symyx

                                                             8
External web services and applications
 •   CIR node for KNIME, by Talete s.r.l.
 •   Lab Helper app for Windows Phone
 •   Avogadro molecule editor
 •   Jmol/JSmol open-source viewer for chemical structures in 3D
 •   GChem for Google Spreadsheet
 •   Bioclipse (CIR plugin)
 •   Macs in Chemistry
 •   Accelrys Draw
 ...and educational
 tools/sites such as:
 • Jmol/JSmol Virtual
   Molecular Model Kit
 • ISU CheMagic
 • Caltech Library



                                                                   9
Examples using CIR
Chemical Identifier Resolver (CIR)
                                C7H6O2
                                APtclcactv03051222202D 0 0.00000   0.00000

                                15 15 0 0 0 0 0 0 0 0999 V2000
                                  2.8660 -2.0600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
                                  3.7321 -1.5600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
                                  3.7321 -0.5600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
                                  2.8660 -0.0600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
                                  2.0000 -0.5600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
                                  2.0000 -1.5600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
                                  2.8660 0.9400 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
                                  3.7321 1.4400 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
                                  2.0000 1.4400 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
                                  2.8660 -2.6800 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
                                  4.2690 -1.8700 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
                                  4.2690 -0.2500 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
                                  1.4631 -0.2500 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
                                  1.4631 -1.8700 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
                                  3.7321 2.0600 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
                                 1 2 2 0 0 0 0
                                 2 3 1 0 0 0 0
                                 3 4 2 0 0 0 0
                                 4 5 1 0 0 0 0
                                 5 6 2 0 0 0 0
                                 1 6 1 0 0 0 0
ChemWriter Editor                4 7 1 0 0 0 0
                                 7 8 1 0 0 0 0
                                 7 9 2 0 0 0 0
                                 1 10 1 0 0 0 0
                                 2 11 1 0 0 0 0
                                 3 12 1 0 0 0 0
                                 5 13 1 0 0 0 0
                                 6 14 1 0 0 0 0
                                 8 15 1 0 0 0 0
                                M END
                      SD file   $$$$
Chemical Identifier Resolver (CIR)
                              benzoic acid
                              65-85-0
                              WLN: QVR
                              Unisept BZA
                              AIDS018010
                              Salvo liquid
                              Benzoic acid-ring-UL-14C
                              ST5213864
                              Benzoesaeure
                              CHEBI:30746
                              NSC 149
                              benzenecarboxylic acid
                              phenylformic acid
                              Benzoic acid (JP15/USP)
                              Benzoic acid (TN)
                              18102_RIEDEL
                              Aromatic hydroxy acid
                              Benzoic acid (7CI,8CI,9CI)
                              Benzoic acid [USAN:JAN]
                              W213128_ALDRICH
                              47849_SUPELCO
                              Acide benzoique [French]
                              Acido benzoico [Italian]
                              Benzoate (VAN)
ChemWriter Editor             Benzoesaeure [German]
                              Benzoic acid (natural)
                              Acide benzoique
                              Benzeneformic acid
                              Benzenemethanoic acid
                              Benzoesaeure GK
                              Benzoesaeure GV
                              Benzoic acid, tech.
                              Carboxybenzene
                              Kyselina benzoova
                      names   Phenylcarboxylic acid
Chemical Identifier Resolver (CIR)



                               InChIKey=WPYMKLBDIGXBTP-UHFFFAOYSA-N
                               InChI=1S/C7H6O2/c8-7(9)6-4-2-1-3-5-6/h1-5H,(H,8,9)
                               C1=CC=C(C=C1)C(O)=O




ChemWriter Editor



                    InChIKey
                       InChI
                      SMILES
Chemical Identifier Resolver (CIR)

programmatic URL API:




 http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation”




if a request is not successful: HTTP404 status message
Chemical Identifier Resolver (CIR)

• access by programming libraries/languages (e.g. Python):
 from urllib2 import *
 url = “http://cactus.nci.nih.gov/chemical/structure/tamiflu/cas”
 resolver = urlopen(url)
 try:
      response = resolver.read()
 except HTTPError:
      raise “your own error handling”
 print response
 204255-11-8



• access from Unix shell level (e.g., via wget):
 shell > wget -qO - 
 http://cactus.nci.nih.gov/chemical/structure/tamiflu/cas
 204255-11-8
Chemical Identifier Resolver (CIR)

examples:

http://cactus.nci.nih.gov/chemical/structure/PGZUMBJQJWIWGJ-ONAKXNSWSA-N/cas


 204255-11-8           MIME type: text/plain




http://cactus.nci.nih.gov/chemical/structure/tamiflu/image




                       MIME type: image/gif
Chemical Identifier Resolver (CIR)

                                                                         /smiles
        chemical names                                                   /names, /iupac_name
  IUPAC names (OPSIN)                                                    /cas
          CAS numbers                                                    /inchi, /stdinchi
         SMILES strings                                                  /inchikey, /stdinchikey
 IUPAC InChI/InChIKeys                                                   /ficts, /ficus, /uuuuu
   NCI/CADD Identifiers                                                  /image
       CACTVS HASHISY                      CIR                           /file, /sdf
           NSC number     http://cactus.nci.nih.gov/chemcial/structure   /mw, /monoisotopic_mass
          PubChem SID                                                    /formula
             ZINC Code                                                   /twirl
         ChemSpider ID                                                   /urls
    ChemNavigator SID                                                    /chemspider_id
         eMolecule VID                                                   /pubchem_sid
                   UNII                                                  /chemnavigator_sid

        “identifier”                                                      “representation”
(Partial) InChIKey Lookup

• resolve Standard InChIKey into full structure representation:
                                                                                  Ethanol
http://cactus.nci.nih.gov/chemical/structure/LFQSCWFLJHTTHZ-UHFFFAOYSA-N/smiles

 CCO


http://cactus.nci.nih.gov/chemical/structure/LFQSCWFLJHTTHZ-UHFFFAOYSA/smiles`

 CCO
 CC[OH2+]


http://cactus.nci.nih.gov/chemical/structure/LFQSCWFLJHTTHZ/smiles

 C(C(O)([2H])[2H])[2H]
 CC(O)([2H])[2H]
 C(CO)([2H])([2H])[2H]
 CC[17OH]
 C(CO)[2H]
 [14CH3]CO
 CCO
Chemical File Representation

• available file format representations:
http://cactus.nci.nih.gov/chemical/structure/Aspirin/file?format=sdf

  alc Alchemy format                                maestro Schroedinger MacroModel
  cdxml CambridgeSoft ChemDraw XML format           structure file format
  cerius MSI Cerius II format                       mol Symyx molecule file
  charmm Chemistry at HARvard                       sybyl2/mol2 Tripos Sybyl MOL2 format
  Macromolecular Mechanics file format              mrv ChemAxon MRV format
  cif Crystallographic Information File             pdb Protein Data Bank
  cml Chemical Markup Language                      sdf Symyx Structure Data Format
  gjf Gaussian input data file                      sdf3000 Symyx Structure Data Format 3000
  gromacs GROMACS file format                       sln SYBYL Line Notation
  hyperchem HyperChem file format                   smiles SMILES
  jme Java Molecule Editor format                   xyz xyz file format
Chemical Structure Images (GIF, PNG)

                                                                 Buckyball

              http://cactus.nci.nih.gov/chemical/structure/
              XMWRBQBLMFGWIX-UHFFFAOYSA-N/image
              ?height=300&width=300&bgcolor=black&bondcolor=white




              http://cactus.nci.nih.gov/chemical/structure/Aspirin/image
              ?height=200&width=200&symbolfontsize=7&footer="Aspirin"
Chemical Properties

• request molecular weight:
                                                          Aspirin
http://cactus.nci.nih.gov/chemical/structure/BSYNRYMUTXBXSQ-UHFFFAOYSA-N/weight

 180.1598                                                               MIME type: text/plain

  /mw molecular weight                             /aromatic compound is aromatic
  /formula formula                                 /macrocyclic compound is macrocyclic
  /monoisotopic_mass monoisotopic mass             /heteroatom_count heteroatom count
  /h_bond_donor_count H bond donor count           /hydrogen_atom_count H atom count
  /h_bond_acceptor_count H bond acceptor count     /heavy_atom_count heavy atom count
  /h_bond_center_count H bond center count         /deprotonable_group_count number of
  /rotor_count number of rotatable bonds           deprotonable groups
  /effective_rotor_count number of effectively     /protonable_group_count number of
  rotatable bonds                                  protonable groups
  /rule_of_5_violation_count number of Rule-of-5   /ring_count number of rings
  violations                                       /ringsys_count number of ringsystems
  /xlogp2 octanol−water partition
  coefficient XLOGP2
Chemical Name Lookup

• request (alternative) names:

http://cactus.nci.nih.gov/chemical/structure/Aspirin/names/xml

 <?xml version="1.0" encoding="UTF-8" ?>
 <request string=“Aspirin" representation="names">
    <data id="1" resolver=“name" string_class=“Name">
       <item id="1" classification=“pubchem_iupac_name">2-acetyloxybenzoic acid</item>
       <item id="2" classification="pubchem_iupac_openeye_name">2-Acetoxybenzoic acid</item>
       <item id="3" classification="pubchem_generic_registry_name">50-78-2</item>
       <item id="4" classification="pubchem_generic_registry_name">11126-35-5</item>
       <item id="5" classification="pubchem_generic_registry_name">11126-37-7</item>
       <item id="6" classification="pubchem_generic_registry_name">2349-94-2</item>
       <item id="7" classification="pubchem_generic_registry_name">26914-13-6</item>
       <item id="8" classification="pubchem_substance_synonym">NCGC00090977-04</item>
       <item id="9" classification="pubchem_substance_synonym">KBioSS_002272</item>
       <item id="10" classification="pubchem_substance_synonym">SBB015069</item>
       <item id="11" classification="pubchem_substance_synonym">Aspirin</item>
       <item id="12" classification="pubchem_substance_synonym">D00109</item>
 […]
Chemical Name Pattern Search

• Google-like searches on CIR’s name index (approx. 70 million names)


 example: all chemical names that contain the words “morphine” and “methyl”
 (name pattern: ‘+morphine +methyl‘):

 http://cactus.nci.nih.gov/chemical/structure/+morphine +methyl/stdinchikey/xml?resolver=name_pattern




                                                  based on the open source
                                                  full text search server Sphinx
                                                  (http://sphinxsearch.com)
Search name pattern ‘+morphine +methyl’: 7 matching names
<request string="+morphine +methyl" representation="stdinchikey">
   <data id="1" resolver="name_pattern" notation="Morphine 3-methyl ether">
      <item id="1">InChIKey=OROGSEYTTFOCAN-DNJOTXNNSA-N</item>
   </data>
   <data id="2" resolver="name_pattern" notation="6-Methyl-delta(sup 6)-deoxy-morphine">
      <item id="1">InChIKey=CUFWYVOFDYVCPM-GGNLRSJOSA-N</item>
   </data>
   <data id="3" resolver="name_pattern" notation="Morphine, dihydro-6-methyl-">
      <item id="1">InChIKey=NBKVWIJQJMEQLE-NGTWOADLSA-N</item>
   </data>
   <data id="4" resolver="name_pattern“ notation="6-METHYL-MORPHINE ETHER">
      <item id="1">InChIKey=FNAHUZTWOVOCTL-UHFFFAOYSA-N</item>
   </data>
   <data id="5" resolver="name_pattern" notation="Morphine alcoholic methyl ether">
      <item id="1">InChIKey=FNAHUZTWOVOCTL-XSSYPUMDSA-N</item>
   </data>
   <data id="6" resolver="name_pattern" notation="N-Methyl morphine chloride">
      <item id="1">InChIKey=MJNCZWBHCFTYFU-SCLAZZCHSA-N</item>
   </data>
   <data id="7" resolver="name_pattern" notation="Morphine, 7-hydroxy-6,6-dimethoxy-3-O-methyl-">
      <item id="1">InChIKey=URFKRBIESURBKC-UHFFFAOYSA-N</item>
   </data>
</request>
Chemical Name Pattern Search

example: chemical names that contain the words “morphine” and “methyl”
but not “hydroxyl” (name pattern: ‘+morphine +methyl -hydroxyl‘):
http://cactus.nci.nih.gov/chemical/structure/+morphine +methyl -hydroxyl/stdinchikey/xml?resolver=name_pattern

                                                                                    6 matching names

example: chemical names that contain the substring “morphine”
somewhere in the name (name pattern: ‘*morphine*‘)
http://cactus.nci.nih.gov/chemical/structure/*morphine*/stdinchikey/xml?resolver=name_pattern


                                                                                  45 matching names

example: chemical names that contain a single character “m” and the word
“benzene” in a maximum distance of 3 words (finds meta-substituted aromatic
compounds, name pattern: ‘“m benzene”~3‘):
http://cactus.nci.nih.gov/chemical/structure/(m benzene)~3/stdinchikey/xml?resolver=name_pattern

                                                                                  22 matching names
NCI/CADD Chemical Structure DataBase
            CSDB 2010
Chemical Structure Normalization/Identifier

• stepwise process:


                      structure                       hashcode
     original       normalization                    calculation
                                       parent                         NCI/CADD
    structure
                                      structure                       Identifier
     record                                          E_HASHISY

     Molfile                            SDF
     SDF                                SMILES
     SMILES                             database
     ChemDraw cdx
     PDB




            original structure records, parent structures and identifiers
                             are stored in the database
Chemical Structure Normalization/Identifier

• calculation of a set of parent structures with different
  sensitivity to chemical features:


                   structure                   hashcode
     original    normalization                calculation
                                  parent                     NCI/CADD
    structure
                                 structure                   Identifier
     record                                   E_HASHISY

                                   FICTS                       FICTS

                                   FICuS                       FICuS

                                  uuuuu                        uuuuu



                all steps are performed using CACTVS
NCI/CADD Identifiers (FICTS, FICuS, uuuuu)
                                                                                                                               O
based on CACTVS hashcodes (HASHISY)                                                                                                 OH
                                                                                                           HN
16-digit hexadecimal number (64-bit unsigned)                                                                    N           NH 2
                                                                                                           9850FD9F9E2B4E25


structure normalization - histidine:

               O                        O                           O                            O                             O
                                                                          Na+
  HN               OH        N              OH      HN                  O-        HN                  OH        HN                 OH
        N   NH                   NH NH2                  N     NH2                      N       NH2                  N       NH2


    tautomer 1                tautomer 2                     salt                           R                            S

 9850FD9F9E2B4E25-FICTS   6C16DE2351F9FF50-FICTS   E5F83F10C5DB080A-FICTS       E92E4BA2869F3611-FICTS      8A7AD1EB498CC76A-FICTS


9850FD9F9E2B4E25-FICuS    9850FD9F9E2B4E25-FICuS   E5F83F10C5DB080A-FICuS       E92E4BA2869F3611-FICuS      8A7AD1EB498CC76A-FICuS


 9850FD9F9E2B4E25-uuuuu 9850FD9F9E2B4E25-uuuuu     9850FD9F9E2B4E25-uuuuu       9850FD9F9E2B4E25-uuuuu      9850FD9F9E2B4E25-uuuuu
Chemical Structure Normalization/Identifier

• calculation of Standard InChIKey from the union set of
  parent structures


                  structure                       hashcode
     original   normalization                    calculation
                                 parent                               NCI/CADD
    structure
                                structure                             Identifier
     record                                     E_HASHISY

                                  FICTS

                                  FICuS     union set:   Standard InChIKey

                                 uuuuu                         1.03
Chemical Structure Database (CSDB)

• ChemNavigator iResearch Library
  compilation of commercially available screening
  compounds from ~300 international chemistry
  suppliers                                                          PubChem
                                                    ChemNav.         ~38%
• PubChem Substance Database                        iResearch Lib.
  including Open NCI database, EPA DSSTox           ~56%
  databases, NIAID HIV database, NIST Webbook,
  NLM ChemIDplus, ChemSpider, …
                                                                      ~6%
• Commercial Sources / others                                               others
  Asinex, Comgenex, eMolecules, …

  current status:                                140 chemical structure databases
  (released March 2010)                              120 million structure records
                                           84.6 million unique structures by FICuS
                                        110 million Standard InChIKeys for lookup
NCI/CADD Chemical Structure DataBase
            CSDB 2013
Chemical Structure Database 2013

• >270 small-molecule database
• >600 database releases (full, incremental, “historic versions”)
• 385 million original database records




unique structure count:

 FICTS   ~125.0 million   FICuS   ~121.4 million   uuuuu   ~109.0 million


              union set: 141.7 million unique structures
Chemical Structure Database 2013

InChI/InChIKey (Version 1.04) calculated with four InChI flag sets:

Standard Set, Set 1 & Set 2: addition of hydrogen atoms by CACTVS
Set 3: addition of hydrogen atoms by the InChI library



                 CACTVS
  Standard   :    Add H   Standard InChIKey

   Set 1     :    Add H   DONOTADDH W0 FIXEDH RECMET NEWPS SPXYZ SAsXYZ Fb Fnud KET 15T

   Set 2     :    Add H   DONOTADDH W0 FIXEDH RECMET NEWPS SPXYZ SAsXYZ Fb Fnud KET 15T

   Set 3     :    Add H   DONOTADDH W0 FIXEDH RECMET NEWPS SPXYZ SAsXYZ Fb Fnud KET 15T
Chemical Structure Database 2013

• calculation of Standard InChIKey



                  structure                                       hashcode
     original   normalization                                    calculation
                                         parent                                NCI/CADD
    structure
                                        structure                              Identifier
     record                                                     E_HASHISY

                                            FICTS

                                            FICuS           union set:

                                            uuuuu

                                  Standard InChIKey           1.04
                       Standard     Set 1           Set 2        Set 3
Chemical Structure Database 2013

• database schema is entirely implemented in python/


  • supports many different database engines: Oracle, PostreSQL, MySQL
  • SQLAlchemy provides:
     • the communication layer with the database engine
     • creates a object-oriented data model representation of the database to
       the “python”-side

  • table relationships:
     • either defined by Foreign Key relationships in the database or specified
       on python level
     • SQLAlchemy creates table joins on the SQL level
Chemical Structure Database 2013

  • SQLAlchemy table definition

structure_table = Table(‘structure’, metadata,
     Column(‘id’, Integer, primary_key=True, autoincrement=True),
     Column(‘hash’, Char(16), unique=True,
     Column(‘smiles’, Text()),
     schema=schema
)
class Structure(TableRepr, TableInit):
     __table__ = structure_table
mapper(Structure, structure_table, relationship={
   ‘name’: relationship(Name, backref=backref(‘structure’,
   primaryjoin=structure_table.c.id=name_table.c.structure_id
})
Chemical Structure Database 2013

• Query the database

 > s = db.session.query(Structure).filter(Structure.id==1234).one()
 <object “Structure”>
 > s.smiles
 CCO

• if the object-oriented data model representation creates too
  much overhead, SQLAlchemy supports writing “almost bare”
  SQL but still follows the python paradigms

 > q = select([structure_table.c.id,]).where(structure.c.id==1234)
 > s = q.execute().fetchone()
 (CCO,)
Chemical Structure Database

• Goals


  • index any chemical structures that can be referenced in some way or
    has a known source
  • may also include virtual chemistry or generic structure collections
  • collect public dataset/databases/structure collections
  • normalize them to our standards
  • make them available in our public web interfaces and APIs
    (if we are allowed to)
  • no refusal/deletion of structures – curation is performed by “keep the
    bad and tag it as bad”

                                              track chemical space
NCI/CADD Chemical Web Apps
NCI/CADD Chemical Web Apps



• implemented with jQuery Mobile (1.3.0)
  • HTML5
  • supports web browser on major mobile platforms: iOS, Android,
    BlackBerry, WindowsPhone, Windows 8, Palm, Symbian
  • supports major Desktop web browsers: Google Chrome, Firefox, IE9/10
  • WAI-ARIA compliant (W3C specification draft describing accessibility
    standards of dynamic Web content for people with disabilities)

• services will be optimized for usage on tabled-sized touch
  screens devices, however, not (yet) for smart-phone sized
  devices (current development is done on an iPad3)
• all services work on a common platform
Chemical Activity Predictor - GUSAR




            chemical structure


             prediction of
 physicochemical properties and activities
Chemical Activity Predictor - GUSAR

GUSAR Software
                      characteristics:
                      chemical structures
                      are represented by
                      QNA descriptors
                      MNA descriptors
                      mathematical algorithm
                      unique algorithm of self-
                      consistent regression allows to
                      select the best set of descriptors
  main developer
                      for a robust and reliable QSAR
  Alexey Zakharov     model.
Chemical Activity Predictor - GUSAR

            GUSAR Software

                     1.00
                     0.90
                                                                                     comparison was performed on the
                     0.80                                                            following data sets:
                     0.70
Accuracy (R2 test)




                     0.60
                                                                                     • ligand–enzyme interactions
                     0.50
                     0.40                                                            • ligand–receptor interactions
                     0.30
                                                                                     • acute toxicity
                     0.20
                     0.10                                                            • interaction with drug-metabolism
                     0.00
                            CoMFA CoMSIA HQSAR   EVA     2D      3D    GOLPE GUSAR   • enzymes
                                                       Cerius2 Cerius2
Chemical Activity Predictor - GUSAR

• QSAR-based models created by GUSAR can be used separately
  from the application
• broad spectra of chemical/biological activity and property
  prediction models for small molecules in development:
  • physicochemical properties
  • assessment of toxicity, metabolism and antineoplastic activities
  • HIV-1-related models
• will be available as Web App and programmatic URL API:


http://cactus.nci.nih.gov/chemical/activity/CCOCC/boiling_point

{in_applicability_domain: True, datatype: ‘float’, value: 42.660}
Chemical Activities

Categories              Models            Endpoints
                                          Boiling point
                                          Density
Physicochemical         Physicochemical   Flash point
Properties              Models            Melting point
                                          Surface tension
                                          Thermal conductivity
                                          Vapor pressure
                                          Viscosity
                                          Water solubility
                                          HIV-1 Integrase (Strand
                                          Transfer) Inhibitor
Biological Activities   HIV-Models
                                          HIV-1 Reverse
                                          Transcriptase Inhibitor
Activity Endpoints
Activity Endpoints
Activity Endpoints
Activity Endpoints
Prediction Results
GUSAR
• value
• unit
• in applicability domain
• quantitative and
  qualitative models
Chemical Activity Predictor – GUSAR beta

          http://cactus.nci.nih.gov/chemial/apps
Chemical Activity Predictor – GUSAR beta

          http://cactus.nci.nih.gov/chemial/apps
Chemical Structure Lookup Service (CSLS)




• first version was released in 2006, development stalled in 2008
• new version will be based on CSDB
• new release planned for 2013
• allows easy lookup of chemical structures within the constituting
  databases in CSDB
InChI/InChIKey Resolver
InChI/InChIKey Resolver


                          “loose coupling”
                          of InChI resolvers
                          provided by
                          different
                          organizations

                          central list of
                          resolvers

                          each resolver
                          must provide a
                          specific protocol.
InChI/InChIKey Resolver


• Evan Bolton (NCBI, NLM, NIH)
• Valery Tkachenko (RSC/ChemSpider)
• Marc Nicklaus (CADD Group, NCI, NIH)
• Steven Bachrach (Trinity University)
• Antony Williams (RSC/ChemSpider)
• Markus Sitzmann (CADD Group, NCI, NIH)
Chemical Structure Web API

                                                                        external
      Chemical
                            NCI/CADD               NCI/CADD            web services
      Identifier
                            web service            web service
       Resolver


                                                                                      http

                             Chemical Structure Web API




                                                                     other
                                          CACTVS                   software
                                                                   packages

    NCI/CADD Chemical Structure                                  OPSIN
          DataBase (CSDB)
Chemical Structure Web API

                                                                        external
      Chemical
                            NCI/CADD               NCI/CADD            web services
      Identifier
                            web service            web service
       Resolver


                                                                                      http

                             Chemical Structure Web API




                                                                     other
                                                                                      GUSAR
                                          CACTVS                   software
                                                                   packages

    NCI/CADD Chemical Structure                                  OPSIN
          DataBase (CSDB)
http://cactus.nci.nih.gov/blog
Acknowledgements


NCI/CADD Team              ChemNavigator
Alexey Zakharov            Scott Hutton
Laura Guasch Pàmies        Tad Hurst
Megan Peach
Marc Nicklaus
                             Pubchem

                             All other database providers
Xemistry GmbH, Germany
Wolf-Dietrich Ihlenfeldt


InChI Team
Acknowledgments - Software


             CACTVS


                                       Python Web Framework
                          ChemWriter


                                           Python SQL Library
      Peter Ertl (Novartis)




     Javascript library


                                        Fulltext Search Engine
http://cactus.nci.nih.gov

Más contenido relacionado

Similar a ACS Meeting New Orleans 2013 (CINF)

Chemoinformatic File Format.pptx
Chemoinformatic File Format.pptxChemoinformatic File Format.pptx
Chemoinformatic File Format.pptxwadhava gurumeet
 
BioIT Europe 2010 - BioCatalogue
BioIT Europe 2010 - BioCatalogueBioIT Europe 2010 - BioCatalogue
BioIT Europe 2010 - BioCatalogueBioCatalogue
 
Evolution of public chemistry databases: past and the future
Evolution of public chemistry databases: past and the futureEvolution of public chemistry databases: past and the future
Evolution of public chemistry databases: past and the futureValery Tkachenko
 
Health Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha NoyHealth Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha NoyHealth Data Consortium
 
BioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
BioThings API: Building a FAIR API Ecosystem for Biomedical KnowledgeBioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
BioThings API: Building a FAIR API Ecosystem for Biomedical KnowledgeChunlei Wu
 
Building global chemistry network at the royal society of chemistry
Building global chemistry network at the royal society of chemistryBuilding global chemistry network at the royal society of chemistry
Building global chemistry network at the royal society of chemistryValery Tkachenko
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...Dr. Haxel Consult
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseDevelopment of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseNathan Olson
 
Getting the Big Picture by Joining up the SAR dots
Getting the Big Picture by Joining up the SAR dotsGetting the Big Picture by Joining up the SAR dots
Getting the Big Picture by Joining up the SAR dotsSorel Muresan
 
CTS2 Development Framework In Action
CTS2 Development Framework In ActionCTS2 Development Framework In Action
CTS2 Development Framework In Actioncts2framework
 

Similar a ACS Meeting New Orleans 2013 (CINF) (20)

Chemoinformatic File Format.pptx
Chemoinformatic File Format.pptxChemoinformatic File Format.pptx
Chemoinformatic File Format.pptx
 
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
 
Dealing with the complex challenge of managing diverse analytical chemistry d...
Dealing with the complex challenge of managing diverse analytical chemistry d...Dealing with the complex challenge of managing diverse analytical chemistry d...
Dealing with the complex challenge of managing diverse analytical chemistry d...
 
BioIT Europe 2010 - BioCatalogue
BioIT Europe 2010 - BioCatalogueBioIT Europe 2010 - BioCatalogue
BioIT Europe 2010 - BioCatalogue
 
Evolution of public chemistry databases: past and the future
Evolution of public chemistry databases: past and the futureEvolution of public chemistry databases: past and the future
Evolution of public chemistry databases: past and the future
 
How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...
 
Data integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientistData integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientist
 
Health Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha NoyHealth Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha Noy
 
How the web has weaved a web of interlinked chemistry data final
How the web has weaved a web of interlinked chemistry data finalHow the web has weaved a web of interlinked chemistry data final
How the web has weaved a web of interlinked chemistry data final
 
BioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
BioThings API: Building a FAIR API Ecosystem for Biomedical KnowledgeBioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
BioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
 
Building global chemistry network at the royal society of chemistry
Building global chemistry network at the royal society of chemistryBuilding global chemistry network at the royal society of chemistry
Building global chemistry network at the royal society of chemistry
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseDevelopment of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
 
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
 
Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
 
Crawling Across the Web of Chemistry Using ChemSpider
Crawling Across the Web of Chemistry Using ChemSpider Crawling Across the Web of Chemistry Using ChemSpider
Crawling Across the Web of Chemistry Using ChemSpider
 
Getting the Big Picture by Joining up the SAR dots
Getting the Big Picture by Joining up the SAR dotsGetting the Big Picture by Joining up the SAR dots
Getting the Big Picture by Joining up the SAR dots
 
CTS2 Development Framework In Action
CTS2 Development Framework In ActionCTS2 Development Framework In Action
CTS2 Development Framework In Action
 
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
 
ChemValidator – an online service for validating and standardizing chemical s...
ChemValidator – an online service for validating and standardizing chemical s...ChemValidator – an online service for validating and standardizing chemical s...
ChemValidator – an online service for validating and standardizing chemical s...
 

ACS Meeting New Orleans 2013 (CINF)

  • 1. NCI/CADD Chemical Structure Web Services Markus Sitzmann Computer-Aided Drug Design Group, Chemical Biology Laboratory, Frederick National Laboratory for Cancer Research, NIH, DHHS
  • 3. Chemical Structure Web API external Chemical NCI/CADD NCI/CADD web services Identifier web service web service Resolver http Chemical Structure Web API other CACTVS software packages NCI/CADD Chemical Structure OPSIN DataBase (CSDB)
  • 4. Chemical Structures SYBYL Line Notation SMILES CAS Registry Number chemical names GIF image ChemNavigator SID SD File chemical structure CML FDA UNII NCI/CADD Identifiers NSC number MRV InChI/InChIKey PubChem SID/CID ChemSpider ID ChEBI ID Chemical Formula PDB Ligand ID
  • 5. Chemical Identifier Resolver (CIR) CIR works as a resolver for different chemical structure identifiers or representations. It allows one to convert a given structure identifier into another representation or structure identifier. http://cactus.nci.nih.gov/chemical/structure
  • 6. Chemical Identifier Resolver (CIR) • officially released in June 2009 • since then four beta versions (for testing, learning, experience things) • one larger database update March 2010 • since early 2012: major internal rewrite (which will allow us to add new services and API functionality while not breaking the existing API) • major database update and services planned for 2013 http://cactus.nci.nih.gov/chemical/structure
  • 7. CIR Usage Statistics 12,000,000 Requests per month since June 2009 10,000,000 8,000,000 6,000,000 4,000,000 2,000,000 0 Typical number of unique IP addresses per month: 4,000 – 8,000 7
  • 8. Top Users (US) Academic/Hospitals Pharma/Chemical Industry • St. Olaf College • Eli Lilly • Carnegie Mellon • Dow Chemical • Drexel University • Intermune • Princeton • Procter & Gamble • Mayo • Vertex U.S. Government Other • EPA • Google • NIH (NIEHS, NCI, NLM...) • Amazon • Lawrence Livermore Natl. Lab. • HP • CDC • Agilent • DoD • Symyx 8
  • 9. External web services and applications • CIR node for KNIME, by Talete s.r.l. • Lab Helper app for Windows Phone • Avogadro molecule editor • Jmol/JSmol open-source viewer for chemical structures in 3D • GChem for Google Spreadsheet • Bioclipse (CIR plugin) • Macs in Chemistry • Accelrys Draw ...and educational tools/sites such as: • Jmol/JSmol Virtual Molecular Model Kit • ISU CheMagic • Caltech Library 9
  • 11. Chemical Identifier Resolver (CIR) C7H6O2 APtclcactv03051222202D 0 0.00000 0.00000 15 15 0 0 0 0 0 0 0 0999 V2000 2.8660 -2.0600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.7321 -1.5600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.7321 -0.5600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2.8660 -0.0600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2.0000 -0.5600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2.0000 -1.5600 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2.8660 0.9400 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.7321 1.4400 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 2.0000 1.4400 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 2.8660 -2.6800 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 4.2690 -1.8700 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 4.2690 -0.2500 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 1.4631 -0.2500 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 1.4631 -1.8700 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 3.7321 2.0600 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 0 0 0 0 2 3 1 0 0 0 0 3 4 2 0 0 0 0 4 5 1 0 0 0 0 5 6 2 0 0 0 0 1 6 1 0 0 0 0 ChemWriter Editor 4 7 1 0 0 0 0 7 8 1 0 0 0 0 7 9 2 0 0 0 0 1 10 1 0 0 0 0 2 11 1 0 0 0 0 3 12 1 0 0 0 0 5 13 1 0 0 0 0 6 14 1 0 0 0 0 8 15 1 0 0 0 0 M END SD file $$$$
  • 12. Chemical Identifier Resolver (CIR) benzoic acid 65-85-0 WLN: QVR Unisept BZA AIDS018010 Salvo liquid Benzoic acid-ring-UL-14C ST5213864 Benzoesaeure CHEBI:30746 NSC 149 benzenecarboxylic acid phenylformic acid Benzoic acid (JP15/USP) Benzoic acid (TN) 18102_RIEDEL Aromatic hydroxy acid Benzoic acid (7CI,8CI,9CI) Benzoic acid [USAN:JAN] W213128_ALDRICH 47849_SUPELCO Acide benzoique [French] Acido benzoico [Italian] Benzoate (VAN) ChemWriter Editor Benzoesaeure [German] Benzoic acid (natural) Acide benzoique Benzeneformic acid Benzenemethanoic acid Benzoesaeure GK Benzoesaeure GV Benzoic acid, tech. Carboxybenzene Kyselina benzoova names Phenylcarboxylic acid
  • 13. Chemical Identifier Resolver (CIR) InChIKey=WPYMKLBDIGXBTP-UHFFFAOYSA-N InChI=1S/C7H6O2/c8-7(9)6-4-2-1-3-5-6/h1-5H,(H,8,9) C1=CC=C(C=C1)C(O)=O ChemWriter Editor InChIKey InChI SMILES
  • 14. Chemical Identifier Resolver (CIR) programmatic URL API: http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation” if a request is not successful: HTTP404 status message
  • 15. Chemical Identifier Resolver (CIR) • access by programming libraries/languages (e.g. Python): from urllib2 import * url = “http://cactus.nci.nih.gov/chemical/structure/tamiflu/cas” resolver = urlopen(url) try: response = resolver.read() except HTTPError: raise “your own error handling” print response 204255-11-8 • access from Unix shell level (e.g., via wget): shell > wget -qO - http://cactus.nci.nih.gov/chemical/structure/tamiflu/cas 204255-11-8
  • 16. Chemical Identifier Resolver (CIR) examples: http://cactus.nci.nih.gov/chemical/structure/PGZUMBJQJWIWGJ-ONAKXNSWSA-N/cas 204255-11-8 MIME type: text/plain http://cactus.nci.nih.gov/chemical/structure/tamiflu/image MIME type: image/gif
  • 17. Chemical Identifier Resolver (CIR) /smiles chemical names /names, /iupac_name IUPAC names (OPSIN) /cas CAS numbers /inchi, /stdinchi SMILES strings /inchikey, /stdinchikey IUPAC InChI/InChIKeys /ficts, /ficus, /uuuuu NCI/CADD Identifiers /image CACTVS HASHISY CIR /file, /sdf NSC number http://cactus.nci.nih.gov/chemcial/structure /mw, /monoisotopic_mass PubChem SID /formula ZINC Code /twirl ChemSpider ID /urls ChemNavigator SID /chemspider_id eMolecule VID /pubchem_sid UNII /chemnavigator_sid “identifier” “representation”
  • 18. (Partial) InChIKey Lookup • resolve Standard InChIKey into full structure representation: Ethanol http://cactus.nci.nih.gov/chemical/structure/LFQSCWFLJHTTHZ-UHFFFAOYSA-N/smiles CCO http://cactus.nci.nih.gov/chemical/structure/LFQSCWFLJHTTHZ-UHFFFAOYSA/smiles` CCO CC[OH2+] http://cactus.nci.nih.gov/chemical/structure/LFQSCWFLJHTTHZ/smiles C(C(O)([2H])[2H])[2H] CC(O)([2H])[2H] C(CO)([2H])([2H])[2H] CC[17OH] C(CO)[2H] [14CH3]CO CCO
  • 19. Chemical File Representation • available file format representations: http://cactus.nci.nih.gov/chemical/structure/Aspirin/file?format=sdf alc Alchemy format maestro Schroedinger MacroModel cdxml CambridgeSoft ChemDraw XML format structure file format cerius MSI Cerius II format mol Symyx molecule file charmm Chemistry at HARvard sybyl2/mol2 Tripos Sybyl MOL2 format Macromolecular Mechanics file format mrv ChemAxon MRV format cif Crystallographic Information File pdb Protein Data Bank cml Chemical Markup Language sdf Symyx Structure Data Format gjf Gaussian input data file sdf3000 Symyx Structure Data Format 3000 gromacs GROMACS file format sln SYBYL Line Notation hyperchem HyperChem file format smiles SMILES jme Java Molecule Editor format xyz xyz file format
  • 20. Chemical Structure Images (GIF, PNG) Buckyball http://cactus.nci.nih.gov/chemical/structure/ XMWRBQBLMFGWIX-UHFFFAOYSA-N/image ?height=300&width=300&bgcolor=black&bondcolor=white http://cactus.nci.nih.gov/chemical/structure/Aspirin/image ?height=200&width=200&symbolfontsize=7&footer="Aspirin"
  • 21. Chemical Properties • request molecular weight: Aspirin http://cactus.nci.nih.gov/chemical/structure/BSYNRYMUTXBXSQ-UHFFFAOYSA-N/weight 180.1598 MIME type: text/plain /mw molecular weight /aromatic compound is aromatic /formula formula /macrocyclic compound is macrocyclic /monoisotopic_mass monoisotopic mass /heteroatom_count heteroatom count /h_bond_donor_count H bond donor count /hydrogen_atom_count H atom count /h_bond_acceptor_count H bond acceptor count /heavy_atom_count heavy atom count /h_bond_center_count H bond center count /deprotonable_group_count number of /rotor_count number of rotatable bonds deprotonable groups /effective_rotor_count number of effectively /protonable_group_count number of rotatable bonds protonable groups /rule_of_5_violation_count number of Rule-of-5 /ring_count number of rings violations /ringsys_count number of ringsystems /xlogp2 octanol−water partition coefficient XLOGP2
  • 22. Chemical Name Lookup • request (alternative) names: http://cactus.nci.nih.gov/chemical/structure/Aspirin/names/xml <?xml version="1.0" encoding="UTF-8" ?> <request string=“Aspirin" representation="names"> <data id="1" resolver=“name" string_class=“Name"> <item id="1" classification=“pubchem_iupac_name">2-acetyloxybenzoic acid</item> <item id="2" classification="pubchem_iupac_openeye_name">2-Acetoxybenzoic acid</item> <item id="3" classification="pubchem_generic_registry_name">50-78-2</item> <item id="4" classification="pubchem_generic_registry_name">11126-35-5</item> <item id="5" classification="pubchem_generic_registry_name">11126-37-7</item> <item id="6" classification="pubchem_generic_registry_name">2349-94-2</item> <item id="7" classification="pubchem_generic_registry_name">26914-13-6</item> <item id="8" classification="pubchem_substance_synonym">NCGC00090977-04</item> <item id="9" classification="pubchem_substance_synonym">KBioSS_002272</item> <item id="10" classification="pubchem_substance_synonym">SBB015069</item> <item id="11" classification="pubchem_substance_synonym">Aspirin</item> <item id="12" classification="pubchem_substance_synonym">D00109</item> […]
  • 23. Chemical Name Pattern Search • Google-like searches on CIR’s name index (approx. 70 million names) example: all chemical names that contain the words “morphine” and “methyl” (name pattern: ‘+morphine +methyl‘): http://cactus.nci.nih.gov/chemical/structure/+morphine +methyl/stdinchikey/xml?resolver=name_pattern based on the open source full text search server Sphinx (http://sphinxsearch.com)
  • 24. Search name pattern ‘+morphine +methyl’: 7 matching names <request string="+morphine +methyl" representation="stdinchikey"> <data id="1" resolver="name_pattern" notation="Morphine 3-methyl ether"> <item id="1">InChIKey=OROGSEYTTFOCAN-DNJOTXNNSA-N</item> </data> <data id="2" resolver="name_pattern" notation="6-Methyl-delta(sup 6)-deoxy-morphine"> <item id="1">InChIKey=CUFWYVOFDYVCPM-GGNLRSJOSA-N</item> </data> <data id="3" resolver="name_pattern" notation="Morphine, dihydro-6-methyl-"> <item id="1">InChIKey=NBKVWIJQJMEQLE-NGTWOADLSA-N</item> </data> <data id="4" resolver="name_pattern“ notation="6-METHYL-MORPHINE ETHER"> <item id="1">InChIKey=FNAHUZTWOVOCTL-UHFFFAOYSA-N</item> </data> <data id="5" resolver="name_pattern" notation="Morphine alcoholic methyl ether"> <item id="1">InChIKey=FNAHUZTWOVOCTL-XSSYPUMDSA-N</item> </data> <data id="6" resolver="name_pattern" notation="N-Methyl morphine chloride"> <item id="1">InChIKey=MJNCZWBHCFTYFU-SCLAZZCHSA-N</item> </data> <data id="7" resolver="name_pattern" notation="Morphine, 7-hydroxy-6,6-dimethoxy-3-O-methyl-"> <item id="1">InChIKey=URFKRBIESURBKC-UHFFFAOYSA-N</item> </data> </request>
  • 25. Chemical Name Pattern Search example: chemical names that contain the words “morphine” and “methyl” but not “hydroxyl” (name pattern: ‘+morphine +methyl -hydroxyl‘): http://cactus.nci.nih.gov/chemical/structure/+morphine +methyl -hydroxyl/stdinchikey/xml?resolver=name_pattern 6 matching names example: chemical names that contain the substring “morphine” somewhere in the name (name pattern: ‘*morphine*‘) http://cactus.nci.nih.gov/chemical/structure/*morphine*/stdinchikey/xml?resolver=name_pattern 45 matching names example: chemical names that contain a single character “m” and the word “benzene” in a maximum distance of 3 words (finds meta-substituted aromatic compounds, name pattern: ‘“m benzene”~3‘): http://cactus.nci.nih.gov/chemical/structure/(m benzene)~3/stdinchikey/xml?resolver=name_pattern 22 matching names
  • 26. NCI/CADD Chemical Structure DataBase CSDB 2010
  • 27. Chemical Structure Normalization/Identifier • stepwise process: structure hashcode original normalization calculation parent NCI/CADD structure structure Identifier record E_HASHISY Molfile SDF SDF SMILES SMILES database ChemDraw cdx PDB original structure records, parent structures and identifiers are stored in the database
  • 28. Chemical Structure Normalization/Identifier • calculation of a set of parent structures with different sensitivity to chemical features: structure hashcode original normalization calculation parent NCI/CADD structure structure Identifier record E_HASHISY FICTS FICTS FICuS FICuS uuuuu uuuuu all steps are performed using CACTVS
  • 29. NCI/CADD Identifiers (FICTS, FICuS, uuuuu) O based on CACTVS hashcodes (HASHISY) OH HN 16-digit hexadecimal number (64-bit unsigned) N NH 2 9850FD9F9E2B4E25 structure normalization - histidine: O O O O O Na+ HN OH N OH HN O- HN OH HN OH N NH NH NH2 N NH2 N NH2 N NH2 tautomer 1 tautomer 2 salt R S 9850FD9F9E2B4E25-FICTS 6C16DE2351F9FF50-FICTS E5F83F10C5DB080A-FICTS E92E4BA2869F3611-FICTS 8A7AD1EB498CC76A-FICTS 9850FD9F9E2B4E25-FICuS 9850FD9F9E2B4E25-FICuS E5F83F10C5DB080A-FICuS E92E4BA2869F3611-FICuS 8A7AD1EB498CC76A-FICuS 9850FD9F9E2B4E25-uuuuu 9850FD9F9E2B4E25-uuuuu 9850FD9F9E2B4E25-uuuuu 9850FD9F9E2B4E25-uuuuu 9850FD9F9E2B4E25-uuuuu
  • 30. Chemical Structure Normalization/Identifier • calculation of Standard InChIKey from the union set of parent structures structure hashcode original normalization calculation parent NCI/CADD structure structure Identifier record E_HASHISY FICTS FICuS union set: Standard InChIKey uuuuu 1.03
  • 31. Chemical Structure Database (CSDB) • ChemNavigator iResearch Library compilation of commercially available screening compounds from ~300 international chemistry suppliers PubChem ChemNav. ~38% • PubChem Substance Database iResearch Lib. including Open NCI database, EPA DSSTox ~56% databases, NIAID HIV database, NIST Webbook, NLM ChemIDplus, ChemSpider, … ~6% • Commercial Sources / others others Asinex, Comgenex, eMolecules, … current status: 140 chemical structure databases (released March 2010) 120 million structure records 84.6 million unique structures by FICuS 110 million Standard InChIKeys for lookup
  • 32. NCI/CADD Chemical Structure DataBase CSDB 2013
  • 33. Chemical Structure Database 2013 • >270 small-molecule database • >600 database releases (full, incremental, “historic versions”) • 385 million original database records unique structure count: FICTS ~125.0 million FICuS ~121.4 million uuuuu ~109.0 million union set: 141.7 million unique structures
  • 34. Chemical Structure Database 2013 InChI/InChIKey (Version 1.04) calculated with four InChI flag sets: Standard Set, Set 1 & Set 2: addition of hydrogen atoms by CACTVS Set 3: addition of hydrogen atoms by the InChI library CACTVS Standard : Add H Standard InChIKey Set 1 : Add H DONOTADDH W0 FIXEDH RECMET NEWPS SPXYZ SAsXYZ Fb Fnud KET 15T Set 2 : Add H DONOTADDH W0 FIXEDH RECMET NEWPS SPXYZ SAsXYZ Fb Fnud KET 15T Set 3 : Add H DONOTADDH W0 FIXEDH RECMET NEWPS SPXYZ SAsXYZ Fb Fnud KET 15T
  • 35. Chemical Structure Database 2013 • calculation of Standard InChIKey structure hashcode original normalization calculation parent NCI/CADD structure structure Identifier record E_HASHISY FICTS FICuS union set: uuuuu Standard InChIKey 1.04 Standard Set 1 Set 2 Set 3
  • 36. Chemical Structure Database 2013 • database schema is entirely implemented in python/ • supports many different database engines: Oracle, PostreSQL, MySQL • SQLAlchemy provides: • the communication layer with the database engine • creates a object-oriented data model representation of the database to the “python”-side • table relationships: • either defined by Foreign Key relationships in the database or specified on python level • SQLAlchemy creates table joins on the SQL level
  • 37. Chemical Structure Database 2013 • SQLAlchemy table definition structure_table = Table(‘structure’, metadata, Column(‘id’, Integer, primary_key=True, autoincrement=True), Column(‘hash’, Char(16), unique=True, Column(‘smiles’, Text()), schema=schema ) class Structure(TableRepr, TableInit): __table__ = structure_table mapper(Structure, structure_table, relationship={ ‘name’: relationship(Name, backref=backref(‘structure’, primaryjoin=structure_table.c.id=name_table.c.structure_id })
  • 38. Chemical Structure Database 2013 • Query the database > s = db.session.query(Structure).filter(Structure.id==1234).one() <object “Structure”> > s.smiles CCO • if the object-oriented data model representation creates too much overhead, SQLAlchemy supports writing “almost bare” SQL but still follows the python paradigms > q = select([structure_table.c.id,]).where(structure.c.id==1234) > s = q.execute().fetchone() (CCO,)
  • 39. Chemical Structure Database • Goals • index any chemical structures that can be referenced in some way or has a known source • may also include virtual chemistry or generic structure collections • collect public dataset/databases/structure collections • normalize them to our standards • make them available in our public web interfaces and APIs (if we are allowed to) • no refusal/deletion of structures – curation is performed by “keep the bad and tag it as bad” track chemical space
  • 41.
  • 42. NCI/CADD Chemical Web Apps • implemented with jQuery Mobile (1.3.0) • HTML5 • supports web browser on major mobile platforms: iOS, Android, BlackBerry, WindowsPhone, Windows 8, Palm, Symbian • supports major Desktop web browsers: Google Chrome, Firefox, IE9/10 • WAI-ARIA compliant (W3C specification draft describing accessibility standards of dynamic Web content for people with disabilities) • services will be optimized for usage on tabled-sized touch screens devices, however, not (yet) for smart-phone sized devices (current development is done on an iPad3) • all services work on a common platform
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49. Chemical Activity Predictor - GUSAR chemical structure prediction of physicochemical properties and activities
  • 50. Chemical Activity Predictor - GUSAR GUSAR Software characteristics: chemical structures are represented by QNA descriptors MNA descriptors mathematical algorithm unique algorithm of self- consistent regression allows to select the best set of descriptors main developer for a robust and reliable QSAR Alexey Zakharov model.
  • 51. Chemical Activity Predictor - GUSAR GUSAR Software 1.00 0.90 comparison was performed on the 0.80 following data sets: 0.70 Accuracy (R2 test) 0.60 • ligand–enzyme interactions 0.50 0.40 • ligand–receptor interactions 0.30 • acute toxicity 0.20 0.10 • interaction with drug-metabolism 0.00 CoMFA CoMSIA HQSAR EVA 2D 3D GOLPE GUSAR • enzymes Cerius2 Cerius2
  • 52. Chemical Activity Predictor - GUSAR • QSAR-based models created by GUSAR can be used separately from the application • broad spectra of chemical/biological activity and property prediction models for small molecules in development: • physicochemical properties • assessment of toxicity, metabolism and antineoplastic activities • HIV-1-related models • will be available as Web App and programmatic URL API: http://cactus.nci.nih.gov/chemical/activity/CCOCC/boiling_point {in_applicability_domain: True, datatype: ‘float’, value: 42.660}
  • 53. Chemical Activities Categories Models Endpoints Boiling point Density Physicochemical Physicochemical Flash point Properties Models Melting point Surface tension Thermal conductivity Vapor pressure Viscosity Water solubility HIV-1 Integrase (Strand Transfer) Inhibitor Biological Activities HIV-Models HIV-1 Reverse Transcriptase Inhibitor
  • 54.
  • 55.
  • 56.
  • 57.
  • 62. Prediction Results GUSAR • value • unit • in applicability domain • quantitative and qualitative models
  • 63. Chemical Activity Predictor – GUSAR beta http://cactus.nci.nih.gov/chemial/apps
  • 64. Chemical Activity Predictor – GUSAR beta http://cactus.nci.nih.gov/chemial/apps
  • 65.
  • 66. Chemical Structure Lookup Service (CSLS) • first version was released in 2006, development stalled in 2008 • new version will be based on CSDB • new release planned for 2013 • allows easy lookup of chemical structures within the constituting databases in CSDB
  • 67.
  • 69. InChI/InChIKey Resolver “loose coupling” of InChI resolvers provided by different organizations central list of resolvers each resolver must provide a specific protocol.
  • 70. InChI/InChIKey Resolver • Evan Bolton (NCBI, NLM, NIH) • Valery Tkachenko (RSC/ChemSpider) • Marc Nicklaus (CADD Group, NCI, NIH) • Steven Bachrach (Trinity University) • Antony Williams (RSC/ChemSpider) • Markus Sitzmann (CADD Group, NCI, NIH)
  • 71. Chemical Structure Web API external Chemical NCI/CADD NCI/CADD web services Identifier web service web service Resolver http Chemical Structure Web API other CACTVS software packages NCI/CADD Chemical Structure OPSIN DataBase (CSDB)
  • 72. Chemical Structure Web API external Chemical NCI/CADD NCI/CADD web services Identifier web service web service Resolver http Chemical Structure Web API other GUSAR CACTVS software packages NCI/CADD Chemical Structure OPSIN DataBase (CSDB)
  • 74. Acknowledgements NCI/CADD Team ChemNavigator Alexey Zakharov Scott Hutton Laura Guasch Pàmies Tad Hurst Megan Peach Marc Nicklaus Pubchem All other database providers Xemistry GmbH, Germany Wolf-Dietrich Ihlenfeldt InChI Team
  • 75. Acknowledgments - Software CACTVS Python Web Framework ChemWriter Python SQL Library Peter Ertl (Novartis) Javascript library Fulltext Search Engine

Notas del editor

  1. While usage of CIR since its unofficial announcement in 2009 hovered around the few-100,000 requests per month mark for the first two and a half years or so, it took off about a year ago and crossed the 10 million mark per month early this year, [and has been at or above 2 million per month for every month for a year now.]
  2. This slide shows some of the top users by request count. There are some very well known names, and to-be-expected users. And also maybe some more surprising ones... ...and yes, Merck is also among the users from the pharma sector. [40 sec]
  3. As it was designed to be, CIR is used .... ...as well as in educational tools such as the CheMagic educational web site put together by Otis Rothenberger at Illinois State University [based on the Jmol Virtual Molecular Model Kit] which centrally depends on our CIR
  4. … but you can do things also independently from InChI – this is the general scheme Almost every identifier or representation can be converted to any other representation