ChemValidator – an online service for validating and standardizing chemical s...
ACS Meeting New Orleans 2013 (CINF)
1. NCI/CADD Chemical Structure Web Services
Markus Sitzmann
Computer-Aided Drug Design Group, Chemical Biology Laboratory,
Frederick National Laboratory for Cancer Research, NIH, DHHS
3. Chemical Structure Web API
external
Chemical
NCI/CADD NCI/CADD web services
Identifier
web service web service
Resolver
http
Chemical Structure Web API
other
CACTVS software
packages
NCI/CADD Chemical Structure OPSIN
DataBase (CSDB)
4. Chemical Structures
SYBYL Line Notation
SMILES CAS Registry Number
chemical names
GIF image
ChemNavigator SID SD File
chemical structure
CML
FDA UNII
NCI/CADD Identifiers
NSC number
MRV
InChI/InChIKey
PubChem SID/CID
ChemSpider ID
ChEBI ID
Chemical Formula PDB Ligand ID
5. Chemical Identifier Resolver (CIR)
CIR works as a resolver for different
chemical structure identifiers or
representations.
It allows one to convert a given
structure identifier into another
representation or structure
identifier.
http://cactus.nci.nih.gov/chemical/structure
6. Chemical Identifier Resolver (CIR)
• officially released in June 2009
• since then four beta versions
(for testing, learning, experience things)
• one larger database update March 2010
• since early 2012: major internal rewrite
(which will allow us to add new services
and API functionality while not breaking
the existing API)
• major database update and services
planned for 2013
http://cactus.nci.nih.gov/chemical/structure
7. CIR Usage Statistics
12,000,000
Requests per month since June 2009
10,000,000
8,000,000
6,000,000
4,000,000
2,000,000
0
Typical number of unique IP addresses per month: 4,000 – 8,000
7
8. Top Users (US)
Academic/Hospitals Pharma/Chemical Industry
• St. Olaf College • Eli Lilly
• Carnegie Mellon • Dow Chemical
• Drexel University • Intermune
• Princeton • Procter & Gamble
• Mayo • Vertex
U.S. Government Other
• EPA • Google
• NIH (NIEHS, NCI, NLM...) • Amazon
• Lawrence Livermore Natl. Lab. • HP
• CDC • Agilent
• DoD • Symyx
8
9. External web services and applications
• CIR node for KNIME, by Talete s.r.l.
• Lab Helper app for Windows Phone
• Avogadro molecule editor
• Jmol/JSmol open-source viewer for chemical structures in 3D
• GChem for Google Spreadsheet
• Bioclipse (CIR plugin)
• Macs in Chemistry
• Accelrys Draw
...and educational
tools/sites such as:
• Jmol/JSmol Virtual
Molecular Model Kit
• ISU CheMagic
• Caltech Library
9
14. Chemical Identifier Resolver (CIR)
programmatic URL API:
http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation”
if a request is not successful: HTTP404 status message
15. Chemical Identifier Resolver (CIR)
• access by programming libraries/languages (e.g. Python):
from urllib2 import *
url = “http://cactus.nci.nih.gov/chemical/structure/tamiflu/cas”
resolver = urlopen(url)
try:
response = resolver.read()
except HTTPError:
raise “your own error handling”
print response
204255-11-8
• access from Unix shell level (e.g., via wget):
shell > wget -qO -
http://cactus.nci.nih.gov/chemical/structure/tamiflu/cas
204255-11-8
17. Chemical Identifier Resolver (CIR)
/smiles
chemical names /names, /iupac_name
IUPAC names (OPSIN) /cas
CAS numbers /inchi, /stdinchi
SMILES strings /inchikey, /stdinchikey
IUPAC InChI/InChIKeys /ficts, /ficus, /uuuuu
NCI/CADD Identifiers /image
CACTVS HASHISY CIR /file, /sdf
NSC number http://cactus.nci.nih.gov/chemcial/structure /mw, /monoisotopic_mass
PubChem SID /formula
ZINC Code /twirl
ChemSpider ID /urls
ChemNavigator SID /chemspider_id
eMolecule VID /pubchem_sid
UNII /chemnavigator_sid
“identifier” “representation”
18. (Partial) InChIKey Lookup
• resolve Standard InChIKey into full structure representation:
Ethanol
http://cactus.nci.nih.gov/chemical/structure/LFQSCWFLJHTTHZ-UHFFFAOYSA-N/smiles
CCO
http://cactus.nci.nih.gov/chemical/structure/LFQSCWFLJHTTHZ-UHFFFAOYSA/smiles`
CCO
CC[OH2+]
http://cactus.nci.nih.gov/chemical/structure/LFQSCWFLJHTTHZ/smiles
C(C(O)([2H])[2H])[2H]
CC(O)([2H])[2H]
C(CO)([2H])([2H])[2H]
CC[17OH]
C(CO)[2H]
[14CH3]CO
CCO
19. Chemical File Representation
• available file format representations:
http://cactus.nci.nih.gov/chemical/structure/Aspirin/file?format=sdf
alc Alchemy format maestro Schroedinger MacroModel
cdxml CambridgeSoft ChemDraw XML format structure file format
cerius MSI Cerius II format mol Symyx molecule file
charmm Chemistry at HARvard sybyl2/mol2 Tripos Sybyl MOL2 format
Macromolecular Mechanics file format mrv ChemAxon MRV format
cif Crystallographic Information File pdb Protein Data Bank
cml Chemical Markup Language sdf Symyx Structure Data Format
gjf Gaussian input data file sdf3000 Symyx Structure Data Format 3000
gromacs GROMACS file format sln SYBYL Line Notation
hyperchem HyperChem file format smiles SMILES
jme Java Molecule Editor format xyz xyz file format
21. Chemical Properties
• request molecular weight:
Aspirin
http://cactus.nci.nih.gov/chemical/structure/BSYNRYMUTXBXSQ-UHFFFAOYSA-N/weight
180.1598 MIME type: text/plain
/mw molecular weight /aromatic compound is aromatic
/formula formula /macrocyclic compound is macrocyclic
/monoisotopic_mass monoisotopic mass /heteroatom_count heteroatom count
/h_bond_donor_count H bond donor count /hydrogen_atom_count H atom count
/h_bond_acceptor_count H bond acceptor count /heavy_atom_count heavy atom count
/h_bond_center_count H bond center count /deprotonable_group_count number of
/rotor_count number of rotatable bonds deprotonable groups
/effective_rotor_count number of effectively /protonable_group_count number of
rotatable bonds protonable groups
/rule_of_5_violation_count number of Rule-of-5 /ring_count number of rings
violations /ringsys_count number of ringsystems
/xlogp2 octanol−water partition
coefficient XLOGP2
23. Chemical Name Pattern Search
• Google-like searches on CIR’s name index (approx. 70 million names)
example: all chemical names that contain the words “morphine” and “methyl”
(name pattern: ‘+morphine +methyl‘):
http://cactus.nci.nih.gov/chemical/structure/+morphine +methyl/stdinchikey/xml?resolver=name_pattern
based on the open source
full text search server Sphinx
(http://sphinxsearch.com)
25. Chemical Name Pattern Search
example: chemical names that contain the words “morphine” and “methyl”
but not “hydroxyl” (name pattern: ‘+morphine +methyl -hydroxyl‘):
http://cactus.nci.nih.gov/chemical/structure/+morphine +methyl -hydroxyl/stdinchikey/xml?resolver=name_pattern
6 matching names
example: chemical names that contain the substring “morphine”
somewhere in the name (name pattern: ‘*morphine*‘)
http://cactus.nci.nih.gov/chemical/structure/*morphine*/stdinchikey/xml?resolver=name_pattern
45 matching names
example: chemical names that contain a single character “m” and the word
“benzene” in a maximum distance of 3 words (finds meta-substituted aromatic
compounds, name pattern: ‘“m benzene”~3‘):
http://cactus.nci.nih.gov/chemical/structure/(m benzene)~3/stdinchikey/xml?resolver=name_pattern
22 matching names
27. Chemical Structure Normalization/Identifier
• stepwise process:
structure hashcode
original normalization calculation
parent NCI/CADD
structure
structure Identifier
record E_HASHISY
Molfile SDF
SDF SMILES
SMILES database
ChemDraw cdx
PDB
original structure records, parent structures and identifiers
are stored in the database
28. Chemical Structure Normalization/Identifier
• calculation of a set of parent structures with different
sensitivity to chemical features:
structure hashcode
original normalization calculation
parent NCI/CADD
structure
structure Identifier
record E_HASHISY
FICTS FICTS
FICuS FICuS
uuuuu uuuuu
all steps are performed using CACTVS
29. NCI/CADD Identifiers (FICTS, FICuS, uuuuu)
O
based on CACTVS hashcodes (HASHISY) OH
HN
16-digit hexadecimal number (64-bit unsigned) N NH 2
9850FD9F9E2B4E25
structure normalization - histidine:
O O O O O
Na+
HN OH N OH HN O- HN OH HN OH
N NH NH NH2 N NH2 N NH2 N NH2
tautomer 1 tautomer 2 salt R S
9850FD9F9E2B4E25-FICTS 6C16DE2351F9FF50-FICTS E5F83F10C5DB080A-FICTS E92E4BA2869F3611-FICTS 8A7AD1EB498CC76A-FICTS
9850FD9F9E2B4E25-FICuS 9850FD9F9E2B4E25-FICuS E5F83F10C5DB080A-FICuS E92E4BA2869F3611-FICuS 8A7AD1EB498CC76A-FICuS
9850FD9F9E2B4E25-uuuuu 9850FD9F9E2B4E25-uuuuu 9850FD9F9E2B4E25-uuuuu 9850FD9F9E2B4E25-uuuuu 9850FD9F9E2B4E25-uuuuu
30. Chemical Structure Normalization/Identifier
• calculation of Standard InChIKey from the union set of
parent structures
structure hashcode
original normalization calculation
parent NCI/CADD
structure
structure Identifier
record E_HASHISY
FICTS
FICuS union set: Standard InChIKey
uuuuu 1.03
31. Chemical Structure Database (CSDB)
• ChemNavigator iResearch Library
compilation of commercially available screening
compounds from ~300 international chemistry
suppliers PubChem
ChemNav. ~38%
• PubChem Substance Database iResearch Lib.
including Open NCI database, EPA DSSTox ~56%
databases, NIAID HIV database, NIST Webbook,
NLM ChemIDplus, ChemSpider, …
~6%
• Commercial Sources / others others
Asinex, Comgenex, eMolecules, …
current status: 140 chemical structure databases
(released March 2010) 120 million structure records
84.6 million unique structures by FICuS
110 million Standard InChIKeys for lookup
33. Chemical Structure Database 2013
• >270 small-molecule database
• >600 database releases (full, incremental, “historic versions”)
• 385 million original database records
unique structure count:
FICTS ~125.0 million FICuS ~121.4 million uuuuu ~109.0 million
union set: 141.7 million unique structures
34. Chemical Structure Database 2013
InChI/InChIKey (Version 1.04) calculated with four InChI flag sets:
Standard Set, Set 1 & Set 2: addition of hydrogen atoms by CACTVS
Set 3: addition of hydrogen atoms by the InChI library
CACTVS
Standard : Add H Standard InChIKey
Set 1 : Add H DONOTADDH W0 FIXEDH RECMET NEWPS SPXYZ SAsXYZ Fb Fnud KET 15T
Set 2 : Add H DONOTADDH W0 FIXEDH RECMET NEWPS SPXYZ SAsXYZ Fb Fnud KET 15T
Set 3 : Add H DONOTADDH W0 FIXEDH RECMET NEWPS SPXYZ SAsXYZ Fb Fnud KET 15T
35. Chemical Structure Database 2013
• calculation of Standard InChIKey
structure hashcode
original normalization calculation
parent NCI/CADD
structure
structure Identifier
record E_HASHISY
FICTS
FICuS union set:
uuuuu
Standard InChIKey 1.04
Standard Set 1 Set 2 Set 3
36. Chemical Structure Database 2013
• database schema is entirely implemented in python/
• supports many different database engines: Oracle, PostreSQL, MySQL
• SQLAlchemy provides:
• the communication layer with the database engine
• creates a object-oriented data model representation of the database to
the “python”-side
• table relationships:
• either defined by Foreign Key relationships in the database or specified
on python level
• SQLAlchemy creates table joins on the SQL level
38. Chemical Structure Database 2013
• Query the database
> s = db.session.query(Structure).filter(Structure.id==1234).one()
<object “Structure”>
> s.smiles
CCO
• if the object-oriented data model representation creates too
much overhead, SQLAlchemy supports writing “almost bare”
SQL but still follows the python paradigms
> q = select([structure_table.c.id,]).where(structure.c.id==1234)
> s = q.execute().fetchone()
(CCO,)
39. Chemical Structure Database
• Goals
• index any chemical structures that can be referenced in some way or
has a known source
• may also include virtual chemistry or generic structure collections
• collect public dataset/databases/structure collections
• normalize them to our standards
• make them available in our public web interfaces and APIs
(if we are allowed to)
• no refusal/deletion of structures – curation is performed by “keep the
bad and tag it as bad”
track chemical space
42. NCI/CADD Chemical Web Apps
• implemented with jQuery Mobile (1.3.0)
• HTML5
• supports web browser on major mobile platforms: iOS, Android,
BlackBerry, WindowsPhone, Windows 8, Palm, Symbian
• supports major Desktop web browsers: Google Chrome, Firefox, IE9/10
• WAI-ARIA compliant (W3C specification draft describing accessibility
standards of dynamic Web content for people with disabilities)
• services will be optimized for usage on tabled-sized touch
screens devices, however, not (yet) for smart-phone sized
devices (current development is done on an iPad3)
• all services work on a common platform
43.
44.
45.
46.
47.
48.
49. Chemical Activity Predictor - GUSAR
chemical structure
prediction of
physicochemical properties and activities
50. Chemical Activity Predictor - GUSAR
GUSAR Software
characteristics:
chemical structures
are represented by
QNA descriptors
MNA descriptors
mathematical algorithm
unique algorithm of self-
consistent regression allows to
select the best set of descriptors
main developer
for a robust and reliable QSAR
Alexey Zakharov model.
51. Chemical Activity Predictor - GUSAR
GUSAR Software
1.00
0.90
comparison was performed on the
0.80 following data sets:
0.70
Accuracy (R2 test)
0.60
• ligand–enzyme interactions
0.50
0.40 • ligand–receptor interactions
0.30
• acute toxicity
0.20
0.10 • interaction with drug-metabolism
0.00
CoMFA CoMSIA HQSAR EVA 2D 3D GOLPE GUSAR • enzymes
Cerius2 Cerius2
52. Chemical Activity Predictor - GUSAR
• QSAR-based models created by GUSAR can be used separately
from the application
• broad spectra of chemical/biological activity and property
prediction models for small molecules in development:
• physicochemical properties
• assessment of toxicity, metabolism and antineoplastic activities
• HIV-1-related models
• will be available as Web App and programmatic URL API:
http://cactus.nci.nih.gov/chemical/activity/CCOCC/boiling_point
{in_applicability_domain: True, datatype: ‘float’, value: 42.660}
53. Chemical Activities
Categories Models Endpoints
Boiling point
Density
Physicochemical Physicochemical Flash point
Properties Models Melting point
Surface tension
Thermal conductivity
Vapor pressure
Viscosity
Water solubility
HIV-1 Integrase (Strand
Transfer) Inhibitor
Biological Activities HIV-Models
HIV-1 Reverse
Transcriptase Inhibitor
66. Chemical Structure Lookup Service (CSLS)
• first version was released in 2006, development stalled in 2008
• new version will be based on CSDB
• new release planned for 2013
• allows easy lookup of chemical structures within the constituting
databases in CSDB
69. InChI/InChIKey Resolver
“loose coupling”
of InChI resolvers
provided by
different
organizations
central list of
resolvers
each resolver
must provide a
specific protocol.
70. InChI/InChIKey Resolver
• Evan Bolton (NCBI, NLM, NIH)
• Valery Tkachenko (RSC/ChemSpider)
• Marc Nicklaus (CADD Group, NCI, NIH)
• Steven Bachrach (Trinity University)
• Antony Williams (RSC/ChemSpider)
• Markus Sitzmann (CADD Group, NCI, NIH)
71. Chemical Structure Web API
external
Chemical
NCI/CADD NCI/CADD web services
Identifier
web service web service
Resolver
http
Chemical Structure Web API
other
CACTVS software
packages
NCI/CADD Chemical Structure OPSIN
DataBase (CSDB)
72. Chemical Structure Web API
external
Chemical
NCI/CADD NCI/CADD web services
Identifier
web service web service
Resolver
http
Chemical Structure Web API
other
GUSAR
CACTVS software
packages
NCI/CADD Chemical Structure OPSIN
DataBase (CSDB)
74. Acknowledgements
NCI/CADD Team ChemNavigator
Alexey Zakharov Scott Hutton
Laura Guasch Pàmies Tad Hurst
Megan Peach
Marc Nicklaus
Pubchem
All other database providers
Xemistry GmbH, Germany
Wolf-Dietrich Ihlenfeldt
InChI Team
75. Acknowledgments - Software
CACTVS
Python Web Framework
ChemWriter
Python SQL Library
Peter Ertl (Novartis)
Javascript library
Fulltext Search Engine
While usage of CIR since its unofficial announcement in 2009 hovered around the few-100,000 requests per month mark for the first two and a half years or so, it took off about a year ago and crossed the 10 million mark per month early this year, [and has been at or above 2 million per month for every month for a year now.]
This slide shows some of the top users by request count. There are some very well known names, and to-be-expected users. And also maybe some more surprising ones... ...and yes, Merck is also among the users from the pharma sector. [40 sec]
As it was designed to be, CIR is used .... ...as well as in educational tools such as the CheMagic educational web site put together by Otis Rothenberger at Illinois State University [based on the Jmol Virtual Molecular Model Kit] which centrally depends on our CIR
… but you can do things also independently from InChI – this is the general scheme Almost every identifier or representation can be converted to any other representation