The Royal Society of Chemistry provides a variety of databases and services covering multiple domains of Chemistry. That includes our electronic publishing platform, ChemSpider and its related databases, the National Chemistry Database and digital access to the RSC archive that spans over 170 years. In order to support the rising tide of semantic web technologies we are now working on exposing our data to conform with the linked data paradigm. This presentation will provide an overview of our work to introduce semantic structure to all RSC electronic resources as well as outlining ways to access this information using standard formats and various APIs.
Unblocking The Main Thread Solving ANRs and Frozen Frames
Building support for the semantic web for chemistry at the Royal Society of Chemistry
1. Presented by Karen Karapetyan, Colin
Batchelor, Jonathan Steele , David Sharpe
Valery Tkachenko, Antony Williams
ACS Indianapolis September 2013
Building support for the
semantic web for chemistry at
the Royal Society of Chemistry
2.
3. http://www.openphacts.org
Open PHACTS is an Innovative Medicines
Initiative (IMI) project, aiming to reduce the
barriers to drug discovery in industry, academia
and for small businesses.
Semantic web is one of the corner stones
9. • ChemSpider (passed 100K records)
• All records are planned to pass through CVSP
• DrugBank (~6.5K records)
• ChEMBL (~1.2 mln records)
Data set examples
13. 2 records where Smiles, InChI, and name did not match
the structure
DB00611 DB01547
14. ~40 records where InChIs did not match the structure
DrugBank ID: DB00755
InChI=1S/C20H28O2/c1-15(8-6-9-16(2)14-19(21)22)11-12-18-17(3)10-7-13-
20(18,4)5/h6,8-9,11-12,14H,7,10,13H2,1-5H3,(H,21,22)/b9-6+,12-11+,15-8+,16-
14+
DruGBank ID: DB00614
15. DB08128
J. Brechner, IUPAC
Graphical Representation of
stereochem. configurations
Section: ST-1.1.10
DB06287
7 records with 2 stereo bonds at chiral
atoms
16. CVSP validation of ChEMBL 16 (~1.3 mln. records)
• Overall 0.7% of records had validation issues
• Stereo problems (~82%)
• Directions of bonds do not make sense (~63%)
• Ambiguous stereo : 2 stereo bonds at chiral center (~19%)
20. “atom not recognized” – 3% isotopes
Should be atom from periodic table
No mass difference in atom line
No “M ISO” in connection table
In molfile:
21. CVSP : standardization
• Standardization workflow was developed for
Open PHACTS’s registration system
• Workflow includes modules like
• SMIRKS rules derived from FDA SRS manual
• Resetting symmetric stereo
• Dearomatize
• Layout
• Fix “fixable” stereo issues
• Disconnect all metals from N, O, F
• Fold non-stereo hydrogens
• Handle partial ionization of acid-base
• etc
22. Open PHACTS chemical registry system:
what we use as chemical identity?
•Standard InChI/InChIKey (currently used ChemSpider)
•Absolute smiles (isomeric canonical)
Drawbacks
•SMILES –many flavors
•Standard InChI
• does not include unknown/undefined stereo unless at least one defined stereo is present
• does not distinguish between undefined and unknown stereo (always “?”)
• standard InChI does some basic tautomer canonicalization which we wanted to prevent
to distinguish between all tautomers (sometimes useful for linking spectral data to
specific tautomer)
• assumes absolute stereo or no stereo at all
Path we took:
Non-standard InChI with options: SUU SLUUD FixedH SUCF
•Always include unknown/undefined stereo (‘u’,’?’)
•add Fixed H layer (to distinguish between tautomers)
•Uses chiral flag in MOL/SD record (ON – absolute stereo, OFF-relative)
23. For each Compound (CSID) parent generation is
attempted
“Tautomerism in large databases”, Sitzmann and
others, J.Comput Aided Mol Des (2010)
Parent Description
Charge-Unsensitive An attempt is made to neutralize ionized acids
and bases. Envisioned to be an ongoing
improvement while new cases appear.
Isotope-Unsensitive Isotopes replaced by common weight
Stereo-Unsensitive Stereo is stripped
Tautomer-Unsensitive Tautomer canonicalization is attempting to
generate a “reasonable” tautomer
Super-Unsensitive This parent is all of the above
No fragment unsensitive parent – we treat all fragments as equal entities
25. Chemistry Validation and Standardization Platform (CVSP)
at cvsp.chemspider.com
•Validation
•Standardization
•Parent generation
RDF Export
Data
26. Data is being imported
from ChemSpider to
Open PHACTS in
RDF/turtle
27. RDF/VoID
– VoID is an RDF Schema vocabulary for expressing metadata about RDF
datasets. It is intended as a bridge between the publishers and users of
RDF data. http://www.w3.org/TR/void
• skos:exactMatch (Simple Knowledge Organisation System)
E.g. To link compounds in OPS with compounds in ChEBI.
• skos:closeMatch
E.g. To link Stereo Insensitive Parents to their Children within OPS.
• skos:relatedMatch
E.g. To link Parent compounds that contain others as Fragments.
– Recommendations on how to create the VoID have been specified by
Manchester here: http://www.cs.man.ac.uk/~graya/ops/2012/ED-datadesc/
30. Future work
Enabling full semantic web capabilities:
•Establishing RDF server with all relationships
(including parent-child relationships)
•Develop SPARQL capability for querying RDF
Validate all records in ChemSpider by passing it
through CVSP