2. Chemistry on the Internet
100s of websites serving up chemistry data, SDF
files of structures and data
Some primary resources : PubChem, ChEBI,
DrugBank, ChemIDPlus, Wikipedia
ChemSpider “links” chemistry on the internet
Almost 25 million compounds, 400 data sources
Allows community deposition, curation, annotation
Integrating properties, publications, patents, media
Text, structure, substructure (in testing) searching
6. We Have Delivered the Vision
“Build a Structure Centric Community to
Serve Chemists”
Integrate chemical structure data on the web
Create a “structure-based hub” to information,
data and algorithmic predictions
Let chemists contribute their own data
Allow the community to curate/correct data
7. How Did We Build It?
We deal in Molfiles or SDF files – including
coordinates
We do rudimentary filtering – valence checking,
charge imbalance – prior to deposition
We have our own “business logic” to standardize
We use InChI to “aggregate tautomers” to one
record
Link out to external sites where possible using IDs
8. Inherited Errors
We have inherited errors from every database…
all public compound databases, including ours,
have errors
“Incorrect” structures – assertions, timelines etc
“Incorrect” names associated with structures
Properties
Links
Publications
ENORMOUS CHALLENGE
10. MeSH
A lipid cofactor that is required for normal blood
clotting. Several forms of vitamin K have been
identified: VITAMIN K 1 (phytomenadione) derived
from plants, VITAMIN K 2 (menaquinone) from
bacteria, and synthetic naphthoquinone provitamins,
VITAMIN K 3 (menadione). Vitamin K 3 provitamins,
after being alkylated in vivo, exhibit the
antifibrinolytic activity of vitamin K. Green leafy
vegetables, liver, cheese, butter, and egg yolk are
good sources of vitamin K
23. Public Domain Chemistry Databases
Our databases are a mess…
Non-curated databases are proliferating errors
We source and deposit data between databases
Original sources of errors hard to determine
Curation is time-consuming, challenging and
exacting
An examination of quality in databases – inter/intra
lab comparison of processes for 150 drugs
41. How are data handled in Pharma?
Algorithms for “collapsing” data? Skeletons only?
Processing structure-name pairs?
Manual curation?
Does it matter relative to the noise in the
measurements?
Do correct structure representations matter, and
to who?????
45. Consider searching each of these
chemical databases by chemical
name (systematic name, trade
name or synonym). Please mark
each online resource according to
how much you generally trust the
results.
46.
47. Drug Name Generic Name ChEBI ChemSpider
CAS Com.
Chem ChemIDPlus DailyMed DrugBank PubChem Wikipedia
Spiriva
Tiotropium
Bromide
No Hits No Hits 4/0
Depakote
Valproate
semisodium
No
Structure
Basen Voglibose No Hits No Hits 2/1
Symbicort 1) Budesonide 8/1
Symbicort 2) Formoterol WRONG No Hits 6/1
Vytorin 1) Ezetimibe No Hits
Vytorin 2) Simvastatin 2/1
Taxol Paclitaxel 44/1
Thalidomid Thalidomide No Hits
Zocor Simvastatin 2/1
Crestor Rosuvastatin No Hits 2/1
50. Online Curation
Online databases generally do NOT allow
curation or annotation
If you find errors they stay there!
ChemSpider allows immediate curation
51. Crowdsourcing Works
Over 100 people have deposited data (structures,
spectra, etc) and participated in data curation
Different level curators check each others work
Wikipedia is the modern primary example
Some curators are “madmen”…
52. Crowdsourcing Works
Over 100 people have deposited data (structures,
spectra, etc) and participated in data curation
Different level curators check each others work
Wikipedia is the modern primary example
Some curators are “madmen”…
The Oxford English Dictionary
53. Collaborative Data Curation
How can we COLLECTIVELY clean online data?
ChemSpider has inherited junk from >400 data
sources. Some of this has proliferated into
PubChem. We should deprecate it.
We need to develop a way to share curation actions
back to original data sources
A mindset of bigger is better is problematic. How
many “real chemicals” are in the public databases?
54. ChemSpider
ChemSpider is free to use.
Multiple web services are available.
New data added daily.
Curation and data validation ongoing everyday.
Provided by the RSC.
www.chemspider.com