Williams and Ekins summarize efforts to improve the quality of public domain chemistry databases. They found hundreds of errors in structures within the National Cancer Institute's NPC Browser shortly after its release. Through blogging and publishing papers, they aimed to alert the community about issues and the need for database curation, validation, and standardization. They propose solutions like structure filtering, provenance tracking, and collaboration between database owners to reduce errors that can inhibit scientific progress when data is reused.
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...
Improving Public Domain Chemistry Database Quality
1. Towards a Gold Standard: Improving The
Quality of Public Domain Chemistry
Databases
Antony J. Williams1, Sean Ekins 2
1Royal Society of Chemistry, Wake Forest, NC 27587
2Collaborations in Chemistry, Fuquay Varina, NC 27526.
3. Chemistry structures are proliferating
on the web
Safety data
Toxicity data
Blogs and Wikis
Property databases Users take them at face value
Experimental results
Scientific publications They SHOULD NOT!!!
Compound aggregators
Open Notebook Science
Metabolic pathway databases
Encyclopedic articles (Wikipedia)
Immense quantities of scientific information are contained in the
thousands of databases
Progress can however be inhibited by errors in these databases,
downstream effects when the data is reused.
http://bit.ly/zWGaps
5. What Mechanisms Do we Have to Alert the Community ?
Email database owner and hope for a response
Blog it
Tony has been blogging about database quality for years and nobody
was listening – other than the people at PubChem
For some databases, when he blogged they listened and would edit!
Tweet it
Dec 2010 - We felt something had to be said definitively about structure
quality
Publish it – wrote to Science, Nature and then PLoS Computational Biology
http://bit.ly/qtJF2f
Perhaps the phone?
6. April 27 2011- Then came the :
The NPC Browser
Science Translational Medicine 2011
7. But wait, hold on – did anyone peer review the
database??
Database released and within days ..
A quick analysis of structure quality revealed..
100’s of errors found in structures
Williams and Ekins,
DDT, 16: 747-750 (2011)
11. How many contribute to
clean-up?
Less than a dozen contributors to data
The majority are project members
The crowd is small…
This is the same for all cheminformatics crowd-
based efforts
12. What Mechanisms Do we Have to Alert the Community –
Publishing is too slow
Tony Blogged April 28th 1 day after
release http://bit.ly/jn8wLC
I Blogged April 29th http://bit.ly/lXHInG
suggesting the need for a gold standard
database
After more extensive analysis we sent a
manuscript to Science Translational
Medicine - Rejected
Drug Discovery Today..accepted…8
Months after we pointed out the issue
even before NPC Browser release..
Williams and Ekins,
DDT, 16: 747-750 (2011)
13. Responses from Community and NCGC
Comments on initial blog
NCGC added a disclaimer which I blogged about May 23rd
http://bit.ly/m4Tx2b
Sept 8th 2011
Email from Tudor Oprea
(cc’ed to 60 others)
He has also been pointing
out database errors for
years..
Followed by one from
Chris Austin offering to
meet us
Several individuals thanked us for the alert
14. More Extensive Analysis and solutions
More analysis of NPC browser errors
“analysis of the NPC browser ‘HTS amenable compounds’ subset of
data for 7600 compounds identified fundamental errors in
stereochemistry, valency issues and charge imbalances in a few
minutes work using a rudimentary software tool”
Analysis of other chemistry databases and errors
Other types of databases and errors
Offered solutions
Towards a Gold Standard: Regarding Quality in Public Domain Chemistry Databases and Approaches to Improving
the Situation Antony J. Williams, Sean Ekins and Valery Tkachenko, Drug Discovery Today, In Press 2012
15. Data Errors in the NPC Browser: Analysis of Steroids
Substructure # of # of No Incomplete Complete but
Hits Correct stereochemistry Stereochemistry incorrect
Hits stereochemistry
Gonane 34 5 8 21 0
Gon-4-ene 55 12 3 33 7
Gon-1,4-diene 60 17 10 23 10
Towards a Gold Standard: Regarding Quality in Public Domain Chemistry Databases and Approaches to Improving
the Situation Antony J. Williams, Sean Ekins and Valery Tkachenko, Drug Discovery Today, In Press 2012
17. What You Might Not Know About
Chemistry Databases On The Internet
Data-sharing between open databases is cyclic
This can proliferate errors in the “Linked Data”
18. Public Domain Databases
Our databases are a mess…
Non-curated databases are proliferating errors
We source and deposit data between databases
Original sources of errors hard to determine
Curation is time-consuming and challenging
19. Molecule Data Quality Impacts
in silico drug discovery
vast ligand and protein–protein interaction databases
develop computational models
global mapping of pharmacological space
drug-target networks of approved drugs
prediction of off-target effects
20. Different types of
databases and errors
Bayer paper on target validation 2/3 of papers did not live up to claims
MDL Drug Data Report (MDDR), errors
Errors in clinical research databases vary from 2.3% to 26.9%
Multicenter analysis by MS-based proteomics identified generic problems in
databases when characterizing proteins -search engines could not distinguish
different identifiers many algorithms calculated molecular weight incorrectly
One database had between 2.1% and 13.6% of annotated Pfam hits unjustified
ligand–protein X-ray structure - these can also have errors with far reaching
consequences
21. Solutions
Structure Validation and Standardization
Curation
Annotation
Structure filters
Incorrect valency, atom labels, aromatic bonds, stereochemistry, salts,
duplication
Structure standardization guidelines
Provided by the FDA (Substance Registration System UniqueIngredient
Identifier (UNII):
http://www.fda.gov/ForIndustry/DataStandards/SubstanceRegistrationSyste
m-UniqueIngredientIdentifierUNII/default.htm)
Need a record of molecule provenance
Can we track databases and quality - - www.scidbs.com
24. Scidbs.com
DB logo
Type of DB
Contact
Owner
Default Body Website
License
Curation etc
25. Data should be:
Free from structure errors
Free from data errors
Free from experimental errors
Are we asking too much? Is it even possible??
Yet when we alert others:
When we raise our hands we are ignored
Our scientific community needs to wake up
26. Today
NPC browser has fewer errors..so do ALL databases!
More people aware of molecule quality online. Trust is
earned not just granted!
The future database user is more informed
Tomorrow
Peer reviewers test the databases that are in manuscripts
NIH checks databases before release!
COLLABORATION between government DBs. PLEASE!!!
We need minimal compound database standards
(MCDS)
27. Acknowledgement
We thank the paper reviewers
and blog commenters
for their constructive comments
Chris Lipinski
This work was unfunded
(but was the right thing to do!)
www.scidbs.com