This is a presentation given to a group of students at the UNC Eshelman School of Pharmacy.
As chemists many of us want to resource information that is high quality, accurate and addresses our query. With the increasing proliferation of online chemistry resources it is very common for us to turn to these resources to source data. However, are resources such as Wikipedia, PubChem and the plethora of databases delivering information for metabolism, medicinal chemistry and synthetic chemistry trustworthy? Which of these resources, if any, should be treated as authorities? What is the most integrated approach to resource chemistry related data online? What approaches can be taken to validate the data that is available and how can individual scientists participate in helping to improve the content and quality of chemistry related data on the web.
Antony Williams is ChemSpiderman. He started the ChemSpider database (www.chemspider.com) as a hobby to deliver a free platform for the community to source chemistry related data. Within three years the system was acquired by the Royal Society of Chemistry and now serves up close to 25 million chemical structures linked to over 400 data sources across the internet and offers individual scientists the opportunity to host and share their data with the community and to participate in data curation and annotation. Tony will share his experiences of building this chemistry database with a focus on data validation and curation and sourcing high quality data. During the presentation he will discuss ways to check chemical structure representations before submission to public systems for searching and provide an overview of chemical identifiers such as SMILES strings and the International Chemical Identifier (InChI) allows for the interlinking of resources. Attendees can expect to leave the session with a deeper understanding of utilizing the internet to resource chemistry related data.
2. Chemistry on the Internet
Where do you source chemistry information?
What can you trust online?
How can you recognize potential issues?
Cross-referencing and curating data
4. MeSH
A lipid cofactor that is required for normal blood
clotting. Several forms of vitamin K have been
identified: VITAMIN K 1 (phytomenadione) derived
from plants, VITAMIN K 2 (menaquinone) from
bacteria, and synthetic naphthoquinone provitamins,
VITAMIN K 3 (menadione). Vitamin K 3 provitamins,
after being alkylated in vivo, exhibit the
antifibrinolytic activity of vitamin K. Green leafy
vegetables, liver, cheese, butter, and egg yolk are
good sources of vitamin K
21. Molfiles
Molfiles are the primary exchange format between
structure drawing packages
Can be different between different drawing packages
Most commonly carry X,Y coordinates for layout
Can support polymers, organometallics, etc.
Can carry 3D coordinates
22. SMILES (http://en.wikipedia.org/wiki/SMILES)
SMILES is a common format
Can support polymers,
organometallics, etc.
Does NOT carry X,Y or Z
coordinates for layout so
requires layout algorithms –
can be problematic!
Generally different between
drawing packages
27. InChI
SINGLE code base managed by IUPAC –
integrated into drawing packages. No variability
as with SMILES
InChI Strings can be reversed to structures –
same problem as with SMILES – no layout
Well adopted by the community (databases,
publishers, blogs, Wikipedia) – good for searching
the internet
43. InChI
No support for polymers, organometallics
Many option settings can lead to variability and
make integration across databases difficult –
FixedH option especially problematic
“Slight” chance of collisions of InChIKeys
VERY USEFUL FOR INTEGRATING THE WEB
48. Where is chemistry online?
Encyclopedic articles (Wikipedia)
Chemical vendor databases
Metabolic pathway databases
Property databases
Patents with chemical structures
Drug Discovery data
Scientific publications
Compound aggregators
Blogs/Wikis and Open Notebook Science
53. How do we build it?
25 million chemicals from 400 data sources
We deal in Molfiles or SDF files – including
coordinates
We do rudimentary filtering – valence checking,
charge imbalance – prior to deposition
We have our own “business logic” to standardize
We use InChI to “aggregate tautomers” to one
record
We link out to external sites where possible using
their IDs
54. Inherited Errors
We have inherited errors from every database…
all public compound databases, including ours,
have errors
“Incorrect” structures – assertions, timelines etc
“Incorrect” names associated with structures
Properties
Links
Publications
ENORMOUS CHALLENGE
56. Be careful searching by Name!
Determining the correct structure by name
searching is difficult online! Good, not perfect
Wikipedia
ChEBI/ChEMBL
ChemIDPlus
ChemSpider
Be VERY careful with MOST databases
57. Validating structures
Check for “full stereo” and use stereo descriptors
especially for checking!
Check for quality of associated data sources
Check against reference literature when available
– but it can be wrong
Question EVERYTHING!
58. Online Curation
Online databases generally do NOT allow
curation or annotation
If you find errors they stay there!
ChemSpider is unique…immediate curation
ChemSpider live demo following this lecture
Searching
Deposition and Curation
ChemSpider SyntheticPages