RSC|ChemSpider is one of the world’s largest online resources for chemistry related data and services. Developed with the intention of delivering access to structure-based chemistry data via the internet the ChemSpider platform hosts over 26 million unique chemical compounds aggregated from over 400 data sources and provides an environment for the community to both annotate and curate these existing data as well as deposit new data to the system. The search system delivers flexible querying capabilities together with links to external sites for publication and patent data. This presentation will review the present capabilities of the ChemSpider system providing direct examples of how to use the system to source high quality data of value to chemists. We will discuss some of the challenges associated with validating data quality and examine how ChemSpider is a part of the new “semantic web for chemistry”. ChemSpider has also spawned a number of additional projects include ChemSpider SyntheticPages for hosting openly peer-reviewed chemical synthesis articles, Learn Chemistry Wiki for students learning chemistry and SpectraSchool for learning spectroscopy.
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Annotation on ChemSpider
1. Delivering Curated Chemistry to the
World via Crowdsourced Deposition
and Annotation on ChemSpider
Antony Williams
University of Illinois in Chicago, January 27th 2012
2. The World of Online Chemistry
Property databases
Compound aggregators
Screening assay results
Scientific publications
Encyclopedic articles (Wikipedia)
Metabolic pathway databases
ADME/Tox data – eTOX for example
Blogs/Wikis and Open Notebook Science
Contributing Open Source code to projects
6. e-Science and Primary Data
How much data generated in a lab, that COULD
go public, is lost forever?
Public Domain reference databases of value?
Syntheses
Properties
Spectra
CIFs
Images
10. e-Science and Primary Data
How much data generated in a lab, that COULD
go public, is lost forever?
Public Domain reference databases of value?
Syntheses
Properties
Spectra
CIFs
Images
Much of chemistry is chemical structure-based –
where and how could we host these data?
19. Chemistry Data online is messy
We have inherited errors
All public compound databases, including ours,
have errors
“Incorrect” structures – assertions, timelines etc
“Incorrect” names associated with structures
Properties
Links
Publications
ENORMOUS CHALLENGE
21. MeSH
A lipid cofactor that is required for normal blood
clotting. Several forms of vitamin K have been
identified: VITAMIN K 1 (phytomenadione)
derived from plants, VITAMIN K 2 (menaquinone)
from bacteria, and synthetic naphthoquinone
provitamins, VITAMIN K 3 (menadione). Vitamin K 3
provitamins, after being alkylated in vivo, exhibit the
antifibrinolytic activity of vitamin K. Green leafy
vegetables, liver, cheese, butter, and egg yolk are
good sources of vitamin K
50. Openness and Quality Issues
Williams and Ekins, DDT, 16: 747-750 (2011)
Science Translational Medicine 2011
51. Public Domain Databases
Our databases are a mess…
Non-curated databases are proliferating errors
We source and deposit data between databases
Original sources of errors hard to determine
Curation is time-consuming and challenging
69. Crowdsourcing Works
>130 people have deposited data and
participated in data curation
Different level curators check each other
More curators and depositors are
encouraged!
70. What needs to happen?
Standards
Standardization of structures
ChEBI/PubChem sharing
InChI adoption
Collaboration
Stop reinventing the wheel
Share data, share efforts and speed the process
71. Antony Williams vs Identifiers
Passport ID
Dad, Tony, others
5 email addresses
License
ChemSpiderman (blog, SSN
Twitter account,
Facebook, Friendfeed)
OpenID
….
Green Card
72. Aspirin names and synonyms
• Text searches depend on
correct association
• 335 suggested identifiers for
Aspirin just on PubChem!
• Disambiguation dictionaries
are necessary, not just for
authors!
80. Validated Name-Structure Dictionaries
Chemical name dictionaries are used for:
Text-mining (publications, patents)
Used to index PubMed and link to Google Patents
Linking to other databases – think Biology!
When structures are not available drug names link
Searching the web
Names link to structures link to InChIs
81. I want to know about “Vincristine”
If all algorithms work then
everything on the page is
correct by default except the
name-structure relationship!
88. Pharma Information Tombs
Internal and external content
Built to meet primary use-case
Tailored indexes and GUIs
Internal unique language & metadata
Poor interoperability/integration
Powerpoint, Documents, Excel
Many suppliers of systems and content in
a single workflow
In vivo Pipeline Literature Patents News SAR CSRs Safety Etc
89. What could create change?
Harvard Business Review (2010)
“One change would make a substantial
difference [to drug R&D]: the creation of
agreed-upon standards for digitally
representing drug assets.”
90. It is so difficult to navigate…
IP?
What’s the
structure?
Are they in
our file?
What’s
similar?
What’s the
Pharmacology target?
data?
Known
Pathways?
Competitors?
Working On
Connections Now?
to disease?
Expressed in
right cell type?
91. Open PHACTS Project
Develop a set of robust standards…
Implement the standards in a semantic integration hub
Deliver services to support drug discovery programs in
pharma and public domain
22 partners, 8 pharmaceutical companies, 3 biotechs
36 months project
Guiding principle is open access, open usage, open source
- Key to standards adoption -
94. The Future
Internet Data
Small organic molecules Commercial Software
Undefined materials Pre-competitive Data
Organometallics Open Science
Nanomaterials Open Data
Polymers Publishers
Minerals Educators
Particle bound Open Databases
Links to Biologicals Chemical Vendors
95. The Future of Chemistry on the Web?
Public compound databases federate & build
a linked environment of validated data!
Data validation needs are not ignored
Publishers layer on information to make
publications discoverable
Public-Private databases can be linked
Open Data proliferate
The “Semantic Web” in action
96. Acknowledgments
The ChemSpider team
Our data providers, depositors, collaborators and
curators
Software providers – OpenEye, ChemDoodle,
ACD/Labs, GGA Software, Open Source (Jmol,
JSpecView, OpenBabel)
Sean Ekins @collabchem