Delivering Curated Chemistry to the World via Crowdsourced Deposition and Annotation on ChemSpider

Delivering Curated Chemistry to the
World via Crowdsourced Deposition
and Annotation on ChemSpider

Antony Williams
University of Illinois in Chicago, January 27th 2012

The World of Online Chemistry
 Property databases
 Compound aggregators
 Screening assay results
 Scientific publications
 Encyclopedic articles (Wikipedia)
 Metabolic pathway databases
 ADME/Tox data – eTOX for example
 Blogs/Wikis and Open Notebook Science
 Contributing Open Source code to projects

e-Science and Primary Data
 How much data generated in a lab, that COULD
go public, is lost forever?

 Public Domain reference databases of value?
 Syntheses
 Properties
 Spectra
 CIFs
 Images

Collaborative Knowledge Management

 Public Domain reference databases of value?
 Syntheses
 Properties
 Spectra
 CIFs
 Images
 Much of chemistry is chemical structure-based –
where and how could we host these data?

Available Information…
 Linked to vendors, safety data, toxicity, metabolism

Crowdsourced “Annotations”
 Users can add
 Descriptions/Syntheses/Commentaries
 Links to PubMed articles
 Links to articles via DOIs
 Add spectral data
 Add Crystallographic Information Files
 Add photos
 Add MP3 files
 Add Videos

Chemistry Data online is messy
 We have inherited errors
 All public compound databases, including ours,
have errors
 “Incorrect” structures – assertions, timelines etc
 “Incorrect” names associated with structures
 Properties
 Links
 Publications
 ENORMOUS CHALLENGE

MeSH
 A lipid cofactor that is required for normal blood
clotting. Several forms of vitamin K have been
identified: VITAMIN K 1 (phytomenadione)
derived from plants, VITAMIN K 2 (menaquinone)
from bacteria, and synthetic naphthoquinone
provitamins, VITAMIN K 3 (menadione). Vitamin K 3
provitamins, after being alkylated in vivo, exhibit the
antifibrinolytic activity of vitamin K. Green leafy
vegetables, liver, cheese, butter, and egg yolk are
good sources of vitamin K

What is the Structure of Vitamin K1?

“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-
enyl)naphthalene-1,4-dione”
 Variants of systematic names on PubChem

 2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl
 2-methyl-3-[(E)-3,7,11,15-tetramethyl
 2-methyl-3-(3,7,11,15-tetramethyl
 2-methyl-3-[(E)-3,7,11,15-tetramethyl

Question Everything online: www.dhmo.org

Chemistry on The Internet Is Messy

PHYSPROP Database
The freely downloadable
database under the EPI
Suite prediction software

Very Basic filters suggest
data quality issues

The Stereochemistry challenge.
12500 chemicals with “missed” stereo

NPC Browser http://tripod.nih.gov/npc/

Openness and Quality Issues
Williams and Ekins, DDT, 16: 747-750 (2011)

Science Translational Medicine 2011

Public Domain Databases
 Our databases are a mess…
 Non-curated databases are proliferating errors

 We source and deposit data between databases

 Original sources of errors hard to determine

 Curation is time-consuming and challenging

Crowdsourced Curation
 Crowd-sourced curation: identify/tag errors, edit
names, synonyms, identify records to deprecate

Standards : Structure Standardization

What needs to happen?
 Standards
 Standardization of structures
 ChEBI/PubChem sharing
 InChI adoption

InChIStrings Hash to InChIKeys

Vancomycin – Search the Internet

Vancomycin

Search Molecular Search Full Molecule
SKELETON

Full Skeleton Search: 104 Hits

Crowdsourcing Works
 >130 people have deposited data and
participated in data curation

 Different level curators check each other

 More curators and depositors are
encouraged!

What needs to happen?
 Standards
 Standardization of structures
 ChEBI/PubChem sharing
 InChI adoption

 Collaboration
 Stop reinventing the wheel
 Share data, share efforts and speed the process

Antony Williams vs Identifiers

Passport ID
Dad, Tony, others

5 email addresses
License
ChemSpiderman (blog, SSN
Twitter account,
Facebook, Friendfeed)
OpenID
….
Green Card

Aspirin names and synonyms

• Text searches depend on
correct association

• 335 suggested identifiers for
Aspirin just on PubChem!

• Disambiguation dictionaries
are necessary, not just for
authors!

All Those Names, One Structure

Success Depends on Dictionaries

Validated Name-Structure Dictionaries

 Chemical name dictionaries are used for:
 Text-mining (publications, patents)
 Used to index PubMed and link to Google Patents

 Linking to other databases – think Biology!
 When structures are not available drug names link

 Searching the web
 Names link to structures link to InChIs

I want to know about “Vincristine”

If all algorithms work then
everything on the page is
correct by default except the
name-structure relationship!

Vincristine: Identifiers and Properties

Vincristine: Vendors and Sources
Linked by Structure

Vincristine: Patents
Linked by Name

Vincristine: Articles
Linked by Name

Challenges of Complex Molecules
Yohimbine

Originally 15 compounds “called” Yohimbine
54 Skeletons for Yohimbine

Pharma Information Tombs
 Internal and external content
 Built to meet primary use-case
 Tailored indexes and GUIs
 Internal unique language & metadata
 Poor interoperability/integration
 Powerpoint, Documents, Excel
 Many suppliers of systems and content in
a single workflow

In vivo Pipeline Literature Patents News SAR CSRs Safety Etc

What could create change?
 Harvard Business Review (2010)

“One change would make a substantial
difference [to drug R&D]: the creation of
agreed-upon standards for digitally
representing drug assets.”

It is so difficult to navigate…
IP?
What’s the
structure?
Are they in
our file?
What’s
similar?
What’s the
Pharmacology target?
data?

Known
Pathways?
Competitors?
Working On
Connections Now?
to disease?
Expressed in
right cell type?

Open PHACTS Project
 Develop a set of robust standards…
 Implement the standards in a semantic integration hub
 Deliver services to support drug discovery programs in
pharma and public domain
 22 partners, 8 pharmaceutical companies, 3 biotechs
 36 months project

Guiding principle is open access, open usage, open source
- Key to standards adoption -

ChemSpider Resources for Chemistry

The Future
Internet Data

Small organic molecules Commercial Software
Undefined materials Pre-competitive Data
Organometallics Open Science
Nanomaterials Open Data
Polymers Publishers
Minerals Educators
Particle bound Open Databases
Links to Biologicals Chemical Vendors

The Future of Chemistry on the Web?
 Public compound databases federate & build
a linked environment of validated data!
 Data validation needs are not ignored
 Publishers layer on information to make
publications discoverable
 Public-Private databases can be linked
 Open Data proliferate
 The “Semantic Web” in action

Acknowledgments
 The ChemSpider team

 Our data providers, depositors, collaborators and
curators

 Software providers – OpenEye, ChemDoodle,
ACD/Labs, GGA Software, Open Source (Jmol,
JSpecView, OpenBabel)

 Sean Ekins @collabchem

Thank you

Email: williamsa@rsc.org
Twitter: ChemConnector
Blog: www.chemspider.com/blog
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams

Delivering Curated Chemistry to the World via Crowdsourced Deposition and Annotation on ChemSpider

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Destacado

Destacado (6)

Similar a Delivering Curated Chemistry to the World via Crowdsourced Deposition and Annotation on ChemSpider

Similar a Delivering Curated Chemistry to the World via Crowdsourced Deposition and Annotation on ChemSpider (20)

Último

Último (20)

Delivering Curated Chemistry to the World via Crowdsourced Deposition and Annotation on ChemSpider