eScience Resources for the Chemistry Community from the Royal Society of Chemistry

eScience Resources for the
Chemistry Community from the
Royal Society of Chemistry
Antony Williams
NCSU, College of Textiles
October 2nd
2013

The World of Online Chemistry
• Property databases
• Compound aggregators
• Screening assay results
• Scientific publications
• Encyclopedic articles (Wikipedia)
• Metabolic pathway databases
• ADME/Tox data – eTOX for example
• Blogs/Wikis and Open Notebook Science

e-Science and Primary Data
• How much data generated in a lab, that COULD go
public, is lost forever?

• Public Domain reference databases of value?
– Syntheses
– Properties
– Spectra
– CIFs
– Images

• Public Domain reference databases of value?
– Syntheses
– Properties
– Spectra
– CIFs
– Images
• Much of chemistry is chemical structure-based – where
and how could we host these data?

ChemSpider
• >29 million unique chemicals from >500
data sources
• Focus on improving data quality,
enhancing functionality, integrating and
enabling

Crowdsourced “Annotations”
• Users can add
– Descriptions/Syntheses/Commentaries
– Links to PubMed articles
– Links to articles via DOIs
– Add spectral data
– Add Crystallographic Information Files
– Add photos
– Add MP3 files
– Add Videos

Chemistry Data online are messy
• We have inherited errors
• All public compound databases have errors
• “Incorrect” structures – assertions, timelines etc
• “Incorrect” names associated with structures
• Properties
• Links
• Publications
• ENORMOUS CHALLENGE

Crowdsourced Curation
• Crowd-sourced curation: identify/tag
errors, edit names, synonyms, identify
records to deprecate

Validated Name-Structure Dictionaries
• Chemical name dictionaries are used for:
• Text-mining (publications, patents)
– Used to index PubMed and link to Google Patents
• Linking to other databases – think Biology!
– When structures are not available drug names link
• Searching the web
– Names link to structures link to InChIs

I want to know about “Vincristine”

Vincristine: Identifiers and Properties

Vincristine: Vendors and Sources
Linked by Structure

Vincristine: Patents
Linked by Name

Vincristine: Articles
Linked by Name

InChIStrings Hash to InChIKeys

Vancomycin – Search the Internet

Vancomycin
Search Molecular
SKELETON
Search Full Molecule

Full Skeleton Search: 104 Hits

ChemSpider Resources for Chemistry

Some usage statistics
• ca. 200 visitors at any one time, ~30,000 visits
per day
• Mar 4-Apr 3, 2013
– Visits = 731,656
– Unique Visitors = 527,008
• Independent servers to support other projects

Publications - a summary of work
• Scientific publications are a summary of
work
– Is all work reported?
– How much science is lost to pruning?
– What of value sits in notebooks and is lost?
• How much data is lost?
– How many compounds never reported?
– How many syntheses fail or succeed?
– How many characterization measurements?

About Me…as a Chemist
• I’ve performed a few dozen chemical syntheses
• I’ve run thousands of analytical spectra
• I’ve generated thousands of NMR assignments
• I’ve probably published <5% of all work
• Most of it has been lost
• But things can be different today….
• But it still needs to be associated with me…

Rewards and Recognition
Congratulations! Your 1st CSSP
article has been published.
Philosopher Lao Tzu said “A
journey of a thousand miles begins
with a single step”. In the same
way we hope that this will be the
first of many submissions that you
make to CSSP.
The First Step badge is
awarded when a user
submits (& has published)
their 1st
CSSP article.

Integrate to instruments and software
• Integration to analytical instrumentation
vendors already in place
– Agilent, Bruker, Thermo, Waters
• Also, Cheminformatics vendors link to
ChemSpider
– Accelrys, ACD/Labs, ChemAxon, iChemLabs, and…

PharmaSea
• Dereplication via ChemSpider
• Segregation of natural products datasets
• Analytical data algorithms & integration
– Mass spec searching – predicted fragmentation
– NMR feature searching – NMR prediction
– Computer-assisted structure elucidation

It is so difficult to navigate…
What’s the
structure?
What’s the
structure?
Are they in
our file?
Are they in
our file?
What’s
similar?
What’s
similar?
What’s the
target?
What’s the
target?Pharmacology
data?
Pharmacology
data?
Known
Pathways?
Known
Pathways?
Working On
Now?
Working On
Now?Connections
to disease?
Connections
to disease?
Expressed in
right cell type?
Expressed in
right cell type?
Competitors?Competitors?
IP?IP?

• 3-year Innovative Medicines Initiative project
• Integrating chemistry and biology data using semantic
web technologies
• Open source code, open data and open standards
• Academics, Pharma companies, Publishers….

ChemSpider Contributions
• The host of the chemistry services
– Supplier of “standardized” chemical data files
– Chemistry searching (structure, substructure
etc)
– Curator and data quality checking
• We built the Open PHACTS chemical
registration system

Chemical Database
Service
• National Chemical Database Service
for UK Academics
• Integrating Commercial Databases
and Services
• Chemicals, analytical data,
prediction algorithms
• Development of data repository

Community Repository for Data
• Funding agencies encourage sharing of data
• Increasing availability of “Open Data”
• Institutional repositories no specific domain
support
• Develop a community repository for
chemistry data – private, public, embargoed
• Provides data to develop models and
algorithms

Community Repository for Data
• Automated depositions of data
• DOI’ed data objects for citation purposes
• A database of reference data, but validated by
the community
• National services feeding the repository –
crystallography, mass spectrometry
• Integrate to blogging tools for chemistry
• Integrate to Electronic Lab Notebooks as
feeds

Model Building with Community Data
• Community data as a basis of model building
– Consume data from available databases,
community data, new publications and build
predictive algorithms for the community
– How many algorithms are reported and lost?
How much repeat work is done in the domain of
algorithmic development?

Inside our Publication Archive
• How much data is in the archive, in the
publications and in the supplementary
info?
– How many compounds for ChemSpider?
– How many syntheses for ChemSpider
reactions?
– How many characterization measurements?
• Property Data
• Spectral Data
• Graphs and charts to be used for modeling?

What if we could capture it all?
Digitally Enhancing the RSC Archive

Start with data in publications

ChemSpider Reactions
• Starting with data from CSSP, MOS and CCR
• Will cover reactions extracted from:
• Patents
• RSC journal articles and ESI

E-Lab Notebooks
• Integration between ELNs
and:
• ChemSpider
• ChemSpider Reactions
• Chemistry Data Repository

Internet Data
The Future
Commercial Software
Pre-competitive Data
Open Science
Open Data
Publishers
Educators
Open Databases
Chemical Vendors
Small organic molecules
Undefined materials
Organometallics
Nanomaterials
Polymers
Minerals
Particle bound
Links to Biologicals

The Future of Chemistry on the Web?
• Public compound databases federate & build
a linked environment of validated data!
• Data validation needs are not ignored
• Publishers layer on information to make
publications discoverable
• Open Data proliferate
• The “Semantic Web” will continue to
develop…

Thank you
Email: williamsa@rsc.org
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams

eScience Resources for the Chemistry Community from the Royal Society of Chemistry

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (7)

Similar a eScience Resources for the Chemistry Community from the Royal Society of Chemistry

Similar a eScience Resources for the Chemistry Community from the Royal Society of Chemistry (20)

Último

Último (20)

eScience Resources for the Chemistry Community from the Royal Society of Chemistry