The Royal Society of Chemistry hosts large scale data collections and provides access to the data to the chemistry community. The largest RSC data set of wide scale interest to the community offers access to tens of millions of compounds. The host platform, ChemSpider, is limited as it is a structure centric hub only. A new architecture, the RSC data repository, has been developed that extends support to reactions, spectral data, crystallography data and related property data. It is also the architecture underlying a series of exemplar projects for managing data for a number of diverse laboratories. The adoption of data standards for the integration and distribution of data has been essential. Specific standards include molecular structure formats such as molfiles and InChIs, and spectral data formats such as JCAMP. This presentation will report on our development of the data repository, the importance of utilizing standards for data integration, the flexible nature of the architecture to deliver solutions for various laboratories and our efforts to develop new large data collections. This includes text-mining efforts to extract large spectrum-structure collections from large corpuses.
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
Importance of data standards for large scale data integration in chemistry
1. Importance of data standards for
large scale data integration in
chemistry
Antony Williams, Valery Tkachenko, Alexey
Pshenichnov, Ken Karapetyan, Stuart Chalk,
Daniel Lowe and Carlos Coba
ACS Denver, March 2015
2. Free and Easy
• To make it easy to “take notes” these slides
will be available at:
www.slideshare.net/AntonyWilliams/
6. Antony John Williams (et al)
• “We don’t need more
standards!”
• “Of COURSE we can build
a spectral database!”
• “The standards we have
are good enough”
7. A Pragmatic View to Progress
• Let’s consider progressing an NMR Spectral
database for the community!
• MUST HAVES– spectra (1D/2D), associated
structures, assignments
• WANTS – predict NMR spectra, spectral
searching, privacy/embargos
• What would we need in terms of standards?
• Molfiles and JCAMP
13. Standards without adoption
are limited in value
• If the instrument vendors don’t support or
adopt the standards success is limited
• YESTERDAY discussion about publishing
NMR – JCAMP
• But what is already available will work – Jeol,
Bruker, Thermo, Anasazi, Agilent/Varian -
imperfect but useful
18. JCAMP file downloads
• When NMR spectra are stored as JCAMP
then downloads into offline packages are
feasible – MestreLabs, ACD/Labs etc
• Open Data – download versus view
• Store spectra locally and reuse
• Java is increasingly a pain!
• Need to move to HTML5 viewing on
ChemSpider, especially for Mobile Viewing
19. Challenges with Spectra
• JCAMP is good for a lot of spectral data – IR,
Raman, 1D NMR
• MS data is rarely made available in JCAMP
• We would love a ratified JCAMP 6.0 for 2D
data exchange – allows third parties to build
support for download
• ASSIGNED JCAMP spectra supported
31. Developing Proof-of-Concept
• Extract from 1976-2014 USPTO applications
*unknown – starts off with NMR: peak list (no nucleus)
H 975543
C 56536
unknown 44306
F 9429
P 3241
B 91
Si 62
Sn 22
Se 11
N 8
32. We want to find text spectra?
• We can find and index text spectra:13C NMR
(CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH,
benzylic methane), 30.77 (CH, benzylic
methane), 66.12 (CH2), 68.49 (CH2), 117.72,
118.19, 120.29, 122.67, 123.37, 125.69, 125.84,
129.03, 130.00, 130.53 (ArCH), 99.42, 123.60,
134.69, 139.23, 147.21, 147.61, 149.41,
152.62, 154.88 (ArC)
• What would be better are spectral figures – and
include assignments where possible!
37. Publications & “Real Spectra”
• We are turning text into spectra
• We are turning figures into spectra
38. Early Test Experiments
Input
74 supplementary data documents. 3444 pages
Output
Plot2Txt extracted content from 1069 pages
1151 spectra total - >80% of peaks extracted to
within 1-2 decimal places (ppm)
40. Manual Curation Layer
• ALL SPECTRA WILL BE STORED AS JCAMP
• ChemSpider has had a manual curation layer
for >8 years
• Users can annotate data on ChemSpider
• We do receive useful feedback from the
community on the data and are optimistic!
41. Extraction is the WRONG WAY
• We should NOT mine data out – digital form!
• Structures should be submitted “correctly”
• Spectra should be digital spectral formats,
not images
• ESI should be RICH and interactive
• Data should be open, available, with meta
data and provenance
42. We can solve for Authors here
Will it be used though??? YES!
48. What should we be doing?
• Settle on a short-term format – JCAMP-JMOL?
• Convince the instrument vendors to export in
this format
• Push button depositions into “containers” –
ChemSpider, NMRShiftDB, Institutional
Repositories
• Encourage format support in software (read
and write) – Mestre, ACD/Labs, Bruker
TopSpin, etc.
50. Standards in Large Scale
Data Integration
• ALL of these are imperfect standards
• Molfiles
• SDF
• InChI
• JCAMP
• But what can be done with them?
51. Compound Data
• The standards of chemical structure handling
are primarily molfile, SDfile, SMILES, InChI
• We primarily depend on molfiles and SDF
files for data deposition and interchange
• We use InChI a lot – especially for integrated
searching across the web
54. Compound Data
• The standards of chemical structure handling
are primarily molfile, SDfile, SMILES, InChI
• We primarily depend on molfiles and SDF
files for data deposition and interchange
• We use InChI a lot – especially for integrated
searching across the web
• There ARE data interchange problems
associated with structures….
55. USE and TEACH Standards
• Too few people are aware of the existing
standards and their capabilities
• Part of the CINF mission activities should be
to teach standards and this is being done
• Still too few people have heard of InChI and
JCAMP for example
• Still little known about the importance of
correct structure representations – kudos to
people like Leah et al who TEACH THIS!
65. Contribute to PUBLIC
Ontologies
• Yes there are “company” ontologies – but for
the good of the community contribute to
public ontologies and standards
• For data interchange and meshing this is
soooooo beneficial!
69. Actions
• Support and encourage new standards
• In the meantime, reawaken and modernize the
JCAMP standard
• Show up and listen to Bob Hanson today
• Encourage scientists to provide data
70. Charles Holland Duell in 1902
“…all previous advances in the
various lines of invention will
appear totally insignificant when
compared with those which the
present century will witness.
I almost wish that I might live my
life over again to see the wonders
which are at the threshold”
72. Acknowledgments
• Daniel Lowe – NextMove, Reactions and Spectra
• Bill Brouwer – Plot2Txt Development
• Carlos Cobas and Stan Sykora– MestreLabs
• The ChemSpider team – led by Richard Kidd
• The RSC Data Repository team
73. Thank you
Email: williamsa@rsc.org
ORCID: 0000-0002-2668-4821
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams