Digital extraction of chemical data

Approaches for extraction and “digital
chromatography” of chemical data:
A perspective from the RSC

Overview
• Introduction
– What data can we consider?
– What are the challenges
– What data and sources does the
RSC have?
– Experimental Data Checker

• Case Studies:
– Project Prospect
– Chair forms of
Sugars/cyclohexanes

Traditional Chromatography

Images taken from:
http://www.sciencemadness.org/talk
/viewthread.php?tid=3960&page=3

http://en.wikipedia.org/wiki/Column_
chromatography

Why Digital Chromatography?
• Useable information is mixed in with description
and analysis – Makes it difficult to find
• Despite our best efforts – still lots of ambiguous
or plain wrong/unusable chemical information
• Why?
– Human error
– Processing errors
– Incorrect usage of data generation/extraction
– Style over meaning
– Data not generated with reuse in mind
– Data generated for humans

Style/Layout Vs Meaning
• Structures drawn to illustrate more than
just the identity

• Data not generated with reuse in mind

• Author practices

• Mixed 2D and perspective representations

• Unintentional definition of stereochemistry

Data generated for humans
• Separated/Orphaned information inc.
Markush structures, information passed by
reference

What chemical data can we consider?
• Chemistry is an especially challenging - wide range of
types of data
– Numeric data
– Names
– Structures
– Terminology

• Over a hugely different set of topics: Org, Inorg,
Physical – Meanings/interpretations are not perfectly
aligned
• Application of standards can be challenging
• Drawing conventions – are documented but not used

What chemical data and sources does the
RSC have?

A beginning: helping chemists review
their own work
Amphidinoketide I

To a solution of…….
…. Amphidinoketide I was isolated as a ……..
[α]D25 −17.6 (c 0.085, CH2Cl2); Rf = 0.61 (1:1 hexane:ethyl acetate); νmax(CHCl3)/cm−1 1707.2 (CO), 1686.9
(CO), 1632.4 (CO), 1618.9 (CC), 1458.1; 1H NMR (CD2Cl2, 500 MHz) δH 6.08 (1H, t, J = 1.3 Hz, 3-CHC),
5.82 (1H, ddt, J = 16.9, 10.2, 6.7 Hz, 19-CHCH2), 4.99 (1H, m (17.1 Hz), 20-CHA), 4.92 (1H, m (10.2
Hz), 20-CHB), 3.05 (1H, dd, J = 17.9, 9.3 Hz, 8-CHA), 3.00–2.90 (3H, m, 9-CHCH3, 11-CHA, 12-CHCH3),
2.72–2.64 (2H, m, 5-CHA, 6-CHA), 2.62–2.55 (2H, m, 5-CHB, 6-CHB), 2.51–2.45 (3H, m, 8-CHB, 11-CHB,
14-CHA), 2.33 (1H, dd, J = 16.9, 7.4 Hz, 14-CHB), 2.09 (3H, s, 21-CH3), 2.05–1.99 (2H, m, 18-CH2),
1.99–1.96 (1H, m, 15-CHCH3), 1.88 (3H, s, 1-CH3), 1.39–1.25 (3H, 17-CH2, 16-CHA), 1.14–1.10 (1H,
m, 16-CHB), 1.07 (3H, d, J = 7.0 Hz, 22-CH3), 1.05 (3H, d, J = 7.2 Hz, 23-CH3), 0.87 (3H, d, J = 6.7 Hz,
24-CH3); 13C NMR (CD2Cl2, 125 MHz) δC 213.15 (13-CO), 212.08 (10-CO), 208.40 (7-CO), 198.76 (4-
CO), 155.40 (2-CCH), 138.41 (19-CHCH2), 123.54 (3-CHC), 114.19 (20-CH2C), 48.81 (14-CH2), 45.93
(11-CH2), 44.50 (8-CH2), 41.43 (9-CHCH3), 41.01 (12-CHCH3), 37.74 (5-CH2), 36.55 (16-CH2), 36.27 (6-
CH2), 34.18 (18-CH2), 28.74 (15-CHCH3), 27.57 (1-CH3), 26.60 (17-CH2), 20.63 (21-CH3), 19.77 (24-
CH3), 16.65 (22 or 23-CH3), 16.62 (22 or 23-CH3); HRMS (ESI) Calculated for C24H38O4 413.2668,
found 413.26600 (MNa+). (9R, 12R, 15S)-1 had [α]D25 +11 (c 0.245, CH2Cl2).

• http://www.rsc.org/is/journals/checker/run.htm

Case study 1: Project Prospect

What is Prospect?
Visible output

Enhanced Prospect InChI–name pairs Better
Output layer RSS
HTML database (in ChemSpider) ontologies

Information layer Enhanced RSC XML

Tool layer OSCAR

InChI–Name pairs Author
Input layer Ontologies RSC XML
(from ChemSpider) CDX files

12

People and machines
People Machines

Can understand narratives. Can’t understand narratives.

Can interpret pictures. Can’t interpret pictures.

Can reason about three- Not able to infer 3D structure
dimensional objects. from 2D without cues.

Can do a high-quality job. Can do a lower-quality, but still
useful job.

Case study 2: The chair representation
issue

InChI=1S/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2
WQZGKKKJIJFFOK-UHFFFAOYSA-N
• 5 stereocentres = 2^5 isomers =32 structures

Case study 2: Chair forms of
hexacycles what could go wrong?

How do we “fix” chair-representations

How we normalize them:
1. Identify 6-membered rings (Indigo)
2. Identify what sort of ring it is
3. Map atoms onto a standard structure (eg.
beta-D-glucopyranose)
4. Tidy

The future: “The digester”
• Ability to:
– Reconnect R-groups
– Expand abbreviations
– Expand brackets
– Link structures with reference IDs

Other examples that we didn’t
mention in case studies
• CIF data importer
• Structure Validation and Standardisation
– (Thurs Aug 23, 9:15 am, Marriott Downtown,
Franklin Hall 6)
• Work on creation of ontologies, RXNO, CMO
– Also collaborating on: ChEBI ontology, GO, SO
• Collaboration with Utopia to enable Prospect
mark-up of PDFs

Summary
• Many data sharing practices are based on:
– Traditional print articles
– Consumption of data by humans only
• This poses issues for publishers and users alike
• The RSC is developing innovative solutions to
address some of these problems
– Chemical structures are challenging
– Limitations to what a machine methods can achieve
– Need to educate authors to think differently

Acknowledgements
• Colin Batchelor - Development and Technical
work
• Jeff White & Aileen Day
• Richard Kidd, Graham McCann and Will
Russell
• RSC ICT staff

Thank you

Email: chemspider@rsc.org
Twitter: @ChemSpider
http://www.chemspider.com

Digital extraction of chemical data

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (19)

Similar a Digital extraction of chemical data

Similar a Digital extraction of chemical data (20)

Digital extraction of chemical data