Progress in drug discovery and chemical biology is hugely enabled by curated document-assay-result-compound-target relationships (D-A-R-C-P) in open databases from resources such as the Guide to Pharmacology and ChEMBL. These are synergistically integrated into PubChem which pre-computes chemical similarity and connectivity between over 95 million structures and 5.6 million BioAssay results. It also links chemistry to documents via various additional routes including MeSH and large scale submissions from publishers. However, these efforts are patchy and very few journals facilitate such connectivity. There thus remains a massive shortfall in public D-A-R-C-P capture from decades of papers and patents. This presentation will cover these aspects and discuss their partial amelioration by options such as author-driven depositions and open lab-book approaches as used by Open Source Malaria
1. Why is connecting
chemistry-to-biology in open sources
more difficult than it should be?
Presented at UCL School of Pharmacy, London, 13 June 2019
Hosted by Professor Mathew Todd
1
Christopher Southan
2. Abstract
Progress in drug discovery and chemical biology is hugely enabled by
curated document-assay-result-compound-target relationships
(D-A-R-C-P) in open databases from resources such as the Guide to
Pharmacology and ChEMBL. These are synergistically integrated into
PubChem which pre-computes chemical similarity and connectivity
between over 95 million structures and 5.6 million BioAssay results. It
also links chemistry to documents via various additional routes
including MeSH and large scale submissions from publishers.
However, these efforts are patchy and very few journals facilitate such
connectivity.There thus remains a massive shortfall in public D-A-R-
C-P capture from decades of papers and patents.This presentation
will cover these aspects and discuss their partial amelioration by
options such as author-driven depositions and open lab-book
approaches as used by Open Source Malaria
2
4. The core of the problem
4
"We have spent millions putting chemistry into
PDFs but now we are spending more millions
taking it back out” (Anon)
5. The chemistry < - > biology join
• Chemistry that does something significant in vitro, in cellulo, in vivo or in clinic
• Major bioactivity domains from drug discovery, chemical biology and ecology
• Some cases not adequately covered by this simple relationship chain (e.g. heparin
as indirect inhibitor of thrombin or where P could be a bacteria or protozoan)
• The majority of data still primarily archived in papers and patent documents
• Upper limit statistics for quality publications essentially unknown
D – A – R – C – P
6. So how much disintered chemistry is out there?
6
8. Unsung Heroes
Expert extraction of D-A-R-C-P by biocurators is hard for many reasons that
include;
• Poor continuity of funding and career support
• Entity disambiguation challenges
• Unintentional obfuscation, ambiguity and errors by authors (and occasionally
deliberately from patent applicants)
• Difficult to capture nuances and complexities of molecular mechanisms of
action (e.g. prodrugs or no molecular target)
• Even primary activity parameters (IC50, Ki, Kd) have ~ 10-fold variation
between publications for nominally the same assays
• Judging the quality and potential reproducibility of the publications selected
for extraction
• Publisher guidelines only slowly beginning to address above
• Authors engagement with assay and target ontologies is limited
9. Disinterment from the PDF tomb (I)
Image extraction > structure
• Real chemists sketch images in a jiffy
• The rest of us can use OSRA: Optical Structure Recognition
11. 11
Commercial biocuration of D-A-R-C-P
Exelra (formerly GVKBIO)
GOSTAR stats from 2015
• 1.3 million cpds from 112K
papers (~ 15 per paper)
• 3.5 million cpds from 70K
patents (~ 50 per pat)
• 3,882 human targets
17. Recent large-scale chem < > doc PubChem submissions
17
• Generally a good thing but with caveats
• Difficult to automate filtration to identify “aboutness” of key compounds
• Issues with indexing of non-PubMed DOI-only Journal papers
• Quality of CNER chemistry extraction
• Introduces a // document < > structure mapping system into PubChem
18. Reciprocal links > virtuous circles (I)
18
• GtoPdb users can navigate “out” via PubChem or PubMed
• NCBI users can navigate “in” via PubChem or PubMed
19. Reciprocal links > virtuous circles (II)
19
• GtoMdb users can
navigate “out” via
PubChem or PubMed
• NCBI users can navigate
“in” via PubChem or
PubMed
29. Conclusions
• The bioscience community (including big data miners) still have their
collective feet nailed to the floor from the 5-decade backlog of
scientifically valuable bioactive chemistry relationships entombed in
PDF papers and patents
• Biocuration of D-A-R-C-P makes a crucial contribution but limited scale
• Automated entity extraction is advancing but is way behind the
specificity of mechanistic biocuration and is publisher-constrained
• Existence of several // document <> chemistry systems (e.g. MeSH,
IBM, ChEMBL, EPMC, Springer Nature,Theime ,Wikidata) is enabling
but also confusing
• The spread of Open Science ELNs is good to see but findability,
searchability and database submissions still need to be optimised
• The need remains to facilitate a flow of published (inc. preprints) of
author-specified bioactive chemistry direct to databases (even if the
papers are FAIR)
29
30. Proposed core of the solution
30
“Mandating authors to explicitly connect chemical structures to
their experimental bioactivity results in a form (extrinsic to PDF)
that is FAIR, structured, includes metadata, machine readable,
ontologised, transferable to open database records and
reciprocally linked to their publications” (Southan 2019)
• This is, of course, a council of perfection
• In essence, authors should become biocurators
• Currently only a few papers with data sets submitted to PubChem BioAssay
by authors would conform
• Has been technically feasible for at least a decade
• Impediments are thus sociological and publishing models
The simplest of starting points, at least the press release had a structure diagram
OSRA provides good starting points to edit and get SMILES
The structure does not have to be exactly right because a database similarity match is OK to see what it should have been
The simplest of starting points, at least the press release had a structure diagram
OSRA provides good starting points to edit and get SMILES
The structure does not have to be exactly right because a database similarity match is OK to see what it should have been
The simplest of starting points, at least the press release had a structure diagram
OSRA provides good starting points to edit and get SMILES
The structure does not have to be exactly right because a database similarity match is OK to see what it should have been