Progress in the biomedical sciences is critically dependent on explicit chemical structures and bioactivity results described in text. This applies across drug discovery, pharmacology, chemical biology, and metabolomics. However the entombing of the majority of these structures and associated data within patents, papers, abstracts and web pages has been a major barrier to progress. This presentation introduces the current public information flow from documents and its associated barriers, such as inadequate author specification of structures, journal pay walls precluding text mining and the patchiness of MeSH chemistry annotation for PubMed-to-PubChem connectivity. It then reviews trends that are lowering these barriers. These include the Google merge of over 50 million InChIKey(s) from PubChem, ChemSpider and UniChem, ChEMBL containing SAR for 0.8 million structures from 50K medicinal chemistry papers, over 20 million abstracts in PubMed, and full-text open patent chemistry in SureChemOpen bringing PubChem patent-extracted structures to 15 million. In addition, options such as Open Lab Books and figshare are expanding the choices for surfacing new structures. Methods will be outlined for establishing document-to-document and document-to-database links via chemical structures. These include the PubChem toolbox, protein targets in UniProt, PubChem BioAssay, ChEMBL indexing in UK PMC, SureChemOpen, chemicalize.org for text name-to-structure conversion , OSRA for image-to-structure conversion, Venny for set comparisons and InChIKey searching in Google [1]. Combined use of these approaches to make joins between patents, papers, abstracts chemical database entries, SAR data and drug target protein sequences will be illustrated with recent novel antimalarial lead compounds, patent-only BACE2 inhibitors and company code numbers in the NCATS repurposing list.
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Closing the gap between chemistry and biology: Joining between text tombs and databases
1. [1]
Closing the gap between chemistry and
biology: Joining between text tombs and
databases
Presentation for Uppsla University Department of Neuroscience, Sept 2013
By Christopher Southan
Curator for IUPHARdb, http://www.guidetopharmacology.org/
Queen's Medical Research Institute, University of Edinburgh
Email: cdsouthan@hotmail.com
Twitter: http://twitter.com/#!/cdsouthan
Blog: http://cdsouthan.blogspot.com/
LinkedIN: http://www.linkedin.com/in/cdsouthan
TW2Informatics: http://www.cdsouthan.info/Consult/CDS_cons.htm
Publications: http://www.citeulike.org/user/cdsouthan/order/year,,/publications
Presentations: http://www.slideshare.net/cdsouthan
2. [2]
Abstract
• Progress in the biomedical sciences is critically dependent on explicit chemical structures
and bioactivity results described in text. This applies across drug discovery, pharmacology,
chemical biology, and metabolomics. However the entombing of the majority of these
structures and associated data within patents, papers, abstracts and web pages has been a
major barrier to progress. This presentation introduces the current public information flow
from documents and its associated barriers, such as inadequate author specification of
structures, journal pay walls precluding text mining and the patchiness of MeSH chemistry
annotation for PubMed-to-PubChem connectivity. It then reviews trends that are lowering
these barriers. These include the Google merge of over 50 million InChIKey(s) from
PubChem, ChemSpider and UniChem, ChEMBL containing SAR for 0.8 million structures
from 50K medicinal chemistry papers, over 20 million abstracts in PubMed, and full-text
open patent chemistry in SureChemOpen bringing PubChem patent-extracted structures to
15 million. In addition, options such as Open Lab Books and figshare are expanding the
choices for surfacing new structures. Methods will be outlined for establishing document-to-
document and document-to-database links via chemical structures. These include the
PubChem toolbox, protein targets in UniProt, PubChem BioAssay, ChEMBL indexing in UK
PMC, SureChemOpen, chemicalize.org for text name-to-structure conversion , OSRA for
image-to-structure conversion, Venny for set comparisons and InChIKey searching in
Google [1]. Combined use of these approaches to make joins between patents, papers,
abstracts chemical database entries, SAR data and drug target protein sequences will be
illustrated with recent novel antimalarial lead compounds, patent-only BACE2 inhibitors and
company code numbers in the NCATS repurposing list.
3. [3]
The Chem < - > Bio Join
• Chemistry that does something: drug discovery, drug development,
toxicology, pharmacology, systems chemical biology (probes), structural
biology, metabolomics, chemical ecology, etc etc ….
• With the exception of some PubChem Bioassays, the majority of data is sill
primarily archived in documents
7. [7]
A recent NRDD article
• Just images and code numbers
• No PubChem or ChemSpider IDs
• No SMILES or InChIs
• No molfiles for download
• No links in or out
• No MeSH > PubChem substances
• Some cited sources might have IUPAC names
8. [8]
You can dig out structures from text for free:
- but its hard work
9. [9]
What’s out there for free
• InChIKey in Google ~ 50 million
• PubChem = 48 million
• PubChem ROF + 250-800 Mw (lead-like) = 31 million
• ChemSpider = 28 million
• PubChem all docs (papers & patents) = 16 million
• PubChem patents = 15 million
• SureChemOpen = 14.5 million
• PubChem journal sources (PubMed + ChEMBL) = 1 million
10. [10]
Medicinal chemistry patents (tombs with lids off)
• WO, C07D = 72,737 (assignee vs. year plots below)
• ~ 50 novel structures with SAR per patent = ~ 3.5 million bioactives
• Paradoxically now completely open for chemistry or any mining
11. [11]
PubMed: ~ 10% with chemistry (guarded tombs)
“Free full text” = 575,513
(24%)
12. [12]
Growth:
(escaping the
tombs)
• Patent “big bang”
(SureChem &
SCRIPDB in
2012)
• Literature “slow
burn” (ChEMBL
2009 jump)
• Paradox -
patents:papers
15:1
(both sets of CIDs
cumulative)
14. [14]
Triaging document or webpage chemistry
• Identify the structure specification types, e.g.
– Semantic names (all sources)
– Code names (press releases, papers and abstracts)
– IUPAC names (papers, patents and abstracts)
– Images (papers, patents, & Google images)
– SMILES (open lab books)
– InChi strings (open lab books)
– SDF files (open lab books, & github)
Convert these to a structure (e.g. SDF, SMILES, InChI) then:
– Search InChIKey in Google
– Search major databases
– Search SureChemOpen
– Compare extracted sets for intersects and diffs
– Extend exact match connectivity with similarity searching
15. [15]
Triage example: a
new antimalaria
The MMV390048 code
name is linked to an
image in press reports
but is PubChem and
PubMed -ve
16. [16]
Images: convert and search
Real chemists sketch them in a jiffy;
the rest of us can use OSRA: Optical Structure Recognition Application
(after editing, CS(=O)(=O)c3ccc(C2=CN=C(N)C(C1=CCC(C(F)(F)F)N=C1)C2)cc3)
27. [27]
PubChem -> ChEMBL -> PMID -> assay -> strucs
• CHEMBL2041980 (structure)
• PMID 22390538 (paper)
• CHEMBL2045642 (assay for 32 strucs
from paper)
• The 32 CIDs all have patent matches
28. [28]
Venny: intersects, diffs, de-dupes and merges
1) WO2011086531
matches in PubCHem
2) CheS-Mapper
cluster 8 from
WO2011086532
3) ChEMBL assayed
cpds from PMID
22390538
(handles any regular
strings e.g. db IDs,
SMILES, IChI or
InChIKey)
39. [39]
Conclusions
• The ability to extract chemical structures from text and web sources
has been transformed by an expansion of the public toolbox
• The PubChem big-bang increases probability of extraction having
database exact or similarity matches
• Paradoxically, the patent corpus is now completely open while access
to journal text is still restricted
• However, ChEMBL has extracted ~ 0.8 mill. SAR-linked and target
mapped structures from ~ 50K papers
• The submission of ~15 mill. patent structures to PubChem ensures at
least representation from the majority of medicinal chemistry patents
(many of which spawned the subsequent ChEMBL papers)
• Those who want to share their structures globally (e.g. OSDD) have an
expanding set of options for surfacing their results.
IinChIKeys - estimate of PubChem + ChemSpider in Google – but PubChem currently has a backlog for Key scrapingThe ROF + 250-800 is a very approximate circumscription of the property space that has some possibility of bioactivityProbably a proportion of vendor structures may have never been committed to textThere are some virtuals “out there” including some patent-extractions but difficult to estimate
Note the WO/PCT queries are non-redundant in the patent family senseThe medicinal chemistry corpus is actually quite smallNote big pharma patent decline post-2008 Average exemplified cpds with activity data per patent (family) is unknown but GVKs curation average is ~ 50
Using the top level MeSH term as a filter for “PubMeds with some chemistry”Free full text is ¨ ¼ but there are a lot of biological journals in this set
Note that cumulative plots include an element of back-mapping i.e. the 2005 matches are to the 2013 total not the just the 2005 documents
Only Nature Chemical Biology and Nature Chemistry have direct links from the journal document to PubChemGiven todays technology the major patent offices could put links in the PDFs but are unlikely to do so
Need to assess what representational types are being used in the documentEg. Some patents are image-only (but SureChem is pulling most of these out)Then select tools and sources for the job ´Decide how to store your structures locally The default batch search is an upload to PubChemThe default individual search is the InChIKey against Google
Self explanatoryNote my blog post was indexed
The simplest of starting points, at least the press release had a structure diagram OSRA provides good starting points to edit and get SMILESThe structure does not have to be exactly right because a database similarity match is OK to see what it should have been
SMILES from the image hits the CID in PubChemThis links to patents via SureChem and chemicalize.orgChEMBL provides a link to the paper Note none of these sources have MMV390048 as synonym so all the connections are via structure
We can start of with patent linksNote in this case numbered image capture, as oposed to the IUPAC listing, was important to manually collate the structure against the correct IC50
From manual cross-checking between the individual example structures and the IC50 table the Excel sheet can be populated
Useful way to share results that is citableIndexed in Google but no live links in Excel sheet (yet)
Can upload CID lists and download as a saved and public collection
This is the Pistoia /AlexClark SAR Table appDropped the CIDs out of PubChem into DropBox and picked them up on the IPADNice but would be good to automate the decomposition
InChIkey search picks up instantly This was just a choice of one of the activesSo this connects PubChem and figshare
The CID links straight throught to chemicalize and will just re-extract the whole patent in a few seconds The 413 gave 358 hits in pub chem
IUPAC names have a lot of usage variants and OCR mistakes Typically gaps, line breaks 1 instead of 1 and missing bracketsOPSIN is good for indicating where the break is This can then be fixed for a series in chemicalize.org
Total extractions from patents can include a lot of low Mw common reagent chemistryCheS mapper display makes it easy to pick out clusters of lead-like compoundsClusters can then be downloadedFlexibility is high because document sets can be split or merged at the imput stage
ChEMBL extracts structure and dataCant actually select a set of cpds via the PubMed ID but can via the assay ID that is usually unique to that paperIn this case we got 32 structures, all of which came from that patent
Very useful utility for any kind of set operations e.g. sets of extractions Total flexibility e.g. intersecting patents and papers with extractions from abstract setsSets can be de-duplicatedand merged from multiple sets (e.g. 10 patent extractions in one box)Can combine with selected downloaded database records