Christopher Southan (The IUPHAR/BPS Guide to PHARMACOLOGY, UK)
While the raison d'être of patents is Intellectual Property (IP) there is a growing awareness of the scientific value of their data content. This is particularly so in medicinal chemistry and associated bioactivity domains where disclosed compounds and associated data not only exceeds that published in papers by several-fold and surfaces years earlier, but is also, paradoxically; completely open (i.e. no paywalls). Scientists have traditionally extracted their own relationships or used commercial sources but the last few years have seen a “big bang” in patent extractions submitted to open databases, including nearly 20 million structures now in PubChem.
This tutorial will:
Outline the statistics of patent chemistry in various open sources
Introduce a spectrum of open resources and tools
Enable an understanding of target identification, bioactivity and SAR extraction from patents and connecting these relationships to papers
Cover aspects of medicinal chemistry patent mining
Include hands on exercises using open source antimalarial research as examples
The focus will be on public databases and patent office portals, since these can be transparently demonstrated. However, the essential complementarity with commercial resources will be touched on. Those engaged in Competitive Intelligence will also find the material relevant.
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open resources
1. Digging bioactive chemistry out of
patents using open resources
• While the raison d'être of patents is Intellectual Property (IP) there is a
growing awareness of the scientific value of their data content. This is
particularly so in medicinal chemistry and associated bioactivity domains
where disclosed compounds and associated data not only exceeds that
published in papers by several-fold and surfaces years earlier, but is also,
paradoxically; completely open (i.e. no paywalls). Scientists have
traditionally extracted their own relationships or used commercial sources
but the last few years have seen a “big bang” in patent extractions
submitted to open databases, including over 20 million structures now in
PubChem.
1
2. Outline
• Statistics of patent chemistry in various sources
• Open resources, databases and tools
• Target identification
• Bioactivity and SAR extraction
• Connecting these relationships to papers
• Medicinal chemistry patent mining
• Exercises using antimalarial research as examples
• Complementarity with commercial resources.
• Competitive Intelligence
N.b. not in scope just now, web services, APIs, RDF or SAR modelling per se
This is a suggested list that can be extended to related topics attendees
would like to cover (at least if within the cognisance of the presenter!)
2
4. Biog
4
Chris Southan joined the IUPHAR/BPS Guide to Pharmacology database
curation team as Senior Cheminformatican in 2013. Previously he was a
Drug Discovery Consultant at TW2Informatics in Göteborg Sweden, working
on patent informatics. Prior to this he was a contractor for AstraZeneca
Knowledge Engineering, 2009-2011 working on Chemistry Connect and
Pharma Connect. Earlier positions include the ELIXIR Database Provider
Survey for the EBI (2008-9), Principle Scientist and Bioinformatics Team
Leader at AstraZeneca (2004-7) and senior bioinformatics positions in
Oxford Glycosciences (2002-3) Gemini Genomics (2001) and SmithKline
Beecham (1987–2000). He has a PhD from the University of Munich, M.Sc.
in Virology from Reading University and a B.Sc.Hons. in Biochemistry from
Dundee University. Further information on LinkedIN
IUPHAR/BPS Guide to PHARMACOLOGY
Publications: PubMed ORCHID ID 0000-0001-9580-0446
Blog: Bio < > Chem
Presentations: Slideshare
Twitter: https://twitter.com/cdsouthan
TW2Informatics: https://sites.google.com/view/tw2informatics/home
5. Audience assumptions
• Some familiarity with SAR distillation from the literature
• Many of you could extract examples from a patent by hand
• Database cognisance, including PubMed and PubChem (SID, CID)
• More interest in recent than historical SAR
• Not obsessively concerned with false-negatives (i.e. missed data)
• Not greatly perturbed by the fuzziness of public sources (that you might
grumble about for commercial ones)
• Familiar with the mess of patent families and Kind codes
• Familiar with protein names and identifiers
• Familiar with obfuscation that can confound SAR extraction
• Focused on Med Chem for human diseases
• Most of this tutorial could apply across other domains (e.g. IPC code
A01N for pesticides and herbicides)
• No boundaries between Drug Discovery and Chemical Biology
• Aware academic Drug Discovery is accelerating relative to commercial
5
6. References (I)
6
Chapter in: Samuel Chackalamannil, Rotella and Ward, (eds.) Comprehensive
Medicinal Chemistry III vol. 3, pp. 464–487. Oxford: Elsevier.
http://dx.doi.org/10.1016/B978-0-12-409547-2.13814-4, ISBN: 9780128032008
https://www.ncbi.nlm.nih.gov/pubmed/26194581
8. Core assumption: can we believe patent SAR results?
• We know the data has value but difficult to extrinsically asses quality
• As for other domains, Med Chem has an experimental reproducibility crisis
• This reflects equivocality w.r.t. antibodies, cell lines and chemistry (e.g.
supplier purity and probes vs PAINS)
• For patents high-replicate error ranges are rarely included
• Re-synthesis fidelity also rarely reported (ever?)
• Cf. “Dispensing processes impact apparent biological activity as determined
by computational and statistical analyses” (PMID 23658723)
• We could hope that internal relative SAR across a series is more consistent
than externally comparative absolute numbers
• We know some inventor teams are world-class, well cited medicinal chemists
but can we assess the less famous?
• The same QC considerations apply to papers
• ChEMBL surfaces the worryingly wide IC50/Ki/Kd ranges on nominally same
assays from different papers
• We can also intersect some patent and paper values
• Is the internal consistency of patent-derived SAR models a useful QC?
8
9. Introductory example
9
• 138 detailed descriptions of the series
• WO2013083991 SureChEMBL- PubChem
• IC50 cross-reactivity data from no less
than five cell-based enzyme assays
• Human NMT1 (P30419), human NMT2
(O60551) Plasmodium vivax (A5K1A2)
Plasmodium falciparum (Q8ILW6) and
Leishmania donovani (Q8ILW6)
• https://cdsouthan.blogspot.se/2013/07/n-
myristoyltransferase-patent-and-pdb.html
11. So how much useful SAR is in the patent corpus?
• Definition for SAR: Bioactivity assay "A" (e.g. for an enzyme) with a
quantitative result "R" (e.g. an IC50) for a compound "C" (defined
chemical structure) as an activity modulator (e.g. inhibition) of protein
target "P“ (also for cellular targets e.g. antinfectives)
• A useful shorthand for this mapping is “D-A-R-C-P”
• Excelra (ex GVKBIO) provides good statistical starting point
• https://www.slideshare.net/cdsouthan/largescale-curation-of-
bioactive-chemistry-from-patents-and-papers
• April 2017 numbers were 1.34 mill cpds from 112K papers and 3.35
mill from 71K patents, 0.18 million overlap
• From the earlier PMID 24204758, 12 cpds/paper and 46/patent
• Human protein targets 3383 in former 2431 in latter, 3882 combined
with 546 patent-only
• The Excelra absolute activity numbers dependent on their capping
rules for binned data (e.g. IC50 between 10 and 100nM)
• Binned data still useful for modelling
• Where are the enzyme activators?
11
12. Independent estimates of SAR total
• WIPO PATENTSCOPE A61 and C07 PCTs = 93,253
• Not all have SAR data from novel composition of matter first-filings
• Many will be “secondary” filings (e.g. synthesis and/or crystallisation)
• Generic companies file many of these for de-risked cpds
• Some first-filings for a chemotype series may not have any activity
data disclosed (stats unknown)
• We can thus assume that extractable SAR from med chem patents in
the last five years may be only 30- 50,000 documents
• Guestimate: ~50K patents ~ 3.50 million bioactive structures (c.f.
Excelra 3.35 million)
• Asian patents under-represented? (i.e. are we missing unique
structures & SAR)
12
13. BindingDB public SAR curation:
useful benchmark extraction stats
• Patents: 1,879
• Binding measurements: 199,588
• Compounds: 132,170
• Target proteins: 1,225
• Assays: 2,668
• Average Number of Targets per Patent: 1.95
• Usually primary plus a specificity paralogue cross-screen
• ~70 compounds/patent
• ~100 affinity measurements/patent
Data courtesy of Tiqing Liu and Michael Gilson, Oct 2017
13
15. The three major CNER sources inside PubChem
15
IBM = 10.7
SCRIPDB = 4.0
SureChEMBL = 17.6
2.9
2.4
4.7 10.1
0.6 0.4
0.50
Counts (Oct 2017)
are CIDs in millions
Union = 21.7
3-way = 2.4
3 + 2-way = 8.1
Unique= 13.5
Raises questions about
corroboration vs divergence
16. The chemistry stats: wheat vs chaff
• If we except a certain proportion of binned data as useful, the max
SAR we could expect to align is ~ 3 to 4 mill strutures
• But how can we select these from the 22 million (and climbing) in
PubChem?
• The easiest way is to come in from the literature with clean structures
• This can expand the SAR around a target anywhere from 2 to 10-fold
• But have an unknown statistic; what proportion of good patent SAR
sets, including for novel targets, never get into a paper? (examples
anyone?)
• Exelra have some relevant stats on this – does anyone else?
16
18. Sources offer a broad spectrum of utilities
• Connecting to patents via structures from papers
• Connecting via targets and/or diseases from papers
• Proximity “Walking” doc <>doc, target <> target, struc <> struc
• Finding patents via metadata (e.g. assignee, target and date)
• Viewing chemistry content in the document
• Establishing if the document has useful SAR
• Finding which sources have extracted chemistry
• Mapping the structures to the activity values
• DIY extraction of structures not yet in a source (e.g. images and/or
IUPAC strings
• Collating an SAR table
• Best to get familiar with in-depth functionality of a few sources
• Many roads lead to Rome so difficult to know which is most efficient
• I certainly have not tried all those of probable utility
18
20. BindingDB
• Pre-cooked expert curation
• Modest but steady growth
• Easy to browse list
• Structures > PubChem and subsumed > ChEMBL
• Targets mapped to UniProt even for titles with no target
• Many search features, some unique (i.e. different to ChEMBL)
• Novel targets from patents and unique journal selection
• Download full SAR sets example no. > structure > activity > target
• Lag time in PubChem indexing
• No antinfective whole organism targets
• US publications some years behind the WO first pubs
• Dependent on CWU structures that are not all correct
20
22. WIP0 PATENTSCOPE
• Comprehensive and up to date
• Instant metrics (yellow highlight) as you toggle search parameters
• Sign in for saved searches
• Useful instant graphics on result lists
• Search reports can “walk” you to other relevant filings
• Pithy examiner comments (almost) amusing
• Limited text search fields
• In-line table images, pros and cons
• Slow image loading
• Inventor/applicant conflation
22
23. The WIPO “gift horse”
23
• ~ 7 million strucs, WO and US from 1978
• WIPO collab w. InfoChem and NextMove
• False-negatives (i.e. examples missed)
• Not yet in PubChem
• Limited utility for SAR mining so far
24. Source: EPO Espace
24
I prefer WIPO as a search portal but Espace is useful for INPADOC families
26. SureChEMBL
• For SAR extraction the best first-stop-shop (after BindingDB)
• Chemistry indexed a week or less from publication date
• Family-wide structure downloads
• Powerful combination of filters and search functionality
• Multiple source x-refs including PubChem and ChEMBL
• Can correct IUPAC failures and paste out example blocks
• Usual caveats of CNER (but hey, 18 mill structures for free)
• Extraction confounded by dense image tables
• WIPOs less well extracted that USPTOs (but OCR not their fault)
• Overhead of futile common chemistry extraction
• Slow image load times and structure step-through
• Need to watch PubChem load dates(via SIDs)
• The feature that never appeared :(
26
28. PubChem
• Mother of all searchable portals with 22 mill patent compounds
• SureChEMBL, ChEMBL, BindingDB and IBM are in it
• Massive feature set including Entrez
• Patent and PubMed connectivity via structure
• Very useful Identifier Exchange Service for set mapping
• Can upload SD files (e.g. from Chemicalize or SciFinder)
• Transparent and navigable chemistry rules (e.g. “same connectivity”)
• Slice ‘n dice full Boolean search history
• Extensive filter options
• Direct Venn from CID lists < 10,000
• Similar compound clustering > isolate an SAR series
• Can “walk” though chemical neighbourhood > cluster > cluster patent hop
by chemotype (target neutral)
• Navigation can be daunting
• Some large sources should be kicked out IMHO
• Interface queries often time out
28
31. ChEMBL
• Gateway to chemistry manually extracted from journals
• 0.39 mill structure mapped across to SureChEMBL
• This gives direct journal < > patent connectivity
• Powerful query, filtration, browsing and target indexing
• Release 23 has1.02 mill structures and assay data from 67,722 papers
• Circular subsumation of 0.5 mill structures from confirmed PubChem
Bioassays
• Integrates the BindingDB patent curation (but sync lag)
• Indexed in PubChem BioAssay
• Target-linked entries subsumed
into BindingDB
• Linked to EPMC
• Good for paper <> patent
• Not linked to PubMed
• Up to 2 year lag for papers
• Selective journal capture
31
0.39
mill
ChEMBL
1.34 mill
SureChEMBL
17.23 mill
33. Europe PubMed Central
• Fully featured literature search functionality
• Big plus is the (HAS_CHEMBL:y) select for chemistry
• Gives query > paper > ChEMBL chemistry > SureChEMBL and/or >
PubChem
• Bioentity mark up from other sources
• De facto two-stop shop with PubMed which has different functionality
• Warning, their patent abstracts not updated since 2012
33
35. PubMed
• Largest entry point to connect Med Chem papers < > patents
• Entities disclosed, ie target protein IDs , affiliations, chemical structure
• Power of Entrez, including MeSH
• PubMed > PDB (via MMD) good for CID of ligands > patent
• Can connect inventors with unusual names
• However, papers typically ~ 2 years behind “fresh” patents
• May find enough SAR for popular targets not to bother with patents
• But (unless paper in ChEMBL) you may have to DIY extract entities
including chemistry
• Patents good at citing papers (US mandated to be thorough)
• However, many authors avoid citing their patents
• Connect into literature via targets and diseases and thence > patents
• JFTR disease searching in patent text largely useless (titles maybe)
• Patent reviews valuable but tend to be in hard-to-get journals
35
42. Status of human targets from open sources
(as UniProt x-refs)
42
Oct 2016 Oct 2017
• Most of have chemistry > target via papers (thus can search patents)
• Outer limit of data-supported druggable proteome
• Some patent only in BindingDB
44. Patent retrieval by target names
44
In : Lecture Notes in
Bioinformatics (ISBN
978-3-642-15119-4) P
Lambrix and G Kemp
(Eds.) Springer Verlag,
pp 106-121, 2010
45. Classification of target names in titles
45
AWK Gene and protein names can be noisy and inconsistently used
by applicants but HGN approved symbol usage seems to be improving
48. Utility of tools
• Can re-run IUPACs and images where automated
conversion failed
• Synergies of gap filling from working between the original
document, the SureChEMBL output and the OPSIN and
OSRA tools
• Can run on PubMed abstracts, individually or bulk
• Can isolate example series of structures that has the SAR
• Useful for extraction from papers not in ChEMBL
• May be necessary to convert between formats e.g. for
uploading to PubChem
48
50. Venny
• Excellent for set comparisons of any strings < over 10,000
• E.g. CIDs, InChIKeys or UniProt IDs
• It automatically de-duplicates
• Download complete intersects and diffs from any segment of the Venn
50
57. OPSIN for IUPAC names
57
• Conversion of compound 19 from WO2016096979 after fixing OCR errors
• N-r3-r(26',3i?)-5-amino-2-methyl-3-(trifluoromethyl)-3,4- dihvdropyrrol-2-yl1-
4-fluoro-phenyl1-5-chloro-pyridine-2-carboxamide.
• Good for iterative correction via error flagging that Chemicalize will not
60. Collating the hard way
• Three versions of the SAR table from WO2016096979
• On the left is the original from page 64 of the PDF
• In the centre is the corresponding section of the SureChEMBL mark-up
• The right hand panel is an Excel paste-across of the centre section
• But you have to complete by pasting SMILES of structures on previous page
60
63. Wish list
Yup, we can dig a lot of SAR out of patents
But wouldn’t it be nice if…..
• Clavariate re-instated the Derwent patent chemistry feed to PubChem
• Open standard SAR modelling tools (with AI natch’) maybe Knime?
(table in > model out)
• These might show large patent SAR sets better than from papers
• Someone indexed full text patents by gene name counts inside the
description section (SureChEMBL for OpenPhacts?)
• SureChEMBL would finally bring in their document section stats
• Run the SureChEMBL engine on full-text papers and PubMed abstracts
• European PubMed Central updated their EPO C07/A61 patent abstracts
from 2012
• We could paste large text chunks > Chemicalize but not run out of points
• Patents could be more like good papers….
63
64. Could the future be automatic?
64
https://www.slideshare.net/NextMoveSoftware