ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open resources

Digging bioactive chemistry out of
patents using open resources
• While the raison d'être of patents is Intellectual Property (IP) there is a
growing awareness of the scientific value of their data content. This is
particularly so in medicinal chemistry and associated bioactivity domains
where disclosed compounds and associated data not only exceeds that
published in papers by several-fold and surfaces years earlier, but is also,
paradoxically; completely open (i.e. no paywalls). Scientists have
traditionally extracted their own relationships or used commercial sources
but the last few years have seen a “big bang” in patent extractions
submitted to open databases, including over 20 million structures now in
PubChem.
1

Outline
• Statistics of patent chemistry in various sources
• Open resources, databases and tools
• Target identification
• Bioactivity and SAR extraction
• Connecting these relationships to papers
• Medicinal chemistry patent mining
• Exercises using antimalarial research as examples
• Complementarity with commercial resources.
• Competitive Intelligence
N.b. not in scope just now, web services, APIs, RDF or SAR modelling per se
This is a suggested list that can be extended to related topics attendees
would like to cover (at least if within the cognisance of the presenter!)
2

Biog
4
Chris Southan joined the IUPHAR/BPS Guide to Pharmacology database
curation team as Senior Cheminformatican in 2013. Previously he was a
Drug Discovery Consultant at TW2Informatics in Göteborg Sweden, working
on patent informatics. Prior to this he was a contractor for AstraZeneca
Knowledge Engineering, 2009-2011 working on Chemistry Connect and
Pharma Connect. Earlier positions include the ELIXIR Database Provider
Survey for the EBI (2008-9), Principle Scientist and Bioinformatics Team
Leader at AstraZeneca (2004-7) and senior bioinformatics positions in
Oxford Glycosciences (2002-3) Gemini Genomics (2001) and SmithKline
Beecham (1987–2000). He has a PhD from the University of Munich, M.Sc.
in Virology from Reading University and a B.Sc.Hons. in Biochemistry from
Dundee University. Further information on LinkedIN
IUPHAR/BPS Guide to PHARMACOLOGY
Publications: PubMed ORCHID ID 0000-0001-9580-0446
Blog: Bio < > Chem
Presentations: Slideshare
Twitter: https://twitter.com/cdsouthan
TW2Informatics: https://sites.google.com/view/tw2informatics/home

Audience assumptions
• Some familiarity with SAR distillation from the literature
• Many of you could extract examples from a patent by hand
• Database cognisance, including PubMed and PubChem (SID, CID)
• More interest in recent than historical SAR
• Not obsessively concerned with false-negatives (i.e. missed data)
• Not greatly perturbed by the fuzziness of public sources (that you might
grumble about for commercial ones)
• Familiar with the mess of patent families and Kind codes
• Familiar with protein names and identifiers
• Familiar with obfuscation that can confound SAR extraction
• Focused on Med Chem for human diseases
• Most of this tutorial could apply across other domains (e.g. IPC code
A01N for pesticides and herbicides)
• No boundaries between Drug Discovery and Chemical Biology
• Aware academic Drug Discovery is accelerating relative to commercial
5

References (I)
6
Chapter in: Samuel Chackalamannil, Rotella and Ward, (eds.) Comprehensive
Medicinal Chemistry III vol. 3, pp. 464–487. Oxford: Elsevier.
http://dx.doi.org/10.1016/B978-0-12-409547-2.13814-4, ISBN: 9780128032008
https://www.ncbi.nlm.nih.gov/pubmed/26194581

References (II)
7

Core assumption: can we believe patent SAR results?
• We know the data has value but difficult to extrinsically asses quality
• As for other domains, Med Chem has an experimental reproducibility crisis
• This reflects equivocality w.r.t. antibodies, cell lines and chemistry (e.g.
supplier purity and probes vs PAINS)
• For patents high-replicate error ranges are rarely included
• Re-synthesis fidelity also rarely reported (ever?)
• Cf. “Dispensing processes impact apparent biological activity as determined
by computational and statistical analyses” (PMID 23658723)
• We could hope that internal relative SAR across a series is more consistent
than externally comparative absolute numbers
• We know some inventor teams are world-class, well cited medicinal chemists
but can we assess the less famous?
• The same QC considerations apply to papers
• ChEMBL surfaces the worryingly wide IC50/Ki/Kd ranges on nominally same
assays from different papers
• We can also intersect some patent and paper values
• Is the internal consistency of patent-derived SAR models a useful QC?
8

Introductory example
9
• 138 detailed descriptions of the series
• WO2013083991 SureChEMBL- PubChem
• IC50 cross-reactivity data from no less
than five cell-based enzyme assays
• Human NMT1 (P30419), human NMT2
(O60551) Plasmodium vivax (A5K1A2)
Plasmodium falciparum (Q8ILW6) and
Leishmania donovani (Q8ILW6)
• https://cdsouthan.blogspot.se/2013/07/n-
myristoyltransferase-patent-and-pdb.html

So how much useful SAR is in the patent corpus?
• Definition for SAR: Bioactivity assay "A" (e.g. for an enzyme) with a
quantitative result "R" (e.g. an IC50) for a compound "C" (defined
chemical structure) as an activity modulator (e.g. inhibition) of protein
target "P“ (also for cellular targets e.g. antinfectives)
• A useful shorthand for this mapping is “D-A-R-C-P”
• Excelra (ex GVKBIO) provides good statistical starting point
• https://www.slideshare.net/cdsouthan/largescale-curation-of-
bioactive-chemistry-from-patents-and-papers
• April 2017 numbers were 1.34 mill cpds from 112K papers and 3.35
mill from 71K patents, 0.18 million overlap
• From the earlier PMID 24204758, 12 cpds/paper and 46/patent
• Human protein targets 3383 in former 2431 in latter, 3882 combined
with 546 patent-only
• The Excelra absolute activity numbers dependent on their capping
rules for binned data (e.g. IC50 between 10 and 100nM)
• Binned data still useful for modelling
• Where are the enzyme activators?
11

Independent estimates of SAR total
• WIPO PATENTSCOPE A61 and C07 PCTs = 93,253
• Not all have SAR data from novel composition of matter first-filings
• Many will be “secondary” filings (e.g. synthesis and/or crystallisation)
• Generic companies file many of these for de-risked cpds
• Some first-filings for a chemotype series may not have any activity
data disclosed (stats unknown)
• We can thus assume that extractable SAR from med chem patents in
the last five years may be only 30- 50,000 documents
• Guestimate: ~50K patents ~ 3.50 million bioactive structures (c.f.
Excelra 3.35 million)
• Asian patents under-represented? (i.e. are we missing unique
structures & SAR)
12

BindingDB public SAR curation:
useful benchmark extraction stats
• Patents: 1,879
• Binding measurements: 199,588
• Compounds: 132,170
• Target proteins: 1,225
• Assays: 2,668
• Average Number of Targets per Patent: 1.95
• Usually primary plus a specificity paralogue cross-screen
• ~70 compounds/patent
• ~100 affinity measurements/patent
Data courtesy of Tiqing Liu and Michael Gilson, Oct 2017
13

Patent chemistry stats inside PubChem
14

The three major CNER sources inside PubChem
15
IBM = 10.7
SCRIPDB = 4.0
SureChEMBL = 17.6
2.9
2.4
4.7 10.1
0.6 0.4
0.50
Counts (Oct 2017)
are CIDs in millions
Union = 21.7
3-way = 2.4
3 + 2-way = 8.1
Unique= 13.5
Raises questions about
corroboration vs divergence

The chemistry stats: wheat vs chaff
• If we except a certain proportion of binned data as useful, the max
SAR we could expect to align is ~ 3 to 4 mill strutures
• But how can we select these from the 22 million (and climbing) in
PubChem?
• The easiest way is to come in from the literature with clean structures
• This can expand the SAR around a target anywhere from 2 to 10-fold
• But have an unknown statistic; what proportion of good patent SAR
sets, including for novel targets, never get into a paper? (examples
anyone?)
• Exelra have some relevant stats on this – does anyone else?
16

Sources offer a broad spectrum of utilities
• Connecting to patents via structures from papers
• Connecting via targets and/or diseases from papers
• Proximity “Walking” doc <>doc, target <> target, struc <> struc
• Finding patents via metadata (e.g. assignee, target and date)
• Viewing chemistry content in the document
• Establishing if the document has useful SAR
• Finding which sources have extracted chemistry
• Mapping the structures to the activity values
• DIY extraction of structures not yet in a source (e.g. images and/or
IUPAC strings
• Collating an SAR table
• Best to get familiar with in-depth functionality of a few sources
• Many roads lead to Rome so difficult to know which is most efficient
• I certainly have not tried all those of probable utility
18

Source : BindingDB target-mapped SAR extraction
19

BindingDB
• Pre-cooked expert curation
• Modest but steady growth
• Easy to browse list
• Structures > PubChem and subsumed > ChEMBL
• Targets mapped to UniProt even for titles with no target
• Many search features, some unique (i.e. different to ChEMBL)
• Novel targets from patents and unique journal selection
• Download full SAR sets example no. > structure > activity > target
• Lag time in PubChem indexing
• No antinfective whole organism targets
• US publications some years behind the WO first pubs
• Dependent on CWU structures that are not all correct
20

WIP0 PATENTSCOPE
• Comprehensive and up to date
• Instant metrics (yellow highlight) as you toggle search parameters
• Sign in for saved searches
• Useful instant graphics on result lists
• Search reports can “walk” you to other relevant filings
• Pithy examiner comments (almost) amusing
• Limited text search fields
• In-line table images, pros and cons
• Slow image loading
• Inventor/applicant conflation
22

The WIPO “gift horse”
23
• ~ 7 million strucs, WO and US from 1978
• WIPO collab w. InfoChem and NextMove
• False-negatives (i.e. examples missed)
• Not yet in PubChem
• Limited utility for SAR mining so far

Source: EPO Espace
24
I prefer WIPO as a search portal but Espace is useful for INPADOC families

SureChEMBL
• For SAR extraction the best first-stop-shop (after BindingDB)
• Chemistry indexed a week or less from publication date
• Family-wide structure downloads
• Powerful combination of filters and search functionality
• Multiple source x-refs including PubChem and ChEMBL
• Can correct IUPAC failures and paste out example blocks
• Usual caveats of CNER (but hey, 18 mill structures for free)
• Extraction confounded by dense image tables
• WIPOs less well extracted that USPTOs (but OCR not their fault)
• Overhead of futile common chemistry extraction
• Slow image load times and structure step-through
• Need to watch PubChem load dates(via SIDs)
• The feature that never appeared :(
26

PubChem
• Mother of all searchable portals with 22 mill patent compounds
• SureChEMBL, ChEMBL, BindingDB and IBM are in it
• Massive feature set including Entrez
• Patent and PubMed connectivity via structure
• Very useful Identifier Exchange Service for set mapping
• Can upload SD files (e.g. from Chemicalize or SciFinder)
• Transparent and navigable chemistry rules (e.g. “same connectivity”)
• Slice ‘n dice full Boolean search history
• Extensive filter options
• Direct Venn from CID lists < 10,000
• Similar compound clustering > isolate an SAR series
• Can “walk” though chemical neighbourhood > cluster > cluster patent hop
by chemotype (target neutral)
• Navigation can be daunting
• Some large sources should be kicked out IMHO
• Interface queries often time out
28

New search interface includes patents
29

ChEMBL
• Gateway to chemistry manually extracted from journals
• 0.39 mill structure mapped across to SureChEMBL
• This gives direct journal < > patent connectivity
• Powerful query, filtration, browsing and target indexing
• Release 23 has1.02 mill structures and assay data from 67,722 papers
• Circular subsumation of 0.5 mill structures from confirmed PubChem
Bioassays
• Integrates the BindingDB patent curation (but sync lag)
• Indexed in PubChem BioAssay
• Target-linked entries subsumed
into BindingDB
• Linked to EPMC
• Good for paper <> patent
• Not linked to PubMed
• Up to 2 year lag for papers
• Selective journal capture
31
0.39
mill
ChEMBL
1.34 mill
SureChEMBL
17.23 mill

Source: Europe PubMed Central
32

Europe PubMed Central
• Fully featured literature search functionality
• Big plus is the (HAS_CHEMBL:y) select for chemistry
• Gives query > paper > ChEMBL chemistry > SureChEMBL and/or >
PubChem
• Bioentity mark up from other sources
• De facto two-stop shop with PubMed which has different functionality
• Warning, their patent abstracts not updated since 2012
33

PubMed
• Largest entry point to connect Med Chem papers < > patents
• Entities disclosed, ie target protein IDs , affiliations, chemical structure
• Power of Entrez, including MeSH
• PubMed > PDB (via MMD) good for CID of ligands > patent
• Can connect inventors with unusual names
• However, papers typically ~ 2 years behind “fresh” patents
• May find enough SAR for popular targets not to bother with patents
• But (unless paper in ChEMBL) you may have to DIY extract entities
including chemistry
• Patents good at citing papers (US mandated to be thorough)
• However, many authors avoid citing their patents
• Connect into literature via targets and diseases and thence > patents
• JFTR disease searching in patent text largely useless (titles maybe)
• Patent reviews valuable but tend to be in hard-to-get journals
35

Patent review articles: doing the groundwork for you
36

PubMed > PubChem > Guide to Pharmacology BACE2 page
38

Source: Open
Google
39
Date cutting e.g.
by one year,
actually works

Status of human targets from open sources
(as UniProt x-refs)
42
Oct 2016 Oct 2017
• Most of have chemistry > target via papers (thus can search patents)
• Outer limit of data-supported druggable proteome
• Some patent only in BindingDB

Patent retrieval by target names: not so easy
43

Patent retrieval by target names
44
In : Lecture Notes in
Bioinformatics (ISBN
978-3-642-15119-4) P
Lambrix and G Kemp
(Eds.) Springer Verlag,
pp 106-121, 2010

Classification of target names in titles
45
AWK Gene and protein names can be noisy and inconsistently used
by applicants but HGN approved symbol usage seems to be improving

Utility of tools
• Can re-run IUPACs and images where automated
conversion failed
• Synergies of gap filling from working between the original
document, the SureChEMBL output and the OPSIN and
OSRA tools
• Can run on PubMed abstracts, individually or bulk
• Can isolate example series of structures that has the SAR
• Useful for extraction from papers not in ChEMBL
• May be necessary to convert between formats e.g. for
uploading to PubChem
48

Venny
• Excellent for set comparisons of any strings < over 10,000
• E.g. CIDs, InChIKeys or UniProt IDs
• It automatically de-duplicates
• Download complete intersects and diffs from any segment of the Venn
50

PubChem Identifier Exchange
52

OpenBabel
53
Format conversions e.g. SciFinder SDF to InChIKey

Example of coverage from US9181236
54
• 173 BindingDB CIDs
curated from PubChem via
US9181236
• 405 substances SDF from
SciFinder OpenBabel > 391
IK > 362 CIDs
• 1657 rows > 834
SureChEMBL IDs > 664
CIDs
• 3-way Venn of CIDs

Chemicalize.org from ChemAxon
55

Chemicalize Google patent webpage result
56

OPSIN for IUPAC names
57
• Conversion of compound 19 from WO2016096979 after fixing OCR errors
• N-r3-r(26',3i?)-5-amino-2-methyl-3-(trifluoromethyl)-3,4- dihvdropyrrol-2-yl1-
4-fluoro-phenyl1-5-chloro-pyridine-2-carboxamide.
• Good for iterative correction via error flagging that Chemicalize will not

Getting SAR out the hard way
59

Collating the hard way
• Three versions of the SAR table from WO2016096979
• On the left is the original from page 64 of the PDF
• In the centre is the corresponding section of the SureChEMBL mark-up
• The right hand panel is an Excel paste-across of the centre section
• But you have to complete by pasting SMILES of structures on previous page
60

Getting SAR out the easy, via BindingDB
61

Wish list
Yup, we can dig a lot of SAR out of patents
But wouldn’t it be nice if…..
• Clavariate re-instated the Derwent patent chemistry feed to PubChem
• Open standard SAR modelling tools (with AI natch’) maybe Knime?
(table in > model out)
• These might show large patent SAR sets better than from papers
• Someone indexed full text patents by gene name counts inside the
description section (SureChEMBL for OpenPhacts?)
• SureChEMBL would finally bring in their document section stats
• Run the SureChEMBL engine on full-text papers and PubMed abstracts
• European PubMed Central updated their EPO C07/A61 patent abstracts
from 2012
• We could paste large text chunks > Chemicalize but not run out of points
• Patents could be more like good papers….
63

Could the future be automatic?
64
https://www.slideshare.net/NextMoveSoftware

ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open resources

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open resources

Similar a ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open resources (20)

Más de Dr. Haxel Consult

Más de Dr. Haxel Consult (20)

Último

Último (20)

ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open resources