SlideShare una empresa de Scribd logo
1 de 23
Descargar para leer sin conexión
www.guidetopharmacology.org
The Open Patent Chemistry “Big Bang”:
Implications, Opportunities and Caveats
Christopher Southan, IUPHAR/BPS Guide to PHARMACOLOGY,
Centre for Integrative Physiology, University of Edinburgh
http://www.slideshare.net/cdsouthan/the-open-patent-chemistry-big-
bang-implications-opportunities-and-caveats
Prepared for
1
Outline
• Big Bang in PubChem
• Balancing IP against bioactivity mining
• Relative source coverage
• Comparing Mwts
• Activity gap
• Unique content
• Mixtures
• CWUs
• Virtuals of various types
• Orthogonal paper
• Conclusions
• References
2
History of patent chemistry feeds into PubChem
• 2006 - Thomson (Reuters) Pharma (TRP) manual extractions from
patents and papers (now 4.3 mil, ~40% patents)
• 2011- IBM phase 1 chemical named entity recognition (CNER) 2.5 mil
- SLING Consortium EPO extraction 0.1 mil
• 2012 - SCRIPDB, CNER plus Complex Work Units (CWU) 4.0 mil
• 2013 - SureChem, CNER + image, 9.0 mil
• 2014 - BindingDB USPTO assay extraction (CWU) 0.07 mil
• 2015- (CNER+images + CWU)
• SureChEMBL 13.0 mil
• IBM phase 2, 7.0 mil,
• NextMove Software 1.4 mil synthesis mapping
3
“Big Bang” of CNER PubChem source submissions (SIDs)
4
IBM II + SureChEMBL + NM
IBM I
SCRIPDB
Current PubChem patent chemistry
• 31.7 mil patent-extracted structures (Oct 2015)
• = 20% of 158 mil total Substance Identifiers (SIDs)
• CIDs with patent SIDs = 17.8 from total of 60.8 mil = 30%
• 2.8 million patent document numbers indexed
• * TRP estimated and “half-open” (i.e. structures and dates but document links
require a Cortelis subscription)
5
SID counts in mil
Opportunities from the Big Bang:
balancing the IP vs SAR utility split
IP assessment
• De facto crucial prior art
• Differential coverage as an adjunct to
commercial sources
• Facilitates IP mining for those who
cannot afford commercial offerings
• PubChem content is chemistry from
patents, not patented chemistry
• CNER is brainless compared to expert
IP-relevance selection
• Claim extraction generally poor
• CNER-extracted chemistry artefacts can
confound assessments (e.g. virtuals)
• Dense image tables still a coverage gap
• Major sources currently static in
PubChem (except SureChEMBL & TRP)
• Asian chemistry shortfall
• The “common chemistry” problem
Bioactivity data-mining
• Circa 5x more SAR that literature
• Chemistry > data via PubChem pat
number indexing > free full-text
• Patent families collapse to < 100K
C07D primary documents
• Advanced query options in
SureChEMBL including SciBite
bioentity mark-up
• Challenge of judging scientific quality
• Synthesis extraction (NextMove)
• Valuable intersects with papers and
targets via ChEMBL
• Easy intersecting with DIY chemistry
extraction from any document
• Only ~ 5 mil structures potentially
linkable to bioactivity data
• Thus ~ 12 million have marginal utility
• Drug structure multiplexing problem
6
Major PubChem CNER patent sources at the compound level:
structural corroboration but also divergence
7
SCRIPDB = 4.0
(SID:CID 1.5)
IBM = 7.9
(SID:CID 1.2)
SureChEMBL = 14.6
(SID:CID 1.0)
0.66
2.12
0.67 8.56
0.53 3.26
1.95 Counts are Compound
Identifiers (CIDs) in millions
with a union of 17.8
Patent CNER vs manual bioactivity sources in PubChem:
structural corroboration but also divergence
8
SCRIPDB + IBM
+SureChEMBL = 17.8
Thomson (Reuters) Pharma = 4.3
ChEMBL = 1.4
16.13
0.18
0.12 0.90
1.35 0.26
2.55
Counts are CIDs in millions
Mw plots indicate the CNER fragmentation problem
9
The bioactivity-gap:
majority of patent chemistry has no linked data
10
1.8 mil CNER CIDs
Compare with a
bioactivity-focussed
source e.g. Guide to
PHARMACOLOGY
(GtoPdb) 6037 CIDs
Patent-unique structures : a mixed blessing
11
Patent-picking: vendors listing probable non-stock structures
12
Has been reduced since the recent
deprecation of 20 million Angene SIDs
CNER whitespace problem: mixtures from WO2010053438
13
US6589997: missing punctuation > CNER fails and mixtures
14
NextMove
SureChEMBL (have now fixed this document)
Mixture extractions: more problematic than useful
15
N.b. PubChem ameliorates the issue by splitting all SID/CID mixtures to
component CIDs while maintaining the back-mapping
CWU chemistry: from the sublime…
16
To the ridiculous…. “Chessbordane” CWU virtuals
17
C362H422
Virtuals II: stereo enumerations from US 20080085923
18
260 CIDs > 581 SIDs from IBM,
SureChEMBL, SCRIPDB, Thomson
Pharma and Discovery Gate
Virtuals III: deuterated enumerations from US20080045558
19
986 deuterated CIDs > 2818
SIDs from IBM, SureChEMBL
and SCRIPDB,
Very virtual: d100 dalbavancin
20
Submitted to PubChem by Thomson Pharma (only) on 16th of March 2009
Recent orthogonal analysis of Big Bang impact
• Compares SureChEMBL and IBM with SciFinder and Reaxys for a small
patent set (i.e. open vs commercial)
• Concludes; “50–66 % of the relevant content from the latter was also found
in the former”
• Equivalent comparisons executed in PubChem, along the lines presented
here, would record a higher overlap
• This would be via contributions from the other three open sources and
mixture splitting
• Note the update schedule for SurChEMBL in PubChem will be quarterly, but
new patent chemistry surfaces in SureChEMBL at the EBI within 2-4 days and
is refreshed in the EBI UniChem resource ~ monthly
21
Managing expectations: assessment of chemistry databases generated by
automated extraction of chemical structures from patents, Senger, et al. J.
Cheminf. 2015, 7:49 doi:10.1186/s13321-015-0097-z (GSK and SureChEMBL)
Conclusions
• The “Big Bang” value massively outweighs the caveats
• All sources contributing to open patent chemistry are to be congratulated,
and PubChem for wrangling them
• PubChem slice-and-dice functionality is informative for comparing sources
• Bioactivity mining is extensively enabled but still challenging
• IP assessment also not straightforward but playing field has levelled
• But we do need to look the gift horse in the mouth
• Important to resolve and understand quirks, artefacts and pitfalls
• PubChem filters can partially ameliorate some of these
• Between open and commercial we are approaching the best of both worlds
• It will be interesting to see where we go from here
22
References and questions please
23
http://cdsouthan.blogspot.com/ 19 posts have the tag “patents”
http://www.ncbi.nlm.nih.gov/pubmed/26194581 http://www.ncbi.nlm.nih.gov/pubmed/23506624
(with PubMed Commons data link)
N.b. from the aspect of reproducibility, anyone needing technical tips to reproduce or
extend the PubChem queries used for these slides is welcome to contact me
www.ncbi.nlm.nih.gov/pubmed/25415348
ACS “Deuterogate” slides http://www.slideshare.net/cdsouthan/causes-and-consequences-of-
automated-extraction-of-patentspecified-virtual-deuterated-drugs
//nar.oxfordjournals.org/content/early/2015/10/11/nar.gkv1037

Más contenido relacionado

Destacado

New Product Introductions - CAS
New Product Introductions - CASNew Product Introductions - CAS
New Product Introductions - CASDr. Haxel Consult
 
Optimising Content Spending with Analytics
Optimising Content Spending with AnalyticsOptimising Content Spending with Analytics
Optimising Content Spending with AnalyticsDr. Haxel Consult
 
Welcome to France, Homebase of the French Speaking Patent Information Associa...
Welcome to France, Homebase of the French Speaking Patent Information Associa...Welcome to France, Homebase of the French Speaking Patent Information Associa...
Welcome to France, Homebase of the French Speaking Patent Information Associa...Dr. Haxel Consult
 
New Product Introductions - Questel
New Product Introductions - QuestelNew Product Introductions - Questel
New Product Introductions - QuestelDr. Haxel Consult
 
Thieme Publishers: New Vistas for the Pharmaceutical Industry: Combining full...
Thieme Publishers: New Vistas for the Pharmaceutical Industry: Combining full...Thieme Publishers: New Vistas for the Pharmaceutical Industry: Combining full...
Thieme Publishers: New Vistas for the Pharmaceutical Industry: Combining full...Dr. Haxel Consult
 
New Product Introduction - Intellixir
New Product Introduction - IntellixirNew Product Introduction - Intellixir
New Product Introduction - IntellixirDr. Haxel Consult
 
New Product Introductions - BizInt
New Product Introductions - BizIntNew Product Introductions - BizInt
New Product Introductions - BizIntDr. Haxel Consult
 
Systematic, Automated Analysis of Patents and Related Literature
Systematic, Automated Analysis of Patents and Related LiteratureSystematic, Automated Analysis of Patents and Related Literature
Systematic, Automated Analysis of Patents and Related LiteratureDr. Haxel Consult
 
Automatic Chemical Annotation of Large Full-Text Patent Corpora. Pitfalls, Ch...
Automatic Chemical Annotation of Large Full-Text Patent Corpora. Pitfalls, Ch...Automatic Chemical Annotation of Large Full-Text Patent Corpora. Pitfalls, Ch...
Automatic Chemical Annotation of Large Full-Text Patent Corpora. Pitfalls, Ch...Dr. Haxel Consult
 
New Product Introductions - InfoChem
New Product Introductions - InfoChemNew Product Introductions - InfoChem
New Product Introductions - InfoChemDr. Haxel Consult
 
New Product Introductions - Minesoft
New Product Introductions - MinesoftNew Product Introductions - Minesoft
New Product Introductions - MinesoftDr. Haxel Consult
 
New Product Introductions - FIZ Karlsruhe
New Product Introductions - FIZ KarlsruheNew Product Introductions - FIZ Karlsruhe
New Product Introductions - FIZ KarlsruheDr. Haxel Consult
 
New Product Introductions - ChemAxon
New Product Introductions - ChemAxonNew Product Introductions - ChemAxon
New Product Introductions - ChemAxonDr. Haxel Consult
 
Efficient and Effective Patent Landscaping Using PatBase: a Case Study
Efficient and Effective Patent Landscaping Using PatBase: a Case Study    Efficient and Effective Patent Landscaping Using PatBase: a Case Study
Efficient and Effective Patent Landscaping Using PatBase: a Case Study Dr. Haxel Consult
 
The Enterprise Search Market in a Nutshell
The Enterprise Search Market in a NutshellThe Enterprise Search Market in a Nutshell
The Enterprise Search Market in a NutshellDr. Haxel Consult
 
The Final ICIC 2016 Programme in Heidelberg
The Final ICIC 2016 Programme in HeidelbergThe Final ICIC 2016 Programme in Heidelberg
The Final ICIC 2016 Programme in HeidelbergDr. Haxel Consult
 
II-SDV 2015 The International Information Conference on Search, Data Mining a...
II-SDV 2015 The International Information Conference on Search, Data Mining a...II-SDV 2015 The International Information Conference on Search, Data Mining a...
II-SDV 2015 The International Information Conference on Search, Data Mining a...Dr. Haxel Consult
 
II-SDV 2017 in Nice - The International Information Conference on Search, Dat...
II-SDV 2017 in Nice - The International Information Conference on Search, Dat...II-SDV 2017 in Nice - The International Information Conference on Search, Dat...
II-SDV 2017 in Nice - The International Information Conference on Search, Dat...Dr. Haxel Consult
 

Destacado (20)

New Product Introductions - CAS
New Product Introductions - CASNew Product Introductions - CAS
New Product Introductions - CAS
 
Optimising Content Spending with Analytics
Optimising Content Spending with AnalyticsOptimising Content Spending with Analytics
Optimising Content Spending with Analytics
 
Welcome to France, Homebase of the French Speaking Patent Information Associa...
Welcome to France, Homebase of the French Speaking Patent Information Associa...Welcome to France, Homebase of the French Speaking Patent Information Associa...
Welcome to France, Homebase of the French Speaking Patent Information Associa...
 
New Product Introductions - Questel
New Product Introductions - QuestelNew Product Introductions - Questel
New Product Introductions - Questel
 
Thieme Publishers: New Vistas for the Pharmaceutical Industry: Combining full...
Thieme Publishers: New Vistas for the Pharmaceutical Industry: Combining full...Thieme Publishers: New Vistas for the Pharmaceutical Industry: Combining full...
Thieme Publishers: New Vistas for the Pharmaceutical Industry: Combining full...
 
New Product Introduction - Intellixir
New Product Introduction - IntellixirNew Product Introduction - Intellixir
New Product Introduction - Intellixir
 
RightsDirekt
RightsDirektRightsDirekt
RightsDirekt
 
New Product Introductions - BizInt
New Product Introductions - BizIntNew Product Introductions - BizInt
New Product Introductions - BizInt
 
Systematic, Automated Analysis of Patents and Related Literature
Systematic, Automated Analysis of Patents and Related LiteratureSystematic, Automated Analysis of Patents and Related Literature
Systematic, Automated Analysis of Patents and Related Literature
 
Automatic Chemical Annotation of Large Full-Text Patent Corpora. Pitfalls, Ch...
Automatic Chemical Annotation of Large Full-Text Patent Corpora. Pitfalls, Ch...Automatic Chemical Annotation of Large Full-Text Patent Corpora. Pitfalls, Ch...
Automatic Chemical Annotation of Large Full-Text Patent Corpora. Pitfalls, Ch...
 
New Product Introductions - InfoChem
New Product Introductions - InfoChemNew Product Introductions - InfoChem
New Product Introductions - InfoChem
 
New Product Introductions - Minesoft
New Product Introductions - MinesoftNew Product Introductions - Minesoft
New Product Introductions - Minesoft
 
New Product Introductions - FIZ Karlsruhe
New Product Introductions - FIZ KarlsruheNew Product Introductions - FIZ Karlsruhe
New Product Introductions - FIZ Karlsruhe
 
New Product Introductions - ChemAxon
New Product Introductions - ChemAxonNew Product Introductions - ChemAxon
New Product Introductions - ChemAxon
 
Big Data: Big Issues for IP
Big Data: Big Issues for IPBig Data: Big Issues for IP
Big Data: Big Issues for IP
 
Efficient and Effective Patent Landscaping Using PatBase: a Case Study
Efficient and Effective Patent Landscaping Using PatBase: a Case Study    Efficient and Effective Patent Landscaping Using PatBase: a Case Study
Efficient and Effective Patent Landscaping Using PatBase: a Case Study
 
The Enterprise Search Market in a Nutshell
The Enterprise Search Market in a NutshellThe Enterprise Search Market in a Nutshell
The Enterprise Search Market in a Nutshell
 
The Final ICIC 2016 Programme in Heidelberg
The Final ICIC 2016 Programme in HeidelbergThe Final ICIC 2016 Programme in Heidelberg
The Final ICIC 2016 Programme in Heidelberg
 
II-SDV 2015 The International Information Conference on Search, Data Mining a...
II-SDV 2015 The International Information Conference on Search, Data Mining a...II-SDV 2015 The International Information Conference on Search, Data Mining a...
II-SDV 2015 The International Information Conference on Search, Data Mining a...
 
II-SDV 2017 in Nice - The International Information Conference on Search, Dat...
II-SDV 2017 in Nice - The International Information Conference on Search, Dat...II-SDV 2017 in Nice - The International Information Conference on Search, Dat...
II-SDV 2017 in Nice - The International Information Conference on Search, Dat...
 

Similar a The open patent chemistry “big bang”: Implications, opportunities and caveats

Patent chemisty big bang: utilities for SMEs
Patent chemisty big bang: utilities for SMEsPatent chemisty big bang: utilities for SMEs
Patent chemisty big bang: utilities for SMEsChris Southan
 
Pros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChemPros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChemChris Southan
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...Dr. Haxel Consult
 
FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCPChris Southan
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databasesChris Southan
 
Connecting chemistry-to-biology
Connecting chemistry-to-biology Connecting chemistry-to-biology
Connecting chemistry-to-biology Chris Southan
 
PubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistryPubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistrySunghwan Kim
 
Navigatingbetween patents, papers, abstracts and databases using public sourc...
Navigatingbetween patents, papers, abstracts and databases using public sourc...Navigatingbetween patents, papers, abstracts and databases using public sourc...
Navigatingbetween patents, papers, abstracts and databases using public sourc...Sean Ekins
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseDevelopment of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseNathan Olson
 
Kobeworkshop pubchemqc project
Kobeworkshop pubchemqc projectKobeworkshop pubchemqc project
Kobeworkshop pubchemqc projectMaho Nakata
 
Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...Chris Southan
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsSunghwan Kim
 
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsMining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsSean Ekins
 
CellSeeker-PDRPresentaton-2
CellSeeker-PDRPresentaton-2CellSeeker-PDRPresentaton-2
CellSeeker-PDRPresentaton-2Edward Chiang
 
ICIC 2014 From SureChem to SureChEMBL
ICIC 2014 From SureChem to SureChEMBLICIC 2014 From SureChem to SureChEMBL
ICIC 2014 From SureChem to SureChEMBLDr. Haxel Consult
 
Connecting Bioactive Chemistry Across Documents and Databases
Connecting Bioactive Chemistry Across Documents and Databases Connecting Bioactive Chemistry Across Documents and Databases
Connecting Bioactive Chemistry Across Documents and Databases Chris Southan
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Sunghwan Kim
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseDevelopment of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Databasenist-spin
 

Similar a The open patent chemistry “big bang”: Implications, opportunities and caveats (20)

Patent chemisty big bang: utilities for SMEs
Patent chemisty big bang: utilities for SMEsPatent chemisty big bang: utilities for SMEs
Patent chemisty big bang: utilities for SMEs
 
Pros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChemPros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChem
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
 
FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCP
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databases
 
Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases? Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases?
 
Is 20TB really Big Data?
Is 20TB really Big Data?Is 20TB really Big Data?
Is 20TB really Big Data?
 
Connecting chemistry-to-biology
Connecting chemistry-to-biology Connecting chemistry-to-biology
Connecting chemistry-to-biology
 
PubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistryPubChem: a public chemical information resource for big data chemistry
PubChem: a public chemical information resource for big data chemistry
 
Navigatingbetween patents, papers, abstracts and databases using public sourc...
Navigatingbetween patents, papers, abstracts and databases using public sourc...Navigatingbetween patents, papers, abstracts and databases using public sourc...
Navigatingbetween patents, papers, abstracts and databases using public sourc...
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseDevelopment of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
 
Kobeworkshop pubchemqc project
Kobeworkshop pubchemqc projectKobeworkshop pubchemqc project
Kobeworkshop pubchemqc project
 
Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...Causes and consequences of automated extraction of patent-specified virtual d...
Causes and consequences of automated extraction of patent-specified virtual d...
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural products
 
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsMining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
 
CellSeeker-PDRPresentaton-2
CellSeeker-PDRPresentaton-2CellSeeker-PDRPresentaton-2
CellSeeker-PDRPresentaton-2
 
ICIC 2014 From SureChem to SureChEMBL
ICIC 2014 From SureChem to SureChEMBLICIC 2014 From SureChem to SureChEMBL
ICIC 2014 From SureChem to SureChEMBL
 
Connecting Bioactive Chemistry Across Documents and Databases
Connecting Bioactive Chemistry Across Documents and Databases Connecting Bioactive Chemistry Across Documents and Databases
Connecting Bioactive Chemistry Across Documents and Databases
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseDevelopment of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
 

Más de Dr. Haxel Consult

AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementAI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementDr. Haxel Consult
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...Dr. Haxel Consult
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...Dr. Haxel Consult
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...Dr. Haxel Consult
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...Dr. Haxel Consult
 
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...Dr. Haxel Consult
 
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...Dr. Haxel Consult
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...Dr. Haxel Consult
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...Dr. Haxel Consult
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...Dr. Haxel Consult
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...Dr. Haxel Consult
 
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...Dr. Haxel Consult
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...Dr. Haxel Consult
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...Dr. Haxel Consult
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...Dr. Haxel Consult
 
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterAI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterDr. Haxel Consult
 
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCAI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCDr. Haxel Consult
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...Dr. Haxel Consult
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...Dr. Haxel Consult
 

Más de Dr. Haxel Consult (20)

AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementAI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
 
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
 
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
 
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
AI-SDV 2022: Extracting information from tables in documents Holger Keibel (K...
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
 
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterAI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance Center
 
AI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IPAI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IP
 
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCAI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOC
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
 

The open patent chemistry “big bang”: Implications, opportunities and caveats

  • 1. www.guidetopharmacology.org The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats Christopher Southan, IUPHAR/BPS Guide to PHARMACOLOGY, Centre for Integrative Physiology, University of Edinburgh http://www.slideshare.net/cdsouthan/the-open-patent-chemistry-big- bang-implications-opportunities-and-caveats Prepared for 1
  • 2. Outline • Big Bang in PubChem • Balancing IP against bioactivity mining • Relative source coverage • Comparing Mwts • Activity gap • Unique content • Mixtures • CWUs • Virtuals of various types • Orthogonal paper • Conclusions • References 2
  • 3. History of patent chemistry feeds into PubChem • 2006 - Thomson (Reuters) Pharma (TRP) manual extractions from patents and papers (now 4.3 mil, ~40% patents) • 2011- IBM phase 1 chemical named entity recognition (CNER) 2.5 mil - SLING Consortium EPO extraction 0.1 mil • 2012 - SCRIPDB, CNER plus Complex Work Units (CWU) 4.0 mil • 2013 - SureChem, CNER + image, 9.0 mil • 2014 - BindingDB USPTO assay extraction (CWU) 0.07 mil • 2015- (CNER+images + CWU) • SureChEMBL 13.0 mil • IBM phase 2, 7.0 mil, • NextMove Software 1.4 mil synthesis mapping 3
  • 4. “Big Bang” of CNER PubChem source submissions (SIDs) 4 IBM II + SureChEMBL + NM IBM I SCRIPDB
  • 5. Current PubChem patent chemistry • 31.7 mil patent-extracted structures (Oct 2015) • = 20% of 158 mil total Substance Identifiers (SIDs) • CIDs with patent SIDs = 17.8 from total of 60.8 mil = 30% • 2.8 million patent document numbers indexed • * TRP estimated and “half-open” (i.e. structures and dates but document links require a Cortelis subscription) 5 SID counts in mil
  • 6. Opportunities from the Big Bang: balancing the IP vs SAR utility split IP assessment • De facto crucial prior art • Differential coverage as an adjunct to commercial sources • Facilitates IP mining for those who cannot afford commercial offerings • PubChem content is chemistry from patents, not patented chemistry • CNER is brainless compared to expert IP-relevance selection • Claim extraction generally poor • CNER-extracted chemistry artefacts can confound assessments (e.g. virtuals) • Dense image tables still a coverage gap • Major sources currently static in PubChem (except SureChEMBL & TRP) • Asian chemistry shortfall • The “common chemistry” problem Bioactivity data-mining • Circa 5x more SAR that literature • Chemistry > data via PubChem pat number indexing > free full-text • Patent families collapse to < 100K C07D primary documents • Advanced query options in SureChEMBL including SciBite bioentity mark-up • Challenge of judging scientific quality • Synthesis extraction (NextMove) • Valuable intersects with papers and targets via ChEMBL • Easy intersecting with DIY chemistry extraction from any document • Only ~ 5 mil structures potentially linkable to bioactivity data • Thus ~ 12 million have marginal utility • Drug structure multiplexing problem 6
  • 7. Major PubChem CNER patent sources at the compound level: structural corroboration but also divergence 7 SCRIPDB = 4.0 (SID:CID 1.5) IBM = 7.9 (SID:CID 1.2) SureChEMBL = 14.6 (SID:CID 1.0) 0.66 2.12 0.67 8.56 0.53 3.26 1.95 Counts are Compound Identifiers (CIDs) in millions with a union of 17.8
  • 8. Patent CNER vs manual bioactivity sources in PubChem: structural corroboration but also divergence 8 SCRIPDB + IBM +SureChEMBL = 17.8 Thomson (Reuters) Pharma = 4.3 ChEMBL = 1.4 16.13 0.18 0.12 0.90 1.35 0.26 2.55 Counts are CIDs in millions
  • 9. Mw plots indicate the CNER fragmentation problem 9
  • 10. The bioactivity-gap: majority of patent chemistry has no linked data 10 1.8 mil CNER CIDs Compare with a bioactivity-focussed source e.g. Guide to PHARMACOLOGY (GtoPdb) 6037 CIDs
  • 11. Patent-unique structures : a mixed blessing 11
  • 12. Patent-picking: vendors listing probable non-stock structures 12 Has been reduced since the recent deprecation of 20 million Angene SIDs
  • 13. CNER whitespace problem: mixtures from WO2010053438 13
  • 14. US6589997: missing punctuation > CNER fails and mixtures 14 NextMove SureChEMBL (have now fixed this document)
  • 15. Mixture extractions: more problematic than useful 15 N.b. PubChem ameliorates the issue by splitting all SID/CID mixtures to component CIDs while maintaining the back-mapping
  • 16. CWU chemistry: from the sublime… 16
  • 17. To the ridiculous…. “Chessbordane” CWU virtuals 17 C362H422
  • 18. Virtuals II: stereo enumerations from US 20080085923 18 260 CIDs > 581 SIDs from IBM, SureChEMBL, SCRIPDB, Thomson Pharma and Discovery Gate
  • 19. Virtuals III: deuterated enumerations from US20080045558 19 986 deuterated CIDs > 2818 SIDs from IBM, SureChEMBL and SCRIPDB,
  • 20. Very virtual: d100 dalbavancin 20 Submitted to PubChem by Thomson Pharma (only) on 16th of March 2009
  • 21. Recent orthogonal analysis of Big Bang impact • Compares SureChEMBL and IBM with SciFinder and Reaxys for a small patent set (i.e. open vs commercial) • Concludes; “50–66 % of the relevant content from the latter was also found in the former” • Equivalent comparisons executed in PubChem, along the lines presented here, would record a higher overlap • This would be via contributions from the other three open sources and mixture splitting • Note the update schedule for SurChEMBL in PubChem will be quarterly, but new patent chemistry surfaces in SureChEMBL at the EBI within 2-4 days and is refreshed in the EBI UniChem resource ~ monthly 21 Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents, Senger, et al. J. Cheminf. 2015, 7:49 doi:10.1186/s13321-015-0097-z (GSK and SureChEMBL)
  • 22. Conclusions • The “Big Bang” value massively outweighs the caveats • All sources contributing to open patent chemistry are to be congratulated, and PubChem for wrangling them • PubChem slice-and-dice functionality is informative for comparing sources • Bioactivity mining is extensively enabled but still challenging • IP assessment also not straightforward but playing field has levelled • But we do need to look the gift horse in the mouth • Important to resolve and understand quirks, artefacts and pitfalls • PubChem filters can partially ameliorate some of these • Between open and commercial we are approaching the best of both worlds • It will be interesting to see where we go from here 22
  • 23. References and questions please 23 http://cdsouthan.blogspot.com/ 19 posts have the tag “patents” http://www.ncbi.nlm.nih.gov/pubmed/26194581 http://www.ncbi.nlm.nih.gov/pubmed/23506624 (with PubMed Commons data link) N.b. from the aspect of reproducibility, anyone needing technical tips to reproduce or extend the PubChem queries used for these slides is welcome to contact me www.ncbi.nlm.nih.gov/pubmed/25415348 ACS “Deuterogate” slides http://www.slideshare.net/cdsouthan/causes-and-consequences-of- automated-extraction-of-patentspecified-virtual-deuterated-drugs //nar.oxfordjournals.org/content/early/2015/10/11/nar.gkv1037