SlideShare una empresa de Scribd logo
1 de 32
Descargar para leer sin conexión
Open, Collaborative, and Transformative:

  Exploring and Connecting Bioactive
Chemistry Across Biomedical Documents
   and Databases with Public Tools


               Christopher Southan
         TW2Informatics, Göteborg, Sweden,

          BioIT Track 11, Boston, April 2013




                                               [1]
Dr Christopher Southan, Ph.D., M.Sc.,B.Sc.
TW2Informatics: http://www.cdsouthan.info/Consult/CDS_cons.htm
Mobile: +46(0)702-530710
Skype: cdsouthan
Email: cdsouthan@hotmail.com
Twitter: http://twitter.com/#!/cdsouthan
Blog: http://cdsouthan.blogspot.com/
LinkedIN: http://www.linkedin.com/in/cdsouthan
Publications: http://www.citeulike.org/user/cdsouthan/order/year,,/publications
Presentations: http://www.slideshare.net/cdsouthan




                                                                                  [2]
Abstract
Although there are ~ 50 million chemical structure in public databases, many
millions of bioactive compounds are still entoombed in documents. In addition
linking chemistry between patents, papers, abstracts and databases has been
patchy. However, new tools such as chemicalize.org, OPSIN, OSCA, Venny,
CheS-Mapper and InChIKey indexing by Google, have transformed the
extraction, analysis and connectivity of structures from text. Extractions can also
be triaged against PubChem that now contains 14.5 million patent-extracted
compounds from SureChemOpen, SCRIPDB, Thomson and IBM as well as 1
million from journals via ChEMBL and PubMed. These advances present new
collaborative options such as sharing extracted neglected disease patents with
SAR annotations on figshare.




                                                                                      [3]
Getting chemistry out of text and linking to data:
  some is done but we have to dig for the rest




                                                 [4]
Estimates for chemical text tombs


• Journal chemistry public extraction, ~10 to 20 million entombed ?
• Majority of useful patent chemistry already publically extracted, but, ~5
  to 10 million still to go?
• PubMed abstracts and MeSH chemistry ~ 0.5 million still entombed ?
• Other unique, useful, text-only (i.e. no database cross-references)
  chemistry on the web ~ 0.1 to 0.5 million entombed ?




                                                                          [5]
What’s out there: publically disinterred structures

    •   InChIKey in Google ~ 50 million
    •   PubChem = 48 million
    •   PubChem ROF + 250-800 Mw (lead-like) = 31 million
    •   ChemSpider = 28 million
    •   PubChem all docs (papers & patents) = 16 million
    •   PubChem patents = 15 million
    •   SureChemOpen = 13 million
    •   PubChem journal sources (PubMed + ChEMBL) = 1 million




~90% of all structures in databases have their primary origin in text sources



                                                                                [6]
Medicinal chemistry patents (tombs with lids off)

 • 18,777,229 patents, 2,208,422 WO’s (i.e. ~ 9 per family)
 • WO, C07 or A61= 469,856
 • WO , C07D or A61K = 235,854
 • WO, C07D = 72,737 (assignee vs. year plots below)




                                                              [7]
PubMed at 22 mill:
~ 10% with chemistry (guarded tombs)




      “Free full text” = 575,513 (24%)




                                         [8]
Top-5 Med Chem journals (4% lids off tombs)




             “Free full text” = 2671 (4.3%)
                                              [9]
Growth:
 (escaping the
    tombs)
• Patent “big bang”
  (SureChem &
  SCRIPDB in
  2012)

• Literature “slow
  burn” (ChEMBL
  2009 jump)

• Paradox -
  patents:papers
  15:1

(both sets of CIDs
cumulative)
                     [10]
Patents in PubChem:
         post-bang total vs. unique content




PubChem at 47.3 million CIDs, 32% include patents, 20% patent-only
                                                                     [11]
Citations: connections between tombs
     but still need to disinter structures

Papers                         Abstracts




                              PubMed
              Patents
                              "relatedness"
                              heuristics




                                              [12]
Databases <> structures < > documents:
        links, but few reciprocal

 Papers                       Abstracts

                 0.8 mill
                (ChEMBL)




                 12K        0.2 mill (mainly MeSH)

Patents


            15 mill




                                                     [13]
Post-document retrieval: basic questions

1.    What is the name:IUPAC:image:other ratio in the document?
2.    Which tools might be appropriate for first-pass extractions?
3.    How many and what proportion of strucs can be extracted?
4.    Which SAR /in vivo/clinical data is linked to strucs ?
5.    Which document sections include the key strucs ?
6.    Which database entries have links (back) to this document?
7.    Which strucs have InChIKey matches in Google, & database entries?
8.    Which strucs have synthesis data?
9.    What other documents specify and/or cite this struc ?
10.   Which database records for this struc have links to other documents?
11.   What realtionship connections can be made using similarity searches?
12.   What intersects and differences are discernible within a document set ?



                                                                                [14]
Triaging document or webpage chemistry
• Identify the structure specification types, e.g.
   – Semantic names (all sources)
   – Code names (press releases, papers and abstracts)
   – IUPAC names (papers, patents and abstracts)
   – Images (papers, patents, & Google images)
   – SMILES (open lab books)
   – InChi strings (open lab books)
   – SDF files (open lab books, & github)

Convert these to a structure (e.g. SDF, SMILES, InChI) then:
   – Search InChIKey in Google
   – Search major databases
   – Search SureChemOpen
   – Compare extracted sets for intersects and diffs
   – Extend exact match connectivity with similarity searching
                                                                 [15]
Triage example:
  antimalarial
 starting point



The MMV390048 code
name is linked to an
image in press reports
but is PubChem and
PubMed -ve




                         [16]
Images: convert and search

                      Real chemists sketch them in a jiffy;

   the rest of us can use OSRA: Optical Structure Recognition Application




(after editing, CS(=O)(=O)c3ccc(C2=CN=C(N)C(C1=CCC(C(F)(F)F)N=C1)C2)cc3)
                                                                            [17]
Making connections:
image > strucure > database > documents




                 CID 53311393 > ChEMBL > PubMed
                 SureChem or chemicalize.org > patent


                                                        [18]
Patent SAR from WO2011086531:
Collating activities via SureChemOpen

     CID 53311393 >




                                        [19]
Patent SAR results: top-20 from 39 IC50s




                                           [20]
Results > figshare




http://figshare.com/articles/Patent_SAR_for_MMV390048/657979
                                                               [21]
Structures > MyNCBI




http://www.ncbi.nlm.nih.gov/sites/myncbi/collections/public/1zWhcobieZ
bIouGfUdsdbHek5/.
                                                                         [22]
SAR Table: iOS app
  from Molecular
     Materials
    Informatics

SureChemOpen strucs ->

manual data collation ->

PubChem CIDs -> SDF ->

Dropbox -> SAR Table

-> edit in data, R-group
decompose

-> share


                           [23]
InChIKey in Google: instant orthogonal joining




                                                 [24]
Chemicalize.org: 413 strucs from WO2011086532



CID 53311393 ->




                                            [25]
Using OPSIN and chemcalize.org to fix
     recalcitrant IUPACs from WO2011086532




Can quasi-manually extract ~ 10 more “split IUPAC” examples
                                                              [26]
Clustering document extraction sets: CheS-Mapper




  WO2011086531 -> chemicalize.org -> 413 cpds download ->
  CheS-Mapper -> cluster 8 -> export 53 cpds

                                                            [27]
PubChem -> ChEMBL -> PMID -> assay -> strucs
                   • CHEMBL2041980 (structure)
                   • PMID 22390538 (paper)
                   • CHEMBL2045642 (assay for 32 strucs
                     from paper)
                   • The 32 CIDs all have patent matches
                   •




                                                       [28]
Venny: intersects, diffs, de-dupes and merges


                                   1) WO2011086531
                                   matches in
                                   PubChem

                                   2) CheS-Mapper
                                   cluster 8 from
                                   WO2011086532

                                   3) ChEMBL
                                   assayed cpds from
                                   PMID 22390538

                                   (handles any
                                   regular strings e.g.
                                   db IDs, SMILES,
                                   IChI or InChIKey)

                                                     [29]
OSDDMalaria: global sharing test-bed




•   Different options being explored
•   Team or personal URLs >chemicalize.org
•   Github for SD files
•   PubChem public collections
•   Direct feed to ChEMBL malaria
•   G+ for real-time exchange and feedback
                                                  [30]
The open toolbox facilitates extraction and
  collation of 10 to 30 million structures
             entombed in text




                                              [31]
Conclusions

• The ability to extract chemical structures from text and web sources
  has been transformed by an expansion of the public toolbox
• The PubChem big-bang increases probability of extraction having
  database exact or similarity matches
• Paradoxically, the patent corpus is now completely open while access
  to journal text is still restricted
• However, ChEMBL has extracted ~ 0.8 mill. SAR-linked and target
  mapped structures from ~ 50K papers
• The submission of ~15 mill. patent structures to PubChem ensures at
  least representation from the majority of medicinal chemistry patents
  (many of which spawned the subsequent ChEMBL papers)
• Those who want to share their structures globally (e.g. OSDD) have an
  expanding set of options for surfacing their results.



                                                                          [32]

Más contenido relacionado

Destacado

Accurate biochemical knowledge starting with precise structure-based criteria...
Accurate biochemical knowledge starting with precise structure-based criteria...Accurate biochemical knowledge starting with precise structure-based criteria...
Accurate biochemical knowledge starting with precise structure-based criteria...Michel Dumontier
 
Boyer bio it_inchi_april2014
Boyer bio it_inchi_april2014Boyer bio it_inchi_april2014
Boyer bio it_inchi_april2014IBMresearcher
 
Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...baoilleach
 
Efficient Perception of Proteins and Nucleic Acids from Atomic Connectivity
Efficient Perception of Proteins and Nucleic Acids from Atomic ConnectivityEfficient Perception of Proteins and Nucleic Acids from Atomic Connectivity
Efficient Perception of Proteins and Nucleic Acids from Atomic ConnectivityNextMove Software
 
Standardizer, canonicalization and chemical business rules for structure data...
Standardizer, canonicalization and chemical business rules for structure data...Standardizer, canonicalization and chemical business rules for structure data...
Standardizer, canonicalization and chemical business rules for structure data...ChemAxon
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformaticsbaoilleach
 
Comiendo helado
Comiendo heladoComiendo helado
Comiendo heladofabianh2u
 
Intrumentoleslicpiñero
IntrumentoleslicpiñeroIntrumentoleslicpiñero
IntrumentoleslicpiñeroLeslic Piñero
 
Nyatakan dan huraikan punca dan ciri2
Nyatakan dan huraikan punca dan ciri2Nyatakan dan huraikan punca dan ciri2
Nyatakan dan huraikan punca dan ciri2Muhammad Sollahhuddin
 

Destacado (12)

Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...
 
Accurate biochemical knowledge starting with precise structure-based criteria...
Accurate biochemical knowledge starting with precise structure-based criteria...Accurate biochemical knowledge starting with precise structure-based criteria...
Accurate biochemical knowledge starting with precise structure-based criteria...
 
Boyer bio it_inchi_april2014
Boyer bio it_inchi_april2014Boyer bio it_inchi_april2014
Boyer bio it_inchi_april2014
 
Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...
 
Efficient Perception of Proteins and Nucleic Acids from Atomic Connectivity
Efficient Perception of Proteins and Nucleic Acids from Atomic ConnectivityEfficient Perception of Proteins and Nucleic Acids from Atomic Connectivity
Efficient Perception of Proteins and Nucleic Acids from Atomic Connectivity
 
Standardizer, canonicalization and chemical business rules for structure data...
Standardizer, canonicalization and chemical business rules for structure data...Standardizer, canonicalization and chemical business rules for structure data...
Standardizer, canonicalization and chemical business rules for structure data...
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformatics
 
Chieftain experiential activations_2013
Chieftain experiential activations_2013Chieftain experiential activations_2013
Chieftain experiential activations_2013
 
Comiendo helado
Comiendo heladoComiendo helado
Comiendo helado
 
Intrumentoleslicpiñero
IntrumentoleslicpiñeroIntrumentoleslicpiñero
Intrumentoleslicpiñero
 
Monagas
MonagasMonagas
Monagas
 
Nyatakan dan huraikan punca dan ciri2
Nyatakan dan huraikan punca dan ciri2Nyatakan dan huraikan punca dan ciri2
Nyatakan dan huraikan punca dan ciri2
 

Similar a Connecting Bioactive Chemistry Across Documents and Databases

A Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and WikidataA Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and Wikidatapetermurrayrust
 
20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horseChris Southan
 
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and CaveatsThe Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and CaveatsChris Southan
 
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...Dr. Haxel Consult
 
Digging out Structures for Repurposing: Non-competitive Intelligence ...
Digging out Structures for Repurposing: Non-competitive Intelligence        ...Digging out Structures for Repurposing: Non-competitive Intelligence        ...
Digging out Structures for Repurposing: Non-competitive Intelligence ...Chris Southan
 
Pros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChemPros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChemChris Southan
 
Connectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityConnectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityChris Southan
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...Dr. Haxel Consult
 
Mining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity DataMining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity DataChris Southan
 
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...ChemAxon
 
Antimalarial drug dscovery data disclosure
Antimalarial drug dscovery data disclosureAntimalarial drug dscovery data disclosure
Antimalarial drug dscovery data disclosureChris Southan
 
2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...Michel Dumontier
 
Museum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themMuseum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themRoss Mounce
 
PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyChris Southan
 
The open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveatsThe open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveatsDr. Haxel Consult
 
MADICES Mungall 2022.pptx
MADICES Mungall 2022.pptxMADICES Mungall 2022.pptx
MADICES Mungall 2022.pptxChris Mungall
 

Similar a Connecting Bioactive Chemistry Across Documents and Databases (20)

A Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and WikidataA Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and Wikidata
 
Patents in PubChem
Patents in PubChemPatents in PubChem
Patents in PubChem
 
20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse20 million public patent structures: looking at the gift horse
20 million public patent structures: looking at the gift horse
 
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and CaveatsThe Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
The Open Patent Chemistry “Big Bang”: Implications, Opportunities and Caveats
 
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
 
Digging out Structures for Repurposing: Non-competitive Intelligence ...
Digging out Structures for Repurposing: Non-competitive Intelligence        ...Digging out Structures for Repurposing: Non-competitive Intelligence        ...
Digging out Structures for Repurposing: Non-competitive Intelligence ...
 
Pros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChemPros and cons of patent-extracted structures in PubChem
Pros and cons of patent-extracted structures in PubChem
 
Connectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivityConnectivity > documents > structures > bioactivity
Connectivity > documents > structures > bioactivity
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
 
Mining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity DataMining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity Data
 
Connecting Chemists To The Internet Training at Burlington House 2010
Connecting Chemists To The Internet Training at Burlington House 2010Connecting Chemists To The Internet Training at Burlington House 2010
Connecting Chemists To The Internet Training at Burlington House 2010
 
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
EUGM 2013 - Christopher Southan (TW2Informatics): Chemicalize.org, SureChemOp...
 
Antimalarial drug dscovery data disclosure
Antimalarial drug dscovery data disclosureAntimalarial drug dscovery data disclosure
Antimalarial drug dscovery data disclosure
 
2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...
 
Is 20TB really Big Data?
Is 20TB really Big Data?Is 20TB really Big Data?
Is 20TB really Big Data?
 
Museum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themMuseum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on them
 
PubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biologyPubChem for drug discovery and chemical biology
PubChem for drug discovery and chemical biology
 
The open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveatsThe open patent chemistry “big bang”: Implications, opportunities and caveats
The open patent chemistry “big bang”: Implications, opportunities and caveats
 
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
 
MADICES Mungall 2022.pptx
MADICES Mungall 2022.pptxMADICES Mungall 2022.pptx
MADICES Mungall 2022.pptx
 

Más de Chris Southan

FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCPChris Southan
 
Peptide tribulations
Peptide tribulationsPeptide tribulations
Peptide tribulationsChris Southan
 
Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Chris Southan
 
Guide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeGuide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeChris Southan
 
In silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentIn silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentChris Southan
 
Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Chris Southan
 
Desperately seeking DARCP
Desperately seeking DARCPDesperately seeking DARCP
Desperately seeking DARCPChris Southan
 
Seeking glimmers of light in Pharos “Tdark” proteins
Seeking glimmers of light in  Pharos “Tdark” proteinsSeeking glimmers of light in  Pharos “Tdark” proteins
Seeking glimmers of light in Pharos “Tdark” proteinsChris Southan
 
5HT2A modulators update for SAFER
5HT2A modulators update for SAFER5HT2A modulators update for SAFER
5HT2A modulators update for SAFERChris Southan
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databasesChris Southan
 
Connecting chemistry-to-biology
Connecting chemistry-to-biology Connecting chemistry-to-biology
Connecting chemistry-to-biology Chris Southan
 
GtoPdb June 2019 poster
GtoPdb June 2019 posterGtoPdb June 2019 poster
GtoPdb June 2019 posterChris Southan
 
PubChem as a source of systems biology perturbagens
PubChem as a source of  systems biology perturbagensPubChem as a source of  systems biology perturbagens
PubChem as a source of systems biology perturbagensChris Southan
 
Will the real proteins please stand up
Will the real proteins please stand upWill the real proteins please stand up
Will the real proteins please stand upChris Southan
 
Peptide Tribulations
Peptide TribulationsPeptide Tribulations
Peptide TribulationsChris Southan
 
Looking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRLooking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRChris Southan
 
Guide to Immunopharmacology update
Guide to Immunopharmacology updateGuide to Immunopharmacology update
Guide to Immunopharmacology updateChris Southan
 
Druggable Proteome sources in UniProt
Druggable Proteome sources in UniProtDruggable Proteome sources in UniProt
Druggable Proteome sources in UniProtChris Southan
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbChris Southan
 
Pub Med to PubChem Connectivity
Pub Med to PubChem ConnectivityPub Med to PubChem Connectivity
Pub Med to PubChem ConnectivityChris Southan
 

Más de Chris Southan (20)

FAIR connectivity for DARCP
FAIR  connectivity for DARCPFAIR  connectivity for DARCP
FAIR connectivity for DARCP
 
Peptide tribulations
Peptide tribulationsPeptide tribulations
Peptide tribulations
 
Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2 Vicissitudes of target validation for BACE1 and BACE2
Vicissitudes of target validation for BACE1 and BACE2
 
Guide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updaeGuide to Pharmacology database: ELIXIR updae
Guide to Pharmacology database: ELIXIR updae
 
In silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug DevelopmentIn silico 360 Analysis for Drug Development
In silico 360 Analysis for Drug Development
 
Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?Will the correct BACE ORFs please stand up?
Will the correct BACE ORFs please stand up?
 
Desperately seeking DARCP
Desperately seeking DARCPDesperately seeking DARCP
Desperately seeking DARCP
 
Seeking glimmers of light in Pharos “Tdark” proteins
Seeking glimmers of light in  Pharos “Tdark” proteinsSeeking glimmers of light in  Pharos “Tdark” proteins
Seeking glimmers of light in Pharos “Tdark” proteins
 
5HT2A modulators update for SAFER
5HT2A modulators update for SAFER5HT2A modulators update for SAFER
5HT2A modulators update for SAFER
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databases
 
Connecting chemistry-to-biology
Connecting chemistry-to-biology Connecting chemistry-to-biology
Connecting chemistry-to-biology
 
GtoPdb June 2019 poster
GtoPdb June 2019 posterGtoPdb June 2019 poster
GtoPdb June 2019 poster
 
PubChem as a source of systems biology perturbagens
PubChem as a source of  systems biology perturbagensPubChem as a source of  systems biology perturbagens
PubChem as a source of systems biology perturbagens
 
Will the real proteins please stand up
Will the real proteins please stand upWill the real proteins please stand up
Will the real proteins please stand up
 
Peptide Tribulations
Peptide TribulationsPeptide Tribulations
Peptide Tribulations
 
Looking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIRLooking at chemistry - protein - papers connectivity in ELIXIR
Looking at chemistry - protein - papers connectivity in ELIXIR
 
Guide to Immunopharmacology update
Guide to Immunopharmacology updateGuide to Immunopharmacology update
Guide to Immunopharmacology update
 
Druggable Proteome sources in UniProt
Druggable Proteome sources in UniProtDruggable Proteome sources in UniProt
Druggable Proteome sources in UniProt
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdb
 
Pub Med to PubChem Connectivity
Pub Med to PubChem ConnectivityPub Med to PubChem Connectivity
Pub Med to PubChem Connectivity
 

Connecting Bioactive Chemistry Across Documents and Databases

  • 1. Open, Collaborative, and Transformative: Exploring and Connecting Bioactive Chemistry Across Biomedical Documents and Databases with Public Tools Christopher Southan TW2Informatics, Göteborg, Sweden, BioIT Track 11, Boston, April 2013 [1]
  • 2. Dr Christopher Southan, Ph.D., M.Sc.,B.Sc. TW2Informatics: http://www.cdsouthan.info/Consult/CDS_cons.htm Mobile: +46(0)702-530710 Skype: cdsouthan Email: cdsouthan@hotmail.com Twitter: http://twitter.com/#!/cdsouthan Blog: http://cdsouthan.blogspot.com/ LinkedIN: http://www.linkedin.com/in/cdsouthan Publications: http://www.citeulike.org/user/cdsouthan/order/year,,/publications Presentations: http://www.slideshare.net/cdsouthan [2]
  • 3. Abstract Although there are ~ 50 million chemical structure in public databases, many millions of bioactive compounds are still entoombed in documents. In addition linking chemistry between patents, papers, abstracts and databases has been patchy. However, new tools such as chemicalize.org, OPSIN, OSCA, Venny, CheS-Mapper and InChIKey indexing by Google, have transformed the extraction, analysis and connectivity of structures from text. Extractions can also be triaged against PubChem that now contains 14.5 million patent-extracted compounds from SureChemOpen, SCRIPDB, Thomson and IBM as well as 1 million from journals via ChEMBL and PubMed. These advances present new collaborative options such as sharing extracted neglected disease patents with SAR annotations on figshare. [3]
  • 4. Getting chemistry out of text and linking to data: some is done but we have to dig for the rest [4]
  • 5. Estimates for chemical text tombs • Journal chemistry public extraction, ~10 to 20 million entombed ? • Majority of useful patent chemistry already publically extracted, but, ~5 to 10 million still to go? • PubMed abstracts and MeSH chemistry ~ 0.5 million still entombed ? • Other unique, useful, text-only (i.e. no database cross-references) chemistry on the web ~ 0.1 to 0.5 million entombed ? [5]
  • 6. What’s out there: publically disinterred structures • InChIKey in Google ~ 50 million • PubChem = 48 million • PubChem ROF + 250-800 Mw (lead-like) = 31 million • ChemSpider = 28 million • PubChem all docs (papers & patents) = 16 million • PubChem patents = 15 million • SureChemOpen = 13 million • PubChem journal sources (PubMed + ChEMBL) = 1 million ~90% of all structures in databases have their primary origin in text sources [6]
  • 7. Medicinal chemistry patents (tombs with lids off) • 18,777,229 patents, 2,208,422 WO’s (i.e. ~ 9 per family) • WO, C07 or A61= 469,856 • WO , C07D or A61K = 235,854 • WO, C07D = 72,737 (assignee vs. year plots below) [7]
  • 8. PubMed at 22 mill: ~ 10% with chemistry (guarded tombs) “Free full text” = 575,513 (24%) [8]
  • 9. Top-5 Med Chem journals (4% lids off tombs) “Free full text” = 2671 (4.3%) [9]
  • 10. Growth: (escaping the tombs) • Patent “big bang” (SureChem & SCRIPDB in 2012) • Literature “slow burn” (ChEMBL 2009 jump) • Paradox - patents:papers 15:1 (both sets of CIDs cumulative) [10]
  • 11. Patents in PubChem: post-bang total vs. unique content PubChem at 47.3 million CIDs, 32% include patents, 20% patent-only [11]
  • 12. Citations: connections between tombs but still need to disinter structures Papers Abstracts PubMed Patents "relatedness" heuristics [12]
  • 13. Databases <> structures < > documents: links, but few reciprocal Papers Abstracts 0.8 mill (ChEMBL) 12K 0.2 mill (mainly MeSH) Patents 15 mill [13]
  • 14. Post-document retrieval: basic questions 1. What is the name:IUPAC:image:other ratio in the document? 2. Which tools might be appropriate for first-pass extractions? 3. How many and what proportion of strucs can be extracted? 4. Which SAR /in vivo/clinical data is linked to strucs ? 5. Which document sections include the key strucs ? 6. Which database entries have links (back) to this document? 7. Which strucs have InChIKey matches in Google, & database entries? 8. Which strucs have synthesis data? 9. What other documents specify and/or cite this struc ? 10. Which database records for this struc have links to other documents? 11. What realtionship connections can be made using similarity searches? 12. What intersects and differences are discernible within a document set ? [14]
  • 15. Triaging document or webpage chemistry • Identify the structure specification types, e.g. – Semantic names (all sources) – Code names (press releases, papers and abstracts) – IUPAC names (papers, patents and abstracts) – Images (papers, patents, & Google images) – SMILES (open lab books) – InChi strings (open lab books) – SDF files (open lab books, & github) Convert these to a structure (e.g. SDF, SMILES, InChI) then: – Search InChIKey in Google – Search major databases – Search SureChemOpen – Compare extracted sets for intersects and diffs – Extend exact match connectivity with similarity searching [15]
  • 16. Triage example: antimalarial starting point The MMV390048 code name is linked to an image in press reports but is PubChem and PubMed -ve [16]
  • 17. Images: convert and search Real chemists sketch them in a jiffy; the rest of us can use OSRA: Optical Structure Recognition Application (after editing, CS(=O)(=O)c3ccc(C2=CN=C(N)C(C1=CCC(C(F)(F)F)N=C1)C2)cc3) [17]
  • 18. Making connections: image > strucure > database > documents CID 53311393 > ChEMBL > PubMed SureChem or chemicalize.org > patent [18]
  • 19. Patent SAR from WO2011086531: Collating activities via SureChemOpen CID 53311393 > [19]
  • 20. Patent SAR results: top-20 from 39 IC50s [20]
  • 23. SAR Table: iOS app from Molecular Materials Informatics SureChemOpen strucs -> manual data collation -> PubChem CIDs -> SDF -> Dropbox -> SAR Table -> edit in data, R-group decompose -> share [23]
  • 24. InChIKey in Google: instant orthogonal joining [24]
  • 25. Chemicalize.org: 413 strucs from WO2011086532 CID 53311393 -> [25]
  • 26. Using OPSIN and chemcalize.org to fix recalcitrant IUPACs from WO2011086532 Can quasi-manually extract ~ 10 more “split IUPAC” examples [26]
  • 27. Clustering document extraction sets: CheS-Mapper WO2011086531 -> chemicalize.org -> 413 cpds download -> CheS-Mapper -> cluster 8 -> export 53 cpds [27]
  • 28. PubChem -> ChEMBL -> PMID -> assay -> strucs • CHEMBL2041980 (structure) • PMID 22390538 (paper) • CHEMBL2045642 (assay for 32 strucs from paper) • The 32 CIDs all have patent matches • [28]
  • 29. Venny: intersects, diffs, de-dupes and merges 1) WO2011086531 matches in PubChem 2) CheS-Mapper cluster 8 from WO2011086532 3) ChEMBL assayed cpds from PMID 22390538 (handles any regular strings e.g. db IDs, SMILES, IChI or InChIKey) [29]
  • 30. OSDDMalaria: global sharing test-bed • Different options being explored • Team or personal URLs >chemicalize.org • Github for SD files • PubChem public collections • Direct feed to ChEMBL malaria • G+ for real-time exchange and feedback [30]
  • 31. The open toolbox facilitates extraction and collation of 10 to 30 million structures entombed in text [31]
  • 32. Conclusions • The ability to extract chemical structures from text and web sources has been transformed by an expansion of the public toolbox • The PubChem big-bang increases probability of extraction having database exact or similarity matches • Paradoxically, the patent corpus is now completely open while access to journal text is still restricted • However, ChEMBL has extracted ~ 0.8 mill. SAR-linked and target mapped structures from ~ 50K papers • The submission of ~15 mill. patent structures to PubChem ensures at least representation from the majority of medicinal chemistry patents (many of which spawned the subsequent ChEMBL papers) • Those who want to share their structures globally (e.g. OSDD) have an expanding set of options for surfacing their results. [32]

Notas del editor

  1. 70 million substances in CAS suggest a 20-30 million shortfall (i.e. SciFinder only) but they include virtualsand librariesSureChen will continue patent extraction but expect an asymtote of true novels only soonPubMed capture largely dependant on MeSH but a lot of IUPAC chemistry is only anually updated, and some not capturedSureChem, IBM and chemicalize all inticate that, including MeSH terms at least 0.5 million structures could be extracted from PubMedNo idea how much web-unique chemistry (not in documents or databases) is out there but open lab books will increase this
  2. IinChIKeys - estimate of PubChem + ChemSpider in Google – but PubChem currently has a backlog for Key scrapingThe ROF + 250-800 is a very approximate circumscription of the property space that has some possibility of bioactivityProbably a proportion of vendor structures may have never been committed to textThere are some virtuals “out there” including some patent-extractions but difficult to estimate
  3. Note the WO/PCT queries are non-redundant in the patent family senseThe medicinal chemistry corpus is actually quite smallNote big pharma patent decline post-2008 Average exemplified cpds with activity data per patent (family) is unknown but GVKs curation average is ~ 50
  4. Using the top level MeSH term as a filter for “PubMeds with some chemistry”Free full text is ¨ ¼ but there are a lot of biological journals in this set
  5. Select the core journals used for med chem extraction by GVKBIO and ChEMBL. Not a large corpus Both extract ~ 15 cpds per paperNote the proportion of “free full text” is low
  6. Note that cumulative plots include an element of back-mapping i.e. the 2005 matches are to the 2013 total not the just the 2005 documents
  7. PubChem hit 15 million patents in March 2013Largest unique content is SureChemOpen Thomson uniqueness low because a) they include at least 30% journal extractions and b) the Derwent WPI content (was) also in Discovery gateIBM are only pre-2000 patents and the extracted content overlaps with other sources.
  8. Citations are a core tradition but they do not provide direct structure &lt;-&gt; structure linksPatents cite papers but papers rarely cite patents (with the exception of patent reviews)
  9. Only Nature Chemical Biology and Nature Chemistry have direct links from the journal document to PubChemGiven todays technology the major patent offices could put links in the PDFs but are unlikely to do so
  10. The problem “how do I find the chemistry out there relevant to my interests” is a general search retrieval recall and specificity challenge. cannot be addressed here. Beyond PubMed and Google it’s getting better (e.g. indexing of full text patents) but there are still issues (e.g. text mining of chemical journals still very restricted)Once you have found the documents or text, these are the typical set of questions you might want to address, especially in regard to choosing which tools are best for the job.
  11. Need to assess what representational types are being used in the documentEg. Some patents are image-only (but SureChem is pulling most of these out)Then select tools and sources for the job ´Decide how to store your structures locally The default batch search is an upload to PubChemThe default individual search is the InChIKey against Google
  12. Self explanatoryNote my blog post was indexed
  13. The simplest of starting points, at least the press release had a structure diagram OSRA provides good starting points to edit and get SMILESThe structure does not have to be exactly right because a database similarity match is OK to see what it should have been
  14. SMILES from the image hits the CID in PubChemThis links to patents via SureChem and chemicalize.orgChEMBL provides a link to the paper Note none of these sources have MMV390048 as synonym so all the connections are via structure
  15. We can start of with patent linksNote in this case numbered image capture, as oposed to the IUPAC listing, was important to manually collate the structure against the correct IC50
  16. From manual cross-checking between the individual example structures and the IC50 table the Excel sheet can be populated
  17. Useful way to share results that is citableIndexed in Google but no live links in Excel sheet (yet)
  18. Can upload CID lists and download as a saved and public collection
  19. This is the Pistoia /AlexClark SAR Table appDropped the CIDs out of PubChem into DropBox and picked them up on the IPADNice but would be good to automate the decomposition
  20. InChIkey search picks up instantly This was just a choice of one of the activesSo this connects PubChem and figshare
  21. The CID links straight throught to chemicalize and will just re-extract the whole patent in a few seconds The 413 gave 358 hits in pub chem
  22. IUPAC names have a lot of usage variants and OCR mistakes Typically gaps, line breaks 1 instead of 1 and missing bracketsOPSIN is good for indicating where the break is This can then be fixed for a series in chemicalize.org
  23. Total extractions from patents can include a lot of low Mw common reagent chemistryCheS mapper display makes it easy to pick out clusters of lead-like compoundsClusters can then be downloadedFlexibility is high because document sets can be split or merged at the imput stage
  24. ChEMBL extracts structure and dataCant actually select a set of cpds via the PubMed ID but can via the assay ID that is usually unique to that paperIn this case we got 32 structures, all of which came from that patent
  25. Very useful utility for any kind of set operations e.g. sets of extractions Total flexibility e.g. intersecting patents and papers with extractions from abstract setsSets can be de-duplicatedand merged from multiple sets (e.g. 10 patent extractions in one box)Can combine with selected downloaded database records