Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Evaluating Patent Full Text Documents with Chemical Ontologies

1.432 visualizaciones

Publicado el

Chemical ontologies represent abstractions of chemical compounds - providing structural as well as functional and chemical property classifications. With automated patent text processing there is also an increasing interest to automatically classify chemical compounds in patent documents to enable chemical searches based on known chemical classes.

Thus, we will present strategies to automatically classify chemical compounds based on their names and chemical structure or function using a chemical ontology derived from the pure lexical variants MeSH and ChEBI but incorporating SMARTS and chemical calculation based logic. We will describe the development of this ontology - comprising also functional classifications and material science terms such as alloys and polymers.

Using our UIMA based OCMiner annotation pipeline, over 90 million patent full text documents were extracted to find mentions of chemical compounds, substances, chemical classes and chemical groups. In addition, the claimed uses of these compounds were also extracted. Subsequently, chemical terms were classified by our chemical ontology, transforming more than 10 billion found chemical class mentions into an ontology enabled, Lucene based search index. This index was also used to analyze the frequency of found chemical classes per time period, giving indications on the focus of general chemical reseach activities and recent trends in patenting strategies.

An annotated data set of 10 years US patents is freely available for further investigations and can be used to train and develop further the use, quality and interchangeability of chemical ontologies.

Publicado en: Internet
  • Sé el primero en comentar

Evaluating Patent Full Text Documents with Chemical Ontologies

  1. 1. Evaluating patent full text documents with chemical ontologies OntoChem IT Solutions GmbH Blücherstr. 24 06120 Halle (Saale) Germany Tel. +49 345 4780472 Fax: +49 345 4780471 mail: info(at)ontochem.com
  2. 2. Evaluating patent full text documents with chemical ontologies • spin-out from OntoChem GmbH • started 1.7.2015 • 15 chemists, bioinformatics, biologists, linguists, pharmacists • extracting knowledge from documents, selling software & services OntoChem IT Solutions GmbH Blücherstr. 24 06120 Halle (Saale) Germany Tel. +49 345 4780472 Fax: +49 345 4780471 mail: info(at)ontochem.com
  3. 3. 3 Computer readable, formal representation of knowledge... describe relationships between knowledge concepts: aspirin benzoic acid carboxylic acid acetyl salicylic acids can be used to infer extract, search, sort and analyse knowledge What are Ontologies ? „is a“ „is a“
  4. 4. 4 ChEBI Chemical Entities of Biological Interest https://www.ebi.ac.uk/chebi/ has about 40,000 compounds manually classified: MeSH – medical subject headings ... PubChem Chemical Ontologies...
  5. 5. 5 SODIAC: automated compound classification software Structure based Ontology Development and Individual Assignment Center ontology editor, OBO specification conformity Definition of compound classes via SMARTS chemical structure editor sub-structure AND, OR and NOT logic compound to class assignment chemistry error detection chemical hierarchy construction Classifying Chemistry: SODIAC
  6. 6. 6 SODIAC: AND/OR logic to assign Vitamin C derivatives: • described in different tautomeric forms in databases • logic needed for classifying correct stereochemistry in substituted compounds Classifying Chemistry: SODIAC concept: Vitamin C derivatives AND AND AND OR OR
  7. 7. 7 structural chemical ontologies are often not based on sub-structures ! Progesterone 19-Norprogesterone 4-8* more active class: Gestagens class: Gestagens>Progestins Pregnane (female hormons) Androstane (male hormons) class: Gonans>Pregnans class: Gonans>Estrans Classifying Chemistry: not straightforward... drugbank & ChEBI: Progestin, a synthetic progestogen parent & SSS not parent but SSS not parent but SSS ChEBI: corticosteroid hormone same family different family
  8. 8. 8 Chemistry Ontologies Organic chemistry 7.586 class concepts, 29.709 class terms 3,185 concepts linked to ChEBI concepts 2,465 concepts linked to MeSH concepts 68 million concepts linked to PubChem Inorganic materials 52.4209 concepts, 56.332 terms Groups-substituents-fragments 4.428 concepts, 12.754 terms Substances 989 concepts, 3.522 terms Polymers 2361 concepts, 7.176 terms
  9. 9. 9 Acetylsalicylic acid SODIAC v2.5.2 Direct Parents: aromatic compounds, benzenes, carbon compounds, carboxylic acids, ethanoic acid esters, methyl esters, monocyclic compounds, oxygen compounds, salicylic acid derivatives bioavailable molecules, hydrophilic molecules, lead like molecules, lipinski molecules, small molecules CHEBI:15365; MeSH:D001241 Ancestors: 6-membered carbocycles, 6-membered cyclic compounds, acetic acid derivatives, acids, carbocycles, carbon group compounds, carbonyl compounds, carboxylic acid derivatives, carboxylic acid esters, chalcogen compounds, cyclic compounds, esters, fatty acyls, fatty esters, lipids, monocarboxylic acid derivatives, monocyclic carbocycles, organic acids, organic compounds, organic esters, salicylic acid derivatives, short chain fatty acid esters Classifying Chemistry: Example
  10. 10. 10 Basic Biology Ontologies Genes, Proteins & Peptides annotation version: 708,141 concepts, 2,627,612 terms classification version: 832,902 concepts, 3,177,057 terms with linkouts to GO, InterPro, HomoloGene, HUGO, KEGG, Uniprot ... Diseases SNOMED-CT, MedDRA, ICD-9, ICD-10, HDO, UMLS, Loinc, MeSH annotation version: 105,824 concepts, 360,077 terms Species based on NCBI, GRIN, IPNI, Cornucopia, World Economic Plants ... annotation version: 1,012,634 concepts, 1,664,042 terms Anatomy different species and stage dependent ontologies available general anatomy: 4,773 concepts, 19,450 terms
  11. 11. 11 Other Biology Ontologies Cell lines 5,566 concepts, 13,083 terms Cosmetology 1,187 concepts, 2,017 terms Effects 35,477 concepts, 111,012 terms Nutrition 19,193 concepts, 115,699 terms Physiology 533 concepts, 619 terms Toxicology 1,019 concepts, 2,150 terms
  12. 12. 12 Other Ontologies Countries annotation version: 245 concepts, 85,069 terms Companies annotation version: 26,388 concepts, 5,757 terms Material properties annotation version: 1,081 concepts, 2,428 terms Methods annotation version: 2,502 concepts, 10,053 terms Regions & Geopolitics annotation version: 3774 concepts, 13,356 terms Relations annotation version: 603 concepts, 2,290 syntaxes
  13. 13. 13 General Ontologies Wikipedia annotation version: 5,200,842 concepts, 11,490,831 terms Magnitudes & Units annotation version: 228 concepts, 510 terms Persons annotation version: >1,000,000 persons Relations annotation version: 603 concepts, 2,290 syntaxes
  14. 14. 14 Understanding Patents with Ontologies NLP for patents pose some unique challenges: • multilingual • poor OCR (optical character recognition) • multi-disciplinary • many >90 million full text documents from >110 patent offices • large up to 500 pages with sentences spanning >20 pages • obscure: hand drawings unclear language
  15. 15. 15 Understanding Patents Collaboration with infoapps GmbH (Munich) Standard full text data US, EP, DE, WO, AT, CH, BE, CA, ES, FR, GB, MA. Standard full text data AR, BR, CN, DK, FI, ID, EI, EN, JP, KR, MX, MY, NL, NO, RU, SE, TH, TW, VN. Original full text data Machine/human translation (EN) AR, AT, BE, BR, CA, CH, CN, DE, DK, EP, ES, FI, FR, ID, JP, KR, MX, NL, NO, RU, SE, TH, TW, VN, WO.
  16. 16. 16 chemistry annotator OCMiner® UIMA Pipeline identify document type OCMiner® UIMA Pipeline picture PDF OCR Text PDF PDF reader XML doc XML reader Office doc Office reader document classifier XML detagger language detector normalize text tokenize text acronym abbrev detector person annotator document structure domain annotators 1…n dictionary name-2- structure formula & molpuzzler class/group resolution cleanup & rule combiner coordinated entity resolution context handler NE confidence domain annotators 1…n domain annotators 1…n relationship extraction consumer BRAT consumer index consumer XML
  17. 17. 17 BRAT (Goran Topić) file example: PLoS One. 2014 Sep 30;9(9):e107477. doi: 10.1371/journal.pone.0107477. eCollection 2014. Annotated chemical patent corpus: a gold standard for text mining. Akhondi SA, Klenner AG, Tyrchan C, Manchala AK, Boppana K, Lowe D, Zimmermann M, Jagarlapudi SA, Sayle R, Kors JA, Muresan S Regular Names in Patents
  18. 18. 18 Chemical Compound 5,7-bis(trifluoromethyl)-pyrazolo[1,5-a]pyrimidine-2-carbonitrile : Chemical Class pyrazolo[1,5-a]pyrimidines : Chemical substituent + class 2-Bromo-, 2-fluoro-, and 2-chloro pyrazolo[1,5-a]pyrimidines: Other Name Types in Patents
  19. 19. 19 Named Entities in Patents extracting named entities (NE) from infoapps patents from 19 million patents with chemistry, selected 4.7 million patents from 2001-2010 (publication year) Ontology term annotation count unique concepts per doc unique concepts Chemistry 1,465,510,682 294,771,572 ? Proteins 204,902,329 30,167,344 67,993 Anatomy non-plants 126,856,048 21,192,154 2,378 Methods 112,230,880 21,725,977 1,959 Species 105,618,715 25,901,359 81,036 Diseases 82,857,385 24,592,233 21,367 Physiology 68,504,035 12,703,542 497 Nutrition 59,367,731 12,839,777 3,861 Cosmetology 23,465,151 4,883,741 920 Anatomy plants, fungi 22,326,124 4,212,548 802 Cell lines 9,857,621 2,325,743 2,079 Toxicity 7,986,832 2,858,977 423 Species plants, fungi 7,444,143 2,345,605 7,347 Regions 6,974,421 2,781,913 1,040 Herbal drugs 162,729 46,830 131
  20. 20. 20 Understanding Patents with Ontologies
  21. 21. 21 3 reasons: patent claims are „ontological“ background knowledge helps to extract the meaning of named entities end user, using knowledge classifications which natural product compound class is useful to treat inflammation of the skin? Ontologies – Why ?
  22. 22. 22 Patent claims are “ontological” Patent classes & ad hoc classes: e.g. chemical „compounds according to claim 1“ „acyl-pyrrolopyridines“ any Markush structure, Patent classes etc e.g. uses: „anti-infectives“ (e.g. antibacterial, antiviral, antiparasitic ... ) Chemical Ontologies – Why ?
  23. 23. 23 ontology based NLP to extract the meaning of named entities • ontology based context sensitive Named Entity resolution ...glucose... ...glucose oxidase... ...glucose oxidase activity... finally: ...inhibitor of glucose oxidase activity... • ontology based anaphora & cataphora resolution Tetrahydrofurane is a commonly used solvent in organic ... This cyclic ether has a melting point of -108,4 °C • ontology based fingerprints classifying documents, e.g. into patent classes Chemical Ontologies – Why ?
  24. 24. 24 3 BRAT parts of one document: Ontology Based Property Extraction
  25. 25. 25 Understanding Patent Claims Logic high quality patent annotations need: • annotated text corpus “Gold Set” • background ontologies Annotated between <chemistry> & <disease>: p=is_Active_Part_Of, i=is_Instance_Of. LREC 2014: Creating a Gold Standard Corpus for the Extraction of Chemistry-Disease Relations from Patents, Antje Schlaf, Claudia Bobach, Matthias Irmer
  26. 26. 26 Enduser Application Examples
  27. 27. 27 End User: Understanding Patents Collaboration with infoapps GmbH (Munich): ChemAnalyser
  28. 28. 28 End User: Understanding Patents ChemAnalyser – causative relationship mining
  29. 29. 29 End User: Understanding Patents ChemAnalyser – causative relationship mining
  30. 30. 30 End User: Understanding Patents ChemAnalyser – causative relationship mining
  31. 31. 31 End User: Patent Big Data Analytics Hot Compounds, hot targets ? L. Weber, T. Böhme, M. Irmer, Pharm. Pat. Analyst 2013, 2, Ontology-based content analysis of US patent applications from 2001–2010
  32. 32. 32 End User: Patent Big Data Analytics enrichment factors for chemistry related diseases... Chemistry Concept cardiovascular system disease of mental health disease of metabolism respiratory system nervous system musculo-skeletal system reproductive system gastro- intestinal system immune system endocrine system prostaglandin F2β derivatives 557 0 0 0 607 427 0 0 375 0 hallucinogens 494 1922 332 449 538 364 3146 622 199 1901 cichoric acid 821 1662 432 1625 509 652 11623 1480 604 7239 alpha 1-adrenoceptor agonist 821 0 267 1736 501 611 8684 1014 543 5636 pregn-4,9(11)-enes 398 256 231 450 491 386 0 467 317 1296 canrenoic acids 771 1343 425 1180 473 534 8474 1260 459 4960 aconitane derivatives 0 1785 205 0 458 257 0 0 0 0 pseudoalkaloid derivatives 0 1778 204 0 456 256 0 0 0 0 diterpene alkaloid derivatives 0 1778 204 0 456 256 0 0 0 0 13,14-dihydro-15-keto-prostaglandin D2 derivatives 651 0 213 1831 447 482 0 1188 521 3956 ripisartan derivatives 953 0 351 0 436 411 0 0 409 0 potassium-sparing diuretics 896 1387 399 1156 425 496 6456 1218 501 3863 steroid acids 692 1193 379 1046 423 485 7578 1132 412 4418 Milfasartan 926 0 304 0 407 414 0 917 404 0 pyrrolizidine alkaloids 453 1041 293 1264 407 464 0 1081 498 0 milfasartan derivatives 930 0 303 0 406 416 0 913 402 0 Pratosartan 695 929 450 523 394 240 2747 794 246 2800
  33. 33. 33 End User: Online Database ChemAnalyser ChemAnalyser – Structure ChemAnalyser – Full text & ontology based semantic searching ChemAnalyser – Organic chemistry & drug discovery ChemAnalyser – Alloys & Inorganic Materials ChemAnalyser – Cosmetics & Nutrition ChemAnalyser – Polymers ChemAnalyser – Reach Report Support
  34. 34. 34 Thanks! Please register at www.chemanalyser.com for more information and a free trial.
  35. 35. 35 Thanks!

×