Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

FAIR Data Knowledge Graphs

261 visualizaciones

Publicado el

FAIR data has flown up the hype curve without a clear sense of return from the required data stewardship investment. The killer use case for FAIR data is a science knowledge graph. It enables you to richly address novel questions of your and the world’s data. We started with data catalogues (findability) which exploited linked/referenced data using a few focused vocabularies (interoperability), for credentialed users (accessibility), with provenance and attribution (reusability) to make this happen.

This talk was presented at The Molecular Medicine Tri-Conference/Bio-IT West on March 11, 2019.

Publicado en: Salud y medicina
  • Sé el primero en comentar

FAIR Data Knowledge Graphs

  1. 1. FAIR* Data Knowledge Graphs Tom Plasterer, PhD Director, Bioinformatics, Research Bioinformatics 11 Mar 2019 * Findable, Accessible, Interoperable and Reusable
  2. 2. What do R&D Researchers want the ability to do? 3 • Gain a greater understanding of the biology of the molecular mechanisms of diseases • Use the human as a model organism to a greater degree • Discover how the microbiome is involved with human pathogenesis • Understanding molecular mechanisms of drug failures • Use patient-level clinical data to identify subphenotypes of diseases Integrative Informatics: A hybrid approach to integrating data for Drug Discovery @Mathew Woodwark; Pharma 2020: March 28, 2018
  3. 3. Can R&D researchers do these things today? 4 • Currently, data exists in file shares, on laptops, eLN, in silos of managed systems and unknown places • The level of data integration is immature and fragmented • Using systems biology approaches requires considerable time and effort • Bioinformatics groups become a bottleneck to analyzing data • Research scientists not empowered to use information and knowledge to answer complex questions Integrative Informatics: A hybrid approach to integrating data for Drug Discovery @Mathew Woodwark; Pharma 2020: March 28, 2018
  4. 4. 5 IIx Approach: Build a FAIR Data Knowledge Graph
  5. 5. 6 FAIR Principles: One-Slide Overview Findable: • F1 (meta)data are assigned a globally unique and persistent identifier • F2 data are described with rich metadata • F3 metadata clearly and explicitly include the identifier of the data it describes • F4 (meta)data are registered or indexed in a searchable resource The FAIR Guiding Principles for scientific data management and stewardship Sci. Data 3:160018 doi: 10.1038/sdata.2016.18 (2016) Accessible: • A1 (meta)data are retrievable by their identifier using a standardized communications protocol • A1.1 the protocol is open, free, and universally implementable • A1.2 the protocol allows for an authentication and authorization procedure, where necessary; • A2 metadata are accessible, even when the data are no longer available; Interoperable: • I1 (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation • I2 (meta)data use vocabularies that follow FAIR principles • I3 (meta)data include qualified references to other (meta)data Reusable: • R1 meta(data) are richly described with a plurality of accurate and relevant attributes • R1.1 (meta)data are released with a clear and accessible data usage license • R1.2 (meta)data are associated with detailed provenance • R1.3 (meta)data meet domain-relevant community standards
  6. 6. 7 Knowledge Graph: Definition(s)…
  7. 7. 8 Knowledge Graph: Innovation Trigger Gartner Identifies Five Emerging Technology Trends That Will Blur the Lines Between Human and Machine
  8. 8. 9 Knowledge Graph: Key Features and Differentiators Federation: • Leave Data in place or ETL pipeline? • URIs, indices really important Standards Support (Syntactic and Semantic) • Universal structure or bespoke? • Universal query language or bespoke? Analytics Enablement • Reasoning, inferencing, graph methodologies Hybrid • Underlying data in multiple shapes and repositories For Machines (and occasionally people) Cypher
  9. 9. 10 Starting Point: Modeling Business Questions core:Study core:Project core:Target core:Subject core:Drug core:Indication core:TherapeuticArea core:BiologicalSample core:Measurement core:Technologycore:Visit bdm:Cohort core:hasSubject core:hasProject core:hasDrug core:hasIndication bdm:hasArm bdm:participatesIn core:hasTA core:hasTarget core:hasMeasurement core:hasSample core:hasVisit core:measuredBy Find all subjects diagnosed with SLE with a disease activity score > 5 Find all studies evaluating the target PD-L1 with RNA Seq Datasets bnav:measuredInStudy
  10. 10. 11 Challenge is determining the “stickiest” representation for a given instance • Studies all have a ‘D’-code and then a number of other internal and external identifiers • API calls to an internal clinical study API and an external (licensed content) API to obtain the exact matches (skos:exactMatch) • Process is abstracted in an Enrichment Service • New relationships (triples) are added to the wrapped data model and pushed into a knowledge graph Enrichment: Core Ontology Classes & API mapping core:Study http://data.rd.astrazeneca.net/study/bdm/CP1103 http://clinicaltrials.astrazeneca.net/study/D4660C00001 http://identifiers.org/clinicaltrials/NCT01448850 http://trialtrove.citeline.com/ClinicalTrial/154466 skos:exactMatch "azct:D4660C00001" "ctg:NCT01448850" "trialtrove:154466" dct:identifier
  11. 11. 12 Now find “stickiest” representation for a given instance from a label • Use system label for the indication • Send to Enrichment API (augmented public disease vocabularies) and generate the preferred URI to obtain the close matches (skos:closeMatch) • Process is abstracted in an Enrichment Service • New relationships (triples) are added to the wrapped data model and pushed into a knowledge graph Enrichment: Core Ontology Classes & Label Matching core:Indication http://data.rd.astrazeneca.net/indication/bdm/Rheumatoid%20Arthritis http://purl.obolibrary.org/obo/DOID_7148 http://identifiers.org/mesh/D001172 skos:closeMatch "Rheumatoid Arthritis (D001172) " bnav:diseaseNameSymbol "Rheumatoid Arthritis" skos:prefLabel
  12. 12. 13 Now find “stickiest” representation for a given instance from a label without a good vocabulary • Aligned internal Technology vocabulary with best public label and URI • Send to Enrichment API (augmented BDM-technology vocabulary) and generate the preferred URI to obtain the close matches (skos:exactMatch) • Process is abstracted in an Enrichment Service • New relationships (triples) are added to the wrapped data model and pushed into a knowledge graph Enrichment: Core Ontology Classes & Mixed Vocabs core:Technology http://data.rd.astrazeneca.net/technology/bdm/BDMTECH00005 "Blood Gas" skos:prefLabel http://identifiers.org/ncit/C71252 skos:exactMatch "Arterial Blood Gas Measurement" skos:prefLabel
  13. 13. 14 Key Lesson: Where is Enrichment Critical? core:Study core:Project core:Target core:Subject core:Drug core:Indication core:TherapeuticArea core:BiologicalSample core:Measurement core:Technologycore:Visit bdm:Cohort core:hasSubject core:hasProject core:hasDrug core:hasIndication bdm:hasArm bdm:participatesIn core:hasTA core:hasTarget core:hasMeasurement core:hasSample core:hasVisit core:measuredBy External Internal Mix
  14. 14. 15 Dataset Catalogs: Find me Datasets about: Projects Study Indication/ Disease Technology Targets Cohort DatesAgent Therapeutic Area Drugs
  15. 15. 16 Dataset Catalog is a collection of Dataset Records • Catalogs are needed to supporting FAIR (Findable) data • Catalogs can and should support Enterprise MDM strategies • Consumers can be internal or external Dataset Catalogs are needed so data consumers can find Datasets • Dataset records need sufficient metadata to support discoverability • Dataset terms are NOT the data instance Dataset Catalogs surface dataset provenance and enable data access Dataset Catalogs can provide datasets for multiple consumption patters • Analytics readiness and fit • ‘Walking’ across information models Dataset Catalogs: Findability Starts Here
  16. 16. 17 The Backbone: A DCAT conformant Data Catalog https://www.w3.org/TR/hcls-dataset/ https://www.w3.org/TR/vocab-dcat/#vocabulary-overview Semantic tagging of datasets with concepts from taxonomies: • provides context • multi-dimensional & flexible • effective for discoverability • light-weight semantics skos:Concept dcat:Catalog skos:ConceptScheme dctypes:Dataset (summary) dct:title dct:publisher <foaf:Agent> foaf:page void:sparqlEndpoint dct:accrualPeriodicity dcat:keyword dcat:dataset dcat:theme dctypes:Dataset (version) dcat:Distribution (dctypes:Dataset) void:vocabulary dct:conformsTo void:exampleResource …other void properties dcat:distribution dcat:themeTaxonomy dct:isVersionOf pav:previousVersion dct:hasPart pav:hasCurrentVersion dct:hasPart dct:title dct:publisher <foaf:Agent> pav:version dct:creator <foaf:Agent> dct:created dct:source dct:creator <foaf:Agent> dct:license dct:format pav:retrievedFrom dct:created pav:createdWith dcat:accessURL dcat:downloadURL void:Dataset dct:title dctDescription dct:publisher <foaf:Agent>
  17. 17. Data Discoverability: Multi-phase Filtering Data Catalog Filter Phase 1 Experiment Metadata Filter Phase 2 Ad hoc Analyses Filtering Phase 3 Outbound to Data Analytics Data Science Tools Statistical Filtering e.g., clinical trial with > 50 participants Dataset Catalog Descriptions
  18. 18. 19 DisQover Example
  19. 19. R&D | RDI Multi-Phase Filtering joins the Catalog and Domain Model • Balance to what belongs in a catalog record vs. instance data Public Domain Ontologies and Identifiers should be reused • Consensus is emerging around best practices and cross-mapping DCTERMS, DCAT, VoID are almost sufficient • Extend for local needs Lots of Activity to Learn and Shape Best Practices • Didn’t reinvent a wheel FAIR Knowledge Graph: Take-aways
  20. 20. R&D | RDI Thanks Key Influencers David Wood Tim Berners-Lee Lee Harland Jane Lomax James Malone Dean Allemang Barend Mons Carole Goble Bernadette Hyland Bob Stanley Eric Little Michel Dumontier John Wilbanks Hans Constandt Filip Pattyn Dan Crowther Tim Hoctor Ian Harrow AstraZeneca/Pistoia FAIR Data Community Mathew Woodwark Rajan Desai Nic Sinibaldi Chia-Chien Chiang Kerstin Forsberg Ola Engkvist Ian Dix Colin Wood Ted Slater Martin Romacker Eric Neumann Jeff Saltzman Kathy Reinold Nirmal Keshava Bryan Takasaki

×