SlideShare a Scribd company logo
1 of 15
Download to read offline
Managing Biomedical
Data and Metadata in
Large Scale Collaborations
November 28, 2018
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.
▪ What is Metadata?
− Content
− Context
− Process
▪ Metadata not always derived from the artifact
directly, but obtained from multiple sources
▪ Metadata semantics are key to unlocking
findability, provenance and usability of data
artefacts
Page ▪ 2
Why Metadata?
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.
▪ Data continues to be accumulated at exponential rate
− There are multiple efforts capturing anything conceivable
− Study data vs non study data lines are blurring
▪ Data demands continues to grow
− Everyone hungers for high quality consented biomedical datasets
− Regulation like GDPR points to large scale consent management capability
▪ Generating and storing all data inhouse is no longer making business sense
Page ▪ 3
Why Collaboration?
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.
▪ Data is produced in silos
− Specialized systems: clinical, prescriptions, lab,
imaging, sequencing, sensors, etc.
▪ Not one warehouse of everything for
everyone
− For the foreseeable future there will always be
some (largish) degree of federation
− No single data science platform can cater to
everyone
▪ Not one view on the data
− No use case needs all the data
− Each use case needs unique combination of data
Page ▪ 4
Status Quo
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.
▪ Working with data
− Data Access
o Non-local data
o Data islands
o Multi-disciplinary
− Data Preparation
o Data normalization
o Data scientist grunt work challenge
▪ Working together – sharing vs collaborating
− Different organizations involvement
− Differing methods of processing
▪ Regulation, contracts and audit
Page ▪ 5
Obstacles to Collaboration
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.
▪ Aggregation: Central data warehouse with corresponding API layer for querying
very large data sets quickly
▪ Common Challenges
− Data vs Meta-data is blurred
− Scalability
− Cost
− Access controls
Page ▪ 6
The Common Approaches: Aggregation
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.
▪ Standardization: Common Data Models and APIs to obtain
information from different custodians
▪ Common Challenges
− Many standards
− They are all in flux
− Big effort to implement and to maintain
− Coverage
Page ▪ 7
The Common Approaches: Standardization
Analytics CoverageStandards Coverage
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.
▪ Federation: Based on aggregation and standardization query multiple data
custodians and deliver aggregate answers
▪ Common Challenges
− Standardizing queries
− Authentication / Authorization
− Normalization
− Performance
Page ▪ 8
The Common Approaches: Federation
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.Page ▪ 9
Metadata and Conway’s Law
“Organizations which design systems (in
the broad sense) ... are constrained to
produce designs which are copies of the
communication structures of these
organizations."
Conway’s Law
Melvin Conway
Datamation, 1968
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.
▪ One person's metadata is another person's data
▪ Collaborate and establish broadest consensus for a given data
type
− Minimum viable standard metadata model across custodians
− Further enriched with contextual data specialized per study
− Requirements:
o Handling presence of unexpected as well as absence of expected data
o Propagation of change and impact on provenance
▪ Data model needs to be accomodating - ideally standardized
summary data with ad hoc extensions by interest
Page ▪ 10
Metadata – Description of Data Artefacts
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.Page ▪ 11
Metadata Aggregation Lifecycle
Extract Translate Validate Annotate Store Index Project
Any
combination
of tools to
extract data
from one or
many sources:
• File Systems
• Files
• Databases
• APIs
Prepare
extracted
native data
fields for
processing by
DBE
Validate
Metadata
inputs against
type
constraints
Process data
fields marked
for annotation
with ontology
providers
Store
validated and
annotated
data in DBE
database
Index stored
data in DBE
search index
Projection of
outputs
directly into
analysis
frameworks
or via API
Importers DBE Core PlatformData
Sources
Data
Consumers
Distributed Centralized
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.Page ▪ 12
Metadata Federation Lifecycle
Portal API
Authentication
Query Builder
Query Federator
Data Basket
HL7 FHIR API
Workspaces
Cohort Management
Importers DBE Core Platform
Extract Translate Validate Annotate Store Index Project
Federation Backends
GA4GH Beacon API
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.Page ▪ 13
Data as a function of other data
“Rien ne se perd, rien ne se
crée, tout se transforme”
Antoine-Laurent de Lavoisier
▪ Metadata not only for content of artefact, but also function
that created / transformed the artefact
▪ Every data artefact is the result of one of more functions
− User
− Application Stack, Configuration, Version
− Infrastructure
− Data Dependencies
− Projections
o Inputs or Source
o Outputs (Data)
Essential for provenance, reproducibility and
consent operations
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.Page ▪ 14
Do You Have
Any Questions?
Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.
Databiology Ltd.
Magdalen Centre
The Oxford Science Park
Oxford, OX4 4GA
United Kingdom
+44-1865-784426
contactus@databiology.com
twitter.com/databiologylinkedin.com/company/databiologydatabiology.com
Databiology Inc.
201 Spear Street, Suite 1100
San Francisco, CA 94105
USA
+1-415-426-3592
contactus@databiology.com
Contact us or follow us online!
Databiology Hong Kong Ltd.
Unit E, 6/F Golden Sun Centre
59-67 Bonham Street West
Sheung Wan, Hong Kong
Hong Kong (SAR)
+852-8193-4005
contactus@databiology.com

More Related Content

What's hot

Powering Self Service Business Intelligence with Hadoop and Data Virtualization
Powering Self Service Business Intelligence with Hadoop and Data VirtualizationPowering Self Service Business Intelligence with Hadoop and Data Virtualization
Powering Self Service Business Intelligence with Hadoop and Data Virtualization
Denodo
 

What's hot (20)

Powering Self Service Business Intelligence with Hadoop and Data Virtualization
Powering Self Service Business Intelligence with Hadoop and Data VirtualizationPowering Self Service Business Intelligence with Hadoop and Data Virtualization
Powering Self Service Business Intelligence with Hadoop and Data Virtualization
 
Building trust in your data lake. A fintech case study on automated data disc...
Building trust in your data lake. A fintech case study on automated data disc...Building trust in your data lake. A fintech case study on automated data disc...
Building trust in your data lake. A fintech case study on automated data disc...
 
Practical experiences using Atlas and Ranger to implement GDPR
Practical experiences using Atlas and Ranger to implement GDPRPractical experiences using Atlas and Ranger to implement GDPR
Practical experiences using Atlas and Ranger to implement GDPR
 
Decoding the Acronyms in Clinical Data Standards
Decoding the Acronyms in Clinical Data StandardsDecoding the Acronyms in Clinical Data Standards
Decoding the Acronyms in Clinical Data Standards
 
MongoDB at Agilysys: A Case Study
MongoDB at Agilysys: A Case StudyMongoDB at Agilysys: A Case Study
MongoDB at Agilysys: A Case Study
 
Azure data catalog your data your way eugene polonichko dataconf 21 04 18
Azure data catalog your data your way eugene polonichko dataconf 21 04 18Azure data catalog your data your way eugene polonichko dataconf 21 04 18
Azure data catalog your data your way eugene polonichko dataconf 21 04 18
 
Enterprise Reporting with MongoDB and JasperSoft
Enterprise Reporting with MongoDB and JasperSoftEnterprise Reporting with MongoDB and JasperSoft
Enterprise Reporting with MongoDB and JasperSoft
 
Cortana Analytics Workshop: Azure Data Catalog
Cortana Analytics Workshop: Azure Data CatalogCortana Analytics Workshop: Azure Data Catalog
Cortana Analytics Workshop: Azure Data Catalog
 
Denodo DataFest 2017: Conquering the Edge with Data Virtualization
Denodo DataFest 2017: Conquering the Edge with Data VirtualizationDenodo DataFest 2017: Conquering the Edge with Data Virtualization
Denodo DataFest 2017: Conquering the Edge with Data Virtualization
 
Best Practices: Data Virtualization Perspectives and Best Practices
Best Practices: Data Virtualization Perspectives and Best PracticesBest Practices: Data Virtualization Perspectives and Best Practices
Best Practices: Data Virtualization Perspectives and Best Practices
 
Denodo Platform 7.0: What's New?
Denodo Platform 7.0: What's New?Denodo Platform 7.0: What's New?
Denodo Platform 7.0: What's New?
 
Education Seminar: Self-service BI, Logical Data Warehouse and Data Lakes
Education Seminar: Self-service BI, Logical Data Warehouse and Data LakesEducation Seminar: Self-service BI, Logical Data Warehouse and Data Lakes
Education Seminar: Self-service BI, Logical Data Warehouse and Data Lakes
 
Logical Data Fabric: An Introduction
Logical Data Fabric: An IntroductionLogical Data Fabric: An Introduction
Logical Data Fabric: An Introduction
 
VMworld 2013: VMware Hybrid Cloud – An Introduction to Object Store
VMworld 2013: VMware Hybrid Cloud – An Introduction to Object Store VMworld 2013: VMware Hybrid Cloud – An Introduction to Object Store
VMworld 2013: VMware Hybrid Cloud – An Introduction to Object Store
 
COnSeNT 2021 - ODRL Profile for Expressing Consent through Granular Access Co...
COnSeNT 2021 - ODRL Profile for Expressing Consent through Granular Access Co...COnSeNT 2021 - ODRL Profile for Expressing Consent through Granular Access Co...
COnSeNT 2021 - ODRL Profile for Expressing Consent through Granular Access Co...
 
GraphTalk Copenhagen - Killing Data Silos in the Life Sciences with Neo4j
GraphTalk Copenhagen - Killing Data Silos in the Life Sciences with Neo4jGraphTalk Copenhagen - Killing Data Silos in the Life Sciences with Neo4j
GraphTalk Copenhagen - Killing Data Silos in the Life Sciences with Neo4j
 
Denodo DataFest 2016: The Role of Data Virtualization in IoT Integration
Denodo DataFest 2016: The Role of Data Virtualization in IoT IntegrationDenodo DataFest 2016: The Role of Data Virtualization in IoT Integration
Denodo DataFest 2016: The Role of Data Virtualization in IoT Integration
 
GDPRov: provenance for GDPR
GDPRov: provenance for GDPR GDPRov: provenance for GDPR
GDPRov: provenance for GDPR
 
Scaling Up Data Access and Storage Without Scaling Up Costs
Scaling Up Data Access and Storage Without Scaling Up CostsScaling Up Data Access and Storage Without Scaling Up Costs
Scaling Up Data Access and Storage Without Scaling Up Costs
 
II-SDV 2016 Michael Iarrobino - Improving Text Mining Results with Access to ...
II-SDV 2016 Michael Iarrobino - Improving Text Mining Results with Access to ...II-SDV 2016 Michael Iarrobino - Improving Text Mining Results with Access to ...
II-SDV 2016 Michael Iarrobino - Improving Text Mining Results with Access to ...
 

Similar to Managing Biomedical Data and Metadata in Large Scale Collaborations

¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
Denodo
 
Acquisition de données dans Neo4j pour le Master Data Management
Acquisition de données dans Neo4j pour le Master Data ManagementAcquisition de données dans Neo4j pour le Master Data Management
Acquisition de données dans Neo4j pour le Master Data Management
Neo4j
 
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Denodo
 

Similar to Managing Biomedical Data and Metadata in Large Scale Collaborations (20)

Hadoop 2015: what we larned -Think Big, A Teradata Company
Hadoop 2015: what we larned -Think Big, A Teradata CompanyHadoop 2015: what we larned -Think Big, A Teradata Company
Hadoop 2015: what we larned -Think Big, A Teradata Company
 
Embedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern StaenderEmbedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern Staender
 
Active Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with AlationActive Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with Alation
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
 
GDPR Noncompliance: Avoid the Risk with Data Virtualization
GDPR Noncompliance: Avoid the Risk with Data VirtualizationGDPR Noncompliance: Avoid the Risk with Data Virtualization
GDPR Noncompliance: Avoid the Risk with Data Virtualization
 
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
 
Acquisition de données dans Neo4j pour le Master Data Management
Acquisition de données dans Neo4j pour le Master Data ManagementAcquisition de données dans Neo4j pour le Master Data Management
Acquisition de données dans Neo4j pour le Master Data Management
 
Modern data integration expert sessions
Modern data integration expert sessionsModern data integration expert sessions
Modern data integration expert sessions
 
Modern Data Integration Expert Session Webinar
Modern Data Integration Expert Session Webinar Modern Data Integration Expert Session Webinar
Modern Data Integration Expert Session Webinar
 
Managing Data Warehouse Growth in the New Era of Big Data
Managing Data Warehouse Growth in the New Era of Big DataManaging Data Warehouse Growth in the New Era of Big Data
Managing Data Warehouse Growth in the New Era of Big Data
 
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BIAugmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
 
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
Implementar una estrategia eficiente de gobierno y seguridad del dato con la ...
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
Introduction to Harnessing Big Data
Introduction to Harnessing Big DataIntroduction to Harnessing Big Data
Introduction to Harnessing Big Data
 
The CIO guide to Big Data Archiving
The CIO guide to Big Data ArchivingThe CIO guide to Big Data Archiving
The CIO guide to Big Data Archiving
 
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationStrata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma Presentation
 
Tarmin GridBank Overview
Tarmin GridBank OverviewTarmin GridBank Overview
Tarmin GridBank Overview
 
Necessity of Data Lakes in the Financial Services Sector
Necessity of Data Lakes in the Financial Services SectorNecessity of Data Lakes in the Financial Services Sector
Necessity of Data Lakes in the Financial Services Sector
 
A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)
 

Recently uploaded

Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
LeenakshiTyagi
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Sérgio Sacani
 

Recently uploaded (20)

Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 

Managing Biomedical Data and Metadata in Large Scale Collaborations

  • 1. Managing Biomedical Data and Metadata in Large Scale Collaborations November 28, 2018
  • 2. Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd. ▪ What is Metadata? − Content − Context − Process ▪ Metadata not always derived from the artifact directly, but obtained from multiple sources ▪ Metadata semantics are key to unlocking findability, provenance and usability of data artefacts Page ▪ 2 Why Metadata?
  • 3. Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd. ▪ Data continues to be accumulated at exponential rate − There are multiple efforts capturing anything conceivable − Study data vs non study data lines are blurring ▪ Data demands continues to grow − Everyone hungers for high quality consented biomedical datasets − Regulation like GDPR points to large scale consent management capability ▪ Generating and storing all data inhouse is no longer making business sense Page ▪ 3 Why Collaboration?
  • 4. Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd. ▪ Data is produced in silos − Specialized systems: clinical, prescriptions, lab, imaging, sequencing, sensors, etc. ▪ Not one warehouse of everything for everyone − For the foreseeable future there will always be some (largish) degree of federation − No single data science platform can cater to everyone ▪ Not one view on the data − No use case needs all the data − Each use case needs unique combination of data Page ▪ 4 Status Quo
  • 5. Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd. ▪ Working with data − Data Access o Non-local data o Data islands o Multi-disciplinary − Data Preparation o Data normalization o Data scientist grunt work challenge ▪ Working together – sharing vs collaborating − Different organizations involvement − Differing methods of processing ▪ Regulation, contracts and audit Page ▪ 5 Obstacles to Collaboration
  • 6. Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd. ▪ Aggregation: Central data warehouse with corresponding API layer for querying very large data sets quickly ▪ Common Challenges − Data vs Meta-data is blurred − Scalability − Cost − Access controls Page ▪ 6 The Common Approaches: Aggregation
  • 7. Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd. ▪ Standardization: Common Data Models and APIs to obtain information from different custodians ▪ Common Challenges − Many standards − They are all in flux − Big effort to implement and to maintain − Coverage Page ▪ 7 The Common Approaches: Standardization Analytics CoverageStandards Coverage
  • 8. Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd. ▪ Federation: Based on aggregation and standardization query multiple data custodians and deliver aggregate answers ▪ Common Challenges − Standardizing queries − Authentication / Authorization − Normalization − Performance Page ▪ 8 The Common Approaches: Federation
  • 9. Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.Page ▪ 9 Metadata and Conway’s Law “Organizations which design systems (in the broad sense) ... are constrained to produce designs which are copies of the communication structures of these organizations." Conway’s Law Melvin Conway Datamation, 1968
  • 10. Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd. ▪ One person's metadata is another person's data ▪ Collaborate and establish broadest consensus for a given data type − Minimum viable standard metadata model across custodians − Further enriched with contextual data specialized per study − Requirements: o Handling presence of unexpected as well as absence of expected data o Propagation of change and impact on provenance ▪ Data model needs to be accomodating - ideally standardized summary data with ad hoc extensions by interest Page ▪ 10 Metadata – Description of Data Artefacts
  • 11. Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.Page ▪ 11 Metadata Aggregation Lifecycle Extract Translate Validate Annotate Store Index Project Any combination of tools to extract data from one or many sources: • File Systems • Files • Databases • APIs Prepare extracted native data fields for processing by DBE Validate Metadata inputs against type constraints Process data fields marked for annotation with ontology providers Store validated and annotated data in DBE database Index stored data in DBE search index Projection of outputs directly into analysis frameworks or via API Importers DBE Core PlatformData Sources Data Consumers Distributed Centralized
  • 12. Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.Page ▪ 12 Metadata Federation Lifecycle Portal API Authentication Query Builder Query Federator Data Basket HL7 FHIR API Workspaces Cohort Management Importers DBE Core Platform Extract Translate Validate Annotate Store Index Project Federation Backends GA4GH Beacon API
  • 13. Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.Page ▪ 13 Data as a function of other data “Rien ne se perd, rien ne se crée, tout se transforme” Antoine-Laurent de Lavoisier ▪ Metadata not only for content of artefact, but also function that created / transformed the artefact ▪ Every data artefact is the result of one of more functions − User − Application Stack, Configuration, Version − Infrastructure − Data Dependencies − Projections o Inputs or Source o Outputs (Data) Essential for provenance, reproducibility and consent operations
  • 14. Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd.Page ▪ 14 Do You Have Any Questions?
  • 15. Copyright ©2018. All Rights Reserved. Confidential Databiology Ltd. Databiology Ltd. Magdalen Centre The Oxford Science Park Oxford, OX4 4GA United Kingdom +44-1865-784426 contactus@databiology.com twitter.com/databiologylinkedin.com/company/databiologydatabiology.com Databiology Inc. 201 Spear Street, Suite 1100 San Francisco, CA 94105 USA +1-415-426-3592 contactus@databiology.com Contact us or follow us online! Databiology Hong Kong Ltd. Unit E, 6/F Golden Sun Centre 59-67 Bonham Street West Sheung Wan, Hong Kong Hong Kong (SAR) +852-8193-4005 contactus@databiology.com