SlideShare una empresa de Scribd logo
Metadata in the BioSample
Repository are Impaired by
Numerous Anomalies
Rafael Gonçalves
Stanford University
Metadata Are Essential In Science
• Metadata are crucial for finding, reproducing,
and reusing the data that they describe
• The FAIR data principles specify desirable
criteria that metadata and their datasets should
meet to be Findable, Accessible, Interoperable,
and Reusable
• For metadata to be interoperable, they should
rely on controlled terms from ontologies
1
The NCBI BioSample Metadata
Repository (2011-)
2
The NCBI BioSample Metadata
Repository (2011-)
3
• BioSample stores metadata that describe
biological materials (samples) under investigation
• BioSample was designed to standardize
descriptions of samples for all NCBI repositories
• Our BioSample dump contains 6,615,347 records
• Collected on June 25th, 2017
A BioSample Metadata Record
4
Study description & Raw data
Design of Metadata Quality Study
5
Design of Metadata Quality Study
6
Ø Do attribute names correspond to
ontology terms?
Ø Are the attribute names used in
metadata records specified by BioSample?
Design of Metadata Quality Study
7
Ø Are the attribute values valid
according to their specification?
E.g., is the value of a numeric
attribute truly a number?
BioSample Metadata Are
Categorized Into Packages
• BioSample provides specifications of 104 packages
• E.g., Human, Microbe, Virus, Plant, Pathogen, etc.
• A package specifies the set of mandatory and optional
attributes that should be used to describe samples
8
BioSample
Metadata Record
BioSample
Metadata
Repository
contains
adheres to
BioSample
Package
1
1
The Human Package Specification
9
Metadata Records Define
Multiple Attributes
• Each attribute (name-value pair) represents a
characteristic of a sample
• BioSample specifies a dictionary of attributes
• 452 attribute names and their expected value types
• Users can provide attributes with arbitrary names
10
BioSample
Metadata Record
BioSample
Attribute
defines
Attribute
Name
Attribute
Value
BioSample
Metadata
Repository
contains
adheres to
BioSample
Package
composed
of
1
* 1 1
1
1
Attribute Types Under Analysis
• Integer - require values that are integers
• Boolean - require values that are Booleans
• Value set - take on values from value sets defined in
the BioSample documentation
• Ontology term - take on term values from specific
ontologies
11
Most Metadata Submissions Do
Not Adhere To Packages
12
The Vast Majority of Attribute
Names Are Defined By Users
13
No correspondence
with ontology terms!
Summary of Results
14
Simple Fields Have a Wide Range
of Values
15
• Boolean-type attributes have many values that do not
parse into Booleans
• For example, for the smoker attribute, there are such diverse
values as: Non-smoker, nonsmoker, non smoker, ex-smoker, Ex
smoker, smoker, former-smoker, Former, current, …
• While most Integer values are valid, there are many
values that are plain text, for example: e;N/A, NO,
UVPgt59.4, pig, JM52, stock_180.92, ...
• Data types of attribute values are not enforced!
Ontology Term Attributes are Mostly
Populated with Invalid Values
16
• Example values for the disease metadata attribute: No
Adenomas, BrCa, presumed normal, no AD evident at
demise, “NL smooth muscle, stomach rmvd as part of
pancr., CA”, …
• Example values for phenotype metadata attribute:
unknown, monster, wild_type, none, 30 psu, “The 136
mutant has a shorter root meristem and a reduction in
root length of about 75% compared to wild type”, …
BioSample Lacks Standardization
of Metadata Attributes
• The values for attributes defined by BioSample
are not appropriately verified
• Neither the attribute names nor (most of) their
values rely on ontologies
• To be FAIR, the metadata in BioSample would
have to improve considerably
17
A Solution: The CEDAR Workbench
See the CEDAR Resource Track paper & talk
Tuesday - Session 6 - Biomedical and scientific applications
18
Questions
See the CEDAR Resource Track paper & talk
Tuesday - Session 6 - Biomedical and scientific applications

Más contenido relacionado

La actualidad más candente

The FAIRDOM Commons for Systems Biology
The FAIRDOM Commons for Systems BiologyThe FAIRDOM Commons for Systems Biology
The FAIRDOM Commons for Systems Biology
FAIRDOM
 
Facilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-juppFacilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-jupp
Simon Jupp
 
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
Susanna-Assunta Sansone
 
Citing data in research articles: principles, implementation, challenges - an...
Citing data in research articles: principles, implementation, challenges - an...Citing data in research articles: principles, implementation, challenges - an...
Citing data in research articles: principles, implementation, challenges - an...
FAIRDOM
 
schema.org and biomedical ontologies
schema.org and biomedical ontologies schema.org and biomedical ontologies
schema.org and biomedical ontologies
Simon Jupp
 
Leveraging publication metadata to help overcome the data ingest bottleneck
Leveraging publication metadata to help overcome the data ingest bottleneck Leveraging publication metadata to help overcome the data ingest bottleneck
Leveraging publication metadata to help overcome the data ingest bottleneck
Todd Vision
 
Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
 Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ... Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
Syed Ahmad Chan Bukhari, PhD
 
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
Michel Dumontier
 
The Dryad Digital Repository: Published evolutionary data as part of the gre...
The Dryad Digital Repository: Published evolutionary data as part of the gre...The Dryad Digital Repository: Published evolutionary data as part of the gre...
The Dryad Digital Repository: Published evolutionary data as part of the gre...
Todd Vision
 
Research data and scholarly publications: going from casual acquaintances to ...
Research data and scholarly publications: going from casual acquaintances to ...Research data and scholarly publications: going from casual acquaintances to ...
Research data and scholarly publications: going from casual acquaintances to ...
Todd Vision
 
Knowledge Exchange, Nov 2011, Bonn
Knowledge Exchange, Nov 2011, BonnKnowledge Exchange, Nov 2011, Bonn
Knowledge Exchange, Nov 2011, Bonn
Todd Vision
 
Presentation from Code Camp 2017
Presentation from Code Camp 2017Presentation from Code Camp 2017
Presentation from Code Camp 2017
Mitch Miller
 
DAS game: how a programmer thinks
DAS game: how a programmer thinksDAS game: how a programmer thinks
DAS game: how a programmer thinks
Rafael C. Jimenez
 
Fairport domain specific metadata using w3 c dcat & skos w ontology views
Fairport domain specific metadata using w3 c dcat & skos w ontology viewsFairport domain specific metadata using w3 c dcat & skos w ontology views
Fairport domain specific metadata using w3 c dcat & skos w ontology views
Tim Clark
 
Data reuse and scholarly reward: understanding practice and building infrastr...
Data reuse and scholarly reward: understanding practice and building infrastr...Data reuse and scholarly reward: understanding practice and building infrastr...
Data reuse and scholarly reward: understanding practice and building infrastr...
Todd Vision
 
Reproducible and citable data and models: an introduction.
Reproducible and citable data and models: an introduction.Reproducible and citable data and models: an introduction.
Reproducible and citable data and models: an introduction.
FAIRDOM
 
Crosslinks
Crosslinks Crosslinks
Crosslinks
ericmeeks
 
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Michel Dumontier
 
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
Syed Ahmad Chan Bukhari, PhD
 
Annotopia open annotation services platform
Annotopia open annotation services platformAnnotopia open annotation services platform
Annotopia open annotation services platform
Tim Clark
 

La actualidad más candente (20)

The FAIRDOM Commons for Systems Biology
The FAIRDOM Commons for Systems BiologyThe FAIRDOM Commons for Systems Biology
The FAIRDOM Commons for Systems Biology
 
Facilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-juppFacilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-jupp
 
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
 
Citing data in research articles: principles, implementation, challenges - an...
Citing data in research articles: principles, implementation, challenges - an...Citing data in research articles: principles, implementation, challenges - an...
Citing data in research articles: principles, implementation, challenges - an...
 
schema.org and biomedical ontologies
schema.org and biomedical ontologies schema.org and biomedical ontologies
schema.org and biomedical ontologies
 
Leveraging publication metadata to help overcome the data ingest bottleneck
Leveraging publication metadata to help overcome the data ingest bottleneck Leveraging publication metadata to help overcome the data ingest bottleneck
Leveraging publication metadata to help overcome the data ingest bottleneck
 
Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
 Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ... Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
Use of CEDAR Technology for Ontology-based Submission of Biomedical Data to ...
 
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
 
The Dryad Digital Repository: Published evolutionary data as part of the gre...
The Dryad Digital Repository: Published evolutionary data as part of the gre...The Dryad Digital Repository: Published evolutionary data as part of the gre...
The Dryad Digital Repository: Published evolutionary data as part of the gre...
 
Research data and scholarly publications: going from casual acquaintances to ...
Research data and scholarly publications: going from casual acquaintances to ...Research data and scholarly publications: going from casual acquaintances to ...
Research data and scholarly publications: going from casual acquaintances to ...
 
Knowledge Exchange, Nov 2011, Bonn
Knowledge Exchange, Nov 2011, BonnKnowledge Exchange, Nov 2011, Bonn
Knowledge Exchange, Nov 2011, Bonn
 
Presentation from Code Camp 2017
Presentation from Code Camp 2017Presentation from Code Camp 2017
Presentation from Code Camp 2017
 
DAS game: how a programmer thinks
DAS game: how a programmer thinksDAS game: how a programmer thinks
DAS game: how a programmer thinks
 
Fairport domain specific metadata using w3 c dcat & skos w ontology views
Fairport domain specific metadata using w3 c dcat & skos w ontology viewsFairport domain specific metadata using w3 c dcat & skos w ontology views
Fairport domain specific metadata using w3 c dcat & skos w ontology views
 
Data reuse and scholarly reward: understanding practice and building infrastr...
Data reuse and scholarly reward: understanding practice and building infrastr...Data reuse and scholarly reward: understanding practice and building infrastr...
Data reuse and scholarly reward: understanding practice and building infrastr...
 
Reproducible and citable data and models: an introduction.
Reproducible and citable data and models: an introduction.Reproducible and citable data and models: an introduction.
Reproducible and citable data and models: an introduction.
 
Crosslinks
Crosslinks Crosslinks
Crosslinks
 
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
 
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
 
Annotopia open annotation services platform
Annotopia open annotation services platformAnnotopia open annotation services platform
Annotopia open annotation services platform
 

Similar a Metadata in the BioSample Online Repository are Impaired by Numerous Anomalies (SemSci 2017 Workshop)

Introduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASEIntroduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASE
PrashantSharma807
 
Designing a community resource - Sandra Orchard
Designing a community resource - Sandra OrchardDesigning a community resource - Sandra Orchard
Designing a community resource - Sandra Orchard
EMBL-ABR
 
Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003
robertstevens65
 
ELSS use cases and strategy
ELSS use cases and strategyELSS use cases and strategy
ELSS use cases and strategy
Anton Yuryev
 
A Semantic Web based Framework for Linking Healthcare Information with Comput...
A Semantic Web based Framework for Linking Healthcare Information with Comput...A Semantic Web based Framework for Linking Healthcare Information with Comput...
A Semantic Web based Framework for Linking Healthcare Information with Comput...
Koray Atalag
 
The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...
The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...
The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...
Peter McQuilton
 
Ebp for the ebp
Ebp for the ebpEbp for the ebp
Ebp for the ebp
Fowler Susan
 
Physiotherapy: Searching for Evidence
Physiotherapy: Searching for EvidencePhysiotherapy: Searching for Evidence
Physiotherapy: Searching for Evidence
La Trobe University Library - College of SHE
 
Molecular modeling database
Molecular modeling database Molecular modeling database
Molecular modeling database
Jayati Shrivastava
 
Standardization of the HIPC Data Templates: The Story So Far
Standardization of the HIPC Data Templates: The Story So FarStandardization of the HIPC Data Templates: The Story So Far
Standardization of the HIPC Data Templates: The Story So Far
Ahmad C. Bukhari
 
Standardization of the HIPC Data Templates
Standardization of the HIPC Data TemplatesStandardization of the HIPC Data Templates
Standardization of the HIPC Data Templates
Syed Ahmad Chan Bukhari, PhD
 
10th Compound Libraries Conference - 27 - 29 October, 2014 - Hotel Palace Be...
10th Compound Libraries Conference  - 27 - 29 October, 2014 - Hotel Palace Be...10th Compound Libraries Conference  - 27 - 29 October, 2014 - Hotel Palace Be...
10th Compound Libraries Conference - 27 - 29 October, 2014 - Hotel Palace Be...
Torben Haagh
 
10th International Conference Compound Libraries 2014
10th International Conference Compound Libraries  201410th International Conference Compound Libraries  2014
10th International Conference Compound Libraries 2014
Torben Haagh
 
Plant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In SequencesPlant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In Sequences
Leighton Pritchard
 
GARNet workshop on Integrating Large Data into Plant Science
GARNet workshop on Integrating Large Data into Plant ScienceGARNet workshop on Integrating Large Data into Plant Science
GARNet workshop on Integrating Large Data into Plant Science
David Johnson
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...
Greg Landrum
 
Proteins databases
Proteins databasesProteins databases
Proteins databases
Hafiz Muhammad Zeeshan Raza
 
Clinical modelling with openEHR Archetypes
Clinical modelling with openEHR ArchetypesClinical modelling with openEHR Archetypes
Clinical modelling with openEHR Archetypes
Koray Atalag
 
Slide sharenursing jan_2013
Slide sharenursing jan_2013Slide sharenursing jan_2013
Slide sharenursing jan_2013
Sharon Karasmanis
 
SLAS Screen Design and Assay Technology SIG: SLAS2013 Presentation
SLAS Screen Design and Assay Technology SIG: SLAS2013 PresentationSLAS Screen Design and Assay Technology SIG: SLAS2013 Presentation
SLAS Screen Design and Assay Technology SIG: SLAS2013 Presentation
SLAS (Society for Laboratory Automation and Screening)
 

Similar a Metadata in the BioSample Online Repository are Impaired by Numerous Anomalies (SemSci 2017 Workshop) (20)

Introduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASEIntroduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASE
 
Designing a community resource - Sandra Orchard
Designing a community resource - Sandra OrchardDesigning a community resource - Sandra Orchard
Designing a community resource - Sandra Orchard
 
Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003
 
ELSS use cases and strategy
ELSS use cases and strategyELSS use cases and strategy
ELSS use cases and strategy
 
A Semantic Web based Framework for Linking Healthcare Information with Comput...
A Semantic Web based Framework for Linking Healthcare Information with Comput...A Semantic Web based Framework for Linking Healthcare Information with Comput...
A Semantic Web based Framework for Linking Healthcare Information with Comput...
 
The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...
The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...
The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...
 
Ebp for the ebp
Ebp for the ebpEbp for the ebp
Ebp for the ebp
 
Physiotherapy: Searching for Evidence
Physiotherapy: Searching for EvidencePhysiotherapy: Searching for Evidence
Physiotherapy: Searching for Evidence
 
Molecular modeling database
Molecular modeling database Molecular modeling database
Molecular modeling database
 
Standardization of the HIPC Data Templates: The Story So Far
Standardization of the HIPC Data Templates: The Story So FarStandardization of the HIPC Data Templates: The Story So Far
Standardization of the HIPC Data Templates: The Story So Far
 
Standardization of the HIPC Data Templates
Standardization of the HIPC Data TemplatesStandardization of the HIPC Data Templates
Standardization of the HIPC Data Templates
 
10th Compound Libraries Conference - 27 - 29 October, 2014 - Hotel Palace Be...
10th Compound Libraries Conference  - 27 - 29 October, 2014 - Hotel Palace Be...10th Compound Libraries Conference  - 27 - 29 October, 2014 - Hotel Palace Be...
10th Compound Libraries Conference - 27 - 29 October, 2014 - Hotel Palace Be...
 
10th International Conference Compound Libraries 2014
10th International Conference Compound Libraries  201410th International Conference Compound Libraries  2014
10th International Conference Compound Libraries 2014
 
Plant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In SequencesPlant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In Sequences
 
GARNet workshop on Integrating Large Data into Plant Science
GARNet workshop on Integrating Large Data into Plant ScienceGARNet workshop on Integrating Large Data into Plant Science
GARNet workshop on Integrating Large Data into Plant Science
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...
 
Proteins databases
Proteins databasesProteins databases
Proteins databases
 
Clinical modelling with openEHR Archetypes
Clinical modelling with openEHR ArchetypesClinical modelling with openEHR Archetypes
Clinical modelling with openEHR Archetypes
 
Slide sharenursing jan_2013
Slide sharenursing jan_2013Slide sharenursing jan_2013
Slide sharenursing jan_2013
 
SLAS Screen Design and Assay Technology SIG: SLAS2013 Presentation
SLAS Screen Design and Assay Technology SIG: SLAS2013 PresentationSLAS Screen Design and Assay Technology SIG: SLAS2013 Presentation
SLAS Screen Design and Assay Technology SIG: SLAS2013 Presentation
 

Último

Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
frank0071
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Leonel Morgado
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
vluwdy49
 
Gadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdfGadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdf
PirithiRaju
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
MaheshaNanjegowda
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Selcen Ozturkcan
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
Sérgio Sacani
 
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
Scintica Instrumentation
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
University of Maribor
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
Anagha Prasad
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
Sérgio Sacani
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
İsa Badur
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
LengamoLAppostilic
 
11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf
PirithiRaju
 
Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
PirithiRaju
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
Aditi Bajpai
 
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
PsychoTech Services
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
Carl Bergstrom
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
Leonel Morgado
 

Último (20)

Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
 
Gadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdfGadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdf
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
 
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
 
11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf
 
Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
 
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
 

Metadata in the BioSample Online Repository are Impaired by Numerous Anomalies (SemSci 2017 Workshop)

  • 1. Metadata in the BioSample Repository are Impaired by Numerous Anomalies Rafael Gonçalves Stanford University
  • 2. Metadata Are Essential In Science • Metadata are crucial for finding, reproducing, and reusing the data that they describe • The FAIR data principles specify desirable criteria that metadata and their datasets should meet to be Findable, Accessible, Interoperable, and Reusable • For metadata to be interoperable, they should rely on controlled terms from ontologies 1
  • 3. The NCBI BioSample Metadata Repository (2011-) 2
  • 4. The NCBI BioSample Metadata Repository (2011-) 3 • BioSample stores metadata that describe biological materials (samples) under investigation • BioSample was designed to standardize descriptions of samples for all NCBI repositories • Our BioSample dump contains 6,615,347 records • Collected on June 25th, 2017
  • 5. A BioSample Metadata Record 4 Study description & Raw data
  • 6. Design of Metadata Quality Study 5
  • 7. Design of Metadata Quality Study 6 Ø Do attribute names correspond to ontology terms? Ø Are the attribute names used in metadata records specified by BioSample?
  • 8. Design of Metadata Quality Study 7 Ø Are the attribute values valid according to their specification? E.g., is the value of a numeric attribute truly a number?
  • 9. BioSample Metadata Are Categorized Into Packages • BioSample provides specifications of 104 packages • E.g., Human, Microbe, Virus, Plant, Pathogen, etc. • A package specifies the set of mandatory and optional attributes that should be used to describe samples 8 BioSample Metadata Record BioSample Metadata Repository contains adheres to BioSample Package 1 1
  • 10. The Human Package Specification 9
  • 11. Metadata Records Define Multiple Attributes • Each attribute (name-value pair) represents a characteristic of a sample • BioSample specifies a dictionary of attributes • 452 attribute names and their expected value types • Users can provide attributes with arbitrary names 10 BioSample Metadata Record BioSample Attribute defines Attribute Name Attribute Value BioSample Metadata Repository contains adheres to BioSample Package composed of 1 * 1 1 1 1
  • 12. Attribute Types Under Analysis • Integer - require values that are integers • Boolean - require values that are Booleans • Value set - take on values from value sets defined in the BioSample documentation • Ontology term - take on term values from specific ontologies 11
  • 13. Most Metadata Submissions Do Not Adhere To Packages 12
  • 14. The Vast Majority of Attribute Names Are Defined By Users 13 No correspondence with ontology terms!
  • 16. Simple Fields Have a Wide Range of Values 15 • Boolean-type attributes have many values that do not parse into Booleans • For example, for the smoker attribute, there are such diverse values as: Non-smoker, nonsmoker, non smoker, ex-smoker, Ex smoker, smoker, former-smoker, Former, current, … • While most Integer values are valid, there are many values that are plain text, for example: e;N/A, NO, UVPgt59.4, pig, JM52, stock_180.92, ... • Data types of attribute values are not enforced!
  • 17. Ontology Term Attributes are Mostly Populated with Invalid Values 16 • Example values for the disease metadata attribute: No Adenomas, BrCa, presumed normal, no AD evident at demise, “NL smooth muscle, stomach rmvd as part of pancr., CA”, … • Example values for phenotype metadata attribute: unknown, monster, wild_type, none, 30 psu, “The 136 mutant has a shorter root meristem and a reduction in root length of about 75% compared to wild type”, …
  • 18. BioSample Lacks Standardization of Metadata Attributes • The values for attributes defined by BioSample are not appropriately verified • Neither the attribute names nor (most of) their values rely on ontologies • To be FAIR, the metadata in BioSample would have to improve considerably 17
  • 19. A Solution: The CEDAR Workbench See the CEDAR Resource Track paper & talk Tuesday - Session 6 - Biomedical and scientific applications 18
  • 20. Questions See the CEDAR Resource Track paper & talk Tuesday - Session 6 - Biomedical and scientific applications