Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Introduction to DATS v2.2 - NIH May 2017
1. Supported by the NIH grant 1U24 AI117966-01 to UCSD
PI , Co-Investigators at:
The
annotated with schema.org
Susanna-Assunta Sansone,
Alejandra Gonzalez-Beltran, Philippe Rocca-Serra
Oxford e-Research Centre, University of Oxford, UK
bioCADDIE - DATS Workshop, Bethesda, 7 May 2017
4. What is ?
Like the JATS (Journal Article Tag Suite) is used by PubMed to index literature,
a DATS (DatA Tag Suite) is needed for a scalable way to index data sources in
the DataMed prototype
5. Where do I find the documentation?
Like the JATS (Journal Article Tag Suite) is used by PubMed to index literature,
a DATS (DatA Tag Suite) is needed for a scalable way to index data sources in
the DataMed prototype
8. Mar15
Jun15
Dec15
Jun16
Aug15
May16
Sep16
Mar17
bioCADDIE team’s work on DATS
Our community engagement: input, feedback and links
Phase 1 Phase 2 Phase 3
Design and development
• Use cases collection= George, ExcTeam
• Method development (SOP); competency
questions; metadata mapping and
definition= Susanna, Alejandra, Philippe
SOP and
metadata
strawman
<DATS>
name
May17
Metadata
specification V1.0
with JSON
schema
Use cases
workshop
WG3 formed;
telecons start;
dissemination via
WG7 formed;
telecons start
Evaluation & iterative refinement Continued evaluation & consolidation
9. Mar15
Jun15
Dec15
Jun16
Aug15
May16
Sep16
Mar17
bioCADDIE team’s work on DATS
Our community engagement: input, feedback and links
Phase 1 Phase 2 Phase 3
Design and development
• Use cases collection= George, ExcTeam
• Method development (SOP); competency
questions; metadata mapping and
definition= Susanna, Alejandra, Philippe
SOP and
metadata
strawman
<DATS>
name
DATS
v1.1
May17
DATS v2.0
(with access
metadata,
WG7)
Metadata
specification V1.0
with JSON
schema
Use cases
workshop
1st DATS
workshop
WG3 formed;
telecons start;
dissemination via
WG7 formed;
telecons start
Evaluation & iterative refinement
• DATS specification, serializations,
refinement= (Susanna), Alejandra,
Philippe and CoreDevTeam, also
based on community feedback
Continued evaluation & consolidation
10. Mar15
Jun15
Dec15
Jun16
Aug15
May16
Sep16
Mar17
bioCADDIE team’s work on DATS
Our community engagement: input, feedback and links
Phase 1 Phase 2 Phase 3
Design and development
• Use cases collection= George, ExcTeam
• Method development (SOP); competency
questions; metadata mapping and
definition= Susanna, Alejandra, Philippe
SOP and
metadata
strawman
<DATS>
name
DATS
v1.1
May17
DATS v2.0
(with access
metadata,
WG7)
DATS v2.1
(schema.org
JSON-LD)
DATS
v2.2
Metadata
specification V1.0
with JSON
schema
Use cases
workshop
1st DATS
workshop
WG3 formed;
telecons start;
dissemination via
2nd DATS
workshop
WG7 formed;
telecons start
WG12 formed;
telecons start
Evaluation & iterative refinement
• DATS specification, serializations,
refinement= (Susanna), Alejandra,
Philippe and CoreDevTeam, also
based on community feedback
Continued evaluation & consolidation
• Alignment with other community efforts;
documentation and curation guidelines =
Susanna, Alejandra, Philippe, Jared,
ExcTeam and CoreDevTeam
11. Mar15
Jun15
Dec15
Jun16
Aug15
May16
Sep16
Mar17
Use cases
workshop
bioCADDIE team’s work on DATS
Our community engagement: input, feedback and links
Phase 1 Phase 2 Phase 3
Design and development
• Use cases collection= George, ExcTeam
• Method development (SOP); competency
questions; metadata mapping and
definition= Susanna, Alejandra, Philippe
Evaluation & iterative refinement
• DATS specification, serializations,
refinement= (Susanna), Alejandra,
Philippe and CoreDevTeam, also
based on community feedback
Continued evaluation & consolidation
• Alignment with other community efforts;
documentation and curation guidelines =
Susanna, Alejandra, Philippe, Jared,
ExcTeam and CoreDevTeam
1st DATS
workshop
SOP and
metadata
strawman
WG3 formed;
telecons start;
dissemination via
<DATS>
name
DATS
v1.1
May17
2nd DATS
workshop
DATS v2.0
(with access
metadata,
WG7)
WG7 formed;
telecons start
DATS v2.1
(schema.org
JSON-LD)
DATS
v2.2
primarily metadata modelers
primarily implementers
Metadata
specification V1.0
with JSON
schema
WG12 formed;
telecons start
12. ❖ Enabling discoverability: find and access datasets
❖ Focusing on surfacing key metadata descriptors, such as
✧ information and relations between authors, datasets, publication,
funding sources, nature of biological signal and perturbation etc.
✧ Not the perfect model to represent the experimental details
✧ the level of details and metadata needed to ensure interoperability
and reusability are left to the indexed databases
❖ Better than just having keywords
✧ we have aimed to have maximum coverage of use cases with
minimal number of data elements and relations
What is supposed to do and be?
13. Metadata elements identified by combining the two complementary approaches
USE CASES: top-down approach SCHEMAS: bottom-up approach
The development process in a nutshell
(v1.0, v1.1, v2.0, v2.1, v2.2)
14. Extracting requirements from use cases
❖ Selected competency questions
✧ representative set collected from: use cases workshop, white paper, submitted by
the community and from NIH and Phil Bourne’s ADDS office
✧ key metadata elements processed: abstracted, color-coded and terms binned
binned as Material, Process, Information, Properties; relation identified
top-down approach
15. bottom-up approach
Standing on the shoulders of giants
❖ schema.org
❖ DataCite
❖ RIF-CS
❖ W3C HCLS dataset descriptions (mapping of many models including DCAT, PROV, VOID, Dublin
Core)
❖ Project Open Metadata (used by HealthData.gov is being added in this new iteration)
❖ ……(full list in the DATS specification)
❖ ISA
❖ BioProject
❖ BioSample
❖ ……(full list in the DATS specification)
❖ MiNIML
❖ PRIDE-ml
❖ MAGE-tab
❖ GA4GH metadata schema
❖ SRA xml
❖ CDISC SDM / element of BRIDGE model
❖ ……(full list in the DATS specification)
16. Convergence
of elements
extracted from
competency
questions
and existing
(generic and
biomedical)
data models
(incl. DataCite,
DCAT, schema.org,
HCLS dataset, RIF-
CS, ISA-Tab, SRA-
xml etc.)
model for scalable indexing
Adoption
of elements extracted
from
and from
core entities
extended entities
17. ❖ The descriptors for each metadata element (Entity), include
✧ Property (describing the Entity), Definition (of each Entity and Property),
Value(s) (allowed for each Property)
Key features of
18. ❖ The descriptors for each metadata element (Entity), include
✧ Property (describing the Entity), Definition (of each Entity and Property),
Value(s) (allowed for each Property)
❖ We have defined a set of core and extended entities
✧ Core elements are generic and applicable to any type of datasets, like the
JATS can describe any type of publication.
✧ Extended elements includes an additional elements, some of which are
specific for life, environmental and biomedical science domains
✧ this set can be further extended as needed
Key features of
19. ❖ The descriptors for each metadata element (Entity), include
✧ Property (describing the Entity), Definition (of each Entity and Property),
Value(s) (allowed for each Property)
❖ We have defined a set of core and extended entities
✧ Core elements are generic and applicable to any type of datasets, like the
JATS can describe any type of publication.
✧ Extended elements includes an additional elements, some of which are
specific for life, environmental and biomedical science domains
✧ this set can be further extended as needed
❖ Entities are not mandatory, in both core and extended set
✧ An entity is used only when applicable to the dataset to be described
✧ In that case, only few of its properties are defined as mandatory
Key features of
20. ❖ Dataset, a core entity catering for any unit of information
✧ archived experimental datasets, which do not change after deposition to the
repository => examples available for dbGAP, GEO, ClinicalTrials.org
✧ datasets in reference knowledge bases, describing dynamic concepts, such
as “genes”, whose definition morphs over time => examples available for
UniProt
❖ Dataset entity is also linked to other digital research objects
✧ Software and Data Standard, which are also part of the NIH Commons, but
the focus on other discovery indexes and therefore are not described in
detail in this model
General design of the
24. ❖ What is the dataset about?
✧ Material
❖ How was the dataset produced ? Which information does it hold?
✧ Dataset / Data Type with its Information, Method, Platform,
Instrument
❖ Where can a dataset be found?
✧ Dataset, Distribution, Access objects (links to License)
❖ When was the datasets produced, released etc.?
✧ Dates to specify the nature of an event {create, modify, start, end...}
and its timestamp
❖ Who did the work, funded the research, hosts the resources etc.?
✧ Person, Organization and their roles, Grant
Core elements provide the basic info
26. also follows the W3C Data on
the Web Best Practices
DATS follows these, which
also recommend
DatasetDistribution
https://www.w3.org/TR/dwbp
27. Serializations and use of schema.org
❖ DATS model in JSON schema, serialized as:
✧ JSON* format, and
✧ JSON-LD** with vocabulary from schema.org
✧ serializations in other formats can also be done, as / if needed
❖ Benefits for DataMed and databases index by DataMed
✧ Increased visibility (by both popular search engines), accessibility
(via common query interfaces) and possibly improved ranking
❖ Extending schema.org
✧ Submitted to their tracker missing DATS core elements
✧ Coordinating via the bioschemas.org initiative (ELIXIR is also part of)
the extension of schema.org for life science
* JavaScript Object Notation
** JavaScript Object Notation for Linked Data