So, you want to build a pan-national digital space for bioscience data and methods? That works with a bunch of pre-existing data repositories and processing platforms? So you can share FAIR workflows and move them between services? Package them up with data and other stuff (or just package up data for that matter)? How? WorkflowHub (https://workflowhub.eu) and RO-Crate Research Objects (https://www.researchobject.org/ro-crate) that’s how! A step towards FAIR Digital Objects gets a workout.
Presented at DataVerse Community Meeting 2021
1. FAIR Workflows and
Research Objects get a Workout
Carole Goble
The University of Manchester, UK
carole.goble@manchester.ac.uk
DataVerse Community Conference 2021, 15th June 2021
2. EOSC-Life pan-national data & method thematic commons for
bioscience data and methods
Using and sharing data, tools and workflows in the cloud
3. Infrastructure Zoo
Flows around a Federated & Diverse System
1466 data repositories / archives
916 data format and metadata
standards*
Not including the institutional or
national repositories like
DataVerse
https://fairsharing.org/ accessed May 2021
From compounds to clinical trials
Primary data - Secondary use
4. Infrastructure Zoo
Flows around a Federated & Diverse System
https://fairsharing.org/ accessed May 2021
Community domain enclaves
fragmented resources
flow across platforms & sovereignties
Workflows as an entry point and
integration mechanism
Legacy
• data repositories & data platforms
• processing and workflow
platforms
5. CryoEM Image Analysis Metagenomic Pipelines Drug Discovery
Quality control
Replication
Scrutiny
Shared know-how
Repetition
7. Beyond Data:ComputationalWorkflows as method objects
to be shared, ported and reused & repurposed
Multi-step
Leverage third party codes
Scalable processing of data
Transparent research
Computational Workflows
Specification
description
Software
Execution
A special kind of software
Separation of the workflow specification from its execution
Precise description of a procedure: multi-
step process coordinated by input/output
data relationships (data types).
Execution of computational
processes (run a code, invoke a
service…).
Data is consumed and produced by
each step.
8. Beyond Data:ComputationalWorkflows as method objects
to be shared, ported and reused & repurposed
Multi-step
Leverage third party codes
Scalable processing of data
Transparent research
Computational Workflows
<my scripts>
A Zoo of Workflow Systems and “systems”*
Native repositories
*https://s.apache.org/existing-workflow-systems
10. Beyond Data: Multi-part Research Objects
dependencies and associates scattered across repositories and within repositories
made at different times by different people
Workflow itself Workflow associated Objects
Specification
descriptions
Parameters
Input
Datasets
Output
Datasets
Runtime details & Provenance
Documentation
Bind to Dependencies
- Containers
- Codes
- Sub-workflows
Bind to particular test engines
Publications
Image
Other workflows
Sub workflows
Software
Execution
Inputs and outputs
Author
11. Beyond Data:ComputationalWorkflows as multi-part method objects
to be shared, ported and reused & repurposed
Services for FAIRWorkflows
• Describe workflows with PIDs and metadata
• Flow: Move workflows between services and
platforms
• Parts: Package (scattered) objects linked
together by context (metadata files with their objects)
Honouring
• the legacy and diverse ecosystem
• buy-in from platforms
Be KISSy
• practical and developer friendly standards,
and webby mechanisms
• extensible openendedness – unknown
unknowns & diversity….
Workflow
Registry
Workflow
Systems
Repos Containers Deploys
Testing
Monitoring
12. Open Registry forWorkflows
Perpetual Development in the open by an open community
https://workflowhub.eu
Towards FAIR workflows and FAIR registry
• Find and AccessWorkflows
– Workflows may remain in their native repositories in
their native form. Or can deposit.
– Register (push) / Harvest (pull)
• Workflows interoperability and reusability
– Using metadata standards framework
Makers are the custodians
• people organisation: spaces, teams, organisations …
• workflow organisation: collections, tagging, facets ...
• credit: for submitters and authors
Open to any platform,
any subject, any person
WorkflowHub Club
14. FAIRWorkflow are FAIR Software
living and with dependencies…workflow history/provenance
Indicators of Status
Workflow
monitoring
Register versions
(Support Github actions)
Incremental metadata and
supplementary materials
(Tracking & Lifting
out subworkflows)
15. Which Workflow Objects are FAIR?
• workflow specification with test or
exemplar data?
• implementation of that design in a
particularWfMS?
• instantiation of that implementation
ready to run with input data, parameters
set, computational services spun up?
• run result with intermediate/final data
products and provenance logs?
• In practice this is a bit blurry.
A metadata
framework
extensible
enough to cope
16. FAIRWorkflows are FAIR Digital Objects
Descriptive, machine actionable metadata framework from the community
practical and developer friendly standards, extensible openendedness
Standardised
metadata about the
workflows
for registration,
discovery
Schema.org profile and types
ComputationalWorkflow
FormalParameter
ComputationalTool
Canonical workflow
description of the
workflow itself
Executable and
Abstract form
Type the input and
output data formats
of the steps
Ontology of types of data
and data identifiers, data
formats, operations in life
sciences
Upload and Download the parts?
Exchange between services & platforms?
Sharing & archiving the components of science
17. Lets step back!
Beyond Data: Multi-part Research = Multi-part ROs
Each object has its own
metadata and repositories
Integrated view & context over
fragmented resources using
their PIDs and metadata
Need a way of packaging up,
describing the package and
parts, citing, shipping around,
storing, archiving, sharing.
Reference real things. Like
people, mice and equipment.
18. Beyond Data: Multi-part Research Objects
Describing a Dataset as a
Digital Object
A way of packaging up,
describing the package and
parts, citing, shipping around,
storing, archiving, sharing.
Even reference real things. Like
people, mice and equipment.
Image Courtesy of Peter Sefton: https://arkisto-platform.github.io/standards/ro-crate/
19. The dataset may contain any kind of
data resource, about anything, in any
format as a file or URL. They can be
scattered across repositories.
Each resource can have a machine
readable description in JSON-LD
format
A human-readable description and
preview can be in an HTML file
that lives alongside the metadata
Provenance and workflow information
can be included - to assist in data and
research-process re-use
RO-Crate DigitalObjects may be
packaged for distribution eg via Zip,
Bagit and OCFL Objects
Courtesy Peter Sefton, https://arkisto-platform.github.io/standards/ro-crate/
A data
repository
perspective
20. Not just for workflows!
For any kind of object
data, publications, SOPs, software …
and data repositories!
especially data repositories!
Aggregate files, any URI-addressable content, another
RO-Crate, along with contextual information, into a citable
RO-Crate which has its own metadata.
Can use as a bag of references:
large/sensitive datasets
citation aggregator
FAIR
here
FAIR
here
21. Unbounded Research Objects
Anything referenceable that may be in scattered
across different repositories and/or different
datasets in the same repository.
Self describing integrated view spanning over
fragmented resources using PIDs and metadata
Metadata held alongside heterogeneous data
Infrastructure independent
• Exchange between repositories, registries and
services.
• Avoid vendor lock-in
22. Practical, lightweight approach Machine
and human readable, search engine friendly
and developer familiar, blah blah
FAIR Object middleware/underware
Standard Web Native PIDs + JSON-LD +
Schema.org, off the shelf archiving formats
Self-describing, Typed by profiles + add
more schema.org and domain ontologies
Extensible, descriptive and content
openendedness, honouring legacy, diversity,
and known and unknown unknowns - one size
does not fit all, blah blah
A Graph inside the RO-Crate
PIDs connect the Graph to the
outside world
http://www.researchobject.org/ro-crate/
23. RO-Crate variants: Profiles are extensible typing
RO-Crates collect metadata
Workflow-RO-Crate Workflow-Testing-RO-Crate
Workflow-Run-RO-Crate
*https://repository.publisso.de/resource/frl:6423291 https://www.researchobject.org/ro-crate/profiles.html
BioComputeObject-
RO-Crate
Galaxy-Workflow-RO-Crate
maDMP
RO-Crate*
DataRepo-RO-Crate
DataRepo-
DataCube-
RO-Crate
Aggregated
DataCitation
RO-Crate
Secure Bags of
PIDs to sensitive
/ large data
24. A step towards FAIR Digital Objects*
“To be FAIR each digital object
type has its own metadata
requirements,
and may have its own repositories
and registries”
FAIR DigitalObjects for Science: From Data Pieces toActionable
Knowledge Units: https://doi.org/10.3390/publications8020021
https://fairdo.org
25. FAIR Digital Objects
Actionable knowledge unit
Digital butterfly – digital twins
Bags of references
courtesy Dimitris Koureas
Coordinator DiSSCo EU
Research Infrastructure
Specimen object image
courtesy of Alex Hardisty
26. Specimen Data Refinery
Workflows to Digitise Natural History Specimens
FAIR DigitalObjects -> Packaged + Actionable
+
FAIR Digital Object
Framework
Open Digital Specimen
Workflow Infrastructure
courtesy of Alex Hardisty and Laurence Livermore
27. Real Use Cases Considered Essential!
• Building out in the open accelerated progress
RO-Crate is metadata middleware
• smart use of wheels already invented
• it takes a village: get tools, services on board
• developer friendly, firm best practice
A little bit of semantics goes a long way…
• Schema.org + JSON-LD
…prepare for more
Known and Unknown unknowns, One size does not fit all
• descriptive openendedness , multi-interpretation
Metadata sucks
• auto-curation is the way forward folks!
What about
the workout?
28. What about
FAIR?
FAIR at multiple levels & granularities
• Workflows & RO-Crates are composite and
nested, with dependencies
• FAIR all the way down
• Not always compatible – e.g. licenses
FAIR+
• Reusable and Usable workflows- testing &
parameter validation. Documentation.
FAIR software paradigm is pervasive
• Applies to RO-Crate Research Objects
FAIR takes a village, of course
C. Goble, S. Cohen-Boulakia, S. Soiland-Reyes,
D.Garijo,Y. Gil, M.R. Crusoe, K. Peters & D.
Schober. FAIR computational workflows. Data
Intelligence 2(2020), 108–121.
doi: 10.1162/dint_a_00033
29. What about DataVerse?
Workflows have data and software
characteristics
RO-Crate preserves metadata and the objects
– workflow, data, datasets whatever…
• Archive/republish independent of
WorkflowHub
• Move content from one repository to
another, one service to another
• Point to content and don’t move it
• Sharing reproducible results & methods
Set data and
workflows and their
metadata free!
RO-Crate RepositoryCollection, RepositoryObject
represents records in a repository to describe an export from a repository or
digital library