This document discusses computational workflows and FAIR principles. It begins by providing background on computational workflows and their increasing importance. It then discusses challenges around finding, accessing, and sharing workflows. Next, it explores how applying FAIR principles to workflows could help address these challenges by making workflows and their associated objects findable, accessible, interoperable, and reusable. This includes discussing applying metadata standards, using persistent identifiers, and developing principles for FAIR workflows and FAIR software. The document concludes by examining the roles and responsibilities of different stakeholders in working towards FAIR workflows.
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Computational Workflows for Data-Intensive Bioscience
1. FAIR
Computational
Workflows
Professor Carole Goble
The University of Manchester UK
EU Research Infrastructures ELIXIR, IBISBA, EOSC-Life
BioExcel Centre of Excellence
Software Sustainability Institute UK
FAIRDOM Consortium
carole.goble@manchester.ac.uk
16th Workshop on Workflows in Support of Large-Scale Science
November 15, 2021
2. 20 years+
Computational workflows
decades in the making…
finally coming of age….?
doi: 10.1093/gigascience/giaa140
Nature 573, 149-150 (2019)
https://doi.org/10.1038/d41586-019-02619-z
https://doi.org/10.1038/s41592-021-01254-9
4. Computational Workflows for Data intensive Bioscience
CryoEM Image Analysis
Metagenomic Pipelines
[Rob Finn]
[Carlos Oscar Sorzano Sanchez]
Nature 573, 149-150 (2019)
https://doi.org/10.1038/d41586-019-02619-z
Data pipelines, simulation
sweeps, workflow ensembles.
Mixture of workflow systems,
notebooks and scripts.
Chaining different codes.
Genome Annotation
[Romain Dallet]
High Throughput Sequencing
[Fabrice Allain]
Interactive &
exploratory analysis
Production, automated,
repetitive & workflow-
integrated software
5. Workflow System Landscape
Inter-twingled, mix and matching
Scripting
environments
Interactive Electronic
Research Notebooks
Repositories
Registries
Workflow
Management
Systems & execution
platforms
*https://s.apache.org/existing-workflow-systems
300+ Systems*
General and Specialised
General Repositories
6. https://snakemake.github.io/
Workflows are rules:
Graph of jobs for automatic parallelisation,
DIY package & containerisation
installation, auto-documentation
From frameworks to web based analysis platforms
Communities cluster round a few systems.
Take up of a WfMS typically depends on the “plugged-in” support of data types &
specific codes, skills level of the workflow developers, its popularity & sustainability.
Online portals users build and reuse
workflows around publicly available
or user-uploaded data and pre-
wrapped, pre-installed tools.
7. A FAIR data and workflow commons
sharing and running workflows
Workflows are:
an entry point to tools and datasets,
democratising resources
functions for FAIR data processing
and secure data processing
FAIR digital objects
Honour legacy & diversity of WfMS -> Buy-in & on-boarding of WfMS
8. A FAIR data and workflow commons
Workflows are:
an entry point to tools and datasets,
democratising resources
functions for FAIR data processing
and secure data processing
FAIR digital objects
Honour legacy & diversity of WfMS -> Buy-in & on-boarding of WfMS
9. FAIR Guiding Principles for Research Data
Findable, Accessible, Interoperable, Reusable
A set of guiding principles to enhance the value
of all digital resources and their reuse
by people and by machines
A community journey to common guidelines
The glue to federate data and services,
to apply to all objects
Benefit both consumers and producers.
10. The FAIR Research Data Principles
RDA FAIR Data Maturity Model. Specification and Guidelines https://zenodo.org/record/3909563#.YORYkUzTX19
https://www.go-fair.org/fair-principles/
Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016).
https://doi.org/10.1038/sdata.2016.18
11. tl;dr FAIR Research Data Principles
https://www.go-fair.org/fair-principles/
Persistent human readable & machine-
actionable metadata
• Linked
• Community standards
Persistent identifiers
Clear licensing and access rules
Protocols for machine accessibility & AAI
Registration
Searching & Indexing
Enabling automation
12. FAIR Research Data Principles update in a nutshell
Policy
Rallying point
I’m FAIR!
What is it?
Definition
Spectrum
Contextual
Methodology
FAIRification
FAIR by Design
Assessment
Compliance
Certification
FREE
Infrastructure
Services
Adoption
Incentives
Stewardship
Services
13. FAIR Research Software Principles
Software is a digital object but research software is not (just) data
https://www.rd-alliance.org/groups/fair-4-researchsoftware-fair4rs-wg
FAIR for Research Software (FAIR4RS) Working Group
FAIR4RS First Draft of FAIR4RS principles
Katz, et al PATTERNS 2, 2021
Lamprecht et al., 2020
14. FAIR Research Software Principles
Software is a digital object but research software is not (just) data
Findable Accessable
I1. Software should read, write or exchange data in a way
that meets domain-relevant community standards
I2. Software includes qualified references to other
objects.
Reusable
Interoperable
R1. Software is richly described with a plurality of accurate & relevant
attributes
R1.1. Software is made available with a clear & accessible software usage
license
R1.2. Software is associated with detailed provenance
R1.3. Software meets domain-relevant community standards
R2. Software includes qualified references to other software
(Katz et al, 2021 PATTERNS,
https://doi.org/10.1016/j.patter.2021.100222)
R. The software is usable (it can be executed) & reusable (it can be
understood, modified, built upon, or incorporated into other software).
15. Enabling FAIR?
FAIRification. Assessment. Services.
Governance. Incentives.
FAIR takes a Village*
*Borgman, C. L., & Bourne, P. E. (2021). Why it takes a village to manage and share data.
Harvard Data Science Review (under Review). https://arxiv.org/abs/2109.01694
FAIR Computational
Workflow Principles?
16. FAIR Principles for Workflows
Abstraction 1: Hybrid Processual Digital Objects
17. Image credit: BioExcel Centre of Excellence
different
components,
codes,
languages,
third parties
FAIR Principles for Workflows
Abstraction 2: Compositional Objects
Interoperability and Reusability
FAIR Unit Test
18. FAIR Principles for Workflows
Method “Data” Objects
Workflows as
FAIR Software
FAIR+R and FAIR++
Quality, maturity, maintainability
The principles revised
Workflows as
FAIR Digital Objects
Data-like method objects
Associated objects
The principles adapted
Workflows as
FAIR Data Instruments
FAIRification of the dataflow
The data principles supported
C. Goble, S. Cohen-Boulakia, S.
Soiland-Reyes, D. Garijo, Y. Gil, M.R.
Crusoe, K. Peters & D. Schober. FAIR
computational workflows. Data
Intelligence 2(2020), 108–121.
doi: 10.1162/dint_a_000
Workflow Objects
Software Objects
Data FAIRification
FAIR enabling services
Services
19. Findable & Accessible
WORKS 2007
WORKS 2021
https://workflowhub.eu https://workflowhub.org
https://myexperiment.org
20. Findable & Accessable
register workflows with assigned PID + metadata in a searchable resource.
https://workflowhub.eu
Publishing Services
Journals
Digital Objects of Scholarship
published, cited, exchanged, reviewed, validated & reused
• Versioning, DOI/PID assignment
• Collections, workflow libraries
scripts
Repos
Containers Deploys
Tools
WfMS Agnostic degrees of onboarding, support & access
• Native repositories
• Metadata standards framework,
handle associated objects and links
between objects.
• Execution API
https://dockstore.org/
21. Link up providers and users
Building visibility & reputation
Close the
“Find – Get– Use – Credit”
loop
Credit, Attribution, Citation
Knowledge Graphs linking
out to OpenAIRE, DataCite
Associate workflows
Associate sister objects
myExperiment influence
Social aspects
Teams, People
23. Tool Registry Service API
Accessible
“metadata & workflow retrievable by PID using a standardized communication protocol”
GitHub page: https://github.com/ga4gh/tool-registry-service-schemas
24. Accessible
an implementation of
GA4GH WES
https://github.com/sapporo-wes/sapporo
top layer over the tools, the
workflow languages, and the
workflow runners
GA4GH TRS
25. FAIR Workflow are FAIR Software
lifecycle support for living objects
Git Coupling
Publishing
Status
Testing
Benchmarking
26. Extensible Metadata Framework
catering for those processual FAIR criteria
Common metadata
about the workflow,
tools & parameters
Canonical workflow
description of the
steps of the workflow
Type the input and
outputs of the steps
Run Provenance / Histories / Tests
WfMS native history logs
Format for packaging a
workflow, its metadata
and companion objects
(links to containers, data
etc) for exchange,
archiving, reporting,
citing.
WorkflowHub and Services
create and consume Crates
FAIR Digital Object
Adopting Open
Community efforts
27. FAIR Metadata for Machines & Humans
https://www.commonwl.org
WfMS neutral canonical description
Linked to containerised tools
• Portable, reusable workflows
• Standardise expression of workflow
• Standardise compatible I/O for steps
• Reduce vendor / project lock-in
• Workflow comparisons
• Collaboration & knowledge transfer
https://openwdl.org/
29. RO-Crate Digital Objects
Packaging everything together regardless where or what it is
https://www.researchobject.org/ro-crate/
Self describing format for
packaging up scattered resources
integrated view + context
metadata and PIDs reference
digital and real things - datasets,
workflows, services, software &
people, places etc.
Web-native, COTS
machine and human readable
search engine & developer
friendly.
Infrastructure
independent & self-
describing
Avoid repository silos
Extensible and open-
ended profiles duck-
typing, cope with diversity
and legacy
31. Provenance & Preservation
Transparency & Reuse matter
more than Reproducibility?
Traceability more important?
When is it FAIR enough?
WfMS heavy lifting needed …
R1.2: (Meta)data, software and workflows are
associated with detailed provenance – data lineage,
workflow lineage & workflow logs
32. ProvenanceWeek 2021, T7 Workshop on Provenance for Transparent Research, July 2021
https://iitdbgroup.github.io/ProvenanceWeek2021/t7.html
33. A2. metadata are accessible, even when
the workflow is no longer available
Read-reproducible as a method description if no
longer runs, Metadata preserved beyond any one
service republished in a long-term archive
R. The workflow is usable (it can
be executed) and reusable (it can
be understood, modified, built
upon, or incorporated into other
workflows).
FAIR Services
Law of decline
All workflows decay over time.
Complexity of Dependencies
Description persists -> Review, Repair, Remake
34. Reusable and Usable
i.e. can be executed once accessed
Quality, maturity, maintainability -> FAIR++
Multiple wf/test backends:
Galaxy Pandemo, CWL,
Jenkins …
Check workflow
performance,
provenance on
containers, memory
usage …
Testing and monitoring -> metadata
into WorkflowHub
Portability
High-level workflow
execution service backend,
sensitive data analysis &
running on private clouds
“Interoperable” Execution
Is a workflow
reusable if it’s
resource greedy
or too slow or
needs special
resources or
unavailable data
or cannot be
ported or run by
anyone other
than the
developers? Like
Google ML…
35. Interoperable and Reusable Workflows…
a portability viewpoint
All good WORKS
stuff which I am not
going to talk about….
exascale computing
36. Composability -> Interoperability and Reusability
Community driven Reusability first
I1: Software interoperates through APIs and metadata standards.
FAIR Unit tested & validated canonical workflows & blocks.
Well documented, well maintained
CWL Canonical descriptions
• Recycle descriptions and sub-workflows
• Platform independent exchange and comparison
• Standardised I/O formats
Thanks: Rob Finn
37. Composability -> Interoperability and Reusability
Community driven Reusability first
I1: Software interoperates through APIs and metadata standards.
FAIR Unit tested & validated canonical workflows & blocks.
Canonical Workflow Frameworks
for Research (CWFR)
https://www.rd-
alliance.org/canonical-
workflow-frameworks-
research-cwfr
https://fairdo.org/wg/fdo-cwfr/
Thanks: Stian Soiland-Reyes
38. Workflow Data FAIRification & FAIR Data by Design
Assisted by WfMS
Challenge of diverse API & AAI landscape, formats and packaging
Reviewing
Curation
Certification
Governance
Best Practice
Golden
Examples
Canonical
workflows
Design for
FAIR Data
and Reuse
Metadata generated for data products
39. FAIR Reusable Workflow Design is Hard and Hard Work
Nearly always post-hoc
Third party dependencies
Technology Debt and Refactoring
Software Engineering
In the Sweatshop
of Science who
has the Time?
Inclination?
Skills? Resources?
40. FAIR Reusable Workflow Design is Hard and Hard Work
Nearly always post-hoc
Workflow developers
Tool and data set
providers
Workflow readiness
FAIR Unit Testing
Brack, et al (2021). 10 Simple Rules for
making a software tool workflow-ready.
https://doi.org/10.5281/zenodo.5636487
What’s the reward?
What’s a FAIR Unit?
How will we assess?
How to refactor?
WfMS platforms
Programmatic access to workflow metadata
Common metadata, PID & API standards
FAIR Software.
Service that is FAIR enabling*
Ramezani et al . (2021). D2.7 Framework for
assessing FAIR Services (V1.0_DRAFT).
https://doi.org/10.5281/zenodo.5336234
41. Can we FAIR assist? automate?
Abstraction framework for
granularity assessment & (semi)-
automated refactoring
2021 IEEE International Conference on Cluster Computing
DOI: 10.1109/Cluster48925.2021.00053
43. WORKFLOW
APPLICATION USER
FAIR takes a Village
Shared responsibility, shared benefits, shared curation
TOOL
DEVELOPER
WORKFLOW
USER
WFMS
DEVOP
WORKFLOW
DEVELOPER
& CUSTODIAN
COMPUTATIONAL
USER
Platform Service
Workflow
Labour
Use
Reach
Software
44. What can a lab do to be FAIR?
As developer and user of workflows, datasets, tools?
Get Help
Skill the Team with
Best Practice
Register/Publish
Cite & credit makers
Document
for Strangers
https://fair-software.nl/
Professionalisation
Pre and post hoc
Corpas M et al (2018) A FAIR guide for data providers to maximise sharing of human genomic data, PLOS Comp Bio
Boeckhout M et al (2018) The FAIR guiding principles for data stewardship: fair enough?, E J of Human Genetics
Use WfMSs and
tools that are FAIR
enabling
Checklists
A Management Plan
Use Standards
Use IDs
45. What can a lab do to be FAIR?
As developer and user of workflows, datasets, tools?
Get Help
Document
for Strangers
https://fair-software.nl/
Professionalisation
Pre and post hoc
Corpas M et al (2018) A FAIR guide for data providers to maximise sharing of human genomic data, PLOS Comp Bio
Boeckhout M et al (2018) The FAIR guiding principles for data stewardship: fair enough?, E J of Human Genetics
Use WfMSs and
tools that are FAIR
enabling
Checklists
A Management Plan
Use Standards
Use IDs
Register/Publish
Cite & credit makers
Skill the Team with
Best Practice
46. What can the WfMS Community do?
Collective action by a few WfMS and
services nails 80% workflow use.
Ferreira da Silva et al, A Community Roadmap for Scientific Workflows Research and Development, arXiv:2110
Best Practice
Support a FAIR
metadata framework
47. TL;DL FAIR Computational Workflows
FAIR Principles laid the foundation for sharing
digital assets
Computational workflows are Hybrid Digital
Objects of scholarship
Should support the creation of FAIR data and
themselves adhere to FAIR Principles
Metadata matters
FAIR takes a Village.
Life Sciences has begun work.
48. Acknowledgements
The WorkflowHub Club, Bioschemas Community, RO-Crate
Community, CWL Community, Galaxy Europe, EOSC-Life and
ELIXIR Tools Platform.
https://about.workflowhub.eu/community/
Special Thanks
Rafael Ferreira da Silva (Oakridge)
Stian Soiland-Reyes (U Manchester / U Amsterdam)
Paul Brack, Stuart Owen, Finn Bacall, Alan Williams, Doug Lowe (U Manchester)
Björn Grüning (U Freiburg)
Frederik Coppens (VIB)
Sarah Jones (GEANT)
Herve Menager (Pasteur Institute)
Sarah Cohen-Boulakia (U Paris Sacly)
Dan Katz (U Illinois Urbana-Champaign)
Simone Leo (CRS4)
Laura Rodriguez-Navas (BSC)
José Mª Fernández (BSC)
Denis Yuen (Ontario Institute for Cancer Research)
Tristan Glatard (Concordia University)
Chris Erdmann (AGU)
WorkflowHub https://workflowhub.eu/ and https://workflowhub.org
EOSC-Life https://www.eosc-life.eu/
ELIXIR http://elixir-europe.org
RO-Crate https://www.researchobject.org/ro-crate/
Galaxy Europe https://galaxyproject.eu/
Bioschemas https://bioschemas.org
Common Workflow Language https://www.commonwl.org/
WorkflowsRI https://workflowsri.org/
Dockstore https://dockstore.org/
RDMkit https://rdmkit.elixir-europe.org
49. Wither Workflow Interoperability? FAR not FAIR?
(Question by Rafael Ferreira da Silva)
What is Workflow Interoperability?
• CWL /WDL - WfMS independence rather than interoperability?
• Execution of sub-workflows – (re)usability rather than interoperability?
• Multiple WfMS execution – are WfMS really executed in mixed workflows
or is this front/backends that can run multiple WfMS (e.g. TES/WES)?
• Composability of workflow units - Data I/O compatibility
I1. Software should read, write or exchange data in a way that meets domain-relevant
community standards