FAIR Computational Workflows
Computational workflows capture precise descriptions of the steps and data dependencies needed to carry out computational data pipelines, analysis and simulations in many areas of Science, including the Life Sciences. The use of computational workflows to manage these multi-step computational processes has accelerated in the past few years driven by the need for scalable data processing, the exchange of processing know-how, and the desire for more reproducible (or at least transparent) and quality assured processing methods. The SARS-CoV-2 pandemic has significantly highlighted the value of workflows.
This increased interest in workflows has been matched by the number of workflow management systems available to scientists (Galaxy, Snakemake, Nextflow and 270+ more) and the number of workflow services like registries and monitors. There is also recognition that workflows are first class, publishable Research Objects just as data are. They deserve their own FAIR (Findable, Accessible, Interoperable, Reusable) principles and services that cater for their dual roles as explicit method description and software method execution [1]. To promote long-term usability and uptake by the scientific community, workflows (as well as the tools that integrate them) should become FAIR+R(eproducible), and citable so that author’s credit is attributed fairly and accurately.
The work on improving the FAIRness of workflows has already started and a whole ecosystem of tools, guidelines and best practices has been under development to reduce the time needed to adapt, reuse and extend existing scientific workflows. An example is the EOSC-Life Cluster of 13 European Biomedical Research Infrastructures which is developing a FAIR Workflow Collaboratory based on the ELIXIR Research Infrastructure for Life Science Data Tools ecosystem. While there are many tools for addressing different aspects of FAIR workflows, many challenges remain for describing, annotating, and exposing scientific workflows so that they can be found, understood and reused by other scientists.
This keynote will explore the FAIR principles for computational workflows in the Life Science using the EOSC-Life Workflow Collaboratory as an example.
[1] Carole Goble, Sarah Cohen-Boulakia, Stian Soiland-Reyes,Daniel Garijo, Yolanda Gil, Michael R. Crusoe, Kristian Peters, and Daniel Schober FAIR Computational Workflows Data Intelligence 2020 2:1-2, 108-121 https://doi.org/10.1162/dint_a_00033.
1. FAIR Computational Workflows
Professor Carole Goble
The University of Manchester UK
EU Research Infrastructures: ELIXIR, IBISBA, EOSC-Life
Centre of Excellence: BioExcel
carole.goble@manchester.ac.uk
JOBIM 2021, 8th July 2021
https://tinyurl.com/jobim-goble
2. Computational Workflows for Data intensive Bioscience
prepare, analyze, and share increasing volumes of complex data
CryoEM Image Analysis
Metagenomic Pipelines
Drug Discovery
Protein Ligand MD
Simulation
Genome Annotation
High Throughput Sequencing
Fabrice Allain
Romain Dallet
3. 20 years+
Computational workflows
decades in the making…finally coming of age….
doi: 10.1093/gigascience/giaa140
Nature 573, 149-150 (2019)
https://doi.org/10.1038/d41586-019-02619-z
4. What are Data intensive Computational Workflows?
Systematic linking together multiple tools and software packages
inputs
outputs
tools, CLI,
containers,
workflows
Scale up
Access to computational infrastructure
and datasets, tool interoperability,
processing portability and
optimisation, data wrangling.
Specification
description
Software
Execution
WfMS
Engine
Workflow
Scale out
Flexible workflow composition to
construct & run executable control
and data flows using
heterogeneous software packages,
codes, tools, other workflows made
by other people.
5. SARS-CoV-2 allelic-variant surveillance
Automated monitoring of structured data
from the European COVID-19 Data Portal and
national SAR-CoV-2 sequencing datasets,
notably COG-UK.
Scalable via access to a global distributed
compute network
• Improved data quality
• Uniformly analysed data for downstream
analysis & visualisation
• Submission of data to public archives
• All workflows, data and documentation
available https://covid19.galaxyproject.org
https://elixir-europe.org/news/covid-19-variants-galaxy
https://doi.org/10.1101/2021.03.25.437046
Suite of
workflows
6. Distributed analysis , Pulsar network
Managed online hosted Workflow as a Service Platform
Designed for direct use by end users - 32K users
Experts build workflows that others can use with their own data
Researchers build and reuse workflows that are shared
End users also use it to access and interact with a tool
Workflow and Tool histories and reporting [Björn Grüning]
7. Those workflows in the WorkflowHub Registry
Find, publish and cite workflows and
collections. Reuse, recycle, repurpose.
8. Sharing Accelerates Science
8
Jacques van Helden
A digital space for
EMERGEN, the French
plan for SARS-CoV-2
genomic surveillance and
research
Adapting and Reusing the ELIXIR
Galaxy Workflows
Tried and tested transparent
methods.
9. Inter-twingled Workflow System Landscape
Scripting
environments
Interactive Electronic
Research Notebooks
Workflow
Management
Systems & execution
platforms
Repositories Registries
Inter-twingling
Mix and Matching
Interactive &
exploratory
analysis
Production, automated,
workflow-integrated
software
https://s.apache.org/existing-workflow-systems
298 Systems
10. 10 Handy Properties of Computational Workflows
Composition & Abstraction
Using the best codes written by 3rd parties
Handle heterogeneity
Shield complexity & incompatibility
Sharable reusable, re-mixable methods
Automation
Repetitive reproducible pipelines
Simulation sweeps
Manage data and control flow
Optimised monitoring & recovery
Automated deployment
Scalability & Infrastructure Access
Accessing infrastructures, datasets and tools
Optimised computation and data handling
Parallelisation
Secure sensitive data access & management
Interoperating datasets & permission handling
Reporting & Accreditation
Portability
Sharing & Adaptability
Provenance logging & data lineage
Auto-documentation
Result comparison
Dependency handling
Containerisation & packaging
Moving between on premise & cloud
Shared method, publishable know-how
BYOD / parameters
Different implementations
Changes in execution infrastructure
11. https://snakemake.github.io/
Workflows are rules:
Graph of jobs for automatic parallelisation,
DIY package & containerisation
installation, auto-documentation
from frameworks to web based analysis platforms, hybrid cloud deployment
Communities tend to cluster round a few systems.
Take up of a WfMS typically depends on the “plugged-in” availability of data type
specific codes, skills level of the workflow developers, and popularity.
Online portals users build and reuse
workflows around publicly available or
user-uploaded data and pre-wrapped,
pre-installed tools.
13. WORKFLOW
APPLICATION USER
Yes it’s work, Labour saving -> Labour shifting know-how
Production platforms & pipelines
TOOL
DEVELOPER
WORKFLOW
USER
SYS ADMIN WORKFLOW
DEVELOPER
& CUSTODIAN
COMPUTATIONAL
USER
Workflow System as a Platform Workflow System as a Service
Labour
Reach
need
infrastructure
& services
need tools to be
wrapped &
maintained
need workflows to be
developed, tested,
run & maintained
need to find and understand
workflows, with explanations to
use properly and safely.
14. from compounds &
genomics to tissue banks,
from plants to marine to
humans…
https://lifescience-ri.eu/
An open collaborative
space for digital
biology in Europe
15. A Workflow and Tools Collaboratory
A data and method commons
Workflows are an entry point to the tools and
datasets of EOSC-Life
functions for production quality FAIR data
processing and access to secure data processing
With thanks: Romain Dallet
Galaxy Genome Annotation (GGA) environment in the cloud
16. The EOSC-Life Workflow Collaboratory
People -> workflows, services and standards for FAIR Workflows.
18. Computational Workflow Framing: FAIR Principles
The EOSC-Life FAIR Workflow Collaboratory
A set of guiding principles to enhance the
value of all digital resources and their reuse
by people and by machines
aligning a community around a journey to
common data guidelines
To help accelerate science so folks can find
and reuse and interlink data – and tools
and workflows too!
Consumers and producers all benefit.
19. Computational Workflow Framing: FAIR Principles
The EOSC-Life FAIR Workflow Collaboratory
FAIR is the EOSC glue to federate
data and services,
to apply to all objects
20. How the FAIR Principles look
RDA FAIR Data Maturity Model. Specification and Guidelines
https://zenodo.org/record/3909563#.YORYkUzTX19
https://www.go-fair.org/fair-principles/
Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data
management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18
21. FAIR Principles for Data
tl;dr
https://www.go-fair.org/fair-principles/
Persistent human readable and machine-actionable metadata
Linked metadata and community standards
Persistent identifiers
Clear licensing and access rules
Protocols for machine accessibility
Register / Index
22. FAIR for Software
Software is a digital object but research software is not (just) data
https://www.rd-alliance.org/groups/fair-4-researchsoftware-fair4rs-wg
FAIR for Research Software (FAIR4RS) working group
Katz et al., 2016; Lamprecht et al., 2019
FAIR4RS First Draft of FAIR4RS principles
CodeMeta
https://github.com/codemeta/codemeta/
23. https://www.softwareheritage.org/
https://www.cascad.tech/
puts software on a par with publications and data and announces a
number of measures designed to open research software and
better recognize software development in research.
https://cache.media.enseignementsup-recherche.gouv.fr/file/science_ouverte/20/9/MEN_brochure_PNSO_web_1415209.pdf
24. Data and software are first class objects and
there will be sharing.
Primary responsibility aimed at creators and
providers for benefit of consumers
but consumers need to shoulder responsibility
too.
Operating in an (open) ecosystem.
Adoption at scale in legacy settings.
Not a green-field site.
EOSC-Life FAIR
Workflow Collaboratory
FAIR Implicit Assumptions in the Principles
25. FAIR Principles for Workflows
Hybrid Processual Digital Objects
Method “Data” Objects
Workflows as
FAIR Software
FAIR+R and FAIR++
The principles can be
revised
Workflows as
FAIR Digital Objects
Data-like method objects
The principles can be
adapted
Workflows as
FAIR Data Instruments
FAIRification of the dataflow
The data principles can be
supported
C. Goble, S. Cohen-Boulakia, S.
Soiland-Reyes, D. Garijo, Y. Gil, M.R.
Crusoe, K. Peters & D. Schober. FAIR
computational workflows. Data
Intelligence 2(2020), 108–121.
doi: 10.1162/dint_a_000
Workflow Objects
Software Objects
Composable
Usable
Reusable
FAIR Data
26. Abstraction & Reporting
Separation of the workflow specification from its execution & tools
Specification
description
Software
Execution
Precise description of a procedure
composed of multiple steps
coordinated by input/output data
relationships.
Execution of computational and
composted processes with data
consumed & produced by each step.
WfMS
Engine
Workflow
Sub
Workflows
Tools and
codes
Parameters
Inputs
Outputs
Infrastructure
Guidance
Associated
Objects
Data
Logs /
Histories /
Provenance
Services,
e.g. Test engines
+
Related workflows
Checker workflows
Contextual Entities
Metadata Graphs
Sample input
parameters, test data
Software
Management
28. FAIR Principles for Workflows
coping with Hybrid Processual Digital Objects
Composition & agency
Usable not just reusable
Abstraction forms
Living & reusable parts & whole
versioned, forked, cloned
parts recycled, repurposed, remixed
limited lifespans
citable credit
executability
reproducibility, portability
testing, maturity
quality, maintainability
specification
implementation
instantiation
run result
FAIR+R
FAIR++
modularisation
FAIR parts & dependencies
propagation of FAIR properties
29. Findable & Accessable
register workflows with assigned PID + metadata in a searchable resource.
https://workflowhub.eu
Publishing Services
Journals
Digital Objects of Scholarship
published, cited, exchanged, reviewed, validated & reused in
new and different ways
• Versioned identifiers
• DOI assignment (https://doi.org/10.48546/workflowhub.workflow.29.2)
• Collections, Canonical workflow libraries
scripts
Repos
Containers Deploys
Tools
Agnostic and generous with the many
WfMSs (with different degrees of support)
• Workflows can be in native places
• Metadata standards framework that
all services can adopt on a spectrum
and handles associated objects and
links between objects.
• Perpetual development by an open
community
31. More than just a list
3
Spaces, Teams, People
Linking up providers and users
Building visibility & reputation
Reciprocity to close the
“Find – Get– Use – Credit” loop
Research objects to be cited
Build Knowledge Graphs linking
out to OpenAIRE, DataCite and
other tools
32. FAIR Workflow are FAIR Software
lifecycle support for living objects
Indicators of Status
Workflow
monitoring
Register versions
Version PIDs
Support Github actions
Track authors and contributions
Incremental metadata and
supplementary materials
Track & lift out sub-
workflows
R1.2: (Meta)data and software are associated
with detailed provenance
33. Tool Registry Service API
Accessible
metadata and workflows are retrievable by their PID using a standardized
communication protocol
GitHub page: https://github.com/ga4gh/tool-registry-service-schemas
34. FAIR Metadata for Machines
Machine and human readable canonical descriptions of the workflow
that are WfMS neutral
https://www.commonwl.org
Canonical description of the workflow
Linked to containerised tools
Aid collaboration & knowledge transfer
Standardise expression of workflow
Describe engine neutral portable, reusable workflows
Reduce vendor / project lock-in
Enable workflow comparisons
“Abstract” CWL
35. Design by canonical, modularised workflow blocks
Build a library of tested and validated CWL blocks
CWL:
• Canonical descriptions
• Recycle descriptions and sub-workflows
• Platform independent pipeline exchange and comparison
Rob Finn
Folker Meyer
AWE
MEGAHIT
Assembly
pipeline
[with thanks to Rob Finn]
36. Extensible Metadata Framework
that caters for all those processual FAIR criteria
Common metadata
about the workflow,
tools & parameters
Canonical workflow
description of the
steps of the workflow
Type the input and outputs
of the steps
Run Provenance / Histories / Tests
Format for packaging a
workflow, its metadata and
companion objects (links to
containers, data etc) for
exchange, archiving,
reporting, citing.
FAIR Digital Object
All Open Communities
37. Bioschemas lightweight metadata
Extensible and Linked metadata in service of the Life Science Community
Open community reusing industry de facto standard
Computation workflow profile
Formal
parameter
profile
https://bioschemas.org
Opinionated use of schema.org, the web
resource mark-up used by search engines,
knowledge graphs and increasingly science
as a whole.
Computational tool
Herve
Menager
Pasteur
Alban
Gaignard
Nante
38. Workflow Digital Objects
Lightweight way of packaging everything together regardless where or what it is
https://www.researchobject.org/ro-crate/
Format for packaging up scattered resources and self
describing the package and its parts to get an integrated
view + context, using metadata and PIDs to reference
digital and real things - data, workflows & people, places.
Web-native, off the
shelf - machine and
human readable,
search engine &
developer friendly.
Infrastructure
independent &
self-describing
PIDs, JSON-LD,
Schema.org,
archive formats
Extensible and open-
ended to cope with
diversity and legacy
“Duck typing”
using profiles +
added schema.org
and domain
ontologies
40. BioComputeObject - Regulation
why and how to use a workflow IEEE P2791-2020
robust, safe exchange & reuse of
HTS computational analytical
workflows
http://biocomputeobject.org
Alterovitz, Dean II, Goble, Crusoe, Soiland-Reyes et al “Enabling Precision Medicine via standard
communication of NGS provenance, analysis, and results” PLOS Biology 2018m,
https://doi.org/10.1371/journal.pbio.3000099
https://biocompute-objects.github.io/bco-ro-crate/
“Sidecar” third party metadata files
inside the RO-Crate
FAIR has to operate in a
legacy ecosystem
format
41. FAIR Digital Objects
RO-Crate a step towards FAIR Digital Object Middleware
“To be FAIR each digital object
type has its own metadata
requirements, and may have its
own repositories and registries”
FAIR Digital Objects for Science: From Data Pieces to Actionable
Knowledge Units: https://doi.org/10.3390/publications8020021
https://fairdo.org
https://fairdo.org/wg/fdo-cwfr/
42. Lightweight Semantic Workflow Underware is ready!
A2. metadata are accessible, even when the workflow is no
longer available
Metadata preservation...beyond any one service.
RO-Crate archive preserves metadata and workflow,
republished in a long-term archive
Archiving
General
Executing
Testing & Monitoring
WfMS
R1. workflows are richly described with a plurality of
accurate and relevant attributes
Automating metadata as much as possible, which
means on-boarding WfMS and FAIR services
Enough metadata that a workflow is read-
reproducible as a method description
43. FAIR Software - not just Reusable but Usable
i.e. can be executed once accessed
Multiple wf/test
backends: Galaxy
Pandemo, CWL,
Jenkins …
Check workflow
performance,
provenance on
containers,
memory usage …
Testing and monitoring
Containers & Packaging
FAIR+R
FAIR++
Tool Registry Service API
UI to start
computational tasks
based on
containerised
software
https://github.com/inab/WfExS-backend
High-level workflow
execution service
backend, sensitive
data analysis &
running on private
clouds, produces &
consumes RO-Crate
44. Reproduciblity – Repeatability
Provenance & Preservation
Workflow-Run-RO-Crate
Some heavy lifting … when is FAIR enough?
https://iitdbgroup.github.io/ProvenanceWeek2021/
July 22nd 2021
It’s free!!
R1.2: (Meta)data and software are associated with
detailed provenance - not just the workflow but the
run record associated with the data it produced ….
45. FAIR Interoperability and Reusability = Composability
*Reusable (can be understood, modified, built
upon or incorporated into other software)
Software interoperates with other software through community
standard APIs and community standard meta(data)
Software include qualified references to other objects
Richly described
Well documented
Licensed
Sample input parameters and test data
Checker workflows
Track versions
Programmatic access to (meta)data
Libraries of canonical workflow blocks
Make tools workflow-ready
Wrap tools
*FAIR4RS Proposed Principles for FAIR Software
Design for FAIR Data
Design for Reuse
Community Review
Community Curation
Certification
Best Practice
Licence combinations
Access permissions
Local -> Global identifiers
46. FAIR takes a village
its a JOINT responsibility and opportunity!
In order for data to be
FAIR, you need services
that enable FAIR
Be a good plug-in tool and data citizen
enable programmatic access to datasets
make clean tool interface
avoid usage restrictions
use open community data standards and formats
simplify installation
code for portability, parallelisation & reproducibility
manage versions
register! document!
Be a good workflow maker......and user
use and make FAIR identifiers for data
license data outputs
use open community data standards and formats
validate parameters
use a WfMS that tracks data provenance
consider secure data processing
manage versions
design tests and test data
credit tool and sub-workflow makers
choose FAIR data services
register! document! build libraries!
use well documented FAIR
enabling and FAIR workflows
credit the makers!
47. FAIR takes a village
its a JOINT responsibility and FAIR ≠ FREE
Advocate standards & practice
Sustain and manage infrastructure
Credit and incentives
Maturity models & metrics
Certification and canonical libraries
In order for data to be
FAIR, you need services
that enable FAIR
Training,
Stewardship &
Sustainability
Workflows are an entry point to
the tools and datasets of EOSC-Life
and functions for FAIR data.
48. FAIR Computational Workflows: TL;DL
Modern bioinformatics increasingly leans on computational workflows as
production workhorses and transparent, reproducible processing.
Workflows democratise access to data and infrastructure and sharing of
complex processing.
Workflows are hybrid Digital Objects of scholarship that should be FAIR
which means defining FAIR, and the necessary standards, services and
processes.
FAIR is an opportunity and necessity to get wider uptake of workflows
FAIR data, workflows and their infrastructure and everything else takes a
village where everyone shoulders responsibility for the benefit of all.
49. Acknowledgements
The WorkflowHub Club, Bioschemas Community, RO-Crate
Community, CWL Community, Galaxy Europe, EOSC-Life
and ELIXIR Tools Platform.
Special Thanks
Stian Soiland-Reyes (U Manchester / U Amsterdam)
Paul Brack, Stuart Owen, Finn Bacall, Alan Williams (U Manchester)
Björn Grüning (U Freiburg)
Frederik Coppens (VIB)
Sarah Jones (GEANT)
Herve Menager (Pasteur Institute)
Sarah Cohen-Boulakia (U Paris Sacly)
Dan Katz (U Illinois Urbana-Champaign)
Simone Leo (CRS4)
Laura Rodriguez-Navas (BSC)
José Mª Fernández (BSC)
EOSC-Life https://www.eosc-life.eu/
ELIXIR http://elixir-europe.org
RO-Crate https://www.researchobject.org/ro-crate/
WorkflowHub https://workflowhub.eu/ and workflowhub.org
Galaxy Europe https://galaxyproject.eu/
Bioschemas https://bioschemas.org
Common Workflow Language https://www.commonwl.org/