Assessing Galaxy's ability to express scientific workflows in bioinformatics
1. Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References
Assessing Galaxy’s ability to express scientific
workflows in bioinformatics
Peter van Heusden and Alan Christoffels
South African National Bioinformatics Institute
University of the Western Cape
Bellville, South Africa
10th FASTAR/Espresso Workshop 2013 / 4-6 November 2013
Peter van Heusden and Alan Christoffels
Assessing Galaxy’s ability to express scientific workflows in bioinformatics
2. Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References
What is bioinformatics?
Bioinformatics is the discipline of solving problems in biology and
medicine using computational resources.
Within bioinformatics, biological sequence analysis (BSA)
describes those analyses that “infer biological information from
sequence alone”. (Durbin, 1998)
Cost of biological sequence analysis has two parts:
1
2
Cost of acquiring sequence
Cost of analysing sequence
Peter van Heusden and Alan Christoffels
Assessing Galaxy’s ability to express scientific workflows in bioinformatics
3. Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References
Cost of acquiring sequence
(Wetterstrand, 2013)
Peter van Heusden and Alan Christoffels
Assessing Galaxy’s ability to express scientific workflows in bioinformatics
4. Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References
Cost of analysing sequence
The “sudden reliance on computation has created an ‘informatics
crisis’ for life science researchers: computational resources can
be difficult to use, and ensuring that computational experiments
are communicated well and hence reproducible is challenging”
(Goecks et al., 2010)
As cost of sequencing plummets analysis faces two challenges:
1
2
Growing data volume demands more sophisticated computational
approaches
Translating biological questions into computational workflows
remains difficult
Peter van Heusden and Alan Christoffels
Assessing Galaxy’s ability to express scientific workflows in bioinformatics
5. Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References
How do we do bioinformatics?
Given a set of protein sequences from species A, which genes
from species B produce similar proteins, and where are these
genes located on the genome of B?
Analysis proceeds (Stevens et al., 2001) using:
1
2
3
Collections of data objects
Transformers that generate new collections (e.g. transform
collection of proteins into collection of genome regions that they
match)
Filters (e.g. discard low quality matches to genome)
Peter van Heusden and Alan Christoffels
Assessing Galaxy’s ability to express scientific workflows in bioinformatics
6. Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References
How we do bioinformatics (2)
Data collections typically exist as (compressed) files
Bioinformatics tools typically are command line executables that
accept and generate files (often using ad-hoc formats)
Scripting languages (Perl, Python) used to compose workflows,
APIs often used for reading/writing file formats
1
2
Workflow enactment often involves manual steps and is closely
tied to execution environment
Workflow is not easily reproducible nor reusable
Peter van Heusden and Alan Christoffels
Assessing Galaxy’s ability to express scientific workflows in bioinformatics
7. Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References
Scientific workflow management systems
Scientific workflow management systems (SciWMS) have been
proposed as an alternative to current script-based approaches to
analysis workflow.
SciWMSs “provide a high-level declarative way of specifying what
a particular in silico experiment modelled by a workflow is set to
achieve, not how it will be executed.” (Taverna project, 2009)
Workflow descriptions resemble dataflow languages (McPhillips
et al., 2009)
Peter van Heusden and Alan Christoffels
Assessing Galaxy’s ability to express scientific workflows in bioinformatics
8. Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References
The promise of SciWMSs
In addition to workflow specification, SciWMSs sometimes offer:
Types that model objects of scientific domain
Recording of provenance of data objects
Execution of scientific workflows on diverse computing
environments (desktop, cluster, grid, cloud)
Peter van Heusden and Alan Christoffels
Assessing Galaxy’s ability to express scientific workflows in bioinformatics
9. Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References
SciWMSs for bioinformatics
Many workflow systems have been proposed for use in
bioinformatics: Taverna, Kepler, Triana, Bioopera, Mobyle,
BiosFlow, bpipe
Some workflow features are also available in Galaxy
Peter van Heusden and Alan Christoffels
Assessing Galaxy’s ability to express scientific workflows in bioinformatics
10. Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References
Support for workflow patterns
Scientific Data Modelling
Workflow representation and use
What is Galaxy
Galaxy emerged in 2004/5 as a web interface to bioinformatics
tools and data
Galaxy is becoming common platform through which to “publish”
tools and data
More than 30 known public Galaxy servers
36 000 users on main public Galaxy server, 0.8 Pb of data
Peter van Heusden and Alan Christoffels
Assessing Galaxy’s ability to express scientific workflows in bioinformatics
11. Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References
Support for workflow patterns
Scientific Data Modelling
Workflow representation and use
Galaxy as an open-source project
Galaxy consists of c. 250 000 lines of (mostly Python) code
Core team includes 15 developers spread across 4 different
institutes
Development is open source and “out in the open” with code
hosted on BitBucket, development planning on Trello and mailing
lists
Peter van Heusden and Alan Christoffels
Assessing Galaxy’s ability to express scientific workflows in bioinformatics
12. Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References
Support for workflow patterns
Scientific Data Modelling
Workflow representation and use
Peter van Heusden and Alan Christoffels
Assessing Galaxy’s ability to express scientific workflows in bioinformatics
Galaxy I
13. Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References
Support for workflow patterns
Scientific Data Modelling
Workflow representation and use
Peter van Heusden and Alan Christoffels
Assessing Galaxy’s ability to express scientific workflows in bioinformatics
Galaxy II
14. Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References
Support for workflow patterns
Scientific Data Modelling
Workflow representation and use
Galaxy workflow management features
Galaxy allows composition of workflows defined as series of
tasks and related dataflow
Allows execution of workflows on local machine or via various job
schedulers
Data objects generated in Galaxy have associated provenance
information
Peter van Heusden and Alan Christoffels
Assessing Galaxy’s ability to express scientific workflows in bioinformatics
15. Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References
Support for workflow patterns
Scientific Data Modelling
Workflow representation and use
Limitations of Galaxy as a SciWMS
Limited support for scientific workflow patterns
Type refers to format of data items
Provenance is recorded as attribute of data files
Workflows are not first class objects
Analysis view focuses on individual datasets
Execution engine schedules tasks (with limited support for task
collections)
Galaxy can be enriched by drawing on prior research on
SciWMSs
Peter van Heusden and Alan Christoffels
Assessing Galaxy’s ability to express scientific workflows in bioinformatics
16. Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References
Support for workflow patterns
Scientific Data Modelling
Workflow representation and use
Scientific workflow patterns
Analysis of scientific workflows has yielded a set of design
patterns used in workflows (Yildiz et al., 2009)
Galaxy workflow language supports sequential dataflow, parallel
split and synchronisation
Tool definition language has recently been extended to support
multiple instances of task (not workflow) execution with a-priori
runtime knowledge
Tool authors can signal that input to tool can be split for parallel
execution
No interface between workflow authors and multiple instance
support
Support for cancel of individual task but not entire workflow
No support for triggering new thread of activity (restart)
Peter van Heusden and Alan Christoffels
Assessing Galaxy’s ability to express scientific workflows in bioinformatics
17. Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References
Support for workflow patterns
Scientific Data Modelling
Workflow representation and use
Scientific workflow patterns (2)
No support for exclusive choice (e.g. execute different dataflow
path based on different input)
No support for sub-workflows
Galaxy workflow language is “abstraction hating” (Green and
Petre, 1996)
Leads to workflow diagrams resembling bowl of spaghetti for
anything but the most simple cases
Peter van Heusden and Alan Christoffels
Assessing Galaxy’s ability to express scientific workflows in bioinformatics
18. Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References
Support for workflow patterns
Scientific Data Modelling
Workflow representation and use
The Galaxy type system
Galaxy types represent file types
File type does not map simply to semantics
Collection types are not supported, although some types are
“splittable” to allow parallel task execution
Workflow parameters are not supported via type system
Cannot guarantee that workflow is well-formed
Provenance recording is coarse-grained
What will happen if we update single element of input data
collection?
Peter van Heusden and Alan Christoffels
Assessing Galaxy’s ability to express scientific workflows in bioinformatics
19. Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References
Support for workflow patterns
Scientific Data Modelling
Workflow representation and use
Science questions vs execution plans
Type system could model scientific domain objects (e.g. protein
and nuceleotide sequences) but . . .
Bioinformatics tools do not support standard formats or support
standard formats with quirks
Not clear what information to save from tool output
Experienced bioinformaticists want opportunity to review “raw
output” to explore factors that underpin confidence in analysis
Need to support both recording and reporting of workflow output
Both recording “raw” output trace and reporting provenance of
scientific domain objects are necessary features for SciWMS
Peter van Heusden and Alan Christoffels
Assessing Galaxy’s ability to express scientific workflows in bioinformatics
20. Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References
Support for workflow patterns
Scientific Data Modelling
Workflow representation and use
Workflow execution in Galaxy
Internally workflows are expanded into collections of tasks at
execution time
Tasks are executed by backend classes: either local or via
scheduler
Execution parameters can be set by “dynamic job runners”
Allows e.g. resource requirements of job to be signalled to
scheduler
Configured using a combination of XML and Python code
maintained by Galaxy administrator
Workflow execution leaves no visible trace in the user interface
At runtime execution shows individual jobs running
Data objects are grouped by “history”, not associated with a
workflow
No support for re-execution of part of workflow
Peter van Heusden and Alan Christoffels
Assessing Galaxy’s ability to express scientific workflows in bioinformatics
21. Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References
Support for workflow patterns
Scientific Data Modelling
Workflow representation and use
Scope for workflow optimisation
Workflows are dataflow graphs (Johnston et al., 2004)
Knowledge of inputs and types can be used to plan execution
efficiently, e.g. pipeline tasks and exploit opportunities for
streaming
Collection of data objects and parameters sets can be exploited
for automatic parallel enactment of tasks and sub-workflows
Data collections and workflows provide structures for nesting of
provenance information
Knowledge of data provenance could facilitate lifecycle of data
products: kept for re-use or discarded as “intermediate products”
Peter van Heusden and Alan Christoffels
Assessing Galaxy’s ability to express scientific workflows in bioinformatics
22. Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References
Conclusion
Bioinformatics faces an “informatics crisis” as cost to generate
sequence has decreased while cost to compose or reproduce
analysis has remained high
Galaxy has emerged as a popular interface to bioinformatics tools
and data with workflow management features
Insight from prior research on SciWMSs suggests areas for
enhancement:
Support for additional workflow patterns
Extension of type system with support for biological types,
collections and parameter sets
Improvement of workflow execution through treating workflows as
first class objects with associated optimisation of execution and
provenance storage
Currently being pursued as a research agenda at SANBI
Peter van Heusden and Alan Christoffels
Assessing Galaxy’s ability to express scientific workflows in bioinformatics
23. Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References
Thanks
Workflows for biological sequence analysis are discussed
by the “Pipelines collaboration”
Research on SciWMS supported
by the MRC and Prof Christoffels
Professor Alan Christoffels
Peter van Heusden and Alan Christoffels
Assessing Galaxy’s ability to express scientific workflows in bioinformatics
24. Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References
Bibliography I
R. Durbin. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids.
Cambridge University Press, Apr. 1998. ISBN 9780521629713.
J. Goecks, A. Nekrutenko, J. Taylor, and T. G. Team. Galaxy: a comprehensive approach for
supporting accessible, reproducible, and transparent computational research in the life
sciences. Genome Biol, 11(8), 2010.
T. R. G. Green and M. Petre. Usability analysis of visual programming environments: a ‘cognitive
dimensions’ framework. Journal of Visual Languages and Computing, 7:131–174, 1996.
W. M. Johnston, J. R. P. Hanna, and R. J. Millar. Advances in dataflow programming languages.
ACM Computing Surveys, 36(1):1–34, Mar. 2004.
T. McPhillips, S. Bowers, D. Zinn, and B. Ludäscher. Scientific workflow design for mere mortals.
Future Generation Computer Systems, 25(5):541–551, May 2009.
R. Stevens, C. Goble, P. Baker, and A. Brass. A classification of tasks in bioinformatics.
Bioinformatics, 17(2):180–188, Feb. 2001.
Taverna project. Why use workflows?, 2009. URL
http://www.taverna.org.uk/introduction/why-use-workflows/.
Peter van Heusden and Alan Christoffels
Assessing Galaxy’s ability to express scientific workflows in bioinformatics
25. Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References
Bibliography II
K. Wetterstrand. DNA sequencing costs: Data from the NHGRI genome sequencing program
(GSP), 2013. URL http://www.genome.gov/sequencingcosts/.
U. Yildiz, A. Guabtni, and A. H. H. Ngu. Towards scientific workflow patterns. In Proceedings of
the 4th Workshop on Workflows in Support of Large-Scale Science, WORKS ’09, page
13:1–13:10, New York, NY, USA, 2009. ACM.
Peter van Heusden and Alan Christoffels
Assessing Galaxy’s ability to express scientific workflows in bioinformatics