SlideShare una empresa de Scribd logo
1 de 25
Descargar para leer sin conexión
Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References

Assessing Galaxy’s ability to express scientific
workflows in bioinformatics
Peter van Heusden and Alan Christoffels
South African National Bioinformatics Institute
University of the Western Cape
Bellville, South Africa

10th FASTAR/Espresso Workshop 2013 / 4-6 November 2013
Peter van Heusden and Alan Christoffels

Assessing Galaxy’s ability to express scientific workflows in bioinformatics
Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References

What is bioinformatics?

Bioinformatics is the discipline of solving problems in biology and
medicine using computational resources.
Within bioinformatics, biological sequence analysis (BSA)
describes those analyses that “infer biological information from
sequence alone”. (Durbin, 1998)
Cost of biological sequence analysis has two parts:
1
2

Cost of acquiring sequence
Cost of analysing sequence

Peter van Heusden and Alan Christoffels

Assessing Galaxy’s ability to express scientific workflows in bioinformatics
Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References

Cost of acquiring sequence

(Wetterstrand, 2013)
Peter van Heusden and Alan Christoffels

Assessing Galaxy’s ability to express scientific workflows in bioinformatics
Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References

Cost of analysing sequence

The “sudden reliance on computation has created an ‘informatics
crisis’ for life science researchers: computational resources can
be difficult to use, and ensuring that computational experiments
are communicated well and hence reproducible is challenging”
(Goecks et al., 2010)

As cost of sequencing plummets analysis faces two challenges:
1

2

Growing data volume demands more sophisticated computational
approaches
Translating biological questions into computational workflows
remains difficult

Peter van Heusden and Alan Christoffels

Assessing Galaxy’s ability to express scientific workflows in bioinformatics
Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References

How do we do bioinformatics?

Given a set of protein sequences from species A, which genes
from species B produce similar proteins, and where are these
genes located on the genome of B?
Analysis proceeds (Stevens et al., 2001) using:
1
2

3

Collections of data objects
Transformers that generate new collections (e.g. transform
collection of proteins into collection of genome regions that they
match)
Filters (e.g. discard low quality matches to genome)

Peter van Heusden and Alan Christoffels

Assessing Galaxy’s ability to express scientific workflows in bioinformatics
Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References

How we do bioinformatics (2)

Data collections typically exist as (compressed) files
Bioinformatics tools typically are command line executables that
accept and generate files (often using ad-hoc formats)
Scripting languages (Perl, Python) used to compose workflows,
APIs often used for reading/writing file formats
1

2

Workflow enactment often involves manual steps and is closely
tied to execution environment
Workflow is not easily reproducible nor reusable

Peter van Heusden and Alan Christoffels

Assessing Galaxy’s ability to express scientific workflows in bioinformatics
Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References

Scientific workflow management systems

Scientific workflow management systems (SciWMS) have been
proposed as an alternative to current script-based approaches to
analysis workflow.
SciWMSs “provide a high-level declarative way of specifying what
a particular in silico experiment modelled by a workflow is set to
achieve, not how it will be executed.” (Taverna project, 2009)
Workflow descriptions resemble dataflow languages (McPhillips
et al., 2009)

Peter van Heusden and Alan Christoffels

Assessing Galaxy’s ability to express scientific workflows in bioinformatics
Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References

The promise of SciWMSs

In addition to workflow specification, SciWMSs sometimes offer:
Types that model objects of scientific domain
Recording of provenance of data objects
Execution of scientific workflows on diverse computing
environments (desktop, cluster, grid, cloud)

Peter van Heusden and Alan Christoffels

Assessing Galaxy’s ability to express scientific workflows in bioinformatics
Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References

SciWMSs for bioinformatics

Many workflow systems have been proposed for use in
bioinformatics: Taverna, Kepler, Triana, Bioopera, Mobyle,
BiosFlow, bpipe
Some workflow features are also available in Galaxy

Peter van Heusden and Alan Christoffels

Assessing Galaxy’s ability to express scientific workflows in bioinformatics
Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References

Support for workflow patterns
Scientific Data Modelling
Workflow representation and use

What is Galaxy

Galaxy emerged in 2004/5 as a web interface to bioinformatics
tools and data
Galaxy is becoming common platform through which to “publish”
tools and data
More than 30 known public Galaxy servers
36 000 users on main public Galaxy server, 0.8 Pb of data

Peter van Heusden and Alan Christoffels

Assessing Galaxy’s ability to express scientific workflows in bioinformatics
Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References

Support for workflow patterns
Scientific Data Modelling
Workflow representation and use

Galaxy as an open-source project

Galaxy consists of c. 250 000 lines of (mostly Python) code
Core team includes 15 developers spread across 4 different
institutes
Development is open source and “out in the open” with code
hosted on BitBucket, development planning on Trello and mailing
lists

Peter van Heusden and Alan Christoffels

Assessing Galaxy’s ability to express scientific workflows in bioinformatics
Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References

Support for workflow patterns
Scientific Data Modelling
Workflow representation and use

Peter van Heusden and Alan Christoffels

Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Galaxy I
Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References

Support for workflow patterns
Scientific Data Modelling
Workflow representation and use

Peter van Heusden and Alan Christoffels

Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Galaxy II
Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References

Support for workflow patterns
Scientific Data Modelling
Workflow representation and use

Galaxy workflow management features

Galaxy allows composition of workflows defined as series of
tasks and related dataflow
Allows execution of workflows on local machine or via various job
schedulers
Data objects generated in Galaxy have associated provenance
information

Peter van Heusden and Alan Christoffels

Assessing Galaxy’s ability to express scientific workflows in bioinformatics
Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References

Support for workflow patterns
Scientific Data Modelling
Workflow representation and use

Limitations of Galaxy as a SciWMS

Limited support for scientific workflow patterns
Type refers to format of data items
Provenance is recorded as attribute of data files

Workflows are not first class objects
Analysis view focuses on individual datasets
Execution engine schedules tasks (with limited support for task
collections)

Galaxy can be enriched by drawing on prior research on
SciWMSs

Peter van Heusden and Alan Christoffels

Assessing Galaxy’s ability to express scientific workflows in bioinformatics
Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References

Support for workflow patterns
Scientific Data Modelling
Workflow representation and use

Scientific workflow patterns
Analysis of scientific workflows has yielded a set of design
patterns used in workflows (Yildiz et al., 2009)
Galaxy workflow language supports sequential dataflow, parallel
split and synchronisation
Tool definition language has recently been extended to support
multiple instances of task (not workflow) execution with a-priori
runtime knowledge
Tool authors can signal that input to tool can be split for parallel
execution
No interface between workflow authors and multiple instance
support

Support for cancel of individual task but not entire workflow
No support for triggering new thread of activity (restart)
Peter van Heusden and Alan Christoffels

Assessing Galaxy’s ability to express scientific workflows in bioinformatics
Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References

Support for workflow patterns
Scientific Data Modelling
Workflow representation and use

Scientific workflow patterns (2)

No support for exclusive choice (e.g. execute different dataflow
path based on different input)
No support for sub-workflows
Galaxy workflow language is “abstraction hating” (Green and
Petre, 1996)
Leads to workflow diagrams resembling bowl of spaghetti for
anything but the most simple cases

Peter van Heusden and Alan Christoffels

Assessing Galaxy’s ability to express scientific workflows in bioinformatics
Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References

Support for workflow patterns
Scientific Data Modelling
Workflow representation and use

The Galaxy type system

Galaxy types represent file types
File type does not map simply to semantics
Collection types are not supported, although some types are
“splittable” to allow parallel task execution
Workflow parameters are not supported via type system

Cannot guarantee that workflow is well-formed
Provenance recording is coarse-grained
What will happen if we update single element of input data
collection?

Peter van Heusden and Alan Christoffels

Assessing Galaxy’s ability to express scientific workflows in bioinformatics
Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References

Support for workflow patterns
Scientific Data Modelling
Workflow representation and use

Science questions vs execution plans
Type system could model scientific domain objects (e.g. protein
and nuceleotide sequences) but . . .
Bioinformatics tools do not support standard formats or support
standard formats with quirks
Not clear what information to save from tool output
Experienced bioinformaticists want opportunity to review “raw
output” to explore factors that underpin confidence in analysis

Need to support both recording and reporting of workflow output
Both recording “raw” output trace and reporting provenance of
scientific domain objects are necessary features for SciWMS

Peter van Heusden and Alan Christoffels

Assessing Galaxy’s ability to express scientific workflows in bioinformatics
Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References

Support for workflow patterns
Scientific Data Modelling
Workflow representation and use

Workflow execution in Galaxy
Internally workflows are expanded into collections of tasks at
execution time
Tasks are executed by backend classes: either local or via
scheduler
Execution parameters can be set by “dynamic job runners”
Allows e.g. resource requirements of job to be signalled to
scheduler
Configured using a combination of XML and Python code
maintained by Galaxy administrator

Workflow execution leaves no visible trace in the user interface
At runtime execution shows individual jobs running
Data objects are grouped by “history”, not associated with a
workflow

No support for re-execution of part of workflow
Peter van Heusden and Alan Christoffels

Assessing Galaxy’s ability to express scientific workflows in bioinformatics
Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References

Support for workflow patterns
Scientific Data Modelling
Workflow representation and use

Scope for workflow optimisation

Workflows are dataflow graphs (Johnston et al., 2004)
Knowledge of inputs and types can be used to plan execution
efficiently, e.g. pipeline tasks and exploit opportunities for
streaming
Collection of data objects and parameters sets can be exploited
for automatic parallel enactment of tasks and sub-workflows

Data collections and workflows provide structures for nesting of
provenance information
Knowledge of data provenance could facilitate lifecycle of data
products: kept for re-use or discarded as “intermediate products”

Peter van Heusden and Alan Christoffels

Assessing Galaxy’s ability to express scientific workflows in bioinformatics
Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References

Conclusion
Bioinformatics faces an “informatics crisis” as cost to generate
sequence has decreased while cost to compose or reproduce
analysis has remained high
Galaxy has emerged as a popular interface to bioinformatics tools
and data with workflow management features
Insight from prior research on SciWMSs suggests areas for
enhancement:
Support for additional workflow patterns
Extension of type system with support for biological types,
collections and parameter sets
Improvement of workflow execution through treating workflows as
first class objects with associated optimisation of execution and
provenance storage

Currently being pursued as a research agenda at SANBI
Peter van Heusden and Alan Christoffels

Assessing Galaxy’s ability to express scientific workflows in bioinformatics
Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References

Thanks

Workflows for biological sequence analysis are discussed
by the “Pipelines collaboration”
Research on SciWMS supported
by the MRC and Prof Christoffels
Professor Alan Christoffels

Peter van Heusden and Alan Christoffels

Assessing Galaxy’s ability to express scientific workflows in bioinformatics
Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References

Bibliography I
R. Durbin. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids.
Cambridge University Press, Apr. 1998. ISBN 9780521629713.
J. Goecks, A. Nekrutenko, J. Taylor, and T. G. Team. Galaxy: a comprehensive approach for
supporting accessible, reproducible, and transparent computational research in the life
sciences. Genome Biol, 11(8), 2010.
T. R. G. Green and M. Petre. Usability analysis of visual programming environments: a ‘cognitive
dimensions’ framework. Journal of Visual Languages and Computing, 7:131–174, 1996.
W. M. Johnston, J. R. P. Hanna, and R. J. Millar. Advances in dataflow programming languages.
ACM Computing Surveys, 36(1):1–34, Mar. 2004.
T. McPhillips, S. Bowers, D. Zinn, and B. Ludäscher. Scientific workflow design for mere mortals.
Future Generation Computer Systems, 25(5):541–551, May 2009.
R. Stevens, C. Goble, P. Baker, and A. Brass. A classification of tasks in bioinformatics.
Bioinformatics, 17(2):180–188, Feb. 2001.
Taverna project. Why use workflows?, 2009. URL

http://www.taverna.org.uk/introduction/why-use-workflows/.
Peter van Heusden and Alan Christoffels

Assessing Galaxy’s ability to express scientific workflows in bioinformatics
Introduction
Biological Sequence Analysis
Scientific workflow management systems
The Galaxy framework
Conclusion
Bibliography
References

Bibliography II

K. Wetterstrand. DNA sequencing costs: Data from the NHGRI genome sequencing program
(GSP), 2013. URL http://www.genome.gov/sequencingcosts/.
U. Yildiz, A. Guabtni, and A. H. H. Ngu. Towards scientific workflow patterns. In Proceedings of
the 4th Workshop on Workflows in Support of Large-Scale Science, WORKS ’09, page
13:1–13:10, New York, NY, USA, 2009. ACM.

Peter van Heusden and Alan Christoffels

Assessing Galaxy’s ability to express scientific workflows in bioinformatics

Más contenido relacionado

La actualidad más candente

Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Bertram Ludäscher
 
Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!Ian Foster
 
Accelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceAccelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceIan Foster
 
Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014Carole Goble
 
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksResults Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksCarole Goble
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research ObjectsDavid De Roure
 
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...Keiichiro Ono
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science researchAnubhav Jain
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsAnubhav Jain
 
RARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research ObjectsRARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research ObjectsCarole Goble
 
The Rhetoric of Research Objects
The Rhetoric of Research ObjectsThe Rhetoric of Research Objects
The Rhetoric of Research ObjectsCarole Goble
 
Research Objects: more than the sum of the parts
Research Objects: more than the sum of the partsResearch Objects: more than the sum of the parts
Research Objects: more than the sum of the partsCarole Goble
 
Sciunits: Reusable Research Objects
Sciunits: Reusable Research Objects Sciunits: Reusable Research Objects
Sciunits: Reusable Research Objects Globus
 
247th ACS Meeting: The Eureka Research Workbench
247th ACS Meeting: The Eureka Research Workbench247th ACS Meeting: The Eureka Research Workbench
247th ACS Meeting: The Eureka Research WorkbenchStuart Chalk
 
VariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn LangitVariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn LangitData Con LA
 

La actualidad más candente (20)

Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
 
Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!
 
Accelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceAccelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy Science
 
Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014
 
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksResults Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research Objects
 
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science research
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data sets
 
RARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research ObjectsRARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research Objects
 
ROHub
ROHubROHub
ROHub
 
NETTAB 2013
NETTAB 2013NETTAB 2013
NETTAB 2013
 
FAIRy Stories
FAIRy StoriesFAIRy Stories
FAIRy Stories
 
The Rhetoric of Research Objects
The Rhetoric of Research ObjectsThe Rhetoric of Research Objects
The Rhetoric of Research Objects
 
Research Objects: more than the sum of the parts
Research Objects: more than the sum of the partsResearch Objects: more than the sum of the parts
Research Objects: more than the sum of the parts
 
Sciunits: Reusable Research Objects
Sciunits: Reusable Research Objects Sciunits: Reusable Research Objects
Sciunits: Reusable Research Objects
 
247th ACS Meeting: The Eureka Research Workbench
247th ACS Meeting: The Eureka Research Workbench247th ACS Meeting: The Eureka Research Workbench
247th ACS Meeting: The Eureka Research Workbench
 
ISMB Workshop 2014
ISMB Workshop 2014ISMB Workshop 2014
ISMB Workshop 2014
 
VariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn LangitVariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn Langit
 

Destacado

HE orientaciones
HE orientacionesHE orientaciones
HE orientacionesbeaochoa
 
Apps for Life! Transition and Independence
Apps for Life! Transition and IndependenceApps for Life! Transition and Independence
Apps for Life! Transition and IndependenceBridgingApps
 
Ap comparative brazil privacy
Ap comparative brazil privacyAp comparative brazil privacy
Ap comparative brazil privacyMariaElenaGB
 
Patron bolo
Patron boloPatron bolo
Patron boloalex-3w
 
Building a cluster filesystem using distributed, directly-attached storage
Building a cluster filesystem using distributed, directly-attached storageBuilding a cluster filesystem using distributed, directly-attached storage
Building a cluster filesystem using distributed, directly-attached storagePeter van Heusden
 
Human resource development
Human resource developmentHuman resource development
Human resource developmentZeinul Haleem
 
Management Information System
Management Information SystemManagement Information System
Management Information SystemZeinul Haleem
 

Destacado (11)

Deporte
DeporteDeporte
Deporte
 
Afiq
AfiqAfiq
Afiq
 
HE orientaciones
HE orientacionesHE orientaciones
HE orientaciones
 
Interview - Cinezik
Interview - Cinezik Interview - Cinezik
Interview - Cinezik
 
Prese p.p confi. libre
Prese p.p confi. librePrese p.p confi. libre
Prese p.p confi. libre
 
Apps for Life! Transition and Independence
Apps for Life! Transition and IndependenceApps for Life! Transition and Independence
Apps for Life! Transition and Independence
 
Ap comparative brazil privacy
Ap comparative brazil privacyAp comparative brazil privacy
Ap comparative brazil privacy
 
Patron bolo
Patron boloPatron bolo
Patron bolo
 
Building a cluster filesystem using distributed, directly-attached storage
Building a cluster filesystem using distributed, directly-attached storageBuilding a cluster filesystem using distributed, directly-attached storage
Building a cluster filesystem using distributed, directly-attached storage
 
Human resource development
Human resource developmentHuman resource development
Human resource development
 
Management Information System
Management Information SystemManagement Information System
Management Information System
 

Similar a Assessing Galaxy's ability to express scientific workflows in bioinformatics

The Electronic Notebook Ontology
The Electronic Notebook OntologyThe Electronic Notebook Ontology
The Electronic Notebook OntologyStuart Chalk
 
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Stuart Chalk
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Sciencedrnigam
 
From peer-reviewed to peer-reproduced: a role for research objects in scholar...
From peer-reviewed to peer-reproduced: a role for research objects in scholar...From peer-reviewed to peer-reproduced: a role for research objects in scholar...
From peer-reviewed to peer-reproduced: a role for research objects in scholar...Alejandra Gonzalez-Beltran
 
Bioinformatics databases: Current Trends and Future Perspectives
Bioinformatics databases: Current Trends and Future PerspectivesBioinformatics databases: Current Trends and Future Perspectives
Bioinformatics databases: Current Trends and Future PerspectivesUniversity of Malaya
 
2011 03-provenance-workshop-edingurgh
2011 03-provenance-workshop-edingurgh2011 03-provenance-workshop-edingurgh
2011 03-provenance-workshop-edingurghJun Zhao
 
Human Studies Database Project (demo)
Human Studies Database Project (demo)Human Studies Database Project (demo)
Human Studies Database Project (demo)Ida Sim
 
Explorations in bioinformatics
Explorations in bioinformaticsExplorations in bioinformatics
Explorations in bioinformaticsDouglas Joubert
 
Mtsr2015 goble-keynote
Mtsr2015 goble-keynoteMtsr2015 goble-keynote
Mtsr2015 goble-keynoteCarole Goble
 
ACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
ACS 248th Paper 146 VIVO/ScientistsDB Integration into EurekaACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
ACS 248th Paper 146 VIVO/ScientistsDB Integration into EurekaStuart Chalk
 
The repository ecology: an approach to understanding repository and service i...
The repository ecology: an approach to understanding repository and service i...The repository ecology: an approach to understanding repository and service i...
The repository ecology: an approach to understanding repository and service i...R. John Robertson
 
The eCrystals Federation
The eCrystals FederationThe eCrystals Federation
The eCrystals FederationManjulaPatel
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
OwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxes
OwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxesOwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxes
OwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxesRokan Uddin Faruqui
 
Experiences with logic programming in bioinformatics
Experiences with logic programming in bioinformaticsExperiences with logic programming in bioinformatics
Experiences with logic programming in bioinformaticsChris Mungall
 
Venkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkitVenkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkitBOSC 2010
 
The beauty of workflows and models
The beauty of workflows and modelsThe beauty of workflows and models
The beauty of workflows and modelsmyGrid team
 
Scientific workflow-overview-2012-01-rev-2
Scientific workflow-overview-2012-01-rev-2Scientific workflow-overview-2012-01-rev-2
Scientific workflow-overview-2012-01-rev-2Terence Critchlow
 

Similar a Assessing Galaxy's ability to express scientific workflows in bioinformatics (20)

The Electronic Notebook Ontology
The Electronic Notebook OntologyThe Electronic Notebook Ontology
The Electronic Notebook Ontology
 
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Science
 
From peer-reviewed to peer-reproduced: a role for research objects in scholar...
From peer-reviewed to peer-reproduced: a role for research objects in scholar...From peer-reviewed to peer-reproduced: a role for research objects in scholar...
From peer-reviewed to peer-reproduced: a role for research objects in scholar...
 
Peer Review and Science2.0
Peer Review and Science2.0Peer Review and Science2.0
Peer Review and Science2.0
 
Bioinformatics databases: Current Trends and Future Perspectives
Bioinformatics databases: Current Trends and Future PerspectivesBioinformatics databases: Current Trends and Future Perspectives
Bioinformatics databases: Current Trends and Future Perspectives
 
2011 03-provenance-workshop-edingurgh
2011 03-provenance-workshop-edingurgh2011 03-provenance-workshop-edingurgh
2011 03-provenance-workshop-edingurgh
 
Human Studies Database Project (demo)
Human Studies Database Project (demo)Human Studies Database Project (demo)
Human Studies Database Project (demo)
 
Explorations in bioinformatics
Explorations in bioinformaticsExplorations in bioinformatics
Explorations in bioinformatics
 
Mtsr2015 goble-keynote
Mtsr2015 goble-keynoteMtsr2015 goble-keynote
Mtsr2015 goble-keynote
 
ACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
ACS 248th Paper 146 VIVO/ScientistsDB Integration into EurekaACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
ACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
 
Ceh Conference Nsb
Ceh Conference NsbCeh Conference Nsb
Ceh Conference Nsb
 
The repository ecology: an approach to understanding repository and service i...
The repository ecology: an approach to understanding repository and service i...The repository ecology: an approach to understanding repository and service i...
The repository ecology: an approach to understanding repository and service i...
 
The eCrystals Federation
The eCrystals FederationThe eCrystals Federation
The eCrystals Federation
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
OwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxes
OwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxesOwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxes
OwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxes
 
Experiences with logic programming in bioinformatics
Experiences with logic programming in bioinformaticsExperiences with logic programming in bioinformatics
Experiences with logic programming in bioinformatics
 
Venkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkitVenkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkit
 
The beauty of workflows and models
The beauty of workflows and modelsThe beauty of workflows and models
The beauty of workflows and models
 
Scientific workflow-overview-2012-01-rev-2
Scientific workflow-overview-2012-01-rev-2Scientific workflow-overview-2012-01-rev-2
Scientific workflow-overview-2012-01-rev-2
 

Último

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Último (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Assessing Galaxy's ability to express scientific workflows in bioinformatics

  • 1. Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Assessing Galaxy’s ability to express scientific workflows in bioinformatics Peter van Heusden and Alan Christoffels South African National Bioinformatics Institute University of the Western Cape Bellville, South Africa 10th FASTAR/Espresso Workshop 2013 / 4-6 November 2013 Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • 2. Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References What is bioinformatics? Bioinformatics is the discipline of solving problems in biology and medicine using computational resources. Within bioinformatics, biological sequence analysis (BSA) describes those analyses that “infer biological information from sequence alone”. (Durbin, 1998) Cost of biological sequence analysis has two parts: 1 2 Cost of acquiring sequence Cost of analysing sequence Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • 3. Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Cost of acquiring sequence (Wetterstrand, 2013) Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • 4. Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Cost of analysing sequence The “sudden reliance on computation has created an ‘informatics crisis’ for life science researchers: computational resources can be difficult to use, and ensuring that computational experiments are communicated well and hence reproducible is challenging” (Goecks et al., 2010) As cost of sequencing plummets analysis faces two challenges: 1 2 Growing data volume demands more sophisticated computational approaches Translating biological questions into computational workflows remains difficult Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • 5. Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References How do we do bioinformatics? Given a set of protein sequences from species A, which genes from species B produce similar proteins, and where are these genes located on the genome of B? Analysis proceeds (Stevens et al., 2001) using: 1 2 3 Collections of data objects Transformers that generate new collections (e.g. transform collection of proteins into collection of genome regions that they match) Filters (e.g. discard low quality matches to genome) Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • 6. Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References How we do bioinformatics (2) Data collections typically exist as (compressed) files Bioinformatics tools typically are command line executables that accept and generate files (often using ad-hoc formats) Scripting languages (Perl, Python) used to compose workflows, APIs often used for reading/writing file formats 1 2 Workflow enactment often involves manual steps and is closely tied to execution environment Workflow is not easily reproducible nor reusable Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • 7. Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Scientific workflow management systems Scientific workflow management systems (SciWMS) have been proposed as an alternative to current script-based approaches to analysis workflow. SciWMSs “provide a high-level declarative way of specifying what a particular in silico experiment modelled by a workflow is set to achieve, not how it will be executed.” (Taverna project, 2009) Workflow descriptions resemble dataflow languages (McPhillips et al., 2009) Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • 8. Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References The promise of SciWMSs In addition to workflow specification, SciWMSs sometimes offer: Types that model objects of scientific domain Recording of provenance of data objects Execution of scientific workflows on diverse computing environments (desktop, cluster, grid, cloud) Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • 9. Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References SciWMSs for bioinformatics Many workflow systems have been proposed for use in bioinformatics: Taverna, Kepler, Triana, Bioopera, Mobyle, BiosFlow, bpipe Some workflow features are also available in Galaxy Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • 10. Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Support for workflow patterns Scientific Data Modelling Workflow representation and use What is Galaxy Galaxy emerged in 2004/5 as a web interface to bioinformatics tools and data Galaxy is becoming common platform through which to “publish” tools and data More than 30 known public Galaxy servers 36 000 users on main public Galaxy server, 0.8 Pb of data Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • 11. Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Support for workflow patterns Scientific Data Modelling Workflow representation and use Galaxy as an open-source project Galaxy consists of c. 250 000 lines of (mostly Python) code Core team includes 15 developers spread across 4 different institutes Development is open source and “out in the open” with code hosted on BitBucket, development planning on Trello and mailing lists Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • 12. Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Support for workflow patterns Scientific Data Modelling Workflow representation and use Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics Galaxy I
  • 13. Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Support for workflow patterns Scientific Data Modelling Workflow representation and use Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics Galaxy II
  • 14. Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Support for workflow patterns Scientific Data Modelling Workflow representation and use Galaxy workflow management features Galaxy allows composition of workflows defined as series of tasks and related dataflow Allows execution of workflows on local machine or via various job schedulers Data objects generated in Galaxy have associated provenance information Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • 15. Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Support for workflow patterns Scientific Data Modelling Workflow representation and use Limitations of Galaxy as a SciWMS Limited support for scientific workflow patterns Type refers to format of data items Provenance is recorded as attribute of data files Workflows are not first class objects Analysis view focuses on individual datasets Execution engine schedules tasks (with limited support for task collections) Galaxy can be enriched by drawing on prior research on SciWMSs Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • 16. Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Support for workflow patterns Scientific Data Modelling Workflow representation and use Scientific workflow patterns Analysis of scientific workflows has yielded a set of design patterns used in workflows (Yildiz et al., 2009) Galaxy workflow language supports sequential dataflow, parallel split and synchronisation Tool definition language has recently been extended to support multiple instances of task (not workflow) execution with a-priori runtime knowledge Tool authors can signal that input to tool can be split for parallel execution No interface between workflow authors and multiple instance support Support for cancel of individual task but not entire workflow No support for triggering new thread of activity (restart) Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • 17. Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Support for workflow patterns Scientific Data Modelling Workflow representation and use Scientific workflow patterns (2) No support for exclusive choice (e.g. execute different dataflow path based on different input) No support for sub-workflows Galaxy workflow language is “abstraction hating” (Green and Petre, 1996) Leads to workflow diagrams resembling bowl of spaghetti for anything but the most simple cases Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • 18. Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Support for workflow patterns Scientific Data Modelling Workflow representation and use The Galaxy type system Galaxy types represent file types File type does not map simply to semantics Collection types are not supported, although some types are “splittable” to allow parallel task execution Workflow parameters are not supported via type system Cannot guarantee that workflow is well-formed Provenance recording is coarse-grained What will happen if we update single element of input data collection? Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • 19. Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Support for workflow patterns Scientific Data Modelling Workflow representation and use Science questions vs execution plans Type system could model scientific domain objects (e.g. protein and nuceleotide sequences) but . . . Bioinformatics tools do not support standard formats or support standard formats with quirks Not clear what information to save from tool output Experienced bioinformaticists want opportunity to review “raw output” to explore factors that underpin confidence in analysis Need to support both recording and reporting of workflow output Both recording “raw” output trace and reporting provenance of scientific domain objects are necessary features for SciWMS Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • 20. Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Support for workflow patterns Scientific Data Modelling Workflow representation and use Workflow execution in Galaxy Internally workflows are expanded into collections of tasks at execution time Tasks are executed by backend classes: either local or via scheduler Execution parameters can be set by “dynamic job runners” Allows e.g. resource requirements of job to be signalled to scheduler Configured using a combination of XML and Python code maintained by Galaxy administrator Workflow execution leaves no visible trace in the user interface At runtime execution shows individual jobs running Data objects are grouped by “history”, not associated with a workflow No support for re-execution of part of workflow Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • 21. Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Support for workflow patterns Scientific Data Modelling Workflow representation and use Scope for workflow optimisation Workflows are dataflow graphs (Johnston et al., 2004) Knowledge of inputs and types can be used to plan execution efficiently, e.g. pipeline tasks and exploit opportunities for streaming Collection of data objects and parameters sets can be exploited for automatic parallel enactment of tasks and sub-workflows Data collections and workflows provide structures for nesting of provenance information Knowledge of data provenance could facilitate lifecycle of data products: kept for re-use or discarded as “intermediate products” Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • 22. Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Conclusion Bioinformatics faces an “informatics crisis” as cost to generate sequence has decreased while cost to compose or reproduce analysis has remained high Galaxy has emerged as a popular interface to bioinformatics tools and data with workflow management features Insight from prior research on SciWMSs suggests areas for enhancement: Support for additional workflow patterns Extension of type system with support for biological types, collections and parameter sets Improvement of workflow execution through treating workflows as first class objects with associated optimisation of execution and provenance storage Currently being pursued as a research agenda at SANBI Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • 23. Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Thanks Workflows for biological sequence analysis are discussed by the “Pipelines collaboration” Research on SciWMS supported by the MRC and Prof Christoffels Professor Alan Christoffels Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • 24. Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Bibliography I R. Durbin. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Apr. 1998. ISBN 9780521629713. J. Goecks, A. Nekrutenko, J. Taylor, and T. G. Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol, 11(8), 2010. T. R. G. Green and M. Petre. Usability analysis of visual programming environments: a ‘cognitive dimensions’ framework. Journal of Visual Languages and Computing, 7:131–174, 1996. W. M. Johnston, J. R. P. Hanna, and R. J. Millar. Advances in dataflow programming languages. ACM Computing Surveys, 36(1):1–34, Mar. 2004. T. McPhillips, S. Bowers, D. Zinn, and B. Ludäscher. Scientific workflow design for mere mortals. Future Generation Computer Systems, 25(5):541–551, May 2009. R. Stevens, C. Goble, P. Baker, and A. Brass. A classification of tasks in bioinformatics. Bioinformatics, 17(2):180–188, Feb. 2001. Taverna project. Why use workflows?, 2009. URL http://www.taverna.org.uk/introduction/why-use-workflows/. Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics
  • 25. Introduction Biological Sequence Analysis Scientific workflow management systems The Galaxy framework Conclusion Bibliography References Bibliography II K. Wetterstrand. DNA sequencing costs: Data from the NHGRI genome sequencing program (GSP), 2013. URL http://www.genome.gov/sequencingcosts/. U. Yildiz, A. Guabtni, and A. H. H. Ngu. Towards scientific workflow patterns. In Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science, WORKS ’09, page 13:1–13:10, New York, NY, USA, 2009. ACM. Peter van Heusden and Alan Christoffels Assessing Galaxy’s ability to express scientific workflows in bioinformatics