The document describes scientific workflows for big data and the challenges they present. It discusses Prof. Shiyong Lu's work on developing the VIEW system for designing, executing, and analyzing scientific workflows. The VIEW system provides a runtime environment for workflows, supports their execution on servers or clouds, and enables efficient storage, querying and visualization of workflow provenance data.
1. Scientific Workflows for Big Data
Prof. Shiyong Lu
Big Data Research Laboratory
Department of Computer Science
Wayne State University
shiyong@wayne.edu
3. Big Data Challenges
Looking for needle
in haystack
For Big Data, data
management and
movement is a frequent
challenge
…between facilities, Looking needle in
archives, researchers… haystack
Many files, large data
volumes
With security, reliability,
performance…
Ian Foster: Father of Grid Computing
4. Big Data Challenges
Looking for needle
in haystack
Capture
Curation
Looking needle in
haystack
Storage
Search
Sharing
Analysis
Visualization
5. Big Data Science
Large Hardron Collider (LHC))
15 PB/year
173 TB/day
500 MB/sec
Higgs discovery is “only
possible because of the
extraordinary
achievements of … grid
computing”
—Rolf Heuer, CERN DG
6. Data flows at Argonne National Lab
Data management challenges
External
Argonne data
sources
flows in
163
9
9
TB/day
Advanced Photon Source
(estimates)
Argonne
Leadership
Computing
Facility
143
100
Shortterm
storage
100
150
Credit: Ian Foster
Data
analysis
10
50
Longterm
storage
7. Big Data demands new CS research
For example, existing clustering algorithms are typically cubic in N, and
when N is too big, they do not work! - Jim Gray
8. What is Big Data?
•Definition of Big Data:
“…refers to large, diverse, complex, longitudinal, and/or
distributed data sets generated from instruments, sensors,
Internet transactions, email, video, click streams, and/or
all other digital sources available today and in the future.”
from nsf.gov website
9. Big Data Challenges
•Challenges of Big Data:
“national big data challenges, which include advances in core
techniques and technologies; big data infrastructure projects in
various science, biomedical research, health and engineering
communities; education and workforce development; and a
comprehensive integrative program to support collaborations of
multi-disciplinary teams and communities to make advances in the
complex grand challenge science, biomedical research, and
engineering problems of a computational- and data-intensive world.”
from nsf.gov website
13. Introduction
Data Intensive Science
From computation intensive to data intensive.
A new research cycle – from data capture and data
curation to data analysis and data visualization.
“In the future, the rapidity with which any given
discipline advances is likely to depend on how well
the community acquires the necessary expertise in
database, workflow management, visualization,
and cloud computing technologies.” (“Beyond
the Data Deluge”, Science, Vol. 323. no. 5919, pp.
1297 – 1298, 2009.)
14. Introduction
Scientific Workflow
A formal specification of a scientific
process.
Represents, streamlines, and
automates the steps from dataset
selection and integration,
computation and analysis, to final
data product presentation and
visualization.
Applications: Bioinformatics,
Oceanography, Neuroinformatics,
Astronomy, etc.
15. Introduction
Scientific Workflow Management System
(SWFMS)
Supports the specification, modification, execution,
failure handling, and monitoring of a scientific
workflow.
Existing SWFMSs:
•
•
•
•
Taverna,
Kepler,
Pegasus,
VisTrails,
• VIEW,
• …
18. Our VIEW System
Enables scientist to design workflows
Provides runtime system to execute workflow
19. Our VIEW System
Enables scientist to design workflows
Provides runtime system to execute workflow
on dedicated VIEW server
20. Our VIEW System
Enables scientist to design workflows
Provides runtime system to execute workflow
on dedicated VIEW server
in Cloud computing environment
21. Our VIEW System
Enables scientist to design workflows
Provides runtime system to execute workflow
on dedicated VIEW server
in Cloud computing environment
Supports efficient collection, storage,
querying, and visualization of workflow
provenance
22. Our VIEW System
Enables scientist to design workflows
Provides runtime system to execute workflow
on dedicated VIEW server
in Cloud computing environment
Supports efficient collection, storage,
querying, and visualization of workflow
provenance
Is currently used in several bioinformatics
applications, including genomic recombination
and gene conversion data analysis
29. An Example Workflow in VIEW
FiberFlow
Transforms the large-scale neuroimaging data to knowledge through crosssubject, cross-modality computation, ultimately leading to high clinical
intelligence in neural diseases.
30. VIEW: A Prototypical SWFMS
Minimum complexity for users, but massive
techniques in the backstage.
To provide a clear and simple abstraction for manipulating
and coordinating resources
Service-oriented architecture.
Intuitive, user-friendly GUI
33. A Reference Architecture for SWFMSs
Other advantages of
:
VIEW workflows can be executed in other
systems (specifications are not tied to a particular
SWFMS)
Use of open standards (Web Services, XML)
promotes collaboration, interoperability and
extensibility of the system
Workflow and data models implemented in VIEW
are specifically geared towards heavy scientific
data
36. Workflow Engine
Workflow Engine is the heart of the
system.
Workflow Orchestration.
Workflow Execution.
Coordination of other subsystems.
Workflow Engine in VIEW.
Dataflow based.
Pure workflow composition.
Workflow constructs.
37. SWL
Example of our proposed scientific workflow
specification language (SWL).
39. Workflow Execution
Workflow Execution
Primitive workflow
Unary construct based workflow
Graph based workflow
• A workflow graph is a composition of workflows by
binary constructs.
• Optimistic scheduling.
41. Data Product Manager
Data Product Manager
Solid data model.
Scalable data storage.
Convenient data access.
Data Independence.
Data Product Manager is based on the
collectional data model.
42. DPM Architecture
Architecture of the Data Product Manager.
Data Product Manager
Main
Server
Master
Data Access Layer
Node
Database
Node
Database
Node
Database
Data Mapping Layer
Data Set 1
Relational
Databases
File
Repositorys
Data Set 2
Relational
Databases
File
Repositorys
Data Storage Layer
43. DPL
Example of the XML description of a
collectional data product.
44. Data Storage
VIEW supports two ways of storage:
A collection can be stored in a table containing a
set of its key/value pairs, whose values are
references to existing collections.
A collection can be expanded and stored in two
tables.
• The Group By operator.
• The Compress operator.
45. Data Typing
A Data Product
a Collection
or a List
or an Empty.
The List type
Introduced in the workflow engine.
Each element is a data product.
Heterogeneous.
46. Collectional Data Querying
Operators are implemented in primitive
workflows.
Arithmetic operators.
Boolean operators.
Collectional operators.
List operators.
Queries are implemented in workflow
compositions.
47. Example
Given a table Reference < Student, Company,
GradTime >, Find the total number of
students offered in each company and each
graduation year; Sort the result in descending
GradTime and ascending Company order.
SQL query.
SELECT Company, GradTime, COUNT(DISTINCT Student)
AS NumberOfJob
FROM Reference
GROUP BY Company, GradTime
ORDER BY GradTime DESC, Company ASC;
49. Key Requirements for Workflow Modeling
R1: Programming-in-the-large.
R2: Dataflow programming model.
R3: Composable dataflow constructs.
R4: Workflow encapsulation and
hierarchical composition.
R5: Single-assignment property.
R6: Physical and logical data models.
R7: Exception handling.
50. A Scientific Workflow Model
Workflows are the basic and the only
operands for workflow composition.
M
i1
ii1 W1 o1
k
i1
ii1 W2 o1
k
o1 o1
W3
Task components (e.g. Web services)
are constructed to primitive workflows
(a.k.a. tasks) which are the basic
building blocks of scientific workflows.
51. A Scientific Workflow Model
A workflow construct is a mapping
from a set of workflows to a workflow.
Unary workflow constructs
Binary workflow constructs
…
A construct C takes a set of workflows W1, ...., Wn as input,
and composes them into Wc as the output workflow.
52. A Scientific Workflow Model
Our proposed scientific workflow model
consists of the following two layers:
The logical layer contains the workflow interface that
models the input ports and output ports of a workflow.
The physical layer contains the workflow body that models
the physical implementation of the workflow.
• Primitive workflows.
• Graph-based workflows.
• Unary-construct-based workflows.
54. The Map Construct
The Map construct enables the parallel
processing of a collection of data products
based on a workflow that can only process a
single data product.
Example:
[[1,2],[3,6],[4,7]]
[1,2]
ii1
k
W1 o1
W2
o1
W1
o1
2
[3,6]
M
i1
i1
ik
i1
k
W1
o1
18
[4,7]
i1
ik
W1
o1
28
55. The Reduce Construct
The Reduce construct enables the aggregation
of a list of data products to a single data
product based on a workflow that aggregates
a limited (two or more) number of input data
products.
Example:
R
i1
0
[3,5,9]
i2
i1 Add o
1
i2
k
W3
0
3
o1
i1 Addo1
3
i2
5
i1 Add o1
i2
8
9
i1 Add o1
17
i2
56. The Tree Construct
The Tree construct
Enables parallel aggregation of a collection of data products.
Aggregates a collection pairwisely as a binary tree until one
single aggregated product is generated.
The Tree construct can be applied on
associative workflows.
Example:
T
[0,3,5,9]
i1
i1 Add o
1
i2
k
W4
o1
0
3
i1 Addo1 3
i2
5
9
i1 Addo1
i2
14
i1 Add o1
i2
17
57. The Conditional Construct
The Conditional construct enables the
conditional execution of a workflow based on a
condition on one of the inputs.
Example:
[2,3]
2
p=(PI 1 < PI 2 ) C
i1 p i1
o1 o1 p=true [2,3] i1
o
iProjection
k
Projection 1
i2
2
i2
W4
[2,3]
1
p=(PI 1 >= PI 2 ) C
i1 p i1
o1 o1 p=false
Projection
ik
i2
i2
W4
i2
Fail
i1
2
Projection
i2
3
58. The Loop Construct
The Loop construct enables cyclic executions
of a workflow.
The output of the workflow will be repetitively
returned (fed back) to a specified input port
until the predicate evaluates to true.
Example:
p=(PI 1 >100) L
0
1
i1 i1
i2
i2
ik Add
o1 o1
p
0
1
i1
o1 p=false
1
Add
i2
i1
1
o1 p=false
2
Add
i2
...
1
101
Add
i2
p=true
59. The Curry Construct
The Curry construct allows users to fix one of the input
ports with a specified argument and thus reduce the
number of input ports.
By applying multiple Curry constructs, a workflow that
takes multiple arguments can be translated into a chain
of workflows each with a single argument.
Example:
U
4
1
i1
i1 Add o
1
i2
k
W8
o1
1
4
i1 Add o
1
i2
k
5
60. Workflow Composition
Example of the composition of Map and Map
constructs.
A Workflow that increase all the numbers in a nested list
by 1.
1
i1
o
M M
1
i1
i2
[[1,2,3],[4,5,6]]
i1
o1
k
ii2 Add
(a) W9
o1
1
1
2
1
3
1
4
1
5
1
6
1
k
ii2 Add
i1
o
ik Add 1
i2
i1
o
ik Add 1
i2
i1
o1
k
ii2 Add
i1
o
ik Add 1
i2
i1
o
ik Add 1
i2
2
3
4
5
6
7
61. Workflow Composition
Example of the composition of Map and Reduce
constructs.
A workflow for parallel summation of each row in a
matrix .
0
o1 o1 1
o1
i1
Addition
i2
k
2
o1
i1
Addition
ik
i2
0
4
o1
i1
Addition
i2
k
5
o1
i1
Addition
ik
i2
M R
0
i1 i1
i2 i2
ik Add
[[1,2,3],[4,5,6]] W11
3
6
o1
i1
Addition 6
ii2
k
o1
i1
Addition 15
ii2
k
62. Workflow Composition
Example of complicated workflow composition.
A workflow to calculate the greatest common divisor.
L
p=(PI(2)==0)
i1
i1
i1 Split o1
o2
G2W
o
i1
iModulus 1
k
i2
ii1
o o
kMerge 1 1
i2
W13
o1
W14
G2W
i1
i2
o1
i1
Merge
i2
i1
M
o1
i1W14 o1
W15
W17
i1
1
M U
o1 o1
i1
iikProjection
2
W16
o1
63. A Collectional Data Model
A collectional data model
Support collection oriented datasets.
• Scientists often work with collection oriented datasets,
such as arrays, lists, tables or file collections.
• A collection-oriented data model enables data
parallelism in scientific workflows.
Support nested data structures.
• Scientific data is often hierarchically organized.
• Scientific workflow tasks often produce collections of
data products, and the execution of a workflow
composed from such tasks can create increasingly
nested data collections.
Provide well-defined operators and their arbitrary
compositions to manipulate and query scientific data
collections.
64. A Collectional Data Model
A relation is a pair < R, r > where R is a
schema of the relation and r is an instance of
that schema.
A relation schema can be defined as an
unordered tuple < c1 : d1, c2 : d2, …, cn : dn >
where c1, c2, …, cn are column names and d1,
d2, …, dn are domain names.
A relation instance is a table with rows
(called tuples) and named columns (called
attributes).
65. A Collectional Data Model
A collection schema is a pair < K, V >.
K, the key, is a pair k : d where k is the key name and d is
the domain name .
V, the value, is either a relation schema or a collection
schema.
A collection instance is a set of key-value
pairs (pi, qi) (i∈ {1,…,m}).
Each pi is a scalar value.
Each qi is either a relation instance or a collection instance.
66. A Collectional Data Model
An example:
Parameters< Model : String, Experiments :
Integer, <Concentration : Double, Degree :
Integer >>.
67. The Collectional Operators
We extend the relational operators to the
collectional operators of which the collections
are the only operands.
Six primitive operators: union, set difference,
selection, projection, Cartesian product and
renaming.
The set of the collections is closed under those
operators.
A relation can be defined as a collection whose
height and cardinality are equal to 1. The
collectional operators will then reduce to the
relational operators.
68. The Collectional Operators
The union and the set difference operators can
only be applied on union-compatible
collections.
Result
Model
26
m1
Result
m2
32
Result
Model
32
m2
Result
m3
31
69. The Collectional Operators
Example of the union operator and the set
difference operator.
Model
m1
m2
m3
Result
26
Model
Result
Result
m1
26
32
m2
Result
Result
31
70. The Collectional Operators
Example of the Cartesian product Operator
and the Renaming Operator.
M1.Result M2.Result
M2.model
M1.model
m1
m2
26
32
m1
M1.Result M2.Result
m2
M2.model
m1
m2
26
31
M1.Result M2.Result
32
32
M1.Result M2.Result
32
31
71. The Collectional Operators
Example of the selection operator.
Model
m2
Experiment
1
Concentration Degree
7.1
15
...
...
72. The Collectional Operators
Example of the projection operator.
Concentration Degree
Experiment
1
2
...
7.0
15
...
7.1
15
...
Concentration Degree
...
7.0
...
30
...
7.1
30
...
73. Key Features of VIEW
F1: VIEW features the first uniform workflow
model, in which workflows are the only
building blocks. In VIEW, tasks are primitive
workflows and all workflow constructs do not
discriminate workflows from tasks. Such a
model greatly simplifies workflow design, in
which a workflow designer only needs to
compose complex workflows from simpler
ones without the need to first encapsulate
workflows to tasks or vice versa during the
composition process.
74. F2: VIEW has a powerful workflow composition
power in which workflow constructs are fully
compositional one with another with arbitrary
levels. This often results in VIEW workflows
that are more concise and efficient to
execute, which can be hard to model in other
workflow systems.
75. F3: VIEW features a pure dataflow-based
workflow language SWL, including the
dataflow counterparts of controlflow-style
constructs, such as conditional and loop.
Existing workflow languages often require
both controlflow and dataflow constructs,
resulting in complex or even obscure
semantics and non-trivial workflow design.
76. F4: VIEW supports the cloud MapReduce
programming model not only at the job level,
but also at the workflow level. Therefore, one
can apply the Map and Reduce constructs on
an arbitrary workflow with arbitrary number
of times. As a result, VIEW can process
nested lists of data products in parallel using
multiple runs of a workflow.
77. F5: VIEW features a collectional data model
that supports not only traditional primitive
data types, such as integer, float, double,
boolean, char, string, but also files, relations,
hierarchical collections (hierarchical key-value
pairs) to support parallel processing of data
collections.
78. F6: VIEW supports a high-level graphbased provenance query language
OPQL. In most cases, users can
formulate lineage queries easily without
the need of writing recursive queries or
knowing the underlying database
schema.
79. F7: VIEW features the first service-oriented
architecture that conforms to the reference
architecture for scientific workflow
management systems (SWFMSs). This
architecture greatly facilitates interoperability
and subsystem reusability in the community.
This architecture also provides a generic
infrastructure upon which a domain-specific
scientific workflow application system
(SWFAS) can be easily developed with custom
interface for various platforms and devices.
80. Conclusions and Future Works
A scientific workflow composition model.
A collectional data model.
A protypical SWFMS.
Future work:
Formalization of the scientific workflow algebra and
collectional algebra.
• Completeness.
• Integration.
Collaborative scientific workflow composition.
• Concurrent design and composition.
• Concurrent execution.