An Overview of VIEW

Scientific Workflows for Big Data
Prof. Shiyong Lu
Big Data Research Laboratory
Department of Computer Science
Wayne State University
shiyong@wayne.edu

Today’s data-intensive science
Looking for needle
in haystack

Looking into
haystack

Jim Gray: Turing Award laureate

Big Data Challenges
Looking for needle
in haystack

For Big Data, data
management and
movement is a frequent
challenge
…between facilities, Looking needle in
archives, researchers… haystack
Many files, large data
volumes
With security, reliability,
performance…

Ian Foster: Father of Grid Computing

Big Data Challenges
Looking for needle
in haystack

Capture

Curation

Looking needle in
haystack
Storage
Search
Sharing

Analysis

Visualization

Big Data Science

Large Hardron Collider (LHC))

15 PB/year
173 TB/day
500 MB/sec

Higgs discovery is “only
possible because of the
extraordinary
achievements of … grid
computing”
—Rolf Heuer, CERN DG

Data flows at Argonne National Lab

Data management challenges
External
Argonne data
sources
flows in
163
9
9
TB/day
Advanced Photon Source
(estimates)
Argonne
Leadership
Computing
Facility

143
100

Shortterm
storage

100

150

Credit: Ian Foster
Data
analysis

10
50

Longterm
storage

Big Data demands new CS research

For example, existing clustering algorithms are typically cubic in N, and
when N is too big, they do not work! - Jim Gray

What is Big Data?
•Definition of Big Data:
“…refers to large, diverse, complex, longitudinal, and/or
distributed data sets generated from instruments, sensors,
Internet transactions, email, video, click streams, and/or
all other digital sources available today and in the future.”
from nsf.gov website

Big Data Challenges
•Challenges of Big Data:
“national big data challenges, which include advances in core
techniques and technologies; big data infrastructure projects in
various science, biomedical research, health and engineering
communities; education and workforce development; and a
comprehensive integrative program to support collaborations of
multi-disciplinary teams and communities to make advances in the
complex grand challenge science, biomedical research, and
engineering problems of a computational- and data-intensive world.”

from nsf.gov website

Big Data demands big workflows

Reminiscent of

And thousands of parallel executions

Managing big workflows and large-scale
parallel execution is a big CS challenge !

Outline

1

Introduction

2

VIEW: A Prototypical SWFMS

3

A Scientific Workflow Composition Model

4

A Collectional Data Model

5

Conclusions and Future Work

Introduction
 Data Intensive Science
 From computation intensive to data intensive.
 A new research cycle – from data capture and data
curation to data analysis and data visualization.
 “In the future, the rapidity with which any given
discipline advances is likely to depend on how well
the community acquires the necessary expertise in
database, workflow management, visualization,
and cloud computing technologies.” (“Beyond
the Data Deluge”, Science, Vol. 323. no. 5919, pp.
1297 – 1298, 2009.)

Introduction
 Scientific Workflow
 A formal specification of a scientific
process.
 Represents, streamlines, and
automates the steps from dataset
selection and integration,
computation and analysis, to final
data product presentation and
visualization.
 Applications: Bioinformatics,
Oceanography, Neuroinformatics,
Astronomy, etc.

Introduction
 Scientific Workflow Management System
(SWFMS)
 Supports the specification, modification, execution,
failure handling, and monitoring of a scientific
workflow.
 Existing SWFMSs:
•
•
•
•

Taverna,
Kepler,
Pegasus,
VisTrails,

• VIEW,
• …

Our VIEW System

 Enables scientist to design workflows

Our VIEW System

 Provides runtime system to execute workflow

Our VIEW System

 on dedicated VIEW server

Our VIEW System

 in Cloud computing environment

Our VIEW System


 Supports efficient collection, storage,
querying, and visualization of workflow
provenance

Our VIEW System


 Supports efficient collection, storage,
querying, and visualization of workflow
provenance
 Is currently used in several bioinformatics
applications, including genomic recombination
and gene conversion data analysis

An Example Workflow in VIEW
 Example workflows in

VIEW 1-2-3

Step 1: Drag and drop inputs and outputs, and computational

VIEW 1-2-3

Step 2: Link them into a scientific workflow

VIEW 1-2-3

Step 3: Click the run button, you get the result!

An Example Workflow in VIEW
 FiberFlow
 Transforms the large-scale neuroimaging data to knowledge through crosssubject, cross-modality computation, ultimately leading to high clinical
intelligence in neural diseases.

 Minimum complexity for users, but massive
techniques in the backstage.
 To provide a clear and simple abstraction for manipulating
and coordinating resources

 Service-oriented architecture.
 Intuitive, user-friendly GUI

A Reference Architecture for SWFMSs
Service-oriented architecture of VIEW

Other advantages of

:

Other advantages of

:

 VIEW workflows can be executed in other
systems (specifications are not tied to a particular
SWFMS)
 Use of open standards (Web Services, XML)
promotes collaboration, interoperability and
extensibility of the system
 Workflow and data models implemented in VIEW
are specifically geared towards heavy scientific
data


A typical scientific workflow execution diagram.

Workflow Engine
Workflow Engine is the heart of the
system.
 Workflow Orchestration.
 Workflow Execution.
 Coordination of other subsystems.

Workflow Engine in VIEW.
 Dataflow based.
 Pure workflow composition.
 Workflow constructs.

SWL
 Example of our proposed scientific workflow
specification language (SWL).

Primitive Workflow Specification
 Example SWL specification of a primitive
workflow.

Workflow Execution
Workflow Execution
 Primitive workflow
 Unary construct based workflow
 Graph based workflow
• A workflow graph is a composition of workflows by
binary constructs.
• Optimistic scheduling.

Data Product Manager
Data Product Manager





Solid data model.
Scalable data storage.
Convenient data access.
Data Independence.

Data Product Manager is based on the
collectional data model.

DPM Architecture
 Architecture of the Data Product Manager.
Data Product Manager
Main
Server

Master

Data Access Layer
Node
Database

Node
Database

Node
Database

Data Mapping Layer

Data Set 1

Relational
Databases

File
Repositorys

Data Set 2

Relational
Databases

File
Repositorys

Data Storage Layer

DPL
 Example of the XML description of a
collectional data product.

Data Storage
 VIEW supports two ways of storage:
 A collection can be stored in a table containing a
set of its key/value pairs, whose values are
references to existing collections.
 A collection can be expanded and stored in two
tables.
• The Group By operator.
• The Compress operator.

Data Typing
A Data Product
 a Collection
 or a List
 or an Empty.

The List type
 Introduced in the workflow engine.
 Each element is a data product.
 Heterogeneous.

Collectional Data Querying
Operators are implemented in primitive
workflows.





Arithmetic operators.
Boolean operators.
Collectional operators.
List operators.

Queries are implemented in workflow
compositions.

Example
 Given a table Reference < Student, Company,
GradTime >, Find the total number of
students offered in each company and each
graduation year; Sort the result in descending
GradTime and ascending Company order.
 SQL query.
 SELECT Company, GradTime, COUNT(DISTINCT Student)
AS NumberOfJob
FROM Reference
GROUP BY Company, GradTime
ORDER BY GradTime DESC, Company ASC;

Example of Query Workflow
Query Workflow.

Key Requirements for Workflow Modeling

R1: Programming-in-the-large.
R2: Dataflow programming model.
R3: Composable dataflow constructs.
R4: Workflow encapsulation and
hierarchical composition.
R5: Single-assignment property.
R6: Physical and logical data models.
R7: Exception handling.

A Scientific Workflow Model
Workflows are the basic and the only
operands for workflow composition.

M

i1

ii1 W1 o1
k

i1

ii1 W2 o1
k

o1 o1

W3
Task components (e.g. Web services)
are constructed to primitive workflows
(a.k.a. tasks) which are the basic
building blocks of scientific workflows.

A workflow construct is a mapping
from a set of workflows to a workflow.
 Unary workflow constructs
 Binary workflow constructs
 …

A construct C takes a set of workflows W1, ...., Wn as input,
and composes them into Wc as the output workflow.

 Our proposed scientific workflow model
consists of the following two layers:
 The logical layer contains the workflow interface that
models the input ports and output ports of a workflow.
 The physical layer contains the workflow body that models
the physical implementation of the workflow.
• Primitive workflows.
• Graph-based workflows.
• Unary-construct-based workflows.

Unary Workflow Constructs

Dataflow-based Unary Workflow Constructs

The Map Construct
 The Map construct enables the parallel
processing of a collection of data products
based on a workflow that can only process a
single data product.
 Example:
[[1,2],[3,6],[4,7]]

[1,2]

ii1
k

W1 o1
W2

o1

W1

o1

2

[3,6]

M

i1

i1
ik
i1
k

W1

o1

18

[4,7]

i1
ik

W1

o1

28

The Reduce Construct
 The Reduce construct enables the aggregation
of a list of data products to a single data
product based on a workflow that aggregates
a limited (two or more) number of input data
products.
 Example:
R

i1
0
[3,5,9]

i2

i1 Add o
1
i2
k
W3

0
3

o1

i1 Addo1
3
i2
5

i1 Add o1
i2
8
9

i1 Add o1
17
i2

The Tree Construct
 The Tree construct
 Enables parallel aggregation of a collection of data products.
 Aggregates a collection pairwisely as a binary tree until one
single aggregated product is generated.

 The Tree construct can be applied on
associative workflows.
 Example:
T

[0,3,5,9]

i1

i1 Add o
1
i2
k
W4

o1

0
3

i1 Addo1 3
i2

5
9

i1 Addo1
i2
14

i1 Add o1
i2
17

The Conditional Construct
 The Conditional construct enables the
conditional execution of a workflow based on a
condition on one of the inputs.
 Example:
[2,3]

2

p=(PI 1 < PI 2 ) C
i1 p i1
o1 o1 p=true [2,3] i1
o
iProjection
k
Projection 1

i2

2

i2

W4

[2,3]

1

p=(PI 1 >= PI 2 ) C
i1 p i1
o1 o1 p=false
Projection
ik

i2

i2

W4

i2

Fail

i1
2

Projection

i2

3

The Loop Construct
 The Loop construct enables cyclic executions
of a workflow.
 The output of the workflow will be repetitively
returned (fed back) to a specified input port
until the predicate evaluates to true.
 Example:

p=(PI 1 >100) L

0
1

i1 i1
i2

i2

ik Add

o1 o1
p

0
1

i1

o1 p=false
1

Add

i2
i1
1

o1 p=false
2

Add

i2
...

1

101
Add

i2

p=true

The Curry Construct
 The Curry construct allows users to fix one of the input
ports with a specified argument and thus reduce the
number of input ports.
 By applying multiple Curry constructs, a workflow that
takes multiple arguments can be translated into a chain
of workflows each with a single argument.
 Example:
U

4

1

i1

i1 Add o
1
i2
k
W8

o1

1

4

i1 Add o
1
i2
k

5

Workflow Composition
 Example of the composition of Map and Map
constructs.
 A Workflow that increase all the numbers in a nested list
by 1.
1
i1
o

M M
1

i1
i2

[[1,2,3],[4,5,6]]

i1
o1
k
ii2 Add
(a) W9

o1

1
1
2
1
3
1
4
1
5
1
6

1
k
ii2 Add
i1
o
ik Add 1
i2
i1
o
ik Add 1
i2
i1
o1
k
ii2 Add
i1
o
ik Add 1
i2

i1
o
ik Add 1
i2

2
3
4

5
6
7

 Example of the composition of Map and Reduce
constructs.
 A workflow for parallel summation of each row in a
matrix .
0
o1 o1 1

o1
i1
Addition
i2
k
2

o1
i1
Addition
ik
i2

0
4

o1
i1
Addition
i2
k
5

o1
i1
Addition
ik
i2

M R

0

i1 i1
i2 i2

ik Add

[[1,2,3],[4,5,6]] W11

3

6

o1
i1
Addition 6
ii2
k
o1
i1
Addition 15
ii2
k

 Example of complicated workflow composition.
 A workflow to calculate the greatest common divisor.
L
p=(PI(2)==0)
i1
i1

i1 Split o1
o2

G2W

o
i1
iModulus 1
k
i2

ii1
o o
kMerge 1 1
i2
W13

o1

W14
G2W
i1
i2

o1
i1
Merge
i2

i1

M
o1
i1W14 o1
W15
W17

i1
1

M U
o1 o1
i1
iikProjection
2

W16

o1

 A collectional data model
 Support collection oriented datasets.
• Scientists often work with collection oriented datasets,
such as arrays, lists, tables or file collections.
• A collection-oriented data model enables data
parallelism in scientific workflows.
 Support nested data structures.
• Scientific data is often hierarchically organized.
• Scientific workflow tasks often produce collections of
data products, and the execution of a workflow
composed from such tasks can create increasingly
nested data collections.
 Provide well-defined operators and their arbitrary
compositions to manipulate and query scientific data
collections.

 A relation is a pair < R, r > where R is a
schema of the relation and r is an instance of
that schema.
 A relation schema can be defined as an
unordered tuple < c1 : d1, c2 : d2, …, cn : dn >
where c1, c2, …, cn are column names and d1,
d2, …, dn are domain names.
 A relation instance is a table with rows
(called tuples) and named columns (called
attributes).

 A collection schema is a pair < K, V >.
 K, the key, is a pair k : d where k is the key name and d is
the domain name .
 V, the value, is either a relation schema or a collection
schema.

 A collection instance is a set of key-value
pairs (pi, qi) (i∈ {1,…,m}).
 Each pi is a scalar value.
 Each qi is either a relation instance or a collection instance.

An example:
 Parameters< Model : String, Experiments :
Integer, <Concentration : Double, Degree :
Integer >>.

The Collectional Operators
 We extend the relational operators to the
collectional operators of which the collections
are the only operands.
 Six primitive operators: union, set difference,
selection, projection, Cartesian product and
renaming.
 The set of the collections is closed under those
operators.
 A relation can be defined as a collection whose
height and cardinality are equal to 1. The
collectional operators will then reduce to the
relational operators.

 The union and the set difference operators can
only be applied on union-compatible
collections.
Result
Model

26

m1
Result
m2

32

Result
Model

32

m2
Result
m3

31

 Example of the union operator and the set
difference operator.

Model
m1
m2
m3

Result
26

Model

Result

Result

m1

26

32

m2

Result

Result
31

 Example of the Cartesian product Operator
and the Renaming Operator.
M1.Result M2.Result

M2.model
M1.model
m1
m2

26

32

m1
M1.Result M2.Result

m2
M2.model
m1
m2

26

31

M1.Result M2.Result

32

32

M1.Result M2.Result

32

31

 Example of the selection operator.
Model
m2

Experiment
1
Concentration Degree

7.1

15

...
...

 Example of the projection operator.


Experiment
1
2

...

7.0

15

...

7.1

15

...


...

7.0
...

30

...

7.1

30

...

Key Features of VIEW
F1: VIEW features the first uniform workflow

model, in which workflows are the only
building blocks. In VIEW, tasks are primitive
workflows and all workflow constructs do not
discriminate workflows from tasks. Such a
model greatly simplifies workflow design, in
which a workflow designer only needs to
compose complex workflows from simpler
ones without the need to first encapsulate
workflows to tasks or vice versa during the
composition process.

F2: VIEW has a powerful workflow composition
power in which workflow constructs are fully
compositional one with another with arbitrary
levels. This often results in VIEW workflows
that are more concise and efficient to
execute, which can be hard to model in other
workflow systems.

F3: VIEW features a pure dataflow-based

workflow language SWL, including the
dataflow counterparts of controlflow-style
constructs, such as conditional and loop.
Existing workflow languages often require
both controlflow and dataflow constructs,
resulting in complex or even obscure
semantics and non-trivial workflow design.

F4: VIEW supports the cloud MapReduce

programming model not only at the job level,
but also at the workflow level. Therefore, one
can apply the Map and Reduce constructs on
an arbitrary workflow with arbitrary number
of times. As a result, VIEW can process
nested lists of data products in parallel using
multiple runs of a workflow.

F5: VIEW features a collectional data model

that supports not only traditional primitive
data types, such as integer, float, double,
boolean, char, string, but also files, relations,
hierarchical collections (hierarchical key-value
pairs) to support parallel processing of data
collections.

F6: VIEW supports a high-level graphbased provenance query language
OPQL. In most cases, users can
formulate lineage queries easily without
the need of writing recursive queries or
knowing the underlying database
schema.

 F7: VIEW features the first service-oriented
architecture that conforms to the reference
architecture for scientific workflow
management systems (SWFMSs). This
architecture greatly facilitates interoperability
and subsystem reusability in the community.
This architecture also provides a generic
infrastructure upon which a domain-specific
scientific workflow application system
(SWFAS) can be easily developed with custom
interface for various platforms and devices.

Conclusions and Future Works
 A scientific workflow composition model.
 A collectional data model.
 A protypical SWFMS.
 Future work:
 Formalization of the scientific workflow algebra and
collectional algebra.
• Completeness.
• Integration.

 Collaborative scientific workflow composition.
• Concurrent design and composition.
• Concurrent execution.

VIEW application

Fiber tract analysis for Epilepsy.

VIEW application

Computational detection of MARS in
genome.

VIEW application

DNA analysis for bacteria E. Coli

VIEW application

Simulation of Nereis succinea mate search
behavior.

Big Data is a Pyramid

Can you contribute a piece too?

Big Data Research Laboratory
Wayne State University

viewsystem.org

An Overview of VIEW

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Destacado

Destacado (20)

Similar a An Overview of VIEW

Similar a An Overview of VIEW (20)

Último

Último (20)

An Overview of VIEW