A Maturing Role of Workflows in the Presence of Heterogenous Computing Architectures

A
Maturing
Role
of
Workﬂows

in
the
Presence
of

Heterogeneous
Compu<ng
Architectures

WorDS.sdsc.edu

Dr.
Ilkay
Al<ntas

Founder
and
Director,
Workﬂows
for
Data
Science
(WorDS)
Center
of
Excellence

San
Diego
Supercomputer
Center,
UC
San
Diego

SAN
DIEGO
SUPERCOMPUTER
CENTER
at
UC
San
Diego

Providing
Cyberinfrastructure
for
Research
and
Educa<on

•  Established
as
a
na<onal

supercomputer
resource

center
in
1985
by
NSF

•  A
world
leader
in
HPC,
data-‐
intensive
compu<ng,
and

scien<ﬁc
data
management

•  Current
strategic
focus
on

“Big
Data”
and
“Data-‐
intensive
HPC”

1985

today

Scien&fic
Workflow

Automa&on
Technologies

Research

Workflows
for
Cloud

Systems

Big
Data
Applica&ons

Reproducible
Science

Workforce
Training
and

Educa&on

Development
and
Consul&ng

Services

Workflows

for
Data

Science

Center

Focus
on
the

ques&on,

not
the

technology!

10+ years of data science R&D
experience as a Center.

Computa<onal
Data
Science
Workﬂows

-‐
Programmable,
Reusable
and
Reproducible
Scalability
-‐

•  Access
and
query
data

•  Scale
computa<onal
analysis

•  Increase
reuse

•  Save
<me,
energy
and
money

•  Formalize
and
standardize

Real-‐Time
Hazards
Management

wiﬁre.ucsd.edu

Data-‐Parallel
Bioinforma<cs

bioKepler.org

Scalable
Automated
Molecular
Dynamics
and
Drug
Discovery

nbcr.ucsd.edu

kepler-‐project.org
WorDS.sdsc.edu

The Big Picture is to Capture the Workflow in an
Executable and Reusable Way
Conceptual SWF
Executable SWF
From
“Napkin
Drawings” to
Executable
Workflows…

SBNL workflow
Local Learner
Data Quality
Evaluation
Local Ensemble
Learning
Quality Evaluation & Data
PartitioningBig Data
Master Learner
MasterEnsemble
Learning
Final BN
Structure
Insurance
and
Traffic

Data
Analy&cs
using
Big

Data
Bayesian
Network

Learning

Ptolemy II: A laboratory for
investigating design
KEPLER: A problem-solving
environment for Scientific
Workflow
KEPLER = Ptolemy II + X for
Scientific Workflows
Kepler is a Scientiﬁc Workﬂow System
•  A cross-project collaboration
… initiated August 2003
•  2.5 will be releases soon
www.kepler-project.org
•  Builds upon the open-source
Ptolemy II framework

Kepler
can
be
applied
to
problems
in
diﬀerent

scien<ﬁc
disciplines:
some
here
and
many
more…

Astrophysisc,
e.g.,
DIAPL

Noanotechnology,
e.g.,
ANELLI

Fusion,
e.g.,
ITER

Metagenomics,
e.g.,
CAMERA

Mul&-‐scale
biology,

e.g.,
NBCR

A Toolbox with Many Tools
Need expertise to identify which tool to use when and how!
Require computation models to schedule and optimize execution!
• 
Data

• 
Search,
database
access,
IO
opera<ons,
streaming
data
in

real-‐<me…

• 
Compute

• 
Data-‐parallel
pa_erns,
external
execu<on,
…

• 
Network
opera<ons

• 
Provenance
and
fault
tolerance

So,  
how can we use
workﬂows in the
context of applications? 
… while coupling all scales computing
computing within a reusable solution…

Some
P’s
to
focus
on…

People
Process
Platforms
Purpose
Programmability

There
are
more:

provenance,
publica<on,
product,

performance,
policy,
proﬁt,
...

Computa<onal
Data
Scien<st
Skill
Set

h_p://
datasciencedojo.com/
what-‐are-‐the-‐key-‐skills-‐
of-‐a-‐data-‐scien<st/

Need
to

communicate!

A
Typical
Workflow-‐Driven
Process

Find
data

Access
data

Acquire
data

Move
data

Clean
data

Integrate
data

Subset
data

Pre-‐process
data

Analyze
data

Process
data

Interpret
results

Summarize
results

Visualize
results

Post-‐process
results

Some
ques<ons
to
ask:

•  Where
and
how
do
I
get
the
data?

•  What
is
the
format
and
frequency
of
the
data,
e.g.,
structured,
textual,
real-‐<me,

image,
…?

•  How
do
I
integrate
or
subset
datasets,
e.g.,
knowledge
representa<on,…
?

•  How
do
I
analyze
the
data
and
what
is
the
analysis
func<on?

•  What
are
the
parameters
to
customize
each
step?

•  What
are
the
compu<ng
needs
to
schedule
and
run
each
step?

•  How
do
I
make
sure
the
results
are
useful
for
the
next
step
or
as
scien<fic
products,

e.g.,
standards
compliance,
repor<ng,
…?

configurable

automated
analysis

Purpose

People
Process
Purpose

Purpose…

“You've
got
to
think
about

big
things

while
you're
doing

small
things,

so
that
all
the
small
things
go
in
the
right

direc<on.”

–
Alvin
Toﬄer

use
cases
=>
purpose
and
value

Need
toolboxes
with

many
tools
for:

•  data
access,

•  analysis,

•  scalable
execu&on,

•  fault
tolerance,

•  provenance

tracking,

•  repor<ng

•  ...

Integra<on
of
Many
Tools
to
Serve
a
Purpose

•  Alterna<ve
tools

•  Mul<ple
modes
of

scalability

•  Support
for
each
step
of

the
development
and

produc<on
process

•  Diﬀerent
repor<ng

needs
for
explora<on

and
produc<on
stages

Build

Explore

Scale

Report

Build
Once,
Run
Many
Times…

•  Data
science
process
should
support

experimental
work
and
dynamic
scalability
on

many
plalorms

•  Scalability
based
on:

–  data
volume
and
velocity

–  dynamic
modeling
needs

–  highly-‐op<mized
HPC
codes

–  changes
in
network,
storage
and
compu<ng

availability

There
are
diﬀerent
styles
of
parallelism!

Task1
Task2
Task3
Task4
Finished Running Waiting
Running Waiting Waiting
1
2
Task1 Task2 Task33
1
2
3
Input
Data
Set
Running Running Running
Task1 Task2 Task3123
1
2
3
Input
Data
Set
...

•  A parallel and scalable programming model for
Big Data
–  Input data is automatically partitioned onto multiple
nodes
–  Programs are distributed and executed in parallel on
the partitioned data blocks
Distributed-‐Data
Parallel
Compu<ng

Images
from:

h_p://www.stratosphere.eu/projects/
Stratosphere/wiki/PactPM

MapReduce
Move program
to data!

Distributed
Data-‐Parallel
(DDP)
Pa_erns

•  A
higher-‐level
programming
model

–  Moving
computa<on
to
data

–  Good
scalability
and
performance
accelera<on

–  Run-‐<me
features
such
as
fault-‐tolerance

–  Easier
parallel
programming
than
MPI
and
OpenMP

Pa_erns
for
data
distribu&on

and
parallel
data
processing

Images
from:
h_p://www.stratosphere.eu/projects/Stratosphere/wiki/PactPM

Hadoop

•  Open
source

implementa<on
of

MapReduce

•  A
distributed
ﬁle
system

across
compute
nodes

(HDFS)

–  Automa=c
data
par==on

–  Automa=c
data
replica=on

•  Master
and
workers/slaves

architecture

•  Automa<c
task
re-‐execu<on

for
failed
tasks

Spark

•  Fast
Big
Data
Engine

–  Keeps
data
in
memory
as

much
as
possible

•  Resilient
Distributed

Datasets
(RDDs)

–  Evaluated
lazily

–  Keeps
track
of
lineage
for

fault
tolerance

•  More
operators
than
just

Map
and
Reduce

•  Can
run
on
YARN
(Hadoop

v2)

Scalability
across
plalorms…

People
Process
Platforms
Purpose

Running on Heterogeneous Computing
Resources
- Execution of programs on where they run most efficiently -
Gordon
Trestles

Local
Cluster

Resources

NSF/DOE:
TeraScale

Resources
(XSEDE)

(Gordon)
(Comet)

(Stampede)

(Lonestar)

Private
Cluster:

User
Owned

Resources

Diﬀerent
executables
have
diﬀerent
compu&ng
architecture
needs!

e.g.,
memory-‐intensive,
compute-‐intensive,
I/O-‐intensive

Challenges
for
Heterogeneous
Compu<ng

•  Dynamic
scheduling
op<miza<on

–  Based
on
network
availability

–  Data
transfer
and
locality

–  Energy
efficiency

–  Availability
of
exascale
memory
hierarchies

–  Workload
changes

–  Dynamic
memory
or
file-‐based
coupling

•  Be_er
programmable
communica<on
between

workflow
systems
and
infrastructure
for
compu<ng,

storage
and
network

•  Harder
form
of
reproducibility

•  Harder
to
program
using
scripts

Programmability
for
scalability,

reusability
and
reproducibility

People
Process
Platforms
Purpose
Programmability

Using Big Data Computing in Bioinformatics
- Improving Programmability, Scalability and Reproducibility-
biokepler.org

Kepler
bioKepler
Compute
Amazon
EC2
FutureGrid
Sun Grid
Engine
Adhoc
Network
Data
CAMERA
Ensembl
Genbank
Deploy &
Execute
Bioinformatics Tools
Clustering
Mapping
Assembly
Transfer
Customize
& Integrate
Data-Parallel Execution Patterns
Map-Reduce Master-Slave All-Pairs
Triton
Resource
Provenance
Execution History
Data Lineage
Reporting
PDF Generation
Report Designer
Fault-Tolerance
Error Handling
Alternatives
Run Manager
Tag
Search
Director
Executable
Workﬂow Plan
Scheduler
Execution
EngineBioinformatician
Workﬂow
bioActors
BLAST
HMMER
CD-HIT
bioKepler’s Conceptual Framework
Private

Repositories

…
XSEDE

Gateways
and
other
user
environments

bioKepler

Kepler
and
Provenance
Framework

BioLinux

Galaxy
Clovr

Hadoop

…
CLOUD
and
OTHER
COMPUTING
RESOURCES

e.g.,
SGE,
Amazon,
FutureGrid,
XSEDE

www.bioKepler.org
A coordinated ecosystem of biological and
technological packages for bioinformatics!

RAMMCAP - Rapid Clustering and Functional Annotation
for Metagenomic Sequences
Data
size

CPU
<me

Memory

Parallel

KB

MB

GB

TB

Second

Hour

Day

Month

Year

GB

10GB

100GB

No
need
No
Mul<
threading
MPI
Map
Reduce

QC

tRNA

cd-‐hit

hmmer

metagene

blast

QC
tRNA
cd-‐hit
hmmer
metagene
blast

QC
tRNA
cd-‐hit
hmmer
metagene
blast

QC
tRNA
cd-‐hit
hmmer
metagene
blast
hmmer
blast

RAMMCAP – Rapid Clustering and Functional Annotation
for Metagenomic Sequences
Data
size

CPU
<me

Memory

Parallel

KB

MB

GB

TB

Minute

Hour

Day

Month

Year

GB

10GB

100GB

No
need
No
Mul<
threading
MPI
Map
Reduce

QC
tRNA
cd-‐hit
hmmer
metagene
blast

QC
tRNA
cd-‐hit
hmmer
metagene
blast

QC
tRNA
cd-‐hit
hmmer
metagene
blast
hmmer
blast

Data
size

CPU
<me

Memory

Parallel

KB

MB

GB

TB

Minute

Hour

Day

Month

Year

GB

10GB

100GB

No
need
No
Mul<
threading
MPI
Map
Reduce

NGS

QC
tRNA
cd-‐hit
hmmer
metagene
blast

QC
tRNA
cd-‐hit
hmmer
metagene
blast

QC
tRNA
cd-‐hit
hmmer
metagene
blast
hmmer
blast

Source:
Larry
Smarr,
Calit2

PI: (Weizhong Li, CRBS, UCSD):
NIH R01HG005978 (2010-2013, $1.1M)
Computa<onal
NextGen
Sequencing
Pipeline:

From
Sequence
to
Taxonomy
and
Func<on

Same
approach
can
be
applied
to

machine
learning
and
other

applica<on
areas!

-‐
REUSABILITY
and
REPURPOSABILITY-‐

Flexible
programming
of
K-‐means

•  R:
Programming

language
and
sorware

environment
for

sta<s<cal
compu<ng
and

graphics.

•  KNIME:
Plalorm
for

data
analy<cs.

•  MlLib:
Scalable
machine

learning
library
running

on
Spark
cluster

compu<ng
framework

•  Mahout:
Scalable

machine
learning
library

based
on
MapReduce.

Scalable Bayesian Network Learning
Conceptual SWF
Executable SWF
From
“Napkin
Drawings” to
Executable
Workﬂows…

SBNL workflow
Local Learner
Data Quality
Evaluation
Local Ensemble
Learning
Quality Evaluation & Data
PartitioningBig Data
Master Learner
MasterEnsemble
Learning
Final BN
Structure
Insurance
and
Traﬃc

Data
Analy&cs
using
Big

Data
Bayesian
Network

Learning

Focus
on
the

use
case,

not
the

technology!

Using Workflows and Cyberinfrastructure
for Wildfire Resilience
- A Scalable Data-Driven Monitoring and Dynamic Prediction Approach -
wifire.ucsd.edu

A
Scalable
Data-‐Driven
Monitoring,
Dynamic
Predic<on
and

Resilience
Cyberinfrastructure
for
Wildfires

(WIFIRE)

Development
of:

“cyberinfrastructure”
for

“analysis
of
large

dimensional

heterogeneous
real-‐<me

sensed
data”
for
fire

resilience
before,
during

and
a@er
a
wildfire

What
is
lacking
in
disaster
management
today
is…

a
system
integra<on
of
real-‐<me
sensor
networks,
satellite

imagery,
near-‐real
<me
data
management

tools,
wildﬁre
simula<on
tools,
and
connec<vity
to

emergency
command
centers

.
….
before,
during
and
arer
a
ﬁrestorm.

h_p://nbcr.ucsd.edu/

Integrated
Mul<-‐Scale
Biomedical

Modeling
Workﬂows
in
NBCR

Identify gaps in multiscale modeling capabilities and
develop new methods and tools that allow us to bridge
across these gaps
Å nm – µm 0.1mm - mm cm
fs - µs µs - ms ms - s s - lifespan
Molecular &
Macromolecular
Sub-Cellular Cell Tissue Organ

Spa&al
and

Temporal

Scales

Driving Biomedical Projects propel technology development
across multi-scale modeling capability gaps, from simulation to
data assembly & integration
•  Models
at
diﬀerent
scales
are

generally
not
designed
to

inform
each
other

•  Specialized
interfaces
to

communicate
large
number
of

parameters
and
data
are

needed

•  Provenance
of
experiments

needs
to
be
portable

•  Models
require
diﬀerent

levels
of
scalability

•  Deployable
sorware

maintenance
requires

exper<se

Rommie

Amaro,

UCSD

Sensi&vity
Analysis
(SA)
for
Uncertainty
Quan&fica&on
(UQ)

Computa(onal
SA

techniques
to
effec=vely

and
efficiently
iden<fy

computa=onal
error
and

model
sensi=vity
for

differen=al
equa=ons
(DE)

Biomedical

Theory
and

Experimental

Data

Nonlinear
DE

System
as

Mathema=cal

Model

Numerical

Solu=on
of

Nonlinear

DE
Model

Extrac=on
of

Quan=ty
of

Interest
from

Simula=on

The
Standard
Scien(fic
Simula(on
Workflow
for
DE
Modeling
in
NBCR

Numerical
solu=on
of
Nonlinear
DE
Model

Standard

Nonlinear
Solve

of
Primal

Problem

Solu<on
of

linearized
Dual

Problem
for

Performing
SA

Use
of
SA
informa<on
for
UQ
(error
es<ma<on)

to
build
an
improved
numerical
discre<za<on

Output
of

Numerical

Solu<on
with

UQ/SA
Info

FETK

&
FEniCS

Support
for
end-‐to-‐end
computa&onal
scien&fic
process

Battling complexity while
facilitating collaboration and increasing reproducibility.
Aim
1

Goal:
Extract
Quan<ty
of
Interest
(QoI)
from
accurate
numerical
simula<on.

Mike
Holst,
UCSD

Local
Execu<on

Op<on

User
MD-‐Parameter
Configura&on
Op&on

Molecular
Dynamic
CADD
Workflow

Amber

Molecular

Dynamics

Package

Local:
NBCR
Cluster

Resources

NSF/DOE:
TeraScale

Resources
(XSEDE)

(Stampede)

NBCR
and
User
Owned

Cloud
Resources

(Comet)

BENEFITS:

•  Enable

users
to
configure
MD
job
parameters

through
command-‐line,
GUI
or
web
interface.

•  Scale
for
mul<ple
compounds
in
parallel

•  Run
on
Mul<ple
Compu<ng
plalorms

•  Increase
reuse

•  Provenance

GPU
or
Gordon
Execu<on
Op<on

h_p://hpc.pnl.gov/IPPD/

Predic<ng
Workflow
Performance
from
Provenance

IPPD
IDEA:

Use
past
workflows

execu<on
traces

along
with
system,

a p p l i c a < o n
a n d

execu<on
profiles
for

dynamic
predic<ve

scheduling.

h_ps://smartmanufacturingcoali<on.org/

Workﬂows-‐as-‐a-‐Service

To Sum Up
•  Workflows and provenance are well-adopted in scientific
infrastructures today, with success
•  WorDS Center applies these concepts to advanced
dynamic data-driven analytics applications
•  One size does not fit all!
•  Many diverse environments and requirements
•  Need to orchestrate at a higher level
•  Higher level programming components for each domain
•  Lots of future challenges on
•  Optimized execution on heterogeneous platforms
•  Programmable interface to workload, storage and network needed
•  Increasing reuse within and across application domains
•  Querying and integration of workflow provenance data into
performance prediction

Ques<ons?

Ilkay
Al<ntas,
Ph.D.

Email:
al<ntas@sdsc.edu

Thanks
to
our
many
collaborators
and
funders!

Twi_er:
@WorDS_SDSC

A Maturing Role of Workflows in the Presence of Heterogenous Computing Architectures

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Similar a A Maturing Role of Workflows in the Presence of Heterogenous Computing Architectures

Similar a A Maturing Role of Workflows in the Presence of Heterogenous Computing Architectures (20)

Más de Ilkay Altintas, Ph.D.

Más de Ilkay Altintas, Ph.D. (7)

Último

Último (20)

A Maturing Role of Workflows in the Presence of Heterogenous Computing Architectures