WorDS of Data Science in the Presence of Heterogenous Computing Architectures

WorDS
of
Data
Science
in
the
Presence
of

Heterogenous
Compu7ng
Architectures

WorDS.sdsc.edu

Dr.
Ilkay
Al7ntas

Founder
and
Director,
Workﬂows
for
Data
Science
(WorDS)

Center
of
Excellence

San
Diego
Supercomputer
Center,
UC
San
Diego

SAN
DIEGO
SUPERCOMPUTER
CENTER
at
UC
San
Diego

Providing
Cyberinfrastructure
for
Research
and
Educa7on

•  Established
as
a
na7onal

supercomputer
resource

center
in
1985
by
NSF

•  A
world
leader
in
HPC,
data-‐
intensive
compu7ng,
and

scien7ﬁc
data
management

•  Current
strategic
focus
on

“Big
Data”

1985

today

Scien&fic
Workflow

Automa&on
Technologies

Research

Workflows
for
Cloud

Systems

Big
Data
Applica&ons

Reproducible
Science

Workforce
Training
and

Educa&on

Development
and
Consul&ng

Services

Workflows

for
Data

Science

Center

Focus
on
the

ques&on,

not
the

technology!

10+ years of data science R&D
experience as a Center.

So,
what
is
a
workﬂow?

Source:

hZp://www.fastcodesign.com/1663557/how-‐a-‐kitchen-‐
design-‐could-‐make-‐it-‐easier-‐to-‐bond-‐with-‐neighbors

Shop
Prepare
Cook
Store

Let’s
make
pasta
this
evening!

Shop
Prepare
Cook
Store

30
minutes

30
minutes

15
minutes

3
minutes

15
minutes

3
minutes

How
to
Cook
Everything
Fast

“How
to
Cook
Everything
Fast
is
a
book
of
kitchen

innova7ons.
Time
management—
the
essen7al
principle

of
fast
cooking—
is
woven
into
revolu7onary
recipes
that

do
the
thinking
for
you.
You’ll
learn
how
to
take

advantage
of
down&me
to
prepare
vegetables

while
a
soup
simmers
or
toast
croutons
while

whisking
a
dressing.
Just
cook
as
you
read—and
let
the

recipes
guide
you
quickly
and
easily
toward
a
delicious

result.”

Image
and
quote
source:
amazon.com

What
if
you
have
more
than
one

cooks?

…

…

…

MAP

•  Input:
veggies

•  User
deﬁned

func&on(UDF):
chop

•  Output:
Chopped
groups

of
each
kind
of
veggie

…

…

REDUCE

•  Input:
chopped
batches

for
each
veggie
type

•  User
deﬁned

func&on(UDF):
combine

based
on
veggie
type
as

key

•  Output:
a
bowl
of

veggies
per
veggie
kind

Thanksgiving
dinner
prepara7on:

more
planning
and
tasks?

Menu
Item
Prepara&on

Time

Cooking

Time

Cooling

Time

Turkey
30
minutes
4
hours
15
minutes

Veggies
30
minutes
45
minutes
None

Cranberry

Sauce

5
minutes
30
minutes
2
hours

Soup
20
minutes
30
minutes
None

Pie
30
minutes
5
minutes
1
day

•  When
do
you
start
cooking?

•  What
order
do
you
cook?

•  Can
you
cook
some
menu
items
in
parallel?

•  Who
cooks
what?

•  …

Data
Science
Workﬂows

-‐
Programmable,
Reusable
and
Reproducible
Scalability
-‐

•  Access
and
query
data

•  Scale
computa7onal
analysis

•  Increase
reuse

•  Save
7me,
energy
and
money

•  Formalize
and
standardize

Real-‐Time
Hazards
Management

wiﬁre.ucsd.edu

Data-‐Parallel
Bioinforma7cs

bioKepler.org

Scalable
Automated
Molecular
Dynamics
and
Drug
Discovery

nbcr.ucsd.edu

kepler-‐project.org
WorDS.sdsc.edu

Why
scalable
and
reproducible
data
science?

The Big Picture is Supporting the Scientist
Conceptual SWF
Executable SWF
From
“Napkin
Drawings” to
Executable
Workﬂows

Fasta
File

Circonspect

Average
Genome
Size

Combine
Results
PHACCS

The Big Picture is Supporting the Data Scientist
Conceptual SWF
Executable SWF
From
“Napkin
Drawings” to
Executable
Workﬂows…

SBNL workflow
Local Learner
Data Quality
Evaluation
Local Ensemble
Learning
Quality Evaluation & Data
PartitioningBig Data
Master Learner
MasterEnsemble
Learning
Final BN
Structure
Insurance
and
Traﬃc

Data
Analy&cs
using
Big

Data
Bayesian
Network

Learning

Ptolemy II: A laboratory for
investigating design
KEPLER: A problem-solving
environment for Scientific
Workflow
KEPLER = Ptolemy II + X for
Scientific Workflows
Kepler is a Scientiﬁc Workﬂow System
•  A cross-project collaboration
… initiated August 2003
•  2.4 released 04/2013
www.kepler-project.org
•  Builds upon the open-source
Ptolemy II framework

A Toolbox with Many Tools
Need expertise to identify which tool to use when and how!
Require computation models to schedule and optimize execution!
• 
Data

• 
Search,
database
access,
IO
opera7ons,
streaming
data
in

real-‐7me…

• 
Compute

• 
Data-‐parallel
paZerns,
external
execu7on,
…

• 
Network
opera7ons

• 
Provenance
and
fault
tolerance

So,  
how does this relate to
data science, big data
and supercomputing?

Distributed
Compu7ng

•  Types
of
distributed
compu7ng:

– Computers
in
local
area
network

– Cluster
or
High-‐Performance
Compu7ng

– Grid

– Cloud

Compu7ng
using
more
than
one

computers
connected
through
a
network.

Cluster
or
High-‐Performance

Compu7ng

•  Built
from
mul:ple
computers

•  May
have

– parallel
ﬁle
system

– high-‐speed
network

•  Provides
a
scheduler
to
manage

the
machines
and
submiZed
jobs

– SGE/OGE,
PBS,
Condor,
LSF,
SLURM

Paralleliza7on

•  Execu7on

environments

– One
machine

– Distributed
machines

Mul&ple
processes
or
threads

running
at
the
same
&me

•  Parallelism
Types

– Computa7on/task

parallelism

– Data
parallelism

– Pipeline
parallelism

Task 4Task 2
Running Waiting Task 5
WaitingTask 3
Running
Task 1
Finished
Input
Data
Set
Task 1
Runnin
g
Task 2
Waiting
Task 3
Waiting
Task 1 Task 2 Task 3
Task 1
Running
Task 2
Waiting
Task 3
Waiting
Input
Data
Set
Task 1
Running
Task 2
Running
Task 3
Running
Task
Parallelism
Data
Parallelism
Pipeline
Parallelism
There
are
diﬀerent
styles
of
parallelism!

Big
Data:

Short
Deﬁni7on

•  Some
features
“V’s”
of
big
data

–  Volume:
amount
of
data

–  Velocity:
speed
of
data
in
and
out

–  Variety:
range
of
data
types
and
sources

–  Veracity:
trustworthiness
of
data

Picture
credit:
IBM
2012

•  A parallel and scalable programming model for
Big Data
–  Input data is automatically partitioned onto multiple
nodes
–  Programs are distributed and executed in parallel on
the partitioned data blocks
Distributed-‐Data
Parallel
Compu7ng

Images
from:

hZp://www.stratosphere.eu/projects/
Stratosphere/wiki/PactPM

MapReduce
Move program
to data!

Distributed
Data-‐Parallel
(DDP)
PaZerns

•  A
higher-‐level
programming
model

–  Moving
computa7on
to
data

–  Good
scalability
and
performance
accelera7on

–  Run-‐7me
features
such
as
fault-‐tolerance

–  Easier
parallel
programming
than
MPI
and
OpenMP

PaZerns
for
data
distribu&on

and
parallel
data
processing

Images
from:
hZp://www.stratosphere.eu/projects/Stratosphere/wiki/PactPM

Hadoop

•  Open
source

implementa7on
of

MapReduce

•  A
distributed
ﬁle
system

across
compute
nodes

(HDFS)

–  Automa:c
data
par::on

–  Automa:c
data
replica:on

•  Master
and
workers/slaves

architecture

•  Automa7c
task
re-‐execu7on

for
failed
tasks

Spark

•  Fast
Big
Data
Engine

–  Keeps
data
in
memory
as

much
as
possible

•  Resilient
Distributed

Datasets
(RDDs)

–  Evaluated
lazily

–  Keeps
track
of
lineage
for

fault
tolerance

•  More
operators
than
just

Map
and
Reduce

•  Can
run
on
YARN
(Hadoop

v2)

Gepng
Value
out
of
All
This

My
favorite
deﬁni7on
of
Data
Science

“By
"Data
Science",
we
mean
almost
everything

that
has
something
to
do
with
data:
Collec:ng,

analyzing,
modeling......
yet
the
most
important

part
is
its
applica:ons
-‐-‐-‐
all
sorts
of

applica:ons.”

Journal
of
Data
Science
(hZp://www.jds-‐online.com/about)

Implies
-‐-‐
programming,
data
analysis,
and
problem
solving

Some
P’s
of
Data
Science

People
Process
Platforms
Purpose
Programmability

There
are
more:

provenance,
publica7on,
product,

performance,
policy,
proﬁt,
...

Data
Scien7st
Skill
Set

hZp://
datasciencedojo.com/
what-‐are-‐the-‐key-‐skills-‐
of-‐a-‐data-‐scien7st/

Unicorn?

hZp://www.anlytcs.com/2014/01/data-‐
science-‐venn-‐diagram-‐v20.html

Solu7on:
Scale
the
Data
Scien7sts

Standardize
the
data
science
process,
not

the
tools!

Standardized
processes
enable
data

scien&sts
to
communicate
with
business

and
programming
partners.

Also,
what
these
deﬁni7ons
really
mean
is

“computa&onal
and
data
scien&sts”.

Some
P’s
of
Data
Science

Process

Defining
a
Typical
Data
Science
Process

Find
data

Access
data

Acquire
data

Move
data

Clean
data

Integrate
data

Subset
data

Pre-‐process
data

Analyze
data

Process
data

Interpret
results

Summarize
results

Visualize
results

Post-‐process
results

Some
ques7ons
to
ask:

•  Where
and
how
do
I
get
the
data?

•  What
is
the
format
and
frequency
of
the
data,
e.g.,
structured,
textual,
real-‐7me,

image,
…?

•  How
do
I
integrate
or
subset
datasets,
e.g.,
knowledge
representa7on,…
?

•  How
do
I
analyze
the
data
and
what
is
the
analysis
func7on?

•  What
are
the
parameters
to
customize
each
step?

•  What
are
the
compu7ng
needs
to
schedule
and
run
each
step?

•  How
do
I
make
sure
the
results
are
useful
for
the
next
step
or
as
scien7fic
products,

e.g.,
standards
compliance,
repor7ng,
…?

configurable

automated
analysis

Some
P’s
of
Data
Science

People
Process
Purpose

Purpose…

“You've
got
to
think
about

big
things

while
you're
doing

small
things,

so
that
all
the
small
things
go
in
the
right

direc7on.”

–
Alvin
Toﬄer

use
cases
=>
purpose
and
value

Need
toolboxes
with

many
tools
for:

•  data
access,

•  analysis,

•  scalable
execu&on,

•  fault
tolerance,

•  provenance

tracking,

•  repor7ng

•  ...

Business

Analysis

Opera&ons

Research

Adapted
from:

B.
Tierney,
2013

Integra7on
of
Many
Tools
to
Serve
a
Purpose

Many
Alterna7ves

•  Alterna7ve
tools

•  Mul7ple
modes
of

scalability

•  Support
for
each
step
of

the
development
and

produc7on
process

•  Diﬀerent
repor7ng
needs

for
explora7on
and

produc7on
stages

Build

Explore

Scale

Report

Build
Once,
Run
Many
Times…

•  Data
science
process
should
support

experimental
work
and
dynamic
scalability
on

many
plavorms

•  Scalability
based
on:

–  data
volume
and
velocity

–  dynamic
modeling
needs

–  highly-‐op7mized
HPC
codes

–  changes
in
network,
storage
and
compu7ng

availability

Scalability
across
plavorms…

People
Process
Platforms
Purpose

Running on Heterogeneous Computing
Resources
- Execution of programs on where they run most efficiently -
Gordon
Trestles

Local
Cluster

Resources

NSF/DOE:
TeraScale

Resources
(XSEDE)

(Gordon)
(Comet)

(Stampede)

(Lonestar)

Private
Cluster:

User
Owned

Resources

Diﬀerent
executables
have
diﬀerent
compu&ng
architecture
needs!

e.g.,
memory-‐intensive,
compute-‐intensive,
I/O-‐intensive

Challenges
for
Heterogeneous
Compu7ng

•  Dynamic
scheduling
op7miza7on
needed

– Based
on
network
availability

– Data
transfer
and
locality

– Energy
eﬃciency

– Availability
of
exascale
memory
hierarchies

– Workload
changes

•  BeZer
programmable
communica7on

between
workﬂow
systems
and
infrastructure

for
compu7ng,
storage
and
network

Programmability
for
scalability,

reusability
and
reproducibility

People
Process
Platforms
Purpose
Programmability

Using Big Data Computing in Bioinformatics
- Improving Programmability, Scalability and Reproducibility-
biokepler.org

Gateways
and
other
user
environments

bioKepler

Kepler
and
Provenance
Framework

BioLinux

Galaxy
Clovr

Hadoop

…
CLOUD
and
OTHER
COMPUTING
RESOURCES

e.g.,
SGE,
Amazon,
FutureGrid,
XSEDE

www.bioKepler.org
A coordinated ecosystem of biological and
technological packages for bioinformatics!

Same
approach
can
be
applied
to

machine
learning
and
other

applica7on
areas!

-‐
REUSABILITY
and
REPURPOSABILITY-‐

Flexible
programming
of
K-‐means

•  R:
Programming

language
and
soyware

environment
for

sta7s7cal
compu7ng
and

graphics.

•  KNIME:
Plavorm
for

data
analy7cs.

•  MlLib:
Scalable
machine

learning
library
running

on
Spark
cluster

compu7ng
framework

•  Mahout:
Scalable

machine
learning
library

based
on
MapReduce.

Scalable
Bayesian
Network
Learning

SBNL workflow
Local Learner
Data Quality
Evaluation
Local Ensemble
Learning
Quality Evaluation & Data
Partitioning
Big Data
Master Learner
MasterEnsemble
Learning
Final BN
Structure
Kepler Workflow

BN
Workﬂow
•  Top
level
workﬂow

–  Par77onData:
RExpression
actor

that
contains
R
script
for
the
data

par77oning
step

–  DDPNetworkLearner:
Composite

actor
using
MapReduce
to
perform

parallel
ensemble
learning

WorDS
–
Simple
and
Scalable
Big
Data

Solu7ons
using
Workflows

Focus
on
the

use
case,

not
the

technology!

•  Develop
new
big
data
science

technologies
and
infrastructure

•  Develop
data
science
workflow

applica&ons
through
combina7on
of

tools,
technologies
and
best
prac&ces

•  Hands
on
consul&ng
on
workflow

technologies
for
big
data
and
cloud

systems,
e.g.,
MapReduce,
Hadoop,

Yarn,
Cascading

•  Technology
briefings
and
applied

classes
on
end-‐to-‐end
support
for

data
science

Using Workflows and Cyberinfrastructure
for Wildfire Resilience
- A Scalable Data-Driven Monitoring and Dynamic Prediction Approach -
wifire.ucsd.edu

A
Scalable
Data-‐Driven
Monitoring,
Dynamic
Predic7on
and

Resilience
Cyberinfrastructure
for
Wildfires

(WIFIRE)

Development
of:

“cyberinfrastructure”
for

“analysis
of
large

dimensional

heterogeneous
real-‐7me

sensed
data”
for
fire

resilience
before,
during

and
aMer
a
wildfire

What
is
lacking
in
disaster
management
today
is…

a
system
integra7on
of
real-‐7me
sensor
networks,
satellite

imagery,
near-‐real
7me
data
management

tools,
wildﬁre
simula7on
tools,
and
connec7vity
to

emergency
command
centers

.
….
before,
during
and
ayer
a
ﬁrestorm.

hZp://nbcr.ucsd.edu/

Integrated
Mul7-‐Scale
Biomedical

Modeling
Workﬂows
in
NBCR

Local
Execu7on

Op7on

User
MD-‐Parameter
Configura&on
Op&on

Molecular
Dynamic
CADD
Workflow

Amber

Molecular

Dynamics

Package

Local:
NBCR
Cluster

Resources

NSF/DOE:
TeraScale

Resources
(XSEDE)

(Stampede)

NBCR
and
User
Owned

Cloud
Resources

(Comet)

BENEFITS:

•  Enable

users
to
configure
MD
job
parameters

through
command-‐line,
GUI
or
web
interface.

•  Scale
for
mul7ple
compounds
in
parallel

•  Run
on
Mul7ple
Compu7ng
plavorms

•  Increase
reuse

•  Provenance

GPU
or
Gordon
Execu7on
Op7on

hZp://hpc.pnl.gov/IPPD/

Predic7ng
Workﬂow
Performance
from
Provenance

hZps://smartmanufacturingcoali7on.org/

Workﬂows-‐as-‐a-‐Service

To Sum Up
•  Workflows and provenance are well-adopted in scientific
infrastructures today, with success
•  WorDS Center applies these concepts to advanced
dynamic data-driven analytics applications
•  One size does not fit all!
•  Many diverse environments and requirements
•  Need to orchestrate at a higher level
•  Higher level programming components for each domain
•  Lots of future challenges on
•  Optimized execution on heterogeneous platforms
•  Programmable interface to workload, storage and network needed
•  Increasing reuse within and across application domains
•  Querying and integration of workflow provenance data into
performance prediction

Ques7ons?

WorDS
Director:

Ilkay
Al7ntas,
Ph.D.

Email:
al7ntas@sdsc.edu

WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Similar a WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Similar a WorDS of Data Science in the Presence of Heterogenous Computing Architectures (20)

Más de Ilkay Altintas, Ph.D.

Más de Ilkay Altintas, Ph.D. (6)

Último

Último (20)

WorDS of Data Science in the Presence of Heterogenous Computing Architectures