Bridging Big Data and Data Science Using Scalable Workflows

Bridging
Big
Data
and
Data
Science
using
Scalable
Workflows
ILKAY
ALTINTAS,
Ph.D.
alBntas@sdsc.edu
Director,
Workflows
for
Data
Science
(WorDS)
Center
of
Excellence
San
Diego
Supercomputer
Center,
UC
San
Diego
WorDS.sdsc.edu

SAN
DIEGO
SUPERCOMPUTER
CENTER
at
UC
San
Diego
Providing
Cyberinfrastructure
for
Research
and
EducaBon
• Established
as
a
naBonal
supercomputer
resource
center
in
1985
by
NSF
• Aworld
leader
in
HPC,
data-‐
intensive
compuBng,
and
scienBfic
data
management
• Current
strategic
focus
on
“Big
Data”
1985
today

Scien&fic
Workflow
Automa&on
Technologies
Research
Workflows
for
Cloud
Systems
Big
Data
Applica&ons
Reproducible
Science
Workforce
Training
and
Educa&on
Development
and
Consul&ng
Services
Workflows
for
Data
Science
Center
Focus
on
the
ques&on,
not
the
technology!
10+ years of data science R&D
experience as a Center.

Why
Data
Science
Workflows?
“You've
got
to
think
about
big
things
use
cases
=>
purpose
and
value
while
you're
doing
small
things,
so
that
all
the
small
things
go
in
the
right
direcBon.”
–
Alvin
Toffler

So,
what
is
a
workflow?
Source:
hcp://www.fastcodesign.com/1663557/how-‐a-‐kitchen-‐
design-‐could-‐make-‐it-‐easier-‐to-‐bond-‐with-‐neighbors
Shop
Prepare
Cook
Store

Let’s
make
pasta
this
evening!
Shop
Prepare
Cook
Store
30
minutes
30
minutes
15
minutes
3
minutes
15
minutes
3
minutes

How
to
Cook
Everything
Fast
“How
to
Cook
Everything
Fast
is
a
book
of
kitchen
innovaBons.
Time
management—
the
essenBal
principle
of
fast
cooking—
is
woven
into
revoluBonary
recipes
that
do
the
thinking
for
you.
You’ll
learn
how
to
take
advantage
of
down&me
to
prepare
vegetables
while
a
soup
simmers
or
toast
croutons
while
whisking
a
dressing.
Just
cook
as
you
read—and
let
the
recipes
guide
you
quickly
and
easily
toward
a
delicious
result.”
Image
and
quote
source:
amazon.com

What
if
you
have
more
than
one
cooks?

…
…
…
MAP
• Input:
veggies
• User
defined
func&on(UDF):
chop
• Output:
Chopped
groups
of
each
kind
of
veggie

…
REDUCE
…
• Input:
chopped
batches
for
each
veggie
type
• User
defined
func&on(UDF):
combine
based
on
veggie
type
as
key
• Output:
a
bowl
of
veggies
per
veggie
kind

Thanksgiving
dinner
preparaBon:
more
planning
and
tasks?
Menu
Item
Prepara&on
Time
Cooking
Time
Cooling
Time
Turkey
30
minutes
4
hours
15
minutes
Veggies
30
minutes
45
minutes
None
Cranberry
5
minutes
30
minutes
2
hours
Sauce
Soup
20
minutes
30
minutes
None
Pie
30
minutes
5
minutes
1
day
• When
do
you
start
cooking?
• What
order
do
you
cook?
• Can
you
cook
some
menu
items
in
parallel?
• Who
cooks
what?
• …

Data
Science
Workflows
-‐
Programmable
and
Reproducible
Scalability
-‐
• Access
and
query
data
• Scale
computaBonal
analysis
• Increase
reuse
• Save
Bme,
energy
and
money
• Formalize
and
standardize
Real-‐Time
Hazards
Management
wifire.ucsd.edu
Data-‐Parallel
BioinformaBcs
bioKepler.org
Scalable
Automated
Molecular
Dynamics
and
Drug
Discovery
nbcr.ucsd.edu
kepler-‐project.org
WorDS.sdsc.edu

Why
scalable
and
reproducible
data
science?

The Big Picture is Supporting the Scientist
Conceptual SWF
From
Executable SWF
“Napkin
Drawings” to
Executable
Workflows
Fasta
File
Circonspect
Average
Genome
Size
Combine
Results
PHACCS

The Big Picture is Supporting the Data Scientist
Quality Evaluation & Data
Conceptual SWF
Executable SWF
From
“Napkin
Drawings” to
Executable
Workflows…
SBNL workflow
Local Learner
Data Quality
Evaluation
Local Ensemble
Learning
Big Data Partitioning
Master Learner
MasterEnsemble
Learning
Final BN
Structure
Insurance
and
Traffic
Data
Analy&cs
using
Big
Data
Bayesian
Network
Learning

Kepler is a Scientific Workflow System
www.kepler-project.org
• A cross-project collaboration
… initiated August 2003
• 2.4 released 04/2013
• Builds upon the open-source
Ptolemy II framework
Ptolemy II: A laboratory for
investigating design
KEPLER: A problem-solving
environment for Scientific
Workflow
KEPLER = “Ptolemy II + X” for
Scientific Workflows

A Toolbox with Many Tools
Data
•
Search,
database
access,
IO
operaBons,
streaming
data
in
real-‐Bme…
Compute
•
Data-‐parallel
pacerns,
external
execuBon,
…
Network
operaBons
•
Provenance
and
fault
tolerance
Need expertise to identify which tool to use when and how!
•
•
•
Require computation models to schedule and optimize execution!

So,
how does this relate to
data science and big
data?

Workflows
integrate
data
science
building
blocks!
Toolboxes
with
many
tools
for:
• data
access,
• analysis,
• execuBon,
• fault
tolerance,
• provenance
tracking,
• reporBng
• ...
Business
Analysis
Opera&ons
Research
Adapted
from:
B.
Tierney,
2013

Data
ScienBst
Skill
Set
hcp://
datasciencedojo.com/
what-‐are-‐the-‐key-‐skills-‐
of-‐a-‐data-‐scienBst/

Unicorn?
hcp://www.anlytcs.com/2014/01/data-‐
science-‐venn-‐diagram-‐v20.html

SoluBon:
Scale
Your
Data
ScienBsts
Standardize
the
data
science
process,
not
the
tools!
Standardized
processes
enable
data
scien&sts
to
communicate
with
business
and
programming
partners.
Also,
what
these
definiBons
really
mean
is
“computa&onal
and
data
scien&sts”.

Conceptualizing
a
ComputaBonal
Data
Science
Workflow

1:
Start
with
the
Workflow
As
a
Blackbox
• Treat
the
whole
workflow
as
a
blackbox
– What
is
the
usecase/
applicaBon?
• What
is
the
science
quesBon
this
workflow
is
solving?
– What
is
the
input
data?
– What
are
the
expected
outcomes?
Input
data
f
My
workflow
Outputs
• Give
the
workflow
a
Btle
based
on
iniBal
assessment!

2:
ConceptualizaBon
of
ScienBfic
Steps
Fasta
File
Circonspect
Bake
Turkey
Average
Genome
Size
Combine
Results
PHACCS
• ...
• Cook
• Chill
• ….
Bake
Pie
• …
• Prepare
• Cook
• …
• …
• Make
Cranberry
Sauce
• Cut
Veggies
• Prepare
Stuffing
• …
Make
Side
Dishes

3:
Treat
Each
Step
Like
a
Workflow
-‐
un=l
you
reach
an
atomic
func=onal
step
-‐
Find
data
Access
data
Acquire
data
Move
data
Clean
data
Integrate
data
Subset
data
Pre-‐process
data
Analyze
data
Process
data
Interpret
results
Summarize
results
Visualize
results
Post-‐process
results
SHOP
PREPARE
COOK
STORE
Some
quesBons
to
ask:
• Where
and
how
do
I
get
the
data?
• What
is
the
format
and
frequency
of
the
data,
e.g.,
structured,
textual,
real-‐Bme,
image,
…?
• How
do
I
integrate
or
subset
datasets,
e.g.,
knowledge
representaBon,…
?
• How
do
I
analyze
the
data
and
what
is
the
analysis
funcBon?
• What
are
the
parameters
to
customize
each
step?
• What
are
the
compuBng
needs
to
schedule
and
run
each
step?
• How
do
I
make
sure
the
results
are
useful
for
the
next
step
or
as
scienBfic
products,
e.g.,
standards
compliance,
reporBng,
…?
configurable
automated
analysis

4:
Start
Building
Each
Step
Including
the
AlternaBves
• AlternaBve
tools
• MulBple
modes
of
scalability
• Support
for
each
step
of
the
development
and
producBon
process
• Different
reporBng
needs
for
exploraBon
and
producBon
stages
Build
Explore
Scale
Report

Running on Heterogeneous Computing
Resources
- Execution of models on where they run most efficiently -
Different
models
have
different
compu&ng
architecture
needs!
e.g.,
memory-‐intensive,
compute-‐intensive,
I/O-‐intensive
Gordon
Trestles
Local:
NBCR
Cluster
Resources
NSF/DOE:
TeraScale
Resources
(XSEDE)
(Gordon)
(Trestles)
(Lonestar)
(Stampede)
Private
Cluster:
User
Owned
Resources

5:
Save
and
Share
Reports
and
Final
Products
with
your
Team
• Data
scienBst
is
in
the
middle
bridging
the
gap
between
business
and
development
à
So,
Data
ScienBsts
defines
the
business
value
and
the
steps
to
achieve
the
results
as
a
workflow
• Developers/computer
scienBsts
use
their
favorite
tools
to
implement
the
methods
in
the
workflow
• The
process
is
kept
sharable,
standardized,
scalable
and
accountable

WorDS
–
Simple
and
Scalable
Big
Data
SoluBons
using
Workflows
Focus
on
the
use
case,
not
the
technology!
• Develop
new
big
data
science
technologies
and
infrastructure
• Develop
data
science
workflow
applica&ons
through
combinaBon
of
tools,
technologies
and
best
prac&ces
• Hands
on
consul&ng
on
workflow
technologies
for
big
data
and
cloud
systems,
e.g.,
MapReduce,
Hadoop,
Yarn,
Cascading
• Technology
briefings
and
applied
classes
on
end-‐to-‐end
support
for
data
science

Using Big Data Computing in Bioinformatics
- Improving Programmability, Scalability and Reproducibility-biokepler.
org

www.bioKepler.org
Gateways
and
other
user
environments
Kepler
and
Provenance
Framework
bioKepler
BioLinux
Galaxy
Clovr
Hadoop
…
CLOUD
and
OTHER
COMPUTING
RESOURCES
e.g.,
SGE,
Amazon,
FutureGrid,
XSEDE
A coordinated ecosystem of biological and
technological packages for bioinformatics!

Status
of
bioActors
500+
bioActors
are
listed
under
current
bioKepler
release,
~40
of
them
are
parallelized.

Using Workflows and Cyberinfrastructure
for Wildfire Resilience
- A Scalable Data-Driven Monitoring and Dynamic Prediction Approach -
wifire.ucsd.edu

Fire
is
Part
of
the
Natural
Ecology
…
but
requires
Monitoring,
PredicBon
and
Resilience
• Wildfires
are
criBcal
for
ecology,
but
volaBle
• Fuel
load
is
high
due
to
fire
suppression
over
the
last
century
• Changes
in
rainfall,
wind,
seasons,
and
thus
wildfires,
potenBally
induced
by
climate
change
• Becer
prevenBon,
predicBon
and
maintenance
of
wildfires
is
needed
Photo
of
Harris
Fire
(2007)
by
former
Fire
Captain
Bill
Clayton
Disaster
management
of
(ongoing)
wildfires
heavily
relies
on
understanding
their
DirecBon
and
Rate
of
Spread
(RoS).

Fire
Data
Today
Decision making for wildfire fighting
and disaster management based
on heterogeneous data:
Photograph by Mark Thiessen
Satellite data
Wildfire perimeter
Wind,
Vegetation
Terrain.

What
is
lacking
in
disaster
management
today
is…
a
system
integraBon
of
real-‐Bme
sensor
networks,
satellite
imagery,
near-‐real
Bme
data
management
tools,
wildfire
simulaBon
tools,
and
connecBvity
to
emergency
command
centers
….
before,
during
and
a{er
a
.fi
restorm.

A
Scalable
Data-‐Driven
Monitoring,
Dynamic
PredicBon
and
Resilience
Cyberinfrastructure
for
Wildfires
(WIFIRE)
Development
of:
“cyberinfrastructure”
for
“analysis
of
large
dimensional
heterogeneous
real-‐Bme
sensed
data”
for
fire
resilience
before,
during
and
aAer
a
wildfire

Data
to
Modeling
in
WIFIRE
Real-‐&me
remote
data
–>
Modeling,
data
assimilaBon
and
dynamic
wildfire
behavior
predicBon
Sensors:

WIFIRE
System
IntegraBon
System
IntegraBon
of
sensor
data,
data
assimilaBon,
dynamic
models
and
fire
direcBon
and
RoS
predicBons
(computaBons)
is
based
on
ScienBfic
and
Engineering
Workflows
(Kepler)
• Visual
programming
• Scalable
parallel
execuBon
• Standardized
data
interfaces
• Reuse
and
reproducibility

Training
and
ConsulBng
Services
in
the
WorDS
Center
• Ongoing
programs
for
workflow
bootcamps
and
hackathons
• Technology
briefings
for
industrial
partners
• Industry
labs
for
undergraduate
student
researchers
• ConsulBng
projects
on
workflow
technologies

To Sum Up
• Workflows and provenance are well-adopted in scientific
data science infrastructures today, with success
• WorDS Center applies these concepts to advanced
dynamic data-driven analytics applications
• One size does not fit all!
• Many diverse environments and requirements
• Need to orchestrate at a higher level
• Higher level programming components for each domain
• Lots of future challenges on
• Optimized execution on heterogeneous platforms
• Increasing reuse within and across application domains
• Querying and integration of workflow provenance data

QuesBons?
WorDS
Director:
Ilkay
AlBntas,
Ph.D.
Email:
alBntas@sdsc.edu

Bridging Big Data and Data Science Using Scalable Workflows

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Destacado

Destacado (17)

Similar a Bridging Big Data and Data Science Using Scalable Workflows

Similar a Bridging Big Data and Data Science Using Scalable Workflows (20)

Último

Último (20)

Bridging Big Data and Data Science Using Scalable Workflows