ISUM 2015 Keynote
Summary: Computational and Data Science is about extracting knowledge from data and modeling. This end goal can only be achieved through a craft that combines people, processes, computational and Big Data platforms, application-specific purpose and programmability. Publications and provenance of the data products products leading to these publications are also important. With this in mind, this talk defines a terminology for computational and data science applications, and discuss why focusing on these concepts is important for executability and reproducibility in computational and data science.
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
1. WorDS
of
Data
Science
in
the
Presence
of
Heterogenous
Compu7ng
Architectures
WorDS.sdsc.edu
Dr.
Ilkay
Al7ntas
Founder
and
Director,
Workflows
for
Data
Science
(WorDS)
Center
of
Excellence
San
Diego
Supercomputer
Center,
UC
San
Diego
2. SAN
DIEGO
SUPERCOMPUTER
CENTER
at
UC
San
Diego
Providing
Cyberinfrastructure
for
Research
and
Educa7on
• Established
as
a
na7onal
supercomputer
resource
center
in
1985
by
NSF
• A
world
leader
in
HPC,
data-‐
intensive
compu7ng,
and
scien7fic
data
management
• Current
strategic
focus
on
“Big
Data”
1985
today
3.
Scien&fic
Workflow
Automa&on
Technologies
Research
Workflows
for
Cloud
Systems
Big
Data
Applica&ons
Reproducible
Science
Workforce
Training
and
Educa&on
Development
and
Consul&ng
Services
Workflows
for
Data
Science
Center
Focus
on
the
ques&on,
not
the
technology!
10+ years of data science R&D
experience as a Center.
4.
5. So,
what
is
a
workflow?
Source:
hZp://www.fastcodesign.com/1663557/how-‐a-‐kitchen-‐
design-‐could-‐make-‐it-‐easier-‐to-‐bond-‐with-‐neighbors
Shop
Prepare
Cook
Store
6. Let’s
make
pasta
this
evening!
Shop
Prepare
Cook
Store
30
minutes
30
minutes
15
minutes
3
minutes
15
minutes
3
minutes
7. How
to
Cook
Everything
Fast
“How
to
Cook
Everything
Fast
is
a
book
of
kitchen
innova7ons.
Time
management—
the
essen7al
principle
of
fast
cooking—
is
woven
into
revolu7onary
recipes
that
do
the
thinking
for
you.
You’ll
learn
how
to
take
advantage
of
down&me
to
prepare
vegetables
while
a
soup
simmers
or
toast
croutons
while
whisking
a
dressing.
Just
cook
as
you
read—and
let
the
recipes
guide
you
quickly
and
easily
toward
a
delicious
result.”
Image
and
quote
source:
amazon.com
9. …
…
…
MAP
• Input:
veggies
• User
defined
func&on(UDF):
chop
• Output:
Chopped
groups
of
each
kind
of
veggie
10. …
…
REDUCE
• Input:
chopped
batches
for
each
veggie
type
• User
defined
func&on(UDF):
combine
based
on
veggie
type
as
key
• Output:
a
bowl
of
veggies
per
veggie
kind
11. Thanksgiving
dinner
prepara7on:
more
planning
and
tasks?
Menu
Item
Prepara&on
Time
Cooking
Time
Cooling
Time
Turkey
30
minutes
4
hours
15
minutes
Veggies
30
minutes
45
minutes
None
Cranberry
Sauce
5
minutes
30
minutes
2
hours
Soup
20
minutes
30
minutes
None
Pie
30
minutes
5
minutes
1
day
• When
do
you
start
cooking?
• What
order
do
you
cook?
• Can
you
cook
some
menu
items
in
parallel?
• Who
cooks
what?
• …
12. Data
Science
Workflows
-‐
Programmable,
Reusable
and
Reproducible
Scalability
-‐
• Access
and
query
data
• Scale
computa7onal
analysis
• Increase
reuse
• Save
7me,
energy
and
money
• Formalize
and
standardize
Real-‐Time
Hazards
Management
wifire.ucsd.edu
Data-‐Parallel
Bioinforma7cs
bioKepler.org
Scalable
Automated
Molecular
Dynamics
and
Drug
Discovery
nbcr.ucsd.edu
kepler-‐project.org
WorDS.sdsc.edu
14. The Big Picture is Supporting the Scientist
Conceptual SWF
Executable SWF
From
“Napkin
Drawings” to
Executable
Workflows
Fasta
File
Circonspect
Average
Genome
Size
Combine
Results
PHACCS
15. The Big Picture is Supporting the Data Scientist
Conceptual SWF
Executable SWF
From
“Napkin
Drawings” to
Executable
Workflows…
SBNL workflow
Local Learner
Data Quality
Evaluation
Local Ensemble
Learning
Quality Evaluation & Data
PartitioningBig Data
Master Learner
MasterEnsemble
Learning
Final BN
Structure
Insurance
and
Traffic
Data
Analy&cs
using
Big
Data
Bayesian
Network
Learning
16. Ptolemy II: A laboratory for
investigating design
KEPLER: A problem-solving
environment for Scientific
Workflow
KEPLER = Ptolemy II + X for
Scientific Workflows
Kepler is a Scientific Workflow System
• A cross-project collaboration
… initiated August 2003
• 2.4 released 04/2013
www.kepler-project.org
• Builds upon the open-source
Ptolemy II framework
17. A Toolbox with Many Tools
Need expertise to identify which tool to use when and how!
Require computation models to schedule and optimize execution!
•
Data
•
Search,
database
access,
IO
opera7ons,
streaming
data
in
real-‐7me…
•
Compute
•
Data-‐parallel
paZerns,
external
execu7on,
…
•
Network
opera7ons
•
Provenance
and
fault
tolerance
18.
So,
how does this relate to
data science, big data
and supercomputing?
19. Distributed
Compu7ng
• Types
of
distributed
compu7ng:
– Computers
in
local
area
network
– Cluster
or
High-‐Performance
Compu7ng
– Grid
– Cloud
Compu7ng
using
more
than
one
computers
connected
through
a
network.
20. Cluster
or
High-‐Performance
Compu7ng
• Built
from
mul:ple
computers
• May
have
– parallel
file
system
– high-‐speed
network
• Provides
a
scheduler
to
manage
the
machines
and
submiZed
jobs
– SGE/OGE,
PBS,
Condor,
LSF,
SLURM
21. Paralleliza7on
• Execu7on
environments
– One
machine
– Distributed
machines
Mul&ple
processes
or
threads
running
at
the
same
&me
• Parallelism
Types
– Computa7on/task
parallelism
– Data
parallelism
– Pipeline
parallelism
22. Task 4Task 2
Running Waiting Task 5
WaitingTask 3
Running
Task 1
Finished
Input
Data
Set
Task 1
Runnin
g
Task 2
Waiting
Task 3
Waiting
Task 1 Task 2 Task 3
Task 1
Running
Task 2
Waiting
Task 3
Waiting
Input
Data
Set
Task 1
Running
Task 2
Running
Task 3
Running
Task
Parallelism
Data
Parallelism
Pipeline
Parallelism
There
are
different
styles
of
parallelism!
23. Big
Data:
Short
Defini7on
• Some
features
“V’s”
of
big
data
– Volume:
amount
of
data
– Velocity:
speed
of
data
in
and
out
– Variety:
range
of
data
types
and
sources
– Veracity:
trustworthiness
of
data
Picture
credit:
IBM
2012
24.
• A parallel and scalable programming model for
Big Data
– Input data is automatically partitioned onto multiple
nodes
– Programs are distributed and executed in parallel on
the partitioned data blocks
Distributed-‐Data
Parallel
Compu7ng
Images
from:
hZp://www.stratosphere.eu/projects/
Stratosphere/wiki/PactPM
MapReduce
Move program
to data!
25. Distributed
Data-‐Parallel
(DDP)
PaZerns
• A
higher-‐level
programming
model
– Moving
computa7on
to
data
– Good
scalability
and
performance
accelera7on
– Run-‐7me
features
such
as
fault-‐tolerance
– Easier
parallel
programming
than
MPI
and
OpenMP
PaZerns
for
data
distribu&on
and
parallel
data
processing
Images
from:
hZp://www.stratosphere.eu/projects/Stratosphere/wiki/PactPM
26. Hadoop
• Open
source
implementa7on
of
MapReduce
• A
distributed
file
system
across
compute
nodes
(HDFS)
– Automa:c
data
par::on
– Automa:c
data
replica:on
• Master
and
workers/slaves
architecture
• Automa7c
task
re-‐execu7on
for
failed
tasks
Spark
• Fast
Big
Data
Engine
– Keeps
data
in
memory
as
much
as
possible
• Resilient
Distributed
Datasets
(RDDs)
– Evaluated
lazily
– Keeps
track
of
lineage
for
fault
tolerance
• More
operators
than
just
Map
and
Reduce
• Can
run
on
YARN
(Hadoop
v2)
28. My
favorite
defini7on
of
Data
Science
“By
"Data
Science",
we
mean
almost
everything
that
has
something
to
do
with
data:
Collec:ng,
analyzing,
modeling......
yet
the
most
important
part
is
its
applica:ons
-‐-‐-‐
all
sorts
of
applica:ons.”
Journal
of
Data
Science
(hZp://www.jds-‐online.com/about)
Implies
-‐-‐
programming,
data
analysis,
and
problem
solving
29. Some
P’s
of
Data
Science
People
Process
Platforms
Purpose
Programmability
30. There
are
more:
provenance,
publica7on,
product,
performance,
policy,
profit,
...
34. Solu7on:
Scale
the
Data
Scien7sts
Standardize
the
data
science
process,
not
the
tools!
Standardized
processes
enable
data
scien&sts
to
communicate
with
business
and
programming
partners.
Also,
what
these
defini7ons
really
mean
is
“computa&onal
and
data
scien&sts”.
36. Defining
a
Typical
Data
Science
Process
Find
data
Access
data
Acquire
data
Move
data
Clean
data
Integrate
data
Subset
data
Pre-‐process
data
Analyze
data
Process
data
Interpret
results
Summarize
results
Visualize
results
Post-‐process
results
Some
ques7ons
to
ask:
• Where
and
how
do
I
get
the
data?
• What
is
the
format
and
frequency
of
the
data,
e.g.,
structured,
textual,
real-‐7me,
image,
…?
• How
do
I
integrate
or
subset
datasets,
e.g.,
knowledge
representa7on,…
?
• How
do
I
analyze
the
data
and
what
is
the
analysis
func7on?
• What
are
the
parameters
to
customize
each
step?
• What
are
the
compu7ng
needs
to
schedule
and
run
each
step?
• How
do
I
make
sure
the
results
are
useful
for
the
next
step
or
as
scien7fic
products,
e.g.,
standards
compliance,
repor7ng,
…?
configurable
automated
analysis
37. Some
P’s
of
Data
Science
People
Process
Purpose
38. Purpose…
“You've
got
to
think
about
big
things
while
you're
doing
small
things,
so
that
all
the
small
things
go
in
the
right
direc7on.”
–
Alvin
Toffler
use
cases
=>
purpose
and
value
39.
Need
toolboxes
with
many
tools
for:
• data
access,
• analysis,
• scalable
execu&on,
• fault
tolerance,
• provenance
tracking,
• repor7ng
• ...
Business
Analysis
Opera&ons
Research
Adapted
from:
B.
Tierney,
2013
Integra7on
of
Many
Tools
to
Serve
a
Purpose
40. Many
Alterna7ves
• Alterna7ve
tools
• Mul7ple
modes
of
scalability
• Support
for
each
step
of
the
development
and
produc7on
process
• Different
repor7ng
needs
for
explora7on
and
produc7on
stages
Build
Explore
Scale
Report
41. Build
Once,
Run
Many
Times…
• Data
science
process
should
support
experimental
work
and
dynamic
scalability
on
many
plavorms
• Scalability
based
on:
– data
volume
and
velocity
– dynamic
modeling
needs
– highly-‐op7mized
HPC
codes
– changes
in
network,
storage
and
compu7ng
availability
43. Running on Heterogeneous Computing
Resources
- Execution of programs on where they run most efficiently -
Gordon
Trestles
Local
Cluster
Resources
NSF/DOE:
TeraScale
Resources
(XSEDE)
(Gordon)
(Comet)
(Stampede)
(Lonestar)
Private
Cluster:
User
Owned
Resources
Different
executables
have
different
compu&ng
architecture
needs!
e.g.,
memory-‐intensive,
compute-‐intensive,
I/O-‐intensive
44. Challenges
for
Heterogeneous
Compu7ng
• Dynamic
scheduling
op7miza7on
needed
– Based
on
network
availability
– Data
transfer
and
locality
– Energy
efficiency
– Availability
of
exascale
memory
hierarchies
– Workload
changes
• BeZer
programmable
communica7on
between
workflow
systems
and
infrastructure
for
compu7ng,
storage
and
network
46. Using Big Data Computing in Bioinformatics
- Improving Programmability, Scalability and Reproducibility-
biokepler.org
47. Gateways
and
other
user
environments
bioKepler
Kepler
and
Provenance
Framework
BioLinux
Galaxy
Clovr
Hadoop
…
CLOUD
and
OTHER
COMPUTING
RESOURCES
e.g.,
SGE,
Amazon,
FutureGrid,
XSEDE
www.bioKepler.org
A coordinated ecosystem of biological and
technological packages for bioinformatics!
48. Same
approach
can
be
applied
to
machine
learning
and
other
applica7on
areas!
-‐
REUSABILITY
and
REPURPOSABILITY-‐
49. Flexible
programming
of
K-‐means
• R:
Programming
language
and
soyware
environment
for
sta7s7cal
compu7ng
and
graphics.
• KNIME:
Plavorm
for
data
analy7cs.
• MlLib:
Scalable
machine
learning
library
running
on
Spark
cluster
compu7ng
framework
• Mahout:
Scalable
machine
learning
library
based
on
MapReduce.
50. Scalable
Bayesian
Network
Learning
SBNL workflow
Local Learner
Data Quality
Evaluation
Local Ensemble
Learning
Quality Evaluation & Data
Partitioning
Big Data
Master Learner
MasterEnsemble
Learning
Final BN
Structure
Kepler Workflow
51. BN
Workflow
• Top
level
workflow
– Par77onData:
RExpression
actor
that
contains
R
script
for
the
data
par77oning
step
– DDPNetworkLearner:
Composite
actor
using
MapReduce
to
perform
parallel
ensemble
learning
52. WorDS
–
Simple
and
Scalable
Big
Data
Solu7ons
using
Workflows
Focus
on
the
use
case,
not
the
technology!
• Develop
new
big
data
science
technologies
and
infrastructure
• Develop
data
science
workflow
applica&ons
through
combina7on
of
tools,
technologies
and
best
prac&ces
• Hands
on
consul&ng
on
workflow
technologies
for
big
data
and
cloud
systems,
e.g.,
MapReduce,
Hadoop,
Yarn,
Cascading
• Technology
briefings
and
applied
classes
on
end-‐to-‐end
support
for
data
science
53. Using Workflows and Cyberinfrastructure
for Wildfire Resilience
- A Scalable Data-Driven Monitoring and Dynamic Prediction Approach -
wifire.ucsd.edu
54. A
Scalable
Data-‐Driven
Monitoring,
Dynamic
Predic7on
and
Resilience
Cyberinfrastructure
for
Wildfires
(WIFIRE)
Development
of:
“cyberinfrastructure”
for
“analysis
of
large
dimensional
heterogeneous
real-‐7me
sensed
data”
for
fire
resilience
before,
during
and
aMer
a
wildfire
55. What
is
lacking
in
disaster
management
today
is…
a
system
integra7on
of
real-‐7me
sensor
networks,
satellite
imagery,
near-‐real
7me
data
management
tools,
wildfire
simula7on
tools,
and
connec7vity
to
emergency
command
centers
.
….
before,
during
and
ayer
a
firestorm.
57. Local
Execu7on
Op7on
User
MD-‐Parameter
Configura&on
Op&on
Molecular
Dynamic
CADD
Workflow
Amber
Molecular
Dynamics
Package
Local:
NBCR
Cluster
Resources
NSF/DOE:
TeraScale
Resources
(XSEDE)
(Stampede)
NBCR
and
User
Owned
Cloud
Resources
(Comet)
BENEFITS:
• Enable
users
to
configure
MD
job
parameters
through
command-‐line,
GUI
or
web
interface.
• Scale
for
mul7ple
compounds
in
parallel
• Run
on
Mul7ple
Compu7ng
plavorms
• Increase
reuse
• Provenance
GPU
or
Gordon
Execu7on
Op7on
60. To Sum Up
• Workflows and provenance are well-adopted in scientific
infrastructures today, with success
• WorDS Center applies these concepts to advanced
dynamic data-driven analytics applications
• One size does not fit all!
• Many diverse environments and requirements
• Need to orchestrate at a higher level
• Higher level programming components for each domain
• Lots of future challenges on
• Optimized execution on heterogeneous platforms
• Programmable interface to workload, storage and network needed
• Increasing reuse within and across application domains
• Querying and integration of workflow provenance data into
performance prediction