SlideShare una empresa de Scribd logo
1 de 49
Descargar para leer sin conexión
A	
  Maturing	
  Role	
  of	
  Workflows	
  	
  
in	
  the	
  Presence	
  of	
  	
  
Heterogeneous	
  Compu<ng	
  Architectures	
  
WorDS.sdsc.edu	
  	
  	
  	
  	
  
Dr.	
  Ilkay	
  Al<ntas	
  
Founder	
  and	
  Director,	
  Workflows	
  for	
  Data	
  Science	
  (WorDS)	
  Center	
  of	
  Excellence	
  
San	
  Diego	
  Supercomputer	
  Center,	
  UC	
  San	
  Diego	
  
	
  
SAN	
  DIEGO	
  SUPERCOMPUTER	
  CENTER	
  at	
  UC	
  San	
  Diego	
  
Providing	
  Cyberinfrastructure	
  for	
  Research	
  and	
  Educa<on	
  
•  Established	
  as	
  a	
  na<onal	
  
supercomputer	
  resource	
  
center	
  in	
  1985	
  by	
  NSF	
  
•  A	
  world	
  leader	
  in	
  HPC,	
  data-­‐
intensive	
  compu<ng,	
  and	
  
scien<fic	
  data	
  management	
  
•  Current	
  strategic	
  focus	
  on	
  
“Big	
  Data”	
  and	
  “Data-­‐
intensive	
  HPC”	
  
1985	
  
today	
  
 
	
  
	
  
Scien&fic	
  Workflow	
  	
  
Automa&on	
  Technologies	
  
Research	
  
	
  
	
  
	
  
Workflows	
  for	
  Cloud	
  
Systems	
  
	
  
	
  
Big	
  Data	
  Applica&ons	
  
	
  	
  
	
  
Reproducible	
  Science	
  
	
  
	
  
	
  
Workforce	
  Training	
  and	
  
Educa&on	
  
	
  
	
  
	
  
Development	
  and	
  Consul&ng	
  
Services	
  
Workflows	
  
for	
  Data	
  
Science	
  
Center	
  
Focus	
  on	
  the	
  
ques&on,	
  	
  
not	
  the	
  
technology!	
  
10+ years of data science R&D
experience as a Center.	
  
Computa<onal	
  Data	
  Science	
  Workflows	
  
-­‐	
  Programmable,	
  Reusable	
  and	
  Reproducible	
  Scalability	
  -­‐	
  
•  Access	
  and	
  query	
  data	
  
•  Scale	
  computa<onal	
  analysis	
  
•  Increase	
  reuse	
  	
  
•  Save	
  <me,	
  energy	
  and	
  money	
  
•  Formalize	
  and	
  standardize	
  
Real-­‐Time	
  Hazards	
  Management	
  
wifire.ucsd.edu	
  
Data-­‐Parallel	
  Bioinforma<cs	
  
bioKepler.org	
  	
  
Scalable	
  Automated	
  Molecular	
  Dynamics	
  and	
  Drug	
  Discovery	
  
nbcr.ucsd.edu	
  
kepler-­‐project.org	
   WorDS.sdsc.edu	
  
The Big Picture is to Capture the Workflow in an
Executable and Reusable Way
Conceptual SWF
Executable SWF
From	
  “Napkin	
  Drawings” to	
  Executable	
  Workflows…	
  
SBNL workflow
Local Learner
Data Quality
Evaluation
Local Ensemble
Learning
Quality Evaluation & Data
PartitioningBig Data
Master Learner
MasterEnsemble
Learning
Final BN
Structure
Insurance	
  and	
  Traffic	
  
Data	
  Analy&cs	
  using	
  Big	
  
Data	
  Bayesian	
  Network	
  
Learning	
  
Ptolemy II: A laboratory for
investigating design
KEPLER: A problem-solving
environment for Scientific
Workflow
KEPLER = Ptolemy II + X for
Scientific Workflows
Kepler is a Scientific Workflow System
•  A cross-project collaboration
… initiated August 2003
•  2.5 will be releases soon
www.kepler-project.org
•  Builds upon the open-source
Ptolemy II framework
Kepler	
  can	
  be	
  applied	
  to	
  problems	
  in	
  different	
  
scien<fic	
  disciplines:	
  some	
  here	
  and	
  many	
  more…	
  	
  
Astrophysisc,	
  e.g.,	
  DIAPL	
  
Noanotechnology,	
  e.g.,	
  ANELLI	
  
Fusion,	
  e.g.,	
  ITER	
  
Metagenomics,	
  e.g.,	
  CAMERA	
  
Mul&-­‐scale	
  biology,	
  
e.g.,	
  NBCR	
  
A Toolbox with Many Tools
Need expertise to identify which tool to use when and how!
Require computation models to schedule and optimize execution!
• 	
  Data	
  
• 	
  Search,	
  database	
  access,	
  IO	
  opera<ons,	
  streaming	
  data	
  in	
  
real-­‐<me…	
  
• 	
  Compute	
  
• 	
  Data-­‐parallel	
  pa_erns,	
  external	
  execu<on,	
  …	
  
• 	
  Network	
  opera<ons	
  
• 	
  Provenance	
  and	
  fault	
  tolerance	
  


So, 

how can we use
workflows in the
context of applications?

… while coupling all scales computing
computing within a reusable solution…
Some	
  P’s	
  to	
  focus	
  on…	
  
People
Process
Platforms
Purpose
Programmability
There	
  are	
  more:	
  	
  
provenance,	
  publica<on,	
  product,	
  
performance,	
  policy,	
  profit,	
  ...	
  	
  
People…	
  
People
Computa<onal	
  Data	
  Scien<st	
  Skill	
  Set	
  
h_p://
datasciencedojo.com/
what-­‐are-­‐the-­‐key-­‐skills-­‐
of-­‐a-­‐data-­‐scien<st/	
  
Need	
  to	
  
communicate!	
  
Process	
  
Process
A	
  Typical	
  Workflow-­‐Driven	
  Process	
  
Find	
  data	
  	
  
Access	
  data	
  
Acquire	
  data	
  
Move	
  data	
  
Clean	
  data	
  
Integrate	
  data	
  
Subset	
  data	
  
Pre-­‐process	
  data	
  
Analyze	
  data	
  
Process	
  data	
  
Interpret	
  results	
  
Summarize	
  results	
  
Visualize	
  results	
  
Post-­‐process	
  results	
  
Some	
  ques<ons	
  to	
  ask:	
  
•  Where	
  and	
  how	
  do	
  I	
  get	
  the	
  data?	
  
•  What	
  is	
  the	
  format	
  and	
  frequency	
  of	
  the	
  data,	
  e.g.,	
  structured,	
  textual,	
  real-­‐<me,	
  
image,	
  …?	
  
•  How	
  do	
  I	
  integrate	
  or	
  subset	
  datasets,	
  e.g.,	
  knowledge	
  representa<on,…	
  ?	
  
•  How	
  do	
  I	
  analyze	
  the	
  data	
  and	
  what	
  is	
  the	
  analysis	
  func<on?	
  
•  What	
  are	
  the	
  parameters	
  to	
  customize	
  each	
  step?	
  
•  What	
  are	
  the	
  compu<ng	
  needs	
  to	
  schedule	
  and	
  run	
  each	
  step?	
  
•  How	
  do	
  I	
  make	
  sure	
  the	
  results	
  are	
  useful	
  for	
  the	
  next	
  step	
  or	
  as	
  scien<fic	
  products,	
  
e.g.,	
  standards	
  compliance,	
  repor<ng,	
  …?	
  	
  
configurable	
  
automated	
  analysis	
  
Purpose	
  
People
Process
Purpose
Purpose…	
  
“You've	
  got	
  to	
  think	
  about	
  
	
  
	
   	
   	
  	
  	
  	
  	
  	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  big	
  things	
  	
  
	
  
while	
  you're	
  doing	
  	
  
	
   	
   	
   	
  small	
  things,	
  
so	
  that	
  all	
  the	
  small	
  things	
  go	
  in	
  the	
  right	
  
direc<on.” 	
   	
   	
   	
   	
   	
   	
   	
   	
   	
   	
   	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
   	
   	
   	
   	
   	
  –	
  Alvin	
  Toffler	
  
use	
  cases	
  =>	
  purpose	
  and	
  value	
  
 
	
  
	
  
	
  
	
  
Need	
  toolboxes	
  with	
  
many	
  tools	
  for:	
  	
  
•  data	
  access,	
  	
  
•  analysis,	
  	
  
•  scalable	
  execu&on,	
  	
  
•  fault	
  tolerance,	
  	
  
•  provenance	
  
tracking,	
  	
  
•  repor<ng	
  
•  ...	
  
Integra<on	
  of	
  Many	
  Tools	
  to	
  Serve	
  a	
  Purpose	
  
•  Alterna<ve	
  tools	
  
•  Mul<ple	
  modes	
  of	
  
scalability	
  
•  Support	
  for	
  each	
  step	
  of	
  
the	
  development	
  and	
  
produc<on	
  process	
  
•  Different	
  repor<ng	
  
needs	
  for	
  explora<on	
  
and	
  produc<on	
  stages	
  
Build	
  
Explore	
  	
  
Scale	
  
Report	
  
Build	
  Once,	
  Run	
  Many	
  Times…	
  
•  Data	
  science	
  process	
  should	
  support	
  
experimental	
  work	
  and	
  dynamic	
  scalability	
  on	
  
many	
  plalorms	
  
•  Scalability	
  based	
  on:	
  
–  data	
  volume	
  and	
  velocity	
  
–  dynamic	
  modeling	
  needs	
  
–  highly-­‐op<mized	
  HPC	
  codes	
  
–  changes	
  in	
  network,	
  storage	
  and	
  compu<ng	
  
availability	
  
There	
  are	
  different	
  styles	
  of	
  parallelism!	
  
Task1
Task2
Task3
Task4
Finished Running Waiting
Running Waiting Waiting
1
2
Task1 Task2 Task33
1
2
3
Input
Data
Set
Running Running Running
Task1 Task2 Task3123
1
2
3
Input
Data
Set
...
 	
  
•  A parallel and scalable programming model for
Big Data
–  Input data is automatically partitioned onto multiple
nodes
–  Programs are distributed and executed in parallel on
the partitioned data blocks
Distributed-­‐Data	
  Parallel	
  Compu<ng	
  
Images	
  from:	
  
h_p://www.stratosphere.eu/projects/
Stratosphere/wiki/PactPM	
  	
  
MapReduce
Move program
to data!
Distributed	
  Data-­‐Parallel	
  (DDP)	
  Pa_erns	
  
•  A	
  higher-­‐level	
  programming	
  model	
  
–  Moving	
  computa<on	
  to	
  data	
  
–  Good	
  scalability	
  and	
  performance	
  accelera<on	
  
–  Run-­‐<me	
  features	
  such	
  as	
  fault-­‐tolerance	
  
–  Easier	
  parallel	
  programming	
  than	
  MPI	
  and	
  OpenMP	
  
Pa_erns	
  for	
  data	
  distribu&on	
  
and	
  parallel	
  data	
  processing	
  	
  
Images	
  from:	
  h_p://www.stratosphere.eu/projects/Stratosphere/wiki/PactPM	
  	
  
Hadoop	
  
•  Open	
  source	
  
implementa<on	
  of	
  
MapReduce	
  
•  A	
  distributed	
  file	
  system	
  
across	
  compute	
  nodes	
  
(HDFS)	
  
–  Automa=c	
  data	
  par==on	
  
–  Automa=c	
  data	
  replica=on	
  
•  Master	
  and	
  workers/slaves	
  
architecture	
  
•  Automa<c	
  task	
  re-­‐execu<on	
  
for	
  failed	
  tasks	
  
Spark	
  
•  Fast	
  Big	
  Data	
  Engine	
  
–  Keeps	
  data	
  in	
  memory	
  as	
  
much	
  as	
  possible	
  
•  Resilient	
  Distributed	
  
Datasets	
  (RDDs)	
  
–  Evaluated	
  lazily	
  
–  Keeps	
  track	
  of	
  lineage	
  for	
  
fault	
  tolerance	
  
•  More	
  operators	
  than	
  just	
  
Map	
  and	
  Reduce	
  
•  Can	
  run	
  on	
  YARN	
  (Hadoop	
  
v2)	
  
Scalability	
  across	
  plalorms…	
  
People
Process
Platforms
Purpose
Running on Heterogeneous Computing
Resources
- Execution of programs on where they run most efficiently -
Gordon	
   Trestles	
  
Local	
  Cluster	
  
Resources	
  
NSF/DOE:	
  TeraScale	
  
Resources	
  (XSEDE)	
  
(Gordon)	
   (Comet)	
  
(Stampede)	
  
(Lonestar)	
  
Private	
  Cluster:	
  
User	
  Owned	
  
Resources	
  
Different	
  executables	
  have	
  different	
  compu&ng	
  architecture	
  needs!	
  
	
  
e.g.,	
  memory-­‐intensive,	
  compute-­‐intensive,	
  I/O-­‐intensive	
  
Challenges	
  for	
  Heterogeneous	
  Compu<ng	
  	
  
•  Dynamic	
  scheduling	
  op<miza<on	
  
–  Based	
  on	
  network	
  availability	
  
–  Data	
  transfer	
  and	
  locality	
  	
  
–  Energy	
  efficiency	
  
–  Availability	
  of	
  exascale	
  memory	
  hierarchies	
  	
  
–  Workload	
  changes	
  
–  Dynamic	
  memory	
  or	
  file-­‐based	
  coupling	
  
•  Be_er	
  programmable	
  communica<on	
  between	
  
workflow	
  systems	
  and	
  infrastructure	
  for	
  compu<ng,	
  
storage	
  and	
  network	
  
•  Harder	
  form	
  of	
  reproducibility	
  
•  Harder	
  to	
  program	
  using	
  scripts	
  
Programmability	
  for	
  scalability,	
  
reusability	
  and	
  reproducibility	
  
People
Process
Platforms
Purpose
Programmability
Using Big Data Computing in Bioinformatics
- Improving Programmability, Scalability and Reproducibility-
biokepler.org	
  
Kepler
bioKepler
Compute
Amazon
EC2
FutureGrid
Sun Grid
Engine
Adhoc
Network
Data
CAMERA
Ensembl
Genbank
Deploy &
Execute
Bioinformatics Tools
Clustering
Mapping
Assembly
Transfer
Customize
& Integrate
Data-Parallel Execution Patterns
Map-Reduce Master-Slave All-Pairs
Triton
Resource
Provenance
Execution History
Data Lineage
Reporting
PDF Generation
Report Designer
Fault-Tolerance
Error Handling
Alternatives
Run Manager
Tag
Search
Director
Executable
Workflow Plan
Scheduler
Execution
EngineBioinformatician
Workflow
bioActors
BLAST
HMMER
CD-HIT
bioKepler’s Conceptual Framework
Private	
  
Repositories	
  
…
XSEDE	
  
Gateways	
  and	
  other	
  user	
  environments	
  
bioKepler	
  
Kepler	
  and	
  Provenance	
  Framework	
  
BioLinux	
  	
   Galaxy	
   Clovr 	
  	
   Hadoop	
  
…
CLOUD	
  and	
  OTHER	
  COMPUTING	
  RESOURCES	
  
e.g.,	
  SGE,	
  Amazon,	
  FutureGrid,	
  XSEDE	
  
www.bioKepler.org
A coordinated ecosystem of biological and
technological packages for bioinformatics!
RAMMCAP - Rapid Clustering and Functional Annotation
for Metagenomic Sequences
Data	
  size	
  	
  
CPU	
  <me	
  
Memory	
  
Parallel	
  	
  
KB	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  MB	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  GB	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  TB	
  
Second	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Hour	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Day	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Month	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Year	
  
GB	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  10GB	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  100GB	
  
No	
  need	
   No	
   Mul<	
  threading	
   MPI	
   Map	
  Reduce	
  	
  
QC	
  
tRNA	
  
cd-­‐hit	
  
hmmer	
  
metagene	
  
blast	
  
QC	
   tRNA	
   cd-­‐hit	
  hmmer	
  metagene	
   blast	
  
QC	
   tRNA	
  cd-­‐hit	
   hmmer	
  metagene	
   blast	
  
QC	
   tRNA	
  cd-­‐hit	
   hmmer	
   metagene	
   blast	
  hmmer	
   blast	
  
RAMMCAP – Rapid Clustering and Functional Annotation
for Metagenomic Sequences
Data	
  size	
  	
  
CPU	
  <me	
  
Memory	
  
Parallel	
  	
  
KB	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  MB	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  GB	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  TB	
  
Minute	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Hour	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Day	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Month	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Year	
  
GB	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  10GB	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  100GB	
  
No	
  need	
   No	
   Mul<	
  threading	
   MPI	
   Map	
  Reduce	
  	
  
QC	
   tRNA	
   cd-­‐hit	
  hmmer	
  metagene	
   blast	
  
QC	
   tRNA	
  cd-­‐hit	
   hmmer	
  metagene	
   blast	
  
QC	
   tRNA	
  cd-­‐hit	
   hmmer	
   metagene	
   blast	
  hmmer	
   blast	
  
Data	
  size	
  	
  
CPU	
  <me	
  
Memory	
  
Parallel	
  	
  
KB	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  MB	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  GB	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  TB	
  
Minute	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Hour	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Day	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Month	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Year	
  
GB	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  10GB	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  100GB	
  
No	
  need	
   No	
   Mul<	
  threading	
   MPI	
   Map	
  Reduce	
  	
  
NGS	
  
QC	
   tRNA	
   cd-­‐hit	
  hmmer	
  metagene	
   blast	
  
QC	
   tRNA	
  cd-­‐hit	
   hmmer	
  metagene	
   blast	
  
QC	
   tRNA	
  cd-­‐hit	
   hmmer	
   metagene	
   blast	
  hmmer	
   blast	
  
Source:	
  Larry	
  Smarr,	
  Calit2	
  
PI: (Weizhong Li, CRBS, UCSD):
NIH R01HG005978 (2010-2013, $1.1M)
Computa<onal	
  NextGen	
  Sequencing	
  Pipeline:	
  
From	
  Sequence	
  to	
  Taxonomy	
  and	
  Func<on	
  
Same	
  approach	
  can	
  be	
  applied	
  to	
  
machine	
  learning	
  and	
  other	
  
applica<on	
  areas!	
  
	
  
	
  
-­‐	
  REUSABILITY	
  and	
  REPURPOSABILITY-­‐	
  
Flexible	
  programming	
  of	
  K-­‐means	
  
	
  
•  R:	
  Programming	
  
language	
  and	
  sorware	
  
environment	
  for	
  
sta<s<cal	
  compu<ng	
  and	
  
graphics.	
  
•  KNIME:	
  Plalorm	
  for	
  
data	
  analy<cs.	
  
•  MlLib:	
  Scalable	
  machine	
  
learning	
  library	
  running	
  
on	
  Spark	
  cluster	
  
compu<ng	
  framework	
  
•  Mahout:	
  Scalable	
  
machine	
  learning	
  library	
  
based	
  on	
  MapReduce.	
  	
  
Scalable Bayesian Network Learning
Conceptual SWF
Executable SWF
From	
  “Napkin	
  Drawings” to	
  Executable	
  Workflows…	
  
SBNL workflow
Local Learner
Data Quality
Evaluation
Local Ensemble
Learning
Quality Evaluation & Data
PartitioningBig Data
Master Learner
MasterEnsemble
Learning
Final BN
Structure
Insurance	
  and	
  Traffic	
  
Data	
  Analy&cs	
  using	
  Big	
  
Data	
  Bayesian	
  Network	
  
Learning	
  
Focus	
  on	
  the	
  
use	
  case,	
  	
  
not	
  the	
  
technology!	
  
Using Workflows and Cyberinfrastructure
for Wildfire Resilience
- A Scalable Data-Driven Monitoring and Dynamic Prediction Approach -
wifire.ucsd.edu	
  
A	
  Scalable	
  Data-­‐Driven	
  Monitoring,	
  Dynamic	
  Predic<on	
  and	
  
Resilience	
  Cyberinfrastructure	
  for	
  Wildfires	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (WIFIRE)	
  
Development	
  of:	
  
	
  
“cyberinfrastructure”	
  for	
  
“analysis	
  of	
  large	
  
dimensional	
  
heterogeneous	
  real-­‐<me	
  
sensed	
  data”	
  for	
  fire	
  
resilience	
  before,	
  during	
  
and	
  a@er	
  a	
  wildfire	
  
What	
  is	
  lacking	
  in	
  disaster	
  management	
  today	
  is…	
  
	
  
	
  a	
  system	
  integra<on	
  of	
  real-­‐<me	
  sensor	
  networks,	
  satellite	
  
imagery,	
  near-­‐real	
  <me	
  data	
  management	
  
tools,	
  wildfire	
  simula<on	
  tools,	
  and	
  connec<vity	
  to	
  
emergency	
  command	
  centers	
  	
  
	
  
.	
  ….	
  before,	
  during	
  and	
  arer	
  a	
  firestorm.	
  
h_p://nbcr.ucsd.edu/	
  	
  
Integrated	
  Mul<-­‐Scale	
  Biomedical	
  
Modeling	
  Workflows	
  in	
  NBCR	
  
Identify gaps in multiscale modeling capabilities and
develop new methods and tools that allow us to bridge
across these gaps
Å nm – µm 0.1mm - mm cm
fs - µs µs - ms ms - s s - lifespan
Molecular &
Macromolecular
Sub-Cellular Cell Tissue Organ
	
  	
  
Spa&al	
  and	
  	
  
Temporal	
  	
  Scales	
  
Driving Biomedical Projects propel technology development
across multi-scale modeling capability gaps, from simulation to
data assembly & integration
•  Models	
  at	
  different	
  scales	
  are	
  
generally	
  not	
  designed	
  to	
  
inform	
  each	
  other	
  
•  Specialized	
  interfaces	
  to	
  
communicate	
  large	
  number	
  of	
  
parameters	
  and	
  data	
  are	
  
needed	
  
•  Provenance	
  of	
  experiments	
  
needs	
  to	
  be	
  portable	
  
•  Models	
  require	
  different	
  
levels	
  of	
  scalability	
  
•  Deployable	
  sorware	
  
maintenance	
  requires	
  
exper<se	
  
Rommie	
  	
  
Amaro,	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
UCSD	
  
Sensi&vity	
  Analysis	
  (SA)	
  for	
  Uncertainty	
  Quan&fica&on	
  (UQ)	
  
Computa(onal	
  SA	
  
techniques	
  to	
  effec=vely	
  
and	
  efficiently	
  iden<fy	
  
computa=onal	
  error	
  and	
  
model	
  sensi=vity	
  for	
  
differen=al	
  equa=ons	
  (DE)	
  
Biomedical	
  
Theory	
  and	
  
Experimental	
  
Data	
  
Nonlinear	
  DE	
  
System	
  as	
  
Mathema=cal	
  
Model	
  
Numerical	
  
Solu=on	
  of	
  
Nonlinear	
  
DE	
  Model	
  
Extrac=on	
  of	
  
Quan=ty	
  of	
  
Interest	
  from	
  
Simula=on	
  
The	
  Standard	
  Scien(fic	
  Simula(on	
  Workflow	
  for	
  DE	
  Modeling	
  in	
  NBCR	
  
Numerical	
  solu=on	
  of	
  Nonlinear	
  DE	
  Model	
  
Standard	
  
Nonlinear	
  Solve	
  
of	
  Primal	
  
Problem	
  
Solu<on	
  of	
  
linearized	
  Dual	
  
Problem	
  for	
  
Performing	
  SA	
  
Use	
  of	
  SA	
  informa<on	
  for	
  UQ	
  (error	
  es<ma<on)	
  
to	
  build	
  an	
  improved	
  numerical	
  discre<za<on	
  
Output	
  of	
  
Numerical	
  
Solu<on	
  with	
  
UQ/SA	
  Info	
  
	
  	
  	
  FETK	
  
&	
  FEniCS	
  
Support	
  for	
  end-­‐to-­‐end	
  computa&onal	
  scien&fic	
  process	
  
Battling complexity while
facilitating collaboration and increasing reproducibility.
Aim	
  1	
  
Goal:	
  Extract	
  Quan<ty	
  of	
  Interest	
  (QoI)	
  from	
  accurate	
  numerical	
  simula<on.	
  
Mike	
  Holst,	
  UCSD	
  
Local	
  Execu<on	
  
Op<on	
  
	
  User	
  MD-­‐Parameter	
  Configura&on	
  Op&on	
  
	
  
Molecular	
  Dynamic	
  CADD	
  Workflow	
  
	
  Amber	
  
Molecular	
  
Dynamics	
  
Package	
  
Local:	
  NBCR	
  Cluster	
  
Resources	
  
NSF/DOE:	
  TeraScale	
  
Resources	
  (XSEDE)	
  
(Stampede)	
  
NBCR	
  and	
  User	
  Owned	
  
Cloud	
  Resources	
  
(Comet)	
  
BENEFITS:	
  
•  Enable	
  	
  users	
  to	
  configure	
  MD	
  job	
  parameters	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  through	
  command-­‐line,	
  GUI	
  or	
  web	
  interface.	
  	
  
•  Scale	
  for	
  mul<ple	
  compounds	
  in	
  parallel	
  
•  Run	
  on	
  Mul<ple	
  Compu<ng	
  plalorms	
  
•  Increase	
  reuse	
  
•  Provenance	
  
GPU	
  or	
  Gordon	
  Execu<on	
  Op<on	
  
h_p://hpc.pnl.gov/IPPD/	
  	
  
Predic<ng	
  Workflow	
  Performance	
  from	
  Provenance	
  
IPPD	
  IDEA:	
  	
  
Use	
   past	
   workflows	
  
execu<on	
   traces	
  	
  
along	
   with	
   system,	
  
a p p l i c a < o n	
   a n d	
  
execu<on	
  profiles	
  for	
  
dynamic	
   predic<ve	
  
scheduling.	
  
h_ps://smartmanufacturingcoali<on.org/	
  	
  
Workflows-­‐as-­‐a-­‐Service	
  
To Sum Up
•  Workflows and provenance are well-adopted in scientific
infrastructures today, with success
•  WorDS Center applies these concepts to advanced
dynamic data-driven analytics applications
•  One size does not fit all!
•  Many diverse environments and requirements
•  Need to orchestrate at a higher level
•  Higher level programming components for each domain
•  Lots of future challenges on
•  Optimized execution on heterogeneous platforms
•  Programmable interface to workload, storage and network needed
•  Increasing reuse within and across application domains
•  Querying and integration of workflow provenance data into
performance prediction
Ques<ons?	
  
Ilkay	
  Al<ntas,	
  Ph.D.	
  
Email:	
  al<ntas@sdsc.edu	
  	
  
Thanks	
  to	
  our	
  many	
  collaborators	
  and	
  funders!	
  
Twi_er:	
  @WorDS_SDSC	
  

Más contenido relacionado

La actualidad más candente

H2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in PythonH2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in PythonSri Ambati
 
Open Science and Executable Papers
Open Science and Executable PapersOpen Science and Executable Papers
Open Science and Executable PapersJose Enrique Ruiz
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoopRussell Jurney
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSSri Ambati
 
Big Data Rampage
Big Data RampageBig Data Rampage
Big Data RampageNiko Vuokko
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
 
Cloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science TeamsCloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science TeamsBoston Consulting Group
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Ian Foster
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopEdureka!
 
IEEE Cloud 2012: Clouds Hands-On Tutorial
IEEE Cloud 2012: Clouds Hands-On TutorialIEEE Cloud 2012: Clouds Hands-On Tutorial
IEEE Cloud 2012: Clouds Hands-On TutorialSrinath Perera
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...Dataiku
 
Josh Wills, MLconf 2013
Josh Wills, MLconf 2013Josh Wills, MLconf 2013
Josh Wills, MLconf 2013MLconf
 
Agile data science
Agile data scienceAgile data science
Agile data scienceJoel Horwitz
 

La actualidad más candente (19)

H2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in PythonH2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in Python
 
Open Science and Executable Papers
Open Science and Executable PapersOpen Science and Executable Papers
Open Science and Executable Papers
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
Big Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning GuruBig Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning Guru
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoop
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
Big Data Rampage
Big Data RampageBig Data Rampage
Big Data Rampage
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
Cloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science TeamsCloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science Teams
 
Anaconda Data Science Collaboration
Anaconda Data Science CollaborationAnaconda Data Science Collaboration
Anaconda Data Science Collaboration
 
Big Data, Baby Steps
Big Data, Baby StepsBig Data, Baby Steps
Big Data, Baby Steps
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
IEEE Cloud 2012: Clouds Hands-On Tutorial
IEEE Cloud 2012: Clouds Hands-On TutorialIEEE Cloud 2012: Clouds Hands-On Tutorial
IEEE Cloud 2012: Clouds Hands-On Tutorial
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
 
Josh Wills, MLconf 2013
Josh Wills, MLconf 2013Josh Wills, MLconf 2013
Josh Wills, MLconf 2013
 
Hadoop
HadoopHadoop
Hadoop
 
Agile data science
Agile data scienceAgile data science
Agile data science
 

Similar a A Maturing Role of Workflows in the Presence of Heterogenous Computing Architectures

Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summitOpen Analytics
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sqlaftab alam
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataIMC Institute
 
Scientific Software Challenges and Community Responses
Scientific Software Challenges and Community ResponsesScientific Software Challenges and Community Responses
Scientific Software Challenges and Community ResponsesDaniel S. Katz
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEWShiyong Lu
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Srinath Perera
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesDataWorks Summit
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupSri Kanajan
 

Similar a A Maturing Role of Workflows in the Presence of Heterogenous Computing Architectures (20)

Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Scientific Software Challenges and Community Responses
Scientific Software Challenges and Community ResponsesScientific Software Challenges and Community Responses
Scientific Software Challenges and Community Responses
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
Resume (1)
Resume (1)Resume (1)
Resume (1)
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEW
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup Group
 

Más de Ilkay Altintas, Ph.D.

Collaborative Data Science In A Highly Networked World
Collaborative Data Science In A Highly Networked WorldCollaborative Data Science In A Highly Networked World
Collaborative Data Science In A Highly Networked WorldIlkay Altintas, Ph.D.
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Ilkay Altintas, Ph.D.
 
Workflow-Driven Geoinformatics Applications and Training in the Big Data Era
Workflow-Driven Geoinformatics Applications and Training in the Big Data EraWorkflow-Driven Geoinformatics Applications and Training in the Big Data Era
Workflow-Driven Geoinformatics Applications and Training in the Big Data EraIlkay Altintas, Ph.D.
 
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...Ilkay Altintas, Ph.D.
 
Using Cyberinfrastructure for Wildfire Resilience
Using Cyberinfrastructure for Wildfire ResilienceUsing Cyberinfrastructure for Wildfire Resilience
Using Cyberinfrastructure for Wildfire ResilienceIlkay Altintas, Ph.D.
 
Using Cyberinfrastructure for Wildfire Resilience
Using Cyberinfrastructure for Wildfire ResilienceUsing Cyberinfrastructure for Wildfire Resilience
Using Cyberinfrastructure for Wildfire ResilienceIlkay Altintas, Ph.D.
 
Invited Talk for EUDAT Workshop in Barcelona
Invited Talk for EUDAT Workshop in Barcelona Invited Talk for EUDAT Workshop in Barcelona
Invited Talk for EUDAT Workshop in Barcelona Ilkay Altintas, Ph.D.
 

Más de Ilkay Altintas, Ph.D. (7)

Collaborative Data Science In A Highly Networked World
Collaborative Data Science In A Highly Networked WorldCollaborative Data Science In A Highly Networked World
Collaborative Data Science In A Highly Networked World
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
 
Workflow-Driven Geoinformatics Applications and Training in the Big Data Era
Workflow-Driven Geoinformatics Applications and Training in the Big Data EraWorkflow-Driven Geoinformatics Applications and Training in the Big Data Era
Workflow-Driven Geoinformatics Applications and Training in the Big Data Era
 
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
 
Using Cyberinfrastructure for Wildfire Resilience
Using Cyberinfrastructure for Wildfire ResilienceUsing Cyberinfrastructure for Wildfire Resilience
Using Cyberinfrastructure for Wildfire Resilience
 
Using Cyberinfrastructure for Wildfire Resilience
Using Cyberinfrastructure for Wildfire ResilienceUsing Cyberinfrastructure for Wildfire Resilience
Using Cyberinfrastructure for Wildfire Resilience
 
Invited Talk for EUDAT Workshop in Barcelona
Invited Talk for EUDAT Workshop in Barcelona Invited Talk for EUDAT Workshop in Barcelona
Invited Talk for EUDAT Workshop in Barcelona
 

Último

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 

Último (20)

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 

A Maturing Role of Workflows in the Presence of Heterogenous Computing Architectures

  • 1. A  Maturing  Role  of  Workflows     in  the  Presence  of     Heterogeneous  Compu<ng  Architectures   WorDS.sdsc.edu           Dr.  Ilkay  Al<ntas   Founder  and  Director,  Workflows  for  Data  Science  (WorDS)  Center  of  Excellence   San  Diego  Supercomputer  Center,  UC  San  Diego    
  • 2. SAN  DIEGO  SUPERCOMPUTER  CENTER  at  UC  San  Diego   Providing  Cyberinfrastructure  for  Research  and  Educa<on   •  Established  as  a  na<onal   supercomputer  resource   center  in  1985  by  NSF   •  A  world  leader  in  HPC,  data-­‐ intensive  compu<ng,  and   scien<fic  data  management   •  Current  strategic  focus  on   “Big  Data”  and  “Data-­‐ intensive  HPC”   1985   today  
  • 3.       Scien&fic  Workflow     Automa&on  Technologies   Research         Workflows  for  Cloud   Systems       Big  Data  Applica&ons         Reproducible  Science         Workforce  Training  and   Educa&on         Development  and  Consul&ng   Services   Workflows   for  Data   Science   Center   Focus  on  the   ques&on,     not  the   technology!   10+ years of data science R&D experience as a Center.  
  • 4.
  • 5. Computa<onal  Data  Science  Workflows   -­‐  Programmable,  Reusable  and  Reproducible  Scalability  -­‐   •  Access  and  query  data   •  Scale  computa<onal  analysis   •  Increase  reuse     •  Save  <me,  energy  and  money   •  Formalize  and  standardize   Real-­‐Time  Hazards  Management   wifire.ucsd.edu   Data-­‐Parallel  Bioinforma<cs   bioKepler.org     Scalable  Automated  Molecular  Dynamics  and  Drug  Discovery   nbcr.ucsd.edu   kepler-­‐project.org   WorDS.sdsc.edu  
  • 6. The Big Picture is to Capture the Workflow in an Executable and Reusable Way Conceptual SWF Executable SWF From  “Napkin  Drawings” to  Executable  Workflows…   SBNL workflow Local Learner Data Quality Evaluation Local Ensemble Learning Quality Evaluation & Data PartitioningBig Data Master Learner MasterEnsemble Learning Final BN Structure Insurance  and  Traffic   Data  Analy&cs  using  Big   Data  Bayesian  Network   Learning  
  • 7. Ptolemy II: A laboratory for investigating design KEPLER: A problem-solving environment for Scientific Workflow KEPLER = Ptolemy II + X for Scientific Workflows Kepler is a Scientific Workflow System •  A cross-project collaboration … initiated August 2003 •  2.5 will be releases soon www.kepler-project.org •  Builds upon the open-source Ptolemy II framework
  • 8. Kepler  can  be  applied  to  problems  in  different   scien<fic  disciplines:  some  here  and  many  more…     Astrophysisc,  e.g.,  DIAPL   Noanotechnology,  e.g.,  ANELLI   Fusion,  e.g.,  ITER   Metagenomics,  e.g.,  CAMERA   Mul&-­‐scale  biology,   e.g.,  NBCR  
  • 9. A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize execution! •   Data   •   Search,  database  access,  IO  opera<ons,  streaming  data  in   real-­‐<me…   •   Compute   •   Data-­‐parallel  pa_erns,  external  execu<on,  …   •   Network  opera<ons   •   Provenance  and  fault  tolerance  
  • 10. 
 So, 
 how can we use workflows in the context of applications?
 … while coupling all scales computing computing within a reusable solution…
  • 11. Some  P’s  to  focus  on…   People Process Platforms Purpose Programmability
  • 12. There  are  more:     provenance,  publica<on,  product,   performance,  policy,  profit,  ...    
  • 14. Computa<onal  Data  Scien<st  Skill  Set   h_p:// datasciencedojo.com/ what-­‐are-­‐the-­‐key-­‐skills-­‐ of-­‐a-­‐data-­‐scien<st/   Need  to   communicate!  
  • 16. A  Typical  Workflow-­‐Driven  Process   Find  data     Access  data   Acquire  data   Move  data   Clean  data   Integrate  data   Subset  data   Pre-­‐process  data   Analyze  data   Process  data   Interpret  results   Summarize  results   Visualize  results   Post-­‐process  results   Some  ques<ons  to  ask:   •  Where  and  how  do  I  get  the  data?   •  What  is  the  format  and  frequency  of  the  data,  e.g.,  structured,  textual,  real-­‐<me,   image,  …?   •  How  do  I  integrate  or  subset  datasets,  e.g.,  knowledge  representa<on,…  ?   •  How  do  I  analyze  the  data  and  what  is  the  analysis  func<on?   •  What  are  the  parameters  to  customize  each  step?   •  What  are  the  compu<ng  needs  to  schedule  and  run  each  step?   •  How  do  I  make  sure  the  results  are  useful  for  the  next  step  or  as  scien<fic  products,   e.g.,  standards  compliance,  repor<ng,  …?     configurable   automated  analysis  
  • 18. Purpose…   “You've  got  to  think  about                                                    big  things       while  you're  doing            small  things,   so  that  all  the  small  things  go  in  the  right   direc<on.”                                                      –  Alvin  Toffler   use  cases  =>  purpose  and  value  
  • 19.           Need  toolboxes  with   many  tools  for:     •  data  access,     •  analysis,     •  scalable  execu&on,     •  fault  tolerance,     •  provenance   tracking,     •  repor<ng   •  ...   Integra<on  of  Many  Tools  to  Serve  a  Purpose   •  Alterna<ve  tools   •  Mul<ple  modes  of   scalability   •  Support  for  each  step  of   the  development  and   produc<on  process   •  Different  repor<ng   needs  for  explora<on   and  produc<on  stages   Build   Explore     Scale   Report  
  • 20. Build  Once,  Run  Many  Times…   •  Data  science  process  should  support   experimental  work  and  dynamic  scalability  on   many  plalorms   •  Scalability  based  on:   –  data  volume  and  velocity   –  dynamic  modeling  needs   –  highly-­‐op<mized  HPC  codes   –  changes  in  network,  storage  and  compu<ng   availability  
  • 21. There  are  different  styles  of  parallelism!   Task1 Task2 Task3 Task4 Finished Running Waiting Running Waiting Waiting 1 2 Task1 Task2 Task33 1 2 3 Input Data Set Running Running Running Task1 Task2 Task3123 1 2 3 Input Data Set ...
  • 22.     •  A parallel and scalable programming model for Big Data –  Input data is automatically partitioned onto multiple nodes –  Programs are distributed and executed in parallel on the partitioned data blocks Distributed-­‐Data  Parallel  Compu<ng   Images  from:   h_p://www.stratosphere.eu/projects/ Stratosphere/wiki/PactPM     MapReduce Move program to data!
  • 23. Distributed  Data-­‐Parallel  (DDP)  Pa_erns   •  A  higher-­‐level  programming  model   –  Moving  computa<on  to  data   –  Good  scalability  and  performance  accelera<on   –  Run-­‐<me  features  such  as  fault-­‐tolerance   –  Easier  parallel  programming  than  MPI  and  OpenMP   Pa_erns  for  data  distribu&on   and  parallel  data  processing     Images  from:  h_p://www.stratosphere.eu/projects/Stratosphere/wiki/PactPM    
  • 24. Hadoop   •  Open  source   implementa<on  of   MapReduce   •  A  distributed  file  system   across  compute  nodes   (HDFS)   –  Automa=c  data  par==on   –  Automa=c  data  replica=on   •  Master  and  workers/slaves   architecture   •  Automa<c  task  re-­‐execu<on   for  failed  tasks   Spark   •  Fast  Big  Data  Engine   –  Keeps  data  in  memory  as   much  as  possible   •  Resilient  Distributed   Datasets  (RDDs)   –  Evaluated  lazily   –  Keeps  track  of  lineage  for   fault  tolerance   •  More  operators  than  just   Map  and  Reduce   •  Can  run  on  YARN  (Hadoop   v2)  
  • 25. Scalability  across  plalorms…   People Process Platforms Purpose
  • 26. Running on Heterogeneous Computing Resources - Execution of programs on where they run most efficiently - Gordon   Trestles   Local  Cluster   Resources   NSF/DOE:  TeraScale   Resources  (XSEDE)   (Gordon)   (Comet)   (Stampede)   (Lonestar)   Private  Cluster:   User  Owned   Resources   Different  executables  have  different  compu&ng  architecture  needs!     e.g.,  memory-­‐intensive,  compute-­‐intensive,  I/O-­‐intensive  
  • 27. Challenges  for  Heterogeneous  Compu<ng     •  Dynamic  scheduling  op<miza<on   –  Based  on  network  availability   –  Data  transfer  and  locality     –  Energy  efficiency   –  Availability  of  exascale  memory  hierarchies     –  Workload  changes   –  Dynamic  memory  or  file-­‐based  coupling   •  Be_er  programmable  communica<on  between   workflow  systems  and  infrastructure  for  compu<ng,   storage  and  network   •  Harder  form  of  reproducibility   •  Harder  to  program  using  scripts  
  • 28. Programmability  for  scalability,   reusability  and  reproducibility   People Process Platforms Purpose Programmability
  • 29. Using Big Data Computing in Bioinformatics - Improving Programmability, Scalability and Reproducibility- biokepler.org  
  • 30. Kepler bioKepler Compute Amazon EC2 FutureGrid Sun Grid Engine Adhoc Network Data CAMERA Ensembl Genbank Deploy & Execute Bioinformatics Tools Clustering Mapping Assembly Transfer Customize & Integrate Data-Parallel Execution Patterns Map-Reduce Master-Slave All-Pairs Triton Resource Provenance Execution History Data Lineage Reporting PDF Generation Report Designer Fault-Tolerance Error Handling Alternatives Run Manager Tag Search Director Executable Workflow Plan Scheduler Execution EngineBioinformatician Workflow bioActors BLAST HMMER CD-HIT bioKepler’s Conceptual Framework Private   Repositories   … XSEDE  
  • 31. Gateways  and  other  user  environments   bioKepler   Kepler  and  Provenance  Framework   BioLinux     Galaxy   Clovr     Hadoop   … CLOUD  and  OTHER  COMPUTING  RESOURCES   e.g.,  SGE,  Amazon,  FutureGrid,  XSEDE   www.bioKepler.org A coordinated ecosystem of biological and technological packages for bioinformatics!
  • 32. RAMMCAP - Rapid Clustering and Functional Annotation for Metagenomic Sequences Data  size     CPU  <me   Memory   Parallel     KB                                                            MB                                                              GB                                                            TB   Second                              Hour                                  Day                                          Month                                      Year   GB                                                                                            10GB                                                                                          100GB   No  need   No   Mul<  threading   MPI   Map  Reduce     QC   tRNA   cd-­‐hit   hmmer   metagene   blast   QC   tRNA   cd-­‐hit  hmmer  metagene   blast   QC   tRNA  cd-­‐hit   hmmer  metagene   blast   QC   tRNA  cd-­‐hit   hmmer   metagene   blast  hmmer   blast  
  • 33. RAMMCAP – Rapid Clustering and Functional Annotation for Metagenomic Sequences Data  size     CPU  <me   Memory   Parallel     KB                                                            MB                                                              GB                                                            TB   Minute                              Hour                                  Day                                          Month                                      Year   GB                                                                                            10GB                                                                                          100GB   No  need   No   Mul<  threading   MPI   Map  Reduce     QC   tRNA   cd-­‐hit  hmmer  metagene   blast   QC   tRNA  cd-­‐hit   hmmer  metagene   blast   QC   tRNA  cd-­‐hit   hmmer   metagene   blast  hmmer   blast   Data  size     CPU  <me   Memory   Parallel     KB                                                            MB                                                              GB                                                            TB   Minute                              Hour                                  Day                                          Month                                      Year   GB                                                                                            10GB                                                                                          100GB   No  need   No   Mul<  threading   MPI   Map  Reduce     NGS   QC   tRNA   cd-­‐hit  hmmer  metagene   blast   QC   tRNA  cd-­‐hit   hmmer  metagene   blast   QC   tRNA  cd-­‐hit   hmmer   metagene   blast  hmmer   blast  
  • 34. Source:  Larry  Smarr,  Calit2   PI: (Weizhong Li, CRBS, UCSD): NIH R01HG005978 (2010-2013, $1.1M) Computa<onal  NextGen  Sequencing  Pipeline:   From  Sequence  to  Taxonomy  and  Func<on  
  • 35. Same  approach  can  be  applied  to   machine  learning  and  other   applica<on  areas!       -­‐  REUSABILITY  and  REPURPOSABILITY-­‐  
  • 36. Flexible  programming  of  K-­‐means     •  R:  Programming   language  and  sorware   environment  for   sta<s<cal  compu<ng  and   graphics.   •  KNIME:  Plalorm  for   data  analy<cs.   •  MlLib:  Scalable  machine   learning  library  running   on  Spark  cluster   compu<ng  framework   •  Mahout:  Scalable   machine  learning  library   based  on  MapReduce.    
  • 37. Scalable Bayesian Network Learning Conceptual SWF Executable SWF From  “Napkin  Drawings” to  Executable  Workflows…   SBNL workflow Local Learner Data Quality Evaluation Local Ensemble Learning Quality Evaluation & Data PartitioningBig Data Master Learner MasterEnsemble Learning Final BN Structure Insurance  and  Traffic   Data  Analy&cs  using  Big   Data  Bayesian  Network   Learning  
  • 38. Focus  on  the   use  case,     not  the   technology!  
  • 39. Using Workflows and Cyberinfrastructure for Wildfire Resilience - A Scalable Data-Driven Monitoring and Dynamic Prediction Approach - wifire.ucsd.edu  
  • 40. A  Scalable  Data-­‐Driven  Monitoring,  Dynamic  Predic<on  and   Resilience  Cyberinfrastructure  for  Wildfires                                                                                                                    (WIFIRE)   Development  of:     “cyberinfrastructure”  for   “analysis  of  large   dimensional   heterogeneous  real-­‐<me   sensed  data”  for  fire   resilience  before,  during   and  a@er  a  wildfire  
  • 41. What  is  lacking  in  disaster  management  today  is…      a  system  integra<on  of  real-­‐<me  sensor  networks,  satellite   imagery,  near-­‐real  <me  data  management   tools,  wildfire  simula<on  tools,  and  connec<vity  to   emergency  command  centers       .  ….  before,  during  and  arer  a  firestorm.  
  • 42. h_p://nbcr.ucsd.edu/     Integrated  Mul<-­‐Scale  Biomedical   Modeling  Workflows  in  NBCR  
  • 43. Identify gaps in multiscale modeling capabilities and develop new methods and tools that allow us to bridge across these gaps Å nm – µm 0.1mm - mm cm fs - µs µs - ms ms - s s - lifespan Molecular & Macromolecular Sub-Cellular Cell Tissue Organ     Spa&al  and     Temporal    Scales   Driving Biomedical Projects propel technology development across multi-scale modeling capability gaps, from simulation to data assembly & integration •  Models  at  different  scales  are   generally  not  designed  to   inform  each  other   •  Specialized  interfaces  to   communicate  large  number  of   parameters  and  data  are   needed   •  Provenance  of  experiments   needs  to  be  portable   •  Models  require  different   levels  of  scalability   •  Deployable  sorware   maintenance  requires   exper<se   Rommie     Amaro,                       UCSD  
  • 44. Sensi&vity  Analysis  (SA)  for  Uncertainty  Quan&fica&on  (UQ)   Computa(onal  SA   techniques  to  effec=vely   and  efficiently  iden<fy   computa=onal  error  and   model  sensi=vity  for   differen=al  equa=ons  (DE)   Biomedical   Theory  and   Experimental   Data   Nonlinear  DE   System  as   Mathema=cal   Model   Numerical   Solu=on  of   Nonlinear   DE  Model   Extrac=on  of   Quan=ty  of   Interest  from   Simula=on   The  Standard  Scien(fic  Simula(on  Workflow  for  DE  Modeling  in  NBCR   Numerical  solu=on  of  Nonlinear  DE  Model   Standard   Nonlinear  Solve   of  Primal   Problem   Solu<on  of   linearized  Dual   Problem  for   Performing  SA   Use  of  SA  informa<on  for  UQ  (error  es<ma<on)   to  build  an  improved  numerical  discre<za<on   Output  of   Numerical   Solu<on  with   UQ/SA  Info        FETK   &  FEniCS   Support  for  end-­‐to-­‐end  computa&onal  scien&fic  process   Battling complexity while facilitating collaboration and increasing reproducibility. Aim  1   Goal:  Extract  Quan<ty  of  Interest  (QoI)  from  accurate  numerical  simula<on.   Mike  Holst,  UCSD  
  • 45. Local  Execu<on   Op<on    User  MD-­‐Parameter  Configura&on  Op&on     Molecular  Dynamic  CADD  Workflow    Amber   Molecular   Dynamics   Package   Local:  NBCR  Cluster   Resources   NSF/DOE:  TeraScale   Resources  (XSEDE)   (Stampede)   NBCR  and  User  Owned   Cloud  Resources   (Comet)   BENEFITS:   •  Enable    users  to  configure  MD  job  parameters                    through  command-­‐line,  GUI  or  web  interface.     •  Scale  for  mul<ple  compounds  in  parallel   •  Run  on  Mul<ple  Compu<ng  plalorms   •  Increase  reuse   •  Provenance   GPU  or  Gordon  Execu<on  Op<on  
  • 46. h_p://hpc.pnl.gov/IPPD/     Predic<ng  Workflow  Performance  from  Provenance   IPPD  IDEA:     Use   past   workflows   execu<on   traces     along   with   system,   a p p l i c a < o n   a n d   execu<on  profiles  for   dynamic   predic<ve   scheduling.  
  • 48. To Sum Up •  Workflows and provenance are well-adopted in scientific infrastructures today, with success •  WorDS Center applies these concepts to advanced dynamic data-driven analytics applications •  One size does not fit all! •  Many diverse environments and requirements •  Need to orchestrate at a higher level •  Higher level programming components for each domain •  Lots of future challenges on •  Optimized execution on heterogeneous platforms •  Programmable interface to workload, storage and network needed •  Increasing reuse within and across application domains •  Querying and integration of workflow provenance data into performance prediction
  • 49. Ques<ons?   Ilkay  Al<ntas,  Ph.D.   Email:  al<ntas@sdsc.edu     Thanks  to  our  many  collaborators  and  funders!   Twi_er:  @WorDS_SDSC