SlideShare una empresa de Scribd logo
1 de 61
Descargar para leer sin conexión
WorDS	
  of	
  Data	
  Science	
  in	
  the	
  Presence	
  of	
  
Heterogenous	
  Compu7ng	
  Architectures	
  
WorDS.sdsc.edu	
  	
  	
  	
  	
  
Dr.	
  Ilkay	
  Al7ntas	
  
Founder	
  and	
  Director,	
  Workflows	
  for	
  Data	
  Science	
  (WorDS)	
  
Center	
  of	
  Excellence	
  
San	
  Diego	
  Supercomputer	
  Center,	
  UC	
  San	
  Diego	
  
	
  
SAN	
  DIEGO	
  SUPERCOMPUTER	
  CENTER	
  at	
  UC	
  San	
  Diego	
  
Providing	
  Cyberinfrastructure	
  for	
  Research	
  and	
  Educa7on	
  
•  Established	
  as	
  a	
  na7onal	
  
supercomputer	
  resource	
  
center	
  in	
  1985	
  by	
  NSF	
  
•  A	
  world	
  leader	
  in	
  HPC,	
  data-­‐
intensive	
  compu7ng,	
  and	
  
scien7fic	
  data	
  management	
  
•  Current	
  strategic	
  focus	
  on	
  
“Big	
  Data”	
  	
  
1985	
  
today	
  
 
	
  
	
  
Scien&fic	
  Workflow	
  	
  
Automa&on	
  Technologies	
  
Research	
  
	
  
	
  
	
  
Workflows	
  for	
  Cloud	
  
Systems	
  
	
  
	
  
Big	
  Data	
  Applica&ons	
  
	
  	
  
	
  
Reproducible	
  Science	
  
	
  
	
  
	
  
Workforce	
  Training	
  and	
  
Educa&on	
  
	
  
	
  
	
  
Development	
  and	
  Consul&ng	
  
Services	
  
Workflows	
  
for	
  Data	
  
Science	
  
Center	
  
Focus	
  on	
  the	
  
ques&on,	
  	
  
not	
  the	
  
technology!	
  
10+ years of data science R&D
experience as a Center.	
  
So,	
  what	
  is	
  a	
  workflow?	
  
Source:	
  
hZp://www.fastcodesign.com/1663557/how-­‐a-­‐kitchen-­‐
design-­‐could-­‐make-­‐it-­‐easier-­‐to-­‐bond-­‐with-­‐neighbors	
  	
  
Shop	
   Prepare	
   Cook	
   Store	
  
Let’s	
  make	
  pasta	
  this	
  evening!	
  
Shop	
   Prepare	
   Cook	
   Store	
  
30	
  minutes	
  
30	
  minutes	
  
15	
  minutes	
  
3	
  minutes	
  
15	
  minutes	
  
3	
  minutes	
  
How	
  to	
  Cook	
  Everything	
  Fast	
  
“How	
  to	
  Cook	
  Everything	
  Fast	
  is	
  a	
  book	
  of	
  kitchen	
  
innova7ons.	
  Time	
  management—	
  the	
  essen7al	
  principle	
  
of	
  fast	
  cooking—	
  is	
  woven	
  into	
  revolu7onary	
  recipes	
  that	
  
do	
  the	
  thinking	
  for	
  you.	
  You’ll	
  learn	
  how	
  to	
  take	
  
advantage	
  of	
  down&me	
  to	
  prepare	
  vegetables	
  
while	
  a	
  soup	
  simmers	
  or	
  toast	
  croutons	
  while	
  
whisking	
  a	
  dressing.	
  Just	
  cook	
  as	
  you	
  read—and	
  let	
  the	
  
recipes	
  guide	
  you	
  quickly	
  and	
  easily	
  toward	
  a	
  delicious	
  
result.”	
  
Image	
  and	
  quote	
  source:	
  amazon.com	
  	
  
What	
  if	
  you	
  have	
  more	
  than	
  one	
  
cooks?	
  
…	
  
…	
  
…	
  
MAP	
  
•  Input:	
  veggies	
  
•  User	
  defined	
  
func&on(UDF):	
  chop	
  
•  Output:	
  Chopped	
  groups	
  
of	
  each	
  kind	
  of	
  veggie	
  
…	
  
…	
  
REDUCE	
  
•  Input:	
  chopped	
  batches	
  
for	
  each	
  veggie	
  type	
  
•  User	
  defined	
  
func&on(UDF):	
  combine	
  
based	
  on	
  veggie	
  type	
  as	
  
key	
  
•  Output:	
  a	
  bowl	
  of	
  
veggies	
  per	
  veggie	
  kind	
  
Thanksgiving	
  dinner	
  prepara7on:	
  	
  
more	
  planning	
  and	
  tasks?	
  
Menu	
  Item	
   Prepara&on	
  
Time	
  
Cooking	
  
Time	
  
Cooling	
  
Time	
  
Turkey	
   30	
  minutes	
   4	
  hours	
   15	
  minutes	
  
Veggies	
   30	
  minutes	
   45	
  minutes	
   None	
  
Cranberry	
  
Sauce	
  
5	
  minutes	
   30	
  minutes	
   2	
  hours	
  
Soup	
   20	
  minutes	
   30	
  minutes	
   None	
  
Pie	
   30	
  minutes	
   5	
  minutes	
   1	
  day	
  
•  When	
  do	
  you	
  start	
  cooking?	
  	
  
•  What	
  order	
  do	
  you	
  cook?	
  	
  
•  Can	
  you	
  cook	
  some	
  menu	
  items	
  in	
  parallel?	
  
•  Who	
  cooks	
  what?	
  
•  …	
  
Data	
  Science	
  Workflows	
  
-­‐	
  Programmable,	
  Reusable	
  and	
  Reproducible	
  Scalability	
  -­‐	
  
•  Access	
  and	
  query	
  data	
  
•  Scale	
  computa7onal	
  analysis	
  
•  Increase	
  reuse	
  	
  
•  Save	
  7me,	
  energy	
  and	
  money	
  
•  Formalize	
  and	
  standardize	
  
Real-­‐Time	
  Hazards	
  Management	
  
wifire.ucsd.edu	
  
Data-­‐Parallel	
  Bioinforma7cs	
  
bioKepler.org	
  	
  
Scalable	
  Automated	
  Molecular	
  Dynamics	
  and	
  Drug	
  Discovery	
  
nbcr.ucsd.edu	
  
kepler-­‐project.org	
   WorDS.sdsc.edu	
  
Why	
  scalable	
  and	
  reproducible	
  data	
  science?	
  
The Big Picture is Supporting the Scientist
Conceptual SWF
Executable SWF
From	
  “Napkin	
  Drawings” to	
  Executable	
  Workflows	
  
Fasta	
  File	
  
Circonspect	
  
	
  Average	
  Genome	
  Size	
  
	
  Combine	
  Results	
   PHACCS	
  
The Big Picture is Supporting the Data Scientist
Conceptual SWF
Executable SWF
From	
  “Napkin	
  Drawings” to	
  Executable	
  Workflows…	
  
SBNL workflow
Local Learner
Data Quality
Evaluation
Local Ensemble
Learning
Quality Evaluation & Data
PartitioningBig Data
Master Learner
MasterEnsemble
Learning
Final BN
Structure
Insurance	
  and	
  Traffic	
  
Data	
  Analy&cs	
  using	
  Big	
  
Data	
  Bayesian	
  Network	
  
Learning	
  
Ptolemy II: A laboratory for
investigating design
KEPLER: A problem-solving
environment for Scientific
Workflow
KEPLER = Ptolemy II + X for
Scientific Workflows
Kepler is a Scientific Workflow System
•  A cross-project collaboration
… initiated August 2003
•  2.4 released 04/2013
www.kepler-project.org
•  Builds upon the open-source
Ptolemy II framework
A Toolbox with Many Tools
Need expertise to identify which tool to use when and how!
Require computation models to schedule and optimize execution!
• 	
  Data	
  
• 	
  Search,	
  database	
  access,	
  IO	
  opera7ons,	
  streaming	
  data	
  in	
  
real-­‐7me…	
  
• 	
  Compute	
  
• 	
  Data-­‐parallel	
  paZerns,	
  external	
  execu7on,	
  …	
  
• 	
  Network	
  opera7ons	
  
• 	
  Provenance	
  and	
  fault	
  tolerance	
  


So, 

how does this relate to
data science, big data
and supercomputing?

Distributed	
  Compu7ng	
  
•  Types	
  of	
  distributed	
  compu7ng:	
  	
  
– Computers	
  in	
  local	
  area	
  network	
  
– Cluster	
  or	
  High-­‐Performance	
  Compu7ng	
  
– Grid	
  
– Cloud	
  	
  
Compu7ng	
   using	
   more	
   than	
   one	
  
computers	
  connected	
  through	
  a	
  network.	
  
Cluster	
  or	
  High-­‐Performance	
  
Compu7ng	
  
•  Built	
  from	
  mul:ple	
  computers	
  
•  May	
  have	
  	
  
– parallel	
  file	
  system	
  
– high-­‐speed	
  network	
  
•  Provides	
  a	
  scheduler	
  to	
  manage	
  
the	
  machines	
  and	
  submiZed	
  jobs	
  
– SGE/OGE,	
  PBS,	
  Condor,	
  LSF,	
  SLURM	
  
Paralleliza7on	
  
•  Execu7on	
  
environments	
  
– One	
  machine	
  
– Distributed	
  machines	
  
Mul&ple	
  processes	
  or	
  threads	
  
running	
  at	
  the	
  same	
  &me	
  
•  Parallelism	
  Types	
  
– Computa7on/task	
  
parallelism	
  
– Data	
  parallelism	
  
– Pipeline	
  parallelism	
  
Task 4Task 2
Running Waiting Task 5
WaitingTask 3
Running
Task 1
Finished
Input
Data
Set
Task 1
Runnin
g
Task 2
Waiting
Task 3
Waiting
Task 1 Task 2 Task 3
Task 1
Running
Task 2
Waiting
Task 3
Waiting
Input
Data
Set
Task 1
Running
Task 2
Running
Task 3
Running
Task
Parallelism
Data
Parallelism
Pipeline
Parallelism
There	
  are	
  different	
  styles	
  of	
  parallelism!	
  
Big	
  Data:	
  	
  Short	
  Defini7on	
  
•  Some	
  features	
  “V’s”	
  of	
  big	
  data	
  
–  Volume:	
  amount	
  of	
  data	
  
–  Velocity:	
  speed	
  of	
  data	
  in	
  and	
  out	
  
–  Variety:	
  range	
  of	
  data	
  types	
  and	
  sources	
  
–  Veracity:	
  trustworthiness	
  of	
  data	
  
Picture	
  credit:	
  IBM	
  2012	
  
 	
  
•  A parallel and scalable programming model for
Big Data
–  Input data is automatically partitioned onto multiple
nodes
–  Programs are distributed and executed in parallel on
the partitioned data blocks
Distributed-­‐Data	
  Parallel	
  Compu7ng	
  
Images	
  from:	
  
hZp://www.stratosphere.eu/projects/
Stratosphere/wiki/PactPM	
  	
  
MapReduce
Move program
to data!
Distributed	
  Data-­‐Parallel	
  (DDP)	
  PaZerns	
  
•  A	
  higher-­‐level	
  programming	
  model	
  
–  Moving	
  computa7on	
  to	
  data	
  
–  Good	
  scalability	
  and	
  performance	
  accelera7on	
  
–  Run-­‐7me	
  features	
  such	
  as	
  fault-­‐tolerance	
  
–  Easier	
  parallel	
  programming	
  than	
  MPI	
  and	
  OpenMP	
  
PaZerns	
  for	
  data	
  distribu&on	
  
and	
  parallel	
  data	
  processing	
  	
  
Images	
  from:	
  hZp://www.stratosphere.eu/projects/Stratosphere/wiki/PactPM	
  	
  
Hadoop	
  
•  Open	
  source	
  
implementa7on	
  of	
  
MapReduce	
  
•  A	
  distributed	
  file	
  system	
  
across	
  compute	
  nodes	
  
(HDFS)	
  
–  Automa:c	
  data	
  par::on	
  
–  Automa:c	
  data	
  replica:on	
  
•  Master	
  and	
  workers/slaves	
  
architecture	
  
•  Automa7c	
  task	
  re-­‐execu7on	
  
for	
  failed	
  tasks	
  
Spark	
  
•  Fast	
  Big	
  Data	
  Engine	
  
–  Keeps	
  data	
  in	
  memory	
  as	
  
much	
  as	
  possible	
  
•  Resilient	
  Distributed	
  
Datasets	
  (RDDs)	
  
–  Evaluated	
  lazily	
  
–  Keeps	
  track	
  of	
  lineage	
  for	
  
fault	
  tolerance	
  
•  More	
  operators	
  than	
  just	
  
Map	
  and	
  Reduce	
  
•  Can	
  run	
  on	
  YARN	
  (Hadoop	
  
v2)	
  
Gepng	
  Value	
  out	
  of	
  All	
  This	
  
My	
  favorite	
  defini7on	
  of	
  Data	
  Science	
  
“By	
  "Data	
  Science",	
  we	
  mean	
  almost	
  everything	
  
that	
  has	
  something	
  to	
  do	
  with	
  data:	
  Collec:ng,	
  
analyzing,	
  modeling......	
  yet	
  the	
  most	
  important	
  
part	
  is	
  its	
  applica:ons	
  -­‐-­‐-­‐	
  all	
  sorts	
  of	
  
applica:ons.”	
  
	
  
Journal	
  of	
  Data	
  Science	
  (hZp://www.jds-­‐online.com/about)	
  
Implies	
  -­‐-­‐	
  programming,	
  data	
  analysis,	
  and	
  problem	
  solving	
  
Some	
  P’s	
  of	
  Data	
  Science	
  
People
Process
Platforms
Purpose
Programmability
There	
  are	
  more:	
  	
  
provenance,	
  publica7on,	
  product,	
  
performance,	
  policy,	
  profit,	
  ...	
  	
  
People…	
  
People
Data	
  Scien7st	
  Skill	
  Set	
  
hZp://
datasciencedojo.com/
what-­‐are-­‐the-­‐key-­‐skills-­‐
of-­‐a-­‐data-­‐scien7st/	
  
Unicorn?	
  
hZp://www.anlytcs.com/2014/01/data-­‐
science-­‐venn-­‐diagram-­‐v20.html	
  	
  
Solu7on:	
  Scale	
  the	
  Data	
  Scien7sts	
  
Standardize	
  the	
  data	
  science	
  process,	
  not	
  
the	
  tools!	
  
	
  
	
  Standardized	
  processes	
  enable	
  data	
  
scien&sts	
  to	
  communicate	
  with	
  business	
  
and	
  programming	
  partners.	
  	
  
	
  
Also,	
  what	
  these	
  defini7ons	
  really	
  mean	
  is	
  
“computa&onal	
  and	
  data	
  scien&sts”.	
  
Some	
  P’s	
  of	
  Data	
  Science	
  
Process
Defining	
  a	
  Typical	
  Data	
  Science	
  Process	
  
Find	
  data	
  	
  
Access	
  data	
  
Acquire	
  data	
  
Move	
  data	
  
Clean	
  data	
  
Integrate	
  data	
  
Subset	
  data	
  
Pre-­‐process	
  data	
  
Analyze	
  data	
  
Process	
  data	
  
Interpret	
  results	
  
Summarize	
  results	
  
Visualize	
  results	
  
Post-­‐process	
  results	
  
Some	
  ques7ons	
  to	
  ask:	
  
•  Where	
  and	
  how	
  do	
  I	
  get	
  the	
  data?	
  
•  What	
  is	
  the	
  format	
  and	
  frequency	
  of	
  the	
  data,	
  e.g.,	
  structured,	
  textual,	
  real-­‐7me,	
  
image,	
  …?	
  
•  How	
  do	
  I	
  integrate	
  or	
  subset	
  datasets,	
  e.g.,	
  knowledge	
  representa7on,…	
  ?	
  
•  How	
  do	
  I	
  analyze	
  the	
  data	
  and	
  what	
  is	
  the	
  analysis	
  func7on?	
  
•  What	
  are	
  the	
  parameters	
  to	
  customize	
  each	
  step?	
  
•  What	
  are	
  the	
  compu7ng	
  needs	
  to	
  schedule	
  and	
  run	
  each	
  step?	
  
•  How	
  do	
  I	
  make	
  sure	
  the	
  results	
  are	
  useful	
  for	
  the	
  next	
  step	
  or	
  as	
  scien7fic	
  products,	
  
e.g.,	
  standards	
  compliance,	
  repor7ng,	
  …?	
  	
  
configurable	
  
automated	
  analysis	
  
Some	
  P’s	
  of	
  Data	
  Science	
  
People
Process
Purpose
Purpose…	
  
“You've	
  got	
  to	
  think	
  about	
  
	
  
	
   	
   	
  	
  	
  	
  	
  	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  big	
  things	
  	
  
	
  
while	
  you're	
  doing	
  	
  
	
   	
   	
   	
  small	
  things,	
  
so	
  that	
  all	
  the	
  small	
  things	
  go	
  in	
  the	
  right	
  
direc7on.” 	
   	
   	
   	
   	
   	
   	
   	
   	
   	
   	
   	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
   	
   	
   	
   	
   	
  –	
  Alvin	
  Toffler	
  
use	
  cases	
  =>	
  purpose	
  and	
  value	
  
 
	
  
	
  
	
  
	
  
Need	
  toolboxes	
  with	
  
many	
  tools	
  for:	
  	
  
•  data	
  access,	
  	
  
•  analysis,	
  	
  
•  scalable	
  execu&on,	
  	
  
•  fault	
  tolerance,	
  	
  
•  provenance	
  
tracking,	
  	
  
•  repor7ng	
  
•  ...	
  
Business	
  
Analysis	
  
Opera&ons	
  
Research	
  
Adapted	
  from:	
  	
  
B.	
  Tierney,	
  2013	
  	
  
Integra7on	
  of	
  Many	
  Tools	
  to	
  Serve	
  a	
  Purpose	
  
Many	
  Alterna7ves	
  	
  
•  Alterna7ve	
  tools	
  
•  Mul7ple	
  modes	
  of	
  
scalability	
  
•  Support	
  for	
  each	
  step	
  of	
  
the	
  development	
  and	
  
produc7on	
  process	
  
•  Different	
  repor7ng	
  needs	
  
for	
  explora7on	
  and	
  
produc7on	
  stages	
  
Build	
  
Explore	
  	
  
Scale	
  
Report	
  
Build	
  Once,	
  Run	
  Many	
  Times…	
  
•  Data	
  science	
  process	
  should	
  support	
  
experimental	
  work	
  and	
  dynamic	
  scalability	
  on	
  
many	
  plavorms	
  
•  Scalability	
  based	
  on:	
  
–  data	
  volume	
  and	
  velocity	
  
–  dynamic	
  modeling	
  needs	
  
–  highly-­‐op7mized	
  HPC	
  codes	
  
–  changes	
  in	
  network,	
  storage	
  and	
  compu7ng	
  
availability	
  
Scalability	
  across	
  plavorms…	
  
People
Process
Platforms
Purpose
Running on Heterogeneous Computing
Resources
- Execution of programs on where they run most efficiently -
Gordon	
   Trestles	
  
Local	
  Cluster	
  
Resources	
  
NSF/DOE:	
  TeraScale	
  
Resources	
  (XSEDE)	
  
(Gordon)	
   (Comet)	
  
(Stampede)	
  
(Lonestar)	
  
Private	
  Cluster:	
  
User	
  Owned	
  
Resources	
  
Different	
  executables	
  have	
  different	
  compu&ng	
  architecture	
  needs!	
  
	
  
e.g.,	
  memory-­‐intensive,	
  compute-­‐intensive,	
  I/O-­‐intensive	
  
Challenges	
  for	
  Heterogeneous	
  Compu7ng	
  	
  
•  Dynamic	
  scheduling	
  op7miza7on	
  needed	
  
– Based	
  on	
  network	
  availability	
  
– Data	
  transfer	
  and	
  locality	
  	
  
– Energy	
  efficiency	
  
– Availability	
  of	
  exascale	
  memory	
  hierarchies	
  	
  
– Workload	
  changes	
  
•  BeZer	
  programmable	
  communica7on	
  
between	
  workflow	
  systems	
  and	
  infrastructure	
  
for	
  compu7ng,	
  storage	
  and	
  network	
  
Programmability	
  for	
  scalability,	
  
reusability	
  and	
  reproducibility	
  
People
Process
Platforms
Purpose
Programmability
Using Big Data Computing in Bioinformatics
- Improving Programmability, Scalability and Reproducibility-
biokepler.org	
  
Gateways	
  and	
  other	
  user	
  environments	
  
bioKepler	
  
Kepler	
  and	
  Provenance	
  Framework	
  
BioLinux	
  	
   Galaxy	
   Clovr 	
  	
   Hadoop	
  
…
CLOUD	
  and	
  OTHER	
  COMPUTING	
  RESOURCES	
  
e.g.,	
  SGE,	
  Amazon,	
  FutureGrid,	
  XSEDE	
  
www.bioKepler.org
A coordinated ecosystem of biological and
technological packages for bioinformatics!
Same	
  approach	
  can	
  be	
  applied	
  to	
  
machine	
  learning	
  and	
  other	
  
applica7on	
  areas!	
  
	
  
	
  
-­‐	
  REUSABILITY	
  and	
  REPURPOSABILITY-­‐	
  
Flexible	
  programming	
  of	
  K-­‐means	
  
	
  
•  R:	
  Programming	
  
language	
  and	
  soyware	
  
environment	
  for	
  
sta7s7cal	
  compu7ng	
  and	
  
graphics.	
  
•  KNIME:	
  Plavorm	
  for	
  
data	
  analy7cs.	
  
•  MlLib:	
  Scalable	
  machine	
  
learning	
  library	
  running	
  
on	
  Spark	
  cluster	
  
compu7ng	
  framework	
  
•  Mahout:	
  Scalable	
  
machine	
  learning	
  library	
  
based	
  on	
  MapReduce.	
  	
  
Scalable	
  Bayesian	
  Network	
  Learning	
  
SBNL workflow
Local Learner
Data Quality
Evaluation
Local Ensemble
Learning
Quality Evaluation & Data
Partitioning
Big Data
Master Learner
MasterEnsemble
Learning
Final BN
Structure
Kepler Workflow
BN	
  Workflow	
   •  Top	
  level	
  workflow	
  
–  Par77onData:	
  RExpression	
  actor	
  
that	
  contains	
  R	
  script	
  for	
  the	
  data	
  
par77oning	
  step	
  
–  DDPNetworkLearner:	
  Composite	
  
actor	
  using	
  MapReduce	
  to	
  perform	
  
parallel	
  ensemble	
  learning	
  
WorDS	
  –	
  Simple	
  and	
  Scalable	
  Big	
  Data	
  
Solu7ons	
  using	
  Workflows	
  
Focus	
  on	
  the	
  
use	
  case,	
  	
  
not	
  the	
  
technology!	
  
	
  
•  Develop	
   new	
   big	
   data	
   science	
  
technologies	
  and	
  infrastructure	
  
•  Develop	
   data	
   science	
   workflow	
  
applica&ons	
   through	
   combina7on	
   of	
  
tools,	
  technologies	
  and	
  best	
  prac&ces	
  
•  Hands	
   on	
   consul&ng	
   on	
   workflow	
  
technologies	
   for	
   big	
   data	
   and	
   cloud	
  
systems,	
   e.g.,	
   MapReduce,	
   Hadoop,	
  
Yarn,	
  Cascading	
  
•  Technology	
   briefings	
   and	
   applied	
  
classes	
   on	
   end-­‐to-­‐end	
   support	
   for	
  
data	
  science	
  
Using Workflows and Cyberinfrastructure
for Wildfire Resilience
- A Scalable Data-Driven Monitoring and Dynamic Prediction Approach -
wifire.ucsd.edu	
  
A	
  Scalable	
  Data-­‐Driven	
  Monitoring,	
  Dynamic	
  Predic7on	
  and	
  
Resilience	
  Cyberinfrastructure	
  for	
  Wildfires	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  (WIFIRE)	
  
Development	
  of:	
  
	
  
“cyberinfrastructure”	
  for	
  
“analysis	
  of	
  large	
  
dimensional	
  
heterogeneous	
  real-­‐7me	
  
sensed	
  data”	
  for	
  fire	
  
resilience	
  before,	
  during	
  
and	
  aMer	
  a	
  wildfire	
  
What	
  is	
  lacking	
  in	
  disaster	
  management	
  today	
  is…	
  
	
  
	
  a	
  system	
  integra7on	
  of	
  real-­‐7me	
  sensor	
  networks,	
  satellite	
  
imagery,	
  near-­‐real	
  7me	
  data	
  management	
  
tools,	
  wildfire	
  simula7on	
  tools,	
  and	
  connec7vity	
  to	
  
emergency	
  command	
  centers	
  	
  
	
  
.	
  ….	
  before,	
  during	
  and	
  ayer	
  a	
  firestorm.	
  
hZp://nbcr.ucsd.edu/	
  	
  
Integrated	
  Mul7-­‐Scale	
  Biomedical	
  
Modeling	
  Workflows	
  in	
  NBCR	
  
Local	
  Execu7on	
  
Op7on	
  
	
  User	
  MD-­‐Parameter	
  Configura&on	
  Op&on	
  
	
  
Molecular	
  Dynamic	
  CADD	
  Workflow	
  
	
  Amber	
  
Molecular	
  
Dynamics	
  
Package	
  
Local:	
  NBCR	
  Cluster	
  
Resources	
  
NSF/DOE:	
  TeraScale	
  
Resources	
  (XSEDE)	
  
(Stampede)	
  
NBCR	
  and	
  User	
  Owned	
  
Cloud	
  Resources	
  
(Comet)	
  
BENEFITS:	
  
•  Enable	
  	
  users	
  to	
  configure	
  MD	
  job	
  parameters	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  through	
  command-­‐line,	
  GUI	
  or	
  web	
  interface.	
  	
  
•  Scale	
  for	
  mul7ple	
  compounds	
  in	
  parallel	
  
•  Run	
  on	
  Mul7ple	
  Compu7ng	
  plavorms	
  
•  Increase	
  reuse	
  
•  Provenance	
  
GPU	
  or	
  Gordon	
  Execu7on	
  Op7on	
  
hZp://hpc.pnl.gov/IPPD/	
  	
  
Predic7ng	
  Workflow	
  Performance	
  from	
  Provenance	
  
hZps://smartmanufacturingcoali7on.org/	
  	
  
Workflows-­‐as-­‐a-­‐Service	
  
To Sum Up
•  Workflows and provenance are well-adopted in scientific
infrastructures today, with success
•  WorDS Center applies these concepts to advanced
dynamic data-driven analytics applications
•  One size does not fit all!
•  Many diverse environments and requirements
•  Need to orchestrate at a higher level
•  Higher level programming components for each domain
•  Lots of future challenges on
•  Optimized execution on heterogeneous platforms
•  Programmable interface to workload, storage and network needed
•  Increasing reuse within and across application domains
•  Querying and integration of workflow provenance data into
performance prediction
Ques7ons?	
  
WorDS	
  Director:	
  	
  Ilkay	
  Al7ntas,	
  Ph.D.	
  
Email:	
  al7ntas@sdsc.edu	
  	
  

Más contenido relacionado

La actualidad más candente

Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
Domino Data Lab
 
IEEE Cloud 2012: Clouds Hands-On Tutorial
IEEE Cloud 2012: Clouds Hands-On TutorialIEEE Cloud 2012: Clouds Hands-On Tutorial
IEEE Cloud 2012: Clouds Hands-On Tutorial
Srinath Perera
 

La actualidad más candente (19)

Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
 
H2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in PythonH2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in Python
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
IEEE Cloud 2012: Clouds Hands-On Tutorial
IEEE Cloud 2012: Clouds Hands-On TutorialIEEE Cloud 2012: Clouds Hands-On Tutorial
IEEE Cloud 2012: Clouds Hands-On Tutorial
 
Anaconda Data Science Collaboration
Anaconda Data Science CollaborationAnaconda Data Science Collaboration
Anaconda Data Science Collaboration
 
Cloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science TeamsCloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science Teams
 
Big Data Rampage
Big Data RampageBig Data Rampage
Big Data Rampage
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
Intro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LAIntro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LA
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
 
Putting the Magic in Data Science
Putting the Magic in Data SciencePutting the Magic in Data Science
Putting the Magic in Data Science
 
Josh Wills, MLconf 2013
Josh Wills, MLconf 2013Josh Wills, MLconf 2013
Josh Wills, MLconf 2013
 
Machine learning in real-time - the next frontier
Machine learning in real-time - the next frontierMachine learning in real-time - the next frontier
Machine learning in real-time - the next frontier
 
Beyond the Science Gateway
Beyond the Science GatewayBeyond the Science Gateway
Beyond the Science Gateway
 
Your Data Scientist Hates You
Your Data Scientist Hates YouYour Data Scientist Hates You
Your Data Scientist Hates You
 
Agile data science
Agile data scienceAgile data science
Agile data science
 

Similar a WorDS of Data Science in the Presence of Heterogenous Computing Architectures

Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Kognitio
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...
Ola Spjuth
 

Similar a WorDS of Data Science in the Presence of Heterogenous Computing Architectures (20)

Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC ResourcesmyHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
MSR 2009
MSR 2009MSR 2009
MSR 2009
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Spark
SparkSpark
Spark
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup Group
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...
 
Big Data Boom
Big Data BoomBig Data Boom
Big Data Boom
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & Manipulation
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
 

Más de Ilkay Altintas, Ph.D.

Collaborative Data Science In A Highly Networked World
Collaborative Data Science In A Highly Networked WorldCollaborative Data Science In A Highly Networked World
Collaborative Data Science In A Highly Networked World
Ilkay Altintas, Ph.D.
 

Más de Ilkay Altintas, Ph.D. (6)

Collaborative Data Science In A Highly Networked World
Collaborative Data Science In A Highly Networked WorldCollaborative Data Science In A Highly Networked World
Collaborative Data Science In A Highly Networked World
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
 
Workflow-Driven Geoinformatics Applications and Training in the Big Data Era
Workflow-Driven Geoinformatics Applications and Training in the Big Data EraWorkflow-Driven Geoinformatics Applications and Training in the Big Data Era
Workflow-Driven Geoinformatics Applications and Training in the Big Data Era
 
Using Cyberinfrastructure for Wildfire Resilience
Using Cyberinfrastructure for Wildfire ResilienceUsing Cyberinfrastructure for Wildfire Resilience
Using Cyberinfrastructure for Wildfire Resilience
 
Using Cyberinfrastructure for Wildfire Resilience
Using Cyberinfrastructure for Wildfire ResilienceUsing Cyberinfrastructure for Wildfire Resilience
Using Cyberinfrastructure for Wildfire Resilience
 
Invited Talk for EUDAT Workshop in Barcelona
Invited Talk for EUDAT Workshop in Barcelona Invited Talk for EUDAT Workshop in Barcelona
Invited Talk for EUDAT Workshop in Barcelona
 

Último

怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 

Último (20)

怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 

WorDS of Data Science in the Presence of Heterogenous Computing Architectures

  • 1. WorDS  of  Data  Science  in  the  Presence  of   Heterogenous  Compu7ng  Architectures   WorDS.sdsc.edu           Dr.  Ilkay  Al7ntas   Founder  and  Director,  Workflows  for  Data  Science  (WorDS)   Center  of  Excellence   San  Diego  Supercomputer  Center,  UC  San  Diego    
  • 2. SAN  DIEGO  SUPERCOMPUTER  CENTER  at  UC  San  Diego   Providing  Cyberinfrastructure  for  Research  and  Educa7on   •  Established  as  a  na7onal   supercomputer  resource   center  in  1985  by  NSF   •  A  world  leader  in  HPC,  data-­‐ intensive  compu7ng,  and   scien7fic  data  management   •  Current  strategic  focus  on   “Big  Data”     1985   today  
  • 3.       Scien&fic  Workflow     Automa&on  Technologies   Research         Workflows  for  Cloud   Systems       Big  Data  Applica&ons         Reproducible  Science         Workforce  Training  and   Educa&on         Development  and  Consul&ng   Services   Workflows   for  Data   Science   Center   Focus  on  the   ques&on,     not  the   technology!   10+ years of data science R&D experience as a Center.  
  • 4.
  • 5. So,  what  is  a  workflow?   Source:   hZp://www.fastcodesign.com/1663557/how-­‐a-­‐kitchen-­‐ design-­‐could-­‐make-­‐it-­‐easier-­‐to-­‐bond-­‐with-­‐neighbors     Shop   Prepare   Cook   Store  
  • 6. Let’s  make  pasta  this  evening!   Shop   Prepare   Cook   Store   30  minutes   30  minutes   15  minutes   3  minutes   15  minutes   3  minutes  
  • 7. How  to  Cook  Everything  Fast   “How  to  Cook  Everything  Fast  is  a  book  of  kitchen   innova7ons.  Time  management—  the  essen7al  principle   of  fast  cooking—  is  woven  into  revolu7onary  recipes  that   do  the  thinking  for  you.  You’ll  learn  how  to  take   advantage  of  down&me  to  prepare  vegetables   while  a  soup  simmers  or  toast  croutons  while   whisking  a  dressing.  Just  cook  as  you  read—and  let  the   recipes  guide  you  quickly  and  easily  toward  a  delicious   result.”   Image  and  quote  source:  amazon.com    
  • 8. What  if  you  have  more  than  one   cooks?  
  • 9. …   …   …   MAP   •  Input:  veggies   •  User  defined   func&on(UDF):  chop   •  Output:  Chopped  groups   of  each  kind  of  veggie  
  • 10. …   …   REDUCE   •  Input:  chopped  batches   for  each  veggie  type   •  User  defined   func&on(UDF):  combine   based  on  veggie  type  as   key   •  Output:  a  bowl  of   veggies  per  veggie  kind  
  • 11. Thanksgiving  dinner  prepara7on:     more  planning  and  tasks?   Menu  Item   Prepara&on   Time   Cooking   Time   Cooling   Time   Turkey   30  minutes   4  hours   15  minutes   Veggies   30  minutes   45  minutes   None   Cranberry   Sauce   5  minutes   30  minutes   2  hours   Soup   20  minutes   30  minutes   None   Pie   30  minutes   5  minutes   1  day   •  When  do  you  start  cooking?     •  What  order  do  you  cook?     •  Can  you  cook  some  menu  items  in  parallel?   •  Who  cooks  what?   •  …  
  • 12. Data  Science  Workflows   -­‐  Programmable,  Reusable  and  Reproducible  Scalability  -­‐   •  Access  and  query  data   •  Scale  computa7onal  analysis   •  Increase  reuse     •  Save  7me,  energy  and  money   •  Formalize  and  standardize   Real-­‐Time  Hazards  Management   wifire.ucsd.edu   Data-­‐Parallel  Bioinforma7cs   bioKepler.org     Scalable  Automated  Molecular  Dynamics  and  Drug  Discovery   nbcr.ucsd.edu   kepler-­‐project.org   WorDS.sdsc.edu  
  • 13. Why  scalable  and  reproducible  data  science?  
  • 14. The Big Picture is Supporting the Scientist Conceptual SWF Executable SWF From  “Napkin  Drawings” to  Executable  Workflows   Fasta  File   Circonspect    Average  Genome  Size    Combine  Results   PHACCS  
  • 15. The Big Picture is Supporting the Data Scientist Conceptual SWF Executable SWF From  “Napkin  Drawings” to  Executable  Workflows…   SBNL workflow Local Learner Data Quality Evaluation Local Ensemble Learning Quality Evaluation & Data PartitioningBig Data Master Learner MasterEnsemble Learning Final BN Structure Insurance  and  Traffic   Data  Analy&cs  using  Big   Data  Bayesian  Network   Learning  
  • 16. Ptolemy II: A laboratory for investigating design KEPLER: A problem-solving environment for Scientific Workflow KEPLER = Ptolemy II + X for Scientific Workflows Kepler is a Scientific Workflow System •  A cross-project collaboration … initiated August 2003 •  2.4 released 04/2013 www.kepler-project.org •  Builds upon the open-source Ptolemy II framework
  • 17. A Toolbox with Many Tools Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize execution! •   Data   •   Search,  database  access,  IO  opera7ons,  streaming  data  in   real-­‐7me…   •   Compute   •   Data-­‐parallel  paZerns,  external  execu7on,  …   •   Network  opera7ons   •   Provenance  and  fault  tolerance  
  • 18. 
 So, 
 how does this relate to data science, big data and supercomputing?

  • 19. Distributed  Compu7ng   •  Types  of  distributed  compu7ng:     – Computers  in  local  area  network   – Cluster  or  High-­‐Performance  Compu7ng   – Grid   – Cloud     Compu7ng   using   more   than   one   computers  connected  through  a  network.  
  • 20. Cluster  or  High-­‐Performance   Compu7ng   •  Built  from  mul:ple  computers   •  May  have     – parallel  file  system   – high-­‐speed  network   •  Provides  a  scheduler  to  manage   the  machines  and  submiZed  jobs   – SGE/OGE,  PBS,  Condor,  LSF,  SLURM  
  • 21. Paralleliza7on   •  Execu7on   environments   – One  machine   – Distributed  machines   Mul&ple  processes  or  threads   running  at  the  same  &me   •  Parallelism  Types   – Computa7on/task   parallelism   – Data  parallelism   – Pipeline  parallelism  
  • 22. Task 4Task 2 Running Waiting Task 5 WaitingTask 3 Running Task 1 Finished Input Data Set Task 1 Runnin g Task 2 Waiting Task 3 Waiting Task 1 Task 2 Task 3 Task 1 Running Task 2 Waiting Task 3 Waiting Input Data Set Task 1 Running Task 2 Running Task 3 Running Task Parallelism Data Parallelism Pipeline Parallelism There  are  different  styles  of  parallelism!  
  • 23. Big  Data:    Short  Defini7on   •  Some  features  “V’s”  of  big  data   –  Volume:  amount  of  data   –  Velocity:  speed  of  data  in  and  out   –  Variety:  range  of  data  types  and  sources   –  Veracity:  trustworthiness  of  data   Picture  credit:  IBM  2012  
  • 24.     •  A parallel and scalable programming model for Big Data –  Input data is automatically partitioned onto multiple nodes –  Programs are distributed and executed in parallel on the partitioned data blocks Distributed-­‐Data  Parallel  Compu7ng   Images  from:   hZp://www.stratosphere.eu/projects/ Stratosphere/wiki/PactPM     MapReduce Move program to data!
  • 25. Distributed  Data-­‐Parallel  (DDP)  PaZerns   •  A  higher-­‐level  programming  model   –  Moving  computa7on  to  data   –  Good  scalability  and  performance  accelera7on   –  Run-­‐7me  features  such  as  fault-­‐tolerance   –  Easier  parallel  programming  than  MPI  and  OpenMP   PaZerns  for  data  distribu&on   and  parallel  data  processing     Images  from:  hZp://www.stratosphere.eu/projects/Stratosphere/wiki/PactPM    
  • 26. Hadoop   •  Open  source   implementa7on  of   MapReduce   •  A  distributed  file  system   across  compute  nodes   (HDFS)   –  Automa:c  data  par::on   –  Automa:c  data  replica:on   •  Master  and  workers/slaves   architecture   •  Automa7c  task  re-­‐execu7on   for  failed  tasks   Spark   •  Fast  Big  Data  Engine   –  Keeps  data  in  memory  as   much  as  possible   •  Resilient  Distributed   Datasets  (RDDs)   –  Evaluated  lazily   –  Keeps  track  of  lineage  for   fault  tolerance   •  More  operators  than  just   Map  and  Reduce   •  Can  run  on  YARN  (Hadoop   v2)  
  • 27. Gepng  Value  out  of  All  This  
  • 28. My  favorite  defini7on  of  Data  Science   “By  "Data  Science",  we  mean  almost  everything   that  has  something  to  do  with  data:  Collec:ng,   analyzing,  modeling......  yet  the  most  important   part  is  its  applica:ons  -­‐-­‐-­‐  all  sorts  of   applica:ons.”     Journal  of  Data  Science  (hZp://www.jds-­‐online.com/about)   Implies  -­‐-­‐  programming,  data  analysis,  and  problem  solving  
  • 29. Some  P’s  of  Data  Science   People Process Platforms Purpose Programmability
  • 30. There  are  more:     provenance,  publica7on,  product,   performance,  policy,  profit,  ...    
  • 32. Data  Scien7st  Skill  Set   hZp:// datasciencedojo.com/ what-­‐are-­‐the-­‐key-­‐skills-­‐ of-­‐a-­‐data-­‐scien7st/  
  • 34. Solu7on:  Scale  the  Data  Scien7sts   Standardize  the  data  science  process,  not   the  tools!      Standardized  processes  enable  data   scien&sts  to  communicate  with  business   and  programming  partners.       Also,  what  these  defini7ons  really  mean  is   “computa&onal  and  data  scien&sts”.  
  • 35. Some  P’s  of  Data  Science   Process
  • 36. Defining  a  Typical  Data  Science  Process   Find  data     Access  data   Acquire  data   Move  data   Clean  data   Integrate  data   Subset  data   Pre-­‐process  data   Analyze  data   Process  data   Interpret  results   Summarize  results   Visualize  results   Post-­‐process  results   Some  ques7ons  to  ask:   •  Where  and  how  do  I  get  the  data?   •  What  is  the  format  and  frequency  of  the  data,  e.g.,  structured,  textual,  real-­‐7me,   image,  …?   •  How  do  I  integrate  or  subset  datasets,  e.g.,  knowledge  representa7on,…  ?   •  How  do  I  analyze  the  data  and  what  is  the  analysis  func7on?   •  What  are  the  parameters  to  customize  each  step?   •  What  are  the  compu7ng  needs  to  schedule  and  run  each  step?   •  How  do  I  make  sure  the  results  are  useful  for  the  next  step  or  as  scien7fic  products,   e.g.,  standards  compliance,  repor7ng,  …?     configurable   automated  analysis  
  • 37. Some  P’s  of  Data  Science   People Process Purpose
  • 38. Purpose…   “You've  got  to  think  about                                                    big  things       while  you're  doing            small  things,   so  that  all  the  small  things  go  in  the  right   direc7on.”                                                      –  Alvin  Toffler   use  cases  =>  purpose  and  value  
  • 39.           Need  toolboxes  with   many  tools  for:     •  data  access,     •  analysis,     •  scalable  execu&on,     •  fault  tolerance,     •  provenance   tracking,     •  repor7ng   •  ...   Business   Analysis   Opera&ons   Research   Adapted  from:     B.  Tierney,  2013     Integra7on  of  Many  Tools  to  Serve  a  Purpose  
  • 40. Many  Alterna7ves     •  Alterna7ve  tools   •  Mul7ple  modes  of   scalability   •  Support  for  each  step  of   the  development  and   produc7on  process   •  Different  repor7ng  needs   for  explora7on  and   produc7on  stages   Build   Explore     Scale   Report  
  • 41. Build  Once,  Run  Many  Times…   •  Data  science  process  should  support   experimental  work  and  dynamic  scalability  on   many  plavorms   •  Scalability  based  on:   –  data  volume  and  velocity   –  dynamic  modeling  needs   –  highly-­‐op7mized  HPC  codes   –  changes  in  network,  storage  and  compu7ng   availability  
  • 42. Scalability  across  plavorms…   People Process Platforms Purpose
  • 43. Running on Heterogeneous Computing Resources - Execution of programs on where they run most efficiently - Gordon   Trestles   Local  Cluster   Resources   NSF/DOE:  TeraScale   Resources  (XSEDE)   (Gordon)   (Comet)   (Stampede)   (Lonestar)   Private  Cluster:   User  Owned   Resources   Different  executables  have  different  compu&ng  architecture  needs!     e.g.,  memory-­‐intensive,  compute-­‐intensive,  I/O-­‐intensive  
  • 44. Challenges  for  Heterogeneous  Compu7ng     •  Dynamic  scheduling  op7miza7on  needed   – Based  on  network  availability   – Data  transfer  and  locality     – Energy  efficiency   – Availability  of  exascale  memory  hierarchies     – Workload  changes   •  BeZer  programmable  communica7on   between  workflow  systems  and  infrastructure   for  compu7ng,  storage  and  network  
  • 45. Programmability  for  scalability,   reusability  and  reproducibility   People Process Platforms Purpose Programmability
  • 46. Using Big Data Computing in Bioinformatics - Improving Programmability, Scalability and Reproducibility- biokepler.org  
  • 47. Gateways  and  other  user  environments   bioKepler   Kepler  and  Provenance  Framework   BioLinux     Galaxy   Clovr     Hadoop   … CLOUD  and  OTHER  COMPUTING  RESOURCES   e.g.,  SGE,  Amazon,  FutureGrid,  XSEDE   www.bioKepler.org A coordinated ecosystem of biological and technological packages for bioinformatics!
  • 48. Same  approach  can  be  applied  to   machine  learning  and  other   applica7on  areas!       -­‐  REUSABILITY  and  REPURPOSABILITY-­‐  
  • 49. Flexible  programming  of  K-­‐means     •  R:  Programming   language  and  soyware   environment  for   sta7s7cal  compu7ng  and   graphics.   •  KNIME:  Plavorm  for   data  analy7cs.   •  MlLib:  Scalable  machine   learning  library  running   on  Spark  cluster   compu7ng  framework   •  Mahout:  Scalable   machine  learning  library   based  on  MapReduce.    
  • 50. Scalable  Bayesian  Network  Learning   SBNL workflow Local Learner Data Quality Evaluation Local Ensemble Learning Quality Evaluation & Data Partitioning Big Data Master Learner MasterEnsemble Learning Final BN Structure Kepler Workflow
  • 51. BN  Workflow   •  Top  level  workflow   –  Par77onData:  RExpression  actor   that  contains  R  script  for  the  data   par77oning  step   –  DDPNetworkLearner:  Composite   actor  using  MapReduce  to  perform   parallel  ensemble  learning  
  • 52. WorDS  –  Simple  and  Scalable  Big  Data   Solu7ons  using  Workflows   Focus  on  the   use  case,     not  the   technology!     •  Develop   new   big   data   science   technologies  and  infrastructure   •  Develop   data   science   workflow   applica&ons   through   combina7on   of   tools,  technologies  and  best  prac&ces   •  Hands   on   consul&ng   on   workflow   technologies   for   big   data   and   cloud   systems,   e.g.,   MapReduce,   Hadoop,   Yarn,  Cascading   •  Technology   briefings   and   applied   classes   on   end-­‐to-­‐end   support   for   data  science  
  • 53. Using Workflows and Cyberinfrastructure for Wildfire Resilience - A Scalable Data-Driven Monitoring and Dynamic Prediction Approach - wifire.ucsd.edu  
  • 54. A  Scalable  Data-­‐Driven  Monitoring,  Dynamic  Predic7on  and   Resilience  Cyberinfrastructure  for  Wildfires                                                                                                                    (WIFIRE)   Development  of:     “cyberinfrastructure”  for   “analysis  of  large   dimensional   heterogeneous  real-­‐7me   sensed  data”  for  fire   resilience  before,  during   and  aMer  a  wildfire  
  • 55. What  is  lacking  in  disaster  management  today  is…      a  system  integra7on  of  real-­‐7me  sensor  networks,  satellite   imagery,  near-­‐real  7me  data  management   tools,  wildfire  simula7on  tools,  and  connec7vity  to   emergency  command  centers       .  ….  before,  during  and  ayer  a  firestorm.  
  • 56. hZp://nbcr.ucsd.edu/     Integrated  Mul7-­‐Scale  Biomedical   Modeling  Workflows  in  NBCR  
  • 57. Local  Execu7on   Op7on    User  MD-­‐Parameter  Configura&on  Op&on     Molecular  Dynamic  CADD  Workflow    Amber   Molecular   Dynamics   Package   Local:  NBCR  Cluster   Resources   NSF/DOE:  TeraScale   Resources  (XSEDE)   (Stampede)   NBCR  and  User  Owned   Cloud  Resources   (Comet)   BENEFITS:   •  Enable    users  to  configure  MD  job  parameters                    through  command-­‐line,  GUI  or  web  interface.     •  Scale  for  mul7ple  compounds  in  parallel   •  Run  on  Mul7ple  Compu7ng  plavorms   •  Increase  reuse   •  Provenance   GPU  or  Gordon  Execu7on  Op7on  
  • 58. hZp://hpc.pnl.gov/IPPD/     Predic7ng  Workflow  Performance  from  Provenance  
  • 60. To Sum Up •  Workflows and provenance are well-adopted in scientific infrastructures today, with success •  WorDS Center applies these concepts to advanced dynamic data-driven analytics applications •  One size does not fit all! •  Many diverse environments and requirements •  Need to orchestrate at a higher level •  Higher level programming components for each domain •  Lots of future challenges on •  Optimized execution on heterogeneous platforms •  Programmable interface to workload, storage and network needed •  Increasing reuse within and across application domains •  Querying and integration of workflow provenance data into performance prediction
  • 61. Ques7ons?   WorDS  Director:    Ilkay  Al7ntas,  Ph.D.   Email:  al7ntas@sdsc.edu