SlideShare una empresa de Scribd logo
1 de 32
Descargar para leer sin conexión
Programming Systems for High-Performance
Computing
Marc	
  Snir	
  
Preamble
May	
  2013	
  
CCGrid13	
  
2	
  
In The Beginning
§  Distributed	
  Memory	
  
Parallel	
  Systems	
  are	
  
replacing	
  vector	
  
machines.	
  
§  CM-­‐5	
  (using	
  SPARC)	
  
announced	
  1991	
  
–  1024	
  node	
  CM-­‐5	
  is	
  at	
  
top	
  of	
  first	
  TOP500	
  
list	
  in	
  1993	
  	
  
§  Problem:	
  Need	
  to	
  rewrite	
  
all	
  (vector)	
  code	
  in	
  new	
  
programming	
  model	
  
	
  
May	
  2013	
  
CCGrid13	
  
3	
  
Some of the Attempts
May	
  2013	
  
CCGrid13	
  
4	
  
1990	
   1995	
   2000	
   2005	
   2010	
  
HPF	
   HPF1	
   HPF2	
  
MPI	
   MPI1	
   MPI2	
   MPI3	
  
UPC	
  
CAF	
  
PGAS	
  
OpenMP	
   V1	
   V2	
   V3	
  
Shared	
  Memory	
  
GPGPU	
   OpenCL	
  
OpenACC	
  
Widely	
  used	
  
Does	
  not	
  scale	
  
Too	
  early	
  to	
  say	
  
LiAle	
  to	
  no	
  public	
  use	
  
Questions
1.  Message-­‐passing	
  is	
  claimed	
  to	
  be	
  “a	
  difficult	
  way	
  to	
  program”.	
  If	
  
so,	
  why	
  is	
  it	
  the	
  only	
  widely	
  used	
  way	
  to	
  program	
  HPC	
  systems?	
  
2.  What’s	
  needed	
  for	
  a	
  programming	
  system	
  to	
  succeed?	
  
3.  What’s	
  needed	
  in	
  a	
  good	
  HPC	
  parallel	
  programming	
  system?	
  
	
  
	
  
May	
  2013	
  
CCGrid13	
  
5	
  
Definitions
§  Parallel	
  programming	
  model:	
  a	
  parallel	
  programming	
  paOern	
  
–  Bulk	
  synchronous,	
  SIMD,	
  message-­‐passing…	
  	
  
§  Parallel	
  programming	
  system:	
  a	
  language	
  or	
  library	
  that	
  
implements	
  one	
  or	
  mulUple	
  models	
  
–  MPI+C,	
  OpenMP…	
  
§  MPI+C	
  implements	
  mulUple	
  models	
  (message-­‐passing,	
  bulk	
  
synchronous…)	
  
§  MPI+OpenMP	
  is	
  a	
  hybrid	
  system	
  	
  
May	
  2013	
  
CCGrid13	
  
6	
  
Why Did MPI Succeed?
Beyond	
  the	
  expectaUons	
  of	
  its	
  designers	
  
May	
  2013	
  
CCGrid13	
  
7	
  
Some Pragmatic Reasons for the Success of MPI
§  Clear	
  need:	
  Each	
  of	
  the	
  vendors	
  had	
  its	
  own	
  message-­‐passing	
  
library;	
  there	
  were	
  no	
  major	
  differences	
  between	
  them,	
  and	
  no	
  
significant	
  compeUUve	
  advantage.	
  Users	
  wanted	
  portability	
  	
  
–  Same	
  was	
  true	
  of	
  OpenMP	
  
§  Quick	
  implementaHon:	
  MPI	
  libraries	
  were	
  available	
  as	
  soon	
  
as	
  the	
  standard	
  was	
  ready	
  	
  
–  As	
  disUnct	
  from	
  HPF	
  [rise	
  &	
  fall	
  of	
  HPF]	
  
§  Cheap:	
  A	
  few	
  people-­‐years	
  (at	
  most)	
  to	
  port	
  MPI	
  to	
  a	
  new	
  
placorm	
  
May	
  2013	
  
CCGrid13	
  
8	
  
Some Technical Reasons for the Success of MPI
§  Good	
  performance:	
  HPC	
  programming	
  is	
  about	
  performance.	
  An	
  
HPC	
  programming	
  system	
  must	
  provide	
  to	
  the	
  user	
  the	
  raw	
  HW	
  
performance	
  of	
  the	
  HW.	
  MPI	
  mostly	
  did	
  that	
  
§  Performance	
  portability:	
  Same	
  opUmizaUons	
  (reduce	
  
communicaUon	
  volume,	
  aggregate	
  communicaUon,	
  use	
  
collecUves,	
  overlap	
  computaUon	
  and	
  communicaUon)	
  are	
  good	
  
for	
  any	
  placorm	
  
–  This	
  was	
  a	
  major	
  weakness	
  of	
  HPF	
  [rise	
  &	
  fall	
  of	
  HPF]	
  
§  Performance	
  transparency:	
  Few	
  surprises;	
  a	
  simple	
  
performance	
  model	
  (e.g.,	
  postal	
  model	
  –	
  a+bm)	
  predicts	
  
well	
  communicaUon	
  performance	
  most	
  of	
  the	
  Ume	
  
–  Another	
  weakness	
  of	
  HPF	
  
May	
  2013	
  
CCGrid13	
  
9	
  
How About Productivity?
§  Coding	
  producUvity	
  is	
  much	
  overrated:	
  A	
  relaUve	
  small	
  fracUon	
  of	
  
programming	
  effort	
  is	
  coding:	
  Debugging,	
  tuning,	
  tesUng	
  &	
  
validaUng	
  are	
  where	
  most	
  of	
  the	
  Ume	
  is	
  spent.	
  MPI	
  is	
  not	
  worse	
  
than	
  exisUng	
  alternaUves	
  in	
  these	
  aspects	
  of	
  code	
  development.	
  
§  MPI	
  code	
  is	
  very	
  small	
  fracUon	
  of	
  a	
  large	
  scienUfic	
  framework	
  
§  Programmer	
  producUvity	
  depends	
  on	
  enUre	
  tool	
  chain	
  (compiler,	
  
debugger,	
  performance	
  tools,	
  etc.).	
  It	
  is	
  easier	
  to	
  build	
  a	
  tool	
  
chain	
  for	
  a	
  communicaUon	
  library	
  than	
  for	
  a	
  new	
  language	
  
May	
  2013	
  
CCGrid13	
  
10	
  
Some Things MPI Got Right
§  Communicators	
  
–  Support	
  for	
  Ume	
  and	
  space	
  mulUplexing	
  
§  CollecUve	
  operaUons	
  
§  Orthogonal,	
  object-­‐oriented	
  design	
  
May	
  2013	
  
CCGrid13	
  
11	
  
ApplicaUon	
  
Library	
  
ApplicaUon	
  
App1	
  
Coupler	
  
App2	
  
App1	
   App2	
  
MPI Forever?
§  MPI	
  will	
  conUnue	
  to	
  be	
  used	
  in	
  the	
  exascale	
  Ume	
  frame	
  
§  MPI	
  may	
  increasingly	
  become	
  a	
  performance	
  boOleneck	
  at	
  
exascale	
  
§  It’s	
  not	
  MPI;	
  it	
  is	
  MPI+X,	
  where	
  X	
  is	
  used	
  for	
  node	
  
programming.	
  The	
  main	
  problem	
  is	
  a	
  lack	
  of	
  adequate	
  X.	
  
May	
  2013	
  
CCGrid13	
  
12	
  
It is Hard to Kill MPI
§  By	
  now,	
  there	
  are	
  10s	
  of	
  millions	
  LOCs	
  that	
  use	
  MPI	
  
§  MPI	
  conUnues	
  to	
  improve	
  (MPI3)	
  
§  MPI	
  performance	
  can	
  sUll	
  be	
  improved	
  in	
  various	
  ways	
  (e.g.,	
  
JIT	
  specializaUon	
  of	
  MPI	
  calls)	
  
§  MPI	
  scalability	
  can	
  sUll	
  be	
  improved	
  in	
  a	
  variety	
  of	
  ways	
  
(e.g.,	
  distributed	
  data	
  structures)	
  
§  The	
  number	
  of	
  nodes	
  in	
  future	
  exascale	
  systems	
  will	
  not	
  be	
  
significantly	
  higher	
  than	
  the	
  current	
  numbers	
  
May	
  2013	
  
CCGrid13	
  
13	
  
What’s Needed for a New System to
Succeed?
May	
  2013	
  
CCGrid13	
  
14	
  
Factors for Success
§  Compelling	
  need	
  
–  MPI	
  runs	
  out	
  of	
  steam	
  (?)	
  
–  OpenMP	
  runs	
  out	
  of	
  steam	
  (!)	
  
–  MPI+OpenMP+Cuda+…	
  becomes	
  too	
  baroque	
  
–  Fault	
  tolerance	
  and	
  power	
  management	
  require	
  new	
  
programming	
  models	
  	
  
§  SoluUon	
  to	
  backward	
  compaUbility	
  problem	
  
§  Limited	
  implementaUon	
  cost	
  &	
  fast	
  deployment	
  
§  Community	
  support	
  (?)	
  
May	
  2013	
  
CCGrid13	
  
15	
  
Potential Problems with MPI
§  Sooware	
  overhead	
  for	
  communicaUon	
  
–  Strong	
  scaling	
  requires	
  finer	
  granularity	
  
§  MPI	
  processes	
  with	
  hundreds	
  of	
  threads	
  –	
  serial	
  boOlenecks	
  
§  Interface	
  to	
  node	
  programming	
  models:	
  The	
  basic	
  design	
  of	
  MPI	
  
is	
  for	
  a	
  single	
  threaded	
  process;	
  tasks	
  are	
  not	
  MPI	
  enUUes	
  
§  Interface	
  to	
  heterogeneous	
  nodes	
  (NUMA,	
  deep	
  memory	
  
hierarchy,	
  possibly	
  non-­‐coherent	
  memory)	
  
§  Fault	
  tolerance	
  
–  	
  No	
  good	
  excepUon	
  model	
  
§  Good	
  rDMA	
  leverage	
  (low-­‐overhead	
  one-­‐sided	
  
communicaUon)	
  
May	
  2013	
  
CCGrid13	
  
16	
  
Potential Problems with OpenMP
§  No	
  support	
  for	
  locality	
  –	
  flat	
  memory	
  model	
  
§  Parallelism	
  is	
  most	
  naturally	
  expressed	
  using	
  fine	
  grain	
  
parallelism	
  (parallel	
  loops)	
  and	
  serializing	
  constructs	
  (locks,	
  serial	
  
secUons)	
  
§  Support	
  for	
  heterogeneity	
  is	
  lacking:	
  
–  OpenACC	
  	
  has	
  narrow	
  view	
  of	
  heterogeneous	
  architecture	
  (two,	
  
noncoherent	
  	
  memory	
  spaces)	
  
–  No	
  support	
  for	
  NUMA	
  (memory	
  heterogeneity)	
  
(There	
  is	
  significant	
  uncertainty	
  on	
  the	
  future	
  node	
  architecture)	
  
May	
  2013	
  
CCGrid13	
  
17	
  
New Needs
§  Current	
  mental	
  picture	
  of	
  parallel	
  computaUon:	
  Equal	
  
computaUon	
  effort	
  means	
  equal	
  computaUon	
  Ume	
  
–  Significant	
  effort	
  invested	
  to	
  make	
  this	
  true:	
  jiOer	
  avoidance	
  
§  Mental	
  picture	
  is	
  likely	
  to	
  be	
  inaccurate	
  in	
  the	
  future	
  
–  Power	
  management	
  is	
  local	
  
–  Fault	
  correcUon	
  is	
  local	
  
§  Global	
  synchronizaUon	
  is	
  evil!	
  
May	
  2013	
  
CCGrid13	
  
18	
  
What Do We Need in a Good HPC
Programming System?
May	
  2013	
  
CCGrid13	
  
19	
  
Our Focus
May	
  2013	
  
CCGrid13	
  
20	
  
MulUple	
  
Systems	
  
Specialized	
  (?)	
  
SoQware	
  Stack	
  
Specialized	
  (?)	
  
SoQware	
  Stack	
  
Specialized	
  (?)	
  
SoQware	
  Stack	
  
Common	
  low-­‐level	
  parallel	
  
programming	
  system	
  Automated	
  
mapping	
  
Domain-­‐Specifc	
  
programming	
  system	
  
Domain-­‐Specifc	
  
programming	
  system	
  
Shared Memory Programming Model
Basics:	
  	
  
§  Program	
  expresses	
  parallelism	
  and	
  dependencies	
  (ordering	
  constraint)	
  
§  Run-­‐Ume	
  maps	
  execuUon	
  threads	
  so	
  as	
  to	
  achieve	
  load	
  balance	
  
§  Global	
  name	
  space	
  
§  CommunicaUon	
  is	
  implied	
  by	
  referencing	
  
§  Data	
  is	
  migrated	
  close	
  to	
  execuUng	
  thread	
  by	
  caching	
  hardware	
  
§  User	
  reduces	
  communicaUon	
  by	
  ordering	
  accesses	
  so	
  as	
  to	
  achieve	
  
good	
  temporal,	
  spaUal	
  and	
  thread	
  locality	
  
Incidentals:	
  
§  Shared	
  memory	
  programs	
  suffer	
  from	
  races	
  
§  Shared	
  memory	
  programs	
  are	
  nondeterminisUc	
  
§  Lack	
  of	
  collecUve	
  operaUons	
  
§  Lack	
  of	
  parallel	
  teams	
  
May	
  2013	
  
CCGrid13	
  
21	
  
App1	
  
Coupler	
  
App2	
  
App1	
   App2	
  
Distributed Memory Programming Model (for HPC)
Basics:	
  
§  Program	
  distributes	
  data,	
  maps	
  execuUon	
  to	
  data,	
  and	
  expresses	
  
dependencies	
  
§  	
  CommunicaUon	
  is	
  usually	
  explicit,	
  but	
  can	
  be	
  implicit	
  (PGAS)	
  
–  Explicit	
  communicaUon	
  needed	
  for	
  performance	
  
–  User	
  explicitly	
  opUmizes	
  communicaUon	
  (no	
  caching)	
  
Incidentals:	
  
§  Lack	
  of	
  global	
  name	
  space	
  
§  Use	
  of	
  send-­‐receive	
  
	
  
May	
  2013	
  
CCGrid13	
  
22	
  
Hypothetical Future Architecture: (At Least) 3 Levels
§  Assume	
  core	
  
heterogeneity	
  
is	
  transparent	
  
to	
  user	
  
(alternaUve	
  is	
  
too	
  
depressing…)	
  
May	
  2013	
  
CCGrid13	
  
23	
  
Node	
  
…	
  Core	
   Core	
  
Coherence	
  domain	
  
…	
  Core	
   Core	
  
Coherence	
  domain	
  
…	
  
…	
  
Node	
  
…	
  Core	
   Core	
  
Coherence	
  domain	
  
…	
  Core	
   Core	
  
Coherence	
  domain	
  
…	
  
…	
  
...	
  
...	
  
Possible Direction
§  Need,	
  for	
  future	
  nodes,	
  a	
  model	
  intermediate	
  between	
  shared	
  
memory	
  and	
  distributed	
  memory	
  
–  Shared	
  memory:	
  global	
  name	
  space,	
  implicit	
  communicaUon	
  
and	
  caching	
  
–  Distributed	
  memory:	
  control	
  of	
  data	
  locaUon	
  and	
  
communicaUon	
  aggregaUon	
  
–  With	
  very	
  good	
  support	
  for	
  latency	
  hiding	
  and	
  fine	
  grain	
  
parallelism	
  
§  Such	
  a	
  model	
  could	
  be	
  used	
  uniformly	
  across	
  the	
  enUre	
  memory	
  
hierarchy!	
  
–  With	
  a	
  different	
  mix	
  of	
  hardware	
  and	
  sooware	
  support	
  
	
  
May	
  2013	
  
CCGrid13	
  
24	
  
Extending the Shared Memory Model Beyond
Coherence Domains – Caching in Memory
§  Assume	
  (to	
  start	
  with)	
  a	
  fixed	
  parUUon	
  of	
  data	
  to	
  
“homes”	
  (locales)	
  
§  Global	
  name	
  space:	
  easy	
  
§  Implicit	
  communicaUon	
  (read/write):	
  performance	
  problem	
  
–  Need	
  accessing	
  data	
  in	
  large	
  user-­‐defined	
  chunks	
  -­‐>	
  	
  
• User-­‐defined	
  “logical	
  cache	
  lines”	
  (data	
  structure,	
  Ule;	
  data	
  set	
  
that	
  can	
  be	
  moved	
  together	
  
• User	
  (or	
  compiler)	
  inserted	
  “prefetch/flush”	
  instrucUons	
  
(acquire/release	
  access)	
  
–  Need	
  latency	
  hiding	
  -­‐>	
  
• Prefetch	
  or	
  “concurrent	
  mulUtasking”	
  
May	
  2013	
  
CCGrid13	
  
25	
  
Living Without Coherence
§  ScienUfic	
  codes	
  are	
  ooen	
  organized	
  in	
  successive	
  phases	
  where	
  a	
  
variable	
  (a	
  family	
  of	
  variables)	
  is	
  either	
  exclusively	
  updated	
  or	
  
shared	
  during	
  the	
  enUre	
  phase.	
  Examples	
  include	
  
–  ParUcle	
  codes	
  (Barnes-­‐Hut,	
  molecular	
  dynamics…)	
  
–  IteraUve	
  solvers	
  (finite	
  element,	
  red-­‐black	
  schemes..)	
  	
  
☛ Need	
  not	
  track	
  status	
  of	
  each	
  “cache	
  line”;	
  sooware	
  can	
  effect	
  
global	
  changes	
  at	
  phase	
  end	
  
	
  
May	
  2013	
  
CCGrid13	
  
26	
  
How This Compares to PGAS Languages?
§  Added	
  ability	
  to	
  cache	
  and	
  prefetch	
  
–  As	
  disUnct	
  from	
  copying	
  
–  Data	
  changes	
  locaQon	
  without	
  changing	
  name	
  
§  Added	
  message	
  driven	
  task	
  scheduling	
  
–  Necessary	
  for	
  hiding	
  latency	
  of	
  reads	
  in	
  irregular,	
  dynamic	
  codes	
  
May	
  2013	
  
CCGrid13	
  
27	
  
Is Such Model Useful?
§  Facilitates	
  wriUng	
  dynamic,	
  irregular	
  code	
  (e.g.,	
  Barnes-­‐Hut)	
  
§  Can	
  be	
  used,	
  without	
  performance	
  loss,	
  by	
  adding	
  annotaUons	
  
that	
  enable	
  early	
  binding	
  
May	
  2013	
  
CCGrid13	
  
28	
  
Example: caching
§  General	
  case:	
  “Logical	
  cache	
  line”	
  defined	
  as	
  a	
  user-­‐provided	
  
class	
  
§  Special	
  case:	
  Code	
  for	
  simple,	
  frequently	
  used	
  Ule	
  types	
  	
  is	
  
opUmized	
  &	
  inlined	
  (e.g.,	
  rectangular	
  submatrices)	
  [Chapel]	
  
§  General	
  case:	
  need	
  dynamic	
  space	
  management	
  and	
  (possibly)	
  an	
  
addiUonal	
  indirecUon	
  	
  (ooen	
  avoided	
  via	
  pointer	
  swizzling)	
  	
  
§  Special	
  case:	
  same	
  remote	
  data	
  (e.g.,	
  halo)	
  is	
  repeatedly	
  
accessed.	
  Can	
  preallocate	
  local	
  buffer	
  and	
  compile	
  remote	
  
references	
  into	
  local	
  references	
  (at	
  compile	
  Ume,	
  for	
  regular	
  
halos,	
  aoer	
  data	
  is	
  read,	
  for	
  irregular	
  halos)	
  
May	
  2013	
  
CCGrid13	
  
29	
  
There is a L o n g List of Additional Issues
§  How	
  general	
  are	
  home	
  parUUons?	
  
–  Use	
  of	
  “owner	
  compute”	
  approach	
  requires	
  general	
  parUUon	
  
–  Complex	
  parUUons	
  lead	
  to	
  expensive	
  address	
  translaUon	
  
§  Is	
  the	
  number	
  of	
  homes	
  (locales)	
  fixed?	
  	
  
§  Can	
  data	
  homes	
  migrate?	
  
–  No	
  migraUon	
  is	
  a	
  problem	
  for	
  load	
  balancing	
  and	
  failure	
  
recovery	
  
–  Usual	
  approach	
  to	
  migraUon	
  (virtualizaUon	
  &	
  
overdecomposiUon)	
  unnecessarily	
  reduces	
  granularity	
  
§  Can	
  we	
  update	
  (parameterized)	
  decomposiUon,	
  rather	
  than	
  
virtualize	
  locaUons?	
  
May	
  2013	
  
CCGrid13	
  
30	
  
There is a L o n g List of Additional Issues (2)
§  Local	
  view	
  of	
  control	
  or	
  global	
  view?	
  
Forall i on A[i] A[i] = foo(i)
–  TransformaUon	
  described	
  in	
  terms	
  of	
  global	
  data	
  structure	
  
Foreach A.tile foo(tile)
–  TransformaUon	
  described	
  in	
  terms	
  of	
  local	
  data	
  structure	
  (with	
  
local	
  references)	
  
–  There	
  are	
  strong,	
  diverging	
  opinions	
  on	
  which	
  is	
  beOer	
  
§  Hierarchical	
  data	
  structures	
  &	
  hierarchical	
  control	
  
§  ExcepUon	
  model	
  
§  …	
  
May	
  2013	
  
CCGrid13	
  
31	
  
32	
  
The End

Más contenido relacionado

Similar a Programming Models for High-performance Computing

Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Spark Summit
 
Lecture 2 more about parallel computing
Lecture 2   more about parallel computingLecture 2   more about parallel computing
Lecture 2 more about parallel computingVajira Thambawita
 
On component interface
On component interfaceOn component interface
On component interfaceLaurence Chen
 
HPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud TechnologiesHPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud TechnologiesInderjeet Singh
 
Meetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaCMeetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaCDamienCarpy
 
Lean Model-Driven Development through Model-Interpretation: the CPAL design ...
Lean Model-Driven Development through  Model-Interpretation: the CPAL design ...Lean Model-Driven Development through  Model-Interpretation: the CPAL design ...
Lean Model-Driven Development through Model-Interpretation: the CPAL design ...Nicolas Navet
 
Apache Hadoop India Summit 2011 talk "Adaptive Parallel Computing over Distri...
Apache Hadoop India Summit 2011 talk "Adaptive Parallel Computing over Distri...Apache Hadoop India Summit 2011 talk "Adaptive Parallel Computing over Distri...
Apache Hadoop India Summit 2011 talk "Adaptive Parallel Computing over Distri...Yahoo Developer Network
 
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Mariano Gonzalez
 
DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...
DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...
DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...Haggai Philip Zagury
 
A REVIEW ON PARALLEL COMPUTING
A REVIEW ON PARALLEL COMPUTINGA REVIEW ON PARALLEL COMPUTING
A REVIEW ON PARALLEL COMPUTINGAmy Roman
 
Building A Linux Cluster Using Raspberry PI #1!
Building A Linux Cluster Using Raspberry PI #1!Building A Linux Cluster Using Raspberry PI #1!
Building A Linux Cluster Using Raspberry PI #1!A Jorge Garcia
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computingMehul Patel
 
Characteristics of Remote Persistent Memory – Performance, Capacity, or Local...
Characteristics of Remote Persistent Memory – Performance, Capacity, or Local...Characteristics of Remote Persistent Memory – Performance, Capacity, or Local...
Characteristics of Remote Persistent Memory – Performance, Capacity, or Local...inside-BigData.com
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine LearningSudarsun Santhiappan
 
Distributed Deep Learning on Spark
Distributed Deep Learning on SparkDistributed Deep Learning on Spark
Distributed Deep Learning on SparkMathieu Dumoulin
 
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...chiportal
 

Similar a Programming Models for High-performance Computing (20)

Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
 
Lecture 2 more about parallel computing
Lecture 2   more about parallel computingLecture 2   more about parallel computing
Lecture 2 more about parallel computing
 
On component interface
On component interfaceOn component interface
On component interface
 
Introduction to GPUs in HPC
Introduction to GPUs in HPCIntroduction to GPUs in HPC
Introduction to GPUs in HPC
 
HPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud TechnologiesHPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud Technologies
 
Meetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaCMeetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaC
 
Lean Model-Driven Development through Model-Interpretation: the CPAL design ...
Lean Model-Driven Development through  Model-Interpretation: the CPAL design ...Lean Model-Driven Development through  Model-Interpretation: the CPAL design ...
Lean Model-Driven Development through Model-Interpretation: the CPAL design ...
 
Apache Hadoop India Summit 2011 talk "Adaptive Parallel Computing over Distri...
Apache Hadoop India Summit 2011 talk "Adaptive Parallel Computing over Distri...Apache Hadoop India Summit 2011 talk "Adaptive Parallel Computing over Distri...
Apache Hadoop India Summit 2011 talk "Adaptive Parallel Computing over Distri...
 
In-Memory Compute Grids… Explained
In-Memory Compute Grids… ExplainedIn-Memory Compute Grids… Explained
In-Memory Compute Grids… Explained
 
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
 
DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...
DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...
DevOpsDays Tel Aviv DEC 2022 | Building A Cloud-Native Platform Brick by Bric...
 
A REVIEW ON PARALLEL COMPUTING
A REVIEW ON PARALLEL COMPUTINGA REVIEW ON PARALLEL COMPUTING
A REVIEW ON PARALLEL COMPUTING
 
Building A Linux Cluster Using Raspberry PI #1!
Building A Linux Cluster Using Raspberry PI #1!Building A Linux Cluster Using Raspberry PI #1!
Building A Linux Cluster Using Raspberry PI #1!
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computing
 
Characteristics of Remote Persistent Memory – Performance, Capacity, or Local...
Characteristics of Remote Persistent Memory – Performance, Capacity, or Local...Characteristics of Remote Persistent Memory – Performance, Capacity, or Local...
Characteristics of Remote Persistent Memory – Performance, Capacity, or Local...
 
Build A Scalable Mobile App
Build A Scalable Mobile App Build A Scalable Mobile App
Build A Scalable Mobile App
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
 
Distributed Deep Learning on Spark
Distributed Deep Learning on SparkDistributed Deep Learning on Spark
Distributed Deep Learning on Spark
 
Aca2 10 11
Aca2 10 11Aca2 10 11
Aca2 10 11
 
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
TRACK F: OpenCL for ALTERA FPGAs, Accelerating performance and design product...
 

Último

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 

Último (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 

Programming Models for High-performance Computing

  • 1. Programming Systems for High-Performance Computing Marc  Snir  
  • 3. In The Beginning §  Distributed  Memory   Parallel  Systems  are   replacing  vector   machines.   §  CM-­‐5  (using  SPARC)   announced  1991   –  1024  node  CM-­‐5  is  at   top  of  first  TOP500   list  in  1993     §  Problem:  Need  to  rewrite   all  (vector)  code  in  new   programming  model     May  2013   CCGrid13   3  
  • 4. Some of the Attempts May  2013   CCGrid13   4   1990   1995   2000   2005   2010   HPF   HPF1   HPF2   MPI   MPI1   MPI2   MPI3   UPC   CAF   PGAS   OpenMP   V1   V2   V3   Shared  Memory   GPGPU   OpenCL   OpenACC   Widely  used   Does  not  scale   Too  early  to  say   LiAle  to  no  public  use  
  • 5. Questions 1.  Message-­‐passing  is  claimed  to  be  “a  difficult  way  to  program”.  If   so,  why  is  it  the  only  widely  used  way  to  program  HPC  systems?   2.  What’s  needed  for  a  programming  system  to  succeed?   3.  What’s  needed  in  a  good  HPC  parallel  programming  system?       May  2013   CCGrid13   5  
  • 6. Definitions §  Parallel  programming  model:  a  parallel  programming  paOern   –  Bulk  synchronous,  SIMD,  message-­‐passing…     §  Parallel  programming  system:  a  language  or  library  that   implements  one  or  mulUple  models   –  MPI+C,  OpenMP…   §  MPI+C  implements  mulUple  models  (message-­‐passing,  bulk   synchronous…)   §  MPI+OpenMP  is  a  hybrid  system     May  2013   CCGrid13   6  
  • 7. Why Did MPI Succeed? Beyond  the  expectaUons  of  its  designers   May  2013   CCGrid13   7  
  • 8. Some Pragmatic Reasons for the Success of MPI §  Clear  need:  Each  of  the  vendors  had  its  own  message-­‐passing   library;  there  were  no  major  differences  between  them,  and  no   significant  compeUUve  advantage.  Users  wanted  portability     –  Same  was  true  of  OpenMP   §  Quick  implementaHon:  MPI  libraries  were  available  as  soon   as  the  standard  was  ready     –  As  disUnct  from  HPF  [rise  &  fall  of  HPF]   §  Cheap:  A  few  people-­‐years  (at  most)  to  port  MPI  to  a  new   placorm   May  2013   CCGrid13   8  
  • 9. Some Technical Reasons for the Success of MPI §  Good  performance:  HPC  programming  is  about  performance.  An   HPC  programming  system  must  provide  to  the  user  the  raw  HW   performance  of  the  HW.  MPI  mostly  did  that   §  Performance  portability:  Same  opUmizaUons  (reduce   communicaUon  volume,  aggregate  communicaUon,  use   collecUves,  overlap  computaUon  and  communicaUon)  are  good   for  any  placorm   –  This  was  a  major  weakness  of  HPF  [rise  &  fall  of  HPF]   §  Performance  transparency:  Few  surprises;  a  simple   performance  model  (e.g.,  postal  model  –  a+bm)  predicts   well  communicaUon  performance  most  of  the  Ume   –  Another  weakness  of  HPF   May  2013   CCGrid13   9  
  • 10. How About Productivity? §  Coding  producUvity  is  much  overrated:  A  relaUve  small  fracUon  of   programming  effort  is  coding:  Debugging,  tuning,  tesUng  &   validaUng  are  where  most  of  the  Ume  is  spent.  MPI  is  not  worse   than  exisUng  alternaUves  in  these  aspects  of  code  development.   §  MPI  code  is  very  small  fracUon  of  a  large  scienUfic  framework   §  Programmer  producUvity  depends  on  enUre  tool  chain  (compiler,   debugger,  performance  tools,  etc.).  It  is  easier  to  build  a  tool   chain  for  a  communicaUon  library  than  for  a  new  language   May  2013   CCGrid13   10  
  • 11. Some Things MPI Got Right §  Communicators   –  Support  for  Ume  and  space  mulUplexing   §  CollecUve  operaUons   §  Orthogonal,  object-­‐oriented  design   May  2013   CCGrid13   11   ApplicaUon   Library   ApplicaUon   App1   Coupler   App2   App1   App2  
  • 12. MPI Forever? §  MPI  will  conUnue  to  be  used  in  the  exascale  Ume  frame   §  MPI  may  increasingly  become  a  performance  boOleneck  at   exascale   §  It’s  not  MPI;  it  is  MPI+X,  where  X  is  used  for  node   programming.  The  main  problem  is  a  lack  of  adequate  X.   May  2013   CCGrid13   12  
  • 13. It is Hard to Kill MPI §  By  now,  there  are  10s  of  millions  LOCs  that  use  MPI   §  MPI  conUnues  to  improve  (MPI3)   §  MPI  performance  can  sUll  be  improved  in  various  ways  (e.g.,   JIT  specializaUon  of  MPI  calls)   §  MPI  scalability  can  sUll  be  improved  in  a  variety  of  ways   (e.g.,  distributed  data  structures)   §  The  number  of  nodes  in  future  exascale  systems  will  not  be   significantly  higher  than  the  current  numbers   May  2013   CCGrid13   13  
  • 14. What’s Needed for a New System to Succeed? May  2013   CCGrid13   14  
  • 15. Factors for Success §  Compelling  need   –  MPI  runs  out  of  steam  (?)   –  OpenMP  runs  out  of  steam  (!)   –  MPI+OpenMP+Cuda+…  becomes  too  baroque   –  Fault  tolerance  and  power  management  require  new   programming  models     §  SoluUon  to  backward  compaUbility  problem   §  Limited  implementaUon  cost  &  fast  deployment   §  Community  support  (?)   May  2013   CCGrid13   15  
  • 16. Potential Problems with MPI §  Sooware  overhead  for  communicaUon   –  Strong  scaling  requires  finer  granularity   §  MPI  processes  with  hundreds  of  threads  –  serial  boOlenecks   §  Interface  to  node  programming  models:  The  basic  design  of  MPI   is  for  a  single  threaded  process;  tasks  are  not  MPI  enUUes   §  Interface  to  heterogeneous  nodes  (NUMA,  deep  memory   hierarchy,  possibly  non-­‐coherent  memory)   §  Fault  tolerance   –   No  good  excepUon  model   §  Good  rDMA  leverage  (low-­‐overhead  one-­‐sided   communicaUon)   May  2013   CCGrid13   16  
  • 17. Potential Problems with OpenMP §  No  support  for  locality  –  flat  memory  model   §  Parallelism  is  most  naturally  expressed  using  fine  grain   parallelism  (parallel  loops)  and  serializing  constructs  (locks,  serial   secUons)   §  Support  for  heterogeneity  is  lacking:   –  OpenACC    has  narrow  view  of  heterogeneous  architecture  (two,   noncoherent    memory  spaces)   –  No  support  for  NUMA  (memory  heterogeneity)   (There  is  significant  uncertainty  on  the  future  node  architecture)   May  2013   CCGrid13   17  
  • 18. New Needs §  Current  mental  picture  of  parallel  computaUon:  Equal   computaUon  effort  means  equal  computaUon  Ume   –  Significant  effort  invested  to  make  this  true:  jiOer  avoidance   §  Mental  picture  is  likely  to  be  inaccurate  in  the  future   –  Power  management  is  local   –  Fault  correcUon  is  local   §  Global  synchronizaUon  is  evil!   May  2013   CCGrid13   18  
  • 19. What Do We Need in a Good HPC Programming System? May  2013   CCGrid13   19  
  • 20. Our Focus May  2013   CCGrid13   20   MulUple   Systems   Specialized  (?)   SoQware  Stack   Specialized  (?)   SoQware  Stack   Specialized  (?)   SoQware  Stack   Common  low-­‐level  parallel   programming  system  Automated   mapping   Domain-­‐Specifc   programming  system   Domain-­‐Specifc   programming  system  
  • 21. Shared Memory Programming Model Basics:     §  Program  expresses  parallelism  and  dependencies  (ordering  constraint)   §  Run-­‐Ume  maps  execuUon  threads  so  as  to  achieve  load  balance   §  Global  name  space   §  CommunicaUon  is  implied  by  referencing   §  Data  is  migrated  close  to  execuUng  thread  by  caching  hardware   §  User  reduces  communicaUon  by  ordering  accesses  so  as  to  achieve   good  temporal,  spaUal  and  thread  locality   Incidentals:   §  Shared  memory  programs  suffer  from  races   §  Shared  memory  programs  are  nondeterminisUc   §  Lack  of  collecUve  operaUons   §  Lack  of  parallel  teams   May  2013   CCGrid13   21   App1   Coupler   App2   App1   App2  
  • 22. Distributed Memory Programming Model (for HPC) Basics:   §  Program  distributes  data,  maps  execuUon  to  data,  and  expresses   dependencies   §   CommunicaUon  is  usually  explicit,  but  can  be  implicit  (PGAS)   –  Explicit  communicaUon  needed  for  performance   –  User  explicitly  opUmizes  communicaUon  (no  caching)   Incidentals:   §  Lack  of  global  name  space   §  Use  of  send-­‐receive     May  2013   CCGrid13   22  
  • 23. Hypothetical Future Architecture: (At Least) 3 Levels §  Assume  core   heterogeneity   is  transparent   to  user   (alternaUve  is   too   depressing…)   May  2013   CCGrid13   23   Node   …  Core   Core   Coherence  domain   …  Core   Core   Coherence  domain   …   …   Node   …  Core   Core   Coherence  domain   …  Core   Core   Coherence  domain   …   …   ...   ...  
  • 24. Possible Direction §  Need,  for  future  nodes,  a  model  intermediate  between  shared   memory  and  distributed  memory   –  Shared  memory:  global  name  space,  implicit  communicaUon   and  caching   –  Distributed  memory:  control  of  data  locaUon  and   communicaUon  aggregaUon   –  With  very  good  support  for  latency  hiding  and  fine  grain   parallelism   §  Such  a  model  could  be  used  uniformly  across  the  enUre  memory   hierarchy!   –  With  a  different  mix  of  hardware  and  sooware  support     May  2013   CCGrid13   24  
  • 25. Extending the Shared Memory Model Beyond Coherence Domains – Caching in Memory §  Assume  (to  start  with)  a  fixed  parUUon  of  data  to   “homes”  (locales)   §  Global  name  space:  easy   §  Implicit  communicaUon  (read/write):  performance  problem   –  Need  accessing  data  in  large  user-­‐defined  chunks  -­‐>     • User-­‐defined  “logical  cache  lines”  (data  structure,  Ule;  data  set   that  can  be  moved  together   • User  (or  compiler)  inserted  “prefetch/flush”  instrucUons   (acquire/release  access)   –  Need  latency  hiding  -­‐>   • Prefetch  or  “concurrent  mulUtasking”   May  2013   CCGrid13   25  
  • 26. Living Without Coherence §  ScienUfic  codes  are  ooen  organized  in  successive  phases  where  a   variable  (a  family  of  variables)  is  either  exclusively  updated  or   shared  during  the  enUre  phase.  Examples  include   –  ParUcle  codes  (Barnes-­‐Hut,  molecular  dynamics…)   –  IteraUve  solvers  (finite  element,  red-­‐black  schemes..)     ☛ Need  not  track  status  of  each  “cache  line”;  sooware  can  effect   global  changes  at  phase  end     May  2013   CCGrid13   26  
  • 27. How This Compares to PGAS Languages? §  Added  ability  to  cache  and  prefetch   –  As  disUnct  from  copying   –  Data  changes  locaQon  without  changing  name   §  Added  message  driven  task  scheduling   –  Necessary  for  hiding  latency  of  reads  in  irregular,  dynamic  codes   May  2013   CCGrid13   27  
  • 28. Is Such Model Useful? §  Facilitates  wriUng  dynamic,  irregular  code  (e.g.,  Barnes-­‐Hut)   §  Can  be  used,  without  performance  loss,  by  adding  annotaUons   that  enable  early  binding   May  2013   CCGrid13   28  
  • 29. Example: caching §  General  case:  “Logical  cache  line”  defined  as  a  user-­‐provided   class   §  Special  case:  Code  for  simple,  frequently  used  Ule  types    is   opUmized  &  inlined  (e.g.,  rectangular  submatrices)  [Chapel]   §  General  case:  need  dynamic  space  management  and  (possibly)  an   addiUonal  indirecUon    (ooen  avoided  via  pointer  swizzling)     §  Special  case:  same  remote  data  (e.g.,  halo)  is  repeatedly   accessed.  Can  preallocate  local  buffer  and  compile  remote   references  into  local  references  (at  compile  Ume,  for  regular   halos,  aoer  data  is  read,  for  irregular  halos)   May  2013   CCGrid13   29  
  • 30. There is a L o n g List of Additional Issues §  How  general  are  home  parUUons?   –  Use  of  “owner  compute”  approach  requires  general  parUUon   –  Complex  parUUons  lead  to  expensive  address  translaUon   §  Is  the  number  of  homes  (locales)  fixed?     §  Can  data  homes  migrate?   –  No  migraUon  is  a  problem  for  load  balancing  and  failure   recovery   –  Usual  approach  to  migraUon  (virtualizaUon  &   overdecomposiUon)  unnecessarily  reduces  granularity   §  Can  we  update  (parameterized)  decomposiUon,  rather  than   virtualize  locaUons?   May  2013   CCGrid13   30  
  • 31. There is a L o n g List of Additional Issues (2) §  Local  view  of  control  or  global  view?   Forall i on A[i] A[i] = foo(i) –  TransformaUon  described  in  terms  of  global  data  structure   Foreach A.tile foo(tile) –  TransformaUon  described  in  terms  of  local  data  structure  (with   local  references)   –  There  are  strong,  diverging  opinions  on  which  is  beOer   §  Hierarchical  data  structures  &  hierarchical  control   §  ExcepUon  model   §  …   May  2013   CCGrid13   31