High-level talk on programming models for parallel heterogeneous architectures at the second workshop organized by the NSF-funded Conceptualization of Software Institute for Abstractions and Methodologies for HPC Simulations Codes on Future Architectures, http://flash.uchicago.edu/site/NSF-SI2/
1. State of programming models and code
transformations on heterogeneous
platforms
Boyana Norris
norris@mcs.anl.gov
- Computer Scientist, Mathematics and
Computer Science Division, Argonne
National Laboratory
- Senior Fellow, Computation Institute,
University of Chicago
2. Before there were computers…
Jacquard Loom, invented in 1801
Programming was
– Parallel
– Pattern-based
– Multithreaded
4. Outline, goals
Parallel programming for heterogeneous architectures
– Challenges
– Example approaches
Help set the stage for subsequent panel discussions w.r.t.
issues related to programming heterogeneous architectures
– Need your input, please do interrupt
5. Heterogeneity
Hardware heterogeneity (different devices with different
capabilities), e.g.:
– Multicore x86 CPUs with GPUs
– Multicore x86 CPUs with Intel Phi accelerators
– big.LITTLE (coupling slower, low-power ARM cores with faster, power-
hungry ARM cores)
– A cluster with different types of nodes
– x86 CPU with FPGAs (e.g., Convey)
– …
Software heterogeneity (e.g., OS, languages)
– Not part of this talk
6. Similarities among heterogeneous platforms
Typically each processor has several, and sometimes many
execution units
– NVIDIA Fermi GPUs have 16 Streaming Multiprocessors (SMPs);
– AMD GPUs have 20 or more SIMD units;
– Intel Phi has >50 x86 cores
Each execution unit typically has SIMD or vector execution.
– NVIDIA GPUs execute threads in SIMD-like groups of 32 (what NVIDIA
calls warps);
– AMD GPUs execute in wavefronts that are 64-threads wide;
– Intel Phi has 512-bit wide SIMD instructions (16 floats or 8 doubles).
11. Challenges
Managing data
– Data distribution, movement, replication
– Load balancing
Different processing capabilities (FPUs, clock rates, vector
units)
Different instruction sets
12. Software developer’s point of view
Important considerations, tradeoffs
– Initial investment
• learning curve, reimplementation
– Ongoing costs
• Maintainability, portability
– Performance
• Real time, within power constraints,…
– Life expectancy
• Architectures, software dependencies
– Suitability for particular goals
• Embedded system vs petaflop machine
– Agility
• Ability to exploit new architectures
– …
13. Programming model implementations
Established:
– Parallelism expressed through message-passing, thread-based shared
memory, PGAS languages
– High-level languages or libraries with APIs that can map to different
models, e.g., MPI
– General-purpose languages with compiler support for exploiting
hybrid architectures
– Small language extensions or annotations embedded in GPLs with
compiler or source transformation tool support, e.g., Fortran CUDA
– Streaming, e.g., CUDA
More recent
Extinct, e.g., HPF
15. Source transformations
Typically multiple levels of abstraction and programming
models are used simultaneously
Goal is to express algorithms at the highest level appropriate
for the functionality being implemented
A single language or library is unlikely to be best for any given
application on all possible hardware
One approach:
– Define algorithms using high-level abstractions
– Provide tools to translate these into lower-level, possibly architecture
specific implementations
Most programming on heterogeneous platforms involves
source transformation
16. Example: Annotation-based approaches
Pros: low-effort, minimal changes
Cons: limited expressivity, performance
Examples:
– MPI + OpenACC directives in a GPL
– Some embedded DSLs (e.g., as supported by Orio)
17. Current limitations
Minimally intrusive approaches typically don’t result in the
best performance possible, e.g., OpenACC annotations
without code restructuring
A number of single-platform solutions provided by vendors
(e.g., Intel, NVIDIA), portability or performance on other
platforms not guaranteed
18. General-purpose programming languages
GPLs for parallel, possibly heterogeneous architectures
– UPC, CAF, Chapel, X10
Pros:
– Robustness (e.g., type safety, memory consistency)
– Tools (e.g., debugging, performance analysis)
Cons:
– Manual reimplementation required in most cases
– Hard to balance user control with resource management automation
– Interoperability
21. High-level frameworks and libraries
Domain-specific problem-solving environments and
mathematical libraries can encapsulate the specifics of
mapping to heterogeneous architectures (e.g., PETSc, Trilinos,
Cactus)
Advantages
– Efficient implementations of common functionality
– Different levels of APIs to hide or expose different levels of the
implementation and runtime (unlike pure language approaches)
– Relatively rapid support of new hardware
Disadvantages
– Learning curves, deep software dependencies
22. Ongoing efforts attempting to balance
scalability with productivity
DOE X-Stack program pursues fundamental advances in
programming models, languages, compilers, runtime systems
and tools to support the transition of applications to exascale
platforms
– DEGAS (Dynamic, Exascale Global Address Space): a PGAS approach
– SLEEC (Semantics-rich Libraries for Effective Exascale Computation):
annotations and cost models to compile into optimized low-level
implementations
– X-Tune: model-based code generation and optimization of algorithms
written in GPLs
– D-TEC: compilers for both new general-purpose languages and
embedding DSLs into other languages
23. Summary
Many traditional programming models can be used on
heterogeneous architectures, with vendor support for
compilers, libraries and runtimes
No clear multi-platform winner programming
model/language/framework
Many new efforts on deepening the software stack to enable
better balance of programmability, performance, portability
Notas del editor
Inspired punched cards used in Charles Babbage’s analytical engine (conceived in 1834)