Critical Issues at Exascale for Algorithm and Software Design

Critical Issues at Exascale
for Algorithm and Software Design
SC12, Salt Lake City, Utah, Nov 2012
Jack Dongarra, University of Tennessee, Tennessee, USA

Performance Development in Top500
1E+11
1E+10
1 Eflop/s
1E+09
10000000
100 Pflop/s
10 Pflop/s
10000000
1000000
1 Pflop/s N=1
100000
100 Tflop/s

10000
10 Tflop/s

1 1000
Tflop/s N=500
100
100 Gflop/s
10
10 Gflop/s
1
1 Gflop/s
1996

2002

2008

2014

2020
0.1

Potential System Architecture

Systems 2012 2022 Difference
Titan Computer Today & 2022
System peak 27 Pflop/s 1 Eflop/s O(100)

Power 8.3 MW ~20 MW
(2 Gflops/W) (50 Gflops/W)

System memory 710 TB 32 - 64 PB O(10)
(38*18688)

Node performance 1,452 GF/s 1.2 or 15TF/s O(10) – O(100)
(1311+141)

Node memory BW 232 GB/s 2 - 4TB/s O(1000)
(52+180)

Node concurrency 16 cores CPU O(1k) or 10k O(100) – O(1000)
2688 CUDA
cores
Total Node Interconnect 8 GB/s 200-400GB/s O(10)
BW
System size (nodes) 18,688 O(100,000) or O(1M) O(100) – O(1000)

Total concurrency 50 M O(billion) O(1,000)

MTTI ?? unknown O(<1 day) - O(10)

Potential System Architecture
with a cap of $200M and 20MW
Systems 2012 2022 Difference
Titan Computer Today & 2022
System peak 27 Pflop/s 1 Eflop/s O(100)

Power 8.3 MW ~20 MW
(2 Gflops/W) (50 Gflops/W)

System memory 710 TB 32 - 64 PB O(10)
(38*18688)

Node performance 1,452 GF/s 1.2 or 15TF/s O(10) – O(100)
(1311+141)

Node memory BW 232 GB/s 2 - 4TB/s O(1000)
(52+180)

Node concurrency 16 cores CPU O(1k) or 10k O(100) – O(1000)
2688 CUDA
cores
Total Node Interconnect 8 GB/s 200-400GB/s O(10)
BW
System size (nodes) 18,688 O(100,000) or O(1M) O(100) – O(1000)

Total concurrency 50 M O(billion) O(1,000)

MTTI ?? unknown O(<1 day) - O(10)

Critical Issues at Peta & Exascale for
Algorithm and Software Design
Synchronization-reducing algorithms
 Break Fork-Join model
Communication-reducing algorithms
 Use methods which have lower bound on communication
Mixed precision methods
 2x speed of ops and 2x speed for data movement
Autotuning
 Today’s machines are too complicated, build “smarts” into
software to adapt to the hardware
Fault resilient algorithms
 Implement algorithms that can recover from failures/bit
flips
Reproducibility of results
 Today we can’t guarantee this. We understand the issues,
5
but some of our “colleagues” have a hard time with this.

Major Changes to Algorithms/Software
• Must rethink the design of our
algorithms and software
Manycore and Hybrid architectures are
disruptive technology
Similar to what happened with cluster
computing and message passing

Rethink and rewrite the applications,
algorithms, and software

Data movement is expensive
Flops are cheap 6

Fork-Join Parallelization of LU and QR.
Parallelize the update: dgemm
• Easy and done in any reasonable software.
• This is the 2/3n3 term in the FLOPs count. -
• Can be done efficiently with LAPACK+multithreaded BLAS

Cores

Synchronization (in LAPACK LU)

Step 1 Step 2 Step 3 Step 4 ...

 synchronous processing
• Fork-join, bulk fork join 27

 bulk synchronous processing

8
Allowing for delayed update, out of order, asynchronous, dataflow execution

PLASMA/MAGMA: Parallel Linear Algebra
s/w for Multicore/Hybrid Architectures
Objectives
 High utilization of each core
 Scaling to large number of cores
 Synchronization reducing algorithms
Methodology
 Dynamic DAG scheduling (QUARK)
 Explicit parallelism
 Implicit communication
 Fine granularity / block data layout
Arbitrary DAG with dynamic scheduling
Fork-join
parallelism

DAG scheduled
parallelism
9

Communication Avoiding QR
Example
R0 R0 R0R R
D0 Domain_Tile_QR
D0

R1
D1 Domain_Tile_QR D1

R2 R2

R3

A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.

11/20/2012 10

Example
R0 R0 R0R R
D0 Domain_Tile_QR
D0

R1

R2 R2

R3


11/20/2012 11

Example
R0 R0 R0R R
D0 Domain_Tile_QR
D0

R1

R2 R2

R3


11/20/2012 12

Example
R0 R0 R0R R
D0 Domain_Tile_QR
D0

R1

R2 R2

R3


11/20/2012 13

Example
R0 R0 R0R R
D0 Domain_Tile_QR
D0

R1

R2 R2

R3


11/20/2012 14

PowerPack 2.0

The PowerPack platform consists of software and hardware instrumentation.
15
Kirk Cameron, Virginia Tech; http://scape.cs.vt.edu/software/powerpack-2-0/

Power for QR Factorization
LAPACK’s QR Factorization
Fork-join based

MKL’s QR Factorization
Fork-join based

PLASMA’s Conventional
QR Factorization
DAG based
PLASMA’s Communication
Reducing QR Factorization
DAG based
dual-socket quad-core Intel Xeon E5462 (Harpertown) processor
@ 2.80GHz (8 cores total) w / MLK BLAS 16
matrix size is very tall and skinny (mxn is 1,152,000 by 288)

Critical Issues at Exascale for Algorithm and Software Design

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Critical Issues at Exascale for Algorithm and Software Design

Similar a Critical Issues at Exascale for Algorithm and Software Design (20)

Último

Último (20)

Critical Issues at Exascale for Algorithm and Software Design

Notas del editor