Generative AI for Technical Writer or Information Developers
Critical Issues at Exascale for Algorithm and Software Design
1. Critical Issues at Exascale
for Algorithm and Software Design
SC12, Salt Lake City, Utah, Nov 2012
Jack Dongarra, University of Tennessee, Tennessee, USA
3. Potential System Architecture
Systems 2012 2022 Difference
Titan Computer Today & 2022
System peak 27 Pflop/s 1 Eflop/s O(100)
Power 8.3 MW ~20 MW
(2 Gflops/W) (50 Gflops/W)
System memory 710 TB 32 - 64 PB O(10)
(38*18688)
Node performance 1,452 GF/s 1.2 or 15TF/s O(10) – O(100)
(1311+141)
Node memory BW 232 GB/s 2 - 4TB/s O(1000)
(52+180)
Node concurrency 16 cores CPU O(1k) or 10k O(100) – O(1000)
2688 CUDA
cores
Total Node Interconnect 8 GB/s 200-400GB/s O(10)
BW
System size (nodes) 18,688 O(100,000) or O(1M) O(100) – O(1000)
Total concurrency 50 M O(billion) O(1,000)
MTTI ?? unknown O(<1 day) - O(10)
4. Potential System Architecture
with a cap of $200M and 20MW
Systems 2012 2022 Difference
Titan Computer Today & 2022
System peak 27 Pflop/s 1 Eflop/s O(100)
Power 8.3 MW ~20 MW
(2 Gflops/W) (50 Gflops/W)
System memory 710 TB 32 - 64 PB O(10)
(38*18688)
Node performance 1,452 GF/s 1.2 or 15TF/s O(10) – O(100)
(1311+141)
Node memory BW 232 GB/s 2 - 4TB/s O(1000)
(52+180)
Node concurrency 16 cores CPU O(1k) or 10k O(100) – O(1000)
2688 CUDA
cores
Total Node Interconnect 8 GB/s 200-400GB/s O(10)
BW
System size (nodes) 18,688 O(100,000) or O(1M) O(100) – O(1000)
Total concurrency 50 M O(billion) O(1,000)
MTTI ?? unknown O(<1 day) - O(10)
5. Critical Issues at Peta & Exascale for
Algorithm and Software Design
Synchronization-reducing algorithms
Break Fork-Join model
Communication-reducing algorithms
Use methods which have lower bound on communication
Mixed precision methods
2x speed of ops and 2x speed for data movement
Autotuning
Today’s machines are too complicated, build “smarts” into
software to adapt to the hardware
Fault resilient algorithms
Implement algorithms that can recover from failures/bit
flips
Reproducibility of results
Today we can’t guarantee this. We understand the issues,
5
but some of our “colleagues” have a hard time with this.
6. Major Changes to Algorithms/Software
• Must rethink the design of our
algorithms and software
Manycore and Hybrid architectures are
disruptive technology
Similar to what happened with cluster
computing and message passing
Rethink and rewrite the applications,
algorithms, and software
Data movement is expensive
Flops are cheap 6
7. Fork-Join Parallelization of LU and QR.
Parallelize the update: dgemm
• Easy and done in any reasonable software.
• This is the 2/3n3 term in the FLOPs count. -
• Can be done efficiently with LAPACK+multithreaded BLAS
Cores
8. Synchronization (in LAPACK LU)
Step 1 Step 2 Step 3 Step 4 ...
synchronous processing
• Fork-join, bulk fork join 27
bulk synchronous processing
8
Allowing for delayed update, out of order, asynchronous, dataflow execution
9. PLASMA/MAGMA: Parallel Linear Algebra
s/w for Multicore/Hybrid Architectures
Objectives
High utilization of each core
Scaling to large number of cores
Synchronization reducing algorithms
Methodology
Dynamic DAG scheduling (QUARK)
Explicit parallelism
Implicit communication
Fine granularity / block data layout
Arbitrary DAG with dynamic scheduling
Fork-join
parallelism
DAG scheduled
parallelism
9
10. Communication Avoiding QR
Example
R0 R0 R0R R
D0 Domain_Tile_QR
D0
R1
D1 Domain_Tile_QR D1
R2 R2
D2 Domain_Tile_QR D2
R3
D3 Domain_Tile_QR D3
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.
11/20/2012 10
11. Communication Avoiding QR
Example
R0 R0 R0R R
D0 Domain_Tile_QR
D0
R1
D1 Domain_Tile_QR D1
R2 R2
D2 Domain_Tile_QR D2
R3
D3 Domain_Tile_QR D3
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.
11/20/2012 11
12. Communication Avoiding QR
Example
R0 R0 R0R R
D0 Domain_Tile_QR
D0
R1
D1 Domain_Tile_QR D1
R2 R2
D2 Domain_Tile_QR D2
R3
D3 Domain_Tile_QR D3
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.
11/20/2012 12
13. Communication Avoiding QR
Example
R0 R0 R0R R
D0 Domain_Tile_QR
D0
R1
D1 Domain_Tile_QR D1
R2 R2
D2 Domain_Tile_QR D2
R3
D3 Domain_Tile_QR D3
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.
11/20/2012 13
14. Communication Avoiding QR
Example
R0 R0 R0R R
D0 Domain_Tile_QR
D0
R1
D1 Domain_Tile_QR D1
R2 R2
D2 Domain_Tile_QR D2
R3
D3 Domain_Tile_QR D3
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.
11/20/2012 14
15. PowerPack 2.0
The PowerPack platform consists of software and hardware instrumentation.
15
Kirk Cameron, Virginia Tech; http://scape.cs.vt.edu/software/powerpack-2-0/
16. Power for QR Factorization
LAPACK’s QR Factorization
Fork-join based
MKL’s QR Factorization
Fork-join based
PLASMA’s Conventional
QR Factorization
DAG based
PLASMA’s Communication
Reducing QR Factorization
DAG based
dual-socket quad-core Intel Xeon E5462 (Harpertown) processor
@ 2.80GHz (8 cores total) w / MLK BLAS 16
matrix size is very tall and skinny (mxn is 1,152,000 by 288)
Notas del editor
Bi-section bandwidth is calculated by knowing the partition sizes and the link bandwidth. For Sequoia, 96 racks will be arranged to16x12x16x16x2.The minimal surface is 12x16x16x2.The bandwidth between the two minimal surfaces is (12x16x16x2) x 2 x 2 x 2 = 49.152 TB/sHere, 2x2 reflects bi-directional 2 GB/s links, another 2 comes from the torus connections that brings a pair of links in each dimension.* Node memory bw for Q is 42.6 gb/s advertised and 28.6 GB/s on STREAM triad. That's **NODE**, not per core...For bg/q bisection bw?Missing LatenciesMessage injection ratesFlops/watt cost money and bytes/flop that costs moneyJoule/op in 2009, 2015 and 2018: 2015: 100 pj/opCapacity doesn’t cost as much power as bandwidth how many joules to move a bit 2 picojoule/bit 75pj/bit for accessing DRAM32 Petabytes: with system memory at same fraction of system Need $ numberBest machine for 20MW and best machine for $200MMemory op is 64 bit word of memory 75 picojoule bit for (multiply by 64) (DDR 3 spec) 50 pj/ for an entire 64 bit opMemory technology in 5pj/bit by 2015 if we invest soonAnything more aggressive than 4pj/bit is close to the limit (will not sign up for 2pj/bit)2015 10 pj/flop 5pj/flop in 2018So we are talking 30:1 ratio of memory reference per flop 10pj/operation to bring a byte in8 terabits * 1pj -> 8 wattsJEDEC is fundamentally broken (DDR4 is the end) Low swing differential insertion of known technology20GB/s per component to 1 order of magnitude more 10-12 Gigabits/second per wire16-64 using courant limited scaling of hydro codesCost per DRAM in that timeframe and how much to spend# outstanding memory references per cycle- bandwidth * latency above based on memory reference size memory concurrency 200 cycles from DRAM (2GHz) is 100ns (40ns for memory alone). With queues will be 100ns O(1000) references per node to memory O(10k) for 64 byte cache lines?Need to add system bisection: 2015: whatever local node bandwidth: factor of 4-8 or 2-4 against per-node interconnect bandwidth 2018:Occupancy vs latency: zero occupancy (1 slot for message launch) 5ns per 2-4 in 20152-4 in 201810^4vs 10^9th
Bi-section bandwidth is calculated by knowing the partition sizes and the link bandwidth. For Sequoia, 96 racks will be arranged to16x12x16x16x2.The minimal surface is 12x16x16x2.The bandwidth between the two minimal surfaces is (12x16x16x2) x 2 x 2 x 2 = 49.152 TB/sHere, 2x2 reflects bi-directional 2 GB/s links, another 2 comes from the torus connections that brings a pair of links in each dimension.* Node memory bw for Q is 42.6 gb/s advertised and 28.6 GB/s on STREAM triad. That's **NODE**, not per core...For bg/q bisection bw?Missing LatenciesMessage injection ratesFlops/watt cost money and bytes/flop that costs moneyJoule/op in 2009, 2015 and 2018: 2015: 100 pj/opCapacity doesn’t cost as much power as bandwidth how many joules to move a bit 2 picojoule/bit 75pj/bit for accessing DRAM32 Petabytes: with system memory at same fraction of system Need $ numberBest machine for 20MW and best machine for $200MMemory op is 64 bit word of memory 75 picojoule bit for (multiply by 64) (DDR 3 spec) 50 pj/ for an entire 64 bit opMemory technology in 5pj/bit by 2015 if we invest soonAnything more aggressive than 4pj/bit is close to the limit (will not sign up for 2pj/bit)2015 10 pj/flop 5pj/flop in 2018So we are talking 30:1 ratio of memory reference per flop 10pj/operation to bring a byte in8 terabits * 1pj -> 8 wattsJEDEC is fundamentally broken (DDR4 is the end) Low swing differential insertion of known technology20GB/s per component to 1 order of magnitude more 10-12 Gigabits/second per wire16-64 using courant limited scaling of hydro codesCost per DRAM in that timeframe and how much to spend# outstanding memory references per cycle- bandwidth * latency above based on memory reference size memory concurrency 200 cycles from DRAM (2GHz) is 100ns (40ns for memory alone). With queues will be 100ns O(1000) references per node to memory O(10k) for 64 byte cache lines?Need to add system bisection: 2015: whatever local node bandwidth: factor of 4-8 or 2-4 against per-node interconnect bandwidth 2018:Occupancy vs latency: zero occupancy (1 slot for message launch) 5ns per 2-4 in 20152-4 in 201810^4vs 10^9th
The PowerPack platform consists of software and hardware instrumentation. Hardware tools include a WattsUp Pro meter and a NI meter, which connect to the monitored system to measure system-level power and component-level power streams.
The system power rate is measured prior to the AC/DC converter, which has an efficiency of 80-85%So during this process, you are loosing power... Which makes the numbers not adding upIntel Xeon E5462 (Harpertown) processor with 8 DIMMs or 16 GB memory (each DIMM is 2 GB DDR2).