ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with Temporal Blocking For Extreme-Scale Many-Core Systems
1. Automatic Generation of High-
Order Finite-Difference Code with
Temporal Blocking For Extreme-
Scale Many-Core Systems
ESPM2 2018
Nov.12th, Dallas
Hideyuki Tanaka*, Youhei Ishihara, Ryo Sakamoto,
Takashi Nakamura, Yasuyuki Kimura, Keigo Nitadori,
Miyuki Tsubouchi, Jun Makino
2. Abstract
For an explicit finite-difference scheme applied to computation
fluid dynamics, we have achieved 4.78 PFlops, 21.5% efficiency
of peak performance on the large-scale PEZY-SC2 based system
which has very low B/F by temporal blocking
The achieved efficiency is comparable to recent works on very
high B/F systems
To achieve this high efficiency on a low B/F machine,
we developed
A framework for explicit stencil computation which generates
the boilerplate code for MPI and device kernel code with
temporal blocking
A finite-difference scheme suitable for temporal blocking
3. Table of Contents
Introduction
Explicit stencil computation
Temporal blocking
About PEZY-SC2
Details of our work
Code generation framework: Formura
Optimization for PEZY-SC2
Benchmark results
Performance on large-scale systems (Gyoukou)
Discussion and summary
5. Explicit Stencil Computation
Explicit stencil computation is simple but very important
application of HPC
It is used for simulating weather, earthquake, inside of
the sun, etc.
Optimizing stencil computation is very important
Source: Riken
6. Efficiency of Recent Stencil Computation
Efficiency of explicit method on recent HPC
hardware is not high enough
Even the best case efficiency on K computer is ≅ 20%,
other many cases are only ≅ 10% (not high enough)
This low efficiency is caused by the problem in the
architecture of processors, or memory bandwidth
We try to solve the problem of memory bandwidth
which does not depend on architecture
In the past decades, B/F of HPC systems has been
reduced dramatically
This trend seems likely to continue
7. Relative Performance Trend
Green: FLOPS vs
memory bandwidth
(4.5x/decade)
Red: FLOPS vs
network latency
(~30x/decade)
This seems to
continue
Source: John D. McCalpin
9. PEZY-SC2 Architecture
The chip consists of
8 prefectures
Each prefecture
contains 16 cities
Each city contains 16
processor elements
(PEs)
8×16×16-64(redun-
dancy) = 1984PEs
Each city shares L2
cache
10. Gyoukou
Supercomputer installed at JAMSTEC, Japan
(Available until April 2018)
Peak 28.2PFlops (Full nodes)
Top500 4th (Nov 2017)
10000 PEZY-SC2s +
1250 Xeon D (1 for 8-SC2s)
World’s largest numbers of
MIMD processor cores
(≒ 20M
cf. TaihuLight ≒ 11M)
Suitable for the test to check
if the code can scale to exa-scale systems
12. Temporal Blocking (TB)
One of the solution to explicit method on low B/F system
With TB, multiple timesteps are calculated for working
array, so it can reduce required B/F when the working
array fits to the processor cache memory size
DRAM DRAM
Cache Cache Cache Cache
Computation
for one time step
Read from memory Write to memory
Network Network
・・・
13. Various Methods of TB
A variation of this used for
inter-node communication
This is used for
in-node computation
14. Detail of Our TB Calculation
Inter-node communication
Each node sends data to one-direction
Each node receives data from one-direction
Simple communication-computation overwrapping
15. Details of Our TB calculation
Computation starts from the right-most block
Upper-right of the parallelogram use dummy data to
equalize all loop lengths
Gray part is unnecessary results
This method increases few computation
16. SL4TH3 Scheme
Fourth order accuracy
Number of stencil = 2
Flop per cell per step ≒ 2800
Required B/F ≒ 0.05 w/o TB
18. Formura: a Framework for TB
From a description of a stencil written in formura DSL,
optimized distributed parallel codes for large scale
parallel computers are generated
In this work, we add the support of the TB method for
this work, and developed a device kernel code generator
for PEZY-SC2
Formura
DSL
MPI driver
code
TB kernel
code
Executable
formura gcc/mpicc
formura
19. Code generation by Formura
Input Equation
(Needs some more configuration files)
(Very part of) Generated C Codes
・・・
Output (Zoomed-in)
20. Code Generation
Formura generates:
Driver code for TB distributing on MPI
Optimized kernel codes for node-local computations
For new accelerator (or any other processors), we can
add a backend by modifying the code that calculates
temporal blocking steps
Typically, major optimizations for each device are
blocking layout for data access locality and thread
scheduling
21. Optimization for PEZY-SC2
Decide the block size
Size of block is smaller than LLC size
Parallelism close to number of PEs for load-balancing
44× 44× 44 is the best block size
443×10(variables/cell)×8Byte = 6.50MB
6.50MB×2 (for overwrapping read and write)<32MB=LLC size
442 (parallelisms) = 1936≒ 1984(number of PEs)
Total 880(=44×20)3 cells per node < 64GB
Allocate adjacent cells in PEs which shares L2
Decrease inner-most loop instruction size
PEZY-SC2’s L1 I-Cache size = 4KB (= 1024ops)
SL4TH3 requires 2800ops/cell
24. Effect of Temporal Blocking
NT Redundant
calculation
by TB
Required B/F
1 1.4% 0.058
2 2.7% 0.029
3 4.0% 0.020
4 5.4% 0.015
5 6.7% 0.012
6 8.0% 0.010
7 9.2% 0.009
8 10.5% 0.008
Required B/F by NT
(size per node = 8803) Calculation speed by NT
GF
NT
Calculation speed by NT
(= time step parameter)
25. Comparison with Other Studies
This work achieves very high efficiency
Comparable result with very high B/F (= 0.5) system
0
0.1
0.2
0.3
0.4
0.5
0.6
0
5
10
15
20
25
30
Yashiro et. al. Yang et. al. Hotta et. al. This work
B/F
Efficiency(%)
Efficiency
列 1
Device B/F
26. Weak Scaling
The communication is
completely hidden by
computation
Thus even though the
actual time for
communication increase
when we increase the
number of nodes, weak
scaling of the performance
is pretty good
Communication time
Total time
28. Future Works
Other schemes
HLLD
Other application
Tsunami (shallow-water equation)
Reaction-diffusion system
Further performance improvement
29. Conclusion
We have achieved 4.78 PFlops, 21.5% efficiency of peak
performance on the fluid simulation code on the large-
scale PEZY-SC2 based system
We developed an automatic code generation framework
for TB, a scheme suitable for it and a backend for PEZY-
SC2 accelerator
Our achieved efficiency is comparable to other works on
high B/F systems