200625material naruse

Akira Naruse, 25 June 2020
CUDAやOpenACCによる
GPUコンピューティング

2
AGENDA
▪ GPU Computing
▪ Latest GPUs (NVIDIA A100)
▪ OpenACC
▪ CUDA

4
Low latency + High throughput
CPU GPU

5
アプリケーション実行
アプリケーション・コード
GPU
CPU
並列部分は
GPUで実行
計算の
重い部分
逐次部分は
CPU上で実行
Do i=1,N
End do

6
GPUの構造 (NVIDIA A100)
6912 CUDA core/chip
この全てを活用できるかが鍵
108 Stream Multiprocessor (SM) /chip
64 CUDA core /SM

7
GPU APPLICATIONS
数百のアプリケーションがGPUに対応
www.nvidia.com/object/gpu-applications.html

9
アプリをGPU対応する方法
Application
CUDAOpenACCLibrary
GPU対応ライブラリにチェンジ
簡単に開始
主要処理をCUDAで記述
高い自由度
既存コードにディレクティブを挿入
簡単に加速

10
NVIDIA LIBRARIES
cuRAND Thrust
cuDNN
AmgXcuBLAS
cuFFT
cuSPARSE cuSOLVER
Performance
Primitives NCCL
nvGRAPH
TensorRT

11
PARTNER LIBRARIES
Computer Vision
Sparse direct solvers
Sparse Iterative MethodsLinear Algebra
Linear Algebra
Graph
Audio and Video Matrix, Signal and Image
Math, Signal and Image Linear Algebra
Computational
Geometry
Real-time visual
simulation

12
Application
Library
簡単に開始
CUDAOpenACC
高い自由度
簡単に加速

13
void saxpy(int n,
float a,
float *x,
float *restrict y)
{
#pragma acc parallel copy(y[:n]) copyin(x[:n])
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
saxpy(N, 3.0, x, y);
...
SAXPY (Y=A*X+Y)
OpenMP OpenACC
void saxpy(int n,
float a,
float *x,
float *restrict y)
{
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
...
void saxpy(int n,
float a,
float *x,
float *restrict y)
{
#pragma omp parallel for
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
...

14
Application
Library OpenACC CUDA
簡単に開始
高い自由度
簡単に加速

15
__global__ void saxpy(int n, float a,
float *x, float *y)
{
int i = threadIdx.x + blodkDim.x * blockIdx;
if (i < n)
y[i] += a*x[i];
}
...
size_t size = sizeof(float) * N;
cudaMemcpy(d_x, x, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, size, cudaMemcpyHostToDevice);
saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y);
cudaMemcpy(y, d_y, size, cudaMemcpyDeviceToHost);
...
SAXPY (Y=A*X+Y)
CPU CUDA
void saxpy(int n, float a,
float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
...

17
NVIDIA AMPERE ARCHITECTURE GPU
54B トランジスタ
826 mm2
第3世代
Tensor Core
構造疎行列
加速
マルチ
インスタンス GPU
第3世代
NVLink & NVSwitch

1818
NVIDIA A100 GPU
3rd gen.
NVLINK
54 billion transistors in 7nmHBM2HBM2HBM2
HBM2HBM2HBM2
GigaThread Engine with MIG Control
L2 Cache L2 Cache
NVLink NVLink NVLink NVLink NVLinkNVLink NVLink NVLink NVLink NVLinkNVLink NVLink
BW 2x
A100
Scale
UP40MB L2
容量 6.7x
108 SMs
6912 CUDA Cores
40GB HBM2
1.56 TB/s
バンド幅 1.7x
Multi-Instance GPU
7x
Scale OUT

19
A100 SM
第3世代 Tensor Core
高速化と効率化
より多くのデータ型のサポート
構造的疎行列の加速
非同期コピーと非同期バリア
L1/共有メモリサイズの増加
3rd Gen.
TENSOR
CORE
3rd Gen.
TENSOR
CORE
3rd Gen.
TENSOR
CORE
3rd Gen.
TENSOR
CORE
192 KB L1 Data Cache / Shared Memory

20
V100
入力データ型積算データ型 TOPS
FP32 FP32 15.7
FP16 FP32 125
V100 TENSOR CORE
CUDA core
FP16
FP16
× + FP32
FP32
FP16 FP32
FP16/FP32混合精度「行列」演算ユニット

21
V100
入力データ型積算データ型 TOPS
FP32 FP32 15.7
FP16 FP32 125
FP32 FP32 19.5
TF32 FP32 156
FP16 FP32 312
BF16 FP32 312
FP16 FP16 312
INT8 INT32 624
INT4 INT32 1248
BINARY INT32 4992
IEEE FP64 19.5
A100 TENSOR CORE
A100
CUDA core
CUDA core

22
A100 NVLINK
12ポート、50GB/s /ポート (双方向)
8 GPU ベースボード4 GPU ベースボード
200
GB/s
600
GB/s

24
Application
Library
簡単に開始
CUDAOpenACC
高い自由度
簡単に加速

25
OPENACC
Program myscience
... serial code ...
!$acc kernels
do k = 1,n1
do i = 1,n2
... parallel code ...
enddo
enddo
!$acc end kernels
... serial code …
End Program myscience
CPU
GPU
既存のC/Fortranコード
簡単: 既存のコードに
コンパイラへのヒントを追加
強力: 相応の労力で、コンパイラが
コードを自動で並列化
オープン: 複数コンパイラベンダが、
様々なプロセッサをサポート
NVIDIA GPU, AMD GPU,
Intel CPU, Intel メニコア(予定)
ヒントの
追加

26
アプリケーション・コード
GPU
CPU
並列部分は
GPUで実行
計算の
重い部分
逐次部分は
CPU上で実行
Do i=1,N
End do
OpenACC

27
void saxpy(int n,
float a,
float *x,
float *restrict y)
{
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
...
SAXPY (y=a*x+y)
▪ omp → acc ▪ データの移動
OpenMP OpenACC
void saxpy(int n,
float a,
float *x,
float *restrict y)
{
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
...

28
subroutine saxpy(n, a, X, Y)
real :: a, Y(:), Y(:)
integer :: n, i
!$acc parallel copy(Y(:)) copyin(X(:))
do i=1,n
Y(i) = a*X(i)+Y(i)
enddo
!$acc end parallel
end subroutine saxpy
...
call saxpy(N, 3.0, x, y)
...
SAXPY (y=a*x+y, FORTRAN)
▪ FORTRANも同様
OpenMP OpenACC
subroutine saxpy(n, a, X, Y)
real :: a, X(:), Y(:)
integer :: n, i
!$omp parallel do
do i=1,n
Y(i) = a*X(i)+Y(i)
enddo
!$omp end parallel do
end subroutine saxpy
...
call saxpy(N, 3.0, x, y)
...

29
float *x,
float *restrict y)
{
for (int i = 0; i < n; ++i) {
y[i] += a*x[i];
}
}
...
...
OPENACC: PARALLEL DIRECTIVE
並列化領域を
指示する

30
float *x,
float *restrict y)
{
#pragma acc kernels copy(y[:n]) copyin(x[:n])
for (int i = 0; i < n; ++i) {
y[i] += a*x[i];
}
}
...
...
OPENACC: KERNELS DIRECTIVE
並列化領域を
指示する
Parallel: ユーザ主導
Kernels: コンパイラ主導

31
float *x,
float *restrict y)
{
#pragma acc loop independent
for (int i = 0; i < n; ++i) {
y[i] += a*x[i];
}
}
...
...
OPENACC: LOOP CLAUSE
ループの並列化方法を
指示する
independent: 並列化可能
seq: 逐次実行
collapse: ループ融合
gang/vector: 並列化粒度
...

32
float *x,
float *restrict y)
{
for (int i = 0; i < n; ++i) {
y[i] += a*x[i];
}
}
...
...
OPENACC: DATA CLAUSE
データの移動方法を
指示する
copyin: CPU → GPU
copyout: CPU  GPU
copy: (両方)
create: (メモリ確保)
...

33
float *x,
float *restrict y)
{
#pragma acc parallel present(x, y)
for (int i = 0; i < n; ++i) {
y[i] += a*x[i];
}
}
...
#pragma acc data copy(y[:N]) copyin(x[:N])
{
}
...
OPENACC DATA DIRECTIVE
データ領域を
指示する

34
OPENMP と OPENACC の共存
float *x,
float *restrict y)
{
for (int i = 0; i < n; ++i) {
y[i] += a*x[i];
}
}
...
...

35
簡単にコンパイル
OpenMP / OpenACC
float *x,
float *restrict y)
{
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
...
$ pgcc -acc –Minfo=acc –ta=tesla saxpy.c
saxpy:
16, Generating present_or_copy(y[:n])
Generating present_or_copyin(x[:n])
Generating Tesla code
19, Loop is parallelizable
Accelerator kernel generated
19, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

36
簡単に実行
OpenMP / OpenACC
float *x,
float *restrict y)
{
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
...
$ pgcc -acc –Minfo=acc –ta=tesla saxpy.c
saxpy:
16, Generating present_or_copy(y[:n])
Generating present_or_copyin(x[:n])
Generating Tesla code
19, Loop is parallelizable
Accelerator kernel generated
19, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
$ nvprof ./a.out
==10302== NVPROF is profiling process 10302, command: ./a.out
==10302== Profiling application: ./a.out
==10302== Profiling result:
Time(%) Time Calls Avg Min Max Name
62.95% 3.0358ms 2 1.5179ms 1.5172ms 1.5186ms [CUDA memcpy HtoD]
31.48% 1.5181ms 1 1.5181ms 1.5181ms 1.5181ms [CUDA memcpy DtoH]
5.56% 268.31us 1 268.31us 268.31us 268.31us saxpy_19_gpu

37
OPENACCが加速する科学技術計算

38
Janus Juul Eriksen, PhD Fellow
qLEAP Center for Theoretical Chemistry, Aarhus University
“
OpenACC makes GPU computing approachable for
domain scientists. Initial OpenACC implementation
required only minor effort, and more importantly,
no modificationsof our existing CPU implementation.
“
LSDALTON
Large-scale application for calculatinghigh-
accuracy molecular energies
Lines of Code
Modified
# of Weeks
Required
# of Codes to
Maintain
<100 Lines 1 Week 1 Source
Big Performance
7.9x
8.9x
11.7x
ALANINE-1
13 ATOMS
ALANINE-2
23 ATOMS
ALANINE-3
33 ATOMSSpeedupvsCPU
Minimal Effort
LS-DALTON CCSD(T) Module
Benchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X)
https://developer.nvidia.com/openacc/success-stories

39
Brad Sutton, Associate Professor of Bioengineering and
Technical Director of the Biomedical Imaging Center
University of Illinois at Urbana-Champaign
“Now that we’ve seen how easy it is to programthe
GPU using OpenACC and the PGI compiler, we’re
looking forward to translating more of our projects
“
CHALLENGE
• Produce detailed and accurate brain images by applying
computationally intensive algorithms to MRI data
• Reduce reconstruction time to make diagnostic use possible
SOLUTION
• Accelerated MRI reconstruction applicationwith OpenACC
using NVIDIA GPUs
RESULTS
• Reduced reconstruction time for a single high-resolution MRI
scan from 40 days to a couple of hours
• Scaled on Blue Waters at NCSAto reconstruct 3000 images in
under 24 hours
POWERGRID
Advanced MRI Reconstruction Model

40
“
OpenACC enabled us to
target routines for GPU
acceleration without
rewriting code, allowing us
to maintain portability on a
code that is 20-years old
“
David Gutzwiller, Head of HPC
NUMECA International
CHALLENGE
• Accelerate 20 year old highly optimized code
on GPUs
SOLUTION
• Accelerated computationally intensive routines
with OpenACC
RESULTS
• Achieved 10x or higher speed-up on key
routines
• Full app speedup of up to 2x on the Oak Ridge
Titan supercomputer
• Total time spent on optimizing various routines
with OpenACC wasjust five person-months
NUMECA FINE/TURBO
Commercial CFD Application

41
The most significant result from
our performance studiesis faster
computation with less energy
consumption compared with our
CPU-only runs.
Dr. Misun Min, Computation Scientist
Argonne National Laboratory
CHALLENGE
• Enable NekCEM to deliver strong scaling on
next-generation architectures while
maintaining portability
SOLUTION
• Use OpenACC to port the entire program
to GPUs
RESULTS
• 2.5x speedup over a highly tuned CPU-only
version
• GPU used only 39 percent of the energy
needed by 16 CPUs to do the same
computation in the same amount of time
NEKCEM
Computational Electromagnetics Application
“
“

42
Adam Jacobs, PhD candidate in the Department of Physics and
Astronomy at Stony Brook University
On the reactions side, accelerated calculations allow us to model larger
networks of nuclear reactions for similar computational costs as the
simple networks we model now
MAESTRO & CASTRO
Astrophysics 3D Simulation
““
CHALLENGE
• Pursuing strong scaling and portability to run
code on GPU-powered supercomputers
SOLUTION
• OpenACC compiler on OLCF’s Titan
supercomputer
RESULTS
• 4.4x faster reactionsthan on a multi-core
computer with 16 cores
• Accelerated calculationsallow modeling larger
networksof nuclear reactionsfor similar
computationalcosts as simple networks
2 weeks to
learn OpenACC
2 weeks to
modify code

43
Wayne Gaudin and Oliver Perks
Atomic Weapons Establishment, UK
We were extremely impressed that we can run OpenACC on a CPU with
no code change and get equivalent performance to our OpenMP/MPI
implementation.
CLOVERLEAF
Performance Portability for a Hydrodynamics Application
Speedupvs1CPUCore
Benchmarked Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz, Accelerator: Tesla K80
““
CHALLENGE
• Application code that runs across architectures
without compromising on performance
SOLUTION
• Use OpenACC to port the CloverLeaf mini app
to GPUs and then recompile and run the same
source code on multi-core CPUs
RESULTS
• Same performance as optimized OpenMP
version on x86 CPU
• 4x faster performance using the same code on
a GPU

44
Lixiang Luo, Researcher
Aerospace Engineering Computational Fluid Dynamics Laboratory
North Carolina State University (NCSU)
“OpenACC is a highly effective tool for programming fully
implicit CFD solverson GPU to achieve true 4X speedup“
* CPU Speedup on 6 cor es of Xeon E5645, with additionalcor esperformance reduces due to partitioning and MPI overheads
CHALLENGE
• Accelerate a complex implicit CFD solver on GPU
SOLUTION
• Used OpenACC to run solver on GPU with
minimal code changes.
RESULTS
• Achieved up to 4X speedups on Tesla GPU over
parallel MPI implementation on x86 multicore
processor
• Structured approach to parallelism provided by
OpenACC allowed better algorithm design
without worrying about GPU architecture
INCOMP3D
3D Fully Implicit CFD Solver

45
Earthquake Simulation (理研、東大地震研)
▪ WACCPD 2016(SC16併催WS): Best Paper Award
http://waccpd.org/wp-content/uploads/2016/04/SC16_WACCPD_fujita.pdf より引用

46
NICAM
▪ 気象・気候モデル by 理研AICS/東大
▪ 膨大なコード (数十万行)
▪ ホットスポットがない (パレートの法則)
▪ 特性の異なる2種類の処理
▪ 力学系 … メモリバンド幅ネック
▪ 物理系 … 演算ネック

47
NICAM: 力学系(NICAM-DC)
▪ OpenACCによるGPU化
▪ 主要サブルーチンは、全てGPU
上で動作(50以上)
▪ MPI対応済み
▪ 2週間
▪ 良好なスケーラビリティ
▪ Tsubame 2.5, 最大2560
GPUs
▪ Scaling factor: 0.8
Weak scaling
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04Performance(GFLOPS)
Number of CPUs or GPUs
Tsubame 2.5 (GPU:K20X)
K computer
Tsubame 2.5 (CPU:WSM)
(*) weak scaling
Courtesy of Dr. Yashiro from RIKEN AICS

48
NICAM: 物理系(SCALE-LES)
▪ Atmospheric radiation transfer
▪ 物理系の中で、最も重い計算
▪ OpenACCによるGPU対応済
1.00 1.99 3.88 8.51
37.8
76.0
151
0
20
40
60
80
100
120
140
160
1 core 2 core 4 core 10 core 1 GPU 2 GPUs 4 GPUs
Xeon E5-2690v2(3.0GHz,10-core) Tesla K40
Speedup
vs.CPU1-core
(*) PCIデータ転送時間込み, グリッドサイズ:1256x32x32
Better

49
SEISM3D
▪ 地震シミュレーション by 古村教授(東大地震研)
▪ 主要サブルーチンのGPU対応が完了
▪ メモリバンド幅ネック、 3次元モデル(2次元分割)、隣接プロセス間通信
605
459
134
0
100
200
300
400
500
600
K: 8x SPARC64
VIIIfx
CPU: 8x Xeon
E5-2690v2
GPU: 8x Tesla
K40
Time(sec)
SEISM3D (480x480x1024, 1K steps)
3.4x speedup
(アプリ全体)
0
20
40
60
80
100
120
140
GPU: 8x Tesla K40
Others (CPU, MPI and so on)
[CUDA memcpy DtoH]
[CUDA memcpy HtoD]
(other subroutines)
update_vel_pml
update_vel
update_stress_pml
update_stress
diff3d_*
GPUの実行時間内訳
Better

50
CM-RCM IC-CG
▪ IC-CG法のベンチマークコード by 中島教授(東大)
▪ CM-RCM法(Cyclic Multi-coloring Reverse Cuthill-Mckee)を使用
▪ メインループ内のサブルーチンを全てOpenACCでGPU化
3.36
1.58 1.53 1.39
0.655
0
1
2
3
4
OpenMP OpenMP OpenMP OpenMP OpenACC
Opteron 6386SE
(2.8GHz,16core)
SPARC64 Ixfx
(1.85GHz,16core)
Xeon E5-2680v2
(2.8GHz,10core)
Xeon-Phi 5110P Tesla K40
Time(second)
CM-RCM ICCG (100x100x100)
Better
Courtesy of Dr. Ohshima from U-Tokyo

51
CCS-QCD
▪ QCDコード by 石川准教授(広島大)
▪ BiCGStab計算を全てOpenACCでGPU化
▪ データレイアウトを変更: AoS → SoA
32.3
24.3 22.1 20.9 19.7
42.5
53.3
57.9 60.8 63.4
0
10
20
30
40
50
60
70
80
90
8x8x8x32 8x8x8x64 8x8x16x64 8x16x16x64 16x16x16x64
GFLOPS
Problem Size
CCS-QCD: BiCGStab Total FLOPS
Xeon E5-2690v2(3.0GHz,10core,OpenMP) Tesla K40(OpenACC)
Better

53
CUDAプログラミング
▪ プログラミングモデル
▪ アーキテクチャ
▪ 性能Tips

54
▪ 高スループット指向のプロセッサ
▪ 分離されたメモリ空間
GPU MemoryCPU Memory
PCI
CPU GPU

55
GPUプログラム
float *x, float *y)
{
if (i < n)
y[i] += a*x[i];
}
...
saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y);
cudaDeviceSynchronize();
...
float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
...
CPU GPU

56
GPU実行の基本的な流れ
▪ GPUは、CPUからの制御
で動作
▪ 入力データ: CPUからGPUに
転送 (H2D)
▪ GPUカーネル: CPUから投入
▪ 出力データ: GPUからCPUに
転送 (D2H)
CPU GPU
入力データ
転送
GPUカーネル
投入
同期
出力データ
転送
GPU上で
演算

57
GPUプログラム
float *x, float *y)
{
if (i < n)
y[i] += a*x[i];
}
...
saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y);
...
float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
...
CPU GPU
入力データ転送
カーネル起動
同期
出力データ転送

58
GPUプログラム (Unified Memory)
float *x, float *y)
{
if (i < n)
y[i] += a*x[i];
}
...
saxpy<<< N/128, 128 >>>(N, 3.0, x, y);
...
float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
...
CPU GPU
カーネル起動
同期

59
GPUプログラム (Unified Memory)
float *x, float *y)
{
if (i < n)
y[i] += a*x[i];
}
...
int dev_id = 0;
cudaMemPrefetchAsync(x, size, dev_id);
cudaMemPrefetchAsync(y, size, dev_id);
saxpy<<< N/128, 128 >>>(N, 3.0, x, y);
...
float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
...
CPU GPU
プリフェッチ

60
GPUカーネル
▪ GPUカーネル: 1つのGPUスレッドの処理内容を記述
▪ 基本: 1つのGPUスレッドが、1つの配列要素を担当
float *x, float *y)
{
int i = threadIdx.x + blodkDim.x * blockIdx.x;
if (i < n)
y[i] += a*x[i];
}
...
saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y);
...
float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
...
CPU GPU
Global スレッドID

61
Execution Configuration
(ブロック数とブロックサイズ)
float *x, float *y)
{
if (i < n)
y[i] += a*x[i];
}
...
saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y);
...
ブロック数ブロックサイズ
ブロック数 x ブロックサイズ = 配列要素数
スレッドID
ブロックID
ブロックサイズ

62
スレッド階層
(スレッド、ブロック、グリッド)
グリッド
x[] 0 127 128 255
y[] 0 127 128 255
y[i] = a*x[i] + y[i]
256 383 384
256 383 384
ブロック2ブロック1ブロック0
スレッド
(global)
0 127 128 255 256 383 384
▪ ブロックサイズ(スレッド数/ブロック)は、カーネル毎に設定可能
▪ 推奨: 128 or 256 スレッド
スレッド
(local)
0 127 0 127 0 127 0

63
Execution Configuration
(ブロック数とブロックサイズ)
float *x, float *y)
{
if (i < n)
y[i] += a*x[i];
}
...
saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y);
...
ブロック数ブロックサイズ
ブロック数 x ブロックサイズ = 配列要素数
N/ 64, 64N/256, 256

64
2D配列のGPUカーネル例
▪ ブロックサイズ(ブロック形状)は、1D～3Dで表現可能
__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N])
{
int j = threadIdx.y + blodkDim.y * blockIdy.y;
if ( i < N && j < N )
C[i][i] = A[i][j] + B[i][j];
}
...
dim3 sizeBlock( 16, 16 );
dim3 numBlocks( N/sizeBlock.x, N/sizeBlock.y );
MatAdd<<< numBlocks, sizeBlock >>>(A, B, C);
...
32, 864, 4
Globalスレッド ID (x)
GlobalスレッドID (y)
ブロックサイズ (x,y)
ブロック数 (x,y)

65
ブロック・マッピング、スレッド・マッピング
dim3 sizeBlock(16,16)
(0,0) (1,0) (2,0)
(0,1) (1,1) (2,1)
(0,2) (1,2) (2,2)
ブロックID(blockIdx)
dim3 sizeBlock(32,8)
(0,0) (1,0)
(0,1) (1,1)
(0,2) (1,2)
(0,3) (1,3)
(0,4) (1,4)
(0,0)
(15,15)
(0,0)
(31,7)
スレッドID(threadIdx)

66
▪ 性能Tips

67
GPUアーキテクチャ概要 PCI I/F
▪ ホスト接続インタフェース
Giga Thread Engine
▪ SMに処理を割り振るスケジュー
ラ
DRAM I/F (HBM2)
▪ 全SM、PCI I/Fからアクセス可
能なメモリ (デバイスメモリ, フ
レームバッファ)
L2 cache (40 MB)
▪ 全SMからアクセス可能なR/W
キャッシュ
SM (Streaming Multiprocessor)
▪ 「並列」プロセッサ、A100:108

68
Stream Multi-Processor
演算ユニット
▪ INT32: 64個
▪ FP32: 64個
▪ FP64: 16個
▪ TensorCore: 4個
その他ユニット
▪ LD/ST, SFU, etc.
レジスタ(32bit): 64K個 (256KB)
共有メモリ/L1キャッシュ: 192KBNVIDIA A100

69
GPUカーネル実行の流れ
▪ CPUが、GPUに、グリッドを投入
▪ 具体的な投入先は、Giga Thread Engine
グリッド
ブロック
ブロック
スレッド
スレッド

70
▪ Giga Thread Engine(GTE)が、SMに、ブロックを投入
▪ GTEは、ブロックスケジューラ
▪ グリッドをブロックに分解して、ブロックを、空いているSMに割当てる
グリッド
ブロック
ブロック
スレッド
スレッド

71
ブロックをSMに割り当て
▪ 各ブロックは、互いに独立に実行
▪ ブロック間では同期しない、実行順序の保証なし
▪ 1つのブロックは複数SMにまたがらない
▪ 1つのSMに、複数ブロックが割当てられることはある
グリッド
ブロック
ブロック
ブロック
ブロック

72
▪ SM内のスケジューラが、スレッドをCUDAコアに投入
グリッド
ブロック
ブロック
スレッド
スレッド

73
▪ SM内のスケジューラが、ワープをCUDAコアに投入
▪ ワープ: 32スレッドの塊
▪ ブロックをワープに分割、実行可能なワープを、空CUDAコアに割当てる
グリッド
ブロック
ブロック
ワープ
ワープ

74
ワープのCUDAコアへの割り当て
▪ ワープ内の32スレッドは、同じ命令を同期して実行
▪ 各ワープは、互いに独立して実行
▪ 同じブロック内のワープは、明示的に同期可能(__syncthreads())
グリッド
ブロック
ブロック
ワープ
ワープ
ワープ
SIMT
(Single Instruction Multiple Threads)

75
GPUアーキの変化を問題としないプログラミングモデル
Kepler, CC3.5
192 cores /SM
Pascal, CC6.0
64 cores /SM
Maxwell, CC5.0
128 cores /SM

76
▪ 性能Tips

77
デバイスメモリへのアクセスは、まとめて
▪ コアレス・アクセス
▪ 32スレッド(ワープ)のロード/ストアをまとめて、メモリトランザクションを発行
▪ トランザクションサイズ: 32B, 64B or 128B
▪ トランザクション数は、少ないほど良い
配列 128B境界 128B境界
0 1 2 29 30 31283
0 1 14 17 30 311615
0 1 2
Best
Good
Bad
128B x1
64B x2
32B x32

78
リソース使用率 (Occupancy)
SMの利用効率を上げる
≒ SMに割当て可能なスレッド数を、
上限に近づける
▪ レジスタ使用量(/スレッド)
▪ できる限り減らす
▪ DP(64bit)は、2レジスタ消費
▪ レジスタ割り当て単位は8個
▪ レジスタ使用量と、割当て可能なスレッド数の関係
▪ 32レジスタ: 2048(100%), 64レジスタ: 1024(50%)
▪ 128レジスタ:512(25%), 256レジスタ: 256(12.5%)
CUDAコア数: 64
最大スレッド数: 2048
最大ブロック数: 32
共有メモリ: 160KB
レジスタ数(32-bit): 64K個
リソース量/SM (A100)

79
SMの利用効率を上げる
≒ SMに割当て可能なスレッド数を、
上限に近づける
▪ スレッド数(/ブロック)
▪ 64以上にする
▪ 64未満だと最大ブロック数がネックになる
▪ 共有メモリ使用量(/ブロック)
▪ できる限り減らす
▪ 共有メモリ使用量と、割当て可能なブロック数の関係
▪ 32KB:2ブロック, 16KB:4ブロック, 8KB:8ブロック
CUDAコア数: 64
最大スレッド数: 2048
最大ブロック数: 32
共有メモリ: 160KB
レジスタ数(32-bit): 64K個
リソース量/SM (A100)

80
空き時間を埋める
▪ CUDAストリーム (≒キュー)
▪ 同じCUDAストリームに投入した操作: 投入順に実行
▪ 別のCUDAストリームに投入した操作: 非同期に実行 (オーバラップ実行)
(*) 操作: GPUカーネル, データ転送
[CUDAストリームの効果例]
GPUカーネルとデータ転送が
オーバラップして
同時に実行されている

81
分岐を減らす
▪ ワープ内のスレッドが別パスを選択すると遅くなる
▪ ワープ内のスレッドは、命令を共有 (SIMT)
▪ ワープ内のスレッドが選んだ全パスの命令を実行
▪ あるパスの命令を実行中、そのパスにいないスレッドはinactive状態
▪ Path divergenceを減らす
▪ できる限り、同ワープ内のスレッドは
同じパスを選択させる
A[id] > 0
処理 X
処理 Y
true false
0 1
2
3

82
まとめ
▪ GPU Computing
▪ Latest GPUs (NVIDIA A100)
▪ OpenACC
▪ CUDA

83
GPU HACKATHONS
https://www.gpuhackathons.org/

200625material naruse

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 200625material naruse

Similar to 200625material naruse (20)

More from RCCSRENKEI

More from RCCSRENKEI (20)

200625material naruse