44. 44
2022年4月 計算科学技術特論B
RSDFTの高並列化-スケーラビリティ-
DGEMM tuned for the K computer was also used for the
LINPACK benchmark program.
5.2 Scalability
We measured the computation time for the SCF iterations with
communications for the parallel tasks in orbitals, however, was
actually restricted to a relatively small number of compute nodes,
and therefore, the wall clock time for global communications of
the parallel tasks in orbitals was small. This means we succeeded
in decreasing time for global communication by the combination
Figure 6. Computation and communication time of (a) GS, (b) CG, (c) MatE/SD and (d)
RotV/SD for different numbers of cores.
0.0
40.0
80.0
120.0
160.0
0 20000 40000 60000 80000
Time
per
CG
(sec.)
Number of cores
theoretical computation
computation
adjacent/space
global/space
global/orbital
0.0
100.0
200.0
300.0
400.0
0 20,000 40,000 60,000 80,000
Time
per
GS
(sec.)
Number of cores
theoretical computation
computation
global/space
global/orbital
wait/orbital
0.0
50.0
100.0
150.0
200.0
0 20,000 40,000 60,000 80,000
Time
per
MatE/SD
(sec.)
Number of cores
theoretical computation
computation
adjacent/space
global/space
global/orbital
0.0
100.0
200.0
300.0
0 20,000 40,000 60,000 80,000
Time
per
RotV/SD(sec.)
Number of cores
theoretical computation
computation
adjacent/space
global/space
global/orbital
(d)
(c)
(a) (b)
time as a result of keeping the block data on the L1 cache
manually decreased by 12% compared with the computation time
for the usual data replacement operations of the L1 cache. This
DGEMM tuned for the K computer was also used for the
LINPACK benchmark program.
5.2 Scalability
We measured the computation time for the SCF iterations with
hand, the global communication time for the parallel tasks in
orbitals was supposed to increase as the number of parallel tasks
in orbitals increased. The number of MPI processes requiring
communications for the parallel tasks in orbitals, however, was
actually restricted to a relatively small number of compute nodes,
and therefore, the wall clock time for global communications of
the parallel tasks in orbitals was small. This means we succeeded
in decreasing time for global communication by the combination
Figure 6. Computation and communication time of (a) GS, (b) CG, (c) MatE/SD and (d)
RotV/SD for different numbers of cores.
0.0
40.0
80.0
120.0
160.0
0 20000 40000 60000 80000
Time
per
CG
(sec.)
Number of cores
theoretical computation
computation
adjacent/space
global/space
global/orbital
0.0
100.0
200.0
300.0
400.0
0 20,000 40,000 60,000 80,000
Time
per
GS
(sec.)
Number of cores
theoretical computation
computation
global/space
global/orbital
wait/orbital
0.0
50.0
100.0
150.0
200.0
0 20,000 40,000 60,000 80,000
Time
per
MatE/SD
(sec.)
Number of cores
theoretical computation
computation
adjacent/space
global/space
global/orbital
0.0
100.0
200.0
300.0
0 20,000 40,000 60,000 80,000
Time
per
RotV/SD(sec.)
Number of cores
theoretical computation
computation
adjacent/space
global/space
global/orbital
(d)
(c)
(a) (b)
並列度のミスマッチの解消
大域通信の増大の解消
45. 45
2022年4月 計算科学技術特論B
RSDFTの高並列化
confinement becomes prominent. The quantum effects,
which depend on the crystallographic directions of the nano-
wire axes and on the cross-sectional shapes of the nanowires,
result in substantial modifications to the energy-band
structures and the transport characteristics of SiNW FETs.
However, knowledge of the effect of the structural mor-
phology on the energy bands of SiNWs is lacking. In addi-
tion, actual nanowires have side-wall roughness. The
effects of such imperfections on the energy bands are
Table 2. Distribution of computational costs for an iteration of the SCF calculation of the modified code.
Procedure block
Execution
time (s)
Computation
time (s)
Communication time (s)
Performance
(PFLOPS/%)
Adjacent/grids Global/grids Global/orbitals Wait/orbitals
SCF 2903.10 1993.89 61.73 823.02 12.57 11.89 5.48/51.67
SD 1796.97 1281.44 13.90 497.36 4.27 – 5.32/50.17
MatE/SD 525.33 363.18 13.90 143.98 4.27 – 6.15/57.93
EigenSolve/SD 492.56 240.66 – 251.90 – – 0.01/1.03
RotV/SD 779.08 677.60 – 101.48 – – 8.14/76.70
CG 159.97 43.28 47.83 68.85 0.01 – 0.06/0.60
GS 946.16 669.17 – 256.81 8.29 11.89 6.70/63.10
The test model was a SiNW with 107,292 atoms. The numbers of grids and orbitals were 576 ! 576 ! 180, and 230,400, respectively. The numbers of
parallel tasks in grids and orbitals were 27,648 and three, respectively, using 82,944 compute nodes. Each parallel task had 2160 grids and 76,800 orbitals.
Hasegawa et al. 13
Article
Performance evaluation of ultra-large-
scale first-principles electronic structure
calculation code on the K computer
Yukihiro Hasegawa1
, Jun-Ichi Iwata2
, Miwako Tsuji1
,
Daisuke Takahashi3
, Atsushi Oshiyama2
, Kazuo Minami1
,
Taisuke Boku3
, Hikaru Inoue4
, Yoshito Kitazawa5
,
Ikuo Miyoshi6
and Mitsuo Yokokawa7,1
Abstract
The International Journal of High
Performance Computing Applications
1–21
ª The Author(s) 2013
Reprints and permissions:
sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/1094342013508163
hpc.sagepub.com
Yukihiro Hasegawa et al.,
http://hpc.sagepub.com/
Computing Applications
International Journal of High Performance
http://hpc.sagepub.com/content/early/2013/10/16/1094342013508163
The online version of this article can be found at:
DOI: 10.1177/1094342013508163
published online 17 October 2013
International Journal of High Performance Computing Applications
Hikaru Inoue, Yoshito Kitazawa, Ikuo Miyoshi and Mitsuo Yokokawa
Yukihiro Hasegawa, Jun-Ichi Iwata, Miwako Tsuji, Daisuke Takahashi, Atsushi Oshiyama, Kazuo Minami, Taisuke Boku,
K computer
Performance evaluation of ultra-largescale first-principles electronic structure calculation code on the
Published by:
総合性能
61. 61
2022年4月 計算科学技術特論B
ルーフラインモデル
mine sustainable DRAM bandwidth.
They include all techniques to get the
best memory performance, including
them to include memory optimiza-
tions of a computer into our bound-
and-bottleneck model. Second, we
l
n
-
d
n
-
y
e
-
-
h
”
-
t
e
n
e
d
-
s
n
e
m
-
p
e
-
Figure 1: Roofline model for (a) AMD Opteron X2 and (b) Opteron X2 vs. Opteron X4.
(a)
peak memory bandwidth (stream) peak floating-point performance
Operational Intensity (Flops/Byte)
Operational
Intensity
1
(memory-bound)
Operational
Intensity
2
(compute-bound)
Attainable
GFlops/sec
128
64
32
16
8
4
2
1
1/2
1/4 1/2 1 2 4 8 16
(b)
Opteron X4
Opteron X2
Attainable
GFlops/s
128
64
32
16
8
4
2
1
1/2
• ハードウェアの実効的なメモリバンド幅:B
• ハードウェアのピーク性能:F
• アプリケーションの演算強度:X=f/b
• アプリケーションの要求フロップス値:f
• アプリケーションの要求バイト値:b
B*X=B*(f/b)=B*F*f/(b*F)=(B/F)/(b/f)*F
アプリケーションのピーク性能比 :min(1.0 , (B/F)/(b/f) )
S. Williams, A. Waterman, and D. Patterson: Roofline: an
insightful visual performance model for multicore architectures.
Commun. ACM, 52:65–76, 2009.
アプリケーションの性能:
min(F,B*X)
ハードウェアのB/F値をアプリ
ケーションのb/f値で割る.
63. 63
2022年4月 計算科学技術特論B
メモリとキャッシュアクセス(1)
do J = 1, NY
do I = 1, NX
do K = 3, NZ-1
DZV (k,I,J) = (V(k,I,J) -V(k-1,I,J))*R40 &
- (V(k+1,I,J)-V(k-2,I,J))*R41
end do
end do
end do
メモリ:ロード1,ストア1
メモリ:ロード1
$L1:ロード2
$L1:ロード1
ルーフラインモデルによる性能見積り
64. 64
2022年4月 計算科学技術特論B
性能見積りの例
do J = 1, NY
do I = 1, NX
do K = 3, NZ-1
DZV (k,I,J) = (V(k,I,J) -V(k-1,I,J))*R40 &
- (V(k+1,I,J)-V(k-2,I,J))*R41
end do
end do
end do
n 最内軸(K軸)が差分
n 1ストリームでその他の3配列は$L1に載っ
ており再利⽤できる。
要求flop:
add : 3 mult : 2 = 5
要求B/F 12/5 = 2.4
性能予測 0.36/2.4 = 0.15
実測値 0.153
要求Byteの算出:
1store,2loadと考える
4x3 = 12byte
アプリケーションのピーク性能比の予測値
: ハードウェアのBF値(実効値)/アプリの要求BF値
ルーフラインモデルによる性能見積り