16. Scenario of AT for ppOpen‐APPL/FDM
84
Execution with optimized
kernels without AT process.
Library User
Set AT parameter,
and execute the library
(OAT_AT_EXEC=1)
■Execute auto-tuner:
With fixed loop lengths
(by specifying problem size and number of
MPI processes and OpenMP threads)
Measurement of timings for target kernels.
Store the best candidate information.
Set AT parameters, and
execute the library
(OAT_AT_EXEC=0)
Store the fastest kernel
information
Using the fastest kernel without AT
(except for varying problem size and
number of MPI processes and OpenMP
threads.)
Specify problem size,
number of MPI processes
and OpenMP threads.
25. Automatic Generated Codes for
the kernel 1
ppohFDM_update_stress
#1 [Baseline]: Original 3-nested Loop
#2 [Split]: Loop Splitting with K-loop
(Separated, two 3-nested loops)
#3 [Split]: Loop Splitting with J-loop
#4 [Split]: Loop Splitting with I-loop
#5 [Split&Fusion]: Loop Fusion to #1 for K and J-loops
(2-nested loop)
#6 [Split&Fusion]: Loop Fusion to #2 for K and J-Loops
(2-nested loop)
#7 [Fusion]: Loop Fusion to #1
(loop collapse)
#8 [Split&Fusion]: Loop Fusion to #2
(loop collapse, two one-nest loop)
29. AT Candidates in This Experiment
1. Kernel update_stress
– 8 Kinds of Candidates with Loop Collapse and Loop Split.
2. Kernel update_vel
– 6 Kinds of Candidates with Loop Collapse and Re‐ordering of Statements.
3 Kinds of Candidates with Loop Collapse.
3. Kernel update_stress_sponge
4. Kernel update_vel_sponge
5. Kernel ppohFDM_pdiffx3_p4
6. Kernel ppohFDM_pdiffx3_m4
7. Kernel ppohFDM_pdiffy3_p4
8. Kernel ppohFDM_pdiffy3_m4
9. Kernel ppohFDM_pdiffz3_p4
10. Kernel ppohFDM_pdiffz3_m4
Kinds of Candidates with Loop Collapse for Data Packing and Data
Unpacking.
11. Kernel ppohFDM_ps_pack
12. Kernel ppohFDM_ps_unpack
13. Kernel ppohFDM_pv_pack
14. Kernel ppohFDM_pv_unpack 104
Total Number
of Kernel
Candidates: 47
30. Machine Environment
(8 nodes of the Xeon Phi)
The Intel Xeon Phi
Xeon Phi 5110P (1.053 GHz), 60 cores
Memory Amount:8 GB (GDDR5)
Theoretical Peak Performance:1.01 TFLOPS
One board per node of the Xeon phi cluster
InfiniBand FDR x 2 Ports
Mellanox Connect‐IB
PCI‐E Gen3 x16
56Gbps x 2
Theoretical Peak bandwidth 13.6GB/s
Full‐Bisection
Intel MPI
Based on MPICH2, MVAPICH2
4.1 Update 3 (build 048)
Compiler:Intel Fortran version 14.0.0.080 Build 20130728
Compiler Options:
‐ipo20 ‐O3 ‐warn all ‐openmp ‐mcmodel=medium ‐shared‐intel –mmic
‐align array64byte
KMP_AFFINITY=granularity=fine, balanced (Uniform Distribution of threads
between sockets)
31. Execution Details
• ppOpen‐APPL/FDM ver.0.2
• ppOpen‐AT ver.0.2
• The number of time step: 2000 steps
• The number of nodes: 8 node
• Native Mode Execution
• Target Problem Size
(Almost maximum size with 8 GB/node)
– NX * NY * NZ = 1536 x 768 x 240 / 8 Node
– NX * NY * NZ = 768 * 384 * 120 / node
(!= per MPI Process)
• The number of iterations for kernels
to do auto‐tuning: 100
34. Maximum Speedups by AT
(Xeon Phi, 8 Nodes)
558
200
171
30 20 51
Speedup [%]
Kinds of Kernels
Speedup =
max ( Execution time of original code / Execution time with AT )
for all combination of Hybrid MPI/OpenMP Executions (PXTY)
NX*NY*NZ = 1536x768x240/ 8 Node
39. 参考文献 (著者に関連するもの)
1. H. Kuroda, T. Katagiri, M. Kudoh, Y. Kanada, “ILIB_GMRES: An auto‐tuning parallel
iterative solver for linear equations,” SC2001, 2001. (Poster.)
2. T. Katagiri, K. Kise, H. Honda, T. Yuba, “ABCLib_DRSSED: A parallel eigensolver with an
auto‐tuning facility,” Parallel Computing, Vol. 32, Issue 3, pp. 231–250, 2006.
3. T. Sakurai, T. Katagiri, K. Naono, H. Kuroda, K. Nakajima, M. Igai, S. Ohshima, S. Itoh,
“Evaluation of auto‐tuning function on OpenATLib,” IPSJ SIG Technical Reports, Vol.
2011‐HPC‐130, No. 43, pp. 1–6, 2011. (in Japanese)
4. T. Katagiri, K. Kise, H. Honda, T. Yuba, “FIBER: A general framework for auto‐tuning
software,” The Fifth International Symposium on High Performance Computing
(ISHPC‐V), Springer LNCS 2858, pp. 146–159, 2003.
5. T. Katagiri, S. Ito, S. Ohshima, “Early experiences for adaptation of auto‐tuning by
ppOpen‐AT to an explicit method,” Special Session: Auto‐Tuning for Multicore and
GPU (ATMG) (In Conjunction with the IEEE MCSoC‐13), Proceedings of MCSoC‐13,
2013.
6. T. Katagiri, S. Ohshima, M. Matsumoto, “Auto‐tuning of computation kernels from an
FDM Code with ppOpen‐AT,” Special Session: Auto‐Tuning for Multicore and GPU
(ATMG) (In Conjunction with the IEEE MCSoC‐14), Proceedings of MCSoC‐14, 2014.
7. T.Katagiri, S.Ohshima, M. Matsumoto, "Directive‐based Auto‐tuning for the Finite
Difference Method on the Xeon Phi," The Tenth International Workshop on Automatic
Performance Tuning (iWAPT2015) (In Conjunction with the IEEE IPDPS2015 ),
Proceedings of IPDPSW2015, pp.1221‐1230,2015.