An evaluation of LLVM compiler for SVE with fairly complicated loops

An Evaluation of
(ARM’s) LLVM Compiler for SVE
with Fairly Complicated Loops
Hiroshi Nakashima
(Kyoto University / RIKEN-AICS)

Introduction
 We’re evaluating several compilers targeting
SVE and AVX-512 using kernel loops in a
production-level particle-in-cell (PIC) code.
 The program is;
 written in C99 with restrict (and const) pointer
qualifiers so that do-all loops operating on arrays
are vectorized.
 parallelized by OpenMP (and MPI) so that all loops
are in a big region of #pragma omp parallel.
 free from any compiler-specific directives,
intrinsics, and #pragma omp simd.
 Evaluation is done by investigating .s and is
based on #-instructions for each loop body. 2
ARM HPC Workshop © 2017 H. Nakashima

Kernel Loops (1/2)
 Loops operate on;
 SOA-type 1D arrays p{xyz}[p] and v{xyz}[p]
of positional/velocity vectors of a particle p;
 SOA-type 4D arrays ef[][z][y][x] and
bf[][z][y][x] for electric/magnetic field;
 SOA-type 4D array jv[][z][y][x] for current
density;
 to accelerate p in each cell c referring to E/B
vectors at c’s vertices, to move p, and to
update J vectors at c’s vertices.
3

Kernel Loops (2/2)
 particle_push-1
Simple v=a*u for 4-dimensional arrays v and u.
 ppush-1
p{xyz}[p], v{xyz}[p], E/B-field vectors in 48 scalar
variables and base coordinate in 3 scalar variables,
perform Lorentz acceleration to update v{xyz}[p] with
interpolation of E/B-field vectors.
 ppush-2 / cscat-1
With p{xyz}[p], v{xyz}[p] and base coordinates in 6
scalar variables, extrapolate the contributions to J
vectors in 12 scalar variables and accumulate them. In
ppush-2, p{xyz}[p] is updated and moving directions
are recorded in mdir[i=p-head].
 pmove-1
while (mdir[j]==0.0) j++; 4

Bad News
 Are 5 loops vectorized?
 Why cannot ARM vectorize them?
 We don’t know, esp. for ppush-2 which is very
similar to cscat-1.
 Can scalar loops be the base of vectorized
loops?
 Yes, but need some improvements. 5
particle_
push-1
ppush-1 ppush-2 cscat-1 pmove-1
ARM 1.4 NO NO NO YES NO
Fujitsu Oct17 NO YES YES YES NO
Intel 17.0.1 YES YES YES YES NO
Cray 8.6.1 YES YES YES YES NO

particle_push-1
 Source (summary)
double (*const restrict et)[esz][esy][esx]=...;
const double (*ef)[esz][esy][esx]=...;
for(z) for(y) for(x) {
et[0][z][y][x] = ef[0][z][y][x] * qmr;
et[1][z][y][x] = ef[1][z][y][x] * qmr;
et[2][z][y][x] = ef[2][z][y][x] * qmr;
bt[0][z][y][x] = bf[0][z][y][x] * qmr;
bt[1][z][y][x] = bf[1][z][y][x] * qmr;
bt[2][z][y][x] = bf[2][z][y][x] * qmr;
}
 const-qualification of RHS arrays looks
insufficient for ARM (& Fujitsu) to vectorize
the loop, while Intel & Cray exploit the
qualification for 8x2 unrolling.
6

ppush-1 (1/2)
 ARM (scalar) vs Intel (vector) in #-inst
 Is ARM’s code sufficiently good as the base
of vectorized version?
 No, because it has 21 redundant sub-s to access
21 loop-invariant scalar variables spilled-out to
memory, whose displacements from the frame
base are less than −256.
sub x21,x29,#168 //x29 is frame base
ldur d2,[x21,#-256] //load from x29-424
7
(a) gross (b) mem opd (c) div net=(a)+(b)-(c)
ARM 163 0 1 162
Intel 129 42 9 162

ppush-1 (2/2)
 Any other improvements/modifications to
have a good vectorized code?
 further eliminations
 a lsl for index scaling.
 2 redundant mov-s caused by a mysterious register
allocation for constant 1.0.
 2 redundant fsub-s for (b-a) to calculate c+d*(b-a)
when we have other fsub-s for (a-b).
 additions
 6 net additions to replace fdiv with NR approximation
with frecpe, frecps and fmul.
 2 movprfx-s for pseudo 4-operand FMAs out of 59 FMAs.
8
(a) gross (b) mem opd (c) div net=(a)+(b)-(c)
ARM 146 0 7 139 (3 movprfx)
Intel 129 42 9 162

ppush-2
 Does small difference from cscat-1 really
make it impossible to vectorize ppush-2?
 update of px[p], py[p] and pz[p].
 update of mdir[i] where i is defined by;
for(int p=head,i=0;p<tail;p++,i++)
 Scalar loop cannot be considered as the base
of vectorization due to a too shrewd
optimization with variable/instruction
coupling using NEON’s 128-bit SIMD.
 e.g. a1=a2+a3 and b1=b2+b3 are done by one
instruction.
 Even with the coupling, one loop-invariant scalar
variable is spilled out due to inappropriate
instruction ordering. 9

cscat-1
 Fairly good job without spilling any of 12
reduction variables, 6 loop-invariants and 2
constants.
 Still has small room of improvement.
 add to have array index p from canonicalized loop
index having p-head.
 xr=(x0==x1)?(px0+px1)*0.5:((x0<x1)?x1:x0);
 false part is fcmgt+sel instead of fmax.
 final assignment is sel instead of movprfx’ed fmul.
 movprfx+fnmls can be replaced with fnmsb.
10
(a) gross (b) mem opd net=(a)+(b)
ARM 82 0 82 (3 movprfx)
Intel 76 6 82

pmove-1
 Doesn’t this a good example of fault tolerant
speculative vectorization?
while (mdir[j]==0) j++;
 Though any of four compilers cannot
vectorize this loop, ARM’s (& Fujitsu’s)
vectorization failure disappoints us because
the speculative vectorization with ldff1d
and related predicating instructions is a
catch of SVE.
 Vectorization is effective because particles
tend to stay in a cell with mdir[j]==0.
11

Spilled Loop-Invariant Var/Const
 Spill is inevitable in ppush-1 (51 invariants + 2
constants) and very likely in ppush-2 (12 reductions
+ 6 invariants + 4 constants).
 Options
12
where instruction note
mem (VL/8-byte) ldr Intel’s way for variables. Signed 9-bit
offset. Consume large space in L1.
mem (8byte) ld1rd Intel’s way for constants. Unsigned 6-
bit offset. As efficient as ldr?
Xn mov Unique for SVE. Faster than mem?
Zn (a lane) dup Not vector-length agnostic?
immediate fmov All constants in 2 loops are short
enough to use this option.
immediate fadd, etc. Used in cscat-1 for 0.5 (but Z3 also
has 0.5).

Summary
 ARM’s LLVM compiler cannot vectorize 3
kernel loops which Intel’s can vectorize.
 Investigation of the reason why not is very
necessary to compete with Intel (& others) in the
game with real-world HPC applications whose
programmers have known (or will know soon)
what Intel can do.
 Since scalar loops have reasonable quality,
simply removing the obstacles of
vectorization will give us good codes.
 And with reasonable effort, ARM’s code can be
superior to Intel’s.
13

An evaluation of LLVM compiler for SVE with fairly complicated loops

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a An evaluation of LLVM compiler for SVE with fairly complicated loops

Similar a An evaluation of LLVM compiler for SVE with fairly complicated loops (20)

Más de Linaro

Más de Linaro (20)

Último

Último (20)

An evaluation of LLVM compiler for SVE with fairly complicated loops