SlideShare una empresa de Scribd logo
1 de 13
Descargar para leer sin conexión
An Evaluation of
(ARM’s) LLVM Compiler for SVE
with Fairly Complicated Loops
Hiroshi Nakashima
(Kyoto University / RIKEN-AICS)
Introduction
 We’re evaluating several compilers targeting
SVE and AVX-512 using kernel loops in a
production-level particle-in-cell (PIC) code.
 The program is;
 written in C99 with restrict (and const) pointer
qualifiers so that do-all loops operating on arrays
are vectorized.
 parallelized by OpenMP (and MPI) so that all loops
are in a big region of #pragma omp parallel.
 free from any compiler-specific directives,
intrinsics, and #pragma omp simd.
 Evaluation is done by investigating .s and is
based on #-instructions for each loop body. 2
ARM HPC Workshop © 2017 H. Nakashima
Kernel Loops (1/2)
 Loops operate on;
 SOA-type 1D arrays p{xyz}[p] and v{xyz}[p]
of positional/velocity vectors of a particle p;
 SOA-type 4D arrays ef[][z][y][x] and
bf[][z][y][x] for electric/magnetic field;
 SOA-type 4D array jv[][z][y][x] for current
density;
 to accelerate p in each cell c referring to E/B
vectors at c’s vertices, to move p, and to
update J vectors at c’s vertices.
3
ARM HPC Workshop © 2017 H. Nakashima
Kernel Loops (2/2)
 particle_push-1
Simple v=a*u for 4-dimensional arrays v and u.
 ppush-1
p{xyz}[p], v{xyz}[p], E/B-field vectors in 48 scalar
variables and base coordinate in 3 scalar variables,
perform Lorentz acceleration to update v{xyz}[p] with
interpolation of E/B-field vectors.
 ppush-2 / cscat-1
With p{xyz}[p], v{xyz}[p] and base coordinates in 6
scalar variables, extrapolate the contributions to J
vectors in 12 scalar variables and accumulate them. In
ppush-2, p{xyz}[p] is updated and moving directions
are recorded in mdir[i=p-head].
 pmove-1
while (mdir[j]==0.0) j++; 4
ARM HPC Workshop © 2017 H. Nakashima
Bad News
 Are 5 loops vectorized?
 Why cannot ARM vectorize them?
 We don’t know, esp. for ppush-2 which is very
similar to cscat-1.
 Can scalar loops be the base of vectorized
loops?
 Yes, but need some improvements. 5
ARM HPC Workshop © 2017 H. Nakashima
particle_
push-1
ppush-1 ppush-2 cscat-1 pmove-1
ARM 1.4 NO NO NO YES NO
Fujitsu Oct17 NO YES YES YES NO
Intel 17.0.1 YES YES YES YES NO
Cray 8.6.1 YES YES YES YES NO
particle_push-1
 Source (summary)
double (*const restrict et)[esz][esy][esx]=...;
const double (*ef)[esz][esy][esx]=...;
for(z) for(y) for(x) {
et[0][z][y][x] = ef[0][z][y][x] * qmr;
et[1][z][y][x] = ef[1][z][y][x] * qmr;
et[2][z][y][x] = ef[2][z][y][x] * qmr;
bt[0][z][y][x] = bf[0][z][y][x] * qmr;
bt[1][z][y][x] = bf[1][z][y][x] * qmr;
bt[2][z][y][x] = bf[2][z][y][x] * qmr;
}
 const-qualification of RHS arrays looks
insufficient for ARM (& Fujitsu) to vectorize
the loop, while Intel & Cray exploit the
qualification for 8x2 unrolling.
6
ARM HPC Workshop © 2017 H. Nakashima
ppush-1 (1/2)
 ARM (scalar) vs Intel (vector) in #-inst
 Is ARM’s code sufficiently good as the base
of vectorized version?
 No, because it has 21 redundant sub-s to access
21 loop-invariant scalar variables spilled-out to
memory, whose displacements from the frame
base are less than −256.
sub x21,x29,#168 //x29 is frame base
ldur d2,[x21,#-256] //load from x29-424
7
ARM HPC Workshop © 2017 H. Nakashima
(a) gross (b) mem opd (c) div net=(a)+(b)-(c)
ARM 163 0 1 162
Intel 129 42 9 162
ppush-1 (2/2)
 Any other improvements/modifications to
have a good vectorized code?
 further eliminations
 a lsl for index scaling.
 2 redundant mov-s caused by a mysterious register
allocation for constant 1.0.
 2 redundant fsub-s for (b-a) to calculate c+d*(b-a)
when we have other fsub-s for (a-b).
 additions
 6 net additions to replace fdiv with NR approximation
with frecpe, frecps and fmul.
 2 movprfx-s for pseudo 4-operand FMAs out of 59 FMAs.
8
ARM HPC Workshop © 2017 H. Nakashima
(a) gross (b) mem opd (c) div net=(a)+(b)-(c)
ARM 146 0 7 139 (3 movprfx)
Intel 129 42 9 162
ppush-2
 Does small difference from cscat-1 really
make it impossible to vectorize ppush-2?
 update of px[p], py[p] and pz[p].
 update of mdir[i] where i is defined by;
for(int p=head,i=0;p<tail;p++,i++)
 Scalar loop cannot be considered as the base
of vectorization due to a too shrewd
optimization with variable/instruction
coupling using NEON’s 128-bit SIMD.
 e.g. a1=a2+a3 and b1=b2+b3 are done by one
instruction.
 Even with the coupling, one loop-invariant scalar
variable is spilled out due to inappropriate
instruction ordering. 9
ARM HPC Workshop © 2017 H. Nakashima
cscat-1
 Fairly good job without spilling any of 12
reduction variables, 6 loop-invariants and 2
constants.
 Still has small room of improvement.
 add to have array index p from canonicalized loop
index having p-head.
 xr=(x0==x1)?(px0+px1)*0.5:((x0<x1)?x1:x0);
 false part is fcmgt+sel instead of fmax.
 final assignment is sel instead of movprfx’ed fmul.
 movprfx+fnmls can be replaced with fnmsb.
10
ARM HPC Workshop © 2017 H. Nakashima
(a) gross (b) mem opd net=(a)+(b)
ARM 82 0 82 (3 movprfx)
Intel 76 6 82
pmove-1
 Doesn’t this a good example of fault tolerant
speculative vectorization?
while (mdir[j]==0) j++;
 Though any of four compilers cannot
vectorize this loop, ARM’s (& Fujitsu’s)
vectorization failure disappoints us because
the speculative vectorization with ldff1d
and related predicating instructions is a
catch of SVE.
 Vectorization is effective because particles
tend to stay in a cell with mdir[j]==0.
11
ARM HPC Workshop © 2017 H. Nakashima
Spilled Loop-Invariant Var/Const
 Spill is inevitable in ppush-1 (51 invariants + 2
constants) and very likely in ppush-2 (12 reductions
+ 6 invariants + 4 constants).
 Options
12
ARM HPC Workshop © 2017 H. Nakashima
where instruction note
mem (VL/8-byte) ldr Intel’s way for variables. Signed 9-bit
offset. Consume large space in L1.
mem (8byte) ld1rd Intel’s way for constants. Unsigned 6-
bit offset. As efficient as ldr?
Xn mov Unique for SVE. Faster than mem?
Zn (a lane) dup Not vector-length agnostic?
immediate fmov All constants in 2 loops are short
enough to use this option.
immediate fadd, etc. Used in cscat-1 for 0.5 (but Z3 also
has 0.5).
Summary
 ARM’s LLVM compiler cannot vectorize 3
kernel loops which Intel’s can vectorize.
 Investigation of the reason why not is very
necessary to compete with Intel (& others) in the
game with real-world HPC applications whose
programmers have known (or will know soon)
what Intel can do.
 Since scalar loops have reasonable quality,
simply removing the obstacles of
vectorization will give us good codes.
 And with reasonable effort, ARM’s code can be
superior to Intel’s.
13
ARM HPC Workshop © 2017 H. Nakashima

Más contenido relacionado

La actualidad más candente

Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsPragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsMarina Kolpakova
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
 
Code GPU with CUDA - Memory Subsystem
Code GPU with CUDA - Memory SubsystemCode GPU with CUDA - Memory Subsystem
Code GPU with CUDA - Memory SubsystemMarina Kolpakova
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesMarina Kolpakova
 
Code gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introductionCode gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introductionMarina Kolpakova
 
Q4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsQ4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsLinaro
 
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
 An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
An Open Discussion of RISC-V BitManip, trends, and comparisons _ ClaireRISC-V International
 
Post-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC EcosystemPost-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC EcosystemLinaro
 
Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...Marina Kolpakova
 
Building Efficient and Highly Run-Time Adaptable Virtual Machines
Building Efficient and Highly Run-Time Adaptable Virtual MachinesBuilding Efficient and Highly Run-Time Adaptable Virtual Machines
Building Efficient and Highly Run-Time Adaptable Virtual MachinesGuido Chari
 
Q4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerQ4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerLinaro
 
Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer George Markomanolis
 
Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)elliando dias
 
GEM - GNU C Compiler Extensions Framework
GEM - GNU C Compiler Extensions FrameworkGEM - GNU C Compiler Extensions Framework
GEM - GNU C Compiler Extensions FrameworkAlexey Smirnov
 
Moving NEON to 64 bits
Moving NEON to 64 bitsMoving NEON to 64 bits
Moving NEON to 64 bitsChiou-Nan Chen
 
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」Shinya Takamaeda-Y
 

La actualidad más candente (20)

Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsPragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Code GPU with CUDA - SIMT
Code GPU with CUDA - SIMTCode GPU with CUDA - SIMT
Code GPU with CUDA - SIMT
 
Code GPU with CUDA - Memory Subsystem
Code GPU with CUDA - Memory SubsystemCode GPU with CUDA - Memory Subsystem
Code GPU with CUDA - Memory Subsystem
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
 
Code gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introductionCode gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introduction
 
Q4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsQ4.11: NEON Intrinsics
Q4.11: NEON Intrinsics
 
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
 An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
 
Post-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC EcosystemPost-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC Ecosystem
 
Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...
 
64-bit Android
64-bit Android64-bit Android
64-bit Android
 
Building Efficient and Highly Run-Time Adaptable Virtual Machines
Building Efficient and Highly Run-Time Adaptable Virtual MachinesBuilding Efficient and Highly Run-Time Adaptable Virtual Machines
Building Efficient and Highly Run-Time Adaptable Virtual Machines
 
Q4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerQ4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-Vectorizer
 
OpenMP
OpenMPOpenMP
OpenMP
 
Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer
 
Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)
 
GEM - GNU C Compiler Extensions Framework
GEM - GNU C Compiler Extensions FrameworkGEM - GNU C Compiler Extensions Framework
GEM - GNU C Compiler Extensions Framework
 
Lustre Best Practices
Lustre Best Practices Lustre Best Practices
Lustre Best Practices
 
Moving NEON to 64 bits
Moving NEON to 64 bitsMoving NEON to 64 bits
Moving NEON to 64 bits
 
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
 

Similar a An evaluation of LLVM compiler for SVE with fairly complicated loops

Practical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT MethodsPractical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT MethodsNaughty Dog
 
Practical spherical harmonics based PRT methods.ppsx
Practical spherical harmonics based PRT methods.ppsxPractical spherical harmonics based PRT methods.ppsx
Practical spherical harmonics based PRT methods.ppsxMannyK4
 
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06ManhHoangVan
 
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...Naoki Shibata
 
General Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareGeneral Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareDaniel Blezek
 
Implementing AI: High Performance Architectures: Arm SVE and Supercomputer Fu...
Implementing AI: High Performance Architectures: Arm SVE and Supercomputer Fu...Implementing AI: High Performance Architectures: Arm SVE and Supercomputer Fu...
Implementing AI: High Performance Architectures: Arm SVE and Supercomputer Fu...KTN
 
Lecture 16 RC Architecture Types & FPGA Interns Lecturer.pptx
Lecture 16 RC Architecture Types & FPGA Interns Lecturer.pptxLecture 16 RC Architecture Types & FPGA Interns Lecturer.pptx
Lecture 16 RC Architecture Types & FPGA Interns Lecturer.pptxwafawafa52
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Hajime Tazaki
 
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptxCA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptxtrupeace
 
Evgeniy Muralev, Mark Vince, Working with the compiler, not against it
Evgeniy Muralev, Mark Vince, Working with the compiler, not against itEvgeniy Muralev, Mark Vince, Working with the compiler, not against it
Evgeniy Muralev, Mark Vince, Working with the compiler, not against itSergey Platonov
 
Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Hajime Tazaki
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonAMD Developer Central
 
How it's made: C++ compilers (GCC)
How it's made: C++ compilers (GCC)How it's made: C++ compilers (GCC)
How it's made: C++ compilers (GCC)Sławomir Zborowski
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)RCCSRENKEI
 
Kernel Recipes 2014 - x86 instruction encoding and the nasty hacks we do in t...
Kernel Recipes 2014 - x86 instruction encoding and the nasty hacks we do in t...Kernel Recipes 2014 - x86 instruction encoding and the nasty hacks we do in t...
Kernel Recipes 2014 - x86 instruction encoding and the nasty hacks we do in t...Anne Nicolas
 
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...Sourour Kanzari
 
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...Sourour Kanzari
 

Similar a An evaluation of LLVM compiler for SVE with fairly complicated loops (20)

Practical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT MethodsPractical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT Methods
 
Practical spherical harmonics based PRT methods.ppsx
Practical spherical harmonics based PRT methods.ppsxPractical spherical harmonics based PRT methods.ppsx
Practical spherical harmonics based PRT methods.ppsx
 
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06
 
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
 
Andes RISC-V processor solutions
Andes RISC-V processor solutionsAndes RISC-V processor solutions
Andes RISC-V processor solutions
 
General Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareGeneral Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics Hardware
 
Implementing AI: High Performance Architectures: Arm SVE and Supercomputer Fu...
Implementing AI: High Performance Architectures: Arm SVE and Supercomputer Fu...Implementing AI: High Performance Architectures: Arm SVE and Supercomputer Fu...
Implementing AI: High Performance Architectures: Arm SVE and Supercomputer Fu...
 
Lecture 16 RC Architecture Types & FPGA Interns Lecturer.pptx
Lecture 16 RC Architecture Types & FPGA Interns Lecturer.pptxLecture 16 RC Architecture Types & FPGA Interns Lecturer.pptx
Lecture 16 RC Architecture Types & FPGA Interns Lecturer.pptx
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
 
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptxCA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
 
Evgeniy Muralev, Mark Vince, Working with the compiler, not against it
Evgeniy Muralev, Mark Vince, Working with the compiler, not against itEvgeniy Muralev, Mark Vince, Working with the compiler, not against it
Evgeniy Muralev, Mark Vince, Working with the compiler, not against it
 
Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
Pipelining1
Pipelining1Pipelining1
Pipelining1
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
How it's made: C++ compilers (GCC)
How it's made: C++ compilers (GCC)How it's made: C++ compilers (GCC)
How it's made: C++ compilers (GCC)
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)
 
Kernel Recipes 2014 - x86 instruction encoding and the nasty hacks we do in t...
Kernel Recipes 2014 - x86 instruction encoding and the nasty hacks we do in t...Kernel Recipes 2014 - x86 instruction encoding and the nasty hacks we do in t...
Kernel Recipes 2014 - x86 instruction encoding and the nasty hacks we do in t...
 
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
 
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
 

Más de Linaro

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloLinaro
 
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaArm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaLinaro
 
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraHuawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraLinaro
 
Bud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaBud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaLinaro
 
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018Linaro
 
HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018Linaro
 
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...Linaro
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Linaro
 
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Linaro
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Linaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteLinaro
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopLinaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allLinaro
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorLinaro
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMULinaro
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MLinaro
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation Linaro
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootLinaro
 

Más de Linaro (20)

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
 
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaArm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
 
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraHuawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
 
Bud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaBud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qa
 
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
 
HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018
 
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
 
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening Keynote
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP Workshop
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMU
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8M
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted boot
 

Último

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Último (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

An evaluation of LLVM compiler for SVE with fairly complicated loops

  • 1. An Evaluation of (ARM’s) LLVM Compiler for SVE with Fairly Complicated Loops Hiroshi Nakashima (Kyoto University / RIKEN-AICS)
  • 2. Introduction  We’re evaluating several compilers targeting SVE and AVX-512 using kernel loops in a production-level particle-in-cell (PIC) code.  The program is;  written in C99 with restrict (and const) pointer qualifiers so that do-all loops operating on arrays are vectorized.  parallelized by OpenMP (and MPI) so that all loops are in a big region of #pragma omp parallel.  free from any compiler-specific directives, intrinsics, and #pragma omp simd.  Evaluation is done by investigating .s and is based on #-instructions for each loop body. 2 ARM HPC Workshop © 2017 H. Nakashima
  • 3. Kernel Loops (1/2)  Loops operate on;  SOA-type 1D arrays p{xyz}[p] and v{xyz}[p] of positional/velocity vectors of a particle p;  SOA-type 4D arrays ef[][z][y][x] and bf[][z][y][x] for electric/magnetic field;  SOA-type 4D array jv[][z][y][x] for current density;  to accelerate p in each cell c referring to E/B vectors at c’s vertices, to move p, and to update J vectors at c’s vertices. 3 ARM HPC Workshop © 2017 H. Nakashima
  • 4. Kernel Loops (2/2)  particle_push-1 Simple v=a*u for 4-dimensional arrays v and u.  ppush-1 p{xyz}[p], v{xyz}[p], E/B-field vectors in 48 scalar variables and base coordinate in 3 scalar variables, perform Lorentz acceleration to update v{xyz}[p] with interpolation of E/B-field vectors.  ppush-2 / cscat-1 With p{xyz}[p], v{xyz}[p] and base coordinates in 6 scalar variables, extrapolate the contributions to J vectors in 12 scalar variables and accumulate them. In ppush-2, p{xyz}[p] is updated and moving directions are recorded in mdir[i=p-head].  pmove-1 while (mdir[j]==0.0) j++; 4 ARM HPC Workshop © 2017 H. Nakashima
  • 5. Bad News  Are 5 loops vectorized?  Why cannot ARM vectorize them?  We don’t know, esp. for ppush-2 which is very similar to cscat-1.  Can scalar loops be the base of vectorized loops?  Yes, but need some improvements. 5 ARM HPC Workshop © 2017 H. Nakashima particle_ push-1 ppush-1 ppush-2 cscat-1 pmove-1 ARM 1.4 NO NO NO YES NO Fujitsu Oct17 NO YES YES YES NO Intel 17.0.1 YES YES YES YES NO Cray 8.6.1 YES YES YES YES NO
  • 6. particle_push-1  Source (summary) double (*const restrict et)[esz][esy][esx]=...; const double (*ef)[esz][esy][esx]=...; for(z) for(y) for(x) { et[0][z][y][x] = ef[0][z][y][x] * qmr; et[1][z][y][x] = ef[1][z][y][x] * qmr; et[2][z][y][x] = ef[2][z][y][x] * qmr; bt[0][z][y][x] = bf[0][z][y][x] * qmr; bt[1][z][y][x] = bf[1][z][y][x] * qmr; bt[2][z][y][x] = bf[2][z][y][x] * qmr; }  const-qualification of RHS arrays looks insufficient for ARM (& Fujitsu) to vectorize the loop, while Intel & Cray exploit the qualification for 8x2 unrolling. 6 ARM HPC Workshop © 2017 H. Nakashima
  • 7. ppush-1 (1/2)  ARM (scalar) vs Intel (vector) in #-inst  Is ARM’s code sufficiently good as the base of vectorized version?  No, because it has 21 redundant sub-s to access 21 loop-invariant scalar variables spilled-out to memory, whose displacements from the frame base are less than −256. sub x21,x29,#168 //x29 is frame base ldur d2,[x21,#-256] //load from x29-424 7 ARM HPC Workshop © 2017 H. Nakashima (a) gross (b) mem opd (c) div net=(a)+(b)-(c) ARM 163 0 1 162 Intel 129 42 9 162
  • 8. ppush-1 (2/2)  Any other improvements/modifications to have a good vectorized code?  further eliminations  a lsl for index scaling.  2 redundant mov-s caused by a mysterious register allocation for constant 1.0.  2 redundant fsub-s for (b-a) to calculate c+d*(b-a) when we have other fsub-s for (a-b).  additions  6 net additions to replace fdiv with NR approximation with frecpe, frecps and fmul.  2 movprfx-s for pseudo 4-operand FMAs out of 59 FMAs. 8 ARM HPC Workshop © 2017 H. Nakashima (a) gross (b) mem opd (c) div net=(a)+(b)-(c) ARM 146 0 7 139 (3 movprfx) Intel 129 42 9 162
  • 9. ppush-2  Does small difference from cscat-1 really make it impossible to vectorize ppush-2?  update of px[p], py[p] and pz[p].  update of mdir[i] where i is defined by; for(int p=head,i=0;p<tail;p++,i++)  Scalar loop cannot be considered as the base of vectorization due to a too shrewd optimization with variable/instruction coupling using NEON’s 128-bit SIMD.  e.g. a1=a2+a3 and b1=b2+b3 are done by one instruction.  Even with the coupling, one loop-invariant scalar variable is spilled out due to inappropriate instruction ordering. 9 ARM HPC Workshop © 2017 H. Nakashima
  • 10. cscat-1  Fairly good job without spilling any of 12 reduction variables, 6 loop-invariants and 2 constants.  Still has small room of improvement.  add to have array index p from canonicalized loop index having p-head.  xr=(x0==x1)?(px0+px1)*0.5:((x0<x1)?x1:x0);  false part is fcmgt+sel instead of fmax.  final assignment is sel instead of movprfx’ed fmul.  movprfx+fnmls can be replaced with fnmsb. 10 ARM HPC Workshop © 2017 H. Nakashima (a) gross (b) mem opd net=(a)+(b) ARM 82 0 82 (3 movprfx) Intel 76 6 82
  • 11. pmove-1  Doesn’t this a good example of fault tolerant speculative vectorization? while (mdir[j]==0) j++;  Though any of four compilers cannot vectorize this loop, ARM’s (& Fujitsu’s) vectorization failure disappoints us because the speculative vectorization with ldff1d and related predicating instructions is a catch of SVE.  Vectorization is effective because particles tend to stay in a cell with mdir[j]==0. 11 ARM HPC Workshop © 2017 H. Nakashima
  • 12. Spilled Loop-Invariant Var/Const  Spill is inevitable in ppush-1 (51 invariants + 2 constants) and very likely in ppush-2 (12 reductions + 6 invariants + 4 constants).  Options 12 ARM HPC Workshop © 2017 H. Nakashima where instruction note mem (VL/8-byte) ldr Intel’s way for variables. Signed 9-bit offset. Consume large space in L1. mem (8byte) ld1rd Intel’s way for constants. Unsigned 6- bit offset. As efficient as ldr? Xn mov Unique for SVE. Faster than mem? Zn (a lane) dup Not vector-length agnostic? immediate fmov All constants in 2 loops are short enough to use this option. immediate fadd, etc. Used in cscat-1 for 0.5 (but Z3 also has 0.5).
  • 13. Summary  ARM’s LLVM compiler cannot vectorize 3 kernel loops which Intel’s can vectorize.  Investigation of the reason why not is very necessary to compete with Intel (& others) in the game with real-world HPC applications whose programmers have known (or will know soon) what Intel can do.  Since scalar loops have reasonable quality, simply removing the obstacles of vectorization will give us good codes.  And with reasonable effort, ARM’s code can be superior to Intel’s. 13 ARM HPC Workshop © 2017 H. Nakashima