OpenPOWER Webinar

POWER9 Features and Strategies for
improving Application Performance
on POWER9 with IBM XL and Open
Source compilers
Archana Ravindar
LLVM Compiler Performance(POWER Systems Performance), ISDL
aravind5@in.ibm.com
https://in.linkedin.com/in/archana-ravindar-0259625b

Scope of the Presentation
• Review POWER9 processor features
• Outline common bottlenecks encountered due to
certain program characteristics
• How to Identify these issues using tools on POWER9
Linux
• What compiler Options can be used to reduce the
impact of these characteristics
• How can we code programs that prevent such
situations to arise
• POWER Linux platform
• Compilers- XL, gcc wherever applicable
• Performance Tools- perf

POWER Processor Technology
Roadmap
2H12
POWER7+
32 nm
- 2.5x Larger L3 cache
- On-die acceleration
- Zero-power core idle state
- Up to 12 Cores
- SMT8
- CAPI Acceleration
- High Bandwidth GPU Attach
1H14 – 2H161H10
POWER7
45 nm
- 8 Cores
- SMT4
- eDRAM L3 Cache
POWER9 Family
14nm
POWER8 Family
22nm
3
Enterprise
Enterprise
Enterprise &
Big Data Optimized
2H17 – 2H18+
Built for the Cognitive
Era− Enhanced Core and
Chip
Architecture Optimized
for
Emerging Workloads
− Processor Family with
Scale-Up and Scale-Out
Optimized Silicon
− Premier Platform for
Accelerated Computing

POWER9 Family – Deep Workload
Optimizations
Emerging Analytics, AI, Cognitive
- New core for stronger thread performance
- Delivers 2x compute resource per socket
- Built for acceleration – OpenPOWER solution enablement
Technical / HPC
- Highest bandwidth GPU attach
- Advanced GPU/CPU interaction and memory sharing
- High bandwidth direct attach memory
Cloud / HSDC
- Power / Packaging / Cost optimizations for a range of platforms
- Superior virtualization features: security, power management, QoS, interrupt
- State of the art IO technology for network and storage performance
Enterprise
- Large, flat, Scale-Up Systems
- Buffered memory for maximum capacity
- Leading RAS
- Improved caching
DB2 BLU
4

POWER9 Core Execution Slice Microarchitecture
128b
Super-slice
64b
Slice
POWER9 SMT8 Core
Modular Execution Slices
Re-factored Core Provides Improved Efficiency & Workload Alignment
• Enhanced pipeline efficiency with modular execution and intelligent pipeline
control
• Increased pipeline utilization with symmetric data-type engines: Fixed, Float,
128b, SIMD
• Shared compute resource optimizes data-type interchange
POWER8 SMT8 Core
5
POWER9 SMT4 Core

Shorter Pipelines with Reduced Disruption
Improved application performance for modern codes
• Shorten fetch to compute by 5 cycles
• Advanced branch prediction
Higher performance and pipeline utilization
• Improved instruction management
– Removed instruction grouping and reduced cracking
– Complete up to 128 (64 – SMT4 Core) instructions per cycle
Reduced latency and improved scalability
• Local pipe control of load/store operations
– Improved hazard avoidance
– Local recycles – reduced hazard disruption
– Improved lock management
POWER9 Core Pipeline Efficiency
6

7
POWER ISA v3.0
Broader data type support
• 128-bit IEEE 754 Quad-Precision Float – Full width quad-precision for financial and security applications
• Expanded BCD and 128b Decimal Integer – For database and native analytics
• Half-Precision Float Conversion – Optimized for accelerator bandwidth and data exchange
Support Emerging Algorithms
• Enhanced Arithmetic and SIMD
• Random Number Generation Instruction
Accelerate Emerging Workloads
• Memory Atomics – For high scale data-centric applications
• Hardware Assisted Garbage Collection – Optimize response time of interpretive languages
Cloud Optimization
• Enhanced Translation Architecture – Optimized for Linux
• New Interrupt Architecture – Automated partition routing for extreme virtualization
• Enhanced Accelerator Virtualization
• Hardware Enforced Trusted Execution
Energy & Frequency Management
• POWER9 Workload Optimized Frequency – Manage energy between threads and cores with reduced wakeup
latency
New Instruction Set Architecture Implemented on POWER9

8
Acceleration Super Highway
 5.6x more data throughput vs. PCIe
Gen3
with NVIDIA NVLink optimization to the core
 2x bandwidth
with PCIe Gen4 vs. PCIe Gen3
 Access up to 2TB of system
memory
delivered with coherence … only on
POWER!
 Superior data transfer to multiple
devices
25G Links to OpenCAPI GPU devices

GPU  CPU and GPUGPU
speed-up

9
Scope of the Compiler
 Compiler is an important layer in the system stack that is crucial for application
performance
 The compiler is intimately aware of the processor design and has functionality
implemented keeping in mind the various latencies of the hardware units and movement
of instructions within the pipe
 The compiler is designed to emit appropriate ISA depending on which architecture a
program is compiled for
 Based on the architecture scheduling is done to ensure smooth flow of instructions
through the pipe.
 IBM XL is a proprietary compiler which was a pioneer in several optimization innovation
over the past 3 decades.
 Increasingly IBM has embraced open source compilers such as GCC, LLVM to leverage
community participation and innovation.
 The scope of this presentation focuses on how we can leverage IBM XL and open source
compilers to obtain optimum performance on POWER9

Tools that we use in the Discussion
• Compilers
– IBM proprietary compilers - xlC/xlc/xlf
– xlc -O[n] program.c –o program : n ranges from 0 to 5
– Some common options: -qhot (array intensive programs),
-qtune=pwr9, -qsimd (enable SIMD) etc
– Profile directed feedback (-qpdf1, -qpdf2)
– Open source compilers: GCC, LLVM
– -O[n]: n ranges from 0-3, Ofast
– Common options -march=power9
– Profile directed feedback (-fprofile-generate, -fprofile-use)
• Perf tool
– To record hotspots/profile application
• perf record -e r<code> ./binary args > out (produces perf.data)
• perf report (opens profile report stored in perf.data)
– To measure hardware events
• perf stat –e r<code> ./binary args > out
– For more details, refer perf manpage

Processor can be thought of containing two
components
•Front end ensures a smooth supply of instructions to be
executed to the Backend
•The Backend is concerned only with the execution of
instructions
•Code that has *too many* branches can cause processor to
fetch more instructions than required and affect performance
Front end Back end

Branches
• Branches are predicted much in advance as the time needed to resolve the condition takes time
introducing a bubble in the pipeline slowing down execution
• POWER9 has an advanced branch predictor that uses complex structures to track context-based
branch histories and does a very good job of predicting them accurately. However certain
applications which are coded in a complex way can continue to cause high mispredictions
• Wrong prediction- Misprediction
– Counters to detect this: PM_BR_MPRED*,PM_FLUSH_BR_MPRED
– Use perf stat –e r<code> ./program arguments > out to collect various counters
• Branches are caused even by function calls, Such branches affect instruction cache locality and
increase instruction cache misses
– Counters to detect this: PM_L1_ICACHE_MISS
• Branches within loops hinder vectorization/SIMD opportunities

Guidelines to reduce branches
• Options to reduce loop /call branches
– #pragma unroll(N) or (XL) -qunroll : Unrolling loops (GCC/LLVM: -funroll-loops compiler flag) (reduces loop branches)
– (XL) -qinline=auto:level=<N> (N=1, .. 10) Inlining routines (will reduce function call jump/return)
– Corresponding GCC/LLVM compiler option: -finline-functions
• Loop Versioning: Slow version (that contains branches) + Fast version of loops (that
does not contain branches) (Usually done automatically by compilers at higher levels
of optimization)
• Provide hints in source code to indicate the expected values of expressions appearing
in branch conditions (long __builtin_expect(long expression, long value);) (hint
whether branch is more likely to be taken/not)
• If-conversion: Remove simple branches wherever possible by coding patterns such as
if(val!=0) a=a+val; a+=val;
if(val==0) a=a+1; a+=(!val)

Register Spills
• In a RISC architecture, predominantly, instructions operate on
registers
– Load,store instructions used to transfer data from memory to registers
• When #live variables > #available registers, spill is performed
• 1 spill = 1 store + 1 load
• *Spilling hot variables can hit performance*
– Spills can cause Load Hit Stores (stores followed by load to the same
address which may cause a delay in the pipe depending on the
separating distance)
– Spills increase Path length, address arithmetic instructions
– Unnecessary reads/writes to memory
• Issues due to to spills detected in following counters- PM_LSU_FIN,
PM_LSU_FLUSH, PM_LSU_REJECT_LHS , PM_INST_CMPL,
PM_FXU_FIN

Guidelines to reduce spills
• Limit extensive unrolling/inlining that can cause long-live ranges of variables
– Best to leave the compiler to do the inlining using its own heuristics
• XL compiler option: -qcompact can help
• Programs using mixed mode operands extensively (signed, unsigned) etc, conversion uses up
extra registers
• Use other register resources like SIMD registers if applicable, Use Vectorization wherever
applicable/Code such that compiler vectorizes automatically
• Use special POWER ISA instructions such as andc (logical AND complement), orc (logical OR
complement) which combines multiple math operations in a single instruction saving a register;
Compilers usually generate ISA when –march=power9, -qarch=pwr9 is used
• (R3=R1 & !R2)
– R4=not (R2) R3=R1 andc R2
– R3= R1 and R4

Memory Unit
• Memory is organized in a hierarchy
• L1 cache : Closest memory to the processor and the fastest, followed by L2, L3 upto
main memory
• Memory is most distant to the processor and slowest
• Data cache : stores data, instruction cache: stores instructions
• Data cache misses can stall load instructions in the pipeline causing a cascading
effect on all those instructions dependent on it
• Counters- PM_LD_MISS_L1, PM_CMPLU_STALL_DCACHE_MISS, PM_ST_MISS_L1,
PM_CMPLU_STALL_DMISS_L2L3, PM_CMPLU_STALL_DMISS_LMEM etc
L1 $
(3 cyc)
L2 $
(15.5 cyc)
L3 $
(35.5 cyc) Memory
(74.5 ns)

Techniques to optimize memory performance
• Memory footprint reduction wherever possible
– If you have enums declared in your program, using –qenum=small allocates just
one byte to enums v/s 4 bytes that gets allocated by default
– Replace bytemaps(1 byte to store a '0' or a '1') by bitmaps wherever possible
• Hardware prefetching
– Controlled by DSCR settings
– ppc64_cpu --dscr=<n>
– Common DSCR configurations
• 0 (all default values)
• 0x1D7 (Achieve most aggressive depth, most quickly, enable stride N prefetch)
• 1 (no prefetch)
• POWER8 tuning guide has a detailed description of DSCR settings
• Software prefetching
– Programmer inserted prefetch instructions __dcbt, __dcbtst
– Prefetch parameters can be tuned –qprefetch=aggressive:dscr=<value>
– Available gcc prefetch options: -fprefetch-loop-arrays/-fno-prefetch-loop-arrays
– If you want to explicitly control prefetching via software, you can turn off
hardware prefetching using ppc64_cpu –dscr command(under root privileges)

18
Flag Kind XL GCC/LLVM
Can be
simulated in
source
Benefit Drawbacks
Unrolling -qunroll -funroll-loops
#pragma
unroll(N)
Unrolls loops ; increases
opportunities pertaining to
scheduling for compiler
Increases register
pressure
Inlining
-
qinline=auto:level=
N -finline-functions
Inline always
attribute or
manual inlining
increases opportunities for
scheduling; Reduces
branches and loads/stores
Increases register
pressure; increases code
size
Enum small -qenum=small -fshort-enums -manual typedef Reduces memory footprint
Can cause issues in
alignment
isel
instructions -misel Using ?: operator
generates isel instruction
instead of branch;
reduces pressure on branch
predictor unit
latency of isel is a bit
higher; Use if branches
are not predictable easily
General
tuning
-qarch=pwr9,
-qtune=pwr9
-mcpu=power9,
-mtune=power9
Turns on platform specific
tuning like ISA, scheduling
64bit
compilation -q64 -m64
Prefetching
-
qprefetch[=aggressi
ve] -fprefetch-loop-arrays
__dcbt/__dcbtst,
_builtin_prefetch reduces cache misses
Can increase memory
traffic particularly if
prefetched values are
not used
Link time
optimization -qipo -flto , -flto=thin
Enables Interprocedural
optimizations
Can increase overall
compilation time
Profile
directed
feedback -qpdf1, -qpdf2
-fprofile-generate and
–fprofile-use LLVM has
an intermediate step
llvm-profdata
Enables hot path
optimizations Requires a training run

Summary
• Today we talked about
– Various performance issues that can occur in an application on POWER9 linux
– How to identify them ?
– What can we do to improve performance during compilation ?
– What can we do to improve performance while coding the application itself ?
• We saw that Power9 has the most comprehensive set of hardware counters that enable
analysts to understand applications of performance and get to the bottlenecks quickly
• We saw that IBM XL compilers and equivalently open source compilers such as GCC, LLVM
have a diverse set of options tailored to different needs to get required performance

References
• POWER9 User Manual
• https://openpowerfoundation.org/?resource_lib=power9-
processor-users-manual
• IBM XL Compiler reference
http://www-01.ibm.com/support/docview.wss?uid=swg27036675
• POWER9 Raw event codes (Install libpfm)
• https://github.com/torvalds/linux/blob/master/arch/powerpc/perf
/power9-events-list.h
• GCC 9.2 manual
• https://devdocs.io/gcc~9/
• LLVM manual
• https://llvm.org/docs/CommandGuide/

OpenPOWER Webinar

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a OpenPOWER Webinar

Similar a OpenPOWER Webinar (20)

Más de Ganesan Narayanasamy

Más de Ganesan Narayanasamy (20)

Último

Último (20)

OpenPOWER Webinar

Notas del editor