SlideShare una empresa de Scribd logo
1 de 21
POWER9 Features and Strategies for
improving Application Performance
on POWER9 with IBM XL and Open
Source compilers
Archana Ravindar
LLVM Compiler Performance(POWER Systems Performance), ISDL
aravind5@in.ibm.com
https://in.linkedin.com/in/archana-ravindar-0259625b
Scope of the Presentation
• Review POWER9 processor features
• Outline common bottlenecks encountered due to
certain program characteristics
• How to Identify these issues using tools on POWER9
Linux
• What compiler Options can be used to reduce the
impact of these characteristics
• How can we code programs that prevent such
situations to arise
• POWER Linux platform
• Compilers- XL, gcc wherever applicable
• Performance Tools- perf
POWER Processor Technology
Roadmap
2H12
POWER7+
32 nm
- 2.5x Larger L3 cache
- On-die acceleration
- Zero-power core idle state
- Up to 12 Cores
- SMT8
- CAPI Acceleration
- High Bandwidth GPU Attach
1H14 – 2H161H10
POWER7
45 nm
- 8 Cores
- SMT4
- eDRAM L3 Cache
POWER9 Family
14nm
POWER8 Family
22nm
3
Enterprise
Enterprise
Enterprise &
Big Data Optimized
2H17 – 2H18+
Built for the Cognitive
Era− Enhanced Core and
Chip
Architecture Optimized
for
Emerging Workloads
− Processor Family with
Scale-Up and Scale-Out
Optimized Silicon
− Premier Platform for
Accelerated Computing
POWER9 Family – Deep Workload
Optimizations
Emerging Analytics, AI, Cognitive
- New core for stronger thread performance
- Delivers 2x compute resource per socket
- Built for acceleration – OpenPOWER solution enablement
Technical / HPC
- Highest bandwidth GPU attach
- Advanced GPU/CPU interaction and memory sharing
- High bandwidth direct attach memory
Cloud / HSDC
- Power / Packaging / Cost optimizations for a range of platforms
- Superior virtualization features: security, power management, QoS, interrupt
- State of the art IO technology for network and storage performance
Enterprise
- Large, flat, Scale-Up Systems
- Buffered memory for maximum capacity
- Leading RAS
- Improved caching
DB2 BLU
4
POWER9 Core Execution Slice Microarchitecture
128b
Super-slice
64b
Slice
POWER9 SMT8 Core
Modular Execution Slices
Re-factored Core Provides Improved Efficiency & Workload Alignment
• Enhanced pipeline efficiency with modular execution and intelligent pipeline
control
• Increased pipeline utilization with symmetric data-type engines: Fixed, Float,
128b, SIMD
• Shared compute resource optimizes data-type interchange
POWER8 SMT8 Core
5
POWER9 SMT4 Core
Shorter Pipelines with Reduced Disruption
Improved application performance for modern codes
• Shorten fetch to compute by 5 cycles
• Advanced branch prediction
Higher performance and pipeline utilization
• Improved instruction management
– Removed instruction grouping and reduced cracking
– Complete up to 128 (64 – SMT4 Core) instructions per cycle
Reduced latency and improved scalability
• Local pipe control of load/store operations
– Improved hazard avoidance
– Local recycles – reduced hazard disruption
– Improved lock management
POWER9 Core Pipeline Efficiency
6
7
POWER ISA v3.0
Broader data type support
• 128-bit IEEE 754 Quad-Precision Float – Full width quad-precision for financial and security applications
• Expanded BCD and 128b Decimal Integer – For database and native analytics
• Half-Precision Float Conversion – Optimized for accelerator bandwidth and data exchange
Support Emerging Algorithms
• Enhanced Arithmetic and SIMD
• Random Number Generation Instruction
Accelerate Emerging Workloads
• Memory Atomics – For high scale data-centric applications
• Hardware Assisted Garbage Collection – Optimize response time of interpretive languages
Cloud Optimization
• Enhanced Translation Architecture – Optimized for Linux
• New Interrupt Architecture – Automated partition routing for extreme virtualization
• Enhanced Accelerator Virtualization
• Hardware Enforced Trusted Execution
Energy & Frequency Management
• POWER9 Workload Optimized Frequency – Manage energy between threads and cores with reduced wakeup
latency
New Instruction Set Architecture Implemented on POWER9
8
Acceleration Super Highway
 5.6x more data throughput vs. PCIe
Gen3
with NVIDIA NVLink optimization to the core
 2x bandwidth
with PCIe Gen4 vs. PCIe Gen3
 Access up to 2TB of system
memory
delivered with coherence … only on
POWER!
 Superior data transfer to multiple
devices
25G Links to OpenCAPI GPU devices

GPU  CPU and GPUGPU
speed-up
9
Scope of the Compiler
 Compiler is an important layer in the system stack that is crucial for application
performance
 The compiler is intimately aware of the processor design and has functionality
implemented keeping in mind the various latencies of the hardware units and movement
of instructions within the pipe
 The compiler is designed to emit appropriate ISA depending on which architecture a
program is compiled for
 Based on the architecture scheduling is done to ensure smooth flow of instructions
through the pipe.
 IBM XL is a proprietary compiler which was a pioneer in several optimization innovation
over the past 3 decades.
 Increasingly IBM has embraced open source compilers such as GCC, LLVM to leverage
community participation and innovation.
 The scope of this presentation focuses on how we can leverage IBM XL and open source
compilers to obtain optimum performance on POWER9
Tools that we use in the Discussion
• Compilers
– IBM proprietary compilers - xlC/xlc/xlf
– xlc -O[n] program.c –o program : n ranges from 0 to 5
– Some common options: -qhot (array intensive programs),
-qtune=pwr9, -qsimd (enable SIMD) etc
– Profile directed feedback (-qpdf1, -qpdf2)
– Open source compilers: GCC, LLVM
– -O[n]: n ranges from 0-3, Ofast
– Common options -march=power9
– Profile directed feedback (-fprofile-generate, -fprofile-use)
• Perf tool
– To record hotspots/profile application
• perf record -e r<code> ./binary args > out (produces perf.data)
• perf report (opens profile report stored in perf.data)
– To measure hardware events
• perf stat –e r<code> ./binary args > out
– For more details, refer perf manpage
Processor can be thought of containing two
components
•Front end ensures a smooth supply of instructions to be
executed to the Backend
•The Backend is concerned only with the execution of
instructions
•Code that has *too many* branches can cause processor to
fetch more instructions than required and affect performance
Front end Back end
Branches
• Branches are predicted much in advance as the time needed to resolve the condition takes time
introducing a bubble in the pipeline slowing down execution
• POWER9 has an advanced branch predictor that uses complex structures to track context-based
branch histories and does a very good job of predicting them accurately. However certain
applications which are coded in a complex way can continue to cause high mispredictions
• Wrong prediction- Misprediction
– Counters to detect this: PM_BR_MPRED*,PM_FLUSH_BR_MPRED
– Use perf stat –e r<code> ./program arguments > out to collect various counters
• Branches are caused even by function calls, Such branches affect instruction cache locality and
increase instruction cache misses
– Counters to detect this: PM_L1_ICACHE_MISS
• Branches within loops hinder vectorization/SIMD opportunities
Guidelines to reduce branches
• Options to reduce loop /call branches
– #pragma unroll(N) or (XL) -qunroll : Unrolling loops (GCC/LLVM: -funroll-loops compiler flag) (reduces loop branches)
– (XL) -qinline=auto:level=<N> (N=1, .. 10) Inlining routines (will reduce function call jump/return)
– Corresponding GCC/LLVM compiler option: -finline-functions
• Loop Versioning: Slow version (that contains branches) + Fast version of loops (that
does not contain branches) (Usually done automatically by compilers at higher levels
of optimization)
• Provide hints in source code to indicate the expected values of expressions appearing
in branch conditions (long __builtin_expect(long expression, long value);) (hint
whether branch is more likely to be taken/not)
• If-conversion: Remove simple branches wherever possible by coding patterns such as
if(val!=0) a=a+val; a+=val;
if(val==0) a=a+1; a+=(!val)
Register Spills
• In a RISC architecture, predominantly, instructions operate on
registers
– Load,store instructions used to transfer data from memory to registers
• When #live variables > #available registers, spill is performed
• 1 spill = 1 store + 1 load
• *Spilling hot variables can hit performance*
– Spills can cause Load Hit Stores (stores followed by load to the same
address which may cause a delay in the pipe depending on the
separating distance)
– Spills increase Path length, address arithmetic instructions
– Unnecessary reads/writes to memory
• Issues due to to spills detected in following counters- PM_LSU_FIN,
PM_LSU_FLUSH, PM_LSU_REJECT_LHS , PM_INST_CMPL,
PM_FXU_FIN
Guidelines to reduce spills
• Limit extensive unrolling/inlining that can cause long-live ranges of variables
– Best to leave the compiler to do the inlining using its own heuristics
• XL compiler option: -qcompact can help
• Programs using mixed mode operands extensively (signed, unsigned) etc, conversion uses up
extra registers
• Use other register resources like SIMD registers if applicable, Use Vectorization wherever
applicable/Code such that compiler vectorizes automatically
• Use special POWER ISA instructions such as andc (logical AND complement), orc (logical OR
complement) which combines multiple math operations in a single instruction saving a register;
Compilers usually generate ISA when –march=power9, -qarch=pwr9 is used
• (R3=R1 & !R2)
– R4=not (R2) R3=R1 andc R2
– R3= R1 and R4
Memory Unit
• Memory is organized in a hierarchy
• L1 cache : Closest memory to the processor and the fastest, followed by L2, L3 upto
main memory
• Memory is most distant to the processor and slowest
• Data cache : stores data, instruction cache: stores instructions
• Data cache misses can stall load instructions in the pipeline causing a cascading
effect on all those instructions dependent on it
• Counters- PM_LD_MISS_L1, PM_CMPLU_STALL_DCACHE_MISS, PM_ST_MISS_L1,
PM_CMPLU_STALL_DMISS_L2L3, PM_CMPLU_STALL_DMISS_LMEM etc
L1 $
(3 cyc)
L2 $
(15.5 cyc)
L3 $
(35.5 cyc) Memory
(74.5 ns)
Techniques to optimize memory performance
• Memory footprint reduction wherever possible
– If you have enums declared in your program, using –qenum=small allocates just
one byte to enums v/s 4 bytes that gets allocated by default
– Replace bytemaps(1 byte to store a '0' or a '1') by bitmaps wherever possible
• Hardware prefetching
– Controlled by DSCR settings
– ppc64_cpu --dscr=<n>
– Common DSCR configurations
• 0 (all default values)
• 0x1D7 (Achieve most aggressive depth, most quickly, enable stride N prefetch)
• 1 (no prefetch)
• POWER8 tuning guide has a detailed description of DSCR settings
• Software prefetching
– Programmer inserted prefetch instructions __dcbt, __dcbtst
– Prefetch parameters can be tuned –qprefetch=aggressive:dscr=<value>
– Available gcc prefetch options: -fprefetch-loop-arrays/-fno-prefetch-loop-arrays
– If you want to explicitly control prefetching via software, you can turn off
hardware prefetching using ppc64_cpu –dscr command(under root privileges)
18
Flag Kind XL GCC/LLVM
Can be
simulated in
source
Benefit Drawbacks
Unrolling -qunroll -funroll-loops
#pragma
unroll(N)
Unrolls loops ; increases
opportunities pertaining to
scheduling for compiler
Increases register
pressure
Inlining
-
qinline=auto:level=
N -finline-functions
Inline always
attribute or
manual inlining
increases opportunities for
scheduling; Reduces
branches and loads/stores
Increases register
pressure; increases code
size
Enum small -qenum=small -fshort-enums -manual typedef Reduces memory footprint
Can cause issues in
alignment
isel
instructions -misel Using ?: operator
generates isel instruction
instead of branch;
reduces pressure on branch
predictor unit
latency of isel is a bit
higher; Use if branches
are not predictable easily
General
tuning
-qarch=pwr9,
-qtune=pwr9
-mcpu=power9,
-mtune=power9
Turns on platform specific
tuning like ISA, scheduling
64bit
compilation -q64 -m64
Prefetching
-
qprefetch[=aggressi
ve] -fprefetch-loop-arrays
__dcbt/__dcbtst,
_builtin_prefetch reduces cache misses
Can increase memory
traffic particularly if
prefetched values are
not used
Link time
optimization -qipo -flto , -flto=thin
Enables Interprocedural
optimizations
Can increase overall
compilation time
Profile
directed
feedback -qpdf1, -qpdf2
-fprofile-generate and
–fprofile-use LLVM has
an intermediate step
llvm-profdata
Enables hot path
optimizations Requires a training run
19
Hands-On Reference
Summary
• Today we talked about
– Various performance issues that can occur in an application on POWER9 linux
– How to identify them ?
– What can we do to improve performance during compilation ?
– What can we do to improve performance while coding the application itself ?
• We saw that Power9 has the most comprehensive set of hardware counters that enable
analysts to understand applications of performance and get to the bottlenecks quickly
• We saw that IBM XL compilers and equivalently open source compilers such as GCC, LLVM
have a diverse set of options tailored to different needs to get required performance
References
• POWER9 User Manual
• https://openpowerfoundation.org/?resource_lib=power9-
processor-users-manual
• IBM XL Compiler reference
http://www-01.ibm.com/support/docview.wss?uid=swg27036675
• POWER9 Raw event codes (Install libpfm)
• https://github.com/torvalds/linux/blob/master/arch/powerpc/perf
/power9-events-list.h
• GCC 9.2 manual
• https://devdocs.io/gcc~9/
• LLVM manual
• https://llvm.org/docs/CommandGuide/

Más contenido relacionado

La actualidad más candente

Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsGanesan Narayanasamy
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialGanesan Narayanasamy
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsAnand Haridass
 
MIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platformMIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platformGanesan Narayanasamy
 
EXTENT-2017: Heterogeneous Computing Trends and Business Value Creation
EXTENT-2017: Heterogeneous Computing Trends and Business Value CreationEXTENT-2017: Heterogeneous Computing Trends and Business Value Creation
EXTENT-2017: Heterogeneous Computing Trends and Business Value CreationIosif Itkin
 
AI OpenPOWER Academia Discussion Group
AI OpenPOWER Academia Discussion Group AI OpenPOWER Academia Discussion Group
AI OpenPOWER Academia Discussion Group Ganesan Narayanasamy
 
Ac922 watson 180208 v1
Ac922 watson 180208 v1Ac922 watson 180208 v1
Ac922 watson 180208 v1IBM Sverige
 
Programming Models for Exascale Systems
Programming Models for Exascale SystemsProgramming Models for Exascale Systems
Programming Models for Exascale Systemsinside-BigData.com
 

La actualidad más candente (20)

IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
 
Summit workshop thompto
Summit workshop thomptoSummit workshop thompto
Summit workshop thompto
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systems
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
 
WML OpenPOWER presentation
WML OpenPOWER presentationWML OpenPOWER presentation
WML OpenPOWER presentation
 
Ac922 cdac webinar
Ac922 cdac webinarAc922 cdac webinar
Ac922 cdac webinar
 
2018 bsc power9 and power ai
2018   bsc power9 and power ai 2018   bsc power9 and power ai
2018 bsc power9 and power ai
 
SNAP MACHINE LEARNING
SNAP MACHINE LEARNINGSNAP MACHINE LEARNING
SNAP MACHINE LEARNING
 
OpenPOWER Latest Updates
OpenPOWER Latest UpdatesOpenPOWER Latest Updates
OpenPOWER Latest Updates
 
CFD on Power
CFD on Power CFD on Power
CFD on Power
 
BSC LMS DDL
BSC LMS DDL BSC LMS DDL
BSC LMS DDL
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of Systems
 
MIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platformMIT's experience on OpenPOWER/POWER 9 platform
MIT's experience on OpenPOWER/POWER 9 platform
 
AMD It's Time to ROC
AMD It's Time to ROCAMD It's Time to ROC
AMD It's Time to ROC
 
EXTENT-2017: Heterogeneous Computing Trends and Business Value Creation
EXTENT-2017: Heterogeneous Computing Trends and Business Value CreationEXTENT-2017: Heterogeneous Computing Trends and Business Value Creation
EXTENT-2017: Heterogeneous Computing Trends and Business Value Creation
 
AI OpenPOWER Academia Discussion Group
AI OpenPOWER Academia Discussion Group AI OpenPOWER Academia Discussion Group
AI OpenPOWER Academia Discussion Group
 
Ac922 watson 180208 v1
Ac922 watson 180208 v1Ac922 watson 180208 v1
Ac922 watson 180208 v1
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
Programming Models for Exascale Systems
Programming Models for Exascale SystemsProgramming Models for Exascale Systems
Programming Models for Exascale Systems
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 

Similar a OpenPOWER Webinar

OpenPOWER Application Optimization
OpenPOWER Application Optimization OpenPOWER Application Optimization
OpenPOWER Application Optimization Ganesan Narayanasamy
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
Embedded systems-unit-1
Embedded systems-unit-1Embedded systems-unit-1
Embedded systems-unit-1Prabhu Mali
 
Top schools in gudgao
Top schools in gudgaoTop schools in gudgao
Top schools in gudgaoEdhole.com
 
HPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance OptimizationHPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance Optimizationinside-BigData.com
 
Top schools in gudgao
Top schools in gudgaoTop schools in gudgao
Top schools in gudgaoEdhole.com
 
OpenCAPI next generation accelerator
OpenCAPI next generation accelerator OpenCAPI next generation accelerator
OpenCAPI next generation accelerator Ganesan Narayanasamy
 
How to Measure RTOS Performance
How to Measure RTOS Performance How to Measure RTOS Performance
How to Measure RTOS Performance mentoresd
 
Top schools in noida
Top schools in noidaTop schools in noida
Top schools in noidaEdhole.com
 
Reduced instruction set computers
Reduced instruction set computersReduced instruction set computers
Reduced instruction set computersSyed Zaid Irshad
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Centerinside-BigData.com
 
Basics of micro controllers for biginners
Basics of  micro controllers for biginnersBasics of  micro controllers for biginners
Basics of micro controllers for biginnersGerwin Makanyanga
 
Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architecturesYoung Alista
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer ArchitectureSubhasis Dash
 
The sunsparc architecture
The sunsparc architectureThe sunsparc architecture
The sunsparc architectureTaha Malampatti
 

Similar a OpenPOWER Webinar (20)

OpenPOWER Application Optimization
OpenPOWER Application Optimization OpenPOWER Application Optimization
OpenPOWER Application Optimization
 
13 risc
13 risc13 risc
13 risc
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
RISC.ppt
RISC.pptRISC.ppt
RISC.ppt
 
13 risc
13 risc13 risc
13 risc
 
Embedded systems-unit-1
Embedded systems-unit-1Embedded systems-unit-1
Embedded systems-unit-1
 
Top schools in gudgao
Top schools in gudgaoTop schools in gudgao
Top schools in gudgao
 
HPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance OptimizationHPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance Optimization
 
Top schools in gudgao
Top schools in gudgaoTop schools in gudgao
Top schools in gudgao
 
13 risc
13 risc13 risc
13 risc
 
OpenCAPI next generation accelerator
OpenCAPI next generation accelerator OpenCAPI next generation accelerator
OpenCAPI next generation accelerator
 
How to Measure RTOS Performance
How to Measure RTOS Performance How to Measure RTOS Performance
How to Measure RTOS Performance
 
Top schools in noida
Top schools in noidaTop schools in noida
Top schools in noida
 
Reduced instruction set computers
Reduced instruction set computersReduced instruction set computers
Reduced instruction set computers
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Center
 
Basics of micro controllers for biginners
Basics of  micro controllers for biginnersBasics of  micro controllers for biginners
Basics of micro controllers for biginners
 
Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architectures
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer Architecture
 
The sunsparc architecture
The sunsparc architectureThe sunsparc architecture
The sunsparc architecture
 
Processors selection
Processors selectionProcessors selection
Processors selection
 

Más de Ganesan Narayanasamy

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency programGanesan Narayanasamy
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and VerilogGanesan Narayanasamy
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISAGanesan Narayanasamy
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Ganesan Narayanasamy
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsGanesan Narayanasamy
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsGanesan Narayanasamy
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsGanesan Narayanasamy
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems Ganesan Narayanasamy
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Ganesan Narayanasamy
 
OpenPOWER Foundation Introduction
OpenPOWER Foundation Introduction OpenPOWER Foundation Introduction
OpenPOWER Foundation Introduction Ganesan Narayanasamy
 
Open Hardware and Future Computing
Open Hardware and Future ComputingOpen Hardware and Future Computing
Open Hardware and Future ComputingGanesan Narayanasamy
 

Más de Ganesan Narayanasamy (20)

Chip Design Curriculum development Residency program
Chip Design Curriculum development Residency programChip Design Curriculum development Residency program
Chip Design Curriculum development Residency program
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
 
180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA180 nm Tape out experience using Open POWER ISA
180 nm Tape out experience using Open POWER ISA
 
Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture Workload Transformation and Innovations in POWER Architecture
Workload Transformation and Innovations in POWER Architecture
 
OpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT RoorkeeOpenPOWER Workshop at IIT Roorkee
OpenPOWER Workshop at IIT Roorkee
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
 
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsAI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systems
 
AI in healthcare - Use Cases
AI in healthcare - Use Cases AI in healthcare - Use Cases
AI in healthcare - Use Cases
 
AI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systemsAI in Health Care using IBM Systems/OpenPOWER systems
AI in Health Care using IBM Systems/OpenPOWER systems
 
AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems AI in Healh Care using IBM POWER systems
AI in Healh Care using IBM POWER systems
 
Poster from NUS
Poster from NUSPoster from NUS
Poster from NUS
 
SAP HANA on POWER9 systems
SAP HANA on POWER9 systemsSAP HANA on POWER9 systems
SAP HANA on POWER9 systems
 
Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9Graphical Structure Learning accelerated with POWER9
Graphical Structure Learning accelerated with POWER9
 
AI in the enterprise
AI in the enterprise AI in the enterprise
AI in the enterprise
 
Robustness in deep learning
Robustness in deep learningRobustness in deep learning
Robustness in deep learning
 
Perspectives of Frond end Design
Perspectives of Frond end DesignPerspectives of Frond end Design
Perspectives of Frond end Design
 
A2O Core implementation on FPGA
A2O Core implementation on FPGAA2O Core implementation on FPGA
A2O Core implementation on FPGA
 
OpenPOWER Foundation Introduction
OpenPOWER Foundation Introduction OpenPOWER Foundation Introduction
OpenPOWER Foundation Introduction
 
Open Hardware and Future Computing
Open Hardware and Future ComputingOpen Hardware and Future Computing
Open Hardware and Future Computing
 
AI/Cloud Technology access
AI/Cloud Technology access AI/Cloud Technology access
AI/Cloud Technology access
 

Último

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Último (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

OpenPOWER Webinar

  • 1. POWER9 Features and Strategies for improving Application Performance on POWER9 with IBM XL and Open Source compilers Archana Ravindar LLVM Compiler Performance(POWER Systems Performance), ISDL aravind5@in.ibm.com https://in.linkedin.com/in/archana-ravindar-0259625b
  • 2. Scope of the Presentation • Review POWER9 processor features • Outline common bottlenecks encountered due to certain program characteristics • How to Identify these issues using tools on POWER9 Linux • What compiler Options can be used to reduce the impact of these characteristics • How can we code programs that prevent such situations to arise • POWER Linux platform • Compilers- XL, gcc wherever applicable • Performance Tools- perf
  • 3. POWER Processor Technology Roadmap 2H12 POWER7+ 32 nm - 2.5x Larger L3 cache - On-die acceleration - Zero-power core idle state - Up to 12 Cores - SMT8 - CAPI Acceleration - High Bandwidth GPU Attach 1H14 – 2H161H10 POWER7 45 nm - 8 Cores - SMT4 - eDRAM L3 Cache POWER9 Family 14nm POWER8 Family 22nm 3 Enterprise Enterprise Enterprise & Big Data Optimized 2H17 – 2H18+ Built for the Cognitive Era− Enhanced Core and Chip Architecture Optimized for Emerging Workloads − Processor Family with Scale-Up and Scale-Out Optimized Silicon − Premier Platform for Accelerated Computing
  • 4. POWER9 Family – Deep Workload Optimizations Emerging Analytics, AI, Cognitive - New core for stronger thread performance - Delivers 2x compute resource per socket - Built for acceleration – OpenPOWER solution enablement Technical / HPC - Highest bandwidth GPU attach - Advanced GPU/CPU interaction and memory sharing - High bandwidth direct attach memory Cloud / HSDC - Power / Packaging / Cost optimizations for a range of platforms - Superior virtualization features: security, power management, QoS, interrupt - State of the art IO technology for network and storage performance Enterprise - Large, flat, Scale-Up Systems - Buffered memory for maximum capacity - Leading RAS - Improved caching DB2 BLU 4
  • 5. POWER9 Core Execution Slice Microarchitecture 128b Super-slice 64b Slice POWER9 SMT8 Core Modular Execution Slices Re-factored Core Provides Improved Efficiency & Workload Alignment • Enhanced pipeline efficiency with modular execution and intelligent pipeline control • Increased pipeline utilization with symmetric data-type engines: Fixed, Float, 128b, SIMD • Shared compute resource optimizes data-type interchange POWER8 SMT8 Core 5 POWER9 SMT4 Core
  • 6. Shorter Pipelines with Reduced Disruption Improved application performance for modern codes • Shorten fetch to compute by 5 cycles • Advanced branch prediction Higher performance and pipeline utilization • Improved instruction management – Removed instruction grouping and reduced cracking – Complete up to 128 (64 – SMT4 Core) instructions per cycle Reduced latency and improved scalability • Local pipe control of load/store operations – Improved hazard avoidance – Local recycles – reduced hazard disruption – Improved lock management POWER9 Core Pipeline Efficiency 6
  • 7. 7 POWER ISA v3.0 Broader data type support • 128-bit IEEE 754 Quad-Precision Float – Full width quad-precision for financial and security applications • Expanded BCD and 128b Decimal Integer – For database and native analytics • Half-Precision Float Conversion – Optimized for accelerator bandwidth and data exchange Support Emerging Algorithms • Enhanced Arithmetic and SIMD • Random Number Generation Instruction Accelerate Emerging Workloads • Memory Atomics – For high scale data-centric applications • Hardware Assisted Garbage Collection – Optimize response time of interpretive languages Cloud Optimization • Enhanced Translation Architecture – Optimized for Linux • New Interrupt Architecture – Automated partition routing for extreme virtualization • Enhanced Accelerator Virtualization • Hardware Enforced Trusted Execution Energy & Frequency Management • POWER9 Workload Optimized Frequency – Manage energy between threads and cores with reduced wakeup latency New Instruction Set Architecture Implemented on POWER9
  • 8. 8 Acceleration Super Highway  5.6x more data throughput vs. PCIe Gen3 with NVIDIA NVLink optimization to the core  2x bandwidth with PCIe Gen4 vs. PCIe Gen3  Access up to 2TB of system memory delivered with coherence … only on POWER!  Superior data transfer to multiple devices 25G Links to OpenCAPI GPU devices  GPU  CPU and GPUGPU speed-up
  • 9. 9 Scope of the Compiler  Compiler is an important layer in the system stack that is crucial for application performance  The compiler is intimately aware of the processor design and has functionality implemented keeping in mind the various latencies of the hardware units and movement of instructions within the pipe  The compiler is designed to emit appropriate ISA depending on which architecture a program is compiled for  Based on the architecture scheduling is done to ensure smooth flow of instructions through the pipe.  IBM XL is a proprietary compiler which was a pioneer in several optimization innovation over the past 3 decades.  Increasingly IBM has embraced open source compilers such as GCC, LLVM to leverage community participation and innovation.  The scope of this presentation focuses on how we can leverage IBM XL and open source compilers to obtain optimum performance on POWER9
  • 10. Tools that we use in the Discussion • Compilers – IBM proprietary compilers - xlC/xlc/xlf – xlc -O[n] program.c –o program : n ranges from 0 to 5 – Some common options: -qhot (array intensive programs), -qtune=pwr9, -qsimd (enable SIMD) etc – Profile directed feedback (-qpdf1, -qpdf2) – Open source compilers: GCC, LLVM – -O[n]: n ranges from 0-3, Ofast – Common options -march=power9 – Profile directed feedback (-fprofile-generate, -fprofile-use) • Perf tool – To record hotspots/profile application • perf record -e r<code> ./binary args > out (produces perf.data) • perf report (opens profile report stored in perf.data) – To measure hardware events • perf stat –e r<code> ./binary args > out – For more details, refer perf manpage
  • 11. Processor can be thought of containing two components •Front end ensures a smooth supply of instructions to be executed to the Backend •The Backend is concerned only with the execution of instructions •Code that has *too many* branches can cause processor to fetch more instructions than required and affect performance Front end Back end
  • 12. Branches • Branches are predicted much in advance as the time needed to resolve the condition takes time introducing a bubble in the pipeline slowing down execution • POWER9 has an advanced branch predictor that uses complex structures to track context-based branch histories and does a very good job of predicting them accurately. However certain applications which are coded in a complex way can continue to cause high mispredictions • Wrong prediction- Misprediction – Counters to detect this: PM_BR_MPRED*,PM_FLUSH_BR_MPRED – Use perf stat –e r<code> ./program arguments > out to collect various counters • Branches are caused even by function calls, Such branches affect instruction cache locality and increase instruction cache misses – Counters to detect this: PM_L1_ICACHE_MISS • Branches within loops hinder vectorization/SIMD opportunities
  • 13. Guidelines to reduce branches • Options to reduce loop /call branches – #pragma unroll(N) or (XL) -qunroll : Unrolling loops (GCC/LLVM: -funroll-loops compiler flag) (reduces loop branches) – (XL) -qinline=auto:level=<N> (N=1, .. 10) Inlining routines (will reduce function call jump/return) – Corresponding GCC/LLVM compiler option: -finline-functions • Loop Versioning: Slow version (that contains branches) + Fast version of loops (that does not contain branches) (Usually done automatically by compilers at higher levels of optimization) • Provide hints in source code to indicate the expected values of expressions appearing in branch conditions (long __builtin_expect(long expression, long value);) (hint whether branch is more likely to be taken/not) • If-conversion: Remove simple branches wherever possible by coding patterns such as if(val!=0) a=a+val; a+=val; if(val==0) a=a+1; a+=(!val)
  • 14. Register Spills • In a RISC architecture, predominantly, instructions operate on registers – Load,store instructions used to transfer data from memory to registers • When #live variables > #available registers, spill is performed • 1 spill = 1 store + 1 load • *Spilling hot variables can hit performance* – Spills can cause Load Hit Stores (stores followed by load to the same address which may cause a delay in the pipe depending on the separating distance) – Spills increase Path length, address arithmetic instructions – Unnecessary reads/writes to memory • Issues due to to spills detected in following counters- PM_LSU_FIN, PM_LSU_FLUSH, PM_LSU_REJECT_LHS , PM_INST_CMPL, PM_FXU_FIN
  • 15. Guidelines to reduce spills • Limit extensive unrolling/inlining that can cause long-live ranges of variables – Best to leave the compiler to do the inlining using its own heuristics • XL compiler option: -qcompact can help • Programs using mixed mode operands extensively (signed, unsigned) etc, conversion uses up extra registers • Use other register resources like SIMD registers if applicable, Use Vectorization wherever applicable/Code such that compiler vectorizes automatically • Use special POWER ISA instructions such as andc (logical AND complement), orc (logical OR complement) which combines multiple math operations in a single instruction saving a register; Compilers usually generate ISA when –march=power9, -qarch=pwr9 is used • (R3=R1 & !R2) – R4=not (R2) R3=R1 andc R2 – R3= R1 and R4
  • 16. Memory Unit • Memory is organized in a hierarchy • L1 cache : Closest memory to the processor and the fastest, followed by L2, L3 upto main memory • Memory is most distant to the processor and slowest • Data cache : stores data, instruction cache: stores instructions • Data cache misses can stall load instructions in the pipeline causing a cascading effect on all those instructions dependent on it • Counters- PM_LD_MISS_L1, PM_CMPLU_STALL_DCACHE_MISS, PM_ST_MISS_L1, PM_CMPLU_STALL_DMISS_L2L3, PM_CMPLU_STALL_DMISS_LMEM etc L1 $ (3 cyc) L2 $ (15.5 cyc) L3 $ (35.5 cyc) Memory (74.5 ns)
  • 17. Techniques to optimize memory performance • Memory footprint reduction wherever possible – If you have enums declared in your program, using –qenum=small allocates just one byte to enums v/s 4 bytes that gets allocated by default – Replace bytemaps(1 byte to store a '0' or a '1') by bitmaps wherever possible • Hardware prefetching – Controlled by DSCR settings – ppc64_cpu --dscr=<n> – Common DSCR configurations • 0 (all default values) • 0x1D7 (Achieve most aggressive depth, most quickly, enable stride N prefetch) • 1 (no prefetch) • POWER8 tuning guide has a detailed description of DSCR settings • Software prefetching – Programmer inserted prefetch instructions __dcbt, __dcbtst – Prefetch parameters can be tuned –qprefetch=aggressive:dscr=<value> – Available gcc prefetch options: -fprefetch-loop-arrays/-fno-prefetch-loop-arrays – If you want to explicitly control prefetching via software, you can turn off hardware prefetching using ppc64_cpu –dscr command(under root privileges)
  • 18. 18 Flag Kind XL GCC/LLVM Can be simulated in source Benefit Drawbacks Unrolling -qunroll -funroll-loops #pragma unroll(N) Unrolls loops ; increases opportunities pertaining to scheduling for compiler Increases register pressure Inlining - qinline=auto:level= N -finline-functions Inline always attribute or manual inlining increases opportunities for scheduling; Reduces branches and loads/stores Increases register pressure; increases code size Enum small -qenum=small -fshort-enums -manual typedef Reduces memory footprint Can cause issues in alignment isel instructions -misel Using ?: operator generates isel instruction instead of branch; reduces pressure on branch predictor unit latency of isel is a bit higher; Use if branches are not predictable easily General tuning -qarch=pwr9, -qtune=pwr9 -mcpu=power9, -mtune=power9 Turns on platform specific tuning like ISA, scheduling 64bit compilation -q64 -m64 Prefetching - qprefetch[=aggressi ve] -fprefetch-loop-arrays __dcbt/__dcbtst, _builtin_prefetch reduces cache misses Can increase memory traffic particularly if prefetched values are not used Link time optimization -qipo -flto , -flto=thin Enables Interprocedural optimizations Can increase overall compilation time Profile directed feedback -qpdf1, -qpdf2 -fprofile-generate and –fprofile-use LLVM has an intermediate step llvm-profdata Enables hot path optimizations Requires a training run
  • 20. Summary • Today we talked about – Various performance issues that can occur in an application on POWER9 linux – How to identify them ? – What can we do to improve performance during compilation ? – What can we do to improve performance while coding the application itself ? • We saw that Power9 has the most comprehensive set of hardware counters that enable analysts to understand applications of performance and get to the bottlenecks quickly • We saw that IBM XL compilers and equivalently open source compilers such as GCC, LLVM have a diverse set of options tailored to different needs to get required performance
  • 21. References • POWER9 User Manual • https://openpowerfoundation.org/?resource_lib=power9- processor-users-manual • IBM XL Compiler reference http://www-01.ibm.com/support/docview.wss?uid=swg27036675 • POWER9 Raw event codes (Install libpfm) • https://github.com/torvalds/linux/blob/master/arch/powerpc/perf /power9-events-list.h • GCC 9.2 manual • https://devdocs.io/gcc~9/ • LLVM manual • https://llvm.org/docs/CommandGuide/

Notas del editor

  1. Memory enhancements, advances in graphic processing units (GPU), interconnects, and bandwidth all provide building blocks for a better performing AI architecture. In fact, the POWER9 AC922 marks what will become an industry requirement: welcome to the “off-chip” era (where advanced accelerators like GPUs and FPGAs are engineered to drive modern workloads) and the sunset of the “totally on-chip” era where processing is integrated on a single chip. POWER9 is the first commercial architecture loaded with NVIDIA’s next generation NVLink (AC922’s optimization isn’t just GPU to GPU like other commercial platforms, it also included GPU to CPU where it’s needed the most), OpenCAPI, and PCI-Express 4.0. Think of these technologies as a giant hose to transfer data. This slide shows a bit of a deeper look into what we are talking about when we say “Cutting Edge” and built for Enterprise AI. The AC922 combined with NVIDIA Next Generation NVLink technology provides 5.6x more data throughput when compared to PCIe Gen3. And since this server comes with PCIe Gen4, it should be noted that Gen4 delivers 2x the throughput when compared to PCIe Gen3’s bandwidth. Finally, the server delivers simplified execution for Enterprise AI with up to 2 TB of coherent memory for use in complex model building.
  2. Memory enhancements, advances in graphic processing units (GPU), interconnects, and bandwidth all provide building blocks for a better performing AI architecture. In fact, the POWER9 AC922 marks what will become an industry requirement: welcome to the “off-chip” era (where advanced accelerators like GPUs and FPGAs are engineered to drive modern workloads) and the sunset of the “totally on-chip” era where processing is integrated on a single chip. POWER9 is the first commercial architecture loaded with NVIDIA’s next generation NVLink (AC922’s optimization isn’t just GPU to GPU like other commercial platforms, it also included GPU to CPU where it’s needed the most), OpenCAPI, and PCI-Express 4.0. Think of these technologies as a giant hose to transfer data. This slide shows a bit of a deeper look into what we are talking about when we say “Cutting Edge” and built for Enterprise AI. The AC922 combined with NVIDIA Next Generation NVLink technology provides 5.6x more data throughput when compared to PCIe Gen3. And since this server comes with PCIe Gen4, it should be noted that Gen4 delivers 2x the throughput when compared to PCIe Gen3’s bandwidth. Finally, the server delivers simplified execution for Enterprise AI with up to 2 TB of coherent memory for use in complex model building.