Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelerate your business

Technical University/Symposia materials may not be reproduced in whole or in part without the prior written permission of IBM.
9.0
Heterogeneous computing on POWER
IBM and OpenPOWER technologies to accelerate your business
César Diniz Maciel
Executive IT Specialist
IBM Corporate Strategy

Session objectives
In this session we present how accelerators provide
significant application performance improvements
and how they can be deployed in a solution. We
evaluate different types of accelerators, and focus
on the latest accelerator technologies available for
IBM Power Systems and OpenPOWER solutions.

Acknowledgements
I would like to thank the following people for providing
invaluable information on POWER8, CAPI and FPGAs
• Jeff Stuecheli, POWER Hardware Architect
• Bill Starke, DE, POWER Server Nest Architect
• Bruce Wile, STG Hardware Design and Verification
• Jonathan Dement, Program Director,
• Power Systems and OpenPower Innovation
• IBM US

What is heterogeneous computing?
From Wikipedia
“Heterogeneous computing refers to systems that use more than one kind of processor. These
are systems that gain performance not just by adding the same type of processors, but by adding
dissimilar processors, usually incorporating specialized processing capabilities to handle particular
tasks.“

What is heterogeneous computing?
Not a new concept, widely used in the
industry
 On-chip accelerators:
 Cryptography and compression accelerators inside the
POWER8 processor




 PCIe based accelerators
 GPGPUs, such as Nvidia Tesla
 PCIe adapters for SSL acceleration, cryptography and
compression
 CAPI adapters for POWER8 systems







 Appliance-based accelerators
 Netezza appliances to accelerate queries on the IBM DB2
Analytics Accelerator

Why heterogeneous computing?
Applications are becoming more complex and demanding more
computing resources
 Application speedup limited to performance of the slowest algorithm
 If algorithm execution can be accelerated, application runs faster – code
optimization
 If algorithm can be built on silicon, execution speeds up – ASICs, FPGA
 If algorithm can be broken in multiple pieces, and these can run
simultaneously, application runs faster – parallelization
“Heterogeneous (or asymmetric) chip multiprocessors present unique
opportunities for improving system throughput, reducing processor power,
and mitigating Amdahl’s law. On-chip heterogeneity allows the processor
to better match execution resources to each application’s needs and to
address a much wider spectrum of system loads—from low to high thread
parallelism—with high efficiency.”
IEEE Computer, November 2005,

Why heterogeneous computing?
Scaling & power Wall Issues, Chip
Design and Fabrication Economics, and
Time To Market Demands are all
intersecting enabling/requiring change
to Business As Usual

On-chip accelerators
• Excellent performance and integration
• Accelerators are part of the microprocessor design and share
the same silicon die.
• Algorithms are implemented in silicon
• Fastest performance
• Application transparency – Hypervisor/OS abstract the
accelerator so that applications do not need to be modified
• Close integration with processor design and capabilities
• However, on-chip accelerators use same silicon space that
could be used for caches, processor features, etc
• Tradeoff between accelerators and core features
• Once built, they cannot be changed/updated

Examples of performance benefits of accelerators
POWER7+ and POWER8 processors include memory compression accelerators, and
cryptographic accelerators
Memory
Compression
Asymmetric
Mathematical
Functions
Cryptographic
Engines
The on-chip loosely coupled
accelerators are part of the processor
chip and managed by the Power
Hypervisor. Multiple partitions can use
the accelerators and the Power
Hypervisor manages the QoS and
address mapping.

Examples of performance benefits of accelerators
Performance comparison of Active Memory Expansion (AME) on an SAP workload (SD 2-Tier
benchmark) on POWER7 and POWER7+ (output of amepat command)
Significant reduction in core consumption

PCIe-based accelerators
• Provide great flexibility for adding capability to existing systems
• Allow many more options in terms of types of accelerators, algorithms, performance
characteristics and implementation devices
• GPUs
• FPGAs
• ASICs
• Easy to replace with newer/faster accelerators
• However, PCIe-based accelerators are seen as an I/O device, and therefore
communicate with the processor and main memory through an I/O subsystem
• Programming model needs to incorporate the I/O device
• Need of OS device drivers for the accelerator
• Data needs to be copied from/to main memory to accelerator memory
• Data transfer performance limited by the I/O subsystem

Java gzip PCIe-based accelerator
• Feature code #EJ12/ #EJ13 - PCIe3 FPGA Accelerator Adapter
• Also known as Generic Work Queue Engine (GenWQE) accelerator
• Hardware compression accelerator for AIX and Linux
• FPGA-based PCIe adapter, provides high performance, low latency compression
without significant CPU overhead
•
• Based on the standard compression library – zlib
• Widely used open source C library that provides compression and decompression
• zlib supports RFC1950, RFC1951, and RFC1952
•
• Enabled transparently in IBM Java 7.1.1 release and later on Linux and AIX on
POWER8 processor-based systems
• HW offload enabled by setting
• env variables:
• ZLIB_INFLATE_IMPL = 1
• ZLIB_DEFLATE_IMPL = 1

Java gzip PCIe-based accelerator - Performance
IBM internal testing

GPUs
From NVIDIA:
•
• “GPU-accelerated computing is the use of a graphics processing unit (GPU)
together with a CPU to accelerate scientific, analytics, engineering, consumer, and
enterprise applications...
• …GPU-accelerated computing offers unprecedented application performance by
offloading compute-intensive portions of the application to the GPU, while the
remainder of the code still runs on the CPU. From a user's perspective,
applications simply run significantly faster.”

General Purpose GPUs
GPUs are well suited for parallel
processing tasks. They have thousands of cores
than work in parallel
Coupled with a high performance
processor and a GPU programming model,
significant application acceleration can be achieved

16
NVIDIA K40 GPU
• Systems
• Up to 2 K40 GPU in S824L
• GPU Spec
• Kepler-2 architecture GPU
• ASIC: GK110B
• PCIe interface
• PCIe Gen3 x16
• Full length / double wide PCIe form factor
• Plugs in using existing double wide cassette
• Power
• 235W Max power draw :75W via PCIe slot
plus 160W via 8-pin Aux. cable.
• OS support on POWER
• Ubuntu 14.10 or later
http://www.nvidia.com/object/tesla-servers.html
16

NVIDIA and POWER8
From http://www.ecmwf.int/sites/default/files/HPC-WS-Appleyard.pdf

NVIDIA on Power
“ The combination of POWER8 CPUs & NVIDIA Tesla accelerators is amazing. It is the
highest performance we have ever seen in individual cores, and the close integration
with accelerators is outstanding for heterogeneous parallelization. Thanks to the little
endian chip and standard CUDA environment it took us less than 24 hours to port and
accelerate GROMACS.”
Erik Lindahl, Professor of Biophysics at the Science for Life
Laboratory, Stockholm University & KTH. Developer of GROMACS

GPUs are not only for technical computing
“ Concurrent execution of an analytical
workload on a POWER8 server with K40
GPUs - A Technology Demonstration”
http://on-demand.gputechconf.com/gtc/2015/presentation/S5835-Sina-
Meraji.pdf

Java acceleration with GPUs
“IBM’s POWER group has partnered with
NVIDIA to make GPUs available on a high-
performance server platform, promising the
next generation of parallel performance for
Java applications.”
The Next Wave of Enterprise Performance with Java, POWER Systems and
NVIDIA GPUs
Performance of java.util.Arrays#sort(int[]) on NVIDIA Tesla K40 GPU (ECC enabled) and IBM Power8 CPU.

Kepler
CUDA 5.5 – 7.0
Unified Memory
Buffered
Memory
POWER8
PCIe
Current
Pascal
CUDA 8
Full GPU Paging
Pascal
POWER8+
Near Future
NVLink
SXM2
Volta
CUDA 9
Cache Coherent
POWER9
More in the future
NVLink 2.0
SXM2
Volta
Kepler
NVIDIA Roadmap on POWER

FPGA as an Accelerator
• FPGA: Field Programmable Gate Array
• From Wikipedia:
“A field-programmable gate array (FPGA) is an integrated circuit designed to be configured by a
customer or a designer after manufacturing – hence "field-programmable". The FPGA
configuration is generally specified using a hardware description language (HDL), similar to that
used for an application-specific integrated circuit (ASIC). “
• It implements the algorithm as a hardware logic circuit
• Not as optimized as an ASIC, but faster than running a software-based algorithm
• It can run fast (cycle times of 250 – 500 MHz or more)
• It has industry standard interfaces like PCIe Gen3
• The major FPGA Suppliers, Altera and Xilinx, are OpenPOWER Foundation
members
gzip Encrypt
Monte
Carlo
FPGA Library
Source code for FPGAs has traditionally
been written in RTL* (VHDL** or Verilog).
Now, we also have OpenCL, a more programmer friendly
language.
*RTL = Register Transfer Level
**VHDL = VHSIC*** Hardware Description Language
***VHSIC = Very High Speed Integrated Circuit

Why FPGAs
• Transistor Efficiency & Extreme Parallelism
• Bit-level operations
• Variable-precision floating point
• Compare divides on GPU vs. FPGA
• Power-Performance Advantage
• >2x compared to a general multicore processor or GPGPU
• Unused lookup tables (LUTs) are powered off
• Technology Scaling better than CPU/GPU
• FPGAs are not frequency or power limited yet
• 3D has great potential
• Dynamic reconfiguration
• Flexibility for application tuning at run-time vs. compile-time
• Additional advantages when FPGAs are network connected
• allows network as well as compute specialization

Several IT companies looking at FPGA
Baidu FPGA accelerator
Microsoft Bing accelerator

Combining the best of both architectures

Coherent Accelerator base principles
Function Based Acceleration
• Main Application executed on Host Processor
Computational heavy functions on Accelerator
• Single binary image encapsulates both HW and SW
version of accelerated functions
• Application calls Accelerator for common or custom
libraries
• Enable an application to work with or without available
Accelerator
• Accelerator call faults when function not available
• Host processor executes function when Accelerator
not available
• No special requirements on data structures
• System software virtualization of accelerator’s function(s)
Full Peer to Processor
• IBM-designed processor interface
• Maintains a trusted coherent interface to system
• Direct communications with application
• Accelerator function(s) use an unmodified EA
• Full access to real address space
• Utilize processor’s page tables directly
• Page faults handled by system software
• Multiple functions can exist in a single accelerator

Customizable Hardware
Application Accelerator
• Specific system SW, middleware, or user
application
• Written to durable interface provided by PSL
Virtual Addressing
• Accelerator can work with same memory
addresses that the processors use
• Pointers de-referenced same as the host
application
• Removes OS & device driver overhead
Hardware Managed Cache Coherence
• Enables the accelerator to participate in “Locks”
as a normal thread Lowers Latency over IO
communication model
POWER8 CAPI (Coherent Accelerator Processor Interface)
Ecosystem
CAPP
PCIe
Power Processor
FPGA
I/O
Ethernet, DASD, etc…
Func
tion
n
Func
tion
0
Func
tion
1
Func
tion
2
CAPI
IBM Supplied POWER Service Layer

Coherent Accelerator Processor Interface (CAPI) overview
CAPP PCIe
POWER8 Processor
FPGA
Accelerator Function Unit
(AFU)
CAPI
IBM-Supplied POWER Service
Layer
Typical I/O model flow
Flow with a coherent model
Shared Memory
Notify Accelerator
Acceleration
Shared Memory
Completion
DD Call
Copy or Pin
Source Data
MMIO Notify
Accelerator
Acceleration
Poll / Interrupt
Completion
Copy or Unpin
Result Data
Return From DD
Completion
Advantages of coherent attachment over I/O attachment
 Virtual addressing and data caching
– Shared memory
– Lower latency for highly referenced data
 Easier, more natural programming model
– Traditional thread-level programming
– Long latency of I/O typically requires
restructuring of application
 Enables applications not possible on I/O
– Pointer chasing, and so on

Typical I/O Model Flow:
Flow with a Coherent Model:
Shared Mem.
Notify Accelerator
Acceleration
Shared Memory
Completion
DD Call
Copy or Pin
Source Data
MMIO Notify
Accelerator
Acceleration
Poll / Interrupt
Completion
Copy or Unpin
Result Data
Ret. From DD
Completion
Application
Dependent, but
Equal to below
Application
Dependent, but
Equal to above
300 Instructions 10,000 Instructions 3,000 Instructions
1,000 Instructions
1,000 Instructions
7.9µs 4.9µs
Total ~13µs for data prep
<400 Instructions <100 Instructions
0.3µs 0.06µs
Total 0.36µs
CAPI vs. I/O Device Driver: Data Prep

Time spent by a GPU-accelerated application
“ Figure 7 shows that the processing
phases ofthe FD-OCT algorithm (DC-
Removal, Resample,
DispersionCompensation, FFT and
Logarithmic Scaling) account for∼40% of
the total run-time time, while the memory
(data)transfers (host-to-device and device-
to-host) require approximately 60% of the
time.”
From: Scalable, High Performance Fourier Domain Optical Coherence Tomography: why FPGAs and not GPGPUs
Jian Li, Marinko V. Sarunic, and Lesley Shannon
School of Engineering Science
Simon Fraser University, Burnaby BC, Canada

strategy ( )
CAPI Attached Flash Optimization
• Attach IBM FlashSystem to POWER8 via CAPI coherent Attach
• Issues Read/Write Commands from applications to eliminate 97% of code pathlength
• Saves 20-30 cores per 1M IOPs
Pin buffers, Translate, Map DMA,
Start I/O
Application
Read/Write Syscall
Interrupt, unmap, unpin,Iodone
scheduling
20K instructions
reduced to <500
Disk and Adapter DD
strategy ( ) iodone ( )
FileSystem
Application
User Library
Posix Async
I/O Style API
Shared Memory Work
Queue
aio_read()
aio_write()
iodone ( )
LVM

Innovative “In-Memory” NoSQL/KVS
IBM Data Engine for NoSQL
24:1
Reduction in
infrastructure
2.4x
Price reduction
12x
Less Energy
6x
Less rack space
40TB of
extended
memory
4U

Demonstrating the Value of CAPI Attachment
IBM Data Engine for NoSQL

Comparison of FFT on three paradigms
Same FFT Algorithm as CAPI, but large latency impact
P8 Core working much harder to deliver data to FPGA as device driver code is
invoked
Small FFTs can be implemented in a fully streamed fashion on FPGA
Performance scales with host-AFU bandwidth
Use radix-2 pipeline for up to 4 GB/s
Radix-2 needs 2 complex samples per cycle → 16B / 250MHz cycle →
4 GB/s
Switch to (more efficient) radix-4 for up to 8 GB/s
Software FFT from the IBM Engineering and Scientific Subroutine Library (ESSL)
Poor performance for small FFTs using multi-threaded software
Little data reuse and strided access patterns lead to software inefficiencies
P8
Core
P8
Core
FPGA
P8
Core
FPGA
CAPI

CAPI projects – PDT planning
• FPGA-CAPI accelerator for PDT (Photodynamic Therapy)
• Monte-Carlo analysis of light scattering in tissue
• Used to plan for non-invasive cancer treatment
• Project led by Jeff Cassidy (University of Toronto) and
• Lothat Lilge (Princess Margaret Cancer Centre), Canada,
• with IBM Austin support.

CAPI projects – Text Analytics

Query A
Query B
Query C
0
200
400
600
800
Query A Query B Query C Query D Query E Query F
Throughput [MB/s]
SW 1 thread SW 64 threads FPGA 4 streams
Performance

Additional CAPI projects
• Algo acceleration for risk analysis for High Performance Trading
• Biometric (image and voice recognition) acceleration for secure
• identification for banking system
• Flash acceleration (NVMe)
• Video transcoding and compression for IPTV/VoD
• Video analytics
•
• .. And many more
•
•
•
• If you have an application, or know an application that can benefit from FPGA and CAPI...
Talk to
IBM!!!

References
First Annual OpenPOWER Summit – presentations and solutions on CAPI and GPU accelerators
Porting GPU-Accelerated Applications to POWER8 Systems
POWER8 Coherent Accelerator Processor Interface (CAPI)
IBM Data Engine for NoSQL – Power Systems Edition
Accelerating performance with the Generic Work Queue Engine (GenWQE)

© Copyright IBM Corporation
2015
Continue growing your IBM skills
ibm.com/training provides a
comprehensive portfolio of skills and
career
accelerators that are designed to
meet all
your training needs.
• Training in cities local to you - where and
when you need it, and in the format you want
– Use IBM Training Search to locate public training classes
near to you with our five Global Training Providers
– Private training is also available with our Global Training Providers
• Demanding a high standard of quality –
view the paths to success
– Browse Training Paths and Certifications to find the
course that is right for you
• If you can’t find the training that is right for you with
our Global Training Providers, we can help.
– Contact IBM Training at dpmc@us.ibm.com
44
Global Skills
Initiative

Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelerate your business

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelerate your business

Similar a Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelerate your business (20)

Último

Último (20)

Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelerate your business