SlideShare una empresa de Scribd logo
1 de 44
Technical University/Symposia materials may not be reproduced in whole or in part without the prior written permission of IBM.
9.0
Heterogeneous computing on POWER
IBM and OpenPOWER technologies to accelerate your business
César Diniz Maciel
Executive IT Specialist
IBM Corporate Strategy
Session objectives
In this session we present how accelerators provide
significant application performance improvements
and how they can be deployed in a solution. We
evaluate different types of accelerators, and focus
on the latest accelerator technologies available for
IBM Power Systems and OpenPOWER solutions.
Acknowledgements
I would like to thank the following people for providing
invaluable information on POWER8, CAPI and FPGAs
• Jeff Stuecheli, POWER Hardware Architect
• Bill Starke, DE, POWER Server Nest Architect
• Bruce Wile, STG Hardware Design and Verification
• Jonathan Dement, Program Director,
• Power Systems and OpenPower Innovation
• IBM US
What is heterogeneous computing?
From Wikipedia
“Heterogeneous computing refers to systems that use more than one kind of processor. These
are systems that gain performance not just by adding the same type of processors, but by adding
dissimilar processors, usually incorporating specialized processing capabilities to handle particular
tasks.“
What is heterogeneous computing?
Not a new concept, widely used in the
industry
 On-chip accelerators:
 Cryptography and compression accelerators inside the
POWER8 processor




 PCIe based accelerators
 GPGPUs, such as Nvidia Tesla
 PCIe adapters for SSL acceleration, cryptography and
compression
 CAPI adapters for POWER8 systems







 Appliance-based accelerators
 Netezza appliances to accelerate queries on the IBM DB2
Analytics Accelerator
Why heterogeneous computing?
Applications are becoming more complex and demanding more
computing resources
 Application speedup limited to performance of the slowest algorithm
 If algorithm execution can be accelerated, application runs faster – code
optimization
 If algorithm can be built on silicon, execution speeds up – ASICs, FPGA
 If algorithm can be broken in multiple pieces, and these can run
simultaneously, application runs faster – parallelization
“Heterogeneous (or asymmetric) chip multiprocessors present unique
opportunities for improving system throughput, reducing processor power,
and mitigating Amdahl’s law. On-chip heterogeneity allows the processor
to better match execution resources to each application’s needs and to
address a much wider spectrum of system loads—from low to high thread
parallelism—with high efficiency.”
IEEE Computer, November 2005,
Why heterogeneous computing?
Scaling & power Wall Issues, Chip
Design and Fabrication Economics, and
Time To Market Demands are all
intersecting enabling/requiring change
to Business As Usual
On-chip accelerators
• Excellent performance and integration
• Accelerators are part of the microprocessor design and share
the same silicon die.
• Algorithms are implemented in silicon
• Fastest performance
• Application transparency – Hypervisor/OS abstract the
accelerator so that applications do not need to be modified
• Close integration with processor design and capabilities
• However, on-chip accelerators use same silicon space that
could be used for caches, processor features, etc
• Tradeoff between accelerators and core features
• Once built, they cannot be changed/updated
Examples of performance benefits of accelerators
POWER7+ and POWER8 processors include memory compression accelerators, and
cryptographic accelerators
Memory
Compression
Asymmetric
Mathematical
Functions
Cryptographic
Engines
The on-chip loosely coupled
accelerators are part of the processor
chip and managed by the Power
Hypervisor. Multiple partitions can use
the accelerators and the Power
Hypervisor manages the QoS and
address mapping.
Examples of performance benefits of accelerators
Performance comparison of Active Memory Expansion (AME) on an SAP workload (SD 2-Tier
benchmark) on POWER7 and POWER7+ (output of amepat command)
Significant reduction in core consumption
PCIe-based accelerators
• Provide great flexibility for adding capability to existing systems
• Allow many more options in terms of types of accelerators, algorithms, performance
characteristics and implementation devices
• GPUs
• FPGAs
• ASICs
• Easy to replace with newer/faster accelerators
• However, PCIe-based accelerators are seen as an I/O device, and therefore
communicate with the processor and main memory through an I/O subsystem
• Programming model needs to incorporate the I/O device
• Need of OS device drivers for the accelerator
• Data needs to be copied from/to main memory to accelerator memory
• Data transfer performance limited by the I/O subsystem
Java gzip PCIe-based accelerator
• Feature code #EJ12/ #EJ13 - PCIe3 FPGA Accelerator Adapter
• Also known as Generic Work Queue Engine (GenWQE) accelerator
• Hardware compression accelerator for AIX and Linux
• FPGA-based PCIe adapter, provides high performance, low latency compression
without significant CPU overhead
•
• Based on the standard compression library – zlib
• Widely used open source C library that provides compression and decompression
• zlib supports RFC1950, RFC1951, and RFC1952
•
• Enabled transparently in IBM Java 7.1.1 release and later on Linux and AIX on
POWER8 processor-based systems
• HW offload enabled by setting
• env variables:
• ZLIB_INFLATE_IMPL = 1
• ZLIB_DEFLATE_IMPL = 1
Java gzip PCIe-based accelerator - Performance
IBM internal testing
GPUs
From NVIDIA:
•
• “GPU-accelerated computing is the use of a graphics processing unit (GPU)
together with a CPU to accelerate scientific, analytics, engineering, consumer, and
enterprise applications...
• …GPU-accelerated computing offers unprecedented application performance by
offloading compute-intensive portions of the application to the GPU, while the
remainder of the code still runs on the CPU. From a user's perspective,
applications simply run significantly faster.”
General Purpose GPUs
GPUs are well suited for parallel
processing tasks. They have thousands of cores
than work in parallel
Coupled with a high performance
processor and a GPU programming model,
significant application acceleration can be achieved
16
NVIDIA K40 GPU
• Systems
• Up to 2 K40 GPU in S824L
• GPU Spec
• Kepler-2 architecture GPU
• ASIC: GK110B
• PCIe interface
• PCIe Gen3 x16
• Full length / double wide PCIe form factor
• Plugs in using existing double wide cassette
• Power
• 235W Max power draw :75W via PCIe slot
plus 160W via 8-pin Aux. cable.
• OS support on POWER
• Ubuntu 14.10 or later
http://www.nvidia.com/object/tesla-servers.html
16
NVIDIA and POWER8
From http://www.ecmwf.int/sites/default/files/HPC-WS-Appleyard.pdf
NVIDIA on Power
“ The combination of POWER8 CPUs & NVIDIA Tesla accelerators is amazing. It is the
highest performance we have ever seen in individual cores, and the close integration
with accelerators is outstanding for heterogeneous parallelization. Thanks to the little
endian chip and standard CUDA environment it took us less than 24 hours to port and
accelerate GROMACS.”
Erik Lindahl, Professor of Biophysics at the Science for Life
Laboratory, Stockholm University & KTH. Developer of GROMACS
GPUs are not only for technical computing
“ Concurrent execution of an analytical
workload on a POWER8 server with K40
GPUs - A Technology Demonstration”
http://on-demand.gputechconf.com/gtc/2015/presentation/S5835-Sina-
Meraji.pdf
Java acceleration with GPUs
“IBM’s POWER group has partnered with
NVIDIA to make GPUs available on a high-
performance server platform, promising the
next generation of parallel performance for
Java applications.”
The Next Wave of Enterprise Performance with Java, POWER Systems and
NVIDIA GPUs
Performance of java.util.Arrays#sort(int[]) on NVIDIA Tesla K40 GPU (ECC enabled) and IBM Power8 CPU.
Kepler
CUDA 5.5 – 7.0
Unified Memory
Buffered
Memory
POWER8
PCIe
Current
Pascal
CUDA 8
Full GPU Paging
Pascal
POWER8+
Near Future
NVLink
SXM2
Volta
CUDA 9
Cache Coherent
POWER9
More in the future
NVLink 2.0
SXM2
Volta
Kepler
NVIDIA Roadmap on POWER
FPGA as an Accelerator
• FPGA: Field Programmable Gate Array
• From Wikipedia:
“A field-programmable gate array (FPGA) is an integrated circuit designed to be configured by a
customer or a designer after manufacturing – hence "field-programmable". The FPGA
configuration is generally specified using a hardware description language (HDL), similar to that
used for an application-specific integrated circuit (ASIC). “
• It implements the algorithm as a hardware logic circuit
• Not as optimized as an ASIC, but faster than running a software-based algorithm
• It can run fast (cycle times of 250 – 500 MHz or more)
• It has industry standard interfaces like PCIe Gen3
• The major FPGA Suppliers, Altera and Xilinx, are OpenPOWER Foundation
members
gzip Encrypt
Monte
Carlo
FPGA Library
Source code for FPGAs has traditionally
been written in RTL* (VHDL** or Verilog).
Now, we also have OpenCL, a more programmer friendly
language.
*RTL = Register Transfer Level
**VHDL = VHSIC*** Hardware Description Language
***VHSIC = Very High Speed Integrated Circuit
Why FPGAs
• Transistor Efficiency & Extreme Parallelism
• Bit-level operations
• Variable-precision floating point
• Compare divides on GPU vs. FPGA
• Power-Performance Advantage
• >2x compared to a general multicore processor or GPGPU
• Unused lookup tables (LUTs) are powered off
• Technology Scaling better than CPU/GPU
• FPGAs are not frequency or power limited yet
• 3D has great potential
• Dynamic reconfiguration
• Flexibility for application tuning at run-time vs. compile-time
• Additional advantages when FPGAs are network connected
• allows network as well as compute specialization
Several IT companies looking at FPGA
Baidu FPGA accelerator
Microsoft Bing accelerator
Combining the best of both architectures
Coherent Accelerator base principles
Function Based Acceleration
• Main Application executed on Host Processor
Computational heavy functions on Accelerator
• Single binary image encapsulates both HW and SW
version of accelerated functions
• Application calls Accelerator for common or custom
libraries
• Enable an application to work with or without available
Accelerator
• Accelerator call faults when function not available
• Host processor executes function when Accelerator
not available
• No special requirements on data structures
• System software virtualization of accelerator’s function(s)
Full Peer to Processor
• IBM-designed processor interface
• Maintains a trusted coherent interface to system
• Direct communications with application
• Accelerator function(s) use an unmodified EA
• Full access to real address space
• Utilize processor’s page tables directly
• Page faults handled by system software
• Multiple functions can exist in a single accelerator
Customizable Hardware
Application Accelerator
• Specific system SW, middleware, or user
application
• Written to durable interface provided by PSL
Virtual Addressing
• Accelerator can work with same memory
addresses that the processors use
• Pointers de-referenced same as the host
application
• Removes OS & device driver overhead
Hardware Managed Cache Coherence
• Enables the accelerator to participate in “Locks”
as a normal thread Lowers Latency over IO
communication model
POWER8 CAPI (Coherent Accelerator Processor Interface)
Ecosystem
CAPP
PCIe
Power Processor
FPGA
I/O
Ethernet, DASD, etc…
Func
tion
n
Func
tion
0
Func
tion
1
Func
tion
2
CAPI
IBM Supplied POWER Service Layer
Coherent Accelerator Processor Interface (CAPI) overview
CAPP PCIe
POWER8 Processor
FPGA
Accelerator Function Unit
(AFU)
CAPI
IBM-Supplied POWER Service
Layer
Typical I/O model flow
Flow with a coherent model
Shared Memory
Notify Accelerator
Acceleration
Shared Memory
Completion
DD Call
Copy or Pin
Source Data
MMIO Notify
Accelerator
Acceleration
Poll / Interrupt
Completion
Copy or Unpin
Result Data
Return From DD
Completion
Advantages of coherent attachment over I/O attachment
 Virtual addressing and data caching
– Shared memory
– Lower latency for highly referenced data
 Easier, more natural programming model
– Traditional thread-level programming
– Long latency of I/O typically requires
restructuring of application
 Enables applications not possible on I/O
– Pointer chasing, and so on
Typical I/O Model Flow:
Flow with a Coherent Model:
Shared Mem.
Notify Accelerator
Acceleration
Shared Memory
Completion
DD Call
Copy or Pin
Source Data
MMIO Notify
Accelerator
Acceleration
Poll / Interrupt
Completion
Copy or Unpin
Result Data
Ret. From DD
Completion
Application
Dependent, but
Equal to below
Application
Dependent, but
Equal to above
300 Instructions 10,000 Instructions 3,000 Instructions
1,000 Instructions
1,000 Instructions
7.9µs 4.9µs
Total ~13µs for data prep
<400 Instructions <100 Instructions
0.3µs 0.06µs
Total 0.36µs
CAPI vs. I/O Device Driver: Data Prep
Time spent by a GPU-accelerated application
“ Figure 7 shows that the processing
phases ofthe FD-OCT algorithm (DC-
Removal, Resample,
DispersionCompensation, FFT and
Logarithmic Scaling) account for∼40% of
the total run-time time, while the memory
(data)transfers (host-to-device and device-
to-host) require approximately 60% of the
time.”
From: Scalable, High Performance Fourier Domain Optical Coherence Tomography: why FPGAs and not GPGPUs
Jian Li, Marinko V. Sarunic, and Lesley Shannon
School of Engineering Science
Simon Fraser University, Burnaby BC, Canada
strategy ( )
CAPI Attached Flash Optimization
• Attach IBM FlashSystem to POWER8 via CAPI coherent Attach
• Issues Read/Write Commands from applications to eliminate 97% of code pathlength
• Saves 20-30 cores per 1M IOPs
Pin buffers, Translate, Map DMA,
Start I/O
Application
Read/Write Syscall
Interrupt, unmap, unpin,Iodone
scheduling
20K instructions
reduced to <500
Disk and Adapter DD
strategy ( ) iodone ( )
FileSystem
Application
User Library
Posix Async
I/O Style API
Shared Memory Work
Queue
aio_read()
aio_write()
iodone ( )
LVM
Innovative “In-Memory” NoSQL/KVS
IBM Data Engine for NoSQL
24:1
Reduction in
infrastructure
2.4x
Price reduction
12x
Less Energy
6x
Less rack space
40TB of
extended
memory
4U
Demonstrating the Value of CAPI Attachment
IBM Data Engine for NoSQL
Comparison of FFT on three paradigms
Same FFT Algorithm as CAPI, but large latency impact
P8 Core working much harder to deliver data to FPGA as device driver code is
invoked
Small FFTs can be implemented in a fully streamed fashion on FPGA
Performance scales with host-AFU bandwidth
Use radix-2 pipeline for up to 4 GB/s
Radix-2 needs 2 complex samples per cycle → 16B / 250MHz cycle →
4 GB/s
Switch to (more efficient) radix-4 for up to 8 GB/s
Software FFT from the IBM Engineering and Scientific Subroutine Library (ESSL)
Poor performance for small FFTs using multi-threaded software
Little data reuse and strided access patterns lead to software inefficiencies
P8
Core
P8
Core
FPGA
P8
Core
FPGA
CAPI
FFT Results
CAPI projects – PDT planning
• FPGA-CAPI accelerator for PDT (Photodynamic Therapy)
• Monte-Carlo analysis of light scattering in tissue
• Used to plan for non-invasive cancer treatment
• Project led by Jeff Cassidy (University of Toronto) and
• Lothat Lilge (Princess Margaret Cancer Centre), Canada,
• with IBM Austin support.
CAPI projects – Text Analytics
Query A
Query B
Query C
0
200
400
600
800
Query A Query B Query C Query D Query E Query F
Throughput [MB/s]
SW 1 thread SW 64 threads FPGA 4 streams
Performance
Additional CAPI projects
• Algo acceleration for risk analysis for High Performance Trading
• Biometric (image and voice recognition) acceleration for secure
• identification for banking system
• Flash acceleration (NVMe)
• Video transcoding and compression for IPTV/VoD
• Video analytics
•
• .. And many more
•
•
•
• If you have an application, or know an application that can benefit from FPGA and CAPI...
Talk to
IBM!!!
References
First Annual OpenPOWER Summit – presentations and solutions on CAPI and GPU accelerators
Porting GPU-Accelerated Applications to POWER8 Systems
POWER8 Coherent Accelerator Processor Interface (CAPI)
IBM Data Engine for NoSQL – Power Systems Edition
Accelerating performance with the Generic Work Queue Engine (GenWQE)
© Copyright IBM Corporation
2015
Continue growing your IBM skills
ibm.com/training provides a
comprehensive portfolio of skills and
career
accelerators that are designed to
meet all
your training needs.
• Training in cities local to you - where and
when you need it, and in the format you want
– Use IBM Training Search to locate public training classes
near to you with our five Global Training Providers
– Private training is also available with our Global Training Providers
• Demanding a high standard of quality –
view the paths to success
– Browse Training Paths and Certifications to find the
course that is right for you
• If you can’t find the training that is right for you with
our Global Training Providers, we can help.
– Contact IBM Training at dpmc@us.ibm.com
44
Global Skills
Initiative

Más contenido relacionado

La actualidad más candente

Deploying Baremetal Instances with OpenStack
Deploying Baremetal Instances with OpenStackDeploying Baremetal Instances with OpenStack
Deploying Baremetal Instances with OpenStack
Etsuji Nakai
 
General Bare-metal Provisioning Framework.pdf
General Bare-metal Provisioning Framework.pdfGeneral Bare-metal Provisioning Framework.pdf
General Bare-metal Provisioning Framework.pdf
OpenStack Foundation
 
Isn't it ironic - managing a bare metal cloud (OSL TES 2015)
Isn't it ironic - managing a bare metal cloud (OSL TES 2015)Isn't it ironic - managing a bare metal cloud (OSL TES 2015)
Isn't it ironic - managing a bare metal cloud (OSL TES 2015)
Devananda Van Der Veen
 
Provisioning Bare Metal with OpenStack
Provisioning Bare Metal with OpenStackProvisioning Bare Metal with OpenStack
Provisioning Bare Metal with OpenStack
Devananda Van Der Veen
 

La actualidad más candente (20)

VIO LPAR Introduction | Basics | Demo
VIO LPAR Introduction | Basics | DemoVIO LPAR Introduction | Basics | Demo
VIO LPAR Introduction | Basics | Demo
 
Deploying Baremetal Instances with OpenStack
Deploying Baremetal Instances with OpenStackDeploying Baremetal Instances with OpenStack
Deploying Baremetal Instances with OpenStack
 
Power systems virtualization with power kvm
Power systems virtualization with power kvmPower systems virtualization with power kvm
Power systems virtualization with power kvm
 
Power8 sales exam prep
 Power8 sales exam prep Power8 sales exam prep
Power8 sales exam prep
 
Better Practices when Using Terraform to Manage Oracle Cloud Infrastructure
Better Practices when Using Terraform to Manage Oracle Cloud InfrastructureBetter Practices when Using Terraform to Manage Oracle Cloud Infrastructure
Better Practices when Using Terraform to Manage Oracle Cloud Infrastructure
 
Understanding software licensing with IBM Power Systems PowerVM virtualization
Understanding software licensing with IBM Power Systems PowerVM virtualizationUnderstanding software licensing with IBM Power Systems PowerVM virtualization
Understanding software licensing with IBM Power Systems PowerVM virtualization
 
High Availability Options for IBM i
High Availability Options for IBM iHigh Availability Options for IBM i
High Availability Options for IBM i
 
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...
 
Tuning DB2 in a Solaris Environment
Tuning DB2 in a Solaris EnvironmentTuning DB2 in a Solaris Environment
Tuning DB2 in a Solaris Environment
 
General Bare-metal Provisioning Framework.pdf
General Bare-metal Provisioning Framework.pdfGeneral Bare-metal Provisioning Framework.pdf
General Bare-metal Provisioning Framework.pdf
 
KVM Tuning @ eBay
KVM Tuning @ eBayKVM Tuning @ eBay
KVM Tuning @ eBay
 
Revisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS SchedulerRevisiting CephFS MDS and mClock QoS Scheduler
Revisiting CephFS MDS and mClock QoS Scheduler
 
Isn't it ironic - managing a bare metal cloud (OSL TES 2015)
Isn't it ironic - managing a bare metal cloud (OSL TES 2015)Isn't it ironic - managing a bare metal cloud (OSL TES 2015)
Isn't it ironic - managing a bare metal cloud (OSL TES 2015)
 
Red hat on_power-ibm _lop_day_2015
Red hat on_power-ibm _lop_day_2015Red hat on_power-ibm _lop_day_2015
Red hat on_power-ibm _lop_day_2015
 
VMware Advance Troubleshooting Workshop - Day 4
VMware Advance Troubleshooting Workshop - Day 4VMware Advance Troubleshooting Workshop - Day 4
VMware Advance Troubleshooting Workshop - Day 4
 
Cisco cloud computing deploying openstack
Cisco cloud computing deploying openstackCisco cloud computing deploying openstack
Cisco cloud computing deploying openstack
 
Impact of Intel Optane Technology on HPC
Impact of Intel Optane Technology on HPCImpact of Intel Optane Technology on HPC
Impact of Intel Optane Technology on HPC
 
Provisioning Bare Metal with OpenStack
Provisioning Bare Metal with OpenStackProvisioning Bare Metal with OpenStack
Provisioning Bare Metal with OpenStack
 
Exadata 12c New Features RMOUG
Exadata 12c New Features RMOUGExadata 12c New Features RMOUG
Exadata 12c New Features RMOUG
 
Linux container & docker
Linux container & dockerLinux container & docker
Linux container & docker
 

Similar a Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelerate your business

GTC15-Manoj-Roge-OpenPOWER
GTC15-Manoj-Roge-OpenPOWERGTC15-Manoj-Roge-OpenPOWER
GTC15-Manoj-Roge-OpenPOWER
Achronix
 

Similar a Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelerate your business (20)

OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
Implementation of Soft-core Processor on FPGA
Implementation of Soft-core Processor on FPGAImplementation of Soft-core Processor on FPGA
Implementation of Soft-core Processor on FPGA
 
OpenCAPI next generation accelerator
OpenCAPI next generation accelerator OpenCAPI next generation accelerator
OpenCAPI next generation accelerator
 
FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)FPGAs in the cloud? (October 2017)
FPGAs in the cloud? (October 2017)
 
0 foundation update__final - Mendy Furmanek
0 foundation update__final - Mendy Furmanek0 foundation update__final - Mendy Furmanek
0 foundation update__final - Mendy Furmanek
 
Cloud Networking Trends
Cloud Networking TrendsCloud Networking Trends
Cloud Networking Trends
 
GTC15-Manoj-Roge-OpenPOWER
GTC15-Manoj-Roge-OpenPOWERGTC15-Manoj-Roge-OpenPOWER
GTC15-Manoj-Roge-OpenPOWER
 
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 
Oracle virtual appliance
Oracle virtual applianceOracle virtual appliance
Oracle virtual appliance
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of Systems
 
Introduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AIIntroduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AI
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
 
OpenCAPI Technology Ecosystem
OpenCAPI Technology EcosystemOpenCAPI Technology Ecosystem
OpenCAPI Technology Ecosystem
 
Sparc t4 systems customer presentation
Sparc t4 systems customer presentationSparc t4 systems customer presentation
Sparc t4 systems customer presentation
 
Demystify OpenPOWER
Demystify OpenPOWERDemystify OpenPOWER
Demystify OpenPOWER
 
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
 
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
 
Using a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceUsing a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application Performance
 
6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final
 

Último

Jual Obat Aborsi Samarinda ( No.1 ) 088980685493 Obat Penggugur Kandungan Cy...
Jual Obat Aborsi Samarinda (  No.1 ) 088980685493 Obat Penggugur Kandungan Cy...Jual Obat Aborsi Samarinda (  No.1 ) 088980685493 Obat Penggugur Kandungan Cy...
Jual Obat Aborsi Samarinda ( No.1 ) 088980685493 Obat Penggugur Kandungan Cy...
Obat Aborsi 088980685493 Jual Obat Aborsi
 
CRISIS COMMUNICATION presentation=-Rishabh(11195)-group ppt (4).pptx
CRISIS COMMUNICATION presentation=-Rishabh(11195)-group ppt (4).pptxCRISIS COMMUNICATION presentation=-Rishabh(11195)-group ppt (4).pptx
CRISIS COMMUNICATION presentation=-Rishabh(11195)-group ppt (4).pptx
Rishabh332761
 
一比一定(购)国立南方理工学院毕业证(Southern毕业证)成绩单学位证
一比一定(购)国立南方理工学院毕业证(Southern毕业证)成绩单学位证一比一定(购)国立南方理工学院毕业证(Southern毕业证)成绩单学位证
一比一定(购)国立南方理工学院毕业证(Southern毕业证)成绩单学位证
wpkuukw
 
怎样办理斯威本科技大学毕业证(SUT毕业证书)成绩单留信认证
怎样办理斯威本科技大学毕业证(SUT毕业证书)成绩单留信认证怎样办理斯威本科技大学毕业证(SUT毕业证书)成绩单留信认证
怎样办理斯威本科技大学毕业证(SUT毕业证书)成绩单留信认证
tufbav
 
在线办理(scu毕业证)南十字星大学毕业证电子版学位证书注册证明信
在线办理(scu毕业证)南十字星大学毕业证电子版学位证书注册证明信在线办理(scu毕业证)南十字星大学毕业证电子版学位证书注册证明信
在线办理(scu毕业证)南十字星大学毕业证电子版学位证书注册证明信
oopacde
 
一比一原版(CSUEB毕业证书)东湾分校毕业证原件一模一样
一比一原版(CSUEB毕业证书)东湾分校毕业证原件一模一样一比一原版(CSUEB毕业证书)东湾分校毕业证原件一模一样
一比一原版(CSUEB毕业证书)东湾分校毕业证原件一模一样
ayoqf
 
Top profile Call Girls In Udgir [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Udgir [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Udgir [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Udgir [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Buy Abortion pills in Riyadh |+966572737505 | Get Cytotec
Buy Abortion pills in Riyadh |+966572737505 | Get CytotecBuy Abortion pills in Riyadh |+966572737505 | Get Cytotec
Buy Abortion pills in Riyadh |+966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Abortion pills in Dammam +966572737505 Buy Cytotec
Abortion pills in Dammam +966572737505 Buy CytotecAbortion pills in Dammam +966572737505 Buy Cytotec
Abortion pills in Dammam +966572737505 Buy Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
一比一维多利亚大学毕业证(victoria毕业证)成绩单学位证如何办理
一比一维多利亚大学毕业证(victoria毕业证)成绩单学位证如何办理一比一维多利亚大学毕业证(victoria毕业证)成绩单学位证如何办理
一比一维多利亚大学毕业证(victoria毕业证)成绩单学位证如何办理
uodye
 
怎样办理阿德莱德大学毕业证(Adelaide毕业证书)成绩单留信认证
怎样办理阿德莱德大学毕业证(Adelaide毕业证书)成绩单留信认证怎样办理阿德莱德大学毕业证(Adelaide毕业证书)成绩单留信认证
怎样办理阿德莱德大学毕业证(Adelaide毕业证书)成绩单留信认证
ehyxf
 
一比一定(购)UNITEC理工学院毕业证(UNITEC毕业证)成绩单学位证
一比一定(购)UNITEC理工学院毕业证(UNITEC毕业证)成绩单学位证一比一定(购)UNITEC理工学院毕业证(UNITEC毕业证)成绩单学位证
一比一定(购)UNITEC理工学院毕业证(UNITEC毕业证)成绩单学位证
wpkuukw
 
一比一定(购)坎特伯雷大学毕业证(UC毕业证)成绩单学位证
一比一定(购)坎特伯雷大学毕业证(UC毕业证)成绩单学位证一比一定(购)坎特伯雷大学毕业证(UC毕业证)成绩单学位证
一比一定(购)坎特伯雷大学毕业证(UC毕业证)成绩单学位证
wpkuukw
 
怎样办理圣芭芭拉分校毕业证(UCSB毕业证书)成绩单留信认证
怎样办理圣芭芭拉分校毕业证(UCSB毕业证书)成绩单留信认证怎样办理圣芭芭拉分校毕业证(UCSB毕业证书)成绩单留信认证
怎样办理圣芭芭拉分校毕业证(UCSB毕业证书)成绩单留信认证
ehyxf
 
怎样办理昆士兰大学毕业证(UQ毕业证书)成绩单留信认证
怎样办理昆士兰大学毕业证(UQ毕业证书)成绩单留信认证怎样办理昆士兰大学毕业证(UQ毕业证书)成绩单留信认证
怎样办理昆士兰大学毕业证(UQ毕业证书)成绩单留信认证
ehyxf
 
Top profile Call Girls In Ratlam [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Ratlam [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Ratlam [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Ratlam [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 

Último (20)

Jual Obat Aborsi Samarinda ( No.1 ) 088980685493 Obat Penggugur Kandungan Cy...
Jual Obat Aborsi Samarinda (  No.1 ) 088980685493 Obat Penggugur Kandungan Cy...Jual Obat Aborsi Samarinda (  No.1 ) 088980685493 Obat Penggugur Kandungan Cy...
Jual Obat Aborsi Samarinda ( No.1 ) 088980685493 Obat Penggugur Kandungan Cy...
 
CRISIS COMMUNICATION presentation=-Rishabh(11195)-group ppt (4).pptx
CRISIS COMMUNICATION presentation=-Rishabh(11195)-group ppt (4).pptxCRISIS COMMUNICATION presentation=-Rishabh(11195)-group ppt (4).pptx
CRISIS COMMUNICATION presentation=-Rishabh(11195)-group ppt (4).pptx
 
一比一定(购)国立南方理工学院毕业证(Southern毕业证)成绩单学位证
一比一定(购)国立南方理工学院毕业证(Southern毕业证)成绩单学位证一比一定(购)国立南方理工学院毕业证(Southern毕业证)成绩单学位证
一比一定(购)国立南方理工学院毕业证(Southern毕业证)成绩单学位证
 
怎样办理斯威本科技大学毕业证(SUT毕业证书)成绩单留信认证
怎样办理斯威本科技大学毕业证(SUT毕业证书)成绩单留信认证怎样办理斯威本科技大学毕业证(SUT毕业证书)成绩单留信认证
怎样办理斯威本科技大学毕业证(SUT毕业证书)成绩单留信认证
 
在线办理(scu毕业证)南十字星大学毕业证电子版学位证书注册证明信
在线办理(scu毕业证)南十字星大学毕业证电子版学位证书注册证明信在线办理(scu毕业证)南十字星大学毕业证电子版学位证书注册证明信
在线办理(scu毕业证)南十字星大学毕业证电子版学位证书注册证明信
 
一比一原版(CSUEB毕业证书)东湾分校毕业证原件一模一样
一比一原版(CSUEB毕业证书)东湾分校毕业证原件一模一样一比一原版(CSUEB毕业证书)东湾分校毕业证原件一模一样
一比一原版(CSUEB毕业证书)东湾分校毕业证原件一模一样
 
Top profile Call Girls In Udgir [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Udgir [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Udgir [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Udgir [ 7014168258 ] Call Me For Genuine Models We ...
 
Buy Abortion pills in Riyadh |+966572737505 | Get Cytotec
Buy Abortion pills in Riyadh |+966572737505 | Get CytotecBuy Abortion pills in Riyadh |+966572737505 | Get Cytotec
Buy Abortion pills in Riyadh |+966572737505 | Get Cytotec
 
Abortion pills in Dammam +966572737505 Buy Cytotec
Abortion pills in Dammam +966572737505 Buy CytotecAbortion pills in Dammam +966572737505 Buy Cytotec
Abortion pills in Dammam +966572737505 Buy Cytotec
 
一比一维多利亚大学毕业证(victoria毕业证)成绩单学位证如何办理
一比一维多利亚大学毕业证(victoria毕业证)成绩单学位证如何办理一比一维多利亚大学毕业证(victoria毕业证)成绩单学位证如何办理
一比一维多利亚大学毕业证(victoria毕业证)成绩单学位证如何办理
 
Abortion pills in Jeddah +966572737505 <> buy cytotec <> unwanted kit Saudi A...
Abortion pills in Jeddah +966572737505 <> buy cytotec <> unwanted kit Saudi A...Abortion pills in Jeddah +966572737505 <> buy cytotec <> unwanted kit Saudi A...
Abortion pills in Jeddah +966572737505 <> buy cytotec <> unwanted kit Saudi A...
 
LANDSLIDE MONITORING AND ALERT SYSTEM FINAL YEAR PROJECT BROCHURE
LANDSLIDE MONITORING AND ALERT SYSTEM FINAL YEAR PROJECT BROCHURELANDSLIDE MONITORING AND ALERT SYSTEM FINAL YEAR PROJECT BROCHURE
LANDSLIDE MONITORING AND ALERT SYSTEM FINAL YEAR PROJECT BROCHURE
 
怎样办理阿德莱德大学毕业证(Adelaide毕业证书)成绩单留信认证
怎样办理阿德莱德大学毕业证(Adelaide毕业证书)成绩单留信认证怎样办理阿德莱德大学毕业证(Adelaide毕业证书)成绩单留信认证
怎样办理阿德莱德大学毕业证(Adelaide毕业证书)成绩单留信认证
 
一比一定(购)UNITEC理工学院毕业证(UNITEC毕业证)成绩单学位证
一比一定(购)UNITEC理工学院毕业证(UNITEC毕业证)成绩单学位证一比一定(购)UNITEC理工学院毕业证(UNITEC毕业证)成绩单学位证
一比一定(购)UNITEC理工学院毕业证(UNITEC毕业证)成绩单学位证
 
一比一定(购)坎特伯雷大学毕业证(UC毕业证)成绩单学位证
一比一定(购)坎特伯雷大学毕业证(UC毕业证)成绩单学位证一比一定(购)坎特伯雷大学毕业证(UC毕业证)成绩单学位证
一比一定(购)坎特伯雷大学毕业证(UC毕业证)成绩单学位证
 
怎样办理圣芭芭拉分校毕业证(UCSB毕业证书)成绩单留信认证
怎样办理圣芭芭拉分校毕业证(UCSB毕业证书)成绩单留信认证怎样办理圣芭芭拉分校毕业证(UCSB毕业证书)成绩单留信认证
怎样办理圣芭芭拉分校毕业证(UCSB毕业证书)成绩单留信认证
 
Point of Care Testing in clinical laboratory
Point of Care Testing in clinical laboratoryPoint of Care Testing in clinical laboratory
Point of Care Testing in clinical laboratory
 
怎样办理昆士兰大学毕业证(UQ毕业证书)成绩单留信认证
怎样办理昆士兰大学毕业证(UQ毕业证书)成绩单留信认证怎样办理昆士兰大学毕业证(UQ毕业证书)成绩单留信认证
怎样办理昆士兰大学毕业证(UQ毕业证书)成绩单留信认证
 
Vashi Affordable Call Girls ,07506202331,Vasai Virar Charming Call Girl
Vashi Affordable Call Girls ,07506202331,Vasai Virar Charming Call GirlVashi Affordable Call Girls ,07506202331,Vasai Virar Charming Call Girl
Vashi Affordable Call Girls ,07506202331,Vasai Virar Charming Call Girl
 
Top profile Call Girls In Ratlam [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Ratlam [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Ratlam [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Ratlam [ 7014168258 ] Call Me For Genuine Models We...
 

Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelerate your business

  • 1. Technical University/Symposia materials may not be reproduced in whole or in part without the prior written permission of IBM. 9.0 Heterogeneous computing on POWER IBM and OpenPOWER technologies to accelerate your business César Diniz Maciel Executive IT Specialist IBM Corporate Strategy
  • 2. Session objectives In this session we present how accelerators provide significant application performance improvements and how they can be deployed in a solution. We evaluate different types of accelerators, and focus on the latest accelerator technologies available for IBM Power Systems and OpenPOWER solutions.
  • 3. Acknowledgements I would like to thank the following people for providing invaluable information on POWER8, CAPI and FPGAs • Jeff Stuecheli, POWER Hardware Architect • Bill Starke, DE, POWER Server Nest Architect • Bruce Wile, STG Hardware Design and Verification • Jonathan Dement, Program Director, • Power Systems and OpenPower Innovation • IBM US
  • 4. What is heterogeneous computing? From Wikipedia “Heterogeneous computing refers to systems that use more than one kind of processor. These are systems that gain performance not just by adding the same type of processors, but by adding dissimilar processors, usually incorporating specialized processing capabilities to handle particular tasks.“
  • 5. What is heterogeneous computing? Not a new concept, widely used in the industry  On-chip accelerators:  Cryptography and compression accelerators inside the POWER8 processor      PCIe based accelerators  GPGPUs, such as Nvidia Tesla  PCIe adapters for SSL acceleration, cryptography and compression  CAPI adapters for POWER8 systems         Appliance-based accelerators  Netezza appliances to accelerate queries on the IBM DB2 Analytics Accelerator
  • 6. Why heterogeneous computing? Applications are becoming more complex and demanding more computing resources  Application speedup limited to performance of the slowest algorithm  If algorithm execution can be accelerated, application runs faster – code optimization  If algorithm can be built on silicon, execution speeds up – ASICs, FPGA  If algorithm can be broken in multiple pieces, and these can run simultaneously, application runs faster – parallelization “Heterogeneous (or asymmetric) chip multiprocessors present unique opportunities for improving system throughput, reducing processor power, and mitigating Amdahl’s law. On-chip heterogeneity allows the processor to better match execution resources to each application’s needs and to address a much wider spectrum of system loads—from low to high thread parallelism—with high efficiency.” IEEE Computer, November 2005,
  • 7. Why heterogeneous computing? Scaling & power Wall Issues, Chip Design and Fabrication Economics, and Time To Market Demands are all intersecting enabling/requiring change to Business As Usual
  • 8. On-chip accelerators • Excellent performance and integration • Accelerators are part of the microprocessor design and share the same silicon die. • Algorithms are implemented in silicon • Fastest performance • Application transparency – Hypervisor/OS abstract the accelerator so that applications do not need to be modified • Close integration with processor design and capabilities • However, on-chip accelerators use same silicon space that could be used for caches, processor features, etc • Tradeoff between accelerators and core features • Once built, they cannot be changed/updated
  • 9. Examples of performance benefits of accelerators POWER7+ and POWER8 processors include memory compression accelerators, and cryptographic accelerators Memory Compression Asymmetric Mathematical Functions Cryptographic Engines The on-chip loosely coupled accelerators are part of the processor chip and managed by the Power Hypervisor. Multiple partitions can use the accelerators and the Power Hypervisor manages the QoS and address mapping.
  • 10. Examples of performance benefits of accelerators Performance comparison of Active Memory Expansion (AME) on an SAP workload (SD 2-Tier benchmark) on POWER7 and POWER7+ (output of amepat command) Significant reduction in core consumption
  • 11. PCIe-based accelerators • Provide great flexibility for adding capability to existing systems • Allow many more options in terms of types of accelerators, algorithms, performance characteristics and implementation devices • GPUs • FPGAs • ASICs • Easy to replace with newer/faster accelerators • However, PCIe-based accelerators are seen as an I/O device, and therefore communicate with the processor and main memory through an I/O subsystem • Programming model needs to incorporate the I/O device • Need of OS device drivers for the accelerator • Data needs to be copied from/to main memory to accelerator memory • Data transfer performance limited by the I/O subsystem
  • 12. Java gzip PCIe-based accelerator • Feature code #EJ12/ #EJ13 - PCIe3 FPGA Accelerator Adapter • Also known as Generic Work Queue Engine (GenWQE) accelerator • Hardware compression accelerator for AIX and Linux • FPGA-based PCIe adapter, provides high performance, low latency compression without significant CPU overhead • • Based on the standard compression library – zlib • Widely used open source C library that provides compression and decompression • zlib supports RFC1950, RFC1951, and RFC1952 • • Enabled transparently in IBM Java 7.1.1 release and later on Linux and AIX on POWER8 processor-based systems • HW offload enabled by setting • env variables: • ZLIB_INFLATE_IMPL = 1 • ZLIB_DEFLATE_IMPL = 1
  • 13. Java gzip PCIe-based accelerator - Performance IBM internal testing
  • 14. GPUs From NVIDIA: • • “GPU-accelerated computing is the use of a graphics processing unit (GPU) together with a CPU to accelerate scientific, analytics, engineering, consumer, and enterprise applications... • …GPU-accelerated computing offers unprecedented application performance by offloading compute-intensive portions of the application to the GPU, while the remainder of the code still runs on the CPU. From a user's perspective, applications simply run significantly faster.”
  • 15. General Purpose GPUs GPUs are well suited for parallel processing tasks. They have thousands of cores than work in parallel Coupled with a high performance processor and a GPU programming model, significant application acceleration can be achieved
  • 16. 16 NVIDIA K40 GPU • Systems • Up to 2 K40 GPU in S824L • GPU Spec • Kepler-2 architecture GPU • ASIC: GK110B • PCIe interface • PCIe Gen3 x16 • Full length / double wide PCIe form factor • Plugs in using existing double wide cassette • Power • 235W Max power draw :75W via PCIe slot plus 160W via 8-pin Aux. cable. • OS support on POWER • Ubuntu 14.10 or later http://www.nvidia.com/object/tesla-servers.html 16
  • 17.
  • 18. NVIDIA and POWER8 From http://www.ecmwf.int/sites/default/files/HPC-WS-Appleyard.pdf
  • 19. NVIDIA on Power “ The combination of POWER8 CPUs & NVIDIA Tesla accelerators is amazing. It is the highest performance we have ever seen in individual cores, and the close integration with accelerators is outstanding for heterogeneous parallelization. Thanks to the little endian chip and standard CUDA environment it took us less than 24 hours to port and accelerate GROMACS.” Erik Lindahl, Professor of Biophysics at the Science for Life Laboratory, Stockholm University & KTH. Developer of GROMACS
  • 20. GPUs are not only for technical computing “ Concurrent execution of an analytical workload on a POWER8 server with K40 GPUs - A Technology Demonstration” http://on-demand.gputechconf.com/gtc/2015/presentation/S5835-Sina- Meraji.pdf
  • 21. Java acceleration with GPUs “IBM’s POWER group has partnered with NVIDIA to make GPUs available on a high- performance server platform, promising the next generation of parallel performance for Java applications.” The Next Wave of Enterprise Performance with Java, POWER Systems and NVIDIA GPUs Performance of java.util.Arrays#sort(int[]) on NVIDIA Tesla K40 GPU (ECC enabled) and IBM Power8 CPU.
  • 22. Kepler CUDA 5.5 – 7.0 Unified Memory Buffered Memory POWER8 PCIe Current Pascal CUDA 8 Full GPU Paging Pascal POWER8+ Near Future NVLink SXM2 Volta CUDA 9 Cache Coherent POWER9 More in the future NVLink 2.0 SXM2 Volta Kepler NVIDIA Roadmap on POWER
  • 23. FPGA as an Accelerator • FPGA: Field Programmable Gate Array • From Wikipedia: “A field-programmable gate array (FPGA) is an integrated circuit designed to be configured by a customer or a designer after manufacturing – hence "field-programmable". The FPGA configuration is generally specified using a hardware description language (HDL), similar to that used for an application-specific integrated circuit (ASIC). “ • It implements the algorithm as a hardware logic circuit • Not as optimized as an ASIC, but faster than running a software-based algorithm • It can run fast (cycle times of 250 – 500 MHz or more) • It has industry standard interfaces like PCIe Gen3 • The major FPGA Suppliers, Altera and Xilinx, are OpenPOWER Foundation members gzip Encrypt Monte Carlo FPGA Library Source code for FPGAs has traditionally been written in RTL* (VHDL** or Verilog). Now, we also have OpenCL, a more programmer friendly language. *RTL = Register Transfer Level **VHDL = VHSIC*** Hardware Description Language ***VHSIC = Very High Speed Integrated Circuit
  • 24. Why FPGAs • Transistor Efficiency & Extreme Parallelism • Bit-level operations • Variable-precision floating point • Compare divides on GPU vs. FPGA • Power-Performance Advantage • >2x compared to a general multicore processor or GPGPU • Unused lookup tables (LUTs) are powered off • Technology Scaling better than CPU/GPU • FPGAs are not frequency or power limited yet • 3D has great potential • Dynamic reconfiguration • Flexibility for application tuning at run-time vs. compile-time • Additional advantages when FPGAs are network connected • allows network as well as compute specialization
  • 25. Several IT companies looking at FPGA Baidu FPGA accelerator Microsoft Bing accelerator
  • 26. Combining the best of both architectures
  • 27. Coherent Accelerator base principles Function Based Acceleration • Main Application executed on Host Processor Computational heavy functions on Accelerator • Single binary image encapsulates both HW and SW version of accelerated functions • Application calls Accelerator for common or custom libraries • Enable an application to work with or without available Accelerator • Accelerator call faults when function not available • Host processor executes function when Accelerator not available • No special requirements on data structures • System software virtualization of accelerator’s function(s) Full Peer to Processor • IBM-designed processor interface • Maintains a trusted coherent interface to system • Direct communications with application • Accelerator function(s) use an unmodified EA • Full access to real address space • Utilize processor’s page tables directly • Page faults handled by system software • Multiple functions can exist in a single accelerator
  • 28. Customizable Hardware Application Accelerator • Specific system SW, middleware, or user application • Written to durable interface provided by PSL Virtual Addressing • Accelerator can work with same memory addresses that the processors use • Pointers de-referenced same as the host application • Removes OS & device driver overhead Hardware Managed Cache Coherence • Enables the accelerator to participate in “Locks” as a normal thread Lowers Latency over IO communication model POWER8 CAPI (Coherent Accelerator Processor Interface) Ecosystem CAPP PCIe Power Processor FPGA I/O Ethernet, DASD, etc… Func tion n Func tion 0 Func tion 1 Func tion 2 CAPI IBM Supplied POWER Service Layer
  • 29. Coherent Accelerator Processor Interface (CAPI) overview CAPP PCIe POWER8 Processor FPGA Accelerator Function Unit (AFU) CAPI IBM-Supplied POWER Service Layer Typical I/O model flow Flow with a coherent model Shared Memory Notify Accelerator Acceleration Shared Memory Completion DD Call Copy or Pin Source Data MMIO Notify Accelerator Acceleration Poll / Interrupt Completion Copy or Unpin Result Data Return From DD Completion Advantages of coherent attachment over I/O attachment  Virtual addressing and data caching – Shared memory – Lower latency for highly referenced data  Easier, more natural programming model – Traditional thread-level programming – Long latency of I/O typically requires restructuring of application  Enables applications not possible on I/O – Pointer chasing, and so on
  • 30. Typical I/O Model Flow: Flow with a Coherent Model: Shared Mem. Notify Accelerator Acceleration Shared Memory Completion DD Call Copy or Pin Source Data MMIO Notify Accelerator Acceleration Poll / Interrupt Completion Copy or Unpin Result Data Ret. From DD Completion Application Dependent, but Equal to below Application Dependent, but Equal to above 300 Instructions 10,000 Instructions 3,000 Instructions 1,000 Instructions 1,000 Instructions 7.9µs 4.9µs Total ~13µs for data prep <400 Instructions <100 Instructions 0.3µs 0.06µs Total 0.36µs CAPI vs. I/O Device Driver: Data Prep
  • 31. Time spent by a GPU-accelerated application “ Figure 7 shows that the processing phases ofthe FD-OCT algorithm (DC- Removal, Resample, DispersionCompensation, FFT and Logarithmic Scaling) account for∼40% of the total run-time time, while the memory (data)transfers (host-to-device and device- to-host) require approximately 60% of the time.” From: Scalable, High Performance Fourier Domain Optical Coherence Tomography: why FPGAs and not GPGPUs Jian Li, Marinko V. Sarunic, and Lesley Shannon School of Engineering Science Simon Fraser University, Burnaby BC, Canada
  • 32. strategy ( ) CAPI Attached Flash Optimization • Attach IBM FlashSystem to POWER8 via CAPI coherent Attach • Issues Read/Write Commands from applications to eliminate 97% of code pathlength • Saves 20-30 cores per 1M IOPs Pin buffers, Translate, Map DMA, Start I/O Application Read/Write Syscall Interrupt, unmap, unpin,Iodone scheduling 20K instructions reduced to <500 Disk and Adapter DD strategy ( ) iodone ( ) FileSystem Application User Library Posix Async I/O Style API Shared Memory Work Queue aio_read() aio_write() iodone ( ) LVM
  • 33. Innovative “In-Memory” NoSQL/KVS IBM Data Engine for NoSQL 24:1 Reduction in infrastructure 2.4x Price reduction 12x Less Energy 6x Less rack space 40TB of extended memory 4U
  • 34. Demonstrating the Value of CAPI Attachment IBM Data Engine for NoSQL
  • 35. Comparison of FFT on three paradigms Same FFT Algorithm as CAPI, but large latency impact P8 Core working much harder to deliver data to FPGA as device driver code is invoked Small FFTs can be implemented in a fully streamed fashion on FPGA Performance scales with host-AFU bandwidth Use radix-2 pipeline for up to 4 GB/s Radix-2 needs 2 complex samples per cycle → 16B / 250MHz cycle → 4 GB/s Switch to (more efficient) radix-4 for up to 8 GB/s Software FFT from the IBM Engineering and Scientific Subroutine Library (ESSL) Poor performance for small FFTs using multi-threaded software Little data reuse and strided access patterns lead to software inefficiencies P8 Core P8 Core FPGA P8 Core FPGA CAPI
  • 37. CAPI projects – PDT planning • FPGA-CAPI accelerator for PDT (Photodynamic Therapy) • Monte-Carlo analysis of light scattering in tissue • Used to plan for non-invasive cancer treatment • Project led by Jeff Cassidy (University of Toronto) and • Lothat Lilge (Princess Margaret Cancer Centre), Canada, • with IBM Austin support.
  • 38. CAPI projects – Text Analytics
  • 39.
  • 40.
  • 41. Query A Query B Query C 0 200 400 600 800 Query A Query B Query C Query D Query E Query F Throughput [MB/s] SW 1 thread SW 64 threads FPGA 4 streams Performance
  • 42. Additional CAPI projects • Algo acceleration for risk analysis for High Performance Trading • Biometric (image and voice recognition) acceleration for secure • identification for banking system • Flash acceleration (NVMe) • Video transcoding and compression for IPTV/VoD • Video analytics • • .. And many more • • • • If you have an application, or know an application that can benefit from FPGA and CAPI... Talk to IBM!!!
  • 43. References First Annual OpenPOWER Summit – presentations and solutions on CAPI and GPU accelerators Porting GPU-Accelerated Applications to POWER8 Systems POWER8 Coherent Accelerator Processor Interface (CAPI) IBM Data Engine for NoSQL – Power Systems Edition Accelerating performance with the Generic Work Queue Engine (GenWQE)
  • 44. © Copyright IBM Corporation 2015 Continue growing your IBM skills ibm.com/training provides a comprehensive portfolio of skills and career accelerators that are designed to meet all your training needs. • Training in cities local to you - where and when you need it, and in the format you want – Use IBM Training Search to locate public training classes near to you with our five Global Training Providers – Private training is also available with our Global Training Providers • Demanding a high standard of quality – view the paths to success – Browse Training Paths and Certifications to find the course that is right for you • If you can’t find the training that is right for you with our Global Training Providers, we can help. – Contact IBM Training at dpmc@us.ibm.com 44 Global Skills Initiative