2. OpenCAPI Topics
ØIndustry Background
ØWhere/How OpenCAPI Technology is used
ØTechnology Overview and Advantages
ØHeterogeneous Computing
ØSNAP Framework for CAPI/OpenCAPI
Computation DataAccess
3. Industry Background that Defined OpenCAPI
§Growing computational demand due to emerging workloads (e.g., AI, cognitive, etc.)
§Moore’s Law not being supported by traditional silicon scaling
§Driving increased dependence on Hardware Acceleration for performance
• Hyperscale Datacenters and HPC need much higher network
bandwidth
• 100 Gb/s -> 200 Gb/s -> 400Gb/s are emerging
• Deep learning and HPC require more bandwidth between accelerators and
memory
• Emerging memory/storage technologies are driving need for bandwidth with low
latency
§ Hardware accelerators are defining the attributes of a high performance bus
• Growing demand for network performance and network offload
• Introduction of device coherency requirements (IBM’s introduction in 2013)
• Emergence of complex storage and memory solutions
• Various form factors with no one able to address everything (e.g., GPUs, FPGAs,
ASICs, etc.)
Computation DataAccess
…all Relevant to Modern Data Centers
4. Use Cases - A True Heterogeneous Architecture Built Upon OpenCAPI
OpenCAPI3.0
OpenCAPI3.1
5. 8 and 16Gbps PHY
Protocols Supported
• PCIe Gen3 x16 and PCIe Gen4 x8
• CAPI 2.0 on PCIe Gen4
PCIeGen4
P9
25Gbs
25Gbps PHY
Protocols Supported
• OpenCAPI 3.0
• NVLink 2.0
Silicon Die
Various packages
(scale-out, scale-up)
POWER9 IO Leading the Industry
• PCIe Gen4
• CAPI 2.0
• NVLink 2.0
• OpenCAPI 3.0
POWER9
6. Acceleration Paradigms with Great Performance
Examples: Encryption, Compression, Erasure prior to
delivering data to the network or storage
ProcessorChip
Egress Transform
Acc
DataDLx/TLx
ProcessorChip
Acc
Data
Bi-Directional Transform
Acc
DLx/TLx
Examples: NoSQL such as Neo4J with Graph Node Traversals, etc
Examples: Machine or Deep Learning such as Natural Language processing,
sentiment analysis or other Actionable Intelligence using OpenCAPI attached memory
ProcessorChip
Acc
DataDLx/TLx
Memory Transform Example: Basic work offload
Ingress Transform
ProcessorChip
Acc
DataDLx/TLx
Examples: Video Analytics, Network Security,
Deep Packet Inspection, Data Plane Accelerator,
Video Encoding (H.265), High Frequency Trading etc
Needle-in-a-haystack Engine
ProcessorChip
Acc
Examples: Database searches, joins, intersections, merges
Only the Needles are sent to the processor
Needle-In-A-Haystack Engine
DLx/TLx Needles
Large
Haystack
Of Data
OpenCAPI is ideal for acceleration due
to Bandwidth to/from accelerators, best
of breed latency, and flexibility of an
Open architecture
7. Comparison of Memory Paradigms
Emerging Storage Class Memory
ProcessorChip DLx/TLx
SCM
Data
Storage Class Memories have the potential to be
the next disruptive technology…..
Examples include ReRAM, MRAM, Z-NAND……
All are racing to become the defacto
OpenCAPI 3.1 Architecture
Ultra Low Latency ASIC buffer chip adding +5ns
on top of native DDR direct connect!!
Main Memory
ProcessorChip DLx/TLx
DDR4/5
Example: Basic DDR attach
Data
Storage Class Memory tiered with traditional DDR
Memory all built upon OpenCAPI 3.1 & 3.0
architecture.
Still have the ability to use Load/Store Semantics
Tiered Memory
ProcessorChip DLx/TLx
DDR4/5
DLx/TLx
SCM
Data Data
§Common physical interface between non-memory and memory devices
§OpenCAPI protocol was architected to minimize latency; excellent for classic DRAM memory
§Extreme bandwidth beyond classical DDR memory interface
§Agnostic interface will handle evolving memory technologies in the future (e.g., compute-in-mem)
§Ability to handle a memory buffer to decouple raw memory and host interface to optimize power, cost, perf
9. Latency Ping-Pong Test
§ Simple workload created to
simulate communication
between system and
attached FPGA
§ Bus traffic recorded with
protocol analyzer and
PowerBus traces
§ Response times and
statistics calculated
TL, DL, PHY
1.
2.
3.
4.
Host Code
Copy 512B from cache to FPGA
Poll on incoming 128B cache injection
Reset poll location
Repeat
TLx, DLx, PHYx
1.
2.
3.
4.
FPGA Code
Poll on 512B received from host
Reset poll location
DMA write 128B for cache injection
Repeat
OpenCAPI Link
PCIe Stack
1.
2.
3.
4.
Host Code
Copy 512B from cache to FPGA
Poll on incoming 128B cache injection
Reset poll location
Repeat
FPGA Code
1. Poll on 512B received from host
2. Reset poll location
3. DMA write 128B for cache injection
4. Repeat
* HIPrefers to hardened IP
PCIeLink
Altera PCIe HIP*
11. OpenCAPI Enabled FPGA Cards
Mellanox Innova2AcceleratorCard Alpha Data 9v3AcceleratorCard
Typicaleyediagram at 25Gb/s usingthese
cards
12. OpenCAPI Topics
ØIndustry Background
ØWhere/How OpenCAPI Technology is used
ØTechnology Overview and Advantages
ØHeterogeneous Computing
ØSNAP Framework for CAPI/OpenCAPI
Computation DataAccess
13. OpenCAPI: Heterogeneous Computing
ØWhy OpenCAPI could let the specific workloads run faster?
à FPGA (Various High Bandwidth I/O, great at deep Parallel & Pipeline designs)
ØHow OpenCAPI could let the software/application run faster?
à Shared Coherent Memory (with Virtual Address, Low Latency & Low Overhead Design)
Computation DataAccess
Single Processor
CPU
Distributed computing
CPU
CPU CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
Heterogeneous Computing
CPU
GPU
ASIC
FPGA
14. FPGAs: What they are good at
l FPGA: Field Programmable Gate Array
l Configurable I/O and High-Speed Serial links
l Integrated Hard IP (Multiply/Add, SRAM, PLL, PCIe, Ethernet, DRAM
Controller,...etc. )
l Custom logic, Complex Special Instructions
l Bit/Matrix Manipulation, Image Processing, graphs, Neural Networks…etc.
l Great at deep Parallel & Pipeline Designs (for workloads)
Processing Engine
Processing Engine
Processing Engine
Processing Engine
Processing Engine
Processing Engine
Processing Engine
parallelism
pipeline
&
Hash
+
+
RAM
Instruction complexity
=
15. FPGAs: Different type of High Bandwidth I/O Cards
Alpha Data 9V3 (Networking) Mellanox Innova 2 (Networking [CX-5])
Nallatech 250S+ (Storage / NVMe SSDs)
16. Accelerated
OpenCAPI Device
OpenCAPI: Key Attributes for Acceleration
16
TL/DL 25Gb I/O
Any OpenCAPI Enabled Processor
Accelerated
Function
TLx/DLx
1. Architecture agnostic bus – Applicable with any system/microprocessor architecture
2. Optimized for High Bandwidth and Low Latency
- 25Gb/s Links (with SlimSAS connector)
- Removes PCIe layering and use brand new TL/DL--TLx/DLx thinner protocol instead (latency optimized)
- High performance industry standard interface design with zero ‘overhead’
3. Coherency & Virtual addressing
- Attached devices operate natively within application’s user space and coherently with host microprocessor
- Enables low overhead with no Kernel, hypervisor or firmware involvement
- Shared coherent data structures and pointers (put the data/mem closer to the Proc/FPGA)
- It’s all traditional thread level programming with CPU coherent device memory
4. Supports a wide range of use cases
- Architected for both Classic Memory and emerging Storage Class Memory
Caches
Application
§ Storage/Compute/Network etc
§ ASIC/FPGA/FFSA
§ FPGA, SOC, GPU Accelerator
§ Load/Store or Block Access
Standard System Memory
Device Memory
Advanced SCM
Solutions
BufferedSystemMemory
OpenCAPIMemoryBuffers
17. OpenCAPI: Virtual Addressing and Benefits
§An OpenCAPI device operates in the Virtual Address spaces of the applications that
it supports
•Eliminates kernel and device driver software overhead
•Allows device to operate on application memory without kernel-level data copies/pinned pages
•Simplifies programming effort to integrate accelerators into applications (SNAP)
•Improves accelerator performance
§The Virtual-to-Physical Address Translation occurs in the host CPU (no PSL logic needed)
•Reduces design complexity of OpenCAPI-attached devices
•Makes it easier to ensure interoperability between OpenCAPI devices and different CPU architectures
•Security - Since the OpenCAPI device never has access to a physical address, this eliminates the
possibility of a defective or malicious device accessing memory locations belonging to the kernel or
other applications that it is not authorized to access
18. OpenCAPI: Shared coherent data structures and pointers
No Kernel/Device Driver involved
(-- put the data/mem closer to the Proc/FPGA)
AFU: Attached Functional Unit
Typical I/O Model with Device Driver
OpenCAPI: Virtual Addressing and Benefits (deeper)
19. Ø The OpenCAPI Transaction Layer specifies the control and response packets
between a host and an endpoint OpenCAPI device: TL and TLX
Ø On the Host side the Transaction Layer converts:
• Host specific protocol requests into transaction layer defined commands
• TLx commands into host specific protocol requests.
• Responses
Ø On the Endpoint OpenCAPI device side, the Transaction Layer converts:
• AFU protocol requests into transaction layer commands
• TL commands into AFU protocol requests.
• Responses
Ø The OpenCAPI Data Link Layer supports a 25Gbps serial data rate per lane
connecting a processor to an FPGA or an ASIC that contains an endpoint
accelerator or device: DL and DLX
• The basic configuration supports 8 lanes running at 25.78125 GHz for a 25 GB/s data
rate.
Noted that: TL/DL/PHY (Host Side) è IBM P9 HW & FW both Ready now;
TLx/DLx/PHYx (Device) è I/O Vendors also have the reference design Ready
Host bus protocol
layer
TL
TL Frame/Parser
DL
PHY
PHYX
DLX
TLX Frame/Parser
TLX
AFU protocol layer
AFU
HostProcessorOpenCAPIDevice
OpenCAPI: Protocol Stack (much thinner than traditional PCIe)
Host bus interface
OpenCAPI packets
DL packet (format)
DL packet
Serial link
DLX packet
DLX packet (format)
AFU packets
AFU protocol stack
interface
Host bus protocol
stack interface
The full TL/DL specification can be obtained by simply going to opencapi.org
and registering under the technical à specifications pull down menu.
20. FPGA
• Cusomter application and accelerator
• Operation system enablement
• Little Endian Linux
• Reference Kernel Driver (ocxl)
• Reference User Library (libocxl)
• Hardware and reference designs to
enable coherent acceleration
Core
Processor
OS
App
(software)
Memory (Coherent)
AFU
TLx
DLx
25G
ocxl
libocxl
Ø OCSE (OpenCAPI Simulation Environment)
models the red outlined area
Ø OCSE enables AFU and Application co-
simulation only when the reference
libocxl and reference TLx/DLx are used.
Ø OCSE dependencies
ØFixed reference TLx/AFU interface
ØFixed reference libocxl user API
Ø Will be contributed to the OpenCAPI
consortium
Ø Development Progress: 90%
Cable
Memory (Coherent)
25G
DL
TL
OpenCAPI: OCSE (OpenCAPI Simulation Environment)
21. OpenCAPI: Two Factors of Low Latency
1. No Kernel/Device Driver process for mapping the mem address or
no need to move data back & forth between user-space, kernel, and
devices
• 2. Thinner layers of protocol
à Compare to PCIe
Typical I/O Model Flow with Device Driver:
DD Call
Copy or Pin
Source Data
MMIO Notify
Accelerator
Acceleration
Poll / Interrupt
Completion
Copy or Unpin
Result Data
Ret. From DD
Completion
300 Instructions 10,000 Instructions 3,000 Instructions
1,000 Instructions
1,000 Instructions
7.9µs 4.9µs
Flow with a Coherent Model (CAPI):
Shared Mem.
Notify
Shared Mem.
Completion
400 Instructions 100 Instructions
0.3µs 0.06µs
Acceleration
TL
TL Frame/Parser
DL
PHY
PHYX
DLX
TLX Frame/Parser
TLX
OpenCAPI links
Faster Memory Access &
Easier Programing
Faster &
more Effective Protocol
Total 0.36µs
Total ~13µs for data prep
22. OpenCAPI Topics
ØIndustry Background
ØWhere/How OpenCAPI Technology is used
ØTechnology Overview and Advantages
ØHeterogeneous Computing
ØSNAP Framework for CAPI/OpenCAPI
Computation DataAccess
23. SNAP Framework Concept for CAPI/OpenCAPI
Action X
Action Y
Action Z
CAPI/
OpenCAPI
SNAP
Vivado
HLS
CAPI/
OpenCAPI
FPGA becomes a peer of the CPU
è Action directly accesses host memory
SNAP
Manage server threads and actions
Manage access to I/Os (AXI to memory/network…)
è Action easily accesses resources
FPGA
Gives on-demand compute capabilities
Gives direct I/O access (AXI to storage/network…)
è Action directly accesses external resources
Vivado
HLS
Compile Action written in C/C++ code
Optimize code to get performance
è Develop Action code efficiently
+
+
+
=Best way to offload/accelerate a C/C++ code with:
- Minimum change in code
- Quick porting
- Better performance than CPU
FPGA
Storage, Networking, Analytics Programming framework
24. CAPI Development without SNAP
Process C
Slave Context
libcxl
cxl
SNAP
library
Job
Queue
Process B
Slave Context
libcxl
cxl
SNAP
library
Job
Queue
Process A
Slave Context
Libcxl/libocxl
cxl/ocxl
HDK:
CAPI
PSL
CAPI
Huge development effort
Performance focused, full cache line control
Programming based on libcxl/libocxl & the VHDL and Verilog code
Software
Program
Hardware Logic
Application on Host Acceleration on FPGA
25. SNAP: Focus on the Additional Acceleration Values
25
Process C
Slave Context
libcxl
cxl
SNAP
library
Job
Queue
Process B
Slave Context
libcxl
cxl
SNAP
library
Job
Queue
Process A
Slave Context
libcxl/libocxl
cxl/ocxl
SNAP
library
Job
Queue
HDK:
CAPI
PSL
CAPI
Software Program
DRAM
on-card
Network
(TBD)
NVMe
on-card
AXI
AXI
Host
DMA
Control
MMIO
Job
Manager
Job
Queue
AXI
PSL/AXI bridge
AXI lite
Quick and easy development
Use High Level Synthesis tool to compile C/C++ to RTL, or directly use RTL
Programming based on SNAP library function call and AXI interface
AXI is an industry standard for on-chip interconnection (https://www.arm.com/products/system-ip/amba-specifications)
Action1
VHDL
Action3
Go, …
Action2
C/C++
Hardware Action
Application on Host Acceleration on FPGA
26. Summary: SNAP Framework for POWER9
• SNAP will be supported on POWER9
ü Abstraction layer on CAPI, the SNAP actions will be portable to CAPI 2.0 & OpenCAPI
ü Minimum Code Changes & Quick Porting
Ø SNAP hides the differences (use the Standard AXI Interfaces to the I/Os… DRAM/NVMe/Ethernet…etc)
Ø Support higher level languages (Vivado HLS helps you covert C/C++ to VHDL/Verilog)
Ø SNAP for OpenCAPI development progress now is about 70%
Ø OpenCAPI Simulation Environment (OCSE) development progress is almost 90%
• All Open Source!! à https://www.github.com/open-power/snap
ü Driven by OpenPOWER Foundation Accelerator Workgroup
ü Cross-company Collaboration and Contributions
Power8 à Power9
CAPI1.0 à CAPI2.0/OpenCAPI
Action1
VHDL
Action3
…
Action2
C/C++
Software Program
AXI interfaceslibsnap APIs
Your current actions
27. Table of Enablement Deliveries
27
Item Availability
OpenCAPI 3.0 TLx and DLx Reference Xilinx FPGA
Designs (RTL and Specifications)
Today
Xilinx Vivado Project Build with Memcopy Exerciser Today
Device Discovery and Configuration Specification
and RTL
Today
AFU Interface Specification Today
Reference Card Design Enablement Specification Today
25Gbps PHY Signal Specification Today
25Gbps PHY Mechanical Specification Today
OpenCAPI Simulation Environment (OCSE) Tech
Preview
Today
Memcopy and Memory Home Agent Exercisers
AFP Exerciser
Today
Today
Reference Driver Available Today
# Join us! # OpenCAPI Consortium: https://opencapi.org/
30. Backup: Overall Application Porting Preparations
Keys (right person to do the right things)
Software profiling
Software/hardware function partition
Understand the data location, data size, data dependency
Parallelism estimation
Consider I/O bandwidth limitation
Decide SNAP mode and FPGA card
API parameters: prevent from using the interface between main application and
hardware action.
Decide what Algorithm(s) you
want to accelerate.
Validate Physical Card choice
Define your API Parameters
Decide on SNAP Mode.
To Execution phase
Start
Plan
31. Backup: Advantage Summary
Traditional IO Attached FPGA CAPI Attached FPGA Benefit
Device Driver w/ system calls to move
data from application memory to IO
memory and initiate data transfer
“Hardware” Device Driver – hardware
handles address translation, no system
call needed.
Less latency to initiate data transfer from
host memory to FPGA
Offload CPU
Limited to PCIe Gen3 Bandwidth and
hardware latency
Gen4 x8 and OpenCAPI provide higher
bandwidth and lower latency
Better roadmap for future performance
enhancements
Separate memory address domains True shared memory with processor.
Processor and FPGA use same virtual
address.
Programming framework: CAPI-SNAP.
Open source at
https://github.com/open-power/snap
• Easier programmability
• Enables pointer chasing, linked lists
• No pinning of pages
• FPGA access to all of system memory
• Scatter / Gather capability removes
need to prepare data in sequential
blocks
Virtualization and multi-process handled
by complex OS support
Added security and multi-process
capability
CAPI supports multi-process access in
hardware with security to prevent cross
process access
32. Backup: Comparison of IBM CAPI Implementations & Roadmap
Feature CAPI 1.0 CAPI 2.0 OpenCAPI 3.0 OpenCAPI 3.1 OpenCAPI 4.0
Processor Generation POWER8 POWER9 POWER9 Power9 Follow-On Power9 Follow-On
CAPI Logic Placement FPGA/ASIC FPGA/ASIC NA
DL/TL on Host
DLx/TLx on endpoint
FPGA/ASIC
NA
DL/TL on Host
DLx/TLx on Memory
Buffer
NA
DL/TL on Host
DLx/TLx on
endpoint
FPGA/ASIC
Interface
Lanes per Instance
Lane bit rate
PCIe Gen3
x8/x16
8 Gb/s
PCIe Gen4
2 x (Dual x8)
16 Gb/s
Direct 25G
x8
x4 fail down
25 Gb/s
Direct 25G
x8
x4 fail down
25 Gb/s
Direct 25G
x8
x4 fail down
25 Gb/s
Address Translation on
CPU
No Yes Yes Yes Yes
Native DMA from Endpoint
Accelerator
No Yes Yes NA Yes
Home Agent Memory on
OpenCAPI Endpoint with
Load/Store Access
No No Yes NA Yes
Native Atomic Ops to Host
Processor Memory from
Accelerator
No Yes Yes NA Yes
Host Memory Caching
Function
on Accelerator
Real Address
Cache in PSL
Real Address
Cache in PSL
No NA Effective Address
Cache in
Accelerator
Remove PCIe layers to
reduce latency
significantly