SlideShare una empresa de Scribd logo
1 de 32
Descargar para leer sin conexión
TM
OpenCAPI Overview
Open Coherent Accelerator Processor Interface
Haman Yu / Hank Chang
IBM OpenPOWER Technical Enablement
OpenCAPI Topics
ØIndustry Background
ØWhere/How OpenCAPI Technology is used
ØTechnology Overview and Advantages
ØHeterogeneous Computing
ØSNAP Framework for CAPI/OpenCAPI
Computation DataAccess
Industry Background that Defined OpenCAPI
§Growing computational demand due to emerging workloads (e.g., AI, cognitive, etc.)
§Moore’s Law not being supported by traditional silicon scaling
§Driving increased dependence on Hardware Acceleration for performance
• Hyperscale Datacenters and HPC need much higher network
bandwidth
• 100 Gb/s -> 200 Gb/s -> 400Gb/s are emerging
• Deep learning and HPC require more bandwidth between accelerators and
memory
• Emerging memory/storage technologies are driving need for bandwidth with low
latency
§ Hardware accelerators are defining the attributes of a high performance bus
• Growing demand for network performance and network offload
• Introduction of device coherency requirements (IBM’s introduction in 2013)
• Emergence of complex storage and memory solutions
• Various form factors with no one able to address everything (e.g., GPUs, FPGAs,
ASICs, etc.)
Computation DataAccess
…all Relevant to Modern Data Centers
Use Cases - A True Heterogeneous Architecture Built Upon OpenCAPI
OpenCAPI3.0
OpenCAPI3.1
8 and 16Gbps PHY
Protocols Supported
• PCIe Gen3 x16 and PCIe Gen4 x8
• CAPI 2.0 on PCIe Gen4
PCIeGen4
P9
25Gbs
25Gbps PHY
Protocols Supported
• OpenCAPI 3.0
• NVLink 2.0
Silicon Die
Various packages
(scale-out, scale-up)
POWER9 IO Leading the Industry
• PCIe Gen4
• CAPI 2.0
• NVLink 2.0
• OpenCAPI 3.0
POWER9
Acceleration Paradigms with Great Performance
Examples: Encryption, Compression, Erasure prior to
delivering data to the network or storage
ProcessorChip
Egress Transform
Acc
DataDLx/TLx
ProcessorChip
Acc
Data
Bi-Directional Transform
Acc
DLx/TLx
Examples: NoSQL such as Neo4J with Graph Node Traversals, etc
Examples: Machine or Deep Learning such as Natural Language processing,
sentiment analysis or other Actionable Intelligence using OpenCAPI attached memory
ProcessorChip
Acc
DataDLx/TLx
Memory Transform Example: Basic work offload
Ingress Transform
ProcessorChip
Acc
DataDLx/TLx
Examples: Video Analytics, Network Security,
Deep Packet Inspection, Data Plane Accelerator,
Video Encoding (H.265), High Frequency Trading etc
Needle-in-a-haystack Engine
ProcessorChip
Acc
Examples: Database searches, joins, intersections, merges
Only the Needles are sent to the processor
Needle-In-A-Haystack Engine
DLx/TLx Needles
Large
Haystack
Of Data
OpenCAPI is ideal for acceleration due
to Bandwidth to/from accelerators, best
of breed latency, and flexibility of an
Open architecture
Comparison of Memory Paradigms
Emerging Storage Class Memory
ProcessorChip DLx/TLx
SCM
Data
Storage Class Memories have the potential to be
the next disruptive technology…..
Examples include ReRAM, MRAM, Z-NAND……
All are racing to become the defacto
OpenCAPI 3.1 Architecture
Ultra Low Latency ASIC buffer chip adding +5ns
on top of native DDR direct connect!!
Main Memory
ProcessorChip DLx/TLx
DDR4/5
Example: Basic DDR attach
Data
Storage Class Memory tiered with traditional DDR
Memory all built upon OpenCAPI 3.1 & 3.0
architecture.
Still have the ability to use Load/Store Semantics
Tiered Memory
ProcessorChip DLx/TLx
DDR4/5
DLx/TLx
SCM
Data Data
§Common physical interface between non-memory and memory devices
§OpenCAPI protocol was architected to minimize latency; excellent for classic DRAM memory
§Extreme bandwidth beyond classical DDR memory interface
§Agnostic interface will handle evolving memory technologies in the future (e.g., compute-in-mem)
§Ability to handle a memory buffer to decouple raw memory and host interface to optimize power, cost, perf
CAPI and OpenCAPI Performance
CAPI 1.0
PCIE Gen3 x8
Measured BW
@8Gb/s
CAPI 2.0
PCIE Gen4 x8
Measured BW
@16Gb/s
OpenCAPI 3.0
25 Gb/s x8
Measured BW
@25Gb/s
128B DMA
Read
3.81 GB/s 12.57 GB/s 22.1 GB/s
128B DMA
Write
4.16 GB/s 11.85 GB/s 21.6 GB/s
256B DMA
Read
N/A 13.94 GB/s 22.1 GB/s
256B DMA
Write
N/A 14.04 GB/s 22.0 GB/s
POWER8
Introduced
in 2013
POWER9
Second
Generation
POWER9
Open Architecture with a
Clean Slate Focused on
Bandwidth and Latency
POWER8CAPI1.0
POWER9CAPI2.0
andOpenCAPI3.0
Xilinx
KU60/VU3P FPGA
Latency Ping-Pong Test
§ Simple workload created to
simulate communication
between system and
attached FPGA
§ Bus traffic recorded with
protocol analyzer and
PowerBus traces
§ Response times and
statistics calculated
TL, DL, PHY
1.
2.
3.
4.
Host Code
Copy 512B from cache to FPGA
Poll on incoming 128B cache injection
Reset poll location
Repeat
TLx, DLx, PHYx
1.
2.
3.
4.
FPGA Code
Poll on 512B received from host
Reset poll location
DMA write 128B for cache injection
Repeat
OpenCAPI Link
PCIe Stack
1.
2.
3.
4.
Host Code
Copy 512B from cache to FPGA
Poll on incoming 128B cache injection
Reset poll location
Repeat
FPGA Code
1. Poll on 512B received from host
2. Reset poll location
3. DMA write 128B for cache injection
4. Repeat
* HIPrefers to hardened IP
PCIeLink
Altera PCIe HIP*
Latency Test Results
‡
ǁ
§
ǁ
378ns Total Latency est. <555ns Total Latency 737ns Total Latency 776ns Total Latency
OpenCAPI Enabled FPGA Cards
Mellanox Innova2AcceleratorCard Alpha Data 9v3AcceleratorCard
Typicaleyediagram at 25Gb/s usingthese
cards
OpenCAPI Topics
ØIndustry Background
ØWhere/How OpenCAPI Technology is used
ØTechnology Overview and Advantages
ØHeterogeneous Computing
ØSNAP Framework for CAPI/OpenCAPI
Computation DataAccess
OpenCAPI: Heterogeneous Computing
ØWhy OpenCAPI could let the specific workloads run faster?
à FPGA (Various High Bandwidth I/O, great at deep Parallel & Pipeline designs)
ØHow OpenCAPI could let the software/application run faster?
à Shared Coherent Memory (with Virtual Address, Low Latency & Low Overhead Design)
Computation DataAccess
Single Processor
CPU
Distributed computing
CPU
CPU CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
Heterogeneous Computing
CPU
GPU
ASIC
FPGA
FPGAs: What they are good at
l FPGA: Field Programmable Gate Array
l Configurable I/O and High-Speed Serial links
l Integrated Hard IP (Multiply/Add, SRAM, PLL, PCIe, Ethernet, DRAM
Controller,...etc. )
l Custom logic, Complex Special Instructions
l Bit/Matrix Manipulation, Image Processing, graphs, Neural Networks…etc.
l Great at deep Parallel & Pipeline Designs (for workloads)
Processing Engine
Processing Engine
Processing Engine
Processing Engine
Processing Engine
Processing Engine
Processing Engine
parallelism
pipeline
&
Hash
+
+
RAM
Instruction complexity
=
FPGAs: Different type of High Bandwidth I/O Cards
Alpha Data 9V3 (Networking) Mellanox Innova 2 (Networking [CX-5])
Nallatech 250S+ (Storage / NVMe SSDs)
Accelerated
OpenCAPI Device
OpenCAPI: Key Attributes for Acceleration
16
TL/DL 25Gb I/O
Any OpenCAPI Enabled Processor
Accelerated
Function
TLx/DLx
1. Architecture agnostic bus – Applicable with any system/microprocessor architecture
2. Optimized for High Bandwidth and Low Latency
- 25Gb/s Links (with SlimSAS connector)
- Removes PCIe layering and use brand new TL/DL--TLx/DLx thinner protocol instead (latency optimized)
- High performance industry standard interface design with zero ‘overhead’
3. Coherency & Virtual addressing
- Attached devices operate natively within application’s user space and coherently with host microprocessor
- Enables low overhead with no Kernel, hypervisor or firmware involvement
- Shared coherent data structures and pointers (put the data/mem closer to the Proc/FPGA)
- It’s all traditional thread level programming with CPU coherent device memory
4. Supports a wide range of use cases
- Architected for both Classic Memory and emerging Storage Class Memory
Caches
Application
§ Storage/Compute/Network etc
§ ASIC/FPGA/FFSA
§ FPGA, SOC, GPU Accelerator
§ Load/Store or Block Access
Standard System Memory
Device Memory
Advanced SCM
Solutions
BufferedSystemMemory
OpenCAPIMemoryBuffers
OpenCAPI: Virtual Addressing and Benefits
§An OpenCAPI device operates in the Virtual Address spaces of the applications that
it supports
•Eliminates kernel and device driver software overhead
•Allows device to operate on application memory without kernel-level data copies/pinned pages
•Simplifies programming effort to integrate accelerators into applications (SNAP)
•Improves accelerator performance
§The Virtual-to-Physical Address Translation occurs in the host CPU (no PSL logic needed)
•Reduces design complexity of OpenCAPI-attached devices
•Makes it easier to ensure interoperability between OpenCAPI devices and different CPU architectures
•Security - Since the OpenCAPI device never has access to a physical address, this eliminates the
possibility of a defective or malicious device accessing memory locations belonging to the kernel or
other applications that it is not authorized to access
OpenCAPI: Shared coherent data structures and pointers
No Kernel/Device Driver involved
(-- put the data/mem closer to the Proc/FPGA)
AFU: Attached Functional Unit
Typical I/O Model with Device Driver
OpenCAPI: Virtual Addressing and Benefits (deeper)
Ø The OpenCAPI Transaction Layer specifies the control and response packets
between a host and an endpoint OpenCAPI device: TL and TLX
Ø On the Host side the Transaction Layer converts:
• Host specific protocol requests into transaction layer defined commands
• TLx commands into host specific protocol requests.
• Responses
Ø On the Endpoint OpenCAPI device side, the Transaction Layer converts:
• AFU protocol requests into transaction layer commands
• TL commands into AFU protocol requests.
• Responses
Ø The OpenCAPI Data Link Layer supports a 25Gbps serial data rate per lane
connecting a processor to an FPGA or an ASIC that contains an endpoint
accelerator or device: DL and DLX
• The basic configuration supports 8 lanes running at 25.78125 GHz for a 25 GB/s data
rate.
Noted that: TL/DL/PHY (Host Side) è IBM P9 HW & FW both Ready now;
TLx/DLx/PHYx (Device) è I/O Vendors also have the reference design Ready
Host bus protocol
layer
TL
TL Frame/Parser
DL
PHY
PHYX
DLX
TLX Frame/Parser
TLX
AFU protocol layer
AFU
HostProcessorOpenCAPIDevice
OpenCAPI: Protocol Stack (much thinner than traditional PCIe)
Host bus interface
OpenCAPI packets
DL packet (format)
DL packet
Serial link
DLX packet
DLX packet (format)
AFU packets
AFU protocol stack
interface
Host bus protocol
stack interface
The full TL/DL specification can be obtained by simply going to opencapi.org
and registering under the technical à specifications pull down menu.
FPGA
• Cusomter application and accelerator
• Operation system enablement
• Little Endian Linux
• Reference Kernel Driver (ocxl)
• Reference User Library (libocxl)
• Hardware and reference designs to
enable coherent acceleration
Core
Processor
OS
App
(software)
Memory (Coherent)
AFU
TLx
DLx
25G
ocxl
libocxl
Ø OCSE (OpenCAPI Simulation Environment)
models the red outlined area
Ø OCSE enables AFU and Application co-
simulation only when the reference
libocxl and reference TLx/DLx are used.
Ø OCSE dependencies
ØFixed reference TLx/AFU interface
ØFixed reference libocxl user API
Ø Will be contributed to the OpenCAPI
consortium
Ø Development Progress: 90%
Cable
Memory (Coherent)
25G
DL
TL
OpenCAPI: OCSE (OpenCAPI Simulation Environment)
OpenCAPI: Two Factors of Low Latency
1. No Kernel/Device Driver process for mapping the mem address or
no need to move data back & forth between user-space, kernel, and
devices
• 2. Thinner layers of protocol
à Compare to PCIe
Typical I/O Model Flow with Device Driver:
DD Call
Copy or Pin
Source Data
MMIO Notify
Accelerator
Acceleration
Poll / Interrupt
Completion
Copy or Unpin
Result Data
Ret. From DD
Completion
300 Instructions 10,000 Instructions 3,000 Instructions
1,000 Instructions
1,000 Instructions
7.9µs 4.9µs
Flow with a Coherent Model (CAPI):
Shared Mem.
Notify
Shared Mem.
Completion
400 Instructions 100 Instructions
0.3µs 0.06µs
Acceleration
TL
TL Frame/Parser
DL
PHY
PHYX
DLX
TLX Frame/Parser
TLX
OpenCAPI links
Faster Memory Access &
Easier Programing
Faster &
more Effective Protocol
Total 0.36µs
Total ~13µs for data prep
OpenCAPI Topics
ØIndustry Background
ØWhere/How OpenCAPI Technology is used
ØTechnology Overview and Advantages
ØHeterogeneous Computing
ØSNAP Framework for CAPI/OpenCAPI
Computation DataAccess
SNAP Framework Concept for CAPI/OpenCAPI
Action X
Action Y
Action Z
CAPI/
OpenCAPI
SNAP
Vivado
HLS
CAPI/
OpenCAPI
FPGA becomes a peer of the CPU
è Action directly accesses host memory
SNAP
Manage server threads and actions
Manage access to I/Os (AXI to memory/network…)
è Action easily accesses resources
FPGA
Gives on-demand compute capabilities
Gives direct I/O access (AXI to storage/network…)
è Action directly accesses external resources
Vivado
HLS
Compile Action written in C/C++ code
Optimize code to get performance
è Develop Action code efficiently
+
+
+
=Best way to offload/accelerate a C/C++ code with:
- Minimum change in code
- Quick porting
- Better performance than CPU
FPGA
Storage, Networking, Analytics Programming framework
CAPI Development without SNAP
Process C
Slave Context
libcxl
cxl
SNAP
library
Job
Queue
Process B
Slave Context
libcxl
cxl
SNAP
library
Job
Queue
Process A
Slave Context
Libcxl/libocxl
cxl/ocxl
HDK:
CAPI
PSL
CAPI
Huge development effort
Performance focused, full cache line control
Programming based on libcxl/libocxl & the VHDL and Verilog code
Software
Program
Hardware Logic
Application on Host Acceleration on FPGA
SNAP: Focus on the Additional Acceleration Values
25
Process C
Slave Context
libcxl
cxl
SNAP
library
Job
Queue
Process B
Slave Context
libcxl
cxl
SNAP
library
Job
Queue
Process A
Slave Context
libcxl/libocxl
cxl/ocxl
SNAP
library
Job
Queue
HDK:
CAPI
PSL
CAPI
Software Program
DRAM
on-card
Network
(TBD)
NVMe
on-card
AXI
AXI
Host
DMA
Control
MMIO
Job
Manager
Job
Queue
AXI
PSL/AXI bridge
AXI lite
Quick and easy development
Use High Level Synthesis tool to compile C/C++ to RTL, or directly use RTL
Programming based on SNAP library function call and AXI interface
AXI is an industry standard for on-chip interconnection (https://www.arm.com/products/system-ip/amba-specifications)
Action1
VHDL
Action3
Go, …
Action2
C/C++
Hardware Action
Application on Host Acceleration on FPGA
Summary: SNAP Framework for POWER9
• SNAP will be supported on POWER9
ü Abstraction layer on CAPI, the SNAP actions will be portable to CAPI 2.0 & OpenCAPI
ü Minimum Code Changes & Quick Porting
Ø SNAP hides the differences (use the Standard AXI Interfaces to the I/Os… DRAM/NVMe/Ethernet…etc)
Ø Support higher level languages (Vivado HLS helps you covert C/C++ to VHDL/Verilog)
Ø SNAP for OpenCAPI development progress now is about 70%
Ø OpenCAPI Simulation Environment (OCSE) development progress is almost 90%
• All Open Source!! à https://www.github.com/open-power/snap
ü Driven by OpenPOWER Foundation Accelerator Workgroup
ü Cross-company Collaboration and Contributions
Power8 à Power9
CAPI1.0 à CAPI2.0/OpenCAPI
Action1
VHDL
Action3
…
Action2
C/C++
Software Program
AXI interfaceslibsnap APIs
Your current actions
Table of Enablement Deliveries
27
Item Availability
OpenCAPI 3.0 TLx and DLx Reference Xilinx FPGA
Designs (RTL and Specifications)
Today
Xilinx Vivado Project Build with Memcopy Exerciser Today
Device Discovery and Configuration Specification
and RTL
Today
AFU Interface Specification Today
Reference Card Design Enablement Specification Today
25Gbps PHY Signal Specification Today
25Gbps PHY Mechanical Specification Today
OpenCAPI Simulation Environment (OCSE) Tech
Preview
Today
Memcopy and Memory Home Agent Exercisers
AFP Exerciser
Today
Today
Reference Driver Available Today
# Join us! # OpenCAPI Consortium: https://opencapi.org/
Join us!
Any Questions?
Backup
Backup: Overall Application Porting Preparations
Keys (right person to do the right things)
Software profiling
Software/hardware function partition
Understand the data location, data size, data dependency
Parallelism estimation
Consider I/O bandwidth limitation
Decide SNAP mode and FPGA card
API parameters: prevent from using the interface between main application and
hardware action.
Decide what Algorithm(s) you
want to accelerate.
Validate Physical Card choice
Define your API Parameters
Decide on SNAP Mode.
To Execution phase
Start
Plan
Backup: Advantage Summary
Traditional IO Attached FPGA CAPI Attached FPGA Benefit
Device Driver w/ system calls to move
data from application memory to IO
memory and initiate data transfer
“Hardware” Device Driver – hardware
handles address translation, no system
call needed.
Less latency to initiate data transfer from
host memory to FPGA
Offload CPU
Limited to PCIe Gen3 Bandwidth and
hardware latency
Gen4 x8 and OpenCAPI provide higher
bandwidth and lower latency
Better roadmap for future performance
enhancements
Separate memory address domains True shared memory with processor.
Processor and FPGA use same virtual
address.
Programming framework: CAPI-SNAP.
Open source at
https://github.com/open-power/snap
• Easier programmability
• Enables pointer chasing, linked lists
• No pinning of pages
• FPGA access to all of system memory
• Scatter / Gather capability removes
need to prepare data in sequential
blocks
Virtualization and multi-process handled
by complex OS support
Added security and multi-process
capability
CAPI supports multi-process access in
hardware with security to prevent cross
process access
Backup: Comparison of IBM CAPI Implementations & Roadmap
Feature CAPI 1.0 CAPI 2.0 OpenCAPI 3.0 OpenCAPI 3.1 OpenCAPI 4.0
Processor Generation POWER8 POWER9 POWER9 Power9 Follow-On Power9 Follow-On
CAPI Logic Placement FPGA/ASIC FPGA/ASIC NA
DL/TL on Host
DLx/TLx on endpoint
FPGA/ASIC
NA
DL/TL on Host
DLx/TLx on Memory
Buffer
NA
DL/TL on Host
DLx/TLx on
endpoint
FPGA/ASIC
Interface
Lanes per Instance
Lane bit rate
PCIe Gen3
x8/x16
8 Gb/s
PCIe Gen4
2 x (Dual x8)
16 Gb/s
Direct 25G
x8
x4 fail down
25 Gb/s
Direct 25G
x8
x4 fail down
25 Gb/s
Direct 25G
x8
x4 fail down
25 Gb/s
Address Translation on
CPU
No Yes Yes Yes Yes
Native DMA from Endpoint
Accelerator
No Yes Yes NA Yes
Home Agent Memory on
OpenCAPI Endpoint with
Load/Store Access
No No Yes NA Yes
Native Atomic Ops to Host
Processor Memory from
Accelerator
No Yes Yes NA Yes
Host Memory Caching
Function
on Accelerator
Real Address
Cache in PSL
Real Address
Cache in PSL
No NA Effective Address
Cache in
Accelerator
Remove PCIe layers to
reduce latency
significantly

Más contenido relacionado

La actualidad más candente

FBTFTP: an opensource framework to build dynamic tftp servers
FBTFTP: an opensource framework to build dynamic tftp serversFBTFTP: an opensource framework to build dynamic tftp servers
FBTFTP: an opensource framework to build dynamic tftp servers
Angelo Failla
 
SDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's StampedeSDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's Stampede
Intel® Software
 

La actualidad más candente (20)

Challenges for Deploying a High-Performance Computing Application to the Cloud
Challenges for Deploying a High-Performance Computing Application to the CloudChallenges for Deploying a High-Performance Computing Application to the Cloud
Challenges for Deploying a High-Performance Computing Application to the Cloud
 
FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)
 
Introduction to OpenDaylight & Application Development
Introduction to OpenDaylight & Application DevelopmentIntroduction to OpenDaylight & Application Development
Introduction to OpenDaylight & Application Development
 
FBTFTP: an opensource framework to build dynamic tftp servers
FBTFTP: an opensource framework to build dynamic tftp serversFBTFTP: an opensource framework to build dynamic tftp servers
FBTFTP: an opensource framework to build dynamic tftp servers
 
Netsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfvNetsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfv
 
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. GrayOVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
 
SDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's StampedeSDVIs and In-Situ Visualization on TACC's Stampede
SDVIs and In-Situ Visualization on TACC's Stampede
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2
 
DPDK Summit 2015 - RIFT.io - Tim Mortsolf
DPDK Summit 2015 - RIFT.io - Tim MortsolfDPDK Summit 2015 - RIFT.io - Tim Mortsolf
DPDK Summit 2015 - RIFT.io - Tim Mortsolf
 
Cellular technology with Embedded Linux - COSCUP 2016
Cellular technology with Embedded Linux - COSCUP 2016Cellular technology with Embedded Linux - COSCUP 2016
Cellular technology with Embedded Linux - COSCUP 2016
 
LAS16-300: Mini Conference 2 Cortex-M Software - Device Configuration
LAS16-300: Mini Conference 2 Cortex-M Software - Device ConfigurationLAS16-300: Mini Conference 2 Cortex-M Software - Device Configuration
LAS16-300: Mini Conference 2 Cortex-M Software - Device Configuration
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet Processing
 
Возможности интерпретатора Python в NX-OS
Возможности интерпретатора Python в NX-OSВозможности интерпретатора Python в NX-OS
Возможности интерпретатора Python в NX-OS
 
BKK16-309A Open Platform support in UEFI
BKK16-309A Open Platform support in UEFIBKK16-309A Open Platform support in UEFI
BKK16-309A Open Platform support in UEFI
 
SDAccel Design Contest: Vivado HLS
SDAccel Design Contest: Vivado HLSSDAccel Design Contest: Vivado HLS
SDAccel Design Contest: Vivado HLS
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
 
Intel dpdk Tutorial
Intel dpdk TutorialIntel dpdk Tutorial
Intel dpdk Tutorial
 
SDAccel Design Contest: Xilinx SDAccel
SDAccel Design Contest: Xilinx SDAccel SDAccel Design Contest: Xilinx SDAccel
SDAccel Design Contest: Xilinx SDAccel
 
P4 Introduction
P4 Introduction P4 Introduction
P4 Introduction
 

Similar a 6 open capi_meetup_in_japan_final

SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4
UniFabric
 
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
byteLAKE
 

Similar a 6 open capi_meetup_in_japan_final (20)

OpenCAPI next generation accelerator
OpenCAPI next generation accelerator OpenCAPI next generation accelerator
OpenCAPI next generation accelerator
 
OpenCAPI Technology Ecosystem
OpenCAPI Technology EcosystemOpenCAPI Technology Ecosystem
OpenCAPI Technology Ecosystem
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablement
 
SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4SOUG_GV_Flashgrid_V4
SOUG_GV_Flashgrid_V4
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
 
DPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettDPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles Shiflett
 
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
 
Using a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceUsing a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application Performance
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
 
PowerDRC/LVS 2.0.1 released by POLYTEDA
PowerDRC/LVS 2.0.1 released by POLYTEDAPowerDRC/LVS 2.0.1 released by POLYTEDA
PowerDRC/LVS 2.0.1 released by POLYTEDA
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
 
PowerDRC/LVS 2.2 released by POLYTEDA
PowerDRC/LVS 2.2 released by POLYTEDAPowerDRC/LVS 2.2 released by POLYTEDA
PowerDRC/LVS 2.2 released by POLYTEDA
 
Cloud Networking Trends
Cloud Networking TrendsCloud Networking Trends
Cloud Networking Trends
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttu
 
ODSA Use Case - SmartNIC
ODSA Use Case - SmartNICODSA Use Case - SmartNIC
ODSA Use Case - SmartNIC
 
Sparc t4 systems customer presentation
Sparc t4 systems customer presentationSparc t4 systems customer presentation
Sparc t4 systems customer presentation
 
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
 
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
Heterogeneous Computing on POWER - IBM and OpenPOWER technologies to accelera...
 

Más de Yutaka Kawai

Más de Yutaka Kawai (20)

05 high density openpower dual-socket p9 system design example
05 high density openpower dual-socket p9 system design example05 high density openpower dual-socket p9 system design example
05 high density openpower dual-socket p9 system design example
 
04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers
 
03 desktop on an open powersystem
03 desktop on an open powersystem03 desktop on an open powersystem
03 desktop on an open powersystem
 
02 ai inference acceleration with components all in open hardware: opencapi a...
02 ai inference acceleration with components all in open hardware: opencapi a...02 ai inference acceleration with components all in open hardware: opencapi a...
02 ai inference acceleration with components all in open hardware: opencapi a...
 
01 high bandwidth acquisitioncomputing compressionall in a box
01 high bandwidth acquisitioncomputing compressionall in a box01 high bandwidth acquisitioncomputing compressionall in a box
01 high bandwidth acquisitioncomputing compressionall in a box
 
Sc19 ibm hms final
Sc19 ibm hms finalSc19 ibm hms final
Sc19 ibm hms final
 
0 foundation update__final - Mendy Furmanek
0 foundation update__final - Mendy Furmanek0 foundation update__final - Mendy Furmanek
0 foundation update__final - Mendy Furmanek
 
10th meetup20191209b
10th meetup20191209b10th meetup20191209b
10th meetup20191209b
 
Light talk kioxia_20191023r2
Light talk kioxia_20191023r2Light talk kioxia_20191023r2
Light talk kioxia_20191023r2
 
Open power ae_jd_20191223_v1
Open power ae_jd_20191223_v1Open power ae_jd_20191223_v1
Open power ae_jd_20191223_v1
 
Open power keynote- openisa
Open power  keynote- openisa Open power  keynote- openisa
Open power keynote- openisa
 
Open power topics20191023
Open power topics20191023Open power topics20191023
Open power topics20191023
 
9th meetup20191023
9th meetup201910239th meetup20191023
9th meetup20191023
 
Ibm open power_meetup_xilinx_lighting_talk_rev1.0
Ibm open power_meetup_xilinx_lighting_talk_rev1.0Ibm open power_meetup_xilinx_lighting_talk_rev1.0
Ibm open power_meetup_xilinx_lighting_talk_rev1.0
 
Ai vision u200
Ai vision u200Ai vision u200
Ai vision u200
 
Nec exp ether071719
Nec exp ether071719Nec exp ether071719
Nec exp ether071719
 
July japan meetup latest
July japan meetup latestJuly japan meetup latest
July japan meetup latest
 
8th meetup20190717
8th meetup201907178th meetup20190717
8th meetup20190717
 
2018 capi contest introduction japan-v2b
2018 capi contest introduction japan-v2b2018 capi contest introduction japan-v2b
2018 capi contest introduction japan-v2b
 
OCP48V Solution
OCP48V SolutionOCP48V Solution
OCP48V Solution
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

6 open capi_meetup_in_japan_final

  • 1. TM OpenCAPI Overview Open Coherent Accelerator Processor Interface Haman Yu / Hank Chang IBM OpenPOWER Technical Enablement
  • 2. OpenCAPI Topics ØIndustry Background ØWhere/How OpenCAPI Technology is used ØTechnology Overview and Advantages ØHeterogeneous Computing ØSNAP Framework for CAPI/OpenCAPI Computation DataAccess
  • 3. Industry Background that Defined OpenCAPI §Growing computational demand due to emerging workloads (e.g., AI, cognitive, etc.) §Moore’s Law not being supported by traditional silicon scaling §Driving increased dependence on Hardware Acceleration for performance • Hyperscale Datacenters and HPC need much higher network bandwidth • 100 Gb/s -> 200 Gb/s -> 400Gb/s are emerging • Deep learning and HPC require more bandwidth between accelerators and memory • Emerging memory/storage technologies are driving need for bandwidth with low latency § Hardware accelerators are defining the attributes of a high performance bus • Growing demand for network performance and network offload • Introduction of device coherency requirements (IBM’s introduction in 2013) • Emergence of complex storage and memory solutions • Various form factors with no one able to address everything (e.g., GPUs, FPGAs, ASICs, etc.) Computation DataAccess …all Relevant to Modern Data Centers
  • 4. Use Cases - A True Heterogeneous Architecture Built Upon OpenCAPI OpenCAPI3.0 OpenCAPI3.1
  • 5. 8 and 16Gbps PHY Protocols Supported • PCIe Gen3 x16 and PCIe Gen4 x8 • CAPI 2.0 on PCIe Gen4 PCIeGen4 P9 25Gbs 25Gbps PHY Protocols Supported • OpenCAPI 3.0 • NVLink 2.0 Silicon Die Various packages (scale-out, scale-up) POWER9 IO Leading the Industry • PCIe Gen4 • CAPI 2.0 • NVLink 2.0 • OpenCAPI 3.0 POWER9
  • 6. Acceleration Paradigms with Great Performance Examples: Encryption, Compression, Erasure prior to delivering data to the network or storage ProcessorChip Egress Transform Acc DataDLx/TLx ProcessorChip Acc Data Bi-Directional Transform Acc DLx/TLx Examples: NoSQL such as Neo4J with Graph Node Traversals, etc Examples: Machine or Deep Learning such as Natural Language processing, sentiment analysis or other Actionable Intelligence using OpenCAPI attached memory ProcessorChip Acc DataDLx/TLx Memory Transform Example: Basic work offload Ingress Transform ProcessorChip Acc DataDLx/TLx Examples: Video Analytics, Network Security, Deep Packet Inspection, Data Plane Accelerator, Video Encoding (H.265), High Frequency Trading etc Needle-in-a-haystack Engine ProcessorChip Acc Examples: Database searches, joins, intersections, merges Only the Needles are sent to the processor Needle-In-A-Haystack Engine DLx/TLx Needles Large Haystack Of Data OpenCAPI is ideal for acceleration due to Bandwidth to/from accelerators, best of breed latency, and flexibility of an Open architecture
  • 7. Comparison of Memory Paradigms Emerging Storage Class Memory ProcessorChip DLx/TLx SCM Data Storage Class Memories have the potential to be the next disruptive technology….. Examples include ReRAM, MRAM, Z-NAND…… All are racing to become the defacto OpenCAPI 3.1 Architecture Ultra Low Latency ASIC buffer chip adding +5ns on top of native DDR direct connect!! Main Memory ProcessorChip DLx/TLx DDR4/5 Example: Basic DDR attach Data Storage Class Memory tiered with traditional DDR Memory all built upon OpenCAPI 3.1 & 3.0 architecture. Still have the ability to use Load/Store Semantics Tiered Memory ProcessorChip DLx/TLx DDR4/5 DLx/TLx SCM Data Data §Common physical interface between non-memory and memory devices §OpenCAPI protocol was architected to minimize latency; excellent for classic DRAM memory §Extreme bandwidth beyond classical DDR memory interface §Agnostic interface will handle evolving memory technologies in the future (e.g., compute-in-mem) §Ability to handle a memory buffer to decouple raw memory and host interface to optimize power, cost, perf
  • 8. CAPI and OpenCAPI Performance CAPI 1.0 PCIE Gen3 x8 Measured BW @8Gb/s CAPI 2.0 PCIE Gen4 x8 Measured BW @16Gb/s OpenCAPI 3.0 25 Gb/s x8 Measured BW @25Gb/s 128B DMA Read 3.81 GB/s 12.57 GB/s 22.1 GB/s 128B DMA Write 4.16 GB/s 11.85 GB/s 21.6 GB/s 256B DMA Read N/A 13.94 GB/s 22.1 GB/s 256B DMA Write N/A 14.04 GB/s 22.0 GB/s POWER8 Introduced in 2013 POWER9 Second Generation POWER9 Open Architecture with a Clean Slate Focused on Bandwidth and Latency POWER8CAPI1.0 POWER9CAPI2.0 andOpenCAPI3.0 Xilinx KU60/VU3P FPGA
  • 9. Latency Ping-Pong Test § Simple workload created to simulate communication between system and attached FPGA § Bus traffic recorded with protocol analyzer and PowerBus traces § Response times and statistics calculated TL, DL, PHY 1. 2. 3. 4. Host Code Copy 512B from cache to FPGA Poll on incoming 128B cache injection Reset poll location Repeat TLx, DLx, PHYx 1. 2. 3. 4. FPGA Code Poll on 512B received from host Reset poll location DMA write 128B for cache injection Repeat OpenCAPI Link PCIe Stack 1. 2. 3. 4. Host Code Copy 512B from cache to FPGA Poll on incoming 128B cache injection Reset poll location Repeat FPGA Code 1. Poll on 512B received from host 2. Reset poll location 3. DMA write 128B for cache injection 4. Repeat * HIPrefers to hardened IP PCIeLink Altera PCIe HIP*
  • 10. Latency Test Results ‡ ǁ § ǁ 378ns Total Latency est. <555ns Total Latency 737ns Total Latency 776ns Total Latency
  • 11. OpenCAPI Enabled FPGA Cards Mellanox Innova2AcceleratorCard Alpha Data 9v3AcceleratorCard Typicaleyediagram at 25Gb/s usingthese cards
  • 12. OpenCAPI Topics ØIndustry Background ØWhere/How OpenCAPI Technology is used ØTechnology Overview and Advantages ØHeterogeneous Computing ØSNAP Framework for CAPI/OpenCAPI Computation DataAccess
  • 13. OpenCAPI: Heterogeneous Computing ØWhy OpenCAPI could let the specific workloads run faster? à FPGA (Various High Bandwidth I/O, great at deep Parallel & Pipeline designs) ØHow OpenCAPI could let the software/application run faster? à Shared Coherent Memory (with Virtual Address, Low Latency & Low Overhead Design) Computation DataAccess Single Processor CPU Distributed computing CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Heterogeneous Computing CPU GPU ASIC FPGA
  • 14. FPGAs: What they are good at l FPGA: Field Programmable Gate Array l Configurable I/O and High-Speed Serial links l Integrated Hard IP (Multiply/Add, SRAM, PLL, PCIe, Ethernet, DRAM Controller,...etc. ) l Custom logic, Complex Special Instructions l Bit/Matrix Manipulation, Image Processing, graphs, Neural Networks…etc. l Great at deep Parallel & Pipeline Designs (for workloads) Processing Engine Processing Engine Processing Engine Processing Engine Processing Engine Processing Engine Processing Engine parallelism pipeline & Hash + + RAM Instruction complexity =
  • 15. FPGAs: Different type of High Bandwidth I/O Cards Alpha Data 9V3 (Networking) Mellanox Innova 2 (Networking [CX-5]) Nallatech 250S+ (Storage / NVMe SSDs)
  • 16. Accelerated OpenCAPI Device OpenCAPI: Key Attributes for Acceleration 16 TL/DL 25Gb I/O Any OpenCAPI Enabled Processor Accelerated Function TLx/DLx 1. Architecture agnostic bus – Applicable with any system/microprocessor architecture 2. Optimized for High Bandwidth and Low Latency - 25Gb/s Links (with SlimSAS connector) - Removes PCIe layering and use brand new TL/DL--TLx/DLx thinner protocol instead (latency optimized) - High performance industry standard interface design with zero ‘overhead’ 3. Coherency & Virtual addressing - Attached devices operate natively within application’s user space and coherently with host microprocessor - Enables low overhead with no Kernel, hypervisor or firmware involvement - Shared coherent data structures and pointers (put the data/mem closer to the Proc/FPGA) - It’s all traditional thread level programming with CPU coherent device memory 4. Supports a wide range of use cases - Architected for both Classic Memory and emerging Storage Class Memory Caches Application § Storage/Compute/Network etc § ASIC/FPGA/FFSA § FPGA, SOC, GPU Accelerator § Load/Store or Block Access Standard System Memory Device Memory Advanced SCM Solutions BufferedSystemMemory OpenCAPIMemoryBuffers
  • 17. OpenCAPI: Virtual Addressing and Benefits §An OpenCAPI device operates in the Virtual Address spaces of the applications that it supports •Eliminates kernel and device driver software overhead •Allows device to operate on application memory without kernel-level data copies/pinned pages •Simplifies programming effort to integrate accelerators into applications (SNAP) •Improves accelerator performance §The Virtual-to-Physical Address Translation occurs in the host CPU (no PSL logic needed) •Reduces design complexity of OpenCAPI-attached devices •Makes it easier to ensure interoperability between OpenCAPI devices and different CPU architectures •Security - Since the OpenCAPI device never has access to a physical address, this eliminates the possibility of a defective or malicious device accessing memory locations belonging to the kernel or other applications that it is not authorized to access
  • 18. OpenCAPI: Shared coherent data structures and pointers No Kernel/Device Driver involved (-- put the data/mem closer to the Proc/FPGA) AFU: Attached Functional Unit Typical I/O Model with Device Driver OpenCAPI: Virtual Addressing and Benefits (deeper)
  • 19. Ø The OpenCAPI Transaction Layer specifies the control and response packets between a host and an endpoint OpenCAPI device: TL and TLX Ø On the Host side the Transaction Layer converts: • Host specific protocol requests into transaction layer defined commands • TLx commands into host specific protocol requests. • Responses Ø On the Endpoint OpenCAPI device side, the Transaction Layer converts: • AFU protocol requests into transaction layer commands • TL commands into AFU protocol requests. • Responses Ø The OpenCAPI Data Link Layer supports a 25Gbps serial data rate per lane connecting a processor to an FPGA or an ASIC that contains an endpoint accelerator or device: DL and DLX • The basic configuration supports 8 lanes running at 25.78125 GHz for a 25 GB/s data rate. Noted that: TL/DL/PHY (Host Side) è IBM P9 HW & FW both Ready now; TLx/DLx/PHYx (Device) è I/O Vendors also have the reference design Ready Host bus protocol layer TL TL Frame/Parser DL PHY PHYX DLX TLX Frame/Parser TLX AFU protocol layer AFU HostProcessorOpenCAPIDevice OpenCAPI: Protocol Stack (much thinner than traditional PCIe) Host bus interface OpenCAPI packets DL packet (format) DL packet Serial link DLX packet DLX packet (format) AFU packets AFU protocol stack interface Host bus protocol stack interface The full TL/DL specification can be obtained by simply going to opencapi.org and registering under the technical à specifications pull down menu.
  • 20. FPGA • Cusomter application and accelerator • Operation system enablement • Little Endian Linux • Reference Kernel Driver (ocxl) • Reference User Library (libocxl) • Hardware and reference designs to enable coherent acceleration Core Processor OS App (software) Memory (Coherent) AFU TLx DLx 25G ocxl libocxl Ø OCSE (OpenCAPI Simulation Environment) models the red outlined area Ø OCSE enables AFU and Application co- simulation only when the reference libocxl and reference TLx/DLx are used. Ø OCSE dependencies ØFixed reference TLx/AFU interface ØFixed reference libocxl user API Ø Will be contributed to the OpenCAPI consortium Ø Development Progress: 90% Cable Memory (Coherent) 25G DL TL OpenCAPI: OCSE (OpenCAPI Simulation Environment)
  • 21. OpenCAPI: Two Factors of Low Latency 1. No Kernel/Device Driver process for mapping the mem address or no need to move data back & forth between user-space, kernel, and devices • 2. Thinner layers of protocol à Compare to PCIe Typical I/O Model Flow with Device Driver: DD Call Copy or Pin Source Data MMIO Notify Accelerator Acceleration Poll / Interrupt Completion Copy or Unpin Result Data Ret. From DD Completion 300 Instructions 10,000 Instructions 3,000 Instructions 1,000 Instructions 1,000 Instructions 7.9µs 4.9µs Flow with a Coherent Model (CAPI): Shared Mem. Notify Shared Mem. Completion 400 Instructions 100 Instructions 0.3µs 0.06µs Acceleration TL TL Frame/Parser DL PHY PHYX DLX TLX Frame/Parser TLX OpenCAPI links Faster Memory Access & Easier Programing Faster & more Effective Protocol Total 0.36µs Total ~13µs for data prep
  • 22. OpenCAPI Topics ØIndustry Background ØWhere/How OpenCAPI Technology is used ØTechnology Overview and Advantages ØHeterogeneous Computing ØSNAP Framework for CAPI/OpenCAPI Computation DataAccess
  • 23. SNAP Framework Concept for CAPI/OpenCAPI Action X Action Y Action Z CAPI/ OpenCAPI SNAP Vivado HLS CAPI/ OpenCAPI FPGA becomes a peer of the CPU è Action directly accesses host memory SNAP Manage server threads and actions Manage access to I/Os (AXI to memory/network…) è Action easily accesses resources FPGA Gives on-demand compute capabilities Gives direct I/O access (AXI to storage/network…) è Action directly accesses external resources Vivado HLS Compile Action written in C/C++ code Optimize code to get performance è Develop Action code efficiently + + + =Best way to offload/accelerate a C/C++ code with: - Minimum change in code - Quick porting - Better performance than CPU FPGA Storage, Networking, Analytics Programming framework
  • 24. CAPI Development without SNAP Process C Slave Context libcxl cxl SNAP library Job Queue Process B Slave Context libcxl cxl SNAP library Job Queue Process A Slave Context Libcxl/libocxl cxl/ocxl HDK: CAPI PSL CAPI Huge development effort Performance focused, full cache line control Programming based on libcxl/libocxl & the VHDL and Verilog code Software Program Hardware Logic Application on Host Acceleration on FPGA
  • 25. SNAP: Focus on the Additional Acceleration Values 25 Process C Slave Context libcxl cxl SNAP library Job Queue Process B Slave Context libcxl cxl SNAP library Job Queue Process A Slave Context libcxl/libocxl cxl/ocxl SNAP library Job Queue HDK: CAPI PSL CAPI Software Program DRAM on-card Network (TBD) NVMe on-card AXI AXI Host DMA Control MMIO Job Manager Job Queue AXI PSL/AXI bridge AXI lite Quick and easy development Use High Level Synthesis tool to compile C/C++ to RTL, or directly use RTL Programming based on SNAP library function call and AXI interface AXI is an industry standard for on-chip interconnection (https://www.arm.com/products/system-ip/amba-specifications) Action1 VHDL Action3 Go, … Action2 C/C++ Hardware Action Application on Host Acceleration on FPGA
  • 26. Summary: SNAP Framework for POWER9 • SNAP will be supported on POWER9 ü Abstraction layer on CAPI, the SNAP actions will be portable to CAPI 2.0 & OpenCAPI ü Minimum Code Changes & Quick Porting Ø SNAP hides the differences (use the Standard AXI Interfaces to the I/Os… DRAM/NVMe/Ethernet…etc) Ø Support higher level languages (Vivado HLS helps you covert C/C++ to VHDL/Verilog) Ø SNAP for OpenCAPI development progress now is about 70% Ø OpenCAPI Simulation Environment (OCSE) development progress is almost 90% • All Open Source!! à https://www.github.com/open-power/snap ü Driven by OpenPOWER Foundation Accelerator Workgroup ü Cross-company Collaboration and Contributions Power8 à Power9 CAPI1.0 à CAPI2.0/OpenCAPI Action1 VHDL Action3 … Action2 C/C++ Software Program AXI interfaceslibsnap APIs Your current actions
  • 27. Table of Enablement Deliveries 27 Item Availability OpenCAPI 3.0 TLx and DLx Reference Xilinx FPGA Designs (RTL and Specifications) Today Xilinx Vivado Project Build with Memcopy Exerciser Today Device Discovery and Configuration Specification and RTL Today AFU Interface Specification Today Reference Card Design Enablement Specification Today 25Gbps PHY Signal Specification Today 25Gbps PHY Mechanical Specification Today OpenCAPI Simulation Environment (OCSE) Tech Preview Today Memcopy and Memory Home Agent Exercisers AFP Exerciser Today Today Reference Driver Available Today # Join us! # OpenCAPI Consortium: https://opencapi.org/
  • 30. Backup: Overall Application Porting Preparations Keys (right person to do the right things) Software profiling Software/hardware function partition Understand the data location, data size, data dependency Parallelism estimation Consider I/O bandwidth limitation Decide SNAP mode and FPGA card API parameters: prevent from using the interface between main application and hardware action. Decide what Algorithm(s) you want to accelerate. Validate Physical Card choice Define your API Parameters Decide on SNAP Mode. To Execution phase Start Plan
  • 31. Backup: Advantage Summary Traditional IO Attached FPGA CAPI Attached FPGA Benefit Device Driver w/ system calls to move data from application memory to IO memory and initiate data transfer “Hardware” Device Driver – hardware handles address translation, no system call needed. Less latency to initiate data transfer from host memory to FPGA Offload CPU Limited to PCIe Gen3 Bandwidth and hardware latency Gen4 x8 and OpenCAPI provide higher bandwidth and lower latency Better roadmap for future performance enhancements Separate memory address domains True shared memory with processor. Processor and FPGA use same virtual address. Programming framework: CAPI-SNAP. Open source at https://github.com/open-power/snap • Easier programmability • Enables pointer chasing, linked lists • No pinning of pages • FPGA access to all of system memory • Scatter / Gather capability removes need to prepare data in sequential blocks Virtualization and multi-process handled by complex OS support Added security and multi-process capability CAPI supports multi-process access in hardware with security to prevent cross process access
  • 32. Backup: Comparison of IBM CAPI Implementations & Roadmap Feature CAPI 1.0 CAPI 2.0 OpenCAPI 3.0 OpenCAPI 3.1 OpenCAPI 4.0 Processor Generation POWER8 POWER9 POWER9 Power9 Follow-On Power9 Follow-On CAPI Logic Placement FPGA/ASIC FPGA/ASIC NA DL/TL on Host DLx/TLx on endpoint FPGA/ASIC NA DL/TL on Host DLx/TLx on Memory Buffer NA DL/TL on Host DLx/TLx on endpoint FPGA/ASIC Interface Lanes per Instance Lane bit rate PCIe Gen3 x8/x16 8 Gb/s PCIe Gen4 2 x (Dual x8) 16 Gb/s Direct 25G x8 x4 fail down 25 Gb/s Direct 25G x8 x4 fail down 25 Gb/s Direct 25G x8 x4 fail down 25 Gb/s Address Translation on CPU No Yes Yes Yes Yes Native DMA from Endpoint Accelerator No Yes Yes NA Yes Home Agent Memory on OpenCAPI Endpoint with Load/Store Access No No Yes NA Yes Native Atomic Ops to Host Processor Memory from Accelerator No Yes Yes NA Yes Host Memory Caching Function on Accelerator Real Address Cache in PSL Real Address Cache in PSL No NA Effective Address Cache in Accelerator Remove PCIe layers to reduce latency significantly