SlideShare una empresa de Scribd logo
1 de 48
Descargar para leer sin conexión
IBM Confidential
Heterogeneous Computing
The Future of Systems
Anand Haridass
Senior Technical Staff Member
IBM Cognitive Systems
NITK (KREC) – Batch of ‘95 (E&C)
IBM Academy of Technology
NITK-IBM Computer Systems Research Group (NCSRG)
Seminar Sep/18/2017
2
Agenda
System Overview
Technology Trends – End of Dennard Scaling
Vertical Integration - OpenPOWER
“Feeding the Engine” – Memory / Storage
Need for High Performance Bus – OpenCAPI
GPU Attach - NVLINK
Accelerator Examples
3
Von Neumann Architecture
• First published by John von Neumann in 1945.
• Design consists of a Control Unit, Arithmetic & Logic Unit (ALU), Memory Unit, Registers & Inputs/Outputs.
• Stored-program computer concept instruction data and program data are stored in the same memory.
• Most Servers & PC’s produced today use this design.
4
Typical 2 Socket Systems [2017]
CPU CPU
Memory Memory
IO/ Storage / NW
AcceleratorAccelerator
IO/ Storage / NW
5
Processor Technology Trends
Moore’’’’s Law
Alive & Kicking
Moore’s Law (1965)
”Number of transistors in a dense integrated circuit
doubles approximately every two years”
6
Dennard Scaling Limits
Dennard scaling As transistors get smaller their power density stays constant, so that the
power use stays in proportion with area: both voltage and current scale (downward) with
length.
Power requirements are proportional to area (both voltage & current being proportional to length). Transistor dimensions are scaled by 30%
(0.7x) every technology generation, thus reducing their area by 50%. This reduces the delay by 30% (0.7x) and therefore increases operating
frequency by about 40% (1.4x). To keep electric field constant, voltage is reduced by 30%, reducing energy by 65% and power (at 1.4x
frequency) by 50%.
• Voltage scaling for high-performance designs is limited
• By leakage issues: can’t reduce threshold voltages
• Need steeper sub-threshold slopes
• Limited by variability, esp VT variability
• Need to minimize random dopant fluctuations
• Limited by gate oxide thickness
• Some relief from high-K materials
• Limited voltage scaling + decreasing feature sizes
Increasing electric fields
• New device structures needed (FinFETs)
• Reliability challenges (devices and wires)
7
CMOS Power - Performance Scaling
Where this curve is flat, can only improve chip frequency by:
a) Pushing core/chip to higher power density (air cooling limits)
b) Design power efficiency improvements (low-hanging fruit all gone)
10
100
0.01 0.1 1 10
Feature pitch (microns)
RelativePerformanceMetric
(Constpowerdensity) When scaling
was good…
8
Processor Technology Trends
‘‘‘‘Affordable’’’’ Air Cooled
Limit ~120-190W
Dennard Scaling
limiting from 2002-04
9
Processor Technology Trends
Processor Frequency
peaks at ~6Ghz and
settle between 2-4GHz
10
Processor Technology Trends
Strongly
Correlated
11
Processor Technology Trends
Multi-Cores (& threads)
Parallel Programming to
leverage
12
End customer doesn't care about Frequency / ST performance & other ‘‘‘‘processor’’’’ metrics
Cost/Performance is the metric
Processors
Semiconductor Technology
Industry trends, Challenges & Opportunities
Microprocessors alone no longer drive sufficient Cost/Performance improvements
13
System stack innovations are required to drive
Cost/Performance
14
OpenPOWER Foundation
15
Materials Innovations - Increased Complexity & Cost
Global Foundries projects that a
computer chip manufacturing plant in NY
would cost $14.7 billion to build
16
“Data Access” Performance
(bandwidth & latency) & Cost
(Power) still very challenging
Some techniques to hide
latency/bw/pwr
Caches
Locality optimization
Out-of-order execution
Multithreading
Pre-fetching
“Fat’ pipes / Memory Buffers
ns
StorageMemory
Storage Class Memory
(100 – 1000ns)
Source: SNIA
“Feeding the Engine” Challenge
17
Access latency in
uP cycles
(@ 4GHz)
Source H.Hunter IBM
21 23
211 213 215
219 223
L1/L2(SRAM) HDD
27
L3/L4
25
29 217
221
Flash
“I/O Calls” (Read/Writes)“Memory Calls“ (Load/Store)
DRAM
Memory / Storage
Storage Class of Memory
NVMe - Non-Volatile Memory express (PCIe)
• Standardized high performance interface for PCI Express SSD.
Available today in three different form factors: PCIe Add in Card, SFF
2.5” and M.2
• PCIeGen3 (today) x8 ~8GB/s [x4 ~4GB/s, x2 ~2GB/s] vs SAS 12Gbs
[1.5GB/s /port]
• PCIeGen4 (2018) x8 ~16GB/s [x4 ~8GB/s, x2 ~4GB/s] vs SAS 24Gbs
[3GB/s /port]
NVMe over fabrics (low latency RDMA access) <10us including switches
CAPI based Flash (today) x16 (16GB/s) – at faster access latencies
(more on this later)
HBM (High Bandwidth memory)
• 3D Stacked DRAM from AMD/Hynix/Samsung
• HBM2 256GB/sec ~4GB/package (8 DRAM TSV stacked)
• 1024bits x 2GT/s
• HBM3 512GB/sec ~2020 time frame
NVDIMM
• Persistent memory solution on DDR interface
• Combines DRAM, NAND Flash and power source
• Delivers DRAM R/W perf with the persistence & reliability
of NAND
18 Source: SNIA
The Contenders
https://www.snia.org/sites/default/files/NVM/2016/presentations/Panel_1_Combined_NVM_Futures%20Revision.pdf
19
Function offload – greater concurrency & utilization
Power efficiency (performance/watt)
Workloads
Encryption-decryption / Compression-
decompression / Encoding-decoding / Network
Controllers / Math Libraries / DB queries / Search
Deep Learning (Arms race !) for training &
inferencing
Hardware Acceleration
Types of Accelerators
General Purpose GPU / Many Integrated Core (MIC)
Nvidia Tesla/Volta, Intel Xeon Phi, AMD Radeon
Field Programmable Gate Array (FPGA)
Xilinx, Altera (now Intel)
Purpose Built / Custom ASIC’s
Google’s TPU
Intelligent Network Controllers
Cavium ARM-accelerated NIC
Mellanox NIC+FPGA
Microsoft FPGA-only network adapter
Traditionally (“IO” limited) sequential instructions
on processor / parallel compute offloaded to
accelerator
Penalty for “IO” operations heavy
20
HPC & Hyper-scale datacenters (Cloud) are driving need for higher network bandwidth
HPC & Deep learning require more bandwidth between accelerators and memory
PCI Express has limitations (coherence / bandwidth / protocol overhead)
Desired Attributes
Low Latency / High Bandwidth / Coherence
Emergence of complex storage & memory solutions (BW & latency & heterogeneity)
Growing demand for network performance (BW & latency)
Various form factors (e.g., GPUs, FPGAs, ASICs, etc.)
Open standard for broad industry, architecture agnostic participation / avoid vendor lock-in
Volume pricing advantages & Broad software ecosystem growth and adoption
Vendor specific variants
Intel Omni Path Architecture, Nvidia Nvlink, AMD Hypertransport
Open Standards evolving
Cache Coherent Interconnect for Accelerators (CCIX) www.ccixconsortium.com
Gen-Z genzconsortium.org
Open Coherent Accelerator Processor Interface (OpenCAPI) opencapi.org
Need for High Performance Next Generation Bus/Interconnect
21
Coherent Accelerator Processor Interface (CAPI) - 2014
CAPP PCIe
Power Processor
FPGA
Functionn
Function0
Function1
Function2
CAPI
IBM Supplied POWER
Service Layer
Virtual Addressing
Removes the requirement for pinning system memory for PCIe
transfers
Eliminates the copying of data into and out of the pinned DMA buffers
Eliminates the operating system call overhead to pin memory for
DMA
Accelerator can work with same addresses that the processors use
Pointers can be de-referenced same as the host application
- Example: Enables the ability to traverse data structures
Coherent Caching of Data
Enables an accelerator to cache data structures
Enables Cache to Cache transfers between accelerator and processor
Enables the accelerator to participate in “Locks” as a normal thread
Elimination of Device Driver
Direct communication with Application
No requirement to call an OS device driver or Hypervisor function for
mainline processing
Enables Accelerator Features not possible with PCIe
Enables efficient Hybrid Applications
Applications partially implemented in the accelerator and partially on
the host CPU
Visibility to full system memory
Simpler programming model for Application Modules
Coherent Accelerator Processor Proxy (CAPP)
– Proxy for FPGA Accelerator on PowerBus
– Integrated into Processor
– Programmable (Table Driven) Protocol for CAPI
– Shadow Cache Directory for Accelerator
• Up to 1MB Cache Tags (Line based)
• Larger block based Cache
POWER Service Layer (PSL)
– Implemented in FPGA Technology
– Provides Address Translation for Accelerator
• Compatible with POWER Architecture
– Provides Cache for Accelerator
– Facilities for downloading Accelerator Functions
22
PCIe
How CAPI Works
AlgorithmAlgo mrith
POWER8 Processor
Acceleration Portion:
Data or Compute Intensive,
Storage or External I/O
Application Portion:
Data Set-up, Control
Sharing the same memory space
Accelerator is a peer to POWER8 Core
CAPI Developer Kit Card
Coherent Accelerator Processor Interface (CAPI) - 2014
Accelerator is a Full Peer to Processor
Accelerator Function(s) use an unmodified
Effective address
Full access to Real address space
Utilize Processor’s Page Tables Directly
Page Faults handled by System Software
Multiple Functions can exist in a single
Accelerator
23
Memory Subsystem
Virt Addr
IO Attached Accelerator
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core
App
FPGA
PCIE
Variables
Input
Data
DD
Device Driver
Storage Area
Variables
Input
Data
Variables
Input
Data
Output
Data
Output
Data
An application called a device driver to utilize an FPGA Accelerator.
The device driver performed a memory mapping operation.
3 versions of the data (not coherent).
1000s of instructions in the device driver.
24
Memory Subsystem
Virt Addr
CAPI Coherency
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core
App
FPGA
PCIE
With CAPI, the FPGA shares memory with the cores
PSL
Variable
s
Input
Data
Output
Data
1 coherent version of the data.
No device driver call/instructions.
25
Typical I/O Model Flow:
Flow with a Coherent Model:
Shared Mem.
Notify Accelerator
Acceleration
Shared Memory
Completion
DD Call
Copy or Pin
Source Data
MMIO Notify
Accelerator
Acceleration
Poll / Interrupt
Completion
Copy or Unpin
Result Data
Ret. From DD
Completion
Application
Dependent, but
Equal to below
Application
Dependent, but
Equal to above
300 Instructions 10,000 Instructions 3,000 Instructions
1,000 Instructions
1,000 Instructions
7.9µs 4.9µs
Total ~13µs for data prep
400 Instructions 100 Instructions
0.3µs 0.06µs
Total 0.36µs
CAPI vs. I/O Device Driver: Data Prep
26
IBM Accelerated GZIP Compression
An FPGA-based low-latency GZIP Compressor & Decompressor with
single-thread througput of ~2GB/s and a compression rate significantly
better than low-CPU overhead compressors like snappy.
27
CAPI Attached Flash
28
29
CAPI Acceleration
29
Examples: Encryption, Compression, Erasure prior to network or storage
Processor
Chip
Acc
Data
Egress Transform
DLx/TLx
Processor
Chip
Acc
Data
Bi-Directional Transform
Acc
TLx/DLx
Examples: NoSQL such as Neo4J with Graph Node Traversals, etc
Needle-in-a-haystack Engine
Examples: Machine or Deep Learning potentially using OpenCAPI attached memory
Memory Transform
Processor
Chip
Acc
DataDLx/TLx
Example: Basic work offload
Processor
Chip
Acc
NeedlesDLx/TLx
Examples: Database searches, joins, intersections, merges
Ingress Transform
Processor
Chip
Acc
DataDLx/TLx
Examples: Video Analytics, HFT, VPN/IPsec/SSL, Deep Packet Inspection (DPI),
Data Plane Accelerator (DPA), Video Encoding (H.265), etc
Needle-In-A-Haystack Engine
Haystack
Data
OpenCAPI WINS due to Bandwidth
to/from accelerators, best of breed
latency, and flexibility of an Open
architecture
30
NVLink 1
4 links
20 GBps per link raw bandwidth each
direction
~160GBps total net NVLink bandwidth
NVLink 2
6 links
25GBps per link raw bandwidth each
direction
~300GBps total net NVLink bandwidth
Volta GV100
• 15 TFLOPS FP32
• 16GB HBM2 – 900 GB/s
• 300W TDP
• 50 GFLOPS/W (FP32)
• 12nm process
• 300GB/s NV Link2
• Tensor Core....
Source: Nvidia
NVIDIA GPU
31
“Minsky” S822LC for HPC
• Tight coupling: strong CPU: strong GPU performance
• Equalizing access to memory - for all kinds of programming
• Closer programming to the CPU paradigm
115GB/S 115GB/S
NVLink
DDR4
P8’
DDR4
P8’
Tesla
P100
Tesla
P100
80GB/S Tesla
P100
Tesla
P100
80GB/S
OpenPOWER P8’ Design
PCIe
32GBps
GPUGPU
x86x86
GPUGPU GPUGPU
x86x86
GPUGPU
For x86 Servers: PCIe Bottleneck
No NVLink between CPU & GPU
2.7X faster query response time on “Minsky”
87% of the total speedup (2.35x of 2.7x
improvement) is due to the NVLink Interface
from CPU:GPU
• Profiling result based on running Kinetica “Filter by geographic area” queries on data set of 280 million simulated 1 simultaneous query stream each with 0 think time.
• Power System S822LC for HPC; 20 cores (2 x 10c chips) / 160 threads, POWER8 with NVLink; 2.86 GHz, 1024 GB memory, 2x 6Gb SSDs, 2-port 10 GbEth, 4x Tesla P100 GPU; Ubuntu 16.04.
• Competitive stack: 2x Xeon E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 512GB memory 2x 6Gb SSDs, 2-port 10 GbEth, 4x Tesla P100 GPU, Ubuntu 16.04.
32
Custom ASIC’s
Reducing Flexibility
CPU > GPU > FPGA > ASIC
Increasing Efficiency
CPU < GPU < FPGA < ASIC
Source: William Dally, Nvidia
33
Google TPU 1.0
[Jouppi et al., ISCA 2017]
Relative performance/Watt (TDP) of GPU server (blue) and
TPU server (red) to CPU server, and TPU server to GPU
server (orange).
TPU’ is an improved TPU that uses GDDR5 memory. The
green bar shows its ratio to the CPU server, and the lavender
bar shows its relation to the GPU server.
Total includes host server power, but incremental doesn’t. GM
and WM are the geometric and weighted means.
34
Google TPU performance
Stars are for the TPU
Triangles are for the K80
Circles are for Haswell.
[Jouppi et al., ISCA 2017]
35
Microsoft Azure FPGA Usage
[M.Russinovich, MSBuild 2017]
FPGA for SDN Offload FPGA for Bing
36
Hardware Micro-services
A hardware-only self-contained service that can be distributed and
accessed from across the datacenter compute fabric
37
Ease of Consumption
Compiler Optimization
Math libraries optimization
Native Support for CUDA / OpenMP / OpenCL ..
Native Support for Frameworks for eg for Deep Learning (Torch/Tensorflow/Caffe …)
38
POWER9 (SO) – Premier Accelerator Platform
……
On-ChipInterconnect
PCIeGen4DDR425Gb/s
MemoryI/OCAPISMPNV
OCAP
I
On-Chip
Accel
16Gb/s
2 Socket SMP: 256 GB/s
OpenCAPI and/or NVLink 2.0
200-300 GB/s
3x16 PCIeG4 : 192 GB/s
Core
POWER9
POWER9
Memory
CAPI 2.0 Links : 128 GB/s
(Uses up to 2 x16 ports)
8 DDR4 ports @ 2667 MT/s
PCIe Device
IBM / Partner
Device
NVIDIA GPU
IBM / Partner
Device
Bandwidths shown are bi-directional
512kL2/SMT8Core+120MBL3NUCACache
39
Newell POWER9 System - 6 GPU / 2CAPI
40
BACKUP
41
Source: SNIA / Flash Summit
42
When to Use FPGAs
Transistor Efficiency & Extreme Parallelism
Bit-level operations
Variable-precision floating point
Power-Performance Advantage
>2x compared to Multicore (MIC) or GPGPU
Unused LUTs are powered off
Technology Scaling better than CPU/GPU
FPGAs are not frequency or power limited yet
3D has great potential
Dynamic reconfiguration
Flexibility for application tuning at run-time vs.
compile-time
Additional advantages when FPGAs are network
connected ...
allows network as well as compute
specialization
Extreme FLOPS & Parallelism
Double-precision floating point leadership
Hundreds of GPGPU cores
Programming Ease & Software Group Interest
CUDA & extensive libraries
OpenCL
IBM Java (coming soon)
Bandwidth Advantage on Power
Start w/PCIe gen3 x16 and then move to
NVLink
Leverage existing GPGPU eco-system and
development base
Lots of existing use-Cases to build on
Heavy HPC investment in GPGPU
When to Use GPGPUs
43
CCIX
Source: Brad Benton, AMD, OpenFabrics Alliance Annual Workshop 2017
44
Gen-Z
Source: Brad Benton, AMD, OpenFabrics Alliance Annual Workshop 2017
Use CasesUse CasesUse CasesUse Cases –––– A truly heterogeneous architecture built uponA truly heterogeneous architecture built uponA truly heterogeneous architecture built uponA truly heterogeneous architecture built upon OpenCAPIOpenCAPIOpenCAPIOpenCAPI
OpenCAPI 3.0
OpenCAPI 3.1
OpenCAPI specifications are
downloadable from the
website
at www.opencapi.org
- Register
- Download
OpenCAPI Advantages for MemoryOpenCAPI Advantages for MemoryOpenCAPI Advantages for MemoryOpenCAPI Advantages for Memory
Open standard interface enables to attach wide range of devices
OpenCAPI protocol was architected to minimize latency
Especially advantageous for classic DRAM memory
Extreme bandwidth beyond classical DDR memory interface
Agnostic interface allows extension to evolving memory technologies in the future
(e.g., compute-in-memory)
Ability to handle a memory buffer to decouple raw memory and host interfaces to
optimize power, cost and performance
Common physical interface between non-memory and memory devices
9
47
OpenCAPI Key AttributesOpenCAPI Key AttributesOpenCAPI Key AttributesOpenCAPI Key Attributes
• Architecture agnostic bus – Applicable with any system/microprocessor architecture
• Coherency - Attached devices operate natively within application’s user space and coherently with host uP
• High performance interface design with no ‘overhead’ and optimized for a high bandwidth and low latency
• Point to point construct optimized within a system
• Allows attached device to fully participate in application without kernel involvement/overhead
• 25Gbit/sec signaling and protocol to enable very low latency interface on CPU and attached device
• Supports a wide range of use cases and access semantics
• Hardware accelerators
• High-performance I/O devices
• Advanced memories and Classic memory
• Various form factors (e.g., GPUs, FPGAs, ASICs, memory, etc.)
• Reduced complexity of design implementation
• Wanted to make this easy for the accelerator, memory and system design teams
• Moved complexities of coherence and virtual addressing onto the host microprocessor to simplify
attached devices and facilitate interoperability across multiple CPU architectures
Virtual Addressing and BenefitsVirtual Addressing and BenefitsVirtual Addressing and BenefitsVirtual Addressing and Benefits
An OpenCAPI device operates in the virtual address spaces of the applications that it supports
• Eliminates kernel and device driver software overhead
• Allows device to operate on application memory without kernel-level data copies/pinned pages
• Simplifies programming effort to integrate accelerators into applications
• Improves accelerator performance
The Virtual-to-Physical Address Translation occurs in the host CPU
• Reduces design complexity of OpenCAPI-attached devices
• Makes it easier to ensure interoperability between OpenCAPI devices and different CPU architectures
• Security - Since the OpenCAPI device never has access to a physical address, this eliminates the
possibility of a defective or malicious device accessing memory locations belonging to the kernel or
other applications that it is not authorized to access

Más contenido relacionado

La actualidad más candente

OpenPOWER Webinar on Machine Learning for Academic Research
OpenPOWER Webinar on Machine Learning for Academic Research OpenPOWER Webinar on Machine Learning for Academic Research
OpenPOWER Webinar on Machine Learning for Academic Research Ganesan Narayanasamy
 
Covid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsCovid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsGanesan Narayanasamy
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...Ganesan Narayanasamy
 
OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM Ganesan Narayanasamy
 
Heterogeneous computing
Heterogeneous computingHeterogeneous computing
Heterogeneous computingRashid Ansari
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialGanesan Narayanasamy
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsGanesan Narayanasamy
 
Announcement Overview 4Q14 (ext)
Announcement Overview 4Q14 (ext)Announcement Overview 4Q14 (ext)
Announcement Overview 4Q14 (ext)David Spurway
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformGanesan Narayanasamy
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuAlan Sill
 

La actualidad más candente (20)

OpenPOWER Webinar on Machine Learning for Academic Research
OpenPOWER Webinar on Machine Learning for Academic Research OpenPOWER Webinar on Machine Learning for Academic Research
OpenPOWER Webinar on Machine Learning for Academic Research
 
Covid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsCovid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power Systems
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
Summit workshop thompto
Summit workshop thomptoSummit workshop thompto
Summit workshop thompto
 
OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM OpenPOWER/POWER9 Webinar from MIT and IBM
OpenPOWER/POWER9 Webinar from MIT and IBM
 
Heterogeneous computing
Heterogeneous computingHeterogeneous computing
Heterogeneous computing
 
IBM HPC Transformation with AI
IBM HPC Transformation with AI IBM HPC Transformation with AI
IBM HPC Transformation with AI
 
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
 
POWER9 for AI & HPC
POWER9 for AI & HPCPOWER9 for AI & HPC
POWER9 for AI & HPC
 
IBM BOA for POWER
IBM BOA for POWER IBM BOA for POWER
IBM BOA for POWER
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systems
 
@IBM Power roadmap 8
@IBM Power roadmap 8 @IBM Power roadmap 8
@IBM Power roadmap 8
 
2018 bsc power9 and power ai
2018   bsc power9 and power ai 2018   bsc power9 and power ai
2018 bsc power9 and power ai
 
Announcement Overview 4Q14 (ext)
Announcement Overview 4Q14 (ext)Announcement Overview 4Q14 (ext)
Announcement Overview 4Q14 (ext)
 
SNAP MACHINE LEARNING
SNAP MACHINE LEARNINGSNAP MACHINE LEARNING
SNAP MACHINE LEARNING
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
 
Power overview 2018 08-13b
Power overview 2018 08-13bPower overview 2018 08-13b
Power overview 2018 08-13b
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttu
 
AMD It's Time to ROC
AMD It's Time to ROCAMD It's Time to ROC
AMD It's Time to ROC
 

Similar a Heterogeneous Computing : The Future of Systems

New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...Filipe Miranda
 
LEGaTO Heterogeneous Hardware
LEGaTO Heterogeneous HardwareLEGaTO Heterogeneous Hardware
LEGaTO Heterogeneous HardwareLEGATO project
 
BUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoCBUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoCLinaro
 
Summit 16: Deploying Virtualized Mobile Infrastructures on Openstack
Summit 16: Deploying Virtualized Mobile Infrastructures on OpenstackSummit 16: Deploying Virtualized Mobile Infrastructures on Openstack
Summit 16: Deploying Virtualized Mobile Infrastructures on OpenstackOPNFV
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersRyousei Takano
 
PCIe Gen 3.0 Presentation @ 4th FPGA Camp
PCIe Gen 3.0 Presentation @ 4th FPGA CampPCIe Gen 3.0 Presentation @ 4th FPGA Camp
PCIe Gen 3.0 Presentation @ 4th FPGA CampFPGA Central
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
Modular by Design: Supermicro’s New Standards-Based Universal GPU ServerModular by Design: Supermicro’s New Standards-Based Universal GPU Server
Modular by Design: Supermicro’s New Standards-Based Universal GPU ServerRebekah Rodriguez
 
Q1 Memory Fabric Forum: Breaking Through the Memory Wall
Q1 Memory Fabric Forum: Breaking Through the Memory WallQ1 Memory Fabric Forum: Breaking Through the Memory Wall
Q1 Memory Fabric Forum: Breaking Through the Memory WallMemory Fabric Forum
 
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
Modular by Design: Supermicro’s New Standards-Based Universal GPU ServerModular by Design: Supermicro’s New Standards-Based Universal GPU Server
Modular by Design: Supermicro’s New Standards-Based Universal GPU ServerRebekah Rodriguez
 
X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors
X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors
X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors Rebekah Rodriguez
 
NVMe Takes It All, SCSI Has To Fall
NVMe Takes It All, SCSI Has To FallNVMe Takes It All, SCSI Has To Fall
NVMe Takes It All, SCSI Has To Fallinside-BigData.com
 
IBM Power9 Features and Specifications
IBM Power9 Features and SpecificationsIBM Power9 Features and Specifications
IBM Power9 Features and Specificationsinside-BigData.com
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...In-Memory Computing Summit
 
Optimized HPC/AI cloud with OpenStack acceleration service and composable har...
Optimized HPC/AI cloud with OpenStack acceleration service and composable har...Optimized HPC/AI cloud with OpenStack acceleration service and composable har...
Optimized HPC/AI cloud with OpenStack acceleration service and composable har...Shuquan Huang
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrjRoberto Brandao
 
HiPEAC-CSW 2022_Kevin Mika presentation
HiPEAC-CSW 2022_Kevin Mika presentationHiPEAC-CSW 2022_Kevin Mika presentation
HiPEAC-CSW 2022_Kevin Mika presentationVEDLIoT Project
 

Similar a Heterogeneous Computing : The Future of Systems (20)

The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
 
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
 
LEGaTO Heterogeneous Hardware
LEGaTO Heterogeneous HardwareLEGaTO Heterogeneous Hardware
LEGaTO Heterogeneous Hardware
 
BUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoCBUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoC
 
Summit 16: Deploying Virtualized Mobile Infrastructures on Openstack
Summit 16: Deploying Virtualized Mobile Infrastructures on OpenstackSummit 16: Deploying Virtualized Mobile Infrastructures on Openstack
Summit 16: Deploying Virtualized Mobile Infrastructures on Openstack
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computers
 
PCIe Gen 3.0 Presentation @ 4th FPGA Camp
PCIe Gen 3.0 Presentation @ 4th FPGA CampPCIe Gen 3.0 Presentation @ 4th FPGA Camp
PCIe Gen 3.0 Presentation @ 4th FPGA Camp
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
Modular by Design: Supermicro’s New Standards-Based Universal GPU ServerModular by Design: Supermicro’s New Standards-Based Universal GPU Server
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
 
Q1 Memory Fabric Forum: Breaking Through the Memory Wall
Q1 Memory Fabric Forum: Breaking Through the Memory WallQ1 Memory Fabric Forum: Breaking Through the Memory Wall
Q1 Memory Fabric Forum: Breaking Through the Memory Wall
 
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
Modular by Design: Supermicro’s New Standards-Based Universal GPU ServerModular by Design: Supermicro’s New Standards-Based Universal GPU Server
Modular by Design: Supermicro’s New Standards-Based Universal GPU Server
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors
X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors
X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors
 
NVMe Takes It All, SCSI Has To Fall
NVMe Takes It All, SCSI Has To FallNVMe Takes It All, SCSI Has To Fall
NVMe Takes It All, SCSI Has To Fall
 
IBM Power9 Features and Specifications
IBM Power9 Features and SpecificationsIBM Power9 Features and Specifications
IBM Power9 Features and Specifications
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
 
Optimized HPC/AI cloud with OpenStack acceleration service and composable har...
Optimized HPC/AI cloud with OpenStack acceleration service and composable har...Optimized HPC/AI cloud with OpenStack acceleration service and composable har...
Optimized HPC/AI cloud with OpenStack acceleration service and composable har...
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrj
 
HiPEAC-CSW 2022_Kevin Mika presentation
HiPEAC-CSW 2022_Kevin Mika presentationHiPEAC-CSW 2022_Kevin Mika presentation
HiPEAC-CSW 2022_Kevin Mika presentation
 
Demystify OpenPOWER
Demystify OpenPOWERDemystify OpenPOWER
Demystify OpenPOWER
 

Más de Anand Haridass

2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...Anand Haridass
 
2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit MumbaiAnand Haridass
 
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016Anand Haridass
 
Performance beyond moore's law
Performance beyond moore's lawPerformance beyond moore's law
Performance beyond moore's lawAnand Haridass
 
ISLPED 2015 FreqLeak (Presentation Charts)
ISLPED 2015 FreqLeak (Presentation Charts)ISLPED 2015 FreqLeak (Presentation Charts)
ISLPED 2015 FreqLeak (Presentation Charts)Anand Haridass
 
VLSID 2015 FirmLeak (Poster)
VLSID 2015 FirmLeak (Poster)VLSID 2015 FirmLeak (Poster)
VLSID 2015 FirmLeak (Poster)Anand Haridass
 
The Cloud & Its Impact on IT
The Cloud & Its Impact on ITThe Cloud & Its Impact on IT
The Cloud & Its Impact on ITAnand Haridass
 

Más de Anand Haridass (7)

2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
 
2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai
 
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016
IBM Data Engine for Hadoop and Spark - POWER System Edition ver1 March 2016
 
Performance beyond moore's law
Performance beyond moore's lawPerformance beyond moore's law
Performance beyond moore's law
 
ISLPED 2015 FreqLeak (Presentation Charts)
ISLPED 2015 FreqLeak (Presentation Charts)ISLPED 2015 FreqLeak (Presentation Charts)
ISLPED 2015 FreqLeak (Presentation Charts)
 
VLSID 2015 FirmLeak (Poster)
VLSID 2015 FirmLeak (Poster)VLSID 2015 FirmLeak (Poster)
VLSID 2015 FirmLeak (Poster)
 
The Cloud & Its Impact on IT
The Cloud & Its Impact on ITThe Cloud & Its Impact on IT
The Cloud & Its Impact on IT
 

Último

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Último (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Heterogeneous Computing : The Future of Systems

  • 1. IBM Confidential Heterogeneous Computing The Future of Systems Anand Haridass Senior Technical Staff Member IBM Cognitive Systems NITK (KREC) – Batch of ‘95 (E&C) IBM Academy of Technology NITK-IBM Computer Systems Research Group (NCSRG) Seminar Sep/18/2017
  • 2. 2 Agenda System Overview Technology Trends – End of Dennard Scaling Vertical Integration - OpenPOWER “Feeding the Engine” – Memory / Storage Need for High Performance Bus – OpenCAPI GPU Attach - NVLINK Accelerator Examples
  • 3. 3 Von Neumann Architecture • First published by John von Neumann in 1945. • Design consists of a Control Unit, Arithmetic & Logic Unit (ALU), Memory Unit, Registers & Inputs/Outputs. • Stored-program computer concept instruction data and program data are stored in the same memory. • Most Servers & PC’s produced today use this design.
  • 4. 4 Typical 2 Socket Systems [2017] CPU CPU Memory Memory IO/ Storage / NW AcceleratorAccelerator IO/ Storage / NW
  • 5. 5 Processor Technology Trends Moore’’’’s Law Alive & Kicking Moore’s Law (1965) ”Number of transistors in a dense integrated circuit doubles approximately every two years”
  • 6. 6 Dennard Scaling Limits Dennard scaling As transistors get smaller their power density stays constant, so that the power use stays in proportion with area: both voltage and current scale (downward) with length. Power requirements are proportional to area (both voltage & current being proportional to length). Transistor dimensions are scaled by 30% (0.7x) every technology generation, thus reducing their area by 50%. This reduces the delay by 30% (0.7x) and therefore increases operating frequency by about 40% (1.4x). To keep electric field constant, voltage is reduced by 30%, reducing energy by 65% and power (at 1.4x frequency) by 50%. • Voltage scaling for high-performance designs is limited • By leakage issues: can’t reduce threshold voltages • Need steeper sub-threshold slopes • Limited by variability, esp VT variability • Need to minimize random dopant fluctuations • Limited by gate oxide thickness • Some relief from high-K materials • Limited voltage scaling + decreasing feature sizes Increasing electric fields • New device structures needed (FinFETs) • Reliability challenges (devices and wires)
  • 7. 7 CMOS Power - Performance Scaling Where this curve is flat, can only improve chip frequency by: a) Pushing core/chip to higher power density (air cooling limits) b) Design power efficiency improvements (low-hanging fruit all gone) 10 100 0.01 0.1 1 10 Feature pitch (microns) RelativePerformanceMetric (Constpowerdensity) When scaling was good…
  • 8. 8 Processor Technology Trends ‘‘‘‘Affordable’’’’ Air Cooled Limit ~120-190W Dennard Scaling limiting from 2002-04
  • 9. 9 Processor Technology Trends Processor Frequency peaks at ~6Ghz and settle between 2-4GHz
  • 11. 11 Processor Technology Trends Multi-Cores (& threads) Parallel Programming to leverage
  • 12. 12 End customer doesn't care about Frequency / ST performance & other ‘‘‘‘processor’’’’ metrics Cost/Performance is the metric Processors Semiconductor Technology Industry trends, Challenges & Opportunities Microprocessors alone no longer drive sufficient Cost/Performance improvements
  • 13. 13 System stack innovations are required to drive Cost/Performance
  • 15. 15 Materials Innovations - Increased Complexity & Cost Global Foundries projects that a computer chip manufacturing plant in NY would cost $14.7 billion to build
  • 16. 16 “Data Access” Performance (bandwidth & latency) & Cost (Power) still very challenging Some techniques to hide latency/bw/pwr Caches Locality optimization Out-of-order execution Multithreading Pre-fetching “Fat’ pipes / Memory Buffers ns StorageMemory Storage Class Memory (100 – 1000ns) Source: SNIA “Feeding the Engine” Challenge
  • 17. 17 Access latency in uP cycles (@ 4GHz) Source H.Hunter IBM 21 23 211 213 215 219 223 L1/L2(SRAM) HDD 27 L3/L4 25 29 217 221 Flash “I/O Calls” (Read/Writes)“Memory Calls“ (Load/Store) DRAM Memory / Storage Storage Class of Memory NVMe - Non-Volatile Memory express (PCIe) • Standardized high performance interface for PCI Express SSD. Available today in three different form factors: PCIe Add in Card, SFF 2.5” and M.2 • PCIeGen3 (today) x8 ~8GB/s [x4 ~4GB/s, x2 ~2GB/s] vs SAS 12Gbs [1.5GB/s /port] • PCIeGen4 (2018) x8 ~16GB/s [x4 ~8GB/s, x2 ~4GB/s] vs SAS 24Gbs [3GB/s /port] NVMe over fabrics (low latency RDMA access) <10us including switches CAPI based Flash (today) x16 (16GB/s) – at faster access latencies (more on this later) HBM (High Bandwidth memory) • 3D Stacked DRAM from AMD/Hynix/Samsung • HBM2 256GB/sec ~4GB/package (8 DRAM TSV stacked) • 1024bits x 2GT/s • HBM3 512GB/sec ~2020 time frame NVDIMM • Persistent memory solution on DDR interface • Combines DRAM, NAND Flash and power source • Delivers DRAM R/W perf with the persistence & reliability of NAND
  • 18. 18 Source: SNIA The Contenders https://www.snia.org/sites/default/files/NVM/2016/presentations/Panel_1_Combined_NVM_Futures%20Revision.pdf
  • 19. 19 Function offload – greater concurrency & utilization Power efficiency (performance/watt) Workloads Encryption-decryption / Compression- decompression / Encoding-decoding / Network Controllers / Math Libraries / DB queries / Search Deep Learning (Arms race !) for training & inferencing Hardware Acceleration Types of Accelerators General Purpose GPU / Many Integrated Core (MIC) Nvidia Tesla/Volta, Intel Xeon Phi, AMD Radeon Field Programmable Gate Array (FPGA) Xilinx, Altera (now Intel) Purpose Built / Custom ASIC’s Google’s TPU Intelligent Network Controllers Cavium ARM-accelerated NIC Mellanox NIC+FPGA Microsoft FPGA-only network adapter Traditionally (“IO” limited) sequential instructions on processor / parallel compute offloaded to accelerator Penalty for “IO” operations heavy
  • 20. 20 HPC & Hyper-scale datacenters (Cloud) are driving need for higher network bandwidth HPC & Deep learning require more bandwidth between accelerators and memory PCI Express has limitations (coherence / bandwidth / protocol overhead) Desired Attributes Low Latency / High Bandwidth / Coherence Emergence of complex storage & memory solutions (BW & latency & heterogeneity) Growing demand for network performance (BW & latency) Various form factors (e.g., GPUs, FPGAs, ASICs, etc.) Open standard for broad industry, architecture agnostic participation / avoid vendor lock-in Volume pricing advantages & Broad software ecosystem growth and adoption Vendor specific variants Intel Omni Path Architecture, Nvidia Nvlink, AMD Hypertransport Open Standards evolving Cache Coherent Interconnect for Accelerators (CCIX) www.ccixconsortium.com Gen-Z genzconsortium.org Open Coherent Accelerator Processor Interface (OpenCAPI) opencapi.org Need for High Performance Next Generation Bus/Interconnect
  • 21. 21 Coherent Accelerator Processor Interface (CAPI) - 2014 CAPP PCIe Power Processor FPGA Functionn Function0 Function1 Function2 CAPI IBM Supplied POWER Service Layer Virtual Addressing Removes the requirement for pinning system memory for PCIe transfers Eliminates the copying of data into and out of the pinned DMA buffers Eliminates the operating system call overhead to pin memory for DMA Accelerator can work with same addresses that the processors use Pointers can be de-referenced same as the host application - Example: Enables the ability to traverse data structures Coherent Caching of Data Enables an accelerator to cache data structures Enables Cache to Cache transfers between accelerator and processor Enables the accelerator to participate in “Locks” as a normal thread Elimination of Device Driver Direct communication with Application No requirement to call an OS device driver or Hypervisor function for mainline processing Enables Accelerator Features not possible with PCIe Enables efficient Hybrid Applications Applications partially implemented in the accelerator and partially on the host CPU Visibility to full system memory Simpler programming model for Application Modules Coherent Accelerator Processor Proxy (CAPP) – Proxy for FPGA Accelerator on PowerBus – Integrated into Processor – Programmable (Table Driven) Protocol for CAPI – Shadow Cache Directory for Accelerator • Up to 1MB Cache Tags (Line based) • Larger block based Cache POWER Service Layer (PSL) – Implemented in FPGA Technology – Provides Address Translation for Accelerator • Compatible with POWER Architecture – Provides Cache for Accelerator – Facilities for downloading Accelerator Functions
  • 22. 22 PCIe How CAPI Works AlgorithmAlgo mrith POWER8 Processor Acceleration Portion: Data or Compute Intensive, Storage or External I/O Application Portion: Data Set-up, Control Sharing the same memory space Accelerator is a peer to POWER8 Core CAPI Developer Kit Card Coherent Accelerator Processor Interface (CAPI) - 2014 Accelerator is a Full Peer to Processor Accelerator Function(s) use an unmodified Effective address Full access to Real address space Utilize Processor’s Page Tables Directly Page Faults handled by System Software Multiple Functions can exist in a single Accelerator
  • 23. 23 Memory Subsystem Virt Addr IO Attached Accelerator POWER8 Core POWER8 Core POWER8 Core POWER8 Core POWER8 Core POWER8 Core App FPGA PCIE Variables Input Data DD Device Driver Storage Area Variables Input Data Variables Input Data Output Data Output Data An application called a device driver to utilize an FPGA Accelerator. The device driver performed a memory mapping operation. 3 versions of the data (not coherent). 1000s of instructions in the device driver.
  • 24. 24 Memory Subsystem Virt Addr CAPI Coherency POWER8 Core POWER8 Core POWER8 Core POWER8 Core POWER8 Core POWER8 Core App FPGA PCIE With CAPI, the FPGA shares memory with the cores PSL Variable s Input Data Output Data 1 coherent version of the data. No device driver call/instructions.
  • 25. 25 Typical I/O Model Flow: Flow with a Coherent Model: Shared Mem. Notify Accelerator Acceleration Shared Memory Completion DD Call Copy or Pin Source Data MMIO Notify Accelerator Acceleration Poll / Interrupt Completion Copy or Unpin Result Data Ret. From DD Completion Application Dependent, but Equal to below Application Dependent, but Equal to above 300 Instructions 10,000 Instructions 3,000 Instructions 1,000 Instructions 1,000 Instructions 7.9µs 4.9µs Total ~13µs for data prep 400 Instructions 100 Instructions 0.3µs 0.06µs Total 0.36µs CAPI vs. I/O Device Driver: Data Prep
  • 26. 26 IBM Accelerated GZIP Compression An FPGA-based low-latency GZIP Compressor & Decompressor with single-thread througput of ~2GB/s and a compression rate significantly better than low-CPU overhead compressors like snappy.
  • 28. 28
  • 29. 29 CAPI Acceleration 29 Examples: Encryption, Compression, Erasure prior to network or storage Processor Chip Acc Data Egress Transform DLx/TLx Processor Chip Acc Data Bi-Directional Transform Acc TLx/DLx Examples: NoSQL such as Neo4J with Graph Node Traversals, etc Needle-in-a-haystack Engine Examples: Machine or Deep Learning potentially using OpenCAPI attached memory Memory Transform Processor Chip Acc DataDLx/TLx Example: Basic work offload Processor Chip Acc NeedlesDLx/TLx Examples: Database searches, joins, intersections, merges Ingress Transform Processor Chip Acc DataDLx/TLx Examples: Video Analytics, HFT, VPN/IPsec/SSL, Deep Packet Inspection (DPI), Data Plane Accelerator (DPA), Video Encoding (H.265), etc Needle-In-A-Haystack Engine Haystack Data OpenCAPI WINS due to Bandwidth to/from accelerators, best of breed latency, and flexibility of an Open architecture
  • 30. 30 NVLink 1 4 links 20 GBps per link raw bandwidth each direction ~160GBps total net NVLink bandwidth NVLink 2 6 links 25GBps per link raw bandwidth each direction ~300GBps total net NVLink bandwidth Volta GV100 • 15 TFLOPS FP32 • 16GB HBM2 – 900 GB/s • 300W TDP • 50 GFLOPS/W (FP32) • 12nm process • 300GB/s NV Link2 • Tensor Core.... Source: Nvidia NVIDIA GPU
  • 31. 31 “Minsky” S822LC for HPC • Tight coupling: strong CPU: strong GPU performance • Equalizing access to memory - for all kinds of programming • Closer programming to the CPU paradigm 115GB/S 115GB/S NVLink DDR4 P8’ DDR4 P8’ Tesla P100 Tesla P100 80GB/S Tesla P100 Tesla P100 80GB/S OpenPOWER P8’ Design PCIe 32GBps GPUGPU x86x86 GPUGPU GPUGPU x86x86 GPUGPU For x86 Servers: PCIe Bottleneck No NVLink between CPU & GPU 2.7X faster query response time on “Minsky” 87% of the total speedup (2.35x of 2.7x improvement) is due to the NVLink Interface from CPU:GPU • Profiling result based on running Kinetica “Filter by geographic area” queries on data set of 280 million simulated 1 simultaneous query stream each with 0 think time. • Power System S822LC for HPC; 20 cores (2 x 10c chips) / 160 threads, POWER8 with NVLink; 2.86 GHz, 1024 GB memory, 2x 6Gb SSDs, 2-port 10 GbEth, 4x Tesla P100 GPU; Ubuntu 16.04. • Competitive stack: 2x Xeon E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 512GB memory 2x 6Gb SSDs, 2-port 10 GbEth, 4x Tesla P100 GPU, Ubuntu 16.04.
  • 32. 32 Custom ASIC’s Reducing Flexibility CPU > GPU > FPGA > ASIC Increasing Efficiency CPU < GPU < FPGA < ASIC Source: William Dally, Nvidia
  • 33. 33 Google TPU 1.0 [Jouppi et al., ISCA 2017] Relative performance/Watt (TDP) of GPU server (blue) and TPU server (red) to CPU server, and TPU server to GPU server (orange). TPU’ is an improved TPU that uses GDDR5 memory. The green bar shows its ratio to the CPU server, and the lavender bar shows its relation to the GPU server. Total includes host server power, but incremental doesn’t. GM and WM are the geometric and weighted means.
  • 34. 34 Google TPU performance Stars are for the TPU Triangles are for the K80 Circles are for Haswell. [Jouppi et al., ISCA 2017]
  • 35. 35 Microsoft Azure FPGA Usage [M.Russinovich, MSBuild 2017] FPGA for SDN Offload FPGA for Bing
  • 36. 36 Hardware Micro-services A hardware-only self-contained service that can be distributed and accessed from across the datacenter compute fabric
  • 37. 37 Ease of Consumption Compiler Optimization Math libraries optimization Native Support for CUDA / OpenMP / OpenCL .. Native Support for Frameworks for eg for Deep Learning (Torch/Tensorflow/Caffe …)
  • 38. 38 POWER9 (SO) – Premier Accelerator Platform …… On-ChipInterconnect PCIeGen4DDR425Gb/s MemoryI/OCAPISMPNV OCAP I On-Chip Accel 16Gb/s 2 Socket SMP: 256 GB/s OpenCAPI and/or NVLink 2.0 200-300 GB/s 3x16 PCIeG4 : 192 GB/s Core POWER9 POWER9 Memory CAPI 2.0 Links : 128 GB/s (Uses up to 2 x16 ports) 8 DDR4 ports @ 2667 MT/s PCIe Device IBM / Partner Device NVIDIA GPU IBM / Partner Device Bandwidths shown are bi-directional 512kL2/SMT8Core+120MBL3NUCACache
  • 39. 39 Newell POWER9 System - 6 GPU / 2CAPI
  • 41. 41 Source: SNIA / Flash Summit
  • 42. 42 When to Use FPGAs Transistor Efficiency & Extreme Parallelism Bit-level operations Variable-precision floating point Power-Performance Advantage >2x compared to Multicore (MIC) or GPGPU Unused LUTs are powered off Technology Scaling better than CPU/GPU FPGAs are not frequency or power limited yet 3D has great potential Dynamic reconfiguration Flexibility for application tuning at run-time vs. compile-time Additional advantages when FPGAs are network connected ... allows network as well as compute specialization Extreme FLOPS & Parallelism Double-precision floating point leadership Hundreds of GPGPU cores Programming Ease & Software Group Interest CUDA & extensive libraries OpenCL IBM Java (coming soon) Bandwidth Advantage on Power Start w/PCIe gen3 x16 and then move to NVLink Leverage existing GPGPU eco-system and development base Lots of existing use-Cases to build on Heavy HPC investment in GPGPU When to Use GPGPUs
  • 43. 43 CCIX Source: Brad Benton, AMD, OpenFabrics Alliance Annual Workshop 2017
  • 44. 44 Gen-Z Source: Brad Benton, AMD, OpenFabrics Alliance Annual Workshop 2017
  • 45. Use CasesUse CasesUse CasesUse Cases –––– A truly heterogeneous architecture built uponA truly heterogeneous architecture built uponA truly heterogeneous architecture built uponA truly heterogeneous architecture built upon OpenCAPIOpenCAPIOpenCAPIOpenCAPI OpenCAPI 3.0 OpenCAPI 3.1 OpenCAPI specifications are downloadable from the website at www.opencapi.org - Register - Download
  • 46. OpenCAPI Advantages for MemoryOpenCAPI Advantages for MemoryOpenCAPI Advantages for MemoryOpenCAPI Advantages for Memory Open standard interface enables to attach wide range of devices OpenCAPI protocol was architected to minimize latency Especially advantageous for classic DRAM memory Extreme bandwidth beyond classical DDR memory interface Agnostic interface allows extension to evolving memory technologies in the future (e.g., compute-in-memory) Ability to handle a memory buffer to decouple raw memory and host interfaces to optimize power, cost and performance Common physical interface between non-memory and memory devices 9
  • 47. 47 OpenCAPI Key AttributesOpenCAPI Key AttributesOpenCAPI Key AttributesOpenCAPI Key Attributes • Architecture agnostic bus – Applicable with any system/microprocessor architecture • Coherency - Attached devices operate natively within application’s user space and coherently with host uP • High performance interface design with no ‘overhead’ and optimized for a high bandwidth and low latency • Point to point construct optimized within a system • Allows attached device to fully participate in application without kernel involvement/overhead • 25Gbit/sec signaling and protocol to enable very low latency interface on CPU and attached device • Supports a wide range of use cases and access semantics • Hardware accelerators • High-performance I/O devices • Advanced memories and Classic memory • Various form factors (e.g., GPUs, FPGAs, ASICs, memory, etc.) • Reduced complexity of design implementation • Wanted to make this easy for the accelerator, memory and system design teams • Moved complexities of coherence and virtual addressing onto the host microprocessor to simplify attached devices and facilitate interoperability across multiple CPU architectures
  • 48. Virtual Addressing and BenefitsVirtual Addressing and BenefitsVirtual Addressing and BenefitsVirtual Addressing and Benefits An OpenCAPI device operates in the virtual address spaces of the applications that it supports • Eliminates kernel and device driver software overhead • Allows device to operate on application memory without kernel-level data copies/pinned pages • Simplifies programming effort to integrate accelerators into applications • Improves accelerator performance The Virtual-to-Physical Address Translation occurs in the host CPU • Reduces design complexity of OpenCAPI-attached devices • Makes it easier to ensure interoperability between OpenCAPI devices and different CPU architectures • Security - Since the OpenCAPI device never has access to a physical address, this eliminates the possibility of a defective or malicious device accessing memory locations belonging to the kernel or other applications that it is not authorized to access