1. Compute API – Past & Future
Ofer Rosenberg
Visual Computing Software
1
2. Intro and acknowledgments
• Who am I ?
– For the past two years leading the Intel representation in OpenCL working group @
Khronos
– Additional background of Media, Signal Processing, etc.
– http://il.linkedin.com/in/oferrosenberg
• Acknowledgments:
– This presentation contains ideas based on talks with lots of people (who should be
mentioned here)
– Partial list:
– AMD: Mike Houston, Ben Gaster
– Apple: Aaftab Munshi
– DICE: Johan Andersson
– Intel: Aaron Lefohn, Stephen Junkins, David Blythe, Adam Lake, Yariv Aridor, Larry Seiler and
more…
– And others…
2
3. Agenda
• The beginning – From Shaders to Compute
• The Past/Present: 1st Generation of Compute API’s
– Caveats of the 1st generation
• The Future: 2nd Generation of Compute API’s
4. From Shaders to Compute
• In the beginning, GPU HW was fixed & optimized for Graphics…
Slide from: GPU Architecture: Implications & Trends, David Luebke, NVIDIA Research,
SIGGRAPH 2008: 4
5. From Shaders to Compute
• Graphics stages became programmable GPUs evolved …
• This led to the traditional GPGPU approach…
Slide from: GPU Architecture: Implications & Trends, David Luebke, NVIDIA Research,
SIGGRAPH 2008: 5
6. From Shaders to Compute
Traditional GPGPU
• Write in graphics language and use the GPU
• Highly effective, but :
– The developer needs to learn another (not intuitive) language
– The developer was limited by the graphics language
• Then came CUDA & CTM…
Slides from “General Purpose Computation on Graphics Processors
6 (GPGPU)”, Mike Houston, Stanford University Graphics Lab 6
7. The cradle of GPU Compute API’s
GeForce 8800 GTX (G80) was released on Nov. 2006 ATI x1900 (R580) released on Jan 2006
CUDA 0.8 was released on Feb. 2007 (first official Beta) CTM was released on Nov. 2006
Slides from “GeForce 8800 & NVIDIA CUDA: A New Architecture for Computing on the GPU”, Ian Buck,
NVIDIA, SC06, & “Close to the Metal”, Justin Hensley, AMD, SIGGRAPH 2007 7
8. The 1st generation of Platform Compute API
• CUDA & CTM led the way to two compute standards: Direct Compute & OpenCL
• DirectCompute is a Microsoft standard
– Released as part of WIn7/DX11, a.k.a. Compute Shaders
– Only runs under Windows on a GPU device
• OpenCL is a cross-OS / cross-Vendor standard
– Managed by a working group in Khronos
– Apple is the spec editor & conformance owner
– Work can be scheduled on both GPUs and CPUs
Nov June Dec Aug Dec Oct Mar June
2006 2007 2007 2008 2008 2009 2010 2010
CTM CUDA 1.0 StreamSDK CUDA 2.0 OpenCL 1.0 DirectX 11 CUDA 3.0 OpenCL 1.1
Released Released Released Released Released Released Released Released
The 1st Generation was developed on GPU HW which was tuned for graphics usages –
just extended it for general usage
8
9. The 1st generation of Platform Compute API
Execution Model
• Execution model was driven directly from shader programming in graphics (“fragment
processing”) :
– Shader Programming : initiate one instance of the shader per vertex/pixel
– Compute : initiate one instance for each point in an N-dimensional grid
• Fits GPU’s vision of array of scalar (or stream) processors
Drawing from OpenCL 1.1 Specification , Rev36
9
10. The 1st generation of Platform Compute API
Memory Model
• Distributed Memory system:
– Abstraction: Application gets a “handle” to the memory object / resource
– Explicit transactions: API for sync between Host & Device(s) : read/write, map/unmap
App OCL A Dev1
RT
H Dev2
A
• Three address spaces: Global, Local (Shared) & Private
– Local/Shared Memory: the non-trivial memory space…
10
11. Disclaimer
Next slides provide my opinion and thoughts on caveats and
future improvements to the Platform Compute API.
11
12. The 2nd generation of Platform Compute API
• Recap:
– The 1st generation : CUDA (until 3.0), OpenCL 1.x, DX11 CS
– Defined on HW optimized for GFX, extended to General Compute
• The “cheese” has moved for GPUs
– Compute becomes an important usage scenario
– Advanced Graphics: Physics, Advanced Lighting Effects, Irregular Shadow Mapping, Screen Space
Rendering
– Media: Video Encoding & Processing, Image Processing, Image Segmentation, Face Recognition
– Throughput: Scientific Simulations, Finance, Oil Searches
– Developers feedback based on the 1st generation enables creating better HW/API
• The Second generation of Platform Compute API: “OpenCL Next”,
DirectX12 ?
The 2nd Generation of Compute API will run on HW which is designed with
Compute in mind
12
13. Caveats of the 1st generation:
Execution Model
• Developers input:
– Most “real world” usages for compute use fine-grain granularity (the gird is small – 100’s at best)
– “Real world” kernels got sequential parts interleaved with the parallel code (reduction, condition
testing, etc.)
__kernel foo()
{
// code here runs for each point in the grid
barrier(CLK_LOCAL_MEM_FENCE);
if (local_id == 0)
{
// this code runs once per workgroup
}
// code here runs for each point in the grid
barrier(CLK_GLOBAL_MEM_FENCE);
if (global_id == 0)
{ Battlefield 2
// this code runs only once execution phase DAG
} (Image courtesy Johan Andersson, DICE)
// code here runs for each point in the grid
}
Using “fragment processing” for these usages results inefficient use of the machine
13
14. Caveats of the 1st generation
Execution Model
• The “array of scalar/stream processors” model is not optimal for CPU’s & GPU’s
• Works well for large grids (like in traditional graphics), but on finer grain there is a better
model…
NV Fermi AMD R600 Intel NHM
CPU’s and GPU’s are better modeled as multi-threaded vector machines
14
15. The 2nd generation of Platform Compute API
Ideas for new execution model
• Goals
– Support fine-grain task parallelism
– Support complex application execution graphs:
– Better match HW evolution: target multi-threaded vector machines
– Aligned with CPU evolution, and SoC integration of CPU/GPU
• Solution: Tasking system as execution model foundation
Device Domain Device
Tasking system:
task SW Thread
...
HW
• Task Q’s mapped to independent
task
task
task
task
task
task task compute
task
task
unit HW units (~compute cores)
task
task HW • Device load balancing enabled via
...
task
task
task stealing
task
task
compute
unit
Task Pool • OpenCL Analogy: Tasks execute at
HW
... the “work group level”
task
task
task
task
task
task
task compute
task
unit
task
• OpenCL Task ≠ CPU Task
task HW • More restricted: No Preemption
task
task
task
compute
• Evolved: Braided Task (sequential parts
unit
& fine-grain parallel parts interleaved)
15
16. The 2nd generation of Platform Compute API
Ideas for new execution model
• There are others who think along the same lines …
Slides from “Leading a new Era of Computing”, Chekib Akrout, Senior VP,
Technology Group, AMD, 2010 Financial Analyst Day 16
17. Caveats of the 1st generation:
Memory Model
• Developers input:
– A growing number of compute workloads uses complex data structures (linked lists, trees, etc.)
– Performance: Cost of pointer marshaling & re-construct on device is high
– Porting complexity: need to add explicit transactions, marshaling, etc.
– Supporting a shared/unified address space (API & HW) is required
App OCL A Dev1
RT
H Dev2
A
App OCL A Dev1
RT
A Dev2
S A
Shared/Unified Address Space between Host & Devices
17
18. The 2nd generation of Platform Compute API
Ideas for new memory model
Baseline:
Memory objects / resources will have Shared Address Space
the same starting address between
Host & Devices
Shared Address Space Shared Address Space
w. relaxed consistency w. full coherency
• Extend existing OCL 1.x / DX11 Memory Model • New Model - Memory is coherent between Host & Device
• Use explicit API calls to sync between Host & Device • Use known “language level” mechanisms for concurrent
• Suitable for Disjoint memory architectures (Discrete access: atomics, volatile
GPU’s, for example…) • Suitable for Shared Memory architectures
Host Device Host Device
P P P
P P P P P P
P P P P P P
Host Memory Device Memory Coherent/Shared Memory
18
19. Some more thoughts for the 2nd generation
(and beyond)
• Promote Heterogeneous Processing – not GPU only… Execution
Time
– Running code pending on problem domain: GPU
CPU
– Matrix Multiply of 16x16 should run on the CPU
– Matrix Multiply of 1000x1000 should run on the GPU Problem size
– Where’s the decision point ? Better leave it to the Runtime… (requires API)
– Load Balancing
– Relevant especially on systems where the CPU & GPU are close in compute power
• One API to rule them all
– Compute API as the underlying infrastructure to run Media & GFX
– Extend the API to contain flexible pipeline, fixed-function HW, etc.
Slide from “Parallel Future of a Game Engine”, Johan Andersson, DICE
19
20. References:
• “GeForce 8800 & NVIDIA CUDA: A New Architecture for Computing on the GPU”, Ian Buck, NVIDIA, SC06
– http://gpgpu.org/static/sc2006/workshop/presentations/Buck_NVIDIA_Cuda.pdf
• “GPU Architecture: Implications & Trends”, David Luebke, NVIDIA Research, SIGGRAPH 2008:
– http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf
• “General Purpose Computation on Graphics Processors (GPGPU)”, Mike Houston, Stanford University Graphics Lab
– http://www-graphics.stanford.edu/~mhouston/public_talks/R520-mhouston.pdf
• “Close to the Metal”, Justin Hensley, AMD, SIGGRAPH 2007
– http://gpgpu.org/static/s2007/slides/07-CTM-overview.pdf
• “NVIDIA’s Fermi: The First Complete GPU Computing Architecture”, Peter N. Glaskowsky
– http://www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIAFermi-
TheFirstCompleteGPUComputingArchitecture.pdf
• “Leading a new Era of Computing”, Chekib Akrout, Senior VP, Technology Group, AMD, 2010 Financial Analyst Day
– http://phx.corporate-ir.net/External.File?item=UGFyZW50SUQ9Njk3NTJ8Q2hpbGRJRD0tMXxUeXBlPTM=&t=1
• “Parallel Future of a Game Engine”, Johan Andersson, DICE
– http://www.slideshare.net/repii/parallel-futures-of-a-game-engine-2478448
20