The document provides a history of GPUs and GPGPU computing. It describes how GPUs evolved from fixed hardware for graphics to programmable hardware. This allowed general purpose computing on GPUs (GPGPU). It discusses the development of GPGPU languages and APIs like CUDA, OpenCL, and DirectCompute. The anatomy of a modern GPU is explained, highlighting its massively parallel architecture. Typical GPGPU execution and memory models are outlined. Usage of GPGPU for applications like graphics, physics, computer vision, and HPC is mentioned. Leading GPU vendors and their products are briefly introduced.
4. From Shaders to Compute (1)
In the beginning, GPU HW was fixed & optimized for Graphics…
Slide from: GPU Architecture: Implications & Trends, David Luebke, NVIDIA Research, SIGGRAPH 2008:
5. From Shaders to Compute (2)
• GPUs evolved to programmable
(which made Gaming companies very happy…)
Shader:
A simple program, that may run on a graphics processing
unit, and describe the traits of either a vertex or a pixel.
6. The birth of GPGPU (1)
• Interest from the academic world
Pixel shader = do the same program for (1024 X 768 X 60)
= highly efficient SPMD (Single Program, Multiple Data) machine
• Fictitious graphics pipe to solve problems
– Advanced Graphics problems
– General Computational problems
6
7. The birth of GPGPU (2)
• In 2002, Mark Harris from NVIDIA
coined the term GPGPU
“General-Purpose computation on
Graphics Processing Units”
• Used a graphics language for general
computation
• Highly effective, but :
– The developer needs to learn another
(not intuitive) language
– The developer was limited by the
graphics language
8. From Shaders to Compute (3)
• GPUs needed one more evolutional step Unified Shaders
8
9. Rise of modern GPGPU
• Unified Architecture paved the way for modern GPGPU languages
GeForce 8800
GTX (G80) was
released on
Nov. 2006
CUDA 0.8 was
released on Feb.
2007 (first official
Beta)
ATI x1900
(R580)
released on
Jan 2006
CTM was
released on
Nov. 2006
10. Evolution of Compute APIs (GPGPU)
• CUDA & CTM led to two compute standards: Direct Compute & OpenCL
• DirectCompute is a Microsoft standard
– Released as part of WIn7/DX11, a.k.a. Compute Shaders
– Runs only on Windows
– Microsoft C++ AMP maps to DirectCompute
• OpenCL is a cross-OS / cross-Vendor standard
– Managed by a working group in Khronos
– Apple is the spec editor & conformance owner
– Work can be scheduled on both GPUs and CPUs
CUDA 1.0
Released
June 2007
CUDA 2.0
Released
Aug 2008
OpenCL 1.0
Released
Dec 2008
DirectX 11
Released
Oct 2009
CUDA 3.0
Released
Mar 2010
OpenCL 1.1
Released
June 2010
CUDA 4.0
Released
May 2011
OpenCL 1.2
Released
Nov 2011
CUDA 4.1
Released
Jan 2012
CUDA 4.2
Released
April 2012
C++ AMP 1.0
Released
Aug 2012
CUDA 5.0
Released
Oct 2012
CUDA 5.5
Released
July 2013
OpenCL 2.0
Provisional
Released
July 2013
CTM SDK
Released
Nov 2006
12. GPGPU Evolution
Nov 2009 - First Hybrid SC in the Top10: Chinese Tianhe-1
1,024 Intel Xeon E5450 CPUs
5,120 Radeon 4870 X2 GPUs
Nov 2010 – First Hybrid SC reaches #1 on Top500 list: Tianhe-1A
14,336 Xeon X5670 CPUs
7,168 Nvidia Tesla M2050 GPUs
Source: http://www.top500.org/lists/
13. GPGPU Evolution
2013 - OpenCL on : Nexus 4 (Qualcomm Adreno 320)
Nexus 10 (ARM Mali T604)
Android 4.2 adds GPU support for Renderscript
2014 – NVIDIA Tegra 5 will support CUDA
2013 – GPGPU Continuum becomes a reality
17. Parallelism detailed
• Multi (Many) Cores
• Wide Vector Unit
• Multi-threaded (latency/stalls hiding)
17
14 SMXsK20NVIDIA
32 Compute UnitsHD7970AMD
60 CoresXeon Phi 5110PIntel
6 Warps per SMX32 floats = WarpK20NVIDIA
4 Wavefronts per CU64 floats = WavefrontHD7970AMD
1 VPU per Core16 floats = VPUXeon Phi 5110PIntel
64 Warps per SMXK20NVIDIA
40 Wavefronts per CUHD7970AMD
NVIDIA GK110 SMX
18. Typical GPU Caveats
• Wide vectors = SIMD (SIMT) execution
– Conditional code has to be executed “vector wide”
– Mitigation: Predication (execute all code using masks on parts)
– Performance hit on mixed execution, up to 1/N efficiency (where N is
vector width)
• Many Cores & Small caches = High percentage of Stalls
– Mitigation:
• Hold multiple in-flight contexts (aka Warps/Wavefronts) per core
• Stall = fast context switch between in-flight context and active context
• Requires huge register bank (NV & AMD: 256KB per SMX/CU)
– Latency hiding depends on having enough in-flight contexts
18A Must Read: (images to the right are taken from this talk)
“From Shader Code to a Teraflop: How GPU Shader Cores Work”, By Kayvon Fatahalian, Stanford University and Mike Houston, Fellow, AMD
19. Typical GPGPU Models
This section describes some general GPGPU models, which apply
to a wide range of languages
19
20. Simplified System Model
• Host runs the OS, Application, Drivers, etc.
• GPU is connected to the Host through PCIe, Shared
Memory, etc.
Application code contains API calls*,
which use a Runtime environment,
which provides GPU access
The Application code contains “kernels”,
which are short programs/functions,
which are loaded and executed on the GPU
* In some languages the API calls are abstracted through special syntax or directives
20
Host
Application
Runtime
GPU
KernelKernel
Kernel
21. GPGPU Execution Model (1)
• A “kernel” is executed on a grid (1D/2D/3D)
• Each point in the grid executes one instance of
the kernel, orthogonally*
• Per-instance read/write is accomplished by using
the instance’s index
* There are sync primitives on a group/block level (or whole device)
21
OpenCL
CUDA
// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < N && j < N)
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
// Kernel invocation
dim3 dimBlock(16, 16);
dim3 dimGrid((N + dimBlock.x – 1) / dimBlock.x,
(N + dimBlock.y – 1) / dimBlock.y);
MatAdd<<<dimGrid, dimBlock>>>(A, B, C);
}
22. GPGPU Execution Model (2)
• GPU execution model is asynchronous
– Commands are sent down the stack
– Kernels executed based on GPU load & status (serves a few Apps)
– Application code may wait on completion
• Quequeing Model
– Explicit (OpenCL)
– Default is implicit, Advanced usage is explicit (CUDA)
• SPMD MPMD
– GPU used to be able to execute one kernel at a time
– Modern languages support multiple simultaneous kernels 22
23. GPGPU Memory Model
Basically, a distributed memory system:
• Separated Host memory / Device memory
– Create a buffer/image on the host
– Create a buffer/image on the device
• Opaque handle (OpenCL) or device-side pointer (CUDA)
• Sync operations between memories:
– Read / Write
– Map / Unmap (marshalling)
• Pinned memory for faster sync
• GPU can access Host mapped memory (CUDA) 23
Host
Application
Runtime
GPU
Buffer
Create Write
Buffer
24. GPU Memory Model
• Few types, GPU architecture driven
• Has affect on performance – use the right type
• Watch out from coherency issues
– Not your typical MESI architecture…
24
25. Compilation Model
• Most GPGPU languages use dynamic compilation
– A common practice in the world of GPUs
– Different GPU architectures : no common ISA
– ISA varies even between generations of the same vendor
• Front-End converts High-level language to IR
(Intermediate Representation)
– Assembly of a virtual machine
– LLVM is very common in this world
– In some languages, this happens at application compile time
• Back-End(s) converts from IR to Binary
– Some Vendors use additional intermediate-to-intermediate stages
• Most languages enable storing of IR & IL
– Some do it implicitly (CUDA)
OpenCL C C for CUDA Fortran
LLVM* IR
PTX IL
GK110 Binary GF104 Binary
OpenACC
* NVIDIA has “NVVM”, which
is LLVM with a set of
restrictions
30. Vendor overview: NVIDIA
Geforce:
• GPU for Gaming
• GTX680
Tesla:
• GPU Accelerators
• K10 / K20
Quadro:
• Professional GFX
• K5000
All running the same cores (Kepler GK104 or GK110)
31. Vendor overview: AMD
31
Radeon:
• GPU for Gaming
• HD7970
FirePro:
• Professional GFX
• W9000
All running the same cores (GCN)
APU:
• CPU+GPU on same die
• A10
32. Vendor overview: Intel
Xeon Phi:
• Accelerator Card
• 5110P
CPU:
• CPU+GPU on same die
• Haswell Core i7-4xxx
33. Leading Mobile GPU Vendors
Vivante CG4000
• Unified Shaders
• 4 Cores, SIMD4 each
• Supports OpenCL 1.2
• 48 Gflops
NVIDIA Tegra 4
• 6 X 4-wide Vertex shaders
• 4 X 4-wide Pixel Shaders
• No GPGPU support
• 74 GFLOPS
ARM Mali T604
• 4 Cores
• Multiple “pipes” per core
• Supports OpenCL 1.1
• 68 GFlops
Imagination PowerVR 5xx
• Apple, Samsung, Motorola,
Intel
• Unified Shaders
• Supports OpenCL 1.1 EP (543)
• 38 Gflops (Apple’s MP4 ver)
Qualcomm Adreno 320
• Part of Snapdragon S4
• Unified Shader
• Supports OpenCL 1.1 EP
• 50 GFlops