SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
Newbie’s guide to the GPGPU universe
Ofer Rosenberg
Agenda
• GPU History
• Anatomy of a Modern GPU
• Typical GPGPU Models
• The GPGPU universe
GPU History
A GPGPU perspective
3
From Shaders to Compute (1)
In the beginning, GPU HW was fixed & optimized for Graphics…
Slide from: GPU Architecture: Implications & Trends, David Luebke, NVIDIA Research, SIGGRAPH 2008:
From Shaders to Compute (2)
• GPUs evolved to programmable
(which made Gaming companies very happy…)
Shader:
A simple program, that may run on a graphics processing
unit, and describe the traits of either a vertex or a pixel.
The birth of GPGPU (1)
• Interest from the academic world
Pixel shader = do the same program for (1024 X 768 X 60)
= highly efficient SPMD (Single Program, Multiple Data) machine
• Fictitious graphics pipe to solve problems
– Advanced Graphics problems
– General Computational problems
6
The birth of GPGPU (2)
• In 2002, Mark Harris from NVIDIA
coined the term GPGPU
“General-Purpose computation on
Graphics Processing Units”
• Used a graphics language for general
computation
• Highly effective, but :
– The developer needs to learn another
(not intuitive) language
– The developer was limited by the
graphics language
From Shaders to Compute (3)
• GPUs needed one more evolutional step  Unified Shaders
8
Rise of modern GPGPU
• Unified Architecture paved the way for modern GPGPU languages
GeForce 8800
GTX (G80) was
released on
Nov. 2006
CUDA 0.8 was
released on Feb.
2007 (first official
Beta)
ATI x1900
(R580)
released on
Jan 2006
CTM was
released on
Nov. 2006
Evolution of Compute APIs (GPGPU)
• CUDA & CTM led to two compute standards: Direct Compute & OpenCL
• DirectCompute is a Microsoft standard
– Released as part of WIn7/DX11, a.k.a. Compute Shaders
– Runs only on Windows
– Microsoft C++ AMP maps to DirectCompute
• OpenCL is a cross-OS / cross-Vendor standard
– Managed by a working group in Khronos
– Apple is the spec editor & conformance owner
– Work can be scheduled on both GPUs and CPUs
CUDA 1.0
Released
June 2007
CUDA 2.0
Released
Aug 2008
OpenCL 1.0
Released
Dec 2008
DirectX 11
Released
Oct 2009
CUDA 3.0
Released
Mar 2010
OpenCL 1.1
Released
June 2010
CUDA 4.0
Released
May 2011
OpenCL 1.2
Released
Nov 2011
CUDA 4.1
Released
Jan 2012
CUDA 4.2
Released
April 2012
C++ AMP 1.0
Released
Aug 2012
CUDA 5.0
Released
Oct 2012
CUDA 5.5
Released
July 2013
OpenCL 2.0
Provisional
Released
July 2013
CTM SDK
Released
Nov 2006
GPGPU Evolution
2004 – Stanford University: Brook for GPUs
2006 – AMD releases CTM
NVIDIA releases CUDA
2008 – OpenCL 1.0 released
G80 – 346 GFLOPS R580 – 375 GFLOPS
GPGPU Evolution
Nov 2009 - First Hybrid SC in the Top10: Chinese Tianhe-1
1,024 Intel Xeon E5450 CPUs
5,120 Radeon 4870 X2 GPUs
Nov 2010 – First Hybrid SC reaches #1 on Top500 list: Tianhe-1A
14,336 Xeon X5670 CPUs
7,168 Nvidia Tesla M2050 GPUs
Source: http://www.top500.org/lists/
GPGPU Evolution
2013 - OpenCL on : Nexus 4 (Qualcomm Adreno 320)
Nexus 10 (ARM Mali T604)
Android 4.2 adds GPU support for Renderscript
2014 – NVIDIA Tegra 5 will support CUDA
2013 – GPGPU Continuum becomes a reality
The GPGPU Continuum
Apple A6 GPU
25 GFLOPS
< 2W
ORNL TITAN SC
27 PFLOPS
8200 KW
AMD G-T16R
46 GFLOPS*
4.5W
NVIDIA GTX Titan
4500 GFLOPS
250W
Intel i7-3770
511 GFLOPS*
77W
* GFLOPS of CPU+GPU
Anatomy of
a Modern GPU
GPGPU Perspective
15
Massive Parallelism
From GPGPU perspective,
GPU is a highly multi-threaded wide vector machine
16
Parallelism detailed
• Multi (Many) Cores
• Wide Vector Unit
• Multi-threaded (latency/stalls hiding)
17
14 SMXsK20NVIDIA
32 Compute UnitsHD7970AMD
60 CoresXeon Phi 5110PIntel
6 Warps per SMX32 floats = WarpK20NVIDIA
4 Wavefronts per CU64 floats = WavefrontHD7970AMD
1 VPU per Core16 floats = VPUXeon Phi 5110PIntel
64 Warps per SMXK20NVIDIA
40 Wavefronts per CUHD7970AMD
NVIDIA GK110 SMX
Typical GPU Caveats
• Wide vectors = SIMD (SIMT) execution
– Conditional code has to be executed “vector wide”
– Mitigation: Predication (execute all code using masks on parts)
– Performance hit on mixed execution, up to 1/N efficiency (where N is
vector width)
• Many Cores & Small caches = High percentage of Stalls
– Mitigation:
• Hold multiple in-flight contexts (aka Warps/Wavefronts) per core
• Stall = fast context switch between in-flight context and active context
• Requires huge register bank (NV & AMD: 256KB per SMX/CU)
– Latency hiding depends on having enough in-flight contexts
18A Must Read: (images to the right are taken from this talk)
“From Shader Code to a Teraflop: How GPU Shader Cores Work”, By Kayvon Fatahalian, Stanford University and Mike Houston, Fellow, AMD
Typical GPGPU Models
This section describes some general GPGPU models, which apply
to a wide range of languages
19
Simplified System Model
• Host runs the OS, Application, Drivers, etc.
• GPU is connected to the Host through PCIe, Shared
Memory, etc.
Application code contains API calls*,
 which use a Runtime environment,
 which provides GPU access
The Application code contains “kernels”,
 which are short programs/functions,
 which are loaded and executed on the GPU
* In some languages the API calls are abstracted through special syntax or directives
20
Host
Application
Runtime
GPU
KernelKernel
Kernel
GPGPU Execution Model (1)
• A “kernel” is executed on a grid (1D/2D/3D)
• Each point in the grid executes one instance of
the kernel, orthogonally*
• Per-instance read/write is accomplished by using
the instance’s index
* There are sync primitives on a group/block level (or whole device)
21
OpenCL
CUDA
// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < N && j < N)
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
// Kernel invocation
dim3 dimBlock(16, 16);
dim3 dimGrid((N + dimBlock.x – 1) / dimBlock.x,
(N + dimBlock.y – 1) / dimBlock.y);
MatAdd<<<dimGrid, dimBlock>>>(A, B, C);
}
GPGPU Execution Model (2)
• GPU execution model is asynchronous
– Commands are sent down the stack
– Kernels executed based on GPU load & status (serves a few Apps)
– Application code may wait on completion
• Quequeing Model
– Explicit (OpenCL)
– Default is implicit, Advanced usage is explicit (CUDA)
• SPMD  MPMD
– GPU used to be able to execute one kernel at a time
– Modern languages support multiple simultaneous kernels 22
GPGPU Memory Model
Basically, a distributed memory system:
• Separated Host memory / Device memory
– Create a buffer/image on the host
– Create a buffer/image on the device
• Opaque handle (OpenCL) or device-side pointer (CUDA)
• Sync operations between memories:
– Read / Write
– Map / Unmap (marshalling)
• Pinned memory for faster sync
• GPU can access Host mapped memory (CUDA) 23
Host
Application
Runtime
GPU
Buffer
Create Write
Buffer
GPU Memory Model
• Few types, GPU architecture driven
• Has affect on performance – use the right type
• Watch out from coherency issues
– Not your typical MESI architecture…
24
Compilation Model
• Most GPGPU languages use dynamic compilation
– A common practice in the world of GPUs
– Different GPU architectures : no common ISA
– ISA varies even between generations of the same vendor
• Front-End converts High-level language to IR
(Intermediate Representation)
– Assembly of a virtual machine
– LLVM is very common in this world
– In some languages, this happens at application compile time
• Back-End(s) converts from IR to Binary
– Some Vendors use additional intermediate-to-intermediate stages
• Most languages enable storing of IR & IL
– Some do it implicitly (CUDA)
OpenCL C C for CUDA Fortran
LLVM* IR
PTX IL
GK110 Binary GF104 Binary
OpenACC
* NVIDIA has “NVVM”, which
is LLVM with a set of
restrictions
26
GPGPU usages
CUDA
usages
Advanced
Graphics
Game
Physics
Computer
Vision
Cluster/
HPC
Finance
Scientific
Media
Processing
Johannes Gutenberg University Mains
•CUDA Community Showcase:
•~900 applications from Academia
•http://www.nvidia.com/object/cuda-apps-
flash-new.html#
Imperial College London
UC Davis, California
TU Darmstadt
GPGPU Languages
• Welcome to the jungle…
28
29
Vendor overview: NVIDIA
Geforce:
• GPU for Gaming
• GTX680
Tesla:
• GPU Accelerators
• K10 / K20
Quadro:
• Professional GFX
• K5000
All running the same cores (Kepler GK104 or GK110)
Vendor overview: AMD
31
Radeon:
• GPU for Gaming
• HD7970
FirePro:
• Professional GFX
• W9000
All running the same cores (GCN)
APU:
• CPU+GPU on same die
• A10
Vendor overview: Intel
Xeon Phi:
• Accelerator Card
• 5110P
CPU:
• CPU+GPU on same die
• Haswell Core i7-4xxx
Leading Mobile GPU Vendors
Vivante CG4000
• Unified Shaders
• 4 Cores, SIMD4 each
• Supports OpenCL 1.2
• 48 Gflops
NVIDIA Tegra 4
• 6 X 4-wide Vertex shaders
• 4 X 4-wide Pixel Shaders
• No GPGPU support
• 74 GFLOPS
ARM Mali T604
• 4 Cores
• Multiple “pipes” per core
• Supports OpenCL 1.1
• 68 GFlops
Imagination PowerVR 5xx
• Apple, Samsung, Motorola,
Intel
• Unified Shaders
• Supports OpenCL 1.1 EP (543)
• 38 Gflops (Apple’s MP4 ver)
Qualcomm Adreno 320
• Part of Snapdragon S4
• Unified Shader
• Supports OpenCL 1.1 EP
• 50 GFlops

Más contenido relacionado

La actualidad más candente

PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...AMD Developer Central
 
AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021Grigory Sapunov
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerAMD Developer Central
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
 
Ostech war story using mainline linux for an android tv bsp
Ostech  war story  using mainline linux  for an android tv bspOstech  war story  using mainline linux  for an android tv bsp
Ostech war story using mainline linux for an android tv bspNeil Armstrong
 
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin CoumansGS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin CoumansAMD Developer Central
 
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...AMD Developer Central
 
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...AMD Developer Central
 
GS-4136, Optimizing Game Development using AMD’s GPU PerfStudio 2, by Gordon ...
GS-4136, Optimizing Game Development using AMD’s GPU PerfStudio 2, by Gordon ...GS-4136, Optimizing Game Development using AMD’s GPU PerfStudio 2, by Gordon ...
GS-4136, Optimizing Game Development using AMD’s GPU PerfStudio 2, by Gordon ...AMD Developer Central
 
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...AMD Developer Central
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesAMD Developer Central
 
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...AMD Developer Central
 
Kernel Recipes 2014 - The Linux graphics stack and Nouveau driver
Kernel Recipes 2014 - The Linux graphics stack and Nouveau driverKernel Recipes 2014 - The Linux graphics stack and Nouveau driver
Kernel Recipes 2014 - The Linux graphics stack and Nouveau driverAnne Nicolas
 
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...AMD Developer Central
 
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...AMD Developer Central
 
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...AMD Developer Central
 
TFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU DelegatesTFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU DelegatesKoan-Sin Tan
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornAMD Developer Central
 

La actualidad más candente (20)

PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
Ostech war story using mainline linux for an android tv bsp
Ostech  war story  using mainline linux  for an android tv bspOstech  war story  using mainline linux  for an android tv bsp
Ostech war story using mainline linux for an android tv bsp
 
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin CoumansGS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
 
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
 
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
 
GS-4136, Optimizing Game Development using AMD’s GPU PerfStudio 2, by Gordon ...
GS-4136, Optimizing Game Development using AMD’s GPU PerfStudio 2, by Gordon ...GS-4136, Optimizing Game Development using AMD’s GPU PerfStudio 2, by Gordon ...
GS-4136, Optimizing Game Development using AMD’s GPU PerfStudio 2, by Gordon ...
 
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math Libraries
 
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
 
Kernel Recipes 2014 - The Linux graphics stack and Nouveau driver
Kernel Recipes 2014 - The Linux graphics stack and Nouveau driverKernel Recipes 2014 - The Linux graphics stack and Nouveau driver
Kernel Recipes 2014 - The Linux graphics stack and Nouveau driver
 
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
 
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
PL-4051, An Introduction to SPIR for OpenCL Application Developers and Compil...
 
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael W...
 
TFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU DelegatesTFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU Delegates
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
 

Destacado

[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...npinto
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...AMD Developer Central
 
General Programming on the GPU - Confoo
General Programming on the GPU - ConfooGeneral Programming on the GPU - Confoo
General Programming on the GPU - ConfooSirKetchup
 
CSTalks - GPGPU - 19 Jan
CSTalks  -  GPGPU - 19 JanCSTalks  -  GPGPU - 19 Jan
CSTalks - GPGPU - 19 Jancstalks
 
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Storti Mario
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux ClubOfer Rosenberg
 
GPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteGPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteNVIDIA
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.
 
E-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUE-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUNur Ahmadi
 
Introduction to gpu architecture
Introduction to gpu architectureIntroduction to gpu architecture
Introduction to gpu architectureCHIHTE LU
 
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architecturesinside-BigData.com
 
CS 354 GPU Architecture
CS 354 GPU ArchitectureCS 354 GPU Architecture
CS 354 GPU ArchitectureMark Kilgard
 
Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Tomasz Bednarz
 
GPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel ApplicationsGPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel ApplicationsMarcos Gonzalez
 

Destacado (20)

Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
 
General Programming on the GPU - Confoo
General Programming on the GPU - ConfooGeneral Programming on the GPU - Confoo
General Programming on the GPU - Confoo
 
Gpgpu intro
Gpgpu introGpgpu intro
Gpgpu intro
 
CSTalks - GPGPU - 19 Jan
CSTalks  -  GPGPU - 19 JanCSTalks  -  GPGPU - 19 Jan
CSTalks - GPGPU - 19 Jan
 
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
 
Gpgpu
GpgpuGpgpu
Gpgpu
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux Club
 
GPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteGPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 Keynote
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
E-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUE-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPU
 
Introduction to gpu architecture
Introduction to gpu architectureIntroduction to gpu architecture
Introduction to gpu architecture
 
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architectures
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
CS 354 GPU Architecture
CS 354 GPU ArchitectureCS 354 GPU Architecture
CS 354 GPU Architecture
 
Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Introduction to OpenCL, 2010
Introduction to OpenCL, 2010
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
GPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel ApplicationsGPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel Applications
 

Similar a Newbie’s guide to GPGPU

Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiectureHaris456
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAPiyush Mittal
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010John Holden
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDAMartin Peniak
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUsShree Kumar
 
Compute API –Past & Future
Compute API –Past & FutureCompute API –Past & Future
Compute API –Past & FutureOfer Rosenberg
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilersAnastasiaStulova
 
Java on the GPU: Where are we now?
Java on the GPU: Where are we now?Java on the GPU: Where are we now?
Java on the GPU: Where are we now?Dmitry Alexandrov
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computingbakers84
 
S0333 gtc2012-gmac-programming-cuda
S0333 gtc2012-gmac-programming-cudaS0333 gtc2012-gmac-programming-cuda
S0333 gtc2012-gmac-programming-cudamistercteam
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...Edge AI and Vision Alliance
 

Similar a Newbie’s guide to GPGPU (20)

Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
 
LEGaTO Integration
LEGaTO IntegrationLEGaTO Integration
LEGaTO Integration
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
Compute API –Past & Future
Compute API –Past & FutureCompute API –Past & Future
Compute API –Past & Future
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
Java on the GPU: Where are we now?
Java on the GPU: Where are we now?Java on the GPU: Where are we now?
Java on the GPU: Where are we now?
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
 
S0333 gtc2012-gmac-programming-cuda
S0333 gtc2012-gmac-programming-cudaS0333 gtc2012-gmac-programming-cuda
S0333 gtc2012-gmac-programming-cuda
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 

Último

Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 

Newbie’s guide to GPGPU

  • 1. Newbie’s guide to the GPGPU universe Ofer Rosenberg
  • 2. Agenda • GPU History • Anatomy of a Modern GPU • Typical GPGPU Models • The GPGPU universe
  • 3. GPU History A GPGPU perspective 3
  • 4. From Shaders to Compute (1) In the beginning, GPU HW was fixed & optimized for Graphics… Slide from: GPU Architecture: Implications & Trends, David Luebke, NVIDIA Research, SIGGRAPH 2008:
  • 5. From Shaders to Compute (2) • GPUs evolved to programmable (which made Gaming companies very happy…) Shader: A simple program, that may run on a graphics processing unit, and describe the traits of either a vertex or a pixel.
  • 6. The birth of GPGPU (1) • Interest from the academic world Pixel shader = do the same program for (1024 X 768 X 60) = highly efficient SPMD (Single Program, Multiple Data) machine • Fictitious graphics pipe to solve problems – Advanced Graphics problems – General Computational problems 6
  • 7. The birth of GPGPU (2) • In 2002, Mark Harris from NVIDIA coined the term GPGPU “General-Purpose computation on Graphics Processing Units” • Used a graphics language for general computation • Highly effective, but : – The developer needs to learn another (not intuitive) language – The developer was limited by the graphics language
  • 8. From Shaders to Compute (3) • GPUs needed one more evolutional step  Unified Shaders 8
  • 9. Rise of modern GPGPU • Unified Architecture paved the way for modern GPGPU languages GeForce 8800 GTX (G80) was released on Nov. 2006 CUDA 0.8 was released on Feb. 2007 (first official Beta) ATI x1900 (R580) released on Jan 2006 CTM was released on Nov. 2006
  • 10. Evolution of Compute APIs (GPGPU) • CUDA & CTM led to two compute standards: Direct Compute & OpenCL • DirectCompute is a Microsoft standard – Released as part of WIn7/DX11, a.k.a. Compute Shaders – Runs only on Windows – Microsoft C++ AMP maps to DirectCompute • OpenCL is a cross-OS / cross-Vendor standard – Managed by a working group in Khronos – Apple is the spec editor & conformance owner – Work can be scheduled on both GPUs and CPUs CUDA 1.0 Released June 2007 CUDA 2.0 Released Aug 2008 OpenCL 1.0 Released Dec 2008 DirectX 11 Released Oct 2009 CUDA 3.0 Released Mar 2010 OpenCL 1.1 Released June 2010 CUDA 4.0 Released May 2011 OpenCL 1.2 Released Nov 2011 CUDA 4.1 Released Jan 2012 CUDA 4.2 Released April 2012 C++ AMP 1.0 Released Aug 2012 CUDA 5.0 Released Oct 2012 CUDA 5.5 Released July 2013 OpenCL 2.0 Provisional Released July 2013 CTM SDK Released Nov 2006
  • 11. GPGPU Evolution 2004 – Stanford University: Brook for GPUs 2006 – AMD releases CTM NVIDIA releases CUDA 2008 – OpenCL 1.0 released G80 – 346 GFLOPS R580 – 375 GFLOPS
  • 12. GPGPU Evolution Nov 2009 - First Hybrid SC in the Top10: Chinese Tianhe-1 1,024 Intel Xeon E5450 CPUs 5,120 Radeon 4870 X2 GPUs Nov 2010 – First Hybrid SC reaches #1 on Top500 list: Tianhe-1A 14,336 Xeon X5670 CPUs 7,168 Nvidia Tesla M2050 GPUs Source: http://www.top500.org/lists/
  • 13. GPGPU Evolution 2013 - OpenCL on : Nexus 4 (Qualcomm Adreno 320) Nexus 10 (ARM Mali T604) Android 4.2 adds GPU support for Renderscript 2014 – NVIDIA Tegra 5 will support CUDA 2013 – GPGPU Continuum becomes a reality
  • 14. The GPGPU Continuum Apple A6 GPU 25 GFLOPS < 2W ORNL TITAN SC 27 PFLOPS 8200 KW AMD G-T16R 46 GFLOPS* 4.5W NVIDIA GTX Titan 4500 GFLOPS 250W Intel i7-3770 511 GFLOPS* 77W * GFLOPS of CPU+GPU
  • 15. Anatomy of a Modern GPU GPGPU Perspective 15
  • 16. Massive Parallelism From GPGPU perspective, GPU is a highly multi-threaded wide vector machine 16
  • 17. Parallelism detailed • Multi (Many) Cores • Wide Vector Unit • Multi-threaded (latency/stalls hiding) 17 14 SMXsK20NVIDIA 32 Compute UnitsHD7970AMD 60 CoresXeon Phi 5110PIntel 6 Warps per SMX32 floats = WarpK20NVIDIA 4 Wavefronts per CU64 floats = WavefrontHD7970AMD 1 VPU per Core16 floats = VPUXeon Phi 5110PIntel 64 Warps per SMXK20NVIDIA 40 Wavefronts per CUHD7970AMD NVIDIA GK110 SMX
  • 18. Typical GPU Caveats • Wide vectors = SIMD (SIMT) execution – Conditional code has to be executed “vector wide” – Mitigation: Predication (execute all code using masks on parts) – Performance hit on mixed execution, up to 1/N efficiency (where N is vector width) • Many Cores & Small caches = High percentage of Stalls – Mitigation: • Hold multiple in-flight contexts (aka Warps/Wavefronts) per core • Stall = fast context switch between in-flight context and active context • Requires huge register bank (NV & AMD: 256KB per SMX/CU) – Latency hiding depends on having enough in-flight contexts 18A Must Read: (images to the right are taken from this talk) “From Shader Code to a Teraflop: How GPU Shader Cores Work”, By Kayvon Fatahalian, Stanford University and Mike Houston, Fellow, AMD
  • 19. Typical GPGPU Models This section describes some general GPGPU models, which apply to a wide range of languages 19
  • 20. Simplified System Model • Host runs the OS, Application, Drivers, etc. • GPU is connected to the Host through PCIe, Shared Memory, etc. Application code contains API calls*,  which use a Runtime environment,  which provides GPU access The Application code contains “kernels”,  which are short programs/functions,  which are loaded and executed on the GPU * In some languages the API calls are abstracted through special syntax or directives 20 Host Application Runtime GPU KernelKernel Kernel
  • 21. GPGPU Execution Model (1) • A “kernel” is executed on a grid (1D/2D/3D) • Each point in the grid executes one instance of the kernel, orthogonally* • Per-instance read/write is accomplished by using the instance’s index * There are sync primitives on a group/block level (or whole device) 21 OpenCL CUDA // Kernel definition __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i < N && j < N) C[i][j] = A[i][j] + B[i][j]; } int main() { // Kernel invocation dim3 dimBlock(16, 16); dim3 dimGrid((N + dimBlock.x – 1) / dimBlock.x, (N + dimBlock.y – 1) / dimBlock.y); MatAdd<<<dimGrid, dimBlock>>>(A, B, C); }
  • 22. GPGPU Execution Model (2) • GPU execution model is asynchronous – Commands are sent down the stack – Kernels executed based on GPU load & status (serves a few Apps) – Application code may wait on completion • Quequeing Model – Explicit (OpenCL) – Default is implicit, Advanced usage is explicit (CUDA) • SPMD  MPMD – GPU used to be able to execute one kernel at a time – Modern languages support multiple simultaneous kernels 22
  • 23. GPGPU Memory Model Basically, a distributed memory system: • Separated Host memory / Device memory – Create a buffer/image on the host – Create a buffer/image on the device • Opaque handle (OpenCL) or device-side pointer (CUDA) • Sync operations between memories: – Read / Write – Map / Unmap (marshalling) • Pinned memory for faster sync • GPU can access Host mapped memory (CUDA) 23 Host Application Runtime GPU Buffer Create Write Buffer
  • 24. GPU Memory Model • Few types, GPU architecture driven • Has affect on performance – use the right type • Watch out from coherency issues – Not your typical MESI architecture… 24
  • 25. Compilation Model • Most GPGPU languages use dynamic compilation – A common practice in the world of GPUs – Different GPU architectures : no common ISA – ISA varies even between generations of the same vendor • Front-End converts High-level language to IR (Intermediate Representation) – Assembly of a virtual machine – LLVM is very common in this world – In some languages, this happens at application compile time • Back-End(s) converts from IR to Binary – Some Vendors use additional intermediate-to-intermediate stages • Most languages enable storing of IR & IL – Some do it implicitly (CUDA) OpenCL C C for CUDA Fortran LLVM* IR PTX IL GK110 Binary GF104 Binary OpenACC * NVIDIA has “NVVM”, which is LLVM with a set of restrictions
  • 26. 26
  • 27. GPGPU usages CUDA usages Advanced Graphics Game Physics Computer Vision Cluster/ HPC Finance Scientific Media Processing Johannes Gutenberg University Mains •CUDA Community Showcase: •~900 applications from Academia •http://www.nvidia.com/object/cuda-apps- flash-new.html# Imperial College London UC Davis, California TU Darmstadt
  • 28. GPGPU Languages • Welcome to the jungle… 28
  • 29. 29
  • 30. Vendor overview: NVIDIA Geforce: • GPU for Gaming • GTX680 Tesla: • GPU Accelerators • K10 / K20 Quadro: • Professional GFX • K5000 All running the same cores (Kepler GK104 or GK110)
  • 31. Vendor overview: AMD 31 Radeon: • GPU for Gaming • HD7970 FirePro: • Professional GFX • W9000 All running the same cores (GCN) APU: • CPU+GPU on same die • A10
  • 32. Vendor overview: Intel Xeon Phi: • Accelerator Card • 5110P CPU: • CPU+GPU on same die • Haswell Core i7-4xxx
  • 33. Leading Mobile GPU Vendors Vivante CG4000 • Unified Shaders • 4 Cores, SIMD4 each • Supports OpenCL 1.2 • 48 Gflops NVIDIA Tegra 4 • 6 X 4-wide Vertex shaders • 4 X 4-wide Pixel Shaders • No GPGPU support • 74 GFLOPS ARM Mali T604 • 4 Cores • Multiple “pipes” per core • Supports OpenCL 1.1 • 68 GFlops Imagination PowerVR 5xx • Apple, Samsung, Motorola, Intel • Unified Shaders • Supports OpenCL 1.1 EP (543) • 38 Gflops (Apple’s MP4 ver) Qualcomm Adreno 320 • Part of Snapdragon S4 • Unified Shader • Supports OpenCL 1.1 EP • 50 GFlops