SlideShare una empresa de Scribd logo
1 de 28
Descargar para leer sin conexión
Naga Vydyanathan
GPU Computing with CUDA
Accelerated Computing
GPU Teaching Kit
2
3 Ways to Accelerate Applications
Applications
Libraries
Easy to use
Most Performance
Programming
Languages
Most Performance
Most Flexibility
Easy to use
Portable code
Compiler
Directives
CUDA
3
Objective
– To understand the CUDA memory hierarchy
– Registers, shared memory, global memory
– Scope and lifetime
– To learn the basic memory API functions in CUDA host code
– Device Memory Allocation
– Host-Device Data Transfer
– To learn about CUDA threads, the main mechanism for exploiting of
data parallelism
– Hierarchical thread organization
– Launching parallel execution
– Thread index to data index mapping
4
Global Memory
Processing Unit
I/O
ALU
Processor (SM)
Shared
Memory
Register
File
Control Unit
PC IR
Hardware View of CUDA Memories
5
Programmer View of CUDA Memories
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
6
Declaring CUDA Variables
– __device__ is optional when used with __shared__, or __constant__
– Automatic variables reside in a register
– Except per-thread arrays that reside in global memory
Variable declaration Memory Scope Lifetime
int LocalVar; register thread thread
__device__ __shared__ int SharedVar; shared block block
__device__ int GlobalVar; global grid application
__device__ __constant__ int ConstantVar; constant grid application
7
Global Memory in CUDA
– GPU Device memory (DRAM)
– Large in size (~16/32 GB)
– Bandwidth ~700-900 GB/s
– High latency (~1000 cycles)
– Scope of access and sharing - kernel
– Lifetime – application
– main means of communicating data between host and GPU.
8
Shared Memory in CUDA
– A special type of memory whose contents are explicitly defined and
used in the kernel source code
– One in each SM
– Accessed at much higher speed (in both latency and throughput) than global
memory
– Scope of access and sharing - thread blocks
– Lifetime – thread block, contents will disappear after the corresponding thread block
finishes and terminates execution
– A form of scratchpad memory in computer architecture
We will re-visit shared memory later.
9
A[0]
vector A
vector B
vector C
A[1] A[2] A[N-1]
B[0] B[1] B[2]
…
… B[N-1]
C[0] C[1] C[2] C[N-1]
…
+ + + +
Data Parallelism - Vector Addition Example
10
Vector Addition – Traditional C Code
// Compute vector sum C = A + B
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
int i;
for (i = 0; i<n; i++) h_C[i] = h_A[i] + h_B[i];
}
int main()
{
// Memory allocation for h_A, h_B, and h_C
// I/O to read h_A and h_B, N elements
…
vecAdd(h_A, h_B, h_C, N);
}
10
11
CPU
Host Memory
GPU
Device Memory
Part 1
Part 3
Heterogeneous Computing vecAdd CUDA Host Code
#include <cuda.h>
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
int size = n* sizeof(float);
float *d_A, *d_B, *d_C;
// Part 1
// Allocate device memory for A, B, and C
// copy A and B to device memory
// Part 2
// Kernel launch code – the device performs the actual vector addition
// Part 3
// copy C from the device memory
// Free device vectors
}
11
Part 2
12
Partial Overview of CUDA Memories
– Device code can:
– R/W per-thread registers
– R/W all-shared global
memory
– Host code can
– Transfer data to/from per
grid global memory
12
Host
(Device) Grid
Global
Memory
Block (0, 0)
Thread (0, 0)
Registers
Block (0, 1)
Thread (0, 0)
Registers
Thread (0, 1)
Registers
Thread (0, 1)
Registers
13
CUDA Device Memory Management API functions
– cudaMalloc()
– Allocates an object in the device
global memory
– Two parameters
– Address of a pointer to the
allocated object
– Size of allocated object in terms
of bytes
– cudaFree()
– Frees object from device global
memory
– One parameter
– Pointer to freed object
Host
(Device) Grid
Global
Memory
Block (0, 0)
Thread (0, 0)
Registers
Block (0, 1)
Thread (0, 0)
Registers
Thread (0, 1)
Registers
Thread (0, 1)
Registers
14
Host-Device Data Transfer API functions
– cudaMemcpy()
– memory data transfer
– Requires four parameters
– Pointer to destination
– Pointer to source
– Number of bytes copied
– Type/Direction of transfer
– Transfer to device is synchronous
Host
(Device) Grid
Global
Memory
Block (0, 0)
Thread (0, 0)
Registers
Block (0, 1)
Thread (0, 0)
Registers
Thread (0, 1)
Registers
Thread (0, 1)
Registers
15
Vector Addition Host Code
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
int size = n * sizeof(float); float *d_A, *d_B, *d_C;
cudaMalloc((void **) &d_A, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_B, size);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_C, size);
// Kernel invocation code – to be shown later
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
cudaFree(d_A); cudaFree(d_B); cudaFree (d_C);
}
15
16
In Practice, Check for API Errors in Host Code
cudaError_t err = cudaMalloc((void **) &d_A, size);
if (err != cudaSuccess) {
printf(“%s in %s at line %dn”, cudaGetErrorString(err), __FILE__,
__LINE__);
exit(EXIT_FAILURE);
}
10
17
CUDA Execution Model
– Heterogeneous host (CPU) + device (GPU) application C program
– Serial parts in host C code
– Parallel parts in device SPMD kernel code
Serial Code (host)
. . .
. . .
Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args);
Serial Code (host)
Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args);
18
Arrays of Parallel Threads
• A CUDA kernel is executed by a grid (array) of threads
– All threads in a grid run the same kernel code (Single Program Multiple Data)
– Each thread has indexes that it uses to compute memory addresses and make
control decisions
i = blockIdx.x * blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];
…
0 1 2 254 255
…
19
Thread Blocks: Scalable Cooperation
– Divide thread array into multiple blocks
– Threads within a block cooperate via shared memory, atomic operations and
barrier synchronization
– Threads in different blocks do not interact
19
i = blockIdx.x * blockDim.x +
threadIdx.x;
C[i] = A[i] + B[i];
…
0 1 2 254 255
Thread Block 0
…
1 2 254 255
Thread Block 1
0
i = blockIdx.x * blockDim.x +
threadIdx.x;
C[i] = A[i] + B[i];
…
1 2 254 255
Thread Block N-1
0
i = blockIdx.x * blockDim.x +
threadIdx.x;
C[i] = A[i] + B[i];
…
…
… …
20
blockIdx and threadIdx
• Each thread uses indices to decide what data to work
on
– blockIdx: 1D, 2D, or 3D (CUDA 4.0)
– threadIdx: 1D, 2D, or 3D
• Simplifies memory
addressing when processing
multidimensional data
– Image processing
– Solving PDEs on volumes
– …
20
device
Grid Block (0,
0)
Block (1,
1)
Block (1,
0)
Block (0,
1)
Block (1,1)
Thread
(0,0,0)
Thread
(0,1,3)
Thread
(0,1,0)
Thread
(0,1,1)
Thread
(0,1,2)
Thread
(0,0,0)
Thread
(0,0,1)
Thread
(0,0,2)
Thread
(0,0,3)
(1,0,0) (1,0,1) (1,0,2) (1,0,3)
21
Example: Vector Addition Kernel
// Compute vector sum C = A + B
// Each thread performs one pair-wise addition
__global__
void vecAddKernel(float* A, float* B, float* C, int n)
{
int i = threadIdx.x+blockDim.x*blockIdx.x;
if(i<n) C[i] = A[i] + B[i];
}
Device Code
22
Example: Vector Addition Kernel Launch (Host Code)
void vecAdd(float* h_A, float* h_B, float* h_C, int n)
{
// d_A, d_B, d_C allocations and copies omitted
// Run ceil(n/256.0) blocks of 256 threads each
vecAddKernel<<<ceil(n/256.0),256>>>(d_A, d_B, d_C, n);
}
Host Code
4
The ceiling function makes sure that there
are enough threads to cover all elements.
23
More on Kernel Launch (Host Code)
void vecAdd(float* h_A, float* h_B, float* h_C, int n)
{
dim3 DimGrid((n-1)/256 + 1, 1, 1);
dim3 DimBlock(256, 1, 1);
vecAddKernel<<<DimGrid,DimBlock>>>(d_A, d_B, d_C, n);
}
23
Host Code
This is an equivalent way to express the
ceiling function.
24
__host__
void vecAdd(…)
{
dim3 DimGrid(ceil(n/256.0),1,1);
dim3 DimBlock(256,1,1);
vecAddKernel<<<DimGrid,DimBlock>>>(d_A,d_B
,d_C,n);
}
Kernel execution in a nutshell
24
Grid
Blk 0 Blk N-1
• • •
GPU
M0
RAM
Mk
• • •
__global__
void vecAddKernel(float *A,
float *B, float *C, int n)
{
int i = blockIdx.x * blockDim.x
+ threadIdx.x;
if( i<n ) C[i] = A[i]+B[i];
}
25
More on CUDA Function Declarations
− __global__ defines a kernel function
− Each “__” consists of two underscore characters
− A kernel function must return void
− __device__ and __host__ can be used together
− __host__ is optional if used alone
25
host
host
__host__ float HostFunc()
host
device
__global__ void KernelFunc()
device
device
__device__ float DeviceFunc()
Only callable from
the:
Executed on
the:
26
CUDA Device Query
– Enumerates properties of CUDA devices in your system
26
27
Next Webinar
– Learn to write, compile and run a multi-dimensional CUDA program
– Learn some CUDA optimization techniques
– Memory optimization
– Compute optimization
– Transfer optimization
GPU Teaching Kit
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under
the Creative Commons Attribution-NonCommercial 4.0 International License.
Accelerated Computing
Naga Vydyanathan: nvydyanathan@nvidia.com
NVIDIA Developer Zone: https://developer.nvidia.com/

Más contenido relacionado

La actualidad más candente

A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAPiyush Mittal
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009Randall Hand
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.
 
GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with RubyShin Yee Chung
 
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012DefCamp
 
SHA-3, Keccak & Sponge function
SHA-3, Keccak & Sponge functionSHA-3, Keccak & Sponge function
SHA-3, Keccak & Sponge functionGennaro Caccavale
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDAprithan
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...npinto
 
CUDA by Example : Thread Cooperation : Notes
CUDA by Example : Thread Cooperation : NotesCUDA by Example : Thread Cooperation : Notes
CUDA by Example : Thread Cooperation : NotesSubhajit Sahu
 

La actualidad más candente (20)

Lecture 04
Lecture 04Lecture 04
Lecture 04
 
GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009
 
Cuda
CudaCuda
Cuda
 
Cuda tutorial
Cuda tutorialCuda tutorial
Cuda tutorial
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with Ruby
 
Cuda
CudaCuda
Cuda
 
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
 
Keccak
KeccakKeccak
Keccak
 
SHA-3, Keccak & Sponge function
SHA-3, Keccak & Sponge functionSHA-3, Keccak & Sponge function
SHA-3, Keccak & Sponge function
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
Slide tesi
Slide tesiSlide tesi
Slide tesi
 
Sha3
Sha3Sha3
Sha3
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDA
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
 
CUDA by Example : Thread Cooperation : Notes
CUDA by Example : Thread Cooperation : NotesCUDA by Example : Thread Cooperation : Notes
CUDA by Example : Thread Cooperation : Notes
 

Similar a GPU Computing with CUDA

002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.pptceyifo9332
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAnirudhGarg35
 
Gpu computing workshop
Gpu computing workshopGpu computing workshop
Gpu computing workshopdatastack
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...mouhouioui
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdfTigabu Yaya
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
Intro2 Cuda Moayad
Intro2 Cuda MoayadIntro2 Cuda Moayad
Intro2 Cuda MoayadMoayadhn
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsnARUNACHALAM468781
 
Tema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfTema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfpepe464163
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011Raymond Tay
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiectureHaris456
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUsShree Kumar
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computationjtsagata
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle CoherenceBen Stopford
 
Coding for multiple cores
Coding for multiple coresCoding for multiple cores
Coding for multiple coresLee Hanxue
 

Similar a GPU Computing with CUDA (20)

002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
 
Gpu computing workshop
Gpu computing workshopGpu computing workshop
Gpu computing workshop
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
 
Cuda 2011
Cuda 2011Cuda 2011
Cuda 2011
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
Paralell
ParalellParalell
Paralell
 
Intro2 Cuda Moayad
Intro2 Cuda MoayadIntro2 Cuda Moayad
Intro2 Cuda Moayad
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
Tema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfTema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdf
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011
 
Chapter 6 os
Chapter 6 osChapter 6 os
Chapter 6 os
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
posix.pdf
posix.pdfposix.pdf
posix.pdf
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle Coherence
 
Coding for multiple cores
Coding for multiple coresCoding for multiple cores
Coding for multiple cores
 

Último

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 

Último (20)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

GPU Computing with CUDA

  • 1. Naga Vydyanathan GPU Computing with CUDA Accelerated Computing GPU Teaching Kit
  • 2. 2 3 Ways to Accelerate Applications Applications Libraries Easy to use Most Performance Programming Languages Most Performance Most Flexibility Easy to use Portable code Compiler Directives CUDA
  • 3. 3 Objective – To understand the CUDA memory hierarchy – Registers, shared memory, global memory – Scope and lifetime – To learn the basic memory API functions in CUDA host code – Device Memory Allocation – Host-Device Data Transfer – To learn about CUDA threads, the main mechanism for exploiting of data parallelism – Hierarchical thread organization – Launching parallel execution – Thread index to data index mapping
  • 4. 4 Global Memory Processing Unit I/O ALU Processor (SM) Shared Memory Register File Control Unit PC IR Hardware View of CUDA Memories
  • 5. 5 Programmer View of CUDA Memories Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory
  • 6. 6 Declaring CUDA Variables – __device__ is optional when used with __shared__, or __constant__ – Automatic variables reside in a register – Except per-thread arrays that reside in global memory Variable declaration Memory Scope Lifetime int LocalVar; register thread thread __device__ __shared__ int SharedVar; shared block block __device__ int GlobalVar; global grid application __device__ __constant__ int ConstantVar; constant grid application
  • 7. 7 Global Memory in CUDA – GPU Device memory (DRAM) – Large in size (~16/32 GB) – Bandwidth ~700-900 GB/s – High latency (~1000 cycles) – Scope of access and sharing - kernel – Lifetime – application – main means of communicating data between host and GPU.
  • 8. 8 Shared Memory in CUDA – A special type of memory whose contents are explicitly defined and used in the kernel source code – One in each SM – Accessed at much higher speed (in both latency and throughput) than global memory – Scope of access and sharing - thread blocks – Lifetime – thread block, contents will disappear after the corresponding thread block finishes and terminates execution – A form of scratchpad memory in computer architecture We will re-visit shared memory later.
  • 9. 9 A[0] vector A vector B vector C A[1] A[2] A[N-1] B[0] B[1] B[2] … … B[N-1] C[0] C[1] C[2] C[N-1] … + + + + Data Parallelism - Vector Addition Example
  • 10. 10 Vector Addition – Traditional C Code // Compute vector sum C = A + B void vecAdd(float *h_A, float *h_B, float *h_C, int n) { int i; for (i = 0; i<n; i++) h_C[i] = h_A[i] + h_B[i]; } int main() { // Memory allocation for h_A, h_B, and h_C // I/O to read h_A and h_B, N elements … vecAdd(h_A, h_B, h_C, N); } 10
  • 11. 11 CPU Host Memory GPU Device Memory Part 1 Part 3 Heterogeneous Computing vecAdd CUDA Host Code #include <cuda.h> void vecAdd(float *h_A, float *h_B, float *h_C, int n) { int size = n* sizeof(float); float *d_A, *d_B, *d_C; // Part 1 // Allocate device memory for A, B, and C // copy A and B to device memory // Part 2 // Kernel launch code – the device performs the actual vector addition // Part 3 // copy C from the device memory // Free device vectors } 11 Part 2
  • 12. 12 Partial Overview of CUDA Memories – Device code can: – R/W per-thread registers – R/W all-shared global memory – Host code can – Transfer data to/from per grid global memory 12 Host (Device) Grid Global Memory Block (0, 0) Thread (0, 0) Registers Block (0, 1) Thread (0, 0) Registers Thread (0, 1) Registers Thread (0, 1) Registers
  • 13. 13 CUDA Device Memory Management API functions – cudaMalloc() – Allocates an object in the device global memory – Two parameters – Address of a pointer to the allocated object – Size of allocated object in terms of bytes – cudaFree() – Frees object from device global memory – One parameter – Pointer to freed object Host (Device) Grid Global Memory Block (0, 0) Thread (0, 0) Registers Block (0, 1) Thread (0, 0) Registers Thread (0, 1) Registers Thread (0, 1) Registers
  • 14. 14 Host-Device Data Transfer API functions – cudaMemcpy() – memory data transfer – Requires four parameters – Pointer to destination – Pointer to source – Number of bytes copied – Type/Direction of transfer – Transfer to device is synchronous Host (Device) Grid Global Memory Block (0, 0) Thread (0, 0) Registers Block (0, 1) Thread (0, 0) Registers Thread (0, 1) Registers Thread (0, 1) Registers
  • 15. 15 Vector Addition Host Code void vecAdd(float *h_A, float *h_B, float *h_C, int n) { int size = n * sizeof(float); float *d_A, *d_B, *d_C; cudaMalloc((void **) &d_A, size); cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMalloc((void **) &d_B, size); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); cudaMalloc((void **) &d_C, size); // Kernel invocation code – to be shown later cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); cudaFree(d_A); cudaFree(d_B); cudaFree (d_C); } 15
  • 16. 16 In Practice, Check for API Errors in Host Code cudaError_t err = cudaMalloc((void **) &d_A, size); if (err != cudaSuccess) { printf(“%s in %s at line %dn”, cudaGetErrorString(err), __FILE__, __LINE__); exit(EXIT_FAILURE); } 10
  • 17. 17 CUDA Execution Model – Heterogeneous host (CPU) + device (GPU) application C program – Serial parts in host C code – Parallel parts in device SPMD kernel code Serial Code (host) . . . . . . Parallel Kernel (device) KernelA<<< nBlk, nTid >>>(args); Serial Code (host) Parallel Kernel (device) KernelB<<< nBlk, nTid >>>(args);
  • 18. 18 Arrays of Parallel Threads • A CUDA kernel is executed by a grid (array) of threads – All threads in a grid run the same kernel code (Single Program Multiple Data) – Each thread has indexes that it uses to compute memory addresses and make control decisions i = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; … 0 1 2 254 255 …
  • 19. 19 Thread Blocks: Scalable Cooperation – Divide thread array into multiple blocks – Threads within a block cooperate via shared memory, atomic operations and barrier synchronization – Threads in different blocks do not interact 19 i = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; … 0 1 2 254 255 Thread Block 0 … 1 2 254 255 Thread Block 1 0 i = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; … 1 2 254 255 Thread Block N-1 0 i = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; … … … …
  • 20. 20 blockIdx and threadIdx • Each thread uses indices to decide what data to work on – blockIdx: 1D, 2D, or 3D (CUDA 4.0) – threadIdx: 1D, 2D, or 3D • Simplifies memory addressing when processing multidimensional data – Image processing – Solving PDEs on volumes – … 20 device Grid Block (0, 0) Block (1, 1) Block (1, 0) Block (0, 1) Block (1,1) Thread (0,0,0) Thread (0,1,3) Thread (0,1,0) Thread (0,1,1) Thread (0,1,2) Thread (0,0,0) Thread (0,0,1) Thread (0,0,2) Thread (0,0,3) (1,0,0) (1,0,1) (1,0,2) (1,0,3)
  • 21. 21 Example: Vector Addition Kernel // Compute vector sum C = A + B // Each thread performs one pair-wise addition __global__ void vecAddKernel(float* A, float* B, float* C, int n) { int i = threadIdx.x+blockDim.x*blockIdx.x; if(i<n) C[i] = A[i] + B[i]; } Device Code
  • 22. 22 Example: Vector Addition Kernel Launch (Host Code) void vecAdd(float* h_A, float* h_B, float* h_C, int n) { // d_A, d_B, d_C allocations and copies omitted // Run ceil(n/256.0) blocks of 256 threads each vecAddKernel<<<ceil(n/256.0),256>>>(d_A, d_B, d_C, n); } Host Code 4 The ceiling function makes sure that there are enough threads to cover all elements.
  • 23. 23 More on Kernel Launch (Host Code) void vecAdd(float* h_A, float* h_B, float* h_C, int n) { dim3 DimGrid((n-1)/256 + 1, 1, 1); dim3 DimBlock(256, 1, 1); vecAddKernel<<<DimGrid,DimBlock>>>(d_A, d_B, d_C, n); } 23 Host Code This is an equivalent way to express the ceiling function.
  • 24. 24 __host__ void vecAdd(…) { dim3 DimGrid(ceil(n/256.0),1,1); dim3 DimBlock(256,1,1); vecAddKernel<<<DimGrid,DimBlock>>>(d_A,d_B ,d_C,n); } Kernel execution in a nutshell 24 Grid Blk 0 Blk N-1 • • • GPU M0 RAM Mk • • • __global__ void vecAddKernel(float *A, float *B, float *C, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if( i<n ) C[i] = A[i]+B[i]; }
  • 25. 25 More on CUDA Function Declarations − __global__ defines a kernel function − Each “__” consists of two underscore characters − A kernel function must return void − __device__ and __host__ can be used together − __host__ is optional if used alone 25 host host __host__ float HostFunc() host device __global__ void KernelFunc() device device __device__ float DeviceFunc() Only callable from the: Executed on the:
  • 26. 26 CUDA Device Query – Enumerates properties of CUDA devices in your system 26
  • 27. 27 Next Webinar – Learn to write, compile and run a multi-dimensional CUDA program – Learn some CUDA optimization techniques – Memory optimization – Compute optimization – Transfer optimization
  • 28. GPU Teaching Kit The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License. Accelerated Computing Naga Vydyanathan: nvydyanathan@nvidia.com NVIDIA Developer Zone: https://developer.nvidia.com/