5. FUNCTION SPECIFIERS
Denote whether a function executes on the host or on the device and whether
it is callable from the host or from the device
__global__ void kernel ( )
__device__ void device ( )
__host__ void main ( )
host device
__global__ callable execute
__device__ callable
__host__ execute
6. COMPILING PROCESS
Separate source code to
host code and device code
NVCC continue deal with
device code (PTX)
Host code is pass to c++
compiler
Combine then into
executable file
CPU code GPU code
.cu
.cpp
.ptxC++ compiler
Host linker
executable
7. PROGRAMMING MODEL
CPU GPU
MemoryMemory
coprocessor
CPU code GPU code
CUDA program Data : CPU to GPU
Allocate GPU memory
Launch kernel on GPU
Data : GPU to CPU
C funtion CUDA C funtion
malloc cudaMalloc
memcpy cudaMemcpy
memset cudaMemset
free cudaFree
8. OUTLINE
CUDA Programming and Execution Model
CUDA Memory Architecture
CUDA Exception List
CUDA Debugging
CUDA Terminology
10. MEMORY TYPE
scope life locate
variable in kernel thread kernel register
arrary in kernel thread kernel local
__shared__ in kernel block kernel shared
__device__ grid application global
__constant__ grid application constant
18. CUDA EXCEPTION LIST
illegal address
Lane
Device
Warp
stack overflow
Lane
Device
Warp
illegal instruction
Warp
out-of-range address
Warp
1
2
3
4
misaligned address
Warp
Lane
invalid address space
Warp
invaild PC
Warp
Warp assert
Warp
Syscall error
Lane
invalid managed memory access
19. ILLEGAL ADDRESS
Device
This occurs when a thread accesses an illegal (out of bounds) global address
Warp
This occurs when a thread accesses an illegal (out of bounds) global/local/shared
address
Lane
Precise (Requires memcheck on)
This occurs when a thread accesses an illegal (out of bounds) global address
20. STACK OVERFLOW
Device
This occurs when the application triggers a global hardware stack overflow
The main cause of this error is large amounts of divergence in the presence of
function calls
Warp
This occurs when any thread in a warp triggers a hardware stack overflow
Lane
This occurs when a thread exceeds its stack memory limit
21. INVALID ADDRESS SPACE
Warp
This occurs when any thread within a warp executes an instruction that
accesses a memory space not permitted for that instruction
22. MISALIGNED ADDRESS
Warp
Occurs when any thread within a warp accesses an address in the local
or shared memory segments that is not correctly aligned
Lane
This occurs when a thread accesses a global address that is not correctly
aligned
23. SYSCALL ERROR
Lane
This occurs when a thread corrupts the heap by invoking free with an
invalid address
( ie, trying to free the same memory region twice )
24. INVALID MANAGED MEMORY ACCESS
Host thread
This occurs when a host thread attempts to access managed memory
currently used by the GPU
25. WARP ASSERT
Warp
This occurs when any thread in the warp hits a device side assertion
# include < assert.h >
__global__ void kernel ( )
{
assert ( threadIdx.x == 0 ) ;
}
26. OUTLINE
CUDA Programming and Execution Model
CUDA Memory Architecture
CUDA Exception List
CUDA Debugging
CUDA Terminology
27. CUDA DEBUGGING
1. Kernel Debugging
To inspect the flow and state of kernel execution on the fly
2. Memory Debugging
It focuses on the discovery of odd program behavior to the memory location
30. CUDA-GDB
Commands : break print run continue next step quit
A CUDA program contain multiple host threads and many CUDA threads
We can use cuda-gdb to report information about the current focus
31. CUDA INFO / FOCUS
(cuda-gdb) cuda thread lane warp block sm grid device kernel
Kernel1 ,grid 1027,block (0,0,0) thread (64,0,0) device 0, sm 1, warp 2,lane 0
(cuda-gdb) cuda thread (2)
49. OUTLINE
CUDA Programming and Execution Model
CUDA Memory Architecture
CUDA Exception List
CUDA Debugging
Appendix : CUDA Terminology
50. TERMINOLOGY
Host
CPU and the system memory
Device
GPU and its memory
Kernel
A function that executes on the device , compose of several thread blocks (grid)
SM
Streaming Multiprocessor , compose of several SPs , assign several thread blocks
SP
Streaming Processor = CUDA Core , execute one thread
51. TERMINOLOGY
Grid
Multiple thread blocks will form a grid
Block
Several threads are grouped into a block, and the threads in the same block can be
synchronized, or they can communicate with each other via shared memory
Warp
Set of threads that execute same instruction at the same time
Thread
CUDA program is executed by many threads. A thread of a warp, called lane
52. CUDA GUARANTEES
All threads in a thread black run on the same SM at the same
All threads in a thread black run on the same SM may cooperate to solve sub-problem
All threads in different thread black will not have cooperate relationship
All blocks in a kernel finish before any blocks from the next kernel run