Más contenido relacionado
La actualidad más candente (19)
Similar a IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA) (20)
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)
- 2. The “New” Moore’s Law
Computers no longer get faster, just wider
You must re-think your algorithms to be parallel !
Data-parallel computing is most scalable solution
© NVIDIA Corporation 2008
- 3. Enter the GPU
Massive economies of scale
Massively parallel
© NVIDIA Corporation 2008
- 4. Enter CUDA
Scalable parallel programming model
Minimal extensions to familiar C/C++ environment
Heterogeneous serial-parallel computing
© NVIDIA Corporation 2008
- 5. Sound Bite
GPUs + CUDA
=
The Democratization of Parallel Computing
Massively parallel computing has become
a commodity technology
© NVIDIA Corporation 2007
- 8. CUDA: ‘C’ FOR PARALLELISM
void saxpy_serial(int n, float a, float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
// Invoke serial SAXPY kernel
Standard C Code
saxpy_serial(n, 2.0, x, y);
__global__ void saxpy_parallel(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
// Invoke parallel SAXPY kernel with 256 threads/block
int nblocks = (n + 255) / 256;
saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
Parallel C Code
© NVIDIA Corporation 2008
- 9. Hierarchy of concurrent threads
Thread t
Parallel kernels composed of many threads
all threads execute the same sequential program
Block b
Threads are grouped into thread blocks t0 t1 … tB
threads in the same block can cooperate
Kernel foo()
Threads/blocks have
...
unique IDs
© NVIDIA Corporation 2008
- 10. Hierarchical organization
Block
Thread
per-block
per-thread shared
local memory memory
Local barrier
Kernel 0
...
Global barrier per-device
global
Kernel 1
memory
...
© NVIDIA Corporation 2008
- 11. Heterogeneous Programming
CUDA = serial program with parallel kernels, all in C
Serial C code executes in a CPU thread
Parallel kernel C code executes in thread blocks
across multiple processing elements
Serial Code
Parallel Kernel
...
foo<<< nBlk, nTid >>>(args);
Serial Code
Parallel Kernel
...
bar<<< nBlk, nTid >>>(args);
© NVIDIA Corporation 2008
- 12. Thread = virtualized scalar processor
Independent thread of execution
has its own PC, variables (registers), processor state, etc.
no implication about how threads are scheduled
CUDA threads might be physical threads
as on NVIDIA GPUs
CUDA threads might be virtual threads
might pick 1 block = 1 physical thread on multicore CPU
© NVIDIA Corporation 2008
- 13. Block = virtualized multiprocessor
Provides programmer flexibility
freely choose processors to fit data
freely customize for each kernel launch
Thread block = a (data) parallel task
all blocks in kernel have the same entry point
but may execute any code they want
Thread blocks of kernel must be independent tasks
program valid for any interleaving of block executions
© NVIDIA Corporation 2008
- 14. Blocks must be independent
Any possible interleaving of blocks should be valid
presumed to run to completion without pre-emption
can run in any order
can run concurrently OR sequentially
Blocks may coordinate but not synchronize
shared queue pointer: OK
shared lock: BAD … can easily deadlock
Independence requirement gives scalability
© NVIDIA Corporation 2008
- 15. Scalable Execution Model
Kernel launched by host
...
Blocks Run on Multiprocessors
MT IU MT IU MT IU MT IU
MT IU MT IU MT IU MT IU
SP SP SP SP SP SP SP SP
...
Shared Shared Shared Shared
Shared Shared Shared Shared
Memory Memory Memory Memory
Memory Memory Memory Memory
Device Memory
© NVIDIA Corporation 2008
- 16. Synchronization & Cooperation
Threads within block may synchronize with barriers
… Step 1 …
__syncthreads();
… Step 2 …
Blocks coordinate via atomic memory operations
e.g., increment shared queue pointer with atomicInc()
Implicit barrier between dependent kernels
vec_minus<<<nblocks, blksize>>>(a, b, c);
vec_dot<<<nblocks, blksize>>>(c, c);
© NVIDIA Corporation 2008
- 17. Using per-block shared memory
Block
Variables shared across block
Shared
__shared__ int *begin, *end;
Scratchpad memory
__shared__ int scratch[blocksize];
scratch[threadIdx.x] = begin[threadIdx.x];
// … compute on scratch values …
begin[threadIdx.x] = scratch[threadIdx.x];
Communicating values between threads
scratch[threadIdx.x] = begin[threadIdx.x];
__syncthreads();
int left = scratch[threadIdx.x - 1];
© NVIDIA Corporation 2008
- 18. Example: Parallel Reduction
Summing up a sequence with 1 thread:
int sum = 0;
for(int i=0; i<N; ++i) sum += x[i];
Parallel reduction builds a summation tree
each thread holds 1 element
stepwise partial sums
N threads need log N steps
one possible approach:
Butterfly pattern
© NVIDIA Corporation 2008
- 19. Example: Parallel Reduction
Summing up a sequence with 1 thread:
int sum = 0;
for(int i=0; i<N; ++i) sum += x[i];
Parallel reduction builds a summation tree
each thread holds 1 element
stepwise partial sums
N threads need log N steps
one possible approach:
Butterfly pattern
© NVIDIA Corporation 2008
- 20. Parallel Reduction for 1 Block
// INPUT: Thread i holds value x_i
int i = threadIdx.x;
__shared__ int sum[blocksize];
// One thread per element
sum[i] = x_i; __syncthreads();
for(int bit=blocksize/2; bit>0; bit/=2)
{
int t=sum[i]+sum[i^bit]; __syncthreads();
sum[i]=t; __syncthreads();
}
// OUTPUT: Every thread now holds sum in sum[i]
© NVIDIA Corporation 2008
- 21. Summing Up CUDA
CUDA = C + a few simple extensions
makes it easy to start writing basic parallel programs
Three key abstractions:
1. hierarchy of parallel threads
2. corresponding levels of synchronization
3. corresponding memory spaces
Supports massive parallelism of manycore GPUs
© NVIDIA Corporation 2008
- 22. Parallel
// INPUT: Thread i holds value x_i
int i = threadIdx.x;
__shared__ int sum[blocksize];
// One thread per element
sum[i] = x_i; __syncthreads();
for(int bit=blocksize/2; bit>0; bit/=2)
{
int t=sum[i]+sum[i^bit]; __syncthreads();
sum[i]=t; __syncthreads();
}
// OUTPUT: Every thread now holds sum in sum[i]
© NVIDIA Corporation 2008
- 23. More efficient reduction tree
template<class T> __device__ T reduce(T *x)
{
unsigned int i = threadIdx.x;
unsigned int n = blockDim.x;
for(unsigned int offset=n/2; offset>0; offset/=2)
{
if(tid < offset) x[i] += x[i + offset];
__syncthreads();
}
// Note that only thread 0 has full sum
return x[i];
}
© NVIDIA Corporation 2008
- 24. Reduction tree execution example
Input (shared memory)
10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2
1 2 3 4 5 6 7 active threads
0
x[i] += x[i+8]; 8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2
1 2 3
0
x[i] += x[i+4]; 8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2
1
0
x[i] += x[i+2]; 21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2
0
41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2
x[i] += x[i+1];
Final result
© NVIDIA Corporation 2008
- 25. Pattern 1: Fit kernel to the data
Code lets B-thread block reduce B-element array
For larger sequences:
launch N/B blocks to reduce each B-element subsequence
write N/B partial sums to temporary array
repeat until done
We’ll be done in logB N steps
© NVIDIA Corporation 2008
- 26. Pattern 2: Fit kernel to the machine
For a block size of B:
1) Launch B blocks to sum N/B elements
2) Launch 1 block to combine B partial sums
31704 16 3 31704 16 331704 16 3 31704 16 3 31704 16 331704 16 331704 16 331704 16 3
4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9
11 14 11 14 11 14 11 14 11 14 11 14 11 14 11 14
Stage 1:
25 25 25 25 25 25 25 25
many blocks