SlideShare una empresa de Scribd logo
1 de 26
Descargar para leer sin conexión
IAP09 CUDA@MIT/6.963

      David Luebke
     NVIDIA Research
The “New” Moore’s Law



   Computers no longer get faster, just wider


   You must re-think your algorithms to be parallel !


   Data-parallel computing is most scalable solution




© NVIDIA Corporation 2008
Enter the GPU



   Massive economies of scale


   Massively parallel




© NVIDIA Corporation 2008
Enter CUDA



   Scalable parallel programming model


   Minimal extensions to familiar C/C++ environment


   Heterogeneous serial-parallel computing




© NVIDIA Corporation 2008
Sound Bite



                            GPUs + CUDA
                                 =
       The Democratization of Parallel Computing



                   Massively parallel computing has become
                            a commodity technology
© NVIDIA Corporation 2007
MOTIVATION
MOTIVATION




146X   36X    19X     17X   100X




149X   47X    20X     24X   30X
CUDA: ‘C’ FOR PARALLELISM


void saxpy_serial(int n, float a, float *x, float *y)
{
    for (int i = 0; i < n; ++i)
        y[i] = a*x[i] + y[i];
}
// Invoke serial SAXPY kernel
                                                 Standard   C Code
saxpy_serial(n, 2.0, x, y);


__global__ void saxpy_parallel(int n, float a, float *x, float *y)
{
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    if (i < n) y[i] = a*x[i] + y[i];
}
// Invoke parallel SAXPY kernel with 256 threads/block
int nblocks = (n + 255) / 256;
saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
                                                  Parallel C Code
 © NVIDIA Corporation 2008
Hierarchy of concurrent threads

                                                                  Thread t

   Parallel kernels composed of many threads
          
   all threads execute the same sequential program




                                                                      Block b

   Threads are grouped into thread blocks                       t0 t1 … tB

          
   threads in the same block can cooperate



                                            Kernel foo()

   Threads/blocks have
                                                                ...
    unique IDs
© NVIDIA Corporation 2008
Hierarchical organization
                                                    Block
Thread
                                                                     per-block
                    per-thread                                        shared
                  local memory                                        memory
                                    Local barrier




                                 Kernel 0

                                        ...
           Global barrier                               per-device
                                                          global
                                 Kernel 1
                                                         memory
                                        ...
 © NVIDIA Corporation 2008
Heterogeneous Programming



   CUDA = serial program with parallel kernels, all in C
     
   Serial C code executes in a CPU thread
     
   Parallel kernel C code executes in thread blocks
         across multiple processing elements


            Serial Code

        Parallel Kernel
                                                         ...
foo<<< nBlk, nTid >>>(args);


            Serial Code

                     Parallel Kernel
                                                   ...
bar<<< nBlk, nTid >>>(args);
© NVIDIA Corporation 2008
Thread = virtualized scalar processor



   Independent thread of execution
          
   has its own PC, variables (registers), processor state, etc.
          
   no implication about how threads are scheduled





   CUDA threads might be physical threads
          
   as on NVIDIA GPUs





   CUDA threads might be virtual threads
          
   might pick 1 block = 1 physical thread on multicore CPU

© NVIDIA Corporation 2008
Block = virtualized multiprocessor



   Provides programmer flexibility
          
   freely choose processors to fit data
          
   freely customize for each kernel launch



   Thread block = a (data) parallel task
          
   all blocks in kernel have the same entry point
          
   but may execute any code they want



   Thread blocks of kernel must be independent tasks
          
   program valid for any interleaving of block executions



© NVIDIA Corporation 2008
Blocks must be independent



   Any possible interleaving of blocks should be valid
          
   presumed to run to completion without pre-emption
          
   can run in any order
          
   can run concurrently OR sequentially



   Blocks may coordinate but not synchronize
          
   shared queue pointer: OK
          
   shared lock: BAD … can easily deadlock



   Independence requirement gives scalability


© NVIDIA Corporation 2008
Scalable Execution Model

                                                           Kernel launched by host


                                                                           ...




                                        Blocks Run on Multiprocessors

                                MT IU              MT IU                             MT IU             MT IU
                       MT IU             MT IU                             MT IU             MT IU

                       SP      SP        SP       SP                       SP       SP       SP       SP



                                                              ...
                      Shared             Shared                            Shared            Shared
                               Shared             Shared                            Shared            Shared
                      Memory             Memory                            Memory            Memory
                               Memory             Memory                            Memory            Memory




                                                           Device Memory

© NVIDIA Corporation 2008
Synchronization & Cooperation



   Threads within block may synchronize with barriers
                 … Step 1 …
                 __syncthreads();
                 … Step 2 …



   Blocks coordinate via atomic memory operations
          
   e.g., increment shared queue pointer with atomicInc()



   Implicit barrier between dependent kernels
                 vec_minus<<<nblocks, blksize>>>(a, b, c);
                 vec_dot<<<nblocks, blksize>>>(c, c);


© NVIDIA Corporation 2008
Using per-block shared memory
                                                      Block


   Variables shared across block




                                                              Shared
                 __shared__ int *begin, *end;


   Scratchpad memory
                 __shared__ int scratch[blocksize];
                 scratch[threadIdx.x] = begin[threadIdx.x];
                 // … compute on scratch values …
                 begin[threadIdx.x] = scratch[threadIdx.x];



   Communicating values between threads
                 scratch[threadIdx.x] = begin[threadIdx.x];
                 __syncthreads();
                 int left = scratch[threadIdx.x - 1];

© NVIDIA Corporation 2008
Example: Parallel Reduction



   Summing up a sequence with 1 thread:
                 int sum = 0;
                 for(int i=0; i<N; ++i)     sum += x[i];



   Parallel reduction builds a summation tree
          
   each thread holds 1 element
          
   stepwise partial sums
          
   N threads need log N steps

          
   one possible approach:
              Butterfly pattern


© NVIDIA Corporation 2008
Example: Parallel Reduction



   Summing up a sequence with 1 thread:
                 int sum = 0;
                 for(int i=0; i<N; ++i)     sum += x[i];



   Parallel reduction builds a summation tree
          
   each thread holds 1 element
          
   stepwise partial sums
          
   N threads need log N steps

          
   one possible approach:
              Butterfly pattern


© NVIDIA Corporation 2008
Parallel Reduction for 1 Block

// INPUT: Thread i holds value x_i
int i = threadIdx.x;
__shared__ int sum[blocksize];

// One thread per element
sum[i] = x_i; __syncthreads();

for(int bit=blocksize/2; bit>0; bit/=2)
{
   int t=sum[i]+sum[i^bit]; __syncthreads();
   sum[i]=t;                __syncthreads();
}
// OUTPUT: Every thread now holds sum in sum[i]
© NVIDIA Corporation 2008
Summing Up CUDA



   CUDA = C + a few simple extensions
          
   makes it easy to start writing basic parallel programs



   Three key abstractions:
          1.  hierarchy of parallel threads
          2.  corresponding levels of synchronization
          3.  corresponding memory spaces



   Supports massive parallelism of manycore GPUs



© NVIDIA Corporation 2008
Parallel

// INPUT: Thread i holds value x_i
int i = threadIdx.x;
__shared__ int sum[blocksize];

// One thread per element
sum[i] = x_i; __syncthreads();

for(int bit=blocksize/2; bit>0; bit/=2)
{
   int t=sum[i]+sum[i^bit]; __syncthreads();
   sum[i]=t;                __syncthreads();
}
// OUTPUT: Every thread now holds sum in sum[i]
© NVIDIA Corporation 2008
More efficient reduction tree

template<class T> __device__ T reduce(T *x)
{
    unsigned int i = threadIdx.x;
    unsigned int n = blockDim.x;

           for(unsigned int offset=n/2; offset>0; offset/=2)
           {
               if(tid < offset) x[i] += x[i + offset];
               __syncthreads();
           }

                            // Note that only thread 0 has full sum
           return x[i];
}



© NVIDIA Corporation 2008
Reduction tree execution example
                              Input (shared memory)
                              10   1    8    -1   0   -2   3   5   -2   -3   2   7   0   11   0   2




                                   1    2    3    4   5    6   7 active threads
                               0

x[i] += x[i+8];                8   -2   10   6    0   9    3   7   -2   -3   2   7   0   11   0   2



                                   1    2    3
                               0

x[i] += x[i+4];                8   7    13   13   0   9    3   7   -2   -3   2   7   0   11   0   2

                                   1
                               0

x[i] += x[i+2];               21   20   13   13   0   9    3   7   -2   -3   2   7   0   11   0   2

                               0

                              41   20   13   13   0   9    3   7   -2   -3   2   7   0   11   0   2
x[i] += x[i+1];
                              Final result
  © NVIDIA Corporation 2008
Pattern 1: Fit kernel to the data



   Code lets B-thread block reduce B-element array


   For larger sequences:
          
   launch N/B blocks to reduce each B-element subsequence
          
   write N/B partial sums to temporary array
          
   repeat until done



   We’ll be done in logB N steps




© NVIDIA Corporation 2008
Pattern 2: Fit kernel to the machine


   For a block size of B:
       1)  Launch B blocks to sum N/B elements
       2)  Launch 1 block to combine B partial sums

  31704 16 3 31704 16 331704 16 3 31704 16 3 31704 16 331704 16 331704 16 331704 16 3
  4    7    5    9 4    7    5    9 4    7    5    9 4    7    5    9 4    7    5    9 4    7    5    9 4    7    5    9 4    7    5    9
    11        14     11        14     11        14     11        14     11        14     11        14     11        14     11        14
                                                                                                                                      Stage 1:
         25               25               25               25               25               25               25               25

                                                                                                                                     many blocks

Más contenido relacionado

La actualidad más candente

Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
Angela Mendoza M.
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
Piyush Mittal
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
Rob Gillen
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
DefCamp
 
Shader model 5 0 and compute shader
Shader model 5 0 and compute shaderShader model 5 0 and compute shader
Shader model 5 0 and compute shader
zaywalker
 

La actualidad más candente (19)

Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
Cuda
CudaCuda
Cuda
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance Computing
 
CUDA
CUDACUDA
CUDA
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : Notes
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
Cuda
CudaCuda
Cuda
 
GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with Ruby
 
CUDA Architecture
CUDA ArchitectureCUDA Architecture
CUDA Architecture
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
 
MXF & AAF
MXF & AAFMXF & AAF
MXF & AAF
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Java and the machine - Martijn Verburg and Kirk Pepperdine
Java and the machine - Martijn Verburg and Kirk PepperdineJava and the machine - Martijn Verburg and Kirk Pepperdine
Java and the machine - Martijn Verburg and Kirk Pepperdine
 
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
 
Shader model 5 0 and compute shader
Shader model 5 0 and compute shaderShader model 5 0 and compute shader
Shader model 5 0 and compute shader
 

Destacado

DIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe WorkshopDIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe Workshop
odsc
 

Destacado (6)

IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
IAP09 CUDA@MIT 6.963 - Lecture 01: High-Throughput Scientific Computing (Hans...
 
A Survey of Compressed GPU-based Direct Volume Rendering
A Survey of Compressed GPU-based Direct Volume RenderingA Survey of Compressed GPU-based Direct Volume Rendering
A Survey of Compressed GPU-based Direct Volume Rendering
 
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 07: CUDA Advanced #2 (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 03: CUDA Basics #2 (Nicolas Pinto, MIT)
 
IAP09 CUDA@MIT 6.963 - Lecture 01: Logistics (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 01: Logistics (Nicolas Pinto, MIT)IAP09 CUDA@MIT 6.963 - Lecture 01: Logistics (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 01: Logistics (Nicolas Pinto, MIT)
 
DIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe WorkshopDIY Deep Learning with Caffe Workshop
DIY Deep Learning with Caffe Workshop
 

Similar a IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
AnirudhGarg35
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
npinto
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
Haris456
 
Storage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdf
Storage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdfStorage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdf
Storage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdf
aaajjj4
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
mouhouioui
 
[05][cuda 및 fermi 최적화 기술] hryu optimization
[05][cuda 및 fermi 최적화 기술] hryu optimization[05][cuda 및 fermi 최적화 기술] hryu optimization
[05][cuda 및 fermi 최적화 기술] hryu optimization
laparuma
 
Congatec_Global Vendor for Innovative Embedded Solutions_Ankara
Congatec_Global Vendor for Innovative Embedded Solutions_AnkaraCongatec_Global Vendor for Innovative Embedded Solutions_Ankara
Congatec_Global Vendor for Innovative Embedded Solutions_Ankara
Işınsu Akçetin
 
Os Madsen Block
Os Madsen BlockOs Madsen Block
Os Madsen Block
oscon2007
 
Congatec_Global Vendor for Innovative Embedded Solutions_Istanbul
Congatec_Global Vendor for Innovative Embedded Solutions_IstanbulCongatec_Global Vendor for Innovative Embedded Solutions_Istanbul
Congatec_Global Vendor for Innovative Embedded Solutions_Istanbul
Işınsu Akçetin
 

Similar a IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA) (20)

An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
 
Cuda 2011
Cuda 2011Cuda 2011
Cuda 2011
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
Storage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdf
Storage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdfStorage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdf
Storage-Performance-Tuning-for-FAST-Virtual-Machines_Fam-Zheng.pdf
 
XS Boston 2008 Network Topology
XS Boston 2008 Network TopologyXS Boston 2008 Network Topology
XS Boston 2008 Network Topology
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
 
[05][cuda 및 fermi 최적화 기술] hryu optimization
[05][cuda 및 fermi 최적화 기술] hryu optimization[05][cuda 및 fermi 최적화 기술] hryu optimization
[05][cuda 및 fermi 최적화 기술] hryu optimization
 
Congatec_Global Vendor for Innovative Embedded Solutions_Ankara
Congatec_Global Vendor for Innovative Embedded Solutions_AnkaraCongatec_Global Vendor for Innovative Embedded Solutions_Ankara
Congatec_Global Vendor for Innovative Embedded Solutions_Ankara
 
Os Madsen Block
Os Madsen BlockOs Madsen Block
Os Madsen Block
 
Congatec_Global Vendor for Innovative Embedded Solutions_Istanbul
Congatec_Global Vendor for Innovative Embedded Solutions_IstanbulCongatec_Global Vendor for Innovative Embedded Solutions_Istanbul
Congatec_Global Vendor for Innovative Embedded Solutions_Istanbul
 
Concept of thread
Concept of threadConcept of thread
Concept of thread
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 

Más de npinto

High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
npinto
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
npinto
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
npinto
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
npinto
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
npinto
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
npinto
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
npinto
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
npinto
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
npinto
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
npinto
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
npinto
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
npinto
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
npinto
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
npinto
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
npinto
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming
npinto
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
npinto
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
npinto
 

Más de npinto (20)

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
 

Último

Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 

Último (20)

PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 

IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

  • 1. IAP09 CUDA@MIT/6.963 David Luebke NVIDIA Research
  • 2. The “New” Moore’s Law   Computers no longer get faster, just wider   You must re-think your algorithms to be parallel !   Data-parallel computing is most scalable solution © NVIDIA Corporation 2008
  • 3. Enter the GPU   Massive economies of scale   Massively parallel © NVIDIA Corporation 2008
  • 4. Enter CUDA   Scalable parallel programming model   Minimal extensions to familiar C/C++ environment   Heterogeneous serial-parallel computing © NVIDIA Corporation 2008
  • 5. Sound Bite GPUs + CUDA = The Democratization of Parallel Computing Massively parallel computing has become a commodity technology © NVIDIA Corporation 2007
  • 7. MOTIVATION 146X 36X 19X 17X 100X 149X 47X 20X 24X 30X
  • 8. CUDA: ‘C’ FOR PARALLELISM void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } // Invoke serial SAXPY kernel Standard C Code saxpy_serial(n, 2.0, x, y); __global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; } // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y); Parallel C Code © NVIDIA Corporation 2008
  • 9. Hierarchy of concurrent threads Thread t   Parallel kernels composed of many threads   all threads execute the same sequential program Block b   Threads are grouped into thread blocks t0 t1 … tB   threads in the same block can cooperate Kernel foo()   Threads/blocks have ... unique IDs © NVIDIA Corporation 2008
  • 10. Hierarchical organization Block Thread per-block per-thread shared local memory memory Local barrier Kernel 0 ... Global barrier per-device global Kernel 1 memory ... © NVIDIA Corporation 2008
  • 11. Heterogeneous Programming   CUDA = serial program with parallel kernels, all in C   Serial C code executes in a CPU thread   Parallel kernel C code executes in thread blocks across multiple processing elements Serial Code Parallel Kernel ... foo<<< nBlk, nTid >>>(args); Serial Code Parallel Kernel ... bar<<< nBlk, nTid >>>(args); © NVIDIA Corporation 2008
  • 12. Thread = virtualized scalar processor   Independent thread of execution   has its own PC, variables (registers), processor state, etc.   no implication about how threads are scheduled   CUDA threads might be physical threads   as on NVIDIA GPUs   CUDA threads might be virtual threads   might pick 1 block = 1 physical thread on multicore CPU © NVIDIA Corporation 2008
  • 13. Block = virtualized multiprocessor   Provides programmer flexibility   freely choose processors to fit data   freely customize for each kernel launch   Thread block = a (data) parallel task   all blocks in kernel have the same entry point   but may execute any code they want   Thread blocks of kernel must be independent tasks   program valid for any interleaving of block executions © NVIDIA Corporation 2008
  • 14. Blocks must be independent   Any possible interleaving of blocks should be valid   presumed to run to completion without pre-emption   can run in any order   can run concurrently OR sequentially   Blocks may coordinate but not synchronize   shared queue pointer: OK   shared lock: BAD … can easily deadlock   Independence requirement gives scalability © NVIDIA Corporation 2008
  • 15. Scalable Execution Model Kernel launched by host ... Blocks Run on Multiprocessors MT IU MT IU MT IU MT IU MT IU MT IU MT IU MT IU SP SP SP SP SP SP SP SP ... Shared Shared Shared Shared Shared Shared Shared Shared Memory Memory Memory Memory Memory Memory Memory Memory Device Memory © NVIDIA Corporation 2008
  • 16. Synchronization & Cooperation   Threads within block may synchronize with barriers … Step 1 … __syncthreads(); … Step 2 …   Blocks coordinate via atomic memory operations   e.g., increment shared queue pointer with atomicInc()   Implicit barrier between dependent kernels vec_minus<<<nblocks, blksize>>>(a, b, c); vec_dot<<<nblocks, blksize>>>(c, c); © NVIDIA Corporation 2008
  • 17. Using per-block shared memory Block   Variables shared across block Shared __shared__ int *begin, *end;   Scratchpad memory __shared__ int scratch[blocksize]; scratch[threadIdx.x] = begin[threadIdx.x]; // … compute on scratch values … begin[threadIdx.x] = scratch[threadIdx.x];   Communicating values between threads scratch[threadIdx.x] = begin[threadIdx.x]; __syncthreads(); int left = scratch[threadIdx.x - 1]; © NVIDIA Corporation 2008
  • 18. Example: Parallel Reduction   Summing up a sequence with 1 thread: int sum = 0; for(int i=0; i<N; ++i) sum += x[i];   Parallel reduction builds a summation tree   each thread holds 1 element   stepwise partial sums   N threads need log N steps   one possible approach: Butterfly pattern © NVIDIA Corporation 2008
  • 19. Example: Parallel Reduction   Summing up a sequence with 1 thread: int sum = 0; for(int i=0; i<N; ++i) sum += x[i];   Parallel reduction builds a summation tree   each thread holds 1 element   stepwise partial sums   N threads need log N steps   one possible approach: Butterfly pattern © NVIDIA Corporation 2008
  • 20. Parallel Reduction for 1 Block // INPUT: Thread i holds value x_i int i = threadIdx.x; __shared__ int sum[blocksize]; // One thread per element sum[i] = x_i; __syncthreads(); for(int bit=blocksize/2; bit>0; bit/=2) { int t=sum[i]+sum[i^bit]; __syncthreads(); sum[i]=t; __syncthreads(); } // OUTPUT: Every thread now holds sum in sum[i] © NVIDIA Corporation 2008
  • 21. Summing Up CUDA   CUDA = C + a few simple extensions   makes it easy to start writing basic parallel programs   Three key abstractions: 1.  hierarchy of parallel threads 2.  corresponding levels of synchronization 3.  corresponding memory spaces   Supports massive parallelism of manycore GPUs © NVIDIA Corporation 2008
  • 22. Parallel // INPUT: Thread i holds value x_i int i = threadIdx.x; __shared__ int sum[blocksize]; // One thread per element sum[i] = x_i; __syncthreads(); for(int bit=blocksize/2; bit>0; bit/=2) { int t=sum[i]+sum[i^bit]; __syncthreads(); sum[i]=t; __syncthreads(); } // OUTPUT: Every thread now holds sum in sum[i] © NVIDIA Corporation 2008
  • 23. More efficient reduction tree template<class T> __device__ T reduce(T *x) { unsigned int i = threadIdx.x; unsigned int n = blockDim.x; for(unsigned int offset=n/2; offset>0; offset/=2) { if(tid < offset) x[i] += x[i + offset]; __syncthreads(); } // Note that only thread 0 has full sum return x[i]; } © NVIDIA Corporation 2008
  • 24. Reduction tree execution example Input (shared memory) 10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2 1 2 3 4 5 6 7 active threads 0 x[i] += x[i+8]; 8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2 1 2 3 0 x[i] += x[i+4]; 8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2 1 0 x[i] += x[i+2]; 21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2 0 41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2 x[i] += x[i+1]; Final result © NVIDIA Corporation 2008
  • 25. Pattern 1: Fit kernel to the data   Code lets B-thread block reduce B-element array   For larger sequences:   launch N/B blocks to reduce each B-element subsequence   write N/B partial sums to temporary array   repeat until done   We’ll be done in logB N steps © NVIDIA Corporation 2008
  • 26. Pattern 2: Fit kernel to the machine   For a block size of B: 1)  Launch B blocks to sum N/B elements 2)  Launch 1 block to combine B partial sums 31704 16 3 31704 16 331704 16 3 31704 16 3 31704 16 331704 16 331704 16 331704 16 3 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 11 14 11 14 11 14 11 14 11 14 11 14 11 14 11 14 Stage 1: 25 25 25 25 25 25 25 25 many blocks