SlideShare una empresa de Scribd logo
1 de 63
Descargar para leer sin conexión
NVIDIA GPU Technology
Siggraph Asia 2010


                                                                    NVIDIA GPU Technology
                                                                    Siggraph Asia 2010



                        Introduction to CUDA C
                        Samuel Gateau | Seoul | December 16, 2010
NVIDIA GPU Technology
Siggraph Asia 2010
                        Who should you thank for this talk?
                          Jason Sanders


                          Senior Software Engineer, NVIDIA


                          Co-author of CUDA by Example
NVIDIA GPU Technology
Siggraph Asia 2010
                        What is CUDA?
                           CUDA Architecture
                             — Expose general-purpose GPU computing as first-class capability
                             — Retain traditional DirectX/OpenGL graphics performance


                           CUDA C
                             — Based on industry-standard C
                             — A handful of language extensions to allow heterogeneous programs
                             — Straightforward APIs to manage devices, memory, etc.


                           This talk will introduce you to CUDA C
NVIDIA GPU Technology

                        Introduction to CUDA C
Siggraph Asia 2010


                           What will you learn today?
                             — Start from ―Hello, World!‖
                             — Write and launch CUDA C kernels
                             — Manage GPU memory
                             — Run parallel kernels in CUDA C
                             — Parallel communication and synchronization
                             — Race conditions and atomic operations
NVIDIA GPU Technology

                        CUDA C Prerequisites
Siggraph Asia 2010


                           You (probably) need experience with C or C++


                           You do not need any GPU experience


                           You do not need any graphics experience


                           You do not need any parallel programming experience
NVIDIA GPU Technology

                        CUDA C: The Basics
Siggraph Asia 2010


                          Terminology
                            Host – The CPU and its memory (host memory)
                            Device – The GPU and its memory (device memory)
                                     Host                               Device




                                            Note: Figure Not to Scale
NVIDIA GPU Technology

                        Hello, World!
Siggraph Asia 2010


                              int main( void ) {
                                  printf( "Hello, World!n" );
                                  return 0;
                              }

                          This basic program is just standard C that runs on the host

                          NVIDIA’s compiler (nvcc) will not complain about CUDA programs
                           with no device code

                          At its simplest, CUDA C is just C!
NVIDIA GPU Technology

                        Hello, World! with Device Code
Siggraph Asia 2010

                             __global__ void kernel( void ) {
                             }


                             int main( void ) {
                                 kernel<<<1,1>>>();
                                 printf( "Hello, World!n" );
                                 return 0;
                             }


                           Two notable additions to the original ―Hello, World!‖
NVIDIA GPU Technology

                        Hello, World! with Device Code
Siggraph Asia 2010

                               __global__ void kernel( void ) {
                               }


                            CUDA C keyword __global__ indicates that a function
                              — Runs on the device
                              — Called from host code


                            nvcc splits source file into host and device components
                              — NVIDIA’s compiler handles device functions like kernel()
                              — Standard host compiler handles host functions like main()
                                   gcc
                                   Microsoft Visual C
NVIDIA GPU Technology

                        Hello, World! with Device Code
Siggraph Asia 2010

                              int main( void ) {
                                   kernel<<< 1, 1 >>>();
                                   printf( "Hello, World!n" );
                                   return 0;
                              }


                           Triple angle brackets mark a call from host code to device code
                              — Sometimes called a ―kernel launch‖
                              — We’ll discuss the parameters inside the angle brackets later


                           This is all that’s required to execute a function on the GPU!

                           The function kernel() does nothing, so this is fairly anticlimactic…
NVIDIA GPU Technology
                        A More Complex Example
Siggraph Asia 2010


                           A simple kernel to add two integers:

                                 __global__ void add( int *a, int *b, int *c ) {
                                         *c = *a + *b;
                                 }



                           As before,    __global__    is a CUDA C keyword meaning
                             — add()   will execute on the device
                             — add()   will be called from the host
NVIDIA GPU Technology
                        A More Complex Example
Siggraph Asia 2010


                           Notice that we use pointers for our variables:

                                 __global__ void add( int *a, int *b, int *c ) {
                                     *c = *a + *b;
                                 }

                           add() runs on the device…so a, b, and c must point to
                            device memory


                           How do we allocate memory on the GPU?
NVIDIA GPU Technology
Siggraph Asia 2010
                        Memory Management
                          Host and device memory are distinct entities
                            — Device pointers point to GPU memory
                                May be passed to and from host code
                                May not be dereferenced from host code

                            — Host pointers point to CPU memory
                                May be passed to and from device code
                                May not be dereferenced from device code



                          Basic CUDA API for dealing with device memory
                            — cudaMalloc(), cudaFree(), cudaMemcpy()
                            — Similar to their C equivalents, malloc(), free(), memcpy()
NVIDIA GPU Technology
Siggraph Asia 2010      A More Complex Example: add()

                           Using our add()kernel:

                               __global__ void add( int *a, int *b, int *c ) {
                                  *c = *a + *b;
                               }


                           Let’s take a look at main()…
NVIDIA GPU Technology
Siggraph Asia 2010      A More Complex Example: main()
                          int main( void ) {
                             int a, b, c;                   // host copies of a, b, c
                             int *dev_a, *dev_b, *dev_c;    // device copies of a, b, c
                             int size = sizeof( int );      // we need space for an integer


                             // allocate device copies of a, b, c
                             cudaMalloc( (void**)&dev_a, size );
                             cudaMalloc( (void**)&dev_b, size );
                             cudaMalloc( (void**)&dev_c, size );


                             a = 2;
                             b = 7;
NVIDIA GPU Technology
Siggraph Asia 2010      A More Complex Example: main() (cont)
                               // copy inputs to device
                               cudaMemcpy( dev_a, &a, size, cudaMemcpyHostToDevice );
                               cudaMemcpy( dev_b, &b, size, cudaMemcpyHostToDevice );


                               // launch add() kernel on GPU, passing parameters
                               add<<< 1, 1 >>>( dev_a, dev_b, dev_c );


                               // copy device result back to host copy of c
                               cudaMemcpy( &c, dev_c, size, cudaMemcpyDeviceToHost );


                               cudaFree( dev_a );
                               cudaFree( dev_b );
                               cudaFree( dev_c );
                               return 0;
                           }
NVIDIA GPU Technology
Siggraph Asia 2010
                        Parallel Programming in CUDA C
                           But wait…GPU computing is about massive parallelism

                           So how do we run code in parallel on the device?

                           Solution lies in the parameters between the triple angle brackets:

                                       add<<< 1, 1 >>>( dev_a, dev_b, dev_c );


                                       add<<< N, 1 >>>( dev_a, dev_b, dev_c );


                           Instead of executing add() once, add() executed N times in parallel
NVIDIA GPU Technology
Siggraph Asia 2010
                        Parallel Programming in CUDA C
                           With add() running in parallel…let’s do vector addition

                           Terminology: Each parallel invocation of add() referred to as a block

                           Kernel can refer to its block’s index with the variable blockIdx.x

                           Each block adds a value from a[] and b[], storing the result in c[]:

                                __global__ void add( int *a, int *b, int *c ) {
                                    c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
                                }

                           By using blockIdx.x to index arrays, each block handles different indices
NVIDIA GPU Technology
Siggraph Asia 2010
                        Parallel Programming in CUDA C
                           We write this code:
                                 __global__ void add( int *a, int *b, int *c ) {
                                     c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
                                 }

                           This is what runs in parallel on the device:
                                   Block 0                     Block 1
                                   c[0] = a[0] + b[0];         c[1] = a[1] + b[1];

                                   Block 2                     Block 3
                                   c[2] = a[2] + b[2];         c[3] = a[3] + b[3];
NVIDIA GPU Technology
Siggraph Asia 2010      Parallel Addition: add()

                          Using our newly parallelized add()kernel:


                              __global__ void add( int *a, int *b, int *c ) {
                                 c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
                             }


                          Let’s take a look at main()…
NVIDIA GPU Technology
Siggraph Asia 2010      Parallel Addition: main()
                          #define N   512
                          int main( void ) {
                             int *a, *b, *c;                   // host copies of a, b, c
                             int *dev_a, *dev_b, *dev_c;       // device copies of a, b, c
                             int size = N * sizeof( int );     // we need space for 512 integers

                             // allocate device copies of a, b, c
                             cudaMalloc( (void**)&dev_a, size );
                             cudaMalloc( (void**)&dev_b, size );
                             cudaMalloc( (void**)&dev_c, size );

                             a = (int*)malloc( size );
                             b = (int*)malloc( size );
                             c = (int*)malloc( size );

                             random_ints( a, N );
                             random_ints( b, N );
NVIDIA GPU Technology
Siggraph Asia 2010      Parallel Addition: main() (cont)
                               // copy inputs to device
                               cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice );
                               cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice );


                               // launch add() kernel with N parallel blocks
                               add<<< N, 1 >>>( dev_a, dev_b, dev_c );


                               // copy device result back to host copy of c
                               cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost );


                               free( a ); free( b ); free( c );
                               cudaFree( dev_a );
                               cudaFree( dev_b );
                               cudaFree( dev_c );
                               return 0;
                           }
NVIDIA GPU Technology
Siggraph Asia 2010
                        Review
                          Difference between ―host‖ and ―device‖
                            — Host = CPU
                            — Device = GPU


                          Using __global__ to declare a function as device code
                            — Runs on device
                            — Called from host


                          Passing parameters from host code to a device function
NVIDIA GPU Technology
Siggraph Asia 2010
                        Review (cont)
                           Basic device memory management
                             — cudaMalloc()
                             — cudaMemcpy()
                             — cudaFree()


                           Launching parallel kernels
                             — Launch N copies of add() with: add<<< N, 1 >>>();
                             — Used blockIdx.x to access block’s index
NVIDIA GPU Technology
Siggraph Asia 2010
                        Threads
                          Terminology: A block can be split into parallel threads


                          Let’s change vector addition to use parallel threads instead of parallel blocks:

                             __global__ void add( int *a, int *b, int *c ) {
                                    c[ threadIdx.x ] = a[ threadIdx.x ] + b[ threadIdx.x ];
                                        blockIdx.x         blockIdx.x         blockIdx.x
                               }


                          We use threadIdx.x instead of blockIdx.x in add()


                          main() will require one change as well…
NVIDIA GPU Technology
Siggraph Asia 2010      Parallel Addition (Threads): main()
                           #define N   512
                           int main( void ) {
                               int *a, *b, *c;                        //host copies of a, b, c
                               int *dev_a, *dev_b, *dev_c;            //device copies of a, b, c
                               int size = N * sizeof( int );          //we need space for 512 integers


                               // allocate device copies of a, b, c
                               cudaMalloc( (void**)&dev_a, size );
                               cudaMalloc( (void**)&dev_b, size );
                               cudaMalloc( (void**)&dev_c, size );


                               a = (int*)malloc( size );
                               b = (int*)malloc( size );
                               c = (int*)malloc( size );


                               random_ints( a, N );
                               random_ints( b, N );
NVIDIA GPU Technology
Siggraph Asia 2010      Parallel Addition (Threads): main() (cont)
                              // copy inputs to device
                              cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice );
                              cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice );


                              // launch add() kernel with N threads
                                                            blocks
                              add<<< 1, N >>>( dev_a, dev_b, dev_c );
                                     N, 1


                              // copy device result back to host copy of c
                              cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost );


                              free( a ); free( b );   free( c );
                              cudaFree( dev_a );
                              cudaFree( dev_b );
                              cudaFree( dev_c );
                              return 0;
                          }
NVIDIA GPU Technology
Siggraph Asia 2010      Using Threads And Blocks
                          We’ve seen parallel vector addition using
                            — Many blocks with 1 thread apiece
                            — 1 block with many threads


                          Let’s adapt vector addition to use lots of both blocks and threads


                          After using threads and blocks together, we’ll talk about why threads


                          First let’s discuss data indexing…
NVIDIA GPU Technology
Siggraph Asia 2010
                        Indexing Arrays With Threads And Blocks
                          No longer as simple as just using threadIdx.x or blockIdx.x as indices


                          To index array with 1 thread per entry (using 8 threads/block)

                                 threadIdx.x     threadIdx.x      threadIdx.x      threadIdx.x
                              0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

                               blockIdx.x = 0   blockIdx.x = 1   blockIdx.x = 2   blockIdx.x = 3



                          If we have M threads/block, a unique array index for each entry given by
                                     int index = threadIdx.x + blockIdx.x * M;

                                     int index =           x       +        y       * width;
NVIDIA GPU Technology
Siggraph Asia 2010
                        Indexing Arrays: Example
                           In this example, the red entry would have an index of 21:
                               0   1   2   3   4   5   6   7   8   9 10 11 12 13 14 15 16 17 18 19 20 21




                               M = 8 threads/block




                                                                                         blockIdx.x = 2

                                   int index = threadIdx.x + blockIdx.x * M;
                                                               =           5               +           2   * 8;
                                                               = 21;
NVIDIA GPU Technology
Siggraph Asia 2010
                        Addition with Threads and Blocks
                           The blockDim.x is a built-in variable for threads per block:
                               int index= threadIdx.x + blockIdx.x * blockDim.x;


                           A combined version of our vector addition kernel to use blocks and threads:
                               __global__ void add( int *a, int *b, int *c ) {
                                    int index = threadIdx.x + blockIdx.x * blockDim.x;
                                    c[index] = a[index] + b[index];
                               }


                           So what changes in main() when we use both blocks and threads?
NVIDIA GPU Technology
Siggraph Asia 2010      Parallel Addition (Blocks/Threads): main()
                           #define N   (2048*2048)
                           #define THREADS_PER_BLOCK 512
                           int main( void ) {
                               int *a, *b, *c;                        // host copies of a, b, c
                               int *dev_a, *dev_b, *dev_c;            // device copies of a, b, c
                               int size = N * sizeof( int );          // we need space for N integers


                               // allocate device copies of a, b, c
                               cudaMalloc( (void**)&dev_a, size );
                               cudaMalloc( (void**)&dev_b, size );
                               cudaMalloc( (void**)&dev_c, size );


                               a = (int*)malloc( size );
                               b = (int*)malloc( size );
                               c = (int*)malloc( size );


                               random_ints( a, N );
                               random_ints( b, N );
NVIDIA GPU Technology
Siggraph Asia 2010      Parallel Addition (Blocks/Threads): main()
                              // copy inputs to device
                              cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice );
                              cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice );


                              // launch add() kernel with blocks and threads
                              add<<< N/THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>( dev_a, dev_b, dev_c );


                              // copy device result back to host copy of c
                              cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost );


                              free( a ); free( b ); free( c );
                              cudaFree( dev_a );
                              cudaFree( dev_b );
                              cudaFree( dev_c );
                              return 0;
                          }
NVIDIA GPU Technology
Siggraph Asia 2010
                        Why Bother With Threads?
                           Threads seem unnecessary
                             — Added a level of abstraction and complexity
                             — What did we gain?



                           Unlike parallel blocks, parallel threads have mechanisms to
                             — Communicate
                             — Synchronize



                           Let’s see how…
NVIDIA GPU Technology
Siggraph Asia 2010
                        Dot Product
                          Unlike vector addition, dot product is a reduction from vectors to a scalar

                                                a        b
                                                a0   *   b0
                                                                          c
                                                a1
                                                a2
                                                     *
                                                     *
                                                         b1
                                                         b2
                                                                    +
                                                a3   *   b3




                                                 c = a∙b
                                                 c = (a0, a1, a2, a3) ∙ (b0, b1, b2, b3)
                                                 c = a0 b0 + a1 b1 + a2 b2 + a3 b3
NVIDIA GPU Technology
Siggraph Asia 2010
                        Dot Product
                          Parallel threads have no problem computing the pairwise products:

                                                a        b
                                                a0   *   b0
                                                a1
                                                a2
                                                     *
                                                     *
                                                         b1
                                                         b2
                                                                   +
                                                a3   *   b3




                          So we can start a dot product CUDA kernel by doing just that:
                                  __global__ void dot( int *a, int *b, int *c )   {
                                      // Each thread computes a pairwise product
                                      int temp = a[threadIdx.x] * b[threadIdx.x];
NVIDIA GPU Technology
Siggraph Asia 2010
                        Dot Product
                          But we need to share data between threads to compute the final sum:

                                               a        b
                                               a0   *   b0
                                               a1
                                               a2
                                                    *
                                                    *
                                                        b1
                                                        b2
                                                                 +
                                               a3   *   b3




                               __global__ void dot( int *a, int *b, int *c )   {
                                   // Each thread computes a pairwise product
                                   int temp = a[threadIdx.x] * b[threadIdx.x];

                                    // Can’t compute the final sum
                                    // Each thread’s copy of ‘temp’ is private
                               }
NVIDIA GPU Technology
Siggraph Asia 2010
                        Sharing Data Between Threads
                         Terminology: A block of threads shares memory called…shared memory


                         Extremely fast, on-chip memory (user-managed cache)


                         Declared with the __shared__ CUDA keyword


                         Not visible to threads in other blocks running in parallel
                              Block 0               Block 1               Block 2
                                    Threads               Threads               Threads
                                                                                            …
                                Shared Memory         Shared Memory         Shared Memory
NVIDIA GPU Technology
Siggraph Asia 2010
                        Parallel Dot Product: dot()
                           We perform parallel multiplication, serial addition:

                           #define N 512
                           __global__ void dot( int *a, int *b, int *c ) {
                                 // Shared memory for results of multiplication
                                 __shared__ int temp[N];
                                 temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x];

                                 // Thread 0 sums the pairwise products
                                 if( 0 == threadIdx.x ) {
                                     int sum = 0;
                                     for( int i = 0; i < N; i++ )
                                         sum += temp[i];
                                     *c = sum;
                                 }
                           }
NVIDIA GPU Technology
Siggraph Asia 2010
                        Parallel Dot Product Recap
                           We perform parallel, pairwise multiplications


                           Shared memory stores each thread’s result


                           We sum these pairwise products from a single thread


                           Sounds good…but we’ve made a huge mistake
NVIDIA GPU Technology
Siggraph Asia 2010
                        Faulty Dot Product Exposed!
                          Step 1: In parallel each thread writes a pairwise product
                                      parallel,


                                __shared__ int temp


                          Step 2: Thread 0 reads and sums the products
                                __shared__ int temp




                          But there’s an assumption hidden in Step 1…
NVIDIA GPU Technology
Siggraph Asia 2010
                        Read-Before-Write Hazard
                          Suppose thread 0 finishes its write in step 1




                          Then thread 0 reads index 12 in step 2



                                                                      This read returns garbage!
                          Before thread 12 writes to index 12 in step 1?
NVIDIA GPU Technology
Siggraph Asia 2010
                        Synchronization
                           We need threads to wait between the sections of dot():
                            __global__ void dot( int *a, int *b, int *c ) {
                                 __shared__ int temp[N];
                                 temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x];

                                 // * NEED THREADS TO SYNCHRONIZE HERE *
                                 // No thread can advance until all threads
                                 // have reached this point in the code

                                 // Thread 0 sums the pairwise products
                                 if( 0 == threadIdx.x ) {
                                     int sum = 0;
                                     for( int i = 0; i < N; i++ )
                                         sum += temp[i];
                                     *c = sum;
                                 }
                            }
NVIDIA GPU Technology
Siggraph Asia 2010
                        __syncthreads()
                          We can synchronize threads with the function __syncthreads()


                          Threads in the block wait until all threads have hit the __syncthreads()
                                Thread 0                   __syncthreads()
                                Thread 1           __syncthreads()
                                Thread 2                 __syncthreads()
                                Thread 3                           __syncthreads()
                                Thread 4         __syncthreads()
                                  …




                          Threads are only synchronized within a block
NVIDIA GPU Technology
Siggraph Asia 2010      Parallel Dot Product: dot()
                          __global__ void dot( int *a, int *b, int *c ) {
                                __shared__ int temp[N];
                               temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x];

                                __syncthreads();

                                if( 0 == threadIdx.x ) {
                                    int sum = 0;
                                    for( int i = 0; i < N; i++ )
                                        sum += temp[i];
                                    *c = sum;
                                }
                          }

                          With a properly synchronized dot() routine, let’s look at main()
NVIDIA GPU Technology
Siggraph Asia 2010      Parallel Dot Product: main()
                            #define N   512
                            int main( void ) {
                                int *a, *b, *c;                        // copies of a, b, c
                                int *dev_a, *dev_b, *dev_c;            // device copies of a, b, c
                                int size = N * sizeof( int );          // we need space for 512 integers


                                // allocate device copies of a, b, c
                                cudaMalloc( (void**)&dev_a, size );
                                cudaMalloc( (void**)&dev_b, size );
                                cudaMalloc( (void**)&dev_c, sizeof( int ) );


                                a = (int *)malloc( size );
                                b = (int *)malloc( size );
                                c = (int *)malloc( sizeof( int ) );


                                random_ints( a, N );
                                random_ints( b, N );
NVIDIA GPU Technology
Siggraph Asia 2010      Parallel Dot Product: main()
                               // copy inputs to device
                               cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice );
                               cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice );


                               // launch dot() kernel with 1 block and N threads
                               dot<<< 1, N >>>( dev_a, dev_b, dev_c );


                               // copy device result back to host copy of c
                               cudaMemcpy( c, dev_c, sizeof( int ) , cudaMemcpyDeviceToHost );


                               free( a ); free( b ); free( c );
                               cudaFree( dev_a );
                               cudaFree( dev_b );
                               cudaFree( dev_c );
                               return 0;
                           }
NVIDIA GPU Technology
Siggraph Asia 2010
                        Review
                         Launching kernels with parallel threads
                           — Launch add() with N threads: add<<< 1, N >>>();
                           — Used threadIdx.x to access thread’s index


                         Using both blocks and threads
                           — Used (threadIdx.x + blockIdx.x * blockDim.x) to index input/output
                           — N/THREADS_PER_BLOCK blocks and THREADS_PER_BLOCK threads gave us N threads total
NVIDIA GPU Technology
Siggraph Asia 2010
                        Review (cont)
                         Using __shared__ to declare memory as shared memory
                           — Data shared among threads in a block
                           — Not visible to threads in other parallel blocks


                         Using __syncthreads() as a barrier
                           — No thread executes instructions after __syncthreads() until all
                             threads have reached the __syncthreads()
                           — Needs to be used to prevent data hazards
NVIDIA GPU Technology
Siggraph Asia 2010
                        Multiblock Dot Product
                           Recall our dot product launch:
                                // launch dot() kernel with 1 block and N threads
                                dot<<< 1, N >>>( dev_a, dev_b, dev_c );



                           Launching with one block will not utilize much of the GPU


                           Let’s write a multiblock version of dot product
NVIDIA GPU Technology
Siggraph Asia 2010
                        Multiblock Dot Product: Algorithm
                          Each block computes a sum of its pairwise products like before:

                                       Block 0
                                      a       b
                                     a0    *   b0
                                                          sum
                                     a1
                                     a2
                                           *
                                           *
                                               b1
                                               b2
                                                      +
                                     a3
                                     …     *   b3




                                        Block 1…
                                    a             b
                                    a512   *   b512
                                                          sum
                                    a513
                                    a514
                                           *
                                           *
                                               b513
                                               b514
                                                      +
                                    a515   *   b515
                                    …

                                               …
NVIDIA GPU Technology
Siggraph Asia 2010
                        Multiblock Dot Product: Algorithm
                          And then contributes its sum to the final result:

                                       Block 0
                                      a       b
                                     a0    *   b0
                                                           sum
                                     a1
                                     a2
                                           *
                                           *
                                               b1
                                               b2
                                                       +
                                     a3
                                     …     *   b3




                                               …
                                                                           c

                                         Block 1
                                     a             b
                                    a512   *   b512
                                                           sum
                                    a513
                                    a514
                                           *
                                           *
                                               b513
                                               b514
                                                       +
                                    a515   *   b515
                                     …

                                               …
NVIDIA GPU Technology
Siggraph Asia 2010
                        Multiblock Dot Product: dot()
                             #define N (2048*2048)
                             #define THREADS_PER_BLOCK 512
                             __global__ void dot( int *a, int *b, int *c ) {
                                 __shared__ int temp[THREADS_PER_BLOCK];
                                 int index = threadIdx.x + blockIdx.x * blockDim.x;
                                 temp[threadIdx.x] = a[index] * b[index];

                                 __syncthreads();

                                 if( 0 == threadIdx.x ) {
                                     int sum = 0;
                                     for( int i = 0; i < THREADS_PER_BLOCK; i++ )
                                         sum += temp[i];
                                     atomicAdd(
                                     *c += sum; c , sum );
                                 }
                             }

                          But we have a race condition…
                          We can fix it with one of CUDA’s atomic operations
NVIDIA GPU Technology
Siggraph Asia 2010
                        Race Conditions
                          Terminology: A race condition occurs when program behavior depends upon
                           relative timing of two (or more) event sequences

                          What actually takes place to execute the line in question: *c += sum;
                             — Read value at address c
                             — Add sum to value              Terminology: Read-Modify-Write
                             — Write result to address c


                          What if two threads are trying to do this at the same time?
                                  Thread 0, Block 0                   Thread 0, Block 1
                                     — Read value at address c            — Read value at address c
                                     — Add sum to value                   — Add sum to value
                                     — Write result to address c          — Write result to address c
NVIDIA GPU Technology
Siggraph Asia 2010
                        Global Memory Contention
                                               Read-Modify-Write

                         Block 0     Reads 0    Computes 0+3   Writes 3
                         sum = 3        0         0+3 = 3          3



                         *c += sum   c 0              0            3        3            3            7



                         Block 1                                            3        3+4 = 7          7
                         sum = 4                                          Reads 3   Computes 3+4   Writes 7


                                                                                Read-Modify-Write
NVIDIA GPU Technology
Siggraph Asia 2010
                        Global Memory Contention
                                                         Read-Modify-Write

                         Block 0     Reads 0              Computes 0+3                  Writes 3
                         sum = 3        0                   0+3 = 3                        3



                         *c += sum   c 0         0              0             0            3          4



                         Block 1                 0                        0+4 = 4                     4
                         sum = 4               Reads 0                   Computes 0+4              Writes 4


                                                                    Read-Modify-Write
NVIDIA GPU Technology
Siggraph Asia 2010
                        Atomic Operations
                          Terminology: Read-modify-write uninterruptible when atomic

                          Many atomic operations on memory available with CUDA C
                                 atomicAdd()        atomicInc()
                                 atomicSub()        atomicDec()
                                 atomicMin()        atomicExch()
                                 atomicMax()        atomicCAS()

                          Predictable result when simultaneous access to memory required

                          We need to atomically add sum to c in our multiblock dot product
NVIDIA GPU Technology
Siggraph Asia 2010
                        Multiblock Dot Product: dot()
                            __global__ void dot( int *a, int *b, int *c ) {
                                 __shared__ int temp[THREADS_PER_BLOCK];
                                 int index = threadIdx.x + blockIdx.x * blockDim.x;
                                 temp[threadIdx.x] = a[index] * b[index];

                                 __syncthreads();

                                 if( 0 == threadIdx.x ) {
                                     int sum = 0;
                                     for( int i = 0; i < THREADS_PER_BLOCK; i++ )
                                         sum += temp[i];
                                     atomicAdd( c , sum );
                                 }
                             }


                          Now let’s fix up main() to handle a multiblock dot product
NVIDIA GPU Technology
Siggraph Asia 2010      Parallel Dot Product: main()
                           #define N (2048*2048)
                           #define THREADS_PER_BLOCK 512
                           int main( void ) {
                               int *a, *b, *c;                       // host copies of a, b, c
                               int *dev_a, *dev_b, *dev_c;           // device copies of a, b, c
                               int size = N * sizeof( int );         // we need space for N ints

                               // allocate   device copies of a, b, c
                               cudaMalloc(   (void**)&dev_a, size );
                               cudaMalloc(   (void**)&dev_b, size );
                               cudaMalloc(   (void**)&dev_c, sizeof( int ) );

                               a = (int *)malloc( size );
                               b = (int *)malloc( size );
                               c = (int *)malloc( sizeof( int ) );

                               random_ints( a, N );
                               random_ints( b, N );
NVIDIA GPU Technology
Siggraph Asia 2010      Parallel Dot Product: main()
                               // copy inputs to device
                               cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice );
                               cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice );


                               // launch dot() kernel
                               dot<<< N/THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>( dev_a, dev_b, dev_c );


                               // copy device result back to host copy of c
                               cudaMemcpy( c, dev_c, sizeof( int ) , cudaMemcpyDeviceToHost );


                               free( a ); free( b ); free( c );
                               cudaFree( dev_a );
                               cudaFree( dev_b );
                               cudaFree( dev_c );
                               return 0;
                           }
NVIDIA GPU Technology
Siggraph Asia 2010
                        Review
                          Race conditions
                            — Behavior depends upon relative timing of multiple event sequences
                            — Can occur when an implied read-modify-write is interruptible


                          Atomic operations
                            — CUDA provides read-modify-write operations guaranteed to be atomic
                            — Atomics ensure correct results when multiple threads modify memory
NVIDIA GPU Technology
Siggraph Asia 2010      To Learn More CUDA C
                           Check out CUDA by Example
                             — Parallel Programming in CUDA C
                             — Thread Cooperation
                             — Constant Memory and Events
                             — Texture Memory
                             — Graphics Interoperability
                             — Atomics
                             — Streams
                             — CUDA C on Multiple GPUs
                             — Other CUDA Resources



                           For sale here at GTC

                           http://developer.nvidia.com/object/cuda-by-example.html
NVIDIA GPU Technology
Siggraph Asia 2010      Questions
                           First my questions


                           Now your questions…

Más contenido relacionado

La actualidad más candente

Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeOfer Rosenberg
 
PG-4039, RapidFire API, by Dmitry Kozlov
PG-4039, RapidFire API, by Dmitry KozlovPG-4039, RapidFire API, by Dmitry Kozlov
PG-4039, RapidFire API, by Dmitry KozlovAMD Developer Central
 
Perceptual Computing Workshop in Munich
Perceptual Computing Workshop in MunichPerceptual Computing Workshop in Munich
Perceptual Computing Workshop in MunichBeMyApp
 
[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight
[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight
[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsightlaparuma
 
Congatec_Global Vendor for Innovative Embedded Solutions_Ankara
Congatec_Global Vendor for Innovative Embedded Solutions_AnkaraCongatec_Global Vendor for Innovative Embedded Solutions_Ankara
Congatec_Global Vendor for Innovative Embedded Solutions_AnkaraIşınsu Akçetin
 
Schematic diagrams of GPUs' architecture and Time evolution of theoretical FL...
Schematic diagrams of GPUs' architecture and Time evolution of theoretical FL...Schematic diagrams of GPUs' architecture and Time evolution of theoretical FL...
Schematic diagrams of GPUs' architecture and Time evolution of theoretical FL...智啓 出川
 
Java on the GPU: Where are we now?
Java on the GPU: Where are we now?Java on the GPU: Where are we now?
Java on the GPU: Where are we now?Dmitry Alexandrov
 
Congatec_Global Vendor for Innovative Embedded Solutions_Istanbul
Congatec_Global Vendor for Innovative Embedded Solutions_IstanbulCongatec_Global Vendor for Innovative Embedded Solutions_Istanbul
Congatec_Global Vendor for Innovative Embedded Solutions_IstanbulIşınsu Akçetin
 
N A G P A R I S280101
N A G P A R I S280101N A G P A R I S280101
N A G P A R I S280101John Holden
 
GTC 2017 オートモーティブ最新情報
GTC 2017 オートモーティブ最新情報GTC 2017 オートモーティブ最新情報
GTC 2017 オートモーティブ最新情報NVIDIA Japan
 
Lecture2 cuda spring 2010
Lecture2 cuda spring 2010Lecture2 cuda spring 2010
Lecture2 cuda spring 2010haythem_2015
 
IS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
IS-4081, Rabbit: Reinventing Video Chat, by Philippe ClavelIS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
IS-4081, Rabbit: Reinventing Video Chat, by Philippe ClavelAMD Developer Central
 
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...AMD Developer Central
 
Ultracode Berlin #2 : Introduction to Perceptual Computing by Sulamita Garcia
Ultracode Berlin #2 : Introduction to Perceptual Computing by Sulamita GarciaUltracode Berlin #2 : Introduction to Perceptual Computing by Sulamita Garcia
Ultracode Berlin #2 : Introduction to Perceptual Computing by Sulamita GarciaBeMyApp
 
pcDuino tech talk at Carnegie Mellon University 10/14/2014
pcDuino tech talk at Carnegie Mellon University 10/14/2014pcDuino tech talk at Carnegie Mellon University 10/14/2014
pcDuino tech talk at Carnegie Mellon University 10/14/2014Jingfeng Liu
 

La actualidad más candente (20)

Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
PG-4039, RapidFire API, by Dmitry Kozlov
PG-4039, RapidFire API, by Dmitry KozlovPG-4039, RapidFire API, by Dmitry Kozlov
PG-4039, RapidFire API, by Dmitry Kozlov
 
Perceptual Computing Workshop in Munich
Perceptual Computing Workshop in MunichPerceptual Computing Workshop in Munich
Perceptual Computing Workshop in Munich
 
[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight
[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight
[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight
 
Congatec_Global Vendor for Innovative Embedded Solutions_Ankara
Congatec_Global Vendor for Innovative Embedded Solutions_AnkaraCongatec_Global Vendor for Innovative Embedded Solutions_Ankara
Congatec_Global Vendor for Innovative Embedded Solutions_Ankara
 
The GPGPU Continuum
The GPGPU ContinuumThe GPGPU Continuum
The GPGPU Continuum
 
Schematic diagrams of GPUs' architecture and Time evolution of theoretical FL...
Schematic diagrams of GPUs' architecture and Time evolution of theoretical FL...Schematic diagrams of GPUs' architecture and Time evolution of theoretical FL...
Schematic diagrams of GPUs' architecture and Time evolution of theoretical FL...
 
Java on the GPU: Where are we now?
Java on the GPU: Where are we now?Java on the GPU: Where are we now?
Java on the GPU: Where are we now?
 
Tech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDATech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDA
 
Congatec_Global Vendor for Innovative Embedded Solutions_Istanbul
Congatec_Global Vendor for Innovative Embedded Solutions_IstanbulCongatec_Global Vendor for Innovative Embedded Solutions_Istanbul
Congatec_Global Vendor for Innovative Embedded Solutions_Istanbul
 
N A G P A R I S280101
N A G P A R I S280101N A G P A R I S280101
N A G P A R I S280101
 
GPU Ecosystem
GPU EcosystemGPU Ecosystem
GPU Ecosystem
 
GTC 2017 オートモーティブ最新情報
GTC 2017 オートモーティブ最新情報GTC 2017 オートモーティブ最新情報
GTC 2017 オートモーティブ最新情報
 
Lecture2 cuda spring 2010
Lecture2 cuda spring 2010Lecture2 cuda spring 2010
Lecture2 cuda spring 2010
 
IS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
IS-4081, Rabbit: Reinventing Video Chat, by Philippe ClavelIS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
IS-4081, Rabbit: Reinventing Video Chat, by Philippe Clavel
 
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
 
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
 
Ultracode Berlin #2 : Introduction to Perceptual Computing by Sulamita Garcia
Ultracode Berlin #2 : Introduction to Perceptual Computing by Sulamita GarciaUltracode Berlin #2 : Introduction to Perceptual Computing by Sulamita Garcia
Ultracode Berlin #2 : Introduction to Perceptual Computing by Sulamita Garcia
 
DRIVE PX 2
DRIVE PX 2DRIVE PX 2
DRIVE PX 2
 
pcDuino tech talk at Carnegie Mellon University 10/14/2014
pcDuino tech talk at Carnegie Mellon University 10/14/2014pcDuino tech talk at Carnegie Mellon University 10/14/2014
pcDuino tech talk at Carnegie Mellon University 10/14/2014
 

Destacado

[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...npinto
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...npinto
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programmingnpinto
 
TenforFlow Internals
TenforFlow InternalsTenforFlow Internals
TenforFlow InternalsKiho Hong
 
텐서플로우 기초 이해하기
텐서플로우 기초 이해하기 텐서플로우 기초 이해하기
텐서플로우 기초 이해하기 Yong Joon Moon
 
Jupyter notebook 이해하기
Jupyter notebook 이해하기 Jupyter notebook 이해하기
Jupyter notebook 이해하기 Yong Joon Moon
 
Python_numpy_pandas_matplotlib 이해하기_20160815
Python_numpy_pandas_matplotlib 이해하기_20160815Python_numpy_pandas_matplotlib 이해하기_20160815
Python_numpy_pandas_matplotlib 이해하기_20160815Yong Joon Moon
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...npinto
 
Jupyter notebok tensorboard 실행하기_20160706
Jupyter notebok tensorboard 실행하기_20160706Jupyter notebok tensorboard 실행하기_20160706
Jupyter notebok tensorboard 실행하기_20160706Yong Joon Moon
 

Destacado (9)

[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
 
TenforFlow Internals
TenforFlow InternalsTenforFlow Internals
TenforFlow Internals
 
텐서플로우 기초 이해하기
텐서플로우 기초 이해하기 텐서플로우 기초 이해하기
텐서플로우 기초 이해하기
 
Jupyter notebook 이해하기
Jupyter notebook 이해하기 Jupyter notebook 이해하기
Jupyter notebook 이해하기
 
Python_numpy_pandas_matplotlib 이해하기_20160815
Python_numpy_pandas_matplotlib 이해하기_20160815Python_numpy_pandas_matplotlib 이해하기_20160815
Python_numpy_pandas_matplotlib 이해하기_20160815
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
 
Jupyter notebok tensorboard 실행하기_20160706
Jupyter notebok tensorboard 실행하기_20160706Jupyter notebok tensorboard 실행하기_20160706
Jupyter notebok tensorboard 실행하기_20160706
 

Similar a [02][cuda c 프로그래밍 소개] gateau intro to_cuda_c

GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An IntroductionDhan V Sagar
 
Compute Unified Device Architecture (CUDA) Programmimg
Compute Unified Device Architecture (CUDA) ProgrammimgCompute Unified Device Architecture (CUDA) Programmimg
Compute Unified Device Architecture (CUDA) ProgrammimgWinnie Thomas
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Pycon2014 GPU computing
Pycon2014 GPU computingPycon2014 GPU computing
Pycon2014 GPU computingAshwin Ashok
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to AcceleratorsDilum Bandara
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introductionHanibei
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.
 
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...“A New, Open-standards-based, Open-source Programming Model for All Accelerat...
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...Edge AI and Vision Alliance
 
CUDA by Example : The Final Countdown : Notes
CUDA by Example : The Final Countdown : NotesCUDA by Example : The Final Countdown : Notes
CUDA by Example : The Final Countdown : NotesSubhajit Sahu
 
GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with RubyShin Yee Chung
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux ClubOfer Rosenberg
 

Similar a [02][cuda c 프로그래밍 소개] gateau intro to_cuda_c (20)

Cuda intro
Cuda introCuda intro
Cuda intro
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An Introduction
 
Compute Unified Device Architecture (CUDA) Programmimg
Compute Unified Device Architecture (CUDA) ProgrammimgCompute Unified Device Architecture (CUDA) Programmimg
Compute Unified Device Architecture (CUDA) Programmimg
 
Cuda materials
Cuda materialsCuda materials
Cuda materials
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
Pycon2014 GPU computing
Pycon2014 GPU computingPycon2014 GPU computing
Pycon2014 GPU computing
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introduction
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
Deep Learning Edge
Deep Learning Edge Deep Learning Edge
Deep Learning Edge
 
Cuda
CudaCuda
Cuda
 
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...“A New, Open-standards-based, Open-source Programming Model for All Accelerat...
“A New, Open-standards-based, Open-source Programming Model for All Accelerat...
 
CUDA by Example : The Final Countdown : Notes
CUDA by Example : The Final Countdown : NotesCUDA by Example : The Final Countdown : Notes
CUDA by Example : The Final Countdown : Notes
 
GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with Ruby
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux Club
 

[02][cuda c 프로그래밍 소개] gateau intro to_cuda_c

  • 1. NVIDIA GPU Technology Siggraph Asia 2010 NVIDIA GPU Technology Siggraph Asia 2010 Introduction to CUDA C Samuel Gateau | Seoul | December 16, 2010
  • 2. NVIDIA GPU Technology Siggraph Asia 2010 Who should you thank for this talk?  Jason Sanders  Senior Software Engineer, NVIDIA  Co-author of CUDA by Example
  • 3. NVIDIA GPU Technology Siggraph Asia 2010 What is CUDA?  CUDA Architecture — Expose general-purpose GPU computing as first-class capability — Retain traditional DirectX/OpenGL graphics performance  CUDA C — Based on industry-standard C — A handful of language extensions to allow heterogeneous programs — Straightforward APIs to manage devices, memory, etc.  This talk will introduce you to CUDA C
  • 4. NVIDIA GPU Technology Introduction to CUDA C Siggraph Asia 2010  What will you learn today? — Start from ―Hello, World!‖ — Write and launch CUDA C kernels — Manage GPU memory — Run parallel kernels in CUDA C — Parallel communication and synchronization — Race conditions and atomic operations
  • 5. NVIDIA GPU Technology CUDA C Prerequisites Siggraph Asia 2010  You (probably) need experience with C or C++  You do not need any GPU experience  You do not need any graphics experience  You do not need any parallel programming experience
  • 6. NVIDIA GPU Technology CUDA C: The Basics Siggraph Asia 2010  Terminology  Host – The CPU and its memory (host memory)  Device – The GPU and its memory (device memory) Host Device Note: Figure Not to Scale
  • 7. NVIDIA GPU Technology Hello, World! Siggraph Asia 2010 int main( void ) { printf( "Hello, World!n" ); return 0; }  This basic program is just standard C that runs on the host  NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code  At its simplest, CUDA C is just C!
  • 8. NVIDIA GPU Technology Hello, World! with Device Code Siggraph Asia 2010 __global__ void kernel( void ) { } int main( void ) { kernel<<<1,1>>>(); printf( "Hello, World!n" ); return 0; }  Two notable additions to the original ―Hello, World!‖
  • 9. NVIDIA GPU Technology Hello, World! with Device Code Siggraph Asia 2010 __global__ void kernel( void ) { }  CUDA C keyword __global__ indicates that a function — Runs on the device — Called from host code  nvcc splits source file into host and device components — NVIDIA’s compiler handles device functions like kernel() — Standard host compiler handles host functions like main()  gcc  Microsoft Visual C
  • 10. NVIDIA GPU Technology Hello, World! with Device Code Siggraph Asia 2010 int main( void ) { kernel<<< 1, 1 >>>(); printf( "Hello, World!n" ); return 0; }  Triple angle brackets mark a call from host code to device code — Sometimes called a ―kernel launch‖ — We’ll discuss the parameters inside the angle brackets later  This is all that’s required to execute a function on the GPU!  The function kernel() does nothing, so this is fairly anticlimactic…
  • 11. NVIDIA GPU Technology A More Complex Example Siggraph Asia 2010  A simple kernel to add two integers: __global__ void add( int *a, int *b, int *c ) { *c = *a + *b; }  As before, __global__ is a CUDA C keyword meaning — add() will execute on the device — add() will be called from the host
  • 12. NVIDIA GPU Technology A More Complex Example Siggraph Asia 2010  Notice that we use pointers for our variables: __global__ void add( int *a, int *b, int *c ) { *c = *a + *b; }  add() runs on the device…so a, b, and c must point to device memory  How do we allocate memory on the GPU?
  • 13. NVIDIA GPU Technology Siggraph Asia 2010 Memory Management  Host and device memory are distinct entities — Device pointers point to GPU memory  May be passed to and from host code  May not be dereferenced from host code — Host pointers point to CPU memory  May be passed to and from device code  May not be dereferenced from device code  Basic CUDA API for dealing with device memory — cudaMalloc(), cudaFree(), cudaMemcpy() — Similar to their C equivalents, malloc(), free(), memcpy()
  • 14. NVIDIA GPU Technology Siggraph Asia 2010 A More Complex Example: add()  Using our add()kernel: __global__ void add( int *a, int *b, int *c ) { *c = *a + *b; }  Let’s take a look at main()…
  • 15. NVIDIA GPU Technology Siggraph Asia 2010 A More Complex Example: main() int main( void ) { int a, b, c; // host copies of a, b, c int *dev_a, *dev_b, *dev_c; // device copies of a, b, c int size = sizeof( int ); // we need space for an integer // allocate device copies of a, b, c cudaMalloc( (void**)&dev_a, size ); cudaMalloc( (void**)&dev_b, size ); cudaMalloc( (void**)&dev_c, size ); a = 2; b = 7;
  • 16. NVIDIA GPU Technology Siggraph Asia 2010 A More Complex Example: main() (cont) // copy inputs to device cudaMemcpy( dev_a, &a, size, cudaMemcpyHostToDevice ); cudaMemcpy( dev_b, &b, size, cudaMemcpyHostToDevice ); // launch add() kernel on GPU, passing parameters add<<< 1, 1 >>>( dev_a, dev_b, dev_c ); // copy device result back to host copy of c cudaMemcpy( &c, dev_c, size, cudaMemcpyDeviceToHost ); cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c ); return 0; }
  • 17. NVIDIA GPU Technology Siggraph Asia 2010 Parallel Programming in CUDA C  But wait…GPU computing is about massive parallelism  So how do we run code in parallel on the device?  Solution lies in the parameters between the triple angle brackets: add<<< 1, 1 >>>( dev_a, dev_b, dev_c ); add<<< N, 1 >>>( dev_a, dev_b, dev_c );  Instead of executing add() once, add() executed N times in parallel
  • 18. NVIDIA GPU Technology Siggraph Asia 2010 Parallel Programming in CUDA C  With add() running in parallel…let’s do vector addition  Terminology: Each parallel invocation of add() referred to as a block  Kernel can refer to its block’s index with the variable blockIdx.x  Each block adds a value from a[] and b[], storing the result in c[]: __global__ void add( int *a, int *b, int *c ) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; }  By using blockIdx.x to index arrays, each block handles different indices
  • 19. NVIDIA GPU Technology Siggraph Asia 2010 Parallel Programming in CUDA C  We write this code: __global__ void add( int *a, int *b, int *c ) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; }  This is what runs in parallel on the device: Block 0 Block 1 c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; Block 2 Block 3 c[2] = a[2] + b[2]; c[3] = a[3] + b[3];
  • 20. NVIDIA GPU Technology Siggraph Asia 2010 Parallel Addition: add()  Using our newly parallelized add()kernel: __global__ void add( int *a, int *b, int *c ) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; }  Let’s take a look at main()…
  • 21. NVIDIA GPU Technology Siggraph Asia 2010 Parallel Addition: main() #define N 512 int main( void ) { int *a, *b, *c; // host copies of a, b, c int *dev_a, *dev_b, *dev_c; // device copies of a, b, c int size = N * sizeof( int ); // we need space for 512 integers // allocate device copies of a, b, c cudaMalloc( (void**)&dev_a, size ); cudaMalloc( (void**)&dev_b, size ); cudaMalloc( (void**)&dev_c, size ); a = (int*)malloc( size ); b = (int*)malloc( size ); c = (int*)malloc( size ); random_ints( a, N ); random_ints( b, N );
  • 22. NVIDIA GPU Technology Siggraph Asia 2010 Parallel Addition: main() (cont) // copy inputs to device cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice ); cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice ); // launch add() kernel with N parallel blocks add<<< N, 1 >>>( dev_a, dev_b, dev_c ); // copy device result back to host copy of c cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost ); free( a ); free( b ); free( c ); cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c ); return 0; }
  • 23. NVIDIA GPU Technology Siggraph Asia 2010 Review  Difference between ―host‖ and ―device‖ — Host = CPU — Device = GPU  Using __global__ to declare a function as device code — Runs on device — Called from host  Passing parameters from host code to a device function
  • 24. NVIDIA GPU Technology Siggraph Asia 2010 Review (cont)  Basic device memory management — cudaMalloc() — cudaMemcpy() — cudaFree()  Launching parallel kernels — Launch N copies of add() with: add<<< N, 1 >>>(); — Used blockIdx.x to access block’s index
  • 25. NVIDIA GPU Technology Siggraph Asia 2010 Threads  Terminology: A block can be split into parallel threads  Let’s change vector addition to use parallel threads instead of parallel blocks: __global__ void add( int *a, int *b, int *c ) { c[ threadIdx.x ] = a[ threadIdx.x ] + b[ threadIdx.x ]; blockIdx.x blockIdx.x blockIdx.x }  We use threadIdx.x instead of blockIdx.x in add()  main() will require one change as well…
  • 26. NVIDIA GPU Technology Siggraph Asia 2010 Parallel Addition (Threads): main() #define N 512 int main( void ) { int *a, *b, *c; //host copies of a, b, c int *dev_a, *dev_b, *dev_c; //device copies of a, b, c int size = N * sizeof( int ); //we need space for 512 integers // allocate device copies of a, b, c cudaMalloc( (void**)&dev_a, size ); cudaMalloc( (void**)&dev_b, size ); cudaMalloc( (void**)&dev_c, size ); a = (int*)malloc( size ); b = (int*)malloc( size ); c = (int*)malloc( size ); random_ints( a, N ); random_ints( b, N );
  • 27. NVIDIA GPU Technology Siggraph Asia 2010 Parallel Addition (Threads): main() (cont) // copy inputs to device cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice ); cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice ); // launch add() kernel with N threads blocks add<<< 1, N >>>( dev_a, dev_b, dev_c ); N, 1 // copy device result back to host copy of c cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost ); free( a ); free( b ); free( c ); cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c ); return 0; }
  • 28. NVIDIA GPU Technology Siggraph Asia 2010 Using Threads And Blocks  We’ve seen parallel vector addition using — Many blocks with 1 thread apiece — 1 block with many threads  Let’s adapt vector addition to use lots of both blocks and threads  After using threads and blocks together, we’ll talk about why threads  First let’s discuss data indexing…
  • 29. NVIDIA GPU Technology Siggraph Asia 2010 Indexing Arrays With Threads And Blocks  No longer as simple as just using threadIdx.x or blockIdx.x as indices  To index array with 1 thread per entry (using 8 threads/block) threadIdx.x threadIdx.x threadIdx.x threadIdx.x 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3  If we have M threads/block, a unique array index for each entry given by int index = threadIdx.x + blockIdx.x * M; int index = x + y * width;
  • 30. NVIDIA GPU Technology Siggraph Asia 2010 Indexing Arrays: Example  In this example, the red entry would have an index of 21: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 M = 8 threads/block blockIdx.x = 2 int index = threadIdx.x + blockIdx.x * M; = 5 + 2 * 8; = 21;
  • 31. NVIDIA GPU Technology Siggraph Asia 2010 Addition with Threads and Blocks  The blockDim.x is a built-in variable for threads per block: int index= threadIdx.x + blockIdx.x * blockDim.x;  A combined version of our vector addition kernel to use blocks and threads: __global__ void add( int *a, int *b, int *c ) { int index = threadIdx.x + blockIdx.x * blockDim.x; c[index] = a[index] + b[index]; }  So what changes in main() when we use both blocks and threads?
  • 32. NVIDIA GPU Technology Siggraph Asia 2010 Parallel Addition (Blocks/Threads): main() #define N (2048*2048) #define THREADS_PER_BLOCK 512 int main( void ) { int *a, *b, *c; // host copies of a, b, c int *dev_a, *dev_b, *dev_c; // device copies of a, b, c int size = N * sizeof( int ); // we need space for N integers // allocate device copies of a, b, c cudaMalloc( (void**)&dev_a, size ); cudaMalloc( (void**)&dev_b, size ); cudaMalloc( (void**)&dev_c, size ); a = (int*)malloc( size ); b = (int*)malloc( size ); c = (int*)malloc( size ); random_ints( a, N ); random_ints( b, N );
  • 33. NVIDIA GPU Technology Siggraph Asia 2010 Parallel Addition (Blocks/Threads): main() // copy inputs to device cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice ); cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice ); // launch add() kernel with blocks and threads add<<< N/THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>( dev_a, dev_b, dev_c ); // copy device result back to host copy of c cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost ); free( a ); free( b ); free( c ); cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c ); return 0; }
  • 34. NVIDIA GPU Technology Siggraph Asia 2010 Why Bother With Threads?  Threads seem unnecessary — Added a level of abstraction and complexity — What did we gain?  Unlike parallel blocks, parallel threads have mechanisms to — Communicate — Synchronize  Let’s see how…
  • 35. NVIDIA GPU Technology Siggraph Asia 2010 Dot Product  Unlike vector addition, dot product is a reduction from vectors to a scalar a b a0 * b0 c a1 a2 * * b1 b2 + a3 * b3 c = a∙b c = (a0, a1, a2, a3) ∙ (b0, b1, b2, b3) c = a0 b0 + a1 b1 + a2 b2 + a3 b3
  • 36. NVIDIA GPU Technology Siggraph Asia 2010 Dot Product  Parallel threads have no problem computing the pairwise products: a b a0 * b0 a1 a2 * * b1 b2 + a3 * b3  So we can start a dot product CUDA kernel by doing just that: __global__ void dot( int *a, int *b, int *c ) { // Each thread computes a pairwise product int temp = a[threadIdx.x] * b[threadIdx.x];
  • 37. NVIDIA GPU Technology Siggraph Asia 2010 Dot Product  But we need to share data between threads to compute the final sum: a b a0 * b0 a1 a2 * * b1 b2 + a3 * b3 __global__ void dot( int *a, int *b, int *c ) { // Each thread computes a pairwise product int temp = a[threadIdx.x] * b[threadIdx.x]; // Can’t compute the final sum // Each thread’s copy of ‘temp’ is private }
  • 38. NVIDIA GPU Technology Siggraph Asia 2010 Sharing Data Between Threads  Terminology: A block of threads shares memory called…shared memory  Extremely fast, on-chip memory (user-managed cache)  Declared with the __shared__ CUDA keyword  Not visible to threads in other blocks running in parallel Block 0 Block 1 Block 2 Threads Threads Threads … Shared Memory Shared Memory Shared Memory
  • 39. NVIDIA GPU Technology Siggraph Asia 2010 Parallel Dot Product: dot()  We perform parallel multiplication, serial addition: #define N 512 __global__ void dot( int *a, int *b, int *c ) { // Shared memory for results of multiplication __shared__ int temp[N]; temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x]; // Thread 0 sums the pairwise products if( 0 == threadIdx.x ) { int sum = 0; for( int i = 0; i < N; i++ ) sum += temp[i]; *c = sum; } }
  • 40. NVIDIA GPU Technology Siggraph Asia 2010 Parallel Dot Product Recap  We perform parallel, pairwise multiplications  Shared memory stores each thread’s result  We sum these pairwise products from a single thread  Sounds good…but we’ve made a huge mistake
  • 41. NVIDIA GPU Technology Siggraph Asia 2010 Faulty Dot Product Exposed!  Step 1: In parallel each thread writes a pairwise product parallel, __shared__ int temp  Step 2: Thread 0 reads and sums the products __shared__ int temp  But there’s an assumption hidden in Step 1…
  • 42. NVIDIA GPU Technology Siggraph Asia 2010 Read-Before-Write Hazard  Suppose thread 0 finishes its write in step 1  Then thread 0 reads index 12 in step 2 This read returns garbage!  Before thread 12 writes to index 12 in step 1?
  • 43. NVIDIA GPU Technology Siggraph Asia 2010 Synchronization  We need threads to wait between the sections of dot(): __global__ void dot( int *a, int *b, int *c ) { __shared__ int temp[N]; temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x]; // * NEED THREADS TO SYNCHRONIZE HERE * // No thread can advance until all threads // have reached this point in the code // Thread 0 sums the pairwise products if( 0 == threadIdx.x ) { int sum = 0; for( int i = 0; i < N; i++ ) sum += temp[i]; *c = sum; } }
  • 44. NVIDIA GPU Technology Siggraph Asia 2010 __syncthreads()  We can synchronize threads with the function __syncthreads()  Threads in the block wait until all threads have hit the __syncthreads() Thread 0 __syncthreads() Thread 1 __syncthreads() Thread 2 __syncthreads() Thread 3 __syncthreads() Thread 4 __syncthreads() …  Threads are only synchronized within a block
  • 45. NVIDIA GPU Technology Siggraph Asia 2010 Parallel Dot Product: dot() __global__ void dot( int *a, int *b, int *c ) { __shared__ int temp[N]; temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x]; __syncthreads(); if( 0 == threadIdx.x ) { int sum = 0; for( int i = 0; i < N; i++ ) sum += temp[i]; *c = sum; } }  With a properly synchronized dot() routine, let’s look at main()
  • 46. NVIDIA GPU Technology Siggraph Asia 2010 Parallel Dot Product: main() #define N 512 int main( void ) { int *a, *b, *c; // copies of a, b, c int *dev_a, *dev_b, *dev_c; // device copies of a, b, c int size = N * sizeof( int ); // we need space for 512 integers // allocate device copies of a, b, c cudaMalloc( (void**)&dev_a, size ); cudaMalloc( (void**)&dev_b, size ); cudaMalloc( (void**)&dev_c, sizeof( int ) ); a = (int *)malloc( size ); b = (int *)malloc( size ); c = (int *)malloc( sizeof( int ) ); random_ints( a, N ); random_ints( b, N );
  • 47. NVIDIA GPU Technology Siggraph Asia 2010 Parallel Dot Product: main() // copy inputs to device cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice ); cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice ); // launch dot() kernel with 1 block and N threads dot<<< 1, N >>>( dev_a, dev_b, dev_c ); // copy device result back to host copy of c cudaMemcpy( c, dev_c, sizeof( int ) , cudaMemcpyDeviceToHost ); free( a ); free( b ); free( c ); cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c ); return 0; }
  • 48. NVIDIA GPU Technology Siggraph Asia 2010 Review  Launching kernels with parallel threads — Launch add() with N threads: add<<< 1, N >>>(); — Used threadIdx.x to access thread’s index  Using both blocks and threads — Used (threadIdx.x + blockIdx.x * blockDim.x) to index input/output — N/THREADS_PER_BLOCK blocks and THREADS_PER_BLOCK threads gave us N threads total
  • 49. NVIDIA GPU Technology Siggraph Asia 2010 Review (cont)  Using __shared__ to declare memory as shared memory — Data shared among threads in a block — Not visible to threads in other parallel blocks  Using __syncthreads() as a barrier — No thread executes instructions after __syncthreads() until all threads have reached the __syncthreads() — Needs to be used to prevent data hazards
  • 50. NVIDIA GPU Technology Siggraph Asia 2010 Multiblock Dot Product  Recall our dot product launch: // launch dot() kernel with 1 block and N threads dot<<< 1, N >>>( dev_a, dev_b, dev_c );  Launching with one block will not utilize much of the GPU  Let’s write a multiblock version of dot product
  • 51. NVIDIA GPU Technology Siggraph Asia 2010 Multiblock Dot Product: Algorithm  Each block computes a sum of its pairwise products like before: Block 0 a b a0 * b0 sum a1 a2 * * b1 b2 + a3 … * b3 Block 1… a b a512 * b512 sum a513 a514 * * b513 b514 + a515 * b515 … …
  • 52. NVIDIA GPU Technology Siggraph Asia 2010 Multiblock Dot Product: Algorithm  And then contributes its sum to the final result: Block 0 a b a0 * b0 sum a1 a2 * * b1 b2 + a3 … * b3 … c Block 1 a b a512 * b512 sum a513 a514 * * b513 b514 + a515 * b515 … …
  • 53. NVIDIA GPU Technology Siggraph Asia 2010 Multiblock Dot Product: dot() #define N (2048*2048) #define THREADS_PER_BLOCK 512 __global__ void dot( int *a, int *b, int *c ) { __shared__ int temp[THREADS_PER_BLOCK]; int index = threadIdx.x + blockIdx.x * blockDim.x; temp[threadIdx.x] = a[index] * b[index]; __syncthreads(); if( 0 == threadIdx.x ) { int sum = 0; for( int i = 0; i < THREADS_PER_BLOCK; i++ ) sum += temp[i]; atomicAdd( *c += sum; c , sum ); } }  But we have a race condition…  We can fix it with one of CUDA’s atomic operations
  • 54. NVIDIA GPU Technology Siggraph Asia 2010 Race Conditions  Terminology: A race condition occurs when program behavior depends upon relative timing of two (or more) event sequences  What actually takes place to execute the line in question: *c += sum; — Read value at address c — Add sum to value Terminology: Read-Modify-Write — Write result to address c  What if two threads are trying to do this at the same time?  Thread 0, Block 0  Thread 0, Block 1 — Read value at address c — Read value at address c — Add sum to value — Add sum to value — Write result to address c — Write result to address c
  • 55. NVIDIA GPU Technology Siggraph Asia 2010 Global Memory Contention Read-Modify-Write Block 0 Reads 0 Computes 0+3 Writes 3 sum = 3 0 0+3 = 3 3 *c += sum c 0 0 3 3 3 7 Block 1 3 3+4 = 7 7 sum = 4 Reads 3 Computes 3+4 Writes 7 Read-Modify-Write
  • 56. NVIDIA GPU Technology Siggraph Asia 2010 Global Memory Contention Read-Modify-Write Block 0 Reads 0 Computes 0+3 Writes 3 sum = 3 0 0+3 = 3 3 *c += sum c 0 0 0 0 3 4 Block 1 0 0+4 = 4 4 sum = 4 Reads 0 Computes 0+4 Writes 4 Read-Modify-Write
  • 57. NVIDIA GPU Technology Siggraph Asia 2010 Atomic Operations  Terminology: Read-modify-write uninterruptible when atomic  Many atomic operations on memory available with CUDA C  atomicAdd()  atomicInc()  atomicSub()  atomicDec()  atomicMin()  atomicExch()  atomicMax()  atomicCAS()  Predictable result when simultaneous access to memory required  We need to atomically add sum to c in our multiblock dot product
  • 58. NVIDIA GPU Technology Siggraph Asia 2010 Multiblock Dot Product: dot() __global__ void dot( int *a, int *b, int *c ) { __shared__ int temp[THREADS_PER_BLOCK]; int index = threadIdx.x + blockIdx.x * blockDim.x; temp[threadIdx.x] = a[index] * b[index]; __syncthreads(); if( 0 == threadIdx.x ) { int sum = 0; for( int i = 0; i < THREADS_PER_BLOCK; i++ ) sum += temp[i]; atomicAdd( c , sum ); } }  Now let’s fix up main() to handle a multiblock dot product
  • 59. NVIDIA GPU Technology Siggraph Asia 2010 Parallel Dot Product: main() #define N (2048*2048) #define THREADS_PER_BLOCK 512 int main( void ) { int *a, *b, *c; // host copies of a, b, c int *dev_a, *dev_b, *dev_c; // device copies of a, b, c int size = N * sizeof( int ); // we need space for N ints // allocate device copies of a, b, c cudaMalloc( (void**)&dev_a, size ); cudaMalloc( (void**)&dev_b, size ); cudaMalloc( (void**)&dev_c, sizeof( int ) ); a = (int *)malloc( size ); b = (int *)malloc( size ); c = (int *)malloc( sizeof( int ) ); random_ints( a, N ); random_ints( b, N );
  • 60. NVIDIA GPU Technology Siggraph Asia 2010 Parallel Dot Product: main() // copy inputs to device cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice ); cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice ); // launch dot() kernel dot<<< N/THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>( dev_a, dev_b, dev_c ); // copy device result back to host copy of c cudaMemcpy( c, dev_c, sizeof( int ) , cudaMemcpyDeviceToHost ); free( a ); free( b ); free( c ); cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c ); return 0; }
  • 61. NVIDIA GPU Technology Siggraph Asia 2010 Review  Race conditions — Behavior depends upon relative timing of multiple event sequences — Can occur when an implied read-modify-write is interruptible  Atomic operations — CUDA provides read-modify-write operations guaranteed to be atomic — Atomics ensure correct results when multiple threads modify memory
  • 62. NVIDIA GPU Technology Siggraph Asia 2010 To Learn More CUDA C  Check out CUDA by Example — Parallel Programming in CUDA C — Thread Cooperation — Constant Memory and Events — Texture Memory — Graphics Interoperability — Atomics — Streams — CUDA C on Multiple GPUs — Other CUDA Resources  For sale here at GTC  http://developer.nvidia.com/object/cuda-by-example.html
  • 63. NVIDIA GPU Technology Siggraph Asia 2010 Questions  First my questions  Now your questions…