SlideShare una empresa de Scribd logo
1 de 73
Introduction to
Parallel Programming With
CUDA & OpenCL
CUDA & OpenCL


Moayad H. Almohaishi

Graduate student, Computer Science
Louisiana Tech University
mha023@latech.edu



                                     1
Outlines
  •    Introduction
  •    Introduction to CUDA
        – Hello World
           – Addition application
           – Array Addition
           – CUDA Memories
           – Matrix Multiplication
           – Performance considerations
  •    Introduction to OpenCL
        – Addition Kernel
           – differences from CUDA kernel
           – setting the OpenCL host code
  •    Sources and additional Resources



01/23/11                                    2
Introduction

• Why GPU
  – Available in almost all new desktops and laptops
  – many-core
     • 512 cores on GTX580
  – high floating point operations
     • GTX580 offer peak performance ≈1.5 TFLOPS (Single
       Precision)
  – high memory bandwidth
     • GTX580 offer 192.4 GB/sec



                                                       3
Introduction to CUDA

• CUDA Architecture
  – The physical technology on the GPU
• CUDA C
  – The programming language to harvest the power of
    CUDA architecture
  – based on standard C




                                                 4
What you need to know?

Today :

•You will need some knowledge about C
•Yow don’t need to know about parallel
programming
•You don’t need to know about CUDA architecture




                                            5
Terminology

• Host
  – The CPU and its dedicated system memory (RAM).


• Device
  – The GPU and its on-board memory




                                               6
C Hello World

int main( void ) {
    printf(“Hello World ! n”);
    return 0;
}



    This Hello world C Code if compiled with Nvidia
     CUDA compiler will compile without problem.


                                                7
CUDA Kernel

__global__ void kernel( Void ){
}

int main( void ) {
kernel<<<1,1>>>();
printf(“Hello World ! n”);
return 0;
}


                                  8
Kernel<<1,1>>(); is the command
__global__ is a key word to define
to call the CUDA kernel kernel
the function as a CUDA from the
  CUDA Kernel
host code



 __global__ void kernel( Void ){
  __global__ void kernel( Void
 }}


 int main( void ) {
 kernel<<<1,1>>>();
 kernel<<<1,1>>>();
 printf(“Hello World ! n”);
 return 0;
 }


                                     9
Single Addition on the CPU

float add( float *a, float *b ){
    return a+b;
}

void main( void ) {
    float *a, *b, *c;
    ... // setting a and b values
    c = add(a,b);
    printf(“%f + %f = %f n”, a,b,c);
    return 0;
}                                       10
Single Addition on the GPU

__global__ void add( float *a, float *b, float *c ){
    c= a+b;
}

void main( void ) {
    float *a, *b, *c;
    ... // setting a and b values
    add<<<1,1>>>(a,b,c);
    printf(“%f + %f = %f n”, a,b,c);
    return 0;
}                                                  11
Single Addition on the GPU

__global__ void add( float *a, float *b, float *c ){
    c= a+b;
}




                                        ?!
void main( void ) {
    float *a, *b, *c;
    ... // setting a and b values
    add<<<1,1>>>(a,b,c); // c will need to be copied to host
    printf(“%f + %f = %f n”, a,b,c);
    return 0;
}                                                              12
Original C memory commands:
malloc(), free(), memcpy()
 CUDA Global Memory

 • To be able to use the GPU memory you will need:
     – Allocate memory on the GPU using the command
        • cudaMalloc()
     – Copy the host memory to the device memory using
        • cudaMemcpy()
 • To free the memory
        • cudaFree()




                                                    13
The Kernel will is correct and will
stay the same
  Single Addition on the GPU

  __global__ void add( float *a, float *b, float *c ){
       c= a+b;
 }




                                                     14
Allocating the the different
we need to define device memory
variables for host and device
  Single Addition on the GPU
memories.



 void main( void ) {
      float *h_a, *h_b, *h_c;
      float *d_a, *d_b, *d_c;
      int size = sizeof(float);

      cudaMalloc((void**) &d_a, size);
      cudaMalloc((void**) &d_b, size);
      cudaMalloc((void**) &d_c, size);

      h_a = 150; h_b = 89;
                                         15
copy
Free the device memory from
         memory to and
the device
  Single Addition on the GPU

      cudaMemcpy(d_a, &h_a, size, cudaMemcpyHostToDevice);
      cudaMemcpy(d_b, &h_b, size, cudaMemcpyHostToDevice);
      add<<<1,1>>>(d_a,d_b,d_c);
      cudaMemcpy(&h_c, d_c, size, cudaMemcpyDeviceToHost);

      printf(“%f + %f = %f n”, a,b,c);
      cudaFree(d_a);
      cudaFree(d_b);
      cudaFree(d_c);
      return 0;
 }
                                                       16
Is that right to do?

• GPU is about massive parallelism, so running this
  program on the GPU is inefficient and will run
  slower than the CPU version

• You need large data




                                              17
The add function will stay the same
Array Addition on the CPU

void main( void ) {
int n = 512; // 2^9
float *a[n], *b[n], *c[n];
... // setting a and b values
for (int i=0 i<=n, i++){
       c[i] = add(a[i],b[i]);
       printf(“%f + %f = %f n”, a,b,c);
       }
return 0;
}
                                           18
we have to modify the size
  Array Addition on the GPU

 void main( void ) {
      int n = 512;
      float *h_a[n], *h_b[n], *h_c[n];
      float *d_a[n], *d_b[n], *d_c[n];
      int size = sizeof(float) * n;

      cudaMalloc((void**) &d_a, size);
      cudaMalloc((void**) &d_b, size);
      cudaMalloc((void**) &d_c, size);
      ... // setting the input data h_a and h_b
                                                  19
Array Addition on the GPU

    cudaMemcpy(d_a, &h_a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, &h_b, size, cudaMemcpyHostToDevice);
    add<<<1,1>>>(d_a,d_b,d_c);
    add<<<1,1>>>(d_a,d_b,d_c);
    cudaMemcpy(&h_c, d_c, size, cudaMemcpyDeviceToHost);




                                        ?!
    printf(“%f + %f = %f n”, a,b,c);
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);
    return 0;
}
                                                     20
Blocks

• CUDA Run the Kernel as a block on a grid
  containing n number of blocks.
• The maximum value of n can defer from device to
  device. current devices limit is 65535 blocks per
  grid

• we will use blockIdx.x to access the block ID from
  the kernel


                                               21
n number of blocks will be
running on the kernel
  Array Addition on the GPU1

      cudaMemcpy(d_a, &h_a, size, cudaMemcpyHostToDevice);
      cudaMemcpy(d_b, &h_b, size, cudaMemcpyHostToDevice);
      add<<<n,1>>>(d_a,d_b,d_c);
      cudaMemcpy(&h_c, d_c, size, cudaMemcpyDeviceToHost);

      printf(“%f + %f = %f n”, a,b,c);
      cudaFree(d_a);
      cudaFree(d_b);
      cudaFree(d_c);
      return 0;
 }
                                                        22
Array Addition Kernel1

__global__ void add( float *a, float *b, float *c ){
    int idx = blockIdx.x ;
    c[idx]= a[idx]+b[idx];
}




                                                   23
Threads

• Each block can contain up to 512 parallel threads
  in the first and second CUDA architecture
• In fermi architecture each block can contain up to
  1024 parallel threads.

• we will use threadIdx.x to access the thread ID
  from the kernel



                                               24
n number of threads on single
block will be running on the
  Array Addition on the GPU
kernel



      cudaMemcpy(d_a, &h_a, size, cudaMemcpyHostToDevice);
      cudaMemcpy(d_b, &h_b, size, cudaMemcpyHostToDevice);
      add<<<1,n>>>(d_a,d_b,d_c);
      cudaMemcpy(&h_c, d_c, size, cudaMemcpyDeviceToHost);

      printf(“%f + %f = %f n”, a,b,c);
      cudaFree(d_a);
      cudaFree(d_b);
      cudaFree(d_c);
      return 0;
 }
                                                        25
CUDA Run the threads as half warps. so
it is more efficient to have at least 16
  Array Addition Kernel
threads per block



  __global__ void add( float *a, float *b, float *c ){
        int idx = threadIdx.x ;
        c[idx]= a[idx]+b[idx];
 }




                                                     26
MORE

• is it still massive parallelism ?

• what about more than 512 elements ?




                                        27
Terminology

• 1D grid
     blockIdx.x= 0   blockIdx.x= 1   blockIdx.x= 2

    0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6        Threads

                                     BlockSize= 7
   threadIdx.x




                                                       28
How to point each thread to the right
global memory address ?
  global memory access



          blockIdx.x= 0     blockIdx.x= 1   blockIdx.x= 2

        0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6           Threads

                                  1 1 1 1 1 1 1 1 1 1 2
        0 1 2 3 4 5 6 7 8 9
                                  0 1 2 3 4 5 6 7 8 9 0
                                                            Global
                                                            Memory




                                                              29
global memory access

• 1D grid
     blockIdx.x= 0   blockIdx.x= 1   blockIdx.x= 2

    0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6        Threads

                                     BlockSize= 7




  idx = threadIdx.x + blockIdx.x * blockDim.x

                                                       30
Array Addition on the GPU

   cudaMemcpy(d_a, &h_a, size, cudaMemcpyHostToDevice);
   cudaMemcpy(d_b, &h_b, size, cudaMemcpyHostToDevice);
  int blockSize = 256;
   int blockSize = 256;
  int blocks = n/blockSize;
   int blocks = n/blockSize;
  add<<<blocks,blockSize>>>(d_a,d_b,d_c);
   add<<<blocks,blockSize>>>(d_a,d_b,d_c);
   cudaMemcpy(&h_c, d_c, size, cudaMemcpyDeviceToHost);
   //printf(“%f + %f = %f n”, a,b,c);
   cudaFree(d_a);
   cudaFree(d_b);
   cudaFree(d_c);
   return 0; }
                                                    31
Array Addition Kernel

__global__ void add( float *a, float *b, float *c ){
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    c[idx]= a[idx]+b[idx];
}




                                                       32
Exercises



• What is the maximum number of threads that can
  be run on a grid ?

• How we can go over that limit ?




                                            33
How to point each thread to the right global memory address ?
Hint: you need to find the idx formula that count one memory index and
  global memory access
jump the second one.
You will access the second index throw idx + 1


  • Allowing each thread to do 2 computation
         blockIdx.x= 0      blockIdx.x= 1     blockIdx.x= 2

        0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6                Threads


        0 1 2 3 4 5 6 7 8 9
                                  1 1 1 1 1 1 1 1 1 1 2          Global
                                  0 1 2 3 4 5 6 7 8 9 0
                                                                 Memory




                                                                     34
How to point each thread to the right global memory address ?
Hint: you need to find the idx formula that count one memory index and
  global memory access
jump the next blockSize .
You will access the second index throw idx + blockDim.x


  • Allowing each thread to do 2 computation
         blockIdx.x= 0      blockIdx.x= 1     blockIdx.x= 2

        0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6                Threads


        0 1 2 3 4 5 6 7 8 9
                                  1 1 1 1 1 1 1 1 1 1 2          Global
                                  0 1 2 3 4 5 6 7 8 9 0
                                                                 Memory




                                                                     35
What you learned

•   Creating CUDA Kernel
•   Calling the Kernel from the host
•   Allocating CUDA memory
•   Copy to/from the device memory
•   freeing the device memory
•   controlling the number of threads throw the block
    size and number of blocks per grid.



                                                 36
Dot Product



         A
              ×
         B

                  +
                  C



                      37
• if each thread do one multiplication. which thread
  will make the addition ?




                                                38
Shared Memory

• The Shared Memory is very fast memory on the
  GPU chip itself.
• each block has its own shared memory space.
• can be declared using __shared__ CUDA
  keyword

• to make sure all the thread finished computing
  use the CUDA keyword __syncthreads()


                                              39
Dot Product Kernel

__global__ void dotP(int *a, int *b, int *c){
   __shared__ temp[N];
   temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x];
   __syncthreads();
    if (threadIdx.x == 0) {
        int sum = 0;
        for (int i = 0;i<N;i++)
            sum += temp[i];
        c = sum;
        }
}
                                                          40
exercise




• in this application the addition will run on thread 0
  only. is that efficient ?

• how to make it better ?




                                                  41
Matrix multiplication



                        B




         A              C




                            42
MatrixMul on the GPU

void main( void ) {
   int n = 16;
   float *h_a[n][n], *h_b[n][n], *h_c[n][n];
   float *d_a[n][n], *d_b[n][n], *d_c[n][n];
   int size = sizeof(float) * n * n;

   cudaMalloc((void**) &d_a, size);
   cudaMalloc((void**) &d_b, size);
   cudaMalloc((void**) &d_c, size);
   ... // setting the input data h_a and h_b
                                               43
Array Addition on the GPU

    cudaMemcpy(d_a, &h_a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, &h_b, size, cudaMemcpyHostToDevice);
    dim3 blockSize = (n,n,1);
    add<<<1,blockSize>>>(d_a,d_b,d_c);
    cudaMemcpy(&h_c, d_c, size, cudaMemcpyDeviceToHost);
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);
    return 0;
}
                                                      44
Simple Matrix Multiplication Kernel

 __global__ void matrixMul( float *a, float *b, float
*c ){
     int x = threadIdx.x ; //row
     int y = threadIdx.y; //column
     float temp = 0;
     for (int i=0; i<= blockDim.x; i++){
         temp += a[i][y] * b[x][i];
     }
         c[x][y] = temp;
}
                                                   45
Exercise




• Use the shared memory to optimize the matrix
  algorithm (hint: look at the code on the SDK)




                                             46
What you learned

• Using the shared memory to share the date
  among the threads in a block
• Synchronizing the threads
• setting blockSize of more than one dimension
  using dim3




                                             47
Performance Considerations

• for maximum performance:
  – Reduce the global memory access.
  – maximize the occupancy (allow scheduling of 1024
    threads per stream multi processor)
     • use the right blockSize
     • use the right number of registers
     • use the right size of the shared memory
  – Increasing the independent instructions
  – coalescing the memory access
  – Using right instruction:byte ratio

                                                  48
Introduction to OpenCL

• OpenCL is open standard
• Cross platform; can run on:
  –   Multi-core CPU
  –   GPU (NVIDIA,ATI)
  –   Cell B/E
  –   others
• close to CUDA




                                49
How the program work

        Host                                               Device (GPU)
                        •Allocating the memory in the
                        Host                               Stream Processors
               GPU
                        •initializing data in the memory
               Kernel
                        objects.
               Code
                        •Allocating the memory in the
                        Device (GPU)
        Memory          •Copy the Data from Host to                   Memory
                        Device
  A[]    B[]     C[]                                            A[]    B[]     C[]
                        •Running the Kernel
                        •Copy the results to the Host
                        memory
                         •Clear the Memory and Free the
                         resources



                                                                                     50
Basic OpenCL program Structure

• OpenCL Kernel
• Host program containing:
   – a. Devices Context.
   – b. Command Queue
   – c. Memory Objects
   – d. OpenCL Program.
   – e. Kernel Memory Arguments.




                                   51
Creating the Kernel

#include <studio.h>#include <stdlib.h>#include <CL/cl.h>const char*
OpenCLSource[ ] = { “__kernel void VectorAdd(__global int* c, __global int*
a, n”, “                             __global int* b) n”, “{ n”, “
unsigned int n = get_global_id(0); n”, “         c[c] = a[n] + b[n]; n”, “} n”}
};
};
};




                                                                      52
Notice that all the kernel here stored as
char variable
  Creating the Kernel

  #include <studio.h>#include <stdlib.h>#include <CL/cl.h>const char*
  OpenCLSource[ ] = { “__kernel void VectorAdd(__global int* c, __global int*
  a, n”, char* OpenCLSource[ ] = {
  const “                               __global int* b) n”, “{ n”, “
  * OpenCLSource[ ] = {
  unsigned int n = get_global_id(0); n”, “         c[c] = a[n] + b[n]; n”, “} n”};
  };
  };
  };




                                                                         53
The __kernel key word in function
get_global_id() is a builtis equivalent to
__global__ in CUDA
instead of calculating the global ID in
  Creating the Kernel
The function parameters need to be
CUDA
define as __global while you don’t need
that in CUDA

  #include <studio.h>#include <stdlib.h>#include <CL/cl.h>const char*
  OpenCLSource[ ] = { “__kernel void VectorAdd(__global int* c, __global int*
  a, n”, char* OpenCLSource[ ] = {
  const “                               __global int* b) n”,VectorAdd(__global int*
                                           “__kernel void “{ n”, “
  c, __global int* a, n”, “
  unsigned int n = get_global_id(0); n”, “         c[c] = a[n] + b[n]; n”, int* b) n
                                                                  __global “} n”};
  };       “       unsigned int n = get_global_id(0); n”,
  ,
  };
  ,
  };




                                                                           54
Initializing data




int InitialData1[12] = {62, 48, 20, -53, 39, 83, 19, 47, 13, 88, 38, -92};int
InitialData2[12] = {-49, 29, 38, 10, 37, 46, -12, 86, 17, 83, -22, 94};#define SIZE
2048




                                                                       55
Creating the main function


int main (int argc, char **argv){ int HostVector1[SIZE]; int HostVector2[SIZE]
        for (int c= 0; c<SIZE; c++) {             HostVector[c] = InitialData1[c
%12];             HostVector[c] = initialData2[c%12];}
];}
];}
];}
];}




                                                                     56
cl_context clCreateContextFromType( cl_context_properties
*properties,          cl_device_type device_type, void
   Creating the context
(*pfn_notify)         (const char *errinfo, const void *private_info,
           size_t cb, void *user_data), void *user_data,
cl_int *errcode_ret)




   cl_context GPUContext = clCreateContextFromType(0,
        CL_DEVICE_TYPE_GPU,             NULL, NULL, NULL);




                                                                        57
cl_context clCreateContextFromType( cl_context_properties
You can also use
*properties,
CL_DEVICE_TYPE_CPU device_type, void
                      cl_device_type
   Creating the context
(*pfn_notify)         (const char *errinfo, const void *private_info,
           size_t cb, void *user_data), void *user_data,
cl_int *errcode_ret)




   cl_context GPUContext = clCreateContextFromType(0,
        CL_DEVICE_TYPE_GPU,             NULL, NULL, NULL);




                                                                        58
cl_int clGetContextInfo( cl_context context,
cl_platform_info param_name,             size_t
  Query compute devices
param_value_size, void *param_value, size_t
*param_value_size_ret)Param_name:
CL_CONTEXT_REFERENCE_COUNT,CL_CONTEXT_DEVICE
S,CL_CONTEXT_PROPERTIES
Param_name:
CL_CONTEXT_REFERENCE_COUNT,CL_CONTEXT_DEVICE
S,CL_CONTEXT_PROPERTIES
Param_name:
CL_CONTEXT_REFERENCE_COUNT,CL_CONTEXT_DEVICE
S,CL_CONTEXT_PROPERTIES
  size_t ParamDataBytes;clGetContextInfo(GPUContext,
  CL_CONTEXT_DEVICES, 0, NULL, &ParmDataBytes);




                                                       59
cl_int clGetContextInfo( cl_context context,
cl_platform_info param_name,             size_t
  Query compute devices
param_value_size, void *param_value, size_t
*param_value_size_ret)Param_name:
CL_CONTEXT_REFERENCE_COUNT,CL_CONTEXT_DEVICE
S,CL_CONTEXT_PROPERTIES
Param_name:
CL_CONTEXT_REFERENCE_COUNT,CL_CONTEXT_DEVICE
S,CL_CONTEXT_PROPERTIES
Param_name:
CL_CONTEXT_REFERENCE_COUNT,CL_CONTEXT_DEVICE
S,CL_CONTEXT_PROPERTIES
  cl_device_id* GPUDevices =
  (cl_device_id*)malloc(ParmDataBytes);clGetContextInfo(GPUContext,
  CL_CONTEXT_DEVICES, ParmDataBytes, GPUDevices, NULL);




                                                                60
cl_command_queue clCreateCommandQueue(
cl_context context, cl_device_id device,
Command queue
cl_command_queue_properties properties,  cl_int
*errcode_ret)Properties:CL_QUEUE_PROFILING_ENABLE,

CL_QUEUE_OUT_OF_ORFER_EXEC_MODE_ENABLE
Properties:CL_QUEUE_PROFILING_ENABLE,

CL_QUEUE_OUT_OF_ORFER_EXEC_MODE_ENABLE




cl_command_queue GPUCommandQueue = clCreatCommandQueue
(GPUContext, GPUDevices[0], 0, NULL);




                                                     61
cl_mem clCreateBuffer(cl_context context,       cl_mem_flags
flags, size_t size, void *host_ptr, cl_int *errcode_ret)flags:
Allocating the Memory
CL_MEM_READ_WRITE, CL_MEM_READ_ONLY,
CL_MEM_WRITE_ONLY, CL_MEM_USE_HOST_PTR,
CL_MEM_ALLOC_HOST_PTR,
CL_MEM_COPY_HOST_PTR
flags: CL_MEM_READ_WRITE, CL_MEM_READ_ONLY,
       CL_MEM_WRITE_ONLY,
CL_MEM_USE_HOST_PTR,
CL_MEM_ALLOC_HOST_PTR,
CL_MEM_COPY_HOST_PTR
cl_mem GPUVector1 = clCreateBuffer(GPUContext,
                                             CL_MEM_READ_ONLY
CL_MEM_COPY_HOST_PTR, sizeof (int) * SIZE, HostVector1, NULL);




                                                                 62
Allocating the Memory



cl_mem GPUVector2 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY
CL_MEM_COPY_HOST_PTR, sizeof (int) * SIZE, HostVector2, NULL);cl_mem
GPUOutputVector; GPUOutputVector = clCreateBuffer(GPUContext,
CL_MEM_WRITE_ONLY,
sizeof (int) * SIZE, NULL, NULL);
;
;




                                                         63
cl_program clCreateProgramWithSource(            cl_context
context, cl_unit count,    const char **strings, const size_t
Creating the program
*lengths, cl_int *errcode_ret)




cl_program OpenCLProgram = clCreateProgramWithSource(GPUContext, 8,
OpenCLSource, NULL, NULL);




                                                                64
cl_int clBuildProgram(cl_program program,   cl_unit
num_devices, const cl_device_id *device_list,
Creating the program
const char *options,void (*pfn_notify)
void *user_data), void *user_data)
                                            (cl_program,




clBuildProgram(OpenCLProgram, 0, NULL, NULL, NULL, NULL);




                                                            65
cl_kernel clCreateKernel(cl_program program, const char
*kernel_name, cl_int *errcode_ret)
Creating the program




cl_kernel OpenCLVectorAdd = clCreateKernel(OpenCLProgram, “VectorAdd”,
NULL);




                                                            66
cl_int clSetKernelArg(cl_kernel kernel,        cl_unit
matching the GPU memory with the
arg_index, size_t arg_size,          const void *arg_value)

Kernel




clSetKernelArg(OpenCLVectorAdd, 0, sizeof (cl_mem), (void*)
&GPUOutputVector);




                                                              67
matching the GPU memory with the
Kernel



clSetKernelArg(OpenCLVectorAdd, 1, sizeof (cl_mem), (void*)
&GPUVector1);clSetKernelArg(OpenCLVectorAdd, 2, sizeof (cl_mem), (void*)
&GPUVector2);
;
;




                                                              68
cl_int clEnqueueNDRangeKernel( cl_command_queue
command_queue,              cl_kernel kernel, cl_unit work_dim,
Lunching the Kernel
       const size_t *global_work_offset,
*global_work_size,
                                                 const size_t
                            const size_t *local_work_size,
cl_unit num_events_in_wait_list,      const cl_event
*event_wait_list, cl_event *event)




size_t WorkSize [1] = {SIZE};clEnqueueNDRangeKernel(GPUCommandQueue
OpenCLVectorAdd, 1, NULL, WorkSize, NULL, 0, NULL, NULL);




                                                                  69
cl_int clEnqueueReadBuffer(           cl_command_queue
command_queue,             cl_mem buffer,     cl_bool
Copying the output to the host memory
blocking_read, size_t offset,size_t cb,       void *ptr,
cl_unit num_evernts_in_wait_list, const cl_event
*event_wait_list, cl_event *event)




int HostOutputVector [SIZE];clEnqueueReadBuffer(GPUCommandQueue,
GPUOutputVector, CL_TRUE, 0, SIZE* sizeof(int), HostOutputVector, 0, NULL
NULL);




                                                              70
Cleaning the GPU device


clReleaseMemObject(GPUVector1);clReleaseMemObject(GPUVector2);clRele
seMemObject(GPUOutputVector);free (GPUDevices);for(int c= 0; c < 305; c+
+)printf (“%c”, (char)HostOutputVector[c]);return 0;}
;}




                                                             71
What you learned

• Writing OpenCL Kernel
• Writing OpenCL Application
  –   Setting the context
  –   preparing the command queue
  –   setting the memory objects
  –   setting the program
  –   setting the kernel and the arguments




                                             72
Sources and additional resources

• Jason sander, “Introduction to CUDA” -book and
  GTC presentation.
• OpenCL specification document
• NVIDIA CUDA programming guide
• NVIDIA OpenCL getting started guide

• Videos from GTC’10 in the link :
• http://www.nvidia.com/object/gtc2010-presentation-a


                                             73

Más contenido relacionado

La actualidad más candente

深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法MITSUNARI Shigeo
 
Goroutine stack and local variable allocation in Go
Goroutine stack and local variable allocation in GoGoroutine stack and local variable allocation in Go
Goroutine stack and local variable allocation in GoYu-Shuan Hsieh
 
Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;
Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;
Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;Tzung-Bi Shih
 
Bridge TensorFlow to run on Intel nGraph backends (v0.5)
Bridge TensorFlow to run on Intel nGraph backends (v0.5)Bridge TensorFlow to run on Intel nGraph backends (v0.5)
Bridge TensorFlow to run on Intel nGraph backends (v0.5)Mr. Vengineer
 
淺入淺出 GDB
淺入淺出 GDB淺入淺出 GDB
淺入淺出 GDBJim Chang
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesSubhajit Sahu
 
Refactoring for testability c++
Refactoring for testability c++Refactoring for testability c++
Refactoring for testability c++Dimitrios Platis
 
Functional Reactive Programming with RxJS
Functional Reactive Programming with RxJSFunctional Reactive Programming with RxJS
Functional Reactive Programming with RxJSstefanmayer13
 
node.js and native code extensions by example
node.js and native code extensions by examplenode.js and native code extensions by example
node.js and native code extensions by examplePhilipp Fehre
 
Openstack taskflow 簡介
Openstack taskflow 簡介Openstack taskflow 簡介
Openstack taskflow 簡介kao kuo-tung
 
Specializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network StackSpecializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network StackKernel TLV
 
Preparation for mit ose lab4
Preparation for mit ose lab4Preparation for mit ose lab4
Preparation for mit ose lab4Benux Wei
 
Start Wrap Episode 11: A New Rope
Start Wrap Episode 11: A New RopeStart Wrap Episode 11: A New Rope
Start Wrap Episode 11: A New RopeYung-Yu Chen
 
Gameboy emulator in rust and web assembly
Gameboy emulator in rust and web assemblyGameboy emulator in rust and web assembly
Gameboy emulator in rust and web assemblyYodalee
 
JavaScriptCore's DFG JIT (JSConf EU 2012)
JavaScriptCore's DFG JIT (JSConf EU 2012)JavaScriptCore's DFG JIT (JSConf EU 2012)
JavaScriptCore's DFG JIT (JSConf EU 2012)Igalia
 
Доклад Антона Поварова "Go in Badoo" с Golang Meetup
Доклад Антона Поварова "Go in Badoo" с Golang MeetupДоклад Антона Поварова "Go in Badoo" с Golang Meetup
Доклад Антона Поварова "Go in Badoo" с Golang MeetupBadoo Development
 
Silicon Valley JUG: JVM Mechanics
Silicon Valley JUG: JVM MechanicsSilicon Valley JUG: JVM Mechanics
Silicon Valley JUG: JVM MechanicsAzul Systems, Inc.
 
Advanced Debugging with GDB
Advanced Debugging with GDBAdvanced Debugging with GDB
Advanced Debugging with GDBDavid Khosid
 
Profiling and optimizing go programs
Profiling and optimizing go programsProfiling and optimizing go programs
Profiling and optimizing go programsBadoo Development
 

La actualidad más candente (20)

深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
 
Goroutine stack and local variable allocation in Go
Goroutine stack and local variable allocation in GoGoroutine stack and local variable allocation in Go
Goroutine stack and local variable allocation in Go
 
Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;
Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;
Global Interpreter Lock: Episode III - cat &lt; /dev/zero > GIL;
 
Openstack 簡介
Openstack 簡介Openstack 簡介
Openstack 簡介
 
Bridge TensorFlow to run on Intel nGraph backends (v0.5)
Bridge TensorFlow to run on Intel nGraph backends (v0.5)Bridge TensorFlow to run on Intel nGraph backends (v0.5)
Bridge TensorFlow to run on Intel nGraph backends (v0.5)
 
淺入淺出 GDB
淺入淺出 GDB淺入淺出 GDB
淺入淺出 GDB
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : Notes
 
Refactoring for testability c++
Refactoring for testability c++Refactoring for testability c++
Refactoring for testability c++
 
Functional Reactive Programming with RxJS
Functional Reactive Programming with RxJSFunctional Reactive Programming with RxJS
Functional Reactive Programming with RxJS
 
node.js and native code extensions by example
node.js and native code extensions by examplenode.js and native code extensions by example
node.js and native code extensions by example
 
Openstack taskflow 簡介
Openstack taskflow 簡介Openstack taskflow 簡介
Openstack taskflow 簡介
 
Specializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network StackSpecializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network Stack
 
Preparation for mit ose lab4
Preparation for mit ose lab4Preparation for mit ose lab4
Preparation for mit ose lab4
 
Start Wrap Episode 11: A New Rope
Start Wrap Episode 11: A New RopeStart Wrap Episode 11: A New Rope
Start Wrap Episode 11: A New Rope
 
Gameboy emulator in rust and web assembly
Gameboy emulator in rust and web assemblyGameboy emulator in rust and web assembly
Gameboy emulator in rust and web assembly
 
JavaScriptCore's DFG JIT (JSConf EU 2012)
JavaScriptCore's DFG JIT (JSConf EU 2012)JavaScriptCore's DFG JIT (JSConf EU 2012)
JavaScriptCore's DFG JIT (JSConf EU 2012)
 
Доклад Антона Поварова "Go in Badoo" с Golang Meetup
Доклад Антона Поварова "Go in Badoo" с Golang MeetupДоклад Антона Поварова "Go in Badoo" с Golang Meetup
Доклад Антона Поварова "Go in Badoo" с Golang Meetup
 
Silicon Valley JUG: JVM Mechanics
Silicon Valley JUG: JVM MechanicsSilicon Valley JUG: JVM Mechanics
Silicon Valley JUG: JVM Mechanics
 
Advanced Debugging with GDB
Advanced Debugging with GDBAdvanced Debugging with GDB
Advanced Debugging with GDB
 
Profiling and optimizing go programs
Profiling and optimizing go programsProfiling and optimizing go programs
Profiling and optimizing go programs
 

Similar a Intro2 Cuda Moayad

Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...mouhouioui
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introductionHanibei
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
CUDA Deep Dive
CUDA Deep DiveCUDA Deep Dive
CUDA Deep Divekrasul
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCinside-BigData.com
 
CUDA by Example : CUDA C on Multiple GPUs : Notes
CUDA by Example : CUDA C on Multiple GPUs : NotesCUDA by Example : CUDA C on Multiple GPUs : Notes
CUDA by Example : CUDA C on Multiple GPUs : NotesSubhajit Sahu
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to AcceleratorsDilum Bandara
 
Tema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfTema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfpepe464163
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdfTigabu Yaya
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011Raymond Tay
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDARaymond Tay
 
Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)Andrei Varanovich
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 

Similar a Intro2 Cuda Moayad (20)

Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introduction
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance Computing
 
Cuda intro
Cuda introCuda intro
Cuda intro
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
CUDA Deep Dive
CUDA Deep DiveCUDA Deep Dive
CUDA Deep Dive
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
 
CUDA by Example : CUDA C on Multiple GPUs : Notes
CUDA by Example : CUDA C on Multiple GPUs : NotesCUDA by Example : CUDA C on Multiple GPUs : Notes
CUDA by Example : CUDA C on Multiple GPUs : Notes
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
Tema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfTema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdf
 
GPU Computing with CUDA
GPU Computing with CUDAGPU Computing with CUDA
GPU Computing with CUDA
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
Conflux:gpgpu for .net (en)
Conflux:gpgpu for .net (en)Conflux:gpgpu for .net (en)
Conflux:gpgpu for .net (en)
 
Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)Conflux: gpgpu for .net (en)
Conflux: gpgpu for .net (en)
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 

Intro2 Cuda Moayad

  • 1. Introduction to Parallel Programming With CUDA & OpenCL CUDA & OpenCL Moayad H. Almohaishi Graduate student, Computer Science Louisiana Tech University mha023@latech.edu 1
  • 2. Outlines • Introduction • Introduction to CUDA – Hello World – Addition application – Array Addition – CUDA Memories – Matrix Multiplication – Performance considerations • Introduction to OpenCL – Addition Kernel – differences from CUDA kernel – setting the OpenCL host code • Sources and additional Resources 01/23/11 2
  • 3. Introduction • Why GPU – Available in almost all new desktops and laptops – many-core • 512 cores on GTX580 – high floating point operations • GTX580 offer peak performance ≈1.5 TFLOPS (Single Precision) – high memory bandwidth • GTX580 offer 192.4 GB/sec 3
  • 4. Introduction to CUDA • CUDA Architecture – The physical technology on the GPU • CUDA C – The programming language to harvest the power of CUDA architecture – based on standard C 4
  • 5. What you need to know? Today : •You will need some knowledge about C •Yow don’t need to know about parallel programming •You don’t need to know about CUDA architecture 5
  • 6. Terminology • Host – The CPU and its dedicated system memory (RAM). • Device – The GPU and its on-board memory 6
  • 7. C Hello World int main( void ) { printf(“Hello World ! n”); return 0; } This Hello world C Code if compiled with Nvidia CUDA compiler will compile without problem. 7
  • 8. CUDA Kernel __global__ void kernel( Void ){ } int main( void ) { kernel<<<1,1>>>(); printf(“Hello World ! n”); return 0; } 8
  • 9. Kernel<<1,1>>(); is the command __global__ is a key word to define to call the CUDA kernel kernel the function as a CUDA from the CUDA Kernel host code __global__ void kernel( Void ){ __global__ void kernel( Void }} int main( void ) { kernel<<<1,1>>>(); kernel<<<1,1>>>(); printf(“Hello World ! n”); return 0; } 9
  • 10. Single Addition on the CPU float add( float *a, float *b ){ return a+b; } void main( void ) { float *a, *b, *c; ... // setting a and b values c = add(a,b); printf(“%f + %f = %f n”, a,b,c); return 0; } 10
  • 11. Single Addition on the GPU __global__ void add( float *a, float *b, float *c ){ c= a+b; } void main( void ) { float *a, *b, *c; ... // setting a and b values add<<<1,1>>>(a,b,c); printf(“%f + %f = %f n”, a,b,c); return 0; } 11
  • 12. Single Addition on the GPU __global__ void add( float *a, float *b, float *c ){ c= a+b; } ?! void main( void ) { float *a, *b, *c; ... // setting a and b values add<<<1,1>>>(a,b,c); // c will need to be copied to host printf(“%f + %f = %f n”, a,b,c); return 0; } 12
  • 13. Original C memory commands: malloc(), free(), memcpy() CUDA Global Memory • To be able to use the GPU memory you will need: – Allocate memory on the GPU using the command • cudaMalloc() – Copy the host memory to the device memory using • cudaMemcpy() • To free the memory • cudaFree() 13
  • 14. The Kernel will is correct and will stay the same Single Addition on the GPU __global__ void add( float *a, float *b, float *c ){ c= a+b; } 14
  • 15. Allocating the the different we need to define device memory variables for host and device Single Addition on the GPU memories. void main( void ) { float *h_a, *h_b, *h_c; float *d_a, *d_b, *d_c; int size = sizeof(float); cudaMalloc((void**) &d_a, size); cudaMalloc((void**) &d_b, size); cudaMalloc((void**) &d_c, size); h_a = 150; h_b = 89; 15
  • 16. copy Free the device memory from memory to and the device Single Addition on the GPU cudaMemcpy(d_a, &h_a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &h_b, size, cudaMemcpyHostToDevice); add<<<1,1>>>(d_a,d_b,d_c); cudaMemcpy(&h_c, d_c, size, cudaMemcpyDeviceToHost); printf(“%f + %f = %f n”, a,b,c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } 16
  • 17. Is that right to do? • GPU is about massive parallelism, so running this program on the GPU is inefficient and will run slower than the CPU version • You need large data 17
  • 18. The add function will stay the same Array Addition on the CPU void main( void ) { int n = 512; // 2^9 float *a[n], *b[n], *c[n]; ... // setting a and b values for (int i=0 i<=n, i++){ c[i] = add(a[i],b[i]); printf(“%f + %f = %f n”, a,b,c); } return 0; } 18
  • 19. we have to modify the size Array Addition on the GPU void main( void ) { int n = 512; float *h_a[n], *h_b[n], *h_c[n]; float *d_a[n], *d_b[n], *d_c[n]; int size = sizeof(float) * n; cudaMalloc((void**) &d_a, size); cudaMalloc((void**) &d_b, size); cudaMalloc((void**) &d_c, size); ... // setting the input data h_a and h_b 19
  • 20. Array Addition on the GPU cudaMemcpy(d_a, &h_a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &h_b, size, cudaMemcpyHostToDevice); add<<<1,1>>>(d_a,d_b,d_c); add<<<1,1>>>(d_a,d_b,d_c); cudaMemcpy(&h_c, d_c, size, cudaMemcpyDeviceToHost); ?! printf(“%f + %f = %f n”, a,b,c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } 20
  • 21. Blocks • CUDA Run the Kernel as a block on a grid containing n number of blocks. • The maximum value of n can defer from device to device. current devices limit is 65535 blocks per grid • we will use blockIdx.x to access the block ID from the kernel 21
  • 22. n number of blocks will be running on the kernel Array Addition on the GPU1 cudaMemcpy(d_a, &h_a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &h_b, size, cudaMemcpyHostToDevice); add<<<n,1>>>(d_a,d_b,d_c); cudaMemcpy(&h_c, d_c, size, cudaMemcpyDeviceToHost); printf(“%f + %f = %f n”, a,b,c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } 22
  • 23. Array Addition Kernel1 __global__ void add( float *a, float *b, float *c ){ int idx = blockIdx.x ; c[idx]= a[idx]+b[idx]; } 23
  • 24. Threads • Each block can contain up to 512 parallel threads in the first and second CUDA architecture • In fermi architecture each block can contain up to 1024 parallel threads. • we will use threadIdx.x to access the thread ID from the kernel 24
  • 25. n number of threads on single block will be running on the Array Addition on the GPU kernel cudaMemcpy(d_a, &h_a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &h_b, size, cudaMemcpyHostToDevice); add<<<1,n>>>(d_a,d_b,d_c); cudaMemcpy(&h_c, d_c, size, cudaMemcpyDeviceToHost); printf(“%f + %f = %f n”, a,b,c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } 25
  • 26. CUDA Run the threads as half warps. so it is more efficient to have at least 16 Array Addition Kernel threads per block __global__ void add( float *a, float *b, float *c ){ int idx = threadIdx.x ; c[idx]= a[idx]+b[idx]; } 26
  • 27. MORE • is it still massive parallelism ? • what about more than 512 elements ? 27
  • 28. Terminology • 1D grid blockIdx.x= 0 blockIdx.x= 1 blockIdx.x= 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Threads BlockSize= 7 threadIdx.x 28
  • 29. How to point each thread to the right global memory address ? global memory access blockIdx.x= 0 blockIdx.x= 1 blockIdx.x= 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Threads 1 1 1 1 1 1 1 1 1 1 2 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 Global Memory 29
  • 30. global memory access • 1D grid blockIdx.x= 0 blockIdx.x= 1 blockIdx.x= 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Threads BlockSize= 7 idx = threadIdx.x + blockIdx.x * blockDim.x 30
  • 31. Array Addition on the GPU cudaMemcpy(d_a, &h_a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &h_b, size, cudaMemcpyHostToDevice); int blockSize = 256; int blockSize = 256; int blocks = n/blockSize; int blocks = n/blockSize; add<<<blocks,blockSize>>>(d_a,d_b,d_c); add<<<blocks,blockSize>>>(d_a,d_b,d_c); cudaMemcpy(&h_c, d_c, size, cudaMemcpyDeviceToHost); //printf(“%f + %f = %f n”, a,b,c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } 31
  • 32. Array Addition Kernel __global__ void add( float *a, float *b, float *c ){ int idx = threadIdx.x + blockIdx.x * blockDim.x; c[idx]= a[idx]+b[idx]; } 32
  • 33. Exercises • What is the maximum number of threads that can be run on a grid ? • How we can go over that limit ? 33
  • 34. How to point each thread to the right global memory address ? Hint: you need to find the idx formula that count one memory index and global memory access jump the second one. You will access the second index throw idx + 1 • Allowing each thread to do 2 computation blockIdx.x= 0 blockIdx.x= 1 blockIdx.x= 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Threads 0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 Global 0 1 2 3 4 5 6 7 8 9 0 Memory 34
  • 35. How to point each thread to the right global memory address ? Hint: you need to find the idx formula that count one memory index and global memory access jump the next blockSize . You will access the second index throw idx + blockDim.x • Allowing each thread to do 2 computation blockIdx.x= 0 blockIdx.x= 1 blockIdx.x= 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Threads 0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 Global 0 1 2 3 4 5 6 7 8 9 0 Memory 35
  • 36. What you learned • Creating CUDA Kernel • Calling the Kernel from the host • Allocating CUDA memory • Copy to/from the device memory • freeing the device memory • controlling the number of threads throw the block size and number of blocks per grid. 36
  • 37. Dot Product A × B + C 37
  • 38. • if each thread do one multiplication. which thread will make the addition ? 38
  • 39. Shared Memory • The Shared Memory is very fast memory on the GPU chip itself. • each block has its own shared memory space. • can be declared using __shared__ CUDA keyword • to make sure all the thread finished computing use the CUDA keyword __syncthreads() 39
  • 40. Dot Product Kernel __global__ void dotP(int *a, int *b, int *c){ __shared__ temp[N]; temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x]; __syncthreads(); if (threadIdx.x == 0) { int sum = 0; for (int i = 0;i<N;i++) sum += temp[i]; c = sum; } } 40
  • 41. exercise • in this application the addition will run on thread 0 only. is that efficient ? • how to make it better ? 41
  • 43. MatrixMul on the GPU void main( void ) { int n = 16; float *h_a[n][n], *h_b[n][n], *h_c[n][n]; float *d_a[n][n], *d_b[n][n], *d_c[n][n]; int size = sizeof(float) * n * n; cudaMalloc((void**) &d_a, size); cudaMalloc((void**) &d_b, size); cudaMalloc((void**) &d_c, size); ... // setting the input data h_a and h_b 43
  • 44. Array Addition on the GPU cudaMemcpy(d_a, &h_a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &h_b, size, cudaMemcpyHostToDevice); dim3 blockSize = (n,n,1); add<<<1,blockSize>>>(d_a,d_b,d_c); cudaMemcpy(&h_c, d_c, size, cudaMemcpyDeviceToHost); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; } 44
  • 45. Simple Matrix Multiplication Kernel __global__ void matrixMul( float *a, float *b, float *c ){ int x = threadIdx.x ; //row int y = threadIdx.y; //column float temp = 0; for (int i=0; i<= blockDim.x; i++){ temp += a[i][y] * b[x][i]; } c[x][y] = temp; } 45
  • 46. Exercise • Use the shared memory to optimize the matrix algorithm (hint: look at the code on the SDK) 46
  • 47. What you learned • Using the shared memory to share the date among the threads in a block • Synchronizing the threads • setting blockSize of more than one dimension using dim3 47
  • 48. Performance Considerations • for maximum performance: – Reduce the global memory access. – maximize the occupancy (allow scheduling of 1024 threads per stream multi processor) • use the right blockSize • use the right number of registers • use the right size of the shared memory – Increasing the independent instructions – coalescing the memory access – Using right instruction:byte ratio 48
  • 49. Introduction to OpenCL • OpenCL is open standard • Cross platform; can run on: – Multi-core CPU – GPU (NVIDIA,ATI) – Cell B/E – others • close to CUDA 49
  • 50. How the program work Host Device (GPU) •Allocating the memory in the Host Stream Processors GPU •initializing data in the memory Kernel objects. Code •Allocating the memory in the Device (GPU) Memory •Copy the Data from Host to Memory Device A[] B[] C[] A[] B[] C[] •Running the Kernel •Copy the results to the Host memory •Clear the Memory and Free the resources 50
  • 51. Basic OpenCL program Structure • OpenCL Kernel • Host program containing: – a. Devices Context. – b. Command Queue – c. Memory Objects – d. OpenCL Program. – e. Kernel Memory Arguments. 51
  • 52. Creating the Kernel #include <studio.h>#include <stdlib.h>#include <CL/cl.h>const char* OpenCLSource[ ] = { “__kernel void VectorAdd(__global int* c, __global int* a, n”, “ __global int* b) n”, “{ n”, “ unsigned int n = get_global_id(0); n”, “ c[c] = a[n] + b[n]; n”, “} n”} }; }; }; 52
  • 53. Notice that all the kernel here stored as char variable Creating the Kernel #include <studio.h>#include <stdlib.h>#include <CL/cl.h>const char* OpenCLSource[ ] = { “__kernel void VectorAdd(__global int* c, __global int* a, n”, char* OpenCLSource[ ] = { const “ __global int* b) n”, “{ n”, “ * OpenCLSource[ ] = { unsigned int n = get_global_id(0); n”, “ c[c] = a[n] + b[n]; n”, “} n”}; }; }; }; 53
  • 54. The __kernel key word in function get_global_id() is a builtis equivalent to __global__ in CUDA instead of calculating the global ID in Creating the Kernel The function parameters need to be CUDA define as __global while you don’t need that in CUDA #include <studio.h>#include <stdlib.h>#include <CL/cl.h>const char* OpenCLSource[ ] = { “__kernel void VectorAdd(__global int* c, __global int* a, n”, char* OpenCLSource[ ] = { const “ __global int* b) n”,VectorAdd(__global int* “__kernel void “{ n”, “ c, __global int* a, n”, “ unsigned int n = get_global_id(0); n”, “ c[c] = a[n] + b[n]; n”, int* b) n __global “} n”}; }; “ unsigned int n = get_global_id(0); n”, , }; , }; 54
  • 55. Initializing data int InitialData1[12] = {62, 48, 20, -53, 39, 83, 19, 47, 13, 88, 38, -92};int InitialData2[12] = {-49, 29, 38, 10, 37, 46, -12, 86, 17, 83, -22, 94};#define SIZE 2048 55
  • 56. Creating the main function int main (int argc, char **argv){ int HostVector1[SIZE]; int HostVector2[SIZE] for (int c= 0; c<SIZE; c++) { HostVector[c] = InitialData1[c %12]; HostVector[c] = initialData2[c%12];} ];} ];} ];} ];} 56
  • 57. cl_context clCreateContextFromType( cl_context_properties *properties, cl_device_type device_type, void Creating the context (*pfn_notify) (const char *errinfo, const void *private_info, size_t cb, void *user_data), void *user_data, cl_int *errcode_ret) cl_context GPUContext = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); 57
  • 58. cl_context clCreateContextFromType( cl_context_properties You can also use *properties, CL_DEVICE_TYPE_CPU device_type, void cl_device_type Creating the context (*pfn_notify) (const char *errinfo, const void *private_info, size_t cb, void *user_data), void *user_data, cl_int *errcode_ret) cl_context GPUContext = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); 58
  • 59. cl_int clGetContextInfo( cl_context context, cl_platform_info param_name, size_t Query compute devices param_value_size, void *param_value, size_t *param_value_size_ret)Param_name: CL_CONTEXT_REFERENCE_COUNT,CL_CONTEXT_DEVICE S,CL_CONTEXT_PROPERTIES Param_name: CL_CONTEXT_REFERENCE_COUNT,CL_CONTEXT_DEVICE S,CL_CONTEXT_PROPERTIES Param_name: CL_CONTEXT_REFERENCE_COUNT,CL_CONTEXT_DEVICE S,CL_CONTEXT_PROPERTIES size_t ParamDataBytes;clGetContextInfo(GPUContext, CL_CONTEXT_DEVICES, 0, NULL, &ParmDataBytes); 59
  • 60. cl_int clGetContextInfo( cl_context context, cl_platform_info param_name, size_t Query compute devices param_value_size, void *param_value, size_t *param_value_size_ret)Param_name: CL_CONTEXT_REFERENCE_COUNT,CL_CONTEXT_DEVICE S,CL_CONTEXT_PROPERTIES Param_name: CL_CONTEXT_REFERENCE_COUNT,CL_CONTEXT_DEVICE S,CL_CONTEXT_PROPERTIES Param_name: CL_CONTEXT_REFERENCE_COUNT,CL_CONTEXT_DEVICE S,CL_CONTEXT_PROPERTIES cl_device_id* GPUDevices = (cl_device_id*)malloc(ParmDataBytes);clGetContextInfo(GPUContext, CL_CONTEXT_DEVICES, ParmDataBytes, GPUDevices, NULL); 60
  • 61. cl_command_queue clCreateCommandQueue( cl_context context, cl_device_id device, Command queue cl_command_queue_properties properties, cl_int *errcode_ret)Properties:CL_QUEUE_PROFILING_ENABLE, CL_QUEUE_OUT_OF_ORFER_EXEC_MODE_ENABLE Properties:CL_QUEUE_PROFILING_ENABLE, CL_QUEUE_OUT_OF_ORFER_EXEC_MODE_ENABLE cl_command_queue GPUCommandQueue = clCreatCommandQueue (GPUContext, GPUDevices[0], 0, NULL); 61
  • 62. cl_mem clCreateBuffer(cl_context context, cl_mem_flags flags, size_t size, void *host_ptr, cl_int *errcode_ret)flags: Allocating the Memory CL_MEM_READ_WRITE, CL_MEM_READ_ONLY, CL_MEM_WRITE_ONLY, CL_MEM_USE_HOST_PTR, CL_MEM_ALLOC_HOST_PTR, CL_MEM_COPY_HOST_PTR flags: CL_MEM_READ_WRITE, CL_MEM_READ_ONLY, CL_MEM_WRITE_ONLY, CL_MEM_USE_HOST_PTR, CL_MEM_ALLOC_HOST_PTR, CL_MEM_COPY_HOST_PTR cl_mem GPUVector1 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR, sizeof (int) * SIZE, HostVector1, NULL); 62
  • 63. Allocating the Memory cl_mem GPUVector2 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR, sizeof (int) * SIZE, HostVector2, NULL);cl_mem GPUOutputVector; GPUOutputVector = clCreateBuffer(GPUContext, CL_MEM_WRITE_ONLY, sizeof (int) * SIZE, NULL, NULL); ; ; 63
  • 64. cl_program clCreateProgramWithSource( cl_context context, cl_unit count, const char **strings, const size_t Creating the program *lengths, cl_int *errcode_ret) cl_program OpenCLProgram = clCreateProgramWithSource(GPUContext, 8, OpenCLSource, NULL, NULL); 64
  • 65. cl_int clBuildProgram(cl_program program, cl_unit num_devices, const cl_device_id *device_list, Creating the program const char *options,void (*pfn_notify) void *user_data), void *user_data) (cl_program, clBuildProgram(OpenCLProgram, 0, NULL, NULL, NULL, NULL); 65
  • 66. cl_kernel clCreateKernel(cl_program program, const char *kernel_name, cl_int *errcode_ret) Creating the program cl_kernel OpenCLVectorAdd = clCreateKernel(OpenCLProgram, “VectorAdd”, NULL); 66
  • 67. cl_int clSetKernelArg(cl_kernel kernel, cl_unit matching the GPU memory with the arg_index, size_t arg_size, const void *arg_value) Kernel clSetKernelArg(OpenCLVectorAdd, 0, sizeof (cl_mem), (void*) &GPUOutputVector); 67
  • 68. matching the GPU memory with the Kernel clSetKernelArg(OpenCLVectorAdd, 1, sizeof (cl_mem), (void*) &GPUVector1);clSetKernelArg(OpenCLVectorAdd, 2, sizeof (cl_mem), (void*) &GPUVector2); ; ; 68
  • 69. cl_int clEnqueueNDRangeKernel( cl_command_queue command_queue, cl_kernel kernel, cl_unit work_dim, Lunching the Kernel const size_t *global_work_offset, *global_work_size, const size_t const size_t *local_work_size, cl_unit num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) size_t WorkSize [1] = {SIZE};clEnqueueNDRangeKernel(GPUCommandQueue OpenCLVectorAdd, 1, NULL, WorkSize, NULL, 0, NULL, NULL); 69
  • 70. cl_int clEnqueueReadBuffer( cl_command_queue command_queue, cl_mem buffer, cl_bool Copying the output to the host memory blocking_read, size_t offset,size_t cb, void *ptr, cl_unit num_evernts_in_wait_list, const cl_event *event_wait_list, cl_event *event) int HostOutputVector [SIZE];clEnqueueReadBuffer(GPUCommandQueue, GPUOutputVector, CL_TRUE, 0, SIZE* sizeof(int), HostOutputVector, 0, NULL NULL); 70
  • 71. Cleaning the GPU device clReleaseMemObject(GPUVector1);clReleaseMemObject(GPUVector2);clRele seMemObject(GPUOutputVector);free (GPUDevices);for(int c= 0; c < 305; c+ +)printf (“%c”, (char)HostOutputVector[c]);return 0;} ;} 71
  • 72. What you learned • Writing OpenCL Kernel • Writing OpenCL Application – Setting the context – preparing the command queue – setting the memory objects – setting the program – setting the kernel and the arguments 72
  • 73. Sources and additional resources • Jason sander, “Introduction to CUDA” -book and GTC presentation. • OpenCL specification document • NVIDIA CUDA programming guide • NVIDIA OpenCL getting started guide • Videos from GTC’10 in the link : • http://www.nvidia.com/object/gtc2010-presentation-a 73