SlideShare una empresa de Scribd logo
1 de 144
Descargar para leer sin conexión
6.963
               IT /
         A@M
      CUD
    9
IAP0

       Supercomputing on your desktop:
 Programming the next generation of cheap
and massively parallel hardware using CUDA

                                      Lecture 04
                                      Nicolas Pinto (MIT)




                                           #1
            CUDA            -   Advanced
During this course,
                                3
                               6
                        for 6.9
                     ed
                adapt


we’ll try to


          “                         ”

and use existing material ;-)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
warp != wrap
Today
yey!!
6.963
               IT /
         A@M
      CUD
    9
IAP0


       Textures & OpenGL
                 Async API
                  Libraries
         Interfacing CUDA
              Performance
6.963
               IT /
         A@M
      CUD
    9
IAP0



                            CUDA
       Textures and OpenGL
res
                                                                 xtu
                                                              Te
Textures in CUDA
        Different hardware path to memory
        Benefits of CUDA textures:
                 Texture fetches are cached
                            Optimized for 2D locality
                 Textures are addressable in 2D
                            Using integer or normalized coordinates
                            Means fewer addressing calculations in code
                 Provide filtering for free
                 Free wrap modes (boundary conditions)
                            Clamp to edge / repeat


        Limitations of CUDA textures:
                 Read-only
                 Currently either 1D or 2D (3D will be added)
                 9-bit accuracy of filter weights
© NVIDIA Corporation 2008                                                 160
res
                                                          xtu
                                                       Te
Two CUDA Texture Types

        Bound to linear memory
                 Global memory is bound to a texture
                 Only 1D
                 Integer addressing
                 No filtering, no addressing modes

        Bound to CUDA arrays
                 CUDA array is bound to a texture
                 1D or 2D
                 Float addressing (size-based or normalized)
                 Filtering
                 Addressing modes (clamping, repeat)

        Both:
                 Return either element type or normalized float

© NVIDIA Corporation 2008                                         161
res
                                                               xtu
                                                            Te
CUDA Texturing Steps
      Host (CPU) code:
               Allocate/obtain memory (global linear, or CUDA array)
               Create a texture reference object
                       Currently must be at file-scope
               Bind the texture reference to memory/array
               When done:
                       Unbind the texture reference, free resources

      Device (kernel) code:
               Fetch using texture reference
               Linear memory textures:
                       tex1Dfetch()
               Array textures:
                       tex1D() or tex2D()


© NVIDIA Corporation 2008                                              162
res
                                                                         xtu
                                                                      Te
Texture Reference
        Immutable parameters (compile-time)
                 Type: type returned when fetching
                            Basic int, float types
                            CUDA 1-, 2-, 4-element vectors
                 Dimensionality:
                            Currently 1 or 2 (3 will be supported in the future)
                 Read Mode:
                            cudaReadModeElementType
                            cudaReadModeNormalizedFloat (valid for 8- or 16-bit ints)
                             – returns [-1,1] for signed, [0,1] for unsigned
        Mutable parameters (run-time, only for array-textures)
                 Normalized:
                            non-zero = addressing range [0, 1]
                 Filter Mode:
                            cudaFilterModePoint
                            cudaFilterModeLinear
                 Address Mode:
                            cudaAddressModeClamp
                            cudaAddressModeWrap


© NVIDIA Corporation 2008                                                               163
Example: Host code for linear mem

// declare texture reference (must be at file-scope)
texture<unsigned short, 1, cudaReadModeNormalizedFloat> texRef;
       ...

// set up linear memory
unsigned short *dA = 0;
cudaMalloc((void**)&dA, numBytes);
cudaMemcpy(dA, hA, numBytes, cudaMemcpyHostToDevice);

// bind texture reference to array


                                                        res
cudaBindTexture(NULL, texRef, dA);


                                                    xtu
                                                 Te

  © NVIDIA Corporation 2008                                       164
res
                                                                 xtu
                                                              Te
cudaArray Type


        Channel format, width, height
        cudaChannelFormatDesc structure
                 int x, y, z, w: bits for each component
                 enum cudaChannelFormatKind – one of:
                            cudaChannelFormatKindSigned
                            cudaChannelFormatKindUnsigned
                            cudaChannelFormatKindFloat
                 some predefined constructors:
                            cudaCreateChannelDesc<float>(void);
                            cudaCreateChannelDesc<float4>(void);

        Management functions:
                 cudaMallocArray, cudaFreeArray,
                 cudaMemcpyToArray, cudaMemcpyFromArray, ...

© NVIDIA Corporation 2008                                              165
Example: Host code for 2D array tex

// declare texture reference (must be at file-scope)
texture<float, 2, cudaReadModeElementType> texRef;
       ...

// set up the CUDA array
cudaChannelFormatDesc cf = cudaCreateChannelDesc<float>();
cudaArray *texArray = 0;
cudaMallocArray(&texArray, &cf, dimX, dimY);
cudaMempcyToArray(texArray, 0,0, hA, numBytes, cudaMemcpyHostToDevice);

// specify mutable texture reference parameters
texRef.normalized = 0;


                                                         res
texRef.filterMode = cudaFilterModeLinear;


                                                     xtu
texRef.addressMode = cudaAddressModeClamp;

                                                  Te
// bind texture reference to array
cudaBindTextureToArray(texRef, texArray);



  © NVIDIA Corporation 2008                                       166
nGL
                                                        pe
                                                      O
OpenGL Interoperability


        OpenGL buffer objects can be mapped into the
        CUDA address space and then used as global
        memory
                 Vertex buffer objects
                 Pixel buffer objects

        Direct3D9 Vertex objects can be mapped
        Data can be accessed like any other global data in
        the device code
        Image data can be displayed from pixel buffer
        objects using glDrawPixels / glTexImage2D
                 Requires copy in video memory, but still fast
© NVIDIA Corporation 2008                                        177
nGL
                                                              pe
                                                            O
OpenGL Interop Steps

    Register a buffer object with CUDA
             cudaGLRegisterBufferObject(GLuint buffObj);
             OpenGL can use a registered buffer only as a source
             Unregister the buffer prior to rendering to it by OpenGL
    Map the buffer object to CUDA memory
             cudaGLMapBufferObject(void **devPtr, GLuint buffObj);
             Returns an address in global memory
             Buffer must registered prior to mapping
    Launch a CUDA kernel to process the buffer
    Unmap the buffer object prior to use by OpenGL
             cudaGLUnmapBufferObject(GLuint buffObj);

    Unregister the buffer object
             cudaGLUnregisterBufferObject(GLuint buffObj);
             Optional: needed if the buffer is a render target
    Use the buffer object in OpenGL code

© NVIDIA Corporation 2008                                               178
nGL
                                                             pe
                                                           O
Interop Scenario:
Dynamic CUDA-generated texture
        Register the texture PBO with CUDA
        For each frame:
                 Map the buffer
                 Generate the texture in a CUDA kernel
                 Unmap the buffer
                 Update the texture
                 Render the textured object

                            unsigned char *p_d=0;
                            cudaGLMapBufferObject((void**)&p_d, pbo);
                            prepTexture<<<height,width>>>(p_d, time);
                            cudaGLUnmapBufferObject(pbo);
                            glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo);
                            glBindTexture(GL_TEXTURE_2D, texID);
                            glTexSubImage2D(GL_TEXTURE_2D, 0, 0,0, 256,256,
                                            GL_BGRA, GL_UNSIGNED_BYTE, 0);
© NVIDIA Corporation 2008                                                     179
nGL
                                                           pe
                                                         O
Interop Scenario:
Frame Post-processing by CUDA

        For each frame:
                 Render to PBO with OpenGL
                 Register the PBO with CUDA
                 Map the buffer
                 Process the buffer with a CUDA kernel
                 Unmap the buffer
                 Unregister the PBO from CUDA
                            unsigned char *p_d=0;
                            cudaGLRegisterBufferObject(pbo);
                            cudaGLMapBufferObject((void**)&p_d, pbo);
                            postProcess<<<blocks,threads>>>(p_d);
                            cudaGLUnmapBufferObject(pbo);
                            cudaGLUnregisterBufferObject(pbo);
                            ...
© NVIDIA Corporation 2008                                               180
6.963
               IT /
         A@M
      CUD
    9
IAP0


                      CUDA
                   Async API
ync
                                                     As
     !quot;#$%&'($()quot;*+,+('#*%(-#

      !quot;#$%&'($()quot;*&(quot;.*!quot; /,01%,*+,+('#*%(-#*2('*
      -34,56(%7,/*+,+('#*2',,quot;*)-*89:*($*366*8:;!*
      %3-3<6,*/,01%,quot;

      =0,'63-*1+-6,+,$.,/*<#*)quot;1$4*3*8:;!*quot;.',3+

      8:;!*>.',3+*?*>,@),$%,*(2*8:;!*(-,'3.1($quot;*.&3.*
      ,A,%).,*1$*('/,'

      >.',3+*!9BC
         D3%&*quot;.',3+*&3quot;*3$*B;C*E*?*/,23)6.*quot;.',3+
         cudaMemcpyAsync(dst, src, size, 0);


97
ync
                                                         As
     !quot;#$%&'()#$*#%(&*+(,#,-$.(/-'.

       0-*/1$$#*2(#3#/124-*(-5(&()#$*#%(&*+(&(6-72(!quot;
       +#quot;4/#(,#,-$.(/-'.(5-$('&8#9%-/)#+(,#,-$.
          0-,'12#(/&'&:4%42.(;<(=>=(?@AB(&*+(1'C
          Dquot;&4%&:%#(&7(&('$#quot;4#E(5#&21$#(4*(0FGD(=>=
          !quot;#$%&'7()#$*#%(#3#/124-*(4*(-*#(72$#&,(E426(&(,#,-$.(
          /-'.(5$-,(&*-26#$(72$#&,


       H2$#&,(DIJK
       cudaStreamCreate(&stream1);
       cudaStreamCreate(&stream2);
       cudaMemcpyAsync(dst, src, size, stream1);
                                                 -quot;#$%&''#+
       kernel<<<grid, block, 0, stream2>>>(…);
       cudaStreamQuery(stream2);


98
ync
                                                                As
      !quot;#$%&'()*%$+,
         &'()*-%./(%0)-(/*(1%2/(34/1(15%0)*4%!quot;#$%3.66%-*/(.7-
         quot;-.8(%-3()./04-9
             7(.-:/(%(6.;-(1%*07(%<4/%!quot;#$%3.66-%23643=%3>36(%;/(30-04)5
             ?:(/>%*@(%-*.*:-%4<%.)%.->)3@/4)4:-%!quot;#$%3.66
             A643=%!+quot;%:)*06%!quot;#$%3.66-%;/04/%*4%*@(%('()*%./(%347;6(*(1
             .->)3$+, -.7;6(%0)%!quot;#$%B#C

     3:1.&'()*D* -*./*E%-*4;F
     3:1.&'()*!/(.*(2G-*./*5F      3:1.&'()*!/(.*(2G-*4;5F
     3:1.&'()*H(34/12-*./*E%I5F
     =(/)(6JJJ8/01E%A643=KKK2LLL5F
     3:1.&'()*H(34/12-*4;E%I5F
     3:1.&'()*B>)3@/4)0M(2-*4;5F
     <64.*%(*F
     3:1.&'()*&6.;-(1N07(2G(*E%-*./*E%-*4;5F
     3:1.&'()*#(-*/4>2-*./*5F      3:1.&'()*#(-*/4>2-*4;5F
                                                                           95
95
6.963
               IT /
         A@M
      CUD
    9
IAP0


                         CUDA
                       Libraries
ary
                                                   ibr
                                                 L
 CUDA libraries


       CUDA includes 2 widely used libraries
             CUBLAS: BLAS implementation
             CUFFT: FFT implementation



       CUDPP (Data Parallel Primitives), available from
       http://www.gpgpu.org/developer/cudpp/ :
             Reduction
             Scan
             Sort




                                                          9
M02: High Performance Computing with CUDA
ary
                                                                           ibr
                                                                         L
 Closely Coupled CPU-GPU

           Function               Function   Function
                         Lib                            Lib

    Init
                                                          GPU
   Alloc


                                                              CPU
                Operation 1                   Operation 2           Operation 3




             Integrated programming model
             High speed data transfer – up to 5.5GB/sec
             Asynchronous data transfer
             Large GPU memory systems
                                                                                  10
M02: High Performance Computing with CUDA
ary
                                                                   ibr
                                                                 L
 CUBLAS
       Implementation of BLAS (Basic Linear Algebra Subprograms)
       on top of CUDA driver
             Self-contained at the API level, no direct interaction with CUDA
             driver

       Basic model for use
             Create matrix and vector objects in GPU memory space
             Fill objects with data
             Call sequence of CUBLAS functions
             Retrieve data from GPU

       CUBLAS library contains helper functions
             Creating and destroying objects in GPU space
             Writing data to and retrieving data from objects




                                                                           11
M02: High Performance Computing with CUDA
ary
                                                                  ibr
                                                                L
 Using CUBLAS
       Interface to CUBLAS library is in cublas.h
       Function naming convention
             cublas + BLAS name
             Eg., cublasSGEMM
       Error handling
             CUBLAS core functions do not return error
                  CUBLAS provides function to retrieve last error recorded
             CUBLAS helper functions do return error
       Helper functions:
             Memory allocation, data transfer
       Implemented using C-based CUDA tool chain
             Interfacing to C/C++ applications is trivial




                                                                         13
M02: High Performance Computing with CUDA
ary
                                                                         ibr
                                                                       L
Supported Features


                     Single Precision               Double Precision*

                    Real         Complex            Real                   Complex


                     !              !               !
   Level 1

                                               dgemv,
                     !                         dger,
   Level 2
                                               dsyr, dtrsv
                                  cgemm                                       zgemm
                     !                              !
   Level 3


*Double-precision functions only supported on GPUs with double-precision hardware

                                                           © 2008 NVIDIA Corporation.
ary
                                                        ibr
                                                      L
CUBLAS Helper Functions

  cublasInit()
     Initializes CUBLAS library
  cublasShutdown()
     Releases resources used by CUBLAS library
  cublasGetError()
     Returns last error from CUBLAS core function (+ resets)
  cublasAlloc()
     Wrapper around cudaMalloc() to allocate space for array
  cublasFree()
     destroys object in GPU memory
  cublas[Set|Get][Vector|Matrix]()
     Copies array elements between CPU and GPU memory
     Accommodates non-unit strides

                                          © 2008 NVIDIA Corporation.
ary
                                                                                    ibr
                                                                                  L
sgemmExample.c
#include <stdio.h>                                cublasInit();
#include <stdlib.h>
#include quot;cublas.hquot;                               cublasAlloc(n2, sizeof(float), (void **)&a_d);
                                                  cublasAlloc(n2, sizeof(float), (void **)&b_d);
int main(void)                                    cublasAlloc(n2, sizeof(float), (void **)&c_d);
{
    float *a_h, *b_h, *c_h;                       cublasSetVector(n2, sizeof(float), a_h, 1, a_d, 1);
    float *a_d, *b_d, *c_d;                       cublasSetVector(n2, sizeof(float), b_h, 1, b_d, 1);
    float alpha = 1.0f, beta = 0.0f;
    int N = 2048, n2 = N*N;                       cublasSgemm('n', 'n', N, N, N, alpha, a_d, N,
    int nBytes, i;                                                b_d, N, beta, c_d, N);


    nBytes = n2*sizeof(float);                    cublasGetVector(n2, sizeof(float), c_d, 1, c_h, 1);


    a_h = (float *)malloc(nBytes);                free(a_h); free(b_h); free(c_h);
    b_h = (float *)malloc(nBytes);                cublasFree(a_d); cublasFree(b_d);
    c_h = (float *)malloc(nBytes);                cublasFree(c_d);


    for (i=0; i < n2; i++) {                      cublasShutdown();
                                                  return 0;
        a_h[i] = rand() / (float) RAND_MAX;
                                              }
        b_h[i] = rand() / (float) RAND_MAX;
    }
                                                                      © 2008 NVIDIA Corporation.
ary
                                                                              ibr
                                                                            L
 Calling CUBLAS from FORTRAN
       Two interfaces:

             Thunking (define CUBLAS_USE_THUNKING when compiling fortran.c)
                  Allows interfacing to existing applications without any changes
                  During each call, the wrappers allocate GPU memory, copy source data
                  from CPU memory space to GPU memory space, call CUBLAS, and finally
                  copy back the results to CPU memory space and deallocate the GPGPU
                  memory
                  Intended for light testing due to call overhead

             Non-Thunking (default)
                  Intended for production code
                  Substitute device pointers for vector and matrix arguments in all BLAS
                  functions
                  Existing applications need to be modified slightly to allocate and deallocate
                  data structures in GPGPU memory space (using CUBLAS_ALLOC and
                  CUBLAS_FREE) and to copy data between GPU and CPU memory
                  spaces (using CUBLAS_SET_VECTOR, CUBLAS_GET_VECTOR,
                  CUBLAS_SET_MATRIX, and CUBLAS_GET_MATRIX)


                                                                                        14
M02: High Performance Computing with CUDA
ary
                                                                              ibr
                                                                            L
 SGEMM example (THUNKING)
 ! Define 3 single precision matrices A, B, C
 real , dimension(m1,m1)::    A, B, C
 ……
 ! Initialize
 ……
 #ifdef CUBLAS
  ! Call SGEMM in CUBLAS library using THUNKING interface (library takes care of
  ! memory allocation on device and data movement)
   call cublasSGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1)
 #else
 ! Call SGEMM in host BLAS library
   call SGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1)
 #endif


   To use the host BLAS routine:
    g95 –O3 code.f90 –L/usr/local/lib -lblas

   To use the CUBLAS routine (fortran.c is provided by NVIDIA):
    gcc -O3 -DCUBLAS_USE_THUNKING -I/usr/local/cuda/include -c fortran.c
    g95 -O3 -DCUBLAS code.f90 fortran.o -L/usr/local/cuda/lib -lcublas


                                                                                   15
M02: High Performance Computing with CUDA
ary
                                                                             ibr
                                                                           L
 SGEMM example (NON-THUNKING)
 ! Define 3 single precision matrices A, B, C
   real , dimension(m1,m1)::    A, B, C
   integer:: devPtrA, devPtrB, devPtrC, size_of_real=4
 ……
 ! Initialize A, B, C
 ………
 ! Allocate matrices on GPU
   cublasAlloc(m1*m1, size_of_real, devPtrA)
   cublasAlloc(m1*m1, size_of_real, devPtrB)
   cublasAlloc(m1*m1, size_of_real, devPtrC)
 !Copy data from CPU to GPU
   cublasSetMatrix(m1,m1, size_of_real, A,m1, devPtrA, m1)
   cublasSetMatrix(m1,m1, size_of_real, B,m1, devPtrB, m1)
   cublasSetMatrix(m1,m1, size_of_real, C,m1, devPtrC, m1)
 ! Call SGEMM in CUBLAS library using NON-THUNKING interface (library is expecting data in
        GPU memory)
   call cublasSGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1)
 !Copy data from GPU to CPU
  cublasGetMatrix(m1,m1, size_of_real, devPtrC,m1, C, m1)
 ! Free memory on device
   cublasFree(devPtrA)
 ……

  g95 -O3 code.f90 -L/usr/local/cuda/lib -lcublas
                                                                                       16
M02: High Performance Computing with CUDA
!quot;#$%&'()*#+,-./0,12,34#quot;,5quot;#0quot;,6*#quot;'(,78+quot;9(',
                                                                           ,
                                 V&9F=J!V*=>*7!                                                     X&'(9!YB!S(''(=!
                        W*'ML8(+!4,F(%,(!SF7F9F*%!                             W*'ML8(+!4,F(%,(!SF7F9F*%!&%N!S(M&+8'(%8!*6!Q&8P('&8F,9!
                     $%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J                                $%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J!


7901('$1,
                                                                                                        quot;()*+,(! quot;()*+,(! quot;()*+,(! quot;()*+,(!
                                                                                      quot;#$!%&'(!
Y(! M+(9(%8! M(+6*+'&%,(! +(9L=89! 6*+! N(%9(! =F%(&+! &=E(K+&! L9F%E!                                  quot;-./01! 2011quot;-.! 0011quot;-. 0311quot;-4
+(,(%8! ZV[S[! quot;#$9B! ]L+! '&8+F^U'&8+F^! 'L=8FM=J! +*L8F%(!                      5!*6!7(,8*+!,*+(9!       :1!         ;3!        ;3!         <!
_quot;`QQa! +L%9! LM! 8*! 31b! 6&98(+!8P&%! 8P(! 7(%N*+c9! F'M=('(%8&U


                                                                                                                                 ary
                                                                                   ,*+(!,=*,>?!quot;@A!        ;B:1!       ;B3C!      ;B:D!       ;B<D!
8F*%!&%N!&MM+*&,P(9!8P(!M(&>!*6!P&+NO&+(!,&M&KF=F8F(9B!]L+!d$?!


                                                                                                                              ibr
                                                                                     +(EF98(+9G,*+(!      3<HI!       :/HI!      :/HI!       :/HI!
ef! &%N! WP*=(9>J! 6&,8*+FA&8F*%9! &,PF(7(! LM! 8*! 01g21b! *6! 8P(!

                                                                                                                            L
                                                                                      9'('G,*+(!          ;3HI!       ;3HI!      ;3HI!       ;3HI!
M(&>! quot;`QQ! +&8(B! ]L+! M&+&==(=! d$! +L%%F%E! *%! 8O*! quot;#$9!
                                                                                  '('*+J!KL9?!quot;@A           ;B;!        ;B;!       1B2!       ;B1!
&,PF(7(9!LM!8*!hD<1!quot;6=*MG9B!-P(9(!+(9L=89!&+(!&,,*'M=F9P(N!KJ!
                                                                                  '('*+J!KL9?!MF%9         D;/!        /D3!       :0<!        ;/0!
,P&==(%EF%E!8P(!&,,(M8(N!7F(O!*6!8P(!quot;#$!&+,PF8(,8L+(!&%N!M+*U
                                                                                   K&%NOFN8P?!quot;IG9!        ;<;!         C1!        03!         :/!
E+&''F%E! ELFN(=F%(9B! Y(! &+EL(! 8P&8! '*N(+%! quot;#$9! 9P*L=N! K(!
                                                                                   '('*+J!&'*L%8!          ;quot;I!      D;/QI!     C30QI!      /D3QI!
7F(O(N! &9! 'L=8F8P+(&N(N! 'L=8F,*+(! 7(,8*+! L%F89B! Y(! (^M=*F8!
                                                                                   4#?!M(&>!quot;6=*MG9!       3/<!        </2!       :<3!         2:!
K=*,>F%E!9F'F=&+=J!8*!7(,8*+!,*'ML8(+9!&%N!P(8(+*E(%(F8J!*6!8P(!
                                                                                  4#?!M(&>!M(+!,*+(!        /;!         /C!        //!         /:!
9J98('! KJ! ,*'ML8F%E! K*8P! *%! quot;#$! &%N! W#$B! -PF9! 98LNJ! F%U
                                                                                    4#?!6=*M9RO*+N!         ;0!         /D!        ;3!         ;/!
,=LN(9! N(8&F=(N! K(%,P'&+>F%E! *6! 8P(! quot;#$! '('*+J! 9J98('! 8P&8!
                                                                                   S#?!M(&>!quot;6=*MG9!        C0!         T!         T!          T!
+(7(&=9! 9FA(9! &%N! =&8(%,F(9! *6! ,&,P(9! &%N! -dIB! Y(! M+(9(%8! &!
                                                                                    S#?!6=*M9RO*+N!         <B<!        T!         T!          T!
,*LM=(! *6! &=E*+F8P'F,! *M8F'FA&8F*%9! &F'(N! &8! F%,+(&9F%E! M&+&=U
=(=F9'!&%N!+(EL=&+F8J!F%!8P(!M+*K=('!8P&8!M+*7FN(!L9!OF8P!9=FEP8=J!               -&K=(!;R!-P(!=F98!*6!8P(!quot;#$9!L9(N!F%!8PF9!98LNJB!4#!F9!9F%E=(!M+(U
PFEP(+!M(+6*+'&%,(B!                                                              ,F9F*%!&%N!S#!F9!N*LK=(!M+(,F9F*%B!4'('!F9!9P&+(N!'('*+JB!#(&>!
                                                                                  6=*M!+&8(9!&+(!9P*O%!6*+!'L=8FM=J!&%N!&NN!*M(+&8F*%9B!)=*M9RO*+N!
:,;#1(2<4$1*2#,                                                                      F9!8P(!+&8F*!*6!M(&>!quot;6=*MG9!+&8(!8*!MF%U'('*+J!K&%NOFN8P!F%!
                                                                                                                 O*+N9B!!
Y(! '&>(! 8P(! 6*==*OF%E! ,*%8+FKL8F*%9B! )*+! 8P(! 6F+98! 8F'(?! O(!
9P*O!&%!d$?!ef!&%N!WP*=(9>J!6&,8*+FA&8F*%!8P&8!&,PF(7(!,*'U
                                                                                  9,+FK(9! 8P(! &+,PF8(,8L+(! *6! 8P(! quot;#$9! O(! L9(N?! PFEP=FEP8F%E! 8P(!
ML8&8F*%&=!+&8(9!*7(+!:11!quot;6=*MG9!*%!&!quot;#$B!-P(9(!&+(!8P+((!*6!
                                                                                  6(&8L+(9!,*''*%!8*!7(,8*+!&+,PF8(,8L+(9B!4(,8F*%!:!K(%,P'&+>9!
8P(!'*98!OFN(=J!L9(N!6&,8*+FA&8F*%9!F%!N(%9(!=F%(&+!&=E(K+&!&%N!
                                                                                  *M(+&8F*%9! F%,=LNF%E! '('*+J! 8+&%96(+?! >(+%(=! 98&+8ULM?! &%N! K&+U
M&7(! 8P(! O&J! 6*+! 8P(! F'M=('(%8&8F*%! *6! 8P(! (%8F+(! d#WH!
                                                                                  +F(+9?! &%N! L9(9! 8P(9(! 8*! &%&=JA(! 8P(! M(+6*+'&%,(! *6! 8P(! M&%(=!
=FK+&+J!i%N(+9*%!(8!&=B!;221j!6*+!8P(!quot;#$9B!
                                                                                  6&,8*+FA&8F*%! *6! d$B! 4(,8F*%! <! NF9,L99(9! 8P(! N(9FE%! &%N! M(+6*+U
    ]L+! +(9L=89! &=9*! F%,=LN(! M(+6*+'&%,(! *%! 8P(! 0U9(+F(9! *6!
                                                                                  '&%,(! (7&=L&8F*%! *6! '&8+F^! 'L=8FM=F,&8F*%B! 4(,8F*%! D! NF9,L99(9!
ZV[S[!quot;#$9!8P&8!O&9!%*8!M+(7F*L9=J!&88&F%(N!F%!8P(!;BD!J(&+9!
                                                                                  8P(! N(9FE%! *6! d$?! ef! &%N! WP*=(9>J?! &%N! 4(,8F*%! 3! (7&=L&8(9!
9F%,(!8P(9(!quot;#$9!O(+(!&7&F=&K=(B!Y(!M+*7FN(!%(O!F%9FEP89!F%8*!
                                                                                  8P(F+! M(+6*+'&%,(B! 4(,8F*%! C! 9L''&+FA(9! &%N! N(9,+FK(9! 6L8L+(!
M+*E+&''F%E! 8P(9(! &%N! %(O(+! quot;#$9! 8P&8! P(=M! L9! &,PF(7(! M(+U
                                                                                  O*+>B!
6*+'&%,(!F%!9L,P!K&9F,!>(+%(=9!&9!'&8+F^U'&8+F^!'L=8FM=J!8P&8!F9!
31b! 6&98(+! 8P&%! 8P*9(! F%! 8P(! *M8F'FA(N! 7(%N*+c9! =FK+&+J!
                                                                                  =,-./,7($%*1quot;$14(quot;,
W$Id4! ;B;B! 4*'(! *6! *L+! ,*N(9! P&7(! K((%! =F,(%9(N! KJ!
                                                                                  [%! 8PF9! O*+>! O(! &+(! ,*%,(+%(N! OF8P! M+*E+&''F%E! 0! 9(+F(9?! 2!
ZV[S[! &%N! F%,=LN(N! F%! W$Id4! /B1B! [%! *L+! &MM+*&,P! O(!                                                                     Volkov and Demmel (SC08)
                                                                                  9(+F(9?!&%N!/11!9(+F(9!*6!ZV[S[!quot;#$9?!&9!=F98(N!F%!-&K=(!;B!)*+!
8PF%>! *6! 8P(! quot;#$! &9! &! 'L=8F8P+(&N(N! 7(,8*+! L%F8! &%N! *L+! K(98!
rary
                                                                                                                                           Lib
                            !quot;#$%&'()*#+,-./0,12,34#quot;,5quot;#0quot;,6*#quot;'(,78+quot;9(',
                                                                           ,
                                 V&9F=J!V*=>*7!                                                     X&'(9!YB!S(''(=!
                        W*'ML8(+!4,F(%,(!SF7F9F*%!                             W*'ML8(+!4,F(%,(!SF7F9F*%!&%N!S(M&+8'(%8!*6!Q&8P('&8F,9!
                     $%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J                                $%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J!


7901('$1,
                                                                                                        quot;()*+,(! quot;()*+,(! quot;()*+,(! quot;()*+,(!
                                                                                      quot;#$!%&'(!
Y(! M+(9(%8! M(+6*+'&%,(! +(9L=89! 6*+! N(%9(! =F%(&+! &=E(K+&! L9F%E!                                  quot;-./01! 2011quot;-.! 0011quot;-. 0311quot;-4
+(,(%8! ZV[S[! quot;#$9B! ]L+! '&8+F^U'&8+F^! 'L=8FM=J! +*L8F%(!                      5!*6!7(,8*+!,*+(9!       :1!         ;3!        ;3!         <!
_quot;`QQa! +L%9! LM! 8*! 31b! 6&98(+!8P&%! 8P(! 7(%N*+c9! F'M=('(%8&U                 ,*+(!,=*,>?!quot;@A!        ;B:1!       ;B3C!      ;B:D!       ;B<D!
8F*%!&%N!&MM+*&,P(9!8P(!M(&>!*6!P&+NO&+(!,&M&KF=F8F(9B!]L+!d$?!                      +(EF98(+9G,*+(!      3<HI!       :/HI!      :/HI!       :/HI!
ef! &%N! WP*=(9>J! 6&,8*+FA&8F*%9! &,PF(7(! LM! 8*! 01g21b! *6! 8P(!
                                                                                      9'('G,*+(!          ;3HI!       ;3HI!      ;3HI!       ;3HI!
M(&>! quot;`QQ! +&8(B! ]L+! M&+&==(=! d$! +L%%F%E! *%! 8O*! quot;#$9!
                                                                                  '('*+J!KL9?!quot;@A           ;B;!        ;B;!       1B2!       ;B1!
&,PF(7(9!LM!8*!hD<1!quot;6=*MG9B!-P(9(!+(9L=89!&+(!&,,*'M=F9P(N!KJ!
                                                                                  '('*+J!KL9?!MF%9         D;/!        /D3!       :0<!        ;/0!
,P&==(%EF%E!8P(!&,,(M8(N!7F(O!*6!8P(!quot;#$!&+,PF8(,8L+(!&%N!M+*U
                                                                                   K&%NOFN8P?!quot;IG9!        ;<;!         C1!        03!         :/!
E+&''F%E! ELFN(=F%(9B! Y(! &+EL(! 8P&8! '*N(+%! quot;#$9! 9P*L=N! K(!
                                                                                   '('*+J!&'*L%8!          ;quot;I!      D;/QI!     C30QI!      /D3QI!
7F(O(N! &9! 'L=8F8P+(&N(N! 'L=8F,*+(! 7(,8*+! L%F89B! Y(! (^M=*F8!
                                                                                   4#?!M(&>!quot;6=*MG9!       3/<!        </2!       :<3!         2:!
K=*,>F%E!9F'F=&+=J!8*!7(,8*+!,*'ML8(+9!&%N!P(8(+*E(%(F8J!*6!8P(!
                                                                                  4#?!M(&>!M(+!,*+(!        /;!         /C!        //!         /:!
9J98('! KJ! ,*'ML8F%E! K*8P! *%! quot;#$! &%N! W#$B! -PF9! 98LNJ! F%U
                                                                                    4#?!6=*M9RO*+N!         ;0!         /D!        ;3!         ;/!
,=LN(9! N(8&F=(N! K(%,P'&+>F%E! *6! 8P(! quot;#$! '('*+J! 9J98('! 8P&8!
                                                                                   S#?!M(&>!quot;6=*MG9!        C0!         T!         T!          T!
+(7(&=9! 9FA(9! &%N! =&8(%,F(9! *6! ,&,P(9! &%N! -dIB! Y(! M+(9(%8! &!
                                                                                    S#?!6=*M9RO*+N!         <B<!        T!         T!          T!
,*LM=(! *6! &=E*+F8P'F,! *M8F'FA&8F*%9! &F'(N! &8! F%,+(&9F%E! M&+&=U
=(=F9'!&%N!+(EL=&+F8J!F%!8P(!M+*K=('!8P&8!M+*7FN(!L9!OF8P!9=FEP8=J!               -&K=(!;R!-P(!=F98!*6!8P(!quot;#$9!L9(N!F%!8PF9!98LNJB!4#!F9!9F%E=(!M+(U
PFEP(+!M(+6*+'&%,(B!                                                              ,F9F*%!&%N!S#!F9!N*LK=(!M+(,F9F*%B!4'('!F9!9P&+(N!'('*+JB!#(&>!
                                                                                  6=*M!+&8(9!&+(!9P*O%!6*+!'L=8FM=J!&%N!&NN!*M(+&8F*%9B!)=*M9RO*+N!
:,;#1(2<4$1*2#,                                                                      F9!8P(!+&8F*!*6!M(&>!quot;6=*MG9!+&8(!8*!MF%U'('*+J!K&%NOFN8P!F%!
                                                                                                                 O*+N9B!!
Y(! '&>(! 8P(! 6*==*OF%E! ,*%8+FKL8F*%9B! )*+! 8P(! 6F+98! 8F'(?! O(!
9P*O!&%!d$?!ef!&%N!WP*=(9>J!6&,8*+FA&8F*%!8P&8!&,PF(7(!,*'U
                                                                                  9,+FK(9! 8P(! &+,PF8(,8L+(! *6! 8P(! quot;#$9! O(! L9(N?! PFEP=FEP8F%E! 8P(!
ML8&8F*%&=!+&8(9!*7(+!:11!quot;6=*MG9!*%!&!quot;#$B!-P(9(!&+(!8P+((!*6!
                                                                                  6(&8L+(9!,*''*%!8*!7(,8*+!&+,PF8(,8L+(9B!4(,8F*%!:!K(%,P'&+>9!
8P(!'*98!OFN(=J!L9(N!6&,8*+FA&8F*%9!F%!N(%9(!=F%(&+!&=E(K+&!&%N!
                                                                                  *M(+&8F*%9! F%,=LNF%E! '('*+J! 8+&%96(+?! >(+%(=! 98&+8ULM?! &%N! K&+U
M&7(! 8P(! O&J! 6*+! 8P(! F'M=('(%8&8F*%! *6! 8P(! (%8F+(! d#WH!
                                                                                  +F(+9?! &%N! L9(9! 8P(9(! 8*! &%&=JA(! 8P(! M(+6*+'&%,(! *6! 8P(! M&%(=!
=FK+&+J!i%N(+9*%!(8!&=B!;221j!6*+!8P(!quot;#$9B!
                                                                                  6&,8*+FA&8F*%! *6! d$B! 4(,8F*%! <! NF9,L99(9! 8P(! N(9FE%! &%N! M(+6*+U
    ]L+! +(9L=89! &=9*! F%,=LN(! M(+6*+'&%,(! *%! 8P(! 0U9(+F(9! *6!
                                                                                  '&%,(! (7&=L&8F*%! *6! '&8+F^! 'L=8FM=F,&8F*%B! 4(,8F*%! D! NF9,L99(9!
ZV[S[!quot;#$9!8P&8!O&9!%*8!M+(7F*L9=J!&88&F%(N!F%!8P(!;BD!J(&+9!
                                                                                  8P(! N(9FE%! *6! d$?! ef! &%N! WP*=(9>J?! &%N! 4(,8F*%! 3! (7&=L&8(9!
9F%,(!8P(9(!quot;#$9!O(+(!&7&F=&K=(B!Y(!M+*7FN(!%(O!F%9FEP89!F%8*!
                                                                                  8P(F+! M(+6*+'&%,(B! 4(,8F*%! C! 9L''&+FA(9! &%N! N(9,+FK(9! 6L8L+(!
M+*E+&''F%E! 8P(9(! &%N! %(O(+! quot;#$9! 8P&8! P(=M! L9! &,PF(7(! M(+U
                                                                                  O*+>B!
6*+'&%,(!F%!9L,P!K&9F,!>(+%(=9!&9!'&8+F^U'&8+F^!'L=8FM=J!8P&8!F9!
31b! 6&98(+! 8P&%! 8P*9(! F%! 8P(! *M8F'FA(N! 7(%N*+c9! =FK+&+J!
                                                                                  =,-./,7($%*1quot;$14(quot;,
W$Id4! ;B;B! 4*'(! *6! *L+! ,*N(9! P&7(! K((%! =F,(%9(N! KJ!                                                                      Volkov and Demmel (SC08)
                                                                                  [%! 8PF9! O*+>! O(! &+(! ,*%,(+%(N! OF8P! M+*E+&''F%E! 0! 9(+F(9?! 2!
ZV[S[! &%N! F%,=LN(N! F%! W$Id4! /B1B! [%! *L+! &MM+*&,P! O(!
                                                                                  9(+F(9?!&%N!/11!9(+F(9!*6!ZV[S[!quot;#$9?!&9!=F98(N!F%!-&K=(!;B!)*+!
8PF%>! *6! 8P(! quot;#$! &9! &! 'L=8F8P+(&N(N! 7(,8*+! L%F8! &%N! *L+! K(98!
ary
                                              ibr
                                            L
 DGEMM Performance




                                                17
M02: High Performance Computing with CUDA
ary
                                                      ibr
                                                    L
Additional Resources


  CUDA SDK example
     simpleCUBLAS
  CUBLAS Library documentation
     in doc folder of CUDA Toolkit or download from CUDA
     Zone




                                        © 2008 NVIDIA Corporation.
ary
                                                            ibr
                                                          L
 CUFFT
       The Fast Fourier Transform (FFT) is a divide-and-
       conquer algorithm for efficiently computing discrete
       Fourier transform of complex or real-valued data
       sets.
       CUFFT is the CUDA FFT library
             Provides a simple interface for computing parallel FFT on
             an NVIDIA GPU
             Allows users to leverage the floating-point power and
             parallelism of the GPU without having to develop a custom,
             GPU-based FFT implementation




                                                                  18
M02: High Performance Computing with CUDA
ary
                                                 ibr
                                               L
 Supported Features
       1D, 2D and 3D transforms of complex and real-valued
       data
       Batched execution for doing multiple 1D transforms
       in parallel
       1D transform size up to 8M elements
       2D and 3D transform sizes in the range [2,16384]
       In-place and out-of-place transforms for real and
       complex data.




                                                      19
M02: High Performance Computing with CUDA
ary
                                                                 ibr
                                                               L
 Transform Types
       Library supports real and complex transforms
             CUFFT_C2C, CUFFT_C2R, CUFFT_R2C
       Directions
             CUFFT_FORWARD (-1) and CUFFT_INVERSE (1)
                  According to sign of the complex exponential term
       Real and imaginary parts of complex input and
       output arrays are interleaved
             cufftComplex type is defined for this
       Real to complex FFTs, output array holds only
       nonredundant coefficients
             N -> N/2+1
             N0 x N1 x … x Nn -> N0 x N1 x … x (Nn/2+1)
             For in-place transforms the input/output arrays need to be
             padded


                                                                      20
M02: High Performance Computing with CUDA
ary
                                                                 ibr
                                                               L
 More on Transforms
       For 2D and 3D transforms, CUFFT performs transforms in row-
       major (C-order)
       If calling from FORTRAN or MATLAB, remember to change the
       order of size parameters during plan creation
       CUFFT performs un-normalized transforms:
            IFFT(FFT(A))= length(A)*A
       CUFFT API is modeled after FFTW. Based on plans, that
       completely specify the optimal configuration to execute a
       particular size of FFT
       Once a plan is created, the library stores whatever state is
       needed to execute the plan multiple times without recomputing
       the configuration
             Works very well for CUFFT, because different kinds of FFTs
             require different thread configurations and GPU resources



                                                                          21
M02: High Performance Computing with CUDA
ary
                                                        ibr
                                                      L
CUFFT Types and Definitions

  cufftHandle
     Type used to store and access CUFFT plans
  cufftResults
     Enumeration of API function return values
  cufftReal
     single-precision, real datatype
  cufftComplex
     single-precision, complex datatype

  Real and complex transforms
     CUFFT_C2C, CUFFT_C2R, CUFFT_R2C
  Directions
     CUFFT_FORWARD, CUFFT_INVERSE
                                          © 2008 NVIDIA Corporation.
ary
                                                                                            ibr
                                                                                          L
CUFFT Example
#include <stdio.h>                                    cufftPlan1d(&plan, N, CUFFT_C2C, batchSize);
#include <math.h>
#include quot;cufft.hquot;                                    cufftExecC2C(plan, a_d, a_d, CUFFT_FORWARD);
                                                      cufftExecC2C(plan, a_d, a_d, CUFFT_INVERSE);
int main(int argc, char *argv[])
{                                                     cudaMemcpy(a_h, a_d, nBytes,
    cufftComplex *a_h, *a_d;                                           cudaMemcpyDeviceToHost);
    cufftHandle plan;
    int N = 1024, batchSize = 10;                     // check error - normalize
    int i, nBytes;                                    for (maxError = 0.0, i=0; i < N*batchSize; i++) {
    double maxError;                                      maxError = max(fabs(a_h[i].x/N-sinf(i)), maxError);
                                                          maxError = max(fabs(a_h[i].y/N-cosf(i)), maxError);
    nBytes = sizeof(cufftComplex)*N*batchSize;        }
    a_h = (cufftComplex *)malloc(nBytes);
                                                      printf(quot;Max fft error = %gnquot;, maxError);
    for (i=0; i < N*batchSize; i++) {
        a_h[i].x = sinf(i);                           cufftDestroy(plan);
        a_h[i].y = cosf(i);                           free(a_h); cudaFree(a_d);
    }
                                                      return 0;
    cudaMalloc((void **)&a_d, nBytes);            }
    cudaMemcpy(a_d, a_h, nBytes,
                       cudaMemcpyHostToDevice);                               © 2008 NVIDIA Corporation.
ary
                                                      ibr
                                                    L
Additional CUFFT Resources


  CUDA SDK examples
     simpleCUFFT
     convolutionFFT2D
     oceanFFT
  CUFFT Library documentation
     In doc folder of CUDA Toolkit or download from CUDA
     Zone




                                        © 2008 NVIDIA Corporation.
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
?
     e
  lu
G
6.963
               IT /
         A@M
      CUD
    9
IAP0




Interfacing CUDA
lue
                                                G
Interfacing CUDA with other languages

      CUDA kernels from FORTRAN, allocate pinned
      memory from FORTRAN

      Calling CUDA from MATLAB with MEX files

      Several packages (open source and commercial) to
      interface CUDA with Python, IDL, .NET, FORTRAN
      (Flagon). Browse CUDA Zone to find all the
      packages.




                                                    23
M02: High Performance Computing with CUDA
lue
                                                                                     G
 Pinned memory from FORTRAN
Pinned memory provides a fast PCI-e transfer speed and enables use of streams:
•Allocation needs to be done with cudaMallocHost
•Use new Fortran 2003 features for interoperability with C.

 use iso_c_binding
 ! The allocation is performed by C function calls. Define the C pointer as type (C_PTR)
  type(C_PTR) :: cptr_A, cptr_B, cptr_C
 ! Define Fortran arrays as pointer.
 real, dimension(:,:), pointer :: A, B, C

 ! Allocating memory with cudaMallocHost.
 ! The Fortan arrays, now defined as pointers, are then associated with the C pointers using the
 ! new interoperability defined in iso_c_binding. This is equivalent to allocate(A(m1,m1))
  res = cudaMallocHost ( cptr_A, m1*m1*sizeof(fp_kind) )
  call c_f_pointer ( cptr_A, A, (/ m1, m1 /) )

 ! Use A as usual.
 ! See example code for cudaMallocHost interface code



     http://www.nvidia.com/object/cuda_programming_tools.html
                                                                                           24
M02: High Performance Computing with CUDA
lue
                                                                                                G
 Calling CUDA kernels from FORTRAN
From Fortran call C function that will call CUDA kernel
 ! Fortran -> C -> CUDA ->C ->Fortran
 call cudafunction(c,c2,N)


 /* NB: Fortran subroutine arguments are passed by reference. */
 extern quot;Cquot; void cudafunction_(cuComplex *a, cuComplex *b, int *Np)
 {
   ...
   int N=*np;
   cudaMalloc ((void **) &a_d , sizeof(cuComplex)*N);
   cudaMemcpy( a_d, a, sizeof(cuComplex)*N ,cudaMemcpyHostToDevice);
   dim3 dimBlock(block_size); dim3 dimGrid (N/dimBlock.x); if( N % block_size != 0 ) dimGrid.x+=1;
   square_complex<<<dimGrid,dimBlock>>>(a_d,a_d,N);
    cudaMemcpy( b, a_d, sizeof(cuComplex)*N,cudaMemcpyDeviceToHost);
   cudaFree(a_d);
 }


 complex_mul: main.f90 Cuda_function.o
    $(FC) -o complex_mul main.f90 Cuda_function.o -L/usr/local/cuda/lib -lcudart

 cuda_function.o: cuda_function.cu
     nvcc -c -O3 cuda_function.cu

                                                                                                     25
M02: High Performance Computing with CUDA
lue
                                                   G
 CUDA & MATLAB

      Even though MATLAB is built on many well-
      optimized libraries, some functions can perform
      better when written in a compiled language (e.g. C
      and Fortran).

      MATLAB provides a convenient API for interfacing
      code written in C and FORTRAN to MATLAB
      functions with MEX files.

      MEX files could be used to exploit multi-core
      processors with OpenMP or threaded codes or like
      in this case to offload functions to the GPU.

                                                       26
M02: High Performance Computing with CUDA
lue
                                                          G
 NVMEX
       Native MATLAB script cannot parse CUDA code

       New MATLAB script nvmex.m compiles CUDA code
       (.cu) to create MATLAB function files

       Syntax similar to original mex script:

       >> nvmex –f nvmexopts.bat filename.cu –IC:cudainclude
           –LC:cudalib -lcudart


       Available for Windows and Linux from:
       http://developer.nvidia.com/object/matlab_cuda.html

                                                                 27
M02: High Performance Computing with CUDA
lue
                                                                 G
 Mex files for CUDA
 A typical mex file will perform the following steps:

        1.   Convert from double to single precision
        2.   Rearrange the data layout for complex data
        3.   Allocate memory on the GPU
        4.   Transfer the data from the host to the GPU
        5.   Perform computation on GPU (library, custom code)
        6.   Transfer results from the GPU to the host
        7.   Rearrange the data layout for complex data
        8.   Convert from single to double
        9.   Clean up memory and return results to MATLAB

 Some of these steps will go away with new versions of the library
    (2,7) and new hardware (1,8)



                                                                  28
M02: High Performance Computing with CUDA
lue
                                                                          G
 CUDA MEX example
      Additional code in MEX file to handle CUDA

  /*Parse input, convert to single precision and to interleaved complex format */
        …..
  /* Allocate array on the GPU */
        cufftComplex *rhs_complex_d;
        cudaMalloc( (void **) &rhs_complex_d,sizeof(cufftComplex)*N*M);
  /* Copy input array in interleaved format to the GPU */
        cudaMemcpy( rhs_complex_d, input_single, sizeof(cufftComplex)*N*M,
                                          cudaMemcpyHostToDevice);
  /* Create plan for CUDA FFT NB: transposing dimensions*/
        cufftPlan2d(&plan, N, M, CUFFT_C2C) ;
  /* Execute FFT on GPU */
        cufftExecC2C(plan, rhs_complex_d, rhs_complex_d, CUFFT_INVERSE) ;
  /* Copy result back to host */
        cudaMemcpy( input_single, rhs_complex_d, sizeof(cufftComplex)*N*M,
                                          cudaMemcpyDeviceToHost);
  /* Clean up memory and plan on the GPU */
        cufftDestroy(plan); cudaFree(rhs_complex_d);
  /*Convert back to double precision and to split complex format */
        ….

                                                                                29
M02: High Performance Computing with CUDA
lue
                                                                              G
Timing details
 1024x1024 mesh, 400 RK4 steps on Windows,
 2D isotropic turbulence
                                             Runtime      Speed    Runtime       Speed
                                            Opteron 250           Opteron 2210
                                                           up                     up

PCI-e Bandwidth:                            1135 MB/s             1483 MB/s
Host to/from device                         1003 MB/s             1223 MB/s
Standard MATLAB                             8098 s                9525s


Overload FFT2 and IFFT2                     4425 s        1.8x       4937s       1.9x

Overload Szeta                              735 s         11.x       789s        12.X


Overload Szeta , FFT2 and 577 s                           14.x       605s        15.7x
IFFT2

                                                                                 30
M02: High Performance Computing with CUDA
lue
G
lue
G
lue
G
lue
G
lue
G
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
Wanna Play with
The Big Guys?
6.963
               IT /
         A@M
      CUD
    9
IAP0



                            CUDA
       Performance Strategies
ing
                                                                             ead
                                                                           hr
                                                                 T
Programming Model
                                          Host            Device
         A kernel is executed as a                              Grid 1
         grid of thread blocks                                    Block     Block    Block
                                             Kernel
         A thread block is a batch                                (0, 0)    (1, 0)   (2, 0)
                                               1

         of threads that can                                      Block     Block    Block
         cooperate with each                                      (0, 1)    (1, 1)   (2, 1)

         other by:
                                                                  Grid 2
                   Sharing data through
                   shared memory            Kernel
                                              2
                   Synchronizing their
                   execution
                                                 Block (1, 1)


         Threads from different
                                                  Thread Thread Thread Thread Thread
                                                   (0, 0) (1, 0) (2, 0) (3, 0) (4, 0)

         blocks cannot cooperate                  Thread Thread Thread Thread Thread
                                                   (0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

                                                  Thread Thread Thread Thread Thread
                                                   (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

                                                                                 3
© NVIDIA Corporation 2006
mory
                                  Me
Data Movement in a CUDA Program

Host Memory
 Device Memory
  [Shared Memory]
    COMPUTATION
  [Shared Memory]
 Device Memory
Host Memory




© NVIDIA Corporation 2008             10
erf
                                                       P
     !quot;#$%$&'()*+,-$#.%/(0,-(#.'(123

       456$%$&'($78'quot;'78'7#(quot;5-5**'*$/%

       456$%$&'(5-$#.%'#$9($7#'7/$#:(;%5#.<=578>$8#.?

       @,%'#$%'/($#A/(='##'-(#,(-'9,%quot;B#'(#.57(#,(959.'
          123(/quot;'78/($#/(#-57/$/#,-/(,7()C3/D(7,#(%'%,-:


       E,(%,-'(9,%quot;B#5#$,7(,7(#.'(123(#,(5F,$8(9,/#*:(
       85#5(#-57/0'-/
          GF'7(*,>(quot;5-5**'*$/%(9,%quot;B#5#$,7/(957(/,%'#$%'/(='(
          05/#'-(#.57(#-57/0'--$7+(=59H(578(0,-#.(#,(.,/#




39
erf
                                                    P
     !quot;#$%$&'()'%*+,(-*.'+'/0'

       -*12'30'4(536(7*/80*12'30'4(9(*+4'+(*:(%1;/$#<4'
          =2*>12?@*012(4'5$0'(%'%*+,(


       !quot;#$%$&'(:*+(3quot;1#$12(2*012$#,($/(010.'4(#'A#<+'(
       %'%*+,

       B/(3.1+'4(%'%*+,C(15*$4(.$;.84';+''(>1/D(0*/:2$0#3




40
erf
                                                        P
     !quot;#$%&'(quot;)*quot;+$%,-%./quot;0$'%1$2,03

       45)'0$'6%,-%*72$6%-quot;6*$0%*/quot;)%+8,9quot;8%2$2,03
       !/0$quot;'6%:quot;)%:,,;$0quot;*$%(7quot;%6/quot;0$'%2$2,03

       <6$%,)$%=%quot;%-$>%*/0$quot;'6%*,%8,quot;'%=%:,2;5*$%'quot;*quot;%
       6/quot;0$'%93%quot;88%*/0$quot;'6

       <6$%7*%*,%quot;(,7'%),)?:,quot;8$6:$'%quot;::$66
          .*quot;+$%8,quot;'6%quot;)'%6*,0$6%7)%6/quot;0$'%2$2,03%*,%0$?,0'$0%),)?
          :,quot;8$6:$quot;98$%quot;''0$667)+
          1quot;*07@%*0quot;)6;,6$%$@quot;2;8$%8quot;*$0




41
erf
                                                  P
     !quot;#$%&'&((#()quot;*$+,,)-)#./(0

       %&'/)/)1.$012'$-1*32/&/)1.$/1$4##3$/5#$6%!$
       *2(/)3'1-#quot;quot;1'quot;$#72&((0$82quot;0
          9&.0$/5'#&:quot;;$*&.0$/5'#&:$8(1-4quot;


       <##3$'#quot;12'-#$2quot;&=#$(1>$#.12=5$/1$quot;2331'/$
       *2(/)3(#$&-/)?#$/5'#&:$8(1-4quot;$3#'$*2(/)3'1-#quot;quot;1'
          @#=)quot;/#'quot;;$quot;5&'#:$*#*1'0




42
erf
                                           P
     !quot;#$%&'$()*#*+,)*$-.

       /()*#*+*-0'#quot;#$%&')%,-.1quot;%.
       2$,3quot;.4*-0'03$5,3'#quot;#$%&',44quot;..quot;.
       6.*-0'.7,%quot;8'#quot;#$%&'quot;11quot;4)*9quot;3&




44
erf
                                                          P
     !quot;#quot;$%&quot;'()*&(

       !*+,-*$.*./&0$#/$1/(#$.*./&0$2quot;'34,3#1$.5-1$
       6/4*&$#1quot;'$3*+,-*$.*./&0$#/$3*+,-*$2quot;'34,3#1
          789:($;*quot;<$=>?@A*$BCDE$+(F$GH$89:($;*quot;<$=I5quot;3&/$JK$LDHHE
          G89:($)/&$>?@A*$MFH


       N,',.,O*$#&quot;'()*&(
          @'#*&.*3,quot;#*$3quot;#quot;$(#&5-#5&*($-quot;'$2*$quot;66/-quot;#*3P$/;*&quot;#*3$
          /'P$quot;'3$3*quot;66/-quot;#*3$4,#1/5#$*+*&$-/;0,'Q$#1*.$#/$1/(#$
          .*./&0


       8&/5;$#&quot;'()*&(
          R'*$6quot;&Q*$#&quot;'()*&$.5-1$2*##*&$#1quot;'$.quot;'0$(.quot;66$/'*(



45
erf
                                                      P
     !quot;#$%&'()$*+,$-'./+0.quot;123$.2

       (4*quot;,quot;55'(6'2789+quot;55':2+quot;55'(quot;7;'1+'3+<quot;#$%5'()$*+
       ='27+-$-'./
       >1quot;?5$2+=;#=$27+(4*quot;,$-(</+<$.3'.-quot;1($
          @AB+CDE2F+('--'1+'1+!GH%$I<.$22+8IJK9
          LM+CDE2+-$quot;24.$*+'1+1N'.($+KOP;+-'7=$.?'quot;.*2+
          8'Q$.(5'()$*+!GH%$9

       R$$+7=$+S?quot;1*:;*7=0$27T GUVW+RVX+2quot;-<5$

       U2$+:;7=+(quot;47;'1
          W55'(quot;7;1#+7''+-4(=+<quot;#$%5'()$*+-$-'./+(quot;1+.$*4($+
          'Q$.quot;55+2/27$-+<$.3'.-quot;1($
          0$27+/'4.+2/27$-2+quot;1*+quot;<<2+7'+5$quot;.1+7=$;.+5;-;72



46
erf
                                                  P
     !quot;#$%quot;&'()#*+&,(%-./0*12(.

       3145(.2&quot;%2(67+&16.2*8721#6.9&:;;<=;;&7quot;#7>&7+7quot;(.

       ?1>(quot;+&2#&$(&@(*A#*)%67(&$#22quot;(6(7>

       B@21)1C%21#6.&7%6&4*(%2quot;+&167*(%.(&@(*A#*)%67(
          D#%quot;(.71649&8@&2#&E;F&.@((-8@
          ?%2(67+&51-1649&8@&2#&GHIF&.@((-8@




47
erf
                                                            P
     !quot;#$%&'()*


      +,'quot;quot;-.()#/%.,-%#.,01,#,2#$345#-6,789 /2-%#.&:
      +,'quot;)/(*;quot;;&,-%*(quot;),quot;3,*$quot;0#$,<%<quot;-1=
          9> 01/%&,4 %#'2,/2-%#.,-%#.&,#,5quot;-.=,()/?,3$quot;#/?,@
         8AB 01/%&,4 %#'2,/2-%#.,-%#.&,#,.quot;;0$%45quot;-.=,()/A?,3$quot;#/A?,@
         AC9 01/%&,D %#'2,/2-%#.,-%#.&,#,E;#.45quot;-.=,()/>?,3$quot;#/>?,@
      +..(/(quot;)#$,-%&/-('/(quot;)&,quot;),FBGHFIG,#-'2(/%'/;-%=
         J/#-/()*,#..-%&&,3quot;-,#,-%*(quot;),<;&/,0%,#,<;$/(6$%,quot;3,-%*(quot;),
         &(K%
         L2%,k/2 /2-%#.,(),#,2#$345#-6,<;&/,#''%&&,/2% k/2 %$%<%)/,(),#,
         0$quot;'M,0%()*,-%#.
      NO'%6/(quot;)=,)quot;/,#$$,/2-%#.&,<;&/,0%,6#-/('(6#/()*
         P-%.('#/%.,#''%&&?,.(Q%-*%)'%,5(/2(),#,2#$35#-6

48
erf
                                                                                      P
     !quot;#$%&'%()*''%&&+),%#(-./)0$quot;#1&


             12         13         14         17                        135 136



       349        374        378        352        355             395    399   3:4

                                              *$$)1>?%#(&)C#?1-'-C#1%



             12         13         14         17                        135 136



       349        374        378        352        355             395    399   3:4

                                   ;quot;<%)=>?%#(&)@quot;)Aquot;1)B#?1-'-C#1%


49
erf
                                                                                 P
     !quot;#$%&'(#')*+##'((,*-'%).quot;/*0&$%1(


             12         13         14         17                 135 136



       349        374        378        352        355         395   399   3B4

                                    :';<=1')*+##'((*>?*@A;'%)(



             12         13         14         17         137     135 136



       349        374        378        352        355         395   399   3B4

                  C.(%&./quot;')*D1%;1.quot;/*+));'((*Equot;$1*%*<=&1.F&'*$0*85G


50
erf
                                                         P
     !quot;#$%&'()*+,-(.()*,/%&0$1&

       234%5(.%)1,quot;),678+,
          9%5)%$+,5%#:,#,;$quot;#1<,()'5%.%)1<,=5(1%,>#'?
          @A,;$quot;#1&,BCDAEF
          -(.%&,#G%5#*%:,quot;G%5,C89,50)&
       CD9,>$quot;'?&,3,DHI,1J5%#:&+
            @HIK&,L 'quot;#$%&'%:
            @HMK&,L 'quot;#$%&'%:<,&quot;.%,1J5%#:&,:quot;)N1,4#51('(4#1%
          @<OPOK&,L 4%5.01%:Q.(&#$(*)%:,1J5%#:,#''%&&




51
erf
                                                                      P
     !quot;#$%&'()*+
     ,-./'-/.%&0quot;10&(2%0! 34054067089-%&
       :&%0#0,-./'-/.%0quot;10;..#9&0<,quot;;=0()&-%#>0quot;10;..#90quot;10,-./'-/.%&0
       <;quot;,=

       ?10,quot;;0(&0)quot;-0@(#A$%+
              Bquot;.'%0&-./'-/.%0#$(*)C%)-+0DD#$(*)<E=40FG%.%0E0H0340540quot;.067
              :&%0,IJI0-quot;0#'G(%@%0'quot;#$%&'()*


          x        y      z     Point structure


          x        y      z     x      y      z     x      y      z          AoS



          x        x      x     y      y      y     z      z      z          SoA


58
erf
                                                          P
     !quot;#$%&'()*+,-.//#01

       !quot;#$%&'()*,*0%#2$1,(/30quot;4%&,250quot;.*53.2

       !0(2('#$,2quot;,/%/quot;0167quot;.)8,9%0)%$&

       :%#8()*,&20.'2.0%&,quot;;,&(<%,quot;25%0,25#),=>,?>,quot;0,@A
       712%&,B($$,70%#9,'quot;#$%&'()*+
          C0%;%0,-20.'2.0%&,quot;;,D00#1& quot;4%0,Dquot;-
          E;,-quot;D,(&,)quot;2,4(#7$%>,0%#8FB0(2%,250quot;.*5,-GHG


       D88(2(quot;)#$,0%&quot;.0'%&+
          D$(*)%8,I13%&,-JK,-#/3$%




59
erf
                                                       P
     !quot;#quot;$$%$&'%()#*&+#,-./%,/0#%

       12&quot;&3quot;#quot;$$%$&(quot;,-.2%4&(quot;2*&/-#%quot;56&quot;,,%66&(%()#*
          7-%#%8)#%4&(%()#*&.6&5.9.5%5&.2/)&:quot;2;6
          <66%2/.quot;$&/)&quot;,-.%9%&-.=-&:quot;25>.5/-


       <quot;,-&:quot;2;&,quot;2&6%#9.,%&)2%&quot;55#%66&3%#&,*,$%
          +&(%()#*&,quot;2&6%#9.,%&quot;6&(quot;2*&6.(0$/quot;2%)06&
                                                        Bank 0
          quot;,,%66%6&quot;6&./&-quot;6&:quot;2;6
                                                        Bank 1
                                                        Bank 2
                                                        Bank 3
       '0$/.3$%&6.(0$/quot;2%)06&quot;,,%66%6&/)&quot;&:quot;2;         Bank 4
       #%60$/&.2&quot;&:quot;2;&,)28$.,/&                       Bank 5
                                                        Bank 6
          ?)28$.,/.2=&quot;,,%66%6&quot;#%&6%#.quot;$.@%5
                                                        Bank 7



                                                        Bank 15
64
erf
                                                           P
     !quot;#$%&''()**+#,%-.quot;/01)*
       23%!quot;#$%43#51+67*               23%!quot;#$%43#51+67*
            8+#)quot;(%quot;''()**+#,%             ;quot;#'3/%:<:%=)(/>7quot;7+3#
            *7(+')%99%:


      Thread 0              Bank 0   Thread 0              Bank 0
      Thread 1              Bank 1   Thread 1              Bank 1
      Thread 2              Bank 2   Thread 2              Bank 2
      Thread 3              Bank 3   Thread 3              Bank 3
      Thread 4              Bank 4   Thread 4              Bank 4
      Thread 5              Bank 5   Thread 5              Bank 5
      Thread 6              Bank 6   Thread 6              Bank 6
      Thread 7              Bank 7   Thread 7              Bank 7



     Thread 15             Bank 15   Thread 15             Bank 15



65
erf
                                                           P
     !quot;#$%&''()**+#,%-.quot;/01)*
       234quot;5%!quot;#$%67#81+9:*            =34quot;5%!quot;#$%67#81+9:*
            ;+#)quot;(%quot;''()**+#,%             ;+#)quot;(%quot;''()**+#,%
            *:(+')%<<%2                    *:(+')%<<%=


                                                     x8
      Thread 0              Bank 0   Thread 0              Bank 0
      Thread 1              Bank 1   Thread 1              Bank 1
      Thread 2              Bank 2   Thread 2              Bank 2
      Thread 3              Bank 3   Thread 3
      Thread 4              Bank 4   Thread 4
                            Bank 5   Thread 5              Bank 7
                            Bank 6   Thread 6              Bank 8
                            Bank 7   Thread 7              Bank 9
     Thread 8                                         x8
     Thread 9
     Thread 10
     Thread 11             Bank 15   Thread 15             Bank 15



66
erf
                                                                             P
     !quot;#$%&&'())()$*%+$,quot;$-%./)$quot;.$012

       3%.&#4&,5$quot;6$(%75$-%./$4)$89$-4,)$+('$9$7:quot;7/$7;7:()
       <=77())4>($89?-4,$#quot;'&)$%'($%))4@.(&$,quot;$)=77())4>($
       -%./)
       012$5%)$AB$-%./)
          <quot;$-%./$C$%&&'())$D$AB
          <%*($%)$,5($)4E($quot;6$%$5%:6?#%'+
             Fquot;$-%./$7quot;.6:47,)$-(,#((.$&466('(.,$5%:6?#%'+)G$quot;.:;$#4,54.$%$)4.@:($5%:6?#%'+




67
erf
                                                             P
     !quot;#$%&'(%()$*'+#,-'.),/01.23

       !quot;#$%&'(%()$*'13'#3'/#32'#3'$%4132%$3'1/'2quot;%$%'#$%'
       ,)'+#,-'.),/01.23

       5quot;%'/#32'.#3%6
          7/'#00'2quot;$%#&3')/'#'quot;#0/89#$:'#..%33'&1//%$%,2'+#,-3;'2quot;%$%'13'
          ,)'+#,-'.),/01.2
          7/'#00'2quot;$%#&3')/'#'quot;#0/89#$:'$%#&'2quot;%'1&%,21.#0'#&&$%33;'
          2quot;%$%'13',)'+#,-'.),/01.2'<+$)#&.#32=
       5quot;%'30)9'.#3%6
          >#,-'?),/01.26'(@021:0%'2quot;$%#&3'1,'2quot;%'3#(%'quot;#0/89#$:'
          #..%33'2quot;%'3#(%'+#,-
          A@32'3%$1#01B%'2quot;%'#..%33%3
          ?)32'C'(#D'E')/'31(@02#,%)@3'#..%33%3'2)'#'31,40%'+#,-


68
erf
                       P
     Conflicts,
Coalescing, Warps...
I hate growing up.
erf
                                P




!quot;#$%$&'#$()*+,'%quot;-./*0'#1$,*21')3quot;(3.
erf
                                                       P
     !quot;#$%&'($quot;)*+,*-

       ./0'.quot;1+2-'34#$quot;)*+,*-56
       7228*#$quot;#-*9
          :,quot;2-*;%)<
          =>,%?%)<'.!@!'Aquot;)B';,)C2%;#*
          .+--?8+*'C,$'->-)'*1quot;22'1quot;#$%;-*



                                     1   5   9    13
           1   2    3    4

                                     2   6   10   14
           5   6    7    8

                                     3   7   11   15
           9   10   11   12

                                     4   8   12   16
          13   14   15   16


70
erf
                                                                          P
     !quot;#$%&'(#')*+,%quot;(-$('



        __global__ void transpose_naive(float *odata, float *idata, int width, int height)
        {
      1. unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;
      2. unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;

      3. if (xIndex < width && yIndex < height)
           {
             unsigned int index_in = xIndex + width * yIndex;
      4.
             unsigned int index_out = yIndex + height * xIndex;
      5.
             $)%.%/0quot;)'12$3.4 = 0)%.%/0quot;)'120quot;4;
      6.
           }
         }




71
erf
                                                                        P
     !quot;#$%&'(#')*+,%quot;(-$('


           .'%)(*/quot;-01*2,$3*4565            <,/1'*$01-01*1$*4565

              ;8;    ;87     ;8:    ;879     ;8;   78;    :8;      798;


              78;    787     78:    7879     ;87   787    :87      7987




             798;   7987     798:   79879   ;879   7879   :879    79879




                     4565                                4565


     Stride = 1, coalesced                   Stride = 16, uncoalesced


72
erf
                                                         P
     !quot;#$%&'%()*+#,&-quot;&%

       .&&/0-12quot;,3)0#1+24)2&)-#+1212quot;,%()2,1quot;)&5/#+%)12$%&
       *6+%#(7$quot;'8)974:)7;<3
          =%#()16%)974:7;< 2,-/1)12$%:)&1quot;+%)2,1quot;)>?@?
          A+21%)16%)>?@?)(#1#)1quot;)97;:74< quot;/1-/1)12$%
              *+#,&-quot;&%)16%)2,(%42,B)2,1quot;)>?@?

       *6+%#()914:1;<3
          =%#(&)%$%0%,1)914:1;< C+quot;0)2,-/1)12$%
          A+21%&)%$%0%,1)914:1;< 2,1quot;)quot;/1-/1)12$%
       !quot;#$%&'2,B)2&)#'62%D%()2C3
          E$quot;'8F12$%)(20%,&2quot;,&)#+%)0/$12-$%&)quot;C)GH




73
erf
                                                            P
     !quot;#$%&'%()*+#,&-quot;&%
         4%#(&)5+quot;6)1232               .+/0%&)0quot;)7232

        <9<    <98    <9;    <98:    <9<    <98    <9;    <98:


        89<    898    89;    898:    89<    898    89;    898:




        8:9<   8:98   8:9;   8:98:   8:9<   8:98   8:9;   8:98:




         4%#(&)5+quot;6)7232              .+/0%&)0quot;)1232

        <9<    89<    ;9<    8:9<    <9<    <98    <9;    <98:

        <98    898    ;98    8:98    89<    898    89;    898:




        <98:   898:   ;98:   8:98:   8:9<   8:98   8:9;   8:98:

74
erf
                                                             P
     !quot;#quot;$%&'()(*+'(,-
      =1+23$;0,)$!quot;#quot;

                                  ./01+23$01+2$!quot;#quot;$4('/$3'0(21$5$67
     A?A    6?A    @?A    6>?A

                                      8+-9$:,-;<(:'3
     A?6    6?6    @?6    6>?6




     A?6>   6?6>   @?6>   6>?6>




                                  !,<B'(,-
     A?A    6?A    @?A    6>?A

                                      C<<,:+'1$+-$D1E'0+F :,<B)-
     A?6    6?6    @?6    6>?6
                                      =1+2$3'0(21$5$6G
                                      ./01+23$01+2$;0,)$:,-31:B'(H1$I+-93
     A?6>   6?6>   @?6>   6>?6>




75
erf
                                                             P
     !quot;#quot;$%&'()(*+'(,-
      =1+23$;0,)$!quot;#quot;

                                  ./01+23$01+2$!quot;#quot;$4('/$3'0(21$5$67
     A?A    6?A    @?A    6>?A

                                      8+-9$:,-;<(:'3
     A?6    6?6    @?6    6>?6




     A?6>   6?6>   @?6>   6>?6>




                                  !,<B'(,-
     A?A    6?A    @?A    6>?A

                                      C<<,:+'1$+-$D1E'0+F :,<B)-
     A?6    6?6    @?6    6>?6
                                      =1+2$3'0(21$5$6G
                                      ./01+23$01+2$;0,)$:,-31:B'(H1$I+-93
     A?6>   6?6>   @?6>   6>?6>




75
erf
                                                                                 P
     !quot;#$%&'%()*+#,&-quot;&%
        __global__ void transpose(float *odata, float *idata, int width, int height)
        {
       1. __shared__ float block[(BLOCK_DIM./)*BLOCK_DIM];

            unsigned int xBlock = blockDim.x * blockIdx.x;
       2.
            unsigned int yBlock = blockDim.y * blockIdx.y;
       3.
            unsigned int xIndex = xBlock + threadIdx.x;
       4.
            unsigned int yIndex = yBlock + threadIdx.y;
       5.
            unsigned int index_out, index_transpose;
       6.

       7. if (xIndex < width && yIndex < height)
          {
              unsigned int index_in = width * yIndex + xIndex;
       8.
              unsigned int index_block = threadIdx.y * (BLOCK_DIM+1) + threadIdx.x;
       9.
              block[index_block] = idata[index_in];
      10.
              index_transpose = threadIdx.x * (BLOCK_DIM+1) + threadIdx.y;
      11.
              index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;
      12.
          }
      13. __syncthreads();

      14. if (xIndex < width && yIndex < height)
              odata[index_out] = block[index_transpose];
      15.
        }
76
erf
                                                           P
     !quot;#$%&'%()!*+*$,%

       -&((./&%)0*12)3'#4(%3*$,)#$.)-565)'&1*+*7#1*'$8
            9:;<9:;8))=>=99+% ?%>)=>=::+%))@:>=A %&((./&B
            C9:<C9:8))=>=D+%)))?%>)=>EE+%))))@F>CA %&((./&B
          9=:F<9=:F8))=>E=+%)))?%>)9>G:+%))))@H>FA %&((./&B
          9=:F<:=F;8))=>DG+%)))?%>)H>H+%))))))@;>FA %&((./&B
       I'#4(%3*$,)0*12'/1)-565)'&1*+*7#1*'$8
            9:;<9:;8))=>=9F+%
            C9:<C9:8))=>9=9+%
          9=:F<9=:F8))=>F9:+%
          9=:F<:=F;8))=>;HG+%




77
erf
                                P




!quot;#$%&'()*+(),'-%./&'()*01&'2'3/&'()4
erf
                                                 P
     !quot;quot;#$%&quot;'

      ()*+%,-.&/0*#quot;0.1&/-%*+-+2+quot;#0+,-/+3#+&0.%44'5-/1-
      +2+quot;#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+-
      4%0+&quot;.+/-%&,-8++$-0)+-)%*,7%*+-9#/'

      !quot;quot;#$%&quot;' :-;#<9+*-1=-7%*$/-*#&&.&6-
      quot;1&quot;#**+&04'-1&-%-<#40.$*1quot;+//1*-,.>.,+,-9'-
      <%2.<#<-&#<9+*-1=-7%*$/-0)%0-quot;%&-*#&-
      quot;1&quot;#**+&04'

      ?.<.0+,-9'-*+/1#*quot;+-#/%6+@
         A+6./0+*/
         B)%*+,-<+<1*'


79
erf
                                                          P
     !quot;#$%&'()*+,#-.+/.0quot;#12#)1

       3+(4+5'()*1+6+3+(4+70'2#8quot;().11(quot;1
          ,(+9''+70'2#8quot;().11(quot;1+:9;.+92+'.912+(<.+5'()*+2(+.=.)02.


       3+(4+5'()*1+%+3+(4+70'2#8quot;().11(quot;1+6+>
          ?0'2#8'.+5'()*1+)9<+quot;0<+)(<)0quot;quot;.<2'@+#<+9+70'2#8quot;().11(quot;
          &'()*1+2:92+9quot;.<A2+B9#2#<C+92+9+DD1@<)2:quot;.9$1EF+*..8+2:.+
          :9quot;$B9quot;.+501@
          ,05G.)2+2(+quot;.1(0quot;).+9;9#'95#'#2@+H quot;.C#12.quot;1I+1:9quot;.$+7.7(quot;@


       3+(4+5'()*1+6+JKK+2(+1)9'.+2(+4020quot;.+$.;#).1
          &'()*1+.=.)02.$+#<+8#8.'#<.+491:#(<
          JKKK+5'()*1+8.quot;+Cquot;#$+B#''+1)9'.+9)quot;(11+70'2#8'.+C.<.quot;92#(<1



80
erf
                                                                 P
     !quot;#$%&quot;'()quot;*quot;+,quot;+-.

       !quot;/,0/1&quot;'02'$&quot;('quot;#$%&quot;'(,quot;*quot;+,quot;+-.
          3+%&'4-&$5+6%('quot;%47&(-/+(8quot;('quot;/,(9::(-.-7quot;%(7/&quot;'
          ;-quot;+/'$5%<=>)?<            @AB<


                       S T(.(U(JV         /,,N1O:(((P1OQ(P1EQ(P1:
                       W(T(S U(OV         /,,N1O:(((P1JQ(P1OQ(P1R

                       %[,/&/XYZ(UT(OV    7,N%D/'quot;,N1O:((P1OQ(XP'OEUYZ(
                                          /,,N1O:(((((((((((P1OQ(P1OQ(P1R


       A5(-5C*7quot;&quot;7.(D$,quot;(&Dquot;(7/&quot;+-.<(
          !4+(/&(7quot;/%&(EF: &D'quot;/,%(GH(2/'*%I(*quot;'(C47&$*'5-quot;%%5'
             ?&(7quot;/%&(:JK 5--4*/+-.
          AD'quot;/,%(,5(+5&(D/Lquot;(&5(8quot;75+#(&5(&Dquot;(%/Cquot;(&D'quot;/,(875-M

81
erf
                                                             P
     !quot;#$%&quot;'()'quot;%%*'quot;

       +$,quot;(-.&quot;/01(21(*%$/#(34'quot;(&5'quot;.,%(6quot;'(78
       9$3$&$/#(:.0&4'%;
          <*32quot;'(4=('quot;#$%&quot;'%(6quot;'(>quot;'/quot;-
              ?@AB 6quot;'(78C(6.'&$&$4/quot;,(.34/#(04/0*''quot;/&(&5'quot;.,%
          D34*/&(4=(%5.'quot;,(3quot;34'1
              @EFG 6quot;'(78C(6.'&$&$4/quot;,(.34/#(04/0*''quot;/&(&5'quot;.,2-40>%
       H5quot;0>(I0*2$/(=$-quot;(=4'(J('quot;#$%&quot;'%(K(>quot;'/quot;-
       L%quot;(M3.N''quot;#04*/&O< =-.#(&4(<PHH
          < O(,quot;%$'quot;,(3.N$3*3('quot;#$%&quot;'%(K(>quot;'/quot;-
          D&(%43quot;(64$/&(Q%6$--$/#R $/&4(98S8(3.1(400*'
              !quot;,*0quot;%(6quot;'=4'3./0quot;(M 98S8($%(%-4T
              H5quot;0>(I0*2$/(=$-quot;(=4'(98S8(*%.#quot;



82
erf
                                                           P
     !quot;#quot;$%&'&'()$quot;*+,$-quot;),*.(quot;
       /*quot;)012#3+2#&+'*4567 +2#&+')#+)'6--
       8$9)-+%2&:quot;)#;quot;)<quot;$'quot;:)-+=quot;)>&#;)#;quot;)5-,?&')@:.()#+)
       =quot;#quot;$%&'quot;)$quot;(&*#quot;$),*.(quot;A
       82quot;')#;quot;)A-,?&')@&:quot;)>&#;).)#quot;3#)quot;=&#+$).'=):++<)@+$)
       #;quot;)0-+=quot;7 *quot;-#&+'A
          architecture {sm_10}
          abiversion {0}
          modname {cubin}
          code {
                                              per thread local memory
              name = BlackScholesGPU
              lmem = 0
              smem = 68                   per thread block shared memory
              reg = 20
              bar = 0
                                                per thread registers
              bincode {
                   0xa0004205 0x04200780 0x40024c09 0x00200780
                   …

83
erf
                                 P
     !quot;#$%&''()*+',%!*-'(-*./0




84
erf
                                                             P
     !quot;#$%$&$'()#*+,-./)quot;,+)01234
       5*22/,)#*+,-./)quot;,+)01234)-/)-)%61#$quot;1,)27)8-+quot;)/$&,
          9:2$.)8-/#$'()32%quot;6#-#$2')2')6'.,+;quot;2quot;61-#,.)8-+quot;/
       <2+,)#*+,-./)quot;,+)01234)==)0,##,+)%,%2+>)1-#,'3>)
       *$.$'(
       ?6#@)%2+,)#*+,-./)quot;,+)01234)==)7,8,+)+,($/#,+/)quot;,+)
       #*+,-.
          A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,.
       B,6+$/#$3/
          <$'$%6%C)DE)#*+,-./)quot;,+)01234
             !'1>)$7)%61#$quot;1,)32'36++,'#)01234/)
          FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3,
             J/6-11>)/#$11),'26(*)+,(/)#2)32%quot;$1,)-'.)$':24,)/633,//7611>
          K*$/)-11).,quot;,'./)2')>26+)32%quot;6#-#$2'@)/2),Lquot;+$%,'#M


85
erf
                                                            P
     !quot;quot;#$%&quot;'()*(+,-./-0%&quot;,

       1&quot;-,%23&4(/quot;quot;#$%&quot;'(5/,2(&/6(&,quot;,22%-37'(
       3&quot;-,%2,($,-./-0%&quot;,

                                BUT…

       8/9:/quot;quot;#$%&quot;'(0#763$-/quot;,22/-2(quot;%&&/6(%5,;#%6,7'(
       <35,(7%6,&quot;'(/&(0,0/-':=/#&5(>,-&,72
          ?16(%77(quot;/0,2(5/9&(6/(%-36<0,63quot;(3&6,&236'(%&5(%@%37%=7,(
          $%-%77,7320A




86
erf
                                                      P
     !quot;#quot;$%&%#'(%)*+,#)-../'0quot;&'+1

       !quot;#quot;$%&%#'(quot;&'+1)2%/.3)quot;4quot;.&quot;&'+1)&+)4'55%#%1&)6!73

       6!73)8quot;#9)'1)$quot;19):quot;93
          ;)+5)$,/&'.#+0%33+#3
          <%$+#9)=quot;14:'4&2
          >2quot;#%4)$%$+#9)3'(%
          ?%@'3&%#)5'/%)3'(%
          A2#%quot;43).%#)=/+0B

       *+,)0quot;1)%8%1)$quot;B%)quot;..3)3%/5C&,1'1@)D/'B%)EEAF)quot;14)
       -AG->H
          IJK.%#'$%1&L $+4%)4'30+8%#3)quot;14)3quot;8%3)+.&'$quot;/)
          0+15'@,#quot;&'+1



87
erf
                                                        P
     !quot;#$%&'(quot;#
       )#*+,'-.#*/!)01/2+,3quot;,4.#$+/$5.,.$-+,('-($'
          6+4quot;,7/$quot;.%+'$(#8
          0(9+,8+#-/:,.#$5(#8
          ;.#</$quot;#3%($-'
          =.-+#$7/5(*(#8
       )'+/2+.</2+,3quot;,4.#$+/4+-,($'/-quot;/8&(*+/quot;2-(4(>.-(quot;#/
       )#*+,'-.#*/2.,.%%+%/.%8quot;,(-54/$quot;42%+?(-7/-5+quot;,7
       @#quot;A/5quot;A/-quot;/(*+#-(37/-72+/quot;3/:quot;--%+#+$<
          +B8B/4+4quot;,7C/$quot;,+/$quot;42&-.-(quot;#C/quot;,/(#'-,&$-(quot;#/quot;9+,5+.*
       D2-(4(>+/7quot;&,/.%8quot;,(-54C/then &#,quot;%%/%quot;quot;2'
       )'+/-+42%.-+/2.,.4+-+,'/-quot;/8+#+,.-+/quot;2-(4.%/$quot;*+



88
erf
                           P




!quot;#$%&'($)*+,-.$/012*.#0
erf
                                                         P
     !quot;#$%&'($)*+,-.$/012*.#0

       3#.4+$5#-+,0#$-67$2*67$418#68*-.$4#02105-69#$
       401:.#5
          ;/&$-67$%/&$8*5*6<$210$-..$=#06#.$*6>19-8*16+$-67$
          5#594?+
          !*5#$+8-54+


       (99#++$81$quot;-07@-0#$4#02105-69#$91,68#0+$




61
erf
                      P
      !quot;#$%&'quot;()'*#




101
erf
                                                                        P
     !quot;#$%&'
       ()*$+',%-*,+-%./*0,1quot;+2,2%-01%-*,.34$+*-',3$,'quot;#$%&',quot;$,+2*,.2quot;56

           +quot;7*'+%75

           #&08quot;$.32*-*$+
                                     Global memory loads/stores are coalesced
           #&08.32*-*$+
                                     (coherent) or non-coalesced (incoherent)
           #'+8quot;$.32*-*$+
           #'+8.32*-*$+

           &3.%&8&3%0
                                      Local loads/stores
           &3.%&8'+3-*

                                     Total branches and divergent branches
           9-%$.2
           0quot;)*-#*$+89-%$.2          taken by threads

           quot;$'+-4.+quot;3$' : quot;$'+-4.+quot;3$,.34$+

           1%-58'*-quot;%&quot;;* : +2-*%0,1%-5',+2%+,'*-quot;%&quot;;*,3$,%00-*'',.3$<&quot;.+',+3,
           '2%-*0,3-,.3$'+%$+,7*73-=

           .+%8&%4$.2*0 : *>*.4+*0,+2-*%0,9&3./'


62
erf
                                                           P
     !quot;#$%&%$#'quot;()&%*+',$%)-*.quot;#$%/

       01,.$/)%$&%$/$quot;#)$2$quot;#/)3'#4'quot;)1)#4%$15)31%&

       6quot;,7)#1%($#/)*quot;$)8.,#'&%*-$//*%
          01,.$/)3',,)quot;*#)-*%%$/&*quot;5)#*)#4$)#*#1,)quot;.89$%)*+)31%&/)
          ,1.quot;-4$5)+*%)1)&1%#'-.,1%):$%quot;$,;
          <1.quot;-4)$quot;*.(4)#4%$15)9,*-:/)#*)$quot;/.%$)#41#)#4$)#1%($#)
          8.,#'&%*-$//*%)'/)('2$quot;)1)-*quot;/'/#$quot;#)&$%-$quot;#1($)*+)#4$)#*#1,)
          3*%:;

       01,.$/)1%$)9$/#)./$5)#*)'5$quot;#'+7)%$,1#'2$)&$%+*%81quot;-$)
       5'++$%$quot;-$/)9$#3$$quot;).quot;*&#'8'=$5)1quot;5)*&#'8'=$5)-*5$
          !quot;)*#4$%)3*%5/>)#%7)#*)%$5.-$)#4$)81(quot;'#.5$/)*+)
          (,5?(/#@'quot;-*4$%$quot;#>)5'2$%($quot;#@9%1quot;-4>)1quot;5)31%&@/$%'1,'=$


63
ME
CO
Back Pocket Slides




                     slide by David Cox
6.963
               IT /
         A@M
      CUD
    9
IAP0


               Dense
       Linear Algebra
!quot;#$quot;%&'#quot;()%*+,quot;-)(


4/5,-quot;.-,6789:;       B,A-C8quot;,Dquot;7/-?8E:C/78quot;C/:8:;

! <128/-quot;:=:089:      ! *8-,2/A01C:;quot;F>4
      $% ! &

! >1?82@/7A8:         ! +,9.A0/01,2/7quot;CG891:0-=
      $% ! quot;%
! B12?A7/-quot;@/7A8:     ! )/0/quot;91212?
    $ ! #!quot;       !


                                       !quot;#$$%quot;&'()(*quot;+,-.,-/01,23
!quot;#$quot;%&'#quot;()%*+,quot;-)(


4/5,-quot;.-,6789:;       *7?,-10C9:;

! <128/-quot;:=:089:      ! D28E:1F8Fquot;G/H0,-1I/01,2:;
      $% ! &                <JKquot;+C,78:L=Kquot;MN

! >1?82@/7A8:         ! OP,E:1F8Fquot;G/H0,-1I/01,2:;
      $% ! quot;%                MNquot;/7?3Kquot;Q/H,61

! B12?A7/-quot;@/7A8:     ! OP,E:1F8Fquot;G/H0,-1I/01,2:;
    $ ! #!quot;       !
                                   B')

                                      !quot;#$$%quot;&'()(*quot;+,-.,-/01,23
!quot;#$%&'!quot;#$%&()$*+,-#.(!quot;#quot;!quot;$quot;%quot;&quot;'


45*6quot;78,-0-/29::;<                 +=45*6quot;7+;<
                                       6789:;$<=--(!)*+,!!)*+,
    !quot;##!$%&''(!)*+,!)*+,
                                                    -,!.,!/,
                -,!.,!/,                            012,!>quot;,!#3quot;,
                012,!quot;,!#3quot;,                             >4,!#34,
                                                    012,!>!,!#3!!5?
                     4,!#34,
                012,!!,!#3!!5


+,>.?0/01,2quot;12quot;@A=quot;-BC?1-BD<
! (2101/E1F/01,2quot;,Gquot;+=)*quot;B2H1-,2>B20
! *EE,I/01,2quot;,Gquot;J/0/quot;D0-?I0?-BDquot;12quot;@A=quot;>B>,-Kquot;7L/2JEB-Dquot;!quot;#$!%#$!&;
! M-/2DGB-quot;,Gquot;J/0/quot;7>/0-1IBDquot;quot;#$%#$&;
! +,>.?0/01,2quot;7I?NE/D6OB>>;
! PB0-1BHBquot;-BD?E0quot;7>/0-1Qquot;&;
! 8-BBquot;J/0/quot;D0-?I0?-BDquot;12quot;@A=quot;>B>,-K
! MB->12/01,2quot;,Gquot;+=)*quot;B2H1-,2>B20
                                                  !quot;#$$%quot;&'()(*quot;+,-.,-/01,23
quot;()*+,(! quot;()*+,(! quot;()*+,(! quot;()*+,(!
    quot;#$!%&'(!
                      quot;-./01! 2011quot;-.! 0011quot;-. 0311quot;-4
 5!*6!7(,8*+!,*+(9!       :1!         ;3!        ;3!         <!
 ,*+(!,=*,>?!quot;@A!        ;B:1!       ;B3C!      ;B:D!       ;B<D!
   +(EF98(+9G,*+(!      3<HI!       :/HI!      :/HI!       :/HI!
    9'('G,*+(!          ;3HI!       ;3HI!      ;3HI!       ;3HI!
'('*+J!KL9?!quot;@A           ;B;!        ;B;!       1B2!       ;B1!
'('*+J!KL9?!MF%9         D;/!        /D3!       :0<!        ;/0!
 K&%NOFN8P?!quot;IG9!        ;<;!         C1!        03!         :/!
 '('*+J!&'*L%8!          ;quot;I!      D;/QI!     C30QI!      /D3QI!
 4#?!M(&>!quot;6=*MG9!       3/<!        </2!       :<3!         2:!
4#?!M(&>!M(+!,*+(!        /;!         /C!        //!         /:!
  4#?!6=*M9RO*+N!         ;0!         /D!        ;3!         ;/!
 S#?!M(&>!quot;6=*MG9!        C0!         T!         T!          T!
  S#?!6=*M9RO*+N!         <B<!        T!         T!          T!
-&K=(!;R!-P(!=F98!*6!8P(!quot;#$9!L9(N!F%!8PF9!98LNJB!4#!F9!9F%E=(!M+(U
,F9F*%!&%N!S#!F9!N*LK=(!M+(,F9F*%B!4'('!F9!9P&+(N!'('*+JB!#(&>!
6=*M!+&8(9!&+(!9P*O%!6*+!'L=8FM=J!&%N!&NN!*M(+&8F*%9B!)=*M9RO*+N!
   F9!8P(!+&8F*!*6!M(&>!quot;6=*MG9!+&8(!8*!MF%U'('*+J!K&%NOFN8P!F%!
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

Más contenido relacionado

La actualidad más candente

Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDAMartin Peniak
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAPiyush Mittal
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with RubyShin Yee Chung
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...npinto
 
Java and the machine - Martijn Verburg and Kirk Pepperdine
Java and the machine - Martijn Verburg and Kirk PepperdineJava and the machine - Martijn Verburg and Kirk Pepperdine
Java and the machine - Martijn Verburg and Kirk PepperdineJAX London
 
Shader model 5 0 and compute shader
Shader model 5 0 and compute shaderShader model 5 0 and compute shader
Shader model 5 0 and compute shaderzaywalker
 
Ca บทที่สี่
Ca บทที่สี่Ca บทที่สี่
Ca บทที่สี่atit604
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 

La actualidad más candente (19)

GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
Cuda
CudaCuda
Cuda
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance Computing
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
CUDA
CUDACUDA
CUDA
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
CUDA Architecture
CUDA ArchitectureCUDA Architecture
CUDA Architecture
 
MXF & AAF
MXF & AAFMXF & AAF
MXF & AAF
 
GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with Ruby
 
Cuda
CudaCuda
Cuda
 
Lect10
Lect10Lect10
Lect10
 
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
IAP09 CUDA@MIT 6.963 - Guest Lecture: CUDA Tricks and High-Performance Comput...
 
Java and the machine - Martijn Verburg and Kirk Pepperdine
Java and the machine - Martijn Verburg and Kirk PepperdineJava and the machine - Martijn Verburg and Kirk Pepperdine
Java and the machine - Martijn Verburg and Kirk Pepperdine
 
Shader model 5 0 and compute shader
Shader model 5 0 and compute shaderShader model 5 0 and compute shader
Shader model 5 0 and compute shader
 
Ca บทที่สี่
Ca บทที่สี่Ca บทที่สี่
Ca บทที่สี่
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 

Similar a IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAnirudhGarg35
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel ComputingIRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel ComputingIRJET Journal
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUsShree Kumar
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computationjtsagata
 
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages toolslaparuma
 
[05][cuda 및 fermi 최적화 기술] hryu optimization
[05][cuda 및 fermi 최적화 기술] hryu optimization[05][cuda 및 fermi 최적화 기술] hryu optimization
[05][cuda 및 fermi 최적화 기술] hryu optimizationlaparuma
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrKohei KaiGai
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Stefano Di Carlo
 

Similar a IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT) (20)

An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
Cuda materials
Cuda materialsCuda materials
Cuda materials
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012 Workshop actualización SVG CESGA 2012
Workshop actualización SVG CESGA 2012
 
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel ComputingIRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools
 
[05][cuda 및 fermi 최적화 기술] hryu optimization
[05][cuda 및 fermi 최적화 기술] hryu optimization[05][cuda 및 fermi 최적화 기술] hryu optimization
[05][cuda 및 fermi 최적화 기술] hryu optimization
 
cuda.ppt
cuda.pptcuda.ppt
cuda.ppt
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
Advances in GPU Computing
Advances in GPU ComputingAdvances in GPU Computing
Advances in GPU Computing
 
GPU - how can we use it?
GPU - how can we use it?GPU - how can we use it?
GPU - how can we use it?
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
Multi-faceted Microarchitecture Level Reliability Characterization for NVIDIA...
 

Más de npinto

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)npinto
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...npinto
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...npinto
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...npinto
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)npinto
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...npinto
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...npinto
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...npinto
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...npinto
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...npinto
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...npinto
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...npinto
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...npinto
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)npinto
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...npinto
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programmingnpinto
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programmingnpinto
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basicsnpinto
 

Más de npinto (20)

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 

Último

What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?TechSoup
 
UKCGE Parental Leave Discussion March 2024
UKCGE Parental Leave Discussion March 2024UKCGE Parental Leave Discussion March 2024
UKCGE Parental Leave Discussion March 2024UKCGE
 
CAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptxCAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptxSaurabhParmar42
 
How to Manage Cross-Selling in Odoo 17 Sales
How to Manage Cross-Selling in Odoo 17 SalesHow to Manage Cross-Selling in Odoo 17 Sales
How to Manage Cross-Selling in Odoo 17 SalesCeline George
 
Education and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxEducation and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxraviapr7
 
AUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptxAUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptxiammrhaywood
 
5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...CaraSkikne1
 
The Singapore Teaching Practice document
The Singapore Teaching Practice documentThe Singapore Teaching Practice document
The Singapore Teaching Practice documentXsasf Sfdfasd
 
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...Nguyen Thanh Tu Collection
 
How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17Celine George
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICESayali Powar
 
Philosophy of Education and Educational Philosophy
Philosophy of Education  and Educational PhilosophyPhilosophy of Education  and Educational Philosophy
Philosophy of Education and Educational PhilosophyShuvankar Madhu
 
Benefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive EducationBenefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive EducationMJDuyan
 
Practical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxPractical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxKatherine Villaluna
 
How to Solve Singleton Error in the Odoo 17
How to Solve Singleton Error in the  Odoo 17How to Solve Singleton Error in the  Odoo 17
How to Solve Singleton Error in the Odoo 17Celine George
 
Prescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxPrescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxraviapr7
 
Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.EnglishCEIPdeSigeiro
 
3.21.24 The Origins of Black Power.pptx
3.21.24  The Origins of Black Power.pptx3.21.24  The Origins of Black Power.pptx
3.21.24 The Origins of Black Power.pptxmary850239
 

Último (20)

Prelims of Kant get Marx 2.0: a general politics quiz
Prelims of Kant get Marx 2.0: a general politics quizPrelims of Kant get Marx 2.0: a general politics quiz
Prelims of Kant get Marx 2.0: a general politics quiz
 
What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?
 
UKCGE Parental Leave Discussion March 2024
UKCGE Parental Leave Discussion March 2024UKCGE Parental Leave Discussion March 2024
UKCGE Parental Leave Discussion March 2024
 
CAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptxCAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptx
 
How to Manage Cross-Selling in Odoo 17 Sales
How to Manage Cross-Selling in Odoo 17 SalesHow to Manage Cross-Selling in Odoo 17 Sales
How to Manage Cross-Selling in Odoo 17 Sales
 
Education and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxEducation and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptx
 
AUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptxAUDIENCE THEORY -- FANDOM -- JENKINS.pptx
AUDIENCE THEORY -- FANDOM -- JENKINS.pptx
 
Finals of Kant get Marx 2.0 : a general politics quiz
Finals of Kant get Marx 2.0 : a general politics quizFinals of Kant get Marx 2.0 : a general politics quiz
Finals of Kant get Marx 2.0 : a general politics quiz
 
5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...
 
The Singapore Teaching Practice document
The Singapore Teaching Practice documentThe Singapore Teaching Practice document
The Singapore Teaching Practice document
 
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
 
How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17How to Add a New Field in Existing Kanban View in Odoo 17
How to Add a New Field in Existing Kanban View in Odoo 17
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICE
 
Philosophy of Education and Educational Philosophy
Philosophy of Education  and Educational PhilosophyPhilosophy of Education  and Educational Philosophy
Philosophy of Education and Educational Philosophy
 
Benefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive EducationBenefits & Challenges of Inclusive Education
Benefits & Challenges of Inclusive Education
 
Practical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptxPractical Research 1 Lesson 9 Scope and delimitation.pptx
Practical Research 1 Lesson 9 Scope and delimitation.pptx
 
How to Solve Singleton Error in the Odoo 17
How to Solve Singleton Error in the  Odoo 17How to Solve Singleton Error in the  Odoo 17
How to Solve Singleton Error in the Odoo 17
 
Prescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxPrescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptx
 
Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.Easter in the USA presentation by Chloe.
Easter in the USA presentation by Chloe.
 
3.21.24 The Origins of Black Power.pptx
3.21.24  The Origins of Black Power.pptx3.21.24  The Origins of Black Power.pptx
3.21.24 The Origins of Black Power.pptx
 

IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

  • 1. 6.963 IT / A@M CUD 9 IAP0 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel hardware using CUDA Lecture 04 Nicolas Pinto (MIT) #1 CUDA - Advanced
  • 2. During this course, 3 6 for 6.9 ed adapt we’ll try to “ ” and use existing material ;-)
  • 7. 6.963 IT / A@M CUD 9 IAP0 Textures & OpenGL Async API Libraries Interfacing CUDA Performance
  • 8. 6.963 IT / A@M CUD 9 IAP0 CUDA Textures and OpenGL
  • 9. res xtu Te Textures in CUDA Different hardware path to memory Benefits of CUDA textures: Texture fetches are cached Optimized for 2D locality Textures are addressable in 2D Using integer or normalized coordinates Means fewer addressing calculations in code Provide filtering for free Free wrap modes (boundary conditions) Clamp to edge / repeat Limitations of CUDA textures: Read-only Currently either 1D or 2D (3D will be added) 9-bit accuracy of filter weights © NVIDIA Corporation 2008 160
  • 10. res xtu Te Two CUDA Texture Types Bound to linear memory Global memory is bound to a texture Only 1D Integer addressing No filtering, no addressing modes Bound to CUDA arrays CUDA array is bound to a texture 1D or 2D Float addressing (size-based or normalized) Filtering Addressing modes (clamping, repeat) Both: Return either element type or normalized float © NVIDIA Corporation 2008 161
  • 11. res xtu Te CUDA Texturing Steps Host (CPU) code: Allocate/obtain memory (global linear, or CUDA array) Create a texture reference object Currently must be at file-scope Bind the texture reference to memory/array When done: Unbind the texture reference, free resources Device (kernel) code: Fetch using texture reference Linear memory textures: tex1Dfetch() Array textures: tex1D() or tex2D() © NVIDIA Corporation 2008 162
  • 12. res xtu Te Texture Reference Immutable parameters (compile-time) Type: type returned when fetching Basic int, float types CUDA 1-, 2-, 4-element vectors Dimensionality: Currently 1 or 2 (3 will be supported in the future) Read Mode: cudaReadModeElementType cudaReadModeNormalizedFloat (valid for 8- or 16-bit ints) – returns [-1,1] for signed, [0,1] for unsigned Mutable parameters (run-time, only for array-textures) Normalized: non-zero = addressing range [0, 1] Filter Mode: cudaFilterModePoint cudaFilterModeLinear Address Mode: cudaAddressModeClamp cudaAddressModeWrap © NVIDIA Corporation 2008 163
  • 13. Example: Host code for linear mem // declare texture reference (must be at file-scope) texture<unsigned short, 1, cudaReadModeNormalizedFloat> texRef; ... // set up linear memory unsigned short *dA = 0; cudaMalloc((void**)&dA, numBytes); cudaMemcpy(dA, hA, numBytes, cudaMemcpyHostToDevice); // bind texture reference to array res cudaBindTexture(NULL, texRef, dA); xtu Te © NVIDIA Corporation 2008 164
  • 14. res xtu Te cudaArray Type Channel format, width, height cudaChannelFormatDesc structure int x, y, z, w: bits for each component enum cudaChannelFormatKind – one of: cudaChannelFormatKindSigned cudaChannelFormatKindUnsigned cudaChannelFormatKindFloat some predefined constructors: cudaCreateChannelDesc<float>(void); cudaCreateChannelDesc<float4>(void); Management functions: cudaMallocArray, cudaFreeArray, cudaMemcpyToArray, cudaMemcpyFromArray, ... © NVIDIA Corporation 2008 165
  • 15. Example: Host code for 2D array tex // declare texture reference (must be at file-scope) texture<float, 2, cudaReadModeElementType> texRef; ... // set up the CUDA array cudaChannelFormatDesc cf = cudaCreateChannelDesc<float>(); cudaArray *texArray = 0; cudaMallocArray(&texArray, &cf, dimX, dimY); cudaMempcyToArray(texArray, 0,0, hA, numBytes, cudaMemcpyHostToDevice); // specify mutable texture reference parameters texRef.normalized = 0; res texRef.filterMode = cudaFilterModeLinear; xtu texRef.addressMode = cudaAddressModeClamp; Te // bind texture reference to array cudaBindTextureToArray(texRef, texArray); © NVIDIA Corporation 2008 166
  • 16. nGL pe O OpenGL Interoperability OpenGL buffer objects can be mapped into the CUDA address space and then used as global memory Vertex buffer objects Pixel buffer objects Direct3D9 Vertex objects can be mapped Data can be accessed like any other global data in the device code Image data can be displayed from pixel buffer objects using glDrawPixels / glTexImage2D Requires copy in video memory, but still fast © NVIDIA Corporation 2008 177
  • 17. nGL pe O OpenGL Interop Steps Register a buffer object with CUDA cudaGLRegisterBufferObject(GLuint buffObj); OpenGL can use a registered buffer only as a source Unregister the buffer prior to rendering to it by OpenGL Map the buffer object to CUDA memory cudaGLMapBufferObject(void **devPtr, GLuint buffObj); Returns an address in global memory Buffer must registered prior to mapping Launch a CUDA kernel to process the buffer Unmap the buffer object prior to use by OpenGL cudaGLUnmapBufferObject(GLuint buffObj); Unregister the buffer object cudaGLUnregisterBufferObject(GLuint buffObj); Optional: needed if the buffer is a render target Use the buffer object in OpenGL code © NVIDIA Corporation 2008 178
  • 18. nGL pe O Interop Scenario: Dynamic CUDA-generated texture Register the texture PBO with CUDA For each frame: Map the buffer Generate the texture in a CUDA kernel Unmap the buffer Update the texture Render the textured object unsigned char *p_d=0; cudaGLMapBufferObject((void**)&p_d, pbo); prepTexture<<<height,width>>>(p_d, time); cudaGLUnmapBufferObject(pbo); glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo); glBindTexture(GL_TEXTURE_2D, texID); glTexSubImage2D(GL_TEXTURE_2D, 0, 0,0, 256,256, GL_BGRA, GL_UNSIGNED_BYTE, 0); © NVIDIA Corporation 2008 179
  • 19. nGL pe O Interop Scenario: Frame Post-processing by CUDA For each frame: Render to PBO with OpenGL Register the PBO with CUDA Map the buffer Process the buffer with a CUDA kernel Unmap the buffer Unregister the PBO from CUDA unsigned char *p_d=0; cudaGLRegisterBufferObject(pbo); cudaGLMapBufferObject((void**)&p_d, pbo); postProcess<<<blocks,threads>>>(p_d); cudaGLUnmapBufferObject(pbo); cudaGLUnregisterBufferObject(pbo); ... © NVIDIA Corporation 2008 180
  • 20. 6.963 IT / A@M CUD 9 IAP0 CUDA Async API
  • 21. ync As !quot;#$%&'($()quot;*+,+('#*%(-# !quot;#$%&'($()quot;*&(quot;.*!quot; /,01%,*+,+('#*%(-#*2('* -34,56(%7,/*+,+('#*2',,quot;*)-*89:*($*366*8:;!* %3-3<6,*/,01%,quot; =0,'63-*1+-6,+,$.,/*<#*)quot;1$4*3*8:;!*quot;.',3+ 8:;!*>.',3+*?*>,@),$%,*(2*8:;!*(-,'3.1($quot;*.&3.* ,A,%).,*1$*('/,' >.',3+*!9BC D3%&*quot;.',3+*&3quot;*3$*B;C*E*?*/,23)6.*quot;.',3+ cudaMemcpyAsync(dst, src, size, 0); 97
  • 22. ync As !quot;#$%&'()#$*#%(&*+(,#,-$.(/-'. 0-*/1$$#*2(#3#/124-*(-5(&()#$*#%(&*+(&(6-72(!quot; +#quot;4/#(,#,-$.(/-'.(5-$('&8#9%-/)#+(,#,-$. 0-,'12#(/&'&:4%42.(;<(=>=(?@AB(&*+(1'C Dquot;&4%&:%#(&7(&('$#quot;4#E(5#&21$#(4*(0FGD(=>= !quot;#$%&'7()#$*#%(#3#/124-*(4*(-*#(72$#&,(E426(&(,#,-$.( /-'.(5$-,(&*-26#$(72$#&, H2$#&,(DIJK cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); cudaMemcpyAsync(dst, src, size, stream1); -quot;#$%&''#+ kernel<<<grid, block, 0, stream2>>>(…); cudaStreamQuery(stream2); 98
  • 23. ync As !quot;#$%&'()*%$+, &'()*-%./(%0)-(/*(1%2/(34/1(15%0)*4%!quot;#$%3.66%-*/(.7- quot;-.8(%-3()./04-9 7(.-:/(%(6.;-(1%*07(%<4/%!quot;#$%3.66-%23643=%3>36(%;/(30-04)5 ?:(/>%*@(%-*.*:-%4<%.)%.->)3@/4)4:-%!quot;#$%3.66 A643=%!+quot;%:)*06%!quot;#$%3.66-%;/04/%*4%*@(%('()*%./(%347;6(*(1 .->)3$+, -.7;6(%0)%!quot;#$%B#C 3:1.&'()*D* -*./*E%-*4;F 3:1.&'()*!/(.*(2G-*./*5F 3:1.&'()*!/(.*(2G-*4;5F 3:1.&'()*H(34/12-*./*E%I5F =(/)(6JJJ8/01E%A643=KKK2LLL5F 3:1.&'()*H(34/12-*4;E%I5F 3:1.&'()*B>)3@/4)0M(2-*4;5F <64.*%(*F 3:1.&'()*&6.;-(1N07(2G(*E%-*./*E%-*4;5F 3:1.&'()*#(-*/4>2-*./*5F 3:1.&'()*#(-*/4>2-*4;5F 95 95
  • 24. 6.963 IT / A@M CUD 9 IAP0 CUDA Libraries
  • 25. ary ibr L CUDA libraries CUDA includes 2 widely used libraries CUBLAS: BLAS implementation CUFFT: FFT implementation CUDPP (Data Parallel Primitives), available from http://www.gpgpu.org/developer/cudpp/ : Reduction Scan Sort 9 M02: High Performance Computing with CUDA
  • 26. ary ibr L Closely Coupled CPU-GPU Function Function Function Lib Lib Init GPU Alloc CPU Operation 1 Operation 2 Operation 3 Integrated programming model High speed data transfer – up to 5.5GB/sec Asynchronous data transfer Large GPU memory systems 10 M02: High Performance Computing with CUDA
  • 27. ary ibr L CUBLAS Implementation of BLAS (Basic Linear Algebra Subprograms) on top of CUDA driver Self-contained at the API level, no direct interaction with CUDA driver Basic model for use Create matrix and vector objects in GPU memory space Fill objects with data Call sequence of CUBLAS functions Retrieve data from GPU CUBLAS library contains helper functions Creating and destroying objects in GPU space Writing data to and retrieving data from objects 11 M02: High Performance Computing with CUDA
  • 28. ary ibr L Using CUBLAS Interface to CUBLAS library is in cublas.h Function naming convention cublas + BLAS name Eg., cublasSGEMM Error handling CUBLAS core functions do not return error CUBLAS provides function to retrieve last error recorded CUBLAS helper functions do return error Helper functions: Memory allocation, data transfer Implemented using C-based CUDA tool chain Interfacing to C/C++ applications is trivial 13 M02: High Performance Computing with CUDA
  • 29. ary ibr L Supported Features Single Precision Double Precision* Real Complex Real Complex ! ! ! Level 1 dgemv, ! dger, Level 2 dsyr, dtrsv cgemm zgemm ! ! Level 3 *Double-precision functions only supported on GPUs with double-precision hardware © 2008 NVIDIA Corporation.
  • 30. ary ibr L CUBLAS Helper Functions cublasInit() Initializes CUBLAS library cublasShutdown() Releases resources used by CUBLAS library cublasGetError() Returns last error from CUBLAS core function (+ resets) cublasAlloc() Wrapper around cudaMalloc() to allocate space for array cublasFree() destroys object in GPU memory cublas[Set|Get][Vector|Matrix]() Copies array elements between CPU and GPU memory Accommodates non-unit strides © 2008 NVIDIA Corporation.
  • 31. ary ibr L sgemmExample.c #include <stdio.h> cublasInit(); #include <stdlib.h> #include quot;cublas.hquot; cublasAlloc(n2, sizeof(float), (void **)&a_d); cublasAlloc(n2, sizeof(float), (void **)&b_d); int main(void) cublasAlloc(n2, sizeof(float), (void **)&c_d); { float *a_h, *b_h, *c_h; cublasSetVector(n2, sizeof(float), a_h, 1, a_d, 1); float *a_d, *b_d, *c_d; cublasSetVector(n2, sizeof(float), b_h, 1, b_d, 1); float alpha = 1.0f, beta = 0.0f; int N = 2048, n2 = N*N; cublasSgemm('n', 'n', N, N, N, alpha, a_d, N, int nBytes, i; b_d, N, beta, c_d, N); nBytes = n2*sizeof(float); cublasGetVector(n2, sizeof(float), c_d, 1, c_h, 1); a_h = (float *)malloc(nBytes); free(a_h); free(b_h); free(c_h); b_h = (float *)malloc(nBytes); cublasFree(a_d); cublasFree(b_d); c_h = (float *)malloc(nBytes); cublasFree(c_d); for (i=0; i < n2; i++) { cublasShutdown(); return 0; a_h[i] = rand() / (float) RAND_MAX; } b_h[i] = rand() / (float) RAND_MAX; } © 2008 NVIDIA Corporation.
  • 32. ary ibr L Calling CUBLAS from FORTRAN Two interfaces: Thunking (define CUBLAS_USE_THUNKING when compiling fortran.c) Allows interfacing to existing applications without any changes During each call, the wrappers allocate GPU memory, copy source data from CPU memory space to GPU memory space, call CUBLAS, and finally copy back the results to CPU memory space and deallocate the GPGPU memory Intended for light testing due to call overhead Non-Thunking (default) Intended for production code Substitute device pointers for vector and matrix arguments in all BLAS functions Existing applications need to be modified slightly to allocate and deallocate data structures in GPGPU memory space (using CUBLAS_ALLOC and CUBLAS_FREE) and to copy data between GPU and CPU memory spaces (using CUBLAS_SET_VECTOR, CUBLAS_GET_VECTOR, CUBLAS_SET_MATRIX, and CUBLAS_GET_MATRIX) 14 M02: High Performance Computing with CUDA
  • 33. ary ibr L SGEMM example (THUNKING) ! Define 3 single precision matrices A, B, C real , dimension(m1,m1):: A, B, C …… ! Initialize …… #ifdef CUBLAS ! Call SGEMM in CUBLAS library using THUNKING interface (library takes care of ! memory allocation on device and data movement) call cublasSGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1) #else ! Call SGEMM in host BLAS library call SGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1) #endif To use the host BLAS routine: g95 –O3 code.f90 –L/usr/local/lib -lblas To use the CUBLAS routine (fortran.c is provided by NVIDIA): gcc -O3 -DCUBLAS_USE_THUNKING -I/usr/local/cuda/include -c fortran.c g95 -O3 -DCUBLAS code.f90 fortran.o -L/usr/local/cuda/lib -lcublas 15 M02: High Performance Computing with CUDA
  • 34. ary ibr L SGEMM example (NON-THUNKING) ! Define 3 single precision matrices A, B, C real , dimension(m1,m1):: A, B, C integer:: devPtrA, devPtrB, devPtrC, size_of_real=4 …… ! Initialize A, B, C ……… ! Allocate matrices on GPU cublasAlloc(m1*m1, size_of_real, devPtrA) cublasAlloc(m1*m1, size_of_real, devPtrB) cublasAlloc(m1*m1, size_of_real, devPtrC) !Copy data from CPU to GPU cublasSetMatrix(m1,m1, size_of_real, A,m1, devPtrA, m1) cublasSetMatrix(m1,m1, size_of_real, B,m1, devPtrB, m1) cublasSetMatrix(m1,m1, size_of_real, C,m1, devPtrC, m1) ! Call SGEMM in CUBLAS library using NON-THUNKING interface (library is expecting data in GPU memory) call cublasSGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1) !Copy data from GPU to CPU cublasGetMatrix(m1,m1, size_of_real, devPtrC,m1, C, m1) ! Free memory on device cublasFree(devPtrA) …… g95 -O3 code.f90 -L/usr/local/cuda/lib -lcublas 16 M02: High Performance Computing with CUDA
  • 35. !quot;#$%&'()*#+,-./0,12,34#quot;,5quot;#0quot;,6*#quot;'(,78+quot;9(', , V&9F=J!V*=>*7! X&'(9!YB!S(''(=! W*'ML8(+!4,F(%,(!SF7F9F*%! W*'ML8(+!4,F(%,(!SF7F9F*%!&%N!S(M&+8'(%8!*6!Q&8P('&8F,9! $%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J $%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J! 7901('$1, quot;()*+,(! quot;()*+,(! quot;()*+,(! quot;()*+,(! quot;#$!%&'(! Y(! M+(9(%8! M(+6*+'&%,(! +(9L=89! 6*+! N(%9(! =F%(&+! &=E(K+&! L9F%E! quot;-./01! 2011quot;-.! 0011quot;-. 0311quot;-4 +(,(%8! ZV[S[! quot;#$9B! ]L+! '&8+F^U'&8+F^! 'L=8FM=J! +*L8F%(! 5!*6!7(,8*+!,*+(9! :1! ;3! ;3! <! _quot;`QQa! +L%9! LM! 8*! 31b! 6&98(+!8P&%! 8P(! 7(%N*+c9! F'M=('(%8&U ary ,*+(!,=*,>?!quot;@A! ;B:1! ;B3C! ;B:D! ;B<D! 8F*%!&%N!&MM+*&,P(9!8P(!M(&>!*6!P&+NO&+(!,&M&KF=F8F(9B!]L+!d$?! ibr +(EF98(+9G,*+(! 3<HI! :/HI! :/HI! :/HI! ef! &%N! WP*=(9>J! 6&,8*+FA&8F*%9! &,PF(7(! LM! 8*! 01g21b! *6! 8P(! L 9'('G,*+(! ;3HI! ;3HI! ;3HI! ;3HI! M(&>! quot;`QQ! +&8(B! ]L+! M&+&==(=! d$! +L%%F%E! *%! 8O*! quot;#$9! '('*+J!KL9?!quot;@A ;B;! ;B;! 1B2! ;B1! &,PF(7(9!LM!8*!hD<1!quot;6=*MG9B!-P(9(!+(9L=89!&+(!&,,*'M=F9P(N!KJ! '('*+J!KL9?!MF%9 D;/! /D3! :0<! ;/0! ,P&==(%EF%E!8P(!&,,(M8(N!7F(O!*6!8P(!quot;#$!&+,PF8(,8L+(!&%N!M+*U K&%NOFN8P?!quot;IG9! ;<;! C1! 03! :/! E+&''F%E! ELFN(=F%(9B! Y(! &+EL(! 8P&8! '*N(+%! quot;#$9! 9P*L=N! K(! '('*+J!&'*L%8! ;quot;I! D;/QI! C30QI! /D3QI! 7F(O(N! &9! 'L=8F8P+(&N(N! 'L=8F,*+(! 7(,8*+! L%F89B! Y(! (^M=*F8! 4#?!M(&>!quot;6=*MG9! 3/<! </2! :<3! 2:! K=*,>F%E!9F'F=&+=J!8*!7(,8*+!,*'ML8(+9!&%N!P(8(+*E(%(F8J!*6!8P(! 4#?!M(&>!M(+!,*+(! /;! /C! //! /:! 9J98('! KJ! ,*'ML8F%E! K*8P! *%! quot;#$! &%N! W#$B! -PF9! 98LNJ! F%U 4#?!6=*M9RO*+N! ;0! /D! ;3! ;/! ,=LN(9! N(8&F=(N! K(%,P'&+>F%E! *6! 8P(! quot;#$! '('*+J! 9J98('! 8P&8! S#?!M(&>!quot;6=*MG9! C0! T! T! T! +(7(&=9! 9FA(9! &%N! =&8(%,F(9! *6! ,&,P(9! &%N! -dIB! Y(! M+(9(%8! &! S#?!6=*M9RO*+N! <B<! T! T! T! ,*LM=(! *6! &=E*+F8P'F,! *M8F'FA&8F*%9! &F'(N! &8! F%,+(&9F%E! M&+&=U =(=F9'!&%N!+(EL=&+F8J!F%!8P(!M+*K=('!8P&8!M+*7FN(!L9!OF8P!9=FEP8=J! -&K=(!;R!-P(!=F98!*6!8P(!quot;#$9!L9(N!F%!8PF9!98LNJB!4#!F9!9F%E=(!M+(U PFEP(+!M(+6*+'&%,(B! ,F9F*%!&%N!S#!F9!N*LK=(!M+(,F9F*%B!4'('!F9!9P&+(N!'('*+JB!#(&>! 6=*M!+&8(9!&+(!9P*O%!6*+!'L=8FM=J!&%N!&NN!*M(+&8F*%9B!)=*M9RO*+N! :,;#1(2<4$1*2#, F9!8P(!+&8F*!*6!M(&>!quot;6=*MG9!+&8(!8*!MF%U'('*+J!K&%NOFN8P!F%! O*+N9B!! Y(! '&>(! 8P(! 6*==*OF%E! ,*%8+FKL8F*%9B! )*+! 8P(! 6F+98! 8F'(?! O(! 9P*O!&%!d$?!ef!&%N!WP*=(9>J!6&,8*+FA&8F*%!8P&8!&,PF(7(!,*'U 9,+FK(9! 8P(! &+,PF8(,8L+(! *6! 8P(! quot;#$9! O(! L9(N?! PFEP=FEP8F%E! 8P(! ML8&8F*%&=!+&8(9!*7(+!:11!quot;6=*MG9!*%!&!quot;#$B!-P(9(!&+(!8P+((!*6! 6(&8L+(9!,*''*%!8*!7(,8*+!&+,PF8(,8L+(9B!4(,8F*%!:!K(%,P'&+>9! 8P(!'*98!OFN(=J!L9(N!6&,8*+FA&8F*%9!F%!N(%9(!=F%(&+!&=E(K+&!&%N! *M(+&8F*%9! F%,=LNF%E! '('*+J! 8+&%96(+?! >(+%(=! 98&+8ULM?! &%N! K&+U M&7(! 8P(! O&J! 6*+! 8P(! F'M=('(%8&8F*%! *6! 8P(! (%8F+(! d#WH! +F(+9?! &%N! L9(9! 8P(9(! 8*! &%&=JA(! 8P(! M(+6*+'&%,(! *6! 8P(! M&%(=! =FK+&+J!i%N(+9*%!(8!&=B!;221j!6*+!8P(!quot;#$9B! 6&,8*+FA&8F*%! *6! d$B! 4(,8F*%! <! NF9,L99(9! 8P(! N(9FE%! &%N! M(+6*+U ]L+! +(9L=89! &=9*! F%,=LN(! M(+6*+'&%,(! *%! 8P(! 0U9(+F(9! *6! '&%,(! (7&=L&8F*%! *6! '&8+F^! 'L=8FM=F,&8F*%B! 4(,8F*%! D! NF9,L99(9! ZV[S[!quot;#$9!8P&8!O&9!%*8!M+(7F*L9=J!&88&F%(N!F%!8P(!;BD!J(&+9! 8P(! N(9FE%! *6! d$?! ef! &%N! WP*=(9>J?! &%N! 4(,8F*%! 3! (7&=L&8(9! 9F%,(!8P(9(!quot;#$9!O(+(!&7&F=&K=(B!Y(!M+*7FN(!%(O!F%9FEP89!F%8*! 8P(F+! M(+6*+'&%,(B! 4(,8F*%! C! 9L''&+FA(9! &%N! N(9,+FK(9! 6L8L+(! M+*E+&''F%E! 8P(9(! &%N! %(O(+! quot;#$9! 8P&8! P(=M! L9! &,PF(7(! M(+U O*+>B! 6*+'&%,(!F%!9L,P!K&9F,!>(+%(=9!&9!'&8+F^U'&8+F^!'L=8FM=J!8P&8!F9! 31b! 6&98(+! 8P&%! 8P*9(! F%! 8P(! *M8F'FA(N! 7(%N*+c9! =FK+&+J! =,-./,7($%*1quot;$14(quot;, W$Id4! ;B;B! 4*'(! *6! *L+! ,*N(9! P&7(! K((%! =F,(%9(N! KJ! [%! 8PF9! O*+>! O(! &+(! ,*%,(+%(N! OF8P! M+*E+&''F%E! 0! 9(+F(9?! 2! ZV[S[! &%N! F%,=LN(N! F%! W$Id4! /B1B! [%! *L+! &MM+*&,P! O(! Volkov and Demmel (SC08) 9(+F(9?!&%N!/11!9(+F(9!*6!ZV[S[!quot;#$9?!&9!=F98(N!F%!-&K=(!;B!)*+! 8PF%>! *6! 8P(! quot;#$! &9! &! 'L=8F8P+(&N(N! 7(,8*+! L%F8! &%N! *L+! K(98!
  • 36. rary Lib !quot;#$%&'()*#+,-./0,12,34#quot;,5quot;#0quot;,6*#quot;'(,78+quot;9(', , V&9F=J!V*=>*7! X&'(9!YB!S(''(=! W*'ML8(+!4,F(%,(!SF7F9F*%! W*'ML8(+!4,F(%,(!SF7F9F*%!&%N!S(M&+8'(%8!*6!Q&8P('&8F,9! $%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J $%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J! 7901('$1, quot;()*+,(! quot;()*+,(! quot;()*+,(! quot;()*+,(! quot;#$!%&'(! Y(! M+(9(%8! M(+6*+'&%,(! +(9L=89! 6*+! N(%9(! =F%(&+! &=E(K+&! L9F%E! quot;-./01! 2011quot;-.! 0011quot;-. 0311quot;-4 +(,(%8! ZV[S[! quot;#$9B! ]L+! '&8+F^U'&8+F^! 'L=8FM=J! +*L8F%(! 5!*6!7(,8*+!,*+(9! :1! ;3! ;3! <! _quot;`QQa! +L%9! LM! 8*! 31b! 6&98(+!8P&%! 8P(! 7(%N*+c9! F'M=('(%8&U ,*+(!,=*,>?!quot;@A! ;B:1! ;B3C! ;B:D! ;B<D! 8F*%!&%N!&MM+*&,P(9!8P(!M(&>!*6!P&+NO&+(!,&M&KF=F8F(9B!]L+!d$?! +(EF98(+9G,*+(! 3<HI! :/HI! :/HI! :/HI! ef! &%N! WP*=(9>J! 6&,8*+FA&8F*%9! &,PF(7(! LM! 8*! 01g21b! *6! 8P(! 9'('G,*+(! ;3HI! ;3HI! ;3HI! ;3HI! M(&>! quot;`QQ! +&8(B! ]L+! M&+&==(=! d$! +L%%F%E! *%! 8O*! quot;#$9! '('*+J!KL9?!quot;@A ;B;! ;B;! 1B2! ;B1! &,PF(7(9!LM!8*!hD<1!quot;6=*MG9B!-P(9(!+(9L=89!&+(!&,,*'M=F9P(N!KJ! '('*+J!KL9?!MF%9 D;/! /D3! :0<! ;/0! ,P&==(%EF%E!8P(!&,,(M8(N!7F(O!*6!8P(!quot;#$!&+,PF8(,8L+(!&%N!M+*U K&%NOFN8P?!quot;IG9! ;<;! C1! 03! :/! E+&''F%E! ELFN(=F%(9B! Y(! &+EL(! 8P&8! '*N(+%! quot;#$9! 9P*L=N! K(! '('*+J!&'*L%8! ;quot;I! D;/QI! C30QI! /D3QI! 7F(O(N! &9! 'L=8F8P+(&N(N! 'L=8F,*+(! 7(,8*+! L%F89B! Y(! (^M=*F8! 4#?!M(&>!quot;6=*MG9! 3/<! </2! :<3! 2:! K=*,>F%E!9F'F=&+=J!8*!7(,8*+!,*'ML8(+9!&%N!P(8(+*E(%(F8J!*6!8P(! 4#?!M(&>!M(+!,*+(! /;! /C! //! /:! 9J98('! KJ! ,*'ML8F%E! K*8P! *%! quot;#$! &%N! W#$B! -PF9! 98LNJ! F%U 4#?!6=*M9RO*+N! ;0! /D! ;3! ;/! ,=LN(9! N(8&F=(N! K(%,P'&+>F%E! *6! 8P(! quot;#$! '('*+J! 9J98('! 8P&8! S#?!M(&>!quot;6=*MG9! C0! T! T! T! +(7(&=9! 9FA(9! &%N! =&8(%,F(9! *6! ,&,P(9! &%N! -dIB! Y(! M+(9(%8! &! S#?!6=*M9RO*+N! <B<! T! T! T! ,*LM=(! *6! &=E*+F8P'F,! *M8F'FA&8F*%9! &F'(N! &8! F%,+(&9F%E! M&+&=U =(=F9'!&%N!+(EL=&+F8J!F%!8P(!M+*K=('!8P&8!M+*7FN(!L9!OF8P!9=FEP8=J! -&K=(!;R!-P(!=F98!*6!8P(!quot;#$9!L9(N!F%!8PF9!98LNJB!4#!F9!9F%E=(!M+(U PFEP(+!M(+6*+'&%,(B! ,F9F*%!&%N!S#!F9!N*LK=(!M+(,F9F*%B!4'('!F9!9P&+(N!'('*+JB!#(&>! 6=*M!+&8(9!&+(!9P*O%!6*+!'L=8FM=J!&%N!&NN!*M(+&8F*%9B!)=*M9RO*+N! :,;#1(2<4$1*2#, F9!8P(!+&8F*!*6!M(&>!quot;6=*MG9!+&8(!8*!MF%U'('*+J!K&%NOFN8P!F%! O*+N9B!! Y(! '&>(! 8P(! 6*==*OF%E! ,*%8+FKL8F*%9B! )*+! 8P(! 6F+98! 8F'(?! O(! 9P*O!&%!d$?!ef!&%N!WP*=(9>J!6&,8*+FA&8F*%!8P&8!&,PF(7(!,*'U 9,+FK(9! 8P(! &+,PF8(,8L+(! *6! 8P(! quot;#$9! O(! L9(N?! PFEP=FEP8F%E! 8P(! ML8&8F*%&=!+&8(9!*7(+!:11!quot;6=*MG9!*%!&!quot;#$B!-P(9(!&+(!8P+((!*6! 6(&8L+(9!,*''*%!8*!7(,8*+!&+,PF8(,8L+(9B!4(,8F*%!:!K(%,P'&+>9! 8P(!'*98!OFN(=J!L9(N!6&,8*+FA&8F*%9!F%!N(%9(!=F%(&+!&=E(K+&!&%N! *M(+&8F*%9! F%,=LNF%E! '('*+J! 8+&%96(+?! >(+%(=! 98&+8ULM?! &%N! K&+U M&7(! 8P(! O&J! 6*+! 8P(! F'M=('(%8&8F*%! *6! 8P(! (%8F+(! d#WH! +F(+9?! &%N! L9(9! 8P(9(! 8*! &%&=JA(! 8P(! M(+6*+'&%,(! *6! 8P(! M&%(=! =FK+&+J!i%N(+9*%!(8!&=B!;221j!6*+!8P(!quot;#$9B! 6&,8*+FA&8F*%! *6! d$B! 4(,8F*%! <! NF9,L99(9! 8P(! N(9FE%! &%N! M(+6*+U ]L+! +(9L=89! &=9*! F%,=LN(! M(+6*+'&%,(! *%! 8P(! 0U9(+F(9! *6! '&%,(! (7&=L&8F*%! *6! '&8+F^! 'L=8FM=F,&8F*%B! 4(,8F*%! D! NF9,L99(9! ZV[S[!quot;#$9!8P&8!O&9!%*8!M+(7F*L9=J!&88&F%(N!F%!8P(!;BD!J(&+9! 8P(! N(9FE%! *6! d$?! ef! &%N! WP*=(9>J?! &%N! 4(,8F*%! 3! (7&=L&8(9! 9F%,(!8P(9(!quot;#$9!O(+(!&7&F=&K=(B!Y(!M+*7FN(!%(O!F%9FEP89!F%8*! 8P(F+! M(+6*+'&%,(B! 4(,8F*%! C! 9L''&+FA(9! &%N! N(9,+FK(9! 6L8L+(! M+*E+&''F%E! 8P(9(! &%N! %(O(+! quot;#$9! 8P&8! P(=M! L9! &,PF(7(! M(+U O*+>B! 6*+'&%,(!F%!9L,P!K&9F,!>(+%(=9!&9!'&8+F^U'&8+F^!'L=8FM=J!8P&8!F9! 31b! 6&98(+! 8P&%! 8P*9(! F%! 8P(! *M8F'FA(N! 7(%N*+c9! =FK+&+J! =,-./,7($%*1quot;$14(quot;, W$Id4! ;B;B! 4*'(! *6! *L+! ,*N(9! P&7(! K((%! =F,(%9(N! KJ! Volkov and Demmel (SC08) [%! 8PF9! O*+>! O(! &+(! ,*%,(+%(N! OF8P! M+*E+&''F%E! 0! 9(+F(9?! 2! ZV[S[! &%N! F%,=LN(N! F%! W$Id4! /B1B! [%! *L+! &MM+*&,P! O(! 9(+F(9?!&%N!/11!9(+F(9!*6!ZV[S[!quot;#$9?!&9!=F98(N!F%!-&K=(!;B!)*+! 8PF%>! *6! 8P(! quot;#$! &9! &! 'L=8F8P+(&N(N! 7(,8*+! L%F8! &%N! *L+! K(98!
  • 37. ary ibr L DGEMM Performance 17 M02: High Performance Computing with CUDA
  • 38. ary ibr L Additional Resources CUDA SDK example simpleCUBLAS CUBLAS Library documentation in doc folder of CUDA Toolkit or download from CUDA Zone © 2008 NVIDIA Corporation.
  • 39. ary ibr L CUFFT The Fast Fourier Transform (FFT) is a divide-and- conquer algorithm for efficiently computing discrete Fourier transform of complex or real-valued data sets. CUFFT is the CUDA FFT library Provides a simple interface for computing parallel FFT on an NVIDIA GPU Allows users to leverage the floating-point power and parallelism of the GPU without having to develop a custom, GPU-based FFT implementation 18 M02: High Performance Computing with CUDA
  • 40. ary ibr L Supported Features 1D, 2D and 3D transforms of complex and real-valued data Batched execution for doing multiple 1D transforms in parallel 1D transform size up to 8M elements 2D and 3D transform sizes in the range [2,16384] In-place and out-of-place transforms for real and complex data. 19 M02: High Performance Computing with CUDA
  • 41. ary ibr L Transform Types Library supports real and complex transforms CUFFT_C2C, CUFFT_C2R, CUFFT_R2C Directions CUFFT_FORWARD (-1) and CUFFT_INVERSE (1) According to sign of the complex exponential term Real and imaginary parts of complex input and output arrays are interleaved cufftComplex type is defined for this Real to complex FFTs, output array holds only nonredundant coefficients N -> N/2+1 N0 x N1 x … x Nn -> N0 x N1 x … x (Nn/2+1) For in-place transforms the input/output arrays need to be padded 20 M02: High Performance Computing with CUDA
  • 42. ary ibr L More on Transforms For 2D and 3D transforms, CUFFT performs transforms in row- major (C-order) If calling from FORTRAN or MATLAB, remember to change the order of size parameters during plan creation CUFFT performs un-normalized transforms: IFFT(FFT(A))= length(A)*A CUFFT API is modeled after FFTW. Based on plans, that completely specify the optimal configuration to execute a particular size of FFT Once a plan is created, the library stores whatever state is needed to execute the plan multiple times without recomputing the configuration Works very well for CUFFT, because different kinds of FFTs require different thread configurations and GPU resources 21 M02: High Performance Computing with CUDA
  • 43. ary ibr L CUFFT Types and Definitions cufftHandle Type used to store and access CUFFT plans cufftResults Enumeration of API function return values cufftReal single-precision, real datatype cufftComplex single-precision, complex datatype Real and complex transforms CUFFT_C2C, CUFFT_C2R, CUFFT_R2C Directions CUFFT_FORWARD, CUFFT_INVERSE © 2008 NVIDIA Corporation.
  • 44. ary ibr L CUFFT Example #include <stdio.h> cufftPlan1d(&plan, N, CUFFT_C2C, batchSize); #include <math.h> #include quot;cufft.hquot; cufftExecC2C(plan, a_d, a_d, CUFFT_FORWARD); cufftExecC2C(plan, a_d, a_d, CUFFT_INVERSE); int main(int argc, char *argv[]) { cudaMemcpy(a_h, a_d, nBytes, cufftComplex *a_h, *a_d; cudaMemcpyDeviceToHost); cufftHandle plan; int N = 1024, batchSize = 10; // check error - normalize int i, nBytes; for (maxError = 0.0, i=0; i < N*batchSize; i++) { double maxError; maxError = max(fabs(a_h[i].x/N-sinf(i)), maxError); maxError = max(fabs(a_h[i].y/N-cosf(i)), maxError); nBytes = sizeof(cufftComplex)*N*batchSize; } a_h = (cufftComplex *)malloc(nBytes); printf(quot;Max fft error = %gnquot;, maxError); for (i=0; i < N*batchSize; i++) { a_h[i].x = sinf(i); cufftDestroy(plan); a_h[i].y = cosf(i); free(a_h); cudaFree(a_d); } return 0; cudaMalloc((void **)&a_d, nBytes); } cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); © 2008 NVIDIA Corporation.
  • 45. ary ibr L Additional CUFFT Resources CUDA SDK examples simpleCUFFT convolutionFFT2D oceanFFT CUFFT Library documentation In doc folder of CUDA Toolkit or download from CUDA Zone © 2008 NVIDIA Corporation.
  • 47. ? e lu G
  • 48. 6.963 IT / A@M CUD 9 IAP0 Interfacing CUDA
  • 49. lue G Interfacing CUDA with other languages CUDA kernels from FORTRAN, allocate pinned memory from FORTRAN Calling CUDA from MATLAB with MEX files Several packages (open source and commercial) to interface CUDA with Python, IDL, .NET, FORTRAN (Flagon). Browse CUDA Zone to find all the packages. 23 M02: High Performance Computing with CUDA
  • 50. lue G Pinned memory from FORTRAN Pinned memory provides a fast PCI-e transfer speed and enables use of streams: •Allocation needs to be done with cudaMallocHost •Use new Fortran 2003 features for interoperability with C. use iso_c_binding ! The allocation is performed by C function calls. Define the C pointer as type (C_PTR) type(C_PTR) :: cptr_A, cptr_B, cptr_C ! Define Fortran arrays as pointer. real, dimension(:,:), pointer :: A, B, C ! Allocating memory with cudaMallocHost. ! The Fortan arrays, now defined as pointers, are then associated with the C pointers using the ! new interoperability defined in iso_c_binding. This is equivalent to allocate(A(m1,m1)) res = cudaMallocHost ( cptr_A, m1*m1*sizeof(fp_kind) ) call c_f_pointer ( cptr_A, A, (/ m1, m1 /) ) ! Use A as usual. ! See example code for cudaMallocHost interface code http://www.nvidia.com/object/cuda_programming_tools.html 24 M02: High Performance Computing with CUDA
  • 51. lue G Calling CUDA kernels from FORTRAN From Fortran call C function that will call CUDA kernel ! Fortran -> C -> CUDA ->C ->Fortran call cudafunction(c,c2,N) /* NB: Fortran subroutine arguments are passed by reference. */ extern quot;Cquot; void cudafunction_(cuComplex *a, cuComplex *b, int *Np) { ... int N=*np; cudaMalloc ((void **) &a_d , sizeof(cuComplex)*N); cudaMemcpy( a_d, a, sizeof(cuComplex)*N ,cudaMemcpyHostToDevice); dim3 dimBlock(block_size); dim3 dimGrid (N/dimBlock.x); if( N % block_size != 0 ) dimGrid.x+=1; square_complex<<<dimGrid,dimBlock>>>(a_d,a_d,N); cudaMemcpy( b, a_d, sizeof(cuComplex)*N,cudaMemcpyDeviceToHost); cudaFree(a_d); } complex_mul: main.f90 Cuda_function.o $(FC) -o complex_mul main.f90 Cuda_function.o -L/usr/local/cuda/lib -lcudart cuda_function.o: cuda_function.cu nvcc -c -O3 cuda_function.cu 25 M02: High Performance Computing with CUDA
  • 52. lue G CUDA & MATLAB Even though MATLAB is built on many well- optimized libraries, some functions can perform better when written in a compiled language (e.g. C and Fortran). MATLAB provides a convenient API for interfacing code written in C and FORTRAN to MATLAB functions with MEX files. MEX files could be used to exploit multi-core processors with OpenMP or threaded codes or like in this case to offload functions to the GPU. 26 M02: High Performance Computing with CUDA
  • 53. lue G NVMEX Native MATLAB script cannot parse CUDA code New MATLAB script nvmex.m compiles CUDA code (.cu) to create MATLAB function files Syntax similar to original mex script: >> nvmex –f nvmexopts.bat filename.cu –IC:cudainclude –LC:cudalib -lcudart Available for Windows and Linux from: http://developer.nvidia.com/object/matlab_cuda.html 27 M02: High Performance Computing with CUDA
  • 54. lue G Mex files for CUDA A typical mex file will perform the following steps: 1. Convert from double to single precision 2. Rearrange the data layout for complex data 3. Allocate memory on the GPU 4. Transfer the data from the host to the GPU 5. Perform computation on GPU (library, custom code) 6. Transfer results from the GPU to the host 7. Rearrange the data layout for complex data 8. Convert from single to double 9. Clean up memory and return results to MATLAB Some of these steps will go away with new versions of the library (2,7) and new hardware (1,8) 28 M02: High Performance Computing with CUDA
  • 55. lue G CUDA MEX example Additional code in MEX file to handle CUDA /*Parse input, convert to single precision and to interleaved complex format */ ….. /* Allocate array on the GPU */ cufftComplex *rhs_complex_d; cudaMalloc( (void **) &rhs_complex_d,sizeof(cufftComplex)*N*M); /* Copy input array in interleaved format to the GPU */ cudaMemcpy( rhs_complex_d, input_single, sizeof(cufftComplex)*N*M, cudaMemcpyHostToDevice); /* Create plan for CUDA FFT NB: transposing dimensions*/ cufftPlan2d(&plan, N, M, CUFFT_C2C) ; /* Execute FFT on GPU */ cufftExecC2C(plan, rhs_complex_d, rhs_complex_d, CUFFT_INVERSE) ; /* Copy result back to host */ cudaMemcpy( input_single, rhs_complex_d, sizeof(cufftComplex)*N*M, cudaMemcpyDeviceToHost); /* Clean up memory and plan on the GPU */ cufftDestroy(plan); cudaFree(rhs_complex_d); /*Convert back to double precision and to split complex format */ …. 29 M02: High Performance Computing with CUDA
  • 56. lue G Timing details 1024x1024 mesh, 400 RK4 steps on Windows, 2D isotropic turbulence Runtime Speed Runtime Speed Opteron 250 Opteron 2210 up up PCI-e Bandwidth: 1135 MB/s 1483 MB/s Host to/from device 1003 MB/s 1223 MB/s Standard MATLAB 8098 s 9525s Overload FFT2 and IFFT2 4425 s 1.8x 4937s 1.9x Overload Szeta 735 s 11.x 789s 12.X Overload Szeta , FFT2 and 577 s 14.x 605s 15.7x IFFT2 30 M02: High Performance Computing with CUDA
  • 57. lue G
  • 58. lue G
  • 59. lue G
  • 60. lue G
  • 61. lue G
  • 63. Wanna Play with The Big Guys?
  • 64. 6.963 IT / A@M CUD 9 IAP0 CUDA Performance Strategies
  • 65. ing ead hr T Programming Model Host Device A kernel is executed as a Grid 1 grid of thread blocks Block Block Block Kernel A thread block is a batch (0, 0) (1, 0) (2, 0) 1 of threads that can Block Block Block cooperate with each (0, 1) (1, 1) (2, 1) other by: Grid 2 Sharing data through shared memory Kernel 2 Synchronizing their execution Block (1, 1) Threads from different Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) blocks cannot cooperate Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) 3 © NVIDIA Corporation 2006
  • 66. mory Me Data Movement in a CUDA Program Host Memory Device Memory [Shared Memory] COMPUTATION [Shared Memory] Device Memory Host Memory © NVIDIA Corporation 2008 10
  • 67. erf P !quot;#$%$&'()*+,-$#.%/(0,-(#.'(123 456$%$&'($78'quot;'78'7#(quot;5-5**'*$/% 456$%$&'(5-$#.%'#$9($7#'7/$#:(;%5#.<=578>$8#.? @,%'#$%'/($#A/(='##'-(#,(-'9,%quot;B#'(#.57(#,(959.' 123(/quot;'78/($#/(#-57/$/#,-/(,7()C3/D(7,#(%'%,-: E,(%,-'(9,%quot;B#5#$,7(,7(#.'(123(#,(5F,$8(9,/#*:( 85#5(#-57/0'-/ GF'7(*,>(quot;5-5**'*$/%(9,%quot;B#5#$,7/(957(/,%'#$%'/(='( 05/#'-(#.57(#-57/0'--$7+(=59H(578(0,-#.(#,(.,/# 39
  • 68. erf P !quot;#$%$&'()'%*+,(-*.'+'/0' -*12'30'4(536(7*/80*12'30'4(9(*+4'+(*:(%1;/$#<4' =2*>12?@*012(4'5$0'(%'%*+,( !quot;#$%$&'(:*+(3quot;1#$12(2*012$#,($/(010.'4(#'A#<+'( %'%*+, B/(3.1+'4(%'%*+,C(15*$4(.$;.84';+''(>1/D(0*/:2$0#3 40
  • 69. erf P !quot;#$%&'(quot;)*quot;+$%,-%./quot;0$'%1$2,03 45)'0$'6%,-%*72$6%-quot;6*$0%*/quot;)%+8,9quot;8%2$2,03 !/0$quot;'6%:quot;)%:,,;$0quot;*$%(7quot;%6/quot;0$'%2$2,03 <6$%,)$%=%quot;%-$>%*/0$quot;'6%*,%8,quot;'%=%:,2;5*$%'quot;*quot;% 6/quot;0$'%93%quot;88%*/0$quot;'6 <6$%7*%*,%quot;(,7'%),)?:,quot;8$6:$'%quot;::$66 .*quot;+$%8,quot;'6%quot;)'%6*,0$6%7)%6/quot;0$'%2$2,03%*,%0$?,0'$0%),)? :,quot;8$6:$quot;98$%quot;''0$667)+ 1quot;*07@%*0quot;)6;,6$%$@quot;2;8$%8quot;*$0 41
  • 70. erf P !quot;#$%&'&((#()quot;*$+,,)-)#./(0 %&'/)/)1.$012'$-1*32/&/)1.$/1$4##3$/5#$6%!$ *2(/)3'1-#quot;quot;1'quot;$#72&((0$82quot;0 9&.0$/5'#&:quot;;$*&.0$/5'#&:$8(1-4quot; <##3$'#quot;12'-#$2quot;&=#$(1>$#.12=5$/1$quot;2331'/$ *2(/)3(#$&-/)?#$/5'#&:$8(1-4quot;$3#'$*2(/)3'1-#quot;quot;1' @#=)quot;/#'quot;;$quot;5&'#:$*#*1'0 42
  • 71. erf P !quot;#$%&'$()*#*+,)*$-. /()*#*+*-0'#quot;#$%&')%,-.1quot;%. 2$,3quot;.4*-0'03$5,3'#quot;#$%&',44quot;..quot;. 6.*-0'.7,%quot;8'#quot;#$%&'quot;11quot;4)*9quot;3& 44
  • 72. erf P !quot;#quot;$%&quot;'()*&( !*+,-*$.*./&0$#/$1/(#$.*./&0$2quot;'34,3#1$.5-1$ 6/4*&$#1quot;'$3*+,-*$.*./&0$#/$3*+,-*$2quot;'34,3#1 789:($;*quot;<$=>?@A*$BCDE$+(F$GH$89:($;*quot;<$=I5quot;3&/$JK$LDHHE G89:($)/&$>?@A*$MFH N,',.,O*$#&quot;'()*&( @'#*&.*3,quot;#*$3quot;#quot;$(#&5-#5&*($-quot;'$2*$quot;66/-quot;#*3P$/;*&quot;#*3$ /'P$quot;'3$3*quot;66/-quot;#*3$4,#1/5#$*+*&$-/;0,'Q$#1*.$#/$1/(#$ .*./&0 8&/5;$#&quot;'()*&( R'*$6quot;&Q*$#&quot;'()*&$.5-1$2*##*&$#1quot;'$.quot;'0$(.quot;66$/'*( 45
  • 73. erf P !quot;#$%&'()$*+,$-'./+0.quot;123$.2 (4*quot;,quot;55'(6'2789+quot;55':2+quot;55'(quot;7;'1+'3+<quot;#$%5'()$*+ ='27+-$-'./ >1quot;?5$2+=;#=$27+(4*quot;,$-(</+<$.3'.-quot;1($ @AB+CDE2F+('--'1+'1+!GH%$I<.$22+8IJK9 LM+CDE2+-$quot;24.$*+'1+1N'.($+KOP;+-'7=$.?'quot;.*2+ 8'Q$.(5'()$*+!GH%$9 R$$+7=$+S?quot;1*:;*7=0$27T GUVW+RVX+2quot;-<5$ U2$+:;7=+(quot;47;'1 W55'(quot;7;1#+7''+-4(=+<quot;#$%5'()$*+-$-'./+(quot;1+.$*4($+ 'Q$.quot;55+2/27$-+<$.3'.-quot;1($ 0$27+/'4.+2/27$-2+quot;1*+quot;<<2+7'+5$quot;.1+7=$;.+5;-;72 46
  • 74. erf P !quot;#$%quot;&'()#*+&,(%-./0*12(. 3145(.2&quot;%2(67+&16.2*8721#6.9&:;;<=;;&7quot;#7>&7+7quot;(. ?1>(quot;+&2#&$(&@(*A#*)%67(&$#22quot;(6(7> B@21)1C%21#6.&7%6&4*(%2quot;+&167*(%.(&@(*A#*)%67( D#%quot;(.71649&8@&2#&E;F&.@((-8@ ?%2(67+&51-1649&8@&2#&GHIF&.@((-8@ 47
  • 75. erf P !quot;#$%&'()* +,'quot;quot;-.()#/%.,-%#.,01,#,2#$345#-6,789 /2-%#.&: +,'quot;)/(*;quot;;&,-%*(quot;),quot;3,*$quot;0#$,<%<quot;-1= 9> 01/%&,4 %#'2,/2-%#.,-%#.&,#,5quot;-.=,()/?,3$quot;#/?,@ 8AB 01/%&,4 %#'2,/2-%#.,-%#.&,#,.quot;;0$%45quot;-.=,()/A?,3$quot;#/A?,@ AC9 01/%&,D %#'2,/2-%#.,-%#.&,#,E;#.45quot;-.=,()/>?,3$quot;#/>?,@ +..(/(quot;)#$,-%&/-('/(quot;)&,quot;),FBGHFIG,#-'2(/%'/;-%= J/#-/()*,#..-%&&,3quot;-,#,-%*(quot;),<;&/,0%,#,<;$/(6$%,quot;3,-%*(quot;), &(K% L2%,k/2 /2-%#.,(),#,2#$345#-6,<;&/,#''%&&,/2% k/2 %$%<%)/,(),#, 0$quot;'M,0%()*,-%#. NO'%6/(quot;)=,)quot;/,#$$,/2-%#.&,<;&/,0%,6#-/('(6#/()* P-%.('#/%.,#''%&&?,.(Q%-*%)'%,5(/2(),#,2#$35#-6 48
  • 76. erf P !quot;#$%&'%()*''%&&+),%#(-./)0$quot;#1& 12 13 14 17 135 136 349 374 378 352 355 395 399 3:4 *$$)1>?%#(&)C#?1-'-C#1% 12 13 14 17 135 136 349 374 378 352 355 395 399 3:4 ;quot;<%)=>?%#(&)@quot;)Aquot;1)B#?1-'-C#1% 49
  • 77. erf P !quot;#$%&'(#')*+##'((,*-'%).quot;/*0&$%1( 12 13 14 17 135 136 349 374 378 352 355 395 399 3B4 :';<=1')*+##'((*>?*@A;'%)( 12 13 14 17 137 135 136 349 374 378 352 355 395 399 3B4 C.(%&./quot;')*D1%;1.quot;/*+));'((*Equot;$1*%*<=&1.F&'*$0*85G 50
  • 78. erf P !quot;#$%&'()*+,-(.()*,/%&0$1& 234%5(.%)1,quot;),678+, 9%5)%$+,5%#:,#,;$quot;#1<,()'5%.%)1<,=5(1%,>#'? @A,;$quot;#1&,BCDAEF -(.%&,#G%5#*%:,quot;G%5,C89,50)& CD9,>$quot;'?&,3,DHI,1J5%#:&+ @HIK&,L 'quot;#$%&'%: @HMK&,L 'quot;#$%&'%:<,&quot;.%,1J5%#:&,:quot;)N1,4#51('(4#1% @<OPOK&,L 4%5.01%:Q.(&#$(*)%:,1J5%#:,#''%&& 51
  • 79. erf P !quot;#$%&'()*+ ,-./'-/.%&0quot;10&(2%0! 34054067089-%& :&%0#0,-./'-/.%0quot;10;..#9&0<,quot;;=0()&-%#>0quot;10;..#90quot;10,-./'-/.%&0 <;quot;,= ?10,quot;;0(&0)quot;-0@(#A$%+ Bquot;.'%0&-./'-/.%0#$(*)C%)-+0DD#$(*)<E=40FG%.%0E0H0340540quot;.067 :&%0,IJI0-quot;0#'G(%@%0'quot;#$%&'()* x y z Point structure x y z x y z x y z AoS x x x y y y z z z SoA 58
  • 80. erf P !quot;#$%&'()*+,-.//#01 !quot;#$%&'()*,*0%#2$1,(/30quot;4%&,250quot;.*53.2 !0(2('#$,2quot;,/%/quot;0167quot;.)8,9%0)%$& :%#8()*,&20.'2.0%&,quot;;,&(<%,quot;25%0,25#),=>,?>,quot;0,@A 712%&,B($$,70%#9,'quot;#$%&'()*+ C0%;%0,-20.'2.0%&,quot;;,D00#1& quot;4%0,Dquot;- E;,-quot;D,(&,)quot;2,4(#7$%>,0%#8FB0(2%,250quot;.*5,-GHG D88(2(quot;)#$,0%&quot;.0'%&+ D$(*)%8,I13%&,-JK,-#/3$% 59
  • 81. erf P !quot;#quot;$$%$&'%()#*&+#,-./%,/0#% 12&quot;&3quot;#quot;$$%$&(quot;,-.2%4&(quot;2*&/-#%quot;56&quot;,,%66&(%()#* 7-%#%8)#%4&(%()#*&.6&5.9.5%5&.2/)&:quot;2;6 <66%2/.quot;$&/)&quot;,-.%9%&-.=-&:quot;25>.5/- <quot;,-&:quot;2;&,quot;2&6%#9.,%&)2%&quot;55#%66&3%#&,*,$% +&(%()#*&,quot;2&6%#9.,%&quot;6&(quot;2*&6.(0$/quot;2%)06& Bank 0 quot;,,%66%6&quot;6&./&-quot;6&:quot;2;6 Bank 1 Bank 2 Bank 3 '0$/.3$%&6.(0$/quot;2%)06&quot;,,%66%6&/)&quot;&:quot;2; Bank 4 #%60$/&.2&quot;&:quot;2;&,)28$.,/& Bank 5 Bank 6 ?)28$.,/.2=&quot;,,%66%6&quot;#%&6%#.quot;$.@%5 Bank 7 Bank 15 64
  • 82. erf P !quot;#$%&''()**+#,%-.quot;/01)* 23%!quot;#$%43#51+67* 23%!quot;#$%43#51+67* 8+#)quot;(%quot;''()**+#,% ;quot;#'3/%:<:%=)(/>7quot;7+3# *7(+')%99%: Thread 0 Bank 0 Thread 0 Bank 0 Thread 1 Bank 1 Thread 1 Bank 1 Thread 2 Bank 2 Thread 2 Bank 2 Thread 3 Bank 3 Thread 3 Bank 3 Thread 4 Bank 4 Thread 4 Bank 4 Thread 5 Bank 5 Thread 5 Bank 5 Thread 6 Bank 6 Thread 6 Bank 6 Thread 7 Bank 7 Thread 7 Bank 7 Thread 15 Bank 15 Thread 15 Bank 15 65
  • 83. erf P !quot;#$%&''()**+#,%-.quot;/01)* 234quot;5%!quot;#$%67#81+9:* =34quot;5%!quot;#$%67#81+9:* ;+#)quot;(%quot;''()**+#,% ;+#)quot;(%quot;''()**+#,% *:(+')%<<%2 *:(+')%<<%= x8 Thread 0 Bank 0 Thread 0 Bank 0 Thread 1 Bank 1 Thread 1 Bank 1 Thread 2 Bank 2 Thread 2 Bank 2 Thread 3 Bank 3 Thread 3 Thread 4 Bank 4 Thread 4 Bank 5 Thread 5 Bank 7 Bank 6 Thread 6 Bank 8 Bank 7 Thread 7 Bank 9 Thread 8 x8 Thread 9 Thread 10 Thread 11 Bank 15 Thread 15 Bank 15 66
  • 84. erf P !quot;#$%&&'())()$*%+$,quot;$-%./)$quot;.$012 3%.&#4&,5$quot;6$(%75$-%./$4)$89$-4,)$+('$9$7:quot;7/$7;7:() <=77())4>($89?-4,$#quot;'&)$%'($%))4@.(&$,quot;$)=77())4>($ -%./) 012$5%)$AB$-%./) <quot;$-%./$C$%&&'())$D$AB <%*($%)$,5($)4E($quot;6$%$5%:6?#%'+ Fquot;$-%./$7quot;.6:47,)$-(,#((.$&466('(.,$5%:6?#%'+)G$quot;.:;$#4,54.$%$)4.@:($5%:6?#%'+ 67
  • 85. erf P !quot;#$%&'(%()$*'+#,-'.),/01.23 !quot;#$%&'(%()$*'13'#3'/#32'#3'$%4132%$3'1/'2quot;%$%'#$%' ,)'+#,-'.),/01.23 5quot;%'/#32'.#3%6 7/'#00'2quot;$%#&3')/'#'quot;#0/89#$:'#..%33'&1//%$%,2'+#,-3;'2quot;%$%'13' ,)'+#,-'.),/01.2 7/'#00'2quot;$%#&3')/'#'quot;#0/89#$:'$%#&'2quot;%'1&%,21.#0'#&&$%33;' 2quot;%$%'13',)'+#,-'.),/01.2'<+$)#&.#32= 5quot;%'30)9'.#3%6 >#,-'?),/01.26'(@021:0%'2quot;$%#&3'1,'2quot;%'3#(%'quot;#0/89#$:' #..%33'2quot;%'3#(%'+#,- A@32'3%$1#01B%'2quot;%'#..%33%3 ?)32'C'(#D'E')/'31(@02#,%)@3'#..%33%3'2)'#'31,40%'+#,- 68
  • 86. erf P Conflicts, Coalescing, Warps... I hate growing up.
  • 87. erf P !quot;#$%$&'#$()*+,'%quot;-./*0'#1$,*21')3quot;(3.
  • 88. erf P !quot;#$%&'($quot;)*+,*- ./0'.quot;1+2-'34#$quot;)*+,*-56 7228*#$quot;#-*9 :,quot;2-*;%)< =>,%?%)<'.!@!'Aquot;)B';,)C2%;#* .+--?8+*'C,$'->-)'*1quot;22'1quot;#$%;-* 1 5 9 13 1 2 3 4 2 6 10 14 5 6 7 8 3 7 11 15 9 10 11 12 4 8 12 16 13 14 15 16 70
  • 89. erf P !quot;#$%&'(#')*+,%quot;(-$(' __global__ void transpose_naive(float *odata, float *idata, int width, int height) { 1. unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x; 2. unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y; 3. if (xIndex < width && yIndex < height) { unsigned int index_in = xIndex + width * yIndex; 4. unsigned int index_out = yIndex + height * xIndex; 5. $)%.%/0quot;)'12$3.4 = 0)%.%/0quot;)'120quot;4; 6. } } 71
  • 90. erf P !quot;#$%&'(#')*+,%quot;(-$(' .'%)(*/quot;-01*2,$3*4565 <,/1'*$01-01*1$*4565 ;8; ;87 ;8: ;879 ;8; 78; :8; 798; 78; 787 78: 7879 ;87 787 :87 7987 798; 7987 798: 79879 ;879 7879 :879 79879 4565 4565 Stride = 1, coalesced Stride = 16, uncoalesced 72
  • 91. erf P !quot;#$%&'%()*+#,&-quot;&% .&&/0-12quot;,3)0#1+24)2&)-#+1212quot;,%()2,1quot;)&5/#+%)12$%& *6+%#(7$quot;'8)974:)7;<3 =%#()16%)974:7;< 2,-/1)12$%:)&1quot;+%)2,1quot;)>?@? A+21%)16%)>?@?)(#1#)1quot;)97;:74< quot;/1-/1)12$% *+#,&-quot;&%)16%)2,(%42,B)2,1quot;)>?@? *6+%#()914:1;<3 =%#(&)%$%0%,1)914:1;< C+quot;0)2,-/1)12$% A+21%&)%$%0%,1)914:1;< 2,1quot;)quot;/1-/1)12$% !quot;#$%&'2,B)2&)#'62%D%()2C3 E$quot;'8F12$%)(20%,&2quot;,&)#+%)0/$12-$%&)quot;C)GH 73
  • 92. erf P !quot;#$%&'%()*+#,&-quot;&% 4%#(&)5+quot;6)1232 .+/0%&)0quot;)7232 <9< <98 <9; <98: <9< <98 <9; <98: 89< 898 89; 898: 89< 898 89; 898: 8:9< 8:98 8:9; 8:98: 8:9< 8:98 8:9; 8:98: 4%#(&)5+quot;6)7232 .+/0%&)0quot;)1232 <9< 89< ;9< 8:9< <9< <98 <9; <98: <98 898 ;98 8:98 89< 898 89; 898: <98: 898: ;98: 8:98: 8:9< 8:98 8:9; 8:98: 74
  • 93. erf P !quot;#quot;$%&'()(*+'(,- =1+23$;0,)$!quot;#quot; ./01+23$01+2$!quot;#quot;$4('/$3'0(21$5$67 A?A 6?A @?A 6>?A 8+-9$:,-;<(:'3 A?6 6?6 @?6 6>?6 A?6> 6?6> @?6> 6>?6> !,<B'(,- A?A 6?A @?A 6>?A C<<,:+'1$+-$D1E'0+F :,<B)- A?6 6?6 @?6 6>?6 =1+2$3'0(21$5$6G ./01+23$01+2$;0,)$:,-31:B'(H1$I+-93 A?6> 6?6> @?6> 6>?6> 75
  • 94. erf P !quot;#quot;$%&'()(*+'(,- =1+23$;0,)$!quot;#quot; ./01+23$01+2$!quot;#quot;$4('/$3'0(21$5$67 A?A 6?A @?A 6>?A 8+-9$:,-;<(:'3 A?6 6?6 @?6 6>?6 A?6> 6?6> @?6> 6>?6> !,<B'(,- A?A 6?A @?A 6>?A C<<,:+'1$+-$D1E'0+F :,<B)- A?6 6?6 @?6 6>?6 =1+2$3'0(21$5$6G ./01+23$01+2$;0,)$:,-31:B'(H1$I+-93 A?6> 6?6> @?6> 6>?6> 75
  • 95. erf P !quot;#$%&'%()*+#,&-quot;&% __global__ void transpose(float *odata, float *idata, int width, int height) { 1. __shared__ float block[(BLOCK_DIM./)*BLOCK_DIM]; unsigned int xBlock = blockDim.x * blockIdx.x; 2. unsigned int yBlock = blockDim.y * blockIdx.y; 3. unsigned int xIndex = xBlock + threadIdx.x; 4. unsigned int yIndex = yBlock + threadIdx.y; 5. unsigned int index_out, index_transpose; 6. 7. if (xIndex < width && yIndex < height) { unsigned int index_in = width * yIndex + xIndex; 8. unsigned int index_block = threadIdx.y * (BLOCK_DIM+1) + threadIdx.x; 9. block[index_block] = idata[index_in]; 10. index_transpose = threadIdx.x * (BLOCK_DIM+1) + threadIdx.y; 11. index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x; 12. } 13. __syncthreads(); 14. if (xIndex < width && yIndex < height) odata[index_out] = block[index_transpose]; 15. } 76
  • 96. erf P !quot;#$%&'%()!*+*$,% -&((./&%)0*12)3'#4(%3*$,)#$.)-565)'&1*+*7#1*'$8 9:;<9:;8))=>=99+% ?%>)=>=::+%))@:>=A %&((./&B C9:<C9:8))=>=D+%)))?%>)=>EE+%))))@F>CA %&((./&B 9=:F<9=:F8))=>E=+%)))?%>)9>G:+%))))@H>FA %&((./&B 9=:F<:=F;8))=>DG+%)))?%>)H>H+%))))))@;>FA %&((./&B I'#4(%3*$,)0*12'/1)-565)'&1*+*7#1*'$8 9:;<9:;8))=>=9F+% C9:<C9:8))=>9=9+% 9=:F<9=:F8))=>F9:+% 9=:F<:=F;8))=>;HG+% 77
  • 97. erf P !quot;#$%&'()*+(),'-%./&'()*01&'2'3/&'()4
  • 98. erf P !quot;quot;#$%&quot;' ()*+%,-.&/0*#quot;0.1&/-%*+-+2+quot;#0+,-/+3#+&0.%44'5-/1- +2+quot;#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+- 4%0+&quot;.+/-%&,-8++$-0)+-)%*,7%*+-9#/' !quot;quot;#$%&quot;' :-;#<9+*-1=-7%*$/-*#&&.&6- quot;1&quot;#**+&04'-1&-%-<#40.$*1quot;+//1*-,.>.,+,-9'- <%2.<#<-&#<9+*-1=-7%*$/-0)%0-quot;%&-*#&- quot;1&quot;#**+&04' ?.<.0+,-9'-*+/1#*quot;+-#/%6+@ A+6./0+*/ B)%*+,-<+<1*' 79
  • 99. erf P !quot;#$%&'()*+,#-.+/.0quot;#12#)1 3+(4+5'()*1+6+3+(4+70'2#8quot;().11(quot;1 ,(+9''+70'2#8quot;().11(quot;1+:9;.+92+'.912+(<.+5'()*+2(+.=.)02. 3+(4+5'()*1+%+3+(4+70'2#8quot;().11(quot;1+6+> ?0'2#8'.+5'()*1+)9<+quot;0<+)(<)0quot;quot;.<2'@+#<+9+70'2#8quot;().11(quot; &'()*1+2:92+9quot;.<A2+B9#2#<C+92+9+DD1@<)2:quot;.9$1EF+*..8+2:.+ :9quot;$B9quot;.+501@ ,05G.)2+2(+quot;.1(0quot;).+9;9#'95#'#2@+H quot;.C#12.quot;1I+1:9quot;.$+7.7(quot;@ 3+(4+5'()*1+6+JKK+2(+1)9'.+2(+4020quot;.+$.;#).1 &'()*1+.=.)02.$+#<+8#8.'#<.+491:#(< JKKK+5'()*1+8.quot;+Cquot;#$+B#''+1)9'.+9)quot;(11+70'2#8'.+C.<.quot;92#(<1 80
  • 100. erf P !quot;#$%&quot;'()quot;*quot;+,quot;+-. !quot;/,0/1&quot;'02'$&quot;('quot;#$%&quot;'(,quot;*quot;+,quot;+-. 3+%&'4-&$5+6%('quot;%47&(-/+(8quot;('quot;/,(9::(-.-7quot;%(7/&quot;' ;-quot;+/'$5%<=>)?< @AB< S T(.(U(JV /,,N1O:(((P1OQ(P1EQ(P1: W(T(S U(OV /,,N1O:(((P1JQ(P1OQ(P1R %[,/&/XYZ(UT(OV 7,N%D/'quot;,N1O:((P1OQ(XP'OEUYZ( /,,N1O:(((((((((((P1OQ(P1OQ(P1R A5(-5C*7quot;&quot;7.(D$,quot;(&Dquot;(7/&quot;+-.<( !4+(/&(7quot;/%&(EF: &D'quot;/,%(GH(2/'*%I(*quot;'(C47&$*'5-quot;%%5' ?&(7quot;/%&(:JK 5--4*/+-. AD'quot;/,%(,5(+5&(D/Lquot;(&5(8quot;75+#(&5(&Dquot;(%/Cquot;(&D'quot;/,(875-M 81
  • 101. erf P !quot;#$%&quot;'()'quot;%%*'quot; +$,quot;(-.&quot;/01(21(*%$/#(34'quot;(&5'quot;.,%(6quot;'(78 9$3$&$/#(:.0&4'%; <*32quot;'(4=('quot;#$%&quot;'%(6quot;'(>quot;'/quot;- ?@AB 6quot;'(78C(6.'&$&$4/quot;,(.34/#(04/0*''quot;/&(&5'quot;.,% D34*/&(4=(%5.'quot;,(3quot;34'1 @EFG 6quot;'(78C(6.'&$&$4/quot;,(.34/#(04/0*''quot;/&(&5'quot;.,2-40>% H5quot;0>(I0*2$/(=$-quot;(=4'(J('quot;#$%&quot;'%(K(>quot;'/quot;- L%quot;(M3.N''quot;#04*/&O< =-.#(&4(<PHH < O(,quot;%$'quot;,(3.N$3*3('quot;#$%&quot;'%(K(>quot;'/quot;- D&(%43quot;(64$/&(Q%6$--$/#R $/&4(98S8(3.1(400*' !quot;,*0quot;%(6quot;'=4'3./0quot;(M 98S8($%(%-4T H5quot;0>(I0*2$/(=$-quot;(=4'(98S8(*%.#quot; 82
  • 102. erf P !quot;#quot;$%&'&'()$quot;*+,$-quot;),*.(quot; /*quot;)012#3+2#&+'*4567 +2#&+')#+)'6-- 8$9)-+%2&:quot;)#;quot;)<quot;$'quot;:)-+=quot;)>&#;)#;quot;)5-,?&')@:.()#+) =quot;#quot;$%&'quot;)$quot;(&*#quot;$),*.(quot;A 82quot;')#;quot;)A-,?&')@&:quot;)>&#;).)#quot;3#)quot;=&#+$).'=):++<)@+$) #;quot;)0-+=quot;7 *quot;-#&+'A architecture {sm_10} abiversion {0} modname {cubin} code { per thread local memory name = BlackScholesGPU lmem = 0 smem = 68 per thread block shared memory reg = 20 bar = 0 per thread registers bincode { 0xa0004205 0x04200780 0x40024c09 0x00200780 … 83
  • 103. erf P !quot;#$%&''()*+',%!*-'(-*./0 84
  • 104. erf P !quot;#$%$&$'()#*+,-./)quot;,+)01234 5*22/,)#*+,-./)quot;,+)01234)-/)-)%61#$quot;1,)27)8-+quot;)/$&, 9:2$.)8-/#$'()32%quot;6#-#$2')2')6'.,+;quot;2quot;61-#,.)8-+quot;/ <2+,)#*+,-./)quot;,+)01234)==)0,##,+)%,%2+>)1-#,'3>) *$.$'( ?6#@)%2+,)#*+,-./)quot;,+)01234)==)7,8,+)+,($/#,+/)quot;,+) #*+,-. A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,. B,6+$/#$3/ <$'$%6%C)DE)#*+,-./)quot;,+)01234 !'1>)$7)%61#$quot;1,)32'36++,'#)01234/) FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3, J/6-11>)/#$11),'26(*)+,(/)#2)32%quot;$1,)-'.)$':24,)/633,//7611> K*$/)-11).,quot;,'./)2')>26+)32%quot;6#-#$2'@)/2),Lquot;+$%,'#M 85
  • 105. erf P !quot;quot;#$%&quot;'()*(+,-./-0%&quot;, 1&quot;-,%23&4(/quot;quot;#$%&quot;'(5/,2(&/6(&,quot;,22%-37'( 3&quot;-,%2,($,-./-0%&quot;, BUT… 8/9:/quot;quot;#$%&quot;'(0#763$-/quot;,22/-2(quot;%&&/6(%5,;#%6,7'( <35,(7%6,&quot;'(/&(0,0/-':=/#&5(>,-&,72 ?16(%77(quot;/0,2(5/9&(6/(%-36<0,63quot;(3&6,&236'(%&5(%@%37%=7,( $%-%77,7320A 86
  • 106. erf P !quot;#quot;$%&%#'(%)*+,#)-../'0quot;&'+1 !quot;#quot;$%&%#'(quot;&'+1)2%/.3)quot;4quot;.&quot;&'+1)&+)4'55%#%1&)6!73 6!73)8quot;#9)'1)$quot;19):quot;93 ;)+5)$,/&'.#+0%33+#3 <%$+#9)=quot;14:'4&2 >2quot;#%4)$%$+#9)3'(% ?%@'3&%#)5'/%)3'(% A2#%quot;43).%#)=/+0B *+,)0quot;1)%8%1)$quot;B%)quot;..3)3%/5C&,1'1@)D/'B%)EEAF)quot;14) -AG->H IJK.%#'$%1&L $+4%)4'30+8%#3)quot;14)3quot;8%3)+.&'$quot;/) 0+15'@,#quot;&'+1 87
  • 107. erf P !quot;#$%&'(quot;# )#*+,'-.#*/!)01/2+,3quot;,4.#$+/$5.,.$-+,('-($' 6+4quot;,7/$quot;.%+'$(#8 0(9+,8+#-/:,.#$5(#8 ;.#</$quot;#3%($-' =.-+#$7/5(*(#8 )'+/2+.</2+,3quot;,4.#$+/4+-,($'/-quot;/8&(*+/quot;2-(4(>.-(quot;#/ )#*+,'-.#*/2.,.%%+%/.%8quot;,(-54/$quot;42%+?(-7/-5+quot;,7 @#quot;A/5quot;A/-quot;/(*+#-(37/-72+/quot;3/:quot;--%+#+$< +B8B/4+4quot;,7C/$quot;,+/$quot;42&-.-(quot;#C/quot;,/(#'-,&$-(quot;#/quot;9+,5+.* D2-(4(>+/7quot;&,/.%8quot;,(-54C/then &#,quot;%%/%quot;quot;2' )'+/-+42%.-+/2.,.4+-+,'/-quot;/8+#+,.-+/quot;2-(4.%/$quot;*+ 88
  • 108. erf P !quot;#$%&'($)*+,-.$/012*.#0
  • 109. erf P !quot;#$%&'($)*+,-.$/012*.#0 3#.4+$5#-+,0#$-67$2*67$418#68*-.$4#02105-69#$ 401:.#5 ;/&$-67$%/&$8*5*6<$210$-..$=#06#.$*6>19-8*16+$-67$ 5#594?+ !*5#$+8-54+ (99#++$81$quot;-07@-0#$4#02105-69#$91,68#0+$ 61
  • 110. erf P !quot;#$%&'quot;()'*# 101
  • 111. erf P !quot;#$%&' ()*$+',%-*,+-%./*0,1quot;+2,2%-01%-*,.34$+*-',3$,'quot;#$%&',quot;$,+2*,.2quot;56 +quot;7*'+%75 #&08quot;$.32*-*$+ Global memory loads/stores are coalesced #&08.32*-*$+ (coherent) or non-coalesced (incoherent) #'+8quot;$.32*-*$+ #'+8.32*-*$+ &3.%&8&3%0 Local loads/stores &3.%&8'+3-* Total branches and divergent branches 9-%$.2 0quot;)*-#*$+89-%$.2 taken by threads quot;$'+-4.+quot;3$' : quot;$'+-4.+quot;3$,.34$+ 1%-58'*-quot;%&quot;;* : +2-*%0,1%-5',+2%+,'*-quot;%&quot;;*,3$,%00-*'',.3$<&quot;.+',+3, '2%-*0,3-,.3$'+%$+,7*73-= .+%8&%4$.2*0 : *>*.4+*0,+2-*%0,9&3./' 62
  • 112. erf P !quot;#$%&%$#'quot;()&%*+',$%)-*.quot;#$%/ 01,.$/)%$&%$/$quot;#)$2$quot;#/)3'#4'quot;)1)#4%$15)31%& 6quot;,7)#1%($#/)*quot;$)8.,#'&%*-$//*% 01,.$/)3',,)quot;*#)-*%%$/&*quot;5)#*)#4$)#*#1,)quot;.89$%)*+)31%&/) ,1.quot;-4$5)+*%)1)&1%#'-.,1%):$%quot;$,; <1.quot;-4)$quot;*.(4)#4%$15)9,*-:/)#*)$quot;/.%$)#41#)#4$)#1%($#) 8.,#'&%*-$//*%)'/)('2$quot;)1)-*quot;/'/#$quot;#)&$%-$quot;#1($)*+)#4$)#*#1,) 3*%:; 01,.$/)1%$)9$/#)./$5)#*)'5$quot;#'+7)%$,1#'2$)&$%+*%81quot;-$) 5'++$%$quot;-$/)9$#3$$quot;).quot;*&#'8'=$5)1quot;5)*&#'8'=$5)-*5$ !quot;)*#4$%)3*%5/>)#%7)#*)%$5.-$)#4$)81(quot;'#.5$/)*+) (,5?(/#@'quot;-*4$%$quot;#>)5'2$%($quot;#@9%1quot;-4>)1quot;5)31%&@/$%'1,'=$ 63
  • 113. ME CO
  • 114. Back Pocket Slides slide by David Cox
  • 115. 6.963 IT / A@M CUD 9 IAP0 Dense Linear Algebra
  • 116. !quot;#$quot;%&'#quot;()%*+,quot;-)( 4/5,-quot;.-,6789:; B,A-C8quot;,Dquot;7/-?8E:C/78quot;C/:8:; ! <128/-quot;:=:089: ! *8-,2/A01C:;quot;F>4 $% ! & ! >1?82@/7A8: ! +,9.A0/01,2/7quot;CG891:0-= $% ! quot;% ! B12?A7/-quot;@/7A8: ! )/0/quot;91212? $ ! #!quot; ! !quot;#$$%quot;&'()(*quot;+,-.,-/01,23
  • 117. !quot;#$quot;%&'#quot;()%*+,quot;-)( 4/5,-quot;.-,6789:; *7?,-10C9:; ! <128/-quot;:=:089: ! D28E:1F8Fquot;G/H0,-1I/01,2:; $% ! & <JKquot;+C,78:L=Kquot;MN ! >1?82@/7A8: ! OP,E:1F8Fquot;G/H0,-1I/01,2:; $% ! quot;% MNquot;/7?3Kquot;Q/H,61 ! B12?A7/-quot;@/7A8: ! OP,E:1F8Fquot;G/H0,-1I/01,2:; $ ! #!quot; ! B') !quot;#$$%quot;&'()(*quot;+,-.,-/01,23
  • 118. !quot;#$%&'!quot;#$%&()$*+,-#.(!quot;#quot;!quot;$quot;%quot;&quot;' 45*6quot;78,-0-/29::;< +=45*6quot;7+;< 6789:;$<=--(!)*+,!!)*+, !quot;##!$%&''(!)*+,!)*+, -,!.,!/, -,!.,!/, 012,!>quot;,!#3quot;, 012,!quot;,!#3quot;, >4,!#34, 012,!>!,!#3!!5? 4,!#34, 012,!!,!#3!!5 +,>.?0/01,2quot;12quot;@A=quot;-BC?1-BD< ! (2101/E1F/01,2quot;,Gquot;+=)*quot;B2H1-,2>B20 ! *EE,I/01,2quot;,Gquot;J/0/quot;D0-?I0?-BDquot;12quot;@A=quot;>B>,-Kquot;7L/2JEB-Dquot;!quot;#$!%#$!&; ! M-/2DGB-quot;,Gquot;J/0/quot;7>/0-1IBDquot;quot;#$%#$&; ! +,>.?0/01,2quot;7I?NE/D6OB>>; ! PB0-1BHBquot;-BD?E0quot;7>/0-1Qquot;&; ! 8-BBquot;J/0/quot;D0-?I0?-BDquot;12quot;@A=quot;>B>,-K ! MB->12/01,2quot;,Gquot;+=)*quot;B2H1-,2>B20 !quot;#$$%quot;&'()(*quot;+,-.,-/01,23
  • 119. quot;()*+,(! quot;()*+,(! quot;()*+,(! quot;()*+,(! quot;#$!%&'(! quot;-./01! 2011quot;-.! 0011quot;-. 0311quot;-4 5!*6!7(,8*+!,*+(9! :1! ;3! ;3! <! ,*+(!,=*,>?!quot;@A! ;B:1! ;B3C! ;B:D! ;B<D! +(EF98(+9G,*+(! 3<HI! :/HI! :/HI! :/HI! 9'('G,*+(! ;3HI! ;3HI! ;3HI! ;3HI! '('*+J!KL9?!quot;@A ;B;! ;B;! 1B2! ;B1! '('*+J!KL9?!MF%9 D;/! /D3! :0<! ;/0! K&%NOFN8P?!quot;IG9! ;<;! C1! 03! :/! '('*+J!&'*L%8! ;quot;I! D;/QI! C30QI! /D3QI! 4#?!M(&>!quot;6=*MG9! 3/<! </2! :<3! 2:! 4#?!M(&>!M(+!,*+(! /;! /C! //! /:! 4#?!6=*M9RO*+N! ;0! /D! ;3! ;/! S#?!M(&>!quot;6=*MG9! C0! T! T! T! S#?!6=*M9RO*+N! <B<! T! T! T! -&K=(!;R!-P(!=F98!*6!8P(!quot;#$9!L9(N!F%!8PF9!98LNJB!4#!F9!9F%E=(!M+(U ,F9F*%!&%N!S#!F9!N*LK=(!M+(,F9F*%B!4'('!F9!9P&+(N!'('*+JB!#(&>! 6=*M!+&8(9!&+(!9P*O%!6*+!'L=8FM=J!&%N!&NN!*M(+&8F*%9B!)=*M9RO*+N! F9!8P(!+&8F*!*6!M(&>!quot;6=*MG9!+&8(!8*!MF%U'('*+J!K&%NOFN8P!F%!