SlideShare una empresa de Scribd logo
1 de 26
Descargar para leer sin conexión
Sci-Prog seminar series
Talks on computing and programming related topics ranging from basic to
                           advanced levels.



                Talk: Using GPUs for parallel processing
                          A. Stephen McGough


        Website: http://conferences.ncl.ac.uk/sciprog/index.php
   Research community site: contact Matt Wade for access
            Alerts mailing list: sci-prog-seminars@ncl.ac.uk
                   (sign up at http://lists.ncl.ac.uk )

Organisers: Dr Liz Petrie, Dr Matt Wade, Dr Stephen McGough,
                 Dr Ben Allen and Gregg Iceton
Using GPUs for parallel processing

         A. Stephen McGough
Why?
       observation
• Moore’s XXXX is dead?
          law
     • “the number of transistors on integrated circuits
       doubles approximately every two years”
        – Processors aren’t getting faster… They’re getting fatter

                                  Processor Speed and Energy

                                  Assume 1 GHz Core consumes 1watt

                                  A 4GHz Core consumes ~64watts

                                  Four 1GHz cores consume ~4watts

                                  Power ~frequency3

                             Computers are going many-core
What?
• Games industry is multi-billion dollar
• Gamers want photo-realistic games
  – Computationally expensive
  – Requires complex physics calculations
• Latest generation of Graphical Processing Units
  are therefore many core parallel processors
  – General Purpose Graphical Processing Units - GPGPUs
Not just normal processors
• 1000’s of cores
  – But cores are simpler than a normal processor
  – Multiple cores perform the same action at the same
    time – Single Instruction Multiple Data – SIMD
• Conventional processor -> Minimize latency
  – Of a single program
• GPU -> Maximize throughput of all cores
• Potential for orders of magnitude speed-up
“If you were plowing a field, which would you
        rather use: two strong oxen or 1024 chicken?”

• Famous quote from Seymour Cray arguing for
  small numbers of processors
  – But the chickens are now winning
• Need a new way to think about programming
  – Need hugely parallel algorithms
     • Many existing algorithms won’t work (efficiently)
Some Issues with GPGPUs
• Cores are slower than a standard CPU
   – But you have lots more
• No direct control on when your code runs on a core
   – GPGPU decides where and when
      • Can’t communicate between cores
      • Order of execution is ‘random’
   – Synchronization is through exiting parallel GPU code
• SIMD only works (efficiently) if all cores are doing the
  same thing
   – NVIDIA GPU’s have Warps of 32 cores working together
      • Code divergence leads to more Warps
• Cores can interfere with each other
   – Overwriting each others memory
How
• Many approaches
  – OpenGL – for the mad Guru
  – Computer Unified Device Architecture (CUDA)
  – OpenCL – emerging standard
  – Dynamic Parallelism – For existing code loops
• Focus here on CUDA
  – Well developed and supported
  – Exploits full power of GPGPU
CUDA
• CUDA is a set of extensions to C/C++
   – (and Fortran)
• Code consists of sequential and parallel parts
   – Parallel parts are written as kernels
           • Describe what one thread of the code will do
 Start               Sequential code


                   Transfer data to card

                      Execute Kernel


                  Transfer data from card

  Finish             Sequential code
Example: Vector Addition
• One dimensional data
• Add two vectors (A,B) together to produce C
• Need to define the kernel to run and the main
  code
• Each thread can compute a single value for C
Example: Vector Addition
• Pseudo code for the kernel:
  – Identify which element in the vector I’m computing
     •i
  – Compute C[i] = A[i] + B[i]


• How do we identify our index (i)?
Blocks and Threads
• In CUDA the whole data
  space is the Grid
   – Divided into a number
     of blocks
      • Divided into a number of
        threads
• Blocks can be executed
  in any order
• Threads in a block are
  executed together
• Blocks and Threads can
  be 1D, 2D or 3D
Blocks
• As Blocks are
  executed in arbitrary
  order this gives
  CUDA the
  opportunity to scale
  to the number of
  cores in a particular
  device
Thread id
• CUDA provides three pieces of data for
  identifying a thread
  – BlockIdx – block identity
  – BlockDim – the size of a block (no of threads in block)
  – ThreadIdx – identity of a thread in a block
• Can use these to compute the absolute thread id
        id = BlockIdx * BlockDim + ThreadIdx
• EG: BlockIdx = 2, BlockDim = 3, ThreadIdx = 1
• id = 2 * 3 + 1 = 7
        Thread index 0 1 2 0 1 2 0 1 2
                     0 1 2 3 4 5 6 7 8

                   Block0 Block1 Block2
Example: Vector Addition
                         Kernel code
                Entry point for a
                                              Normal function
                     kernel
                                                 definition


 __global__ void vector_add(double *A, double *B,
                            double* C, int N) {
   // Find my thread id - block and thread
   int id = blockDim.x * blockIdx.x + threadIdx.x;
   if (id >= N) {return;} // I'm not a valid ID
   C[id] = A[id] + B[id]; // do my work
 }                                             Compute my
                                                                absolute thread id
We might be
 invalid – if
data size not                   Do the work
 completely
 divisible by
    blocks
Example: Vector Addition
         Pseudo code for sequential code
• Create Data on Host Computer

• Create space on device

• Copy data to device
• Run Kernel
• Copy data back to host and do something with it
• Clean up
Host and Device
• Data needs copying to / from the GPU (device)
• Often end up with same data on both
  – Postscript variable names with _device or _host
     • To help identify where data is
        A_host                          A_device




         Host                           Device
Example: Vector Addition
int N = 2000;
double *A_host = new double[N]; // Create data on host computer
double *B_host = new double[N]; double *C_host = new double[N];
for(int i=0; i<N; i++) {    A_host[i] = i; B_host[i] = (double)i/N; }
double *A_device, *B_device, *C_device; // allocate space on device GPGPU
cudaMalloc((void**) &A_device, N*sizeof(double));
cudaMalloc((void**) &B_device, N*sizeof(double));
cudaMalloc((void**) &C_device, N*sizeof(double));
// Copy data from host memory to device memory
cudaMemcpy(A_device, A_host, N*sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(B_device, B_host, N*sizeof(double), cudaMemcpyHostToDevice);
// How many blocks will we need? Choose block size of 256
int blocks = (N - 0.5)/256 + 1;
vector_add<<<blocks, 256>>>(A_device, B_device, C_device, N); // run kernel
// Copy data back
cudaMemcpy(C_host, C_device, N*sizeof(double), cudaMemcpyDeviceToHost);
// do something with result

// free device memory
cudaFree(A_device); cudaFree(B_device); cudaFree(C_device);
free(A_host); free(B_host); free(C_host); // free host memory
More Complex: Matrix Addition
• Now a 2D problem
  – BlockIdx, BlockDim, ThreadIdx now have x and y
• But general principles hold
  – For kernel
     • Compute location in matrix of two diminutions
  – For main code
     • Define and transmit data
• But keep data 1D
  – Why?
Why data in 1D?
• If you define data as 2D there is no guarantee
  that data will be a contiguous block of memory
  – Can’t be transmitted to card in one command




                    X X

                       Some other
                          data
Faking 2D data
• 2D data size N*M
• Define 1D array of size N*M
• Index element at [x,y] as
                    x*N+y
• Then can transfer to device in one go



          Row 1   Row 2   Row 3   Row 4
Example: Matrix Add
                              Kernel
__global__ void matrix_add(double *A, double *B, double* C, int N, int M)
{
  // Find my thread id - block and thread
                                                                  Both
  int idX = blockDim.x * blockIdx.x + threadIdx.x;
                                                               dimensions
  int idY = blockDim.y * blockIdx.y + threadIdx.y;
  if (idX >= N || idY >= M) {return;} // I'm not a valid ID
  int id = idY * N + idX;
                                                     Get both
  C[id] = A[id] + B[id]; // do my work
                                                    dimensions
}
                           Compute
                          1D location
Example: Matrix Addition
                              Main Code
int N = 20;
int M = 10;
double *A_host = new double[N * M]; // Create data on host computer
double *B_host = new double[N * M];
double *C_host = new double[N * M];                                         Define matrices
for(int i=0; i<N; i++) {
  for (int j = 0; j < M; j++) {
                                                                                on host
    A_host[i + j * N] = i; B_host[i + j * N] = (double)j/M;
  }
}

double *A_device, *B_device, *C_device; // allocate space on device GPGPU
cudaMalloc((void**) &A_device, N*M*sizeof(double));
                                                                                Define space on
cudaMalloc((void**) &B_device, N*M*sizeof(double));                                  device
cudaMalloc((void**) &C_device, N*M*sizeof(double));

// Copy data from host memory to device memory
cudaMemcpy(A_device, A_host, N*M*sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(B_device, B_host, N*M*sizeof(double), cudaMemcpyHostToDevice);
                                                                                  Copy data to
                                                                                    device
// How many blocks will we need? Choose block size of 16
int blocksX = (N - 0.5)/16 + 1;
int blocksY = (M - 0.5)/16 + 1;
dim3 dimGrid(blocksX, blocksY);
dim3 dimBlocks(16, 16);                                                           Run Kernel
matrix_add<<<dimGrid, dimBlocks>>>(A_device, B_device, C_device, N, M);

// Copy data back from device to host
cudaMemcpy(C_host, C_device, N*M*sizeof(double), cudaMemcpyDeviceToHost);       Bring data back
// Free device
//for (int i = 0; i < N*M; i++) printf("C[%d,%d] = %fn", i/N, i%N, C_host[i]);
cudaFree(A_device); cudaFree(B_device); cudaFree(C_device);
free(A_host); free(B_host); free(C_host);                                                Tidy up
Running Example
• Computer: condor-gpu01
  – Set path
     • set path = ( $path /usr/local/cuda/bin/ )
• Compile command nvcc
• Then just run the binary file

• C2050, 440 cores, 3GB RAM
  – Single precision flops 1.03Tflops
  – Double precision flops 515Gflops
Summary and Questions
• GPGPU’s have great potential for parallelism
• But at a cost
   – Not ‘normal’ parallel computing
   – Need to think about problems in a new way
• Further reading
   – NVIDIA CUDA Zone https://developer.nvidia.com/category/zone/cuda-zone
   – Online courses https://www.coursera.org/course/hetero
Sci-Prog seminar series
Talks on computing and programming related topics ranging from basic to
                           advanced levels.



                Talk: Using GPUs for parallel processing
                          A. Stephen McGough


        Website: http://conferences.ncl.ac.uk/sciprog/index.php
   Research community site: contact Matt Wade for access
            Alerts mailing list: sci-prog-seminars@ncl.ac.uk
                   (sign up at http://lists.ncl.ac.uk )

Organisers: Dr Liz Petrie, Dr Matt Wade, Dr Stephen McGough,
                 Dr Ben Allen and Gregg Iceton

Más contenido relacionado

La actualidad más candente

CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learningAmgad Muhammad
 
Implementation of Computational Algorithms using Parallel Programming
Implementation of Computational Algorithms using Parallel ProgrammingImplementation of Computational Algorithms using Parallel Programming
Implementation of Computational Algorithms using Parallel Programmingijtsrd
 
Gpu programming with java
Gpu programming with javaGpu programming with java
Gpu programming with javaGary Sieling
 
Introduction to Homomorphic Encryption
Introduction to Homomorphic EncryptionIntroduction to Homomorphic Encryption
Introduction to Homomorphic EncryptionChristoph Matthies
 
A survey on Fully Homomorphic Encryption
A survey on Fully Homomorphic EncryptionA survey on Fully Homomorphic Encryption
A survey on Fully Homomorphic Encryptioniosrjce
 
Agent threading model
Agent threading modelAgent threading model
Agent threading modelJudd Gaddie
 
CUDA by Example : Thread Cooperation : Notes
CUDA by Example : Thread Cooperation : NotesCUDA by Example : Thread Cooperation : Notes
CUDA by Example : Thread Cooperation : NotesSubhajit Sahu
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...David Walker
 
Advanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineAdvanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineNarann29
 
TensorFlow Study Part I
TensorFlow Study Part ITensorFlow Study Part I
TensorFlow Study Part ITe-Yen Liu
 
Secure and privacy-preserving data transmission and processing using homomorp...
Secure and privacy-preserving data transmission and processing using homomorp...Secure and privacy-preserving data transmission and processing using homomorp...
Secure and privacy-preserving data transmission and processing using homomorp...DefCamp
 
Homomorphic Encryption
Homomorphic EncryptionHomomorphic Encryption
Homomorphic EncryptionVictor Pereira
 
Introduce to Rust-A Powerful System Language
Introduce to Rust-A Powerful System LanguageIntroduce to Rust-A Powerful System Language
Introduce to Rust-A Powerful System Language安齊 劉
 
2013 0928 programming by cuda
2013 0928 programming by cuda2013 0928 programming by cuda
2013 0928 programming by cuda小明 王
 
Groovy Fly Through
Groovy Fly ThroughGroovy Fly Through
Groovy Fly Throughniklal
 
Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)Daniel Lemire
 

La actualidad más candente (20)

CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
 
GLSL
GLSLGLSL
GLSL
 
Implementation of Computational Algorithms using Parallel Programming
Implementation of Computational Algorithms using Parallel ProgrammingImplementation of Computational Algorithms using Parallel Programming
Implementation of Computational Algorithms using Parallel Programming
 
Gpu programming with java
Gpu programming with javaGpu programming with java
Gpu programming with java
 
Introduction to Homomorphic Encryption
Introduction to Homomorphic EncryptionIntroduction to Homomorphic Encryption
Introduction to Homomorphic Encryption
 
A survey on Fully Homomorphic Encryption
A survey on Fully Homomorphic EncryptionA survey on Fully Homomorphic Encryption
A survey on Fully Homomorphic Encryption
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
Agent threading model
Agent threading modelAgent threading model
Agent threading model
 
CUDA by Example : Thread Cooperation : Notes
CUDA by Example : Thread Cooperation : NotesCUDA by Example : Thread Cooperation : Notes
CUDA by Example : Thread Cooperation : Notes
 
Codes and Isogenies
Codes and IsogeniesCodes and Isogenies
Codes and Isogenies
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 
Advanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineAdvanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering Pipeline
 
TensorFlow Study Part I
TensorFlow Study Part ITensorFlow Study Part I
TensorFlow Study Part I
 
Computing on Encrypted Data
Computing on Encrypted DataComputing on Encrypted Data
Computing on Encrypted Data
 
Secure and privacy-preserving data transmission and processing using homomorp...
Secure and privacy-preserving data transmission and processing using homomorp...Secure and privacy-preserving data transmission and processing using homomorp...
Secure and privacy-preserving data transmission and processing using homomorp...
 
Homomorphic Encryption
Homomorphic EncryptionHomomorphic Encryption
Homomorphic Encryption
 
Introduce to Rust-A Powerful System Language
Introduce to Rust-A Powerful System LanguageIntroduce to Rust-A Powerful System Language
Introduce to Rust-A Powerful System Language
 
2013 0928 programming by cuda
2013 0928 programming by cuda2013 0928 programming by cuda
2013 0928 programming by cuda
 
Groovy Fly Through
Groovy Fly ThroughGroovy Fly Through
Groovy Fly Through
 
Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)
 

Destacado

Parallel Processing with IPython
Parallel Processing with IPythonParallel Processing with IPython
Parallel Processing with IPythonEnthought, Inc.
 
Parallel processing & Multi level logic
Parallel processing & Multi level logicParallel processing & Multi level logic
Parallel processing & Multi level logicHamza Saleem
 
Geoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman
 
Parallel Processing for Digital Image Enhancement
Parallel Processing for Digital Image EnhancementParallel Processing for Digital Image Enhancement
Parallel Processing for Digital Image EnhancementNora Youssef
 
Introduction to Parallel Processing Algorithms in Shared Nothing Databases
Introduction to Parallel Processing Algorithms in Shared Nothing DatabasesIntroduction to Parallel Processing Algorithms in Shared Nothing Databases
Introduction to Parallel Processing Algorithms in Shared Nothing DatabasesOfir Manor
 
QGIS plugin for parallel processing in terrain analysis
QGIS plugin for parallel processing in terrain analysisQGIS plugin for parallel processing in terrain analysis
QGIS plugin for parallel processing in terrain analysisRoss McDonald
 
ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent P...
ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent P...ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent P...
ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent P...Johann Schleier-Smith
 
Parallel batch processing with spring batch slideshare
Parallel batch processing with spring batch   slideshareParallel batch processing with spring batch   slideshare
Parallel batch processing with spring batch slideshareMorten Andersen-Gott
 
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQMassively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQInMobi Technology
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Information processing approach
Information processing approachInformation processing approach
Information processing approachaj9ajeet
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Ian Huston
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDAMartin Peniak
 

Destacado (14)

Parallel Processing with IPython
Parallel Processing with IPythonParallel Processing with IPython
Parallel Processing with IPython
 
Parallel processing & Multi level logic
Parallel processing & Multi level logicParallel processing & Multi level logic
Parallel processing & Multi level logic
 
Geoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel Processing
 
Parallel Processing for Digital Image Enhancement
Parallel Processing for Digital Image EnhancementParallel Processing for Digital Image Enhancement
Parallel Processing for Digital Image Enhancement
 
Computer Architecture
Computer ArchitectureComputer Architecture
Computer Architecture
 
Introduction to Parallel Processing Algorithms in Shared Nothing Databases
Introduction to Parallel Processing Algorithms in Shared Nothing DatabasesIntroduction to Parallel Processing Algorithms in Shared Nothing Databases
Introduction to Parallel Processing Algorithms in Shared Nothing Databases
 
QGIS plugin for parallel processing in terrain analysis
QGIS plugin for parallel processing in terrain analysisQGIS plugin for parallel processing in terrain analysis
QGIS plugin for parallel processing in terrain analysis
 
ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent P...
ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent P...ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent P...
ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent P...
 
Parallel batch processing with spring batch slideshare
Parallel batch processing with spring batch   slideshareParallel batch processing with spring batch   slideshare
Parallel batch processing with spring batch slideshare
 
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQMassively Parallel Processing with Procedural Python - Pivotal HAWQ
Massively Parallel Processing with Procedural Python - Pivotal HAWQ
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Information processing approach
Information processing approachInformation processing approach
Information processing approach
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 

Similar a Using GPUs for parallel processing

002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.pptceyifo9332
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...mouhouioui
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDARaymond Tay
 
introduction to CUDA_C.pptx it is widely used
introduction to CUDA_C.pptx it is widely usedintroduction to CUDA_C.pptx it is widely used
introduction to CUDA_C.pptx it is widely usedHimanshu577858
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introductionHanibei
 
Tema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfTema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfpepe464163
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUsShree Kumar
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdfTigabu Yaya
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computingbakers84
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with GpuRohit Khatana
 
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012DefCamp
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAnirudhGarg35
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.
 
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data ScienceScaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data ScienceNeo4j
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 

Similar a Using GPUs for parallel processing (20)

002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
introduction to CUDA_C.pptx it is widely used
introduction to CUDA_C.pptx it is widely usedintroduction to CUDA_C.pptx it is widely used
introduction to CUDA_C.pptx it is widely used
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introduction
 
Tema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdfTema3_Introduction_to_CUDA_C.pdf
Tema3_Introduction_to_CUDA_C.pdf
 
Cuda 2011
Cuda 2011Cuda 2011
Cuda 2011
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
 
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data ScienceScaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data Science
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 

Último

CALL ON ➥8923113531 🔝Call Girls Aliganj Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Aliganj Lucknow best sexual serviceCALL ON ➥8923113531 🔝Call Girls Aliganj Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Aliganj Lucknow best sexual serviceanilsa9823
 
2k Shots ≽ 9205541914 ≼ Call Girls In Palam (Delhi)
2k Shots ≽ 9205541914 ≼ Call Girls In Palam (Delhi)2k Shots ≽ 9205541914 ≼ Call Girls In Palam (Delhi)
2k Shots ≽ 9205541914 ≼ Call Girls In Palam (Delhi)Delhi Call girls
 
Top Rated Pune Call Girls Tingre Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Tingre Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Tingre Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Tingre Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
call Now 9811711561 Cash Payment乂 Call Girls in Dwarka Mor
call Now 9811711561 Cash Payment乂 Call Girls in Dwarka Morcall Now 9811711561 Cash Payment乂 Call Girls in Dwarka Mor
call Now 9811711561 Cash Payment乂 Call Girls in Dwarka Morvikas rana
 
CALL ON ➥8923113531 🔝Call Girls Mahanagar Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Mahanagar Lucknow best sexual serviceCALL ON ➥8923113531 🔝Call Girls Mahanagar Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Mahanagar Lucknow best sexual serviceanilsa9823
 
$ Love Spells^ 💎 (310) 882-6330 in West Virginia, WV | Psychic Reading Best B...
$ Love Spells^ 💎 (310) 882-6330 in West Virginia, WV | Psychic Reading Best B...$ Love Spells^ 💎 (310) 882-6330 in West Virginia, WV | Psychic Reading Best B...
$ Love Spells^ 💎 (310) 882-6330 in West Virginia, WV | Psychic Reading Best B...PsychicRuben LoveSpells
 
The Selfspace Journal Preview by Mindbrush
The Selfspace Journal Preview by MindbrushThe Selfspace Journal Preview by Mindbrush
The Selfspace Journal Preview by MindbrushShivain97
 
LC_YouSaidYes_NewBelieverBookletDone.pdf
LC_YouSaidYes_NewBelieverBookletDone.pdfLC_YouSaidYes_NewBelieverBookletDone.pdf
LC_YouSaidYes_NewBelieverBookletDone.pdfpastor83
 
8377087607 Full Enjoy @24/7-CLEAN-Call Girls In Chhatarpur,
8377087607 Full Enjoy @24/7-CLEAN-Call Girls In Chhatarpur,8377087607 Full Enjoy @24/7-CLEAN-Call Girls In Chhatarpur,
8377087607 Full Enjoy @24/7-CLEAN-Call Girls In Chhatarpur,dollysharma2066
 
Lucknow 💋 High Class Call Girls Lucknow 10k @ I'm VIP Independent Escorts Gir...
Lucknow 💋 High Class Call Girls Lucknow 10k @ I'm VIP Independent Escorts Gir...Lucknow 💋 High Class Call Girls Lucknow 10k @ I'm VIP Independent Escorts Gir...
Lucknow 💋 High Class Call Girls Lucknow 10k @ I'm VIP Independent Escorts Gir...anilsa9823
 
CALL ON ➥8923113531 🔝Call Girls Adil Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Adil Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Adil Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Adil Nagar Lucknow best Female serviceanilsa9823
 
CALL ON ➥8923113531 🔝Call Girls Rajajipuram Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Rajajipuram Lucknow best sexual serviceCALL ON ➥8923113531 🔝Call Girls Rajajipuram Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Rajajipuram Lucknow best sexual serviceanilsa9823
 
Introducing to billionaire brain wave.pdf
Introducing to billionaire brain wave.pdfIntroducing to billionaire brain wave.pdf
Introducing to billionaire brain wave.pdfnoumannajam04
 
Pokemon Go... Unraveling the Conspiracy Theory
Pokemon Go... Unraveling the Conspiracy TheoryPokemon Go... Unraveling the Conspiracy Theory
Pokemon Go... Unraveling the Conspiracy Theorydrae5
 
CALL ON ➥8923113531 🔝Call Girls Jankipuram Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Jankipuram Lucknow best sexual serviceCALL ON ➥8923113531 🔝Call Girls Jankipuram Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Jankipuram Lucknow best sexual serviceanilsa9823
 
Lilac Illustrated Social Psychology Presentation.pptx
Lilac Illustrated Social Psychology Presentation.pptxLilac Illustrated Social Psychology Presentation.pptx
Lilac Illustrated Social Psychology Presentation.pptxABMWeaklings
 
9892124323, Call Girls in mumbai, Vashi Call Girls , Kurla Call girls
9892124323, Call Girls in mumbai, Vashi Call Girls , Kurla Call girls9892124323, Call Girls in mumbai, Vashi Call Girls , Kurla Call girls
9892124323, Call Girls in mumbai, Vashi Call Girls , Kurla Call girlsPooja Nehwal
 
2k Shots ≽ 9205541914 ≼ Call Girls In Mukherjee Nagar (Delhi)
2k Shots ≽ 9205541914 ≼ Call Girls In Mukherjee Nagar (Delhi)2k Shots ≽ 9205541914 ≼ Call Girls In Mukherjee Nagar (Delhi)
2k Shots ≽ 9205541914 ≼ Call Girls In Mukherjee Nagar (Delhi)Delhi Call girls
 

Último (20)

CALL ON ➥8923113531 🔝Call Girls Aliganj Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Aliganj Lucknow best sexual serviceCALL ON ➥8923113531 🔝Call Girls Aliganj Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Aliganj Lucknow best sexual service
 
2k Shots ≽ 9205541914 ≼ Call Girls In Palam (Delhi)
2k Shots ≽ 9205541914 ≼ Call Girls In Palam (Delhi)2k Shots ≽ 9205541914 ≼ Call Girls In Palam (Delhi)
2k Shots ≽ 9205541914 ≼ Call Girls In Palam (Delhi)
 
Top Rated Pune Call Girls Tingre Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Tingre Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Tingre Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Tingre Nagar ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
call Now 9811711561 Cash Payment乂 Call Girls in Dwarka Mor
call Now 9811711561 Cash Payment乂 Call Girls in Dwarka Morcall Now 9811711561 Cash Payment乂 Call Girls in Dwarka Mor
call Now 9811711561 Cash Payment乂 Call Girls in Dwarka Mor
 
CALL ON ➥8923113531 🔝Call Girls Mahanagar Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Mahanagar Lucknow best sexual serviceCALL ON ➥8923113531 🔝Call Girls Mahanagar Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Mahanagar Lucknow best sexual service
 
(Aarini) Russian Call Girls Surat Call Now 8250077686 Surat Escorts 24x7
(Aarini) Russian Call Girls Surat Call Now 8250077686 Surat Escorts 24x7(Aarini) Russian Call Girls Surat Call Now 8250077686 Surat Escorts 24x7
(Aarini) Russian Call Girls Surat Call Now 8250077686 Surat Escorts 24x7
 
$ Love Spells^ 💎 (310) 882-6330 in West Virginia, WV | Psychic Reading Best B...
$ Love Spells^ 💎 (310) 882-6330 in West Virginia, WV | Psychic Reading Best B...$ Love Spells^ 💎 (310) 882-6330 in West Virginia, WV | Psychic Reading Best B...
$ Love Spells^ 💎 (310) 882-6330 in West Virginia, WV | Psychic Reading Best B...
 
The Selfspace Journal Preview by Mindbrush
The Selfspace Journal Preview by MindbrushThe Selfspace Journal Preview by Mindbrush
The Selfspace Journal Preview by Mindbrush
 
(Anamika) VIP Call Girls Navi Mumbai Call Now 8250077686 Navi Mumbai Escorts ...
(Anamika) VIP Call Girls Navi Mumbai Call Now 8250077686 Navi Mumbai Escorts ...(Anamika) VIP Call Girls Navi Mumbai Call Now 8250077686 Navi Mumbai Escorts ...
(Anamika) VIP Call Girls Navi Mumbai Call Now 8250077686 Navi Mumbai Escorts ...
 
LC_YouSaidYes_NewBelieverBookletDone.pdf
LC_YouSaidYes_NewBelieverBookletDone.pdfLC_YouSaidYes_NewBelieverBookletDone.pdf
LC_YouSaidYes_NewBelieverBookletDone.pdf
 
8377087607 Full Enjoy @24/7-CLEAN-Call Girls In Chhatarpur,
8377087607 Full Enjoy @24/7-CLEAN-Call Girls In Chhatarpur,8377087607 Full Enjoy @24/7-CLEAN-Call Girls In Chhatarpur,
8377087607 Full Enjoy @24/7-CLEAN-Call Girls In Chhatarpur,
 
Lucknow 💋 High Class Call Girls Lucknow 10k @ I'm VIP Independent Escorts Gir...
Lucknow 💋 High Class Call Girls Lucknow 10k @ I'm VIP Independent Escorts Gir...Lucknow 💋 High Class Call Girls Lucknow 10k @ I'm VIP Independent Escorts Gir...
Lucknow 💋 High Class Call Girls Lucknow 10k @ I'm VIP Independent Escorts Gir...
 
CALL ON ➥8923113531 🔝Call Girls Adil Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Adil Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Adil Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Adil Nagar Lucknow best Female service
 
CALL ON ➥8923113531 🔝Call Girls Rajajipuram Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Rajajipuram Lucknow best sexual serviceCALL ON ➥8923113531 🔝Call Girls Rajajipuram Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Rajajipuram Lucknow best sexual service
 
Introducing to billionaire brain wave.pdf
Introducing to billionaire brain wave.pdfIntroducing to billionaire brain wave.pdf
Introducing to billionaire brain wave.pdf
 
Pokemon Go... Unraveling the Conspiracy Theory
Pokemon Go... Unraveling the Conspiracy TheoryPokemon Go... Unraveling the Conspiracy Theory
Pokemon Go... Unraveling the Conspiracy Theory
 
CALL ON ➥8923113531 🔝Call Girls Jankipuram Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Jankipuram Lucknow best sexual serviceCALL ON ➥8923113531 🔝Call Girls Jankipuram Lucknow best sexual service
CALL ON ➥8923113531 🔝Call Girls Jankipuram Lucknow best sexual service
 
Lilac Illustrated Social Psychology Presentation.pptx
Lilac Illustrated Social Psychology Presentation.pptxLilac Illustrated Social Psychology Presentation.pptx
Lilac Illustrated Social Psychology Presentation.pptx
 
9892124323, Call Girls in mumbai, Vashi Call Girls , Kurla Call girls
9892124323, Call Girls in mumbai, Vashi Call Girls , Kurla Call girls9892124323, Call Girls in mumbai, Vashi Call Girls , Kurla Call girls
9892124323, Call Girls in mumbai, Vashi Call Girls , Kurla Call girls
 
2k Shots ≽ 9205541914 ≼ Call Girls In Mukherjee Nagar (Delhi)
2k Shots ≽ 9205541914 ≼ Call Girls In Mukherjee Nagar (Delhi)2k Shots ≽ 9205541914 ≼ Call Girls In Mukherjee Nagar (Delhi)
2k Shots ≽ 9205541914 ≼ Call Girls In Mukherjee Nagar (Delhi)
 

Using GPUs for parallel processing

  • 1. Sci-Prog seminar series Talks on computing and programming related topics ranging from basic to advanced levels. Talk: Using GPUs for parallel processing A. Stephen McGough Website: http://conferences.ncl.ac.uk/sciprog/index.php Research community site: contact Matt Wade for access Alerts mailing list: sci-prog-seminars@ncl.ac.uk (sign up at http://lists.ncl.ac.uk ) Organisers: Dr Liz Petrie, Dr Matt Wade, Dr Stephen McGough, Dr Ben Allen and Gregg Iceton
  • 2. Using GPUs for parallel processing A. Stephen McGough
  • 3. Why? observation • Moore’s XXXX is dead? law • “the number of transistors on integrated circuits doubles approximately every two years” – Processors aren’t getting faster… They’re getting fatter Processor Speed and Energy Assume 1 GHz Core consumes 1watt A 4GHz Core consumes ~64watts Four 1GHz cores consume ~4watts Power ~frequency3 Computers are going many-core
  • 4. What? • Games industry is multi-billion dollar • Gamers want photo-realistic games – Computationally expensive – Requires complex physics calculations • Latest generation of Graphical Processing Units are therefore many core parallel processors – General Purpose Graphical Processing Units - GPGPUs
  • 5. Not just normal processors • 1000’s of cores – But cores are simpler than a normal processor – Multiple cores perform the same action at the same time – Single Instruction Multiple Data – SIMD • Conventional processor -> Minimize latency – Of a single program • GPU -> Maximize throughput of all cores • Potential for orders of magnitude speed-up
  • 6. “If you were plowing a field, which would you rather use: two strong oxen or 1024 chicken?” • Famous quote from Seymour Cray arguing for small numbers of processors – But the chickens are now winning • Need a new way to think about programming – Need hugely parallel algorithms • Many existing algorithms won’t work (efficiently)
  • 7. Some Issues with GPGPUs • Cores are slower than a standard CPU – But you have lots more • No direct control on when your code runs on a core – GPGPU decides where and when • Can’t communicate between cores • Order of execution is ‘random’ – Synchronization is through exiting parallel GPU code • SIMD only works (efficiently) if all cores are doing the same thing – NVIDIA GPU’s have Warps of 32 cores working together • Code divergence leads to more Warps • Cores can interfere with each other – Overwriting each others memory
  • 8. How • Many approaches – OpenGL – for the mad Guru – Computer Unified Device Architecture (CUDA) – OpenCL – emerging standard – Dynamic Parallelism – For existing code loops • Focus here on CUDA – Well developed and supported – Exploits full power of GPGPU
  • 9. CUDA • CUDA is a set of extensions to C/C++ – (and Fortran) • Code consists of sequential and parallel parts – Parallel parts are written as kernels • Describe what one thread of the code will do Start Sequential code Transfer data to card Execute Kernel Transfer data from card Finish Sequential code
  • 10. Example: Vector Addition • One dimensional data • Add two vectors (A,B) together to produce C • Need to define the kernel to run and the main code • Each thread can compute a single value for C
  • 11. Example: Vector Addition • Pseudo code for the kernel: – Identify which element in the vector I’m computing •i – Compute C[i] = A[i] + B[i] • How do we identify our index (i)?
  • 12. Blocks and Threads • In CUDA the whole data space is the Grid – Divided into a number of blocks • Divided into a number of threads • Blocks can be executed in any order • Threads in a block are executed together • Blocks and Threads can be 1D, 2D or 3D
  • 13. Blocks • As Blocks are executed in arbitrary order this gives CUDA the opportunity to scale to the number of cores in a particular device
  • 14. Thread id • CUDA provides three pieces of data for identifying a thread – BlockIdx – block identity – BlockDim – the size of a block (no of threads in block) – ThreadIdx – identity of a thread in a block • Can use these to compute the absolute thread id id = BlockIdx * BlockDim + ThreadIdx • EG: BlockIdx = 2, BlockDim = 3, ThreadIdx = 1 • id = 2 * 3 + 1 = 7 Thread index 0 1 2 0 1 2 0 1 2 0 1 2 3 4 5 6 7 8 Block0 Block1 Block2
  • 15. Example: Vector Addition Kernel code Entry point for a Normal function kernel definition __global__ void vector_add(double *A, double *B, double* C, int N) { // Find my thread id - block and thread int id = blockDim.x * blockIdx.x + threadIdx.x; if (id >= N) {return;} // I'm not a valid ID C[id] = A[id] + B[id]; // do my work } Compute my absolute thread id We might be invalid – if data size not Do the work completely divisible by blocks
  • 16. Example: Vector Addition Pseudo code for sequential code • Create Data on Host Computer • Create space on device • Copy data to device • Run Kernel • Copy data back to host and do something with it • Clean up
  • 17. Host and Device • Data needs copying to / from the GPU (device) • Often end up with same data on both – Postscript variable names with _device or _host • To help identify where data is A_host A_device Host Device
  • 18. Example: Vector Addition int N = 2000; double *A_host = new double[N]; // Create data on host computer double *B_host = new double[N]; double *C_host = new double[N]; for(int i=0; i<N; i++) { A_host[i] = i; B_host[i] = (double)i/N; } double *A_device, *B_device, *C_device; // allocate space on device GPGPU cudaMalloc((void**) &A_device, N*sizeof(double)); cudaMalloc((void**) &B_device, N*sizeof(double)); cudaMalloc((void**) &C_device, N*sizeof(double)); // Copy data from host memory to device memory cudaMemcpy(A_device, A_host, N*sizeof(double), cudaMemcpyHostToDevice); cudaMemcpy(B_device, B_host, N*sizeof(double), cudaMemcpyHostToDevice); // How many blocks will we need? Choose block size of 256 int blocks = (N - 0.5)/256 + 1; vector_add<<<blocks, 256>>>(A_device, B_device, C_device, N); // run kernel // Copy data back cudaMemcpy(C_host, C_device, N*sizeof(double), cudaMemcpyDeviceToHost); // do something with result // free device memory cudaFree(A_device); cudaFree(B_device); cudaFree(C_device); free(A_host); free(B_host); free(C_host); // free host memory
  • 19. More Complex: Matrix Addition • Now a 2D problem – BlockIdx, BlockDim, ThreadIdx now have x and y • But general principles hold – For kernel • Compute location in matrix of two diminutions – For main code • Define and transmit data • But keep data 1D – Why?
  • 20. Why data in 1D? • If you define data as 2D there is no guarantee that data will be a contiguous block of memory – Can’t be transmitted to card in one command X X Some other data
  • 21. Faking 2D data • 2D data size N*M • Define 1D array of size N*M • Index element at [x,y] as x*N+y • Then can transfer to device in one go Row 1 Row 2 Row 3 Row 4
  • 22. Example: Matrix Add Kernel __global__ void matrix_add(double *A, double *B, double* C, int N, int M) { // Find my thread id - block and thread Both int idX = blockDim.x * blockIdx.x + threadIdx.x; dimensions int idY = blockDim.y * blockIdx.y + threadIdx.y; if (idX >= N || idY >= M) {return;} // I'm not a valid ID int id = idY * N + idX; Get both C[id] = A[id] + B[id]; // do my work dimensions } Compute 1D location
  • 23. Example: Matrix Addition Main Code int N = 20; int M = 10; double *A_host = new double[N * M]; // Create data on host computer double *B_host = new double[N * M]; double *C_host = new double[N * M]; Define matrices for(int i=0; i<N; i++) { for (int j = 0; j < M; j++) { on host A_host[i + j * N] = i; B_host[i + j * N] = (double)j/M; } } double *A_device, *B_device, *C_device; // allocate space on device GPGPU cudaMalloc((void**) &A_device, N*M*sizeof(double)); Define space on cudaMalloc((void**) &B_device, N*M*sizeof(double)); device cudaMalloc((void**) &C_device, N*M*sizeof(double)); // Copy data from host memory to device memory cudaMemcpy(A_device, A_host, N*M*sizeof(double), cudaMemcpyHostToDevice); cudaMemcpy(B_device, B_host, N*M*sizeof(double), cudaMemcpyHostToDevice); Copy data to device // How many blocks will we need? Choose block size of 16 int blocksX = (N - 0.5)/16 + 1; int blocksY = (M - 0.5)/16 + 1; dim3 dimGrid(blocksX, blocksY); dim3 dimBlocks(16, 16); Run Kernel matrix_add<<<dimGrid, dimBlocks>>>(A_device, B_device, C_device, N, M); // Copy data back from device to host cudaMemcpy(C_host, C_device, N*M*sizeof(double), cudaMemcpyDeviceToHost); Bring data back // Free device //for (int i = 0; i < N*M; i++) printf("C[%d,%d] = %fn", i/N, i%N, C_host[i]); cudaFree(A_device); cudaFree(B_device); cudaFree(C_device); free(A_host); free(B_host); free(C_host); Tidy up
  • 24. Running Example • Computer: condor-gpu01 – Set path • set path = ( $path /usr/local/cuda/bin/ ) • Compile command nvcc • Then just run the binary file • C2050, 440 cores, 3GB RAM – Single precision flops 1.03Tflops – Double precision flops 515Gflops
  • 25. Summary and Questions • GPGPU’s have great potential for parallelism • But at a cost – Not ‘normal’ parallel computing – Need to think about problems in a new way • Further reading – NVIDIA CUDA Zone https://developer.nvidia.com/category/zone/cuda-zone – Online courses https://www.coursera.org/course/hetero
  • 26. Sci-Prog seminar series Talks on computing and programming related topics ranging from basic to advanced levels. Talk: Using GPUs for parallel processing A. Stephen McGough Website: http://conferences.ncl.ac.uk/sciprog/index.php Research community site: contact Matt Wade for access Alerts mailing list: sci-prog-seminars@ncl.ac.uk (sign up at http://lists.ncl.ac.uk ) Organisers: Dr Liz Petrie, Dr Matt Wade, Dr Stephen McGough, Dr Ben Allen and Gregg Iceton