SlideShare una empresa de Scribd logo
1 de 16
Descargar para leer sin conexión
Heterogenous Parallel Programming
Class of 2014

Week 1 Summary

Update 1

CUDA

Pipat Methavanitpong
Heterogeneous Computing
● Diversity of Computing Units
○

CPU, GPU, DSP, Configurable Cores, Cloud Computing

● Right Man, Right Job
○

Each application requires different orientation to perform best

● Application Examples
○

Financial Analysis, Scientific Simulation, Digital Audio Processing,
Computer Vision, Numerical Methods, Interactive Physics
Latency and Throughput Orientation
Latency

Throughput

● Min Time
● Smart / Weak
● Best Path

● Max Throughput
● Stupid / Strong
● Brute Force
Latency and Throughput Orientation
CPU

GPU

● Best for Sequential
● Powerful ALU

● Best for Parallel
● Weak ALU

○
○
○

Few
Low Latency
Lightly Pipelined

● Large Cache
○

Lower Latency than RAM

● Sophisticated Control
○
○

Smart Branch INSN* to take
Smart Hazard Handling

○
○
○

Many
High Latency
Heavily Pipelined

● Small Cache
○

But boost mem throughput

● Simple Control
○
○

No Predict
No Data Forwarding
Latency and Throughput Orientation
CPU
ALU

GPU
ALU
Control

ALU

ALU

Cache
DRAM

DRAM
System Cost
● Hardware + Software Cost
● Software dominates after 2010
● Reduce Software Cost = One on Many
○

Scalability
■

○

Same Arch / New Hardware Offer: # of cores, pipeline depth, vector length

Portability
■

Different Arch: x86, ARM

■

Different Org and Interfaces: Latency/Throughput, Shared/Distributed Mem
Data Parallelism
Manipulation of Data in Parallel
e.g. Vector Addition

A[0]

A[1]

A[2]

A[3]

B[0]

B[1]

B[2]

B[3]

+

+

+

+

C[0]

C[1]

C[2]

C[3]
Introduction to CUDA
➔
➔
➔
➔
➔
➔
➔

CUDA = Compute Unified Device Architecture
Introduced by NVIDIA
Distribute workload from a Host to CUDA capable Devices
NVIDIA = GPU = Throughput Oriented = Best Parallel
Use of GPU to compute as CPU = GPGPU
GPGPU = General Purpose GPU
Extend C / C++ / Fortran
CUDA Thread Organization

Block

Block

Block

Block

Block

Grid

● Grid = [Vector~3D Matrix] of Blocks
○ Block = [Vector~3D Matrix] of Threads
■ Thread = One that computes

Thread

Thread

Thread

Thread
CUDA Thread Organization
Grid Dimension
Declaration

Declaration

dim3 DimGrid(x,y,z);
*var name can be others

dim3 DimBlock(x,y,z);
*var name can be others

This Block

dim3 DimGrid
(2,1,1);
dim3 DimBlock
(256,1,1);

Block Dimension

This Thread

Block 0
t0

Block 1
t1

t2

...

t255

t0

t1

t2

...

t255
CUDA Memory Organization
A Thread have its Private Registers
Threads in a Block have common Shared Memory
Blocks in a same Grid have common Global and Constant Memory

Shared

Thread

Global,
Constant

Block

Grid

HOST

But Host can only access Global and Constant Memory

Register

Register

Register

Register
Memory Management Command
Prototype

typedef enum cudaError cudaError_t

// Allocate Memory on Device
cudaError_t cudaMalloc(void** devPtr, size_t size)

enum cudaError

// Copy Data

0.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.

cudaSuccess
cudaErrorMissingConfiguration
cudaErrorMemoryAllocation
cudaErrorInitializationError
cudaErrorLaunchFailure
cudaErrorPriorLaunchFailure
cudaErrorLaunchTimeout
cudaErrorLaunchOutOfResources
cudaErrorInvalidDeviceFunction
cudaErrorInvalidConfiguration
cudaErrorInvalidDevice

…

…

cudaError_t cudaMemcpy(void* dst, const void* src,
size_t size, enum cudaMemcpyKind kind)
// Free Memory on Device
cudaError_t cudaFree(void* devPtr)

enum cudaMemcpyKind
0.
1.
2.
3.
4.

cudaMemcpyHostToHost
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
cudaMemcpyDefault

For more information
http://developer.download.
nvidia.
com/compute/cuda/4_1/rel/tool
kit/docs/online/group__CUDA
RT__MEMORY.html

size - size in bytes
Kernel
Terminology for Function for Device to be called by Host
Declared by adding attribute to Function
Attribute

Return
Type

Function Type

Executed on

Only Callable
from

__device__ any

DeviceFunc()

device

device

__global__ void

KernelFunc()

device

host

host

host

__host__ any

HostFunc()

This attribute is optional
Starting Kernel Function by giving it Grid&Block Structure and Parameters
KernelFunc<<<dimGrid,dimBlock>>>(param1, param2, …);
Waiting for all thrown tasks to complete before move on
cudaDeviceSynchronize();
Row-Major Layout
Way of addressing an element in an Array
Multi-dimensional array can be addressed by 1D array
C / C++ use Row-Major Layout
A0,1

A0,2

A0,3

A1,0

A1,1

A1,2

A2,1

A2,2

A0,1

A0,2

A0,3

A1,0

A1,1

A1,2

A1,3

A2,0

A2,1

A2,2

A2,2

A1

A2

A3

A4

A5

A6

A7

A8

A9

A10

A11

A1,3

A2,0

A0,0

A0

A0,0

A2,3

Fortran uses Col-Major Index
Sample Code: Vector Addition
__global__ void vecAdd(int *d_vIn1, int *d_vIn2, *d_vOut, int n) {
int pos = blockIdx.x * blockDim.x + threadIdx.x;
if (pos < n)
d_vOut[pos] = d_vIn1[pos] + d_vIn2[pos];
}
…
int main() {
int vecLength = …;
int* h_input1 = {…}; int* h_input2 = {…};
int* h_output = (int *) malloc(vecLength * sizeof(int));
int* d_input1, d_input2, d_output;
cudaMalloc((void **) &d_input1, vecLength * sizeof(int));
cudaMalloc((void **) &d_input2, vecLength * sizeof(int));
cudaMalloc((void **) &d_output, vecLength * sizeof(int));
cudaMemcpy(d_input1,h_input1,vecLength*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(d_input2,h_input2,vecLength*sizeof(int),cudaMemcpyHostToDevice);
dim3 dimGrid((vecLength-1)/256+1,1,1);
dim3 dimBlock(256,1,1);
vecAdd<<<dimGrid,dimBlock>>>(d_input1,d_input2,d_output,vecLength);
cudaDeviceSynchronize();
cudaMemcpy(h_output,d_output,vecLength*sizeof(int),cudaMemcpyDeviceToHost);
cudaFree(d_input1); cudaFree(d_input2); cudaFree(d_output);
return 0;
}
Error Checking Pattern
cudaError_t err = cudaMalloc((void **)) &d_input1, size);
if (err != cudaSuccess) {
printf(“%s in %s at line %dn”,
cudaGetErrorString(err), __FILE__, __LINE__);
exit(EXIT_FAILURE);
}

Más contenido relacionado

La actualidad más candente

GPU-Accelerated Parallel Computing
GPU-Accelerated Parallel ComputingGPU-Accelerated Parallel Computing
GPU-Accelerated Parallel ComputingJun Young Park
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)Alex Rasmussen
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAPiyush Mittal
 

La actualidad más candente (7)

NUMA and Java Databases
NUMA and Java DatabasesNUMA and Java Databases
NUMA and Java Databases
 
Cuda
CudaCuda
Cuda
 
GPU-Accelerated Parallel Computing
GPU-Accelerated Parallel ComputingGPU-Accelerated Parallel Computing
GPU-Accelerated Parallel Computing
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
 
Chap 17 advfs
Chap 17 advfsChap 17 advfs
Chap 17 advfs
 

Destacado

Hypergraph Mining For Social Networks
Hypergraph Mining For Social NetworksHypergraph Mining For Social Networks
Hypergraph Mining For Social NetworksGiacomo Bergami
 
Android Internals (This is not the droid you’re loking for...)
Android Internals (This is not the droid you’re loking for...)Android Internals (This is not the droid you’re loking for...)
Android Internals (This is not the droid you’re loking for...)Giacomo Bergami
 
Keynote presentation hr_and_optimism
Keynote presentation hr_and_optimismKeynote presentation hr_and_optimism
Keynote presentation hr_and_optimismBusiness_and_Optimism
 
Empathize and define
Empathize and defineEmpathize and define
Empathize and definealanmcn
 
May 2013 staff mtg
May 2013 staff mtgMay 2013 staff mtg
May 2013 staff mtgdmc1922
 

Destacado (6)

Hypergraph Mining For Social Networks
Hypergraph Mining For Social NetworksHypergraph Mining For Social Networks
Hypergraph Mining For Social Networks
 
Android Internals (This is not the droid you’re loking for...)
Android Internals (This is not the droid you’re loking for...)Android Internals (This is not the droid you’re loking for...)
Android Internals (This is not the droid you’re loking for...)
 
Keynote presentation hr_and_optimism
Keynote presentation hr_and_optimismKeynote presentation hr_and_optimism
Keynote presentation hr_and_optimism
 
Empathize and define
Empathize and defineEmpathize and define
Empathize and define
 
Suvidhi Industries
Suvidhi IndustriesSuvidhi Industries
Suvidhi Industries
 
May 2013 staff mtg
May 2013 staff mtgMay 2013 staff mtg
May 2013 staff mtg
 

Similar a HPP Week 1 Summary

Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDAMartin Peniak
 
Linux Hosting Training Course Level 1-1
Linux Hosting Training Course Level 1-1Linux Hosting Training Course Level 1-1
Linux Hosting Training Course Level 1-1Ramy Allam
 
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmKernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmAnne Nicolas
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Processor Organization
Processor OrganizationProcessor Organization
Processor OrganizationDominik Salvet
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Gpu computing workshop
Gpu computing workshopGpu computing workshop
Gpu computing workshopdatastack
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance CachingScyllaDB
 
An End to Order (many cores with java, session two)
An End to Order (many cores with java, session two)An End to Order (many cores with java, session two)
An End to Order (many cores with java, session two)Robert Burrell Donkin
 
Efficient Buffer Management
Efficient Buffer ManagementEfficient Buffer Management
Efficient Buffer Managementbasisspace
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the Worldjhugg
 

Similar a HPP Week 1 Summary (20)

Micro-controllers (PIC) based Application Development
Micro-controllers (PIC) based Application DevelopmentMicro-controllers (PIC) based Application Development
Micro-controllers (PIC) based Application Development
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Linux Hosting Training Course Level 1-1
Linux Hosting Training Course Level 1-1Linux Hosting Training Course Level 1-1
Linux Hosting Training Course Level 1-1
 
Multicore
MulticoreMulticore
Multicore
 
Threads and processes
Threads and processesThreads and processes
Threads and processes
 
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmKernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Processor Organization
Processor OrganizationProcessor Organization
Processor Organization
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Lect 1 Into.pptx
Lect 1 Into.pptxLect 1 Into.pptx
Lect 1 Into.pptx
 
Caching in
Caching inCaching in
Caching in
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Gpu computing workshop
Gpu computing workshopGpu computing workshop
Gpu computing workshop
 
An End to Order
An End to OrderAn End to Order
An End to Order
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Caching
 
An End to Order (many cores with java, session two)
An End to Order (many cores with java, session two)An End to Order (many cores with java, session two)
An End to Order (many cores with java, session two)
 
Efficient Buffer Management
Efficient Buffer ManagementEfficient Buffer Management
Efficient Buffer Management
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 

Más de Pipat Methavanitpong

Influence of Native Language and Society on English Proficiency
Influence of Native Language and Society on English ProficiencyInfluence of Native Language and Society on English Proficiency
Influence of Native Language and Society on English ProficiencyPipat Methavanitpong
 
Intel processor trace - What are Recorded?
Intel processor trace - What are Recorded?Intel processor trace - What are Recorded?
Intel processor trace - What are Recorded?Pipat Methavanitpong
 
Exploring the World Classroom: MOOC
Exploring the World Classroom: MOOCExploring the World Classroom: MOOC
Exploring the World Classroom: MOOCPipat Methavanitpong
 

Más de Pipat Methavanitpong (6)

Influence of Native Language and Society on English Proficiency
Influence of Native Language and Society on English ProficiencyInfluence of Native Language and Society on English Proficiency
Influence of Native Language and Society on English Proficiency
 
Return oriented programming (ROP)
Return oriented programming (ROP)Return oriented programming (ROP)
Return oriented programming (ROP)
 
Intel processor trace - What are Recorded?
Intel processor trace - What are Recorded?Intel processor trace - What are Recorded?
Intel processor trace - What are Recorded?
 
Principles in software debugging
Principles in software debuggingPrinciples in software debugging
Principles in software debugging
 
Exploring the World Classroom: MOOC
Exploring the World Classroom: MOOCExploring the World Classroom: MOOC
Exploring the World Classroom: MOOC
 
Seminar 12-11-19
Seminar 12-11-19Seminar 12-11-19
Seminar 12-11-19
 

Último

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Último (20)

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

HPP Week 1 Summary

  • 1. Heterogenous Parallel Programming Class of 2014 Week 1 Summary Update 1 CUDA Pipat Methavanitpong
  • 2. Heterogeneous Computing ● Diversity of Computing Units ○ CPU, GPU, DSP, Configurable Cores, Cloud Computing ● Right Man, Right Job ○ Each application requires different orientation to perform best ● Application Examples ○ Financial Analysis, Scientific Simulation, Digital Audio Processing, Computer Vision, Numerical Methods, Interactive Physics
  • 3. Latency and Throughput Orientation Latency Throughput ● Min Time ● Smart / Weak ● Best Path ● Max Throughput ● Stupid / Strong ● Brute Force
  • 4. Latency and Throughput Orientation CPU GPU ● Best for Sequential ● Powerful ALU ● Best for Parallel ● Weak ALU ○ ○ ○ Few Low Latency Lightly Pipelined ● Large Cache ○ Lower Latency than RAM ● Sophisticated Control ○ ○ Smart Branch INSN* to take Smart Hazard Handling ○ ○ ○ Many High Latency Heavily Pipelined ● Small Cache ○ But boost mem throughput ● Simple Control ○ ○ No Predict No Data Forwarding
  • 5. Latency and Throughput Orientation CPU ALU GPU ALU Control ALU ALU Cache DRAM DRAM
  • 6. System Cost ● Hardware + Software Cost ● Software dominates after 2010 ● Reduce Software Cost = One on Many ○ Scalability ■ ○ Same Arch / New Hardware Offer: # of cores, pipeline depth, vector length Portability ■ Different Arch: x86, ARM ■ Different Org and Interfaces: Latency/Throughput, Shared/Distributed Mem
  • 7. Data Parallelism Manipulation of Data in Parallel e.g. Vector Addition A[0] A[1] A[2] A[3] B[0] B[1] B[2] B[3] + + + + C[0] C[1] C[2] C[3]
  • 8. Introduction to CUDA ➔ ➔ ➔ ➔ ➔ ➔ ➔ CUDA = Compute Unified Device Architecture Introduced by NVIDIA Distribute workload from a Host to CUDA capable Devices NVIDIA = GPU = Throughput Oriented = Best Parallel Use of GPU to compute as CPU = GPGPU GPGPU = General Purpose GPU Extend C / C++ / Fortran
  • 9. CUDA Thread Organization Block Block Block Block Block Grid ● Grid = [Vector~3D Matrix] of Blocks ○ Block = [Vector~3D Matrix] of Threads ■ Thread = One that computes Thread Thread Thread Thread
  • 10. CUDA Thread Organization Grid Dimension Declaration Declaration dim3 DimGrid(x,y,z); *var name can be others dim3 DimBlock(x,y,z); *var name can be others This Block dim3 DimGrid (2,1,1); dim3 DimBlock (256,1,1); Block Dimension This Thread Block 0 t0 Block 1 t1 t2 ... t255 t0 t1 t2 ... t255
  • 11. CUDA Memory Organization A Thread have its Private Registers Threads in a Block have common Shared Memory Blocks in a same Grid have common Global and Constant Memory Shared Thread Global, Constant Block Grid HOST But Host can only access Global and Constant Memory Register Register Register Register
  • 12. Memory Management Command Prototype typedef enum cudaError cudaError_t // Allocate Memory on Device cudaError_t cudaMalloc(void** devPtr, size_t size) enum cudaError // Copy Data 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. cudaSuccess cudaErrorMissingConfiguration cudaErrorMemoryAllocation cudaErrorInitializationError cudaErrorLaunchFailure cudaErrorPriorLaunchFailure cudaErrorLaunchTimeout cudaErrorLaunchOutOfResources cudaErrorInvalidDeviceFunction cudaErrorInvalidConfiguration cudaErrorInvalidDevice … … cudaError_t cudaMemcpy(void* dst, const void* src, size_t size, enum cudaMemcpyKind kind) // Free Memory on Device cudaError_t cudaFree(void* devPtr) enum cudaMemcpyKind 0. 1. 2. 3. 4. cudaMemcpyHostToHost cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice cudaMemcpyDefault For more information http://developer.download. nvidia. com/compute/cuda/4_1/rel/tool kit/docs/online/group__CUDA RT__MEMORY.html size - size in bytes
  • 13. Kernel Terminology for Function for Device to be called by Host Declared by adding attribute to Function Attribute Return Type Function Type Executed on Only Callable from __device__ any DeviceFunc() device device __global__ void KernelFunc() device host host host __host__ any HostFunc() This attribute is optional Starting Kernel Function by giving it Grid&Block Structure and Parameters KernelFunc<<<dimGrid,dimBlock>>>(param1, param2, …); Waiting for all thrown tasks to complete before move on cudaDeviceSynchronize();
  • 14. Row-Major Layout Way of addressing an element in an Array Multi-dimensional array can be addressed by 1D array C / C++ use Row-Major Layout A0,1 A0,2 A0,3 A1,0 A1,1 A1,2 A2,1 A2,2 A0,1 A0,2 A0,3 A1,0 A1,1 A1,2 A1,3 A2,0 A2,1 A2,2 A2,2 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A1,3 A2,0 A0,0 A0 A0,0 A2,3 Fortran uses Col-Major Index
  • 15. Sample Code: Vector Addition __global__ void vecAdd(int *d_vIn1, int *d_vIn2, *d_vOut, int n) { int pos = blockIdx.x * blockDim.x + threadIdx.x; if (pos < n) d_vOut[pos] = d_vIn1[pos] + d_vIn2[pos]; } … int main() { int vecLength = …; int* h_input1 = {…}; int* h_input2 = {…}; int* h_output = (int *) malloc(vecLength * sizeof(int)); int* d_input1, d_input2, d_output; cudaMalloc((void **) &d_input1, vecLength * sizeof(int)); cudaMalloc((void **) &d_input2, vecLength * sizeof(int)); cudaMalloc((void **) &d_output, vecLength * sizeof(int)); cudaMemcpy(d_input1,h_input1,vecLength*sizeof(int),cudaMemcpyHostToDevice); cudaMemcpy(d_input2,h_input2,vecLength*sizeof(int),cudaMemcpyHostToDevice); dim3 dimGrid((vecLength-1)/256+1,1,1); dim3 dimBlock(256,1,1); vecAdd<<<dimGrid,dimBlock>>>(d_input1,d_input2,d_output,vecLength); cudaDeviceSynchronize(); cudaMemcpy(h_output,d_output,vecLength*sizeof(int),cudaMemcpyDeviceToHost); cudaFree(d_input1); cudaFree(d_input2); cudaFree(d_output); return 0; }
  • 16. Error Checking Pattern cudaError_t err = cudaMalloc((void **)) &d_input1, size); if (err != cudaSuccess) { printf(“%s in %s at line %dn”, cudaGetErrorString(err), __FILE__, __LINE__); exit(EXIT_FAILURE); }