Enviar búsqueda
Cargar
Cuda tutorial
•
9 recomendaciones
•
4,455 vistas
M
Mahesh Khadatare
Seguir
Tecnología
Empresariales
Denunciar
Compartir
Denunciar
Compartir
1 de 60
Descargar ahora
Descargar para leer sin conexión
Recomendados
Introduction to CUDA
Introduction to CUDA
Raymond Tay
Cuda
Cuda
Amy Devadas
NVIDIA CUDA
NVIDIA CUDA
Jungsoo Nam
Cuda
Cuda
Mannu Malhotra
Cuda
Cuda
Gopi Saiteja
Introduction to GPU Programming
Introduction to GPU Programming
Chakkrit (Kla) Tantithamthavorn
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)
Fatima Qayyum
GPU - Basic Working
GPU - Basic Working
Nived R Nambiar
Recomendados
Introduction to CUDA
Introduction to CUDA
Raymond Tay
Cuda
Cuda
Amy Devadas
NVIDIA CUDA
NVIDIA CUDA
Jungsoo Nam
Cuda
Cuda
Mannu Malhotra
Cuda
Cuda
Gopi Saiteja
Introduction to GPU Programming
Introduction to GPU Programming
Chakkrit (Kla) Tantithamthavorn
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)
Fatima Qayyum
GPU - Basic Working
GPU - Basic Working
Nived R Nambiar
GPU Programming
GPU Programming
William Cunningham
Linux scheduler
Linux scheduler
Liran Ben Haim
eBPF maps 101
eBPF maps 101
SUSE Labs Taipei
Parallel computing with Gpu
Parallel computing with Gpu
Rohit Khatana
CUDA Architecture
CUDA Architecture
Dr Shashikant Athawale
GPU
GPU
Hamid Bluri
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
Martin Peniak
Introduction to OpenCL
Introduction to OpenCL
Unai Lopez-Novoa
Gpu with cuda architecture
Gpu with cuda architecture
Dhaval Kaneria
Parallel Computing on the GPU
Parallel Computing on the GPU
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
Graphics Processing Unit by Saurabh
Graphics Processing Unit by Saurabh
Saurabh Kumar
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
Saksham Tanwar
Tech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDA
Jens Rühmkorf
Revisit DCA, PCIe TPH and DDIO
Revisit DCA, PCIe TPH and DDIO
Hisaki Ohara
Gpu
Gpu
hashim102
10. GPU - Video Card (Display, Graphics, VGA)
10. GPU - Video Card (Display, Graphics, VGA)
Akhila Dakshina
CUDA
CUDA
Rachel Miller
Introduction to eBPF and XDP
Introduction to eBPF and XDP
lcplcp1
Hadoop File system (HDFS)
Hadoop File system (HDFS)
Prashant Gupta
GPU - An Introduction
GPU - An Introduction
Dhan V Sagar
GPU: Understanding CUDA
GPU: Understanding CUDA
Joaquín Aparicio Ramos
Nvidia cuda programming_guide_0.8.2
Nvidia cuda programming_guide_0.8.2
Piyush Mittal
Más contenido relacionado
La actualidad más candente
GPU Programming
GPU Programming
William Cunningham
Linux scheduler
Linux scheduler
Liran Ben Haim
eBPF maps 101
eBPF maps 101
SUSE Labs Taipei
Parallel computing with Gpu
Parallel computing with Gpu
Rohit Khatana
CUDA Architecture
CUDA Architecture
Dr Shashikant Athawale
GPU
GPU
Hamid Bluri
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
Martin Peniak
Introduction to OpenCL
Introduction to OpenCL
Unai Lopez-Novoa
Gpu with cuda architecture
Gpu with cuda architecture
Dhaval Kaneria
Parallel Computing on the GPU
Parallel Computing on the GPU
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
Graphics Processing Unit by Saurabh
Graphics Processing Unit by Saurabh
Saurabh Kumar
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
Saksham Tanwar
Tech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDA
Jens Rühmkorf
Revisit DCA, PCIe TPH and DDIO
Revisit DCA, PCIe TPH and DDIO
Hisaki Ohara
Gpu
Gpu
hashim102
10. GPU - Video Card (Display, Graphics, VGA)
10. GPU - Video Card (Display, Graphics, VGA)
Akhila Dakshina
CUDA
CUDA
Rachel Miller
Introduction to eBPF and XDP
Introduction to eBPF and XDP
lcplcp1
Hadoop File system (HDFS)
Hadoop File system (HDFS)
Prashant Gupta
GPU - An Introduction
GPU - An Introduction
Dhan V Sagar
La actualidad más candente
(20)
GPU Programming
GPU Programming
Linux scheduler
Linux scheduler
eBPF maps 101
eBPF maps 101
Parallel computing with Gpu
Parallel computing with Gpu
CUDA Architecture
CUDA Architecture
GPU
GPU
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
Introduction to OpenCL
Introduction to OpenCL
Gpu with cuda architecture
Gpu with cuda architecture
Parallel Computing on the GPU
Parallel Computing on the GPU
Graphics Processing Unit by Saurabh
Graphics Processing Unit by Saurabh
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
Tech Talk NVIDIA CUDA
Tech Talk NVIDIA CUDA
Revisit DCA, PCIe TPH and DDIO
Revisit DCA, PCIe TPH and DDIO
Gpu
Gpu
10. GPU - Video Card (Display, Graphics, VGA)
10. GPU - Video Card (Display, Graphics, VGA)
CUDA
CUDA
Introduction to eBPF and XDP
Introduction to eBPF and XDP
Hadoop File system (HDFS)
Hadoop File system (HDFS)
GPU - An Introduction
GPU - An Introduction
Destacado
GPU: Understanding CUDA
GPU: Understanding CUDA
Joaquín Aparicio Ramos
Nvidia cuda programming_guide_0.8.2
Nvidia cuda programming_guide_0.8.2
Piyush Mittal
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
npinto
HPC Essentials 0
HPC Essentials 0
William Brouwer
GPU performance analysis
GPU performance analysis
Pedram Mazloom
Google Now Marketing
Google Now Marketing
Gil Reich
Siri Vs Google Now
Siri Vs Google Now
MeasurementMarketing.io
Design and implementation of GPU-based SAR image processor
Design and implementation of GPU-based SAR image processor
Najeeb Ahmad
Hands on OpenCL
Hands on OpenCL
Vladimir Starostenkov
Tesla personal super computer
Tesla personal super computer
Priya Manik
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Hsien-Hsin Sean Lee, Ph.D.
Google glass ppt
Google glass ppt
Nidhin P Koshy
Destacado
(12)
GPU: Understanding CUDA
GPU: Understanding CUDA
Nvidia cuda programming_guide_0.8.2
Nvidia cuda programming_guide_0.8.2
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
HPC Essentials 0
HPC Essentials 0
GPU performance analysis
GPU performance analysis
Google Now Marketing
Google Now Marketing
Siri Vs Google Now
Siri Vs Google Now
Design and implementation of GPU-based SAR image processor
Design and implementation of GPU-based SAR image processor
Hands on OpenCL
Hands on OpenCL
Tesla personal super computer
Tesla personal super computer
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Google glass ppt
Google glass ppt
Similar a Cuda tutorial
Gpu archi
Gpu archi
Piyush Mittal
Sun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentation
xKinAnx
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Jeff Larkin
HP - HPC-29mai2012
HP - HPC-29mai2012
Agora Group
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
Shinya Takamaeda-Y
Exaflop In 2018 Hardware
Exaflop In 2018 Hardware
Jacob Wu
PG-Strom
PG-Strom
Kohei KaiGai
MSI N480GTX Lightning Infokit
MSI N480GTX Lightning Infokit
MSI
Ibm power7
Ibm power7
Tom Presotto
Pgopencl
Pgopencl
Tim Child
PostgreSQL with OpenCL
PostgreSQL with OpenCL
Muhaza Liebenlito
Qnap nas TS 1679 introduction_info tech Middle east
Qnap nas TS 1679 introduction_info tech Middle east
Ali Shoaee
Qnap nas ts 1679 introduction-02
Qnap nas ts 1679 introduction-02
CarrierDigit
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
Kohei KaiGai
Modeling System Behaviors: A Better Paradigm on Prototyping
Modeling System Behaviors: A Better Paradigm on Prototyping
DVClub
Trend - HPC-29mai2012
Trend - HPC-29mai2012
Agora Group
OFC/NFOEC: GPU-based Parallelization of System Modeling
OFC/NFOEC: GPU-based Parallelization of System Modeling
ADVA
Sun Microsystems
Sun Microsystems
guest09c59b06
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architectures
inside-BigData.com
Fpga technology
Fpga technology
Naren Sridhar
Similar a Cuda tutorial
(20)
Gpu archi
Gpu archi
Sun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentation
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
HP - HPC-29mai2012
HP - HPC-29mai2012
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
ScalableCore System: A Scalable Many-core Simulator by Employing Over 100 FPGAs
Exaflop In 2018 Hardware
Exaflop In 2018 Hardware
PG-Strom
PG-Strom
MSI N480GTX Lightning Infokit
MSI N480GTX Lightning Infokit
Ibm power7
Ibm power7
Pgopencl
Pgopencl
PostgreSQL with OpenCL
PostgreSQL with OpenCL
Qnap nas TS 1679 introduction_info tech Middle east
Qnap nas TS 1679 introduction_info tech Middle east
Qnap nas ts 1679 introduction-02
Qnap nas ts 1679 introduction-02
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
Modeling System Behaviors: A Better Paradigm on Prototyping
Modeling System Behaviors: A Better Paradigm on Prototyping
Trend - HPC-29mai2012
Trend - HPC-29mai2012
OFC/NFOEC: GPU-based Parallelization of System Modeling
OFC/NFOEC: GPU-based Parallelization of System Modeling
Sun Microsystems
Sun Microsystems
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architectures
Fpga technology
Fpga technology
Último
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
null - The Open Security Community
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Zilliz
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
charlottematthew16
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
RankYa
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
Dilum Bandara
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
Alan Dix
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
BookNet Canada
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
2toLead Limited
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
Manik S Magar
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Fwdays
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
Alfredo García Lavilla
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Mark Simos
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
Sergiu Bodiu
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
charlottematthew16
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
Mattias Andersson
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
Fwdays
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
Lorenzo Miniero
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
ScyllaDB
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
Slibray Presentation
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Sri Ambati
Último
(20)
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Cuda tutorial
1.
HPC COMPUTING WITH CUDA
AND TESLA HARDWARE Timothy Lanfear, NVIDIA
2.
WHAT IS GPU
COMPUTING? © NVIDIA Corporation 2009
3.
What is GPU
Computing? x86 PCIe bus GPU Computing with CPU + GPU Heterogeneous Computing © NVIDIA Corporation 2009
4.
Low Latency or
High Throughput? ALU ALU CPU Control ALU ALU Optimised for low-latency access to cached data sets Cache Control logic for out-of-order and speculative execution DRAM GPU Optimised for data-parallel, throughput computation Architecture tolerant of memory latency More transistors dedicated to DRAM computation © NVIDIA Corporation 2009
5.
Tesla C1060 Computing
Processor Processor 1 x Tesla T10 Number of cores 240 Core Clock 1.296 GHz 933 GFlops Single Precision Floating Point Performance 78 GFlops Double Precision On-board memory 4.0 GB Memory bandwidth 102 GB/sec peak Memory I/O 512-bit, 800MHz GDDR3 512- Full ATX: 4.736” x 10.5” Form factor Dual slot wide System I/O PCIe x16 Gen2 Typical power 160 W © NVIDIA Corporation 2009
6.
Tesla Streaming Multiprocessor
(SM) SM has 8 Thread Processors (SP) IEEE 754 32-bit floating point 32-bit and 64-bit integer 16K 32-bit registers SM has 2 Special Function Units (SFU) SM has 1 Double Precision Unit (DP) IEEE 754 64-bit floating point Fused multiply-add Multithreaded Instruction Unit 1024 threads, hardware multithreaded 32 single-instruction, multi-thread warps of 32 threads Independent thread execution Hardware thread scheduling 16KB Shared Memory Concurrent threads share data Low latency load/store © NVIDIA Corporation 2009
7.
Computing with Tesla
240 SP processors at 1.5 GHz: 1 TFLOPS peak 128 threads per processor: 30,720 threads total Tesla PCI-e board: C1060 (1 GPU) SM I-Cache 1U Server: S1070 (4 GPUs) MT Issue C-Cache SP SP Host CPU Bridge System Memory Work Distribution Tesla T10 SP SP SP SP SP SP SFU SFU Interconnection Network DP ROP L2 ROP L2 ROP L2 ROP L2 ROP L2 ROP L2 ROP L2 ROP L2 Shared DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM Memory © NVIDIA Corporation 2009
8.
CUDA ARCHITECTURE © NVIDIA
Corporation 2009
9.
CUDA Parallel Computing
Architecture Parallel computing architecture and programming model Includes a C compiler plus support for OpenCL and DirectCompute Architected to natively support multiple computational interfaces (standard languages and APIs) © NVIDIA Corporation 2009
10.
NVIDIA CUDA C
and OpenCL CUDA C Entry point for developers who prefer high-level C Entry point for developers who want low-level API OpenCL Shared back-end compiler and optimization technology PTX GPU © NVIDIA Corporation 2009
11.
CUDA PROGRAMMING MODEL ©
NVIDIA Corporation 2009
12.
GPU is a
Multi-threaded Coprocessor The GPU is a compute device that is: A coprocessor to the CPU or host Has its own memory space Runs many threads in parallel Data parallel portions of the application are executed on the device as multi-threaded kernels GPU threads are very lightweight; thousands are needed for full efficiency © NVIDIA Corporation 2009
13.
Heterogeneous Execution
Serial Code (CPU) Parallel Kernel (GPU) ... KernelA<<< nBlk, nTid >>>(args); Serial Code (CPU) Parallel Kernel (GPU) KernelB<<< nBlk, nTid >>>(args); ... © NVIDIA Corporation 2009
14.
Processing Flow
Work Distribution GPU Geometry Controller Geometry Controller SMC SMC Host CPU CPU I-Cache I-Cache I-Cache I-Cache MT Issue MT Issue MT Issue MT Issue C-Cache C-Cache C-Cache C-Cache Bridge PCI Bus SP SP SP SP SP SP SP SP CPU Memory SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP 1. Copy input data from CPU memory SFU SFU SFU SFU SFU SFU SFU SFU Shared Shared Shared Shared to GPU memory Memory Memory Memory Memory 2. Load GPU program and execute, Texture Unit Texture Unit Tex L1 Tex L1 caching data on chip for performance Interconnection Network 3. Copy results from GPU memory to CPU memory ROP L2 ROP L2 © NVIDIA Corporation 2009 DRAM DRAM
15.
Execution Model
Host Device A kernel is executed by a grid, which contain blocks Grid 1 Block Block Block Kernel (0, 0) (1, 0) (2, 0) These blocks contain threads 1 Block Block Block (0, 1) (1, 1) (2, 1) A thread block is a batch of threads that can cooperate Grid 2 Sharing data through shared memory Kernel Synchronizing their execution 2 Threads from different blocks operate Block (1, 1) independently Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) © NVIDIA Corporation 2009
16.
Thread Blocks: Scalable
Cooperation Divide monolithic thread array into multiple blocks Threads within a block cooperate via shared memory Threads in different blocks cannot cooperate Enables programs to transparently scale to any number of processors Thread Block 0 Thread Block 1 Thread Block N - 1 threadID 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 … … … float x = input[threadID]; float x = input[threadID]; float x = input[threadID]; float y = func(x); output[threadID] = y; … float y = func(x); output[threadID] = y; … … float y = func(x); output[threadID] = y; … © NVIDIA Corporation 2009
17.
Transparent Scalability
Hardware is free to schedule thread blocks on any processor Kernel grid Device Device Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 0 Block 1 Block 0 Block 1 Block 2 Block 3 Block 6 Block 7 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 4 Block 5 Block 6 Block 7 © NVIDIA Corporation 2009
18.
Reason for blocks:
GPU scalability G80: 128 Cores G84: 32 Cores GT200: 240 SP Cores © NVIDIA Corporation 2009
19.
Hierarchy of Threads
Thread Thread Computes result elements Thread id number Thread Block Block Runs on one SM, shared mem t0 t1 t2 … tm 1 to 512 threads per block Block id number Grid of Blocks Holds complete computation task One to many blocks per Grid Sequential Grids Compute sequential problem steps Grid Bl. 0 Bl. 1 Bl. 2 Bl. n ... © NVIDIA Corporation 2009
20.
CUDA Memory
Block Thread Shared Local Memory Memory Local barrier Grid 0 Global Sequential barrier .. Global Grids . Grid 1 Memory in Time ... © NVIDIA Corporation 2009
21.
Memory Spaces
Memory Location Cached Access Scope Lifetime Register On-chip N/A R/W One thread Thread Local Off-chip No R/W One thread Thread Shared On-chip N/A R/W All threads in a block Block Global Off-chip No R/W All threads + host Application Constant Off-chip Yes R All threads + host Application Texture Off-chip Yes R All threads + host Application © NVIDIA Corporation 2009
22.
Compiling CUDA C
CUDA C Combined CPU-GPU Code Application NVCC CUDA C Rest of C Kernels Application CUDACC CPU Compiler CUDA object CPU object files files Linker CPU-GPU © NVIDIA Corporation 2009 Executable
23.
Compilation Commands
nvcc <filename>.cu [-o <executable>] Builds release mode nvcc –g <filename>.cu Builds debug (device) mode Can debug host code but not device code (runs on GPU) nvcc –deviceemu <filename>.cu Builds device emulation mode All code runs on CPU, but no debug symbols nvcc –deviceemu –g <filename>.cu Builds debug device emulation mode All code runs on CPU, with debug symbols Debug using gdb or other linux debugger © NVIDIA Corporation 2009
24.
Exercise 0: Run
a Simple Program CUDA Device device (Runtime API) version (CUDART static linking) There is 1 Query supporting CUDA There are 2 devices supporting CUDA Log on to test system Device 0: "Quadro FX 570M" Device 0:revision C1060" Major "Tesla number: 1 Compile and run pre-written CUDA Minor Capabilitynumber: revision number: CUDA revision Major Total Capability global revision number: CUDA amount of Minor memory: 1 1 3 268107776 bytes program — deviceQuery Number amount of global memory: Total of multiprocessors: Number of multiprocessors: 4294705152 bytes 4 30 Number of cores: 32 Number of cores: Total amount of constant memory: 240 65536 bytes Total amount of shared memory per block: Total amount of constant memory: 16384 bytes 65536 bytes Total number of registers available per block: 8192 bytes Total amount of shared memory per block: 16384 Totalsize: Warp number of registers available per block: 32 16384 Warp size: Maximum number of threads per block: 32 512 Maximum sizes of each dimension of a block: Maximum number of threads per block: 512 x 512 x 64 512 Maximum sizes of each dimension of a block: Maximum sizes of each dimension of a grid: 512 x 512 x 64 65535 x 65535 x 1 Maximum memory of each dimension of a grid: Maximum sizes pitch: 262144 xbytes x 1 65535 65535 Maximum memory pitch: Texture alignment: 262144 bytes 256 bytes Texture alignment: Clock rate: 0.95 bytes 256 GHz Clock rate: Concurrent copy and execution: Yes GHz 1.44 Concurrent copy and execution: Yes Test PASSED limit on kernels: Run time No Integrated: No Support host exit... Press ENTER to page-locked memory mapping: Yes Compute mode: Exclusive (only one host thread at a time can use this device) © NVIDIA Corporation 2009
25.
CUDA C PROGRAMMING
LANGUAGE © NVIDIA Corporation 2009
26.
CUDA C —
C with Runtime Extensions Device management: cudaGetDeviceCount(), cudaGetDeviceProperties() Device memory management: cudaMalloc(), cudaFree(), cudaMemcpy() Texture management: cudaBindTexture(), cudaBindTextureToArray() Graphics interoperability: cudaGLMapBufferObject(), cudaD3D9MapVertexBuffer() © NVIDIA Corporation 2009
27.
CUDA C —
C with Language Extensions Function qualifiers __global__ void MyKernel() {} // call from host, execute on GPU __device__ float MyDeviceFunc() {} // call from GPU, execute on GPU __host__ int HostFunc() {} // call from host, execute on host Variable qualifiers __device__ float MyGPUArray[32]; // in GPU memory space __constant__ float MyConstArray[32]; // write by host; read by GPU __shared__ float MySharedArray[32]; // shared within thread block Built-in vector types int1, int2, int3, int4 float1, float2, float3, float4 double1, double2 etc. © NVIDIA Corporation 2009
28.
CUDA C —
C with Language Extensions Execution configuration dim3 dimGrid(100, 50); // 5000 thread blocks dim3 dimBlock(4, 8, 8); // 256 threads per block MyKernel <<< dimGrid, dimBlock >>> (...); // Launch kernel Built-in variables and functions valid in device code: dim3 gridDim; // Grid dimension dim3 blockDim; // Block dimension dim3 blockIdx; // Block index dim3 threadIdx; // Thread index void __syncthreads(); // Thread synchronization © NVIDIA Corporation 2009
29.
SAXPY: Device Code
void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; Standard C Code } __global__ void saxpy_parallel(int n, float a, float *x, float *y) { blockIdx.x*blockDim.x .x*blockDim threadIdx.x; int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; } Parallel C Code © NVIDIA Corporation 2009
30.
SAXPY: Host Code
// Allocate two N-vectors h_x and h_y int size = N * sizeof(float); float* h_x = (float*)malloc(size); float* h_y = (float*)malloc(size); // Initialize them... // Allocate device memory float* d_x; float* d_y; cudaMalloc((void**)&d_x, size)); cudaMalloc((void**)&d_y, size)); // Copy host memory to device memory cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice); cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice); // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(N, 2.0, d_x, d_y); // Copy result back from device memory to host memory cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost); © NVIDIA Corporation 2009
31.
Exercise 1: Move
Data between Host and GPU Start from the “cudaMallocAndMemcpy” template. Part 1: Allocate memory for pointers d_a and d_b on the device. Part 2: Copy h_a on the host to d_a on the device. Part 3: Do a device to device copy from d_a to d_b. Part 4: Copy d_b on the device back to h_a on the host. Part 5: Free d_a and d_b on the host. Bonus: Experiment with cudaMallocHost in place of malloc for allocating h_a. © NVIDIA Corporation 2009
32.
Launching a Kernel
Host Device Call a kernel with Grid 1 Func <<<Dg,Db,Ns,S>>> (params); Block Block Block Kernel (0, 0) (1, 0) (2, 0) dim3 Dg(mx,my,1); // grid spec dim3 Db(nx,ny,nz); // block spec Block Block Block (0, 1) (1, 1) (2, 1) size_t Ns; // shared memory cudaStream_t S; // CUDA stream Execution configuration is passed to kernel with built-in variables dim3 gridDim, blockDim, blockIdx, threadIdx; Block (1, 1) Extract components with Thread Thread Thread Thread Thread threadIdx.x, threadIdx.y, (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) threadIdx.z, etc. Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) © NVIDIA Corporation 2009
33.
Exercise 2: Launching
Kernels Start from the “myFirstKernel” template. Part1: Allocate device memory for the result of the kernel using pointer d_a. Part2: Configure and launch the kernel using a 1-D grid of 1-D thread blocks. Part3: Have each thread set an element of d_a as follows: idx = blockIdx.x*blockDim.x + threadIdx.x d_a[idx] = 1000*blockIdx.x + threadIdx.x Part4: Copy the result in d_a back to the host pointer h_a. Part5: Verify that the result is correct. © NVIDIA Corporation 2009
34.
Exercise 3: Reverse
Array, Single Block Given an input array {a0, a1, …, an-1} in pointer d_a, store the reversed array {an-1, an-2, …, a0} in pointer d_b Start from the “reverseArray_singleblock” template Only one thread block launched, to reverse an array of size N = numThreads = 256 elements Part 1 (of 1): All you have to do is implement the body of the kernel “reverseArrayBlock()” Each thread moves a single element to reversed position Read input from d_a pointer Store output in reversed location in d_b pointer © NVIDIA Corporation 2009
35.
Exercise 4: Reverse
Array, Multi-Block Given an input array {a0, a1, …, an-1} in pointer d_a, store the reversed array {an-1, an-2, …, a0} in pointer d_b Start from the “reverseArray_multiblock” template Multiple 256-thread blocks launched To reverse an array of size N, N/256 blocks Part 1: Compute the number of blocks to launch Part 2: Implement the kernel reverseArrayBlock() Note that now you must compute both The reversed location within the block The reversed offset to the start of the block © NVIDIA Corporation 2009
36.
PERFORMANCE CONSIDERATIONS © NVIDIA
Corporation 2009
37.
Single-Instruction, Multiple-Thread Execution
Warp: set of 32 parallel threads that execute together in single-instruction, multiple-thread mode (SIMT) on a streaming multiprocessor (SM) SM hardware implements zero-overhead warp and thread scheduling Threads can execute independently SIMT warp diverges and converges when threads branch independently Best efficiency and performance when threads of a warp execute together, so no penalty if all threads in a warp take same path of execution Each SM executes up to 1024 concurrent threads, as 32 SIMT warps of 32 threads © NVIDIA Corporation 2009
38.
Global Memory
Off-chip global memory is not cached SM I-Cache MT Issue C-Cache SP SP Host CPU Bridge System Memory Work Distribution Tesla T10 SP SP SP SP SP SP SFU SFU Interconnection Network DP ROP L2 ROP L2 ROP L2 ROP L2 ROP L2 ROP L2 ROP L2 ROP L2 Shared DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM Memory © NVIDIA Corporation 2009
39.
Efficient Access to
Global Memory Single memory transaction (coalescing) for some memory addressing patterns 128 bytes global memory Linear pattern Not all need participate Anywhere in block OK 16 threads (half-warp) © NVIDIA Corporation 2009
40.
Shared Memory
More than 1 Tbyte/sec aggregate memory bandwidth Use it As a cache To reorganize global memory accesses into coalesced pattern To share data between threads 16 kbytes per SM © NVIDIA Corporation 2009
41.
Shared Memory Bank
Conflicts Successive 32-bit words assigned to different banks Thread 0 Bank 0 Simultaneous access to the Thread 1 Bank 1 same bank by threads in a half- Thread 2 Bank 2 warp causes conflict and Thread 3 Bank 3 serializes access Thread 4 Bank 4 Thread 5 Bank 5 Linear access pattern Thread 6 Bank 6 Permutation Thread 7 Bank 7 Broadcast (from one address) Conflict, stride 8 Thread 15 Bank 15 © NVIDIA Corporation 2009
42.
Matrix Transpose
Access columns of a tile in shared memory to write contiguous data to global memory Requires __syncthreads() since threads write data read by other threads Pad shared memory array to avoid bank conflicts idata odata tile © NVIDIA Corporation 2009
43.
Matrix Transpose
There are further optimisations: see the New Matrix Transpose SDK example. © NVIDIA Corporation 2009
44.
OTHER GPU MEMORIES ©
NVIDIA Corporation 2009
45.
Texture Memory
Texture is an object for reading data Data is cached Host actions Allocate memory on GPU Create a texture memory reference object Bind the texture object to memory Clean up after use GPU actions Fetch using texture references text1Dfetch(), tex1D(), tex2D(), tex3D() © NVIDIA Corporation 2009
46.
Constant Memory
Write by host, read by GPU Data is cached Useful for tables of constants © NVIDIA Corporation 2009
47.
EXECUTION CONFIGURATION © NVIDIA
Corporation 2009
48.
Execution Configuration
vectorAdd <<< BLOCKS, THREADS_PER_BLOCK >>> (N, 2.0, d_x, d_y); How many blocks? At least one block per SM to keep every SM occupied At least two blocks per SM so something can run if block is waiting for a synchronization to complete Many blocks for scalability to larger and future GPUs How many threads? At least 192 threads per SM to hide read after write latency of 11 cycles (not necessarily in same block) Use many threads to hide global memory latency x = y + 5; Too many threads exhausts registers and shared memory Thread count a multiple of warp size z = x + 3; Typically, between 64 and 256 threads per block © NVIDIA Corporation 2009
49.
Occupancy Calculator
blocks per SM × threads per block occupancy = maximum threads per SM Occupancy calculator shows trade-offs between thread count, register use, shared memory use Low occupancy is bad Increasing occupancy doesn’t always help © NVIDIA Corporation 2009
50.
DEBUGGING AND PROFILING ©
NVIDIA Corporation 2009
51.
Debugging
nvcc flags –debug (-g) Generate debug information for host code --device-debug <level> (-G <level>) Generate debug information for device code, plus also specify the optimisation level for the device code in order to control its ‘debuggability’. Allowed values for this option: 0,1,2,3 Debug with cuda-gdb a.out Usual gdb commands available © NVIDIA Corporation 2009
52.
Debugging
Additional commands in cuda-gdb thread — Display the current host and CUDA thread of focus. thread <<<(TX,TY,TZ)>>> — Switch to the CUDA thread at specified coordinates thread <<<(BX,BY),(TX,TY,TZ)>>> — Switch to the CUDA block and thread at specified coordinates info cuda threads — Display a summary of all CUDA threads that are currently resident on the GPU info cuda threads all — Display a list of each CUDA thread that is currently resident on the GPU info cuda state — Display information about the current CUDA state. next and step advance all threads in a warp, except at _syncthreads() where all warps continue to an implicit barrier following sync © NVIDIA Corporation 2009
53.
CUDA Visual Profiler
cudaprof Documentation in $CUDA/cudaprof/doc/cudaprof.html © NVIDIA Corporation 2009
54.
CUDA Visual Profiler
Open a new project Select session settings through dialogue Execute CUDA program by clicking Start button Various views of collected data available Results of different runs stored in sessions for easy comparison Project can be saved © NVIDIA Corporation 2009
55.
MISCELLANEOUS TOPICS © NVIDIA
Corporation 2009
56.
Expensive Operations
32-bit multiply; __mul24() and __umul24() are fast 24-bit multiplies sin(), exp() etc.; faster, less accurate versions are __sin(), __exp() etc. Integer division and modulo; avoid if possible; replace with bit shift operations for powers of 2 Branching where threads of warp take differing paths of control flow © NVIDIA Corporation 2009
57.
Host to GPU
Data Transfers PCI Express Gen2, 8 Gbytes/sec peak Use page-locked (pinned) memory for maximum bandwidth between GPU and host Data transfer host-GPU and GPU-host can overlap with computation both on host and GPU © NVIDIA Corporation 2009
58.
Application Software
(written in C) CUDA Libraries cuFFT cuBLAS cuDPP CPU Hardware CUDA Compiler CUDA Tools 1U PCI-E Switch C Fortran Debugger Profiler © NVIDIA Corporation 2009 4 cores 240 cores
59.
On-line Course
Programming Massively Parallel Processors, Wen-Mei Hwu, University of Illinois at Urbana-Champaign http://courses.ece.illinois.edu/ece498/al/ PowerPoint slides, MP3 recordings of lectures, draft of textbook by Wen-Mei Hwu and David Kirk (NVIDIA) © NVIDIA Corporation 2009
60.
CUDA Zone: www.nvidia.com/CUDA
CUDA Toolkit Compiler Libraries CUDA SDK Code samples CUDA Profiler Forums Resources for CUDA developers © NVIDIA Corporation 2009
Descargar ahora