SlideShare una empresa de Scribd logo
1 de 26
Optimizing Performance Perhaad Mistry & Dana Schaa, Northeastern University Computer Architecture Research Lab, with Benedict R. Gaster, AMD © 2011
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely important. Examples are shown (including a detailed matrix transpose) along with actual empirical results The number of threads that are active on the GPU can also play a large part in achieving good performance, and so the subtleties of GPU occupancy are discussed Vectorization is particularly important for AMD GPUs and is briefly discussed as well 2 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Topics Thread mapping Choosing a proper mapping Optimizing with local memory Device occupancy Vectorization 3 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thread Mapping Thread mapping determines which threads will access which data Proper mappings can align with hardware and provide large performance benefits Improper mappings can be disastrous to performance The paper Static Memory Access Pattern Analysis on a Massively Parallel GPU by Jang, et. al focuses on the task of effectively mapping threads to the data access patterns of an algorithm 4 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thread Mapping By using different mappings, the same thread can be assigned to access different data elements The examples below show three different possible mappings of threads to data (assuming the thread id is used to access an element) intgroup_size =  get_local_size(0) * get_local_size(1); inttid =  get_group_id(1) * get_num_groups(0) * group_size + get_group_id(0) * group_size +  get_local_id(1) * get_local_size(0) +   get_local_id(0) inttid =  get_global_id(1) *  get_global_size(0) +  get_global_id(0); inttid =  get_global_id(0) *  get_global_size(1) +  get_global_id(1); Mapping 0 1 4 5 Thread IDs 2 3 6 7 0 1 2 3 0 4 8 12 8 9 12 13 4 5 6 7 1 5 9 13 10 11 14 15 8 9 10 11 2 6 10 14 *assuming 2x2 groups 12 13 14 15 3 7 11 15 5 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thread Mapping Consider a serial matrix multiplication algorithm This algorithm is suited for output data decomposition We will create NM threads  Effectively removing the outer two loops Each thread will perform P calculations The inner loop will remain as part of the kernel Should the index space be MxN or NxM? 6 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thread Mapping Thread mapping 1: with an MxN index space, the kernel would be: Thread mapping 2: with an NxM index space, the kernel would be: Both mappings produce functionally equivalent versions of the program 7 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thread Mapping This figure shows the execution of the two thread mappings on NVIDIA GeForce 285 and 8800 GPUs Notice that mapping 2 is far superior in performance for both GPUs 8 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thread Mapping The discrepancy in execution times between the mappings is due to data accesses on the global memory bus Assuming row-major data, data in a row (i.e., elements in adjacent columns) are stored sequentially in memory To ensure coalesced accesses, consecutive threads in the same wavefront should be mapped to columns (the second dimension) of the matrices This will give coalesced accesses in Matrices B and C For Matrix A, the iteratori3 determines the access pattern for row-major data, so thread mapping does not affect it 9 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thread Mapping In mapping 1, consecutive threads (tx) are mapped to different rows of Matrix C, and non-consecutive threads (ty) are mapped to columns of Matrix B The mapping causes inefficient memory accesses 10 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thread Mapping In mapping 2, consecutive threads (tx) are mapped to consecutive elements in Matrices B and C Accesses to both of these matrices will be coalesced  Degree of coalescence depends on the workgroup and data sizes 11 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thread Mapping In general, threads can be created and mapped to any data element by manipulating the values returned by the thread identifier functions The following matrix transpose example will show how thread IDs can be modified to achieve efficient memory accesses 12 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Matrix Transpose A matrix transpose is a straightforward technique Out(x,y) = In(y,x) No matter which thread mapping is chosen, one operation (read/write) will produce coalesced accesses while the other (write/read) produces uncoalesced accesses Note that data must be read to a temporary location (such as a register) before being written to a new location Out In Out In coalesced uncoalesced uncoalesced coalesced Threads 0 1 2 3 0 1 2 3 13 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Matrix Transpose If local memory is used to buffer the data between reading and writing, we can rearrange the thread mapping to provide coalesced accesses in both directions Note that the work group must be square Out In 0 1 2 3 coalesced coalesced 4 5 6 7 8 9 10 11 Local memory 12 13 14 15 Threads 0 1 2 3 0 1 2 3 global mem index 0 1 2 3 0 1 2 3 0 1 2 3 0 4 8 12 local mem index 14 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Matrix Transpose The following figure shows a performance comparison of the two transpose kernels for matrices of size NxM on an AMD 5870 GPU “Optimized” uses local memory and thread remapping 15 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Occupancy On current GPUs, work groups get mapped to compute units When a work group is mapped to a compute unit, it cannot be swapped off until all of its threads complete their execution If there are enough resources available, multiple work groups can be mapped to the same compute unit at the same time  Wavefronts from another work group can be swapped in to hide latency Resources are fixed per compute unit (number of registers, local memory size, maximum number of threads) Any one of these resource constraints may limit the number of work groups on a compute unit The term occupancy is used to describe how well the resources of the compute unit are being utilized 16 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Occupancy – Registers The availability of registers is one of the major limiting factor for larger kernels The maximum number of registers required by a kernel must be available for all threads of a workgroup Example: Consider a GPU with 16384 registers per compute unit running a kernel that requires 35 registers per thread  Each compute unit can execute at most 468 threads This affects the choice of workgroup size A workgroup of 512 is not possible Only 1 workgroup of 256 threads is allowed at a time, even though 212 more threads could be running 3 workgroups of 128 threads are allowed, providing 384 threads to be scheduled, etc. 17 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Occupancy – Registers Consider another example:  A GPU has 16384 registers per compute unit The work group size of a kernel is fixed at 256 threads The kernel currently requires 17 registers per thread Given the information, each work group requires 4352 registers This allows for 3 active work groups if registers are the only limiting factor If the code can be restructured to only use 16 registers, then 4 active work groups would be possible 18 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Occupancy – Local Memory GPUs have a limited amount of local memory on each compute unit 32KB of local memory on AMD GPUs 32-48KB of local memory on NVIDIA GPUs Local memory limits the number of active work groups per compute unit Depending on the kernel, the data per workgroup may be fixed regardless of number of threads (e.g., histograms), or may vary based on the number of threads (e.g., matrix multiplication, convolution) 19 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Occupancy – Threads GPUs have hardware limitations on the maximum number of threads per work group 256 threads per WG on AMD GPUs 512 threads per WG on NVIDIA GPUs NVIDIA GPUs have per-compute-unit limits on the number of active threads and work groups (depending on the GPU model) 768 or 1024 threads per compute unit 8 or 16 warps per compute unit AMD GPUs have GPU-wide limits on the number of wavefronts 496 wavefronts on the 5870 GPU (~25 wavefronts or ~1600 threads per compute unit) 20 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Occupancy – Limiting Factors The minimum of these three factors is what limits the active number of threads (or occupancy) of a compute unit  The interactions between the factors are complex The limiting factor may have either thread or wavefront granularity Changing work group size may affect register or shared memory usage Reducing any factor (such as register usage) slightly may have allow another work group to be active The CUDA occupancy calculator from NVIDIA plots these factors visually allowing the tradeoffs to be visualized 21 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
CUDA Occupancy Calculator CUDA occupancy calculator: 1. Enter hardware model and kernel requirements 2. Resource usage and limiting factors are displayed 3. Graphs are shown to visualize limiting factors 22 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Compute Unit ... PE0 PE1 PEn-1 PE2 Vectorization On AMD GPUs, each processing element executes a 5-way VLIW instruction 5 scalar operations or 4 scalar operations + 1 transcendental operation  Registers ALU + T-unit Incoming Instruction Branch Unit ALU 23 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Vectorization Vectorization allows a single thread to perform multiple operations at once Explicit vectorization is achieved by using vector datatypes (such as float4) in the source program When a number is appended to a datatype, the datatype becomes an array of that length Operations can be performed on vector datatypes just like regular datatypes Each ALU will operate on different element of the float4 data 24 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Vectorization Vectorization improves memory performance on AMD GPUs The AMD Accelerated Parallel Processing OpenCL Programming Guide compares float to  float4 memory bandwidth 25 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Summary Although writing a simple OpenCL program is relatively easy, optimizing code can be very difficult Improperly mapping loop iterations to OpenCL threads can significantly degrade performance When creating work groups, hardware limitations (number of registers, size of local memory, etc.) need to be considered Work groups must be sized appropriately to maximize the number of active threads and properly hide latencies Vectorization is an important optimization for AMD GPU hardware Though not covered here, vectorization may also help performance when targeting CPUs 26 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Más contenido relacionado

La actualidad más candente

A Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGaA Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGaHiroki Nakahara
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelJenny Liu
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learningAmgad Muhammad
 
High Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud TechnologiesHigh Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud Technologiesjaliyae
 
FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...Hiroki Nakahara
 
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...IJCNCJournal
 
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...Pooyan Jamshidi
 
Comp7404 ai group_project_15apr2018_v2.1
Comp7404 ai group_project_15apr2018_v2.1Comp7404 ai group_project_15apr2018_v2.1
Comp7404 ai group_project_15apr2018_v2.1paul0001
 
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHYSPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHYcsandit
 
Learn about Tensorflow for Deep Learning now! Part 1
Learn about Tensorflow for Deep Learning now! Part 1Learn about Tensorflow for Deep Learning now! Part 1
Learn about Tensorflow for Deep Learning now! Part 1Tyrone Systems
 
An OpenCL Method of Parallel Sorting Algorithms for GPU Architecture
An OpenCL Method of Parallel Sorting Algorithms for GPU ArchitectureAn OpenCL Method of Parallel Sorting Algorithms for GPU Architecture
An OpenCL Method of Parallel Sorting Algorithms for GPU ArchitectureWaqas Tariq
 
FPL15 talk: Deep Convolutional Neural Network on FPGA
FPL15 talk: Deep Convolutional Neural Network on FPGAFPL15 talk: Deep Convolutional Neural Network on FPGA
FPL15 talk: Deep Convolutional Neural Network on FPGAHiroki Nakahara
 
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...ijaia
 
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel ComputingIRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel ComputingIRJET Journal
 
Salt Identification Challenge
Salt Identification ChallengeSalt Identification Challenge
Salt Identification Challengekenluck2001
 
Towards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoTowards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoDatabricks
 

La actualidad más candente (20)

A Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGaA Random Forest using a Multi-valued Decision Diagram on an FPGa
A Random Forest using a Multi-valued Decision Diagram on an FPGa
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
 
High Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud TechnologiesHigh Performance Parallel Computing with Clouds and Cloud Technologies
High Performance Parallel Computing with Clouds and Cloud Technologies
 
FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...FPT17: An object detector based on multiscale sliding window search using a f...
FPT17: An object detector based on multiscale sliding window search using a f...
 
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
 
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
 
Comp7404 ai group_project_15apr2018_v2.1
Comp7404 ai group_project_15apr2018_v2.1Comp7404 ai group_project_15apr2018_v2.1
Comp7404 ai group_project_15apr2018_v2.1
 
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHYSPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
 
Learn about Tensorflow for Deep Learning now! Part 1
Learn about Tensorflow for Deep Learning now! Part 1Learn about Tensorflow for Deep Learning now! Part 1
Learn about Tensorflow for Deep Learning now! Part 1
 
An OpenCL Method of Parallel Sorting Algorithms for GPU Architecture
An OpenCL Method of Parallel Sorting Algorithms for GPU ArchitectureAn OpenCL Method of Parallel Sorting Algorithms for GPU Architecture
An OpenCL Method of Parallel Sorting Algorithms for GPU Architecture
 
FPL15 talk: Deep Convolutional Neural Network on FPGA
FPL15 talk: Deep Convolutional Neural Network on FPGAFPL15 talk: Deep Convolutional Neural Network on FPGA
FPL15 talk: Deep Convolutional Neural Network on FPGA
 
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
Data Mining Un-Compressed Images from cloud with Clustering Compression techn...
 
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel ComputingIRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
 
Google warehouse scale computer
Google warehouse scale computerGoogle warehouse scale computer
Google warehouse scale computer
 
Salt Identification Challenge
Salt Identification ChallengeSalt Identification Challenge
Salt Identification Challenge
 
Nicpaper2009
Nicpaper2009Nicpaper2009
Nicpaper2009
 
Towards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei DiaoTowards a Unified Data Analytics Optimizer with Yanlei Diao
Towards a Unified Data Analytics Optimizer with Yanlei Diao
 
Exploring Gpgpu Workloads
Exploring Gpgpu WorkloadsExploring Gpgpu Workloads
Exploring Gpgpu Workloads
 
Naist2015 dec ver1
Naist2015 dec ver1Naist2015 dec ver1
Naist2015 dec ver1
 

Similar a Lec08 optimizations

Effective Sparse Matrix Representation for the GPU Architectures
 Effective Sparse Matrix Representation for the GPU Architectures Effective Sparse Matrix Representation for the GPU Architectures
Effective Sparse Matrix Representation for the GPU ArchitecturesIJCSEA Journal
 
Effective Sparse Matrix Representation for the GPU Architectures
Effective Sparse Matrix Representation for the GPU ArchitecturesEffective Sparse Matrix Representation for the GPU Architectures
Effective Sparse Matrix Representation for the GPU ArchitecturesIJCSEA Journal
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
 
FAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.pptFAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.pptgrssieee
 
16-mmap-ml-sigmod
16-mmap-ml-sigmod16-mmap-ml-sigmod
16-mmap-ml-sigmodDezhi Fang
 
4213ijaia02
4213ijaia024213ijaia02
4213ijaia02ijaia
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentEricsson
 
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...journalBEEI
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
Performance Optimization of Clustering On GPU
 Performance Optimization of Clustering On GPU Performance Optimization of Clustering On GPU
Performance Optimization of Clustering On GPUijsrd.com
 
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...GeeksLab Odessa
 
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docx
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docxPlease do ECE572 requirementECECS 472572 Final Exam Project (W.docx
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docxARIV4
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009Léia de Sousa
 

Similar a Lec08 optimizations (20)

Effective Sparse Matrix Representation for the GPU Architectures
 Effective Sparse Matrix Representation for the GPU Architectures Effective Sparse Matrix Representation for the GPU Architectures
Effective Sparse Matrix Representation for the GPU Architectures
 
Effective Sparse Matrix Representation for the GPU Architectures
Effective Sparse Matrix Representation for the GPU ArchitecturesEffective Sparse Matrix Representation for the GPU Architectures
Effective Sparse Matrix Representation for the GPU Architectures
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
FAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.pptFAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.ppt
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Co question 2006
Co question 2006Co question 2006
Co question 2006
 
16-mmap-ml-sigmod
16-mmap-ml-sigmod16-mmap-ml-sigmod
16-mmap-ml-sigmod
 
4213ijaia02
4213ijaia024213ijaia02
4213ijaia02
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
V3I8-0460
V3I8-0460V3I8-0460
V3I8-0460
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Performance Optimization of Clustering On GPU
 Performance Optimization of Clustering On GPU Performance Optimization of Clustering On GPU
Performance Optimization of Clustering On GPU
 
Aw33283286
Aw33283286Aw33283286
Aw33283286
 
Aw33283286
Aw33283286Aw33283286
Aw33283286
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
 
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docx
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docxPlease do ECE572 requirementECECS 472572 Final Exam Project (W.docx
Please do ECE572 requirementECECS 472572 Final Exam Project (W.docx
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009
 

Último

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Último (20)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

Lec08 optimizations

  • 1. Optimizing Performance Perhaad Mistry & Dana Schaa, Northeastern University Computer Architecture Research Lab, with Benedict R. Gaster, AMD © 2011
  • 2. Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely important. Examples are shown (including a detailed matrix transpose) along with actual empirical results The number of threads that are active on the GPU can also play a large part in achieving good performance, and so the subtleties of GPU occupancy are discussed Vectorization is particularly important for AMD GPUs and is briefly discussed as well 2 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 3. Topics Thread mapping Choosing a proper mapping Optimizing with local memory Device occupancy Vectorization 3 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 4. Thread Mapping Thread mapping determines which threads will access which data Proper mappings can align with hardware and provide large performance benefits Improper mappings can be disastrous to performance The paper Static Memory Access Pattern Analysis on a Massively Parallel GPU by Jang, et. al focuses on the task of effectively mapping threads to the data access patterns of an algorithm 4 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 5. Thread Mapping By using different mappings, the same thread can be assigned to access different data elements The examples below show three different possible mappings of threads to data (assuming the thread id is used to access an element) intgroup_size = get_local_size(0) * get_local_size(1); inttid = get_group_id(1) * get_num_groups(0) * group_size + get_group_id(0) * group_size + get_local_id(1) * get_local_size(0) + get_local_id(0) inttid = get_global_id(1) * get_global_size(0) + get_global_id(0); inttid = get_global_id(0) * get_global_size(1) + get_global_id(1); Mapping 0 1 4 5 Thread IDs 2 3 6 7 0 1 2 3 0 4 8 12 8 9 12 13 4 5 6 7 1 5 9 13 10 11 14 15 8 9 10 11 2 6 10 14 *assuming 2x2 groups 12 13 14 15 3 7 11 15 5 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 6. Thread Mapping Consider a serial matrix multiplication algorithm This algorithm is suited for output data decomposition We will create NM threads Effectively removing the outer two loops Each thread will perform P calculations The inner loop will remain as part of the kernel Should the index space be MxN or NxM? 6 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 7. Thread Mapping Thread mapping 1: with an MxN index space, the kernel would be: Thread mapping 2: with an NxM index space, the kernel would be: Both mappings produce functionally equivalent versions of the program 7 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 8. Thread Mapping This figure shows the execution of the two thread mappings on NVIDIA GeForce 285 and 8800 GPUs Notice that mapping 2 is far superior in performance for both GPUs 8 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 9. Thread Mapping The discrepancy in execution times between the mappings is due to data accesses on the global memory bus Assuming row-major data, data in a row (i.e., elements in adjacent columns) are stored sequentially in memory To ensure coalesced accesses, consecutive threads in the same wavefront should be mapped to columns (the second dimension) of the matrices This will give coalesced accesses in Matrices B and C For Matrix A, the iteratori3 determines the access pattern for row-major data, so thread mapping does not affect it 9 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 10. Thread Mapping In mapping 1, consecutive threads (tx) are mapped to different rows of Matrix C, and non-consecutive threads (ty) are mapped to columns of Matrix B The mapping causes inefficient memory accesses 10 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 11. Thread Mapping In mapping 2, consecutive threads (tx) are mapped to consecutive elements in Matrices B and C Accesses to both of these matrices will be coalesced Degree of coalescence depends on the workgroup and data sizes 11 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 12. Thread Mapping In general, threads can be created and mapped to any data element by manipulating the values returned by the thread identifier functions The following matrix transpose example will show how thread IDs can be modified to achieve efficient memory accesses 12 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 13. Matrix Transpose A matrix transpose is a straightforward technique Out(x,y) = In(y,x) No matter which thread mapping is chosen, one operation (read/write) will produce coalesced accesses while the other (write/read) produces uncoalesced accesses Note that data must be read to a temporary location (such as a register) before being written to a new location Out In Out In coalesced uncoalesced uncoalesced coalesced Threads 0 1 2 3 0 1 2 3 13 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 14. Matrix Transpose If local memory is used to buffer the data between reading and writing, we can rearrange the thread mapping to provide coalesced accesses in both directions Note that the work group must be square Out In 0 1 2 3 coalesced coalesced 4 5 6 7 8 9 10 11 Local memory 12 13 14 15 Threads 0 1 2 3 0 1 2 3 global mem index 0 1 2 3 0 1 2 3 0 1 2 3 0 4 8 12 local mem index 14 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 15. Matrix Transpose The following figure shows a performance comparison of the two transpose kernels for matrices of size NxM on an AMD 5870 GPU “Optimized” uses local memory and thread remapping 15 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 16. Occupancy On current GPUs, work groups get mapped to compute units When a work group is mapped to a compute unit, it cannot be swapped off until all of its threads complete their execution If there are enough resources available, multiple work groups can be mapped to the same compute unit at the same time Wavefronts from another work group can be swapped in to hide latency Resources are fixed per compute unit (number of registers, local memory size, maximum number of threads) Any one of these resource constraints may limit the number of work groups on a compute unit The term occupancy is used to describe how well the resources of the compute unit are being utilized 16 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 17. Occupancy – Registers The availability of registers is one of the major limiting factor for larger kernels The maximum number of registers required by a kernel must be available for all threads of a workgroup Example: Consider a GPU with 16384 registers per compute unit running a kernel that requires 35 registers per thread Each compute unit can execute at most 468 threads This affects the choice of workgroup size A workgroup of 512 is not possible Only 1 workgroup of 256 threads is allowed at a time, even though 212 more threads could be running 3 workgroups of 128 threads are allowed, providing 384 threads to be scheduled, etc. 17 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 18. Occupancy – Registers Consider another example: A GPU has 16384 registers per compute unit The work group size of a kernel is fixed at 256 threads The kernel currently requires 17 registers per thread Given the information, each work group requires 4352 registers This allows for 3 active work groups if registers are the only limiting factor If the code can be restructured to only use 16 registers, then 4 active work groups would be possible 18 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 19. Occupancy – Local Memory GPUs have a limited amount of local memory on each compute unit 32KB of local memory on AMD GPUs 32-48KB of local memory on NVIDIA GPUs Local memory limits the number of active work groups per compute unit Depending on the kernel, the data per workgroup may be fixed regardless of number of threads (e.g., histograms), or may vary based on the number of threads (e.g., matrix multiplication, convolution) 19 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 20. Occupancy – Threads GPUs have hardware limitations on the maximum number of threads per work group 256 threads per WG on AMD GPUs 512 threads per WG on NVIDIA GPUs NVIDIA GPUs have per-compute-unit limits on the number of active threads and work groups (depending on the GPU model) 768 or 1024 threads per compute unit 8 or 16 warps per compute unit AMD GPUs have GPU-wide limits on the number of wavefronts 496 wavefronts on the 5870 GPU (~25 wavefronts or ~1600 threads per compute unit) 20 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 21. Occupancy – Limiting Factors The minimum of these three factors is what limits the active number of threads (or occupancy) of a compute unit The interactions between the factors are complex The limiting factor may have either thread or wavefront granularity Changing work group size may affect register or shared memory usage Reducing any factor (such as register usage) slightly may have allow another work group to be active The CUDA occupancy calculator from NVIDIA plots these factors visually allowing the tradeoffs to be visualized 21 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 22. CUDA Occupancy Calculator CUDA occupancy calculator: 1. Enter hardware model and kernel requirements 2. Resource usage and limiting factors are displayed 3. Graphs are shown to visualize limiting factors 22 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 23. Compute Unit ... PE0 PE1 PEn-1 PE2 Vectorization On AMD GPUs, each processing element executes a 5-way VLIW instruction 5 scalar operations or 4 scalar operations + 1 transcendental operation Registers ALU + T-unit Incoming Instruction Branch Unit ALU 23 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 24. Vectorization Vectorization allows a single thread to perform multiple operations at once Explicit vectorization is achieved by using vector datatypes (such as float4) in the source program When a number is appended to a datatype, the datatype becomes an array of that length Operations can be performed on vector datatypes just like regular datatypes Each ALU will operate on different element of the float4 data 24 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 25. Vectorization Vectorization improves memory performance on AMD GPUs The AMD Accelerated Parallel Processing OpenCL Programming Guide compares float to float4 memory bandwidth 25 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 26. Summary Although writing a simple OpenCL program is relatively easy, optimizing code can be very difficult Improperly mapping loop iterations to OpenCL threads can significantly degrade performance When creating work groups, hardware limitations (number of registers, size of local memory, etc.) need to be considered Work groups must be sized appropriately to maximize the number of active threads and properly hide latencies Vectorization is an important optimization for AMD GPU hardware Though not covered here, vectorization may also help performance when targeting CPUs 26 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Notas del editor

  1. Assuming that our thread structure will match the dimensions of our output matrix, the question here is whether the outer loop of the algorithm should be mapped to the X or the Y dimension of the thread structure. Same for the middle loop. The next slide will help clarify.
  2. These are results from Jang, et al.
  3. Either approach causes horrible performance on the GPU.
  4. After reading the data to local memory, new indexes are computed that allow consecutive threads to write to consecutive memory locations.
  5. Because of the hardware schedulable unit of threads (i.e. wavefronts or warps), we usually try to create workgroups that are a power of two in size, or at least a multiple of the schedulable unit
  6. Since workgroups are units that must be scheduled all at once, reducing register pressure by a small amount could allow many more threads to be active.
  7. The impact of occupancy is latency hiding, so the real benefit of having as many active threads possible is completely dependent on the degree of memory latency present in the program (compute-bound kernels will see minimal benefit from more active threads).
  8. NVIDIA GPUs do not have vector-based PEs
  9. The compiler will attempt to pack the VLIW instruction if vector data types are not used.