SlideShare una empresa de Scribd logo
1 de 15
Tim Costa, NVIDIA HPC SW Product Manager
Jeff Larkin, Sr. DevTech Software Engineer
UK OpenMP Users Group, December 2020
BEST PRACTICES FOR
OPENMP ON GPUS
2
THE FUTURE OF PARALLEL PROGRAMMING
Standard Languages | Directives | Specialized Languages
Maximize Performance with
Specialized Languages & Intrinsics
Drive Base Languages to Better
Support Parallelism
Augment Base Languages with
Directives
do concurrent (i = 1:n)
y(i) = y(i) + a*x(i)
enddo
!$omp target data map(x,y)
...
do concurrent (i = 1:n)
y(i) = y(i) + a*x(i)
enddo
...
!$omp end target data
attribute(global)
subroutine saxpy(n, a, x, y) {
int i = blockIdx%x*blockDim%x +
threadIdx%x;
if (i < n) y(i) += a*x(i)
}
program main
real :: x(:), y(:)
real,device :: d_x(:), d_y(:)
d_x = x
d_y = y
call saxpy
<<<(N+255)/256,256>>>(...)
y = d_y
std::for_each_n(POL, idx(0), n,
[&](Index_t i){
y[i] += a*x[i];
});
3
THE ROLE OF DIRECTIVES FOR
PARALLEL PROGRAMMING
Serial Programming
Languages
Parallel
Programming
Languages
Directives convey additional information to the compiler.
4
AVAILABLE NOW: THE NVIDIA HPC SDK
Available at developer.nvidia.com/hpc-sdk, on NGC, and in the Cloud
Develop for the NVIDIA HPC Platform: GPU, CPU and Interconnect
HPC Libraries | GPU Accelerated C++ and Fortran | Directives | CUDA
7-8 Releases Per Year | Freely Available
Compilers
nvcc nvc
nvc++
nvfortran
Programming
Models
Standard C++ & Fortran
OpenACC & OpenMP
CUDA
Core
Libraries
libcu++
Thrust
CUB
Math
Libraries
cuBLAS cuTENSOR
cuSPARSE cuSOLVER
cuFFT cuRAND
Communication
Libraries
Open MPI
NVSHMEM
NCCL
DEVELOPMENT
Profilers
Nsight
Systems
Compute
Debugger
cuda-gdb
Host
Device
ANALYSIS
NVIDIA HPC SDK
5
HPC COMPILERS
NVC | NVC++ | NVFORTRAN
Programmable
Standard Languages
Directives
CUDA
Multicore
Directives
Vectorization
Multi-Platform
x86_64
Arm
OpenPOWER
Accelerated
Latest GPUs
Automatic Acceleration
*+=
6
BEST PRACTICES FOR OPENMP ON GPUS
Always use the teams and distribute directive to expose all available parallelism
Aggressively collapse loops to increase available parallelism
Use the target data directive and map clauses to reduce data movement between CPU and GPU
Use accelerated libraries whenever possible
Use OpenMP tasks to go asynchronous and better utilize the whole system
Use host fallback (if clause) to generate host and device code
7
Expose More
Parallelism
Many codes already have Parallel Worksharing loops
(Parallel For/Parallel Do); Isn’t that enough?
Parallel For creates a single contention group of
threads with a shared view of memory and the ability to
coordinate and synchronize.
This structure limits the degree of parallelism that a GPU
can exploit and doesn’t exploit many GPU advantages
Parallel For Isn’t Enough
#pragma omp parallel for reduction(max:error)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
OMP PARALLEL
OMP FOR
ThreadTeam
8
Expose More
Parallelism
The Teams Distribute directives exposes coarse-
grained, scalable parallelism, generating more parallelism.
Parallel For is still used to engage the threads within
each team.
Coarse and Fine-grained parallelism may be combined (A)
or split (B).
How do I know which will be better for my code &
machine?
Use Teams Distribute
#pragma omp target teams distribute 
parallel for reduction(max:error)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
A
#pragma omp target teams distribute 
reduction(max:error)
for( int j = 1; j < n-1; j++) {
#pragma parallel for reduction(max:error)
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
B
OMP TEAMS
OMP DISTRIBUTE
OMP FOR
9
Expose Even More
Parallelism!
Collapsing loops increases the parallelism available for the
compiler to exploit.
It’s likely possible for this code to perform equally well to
parallel for on the CPU and still exploit more parallelism
on a GPU.
Collapsing does erase possible gains from locality (maybe
tile can help?)
Not all loops can be collapsed
Use Collapse Clause
Aggressively
#pragma omp target teams distribute 
parallel for reduction(max:error) 
collapse(2)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] -
A[j][i]));
}
}
10
Optimize Data
Motion
The more loops you move to the device, the more data
movement you may introduce.
Compilers will always choose correctness over
performance, so the programmer can often do better.
Target Data directives and map clauses enable the
programmer to take control of data movement.
On Unified Memory machines, locality may be good
enough, but optimizing the data movement is more
portable.
Note: I’ve seen cases where registering arrays with CUDA
may further improve remaining data movement.
Use Target Data Mapping
#pragma omp target data map(to:Anew) map(tofrom:A)
while ( error > tol && iter < iter_max )
{
… // Use A and Anew in target directives
}
11
Accelerated
Libraries & OpenMP
Accelerated libraries are free performance
The use_device_ptr clause enables sharing data
from OpenMP to CUDA.
OpenMP 5.1 adds the interop construct to enable
sharing CUDA Streams
Eat your free lunch
#pragma omp target data map(alloc:x[0:n],y[0:n])
{
#pragma omp target teams distribute parallel for
for( i = 0; i < n; i++)
{
x[i] = 1.0f;
y[i] = 0.0f;
}
#pragma omp target data use_device_ptr(x,y)
{
cublasSaxpy(n, 2.0, x, 1, y, 1);
// Synchronize before using results
}
#pragma omp target update from(y[0:n])
}
12
Tasking & GPUs
Data copies, GPU Compute, and CPU compute
can all overlap.
OpenMP uses its existing tasking framework to
manage asynchronous execution and
dependencies.
Your mileage may vary regarding how well the
runtime maps OpenMP tasks to CUDA streams.
Get it working synchronously first!
Keep the whole system busy
#pragma omp target data map(alloc:a[0:N],b[0:N])
{
#pragma omp target update to(a[0:N]) nowait depend(inout:a)
#pragma omp target update to(b[0:N]) nowait depend(inout:b)
#pragma omp target teams distribute parallel for 
nowait depend(inout:a)
for(int i=0; i<N; i++ )
{
a[i] = 2.0f * a[i];
}
#pragma omp target teams distribute parallel for 
nowait depend(inout:a,b)
for(int i=0; i<N; i++ )
{
b[i] = 2.0f * a[i];
}
#pragma omp target update from(b[0:N]) nowait depend(inout:b)
#pragma omp taskwait depend(inout:b)
}
13
Host Fallback
The if clause can be used to selectively turn off
OpenMP offloading.
Unless the compiler knows at compile-time the
value of the if clause, both code paths are
generated
For simple loops, written to use the previous
guidelines, it may be possible to have unified
CPU & GPU code (no guarantee it will be
optimal)
The “if” clause
#pragma omp target teams distribute 
parallel for reduction(max:error) 
collapse(2) if(target:USE_GPU)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
14
The Loop Directive
The Loop directive was added in 5.0 as a
descriptive option for programming.
Loop asserts the ability of a loop to be run in any
order, including concurrently.
As more compilers & applications adopt it, we
hope it will enable more performance
portability.
Give descriptiveness a try
#pragma omp target teams loop 
reduction(max:error) 
collapse(2)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
15
BEST PRACTICES FOR OPENMP ON GPUS
Always use the teams and distribute directive to expose all available parallelism
Aggressively collapse loops to increase available parallelism
Use the target data directive and map clauses to reduce data movement between CPU and GPU
Use accelerated libraries whenever possible
Use OpenMP tasks to go asynchronous and better utilize the whole system
Use host fallback (if clause) to generate host and device code
Bonus: Give loop a try

Más contenido relacionado

La actualidad más candente

190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pubJaewook. Kang
 
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Jyotirmoy Sundi
 
Intro to OpenMP
Intro to OpenMPIntro to OpenMP
Intro to OpenMPjbp4444
 
NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeRizwan Habib
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Databricks
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Preferred Networks
 
PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019Masashi Shibata
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to ChainerSeiya Tokui
 
Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNetScaling Deep Learning with MXNet
Scaling Deep Learning with MXNetAI Frontiers
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central
 
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons LearnedOptimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons LearnedIntel IT Center
 
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlowHorovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlowDatabricks
 
Profiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & SustainabilityProfiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & Sustainabilitygeetachauhan
 
Intro to Scalable Deep Learning on AWS with Apache MXNet
Intro to Scalable Deep Learning on AWS with Apache MXNetIntro to Scalable Deep Learning on AWS with Apache MXNet
Intro to Scalable Deep Learning on AWS with Apache MXNetAmazon Web Services
 
Introduction to Polyaxon
Introduction to PolyaxonIntroduction to Polyaxon
Introduction to PolyaxonYu Ishikawa
 
Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionCherryBerry2
 

La actualidad más candente (20)

190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub190111 tf2 preview_jwkang_pub
190111 tf2 preview_jwkang_pub
 
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
 
Intro to OpenMP
Intro to OpenMPIntro to OpenMP
Intro to OpenMP
 
NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKee
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018
 
OpenMP
OpenMPOpenMP
OpenMP
 
PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019PythonとAutoML at PyConJP 2019
PythonとAutoML at PyConJP 2019
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNetScaling Deep Learning with MXNet
Scaling Deep Learning with MXNet
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons LearnedOptimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
 
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlowHorovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
 
Profiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & SustainabilityProfiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & Sustainability
 
Intro to Scalable Deep Learning on AWS with Apache MXNet
Intro to Scalable Deep Learning on AWS with Apache MXNetIntro to Scalable Deep Learning on AWS with Apache MXNet
Intro to Scalable Deep Learning on AWS with Apache MXNet
 
Introduction to Polyaxon
Introduction to PolyaxonIntroduction to Polyaxon
Introduction to Polyaxon
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System Discussion
 
Open mp
Open mpOpen mp
Open mp
 
Open mp directives
Open mp directivesOpen mp directives
Open mp directives
 

Similar a Best Practices for OpenMP on GPUs - OpenMP UK Users Group

Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterSudhang Shankar
 
MapReduce: teoria e prática
MapReduce: teoria e práticaMapReduce: teoria e prática
MapReduce: teoria e práticaPET Computação
 
Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisFastly
 
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemHybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemCSCJournals
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdfTigabu Yaya
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsZvi Avraham
 
Program Assignment Process ManagementObjective This program a.docx
Program Assignment  Process ManagementObjective This program a.docxProgram Assignment  Process ManagementObjective This program a.docx
Program Assignment Process ManagementObjective This program a.docxwkyra78
 
HSA HSAIL Introduction Hot Chips 2013
HSA HSAIL Introduction  Hot Chips 2013 HSA HSAIL Introduction  Hot Chips 2013
HSA HSAIL Introduction Hot Chips 2013 HSA Foundation
 
clWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPUclWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPUJohn Colvin
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSPeterAndreasEntschev
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
 
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL DevroomFlame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL DevroomValeriy Kravchuk
 

Similar a Best Practices for OpenMP on GPUs - OpenMP UK Users Group (20)

Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC cluster
 
MapReduce: teoria e prática
MapReduce: teoria e práticaMapReduce: teoria e prática
MapReduce: teoria e prática
 
parallel-computation.pdf
parallel-computation.pdfparallel-computation.pdf
parallel-computation.pdf
 
Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic Analysis
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemHybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
 
Haskell Accelerate
Haskell  AccelerateHaskell  Accelerate
Haskell Accelerate
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
 
Parallel computation
Parallel computationParallel computation
Parallel computation
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
 
Program Assignment Process ManagementObjective This program a.docx
Program Assignment  Process ManagementObjective This program a.docxProgram Assignment  Process ManagementObjective This program a.docx
Program Assignment Process ManagementObjective This program a.docx
 
HSA HSAIL Introduction Hot Chips 2013
HSA HSAIL Introduction  Hot Chips 2013 HSA HSAIL Introduction  Hot Chips 2013
HSA HSAIL Introduction Hot Chips 2013
 
clWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPUclWrap: Nonsense free control of your GPU
clWrap: Nonsense free control of your GPU
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Matlab ppt
Matlab pptMatlab ppt
Matlab ppt
 
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL DevroomFlame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
Flame Graphs for MySQL DBAs - FOSDEM 2022 MySQL Devroom
 

Más de Jeff Larkin

FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesFortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesJeff Larkin
 
SC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIASC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIAJeff Larkin
 
Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesRefactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesJeff Larkin
 
Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Jeff Larkin
 
Progress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEProgress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEJeff Larkin
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialJeff Larkin
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingJeff Larkin
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Jeff Larkin
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
A Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsA Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsJeff Larkin
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best PracticesJeff Larkin
 
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Jeff Larkin
 

Más de Jeff Larkin (13)

FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesFortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
 
SC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIASC13: OpenMP and NVIDIA
SC13: OpenMP and NVIDIA
 
Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesRefactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid Architectures
 
Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7
 
Progress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEProgress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SE
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorial
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU Computing
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
A Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsA Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming Models
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best Practices
 
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
 

Último

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 

Último (20)

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 

Best Practices for OpenMP on GPUs - OpenMP UK Users Group

  • 1. Tim Costa, NVIDIA HPC SW Product Manager Jeff Larkin, Sr. DevTech Software Engineer UK OpenMP Users Group, December 2020 BEST PRACTICES FOR OPENMP ON GPUS
  • 2. 2 THE FUTURE OF PARALLEL PROGRAMMING Standard Languages | Directives | Specialized Languages Maximize Performance with Specialized Languages & Intrinsics Drive Base Languages to Better Support Parallelism Augment Base Languages with Directives do concurrent (i = 1:n) y(i) = y(i) + a*x(i) enddo !$omp target data map(x,y) ... do concurrent (i = 1:n) y(i) = y(i) + a*x(i) enddo ... !$omp end target data attribute(global) subroutine saxpy(n, a, x, y) { int i = blockIdx%x*blockDim%x + threadIdx%x; if (i < n) y(i) += a*x(i) } program main real :: x(:), y(:) real,device :: d_x(:), d_y(:) d_x = x d_y = y call saxpy <<<(N+255)/256,256>>>(...) y = d_y std::for_each_n(POL, idx(0), n, [&](Index_t i){ y[i] += a*x[i]; });
  • 3. 3 THE ROLE OF DIRECTIVES FOR PARALLEL PROGRAMMING Serial Programming Languages Parallel Programming Languages Directives convey additional information to the compiler.
  • 4. 4 AVAILABLE NOW: THE NVIDIA HPC SDK Available at developer.nvidia.com/hpc-sdk, on NGC, and in the Cloud Develop for the NVIDIA HPC Platform: GPU, CPU and Interconnect HPC Libraries | GPU Accelerated C++ and Fortran | Directives | CUDA 7-8 Releases Per Year | Freely Available Compilers nvcc nvc nvc++ nvfortran Programming Models Standard C++ & Fortran OpenACC & OpenMP CUDA Core Libraries libcu++ Thrust CUB Math Libraries cuBLAS cuTENSOR cuSPARSE cuSOLVER cuFFT cuRAND Communication Libraries Open MPI NVSHMEM NCCL DEVELOPMENT Profilers Nsight Systems Compute Debugger cuda-gdb Host Device ANALYSIS NVIDIA HPC SDK
  • 5. 5 HPC COMPILERS NVC | NVC++ | NVFORTRAN Programmable Standard Languages Directives CUDA Multicore Directives Vectorization Multi-Platform x86_64 Arm OpenPOWER Accelerated Latest GPUs Automatic Acceleration *+=
  • 6. 6 BEST PRACTICES FOR OPENMP ON GPUS Always use the teams and distribute directive to expose all available parallelism Aggressively collapse loops to increase available parallelism Use the target data directive and map clauses to reduce data movement between CPU and GPU Use accelerated libraries whenever possible Use OpenMP tasks to go asynchronous and better utilize the whole system Use host fallback (if clause) to generate host and device code
  • 7. 7 Expose More Parallelism Many codes already have Parallel Worksharing loops (Parallel For/Parallel Do); Isn’t that enough? Parallel For creates a single contention group of threads with a shared view of memory and the ability to coordinate and synchronize. This structure limits the degree of parallelism that a GPU can exploit and doesn’t exploit many GPU advantages Parallel For Isn’t Enough #pragma omp parallel for reduction(max:error) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } OMP PARALLEL OMP FOR ThreadTeam
  • 8. 8 Expose More Parallelism The Teams Distribute directives exposes coarse- grained, scalable parallelism, generating more parallelism. Parallel For is still used to engage the threads within each team. Coarse and Fine-grained parallelism may be combined (A) or split (B). How do I know which will be better for my code & machine? Use Teams Distribute #pragma omp target teams distribute parallel for reduction(max:error) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } A #pragma omp target teams distribute reduction(max:error) for( int j = 1; j < n-1; j++) { #pragma parallel for reduction(max:error) for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } B OMP TEAMS OMP DISTRIBUTE OMP FOR
  • 9. 9 Expose Even More Parallelism! Collapsing loops increases the parallelism available for the compiler to exploit. It’s likely possible for this code to perform equally well to parallel for on the CPU and still exploit more parallelism on a GPU. Collapsing does erase possible gains from locality (maybe tile can help?) Not all loops can be collapsed Use Collapse Clause Aggressively #pragma omp target teams distribute parallel for reduction(max:error) collapse(2) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } }
  • 10. 10 Optimize Data Motion The more loops you move to the device, the more data movement you may introduce. Compilers will always choose correctness over performance, so the programmer can often do better. Target Data directives and map clauses enable the programmer to take control of data movement. On Unified Memory machines, locality may be good enough, but optimizing the data movement is more portable. Note: I’ve seen cases where registering arrays with CUDA may further improve remaining data movement. Use Target Data Mapping #pragma omp target data map(to:Anew) map(tofrom:A) while ( error > tol && iter < iter_max ) { … // Use A and Anew in target directives }
  • 11. 11 Accelerated Libraries & OpenMP Accelerated libraries are free performance The use_device_ptr clause enables sharing data from OpenMP to CUDA. OpenMP 5.1 adds the interop construct to enable sharing CUDA Streams Eat your free lunch #pragma omp target data map(alloc:x[0:n],y[0:n]) { #pragma omp target teams distribute parallel for for( i = 0; i < n; i++) { x[i] = 1.0f; y[i] = 0.0f; } #pragma omp target data use_device_ptr(x,y) { cublasSaxpy(n, 2.0, x, 1, y, 1); // Synchronize before using results } #pragma omp target update from(y[0:n]) }
  • 12. 12 Tasking & GPUs Data copies, GPU Compute, and CPU compute can all overlap. OpenMP uses its existing tasking framework to manage asynchronous execution and dependencies. Your mileage may vary regarding how well the runtime maps OpenMP tasks to CUDA streams. Get it working synchronously first! Keep the whole system busy #pragma omp target data map(alloc:a[0:N],b[0:N]) { #pragma omp target update to(a[0:N]) nowait depend(inout:a) #pragma omp target update to(b[0:N]) nowait depend(inout:b) #pragma omp target teams distribute parallel for nowait depend(inout:a) for(int i=0; i<N; i++ ) { a[i] = 2.0f * a[i]; } #pragma omp target teams distribute parallel for nowait depend(inout:a,b) for(int i=0; i<N; i++ ) { b[i] = 2.0f * a[i]; } #pragma omp target update from(b[0:N]) nowait depend(inout:b) #pragma omp taskwait depend(inout:b) }
  • 13. 13 Host Fallback The if clause can be used to selectively turn off OpenMP offloading. Unless the compiler knows at compile-time the value of the if clause, both code paths are generated For simple loops, written to use the previous guidelines, it may be possible to have unified CPU & GPU code (no guarantee it will be optimal) The “if” clause #pragma omp target teams distribute parallel for reduction(max:error) collapse(2) if(target:USE_GPU) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } }
  • 14. 14 The Loop Directive The Loop directive was added in 5.0 as a descriptive option for programming. Loop asserts the ability of a loop to be run in any order, including concurrently. As more compilers & applications adopt it, we hope it will enable more performance portability. Give descriptiveness a try #pragma omp target teams loop reduction(max:error) collapse(2) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } }
  • 15. 15 BEST PRACTICES FOR OPENMP ON GPUS Always use the teams and distribute directive to expose all available parallelism Aggressively collapse loops to increase available parallelism Use the target data directive and map clauses to reduce data movement between CPU and GPU Use accelerated libraries whenever possible Use OpenMP tasks to go asynchronous and better utilize the whole system Use host fallback (if clause) to generate host and device code Bonus: Give loop a try