SlideShare a Scribd company logo
1 of 33
Download to read offline
April 4-7, 2016 | Silicon Valley
James Beyer, NVIDIA
Jeff Larkin, NVIDIA
GTC16 – April 7, 2016
S6410 - Comparing OpenACC 2.5
and OpenMP 4.5
2
AGENDA
History of OpenMP & OpenACC
Philosophical Differences
Technical Differences
Portability Challenges
Conclusions
3
A Tale of Two Specs.
4
A Brief History of OpenMP
1996 - Architecture Review Board (ARB) formed by several vendors implementing
their own directives for Shared Memory Parallelism (SMP).
1997 - 1.0 was released for C/C++ and Fortran with support for parallelizing loops
across threads.
2000, 2002 – Version 2.0 of Fortran, C/C++ specifications released.
2005 – Version 2.5 released, combining both specs into one.
2008 – Version 3.0 released, added support for tasking
2011 – Version 3.1 release, improved support for tasking
2013 – Version 4.0 released, added support for offloading (and more)
2015 – Version 4.5 released, improved support for offloading targets (and more)
4/6/2016
5
A Brief History of OpenACC
2010 – OpenACC founded by CAPS, Cray, PGI, and NVIDIA, to unify directives for
accelerators being developed by CAPS, Cray, and PGI independently
2011 – OpenACC 1.0 released
2013 – OpenACC 2.0 released, adding support for unstructured data management and
clarifying specification language
2015 – OpenACC 2.5 released, contains primarily clarifications with some additional
features.
4/6/2016
6
Philosophical Differences
7
OpenMP: Compilers are dumb, users are
smart. Restructuring non-parallel code is
optional.
OpenACC: Compilers can be smart and
smarter with the user’s help. Non-parallel
code must be made parallel.
8
Philosophical Differences
OpenMP:
The OpenMP API covers only user-directedparallelization, wherein the programmer
explicitly specifies the actions to be taken by the compiler and runtime system in
order to execute the program in parallel.
The OpenMP API does not cover compiler-generated automatic parallelization and
directives to the compiler to assist such parallelization.
OpenACC:
The programming model allows the programmer to augment information available
to the compilers, including specification of data local to an accelerator, guidance
on mapping of loops onto an accelerator, and similar performance-related details.
4/6/2016
9
Philosophical Trade-offs
Consistent, predictable behavior
between implementations
Users can parallelize non-parallel
code and protect data races
explicitly
Some optimizations are off the
table
Substantially different architectures
require substantially different
directives.
Quality of implementation will
greatly affect performance
Users must restructure their code
to be parallel and free of data
races
Compiler has more freedom and
information to optimize
High level parallel directives can be
applied to different architectures
by the compiler
OpenMP OpenACC
10
Technical Differences
11
Parallel: Similar, but Different
OMP Parallel
Creates a team of threads
Very well-defined how the number
of threads is chosen.
May synchronize within the team
Data races are the user’s
responsibility
ACC Parallel
Creates 1 or more gangs of workers
Compiler free to choose number of
gangs, workers, vector length
May not synchronize between gangs
Data races not allowed
12
OMP Teams vs. ACC Parallel
OMP Teams
Creates a league of 1 or more
thread teams
Compiler free to choose number of
teams, threads, and simd lanes.
May not synchronize between teams
Only available within target regions
ACC Parallel
Creates 1 or more gangs of workers
Compiler free to choose number of
gangs, workers, vector length
May not synchronize between gangs
May be used anywhere
13
Compiler-Driven Mode
OpenMP
Fully user-driven (no analogue)
Some compilers choose to go above
and beyond after applying OpenMP,
but not guaranteed
OpenACC
Kernels directive declares desire
to parallelize a region of code, but
places the burden of analysis on the
compiler
Compiler required to be able to do
analysis and make decisions.
14
Loop: Similar but Different
OMP Loop (For/Do)
Splits (“Workshares”) the iterations
of the next loop to threads in the
team, guarantees the user has
managed any data races
Loop will be run over threads and
scheduling of loop iterations may
restrict the compiler
ACC Loop
Declares the loop iterations as
independent & race free (parallel)
or interesting & should be analyzed
(kernels)
User able to declare independence
w/o declaring scheduling
Compiler free to schedule with
gangs/workers/vector, unless
overridden by user
15
Distribute vs. Loop
OMP Distribute
Must live in a TEAMS region
Distributes loop iterations over 1 or
more thread teams
Only master thread of each team
runs iterations, until PARALLEL is
encountered
Loop iterations are implicitly
independent, but some compiler
optimizations still restricted
ACC Loop
Declares the loop iterations as
independent & race free (parallel)
or interesting & should be analyzed
(kernels)
Compiler free to schedule with
gangs/workers/vector, unless
overridden by user
16
Distribute Example
#pragma omp target teams
{
#pragma omp distribute
for(i=0; i<n; i++)
for(j=0;j<m;j++)
for(k=0;k<p;k++)
}
#pragma acc parallel
{
#pragma acc loop
for(i=0; i<n; i++)
#pragma acc loop
for(j=0;j<m;j++)
#pragma acc loop
for(k=0;k<p;k++)
}
17
Distribute Example
#pragma omp target teams
{
#pragma omp distribute
for(i=0; i<n; i++)
for(j=0;j<m;j++)
for(k=0;k<p;k++)
}
#pragma acc parallel
{
#pragma acc loop
for(i=0; i<n; i++)
#pragma acc loop
for(j=0;j<m;j++)
#pragma acc loop
for(k=0;k<p;k++)
}
Generate a 1 or more
thread teams
Distribute “i” over
teams.
No information about
“j” or “k” loops
18
Distribute Example
#pragma omp target teams
{
#pragma omp distribute
for(i=0; i<n; i++)
for(j=0;j<m;j++)
for(k=0;k<p;k++)
}
#pragma acc parallel
{
#pragma acc loop
for(i=0; i<n; i++)
#pragma acc loop
for(j=0;j<m;j++)
#pragma acc loop
for(k=0;k<p;k++)
}
Generate a 1 or more
gangs
These loops are
independent, do the
right thing
19
Distribute Example
#pragma omp target teams
{
#pragma omp distribute
for(i=0; i<n; i++)
for(j=0;j<m;j++)
for(k=0;k<p;k++)
}
#pragma acc parallel
{
#pragma acc loop
for(i=0; i<n; i++)
#pragma acc loop
for(j=0;j<m;j++)
#pragma acc loop
for(k=0;k<p;k++)
}
What’s the right thing?
Interchange? Distribute? Workshare?
Vectorize? Stripmine? Ignore? …
20
Synchronization
OpenMP
Users may use barriers, critical
regions, and/or locks to protect
data races
It’s possible to parallelize non-
parallel code
OpenACC
Users expected to refactor code to
remove data races.
Code should be made truly parallel
and scalable
21
Synchronization Example
#pragma omp parallel private(p)
{
funcA(p);
#pragma omp barrier
funcB(p);
}
function funcA(p[N]){
#pragma acc parallel
}
function funcB(p[N]){
#pragma acc parallel
}
22
Synchronization Example
#pragma omp parallel for
for (i=0; i<N; i++)
{
#pragma omp critical
A[i] = rand();
A[i] *= 2;
}
parallelRand(A);
#pragma acc parallel loop
for (i=0; i<N; i++)
{
A[i] *= 2;
}
23
Portability Challenges
24
How to Write Portable Code (OMP)
#ifdef GPU
#pragma omp target omp teams distribute parallel for reduction(max:error) 
collapse(2) schedule(static,1)
#elif defined(CPU)
#pragma omp parallel for reduction(max:error)
#elif defined(SOMETHING_ELSE)
#pragma omp …
endif
for( int j = 1; j < n-1; j++)
{
#if defined(CPU) && defined(USE_SIMD)
#pragma omp simd
#endif
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
Ifdefs can be used to
choose particular
directives per device
at compile-time
25
How to Write Portable Code (OMP)
#pragma omp 
#ifdef GPU
target teams distribute 
#endif
parallel for reduction(max:error) 
#ifdef GPU
collapse(2) schedule(static,1)
endif
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
Creative ifdefs might
clean up the code, but
still one target at a
time.
26
How to Write Portable Code (OMP)
usegpu = 1;
#pragma omp target teams distribute parallel for reduction(max:error) 
#ifdef GPU
collapse(2) schedule(static,1) 
endif
if(target:usegpu)
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
The OpenMP if clause
can help some too (4.5
improves this).
Note: This example
assumes that a compiler
will choose to generate 1
team when not in a target,
making it the same as a
standard “parallel for.”
27
How to Write Portable Code (ACC)
#pragma acc kernels
{
for( int j = 1; j < n-1; j++)
{
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
}
Developer presents the
desire to parallelize to
the compiler, compiler
handles the rest.
28
How to Write Portable Code (ACC)
#pragma acc parallel loop reduction(max:error)
{
for( int j = 1; j < n-1; j++)
{
#pragma acc loop reduction(max:error)
for( int i = 1; i < m-1; i++ )
{
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}
}
Developer asserts the
parallelism of the loops
to the compiler,
compiler makes
decision about
scheduling.
29
Execution Time (Smaller is Better)
1.00X
5.12X
0.12X
1.01X
2.47X
0.96X
4.00X
17.42X
4.96X
23.03X
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
Original CPU
Threaded
GPU
Threaded
GPU Teams GPU Split GPU
Collapse
GPU Split
Sched
GPU
Collapse
Sched
OpenACC
(CPU)
OpenACC
(GPU)
893
Same
directives
(unoptimized)
ExecutionTime(seconds)
NVIDIA Tesla K40, Intel Xeon E5-2690 v2 @3.00GHz – See GTC16 S6510 for additional information
30
Compiler Portability (CPU)
OpenMP
Numerous well-tested
implementations
• PGI, IBM, Intel, GCC, Cray, …
OpenACC
CPU implementations beginning to
emerge
• X86: PGI
• ARM: PathScale
• Power: Coming soon
31
Compiler Portability (Offload)
OpenMP
Few mature implementations
• Intel (Phi)
• Cray (GPU, Phi?)
• GCC (Phi, GPUs in development)
• Clang (Multiple targets in
development)
OpenACC
Multiple mature implementations
• PGI (NVIDIA & AMD)
• PathScale (NVIDIA & AMD)
• Cray (NVIDIA)
• GCC (in development)
32
Conclusions
33
Conclusions
OpenMP & OpenACC, while similar, are still quite different in their approach
Each approach has clear tradeoffs with no clear “winner”
It should be possible to translate between the two, but the process may not be
automatic
4/6/2016

More Related Content

What's hot

Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Christian Peel
 
HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016Ehsan Totoni
 
Return oriented programming
Return oriented programmingReturn oriented programming
Return oriented programminghybr1s
 
eBPF Debugging Infrastructure - Current Techniques
eBPF Debugging Infrastructure - Current TechniqueseBPF Debugging Infrastructure - Current Techniques
eBPF Debugging Infrastructure - Current TechniquesNetronome
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to ChainerSeiya Tokui
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016MLconf
 
Return-Oriented Programming: Exploits Without Code Injection
Return-Oriented Programming: Exploits Without Code InjectionReturn-Oriented Programming: Exploits Without Code Injection
Return-Oriented Programming: Exploits Without Code Injectionguest9f4856
 
Demystify eBPF JIT Compiler
Demystify eBPF JIT CompilerDemystify eBPF JIT Compiler
Demystify eBPF JIT CompilerNetronome
 
Erlang For Five Nines
Erlang For Five NinesErlang For Five Nines
Erlang For Five NinesBarcamp Cork
 
Metrics ekon 14_2_kleiner
Metrics ekon 14_2_kleinerMetrics ekon 14_2_kleiner
Metrics ekon 14_2_kleinerMax Kleiner
 
Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Chris Fregly
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applicationsKnoldus Inc.
 
Micro-Benchmarking Considered Harmful
Micro-Benchmarking Considered HarmfulMicro-Benchmarking Considered Harmful
Micro-Benchmarking Considered HarmfulThomas Wuerthinger
 
Use of an Oscilloscope - maXbox Starter33
Use of an Oscilloscope - maXbox Starter33Use of an Oscilloscope - maXbox Starter33
Use of an Oscilloscope - maXbox Starter33Max Kleiner
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Chris Fregly
 

What's hot (20)

Open mp directives
Open mp directivesOpen mp directives
Open mp directives
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015
 
HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016HPAT presentation at JuliaCon 2016
HPAT presentation at JuliaCon 2016
 
Return oriented programming
Return oriented programmingReturn oriented programming
Return oriented programming
 
eBPF Debugging Infrastructure - Current Techniques
eBPF Debugging Infrastructure - Current TechniqueseBPF Debugging Infrastructure - Current Techniques
eBPF Debugging Infrastructure - Current Techniques
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
OpenMP
OpenMPOpenMP
OpenMP
 
Open mp
Open mpOpen mp
Open mp
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
Return-Oriented Programming: Exploits Without Code Injection
Return-Oriented Programming: Exploits Without Code InjectionReturn-Oriented Programming: Exploits Without Code Injection
Return-Oriented Programming: Exploits Without Code Injection
 
Demystify eBPF JIT Compiler
Demystify eBPF JIT CompilerDemystify eBPF JIT Compiler
Demystify eBPF JIT Compiler
 
Erlang For Five Nines
Erlang For Five NinesErlang For Five Nines
Erlang For Five Nines
 
Metrics ekon 14_2_kleiner
Metrics ekon 14_2_kleinerMetrics ekon 14_2_kleiner
Metrics ekon 14_2_kleiner
 
Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applications
 
Micro-Benchmarking Considered Harmful
Micro-Benchmarking Considered HarmfulMicro-Benchmarking Considered Harmful
Micro-Benchmarking Considered Harmful
 
Return oriented programming (ROP)
Return oriented programming (ROP)Return oriented programming (ROP)
Return oriented programming (ROP)
 
Use of an Oscilloscope - maXbox Starter33
Use of an Oscilloscope - maXbox Starter33Use of an Oscilloscope - maXbox Starter33
Use of an Oscilloscope - maXbox Starter33
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
 

Viewers also liked

Viewers also liked (13)

Mesa 1363 oropel hasta palavecino
Mesa 1363   oropel hasta palavecinoMesa 1363   oropel hasta palavecino
Mesa 1363 oropel hasta palavecino
 
Khalig Atakishiyev CV
Khalig Atakishiyev CVKhalig Atakishiyev CV
Khalig Atakishiyev CV
 
Savoldi vittoria paper moda
Savoldi vittoria   paper modaSavoldi vittoria   paper moda
Savoldi vittoria paper moda
 
Metodologia de bruno munari
Metodologia de bruno munariMetodologia de bruno munari
Metodologia de bruno munari
 
Contratacion nucleos de formacion artística
Contratacion nucleos de formacion artísticaContratacion nucleos de formacion artística
Contratacion nucleos de formacion artística
 
160116_Certificat-Travail-Ueli2
160116_Certificat-Travail-Ueli2160116_Certificat-Travail-Ueli2
160116_Certificat-Travail-Ueli2
 
Professional Profile
Professional ProfileProfessional Profile
Professional Profile
 
Virtualizion: Maximize Your Resources
Virtualizion: Maximize Your ResourcesVirtualizion: Maximize Your Resources
Virtualizion: Maximize Your Resources
 
คู่มือติดตั้ง Obeclms for linux
คู่มือติดตั้ง Obeclms for linuxคู่มือติดตั้ง Obeclms for linux
คู่มือติดตั้ง Obeclms for linux
 
Gandini forma 2016
Gandini forma 2016Gandini forma 2016
Gandini forma 2016
 
IBM OOAD Part1 Summary
IBM OOAD Part1 SummaryIBM OOAD Part1 Summary
IBM OOAD Part1 Summary
 
SM-Twittometro "CESAR ACUÑA"
SM-Twittometro "CESAR ACUÑA"SM-Twittometro "CESAR ACUÑA"
SM-Twittometro "CESAR ACUÑA"
 
SM-Twittometro Lourdes Flores
SM-Twittometro Lourdes FloresSM-Twittometro Lourdes Flores
SM-Twittometro Lourdes Flores
 

Similar to GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5

Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPAnil Bohare
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdfTigabu Yaya
 
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Qualcomm Developer Network
 
Directive-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingDirective-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingRuymán Reyes
 
Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...
Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...
Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...Sara Alvarez
 
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemHybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemCSCJournals
 
The Past, Present, and Future of OpenACC
The Past, Present, and Future of OpenACCThe Past, Present, and Future of OpenACC
The Past, Present, and Future of OpenACCinside-BigData.com
 
The features of java 11 vs. java 12
The features of  java 11 vs. java 12The features of  java 11 vs. java 12
The features of java 11 vs. java 12FarjanaAhmed3
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
 
OpenPOWER Application Optimization
OpenPOWER Application Optimization OpenPOWER Application Optimization
OpenPOWER Application Optimization Ganesan Narayanasamy
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
AMD_11th_Intl_SoC_Conf_UCI_Irvine
AMD_11th_Intl_SoC_Conf_UCI_IrvineAMD_11th_Intl_SoC_Conf_UCI_Irvine
AMD_11th_Intl_SoC_Conf_UCI_IrvinePankaj Singh
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)Akhila Prabhakaran
 

Similar to GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 (20)

Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
Directive-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingDirective-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous Computing
 
Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...
Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...
Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...
 
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemHybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
 
The Past, Present, and Future of OpenACC
The Past, Present, and Future of OpenACCThe Past, Present, and Future of OpenACC
The Past, Present, and Future of OpenACC
 
The features of java 11 vs. java 12
The features of  java 11 vs. java 12The features of  java 11 vs. java 12
The features of java 11 vs. java 12
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
OpenPOWER Application Optimization
OpenPOWER Application Optimization OpenPOWER Application Optimization
OpenPOWER Application Optimization
 
CS4961-L9.ppt
CS4961-L9.pptCS4961-L9.ppt
CS4961-L9.ppt
 
Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimizationIntroduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
AMD_11th_Intl_SoC_Conf_UCI_Irvine
AMD_11th_Intl_SoC_Conf_UCI_IrvineAMD_11th_Intl_SoC_Conf_UCI_Irvine
AMD_11th_Intl_SoC_Conf_UCI_Irvine
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)
 
Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimizationIntroduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
 

More from Jeff Larkin

FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesFortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesJeff Larkin
 
Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesRefactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesJeff Larkin
 
Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Jeff Larkin
 
Progress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEProgress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEJeff Larkin
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialJeff Larkin
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingJeff Larkin
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Jeff Larkin
 
A Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsA Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsJeff Larkin
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best PracticesJeff Larkin
 
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Jeff Larkin
 

More from Jeff Larkin (10)

FortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC DirectivesFortranCon2020: Highly Parallel Fortran and OpenACC Directives
FortranCon2020: Highly Parallel Fortran and OpenACC Directives
 
Refactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid ArchitecturesRefactoring Applications for the XK7 and Future Hybrid Architectures
Refactoring Applications for the XK7 and Future Hybrid Architectures
 
Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7Optimizing GPU to GPU Communication on Cray XK7
Optimizing GPU to GPU Communication on Cray XK7
 
Progress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SEProgress Toward Accelerating CAM-SE
Progress Toward Accelerating CAM-SE
 
HPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorialHPCMPUG2011 cray tutorial
HPCMPUG2011 cray tutorial
 
CUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU ComputingCUG2011 Introduction to GPU Computing
CUG2011 Introduction to GPU Computing
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
 
A Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming ModelsA Comparison of Accelerator Programming Models
A Comparison of Accelerator Programming Models
 
XT Best Practices
XT Best PracticesXT Best Practices
XT Best Practices
 
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
Practical Examples for Efficient I/O on Cray XT Systems (CUG 2009)
 

Recently uploaded

Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineeringssuserb3a23b
 

Recently uploaded (20)

Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Odoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting ServiceOdoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting Service
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineering
 

GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5

  • 1. April 4-7, 2016 | Silicon Valley James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 – April 7, 2016 S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
  • 2. 2 AGENDA History of OpenMP & OpenACC Philosophical Differences Technical Differences Portability Challenges Conclusions
  • 3. 3 A Tale of Two Specs.
  • 4. 4 A Brief History of OpenMP 1996 - Architecture Review Board (ARB) formed by several vendors implementing their own directives for Shared Memory Parallelism (SMP). 1997 - 1.0 was released for C/C++ and Fortran with support for parallelizing loops across threads. 2000, 2002 – Version 2.0 of Fortran, C/C++ specifications released. 2005 – Version 2.5 released, combining both specs into one. 2008 – Version 3.0 released, added support for tasking 2011 – Version 3.1 release, improved support for tasking 2013 – Version 4.0 released, added support for offloading (and more) 2015 – Version 4.5 released, improved support for offloading targets (and more) 4/6/2016
  • 5. 5 A Brief History of OpenACC 2010 – OpenACC founded by CAPS, Cray, PGI, and NVIDIA, to unify directives for accelerators being developed by CAPS, Cray, and PGI independently 2011 – OpenACC 1.0 released 2013 – OpenACC 2.0 released, adding support for unstructured data management and clarifying specification language 2015 – OpenACC 2.5 released, contains primarily clarifications with some additional features. 4/6/2016
  • 7. 7 OpenMP: Compilers are dumb, users are smart. Restructuring non-parallel code is optional. OpenACC: Compilers can be smart and smarter with the user’s help. Non-parallel code must be made parallel.
  • 8. 8 Philosophical Differences OpenMP: The OpenMP API covers only user-directedparallelization, wherein the programmer explicitly specifies the actions to be taken by the compiler and runtime system in order to execute the program in parallel. The OpenMP API does not cover compiler-generated automatic parallelization and directives to the compiler to assist such parallelization. OpenACC: The programming model allows the programmer to augment information available to the compilers, including specification of data local to an accelerator, guidance on mapping of loops onto an accelerator, and similar performance-related details. 4/6/2016
  • 9. 9 Philosophical Trade-offs Consistent, predictable behavior between implementations Users can parallelize non-parallel code and protect data races explicitly Some optimizations are off the table Substantially different architectures require substantially different directives. Quality of implementation will greatly affect performance Users must restructure their code to be parallel and free of data races Compiler has more freedom and information to optimize High level parallel directives can be applied to different architectures by the compiler OpenMP OpenACC
  • 11. 11 Parallel: Similar, but Different OMP Parallel Creates a team of threads Very well-defined how the number of threads is chosen. May synchronize within the team Data races are the user’s responsibility ACC Parallel Creates 1 or more gangs of workers Compiler free to choose number of gangs, workers, vector length May not synchronize between gangs Data races not allowed
  • 12. 12 OMP Teams vs. ACC Parallel OMP Teams Creates a league of 1 or more thread teams Compiler free to choose number of teams, threads, and simd lanes. May not synchronize between teams Only available within target regions ACC Parallel Creates 1 or more gangs of workers Compiler free to choose number of gangs, workers, vector length May not synchronize between gangs May be used anywhere
  • 13. 13 Compiler-Driven Mode OpenMP Fully user-driven (no analogue) Some compilers choose to go above and beyond after applying OpenMP, but not guaranteed OpenACC Kernels directive declares desire to parallelize a region of code, but places the burden of analysis on the compiler Compiler required to be able to do analysis and make decisions.
  • 14. 14 Loop: Similar but Different OMP Loop (For/Do) Splits (“Workshares”) the iterations of the next loop to threads in the team, guarantees the user has managed any data races Loop will be run over threads and scheduling of loop iterations may restrict the compiler ACC Loop Declares the loop iterations as independent & race free (parallel) or interesting & should be analyzed (kernels) User able to declare independence w/o declaring scheduling Compiler free to schedule with gangs/workers/vector, unless overridden by user
  • 15. 15 Distribute vs. Loop OMP Distribute Must live in a TEAMS region Distributes loop iterations over 1 or more thread teams Only master thread of each team runs iterations, until PARALLEL is encountered Loop iterations are implicitly independent, but some compiler optimizations still restricted ACC Loop Declares the loop iterations as independent & race free (parallel) or interesting & should be analyzed (kernels) Compiler free to schedule with gangs/workers/vector, unless overridden by user
  • 16. 16 Distribute Example #pragma omp target teams { #pragma omp distribute for(i=0; i<n; i++) for(j=0;j<m;j++) for(k=0;k<p;k++) } #pragma acc parallel { #pragma acc loop for(i=0; i<n; i++) #pragma acc loop for(j=0;j<m;j++) #pragma acc loop for(k=0;k<p;k++) }
  • 17. 17 Distribute Example #pragma omp target teams { #pragma omp distribute for(i=0; i<n; i++) for(j=0;j<m;j++) for(k=0;k<p;k++) } #pragma acc parallel { #pragma acc loop for(i=0; i<n; i++) #pragma acc loop for(j=0;j<m;j++) #pragma acc loop for(k=0;k<p;k++) } Generate a 1 or more thread teams Distribute “i” over teams. No information about “j” or “k” loops
  • 18. 18 Distribute Example #pragma omp target teams { #pragma omp distribute for(i=0; i<n; i++) for(j=0;j<m;j++) for(k=0;k<p;k++) } #pragma acc parallel { #pragma acc loop for(i=0; i<n; i++) #pragma acc loop for(j=0;j<m;j++) #pragma acc loop for(k=0;k<p;k++) } Generate a 1 or more gangs These loops are independent, do the right thing
  • 19. 19 Distribute Example #pragma omp target teams { #pragma omp distribute for(i=0; i<n; i++) for(j=0;j<m;j++) for(k=0;k<p;k++) } #pragma acc parallel { #pragma acc loop for(i=0; i<n; i++) #pragma acc loop for(j=0;j<m;j++) #pragma acc loop for(k=0;k<p;k++) } What’s the right thing? Interchange? Distribute? Workshare? Vectorize? Stripmine? Ignore? …
  • 20. 20 Synchronization OpenMP Users may use barriers, critical regions, and/or locks to protect data races It’s possible to parallelize non- parallel code OpenACC Users expected to refactor code to remove data races. Code should be made truly parallel and scalable
  • 21. 21 Synchronization Example #pragma omp parallel private(p) { funcA(p); #pragma omp barrier funcB(p); } function funcA(p[N]){ #pragma acc parallel } function funcB(p[N]){ #pragma acc parallel }
  • 22. 22 Synchronization Example #pragma omp parallel for for (i=0; i<N; i++) { #pragma omp critical A[i] = rand(); A[i] *= 2; } parallelRand(A); #pragma acc parallel loop for (i=0; i<N; i++) { A[i] *= 2; }
  • 24. 24 How to Write Portable Code (OMP) #ifdef GPU #pragma omp target omp teams distribute parallel for reduction(max:error) collapse(2) schedule(static,1) #elif defined(CPU) #pragma omp parallel for reduction(max:error) #elif defined(SOMETHING_ELSE) #pragma omp … endif for( int j = 1; j < n-1; j++) { #if defined(CPU) && defined(USE_SIMD) #pragma omp simd #endif for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } Ifdefs can be used to choose particular directives per device at compile-time
  • 25. 25 How to Write Portable Code (OMP) #pragma omp #ifdef GPU target teams distribute #endif parallel for reduction(max:error) #ifdef GPU collapse(2) schedule(static,1) endif for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } Creative ifdefs might clean up the code, but still one target at a time.
  • 26. 26 How to Write Portable Code (OMP) usegpu = 1; #pragma omp target teams distribute parallel for reduction(max:error) #ifdef GPU collapse(2) schedule(static,1) endif if(target:usegpu) for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } The OpenMP if clause can help some too (4.5 improves this). Note: This example assumes that a compiler will choose to generate 1 team when not in a target, making it the same as a standard “parallel for.”
  • 27. 27 How to Write Portable Code (ACC) #pragma acc kernels { for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } } Developer presents the desire to parallelize to the compiler, compiler handles the rest.
  • 28. 28 How to Write Portable Code (ACC) #pragma acc parallel loop reduction(max:error) { for( int j = 1; j < n-1; j++) { #pragma acc loop reduction(max:error) for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmax( error, fabs(Anew[j][i] - A[j][i])); } } } Developer asserts the parallelism of the loops to the compiler, compiler makes decision about scheduling.
  • 29. 29 Execution Time (Smaller is Better) 1.00X 5.12X 0.12X 1.01X 2.47X 0.96X 4.00X 17.42X 4.96X 23.03X 0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00 Original CPU Threaded GPU Threaded GPU Teams GPU Split GPU Collapse GPU Split Sched GPU Collapse Sched OpenACC (CPU) OpenACC (GPU) 893 Same directives (unoptimized) ExecutionTime(seconds) NVIDIA Tesla K40, Intel Xeon E5-2690 v2 @3.00GHz – See GTC16 S6510 for additional information
  • 30. 30 Compiler Portability (CPU) OpenMP Numerous well-tested implementations • PGI, IBM, Intel, GCC, Cray, … OpenACC CPU implementations beginning to emerge • X86: PGI • ARM: PathScale • Power: Coming soon
  • 31. 31 Compiler Portability (Offload) OpenMP Few mature implementations • Intel (Phi) • Cray (GPU, Phi?) • GCC (Phi, GPUs in development) • Clang (Multiple targets in development) OpenACC Multiple mature implementations • PGI (NVIDIA & AMD) • PathScale (NVIDIA & AMD) • Cray (NVIDIA) • GCC (in development)
  • 33. 33 Conclusions OpenMP & OpenACC, while similar, are still quite different in their approach Each approach has clear tradeoffs with no clear “winner” It should be possible to translate between the two, but the process may not be automatic 4/6/2016