The document discusses porting and optimizing OpenMP applications to AMD APUs using CAPS tools. It provides an overview of CAPS Enterprise, which develops compilers and tools to help customers leverage the performance of multi-core and many-core processors. It then discusses CAPS' OpenACC and OpenMP compilers, which can generate code for AMD GPUs and APUs from directive-based programming models. The document demonstrates how the CAPS OpenMP compiler can analyze OpenMP applications and generate optimized code for execution on AMD APUs, showing speedups for the HydroC benchmark application.
4. COMPANY
PROFILE
y Founded
in
2002
‒ Large
experPse
in
processor
micro-‐architecture
and
code
generaPon
‒ Spin-‐off
of
French
INRIA
Research
Lab
‒ 30
employees
y Mission:
to
help
its
customers
to
leverage
the
performance
of
mulP/manycore
machines
‒ ConsulPng
&
engineering
services
‒ CAPS
OpenACC
Compiler
&
toolchain
‒ Trainings
y Expanding
sales
worldwide
‒ Resellers
in
US
and
APAC
(Exxact,
Abso^,
JCC
Gimmick
Ltd,
Nodasys,
…)
4
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
www.caps-entreprise.com
4
5. CAPS
ECOSYSTEM
Customers
5
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
Business Partners
www.caps-entreprise.com
European R&D
Projects
5
7. OPENACC
INITIATIVE
y A CAPS, CRAY, Nvidia and PGI initiative
y Open Standard
y A directive-based approach for programming heterogeneous manycore hardware for C and FORTRAN applications
y http://www.openacc-standard.com
7
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
www.caps-entreprise.com
7
8. DIRECTIVE-‐BASED
PROGRAMMING
(1)
y Three ways of programming GPGPU applications:
Libraries
Directives
Programming
Languages
Ready-to-use Acceleration
Quickly Accelerate Existing
Applications
Maximum Performance
8
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
www.caps-entreprise.com
8
10. EXECUTION
MODEL
y Among a bulk of computations executed by the CPU, some regions can be offloaded to hardware
accelerators
‒ Parallel regions
‒ Kernels regions
y Host is responsible for:
‒ Allocating memory space on accelerator
‒ Initiating data transfers
‒ Launching computations
‒ Waiting for completion
‒ Deallocating memory space
y Accelerators execute parallel regions:
‒ Use work-sharing directives
‒ Specify level of parallelization
10
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
www.caps-entreprise.com
10
11. OPENACC
EXECUTION
MODEL
y Host-‐controlled
execuPon
y Based
on
three
parallelism
levels
‒ Gangs
–
coarse
grain
‒ Workers
–
fine
grain
‒ Vectors
–
finest
grain
Device
Gang
Worker
Vectors
11
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
Gang
Worker
Vectors
www.caps-entreprise.com
…
11
13. OPENACC
COMPILERS
(1)
CAPS
Compilers:
PGI
Accelerator
y Source-‐to-‐source
compilers
y Support
Intel
Xeon
Phi,
NVIDIA
GPUs,
AMD
GPUs
and
APUs
y Extension
of
x86
PGI
compiler
y Support
Intel
Xeon
Phi,
NVIDIA
GPUs,
AMD
GPUs
and
APUs
Cray
Compilers:
y Provided
with
Cray
system
only
13
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
www.caps-‐entreprise.com
13
14. CAPS
COMPILERS
(2)
Are source-to-source compilers, composed of 3 parts:
y The directives (OpenACC or OpenHMPP)
‒ Define parts of code to be accelerated
‒ Indicate resource allocation and communication
‒ Ensure portability
y The toolchain
‒ Helps building manycore applications
‒ Includes compilers and target code generators
‒ Insulates hardware specific computations
‒ Uses hardware vendor SDK
y The runtime
‒ Helps to adapt to platform configuration
‒ Manages hardware resource availability
14
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
www.caps-entreprise.com
14
15. CAPS
COMPILERS
(3)
y Take
the
original
applicaPon
as
input
and
generate
another
applicaPon
source
code
as
output
‒ AutomaPcally
turn
the
OpenACC
source
code
into
a
accelerator-‐specific
source
code
(CUDA,
OpenCL)
y Compile
the
enPre
hybrid
applicaPon
y Just
prefix
the
original
compilaPon
line
with
capsmc
to
produce
a
hybrid
applicaPon
$ capsmc gcc myprogram.c
$ capsmc gfortran myprogram.f90
y CompaPble
with:
‒ GNU
‒ Intel
‒ Open64
‒ Abso^
‒ …
15
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
www.caps-entreprise.com
15
16. CAPS
COMPILERS
(4)
C++
Frontend
y CAPS Compilers drives all
compilation passes
Fortran
Frontend
ExtracPon
module
y Host application compilation
‒ Calls traditional CPU compilers
‒ CAPS Runtime is linked to the
host part of the application
C
Frontend
codelets
Host
code
Fun
#1
Fun
#2
Fun
#3
Instrumen-‐taPon
module
CUDA
Code
GeneraPon
OpenCL
GeneraPon
CPU
compiler
(gcc,
ifort,
…)
CUDA
compilers
OpenCL
compilers
y Device code production
‒ According to the specified
target
‒ A dynamic library is built
Executable
(mybin.exe)
16
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
CAPS
RunDme
www.caps-‐entreprise.com
HWA
Code
(Dynamic
library)
16
18. CAPS
OPENMP
COMPILER
y AutomaPcally
turns
OpenMP
codes
into
OpenACC
y Diagnoses
compaPbility
issues
and
suggests
code
transformaPons
y Builds
accelerated
versions
based
on
CUDA
or
OpenCL
y Works
with
all
plalorms
‒ AMD
and
Nvidia
GPUs
‒ AMD
APUs
‒ Intel
Xeon
Phi
18
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
CAPS OpenMP Compiler - June 2013
18
19. CAPS
OPENMP
COMPILER
OVERVIEW
Profiling
19
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
Analysis
CAPS OpenMP Compiler - June 2013
AcceleraPon
19
20. EXTENSION
OF
THE
CAPS
OPENACC
COMPILER
y Converts
OpenMP
codes
into
OpenACC
‒ Examine
OpenMP
loop
nests
and
check
their
OpenACC
compaPbility
‒ Diagnose
non
compaPbility
issues
and
propose
advice
‒ Build
an
APU
version
based
on
OpenCL
y Builds
a
interacPve
report
‒ Based
on
the
compiler
staPc
and
dynamic
analyses
‒ OpenMP
to
OpenACC
kernels
view
o
Performance
details
of
each
region
‒ Regions’
In/Out
and
data
dependencies
between
regions
‒ Gives
the
user
control
on
pushing
kernels
onto
GPU
and
manage
data
transfers
20
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
21. OPENMP-‐BASED
OPTIMIZATION
PROCESS
Application
with OpenMP
directives
Instrumentation
Execution
Analysis
Tracable application
Profiling report
HTML
interactive
report
21
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
CAPS OpenMP Compiler - June 2013
Generation
Accelerated
executable
21
22. INSTRUMENTATION
AND
PROFILING
PHASES
y Code
preprocessing
and
instrumentaPon
‒ IdenPfy
supported
OpenMP
regions
‒
parallel,
parallel
for
and
parallel
for
constructs
‒ Instrument
the
code
to
track
data
and
measure
kernel
performance
y Instrumented
applicaPon
execuPon
‒ Based
on
the
user
data
set
‒ Number
of
Pmes
a
OpenMP
region
is
executed
‒ Region’s
reads
and
writes
‒ Range
of
loops
iteraPon
‒ Region
performance
22
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
23. ANALYSIS
PHASE
y Generates
an
interacPve
HTML
report
‒ Based
on
the
compiler
staPc
and
dynamic
analyses
‒ Metrics
for
each
OpenMP
regions
‒
Check
OpenACC
compliancy
‒ ComputaPon
density
‒ Coalescing
of
data
accesses
‒ EsPmated
speed-‐up
‒ Memory
usage
‒ Propose
a
GPU
execuPon
or
naPve
OpenMP
execuPon
‒ Data
usage
and
data
dependencies
graph
between
regions
‒ Determine
when
transfers
are
required
between
kernels
‒ Let
the
user
modify
the
CPU
or
GPU
execuPon
and
data
transfer
policy
23
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
24. HTML
INTERACTIVE
REPORT
(1)
y Get
regions
overview
in
a
snap!
y Code
View:
from
OpenMP
to
OpenACC
direcPves
24
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
CAPS OpenMP Compiler - June 2013
24
25. HTML
INTERACTIVE
REPORT
(2)
y Performance
details
of
each
region
y Analysis
conclusions
and
portability
diagnosis
25
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
CAPS
OpenMP
Compiler
-‐
June
2013
25
26. HTML
INTERACTIVE
REPORT
(3)
y Regions’
inputs/outputs
and
data
dependencies
map
26
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
CAPS OpenMP Compiler - June 2013
26
27. HTML
INTERACTIVE
REPORT
(4)
y Get
the
control!
‒ Manually
push
kernels
onto
accelerators
‒ Manage
data
transfers
27
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
CAPS OpenMP Compiler - June 2013
27
28. CODE
GENERATION
PHASE
y Same
as
the
CAPS
OpenACC
Compiler
‒ Based
on
the
analysis
report
‒ Generates
OpenCL
kernels
from
OpenACC
‒ AutomaPc
data
updates
to
ensure
memory
coherency
28
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
29. FEATURES
y Diagnoses
‒ OpenACC
compliancy
‒ ComputaPonal
density
‒ Data
accesses
coalescing
‒ Memory
usage
‒ EsPmated
speed-‐up
y AutomaPc
porPng
to
AMD,
NVIDIA,
or
Intel
accelerators
y Accelerates
execuPon
or
keeps
the
OpenMP
naPve
one
y Gives
users
control
to
manual
opPmizaPons
29
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
CAPS OpenMP Compiler - June 2013
29
31. HARDWARE
AND
SOFTWARE
ENVIRONMENT
y Linux
system
‒ AMD
SDK
2.8
‒ CAPS
Compiler
revision
50387
‒ GCC
4.6.1
‒ OpenMPI
1.6.4
y Hardware
‒ AMD
A10-‐5800K
APU
with
Radeon
HD
Graphics
31
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
CAPS OpenMP Compiler - June 2013
31
32. APPLICATIONS
STATUS
y Main
objecPve
is
proof
of
concept,
not
performance
‒ Performance
limitaPons
of
current
version
of
the
APU
y HydroC
‒ Most
convincing
demo
‒ x1.3
speed-‐up
by
modifying
the
execuPon
and
transfer
policy
32
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
CAPS OpenMP Compiler - June 2013
32
33. HYDROC
HTML
REPORT
33
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
35. C2PO
MISSION
STATEMENT
Guides
you
through
the
whole
process
of
porPng
and
tuning
applicaPons
onto
manycore
parallel
systems
y Combines
various
CAPS
technologies
in
a
modular
tool
chain
‒ StaPc
and
dynamic
code
analyzers
‒ OpenMP
to
OpenACC
code
transformers
‒ Kernel
micro-‐bencher
‒ Plug
with
third-‐party
tools:
Vtune,
CUDA
profiler
‒ Use
CAPS
Compiler
at
final
stage
to
produce
manycore
applicaPon
35
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
C2PO - Oct. 2013
35
36. C2PO
PHASES
1.
GeneraPon
of
an
OpenACC
skeleton
from
OpenMP
or
sequenPal
code
‒
2.
Hotspot
detecPon
and
dataflow
analysis
Indicates
global
and
local
advice
on
‒ Data
management/placement
between
kernels
or
regions
‒ First
ten
Pps
on
kernel
performance
‒ Data
coalescing,
parallelism,
gridificaPon,
loops
order
3.
Let
you
rapidly
opPmize
performance
of
kernels
‒ Extracts
funcPons,
loops
or
annotated
regions
‒ Tune
kernel
code
following
C2PO
advice
‒ Replay
standalone
with
applicaPon
data
and
measure
performance
gain
‒ Re-‐inject
opPmized
into
applicaPon
source
code
4.
Use
CAPS
Compilers
to
build
Intel
Xeon
Phi,
NVIDIA
or
AMD
GPUs
Dataflow
analysis
OpenACC
skeleton
generaPon
Extract
loops,
funcPons,
regions
Fine
tune
kernels
User
Input
36
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
C2PO - Oct. 2013
36
37. C2PO
TOOL
CHAIN
InteracPve
Report
Global
tuning
Code
skeleton
generaDon
Data
Movement
Analyzer
SequenPal
Code
OpenACC
Generator
OpenACC
Code
OpenMP
Code
ubencher
HTML
Report
CUDA
profiler
Local
tuning
Kernels
37
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
VTune
C2PO - Oct. 2013
Performance
analyzer
37
38. C2PO
OPENACC
GENERATION
y From
sequenPal
or
OpenMP
code
to
first
parallelized
code
‒ Instrument
applicaPon
and
detect
hotspots
‒ Generate
OpenACC
skeleton
of
kernels
from
loops
‒ Manage
data
transfers
between
kernels
y A
report
is
generated
containing
‒ Various
performance
metrics
‒ Kernel
execuPon
‒ Memory
reads
and
writes
‒ PotenPal
performance
gain
‒ Data
dependencies
and
usage
between
kernels
‒ OpenACC
code
view
38
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
C2PO - Oct. 2013
38
39. C2PO
GLOBAL
TUNING
y Dynamic
tracking
of
data
so
as
to
opPmize
their
movement
‒ Dynamically
trace
uploads
and
downloads
at
execuPon
Pme
‒ Detect
potenPally
redundant
data
transfers
Difficult
for
the
compiler
to
detect
any
CPU
use
of
data
#openacc
data
region
//
convergence
loop
for
{
Upload
data()
Kernels’
calls()
Download
data()
}
…
39
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
C2PO - Oct. 2013
Possible
advice:
are
the
following
parameters
modified
by
the
CPU
between
the
downloads
and
uploads?
If
yes,
insert
OpenACC
data
region
with
non
modified
parameters
39
40. C2PO
TUNING
PHASE
y Microbenchmarking
mechanism
‒ Loops,
funcPons,
user
annotated
regions
are
extracted
in
kernels
‒ Apply
opPmizaPons
‒ Replay
kernels
with
original
data
set
without
running
the
whole
applicaPon
‒ Once
tuned,
inject
kernels
into
the
applicaPon
source
code
y Apply
performance
analyzers
from
third
party
tools
(Vtune,
CUDA
profiler)
‒ Synthesizes
raw
metrics
(hardware
counters)
linked
to
the
source
code
40
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
C2PO - Oct. 2013
40
41. C2PO
OBJECTIVES
AND
BENEFITS
y Keep
one
single
OpenMP
code
for
various
parallel
many-‐core
systems
(GPUs,
APUs,
MIC)
y Incrementally
port
and
opPmize
codes
in
a
modular
way
y Use
an
interacPve
compiler:
advice
from
dynamic
and
staPc
analyses
at
source
code
level
41
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
C2PO - Oct. 2013
41
42. THANK
YOU
FOR
YOUR
ATTENTION!
Vasnier
Jean-‐Charles
Sales
Engineer,
CAPS
entreprise
Phone:
+1-‐865-‐227-‐6899
Email:
jvasnier@caps-‐entreprise.com
43. GET
PERFORMANCE
IN
NO
TIME!
ExecuDon
Time
(seconds)
70
63,42
60
50
45,698
Original
(OpenMP)
40
30
27,539
Generated
(auto)
23,417
Generated(tweaked)
20
12,71
12,55
10
0
Hydro
x2
speed-‐up
(a^er
user’s
tuning)
Nbody
x6
speed-‐up
in
3
clicks
(full
automaPc)
‒ Measured
on
a
dual
Sandy
bridge
E5-‐2687W
with
32
Go
RAM
and
a
Kepler
K20C
driven
by
CUDA
v5.0
43
|
PRESENTATION
TITLE
|
NOVEMBRE
19,
2013
|
CONFIDENTIAL
CAPS OpenMP Compiler - June 2013
43