To view the corresponding video, please visit: http://bit.ly/1iBiW17
This webinar takes you through a case study of accelerating a seismic algorithm on a cluster of AMD GPU compute nodes for a geophysical software provider. Acceleware Product Manager Chris Mason presents a programming example, step-by-step project phase profiling, optimization techniques, a look at the strategy behind taking advantage of the massively parallel GPU architecture, and run time performance results.
Chris has eight years of experience developing commercial applications for the GPU and multi-core CPUs. His previous experience also includes parallelization of algorithms on digital signal processors (DSPs) for cellular phones and base stations. His specialty is in electromagnetic simulations, medical imaging, signal processing and linear algebra.
Sign up for the developer newsletter and learn about future webinars here: http://bit.ly/176wril
For more training options from Accelerware, visit http://bit.ly/MRn6Gn
Share your ideas with other developers at http://bit.ly/P5ohUo
2. CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
About Acceleware
Software and services company specializing in HPC
product development, developer training and
consulting services
OpenCL training for AMD GPUs
– Progressive lectures and hands-on lab exercises
– Experienced instructors
– Delivered worldwide
– Find out more
High performance consulting
– Feasibility studies
– Porting and optimization
– Code commercialization
– Find out more
1
9. CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
The Project
GeoTomo develops high-end geophysical software products
that help geophysicists around the world to image beneath
the subsurface
GeoTomo had pre-existing cluster-ready multi-threaded
(OpenMP based) CPU FWI solution
GeoTomo required their FWI application to run faster so they
could deliver the results quicker to their clients
– Looked to AMD GPUs to potentially accelerate their FWI and approached
Acceleware for our help to make it happen
8
10. CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Why use GPUs? Performance!
9
AMD Opteron 6386 SE AMD FirePro
W9000
AMD Firepro
S10000
Memory Bandwidth
59.7 GB/s 264 GB/s 480 GB/s
Peak Gflops (single) ~410 4000 5910
Peak Gflops (double) ~205 1000 1480
Total Memory >>6 GB 6GB 6 GB
Power Consumption
140 W 274 W 375 W
Gflops per Watt
(single precision) <3 14.59 15.76
11. CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
OpenCL Overview
Parallel computing architecture standardized by the Khronos
Group
OpenCL:
– Is a royalty free standard
– Provides an API to coordinate parallel computation across
heterogeneous processors
Of interest because heterogeneous devices can significantly accelerate certain
(primarily data-parallel) workloads
– Defines a cross-platform programming language
– Used on handheld/embedded devices through supercomputers
10
13. CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs The OpenCL Programming Model
Data-parallel portions of an
algorithm are executed on the
device as kernels
– Kernels are C functions with some
restrictions and a few language extensions
– Many (parallel) work-items execute the
kernel
The host executes serial code
between device kernel launches
– Memory management
– Data exchange to/from device (usually)
– Error handling
12
Work-Group (0,0) Work-Group (1,0)
Work-Group (0,1) Work-Group (1,1)
Work-Group (0,2) Work-Group( 1,2)
ND Range
Work-Group
(0,0)
Work-Group
(1,0)
Work-Group
(2,0)
Work-Group
(0,1)
Work-Group
(1,1)
Work-Group
(2,1)
ND Range
Host
Device
Host
Device
14. CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
OpenCL Memory Model
OpenCL kernels have access to four distinct memory regions:
– Global
Allows read/write access from all work-items in all work-groups
Persistent across kernels
– Local
Memory that is local to all work-items within a work-group
– Constant
Region of memory that remains constant (read-only) during the execution of a kernel
– Private
Memory that is private to a work-item
OpenCL vendors map memory regions into physical resources
– Local/constant/private memory usually several orders of magnitude lower
capacity but orders of magnitude faster than global memory
13
15. CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs OpenCL Syntax – Memory Spaces
Host and device have separate memory spaces
– Data is explicitly moved between them
Typically over PCIe bus
Host functions to allocate, copy, and free memory on device, eg.
– clCreateBuffer()
– clEnqueueReadBuffer()
– clEnqueueWriteBuffer()
– clReleaseMemoryObject()
14
16. CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Putting It All Together
15
A0 A1 A2 A3 A4 A5 A6 A7
B0 B1 B2 B3 B4 B5 B6 B7
C0 C1 C2 C3 C4 C5 C6 C7
Cx = Ax + Bx
One work-item per element
Operation
__kernel
void VectorAdd(__global float* a,
__global float* b,
__global float* c)
{
int idx = get_global_id(0);
c[idx] = a[idx] + b[idx];
}
Each work-item has
a unique index,
typically used to
index into arrays
17. CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Vector Add – Host Code
16
void VectorAdd(float* aH, float* bH, float* cH, int N)
{
int N_BYTES = N * sizeof(float);
// Device management code
…
cl_mem aD = clCreateBuffer(…,N_BYTES, …);
cl_mem bD = clCreateBuffer(…,N_BYTES, …);
cl_mem cD = clCreateBuffer(…,N_BYTES, …);
clEnqueueWriteBuffer(...,aD,…,N_BYTES,aH,…);
clEnqueueWriteBuffer(...,bD,…,N_BYTES,bH,…);
// Pass kernel arguments and launch kernel
…
clEnqueueNDRangeKernel(…, &N, …);
clEnqueueReadBuffer(...,cD,…,N_BYTES,cH,…);
}
Allocate memory
on device
Transfer input
arrays to device
Launch kernel
Transfer output
array to host
18. CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Project Steps
1) Profiling
– Acquired code, datasets and reference benchmarks from
GeoTomo
– Set up local machines with near-equivalent hardware, compiled
code and confirmed reference benchmark numbers
– Augmented code with timers to determine time spent in parallel
regions, areas of interest
17
19. CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Project Steps
2) Feasibility Analysis
– Investigated memory footprint for FWI jobs
GPU memory limited to 6GB per card
– Investigated potential speedup / time to port code
Maximum speed up determined by time spent in parallel regions
(Amdahl’s Law)
Time to port dependent on feature set
– E.g. domain decomposition across multiple GPUs
18
20. CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Project Steps
3) Implementation
– Creating testing harnesses
– Kernel implementation
– Resolving hardware driver issues
– Enabling multi-GPU device support
– Optimization iterations
4) Wrapup
– Delivery of port, along with installation documentation
– Trained GeoTomo developer on OpenCL
19
21. CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Key GeoTomo Optimizations
1) Coalescing
– Changing memory access patterns in the kernels to those best
suited for GPUs
Global memory is accessed via a request for a multi-byte word
Combine load/store requests from consecutive work-items to reduce
the number of requested words
– Fewer requests less contention to global memory
Make one big multi-word burst request to global memory whenever
possible
– Contiguous bursts -> less global memory overhead
20
24. CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Key GeoTomo Optimizations
Alternative: Iterating over 2D slices
along slowest dimension
– Single items responsible for column of
output array
– Work-group caches 2D plane of input in
local memory
– Work-items store inputs in direction of
iteration in registers
– Reduces required number of global
memory reads significantly
Single Work-
item View
Register Local memory
Acknowledgement: Paulius Micikevicius, 2009 23
25. CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Key GeoTomo Optimizations
3) Kernel Fusion
– Reduce redundant memory accesses by fusing kernels that
operate on the same volume together
– Improves performance by reducing redundant global memory
reads
4) Kernel Fission
– Improve occupancy by lowering kernel resource requirements
(registers) via kernel simplification
– Allows for more work-items to run concurrently on GPU,
improving masking of global memory latency
24
28. CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Questions?
Contact Us
Tel: +1 403.249.9099
Email: services@acceleware.com
OpenCL Courses
June 3-6, 2014, Calgary, Canada
Private onsite classes also available
Find out more
OpenCL Consulting
Feasibility studies
Code commercialization
Porting and optimization
Mentoring
Find out more
Watch the recording of this webinar 27