Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

CaseStudy:AcceleratingFull
WaveformInversionviaOpenCL
onAMDGPUs
Case Study:
Accelerating Full Waveform Inversion
via OpenCL™ on AMD GPUs
©2014 Acceleware Ltd. All rights reserved.
Chris Mason, Acceleware Product Manager
March 5, 2014

onAMDGPUs
About Acceleware
 Software and services company specializing in HPC
product development, developer training and
consulting services
 OpenCL training for AMD GPUs
– Progressive lectures and hands-on lab exercises
– Experienced instructors
– Delivered worldwide
– Find out more
 High performance consulting
– Feasibility studies
– Porting and optimization
– Code commercialization
– Find out more
1

onAMDGPUs
Acceleware Software
 Seismic Applications
– Survey design and 3D modeling
– Reverse Time Migration
 Electromagnetics
– FDTD Solver
 Radio Frequency Heating
– Simulation application for the RF
heating of hydrocarbon reserves
2

onAMDGPUs
Outline
 Watch the recording of this webinar
 What is Full Waveform Inversion?
 The Project
 OpenCL
 Optimizations
– Coalescing
– Iterative kernel for stencil
operations
– Fusing kernels together to eliminate
redundant memory accesses
 Key Performance Results
3

onAMDGPUs What is Full Waveform Inversion?
 Seismic inversion technique
 Used to build Earth models from recorded seismic data
 Uses a finite-difference solution to the acoustic wave
equation
 Computationally expensive
4

onAMDGPUs
What is FWI?
From a basic starting point...
... to an accurate velocity model
5

onAMDGPUs
FWI Algorithm
Initial Model Estimate
Forward Propagate Source → Residuals
Back Propagate Residuals → Gradient
Forward Propagation(s) → Step Length
Update Model
Increase Frequency
Loop over shots
Loop over
frequencies
Loop until
convergence
6

onAMDGPUs
FWI Compute Cost
 Cluster size of 10s to 100s of CPU nodes
 Many days of runtime
 Accuracy and quality reduced to keep runtime acceptable
7

onAMDGPUs
The Project
 GeoTomo develops high-end geophysical software products
that help geophysicists around the world to image beneath
the subsurface
 GeoTomo had pre-existing cluster-ready multi-threaded
(OpenMP based) CPU FWI solution
 GeoTomo required their FWI application to run faster so they
could deliver the results quicker to their clients
– Looked to AMD GPUs to potentially accelerate their FWI and approached
Acceleware for our help to make it happen
8

onAMDGPUs
Why use GPUs? Performance!
9
AMD Opteron 6386 SE AMD FirePro
W9000
AMD Firepro
S10000
Memory Bandwidth
59.7 GB/s 264 GB/s 480 GB/s
Peak Gflops (single) ~410 4000 5910
Peak Gflops (double) ~205 1000 1480
Total Memory >>6 GB 6GB 6 GB
Power Consumption
140 W 274 W 375 W
Gflops per Watt
(single precision) <3 14.59 15.76

onAMDGPUs
OpenCL Overview
 Parallel computing architecture standardized by the Khronos
Group
 OpenCL:
– Is a royalty free standard
– Provides an API to coordinate parallel computation across
heterogeneous processors
 Of interest because heterogeneous devices can significantly accelerate certain
(primarily data-parallel) workloads
– Defines a cross-platform programming language
– Used on handheld/embedded devices through supercomputers
10

onAMDGPUs
OpenCL Programming Model
 Heterogeneous model, including provisions for a host connected to
one or more devices
– Example: GPUs, CPUs
Host
Device 1
GPU
Device 2
GPU
…
Device N
GPU
11

onAMDGPUs The OpenCL Programming Model
 Data-parallel portions of an
algorithm are executed on the
device as kernels
– Kernels are C functions with some
restrictions and a few language extensions
– Many (parallel) work-items execute the
kernel
 The host executes serial code
between device kernel launches
– Memory management
– Data exchange to/from device (usually)
– Error handling
12
Work-Group (0,0) Work-Group (1,0)
Work-Group (0,1) Work-Group (1,1)
Work-Group (0,2) Work-Group( 1,2)
ND Range
Work-Group
(0,0)
Work-Group
(1,0)
Work-Group
(2,0)
Work-Group
(0,1)
Work-Group
(1,1)
Work-Group
(2,1)
ND Range
Host
Device
Host
Device

onAMDGPUs
OpenCL Memory Model
 OpenCL kernels have access to four distinct memory regions:
– Global
 Allows read/write access from all work-items in all work-groups
 Persistent across kernels
– Local
 Memory that is local to all work-items within a work-group
– Constant
 Region of memory that remains constant (read-only) during the execution of a kernel
– Private
 Memory that is private to a work-item
 OpenCL vendors map memory regions into physical resources
– Local/constant/private memory usually several orders of magnitude lower
capacity but orders of magnitude faster than global memory
13

onAMDGPUs OpenCL Syntax – Memory Spaces
 Host and device have separate memory spaces
– Data is explicitly moved between them
 Typically over PCIe bus
 Host functions to allocate, copy, and free memory on device, eg.
– clCreateBuffer()
– clEnqueueReadBuffer()
– clEnqueueWriteBuffer()
– clReleaseMemoryObject()
14

onAMDGPUs
Putting It All Together
15
A0 A1 A2 A3 A4 A5 A6 A7
B0 B1 B2 B3 B4 B5 B6 B7
C0 C1 C2 C3 C4 C5 C6 C7
Cx = Ax + Bx
One work-item per element
Operation
__kernel
void VectorAdd(__global float* a,
__global float* b,
__global float* c)
{
int idx = get_global_id(0);
c[idx] = a[idx] + b[idx];
}
Each work-item has
a unique index,
typically used to
index into arrays

onAMDGPUs
Vector Add – Host Code
16
void VectorAdd(float* aH, float* bH, float* cH, int N)
{
int N_BYTES = N * sizeof(float);
// Device management code
…
cl_mem aD = clCreateBuffer(…,N_BYTES, …);
cl_mem bD = clCreateBuffer(…,N_BYTES, …);
cl_mem cD = clCreateBuffer(…,N_BYTES, …);
clEnqueueWriteBuffer(...,aD,…,N_BYTES,aH,…);
clEnqueueWriteBuffer(...,bD,…,N_BYTES,bH,…);
// Pass kernel arguments and launch kernel
…
clEnqueueNDRangeKernel(…, &N, …);
clEnqueueReadBuffer(...,cD,…,N_BYTES,cH,…);
}
Allocate memory
on device
Transfer input
arrays to device
Launch kernel
Transfer output
array to host

onAMDGPUs
Project Steps
 1) Profiling
– Acquired code, datasets and reference benchmarks from
GeoTomo
– Set up local machines with near-equivalent hardware, compiled
code and confirmed reference benchmark numbers
– Augmented code with timers to determine time spent in parallel
regions, areas of interest
17

onAMDGPUs
Project Steps
 2) Feasibility Analysis
– Investigated memory footprint for FWI jobs
 GPU memory limited to 6GB per card
– Investigated potential speedup / time to port code
 Maximum speed up determined by time spent in parallel regions
(Amdahl’s Law)
 Time to port dependent on feature set
– E.g. domain decomposition across multiple GPUs
18

onAMDGPUs
Project Steps
 3) Implementation
– Creating testing harnesses
– Kernel implementation
– Resolving hardware driver issues
– Enabling multi-GPU device support
– Optimization iterations
 4) Wrapup
– Delivery of port, along with installation documentation
– Trained GeoTomo developer on OpenCL
19

onAMDGPUs
Key GeoTomo Optimizations
 1) Coalescing
– Changing memory access patterns in the kernels to those best
suited for GPUs
 Global memory is accessed via a request for a multi-byte word
 Combine load/store requests from consecutive work-items to reduce
the number of requested words
– Fewer requests  less contention to global memory
 Make one big multi-word burst request to global memory whenever
possible
– Contiguous bursts -> less global memory overhead
20

onAMDGPUs
 2) Iterative kernel for stencil operations
Input Volumes Stencil Kernels
* • Outputs are weighted
combinations of
surrounding elements from
input volumes
• Off-axis weights are zero
Acknowledgement: Paulius Micikevicius, 2009 21

onAMDGPUs
 Naïve implementation would have each work-item read all of
its neighboring elements directly from global memory
– Possible to hit maximum GPU memory bandwidth but redundant
reads hurt performance
22

onAMDGPUs
 Alternative: Iterating over 2D slices
along slowest dimension
– Single items responsible for column of
output array
– Work-group caches 2D plane of input in
local memory
– Work-items store inputs in direction of
iteration in registers
– Reduces required number of global
memory reads significantly
Single Work-
item View
Register Local memory
Acknowledgement: Paulius Micikevicius, 2009 23

onAMDGPUs
 3) Kernel Fusion
– Reduce redundant memory accesses by fusing kernels that
operate on the same volume together
– Improves performance by reducing redundant global memory
reads
 4) Kernel Fission
– Improve occupancy by lowering kernel resource requirements
(registers) via kernel simplification
– Allows for more work-items to run concurrently on GPU,
improving masking of global memory latency
24

onAMDGPUs
Performance Results
 FWI 15 Hz, 15 shots
– GPU version 7997 seconds
– CPU (5 cores per shot) 67086 seconds [8.4X]
– CPU (30 cores per shot) 166948 seconds [20.9X]
 GPU: Sapphire Radeon HD 7970 GHz Edition
– 6GB model
25

onAMDGPUs
Performance Results
“Using GPU’s we can use higher frequencies and more if not all
of the shots to improve the resolution and coverage.”
James Jackson, President, GeoTomo
26

onAMDGPUs
Questions?
Contact Us
 Tel: +1 403.249.9099
 Email: services@acceleware.com
OpenCL Courses
 June 3-6, 2014, Calgary, Canada
 Private onsite classes also available
 Find out more
OpenCL Consulting
 Feasibility studies
 Code commercialization
 Porting and optimization
 Mentoring
 Find out more
Watch the recording of this webinar 27

Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar

Recomendados

Recomendados

Más contenido relacionado

Más de AMD Developer Central

Más de AMD Developer Central (20)

Último

Último (20)

Accelerating Full Waveform Inversion via OpenCL on AMD GPUs - Case Study / Webinar