17. Ra Site
nk
System
Cores
1
National
University of
Defense
Technology
China
Tianhe-2
3120000
33862.7
54902.4
(MilkyWay-2) TH-IVB-FEP Cluster,
Intel Xeon E5-2692
12C 2.200GHz, TH
Express-2, Intel
Xeon Phi 31S1P
NUDT
2
DOE/SC/Oak
Ridge
National
Laboratory
United
States
Titan - Cray XK7 ,
Opteron 6274 16C
2.200GHz, Cray
Gemini
interconnect,
NVIDIA K20x
Cray Inc.
560640
Rmax
Rpeak
17590.0
27112.5
21. Titan
Cray XK7 Compute Node
e
PCIe G
XK7 Compute Node
Characteristics
AMD Series 6200
(Interlagos)
n2
HT3
HT3
NVIDIA Kepler
Host Memory
32GB
1600 MT/s DDR3
NVIDIA Tesla X2090
Memory
6GB GDDR5 capacity
Z
Y
X
Gemini High Speed
Interconnect
Keplers in final installation
http://en.wikipedia.org/wiki/Titan_(supercomputer)
21
22. Site
System
Cores
Rmax
Rpeak
6
Texas
Advanced
Computing
Center/Univ.
of Texas
United
States
Stampede PowerEdge C8220,
Xeon E5-2680 8C
2.700GHz,
Infiniband FDR,
Intel Xeon Phi
SE10P
Dell
462462
5168.1
8520.1
10
National
Supercompu
ting Center
in Tianjin
China
Tianhe-1A - NUDT
YH MPP, Xeon
X5670 6C 2.93
GHz, NVIDIA
2050
NUDT
186368
2566.0
4701.0
16
National
Supercompu
ting Centre
in Shenzhen
(NSCS)
China
Nebulae Dawning TC3600
Blade System,
Xeon X5650 6C
2.66GHz,
Infiniband QDR,
NVIDIA 2050
Dawning
120640
1271.0
2984.3
49. CPU: 遅延を意識した設計
o 大きなキャッシュ
n メモリーアクセスの長い遅
延をキャッシュで短かな遅
延に変える
ALU
ALU
o 強力な演算機能
n 演算の遅延を軽減する
ALU
Control
CPU
o 高度な制御
n 分岐遅延を軽減する為の
分岐予測・投機的実行
n データ遅延を軽減する為の
データ先読み
ALU
Cache
DRAM
84. // scale the calculation across threads requested
// need to set environment variables
// OMP_NUM_THREADS and KMP_AFFINITY
#pragma omp parallel for private(j,k)
for (i=0; i<numthreads; i++)
{
// each thread will work it's own array section
// calc offset into the right section
int offset = i*LOOP_COUNT;
// loop many times to get lots of calculations
for(j=0; j<MAXFLOPS_ITERS; j++)
{
// scale 1st array and add in the 2nd array
for(k=0; k<LOOP_COUNT; k++)
{
fa[k+offset] = a * fa[k+offset] + fb[k+offset];
}
}
}
125. ベクトルの加算の呼び出し – 通常のCコード
int main()
{
// Memory allocation for A_h, B_h, and C_h
// I/O to read A_h and B_h, N elements 省略
vecAdd(A_h, B_h, C_h, N);
}
ベクトルの加算の呼び出し – CUDAのhost Cコード
int main()
{
// Memory allocation for A_h, B_h, and C_h
// I/O to read A_h and B_h, N elements 省略
vecAddKernel<<<ceil(n/256), 256>>>(A_d, B_d,
C_d, n);
}
136. ベクトルの加算 CUDA kernel コード
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__
void vecAddKernel(
float* A_d, float* B_d, float* C_d, int n)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
if(i<n) C_d[i] = A_d[i] + B_d[i];
}
137. ベクトルの加算 CUDA host コード
int vecAdd(float* A, float* B, float* C, int n)
{
// allocations and copies omitted
// Run ceil(n/256) blocks of 256 threads each
dim3 DimGrid(n/256, 1, 1);
if (n%256) DimGrid.x++;
dim3 DimBlock(256, 1, 1);
vecAddKernel<<<DimGrid,DimBlock>>>(A_d, B_d,
C_d, n);
}
145. Data Parallelism Model
o OpenCLは、データ・パラレル実行モデルを利用
している。この点では、CUDAと直接に対応して
いる。
o OpenCLのプログラムは、二つの部分からなる。
一つは、OpenCLのデバイス上で実行される
kernelの部分と、もう一つは、kernelの実行を管
理するhostプログラムの部分である。
146. OpenCLとCUDAの対応
OpenCL
o Kernel
o Host Program
o NDRange
(Index Space)
o Work Item
o Work Group
CUDA
o Kernel
o Host Program
o Grid
o Thread
o Block
165. Work item, Work group,
NDRange
o kernel関数が起動されると、そのコードはwork
item で実行される。これは、CUDAのthreadに
対応する。
o work itemは、work groupを形成する。それ
は、CUDAのthread Blockに対応する。
o OpenCLのwork itemは、グローバルなディメン
ション・インデックス・レンジ NDRangeで指定さ
れる。インデックス・スペースは、work item と、
どのようにデータが work itemにマップされるか
を定義する。
166. OpenCLとCUDAの対応
OpenCL
o Kernel
o Host Program
o NDRange
(Index Space)
o Work Item
o Work Group
CUDA
o Kernel
o Host Program
o Grid
o Thread
o Block
190. OpenCL Design and Programming
Guide for the Intel Xeon Phi
Coprocessor
http://software.intel.com/en-us/articles/
opencl-design-and-programming-guide-forthe-intel-xeon-phi-coprocessor
196. Nokia’s WebCL Prototype
o Nokia open sourced their prototype in May
2011 (LGPL).
o Web-based interactive photo editor utilizing
GPU for image processing, through WebGL &
Nokia’s OpenCL bindings for JavaScript.
o YouTube Demo:
http://www.youtube.com/watch?
v=9BF7zzUM1kY
o Add-on for Firefox 4 on Win/Linux(Firefox 5
coming soon)
o Visit http://webcl.nokiaresearch.com for
binaries, source code, demos and tutorials.
197. Samsung WebCL Prototype
o Samsung open sourced their prototype WebCL
implementation for WebKit in July 2011 (BSD
license).
o Allows JavaScript to run computations on GPU.
o Demos on YouTube: http://www.youtube.com/
user/SamsungSISA Demos use WebGL for 3D
rendering.
o Code available at
http://code.google.com/p/webcl/
o For comparison, same computations were also
done in pure JavaScript. - WebCL gave
performance increases of up to 100x.
218. WebCL: Create Kernel
<script>
var programSource =
getProgramSource("squareProgram");
// JavaScript function using DOM APIs
var program =
context.createProgram(programSource);
program.build();
var kernel = program.createKernel("square");
</script>
219. WebCL: Run Kernel 1
<script>
…
var inputBuf =
context.createBuffer(WebCL.MEM_READ_ONLY,
Float32Array.BYTES_PER_ELEMENT * count);
var outputBuf =
context.createBuffer(WebCL.MEM_WRITE_ONLY,
Float32Array.BYTES_PER_ELEMENT * count);
var data = new Float32Array(count);
// populate data …
queue.enqueueWriteBuffer(inputBuf, data, true);
// last arg indicates API is blocking
220. WebCL: Run Kernel 2
kernel.setKernelArg(0, inputBuf);
kernel.setKernelArg(1, outputBuf);
kernel.setKernelArg(2, count,
WebCL.KERNEL_ARG_INT);
var workGroupSize =
kernel.getWorkGroupInfo(devices[0],
WebCL.KERNEL_WORK_GROUP_SIZE);
queue.enqueueNDRangeKernel(kernel, [count],
[workGroupSize]);
221. WebCL: Run Kernel 3
queue.finish();
// this API blocks
queue.enqueueReadBuffer(outputBuf,
data, true);
// last arg indicates API is blocking
</script>
222. WebCL: Image Object Creation
o From Uint8Array()
<script>
var bpp = 4; // bytes per pixel
var pixels = new Uint8Array(width * height * bpp);
var pitch = width * bpp;
var clImage =
context.createImage(WebCL.MEM_READ_ONLY,
{ channelOrder:WebCL.RGBA,
channelType:WebCL.UNORM_INT8,
size:[width, height], pitch:pitch } );
</script>
223. WebCL: Image Object Creation
o From <img> or <canvas> or <video>
<script>
var canvas =
document.getElementById("aCanvas");
var clImage =
context.createImage(WebCL.MEM_READ_ONLY,
canvas); // format, size from element
</script>
224. WebCL:
Vertex Buffer Initialization
<script>
WebGL
var points = new Float32Array(NPOINTS * 3);
var glVertexBuffer = gl.createBuffer();
gl.bindBuffer(gl.ARRAY_BUFFER, glVertexBuffer);
gl.bufferData(gl.ARRAY_BUFFER, points,
gl.DYNAMIC_DRAW);
var clVertexBuffer =
context.createFromGLBuffer(
WebCL.MEM_READ_WRITE, glVertexBuffer);
kernel.setKernelArg(0, NPOINTS,
WebCL
WebCL.KERNEL_ARG_INT);
kernel.setKernelArg(1, clVertexBuffer);
</script>
231. OpenCL Design and Programming
Guide for the Intel Xeon Phi
Coprocessor
http://software.intel.com/en-us/articles/
opencl-design-and-programming-guide-forthe-intel-xeon-phi-coprocessor
232. Why is this paper needed?
o While OpenCL is a portable programming
model, the performance portability is not
guaranteed. Traditional GPUs and the Intel
Xeon Phi coprocessor have different HW
designs. Their differences are such that they
benefit from different application optimizations.
For example, traditional GPUs rely on the
existence of fast shared local memory, which
the programmer needs to program explicitly.
Intel Xeon Phi coprocessor includes fully
coherent cache hierarchy, similar to regular
CPU caches, which automatically speed up
memory accesses.
233. o Another example: while some traditional GPUs
are based on HW scheduling of many tiny
threads, Intel Xeon Phi coprocessors rely on
the device OS to schedule medium size
threads. These and other differences suggest
that applications usually benefit from tuning to
the HW they’re intended to run on.
234. Will I need to have different OpenCL
optimizations for different devices?
o Not necessarily. Will you add a small #ifdef in
your code to run 50% faster on Intel Xeon Phi
coprocessor? Will you duplicate a 1000-line file
for that? Would you do it for only 10%
speedup? Or, maybe you would prefer adding
the optimization unconditionally and pay 10%
slowdown on other devices for 50%
improvement on Intel Xeon Phi coprocessor? It
is totally your decision. In some cases, you will
need to make the tradeoff between cross
device performance and maintainability of your
OpenCL application.
235. o We really encourage developers to explore the
performance potential of the Intel Xeon Phi
coprocessor, using the guidelines available in
this document and then decide based on the
performance numbers. This document doesn’t
intend to answer all the questions, but instead
give you some tools to answer them yourself.
237. o An Intel Xeon Phi coprocessor contains many
cores, each with a 512-bit vector arithmetic
unit, capable of executing SIMD vector
instructions. An L1 cache is included in each
core (32 KB data + 32 KB instructions). An L2
cache is associated with each core (512 KB
combined Data and Instr, L1 D cache is
inclusive). A high-speed interconnect allows
data transfer between the L2 caches and the
memory subsystem. Each core can execute up
to four HW threads simultaneously.
238. o This simultaneous multi-threading helps hide
instruction and memory latencies. OpenCL
hides most of these details from the
programmer.
239. Key Intel Xeon Phi Coprocessor
Performance Aspects
o Multi-threading parallelism
o Intel Xeon Phi coprocessor HW includes many
cores depending on the SKU (I assume 60 in
this paper). Each core is capable of running up
to four HW threads. In most cases, populating
the 240 threads with tasks is essential to
maximize performance. The exact number of
HW threads can be queried with the
clGetDeviceInfor(NUM_COMPUTE_UNITS);
interface.
240. o In Core Vectorization
o The vector size in the Intel Xeon Phi
coprocessor is 512 bit wide SIMD. Typically,
this vector represents 8 double precision
floating point numbers, or 16 single precision
floating point numbers. Each Intel Xeon Phi
coprocessor core can issue a single vector
computation instruction per cycle.
241. o PCI Express* (PCIe) Bus Interface
o The Intel Xeon Phi coprocessor resides on the
PCIe bus. Transferring data over the PCIe bus
has the highest latency and the lowest
bandwidth. As you would do in any other PCIe
device, you should reduce this traffic to a
minimum.
242. o Memory subsystem
o The Intel Xeon Phi coprocessor includes three
levels of memory (GDDR, L2 cache, and L1
cache). The following table includes important
cache information:
L1 (Data +
Instructions)
Shared L2
Total Size
32 KB + 32 KB
512 KB
Miss Latency
15-30 cycles
500-1000 cycles
243. o Since the Intel Xeon Phi coprocessor is an inorder machine, the latency of memory
accesses has significant impact on software
performance. Luckily, the programmer can
reduce these latencies. Prefetches are one of
the tools that can help hide memory latencies.
We will discuss it in more detail later.
244. Data Access Pattern
o Accessing memory consecutively is the fastest
way to access memory on the Intel Xeon Phi
coprocessor. It improves cache efficiency,
reduces the number of TLB (Translation
Lookaside Buffer) misses, and allows the HW
prefetcher to kick in.
245. Mapping the OpenCL constructs
to Intel Xeon Phi coprocessor
o Understanding how the key OpenCL constructs
are implemented on the Intel Xeon Phi
coprocessor will help you better design your
application to take advantage of the
coprocessor’s HW. It will also help you avoid
the coprocessor’s performance pitfalls.
246. o Conceptually, at initialization time, the OpenCL
driver creates 240 SW threads and pins them
to the HW threads (for a 60-core
configuration). Then, following a
clEnqueueNDRange() call, the driver schedules
the work groups (WG) of the current NDRange
on the 240 threads. A WG is the smallest task
being scheduled on the threads. So calling
clEnqueueNDRange() with less than 240 WGs,
leaves the coprocessor underutilized.
247. o The OpenCL compiler creates an optimized
routine that executes a WG. This routine is
built from up to three nested loops, as shown
in the following pseudo code:
1: __Kernel ABC(…)
2: For (int i = 0; i < get_local_size(2); i++)
3:
For (int j = 0; j < get_local_size(1); j++)
4:
For (int k = 0; k < get_local_size(0); k++)
5:
Kernel_Body;
248. o Note that the innermost loop is used for
dimension zero of the NDRange. This directly
impacts the access pattern of your
performance critical code. It also impacts the
implicit vectorization efficiency.
249. o The OpenCL compiler implicitly vectorizes the
WG routine based on dimension zero loop, i.e.,
the dimension zero loop is unrolled by the
vector size. So the WG code with vectorization
looks like:
1: __Kernel ABC(…)
2: For (int i = 0; i < get_local_size(2); i++)
3:
For (int j = 0; j < get_local_size(1); j++)
4:
For (int k = 0; k < get_local_size(0);
k += VECTOR_SIZE)
5:
Vector_Kernel_Body;
250. o The vector size of Intel Xeon Phi coprocessor is
16, regardless of the data types used in the
kernel. However, in the future, we may
increase the vectorization size to allow more
instruction level parallelism.
251. Exposing algorithm parallelism
o While the OpenCL specification provides
various ways to express parallelism and
concurrency, some of them will not map well to
Intel Xeon Phi coprocessor. We will show you
how the key OpenCL constructs are mapped to
the coprocessor, so you can design your
application to exploit its parallelism.
252. Multi-threading
o To get good utilization of the 240 HW threads,
it’s best to have more than 1000 WGs per
NDRange. Having 180‒240 WGs per NDRange
will provide basic threads utilization; however,
the execution may suffer from poor loadbalancing and high invocation overhead.
o Recommendation: Have at least 1000 WGs
per NDRange to optimally utilize the Intel Xeon
Phi coprocessor HW threads. Applications with
NDRange of 100 WGs or less will suffer from
serious under-utilization of threads.
253. o Single WG execution duration also impacts the
threading efficiency. Lightweight WGs are also
not recommended, as these may suffer from
relatively high overheads.
254. Vectorization
o OpenCL on Intel Xeon Phi coprocessor includes
an implicit vectorization module. The OpenCL
compiler automatically vectorizes the implicit
WG loop over the work items in dimension zero
(see example above). The vectorization width
is currently 16, regardless of the data type
used in the kernel. In future implementations,
we may vectorize even 32 elements. As
OpenCL work items are guaranteed to be
independent, the OpenCL vectorizer needs no
feasibility analysis to apply vectorization.
255. o However, the vectorized kernel is only used if
the local size of dimension zero is greater than
or equal to 16. Otherwise, the OpenCL runtime
runs scalar kernel for each of the work items. If
the WG size at dimension zero is not divisible
by 16, then the end of the WG needs to be
executed by scalar code. This isn’t an issue for
large WGs, e.g., 1024 items at dimension zero,
but is for WGs of size 31 on dimension zero.
256. o Recommendation 1: Don’t manually vectorize
kernels, as the OpenCL compiler is going to
scalarize your code to prepare it for implicit
vectorization.
o Recommendation 2: Avoid using a WG size
that is not divisible by 32 (16 will work for
now).
257. Work-Item-ID nonuniform
control flow
o In this section, we explain the difference
between uniform and nonuniform control flow,
in the context of implicit vectorization. It is
important to understand because uniform
control flow may have small negative impacts
on performance. But nonuniform control flow
creates significant performance overhead
within the innermost NDRange dimension. The
uniformity with respect to the vectorized loop
(dimension zero) matters.
258. Uniform branch example:
o A branch is uniform if it is statically guaranteed
that all work items within a WG execute the
same side of the branch.
1: //isSimple is a kernel argument
2: Int LID = get_local_id(0);
3: If (isSimple == 0)
4: Res = buff[LID];
259. Nonuniform branch example:
1: Int LID = get_local_id(0);
2: If (LID == 0)
3:
Res = -1;
Another uniform branch example:
1: Int LID = get_local_id(1);
2: //Uniform as the IF is based on dimension one,
while vectorization on dimension on.
3: If (LID == 0)
4:
Res = -1;
260. o While vectorizing, the compiler has to linearize
(flatten) any code dominated by nonuniform
control flow via predication. The first and major
cost of predication is the execution of both
sides of the branch. Additional penalties result
from the masked execution.
o Recommendation: Avoid branches, especially
those that are nonuniform on dimension zero.
261. // Assuming the following original kernel code:
1: Int gid = get_global_id(0);
2: If(gid % 32 == 0)
3:
Res = HandleEdgeCase();
4: Else
5:
Res = HandleCommonCase();
6: End
// After vectorization (and predication),
// the code looks like:
1: int16 gid = get16_global_id(0);
2: uint mask;
3: Mask = compare16int((gid % broadcast16(32)), 0)
4: res_if = HandleEdgeCase();
5: res_else = HandleCommonCase();
6: Res = (res_if & mask) | (res_else & not(mask));
// Note that both the IF and the ELSE are executed
// for all of the work items.
262. Data Alignment
o For various reasons, memory access that is
vector-size-aligned is faster than unaligned
memory access. In the Intel Xeon Phi
coprocessor, OpenCL buffers are guaranteed to
start on a vector-size-aligned address.
However, this only guarantees that the first
WG starts at an aligned address. To guarantee
that all WGs start at a properly aligned
location, the WG size (local size) needs to be
divisible by 16, or even by 32 if you want to
take advantage of potential product
improvements.
263. o Calling EnqueueNDRange with local size NULL,
lets the OpenCL driver choose the best WG size
for you. The driver should be smart enough to
choose a WG size matching the alignment
requirements. However, the programmer needs
to make sure that the global size is divisible by
VECTOR_SIZE and the quotient is big enough
to allow the runtime efficient split to WGs. “Big
enough” is 1,000,000 in cases of a small kernel
and 1000 in the case of a huge kernel including
a 1000 iteration loop in the kernel. Also
NDRange offsetting can break the alignment.
264. o Recommendation 1: Don’t use NDrange
offset. If you have to use an offset, then make
it a multiple of 32, or at least a multiple of 16.
o Recommendation 2: Use local size that is a
multiple of 32, or at least of 16.
265. Design your algorithm to benefit from
the Intel Xeon Phi coprocessor
memory subsystem
o Since Intel Xeon Phi coprocessor is an in-order
machine, it is very sensitive to memory
latencies. Memory-related optimizations, at the
application level, can lead to 2X-4X
performance speedup.
266. Intra WG data reuse
o Designing your application to maximize the
amount of data reuse from the caches is the
first memory optimization to apply. However,
only certain algorithms need to reuse data. For
example, adding two matrices involves no
opportunity to reuse any data. But multiplying
two matrices (GEMM) involves significant data
reuse. Therefore, it is an obvious candidate for
blocking/tiling optimization. Please see more
details in the
Intel SDK for OpenCL Applications XE –
Optimization Guide.
267. o To benefit from data reuse, you need to take
into account the WG implicit loop(s), as
described earlier in this document. The
programmer’s control over these loops is
through the local size definition. The
programmer can add additional loop(s)
(explicit) in the kernel.
268. Cross WG data reuse
o Cross-group data reuse is a greater challenge.
Currently, OpenCL on Intel Xeon Phi
coprocessor doesn’t allow enough control over
the WGs scheduling. Therefore, cross WG data
reuse is almost impossible. We will keep this
section as a placeholder for future
development.
269. Data access pattern
o Consecutive data access usually allows the best
memory system performance. When one
considers consecutive memory access,
understanding the structure of the WG implicit
loops is crucial. The innermost implicit loop is
the loop over dimension zero. If your kernel
introduces no additional (explicit) loop, then
you should try having most of your memory
accesses consecutive with that implicit
dimension zero loop in mind. For example:
o The following code accesses the 2D buffers
consecutively in memory (recommended):
270. The following code accesses the 2D buffers
consecutively in memory (recommended):
1: __kernel ABC(…){
2: int ID1 = get_global_id(1);
3: int ID0 = get_global_id(0);
4: res[ID1][ID0] = param1 * buffer[ID1][ID0];
5: }
The following code doesn’t access the 2D buffers
consecutively in memory (not recommended):
1: __kernel ABC(…){
2: int ID1 = get_global_id(1);
3: int ID0 = get_global_id(0);
4: res[ID0][ID1] = param1 * buffer[ID0][ID1];
5: }
271. o The second code example scans the 2D buffers
“column major.” With vectorization, it results in
double faults, namely: 1) The input vector data
need to be gathered along the column from 16
consecutive rows. The result is stored via
scatter instructions to 16 different rows. Both
operations perform slowly. 2) Memory access is
not consecutive, iteration to iteration. Both of
these increase the pressure on the TLB and
prevent prefetching.
272. Simple one dimension example (recommended):
Consecutive access:
1: Int id = get_global_id(0);
2: A[id]= B[id];
Non-Consecutive access (not recommended):
1: Int id = get_global_id(0);
2: A[id*4] = B[id*4]
Recommendation: Use ID(0) to index memory
consecutively within the row. With explicit 2D buffer:
buffer[ID1][ID0]. With 2D indexing into 1D buffer:
buffer[STRIDE * ID1 + ID0]
273. o If your kernel includes an explicit loop, then
you should remember that the implicit
vectorization is still based on the ID(0) implicit
loop. So accessing buffers through the OpenCL
IDs should follow the recommendation above
(buffer[ID1][ID0]). This will keep vector access
consecutive and efficient. Accessing buffers
through the inner loop index (idx), will be
consecutive within the inner loop (buffer[ID1]
[idx]) and will be uniform to the vectorized
loop, which is excellent! However, mixing ID0
and idx should be avoided. For example,
buffer[ID0][idx] is strided to the vectorized
loop, therefore will result in gather/scatter.
274. Data layout
o Pure SOA (Structure-of-Arrays) data layout
results in simple and efficient vector loads and
stores. However, spatial locality is lower, the
pressure on the TLB is higher, and the number
of pages used simultaneously can be higher.
o With AOS (Array-of-Structures) data layout,
the generated vectorized kernel needs to load
and store data via gather and scatter
instructions, which are less efficient than
simple vector load and store. However, for
random access pattern, AOS layout is often
more efficient than SOA because of better
spatial locality.
275. o Please remember that random access of SOA
data layout creates gather and scatter
instructions too.
o The third option is AOSOA—an array of
structures of small arrays. The size of the small
arrays should be 32 for the Intel Xeon Phi
coprocessor. This would allow vectorization of
up to 32 elements vector.
1: struct Point32 { float x[32], y[32], z[32]; };
2: __kernel void ABC(__global Point32* ptrData)
276. o AOSOA allows efficient vectorization using
simple vector loads, while not overloading the
TLB, nor spreading the accesses across many
pages. The problem of AOSOA is the readability
of the code. Most people don’t naturally think
in AOSOA terms.
277. Data prefetching
o With the Intel Xeon Phi coprocessor being an
in-order machine, data prefetching is an
essential way to bring data closer to the cores,
in parallel with other computations. Loads and
stores are executed serially, with parallelism.
For example, any two load instructions are
executed entirely serially. The prefetch
instruction is exceptional. It is executed in
parallel to other instructions, including to other
prefetch instructions. Therefore, prefetch
instruction that hasn’t finished on time, can still
improve the performance as this memory
request executed in parallel to other
instruction.
278. o A cache miss means a thread stall plus a few
cycles penalty to reissue the instruction. The
Intel Xeon Phi coprocessor includes a simple
automatic HW prefetcher to the L2. It takes
some time for the HW prefetcher to kick in, and
it needs to restart on every 4 KB virtual page
boundary.
279. o Automatic SW prefetches to the L1 and L2 are
inserted by the OpenCL compiler for data
accessed in future iterations, whenever it
figures out (through analysis) that such can be
inserted and provide benefit. The beta release
includes partial support for automatic SW
prefetching.
280. o Manual prefetching can be inserted by the
programmer into the OpenCL kernel, via the
prefetch built-in. Currently, manual prefetches
are inserted exactly at the location and to the
address that the programmer requested, but
these are limited to L2 prefetches. In the
future, the OpenCL compiler may add both L2
and L1 prefetches for the PREFETCH built-in. It
may also improve the location and stride
indicated by the programmer. Manual
prefetches should be inserted at least 500
cycles before the data is going to be actually
used. Usually only the main input and output
buffers need to be prefetched.
281. Local memory and Barriers
o While traditional GPUs include Shared Local
Memory (SLM), which requires manual
management, Intel Xeon Phi coprocessor
includes a two-level cache system (automatic),
similar to most modern CPUs. Therefore, using
the OpenCL SLM provides no benefit on the
Intel Xeon Phi coprocessor. Furthermore, local
memory in the coprocessor is allocated on the
regular GDDR memory and is supported by the
cache system like any other memory.
Therefore, it introduces additional overhead in
terms of redundant data copy and
management.
282. o Recommendation: Avoid using Shared Local
Memory on the Intel Xeon Phi coprocessor.
o The Intel Xeon Phi coprocessor includes no
special HW support for barriers. Therefore,
barriers are emulated by OpenCL on the
coprocessor. We recommend avoiding the use
of barriers. Also, splitting the kernel into two
separate kernels will be slower than a barrier,
so we don’t recommend taking this path either.
o As of the beta release, the combination of
barrier and WG size nondivisible by 16, results
in execution of a scalar kernel. Please avoid
this combination. Currently, we don’t see a
283. Summary
o While designing your OpenCL application for
Intel Xeon Phi coprocessor, you should pay
careful attention to the following aspects:
1. Include enough work groups within each
NDRange—a minimum of 1000 is
recommended.
2. Avoid lightweight work groups. Don’t hesitate
using the maximum local size allowed
(currently 1024). Keep the WG size a multiple
of 32.
3. Avoid ID(0) dependent control flow. This allows
efficient implicit vectorization.
284. 5. Prefer consecutive data access.
6. Data layout preferences: AOS for sparse
random access; pure SOA or AOSOA(32)
otherwise.
7. Exploit data reuse through the caches within
the WG—tiling/blocking.
8. If auto-prefetching didn’t kick in, use the
PREFETCH built-in to bring the global data to
the cache 500‒1000 cycles before use.
9. Don’t use local memory. Avoid using barriers.