3. WHAT IS SPLIT COMPILATION?
App starts a source program
1) A high level compiler (HLC) generates HSAIL
2) The HSAIL is shipped to the target machine
3) A second compiler (a finalizer) turns HSAIL into ISA
Unlike traditional compilers, where optimization is contained in one part or done twice
HSAIL allows optimization to be split into two parts
The heavy lifting goes to the HLC , the quick finish goes to the finalizer
HSAIL provides ways for an HLC and a finalizer to cooperate For instance:
HSAIL provides a fixed number of registers.
HSA implementations might support a different number
When the HLC spills registers, it can use special operations that will let the finalizer know
where to use extra registers.
3 | hsail AFDS | June 11, 2012
4. SPLIT COMPILATION
(MEANS THERE HAS TO BE WAYS TO PASS INFORMATION FROM HLC TO FINALIZER)
HLC – High level compiler
Lots of time
Info from source
Lots of aggressive optimizations
But limited (or no) knowledge of target
Finalizer
Very little time (we estimate that it will take close to linear time)
No info not in HSAIL (no back doors (almost)
Cannot update regularly (close to bug free)
Simple optimizations only
But knows the target
Exactly how to split some optimizations is still an open problem
4 | hsail AFDS | June 11, 2012
5. WHY A VIRTUAL ISA - WHY NOT JUST TARGET THE REAL ISA?
ISA Gains performance Better time to market (because hardware is finished faster)
Loses performance (cannot use every hardware trick)
No legacy boat anchor
Real isa means one vendor/ one chip family
Can fix hardware bugs in software
Old and new code just works on old and new machines
Allows hardware innovation under the table
Features not in HSAIL are not exposed, and are hard to access
5 | hsail AFDS | June 11, 2012
6. Development tools at HSAIL level
Today the need for a complete tool chain for each core, each with its own technology, switches etc., is a
significant maintenance problem.
Debuggability, reproducibility.
Because the same application needs to run on different pieces of hardware, current source code contains
many conditional preprocessing directives
Programmers rely on compiler intrinsic and ad-hoc command line arguments to drive the
optimization. This severely impacts code readability and productivity, and the application
binary tested and debugged on a workstation is different from the one that eventually runs on the system.
Platform openness.
Independent software vendors rarely have access to the tool chains needed to program the
most powerful parts of the system, namely the DSPs and hardware accelerators. Virtualization
can make the whole platform programmable, opening opportunities to third-party high-performance
applications
.Performance through time to market
Because of the finalizer, last minute fixes can happen after the chip is finished. This means that
the time to release a new part goes down. Less time per generation translates to better
performance
6 | hsail AFDS | June 11, 2012
7. GOALS OF HSAIL
1. Can support all of C++ (open up the GPU to mass programming, not only for specialists)
2. Avoid constant change (do not change the spec every chip)
3. Support accurate IEEE floating point math
4. Target lots of different machines
5. Allow for packed operations, SSE and friends, bytes/shorts/ints/doubles etc
6. Allow packed forms to save power
7. Make the model understandable
8. Make the finalizer fast (around linear time)
9. Make the finalizer simple (do not need monthly updates)
10. Less ambiguity in the spec (little undefined behavior)
11. Get good performance (little need to write in ISA)
12. Support all of OpenCL™ and C++Amp™
13. Can ship linkable libraries in HSAIL
14. Clean up all nits in AMDIL
15. Allow the use of chip specific acceleration when it is a good idea
7 | hsail AFDS | June 11, 2012
8. HSAIL – LOTS OF NEW FEATURES
Lots of features not in OpenCL and C++ AMP
Enough to implement C++
Exceptions/ heterogeneous compute
Flat address space (work items on the GPU and agents on the CPU)
Because of hand written HSAIL, these features can be exposed early
Fine-grain barriers that work inside control flow, you can implement producer consumer models
Lots of cross wave operations – so you can quickly move data between lanes without loads and stores
Spec is available on the web site
The memory model shows how the CPU and GPU can cooperate
Support for image operations
8 | hsail AFDS | June 11, 2012
10. WAVEFRONTS
Most developers will not care about wavefronts
Similar to cache line sizes
Experts can get good performance if they code to the cache line size
Compiler has to avoid breaking the developers model
HSAIL formalizes the notion of wavefronts
you can tell which work item goes into which wavefront
you can write producer consumer parallelism between work groups
10 | hsail AFDS | June 11, 2012
11. AN EXAMPLE (IN OPENCL™)
__kernel void vec_add (__global const float *a, __global const float *b, __global float *c,
const unsigned int n)
{
// Get our global thread ID
int id = get_global_id(0);
// Make sure we do not go out of bounds
if (id < n) {
c[id ] = a[id] + b[id];
}
11 | hsail AFDS | June 11, 2012
13. MEMORY SEGMENTS
Memory is split into 7 segments
kernarg, global, arg, readonly, private, group, and spill
There is a single flat address space with everything but its is often advantageous to tell the finalizer
which segment to use
Load/store machine with registers
Some segments are used for intent –
– Spill indicates that the slot was used by the HLC for register spilling
13 | hsail AFDS | June 11, 2012
14. SEGMENTS
NDRange
Work group Work group
Work Items
Group
Private
group
Arg locations are
in private
Private Spill locations are in
private
Agent Flat address space
Group within
Private within
arg memory is within Private flat
flat
spill memory is within Private
privateRW is within Private
kernarg is within Global
ReadOnly is within Global
14 | hsail AFDS | June 11, 2012
15. HSAIL FEATURES REGISTERS AND Types
TYPES
Brigs8, Brigs16, Brigs32, Brigs64,
Four classes of registers Brigu8, Brigu16, Brigu32, Brigu64,
c/s/d/q Brigf16, Brigf32, Brigf64, Brigb1,
1 bit Brigb8, Brigb16, Brigb32, Brigb64,
32 bits Brigb128, Brigu8x16,
64 bits BrigROImg, BrigRWImg, BrigSamp,
128 bits Brigu8x4, Brigs8x4, Brigu8x8, Brigs8x8,
Both Binary (BRIG) and text format Brigs8x16,
The binary format is fully specified Brigu16x2, Brigs16x2, Brigf16x2,
Brigu16x4, Brigs16x4, Brigf16x4, Brigu16x8,
120 opcodes (JavaByte code has 200) Brigs16x8,
Brigf16x8, Brigu32x2, Brigs32x2,
Brigf32x2, Brigu32x4, Brigs32x4, Brigf32x4,
Brigu64x2, Brigs64x2,
Brigf64x2
15 | hsail AFDS | June 11, 2012
16. WHY DOES HSAIL LOOK THIS WAY?
An SIMT model (single instruction, multiple threads) claims that every work-item has a program counter
So branch instructions look pretty natural
A vector machine model looks like sse, one program counter and vector registers, this is like real AMD GPU
hardware
SIMT or Vector?
16 | hsail AFDS | June 11, 2012
17. PROS FOR SIMT
We want HSAIL to outlast one hardware generation (so at the very least the vector length
and real types/number of registers should not get exposed).
Even with a vector model the finalizer will still have to map to the real vector length. We
expected this to mean that a vector finalizer would not have a much simpler time
We want to support lots of machines including ones not built by AMD
We can add cross lane operations (like count) to the SIMTmodel so the line between SIMT
and vector is blurry
We want to open up to 3rd party compiler and tools, all of which can support SIMT but few
of which can support vector
Work groups is a much more developer friendly model than wavefronts
Natural path for OpenCL™/CUDA ™ c++amp™
Graphics is SIMT, so the pressure to make future hardware work well for SIMT
is immense
17 | hsail AFDS | June 11, 2012
18. PROS FOR VECTOR
Might get more performance, we estimated <10% even in good cases
Simpler for expert programmers to reason out what is going on
This was a big one for us, the exact rules on wavefront re-convergence are
hidden in the SIMTmodel but clear in the vector one
In the vector model you can prove some results about code, which cannot be
done when the finalizer reorders things
On the other hand constructs like C++ virtual functions become very confusing on
a vector machine, where the original program was SIMT
We think the performance deficits are a reasonable trade for broader adoption,
and in many cases can be closed by well written libraries for the cases that really
matter.
18 | hsail AFDS | June 11, 2012
19. HSAIL AND FUNCTIONS
{
arg_u32 %input1;
arg_u32 %input2;
// …
call &fnWithTwoArgs ()(%input1, %input2); // call of a function
// all work-items call the same function
}
// ...
HSAIL supports
Virtual functions,
Signatures
Jumps via a register
Load address of code
19 | hsail AFDS | June 11, 2012
20. HSAIL PROVIDES A SERIES OF OPTIMIZATION CONTROLS
Sometimes you know if an operation is uniform over a range
ld_f32_width(8) $s1, address
Work items in groups of 8 will read the same value
call_width(64) $s1
Even through this is a call through register, work items in groups of 64 will call the same function
ld_equiv(3)_u32 $s1, address
A block of memory that cannot alias with other blocks
20 | hsail AFDS | June 11, 2012
21. HSAIL COMPARED TO LLVM-IR
HSAIL is low level
assumes finalizer does not do as much optimization
no phi nodes,
finite register count
No ssa input
Parallelism is built into HSAIL
No need to hack the meaning of a barrier
No structures or other high level features
21 | hsail AFDS | June 11, 2012
22. HSAIL COMPARED TO JAVA BYTE CODE
HSAIL is more focused on performance,
HSAIL has registers not a stack
HSAIL has parallelism built in
HSAIL is not as focused on security (does not require a formal validator)
Not quite write once
HSAIL is less concerned about code compression
22 | hsail AFDS | June 11, 2012
23. HSAIL COMPARED TO AMDIL
HSAIL supports lots of complex control flow
AMDIL provides structured control flow only
irreducible flow needed exponential compile time
No (or limited) graphics features
just enough for C++ AMP™ and OpenCL™
four sizes of registers 1/32/64/128 bit vs. 4x32 vector registers (no more .x, .y, .z, .w) fields
HSAIL is extendable (per vendor/per chip extensions)
Different cost model
23 | hsail AFDS | June 11, 2012
24. HSAIL COMPARED TO PTX
More formal model of execution
possible to write valid programs that pass data between work groups
More formal model of memory - acq/rel semantics
Less semantics defined by the device
Support for libraries and complex calls
Interaction between agents and HSAIL code,
shared memory, support for GPU to call CPU services
Per vendor extension mechanism
Clean separation of core features and per device operations
Support for linking/ libraries/ separate compilation
Removal of hard to finalize features
no predication
24 | hsail AFDS | June 11, 2012
25. MEMORY MODEL
A memory model defines how writes by one work-item or agent become visible toother work-items and agents.
For many implementations, better performance will result if either the hardware or the finalizer is allowed to reorder
code. For example, the finalizer might find it more efficient if a write is moved later in the program; so long as the
program semantics do not change, the finalizer is free to do so. Once a store is deferred, other work-items and
agents will not see it until the store actually happens. Hardware might provide a cache that also defers writes.
The HSAIL memory model is based on acquire release
An ld_acq creates a “downward fence.” This means that normal loads and stores can be moved (by the
implementation) down past the ld_acq but no memory operation (load, store, or atomic) can be moved up above the
ld_acq.
A st_rel creates an “upward fence.” That means that normal loads and stores can be moved (by the
implementation) above the st_rel but no memory operation (load, store, or atomic) can be moved down after the
st_rel.
25 | hsail AFDS | June 11, 2012
26. Original Axiomatic Definition [Lamport 1979]
A single processor (core) sequentially consistent if
“the result of an execution is the same as if the operations had been executed in the order specified
by the program.”
A multiprocessor sequentially consistent if
“the result of any execution is the same as if the operations of all processors (cores) were executed in
some sequential order, and the operations of each individual processor (core) appear in this sequence
in the order specified by its program.”
26 | hsail AFDS | June 11, 2012
27. SEQUENTIAL CONSISTENCY (SC) OPERATIONAL DEFINITION
System
P P P
1 memory
P simple processors MEMORY
Operation: Pick one ready row, do it, & repeat until
done
Processor 0 ready to load/store of memory
…
Processor P-1 ready to load/store of memory
27 | hsail AFDS | June 11, 2012
28. SEQUENTIAL CONSISTENCY
Any SC implementation must only permit executions allowed by SC operational model (SC executions).
The SC operational model is NOT a performance model.
SC implementation performance != Counting operation model steps
The operational model hides most implementation techniques
pipelining, out-of-order, speculation, caches, cache coherence, …
HW must functional behave “as if” is was like operational model
HW designers & verifiers often most comfortable with operational model
Each processor is eventually selected
28 | hsail AFDS | June 11, 2012
29. HSAIL OPERATIONAL DEFINITION
P P P
System
1 (host) memory
P simple processors
Reorder buffer
Writes can get held
Reads can be satisfied
MEMORY
Operation: Pick one ready row, do it, & repeat until done
Processor 0 ready to load/store of memory
…
Processor P-1 ready to load/store of memory
write values may stay in reorder buffer, reads may come out of the reorder buffer,
Rules to move between reorder buffer and memory
rel = release the values from the buffer, acq = acquire new values
29 | hsail AFDS | June 11, 2012
30. WITHIN ONE WORK ITEM
SEQUENCED BEFORE
This is the order operations appear in the source
What you see looking at the code
single work item - “as-if-serial” view
- each operation appears to happen in the order it appears in the source
X sb Y
- X and Y in same work item,
- X sequenced before Y
multiple work items and agents makes this more complex
30 | hsail AFDS | June 11, 2012
31. BETWEEN WORK ITEMS
X >> Y
What the memory system sees
memory system must see X before Y
global visibility order
this is transitive
X >>Y, and Y >> Z, then X >>Z
31 | hsail AFDS | June 11, 2012
32. RULES, SOMETIMES
X SB Y => X >> Y
•X sb Y, same address, then X >>Y
•Different address
–If there is a barrier or sync between X and Y then
X >>Y
•If X is an acquire:
– ld_acq, atomic_acq, atomicNoRet_acq, atomic_ar, atomicNoRet_ar
–Then X >> Y
–This is one sided (Y cannot move before X)
The general rule is use acquire and release when you want to force order
Acquire and Release may take extra time, but they give you sequential constancy
Compilers can trade performance for simple cross work-item communication
32 | hsail AFDS | June 11, 2012
33. •If Y is a release
–st_rel, atomic_ar or atomicNoRet_ar then X >>Y
–st rel is another one way fence
•Consider a critical region (can use acquire and release to form critical sections)
•ld_acq x
•Assorted memory operations
•st_rel y
•No operations can move out, but operations can move in
33 | hsail AFDS | June 11, 2012
34. AN EXAMPLE SB ORDER DOES NOT FORCE MEMORY ORDER
Work-item 0 Work-item 1
------------------- ------------------------------------
@h0: st_u32 1, [&a] @k0: st_u32 1, [&b]
@h1: ld_u32 $s0, [&b] @k1: ld_u32 $s1, [&a]
Initially, &a and &b = 0. $s0 = 0 and $s1 = 0 is allowed. --
constraints added because readers have to follow writers. k1 (the reader)
has to happen before h0 changes the value. There are also constraints caused by synchronization
h1 >> k1 >> h0 >> k0.
Even though h0 appears first (in sequenced-before order) before h1, there is no
requirement that the operations appear in text order (sequenced-before order) to the
memory system.
34 | hsail AFDS | June 11, 2012
35. EXAMPLE 2 REGISTER DEPENDENCE DOES NOT FORCE MEMORY ORDER
Work-item 0 Work-item 1
----------------------- ---------------------
@h0: ld $s0, [&a] @j0: st 20, [100]
@h1: ld $s1, [$s0] @j1: st_rel 100, [&a]
Initially, &a and contents of location 100 = 0.
$s1 == 0 and $s0 == 100 is allowed
If $s1 == 0 then h1 >> j0. f $s0 == 100 then j1 >> h1.
Because this seems to violate dependence order, it is useful to consider how this can
come about.
Work-item 0 is allowed to prefetch load h1. One reason it might do this is that code before these operations
reads address 96, and the implementation reads in large cache lines.
Later, work-item 1 reads the new value of &a, which is 100. Then it reads the value of
location 100, but because there is no synchronization, it can use the previously prefetched value of 0.
35 | hsail AFDS | June 11, 2012
36. EXAMPLE 3
Work-item 0 Work-item 1
@h0: ld_acq $s0, [&a] @j0: st 20, [100]
@h1: ld $s1, [$s0] @j1: st_rel 100, [&a]
Initially, &a and 100 = 0.
HSAIL does not allow $s1 == 0 and $s0 == 100.
36 | hsail AFDS | June 11, 2012