SlideShare a Scribd company logo
1 of 37
Download to read offline
HSAIL
Norm Rubin
Fellow


An introduction to the HSA Intermediate language
Disclaimer & Attribution
        The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions
        and typographical errors.

        The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited
        to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product
        differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation
        to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make
        changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.

        NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO
        RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS
        INFORMATION.

        ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY
        DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL
        OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF
        EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

        AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this
        presentation are for informational purposes only and may be trademarks of their respective owners.

        OpenCL is a trademark of Apple Inc. used with permission by Khronos.

        DirectX is a registered trademark of Microsoft Corporation.

        © 2012 Advanced Micro Devices, Inc. All rights reserved.



2 | hsail AFDS | June 11, 2012
WHAT IS SPLIT COMPILATION?

App starts a source program
1) A high level compiler (HLC) generates HSAIL
2) The HSAIL is shipped to the target machine
3) A second compiler (a finalizer) turns HSAIL into ISA


Unlike traditional compilers, where optimization is contained in one part or done twice
HSAIL allows optimization to be split into two parts
The heavy lifting goes to the HLC , the quick finish goes to the finalizer


HSAIL provides ways for an HLC and a finalizer to cooperate For instance:
   HSAIL provides a fixed number of registers.
   HSA implementations might support a different number
   When the HLC spills registers, it can use special operations that will let the finalizer know
   where to use extra registers.


3 | hsail AFDS | June 11, 2012
SPLIT COMPILATION
(MEANS THERE HAS TO BE WAYS TO PASS INFORMATION FROM HLC TO FINALIZER)

HLC – High level compiler
  Lots of time
  Info from source
  Lots of aggressive optimizations
  But limited (or no) knowledge of target

Finalizer
   Very little time (we estimate that it will take close to linear time)
   No info not in HSAIL (no back doors (almost)
   Cannot update regularly (close to bug free)
   Simple optimizations only
    But knows the target

   Exactly how to split some optimizations is still an open problem



4 | hsail AFDS | June 11, 2012
WHY A VIRTUAL ISA - WHY NOT JUST TARGET THE REAL ISA?


ISA Gains performance Better time to market (because hardware is finished faster)

Loses performance (cannot use every hardware trick)

No legacy boat anchor

Real isa means one vendor/ one chip family

Can fix hardware bugs in software

Old and new code just works on old and new machines

Allows hardware innovation under the table

Features not in HSAIL are not exposed, and are hard to access

5 | hsail AFDS | June 11, 2012
Development tools at HSAIL level
    Today the need for a complete tool chain for each core, each with its own technology, switches etc., is a
    significant maintenance problem.

  Debuggability, reproducibility.
    Because the same application needs to run on different pieces of hardware, current source code contains
    many conditional preprocessing directives
    Programmers rely on compiler intrinsic and ad-hoc command line arguments to drive the
    optimization. This severely impacts code readability and productivity, and the application
    binary tested and debugged on a workstation is different from the one that eventually runs on the system.

  Platform openness.
     Independent software vendors rarely have access to the tool chains needed to program the
     most powerful parts of the system, namely the DSPs and hardware accelerators. Virtualization
     can make the whole platform programmable, opening opportunities to third-party high-performance
     applications

  .Performance through time to market
     Because of the finalizer, last minute fixes can happen after the chip is finished. This means that
     the time to release a new part goes down. Less time per generation translates to better
     performance

6 | hsail AFDS | June 11, 2012
GOALS OF HSAIL
1.    Can support all of C++ (open up the GPU to mass programming, not only for specialists)
2.    Avoid constant change (do not change the spec every chip)
3.    Support accurate IEEE floating point math
4.    Target lots of different machines
5.    Allow for packed operations, SSE and friends, bytes/shorts/ints/doubles etc
6.    Allow packed forms to save power
7.    Make the model understandable
8.    Make the finalizer fast (around linear time)
9.    Make the finalizer simple (do not need monthly updates)
10. Less ambiguity in the spec (little undefined behavior)
11. Get good performance (little need to write in ISA)
12. Support all of OpenCL™ and C++Amp™
13. Can ship linkable libraries in HSAIL
14. Clean up all nits in AMDIL
15. Allow the use of chip specific acceleration when it is a good idea

7 | hsail AFDS | June 11, 2012
HSAIL – LOTS OF NEW FEATURES
Lots of features not in OpenCL and C++ AMP
   Enough to implement C++
   Exceptions/ heterogeneous compute
   Flat address space (work items on the GPU and agents on the CPU)
Because of hand written HSAIL, these features can be exposed early


Fine-grain barriers that work inside control flow, you can implement producer consumer models


Lots of cross wave operations – so you can quickly move data between lanes without loads and stores


Spec is available on the web site


The memory model shows how the CPU and GPU can cooperate


Support for image operations

8 | hsail AFDS | June 11, 2012
PARALLELISM MODEL




9 | hsail AFDS | June 11, 2012
WAVEFRONTS


Most developers will not care about wavefronts
Similar to cache line sizes
     Experts can get good performance if they code to the cache line size
     Compiler has to avoid breaking the developers model


     HSAIL formalizes the notion of wavefronts
                  you can tell which work item goes into which wavefront
                  you can write producer consumer parallelism between work groups




10 | hsail AFDS | June 11, 2012
AN EXAMPLE (IN OPENCL™)


__kernel void vec_add (__global const float *a, __global const float *b, __global float *c,
         const unsigned int n)
{
        // Get our global thread ID
        int id = get_global_id(0);
        // Make sure we do not go out of bounds
        if (id < n) {
          c[id ] = a[id] + b[id];
}




11 | hsail AFDS | June 11, 2012
VECTOR ADD A[0:N-1] = B[0:N-1] + C[0:N-1]
                                                  cur $c0, @BB0_2;
version 1:0:$small;
                                                            brn @BB0_1;
kernel &__OpenCL_vec_add_kernel(
                                                  @BB0_1: // %if.end
                  kernarg_u32 %arg_a
                                                            ret;
                  kernarg_u32 %arg_b,
                                                  @BB0_2: // %if.then
                  kernarg_u32 %arg_c,
                                                            shl_u32 $s1, $s1, 2;
                  kernarg_u32 %arg_n)
                                                            add_u32 $s2, $s2, $s1;
{ @__OpenCL_vec_add_kernel_entry:
                                                            ld_global_f32 $s2, [$s2];
                  // BB#0: // %entry
                                                            add_u32 $s3, $s3, $s1;
                  ld_kernarg_u32 $s0, [%arg_n];
                                                            ld_global_f32 $s3, [$s3];
                  workitemaid $s1, 0;
                                                            add_f32 $s2, $s3, $s2;
                  cmp_lt_b1_u32 $c0, $s1, $s0;
                                                            add_u32 $s0, $s0, $s1;
                  ld_kernarg_u32 $s0, [%arg_c];
                                                            st_global_f32 $s2, [$s0];
                  ld_kernarg_u32 $s2, [%arg_b];
                                                            brn @BB0_1;
                  ld_kernarg_u32 $s3, [%arg_a];
                                                  };

12 | hsail AFDS | June 11, 2012
MEMORY SEGMENTS


 Memory is split into 7 segments
 kernarg, global, arg, readonly, private, group, and spill

 There is a single flat address space with everything but its is often advantageous to tell the finalizer
  which segment to use


 Load/store machine with registers


 Some segments are used for intent –
     – Spill indicates that the slot was used by the HLC for register spilling




13 | hsail AFDS | June 11, 2012
SEGMENTS
                                                                               NDRange


                                                        Work group                               Work group


                                  Work Items




                                                                                                       Group

                                                      Private



                                                                                                       group

                                                                 Arg locations are
                                                                    in private
                                                      Private     Spill locations are in
                                                                          private

                              Agent                                      Flat address space

                                                                                                         Group within
                                               Private within
                                                                     arg memory is within Private            flat
                                                    flat
                                                                     spill memory is within Private
                                                                     privateRW is within Private
                                                                     kernarg is within Global
                                                                     ReadOnly is within Global




14 | hsail AFDS | June 11, 2012
HSAIL FEATURES REGISTERS AND           Types
TYPES
                                       Brigs8, Brigs16, Brigs32, Brigs64,
Four classes of registers              Brigu8, Brigu16, Brigu32, Brigu64,
     c/s/d/q                           Brigf16, Brigf32, Brigf64, Brigb1,
     1 bit                             Brigb8, Brigb16, Brigb32, Brigb64,
     32 bits                           Brigb128, Brigu8x16,
     64 bits                           BrigROImg, BrigRWImg, BrigSamp,
     128 bits                          Brigu8x4, Brigs8x4, Brigu8x8, Brigs8x8,
Both Binary (BRIG) and text format     Brigs8x16,
The binary format is fully specified   Brigu16x2, Brigs16x2, Brigf16x2,
                                       Brigu16x4, Brigs16x4, Brigf16x4, Brigu16x8,
120 opcodes (JavaByte code has 200)    Brigs16x8,
                                       Brigf16x8, Brigu32x2, Brigs32x2,
                                       Brigf32x2, Brigu32x4, Brigs32x4, Brigf32x4,
                                       Brigu64x2, Brigs64x2,
                                       Brigf64x2
15 | hsail AFDS | June 11, 2012
WHY DOES HSAIL LOOK THIS WAY?


An SIMT model (single instruction, multiple threads) claims that every work-item has a program counter
So branch instructions look pretty natural


A vector machine model looks like sse, one program counter and vector registers, this is like real AMD GPU
hardware


SIMT or Vector?




16 | hsail AFDS | June 11, 2012
PROS FOR SIMT
   We want HSAIL to outlast one hardware generation (so at the very least the vector length
   and real types/number of registers should not get exposed).
   Even with a vector model the finalizer will still have to map to the real vector length. We
   expected this to mean that a vector finalizer would not have a much simpler time

   We want to support lots of machines including ones not built by AMD

   We can add cross lane operations (like count) to the SIMTmodel so the line between SIMT
   and vector is blurry

   We want to open up to 3rd party compiler and tools, all of which can support SIMT but few
   of which can support vector

   Work groups is a much more developer friendly model than wavefronts

   Natural path for OpenCL™/CUDA ™ c++amp™

   Graphics is SIMT, so the pressure to make future hardware work well for SIMT
   is immense


17 | hsail AFDS | June 11, 2012
PROS FOR VECTOR
   Might get more performance, we estimated <10% even in good cases

   Simpler for expert programmers to reason out what is going on

   This was a big one for us, the exact rules on wavefront re-convergence are
   hidden in the SIMTmodel but clear in the vector one

   In the vector model you can prove some results about code, which cannot be
   done when the finalizer reorders things

   On the other hand constructs like C++ virtual functions become very confusing on
   a vector machine, where the original program was SIMT

   We think the performance deficits are a reasonable trade for broader adoption,
   and in many cases can be closed by well written libraries for the cases that really
   matter.



18 | hsail AFDS | June 11, 2012
HSAIL AND FUNCTIONS
{
      arg_u32 %input1;
      arg_u32 %input2;
      // …
      call &fnWithTwoArgs ()(%input1, %input2); // call of a function
         // all work-items call the same function
      }
// ...


HSAIL supports
Virtual functions,
Signatures
Jumps via a register
Load address of code


19 | hsail AFDS | June 11, 2012
HSAIL PROVIDES A SERIES OF OPTIMIZATION CONTROLS

Sometimes you know if an operation is uniform over a range


ld_f32_width(8) $s1, address


Work items in groups of 8 will read the same value


call_width(64) $s1


Even through this is a call through register,   work items in groups of 64 will call the same function


ld_equiv(3)_u32 $s1, address


A block of memory that cannot alias with other blocks



20 | hsail AFDS | June 11, 2012
HSAIL COMPARED TO LLVM-IR


HSAIL is low level
  assumes finalizer does not do as much optimization
  no phi nodes,
  finite register count
  No ssa input
Parallelism is built into HSAIL
            No need to hack the meaning of a barrier
No structures or other high level features




21 | hsail AFDS | June 11, 2012
HSAIL COMPARED TO JAVA BYTE CODE


HSAIL is more focused on performance,
HSAIL has registers not a stack
HSAIL has parallelism built in
HSAIL is not as focused on security (does not require a formal validator)
Not quite write once
HSAIL is less concerned about code compression




22 | hsail AFDS | June 11, 2012
HSAIL COMPARED TO AMDIL


HSAIL supports lots of complex control flow
  AMDIL provides structured control flow only
  irreducible flow needed exponential compile time

No (or limited) graphics features
  just enough for C++ AMP™ and OpenCL™

four sizes of registers 1/32/64/128 bit vs. 4x32 vector registers (no more .x, .y, .z, .w) fields

HSAIL is extendable (per vendor/per chip extensions)

Different cost model




23 | hsail AFDS | June 11, 2012
HSAIL COMPARED TO PTX


More formal model of execution
  possible to write valid programs that pass data between work groups

More formal model of memory - acq/rel semantics

Less semantics defined by the device

Support for libraries and complex calls
   Interaction between agents and HSAIL code,
   shared memory, support for GPU to call CPU services

Per vendor extension mechanism

Clean separation of core features and per device operations

Support for linking/ libraries/ separate compilation

Removal of hard to finalize features
  no predication




24 | hsail AFDS | June 11, 2012
MEMORY MODEL


A memory model defines how writes by one work-item or agent become visible toother work-items and agents.


For many implementations, better performance will result if either the hardware or the finalizer is allowed to reorder
code. For example, the finalizer might find it more efficient if a write is moved later in the program; so long as the
program semantics do not change, the finalizer is free to do so. Once a store is deferred, other work-items and
agents will not see it until the store actually happens. Hardware might provide a cache that also defers writes.


The HSAIL memory model is based on acquire release

An ld_acq creates a “downward fence.” This means that normal loads and stores can be moved (by the
implementation) down past the ld_acq but no memory operation (load, store, or atomic) can be moved up above the
ld_acq.

A st_rel creates an “upward fence.” That means that normal loads and stores can be moved (by the
implementation) above the st_rel but no memory operation (load, store, or atomic) can be moved down after the
st_rel.



25 | hsail AFDS | June 11, 2012
Original Axiomatic Definition [Lamport 1979]



       A single processor (core) sequentially consistent if
       “the result of an execution is the same as if the operations had been executed in the order specified
       by the program.”


       A multiprocessor sequentially consistent if


       “the result of any execution is the same as if the operations of all processors (cores) were executed in
       some sequential order, and the operations of each individual processor (core) appear in this sequence
       in the order specified by its program.”




26 | hsail AFDS | June 11, 2012
SEQUENTIAL CONSISTENCY (SC) OPERATIONAL DEFINITION




 System
                                                        P   P        P
   1 memory
   P simple processors                                      MEMORY
 Operation: Pick one ready row, do it, & repeat until
   done
   Processor 0 ready to load/store of memory
   …
   Processor P-1 ready to load/store of memory




27 | hsail AFDS | June 11, 2012
SEQUENTIAL CONSISTENCY

Any SC implementation must only permit executions allowed by SC operational model (SC executions).

The SC operational model is NOT a performance model.
  SC implementation performance != Counting operation model steps

The operational model hides most implementation techniques
  pipelining, out-of-order, speculation, caches, cache coherence, …
  HW must functional behave “as if” is was like operational model

HW designers & verifiers often most comfortable with operational model

Each processor is eventually selected




28 | hsail AFDS | June 11, 2012
HSAIL OPERATIONAL DEFINITION



                                                                P           P                 P
    System
      1 (host) memory
      P simple processors
      Reorder buffer
              Writes can get held
              Reads can be satisfied
                                                                          MEMORY
    Operation: Pick one ready row, do it, & repeat until done
      Processor 0 ready to load/store of memory
      …
      Processor P-1 ready to load/store of memory
         write values may stay in reorder buffer, reads may come out of the reorder buffer,
         Rules to move between reorder buffer and memory
                rel = release the values from the buffer, acq = acquire new values



29 | hsail AFDS | June 11, 2012
WITHIN ONE WORK ITEM
SEQUENCED BEFORE

This is the order operations appear in the source
What you see looking at the code




single work item - “as-if-serial” view
      - each operation appears to happen in the order it appears in the source
X sb Y
      - X and Y in same work item,
      - X sequenced before Y

multiple work items and agents makes this more complex




30 | hsail AFDS | June 11, 2012
BETWEEN WORK ITEMS


X >> Y


What the memory system sees



memory system must see X before Y
global visibility order
this is transitive
     X >>Y, and Y >> Z, then X >>Z




31 | hsail AFDS | June 11, 2012
RULES, SOMETIMES
X SB Y => X >> Y

•X sb Y, same address, then X >>Y
•Different address
    –If there is a barrier or sync between X and Y then
    X >>Y
•If X is an acquire:
    – ld_acq, atomic_acq, atomicNoRet_acq, atomic_ar, atomicNoRet_ar
    –Then X >> Y
    –This is one sided (Y cannot move before X)


The general rule is use acquire and release when you want to force order
Acquire and Release may take extra time, but they give you sequential constancy

     Compilers can trade performance for simple cross work-item communication




32 | hsail AFDS | June 11, 2012
•If Y is a release
       –st_rel, atomic_ar or atomicNoRet_ar then X >>Y
       –st rel is another one way fence

   •Consider a critical region (can use acquire and release to form critical sections)



   •ld_acq x
   •Assorted memory operations
   •st_rel y

   •No operations can move out, but operations can move in




33 | hsail AFDS | June 11, 2012
AN EXAMPLE SB ORDER DOES NOT FORCE MEMORY ORDER
Work-item 0                                       Work-item 1
-------------------                               ------------------------------------
@h0: st_u32 1, [&a]                               @k0: st_u32 1, [&b]
@h1: ld_u32 $s0, [&b]                              @k1: ld_u32 $s1, [&a]


Initially, &a and &b = 0.         $s0 = 0 and $s1 = 0 is allowed. --


constraints added because readers have to follow writers. k1 (the reader)
has to happen before h0 changes the value. There are also constraints caused by synchronization
 h1 >> k1 >> h0 >> k0.


Even though h0 appears first (in sequenced-before order) before h1, there is no
requirement that the operations appear in text order (sequenced-before order) to the
memory system.


34 | hsail AFDS | June 11, 2012
EXAMPLE 2 REGISTER DEPENDENCE DOES NOT FORCE MEMORY ORDER
Work-item 0                                       Work-item 1
-----------------------                           ---------------------
@h0: ld $s0, [&a]                                 @j0: st 20, [100]
@h1: ld $s1, [$s0]                                @j1: st_rel 100, [&a]
Initially, &a and contents of location 100 = 0.
$s1 == 0 and $s0 == 100 is allowed


If $s1 == 0 then h1 >> j0. f $s0 == 100 then j1 >> h1.
Because this seems to violate dependence order, it is useful to consider how this can
come about.
Work-item 0 is allowed to prefetch load h1. One reason it might do this is that code before these operations
reads address 96, and the implementation reads in large cache lines.
Later, work-item 1 reads the new value of &a, which is 100. Then it reads the value of
location 100, but because there is no synchronization, it can use the previously prefetched value of 0.



35 | hsail AFDS | June 11, 2012
EXAMPLE 3


Work-item 0                                     Work-item 1

@h0: ld_acq $s0, [&a]                           @j0: st 20, [100]
@h1: ld $s1, [$s0]                              @j1: st_rel 100, [&a]
Initially, &a and 100 = 0.
HSAIL does not allow $s1 == 0 and $s0 == 100.




36 | hsail AFDS | June 11, 2012
QUESTIONS?




37 | hsail AFDS | June 11, 2012

More Related Content

What's hot

SAN Extension Design and Solutions
SAN Extension Design and SolutionsSAN Extension Design and Solutions
SAN Extension Design and SolutionsTony Antony
 
5G RAN fundamentals
5G RAN fundamentals5G RAN fundamentals
5G RAN fundamentalsRavi Sharma
 
Beginners: Non Terrestrial Networks (NTN)
Beginners: Non Terrestrial Networks (NTN)Beginners: Non Terrestrial Networks (NTN)
Beginners: Non Terrestrial Networks (NTN)3G4G
 
Turn on 5G with Ericsson 5G Platform
Turn on 5G with Ericsson 5G PlatformTurn on 5G with Ericsson 5G Platform
Turn on 5G with Ericsson 5G PlatformEricsson
 
Intermediate: Bandwidth Parts (BWP)
Intermediate: Bandwidth Parts (BWP)Intermediate: Bandwidth Parts (BWP)
Intermediate: Bandwidth Parts (BWP)3G4G
 
5G Network Architecture Options
5G Network Architecture Options5G Network Architecture Options
5G Network Architecture Options3G4G
 
Advanced Topics and Future Directions in MPLS
Advanced Topics and Future Directions in MPLS Advanced Topics and Future Directions in MPLS
Advanced Topics and Future Directions in MPLS Cisco Canada
 
3GPP 5G NSA introduction 1(EN-DC Bearer)
3GPP 5G NSA introduction 1(EN-DC Bearer)3GPP 5G NSA introduction 1(EN-DC Bearer)
3GPP 5G NSA introduction 1(EN-DC Bearer)Ryuichi Yasunaga
 
Ttalteoverview 100923032416 Phpapp01 (1)
Ttalteoverview 100923032416 Phpapp01 (1)Ttalteoverview 100923032416 Phpapp01 (1)
Ttalteoverview 100923032416 Phpapp01 (1)Deepak Sharma
 
lte channel types
lte channel typeslte channel types
lte channel typesavneesh7
 
Prof. Andy Sutton: 5G RAN Architecture Evolution - Jan 2019
Prof. Andy Sutton: 5G RAN Architecture Evolution - Jan 2019Prof. Andy Sutton: 5G RAN Architecture Evolution - Jan 2019
Prof. Andy Sutton: 5G RAN Architecture Evolution - Jan 20193G4G
 
Overview 5G Architecture Options from Deutsche Telekom
Overview 5G Architecture Options from Deutsche TelekomOverview 5G Architecture Options from Deutsche Telekom
Overview 5G Architecture Options from Deutsche TelekomEiko Seidel
 
5G and V2X Automotive Slicing
5G and V2X Automotive Slicing5G and V2X Automotive Slicing
5G and V2X Automotive SlicingMarie-Paule Odini
 
Amplificateurs optiques (soa, raman, edfa)
Amplificateurs optiques (soa, raman, edfa)Amplificateurs optiques (soa, raman, edfa)
Amplificateurs optiques (soa, raman, edfa)Assia Mounir
 
ZTE BTS Cards Description
ZTE BTS Cards DescriptionZTE BTS Cards Description
ZTE BTS Cards Descriptionibrahimnabil17
 
Cell PCH state - Some Questions Answered
Cell PCH state - Some Questions AnsweredCell PCH state - Some Questions Answered
Cell PCH state - Some Questions AnsweredFaraz Husain
 

What's hot (20)

SAN Extension Design and Solutions
SAN Extension Design and SolutionsSAN Extension Design and Solutions
SAN Extension Design and Solutions
 
5G RAN fundamentals
5G RAN fundamentals5G RAN fundamentals
5G RAN fundamentals
 
Beginners: Non Terrestrial Networks (NTN)
Beginners: Non Terrestrial Networks (NTN)Beginners: Non Terrestrial Networks (NTN)
Beginners: Non Terrestrial Networks (NTN)
 
5g-Air-Interface-pptx.pptx
5g-Air-Interface-pptx.pptx5g-Air-Interface-pptx.pptx
5g-Air-Interface-pptx.pptx
 
Turn on 5G with Ericsson 5G Platform
Turn on 5G with Ericsson 5G PlatformTurn on 5G with Ericsson 5G Platform
Turn on 5G with Ericsson 5G Platform
 
Intermediate: Bandwidth Parts (BWP)
Intermediate: Bandwidth Parts (BWP)Intermediate: Bandwidth Parts (BWP)
Intermediate: Bandwidth Parts (BWP)
 
5G Network Architecture Options
5G Network Architecture Options5G Network Architecture Options
5G Network Architecture Options
 
Advanced Topics and Future Directions in MPLS
Advanced Topics and Future Directions in MPLS Advanced Topics and Future Directions in MPLS
Advanced Topics and Future Directions in MPLS
 
NEI_LTE1235_ready_02.pdf
NEI_LTE1235_ready_02.pdfNEI_LTE1235_ready_02.pdf
NEI_LTE1235_ready_02.pdf
 
3GPP 5G NSA introduction 1(EN-DC Bearer)
3GPP 5G NSA introduction 1(EN-DC Bearer)3GPP 5G NSA introduction 1(EN-DC Bearer)
3GPP 5G NSA introduction 1(EN-DC Bearer)
 
Ttalteoverview 100923032416 Phpapp01 (1)
Ttalteoverview 100923032416 Phpapp01 (1)Ttalteoverview 100923032416 Phpapp01 (1)
Ttalteoverview 100923032416 Phpapp01 (1)
 
TWAMP NOKIA.pdf
TWAMP NOKIA.pdfTWAMP NOKIA.pdf
TWAMP NOKIA.pdf
 
lte channel types
lte channel typeslte channel types
lte channel types
 
Prof. Andy Sutton: 5G RAN Architecture Evolution - Jan 2019
Prof. Andy Sutton: 5G RAN Architecture Evolution - Jan 2019Prof. Andy Sutton: 5G RAN Architecture Evolution - Jan 2019
Prof. Andy Sutton: 5G RAN Architecture Evolution - Jan 2019
 
Mpls
MplsMpls
Mpls
 
Overview 5G Architecture Options from Deutsche Telekom
Overview 5G Architecture Options from Deutsche TelekomOverview 5G Architecture Options from Deutsche Telekom
Overview 5G Architecture Options from Deutsche Telekom
 
5G and V2X Automotive Slicing
5G and V2X Automotive Slicing5G and V2X Automotive Slicing
5G and V2X Automotive Slicing
 
Amplificateurs optiques (soa, raman, edfa)
Amplificateurs optiques (soa, raman, edfa)Amplificateurs optiques (soa, raman, edfa)
Amplificateurs optiques (soa, raman, edfa)
 
ZTE BTS Cards Description
ZTE BTS Cards DescriptionZTE BTS Cards Description
ZTE BTS Cards Description
 
Cell PCH state - Some Questions Answered
Cell PCH state - Some Questions AnsweredCell PCH state - Some Questions Answered
Cell PCH state - Some Questions Answered
 

Viewers also liked

HSA Introduction Hot Chips 2013
HSA Introduction  Hot Chips 2013HSA Introduction  Hot Chips 2013
HSA Introduction Hot Chips 2013HSA Foundation
 
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMDHSA Foundation
 
HSA HSAIL Introduction Hot Chips 2013
HSA HSAIL Introduction  Hot Chips 2013 HSA HSAIL Introduction  Hot Chips 2013
HSA HSAIL Introduction Hot Chips 2013 HSA Foundation
 
HSA Memory Model Hot Chips 2013
HSA Memory Model Hot Chips 2013HSA Memory Model Hot Chips 2013
HSA Memory Model Hot Chips 2013HSA Foundation
 
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...HSA Foundation
 
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
 AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.” AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”HSA Foundation
 
HSA Queuing Hot Chips 2013
HSA Queuing Hot Chips 2013 HSA Queuing Hot Chips 2013
HSA Queuing Hot Chips 2013 HSA Foundation
 

Viewers also liked (8)

HSA Introduction Hot Chips 2013
HSA Introduction  Hot Chips 2013HSA Introduction  Hot Chips 2013
HSA Introduction Hot Chips 2013
 
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
 
HSA HSAIL Introduction Hot Chips 2013
HSA HSAIL Introduction  Hot Chips 2013 HSA HSAIL Introduction  Hot Chips 2013
HSA HSAIL Introduction Hot Chips 2013
 
HSA Memory Model Hot Chips 2013
HSA Memory Model Hot Chips 2013HSA Memory Model Hot Chips 2013
HSA Memory Model Hot Chips 2013
 
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
 
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
 AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.” AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
 
HSA Queuing Hot Chips 2013
HSA Queuing Hot Chips 2013 HSA Queuing Hot Chips 2013
HSA Queuing Hot Chips 2013
 
Hsa10 whitepaper
Hsa10 whitepaperHsa10 whitepaper
Hsa10 whitepaper
 

Similar to Deeper Look Into HSAIL And It's Runtime

HSA-4131, HSAIL Programmers Manual: Uncovered, by Ben Sander
HSA-4131, HSAIL Programmers Manual: Uncovered, by Ben SanderHSA-4131, HSAIL Programmers Manual: Uncovered, by Ben Sander
HSA-4131, HSAIL Programmers Manual: Uncovered, by Ben SanderAMD Developer Central
 
ISCA Final Presentation - HSAIL
ISCA Final Presentation - HSAILISCA Final Presentation - HSAIL
ISCA Final Presentation - HSAILHSA Foundation
 
Compile ahead of time. It's fine?
Compile ahead of time. It's fine?Compile ahead of time. It's fine?
Compile ahead of time. It's fine?Dmitry Chuyko
 
Error management
Error managementError management
Error managementdaniil3
 
GDB - a tough nut to crack: only a few bugs found by PVS-Studio
GDB - a tough nut to crack: only a few bugs found by PVS-StudioGDB - a tough nut to crack: only a few bugs found by PVS-Studio
GDB - a tough nut to crack: only a few bugs found by PVS-StudioPVS-Studio
 
Joomla! Day Chicago 2011 Presentation - Steven Pignataro
Joomla! Day Chicago 2011 Presentation - Steven PignataroJoomla! Day Chicago 2011 Presentation - Steven Pignataro
Joomla! Day Chicago 2011 Presentation - Steven PignataroSteven Pignataro
 
Kernel Recipes 2014 - Writing Code: Keep It Short, Stupid!
Kernel Recipes 2014 - Writing Code: Keep It Short, Stupid!Kernel Recipes 2014 - Writing Code: Keep It Short, Stupid!
Kernel Recipes 2014 - Writing Code: Keep It Short, Stupid!Anne Nicolas
 
Klement_0902_v2F (complete article)
Klement_0902_v2F (complete article)Klement_0902_v2F (complete article)
Klement_0902_v2F (complete article)Mike Friehauf
 
Gabriele Nocco - Massive distributed processing with H2O - Codemotion Milan 2017
Gabriele Nocco - Massive distributed processing with H2O - Codemotion Milan 2017Gabriele Nocco - Massive distributed processing with H2O - Codemotion Milan 2017
Gabriele Nocco - Massive distributed processing with H2O - Codemotion Milan 2017Codemotion
 
Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSJane Man
 
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...7mind
 
Open source report writing tools for IBM i Vienna 2012
Open source report writing tools for IBM i  Vienna 2012Open source report writing tools for IBM i  Vienna 2012
Open source report writing tools for IBM i Vienna 2012COMMON Europe
 
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Databricks
 
HSA From A Software Perspective
HSA From A Software Perspective HSA From A Software Perspective
HSA From A Software Perspective HSA Foundation
 
Rishikesh Sharma Portfolio
Rishikesh Sharma PortfolioRishikesh Sharma Portfolio
Rishikesh Sharma Portfoliosharmarishikesh
 
Catalyst - refactor large apps with it and have fun!
Catalyst - refactor large apps with it and have fun!Catalyst - refactor large apps with it and have fun!
Catalyst - refactor large apps with it and have fun!mold
 

Similar to Deeper Look Into HSAIL And It's Runtime (20)

HSA-4131, HSAIL Programmers Manual: Uncovered, by Ben Sander
HSA-4131, HSAIL Programmers Manual: Uncovered, by Ben SanderHSA-4131, HSAIL Programmers Manual: Uncovered, by Ben Sander
HSA-4131, HSAIL Programmers Manual: Uncovered, by Ben Sander
 
ISCA Final Presentation - HSAIL
ISCA Final Presentation - HSAILISCA Final Presentation - HSAIL
ISCA Final Presentation - HSAIL
 
Compile ahead of time. It's fine?
Compile ahead of time. It's fine?Compile ahead of time. It's fine?
Compile ahead of time. It's fine?
 
Readme
ReadmeReadme
Readme
 
Error management
Error managementError management
Error management
 
GDB - a tough nut to crack: only a few bugs found by PVS-Studio
GDB - a tough nut to crack: only a few bugs found by PVS-StudioGDB - a tough nut to crack: only a few bugs found by PVS-Studio
GDB - a tough nut to crack: only a few bugs found by PVS-Studio
 
Joomla! Day Chicago 2011 Presentation - Steven Pignataro
Joomla! Day Chicago 2011 Presentation - Steven PignataroJoomla! Day Chicago 2011 Presentation - Steven Pignataro
Joomla! Day Chicago 2011 Presentation - Steven Pignataro
 
Kernel Recipes 2014 - Writing Code: Keep It Short, Stupid!
Kernel Recipes 2014 - Writing Code: Keep It Short, Stupid!Kernel Recipes 2014 - Writing Code: Keep It Short, Stupid!
Kernel Recipes 2014 - Writing Code: Keep It Short, Stupid!
 
Klement_0902_v2F (complete article)
Klement_0902_v2F (complete article)Klement_0902_v2F (complete article)
Klement_0902_v2F (complete article)
 
Gabriele Nocco - Massive distributed processing with H2O - Codemotion Milan 2017
Gabriele Nocco - Massive distributed processing with H2O - Codemotion Milan 2017Gabriele Nocco - Massive distributed processing with H2O - Codemotion Milan 2017
Gabriele Nocco - Massive distributed processing with H2O - Codemotion Milan 2017
 
Rsockets ofa12
Rsockets ofa12Rsockets ofa12
Rsockets ofa12
 
Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OS
 
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
distage: Purely Functional Staged Dependency Injection; bonus: Faking Kind Po...
 
Joel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMDJoel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMD
 
Open source report writing tools for IBM i Vienna 2012
Open source report writing tools for IBM i  Vienna 2012Open source report writing tools for IBM i  Vienna 2012
Open source report writing tools for IBM i Vienna 2012
 
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
 
HSA From A Software Perspective
HSA From A Software Perspective HSA From A Software Perspective
HSA From A Software Perspective
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 
Rishikesh Sharma Portfolio
Rishikesh Sharma PortfolioRishikesh Sharma Portfolio
Rishikesh Sharma Portfolio
 
Catalyst - refactor large apps with it and have fun!
Catalyst - refactor large apps with it and have fun!Catalyst - refactor large apps with it and have fun!
Catalyst - refactor large apps with it and have fun!
 

More from HSA Foundation

KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPUKeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPUHSA Foundation
 
Hsa Runtime version 1.00 Provisional
Hsa Runtime version  1.00  ProvisionalHsa Runtime version  1.00  Provisional
Hsa Runtime version 1.00 ProvisionalHSA Foundation
 
Hsa programmers reference manual (version 1.0 provisional)
Hsa programmers reference manual (version 1.0 provisional)Hsa programmers reference manual (version 1.0 provisional)
Hsa programmers reference manual (version 1.0 provisional)HSA Foundation
 
ISCA final presentation - Runtime
ISCA final presentation - RuntimeISCA final presentation - Runtime
ISCA final presentation - RuntimeHSA Foundation
 
ISCA final presentation - Queuing Model
ISCA final presentation - Queuing ModelISCA final presentation - Queuing Model
ISCA final presentation - Queuing ModelHSA Foundation
 
ISCA final presentation - Memory Model
ISCA final presentation - Memory ModelISCA final presentation - Memory Model
ISCA final presentation - Memory ModelHSA Foundation
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - CompilationsHSA Foundation
 
ISCA Final Presentation - Applications
ISCA Final Presentation - ApplicationsISCA Final Presentation - Applications
ISCA Final Presentation - ApplicationsHSA Foundation
 
ISCA Final Presentation - Intro
ISCA Final Presentation - IntroISCA Final Presentation - Intro
ISCA Final Presentation - IntroHSA Foundation
 
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...HSA Foundation
 
Hsa Platform System Architecture Specification Provisional verl 1.0 ratifed
Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed
Hsa Platform System Architecture Specification Provisional verl 1.0 ratifed HSA Foundation
 
Apu13 cp lu-keynote-final-slideshare
Apu13 cp lu-keynote-final-slideshareApu13 cp lu-keynote-final-slideshare
Apu13 cp lu-keynote-final-slideshareHSA Foundation
 
HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA HSA Foundation
 
HSA Foundation BoF -Siggraph 2013 Flyer
HSA Foundation BoF -Siggraph 2013 Flyer HSA Foundation BoF -Siggraph 2013 Flyer
HSA Foundation BoF -Siggraph 2013 Flyer HSA Foundation
 
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...HSA Foundation
 
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...HSA Foundation
 
Phil Rogers IFA Keynote 2012
Phil Rogers IFA Keynote 2012Phil Rogers IFA Keynote 2012
Phil Rogers IFA Keynote 2012HSA Foundation
 
Hsa2012 logo guidelines.
Hsa2012 logo guidelines.Hsa2012 logo guidelines.
Hsa2012 logo guidelines.HSA Foundation
 
What Fabric Engine Can Do With HSA
What Fabric Engine Can Do With HSAWhat Fabric Engine Can Do With HSA
What Fabric Engine Can Do With HSAHSA Foundation
 
Fabric Engine: Why HSA is Invaluable
Fabric Engine: Why HSA is  InvaluableFabric Engine: Why HSA is  Invaluable
Fabric Engine: Why HSA is InvaluableHSA Foundation
 

More from HSA Foundation (20)

KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPUKeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
KeynoteTHE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU
 
Hsa Runtime version 1.00 Provisional
Hsa Runtime version  1.00  ProvisionalHsa Runtime version  1.00  Provisional
Hsa Runtime version 1.00 Provisional
 
Hsa programmers reference manual (version 1.0 provisional)
Hsa programmers reference manual (version 1.0 provisional)Hsa programmers reference manual (version 1.0 provisional)
Hsa programmers reference manual (version 1.0 provisional)
 
ISCA final presentation - Runtime
ISCA final presentation - RuntimeISCA final presentation - Runtime
ISCA final presentation - Runtime
 
ISCA final presentation - Queuing Model
ISCA final presentation - Queuing ModelISCA final presentation - Queuing Model
ISCA final presentation - Queuing Model
 
ISCA final presentation - Memory Model
ISCA final presentation - Memory ModelISCA final presentation - Memory Model
ISCA final presentation - Memory Model
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - Compilations
 
ISCA Final Presentation - Applications
ISCA Final Presentation - ApplicationsISCA Final Presentation - Applications
ISCA Final Presentation - Applications
 
ISCA Final Presentation - Intro
ISCA Final Presentation - IntroISCA Final Presentation - Intro
ISCA Final Presentation - Intro
 
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
 
Hsa Platform System Architecture Specification Provisional verl 1.0 ratifed
Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed
Hsa Platform System Architecture Specification Provisional verl 1.0 ratifed
 
Apu13 cp lu-keynote-final-slideshare
Apu13 cp lu-keynote-final-slideshareApu13 cp lu-keynote-final-slideshare
Apu13 cp lu-keynote-final-slideshare
 
HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA
 
HSA Foundation BoF -Siggraph 2013 Flyer
HSA Foundation BoF -Siggraph 2013 Flyer HSA Foundation BoF -Siggraph 2013 Flyer
HSA Foundation BoF -Siggraph 2013 Flyer
 
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
 
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
 
Phil Rogers IFA Keynote 2012
Phil Rogers IFA Keynote 2012Phil Rogers IFA Keynote 2012
Phil Rogers IFA Keynote 2012
 
Hsa2012 logo guidelines.
Hsa2012 logo guidelines.Hsa2012 logo guidelines.
Hsa2012 logo guidelines.
 
What Fabric Engine Can Do With HSA
What Fabric Engine Can Do With HSAWhat Fabric Engine Can Do With HSA
What Fabric Engine Can Do With HSA
 
Fabric Engine: Why HSA is Invaluable
Fabric Engine: Why HSA is  InvaluableFabric Engine: Why HSA is  Invaluable
Fabric Engine: Why HSA is Invaluable
 

Recently uploaded

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 

Recently uploaded (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 

Deeper Look Into HSAIL And It's Runtime

  • 1. HSAIL Norm Rubin Fellow An introduction to the HSA Intermediate language
  • 2. Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. OpenCL is a trademark of Apple Inc. used with permission by Khronos. DirectX is a registered trademark of Microsoft Corporation. © 2012 Advanced Micro Devices, Inc. All rights reserved. 2 | hsail AFDS | June 11, 2012
  • 3. WHAT IS SPLIT COMPILATION? App starts a source program 1) A high level compiler (HLC) generates HSAIL 2) The HSAIL is shipped to the target machine 3) A second compiler (a finalizer) turns HSAIL into ISA Unlike traditional compilers, where optimization is contained in one part or done twice HSAIL allows optimization to be split into two parts The heavy lifting goes to the HLC , the quick finish goes to the finalizer HSAIL provides ways for an HLC and a finalizer to cooperate For instance: HSAIL provides a fixed number of registers. HSA implementations might support a different number When the HLC spills registers, it can use special operations that will let the finalizer know where to use extra registers. 3 | hsail AFDS | June 11, 2012
  • 4. SPLIT COMPILATION (MEANS THERE HAS TO BE WAYS TO PASS INFORMATION FROM HLC TO FINALIZER) HLC – High level compiler Lots of time Info from source Lots of aggressive optimizations But limited (or no) knowledge of target Finalizer Very little time (we estimate that it will take close to linear time) No info not in HSAIL (no back doors (almost) Cannot update regularly (close to bug free) Simple optimizations only But knows the target Exactly how to split some optimizations is still an open problem 4 | hsail AFDS | June 11, 2012
  • 5. WHY A VIRTUAL ISA - WHY NOT JUST TARGET THE REAL ISA? ISA Gains performance Better time to market (because hardware is finished faster) Loses performance (cannot use every hardware trick) No legacy boat anchor Real isa means one vendor/ one chip family Can fix hardware bugs in software Old and new code just works on old and new machines Allows hardware innovation under the table Features not in HSAIL are not exposed, and are hard to access 5 | hsail AFDS | June 11, 2012
  • 6. Development tools at HSAIL level Today the need for a complete tool chain for each core, each with its own technology, switches etc., is a significant maintenance problem. Debuggability, reproducibility. Because the same application needs to run on different pieces of hardware, current source code contains many conditional preprocessing directives Programmers rely on compiler intrinsic and ad-hoc command line arguments to drive the optimization. This severely impacts code readability and productivity, and the application binary tested and debugged on a workstation is different from the one that eventually runs on the system. Platform openness. Independent software vendors rarely have access to the tool chains needed to program the most powerful parts of the system, namely the DSPs and hardware accelerators. Virtualization can make the whole platform programmable, opening opportunities to third-party high-performance applications .Performance through time to market Because of the finalizer, last minute fixes can happen after the chip is finished. This means that the time to release a new part goes down. Less time per generation translates to better performance 6 | hsail AFDS | June 11, 2012
  • 7. GOALS OF HSAIL 1. Can support all of C++ (open up the GPU to mass programming, not only for specialists) 2. Avoid constant change (do not change the spec every chip) 3. Support accurate IEEE floating point math 4. Target lots of different machines 5. Allow for packed operations, SSE and friends, bytes/shorts/ints/doubles etc 6. Allow packed forms to save power 7. Make the model understandable 8. Make the finalizer fast (around linear time) 9. Make the finalizer simple (do not need monthly updates) 10. Less ambiguity in the spec (little undefined behavior) 11. Get good performance (little need to write in ISA) 12. Support all of OpenCL™ and C++Amp™ 13. Can ship linkable libraries in HSAIL 14. Clean up all nits in AMDIL 15. Allow the use of chip specific acceleration when it is a good idea 7 | hsail AFDS | June 11, 2012
  • 8. HSAIL – LOTS OF NEW FEATURES Lots of features not in OpenCL and C++ AMP Enough to implement C++ Exceptions/ heterogeneous compute Flat address space (work items on the GPU and agents on the CPU) Because of hand written HSAIL, these features can be exposed early Fine-grain barriers that work inside control flow, you can implement producer consumer models Lots of cross wave operations – so you can quickly move data between lanes without loads and stores Spec is available on the web site The memory model shows how the CPU and GPU can cooperate Support for image operations 8 | hsail AFDS | June 11, 2012
  • 9. PARALLELISM MODEL 9 | hsail AFDS | June 11, 2012
  • 10. WAVEFRONTS Most developers will not care about wavefronts Similar to cache line sizes Experts can get good performance if they code to the cache line size Compiler has to avoid breaking the developers model HSAIL formalizes the notion of wavefronts you can tell which work item goes into which wavefront you can write producer consumer parallelism between work groups 10 | hsail AFDS | June 11, 2012
  • 11. AN EXAMPLE (IN OPENCL™) __kernel void vec_add (__global const float *a, __global const float *b, __global float *c, const unsigned int n) { // Get our global thread ID int id = get_global_id(0); // Make sure we do not go out of bounds if (id < n) { c[id ] = a[id] + b[id]; } 11 | hsail AFDS | June 11, 2012
  • 12. VECTOR ADD A[0:N-1] = B[0:N-1] + C[0:N-1] cur $c0, @BB0_2; version 1:0:$small; brn @BB0_1; kernel &__OpenCL_vec_add_kernel( @BB0_1: // %if.end kernarg_u32 %arg_a ret; kernarg_u32 %arg_b, @BB0_2: // %if.then kernarg_u32 %arg_c, shl_u32 $s1, $s1, 2; kernarg_u32 %arg_n) add_u32 $s2, $s2, $s1; { @__OpenCL_vec_add_kernel_entry: ld_global_f32 $s2, [$s2]; // BB#0: // %entry add_u32 $s3, $s3, $s1; ld_kernarg_u32 $s0, [%arg_n]; ld_global_f32 $s3, [$s3]; workitemaid $s1, 0; add_f32 $s2, $s3, $s2; cmp_lt_b1_u32 $c0, $s1, $s0; add_u32 $s0, $s0, $s1; ld_kernarg_u32 $s0, [%arg_c]; st_global_f32 $s2, [$s0]; ld_kernarg_u32 $s2, [%arg_b]; brn @BB0_1; ld_kernarg_u32 $s3, [%arg_a]; }; 12 | hsail AFDS | June 11, 2012
  • 13. MEMORY SEGMENTS  Memory is split into 7 segments  kernarg, global, arg, readonly, private, group, and spill   There is a single flat address space with everything but its is often advantageous to tell the finalizer which segment to use  Load/store machine with registers  Some segments are used for intent – – Spill indicates that the slot was used by the HLC for register spilling 13 | hsail AFDS | June 11, 2012
  • 14. SEGMENTS NDRange Work group Work group Work Items Group Private group Arg locations are in private Private Spill locations are in private Agent Flat address space Group within Private within arg memory is within Private flat flat spill memory is within Private privateRW is within Private kernarg is within Global ReadOnly is within Global 14 | hsail AFDS | June 11, 2012
  • 15. HSAIL FEATURES REGISTERS AND Types TYPES Brigs8, Brigs16, Brigs32, Brigs64, Four classes of registers Brigu8, Brigu16, Brigu32, Brigu64, c/s/d/q Brigf16, Brigf32, Brigf64, Brigb1, 1 bit Brigb8, Brigb16, Brigb32, Brigb64, 32 bits Brigb128, Brigu8x16, 64 bits BrigROImg, BrigRWImg, BrigSamp, 128 bits Brigu8x4, Brigs8x4, Brigu8x8, Brigs8x8, Both Binary (BRIG) and text format Brigs8x16, The binary format is fully specified Brigu16x2, Brigs16x2, Brigf16x2, Brigu16x4, Brigs16x4, Brigf16x4, Brigu16x8, 120 opcodes (JavaByte code has 200) Brigs16x8, Brigf16x8, Brigu32x2, Brigs32x2, Brigf32x2, Brigu32x4, Brigs32x4, Brigf32x4, Brigu64x2, Brigs64x2, Brigf64x2 15 | hsail AFDS | June 11, 2012
  • 16. WHY DOES HSAIL LOOK THIS WAY? An SIMT model (single instruction, multiple threads) claims that every work-item has a program counter So branch instructions look pretty natural A vector machine model looks like sse, one program counter and vector registers, this is like real AMD GPU hardware SIMT or Vector? 16 | hsail AFDS | June 11, 2012
  • 17. PROS FOR SIMT We want HSAIL to outlast one hardware generation (so at the very least the vector length and real types/number of registers should not get exposed). Even with a vector model the finalizer will still have to map to the real vector length. We expected this to mean that a vector finalizer would not have a much simpler time We want to support lots of machines including ones not built by AMD We can add cross lane operations (like count) to the SIMTmodel so the line between SIMT and vector is blurry We want to open up to 3rd party compiler and tools, all of which can support SIMT but few of which can support vector Work groups is a much more developer friendly model than wavefronts Natural path for OpenCL™/CUDA ™ c++amp™ Graphics is SIMT, so the pressure to make future hardware work well for SIMT is immense 17 | hsail AFDS | June 11, 2012
  • 18. PROS FOR VECTOR Might get more performance, we estimated <10% even in good cases Simpler for expert programmers to reason out what is going on This was a big one for us, the exact rules on wavefront re-convergence are hidden in the SIMTmodel but clear in the vector one In the vector model you can prove some results about code, which cannot be done when the finalizer reorders things On the other hand constructs like C++ virtual functions become very confusing on a vector machine, where the original program was SIMT We think the performance deficits are a reasonable trade for broader adoption, and in many cases can be closed by well written libraries for the cases that really matter. 18 | hsail AFDS | June 11, 2012
  • 19. HSAIL AND FUNCTIONS { arg_u32 %input1; arg_u32 %input2; // … call &fnWithTwoArgs ()(%input1, %input2); // call of a function // all work-items call the same function } // ... HSAIL supports Virtual functions, Signatures Jumps via a register Load address of code 19 | hsail AFDS | June 11, 2012
  • 20. HSAIL PROVIDES A SERIES OF OPTIMIZATION CONTROLS Sometimes you know if an operation is uniform over a range ld_f32_width(8) $s1, address Work items in groups of 8 will read the same value call_width(64) $s1 Even through this is a call through register, work items in groups of 64 will call the same function ld_equiv(3)_u32 $s1, address A block of memory that cannot alias with other blocks 20 | hsail AFDS | June 11, 2012
  • 21. HSAIL COMPARED TO LLVM-IR HSAIL is low level assumes finalizer does not do as much optimization no phi nodes, finite register count No ssa input Parallelism is built into HSAIL No need to hack the meaning of a barrier No structures or other high level features 21 | hsail AFDS | June 11, 2012
  • 22. HSAIL COMPARED TO JAVA BYTE CODE HSAIL is more focused on performance, HSAIL has registers not a stack HSAIL has parallelism built in HSAIL is not as focused on security (does not require a formal validator) Not quite write once HSAIL is less concerned about code compression 22 | hsail AFDS | June 11, 2012
  • 23. HSAIL COMPARED TO AMDIL HSAIL supports lots of complex control flow AMDIL provides structured control flow only irreducible flow needed exponential compile time No (or limited) graphics features just enough for C++ AMP™ and OpenCL™ four sizes of registers 1/32/64/128 bit vs. 4x32 vector registers (no more .x, .y, .z, .w) fields HSAIL is extendable (per vendor/per chip extensions) Different cost model 23 | hsail AFDS | June 11, 2012
  • 24. HSAIL COMPARED TO PTX More formal model of execution possible to write valid programs that pass data between work groups More formal model of memory - acq/rel semantics Less semantics defined by the device Support for libraries and complex calls Interaction between agents and HSAIL code, shared memory, support for GPU to call CPU services Per vendor extension mechanism Clean separation of core features and per device operations Support for linking/ libraries/ separate compilation Removal of hard to finalize features no predication 24 | hsail AFDS | June 11, 2012
  • 25. MEMORY MODEL A memory model defines how writes by one work-item or agent become visible toother work-items and agents. For many implementations, better performance will result if either the hardware or the finalizer is allowed to reorder code. For example, the finalizer might find it more efficient if a write is moved later in the program; so long as the program semantics do not change, the finalizer is free to do so. Once a store is deferred, other work-items and agents will not see it until the store actually happens. Hardware might provide a cache that also defers writes. The HSAIL memory model is based on acquire release An ld_acq creates a “downward fence.” This means that normal loads and stores can be moved (by the implementation) down past the ld_acq but no memory operation (load, store, or atomic) can be moved up above the ld_acq. A st_rel creates an “upward fence.” That means that normal loads and stores can be moved (by the implementation) above the st_rel but no memory operation (load, store, or atomic) can be moved down after the st_rel. 25 | hsail AFDS | June 11, 2012
  • 26. Original Axiomatic Definition [Lamport 1979] A single processor (core) sequentially consistent if “the result of an execution is the same as if the operations had been executed in the order specified by the program.” A multiprocessor sequentially consistent if “the result of any execution is the same as if the operations of all processors (cores) were executed in some sequential order, and the operations of each individual processor (core) appear in this sequence in the order specified by its program.” 26 | hsail AFDS | June 11, 2012
  • 27. SEQUENTIAL CONSISTENCY (SC) OPERATIONAL DEFINITION System P P P 1 memory P simple processors MEMORY Operation: Pick one ready row, do it, & repeat until done Processor 0 ready to load/store of memory … Processor P-1 ready to load/store of memory 27 | hsail AFDS | June 11, 2012
  • 28. SEQUENTIAL CONSISTENCY Any SC implementation must only permit executions allowed by SC operational model (SC executions). The SC operational model is NOT a performance model. SC implementation performance != Counting operation model steps The operational model hides most implementation techniques pipelining, out-of-order, speculation, caches, cache coherence, … HW must functional behave “as if” is was like operational model HW designers & verifiers often most comfortable with operational model Each processor is eventually selected 28 | hsail AFDS | June 11, 2012
  • 29. HSAIL OPERATIONAL DEFINITION P P P System 1 (host) memory P simple processors Reorder buffer Writes can get held Reads can be satisfied MEMORY Operation: Pick one ready row, do it, & repeat until done Processor 0 ready to load/store of memory … Processor P-1 ready to load/store of memory write values may stay in reorder buffer, reads may come out of the reorder buffer, Rules to move between reorder buffer and memory rel = release the values from the buffer, acq = acquire new values 29 | hsail AFDS | June 11, 2012
  • 30. WITHIN ONE WORK ITEM SEQUENCED BEFORE This is the order operations appear in the source What you see looking at the code single work item - “as-if-serial” view - each operation appears to happen in the order it appears in the source X sb Y - X and Y in same work item, - X sequenced before Y multiple work items and agents makes this more complex 30 | hsail AFDS | June 11, 2012
  • 31. BETWEEN WORK ITEMS X >> Y What the memory system sees memory system must see X before Y global visibility order this is transitive X >>Y, and Y >> Z, then X >>Z 31 | hsail AFDS | June 11, 2012
  • 32. RULES, SOMETIMES X SB Y => X >> Y •X sb Y, same address, then X >>Y •Different address –If there is a barrier or sync between X and Y then X >>Y •If X is an acquire: – ld_acq, atomic_acq, atomicNoRet_acq, atomic_ar, atomicNoRet_ar –Then X >> Y –This is one sided (Y cannot move before X) The general rule is use acquire and release when you want to force order Acquire and Release may take extra time, but they give you sequential constancy Compilers can trade performance for simple cross work-item communication 32 | hsail AFDS | June 11, 2012
  • 33. •If Y is a release –st_rel, atomic_ar or atomicNoRet_ar then X >>Y –st rel is another one way fence •Consider a critical region (can use acquire and release to form critical sections) •ld_acq x •Assorted memory operations •st_rel y •No operations can move out, but operations can move in 33 | hsail AFDS | June 11, 2012
  • 34. AN EXAMPLE SB ORDER DOES NOT FORCE MEMORY ORDER Work-item 0 Work-item 1 ------------------- ------------------------------------ @h0: st_u32 1, [&a] @k0: st_u32 1, [&b] @h1: ld_u32 $s0, [&b] @k1: ld_u32 $s1, [&a] Initially, &a and &b = 0. $s0 = 0 and $s1 = 0 is allowed. -- constraints added because readers have to follow writers. k1 (the reader) has to happen before h0 changes the value. There are also constraints caused by synchronization h1 >> k1 >> h0 >> k0. Even though h0 appears first (in sequenced-before order) before h1, there is no requirement that the operations appear in text order (sequenced-before order) to the memory system. 34 | hsail AFDS | June 11, 2012
  • 35. EXAMPLE 2 REGISTER DEPENDENCE DOES NOT FORCE MEMORY ORDER Work-item 0 Work-item 1 ----------------------- --------------------- @h0: ld $s0, [&a] @j0: st 20, [100] @h1: ld $s1, [$s0] @j1: st_rel 100, [&a] Initially, &a and contents of location 100 = 0. $s1 == 0 and $s0 == 100 is allowed If $s1 == 0 then h1 >> j0. f $s0 == 100 then j1 >> h1. Because this seems to violate dependence order, it is useful to consider how this can come about. Work-item 0 is allowed to prefetch load h1. One reason it might do this is that code before these operations reads address 96, and the implementation reads in large cache lines. Later, work-item 1 reads the new value of &a, which is 100. Then it reads the value of location 100, but because there is no synchronization, it can use the previously prefetched value of 0. 35 | hsail AFDS | June 11, 2012
  • 36. EXAMPLE 3 Work-item 0 Work-item 1 @h0: ld_acq $s0, [&a] @j0: st 20, [100] @h1: ld $s1, [$s0] @j1: st_rel 100, [&a] Initially, &a and 100 = 0. HSAIL does not allow $s1 == 0 and $s0 == 100. 36 | hsail AFDS | June 11, 2012
  • 37. QUESTIONS? 37 | hsail AFDS | June 11, 2012