Más contenido relacionado
La actualidad más candente (20)
Similar a ISCA final presentation - Queuing Model (20)
Más de HSA Foundation (20)
ISCA final presentation - Queuing Model
- 3. MOTIVATION (TODAY’S PICTURE)
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
- 5. REQUIREMENTS
Three key technologies are used to build the user mode queueing
mechanism
Shared Virtual Memory
System Coherency
Signaling
AQL (Architected Queueing Language) enables any agent
enqueue tasks
© Copyright 2014 HSA Foundation. All Rights Reserved
- 7. PHYSICAL MEMORY
SHARED VIRTUAL MEMORY (TODAY)
Multiple Virtual memory address spaces
© Copyright 2014 HSA Foundation. All Rights Reserved
CPU0 GPU
VIRTUAL MEMORY1
PHYSICAL MEMORY
VA1->PA1 VA2->PA1
VIRTUAL MEMORY2
- 8. PHYSICAL MEMORY
SHARED VIRTUAL MEMORY (HSA)
Common Virtual Memory for all HSA agents
© Copyright 2014 HSA Foundation. All Rights Reserved
CPU0 GPU
VIRTUAL MEMORY
PHYSICAL MEMORY
VA->PA VA->PA
- 9. SHARED VIRTUAL MEMORY
Advantages
No mapping tricks, no copying back-and-forth between different PA
addresses
Send pointers (not data) back and forth between HSA agents.
Implications
Common Page Tables (and common interpretation of architectural
semantics such as shareability, protection, etc).
Common mechanisms for address translation (and servicing address
translation faults)
Concept of a process address space (PASID) to allow multiple, per
process virtual address spaces within the system.
© Copyright 2014 HSA Foundation. All Rights Reserved
- 10. SHARED VIRTUAL MEMORY
Specifics
Minimum supported VA width is 48b for 64b systems, and 32b for
32b systems.
HSA agents may reserve VA ranges for internal use via system
software.
All HSA agents other than the host unit must use the lowest privilege
level
If present, read/write access flags for page tables must be
maintained by all agents.
Read/write permissions apply to all HSA agents, equally.
© Copyright 2014 HSA Foundation. All Rights Reserved
- 11. GETTING THERE …
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
- 13. CACHE COHERENCY DOMAINS (1/3)
Data accesses to global memory segment from all HSA Agents shall be
coherent without the need for explicit cache maintenance.
© Copyright 2014 HSA Foundation. All Rights Reserved
- 14. CACHE COHERENCY DOMAINS (2/3)
Advantages
Composability
Reduced SW complexity when communicating between agents
Lower barrier to entry when porting software
Implications
Hardware coherency support between all HSA agents
Can take many forms
Stand alone Snoop Filters / Directories
Combined L3/Filters
Snoop-based systems (no filter)
Etc …
© Copyright 2014 HSA Foundation. All Rights Reserved
- 15. CACHE COHERENCY DOMAINS (3/3)
Specifics
No requirement for instruction memory accesses to be
coherent
Only applies to the Primary memory type.
No requirement for HSA agents to maintain coherency to any
memory location where the HSA agents do not specify the
same memory attributes
Read-only image data is required to remain static during the
execution of an HSA kernel.
No double mapping (via different attributes) in order to
modify. Must remain static
© Copyright 2014 HSA Foundation. All Rights Reserved
- 16. GETTING CLOSER …
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
- 18. SIGNALING (1/3)
HSA agents support the ability to use signaling objects
All creation/destruction signaling objects occurs via HSA
runtime APIs
From an HSA Agent you can directly access signaling objects.
Signaling a signal object (this will wake up HSA agents
waiting upon the object)
Query current object
Wait on the current object (various conditions supported).
© Copyright 2014 HSA Foundation. All Rights Reserved
- 19. SIGNALING (2/3)
Advantages
Enables asynchronous events between HSA agents,
without involving the kernel
Common idiom for work offload
Low power waiting
Implications
Runtime support required
Commonly implemented on top of cache coherency flows
© Copyright 2014 HSA Foundation. All Rights Reserved
- 20. SIGNALING (3/3)
Specifics
Only supported within a PASID
Supported wait conditions are =, !=, < and >=
Wait operations may return sporadically (no guarantee against
false positives)
Programmer must test.
Wait operations have a maximum duration before returning.
The HSAIL atomic operations are supported on signal objects.
Signal objects are opaque
Must use dedicated HSAIL/HSA runtime operations
© Copyright 2014 HSA Foundation. All Rights Reserved
- 21. ALMOST THERE…
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
- 23. ONE BLOCK LEFT
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
- 24. USER MODE QUEUEING (1/3)
User mode Queueing
Enables user space applications to directly, without OS
intervention, enqueue jobs (“Dispatch Packets”) for HSA
agents.
Queues are created/destroyed via calls to the HSA
runtime.
One (or many) agents enqueue packets, a single agent
dequeues packets.
Requires coherency and shared virtual memory.
© Copyright 2014 HSA Foundation. All Rights Reserved
- 25. USER MODE QUEUEING (2/3)
Advantages
Avoid involving the kernel/driver when dispatching work for an Agent.
Lower latency job dispatch enables finer granularity of offload
Standard memory protection mechanisms may be used to protect communication with
the consuming agent.
Implications
Packet formats/fields are Architected – standard across vendors!
Guaranteed backward compatibility
Packets are enqueued/dequeued via an Architected protocol (all via memory
accesses and signaling)
More on this later……
© Copyright 2014 HSA Foundation. All Rights Reserved
- 26. SUCCESS!
© Copyright 2014 HSA Foundation. All Rights Reserved
Application OS GPU
Transfer
buffer to GPU Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
- 29. ARCHITECTED QUEUEING LANGUAGE
HSA Queues look just like standard shared
memory queues, supporting multi-producer,
single-consumer
Single producer variant defined with some
optimizations possible.
Queues consist of storage, read/write indices, ID,
etc.
Queues are created/destroyed via calls to the
HSA runtime
“Packets” are placed in queues directly from user
mode, via an architected protocol
Packet format is architected
© Copyright 2014 HSA Foundation. All Rights Reserved
Producer Producer
Consumer
Read Index
Write Index
Storage in
coherent, shared
memory
Packets
- 30. ARCHITECTED QUEUING LANGUAGE
Packets are read and dispatched for execution from the queue in order, but
may complete in any order.
There is no guarantee that more than one packet will be processed in parallel at a
time
There may be many queues. A single agent may also consume from several
queues.
Any HSA agent may enqueue packets
CPUs
GPUs
Other accelerators
© Copyright 2014 HSA Foundation. All Rights Reserved
- 31. QUEUE STRUCTURE
© Copyright 2014 HSA Foundation. All Rights Reserved
Offset (bytes) Size (bytes) Field Notes
0 4 queueType Differentiate different queues
4 4 queueFeatures Indicate supported features
8 8 baseAddress Pointer to packet array
16 16 doorbellSignal HSA signaling object handle
24 4 size Packet array cardinality
28 4 queueId Unique per process
32 8 serviceQueue Queue for callback services
intrinsic 8 writeIndex Packet array write index
intrinsic 8 readIndex Packet array read index
- 32. QUEUE VARIANTS
queueType and queueFeatures together define queue semantics and
capabilities
Two queueType values defined, other values reserved:
MULTI – queue supports multiple producers
SINGLE – queue supports single producer
queueFeatures is a bitfield indicating capabilities
DISPATCH (bit 0) if set then queue supports DISPATCH packets
AGENT_DISPATCH (bit 1) if set then queue supports AGENT_DISPATCH packets
All other bits are reserved and must be 0
© Copyright 2014 HSA Foundation. All Rights Reserved
- 33. QUEUE STRUCTURE DETAILS
Queue doorbells are HSA signaling objects with restrictions
Created as part of the queue – lifetime tied to queue object
Atomic read-modify-write not allowed
size field value must be aligned to a power of 2
serviceQueue can be used by HSA kernel for callback services
Provided by application when queue is created
Can be mapped to HSA runtime provided serviceQueue, an application serviced
queue, or NULL if no serviceQueue required
© Copyright 2014 HSA Foundation. All Rights Reserved
- 34. READ/WRITE INDICES
readIndex and writeIndex properties are part of the queue, but not visible in the queue structure
Accessed through HSA runtime API and HSAIL operations
HSA runtime/HSAIL operations defined to
Read readIndex or writeIndex property
Write readIndex or writeIndex property
Add constant to writeIndex property (returns previous writeIndex value)
CAS on writeIndex property
readIndex & writeIndex operations treated as atomic in memory model
relaxed, acquire, release and acquire-release variants defined as applicable
readIndex and writeIndex never wrap
PacketID – the index of a particular packet
Uniquely identifies each packet of a queue
© Copyright 2014 HSA Foundation. All Rights Reserved
- 35. PACKET ENQUEUE
Packet enqueue follows a few simple steps:
Reserve space
Multiple packets can be reserved at a time
Write packet to queue
Mark packet as valid
Producer no longer allowed to modify packet
Consumer is allowed to start processing packet
Notify consumer of packet through the queue doorbell
Multiple packets can be notified at a time
Doorbell signal should be signaled with last packetID notified
On small machine model the lower 32 bits of the packetID are used
© Copyright 2014 HSA Foundation. All Rights Reserved
- 36. PACKET RESERVATION
Two flows envisaged
Atomic add writeIndex with number of packets to reserve
Producer must wait until packetID < readIndex + size before writing to packet
Queue can be sized so that wait is unlikely (or impossible)
Suitable when many threads use one queue
Check queue not full first, then use atomic CAS to update writeIndex
Can be inefficient if many threads use the same queue
Allows different failure model if queue is congested
© Copyright 2014 HSA Foundation. All Rights Reserved
- 37. QUEUE OPTIMIZATIONS
Queue behavior is loosely defined to allow optimizations
Some potential producer behavior optimizations:
Keep local copy of readIndex, update when required
For single producer queues:
Keep local copy of writeIndex
Use store operation rather than add/cas atomic to update writeIndex
Some potential consumer behavior optimizations:
Use packet format field to determine whether a packet has been submitted rather than writeIndex
property
Speculatively read multiple packets from the queue
Not update readIndex for each packet processed
Rely on value used for doorbellSignal to notify new packets
Especially useful for single producer queues
© Copyright 2014 HSA Foundation. All Rights Reserved
- 38. POTENTIAL MULTI-PRODUCER ALGORITHM
// Allocate packet
uint64_t packetID = hsa_queue_add_write_index_relaxed(q, 1);
// Wait until the queue is no longer full.
uint64_t rdIdx;
do {
rdIdx = hsa_queue_load_read_index_relaxed(q);
} while (packetID >= (rdIdx + q->size));
// calculate index
uint32_t arrayIdx = packetID & (q->size-1);
// copy over the packet, the format field is INVALID
q->baseAddress[arrayIdx] = pkt;
// Update format field with release semantics
q->baseAddress[index].hdr.format.store(DISPATCH, std::memory_order_release);
// ring doorbell, with release semantics (could also amortize over multiple packets)
hsa_signal_send_relaxed(q->doorbellSignal, packetID);
© Copyright 2014 HSA Foundation. All Rights Reserved
- 39. POTENTIAL CONSUMER ALGORITHM
// Get location of next packet
uint64_t readIndex = hsa_queue_load_read_index_relaxed(q);
// calculate the index
uint32_t arrayIdx = readIndex & (q->size-1);
// spin while empty (could also perform low-power wait on doorbell)
while (INVALID == q->baseAddress[arrayIdx].hdr.format) { }
// copy over the packet
pkt = q->baseAddress[arrayIdx];
// set the format field to invalid
q->baseAddress[arrayIdx].hdr.format.store(INVALID, std::memory_order_relaxed);
// Update the readIndex using HSA intrinsic
hsa_queue_store_read_index_relaxed(q, readIndex+1);
// Now process <pkt>!
© Copyright 2014 HSA Foundation. All Rights Reserved
- 41. PACKETS
© Copyright 2014 HSA Foundation. All Rights Reserved
Packets come in three main types with architected layouts
Always reserved & Invalid
Do not contain any valid tasks and are not processed (queue will not progress)
Dispatch
Specifies kernel execution over a grid
Agent Dispatch
Specifies a single function to perform with a set of parameters
Barrier
Used for task dependencies
- 42. COMMON PACKET HEADER
Start Offset
(Bytes)
Format Field Name Description
0 uint16_t
format:8
Contains the packet type (Always reserved, Invalid,
Dispatch, Agent Dispatch, and Barrier). Other values are
reserved and should not be used.
barrier:1
If set then processing of packet will only begin when all
preceding packets are complete.
acquireFenceScope:2
Determines the scope and type of the memory fence
operation applied before the packet enters the active
phase.
Must be 0 for Barrier Packets.
releaseFenceScope:2
Determines the scope and type of the memory fence
operation applied after kernel completion but before the
packet is completed.
reserved:3 Must be 0
© Copyright 2014 HSA Foundation. All Rights Reserved
- 43. DISPATCH PACKET
© Copyright 2014 HSA Foundation. All Rights Reserved
Start
Offset
(Bytes)
Format Field Name Description
0 uint16_t header Packet header
2 uint16_t
dimensions:2 Number of dimensions specified in gridSize. Valid values are 1, 2, or 3.
reserved:14 Must be 0.
4 uint16_t workgroupSize.x x dimension of work-group (measured in work-items).
6 uint16_t workgroupSize.y y dimension of work-group (measured in work-items).
8 uint16_t workgroupSize.z z dimension of work-group (measured in work-items).
10 uint16_t reserved2 Must be 0.
12 uint32_t gridSize.x x dimension of grid (measured in work-items).
16 uint32_t gridSize.y y dimension of grid (measured in work-items).
20 uint32_t gridSize.z z dimension of grid (measured in work-items).
24 uint32_t privateSegmentSizeBytes Total size in bytes of private memory allocation request (per work-item).
28 uint32_t groupSegmentSizeBytes Total size in bytes of group memory allocation request (per work-group).
32 uint64_t kernelObjectAddress
Address of an object in memory that includes an implementation-defined
executable ISA image for the kernel.
40 uint64_t kernargAddress Address of memory containing kernel arguments.
48 uint64_t reserved3 Must be 0.
56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.
- 44. AGENT DISPATCH PACKET
© Copyright 2014 HSA Foundation. All Rights Reserved
Start Offset
(Bytes)
Format Field Name Description
0 uint16_t header Packet header
2 uint16_t type
The function to be performed by the destination Agent. The type value is
split into the following ranges:
0x0000:0x3FFF – Vendor specific
0x4000:0x7FFF – HSA runtime
0x8000:0xFFFF – User registered function
4 uint32_t reserved2 Must be 0.
8 uint64_t returnLocation Pointer to location to store the function return value in.
16 uint64_t arg[0]
64-bit direct or indirect arguments.
24 uint64_t arg[1]
32 uint64_t arg[2]
40 uint64_t arg[3]
48 uint64_t reserved3 Must be 0.
56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.
- 45. BARRIER PACKET
Used for specifying dependences between packets
HSA agent will not launch any further packets from this queue until the barrier
packet signal conditions are met
Used for specifying dependences on packets dispatched from any queue.
Execution phase completes only when all of the dependent signals (up to five) have
been signaled (with the value of 0).
Or if an error has occurred in one of the packets upon which we have a dependence.
© Copyright 2014 HSA Foundation. All Rights Reserved
- 46. BARRIER PACKET
© Copyright 2014 HSA Foundation. All Rights Reserved
Start Offset
(Bytes)
Format Field Name Description
0 uint16_t header Packet header, see 2.8.1 Packet header (p. 16).
2 uint16_t reserved2 Must be 0.
4 uint32_t reserved3 Must be 0.
8 uint64_t depSignal0
Address of dependent signaling objects to be evaluated by the packet processor.
16 uint64_t depSignal1
24 uint64_t depSignal2
32 uint64_t depSignal3
40 uint64_t depSignal4
48 uint64_t reserved4 Must be 0.
56 uint64_t completionSignal Address of HSA signaling object used to indicate completion of the job.
- 47. DEPENDENCES
A user may never assume more than one packet is being executed by an HSA
agent at a time.
Implications:
Packets can’t poll on shared memory values which will be set by packets issued from
other queues, unless the user has ensured the proper ordering.
To ensure all previous packets from a queue have been completed, use the Barrier
bit.
To ensure specific packets from any queue have completed, use the Barrier packet.
© Copyright 2014 HSA Foundation. All Rights Reserved
- 49. PACKET EXECUTION
Launch phase
Initiated when launch conditions are met
All preceding packets in the queue must have exited launch phase
If the barrier bit in the packet header is set, then all preceding packets in the queue
must have exited completion phase
Includes memory acquire fence
Active phase
Execute the packet
Barrier packets remain in Active phase until conditions are met.
Completion phase
First step is memory release fence – make results visible.
completionSignal field is then signaled with a decrementing atomic.
© Copyright 2014 HSA Foundation. All Rights Reserved
- 50. PACKET EXECUTION – BARRIER BIT
© Copyright 2014 HSA Foundation. All Rights Reserved
Pkt1
Launch
Pkt2
Launch
Pkt1
Execute
Pkt2
Execute
Pkt1
Complete
Pkt3
Launch (barrier=1)
Pkt2
Complete
Pkt3
Execute
Time
Pkt3 launches whenall
packets in the queue
have completed.
- 51. PUTTING IT ALL TOGETHER (FFT)
© Copyright 2014 HSA Foundation. All Rights Reserved
Packet 1
Packet 2
Packet 3
Packet 4
Packet 5
Packet 6
Barrier Barrier
X[0]
X[1]
X[2]
X[3]
X[4]
X[5]
X[6]
X[7]
Time
- 52. PUTTING IT ALL TOGETHER
© Copyright 2014 HSA Foundation. All Rights Reserved
AQL Pseudo Code
// Send the packets to do the first stage.
aql_dispatch(pkt1);
aql_dispatch(pkt2);
// Send the next two packets, setting the barrier bit so we
// know packets 1 & 2 will be complete before 3 and 4 are
// launched.
aql_dispatch_with _barrier_bit(pkt3);
aql_dispatch(pkt4);
// Same as above (make sure 3 & 4 are done before issuing 5
// & 6)
aql_dispatch_with_barrier_bit(pkt5);
aql_dispatch(pkt6);
// This packet will notify us when 5 & 6 are complete)
aql_dispatch_with_barrier_bit(finish_pkt);
- 53. PACKET EXECUTION – BARRIER PACKET
© Copyright 2014 HSA Foundation. All Rights Reserved
Barrier T2Q2
T1Q1
Signal X
init to 1
depSignal0
completionSignal
Time
Decrements signal X
Barrier
Launch
T1
Launch
Barrier
Execute
T1
Execute
Barrier
Complete
T1
Complete
T2
Launch
T2
Execute
T2
Complete
Barrier completes
when signal X
signalled with 0
T2 launches once
barrier complete
- 54. DEPTH FIRST CHILD TASK EXECUTION
Consider two generations of child tasks
Task T submits tasks T.1 & T.2
Task T.1 submits tasks T.1.1 & T.1.2
Task T.2 submits tasks T.2.1 & T.2.2
Desired outcome
Depth first child task execution
I.e. T T1 T.1.1 T.1.2 T.2 T.2.1 T.2.2
T passed signal (allComplete) to decrement when all tasks are complete (T and its
children etc)
© Copyright 2014 HSA Foundation. All Rights Reserved
T
T.2.2T.1.2T.1.2T.1.1
T.1 T.2
- 55. HOW TO DO THIS WITH HSA QUEUES?
Use a separate user mode queue for each recursion level
Task T submits to queue Q1
Tasks T.1 & T.2 submits tasks to queue Q2
Queues could be passed in as parameters to task T
Depth first requires ordering of T.1, T.2 and their children
Use additional signal object (childrenComplete) to track completion of the children of
T.1 & T.2
childrenComplete set to number of children (i.e. 2) by each of T.1 & T.2
© Copyright 2014 HSA Foundation. All Rights Reserved
- 56. A PICTURE SAYS MORE THAN 1000 WORDS
© Copyright 2014 HSA Foundation. All Rights Reserved
T
T.2.2T.1.2T.1.2T.1.1
T.1 T.2 T.1 Barrier T.2 BarrierQ1
Wait on
childrenComplete
Signal
allComplete
T.1.1 T.1.2 T.2.1 T.2.2Q2
- 58. KEY HSA TECHNOLOGIES
HSA combines several mechanisms to enable low overhead task
dispatch
Shared Virtual Memory
System Coherency
Signaling
AQL
User mode queues – from any compatible agent
Architected packet format
Rich dependency mechanism
Flexible and efficient signaling of completion
© Copyright 2014 HSA Foundation. All Rights Reserved