The document describes how HSA enables more efficient programming techniques for GPUs compared to legacy approaches. It provides examples of pointer-based data structures, dynamic task management, and large data sets. For pointer-based data structures, HSA allows direct GPU access to data structures created by the CPU in unified coherent memory, avoiding copies. For dynamic task management, HSA platform atomics allow efficient sharing of task pools between the CPU and GPU without data copying or reconciliation. And for large data sets, HSA gives the GPU access to operate directly on large models in unified memory, reducing overhead from data copies and kernel launches.
Heap allocation (hUMA – virtual memory) – dGPU version would need to page large portions of the model to the GPU – CPU version would be slow.
Data pointers (hUMA – unified addresses) – non-HSA version would need to “serialize” tree into array (with indices for pointers) for GPU.
Recursion (hQ – GPU enqueuing) – non-HSA version would suffer from load imbalance, because CPU has to wait and spawn 1 kernel to process all secondary rays, whereas with HSA, the GPU threads can dynamically spawn kernels to process secondary rays.
Callbacks (hUMA – platform atomics)– In the non-HSA version, the CPU has to wait until the first kernel exits to begin processing callbacks, and can’t launch second kernel until all callbacks have completed.
Atomics (hUMA – memory coherence & platform atomics) – In the non-HSA version, the CPU and GPU processing is serialized.
Kernel should have input buffer which is list of keys being searched for and outputs are the values from the key value pairs
This case study implements a dynamic task scheduling scheme that aims load balancing among work-groups.
Traditional heterogeneous approach:
The host system enqueues tasks in several queues located in GPU memory.
Two variables per queue are used to synchronize CPU and GPU: The number of tasks that have been written in the queue, and the number of tasks that have been already consumed from the queue.
These variables are duplicated in CPU and GPU memory.
The GPU runs a number of persistent work-groups. A work-group can dequeue one task and update the number of consumed tasks.
A group of tasks are asynchronously transferred to one queue in GPU memory.
Then, the host updates the number of written tasks in CPU memory.
The number of written tasks is updated in GPU memory by an asynchronous transfer.
A work-group dequeues one task from a queue.
The work-group updates the number of consumed tasks by using a global memory atomic operation.
Then, it checks whether the queue is already empty, that is, it compares the number of consumed tasks with the number of written tasks.
Then, a different work-group dequeues the next task.
Work-group 2 updates the number of consumed tasks.
Then, it checks whether the queue is already empty, that is, it compares the number of consumed tasks with the number of written tasks.
Work-group 3 dequeues the next task.
Work-group 3 updates the number of consumed tasks.
Then, it checks whether the queue is already empty, that is, it compares the number of consumed tasks with the number of written tasks.
Work-group 4 dequeues the next task.
Work-group 4 updates the number of consumed tasks. Then, it checks whether the queue is already empty, that is, it compares the number of consumed tasks and the number of written tasks.
Since the number of consumed tasks and the number of written tasks are equal, the queue is empty.
Then, the number of consumed tasks should be updated in CPU memory. This is implemented by using the zero-copy feature.
Once the number of consumed tasks in CPU memory is updated, the host thread will detect that this number is equal to the number of written tasks. More tasks can be then enqueued in queue 1.
Using HSA and full OpenCL 2.0, queues and synchronization variables can be allocated in host coherent memory.
Moving tasks to a queue is as simple as using memcpy.
No copies of the number of written tasks and the number of consumed tasks are needed in GPU memory.
A work-group can dequeue one task from a queue in host coherent memory.
The number of consumed tasks is updated by using platform atomics.
Only the function that inserts tasks in the queues needs 5x less lines of code than the legacy implementation.
This slide presents 8 tests. The total number of tasks in the tasks pool is 4096 or 16384. The number of queues is 4 in every test.
Each time the host inserts tasks in a queue, the number of tasks per insertion is 64, 128, 256 or 512.
Atomic is a lock on a parent before adding to it. Semaphore on the tree struct. CAS to take semaphore
Tree has 2M nodes. Add 0.5 nodes
Time 3 ways: CPU, GPU, Both
Just the dividing planes are loaded to GPU memory for first two levels
BVH – Bounding Volume Hierarchy
Each leaf has a collection of primitives (spheres)
Looking for first sphere that intercepts a point
Heap allocation (hUMA – virtual memory) – dGPU version would need to page large portions of the model to the GPU – CPU version would be slow.
Data pointers (hUMA – unified addresses) – non-HSA version would need to “serialize” tree into array (with indices for pointers) for GPU.
Recursion (hQ – GPU enqueuing) – non-HSA version would suffer from load imbalance, because CPU has to wait and spawn 1 kernel to process all secondary rays, whereas with HSA, the GPU threads can dynamically spawn kernels to process secondary rays.
Callbacks (hUMA – platform atomics)– In the non-HSA version, the CPU has to wait until the first kernel exits to begin processing callbacks, and can’t launch second kernel until all callbacks have completed.
Atomics (hUMA – memory coherence & platform atomics) – In the non-HSA version, the CPU and GPU processing is serialized.