SlideShare una empresa de Scribd logo
1 de 83
HSA APPLICATIONS
WEN-MEI HWU, PROFESSOR, UNIVERSITY OF ILLINOIS
WITH J.P. BORDES AND JUAN GOMEZ
USE CASES SHOWING HSA ADVANTAGE
Programming
Technique
Use Case Description HSA Advantage
Pointer-based Data
Structures
Binary tree searches
GPU performs parallel searches in a CPU created
binary tree.
CPU and GPU have access to entire unified coherent
memory. GPU can access existing data structures containing
pointers.
Platform Atomics
Work-Group Dynamic Task Management
GPU directly operate on a task pool managed
by the CPU for algorithms with dynamic
computation loads
Binary tree updates
CPU and GPU operating simultaneously on the
tree, both doing modifications
CPU and GPU can synchronize using Platform Atomics
Higher performance through parallel operations reducing the
need for data copying and reconciling.
Large Data Sets
Hierarchical data searches
Applications include object recognition, collision
detection, global illumination, BVH
CPU and GPU have access to entire unified coherent
memory. GPU can operate on huge models in place,
reducing copy and kernel launch overhead.
CPU Callbacks
Middleware user-callbacks
GPU processes work items, some of which require
a call to a CPU function to fetch new data
GPU can invoke CPU functions from within a GPU kernel
Simpler programming does not require “split kernels”
Higher performance through parallel operations
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORY
FOR POINTER-BASED DATA
STRUCTURES
UNIFIED COHERENT MEMORY
MORE EFFICIENT POINTER DATA STRUCTURES
Legacy
SYSTEM
MEMORY
KERNEL
GPU
TREE RESULT
BUFFER
L R
L R L R
GPU MEMORY
RESULT
BUFFER
FLAT
TREE
© Copyright 2014 HSA Foundation. All Rights Reserved
L R
Legacy
SYSTEM
MEMORY
KERNEL
GPU
TREE RESULT
BUFFER
L R
L R L R
GPU MEMORY
RESULT
BUFFER
FLAT
TREE
UNIFIED COHERENT MEMORY
MORE EFFICIENT POINTER DATA STRUCTURES
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORY
MORE EFFICIENT POINTER DATA STRUCTURES
Legacy
SYSTEM
MEMORY
KERNEL
GPU
TREE RESULT
BUFFER
L R
L R L R
GPU MEMORY
RESULT
BUFFER
FLAT
TREE
L
R
L
R
L
R
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORY
MORE EFFICIENT POINTER DATA STRUCTURES
Legacy
SYSTEM
MEMORY
KERNEL
GPU
TREE RESULT
BUFFER
L R
L R L R
GPU MEMORY
RESULT
BUFFER
FLAT
TREE
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORY
MORE EFFICIENT POINTER DATA STRUCTURES
Legacy
SYSTEM
MEMORY
KERNEL
GPU
TREE RESULT
BUFFER
L R
L R L R
GPU MEMORY
RESULT
BUFFER
FLAT
TREE
L R
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORY
MORE EFFICIENT POINTER DATA STRUCTURES
Legacy
SYSTEM
MEMORY
KERNEL
GPU
TREE RESULT
BUFFER
L R
L R L R
GPU MEMORY
RESULT
BUFFER
FLAT
TREE
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORY
MORE EFFICIENT POINTER DATA STRUCTURES
Legacy
SYSTEM
MEMORY
KERNEL
GPU
TREE RESULT
BUFFER
L R
L R L R
GPU MEMORY
RESULT
BUFFER
FLAT
TREE
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
KERNEL
GPU
UNIFIED COHERENT MEMORY
MORE EFFICIENT POINTER DATA STRUCTURES
HSA and full OpenCL 2.0
TREE RESULT
BUFFER
L R
L R L R
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORY
MORE EFFICIENT POINTER DATA STRUCTURES
HSA
SYSTEM
MEMORY
KERNEL
GPU
TREE RESULT
BUFFER
L R
L R L R
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORY
MORE EFFICIENT POINTER DATA STRUCTURES
HSA
SYSTEM
MEMORY
KERNEL
GPU
TREE RESULT
BUFFER
L R
L R L R
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORY
MORE EFFICIENT POINTER DATA STRUCTURES
HSA
SYSTEM
MEMORY
KERNEL
GPU
TREE RESULT
BUFFER
L R
L R L R
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORY
MORE EFFICIENT POINTER DATA STRUCTURES
HSA
SYSTEM
MEMORY
KERNEL
GPU
TREE RESULT
BUFFER
L R
L R L R
© Copyright 2014 HSA Foundation. All Rights Reserved
POINTER DATA STRUCTURES
- CODE COMPLEXITY
HSA Legacy
© Copyright 2014 HSA Foundation. All Rights Reserved
POINTER DATA STRUCTURES
- PERFORMANCE
0
10,000
20,000
30,000
40,000
50,000
60,000
1M 5M 10M 25M
Searchrate(nodes/ms)
Tree size ( # nodes )
Binary Tree Search
CPU (1 core)
CPU (4 core)
Legacy APU
HSA APU
Measured in AMD labs Jan 1-3 on system shown in back up
slide
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS FOR
DYNAMIC TASK MANAGEMENT
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
0
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
TASKS
POOL
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
0
NUM.
WRITTEN
TASKS
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
© Copyright 2014 HSA Foundation. All Rights Reserved
0
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
0
NUM.
WRITTEN
TASKS
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Asynchronous transfer
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
0
NUM.
WRITTEN
TASKS
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Asynchronous transfer
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
1
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
1
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
2
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
2
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
3
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
3
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
4
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
4
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
4
NUM.
WRITTEN
TASKS
0
4
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 3
WORK-
GROUP 4
TASKS
POOL
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Zero-copy
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
0
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
0
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
memcpy
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
0
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
1
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
Platform atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
1
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
2
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
Platform atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
2
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
3
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
Platform atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
3
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT
MEMORY
WORK-
GROUP 1
GPU
NUM.
WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS
POOL
0
4
NUM.
CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-
GROUP 2
WORK-
GROUP 3
WORK-
GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
Platform atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS – CODE COMPLEXITY
HSA
Legacy
Host enqueue function: 20 lines of code
Host enqueue function: 102 lines of code
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS - PERFORMANCE
0
100
200
300
400
500
600
700
64 128 256 512 64 128 256 512
4096 16384
Executiontime(ms)
Tasks per insertion
Tasks pool size
Legacy implementation (ms)
HSA implementation (ms)
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS FOR
CPU/GPU COLLABORATION
PLATFORM ATOMICS
ENABLING EFFICIENT GPU/CPU COLLABORATION
Legacy
Only GPU
can work
on input
array
Concurre
nt
processin
g not
possible
TREEINPUT
BUFFER
GPU
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
Legacy
Only GPU
can work
on input
array
Concurre
nt
processin
g not
possible
TREEINPUT
BUFFER
GPU
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
Legacy
Only GPU
can work
on input
array
Concurre
nt
processin
g not
possible
TREEINPUT
BUFFER
GPU
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
GPU
KERNEL
PLATFORM ATOMICS
Both
CPU+GPU
operating
on same
data
structure
concurren
tly
TREEINPUT
BUFFER
CPU
0
CPU
1
HSA and full OpenCL 2.0
© Copyright 2014 HSA Foundation. All Rights Reserved
GPU
KERNEL
PLATFORM ATOMICS
Both
CPU+GPU
operating
on same
data
structure
concurren
tly
TREEINPUT
BUFFER
CPU
0
CPU
1
HSA and full OpenCL 2.0
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORY
FOR LARGE
DATA SETS
PROCESSING LARGE DATA SETS
The CPU creates a large
data structure in System
Memory. Computations
using the data are
offloaded to the GPU.
SYSTEM
MEMORY
GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
PROCESSING LARGE DATA SETS
Large3Dspatialdata
structure
GPU
The CPU creates a large
data structure in System
Memory. Computations
using the data are
offloaded to the GPU.
Compare HSA and
Legacy methods
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
LEGACY ACCESS USING GPU MEMORY
Legacy
GPU Memory
is smaller
Have to copy and
process in chunks
GPU
GPU
MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
LEGACY ACCESS TO LARGE STRUCTURES
Large3Dspatialdata
structure
GPU
GPU
MEMOR
Y
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
COPY ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
Copy of top 2 levels of
hierarchy
Large3Dspatialdata
structure
GPU
MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
GPU
GPU
MEMORY
SYSTEM
MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
FIRST
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
GPU
MEMORY
FIRST
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
GPU
MEMORY
FIRST
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
COPY ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
GPU
MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
COPY ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
Copy of bottom 3 levels of
one branch of the hierarchy
GPU
MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
GPU
MEMORY
SECOND
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
GPU
MEMORY
SECOND
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
GPU
MEMORY
SECOND
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
COPY ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
GPU
MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
COPY ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
GPU
MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
COPY ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
Copy of bottom 3 levels of a
different branch of the
hierarchy
GPU
MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
GPU
MEMORY
Nth
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
GPU
MEMORY
Nth
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
GPU
KERNEL
GPU
MEMORY
Nth
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
LARGE SPATIAL DATA STRUCTURE
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
Large3Dspatialdata
structure
SYSTEM
MEMORY
KERNEL
GPU
HSA and full OpenCL 2.0
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
GPU CAN TRAVERSE ENTIRE HIERARCHY
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
HSA
KERNEL
GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
GPU CAN TRAVERSE ENTIRE HIERARCHY
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
HSA
KERNEL
GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
GPU CAN TRAVERSE ENTIRE HIERARCHY
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
HSA
KERNEL
GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
GPU CAN TRAVERSE ENTIRE HIERARCHY
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
HSA
KERNEL
GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM
MEMORY
GPU CAN TRAVERSE ENTIRE HIERARCHY
Leve
l 1
Leve
l 2
Leve
l 3
Leve
l 4
Leve
l 5
KERNEL
HSA
GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
CALLBACKS
CALLBACKS
 Parallel processing algorithm with branches
 A seldom taken branch requires new data from the CPU
 On legacy systems, the algorithm must be split:
 Process Kernel 1 on GPU
 Check for CPU callbacks and if any, process on CPU
 Process Kernel 2 on GPU
 Example algorithm from Image Processing
 Perform a filter
 Calculate average LUMA in each tile
 Compare LUMA against threshold and call CPU callback if exceeded (rare)
 Perform special processing on tiles with callbackxs
COMMON SITUATION IN HC
Input Image Output Image
© Copyright 2014 HSA Foundation. All Rights Reserved
CALLBACKS
Legacy
GPUTHREADS
0
1
2
N
.
.
.
.
.
.
.
.
.
Continuation kernel
finishes up kernel
works
results in poor GPU
utilization
© Copyright 2014 HSA Foundation. All Rights Reserved
CALLBACKS
Input Image
1 Tile = 1 OpenCL Work
Item
Output
Image
GPU
• Work items compute average RGB value
of all the pixels in a tile
• Work items also compute average Luma
from the average RGB
• If average Luma > threshold, workgroup
invokes CPU CALLBACK
• In parallel with callback, continue compute
CPU
• For selected tiles, update average Luma
value (set to RED)
GPU
• Work items apply the Luma value to all
pixels in the tile
GPU to CPU callbacks use Shared
Virtual Memory (SVM) Semaphores,
implemented using Platform Atomic
Compare-and-Swap.
© Copyright 2014 HSA Foundation. All Rights Reserved
CALLBACKS
A few kernel threads
need CPU callback
services but serviced
immediately
GPUTHREADS
0
1
2
N
.
.
.
.
.
.
.
.
.
CPU
callbacks
HSA and full OpenCL 2.0
© Copyright 2014 HSA Foundation. All Rights Reserved
SUMMARY - HSA ADVANTAGE
Programming
Technique
Use Case Description HSA Advantage
Pointer-based Data
Structures
Binary tree searches
GPU performs parallel searches in a CPU created
binary tree.
CPU and GPU have access to entire unified coherent
memory. GPU can access existing data structures containing
pointers.
Platform Atomics
Work-Group Dynamic Task Management
GPU directly operate on a task pool managed
by the CPU for algorithms with dynamic
computation loads
Binary tree updates
CPU and GPU operating simultaneously on the
tree, both doing modifications
CPU and GPU can synchronize using Platform Atomics
Higher performance through parallel operations reducing the
need for data copying and reconciling.
Large Data Sets
Hierarchical data searches
Applications include object recognition, collision
detection, global illumination, BVH
CPU and GPU have access to entire unified coherent
memory. GPU can operate on huge models in place,
reducing copy and kernel launch overhead.
CPU Callbacks
Middleware user-callbacks
GPU processes work items, some of which require
a call to a CPU function to fetch new data
GPU can invoke CPU functions from within a GPU kernel
Simpler programming does not require “split kernels”
Higher performance through parallel operations
© Copyright 2014 HSA Foundation. All Rights Reserved
QUESTIONS?

Más contenido relacionado

La actualidad más candente

HSA Queuing Hot Chips 2013
HSA Queuing Hot Chips 2013 HSA Queuing Hot Chips 2013
HSA Queuing Hot Chips 2013 HSA Foundation
 
HSA HSAIL Introduction Hot Chips 2013
HSA HSAIL Introduction  Hot Chips 2013 HSA HSAIL Introduction  Hot Chips 2013
HSA HSAIL Introduction Hot Chips 2013 HSA Foundation
 
HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA HSA Foundation
 
HSA Introduction Hot Chips 2013
HSA Introduction  Hot Chips 2013HSA Introduction  Hot Chips 2013
HSA Introduction Hot Chips 2013HSA Foundation
 
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...HSA Foundation
 
HSA Memory Model Hot Chips 2013
HSA Memory Model Hot Chips 2013HSA Memory Model Hot Chips 2013
HSA Memory Model Hot Chips 2013HSA Foundation
 
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...HSA Foundation
 
Deeper Look Into HSAIL And It's Runtime
Deeper Look Into HSAIL And It's Runtime Deeper Look Into HSAIL And It's Runtime
Deeper Look Into HSAIL And It's Runtime HSA Foundation
 
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
 AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.” AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”HSA Foundation
 
Heterogeneous Systems Architecture: The Next Area of Computing Innovation
Heterogeneous Systems Architecture: The Next Area of Computing Innovation Heterogeneous Systems Architecture: The Next Area of Computing Innovation
Heterogeneous Systems Architecture: The Next Area of Computing Innovation AMD
 
Heterogeneous System Architecture Overview
Heterogeneous System Architecture OverviewHeterogeneous System Architecture Overview
Heterogeneous System Architecture Overviewinside-BigData.com
 
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...AMD Developer Central
 
HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
HC-4018, How to make the most of GPU accessible memory, by Paul BlinzerHC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
HC-4018, How to make the most of GPU accessible memory, by Paul BlinzerAMD Developer Central
 
Guide to heterogeneous system architecture (hsa)
Guide to heterogeneous system architecture (hsa)Guide to heterogeneous system architecture (hsa)
Guide to heterogeneous system architecture (hsa)dibyendu.das
 
Heterogeneous computing
Heterogeneous computingHeterogeneous computing
Heterogeneous computingRashid Ansari
 
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...AMD Developer Central
 
Gpu Compute
Gpu ComputeGpu Compute
Gpu Computejworth
 

La actualidad más candente (20)

HSA Queuing Hot Chips 2013
HSA Queuing Hot Chips 2013 HSA Queuing Hot Chips 2013
HSA Queuing Hot Chips 2013
 
HSA Features
HSA FeaturesHSA Features
HSA Features
 
HSA HSAIL Introduction Hot Chips 2013
HSA HSAIL Introduction  Hot Chips 2013 HSA HSAIL Introduction  Hot Chips 2013
HSA HSAIL Introduction Hot Chips 2013
 
HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA
 
HSA Introduction Hot Chips 2013
HSA Introduction  Hot Chips 2013HSA Introduction  Hot Chips 2013
HSA Introduction Hot Chips 2013
 
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
 
HSA Overview
HSA Overview HSA Overview
HSA Overview
 
Hsa10 whitepaper
Hsa10 whitepaperHsa10 whitepaper
Hsa10 whitepaper
 
HSA Memory Model Hot Chips 2013
HSA Memory Model Hot Chips 2013HSA Memory Model Hot Chips 2013
HSA Memory Model Hot Chips 2013
 
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
 
Deeper Look Into HSAIL And It's Runtime
Deeper Look Into HSAIL And It's Runtime Deeper Look Into HSAIL And It's Runtime
Deeper Look Into HSAIL And It's Runtime
 
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
 AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.” AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
 
Heterogeneous Systems Architecture: The Next Area of Computing Innovation
Heterogeneous Systems Architecture: The Next Area of Computing Innovation Heterogeneous Systems Architecture: The Next Area of Computing Innovation
Heterogeneous Systems Architecture: The Next Area of Computing Innovation
 
Heterogeneous System Architecture Overview
Heterogeneous System Architecture OverviewHeterogeneous System Architecture Overview
Heterogeneous System Architecture Overview
 
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
 
HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
HC-4018, How to make the most of GPU accessible memory, by Paul BlinzerHC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
 
Guide to heterogeneous system architecture (hsa)
Guide to heterogeneous system architecture (hsa)Guide to heterogeneous system architecture (hsa)
Guide to heterogeneous system architecture (hsa)
 
Heterogeneous computing
Heterogeneous computingHeterogeneous computing
Heterogeneous computing
 
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
 
Gpu Compute
Gpu ComputeGpu Compute
Gpu Compute
 

Similar a HSA-ENABLED POINTER DATA STRUCTURES

[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil GovindanNewton Alex
 
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsHow to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsAltoros
 
PureSystems on the Private Cloud, John Kaemmerer and Gerry Novan, 11th Sept 14
PureSystems on the Private Cloud, John Kaemmerer and Gerry Novan, 11th Sept 14PureSystems on the Private Cloud, John Kaemmerer and Gerry Novan, 11th Sept 14
PureSystems on the Private Cloud, John Kaemmerer and Gerry Novan, 11th Sept 14IBM Systems UKI
 
2689 - Exploring IBM PureApplication System and IBM Workload Deployer Best Pr...
2689 - Exploring IBM PureApplication System and IBM Workload Deployer Best Pr...2689 - Exploring IBM PureApplication System and IBM Workload Deployer Best Pr...
2689 - Exploring IBM PureApplication System and IBM Workload Deployer Best Pr...Hendrik van Run
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
 
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit
 
OOW15 - Online Patching with Oracle E-Business Suite 12.2
OOW15 - Online Patching with Oracle E-Business Suite 12.2OOW15 - Online Patching with Oracle E-Business Suite 12.2
OOW15 - Online Patching with Oracle E-Business Suite 12.2vasuballa
 
Final presentasi gnome asia
Final presentasi gnome asiaFinal presentasi gnome asia
Final presentasi gnome asiaAnton Siswo
 
Performance Tuning
Performance TuningPerformance Tuning
Performance TuningJannet Peetz
 
PI-Tool To Improve Performance of Application In Multi-core Architecture
PI-Tool To Improve Performance of Application In Multi-core ArchitecturePI-Tool To Improve Performance of Application In Multi-core Architecture
PI-Tool To Improve Performance of Application In Multi-core ArchitectureCSCJournals
 
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAlluxio, Inc.
 
Week 4Operating SystemsCardPunch - verif.docx
Week 4Operating SystemsCardPunch - verif.docxWeek 4Operating SystemsCardPunch - verif.docx
Week 4Operating SystemsCardPunch - verif.docxcockekeshia
 
Memory Management in OS
Memory Management in OSMemory Management in OS
Memory Management in OSvampugani
 
HPC Parallel Computing for FEA - Customer Examples (1 of 4)
HPC Parallel Computing for FEA - Customer Examples (1 of 4)HPC Parallel Computing for FEA - Customer Examples (1 of 4)
HPC Parallel Computing for FEA - Customer Examples (1 of 4)Ansys
 
Virtualization overheads
Virtualization overheadsVirtualization overheads
Virtualization overheadsSandeep Joshi
 
Introducing Apache Geode and Spring Data GemFire
Introducing Apache Geode and Spring Data GemFireIntroducing Apache Geode and Spring Data GemFire
Introducing Apache Geode and Spring Data GemFireJohn Blum
 
Tendencias Storage
Tendencias StorageTendencias Storage
Tendencias StorageFran Navarro
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors DataWorks Summit/Hadoop Summit
 

Similar a HSA-ENABLED POINTER DATA STRUCTURES (20)

[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
 
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsHow to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
 
PureSystems on the Private Cloud, John Kaemmerer and Gerry Novan, 11th Sept 14
PureSystems on the Private Cloud, John Kaemmerer and Gerry Novan, 11th Sept 14PureSystems on the Private Cloud, John Kaemmerer and Gerry Novan, 11th Sept 14
PureSystems on the Private Cloud, John Kaemmerer and Gerry Novan, 11th Sept 14
 
2689 - Exploring IBM PureApplication System and IBM Workload Deployer Best Pr...
2689 - Exploring IBM PureApplication System and IBM Workload Deployer Best Pr...2689 - Exploring IBM PureApplication System and IBM Workload Deployer Best Pr...
2689 - Exploring IBM PureApplication System and IBM Workload Deployer Best Pr...
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg Schad
 
OOW15 - Online Patching with Oracle E-Business Suite 12.2
OOW15 - Online Patching with Oracle E-Business Suite 12.2OOW15 - Online Patching with Oracle E-Business Suite 12.2
OOW15 - Online Patching with Oracle E-Business Suite 12.2
 
Final presentasi gnome asia
Final presentasi gnome asiaFinal presentasi gnome asia
Final presentasi gnome asia
 
Performance Tuning
Performance TuningPerformance Tuning
Performance Tuning
 
PowerAI Deep Dive ( key points )
PowerAI Deep Dive ( key points )PowerAI Deep Dive ( key points )
PowerAI Deep Dive ( key points )
 
PI-Tool To Improve Performance of Application In Multi-core Architecture
PI-Tool To Improve Performance of Application In Multi-core ArchitecturePI-Tool To Improve Performance of Application In Multi-core Architecture
PI-Tool To Improve Performance of Application In Multi-core Architecture
 
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
 
Week 4Operating SystemsCardPunch - verif.docx
Week 4Operating SystemsCardPunch - verif.docxWeek 4Operating SystemsCardPunch - verif.docx
Week 4Operating SystemsCardPunch - verif.docx
 
Memory Management in OS
Memory Management in OSMemory Management in OS
Memory Management in OS
 
HPC Parallel Computing for FEA - Customer Examples (1 of 4)
HPC Parallel Computing for FEA - Customer Examples (1 of 4)HPC Parallel Computing for FEA - Customer Examples (1 of 4)
HPC Parallel Computing for FEA - Customer Examples (1 of 4)
 
os mod1 notes
 os mod1 notes os mod1 notes
os mod1 notes
 
Virtualization overheads
Virtualization overheadsVirtualization overheads
Virtualization overheads
 
Introducing Apache Geode and Spring Data GemFire
Introducing Apache Geode and Spring Data GemFireIntroducing Apache Geode and Spring Data GemFire
Introducing Apache Geode and Spring Data GemFire
 
Tendencias Storage
Tendencias StorageTendencias Storage
Tendencias Storage
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 

Más de HSA Foundation

Hsa Runtime version 1.00 Provisional
Hsa Runtime version  1.00  ProvisionalHsa Runtime version  1.00  Provisional
Hsa Runtime version 1.00 ProvisionalHSA Foundation
 
Hsa programmers reference manual (version 1.0 provisional)
Hsa programmers reference manual (version 1.0 provisional)Hsa programmers reference manual (version 1.0 provisional)
Hsa programmers reference manual (version 1.0 provisional)HSA Foundation
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - CompilationsHSA Foundation
 
Hsa Platform System Architecture Specification Provisional verl 1.0 ratifed
Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed
Hsa Platform System Architecture Specification Provisional verl 1.0 ratifed HSA Foundation
 
Apu13 cp lu-keynote-final-slideshare
Apu13 cp lu-keynote-final-slideshareApu13 cp lu-keynote-final-slideshare
Apu13 cp lu-keynote-final-slideshareHSA Foundation
 
HSA Foundation BoF -Siggraph 2013 Flyer
HSA Foundation BoF -Siggraph 2013 Flyer HSA Foundation BoF -Siggraph 2013 Flyer
HSA Foundation BoF -Siggraph 2013 Flyer HSA Foundation
 
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...HSA Foundation
 
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...HSA Foundation
 
Phil Rogers IFA Keynote 2012
Phil Rogers IFA Keynote 2012Phil Rogers IFA Keynote 2012
Phil Rogers IFA Keynote 2012HSA Foundation
 
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMDHSA Foundation
 
Hsa2012 logo guidelines.
Hsa2012 logo guidelines.Hsa2012 logo guidelines.
Hsa2012 logo guidelines.HSA Foundation
 
What Fabric Engine Can Do With HSA
What Fabric Engine Can Do With HSAWhat Fabric Engine Can Do With HSA
What Fabric Engine Can Do With HSAHSA Foundation
 
Fabric Engine: Why HSA is Invaluable
Fabric Engine: Why HSA is  InvaluableFabric Engine: Why HSA is  Invaluable
Fabric Engine: Why HSA is InvaluableHSA Foundation
 

Más de HSA Foundation (13)

Hsa Runtime version 1.00 Provisional
Hsa Runtime version  1.00  ProvisionalHsa Runtime version  1.00  Provisional
Hsa Runtime version 1.00 Provisional
 
Hsa programmers reference manual (version 1.0 provisional)
Hsa programmers reference manual (version 1.0 provisional)Hsa programmers reference manual (version 1.0 provisional)
Hsa programmers reference manual (version 1.0 provisional)
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - Compilations
 
Hsa Platform System Architecture Specification Provisional verl 1.0 ratifed
Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed Hsa Platform System Architecture Specification Provisional  verl 1.0 ratifed
Hsa Platform System Architecture Specification Provisional verl 1.0 ratifed
 
Apu13 cp lu-keynote-final-slideshare
Apu13 cp lu-keynote-final-slideshareApu13 cp lu-keynote-final-slideshare
Apu13 cp lu-keynote-final-slideshare
 
HSA Foundation BoF -Siggraph 2013 Flyer
HSA Foundation BoF -Siggraph 2013 Flyer HSA Foundation BoF -Siggraph 2013 Flyer
HSA Foundation BoF -Siggraph 2013 Flyer
 
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, C...
 
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...
 
Phil Rogers IFA Keynote 2012
Phil Rogers IFA Keynote 2012Phil Rogers IFA Keynote 2012
Phil Rogers IFA Keynote 2012
 
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
 
Hsa2012 logo guidelines.
Hsa2012 logo guidelines.Hsa2012 logo guidelines.
Hsa2012 logo guidelines.
 
What Fabric Engine Can Do With HSA
What Fabric Engine Can Do With HSAWhat Fabric Engine Can Do With HSA
What Fabric Engine Can Do With HSA
 
Fabric Engine: Why HSA is Invaluable
Fabric Engine: Why HSA is  InvaluableFabric Engine: Why HSA is  Invaluable
Fabric Engine: Why HSA is Invaluable
 

Último

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 

Último (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 

HSA-ENABLED POINTER DATA STRUCTURES

  • 1. HSA APPLICATIONS WEN-MEI HWU, PROFESSOR, UNIVERSITY OF ILLINOIS WITH J.P. BORDES AND JUAN GOMEZ
  • 2. USE CASES SHOWING HSA ADVANTAGE Programming Technique Use Case Description HSA Advantage Pointer-based Data Structures Binary tree searches GPU performs parallel searches in a CPU created binary tree. CPU and GPU have access to entire unified coherent memory. GPU can access existing data structures containing pointers. Platform Atomics Work-Group Dynamic Task Management GPU directly operate on a task pool managed by the CPU for algorithms with dynamic computation loads Binary tree updates CPU and GPU operating simultaneously on the tree, both doing modifications CPU and GPU can synchronize using Platform Atomics Higher performance through parallel operations reducing the need for data copying and reconciling. Large Data Sets Hierarchical data searches Applications include object recognition, collision detection, global illumination, BVH CPU and GPU have access to entire unified coherent memory. GPU can operate on huge models in place, reducing copy and kernel launch overhead. CPU Callbacks Middleware user-callbacks GPU processes work items, some of which require a call to a CPU function to fetch new data GPU can invoke CPU functions from within a GPU kernel Simpler programming does not require “split kernels” Higher performance through parallel operations © Copyright 2014 HSA Foundation. All Rights Reserved
  • 3. UNIFIED COHERENT MEMORY FOR POINTER-BASED DATA STRUCTURES
  • 4. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES Legacy SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R GPU MEMORY RESULT BUFFER FLAT TREE © Copyright 2014 HSA Foundation. All Rights Reserved
  • 5. L R Legacy SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R GPU MEMORY RESULT BUFFER FLAT TREE UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES © Copyright 2014 HSA Foundation. All Rights Reserved
  • 6. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES Legacy SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R GPU MEMORY RESULT BUFFER FLAT TREE L R L R L R © Copyright 2014 HSA Foundation. All Rights Reserved
  • 7. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES Legacy SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R GPU MEMORY RESULT BUFFER FLAT TREE © Copyright 2014 HSA Foundation. All Rights Reserved
  • 8. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES Legacy SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R GPU MEMORY RESULT BUFFER FLAT TREE L R © Copyright 2014 HSA Foundation. All Rights Reserved
  • 9. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES Legacy SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R GPU MEMORY RESULT BUFFER FLAT TREE © Copyright 2014 HSA Foundation. All Rights Reserved
  • 10. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES Legacy SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R GPU MEMORY RESULT BUFFER FLAT TREE © Copyright 2014 HSA Foundation. All Rights Reserved
  • 11. SYSTEM MEMORY KERNEL GPU UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES HSA and full OpenCL 2.0 TREE RESULT BUFFER L R L R L R © Copyright 2014 HSA Foundation. All Rights Reserved
  • 12. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES HSA SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R © Copyright 2014 HSA Foundation. All Rights Reserved
  • 13. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES HSA SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R © Copyright 2014 HSA Foundation. All Rights Reserved
  • 14. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES HSA SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R © Copyright 2014 HSA Foundation. All Rights Reserved
  • 15. UNIFIED COHERENT MEMORY MORE EFFICIENT POINTER DATA STRUCTURES HSA SYSTEM MEMORY KERNEL GPU TREE RESULT BUFFER L R L R L R © Copyright 2014 HSA Foundation. All Rights Reserved
  • 16. POINTER DATA STRUCTURES - CODE COMPLEXITY HSA Legacy © Copyright 2014 HSA Foundation. All Rights Reserved
  • 17. POINTER DATA STRUCTURES - PERFORMANCE 0 10,000 20,000 30,000 40,000 50,000 60,000 1M 5M 10M 25M Searchrate(nodes/ms) Tree size ( # nodes ) Binary Tree Search CPU (1 core) CPU (4 core) Legacy APU HSA APU Measured in AMD labs Jan 1-3 on system shown in back up slide © Copyright 2014 HSA Foundation. All Rights Reserved
  • 18. PLATFORM ATOMICS FOR DYNAMIC TASK MANAGEMENT
  • 19. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* 0 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 TASKS POOL 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 0 NUM. WRITTEN TASKS 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 20. 0 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 0 NUM. WRITTEN TASKS 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 Asynchronous transfer © Copyright 2014 HSA Foundation. All Rights Reserved
  • 21. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 0 NUM. WRITTEN TASKS 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 22. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 Asynchronous transfer © Copyright 2014 HSA Foundation. All Rights Reserved
  • 23. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 24. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 1 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 Atomic add © Copyright 2014 HSA Foundation. All Rights Reserved
  • 25. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 1 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 26. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 2 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 Atomic add © Copyright 2014 HSA Foundation. All Rights Reserved
  • 27. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 2 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 28. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 3 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 Atomic add © Copyright 2014 HSA Foundation. All Rights Reserved
  • 29. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 3 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 30. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 4 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 Atomic add © Copyright 2014 HSA Foundation. All Rights Reserved
  • 31. 4 SYSTEM MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS GPU MEMORY QUEUE 2QUEUE 1 0 4 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 4 NUM. WRITTEN TASKS 0 4 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 3 WORK- GROUP 4 TASKS POOL PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT Legacy* *Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010 Zero-copy © Copyright 2014 HSA Foundation. All Rights Reserved
  • 32. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 0 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
  • 33. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 0 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY memcpy © Copyright 2014 HSA Foundation. All Rights Reserved
  • 34. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 4 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
  • 35. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 4 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 0 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
  • 36. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 4 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 1 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY Platform atomic add © Copyright 2014 HSA Foundation. All Rights Reserved
  • 37. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 4 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 1 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
  • 38. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 4 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 2 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY Platform atomic add © Copyright 2014 HSA Foundation. All Rights Reserved
  • 39. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 4 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 2 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
  • 40. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 4 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 3 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY Platform atomic add © Copyright 2014 HSA Foundation. All Rights Reserved
  • 41. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 4 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 3 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
  • 42. PLATFORM ATOMICS ENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT 4 HOST COHERENT MEMORY WORK- GROUP 1 GPU NUM. WRITTEN TASKS QUEUE 2QUEUE 1 TASKS POOL 0 4 NUM. CONSUMED TASKS 0 QUEUE 1 QUEUE 2 WORK- GROUP 2 WORK- GROUP 3 WORK- GROUP 4 HSA and full OpenCL 2.0 GPU MEMORY Platform atomic add © Copyright 2014 HSA Foundation. All Rights Reserved
  • 43. PLATFORM ATOMICS – CODE COMPLEXITY HSA Legacy Host enqueue function: 20 lines of code Host enqueue function: 102 lines of code © Copyright 2014 HSA Foundation. All Rights Reserved
  • 44. PLATFORM ATOMICS - PERFORMANCE 0 100 200 300 400 500 600 700 64 128 256 512 64 128 256 512 4096 16384 Executiontime(ms) Tasks per insertion Tasks pool size Legacy implementation (ms) HSA implementation (ms) © Copyright 2014 HSA Foundation. All Rights Reserved
  • 46. PLATFORM ATOMICS ENABLING EFFICIENT GPU/CPU COLLABORATION Legacy Only GPU can work on input array Concurre nt processin g not possible TREEINPUT BUFFER GPU KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 47. PLATFORM ATOMICS Legacy Only GPU can work on input array Concurre nt processin g not possible TREEINPUT BUFFER GPU KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 48. PLATFORM ATOMICS Legacy Only GPU can work on input array Concurre nt processin g not possible TREEINPUT BUFFER GPU KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 51. UNIFIED COHERENT MEMORY FOR LARGE DATA SETS
  • 52. PROCESSING LARGE DATA SETS The CPU creates a large data structure in System Memory. Computations using the data are offloaded to the GPU. SYSTEM MEMORY GPU © Copyright 2014 HSA Foundation. All Rights Reserved
  • 53. SYSTEM MEMORY Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 PROCESSING LARGE DATA SETS Large3Dspatialdata structure GPU The CPU creates a large data structure in System Memory. Computations using the data are offloaded to the GPU. Compare HSA and Legacy methods © Copyright 2014 HSA Foundation. All Rights Reserved
  • 54. SYSTEM MEMORY LEGACY ACCESS USING GPU MEMORY Legacy GPU Memory is smaller Have to copy and process in chunks GPU GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
  • 55. SYSTEM MEMORY Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 LEGACY ACCESS TO LARGE STRUCTURES Large3Dspatialdata structure GPU GPU MEMOR Y © Copyright 2014 HSA Foundation. All Rights Reserved
  • 56. SYSTEM MEMORY COPY ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU KERNEL Copy of top 2 levels of hierarchy Large3Dspatialdata structure GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
  • 57. GPU GPU MEMORY SYSTEM MEMORY PROCESS ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 FIRST KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 58. SYSTEM MEMORY PROCESS ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU GPU MEMORY FIRST KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 59. SYSTEM MEMORY PROCESS ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU GPU MEMORY FIRST KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 60. SYSTEM MEMORY COPY ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
  • 61. SYSTEM MEMORY COPY ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU KERNEL Copy of bottom 3 levels of one branch of the hierarchy GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
  • 62. SYSTEM MEMORY PROCESS ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU KERNEL GPU MEMORY SECOND KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 63. SYSTEM MEMORY PROCESS ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU KERNEL GPU MEMORY SECOND KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 64. SYSTEM MEMORY PROCESS ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU KERNEL GPU MEMORY SECOND KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 65. SYSTEM MEMORY COPY ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
  • 66. SYSTEM MEMORY COPY ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
  • 67. SYSTEM MEMORY COPY ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU Copy of bottom 3 levels of a different branch of the hierarchy GPU MEMORY © Copyright 2014 HSA Foundation. All Rights Reserved
  • 68. SYSTEM MEMORY PROCESS ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU KERNEL GPU MEMORY Nth KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 69. SYSTEM MEMORY PROCESS ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU KERNEL GPU MEMORY Nth KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 70. SYSTEM MEMORY PROCESS ONE CHUNK AT A TIME Legacy Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 GPU KERNEL GPU MEMORY Nth KERNEL © Copyright 2014 HSA Foundation. All Rights Reserved
  • 71. LARGE SPATIAL DATA STRUCTURE Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 Large3Dspatialdata structure SYSTEM MEMORY KERNEL GPU HSA and full OpenCL 2.0 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 72. SYSTEM MEMORY GPU CAN TRAVERSE ENTIRE HIERARCHY Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 HSA KERNEL GPU © Copyright 2014 HSA Foundation. All Rights Reserved
  • 73. SYSTEM MEMORY GPU CAN TRAVERSE ENTIRE HIERARCHY Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 HSA KERNEL GPU © Copyright 2014 HSA Foundation. All Rights Reserved
  • 74. SYSTEM MEMORY GPU CAN TRAVERSE ENTIRE HIERARCHY Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 HSA KERNEL GPU © Copyright 2014 HSA Foundation. All Rights Reserved
  • 75. SYSTEM MEMORY GPU CAN TRAVERSE ENTIRE HIERARCHY Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 HSA KERNEL GPU © Copyright 2014 HSA Foundation. All Rights Reserved
  • 76. SYSTEM MEMORY GPU CAN TRAVERSE ENTIRE HIERARCHY Leve l 1 Leve l 2 Leve l 3 Leve l 4 Leve l 5 KERNEL HSA GPU © Copyright 2014 HSA Foundation. All Rights Reserved
  • 78. CALLBACKS  Parallel processing algorithm with branches  A seldom taken branch requires new data from the CPU  On legacy systems, the algorithm must be split:  Process Kernel 1 on GPU  Check for CPU callbacks and if any, process on CPU  Process Kernel 2 on GPU  Example algorithm from Image Processing  Perform a filter  Calculate average LUMA in each tile  Compare LUMA against threshold and call CPU callback if exceeded (rare)  Perform special processing on tiles with callbackxs COMMON SITUATION IN HC Input Image Output Image © Copyright 2014 HSA Foundation. All Rights Reserved
  • 79. CALLBACKS Legacy GPUTHREADS 0 1 2 N . . . . . . . . . Continuation kernel finishes up kernel works results in poor GPU utilization © Copyright 2014 HSA Foundation. All Rights Reserved
  • 80. CALLBACKS Input Image 1 Tile = 1 OpenCL Work Item Output Image GPU • Work items compute average RGB value of all the pixels in a tile • Work items also compute average Luma from the average RGB • If average Luma > threshold, workgroup invokes CPU CALLBACK • In parallel with callback, continue compute CPU • For selected tiles, update average Luma value (set to RED) GPU • Work items apply the Luma value to all pixels in the tile GPU to CPU callbacks use Shared Virtual Memory (SVM) Semaphores, implemented using Platform Atomic Compare-and-Swap. © Copyright 2014 HSA Foundation. All Rights Reserved
  • 81. CALLBACKS A few kernel threads need CPU callback services but serviced immediately GPUTHREADS 0 1 2 N . . . . . . . . . CPU callbacks HSA and full OpenCL 2.0 © Copyright 2014 HSA Foundation. All Rights Reserved
  • 82. SUMMARY - HSA ADVANTAGE Programming Technique Use Case Description HSA Advantage Pointer-based Data Structures Binary tree searches GPU performs parallel searches in a CPU created binary tree. CPU and GPU have access to entire unified coherent memory. GPU can access existing data structures containing pointers. Platform Atomics Work-Group Dynamic Task Management GPU directly operate on a task pool managed by the CPU for algorithms with dynamic computation loads Binary tree updates CPU and GPU operating simultaneously on the tree, both doing modifications CPU and GPU can synchronize using Platform Atomics Higher performance through parallel operations reducing the need for data copying and reconciling. Large Data Sets Hierarchical data searches Applications include object recognition, collision detection, global illumination, BVH CPU and GPU have access to entire unified coherent memory. GPU can operate on huge models in place, reducing copy and kernel launch overhead. CPU Callbacks Middleware user-callbacks GPU processes work items, some of which require a call to a CPU function to fetch new data GPU can invoke CPU functions from within a GPU kernel Simpler programming does not require “split kernels” Higher performance through parallel operations © Copyright 2014 HSA Foundation. All Rights Reserved

Notas del editor

  1. Heap allocation (hUMA – virtual memory) – dGPU version would need to page large portions of the model to the GPU – CPU version would be slow. Data pointers (hUMA – unified addresses) – non-HSA version would need to “serialize” tree into array (with indices for pointers) for GPU. Recursion (hQ – GPU enqueuing) – non-HSA version would suffer from load imbalance, because CPU has to wait and spawn 1 kernel to process all secondary rays, whereas with HSA, the GPU threads can dynamically spawn kernels to process secondary rays. Callbacks (hUMA – platform atomics)– In the non-HSA version, the CPU has to wait until the first kernel exits to begin processing callbacks, and can’t launch second kernel until all callbacks have completed. Atomics (hUMA – memory coherence & platform atomics) – In the non-HSA version, the CPU and GPU processing is serialized.
  2. Kernel should have input buffer which is list of keys being searched for and outputs are the values from the key value pairs
  3. This case study implements a dynamic task scheduling scheme that aims load balancing among work-groups. Traditional heterogeneous approach: The host system enqueues tasks in several queues located in GPU memory. Two variables per queue are used to synchronize CPU and GPU: The number of tasks that have been written in the queue, and the number of tasks that have been already consumed from the queue. These variables are duplicated in CPU and GPU memory. The GPU runs a number of persistent work-groups. A work-group can dequeue one task and update the number of consumed tasks.
  4. A group of tasks are asynchronously transferred to one queue in GPU memory.
  5. Then, the host updates the number of written tasks in CPU memory.
  6. The number of written tasks is updated in GPU memory by an asynchronous transfer.
  7. A work-group dequeues one task from a queue.
  8. The work-group updates the number of consumed tasks by using a global memory atomic operation. Then, it checks whether the queue is already empty, that is, it compares the number of consumed tasks with the number of written tasks.
  9. Then, a different work-group dequeues the next task.
  10. Work-group 2 updates the number of consumed tasks. Then, it checks whether the queue is already empty, that is, it compares the number of consumed tasks with the number of written tasks.
  11. Work-group 3 dequeues the next task.
  12. Work-group 3 updates the number of consumed tasks. Then, it checks whether the queue is already empty, that is, it compares the number of consumed tasks with the number of written tasks.
  13. Work-group 4 dequeues the next task.
  14. Work-group 4 updates the number of consumed tasks. Then, it checks whether the queue is already empty, that is, it compares the number of consumed tasks and the number of written tasks.
  15. Since the number of consumed tasks and the number of written tasks are equal, the queue is empty. Then, the number of consumed tasks should be updated in CPU memory. This is implemented by using the zero-copy feature. Once the number of consumed tasks in CPU memory is updated, the host thread will detect that this number is equal to the number of written tasks. More tasks can be then enqueued in queue 1.
  16. Using HSA and full OpenCL 2.0, queues and synchronization variables can be allocated in host coherent memory.
  17. Moving tasks to a queue is as simple as using memcpy.
  18. No copies of the number of written tasks and the number of consumed tasks are needed in GPU memory.
  19. A work-group can dequeue one task from a queue in host coherent memory.
  20. The number of consumed tasks is updated by using platform atomics.
  21. Only the function that inserts tasks in the queues needs 5x less lines of code than the legacy implementation.
  22. This slide presents 8 tests. The total number of tasks in the tasks pool is 4096 or 16384. The number of queues is 4 in every test. Each time the host inserts tasks in a queue, the number of tasks per insertion is 64, 128, 256 or 512.
  23. Atomic is a lock on a parent before adding to it. Semaphore on the tree struct. CAS to take semaphore Tree has 2M nodes. Add 0.5 nodes Time 3 ways: CPU, GPU, Both
  24. Just the dividing planes are loaded to GPU memory for first two levels
  25. BVH – Bounding Volume Hierarchy Each leaf has a collection of primitives (spheres) Looking for first sphere that intercepts a point
  26. Heap allocation (hUMA – virtual memory) – dGPU version would need to page large portions of the model to the GPU – CPU version would be slow. Data pointers (hUMA – unified addresses) – non-HSA version would need to “serialize” tree into array (with indices for pointers) for GPU. Recursion (hQ – GPU enqueuing) – non-HSA version would suffer from load imbalance, because CPU has to wait and spawn 1 kernel to process all secondary rays, whereas with HSA, the GPU threads can dynamically spawn kernels to process secondary rays. Callbacks (hUMA – platform atomics)– In the non-HSA version, the CPU has to wait until the first kernel exits to begin processing callbacks, and can’t launch second kernel until all callbacks have completed. Atomics (hUMA – memory coherence & platform atomics) – In the non-HSA version, the CPU and GPU processing is serialized.