AFDS Keynote: “The Programmer’s Guide to the APU Galaxy.”
Phil Rogers, AMD Corporate Fellow
It’s a well-understood maxim in the technology industry that software and hardware must evolve in parallel, and be well matched, to achieve greatness. With the introduction of the world’s first APU in January 2011, AMD pointed the world toward a new way of computing. This was very much a first step in an architectural journey that is well underway at AMD. APUs combine different processing engines in single-chip combinations to strike a unique balance between the dimensions of performance, power consumption and price. Hear how AMD is working to ease the programmer’s access to this new level of compute horsepower and dramatically expand the processing resources available to modern applications
2. THE OPPORTUNITY WE ARE SEIZING
Make the unprecedented
processing capability of
the APU as accessible to
programmers as the
CPU is today.
2 | The Programmer’s Guide to the APU Galaxy | June 2011
3. OUTLINE
The APU today and its programming
environment
The future of the heterogeneous platform
AMD Fusion System Architecture
Roadmap
Software evolution
A visual view of the new command
and data flow
3 | The Programmer’s Guide to the APU Galaxy | June 2011
4. APU: ACCELERATED PROCESSING UNIT
The APU has arrived and it is a great advance
over previous platforms
Combines scalar processing on CPU with
parallel processing on the GPU and high
bandwidth access to memory
How do we make it even better going forward?
– Easier to program
– Easier to optimize
– Easier to load balance
– Higher performance
– Lower power
4 | The Programmer’s Guide to the APU Galaxy | June 2011
5. LOW POWER E-SERIES AMD FUSION APU: “ZACATE”
E-Series APU
2 x86 Bobcat CPU cores
Array of Radeon™ Cores
Discrete-class DirectX® 11 performance
80 Stream Processors
3rd Generation Unified Video Decoder
PCIe® Gen2
Single-channel DDR3 @ 1066
18W TDP
Performance:
Up to 8.5GB/s System Memory Bandwidth
Up to 90 Gflop of Single Precision Compute
5 | The Programmer’s Guide to the APU Galaxy | June 2011
6. TABLET Z-SERIES AMD FUSION APU: “DESNA”
Z-Series APU
2 x86 “Bobcat” CPU cores
Array of Radeon™ Cores
Discrete-class DirectX® 11 performance
80 Stream Processors
3rd Generation Unified Video Decoder
PCIe® Gen2
Single-channel DDR3 @ 1066
6W TDP w/ Local Hardware Thermal Control
Performance:
Up to 8.5GB/s System Memory Bandwidth
Suitable for sealed, passively cooled designs
6 | The Programmer’s Guide to the APU Galaxy | June 2011
7. MAINSTREAM A-SERIES AMD FUSION APU: “LLANO”
A-Series APU
Up to four x86 CPU cores
AMD Turbo CORE frequency acceleration
Array of Radeon™ Cores
Discrete-class DirectX® 11 performance
3rd Generation Unified Video Decoder
Blu-ray 3D stereoscopic display
PCIe® Gen2
Dual-channel DDR3
45W TDP
Performance:
Up to 29GB/s System Memory Bandwidth
Up to 500 Gflops of Single Precision Compute
7 | The Programmer’s Guide to the APU Galaxy | June 2011
8. COMMITTED TO OPEN STANDARDS
AMD drives open and de-facto standards
– Compete on the best implementation
Open standards are the basis for large
ecosystems
Open standards always win over time
DirectX®
– SW developers want their applications
to run on multiple platforms from
multiple hardware vendors
8 | The Programmer’s Guide to the APU Galaxy | June 2011
9. A NEW ERA OF PROCESSOR PERFORMANCE
Heterogeneous
Single-Core Era Multi-Core Era
Systems Era
Enabled by: Constrained by: Enabled by: Constrained by: Enabled by: Temporarily
Moore’s Law Power Moore’s Law Power Abundant data Constrained by:
Voltage Complexity SMP Parallel SW parallelism Programming
Scaling architecture Scalability Power efficient models
GPUs Comm.overhead
Assembly C/C++ Java … pthreads OpenMP / TBB … Shader CUDA OpenCL !!!
Modern Application
Single-thread
Performance
Performance
?
Throughput
Performance
we are
here
we are
here
we are
here
Time Time (# of processors) Time (Data-parallel exploitation)
9 | The Programmer’s Guide to the APU Galaxy | June 2011
10. EVOLUTION OF HETEROGENEOUS COMPUTING
Excellent Architected Era
AMD Fusion System Architecture
Architecture Maturity & Programmer Accessibility
Standards Drivers Era GPU Peer Processor
OpenCL™, DirectCompute Mainstream programmers
Proprietary Drivers Era Driver-based APIs Full C++
GPU as a co-processor
Graphics & Proprietary Expert programmers Unified coherent address space
Driver-based APIs C and C++ subsets Task parallel runtimes
Compute centric APIs , data Nested Data Parallel programs
“Adventurous” programmers types User mode dispatch
Multiple address spaces with Pre-emption and context
Exploit early programmable explicit data movement switching
“shader cores” in the GPU Specialized work queue based
Make your program look like structures
“graphics” to the GPU Kernel mode dispatch
See Herb Sutter’s Keynote
tomorrow for a cool example of
CUDA™, Brook+, etc
plans for the architected era!
Poor
2002 - 2008 2009 - 2011 2012 - 2020
10 | The Programmer’s Guide to the APU Galaxy | June 2011
11. FSA FEATURE ROADMAP
Physical Optimized Architectural System
Integration Platforms Integration Integration
GPU compute
Integrate CPU & GPU GPU Compute C++ Unified Address Space
context switch
in silicon support for CPU and GPU
GPU graphics
GPU uses pageable pre-emption
Unified Memory
User mode scheduling system memory via
Controller
CPU pointers
Quality of Service
Common Bi-Directional Power
Fully coherent memory
Manufacturing Mgmt between CPU Extend to
between CPU & GPU
Technology and GPU Discrete GPU
11 | The Programmer’s Guide to the APU Galaxy | June 2011
12. FUSION SYSTEM ARCHITECTURE – AN OPEN PLATFORM
Open Architecture, published specifications
– FSAIL virtual ISA
– FSA memory model
– FSA dispatch
ISA agnostic for both CPU and GPU
Inviting partners to join us, in all areas
– Hardware companies
– Operating Systems
– Tools and Middleware
– Applications
FSA review committee planned
12 | The Programmer’s Guide to the APU Galaxy | June 2011
13. FSA INTERMEDIATE LAYER - FSAIL
FSAIL is a virtual ISA for parallel programs
– Finalized to ISA by a JIT compiler or
“Finalizer”
Explicitly parallel
– Designed for data parallel programming
Support for exceptions, virtual functions,
and other high level language features
Syscall methods
– GPU code can call directly to system
services, IO, printf, etc
Debugging support
13 | The Programmer’s Guide to the APU Galaxy | June 2011
14. FSA MEMORY MODEL
Designed to be compatible with C++0x,
Java and .NET Memory Models
Relaxed consistency memory model for
parallel compute performance
Loads and stores can be re-ordered by
the finalizer
Visibility controlled by:
– Load.Acquire, Store.Release
– Fences
– Barriers
14 | The Programmer’s Guide to the APU Galaxy | June 2011
15. Driver Stack FSA Software Stack
Apps Apps
Apps Apps
Apps Apps
Apps Apps
Apps Apps
Apps Apps
Domain Libraries FSA Domain Libraries
OpenCL™ 1.x, DX Runtimes,
FSA Runtime
User Mode Drivers
Task Queuing
FSA JIT
Libraries
FSA Kernel
Graphics Kernel Mode Driver
Mode Driver
Hardware - APUs, CPUs, GPUs
AMD user mode component AMD kernel mode component All others contributed by third parties or AMD
15 | The Programmer’s Guide to the APU Galaxy | June 2011
16. OPENCL™ AND FSA
FSA is an optimized platform architecture
for OpenCL™
– Not an alternative to OpenCL™
OpenCL™ on FSA will benefit from
– Avoidance of wasteful copies
– Low latency dispatch
– Improved memory model
– Pointers shared between CPU and GPU
FSA also exposes a lower level programming
interface, for those that want the ultimate in
control and performance
– Optimized libraries may choose the lower
level interface
16 | The Programmer’s Guide to the APU Galaxy | June 2011
17. TASK QUEUING RUNTIMES
Popular pattern for task and data parallel
programming on SMP systems today
Characterized by:
– A work queue per core
– Runtime library that divides large
loops into tasks and distributes to
queues
– A work stealing runtime that keeps the
system balanced
FSA is designed to extend this pattern to
run on heterogeneous systems
17 | The Programmer’s Guide to the APU Galaxy | June 2011
18. TASK QUEUING RUNTIME ON CPUS
Work Stealing Runtime
Q Q Q Q
CPU CPU CPU CPU
Worker Worker Worker Worker
X86 CPU X86 CPU X86 CPU X86 CPU
CPU Threads GPU Threads Memory
18 | The Programmer’s Guide to the APU Galaxy | June 2011
19. TASK QUEUING RUNTIME ON THE FSA PLATFORM
Work Stealing Runtime
Q Q Q Q Q
CPU CPU CPU CPU GPU
Worker Worker Worker Worker Manager
X86 CPU X86 CPU X86 CPU X86 CPU Radeon™ GPU
CPU Threads GPU Threads Memory
19 | The Programmer’s Guide to the APU Galaxy | June 2011
20. TASK QUEUING RUNTIME ON THE FSA PLATFORM
Work Stealing Runtime
Q Q Q Q Q
CPU CPU CPU CPU GPU
Memory
Worker Worker Worker Worker Manager
X86 CPU X86 CPU X86 CPU X86 CPU
Fetch and Dispatch
S S S S S
I I I I I
M M M M M
CPU Threads GPU Threads Memory D D D D D
20 | The Programmer’s Guide to the APU Galaxy | June 2011
21. FSA SOFTWARE EXAMPLE - REDUCTION
float foo(float);
float myArray[…];
Task<float, ReductionBin> task([myArray]( IndexRange<1> index) [[device]] {
float sum = 0.;
for (size_t I = index.begin(); I != index.end(); i++) {
sum += foo(myArray[i]);
}
return sum;
});
float result = task.enqueueWithReduce( Partition<1, Auto>(1920),
[] (int x, int y) [[device]] { return x+y; }, 0.);
21 | The Programmer’s Guide to the APU Galaxy | June 2011
22. HETEROGENEOUS COMPUTE DISPATCH
How compute dispatch operates
today in the driver model
How compute dispatch
improves tomorrow under FSA
22 | The Programmer’s Guide to the APU Galaxy | June 2011
23. TODAY’S COMMAND AND DISPATCH FLOW
Command Flow Data Flow
User Kernel
Application Soft
Direct3D Mode Mode
A Queue
Driver Driver
Command Buffer DMA Buffer
A GPU
HARDWARE
Hardware
Queue
23 | The Programmer’s Guide to the APU Galaxy | June 2011
24. TODAY’S COMMAND AND DISPATCH FLOW
Command Flow Data Flow
User Kernel
Application Soft
Direct3D Mode Mode
A Queue
Driver Driver
Command Buffer DMA Buffer
Command Flow Data Flow
User Kernel GPU
Application Soft A
Direct3D Mode Mode HARDWARE
B Queue
Driver Driver
Command Buffer DMA Buffer
Command Flow Data Flow
Hardware
User Kernel Queue
Application Soft
Direct3D Mode Mode
C Queue
Driver Driver
Command Buffer DMA Buffer
24 | The Programmer’s Guide to the APU Galaxy | June 2011
25. TODAY’S COMMAND AND DISPATCH FLOW
Command Flow Data Flow
User Kernel
Application Soft
Direct3D Mode Mode
A Queue
Driver Driver
Command Buffer DMA Buffer
Command Flow Data Flow
A B
B
C
User Kernel GPU
Application Soft A
Direct3D Mode Mode HARDWARE
B Queue
Driver Driver
Command Buffer DMA Buffer
Command Flow Data Flow
Hardware
User Kernel Queue
Application Soft
Direct3D Mode Mode
C Queue
Driver Driver
Command Buffer DMA Buffer
25 | The Programmer’s Guide to the APU Galaxy | June 2011
26. TODAY’S COMMAND AND DISPATCH FLOW
Command Flow Data Flow
User Kernel
Application Soft
Direct3D Mode Mode
A Queue
Driver Driver
Command Buffer DMA Buffer
Command Flow Data Flow
A B
B
C
User Kernel GPU
Application Soft A
Direct3D Mode Mode HARDWARE
B Queue
Driver Driver
Command Buffer DMA Buffer
Command Flow Data Flow
Hardware
User Kernel Queue
Application Soft
Direct3D Mode Mode
C Queue
Driver Driver
Command Buffer DMA Buffer
26 | The Programmer’s Guide to the APU Galaxy | June 2011
27. FUTURE COMMAND AND DISPATCH FLOW
C C
C C
Application Application codes to the
C C hardware
User mode queuing
Hardware Queue
Optional Dispatch
Buffer
Hardware scheduling
B
B Low dispatch times
Application B GPU
B HARDWARE
No APIs
Hardware Queue
No Soft Queues
A
A No User Mode Drivers
A
Application No Kernel Mode Transitions
A
No Overhead!
Hardware Queue
27 | The Programmer’s Guide to the APU Galaxy | June 2011
28. FUTURE COMMAND AND DISPATCH CPU <-> GPU
Application / Runtime
CPU1 CPU2 GPU
28 | The Programmer’s Guide to the APU Galaxy | June 2011
29. FUTURE COMMAND AND DISPATCH CPU <-> GPU
Application / Runtime
CPU1 CPU2 GPU
29 | The Programmer’s Guide to the APU Galaxy | June 2011
30. FUTURE COMMAND AND DISPATCH CPU <-> GPU
Application / Runtime
CPU1 CPU2 GPU
30 | The Programmer’s Guide to the APU Galaxy | June 2011
31. FUTURE COMMAND AND DISPATCH CPU <-> GPU
Application / Runtime
CPU1 CPU2 GPU
31 | The Programmer’s Guide to the APU Galaxy | June 2011
32. WHERE ARE WE TAKING YOU?
Switch the compute, don’t move
Platform Design Goals
the data!
Every processor now has serial and Easy support of massive data sets
parallel cores
Support for task based programming
All cores capable, with performance models
differences
Solutions for
Simple and all platforms
efficient program
model Open to all
32 | The Programmer’s Guide to the APU Galaxy | June 2011
33. THE FUTURE OF HETEROGENEOUS COMPUTING
The architectural path for the future is clear
– Programming patterns established on
Symmetric Multi-Processor (SMP)
systems migrate to the heterogeneous
world
– An open architecture, with published
specifications and an open source
execution software stack
– Heterogeneous cores working together
seamlessly in coherent memory
– Low latency dispatch
– No software fault lines
33 | The Programmer’s Guide to the APU Galaxy | June 2011