Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Direct3D12 and the Future of Graphics APIs
1. DIRECT3D12 AND THE FUTURE OF
GRAPHICS APIS
Dave Oldcorn, Direct3D12 Technical Lead, AMD
2. 2 | AMD Direct3D Futures | March 20th, 2014
THE PROBLEM
3. 3 | AMD Direct3D Futures | March 20th, 2014
THE PROBLEM
Mismatch between existing Direct3D and hardware capabilities
– Lots of CPU cores, but only one stream of data
– State communication in small chunks
– “Hidden” work
Hard to predict from any one given call what the overhead might be
Implicit memory management
– Hardware evolving away from classical register programming
4. 4 | AMD Direct3D Futures | March 20th, 2014
Metal
(register level access)
API LANDSCAPE
Gap between PC ‘raw’ 3D APIs and the
hardware has opened up
Very high level APIs now ubiquitous; easy to
access even for casual developers, plenty of
choice
Where the PC APIs are is a middle ground
Capability,easeofuse,distancefrom3Dengine
Game Engines
Frostbite
Unity
Unreal
CryEngine
BlitzTech
Flash / Silverlight
Console APIs
Opportunity
D3D9
OpenGL
D3D11
D3D7/8
Application
5. 5 | AMD Direct3D Futures | March 20th, 2014
WHAT ARE THE CONSEQUENCES?
WHAT ARE THE SOLUTIONS?
6. 6 | AMD Direct3D Futures | March 20th, 2014
SEQUENTIAL API
Sequential API: state for given draw comes from arbitrary
previous time
Some states must be reconciled on the CPU (“delayed
validation”)
– All contributing state needs to be visible
GPU isn’t like this, uses command buffers
– Must save and restore state at start and end
...
Draw
Set PS CB
Draw x 5
Set VS CB
Draw x 3
Set Blend
Set PS
Set RT state
Draw
Set VS VB
Draw
...
(more, earlier)
PS CB
VS CB
Blend state
PS
RT state
Draw
State contributing
to draw
API input
7. 7 | AMD Direct3D Futures | March 20th, 2014
THREADING A SEQUENTIAL API
Sequential API threading
– Simple producer / consumer model
Extra latency
Buffering has a cost
More threading would mean dividing tasks on finer grain
– Bottlenecked on application or driver thread
Difficult to extract parallelism (Amdahl’s Law)
Application simulation
Prebuild
Thread 0
Prebuild
Thread 1
Application Render Thread
GPU Execution Queue
Queued
Buffer 0
Queued
Buffer 1
...
Runtime / Driver
Application
Driver Thread
Queued
Buffer 2
8. 8 | AMD Direct3D Futures | March 20th, 2014
COMMAND BUFFER API
GPUs only listen to command buffers
Let the app build them
– Command Lists, at the API level
Solves sequential API CPU issues
Application simulation
Thread 0 Thread 1
Build Cmd
Buffer
Build
Cmd
Buffer
GPU Execution Queue
Queued
Buffer 0
Queued
Buffer 1
...
Runtime / Driver
Application
9. 9 | AMD Direct3D Futures | March 20th, 2014
BETTER SCHEDULING
App has much more control over scheduling work
– Both CPU side and GPU
Threads don’t really share much resource
Many more options for streaming assets
Driver thread
Create thread
D3D11: CB building threads tend to interfere
GPU load still added but only after queuing
Render work
Create work
GPU executes
D3D12: CB building threads more independent
Create thread
Build threads
10. 10 | AMD Direct3D Futures | March 20th, 2014
PIPELINE OBJECTS
Pipeline objects get rid of JIT and enable LTCG for GPUs
Decouple interface and implementation
We’re aware that this is a hairpin bend for many graphics
engines to negotiate.
– Many engines don’t think in terms of predicting state up
front
– The benefits are worth it
Simplified dataflow
through pipeline
VS
PS
Index
Process
Primitive
Generation
Rasteriser
Rendertarget
Output
?
?
?
11. 11 | AMD Direct3D Futures | March 20th, 2014
RENDER OBJECT BINDING MISMATCH
Hardware uses tables in video memory
BUT still programmed like a register solution
– So one bind becomes:
Allocate a new chunk of video memory
Create a new copy of the entire table
Update the one entry
Write the register with the new table base
address
SR
CB
On-chip
root table
(1 per stage) Pointer to table
(here, textures)
GPU Memory
SRD table
GPU Memory
resource
Pointer to table
(constant buffers)
Pointer to (+ params
of) resource
12. 12 | AMD Direct3D Futures | March 20th, 2014
DESCRIPTOR TABLES
Several tables of each type of resource
– Easy to divide up by frequency
Tables can be of arbitrary size; dynamically indexed to
provide bindless textures
Changing a pointer in the root table is cheap
Updating a descriptor in a table is not so cheap
– Some dynamic descriptors are a requirement but avoid
in general.
SR.T[0]
SR.T[3]
SR.T[2]
SR.T[1]
UAV
CB.T[1]
CB.T[0]
Samp
SR.T[0][0]
SR.T[0][2]
SR.T[0][1]
CB.T[1][0]
CB.T[1][1]
On-chip
root table Pointer to table
(textures table 0)
GPU Memory
SRD table
Pointer to table
(constbuf table 1)
13. 13 | AMD Direct3D Futures | March 20th, 2014
KEY INNOVATIONS
Innovation CPU-side win GPU-side win
Command buffers
Build on many threads
Control of scheduling
Lower latency
Simplified state tracking
Pipeline state objects
Link at create time
No JIT shader compiles
Efficient batched updates
Cheaper state updates
Enables LTCG
Bind objects in groups Cheap to change group
Cheap to change group
Fits hardware paradigm
Move work to Create Predictability Enables optimisations
14. 14 | AMD Direct3D Futures | March 20th, 2014
KEY INNOVATIONS
Innovation CPU-side win GPU-side win
Explicit Synchronisation
Efficiency
Required for bindless textures
Less overhead
Explicit Memory
Management
Efficiency
Predictability
Application flexibility
Zero copy
Control over placement
Do less
Predictability, Efficiency
Enables aggressive schedule
FEWER BUGS
15. 15 | AMD Direct3D Futures | March 20th, 2014
NEW PROBLEMS
(AND TIPS TO SOLVE THEM)
16. 16 | AMD Direct3D Futures | March 20th, 2014
NEW VISIBLE LIMITS
More draws in does not automatically mean more
triangles out
– You will not see full rendering rates with triangles
averaging 1 pixel each.
– Wireframe mode should look different to filled
rendering
17. 17 | AMD Direct3D Futures | March 20th, 2014
NEW VISIBLE LIMITS
Feeding the GPU much more efficiently means exploring interesting new limits that weren’t visible before
10k/frame of anything is ~1µs per thing.
GPU pipeline depth is likely to be 1-10µs (1k-10k cycles).
Specific limit: context registers
– Root shader table is NOT in the context
– Compute doesn’t bottleneck on context
18. 18 | AMD Direct3D Futures | March 20th, 2014
APPLICATION IN CHARGE
Application is arbiter of correct rendering
– This is a serious responsibility
– The benefits of D3D12 aren’t readily available without this condition
Applications must be warning-free on the debug layer
Different opportunities for driver intervention
Consider controlling risk by avoiding riskier techniques
19. 19 | AMD Direct3D Futures | March 20th, 2014
APPLICATION IN CHARGE
No driver thread in play
– App can target much lower latency
– BUT implies app has to be ready with new
GPU work
Driver F1
App Render Frame 1
GPU F1
Frame 2
F2
F2
Frame 3
F3
F3
D3D11: No dead GPU time after 1st frame (but extra latency)
Dead
Time
First work sent to driver Driver buffers Present; no future dead time
No buffered present reveals dead time on GPU
20. 20 | AMD Direct3D Futures | March 20th, 2014
USE COMMAND BUFFERS SPARINGLY
Each API command list maps to a single hardware
command buffer
Starting / ending a command list has an overhead
– Writes full 3D state, may flush caches or idle GPU
We think a good rule of thumb will be to target around 100
command buffers/frame
– Use the multiple submission API where possible
CB0 CB1 CB2CB0
Multiple applications running on system
Application 0 queue
CB0 CB1 CB2
CB0
Application 1 queue
GPU executes
22. 22 | AMD Direct3D Futures | March 20th, 2014
ALL-NEW
There’s a learning curve here for all of us
In the main it’s a shallow one
– Compared at least to the general problem of multithreaded rendering
Multithread is always hard.
– Simpler design means fewer bugs and more predictable performance
23. 23 | AMD Direct3D Futures | March 20th, 2014
WHAT AMD PLAN TO DELIVER
Release driver for Direct3D12 launch
Continuous engagement
– With Microsoft
– With ISVs
Bring your opinions to us and to Microsoft.