3. DirectX 11 Basics
New API from Microsoft®
– Released alongside Windows® 7
– Runs on Windows Vista® as well
Supports downlevel hardware
– DirectX9, DirectX10, DirectX11-class HW supported
– Exposed features depend on GPU
Allows the use of the same API for multiple
generations of GPUs
– However Windows Vista/Windows 7 required
Lots of new features…
4. What is DirectCompute?
DirectCompute brings GPGPU to DirectX
DirectCompute is both separate from and
integrated with the DirectX graphics pipeline
– Compute Shader
– Compute features in Pixel Shader
Potential applications
– Physics
– AI
– Image processing
5. DirectCompute – part of
DirectX 11
DirectX 11 helps efficiently combine
Compute work with graphics
– Sharing of buffers is trivial
– Work graph is scheduled efficiently by the
driver
Input
Assembler
Vertex
Shader
Tesselation
Geometry
Shader
Rasterizer
Pixel
Shader
Compute
Shader
Graphics pipeline
9. Transparency Problem
Classic problem in computer graphics
Correct rendering of semi-transparent geometry requires
sorting – blending is an order dependent operation
Sometimes sorting triangles is enough but not always
– Difficult to sort: Multiple meshes interacting (many draw calls)
– Impossible to sort: Intersecting triangles (must sort fragments)
Try doing this
in PowerPoint!
10. Background
A-buffer – Carpenter „84
– CPU side linked list per-pixel for anti-aliasing
Fixed array per-pixel
– F-buffer, stencil routed A-buffer, Z3 buffer, and k-buffer,
Slice map, bucket depth peeling
Multi-pass
– Depth peeling methods for transparency
Recent
– Freepipe, PreCalc [DirectX11 SDK]
11. OIT using Per-Pixel Linked Lists
Fast creation of linked lists of arbitrary size
on the GPU using D3D11
– Computes correct transparency
Integration into the standard graphics pipeline
– Demonstrates compute from rasterized data
– DirectCompute features in Pixel Shader
– Works with depth and stencil testing
– Works with and without MSAA
Example of programmable blend
12. Linked List Construction
Two Buffers
– Head pointer buffer
addresses/offsets
Initialized to end-of-list (EOL) value (e.g., -1)
– Node buffer
arbitrary payload data + “next pointer”
Each shader thread
1. Retrieve and increment global counter value
2. Atomic exchange into head pointer buffer
3. Add new entry into the node buffer at location from step 1
13. Algorithm Overview
0. Render opaque scene objects
1. Render transparent scene objects
2. Screen quad resolves and composites
fragment lists
15. Algorithm Overview
0. Render opaque scene objects
1. Render transparent scene objects
– All fragments are stored using per-pixel linked
lists
– Store fragment‟s: color, alpha, & depth
2. Screen quad resolves and composites
fragment lists
16. Setup
Two buffers
– Screen sized head pointer buffer
– Node buffer – large enough to handle all
fragments
Render as usual
Disable render target writes
Insert render target data into linked list
25. Node Buffer Counter
Counter allocated in GPU memory (i.e. a buffer)
– Atomic updates
– Contention issues
DirectX11 Append feature
– Linear writes to a buffer
– Implicit writes
Append()
– Explicit writes
IncrementCounter()
Standard memory operations
– Up to 60% faster than memory counters
26. Code Example
RWStructuredBuffer RWStructuredCounter;
RWTexture2D<int> tRWFragmentListHead;
RWTexture2D<float4> tRWFragmentColor;
RWTexture2D<int2> tRWFragmentDepthAndLink;
[earlydepthstencil]
void PS( PsInput input )
{
float4 vFragment = ComputeFragmentColor(input);
int2 vScreenAddress = int2(input.vPositionSS.xy);
// Get counter value and increment
int nNewFragmentAddress = RWStructuredCounter.IncrementCounter();
if ( nNewFragmentAddress == FRAGMENT_LIST_NULL ) { return; }
// Update head buffer
int nOldFragmentAddress;
InterlockedExchange(tRWFragmentListHead[vScreenAddress], nNewHeadAddress,
nOldHeadAddress );
// Write the fragment attributes to the node buffer
int2 vAddress = GetAddress( nNewFragmentAddress );
tRWFragmentColor[vAddress] = vFragment;
tRWFragmentDepthAndLink[vAddress] = int2(
int(saturate(input.vPositionSS.z))*0x7fffffff), nOldFragmentAddress );
return;
}
27. Algorithm Overview
0. Render opaque scene objects
1. Render transparent scene objects
2. Screen quad resolves and composites
fragment lists
– Single pass
– Pixel shader sorts associated linked list (e.g., insertion sort)
– Composite fragments in sorted order with background
– Output final fragment
32. Anti-Aliasing
Store coverage information in the linked list
Resolve per-sample
– Execute a shader at each sample location
– Use MSAA hardware
Resolve per-pixel
– Execute a shader at each pixel location
– Average all sample contributions within the
shader
33. Performance Comparison
Teapot Dragon
Linked List 743 fps 338 fps
Precalc 285 fps 143 fps
Depth Peeling 579 fps 45 fps
Bucket Depth
Peeling
--- 256 fps
Dual Depth Peeling --- 94 fps
Based on internal numbers
Performance scaled to ATI Radeon HD 5770 graphics card
37. Future Work
Memory allocation
Sort on insert
Other linked list applications
– Indirect illumination
– Motion blur
– Shadows
More complex data structures
39. DirectCompute for physics
DirectCompute in the Bullet physics SDK
– An introduction to cloth simulation
– Some tips for implementation in DirectCompute
– A demonstration of the current state of
development
40. Cloth simulation
Large number of particles
– Appropriate for parallel processing
– Force from each spring constraint applied to both
connected particles
Original layout
Current layout:
Compute forces as
stretch from rest length
Compute new
positions
Apply position
corrections
to masses
41. Cloth simulation
Large number of particles
– Appropriate for parallel processing
– Force from each spring constraint applied to both
connected particles
Original layout
Current layout:
Compute forces as
stretch from rest length
Compute new
positions
Apply position
corrections
to masses
Rest length of spring
42. Cloth simulation steps
For each simulation iteration:
– Compute forces in each link based on its length
– Correct positions of masses/vertices from forces
– Compute new vertex positions
Original layout
Current layout:
Compute forces as
stretch from rest length
Compute new
positions
Apply position
corrections
to masses
43. Cloth simulation steps
For each simulation iteration:
– Compute forces in each link based on its length
– Correct positions of masses/vertices from forces
– Compute new vertex positions
Original layout
Current layout:
Compute forces as
stretch from rest length
Apply position
corrections
to masses
Compute new
positions
44. Cloth simulation steps
For each simulation iteration:
– Compute forces in each link based on its length
– Correct positions of masses/vertices from forces
– Compute new vertex positions
Original layout
Current layout:
Compute forces as
stretch from rest length
Compute new
positions
Apply position
corrections
to masses
45. Springs and masses
Two or three main types of springs
– Structural/shearing
– Bending
46. Springs and masses
Two or three main types of springs
– Structural/shearing
– Bending
47. CPU approach to simulation
One link at a time
Perform updates in place
“Gauss-Seidel” style
Conserves momentum
Iterate n times
48. CPU approach to simulation
One link at a time
Perform updates in place
“Gauss-Seidel” style
Conserves momentum
Iterate n times
49. CPU approach to simulation
One link at a time
Perform updates in place
“Gauss-Seidel” style
Conserves momentum
Iterate n times
50. CPU approach to simulation
One link at a time
Perform updates in place
“Gauss-Seidel” style
Conserves momentum
Iterate n times
51. CPU approach to simulation
One link at a time
Perform updates in place
“Gauss-Seidel” style
Conserves momentum
Iterate n times
52. CPU approach to simulation
One link at a time
Perform updates in place
“Gauss-Seidel” style
Conserves momentum
Iterate n times
Target
53. Moving to the GPU:
The pixel shader approach
Offers full parallelism
One vertex at a time
No scattered writes
54. Moving to the GPU:
The pixel shader approach
Offers full parallelism
One vertex at a time
No scattered writes
55. Moving to the GPU:
The pixel shader approach
Offers full parallelism
One vertex at a time
No scattered writes
56. Moving to the GPU:
The pixel shader approach
Offers full parallelism
One vertex at a time
No scattered writes
57. Moving to the GPU:
The pixel shader approach
Offers full parallelism
One vertex at a time
No scattered writes
58. Moving to the GPU:
The pixel shader approach
Offers full parallelism
One vertex at a time
No scattered writes
59. Moving to the GPU:
The pixel shader approach
Offers full parallelism
One vertex at a time
No scattered writes
In parallel!
60. Moving to the GPU:
The pixel shader approach
Offers full parallelism
One vertex at a time
No scattered writes
Target
61. Downsides of the pixel
shader approach
No propagation of
updates
– If we double buffer
Or non-deterministic
– If we update in-place in a
read/write array
Momentum preservation
– Lacking due to single-
ended link updates
62. Can DirectCompute help?
Offers scattered writes as a feature as we
saw earlier
The GPU implementation could be more like
the CPU
– Solver per-link rather than per-vertex
– Leads to races between links that update the
same vertex
63. Execute independent subsets
in parallel
All links act at both ends
Batch links
– No two links in a given
batch share a vertex
– No data races
64. Execute independent subsets
in parallel
All links act at both ends
Batch links
– No two links in a given
batch share a vertex
– No data races
65. Execute independent subsets
in parallel
All links act at both ends
Batch links
– No two links in a given
batch share a vertex
– No data races
66. Execute independent subsets
in parallel
All links act at both ends
Batch links
– No two links in a given
batch share a vertex
– No data races
67. Execute independent subsets
in parallel
All links act at both ends
Batch links
– No two links in a given
batch share a vertex
– No data races
68. On a real cloth mesh we need
many batches
Create independent subsets of links through
graph coloring.
Synchronize between batches
69. On a real cloth mesh we need
many batches
Create independent subsets of links through
graph coloring.
Synchronize between batches
1st batch
70. On a real cloth mesh we need
many batches
Create independent subsets of links through
graph coloring.
Synchronize between batches
2nd batch
71. On a real cloth mesh we need
many batches
Create independent subsets of links through
graph coloring.
Synchronize between batches
3rd batch
72. On a real cloth mesh we need
many batches
Create independent subsets of links through
graph coloring.
Synchronize between batches
10 batches
78. Returning to our batching
10 batches: 10 compute shader dispatches
1/10 links per batch
Low compute density per thread
10 batches
79. Packing for higher efficiency
Can create larger groups
– The cloth is fixed-structure
– Can be preprocessed
Fewer batches/dispatches
Less parallelism
4 batches
80. Solving cloths together
Solve multiple cloths together in n batches
Grouping
– Larger and reduced number of dispatches
– Regain the parallelism that increased work-
per-thread removed
4 batches
81. Shared memory
We‟ve made use of scattered writes
The next feature of DirectCompute: shared
memory
– Load data at the start of a block
– Compute over multiple links together
– Write data out again
83. So let‟s look at the batching we saw before:
There are 4 batches:
– If we do this per group we need 3 groups rather
than three DirectX “threads”
How can we improve the
batching?
Group 1 Batch 1
84. So let‟s look at the batching we saw before:
There are 4 batches:
– If we do this per group we need 3 groups rather
than three DirectX “threads”
How can we improve the
batching?
Group 2 Batch 1
85. So let‟s look at the batching we saw before:
There are 4 batches:
– If we do this per group we need 3 groups rather
than three DirectX “threads”
How can we improve the
batching?
Group 1 Batch 2
94. Group execution
The sequence of operations for the first
batch is:
Few links so low packing
efficiency:
Not a problem with
larger cloth
95. Group execution
The sequence of operations for the first
batch is:
Synchronize
Synchronize
Synchronize
Synchronize
Synchronize
96. Group execution
The sequence of operations for the first
batch is:
// load
AllMemoryBarrierWithGroupSync();
for( each subgroup ) {
// Process a subgroup
AllMemoryBarrierWithGroupSync();
}
// Store
Synchronize
97. Why is this an improvement?
So we still need 10*4 batches. What have
we gained?
– The batches within a group chunk are in-shader
loops
– Only 4 shader dispatches, each with significant
overhead
The barriers will still hit performance
– We are no longer dispatch bound, but we are
likely to be on-chip synchronization bound
98. Exploiting the SIMD
architecture
Hardware executes 64- or 32-wide SIMD
Sequentially consistent at the SIMD level
Synchronization is now implicit
– Take care
– Execute over groups that are SIMD width or a
divisor thereof
99. Group execution
The sequence of operations for the first
batch is:
Just works…
Just works…
Just works…
Just works…
Just works…
101. Performance gains
For 90,000 links:
– No solver running in 2.98 ms/frame
– Fully batched link solver in 3.84 ms/frame
– SIMD batched solver 3.22 ms/frame
– CPU solver 16.24ms/frame
3.5x improvement in solver alone
(67x improvement CPU solver)
Based on internal numbers
102. One more thing…
Remember the tight pipeline integration?
How can we use this to our advantage?
Input
Assembler
Vertex
Shader
Tesselation
Geometry
Shader
Rasterizer
Pixel
Shader
Compute
Shader
Graphics pipeline
103. Efficiently output vertex data
Cloth simulation updates vertex positions
– Generated on the GPU
– Need to be used on the GPU for rendering
– Why not keep them there?
Large amount of data to update
– Many vertices in fine simulation meshes
– Normals and other information present
104. Create a vertex buffer
// Create a vertex buffer with unordered access support
D3D11_BUFFER_DESC bd;
bd.Usage = D3D11_USAGE_DEFAULT;
bd.ByteWidth = vertexBufferSize * 32;
bd.BindFlags =
D3D11_BIND_VERTEX_BUFFER |
D3D11_BIND_UNORDERED_ACCESS;
bd.CPUAccessFlags = 0;
bd.MiscFlags = 0;
bd.StructureByteStride = 32;
hr = m_d3dDevice->CreateBuffer(&bd, NULL, &m_Buffer);
// Create an unordered access view of the buffer to allow writing
D3D11_UNORDERED_ACCESS_VIEW_DESC uavbuffer_desc;
ud.Format = DirectXGI_FORMAT_UNKNOWN;
ud.ViewDimension = D3D11_UAV_DIMENSION_BUFFER;
ud.Buffer.NumElements = vertexBufferSize;
hr = m_d3dDevice->CreateUnorderedAccessView(m_Buffer, &ud, &m_UAV);
It‟s a vertex buffer.
It‟s also bound for
unordered access.
Scattered writes!
105. Performance gains
For 90,000 links with copy on GPU:
– No solver running in 0.58 ms/frame
– Fully batched link solver in 0.82 ms/frame
– SIMD batched solver 0.617ms/frame
6.5x improvement in solver
6.5x improvement from CPU copy alone
23x improvement over simpler solver with
host copy Based on internal numbers
107. References
Yang J., Hensley J., Grün H., Thibieroz N.: Real-Time Concurrent
Linked List Construction on the GPU. In Rendering Techniques 2010:
Eurographics Symposium on Rendering (2010), vol. 29,
Eurographics.
Grün H., Thibieroz N.: OIT and Indirect Illumination using DirectX11
Linked Lists. In Proceedings of Game Developers Conference 2010
(Mar. 2010).
http://developer.amd.com/gpu_assets/OIT%20and%20Indirect%20Il
lumination%20using%20DirectX11%20Linked%20Lists_forweb.ppsx
http://developer.amd.com/samples/demos/pages/ATIRadeonHD5800
SeriesRealTimeDemos.aspx
http://bulletphysics.org