SlideShare a Scribd company logo
1 of 108
Download to read offline
Group execution
Advanced DirectX® 11
technology:
DirectCompute by
Example
Jason Yang and Lee Howes
August 18, 2010
DirectX 11 Basics
 New API from Microsoft®
– Released alongside Windows® 7
– Runs on Windows Vista® as well
 Supports downlevel hardware
– DirectX9, DirectX10, DirectX11-class HW supported
– Exposed features depend on GPU
 Allows the use of the same API for multiple
generations of GPUs
– However Windows Vista/Windows 7 required
 Lots of new features…
What is DirectCompute?
 DirectCompute brings GPGPU to DirectX
 DirectCompute is both separate from and
integrated with the DirectX graphics pipeline
– Compute Shader
– Compute features in Pixel Shader
 Potential applications
– Physics
– AI
– Image processing
DirectCompute – part of
DirectX 11
 DirectX 11 helps efficiently combine
Compute work with graphics
– Sharing of buffers is trivial
– Work graph is scheduled efficiently by the
driver
Input
Assembler
Vertex
Shader
Tesselation
Geometry
Shader
Rasterizer
Pixel
Shader
Compute
Shader
Graphics pipeline
DirectCompute Features
 Scattered writes
 Atomic operations
 Append/consume buffer
 Shared memory (local data share)
 Structured buffers
 Double precision (if supported)
DirectCompute by Example
 Order Independent Transparency (OIT)
– Atomic operations
– Scattered writes
– Append buffer feature
 Bullet Cloth Simulation
– Shared memory
– Shared compute and graphics buffers
Order Independent
Transparency
Transparency Problem
 Classic problem in computer graphics
 Correct rendering of semi-transparent geometry requires
sorting – blending is an order dependent operation
 Sometimes sorting triangles is enough but not always
– Difficult to sort: Multiple meshes interacting (many draw calls)
– Impossible to sort: Intersecting triangles (must sort fragments)
Try doing this
in PowerPoint!
Background
 A-buffer – Carpenter „84
– CPU side linked list per-pixel for anti-aliasing
 Fixed array per-pixel
– F-buffer, stencil routed A-buffer, Z3 buffer, and k-buffer,
Slice map, bucket depth peeling
 Multi-pass
– Depth peeling methods for transparency
 Recent
– Freepipe, PreCalc [DirectX11 SDK]
OIT using Per-Pixel Linked Lists
 Fast creation of linked lists of arbitrary size
on the GPU using D3D11
– Computes correct transparency
 Integration into the standard graphics pipeline
– Demonstrates compute from rasterized data
– DirectCompute features in Pixel Shader
– Works with depth and stencil testing
– Works with and without MSAA
 Example of programmable blend
Linked List Construction
 Two Buffers
– Head pointer buffer
 addresses/offsets
 Initialized to end-of-list (EOL) value (e.g., -1)
– Node buffer
 arbitrary payload data + “next pointer”
 Each shader thread
1. Retrieve and increment global counter value
2. Atomic exchange into head pointer buffer
3. Add new entry into the node buffer at location from step 1
Algorithm Overview
0. Render opaque scene objects
1. Render transparent scene objects
2. Screen quad resolves and composites
fragment lists
Step 0 – Render Opaque
 Render all opaque geometry normally
Render Target
Algorithm Overview
0. Render opaque scene objects
1. Render transparent scene objects
– All fragments are stored using per-pixel linked
lists
– Store fragment‟s: color, alpha, & depth
2. Screen quad resolves and composites
fragment lists
Setup
 Two buffers
– Screen sized head pointer buffer
– Node buffer – large enough to handle all
fragments
 Render as usual
 Disable render target writes
 Insert render target data into linked list
Step 1 – Create Linked List
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
Head Pointer Buffer
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
Node Buffer
0 1 2 3 4 5 6 …
Counter = 0
Render Target
Step 1 – Create Linked List
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
…
Render Target
Head Pointer Buffer
Node Buffer
0 1 2 3 4 5 6 …
Counter = 0
Step 1 – Create Linked List
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
…
Render Target
Head Pointer Buffer
Node Buffer
0 1 2 3 4 5 6 …
Counter = 1
IncrementCounter()
Step 1 – Create Linked List
-1 -1 -1 -1 -1 -1
-1 0 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
…
Render Target
Head Pointer Buffer
Node Buffer
0 1 2 3 4 5 6 …
Counter = 1
InterlockedExchange()
Step 1 – Create Linked List
-1 -1 -1 -1 -1 -1
-1 0 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
0.87
-1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
…
Render Target
Head Pointer Buffer
Node Buffer
0 1 2 3 4 5 6 …
Counter = 1
Scatter Write
Step 1 – Create Linked List
-1 -1 -1 -1 -1 -1
-1 0 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
0.87
-1
0.89
-1
0.90
-1
-1 -1 -1 -1 -1 -1
-1 -1 -1 1 2 -1
-1 -1 -1 -1 -1 -1
…
Culled due to existing
scene geometry depth.
Render Target
Head Pointer Buffer
Node Buffer
0 1 2 3 4 5 6 …
Counter = 3
Node Buffer
0 1 2 3 4 5 6 …
Step 1 – Create Linked List
-1 -1 -1 -1 -1 -1
-1 3 4 -1 -1 -1
-1 -1 -1 -1 -1 -1
0.87
-1
0.89
-1
0.90
-1
0.65
0
0.65
-1
-1 -1 -1 -1 -1 -1
-1 -1 -1 1 2 -1
-1 -1 -1 -1 -1 -1
…
Render Target
Head Pointer Buffer
Counter = 5
Step 1 – Create Linked List
-1 -1 -1 -1 -1 -1
-1 5 4 -1 -1 -1
-1 -1 -1 -1 -1 -1
0.87
-1
0.89
-1
0.90
-1
0.65
0
0.65
-1
0.71
3
-1 -1 -1 -1 -1 -1
-1 -1 -1 1 2 -1
-1 -1 -1 -1 -1 -1
…
Render Target
Head Pointer Buffer
Node Buffer
0 1 2 3 4 5 6 …
Counter = 6
Node Buffer Counter
 Counter allocated in GPU memory (i.e. a buffer)
– Atomic updates
– Contention issues
 DirectX11 Append feature
– Linear writes to a buffer
– Implicit writes
 Append()
– Explicit writes
 IncrementCounter()
 Standard memory operations
– Up to 60% faster than memory counters
Code Example
RWStructuredBuffer RWStructuredCounter;
RWTexture2D<int> tRWFragmentListHead;
RWTexture2D<float4> tRWFragmentColor;
RWTexture2D<int2> tRWFragmentDepthAndLink;
[earlydepthstencil]
void PS( PsInput input )
{
float4 vFragment = ComputeFragmentColor(input);
int2 vScreenAddress = int2(input.vPositionSS.xy);
// Get counter value and increment
int nNewFragmentAddress = RWStructuredCounter.IncrementCounter();
if ( nNewFragmentAddress == FRAGMENT_LIST_NULL ) { return; }
// Update head buffer
int nOldFragmentAddress;
InterlockedExchange(tRWFragmentListHead[vScreenAddress], nNewHeadAddress,
nOldHeadAddress );
// Write the fragment attributes to the node buffer
int2 vAddress = GetAddress( nNewFragmentAddress );
tRWFragmentColor[vAddress] = vFragment;
tRWFragmentDepthAndLink[vAddress] = int2(
int(saturate(input.vPositionSS.z))*0x7fffffff), nOldFragmentAddress );
return;
}
Algorithm Overview
0. Render opaque scene objects
1. Render transparent scene objects
2. Screen quad resolves and composites
fragment lists
– Single pass
– Pixel shader sorts associated linked list (e.g., insertion sort)
– Composite fragments in sorted order with background
– Output final fragment
Step 2 – Render Fragments
-1 -1 -1 -1 -1 -1
-1 5 4 -1 -1 -1
-1 -1 -1 -1 -1 -1
0.87
-1
0.89
-1
0.90
-1
0.65
0
0.65
-1
0.71
3
-1 -1 -1 -1 -1 -1
-1 -1 -1 1 2 -1
-1 -1 -1 -1 -1 -1
…
(0,0)->(1,1):
Fetch Head Pointer: -1
-1 indicates no
fragment to render
Render Target
Head Pointer Buffer
Node Buffer
0 1 2 3 4 5 6 …
Step 2 – Render Fragments
-1 -1 -1 -1 -1 -1
-1 5 4 -1 -1 -1
-1 -1 -1 -1 -1 -1
0.87
-1
0.89
-1
0.90
-1
0.65
0
0.65
-1
0.71
3
-1 -1 -1 -1 -1 -1
-1 -1 -1 1 2 -1
-1 -1 -1 -1 -1 -1
…
(1,1):
Fetch Head Pointer: 5
Fetch Node Data (5)
Walk the list and store
in temp array
0.71 0.65 0.87
Render Target
Head Pointer Buffer
Node Buffer
0 1 2 3 4 5 6 …
Step 2 – Render Fragments
-1 -1 -1 -1 -1 -1
-1 5 4 -1 -1 -1
-1 -1 -1 -1 -1 -1
0.87
-1
0.89
-1
0.90
-1
0.65
0
0.65
-1
0.71
3
-1 -1 -1 -1 -1 -1
-1 -1 -1 1 2 -1
-1 -1 -1 -1 -1 -1
…
(1,1):
Sort temp array
Blend colors and write
out
0.65 0.71 0.87
Render Target
Head Pointer Buffer
Node Buffer
0 1 2 3 4 5 6 …
Step 2 – Render Fragments
-1 -1 -1 -1 -1 -1
-1 5 4 -1 -1 -1
-1 -1 -1 -1 -1 -1
0.87
-1
0.89
-1
0.90
-1
0.65
0
0.65
-1
0.71
3
-1 -1 -1 -1 -1 -1
-1 -1 -1 1 2 -1
-1 -1 -1 -1 -1 -1
…
Render Target
Head Pointer Buffer
Node Buffer
0 1 2 3 4 5 6 …
Anti-Aliasing
 Store coverage information in the linked list
 Resolve per-sample
– Execute a shader at each sample location
– Use MSAA hardware
 Resolve per-pixel
– Execute a shader at each pixel location
– Average all sample contributions within the
shader
Performance Comparison
Teapot Dragon
Linked List 743 fps 338 fps
Precalc 285 fps 143 fps
Depth Peeling 579 fps 45 fps
Bucket Depth
Peeling
--- 256 fps
Dual Depth Peeling --- 94 fps
Based on internal numbers
Performance scaled to ATI Radeon HD 5770 graphics card
Mecha Demo
 602K scene triangles
– 254K transparent triangles
Layers
Based on internal numbers
Scaling
Based on internal numbers
Future Work
 Memory allocation
 Sort on insert
 Other linked list applications
– Indirect illumination
– Motion blur
– Shadows
 More complex data structures
Bullet Cloth Simulation
DirectCompute for physics
 DirectCompute in the Bullet physics SDK
– An introduction to cloth simulation
– Some tips for implementation in DirectCompute
– A demonstration of the current state of
development
Cloth simulation
 Large number of particles
– Appropriate for parallel processing
– Force from each spring constraint applied to both
connected particles
Original layout
Current layout:
Compute forces as
stretch from rest length
Compute new
positions
Apply position
corrections
to masses
Cloth simulation
 Large number of particles
– Appropriate for parallel processing
– Force from each spring constraint applied to both
connected particles
Original layout
Current layout:
Compute forces as
stretch from rest length
Compute new
positions
Apply position
corrections
to masses
Rest length of spring
Cloth simulation steps
 For each simulation iteration:
– Compute forces in each link based on its length
– Correct positions of masses/vertices from forces
– Compute new vertex positions
Original layout
Current layout:
Compute forces as
stretch from rest length
Compute new
positions
Apply position
corrections
to masses
Cloth simulation steps
 For each simulation iteration:
– Compute forces in each link based on its length
– Correct positions of masses/vertices from forces
– Compute new vertex positions
Original layout
Current layout:
Compute forces as
stretch from rest length
Apply position
corrections
to masses
Compute new
positions
Cloth simulation steps
 For each simulation iteration:
– Compute forces in each link based on its length
– Correct positions of masses/vertices from forces
– Compute new vertex positions
Original layout
Current layout:
Compute forces as
stretch from rest length
Compute new
positions
Apply position
corrections
to masses
Springs and masses
 Two or three main types of springs
– Structural/shearing
– Bending
Springs and masses
 Two or three main types of springs
– Structural/shearing
– Bending
CPU approach to simulation
 One link at a time
 Perform updates in place
 “Gauss-Seidel” style
 Conserves momentum
 Iterate n times
CPU approach to simulation
 One link at a time
 Perform updates in place
 “Gauss-Seidel” style
 Conserves momentum
 Iterate n times
CPU approach to simulation
 One link at a time
 Perform updates in place
 “Gauss-Seidel” style
 Conserves momentum
 Iterate n times
CPU approach to simulation
 One link at a time
 Perform updates in place
 “Gauss-Seidel” style
 Conserves momentum
 Iterate n times
CPU approach to simulation
 One link at a time
 Perform updates in place
 “Gauss-Seidel” style
 Conserves momentum
 Iterate n times
CPU approach to simulation
 One link at a time
 Perform updates in place
 “Gauss-Seidel” style
 Conserves momentum
 Iterate n times
Target
Moving to the GPU:
The pixel shader approach
 Offers full parallelism
 One vertex at a time
 No scattered writes
Moving to the GPU:
The pixel shader approach
 Offers full parallelism
 One vertex at a time
 No scattered writes
Moving to the GPU:
The pixel shader approach
 Offers full parallelism
 One vertex at a time
 No scattered writes
Moving to the GPU:
The pixel shader approach
 Offers full parallelism
 One vertex at a time
 No scattered writes
Moving to the GPU:
The pixel shader approach
 Offers full parallelism
 One vertex at a time
 No scattered writes
Moving to the GPU:
The pixel shader approach
 Offers full parallelism
 One vertex at a time
 No scattered writes
Moving to the GPU:
The pixel shader approach
 Offers full parallelism
 One vertex at a time
 No scattered writes
In parallel!
Moving to the GPU:
The pixel shader approach
 Offers full parallelism
 One vertex at a time
 No scattered writes
Target
Downsides of the pixel
shader approach
 No propagation of
updates
– If we double buffer
 Or non-deterministic
– If we update in-place in a
read/write array
 Momentum preservation
– Lacking due to single-
ended link updates
Can DirectCompute help?
 Offers scattered writes as a feature as we
saw earlier
 The GPU implementation could be more like
the CPU
– Solver per-link rather than per-vertex
– Leads to races between links that update the
same vertex
Execute independent subsets
in parallel
 All links act at both ends
 Batch links
– No two links in a given
batch share a vertex
– No data races
Execute independent subsets
in parallel
 All links act at both ends
 Batch links
– No two links in a given
batch share a vertex
– No data races
Execute independent subsets
in parallel
 All links act at both ends
 Batch links
– No two links in a given
batch share a vertex
– No data races
Execute independent subsets
in parallel
 All links act at both ends
 Batch links
– No two links in a given
batch share a vertex
– No data races
Execute independent subsets
in parallel
 All links act at both ends
 Batch links
– No two links in a given
batch share a vertex
– No data races
On a real cloth mesh we need
many batches
 Create independent subsets of links through
graph coloring.
 Synchronize between batches
On a real cloth mesh we need
many batches
 Create independent subsets of links through
graph coloring.
 Synchronize between batches
1st batch
On a real cloth mesh we need
many batches
 Create independent subsets of links through
graph coloring.
 Synchronize between batches
2nd batch
On a real cloth mesh we need
many batches
 Create independent subsets of links through
graph coloring.
 Synchronize between batches
3rd batch
On a real cloth mesh we need
many batches
 Create independent subsets of links through
graph coloring.
 Synchronize between batches
10 batches
Driving batches and
synchronizing
Iteration 0 Iteration 1 Iteration 2
Simulation step
Batch 0
Batch 1
Batch 2
Batch 3
Batch 0
Batch 1
Batch 2
Batch 3
Batch 0
Batch 1
Batch 2
Batch 3
Batch 4 Batch 4 Batch 4
Driving batches and
synchronizing
Iteration 0
Batch 0
Batch 1
Batch 2
Batch 3
Batch 4
// Execute the kernel
context->CSSetShader(
solvePositionsFromLinksKernel.kernel, NULL, 0 );
int numBlocks =
(constBuffer.numLinks + (blockSize-1)) / blockSize;
context->Dispatch( numBlocks , 1, 1 );
Driving batches and
synchronizing
Iteration
Simulation step
Batch 0
Batch 1
Batch 2
Batch 3
Batch 4
Twiddle fingers…
Driving batches and
synchronizing
Iteration
Simulation step
Batch 0
Batch 1
Batch 2
Batch 3
Batch 4
Twiddle fingers…
Twiddle fingers…
Driving batches and
synchronizing
Iteration
Simulation step
Batch 0
Batch 1
Batch 2
Batch 3
Batch 4
Twiddle fingers…
Twiddle fingers…
Twiddle fingers…
Remember, 10
batches!
Twiddle fingers…
Twiddle fingers…
Returning to our batching
 10 batches: 10 compute shader dispatches
 1/10 links per batch
 Low compute density per thread
10 batches
Packing for higher efficiency
 Can create larger groups
– The cloth is fixed-structure
– Can be preprocessed
 Fewer batches/dispatches
 Less parallelism
4 batches
Solving cloths together
 Solve multiple cloths together in n batches
 Grouping
– Larger and reduced number of dispatches
– Regain the parallelism that increased work-
per-thread removed
4 batches
Shared memory
 We‟ve made use of scattered writes
 The next feature of DirectCompute: shared
memory
– Load data at the start of a block
– Compute over multiple links together
– Write data out again
Driving batches and
synchronizing
Iteration
Simulation step
Batch 0
Batch 1
Inner batch 0
Inner batch 1
Inner batch 0
Iteration
Batch 0
Batch 1
Iteration
Batch 0
Batch 1
Inner batch 0
Inner batch 1
Inner batch 0
Inner batch 0
Inner batch 1
Inner batch 0
 So let‟s look at the batching we saw before:
 There are 4 batches:
– If we do this per group we need 3 groups rather
than three DirectX “threads”
How can we improve the
batching?
Group 1 Batch 1
 So let‟s look at the batching we saw before:
 There are 4 batches:
– If we do this per group we need 3 groups rather
than three DirectX “threads”
How can we improve the
batching?
Group 2 Batch 1
 So let‟s look at the batching we saw before:
 There are 4 batches:
– If we do this per group we need 3 groups rather
than three DirectX “threads”
How can we improve the
batching?
Group 1 Batch 2
Solving in shared memory
groupshared float4 positionSharedData[VERTS_PER_GROUP];
[numthreads(GROUP_SIZE, 1, 1)]
void
SolvePositionsFromLinksKernel( … uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID … )
{
for( int vertex = laneInWavefront; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )
{
int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];
positionSharedData[vertex] = g_vertexPositions[vertexAddress];
}
... // Perform computation in shared buffer
for( int vertex = GTid.x; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )
{
int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];
g_vertexPositions[vertexAddress] = positionSharedData[vertex];
}
}
Solving in shared memory
groupshared float4 positionSharedData[VERTS_PER_GROUP];
[numthreads(GROUP_SIZE, 1, 1)]
void
SolvePositionsFromLinksKernel( … uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID … )
{
for( int vertex = laneInWavefront; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )
{
int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];
positionSharedData[vertex] = g_vertexPositions[vertexAddress];
}
... // Perform computation in shared buffer
for( int vertex = GTid.x; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )
{
int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];
g_vertexPositions[vertexAddress] = positionSharedData[vertex];
}
}
Define a “groupshared”
buffer for shared data
storage
Solving in shared memory
groupshared float4 positionSharedData[VERTS_PER_GROUP];
[numthreads(GROUP_SIZE, 1, 1)]
void
SolvePositionsFromLinksKernel( … uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID … )
{
for( int vertex = laneInWavefront; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )
{
int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];
positionSharedData[vertex] = g_vertexPositions[vertexAddress];
}
... // Perform computation in shared buffer
for( int vertex = GTid.x; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )
{
int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];
g_vertexPositions[vertexAddress] = positionSharedData[vertex];
}
}
Data will be shared
across a group of
threads with these
dimensions
Solving in shared memory
groupshared float4 positionSharedData[VERTS_PER_GROUP];
[numthreads(GROUP_SIZE, 1, 1)]
void
SolvePositionsFromLinksKernel( … uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID … )
{
for( int vertex = laneInWavefront; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )
{
int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];
positionSharedData[vertex] = g_vertexPositions[vertexAddress];
}
... // Perform computation in shared buffer
for( int vertex = GTid.x; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )
{
int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];
g_vertexPositions[vertexAddress] = positionSharedData[vertex];
}
}
Load data from global
buffers into the shared
region
Solving in shared memory
groupshared float4 positionSharedData[VERTS_PER_GROUP];
[numthreads(GROUP_SIZE, 1, 1)]
void
SolvePositionsFromLinksKernel( … uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID … )
{
for( int vertex = laneInWavefront; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )
{
int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];
positionSharedData[vertex] = g_vertexPositions[vertexAddress];
}
... // Perform computation in shared buffer
for( int vertex = GTid.x; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )
{
int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];
g_vertexPositions[vertexAddress] = positionSharedData[vertex];
}
}
Write back to the
global buffer after
computation
Group execution
 The sequence of operations for the first
batch is:
Group execution
 The sequence of operations for the first
batch is:
Group execution
 The sequence of operations for the first
batch is:
Group execution
 The sequence of operations for the first
batch is:
Few links so low packing
efficiency:
Not a problem with
larger cloth
Group execution
 The sequence of operations for the first
batch is:
Synchronize
Synchronize
Synchronize
Synchronize
Synchronize
Group execution
 The sequence of operations for the first
batch is:
// load
AllMemoryBarrierWithGroupSync();
for( each subgroup ) {
// Process a subgroup
AllMemoryBarrierWithGroupSync();
}
// Store
Synchronize
Why is this an improvement?
 So we still need 10*4 batches. What have
we gained?
– The batches within a group chunk are in-shader
loops
– Only 4 shader dispatches, each with significant
overhead
 The barriers will still hit performance
– We are no longer dispatch bound, but we are
likely to be on-chip synchronization bound
Exploiting the SIMD
architecture
 Hardware executes 64- or 32-wide SIMD
 Sequentially consistent at the SIMD level
 Synchronization is now implicit
– Take care
– Execute over groups that are SIMD width or a
divisor thereof
Group execution
 The sequence of operations for the first
batch is:
Just works…
Just works…
Just works…
Just works…
Just works…
Driving batches and
synchronizing
Iteration
Simulation step
Batch 0
Batch 1
Inner batch 0
Inner batch 1
Inner batch 0
Synchronize
Synchronize
Performance gains
 For 90,000 links:
– No solver running in 2.98 ms/frame
– Fully batched link solver in 3.84 ms/frame
– SIMD batched solver 3.22 ms/frame
– CPU solver 16.24ms/frame
 3.5x improvement in solver alone
 (67x improvement CPU solver)
Based on internal numbers
One more thing…
 Remember the tight pipeline integration?
 How can we use this to our advantage?
Input
Assembler
Vertex
Shader
Tesselation
Geometry
Shader
Rasterizer
Pixel
Shader
Compute
Shader
Graphics pipeline
Efficiently output vertex data
 Cloth simulation updates vertex positions
– Generated on the GPU
– Need to be used on the GPU for rendering
– Why not keep them there?
 Large amount of data to update
– Many vertices in fine simulation meshes
– Normals and other information present
Create a vertex buffer
// Create a vertex buffer with unordered access support
D3D11_BUFFER_DESC bd;
bd.Usage = D3D11_USAGE_DEFAULT;
bd.ByteWidth = vertexBufferSize * 32;
bd.BindFlags =
D3D11_BIND_VERTEX_BUFFER |
D3D11_BIND_UNORDERED_ACCESS;
bd.CPUAccessFlags = 0;
bd.MiscFlags = 0;
bd.StructureByteStride = 32;
hr = m_d3dDevice->CreateBuffer(&bd, NULL, &m_Buffer);
// Create an unordered access view of the buffer to allow writing
D3D11_UNORDERED_ACCESS_VIEW_DESC uavbuffer_desc;
ud.Format = DirectXGI_FORMAT_UNKNOWN;
ud.ViewDimension = D3D11_UAV_DIMENSION_BUFFER;
ud.Buffer.NumElements = vertexBufferSize;
hr = m_d3dDevice->CreateUnorderedAccessView(m_Buffer, &ud, &m_UAV);
It‟s a vertex buffer.
It‟s also bound for
unordered access.
Scattered writes!
Performance gains
 For 90,000 links with copy on GPU:
– No solver running in 0.58 ms/frame
– Fully batched link solver in 0.82 ms/frame
– SIMD batched solver 0.617ms/frame
 6.5x improvement in solver
 6.5x improvement from CPU copy alone
 23x improvement over simpler solver with
host copy Based on internal numbers
Thanks
 Justin Hensley
 Holger Grün
 Nicholas Thibieroz
 Erwin Coumans
References
 Yang J., Hensley J., Grün H., Thibieroz N.: Real-Time Concurrent
Linked List Construction on the GPU. In Rendering Techniques 2010:
Eurographics Symposium on Rendering (2010), vol. 29,
Eurographics.
 Grün H., Thibieroz N.: OIT and Indirect Illumination using DirectX11
Linked Lists. In Proceedings of Game Developers Conference 2010
(Mar. 2010).
http://developer.amd.com/gpu_assets/OIT%20and%20Indirect%20Il
lumination%20using%20DirectX11%20Linked%20Lists_forweb.ppsx
 http://developer.amd.com/samples/demos/pages/ATIRadeonHD5800
SeriesRealTimeDemos.aspx
 http://bulletphysics.org
Trademark Attribution
AMD, the AMD Arrow logo, ATI, the ATI logo, Radeon and combinations thereof are trademarks of
Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Microsoft, Windows,
Windows Vista, Windows 7 and DirectX are registered trademarks of Microsoft Corporation in the U.S.
and/or other juristictions. Other names used in this presentation are for identification purposes only and
may be trademarks of their respective owners.
©2010 Advanced Micro Devices, Inc. All rights reserved.

More Related Content

What's hot

Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNetScaling Deep Learning with MXNet
Scaling Deep Learning with MXNetAI Frontiers
 
SQL Tuning, takes 3 to tango
SQL Tuning, takes 3 to tangoSQL Tuning, takes 3 to tango
SQL Tuning, takes 3 to tangoMauro Pagano
 
SQL Plan Directives explained
SQL Plan Directives explainedSQL Plan Directives explained
SQL Plan Directives explainedMauro Pagano
 
Libra : A Compatible Method for Defending Against Arbitrary Memory Overwrite
Libra : A Compatible Method for Defending Against Arbitrary Memory OverwriteLibra : A Compatible Method for Defending Against Arbitrary Memory Overwrite
Libra : A Compatible Method for Defending Against Arbitrary Memory OverwriteJeremy Haung
 
Character Encoding - MySQL DevRoom - FOSDEM 2015
Character Encoding - MySQL DevRoom - FOSDEM 2015Character Encoding - MySQL DevRoom - FOSDEM 2015
Character Encoding - MySQL DevRoom - FOSDEM 2015mushupl
 
Brace yourselves, leap second is coming
Brace yourselves, leap second is comingBrace yourselves, leap second is coming
Brace yourselves, leap second is comingNati Cohen
 
All on Adaptive and Extended Cursor Sharing
All on Adaptive and Extended Cursor SharingAll on Adaptive and Extended Cursor Sharing
All on Adaptive and Extended Cursor SharingMohamed Houri
 
Oit And Indirect Illumination Using Dx11 Linked Lists
Oit And Indirect Illumination Using Dx11 Linked ListsOit And Indirect Illumination Using Dx11 Linked Lists
Oit And Indirect Illumination Using Dx11 Linked ListsHolger Gruen
 
Performance Schema for MySQL Troubleshooting
Performance Schema for MySQL TroubleshootingPerformance Schema for MySQL Troubleshooting
Performance Schema for MySQL TroubleshootingSveta Smirnova
 
Basic MySQL Troubleshooting for Oracle Database Administrators
Basic MySQL Troubleshooting for Oracle Database AdministratorsBasic MySQL Troubleshooting for Oracle Database Administrators
Basic MySQL Troubleshooting for Oracle Database AdministratorsSveta Smirnova
 
Performance Schema for MySQL Troubleshooting
Performance Schema for MySQL TroubleshootingPerformance Schema for MySQL Troubleshooting
Performance Schema for MySQL TroubleshootingSveta Smirnova
 
Introduction to Parallel Processing Algorithms in Shared Nothing Databases
Introduction to Parallel Processing Algorithms in Shared Nothing DatabasesIntroduction to Parallel Processing Algorithms in Shared Nothing Databases
Introduction to Parallel Processing Algorithms in Shared Nothing DatabasesOfir Manor
 
Mysqlconf2013 mariadb-cassandra-interoperability
Mysqlconf2013 mariadb-cassandra-interoperabilityMysqlconf2013 mariadb-cassandra-interoperability
Mysqlconf2013 mariadb-cassandra-interoperabilitySergey Petrunya
 

What's hot (15)

Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNetScaling Deep Learning with MXNet
Scaling Deep Learning with MXNet
 
SQL Tuning, takes 3 to tango
SQL Tuning, takes 3 to tangoSQL Tuning, takes 3 to tango
SQL Tuning, takes 3 to tango
 
SQL Plan Directives explained
SQL Plan Directives explainedSQL Plan Directives explained
SQL Plan Directives explained
 
Libra : A Compatible Method for Defending Against Arbitrary Memory Overwrite
Libra : A Compatible Method for Defending Against Arbitrary Memory OverwriteLibra : A Compatible Method for Defending Against Arbitrary Memory Overwrite
Libra : A Compatible Method for Defending Against Arbitrary Memory Overwrite
 
Character Encoding - MySQL DevRoom - FOSDEM 2015
Character Encoding - MySQL DevRoom - FOSDEM 2015Character Encoding - MySQL DevRoom - FOSDEM 2015
Character Encoding - MySQL DevRoom - FOSDEM 2015
 
Brace yourselves, leap second is coming
Brace yourselves, leap second is comingBrace yourselves, leap second is coming
Brace yourselves, leap second is coming
 
MySQL SQL Tutorial
MySQL SQL TutorialMySQL SQL Tutorial
MySQL SQL Tutorial
 
All on Adaptive and Extended Cursor Sharing
All on Adaptive and Extended Cursor SharingAll on Adaptive and Extended Cursor Sharing
All on Adaptive and Extended Cursor Sharing
 
Oit And Indirect Illumination Using Dx11 Linked Lists
Oit And Indirect Illumination Using Dx11 Linked ListsOit And Indirect Illumination Using Dx11 Linked Lists
Oit And Indirect Illumination Using Dx11 Linked Lists
 
Performance Schema for MySQL Troubleshooting
Performance Schema for MySQL TroubleshootingPerformance Schema for MySQL Troubleshooting
Performance Schema for MySQL Troubleshooting
 
Reduction
ReductionReduction
Reduction
 
Basic MySQL Troubleshooting for Oracle Database Administrators
Basic MySQL Troubleshooting for Oracle Database AdministratorsBasic MySQL Troubleshooting for Oracle Database Administrators
Basic MySQL Troubleshooting for Oracle Database Administrators
 
Performance Schema for MySQL Troubleshooting
Performance Schema for MySQL TroubleshootingPerformance Schema for MySQL Troubleshooting
Performance Schema for MySQL Troubleshooting
 
Introduction to Parallel Processing Algorithms in Shared Nothing Databases
Introduction to Parallel Processing Algorithms in Shared Nothing DatabasesIntroduction to Parallel Processing Algorithms in Shared Nothing Databases
Introduction to Parallel Processing Algorithms in Shared Nothing Databases
 
Mysqlconf2013 mariadb-cassandra-interoperability
Mysqlconf2013 mariadb-cassandra-interoperabilityMysqlconf2013 mariadb-cassandra-interoperability
Mysqlconf2013 mariadb-cassandra-interoperability
 

Viewers also liked

Escuela normal experimental de el fuerte jose
Escuela normal experimental de el fuerte joseEscuela normal experimental de el fuerte jose
Escuela normal experimental de el fuerte josejosesines
 
Aprendizaje por proyectos educativos
Aprendizaje por proyectos educativosAprendizaje por proyectos educativos
Aprendizaje por proyectos educativosPablo Román
 
Prueba 1. marioramos
Prueba 1. marioramosPrueba 1. marioramos
Prueba 1. marioramosramosiii
 
The palladium updates 12-7-2014
The palladium updates   12-7-2014The palladium updates   12-7-2014
The palladium updates 12-7-2014Sneha Chitte
 
работа с сервисом Wiki wall
работа с сервисом Wiki wallработа с сервисом Wiki wall
работа с сервисом Wiki wallovarlan
 
Unidad didáctica mujeres 8 marzo 2013
Unidad  didáctica mujeres 8 marzo 2013Unidad  didáctica mujeres 8 marzo 2013
Unidad didáctica mujeres 8 marzo 2013marinurg
 
North Media Center Highlights and Statistics February 2013
North Media Center Highlights and Statistics February 2013North Media Center Highlights and Statistics February 2013
North Media Center Highlights and Statistics February 2013Stoutl
 
Apresentação março de 2013 imprensa valida ar
Apresentação março de 2013 imprensa valida arApresentação março de 2013 imprensa valida ar
Apresentação março de 2013 imprensa valida arBdtm
 
Yogurt lacteosita (english)
Yogurt lacteosita (english)Yogurt lacteosita (english)
Yogurt lacteosita (english)patydiegod
 
Multi-Channel Madness! Same Donors. Different Strategies: Bridge Conference 2014
Multi-Channel Madness! Same Donors. Different Strategies: Bridge Conference 2014Multi-Channel Madness! Same Donors. Different Strategies: Bridge Conference 2014
Multi-Channel Madness! Same Donors. Different Strategies: Bridge Conference 2014Heather Marsh
 
Web Application Architecture
Web Application ArchitectureWeb Application Architecture
Web Application ArchitectureAbhishek Chikane
 
‘ Final ‘ to avoid overriding
‘ Final ‘  to avoid overriding‘ Final ‘  to avoid overriding
‘ Final ‘ to avoid overridingmyrajendra
 
Embedded meets Agile
Embedded meets AgileEmbedded meets Agile
Embedded meets AgileRavneet Kaur
 
E lerote treinamento professores 2013
E lerote treinamento professores 2013E lerote treinamento professores 2013
E lerote treinamento professores 2013maryaguiar13
 
マルチタスクって奥が深い #mishimapm
マルチタスクって奥が深い #mishimapmマルチタスクって奥が深い #mishimapm
マルチタスクって奥が深い #mishimapm鉄次 尾形
 
Does the American Constitution guarantee freedom?
Does the American Constitution guarantee freedom? Does the American Constitution guarantee freedom?
Does the American Constitution guarantee freedom? amibeth26
 

Viewers also liked (20)

Web
WebWeb
Web
 
Putting Success in your Way:
Putting Success in your Way: Putting Success in your Way:
Putting Success in your Way:
 
Escuela normal experimental de el fuerte jose
Escuela normal experimental de el fuerte joseEscuela normal experimental de el fuerte jose
Escuela normal experimental de el fuerte jose
 
Aprendizaje por proyectos educativos
Aprendizaje por proyectos educativosAprendizaje por proyectos educativos
Aprendizaje por proyectos educativos
 
Prueba 1. marioramos
Prueba 1. marioramosPrueba 1. marioramos
Prueba 1. marioramos
 
The palladium updates 12-7-2014
The palladium updates   12-7-2014The palladium updates   12-7-2014
The palladium updates 12-7-2014
 
Bajaj (2)
Bajaj (2)Bajaj (2)
Bajaj (2)
 
работа с сервисом Wiki wall
работа с сервисом Wiki wallработа с сервисом Wiki wall
работа с сервисом Wiki wall
 
Unidad didáctica mujeres 8 marzo 2013
Unidad  didáctica mujeres 8 marzo 2013Unidad  didáctica mujeres 8 marzo 2013
Unidad didáctica mujeres 8 marzo 2013
 
North Media Center Highlights and Statistics February 2013
North Media Center Highlights and Statistics February 2013North Media Center Highlights and Statistics February 2013
North Media Center Highlights and Statistics February 2013
 
Apresentação março de 2013 imprensa valida ar
Apresentação março de 2013 imprensa valida arApresentação março de 2013 imprensa valida ar
Apresentação março de 2013 imprensa valida ar
 
Yogurt lacteosita (english)
Yogurt lacteosita (english)Yogurt lacteosita (english)
Yogurt lacteosita (english)
 
Multi-Channel Madness! Same Donors. Different Strategies: Bridge Conference 2014
Multi-Channel Madness! Same Donors. Different Strategies: Bridge Conference 2014Multi-Channel Madness! Same Donors. Different Strategies: Bridge Conference 2014
Multi-Channel Madness! Same Donors. Different Strategies: Bridge Conference 2014
 
Web Application Architecture
Web Application ArchitectureWeb Application Architecture
Web Application Architecture
 
‘ Final ‘ to avoid overriding
‘ Final ‘  to avoid overriding‘ Final ‘  to avoid overriding
‘ Final ‘ to avoid overriding
 
Embedded meets Agile
Embedded meets AgileEmbedded meets Agile
Embedded meets Agile
 
Cp 10
Cp 10Cp 10
Cp 10
 
E lerote treinamento professores 2013
E lerote treinamento professores 2013E lerote treinamento professores 2013
E lerote treinamento professores 2013
 
マルチタスクって奥が深い #mishimapm
マルチタスクって奥が深い #mishimapmマルチタスクって奥が深い #mishimapm
マルチタスクって奥が深い #mishimapm
 
Does the American Constitution guarantee freedom?
Does the American Constitution guarantee freedom? Does the American Constitution guarantee freedom?
Does the American Constitution guarantee freedom?
 

Similar to Gdce 2010 dx11

Optimizing unity games (Google IO 2014)
Optimizing unity games (Google IO 2014)Optimizing unity games (Google IO 2014)
Optimizing unity games (Google IO 2014)Alexander Dolbilov
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Johan Andersson
 
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio [Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio Owen Wu
 
Refresh what you know about AssetDatabase.Refresh()- Unite Copenhagen 2019
Refresh what you know about AssetDatabase.Refresh()- Unite Copenhagen 2019Refresh what you know about AssetDatabase.Refresh()- Unite Copenhagen 2019
Refresh what you know about AssetDatabase.Refresh()- Unite Copenhagen 2019Unity Technologies
 
Cytoscape Tutorial Session 1 at UT-KBRIN Bioinformatics Summit 2014 (4/11/2014)
Cytoscape Tutorial Session 1 at UT-KBRIN Bioinformatics Summit 2014 (4/11/2014)Cytoscape Tutorial Session 1 at UT-KBRIN Bioinformatics Summit 2014 (4/11/2014)
Cytoscape Tutorial Session 1 at UT-KBRIN Bioinformatics Summit 2014 (4/11/2014)Keiichiro Ono
 
Unity AMD FSR - SIGGRAPH 2021.pptx
Unity AMD FSR - SIGGRAPH 2021.pptxUnity AMD FSR - SIGGRAPH 2021.pptx
Unity AMD FSR - SIGGRAPH 2021.pptxssuser2c3c67
 
Initial design (Game Architecture)
Initial design (Game Architecture)Initial design (Game Architecture)
Initial design (Game Architecture)Rajkumar Pawar
 
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookScaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookDatabricks
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCinside-BigData.com
 
ECS: Streaming and Serialization - Unite LA
ECS: Streaming and Serialization - Unite LAECS: Streaming and Serialization - Unite LA
ECS: Streaming and Serialization - Unite LAUnity Technologies
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
RNN sharing at Trend Micro
RNN sharing at Trend MicroRNN sharing at Trend Micro
RNN sharing at Trend MicroChun Hao Wang
 
Introduction to-mongo db-execution-plan-optimizer-final
Introduction to-mongo db-execution-plan-optimizer-finalIntroduction to-mongo db-execution-plan-optimizer-final
Introduction to-mongo db-execution-plan-optimizer-finalM Malai
 
Introduction to Mongodb execution plan and optimizer
Introduction to Mongodb execution plan and optimizerIntroduction to Mongodb execution plan and optimizer
Introduction to Mongodb execution plan and optimizerMydbops
 
Practical Deep Learning Using Tensor Flow - Sandeep Kath
Practical Deep Learning Using Tensor Flow - Sandeep KathPractical Deep Learning Using Tensor Flow - Sandeep Kath
Practical Deep Learning Using Tensor Flow - Sandeep KathSandeep Kath
 
Optimization of graph storage using GoFFish
Optimization of graph storage using GoFFishOptimization of graph storage using GoFFish
Optimization of graph storage using GoFFishAnushree Prasanna Kumar
 

Similar to Gdce 2010 dx11 (20)

Optimizing unity games (Google IO 2014)
Optimizing unity games (Google IO 2014)Optimizing unity games (Google IO 2014)
Optimizing unity games (Google IO 2014)
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
 
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio [Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
 
Refresh what you know about AssetDatabase.Refresh()- Unite Copenhagen 2019
Refresh what you know about AssetDatabase.Refresh()- Unite Copenhagen 2019Refresh what you know about AssetDatabase.Refresh()- Unite Copenhagen 2019
Refresh what you know about AssetDatabase.Refresh()- Unite Copenhagen 2019
 
Cytoscape Tutorial Session 1 at UT-KBRIN Bioinformatics Summit 2014 (4/11/2014)
Cytoscape Tutorial Session 1 at UT-KBRIN Bioinformatics Summit 2014 (4/11/2014)Cytoscape Tutorial Session 1 at UT-KBRIN Bioinformatics Summit 2014 (4/11/2014)
Cytoscape Tutorial Session 1 at UT-KBRIN Bioinformatics Summit 2014 (4/11/2014)
 
Unity AMD FSR - SIGGRAPH 2021.pptx
Unity AMD FSR - SIGGRAPH 2021.pptxUnity AMD FSR - SIGGRAPH 2021.pptx
Unity AMD FSR - SIGGRAPH 2021.pptx
 
Initial design (Game Architecture)
Initial design (Game Architecture)Initial design (Game Architecture)
Initial design (Game Architecture)
 
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookScaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
 
Meetup talk
Meetup talkMeetup talk
Meetup talk
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
 
ECS: Streaming and Serialization - Unite LA
ECS: Streaming and Serialization - Unite LAECS: Streaming and Serialization - Unite LA
ECS: Streaming and Serialization - Unite LA
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
RNN sharing at Trend Micro
RNN sharing at Trend MicroRNN sharing at Trend Micro
RNN sharing at Trend Micro
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
 
DirectX 11 Rendering in Battlefield 3
DirectX 11 Rendering in Battlefield 3DirectX 11 Rendering in Battlefield 3
DirectX 11 Rendering in Battlefield 3
 
Introduction to-mongo db-execution-plan-optimizer-final
Introduction to-mongo db-execution-plan-optimizer-finalIntroduction to-mongo db-execution-plan-optimizer-final
Introduction to-mongo db-execution-plan-optimizer-final
 
Introduction to Mongodb execution plan and optimizer
Introduction to Mongodb execution plan and optimizerIntroduction to Mongodb execution plan and optimizer
Introduction to Mongodb execution plan and optimizer
 
Practical Deep Learning Using Tensor Flow - Sandeep Kath
Practical Deep Learning Using Tensor Flow - Sandeep KathPractical Deep Learning Using Tensor Flow - Sandeep Kath
Practical Deep Learning Using Tensor Flow - Sandeep Kath
 
Optimization of graph storage using GoFFish
Optimization of graph storage using GoFFishOptimization of graph storage using GoFFish
Optimization of graph storage using GoFFish
 

More from mistercteam

Preliminary xsx die_fact_finding
Preliminary xsx die_fact_findingPreliminary xsx die_fact_finding
Preliminary xsx die_fact_findingmistercteam
 
20150207 howes-gpgpu8-dark secrets
20150207 howes-gpgpu8-dark secrets20150207 howes-gpgpu8-dark secrets
20150207 howes-gpgpu8-dark secretsmistercteam
 
S0333 gtc2012-gmac-programming-cuda
S0333 gtc2012-gmac-programming-cudaS0333 gtc2012-gmac-programming-cuda
S0333 gtc2012-gmac-programming-cudamistercteam
 
201210 howes-hsa and-the_modern_gpu
201210 howes-hsa and-the_modern_gpu201210 howes-hsa and-the_modern_gpu
201210 howes-hsa and-the_modern_gpumistercteam
 
3 boyd direct3_d12 (1)
3 boyd direct3_d12 (1)3 boyd direct3_d12 (1)
3 boyd direct3_d12 (1)mistercteam
 
5 baker oxide (1)
5 baker oxide (1)5 baker oxide (1)
5 baker oxide (1)mistercteam
 
The technology behind_the_elemental_demo_16x9-1248544805
The technology behind_the_elemental_demo_16x9-1248544805The technology behind_the_elemental_demo_16x9-1248544805
The technology behind_the_elemental_demo_16x9-1248544805mistercteam
 
01 intro-bps-2011
01 intro-bps-201101 intro-bps-2011
01 intro-bps-2011mistercteam
 
Hpg2011 papers kazakov
Hpg2011 papers kazakovHpg2011 papers kazakov
Hpg2011 papers kazakovmistercteam
 
Dx11 performancereloaded
Dx11 performancereloadedDx11 performancereloaded
Dx11 performancereloadedmistercteam
 
Mantle programming-guide-and-api-reference
Mantle programming-guide-and-api-referenceMantle programming-guide-and-api-reference
Mantle programming-guide-and-api-referencemistercteam
 
D3 d12 a-new-meaning-for-efficiency-and-performance
D3 d12 a-new-meaning-for-efficiency-and-performanceD3 d12 a-new-meaning-for-efficiency-and-performance
D3 d12 a-new-meaning-for-efficiency-and-performancemistercteam
 
D3 d12 a-new-meaning-for-efficiency-and-performance
D3 d12 a-new-meaning-for-efficiency-and-performanceD3 d12 a-new-meaning-for-efficiency-and-performance
D3 d12 a-new-meaning-for-efficiency-and-performancemistercteam
 
Advancements in-tiled-rendering
Advancements in-tiled-renderingAdvancements in-tiled-rendering
Advancements in-tiled-renderingmistercteam
 
Getting the-best-out-of-d3 d12
Getting the-best-out-of-d3 d12Getting the-best-out-of-d3 d12
Getting the-best-out-of-d3 d12mistercteam
 

More from mistercteam (17)

Preliminary xsx die_fact_finding
Preliminary xsx die_fact_findingPreliminary xsx die_fact_finding
Preliminary xsx die_fact_finding
 
20150207 howes-gpgpu8-dark secrets
20150207 howes-gpgpu8-dark secrets20150207 howes-gpgpu8-dark secrets
20150207 howes-gpgpu8-dark secrets
 
S0333 gtc2012-gmac-programming-cuda
S0333 gtc2012-gmac-programming-cudaS0333 gtc2012-gmac-programming-cuda
S0333 gtc2012-gmac-programming-cuda
 
201210 howes-hsa and-the_modern_gpu
201210 howes-hsa and-the_modern_gpu201210 howes-hsa and-the_modern_gpu
201210 howes-hsa and-the_modern_gpu
 
3 673 (1)
3 673 (1)3 673 (1)
3 673 (1)
 
3 boyd direct3_d12 (1)
3 boyd direct3_d12 (1)3 boyd direct3_d12 (1)
3 boyd direct3_d12 (1)
 
5 baker oxide (1)
5 baker oxide (1)5 baker oxide (1)
5 baker oxide (1)
 
The technology behind_the_elemental_demo_16x9-1248544805
The technology behind_the_elemental_demo_16x9-1248544805The technology behind_the_elemental_demo_16x9-1248544805
The technology behind_the_elemental_demo_16x9-1248544805
 
Lecture14
Lecture14Lecture14
Lecture14
 
01 intro-bps-2011
01 intro-bps-201101 intro-bps-2011
01 intro-bps-2011
 
Hpg2011 papers kazakov
Hpg2011 papers kazakovHpg2011 papers kazakov
Hpg2011 papers kazakov
 
Dx11 performancereloaded
Dx11 performancereloadedDx11 performancereloaded
Dx11 performancereloaded
 
Mantle programming-guide-and-api-reference
Mantle programming-guide-and-api-referenceMantle programming-guide-and-api-reference
Mantle programming-guide-and-api-reference
 
D3 d12 a-new-meaning-for-efficiency-and-performance
D3 d12 a-new-meaning-for-efficiency-and-performanceD3 d12 a-new-meaning-for-efficiency-and-performance
D3 d12 a-new-meaning-for-efficiency-and-performance
 
D3 d12 a-new-meaning-for-efficiency-and-performance
D3 d12 a-new-meaning-for-efficiency-and-performanceD3 d12 a-new-meaning-for-efficiency-and-performance
D3 d12 a-new-meaning-for-efficiency-and-performance
 
Advancements in-tiled-rendering
Advancements in-tiled-renderingAdvancements in-tiled-rendering
Advancements in-tiled-rendering
 
Getting the-best-out-of-d3 d12
Getting the-best-out-of-d3 d12Getting the-best-out-of-d3 d12
Getting the-best-out-of-d3 d12
 

Recently uploaded

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 

Recently uploaded (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 

Gdce 2010 dx11

  • 2. Advanced DirectX® 11 technology: DirectCompute by Example Jason Yang and Lee Howes August 18, 2010
  • 3. DirectX 11 Basics  New API from Microsoft® – Released alongside Windows® 7 – Runs on Windows Vista® as well  Supports downlevel hardware – DirectX9, DirectX10, DirectX11-class HW supported – Exposed features depend on GPU  Allows the use of the same API for multiple generations of GPUs – However Windows Vista/Windows 7 required  Lots of new features…
  • 4. What is DirectCompute?  DirectCompute brings GPGPU to DirectX  DirectCompute is both separate from and integrated with the DirectX graphics pipeline – Compute Shader – Compute features in Pixel Shader  Potential applications – Physics – AI – Image processing
  • 5. DirectCompute – part of DirectX 11  DirectX 11 helps efficiently combine Compute work with graphics – Sharing of buffers is trivial – Work graph is scheduled efficiently by the driver Input Assembler Vertex Shader Tesselation Geometry Shader Rasterizer Pixel Shader Compute Shader Graphics pipeline
  • 6. DirectCompute Features  Scattered writes  Atomic operations  Append/consume buffer  Shared memory (local data share)  Structured buffers  Double precision (if supported)
  • 7. DirectCompute by Example  Order Independent Transparency (OIT) – Atomic operations – Scattered writes – Append buffer feature  Bullet Cloth Simulation – Shared memory – Shared compute and graphics buffers
  • 9. Transparency Problem  Classic problem in computer graphics  Correct rendering of semi-transparent geometry requires sorting – blending is an order dependent operation  Sometimes sorting triangles is enough but not always – Difficult to sort: Multiple meshes interacting (many draw calls) – Impossible to sort: Intersecting triangles (must sort fragments) Try doing this in PowerPoint!
  • 10. Background  A-buffer – Carpenter „84 – CPU side linked list per-pixel for anti-aliasing  Fixed array per-pixel – F-buffer, stencil routed A-buffer, Z3 buffer, and k-buffer, Slice map, bucket depth peeling  Multi-pass – Depth peeling methods for transparency  Recent – Freepipe, PreCalc [DirectX11 SDK]
  • 11. OIT using Per-Pixel Linked Lists  Fast creation of linked lists of arbitrary size on the GPU using D3D11 – Computes correct transparency  Integration into the standard graphics pipeline – Demonstrates compute from rasterized data – DirectCompute features in Pixel Shader – Works with depth and stencil testing – Works with and without MSAA  Example of programmable blend
  • 12. Linked List Construction  Two Buffers – Head pointer buffer  addresses/offsets  Initialized to end-of-list (EOL) value (e.g., -1) – Node buffer  arbitrary payload data + “next pointer”  Each shader thread 1. Retrieve and increment global counter value 2. Atomic exchange into head pointer buffer 3. Add new entry into the node buffer at location from step 1
  • 13. Algorithm Overview 0. Render opaque scene objects 1. Render transparent scene objects 2. Screen quad resolves and composites fragment lists
  • 14. Step 0 – Render Opaque  Render all opaque geometry normally Render Target
  • 15. Algorithm Overview 0. Render opaque scene objects 1. Render transparent scene objects – All fragments are stored using per-pixel linked lists – Store fragment‟s: color, alpha, & depth 2. Screen quad resolves and composites fragment lists
  • 16. Setup  Two buffers – Screen sized head pointer buffer – Node buffer – large enough to handle all fragments  Render as usual  Disable render target writes  Insert render target data into linked list
  • 17. Step 1 – Create Linked List -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 Head Pointer Buffer -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 Node Buffer 0 1 2 3 4 5 6 … Counter = 0 Render Target
  • 18. Step 1 – Create Linked List -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 … Render Target Head Pointer Buffer Node Buffer 0 1 2 3 4 5 6 … Counter = 0
  • 19. Step 1 – Create Linked List -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 … Render Target Head Pointer Buffer Node Buffer 0 1 2 3 4 5 6 … Counter = 1 IncrementCounter()
  • 20. Step 1 – Create Linked List -1 -1 -1 -1 -1 -1 -1 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 … Render Target Head Pointer Buffer Node Buffer 0 1 2 3 4 5 6 … Counter = 1 InterlockedExchange()
  • 21. Step 1 – Create Linked List -1 -1 -1 -1 -1 -1 -1 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0.87 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 … Render Target Head Pointer Buffer Node Buffer 0 1 2 3 4 5 6 … Counter = 1 Scatter Write
  • 22. Step 1 – Create Linked List -1 -1 -1 -1 -1 -1 -1 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0.87 -1 0.89 -1 0.90 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 2 -1 -1 -1 -1 -1 -1 -1 … Culled due to existing scene geometry depth. Render Target Head Pointer Buffer Node Buffer 0 1 2 3 4 5 6 … Counter = 3
  • 23. Node Buffer 0 1 2 3 4 5 6 … Step 1 – Create Linked List -1 -1 -1 -1 -1 -1 -1 3 4 -1 -1 -1 -1 -1 -1 -1 -1 -1 0.87 -1 0.89 -1 0.90 -1 0.65 0 0.65 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 2 -1 -1 -1 -1 -1 -1 -1 … Render Target Head Pointer Buffer Counter = 5
  • 24. Step 1 – Create Linked List -1 -1 -1 -1 -1 -1 -1 5 4 -1 -1 -1 -1 -1 -1 -1 -1 -1 0.87 -1 0.89 -1 0.90 -1 0.65 0 0.65 -1 0.71 3 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 2 -1 -1 -1 -1 -1 -1 -1 … Render Target Head Pointer Buffer Node Buffer 0 1 2 3 4 5 6 … Counter = 6
  • 25. Node Buffer Counter  Counter allocated in GPU memory (i.e. a buffer) – Atomic updates – Contention issues  DirectX11 Append feature – Linear writes to a buffer – Implicit writes  Append() – Explicit writes  IncrementCounter()  Standard memory operations – Up to 60% faster than memory counters
  • 26. Code Example RWStructuredBuffer RWStructuredCounter; RWTexture2D<int> tRWFragmentListHead; RWTexture2D<float4> tRWFragmentColor; RWTexture2D<int2> tRWFragmentDepthAndLink; [earlydepthstencil] void PS( PsInput input ) { float4 vFragment = ComputeFragmentColor(input); int2 vScreenAddress = int2(input.vPositionSS.xy); // Get counter value and increment int nNewFragmentAddress = RWStructuredCounter.IncrementCounter(); if ( nNewFragmentAddress == FRAGMENT_LIST_NULL ) { return; } // Update head buffer int nOldFragmentAddress; InterlockedExchange(tRWFragmentListHead[vScreenAddress], nNewHeadAddress, nOldHeadAddress ); // Write the fragment attributes to the node buffer int2 vAddress = GetAddress( nNewFragmentAddress ); tRWFragmentColor[vAddress] = vFragment; tRWFragmentDepthAndLink[vAddress] = int2( int(saturate(input.vPositionSS.z))*0x7fffffff), nOldFragmentAddress ); return; }
  • 27. Algorithm Overview 0. Render opaque scene objects 1. Render transparent scene objects 2. Screen quad resolves and composites fragment lists – Single pass – Pixel shader sorts associated linked list (e.g., insertion sort) – Composite fragments in sorted order with background – Output final fragment
  • 28. Step 2 – Render Fragments -1 -1 -1 -1 -1 -1 -1 5 4 -1 -1 -1 -1 -1 -1 -1 -1 -1 0.87 -1 0.89 -1 0.90 -1 0.65 0 0.65 -1 0.71 3 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 2 -1 -1 -1 -1 -1 -1 -1 … (0,0)->(1,1): Fetch Head Pointer: -1 -1 indicates no fragment to render Render Target Head Pointer Buffer Node Buffer 0 1 2 3 4 5 6 …
  • 29. Step 2 – Render Fragments -1 -1 -1 -1 -1 -1 -1 5 4 -1 -1 -1 -1 -1 -1 -1 -1 -1 0.87 -1 0.89 -1 0.90 -1 0.65 0 0.65 -1 0.71 3 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 2 -1 -1 -1 -1 -1 -1 -1 … (1,1): Fetch Head Pointer: 5 Fetch Node Data (5) Walk the list and store in temp array 0.71 0.65 0.87 Render Target Head Pointer Buffer Node Buffer 0 1 2 3 4 5 6 …
  • 30. Step 2 – Render Fragments -1 -1 -1 -1 -1 -1 -1 5 4 -1 -1 -1 -1 -1 -1 -1 -1 -1 0.87 -1 0.89 -1 0.90 -1 0.65 0 0.65 -1 0.71 3 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 2 -1 -1 -1 -1 -1 -1 -1 … (1,1): Sort temp array Blend colors and write out 0.65 0.71 0.87 Render Target Head Pointer Buffer Node Buffer 0 1 2 3 4 5 6 …
  • 31. Step 2 – Render Fragments -1 -1 -1 -1 -1 -1 -1 5 4 -1 -1 -1 -1 -1 -1 -1 -1 -1 0.87 -1 0.89 -1 0.90 -1 0.65 0 0.65 -1 0.71 3 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 2 -1 -1 -1 -1 -1 -1 -1 … Render Target Head Pointer Buffer Node Buffer 0 1 2 3 4 5 6 …
  • 32. Anti-Aliasing  Store coverage information in the linked list  Resolve per-sample – Execute a shader at each sample location – Use MSAA hardware  Resolve per-pixel – Execute a shader at each pixel location – Average all sample contributions within the shader
  • 33. Performance Comparison Teapot Dragon Linked List 743 fps 338 fps Precalc 285 fps 143 fps Depth Peeling 579 fps 45 fps Bucket Depth Peeling --- 256 fps Dual Depth Peeling --- 94 fps Based on internal numbers Performance scaled to ATI Radeon HD 5770 graphics card
  • 34. Mecha Demo  602K scene triangles – 254K transparent triangles
  • 37. Future Work  Memory allocation  Sort on insert  Other linked list applications – Indirect illumination – Motion blur – Shadows  More complex data structures
  • 39. DirectCompute for physics  DirectCompute in the Bullet physics SDK – An introduction to cloth simulation – Some tips for implementation in DirectCompute – A demonstration of the current state of development
  • 40. Cloth simulation  Large number of particles – Appropriate for parallel processing – Force from each spring constraint applied to both connected particles Original layout Current layout: Compute forces as stretch from rest length Compute new positions Apply position corrections to masses
  • 41. Cloth simulation  Large number of particles – Appropriate for parallel processing – Force from each spring constraint applied to both connected particles Original layout Current layout: Compute forces as stretch from rest length Compute new positions Apply position corrections to masses Rest length of spring
  • 42. Cloth simulation steps  For each simulation iteration: – Compute forces in each link based on its length – Correct positions of masses/vertices from forces – Compute new vertex positions Original layout Current layout: Compute forces as stretch from rest length Compute new positions Apply position corrections to masses
  • 43. Cloth simulation steps  For each simulation iteration: – Compute forces in each link based on its length – Correct positions of masses/vertices from forces – Compute new vertex positions Original layout Current layout: Compute forces as stretch from rest length Apply position corrections to masses Compute new positions
  • 44. Cloth simulation steps  For each simulation iteration: – Compute forces in each link based on its length – Correct positions of masses/vertices from forces – Compute new vertex positions Original layout Current layout: Compute forces as stretch from rest length Compute new positions Apply position corrections to masses
  • 45. Springs and masses  Two or three main types of springs – Structural/shearing – Bending
  • 46. Springs and masses  Two or three main types of springs – Structural/shearing – Bending
  • 47. CPU approach to simulation  One link at a time  Perform updates in place  “Gauss-Seidel” style  Conserves momentum  Iterate n times
  • 48. CPU approach to simulation  One link at a time  Perform updates in place  “Gauss-Seidel” style  Conserves momentum  Iterate n times
  • 49. CPU approach to simulation  One link at a time  Perform updates in place  “Gauss-Seidel” style  Conserves momentum  Iterate n times
  • 50. CPU approach to simulation  One link at a time  Perform updates in place  “Gauss-Seidel” style  Conserves momentum  Iterate n times
  • 51. CPU approach to simulation  One link at a time  Perform updates in place  “Gauss-Seidel” style  Conserves momentum  Iterate n times
  • 52. CPU approach to simulation  One link at a time  Perform updates in place  “Gauss-Seidel” style  Conserves momentum  Iterate n times Target
  • 53. Moving to the GPU: The pixel shader approach  Offers full parallelism  One vertex at a time  No scattered writes
  • 54. Moving to the GPU: The pixel shader approach  Offers full parallelism  One vertex at a time  No scattered writes
  • 55. Moving to the GPU: The pixel shader approach  Offers full parallelism  One vertex at a time  No scattered writes
  • 56. Moving to the GPU: The pixel shader approach  Offers full parallelism  One vertex at a time  No scattered writes
  • 57. Moving to the GPU: The pixel shader approach  Offers full parallelism  One vertex at a time  No scattered writes
  • 58. Moving to the GPU: The pixel shader approach  Offers full parallelism  One vertex at a time  No scattered writes
  • 59. Moving to the GPU: The pixel shader approach  Offers full parallelism  One vertex at a time  No scattered writes In parallel!
  • 60. Moving to the GPU: The pixel shader approach  Offers full parallelism  One vertex at a time  No scattered writes Target
  • 61. Downsides of the pixel shader approach  No propagation of updates – If we double buffer  Or non-deterministic – If we update in-place in a read/write array  Momentum preservation – Lacking due to single- ended link updates
  • 62. Can DirectCompute help?  Offers scattered writes as a feature as we saw earlier  The GPU implementation could be more like the CPU – Solver per-link rather than per-vertex – Leads to races between links that update the same vertex
  • 63. Execute independent subsets in parallel  All links act at both ends  Batch links – No two links in a given batch share a vertex – No data races
  • 64. Execute independent subsets in parallel  All links act at both ends  Batch links – No two links in a given batch share a vertex – No data races
  • 65. Execute independent subsets in parallel  All links act at both ends  Batch links – No two links in a given batch share a vertex – No data races
  • 66. Execute independent subsets in parallel  All links act at both ends  Batch links – No two links in a given batch share a vertex – No data races
  • 67. Execute independent subsets in parallel  All links act at both ends  Batch links – No two links in a given batch share a vertex – No data races
  • 68. On a real cloth mesh we need many batches  Create independent subsets of links through graph coloring.  Synchronize between batches
  • 69. On a real cloth mesh we need many batches  Create independent subsets of links through graph coloring.  Synchronize between batches 1st batch
  • 70. On a real cloth mesh we need many batches  Create independent subsets of links through graph coloring.  Synchronize between batches 2nd batch
  • 71. On a real cloth mesh we need many batches  Create independent subsets of links through graph coloring.  Synchronize between batches 3rd batch
  • 72. On a real cloth mesh we need many batches  Create independent subsets of links through graph coloring.  Synchronize between batches 10 batches
  • 73. Driving batches and synchronizing Iteration 0 Iteration 1 Iteration 2 Simulation step Batch 0 Batch 1 Batch 2 Batch 3 Batch 0 Batch 1 Batch 2 Batch 3 Batch 0 Batch 1 Batch 2 Batch 3 Batch 4 Batch 4 Batch 4
  • 74. Driving batches and synchronizing Iteration 0 Batch 0 Batch 1 Batch 2 Batch 3 Batch 4 // Execute the kernel context->CSSetShader( solvePositionsFromLinksKernel.kernel, NULL, 0 ); int numBlocks = (constBuffer.numLinks + (blockSize-1)) / blockSize; context->Dispatch( numBlocks , 1, 1 );
  • 75. Driving batches and synchronizing Iteration Simulation step Batch 0 Batch 1 Batch 2 Batch 3 Batch 4 Twiddle fingers…
  • 76. Driving batches and synchronizing Iteration Simulation step Batch 0 Batch 1 Batch 2 Batch 3 Batch 4 Twiddle fingers… Twiddle fingers…
  • 77. Driving batches and synchronizing Iteration Simulation step Batch 0 Batch 1 Batch 2 Batch 3 Batch 4 Twiddle fingers… Twiddle fingers… Twiddle fingers… Remember, 10 batches! Twiddle fingers… Twiddle fingers…
  • 78. Returning to our batching  10 batches: 10 compute shader dispatches  1/10 links per batch  Low compute density per thread 10 batches
  • 79. Packing for higher efficiency  Can create larger groups – The cloth is fixed-structure – Can be preprocessed  Fewer batches/dispatches  Less parallelism 4 batches
  • 80. Solving cloths together  Solve multiple cloths together in n batches  Grouping – Larger and reduced number of dispatches – Regain the parallelism that increased work- per-thread removed 4 batches
  • 81. Shared memory  We‟ve made use of scattered writes  The next feature of DirectCompute: shared memory – Load data at the start of a block – Compute over multiple links together – Write data out again
  • 82. Driving batches and synchronizing Iteration Simulation step Batch 0 Batch 1 Inner batch 0 Inner batch 1 Inner batch 0 Iteration Batch 0 Batch 1 Iteration Batch 0 Batch 1 Inner batch 0 Inner batch 1 Inner batch 0 Inner batch 0 Inner batch 1 Inner batch 0
  • 83.  So let‟s look at the batching we saw before:  There are 4 batches: – If we do this per group we need 3 groups rather than three DirectX “threads” How can we improve the batching? Group 1 Batch 1
  • 84.  So let‟s look at the batching we saw before:  There are 4 batches: – If we do this per group we need 3 groups rather than three DirectX “threads” How can we improve the batching? Group 2 Batch 1
  • 85.  So let‟s look at the batching we saw before:  There are 4 batches: – If we do this per group we need 3 groups rather than three DirectX “threads” How can we improve the batching? Group 1 Batch 2
  • 86. Solving in shared memory groupshared float4 positionSharedData[VERTS_PER_GROUP]; [numthreads(GROUP_SIZE, 1, 1)] void SolvePositionsFromLinksKernel( … uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID … ) { for( int vertex = laneInWavefront; vertex < verticesUsedByWave; vertex+=GROUP_SIZE ) { int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex]; positionSharedData[vertex] = g_vertexPositions[vertexAddress]; } ... // Perform computation in shared buffer for( int vertex = GTid.x; vertex < verticesUsedByWave; vertex+=GROUP_SIZE ) { int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex]; g_vertexPositions[vertexAddress] = positionSharedData[vertex]; } }
  • 87. Solving in shared memory groupshared float4 positionSharedData[VERTS_PER_GROUP]; [numthreads(GROUP_SIZE, 1, 1)] void SolvePositionsFromLinksKernel( … uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID … ) { for( int vertex = laneInWavefront; vertex < verticesUsedByWave; vertex+=GROUP_SIZE ) { int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex]; positionSharedData[vertex] = g_vertexPositions[vertexAddress]; } ... // Perform computation in shared buffer for( int vertex = GTid.x; vertex < verticesUsedByWave; vertex+=GROUP_SIZE ) { int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex]; g_vertexPositions[vertexAddress] = positionSharedData[vertex]; } } Define a “groupshared” buffer for shared data storage
  • 88. Solving in shared memory groupshared float4 positionSharedData[VERTS_PER_GROUP]; [numthreads(GROUP_SIZE, 1, 1)] void SolvePositionsFromLinksKernel( … uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID … ) { for( int vertex = laneInWavefront; vertex < verticesUsedByWave; vertex+=GROUP_SIZE ) { int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex]; positionSharedData[vertex] = g_vertexPositions[vertexAddress]; } ... // Perform computation in shared buffer for( int vertex = GTid.x; vertex < verticesUsedByWave; vertex+=GROUP_SIZE ) { int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex]; g_vertexPositions[vertexAddress] = positionSharedData[vertex]; } } Data will be shared across a group of threads with these dimensions
  • 89. Solving in shared memory groupshared float4 positionSharedData[VERTS_PER_GROUP]; [numthreads(GROUP_SIZE, 1, 1)] void SolvePositionsFromLinksKernel( … uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID … ) { for( int vertex = laneInWavefront; vertex < verticesUsedByWave; vertex+=GROUP_SIZE ) { int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex]; positionSharedData[vertex] = g_vertexPositions[vertexAddress]; } ... // Perform computation in shared buffer for( int vertex = GTid.x; vertex < verticesUsedByWave; vertex+=GROUP_SIZE ) { int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex]; g_vertexPositions[vertexAddress] = positionSharedData[vertex]; } } Load data from global buffers into the shared region
  • 90. Solving in shared memory groupshared float4 positionSharedData[VERTS_PER_GROUP]; [numthreads(GROUP_SIZE, 1, 1)] void SolvePositionsFromLinksKernel( … uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID … ) { for( int vertex = laneInWavefront; vertex < verticesUsedByWave; vertex+=GROUP_SIZE ) { int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex]; positionSharedData[vertex] = g_vertexPositions[vertexAddress]; } ... // Perform computation in shared buffer for( int vertex = GTid.x; vertex < verticesUsedByWave; vertex+=GROUP_SIZE ) { int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex]; g_vertexPositions[vertexAddress] = positionSharedData[vertex]; } } Write back to the global buffer after computation
  • 91. Group execution  The sequence of operations for the first batch is:
  • 92. Group execution  The sequence of operations for the first batch is:
  • 93. Group execution  The sequence of operations for the first batch is:
  • 94. Group execution  The sequence of operations for the first batch is: Few links so low packing efficiency: Not a problem with larger cloth
  • 95. Group execution  The sequence of operations for the first batch is: Synchronize Synchronize Synchronize Synchronize Synchronize
  • 96. Group execution  The sequence of operations for the first batch is: // load AllMemoryBarrierWithGroupSync(); for( each subgroup ) { // Process a subgroup AllMemoryBarrierWithGroupSync(); } // Store Synchronize
  • 97. Why is this an improvement?  So we still need 10*4 batches. What have we gained? – The batches within a group chunk are in-shader loops – Only 4 shader dispatches, each with significant overhead  The barriers will still hit performance – We are no longer dispatch bound, but we are likely to be on-chip synchronization bound
  • 98. Exploiting the SIMD architecture  Hardware executes 64- or 32-wide SIMD  Sequentially consistent at the SIMD level  Synchronization is now implicit – Take care – Execute over groups that are SIMD width or a divisor thereof
  • 99. Group execution  The sequence of operations for the first batch is: Just works… Just works… Just works… Just works… Just works…
  • 100. Driving batches and synchronizing Iteration Simulation step Batch 0 Batch 1 Inner batch 0 Inner batch 1 Inner batch 0 Synchronize Synchronize
  • 101. Performance gains  For 90,000 links: – No solver running in 2.98 ms/frame – Fully batched link solver in 3.84 ms/frame – SIMD batched solver 3.22 ms/frame – CPU solver 16.24ms/frame  3.5x improvement in solver alone  (67x improvement CPU solver) Based on internal numbers
  • 102. One more thing…  Remember the tight pipeline integration?  How can we use this to our advantage? Input Assembler Vertex Shader Tesselation Geometry Shader Rasterizer Pixel Shader Compute Shader Graphics pipeline
  • 103. Efficiently output vertex data  Cloth simulation updates vertex positions – Generated on the GPU – Need to be used on the GPU for rendering – Why not keep them there?  Large amount of data to update – Many vertices in fine simulation meshes – Normals and other information present
  • 104. Create a vertex buffer // Create a vertex buffer with unordered access support D3D11_BUFFER_DESC bd; bd.Usage = D3D11_USAGE_DEFAULT; bd.ByteWidth = vertexBufferSize * 32; bd.BindFlags = D3D11_BIND_VERTEX_BUFFER | D3D11_BIND_UNORDERED_ACCESS; bd.CPUAccessFlags = 0; bd.MiscFlags = 0; bd.StructureByteStride = 32; hr = m_d3dDevice->CreateBuffer(&bd, NULL, &m_Buffer); // Create an unordered access view of the buffer to allow writing D3D11_UNORDERED_ACCESS_VIEW_DESC uavbuffer_desc; ud.Format = DirectXGI_FORMAT_UNKNOWN; ud.ViewDimension = D3D11_UAV_DIMENSION_BUFFER; ud.Buffer.NumElements = vertexBufferSize; hr = m_d3dDevice->CreateUnorderedAccessView(m_Buffer, &ud, &m_UAV); It‟s a vertex buffer. It‟s also bound for unordered access. Scattered writes!
  • 105. Performance gains  For 90,000 links with copy on GPU: – No solver running in 0.58 ms/frame – Fully batched link solver in 0.82 ms/frame – SIMD batched solver 0.617ms/frame  6.5x improvement in solver  6.5x improvement from CPU copy alone  23x improvement over simpler solver with host copy Based on internal numbers
  • 106. Thanks  Justin Hensley  Holger Grün  Nicholas Thibieroz  Erwin Coumans
  • 107. References  Yang J., Hensley J., Grün H., Thibieroz N.: Real-Time Concurrent Linked List Construction on the GPU. In Rendering Techniques 2010: Eurographics Symposium on Rendering (2010), vol. 29, Eurographics.  Grün H., Thibieroz N.: OIT and Indirect Illumination using DirectX11 Linked Lists. In Proceedings of Game Developers Conference 2010 (Mar. 2010). http://developer.amd.com/gpu_assets/OIT%20and%20Indirect%20Il lumination%20using%20DirectX11%20Linked%20Lists_forweb.ppsx  http://developer.amd.com/samples/demos/pages/ATIRadeonHD5800 SeriesRealTimeDemos.aspx  http://bulletphysics.org
  • 108. Trademark Attribution AMD, the AMD Arrow logo, ATI, the ATI logo, Radeon and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Microsoft, Windows, Windows Vista, Windows 7 and DirectX are registered trademarks of Microsoft Corporation in the U.S. and/or other juristictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners. ©2010 Advanced Micro Devices, Inc. All rights reserved.