Gen AI in Business - Global Trends Report 2024.pdf
Windows to reality getting the most out of direct3 d 10 graphics in your games
1.
2. Windows to Reality:
Getting the Most out of
Direct3D 10 Graphics in
Your Games
Shanon Drone
Software Development Engineer
XNA Developer Connection
Microsoft
3. Key areas
Debug Layer
Draw Calls
Constant Updates
State Management
Shader Linkage
Resource Updates
Dynamic Geometry
Porting Tips
4. Debug Layer
Use it!
The D3D10 layer can help find performance
issues
App controlled by passing
D3D10_CREATE_DEVICE_DEBUG into
D3D10CreateDevice.
Use the D3DX10 Debug Runtime
Link against D3DX10d.lib
Only do this for debug builds!
Look for performance warnings in the debug
output
5. Draw Calls
Draw calls are still “not free”
Draw overhead is reduced in D3D10
But not enough that you can be lazy
Efficiency in the number of draw calls will
still give a performance win
6. Draw Calls
Excess baggage
An increase in the number of draw calls
generally increases the number of API
calls associated with those draws
ConstantBuffer updates
Resource changes (VBs, IBs, Textures)
InputLayout changes
These all have effects on performance
that vary with draw call count
7. Constant Updates
Updating shader constants was often a
bottleneck in D3D9
It can still be a bottleneck in D3D10
The main difference between the two is
the new Constant Buffer object in D3D10
This is the largest section of this talk
8. Constant Updates
Constant Buffer Recap
Constant Buffers are buffer objects that
hold shader constant data
They are updated using
D3D10_MAP_WRITE_DISCARD or by calling
UpdateSubresource
There are 16 Constant Buffer slots
available to each shader in the pipeline
Try not to use all 16 to leave some headroom
9. Constant Updates
Porting Issues
D3D9 constants were updated individually
by calling SetXXXXXShaderConstantX
In D3D10, you have to update the entire
constant buffer all at once
A naïve port from D3D9 to D3D10 can have
crippling performance implications if
Constant Buffers are not handled
correctly!
Rule of thumb: Do not update more data
than you need to
10. Constant Updates
Naïve Port: AKA how to cripple perf
Each shader uses one big constant buffer
Submitting one value submits them all!
If you have one 4096 byte Constant
Buffer, and you only need to update your
World matrix, you will still have to update
4096 bytes of data and send it across the
bus
Don’t do this!
12. Constant Updates
Organize Constants
The first step is to organize constants by
frequency of update
One shader will generally be used to draw
several objects
Some data in this shader doesn’t need to
be set for every draw
For example: Time, ViewProj matrices
Split these out into their own buffers
15. Constant Updates
Managing Buffers
Constant buffers need to be managed in
the application
Creating a few buffers that are used for
all shader constants just won’t work
We update more data than necessary due to
large buffers
16. Constant Updates
Managing Buffers
Solution 1 (Fastest)
Create Constant Buffers that line up exactly
with the number of elements of each
frequency group
Global CBs
CBs per Mesh
CBs per Material
CBs per Pass
This ensures that EVERY constant buffer is no
larger than it absolutely needs to be
This also ensures the most efficient update of
CBs based upon frequency
17. Constant Updates
Managing Buffers
Solution 2 (Second Best)
If you cannot create a CBs that line up exactly
with elements, you can create a tiered constant
buffer system
Create arrays of 32-byte, 64-byte, 128-byte, 256-
byte, etc. constant buffers
Keep a shadow copy of the constant data in
system memory
When it comes time to render, select the
smallest CB from the array that will hold the
necessary constant data
May have to resubmit redundant data for
separate passes
Hybrid approach?
18. Constant Updates
Case Study: Skinning using Solution 1
Skinning in D3D9 (or a bad D3D10 port)
Multiple passes causes redundant bone data
uploads to the GPU
Skinning in D3D10
Using Constant Buffers we only need to
upload it once
19. Constant Updates
D3D9 Version / or Naïve D3D10 Version
Pass1 Mesh2 Bone0
Mesh1
Set Mesh1 Bones Mesh2 Bone1
Mesh1 Bone1
Draw Mesh1
Mesh2 Bone2
Mesh1
Set Mesh2 Bones
Constant Mesh2 Bone3
Mesh1
Draw Mesh2
Data
Pass2 Mesh2 Bone4
Mesh1
Set Mesh1 Bones …
Draw Mesh1
Mesh2 BoneN
Mesh1
Set Mesh2 Bones
Draw Mesh2
21. Constant Updates
Advanced D3D10 Version
Why not store all of our characters’ bones in
a 128-bit FP texture?
We can upload bones for all visible
characters at the start of a frame
We can draw similar characters using
instancing instead of individual draws
Use SV_InstanceID to select the start of the
character’s bone data in the texture
Stream the skinned meshes to memory using
Stream Output and render all subsequent
passes from the post-skinned buffer
22. State Management
Individual state setting is no longer
possible in D3D10
State in D3D10 is stored in state objects
These state objects are immutable
To change even one aspect of a state
object requires that you create an
entirely new state object with that one
change
23. State Management
Managing State Objects
Solution 1 (Fastest)
If you have a known set of materials and
required states, you can create all state
objects at load time
State objects are small and there are finite
set of permutations
With all state objects created at runtime, all
that needs to be done during rendering is to
bind the object
24. State Management
Managing State Objects
Solution 2 (Second Best)
If your content is not finalized, or if you
CANNOT get your engine to lump state
together
Create a state object hash table
Hash off of the setting that has the most
unique states
Grab pre-created states from the hash-table
Why not give your tools pipeline the ability to
do this for a level and save out the results?
25. Shader Linkage
D3D9 shader linkage was based off of
semantics (POSITION, NORMAL,
TEXCOORDN)
D3D10 linkage is based off of offsets and
sizes
This means stricter linkage rules
This also means that the driver doesn’t
have to link shaders together at every
draw call!
26. Shader Linkage
No Holes Allowed!
Elements must be read in the order they
are output from the previous stage
Cannot have “holes” between linkages
Struct VS_OUTPUT Struct PS_INPUT
{ {
float3 Norm : NORMAL; float2 Tex : TEXCOORD0;
float3 Norm NORMAL;
float2 Tex : TEXCOORD0; float3 Norm : NORMAL;
Tex TEXCOORD0;
float2 Tex2 : TEXCOORD1; float2 Tex2 : TEXCOORD1;
float4 Pos : SV_POSITION;
}; };
Holes at the end are OK
27. Shader Linkage
Input Assembler to Vertex Shader
Input Layouts define the signature of the
vertex stream data
Input Layouts are the similar to Vertex
Declarations in D3D9
Strict linkage rules are a big difference
Creating Input Layouts on the fly is not
recommended
CreateInputLayout requires a shader
signature to validate against
28. Shader Linkage
Input Assembler to Vertex Shader
Solution 1 (Fastest)
Create an Input Layout for each unique
Vertex Stream / Vertex Shader combination
up front
Input Layouts are small
This assumes that the shader input signature
is available when you call CreateInputLayout
Try to normalize Input Layouts across level or
be art directed
29. Shader Linkage
Input Assembler to Vertex Shader
Solution 2 (Second Best)
If you load meshes and create input layouts
before loading shaders, you might have a
problem
You can use a similar hashing scheme as the
one used for State Objects
When the Input Layout is needed, search the
hash for an Input Layout that matches the
Vertex Stream and Vertex Shader signature
Why not store this data to a file and pre-
populate the Input Layouts after your content
is tuned?
30. Shader Linkage
Aside: Instancing
Instancing is a first class citizen on D3D10!
Stream source frequency is now part of
the Input Layout
Multiple frequencies will mean multiple
Input Layouts
31. Resource Updates
Updating resources is different in D3D10
Create / Lock / Fill / Unlock paradigm is
no longer necessary (although you can
still do it)
Texture data can be passed into the
texture at create time
33. Resource Updates
D3D10_USAGE_DEFAULT
Use for resources that need fast GPU read
and write access
Can only be updated using
UpdateSubresource
Render targets are good candidates
Textures that are updated infrequently
(less than once per frame) are good
candidates
34. Resource Updates
D3D10_USAGE_IMMUTABLE
Use for resources that need fast GPU read
access only
Once they are created, they cannot be
updated... ever
Initial data must be passed in during the
creation call
Resources that will never change (static
textures, VBs / Ibs) are good candidates
Don’t bend over backwards trying to make
everything D3D10_USAGE_IMMUTABLE
35. Resource Updates
D3D10_USAGE_DYNAMIC
Use for resources that need fast CPU write
access (at the expense of slower GPU read
access)
No CPU read access
Can only be updated using Map with:
D3D10_MAP_WRITE_DISCARD
D3D10_MAP_WRITE_NO_OVERWRITE
Dynamic Vertex Buffers are good candidates
Dynamic (> once per frame) textures are
good candidates
36. Resource Updates
D3D10_USAGE_STAGING
This is the only way to read data back
from the GPU
Can only be updated using Map
Cannot map with
D3D10_MAP_WRITE_DISCARD or
D3D10_MAP_WRITE_NO_OVERWRITE
Might want to double buffer to keep from
stalling GPU
The GPU cannot directly use these
37. Resource Updates
Summary
CPU updates the resource frequently
(more than once per frame)
Use D3D10_USAGE_DYNAMIC
CPU updates the resource infrequently
(once per frame or less)
Use D3D10_USAGE_DEFAULT
CPU doesn’t update the resource
Use D3D10_USAGE_IMMUTABLE
CPU needs to read the resource
Use D3D10_USAGE_STAGING
38. Resource Updates
Example: Vertex Buffer
The vertex buffer is touched by the CPU
less than once per frame
Create it with D3D10_USAGE_DEFAULT
Update it with UpdateSubresource
The vertex buffer is used for dynamic
geometry and CPU need to update if
multiple times per frame
Create it with D3D10_USAGE_DYNAMIC
Update it with Map
39. Resource Updates
The Exception: Constant Buffers
CBs are always expected to be updated
frequently
Select CB usage based upon which one
causes the least amount of system
memory to be transferred
Not just to the GPU, but system-to-system
memory copies as well
41. Resource Updates
Map
Map requires no extra system memory but
may hit driver renaming limits if abused
Use if compositing values on the fly or
collecting values from other places
42. Resource Updates
A note on overusing discard
Use D3D10_MAP_WRITE_DISCARD carefully
with buffers!
D3D10_MAP_WRITE_DISCARD tells the driver to
give us a new memory buffer if the current
one is busy
There are a LIMITED set of temporary buffers
If these run out, then your app will stall until
another buffer can be freed
This can happen if you do dynamic geometry
using one VB and D3D10_MAP_WRITE_DISCARD
44. Dynamic Geometry
Solution: Same as in D3D9
Use one large buffer, and map it with
D3D10_MAP_WRITE_NO_OVERWRITE
Advance the write position with every draw
Wrap to the beginning
Make sure your buffer is large enough that
you’re not overwriting data that the GPU is
reading
This is what happens under the covers for
D3D9 when using DIPUP or DUP in Windows
Vista
45. Porting Tips
StretchRect is Gone
Work around using render-to-texture
A8R8G8B8 have been replaced with
R8G8B8A8 formats
Swizzle on texture load or swizzle in the
shader
Fixed Function AlphaTest is Gone
Add logic to the shader and call discard
Fixed Function Fog is Gone
Add it to the shader
46. Porting Tips
Continued
User Clip Planes usage has changed
They’ve move to the shader
Experiment with the SV_ClipDistance SEMANTIC vs
discard in the PS to determine which is faster for
your shader
Query data sizes might have changed
Occlusion queries are UINT64 vs DWORD
No Triangle Fan Support
Work around in content pipeline or on load
SetCursorProperties, ShowCursor are gone
Use Win32 APIs to handle cursors now
47. Porting Tips
Continued
No offsets on Map calls
This was basically API clutter in D3D9
Calculate the offset from the returned pointer
Clears are no longer bound to pipeline state
If you want a clear call to respect scissor,
stencil, or other state, draw a full-screen quad
This is closer to the HW
The Driver/HW has been doing for you for years
OMSetBlendState
Never set the SampleMask to 0 in
OMSetBlendState
48. Porting Tips
Continued
Input Layout conversions tightened up
D3DDECLTYPE_UBYTE4 in the vertex stream
could be converted to a float4 in the VS in D3D9
IE. 255u in the stream would show up as 255.0 in
the VS
In D3D10 you either get a normalized [0..1] value
or 255 (u)int
Register keyword
It doesn’t mean the same thing in D3D10
Use register to determine which CB slot a CB
binds to
Use packoffset to place a variable inside a CB
49. Porting Tips
Continued
Sampler and Texture bindings
Samplers can be bound independently of textures
This is very flexible!
Sampler and Texture slots are not always the
same
Register Packing
In D3D9 all variables took up at least one float4
register (even if you only used a single float!)
In D3D10 variables are packed together
This saves a lot of space
Make sure your engine doesn’t do everything
based upon register offsets or your variables
might alias
50. Porting Tips
Continued
D3DSAMP_SRGBTEXTURE
This sampler state setting does not exist on
D3D10
Instead it’s included in the texture format
This is more like the Xbox 360
Consider re-optimizing resource usage and
upload for better D3D10 performance
But use D3D10_USAGE_DEFAULT resources
and UpdateSubresource and a baseline
51. Summary
Use the debug runtime!
More draw calls usually means more constant
updating and state changing calls
Be frugal with constant updates
Avoid resubmitting redundant data!
Create as much state and input layout
information up front as possible
Select D3D10_USAGE for resources based
upon the CPU access patterns needed
Use D3D10_MAP_NO_OVERWRITE and a big
buffer as a replacement for DIPUP and DUP
52. Call to Action
Actually exploit D3D10!
This talk tells you how to get performance
gains from a straight port
You can get a whole lot more by using
D3D10’s advanced features!
StreamOut to minimize skinning costs
First class instancing support
Store some vertex data in textures
Move some systems to the GPU (Particles?)
Aggressive use of Constant Buffers