SlideShare a Scribd company logo
1 of 34
Shader Model 5.0 and
Compute Shader


Nick Thibieroz, AMD
DX11 Basics
» New API from Microsoft
» Will be released alongside Windows 7
  »   Runs on Vista as well
» Supports downlevel hardware
  »   DX9, DX10, DX11-class HW supported
  »   Exposed features depend on GPU
» Allows the use of the same API for
  multiple generations of GPUs
  »   However Vista/Windows7 required
» Lots of new features…
Shader Model 5.0
SM5.0 Basics
» All shader types support Shader Model 5.0
  »   Vertex Shader
  »   Hull Shader
  »   Domain Shader
  »   Geometry Shader
  »   Pixel Shader
» Some instructions/declarations/system
  values are shader-specific
» Pull Model
» Shader subroutines
Uniform Indexing
» Can now index resource inputs
  »   Buffer and Texture resources
  »   Constant buffers
  »   Texture samplers
» Indexing occurs on the slot number
  »   E.g. Indexing of multiple texture arrays
  »   E.g. indexing across constant buffer slots
» Index must be a constant expression
Texture2D txDiffuse[2] : register(t0);
Texture2D txDiffuse1   : register(t1);
static uint Indices[4] = { 4, 3, 2, 1 };
float4 PS(PS_INPUT i) : SV_Target
{
  float4 color=txDiffuse[Indices[3]].Sample(sam, i.Tex);
  // float4 color=txDiffuse1.Sample(sam, i.Tex);
}
SV_Coverage
» System value available to PS stage only
» Bit field indicating the samples covered by
  the current primitive
  »   E.g. a value of 0x09 (1001b) indicates that
      sample 0 and 3 are covered by the primitive


» Easy way to detect MSAA edges for per-
  pixel/per-sample processing optimizations
  »   E.g. for MSAA 4x:
  »   bIsEdge=(uCovMask!=0x0F && uCovMask!=0);
Double Precision
» Double precision optionally supported
    »   IEEE 754 format with full precision (0.5 ULP)
    »   Mostly used for applications requiring a high
        amount of precision
    »   Denormalized values support
» Slower performance than single precision!
» Check for support:
D3D11_FEATURE_DATA_DOUBLES fdDoubleSupport;
pDev->CheckFeatureSupport( D3D11_FEATURE_DOUBLES,
                           &fdDoubleSupport,
                           sizeof(fdDoubleSupport) );
if (fdDoubleSupport.DoublePrecisionFloatShaderOps)
{
    // Double precision floating-point supported!
}
Gather()
» Fetches 4 point-sampled values in a single
  texture instruction
» Allows reduction of texture processing
      Better/faster shadow kernels
  »
                                            W Z
  »   Optimized SSAO implementations
» SM 5.0 Gather() more flexible             X Y
  »   Channel selection now supported
  »   Offset support (-32..31 range) for Texture2D
  »   Depth compare version e.g. for shadow mapping
                Gather[Cmp]Red()
                 Gather[Cmp]Green()
                 Gather[Cmp]Blue()
                 Gather[Cmp]Alpha()
Coarse Partial Derivatives
» ddx()/ddy() supplemented by coarse
  version
  »   ddx_coarse()
  »   ddy_coarse()
» Return same derivatives for whole 2x2 quad
  »   Actual derivatives used are IHV-specific
» Faster than “fine” version
  »   Trading quality for performance

                       ddx_coarse(      ) ==
                       ddx_coarse(      ) ==
                       ddx_coarse(      ) ==
                       ddx_coarse(      )

               Same principle applies to ddy_coarse()
Other Instructions
» FP32 to/from FP16 conversion
  »   uint f32tof16(float value);
  »   float f16tof32(uint value);
  »   fp16 stored in low 16 bits of uint
» Bit manipulation
  »   Returns the first occurrence of a set bit
      »   int firstbithigh(int value);
      »   int firstbitlow(int value);
  »   Reverse bit ordering
      »   uint reversebits(uint value);
  »   Useful for packing/compression code
  »   And more…
Unordered Access Views
» New view available in Shader Model 5.0
» UAVs allow binding of resources for arbitrary
  (unordered) read or write operations
  »   Supported in PS 5.0 and CS 5.0
» Applications
  »   Scatter operations
  »   Order-Independent Transparency
  »   Data binning operations
» Pixel Shader limited to 8 RTVs+UAVs total
  »   OMSetRenderTargetsAndUnorderedAccessViews()
» Compute Shader limited to 8 UAVs
  »   CSSetUnorderedAccessViews()
Raw Buffer Views
» New Buffer and View creation flag in SM 5.0
  »   Allows a buffer to be viewed as array of typeless
      32-bit aligned values
      »   Exception: Structured Buffers
  »   Buffer must be created with flag
      D3D11_RESOURCE_MISC_BUFFER_ALLOW_RAW_VIEWS
  »   Can be bound as SRV or UAV
      »   SRV: need D3D11_BUFFEREX_SRV_FLAG_RAW flag
      »   UAV: need D3D11_BUFFER_UAV_FLAG_RAW flag
ByteAddressBuffer   MyInputRawBuffer;     // SRV
RWByteAddressBuffer MyOutputRawBuffer;    // UAV

float4 MyPS(PSINPUT input) : COLOR
{
  uint u32BitData;
  u32BitData = MyInputRawBuffer.Load(input.index);// Read from SRV
  MyOutputRawBuffer.Store(input.index, u32BitData);// Write to UAV
  // Rest of code ...
}
Structured Buffers
» New Buffer creation flag in SM 5.0
  »   Ability to read or write a data structure at a
      specified index in a Buffer
  »   Resource must be created with flag
      D3D11_RESOURCE_MISC_BUFFER_STRUCTURED
  »   Can be bound as SRV or UAV
struct MyStruct
{
    float4 vValue1;
    uint   uBitField;
};
StructuredBuffer<MyStruct>   MyInputBuffer;    // SRV
RWStructuredBuffer<MyStruct> MyOutputBuffer;   // UAV

float4 MyPS(PSINPUT input) : COLOR
{
  MyStruct StructElement;
  StructElement = MyInputBuffer[input.index]; // Read from SRV
  MyOutputBuffer[input.index] = StructElement; // Write to UAV
  // Rest of code ...
}
Buffer Append/Consume
» Append Buffer allows new data to be written
  at the end of the buffer
  »   Raw and Structured Buffers only
  »   Useful for building lists, stacks, etc.
» Declaration
      Append[ByteAddress/Structured]Buffer MyAppendBuf;

» Access to write counter (Raw Buffer only)
      uint uCounter = MyRawAppendBuf.IncrementCounter();

» Append data to buffer
      MyRawAppendBuf.Store(uWriteCounter, value);
      MyStructuredAppendBuf.Append(StructElement);

» Can specify counters’ start offset
» Similar API for Consume and reading back a
  buffer
Atomic Operations
» PS and CS support atomic operations
  »   Can be used when multiple threads try to modify
      the same data location (UAV or TLS)
  » Avoid contention
  InterlockedAdd
  InterlockedAnd/InterlockedOr/InterlockedXor
  InterlockedCompareExchange
  InterlockedCompareStore
  InterlockedExchange
  InterlockedMax/InterlockedMin
» Can optionally return original value
» Potential cost in performance
  »   Especially if original value is required
  »   More latency hiding required
Compute Shader
Compute Shader Intro
» A new programmable shader stage in DX11
  »   Independent of the graphic pipeline
» New industry standard for GPGPU
  applications
» CS enables general processing operations
  »   Post-processing
  »   Video filtering
  »   Sorting/Binning
  »   Setting up resources for rendering
  »   Etc.
» Not limited to graphic applications
  »   E.g. AI, pathfinding, physics, compression…
CS 5.0 Features
» Supports Shader Model 5.0 instructions
» Texture sampling and filtering instructions
  »   Explicit derivatives required
» Execution not limited to fixed input/output
» Thread model execution
  »   Full control on the number of times the CS runs
» Read/write access to “on-cache” memory
  »   Thread Local Storage (TLS)
  »   Shared between threads
  »   Synchronization support
» Random access writes
  »   At last!  Enables new possibilities (scattering)
CS Threads
» A thread is the basic CS processing element
» CS declares the number of threads to
  operate on (the “thread group”)
  »   [numthreads(X, Y, Z)]                CS 5.0
      void MyCS(…)                       X*Y*Z<=1024
» To kick off CS execution:              Z<=64
  »   pDev11->Dispatch( nX, nY, nZ );
  »   nX, nY, nZ: number of thread groups to execute
» Number of thread groups can be written
  out to a Buffer as pre-pass
  »   pDev11->DispatchIndirect(LPRESOURCE
      *hBGroupDimensions, DWORD dwOffsetBytes);
  »   Useful for conditional execution
CS Threads & Groups
» pDev11->Dispatch(3, 2, 1);
» [numthreads(4, 4, 1)]
  void MyCS(…)
» Total threads = 3*2*4*4 = 96
CS Parameter Inputs
» pDev11->Dispatch(nX, nY, nZ);
» [numthreads(X, Y, Z)]
  void MyCS(
      uint3 groupID:                SV_GroupID,
      uint3 groupThreadID:          SV_GroupThreadID,
      uint3 dispatchThreadID:       SV_DispatchThreadID,
      uint groupIndex:              SV_GroupIndex);
» groupID.xyz: group offsets from Dispatch()
»   groupID.xyz   є   (0..nX-1, 0..nY-1, 0..nZ-1);
»   Constant within a CS thread group invocation
» groupThreadID.xyz: thread ID in group
»   groupThreadID.xyz    є   (0..X-1, 0..Y-1, 0..Z-1);
»   Independent of Dispatch() parameters
» dispatchThreadID.xyz: global thread offset
»   = groupID.xyz*(X,Y,Z) + groupThreadID.xyz
» groupIndex: flattened version of groupThreadID
CS Bandwidth Advantage
» Memory bandwidth often still a bottleneck
  »   Post-processing, compression, etc.
» Fullscreen filters often require input pixels
  to be fetched multiple times!
  »   Depth of Field, SSAO, Blur, etc.
  »   BW usage depends on TEX cache and kernel size
» TLS allows reduction in BW requirements
» Typical usage model
  »   Each thread reads data from input resource
  »   …and write it into TLS group data
  »   Synchronize threads
  »   Read back and process TLS group data
Thread Local Storage
» Shared between threads
» Read/write access at any location
» Declared in the shader
  »   groupshared float4 vCacheMemory[1024];
» Limited to 32 KB
» Need synchronization before reading back
  data written by other threads
  »   To ensure all threads have finished writing
  »   GroupMemoryBarrier();
  »   GroupMemoryBarrierWithGroupSync();
CS 4.X
» Compute Shader supported on DX10(.1) HW
  »   CS 4.0 on DX10 HW, CS 4.1 on DX10.1 HW
» Useful for prototyping CS on HW device
  before DX11 GPUs become available
» Drivers available from ATI and NVIDIA
» Major differences compared to CS5.0
  »   Max number of threads is 768 total
  »   Dispatch Zn==1 & no DispatchIndirect() support
  »   TLS size is 16 KB
  »   Thread can only write to its own offset in TLS
  »   Atomic operations not supported
  »   Only one UAV can be bound
  »   Only writable resource is Buffer type
PS 5.0 vs CS 5.0
 Example: Gaussian Blur
» Comparison between a PS 5.0 and CS5.0
  implementation of Gaussian Blur
» Two-pass Gaussian Blur
  »   High cost in texture instructions and bandwidth


» Can the compute shader perform better?
Gaussian Blur PS
» Separable filter Horizontal/Vertical pass
  »   Using kernel size of x*y
» For each pixel of each line:
  »   Fetch x texels in a horizontal segment           x
  »   Write H-blurred output pixel in RT:     BH            Gi Pi
» For each pixel of each column:                      i 1

  »   Fetch y texels in a vertical segment from RT
                                              y
  »   Write fully blurred output pixel:   B          Gi Pi
» Problems:                                    i 1

  »   Texels of source texture are read multiple times
  »   This will lead to cache trashing if kernel is large
  »   Also leads to many texture instructions used!
Gaussian Blur PS
Horizontal Pass
           Source texture




              Temp RT
Gaussian Blur PS
Vertical Pass
        Source texture (temp RT)




            Destination RT
Gaussian Blur CS – HP(1)
groupshared float4 HorizontalLine[WIDTH];             // TLS
Texture2D txInput;              // Input texture to read from
RWTexture2D<float4> OutputTexture;            // Tmp output
[numthreads(WIDTH,1,1)]
void GausBlurHoriz(uint3 groupID: SV_GroupID,
       pDevContext->Dispatch(1,HEIGHT,1);
                   uint3 groupThreadID: SV_GroupThreadID)
{
    // Fetch color from input texture
                [numthreads(WIDTH,1,1)]
        Dispatch(1,HEIGHT,1);


    float4 vColor=txInput[int2(groupThreadID.x,groupID.y)];
    // Store it into TLS
    HorizontalLine[groupThreadID.x]=vColor;
    // Synchronize threads
    GroupMemoryBarrierWithGroupSync();


    // Continued on next slide
Gaussian Blur CS – HP(2)
    // Compute horizontal Gaussian blur for each pixel
    vColor = float4(0,0,0,0);
    [unroll]for (int i=-GS2; i<=GS2; i++)
    {
        // Determine offset of pixel to fetch
        int nOffset = groupThreadID.x + i;
        // Clamp offset
        nOffset = clamp(nOffset, 0, WIDTH-1);
        // Add color for pixels within horizontal filter
        vColor += G[GS2+i] * HorizontalLine[nOffset];
    }

    // Store result
    OutputTexture[int2(groupThreadID.x,groupID.y)]=vColor;
}
Gaussian Blur BW:PS                                    vs      CS
» Pixel Shader
  »   # of reads per source pixel: 7 (H) + 7 (V) = 14
  »   # of writes per source pixel: 1 (H) + 1 (V) = 2
  »   Total number of memory operations per pixel: 16
  »   For a 1024x1024 RGBA8 source texture this is 64
      MBytes worth of data transfer
      »   Texture cache will reduce this number
      »   But become less effective as the kernel gets larger

» Compute Shader
  »   # of reads per source pixel: 1 (H) + 1 (V) = 2
  »   # of writes per source pixel: 1 (H) + 1 (V) = 2
  »   Total number of memory operations per pixel: 4
  »   For a 1024x1024 RGBA8 source texture this is 16
      MBytes worth of data transfer
Conclusion
» New Shader Model 5.0 feature set
  extensively powerful
  »   New instructions
  »   Double precision support
  »   Scattering support through UAVs
» Compute Shader
  »   No longer limited to graphic applications
  »   TLS memory allows considerable
      performance savings
» DX11 SDK available for prototyping
  »   Ask your IHV for a CS4.X-enabled driver
  »   REF driver for full SM 5.0 support
Questions?




   nicolas.thibieroz@amd.com

More Related Content

What's hot

Crysis Next-Gen Effects (GDC 2008)
Crysis Next-Gen Effects (GDC 2008)Crysis Next-Gen Effects (GDC 2008)
Crysis Next-Gen Effects (GDC 2008)
Tiago Sousa
 
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Johan Andersson
 
Moving Frostbite to Physically Based Rendering
Moving Frostbite to Physically Based RenderingMoving Frostbite to Physically Based Rendering
Moving Frostbite to Physically Based Rendering
Electronic Arts / DICE
 
Penner pre-integrated skin rendering (siggraph 2011 advances in real-time r...
Penner   pre-integrated skin rendering (siggraph 2011 advances in real-time r...Penner   pre-integrated skin rendering (siggraph 2011 advances in real-time r...
Penner pre-integrated skin rendering (siggraph 2011 advances in real-time r...
JP Lee
 
The Technology of Uncharted: Drake’s Fortune
The Technology of Uncharted: Drake’s FortuneThe Technology of Uncharted: Drake’s Fortune
The Technology of Uncharted: Drake’s Fortune
Naughty Dog
 
Z Buffer Optimizations
Z Buffer OptimizationsZ Buffer Optimizations
Z Buffer Optimizations
pjcozzi
 

What's hot (20)

Cascade Shadow Mapping
Cascade Shadow MappingCascade Shadow Mapping
Cascade Shadow Mapping
 
Crysis Next-Gen Effects (GDC 2008)
Crysis Next-Gen Effects (GDC 2008)Crysis Next-Gen Effects (GDC 2008)
Crysis Next-Gen Effects (GDC 2008)
 
Killzone Shadow Fall Demo Postmortem
Killzone Shadow Fall Demo PostmortemKillzone Shadow Fall Demo Postmortem
Killzone Shadow Fall Demo Postmortem
 
DirectX 11 Rendering in Battlefield 3
DirectX 11 Rendering in Battlefield 3DirectX 11 Rendering in Battlefield 3
DirectX 11 Rendering in Battlefield 3
 
Screen Space Reflections in The Surge
Screen Space Reflections in The SurgeScreen Space Reflections in The Surge
Screen Space Reflections in The Surge
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
 
Star Ocean 4 - Flexible Shader Managment and Post-processing
Star Ocean 4 - Flexible Shader Managment and Post-processingStar Ocean 4 - Flexible Shader Managment and Post-processing
Star Ocean 4 - Flexible Shader Managment and Post-processing
 
Ndc11 이창희_hdr
Ndc11 이창희_hdrNdc11 이창희_hdr
Ndc11 이창희_hdr
 
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The RunFive Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
 
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
 
Moving Frostbite to Physically Based Rendering
Moving Frostbite to Physically Based RenderingMoving Frostbite to Physically Based Rendering
Moving Frostbite to Physically Based Rendering
 
Penner pre-integrated skin rendering (siggraph 2011 advances in real-time r...
Penner   pre-integrated skin rendering (siggraph 2011 advances in real-time r...Penner   pre-integrated skin rendering (siggraph 2011 advances in real-time r...
Penner pre-integrated skin rendering (siggraph 2011 advances in real-time r...
 
Calibrating Lighting and Materials in Far Cry 3
Calibrating Lighting and Materials in Far Cry 3Calibrating Lighting and Materials in Far Cry 3
Calibrating Lighting and Materials in Far Cry 3
 
Shadow mapping 정리
Shadow mapping 정리Shadow mapping 정리
Shadow mapping 정리
 
Secrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics TechnologySecrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics Technology
 
The Technology of Uncharted: Drake’s Fortune
The Technology of Uncharted: Drake’s FortuneThe Technology of Uncharted: Drake’s Fortune
The Technology of Uncharted: Drake’s Fortune
 
Practical Occlusion Culling in Killzone 3
Practical Occlusion Culling in Killzone 3Practical Occlusion Culling in Killzone 3
Practical Occlusion Culling in Killzone 3
 
Z Buffer Optimizations
Z Buffer OptimizationsZ Buffer Optimizations
Z Buffer Optimizations
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
 
[KGC2014] DX9에서DX11로의이행경험공유
[KGC2014] DX9에서DX11로의이행경험공유[KGC2014] DX9에서DX11로의이행경험공유
[KGC2014] DX9에서DX11로의이행경험공유
 

Viewers also liked

Viewers also liked (18)

GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
 
CS 354 Project 2 and Compression
CS 354 Project 2 and CompressionCS 354 Project 2 and Compression
CS 354 Project 2 and Compression
 
Your Game Needs Direct3D 11, So Get Started Now!
Your Game Needs Direct3D 11, So Get Started Now!Your Game Needs Direct3D 11, So Get Started Now!
Your Game Needs Direct3D 11, So Get Started Now!
 
An Introduction to Writing Custom Unity Shaders!
An Introduction to Writing  Custom Unity Shaders!An Introduction to Writing  Custom Unity Shaders!
An Introduction to Writing Custom Unity Shaders!
 
Shader Programming With Unity
Shader Programming With UnityShader Programming With Unity
Shader Programming With Unity
 
Geometry Shader-based Bump Mapping Setup
Geometry Shader-based Bump Mapping SetupGeometry Shader-based Bump Mapping Setup
Geometry Shader-based Bump Mapping Setup
 
Game Programming 12 - Shaders
Game Programming 12 - ShadersGame Programming 12 - Shaders
Game Programming 12 - Shaders
 
Shaders - Claudia Doppioslash - Unity With the Best
Shaders - Claudia Doppioslash - Unity With the BestShaders - Claudia Doppioslash - Unity With the Best
Shaders - Claudia Doppioslash - Unity With the Best
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
 
Play RICOH THETA 360 Videos in Unity Shanyuan Teng
Play RICOH THETA 360 Videos in Unity Shanyuan TengPlay RICOH THETA 360 Videos in Unity Shanyuan Teng
Play RICOH THETA 360 Videos in Unity Shanyuan Teng
 
Optimizing unity games (Google IO 2014)
Optimizing unity games (Google IO 2014)Optimizing unity games (Google IO 2014)
Optimizing unity games (Google IO 2014)
 
ApresentaçAo De Tcc Modelo
ApresentaçAo De Tcc ModeloApresentaçAo De Tcc Modelo
ApresentaçAo De Tcc Modelo
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving Cars
 
UX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesUX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and Archives
 
Designing Teams for Emerging Challenges
Designing Teams for Emerging ChallengesDesigning Teams for Emerging Challenges
Designing Teams for Emerging Challenges
 
Visual Design with Data
Visual Design with DataVisual Design with Data
Visual Design with Data
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 

Similar to Shader model 5 0 and compute shader

An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
AnirudhGarg35
 
D3 D10 Unleashed New Features And Effects
D3 D10 Unleashed   New Features And EffectsD3 D10 Unleashed   New Features And Effects
D3 D10 Unleashed New Features And Effects
Thomas Goddard
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Akihiro Hayashi
 
Minko stage3d workshop_20130525
Minko stage3d workshop_20130525Minko stage3d workshop_20130525
Minko stage3d workshop_20130525
Minko3D
 
Building High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low EffortBuilding High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low Effort
Stefan Marr
 

Similar to Shader model 5 0 and compute shader (20)

An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
 
Gpu computing workshop
Gpu computing workshopGpu computing workshop
Gpu computing workshop
 
Gpu programming with java
Gpu programming with javaGpu programming with java
Gpu programming with java
 
GPU Computing with CUDA
GPU Computing with CUDAGPU Computing with CUDA
GPU Computing with CUDA
 
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach ShoolmanRedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
 
D3 D10 Unleashed New Features And Effects
D3 D10 Unleashed   New Features And EffectsD3 D10 Unleashed   New Features And Effects
D3 D10 Unleashed New Features And Effects
 
Lecture 6 Kernel Debugging + Ports Development
Lecture 6 Kernel Debugging + Ports DevelopmentLecture 6 Kernel Debugging + Ports Development
Lecture 6 Kernel Debugging + Ports Development
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
3 boyd direct3_d12 (1)
3 boyd direct3_d12 (1)3 boyd direct3_d12 (1)
3 boyd direct3_d12 (1)
 
Example uses of gpu compute models
Example uses of gpu compute modelsExample uses of gpu compute models
Example uses of gpu compute models
 
Minko stage3d workshop_20130525
Minko stage3d workshop_20130525Minko stage3d workshop_20130525
Minko stage3d workshop_20130525
 
NvFX GTC 2013
NvFX GTC 2013NvFX GTC 2013
NvFX GTC 2013
 
02 direct3 d_pipeline
02 direct3 d_pipeline02 direct3 d_pipeline
02 direct3 d_pipeline
 
2011.05.27 ACC 기술세미나 : Adobe Flash Builder 4.5를 환경에서 Molehill 3D를 이용한 개발 소개
2011.05.27 ACC 기술세미나 : Adobe Flash Builder 4.5를 환경에서 Molehill 3D를 이용한 개발 소개2011.05.27 ACC 기술세미나 : Adobe Flash Builder 4.5를 환경에서 Molehill 3D를 이용한 개발 소개
2011.05.27 ACC 기술세미나 : Adobe Flash Builder 4.5를 환경에서 Molehill 3D를 이용한 개발 소개
 
SMB3 Offload Data Transfer (ODX)
SMB3 Offload Data Transfer (ODX)SMB3 Offload Data Transfer (ODX)
SMB3 Offload Data Transfer (ODX)
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Building High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low EffortBuilding High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low Effort
 
Data Grids with Oracle Coherence
Data Grids with Oracle CoherenceData Grids with Oracle Coherence
Data Grids with Oracle Coherence
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
Supercharge your Analytics with ClickHouse, v.2. By Vadim Tkachenko
Supercharge your Analytics with ClickHouse, v.2. By Vadim TkachenkoSupercharge your Analytics with ClickHouse, v.2. By Vadim Tkachenko
Supercharge your Analytics with ClickHouse, v.2. By Vadim Tkachenko
 

Recently uploaded

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Shader model 5 0 and compute shader

  • 1.
  • 2. Shader Model 5.0 and Compute Shader Nick Thibieroz, AMD
  • 3. DX11 Basics » New API from Microsoft » Will be released alongside Windows 7 » Runs on Vista as well » Supports downlevel hardware » DX9, DX10, DX11-class HW supported » Exposed features depend on GPU » Allows the use of the same API for multiple generations of GPUs » However Vista/Windows7 required » Lots of new features…
  • 5. SM5.0 Basics » All shader types support Shader Model 5.0 » Vertex Shader » Hull Shader » Domain Shader » Geometry Shader » Pixel Shader » Some instructions/declarations/system values are shader-specific » Pull Model » Shader subroutines
  • 6. Uniform Indexing » Can now index resource inputs » Buffer and Texture resources » Constant buffers » Texture samplers » Indexing occurs on the slot number » E.g. Indexing of multiple texture arrays » E.g. indexing across constant buffer slots » Index must be a constant expression Texture2D txDiffuse[2] : register(t0); Texture2D txDiffuse1 : register(t1); static uint Indices[4] = { 4, 3, 2, 1 }; float4 PS(PS_INPUT i) : SV_Target { float4 color=txDiffuse[Indices[3]].Sample(sam, i.Tex); // float4 color=txDiffuse1.Sample(sam, i.Tex); }
  • 7. SV_Coverage » System value available to PS stage only » Bit field indicating the samples covered by the current primitive » E.g. a value of 0x09 (1001b) indicates that sample 0 and 3 are covered by the primitive » Easy way to detect MSAA edges for per- pixel/per-sample processing optimizations » E.g. for MSAA 4x: » bIsEdge=(uCovMask!=0x0F && uCovMask!=0);
  • 8. Double Precision » Double precision optionally supported » IEEE 754 format with full precision (0.5 ULP) » Mostly used for applications requiring a high amount of precision » Denormalized values support » Slower performance than single precision! » Check for support: D3D11_FEATURE_DATA_DOUBLES fdDoubleSupport; pDev->CheckFeatureSupport( D3D11_FEATURE_DOUBLES, &fdDoubleSupport, sizeof(fdDoubleSupport) ); if (fdDoubleSupport.DoublePrecisionFloatShaderOps) { // Double precision floating-point supported! }
  • 9. Gather() » Fetches 4 point-sampled values in a single texture instruction » Allows reduction of texture processing Better/faster shadow kernels » W Z » Optimized SSAO implementations » SM 5.0 Gather() more flexible X Y » Channel selection now supported » Offset support (-32..31 range) for Texture2D » Depth compare version e.g. for shadow mapping Gather[Cmp]Red() Gather[Cmp]Green() Gather[Cmp]Blue() Gather[Cmp]Alpha()
  • 10. Coarse Partial Derivatives » ddx()/ddy() supplemented by coarse version » ddx_coarse() » ddy_coarse() » Return same derivatives for whole 2x2 quad » Actual derivatives used are IHV-specific » Faster than “fine” version » Trading quality for performance ddx_coarse( ) == ddx_coarse( ) == ddx_coarse( ) == ddx_coarse( ) Same principle applies to ddy_coarse()
  • 11. Other Instructions » FP32 to/from FP16 conversion » uint f32tof16(float value); » float f16tof32(uint value); » fp16 stored in low 16 bits of uint » Bit manipulation » Returns the first occurrence of a set bit » int firstbithigh(int value); » int firstbitlow(int value); » Reverse bit ordering » uint reversebits(uint value); » Useful for packing/compression code » And more…
  • 12. Unordered Access Views » New view available in Shader Model 5.0 » UAVs allow binding of resources for arbitrary (unordered) read or write operations » Supported in PS 5.0 and CS 5.0 » Applications » Scatter operations » Order-Independent Transparency » Data binning operations » Pixel Shader limited to 8 RTVs+UAVs total » OMSetRenderTargetsAndUnorderedAccessViews() » Compute Shader limited to 8 UAVs » CSSetUnorderedAccessViews()
  • 13. Raw Buffer Views » New Buffer and View creation flag in SM 5.0 » Allows a buffer to be viewed as array of typeless 32-bit aligned values » Exception: Structured Buffers » Buffer must be created with flag D3D11_RESOURCE_MISC_BUFFER_ALLOW_RAW_VIEWS » Can be bound as SRV or UAV » SRV: need D3D11_BUFFEREX_SRV_FLAG_RAW flag » UAV: need D3D11_BUFFER_UAV_FLAG_RAW flag ByteAddressBuffer MyInputRawBuffer; // SRV RWByteAddressBuffer MyOutputRawBuffer; // UAV float4 MyPS(PSINPUT input) : COLOR { uint u32BitData; u32BitData = MyInputRawBuffer.Load(input.index);// Read from SRV MyOutputRawBuffer.Store(input.index, u32BitData);// Write to UAV // Rest of code ... }
  • 14. Structured Buffers » New Buffer creation flag in SM 5.0 » Ability to read or write a data structure at a specified index in a Buffer » Resource must be created with flag D3D11_RESOURCE_MISC_BUFFER_STRUCTURED » Can be bound as SRV or UAV struct MyStruct { float4 vValue1; uint uBitField; }; StructuredBuffer<MyStruct> MyInputBuffer; // SRV RWStructuredBuffer<MyStruct> MyOutputBuffer; // UAV float4 MyPS(PSINPUT input) : COLOR { MyStruct StructElement; StructElement = MyInputBuffer[input.index]; // Read from SRV MyOutputBuffer[input.index] = StructElement; // Write to UAV // Rest of code ... }
  • 15. Buffer Append/Consume » Append Buffer allows new data to be written at the end of the buffer » Raw and Structured Buffers only » Useful for building lists, stacks, etc. » Declaration Append[ByteAddress/Structured]Buffer MyAppendBuf; » Access to write counter (Raw Buffer only) uint uCounter = MyRawAppendBuf.IncrementCounter(); » Append data to buffer MyRawAppendBuf.Store(uWriteCounter, value); MyStructuredAppendBuf.Append(StructElement); » Can specify counters’ start offset » Similar API for Consume and reading back a buffer
  • 16. Atomic Operations » PS and CS support atomic operations » Can be used when multiple threads try to modify the same data location (UAV or TLS) » Avoid contention InterlockedAdd InterlockedAnd/InterlockedOr/InterlockedXor InterlockedCompareExchange InterlockedCompareStore InterlockedExchange InterlockedMax/InterlockedMin » Can optionally return original value » Potential cost in performance » Especially if original value is required » More latency hiding required
  • 18. Compute Shader Intro » A new programmable shader stage in DX11 » Independent of the graphic pipeline » New industry standard for GPGPU applications » CS enables general processing operations » Post-processing » Video filtering » Sorting/Binning » Setting up resources for rendering » Etc. » Not limited to graphic applications » E.g. AI, pathfinding, physics, compression…
  • 19. CS 5.0 Features » Supports Shader Model 5.0 instructions » Texture sampling and filtering instructions » Explicit derivatives required » Execution not limited to fixed input/output » Thread model execution » Full control on the number of times the CS runs » Read/write access to “on-cache” memory » Thread Local Storage (TLS) » Shared between threads » Synchronization support » Random access writes » At last!  Enables new possibilities (scattering)
  • 20. CS Threads » A thread is the basic CS processing element » CS declares the number of threads to operate on (the “thread group”) » [numthreads(X, Y, Z)] CS 5.0 void MyCS(…) X*Y*Z<=1024 » To kick off CS execution: Z<=64 » pDev11->Dispatch( nX, nY, nZ ); » nX, nY, nZ: number of thread groups to execute » Number of thread groups can be written out to a Buffer as pre-pass » pDev11->DispatchIndirect(LPRESOURCE *hBGroupDimensions, DWORD dwOffsetBytes); » Useful for conditional execution
  • 21. CS Threads & Groups » pDev11->Dispatch(3, 2, 1); » [numthreads(4, 4, 1)] void MyCS(…) » Total threads = 3*2*4*4 = 96
  • 22. CS Parameter Inputs » pDev11->Dispatch(nX, nY, nZ); » [numthreads(X, Y, Z)] void MyCS( uint3 groupID: SV_GroupID, uint3 groupThreadID: SV_GroupThreadID, uint3 dispatchThreadID: SV_DispatchThreadID, uint groupIndex: SV_GroupIndex); » groupID.xyz: group offsets from Dispatch() » groupID.xyz є (0..nX-1, 0..nY-1, 0..nZ-1); » Constant within a CS thread group invocation » groupThreadID.xyz: thread ID in group » groupThreadID.xyz є (0..X-1, 0..Y-1, 0..Z-1); » Independent of Dispatch() parameters » dispatchThreadID.xyz: global thread offset » = groupID.xyz*(X,Y,Z) + groupThreadID.xyz » groupIndex: flattened version of groupThreadID
  • 23. CS Bandwidth Advantage » Memory bandwidth often still a bottleneck » Post-processing, compression, etc. » Fullscreen filters often require input pixels to be fetched multiple times! » Depth of Field, SSAO, Blur, etc. » BW usage depends on TEX cache and kernel size » TLS allows reduction in BW requirements » Typical usage model » Each thread reads data from input resource » …and write it into TLS group data » Synchronize threads » Read back and process TLS group data
  • 24. Thread Local Storage » Shared between threads » Read/write access at any location » Declared in the shader » groupshared float4 vCacheMemory[1024]; » Limited to 32 KB » Need synchronization before reading back data written by other threads » To ensure all threads have finished writing » GroupMemoryBarrier(); » GroupMemoryBarrierWithGroupSync();
  • 25. CS 4.X » Compute Shader supported on DX10(.1) HW » CS 4.0 on DX10 HW, CS 4.1 on DX10.1 HW » Useful for prototyping CS on HW device before DX11 GPUs become available » Drivers available from ATI and NVIDIA » Major differences compared to CS5.0 » Max number of threads is 768 total » Dispatch Zn==1 & no DispatchIndirect() support » TLS size is 16 KB » Thread can only write to its own offset in TLS » Atomic operations not supported » Only one UAV can be bound » Only writable resource is Buffer type
  • 26. PS 5.0 vs CS 5.0 Example: Gaussian Blur » Comparison between a PS 5.0 and CS5.0 implementation of Gaussian Blur » Two-pass Gaussian Blur » High cost in texture instructions and bandwidth » Can the compute shader perform better?
  • 27. Gaussian Blur PS » Separable filter Horizontal/Vertical pass » Using kernel size of x*y » For each pixel of each line: » Fetch x texels in a horizontal segment x » Write H-blurred output pixel in RT: BH Gi Pi » For each pixel of each column: i 1 » Fetch y texels in a vertical segment from RT y » Write fully blurred output pixel: B Gi Pi » Problems: i 1 » Texels of source texture are read multiple times » This will lead to cache trashing if kernel is large » Also leads to many texture instructions used!
  • 28. Gaussian Blur PS Horizontal Pass Source texture Temp RT
  • 29. Gaussian Blur PS Vertical Pass Source texture (temp RT) Destination RT
  • 30. Gaussian Blur CS – HP(1) groupshared float4 HorizontalLine[WIDTH]; // TLS Texture2D txInput; // Input texture to read from RWTexture2D<float4> OutputTexture; // Tmp output [numthreads(WIDTH,1,1)] void GausBlurHoriz(uint3 groupID: SV_GroupID, pDevContext->Dispatch(1,HEIGHT,1); uint3 groupThreadID: SV_GroupThreadID) { // Fetch color from input texture [numthreads(WIDTH,1,1)] Dispatch(1,HEIGHT,1); float4 vColor=txInput[int2(groupThreadID.x,groupID.y)]; // Store it into TLS HorizontalLine[groupThreadID.x]=vColor; // Synchronize threads GroupMemoryBarrierWithGroupSync(); // Continued on next slide
  • 31. Gaussian Blur CS – HP(2) // Compute horizontal Gaussian blur for each pixel vColor = float4(0,0,0,0); [unroll]for (int i=-GS2; i<=GS2; i++) { // Determine offset of pixel to fetch int nOffset = groupThreadID.x + i; // Clamp offset nOffset = clamp(nOffset, 0, WIDTH-1); // Add color for pixels within horizontal filter vColor += G[GS2+i] * HorizontalLine[nOffset]; } // Store result OutputTexture[int2(groupThreadID.x,groupID.y)]=vColor; }
  • 32. Gaussian Blur BW:PS vs CS » Pixel Shader » # of reads per source pixel: 7 (H) + 7 (V) = 14 » # of writes per source pixel: 1 (H) + 1 (V) = 2 » Total number of memory operations per pixel: 16 » For a 1024x1024 RGBA8 source texture this is 64 MBytes worth of data transfer » Texture cache will reduce this number » But become less effective as the kernel gets larger » Compute Shader » # of reads per source pixel: 1 (H) + 1 (V) = 2 » # of writes per source pixel: 1 (H) + 1 (V) = 2 » Total number of memory operations per pixel: 4 » For a 1024x1024 RGBA8 source texture this is 16 MBytes worth of data transfer
  • 33. Conclusion » New Shader Model 5.0 feature set extensively powerful » New instructions » Double precision support » Scattering support through UAVs » Compute Shader » No longer limited to graphic applications » TLS memory allows considerable performance savings » DX11 SDK available for prototyping » Ask your IHV for a CS4.X-enabled driver » REF driver for full SM 5.0 support
  • 34. Questions? nicolas.thibieroz@amd.com