SlideShare una empresa de Scribd logo
1 de 48
Why Graphics is Fast,
and what it can teach us about
    parallel programming
      Jonathan Ragan-Kelley, MIT
 7 December 2009, University College London
       13 November 2009, Harvard
(me)
PhD student at MIT
with Frédo Durand, Saman Amarasinghe

Previously
Stanford
Industrial Light + Magic
The Lightspeed Automatic Interactive Lighting Preview System
(SIGGRAPH 2007)
NVIDIA, ATI
Decoupled Sampling for Real-Time Graphics Pipelines
(ACM Transactions on Graphics 2010)
Intel
1 game frame
via Johan Andersson, DICE
via Johan Andersson, DICE
Game Engine Parallelism
Each task (~200-300/frame) is potentially:

data parallel
physics, sound, AI
image post-processing, streaming/decompression

pipeline parallel

(internally) task parallel

entire invocation of the graphics pipeline

↳ braided parallelism
The Graphics Pipeline
Vertex Buffer
                 Input Assembler
Index Buffer


  Texture         Vertex Shader


memory          Primitive Assembler


  Texture       Geometry Shader


                 Setup/Rasterizer
  Texture

                   Pixel Shader
Color Buffer

Depth Buffer      Output Merger
Vertex Buffer
                 Input Assembler
Index Buffer


  Texture         Vertex Shader


memory          Primitive Assembler


  Texture       Geometry Shader


                 Setup/Rasterizer
  Texture

                   Pixel Shader
Color Buffer

Depth Buffer      Output Merger
Vertex Buffer
                 Input Assembler
Index Buffer


  Texture         Vertex Shader


memory          Primitive Assembler


  Texture       Geometry Shader


                 Setup/Rasterizer
  Texture

                   Pixel Shader
Color Buffer

Depth Buffer      Output Merger
Vertex Buffer
                 Input Assembler
Index Buffer


  Texture         Vertex Shader


memory          Primitive Assembler


  Texture       Geometry Shader


                 Setup/Rasterizer
  Texture

                   Pixel Shader
Color Buffer

Depth Buffer      Output Merger
Vertex Buffer
                 Input Assembler
Index Buffer


  Texture         Vertex Shader


memory          Primitive Assembler


  Texture       Geometry Shader


                 Setup/Rasterizer
  Texture

                   Pixel Shader
Color Buffer

Depth Buffer      Output Merger
Vertex Buffer
                 Input Assembler
Index Buffer


  Texture         Vertex Shader


memory          Primitive Assembler


  Texture       Geometry Shader


                 Setup/Rasterizer
  Texture

                   Pixel Shader
Color Buffer

Depth Buffer      Output Merger
Vertex Buffer
                 Input Assembler
Index Buffer


  Texture         Vertex Shader


memory          Primitive Assembler


  Texture       Geometry Shader


                 Setup/Rasterizer
  Texture

                   Pixel Shader
Color Buffer

Depth Buffer      Output Merger
Vertex Buffer
                 Input Assembler      2%
Index Buffer


  Texture         Vertex Shader       10%


memory          Primitive Assembler   2%


  Texture       Geometry Shader       8%


                 Setup/Rasterizer     8%
  Texture

                   Pixel Shader       50%
Color Buffer

Depth Buffer      Output Merger       20%
Fast Implementation
90% resource utilization

Massive data parallelism

Fine-grained dynamic load balance

Pipeline parallelism

Producer-consumer locality

Global dependence analysis, scheduling

Efficient fixed-function
Local Parallelism
Shader Execution Model
Highly data-parallel
vertices, primitives, fragments, pixels

Hierarchical parallelism
Instruction bandwidth: SIMD ALUs
Pipeline latency: vector execution/software pipelining
Unpredictable latency: hardware multithreading
Memory latency: dynamic threading/fibering
Task independence: multicore

Regular input, output
Marshaling/unmarshaling from data structures
handled in “fixed-function” for efficiency
Shader Implementation
Hardware shader architecture (GPU)
static SIMD warps
many dynamic hardware threads, tens of cores
fine-grained dynamic load balance
many kernels, types simultaneously
heuristic knowledge of specific pipeline

Software shader architecture (Larrabee)
similar static SIMD, inside dynamic latency-hiding,
inside many-core
threading via software fibering (microthreads),
software scheduling
Shader Stage-specific Variation
Vertex
less latency hiding needed,
so use less local memory,
[perhaps] no dynamic fibering (simpler schedule).

Geometry
variable output.
more state = less parallelism,
but (hopefully) need less latency hiding.

Fragment
derivatives → neighborhood communication.
texture intensive → more (dynamic) latency hiding.
Fixed-function Stages
[Input,Primitive] Assembly
Hairy finite state machine
Indirect indexing

Rasterization
Locally data-parallel (~16xSIMD), No memory access
Globally branchy, incremental, hierarchical

Output Merge
Really data-parallel, but ordered read-modify-write
Implicit data amplification (for antialiasing)
Pipeline Implementation
Ordering Semantics
Logically sequential
input primitive order defines framebuffer update order

Pixels are independent
spatial parallelism in framebuffer

Buffer to reorder
otherwise



    (Could be looser in custom pipelines)
Sort-Last Fragment
Shaders fully parallel
buffer to reorder between stages (logically FIFO)
100s-1,000s of triangles in-flight

Output Merger is screen-space parallel
crossbar from Pixel Shade to Output Merge (“sort-last”)
fine-grained scoreboarding

Full producer-consumer locality
all inter-stage communication on-chip
primitive-order processing
framebuffer cache filters some read-modify-write b/w
Sort-Middle
Front-end: transform/vertex processing
Scatter all geometry to screen-space tiles

Merge sort (through memory)
Maintain primitive order

Back-end: pixel processing
In-order (per-pixel)
scoreboard per-pixel → screen-space parallelism
One tile per-core
framebuffer data in local store
one scheduler, several worker [hardware] threads collaborating
Pixel Shader + Output Merge together
lower scheduling overhead
return to task system
        to put back in
        context.

        reinforce braided
        nature.


Game Frame
Existing Models
Parallel Programming Models
Task parallelism (Cilk, Jade, etc.)
dynamic, heavyweight items
very flexible, high overhead

Data parallelism (NESL, FortranX, C*)
dynamic, lightweight items
flexible, moderate overhead

Streaming (StreamIt, Brook)
static data + pipeline parallelism
very low overhead, but inflexible ➞ load imbalance
CUDA, GPGPU

Canonical model         Key challenges

thread = work-item
one kernel at a time
                        }   dynamic load balance


purely data parallel
streaming data access
                        }   producer-consumer
                            locality
Future directions
Task decomposition continuing to grow
Past:	 	 subsystem threads: 2-3 threads utilized
Present: 	 task systems: 6-8 threads, some headroom
Future: 

 many more tasks, reduce false dependence

        
 orchestration language?

More data, braided parallelism within tasks
Essential to extracting FLOPS in future throughput chips

“Über-kernels”
(NVIDIA ray tracing, Reyes demos; OptiX system)
Dynamic task scheduling inside a single-kernel,
purely data-parallel architecture
Über-kernels

stage 1       stage_1():
               // do stage 1…

              stage_2():
stage 2        // do stage 2…

              stage_3():
stage 3        // do stage 3…
Über-kernels

              while(true):
stage 1
               state = scheduler()
               switch(state):
                case stage_1:
stage 2           // do stage 1…
                case stage_2:
                  // do stage 2…
                case stage_3:
stage 3           // do stage 3…
Über-kernels

              while(true):
stage 1
               state = scheduler()
               switch(state):
                case stage_1:
stage 2           // do stage 1…
                case stage_2:
                  // do stage 2…
                case stage_3:
stage 3           // do stage 3…
Über-kernels
              while(true):
               state = scheduler()
stage 1        switch(state):
                case stage_1:
                  // do stage 1…
                case stage_2_1:
stage 2           // beginning of 2…
                case stage_2_2:
                  // end of 2…
stage 3         case stage_3_1:
                  // beginning of 3…
                case stage_3_2:
                  // end of 3…
The Ray Tracing Pipeline
           Host
                                 Buffers
                            Texture Samplers
      Entry Points              Variables
   Ray Generation Program

     Exception Program

                                Traversal
                            Intersection Program
        Trace                 Any Hit Program

                            Selector Visit Program

     Ray Shading
     Closest Hit Program

       Miss Program
Explicit continuation programs
             stage_1():
              non_blocking_fetch()
 stage 1      if(c): recurse(s_1)
              else: tail_call(s_2)

             non_blocking_fetch():
 stage 2
              prefetch()
              call/cc(myScheduler)
              return result
 stage 3
             recurse(s):
              yield(myScheduler)
              s()
Acknowledgements
CPS ideas
Tim Foley, Mark Lacey, Mark Leone - Intel

General Discussion
Saman Amarasinghe, Frédo Durand - MIT
Solomon Boulos, Kurt Akeley - Stanford/MSR

OptiX details, slides
Austin Robison, Steve Parker - NVIDIA

Current game-engine details, figures
Johan Andersson - DICE
extras
Need
Locality
across many (potentially competing) axes
application-specific scheduling knowledge
autotuning for complex optimization space?

Braids of parallelism
data-parallelism
static - SIMD, extreme arithmetic intensity
dynamic - latency hiding
task parallelism
static - pipelines)
dynamic - to add dynamism to data-parallel layer
Key issues
Hierarchy

Braided decomposition

Managing complexity
separating orchestration from kernel implementation
recursively, at multiple levels

Hybrid static-dynamic
extreme arithmetic intensity + dynamic load balance

Hybrid pipeline and data-parallel
competing axes of locality
Task Optimization
Draw Batch
Group of primitives bound to common state

JIT compile, optimize on rebinding state
changing any shader stage
potential inter-stage elimination of dead outputs
late-binding constant folding
changing some fixed-function mode state
not on changing inputs (too expensive)

“near-static” compilation
Rendering Pass
Common output buffer binding
with no swap/clear

Synthesize renderer from canonical pipeline
render shadow, environment maps
2D image post-processing

Optimize performance
avoid wasted work (Z cull pre-pass, deferred shading)
reorder/coalesce batches (deferred lighting)
Inter-pass Optimization
Buffers never consumed
keep in scratchpad
“fast clear”
use optimized/native formats
resolve antialiasing before write-out

Passes overlapped
no startup/wind-down bubbles

Pass folding
e.g. merge 2D post-processing into back-end stage,
on local tile memory
Control vs. Data-Dependence
Simple task graph has control dependence
between stages

D3D scheduling based on resource (data)
dependence, not control dependence
‣Finer-grained scheduling
‣Fewer false hazards
Generalize to whole engine core?
AI, sound
Physics: rigid, soft, IK, cloth, particles, fracture
Core principles
                                     TODO: update

Hierarchical parallelism
Shader-style data-parallel kernels
Kernel fusion/global optimization

Separate kernel and orchestration
Separate-but-connected languages/runtime models

Resource-based scheduling

Application-specific/user-controlled
scheduling and memory management

Más contenido relacionado

La actualidad más candente

Utp pds_l4_procesamiento de señales del habla con mat_lab
 Utp pds_l4_procesamiento de señales del habla con mat_lab Utp pds_l4_procesamiento de señales del habla con mat_lab
Utp pds_l4_procesamiento de señales del habla con mat_labjcbenitezp
 
Ubiquitous Resources Abstraction using a File System Interface on Sensor Nodes
Ubiquitous Resources Abstraction using a File System Interface on Sensor NodesUbiquitous Resources Abstraction using a File System Interface on Sensor Nodes
Ubiquitous Resources Abstraction using a File System Interface on Sensor NodesTill Riedel
 
GDC 2012: Advanced Procedural Rendering in DX11
GDC 2012: Advanced Procedural Rendering in DX11GDC 2012: Advanced Procedural Rendering in DX11
GDC 2012: Advanced Procedural Rendering in DX11smashflt
 
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The RunFive Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The RunElectronic Arts / DICE
 
Modern Graphics Pipeline Overview
Modern Graphics Pipeline OverviewModern Graphics Pipeline Overview
Modern Graphics Pipeline Overviewslantsixgames
 
SPU Assisted Rendering
SPU Assisted RenderingSPU Assisted Rendering
SPU Assisted RenderingSteven Tovey
 
A Bizarre Way to do Real-Time Lighting
A Bizarre Way to do Real-Time LightingA Bizarre Way to do Real-Time Lighting
A Bizarre Way to do Real-Time LightingSteven Tovey
 
5 Major Challenges in Interactive Rendering
5 Major Challenges in Interactive Rendering5 Major Challenges in Interactive Rendering
5 Major Challenges in Interactive RenderingElectronic Arts / DICE
 
Intro to Deep Learning, TensorFlow, and tensorflow.js
Intro to Deep Learning, TensorFlow, and tensorflow.jsIntro to Deep Learning, TensorFlow, and tensorflow.js
Intro to Deep Learning, TensorFlow, and tensorflow.jsOswald Campesato
 
Optimizing Games for Mobiles
Optimizing Games for MobilesOptimizing Games for Mobiles
Optimizing Games for MobilesSt1X
 
Deep Learning and TensorFlow
Deep Learning and TensorFlowDeep Learning and TensorFlow
Deep Learning and TensorFlowOswald Campesato
 
Deep Learning in Your Browser
Deep Learning in Your BrowserDeep Learning in Your Browser
Deep Learning in Your BrowserOswald Campesato
 
Introduction to Deep Learning and TensorFlow
Introduction to Deep Learning and TensorFlowIntroduction to Deep Learning and TensorFlow
Introduction to Deep Learning and TensorFlowOswald Campesato
 
Optimizing unity games (Google IO 2014)
Optimizing unity games (Google IO 2014)Optimizing unity games (Google IO 2014)
Optimizing unity games (Google IO 2014)Alexander Dolbilov
 
BitSquid Tech: Benefits of a data-driven renderer
BitSquid Tech: Benefits of a data-driven rendererBitSquid Tech: Benefits of a data-driven renderer
BitSquid Tech: Benefits of a data-driven renderertobias_persson
 
Game development
Game developmentGame development
Game developmentAsido_
 
Multimedia
MultimediaMultimedia
MultimediaBUDNET
 
Terrain in Battlefield 3: A Modern, Complete and Scalable System
Terrain in Battlefield 3: A Modern, Complete and Scalable SystemTerrain in Battlefield 3: A Modern, Complete and Scalable System
Terrain in Battlefield 3: A Modern, Complete and Scalable SystemElectronic Arts / DICE
 
Unite 2013 optimizing unity games for mobile platforms
Unite 2013 optimizing unity games for mobile platformsUnite 2013 optimizing unity games for mobile platforms
Unite 2013 optimizing unity games for mobile platformsナム-Nam Nguyễn
 

La actualidad más candente (20)

Utp pds_l4_procesamiento de señales del habla con mat_lab
 Utp pds_l4_procesamiento de señales del habla con mat_lab Utp pds_l4_procesamiento de señales del habla con mat_lab
Utp pds_l4_procesamiento de señales del habla con mat_lab
 
Ubiquitous Resources Abstraction using a File System Interface on Sensor Nodes
Ubiquitous Resources Abstraction using a File System Interface on Sensor NodesUbiquitous Resources Abstraction using a File System Interface on Sensor Nodes
Ubiquitous Resources Abstraction using a File System Interface on Sensor Nodes
 
GDC 2012: Advanced Procedural Rendering in DX11
GDC 2012: Advanced Procedural Rendering in DX11GDC 2012: Advanced Procedural Rendering in DX11
GDC 2012: Advanced Procedural Rendering in DX11
 
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The RunFive Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
 
Modern Graphics Pipeline Overview
Modern Graphics Pipeline OverviewModern Graphics Pipeline Overview
Modern Graphics Pipeline Overview
 
SPU Assisted Rendering
SPU Assisted RenderingSPU Assisted Rendering
SPU Assisted Rendering
 
A Bizarre Way to do Real-Time Lighting
A Bizarre Way to do Real-Time LightingA Bizarre Way to do Real-Time Lighting
A Bizarre Way to do Real-Time Lighting
 
5 Major Challenges in Interactive Rendering
5 Major Challenges in Interactive Rendering5 Major Challenges in Interactive Rendering
5 Major Challenges in Interactive Rendering
 
Intro to Deep Learning, TensorFlow, and tensorflow.js
Intro to Deep Learning, TensorFlow, and tensorflow.jsIntro to Deep Learning, TensorFlow, and tensorflow.js
Intro to Deep Learning, TensorFlow, and tensorflow.js
 
Optimizing Games for Mobiles
Optimizing Games for MobilesOptimizing Games for Mobiles
Optimizing Games for Mobiles
 
Deep Learning and TensorFlow
Deep Learning and TensorFlowDeep Learning and TensorFlow
Deep Learning and TensorFlow
 
Deep Learning in Your Browser
Deep Learning in Your BrowserDeep Learning in Your Browser
Deep Learning in Your Browser
 
Introduction to Deep Learning and TensorFlow
Introduction to Deep Learning and TensorFlowIntroduction to Deep Learning and TensorFlow
Introduction to Deep Learning and TensorFlow
 
Optimizing unity games (Google IO 2014)
Optimizing unity games (Google IO 2014)Optimizing unity games (Google IO 2014)
Optimizing unity games (Google IO 2014)
 
Xbox
XboxXbox
Xbox
 
BitSquid Tech: Benefits of a data-driven renderer
BitSquid Tech: Benefits of a data-driven rendererBitSquid Tech: Benefits of a data-driven renderer
BitSquid Tech: Benefits of a data-driven renderer
 
Game development
Game developmentGame development
Game development
 
Multimedia
MultimediaMultimedia
Multimedia
 
Terrain in Battlefield 3: A Modern, Complete and Scalable System
Terrain in Battlefield 3: A Modern, Complete and Scalable SystemTerrain in Battlefield 3: A Modern, Complete and Scalable System
Terrain in Battlefield 3: A Modern, Complete and Scalable System
 
Unite 2013 optimizing unity games for mobile platforms
Unite 2013 optimizing unity games for mobile platformsUnite 2013 optimizing unity games for mobile platforms
Unite 2013 optimizing unity games for mobile platforms
 

Similar a Why Graphics Is Fast, and What It Can Teach Us About Parallel Programming

D3 D10 Unleashed New Features And Effects
D3 D10 Unleashed   New Features And EffectsD3 D10 Unleashed   New Features And Effects
D3 D10 Unleashed New Features And EffectsThomas Goddard
 
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...Johan Andersson
 
Cg shaders with Unity3D
Cg shaders with Unity3DCg shaders with Unity3D
Cg shaders with Unity3DMichael Ivanov
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Maté Ongenaert
 
FrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteFrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteElectronic Arts / DICE
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Maté Ongenaert
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmicguest40fc7cd
 
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...Johan Andersson
 
NVIDIA's OpenGL Functionality
NVIDIA's OpenGL FunctionalityNVIDIA's OpenGL Functionality
NVIDIA's OpenGL FunctionalityMark Kilgard
 
Changes and Bugs: Mining and Predicting Development Activities
Changes and Bugs: Mining and Predicting Development ActivitiesChanges and Bugs: Mining and Predicting Development Activities
Changes and Bugs: Mining and Predicting Development ActivitiesThomas Zimmermann
 
Open GL ES Android
Open GL ES AndroidOpen GL ES Android
Open GL ES AndroidMindos Cheng
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central
 
FGS 2011: Making A Game With Molehill: Zombie Tycoon
FGS 2011: Making A Game With Molehill: Zombie TycoonFGS 2011: Making A Game With Molehill: Zombie Tycoon
FGS 2011: Making A Game With Molehill: Zombie Tycoonmochimedia
 
Hardware Shaders
Hardware ShadersHardware Shaders
Hardware Shadersgueste52f1b
 
Unity AMD FSR - SIGGRAPH 2021.pptx
Unity AMD FSR - SIGGRAPH 2021.pptxUnity AMD FSR - SIGGRAPH 2021.pptx
Unity AMD FSR - SIGGRAPH 2021.pptxssuser2c3c67
 

Similar a Why Graphics Is Fast, and What It Can Teach Us About Parallel Programming (20)

D3 D10 Unleashed New Features And Effects
D3 D10 Unleashed   New Features And EffectsD3 D10 Unleashed   New Features And Effects
D3 D10 Unleashed New Features And Effects
 
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on SparkInferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark
 
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
 
Cg shaders with Unity3D
Cg shaders with Unity3DCg shaders with Unity3D
Cg shaders with Unity3D
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
 
FrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteFrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in Frostbite
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmic
 
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
 
NVIDIA's OpenGL Functionality
NVIDIA's OpenGL FunctionalityNVIDIA's OpenGL Functionality
NVIDIA's OpenGL Functionality
 
JavaScript on the GPU
JavaScript on the GPUJavaScript on the GPU
JavaScript on the GPU
 
Changes and Bugs: Mining and Predicting Development Activities
Changes and Bugs: Mining and Predicting Development ActivitiesChanges and Bugs: Mining and Predicting Development Activities
Changes and Bugs: Mining and Predicting Development Activities
 
Open GL ES Android
Open GL ES AndroidOpen GL ES Android
Open GL ES Android
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
NvFX GTC 2013
NvFX GTC 2013NvFX GTC 2013
NvFX GTC 2013
 
FGS 2011: Making A Game With Molehill: Zombie Tycoon
FGS 2011: Making A Game With Molehill: Zombie TycoonFGS 2011: Making A Game With Molehill: Zombie Tycoon
FGS 2011: Making A Game With Molehill: Zombie Tycoon
 
GPU - how can we use it?
GPU - how can we use it?GPU - how can we use it?
GPU - how can we use it?
 
Hardware Shaders
Hardware ShadersHardware Shaders
Hardware Shaders
 
Unity AMD FSR - SIGGRAPH 2021.pptx
Unity AMD FSR - SIGGRAPH 2021.pptxUnity AMD FSR - SIGGRAPH 2021.pptx
Unity AMD FSR - SIGGRAPH 2021.pptx
 

Último

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 

Último (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 

Why Graphics Is Fast, and What It Can Teach Us About Parallel Programming

  • 1. Why Graphics is Fast, and what it can teach us about parallel programming Jonathan Ragan-Kelley, MIT 7 December 2009, University College London 13 November 2009, Harvard
  • 2. (me) PhD student at MIT with Frédo Durand, Saman Amarasinghe Previously Stanford Industrial Light + Magic The Lightspeed Automatic Interactive Lighting Preview System (SIGGRAPH 2007) NVIDIA, ATI Decoupled Sampling for Real-Time Graphics Pipelines (ACM Transactions on Graphics 2010) Intel
  • 6. Game Engine Parallelism Each task (~200-300/frame) is potentially: data parallel physics, sound, AI image post-processing, streaming/decompression pipeline parallel (internally) task parallel entire invocation of the graphics pipeline ↳ braided parallelism
  • 8. Vertex Buffer Input Assembler Index Buffer Texture Vertex Shader memory Primitive Assembler Texture Geometry Shader Setup/Rasterizer Texture Pixel Shader Color Buffer Depth Buffer Output Merger
  • 9. Vertex Buffer Input Assembler Index Buffer Texture Vertex Shader memory Primitive Assembler Texture Geometry Shader Setup/Rasterizer Texture Pixel Shader Color Buffer Depth Buffer Output Merger
  • 10. Vertex Buffer Input Assembler Index Buffer Texture Vertex Shader memory Primitive Assembler Texture Geometry Shader Setup/Rasterizer Texture Pixel Shader Color Buffer Depth Buffer Output Merger
  • 11. Vertex Buffer Input Assembler Index Buffer Texture Vertex Shader memory Primitive Assembler Texture Geometry Shader Setup/Rasterizer Texture Pixel Shader Color Buffer Depth Buffer Output Merger
  • 12. Vertex Buffer Input Assembler Index Buffer Texture Vertex Shader memory Primitive Assembler Texture Geometry Shader Setup/Rasterizer Texture Pixel Shader Color Buffer Depth Buffer Output Merger
  • 13. Vertex Buffer Input Assembler Index Buffer Texture Vertex Shader memory Primitive Assembler Texture Geometry Shader Setup/Rasterizer Texture Pixel Shader Color Buffer Depth Buffer Output Merger
  • 14. Vertex Buffer Input Assembler Index Buffer Texture Vertex Shader memory Primitive Assembler Texture Geometry Shader Setup/Rasterizer Texture Pixel Shader Color Buffer Depth Buffer Output Merger
  • 15. Vertex Buffer Input Assembler 2% Index Buffer Texture Vertex Shader 10% memory Primitive Assembler 2% Texture Geometry Shader 8% Setup/Rasterizer 8% Texture Pixel Shader 50% Color Buffer Depth Buffer Output Merger 20%
  • 16. Fast Implementation 90% resource utilization Massive data parallelism Fine-grained dynamic load balance Pipeline parallelism Producer-consumer locality Global dependence analysis, scheduling Efficient fixed-function
  • 18. Shader Execution Model Highly data-parallel vertices, primitives, fragments, pixels Hierarchical parallelism Instruction bandwidth: SIMD ALUs Pipeline latency: vector execution/software pipelining Unpredictable latency: hardware multithreading Memory latency: dynamic threading/fibering Task independence: multicore Regular input, output Marshaling/unmarshaling from data structures handled in “fixed-function” for efficiency
  • 19. Shader Implementation Hardware shader architecture (GPU) static SIMD warps many dynamic hardware threads, tens of cores fine-grained dynamic load balance many kernels, types simultaneously heuristic knowledge of specific pipeline Software shader architecture (Larrabee) similar static SIMD, inside dynamic latency-hiding, inside many-core threading via software fibering (microthreads), software scheduling
  • 20. Shader Stage-specific Variation Vertex less latency hiding needed, so use less local memory, [perhaps] no dynamic fibering (simpler schedule). Geometry variable output. more state = less parallelism, but (hopefully) need less latency hiding. Fragment derivatives → neighborhood communication. texture intensive → more (dynamic) latency hiding.
  • 21. Fixed-function Stages [Input,Primitive] Assembly Hairy finite state machine Indirect indexing Rasterization Locally data-parallel (~16xSIMD), No memory access Globally branchy, incremental, hierarchical Output Merge Really data-parallel, but ordered read-modify-write Implicit data amplification (for antialiasing)
  • 23. Ordering Semantics Logically sequential input primitive order defines framebuffer update order Pixels are independent spatial parallelism in framebuffer Buffer to reorder otherwise (Could be looser in custom pipelines)
  • 24. Sort-Last Fragment Shaders fully parallel buffer to reorder between stages (logically FIFO) 100s-1,000s of triangles in-flight Output Merger is screen-space parallel crossbar from Pixel Shade to Output Merge (“sort-last”) fine-grained scoreboarding Full producer-consumer locality all inter-stage communication on-chip primitive-order processing framebuffer cache filters some read-modify-write b/w
  • 25. Sort-Middle Front-end: transform/vertex processing Scatter all geometry to screen-space tiles Merge sort (through memory) Maintain primitive order Back-end: pixel processing In-order (per-pixel) scoreboard per-pixel → screen-space parallelism One tile per-core framebuffer data in local store one scheduler, several worker [hardware] threads collaborating Pixel Shader + Output Merge together lower scheduling overhead
  • 26. return to task system to put back in context. reinforce braided nature. Game Frame
  • 27.
  • 29. Parallel Programming Models Task parallelism (Cilk, Jade, etc.) dynamic, heavyweight items very flexible, high overhead Data parallelism (NESL, FortranX, C*) dynamic, lightweight items flexible, moderate overhead Streaming (StreamIt, Brook) static data + pipeline parallelism very low overhead, but inflexible ➞ load imbalance
  • 30. CUDA, GPGPU Canonical model Key challenges thread = work-item one kernel at a time } dynamic load balance purely data parallel streaming data access } producer-consumer locality
  • 32. Task decomposition continuing to grow Past: subsystem threads: 2-3 threads utilized Present: task systems: 6-8 threads, some headroom Future: many more tasks, reduce false dependence orchestration language? More data, braided parallelism within tasks Essential to extracting FLOPS in future throughput chips “Über-kernels” (NVIDIA ray tracing, Reyes demos; OptiX system) Dynamic task scheduling inside a single-kernel, purely data-parallel architecture
  • 33. Über-kernels stage 1 stage_1(): // do stage 1… stage_2(): stage 2 // do stage 2… stage_3(): stage 3 // do stage 3…
  • 34. Über-kernels while(true): stage 1 state = scheduler() switch(state): case stage_1: stage 2 // do stage 1… case stage_2: // do stage 2… case stage_3: stage 3 // do stage 3…
  • 35. Über-kernels while(true): stage 1 state = scheduler() switch(state): case stage_1: stage 2 // do stage 1… case stage_2: // do stage 2… case stage_3: stage 3 // do stage 3…
  • 36. Über-kernels while(true): state = scheduler() stage 1 switch(state): case stage_1: // do stage 1… case stage_2_1: stage 2 // beginning of 2… case stage_2_2: // end of 2… stage 3 case stage_3_1: // beginning of 3… case stage_3_2: // end of 3…
  • 37. The Ray Tracing Pipeline Host Buffers Texture Samplers Entry Points Variables Ray Generation Program Exception Program Traversal Intersection Program Trace Any Hit Program Selector Visit Program Ray Shading Closest Hit Program Miss Program
  • 38. Explicit continuation programs stage_1(): non_blocking_fetch() stage 1 if(c): recurse(s_1) else: tail_call(s_2) non_blocking_fetch(): stage 2 prefetch() call/cc(myScheduler) return result stage 3 recurse(s): yield(myScheduler) s()
  • 39. Acknowledgements CPS ideas Tim Foley, Mark Lacey, Mark Leone - Intel General Discussion Saman Amarasinghe, Frédo Durand - MIT Solomon Boulos, Kurt Akeley - Stanford/MSR OptiX details, slides Austin Robison, Steve Parker - NVIDIA Current game-engine details, figures Johan Andersson - DICE
  • 41. Need Locality across many (potentially competing) axes application-specific scheduling knowledge autotuning for complex optimization space? Braids of parallelism data-parallelism static - SIMD, extreme arithmetic intensity dynamic - latency hiding task parallelism static - pipelines) dynamic - to add dynamism to data-parallel layer
  • 42. Key issues Hierarchy Braided decomposition Managing complexity separating orchestration from kernel implementation recursively, at multiple levels Hybrid static-dynamic extreme arithmetic intensity + dynamic load balance Hybrid pipeline and data-parallel competing axes of locality
  • 44. Draw Batch Group of primitives bound to common state JIT compile, optimize on rebinding state changing any shader stage potential inter-stage elimination of dead outputs late-binding constant folding changing some fixed-function mode state not on changing inputs (too expensive) “near-static” compilation
  • 45. Rendering Pass Common output buffer binding with no swap/clear Synthesize renderer from canonical pipeline render shadow, environment maps 2D image post-processing Optimize performance avoid wasted work (Z cull pre-pass, deferred shading) reorder/coalesce batches (deferred lighting)
  • 46. Inter-pass Optimization Buffers never consumed keep in scratchpad “fast clear” use optimized/native formats resolve antialiasing before write-out Passes overlapped no startup/wind-down bubbles Pass folding e.g. merge 2D post-processing into back-end stage, on local tile memory
  • 47. Control vs. Data-Dependence Simple task graph has control dependence between stages D3D scheduling based on resource (data) dependence, not control dependence ‣Finer-grained scheduling ‣Fewer false hazards Generalize to whole engine core? AI, sound Physics: rigid, soft, IK, cloth, particles, fracture
  • 48. Core principles TODO: update Hierarchical parallelism Shader-style data-parallel kernels Kernel fusion/global optimization Separate kernel and orchestration Separate-but-connected languages/runtime models Resource-based scheduling Application-specific/user-controlled scheduling and memory management

Notas del editor

  1. Graphics and games among the most successful mainstream applications of highly parallel computing: - multi-million-line code bases, many real subsystems - not 50-line kernel expressed in 100k lines for performance, like SciComp - tight real-time budgets, memory constraints, … - Consoles look like mini Computers of the Future: - manycore, multithreaded - SIMD, in-order/throughput-oriented - heterogeneous: GPU + CPU + SPU - Portable across platforms: PC, PS3, Xbox - SIMD, threading, task systems Philosophy: if we can get graphics right—complex, heterogeneous, dynamic, but also well-understood—then good start to generalize So here I’m going to focus on explaining some key design patterns that have emerged from current graphics pipelines, to set the stage for the sorts of things future systems-programming tools need to support/express. --- What I’m hoping to do: parallel programming system good enough for graphics and games on next-generation (software+throughput processor) systems. That is: not just after graphics—whole game engine core, plus other complex/heterogeneous systems programming applications
  2. I’m an Nth year PhD student at MIT, where I work with Fredo Durand, and more recently also Saman Amarasinghe I’ve been in graphics and graphics systems for a while: Stanford: lucky to spend 4 years with Pat Hanrahan ILM: offline rendering, compilers - Lightspeed, at SIGGRAPH 2007 NVIDIA, ATI - graphics architecture - Decoupled Sampling, in revision for Trans. Graph., presented at next SIGGRAPH Intel: graphics architecture (now software - Larrabee!), data-parallel compilers, some chip architecture
  3. as a motivating example, look at 1 modern game frame
  4. 1000s of independent entities: - characters - vehicles - projectiles explosions, physics destruction many lights, changing occlusion/visibility In all: 10s of GFLOPS, 100s of thousands of lines of code executed
  5. This is the task graph for a single frame from the engine used to make that image. To render that: hundreds of (heterogeneous) tasks locally task, data, and pipeline parallel (point at graph) Task = 10k-100k lines/each—fairly large
  6. and even within those tasks data, pipeline, task parallel entire invocations of graphics pipeline many levels => braided parallelism
  7. Logical Direct3D 10 pipeline. Red = programmable. Blue = fixed-function. “Fixed-function” still configured by programmable state (and implied by shader input/output). - Inter-stage data = always fixed flow/non-programmable IA: reads from pointer to index, vertex buffers and pulls in vertex data (indirectly/complex traversal). Finite State Machine. VS: user program. Can read (only) from textures PrimAsm: indexes vertex stream and groups together whole triangles. Finite State Machine. GS: . Variable output (filter/amplify) NOTE: programmable stages delineated by common one-to-one data stream. Any time there is reordering or complex stream amplification/filtering, we split stages. THIS IS LOGICAL PIPELINE, but in reality, fast implementation is
  8. Logical Direct3D 10 pipeline. Red = programmable. Blue = fixed-function. “Fixed-function” still configured by programmable state (and implied by shader input/output). - Inter-stage data = always fixed flow/non-programmable IA: reads from pointer to index, vertex buffers and pulls in vertex data (indirectly/complex traversal). Finite State Machine. VS: user program. Can read (only) from textures PrimAsm: indexes vertex stream and groups together whole triangles. Finite State Machine. GS: . Variable output (filter/amplify) NOTE: programmable stages delineated by common one-to-one data stream. Any time there is reordering or complex stream amplification/filtering, we split stages. THIS IS LOGICAL PIPELINE, but in reality, fast implementation is
  9. Logical Direct3D 10 pipeline. Red = programmable. Blue = fixed-function. “Fixed-function” still configured by programmable state (and implied by shader input/output). - Inter-stage data = always fixed flow/non-programmable IA: reads from pointer to index, vertex buffers and pulls in vertex data (indirectly/complex traversal). Finite State Machine. VS: user program. Can read (only) from textures PrimAsm: indexes vertex stream and groups together whole triangles. Finite State Machine. GS: . Variable output (filter/amplify) NOTE: programmable stages delineated by common one-to-one data stream. Any time there is reordering or complex stream amplification/filtering, we split stages. THIS IS LOGICAL PIPELINE, but in reality, fast implementation is
  10. Logical Direct3D 10 pipeline. Red = programmable. Blue = fixed-function. “Fixed-function” still configured by programmable state (and implied by shader input/output). - Inter-stage data = always fixed flow/non-programmable IA: reads from pointer to index, vertex buffers and pulls in vertex data (indirectly/complex traversal). Finite State Machine. VS: user program. Can read (only) from textures PrimAsm: indexes vertex stream and groups together whole triangles. Finite State Machine. GS: . Variable output (filter/amplify) NOTE: programmable stages delineated by common one-to-one data stream. Any time there is reordering or complex stream amplification/filtering, we split stages. THIS IS LOGICAL PIPELINE, but in reality, fast implementation is
  11. Logical Direct3D 10 pipeline. Red = programmable. Blue = fixed-function. “Fixed-function” still configured by programmable state (and implied by shader input/output). - Inter-stage data = always fixed flow/non-programmable IA: reads from pointer to index, vertex buffers and pulls in vertex data (indirectly/complex traversal). Finite State Machine. VS: user program. Can read (only) from textures PrimAsm: indexes vertex stream and groups together whole triangles. Finite State Machine. GS: . Variable output (filter/amplify) NOTE: programmable stages delineated by common one-to-one data stream. Any time there is reordering or complex stream amplification/filtering, we split stages. THIS IS LOGICAL PIPELINE, but in reality, fast implementation is
  12. Logical Direct3D 10 pipeline. Red = programmable. Blue = fixed-function. “Fixed-function” still configured by programmable state (and implied by shader input/output). - Inter-stage data = always fixed flow/non-programmable IA: reads from pointer to index, vertex buffers and pulls in vertex data (indirectly/complex traversal). Finite State Machine. VS: user program. Can read (only) from textures PrimAsm: indexes vertex stream and groups together whole triangles. Finite State Machine. GS: . Variable output (filter/amplify) NOTE: programmable stages delineated by common one-to-one data stream. Any time there is reordering or complex stream amplification/filtering, we split stages. THIS IS LOGICAL PIPELINE, but in reality, fast implementation is
  13. Logical Direct3D 10 pipeline. Red = programmable. Blue = fixed-function. “Fixed-function” still configured by programmable state (and implied by shader input/output). - Inter-stage data = always fixed flow/non-programmable IA: reads from pointer to index, vertex buffers and pulls in vertex data (indirectly/complex traversal). Finite State Machine. VS: user program. Can read (only) from textures PrimAsm: indexes vertex stream and groups together whole triangles. Finite State Machine. GS: . Variable output (filter/amplify) NOTE: programmable stages delineated by common one-to-one data stream. Any time there is reordering or complex stream amplification/filtering, we split stages. THIS IS LOGICAL PIPELINE, but in reality, fast implementation is
  14. Logical Direct3D 10 pipeline. Red = programmable. Blue = fixed-function. “Fixed-function” still configured by programmable state (and implied by shader input/output). - Inter-stage data = always fixed flow/non-programmable IA: reads from pointer to index, vertex buffers and pulls in vertex data (indirectly/complex traversal). Finite State Machine. VS: user program. Can read (only) from textures PrimAsm: indexes vertex stream and groups together whole triangles. Finite State Machine. GS: . Variable output (filter/amplify) NOTE: programmable stages delineated by common one-to-one data stream. Any time there is reordering or complex stream amplification/filtering, we split stages. THIS IS LOGICAL PIPELINE, but in reality, fast implementation is
  15. Logical Direct3D 10 pipeline. Red = programmable. Blue = fixed-function. “Fixed-function” still configured by programmable state (and implied by shader input/output). - Inter-stage data = always fixed flow/non-programmable IA: reads from pointer to index, vertex buffers and pulls in vertex data (indirectly/complex traversal). Finite State Machine. VS: user program. Can read (only) from textures PrimAsm: indexes vertex stream and groups together whole triangles. Finite State Machine. GS: . Variable output (filter/amplify) NOTE: programmable stages delineated by common one-to-one data stream. Any time there is reordering or complex stream amplification/filtering, we split stages. THIS IS LOGICAL PIPELINE, but in reality, fast implementation is
  16. Logical Direct3D 10 pipeline. Red = programmable. Blue = fixed-function. “Fixed-function” still configured by programmable state (and implied by shader input/output). - Inter-stage data = always fixed flow/non-programmable IA: reads from pointer to index, vertex buffers and pulls in vertex data (indirectly/complex traversal). Finite State Machine. VS: user program. Can read (only) from textures PrimAsm: indexes vertex stream and groups together whole triangles. Finite State Machine. GS: . Variable output (filter/amplify) NOTE: programmable stages delineated by common one-to-one data stream. Any time there is reordering or complex stream amplification/filtering, we split stages. THIS IS LOGICAL PIPELINE, but in reality, fast implementation is
  17. Logical Direct3D 10 pipeline. Red = programmable. Blue = fixed-function. “Fixed-function” still configured by programmable state (and implied by shader input/output). - Inter-stage data = always fixed flow/non-programmable IA: reads from pointer to index, vertex buffers and pulls in vertex data (indirectly/complex traversal). Finite State Machine. VS: user program. Can read (only) from textures PrimAsm: indexes vertex stream and groups together whole triangles. Finite State Machine. GS: . Variable output (filter/amplify) NOTE: programmable stages delineated by common one-to-one data stream. Any time there is reordering or complex stream amplification/filtering, we split stages. THIS IS LOGICAL PIPELINE, but in reality, fast implementation is
  18. Logical Direct3D 10 pipeline. Red = programmable. Blue = fixed-function. “Fixed-function” still configured by programmable state (and implied by shader input/output). - Inter-stage data = always fixed flow/non-programmable IA: reads from pointer to index, vertex buffers and pulls in vertex data (indirectly/complex traversal). Finite State Machine. VS: user program. Can read (only) from textures PrimAsm: indexes vertex stream and groups together whole triangles. Finite State Machine. GS: . Variable output (filter/amplify) NOTE: programmable stages delineated by common one-to-one data stream. Any time there is reordering or complex stream amplification/filtering, we split stages. THIS IS LOGICAL PIPELINE, but in reality, fast implementation is
  19. Logical Direct3D 10 pipeline. Red = programmable. Blue = fixed-function. “Fixed-function” still configured by programmable state (and implied by shader input/output). - Inter-stage data = always fixed flow/non-programmable IA: reads from pointer to index, vertex buffers and pulls in vertex data (indirectly/complex traversal). Finite State Machine. VS: user program. Can read (only) from textures PrimAsm: indexes vertex stream and groups together whole triangles. Finite State Machine. GS: . Variable output (filter/amplify) NOTE: programmable stages delineated by common one-to-one data stream. Any time there is reordering or complex stream amplification/filtering, we split stages. THIS IS LOGICAL PIPELINE, but in reality, fast implementation is
  20. New shader stages with new semantics, characteristics. (Unordered) read-write memory from the pixel shader
  21. Future: Programmable output blending? Ordered read-modify-write buffers. Also: the more that becomes software-controlled, the more expensive synchronization becomes. As we’ll see: biggest difference between pipeline choice for LRB vs. GPU is based on cost→granularity of synchronization.
  22. Needs to be able to drive essentially maximum resource utilization, sustained. Dynamic load balance: load balance between stages shifts not just across apps or frames, but at very fine granularity within frame. (Triangles are different size. Shaders are different length.) Pipeline parallelism: has to overlap operations, passes to avoid bubbles. Producer-consumer locality is essential: way too much intermediate data to spill to memory Task parallelism and producer-consumer are why (ironically) you can’t do a fast graphics pipeline in “GPGPU”/CUDA.
  23. First, how parallelism is exploited and how the different stages work.
  24. Parallelism across all data types: vertices, primitives, fragments, pixels. Hierarchical application of parallelism mirrors a hierarchical application of most major ideas from parallel, latency-tolerant (i.e. “throughput”) processor architecture. To maintain high regularity within shaders, Marshaling of data to/from complex or irregular structures is factored apart from the core programmable data-parallel shaders, in logically fixed-function stages.
  25. There are two major schools of implementation today: - In a hardware (NVIDIA) GPU, - In a software graphics pipeline (Larrabee) but the core hierarchical approach to exploiting parallelism and hiding latency is essentially the same, only the constants are shifted by the given implementation. So the high-order bit: - SIMD+vector+thread+core hierarchy to exploit parallelism Balance very high arithmetic intensity with retaining dynamic load balance - same basic model in software or hardware - schedulers are application/pipeline-specific, so software has potential advantage
  26. data-parallelism over different data types is subtly different, but in ways which necessitate different implementation and optimization tradeoffs for peak efficiency. a fast graphics pipeline implementation must optimize for all of these stage specific characteristics, and this is just in the “simple data-parallel kernels” parts of the pipeline. Clearly: triviality of embarrassing parallelism is over-stated.
  27. The fixed-function stages introduce even more workload variation. - IA - Rast - ROP Ordered read-modify-write buffers. (As always, order is the enemy of parallelism.) Also: the more that becomes software-controlled, the more expensive synchronization becomes. As we’ll see: biggest difference between pipeline choice for LRB vs. GPU is based on cost→granularity of synchronization.
  28. Popping up a level, to the pipeline as a whole:
  29. All this talk of parallelism and asynchrony, but the logical pipeline was scalar. API-defined ordering is that all pixel updates must happen in exactly the order defined by the input primitive sequence. This synchronization is a key reason why graphics is not trivially parallel. Many operations are order-dependent. There still is parallelism: - screen-space - reorder buffer for asynchronous completion I’ll also point out as an aside that these strict semantics exist to make the API highly predictable and usable for many applications. In practice, could often be much looser, given application-specific semantics, so I’ll speculate that application-defined software rendering pipelines will exploit optimization here
  30. Here’s how those semantics are exploited in 2 kinds of modern pipelines… Most Hardware GPUs. Looks like logical pipeline -- because directly implemented to run API -- but highly parallel at most stages
  31. Tradeoff in memory spill: stream vs. cache/buffer intra-pipeline data. - Stream over intra-pipeline data (post-transformed primitives), cache framebuffer, vs: - Stream over framebuffer, buffer intra-pipeline data. Also critical: lower scheduling overhead, coarser-grained synchronization
  32. To tie back to the whole, in a single frame this pipeline as a whole is driven by sequences of passes made up of many draw batches the app’s logical rendering pipeline != D3D pipeline both for core algorithm reasons and for performance optimization reasons
  33. an entire one of those passes of the pipeline is just one node in this graph (large circles, I think) And there are many other task stages: - physics simulations, sound - culling - streaming, decompression - AI, agent updates Others are going to become *more* data-parallel, *more* like graphics pipeline, to utilize future processors. (Biggest current barrier to full-on GPGPU for these is latency/communication overhead, but devs want to)
  34. People have obviously built parallel programming models before. Task-centric systems: - dynamic+ - high overhead- - can’t achieve sufficient compute density (real work) Data-parallel systems: I’ve divided into more dynamic and more static ones. Many historical DP systems had dynamic runtimes - targeted dynamically independent processors - overhead still too high Streaming as in StreamIt is extreme end point of static data parallel optimization: - very low overhead, but struggles with variable/dynamic loads. Graphics on StreamIt: too static, load imbalance - obvious: triangles vary in size! this is by no means all, just represents a sampling of systems covering major individual models I’ve talked about but biggest key: focus has traditionally been on single-model parallel programming systems. large systems like graphics and games require all of the above
  35. Consider CUDA, because on massively parallel throughput machines, this is the leading candidate model actually available today. Canonical model: - thread per-work-item - purely data-parallel, streaming data accesses Challenges: - struggle with dynamic load balance, like any semi-static - producer-consumer locality Ironically, these are two key things graphics pipelines have to do very well: - dynamic load balance within and between tasks, data elements - producer-consumer locality for bandwidth savings on huge streams
  36. promising directions are emerging, and challenges for the future are starting to become clear
  37. Going forward, it’s interesting to look at how things might change. Dynamic task decompositions will become even more important: - first: just put each subsystem on a thread. doesn’t scale far. - now: several hundred jobs/frame, goes somewhere. - next: many more jobs. Key issue in the future will be continuing to radically increase the number of independent tasks which can be extracted. Problem with traditional “task systems” is false control dependence. Graphics Pipeline already shows how this can be overcome. Pure data-flow? Orchestration, complexity, dynamism in general will be challenging - language, tools? ------ Individual tasks will become internally more complex. Hierarchical decomposition, low-level data parallelism important for efficiency. Data-centric tasks will become More like graphics/rendering. ------ At the data-parallel level, one Interesting idea/trend emerged at SIGGRAPH this year: dynamic scheduling between multiple in-flight logical kernels on their one-kernel-at-a-time GPUs. every interesting demo from NVIDIA breaking the CUDA model demonstrated many alternative rendering systems on GPUs, entirely in software, but breaking the pure data-parallel, streaming model to achieve essential performance and flexibility
  38. Idea is: take what is logically a series of pipeline stages, which would
  39. you can use them for recursive code and cyclic graphs, not just trivial pipelines.
  40. and you can even use them for dynamic branching between arbitrary points in the flow, by logically decomposing stages and entry/exit points into sub-states. or if stages are internally fanning out to go data parallel, dynamically --- NVIDIA OptiX uses this pattern
  41. NVIDIA’s OptiX system uses this pattern to implement an entirely different rendering “pipeline,” entirely in software, and with recursion (which isn’t formally supported in the CUDA model). Uses a just-in-time “pipeline compiler” to generate the CUDA über-kernel for a given pipeline configuration and shader binding. Effectively, a special-purpose continuation compiler. This is the idea of one place I’d like to go next: explicit continuations as a low-level primitive. Static compiler transform
  42. One thing I’m playing with doing next: General-purpose abstraction of this idea could be much simpler than domain-specific compilers like: - Larrabee shader JIT - OptiX pipeline JIT Lesson: those systems are complex because they fuse proven need for application-specific scheduling with compiler transform to support it. Separation of concerns: Contain application-specific complexity to *code in system*, while keeping compiler transform totally agnostic. Useful for many things: - texture fetches with software latency hiding, as in Larrabee - recursive ray tracing - dynamically coalescing work items - lower-level task and pipeline parallelism (producer-consumer) within generally data-parallel, arithmetically-intense jobs --- Are there good references for prior work in this area? *Static* continuation, state machine transform - not *dynamic*, heavy-weight mechanisms like from Lisp world.
  43. Popping up another level, this pipeline as a whole is driven by sequences of passes made up of many draw batches
  44. the app’s logical rendering pipeline != D3D pipeline both for core algorithm reasons and for performance optimization reasons THIS IS PART OF THE MOTIVATION FOR PROGRAMMABLE PIPELINES
  45. pass folding is further motivation for programmable pipelines