Graphics has been at the forefront of the resurgence in parallel computation. Real-time graphics and games have been the source of many of today’s new programming models and architectures for parallel computation. Modern games are arguably the only successful mainstream application of highly parallel programming in heterogeneous, million-line codebases. But while graphics is thought of as an embarrassingly parallel application, there has been little success in implementing high-performance graphics systems in any single general-purpose parallel programming model, ironically including those which have come from the GPGPU community.
I will talk about key patterns of parallelism and locality used in graphics pipelines and games, and how existing tools and monolithic programming models fail to express these patterns with sufficient efficiency. I will try to synthesize some directions for future programming systems based on this experience, including my current thoughts for how a compile-time continuation passing transform could help formalize patterns emerging in how high performance systems are manually overcoming the limitations of existing GPU programming models.
This talk will be at least as much informal, educational and speculative as it will be about any currently active research.
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Why Graphics Is Fast, and What It Can Teach Us About Parallel Programming
1. Why Graphics is Fast,
and what it can teach us about
parallel programming
Jonathan Ragan-Kelley, MIT
7 December 2009, University College London
13 November 2009, Harvard
2. (me)
PhD student at MIT
with Frédo Durand, Saman Amarasinghe
Previously
Stanford
Industrial Light + Magic
The Lightspeed Automatic Interactive Lighting Preview System
(SIGGRAPH 2007)
NVIDIA, ATI
Decoupled Sampling for Real-Time Graphics Pipelines
(ACM Transactions on Graphics 2010)
Intel
6. Game Engine Parallelism
Each task (~200-300/frame) is potentially:
data parallel
physics, sound, AI
image post-processing, streaming/decompression
pipeline parallel
(internally) task parallel
entire invocation of the graphics pipeline
↳ braided parallelism
18. Shader Execution Model
Highly data-parallel
vertices, primitives, fragments, pixels
Hierarchical parallelism
Instruction bandwidth: SIMD ALUs
Pipeline latency: vector execution/software pipelining
Unpredictable latency: hardware multithreading
Memory latency: dynamic threading/fibering
Task independence: multicore
Regular input, output
Marshaling/unmarshaling from data structures
handled in “fixed-function” for efficiency
19. Shader Implementation
Hardware shader architecture (GPU)
static SIMD warps
many dynamic hardware threads, tens of cores
fine-grained dynamic load balance
many kernels, types simultaneously
heuristic knowledge of specific pipeline
Software shader architecture (Larrabee)
similar static SIMD, inside dynamic latency-hiding,
inside many-core
threading via software fibering (microthreads),
software scheduling
20. Shader Stage-specific Variation
Vertex
less latency hiding needed,
so use less local memory,
[perhaps] no dynamic fibering (simpler schedule).
Geometry
variable output.
more state = less parallelism,
but (hopefully) need less latency hiding.
Fragment
derivatives → neighborhood communication.
texture intensive → more (dynamic) latency hiding.
21. Fixed-function Stages
[Input,Primitive] Assembly
Hairy finite state machine
Indirect indexing
Rasterization
Locally data-parallel (~16xSIMD), No memory access
Globally branchy, incremental, hierarchical
Output Merge
Really data-parallel, but ordered read-modify-write
Implicit data amplification (for antialiasing)
23. Ordering Semantics
Logically sequential
input primitive order defines framebuffer update order
Pixels are independent
spatial parallelism in framebuffer
Buffer to reorder
otherwise
(Could be looser in custom pipelines)
24. Sort-Last Fragment
Shaders fully parallel
buffer to reorder between stages (logically FIFO)
100s-1,000s of triangles in-flight
Output Merger is screen-space parallel
crossbar from Pixel Shade to Output Merge (“sort-last”)
fine-grained scoreboarding
Full producer-consumer locality
all inter-stage communication on-chip
primitive-order processing
framebuffer cache filters some read-modify-write b/w
25. Sort-Middle
Front-end: transform/vertex processing
Scatter all geometry to screen-space tiles
Merge sort (through memory)
Maintain primitive order
Back-end: pixel processing
In-order (per-pixel)
scoreboard per-pixel → screen-space parallelism
One tile per-core
framebuffer data in local store
one scheduler, several worker [hardware] threads collaborating
Pixel Shader + Output Merge together
lower scheduling overhead
26. return to task system
to put back in
context.
reinforce braided
nature.
Game Frame
29. Parallel Programming Models
Task parallelism (Cilk, Jade, etc.)
dynamic, heavyweight items
very flexible, high overhead
Data parallelism (NESL, FortranX, C*)
dynamic, lightweight items
flexible, moderate overhead
Streaming (StreamIt, Brook)
static data + pipeline parallelism
very low overhead, but inflexible ➞ load imbalance
30. CUDA, GPGPU
Canonical model Key challenges
thread = work-item
one kernel at a time
} dynamic load balance
purely data parallel
streaming data access
} producer-consumer
locality
32. Task decomposition continuing to grow
Past: subsystem threads: 2-3 threads utilized
Present: task systems: 6-8 threads, some headroom
Future:
many more tasks, reduce false dependence
orchestration language?
More data, braided parallelism within tasks
Essential to extracting FLOPS in future throughput chips
“Über-kernels”
(NVIDIA ray tracing, Reyes demos; OptiX system)
Dynamic task scheduling inside a single-kernel,
purely data-parallel architecture
33. Über-kernels
stage 1 stage_1():
// do stage 1…
stage_2():
stage 2 // do stage 2…
stage_3():
stage 3 // do stage 3…
34. Über-kernels
while(true):
stage 1
state = scheduler()
switch(state):
case stage_1:
stage 2 // do stage 1…
case stage_2:
// do stage 2…
case stage_3:
stage 3 // do stage 3…
35. Über-kernels
while(true):
stage 1
state = scheduler()
switch(state):
case stage_1:
stage 2 // do stage 1…
case stage_2:
// do stage 2…
case stage_3:
stage 3 // do stage 3…
36. Über-kernels
while(true):
state = scheduler()
stage 1 switch(state):
case stage_1:
// do stage 1…
case stage_2_1:
stage 2 // beginning of 2…
case stage_2_2:
// end of 2…
stage 3 case stage_3_1:
// beginning of 3…
case stage_3_2:
// end of 3…
37. The Ray Tracing Pipeline
Host
Buffers
Texture Samplers
Entry Points Variables
Ray Generation Program
Exception Program
Traversal
Intersection Program
Trace Any Hit Program
Selector Visit Program
Ray Shading
Closest Hit Program
Miss Program
39. Acknowledgements
CPS ideas
Tim Foley, Mark Lacey, Mark Leone - Intel
General Discussion
Saman Amarasinghe, Frédo Durand - MIT
Solomon Boulos, Kurt Akeley - Stanford/MSR
OptiX details, slides
Austin Robison, Steve Parker - NVIDIA
Current game-engine details, figures
Johan Andersson - DICE
44. Draw Batch
Group of primitives bound to common state
JIT compile, optimize on rebinding state
changing any shader stage
potential inter-stage elimination of dead outputs
late-binding constant folding
changing some fixed-function mode state
not on changing inputs (too expensive)
“near-static” compilation
45. Rendering Pass
Common output buffer binding
with no swap/clear
Synthesize renderer from canonical pipeline
render shadow, environment maps
2D image post-processing
Optimize performance
avoid wasted work (Z cull pre-pass, deferred shading)
reorder/coalesce batches (deferred lighting)
46. Inter-pass Optimization
Buffers never consumed
keep in scratchpad
“fast clear”
use optimized/native formats
resolve antialiasing before write-out
Passes overlapped
no startup/wind-down bubbles
Pass folding
e.g. merge 2D post-processing into back-end stage,
on local tile memory
47. Control vs. Data-Dependence
Simple task graph has control dependence
between stages
D3D scheduling based on resource (data)
dependence, not control dependence
‣Finer-grained scheduling
‣Fewer false hazards
Generalize to whole engine core?
AI, sound
Physics: rigid, soft, IK, cloth, particles, fracture
48. Core principles
TODO: update
Hierarchical parallelism
Shader-style data-parallel kernels
Kernel fusion/global optimization
Separate kernel and orchestration
Separate-but-connected languages/runtime models
Resource-based scheduling
Application-specific/user-controlled
scheduling and memory management
Notas del editor
Graphics and games among the most successful mainstream applications of highly parallel computing:
- multi-million-line code bases, many real subsystems - not 50-line kernel expressed in 100k lines for performance, like SciComp
- tight real-time budgets, memory constraints, …
- Consoles look like mini Computers of the Future:
- manycore, multithreaded
- SIMD, in-order/throughput-oriented
- heterogeneous: GPU + CPU + SPU
- Portable across platforms: PC, PS3, Xbox
- SIMD, threading, task systems
Philosophy: if we can get graphics right—complex, heterogeneous, dynamic, but also well-understood—then good start to generalize
So here I’m going to focus on explaining some key design patterns that have emerged from current graphics pipelines, to set the stage for the sorts of things future systems-programming tools need to support/express.
---
What I’m hoping to do: parallel programming system good enough for graphics and games on next-generation (software+throughput processor) systems.
That is: not just after graphics—whole game engine core, plus other complex/heterogeneous systems programming applications
I’m an Nth year PhD student at MIT, where I work with Fredo Durand, and more recently also Saman Amarasinghe
I’ve been in graphics and graphics systems for a while:
Stanford: lucky to spend 4 years with Pat Hanrahan
ILM: offline rendering, compilers - Lightspeed, at SIGGRAPH 2007
NVIDIA, ATI - graphics architecture - Decoupled Sampling, in revision for Trans. Graph., presented at next SIGGRAPH
Intel: graphics architecture (now software - Larrabee!),
data-parallel compilers, some chip architecture
as a motivating example, look at 1 modern game frame
1000s of independent entities:
- characters
- vehicles
- projectiles
explosions, physics
destruction
many lights, changing occlusion/visibility
In all:
10s of GFLOPS, 100s of thousands of lines of code executed
This is the task graph for a single frame from the engine used to make that image.
To render that:
hundreds of (heterogeneous) tasks
locally task, data, and pipeline parallel (point at graph)
Task = 10k-100k lines/each—fairly large
and even within those tasks
data,
pipeline,
task parallel
entire invocations of graphics pipeline
many levels => braided parallelism
Logical Direct3D 10 pipeline. Red = programmable. Blue = fixed-function.
“Fixed-function” still configured by programmable state (and implied by shader input/output).
- Inter-stage data = always fixed flow/non-programmable
IA: reads from pointer to index, vertex buffers and pulls in vertex data (indirectly/complex traversal). Finite State Machine.
VS: user program. Can read (only) from textures
PrimAsm: indexes vertex stream and groups together whole triangles. Finite State Machine.
GS: . Variable output (filter/amplify)
NOTE: programmable stages delineated by common one-to-one data stream.
Any time there is reordering or complex stream amplification/filtering, we split stages.
THIS IS LOGICAL PIPELINE, but in reality, fast implementation is
Logical Direct3D 10 pipeline. Red = programmable. Blue = fixed-function.
“Fixed-function” still configured by programmable state (and implied by shader input/output).
- Inter-stage data = always fixed flow/non-programmable
IA: reads from pointer to index, vertex buffers and pulls in vertex data (indirectly/complex traversal). Finite State Machine.
VS: user program. Can read (only) from textures
PrimAsm: indexes vertex stream and groups together whole triangles. Finite State Machine.
GS: . Variable output (filter/amplify)
NOTE: programmable stages delineated by common one-to-one data stream.
Any time there is reordering or complex stream amplification/filtering, we split stages.
THIS IS LOGICAL PIPELINE, but in reality, fast implementation is
Logical Direct3D 10 pipeline. Red = programmable. Blue = fixed-function.
“Fixed-function” still configured by programmable state (and implied by shader input/output).
- Inter-stage data = always fixed flow/non-programmable
IA: reads from pointer to index, vertex buffers and pulls in vertex data (indirectly/complex traversal). Finite State Machine.
VS: user program. Can read (only) from textures
PrimAsm: indexes vertex stream and groups together whole triangles. Finite State Machine.
GS: . Variable output (filter/amplify)
NOTE: programmable stages delineated by common one-to-one data stream.
Any time there is reordering or complex stream amplification/filtering, we split stages.
THIS IS LOGICAL PIPELINE, but in reality, fast implementation is
Logical Direct3D 10 pipeline. Red = programmable. Blue = fixed-function.
“Fixed-function” still configured by programmable state (and implied by shader input/output).
- Inter-stage data = always fixed flow/non-programmable
IA: reads from pointer to index, vertex buffers and pulls in vertex data (indirectly/complex traversal). Finite State Machine.
VS: user program. Can read (only) from textures
PrimAsm: indexes vertex stream and groups together whole triangles. Finite State Machine.
GS: . Variable output (filter/amplify)
NOTE: programmable stages delineated by common one-to-one data stream.
Any time there is reordering or complex stream amplification/filtering, we split stages.
THIS IS LOGICAL PIPELINE, but in reality, fast implementation is
Logical Direct3D 10 pipeline. Red = programmable. Blue = fixed-function.
“Fixed-function” still configured by programmable state (and implied by shader input/output).
- Inter-stage data = always fixed flow/non-programmable
IA: reads from pointer to index, vertex buffers and pulls in vertex data (indirectly/complex traversal). Finite State Machine.
VS: user program. Can read (only) from textures
PrimAsm: indexes vertex stream and groups together whole triangles. Finite State Machine.
GS: . Variable output (filter/amplify)
NOTE: programmable stages delineated by common one-to-one data stream.
Any time there is reordering or complex stream amplification/filtering, we split stages.
THIS IS LOGICAL PIPELINE, but in reality, fast implementation is
Logical Direct3D 10 pipeline. Red = programmable. Blue = fixed-function.
“Fixed-function” still configured by programmable state (and implied by shader input/output).
- Inter-stage data = always fixed flow/non-programmable
IA: reads from pointer to index, vertex buffers and pulls in vertex data (indirectly/complex traversal). Finite State Machine.
VS: user program. Can read (only) from textures
PrimAsm: indexes vertex stream and groups together whole triangles. Finite State Machine.
GS: . Variable output (filter/amplify)
NOTE: programmable stages delineated by common one-to-one data stream.
Any time there is reordering or complex stream amplification/filtering, we split stages.
THIS IS LOGICAL PIPELINE, but in reality, fast implementation is
Logical Direct3D 10 pipeline. Red = programmable. Blue = fixed-function.
“Fixed-function” still configured by programmable state (and implied by shader input/output).
- Inter-stage data = always fixed flow/non-programmable
IA: reads from pointer to index, vertex buffers and pulls in vertex data (indirectly/complex traversal). Finite State Machine.
VS: user program. Can read (only) from textures
PrimAsm: indexes vertex stream and groups together whole triangles. Finite State Machine.
GS: . Variable output (filter/amplify)
NOTE: programmable stages delineated by common one-to-one data stream.
Any time there is reordering or complex stream amplification/filtering, we split stages.
THIS IS LOGICAL PIPELINE, but in reality, fast implementation is
Logical Direct3D 10 pipeline. Red = programmable. Blue = fixed-function.
“Fixed-function” still configured by programmable state (and implied by shader input/output).
- Inter-stage data = always fixed flow/non-programmable
IA: reads from pointer to index, vertex buffers and pulls in vertex data (indirectly/complex traversal). Finite State Machine.
VS: user program. Can read (only) from textures
PrimAsm: indexes vertex stream and groups together whole triangles. Finite State Machine.
GS: . Variable output (filter/amplify)
NOTE: programmable stages delineated by common one-to-one data stream.
Any time there is reordering or complex stream amplification/filtering, we split stages.
THIS IS LOGICAL PIPELINE, but in reality, fast implementation is
Logical Direct3D 10 pipeline. Red = programmable. Blue = fixed-function.
“Fixed-function” still configured by programmable state (and implied by shader input/output).
- Inter-stage data = always fixed flow/non-programmable
IA: reads from pointer to index, vertex buffers and pulls in vertex data (indirectly/complex traversal). Finite State Machine.
VS: user program. Can read (only) from textures
PrimAsm: indexes vertex stream and groups together whole triangles. Finite State Machine.
GS: . Variable output (filter/amplify)
NOTE: programmable stages delineated by common one-to-one data stream.
Any time there is reordering or complex stream amplification/filtering, we split stages.
THIS IS LOGICAL PIPELINE, but in reality, fast implementation is
Logical Direct3D 10 pipeline. Red = programmable. Blue = fixed-function.
“Fixed-function” still configured by programmable state (and implied by shader input/output).
- Inter-stage data = always fixed flow/non-programmable
IA: reads from pointer to index, vertex buffers and pulls in vertex data (indirectly/complex traversal). Finite State Machine.
VS: user program. Can read (only) from textures
PrimAsm: indexes vertex stream and groups together whole triangles. Finite State Machine.
GS: . Variable output (filter/amplify)
NOTE: programmable stages delineated by common one-to-one data stream.
Any time there is reordering or complex stream amplification/filtering, we split stages.
THIS IS LOGICAL PIPELINE, but in reality, fast implementation is
Logical Direct3D 10 pipeline. Red = programmable. Blue = fixed-function.
“Fixed-function” still configured by programmable state (and implied by shader input/output).
- Inter-stage data = always fixed flow/non-programmable
IA: reads from pointer to index, vertex buffers and pulls in vertex data (indirectly/complex traversal). Finite State Machine.
VS: user program. Can read (only) from textures
PrimAsm: indexes vertex stream and groups together whole triangles. Finite State Machine.
GS: . Variable output (filter/amplify)
NOTE: programmable stages delineated by common one-to-one data stream.
Any time there is reordering or complex stream amplification/filtering, we split stages.
THIS IS LOGICAL PIPELINE, but in reality, fast implementation is
Logical Direct3D 10 pipeline. Red = programmable. Blue = fixed-function.
“Fixed-function” still configured by programmable state (and implied by shader input/output).
- Inter-stage data = always fixed flow/non-programmable
IA: reads from pointer to index, vertex buffers and pulls in vertex data (indirectly/complex traversal). Finite State Machine.
VS: user program. Can read (only) from textures
PrimAsm: indexes vertex stream and groups together whole triangles. Finite State Machine.
GS: . Variable output (filter/amplify)
NOTE: programmable stages delineated by common one-to-one data stream.
Any time there is reordering or complex stream amplification/filtering, we split stages.
THIS IS LOGICAL PIPELINE, but in reality, fast implementation is
Logical Direct3D 10 pipeline. Red = programmable. Blue = fixed-function.
“Fixed-function” still configured by programmable state (and implied by shader input/output).
- Inter-stage data = always fixed flow/non-programmable
IA: reads from pointer to index, vertex buffers and pulls in vertex data (indirectly/complex traversal). Finite State Machine.
VS: user program. Can read (only) from textures
PrimAsm: indexes vertex stream and groups together whole triangles. Finite State Machine.
GS: . Variable output (filter/amplify)
NOTE: programmable stages delineated by common one-to-one data stream.
Any time there is reordering or complex stream amplification/filtering, we split stages.
THIS IS LOGICAL PIPELINE, but in reality, fast implementation is
New shader stages with new semantics, characteristics.
(Unordered) read-write memory from the pixel shader
Future: Programmable output blending?
Ordered read-modify-write buffers.
Also: the more that becomes software-controlled, the more expensive synchronization becomes.
As we’ll see: biggest difference between pipeline choice for LRB vs. GPU is based on cost→granularity of synchronization.
Needs to be able to drive essentially maximum resource utilization, sustained.
Dynamic load balance: load balance between stages shifts not just across apps or frames, but at very fine granularity within frame. (Triangles are different size. Shaders are different length.)
Pipeline parallelism: has to overlap operations, passes to avoid bubbles.
Producer-consumer locality is essential: way too much intermediate data to spill to memory
Task parallelism and producer-consumer are why (ironically) you can’t do a fast graphics pipeline in “GPGPU”/CUDA.
First, how parallelism is exploited
and how the different stages work.
Parallelism across all data types: vertices, primitives, fragments, pixels.
Hierarchical application of parallelism mirrors a hierarchical application of most major ideas from parallel, latency-tolerant (i.e. “throughput”) processor architecture.
To maintain high regularity within shaders,
Marshaling of data to/from complex or irregular structures is factored apart from the core programmable data-parallel shaders, in logically fixed-function stages.
There are two major schools of implementation today:
- In a hardware (NVIDIA) GPU,
- In a software graphics pipeline (Larrabee)
but the core hierarchical approach to exploiting parallelism and hiding latency is essentially the same, only the constants are shifted by the given implementation.
So the high-order bit:
- SIMD+vector+thread+core hierarchy to exploit parallelism
Balance very high arithmetic intensity with retaining dynamic load balance
- same basic model in software or hardware
- schedulers are application/pipeline-specific, so software has potential advantage
data-parallelism over different data types is subtly different, but in ways which necessitate different implementation and optimization tradeoffs for peak efficiency.
a fast graphics pipeline implementation must optimize for all of these stage specific characteristics,
and this is just in the “simple data-parallel kernels” parts of the pipeline.
Clearly: triviality of embarrassing parallelism is over-stated.
The fixed-function stages introduce even more workload variation.
- IA
- Rast
- ROP
Ordered read-modify-write buffers. (As always, order is the enemy of parallelism.)
Also: the more that becomes software-controlled, the more expensive synchronization becomes.
As we’ll see: biggest difference between pipeline choice for LRB vs. GPU is based on cost→granularity of synchronization.
Popping up a level, to the pipeline as a whole:
All this talk of parallelism and asynchrony, but the logical pipeline was scalar.
API-defined ordering is that all pixel updates must happen in exactly the order defined by the input primitive sequence.
This synchronization is a key reason why graphics is not trivially parallel. Many operations are order-dependent.
There still is parallelism:
- screen-space
- reorder buffer for asynchronous completion
I’ll also point out as an aside that these strict semantics exist to make the API highly predictable and usable for many applications.
In practice, could often be much looser, given application-specific semantics,
so I’ll speculate that application-defined software rendering pipelines will exploit optimization here
Here’s how those semantics are exploited in 2 kinds of modern pipelines…
Most Hardware GPUs.
Looks like logical pipeline -- because directly implemented to run API -- but highly parallel at most stages
Tradeoff in memory spill: stream vs. cache/buffer intra-pipeline data.
- Stream over intra-pipeline data (post-transformed primitives), cache framebuffer, vs:
- Stream over framebuffer, buffer intra-pipeline data.
Also critical: lower scheduling overhead, coarser-grained synchronization
To tie back to the whole, in a single frame
this pipeline as a whole is driven by sequences of passes made up of many draw batches
the app’s logical rendering pipeline != D3D pipeline
both for core algorithm reasons
and for performance optimization reasons
an entire one of those passes of the pipeline is just one node in this graph
(large circles, I think)
And there are many other task stages:
- physics simulations, sound
- culling
- streaming, decompression
- AI, agent updates
Others are going to become *more* data-parallel, *more* like graphics pipeline, to utilize future processors.
(Biggest current barrier to full-on GPGPU for these is latency/communication overhead,
but devs want to)
People have obviously built parallel programming models before.
Task-centric systems:
- dynamic+
- high overhead- - can’t achieve sufficient compute density (real work)
Data-parallel systems:
I’ve divided into more dynamic and more static ones.
Many historical DP systems had dynamic runtimes
- targeted dynamically independent processors
- overhead still too high
Streaming as in StreamIt is extreme end point of static data parallel optimization:
- very low overhead, but struggles with variable/dynamic loads.
Graphics on StreamIt: too static, load imbalance - obvious: triangles vary in size!
this is by no means all, just represents a sampling of systems covering major individual models I’ve talked about
but biggest key: focus has traditionally been on single-model parallel programming systems.
large systems like graphics and games require all of the above
Consider CUDA, because on massively parallel throughput machines, this is the leading candidate model actually available today.
Canonical model:
- thread per-work-item
- purely data-parallel, streaming data accesses
Challenges:
- struggle with dynamic load balance, like any semi-static
- producer-consumer locality
Ironically, these are two key things graphics pipelines have to do very well:
- dynamic load balance within and between tasks, data elements
- producer-consumer locality for bandwidth savings on huge streams
promising directions are emerging,
and challenges for the future are starting to become clear
Going forward, it’s interesting to look at how things might change.
Dynamic task decompositions will become even more important:
- first: just put each subsystem on a thread. doesn’t scale far.
- now: several hundred jobs/frame, goes somewhere.
- next: many more jobs.
Key issue in the future will be continuing to radically increase the number of independent tasks which can be extracted.
Problem with traditional “task systems” is false control dependence. Graphics Pipeline already shows how this can be overcome.
Pure data-flow?
Orchestration, complexity, dynamism in general will be challenging - language, tools?
------
Individual tasks will become internally more complex. Hierarchical decomposition, low-level data parallelism important for efficiency. Data-centric tasks will become More like graphics/rendering.
------
At the data-parallel level, one Interesting idea/trend emerged at SIGGRAPH this year:
dynamic scheduling between multiple in-flight logical kernels on their one-kernel-at-a-time GPUs.
every interesting demo from NVIDIA breaking the CUDA model
demonstrated many alternative rendering systems on GPUs, entirely in software, but breaking the pure data-parallel, streaming model to achieve essential performance and flexibility
Idea is:
take what is logically a series of pipeline stages,
which would
you can use them for recursive code and cyclic graphs, not just trivial pipelines.
and you can even use them for dynamic branching between arbitrary points in the flow, by logically decomposing stages and entry/exit points into sub-states.
or if stages are internally fanning out to go data parallel, dynamically
---
NVIDIA OptiX uses this pattern
NVIDIA’s OptiX system uses this pattern to implement an entirely different rendering “pipeline,”
entirely in software, and with recursion (which isn’t formally supported in the CUDA model).
Uses a just-in-time “pipeline compiler” to generate the CUDA über-kernel for a given pipeline configuration and shader binding.
Effectively, a special-purpose continuation compiler.
This is the idea of one place I’d like to go next: explicit continuations as a low-level primitive.
Static compiler transform
One thing I’m playing with doing next:
General-purpose abstraction of this idea could be much simpler than domain-specific compilers like:
- Larrabee shader JIT
- OptiX pipeline JIT
Lesson: those systems are complex because they fuse proven need for application-specific scheduling with compiler transform to support it.
Separation of concerns:
Contain application-specific complexity to *code in system*, while keeping compiler transform totally agnostic.
Useful for many things:
- texture fetches with software latency hiding, as in Larrabee
- recursive ray tracing
- dynamically coalescing work items
- lower-level task and pipeline parallelism (producer-consumer) within generally data-parallel, arithmetically-intense jobs
---
Are there good references for prior work in this area?
*Static* continuation, state machine transform - not *dynamic*, heavy-weight mechanisms like from Lisp world.
Popping up another level,
this pipeline as a whole is driven by sequences of passes made up of many draw batches
the app’s logical rendering pipeline != D3D pipeline
both for core algorithm reasons
and for performance optimization reasons
THIS IS PART OF THE MOTIVATION FOR PROGRAMMABLE PIPELINES
pass folding is further motivation for programmable pipelines