Efficient floating-point texture decompression

Efficient Floating-Point Texture Decompression Tomi Aarnio (NRC Tampere) Claudio Brunelli (NRC Tampere) Timo Viitanen (TUT)

Texturing pipeline in a GPU Memory bandwidth is the worst bottleneck

Texturing pipeline in a GPU Cache size is another Memory bandwidth is the worst bottleneck

Texturing pipeline in a GPU Cache size is another Memory bandwidth is the worst bottleneck Texture compression can alleviate both!

Texturing pipeline in a GPU Must be very fast: ~40 gigatexels/sec

The established solution Nearly all existing schemes work the same way Partition the image into blocks of 4 x 4 pixels Compress each block independently Use a fixed compression ratio (6:1) Our focus is on high dynamic range (HDR) textures RGB colors in 16-bit floating-point (FP16) Compressed from 48 bits per pixel, down to 8 bpp

FP16 texture compression Roimela et al. [SIGGRAPH 2006, I3D 2008] Munkberg et al. [SIGGRAPH 2006, CGF 2008] Sun et al. [Graphics Hardware 2008, IEEE TVCG 2010] BC6H/BPTC [DirectX 11, OpenGL 4]

FP16 texture compression Roimela et al. [SIGGRAPH 2006, I3D 2008] Munkberg et al. [SIGGRAPH 2006, CGF 2008] Sun et al. [Graphics Hardware 2008, IEEE TVCG 2010] BC6H/BPTC [DirectX 11, OpenGL 4] Far too high complexity

FP16 texture compression Roimela et al. [SIGGRAPH 2006, I3D 2008] Munkberg et al. [SIGGRAPH 2006, CGF 2008] Sun et al. [Graphics Hardware 2008, IEEE TVCG 2010] BC6H/BPTC [DirectX 11, OpenGL 4] Our contribution Implemented and optimized #1 (a.k.a. ”NXR”) Benchmarked against #4

Red Baseline decoder Extract bitfields R, B, Lexponent Lmantissa int-to-fp16 converter fp16 multiplier R R 210 Green int-to-fp16 converter fp16 multiplier G Blue int-to-fp16 converter fp16 multiplier B B Lexponent fp16 normalizer Lmantissa

Optimizations Simplify this Red Extract bitfields R, B, Lexponent Lmantissa int-to-fp16 converter fp16 multiplier R R 210 Green int-to-fp16 converter fp16 multiplier G Blue int-to-fp16 converter fp16 multiplier B B Simplify this Lexponent fp16 normalizer Lmantissa

Optimizations (Part 1) Red and Blue are in 0.10-bit fixed point  Can be treated as fp16 denormals with no conversion logic Simplify the multipliers (L*R and L*B) Exponent can’t increase – remove biasing and overflow logic Mantissa will fit in 1.20 fixed point – remove overflow logic At most 10 leading zeros – truncate post-normalizers No need to deal with signs, infinities and NaNs

Red Extract bitfields R, B, Lexponent Lmantissa Green Blue Optimized decoder

Optimized decoder CLZ Count Leading Zeros << Shift Left 10 x 11 -bit multiplier Extract bitfields R, B, Lexponent Lmantissa Red Clamp, Shift & Pack Rexponent Lexponent R R CLZ Rmantissa << Green Lmantissa Blue

Optimized decoder CLZ Count Leading Zeros << Shift Left 10 x 11 -bit multiplier Extract bitfields R, B, Lexponent Lmantissa Red Clamp, Shift & Pack Rexponent Lexponent R R CLZ Rmantissa << Green Lmantissa Blue << Clamp, Shift & Pack Bmantissa B B CLZ Lexponent Bexponent

Optimizations (Part 2) Eliminate the green channel multiplier LG = L (1024 – (R + B)) = 1024L – (LR + LB) Two 20-bit adders are much cheaper than a 10-bit multiplier Round to zero instead of nearest Introduces a maximum of 1-bit error Compression error is much larger, 4-8 bits

Optimized decoder CLZ Count Leading Zeros << Shift Left 10 x 11 -bit multiplier Extract bitfields R, B, Lexponent Lmantissa Red Clamp, Shift & Pack Rexponent Lexponent R R CLZ Rmantissa << Green Lexponent Clamp, Shift & Pack 220 Gexponent Lmantissa G CLZ Gmantissa << Blue << Clamp, Shift & Pack Bmantissa B B CLZ Lexponent Bexponent

FPGA synthesis (Altera Stratix III)

ASIC synthesis @ 180 nm (Synopsys)

ASIC synthesis @ 180 nm (Synopsys) Only one of 14 modes. A complete decoder would be somewhat larger.

ASIC synthesis @ 180 nm (Synopsys) Relatively long critical path, due to leading-zero counters.

Summary VHDL implementation of a floating-point texture decoder Our optimizations reduced area by ~50% Competing decoder turned out 75% larger Main weakness: long critical path Completely feasible to put on real hardware

Future work Measure power consumption More important than silicon area Optimize the long latency Can also help reduce area & power Implement an encoder in ASIC Textures are increasingly generated in real time

Efficient floating-point texture decompression

Recomendados

Recomendados

Más contenido relacionado

Último

Último (20)

Destacado

Destacado (20)

Efficient floating-point texture decompression

Notas del editor