Presentation at SoC 2010 (International Symposium on System-on-Chip) in Tampere, Finland. The full paper is available at IEEEXplore (http://dx.doi.org/10.1109/ISSOC.2010.5625555).
7. The established solution Nearly all existing schemes work the same way Partition the image into blocks of 4 x 4 pixels Compress each block independently Use a fixed compression ratio (6:1) Our focus is on high dynamic range (HDR) textures RGB colors in 16-bit floating-point (FP16) Compressed from 48 bits per pixel, down to 8 bpp
8. FP16 texture compression Roimela et al. [SIGGRAPH 2006, I3D 2008] Munkberg et al. [SIGGRAPH 2006, CGF 2008] Sun et al. [Graphics Hardware 2008, IEEE TVCG 2010] BC6H/BPTC [DirectX 11, OpenGL 4]
9. FP16 texture compression Roimela et al. [SIGGRAPH 2006, I3D 2008] Munkberg et al. [SIGGRAPH 2006, CGF 2008] Sun et al. [Graphics Hardware 2008, IEEE TVCG 2010] BC6H/BPTC [DirectX 11, OpenGL 4] Far too high complexity
10. FP16 texture compression Roimela et al. [SIGGRAPH 2006, I3D 2008] Munkberg et al. [SIGGRAPH 2006, CGF 2008] Sun et al. [Graphics Hardware 2008, IEEE TVCG 2010] BC6H/BPTC [DirectX 11, OpenGL 4] Our contribution Implemented and optimized #1 (a.k.a. ”NXR”) Benchmarked against #4
11. Red Baseline decoder Extract bitfields R, B, Lexponent Lmantissa int-to-fp16 converter fp16 multiplier R R 210 Green int-to-fp16 converter fp16 multiplier G Blue int-to-fp16 converter fp16 multiplier B B Lexponent fp16 normalizer Lmantissa
12. Optimizations Simplify this Red Extract bitfields R, B, Lexponent Lmantissa int-to-fp16 converter fp16 multiplier R R 210 Green int-to-fp16 converter fp16 multiplier G Blue int-to-fp16 converter fp16 multiplier B B Simplify this Lexponent fp16 normalizer Lmantissa
13. Optimizations (Part 1) Red and Blue are in 0.10-bit fixed point Can be treated as fp16 denormals with no conversion logic Simplify the multipliers (L*R and L*B) Exponent can’t increase – remove biasing and overflow logic Mantissa will fit in 1.20 fixed point – remove overflow logic At most 10 leading zeros – truncate post-normalizers No need to deal with signs, infinities and NaNs
15. Optimized decoder CLZ Count Leading Zeros << Shift Left 10 x 11 -bit multiplier Extract bitfields R, B, Lexponent Lmantissa Red Clamp, Shift & Pack Rexponent Lexponent R R CLZ Rmantissa << Green Lmantissa Blue
16. Optimized decoder CLZ Count Leading Zeros << Shift Left 10 x 11 -bit multiplier Extract bitfields R, B, Lexponent Lmantissa Red Clamp, Shift & Pack Rexponent Lexponent R R CLZ Rmantissa << Green Lmantissa Blue << Clamp, Shift & Pack Bmantissa B B CLZ Lexponent Bexponent
17. Optimizations (Part 2) Eliminate the green channel multiplier LG = L (1024 – (R + B)) = 1024L – (LR + LB) Two 20-bit adders are much cheaper than a 10-bit multiplier Round to zero instead of nearest Introduces a maximum of 1-bit error Compression error is much larger, 4-8 bits
18. Optimized decoder CLZ Count Leading Zeros << Shift Left 10 x 11 -bit multiplier Extract bitfields R, B, Lexponent Lmantissa Red Clamp, Shift & Pack Rexponent Lexponent R R CLZ Rmantissa << Green Lexponent Clamp, Shift & Pack 220 Gexponent Lmantissa G CLZ Gmantissa << Blue << Clamp, Shift & Pack Bmantissa B B CLZ Lexponent Bexponent
21. ASIC synthesis @ 180 nm (Synopsys) Only one of 14 modes. A complete decoder would be somewhat larger.
22. ASIC synthesis @ 180 nm (Synopsys) Relatively long critical path, due to leading-zero counters.
23. Summary VHDL implementation of a floating-point texture decoder Our optimizations reduced area by ~50% Competing decoder turned out 75% larger Main weakness: long critical path Completely feasible to put on real hardware
24. Future work Measure power consumption More important than silicon area Optimize the long latency Can also help reduce area & power Implement an encoder in ASIC Textures are increasingly generated in real time