SlideShare una empresa de Scribd logo
1 de 48
Descargar para leer sin conexión
Iago Toral Quiroga
<itoral@igalia.com>
Bringing ARB_gpu_shader_fp64 to
Intel GPUs
XDC 2016
Helsinki, Finland
Contents
→ ARB_gpu_shader_fp64
→ Overview
→ Scope
→ Intel implementation
→ NIR
→ i965
→ Current status
ARB_gpu_shader_fp64
● New GLSL types: double, dvecX and dmatX
● Variables, Inputs, outputs, uniforms, constants (LF suffix)
● No 64-bit vertex attributes (handled by another extension)
● No 64-bit fragment shader outputs (no 64-bit FBs)
● Arithmetic and relational operators supported
● Most built-in functions can take the new types
● Some exceptions: angle, trigonometry, exponential, noise
● New packing functions: (un)packDouble2x32
● Conversions from/to 32-bit types
● No interpolation
Scope (i965)
● Combined work by Intel and Igalia for over a year
● ~260 new patches to add NIR (~60) and i965 support (~100
Broadwell+, ~100 Haswell)
● More IvyBridge patches on the way!
● Usable across all shader stages
● 3 IRs to support: NIR, i965/align1, i965/align16
● Lots of GL functionality involved: ALU, varyings, uniforms, UBO,
SSBO, shared variables, etc.
● Lots of internal driver modules involved: Optimization passes,
liveness analysis, register spilling, etc.
Scope (Piglit)
● Piglit coverage was limited
● Mostly focused on ALU operations.
● ~2,000 new tests added
NIR
NIR
● Added support for bit-sized ALU types:
● nir_type_float32 = 32 | nir_type_float, etc
● nir_alu_get_type_size(), nir_alu_get_base_type()
● We need to be a bit more careful now:
- if (alu_info.output_type == nir_type_bool) {
+if (nir_alu_type_get_base_type(alu_info.output_type) == nir_type_bool) {
NIR
● Lots of plumbing to deal correctly with bit-sizes
everywhere.
● Algebraic rules and bit-size validation
● New double-precision opcodes:
● d2f, d2i, d2u, d2b, f2d, i2d, u2d
● (un)pack_double_2x32
● etc
NIR
● Added lowering for unsupported 64-bit operations on
Intel GPU hardware:
● trunc, floor, ceil, fract, round, mod, rcp, sqrt, rsq
● nir_lower_doubles(). Implemented in terms of:
● Supported 64-bit operations
● 32-bit float/integer math
● Drivers can choose which lowerings to use:
● nir_lower_doubles_options
i965 – Fp64
(Align1)
Align1 – SIMD8 - 32bit
g0
g1
g2
c0 c1 c2 c3 c4 c5 c6 c7
0B
32B
64B
mov(8) g0.0<1>F g1.1<8,4,2>F { align1 1Q }
Row 1
Row 2
Align1 – SIMD8 - 64bit
g0
g1
g2
c0 c1 c2 c3 c4 c5 c6 c7
0B
32B
64B
g396B
Xl
Xl
Yl
Yl
Xh
Xh
Yh
Yh
Xl
Xl
Yl
Yl
Xh
Xh
Yh
Yh
Xl
Xl
Yl
Yl
Xh
Xh
Yh
Yh
Xl
Xl
Yl
Yl
Xh
Xh
Yh
Yh
g4
g5
g6
128B
160B
192B
g7224B
Zl
Zl
Wl
Wl
Zh
Zh
Wh
Wh
Zl
Zl
Wl
Wl
Zh
Zh
Wh
Wh
Zl
Zl
Wl
Wl
Zh
Zh
Wh
Wh
Zl
Zl
Wl
Wl
Zh
Zh
Wh
Wh
c0:DF
c7:DF
Align1 – SIMD8 - 64bit
(Haswell+)
c0 c1 c2 c3 c4 c5 c6 c7
0B
32B
64B
96B
Xl0 Xh0 Xl1 Xh1 Xl2 Xh2 Xl3 Xh3
mov(8) g2<1>DF g0<4,4,1>DF { align1 1Q }
Xl0 Xh0 Xl1 Xh1 Xl2 Xh2 Xl3 Xh3
Xl4 Xh4 Xl5 Xh5 Xl6 Xh6 Xl7 Xh7
Xl4 Xh4 Xl5 Xh5 Xl6 Xh6 Xl7 Xh7
Row 1
Row 2
Align1 – SIMD8 - 64bit
(Haswell+)
c0 c1 c2 c3 c4 c5 c6 c7
0B
32B
64B
96B
Xl0
Xl4
Xh0
Xh4
Xl1
Xl5
Xh1
Xh5
Xl2
Xl6
Xh2
Xh6
Xl3
Xl7
Xh3
Xh7
mov(8) g2.1<2>UD g0.1<8,4,2>UD { align1 1Q }
Xh0 Xh1 Xh2 Xh3
Xh4 Xh5 Xh6 Xh7
● Use the subscript() helper:
subscript(reg, BRW_REGISTER_TYPE_UD, 1)
i965 – Fp64
(Align16)
Align16 – SIMD4x2 - 32bit
mov(8) g0.0<1>.xyF g1.0<4,4,1>.ywwwF { align16 1Q }
g0
g1
g2
c0 c1 c2 c3 c4 c5 c6 c7
0B
32B
64B
16B boundary
X Y Z W X Y Z W
0B boundary
FixedFixed
x3 x3
Row 1
Row 2
Align16 - 64bit
● Broadwell+ used to need Align16 for Geometry and
Tessellation shaders.
● Nowadays these platforms are fully scalar and Align1 support is
sufficient to expose Fp64
● Older platforms (Haswell, IvyBridge, etc) still need Align16 for
Vertex, Geometry and Tessellation shaders.
Align16 - 64bit
● Swizzle channels are 32-bit, even on 64-bit operands
● We can only address DF components XY directly!
● Writemasks are 64-bit for DF destinations though
● WRITEMASK_XY and WRITEMASK_ZW are 32-bit though → no
native representation
Align16 - SIMD4x2 - 64bit
g0
g1
g2
c0 c1 c2 c3 c4 c5 c6 c7
0B
32B
64B
g396B
Xl Xh Yl Yh Zl Zh Wl Wh
Xl Xh Yl Yh Zl Zh Wl Wh
mov(8) g2<1>.xyDF g0<2,2,1>.zwzwDF { align16 1Q }
Fixed
Thread boundary
Vertex 1
Vertex 2
Align16 - SIMD4x2 - 64bit
g0
g1
g2
c0 c1 c2 c3 c4 c5 c6 c7
0B
32B
64B
g396B
Xl Xh Yl Yh Zl Zh Wl Wh
Xl Xh Yl Yh Zl Zh Wl Wh
mov(8) g2<1>.xyDF g0<2,2,1>.zwzwDF { align16 1Q }
Thread boundary
● Align16 requires 16B alignment
→ 2 DF components in each row.
● Vstride=2 to cover the entire region
Align16 - SIMD4x2 - 64bit
g0
g1
g2
c0 c1 c2 c3 c4 c5 c6 c7
0B
32B
64B
g396B
Xl Xh Yl Yh Zl Zh Wl Wh
Xl Xh Yl Yh Zl Zh Wl Wh
mov(8) g2<1>.xyDF g0<2,2,1>.zwzwDF { align16 1Q }
● Each 16B region applies the 4-component 32-bit swizzle
Thread boundary
Align16 - SIMD4x2 - 64bit
g0
g1
g2
c0 c1 c2 c3 c4 c5 c6 c7
0B
32B
64B
g396B
Xl Xh Yl Yh Zl Zh Wl Wh
Xl Xh Yl Yh Zl Zh Wl Wh
mov(8) g2<1>.xyDF g0<2,2,1>.zwzwDF { align16 1Q }
● We can’t do all swizzle combinations:
● XXXX, YZYZ, XYYZ, etc. are not supported..
● We need to translate our 64-bit swizzles to 32-bit.
– X → XY, Y → ZW, Z → ?, W → ?
Thread boundary
Align16 - SIMD4x2 - 64bit
XY / ZW component splitting
g0
g1
g2
c0 c1 c2 c3 c4 c5 c6 c7
0B
32B
64B
g396B
Xl Xh Yl Yh Yh
Zl Zh Wl Zl Zh Wl Wh
mov(8) g2<1>.xywDF g0<2,2,1>.zwyDF
mov(4) g2<1>.xyDF g0+1<2,2,1>.xyzwDF
mov(4) g2+1<1>.yDF g0<2,2,1>.zwzwDF
Wh
Xl Xh
Thread boundary
Yl
Zl Zh Wl Zl Zh Wl WhWh
Yl Yh YhYl
mov 1
mov 2
Align16 - SIMD4x2 - 64bit
XY / ZW component splitting
g0
g1
g2
c0 c1 c2 c3 c4 c5 c6 c7
0B
32B
64B
g396B
Xl Xh Yl Yh Yh
Zl Zh Wl Zl Zh Wl WhWh
Xl Xh
Thread boundary
Yl
Thread boundary0 0 0 0
mov(4) g2<1>.xDF g0<2,2,1>.xyxyDF { align16 1Q }
emask
Xl Xh Xl Xh
Wrong!
Not used
1 1 1 1
Align16 - SIMD4x2 - 64bit
g0
g1
g2
c0 c1 c2 c3 c4 c5 c6 c7
0B
32B
64B
g396B
Xl Xh
Yl Yh
Zl Zh Wl Wh
Xl Xh Yl Yh Zl Zh Wl Wh
mov(8) g2<1>.xyDF g0<2,2,1>.zwzwDF { align16 1Q }
● Back to square one…
● Z → XY, W → ZW (at 16B offset)
Yl Yh
Yl Yh
Yl Yh
Yl Yh
Align16 – SIMD4x2 - 64bit
g0
g1
g2
c0 c1 c2 c3 c4 c5 c6 c7
0B
32B
64B
g396B
Xl Xh
Zl Zh
Zl Zh Wl Wh
Xl Xh Yl Yh Zl Zh Wl Wh
mov(8) g2<1>.xyDF g0.2<2,2,1>.xyzwDF { align16 1Q }
● Not good enough (and violates register region restrictions)
● We could use a combination of vstride=0 and SIMD splitting.
Yl Yh
Zl Zh
Wl Wh
Wl Wh
Align16 - SIMD4x2 - 64bit
● Gen7 hardware seems to have an interesting bug feature:
● The second half of a compressed instruction with vstride=0 will ignore
the vstride and offset exactly 1 register
● We can use this to avoid the SIMD splitting
Align16 - SIMD4x2 - 64bit
g0
g1
g2
c0 c1 c2 c3 c4 c5 c6 c7
0B
32B
64B
g396B
Xl Xh
Zl Zh
Zl Zh Wl Wh
Xl Xh Yl Yh Zl Zh Wl Wh
mov(8) g2<1>.xyDF g0.2<0,2,1>.xyzwDF { align16 1Q }
Yl Yh
Zl Zh
Wl Wh
Wl Wh
x2
x2
Align16 - SIMD4x2 - 64bit
mov(8) g2<1>.xyDF g0.2<0,2,1>.xyzwDF { align16 1Q }
● Remember that issue with 32-bit writemasks?
● WRITEMASK_XY == WRITEMASK_X
g0
g1
g2
c0 c1 c2 c3 c4 c5 c6 c7
0B
32B
64B
g396B
Xl Xh
Zl Zh
Zl Zh Wl Wh
Xl Xh Yl Yh Zl Zh Wl Wh
Yl Yh
Zl Zh
Align16 - SIMD4x2 - 64bit
mov(8) g2<1>.xDF g0.2<0,2,1>.xyzwDF { align16 1Q }
mov(8) g2<1>.yDF g0.2<0,2,1>.xyzwDF { align16 1Q }
g0
g1
g2
c0 c1 c2 c3 c4 c5 c6 c7
0B
32B
64B
g396B
Xl Xh
Zl Zh
Zl Zh Wl Wh
Xl Xh Yl Yh Zl Zh Wl Wh
Yl Yh
Zl Zh
Wl Wh
Wl Wh
Align16 - SIMD4x2 - 64bit
● This is just an example:
● Different swizzle combinations may require different implementations
● 2-src instructions and 3-src instructions
Align16 - SIMD4x2 - 64bit
● Implementation:
● Step 1: scalarize everything, swizzle translation at codegen - Done
● Step 2: let through swizzle classes that we can support natively (e.g.
XYZW) - Done
● Step 3: let through swizzle classes that we can support by exploiting
the vstride=0 behavior (e.g. XXXX) - Done
● Step 4: use component splitting (partial scalarization) to support
more swizzle classes – Not Done (yet)
i965 – Fp64
Common Issues
Multiple hardware generations
● Significant differences between IvyBridge, Haswell and
Broadwell+ hardware
● Skylake did not require specific adaptations
● Broxton, CherryView and Braswell only required minor tweaks:
– 32b to 64b conversions need 64b aligned source data
– 64b indirect addressing not supported
32-bit driver
● Before fp64 all GLSL types were implemented as 32-bit types.
● Driver code assumed 32-bit types (and even hstride=1) in lots of
places.
● Manyfixes like:
- int dst_width = inst->exec_size / 8;
+ int dst_width =
DIV_ROUND_UP(inst->dst.component_size(inst->exec_size), REG_SIZE);
● This could happen anywhere in the driver
● Piglit was the driving force to find these
Unfamiliar code patterns
● Fp64 operation produces new code patterns:
● 32-bit access patterns on low/high 32-bit chunks of 64-bit data
● Horizontal strides != 1
● Some parts of the driver did not handle these scenarios
properly.
– Copy-propagation received at least 7 patches!
32-bit read/write messages
● All read/write messages are 32-bit
● Pull loads, UBOs, SSBOs, URB, scratch...
● 64bit data needs to be shuffled into 32-bit channels before
writing
● 32bit data reads need to be shuffled into valid 64bit data
channels
● shuffle_64bit_data_for_32bit_write()
● shuffle_32bit_load_result_to_64bit_data()
32-bit read/write messages
(SIMD8 read)
g0
g1
g2
c0 c1 c2 c3 c4 c5 c6 c7
0B
32B
64B
g396B
X X X X X X X X
Y Y Y Y Y Y Y Y
Z Z Z Z Z Z Z Z
W W W W W W W W
4x4B = 16B
● Just what we want for 32-bit scalar operation
● 8x16B SIMD8 read messages
● Separate variables (registers) for each component
● Consecutive components in consecutive registers
32-bit read/write messages
(SIMD8 read)
g0
g1
g2
c0 c1 c2 c3 c4 c5 c6 c7
0B
32B
64B
g396B
Xl
Xh
Yl
Yh
Xl
Xh
Yl
Yh
Xl
Xh
Yl
Yh
Xl
Xh
Yl
Yh
Xl
Xh
Yl
Yh
Xl
Xh
Yl
Yh
Xl
Xh
Yl
Yh
Xl
Xh
Yl
Yh
g4
g5
g6
128B
160B
192B
g7224B
Zl
Zh
Wl
Wh
Zl
Zh
Wl
Wh
Zl
Zh
Wl
Wh
Zl
Zh
Wl
Wh
Zl
Zh
Wl
Wh
Zl
Zh
Wl
Wh
Zl
Zh
Wl
Wh
Zl
Zh
Wl
Wh
Invalid 64b data
read(+0B)
read(+16B)
32-bit read/write messages
(SIMD8 read)
g0
g1
g2
c0 c1 c2 c3 c4 c5 c6 c7
0B
32B
64B
g396B
Xl
Xl
Yl
Yl
Xh
Xh
Yh
Yh
Xl
Xl
Yl
Yl
Xh
Xh
Yh
Yh
Xl
Xl
Yl
Yl
Xh
Xh
Yh
Yh
Xl
Xl
Yl
Yl
Xh
Xh
Yh
Yh
g4
g5
g6
128B
160B
192B
g7224B
Zl
Zl
Wl
Wl
Zh
Zh
Wh
Wh
Zl
Zl
Wl
Wl
Zh
Zh
Wh
Wh
Zl
Zl
Wl
Wl
Zh
Zh
Wh
Wh
Zl
Zl
Wl
Wl
Zh
Zh
Wh
Wh
read(+0B)
read(+16B)
64-bit immediates (gen7)
● No support for 64-bit immediates (Haswell, IvyBridge)
● Haswell provides the DIM instruction specifically for this purpose, we
just had to add support for this in the driver.
● IvyBridge requires that we emit code to load each 32-bit chunk of the
constant into a register and then return either a XXXX swizzle
(align16) or a stride 0 (align1).
Bugs / restrictions (Align1)
● Second half of compressed instructions that don’t write all
channels have wrong emask (Haswell)
● Requires SIMD splitting
● Second half of compressed 64-bit instructions has wrong
emask (IvyBridge)
● Requires unconditional SIMD splitting of all 64-bit instructions :-(
Bugs / restrictions (Align16)
● Vertical stride 0 doesn’t work (gen7)
● The second half of a compressed instruction will invariably offset a full register
● Requires SIMD splitting
● Instructions that write 2 registers must also read 2 registers (gen7)
● This was a known issue in gen7, but it was never triggered in Align16 before
Fp64
● Requires SIMD splitting
A simple test that just copies a DF uniform to the output hits both of
these bugs! :-(
Bugs / restrictions (Align16)
● 3-src instructions can’t use RepCtrl=1
● Not supported for 64-bit instructions
● Only affects to MAD
● RepCtrl=0 leads to <4,4,1>:DF regions so it can only be used to work
with components XY
– Requires temporaries and component splitting, but leads to quite bad code in
general
– Avoiding MAD altogether seems a better option for now
Bugs / restrictions (Align16)
● Compressed bcsel
● Does not read the predication mask properly
● Requires SIMD splitting
● Dependency control
● Can’t be used with 64-bit instructions → GPU hangs
Current State
Status
● Skylake: Available in Mesa 12.0
● Broadwell: Available in Mesa 12.0
● Haswell:
● ARB_gpu_shader_fp64: Implemented, in review
● ARB_vertex_attrib_64bit: Implemented
● IvyBridge:
● ARB_gpu_shader_fp64: Align1 implemented, Align16 implementation
in progress
● ARB_vertex_attrib_64bit: Not started
Questions?

Más contenido relacionado

Más de Igalia

Implementing a Vulkan Video Encoder From Mesa to GStreamer
Implementing a Vulkan Video Encoder From Mesa to GStreamerImplementing a Vulkan Video Encoder From Mesa to GStreamer
Implementing a Vulkan Video Encoder From Mesa to GStreamerIgalia
 
8 Years of Open Drivers, including the State of Vulkan in Mesa
8 Years of Open Drivers, including the State of Vulkan in Mesa8 Years of Open Drivers, including the State of Vulkan in Mesa
8 Years of Open Drivers, including the State of Vulkan in MesaIgalia
 
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por Igalia
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por IgaliaIntroducción a Mesa. Caso específico dos dispositivos Raspberry Pi por Igalia
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por IgaliaIgalia
 
2023 in Chimera Linux
2023 in Chimera                    Linux2023 in Chimera                    Linux
2023 in Chimera LinuxIgalia
 
Building a Linux distro with LLVM
Building a Linux distro        with LLVMBuilding a Linux distro        with LLVM
Building a Linux distro with LLVMIgalia
 
turnip: Update on Open Source Vulkan Driver for Adreno GPUs
turnip: Update on Open Source Vulkan Driver for Adreno GPUsturnip: Update on Open Source Vulkan Driver for Adreno GPUs
turnip: Update on Open Source Vulkan Driver for Adreno GPUsIgalia
 
Graphics stack updates for Raspberry Pi devices
Graphics stack updates for Raspberry Pi devicesGraphics stack updates for Raspberry Pi devices
Graphics stack updates for Raspberry Pi devicesIgalia
 
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOS
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOSDelegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOS
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOSIgalia
 
MessageFormat: The future of i18n on the web
MessageFormat: The future of i18n on the webMessageFormat: The future of i18n on the web
MessageFormat: The future of i18n on the webIgalia
 
Replacing the geometry pipeline with mesh shaders
Replacing the geometry pipeline with mesh shadersReplacing the geometry pipeline with mesh shaders
Replacing the geometry pipeline with mesh shadersIgalia
 
I'm not an AMD expert, but...
I'm not an AMD expert, but...I'm not an AMD expert, but...
I'm not an AMD expert, but...Igalia
 
Status of Vulkan on Raspberry
Status of Vulkan on RaspberryStatus of Vulkan on Raspberry
Status of Vulkan on RaspberryIgalia
 
Enable hardware acceleration for GL applications without glamor on Xorg modes...
Enable hardware acceleration for GL applications without glamor on Xorg modes...Enable hardware acceleration for GL applications without glamor on Xorg modes...
Enable hardware acceleration for GL applications without glamor on Xorg modes...Igalia
 
Async page flip in DRM atomic API
Async page flip in DRM  atomic APIAsync page flip in DRM  atomic API
Async page flip in DRM atomic APIIgalia
 
From the proposal to ECMAScript – Step by Step
From the proposal to ECMAScript – Step by StepFrom the proposal to ECMAScript – Step by Step
From the proposal to ECMAScript – Step by StepIgalia
 
Migrating Babel from CommonJS to ESM
Migrating Babel     from CommonJS to ESMMigrating Babel     from CommonJS to ESM
Migrating Babel from CommonJS to ESMIgalia
 
The rainbow treasure map: Advanced color management on Linux with AMD/Steam D...
The rainbow treasure map: Advanced color management on Linux with AMD/Steam D...The rainbow treasure map: Advanced color management on Linux with AMD/Steam D...
The rainbow treasure map: Advanced color management on Linux with AMD/Steam D...Igalia
 
Freedreno on Android – XDC 2023
Freedreno on Android          – XDC 2023Freedreno on Android          – XDC 2023
Freedreno on Android – XDC 2023Igalia
 
On-going challenges in the Raspberry Pi driver stack – XDC 2023
On-going challenges in the Raspberry Pi driver stack – XDC 2023On-going challenges in the Raspberry Pi driver stack – XDC 2023
On-going challenges in the Raspberry Pi driver stack – XDC 2023Igalia
 
Status Update of the VKMS DRM driver – XDC 2023
Status Update of the VKMS DRM driver – XDC 2023Status Update of the VKMS DRM driver – XDC 2023
Status Update of the VKMS DRM driver – XDC 2023Igalia
 

Más de Igalia (20)

Implementing a Vulkan Video Encoder From Mesa to GStreamer
Implementing a Vulkan Video Encoder From Mesa to GStreamerImplementing a Vulkan Video Encoder From Mesa to GStreamer
Implementing a Vulkan Video Encoder From Mesa to GStreamer
 
8 Years of Open Drivers, including the State of Vulkan in Mesa
8 Years of Open Drivers, including the State of Vulkan in Mesa8 Years of Open Drivers, including the State of Vulkan in Mesa
8 Years of Open Drivers, including the State of Vulkan in Mesa
 
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por Igalia
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por IgaliaIntroducción a Mesa. Caso específico dos dispositivos Raspberry Pi por Igalia
Introducción a Mesa. Caso específico dos dispositivos Raspberry Pi por Igalia
 
2023 in Chimera Linux
2023 in Chimera                    Linux2023 in Chimera                    Linux
2023 in Chimera Linux
 
Building a Linux distro with LLVM
Building a Linux distro        with LLVMBuilding a Linux distro        with LLVM
Building a Linux distro with LLVM
 
turnip: Update on Open Source Vulkan Driver for Adreno GPUs
turnip: Update on Open Source Vulkan Driver for Adreno GPUsturnip: Update on Open Source Vulkan Driver for Adreno GPUs
turnip: Update on Open Source Vulkan Driver for Adreno GPUs
 
Graphics stack updates for Raspberry Pi devices
Graphics stack updates for Raspberry Pi devicesGraphics stack updates for Raspberry Pi devices
Graphics stack updates for Raspberry Pi devices
 
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOS
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOSDelegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOS
Delegated Compositing - Utilizing Wayland Protocols for Chromium on ChromeOS
 
MessageFormat: The future of i18n on the web
MessageFormat: The future of i18n on the webMessageFormat: The future of i18n on the web
MessageFormat: The future of i18n on the web
 
Replacing the geometry pipeline with mesh shaders
Replacing the geometry pipeline with mesh shadersReplacing the geometry pipeline with mesh shaders
Replacing the geometry pipeline with mesh shaders
 
I'm not an AMD expert, but...
I'm not an AMD expert, but...I'm not an AMD expert, but...
I'm not an AMD expert, but...
 
Status of Vulkan on Raspberry
Status of Vulkan on RaspberryStatus of Vulkan on Raspberry
Status of Vulkan on Raspberry
 
Enable hardware acceleration for GL applications without glamor on Xorg modes...
Enable hardware acceleration for GL applications without glamor on Xorg modes...Enable hardware acceleration for GL applications without glamor on Xorg modes...
Enable hardware acceleration for GL applications without glamor on Xorg modes...
 
Async page flip in DRM atomic API
Async page flip in DRM  atomic APIAsync page flip in DRM  atomic API
Async page flip in DRM atomic API
 
From the proposal to ECMAScript – Step by Step
From the proposal to ECMAScript – Step by StepFrom the proposal to ECMAScript – Step by Step
From the proposal to ECMAScript – Step by Step
 
Migrating Babel from CommonJS to ESM
Migrating Babel     from CommonJS to ESMMigrating Babel     from CommonJS to ESM
Migrating Babel from CommonJS to ESM
 
The rainbow treasure map: Advanced color management on Linux with AMD/Steam D...
The rainbow treasure map: Advanced color management on Linux with AMD/Steam D...The rainbow treasure map: Advanced color management on Linux with AMD/Steam D...
The rainbow treasure map: Advanced color management on Linux with AMD/Steam D...
 
Freedreno on Android – XDC 2023
Freedreno on Android          – XDC 2023Freedreno on Android          – XDC 2023
Freedreno on Android – XDC 2023
 
On-going challenges in the Raspberry Pi driver stack – XDC 2023
On-going challenges in the Raspberry Pi driver stack – XDC 2023On-going challenges in the Raspberry Pi driver stack – XDC 2023
On-going challenges in the Raspberry Pi driver stack – XDC 2023
 
Status Update of the VKMS DRM driver – XDC 2023
Status Update of the VKMS DRM driver – XDC 2023Status Update of the VKMS DRM driver – XDC 2023
Status Update of the VKMS DRM driver – XDC 2023
 

Último

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Último (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Bringing GL_ARB_gpu_shader_fp64 to Intel GPUs (XDC 2016)

  • 1. Iago Toral Quiroga <itoral@igalia.com> Bringing ARB_gpu_shader_fp64 to Intel GPUs XDC 2016 Helsinki, Finland
  • 2. Contents → ARB_gpu_shader_fp64 → Overview → Scope → Intel implementation → NIR → i965 → Current status
  • 3. ARB_gpu_shader_fp64 ● New GLSL types: double, dvecX and dmatX ● Variables, Inputs, outputs, uniforms, constants (LF suffix) ● No 64-bit vertex attributes (handled by another extension) ● No 64-bit fragment shader outputs (no 64-bit FBs) ● Arithmetic and relational operators supported ● Most built-in functions can take the new types ● Some exceptions: angle, trigonometry, exponential, noise ● New packing functions: (un)packDouble2x32 ● Conversions from/to 32-bit types ● No interpolation
  • 4. Scope (i965) ● Combined work by Intel and Igalia for over a year ● ~260 new patches to add NIR (~60) and i965 support (~100 Broadwell+, ~100 Haswell) ● More IvyBridge patches on the way! ● Usable across all shader stages ● 3 IRs to support: NIR, i965/align1, i965/align16 ● Lots of GL functionality involved: ALU, varyings, uniforms, UBO, SSBO, shared variables, etc. ● Lots of internal driver modules involved: Optimization passes, liveness analysis, register spilling, etc.
  • 5. Scope (Piglit) ● Piglit coverage was limited ● Mostly focused on ALU operations. ● ~2,000 new tests added
  • 6. NIR
  • 7. NIR ● Added support for bit-sized ALU types: ● nir_type_float32 = 32 | nir_type_float, etc ● nir_alu_get_type_size(), nir_alu_get_base_type() ● We need to be a bit more careful now: - if (alu_info.output_type == nir_type_bool) { +if (nir_alu_type_get_base_type(alu_info.output_type) == nir_type_bool) {
  • 8. NIR ● Lots of plumbing to deal correctly with bit-sizes everywhere. ● Algebraic rules and bit-size validation ● New double-precision opcodes: ● d2f, d2i, d2u, d2b, f2d, i2d, u2d ● (un)pack_double_2x32 ● etc
  • 9. NIR ● Added lowering for unsupported 64-bit operations on Intel GPU hardware: ● trunc, floor, ceil, fract, round, mod, rcp, sqrt, rsq ● nir_lower_doubles(). Implemented in terms of: ● Supported 64-bit operations ● 32-bit float/integer math ● Drivers can choose which lowerings to use: ● nir_lower_doubles_options
  • 11. Align1 – SIMD8 - 32bit g0 g1 g2 c0 c1 c2 c3 c4 c5 c6 c7 0B 32B 64B mov(8) g0.0<1>F g1.1<8,4,2>F { align1 1Q } Row 1 Row 2
  • 12. Align1 – SIMD8 - 64bit g0 g1 g2 c0 c1 c2 c3 c4 c5 c6 c7 0B 32B 64B g396B Xl Xl Yl Yl Xh Xh Yh Yh Xl Xl Yl Yl Xh Xh Yh Yh Xl Xl Yl Yl Xh Xh Yh Yh Xl Xl Yl Yl Xh Xh Yh Yh g4 g5 g6 128B 160B 192B g7224B Zl Zl Wl Wl Zh Zh Wh Wh Zl Zl Wl Wl Zh Zh Wh Wh Zl Zl Wl Wl Zh Zh Wh Wh Zl Zl Wl Wl Zh Zh Wh Wh c0:DF c7:DF
  • 13. Align1 – SIMD8 - 64bit (Haswell+) c0 c1 c2 c3 c4 c5 c6 c7 0B 32B 64B 96B Xl0 Xh0 Xl1 Xh1 Xl2 Xh2 Xl3 Xh3 mov(8) g2<1>DF g0<4,4,1>DF { align1 1Q } Xl0 Xh0 Xl1 Xh1 Xl2 Xh2 Xl3 Xh3 Xl4 Xh4 Xl5 Xh5 Xl6 Xh6 Xl7 Xh7 Xl4 Xh4 Xl5 Xh5 Xl6 Xh6 Xl7 Xh7 Row 1 Row 2
  • 14. Align1 – SIMD8 - 64bit (Haswell+) c0 c1 c2 c3 c4 c5 c6 c7 0B 32B 64B 96B Xl0 Xl4 Xh0 Xh4 Xl1 Xl5 Xh1 Xh5 Xl2 Xl6 Xh2 Xh6 Xl3 Xl7 Xh3 Xh7 mov(8) g2.1<2>UD g0.1<8,4,2>UD { align1 1Q } Xh0 Xh1 Xh2 Xh3 Xh4 Xh5 Xh6 Xh7 ● Use the subscript() helper: subscript(reg, BRW_REGISTER_TYPE_UD, 1)
  • 16. Align16 – SIMD4x2 - 32bit mov(8) g0.0<1>.xyF g1.0<4,4,1>.ywwwF { align16 1Q } g0 g1 g2 c0 c1 c2 c3 c4 c5 c6 c7 0B 32B 64B 16B boundary X Y Z W X Y Z W 0B boundary FixedFixed x3 x3 Row 1 Row 2
  • 17. Align16 - 64bit ● Broadwell+ used to need Align16 for Geometry and Tessellation shaders. ● Nowadays these platforms are fully scalar and Align1 support is sufficient to expose Fp64 ● Older platforms (Haswell, IvyBridge, etc) still need Align16 for Vertex, Geometry and Tessellation shaders.
  • 18. Align16 - 64bit ● Swizzle channels are 32-bit, even on 64-bit operands ● We can only address DF components XY directly! ● Writemasks are 64-bit for DF destinations though ● WRITEMASK_XY and WRITEMASK_ZW are 32-bit though → no native representation
  • 19. Align16 - SIMD4x2 - 64bit g0 g1 g2 c0 c1 c2 c3 c4 c5 c6 c7 0B 32B 64B g396B Xl Xh Yl Yh Zl Zh Wl Wh Xl Xh Yl Yh Zl Zh Wl Wh mov(8) g2<1>.xyDF g0<2,2,1>.zwzwDF { align16 1Q } Fixed Thread boundary Vertex 1 Vertex 2
  • 20. Align16 - SIMD4x2 - 64bit g0 g1 g2 c0 c1 c2 c3 c4 c5 c6 c7 0B 32B 64B g396B Xl Xh Yl Yh Zl Zh Wl Wh Xl Xh Yl Yh Zl Zh Wl Wh mov(8) g2<1>.xyDF g0<2,2,1>.zwzwDF { align16 1Q } Thread boundary ● Align16 requires 16B alignment → 2 DF components in each row. ● Vstride=2 to cover the entire region
  • 21. Align16 - SIMD4x2 - 64bit g0 g1 g2 c0 c1 c2 c3 c4 c5 c6 c7 0B 32B 64B g396B Xl Xh Yl Yh Zl Zh Wl Wh Xl Xh Yl Yh Zl Zh Wl Wh mov(8) g2<1>.xyDF g0<2,2,1>.zwzwDF { align16 1Q } ● Each 16B region applies the 4-component 32-bit swizzle Thread boundary
  • 22. Align16 - SIMD4x2 - 64bit g0 g1 g2 c0 c1 c2 c3 c4 c5 c6 c7 0B 32B 64B g396B Xl Xh Yl Yh Zl Zh Wl Wh Xl Xh Yl Yh Zl Zh Wl Wh mov(8) g2<1>.xyDF g0<2,2,1>.zwzwDF { align16 1Q } ● We can’t do all swizzle combinations: ● XXXX, YZYZ, XYYZ, etc. are not supported.. ● We need to translate our 64-bit swizzles to 32-bit. – X → XY, Y → ZW, Z → ?, W → ? Thread boundary
  • 23. Align16 - SIMD4x2 - 64bit XY / ZW component splitting g0 g1 g2 c0 c1 c2 c3 c4 c5 c6 c7 0B 32B 64B g396B Xl Xh Yl Yh Yh Zl Zh Wl Zl Zh Wl Wh mov(8) g2<1>.xywDF g0<2,2,1>.zwyDF mov(4) g2<1>.xyDF g0+1<2,2,1>.xyzwDF mov(4) g2+1<1>.yDF g0<2,2,1>.zwzwDF Wh Xl Xh Thread boundary Yl Zl Zh Wl Zl Zh Wl WhWh Yl Yh YhYl mov 1 mov 2
  • 24. Align16 - SIMD4x2 - 64bit XY / ZW component splitting g0 g1 g2 c0 c1 c2 c3 c4 c5 c6 c7 0B 32B 64B g396B Xl Xh Yl Yh Yh Zl Zh Wl Zl Zh Wl WhWh Xl Xh Thread boundary Yl Thread boundary0 0 0 0 mov(4) g2<1>.xDF g0<2,2,1>.xyxyDF { align16 1Q } emask Xl Xh Xl Xh Wrong! Not used 1 1 1 1
  • 25. Align16 - SIMD4x2 - 64bit g0 g1 g2 c0 c1 c2 c3 c4 c5 c6 c7 0B 32B 64B g396B Xl Xh Yl Yh Zl Zh Wl Wh Xl Xh Yl Yh Zl Zh Wl Wh mov(8) g2<1>.xyDF g0<2,2,1>.zwzwDF { align16 1Q } ● Back to square one… ● Z → XY, W → ZW (at 16B offset) Yl Yh Yl Yh Yl Yh Yl Yh
  • 26. Align16 – SIMD4x2 - 64bit g0 g1 g2 c0 c1 c2 c3 c4 c5 c6 c7 0B 32B 64B g396B Xl Xh Zl Zh Zl Zh Wl Wh Xl Xh Yl Yh Zl Zh Wl Wh mov(8) g2<1>.xyDF g0.2<2,2,1>.xyzwDF { align16 1Q } ● Not good enough (and violates register region restrictions) ● We could use a combination of vstride=0 and SIMD splitting. Yl Yh Zl Zh Wl Wh Wl Wh
  • 27. Align16 - SIMD4x2 - 64bit ● Gen7 hardware seems to have an interesting bug feature: ● The second half of a compressed instruction with vstride=0 will ignore the vstride and offset exactly 1 register ● We can use this to avoid the SIMD splitting
  • 28. Align16 - SIMD4x2 - 64bit g0 g1 g2 c0 c1 c2 c3 c4 c5 c6 c7 0B 32B 64B g396B Xl Xh Zl Zh Zl Zh Wl Wh Xl Xh Yl Yh Zl Zh Wl Wh mov(8) g2<1>.xyDF g0.2<0,2,1>.xyzwDF { align16 1Q } Yl Yh Zl Zh Wl Wh Wl Wh x2 x2
  • 29. Align16 - SIMD4x2 - 64bit mov(8) g2<1>.xyDF g0.2<0,2,1>.xyzwDF { align16 1Q } ● Remember that issue with 32-bit writemasks? ● WRITEMASK_XY == WRITEMASK_X g0 g1 g2 c0 c1 c2 c3 c4 c5 c6 c7 0B 32B 64B g396B Xl Xh Zl Zh Zl Zh Wl Wh Xl Xh Yl Yh Zl Zh Wl Wh Yl Yh Zl Zh
  • 30. Align16 - SIMD4x2 - 64bit mov(8) g2<1>.xDF g0.2<0,2,1>.xyzwDF { align16 1Q } mov(8) g2<1>.yDF g0.2<0,2,1>.xyzwDF { align16 1Q } g0 g1 g2 c0 c1 c2 c3 c4 c5 c6 c7 0B 32B 64B g396B Xl Xh Zl Zh Zl Zh Wl Wh Xl Xh Yl Yh Zl Zh Wl Wh Yl Yh Zl Zh Wl Wh Wl Wh
  • 31. Align16 - SIMD4x2 - 64bit ● This is just an example: ● Different swizzle combinations may require different implementations ● 2-src instructions and 3-src instructions
  • 32. Align16 - SIMD4x2 - 64bit ● Implementation: ● Step 1: scalarize everything, swizzle translation at codegen - Done ● Step 2: let through swizzle classes that we can support natively (e.g. XYZW) - Done ● Step 3: let through swizzle classes that we can support by exploiting the vstride=0 behavior (e.g. XXXX) - Done ● Step 4: use component splitting (partial scalarization) to support more swizzle classes – Not Done (yet)
  • 34. Multiple hardware generations ● Significant differences between IvyBridge, Haswell and Broadwell+ hardware ● Skylake did not require specific adaptations ● Broxton, CherryView and Braswell only required minor tweaks: – 32b to 64b conversions need 64b aligned source data – 64b indirect addressing not supported
  • 35. 32-bit driver ● Before fp64 all GLSL types were implemented as 32-bit types. ● Driver code assumed 32-bit types (and even hstride=1) in lots of places. ● Manyfixes like: - int dst_width = inst->exec_size / 8; + int dst_width = DIV_ROUND_UP(inst->dst.component_size(inst->exec_size), REG_SIZE); ● This could happen anywhere in the driver ● Piglit was the driving force to find these
  • 36. Unfamiliar code patterns ● Fp64 operation produces new code patterns: ● 32-bit access patterns on low/high 32-bit chunks of 64-bit data ● Horizontal strides != 1 ● Some parts of the driver did not handle these scenarios properly. – Copy-propagation received at least 7 patches!
  • 37. 32-bit read/write messages ● All read/write messages are 32-bit ● Pull loads, UBOs, SSBOs, URB, scratch... ● 64bit data needs to be shuffled into 32-bit channels before writing ● 32bit data reads need to be shuffled into valid 64bit data channels ● shuffle_64bit_data_for_32bit_write() ● shuffle_32bit_load_result_to_64bit_data()
  • 38. 32-bit read/write messages (SIMD8 read) g0 g1 g2 c0 c1 c2 c3 c4 c5 c6 c7 0B 32B 64B g396B X X X X X X X X Y Y Y Y Y Y Y Y Z Z Z Z Z Z Z Z W W W W W W W W 4x4B = 16B ● Just what we want for 32-bit scalar operation ● 8x16B SIMD8 read messages ● Separate variables (registers) for each component ● Consecutive components in consecutive registers
  • 39. 32-bit read/write messages (SIMD8 read) g0 g1 g2 c0 c1 c2 c3 c4 c5 c6 c7 0B 32B 64B g396B Xl Xh Yl Yh Xl Xh Yl Yh Xl Xh Yl Yh Xl Xh Yl Yh Xl Xh Yl Yh Xl Xh Yl Yh Xl Xh Yl Yh Xl Xh Yl Yh g4 g5 g6 128B 160B 192B g7224B Zl Zh Wl Wh Zl Zh Wl Wh Zl Zh Wl Wh Zl Zh Wl Wh Zl Zh Wl Wh Zl Zh Wl Wh Zl Zh Wl Wh Zl Zh Wl Wh Invalid 64b data read(+0B) read(+16B)
  • 40. 32-bit read/write messages (SIMD8 read) g0 g1 g2 c0 c1 c2 c3 c4 c5 c6 c7 0B 32B 64B g396B Xl Xl Yl Yl Xh Xh Yh Yh Xl Xl Yl Yl Xh Xh Yh Yh Xl Xl Yl Yl Xh Xh Yh Yh Xl Xl Yl Yl Xh Xh Yh Yh g4 g5 g6 128B 160B 192B g7224B Zl Zl Wl Wl Zh Zh Wh Wh Zl Zl Wl Wl Zh Zh Wh Wh Zl Zl Wl Wl Zh Zh Wh Wh Zl Zl Wl Wl Zh Zh Wh Wh read(+0B) read(+16B)
  • 41. 64-bit immediates (gen7) ● No support for 64-bit immediates (Haswell, IvyBridge) ● Haswell provides the DIM instruction specifically for this purpose, we just had to add support for this in the driver. ● IvyBridge requires that we emit code to load each 32-bit chunk of the constant into a register and then return either a XXXX swizzle (align16) or a stride 0 (align1).
  • 42. Bugs / restrictions (Align1) ● Second half of compressed instructions that don’t write all channels have wrong emask (Haswell) ● Requires SIMD splitting ● Second half of compressed 64-bit instructions has wrong emask (IvyBridge) ● Requires unconditional SIMD splitting of all 64-bit instructions :-(
  • 43. Bugs / restrictions (Align16) ● Vertical stride 0 doesn’t work (gen7) ● The second half of a compressed instruction will invariably offset a full register ● Requires SIMD splitting ● Instructions that write 2 registers must also read 2 registers (gen7) ● This was a known issue in gen7, but it was never triggered in Align16 before Fp64 ● Requires SIMD splitting A simple test that just copies a DF uniform to the output hits both of these bugs! :-(
  • 44. Bugs / restrictions (Align16) ● 3-src instructions can’t use RepCtrl=1 ● Not supported for 64-bit instructions ● Only affects to MAD ● RepCtrl=0 leads to <4,4,1>:DF regions so it can only be used to work with components XY – Requires temporaries and component splitting, but leads to quite bad code in general – Avoiding MAD altogether seems a better option for now
  • 45. Bugs / restrictions (Align16) ● Compressed bcsel ● Does not read the predication mask properly ● Requires SIMD splitting ● Dependency control ● Can’t be used with 64-bit instructions → GPU hangs
  • 47. Status ● Skylake: Available in Mesa 12.0 ● Broadwell: Available in Mesa 12.0 ● Haswell: ● ARB_gpu_shader_fp64: Implemented, in review ● ARB_vertex_attrib_64bit: Implemented ● IvyBridge: ● ARB_gpu_shader_fp64: Align1 implemented, Align16 implementation in progress ● ARB_vertex_attrib_64bit: Not started