Compute shaders - ./jerome{m}

# ==Compute shaders== <p class="doc-sub">// status: seedling</p> Compute shaders are the "just let me run code on the GPU" pipeline stage. No triangles, no rasterizer, no fixed-function ceremony — a flat grid of threads reading and writing buffers/images. In modern engines they do everything from culling, light binning, and [[Voxel rendering techniques|voxelization]] to particle sims, tone mapping, and mesh generation. # The execution model You dispatch a 3D grid of ==workgroups==; each workgroup contains a 3D grid of ==invocations== (threads). The shader declares its local size: ```glsl #version 460 layout(local_size_x = 16, local_size_y = 16, local_size_z = 1) in; ``` Call `glDispatchCompute(gx, gy, gz)` / `vkCmdDispatch` with the number of workgroups. Total threads = `local_size_* × gx × gy × gz`. Within a shader: - `gl_GlobalInvocationID` — unique 3D index across the whole dispatch. - `gl_LocalInvocationID` — index within the workgroup. - `gl_WorkGroupID` — which workgroup this invocation belongs to. - `gl_LocalInvocationIndex` — linearised local index. # Warps / waves / subgroups Under the hood the GPU runs threads in lock-step groups — ==warps== (NVIDIA, 32 threads), ==wavefronts== (AMD, 32 or 64), ==SIMD-group== (Apple), ==subgroups== (Vulkan's portable name). A whole subgroup executes the same instruction at the same time; divergent branches serialise. Subgroup operations (`VK_KHR_shader_subgroup`, `GL_KHR_shader_subgroup`) let you do things like `subgroupAdd`, `subgroupBallot`, `subgroupShuffle` — often an order of magnitude faster than the equivalent shared-memory reduction. Picking a `local_size`: - Multiple of the subgroup size (32 on NV, 32/64 on AMD). 64, 128, or 256 is a safe floor. - For a 2D image, `16×16 = 256` is a sweet spot. - Too small and you can't hide memory latency; too large and register pressure kills occupancy. # Shared memory (LDS) `shared` variables live in fast on-chip memory, shared across a workgroup: ```glsl shared float tile[16][16]; void main() { ivec2 p = ivec2(gl_GlobalInvocationID.xy); ivec2 l = ivec2(gl_LocalInvocationID.xy); tile[l.y][l.x] = imageLoad(src, p).r; barrier(); // everyone finished writing // ... read neighbours from `tile`, compute a blur, write out. } ``` `barrier()` is a workgroup-scope sync — all invocations in the group must reach it before anyone continues. `memoryBarrierShared()` ensures writes are visible to other threads in the group. `groupMemoryBarrier()` is both. Shared memory is the single biggest optimisation for stencil-style problems (blurs, separable filters, histogram, reductions). The pattern: 1. Each thread loads one (or a few) samples into shared. 2. `barrier`. 3. Each thread reads neighbours from shared and writes its output. The cooperative fetch turns N redundant global reads per output into one. # Atomics `atomicAdd`, `atomicMin`, `atomicCounter`, etc. work on buffer/shared memory. Great for per-tile counters, histograms, particle append buffers. Slow if heavily contended — prefer subgroup reductions or per-workgroup locals first, then one atomic per workgroup. # Images vs buffers - **`imageStore` / `imageLoad`** — typed access to textures. Hardware format conversion, usually want `rgba8` / `r16f` / `r32f` etc. - **SSBOs (GL) / storage buffers (VK)** — untyped, `std430` layout, good for structured data. - Read-only access via samplers is still faster for spatial patterns — filtering, mipmaps. Only use `imageLoad` when you need random writes. # What compute is actually good at - **Per-pixel post-processing** — tonemap, bloom, SSAO, SSR. Writes go straight back to the HDR buffer. - **Particle systems** — integrate, spawn, cull in compute, then render with a graphics pipeline. - **Light binning / tile culling** — the workhorse of clustered [[Deferred vs forward rendering|forward+]] rendering. - **[[Marching Cubes|Marching cubes]] / meshing** — each workgroup processes a cell block; appends triangles. - **Ray traversal** when hardware RT isn't available — software BVH traversal ([[Spatial acceleration structures]]) scales very well on compute. - **Prefix scans, histograms, reductions** — now one-liners with subgroup ops. - **Animation / skinning** — CPU-side skinning is usually a waste. # What compute is bad at - **Small dispatches** — dispatch overhead is non-trivial; one-dispatch-per-object is usually a mistake. Batch. - **Highly divergent control flow** — when every thread in a subgroup takes a different branch, you serialise. Sort inputs to keep similar work adjacent. - **Random writes into hot memory** — atomics contend badly; think in terms of local aggregation then single writeback. # Things that tripped me up - **Forgetting `barrier()` between shared writes and reads** — race conditions that happen to work on one GPU and not another. - **Image format mismatch** — `rgba8` declared in shader but the texture was created as `rgba16f`. Silent garbage on some drivers. - **Workgroup count underflow** — dispatching `ceil(w/16), ceil(h/16)` but not guarding against threads outside the image. Always `if (p.x >= width || p.y >= height) return;`. - **`memoryBarrier*` is not `barrier()`** — memory barriers order _memory_; `barrier()` orders _execution_. You often need both. - **Indirect dispatch** — `vkCmdDispatchIndirect` / `glDispatchComputeIndirect` lets a previous compute pass decide the workgroup count. Massively useful for GPU-driven pipelines. # References - [Vulkan Spec — Compute chapter](https://www.khronos.org/vulkan/) - _GPU Gems 3_ — still has some of the clearest compute-style write-ups. - _A trip through the Graphics Pipeline_ — Fabian Giesen's series, essential for understanding what the hardware is actually doing. --- Back to [[Index|Notes]] · see also [[OpenGL - learning log]] · [[Vulkan - learning log]] · [[Deferred vs forward rendering]]