# ==Compute shaders==
<p class="doc-sub">// status: seedling</p>
Compute shaders are the "just let me run code on the GPU" pipeline stage. No triangles, no rasterizer, no fixed-function ceremony — a flat grid of threads reading and writing buffers/images. In modern engines they do everything from culling, light binning, and [[Voxel rendering techniques|voxelization]] to particle sims, tone mapping, and mesh generation.
# The execution model
You dispatch a 3D grid of ==workgroups==; each workgroup contains a 3D grid of ==invocations== (threads). The shader declares its local size:
```glsl
#version 460
layout(local_size_x = 16, local_size_y = 16, local_size_z = 1) in;
```
Call `glDispatchCompute(gx, gy, gz)` / `vkCmdDispatch` with the number of workgroups. Total threads = `local_size_* × gx × gy × gz`.
Within a shader:
- `gl_GlobalInvocationID` — unique 3D index across the whole dispatch.
- `gl_LocalInvocationID` — index within the workgroup.
- `gl_WorkGroupID` — which workgroup this invocation belongs to.
- `gl_LocalInvocationIndex` — linearised local index.
# Warps / waves / subgroups
Under the hood the GPU runs threads in lock-step groups — ==warps== (NVIDIA, 32 threads), ==wavefronts== (AMD, 32 or 64), ==SIMD-group== (Apple), ==subgroups== (Vulkan's portable name). A whole subgroup executes the same instruction at the same time; divergent branches serialise.
Subgroup operations (`VK_KHR_shader_subgroup`, `GL_KHR_shader_subgroup`) let you do things like `subgroupAdd`, `subgroupBallot`, `subgroupShuffle` — often an order of magnitude faster than the equivalent shared-memory reduction.
Picking a `local_size`:
- Multiple of the subgroup size (32 on NV, 32/64 on AMD). 64, 128, or 256 is a safe floor.
- For a 2D image, `16×16 = 256` is a sweet spot.
- Too small and you can't hide memory latency; too large and register pressure kills occupancy.
# Shared memory (LDS)
`shared` variables live in fast on-chip memory, shared across a workgroup:
```glsl
shared float tile[16][16];
void main() {
ivec2 p = ivec2(gl_GlobalInvocationID.xy);
ivec2 l = ivec2(gl_LocalInvocationID.xy);
tile[l.y][l.x] = imageLoad(src, p).r;
barrier(); // everyone finished writing
// ... read neighbours from `tile`, compute a blur, write out.
}
```
`barrier()` is a workgroup-scope sync — all invocations in the group must reach it before anyone continues. `memoryBarrierShared()` ensures writes are visible to other threads in the group. `groupMemoryBarrier()` is both.
Shared memory is the single biggest optimisation for stencil-style problems (blurs, separable filters, histogram, reductions). The pattern:
1. Each thread loads one (or a few) samples into shared.
2. `barrier`.
3. Each thread reads neighbours from shared and writes its output.
The cooperative fetch turns N redundant global reads per output into one.
# Atomics
`atomicAdd`, `atomicMin`, `atomicCounter`, etc. work on buffer/shared memory. Great for per-tile counters, histograms, particle append buffers. Slow if heavily contended — prefer subgroup reductions or per-workgroup locals first, then one atomic per workgroup.
# Images vs buffers
- **`imageStore` / `imageLoad`** — typed access to textures. Hardware format conversion, usually want `rgba8` / `r16f` / `r32f` etc.
- **SSBOs (GL) / storage buffers (VK)** — untyped, `std430` layout, good for structured data.
- Read-only access via samplers is still faster for spatial patterns — filtering, mipmaps. Only use `imageLoad` when you need random writes.
# What compute is actually good at
- **Per-pixel post-processing** — tonemap, bloom, SSAO, SSR. Writes go straight back to the HDR buffer.
- **Particle systems** — integrate, spawn, cull in compute, then render with a graphics pipeline.
- **Light binning / tile culling** — the workhorse of clustered [[Deferred vs forward rendering|forward+]] rendering.
- **[[Marching Cubes|Marching cubes]] / meshing** — each workgroup processes a cell block; appends triangles.
- **Ray traversal** when hardware RT isn't available — software BVH traversal ([[Spatial acceleration structures]]) scales very well on compute.
- **Prefix scans, histograms, reductions** — now one-liners with subgroup ops.
- **Animation / skinning** — CPU-side skinning is usually a waste.
# What compute is bad at
- **Small dispatches** — dispatch overhead is non-trivial; one-dispatch-per-object is usually a mistake. Batch.
- **Highly divergent control flow** — when every thread in a subgroup takes a different branch, you serialise. Sort inputs to keep similar work adjacent.
- **Random writes into hot memory** — atomics contend badly; think in terms of local aggregation then single writeback.
# Things that tripped me up
- **Forgetting `barrier()` between shared writes and reads** — race conditions that happen to work on one GPU and not another.
- **Image format mismatch** — `rgba8` declared in shader but the texture was created as `rgba16f`. Silent garbage on some drivers.
- **Workgroup count underflow** — dispatching `ceil(w/16), ceil(h/16)` but not guarding against threads outside the image. Always `if (p.x >= width || p.y >= height) return;`.
- **`memoryBarrier*` is not `barrier()`** — memory barriers order _memory_; `barrier()` orders _execution_. You often need both.
- **Indirect dispatch** — `vkCmdDispatchIndirect` / `glDispatchComputeIndirect` lets a previous compute pass decide the workgroup count. Massively useful for GPU-driven pipelines.
# References
- [Vulkan Spec — Compute chapter](https://www.khronos.org/vulkan/)
- _GPU Gems 3_ — still has some of the clearest compute-style write-ups.
- _A trip through the Graphics Pipeline_ — Fabian Giesen's series, essential for understanding what the hardware is actually doing.
---
Back to [[Index|Notes]] · see also [[OpenGL - learning log]] · [[Vulkan - learning log]] · [[Deferred vs forward rendering]]