diff --git a/DESIGN.md b/DESIGN.md index d59a5b3..cb5648e 100644 --- a/DESIGN.md +++ b/DESIGN.md @@ -88,6 +88,7 @@ The GPU runs **everything that is massively parallel per macroblock**: | 2b. Motion compensation | qpel filter on reference pixels | Fully parallel | | 3. Reconstruct | residual + predicted, clip255 | Fully parallel (per pixel) | | 4. Deblocking | Filter MB edges (alpha/beta/tc0 table) | Two-pass: V-edges (rows ∥), H-edges (cols ∥) | +| 5. NV12 → RGBA (optional) | YUV→RGB matrix conversion via H.264 VUI colourspace | Fully parallel (per pixel) | --- @@ -144,7 +145,50 @@ Alternative: speculative CPU intra prediction, GPU only does inter blocks. Simp **Stage 4 (deblock)** — two dispatches: vertical edges (all rows independent), horizontal edges (all columns independent). Reuses daedalus-fourier's `v3d_h264deblock` shader, adapted for the new buffer layout. -Total dispatches per frame: ~200 (mostly the intra wavefront), all in one command buffer, one submit, one fence wait. +**Stage 5 (optional, NV12 → RGBA conversion)** — a single per-pixel +compute dispatch that converts the just-deblocked NV12 to packed +RGBA8. Roughly: + +```glsl +// BT.709 limited-range example; matrix selected per H.264 VUI. +int Y = int(luma[...]) - 16; +int CB = int(chroma_uv[...]) - 128; +int CR = int(chroma_uv[... + 1]) - 128; +int R = clip255((298*Y + 459*CR + 128) >> 8); +int G = clip255((298*Y - 137*CB - 55*CR + 128) >> 8); +int B = clip255((298*Y + 541*CB + 128) >> 8); +out_rgba[r*W + c] = pack32(R, G, B, 255); +``` + +Off by default. Most consumers want NV12 because: + +- V4L2 stateless decoder output is canonically NV12; the daedalus-v4l2 + CAPTURE plane is already plumbed for NV12 dmabuf. +- Wayland compositors (sway, mutter, KWin, weston) advertise NV12 via + `zwp_linux_dmabuf_v1` and convert during composition — essentially + free because it's one texture sample with a swizzle in the + composite fragment shader. +- Firefox / mpv VAAPI paths use NV12 dmabuf zero-copy to the + compositor (cf. project memory `pi5_h264_hwaccel_landed`). +- RGBA8 is 4× the bandwidth of NV12 (8 MB/frame vs 3 MB/frame at + 1080p) — converting in the decoder when no consumer needs it + burns DMA bandwidth and electrons. + +Opt in via a decoder config flag for consumers that genuinely need +RGBA pre-compositor (SHM-buffer-only clients, future X11/headless +paths, GL contexts without YUV sampler extension support). + +Colourspace metadata MUST be plumbed through when Stage 5 is on: +the H.264 SPS's `vui_parameters` carry `colour_primaries`, +`transfer_characteristics`, `matrix_coefficients`, and `video_full_range_flag`. +Picking the wrong matrix (BT.601 vs BT.709) or range silently mis- +saturates skin tones and shadows. Default to the BT.709 limited-range +combination when VUI is absent (the dominant case for HD content); +log a warning in that case. + +Total dispatches per frame: ~200 with intra wavefront + ~190 +without Stage 5; +1 with Stage 5. All in one command buffer, one +submit, one fence wait. --- @@ -200,6 +244,8 @@ New shaders needed: - `v3d_h264_chroma_mc` — 1/8-pel chroma MC (bilinear) - `v3d_h264_deblock_h_luma` + chroma variants - `v3d_h264_reconstruct` — trivial per-pixel add+clip +- `v3d_h264_yuv_to_rgba` (optional Stage 5) — BT.601 / BT.709 / SMPTE-240M + matrix selectable; limited vs full range selectable from VUI That's a substantial shader inventory. Each requires bit-exact M1 gate against an FFmpeg reference, same methodology as daedalus-fourier cycles 1-9.