design: optional Stage 5 NV12 → RGBA conversion
User question 2026-05-23: 'Wayland does need a conversion of NV12 to
its output format. Could we cram that in?'
Yes — trivially. Added Stage 5 to the pipeline doc with:
- 5-line per-pixel compute shader (BT.709 limited-range example
given; matrix selected from H.264 VUI at runtime)
- explicit OPT-IN flag, off by default
- rationale for default-off: most consumers (V4L2 stateless,
Wayland zwp_linux_dmabuf NV12 passthrough, Firefox/mpv VAAPI
paths) want NV12 because compositors convert during composition
essentially for free. RGBA8 is 4x the bandwidth of NV12 — not
worth burning DMA + electrons when no downstream needs it
- colourspace metadata plumbing requirement: SPS vui_parameters
(colour_primaries, transfer_characteristics, matrix_coefficients,
video_full_range_flag) MUST flow through to the shader; default
BT.709 limited-range with warning if VUI absent
Updated the new-shader inventory to include v3d_h264_yuv_to_rgba.
Total dispatches/frame remains ~190-200; Stage 5 adds one.
This commit is contained in:
@@ -88,6 +88,7 @@ The GPU runs **everything that is massively parallel per macroblock**:
|
|||||||
| 2b. Motion compensation | qpel filter on reference pixels | Fully parallel |
|
| 2b. Motion compensation | qpel filter on reference pixels | Fully parallel |
|
||||||
| 3. Reconstruct | residual + predicted, clip255 | Fully parallel (per pixel) |
|
| 3. Reconstruct | residual + predicted, clip255 | Fully parallel (per pixel) |
|
||||||
| 4. Deblocking | Filter MB edges (alpha/beta/tc0 table) | Two-pass: V-edges (rows ∥), H-edges (cols ∥) |
|
| 4. Deblocking | Filter MB edges (alpha/beta/tc0 table) | Two-pass: V-edges (rows ∥), H-edges (cols ∥) |
|
||||||
|
| 5. NV12 → RGBA (optional) | YUV→RGB matrix conversion via H.264 VUI colourspace | Fully parallel (per pixel) |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -144,7 +145,50 @@ Alternative: speculative CPU intra prediction, GPU only does inter blocks. Simp
|
|||||||
|
|
||||||
**Stage 4 (deblock)** — two dispatches: vertical edges (all rows independent), horizontal edges (all columns independent). Reuses daedalus-fourier's `v3d_h264deblock` shader, adapted for the new buffer layout.
|
**Stage 4 (deblock)** — two dispatches: vertical edges (all rows independent), horizontal edges (all columns independent). Reuses daedalus-fourier's `v3d_h264deblock` shader, adapted for the new buffer layout.
|
||||||
|
|
||||||
Total dispatches per frame: ~200 (mostly the intra wavefront), all in one command buffer, one submit, one fence wait.
|
**Stage 5 (optional, NV12 → RGBA conversion)** — a single per-pixel
|
||||||
|
compute dispatch that converts the just-deblocked NV12 to packed
|
||||||
|
RGBA8. Roughly:
|
||||||
|
|
||||||
|
```glsl
|
||||||
|
// BT.709 limited-range example; matrix selected per H.264 VUI.
|
||||||
|
int Y = int(luma[...]) - 16;
|
||||||
|
int CB = int(chroma_uv[...]) - 128;
|
||||||
|
int CR = int(chroma_uv[... + 1]) - 128;
|
||||||
|
int R = clip255((298*Y + 459*CR + 128) >> 8);
|
||||||
|
int G = clip255((298*Y - 137*CB - 55*CR + 128) >> 8);
|
||||||
|
int B = clip255((298*Y + 541*CB + 128) >> 8);
|
||||||
|
out_rgba[r*W + c] = pack32(R, G, B, 255);
|
||||||
|
```
|
||||||
|
|
||||||
|
Off by default. Most consumers want NV12 because:
|
||||||
|
|
||||||
|
- V4L2 stateless decoder output is canonically NV12; the daedalus-v4l2
|
||||||
|
CAPTURE plane is already plumbed for NV12 dmabuf.
|
||||||
|
- Wayland compositors (sway, mutter, KWin, weston) advertise NV12 via
|
||||||
|
`zwp_linux_dmabuf_v1` and convert during composition — essentially
|
||||||
|
free because it's one texture sample with a swizzle in the
|
||||||
|
composite fragment shader.
|
||||||
|
- Firefox / mpv VAAPI paths use NV12 dmabuf zero-copy to the
|
||||||
|
compositor (cf. project memory `pi5_h264_hwaccel_landed`).
|
||||||
|
- RGBA8 is 4× the bandwidth of NV12 (8 MB/frame vs 3 MB/frame at
|
||||||
|
1080p) — converting in the decoder when no consumer needs it
|
||||||
|
burns DMA bandwidth and electrons.
|
||||||
|
|
||||||
|
Opt in via a decoder config flag for consumers that genuinely need
|
||||||
|
RGBA pre-compositor (SHM-buffer-only clients, future X11/headless
|
||||||
|
paths, GL contexts without YUV sampler extension support).
|
||||||
|
|
||||||
|
Colourspace metadata MUST be plumbed through when Stage 5 is on:
|
||||||
|
the H.264 SPS's `vui_parameters` carry `colour_primaries`,
|
||||||
|
`transfer_characteristics`, `matrix_coefficients`, and `video_full_range_flag`.
|
||||||
|
Picking the wrong matrix (BT.601 vs BT.709) or range silently mis-
|
||||||
|
saturates skin tones and shadows. Default to the BT.709 limited-range
|
||||||
|
combination when VUI is absent (the dominant case for HD content);
|
||||||
|
log a warning in that case.
|
||||||
|
|
||||||
|
Total dispatches per frame: ~200 with intra wavefront + ~190
|
||||||
|
without Stage 5; +1 with Stage 5. All in one command buffer, one
|
||||||
|
submit, one fence wait.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -200,6 +244,8 @@ New shaders needed:
|
|||||||
- `v3d_h264_chroma_mc` — 1/8-pel chroma MC (bilinear)
|
- `v3d_h264_chroma_mc` — 1/8-pel chroma MC (bilinear)
|
||||||
- `v3d_h264_deblock_h_luma` + chroma variants
|
- `v3d_h264_deblock_h_luma` + chroma variants
|
||||||
- `v3d_h264_reconstruct` — trivial per-pixel add+clip
|
- `v3d_h264_reconstruct` — trivial per-pixel add+clip
|
||||||
|
- `v3d_h264_yuv_to_rgba` (optional Stage 5) — BT.601 / BT.709 / SMPTE-240M
|
||||||
|
matrix selectable; limited vs full range selectable from VUI
|
||||||
|
|
||||||
That's a substantial shader inventory. Each requires bit-exact M1 gate against an FFmpeg reference, same methodology as daedalus-fourier cycles 1-9.
|
That's a substantial shader inventory. Each requires bit-exact M1 gate against an FFmpeg reference, same methodology as daedalus-fourier cycles 1-9.
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user