h264: V3D shaders for the 4 bS=4 intra deblock variants — deblock QPU complete #35
Reference in New Issue
Block a user
Delete Branch "noether/v3d-shader-h264-deblock-intra"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes the H.264 deblock QPU coverage matrix (8 of 8 variants now on QPU).
4 new shaders for luma_v/h_intra + chroma_v/h_intra. Algorithmically distinct from bS<4 — per-side strong/weak filter selector decides whether to update 3 cells (strong, 5-/4-/3-tap blends) or 1 cell (weak), reads p3/q3, no tc0. Chroma intra: always weak, p0/q0 only. Transcribed from PR #11's C ref.
All 4 PASS bit-exact via QPU first try. Meaningful — the strong/weak selector + per-side asymmetry has sign/shift/index pitfalls that show up immediately on random inputs.
Dispatch: new
dispatch_h264_deblock_luma_intra_qpuhelper parameterised byorient_h; chroma reuses the existing chroma helper. Recipe table flips all 4 DEBLOCK_*_INTRA CPU→QPU.DEFINE_INTRA_DISPATCHmacro extended with qpu_fn parameter.Final H.264 8-bit 4:2:0 hot-path QPU coverage:
Closes the H.264 deblock QPU coverage matrix. Adds the 4 intra (bS=4) variants — luma_v/h_intra + chroma_v/h_intra. Algorithmically distinct from the bS<4 path: - Per-side strong/weak filter selector strong_p = (|p2-p0| < β) AND (|p0-q0| < (α>>2) + 2) strong_q = (|q2-q0| < β) AND (|p0-q0| < (α>>2) + 2) - Strong-p updates p0/p1/p2 with 5-/4-/3-tap blends (reads p3) - Weak-p updates p0 only with (2*p1 + p0 + q1 + 2) >> 2 - Mirror for q-side; no tc0 (bS=4 hardcodes the strength) - Chroma always weak, only p0/q0 updated (same as bS<4 chroma) Per H.264 §8.3.2.3. Transcribed from PR #11's C reference (tests/h264_intra_loop_filter_ref.c). Shaders: - v3d_h264deblock_luma_v_intra.comp (luma 16-cell + strong/weak) - v3d_h264deblock_luma_h_intra.comp (transpose of luma_v_intra) - v3d_h264deblock_chroma_v_intra.comp (8-cell, always weak) - v3d_h264deblock_chroma_h_intra.comp (transpose of chroma_v_intra) Dispatch wiring: - 4 new pipeline pairs in daedalus_ctx - dispatch_h264_deblock_luma_intra_qpu helper (parameterised by orient_h for V vs H) — 2 wrappers - chroma intra reuses the existing dispatch_h264_deblock_chroma_qpu helper (same WG geometry as bS<4 chroma) — 2 wrappers - DEFINE_INTRA_DISPATCH macro extended with qpu_fn parameter, routes CPU/QPU per recipe table - Recipe table flips DAEDALUS_KERNEL_H264_DEBLOCK_*_INTRA from CPU to QPU Verified on hertz: $ ./build/test_api_h264 | grep intra H.264 deblock luma v intra: 1024/1024 bytes bit-exact H.264 deblock luma h intra: 1024/1024 bytes bit-exact H.264 deblock chroma v intra: 256/256 bytes bit-exact H.264 deblock chroma h intra: 256/256 bytes bit-exact All 4 PASS first try. Strong/weak quad-tree selector + per-side asymmetry would have surfaced any sign/shift/index mistake; passing on all 4 (including the asymmetric writes-3-cells cases) means the transcription from C is clean. Deblock QPU coverage matrix — COMPLETE (8 of 8): bS<4 (non-intra): luma_v ✓ cycle 8 luma_h ✓ PR #28 chroma_v ✓ PR #29 chroma_h ✓ PR #29 bS=4 (intra, this PR): luma_v ✓ luma_h ✓ chroma_v ✓ chroma_h ✓ The full H.264 8-bit 4:2:0 hot-path pixel-math layer is now on QPU when daedalus is initialised with a QPU-capable context: - IDCT 4x4 / 8x8 ✓ - All 8 deblock variants ✓ - All 30 qpel positions (15 put_ + 15 avg_) ✓