h264: V3D shaders for chroma deblock V + H (4:2:0) #29

Merged
marfrit merged 1 commits from noether/v3d-shader-h264-deblock-chroma into main 2026-05-25 16:35:09 +00:00
Owner

Adds the QPU shader pair for chroma_v / chroma_h deblock (non-intra bS<4). Siblings of cycle 8's luma_v and PR #28's luma_h. Per H.264 §8.7.2.4 chroma kernel is simpler — only p0/q0 updated, tC = tc0_seg+1, 8 cells/edge (vs luma's 16). Shader is 64 lines vs luma_v's 108.

4:2:0-only (4:2:2 has a 16-row chroma_h edge geometry not handled). Recipe table flips DEBLOCK_CV/CH from CPU to QPU. Shared QPU plumbing factored between V and H.

Test results on hertz: chroma v: 256/256 bit-exact, chroma h: 256/256 bit-exact. Recipe substrate=2 (QPU) for both.

Non-intra deblock QPU matrix complete after this PR: luma_v ✓, luma_h ✓, chroma_v ✓, chroma_h ✓. Intra (bS=4) variants stay CPU NEON (less common, smaller per-frame contribution, structurally different algorithm).

Adds the QPU shader pair for chroma_v / chroma_h deblock (non-intra bS<4). Siblings of cycle 8's luma_v and PR #28's luma_h. Per H.264 §8.7.2.4 chroma kernel is simpler — only p0/q0 updated, tC = tc0_seg+1, 8 cells/edge (vs luma's 16). Shader is 64 lines vs luma_v's 108. 4:2:0-only (4:2:2 has a 16-row chroma_h edge geometry not handled). Recipe table flips DEBLOCK_CV/CH from CPU to QPU. Shared QPU plumbing factored between V and H. **Test results on hertz**: `chroma v: 256/256 bit-exact`, `chroma h: 256/256 bit-exact`. Recipe substrate=2 (QPU) for both. **Non-intra deblock QPU matrix complete after this PR**: luma_v ✓, luma_h ✓, chroma_v ✓, chroma_h ✓. Intra (bS=4) variants stay CPU NEON (less common, smaller per-frame contribution, structurally different algorithm).
marfrit added 1 commit 2026-05-25 15:10:37 +00:00
Adds the QPU shader pair for chroma_v / chroma_h deblock (non-intra
bS<4), siblings of the cycle 8 luma_v shader and PR #28's luma_h.
Closes 4 of 8 deblock QPU coverage at non-intra:

  luma_v   ✓ cycle 8
  luma_h   ✓ PR #28
  chroma_v ✓ this PR
  chroma_h ✓ this PR
  *_intra  — CPU NEON (less common; smaller volume)

Per H.264 §8.7.2.4 chroma kernel is simpler than luma: only p0/q0
updated (never p1/p2/q1/q2), tC = tc0_seg + 1 (no luma-style ap/aq
side bonus), 8 cells per edge (vs luma's 16).  Shader: 64 lines
vs luma_v's 108 — same WG geometry (16 edges × 16 lanes, lanes
8..15 of each edge early-return).

4:2:0-only: 4:2:2 chroma_h has a 16-row edge geometry that this
shader doesn't address; daedalus_dispatch_h264_deblock_chroma_h is
4:2:0-only by design, caller-side gating already covers this in the
libavcodec substitution arc (marfrit-packages PR #98).

Recipe table flips DAEDALUS_KERNEL_H264_DEBLOCK_CV / CH from CPU to
QPU.  dispatch_h264_deblock_chroma_qpu factored to share QPU
plumbing between V and H (orientation passed as a flag for the
dst_max calculation).

Verified on hertz:

  $ ./build/test_api_h264 | grep "deblock chroma [vh]:"
    H.264 deblock chroma v: 256/256 bytes bit-exact (100.0000%)
    H.264 deblock chroma h: 256/256 bytes bit-exact (100.0000%)

  Recipe substrate now reports 2 (QPU) for both CV and CH.

Coverage now:
                bS<4 QPU     bS=4 (intra)
  luma_v        ✓ cycle 8    CPU NEON
  luma_h        ✓ PR #28     CPU NEON
  chroma_v      ✓ this PR    CPU NEON
  chroma_h      ✓ this PR    CPU NEON

Intra (bS=4) variants stay CPU NEON.  Less common case, smaller
per-frame contribution, and the algorithm is structurally different
(no tc0; strong-vs-weak filter quad-tree).  Can land as a follow-up
PR if perf demands.
marfrit merged commit 37b75b5813 into main 2026-05-25 16:35:09 +00:00
marfrit deleted branch noether/v3d-shader-h264-deblock-chroma 2026-05-25 16:35:09 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marfrit/daedalus-fourier#29