d8de7754fa
Adds the QPU shader pair for chroma_v / chroma_h deblock (non-intra bS<4), siblings of the cycle 8 luma_v shader and PR #28's luma_h. Closes 4 of 8 deblock QPU coverage at non-intra: luma_v ✓ cycle 8 luma_h ✓ PR #28 chroma_v ✓ this PR chroma_h ✓ this PR *_intra — CPU NEON (less common; smaller volume) Per H.264 §8.7.2.4 chroma kernel is simpler than luma: only p0/q0 updated (never p1/p2/q1/q2), tC = tc0_seg + 1 (no luma-style ap/aq side bonus), 8 cells per edge (vs luma's 16). Shader: 64 lines vs luma_v's 108 — same WG geometry (16 edges × 16 lanes, lanes 8..15 of each edge early-return). 4:2:0-only: 4:2:2 chroma_h has a 16-row edge geometry that this shader doesn't address; daedalus_dispatch_h264_deblock_chroma_h is 4:2:0-only by design, caller-side gating already covers this in the libavcodec substitution arc (marfrit-packages PR #98). Recipe table flips DAEDALUS_KERNEL_H264_DEBLOCK_CV / CH from CPU to QPU. dispatch_h264_deblock_chroma_qpu factored to share QPU plumbing between V and H (orientation passed as a flag for the dst_max calculation). Verified on hertz: $ ./build/test_api_h264 | grep "deblock chroma [vh]:" H.264 deblock chroma v: 256/256 bytes bit-exact (100.0000%) H.264 deblock chroma h: 256/256 bytes bit-exact (100.0000%) Recipe substrate now reports 2 (QPU) for both CV and CH. Coverage now: bS<4 QPU bS=4 (intra) luma_v ✓ cycle 8 CPU NEON luma_h ✓ PR #28 CPU NEON chroma_v ✓ this PR CPU NEON chroma_h ✓ this PR CPU NEON Intra (bS=4) variants stay CPU NEON. Less common case, smaller per-frame contribution, and the algorithm is structurally different (no tc0; strong-vs-weak filter quad-tree). Can land as a follow-up PR if perf demands.