h264: V3D shaders for the 4 bS=4 intra deblock variants — deblock QPU complete #35

Merged
marfrit merged 1 commits from noether/v3d-shader-h264-deblock-intra into main 2026-05-25 18:36:11 +00:00
Owner

Closes the H.264 deblock QPU coverage matrix (8 of 8 variants now on QPU).

4 new shaders for luma_v/h_intra + chroma_v/h_intra. Algorithmically distinct from bS<4 — per-side strong/weak filter selector decides whether to update 3 cells (strong, 5-/4-/3-tap blends) or 1 cell (weak), reads p3/q3, no tc0. Chroma intra: always weak, p0/q0 only. Transcribed from PR #11's C ref.

All 4 PASS bit-exact via QPU first try. Meaningful — the strong/weak selector + per-side asymmetry has sign/shift/index pitfalls that show up immediately on random inputs.

Dispatch: new dispatch_h264_deblock_luma_intra_qpu helper parameterised by orient_h; chroma reuses the existing chroma helper. Recipe table flips all 4 DEBLOCK_*_INTRA CPU→QPU. DEFINE_INTRA_DISPATCH macro extended with qpu_fn parameter.

Final H.264 8-bit 4:2:0 hot-path QPU coverage:

  • IDCT 4x4 / 8x8 ✓
  • All 8 deblock variants ✓ (bS<4 + bS=4)
  • All 30 qpel positions ✓ (15 put_ + 15 avg_)
Closes the H.264 deblock QPU coverage matrix (8 of 8 variants now on QPU). 4 new shaders for luma_v/h_intra + chroma_v/h_intra. Algorithmically distinct from bS<4 — per-side strong/weak filter selector decides whether to update 3 cells (strong, 5-/4-/3-tap blends) or 1 cell (weak), reads p3/q3, no tc0. Chroma intra: always weak, p0/q0 only. Transcribed from PR #11's C ref. **All 4 PASS bit-exact via QPU first try.** Meaningful — the strong/weak selector + per-side asymmetry has sign/shift/index pitfalls that show up immediately on random inputs. Dispatch: new `dispatch_h264_deblock_luma_intra_qpu` helper parameterised by `orient_h`; chroma reuses the existing chroma helper. Recipe table flips all 4 DEBLOCK_*_INTRA CPU→QPU. `DEFINE_INTRA_DISPATCH` macro extended with qpu_fn parameter. **Final H.264 8-bit 4:2:0 hot-path QPU coverage**: - IDCT 4x4 / 8x8 ✓ - All 8 deblock variants ✓ (bS<4 + bS=4) - All 30 qpel positions ✓ (15 put_ + 15 avg_)
marfrit added 1 commit 2026-05-25 18:30:09 +00:00
Closes the H.264 deblock QPU coverage matrix.  Adds the 4 intra
(bS=4) variants — luma_v/h_intra + chroma_v/h_intra.

Algorithmically distinct from the bS<4 path:
  - Per-side strong/weak filter selector
      strong_p = (|p2-p0| < β) AND (|p0-q0| < (α>>2) + 2)
      strong_q = (|q2-q0| < β) AND (|p0-q0| < (α>>2) + 2)
  - Strong-p updates p0/p1/p2 with 5-/4-/3-tap blends (reads p3)
  - Weak-p updates p0 only with (2*p1 + p0 + q1 + 2) >> 2
  - Mirror for q-side; no tc0 (bS=4 hardcodes the strength)
  - Chroma always weak, only p0/q0 updated (same as bS<4 chroma)

Per H.264 §8.3.2.3.  Transcribed from PR #11's C reference
(tests/h264_intra_loop_filter_ref.c).

Shaders:
  - v3d_h264deblock_luma_v_intra.comp  (luma 16-cell + strong/weak)
  - v3d_h264deblock_luma_h_intra.comp  (transpose of luma_v_intra)
  - v3d_h264deblock_chroma_v_intra.comp (8-cell, always weak)
  - v3d_h264deblock_chroma_h_intra.comp (transpose of chroma_v_intra)

Dispatch wiring:
  - 4 new pipeline pairs in daedalus_ctx
  - dispatch_h264_deblock_luma_intra_qpu helper (parameterised by
    orient_h for V vs H) — 2 wrappers
  - chroma intra reuses the existing dispatch_h264_deblock_chroma_qpu
    helper (same WG geometry as bS<4 chroma) — 2 wrappers
  - DEFINE_INTRA_DISPATCH macro extended with qpu_fn parameter,
    routes CPU/QPU per recipe table
  - Recipe table flips DAEDALUS_KERNEL_H264_DEBLOCK_*_INTRA from CPU
    to QPU

Verified on hertz:

  $ ./build/test_api_h264 | grep intra
    H.264 deblock luma v intra:   1024/1024 bytes bit-exact
    H.264 deblock luma h intra:   1024/1024 bytes bit-exact
    H.264 deblock chroma v intra:  256/256 bytes bit-exact
    H.264 deblock chroma h intra:  256/256 bytes bit-exact

All 4 PASS first try.  Strong/weak quad-tree selector + per-side
asymmetry would have surfaced any sign/shift/index mistake; passing
on all 4 (including the asymmetric writes-3-cells cases) means the
transcription from C is clean.

Deblock QPU coverage matrix — COMPLETE (8 of 8):

  bS<4 (non-intra):
    luma_v    ✓ cycle 8
    luma_h    ✓ PR #28
    chroma_v  ✓ PR #29
    chroma_h  ✓ PR #29

  bS=4 (intra, this PR):
    luma_v    ✓
    luma_h    ✓
    chroma_v  ✓
    chroma_h  ✓

The full H.264 8-bit 4:2:0 hot-path pixel-math layer is now on QPU
when daedalus is initialised with a QPU-capable context:
  - IDCT 4x4 / 8x8 ✓
  - All 8 deblock variants ✓
  - All 30 qpel positions (15 put_ + 15 avg_) ✓
marfrit merged commit 1446b779a6 into main 2026-05-25 18:36:11 +00:00
marfrit deleted branch noether/v3d-shader-h264-deblock-intra 2026-05-25 18:36:11 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marfrit/daedalus-fourier#35