h264: deblock chroma_v + chroma_h (CPU/NEON, bS<4) #10

Merged

marfrit merged 1 commits from noether/h264-deblock-chroma into main

2026-05-24 21:55:58 +00:00

Author	SHA1	Message	Date
claude-noether	a5c47aa51c	h264: deblock chroma_v + chroma_h (CPU/NEON, bS<4) Continues the deblock buildout after PR #9 (luma_h). Adds the two chroma orientations via the same recipe-table-routed-to-CPU pattern; QPU shaders for chroma deblock are still a follow-up. Scope: - Public API: 4 new fns (dispatch + recipe wrapper × {v, h}). - Internal: dispatch_h264_deblock_chroma_{v,h}_cpu calling the vendored ff_h264_{v,h}_loop_filter_chroma_neon symbols. - Recipe table: DAEDALUS_KERNEL_H264_DEBLOCK_CV = 11, DAEDALUS_KERNEL_H264_DEBLOCK_CH = 12, both → CPU. Explicit SUBSTRATE_QPU returns -1 (no shader yet). - C reference: tests/h264_chroma_loop_filter_ref.c — covers both orientations. Algorithm per H.264 §8.7.2.4 (bS<4 chroma inter): tC = tc0_seg + 1 (no luma-style ap/aq side bonus); only p0/q0 are updated (chroma never modifies p1/p2/q1/q2). - Tests: test_deblock_chroma_v (8x4 tile, edge at row 2) + test_deblock_chroma_h (4x8 tile, edge at col 2), 4 segments x 2 cells per segment per spec. Verified on hertz (Pi 5 / V3D 7.1): $ ./build/test_api_h264 === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 2 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 2 H264_DEBLOCK_LH recipe substrate: 1 (CPU, no QPU H shader yet) H264_DEBLOCK_CV recipe substrate: 1 (CPU) H264_DEBLOCK_CH recipe substrate: 1 (CPU) H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%) H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%) H.264 deblock chroma v: 256/256 bytes bit-exact (100.0000%) H.264 deblock chroma h: 256/256 bytes bit-exact (100.0000%) H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) All 7 kernels bit-exact PASS. Chroma test sizes are smaller (256 bytes per orientation) because the per-MB chroma deblock surface is smaller than luma — accurate to the production geometry. Why no QPU shader yet (per the established pattern): - Chroma deblock is ~25% of total deblock work at 4:2:0 (one quarter the pixel count of luma per MB) — modest QPU win even after the shader exists. - Same R-band considerations as the luma _h follow-up: the V shader transpose isn't mechanical, and the 8-cell tile is small enough that NEON's per-edge cost (~3 ns) is already inside the budget. - Total bench at 1080p: 8160 MBs × 4 chroma edges × 3 ns = ~100 us. Negligible compared to the IDCT layer's 10 ms (CPU NEON). Now coverage in fourier for the bS<4 8-bit 4:2:0 deblock matrix is complete: luma_v ✓, luma_h ✓, chroma_v ✓, chroma_h ✓. Remaining deblock work: bS=4 intra variants (luma + chroma, V + H). What this unblocks downstream: - daedalus-decoder Stage 4 deblock can now dispatch all four bS<4 edge categories that a typical inter MB needs.	2026-05-24 23:53:09 +02:00

Author

SHA1

Message

Date

claude-noether

a5c47aa51c

h264: deblock chroma_v + chroma_h (CPU/NEON, bS<4)

Continues the deblock buildout after PR #9 (luma_h).  Adds the two
chroma orientations via the same recipe-table-routed-to-CPU pattern;
QPU shaders for chroma deblock are still a follow-up.

Scope:
  - Public API: 4 new fns (dispatch + recipe wrapper × {v, h}).
  - Internal: dispatch_h264_deblock_chroma_{v,h}_cpu calling the
    vendored ff_h264_{v,h}_loop_filter_chroma_neon symbols.
  - Recipe table: DAEDALUS_KERNEL_H264_DEBLOCK_CV = 11,
    DAEDALUS_KERNEL_H264_DEBLOCK_CH = 12, both → CPU.  Explicit
    SUBSTRATE_QPU returns -1 (no shader yet).
  - C reference: tests/h264_chroma_loop_filter_ref.c — covers both
    orientations.  Algorithm per H.264 §8.7.2.4 (bS<4 chroma inter):
    tC = tc0_seg + 1 (no luma-style ap/aq side bonus); only p0/q0
    are updated (chroma never modifies p1/p2/q1/q2).
  - Tests: test_deblock_chroma_v (8x4 tile, edge at row 2) +
    test_deblock_chroma_h (4x8 tile, edge at col 2), 4 segments x
    2 cells per segment per spec.

Verified on hertz (Pi 5 / V3D 7.1):

  $ ./build/test_api_h264
  === Phase 8a API smoke: H.264 kernels via recipe dispatch ===
    H264_IDCT4 recipe substrate:      2 (1=CPU, 2=QPU)
    H264_IDCT8 recipe substrate:      2
    H264_DEBLOCK_LV recipe substrate: 2
    H264_QPEL_MC20 recipe substrate:  2
    H264_DEBLOCK_LH recipe substrate: 1 (CPU, no QPU H shader yet)
    H264_DEBLOCK_CV recipe substrate: 1 (CPU)
    H264_DEBLOCK_CH recipe substrate: 1 (CPU)
    H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%)
    H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%)
    H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%)
    H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%)
    H.264 deblock chroma v: 256/256 bytes bit-exact (100.0000%)
    H.264 deblock chroma h: 256/256 bytes bit-exact (100.0000%)
    H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)

  All 7 kernels bit-exact PASS.  Chroma test sizes are smaller (256
  bytes per orientation) because the per-MB chroma deblock surface is
  smaller than luma — accurate to the production geometry.

Why no QPU shader yet (per the established pattern):
  - Chroma deblock is ~25% of total deblock work at 4:2:0 (one quarter
    the pixel count of luma per MB) — modest QPU win even after the
    shader exists.
  - Same R-band considerations as the luma _h follow-up: the V shader
    transpose isn't mechanical, and the 8-cell tile is small enough
    that NEON's per-edge cost (~3 ns) is already inside the budget.
  - Total bench at 1080p: 8160 MBs × 4 chroma edges × 3 ns = ~100 us.
    Negligible compared to the IDCT layer's 10 ms (CPU NEON).

Now coverage in fourier for the bS<4 8-bit 4:2:0 deblock matrix is
complete: luma_v ✓, luma_h ✓, chroma_v ✓, chroma_h ✓.  Remaining
deblock work: bS=4 intra variants (luma + chroma, V + H).

What this unblocks downstream:
  - daedalus-decoder Stage 4 deblock can now dispatch all four bS<4
    edge categories that a typical inter MB needs.

2026-05-24 23:53:09 +02:00

h264: deblock chroma_v + chroma_h (CPU/NEON, bS<4) #10

1 Commits