h264: deblock_luma_h (CPU/NEON via vendored ff_h264_h_loop_filter) #9

Merged
marfrit merged 2 commits from noether/h264-deblock-luma-h into main 2026-05-24 21:48:01 +00:00
Owner

Adds the horizontal-edge sibling of cycle 8's deblock_luma_v. The vendored FFmpeg snapshot already includes ff_h264_h_loop_filter_luma_neon — this PR wires up the symbol, the bit-exact C reference, and a new recipe-table kernel DAEDALUS_KERNEL_H264_DEBLOCK_LH=10 (CPU until a QPU H shader lands; explicit SUBSTRATE_QPU returns -1, no silent CPU degradation).

test_api_h264 grows test_deblock_h() with 8 tiles (8 cols × 16 rows each, vertical edge at col 4, random alpha/beta/tc0). PASS 1024/1024 bytes bit-exact on hertz. All 5 H.264 kernels (idct4, idct8, deblock_lv, deblock_lh, qpel_mc20) now have bit-exact gates in the api smoke. Closes the gap that prevented daedalus-decoder from dispatching the per-MB internal vertical edges through fourier.

Follow-up backlog (not in this PR): V3D shader for the H variant; bS=4 intra-strength filter; chroma deblock _v/_h.

Adds the horizontal-edge sibling of cycle 8's deblock_luma_v. The vendored FFmpeg snapshot already includes `ff_h264_h_loop_filter_luma_neon` — this PR wires up the symbol, the bit-exact C reference, and a new recipe-table kernel `DAEDALUS_KERNEL_H264_DEBLOCK_LH=10` (CPU until a QPU H shader lands; explicit SUBSTRATE_QPU returns -1, no silent CPU degradation). test_api_h264 grows test_deblock_h() with 8 tiles (8 cols × 16 rows each, vertical edge at col 4, random alpha/beta/tc0). PASS 1024/1024 bytes bit-exact on hertz. All 5 H.264 kernels (idct4, idct8, deblock_lv, deblock_lh, qpel_mc20) now have bit-exact gates in the api smoke. Closes the gap that prevented daedalus-decoder from dispatching the per-MB internal vertical edges through fourier. Follow-up backlog (not in this PR): V3D shader for the H variant; bS=4 intra-strength filter; chroma deblock _v/_h.
marfrit added 2 commits 2026-05-24 21:29:22 +00:00
Adds the horizontal-edge sibling of cycle 8's deblock_luma_v.  The
vendored FFmpeg snapshot already includes ff_h264_h_loop_filter_luma_neon
in libavcodec/aarch64/h264dsp_neon.S — this PR wires up the symbol,
the bit-exact reference, and the recipe-table entry so daedalus-decoder
and other consumers can call the H variant through the same dispatch
shape they use for _v.

Scope:
  - Public API: daedalus_dispatch_h264_deblock_luma_h(ctx, sub, ...)
    + daedalus_recipe_dispatch_h264_deblock_luma_h(ctx, ...) wrapper.
  - Internal: dispatch_h264_deblock_h_cpu() calls the NEON entry.
  - Recipe table: new DAEDALUS_KERNEL_H264_DEBLOCK_LH = 10, mapped
    to DAEDALUS_SUBSTRATE_CPU until a QPU shader is written.  An
    explicit SUBSTRATE_QPU request on the H dispatch returns -1
    (fails fast, no silent CPU degradation).
  - C reference: tests/h264_h_loop_filter_luma_ref.c — the
    column-axis transpose of h264_deblock_ref.c.  Same per-segment
    kernel; pix[-4..+3] accesses cols instead of rows*stride.
  - Test: test_api_h264 grows a test_deblock_h() with 8 tiles
    (8 cols x 16 rows each, edge at col 4), random alpha/beta/tc0;
    compares NEON dispatch against reference byte-for-byte.

Verified on hertz (Pi 5 / V3D 7.1):

  $ ./build/test_api_h264
  === Phase 8a API smoke: H.264 kernels via recipe dispatch ===
    H264_IDCT4 recipe substrate:      2 (1=CPU, 2=QPU)
    H264_IDCT8 recipe substrate:      2
    H264_DEBLOCK_LV recipe substrate: 2
    H264_QPEL_MC20 recipe substrate:  2
    H264_DEBLOCK_LH recipe substrate: 1 (CPU, no QPU H shader yet)
    H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%)
    H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%)
    H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%)
    H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%)
    H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)

  All 5 kernels bit-exact PASS.  The new H variant joins the suite
  with 1024 random-input bytes per tile x 8 tiles.

Why CPU-only for now: the daedalus-decoder downstream needs the H
edge dispatched somewhere — even at CPU NEON cost (~6 ns/edge per
the cycle 8 M3 baseline) a frame's worth at 1080p is
~ 8160 MBs * 4 edges = 32 640 edges = ~200 us — well inside the
30 fps budget.  Writing the V3D H-edge shader is a follow-up
(would be cycle 8' or similar; the V-edge shader's transpose isn't
mechanical because of how the workgroup organisation maps to columns
vs rows).

Backlog addition (out of scope for this PR):
  - V3D shader for the H variant (mirror of v3d_h264deblock.spv).
  - bS=4 intra-strength filter (different algebra; both _v and _h).
  - Chroma deblock luma_v/_h (8-cell variants).
The previous commit unintentionally added .claude/scheduled_tasks.lock
which is an agent-runtime artefact, not source. Untrack it and add
.claude/ to .gitignore so it stays out of future commits.
marfrit merged commit f4af24020f into main 2026-05-24 21:48:01 +00:00
marfrit deleted branch noether/h264-deblock-luma-h 2026-05-24 21:48:03 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marfrit/daedalus-fourier#9