h264: deblock_luma_h (CPU/NEON via vendored ff_h264_h_loop_filter) #9
Reference in New Issue
Block a user
Delete Branch "noether/h264-deblock-luma-h"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Adds the horizontal-edge sibling of cycle 8's deblock_luma_v. The vendored FFmpeg snapshot already includes
ff_h264_h_loop_filter_luma_neon— this PR wires up the symbol, the bit-exact C reference, and a new recipe-table kernelDAEDALUS_KERNEL_H264_DEBLOCK_LH=10(CPU until a QPU H shader lands; explicit SUBSTRATE_QPU returns -1, no silent CPU degradation).test_api_h264 grows test_deblock_h() with 8 tiles (8 cols × 16 rows each, vertical edge at col 4, random alpha/beta/tc0). PASS 1024/1024 bytes bit-exact on hertz. All 5 H.264 kernels (idct4, idct8, deblock_lv, deblock_lh, qpel_mc20) now have bit-exact gates in the api smoke. Closes the gap that prevented daedalus-decoder from dispatching the per-MB internal vertical edges through fourier.
Follow-up backlog (not in this PR): V3D shader for the H variant; bS=4 intra-strength filter; chroma deblock _v/_h.
Adds the horizontal-edge sibling of cycle 8's deblock_luma_v. The vendored FFmpeg snapshot already includes ff_h264_h_loop_filter_luma_neon in libavcodec/aarch64/h264dsp_neon.S — this PR wires up the symbol, the bit-exact reference, and the recipe-table entry so daedalus-decoder and other consumers can call the H variant through the same dispatch shape they use for _v. Scope: - Public API: daedalus_dispatch_h264_deblock_luma_h(ctx, sub, ...) + daedalus_recipe_dispatch_h264_deblock_luma_h(ctx, ...) wrapper. - Internal: dispatch_h264_deblock_h_cpu() calls the NEON entry. - Recipe table: new DAEDALUS_KERNEL_H264_DEBLOCK_LH = 10, mapped to DAEDALUS_SUBSTRATE_CPU until a QPU shader is written. An explicit SUBSTRATE_QPU request on the H dispatch returns -1 (fails fast, no silent CPU degradation). - C reference: tests/h264_h_loop_filter_luma_ref.c — the column-axis transpose of h264_deblock_ref.c. Same per-segment kernel; pix[-4..+3] accesses cols instead of rows*stride. - Test: test_api_h264 grows a test_deblock_h() with 8 tiles (8 cols x 16 rows each, edge at col 4), random alpha/beta/tc0; compares NEON dispatch against reference byte-for-byte. Verified on hertz (Pi 5 / V3D 7.1): $ ./build/test_api_h264 === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 2 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 2 H264_DEBLOCK_LH recipe substrate: 1 (CPU, no QPU H shader yet) H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%) H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%) H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) All 5 kernels bit-exact PASS. The new H variant joins the suite with 1024 random-input bytes per tile x 8 tiles. Why CPU-only for now: the daedalus-decoder downstream needs the H edge dispatched somewhere — even at CPU NEON cost (~6 ns/edge per the cycle 8 M3 baseline) a frame's worth at 1080p is ~ 8160 MBs * 4 edges = 32 640 edges = ~200 us — well inside the 30 fps budget. Writing the V3D H-edge shader is a follow-up (would be cycle 8' or similar; the V-edge shader's transpose isn't mechanical because of how the workgroup organisation maps to columns vs rows). Backlog addition (out of scope for this PR): - V3D shader for the H variant (mirror of v3d_h264deblock.spv). - bS=4 intra-strength filter (different algebra; both _v and _h). - Chroma deblock luma_v/_h (8-cell variants).