daedalus-fourier

Author	SHA1	Message	Date
claude-noether	a5c47aa51c	h264: deblock chroma_v + chroma_h (CPU/NEON, bS<4) Continues the deblock buildout after PR #9 (luma_h). Adds the two chroma orientations via the same recipe-table-routed-to-CPU pattern; QPU shaders for chroma deblock are still a follow-up. Scope: - Public API: 4 new fns (dispatch + recipe wrapper × {v, h}). - Internal: dispatch_h264_deblock_chroma_{v,h}_cpu calling the vendored ff_h264_{v,h}_loop_filter_chroma_neon symbols. - Recipe table: DAEDALUS_KERNEL_H264_DEBLOCK_CV = 11, DAEDALUS_KERNEL_H264_DEBLOCK_CH = 12, both → CPU. Explicit SUBSTRATE_QPU returns -1 (no shader yet). - C reference: tests/h264_chroma_loop_filter_ref.c — covers both orientations. Algorithm per H.264 §8.7.2.4 (bS<4 chroma inter): tC = tc0_seg + 1 (no luma-style ap/aq side bonus); only p0/q0 are updated (chroma never modifies p1/p2/q1/q2). - Tests: test_deblock_chroma_v (8x4 tile, edge at row 2) + test_deblock_chroma_h (4x8 tile, edge at col 2), 4 segments x 2 cells per segment per spec. Verified on hertz (Pi 5 / V3D 7.1): $ ./build/test_api_h264 === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 2 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 2 H264_DEBLOCK_LH recipe substrate: 1 (CPU, no QPU H shader yet) H264_DEBLOCK_CV recipe substrate: 1 (CPU) H264_DEBLOCK_CH recipe substrate: 1 (CPU) H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%) H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%) H.264 deblock chroma v: 256/256 bytes bit-exact (100.0000%) H.264 deblock chroma h: 256/256 bytes bit-exact (100.0000%) H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) All 7 kernels bit-exact PASS. Chroma test sizes are smaller (256 bytes per orientation) because the per-MB chroma deblock surface is smaller than luma — accurate to the production geometry. Why no QPU shader yet (per the established pattern): - Chroma deblock is ~25% of total deblock work at 4:2:0 (one quarter the pixel count of luma per MB) — modest QPU win even after the shader exists. - Same R-band considerations as the luma _h follow-up: the V shader transpose isn't mechanical, and the 8-cell tile is small enough that NEON's per-edge cost (~3 ns) is already inside the budget. - Total bench at 1080p: 8160 MBs × 4 chroma edges × 3 ns = ~100 us. Negligible compared to the IDCT layer's 10 ms (CPU NEON). Now coverage in fourier for the bS<4 8-bit 4:2:0 deblock matrix is complete: luma_v ✓, luma_h ✓, chroma_v ✓, chroma_h ✓. Remaining deblock work: bS=4 intra variants (luma + chroma, V + H). What this unblocks downstream: - daedalus-decoder Stage 4 deblock can now dispatch all four bS<4 edge categories that a typical inter MB needs.	2026-05-24 23:53:09 +02:00
claude-noether	9d5451e0fe	h264: deblock_luma_h — CPU/NEON via vendored ff_h264_h_loop_filter Adds the horizontal-edge sibling of cycle 8's deblock_luma_v. The vendored FFmpeg snapshot already includes ff_h264_h_loop_filter_luma_neon in libavcodec/aarch64/h264dsp_neon.S — this PR wires up the symbol, the bit-exact reference, and the recipe-table entry so daedalus-decoder and other consumers can call the H variant through the same dispatch shape they use for _v. Scope: - Public API: daedalus_dispatch_h264_deblock_luma_h(ctx, sub, ...) + daedalus_recipe_dispatch_h264_deblock_luma_h(ctx, ...) wrapper. - Internal: dispatch_h264_deblock_h_cpu() calls the NEON entry. - Recipe table: new DAEDALUS_KERNEL_H264_DEBLOCK_LH = 10, mapped to DAEDALUS_SUBSTRATE_CPU until a QPU shader is written. An explicit SUBSTRATE_QPU request on the H dispatch returns -1 (fails fast, no silent CPU degradation). - C reference: tests/h264_h_loop_filter_luma_ref.c — the column-axis transpose of h264_deblock_ref.c. Same per-segment kernel; pix[-4..+3] accesses cols instead of rowsstride. - Test: test_api_h264 grows a test_deblock_h() with 8 tiles (8 cols x 16 rows each, edge at col 4), random alpha/beta/tc0; compares NEON dispatch against reference byte-for-byte. Verified on hertz (Pi 5 / V3D 7.1): $ ./build/test_api_h264 === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 2 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 2 H264_DEBLOCK_LH recipe substrate: 1 (CPU, no QPU H shader yet) H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%) H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%) H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) All 5 kernels bit-exact PASS. The new H variant joins the suite with 1024 random-input bytes per tile x 8 tiles. Why CPU-only for now: the daedalus-decoder downstream needs the H edge dispatched somewhere — even at CPU NEON cost (~6 ns/edge per the cycle 8 M3 baseline) a frame's worth at 1080p is ~ 8160 MBs 4 edges = 32 640 edges = ~200 us — well inside the 30 fps budget. Writing the V3D H-edge shader is a follow-up (would be cycle 8' or similar; the V-edge shader's transpose isn't mechanical because of how the workgroup organisation maps to columns vs rows). Backlog addition (out of scope for this PR): - V3D shader for the H variant (mirror of v3d_h264deblock.spv). - bS=4 intra-strength filter (different algebra; both _v and _h). - Chroma deblock luma_v/_h (8-cell variants).	2026-05-24 23:28:56 +02:00
claude-noether	8fdef27a7d	Phase 8c: H.264 luma qpel mc20 through public API Extends daedalus-fourier with daedalus_recipe_dispatch_h264_qpel_mc20 so libavcodec.so can route H264QpelContext.put_h264_qpel_pixels_tab[1][2] through the recipe layer instead of ff_put_h264_qpel8_mc20_neon directly. API additions (header + library): - daedalus_h264_qpel_meta { dst_off, src_off } - daedalus_dispatch_h264_qpel_mc20(ctx, sub, dst, src, stride, n_blocks, meta) - daedalus_recipe_dispatch_h264_qpel_mc20(...) (AUTO wrapper) - DAEDALUS_KERNEL_H264_QPEL_MC20 = 9 in the recipe-query enum - daedalus_recipe_substrate_for() returns CPU NEON for cycle 9 The 6-tap horizontal half-pel filter signature matches FFmpeg's H264QpelContext convention exactly: dst and src share a single stride and src already points at output column 0 (filter reads cols -2..+3). Single-stride API to make the marfrit-packages FFmpeg shim a straight pointer-pass; no buffer rearrangement. Verdict per docs/k9_h264qpel_mc20.md: CPU NEON. Per-block 7.6 ns gives 135x margin over 30 fps 1080p; QPU dispatch floor at ~250 ns makes any V3D shader strictly worse. Recipe table reflects that — the recipe_dispatch entry is a one-line forward to the CPU path. CMakeLists changes: - h264qpel_neon.S added to the daedalus_core static lib (only the bench targets owned it before; now the public API needs it too) - tests/h264_qpel8_mc20_ref.c added to the test_api_h264 target Phase 8a/8b smoke gains a 4th case (test_qpel_mc20): 1024/1024 bytes bit-exact via daedalus_recipe_dispatch_h264_qpel_mc20. Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.	2026-05-23 03:25:24 +02:00
marfrit	af8146a2cd	Phase 8a: H.264 kernels through public API Extends include/daedalus.h with cycles 6, 7, 8 (H.264 IDCT 4x4, IDCT 8x8, luma deblock luma-v). All recipe-substrate = CPU (matches per-cycle Phase 7 verdicts). src/daedalus_core.c: NEON-path implementations + recipe routing. daedalus_core library now links the full FFmpeg H.264 NEON snapshot (h264idct + h264dsp) plus existing VP9 + dav1d. tests/test_api_h264.c: smoke test covering all 3 H.264 kernels via daedalus_recipe_dispatch_*. All pass 2048/2048 bit-exact. Public API coverage after this commit: - Cycles 1 IDCT 8x8 + 2 LPF4 + 4 LPF8: CPU+QPU+AUTO dispatch (test_api_idct, test_api_lpf, both pass) - Cycle 3 MC 8h: CPU only (QPU dispatch stub returns -1) - Cycle 5 CDEF: CPU only (QPU stub) - Cycle 6 H.264 IDCT 4x4: CPU only (recipe + only NEON wired) - Cycle 7 H.264 IDCT 8x8: CPU only - Cycle 8 H.264 deblock: CPU only (QPU opportunistic — not wired through API yet; bench_v3d_h264deblock exists for direct test) Next Phase 8 sub-step: wire opportunistic QPU dispatch for cycles 3+5+8 through the API (so override-mode users can request QPU). Then surface V4L2-wrapper architecture decisions to user. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:46:03 +00:00

4 Commits