Adds the horizontal-edge sibling of cycle 8's deblock_luma_v. The
vendored FFmpeg snapshot already includes ff_h264_h_loop_filter_luma_neon
in libavcodec/aarch64/h264dsp_neon.S — this PR wires up the symbol,
the bit-exact reference, and the recipe-table entry so daedalus-decoder
and other consumers can call the H variant through the same dispatch
shape they use for _v.
Scope:
- Public API: daedalus_dispatch_h264_deblock_luma_h(ctx, sub, ...)
+ daedalus_recipe_dispatch_h264_deblock_luma_h(ctx, ...) wrapper.
- Internal: dispatch_h264_deblock_h_cpu() calls the NEON entry.
- Recipe table: new DAEDALUS_KERNEL_H264_DEBLOCK_LH = 10, mapped
to DAEDALUS_SUBSTRATE_CPU until a QPU shader is written. An
explicit SUBSTRATE_QPU request on the H dispatch returns -1
(fails fast, no silent CPU degradation).
- C reference: tests/h264_h_loop_filter_luma_ref.c — the
column-axis transpose of h264_deblock_ref.c. Same per-segment
kernel; pix[-4..+3] accesses cols instead of rows*stride.
- Test: test_api_h264 grows a test_deblock_h() with 8 tiles
(8 cols x 16 rows each, edge at col 4), random alpha/beta/tc0;
compares NEON dispatch against reference byte-for-byte.
Verified on hertz (Pi 5 / V3D 7.1):
$ ./build/test_api_h264
=== Phase 8a API smoke: H.264 kernels via recipe dispatch ===
H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU)
H264_IDCT8 recipe substrate: 2
H264_DEBLOCK_LV recipe substrate: 2
H264_QPEL_MC20 recipe substrate: 2
H264_DEBLOCK_LH recipe substrate: 1 (CPU, no QPU H shader yet)
H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%)
H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%)
H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%)
H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%)
H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)
All 5 kernels bit-exact PASS. The new H variant joins the suite
with 1024 random-input bytes per tile x 8 tiles.
Why CPU-only for now: the daedalus-decoder downstream needs the H
edge dispatched somewhere — even at CPU NEON cost (~6 ns/edge per
the cycle 8 M3 baseline) a frame's worth at 1080p is
~ 8160 MBs * 4 edges = 32 640 edges = ~200 us — well inside the
30 fps budget. Writing the V3D H-edge shader is a follow-up
(would be cycle 8' or similar; the V-edge shader's transpose isn't
mechanical because of how the workgroup organisation maps to columns
vs rows).
Backlog addition (out of scope for this PR):
- V3D shader for the H variant (mirror of v3d_h264deblock.spv).
- bS=4 intra-strength filter (different algebra; both _v and _h).
- Chroma deblock luma_v/_h (8-cell variants).
Extends daedalus-fourier with daedalus_recipe_dispatch_h264_qpel_mc20
so libavcodec.so can route H264QpelContext.put_h264_qpel_pixels_tab[1][2]
through the recipe layer instead of ff_put_h264_qpel8_mc20_neon directly.
API additions (header + library):
- daedalus_h264_qpel_meta { dst_off, src_off }
- daedalus_dispatch_h264_qpel_mc20(ctx, sub, dst, src, stride,
n_blocks, meta)
- daedalus_recipe_dispatch_h264_qpel_mc20(...) (AUTO wrapper)
- DAEDALUS_KERNEL_H264_QPEL_MC20 = 9 in the recipe-query enum
- daedalus_recipe_substrate_for() returns CPU NEON for cycle 9
The 6-tap horizontal half-pel filter signature matches FFmpeg's
H264QpelContext convention exactly: dst and src share a single stride
and src already points at output column 0 (filter reads cols -2..+3).
Single-stride API to make the marfrit-packages FFmpeg shim a
straight pointer-pass; no buffer rearrangement.
Verdict per docs/k9_h264qpel_mc20.md: CPU NEON. Per-block 7.6 ns
gives 135x margin over 30 fps 1080p; QPU dispatch floor at ~250 ns
makes any V3D shader strictly worse. Recipe table reflects that —
the recipe_dispatch entry is a one-line forward to the CPU path.
CMakeLists changes:
- h264qpel_neon.S added to the daedalus_core static lib (only the
bench targets owned it before; now the public API needs it too)
- tests/h264_qpel8_mc20_ref.c added to the test_api_h264 target
Phase 8a/8b smoke gains a 4th case (test_qpel_mc20): 1024/1024
bytes bit-exact via daedalus_recipe_dispatch_h264_qpel_mc20.
Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
Extends include/daedalus.h with cycles 6, 7, 8 (H.264 IDCT 4x4,
IDCT 8x8, luma deblock luma-v). All recipe-substrate = CPU
(matches per-cycle Phase 7 verdicts).
src/daedalus_core.c: NEON-path implementations + recipe routing.
daedalus_core library now links the full FFmpeg H.264 NEON
snapshot (h264idct + h264dsp) plus existing VP9 + dav1d.
tests/test_api_h264.c: smoke test covering all 3 H.264 kernels
via daedalus_recipe_dispatch_*. All pass 2048/2048 bit-exact.
Public API coverage after this commit:
- Cycles 1 IDCT 8x8 + 2 LPF4 + 4 LPF8: CPU+QPU+AUTO dispatch
(test_api_idct, test_api_lpf, both pass)
- Cycle 3 MC 8h: CPU only (QPU dispatch stub returns -1)
- Cycle 5 CDEF: CPU only (QPU stub)
- Cycle 6 H.264 IDCT 4x4: CPU only (recipe + only NEON wired)
- Cycle 7 H.264 IDCT 8x8: CPU only
- Cycle 8 H.264 deblock: CPU only (QPU opportunistic — not wired
through API yet; bench_v3d_h264deblock exists for direct test)
Next Phase 8 sub-step: wire opportunistic QPU dispatch for cycles
3+5+8 through the API (so override-mode users can request QPU).
Then surface V4L2-wrapper architecture decisions to user.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
daedalus_ctx now owns a v3d_runner when V3D is available. The
public API's dispatch_vp9_idct8 routes QPU calls through a
new dispatch_idct8_qpu helper that: (1) lazy-creates the cycle 1
v4 pipeline on first use, (2) allocates 3 host-visible SSBOs
per call (coeffs/dst/meta), (3) memcpy host->GPU, (4) dispatch
with the v4 32-blocks-per-WG geometry, (5) memcpy GPU->host.
Per-call alloc is intentional for Phase 8 correctness-first
scope; buffer-pool perf optimization is deferred.
Added daedalus_ctx_create_no_qpu() for fast-path callers that
know they want CPU only.
test_api_idct extended to a 3-mode matrix: CPU forced, QPU
forced, AUTO recipe. All three deliver 4096/4096 bit-exact
on hertz with V3D 7.1.7.0:
recipe substrate for VP9_IDCT8: 2 (QPU)
[CPU] 4096/4096 bit-exact
[QPU] 4096/4096 bit-exact (real QPU dispatch through the API)
[AUTO] 4096/4096 bit-exact (recipe routes to QPU)
Next Phase 8 sub-step: same wiring pattern for cycle 2 LPF wd=4
and cycle 4 LPF wd=8 (the other two recipe-QPU kernels).
Cycle 3 MC and cycle 5 CDEF only need the dispatch hook
(recipe routes to CPU; QPU stays opportunistic via explicit
override).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
include/daedalus.h: stable C API surface exposing the 5 cycles
(VP9 IDCT 8x8, LPF wd=4, MC 8h, LPF wd=8; AV1 CDEF). Per-kernel
recipe-dispatch helpers default to the cycle 1-5 verdict
substrate (QPU for cycles 1+2+4, CPU for cycles 3+5); explicit
override available for benchmarking and runtime-aware scheduling.
src/daedalus_core.c: NEON-path implementation of all 5 kernels
wrapped behind the public API. QPU path stubbed out (returns -1)
since wiring v3d_runner into daedalus_ctx is the next Phase 8
sub-step; with has_qpu=0 the recipe falls back to CPU cleanly.
tests/test_api_idct.c: 64-block IDCT through the public recipe
dispatch, bit-exact vs C ref. PASS 4096/4096 bytes — proves the
API surface compiles, library links, dispatch routing works, and
NEON fallback delivers correct results.
docs/phase8_scoping.md: architecture options (A=userspace V4L2,
B=kernel V4L2 shim, C=direct libva); pick A for v1; explicitly
out-of-scope work tracked.
Next Phase 8 sub-step: wire v3d_runner into daedalus_ctx so
has_qpu=1 and QPU dispatch goes through the API too. After that:
V4L2 ioctl glue, bitstream parser, superblock loop.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>