Commit Graph

2 Commits

Author SHA1 Message Date
claude-noether 8fdef27a7d Phase 8c: H.264 luma qpel mc20 through public API
Extends daedalus-fourier with daedalus_recipe_dispatch_h264_qpel_mc20
so libavcodec.so can route H264QpelContext.put_h264_qpel_pixels_tab[1][2]
through the recipe layer instead of ff_put_h264_qpel8_mc20_neon directly.

API additions (header + library):
  - daedalus_h264_qpel_meta { dst_off, src_off }
  - daedalus_dispatch_h264_qpel_mc20(ctx, sub, dst, src, stride,
                                     n_blocks, meta)
  - daedalus_recipe_dispatch_h264_qpel_mc20(...)  (AUTO wrapper)
  - DAEDALUS_KERNEL_H264_QPEL_MC20 = 9 in the recipe-query enum
  - daedalus_recipe_substrate_for() returns CPU NEON for cycle 9

The 6-tap horizontal half-pel filter signature matches FFmpeg's
H264QpelContext convention exactly: dst and src share a single stride
and src already points at output column 0 (filter reads cols -2..+3).
Single-stride API to make the marfrit-packages FFmpeg shim a
straight pointer-pass; no buffer rearrangement.

Verdict per docs/k9_h264qpel_mc20.md: CPU NEON.  Per-block 7.6 ns
gives 135x margin over 30 fps 1080p; QPU dispatch floor at ~250 ns
makes any V3D shader strictly worse.  Recipe table reflects that —
the recipe_dispatch entry is a one-line forward to the CPU path.

CMakeLists changes:
  - h264qpel_neon.S added to the daedalus_core static lib (only the
    bench targets owned it before; now the public API needs it too)
  - tests/h264_qpel8_mc20_ref.c added to the test_api_h264 target

Phase 8a/8b smoke gains a 4th case (test_qpel_mc20): 1024/1024
bytes bit-exact via daedalus_recipe_dispatch_h264_qpel_mc20.

Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
2026-05-23 03:25:24 +02:00
marfrit af8146a2cd Phase 8a: H.264 kernels through public API
Extends include/daedalus.h with cycles 6, 7, 8 (H.264 IDCT 4x4,
IDCT 8x8, luma deblock luma-v). All recipe-substrate = CPU
(matches per-cycle Phase 7 verdicts).

src/daedalus_core.c: NEON-path implementations + recipe routing.
daedalus_core library now links the full FFmpeg H.264 NEON
snapshot (h264idct + h264dsp) plus existing VP9 + dav1d.

tests/test_api_h264.c: smoke test covering all 3 H.264 kernels
via daedalus_recipe_dispatch_*. All pass 2048/2048 bit-exact.

Public API coverage after this commit:
- Cycles 1 IDCT 8x8 + 2 LPF4 + 4 LPF8: CPU+QPU+AUTO dispatch
  (test_api_idct, test_api_lpf, both pass)
- Cycle 3 MC 8h: CPU only (QPU dispatch stub returns -1)
- Cycle 5 CDEF: CPU only (QPU stub)
- Cycle 6 H.264 IDCT 4x4: CPU only (recipe + only NEON wired)
- Cycle 7 H.264 IDCT 8x8: CPU only
- Cycle 8 H.264 deblock: CPU only (QPU opportunistic — not wired
  through API yet; bench_v3d_h264deblock exists for direct test)

Next Phase 8 sub-step: wire opportunistic QPU dispatch for cycles
3+5+8 through the API (so override-mode users can request QPU).
Then surface V4L2-wrapper architecture decisions to user.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:46:03 +00:00