Phase 8c: H.264 luma qpel mc20 through public API

Extends daedalus-fourier with daedalus_recipe_dispatch_h264_qpel_mc20
so libavcodec.so can route H264QpelContext.put_h264_qpel_pixels_tab[1][2]
through the recipe layer instead of ff_put_h264_qpel8_mc20_neon directly.

API additions (header + library):
  - daedalus_h264_qpel_meta { dst_off, src_off }
  - daedalus_dispatch_h264_qpel_mc20(ctx, sub, dst, src, stride,
                                     n_blocks, meta)
  - daedalus_recipe_dispatch_h264_qpel_mc20(...)  (AUTO wrapper)
  - DAEDALUS_KERNEL_H264_QPEL_MC20 = 9 in the recipe-query enum
  - daedalus_recipe_substrate_for() returns CPU NEON for cycle 9

The 6-tap horizontal half-pel filter signature matches FFmpeg's
H264QpelContext convention exactly: dst and src share a single stride
and src already points at output column 0 (filter reads cols -2..+3).
Single-stride API to make the marfrit-packages FFmpeg shim a
straight pointer-pass; no buffer rearrangement.

Verdict per docs/k9_h264qpel_mc20.md: CPU NEON.  Per-block 7.6 ns
gives 135x margin over 30 fps 1080p; QPU dispatch floor at ~250 ns
makes any V3D shader strictly worse.  Recipe table reflects that —
the recipe_dispatch entry is a one-line forward to the CPU path.

CMakeLists changes:
  - h264qpel_neon.S added to the daedalus_core static lib (only the
    bench targets owned it before; now the public API needs it too)
  - tests/h264_qpel8_mc20_ref.c added to the test_api_h264 target

Phase 8a/8b smoke gains a 4th case (test_qpel_mc20): 1024/1024
bytes bit-exact via daedalus_recipe_dispatch_h264_qpel_mc20.

Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
This commit is contained in:
2026-05-23 03:25:24 +02:00
parent d87239d817
commit 8fdef27a7d
4 changed files with 116 additions and 0 deletions
+34
View File
@@ -263,6 +263,39 @@ int daedalus_dispatch_h264_deblock_luma_v(daedalus_ctx *ctx, daedalus_substrate
uint8_t *dst, size_t dst_stride,
size_t n_edges, const daedalus_h264_deblock_meta *meta);
/* -------------------------------------------------------------------
* H.264 luma qpel mc20 (8×8, horizontal half-pel) — cycle 9
* (CPU by recipe; per-block 7.6 ns NEON, QPU not viable — see
* docs/k9_h264qpel_mc20.md for the R-band rationale).
*
* Per H.264 §8.4.2.2.1, horizontal half-pel luma 6-tap filter:
* dst[r,c] = clip255((s[r,c-2] - 5*s[r,c-1] + 20*s[r,c]
* + 20*s[r,c+1] - 5*s[r,c+2] + s[r,c+3]
* + 16) >> 5)
*
* Single-stride: dst and src share `stride`; this matches FFmpeg's
* H264QpelContext.put_h264_qpel_pixels_tab[][] convention and the
* vendored ff_put_h264_qpel8_mc20_neon signature.
*
* `src + src_off` points at the leftmost OUTPUT column (col 0); the
* filter reads cols -2..+3, so the caller must guarantee src has at
* least 2 pixels of left context and 3 pixels of right context per
* row. (FFmpeg already maintains an edge-emulated buffer for the
* frame boundary; this matches that contract.)
* ----------------------------------------------------------------- */
typedef struct {
uint32_t dst_off; /* byte offset into dst (block top-left) */
uint32_t src_off; /* byte offset into src (col 0, row 0) */
} daedalus_h264_qpel_meta;
int daedalus_recipe_dispatch_h264_qpel_mc20(daedalus_ctx *ctx,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
int daedalus_dispatch_h264_qpel_mc20(daedalus_ctx *ctx, daedalus_substrate sub,
uint8_t *dst, const uint8_t *src, size_t stride,
size_t n_blocks, const daedalus_h264_qpel_meta *meta);
/* -------------------------------------------------------------------
* Recipe query — what does the API recommend for each kernel?
* ----------------------------------------------------------------- */
@@ -275,6 +308,7 @@ typedef enum {
DAEDALUS_KERNEL_H264_IDCT4 = 6,
DAEDALUS_KERNEL_H264_IDCT8 = 7,
DAEDALUS_KERNEL_H264_DEBLOCK_LV = 8,
DAEDALUS_KERNEL_H264_QPEL_MC20 = 9,
} daedalus_kernel;
daedalus_substrate daedalus_recipe_substrate_for(daedalus_kernel k);