Phase 8c: H.264 luma qpel mc20 through public API #2

2026-05-23T01:26:03Z

marfrit commented

2026-05-23 01:26:03 +00:00

Closes the substitution-arc API surface (cycles 6/7/8 already wired): adds daedalus_recipe_dispatch_h264_qpel_mc20 so libavcodec can route H264QpelContext.put_h264_qpel_pixels_tab[1][2] through the recipe layer instead of ff_put_h264_qpel8_mc20_neon directly.

Build + smoke (hertz aarch64):

=== Phase 8a API smoke: H.264 kernels via recipe dispatch ===
  H264_IDCT4 recipe substrate:      1 (1=CPU, 2=QPU)
  H264_IDCT8 recipe substrate:      1
  H264_DEBLOCK_LV recipe substrate: 1
  H264_QPEL_MC20 recipe substrate:  1
  H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%)
  H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%)
  H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%)
  H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)

Verdict: CPU NEON per docs/k9_h264qpel_mc20.md. Per-block 7.6 ns gives 135x margin over 30 fps 1080p; QPU dispatch floor at ~250 ns makes any V3D shader strictly worse. The recipe_dispatch entry is a one-line forward to the CPU path.

API shape:

New struct daedalus_h264_qpel_meta { dst_off, src_off }
New daedalus_dispatch_h264_qpel_mc20(ctx, sub, dst, src, stride, n_blocks, meta) + AUTO wrapper
New DAEDALUS_KERNEL_H264_QPEL_MC20 = 9 in the recipe-query enum
Single-stride: matches FFmpeg's H264QpelContext convention (dst and src share stride, src already at output col 0) so the consumer shim is a straight pointer-pass

CMakeLists:

h264qpel_neon.S added to the daedalus_core static lib (only the bench targets owned it before)
tests/h264_qpel8_mc20_ref.c added to test_api_h264

Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.

Closes the substitution-arc API surface (cycles 6/7/8 already wired): adds **daedalus_recipe_dispatch_h264_qpel_mc20** so libavcodec can route H264QpelContext.put_h264_qpel_pixels_tab[1][2] through the recipe layer instead of ff_put_h264_qpel8_mc20_neon directly. **Build + smoke (hertz aarch64):** ``` === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 1 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 1 H264_DEBLOCK_LV recipe substrate: 1 H264_QPEL_MC20 recipe substrate: 1 H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%) H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) ``` **Verdict:** CPU NEON per docs/k9_h264qpel_mc20.md. Per-block 7.6 ns gives 135x margin over 30 fps 1080p; QPU dispatch floor at ~250 ns makes any V3D shader strictly worse. The recipe_dispatch entry is a one-line forward to the CPU path. **API shape:** - New struct `daedalus_h264_qpel_meta { dst_off, src_off }` - New `daedalus_dispatch_h264_qpel_mc20(ctx, sub, dst, src, stride, n_blocks, meta)` + AUTO wrapper - New `DAEDALUS_KERNEL_H264_QPEL_MC20 = 9` in the recipe-query enum - Single-stride: matches FFmpeg's H264QpelContext convention (dst and src share `stride`, src already at output col 0) so the consumer shim is a straight pointer-pass **CMakeLists:** - h264qpel_neon.S added to the daedalus_core static lib (only the bench targets owned it before) - tests/h264_qpel8_mc20_ref.c added to test_api_h264 Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.

marfrit added 1 commit 2026-05-23 01:26:04 +00:00

Phase 8c: H.264 luma qpel mc20 through public API 8fdef27a7d

Extends daedalus-fourier with daedalus_recipe_dispatch_h264_qpel_mc20
so libavcodec.so can route H264QpelContext.put_h264_qpel_pixels_tab[1][2]
through the recipe layer instead of ff_put_h264_qpel8_mc20_neon directly.

API additions (header + library):
  - daedalus_h264_qpel_meta { dst_off, src_off }
  - daedalus_dispatch_h264_qpel_mc20(ctx, sub, dst, src, stride,
                                     n_blocks, meta)
  - daedalus_recipe_dispatch_h264_qpel_mc20(...)  (AUTO wrapper)
  - DAEDALUS_KERNEL_H264_QPEL_MC20 = 9 in the recipe-query enum
  - daedalus_recipe_substrate_for() returns CPU NEON for cycle 9

The 6-tap horizontal half-pel filter signature matches FFmpeg's
H264QpelContext convention exactly: dst and src share a single stride
and src already points at output column 0 (filter reads cols -2..+3).
Single-stride API to make the marfrit-packages FFmpeg shim a
straight pointer-pass; no buffer rearrangement.

Verdict per docs/k9_h264qpel_mc20.md: CPU NEON.  Per-block 7.6 ns
gives 135x margin over 30 fps 1080p; QPU dispatch floor at ~250 ns
makes any V3D shader strictly worse.  Recipe table reflects that —
the recipe_dispatch entry is a one-line forward to the CPU path.

CMakeLists changes:
  - h264qpel_neon.S added to the daedalus_core static lib (only the
    bench targets owned it before; now the public API needs it too)
  - tests/h264_qpel8_mc20_ref.c added to the test_api_h264 target

Phase 8a/8b smoke gains a 4th case (test_qpel_mc20): 1024/1024
bytes bit-exact via daedalus_recipe_dispatch_h264_qpel_mc20.

Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.

marfrit merged commit 209a4218bc into main

2026-05-23 01:29:25 +00:00

marfrit deleted branch noether/api-h264-qpel-mc20

2026-05-23 01:29:25 +00:00

marfrit referenced this pull request from marfrit/marfrit-packages

2026-05-23 01:33:17 +00:00

ffmpeg-v4l2-request-fourier: substitute H.264 qpel mc20 → daedalus-fourier #90

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marfrit/daedalus-fourier#2