Phase 8c: H.264 luma qpel mc20 through public API #2

Merged
marfrit merged 1 commits from noether/api-h264-qpel-mc20 into main 2026-05-23 01:29:25 +00:00
Owner

Closes the substitution-arc API surface (cycles 6/7/8 already wired): adds daedalus_recipe_dispatch_h264_qpel_mc20 so libavcodec can route H264QpelContext.put_h264_qpel_pixels_tab[1][2] through the recipe layer instead of ff_put_h264_qpel8_mc20_neon directly.

Build + smoke (hertz aarch64):

=== Phase 8a API smoke: H.264 kernels via recipe dispatch ===
  H264_IDCT4 recipe substrate:      1 (1=CPU, 2=QPU)
  H264_IDCT8 recipe substrate:      1
  H264_DEBLOCK_LV recipe substrate: 1
  H264_QPEL_MC20 recipe substrate:  1
  H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%)
  H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%)
  H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%)
  H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)

Verdict: CPU NEON per docs/k9_h264qpel_mc20.md. Per-block 7.6 ns gives 135x margin over 30 fps 1080p; QPU dispatch floor at ~250 ns makes any V3D shader strictly worse. The recipe_dispatch entry is a one-line forward to the CPU path.

API shape:

  • New struct daedalus_h264_qpel_meta { dst_off, src_off }
  • New daedalus_dispatch_h264_qpel_mc20(ctx, sub, dst, src, stride, n_blocks, meta) + AUTO wrapper
  • New DAEDALUS_KERNEL_H264_QPEL_MC20 = 9 in the recipe-query enum
  • Single-stride: matches FFmpeg's H264QpelContext convention (dst and src share stride, src already at output col 0) so the consumer shim is a straight pointer-pass

CMakeLists:

  • h264qpel_neon.S added to the daedalus_core static lib (only the bench targets owned it before)
  • tests/h264_qpel8_mc20_ref.c added to test_api_h264

Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.

Closes the substitution-arc API surface (cycles 6/7/8 already wired): adds **daedalus_recipe_dispatch_h264_qpel_mc20** so libavcodec can route H264QpelContext.put_h264_qpel_pixels_tab[1][2] through the recipe layer instead of ff_put_h264_qpel8_mc20_neon directly. **Build + smoke (hertz aarch64):** ``` === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 1 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 1 H264_DEBLOCK_LV recipe substrate: 1 H264_QPEL_MC20 recipe substrate: 1 H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%) H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) ``` **Verdict:** CPU NEON per docs/k9_h264qpel_mc20.md. Per-block 7.6 ns gives 135x margin over 30 fps 1080p; QPU dispatch floor at ~250 ns makes any V3D shader strictly worse. The recipe_dispatch entry is a one-line forward to the CPU path. **API shape:** - New struct `daedalus_h264_qpel_meta { dst_off, src_off }` - New `daedalus_dispatch_h264_qpel_mc20(ctx, sub, dst, src, stride, n_blocks, meta)` + AUTO wrapper - New `DAEDALUS_KERNEL_H264_QPEL_MC20 = 9` in the recipe-query enum - Single-stride: matches FFmpeg's H264QpelContext convention (dst and src share `stride`, src already at output col 0) so the consumer shim is a straight pointer-pass **CMakeLists:** - h264qpel_neon.S added to the daedalus_core static lib (only the bench targets owned it before) - tests/h264_qpel8_mc20_ref.c added to test_api_h264 Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
marfrit added 1 commit 2026-05-23 01:26:04 +00:00
Extends daedalus-fourier with daedalus_recipe_dispatch_h264_qpel_mc20
so libavcodec.so can route H264QpelContext.put_h264_qpel_pixels_tab[1][2]
through the recipe layer instead of ff_put_h264_qpel8_mc20_neon directly.

API additions (header + library):
  - daedalus_h264_qpel_meta { dst_off, src_off }
  - daedalus_dispatch_h264_qpel_mc20(ctx, sub, dst, src, stride,
                                     n_blocks, meta)
  - daedalus_recipe_dispatch_h264_qpel_mc20(...)  (AUTO wrapper)
  - DAEDALUS_KERNEL_H264_QPEL_MC20 = 9 in the recipe-query enum
  - daedalus_recipe_substrate_for() returns CPU NEON for cycle 9

The 6-tap horizontal half-pel filter signature matches FFmpeg's
H264QpelContext convention exactly: dst and src share a single stride
and src already points at output column 0 (filter reads cols -2..+3).
Single-stride API to make the marfrit-packages FFmpeg shim a
straight pointer-pass; no buffer rearrangement.

Verdict per docs/k9_h264qpel_mc20.md: CPU NEON.  Per-block 7.6 ns
gives 135x margin over 30 fps 1080p; QPU dispatch floor at ~250 ns
makes any V3D shader strictly worse.  Recipe table reflects that —
the recipe_dispatch entry is a one-line forward to the CPU path.

CMakeLists changes:
  - h264qpel_neon.S added to the daedalus_core static lib (only the
    bench targets owned it before; now the public API needs it too)
  - tests/h264_qpel8_mc20_ref.c added to the test_api_h264 target

Phase 8a/8b smoke gains a 4th case (test_qpel_mc20): 1024/1024
bytes bit-exact via daedalus_recipe_dispatch_h264_qpel_mc20.

Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
marfrit merged commit 209a4218bc into main 2026-05-23 01:29:25 +00:00
marfrit deleted branch noether/api-h264-qpel-mc20 2026-05-23 01:29:25 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marfrit/daedalus-fourier#2