ffmpeg-v4l2-request-fourier: substitute H.264 qpel mc20 → daedalus-fourier #90

2026-05-23T01:33:17Z

marfrit commented

2026-05-23 01:33:17 +00:00

Cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes the 4-cycle libavcodec.so substitution sequence:

Cycle	PR	Kernel	Status
6	#76	H.264 IDCT 4x4	landed
7	#85	H.264 IDCT 8x8	landed
8	#86	H.264 luma-v deblock	landed
9	this	H.264 qpel mc20	new

H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma horizontal half-pel, 6-tap put) now dispatches through daedalus_recipe_dispatch_h264_qpel_mc20 instead of ff_put_h264_qpel8_mc20_neon directly.

Pin bump

daedalus-fourier d87239d → 209a421 (marfrit/daedalus-fourier#2 — public API extended with daedalus_recipe_dispatch_h264_qpel_mc20 + DAEDALUS_KERNEL_H264_QPEL_MC20).

Verdict

CPU NEON per docs/k9_h264qpel_mc20.md:

=== M1₉ bit-exact (10000 random 8x8 blocks) ===
M1₉ correctness: 10000 / 10000 blocks bit-exact (100.0000%)

=== M3₉ NEON throughput ===
  throughput      = 131.477 Mblock/s
  per-block       = 7.6 ns
  H.264 1080p30 8x8 MC floor: 135.26× margin

QPU dispatch floor ~250 ns vs 7.6 ns per-block makes any V3D shader strictly worse. Substitution is plumbing-only — same daedalus_ctx_create_no_qpu pthread_once shape the cycles 6/7/8 shims already own (kept SEPARATE from the H264DSP shim's ctx because H264QPEL is its own libavcodec Makefile module and link order does not guarantee a single .o owns the ctx symbol; one extra ~µs init per process, paid lazily on first MC call).

Scope of substitution

Only put_h264_qpel_pixels_tab[1][2] is substituted. Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the 16x16 size tier stay on the in-tree NEON .S code per the cycle-9 phase-1 rationale (mc20 8x8 is representative; remaining variants would multiply recipe-lookup overhead without changing the substrate verdict).

Smoke

=== Phase 8a API smoke: H.264 kernels via recipe dispatch ===
  H264_IDCT4 recipe substrate:      1 (1=CPU, 2=QPU)
  H264_IDCT8 recipe substrate:      1
  H264_DEBLOCK_LV recipe substrate: 1
  H264_QPEL_MC20 recipe substrate:  1
  H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%)
  H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%)
  H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%)
  H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%)

Patch chain 0003→0007 dry-applied cleanly against the FFmpeg b57fbbe pin; new shim file compiles against the installed daedalus prefix (libavutil/attributes.h is the only FFmpeg header it needs).

PKGREL 9 → 10. No SONAME change, no Depends change.

Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.

Cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes the 4-cycle libavcodec.so substitution sequence: | Cycle | PR | Kernel | Status | |---|---|---|---| | 6 | #76 | H.264 IDCT 4x4 | landed | | 7 | #85 | H.264 IDCT 8x8 | landed | | 8 | #86 | H.264 luma-v deblock | landed | | 9 | **this** | H.264 qpel mc20 | new | H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma horizontal half-pel, 6-tap `put`) now dispatches through `daedalus_recipe_dispatch_h264_qpel_mc20` instead of `ff_put_h264_qpel8_mc20_neon` directly. ## Pin bump - daedalus-fourier d87239d → 209a421 (marfrit/daedalus-fourier#2 — public API extended with `daedalus_recipe_dispatch_h264_qpel_mc20` + `DAEDALUS_KERNEL_H264_QPEL_MC20`). ## Verdict CPU NEON per `docs/k9_h264qpel_mc20.md`: ``` === M1₉ bit-exact (10000 random 8x8 blocks) === M1₉ correctness: 10000 / 10000 blocks bit-exact (100.0000%) === M3₉ NEON throughput === throughput = 131.477 Mblock/s per-block = 7.6 ns H.264 1080p30 8x8 MC floor: 135.26× margin ``` QPU dispatch floor ~250 ns vs 7.6 ns per-block makes any V3D shader strictly worse. Substitution is plumbing-only — same `daedalus_ctx_create_no_qpu` pthread_once shape the cycles 6/7/8 shims already own (kept SEPARATE from the H264DSP shim's ctx because H264QPEL is its own libavcodec Makefile module and link order does not guarantee a single .o owns the ctx symbol; one extra ~µs init per process, paid lazily on first MC call). ## Scope of substitution Only `put_h264_qpel_pixels_tab[1][2]` is substituted. Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the 16x16 size tier stay on the in-tree NEON `.S` code per the cycle-9 phase-1 rationale (mc20 8x8 is representative; remaining variants would multiply recipe-lookup overhead without changing the substrate verdict). ## Smoke ``` === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 1 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 1 H264_DEBLOCK_LV recipe substrate: 1 H264_QPEL_MC20 recipe substrate: 1 H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%) H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) ``` Patch chain 0003→0007 dry-applied cleanly against the FFmpeg b57fbbe pin; new shim file compiles against the installed daedalus prefix (libavutil/attributes.h is the only FFmpeg header it needs). PKGREL 9 → 10. No SONAME change, no Depends change. Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.

marfrit added 1 commit 2026-05-23 01:33:18 +00:00

ffmpeg-v4l2-request-fourier: substitute H.264 qpel mc20 → daedalus-fourier 0bfc4ab03e

H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma horizontal
half-pel, 6-tap "put" — the canonical representative of the H.264
luma motion-compensation family) now dispatches through
daedalus_recipe_dispatch_h264_qpel_mc20 instead of
ff_put_h264_qpel8_mc20_neon.

Cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes the
4-cycle libavcodec.so substitution sequence:

  cycle 6 (PR #76)  H.264 IDCT 4x4         done
  cycle 7 (PR #85)  H.264 IDCT 8x8         done
  cycle 8 (PR #86)  H.264 luma-v deblock   done
  cycle 9 (this)    H.264 qpel mc20

Bumps daedalus-fourier pin d87239d → 209a421 (PR #2 — public API
gains daedalus_recipe_dispatch_h264_qpel_mc20 +
DAEDALUS_KERNEL_H264_QPEL_MC20).

Verdict per docs/k9_h264qpel_mc20.md: CPU NEON.  Per-block 7.6 ns at
131 Mblock/s gives 135× margin over 30 fps 1080p; QPU dispatch floor
at ~250 ns makes any V3D shader strictly worse.  Substitution is
plumbing-only — same daedalus_ctx_create_no_qpu pthread_once shape
the cycles 6/7/8 shims already own (kept SEPARATE from the H264DSP
shim's ctx because H264QPEL is its own libavcodec Makefile module
and link order does not guarantee a single .o owns the ctx symbol;
one extra ~µs init per process, paid lazily on first MC call).

Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the 16x16
size tier stay on the in-tree NEON .S code per the cycle-9 phase-1
rationale (mc20 8x8 is representative; remaining variants would
multiply recipe-lookup overhead without changing the substrate
verdict).

Bit-exact against ff_put_h264_qpel8_mc20_neon (daedalus-fourier
cycle 9 green; 10000/10000 random blocks bit-exact, M3 = 131 Mblock/s).

No SONAME change, no Depends change.  PKGREL 9 → 10.

Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.

marfrit merged commit b736dd0529 into main

2026-05-23 01:34:05 +00:00

marfrit deleted branch noether/ffmpeg-fourier-qpel-mc20-daedalus

2026-05-23 01:34:06 +00:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marfrit/marfrit-packages#90

Allow edits from maintainers