ffmpeg-v4l2-request-fourier: substitute H.264 qpel mc20 → daedalus-fourier

H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma horizontal
half-pel, 6-tap "put" — the canonical representative of the H.264
luma motion-compensation family) now dispatches through
daedalus_recipe_dispatch_h264_qpel_mc20 instead of
ff_put_h264_qpel8_mc20_neon.

Cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes the
4-cycle libavcodec.so substitution sequence:

  cycle 6 (PR #76)  H.264 IDCT 4x4         done
  cycle 7 (PR #85)  H.264 IDCT 8x8         done
  cycle 8 (PR #86)  H.264 luma-v deblock   done
  cycle 9 (this)    H.264 qpel mc20

Bumps daedalus-fourier pin d87239d → 209a421 (PR #2 — public API
gains daedalus_recipe_dispatch_h264_qpel_mc20 +
DAEDALUS_KERNEL_H264_QPEL_MC20).

Verdict per docs/k9_h264qpel_mc20.md: CPU NEON.  Per-block 7.6 ns at
131 Mblock/s gives 135× margin over 30 fps 1080p; QPU dispatch floor
at ~250 ns makes any V3D shader strictly worse.  Substitution is
plumbing-only — same daedalus_ctx_create_no_qpu pthread_once shape
the cycles 6/7/8 shims already own (kept SEPARATE from the H264DSP
shim's ctx because H264QPEL is its own libavcodec Makefile module
and link order does not guarantee a single .o owns the ctx symbol;
one extra ~µs init per process, paid lazily on first MC call).

Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the 16x16
size tier stay on the in-tree NEON .S code per the cycle-9 phase-1
rationale (mc20 8x8 is representative; remaining variants would
multiply recipe-lookup overhead without changing the substrate
verdict).

Bit-exact against ff_put_h264_qpel8_mc20_neon (daedalus-fourier
cycle 9 green; 10000/10000 random blocks bit-exact, M3 = 131 Mblock/s).

No SONAME change, no Depends change.  PKGREL 9 → 10.

Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
This commit is contained in:
2026-05-23 03:32:29 +02:00
parent 8729c2db92
commit 0bfc4ab03e
5 changed files with 334 additions and 18 deletions
+34
View File
@@ -1,3 +1,37 @@
ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-10) bookworm trixie; urgency=medium
* Add 0007-h264-qpel-mc20-daedalus-fourier.patch —
H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma
horizontal half-pel, 6-tap "put" — the canonical representative
of the H.264 luma motion-compensation family) now dispatches
through daedalus_recipe_dispatch_h264_qpel_mc20 instead of
ff_put_h264_qpel8_mc20_neon. Cycle 9 of the daedalus-v4l2#11
step 2 substitution arc; closes the 4-cycle libavcodec.so
substitution sequence (6 IDCT4 / 7 IDCT8 / 8 luma-v deblock /
9 qpel mc20).
* Bumps daedalus-fourier pin d87239d → 209a421 (PR #2 — public
API extended with daedalus_recipe_dispatch_h264_qpel_mc20 +
DAEDALUS_KERNEL_H264_QPEL_MC20).
* Cycle 9 is "CPU primary; QPU pointless" per
docs/k9_h264qpel_mc20.md. Per-block 7.6 ns at 131 Mblock/s
gives 135x margin over 30 fps 1080p; QPU dispatch floor at
~250 ns makes any V3D shader strictly worse. Substitution
is plumbing-only, NEON-by-recipe — same
daedalus_ctx_create_no_qpu pthread_once shape the cycles 6/7/8
shims already own (kept SEPARATE from the H264DSP shim's ctx
because H264QPEL is its own libavcodec Makefile module and
link order does not guarantee a single .o owns the ctx symbol;
one extra ~µs init per process, paid lazily on first MC call).
* Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the
16x16 size tier stay on the in-tree NEON .S code. Per the
cycle-9 phase-1 rationale, mc20 8x8 is representative of the
whole family's per-block cost.
* Bit-exact against ff_put_h264_qpel8_mc20_neon (daedalus-fourier
cycle 9 green; 10000/10000 random blocks).
* No SONAME change, no Depends change.
-- Markus Fritsche <mfritsche@reauktion.de> Sat, 23 May 2026 12:00:00 +0000
ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-9) bookworm trixie; urgency=medium
* Add 0006-h264-restore-low-delay.patch — restore the documented