ffmpeg-v4l2-request-fourier: substitute H.264 qpel mc20 → daedalus-fourier

H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma horizontal half-pel, 6-tap "put" — the canonical representative of the H.264 luma motion-compensation family) now dispatches through daedalus_recipe_dispatch_h264_qpel_mc20 instead of ff_put_h264_qpel8_mc20_neon. Cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes the 4-cycle libavcodec.so substitution sequence: cycle 6 (PR #76) H.264 IDCT 4x4 done cycle 7 (PR #85) H.264 IDCT 8x8 done cycle 8 (PR #86) H.264 luma-v deblock done cycle 9 (this) H.264 qpel mc20 Bumps daedalus-fourier pin d87239d → 209a421 (PR #2 — public API gains daedalus_recipe_dispatch_h264_qpel_mc20 + DAEDALUS_KERNEL_H264_QPEL_MC20). Verdict per docs/k9_h264qpel_mc20.md: CPU NEON. Per-block 7.6 ns at 131 Mblock/s gives 135× margin over 30 fps 1080p; QPU dispatch floor at ~250 ns makes any V3D shader strictly worse. Substitution is plumbing-only — same daedalus_ctx_create_no_qpu pthread_once shape the cycles 6/7/8 shims already own (kept SEPARATE from the H264DSP shim's ctx because H264QPEL is its own libavcodec Makefile module and link order does not guarantee a single .o owns the ctx symbol; one extra ~µs init per process, paid lazily on first MC call). Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the 16x16 size tier stay on the in-tree NEON .S code per the cycle-9 phase-1 rationale (mc20 8x8 is representative; remaining variants would multiply recipe-lookup overhead without changing the substrate verdict). Bit-exact against ff_put_h264_qpel8_mc20_neon (daedalus-fourier cycle 9 green; 10000/10000 random blocks bit-exact, M3 = 131 Mblock/s). No SONAME change, no Depends change. PKGREL 9 → 10. Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
2026-05-23 03:32:29 +02:00
parent 8729c2db92
commit 0bfc4ab03e
5 changed files with 334 additions and 18 deletions
@@ -1,3 +1,37 @@
+ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-10) bookworm trixie; urgency=medium
+
+  * Add 0007-h264-qpel-mc20-daedalus-fourier.patch —
+    H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma
+    horizontal half-pel, 6-tap "put" — the canonical representative
+    of the H.264 luma motion-compensation family) now dispatches
+    through daedalus_recipe_dispatch_h264_qpel_mc20 instead of
+    ff_put_h264_qpel8_mc20_neon.  Cycle 9 of the daedalus-v4l2#11
+    step 2 substitution arc; closes the 4-cycle libavcodec.so
+    substitution sequence (6 IDCT4 / 7 IDCT8 / 8 luma-v deblock /
+    9 qpel mc20).
+  * Bumps daedalus-fourier pin d87239d → 209a421 (PR #2 — public
+    API extended with daedalus_recipe_dispatch_h264_qpel_mc20 +
+    DAEDALUS_KERNEL_H264_QPEL_MC20).
+  * Cycle 9 is "CPU primary; QPU pointless" per
+    docs/k9_h264qpel_mc20.md.  Per-block 7.6 ns at 131 Mblock/s
+    gives 135x margin over 30 fps 1080p; QPU dispatch floor at
+    ~250 ns makes any V3D shader strictly worse.  Substitution
+    is plumbing-only, NEON-by-recipe — same
+    daedalus_ctx_create_no_qpu pthread_once shape the cycles 6/7/8
+    shims already own (kept SEPARATE from the H264DSP shim's ctx
+    because H264QPEL is its own libavcodec Makefile module and
+    link order does not guarantee a single .o owns the ctx symbol;
+    one extra ~µs init per process, paid lazily on first MC call).
+  * Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the
+    16x16 size tier stay on the in-tree NEON .S code.  Per the
+    cycle-9 phase-1 rationale, mc20 8x8 is representative of the
+    whole family's per-block cost.
+  * Bit-exact against ff_put_h264_qpel8_mc20_neon (daedalus-fourier
+    cycle 9 green; 10000/10000 random blocks).
+  * No SONAME change, no Depends change.
+
+ -- Markus Fritsche <mfritsche@reauktion.de>  Sat, 23 May 2026 12:00:00 +0000
+
 ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-9) bookworm trixie; urgency=medium

  * Add 0006-h264-restore-low-delay.patch — restore the documented