ffmpeg-v4l2-request-fourier: substitute H.264 qpel mc20 → daedalus-fourier

H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma horizontal half-pel, 6-tap "put" — the canonical representative of the H.264 luma motion-compensation family) now dispatches through daedalus_recipe_dispatch_h264_qpel_mc20 instead of ff_put_h264_qpel8_mc20_neon. Cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes the 4-cycle libavcodec.so substitution sequence: cycle 6 (PR #76) H.264 IDCT 4x4 done cycle 7 (PR #85) H.264 IDCT 8x8 done cycle 8 (PR #86) H.264 luma-v deblock done cycle 9 (this) H.264 qpel mc20 Bumps daedalus-fourier pin d87239d → 209a421 (PR #2 — public API gains daedalus_recipe_dispatch_h264_qpel_mc20 + DAEDALUS_KERNEL_H264_QPEL_MC20). Verdict per docs/k9_h264qpel_mc20.md: CPU NEON. Per-block 7.6 ns at 131 Mblock/s gives 135× margin over 30 fps 1080p; QPU dispatch floor at ~250 ns makes any V3D shader strictly worse. Substitution is plumbing-only — same daedalus_ctx_create_no_qpu pthread_once shape the cycles 6/7/8 shims already own (kept SEPARATE from the H264DSP shim's ctx because H264QPEL is its own libavcodec Makefile module and link order does not guarantee a single .o owns the ctx symbol; one extra ~µs init per process, paid lazily on first MC call). Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the 16x16 size tier stay on the in-tree NEON .S code per the cycle-9 phase-1 rationale (mc20 8x8 is representative; remaining variants would multiply recipe-lookup overhead without changing the substrate verdict). Bit-exact against ff_put_h264_qpel8_mc20_neon (daedalus-fourier cycle 9 green; 10000/10000 random blocks bit-exact, M3 = 131 Mblock/s). No SONAME change, no Depends change. PKGREL 9 → 10. Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
2026-05-23 03:32:29 +02:00
parent 8729c2db92
commit 0bfc4ab03e
5 changed files with 334 additions and 18 deletions
@@ -0,0 +1,139 @@
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: Markus Fritsche <mfritsche@reauktion.de>
+Date: Sat, 23 May 2026 12:00:00 +0200
+Subject: [PATCH] avcodec/aarch64/h264qpel: route 8x8 mc20 through
+ daedalus-fourier
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
+
+H264QpelContext.put_h264_qpel_pixels_tab[1][2] (8x8 luma horizontal
+half-pel, 6-tap "put" variant — the canonical representative of the
+H.264 luma motion-compensation family) now dispatches through
+daedalus_recipe_dispatch_h264_qpel_mc20 instead of
+ff_put_h264_qpel8_mc20_neon.
+
+Cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes the
+4-cycle libavcodec.so substitution sequence (6 IDCT 4x4 / 7 IDCT 8x8 /
+8 luma-v deblock / 9 qpel mc20).
+
+The recipe layer picks the substrate. Per docs/k9_h264qpel_mc20.md
+the verdict is CPU NEON: per-block 7.6 ns at 131 Mblock/s gives 135x
+margin over 30 fps 1080p, and the QPU dispatch floor (~250 ns)
+makes any V3D shader strictly worse. Substitution is plumbing-only,
+NEON-by-recipe — same daedalus_ctx_create_no_qpu pthread_once
+context shape the cycles 6/7/8 shims already own (kept SEPARATE
+from the H264DSP shim's ctx because H264QPEL is its own libavcodec
+Makefile module and link order does not guarantee a single .o
+owns the ctx symbol; one extra ~µs init per process, paid lazily).
+
+Other H.264 luma MC variants (mc02, mc11, mc22 etc.) and the 16x16
+size tier stay on the in-tree NEON .S code. Per the cycle-9 phase-1
+rationale, mc20 8x8 is representative of the whole family's per-block
+cost — extending the substitution to other variants would multiply
+recipe-lookup overhead without changing the substrate verdict.
+
+Bit-exact against ff_put_h264_qpel8_mc20_neon (daedalus-fourier
+cycle 9 green; M1 = 100% bit-exact across 10000 random blocks).
+
+No SONAME change, no Depends change.
+
+Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.
+---
+ libavcodec/aarch64/Makefile                |  3 +-
+ libavcodec/aarch64/h264_qpel_daedalus.c    | 50 ++++++++++++++++++++++
+ libavcodec/aarch64/h264qpel_init_aarch64.c |  4 +-
+ 3 files changed, 55 insertions(+), 2 deletions(-)
+ create mode 100644 libavcodec/aarch64/h264_qpel_daedalus.c
+
+diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile
+--- a/libavcodec/aarch64/Makefile
+++ b/libavcodec/aarch64/Makefile
+@@ -7,7 +7,8 @@ OBJS-$(CONFIG_H264DSP)                  += aarch64/h264dsp_init_aarch64.o \
+                                            aarch64/h264_idct_daedalus.o
+ OBJS-$(CONFIG_HUFFYUVDSP)               += aarch64/huffyuvdsp_init_aarch64.o
+ OBJS-$(CONFIG_H264PRED)                 += aarch64/h264pred_init.o
+-OBJS-$(CONFIG_H264QPEL)                 += aarch64/h264qpel_init_aarch64.o
+OBJS-$(CONFIG_H264QPEL)                 += aarch64/h264qpel_init_aarch64.o \
+                                           aarch64/h264_qpel_daedalus.o
+ OBJS-$(CONFIG_HPELDSP)                  += aarch64/hpeldsp_init_aarch64.o
+ OBJS-$(CONFIG_IDCTDSP)                  += aarch64/idctdsp_init_aarch64.o
+ OBJS-$(CONFIG_ME_CMP)                   += aarch64/me_cmp_init_aarch64.o
+diff --git a/libavcodec/aarch64/h264_qpel_daedalus.c b/libavcodec/aarch64/h264_qpel_daedalus.c
+new file mode 100644
+--- /dev/null
+++ b/libavcodec/aarch64/h264_qpel_daedalus.c
+@@ -0,0 +1,50 @@
+/*
+ * H.264 luma qpel mc20 (8x8, horizontal half-pel, 6-tap "put")
+ * — daedalus-fourier substitution shim.
+ *
+ * Routes H264QpelContext.put_h264_qpel_pixels_tab[1][2] through
+ * daedalus_recipe_dispatch_h264_qpel_mc20 instead of
+ * ff_put_h264_qpel8_mc20_neon.  The recipe layer picks the substrate
+ * (CPU NEON for cycle 9; QPU not viable — per-block 7.6 ns vs
+ * ~250 ns QPU dispatch floor, see docs/k9_h264qpel_mc20.md).
+ *
+ * Sibling to libavcodec/aarch64/h264_idct_daedalus.c.  We keep a
+ * SEPARATE process-global pthread_once context here instead of
+ * sharing the H264DSP one because H264QPEL is its own libavcodec
+ * Makefile module and link order does not guarantee a single .o
+ * owns the ctx symbol.  The cost is one extra
+ * daedalus_ctx_create_no_qpu (~µs) per process; daemon and host
+ * processes pay this lazily on first MC call.
+ *
+ * FFmpeg H264QpelContext convention: both dst and src use a SINGLE
+ * stride and `src` already points at the leftmost OUTPUT column
+ * (col 0); the 6-tap filter reads cols -2..+3.  This matches
+ * daedalus_recipe_dispatch_h264_qpel_mc20's documented contract
+ * directly, so dst_off = src_off = 0.
+ */
+
+#include <pthread.h>
+#include <stddef.h>
+#include <stdint.h>
+
+#include <daedalus.h>
+
+#include "libavutil/attributes.h"
+
+static daedalus_ctx     *g_dctx;
+static pthread_once_t    g_dctx_once = PTHREAD_ONCE_INIT;
+
+static void daedalus_ctx_init_once(void)
+{
+    g_dctx = daedalus_ctx_create_no_qpu();
+}
+
+void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
+
+void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride)
+{
+    static const daedalus_h264_qpel_meta meta = { .dst_off = 0, .src_off = 0 };
+    pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+    daedalus_recipe_dispatch_h264_qpel_mc20(g_dctx, dst, src, (size_t)stride,
+                                            1, &meta);
+}
+diff --git a/libavcodec/aarch64/h264qpel_init_aarch64.c b/libavcodec/aarch64/h264qpel_init_aarch64.c
+--- a/libavcodec/aarch64/h264qpel_init_aarch64.c
+++ b/libavcodec/aarch64/h264qpel_init_aarch64.c
+@@ -47,6 +47,8 @@ void ff_put_h264_qpel8_mc00_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t str
+ void ff_put_h264_qpel8_mc10_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
+ void ff_put_h264_qpel8_mc20_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
+ void ff_put_h264_qpel8_mc30_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
+void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src,
+                                     ptrdiff_t stride);
+ void ff_put_h264_qpel8_mc01_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
+ void ff_put_h264_qpel8_mc11_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
+ void ff_put_h264_qpel8_mc21_neon(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
+@@ -184,7 +186,7 @@ av_cold void ff_h264qpel_init_aarch64(H264QpelContext *c, int bit_depth)
+
+         c->put_h264_qpel_pixels_tab[1][ 0] = ff_put_h264_qpel8_mc00_neon;
+         c->put_h264_qpel_pixels_tab[1][ 1] = ff_put_h264_qpel8_mc10_neon;
+-        c->put_h264_qpel_pixels_tab[1][ 2] = ff_put_h264_qpel8_mc20_neon;
+        c->put_h264_qpel_pixels_tab[1][ 2] = ff_put_h264_qpel8_mc20_daedalus;
+         c->put_h264_qpel_pixels_tab[1][ 3] = ff_put_h264_qpel8_mc30_neon;
+         c->put_h264_qpel_pixels_tab[1][ 4] = ff_put_h264_qpel8_mc01_neon;
+         c->put_h264_qpel_pixels_tab[1][ 5] = ff_put_h264_qpel8_mc11_neon;
+--
+2.47.3