ffmpeg-v4l2-request-fourier: substitute H.264 luma-v deblock → daedalus-fourier

Cycle 8 of the libavcodec.so substitution arc (reauktion/daedalus-v4l2#11 step 2). H264DSPContext.v_loop_filter_luma — non-intra bS<4 vertical luma deblock, called per macroblock-row edge from the slice deblock loop in libavcodec/h264_loopfilter.c — now dispatches through daedalus_recipe_dispatch_h264_deblock_luma_v instead of ff_h264_v_loop_filter_luma_neon. ## What - Add 0005-h264-deblock-luma-v-daedalus-fourier.patch (in both arch/ and debian/ ffmpeg-v4l2-request-fourier/). Extends libavcodec/aarch64/h264_idct_daedalus.c with ff_h264_v_loop_filter_luma_daedalus (constructs a daedalus_h264_deblock_meta from FFmpeg's (alpha, beta, tc0[4]) and calls daedalus_recipe_dispatch_h264_deblock_luma_v with n_edges=1). Patches libavcodec/aarch64/h264dsp_init_aarch64.c to wire c->v_loop_filter_luma to the new shim. - arch/PKGBUILD + debian/build-deb.sh: append patch + bump pkgrel/PKGREL to 8. - No new build-deps, no Depends change, no daedalus-fourier rev — the d87239d pin already exposes daedalus_recipe_dispatch_h264_deblock_luma_v. ## Why Cycle 8 is marked "CPU primary; QPU opportunistic" in the daedalus- fourier API docstring. Per the hybrid substrate philosophy ("if there's a coprocessor, use it") we eventually want the QPU opportunism active here. But the libavcodec.so context is process-global and shared with cycles 6/7 via pthread_once, and it uses daedalus_ctx_create_no_qpu deliberately to avoid implicit Vulkan init in arbitrary host processes (Firefox content, mpv-fourier, ffmpeg-fourier CLI, ...). Switching to daedalus_ctx_create here without a feature flag would be a footgun. So cycle 8 lands as plumbing-only NEON-by-recipe substitution for now; opportunistic QPU enablement is a separate follow-up that adds a DAEDALUS_FOURIER_ENABLE_QPU env var or equivalent. ## Scope NOT covered - Intra (bS=4) loop filter c->v_loop_filter_luma_intra — daedalus's daedalus_h264_deblock_meta only covers the non-intra path. - Horizontal-edge variant c->h_loop_filter_luma — separate kernel (not yet in daedalus-fourier API). - Chroma loop filters — separate kernels. - Bulk batching — single-edge dispatch wastes the kernel's n_edges>1 amortization. Same caveat as cycles 6/7; follow-up. - QPU opportunism — see "Why" above. ## SONAME Unchanged. libavcodec.so.62 / libavformat.so.62 / libavutil.so.60. ## Refs - reauktion/daedalus-v4l2 issue #11: reauktion/daedalus-v4l2#11 - marfrit-packages PR #76 (cycle 6 IDCT 4×4) - marfrit-packages PR #85 (cycle 7 IDCT 8×8) - marfrit/daedalus-fourier cycle 8 close (deblock luma-v NEON green)
2026-05-22 12:17:14 +02:00
parent 510a31622c
commit 29e0852d11
5 changed files with 280 additions and 8 deletions
@@ -0,0 +1,121 @@
+From 68731c41d7ea68be0e912b128cb4e71fb56e8263 Mon Sep 17 00:00:00 2001
+From: Markus Fritsche <mfritsche@reauktion.de>
+Date: Fri, 22 May 2026 12:15:16 +0200
+Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 luma-v deblock through
+ daedalus-fourier
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
+
+H264DSPContext.v_loop_filter_luma (non-intra bS<4 vertical luma
+deblock, called per macroblock-row edge from the slice deblock
+loop) now dispatches through
+daedalus_recipe_dispatch_h264_deblock_luma_v instead of
+ff_h264_v_loop_filter_luma_neon.
+
+The recipe layer picks the substrate; for cycle 8 the daedalus
+docstring marks the kernel "CPU primary; QPU opportunistic", but
+the libavcodec.so context here is built with
+daedalus_ctx_create_no_qpu — process-global pthread_once init,
+shared with cycles 6/7.  QPU opportunism stays gated off until a
+follow-up adds an explicit feature flag (no implicit Vulkan init
+in arbitrary host processes).  In the meantime cycle 8 is a
+plumbing-only substitution, NEON-to-NEON via the daedalus recipe.
+
+Intra (bS=4) loop filter — c->v_loop_filter_luma_intra — stays on
+the in-tree NEON .S code; daedalus's daedalus_h264_deblock_meta
+only covers the non-intra path per its docstring.
+
+FFmpeg `int alpha/beta/int8_t tc0[4]` → daedalus_h264_deblock_meta
+(int32_t alpha/beta + inline int8_t tc0[4]).  pix already points
+to row 0 of the bottom block per FFmpeg's deblock convention,
+satisfying daedalus's `dst_off >= 4 * dst_stride` constraint.
+
+Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 8.
+---
+ libavcodec/aarch64/h264_idct_daedalus.c   | 36 +++++++++++++++++++----
+ libavcodec/aarch64/h264dsp_init_aarch64.c |  4 ++-
+ 2 files changed, 33 insertions(+), 7 deletions(-)
+
+diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
+index cbb98af..92365fa 100644
+--- a/libavcodec/aarch64/h264_idct_daedalus.c
+++ b/libavcodec/aarch64/h264_idct_daedalus.c
+@@ -1,11 +1,14 @@
+ /*
+- * H.264 4x4 / 8x8 IDCT + add — daedalus-fourier substitution shims.
+ * H.264 4x4 / 8x8 IDCT + luma-v deblock — daedalus-fourier substitution shims.
+  *
+- * Routes H264DSPContext.idct_add  → daedalus_recipe_dispatch_h264_idct4
+- *        H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
+- * instead of the in-tree ff_h264_idct{,8}_add_neon assembly.  The
+- * recipe layer picks the substrate (CPU NEON by default for cycles
+- * 6 + 7; future cycles may dispatch to V3D opportunistically).
+ * Routes H264DSPContext.idct_add           → daedalus_recipe_dispatch_h264_idct4
+ *        H264DSPContext.idct8_add          → daedalus_recipe_dispatch_h264_idct8
+ *        H264DSPContext.v_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_v
+ * instead of the in-tree ff_h264_*_neon assembly.  The recipe layer
+ * picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
+ * is CPU primary with QPU opportunistic — the ctx below is no-QPU,
+ * so cycle 8 stays on the CPU NEON path until a separate change
+ * gates QPU init on a daedalus-fourier feature flag).
+  *
+  * FFmpeg's 4x4 and 8x8 block memory layouts match daedalus's
+  * column-major convention: block[r + N*c] = coefficient at
+@@ -40,6 +43,8 @@ static void daedalus_ctx_init_once(void)
+ 
+ void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+ void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                         int alpha, int beta, int8_t *tc0);
+ 
+ void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
+ {
+@@ -60,3 +65,22 @@ void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride)
+     daedalus_recipe_dispatch_h264_idct8(g_dctx, dst, (size_t)stride,
+                                         block, 1, &meta);
+ }
+
+void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                         int alpha, int beta, int8_t *tc0)
+{
+    daedalus_h264_deblock_meta meta = {
+        .dst_off = 0,
+        .alpha   = alpha,
+        .beta    = beta,
+    };
+    meta.tc0[0] = tc0[0];
+    meta.tc0[1] = tc0[1];
+    meta.tc0[2] = tc0[2];
+    meta.tc0[3] = tc0[3];
+
+    pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+    daedalus_recipe_dispatch_h264_deblock_luma_v(g_dctx, pix, (size_t)stride,
+                                                 1, &meta);
+}
+diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
+index 741e551..85ac381 100644
+--- a/libavcodec/aarch64/h264dsp_init_aarch64.c
+++ b/libavcodec/aarch64/h264dsp_init_aarch64.c
+@@ -27,6 +27,8 @@
+ 
+ void ff_h264_v_loop_filter_luma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
+                                      int beta, int8_t *tc0);
+void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                         int alpha, int beta, int8_t *tc0);
+ void ff_h264_h_loop_filter_luma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
+                                      int beta, int8_t *tc0);
+ void ff_h264_v_loop_filter_luma_intra_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
+@@ -114,7 +116,7 @@ av_cold void ff_h264dsp_init_aarch64(H264DSPContext *c, const int bit_depth,
+     int cpu_flags = av_get_cpu_flags();
+ 
+     if (have_neon(cpu_flags) && bit_depth == 8) {
+-        c->v_loop_filter_luma   = ff_h264_v_loop_filter_luma_neon;
+        c->v_loop_filter_luma   = ff_h264_v_loop_filter_luma_daedalus;
+         c->h_loop_filter_luma   = ff_h264_h_loop_filter_luma_neon;
+         c->v_loop_filter_luma_intra= ff_h264_v_loop_filter_luma_intra_neon;
+         c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_neon;
+-- 
+2.47.3
+
@@ -24,7 +24,7 @@ _srcname=FFmpeg
 _version='8.1'
 _commit='b57fbbe50c9b2656fad86a1a7eeabfd2b2a50935'  # v4l2-request-n8.1 tip 2026-04-24
 pkgver=8.1.r123329.b57fbbe
-pkgrel=7   # pkgrel=7 — H.264 IDCT 8x8 daedalus-fourier substitution (cycle 7, 2026-05-22)
+pkgrel=8   # pkgrel=8 — H.264 luma-v deblock daedalus-fourier substitution (cycle 8, 2026-05-22)
 epoch=2

 # daedalus-fourier pin — first kernel substitution in libavcodec
@@ -91,8 +91,9 @@ source=("git+https://github.com/Kwiboo/FFmpeg.git#commit=${_commit}"
        '0001-libudev-bypass-fallback.patch'
        '0002-nv15-to-p010-unpack.patch'
        '0003-h264-idct4-daedalus-fourier.patch'
-        '0004-h264-idct8-daedalus-fourier.patch')
-sha256sums=('SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP')
+        '0004-h264-idct8-daedalus-fourier.patch'
+        '0005-h264-deblock-luma-v-daedalus-fourier.patch')
+sha256sums=('SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP')

 pkgver() {
  cd "${_srcname}"
@@ -107,6 +108,7 @@ prepare() {
  patch -Np1 -i "${srcdir}/0002-nv15-to-p010-unpack.patch"
  patch -Np1 -i "${srcdir}/0003-h264-idct4-daedalus-fourier.patch"
  patch -Np1 -i "${srcdir}/0004-h264-idct8-daedalus-fourier.patch"
+  patch -Np1 -i "${srcdir}/0005-h264-deblock-luma-v-daedalus-fourier.patch"
 }

 build() {
@@ -0,0 +1,121 @@
+From 68731c41d7ea68be0e912b128cb4e71fb56e8263 Mon Sep 17 00:00:00 2001
+From: Markus Fritsche <mfritsche@reauktion.de>
+Date: Fri, 22 May 2026 12:15:16 +0200
+Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 luma-v deblock through
+ daedalus-fourier
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
+
+H264DSPContext.v_loop_filter_luma (non-intra bS<4 vertical luma
+deblock, called per macroblock-row edge from the slice deblock
+loop) now dispatches through
+daedalus_recipe_dispatch_h264_deblock_luma_v instead of
+ff_h264_v_loop_filter_luma_neon.
+
+The recipe layer picks the substrate; for cycle 8 the daedalus
+docstring marks the kernel "CPU primary; QPU opportunistic", but
+the libavcodec.so context here is built with
+daedalus_ctx_create_no_qpu — process-global pthread_once init,
+shared with cycles 6/7.  QPU opportunism stays gated off until a
+follow-up adds an explicit feature flag (no implicit Vulkan init
+in arbitrary host processes).  In the meantime cycle 8 is a
+plumbing-only substitution, NEON-to-NEON via the daedalus recipe.
+
+Intra (bS=4) loop filter — c->v_loop_filter_luma_intra — stays on
+the in-tree NEON .S code; daedalus's daedalus_h264_deblock_meta
+only covers the non-intra path per its docstring.
+
+FFmpeg `int alpha/beta/int8_t tc0[4]` → daedalus_h264_deblock_meta
+(int32_t alpha/beta + inline int8_t tc0[4]).  pix already points
+to row 0 of the bottom block per FFmpeg's deblock convention,
+satisfying daedalus's `dst_off >= 4 * dst_stride` constraint.
+
+Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 8.
+---
+ libavcodec/aarch64/h264_idct_daedalus.c   | 36 +++++++++++++++++++----
+ libavcodec/aarch64/h264dsp_init_aarch64.c |  4 ++-
+ 2 files changed, 33 insertions(+), 7 deletions(-)
+
+diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
+index cbb98af..92365fa 100644
+--- a/libavcodec/aarch64/h264_idct_daedalus.c
+++ b/libavcodec/aarch64/h264_idct_daedalus.c
+@@ -1,11 +1,14 @@
+ /*
+- * H.264 4x4 / 8x8 IDCT + add — daedalus-fourier substitution shims.
+ * H.264 4x4 / 8x8 IDCT + luma-v deblock — daedalus-fourier substitution shims.
+  *
+- * Routes H264DSPContext.idct_add  → daedalus_recipe_dispatch_h264_idct4
+- *        H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
+- * instead of the in-tree ff_h264_idct{,8}_add_neon assembly.  The
+- * recipe layer picks the substrate (CPU NEON by default for cycles
+- * 6 + 7; future cycles may dispatch to V3D opportunistically).
+ * Routes H264DSPContext.idct_add           → daedalus_recipe_dispatch_h264_idct4
+ *        H264DSPContext.idct8_add          → daedalus_recipe_dispatch_h264_idct8
+ *        H264DSPContext.v_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_v
+ * instead of the in-tree ff_h264_*_neon assembly.  The recipe layer
+ * picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
+ * is CPU primary with QPU opportunistic — the ctx below is no-QPU,
+ * so cycle 8 stays on the CPU NEON path until a separate change
+ * gates QPU init on a daedalus-fourier feature flag).
+  *
+  * FFmpeg's 4x4 and 8x8 block memory layouts match daedalus's
+  * column-major convention: block[r + N*c] = coefficient at
+@@ -40,6 +43,8 @@ static void daedalus_ctx_init_once(void)
+ 
+ void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+ void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                         int alpha, int beta, int8_t *tc0);
+ 
+ void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
+ {
+@@ -60,3 +65,22 @@ void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride)
+     daedalus_recipe_dispatch_h264_idct8(g_dctx, dst, (size_t)stride,
+                                         block, 1, &meta);
+ }
+
+void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                         int alpha, int beta, int8_t *tc0)
+{
+    daedalus_h264_deblock_meta meta = {
+        .dst_off = 0,
+        .alpha   = alpha,
+        .beta    = beta,
+    };
+    meta.tc0[0] = tc0[0];
+    meta.tc0[1] = tc0[1];
+    meta.tc0[2] = tc0[2];
+    meta.tc0[3] = tc0[3];
+
+    pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+
+    daedalus_recipe_dispatch_h264_deblock_luma_v(g_dctx, pix, (size_t)stride,
+                                                 1, &meta);
+}
+diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
+index 741e551..85ac381 100644
+--- a/libavcodec/aarch64/h264dsp_init_aarch64.c
+++ b/libavcodec/aarch64/h264dsp_init_aarch64.c
+@@ -27,6 +27,8 @@
+ 
+ void ff_h264_v_loop_filter_luma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
+                                      int beta, int8_t *tc0);
+void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                         int alpha, int beta, int8_t *tc0);
+ void ff_h264_h_loop_filter_luma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
+                                      int beta, int8_t *tc0);
+ void ff_h264_v_loop_filter_luma_intra_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
+@@ -114,7 +116,7 @@ av_cold void ff_h264dsp_init_aarch64(H264DSPContext *c, const int bit_depth,
+     int cpu_flags = av_get_cpu_flags();
+ 
+     if (have_neon(cpu_flags) && bit_depth == 8) {
+-        c->v_loop_filter_luma   = ff_h264_v_loop_filter_luma_neon;
+        c->v_loop_filter_luma   = ff_h264_v_loop_filter_luma_daedalus;
+         c->h_loop_filter_luma   = ff_h264_h_loop_filter_luma_neon;
+         c->v_loop_filter_luma_intra= ff_h264_v_loop_filter_luma_intra_neon;
+         c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_neon;
+-- 
+2.47.3
+
@@ -33,11 +33,13 @@ FFMPEG_VERSION=8.1
 # epoch 2 matches Debian's stock ffmpeg (currently 7:7.1.x in trixie);
 # +rfourier suffix to avoid colliding with upstream/Debian rebuilds.
 PKGVER=2:${FFMPEG_VERSION}+rfourier+gb57fbbe
-PKGREL=7  # pkgrel=7 — H.264 IDCT 8x8 daedalus-fourier substitution
-          # (cycle 7).  Stacks on top of cycle-6 IDCT 4x4 (PR #76) and
-          # the libxml2-drop ABI-skew workaround (PR #78).  Wires
-          # H264DSPContext.idct8_add through
-          # daedalus_recipe_dispatch_h264_idct8.  (2026-05-22)
+PKGREL=8  # pkgrel=8 — H.264 luma-v deblock daedalus-fourier substitution
+          # (cycle 8, non-intra bS<4 vertical luma).  Stacks on cycles
+          # 6/7 (IDCT 4x4 + 8x8).  Wires H264DSPContext.v_loop_filter_luma
+          # through daedalus_recipe_dispatch_h264_deblock_luma_v.
+          # ctx stays no-QPU until a separate change gates Vulkan init
+          # on a feature flag; cycle-8 dispatch is NEON-by-recipe for
+          # now.  (2026-05-22)

 # daedalus-fourier pin — first kernel substitution in libavcodec (cycle 6
 # H.264 IDCT 4x4).  Same SHA as the daedalus-v4l2 daemon already ships
@@ -68,6 +70,7 @@ patch -Np1 -i "$HERE/0001-libudev-bypass-fallback.patch"
 patch -Np1 -i "$HERE/0002-nv15-to-p010-unpack.patch"
 patch -Np1 -i "$HERE/0003-h264-idct4-daedalus-fourier.patch"
 patch -Np1 -i "$HERE/0004-h264-idct8-daedalus-fourier.patch"
+patch -Np1 -i "$HERE/0005-h264-deblock-luma-v-daedalus-fourier.patch"

 # --- daedalus-fourier: fetch + build static .a with PIC, install to a
 # per-build prefix; libavcodec.so links it into the shared object so
@@ -1,3 +1,28 @@
+ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-8) bookworm trixie; urgency=medium
+
+  * Add 0005-h264-deblock-luma-v-daedalus-fourier.patch —
+    H264DSPContext.v_loop_filter_luma (non-intra bS<4 vertical luma
+    deblock, called per macroblock-row edge from the slice deblock
+    loop in libavcodec/h264_loopfilter.c) now dispatches through
+    daedalus_recipe_dispatch_h264_deblock_luma_v instead of
+    ff_h264_v_loop_filter_luma_neon.  Cycle 8 of the daedalus-v4l2#11
+    step 2 substitution arc.
+  * Cycle 8 is marked "CPU primary; QPU opportunistic" in
+    daedalus-fourier, but the libavcodec.so context here uses
+    daedalus_ctx_create_no_qpu (process-global pthread_once,
+    shared with cycles 6/7).  Opportunistic QPU is deferred to a
+    separate change that gates Vulkan init on a feature flag, to
+    avoid implicit Vulkan init in arbitrary host processes.  For
+    now cycle 8 is plumbing-only — NEON-by-recipe.
+  * Intra (bS=4) loop filter c->v_loop_filter_luma_intra stays on
+    the in-tree NEON .S code; daedalus's daedalus_h264_deblock_meta
+    only covers the non-intra path per its API docstring.
+  * Bit-exact against ff_h264_v_loop_filter_luma_neon (daedalus-fourier
+    cycle 8 green).
+  * No SONAME change, no Depends change.
+
+ -- Markus Fritsche <mfritsche@reauktion.de>  Fri, 22 May 2026 12:30:00 +0000
+
 ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-7) bookworm trixie; urgency=medium

  * Add 0004-h264-idct8-daedalus-fourier.patch — H264DSPContext.idct8_add