Merge pull request 'ffmpeg-v4l2-request-fourier: substitute H.264 luma-v deblock → daedalus-fourier' (#86) from claude-noether/marfrit-packages:noether/ffmpeg-fourier-deblock-luma-v-daedalus into main

Reviewed-on: #86
2026-05-22 10:29:09 +00:00
parent 510a31622c 29e0852d11
commit d11a52405d
5 changed files with 280 additions and 8 deletions
@@ -0,0 +1,121 @@
 From 68731c41d7ea68be0e912b128cb4e71fb56e8263 Mon Sep 17 00:00:00 2001
 From: Markus Fritsche <mfritsche@reauktion.de>
 Date: Fri, 22 May 2026 12:15:16 +0200
 Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 luma-v deblock through
 daedalus-fourier
 MIME-Version: 1.0
 Content-Type: text/plain; charset=UTF-8
 Content-Transfer-Encoding: 8bit
 H264DSPContext.v_loop_filter_luma (non-intra bS<4 vertical luma
 deblock, called per macroblock-row edge from the slice deblock
 loop) now dispatches through
 daedalus_recipe_dispatch_h264_deblock_luma_v instead of
 ff_h264_v_loop_filter_luma_neon.
 The recipe layer picks the substrate; for cycle 8 the daedalus
 docstring marks the kernel "CPU primary; QPU opportunistic", but
 the libavcodec.so context here is built with
 daedalus_ctx_create_no_qpu — process-global pthread_once init,
 shared with cycles 6/7.  QPU opportunism stays gated off until a
 follow-up adds an explicit feature flag (no implicit Vulkan init
 in arbitrary host processes).  In the meantime cycle 8 is a
 plumbing-only substitution, NEON-to-NEON via the daedalus recipe.
 Intra (bS=4) loop filter — c->v_loop_filter_luma_intra — stays on
 the in-tree NEON .S code; daedalus's daedalus_h264_deblock_meta
 only covers the non-intra path per its docstring.
 FFmpeg `int alpha/beta/int8_t tc0[4]` → daedalus_h264_deblock_meta
 (int32_t alpha/beta + inline int8_t tc0[4]).  pix already points
 to row 0 of the bottom block per FFmpeg's deblock convention,
 satisfying daedalus's `dst_off >= 4 * dst_stride` constraint.
 Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 8.
 ---
 libavcodec/aarch64/h264_idct_daedalus.c   | 36 +++++++++++++++++++----
 libavcodec/aarch64/h264dsp_init_aarch64.c |  4 ++-
 2 files changed, 33 insertions(+), 7 deletions(-)
 diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
 index cbb98af..92365fa 100644
 --- a/libavcodec/aarch64/h264_idct_daedalus.c
 +++ b/libavcodec/aarch64/h264_idct_daedalus.c
@@ -1,11 +1,14 @@
 /*
 - * H.264 4x4 / 8x8 IDCT + add — daedalus-fourier substitution shims.
 + * H.264 4x4 / 8x8 IDCT + luma-v deblock — daedalus-fourier substitution shims.
  *
 - * Routes H264DSPContext.idct_add  → daedalus_recipe_dispatch_h264_idct4
 - *        H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
 - * instead of the in-tree ff_h264_idct{,8}_add_neon assembly.  The
 - * recipe layer picks the substrate (CPU NEON by default for cycles
 - * 6 + 7; future cycles may dispatch to V3D opportunistically).
 + * Routes H264DSPContext.idct_add           → daedalus_recipe_dispatch_h264_idct4
 + *        H264DSPContext.idct8_add          → daedalus_recipe_dispatch_h264_idct8
 + *        H264DSPContext.v_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_v
 + * instead of the in-tree ff_h264_*_neon assembly.  The recipe layer
 + * picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
 + * is CPU primary with QPU opportunistic — the ctx below is no-QPU,
 + * so cycle 8 stays on the CPU NEON path until a separate change
 + * gates QPU init on a daedalus-fourier feature flag).
  *
  * FFmpeg's 4x4 and 8x8 block memory layouts match daedalus's
  * column-major convention: block[r + N*c] = coefficient at
@@ -40,6 +43,8 @@ static void daedalus_ctx_init_once(void)
 void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
 void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
 +void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                         int alpha, int beta, int8_t *tc0);
 void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
 {
@@ -60,3 +65,22 @@ void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride)
     daedalus_recipe_dispatch_h264_idct8(g_dctx, dst, (size_t)stride,
                                         block, 1, &meta);
 }
 +
 +void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                         int alpha, int beta, int8_t *tc0)
 +{
 +    daedalus_h264_deblock_meta meta = {
 +        .dst_off = 0,
 +        .alpha   = alpha,
 +        .beta    = beta,
 +    };
 +    meta.tc0[0] = tc0[0];
 +    meta.tc0[1] = tc0[1];
 +    meta.tc0[2] = tc0[2];
 +    meta.tc0[3] = tc0[3];
 +
 +    pthread_once(&g_dctx_once, daedalus_ctx_init_once);
 +
 +    daedalus_recipe_dispatch_h264_deblock_luma_v(g_dctx, pix, (size_t)stride,
 +                                                 1, &meta);
 +}
 diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
 index 741e551..85ac381 100644
 --- a/libavcodec/aarch64/h264dsp_init_aarch64.c
 +++ b/libavcodec/aarch64/h264dsp_init_aarch64.c
@@ -27,6 +27,8 @@
 void ff_h264_v_loop_filter_luma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
                                      int beta, int8_t *tc0);
 +void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                         int alpha, int beta, int8_t *tc0);
 void ff_h264_h_loop_filter_luma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
                                      int beta, int8_t *tc0);
 void ff_h264_v_loop_filter_luma_intra_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
@@ -114,7 +116,7 @@ av_cold void ff_h264dsp_init_aarch64(H264DSPContext *c, const int bit_depth,
     int cpu_flags = av_get_cpu_flags();
     if (have_neon(cpu_flags) && bit_depth == 8) {
 -        c->v_loop_filter_luma   = ff_h264_v_loop_filter_luma_neon;
 +        c->v_loop_filter_luma   = ff_h264_v_loop_filter_luma_daedalus;
         c->h_loop_filter_luma   = ff_h264_h_loop_filter_luma_neon;
         c->v_loop_filter_luma_intra= ff_h264_v_loop_filter_luma_intra_neon;
         c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_neon;
 -- 
 2.47.3
@@ -24,7 +24,7 @@ _srcname=FFmpeg
 _version='8.1'
 _commit='b57fbbe50c9b2656fad86a1a7eeabfd2b2a50935'  # v4l2-request-n8.1 tip 2026-04-24
 pkgver=8.1.r123329.b57fbbe
-pkgrel=7   # pkgrel=7 — H.264 IDCT 8x8 daedalus-fourier substitution (cycle 7, 2026-05-22)
+pkgrel=8   # pkgrel=8 — H.264 luma-v deblock daedalus-fourier substitution (cycle 8, 2026-05-22)
 epoch=2
 # daedalus-fourier pin — first kernel substitution in libavcodec
@@ -91,8 +91,9 @@ source=("git+https://github.com/Kwiboo/FFmpeg.git#commit=${_commit}"
        '0001-libudev-bypass-fallback.patch'
        '0002-nv15-to-p010-unpack.patch'
        '0003-h264-idct4-daedalus-fourier.patch'
-        '0004-h264-idct8-daedalus-fourier.patch')
+        '0004-h264-idct8-daedalus-fourier.patch'
-sha256sums=('SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP')
+        '0005-h264-deblock-luma-v-daedalus-fourier.patch')
 sha256sums=('SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP')
 pkgver() {
  cd "${_srcname}"
@@ -107,6 +108,7 @@ prepare() {
  patch -Np1 -i "${srcdir}/0002-nv15-to-p010-unpack.patch"
  patch -Np1 -i "${srcdir}/0003-h264-idct4-daedalus-fourier.patch"
  patch -Np1 -i "${srcdir}/0004-h264-idct8-daedalus-fourier.patch"
  patch -Np1 -i "${srcdir}/0005-h264-deblock-luma-v-daedalus-fourier.patch"
 }
 build() {
@@ -0,0 +1,121 @@
 From 68731c41d7ea68be0e912b128cb4e71fb56e8263 Mon Sep 17 00:00:00 2001
 From: Markus Fritsche <mfritsche@reauktion.de>
 Date: Fri, 22 May 2026 12:15:16 +0200
 Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 luma-v deblock through
 daedalus-fourier
 MIME-Version: 1.0
 Content-Type: text/plain; charset=UTF-8
 Content-Transfer-Encoding: 8bit
 H264DSPContext.v_loop_filter_luma (non-intra bS<4 vertical luma
 deblock, called per macroblock-row edge from the slice deblock
 loop) now dispatches through
 daedalus_recipe_dispatch_h264_deblock_luma_v instead of
 ff_h264_v_loop_filter_luma_neon.
 The recipe layer picks the substrate; for cycle 8 the daedalus
 docstring marks the kernel "CPU primary; QPU opportunistic", but
 the libavcodec.so context here is built with
 daedalus_ctx_create_no_qpu — process-global pthread_once init,
 shared with cycles 6/7.  QPU opportunism stays gated off until a
 follow-up adds an explicit feature flag (no implicit Vulkan init
 in arbitrary host processes).  In the meantime cycle 8 is a
 plumbing-only substitution, NEON-to-NEON via the daedalus recipe.
 Intra (bS=4) loop filter — c->v_loop_filter_luma_intra — stays on
 the in-tree NEON .S code; daedalus's daedalus_h264_deblock_meta
 only covers the non-intra path per its docstring.
 FFmpeg `int alpha/beta/int8_t tc0[4]` → daedalus_h264_deblock_meta
 (int32_t alpha/beta + inline int8_t tc0[4]).  pix already points
 to row 0 of the bottom block per FFmpeg's deblock convention,
 satisfying daedalus's `dst_off >= 4 * dst_stride` constraint.
 Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 8.
 ---
 libavcodec/aarch64/h264_idct_daedalus.c   | 36 +++++++++++++++++++----
 libavcodec/aarch64/h264dsp_init_aarch64.c |  4 ++-
 2 files changed, 33 insertions(+), 7 deletions(-)
 diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
 index cbb98af..92365fa 100644
 --- a/libavcodec/aarch64/h264_idct_daedalus.c
 +++ b/libavcodec/aarch64/h264_idct_daedalus.c
@@ -1,11 +1,14 @@
 /*
 - * H.264 4x4 / 8x8 IDCT + add — daedalus-fourier substitution shims.
 + * H.264 4x4 / 8x8 IDCT + luma-v deblock — daedalus-fourier substitution shims.
  *
 - * Routes H264DSPContext.idct_add  → daedalus_recipe_dispatch_h264_idct4
 - *        H264DSPContext.idct8_add → daedalus_recipe_dispatch_h264_idct8
 - * instead of the in-tree ff_h264_idct{,8}_add_neon assembly.  The
 - * recipe layer picks the substrate (CPU NEON by default for cycles
 - * 6 + 7; future cycles may dispatch to V3D opportunistically).
 + * Routes H264DSPContext.idct_add           → daedalus_recipe_dispatch_h264_idct4
 + *        H264DSPContext.idct8_add          → daedalus_recipe_dispatch_h264_idct8
 + *        H264DSPContext.v_loop_filter_luma → daedalus_recipe_dispatch_h264_deblock_luma_v
 + * instead of the in-tree ff_h264_*_neon assembly.  The recipe layer
 + * picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
 + * is CPU primary with QPU opportunistic — the ctx below is no-QPU,
 + * so cycle 8 stays on the CPU NEON path until a separate change
 + * gates QPU init on a daedalus-fourier feature flag).
  *
  * FFmpeg's 4x4 and 8x8 block memory layouts match daedalus's
  * column-major convention: block[r + N*c] = coefficient at
@@ -40,6 +43,8 @@ static void daedalus_ctx_init_once(void)
 void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
 void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride);
 +void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                         int alpha, int beta, int8_t *tc0);
 void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
 {
@@ -60,3 +65,22 @@ void ff_h264_idct8_add_daedalus(uint8_t *dst, int16_t *block, int stride)
     daedalus_recipe_dispatch_h264_idct8(g_dctx, dst, (size_t)stride,
                                         block, 1, &meta);
 }
 +
 +void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                         int alpha, int beta, int8_t *tc0)
 +{
 +    daedalus_h264_deblock_meta meta = {
 +        .dst_off = 0,
 +        .alpha   = alpha,
 +        .beta    = beta,
 +    };
 +    meta.tc0[0] = tc0[0];
 +    meta.tc0[1] = tc0[1];
 +    meta.tc0[2] = tc0[2];
 +    meta.tc0[3] = tc0[3];
 +
 +    pthread_once(&g_dctx_once, daedalus_ctx_init_once);
 +
 +    daedalus_recipe_dispatch_h264_deblock_luma_v(g_dctx, pix, (size_t)stride,
 +                                                 1, &meta);
 +}
 diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
 index 741e551..85ac381 100644
 --- a/libavcodec/aarch64/h264dsp_init_aarch64.c
 +++ b/libavcodec/aarch64/h264dsp_init_aarch64.c
@@ -27,6 +27,8 @@
 void ff_h264_v_loop_filter_luma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
                                      int beta, int8_t *tc0);
 +void ff_h264_v_loop_filter_luma_daedalus(uint8_t *pix, ptrdiff_t stride,
 +                                         int alpha, int beta, int8_t *tc0);
 void ff_h264_h_loop_filter_luma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
                                      int beta, int8_t *tc0);
 void ff_h264_v_loop_filter_luma_intra_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
@@ -114,7 +116,7 @@ av_cold void ff_h264dsp_init_aarch64(H264DSPContext *c, const int bit_depth,
     int cpu_flags = av_get_cpu_flags();
     if (have_neon(cpu_flags) && bit_depth == 8) {
 -        c->v_loop_filter_luma   = ff_h264_v_loop_filter_luma_neon;
 +        c->v_loop_filter_luma   = ff_h264_v_loop_filter_luma_daedalus;
         c->h_loop_filter_luma   = ff_h264_h_loop_filter_luma_neon;
         c->v_loop_filter_luma_intra= ff_h264_v_loop_filter_luma_intra_neon;
         c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_neon;
 -- 
 2.47.3
@@ -33,11 +33,13 @@ FFMPEG_VERSION=8.1
 # epoch 2 matches Debian's stock ffmpeg (currently 7:7.1.x in trixie);
 # +rfourier suffix to avoid colliding with upstream/Debian rebuilds.
 PKGVER=2:${FFMPEG_VERSION}+rfourier+gb57fbbe
-PKGREL=7  # pkgrel=7 — H.264 IDCT 8x8 daedalus-fourier substitution
+PKGREL=8  # pkgrel=8 — H.264 luma-v deblock daedalus-fourier substitution
-          # (cycle 7).  Stacks on top of cycle-6 IDCT 4x4 (PR #76) and
+          # (cycle 8, non-intra bS<4 vertical luma).  Stacks on cycles
-          # the libxml2-drop ABI-skew workaround (PR #78).  Wires
+          # 6/7 (IDCT 4x4 + 8x8).  Wires H264DSPContext.v_loop_filter_luma
-          # H264DSPContext.idct8_add through
+          # through daedalus_recipe_dispatch_h264_deblock_luma_v.
-          # daedalus_recipe_dispatch_h264_idct8.  (2026-05-22)
+          # ctx stays no-QPU until a separate change gates Vulkan init
          # on a feature flag; cycle-8 dispatch is NEON-by-recipe for
          # now.  (2026-05-22)
 # daedalus-fourier pin — first kernel substitution in libavcodec (cycle 6
 # H.264 IDCT 4x4).  Same SHA as the daedalus-v4l2 daemon already ships
@@ -68,6 +70,7 @@ patch -Np1 -i "$HERE/0001-libudev-bypass-fallback.patch"
 patch -Np1 -i "$HERE/0002-nv15-to-p010-unpack.patch"
 patch -Np1 -i "$HERE/0003-h264-idct4-daedalus-fourier.patch"
 patch -Np1 -i "$HERE/0004-h264-idct8-daedalus-fourier.patch"
 patch -Np1 -i "$HERE/0005-h264-deblock-luma-v-daedalus-fourier.patch"
 # --- daedalus-fourier: fetch + build static .a with PIC, install to a
 # per-build prefix; libavcodec.so links it into the shared object so
@@ -1,3 +1,28 @@
 ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-8) bookworm trixie; urgency=medium
  * Add 0005-h264-deblock-luma-v-daedalus-fourier.patch —
    H264DSPContext.v_loop_filter_luma (non-intra bS<4 vertical luma
    deblock, called per macroblock-row edge from the slice deblock
    loop in libavcodec/h264_loopfilter.c) now dispatches through
    daedalus_recipe_dispatch_h264_deblock_luma_v instead of
    ff_h264_v_loop_filter_luma_neon.  Cycle 8 of the daedalus-v4l2#11
    step 2 substitution arc.
  * Cycle 8 is marked "CPU primary; QPU opportunistic" in
    daedalus-fourier, but the libavcodec.so context here uses
    daedalus_ctx_create_no_qpu (process-global pthread_once,
    shared with cycles 6/7).  Opportunistic QPU is deferred to a
    separate change that gates Vulkan init on a feature flag, to
    avoid implicit Vulkan init in arbitrary host processes.  For
    now cycle 8 is plumbing-only — NEON-by-recipe.
  * Intra (bS=4) loop filter c->v_loop_filter_luma_intra stays on
    the in-tree NEON .S code; daedalus's daedalus_h264_deblock_meta
    only covers the non-intra path per its API docstring.
  * Bit-exact against ff_h264_v_loop_filter_luma_neon (daedalus-fourier
    cycle 8 green).
  * No SONAME change, no Depends change.
 -- Markus Fritsche <mfritsche@reauktion.de>  Fri, 22 May 2026 12:30:00 +0000
 ffmpeg-v4l2-request-fourier (2:8.1+rfourier+gb57fbbe-7) bookworm trixie; urgency=medium
  * Add 0004-h264-idct8-daedalus-fourier.patch — H264DSPContext.idct8_add