ffmpeg-v4l2-request-fourier: flip libavcodec daedalus ctx no_qpu → qpu-capable (0014)

(Renumbered from 0013 — PR #102 landed 0013-h264-deblock-chroma-intra while this PR was open, so the next free slot is 0014.) Patches 0003 (IDCT 4x4) and 0007 (qpel mc20) created the libavcodec.so process-global daedalus_ctx via daedalus_ctx_create_no_qpu(). Rationale at the time: cycle 6/9 had only CPU NEON paths, so a QPU-capable ctx would have meant pointless Vulkan init in every host process. Two things changed since: 1. Every H.264 hot-path primitive now has a V3D7 compute shader. IDCT 4x4/8x8 + 8 deblock variants (luma+chroma × V+H × inter+intra) + 30 qpel positions. See daedalus-fourier PRs #28-#35. 2. Dispatch overhead has been hammered down — buffer pool in v3d_runner + persistent command buffer. daedalus-fourier PR #36 bench on hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup): 1080p worst-case sum (IDCT4 + deblock luma + qpel mc22): CPU NEON only: 5.57 ms QPU only: 1.30 ms (CPU/QPU sum ratio = 4.30x) PR #10's CPU-4x-faster-than-QPU verdict (which justified the original no_qpu ctx choice) is reversed by ~17x. This commit adds 0014-h264-ctx-qpu-capable.patch which flips both H.264 TUs (h264_idct_daedalus.c, h264_qpel_daedalus.c) from daedalus_ctx_create_no_qpu() to daedalus_ctx_create(). daedalus_ctx_create() probes for a usable Vulkan device and falls back to no_qpu mode if unavailable, so this is safe on hosts without V3D (x86 build runners, Debian aarch64 builders without renderD, etc.). Hosts WITH V3D (Pi 5 deployment targets) now route the H.264 hot-path through V3D compute instead of CPU NEON. Wired into both arch PKGBUILD (source[] + prepare()) and debian build-deb.sh; both pkgrel bumped 10 → 11. Refs reauktion/daedalus-fourier!36.
Merge pull request 'mesa-panvk-bifrost r9: bump maxImageDimension3D to 2048 (iter22, unblocks Dawn/WebGPU)' (#103 ) from claude-noether/marfrit-packages:mesa-panvk-bifrost-r9 into main
2026-05-25 21:18:18 +02:00 · 2026-05-25 18:17:16 +00:00 · 2026-05-25 12:48:19 +00:00 · 2026-05-25 14:22:02 +02:00
6 changed files with 420 additions and 4 deletions
@@ -0,0 +1,120 @@
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: claude-noether <claude-noether@noreply.localhost>
+Date: Sun, 25 May 2026 14:30:00 +0200
+Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 chroma intra deblock (4:2:0) through daedalus-fourier
+
+Substitutes c->v_loop_filter_chroma_intra and c->h_loop_filter_chroma_intra
+with daedalus wrappers in the bit_depth=8 / chroma_format_idc<=1 (4:2:0)
+branch.  4:2:2 stays on the in-tree NEON path (the daedalus chroma intra
+dispatch is 4:2:0-only).
+
+The fourier dispatches were exposed in PR #11 (DEFINE_INTRA_DISPATCH
+macro generates the public daedalus_dispatch_h264_deblock_chroma_*_intra
+symbols + recipe wrappers).
+
+Re-architects the chroma init: v_loop_filter_chroma_intra was previously
+assigned unconditionally to the NEON variant (which works for both 4:2:0
+and 4:2:2).  We now assign it INSIDE both branches of the chroma_format_idc
+conditional, with the 4:2:0 branch picking daedalus and the 4:2:2 branch
+keeping NEON.  No regression for 4:2:2 streams.
+
+Same NEON-to-NEON via recipe shape as 0010 luma intra.
+
+Refs reauktion/daedalus-v4l2#11 — substitution arc chroma intra.
+---
+diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
+--- a/libavcodec/aarch64/h264_idct_daedalus.c	2026-05-25 14:21:08.267156263 +0200
+++ libavcodec/aarch64/h264_idct_daedalus.c	2026-05-25 14:21:08.287745931 +0200
+@@ -1,5 +1,5 @@
+ /*
+- * H.264 4x4 / 8x8 IDCT + luma v/h (inter+intra) + chroma v/h deblock + chroma DC Hadamard — daedalus-fourier substitution shims.
+ * H.264 4x4 / 8x8 IDCT + luma v/h (inter+intra) + chroma v/h (inter+intra) deblock + chroma DC Hadamard — daedalus-fourier substitution shims.
+  *
+  * Routes H264DSPContext.idct_add           → daedalus_recipe_dispatch_h264_idct4
+  *        H264DSPContext.idct8_add          → daedalus_recipe_dispatch_h264_idct8
+@@ -9,6 +9,8 @@
+  *        H264DSPContext.h_loop_filter_chroma → daedalus_recipe_dispatch_h264_deblock_chroma_h
+  *        H264DSPContext.v_loop_filter_luma_intra → daedalus_recipe_dispatch_h264_deblock_luma_v_intra
+  *        H264DSPContext.h_loop_filter_luma_intra → daedalus_recipe_dispatch_h264_deblock_luma_h_intra
+ *        H264DSPContext.v_loop_filter_chroma_intra → daedalus_recipe_dispatch_h264_deblock_chroma_v_intra
+ *        H264DSPContext.h_loop_filter_chroma_intra → daedalus_recipe_dispatch_h264_deblock_chroma_h_intra
+  *        H264DSPContext.chroma_dc_dequant_idct   → daedalus_h264_chroma_dc_hadamard_2x2 + caller-side qmul
+  * instead of the in-tree ff_h264_*_neon assembly.  The recipe layer
+  * picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
+@@ -61,6 +63,10 @@
+                                                 int alpha, int beta);
+ void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                                 int alpha, int beta);
+void ff_h264_v_loop_filter_chroma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                                 int alpha, int beta);
+void ff_h264_h_loop_filter_chroma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                                 int alpha, int beta);
+ void ff_h264_chroma_dc_dequant_idct_daedalus(int16_t *block, int qmul);
+ 
+ void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
+@@ -218,3 +224,30 @@
+     block[stride*1 + xStride*0] = (int16_t)((int)dc[2] * qmul >> 7);
+     block[stride*1 + xStride*1] = (int16_t)((int)dc[3] * qmul >> 7);
+ }
+
+void ff_h264_v_loop_filter_chroma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                                 int alpha, int beta)
+{
+    daedalus_h264_deblock_meta meta = {
+        .dst_off = 0,
+        .alpha   = alpha,
+        .beta    = beta,
+    };
+    /* tc0[] unused for intra (bS=4 hardcodes the strength). */
+    pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+    daedalus_recipe_dispatch_h264_deblock_chroma_v_intra(g_dctx, pix, (size_t)stride,
+                                                          1, &meta);
+}
+
+void ff_h264_h_loop_filter_chroma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                                 int alpha, int beta)
+{
+    daedalus_h264_deblock_meta meta = {
+        .dst_off = 0,
+        .alpha   = alpha,
+        .beta    = beta,
+    };
+    pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+    daedalus_recipe_dispatch_h264_deblock_chroma_h_intra(g_dctx, pix, (size_t)stride,
+                                                          1, &meta);
+}
+diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
+--- a/libavcodec/aarch64/h264dsp_init_aarch64.c	2026-05-25 14:21:08.268311057 +0200
+++ libavcodec/aarch64/h264dsp_init_aarch64.c	2026-05-25 14:21:08.287886563 +0200
+@@ -42,6 +42,10 @@
+ void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                                 int alpha, int beta);
+ void ff_h264_chroma_dc_dequant_idct_daedalus(int16_t *block, int qmul);
+void ff_h264_v_loop_filter_chroma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                                 int alpha, int beta);
+void ff_h264_h_loop_filter_chroma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                                 int alpha, int beta);
+ void ff_h264_v_loop_filter_chroma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
+                                        int beta, int8_t *tc0);
+ void ff_h264_v_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
+@@ -133,14 +137,15 @@
+         c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_daedalus;
+ 
+         c->v_loop_filter_chroma = ff_h264_v_loop_filter_chroma_daedalus;
+-        c->v_loop_filter_chroma_intra = ff_h264_v_loop_filter_chroma_intra_neon;
+ 
+         if (chroma_format_idc <= 1) {
+             c->chroma_dc_dequant_idct = ff_h264_chroma_dc_dequant_idct_daedalus;
+            c->v_loop_filter_chroma_intra = ff_h264_v_loop_filter_chroma_intra_daedalus;
+             c->h_loop_filter_chroma = ff_h264_h_loop_filter_chroma_daedalus;
+-            c->h_loop_filter_chroma_intra = ff_h264_h_loop_filter_chroma_intra_neon;
+            c->h_loop_filter_chroma_intra = ff_h264_h_loop_filter_chroma_intra_daedalus;
+             c->h_loop_filter_chroma_mbaff_intra = ff_h264_h_loop_filter_chroma_mbaff_intra_neon;
+         } else {
+            c->v_loop_filter_chroma_intra = ff_h264_v_loop_filter_chroma_intra_neon;
+             c->h_loop_filter_chroma = ff_h264_h_loop_filter_chroma422_neon;
+             c->h_loop_filter_chroma_mbaff = ff_h264_h_loop_filter_chroma_neon;
+             c->h_loop_filter_chroma_intra = ff_h264_h_loop_filter_chroma422_intra_neon;
+--
+2.47.3
+
@@ -0,0 +1,85 @@
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: Markus Fritsche <mfritsche@reauktion.de>
+Date: Mon, 25 May 2026 21:00:00 +0200
+Subject: [PATCH] avcodec/aarch64/h264: use QPU-capable daedalus ctx (bench
+ shows 4.30x faster on Pi 5)
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
+
+Patches 0003 (IDCT 4x4) and 0007 (qpel mc20) created the libavcodec.so
+process-global daedalus_ctx via daedalus_ctx_create_no_qpu().  Rationale
+at the time: cycle 6/9 had only CPU NEON paths, so a QPU-capable ctx
+would have meant pointless Vulkan init in every host process (firefox-
+fourier, mpv-fourier, daedalus_v4l2_daemon, ...).
+
+Two things changed since:
+
+  1. Every H.264 hot-path primitive now has a V3D7 compute shader.
+     IDCT 4x4/8x8 (cycles 6, 7), 8 deblock variants (luma+chroma x V+H
+     x inter+intra), 30 qpel positions (15 put_ + 15 avg_).  See
+     daedalus-fourier PRs #28-#35.
+
+  2. Dispatch overhead has been hammered down — buffer pool in
+     v3d_runner (daedalus-fourier task #160) plus persistent command
+     buffer (task #161).  daedalus-fourier PR #36 bench measures the
+     1080p worst-case sum on hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup):
+
+       kernel             CPU ns/op  QPU ns/op  winner
+       IDCT 4x4 luma          10.79       2.47  QPU 4.36x
+       IDCT 8x8 luma          29.69       9.23  QPU 3.22x
+       Deblock luma_v         17.58      10.21  QPU 1.72x
+       Deblock luma_h         38.41       9.98  QPU 3.85x
+       qpel mc20 (8x8)        28.24       9.66  QPU 2.92x
+       qpel mc02 (8x8)        16.96      20.54  CPU 1.21x
+       qpel mc22 (8x8)        71.58       9.64  QPU 7.43x
+
+       1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):
+         CPU NEON only:  5.57 ms
+         QPU only:       1.30 ms   (CPU/QPU sum ratio = 4.30x)
+
+PR #10's verdict (CPU 4x faster than QPU at IDCT) is reversed.  Switch
+the substitution context to daedalus_ctx_create() in both H.264 TUs
+(h264_idct_daedalus.c, h264_qpel_daedalus.c) so the recipe layer can
+actually route through the now-faster QPU path.
+
+daedalus_ctx_create() probes for a usable Vulkan device and falls back
+to no_qpu mode if unavailable, so this is safe on hosts without V3D
+(x86 reauktion build runners, debian-aarch64 builders without renderD,
+etc.).  Hosts WITH V3D (Pi 5 deployment targets) get the speedup.
+
+The remaining qpel mc02 anomaly (single-axis vertical filter, 1.21x
+CPU) is bench-flagged for a v2 shader follow-up; the recipe entry
+stays QPU since the policy decree (2026-05-23 substrate decree) holds
+and the gap is marginal.
+
+Refs reauktion/daedalus-fourier!36.
+---
+ libavcodec/aarch64/h264_idct_daedalus.c | 2 +-
+ libavcodec/aarch64/h264_qpel_daedalus.c | 2 +-
+ 2 files changed, 2 insertions(+), 2 deletions(-)
+
+diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
+--- a/libavcodec/aarch64/h264_idct_daedalus.c
+++ b/libavcodec/aarch64/h264_idct_daedalus.c
+@@ -32,7 +32,7 @@ static pthread_once_t    g_dctx_once = PTHREAD_ONCE_INIT;
+
+ static void daedalus_ctx_init_once(void)
+ {
+-    g_dctx = daedalus_ctx_create_no_qpu();
+    g_dctx = daedalus_ctx_create();
+ }
+
+ void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+diff --git a/libavcodec/aarch64/h264_qpel_daedalus.c b/libavcodec/aarch64/h264_qpel_daedalus.c
+--- a/libavcodec/aarch64/h264_qpel_daedalus.c
+++ b/libavcodec/aarch64/h264_qpel_daedalus.c
+@@ -38,7 +38,7 @@ static pthread_once_t    g_dctx_once = PTHREAD_ONCE_INIT;
+
+ static void daedalus_ctx_init_once(void)
+ {
+-    g_dctx = daedalus_ctx_create_no_qpu();
+    g_dctx = daedalus_ctx_create();
+ }
+
+ void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
@@ -24,7 +24,7 @@ _srcname=FFmpeg
 _version='8.1'
 _commit='b57fbbe50c9b2656fad86a1a7eeabfd2b2a50935'  # v4l2-request-n8.1 tip 2026-04-24
 pkgver=8.1.r123329.b57fbbe
-pkgrel=10  # pkgrel=10 — H.264 luma qpel mc20 daedalus-fourier substitution (cycle 9, 2026-05-23)
+pkgrel=11  # pkgrel=11 — libavcodec.so daedalus ctx flipped no_qpu → qpu-capable (PR #36 bench: QPU 4.30x on 1080p hot-path sum, 2026-05-25)
 epoch=2

 # daedalus-fourier pin.  209a421 = PR #2 merge (Phase 8c — public API
@@ -99,8 +99,10 @@ source=("git+https://github.com/Kwiboo/FFmpeg.git#commit=${_commit}"
        '0009-h264-deblock-chroma-daedalus-fourier.patch'
        '0010-h264-deblock-luma-intra-daedalus-fourier.patch'
        '0011-h264-chroma-dc-hadamard-daedalus-fourier.patch'
-        '0012-h264-qpel-rest-daedalus-fourier.patch')
-sha256sums=('SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP')
+        '0012-h264-qpel-rest-daedalus-fourier.patch'
+        '0013-h264-deblock-chroma-intra-daedalus-fourier.patch'
+        '0014-h264-ctx-qpu-capable.patch')
+sha256sums=('SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP' 'SKIP')

 pkgver() {
  cd "${_srcname}"
@@ -123,6 +125,8 @@ prepare() {
  patch -Np1 -i "${srcdir}/0010-h264-deblock-luma-intra-daedalus-fourier.patch"
  patch -Np1 -i "${srcdir}/0011-h264-chroma-dc-hadamard-daedalus-fourier.patch"
  patch -Np1 -i "${srcdir}/0012-h264-qpel-rest-daedalus-fourier.patch"
+  patch -Np1 -i "${srcdir}/0013-h264-deblock-chroma-intra-daedalus-fourier.patch"
+  patch -Np1 -i "${srcdir}/0014-h264-ctx-qpu-capable.patch"
 }

 build() {
@@ -0,0 +1,120 @@
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: claude-noether <claude-noether@noreply.localhost>
+Date: Sun, 25 May 2026 14:30:00 +0200
+Subject: [PATCH] avcodec/aarch64/h264dsp: route H.264 chroma intra deblock (4:2:0) through daedalus-fourier
+
+Substitutes c->v_loop_filter_chroma_intra and c->h_loop_filter_chroma_intra
+with daedalus wrappers in the bit_depth=8 / chroma_format_idc<=1 (4:2:0)
+branch.  4:2:2 stays on the in-tree NEON path (the daedalus chroma intra
+dispatch is 4:2:0-only).
+
+The fourier dispatches were exposed in PR #11 (DEFINE_INTRA_DISPATCH
+macro generates the public daedalus_dispatch_h264_deblock_chroma_*_intra
+symbols + recipe wrappers).
+
+Re-architects the chroma init: v_loop_filter_chroma_intra was previously
+assigned unconditionally to the NEON variant (which works for both 4:2:0
+and 4:2:2).  We now assign it INSIDE both branches of the chroma_format_idc
+conditional, with the 4:2:0 branch picking daedalus and the 4:2:2 branch
+keeping NEON.  No regression for 4:2:2 streams.
+
+Same NEON-to-NEON via recipe shape as 0010 luma intra.
+
+Refs reauktion/daedalus-v4l2#11 — substitution arc chroma intra.
+---
+diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
+--- a/libavcodec/aarch64/h264_idct_daedalus.c	2026-05-25 14:21:08.267156263 +0200
+++ libavcodec/aarch64/h264_idct_daedalus.c	2026-05-25 14:21:08.287745931 +0200
+@@ -1,5 +1,5 @@
+ /*
+- * H.264 4x4 / 8x8 IDCT + luma v/h (inter+intra) + chroma v/h deblock + chroma DC Hadamard — daedalus-fourier substitution shims.
+ * H.264 4x4 / 8x8 IDCT + luma v/h (inter+intra) + chroma v/h (inter+intra) deblock + chroma DC Hadamard — daedalus-fourier substitution shims.
+  *
+  * Routes H264DSPContext.idct_add           → daedalus_recipe_dispatch_h264_idct4
+  *        H264DSPContext.idct8_add          → daedalus_recipe_dispatch_h264_idct8
+@@ -9,6 +9,8 @@
+  *        H264DSPContext.h_loop_filter_chroma → daedalus_recipe_dispatch_h264_deblock_chroma_h
+  *        H264DSPContext.v_loop_filter_luma_intra → daedalus_recipe_dispatch_h264_deblock_luma_v_intra
+  *        H264DSPContext.h_loop_filter_luma_intra → daedalus_recipe_dispatch_h264_deblock_luma_h_intra
+ *        H264DSPContext.v_loop_filter_chroma_intra → daedalus_recipe_dispatch_h264_deblock_chroma_v_intra
+ *        H264DSPContext.h_loop_filter_chroma_intra → daedalus_recipe_dispatch_h264_deblock_chroma_h_intra
+  *        H264DSPContext.chroma_dc_dequant_idct   → daedalus_h264_chroma_dc_hadamard_2x2 + caller-side qmul
+  * instead of the in-tree ff_h264_*_neon assembly.  The recipe layer
+  * picks the substrate (CPU NEON for cycles 6 + 7 by default; cycle 8
+@@ -61,6 +63,10 @@
+                                                 int alpha, int beta);
+ void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                                 int alpha, int beta);
+void ff_h264_v_loop_filter_chroma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                                 int alpha, int beta);
+void ff_h264_h_loop_filter_chroma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                                 int alpha, int beta);
+ void ff_h264_chroma_dc_dequant_idct_daedalus(int16_t *block, int qmul);
+ 
+ void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride)
+@@ -218,3 +224,30 @@
+     block[stride*1 + xStride*0] = (int16_t)((int)dc[2] * qmul >> 7);
+     block[stride*1 + xStride*1] = (int16_t)((int)dc[3] * qmul >> 7);
+ }
+
+void ff_h264_v_loop_filter_chroma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                                 int alpha, int beta)
+{
+    daedalus_h264_deblock_meta meta = {
+        .dst_off = 0,
+        .alpha   = alpha,
+        .beta    = beta,
+    };
+    /* tc0[] unused for intra (bS=4 hardcodes the strength). */
+    pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+    daedalus_recipe_dispatch_h264_deblock_chroma_v_intra(g_dctx, pix, (size_t)stride,
+                                                          1, &meta);
+}
+
+void ff_h264_h_loop_filter_chroma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                                 int alpha, int beta)
+{
+    daedalus_h264_deblock_meta meta = {
+        .dst_off = 0,
+        .alpha   = alpha,
+        .beta    = beta,
+    };
+    pthread_once(&g_dctx_once, daedalus_ctx_init_once);
+    daedalus_recipe_dispatch_h264_deblock_chroma_h_intra(g_dctx, pix, (size_t)stride,
+                                                          1, &meta);
+}
+diff --git a/libavcodec/aarch64/h264dsp_init_aarch64.c b/libavcodec/aarch64/h264dsp_init_aarch64.c
+--- a/libavcodec/aarch64/h264dsp_init_aarch64.c	2026-05-25 14:21:08.268311057 +0200
+++ libavcodec/aarch64/h264dsp_init_aarch64.c	2026-05-25 14:21:08.287886563 +0200
+@@ -42,6 +42,10 @@
+ void ff_h264_h_loop_filter_luma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                                 int alpha, int beta);
+ void ff_h264_chroma_dc_dequant_idct_daedalus(int16_t *block, int qmul);
+void ff_h264_v_loop_filter_chroma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                                 int alpha, int beta);
+void ff_h264_h_loop_filter_chroma_intra_daedalus(uint8_t *pix, ptrdiff_t stride,
+                                                 int alpha, int beta);
+ void ff_h264_v_loop_filter_chroma_neon(uint8_t *pix, ptrdiff_t stride, int alpha,
+                                        int beta, int8_t *tc0);
+ void ff_h264_v_loop_filter_chroma_daedalus(uint8_t *pix, ptrdiff_t stride,
+@@ -133,14 +137,15 @@
+         c->h_loop_filter_luma_intra= ff_h264_h_loop_filter_luma_intra_daedalus;
+ 
+         c->v_loop_filter_chroma = ff_h264_v_loop_filter_chroma_daedalus;
+-        c->v_loop_filter_chroma_intra = ff_h264_v_loop_filter_chroma_intra_neon;
+ 
+         if (chroma_format_idc <= 1) {
+             c->chroma_dc_dequant_idct = ff_h264_chroma_dc_dequant_idct_daedalus;
+            c->v_loop_filter_chroma_intra = ff_h264_v_loop_filter_chroma_intra_daedalus;
+             c->h_loop_filter_chroma = ff_h264_h_loop_filter_chroma_daedalus;
+-            c->h_loop_filter_chroma_intra = ff_h264_h_loop_filter_chroma_intra_neon;
+            c->h_loop_filter_chroma_intra = ff_h264_h_loop_filter_chroma_intra_daedalus;
+             c->h_loop_filter_chroma_mbaff_intra = ff_h264_h_loop_filter_chroma_mbaff_intra_neon;
+         } else {
+            c->v_loop_filter_chroma_intra = ff_h264_v_loop_filter_chroma_intra_neon;
+             c->h_loop_filter_chroma = ff_h264_h_loop_filter_chroma422_neon;
+             c->h_loop_filter_chroma_mbaff = ff_h264_h_loop_filter_chroma_neon;
+             c->h_loop_filter_chroma_intra = ff_h264_h_loop_filter_chroma422_intra_neon;
+--
+2.47.3
+
@@ -0,0 +1,85 @@
+From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
+From: Markus Fritsche <mfritsche@reauktion.de>
+Date: Mon, 25 May 2026 21:00:00 +0200
+Subject: [PATCH] avcodec/aarch64/h264: use QPU-capable daedalus ctx (bench
+ shows 4.30x faster on Pi 5)
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
+
+Patches 0003 (IDCT 4x4) and 0007 (qpel mc20) created the libavcodec.so
+process-global daedalus_ctx via daedalus_ctx_create_no_qpu().  Rationale
+at the time: cycle 6/9 had only CPU NEON paths, so a QPU-capable ctx
+would have meant pointless Vulkan init in every host process (firefox-
+fourier, mpv-fourier, daedalus_v4l2_daemon, ...).
+
+Two things changed since:
+
+  1. Every H.264 hot-path primitive now has a V3D7 compute shader.
+     IDCT 4x4/8x8 (cycles 6, 7), 8 deblock variants (luma+chroma x V+H
+     x inter+intra), 30 qpel positions (15 put_ + 15 avg_).  See
+     daedalus-fourier PRs #28-#35.
+
+  2. Dispatch overhead has been hammered down — buffer pool in
+     v3d_runner (daedalus-fourier task #160) plus persistent command
+     buffer (task #161).  daedalus-fourier PR #36 bench measures the
+     1080p worst-case sum on hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup):
+
+       kernel             CPU ns/op  QPU ns/op  winner
+       IDCT 4x4 luma          10.79       2.47  QPU 4.36x
+       IDCT 8x8 luma          29.69       9.23  QPU 3.22x
+       Deblock luma_v         17.58      10.21  QPU 1.72x
+       Deblock luma_h         38.41       9.98  QPU 3.85x
+       qpel mc20 (8x8)        28.24       9.66  QPU 2.92x
+       qpel mc02 (8x8)        16.96      20.54  CPU 1.21x
+       qpel mc22 (8x8)        71.58       9.64  QPU 7.43x
+
+       1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):
+         CPU NEON only:  5.57 ms
+         QPU only:       1.30 ms   (CPU/QPU sum ratio = 4.30x)
+
+PR #10's verdict (CPU 4x faster than QPU at IDCT) is reversed.  Switch
+the substitution context to daedalus_ctx_create() in both H.264 TUs
+(h264_idct_daedalus.c, h264_qpel_daedalus.c) so the recipe layer can
+actually route through the now-faster QPU path.
+
+daedalus_ctx_create() probes for a usable Vulkan device and falls back
+to no_qpu mode if unavailable, so this is safe on hosts without V3D
+(x86 reauktion build runners, debian-aarch64 builders without renderD,
+etc.).  Hosts WITH V3D (Pi 5 deployment targets) get the speedup.
+
+The remaining qpel mc02 anomaly (single-axis vertical filter, 1.21x
+CPU) is bench-flagged for a v2 shader follow-up; the recipe entry
+stays QPU since the policy decree (2026-05-23 substrate decree) holds
+and the gap is marginal.
+
+Refs reauktion/daedalus-fourier!36.
+---
+ libavcodec/aarch64/h264_idct_daedalus.c | 2 +-
+ libavcodec/aarch64/h264_qpel_daedalus.c | 2 +-
+ 2 files changed, 2 insertions(+), 2 deletions(-)
+
+diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
+--- a/libavcodec/aarch64/h264_idct_daedalus.c
+++ b/libavcodec/aarch64/h264_idct_daedalus.c
+@@ -32,7 +32,7 @@ static pthread_once_t    g_dctx_once = PTHREAD_ONCE_INIT;
+
+ static void daedalus_ctx_init_once(void)
+ {
+-    g_dctx = daedalus_ctx_create_no_qpu();
+    g_dctx = daedalus_ctx_create();
+ }
+
+ void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
+diff --git a/libavcodec/aarch64/h264_qpel_daedalus.c b/libavcodec/aarch64/h264_qpel_daedalus.c
+--- a/libavcodec/aarch64/h264_qpel_daedalus.c
+++ b/libavcodec/aarch64/h264_qpel_daedalus.c
+@@ -38,7 +38,7 @@ static pthread_once_t    g_dctx_once = PTHREAD_ONCE_INIT;
+
+ static void daedalus_ctx_init_once(void)
+ {
+-    g_dctx = daedalus_ctx_create_no_qpu();
+    g_dctx = daedalus_ctx_create();
+ }
+
+ void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
@@ -33,7 +33,7 @@ FFMPEG_VERSION=8.1
 # epoch 2 matches Debian's stock ffmpeg (currently 7:7.1.x in trixie);
 # +rfourier suffix to avoid colliding with upstream/Debian rebuilds.
 PKGVER=2:${FFMPEG_VERSION}+rfourier+gb57fbbe
-PKGREL=10  # pkgrel=10 — H.264 luma qpel mc20 daedalus-fourier substitution
+PKGREL=11  # pkgrel=11 — libavcodec.so daedalus ctx flipped no_qpu → qpu-capable (PR #36 bench: QPU 4.30x on 1080p hot-path sum, 2026-05-25)
           # (cycle 9 of the daedalus-v4l2#11 step 2 substitution arc; closes
           # the libavcodec.so substitution sequence 6 IDCT4 / 7 IDCT8 /
           # 8 luma-v deblock / 9 qpel mc20).  Pulls daedalus-fourier PR #2
@@ -79,6 +79,8 @@ patch -Np1 -i "$HERE/0009-h264-deblock-chroma-daedalus-fourier.patch"
 patch -Np1 -i "$HERE/0010-h264-deblock-luma-intra-daedalus-fourier.patch"
 patch -Np1 -i "$HERE/0011-h264-chroma-dc-hadamard-daedalus-fourier.patch"
 patch -Np1 -i "$HERE/0012-h264-qpel-rest-daedalus-fourier.patch"
+patch -Np1 -i "$HERE/0013-h264-deblock-chroma-intra-daedalus-fourier.patch"
+patch -Np1 -i "$HERE/0014-h264-ctx-qpu-capable.patch"

 # --- daedalus-fourier: fetch + build static .a with PIC, install to a
 # per-build prefix; libavcodec.so links it into the shared object so
Author	SHA1	Message	Date
claude-noether	9c70ffffe7	ffmpeg-v4l2-request-fourier: flip libavcodec daedalus ctx no_qpu → qpu-capable (0014) (Renumbered from 0013 — PR #102 landed 0013-h264-deblock-chroma-intra while this PR was open, so the next free slot is 0014.) Patches 0003 (IDCT 4x4) and 0007 (qpel mc20) created the libavcodec.so process-global daedalus_ctx via daedalus_ctx_create_no_qpu(). Rationale at the time: cycle 6/9 had only CPU NEON paths, so a QPU-capable ctx would have meant pointless Vulkan init in every host process. Two things changed since: 1. Every H.264 hot-path primitive now has a V3D7 compute shader. IDCT 4x4/8x8 + 8 deblock variants (luma+chroma × V+H × inter+intra) + 30 qpel positions. See daedalus-fourier PRs #28-#35. 2. Dispatch overhead has been hammered down — buffer pool in v3d_runner + persistent command buffer. daedalus-fourier PR #36 bench on hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup): 1080p worst-case sum (IDCT4 + deblock luma + qpel mc22): CPU NEON only: 5.57 ms QPU only: 1.30 ms (CPU/QPU sum ratio = 4.30x) PR #10's CPU-4x-faster-than-QPU verdict (which justified the original no_qpu ctx choice) is reversed by ~17x. This commit adds 0014-h264-ctx-qpu-capable.patch which flips both H.264 TUs (h264_idct_daedalus.c, h264_qpel_daedalus.c) from daedalus_ctx_create_no_qpu() to daedalus_ctx_create(). daedalus_ctx_create() probes for a usable Vulkan device and falls back to no_qpu mode if unavailable, so this is safe on hosts without V3D (x86 build runners, Debian aarch64 builders without renderD, etc.). Hosts WITH V3D (Pi 5 deployment targets) now route the H.264 hot-path through V3D compute instead of CPU NEON. Wired into both arch PKGBUILD (source[] + prepare()) and debian build-deb.sh; both pkgrel bumped 10 → 11. Refs reauktion/daedalus-fourier!36.	2026-05-25 21:18:18 +02:00
marfrit	520f2fce33	Merge pull request 'mesa-panvk-bifrost r9: bump maxImageDimension3D to 2048 (iter22, unblocks Dawn/WebGPU)' (#103 ) from claude-noether/marfrit-packages:mesa-panvk-bifrost-r9 into main build and publish packages / distcc-avahi-aarch64 (push) Successful in 8s Details build and publish packages / mesa-panvk-bifrost-aarch64 (push) Successful in 5m19s Details build and publish packages / mesa-panvk-bifrost-video-aarch64 (push) Successful in 9s Details build and publish packages / lmcp-any (push) Successful in 6s Details build and publish packages / lmcp-debian (push) Successful in 4s Details build and publish packages / claude-his-any (push) Successful in 3s Details build and publish packages / aish-any (push) Successful in 3s Details build and publish packages / ffmpeg-v4l2-request-aarch64 (push) Successful in 3s Details build and publish packages / claude-his-debian (push) Successful in 4s Details build and publish packages / aish-debian (push) Successful in 5s Details build and publish packages / ffmpeg-v4l2-request-debian (push) Successful in 8s Details build and publish packages / libva-v4l2-request-fourier-aarch64 (push) Failing after 10m6s Details build and publish packages / daedalus-v4l2-debian (push) Failing after 10m12s Details build and publish packages / mpv-fourier-aarch64 (push) Has been skipped Details build and publish packages / libva-v4l2-request-fourier-debian (push) Has been skipped Details build and publish packages / mpv-fourier-debian (push) Has been skipped Details build and publish packages / daedalus-v4l2-dkms-debian (push) Has been skipped Details Reviewed-on: #103	2026-05-25 18:17:16 +00:00
marfrit	875156782e	Merge pull request 'ffmpeg-v4l2-request-fourier: route H.264 chroma intra deblock (4:2:0) through daedalus-fourier (0013)' (#102 ) from claude-noether/marfrit-packages:noether/h264-substitute-deblock-chroma-intra into main build and publish packages / distcc-avahi-aarch64 (push) Successful in 9s Details build and publish packages / mesa-panvk-bifrost-aarch64 (push) Successful in 6s Details build and publish packages / mesa-panvk-bifrost-video-aarch64 (push) Successful in 6s Details build and publish packages / lmcp-any (push) Failing after 11m54s Details build and publish packages / lmcp-debian (push) Has been skipped Details build and publish packages / claude-his-any (push) Has been skipped Details build and publish packages / ffmpeg-v4l2-request-aarch64 (push) Has been skipped Details build and publish packages / libva-v4l2-request-fourier-aarch64 (push) Has been skipped Details build and publish packages / mpv-fourier-aarch64 (push) Has been skipped Details build and publish packages / claude-his-debian (push) Has been skipped Details build and publish packages / ffmpeg-v4l2-request-debian (push) Has been skipped Details build and publish packages / libva-v4l2-request-fourier-debian (push) Has been skipped Details build and publish packages / mpv-fourier-debian (push) Has been skipped Details build and publish packages / daedalus-v4l2-debian (push) Has been skipped Details build and publish packages / daedalus-v4l2-dkms-debian (push) Has been skipped Details build and publish packages / aish-any (push) Has been skipped Details build and publish packages / aish-debian (push) Has been skipped Details Reviewed-on: #102	2026-05-25 12:48:19 +00:00
claude-noether	8f9487d355	ffmpeg-v4l2-request-fourier: route H.264 chroma intra deblock (4:2:0) through daedalus-fourier (0013) Substitutes c->v_loop_filter_chroma_intra and c->h_loop_filter_chroma_intra with daedalus wrappers in the bit_depth=8 / chroma_format_idc<=1 (4:2:0) branch. 4:2:2 stays on the in-tree NEON path (the daedalus chroma intra dispatch is 4:2:0-only). The fourier dispatches were exposed in PR #11 (DEFINE_INTRA_DISPATCH macro generates the public daedalus_dispatch_h264_deblock_chroma_*_intra symbols + recipe wrappers). Re-architects the chroma init: v_loop_filter_chroma_intra was previously assigned unconditionally to the NEON variant (which works for both 4:2:0 and 4:2:2). We now assign it INSIDE both branches of the chroma_format_idc conditional — 4:2:0 picks daedalus, 4:2:2 keeps NEON. No regression for 4:2:2 streams. Same NEON-to-NEON via recipe shape as 0010 luma intra. Closes the deblock substitution layer for the 4:2:0 / 8-bit hot path: - 0005 luma_v non-intra ✓ - 0008 luma_h non-intra ✓ - 0009 chroma_v / chroma_h non-intra ✓ - 0010 luma_v / luma_h intra ✓ - 0013 chroma_v / chroma_h intra ✓ All 8 deblock variants for the common 4:2:0 path now route through daedalus. 4:2:2 chroma + the chroma422 mbaff variants stay on in-tree NEON. Verified the patch applies cleanly on top of 0001-0012 against the pinned upstream commit b57fbbe5 on hertz.	2026-05-25 14:22:02 +02:00