marfrit-packages/arch/ffmpeg-v4l2-request-fourier/0014-h264-ctx-qpu-capable.patch

From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: Markus Fritsche <mfritsche@reauktion.de>
Date: Mon, 25 May 2026 21:00:00 +0200
Subject: [PATCH] avcodec/aarch64/h264: use QPU-capable daedalus ctx (bench
 shows 4.30x faster on Pi 5)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Patches 0003 (IDCT 4x4) and 0007 (qpel mc20) created the libavcodec.so
process-global daedalus_ctx via daedalus_ctx_create_no_qpu().  Rationale
at the time: cycle 6/9 had only CPU NEON paths, so a QPU-capable ctx
would have meant pointless Vulkan init in every host process (firefox-
fourier, mpv-fourier, daedalus_v4l2_daemon, ...).

Two things changed since:

  1. Every H.264 hot-path primitive now has a V3D7 compute shader.
     IDCT 4x4/8x8 (cycles 6, 7), 8 deblock variants (luma+chroma x V+H
     x inter+intra), 30 qpel positions (15 put_ + 15 avg_).  See
     daedalus-fourier PRs #28-#35.

  2. Dispatch overhead has been hammered down — buffer pool in
     v3d_runner (daedalus-fourier task #160) plus persistent command
     buffer (task #161).  daedalus-fourier PR #36 bench measures the
     1080p worst-case sum on hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup):

       kernel             CPU ns/op  QPU ns/op  winner
       IDCT 4x4 luma          10.79       2.47  QPU 4.36x
       IDCT 8x8 luma          29.69       9.23  QPU 3.22x
       Deblock luma_v         17.58      10.21  QPU 1.72x
       Deblock luma_h         38.41       9.98  QPU 3.85x
       qpel mc20 (8x8)        28.24       9.66  QPU 2.92x
       qpel mc02 (8x8)        16.96      20.54  CPU 1.21x
       qpel mc22 (8x8)        71.58       9.64  QPU 7.43x

       1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):
         CPU NEON only:  5.57 ms
         QPU only:       1.30 ms   (CPU/QPU sum ratio = 4.30x)

PR #10's verdict (CPU 4x faster than QPU at IDCT) is reversed.  Switch
the substitution context to daedalus_ctx_create() in both H.264 TUs
(h264_idct_daedalus.c, h264_qpel_daedalus.c) so the recipe layer can
actually route through the now-faster QPU path.

daedalus_ctx_create() probes for a usable Vulkan device and falls back
to no_qpu mode if unavailable, so this is safe on hosts without V3D
(x86 reauktion build runners, debian-aarch64 builders without renderD,
etc.).  Hosts WITH V3D (Pi 5 deployment targets) get the speedup.

The remaining qpel mc02 anomaly (single-axis vertical filter, 1.21x
CPU) is bench-flagged for a v2 shader follow-up; the recipe entry
stays QPU since the policy decree (2026-05-23 substrate decree) holds
and the gap is marginal.

Refs reauktion/daedalus-fourier!36.
---
 libavcodec/aarch64/h264_idct_daedalus.c | 2 +-
 libavcodec/aarch64/h264_qpel_daedalus.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
--- a/libavcodec/aarch64/h264_idct_daedalus.c
+++ b/libavcodec/aarch64/h264_idct_daedalus.c
@@ -32,7 +32,7 @@ static pthread_once_t    g_dctx_once = PTHREAD_ONCE_INIT;

 static void daedalus_ctx_init_once(void)
 {
-    g_dctx = daedalus_ctx_create_no_qpu();
+    g_dctx = daedalus_ctx_create();
 }

 void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
diff --git a/libavcodec/aarch64/h264_qpel_daedalus.c b/libavcodec/aarch64/h264_qpel_daedalus.c
--- a/libavcodec/aarch64/h264_qpel_daedalus.c
+++ b/libavcodec/aarch64/h264_qpel_daedalus.c
@@ -38,7 +38,7 @@ static pthread_once_t    g_dctx_once = PTHREAD_ONCE_INIT;

 static void daedalus_ctx_init_once(void)
 {
-    g_dctx = daedalus_ctx_create_no_qpu();
+    g_dctx = daedalus_ctx_create();
 }

 void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);