9c70ffffe7
(Renumbered from 0013 — PR #102 landed 0013-h264-deblock-chroma-intra while this PR was open, so the next free slot is 0014.) Patches 0003 (IDCT 4x4) and 0007 (qpel mc20) created the libavcodec.so process-global daedalus_ctx via daedalus_ctx_create_no_qpu(). Rationale at the time: cycle 6/9 had only CPU NEON paths, so a QPU-capable ctx would have meant pointless Vulkan init in every host process. Two things changed since: 1. Every H.264 hot-path primitive now has a V3D7 compute shader. IDCT 4x4/8x8 + 8 deblock variants (luma+chroma × V+H × inter+intra) + 30 qpel positions. See daedalus-fourier PRs #28-#35. 2. Dispatch overhead has been hammered down — buffer pool in v3d_runner + persistent command buffer. daedalus-fourier PR #36 bench on hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup): 1080p worst-case sum (IDCT4 + deblock luma + qpel mc22): CPU NEON only: 5.57 ms QPU only: 1.30 ms (CPU/QPU sum ratio = 4.30x) PR #10's CPU-4x-faster-than-QPU verdict (which justified the original no_qpu ctx choice) is reversed by ~17x. This commit adds 0014-h264-ctx-qpu-capable.patch which flips both H.264 TUs (h264_idct_daedalus.c, h264_qpel_daedalus.c) from daedalus_ctx_create_no_qpu() to daedalus_ctx_create(). daedalus_ctx_create() probes for a usable Vulkan device and falls back to no_qpu mode if unavailable, so this is safe on hosts without V3D (x86 build runners, Debian aarch64 builders without renderD, etc.). Hosts WITH V3D (Pi 5 deployment targets) now route the H.264 hot-path through V3D compute instead of CPU NEON. Wired into both arch PKGBUILD (source[] + prepare()) and debian build-deb.sh; both pkgrel bumped 10 → 11. Refs reauktion/daedalus-fourier!36.
86 lines
3.7 KiB
Diff
86 lines
3.7 KiB
Diff
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
|
|
From: Markus Fritsche <mfritsche@reauktion.de>
|
|
Date: Mon, 25 May 2026 21:00:00 +0200
|
|
Subject: [PATCH] avcodec/aarch64/h264: use QPU-capable daedalus ctx (bench
|
|
shows 4.30x faster on Pi 5)
|
|
MIME-Version: 1.0
|
|
Content-Type: text/plain; charset=UTF-8
|
|
Content-Transfer-Encoding: 8bit
|
|
|
|
Patches 0003 (IDCT 4x4) and 0007 (qpel mc20) created the libavcodec.so
|
|
process-global daedalus_ctx via daedalus_ctx_create_no_qpu(). Rationale
|
|
at the time: cycle 6/9 had only CPU NEON paths, so a QPU-capable ctx
|
|
would have meant pointless Vulkan init in every host process (firefox-
|
|
fourier, mpv-fourier, daedalus_v4l2_daemon, ...).
|
|
|
|
Two things changed since:
|
|
|
|
1. Every H.264 hot-path primitive now has a V3D7 compute shader.
|
|
IDCT 4x4/8x8 (cycles 6, 7), 8 deblock variants (luma+chroma x V+H
|
|
x inter+intra), 30 qpel positions (15 put_ + 15 avg_). See
|
|
daedalus-fourier PRs #28-#35.
|
|
|
|
2. Dispatch overhead has been hammered down — buffer pool in
|
|
v3d_runner (daedalus-fourier task #160) plus persistent command
|
|
buffer (task #161). daedalus-fourier PR #36 bench measures the
|
|
1080p worst-case sum on hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup):
|
|
|
|
kernel CPU ns/op QPU ns/op winner
|
|
IDCT 4x4 luma 10.79 2.47 QPU 4.36x
|
|
IDCT 8x8 luma 29.69 9.23 QPU 3.22x
|
|
Deblock luma_v 17.58 10.21 QPU 1.72x
|
|
Deblock luma_h 38.41 9.98 QPU 3.85x
|
|
qpel mc20 (8x8) 28.24 9.66 QPU 2.92x
|
|
qpel mc02 (8x8) 16.96 20.54 CPU 1.21x
|
|
qpel mc22 (8x8) 71.58 9.64 QPU 7.43x
|
|
|
|
1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):
|
|
CPU NEON only: 5.57 ms
|
|
QPU only: 1.30 ms (CPU/QPU sum ratio = 4.30x)
|
|
|
|
PR #10's verdict (CPU 4x faster than QPU at IDCT) is reversed. Switch
|
|
the substitution context to daedalus_ctx_create() in both H.264 TUs
|
|
(h264_idct_daedalus.c, h264_qpel_daedalus.c) so the recipe layer can
|
|
actually route through the now-faster QPU path.
|
|
|
|
daedalus_ctx_create() probes for a usable Vulkan device and falls back
|
|
to no_qpu mode if unavailable, so this is safe on hosts without V3D
|
|
(x86 reauktion build runners, debian-aarch64 builders without renderD,
|
|
etc.). Hosts WITH V3D (Pi 5 deployment targets) get the speedup.
|
|
|
|
The remaining qpel mc02 anomaly (single-axis vertical filter, 1.21x
|
|
CPU) is bench-flagged for a v2 shader follow-up; the recipe entry
|
|
stays QPU since the policy decree (2026-05-23 substrate decree) holds
|
|
and the gap is marginal.
|
|
|
|
Refs reauktion/daedalus-fourier!36.
|
|
---
|
|
libavcodec/aarch64/h264_idct_daedalus.c | 2 +-
|
|
libavcodec/aarch64/h264_qpel_daedalus.c | 2 +-
|
|
2 files changed, 2 insertions(+), 2 deletions(-)
|
|
|
|
diff --git a/libavcodec/aarch64/h264_idct_daedalus.c b/libavcodec/aarch64/h264_idct_daedalus.c
|
|
--- a/libavcodec/aarch64/h264_idct_daedalus.c
|
|
+++ b/libavcodec/aarch64/h264_idct_daedalus.c
|
|
@@ -32,7 +32,7 @@ static pthread_once_t g_dctx_once = PTHREAD_ONCE_INIT;
|
|
|
|
static void daedalus_ctx_init_once(void)
|
|
{
|
|
- g_dctx = daedalus_ctx_create_no_qpu();
|
|
+ g_dctx = daedalus_ctx_create();
|
|
}
|
|
|
|
void ff_h264_idct_add_daedalus(uint8_t *dst, int16_t *block, int stride);
|
|
diff --git a/libavcodec/aarch64/h264_qpel_daedalus.c b/libavcodec/aarch64/h264_qpel_daedalus.c
|
|
--- a/libavcodec/aarch64/h264_qpel_daedalus.c
|
|
+++ b/libavcodec/aarch64/h264_qpel_daedalus.c
|
|
@@ -38,7 +38,7 @@ static pthread_once_t g_dctx_once = PTHREAD_ONCE_INIT;
|
|
|
|
static void daedalus_ctx_init_once(void)
|
|
{
|
|
- g_dctx = daedalus_ctx_create_no_qpu();
|
|
+ g_dctx = daedalus_ctx_create();
|
|
}
|
|
|
|
void ff_put_h264_qpel8_mc20_daedalus(uint8_t *dst, const uint8_t *src, ptrdiff_t stride);
|