ffmpeg-v4l2-request-fourier: flip libavcodec daedalus ctx no_qpu → qpu-capable (0013) #104

Merged
marfrit merged 1 commits from claude-noether/marfrit-packages:noether/h264-ctx-qpu-capable into main 2026-05-25 19:25:34 +00:00
Owner

Why

Patches 0003 (IDCT 4x4) and 0007 (qpel mc20) created the libavcodec.so process-global daedalus_ctx via daedalus_ctx_create_no_qpu(). At the time the rationale was sound: cycle 6/9 had only CPU NEON paths, so a QPU-capable ctx would have meant pointless Vulkan init in every host process.

Two things changed since.

  1. Every H.264 hot-path primitive now has a V3D7 compute shader. IDCT 4x4/8x8 + 8 deblock variants + 30 qpel positions. See daedalus-fourier PRs #28–#35.

  2. Dispatch overhead has been hammered down — buffer pool in v3d_runner + persistent command buffer. daedalus-fourier PR #36 bench on hertz (Pi 5 V3D 7.1):

kernel CPU ns/op QPU ns/op winner
IDCT 4x4 luma 10.79 2.47 QPU 4.36×
IDCT 8x8 luma 29.69 9.23 QPU 3.22×
Deblock luma_v 17.58 10.21 QPU 1.72×
Deblock luma_h 38.41 9.98 QPU 3.85×
qpel mc20 (8x8) 28.24 9.66 QPU 2.92×
qpel mc22 (8x8) 71.58 9.64 QPU 7.43×

1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):

  • CPU NEON only: 5.57 ms
  • QPU only: 1.30 ms
  • CPU/QPU ratio: 4.30×

PR #10's CPU-4×-faster verdict (which justified the original no_qpu ctx choice) is reversed by ~17×.

What

New patch 0013-h264-ctx-qpu-capable.patch flips both H.264 TUs (h264_idct_daedalus.c, h264_qpel_daedalus.c) from daedalus_ctx_create_no_qpu() to daedalus_ctx_create().

daedalus_ctx_create() probes for Vulkan and falls back to no_qpu if unavailable, so safe on hosts without V3D (x86 build runners, etc.). Hosts WITH V3D get the speedup.

Wired into both arch PKGBUILD and debian build-deb.sh; both pkgrel bumped 10 → 11.

Notes

The daedalus-fourier pin in PKGBUILD currently points at b9f9ff2 (PR #25 — chroma DC Hadamard public symbol). The bench-measured speedup was at daedalus-fourier head post-#35. Bumping the pin is a follow-up; the ctx flip alone already gets the QPU path engaged for whatever shaders are present at the pinned commit.

Refs reauktion/daedalus-fourier!36.

## Why Patches 0003 (IDCT 4x4) and 0007 (qpel mc20) created the libavcodec.so process-global daedalus_ctx via `daedalus_ctx_create_no_qpu()`. At the time the rationale was sound: cycle 6/9 had only CPU NEON paths, so a QPU-capable ctx would have meant pointless Vulkan init in every host process. **Two things changed since.** 1. Every H.264 hot-path primitive now has a V3D7 compute shader. IDCT 4x4/8x8 + 8 deblock variants + 30 qpel positions. See daedalus-fourier PRs #28–#35. 2. Dispatch overhead has been hammered down — buffer pool in v3d_runner + persistent command buffer. daedalus-fourier PR #36 bench on hertz (Pi 5 V3D 7.1): | kernel | CPU ns/op | QPU ns/op | winner | |---|---|---|---| | IDCT 4x4 luma | 10.79 | 2.47 | QPU 4.36× | | IDCT 8x8 luma | 29.69 | 9.23 | QPU 3.22× | | Deblock luma_v | 17.58 | 10.21 | QPU 1.72× | | Deblock luma_h | 38.41 | 9.98 | QPU 3.85× | | qpel mc20 (8x8) | 28.24 | 9.66 | QPU 2.92× | | qpel mc22 (8x8) | 71.58 | 9.64 | QPU 7.43× | **1080p worst-case sum** (IDCT4 + deblock luma + qpel mc22): - CPU NEON only: **5.57 ms** - QPU only: **1.30 ms** - CPU/QPU ratio: **4.30×** PR #10's CPU-4×-faster verdict (which justified the original no_qpu ctx choice) is reversed by ~17×. ## What New patch `0013-h264-ctx-qpu-capable.patch` flips both H.264 TUs (`h264_idct_daedalus.c`, `h264_qpel_daedalus.c`) from `daedalus_ctx_create_no_qpu()` to `daedalus_ctx_create()`. `daedalus_ctx_create()` probes for Vulkan and falls back to no_qpu if unavailable, so safe on hosts without V3D (x86 build runners, etc.). Hosts WITH V3D get the speedup. Wired into both arch PKGBUILD and debian build-deb.sh; both pkgrel bumped 10 → 11. ## Notes The daedalus-fourier pin in PKGBUILD currently points at `b9f9ff2` (PR #25 — chroma DC Hadamard public symbol). The bench-measured speedup was at daedalus-fourier head post-#35. Bumping the pin is a follow-up; the ctx flip alone already gets the QPU path engaged for whatever shaders are present at the pinned commit. Refs reauktion/daedalus-fourier!36.
claude-noether added 1 commit 2026-05-25 19:18:22 +00:00
(Renumbered from 0013 — PR #102 landed 0013-h264-deblock-chroma-intra
while this PR was open, so the next free slot is 0014.)

Patches 0003 (IDCT 4x4) and 0007 (qpel mc20) created the libavcodec.so
process-global daedalus_ctx via daedalus_ctx_create_no_qpu().  Rationale
at the time: cycle 6/9 had only CPU NEON paths, so a QPU-capable ctx
would have meant pointless Vulkan init in every host process.

Two things changed since:

  1. Every H.264 hot-path primitive now has a V3D7 compute shader.
     IDCT 4x4/8x8 + 8 deblock variants (luma+chroma × V+H × inter+intra)
     + 30 qpel positions.  See daedalus-fourier PRs #28-#35.

  2. Dispatch overhead has been hammered down — buffer pool in
     v3d_runner + persistent command buffer.  daedalus-fourier PR #36
     bench on hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup):

       1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):
         CPU NEON only:  5.57 ms
         QPU only:       1.30 ms   (CPU/QPU sum ratio = 4.30x)

PR #10's CPU-4x-faster-than-QPU verdict (which justified the original
no_qpu ctx choice) is reversed by ~17x.

This commit adds 0014-h264-ctx-qpu-capable.patch which flips both H.264
TUs (h264_idct_daedalus.c, h264_qpel_daedalus.c) from
daedalus_ctx_create_no_qpu() to daedalus_ctx_create().

daedalus_ctx_create() probes for a usable Vulkan device and falls back
to no_qpu mode if unavailable, so this is safe on hosts without V3D
(x86 build runners, Debian aarch64 builders without renderD, etc.).
Hosts WITH V3D (Pi 5 deployment targets) now route the H.264 hot-path
through V3D compute instead of CPU NEON.

Wired into both arch PKGBUILD (source[] + prepare()) and debian
build-deb.sh; both pkgrel bumped 10 → 11.

Refs reauktion/daedalus-fourier!36.
claude-noether force-pushed noether/h264-ctx-qpu-capable from 6047c04f7f to 9c70ffffe7 2026-05-25 19:18:22 +00:00 Compare
Author
Owner

Rebased onto current main. PR #102 (0013-h264-deblock-chroma-intra) merged while this PR was open, so the patch is renumbered 0013 → 0014. No content change beyond the filename and the corresponding '0014-...' entries in PKGBUILD source[]/prepare() and build-deb.sh. Mergeable now.

Rebased onto current `main`. PR #102 (0013-h264-deblock-chroma-intra) merged while this PR was open, so the patch is renumbered **0013 → 0014**. No content change beyond the filename and the corresponding `'0014-...'` entries in PKGBUILD `source[]`/`prepare()` and `build-deb.sh`. Mergeable now.
marfrit merged commit 190f810843 into main 2026-05-25 19:25:34 +00:00
marfrit deleted branch noether/h264-ctx-qpu-capable 2026-05-25 19:25:37 +00:00
Sign in to join this conversation.
No Reviewers
No Label
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marfrit/marfrit-packages#104