Per the QPU-default substrate decree 2026-05-23, cycle 6 (H.264
IDCT 4x4 + add) was the highest-priority H.264 kernel to flip
from NEON-only to QPU-capable. The same shape as VP9 IDCT 8x8
(cycle 1) — two-pass butterfly with shared-memory transpose —
but at 4x4 scale: 4 lanes per block, 16 blocks per WG.
What's added:
- src/v3d_h264_idct4.comp: GLSL compute shader implementing
the H.264 §8.5.12.1 1D butterfly twice (row pass then column
pass), with (val + 32) >> 6 rounding and clip-to-u8 add to
dst. Block memory layout is column-major (matches FFmpeg
`ff_h264_idct_add_neon` convention).
- CMakeLists: glslang rule + install entry for v3d_h264_idct4.spv.
- dispatch_h264_idct4_qpu() in daedalus_core.c: lazy pipeline
init, 3 SSBOs (coeffs / dst / meta as uvec4), push-constant
(n_blocks, dst_stride), 16 blocks per WG dispatch. Matches
the existing dispatch_*_qpu patterns; uses
v3d_runner_create_buffer / destroy_buffer (will swap to
pool API once PR #6 lands).
- daedalus_dispatch_h264_idct4() replaces ROUTE_CPU_ONLY with
the same CPU/QPU substrate switch the deblock dispatch uses.
- daedalus_recipe_substrate_for(H264_IDCT4) returns QPU now
that the shader exists.
Verification on hertz (Pi 5 + V3D 7.1):
$ ./test_api_h264
=== Phase 8a API smoke: H.264 kernels via recipe dispatch ===
H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU)
H264_IDCT8 recipe substrate: 1
H264_DEBLOCK_LV recipe substrate: 2
H264_QPEL_MC20 recipe substrate: 1
H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%) ← QPU
H.264 IDCT 8x8: 2048/2048 bytes bit-exact
H.264 deblock luma v: 2048/2048 bytes bit-exact
H.264 qpel mc20: 1024/1024 bytes bit-exact
The AUTO-substrate path now picks QPU for H.264 IDCT 4x4, and
the output is bit-exact against the C reference (which is
identical to the NEON .S code by construction — same FFmpeg
upstream).
Remaining cycle-6/7/9 work in task #165:
- cycle 7: H.264 IDCT 8x8 (template same shape; 8 lanes per
block, fewer blocks per WG)
- cycle 9: H.264 luma qpel mc20 (different shape — 6-tap MC
not a transform)
This commit lands the cycle-6 piece of task #165.
Per the user decree 2026-05-23 — "what can be done in QPU will
be done in QPU" — this lands two coupled changes that flip
production-decode kernels with existing V3D shaders from
CPU-by-recipe to QPU-by-recipe:
1) daedalus_recipe_substrate_for() returns SUBSTRATE_QPU for
every kernel that has a shipped V3D compute shader:
cycle 1 VP9 IDCT 8x8 QPU (was QPU; unchanged)
cycle 2 VP9 LPF wd=4 QPU (was QPU; unchanged)
cycle 3 VP9 MC 8h QPU (FLIPPED from CPU — v3d_mc_8h.spv)
cycle 4 VP9 LPF wd=8 QPU (was QPU; unchanged)
cycle 5 AV1 CDEF 8x8 QPU (FLIPPED from CPU — v3d_cdef.spv)
cycle 6 H.264 IDCT 4x4 CPU (no shader yet; task #165)
cycle 7 H.264 IDCT 8x8 CPU (no shader yet; task #165)
cycle 8 H.264 deblock luma-v QPU (FLIPPED from CPU — v3d_h264deblock.spv)
cycle 9 H.264 qpel mc20 CPU (no shader yet; task #165)
The R-band cost/benefit framework still applies but is now
superseded for substrate selection by the decree. Where R
stays RED, the cost is in dispatch overhead, which is a
fixable engineering issue (tasks 160 buffer-pool, 161
persistent cmdbuf, 162 dmabuf import).
2) daedalus_ctx_create_no_qpu() now honours an env-var override:
set DAEDALUS_FORCE_QPU=1 in the process and create_no_qpu
silently escalates to a full daedalus_ctx_create(). Lets
the libavcodec substitution shims in marfrit-packages (which
pthread_once a create_no_qpu ctx — see
libavcodec/aarch64/h264_idct_daedalus.c) fire QPU paths
without rebuilding those patches.
Firefox / mpv consumers stay on the Vulkan-free path by
default (env var unset). The daedalus-v4l2 daemon will set
DAEDALUS_FORCE_QPU=1 explicitly before dlopen'ing libavcodec
(separate daedalus-v4l2 follow-up).
Smoke (hertz, Pi 5, kernel 6.18.29):
=== test_api_h264 ===
H264_IDCT4 recipe substrate: 1 (1=CPU, 2=QPU)
H264_IDCT8 recipe substrate: 1
H264_DEBLOCK_LV recipe substrate: 2 ← flipped
H264_QPEL_MC20 recipe substrate: 1
H.264 IDCT 4x4: 2048/2048 bytes bit-exact
H.264 IDCT 8x8: 2048/2048 bytes bit-exact
H.264 deblock luma v: 2048/2048 bytes bit-exact ← QPU path
H.264 qpel mc20: 1024/1024 bytes bit-exact
=== test_api_idct === all substrates (CPU/QPU/AUTO) bit-exact
=== test_api_lpf === all substrates bit-exact wd=4 and wd=8
The dispatch wrapper's fall-through logic (eff == SUBSTRATE_QPU
&& !ctx_has_qpu(ctx) → eff = SUBSTRATE_CPU) handles the case
where the recipe says QPU but the consumer didn't opt in — it
falls back to CPU silently, no regression.
Closes daedalus-fourier tasks #163, #164.
Refs the 2026-05-23 "QPU default substrate" decree.