QPU is default substrate: recipe table + ctx env-var override #7
Reference in New Issue
Block a user
Delete Branch "noether/qpu-default-recipe-cycles-5-8"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Phase 3 of the QPU-default substrate campaign per the user decree 2026-05-23 — "what can be done in QPU will be done in QPU."
Flips production-decode kernels with existing V3D shaders from CPU-by-recipe to QPU-by-recipe. Pairs with PR #6 (buffer pool + persistent cmdbuf) which cut dispatch overhead so the QPU paths are now viable.
Recipe table changes
v3d_mc_8h.spvexistsv3d_cdef.spvexistsv3d_h264deblock.spvexistsThree kernels flip from CPU verdict to QPU. Cycles 6/7/9 still CPU because their V3D shaders are deferred to task #165.
DAEDALUS_FORCE_QPU=1env-var overridedaedalus_ctx_create_no_qpu()now honours the env var: when set, it silently escalates to a fulldaedalus_ctx_create(). This lets the libavcodec substitution shims in marfrit-packages — whichpthread_onceacreate_no_qpuctx — fire QPU paths without rebuilding those patches.Firefox / mpv consumers stay on the Vulkan-free path by default (env var unset). The daedalus-v4l2 daemon will set
DAEDALUS_FORCE_QPU=1explicitly before dlopen'ing libavcodec (separate small daedalus-v4l2 follow-up).Smoke (hertz, Pi 5, kernel 6.18.29)
Safety: graceful fallback
The dispatch wrapper's fall-through logic (
eff == SUBSTRATE_QPU && !ctx_has_qpu(ctx) → eff = SUBSTRATE_CPU) handles the case where the recipe says QPU but the consumer didn't opt in. No regression for non-Vulkan consumers; the dispatch quietly falls back to NEON.Closes tasks #163, #164. Sequenced after PR #6 (dispatch overhead reduction) so the new QPU paths benefit from the buffer pool + persistent cmdbuf.
Per the user decree 2026-05-23 — "what can be done in QPU will be done in QPU" — this lands two coupled changes that flip production-decode kernels with existing V3D shaders from CPU-by-recipe to QPU-by-recipe: 1) daedalus_recipe_substrate_for() returns SUBSTRATE_QPU for every kernel that has a shipped V3D compute shader: cycle 1 VP9 IDCT 8x8 QPU (was QPU; unchanged) cycle 2 VP9 LPF wd=4 QPU (was QPU; unchanged) cycle 3 VP9 MC 8h QPU (FLIPPED from CPU — v3d_mc_8h.spv) cycle 4 VP9 LPF wd=8 QPU (was QPU; unchanged) cycle 5 AV1 CDEF 8x8 QPU (FLIPPED from CPU — v3d_cdef.spv) cycle 6 H.264 IDCT 4x4 CPU (no shader yet; task #165) cycle 7 H.264 IDCT 8x8 CPU (no shader yet; task #165) cycle 8 H.264 deblock luma-v QPU (FLIPPED from CPU — v3d_h264deblock.spv) cycle 9 H.264 qpel mc20 CPU (no shader yet; task #165) The R-band cost/benefit framework still applies but is now superseded for substrate selection by the decree. Where R stays RED, the cost is in dispatch overhead, which is a fixable engineering issue (tasks 160 buffer-pool, 161 persistent cmdbuf, 162 dmabuf import). 2) daedalus_ctx_create_no_qpu() now honours an env-var override: set DAEDALUS_FORCE_QPU=1 in the process and create_no_qpu silently escalates to a full daedalus_ctx_create(). Lets the libavcodec substitution shims in marfrit-packages (which pthread_once a create_no_qpu ctx — see libavcodec/aarch64/h264_idct_daedalus.c) fire QPU paths without rebuilding those patches. Firefox / mpv consumers stay on the Vulkan-free path by default (env var unset). The daedalus-v4l2 daemon will set DAEDALUS_FORCE_QPU=1 explicitly before dlopen'ing libavcodec (separate daedalus-v4l2 follow-up). Smoke (hertz, Pi 5, kernel 6.18.29): === test_api_h264 === H264_IDCT4 recipe substrate: 1 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 1 H264_DEBLOCK_LV recipe substrate: 2 ← flipped H264_QPEL_MC20 recipe substrate: 1 H.264 IDCT 4x4: 2048/2048 bytes bit-exact H.264 IDCT 8x8: 2048/2048 bytes bit-exact H.264 deblock luma v: 2048/2048 bytes bit-exact ← QPU path H.264 qpel mc20: 1024/1024 bytes bit-exact === test_api_idct === all substrates (CPU/QPU/AUTO) bit-exact === test_api_lpf === all substrates bit-exact wd=4 and wd=8 The dispatch wrapper's fall-through logic (eff == SUBSTRATE_QPU && !ctx_has_qpu(ctx) → eff = SUBSTRATE_CPU) handles the case where the recipe says QPU but the consumer didn't opt in — it falls back to CPU silently, no regression. Closes daedalus-fourier tasks #163, #164. Refs the 2026-05-23 "QPU default substrate" decree.Per the QPU-default substrate decree 2026-05-23, cycle 6 (H.264 IDCT 4x4 + add) was the highest-priority H.264 kernel to flip from NEON-only to QPU-capable. The same shape as VP9 IDCT 8x8 (cycle 1) — two-pass butterfly with shared-memory transpose — but at 4x4 scale: 4 lanes per block, 16 blocks per WG. What's added: - src/v3d_h264_idct4.comp: GLSL compute shader implementing the H.264 §8.5.12.1 1D butterfly twice (row pass then column pass), with (val + 32) >> 6 rounding and clip-to-u8 add to dst. Block memory layout is column-major (matches FFmpeg `ff_h264_idct_add_neon` convention). - CMakeLists: glslang rule + install entry for v3d_h264_idct4.spv. - dispatch_h264_idct4_qpu() in daedalus_core.c: lazy pipeline init, 3 SSBOs (coeffs / dst / meta as uvec4), push-constant (n_blocks, dst_stride), 16 blocks per WG dispatch. Matches the existing dispatch_*_qpu patterns; uses v3d_runner_create_buffer / destroy_buffer (will swap to pool API once PR #6 lands). - daedalus_dispatch_h264_idct4() replaces ROUTE_CPU_ONLY with the same CPU/QPU substrate switch the deblock dispatch uses. - daedalus_recipe_substrate_for(H264_IDCT4) returns QPU now that the shader exists. Verification on hertz (Pi 5 + V3D 7.1): $ ./test_api_h264 === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 1 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 1 H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%) ← QPU H.264 IDCT 8x8: 2048/2048 bytes bit-exact H.264 deblock luma v: 2048/2048 bytes bit-exact H.264 qpel mc20: 1024/1024 bytes bit-exact The AUTO-substrate path now picks QPU for H.264 IDCT 4x4, and the output is bit-exact against the C reference (which is identical to the NEON .S code by construction — same FFmpeg upstream). Remaining cycle-6/7/9 work in task #165: - cycle 7: H.264 IDCT 8x8 (template same shape; 8 lanes per block, fewer blocks per WG) - cycle 9: H.264 luma qpel mc20 (different shape — 6-tap MC not a transform) This commit lands the cycle-6 piece of task #165.Cycle 6 H.264 IDCT 4×4 V3D shader added (commit
65bd5c3) — the first H.264 cycle to actually run on QPU with bit-exact correctness.The earlier two-bullet PR title still describes Phase 3 of the campaign but the scope is now "all kernels that have V3D shaders, including the new cycle-6 one".
Cycle 6 shader
src/v3d_h264_idct4.comp— H.264 §8.5.12.1 1D butterfly twice (row pass, column pass),(val + 32) >> 6rounded, clipped, added to dst. Column-major block layout matches FFmpegff_h264_idct_add_neon. 4 lanes per block, 16 blocks per WG.Verification on hertz (Pi 5 / V3D 7.1)
Bit-exact first try.
Current recipe state after this PR
Six of nine kernels now QPU-by-recipe. Remaining cycles 7 and 9 are the same-task follow-ons.
Uses
v3d_runner_create_buffer/destroy_bufferrather than the pool API (PR #6 hadn't merged when this branch was cut); trivial post-merge swap.Closes tasks #163, #164, and partial #165 (cycle 6 piece).
Cycle 7 H.264 IDCT 8×8 V3D shader added (commit
74687d9). Same template as cycle 6: 8 lanes per block, 8 blocks per WG. H.264 §8.5.13.2 1D butterfly twice.Current state after this commit
8 of 9 cycles now QPU-by-recipe. Only cycle 9 (6-tap horizontal MC, different shape from transforms) still pending — task #165 residual.