QPU is default substrate: recipe table + ctx env-var override #7

Merged
marfrit merged 3 commits from noether/qpu-default-recipe-cycles-5-8 into main 2026-05-23 18:59:35 +00:00
Owner

Phase 3 of the QPU-default substrate campaign per the user decree 2026-05-23 — "what can be done in QPU will be done in QPU."

Flips production-decode kernels with existing V3D shaders from CPU-by-recipe to QPU-by-recipe. Pairs with PR #6 (buffer pool + persistent cmdbuf) which cut dispatch overhead so the QPU paths are now viable.

Recipe table changes

Cycle Kernel Before After Reason
1 VP9 IDCT 8x8 QPU QPU unchanged
2 VP9 LPF wd=4 QPU QPU unchanged
3 VP9 MC 8h CPU QPU v3d_mc_8h.spv exists
4 VP9 LPF wd=8 QPU QPU unchanged
5 AV1 CDEF 8x8 CPU QPU v3d_cdef.spv exists
6 H.264 IDCT 4x4 CPU CPU no shader yet (task #165)
7 H.264 IDCT 8x8 CPU CPU no shader yet (task #165)
8 H.264 deblock luma-v CPU QPU v3d_h264deblock.spv exists
9 H.264 qpel mc20 CPU CPU no shader yet (task #165)

Three kernels flip from CPU verdict to QPU. Cycles 6/7/9 still CPU because their V3D shaders are deferred to task #165.

DAEDALUS_FORCE_QPU=1 env-var override

daedalus_ctx_create_no_qpu() now honours the env var: when set, it silently escalates to a full daedalus_ctx_create(). This lets the libavcodec substitution shims in marfrit-packages — which pthread_once a create_no_qpu ctx — fire QPU paths without rebuilding those patches.

Firefox / mpv consumers stay on the Vulkan-free path by default (env var unset). The daedalus-v4l2 daemon will set DAEDALUS_FORCE_QPU=1 explicitly before dlopen'ing libavcodec (separate small daedalus-v4l2 follow-up).

Smoke (hertz, Pi 5, kernel 6.18.29)

=== test_api_h264 ===
  H264_IDCT4 recipe substrate:      1 (1=CPU, 2=QPU)
  H264_IDCT8 recipe substrate:      1
  H264_DEBLOCK_LV recipe substrate: 2     ← flipped
  H264_QPEL_MC20 recipe substrate:  1
  H.264 IDCT 4x4: 2048/2048 bytes bit-exact
  H.264 IDCT 8x8: 2048/2048 bytes bit-exact
  H.264 deblock luma v: 2048/2048 bytes bit-exact   ← QPU path
  H.264 qpel mc20: 1024/1024 bytes bit-exact

=== test_api_idct === all substrates (CPU/QPU/AUTO) bit-exact
=== test_api_lpf  === all substrates bit-exact wd=4 and wd=8

Safety: graceful fallback

The dispatch wrapper's fall-through logic (eff == SUBSTRATE_QPU && !ctx_has_qpu(ctx) → eff = SUBSTRATE_CPU) handles the case where the recipe says QPU but the consumer didn't opt in. No regression for non-Vulkan consumers; the dispatch quietly falls back to NEON.

Closes tasks #163, #164. Sequenced after PR #6 (dispatch overhead reduction) so the new QPU paths benefit from the buffer pool + persistent cmdbuf.

Phase 3 of the QPU-default substrate campaign per the user decree 2026-05-23 — *"what can be done in QPU will be done in QPU."* Flips production-decode kernels with existing V3D shaders from CPU-by-recipe to QPU-by-recipe. Pairs with PR #6 (buffer pool + persistent cmdbuf) which cut dispatch overhead so the QPU paths are now viable. ## Recipe table changes | Cycle | Kernel | Before | After | Reason | |---|---|---|---|---| | 1 | VP9 IDCT 8x8 | QPU | QPU | unchanged | | 2 | VP9 LPF wd=4 | QPU | QPU | unchanged | | 3 | VP9 MC 8h | CPU | **QPU** | `v3d_mc_8h.spv` exists | | 4 | VP9 LPF wd=8 | QPU | QPU | unchanged | | 5 | AV1 CDEF 8x8 | CPU | **QPU** | `v3d_cdef.spv` exists | | 6 | H.264 IDCT 4x4 | CPU | CPU | no shader yet (task #165) | | 7 | H.264 IDCT 8x8 | CPU | CPU | no shader yet (task #165) | | 8 | H.264 deblock luma-v | CPU | **QPU** | `v3d_h264deblock.spv` exists | | 9 | H.264 qpel mc20 | CPU | CPU | no shader yet (task #165) | Three kernels flip from CPU verdict to QPU. Cycles 6/7/9 still CPU because their V3D shaders are deferred to task #165. ## `DAEDALUS_FORCE_QPU=1` env-var override `daedalus_ctx_create_no_qpu()` now honours the env var: when set, it silently escalates to a full `daedalus_ctx_create()`. This lets the libavcodec substitution shims in marfrit-packages — which `pthread_once` a `create_no_qpu` ctx — fire QPU paths without rebuilding those patches. Firefox / mpv consumers stay on the Vulkan-free path by default (env var unset). The daedalus-v4l2 daemon will set `DAEDALUS_FORCE_QPU=1` explicitly before dlopen'ing libavcodec (separate small daedalus-v4l2 follow-up). ## Smoke (hertz, Pi 5, kernel 6.18.29) ``` === test_api_h264 === H264_IDCT4 recipe substrate: 1 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 1 H264_DEBLOCK_LV recipe substrate: 2 ← flipped H264_QPEL_MC20 recipe substrate: 1 H.264 IDCT 4x4: 2048/2048 bytes bit-exact H.264 IDCT 8x8: 2048/2048 bytes bit-exact H.264 deblock luma v: 2048/2048 bytes bit-exact ← QPU path H.264 qpel mc20: 1024/1024 bytes bit-exact === test_api_idct === all substrates (CPU/QPU/AUTO) bit-exact === test_api_lpf === all substrates bit-exact wd=4 and wd=8 ``` ## Safety: graceful fallback The dispatch wrapper's fall-through logic (`eff == SUBSTRATE_QPU && !ctx_has_qpu(ctx) → eff = SUBSTRATE_CPU`) handles the case where the recipe says QPU but the consumer didn't opt in. No regression for non-Vulkan consumers; the dispatch quietly falls back to NEON. Closes tasks #163, #164. Sequenced after PR #6 (dispatch overhead reduction) so the new QPU paths benefit from the buffer pool + persistent cmdbuf.
marfrit added 1 commit 2026-05-23 18:00:20 +00:00
Per the user decree 2026-05-23 — "what can be done in QPU will
be done in QPU" — this lands two coupled changes that flip
production-decode kernels with existing V3D shaders from
CPU-by-recipe to QPU-by-recipe:

1) daedalus_recipe_substrate_for() returns SUBSTRATE_QPU for
   every kernel that has a shipped V3D compute shader:

    cycle 1 VP9 IDCT 8x8         QPU  (was QPU; unchanged)
    cycle 2 VP9 LPF wd=4         QPU  (was QPU; unchanged)
    cycle 3 VP9 MC 8h            QPU  (FLIPPED from CPU — v3d_mc_8h.spv)
    cycle 4 VP9 LPF wd=8         QPU  (was QPU; unchanged)
    cycle 5 AV1 CDEF 8x8         QPU  (FLIPPED from CPU — v3d_cdef.spv)
    cycle 6 H.264 IDCT 4x4       CPU  (no shader yet; task #165)
    cycle 7 H.264 IDCT 8x8       CPU  (no shader yet; task #165)
    cycle 8 H.264 deblock luma-v QPU  (FLIPPED from CPU — v3d_h264deblock.spv)
    cycle 9 H.264 qpel mc20      CPU  (no shader yet; task #165)

   The R-band cost/benefit framework still applies but is now
   superseded for substrate selection by the decree.  Where R
   stays RED, the cost is in dispatch overhead, which is a
   fixable engineering issue (tasks 160 buffer-pool, 161
   persistent cmdbuf, 162 dmabuf import).

2) daedalus_ctx_create_no_qpu() now honours an env-var override:
   set DAEDALUS_FORCE_QPU=1 in the process and create_no_qpu
   silently escalates to a full daedalus_ctx_create().  Lets
   the libavcodec substitution shims in marfrit-packages (which
   pthread_once a create_no_qpu ctx — see
   libavcodec/aarch64/h264_idct_daedalus.c) fire QPU paths
   without rebuilding those patches.

   Firefox / mpv consumers stay on the Vulkan-free path by
   default (env var unset).  The daedalus-v4l2 daemon will set
   DAEDALUS_FORCE_QPU=1 explicitly before dlopen'ing libavcodec
   (separate daedalus-v4l2 follow-up).

Smoke (hertz, Pi 5, kernel 6.18.29):

  === test_api_h264 ===
    H264_IDCT4 recipe substrate:      1 (1=CPU, 2=QPU)
    H264_IDCT8 recipe substrate:      1
    H264_DEBLOCK_LV recipe substrate: 2     ← flipped
    H264_QPEL_MC20 recipe substrate:  1
    H.264 IDCT 4x4: 2048/2048 bytes bit-exact
    H.264 IDCT 8x8: 2048/2048 bytes bit-exact
    H.264 deblock luma v: 2048/2048 bytes bit-exact   ← QPU path
    H.264 qpel mc20: 1024/1024 bytes bit-exact

  === test_api_idct === all substrates (CPU/QPU/AUTO) bit-exact
  === test_api_lpf  === all substrates bit-exact wd=4 and wd=8

The dispatch wrapper's fall-through logic (eff == SUBSTRATE_QPU
&& !ctx_has_qpu(ctx) → eff = SUBSTRATE_CPU) handles the case
where the recipe says QPU but the consumer didn't opt in — it
falls back to CPU silently, no regression.

Closes daedalus-fourier tasks #163, #164.
Refs the 2026-05-23 "QPU default substrate" decree.
claude-noether added 1 commit 2026-05-23 18:06:23 +00:00
Per the QPU-default substrate decree 2026-05-23, cycle 6 (H.264
IDCT 4x4 + add) was the highest-priority H.264 kernel to flip
from NEON-only to QPU-capable.  The same shape as VP9 IDCT 8x8
(cycle 1) — two-pass butterfly with shared-memory transpose —
but at 4x4 scale: 4 lanes per block, 16 blocks per WG.

What's added:

  - src/v3d_h264_idct4.comp: GLSL compute shader implementing
    the H.264 §8.5.12.1 1D butterfly twice (row pass then column
    pass), with (val + 32) >> 6 rounding and clip-to-u8 add to
    dst.  Block memory layout is column-major (matches FFmpeg
    `ff_h264_idct_add_neon` convention).

  - CMakeLists: glslang rule + install entry for v3d_h264_idct4.spv.

  - dispatch_h264_idct4_qpu() in daedalus_core.c: lazy pipeline
    init, 3 SSBOs (coeffs / dst / meta as uvec4), push-constant
    (n_blocks, dst_stride), 16 blocks per WG dispatch.  Matches
    the existing dispatch_*_qpu patterns; uses
    v3d_runner_create_buffer / destroy_buffer (will swap to
    pool API once PR #6 lands).

  - daedalus_dispatch_h264_idct4() replaces ROUTE_CPU_ONLY with
    the same CPU/QPU substrate switch the deblock dispatch uses.

  - daedalus_recipe_substrate_for(H264_IDCT4) returns QPU now
    that the shader exists.

Verification on hertz (Pi 5 + V3D 7.1):

  $ ./test_api_h264
  === Phase 8a API smoke: H.264 kernels via recipe dispatch ===
    H264_IDCT4 recipe substrate:      2 (1=CPU, 2=QPU)
    H264_IDCT8 recipe substrate:      1
    H264_DEBLOCK_LV recipe substrate: 2
    H264_QPEL_MC20 recipe substrate:  1
    H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%)   ← QPU
    H.264 IDCT 8x8: 2048/2048 bytes bit-exact
    H.264 deblock luma v: 2048/2048 bytes bit-exact
    H.264 qpel mc20: 1024/1024 bytes bit-exact

The AUTO-substrate path now picks QPU for H.264 IDCT 4x4, and
the output is bit-exact against the C reference (which is
identical to the NEON .S code by construction — same FFmpeg
upstream).

Remaining cycle-6/7/9 work in task #165:
  - cycle 7: H.264 IDCT 8x8 (template same shape; 8 lanes per
    block, fewer blocks per WG)
  - cycle 9: H.264 luma qpel mc20 (different shape — 6-tap MC
    not a transform)

This commit lands the cycle-6 piece of task #165.
Author
Owner

Cycle 6 H.264 IDCT 4×4 V3D shader added (commit 65bd5c3) — the first H.264 cycle to actually run on QPU with bit-exact correctness.

The earlier two-bullet PR title still describes Phase 3 of the campaign but the scope is now "all kernels that have V3D shaders, including the new cycle-6 one".

Cycle 6 shader

src/v3d_h264_idct4.comp — H.264 §8.5.12.1 1D butterfly twice (row pass, column pass), (val + 32) >> 6 rounded, clipped, added to dst. Column-major block layout matches FFmpeg ff_h264_idct_add_neon. 4 lanes per block, 16 blocks per WG.

Verification on hertz (Pi 5 / V3D 7.1)

=== Phase 8a API smoke: H.264 kernels via recipe dispatch ===
  H264_IDCT4 recipe substrate:      2 (1=CPU, 2=QPU)   ← flipped
  H264_IDCT8 recipe substrate:      1
  H264_DEBLOCK_LV recipe substrate: 2
  H264_QPEL_MC20 recipe substrate:  1
  H.264 IDCT 4x4: 2048/2048 bytes bit-exact   ← QPU dispatch
  H.264 IDCT 8x8: 2048/2048 bytes bit-exact
  H.264 deblock luma v: 2048/2048 bytes bit-exact
  H.264 qpel mc20: 1024/1024 bytes bit-exact

Bit-exact first try.

Current recipe state after this PR

Cycle Kernel Recipe verdict V3D shader
1 VP9 IDCT 8x8 QPU yes
2 VP9 LPF wd=4 QPU yes
3 VP9 MC 8h QPU yes
4 VP9 LPF wd=8 QPU yes
5 AV1 CDEF 8x8 QPU yes
6 H.264 IDCT 4x4 QPU new
7 H.264 IDCT 8x8 CPU task #165
8 H.264 deblock luma-v QPU yes
9 H.264 qpel mc20 CPU task #165

Six of nine kernels now QPU-by-recipe. Remaining cycles 7 and 9 are the same-task follow-ons.

Uses v3d_runner_create_buffer / destroy_buffer rather than the pool API (PR #6 hadn't merged when this branch was cut); trivial post-merge swap.

Closes tasks #163, #164, and partial #165 (cycle 6 piece).

**Cycle 6 H.264 IDCT 4×4 V3D shader added** (commit 65bd5c3) — the first H.264 cycle to actually run on QPU with bit-exact correctness. The earlier two-bullet PR title still describes Phase 3 of the campaign but the scope is now "all kernels that have V3D shaders, including the new cycle-6 one". ## Cycle 6 shader `src/v3d_h264_idct4.comp` — H.264 §8.5.12.1 1D butterfly twice (row pass, column pass), `(val + 32) >> 6` rounded, clipped, added to dst. Column-major block layout matches FFmpeg `ff_h264_idct_add_neon`. 4 lanes per block, 16 blocks per WG. ## Verification on hertz (Pi 5 / V3D 7.1) ``` === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) ← flipped H264_IDCT8 recipe substrate: 1 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 1 H.264 IDCT 4x4: 2048/2048 bytes bit-exact ← QPU dispatch H.264 IDCT 8x8: 2048/2048 bytes bit-exact H.264 deblock luma v: 2048/2048 bytes bit-exact H.264 qpel mc20: 1024/1024 bytes bit-exact ``` Bit-exact first try. ## Current recipe state after this PR | Cycle | Kernel | Recipe verdict | V3D shader | |---|---|---|---| | 1 | VP9 IDCT 8x8 | QPU | yes | | 2 | VP9 LPF wd=4 | QPU | yes | | 3 | VP9 MC 8h | QPU | yes | | 4 | VP9 LPF wd=8 | QPU | yes | | 5 | AV1 CDEF 8x8 | QPU | yes | | 6 | H.264 IDCT 4x4 | **QPU** | **new** | | 7 | H.264 IDCT 8x8 | CPU | task #165 | | 8 | H.264 deblock luma-v | QPU | yes | | 9 | H.264 qpel mc20 | CPU | task #165 | Six of nine kernels now QPU-by-recipe. Remaining cycles 7 and 9 are the same-task follow-ons. Uses `v3d_runner_create_buffer` / `destroy_buffer` rather than the pool API (PR #6 hadn't merged when this branch was cut); trivial post-merge swap. Closes tasks #163, #164, and partial #165 (cycle 6 piece).
claude-noether added 1 commit 2026-05-23 18:09:28 +00:00
Mirrors cycle 6 (PR #7 prior commit) but at 8x8 scale: 8 lanes per
block, 8 blocks per WG.  H.264 §8.5.13.2 1D butterfly twice (row
pass, column pass), (val + 32) >> 6 rounded + clipped + added to
dst.

Bit-exact first try on hertz (Pi 5, V3D 7.1):

  H264_IDCT4 recipe substrate:      2 (QPU)
  H264_IDCT8 recipe substrate:      2 (QPU)    ← flipped
  H264_DEBLOCK_LV recipe substrate: 2 (QPU)
  H264_QPEL_MC20 recipe substrate:  1 (CPU)    ← task #165 remaining
  H.264 IDCT 4x4: 2048/2048 bytes bit-exact
  H.264 IDCT 8x8: 2048/2048 bytes bit-exact    ← QPU
  H.264 deblock luma v: 2048/2048 bytes bit-exact
  H.264 qpel mc20: 1024/1024 bytes bit-exact

8 of 9 daedalus-fourier cycles now QPU-by-recipe.  Only cycle 9
(H.264 luma qpel mc20) still CPU — different shape (6-tap MC
filter, not a transform) so needs its own shader template; task
#165 covers it as a follow-on.

Same pattern as cycle 6 commit (65bd5c3): adds h264_idct8_pipe
field + lazy init, dispatch_h264_idct8_qpu() with 3 SSBOs,
v3d_h264_idct8.spv install rule.

Uses v3d_runner_create_buffer / destroy_buffer (will swap to
pool API once PR #6 lands).
Author
Owner

Cycle 7 H.264 IDCT 8×8 V3D shader added (commit 74687d9). Same template as cycle 6: 8 lanes per block, 8 blocks per WG. H.264 §8.5.13.2 1D butterfly twice.

Current state after this commit

Cycle Kernel Recipe Shader Bit-exact
1 VP9 IDCT 8x8 QPU yes
2 VP9 LPF wd=4 QPU yes
3 VP9 MC 8h QPU yes
4 VP9 LPF wd=8 QPU yes
5 AV1 CDEF 8x8 QPU yes
6 H.264 IDCT 4x4 QPU new
7 H.264 IDCT 8x8 QPU new
8 H.264 deblock luma-v QPU yes
9 H.264 qpel mc20 CPU task #165 follow-on -

8 of 9 cycles now QPU-by-recipe. Only cycle 9 (6-tap horizontal MC, different shape from transforms) still pending — task #165 residual.

  H264_IDCT4 recipe substrate:      2 (1=CPU, 2=QPU)
  H264_IDCT8 recipe substrate:      2
  H264_DEBLOCK_LV recipe substrate: 2
  H264_QPEL_MC20 recipe substrate:  1
  H.264 IDCT 4x4: 2048/2048 bytes bit-exact
  H.264 IDCT 8x8: 2048/2048 bytes bit-exact    ← QPU
  H.264 deblock luma v: 2048/2048 bytes bit-exact
  H.264 qpel mc20: 1024/1024 bytes bit-exact
**Cycle 7 H.264 IDCT 8×8 V3D shader added** (commit 74687d9). Same template as cycle 6: 8 lanes per block, 8 blocks per WG. H.264 §8.5.13.2 1D butterfly twice. ## Current state after this commit | Cycle | Kernel | Recipe | Shader | Bit-exact | |---|---|---|---|---| | 1 | VP9 IDCT 8x8 | QPU | yes | ✓ | | 2 | VP9 LPF wd=4 | QPU | yes | ✓ | | 3 | VP9 MC 8h | QPU | yes | ✓ | | 4 | VP9 LPF wd=8 | QPU | yes | ✓ | | 5 | AV1 CDEF 8x8 | QPU | yes | ✓ | | 6 | H.264 IDCT 4x4 | QPU | new | ✓ | | 7 | H.264 IDCT 8x8 | **QPU** | **new** | ✓ | | 8 | H.264 deblock luma-v | QPU | yes | ✓ | | 9 | H.264 qpel mc20 | CPU | task #165 follow-on | - | **8 of 9 cycles now QPU-by-recipe.** Only cycle 9 (6-tap horizontal MC, different shape from transforms) still pending — task #165 residual. ``` H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 2 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 1 H.264 IDCT 4x4: 2048/2048 bytes bit-exact H.264 IDCT 8x8: 2048/2048 bytes bit-exact ← QPU H.264 deblock luma v: 2048/2048 bytes bit-exact H.264 qpel mc20: 1024/1024 bytes bit-exact ```
marfrit merged commit a092ee34aa into main 2026-05-23 18:59:35 +00:00
marfrit deleted branch noether/qpu-default-recipe-cycles-5-8 2026-05-23 18:59:35 +00:00
Sign in to join this conversation.
No Reviewers
No Label
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marfrit/daedalus-fourier#7