v3d_runner: SPV path search + bench preflight — RETRACTS PR #36's headline #37
Reference in New Issue
Block a user
Delete Branch "noether/spv-search-and-bench-retract"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
PR #36 was wrong
The "QPU 4.30× faster than CPU NEON" headline was a measurement artifact, not a real result.
The bug
v3d_runner.read_spv()didfopen(spv_path, "rb")with no path search. The caller passes a bare filename (e.g.v3d_h264_idct4.spv). cmake puts SPVs in$builddir, but the bench was typically invoked from the source dir, sofopenfailed.On failure,
read_spvprintedperrorand returned NULL. Pipeline create returned -1, dispatch returned -1, but the bench loop ignored the return value and timed the failure path. Each iteration cost ~1–5 µs (open + perror + return), which divided across 256 ops gave ~10–20 ns/op — looking convincingly like real-but-fast QPU work.PR #36's
IDCT 4x4 luma … QPU 2.47 ns/opwas that artifact. PR #10's much-slower QPU measurement was real (SPV happened to be findable that time, perhaps run from build/). The gap never closed; we just measured the wrong thing in PR #36.Corrected numbers (hertz, Pi 5 V3D 7.1, 30 iters × 5 warmup, AFTER this PR)
1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):
QPU is 12–77× slower per kernel. PR #10's verdict stands; PR #36's reversal was withdrawn.
The fix
v3d_runner.c — SPV path search
Tries, in order: cwd →
$DAEDALUS_SHADER_DIR→ binary-relative (readlink /proc/self/exe) →/opt/fourier/share/daedalus-fourier/→/usr/share/daedalus-fourier/. Found-anywhere succeeds silently; found-nowhere prints one error naming all searched locations.bench_h264_primitives.c — preflight + abort
bench_fnnow returnsint.bench_nsdoes a single preflight call; ifrc != 0it printsDISPATCH FAILED rc=N — kernel skippedand skips the kernel. Main counts QPU failures and exits 2 before printing the comparison table if any kernel failed — so the next person running this can't read fail-fast timings as substrate numbers.Policy implications
The QPU substrate decree (2026-05-23) was conceived as a policy choice overriding per-kernel measurement. With corrected data the gap is not "fixable defect we'll close with one more optimization" — it's an order of magnitude. Whether to keep the decree, soften it, or revert is now a clear-eyed decision.
This PR doesn't change the recipe table. That's a separate decision taken on its own merits.
Related
marfrit-packages PR #104 (libavcodec ctx no_qpu → qpu-capable) was justified by PR #36's artifact and is being reverted in a follow-up to marfrit-packages.