v3d_runner: SPV path search + bench preflight — RETRACTS PR #36's headline #37

2026-05-25T19:45:44Z

marfrit commented

2026-05-25 19:45:44 +00:00

PR #36 was wrong

The "QPU 4.30× faster than CPU NEON" headline was a measurement artifact, not a real result.

The bug

v3d_runner.read_spv() did fopen(spv_path, "rb") with no path search. The caller passes a bare filename (e.g. v3d_h264_idct4.spv). cmake puts SPVs in $builddir, but the bench was typically invoked from the source dir, so fopen failed.

On failure, read_spv printed perror and returned NULL. Pipeline create returned -1, dispatch returned -1, but the bench loop ignored the return value and timed the failure path. Each iteration cost ~1–5 µs (open + perror + return), which divided across 256 ops gave ~10–20 ns/op — looking convincingly like real-but-fast QPU work.

PR #36's IDCT 4x4 luma … QPU 2.47 ns/op was that artifact. PR #10's much-slower QPU measurement was real (SPV happened to be findable that time, perhaps run from build/). The gap never closed; we just measured the wrong thing in PR #36.

Corrected numbers (hertz, Pi 5 V3D 7.1, 30 iters × 5 warmup, AFTER this PR)

kernel	CPU ns/op	QPU ns/op	winner
IDCT 4x4 luma	10.75	217.63	CPU 20.24×
IDCT 8x8 luma	29.69	785.94	CPU 26.47×
Deblock luma_v	17.63	467.42	CPU 26.51×
Deblock luma_h	38.30	498.53	CPU 13.02×
qpel mc20 (8x8)	30.17	1300.44	CPU 43.10×
qpel mc02 (8x8)	17.69	1363.40	CPU 77.08×
qpel mc22 (8x8)	71.60	1948.37	CPU 27.21×

1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):

CPU NEON only: 5.57 ms
QPU only: 123.54 ms
CPU/QPU sum ratio: 0.05× (QPU 22× SLOWER than CPU)

QPU is 12–77× slower per kernel. PR #10's verdict stands; PR #36's reversal was withdrawn.

The fix

v3d_runner.c — SPV path search

Tries, in order: cwd → $DAEDALUS_SHADER_DIR → binary-relative (readlink /proc/self/exe) → /opt/fourier/share/daedalus-fourier/ → /usr/share/daedalus-fourier/. Found-anywhere succeeds silently; found-nowhere prints one error naming all searched locations.

bench_h264_primitives.c — preflight + abort

bench_fn now returns int. bench_ns does a single preflight call; if rc != 0 it prints DISPATCH FAILED rc=N — kernel skipped and skips the kernel. Main counts QPU failures and exits 2 before printing the comparison table if any kernel failed — so the next person running this can't read fail-fast timings as substrate numbers.

Policy implications

The QPU substrate decree (2026-05-23) was conceived as a policy choice overriding per-kernel measurement. With corrected data the gap is not "fixable defect we'll close with one more optimization" — it's an order of magnitude. Whether to keep the decree, soften it, or revert is now a clear-eyed decision.

This PR doesn't change the recipe table. That's a separate decision taken on its own merits.

## PR #36 was wrong The "QPU 4.30× faster than CPU NEON" headline was a measurement artifact, not a real result. ### The bug `v3d_runner.read_spv()` did `fopen(spv_path, "rb")` with no path search. The caller passes a bare filename (e.g. `v3d_h264_idct4.spv`). cmake puts SPVs in `$builddir`, but the bench was typically invoked from the source dir, so `fopen` failed. On failure, `read_spv` printed `perror` and returned NULL. Pipeline create returned -1, dispatch returned -1, but the bench loop **ignored the return value** and timed the failure path. Each iteration cost ~1–5 µs (open + perror + return), which divided across 256 ops gave ~10–20 ns/op — looking convincingly like real-but-fast QPU work. PR #36's `IDCT 4x4 luma … QPU 2.47 ns/op` was that artifact. PR #10's much-slower QPU measurement was real (SPV happened to be findable that time, perhaps run from build/). The gap never closed; we just measured the wrong thing in PR #36. ### Corrected numbers (hertz, Pi 5 V3D 7.1, 30 iters × 5 warmup, AFTER this PR) | kernel | CPU ns/op | QPU ns/op | winner | |---|---|---|---| | IDCT 4x4 luma | **10.75** | 217.63 | CPU **20.24×** | | IDCT 8x8 luma | **29.69** | 785.94 | CPU **26.47×** | | Deblock luma_v | **17.63** | 467.42 | CPU **26.51×** | | Deblock luma_h | **38.30** | 498.53 | CPU **13.02×** | | qpel mc20 (8x8) | **30.17** | 1300.44 | CPU **43.10×** | | qpel mc02 (8x8) | **17.69** | 1363.40 | CPU **77.08×** | | qpel mc22 (8x8) | **71.60** | 1948.37 | CPU **27.21×** | **1080p worst-case sum** (IDCT4 + deblock luma + qpel mc22): - CPU NEON only: **5.57 ms** - QPU only: **123.54 ms** - CPU/QPU sum ratio: **0.05×** (QPU 22× SLOWER than CPU) QPU is 12–77× slower per kernel. PR #10's verdict stands; PR #36's reversal was withdrawn. ## The fix ### v3d_runner.c — SPV path search Tries, in order: cwd → `$DAEDALUS_SHADER_DIR` → binary-relative (`readlink /proc/self/exe`) → `/opt/fourier/share/daedalus-fourier/` → `/usr/share/daedalus-fourier/`. Found-anywhere succeeds silently; found-nowhere prints **one** error naming all searched locations. ### bench_h264_primitives.c — preflight + abort `bench_fn` now returns `int`. `bench_ns` does a single preflight call; if `rc != 0` it prints `DISPATCH FAILED rc=N — kernel skipped` and skips the kernel. Main counts QPU failures and exits 2 **before** printing the comparison table if any kernel failed — so the next person running this can't read fail-fast timings as substrate numbers. ## Policy implications The QPU substrate decree (2026-05-23) was conceived as a policy choice overriding per-kernel measurement. With corrected data the gap is not "fixable defect we'll close with one more optimization" — it's an order of magnitude. Whether to keep the decree, soften it, or revert is now a clear-eyed decision. This PR doesn't change the recipe table. That's a separate decision taken on its own merits. ## Related marfrit-packages PR #104 (libavcodec ctx no_qpu → qpu-capable) was justified by PR #36's artifact and is being reverted in a follow-up to marfrit-packages.

marfrit added 1 commit 2026-05-25 19:45:47 +00:00

v3d_runner: SPV path search + bench preflight — RETRACTS PR #36 's headline 1347fb961c

PR #36 reported a 4.30x QPU-over-CPU win for the H.264 1080p hot-path
sum.  That number was a measurement artifact.  This commit makes the
artifact impossible to reproduce by ANYONE running the bench again.

THE BUG
-------

v3d_runner read_spv() did fopen(spv_path, "rb") with no path search:
the caller passes a bare filename like "v3d_h264_idct4.spv" and fopen
resolves it relative to cwd.  The cmake build puts SPVs in $builddir
(e.g. ~/src/daedalus-fourier/build/), but the bench (and test_api_h264)
were typically invoked from ~/src/daedalus-fourier/, so fopen failed.

On failure read_spv printed perror and returned NULL; pipeline create
then returned -1; dispatch then returned -1; the bench loop ignored
the return value and timed the failure path.  Each iter cost ~1-5 µs
(open + perror + return), which divided across 256 ops gave ~10-20
ns/op — looking convincingly like real-but-fast QPU work.

PR #36's "QPU 2.47 ns/op" for IDCT 4x4 was that artifact.  PR #10's
much-slower "QPU 37.77 ms" measurement was REAL (SPV apparently found
that time, perhaps run from build/), so the artifact is what made it
look like the gap had closed.  The gap never closed.

CORRECTED NUMBERS
-----------------

Run from hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup) AFTER this commit:

  kernel             CPU ns/op  QPU ns/op  winner
  IDCT 4x4 luma          10.75     217.63  CPU 20.24x
  IDCT 8x8 luma          29.69     785.94  CPU 26.47x
  Deblock luma_v         17.63     467.42  CPU 26.51x
  Deblock luma_h         38.30     498.53  CPU 13.02x
  qpel mc20 (8x8)        30.17    1300.44  CPU 43.10x
  qpel mc02 (8x8)        17.69    1363.40  CPU 77.08x
  qpel mc22 (8x8)        71.60    1948.37  CPU 27.21x

  1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):
    CPU NEON only:   5.57 ms
    QPU only:      123.54 ms
    Ratio:        CPU/QPU sum = 0.05x  (QPU 22x SLOWER than CPU)

QPU is currently 12-77x slower per kernel.  The post-buffer-pool /
post-persistent-cmdbuf dispatch overhead (tasks #160, #161) did NOT
close the gap with NEON.  Whether those tasks helped at all needs
re-measurement — the previous "we saw a big win" reading was the
same artifact.

PR #36's commit-message claim "PR #10's verdict is reversed" is
withdrawn.  PR #10 was right; PR #36 was wrong.

THE FIX
-------

Two changes:

1. v3d_runner: SPV search now tries, in order:
     - cwd (legacy)
     - $DAEDALUS_SHADER_DIR (env override)
     - <readlink /proc/self/exe>/.. (binary-relative)
     - /opt/fourier/share/daedalus-fourier/ (Pi 5 install)
     - /usr/share/daedalus-fourier/ (system-wide)
   Found-anywhere succeeds silently.  Found-nowhere prints one error
   naming all searched locations.

2. bench_h264_primitives: bench_fn now returns int.  bench_ns does
   a single preflight call; if rc != 0 it prints "DISPATCH FAILED
   rc=N — kernel skipped" and bails on the kernel.  Main loop counts
   QPU failures and exits 2 BEFORE printing the comparison table if
   any kernel failed — so the next person running this can't read
   fail-fast timings as substrate numbers.

POLICY IMPLICATIONS
-------------------

The QPU substrate decree (2026-05-23) was conceived as a policy
choice that overrides per-kernel measurement.  With the corrected
data the gap is not "fixable defect we'll close with one more
optimization", it's an order of magnitude.  Whether to keep the
decree, soften it (auto = QPU only where measured advantage), or
revert is now a clear-eyed decision for the user.

This commit doesn't change the recipe table — that's a separate
question, taken on its own merits with this data in hand.

Related: marfrit-packages PR #104 (libavcodec ctx flipped no_qpu →
qpu-capable) was justified by PR #36's artifact and should be
reverted; that revert lands in a follow-up to marfrit-packages.

marfrit referenced this pull request from marfrit/marfrit-packages

2026-05-25 19:48:02 +00:00

ffmpeg-v4l2-request-fourier: revert ctx flip — PR #36 was a measurement artifact (0015) #105

marfrit merged commit 432d127ea9 into main

2026-05-25 20:33:32 +00:00

marfrit deleted branch noether/spv-search-and-bench-retract

2026-05-25 20:33:32 +00:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marfrit/daedalus-fourier#37