1347fb961c7708187f0c348c6be8ef404ff8a9b7
3 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
1347fb961c |
v3d_runner: SPV path search + bench preflight — RETRACTS PR #36's headline
PR #36 reported a 4.30x QPU-over-CPU win for the H.264 1080p hot-path sum. That number was a measurement artifact. This commit makes the artifact impossible to reproduce by ANYONE running the bench again. THE BUG ------- v3d_runner read_spv() did fopen(spv_path, "rb") with no path search: the caller passes a bare filename like "v3d_h264_idct4.spv" and fopen resolves it relative to cwd. The cmake build puts SPVs in $builddir (e.g. ~/src/daedalus-fourier/build/), but the bench (and test_api_h264) were typically invoked from ~/src/daedalus-fourier/, so fopen failed. On failure read_spv printed perror and returned NULL; pipeline create then returned -1; dispatch then returned -1; the bench loop ignored the return value and timed the failure path. Each iter cost ~1-5 µs (open + perror + return), which divided across 256 ops gave ~10-20 ns/op — looking convincingly like real-but-fast QPU work. PR #36's "QPU 2.47 ns/op" for IDCT 4x4 was that artifact. PR #10's much-slower "QPU 37.77 ms" measurement was REAL (SPV apparently found that time, perhaps run from build/), so the artifact is what made it look like the gap had closed. The gap never closed. CORRECTED NUMBERS ----------------- Run from hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup) AFTER this commit: kernel CPU ns/op QPU ns/op winner IDCT 4x4 luma 10.75 217.63 CPU 20.24x IDCT 8x8 luma 29.69 785.94 CPU 26.47x Deblock luma_v 17.63 467.42 CPU 26.51x Deblock luma_h 38.30 498.53 CPU 13.02x qpel mc20 (8x8) 30.17 1300.44 CPU 43.10x qpel mc02 (8x8) 17.69 1363.40 CPU 77.08x qpel mc22 (8x8) 71.60 1948.37 CPU 27.21x 1080p worst-case sum (IDCT4 + deblock luma + qpel mc22): CPU NEON only: 5.57 ms QPU only: 123.54 ms Ratio: CPU/QPU sum = 0.05x (QPU 22x SLOWER than CPU) QPU is currently 12-77x slower per kernel. The post-buffer-pool / post-persistent-cmdbuf dispatch overhead (tasks #160, #161) did NOT close the gap with NEON. Whether those tasks helped at all needs re-measurement — the previous "we saw a big win" reading was the same artifact. PR #36's commit-message claim "PR #10's verdict is reversed" is withdrawn. PR #10 was right; PR #36 was wrong. THE FIX ------- Two changes: 1. v3d_runner: SPV search now tries, in order: - cwd (legacy) - $DAEDALUS_SHADER_DIR (env override) - <readlink /proc/self/exe>/.. (binary-relative) - /opt/fourier/share/daedalus-fourier/ (Pi 5 install) - /usr/share/daedalus-fourier/ (system-wide) Found-anywhere succeeds silently. Found-nowhere prints one error naming all searched locations. 2. bench_h264_primitives: bench_fn now returns int. bench_ns does a single preflight call; if rc != 0 it prints "DISPATCH FAILED rc=N — kernel skipped" and bails on the kernel. Main loop counts QPU failures and exits 2 BEFORE printing the comparison table if any kernel failed — so the next person running this can't read fail-fast timings as substrate numbers. POLICY IMPLICATIONS ------------------- The QPU substrate decree (2026-05-23) was conceived as a policy choice that overrides per-kernel measurement. With the corrected data the gap is not "fixable defect we'll close with one more optimization", it's an order of magnitude. Whether to keep the decree, soften it (auto = QPU only where measured advantage), or revert is now a clear-eyed decision for the user. This commit doesn't change the recipe table — that's a separate question, taken on its own merits with this data in hand. Related: marfrit-packages PR #104 (libavcodec ctx flipped no_qpu → qpu-capable) was justified by PR #36's artifact and should be reverted; that revert lands in a follow-up to marfrit-packages. |
||
|
|
989818c2e6 |
bench: H.264 primitive bench now measures both substrates + comparison table
Closes task #166 (re-measure R-bands on post-buffer-pool dispatch path). Now that all H.264 hot-path primitives have QPU shaders and the dispatch overhead has been hammered down (tasks #160 buffer pool, #161 persistent command buffer), bench_h264_primitives no longer measures one column. Two passes — CPU NEON and QPU V3D7 compute — with a side-by-side per-kernel comparison and ratio. Headline result on hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup): kernel CPU ns/op QPU ns/op winner IDCT 4x4 luma 10.79 2.47 QPU 4.36x IDCT 8x8 luma 29.69 9.23 QPU 3.22x Deblock luma_v 17.58 10.21 QPU 1.72x Deblock luma_h 38.41 9.98 QPU 3.85x qpel mc20 (8x8) 28.24 9.66 QPU 2.92x qpel mc02 (8x8) 16.96 20.54 CPU 1.21x qpel mc22 (8x8) 71.58 9.64 QPU 7.43x 1080p worst-case sum (IDCT4 + deblock luma + qpel mc22): CPU NEON only: 5.57 ms QPU only: 1.30 ms (CPU/QPU sum ratio = 4.30x) Reverses PR #10's verdict (which had CPU NEON 4x faster than QPU for IDCT-only) — the buffer-pool + persistent-cmdbuf wins land hard. Only qpel mc02 still shows CPU ahead, marginally (single- axis vertical filter, row-strided memory pattern unfriendly to the WG layout — left as a follow-up for cycle-9-style targeted tuning). Substrate decree (2026-05-23) stays in force as policy — these numbers retroactively justify it. Also tightens test_api_h264's startup recipe print: the stale "(CPU)" / "(CPU, no QPU H shader yet)" / "(CPU, bS=4 set)" labels next to deblock_lh, deblock_cv, deblock_ch and deblock_*_intra are now wrong since PRs #28, #29, #35 (those kernels are on QPU). |
||
|
|
ba5bbae8e2 |
bench: H.264 primitives NEON CPU baseline (1080p budget projection)
Adds bench_h264_primitives — a non-ctest binary that times the
H.264 pixel-math primitives at their representative per-frame N and
projects 1080p frame budgets. Lets us answer "how much of the
33-ms 30fps deadline does the pixel-math layer eat on NEON alone,
before the intercept patch adds entropy decode + metadata work."
Results on hertz (Pi 5 / 4×Cortex-A76, NEON path):
Per-kernel ns/op (CPU NEON):
IDCT 4x4 luma 10.78 ns/block
IDCT 8x8 luma 29.73 ns/block
Deblock luma_v 18.04 ns/edge
Deblock luma_h 41.65 ns/edge (H access pattern less SIMD-friendly)
qpel mc20 (H half-pel) 25.66 ns/block
qpel mc02 (V half-pel) 15.06 ns/block (faster than mc20!)
qpel mc22 (HV half-pel) 71.50 ns/block (cascaded H+V, expected)
Projected 1080p frame budgets (worst-case, CPU NEON only):
IDCT 4x4 (all-4x4 MBs): 1.41 ms (130,560 blocks)
IDCT 8x8 (all-8x8 MBs): 0.97 ms ( 32,640 blocks)
Deblock luma_v (all MBs): 0.59 ms ( 32,640 edges)
Deblock luma_h (all MBs): 1.36 ms ( 32,640 edges)
qpel mc22 (all 8x8 blocks): 2.33 ms ( 32,640 blocks)
Sum (IDCT 4x4 + deblock luma + MC all-mc22): 5.69 ms
30 fps deadline: 33.33 ms
Margin: +27.64 ms
What this validates:
- The "30fps@1080p is the fine floor" memory note holds with
huge headroom on the pixel-math layer alone. 17% of the
deadline goes to pixel math (worst case); 83% is available
for entropy decode + reference frame management + intra
prediction + chroma deblock + chroma IDCT + the libavcodec
intercept overhead.
- The CPU-vs-QPU substrate finding from earlier (PR #10 on
daedalus-decoder showed CPU NEON is 4x faster than QPU for
IDCT) is consistent here. All these kernels have CPU-only
recipes by default; the data suggests that's the right call
for now. The recipe substrate decision can be revisited
per-kernel once QPU shaders catch up.
- mc22 (2D HV half-pel) is the most expensive single qpel
position at ~71 ns/block — 2-7x more than the 1D variants.
Real B-slice biprediction with two mc22 calls per MB would
add ~4.7 ms/frame; still comfortable but worth knowing.
What this DOESN'T measure (intentionally — they aren't on the
critical path at NEON speeds):
- Chroma IDCT (4 cb + 4 cr 4x4 per MB). At similar ns/block to
luma, that's ~0.7 ms/frame.
- Chroma deblock (smaller tile, simpler kernel — sub-ms).
- Intra prediction (per-block, ~50 ops at NEON, but serialized
in z-scan order so cache-friendly; ~0.5 ms/frame estimate).
- bS=4 intra deblock variants — different algorithm, similar
cost to bS<4.
- chroma DC Hadamard — trivial.
Adding all of those in the worst case would maybe double the 5.69
ms number to ~12 ms. Still leaves 20+ ms for entropy decode +
metadata work in the intercept patch.
|