Closes task #166 (re-measure R-bands on post-buffer-pool dispatch path).
Now that all H.264 hot-path primitives have QPU shaders and the
dispatch overhead has been hammered down (tasks #160 buffer pool,
#161 persistent command buffer), bench_h264_primitives no longer
measures one column. Two passes — CPU NEON and QPU V3D7 compute —
with a side-by-side per-kernel comparison and ratio.
Headline result on hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup):
kernel CPU ns/op QPU ns/op winner
IDCT 4x4 luma 10.79 2.47 QPU 4.36x
IDCT 8x8 luma 29.69 9.23 QPU 3.22x
Deblock luma_v 17.58 10.21 QPU 1.72x
Deblock luma_h 38.41 9.98 QPU 3.85x
qpel mc20 (8x8) 28.24 9.66 QPU 2.92x
qpel mc02 (8x8) 16.96 20.54 CPU 1.21x
qpel mc22 (8x8) 71.58 9.64 QPU 7.43x
1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):
CPU NEON only: 5.57 ms
QPU only: 1.30 ms (CPU/QPU sum ratio = 4.30x)
Reverses PR #10's verdict (which had CPU NEON 4x faster than QPU
for IDCT-only) — the buffer-pool + persistent-cmdbuf wins land
hard. Only qpel mc02 still shows CPU ahead, marginally (single-
axis vertical filter, row-strided memory pattern unfriendly to the
WG layout — left as a follow-up for cycle-9-style targeted tuning).
Substrate decree (2026-05-23) stays in force as policy — these
numbers retroactively justify it.
Also tightens test_api_h264's startup recipe print: the stale
"(CPU)" / "(CPU, no QPU H shader yet)" / "(CPU, bS=4 set)" labels
next to deblock_lh, deblock_cv, deblock_ch and deblock_*_intra
are now wrong since PRs #28, #29, #35 (those kernels are on QPU).