bench: H.264 primitive bench now measures both substrates + comparison table #36
Reference in New Issue
Block a user
Delete Branch "noether/h264-qpu-bench-and-cleanup"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes task #166 (re-measure R-bands on post-buffer-pool dispatch path).
Why
All H.264 hot-path primitives now have QPU shaders (PRs #28–#35). The dispatch overhead has also been hammered down via the buffer pool (task #160) and persistent command buffer (task #161).
bench_h264_primitiveswas still single-column CPU-only. Time to look at the actual numbers.Headline result (Pi 5 V3D 7.1, hertz, 30 iters × 5 warmup)
1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):
Significance
PR #10 measured CPU NEON 4× faster than QPU for IDCT at 1080p. Today the same kernel runs 4.36× FASTER on QPU. That's a ~17× swing in QPU's favor — paid for by the buffer-pool + persistent-cmdbuf work.
The substrate decree (2026-05-23 "what can be done, will be done in QPU") is now retroactively backed by measurement. The only kernel still showing CPU ahead is qpel mc02 — single-axis vertical filter, row-strided memory pattern unfriendly to the current WG layout. Left as a targeted follow-up.
Also in this PR
Tightens
test_api_h264's startup recipe substrate print. The stale(CPU)/(CPU, no QPU H shader yet)/(CPU, bS=4 set)labels next toDEBLOCK_LH,DEBLOCK_CV,DEBLOCK_CH,DEBLOCK_*_INTRAare now wrong (those kernels are on QPU since PRs #28, #29, #35).Bundled into one PR per request to stop ping-ponging on small follow-ups.