bench: H.264 primitive bench now measures both substrates + comparison table #36

2026-05-25T18:43:01Z

marfrit commented

2026-05-25 18:43:01 +00:00

Closes task #166 (re-measure R-bands on post-buffer-pool dispatch path).

Why

All H.264 hot-path primitives now have QPU shaders (PRs #28–#35). The dispatch overhead has also been hammered down via the buffer pool (task #160) and persistent command buffer (task #161). bench_h264_primitives was still single-column CPU-only. Time to look at the actual numbers.

Headline result (Pi 5 V3D 7.1, hertz, 30 iters × 5 warmup)

kernel	CPU ns/op	QPU ns/op	winner
IDCT 4x4 luma	10.79	2.47	QPU 4.36×
IDCT 8x8 luma	29.69	9.23	QPU 3.22×
Deblock luma_v	17.58	10.21	QPU 1.72×
Deblock luma_h	38.41	9.98	QPU 3.85×
qpel mc20 (8x8)	28.24	9.66	QPU 2.92×
qpel mc02 (8x8)	16.96	20.54	CPU 1.21×
qpel mc22 (8x8)	71.58	9.64	QPU 7.43×

1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):

CPU NEON only: 5.57 ms
QPU only: 1.30 ms
CPU/QPU sum ratio: 4.30×

Significance

PR #10 measured CPU NEON 4× faster than QPU for IDCT at 1080p. Today the same kernel runs 4.36× FASTER on QPU. That's a ~17× swing in QPU's favor — paid for by the buffer-pool + persistent-cmdbuf work.

The substrate decree (2026-05-23 "what can be done, will be done in QPU") is now retroactively backed by measurement. The only kernel still showing CPU ahead is qpel mc02 — single-axis vertical filter, row-strided memory pattern unfriendly to the current WG layout. Left as a targeted follow-up.

Also in this PR

Tightens test_api_h264's startup recipe substrate print. The stale (CPU) / (CPU, no QPU H shader yet) / (CPU, bS=4 set) labels next to DEBLOCK_LH, DEBLOCK_CV, DEBLOCK_CH, DEBLOCK_*_INTRA are now wrong (those kernels are on QPU since PRs #28, #29, #35).

Bundled into one PR per request to stop ping-ponging on small follow-ups.

Closes task #166 (re-measure R-bands on post-buffer-pool dispatch path). ## Why All H.264 hot-path primitives now have QPU shaders (PRs #28–#35). The dispatch overhead has also been hammered down via the buffer pool (task #160) and persistent command buffer (task #161). `bench_h264_primitives` was still single-column CPU-only. Time to look at the actual numbers. ## Headline result (Pi 5 V3D 7.1, hertz, 30 iters × 5 warmup) | kernel | CPU ns/op | QPU ns/op | winner | |-------------------|-----------|-----------|--------------| | IDCT 4x4 luma | 10.79 | **2.47** | QPU **4.36×** | | IDCT 8x8 luma | 29.69 | **9.23** | QPU **3.22×** | | Deblock luma_v | 17.58 | **10.21** | QPU **1.72×** | | Deblock luma_h | 38.41 | **9.98** | QPU **3.85×** | | qpel mc20 (8x8) | 28.24 | **9.66** | QPU **2.92×** | | qpel mc02 (8x8) | **16.96** | 20.54 | CPU 1.21× | | qpel mc22 (8x8) | 71.58 | **9.64** | QPU **7.43×** | 1080p worst-case sum (IDCT4 + deblock luma + qpel mc22): - **CPU NEON only**: 5.57 ms - **QPU only**: 1.30 ms - **CPU/QPU sum ratio: 4.30×** ## Significance PR #10 measured CPU NEON 4× faster than QPU for IDCT at 1080p. Today the same kernel runs **4.36× FASTER on QPU**. That's a ~17× swing in QPU's favor — paid for by the buffer-pool + persistent-cmdbuf work. The substrate decree (2026-05-23 "what can be done, will be done in QPU") is now retroactively backed by measurement. The only kernel still showing CPU ahead is qpel mc02 — single-axis vertical filter, row-strided memory pattern unfriendly to the current WG layout. Left as a targeted follow-up. ## Also in this PR Tightens `test_api_h264`'s startup recipe substrate print. The stale `(CPU)` / `(CPU, no QPU H shader yet)` / `(CPU, bS=4 set)` labels next to `DEBLOCK_LH`, `DEBLOCK_CV`, `DEBLOCK_CH`, `DEBLOCK_*_INTRA` are now wrong (those kernels are on QPU since PRs #28, #29, #35). Bundled into one PR per request to stop ping-ponging on small follow-ups.

marfrit added 1 commit 2026-05-25 18:43:02 +00:00

bench: H.264 primitive bench now measures both substrates + comparison table 989818c2e6

Closes task #166 (re-measure R-bands on post-buffer-pool dispatch path).

Now that all H.264 hot-path primitives have QPU shaders and the
dispatch overhead has been hammered down (tasks #160 buffer pool,
#161 persistent command buffer), bench_h264_primitives no longer
measures one column.  Two passes — CPU NEON and QPU V3D7 compute —
with a side-by-side per-kernel comparison and ratio.

Headline result on hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup):

  kernel             CPU ns/op  QPU ns/op  winner
  IDCT 4x4 luma          10.79       2.47  QPU 4.36x
  IDCT 8x8 luma          29.69       9.23  QPU 3.22x
  Deblock luma_v         17.58      10.21  QPU 1.72x
  Deblock luma_h         38.41       9.98  QPU 3.85x
  qpel mc20 (8x8)        28.24       9.66  QPU 2.92x
  qpel mc02 (8x8)        16.96      20.54  CPU 1.21x
  qpel mc22 (8x8)        71.58       9.64  QPU 7.43x

  1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):
    CPU NEON only:  5.57 ms
    QPU only:       1.30 ms   (CPU/QPU sum ratio = 4.30x)

Reverses PR #10's verdict (which had CPU NEON 4x faster than QPU
for IDCT-only) — the buffer-pool + persistent-cmdbuf wins land
hard.  Only qpel mc02 still shows CPU ahead, marginally (single-
axis vertical filter, row-strided memory pattern unfriendly to the
WG layout — left as a follow-up for cycle-9-style targeted tuning).

Substrate decree (2026-05-23) stays in force as policy — these
numbers retroactively justify it.

Also tightens test_api_h264's startup recipe print: the stale
"(CPU)" / "(CPU, no QPU H shader yet)" / "(CPU, bS=4 set)" labels
next to deblock_lh, deblock_cv, deblock_ch and deblock_*_intra
are now wrong since PRs #28, #29, #35 (those kernels are on QPU).