bench: H.264 primitive bench now measures both substrates + comparison table #36

Merged

marfrit merged 1 commits from noether/h264-qpu-bench-and-cleanup into main

2026-05-25 18:56:02 +00:00

Author	SHA1	Message	Date
claude-noether	989818c2e6	bench: H.264 primitive bench now measures both substrates + comparison table Closes task #166 (re-measure R-bands on post-buffer-pool dispatch path). Now that all H.264 hot-path primitives have QPU shaders and the dispatch overhead has been hammered down (tasks #160 buffer pool, #161 persistent command buffer), bench_h264_primitives no longer measures one column. Two passes — CPU NEON and QPU V3D7 compute — with a side-by-side per-kernel comparison and ratio. Headline result on hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup): kernel CPU ns/op QPU ns/op winner IDCT 4x4 luma 10.79 2.47 QPU 4.36x IDCT 8x8 luma 29.69 9.23 QPU 3.22x Deblock luma_v 17.58 10.21 QPU 1.72x Deblock luma_h 38.41 9.98 QPU 3.85x qpel mc20 (8x8) 28.24 9.66 QPU 2.92x qpel mc02 (8x8) 16.96 20.54 CPU 1.21x qpel mc22 (8x8) 71.58 9.64 QPU 7.43x 1080p worst-case sum (IDCT4 + deblock luma + qpel mc22): CPU NEON only: 5.57 ms QPU only: 1.30 ms (CPU/QPU sum ratio = 4.30x) Reverses PR #10's verdict (which had CPU NEON 4x faster than QPU for IDCT-only) — the buffer-pool + persistent-cmdbuf wins land hard. Only qpel mc02 still shows CPU ahead, marginally (single- axis vertical filter, row-strided memory pattern unfriendly to the WG layout — left as a follow-up for cycle-9-style targeted tuning). Substrate decree (2026-05-23) stays in force as policy — these numbers retroactively justify it. Also tightens test_api_h264's startup recipe print: the stale "(CPU)" / "(CPU, no QPU H shader yet)" / "(CPU, bS=4 set)" labels next to deblock_lh, deblock_cv, deblock_ch and deblock_*_intra are now wrong since PRs #28, #29, #35 (those kernels are on QPU).	2026-05-25 20:42:39 +02:00

Author

SHA1

Message

Date

claude-noether

989818c2e6

bench: H.264 primitive bench now measures both substrates + comparison table

Closes task #166 (re-measure R-bands on post-buffer-pool dispatch path).

Now that all H.264 hot-path primitives have QPU shaders and the
dispatch overhead has been hammered down (tasks #160 buffer pool,
#161 persistent command buffer), bench_h264_primitives no longer
measures one column.  Two passes — CPU NEON and QPU V3D7 compute —
with a side-by-side per-kernel comparison and ratio.

Headline result on hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup):

  kernel             CPU ns/op  QPU ns/op  winner
  IDCT 4x4 luma          10.79       2.47  QPU 4.36x
  IDCT 8x8 luma          29.69       9.23  QPU 3.22x
  Deblock luma_v         17.58      10.21  QPU 1.72x
  Deblock luma_h         38.41       9.98  QPU 3.85x
  qpel mc20 (8x8)        28.24       9.66  QPU 2.92x
  qpel mc02 (8x8)        16.96      20.54  CPU 1.21x
  qpel mc22 (8x8)        71.58       9.64  QPU 7.43x

  1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):
    CPU NEON only:  5.57 ms
    QPU only:       1.30 ms   (CPU/QPU sum ratio = 4.30x)

Reverses PR #10's verdict (which had CPU NEON 4x faster than QPU
for IDCT-only) — the buffer-pool + persistent-cmdbuf wins land
hard.  Only qpel mc02 still shows CPU ahead, marginally (single-
axis vertical filter, row-strided memory pattern unfriendly to the
WG layout — left as a follow-up for cycle-9-style targeted tuning).

Substrate decree (2026-05-23) stays in force as policy — these
numbers retroactively justify it.

Also tightens test_api_h264's startup recipe print: the stale
"(CPU)" / "(CPU, no QPU H shader yet)" / "(CPU, bS=4 set)" labels
next to deblock_lh, deblock_cv, deblock_ch and deblock_*_intra
are now wrong since PRs #28, #29, #35 (those kernels are on QPU).

2026-05-25 20:42:39 +02:00

bench: H.264 primitive bench now measures both substrates + comparison table #36

1 Commits