bench: H.264 primitive bench now measures both substrates + comparison table

Closes task #166 (re-measure R-bands on post-buffer-pool dispatch path).

Now that all H.264 hot-path primitives have QPU shaders and the
dispatch overhead has been hammered down (tasks #160 buffer pool,
#161 persistent command buffer), bench_h264_primitives no longer
measures one column.  Two passes — CPU NEON and QPU V3D7 compute —
with a side-by-side per-kernel comparison and ratio.

Headline result on hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup):

  kernel             CPU ns/op  QPU ns/op  winner
  IDCT 4x4 luma          10.79       2.47  QPU 4.36x
  IDCT 8x8 luma          29.69       9.23  QPU 3.22x
  Deblock luma_v         17.58      10.21  QPU 1.72x
  Deblock luma_h         38.41       9.98  QPU 3.85x
  qpel mc20 (8x8)        28.24       9.66  QPU 2.92x
  qpel mc02 (8x8)        16.96      20.54  CPU 1.21x
  qpel mc22 (8x8)        71.58       9.64  QPU 7.43x

  1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):
    CPU NEON only:  5.57 ms
    QPU only:       1.30 ms   (CPU/QPU sum ratio = 4.30x)

Reverses PR #10's verdict (which had CPU NEON 4x faster than QPU
for IDCT-only) — the buffer-pool + persistent-cmdbuf wins land
hard.  Only qpel mc02 still shows CPU ahead, marginally (single-
axis vertical filter, row-strided memory pattern unfriendly to the
WG layout — left as a follow-up for cycle-9-style targeted tuning).

Substrate decree (2026-05-23) stays in force as policy — these
numbers retroactively justify it.

Also tightens test_api_h264's startup recipe print: the stale
"(CPU)" / "(CPU, no QPU H shader yet)" / "(CPU, bS=4 set)" labels
next to deblock_lh, deblock_cv, deblock_ch and deblock_*_intra
are now wrong since PRs #28, #29, #35 (those kernels are on QPU).
This commit is contained in:
2026-05-25 20:41:38 +02:00
parent 1446b779a6
commit 989818c2e6
2 changed files with 126 additions and 77 deletions
+4 -4
View File
@@ -683,13 +683,13 @@ int main(void)
printf(" H264_QPEL_MC20 recipe substrate: %d\n",
(int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_QPEL_MC20));
printf(" H264_DEBLOCK_LH recipe substrate: %d (CPU, no QPU H shader yet)\n",
printf(" H264_DEBLOCK_LH recipe substrate: %d\n",
(int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_DEBLOCK_LH));
printf(" H264_DEBLOCK_CV recipe substrate: %d (CPU)\n",
printf(" H264_DEBLOCK_CV recipe substrate: %d\n",
(int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_DEBLOCK_CV));
printf(" H264_DEBLOCK_CH recipe substrate: %d (CPU)\n",
printf(" H264_DEBLOCK_CH recipe substrate: %d\n",
(int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_DEBLOCK_CH));
printf(" H264_DEBLOCK_*_INTRA recipe substrate: %d (CPU, bS=4 set)\n",
printf(" H264_DEBLOCK_*_INTRA recipe substrate: %d (bS=4 family, all on QPU)\n",
(int) daedalus_recipe_substrate_for(DAEDALUS_KERNEL_H264_DEBLOCK_LV_INTRA));
int fail = 0;