daedalus-fourier/tests at 989818c2e6e17eaa786bce9478ef14164a66a2c8 - daedalus-fourier - marfrit's space

marfrit/daedalus-fourier

Files

T

History

claude-noether 989818c2e6 bench: H.264 primitive bench now measures both substrates + comparison table

Closes task #166 (re-measure R-bands on post-buffer-pool dispatch path).

Now that all H.264 hot-path primitives have QPU shaders and the
dispatch overhead has been hammered down (tasks #160 buffer pool,
#161 persistent command buffer), bench_h264_primitives no longer
measures one column.  Two passes — CPU NEON and QPU V3D7 compute —
with a side-by-side per-kernel comparison and ratio.

Headline result on hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup):

  kernel             CPU ns/op  QPU ns/op  winner
  IDCT 4x4 luma          10.79       2.47  QPU 4.36x
  IDCT 8x8 luma          29.69       9.23  QPU 3.22x
  Deblock luma_v         17.58      10.21  QPU 1.72x
  Deblock luma_h         38.41       9.98  QPU 3.85x
  qpel mc20 (8x8)        28.24       9.66  QPU 2.92x
  qpel mc02 (8x8)        16.96      20.54  CPU 1.21x
  qpel mc22 (8x8)        71.58       9.64  QPU 7.43x

  1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):
    CPU NEON only:  5.57 ms
    QPU only:       1.30 ms   (CPU/QPU sum ratio = 4.30x)

Reverses PR #10's verdict (which had CPU NEON 4x faster than QPU
for IDCT-only) — the buffer-pool + persistent-cmdbuf wins land
hard.  Only qpel mc02 still shows CPU ahead, marginally (single-
axis vertical filter, row-strided memory pattern unfriendly to the
WG layout — left as a follow-up for cycle-9-style targeted tuning).

Substrate decree (2026-05-23) stays in force as policy — these
numbers retroactively justify it.

Also tightens test_api_h264's startup recipe print: the stale
"(CPU)" / "(CPU, no QPU H shader yet)" / "(CPU, bS=4 set)" labels
next to deblock_lh, deblock_cv, deblock_ch and deblock_*_intra
are now wrong since PRs #28, #29, #35 (those kernels are on QPU).

2026-05-25 20:42:39 +02:00

..

Path B pivot + Phase 0-3 closed with first baseline numbers

2026-05-18 11:30:12 +00:00

.gitkeep

Path B pivot + Phase 0-3 closed with first baseline numbers

2026-05-18 11:30:12 +00:00

bench_concurrent_lpf8.c

Cycle 4 (LPF wd=8) closure: M1=100%, R=0.34, M4=+4.1%, PASS

2026-05-18 12:56:25 +00:00

bench_concurrent_lpf.c

Cycle 2 (LPF) closure: M1''=100%, R''=0.41, M4''=+6.9%, PASS

2026-05-18 12:39:26 +00:00

bench_concurrent_mc.c

Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%

2026-05-18 12:51:43 +00:00

bench_concurrent_mixed.c

Cycle 8 closed: H.264 deblock R8=0.061 RED, opportunistic helper

2026-05-18 14:44:21 +00:00

bench_concurrent.c

Phase 7 M4: mixed CPU+QPU beats pure 4-core NEON; project continues

2026-05-18 12:18:36 +00:00

bench_h264_primitives.c

bench: H.264 primitive bench now measures both substrates + comparison table

2026-05-25 20:42:39 +02:00

bench_neon_cdef.c

Cycle 5 closed: CDEF QPU R5=0.116 ORANGE, opportunistic helper

2026-05-18 13:52:46 +00:00

bench_neon_h264deblock.c

Cycle 8 Phase 3 closed: H.264 deblock NEON = 92 Medge/s

2026-05-18 14:39:36 +00:00

bench_neon_h264idct4.c

Cycle 6 (H.264) opened — IDCT 4x4 Phase 1+3, M3 = 175 Mblock/s

2026-05-18 14:14:43 +00:00

bench_neon_h264idct8.c

Cycle 7 closed: H.264 IDCT 8x8 = 151 Mblock/s NEON, Phase 4 deferred

2026-05-18 14:16:42 +00:00

bench_neon_h264qpel_mc20.c

Cycle 9 closed: H.264 luma qpel mc20 = 131 Mblock/s NEON, CPU-only

2026-05-18 14:53:21 +00:00

bench_neon_idct.c

Path B pivot + Phase 0-3 closed with first baseline numbers

2026-05-18 11:30:12 +00:00

bench_neon_lpf8.c

Cycle 4 (LPF wd=8) closure: M1=100%, R=0.34, M4=+4.1%, PASS

2026-05-18 12:56:25 +00:00

bench_neon_lpf.c

Cycle 2 (deblocking) Phase 1-3: M3'' = 48.285 Medge/s baseline

2026-05-18 12:28:57 +00:00

bench_neon_mc.c

Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%

2026-05-18 12:51:43 +00:00

bench_pool_overhead.c

v3d_runner: buffer pool for QPU dispatch hot path

2026-05-23 19:52:50 +02:00

bench_v3d_cdef.c

Cycle 5 closed: CDEF QPU R5=0.116 ORANGE, opportunistic helper

2026-05-18 13:52:46 +00:00

bench_v3d_h264deblock.c

Cycle 8 closed: H.264 deblock R8=0.061 RED, opportunistic helper

2026-05-18 14:44:21 +00:00

bench_v3d_idct.c

Phase 6 (v1+v4 production) + Phase 7 closure: R = 0.92 ± 0.03 on hertz

2026-05-18 12:09:00 +00:00

bench_v3d_lpf8.c

Cycle 4 (LPF wd=8) closure: M1=100%, R=0.34, M4=+4.1%, PASS

2026-05-18 12:56:25 +00:00

bench_v3d_lpf.c

Cycle 2 (LPF) closure: M1''=100%, R''=0.41, M4''=+6.9%, PASS

2026-05-18 12:39:26 +00:00

bench_v3d_mc.c

Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%

2026-05-18 12:51:43 +00:00

bench_vulkan_dispatch.c

Path B pivot + Phase 0-3 closed with first baseline numbers

2026-05-18 11:30:12 +00:00

cdef_ref.c

Cycle 5 closed: CDEF QPU R5=0.116 ORANGE, opportunistic helper

2026-05-18 13:52:46 +00:00

h264_chroma_dc_hadamard_ref.c

h264: chroma DC 2x2 Hadamard pre-pass primitive

2026-05-25 11:18:59 +02:00

h264_chroma_loop_filter_ref.c

h264: deblock chroma_v + chroma_h (CPU/NEON, bS<4)

2026-05-24 23:53:09 +02:00

h264_deblock_ref.c

Cycle 8 Phase 3 closed: H.264 deblock NEON = 92 Medge/s

2026-05-18 14:39:36 +00:00

h264_h_loop_filter_luma_ref.c

h264: deblock_luma_h — CPU/NEON via vendored ff_h264_h_loop_filter

2026-05-24 23:28:56 +02:00

h264_idct4_ref.c

Cycle 6 (H.264) opened — IDCT 4x4 Phase 1+3, M3 = 175 Mblock/s

2026-05-18 14:14:43 +00:00

h264_idct8_ref.c

Cycle 7 closed: H.264 IDCT 8x8 = 151 Mblock/s NEON, Phase 4 deferred

2026-05-18 14:16:42 +00:00

h264_intra_loop_filter_ref.c

h264: deblock bS=4 intra variants (luma + chroma, V + H)

2026-05-25 00:00:46 +02:00

h264_qpel8_avg_anchors_ref.c

h264: qpel avg anchors (avg_mc20/02/22, biprediction support)

2026-05-25 08:35:25 +02:00

h264_qpel8_avg_rest_ref.c

h264: qpel avg — 12 remaining variants (closes the matrix)

2026-05-25 08:49:42 +02:00

h264_qpel8_diag_ref.c

h264: qpel diagonals — 8 positions (mc11/12/13/21/23/31/32/33)

2026-05-25 07:49:12 +02:00

h264_qpel8_mc02_ref.c

h264: qpel mc02 (vertical half-pel, CPU/NEON)

2026-05-25 00:47:37 +02:00

h264_qpel8_mc20_ref.c

Cycle 9 closed: H.264 luma qpel mc20 = 131 Mblock/s NEON, CPU-only

2026-05-18 14:53:21 +00:00

h264_qpel8_mc22_ref.c

h264: qpel mc22 (2D half-pel, CPU/NEON)

2026-05-25 01:03:14 +02:00

h264_qpel8_quarter_axis_ref.c

h264: qpel single-axis quarter-pel — mc10/mc30/mc01/mc03 (CPU/NEON)

2026-05-25 01:29:52 +02:00

test_api_h264.c

bench: H.264 primitive bench now measures both substrates + comparison table

2026-05-25 20:42:39 +02:00

test_api_idct.c

Phase 8: wire IDCT QPU dispatch through public API

2026-05-18 13:55:55 +00:00

test_api_lpf.c

Phase 8: wire LPF wd=4 + wd=8 QPU through public API

2026-05-18 13:57:25 +00:00

test_api_opportunistic_qpu.c

Phase 8b: opportunistic QPU paths through public API

2026-05-18 14:50:41 +00:00

test_chroma_dc_hadamard.c

h264: expose chroma DC 2x2 Hadamard as public API

2026-05-25 13:32:01 +02:00

test_intra_pred_4x4.c

h264: promote Intra_4x4 luma prediction (9 modes) to public API

2026-05-25 14:53:37 +02:00

test_intra_pred_8x8_luma.c

h264: promote remaining intra prediction modes (17) to public API

2026-05-25 15:37:44 +02:00

test_intra_pred_16x16.c

h264: promote remaining intra prediction modes (17) to public API

2026-05-25 15:37:44 +02:00

test_intra_pred_chroma8x8.c

h264: promote remaining intra prediction modes (17) to public API

2026-05-25 15:37:44 +02:00

vp9_idct8_ref.c

Path B pivot + Phase 0-3 closed with first baseline numbers

2026-05-18 11:30:12 +00:00

vp9_lpf8_ref.c

Cycle 4 (LPF wd=8) closure: M1=100%, R=0.34, M4=+4.1%, PASS

2026-05-18 12:56:25 +00:00

vp9_lpf_ref.c

Cycle 2 (deblocking) Phase 1-3: M3'' = 48.285 Medge/s baseline

2026-05-18 12:28:57 +00:00

vp9_mc_ref.c

Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%

2026-05-18 12:51:43 +00:00