daedalus-fourier

marfrit/daedalus-fourier

Fork 0

Commit Graph

Author	SHA1	Message	Date
claude-noether	989818c2e6	bench: H.264 primitive bench now measures both substrates + comparison table Closes task #166 (re-measure R-bands on post-buffer-pool dispatch path). Now that all H.264 hot-path primitives have QPU shaders and the dispatch overhead has been hammered down (tasks #160 buffer pool, #161 persistent command buffer), bench_h264_primitives no longer measures one column. Two passes — CPU NEON and QPU V3D7 compute — with a side-by-side per-kernel comparison and ratio. Headline result on hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup): kernel CPU ns/op QPU ns/op winner IDCT 4x4 luma 10.79 2.47 QPU 4.36x IDCT 8x8 luma 29.69 9.23 QPU 3.22x Deblock luma_v 17.58 10.21 QPU 1.72x Deblock luma_h 38.41 9.98 QPU 3.85x qpel mc20 (8x8) 28.24 9.66 QPU 2.92x qpel mc02 (8x8) 16.96 20.54 CPU 1.21x qpel mc22 (8x8) 71.58 9.64 QPU 7.43x 1080p worst-case sum (IDCT4 + deblock luma + qpel mc22): CPU NEON only: 5.57 ms QPU only: 1.30 ms (CPU/QPU sum ratio = 4.30x) Reverses PR #10's verdict (which had CPU NEON 4x faster than QPU for IDCT-only) — the buffer-pool + persistent-cmdbuf wins land hard. Only qpel mc02 still shows CPU ahead, marginally (single- axis vertical filter, row-strided memory pattern unfriendly to the WG layout — left as a follow-up for cycle-9-style targeted tuning). Substrate decree (2026-05-23) stays in force as policy — these numbers retroactively justify it. Also tightens test_api_h264's startup recipe print: the stale "(CPU)" / "(CPU, no QPU H shader yet)" / "(CPU, bS=4 set)" labels next to deblock_lh, deblock_cv, deblock_ch and deblock_*_intra are now wrong since PRs #28, #29, #35 (those kernels are on QPU).	2026-05-25 20:42:39 +02:00
claude-noether	ba5bbae8e2	bench: H.264 primitives NEON CPU baseline (1080p budget projection) Adds bench_h264_primitives — a non-ctest binary that times the H.264 pixel-math primitives at their representative per-frame N and projects 1080p frame budgets. Lets us answer "how much of the 33-ms 30fps deadline does the pixel-math layer eat on NEON alone, before the intercept patch adds entropy decode + metadata work." Results on hertz (Pi 5 / 4×Cortex-A76, NEON path): Per-kernel ns/op (CPU NEON): IDCT 4x4 luma 10.78 ns/block IDCT 8x8 luma 29.73 ns/block Deblock luma_v 18.04 ns/edge Deblock luma_h 41.65 ns/edge (H access pattern less SIMD-friendly) qpel mc20 (H half-pel) 25.66 ns/block qpel mc02 (V half-pel) 15.06 ns/block (faster than mc20!) qpel mc22 (HV half-pel) 71.50 ns/block (cascaded H+V, expected) Projected 1080p frame budgets (worst-case, CPU NEON only): IDCT 4x4 (all-4x4 MBs): 1.41 ms (130,560 blocks) IDCT 8x8 (all-8x8 MBs): 0.97 ms ( 32,640 blocks) Deblock luma_v (all MBs): 0.59 ms ( 32,640 edges) Deblock luma_h (all MBs): 1.36 ms ( 32,640 edges) qpel mc22 (all 8x8 blocks): 2.33 ms ( 32,640 blocks) Sum (IDCT 4x4 + deblock luma + MC all-mc22): 5.69 ms 30 fps deadline: 33.33 ms Margin: +27.64 ms What this validates: - The "30fps@1080p is the fine floor" memory note holds with huge headroom on the pixel-math layer alone. 17% of the deadline goes to pixel math (worst case); 83% is available for entropy decode + reference frame management + intra prediction + chroma deblock + chroma IDCT + the libavcodec intercept overhead. - The CPU-vs-QPU substrate finding from earlier (PR #10 on daedalus-decoder showed CPU NEON is 4x faster than QPU for IDCT) is consistent here. All these kernels have CPU-only recipes by default; the data suggests that's the right call for now. The recipe substrate decision can be revisited per-kernel once QPU shaders catch up. - mc22 (2D HV half-pel) is the most expensive single qpel position at ~71 ns/block — 2-7x more than the 1D variants. Real B-slice biprediction with two mc22 calls per MB would add ~4.7 ms/frame; still comfortable but worth knowing. What this DOESN'T measure (intentionally — they aren't on the critical path at NEON speeds): - Chroma IDCT (4 cb + 4 cr 4x4 per MB). At similar ns/block to luma, that's ~0.7 ms/frame. - Chroma deblock (smaller tile, simpler kernel — sub-ms). - Intra prediction (per-block, ~50 ops at NEON, but serialized in z-scan order so cache-friendly; ~0.5 ms/frame estimate). - bS=4 intra deblock variants — different algorithm, similar cost to bS<4. - chroma DC Hadamard — trivial. Adding all of those in the worst case would maybe double the 5.69 ms number to ~12 ms. Still leaves 20+ ms for entropy decode + metadata work in the intercept patch.	2026-05-25 11:26:11 +02:00

Author

SHA1

Message

Date

claude-noether

989818c2e6

bench: H.264 primitive bench now measures both substrates + comparison table

Closes task #166 (re-measure R-bands on post-buffer-pool dispatch path).

Now that all H.264 hot-path primitives have QPU shaders and the
dispatch overhead has been hammered down (tasks #160 buffer pool,
#161 persistent command buffer), bench_h264_primitives no longer
measures one column.  Two passes — CPU NEON and QPU V3D7 compute —
with a side-by-side per-kernel comparison and ratio.

Headline result on hertz (Pi 5 V3D 7.1, 30 iters x 5 warmup):

  kernel             CPU ns/op  QPU ns/op  winner
  IDCT 4x4 luma          10.79       2.47  QPU 4.36x
  IDCT 8x8 luma          29.69       9.23  QPU 3.22x
  Deblock luma_v         17.58      10.21  QPU 1.72x
  Deblock luma_h         38.41       9.98  QPU 3.85x
  qpel mc20 (8x8)        28.24       9.66  QPU 2.92x
  qpel mc02 (8x8)        16.96      20.54  CPU 1.21x
  qpel mc22 (8x8)        71.58       9.64  QPU 7.43x

  1080p worst-case sum (IDCT4 + deblock luma + qpel mc22):
    CPU NEON only:  5.57 ms
    QPU only:       1.30 ms   (CPU/QPU sum ratio = 4.30x)

Reverses PR #10's verdict (which had CPU NEON 4x faster than QPU
for IDCT-only) — the buffer-pool + persistent-cmdbuf wins land
hard.  Only qpel mc02 still shows CPU ahead, marginally (single-
axis vertical filter, row-strided memory pattern unfriendly to the
WG layout — left as a follow-up for cycle-9-style targeted tuning).

Substrate decree (2026-05-23) stays in force as policy — these
numbers retroactively justify it.

Also tightens test_api_h264's startup recipe print: the stale
"(CPU)" / "(CPU, no QPU H shader yet)" / "(CPU, bS=4 set)" labels
next to deblock_lh, deblock_cv, deblock_ch and deblock_*_intra
are now wrong since PRs #28, #29, #35 (those kernels are on QPU).

2026-05-25 20:42:39 +02:00

claude-noether

ba5bbae8e2

bench: H.264 primitives NEON CPU baseline (1080p budget projection)

Adds bench_h264_primitives — a non-ctest binary that times the
H.264 pixel-math primitives at their representative per-frame N and
projects 1080p frame budgets.  Lets us answer "how much of the
33-ms 30fps deadline does the pixel-math layer eat on NEON alone,
before the intercept patch adds entropy decode + metadata work."

Results on hertz (Pi 5 / 4×Cortex-A76, NEON path):

  Per-kernel ns/op (CPU NEON):
    IDCT 4x4 luma            10.78 ns/block
    IDCT 8x8 luma            29.73 ns/block
    Deblock luma_v           18.04 ns/edge
    Deblock luma_h           41.65 ns/edge   (H access pattern less SIMD-friendly)
    qpel mc20  (H half-pel)  25.66 ns/block
    qpel mc02  (V half-pel)  15.06 ns/block  (faster than mc20!)
    qpel mc22  (HV half-pel) 71.50 ns/block  (cascaded H+V, expected)

  Projected 1080p frame budgets (worst-case, CPU NEON only):
    IDCT 4x4 (all-4x4 MBs):       1.41 ms   (130,560 blocks)
    IDCT 8x8 (all-8x8 MBs):       0.97 ms   ( 32,640 blocks)
    Deblock luma_v (all MBs):     0.59 ms   ( 32,640 edges)
    Deblock luma_h (all MBs):     1.36 ms   ( 32,640 edges)
    qpel mc22 (all 8x8 blocks):   2.33 ms   ( 32,640 blocks)

    Sum (IDCT 4x4 + deblock luma + MC all-mc22):    5.69 ms
    30 fps deadline:                              33.33 ms
    Margin:                                       +27.64 ms

What this validates:

  - The "30fps@1080p is the fine floor" memory note holds with
    huge headroom on the pixel-math layer alone.  17% of the
    deadline goes to pixel math (worst case); 83% is available
    for entropy decode + reference frame management + intra
    prediction + chroma deblock + chroma IDCT + the libavcodec
    intercept overhead.
  - The CPU-vs-QPU substrate finding from earlier (PR #10 on
    daedalus-decoder showed CPU NEON is 4x faster than QPU for
    IDCT) is consistent here.  All these kernels have CPU-only
    recipes by default; the data suggests that's the right call
    for now.  The recipe substrate decision can be revisited
    per-kernel once QPU shaders catch up.
  - mc22 (2D HV half-pel) is the most expensive single qpel
    position at ~71 ns/block — 2-7x more than the 1D variants.
    Real B-slice biprediction with two mc22 calls per MB would
    add ~4.7 ms/frame; still comfortable but worth knowing.

What this DOESN'T measure (intentionally — they aren't on the
critical path at NEON speeds):

  - Chroma IDCT (4 cb + 4 cr 4x4 per MB).  At similar ns/block to
    luma, that's ~0.7 ms/frame.
  - Chroma deblock (smaller tile, simpler kernel — sub-ms).
  - Intra prediction (per-block, ~50 ops at NEON, but serialized
    in z-scan order so cache-friendly; ~0.5 ms/frame estimate).
  - bS=4 intra deblock variants — different algorithm, similar
    cost to bS<4.
  - chroma DC Hadamard — trivial.

Adding all of those in the worst case would maybe double the 5.69
ms number to ~12 ms.  Still leaves 20+ ms for entropy decode +
metadata work in the intercept patch.

2026-05-25 11:26:11 +02:00

2 Commits