bench: H.264 primitives NEON CPU baseline (1080p budget projection) #24

Merged

marfrit merged 1 commits from noether/h264-primitives-bench into main

2026-05-25 09:51:21 +00:00

Author	SHA1	Message	Date
claude-noether	ba5bbae8e2	bench: H.264 primitives NEON CPU baseline (1080p budget projection) Adds bench_h264_primitives — a non-ctest binary that times the H.264 pixel-math primitives at their representative per-frame N and projects 1080p frame budgets. Lets us answer "how much of the 33-ms 30fps deadline does the pixel-math layer eat on NEON alone, before the intercept patch adds entropy decode + metadata work." Results on hertz (Pi 5 / 4×Cortex-A76, NEON path): Per-kernel ns/op (CPU NEON): IDCT 4x4 luma 10.78 ns/block IDCT 8x8 luma 29.73 ns/block Deblock luma_v 18.04 ns/edge Deblock luma_h 41.65 ns/edge (H access pattern less SIMD-friendly) qpel mc20 (H half-pel) 25.66 ns/block qpel mc02 (V half-pel) 15.06 ns/block (faster than mc20!) qpel mc22 (HV half-pel) 71.50 ns/block (cascaded H+V, expected) Projected 1080p frame budgets (worst-case, CPU NEON only): IDCT 4x4 (all-4x4 MBs): 1.41 ms (130,560 blocks) IDCT 8x8 (all-8x8 MBs): 0.97 ms ( 32,640 blocks) Deblock luma_v (all MBs): 0.59 ms ( 32,640 edges) Deblock luma_h (all MBs): 1.36 ms ( 32,640 edges) qpel mc22 (all 8x8 blocks): 2.33 ms ( 32,640 blocks) Sum (IDCT 4x4 + deblock luma + MC all-mc22): 5.69 ms 30 fps deadline: 33.33 ms Margin: +27.64 ms What this validates: - The "30fps@1080p is the fine floor" memory note holds with huge headroom on the pixel-math layer alone. 17% of the deadline goes to pixel math (worst case); 83% is available for entropy decode + reference frame management + intra prediction + chroma deblock + chroma IDCT + the libavcodec intercept overhead. - The CPU-vs-QPU substrate finding from earlier (PR #10 on daedalus-decoder showed CPU NEON is 4x faster than QPU for IDCT) is consistent here. All these kernels have CPU-only recipes by default; the data suggests that's the right call for now. The recipe substrate decision can be revisited per-kernel once QPU shaders catch up. - mc22 (2D HV half-pel) is the most expensive single qpel position at ~71 ns/block — 2-7x more than the 1D variants. Real B-slice biprediction with two mc22 calls per MB would add ~4.7 ms/frame; still comfortable but worth knowing. What this DOESN'T measure (intentionally — they aren't on the critical path at NEON speeds): - Chroma IDCT (4 cb + 4 cr 4x4 per MB). At similar ns/block to luma, that's ~0.7 ms/frame. - Chroma deblock (smaller tile, simpler kernel — sub-ms). - Intra prediction (per-block, ~50 ops at NEON, but serialized in z-scan order so cache-friendly; ~0.5 ms/frame estimate). - bS=4 intra deblock variants — different algorithm, similar cost to bS<4. - chroma DC Hadamard — trivial. Adding all of those in the worst case would maybe double the 5.69 ms number to ~12 ms. Still leaves 20+ ms for entropy decode + metadata work in the intercept patch.	2026-05-25 11:26:11 +02:00

Author

SHA1

Message

Date

claude-noether

ba5bbae8e2

bench: H.264 primitives NEON CPU baseline (1080p budget projection)

Adds bench_h264_primitives — a non-ctest binary that times the
H.264 pixel-math primitives at their representative per-frame N and
projects 1080p frame budgets.  Lets us answer "how much of the
33-ms 30fps deadline does the pixel-math layer eat on NEON alone,
before the intercept patch adds entropy decode + metadata work."

Results on hertz (Pi 5 / 4×Cortex-A76, NEON path):

  Per-kernel ns/op (CPU NEON):
    IDCT 4x4 luma            10.78 ns/block
    IDCT 8x8 luma            29.73 ns/block
    Deblock luma_v           18.04 ns/edge
    Deblock luma_h           41.65 ns/edge   (H access pattern less SIMD-friendly)
    qpel mc20  (H half-pel)  25.66 ns/block
    qpel mc02  (V half-pel)  15.06 ns/block  (faster than mc20!)
    qpel mc22  (HV half-pel) 71.50 ns/block  (cascaded H+V, expected)

  Projected 1080p frame budgets (worst-case, CPU NEON only):
    IDCT 4x4 (all-4x4 MBs):       1.41 ms   (130,560 blocks)
    IDCT 8x8 (all-8x8 MBs):       0.97 ms   ( 32,640 blocks)
    Deblock luma_v (all MBs):     0.59 ms   ( 32,640 edges)
    Deblock luma_h (all MBs):     1.36 ms   ( 32,640 edges)
    qpel mc22 (all 8x8 blocks):   2.33 ms   ( 32,640 blocks)

    Sum (IDCT 4x4 + deblock luma + MC all-mc22):    5.69 ms
    30 fps deadline:                              33.33 ms
    Margin:                                       +27.64 ms

What this validates:

  - The "30fps@1080p is the fine floor" memory note holds with
    huge headroom on the pixel-math layer alone.  17% of the
    deadline goes to pixel math (worst case); 83% is available
    for entropy decode + reference frame management + intra
    prediction + chroma deblock + chroma IDCT + the libavcodec
    intercept overhead.
  - The CPU-vs-QPU substrate finding from earlier (PR #10 on
    daedalus-decoder showed CPU NEON is 4x faster than QPU for
    IDCT) is consistent here.  All these kernels have CPU-only
    recipes by default; the data suggests that's the right call
    for now.  The recipe substrate decision can be revisited
    per-kernel once QPU shaders catch up.
  - mc22 (2D HV half-pel) is the most expensive single qpel
    position at ~71 ns/block — 2-7x more than the 1D variants.
    Real B-slice biprediction with two mc22 calls per MB would
    add ~4.7 ms/frame; still comfortable but worth knowing.

What this DOESN'T measure (intentionally — they aren't on the
critical path at NEON speeds):

  - Chroma IDCT (4 cb + 4 cr 4x4 per MB).  At similar ns/block to
    luma, that's ~0.7 ms/frame.
  - Chroma deblock (smaller tile, simpler kernel — sub-ms).
  - Intra prediction (per-block, ~50 ops at NEON, but serialized
    in z-scan order so cache-friendly; ~0.5 ms/frame estimate).
  - bS=4 intra deblock variants — different algorithm, similar
    cost to bS<4.
  - chroma DC Hadamard — trivial.

Adding all of those in the worst case would maybe double the 5.69
ms number to ~12 ms.  Still leaves 20+ ms for entropy decode +
metadata work in the intercept patch.

2026-05-25 11:26:11 +02:00

bench: H.264 primitives NEON CPU baseline (1080p budget projection) #24

1 Commits