ba5bbae8e2
Adds bench_h264_primitives — a non-ctest binary that times the
H.264 pixel-math primitives at their representative per-frame N and
projects 1080p frame budgets. Lets us answer "how much of the
33-ms 30fps deadline does the pixel-math layer eat on NEON alone,
before the intercept patch adds entropy decode + metadata work."
Results on hertz (Pi 5 / 4×Cortex-A76, NEON path):
Per-kernel ns/op (CPU NEON):
IDCT 4x4 luma 10.78 ns/block
IDCT 8x8 luma 29.73 ns/block
Deblock luma_v 18.04 ns/edge
Deblock luma_h 41.65 ns/edge (H access pattern less SIMD-friendly)
qpel mc20 (H half-pel) 25.66 ns/block
qpel mc02 (V half-pel) 15.06 ns/block (faster than mc20!)
qpel mc22 (HV half-pel) 71.50 ns/block (cascaded H+V, expected)
Projected 1080p frame budgets (worst-case, CPU NEON only):
IDCT 4x4 (all-4x4 MBs): 1.41 ms (130,560 blocks)
IDCT 8x8 (all-8x8 MBs): 0.97 ms ( 32,640 blocks)
Deblock luma_v (all MBs): 0.59 ms ( 32,640 edges)
Deblock luma_h (all MBs): 1.36 ms ( 32,640 edges)
qpel mc22 (all 8x8 blocks): 2.33 ms ( 32,640 blocks)
Sum (IDCT 4x4 + deblock luma + MC all-mc22): 5.69 ms
30 fps deadline: 33.33 ms
Margin: +27.64 ms
What this validates:
- The "30fps@1080p is the fine floor" memory note holds with
huge headroom on the pixel-math layer alone. 17% of the
deadline goes to pixel math (worst case); 83% is available
for entropy decode + reference frame management + intra
prediction + chroma deblock + chroma IDCT + the libavcodec
intercept overhead.
- The CPU-vs-QPU substrate finding from earlier (PR #10 on
daedalus-decoder showed CPU NEON is 4x faster than QPU for
IDCT) is consistent here. All these kernels have CPU-only
recipes by default; the data suggests that's the right call
for now. The recipe substrate decision can be revisited
per-kernel once QPU shaders catch up.
- mc22 (2D HV half-pel) is the most expensive single qpel
position at ~71 ns/block — 2-7x more than the 1D variants.
Real B-slice biprediction with two mc22 calls per MB would
add ~4.7 ms/frame; still comfortable but worth knowing.
What this DOESN'T measure (intentionally — they aren't on the
critical path at NEON speeds):
- Chroma IDCT (4 cb + 4 cr 4x4 per MB). At similar ns/block to
luma, that's ~0.7 ms/frame.
- Chroma deblock (smaller tile, simpler kernel — sub-ms).
- Intra prediction (per-block, ~50 ops at NEON, but serialized
in z-scan order so cache-friendly; ~0.5 ms/frame estimate).
- bS=4 intra deblock variants — different algorithm, similar
cost to bS<4.
- chroma DC Hadamard — trivial.
Adding all of those in the worst case would maybe double the 5.69
ms number to ~12 ms. Still leaves 20+ ms for entropy decode +
metadata work in the intercept patch.