bench: H.264 primitives NEON CPU baseline (1080p budget projection) #24

2026-05-25T09:26:40Z

marfrit commented

2026-05-25 09:26:40 +00:00

Adds bench_h264_primitives — measures each pixel-math primitive at representative per-frame N and projects 1080p frame budgets. Lets us quantify how much of the 33ms 30fps deadline the pixel-math layer eats on NEON alone.

Headline numbers on hertz (Pi 5 / Cortex-A76 NEON):

Per-kernel ns/op:

IDCT 4x4 luma: 10.78 ns/block
IDCT 8x8 luma: 29.73 ns/block
Deblock luma_v: 18.04 ns/edge
Deblock luma_h: 41.65 ns/edge (less SIMD-friendly access)
qpel mc20: 25.66 ns / mc02: 15.06 ns / mc22: 71.50 ns/block

1080p budget summary (worst case, CPU NEON only):

IDCT 4x4 + luma deblock + all-mc22 MC = 5.69 ms
30 fps deadline = 33.33 ms
Margin: +27.64 ms for entropy + intra + chroma + intercept overhead

The pixel-math layer takes 17% of the deadline on plain NEON. Validates the campaign's '30fps@1080p is fine' memory note with huge headroom; also reinforces the prior PR #10 finding that CPU NEON beats QPU for these kernels at current overhead levels.

Not a ctest — produces wall-time numbers, manual invocation.

Adds bench_h264_primitives — measures each pixel-math primitive at representative per-frame N and projects 1080p frame budgets. Lets us quantify how much of the 33ms 30fps deadline the pixel-math layer eats on NEON alone. **Headline numbers on hertz (Pi 5 / Cortex-A76 NEON):** Per-kernel ns/op: - IDCT 4x4 luma: **10.78 ns/block** - IDCT 8x8 luma: 29.73 ns/block - Deblock luma_v: 18.04 ns/edge - Deblock luma_h: 41.65 ns/edge (less SIMD-friendly access) - qpel mc20: 25.66 ns / mc02: 15.06 ns / mc22: 71.50 ns/block **1080p budget summary (worst case, CPU NEON only):** - IDCT 4x4 + luma deblock + all-mc22 MC = **5.69 ms** - 30 fps deadline = 33.33 ms - **Margin: +27.64 ms** for entropy + intra + chroma + intercept overhead The pixel-math layer takes 17% of the deadline on plain NEON. Validates the campaign's '30fps@1080p is fine' memory note with huge headroom; also reinforces the prior PR #10 finding that CPU NEON beats QPU for these kernels at current overhead levels. Not a ctest — produces wall-time numbers, manual invocation.

marfrit added 1 commit 2026-05-25 09:26:41 +00:00

bench: H.264 primitives NEON CPU baseline (1080p budget projection) ba5bbae8e2

Adds bench_h264_primitives — a non-ctest binary that times the
H.264 pixel-math primitives at their representative per-frame N and
projects 1080p frame budgets.  Lets us answer "how much of the
33-ms 30fps deadline does the pixel-math layer eat on NEON alone,
before the intercept patch adds entropy decode + metadata work."

Results on hertz (Pi 5 / 4×Cortex-A76, NEON path):

  Per-kernel ns/op (CPU NEON):
    IDCT 4x4 luma            10.78 ns/block
    IDCT 8x8 luma            29.73 ns/block
    Deblock luma_v           18.04 ns/edge
    Deblock luma_h           41.65 ns/edge   (H access pattern less SIMD-friendly)
    qpel mc20  (H half-pel)  25.66 ns/block
    qpel mc02  (V half-pel)  15.06 ns/block  (faster than mc20!)
    qpel mc22  (HV half-pel) 71.50 ns/block  (cascaded H+V, expected)

  Projected 1080p frame budgets (worst-case, CPU NEON only):
    IDCT 4x4 (all-4x4 MBs):       1.41 ms   (130,560 blocks)
    IDCT 8x8 (all-8x8 MBs):       0.97 ms   ( 32,640 blocks)
    Deblock luma_v (all MBs):     0.59 ms   ( 32,640 edges)
    Deblock luma_h (all MBs):     1.36 ms   ( 32,640 edges)
    qpel mc22 (all 8x8 blocks):   2.33 ms   ( 32,640 blocks)

    Sum (IDCT 4x4 + deblock luma + MC all-mc22):    5.69 ms
    30 fps deadline:                              33.33 ms
    Margin:                                       +27.64 ms

What this validates:

  - The "30fps@1080p is the fine floor" memory note holds with
    huge headroom on the pixel-math layer alone.  17% of the
    deadline goes to pixel math (worst case); 83% is available
    for entropy decode + reference frame management + intra
    prediction + chroma deblock + chroma IDCT + the libavcodec
    intercept overhead.
  - The CPU-vs-QPU substrate finding from earlier (PR #10 on
    daedalus-decoder showed CPU NEON is 4x faster than QPU for
    IDCT) is consistent here.  All these kernels have CPU-only
    recipes by default; the data suggests that's the right call
    for now.  The recipe substrate decision can be revisited
    per-kernel once QPU shaders catch up.
  - mc22 (2D HV half-pel) is the most expensive single qpel
    position at ~71 ns/block — 2-7x more than the 1D variants.
    Real B-slice biprediction with two mc22 calls per MB would
    add ~4.7 ms/frame; still comfortable but worth knowing.

What this DOESN'T measure (intentionally — they aren't on the
critical path at NEON speeds):

  - Chroma IDCT (4 cb + 4 cr 4x4 per MB).  At similar ns/block to
    luma, that's ~0.7 ms/frame.
  - Chroma deblock (smaller tile, simpler kernel — sub-ms).
  - Intra prediction (per-block, ~50 ops at NEON, but serialized
    in z-scan order so cache-friendly; ~0.5 ms/frame estimate).
  - bS=4 intra deblock variants — different algorithm, similar
    cost to bS<4.
  - chroma DC Hadamard — trivial.

Adding all of those in the worst case would maybe double the 5.69
ms number to ~12 ms.  Still leaves 20+ ms for entropy decode +
metadata work in the intercept patch.

marfrit merged commit b21b35c74b into main

2026-05-25 09:51:21 +00:00

marfrit deleted branch noether/h264-primitives-bench

2026-05-25 09:51:22 +00:00

marfrit referenced this pull request

2026-05-25 16:38:44 +00:00

h264: V3D shader for qpel mc02 (vertical half-pel) #30

claude-noether referenced this issue from a commit

2026-05-25 16:38:52 +00:00

h264: V3D shader for qpel mc02 (vertical half-pel)

marfrit referenced this pull request

2026-05-25 16:52:42 +00:00