phase1: bench_flush_frame substrate selector + IDCT-layer CPU vs QPU data #10

2026-05-24T21:19:57Z

marfrit commented

2026-05-24 21:19:57 +00:00

Adds argv[5] substrate selector (auto/cpu/qpu) to bench_flush_frame and reports a measured comparison at 1920×1088:

substrate	median	mean	p99	fps
CPU NEON	9.27 ms	11.10 ms	33.06 ms	107.8
QPU V3D7	37.77 ms	37.67 ms	47.27 ms	26.5
AUTO	33.19 ms	36.04 ms	92.23 ms	30.1

For the IDCT-only workload, CPU NEON is 4.1× faster than QPU V3D7. AUTO (per the QPU-substrate decree) lands at 30 fps with no headroom for the remaining decoder stages.

Also surfaces a measurement correction: PR #8's '101 fps median at 1080p' was actually the CPU NEON path — hertz's fourier install at the time predated cycle 6's H.264 QPU shader, so recipe AUTO silently fell back to NEON. Path C 'viable' conclusion still holds; the substrate label was wrong.

The decree's 'dispatch overhead is fixable defect' claim is still aspirational for the H.264 IDCT shaders. Each QPU IDCT round-trip costs ~3.6 ms vs ~2.3 ms total for CPU NEON. daedalus-decoder doesn't take a position; AUTO follows the recipe table per the decree, the substrate selector is the escape hatch.

Adds argv[5] substrate selector (auto/cpu/qpu) to bench_flush_frame and reports a measured comparison at 1920×1088: | substrate | median | mean | p99 | fps | |-----------|--------|------|-----|-----| | **CPU NEON** | 9.27 ms | 11.10 ms | 33.06 ms | **107.8** | | QPU V3D7 | 37.77 ms | 37.67 ms | 47.27 ms | 26.5 | | AUTO | 33.19 ms | 36.04 ms | 92.23 ms | 30.1 | **For the IDCT-only workload, CPU NEON is 4.1× faster than QPU V3D7.** AUTO (per the QPU-substrate decree) lands at 30 fps with no headroom for the remaining decoder stages. Also surfaces a measurement correction: PR #8's '101 fps median at 1080p' was actually the CPU NEON path — hertz's fourier install at the time predated cycle 6's H.264 QPU shader, so recipe AUTO silently fell back to NEON. Path C 'viable' conclusion still holds; the substrate label was wrong. The decree's 'dispatch overhead is fixable defect' claim is still aspirational for the H.264 IDCT shaders. Each QPU IDCT round-trip costs ~3.6 ms vs ~2.3 ms total for CPU NEON. daedalus-decoder doesn't take a position; AUTO follows the recipe table per the decree, the substrate selector is the escape hatch.

marfrit added 1 commit 2026-05-24 21:19:58 +00:00

phase1: bench_flush_frame substrate selector + IDCT-layer QPU vs CPU data 0b6482bc8f

Extends bench_flush_frame with an argv[5] substrate selector
(auto/cpu/qpu). Same enum as test_idct_bitexact's argv[4] — keeps
both binaries' CLI in sync.

The whole point of plumbing the selector through is to put a number
on the "QPU is default substrate" decree (2026-05-23,
feedback_qpu_is_default_substrate.md) for the IDCT layer
specifically. The decree said: "What can be done, will be done in
QPU. Dispatch overhead is fixable defect." This measurement
quantifies the unfixed defect.

Bench config: 1920x1088, 100 iters, 5 warmup, half 4x4 / half 8x8
luma MBs + chroma always 4x4. Pi 5 / V3D 7.1 / daedalus-fourier
0.1.0 (with cycle 6/7/9 H.264 IDCT shaders). Hertz, idle system.

Results:

substrate min median mean p99 fps (median)
─────────────────────────────────────────────────────────────
CPU NEON 8.75 9.27 11.10 33.06 107.8
QPU V3D7 31.92 37.77 37.67 47.27 26.5
AUTO 31.99 33.19 36.04 92.23 30.1

Targets: 30 fps @ 1080p (project_30fps_floor_is_fine.md).
Stages NOT yet measured: intra prediction, MC, deblock.

Interpretation:

- For the IDCT-only workload at frame batch granularity, CPU NEON
is 4.1x faster than QPU V3D7.
- AUTO → recipe table → QPU per the decree → BELOW the 30 fps
target with no headroom for the remaining decoder stages.
- The earlier "101 fps median at 1080p" measurement reported in
PR #8's commit was actually the CPU NEON path — the daedalus-
fourier install on hertz at the time predated the cycle 6 H.264
QPU shader, so recipe AUTO silently fell back to CPU NEON.
PR #8's "Path C is viable" conclusion stands, but the substrate
label was wrong. Apologies for the misleading number.

What this means for the campaign:

- The decree's "fixable defect" claim is still aspirational for
the H.264 IDCT shaders. The current QPU shader dispatch costs
~3.6 ms per IDCT round-trip (luma 4x4 + luma 8x8 + chroma 4x4 =
~10 ms total cf. CPU's 2.3 ms), which dominates over the compute.
- daedalus-decoder doesn't need to take a position on this — the
AUTO path follows the recipe table and respects the decree.
The substrate selector is the escape hatch when consumers want
to override.
- For the libavcodec intercept patch when it lands, the right
move is probably to start with CPU NEON for IDCT and switch to
QPU once the dispatch overhead drops (issue #162 dmabuf import
+ further pool work on the daedalus-fourier side).

No source change to flush_frame itself; this is purely a measurement
add. The bench is opt-in (not a ctest) — these numbers belong in
commit messages and the campaign log, not in CI gating.

marfrit merged commit 820771d24b into main

2026-05-24 21:22:43 +00:00

marfrit deleted branch noether/phase1-bench-substrate

2026-05-24 21:22:43 +00:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marfrit/daedalus-decoder#10