phase1: bench_flush_frame substrate selector + IDCT-layer CPU vs QPU data #10
Reference in New Issue
Block a user
Delete Branch "noether/phase1-bench-substrate"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Adds argv[5] substrate selector (auto/cpu/qpu) to bench_flush_frame and reports a measured comparison at 1920×1088:
For the IDCT-only workload, CPU NEON is 4.1× faster than QPU V3D7. AUTO (per the QPU-substrate decree) lands at 30 fps with no headroom for the remaining decoder stages.
Also surfaces a measurement correction: PR #8's '101 fps median at 1080p' was actually the CPU NEON path — hertz's fourier install at the time predated cycle 6's H.264 QPU shader, so recipe AUTO silently fell back to NEON. Path C 'viable' conclusion still holds; the substrate label was wrong.
The decree's 'dispatch overhead is fixable defect' claim is still aspirational for the H.264 IDCT shaders. Each QPU IDCT round-trip costs ~3.6 ms vs ~2.3 ms total for CPU NEON. daedalus-decoder doesn't take a position; AUTO follows the recipe table per the decree, the substrate selector is the escape hatch.
Extends bench_flush_frame with an argv[5] substrate selector (auto/cpu/qpu). Same enum as test_idct_bitexact's argv[4] — keeps both binaries' CLI in sync. The whole point of plumbing the selector through is to put a number on the "QPU is default substrate" decree (2026-05-23, feedback_qpu_is_default_substrate.md) for the IDCT layer specifically. The decree said: "What can be done, will be done in QPU. Dispatch overhead is fixable defect." This measurement quantifies the unfixed defect. Bench config: 1920x1088, 100 iters, 5 warmup, half 4x4 / half 8x8 luma MBs + chroma always 4x4. Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0 (with cycle 6/7/9 H.264 IDCT shaders). Hertz, idle system. Results: substrate min median mean p99 fps (median) ───────────────────────────────────────────────────────────── CPU NEON 8.75 9.27 11.10 33.06 107.8 QPU V3D7 31.92 37.77 37.67 47.27 26.5 AUTO 31.99 33.19 36.04 92.23 30.1 Targets: 30 fps @ 1080p (project_30fps_floor_is_fine.md). Stages NOT yet measured: intra prediction, MC, deblock. Interpretation: - For the IDCT-only workload at frame batch granularity, CPU NEON is 4.1x faster than QPU V3D7. - AUTO → recipe table → QPU per the decree → BELOW the 30 fps target with no headroom for the remaining decoder stages. - The earlier "101 fps median at 1080p" measurement reported in PR #8's commit was actually the CPU NEON path — the daedalus- fourier install on hertz at the time predated the cycle 6 H.264 QPU shader, so recipe AUTO silently fell back to CPU NEON. PR #8's "Path C is viable" conclusion stands, but the substrate label was wrong. Apologies for the misleading number. What this means for the campaign: - The decree's "fixable defect" claim is still aspirational for the H.264 IDCT shaders. The current QPU shader dispatch costs ~3.6 ms per IDCT round-trip (luma 4x4 + luma 8x8 + chroma 4x4 = ~10 ms total cf. CPU's 2.3 ms), which dominates over the compute. - daedalus-decoder doesn't need to take a position on this — the AUTO path follows the recipe table and respects the decree. The substrate selector is the escape hatch when consumers want to override. - For the libavcodec intercept patch when it lands, the right move is probably to start with CPU NEON for IDCT and switch to QPU once the dispatch overhead drops (issue #162 dmabuf import + further pool work on the daedalus-fourier side). No source change to flush_frame itself; this is purely a measurement add. The bench is opt-in (not a ctest) — these numbers belong in commit messages and the campaign log, not in CI gating.