phase1: bench_flush_frame substrate selector + IDCT-layer CPU vs QPU data #10

Merged
marfrit merged 1 commits from noether/phase1-bench-substrate into main 2026-05-24 21:22:43 +00:00
Owner

Adds argv[5] substrate selector (auto/cpu/qpu) to bench_flush_frame and reports a measured comparison at 1920×1088:

substrate median mean p99 fps
CPU NEON 9.27 ms 11.10 ms 33.06 ms 107.8
QPU V3D7 37.77 ms 37.67 ms 47.27 ms 26.5
AUTO 33.19 ms 36.04 ms 92.23 ms 30.1

For the IDCT-only workload, CPU NEON is 4.1× faster than QPU V3D7. AUTO (per the QPU-substrate decree) lands at 30 fps with no headroom for the remaining decoder stages.

Also surfaces a measurement correction: PR #8's '101 fps median at 1080p' was actually the CPU NEON path — hertz's fourier install at the time predated cycle 6's H.264 QPU shader, so recipe AUTO silently fell back to NEON. Path C 'viable' conclusion still holds; the substrate label was wrong.

The decree's 'dispatch overhead is fixable defect' claim is still aspirational for the H.264 IDCT shaders. Each QPU IDCT round-trip costs ~3.6 ms vs ~2.3 ms total for CPU NEON. daedalus-decoder doesn't take a position; AUTO follows the recipe table per the decree, the substrate selector is the escape hatch.

Adds argv[5] substrate selector (auto/cpu/qpu) to bench_flush_frame and reports a measured comparison at 1920×1088: | substrate | median | mean | p99 | fps | |-----------|--------|------|-----|-----| | **CPU NEON** | 9.27 ms | 11.10 ms | 33.06 ms | **107.8** | | QPU V3D7 | 37.77 ms | 37.67 ms | 47.27 ms | 26.5 | | AUTO | 33.19 ms | 36.04 ms | 92.23 ms | 30.1 | **For the IDCT-only workload, CPU NEON is 4.1× faster than QPU V3D7.** AUTO (per the QPU-substrate decree) lands at 30 fps with no headroom for the remaining decoder stages. Also surfaces a measurement correction: PR #8's '101 fps median at 1080p' was actually the CPU NEON path — hertz's fourier install at the time predated cycle 6's H.264 QPU shader, so recipe AUTO silently fell back to NEON. Path C 'viable' conclusion still holds; the substrate label was wrong. The decree's 'dispatch overhead is fixable defect' claim is still aspirational for the H.264 IDCT shaders. Each QPU IDCT round-trip costs ~3.6 ms vs ~2.3 ms total for CPU NEON. daedalus-decoder doesn't take a position; AUTO follows the recipe table per the decree, the substrate selector is the escape hatch.
marfrit added 1 commit 2026-05-24 21:19:58 +00:00
Extends bench_flush_frame with an argv[5] substrate selector
(auto/cpu/qpu).  Same enum as test_idct_bitexact's argv[4] — keeps
both binaries' CLI in sync.

The whole point of plumbing the selector through is to put a number
on the "QPU is default substrate" decree (2026-05-23,
feedback_qpu_is_default_substrate.md) for the IDCT layer
specifically.  The decree said: "What can be done, will be done in
QPU.  Dispatch overhead is fixable defect."  This measurement
quantifies the unfixed defect.

Bench config: 1920x1088, 100 iters, 5 warmup, half 4x4 / half 8x8
luma MBs + chroma always 4x4.  Pi 5 / V3D 7.1 / daedalus-fourier
0.1.0 (with cycle 6/7/9 H.264 IDCT shaders).  Hertz, idle system.

Results:

  substrate   min     median   mean    p99      fps (median)
  ─────────────────────────────────────────────────────────────
  CPU NEON    8.75    9.27     11.10   33.06    107.8
  QPU V3D7    31.92   37.77    37.67   47.27    26.5
  AUTO        31.99   33.19    36.04   92.23    30.1

  Targets: 30 fps @ 1080p (project_30fps_floor_is_fine.md).
  Stages NOT yet measured: intra prediction, MC, deblock.

Interpretation:

  - For the IDCT-only workload at frame batch granularity, CPU NEON
    is 4.1x faster than QPU V3D7.
  - AUTO → recipe table → QPU per the decree → BELOW the 30 fps
    target with no headroom for the remaining decoder stages.
  - The earlier "101 fps median at 1080p" measurement reported in
    PR #8's commit was actually the CPU NEON path — the daedalus-
    fourier install on hertz at the time predated the cycle 6 H.264
    QPU shader, so recipe AUTO silently fell back to CPU NEON.
    PR #8's "Path C is viable" conclusion stands, but the substrate
    label was wrong.  Apologies for the misleading number.

What this means for the campaign:

  - The decree's "fixable defect" claim is still aspirational for
    the H.264 IDCT shaders.  The current QPU shader dispatch costs
    ~3.6 ms per IDCT round-trip (luma 4x4 + luma 8x8 + chroma 4x4 =
    ~10 ms total cf. CPU's 2.3 ms), which dominates over the compute.
  - daedalus-decoder doesn't need to take a position on this — the
    AUTO path follows the recipe table and respects the decree.
    The substrate selector is the escape hatch when consumers want
    to override.
  - For the libavcodec intercept patch when it lands, the right
    move is probably to start with CPU NEON for IDCT and switch to
    QPU once the dispatch overhead drops (issue #162 dmabuf import
    + further pool work on the daedalus-fourier side).

No source change to flush_frame itself; this is purely a measurement
add.  The bench is opt-in (not a ctest) — these numbers belong in
commit messages and the campaign log, not in CI gating.
marfrit merged commit 820771d24b into main 2026-05-24 21:22:43 +00:00
marfrit deleted branch noether/phase1-bench-substrate 2026-05-24 21:22:43 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marfrit/daedalus-decoder#10