phase1: add IDCT-layer throughput benchmark (bench_flush_frame) #8
Reference in New Issue
Block a user
Delete Branch "noether/phase1-bench-flush"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Adds a non-ctest benchmark binary that times flush_frame at configurable resolutions, reporting steady-state per-frame latency stats and fps.
First measurement on hertz: 101 fps median at 1080p (10 ms/frame), giving 3.4x headroom over the 30 fps floor for the remaining intra/MC/deblock work. Validates that the Path C single-submit-per-dispatch architecture pays back vs the 6.5 s/frame per-block fail-mode that motivated the pivot. Full commit message has the test transcript and a breakdown of what is/isn't measured.
Establishes a steady-state baseline for the Path C frame-level dispatch architecture. Times daedalus_decoder_flush_frame at a configurable coded resolution with random coefficients, reporting per-frame latency stats and fps. NOT a ctest — produces wall-time numbers, doesn't pass/fail. Run manually: ./build/bench_flush_frame [width] [height] [iters] [warmup] Defaults to 1920x1088, 100 iters, 5-frame warmup (excludes shader- pipeline-pool materialisation cost from the timing average). Measured on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0): $ ./build/bench_flush_frame bench_flush_frame: 1920x1088 (8160 MBs), 100 iters (5 warmup) ctx has_qpu=1 flush_frame (post-warmup, 95 samples): min = 9.699 ms median = 9.905 ms mean = 10.014 ms p99 = 12.011 ms max = 12.011 ms throughput (steady-state, IDCT only — NO intra/MC/deblock): mean = 99.9 fps median = 101.0 fps target = 30.0 fps (project_30fps_floor_is_fine.md) status = MEETS target (with 3.4x headroom for intra/MC/deblock) Interpretation: Per-frame work measured: - CPU partition + flat-pack of 8160 MBs into luma_4x4, luma_8x8, chroma meta+coeffs buffers - 3 GPU dispatches (luma 4x4, luma 8x8, chroma 4x4) with their respective vkQueueSubmit + vkQueueWaitIdle round-trips - CPU NV12 interleave (chroma planar → UV) - calloc/free for scratch_y / coeffs / meta buffers Doing all of that in ~10 ms means the architecture pays back the Path C design bet: ONE Vulkan submit per dispatch (cycle 8b buffer pool keeps amortised cost low) is the right granularity. The per-block dispatch fail-mode that motivated Path C (~6500 ms/frame from the libavcodec substitution arc) is 600x slower than this. 3.4x headroom from 101 fps → 30 fps target gives a budget of ~23 ms/frame for the remaining decode work (intra prediction wavefront, MC, deblock). Each of those needs to fit inside that budget at steady state for the end-to-end decoder to hit 30 fps at 1080p. p99 latency 12 ms means even worst-case frames clear the 33-ms deadline (30 fps) easily; tail latency isn't a concern at this stage. What this number does NOT validate: - Intra prediction shader dispatch overhead (likely per-anti-diagonal or per-MB-wavefront; dispatch count goes up) - MC dispatch (per qpel-block; up to several per MB) - Deblock dispatch (4 edges per MB; per-edge meta entries) - Real H.264 streams (random coeffs ≠ real residuals; perf shape of memory access is content-independent, but cache pressure may differ at scale).