phase1: add IDCT-layer throughput benchmark (bench_flush_frame) #8

2026-05-24T20:54:13Z

marfrit commented

2026-05-24 20:54:13 +00:00

Adds a non-ctest benchmark binary that times flush_frame at configurable resolutions, reporting steady-state per-frame latency stats and fps.

First measurement on hertz: 101 fps median at 1080p (10 ms/frame), giving 3.4x headroom over the 30 fps floor for the remaining intra/MC/deblock work. Validates that the Path C single-submit-per-dispatch architecture pays back vs the 6.5 s/frame per-block fail-mode that motivated the pivot. Full commit message has the test transcript and a breakdown of what is/isn't measured.

Adds a non-ctest benchmark binary that times flush_frame at configurable resolutions, reporting steady-state per-frame latency stats and fps. First measurement on hertz: **101 fps median at 1080p** (10 ms/frame), giving **3.4x headroom** over the 30 fps floor for the remaining intra/MC/deblock work. Validates that the Path C single-submit-per-dispatch architecture pays back vs the 6.5 s/frame per-block fail-mode that motivated the pivot. Full commit message has the test transcript and a breakdown of what is/isn't measured.

marfrit added 1 commit 2026-05-24 20:54:21 +00:00

phase1: add IDCT-layer throughput benchmark (bench_flush_frame) 352373a9be

Establishes a steady-state baseline for the Path C frame-level
dispatch architecture.  Times daedalus_decoder_flush_frame at a
configurable coded resolution with random coefficients, reporting
per-frame latency stats and fps.

NOT a ctest — produces wall-time numbers, doesn't pass/fail.  Run
manually:

  ./build/bench_flush_frame [width] [height] [iters] [warmup]

Defaults to 1920x1088, 100 iters, 5-frame warmup (excludes shader-
pipeline-pool materialisation cost from the timing average).

Measured on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0):

  $ ./build/bench_flush_frame
  bench_flush_frame: 1920x1088 (8160 MBs), 100 iters (5 warmup)
  ctx has_qpu=1

  flush_frame (post-warmup, 95 samples):
    min    =   9.699 ms
    median =   9.905 ms
    mean   =  10.014 ms
    p99    =  12.011 ms
    max    =  12.011 ms

  throughput (steady-state, IDCT only — NO intra/MC/deblock):
    mean   = 99.9 fps
    median = 101.0 fps
    target = 30.0 fps (project_30fps_floor_is_fine.md)
    status = MEETS target (with 3.4x headroom for intra/MC/deblock)

Interpretation:

  Per-frame work measured:
    - CPU partition + flat-pack of 8160 MBs into luma_4x4, luma_8x8,
      chroma meta+coeffs buffers
    - 3 GPU dispatches (luma 4x4, luma 8x8, chroma 4x4) with their
      respective vkQueueSubmit + vkQueueWaitIdle round-trips
    - CPU NV12 interleave (chroma planar → UV)
    - calloc/free for scratch_y / coeffs / meta buffers

  Doing all of that in ~10 ms means the architecture pays back the
  Path C design bet: ONE Vulkan submit per dispatch (cycle 8b buffer
  pool keeps amortised cost low) is the right granularity.  The
  per-block dispatch fail-mode that motivated Path C (~6500 ms/frame
  from the libavcodec substitution arc) is 600x slower than this.

  3.4x headroom from 101 fps → 30 fps target gives a budget of
  ~23 ms/frame for the remaining decode work (intra prediction
  wavefront, MC, deblock).  Each of those needs to fit inside that
  budget at steady state for the end-to-end decoder to hit 30 fps
  at 1080p.

  p99 latency 12 ms means even worst-case frames clear the 33-ms
  deadline (30 fps) easily; tail latency isn't a concern at this
  stage.

What this number does NOT validate:
  - Intra prediction shader dispatch overhead (likely per-anti-diagonal
    or per-MB-wavefront; dispatch count goes up)
  - MC dispatch (per qpel-block; up to several per MB)
  - Deblock dispatch (4 edges per MB; per-edge meta entries)
  - Real H.264 streams (random coeffs ≠ real residuals; perf shape
    of memory access is content-independent, but cache pressure may
    differ at scale).

marfrit merged commit bfe43003f3 into main

2026-05-24 21:03:11 +00:00

marfrit deleted branch noether/phase1-bench-flush

2026-05-24 21:03:12 +00:00

claude-noether referenced this issue from a commit

2026-05-24 21:19:43 +00:00

phase1: bench_flush_frame substrate selector + IDCT-layer QPU vs CPU data

marfrit referenced this pull request

2026-05-24 21:19:57 +00:00

phase1: bench_flush_frame substrate selector + IDCT-layer CPU vs QPU data #10

Sign in to join this conversation.