daedalus-decoder

marfrit/daedalus-decoder

Fork 0

Commit Graph

Author	SHA1	Message	Date
claude-noether	0b6482bc8f	phase1: bench_flush_frame substrate selector + IDCT-layer QPU vs CPU data Extends bench_flush_frame with an argv[5] substrate selector (auto/cpu/qpu). Same enum as test_idct_bitexact's argv[4] — keeps both binaries' CLI in sync. The whole point of plumbing the selector through is to put a number on the "QPU is default substrate" decree (2026-05-23, feedback_qpu_is_default_substrate.md) for the IDCT layer specifically. The decree said: "What can be done, will be done in QPU. Dispatch overhead is fixable defect." This measurement quantifies the unfixed defect. Bench config: 1920x1088, 100 iters, 5 warmup, half 4x4 / half 8x8 luma MBs + chroma always 4x4. Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0 (with cycle 6/7/9 H.264 IDCT shaders). Hertz, idle system. Results: substrate min median mean p99 fps (median) ───────────────────────────────────────────────────────────── CPU NEON 8.75 9.27 11.10 33.06 107.8 QPU V3D7 31.92 37.77 37.67 47.27 26.5 AUTO 31.99 33.19 36.04 92.23 30.1 Targets: 30 fps @ 1080p (project_30fps_floor_is_fine.md). Stages NOT yet measured: intra prediction, MC, deblock. Interpretation: - For the IDCT-only workload at frame batch granularity, CPU NEON is 4.1x faster than QPU V3D7. - AUTO → recipe table → QPU per the decree → BELOW the 30 fps target with no headroom for the remaining decoder stages. - The earlier "101 fps median at 1080p" measurement reported in PR #8's commit was actually the CPU NEON path — the daedalus- fourier install on hertz at the time predated the cycle 6 H.264 QPU shader, so recipe AUTO silently fell back to CPU NEON. PR #8's "Path C is viable" conclusion stands, but the substrate label was wrong. Apologies for the misleading number. What this means for the campaign: - The decree's "fixable defect" claim is still aspirational for the H.264 IDCT shaders. The current QPU shader dispatch costs ~3.6 ms per IDCT round-trip (luma 4x4 + luma 8x8 + chroma 4x4 = ~10 ms total cf. CPU's 2.3 ms), which dominates over the compute. - daedalus-decoder doesn't need to take a position on this — the AUTO path follows the recipe table and respects the decree. The substrate selector is the escape hatch when consumers want to override. - For the libavcodec intercept patch when it lands, the right move is probably to start with CPU NEON for IDCT and switch to QPU once the dispatch overhead drops (issue #162 dmabuf import + further pool work on the daedalus-fourier side). No source change to flush_frame itself; this is purely a measurement add. The bench is opt-in (not a ctest) — these numbers belong in commit messages and the campaign log, not in CI gating.	2026-05-24 23:19:39 +02:00
claude-noether	352373a9be	phase1: add IDCT-layer throughput benchmark (bench_flush_frame) Establishes a steady-state baseline for the Path C frame-level dispatch architecture. Times daedalus_decoder_flush_frame at a configurable coded resolution with random coefficients, reporting per-frame latency stats and fps. NOT a ctest — produces wall-time numbers, doesn't pass/fail. Run manually: ./build/bench_flush_frame [width] [height] [iters] [warmup] Defaults to 1920x1088, 100 iters, 5-frame warmup (excludes shader- pipeline-pool materialisation cost from the timing average). Measured on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0): $ ./build/bench_flush_frame bench_flush_frame: 1920x1088 (8160 MBs), 100 iters (5 warmup) ctx has_qpu=1 flush_frame (post-warmup, 95 samples): min = 9.699 ms median = 9.905 ms mean = 10.014 ms p99 = 12.011 ms max = 12.011 ms throughput (steady-state, IDCT only — NO intra/MC/deblock): mean = 99.9 fps median = 101.0 fps target = 30.0 fps (project_30fps_floor_is_fine.md) status = MEETS target (with 3.4x headroom for intra/MC/deblock) Interpretation: Per-frame work measured: - CPU partition + flat-pack of 8160 MBs into luma_4x4, luma_8x8, chroma meta+coeffs buffers - 3 GPU dispatches (luma 4x4, luma 8x8, chroma 4x4) with their respective vkQueueSubmit + vkQueueWaitIdle round-trips - CPU NV12 interleave (chroma planar → UV) - calloc/free for scratch_y / coeffs / meta buffers Doing all of that in ~10 ms means the architecture pays back the Path C design bet: ONE Vulkan submit per dispatch (cycle 8b buffer pool keeps amortised cost low) is the right granularity. The per-block dispatch fail-mode that motivated Path C (~6500 ms/frame from the libavcodec substitution arc) is 600x slower than this. 3.4x headroom from 101 fps → 30 fps target gives a budget of ~23 ms/frame for the remaining decode work (intra prediction wavefront, MC, deblock). Each of those needs to fit inside that budget at steady state for the end-to-end decoder to hit 30 fps at 1080p. p99 latency 12 ms means even worst-case frames clear the 33-ms deadline (30 fps) easily; tail latency isn't a concern at this stage. What this number does NOT validate: - Intra prediction shader dispatch overhead (likely per-anti-diagonal or per-MB-wavefront; dispatch count goes up) - MC dispatch (per qpel-block; up to several per MB) - Deblock dispatch (4 edges per MB; per-edge meta entries) - Real H.264 streams (random coeffs ≠ real residuals; perf shape of memory access is content-independent, but cache pressure may differ at scale).	2026-05-24 22:53:49 +02:00

Author

SHA1

Message

Date

claude-noether

0b6482bc8f

phase1: bench_flush_frame substrate selector + IDCT-layer QPU vs CPU data

Extends bench_flush_frame with an argv[5] substrate selector
(auto/cpu/qpu). Same enum as test_idct_bitexact's argv[4] — keeps
both binaries' CLI in sync.

The whole point of plumbing the selector through is to put a number
on the "QPU is default substrate" decree (2026-05-23,
feedback_qpu_is_default_substrate.md) for the IDCT layer
specifically. The decree said: "What can be done, will be done in
QPU. Dispatch overhead is fixable defect." This measurement
quantifies the unfixed defect.

Bench config: 1920x1088, 100 iters, 5 warmup, half 4x4 / half 8x8
luma MBs + chroma always 4x4. Pi 5 / V3D 7.1 / daedalus-fourier
0.1.0 (with cycle 6/7/9 H.264 IDCT shaders). Hertz, idle system.

Results:

substrate min median mean p99 fps (median)
─────────────────────────────────────────────────────────────
CPU NEON 8.75 9.27 11.10 33.06 107.8
QPU V3D7 31.92 37.77 37.67 47.27 26.5
AUTO 31.99 33.19 36.04 92.23 30.1

Targets: 30 fps @ 1080p (project_30fps_floor_is_fine.md).
Stages NOT yet measured: intra prediction, MC, deblock.

Interpretation:

- For the IDCT-only workload at frame batch granularity, CPU NEON
is 4.1x faster than QPU V3D7.
- AUTO → recipe table → QPU per the decree → BELOW the 30 fps
target with no headroom for the remaining decoder stages.
- The earlier "101 fps median at 1080p" measurement reported in
PR #8's commit was actually the CPU NEON path — the daedalus-
fourier install on hertz at the time predated the cycle 6 H.264
QPU shader, so recipe AUTO silently fell back to CPU NEON.
PR #8's "Path C is viable" conclusion stands, but the substrate
label was wrong. Apologies for the misleading number.

What this means for the campaign:

- The decree's "fixable defect" claim is still aspirational for
the H.264 IDCT shaders. The current QPU shader dispatch costs
~3.6 ms per IDCT round-trip (luma 4x4 + luma 8x8 + chroma 4x4 =
~10 ms total cf. CPU's 2.3 ms), which dominates over the compute.
- daedalus-decoder doesn't need to take a position on this — the
AUTO path follows the recipe table and respects the decree.
The substrate selector is the escape hatch when consumers want
to override.
- For the libavcodec intercept patch when it lands, the right
move is probably to start with CPU NEON for IDCT and switch to
QPU once the dispatch overhead drops (issue #162 dmabuf import
+ further pool work on the daedalus-fourier side).

No source change to flush_frame itself; this is purely a measurement
add. The bench is opt-in (not a ctest) — these numbers belong in
commit messages and the campaign log, not in CI gating.

2026-05-24 23:19:39 +02:00

claude-noether

352373a9be

phase1: add IDCT-layer throughput benchmark (bench_flush_frame)

Establishes a steady-state baseline for the Path C frame-level
dispatch architecture.  Times daedalus_decoder_flush_frame at a
configurable coded resolution with random coefficients, reporting
per-frame latency stats and fps.

NOT a ctest — produces wall-time numbers, doesn't pass/fail.  Run
manually:

  ./build/bench_flush_frame [width] [height] [iters] [warmup]

Defaults to 1920x1088, 100 iters, 5-frame warmup (excludes shader-
pipeline-pool materialisation cost from the timing average).

Measured on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0):

  $ ./build/bench_flush_frame
  bench_flush_frame: 1920x1088 (8160 MBs), 100 iters (5 warmup)
  ctx has_qpu=1

  flush_frame (post-warmup, 95 samples):
    min    =   9.699 ms
    median =   9.905 ms
    mean   =  10.014 ms
    p99    =  12.011 ms
    max    =  12.011 ms

  throughput (steady-state, IDCT only — NO intra/MC/deblock):
    mean   = 99.9 fps
    median = 101.0 fps
    target = 30.0 fps (project_30fps_floor_is_fine.md)
    status = MEETS target (with 3.4x headroom for intra/MC/deblock)

Interpretation:

  Per-frame work measured:
    - CPU partition + flat-pack of 8160 MBs into luma_4x4, luma_8x8,
      chroma meta+coeffs buffers
    - 3 GPU dispatches (luma 4x4, luma 8x8, chroma 4x4) with their
      respective vkQueueSubmit + vkQueueWaitIdle round-trips
    - CPU NV12 interleave (chroma planar → UV)
    - calloc/free for scratch_y / coeffs / meta buffers

  Doing all of that in ~10 ms means the architecture pays back the
  Path C design bet: ONE Vulkan submit per dispatch (cycle 8b buffer
  pool keeps amortised cost low) is the right granularity.  The
  per-block dispatch fail-mode that motivated Path C (~6500 ms/frame
  from the libavcodec substitution arc) is 600x slower than this.

  3.4x headroom from 101 fps → 30 fps target gives a budget of
  ~23 ms/frame for the remaining decode work (intra prediction
  wavefront, MC, deblock).  Each of those needs to fit inside that
  budget at steady state for the end-to-end decoder to hit 30 fps
  at 1080p.

  p99 latency 12 ms means even worst-case frames clear the 33-ms
  deadline (30 fps) easily; tail latency isn't a concern at this
  stage.

What this number does NOT validate:
  - Intra prediction shader dispatch overhead (likely per-anti-diagonal
    or per-MB-wavefront; dispatch count goes up)
  - MC dispatch (per qpel-block; up to several per MB)
  - Deblock dispatch (4 edges per MB; per-edge meta entries)
  - Real H.264 streams (random coeffs ≠ real residuals; perf shape
    of memory access is content-independent, but cache pressure may
    differ at scale).

2026-05-24 22:53:49 +02:00

2 Commits