352373a9be1d6f404bd7ce571b9b91bf7640ccad
Establishes a steady-state baseline for the Path C frame-level
dispatch architecture. Times daedalus_decoder_flush_frame at a
configurable coded resolution with random coefficients, reporting
per-frame latency stats and fps.
NOT a ctest — produces wall-time numbers, doesn't pass/fail. Run
manually:
./build/bench_flush_frame [width] [height] [iters] [warmup]
Defaults to 1920x1088, 100 iters, 5-frame warmup (excludes shader-
pipeline-pool materialisation cost from the timing average).
Measured on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0):
$ ./build/bench_flush_frame
bench_flush_frame: 1920x1088 (8160 MBs), 100 iters (5 warmup)
ctx has_qpu=1
flush_frame (post-warmup, 95 samples):
min = 9.699 ms
median = 9.905 ms
mean = 10.014 ms
p99 = 12.011 ms
max = 12.011 ms
throughput (steady-state, IDCT only — NO intra/MC/deblock):
mean = 99.9 fps
median = 101.0 fps
target = 30.0 fps (project_30fps_floor_is_fine.md)
status = MEETS target (with 3.4x headroom for intra/MC/deblock)
Interpretation:
Per-frame work measured:
- CPU partition + flat-pack of 8160 MBs into luma_4x4, luma_8x8,
chroma meta+coeffs buffers
- 3 GPU dispatches (luma 4x4, luma 8x8, chroma 4x4) with their
respective vkQueueSubmit + vkQueueWaitIdle round-trips
- CPU NV12 interleave (chroma planar → UV)
- calloc/free for scratch_y / coeffs / meta buffers
Doing all of that in ~10 ms means the architecture pays back the
Path C design bet: ONE Vulkan submit per dispatch (cycle 8b buffer
pool keeps amortised cost low) is the right granularity. The
per-block dispatch fail-mode that motivated Path C (~6500 ms/frame
from the libavcodec substitution arc) is 600x slower than this.
3.4x headroom from 101 fps → 30 fps target gives a budget of
~23 ms/frame for the remaining decode work (intra prediction
wavefront, MC, deblock). Each of those needs to fit inside that
budget at steady state for the end-to-end decoder to hit 30 fps
at 1080p.
p99 latency 12 ms means even worst-case frames clear the 33-ms
deadline (30 fps) easily; tail latency isn't a concern at this
stage.
What this number does NOT validate:
- Intra prediction shader dispatch overhead (likely per-anti-diagonal
or per-MB-wavefront; dispatch count goes up)
- MC dispatch (per qpel-block; up to several per MB)
- Deblock dispatch (4 edges per MB; per-edge meta entries)
- Real H.264 streams (random coeffs ≠ real residuals; perf shape
of memory access is content-independent, but cache pressure may
differ at scale).
daedalus-decoder
Frame-level GPU H.264 decoder for Raspberry Pi 5 / V3D7. Design phase — not implemented yet.
The objective: build the NVDEC-equivalent shape on Pi 5. One Vulkan submit per frame, one fence wait per frame, encoded H.264 bitstream in, NV12 frame out. Reuses daedalus-fourier's V3D compute primitives at the right granularity — not the per-block-call granularity that the kernel-substitution prototype exposed as architecturally wrong.
Sibling projects:
- daedalus-fourier — V3D + NEON kernel pack (IDCT, MC, deblock primitives). Stays as research/microbench artifact.
- daedalus-v4l2 — V4L2 stateless decoder shim + userspace daemon for Pi 5. The eventual consumer of this decoder.
- libva-v4l2-request-fourier — VAAPI ↔ V4L2 stateless bridge. End consumer.
See DESIGN.md for the architecture sketch.
Description
Frame-level GPU H.264 decoder for Raspberry Pi 5 V3D7. NVDEC-shaped pipeline (encoded bitstream in, NV12 out, one Vulkan submit per frame) built on daedalus-fourier's V3D compute primitives. Phase 1 design exploration.
Languages
Markdown
100%