44ca4e550fbd45fafef783af7d2b8b1e1eadb023
Surfaces daedalus-fourier's substrate-override capability at the
decoder boundary. Lets tests run on CPU-only hosts (CI runners,
x86 dev boxes) AND cross-checks V3D shader output against NEON
reference on hosts that have both.
API additions (pre-0.1 ABI, additive):
- daedalus_decoder_substrate enum { AUTO, CPU, QPU }
(mirrors daedalus_substrate; isolated for ABI reasons).
- daedalus_decoder_set_substrate(dec, sub) setter, same
mid-frame-change restrictions as set_output_format.
- Default remains AUTO — the only sensible choice for production.
Internal:
- flush_frame now calls daedalus_dispatch_h264_idct{4,8} with an
explicit substrate instead of daedalus_recipe_dispatch_*. Mapped
via a small map_substrate() helper. No perf delta on AUTO (recipe
layer was just doing the same dispatch under the hood).
Test changes:
- test_smoke: new EXPECTs for set_substrate (valid + bogus).
- test_idct_bitexact: new argv[4] takes "auto" (default), "cpu", or
"qpu" to force the substrate.
- CMakeLists.txt: new ctest entry `idct_bitexact_cpu` re-runs the
QVGA case forcing the CPU path. Catches silent drift between
the V3D shader and the NEON reference; both must produce
identical output for the same coefficient input (and they do —
see ctest log below).
Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0):
$ ctest --test-dir build --output-on-failure
Start 1: smoke
1/4 Test #1: smoke ............................ Passed 0.10 sec
Start 2: idct_bitexact
2/4 Test #2: idct_bitexact .................... Passed 0.03 sec
Start 3: idct_bitexact_cpu
3/4 Test #3: idct_bitexact_cpu ................ Passed 0.03 sec
Start 4: idct_bitexact_1080p
4/4 Test #4: idct_bitexact_1080p .............. Passed 0.06 sec
100% tests passed, 0 tests failed out of 4
CPU substrate produces byte-identical Y + Cb + Cr planes against the
same C reference that the AUTO/QPU path matches — confirming the V3D
shaders and the daedalus-fourier NEON path agree at the spec level.
Why we plumbed the lower-level dispatch instead of leaving recipe in
place: recipe is just a thin wrapper that calls dispatch with
AUTO. Once we needed substrate control, the wrapper became a
liability (would have required adding a parallel recipe API for each
substrate); going direct is simpler and the AUTO path is unchanged.
Coverage note: idct_bitexact_cpu runs at QVGA (300 MBs); not also at
1080p because the CPU path's wall time scales linearly with block
count and a 1080p CPU run is ~0.5s on hertz — fine standalone but
slows ctest enough that it would tempt opt-in gating. The bit-exact
content is the same regardless of frame size; the 1080p variant only
exists to gate index-arithmetic bugs that surface above small int
boundaries.
daedalus-decoder
Frame-level GPU H.264 decoder for Raspberry Pi 5 / V3D7. Design phase — not implemented yet.
The objective: build the NVDEC-equivalent shape on Pi 5. One Vulkan submit per frame, one fence wait per frame, encoded H.264 bitstream in, NV12 frame out. Reuses daedalus-fourier's V3D compute primitives at the right granularity — not the per-block-call granularity that the kernel-substitution prototype exposed as architecturally wrong.
Sibling projects:
- daedalus-fourier — V3D + NEON kernel pack (IDCT, MC, deblock primitives). Stays as research/microbench artifact.
- daedalus-v4l2 — V4L2 stateless decoder shim + userspace daemon for Pi 5. The eventual consumer of this decoder.
- libva-v4l2-request-fourier — VAAPI ↔ V4L2 stateless bridge. End consumer.
See DESIGN.md for the architecture sketch.
Description
Frame-level GPU H.264 decoder for Raspberry Pi 5 V3D7. NVDEC-shaped pipeline (encoded bitstream in, NV12 out, one Vulkan submit per frame) built on daedalus-fourier's V3D compute primitives. Phase 1 design exploration.
Languages
Markdown
100%