948697ef0d9f3d8ff7dcfc4f488651c40016e719
Adds test_idct_bitexact that exercises daedalus_decoder_flush_frame
end-to-end with random coefficients and compares every output byte
against an inline C reference of the H.264 §8.5.12.1 1D butterfly.
Closes the validation gap from the previous PR ("dispatch succeeds"
becomes "dispatch is bit-exact").
What's tested:
- 320×240 coded frame (300 MBs), enough to cover multiple workgroups
of the V3D shader (16 blocks/WG → ≥30 WGs)
- Per-MB → flat-raster block layout consistent with flush_frame
- Random coeffs in [-512, 511] (same range as daedalus-fourier
cycle-6 M1 gate)
- Inline C reference: H.264 §8.5.12.1 butterfly with column-major
block layout, +32 rounding, >>6, add-to-predicted (=0), clip255 —
mirrors daedalus-fourier tests/h264_idct4_ref.c
Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0):
$ ctest --test-dir build --output-on-failure
Start 1: smoke
1/2 Test #1: smoke ............................ Passed 0.16 sec
Start 2: idct_bitexact
2/2 Test #2: idct_bitexact .................... Passed 0.03 sec
100% tests passed, 0 tests failed out of 2
Bit-exact PASS first try — daedalus-fourier's V3D IDCT 4x4 shader
produces identical pixels to the C reference for all 4800 blocks in
the test frame. Validates BOTH the shader correctness AND the
frame-batched-dispatch correctness (this is the first time
n_blocks > ~30 has been exercised at the recipe-dispatch layer; the
substitution arc only ever called with n_blocks=1).
What is NOT tested by this PR (deferred to follow-ons):
- Non-zero predicted pixels — flush_frame zero-initialises scratch_y,
so the IDCT-ADD reduces to clip255(IDCT). Real predicted comes
from Stage 2a intra prediction.
- Z-scan permutation between FFmpeg's per-MB coeffs layout and our
per-MB → flat raster — the test uses its own coefficient generator
that already matches our layout, so it doesn't exercise the
permutation. The libavcodec-intercept patch is where the
permutation lands and gets validated against real H.264 streams.
- Chroma 4×4 IDCT.
- IDCT 8×8 (High profile).
Stacked on noether/phase1-stage1-idct (PR #3, the frame-scaled
dispatch). Rebase on main after #3 lands; the diff is purely additive
(one new test file + 5 lines of CMake).
daedalus-decoder
Frame-level GPU H.264 decoder for Raspberry Pi 5 / V3D7. Design phase — not implemented yet.
The objective: build the NVDEC-equivalent shape on Pi 5. One Vulkan submit per frame, one fence wait per frame, encoded H.264 bitstream in, NV12 frame out. Reuses daedalus-fourier's V3D compute primitives at the right granularity — not the per-block-call granularity that the kernel-substitution prototype exposed as architecturally wrong.
Sibling projects:
- daedalus-fourier — V3D + NEON kernel pack (IDCT, MC, deblock primitives). Stays as research/microbench artifact.
- daedalus-v4l2 — V4L2 stateless decoder shim + userspace daemon for Pi 5. The eventual consumer of this decoder.
- libva-v4l2-request-fourier — VAAPI ↔ V4L2 stateless bridge. End consumer.
See DESIGN.md for the architecture sketch.
Description
Frame-level GPU H.264 decoder for Raspberry Pi 5 V3D7. NVDEC-shaped pipeline (encoded bitstream in, NV12 out, one Vulkan submit per frame) built on daedalus-fourier's V3D compute primitives. Phase 1 design exploration.
Languages
Markdown
100%