T

claude-noether 948697ef0d phase1/stage1: bit-exact gate for the frame-scaled luma IDCT 4x4

Adds test_idct_bitexact that exercises daedalus_decoder_flush_frame
end-to-end with random coefficients and compares every output byte
against an inline C reference of the H.264 §8.5.12.1 1D butterfly.
Closes the validation gap from the previous PR ("dispatch succeeds"
becomes "dispatch is bit-exact").

What's tested:

  - 320×240 coded frame (300 MBs), enough to cover multiple workgroups
    of the V3D shader (16 blocks/WG → ≥30 WGs)
  - Per-MB → flat-raster block layout consistent with flush_frame
  - Random coeffs in [-512, 511] (same range as daedalus-fourier
    cycle-6 M1 gate)
  - Inline C reference: H.264 §8.5.12.1 butterfly with column-major
    block layout, +32 rounding, >>6, add-to-predicted (=0), clip255 —
    mirrors daedalus-fourier tests/h264_idct4_ref.c

Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0):

  $ ctest --test-dir build --output-on-failure
    Start 1: smoke
  1/2 Test #1: smoke ............................   Passed    0.16 sec
    Start 2: idct_bitexact
  2/2 Test #2: idct_bitexact ....................   Passed    0.03 sec

  100% tests passed, 0 tests failed out of 2

Bit-exact PASS first try — daedalus-fourier's V3D IDCT 4x4 shader
produces identical pixels to the C reference for all 4800 blocks in
the test frame.  Validates BOTH the shader correctness AND the
frame-batched-dispatch correctness (this is the first time
n_blocks > ~30 has been exercised at the recipe-dispatch layer; the
substitution arc only ever called with n_blocks=1).

What is NOT tested by this PR (deferred to follow-ons):

  - Non-zero predicted pixels — flush_frame zero-initialises scratch_y,
    so the IDCT-ADD reduces to clip255(IDCT).  Real predicted comes
    from Stage 2a intra prediction.
  - Z-scan permutation between FFmpeg's per-MB coeffs layout and our
    per-MB → flat raster — the test uses its own coefficient generator
    that already matches our layout, so it doesn't exercise the
    permutation.  The libavcodec-intercept patch is where the
    permutation lands and gets validated against real H.264 streams.
  - Chroma 4×4 IDCT.
  - IDCT 8×8 (High profile).

Stacked on noether/phase1-stage1-idct (PR #3, the frame-scaled
dispatch).  Rebase on main after #3 lands; the diff is purely additive
(one new test file + 5 lines of CMake).

2026-05-24 22:20:21 +02:00

include

scaffold: CMake + API skeleton + smoke test

2026-05-24 22:08:46 +02:00

src

phase1/stage1: frame-scaled luma IDCT 4x4 dispatch — first GPU round-trip

2026-05-24 22:15:35 +02:00

tests

phase1/stage1: bit-exact gate for the frame-scaled luma IDCT 4x4

2026-05-24 22:20:21 +02:00

.gitignore

scaffold: CMake + API skeleton + smoke test

2026-05-24 22:08:46 +02:00

CMakeLists.txt

phase1/stage1: bit-exact gate for the frame-scaled luma IDCT 4x4

2026-05-24 22:20:21 +02:00

DESIGN.md

initial design doc — frame-level GPU H.264 decoder for V3D7

2026-05-23 22:44:03 +02:00

LICENSE

scaffold: CMake + API skeleton + smoke test

2026-05-24 22:08:46 +02:00

README.md

initial design doc — frame-level GPU H.264 decoder for V3D7

2026-05-23 22:44:03 +02:00

README.md

daedalus-decoder

Frame-level GPU H.264 decoder for Raspberry Pi 5 / V3D7. Design phase — not implemented yet.

The objective: build the NVDEC-equivalent shape on Pi 5. One Vulkan submit per frame, one fence wait per frame, encoded H.264 bitstream in, NV12 frame out. Reuses daedalus-fourier's V3D compute primitives at the right granularity — not the per-block-call granularity that the kernel-substitution prototype exposed as architecturally wrong.

Sibling projects:

daedalus-fourier — V3D + NEON kernel pack (IDCT, MC, deblock primitives). Stays as research/microbench artifact.
daedalus-v4l2 — V4L2 stateless decoder shim + userspace daemon for Pi 5. The eventual consumer of this decoder.
libva-v4l2-request-fourier — VAAPI ↔ V4L2 stateless bridge. End consumer.

See DESIGN.md for the architecture sketch.