045553ccaf04d2acc9807695a7fcc6f5eecdd770
The existing 320x240 bit-exact test (300 MBs) is the fast inner-loop
gate, but it's small enough that index arithmetic bugs that only
surface above 16-bit boundaries would slip through. This adds a
second ctest entry that runs the same binary against a full coded
1080p frame (1920x1088, 8160 MBs):
- 4080 MBs at transform_8x8=0 → 65,280 luma 4x4 blocks
- 4080 MBs at transform_8x8=1 → 16,320 luma 8x8 blocks
- 65,280 chroma 4x4 blocks (32,640 Cb + 32,640 Cr)
- 146,880 IDCTs total across 3 separate luma_4x4 + luma_8x8 +
chroma dispatches; bit-exact compared against the in-test C
reference for each.
No code change to the test binary itself — it already accepted
width/height as argv[1..2]. Just a second `add_test` in
CMakeLists.txt that invokes it with `1920 1088`.
Coverage rationale:
- dst_off is uint32_t in daedalus_h264_block_meta; at 1920x1088
the max offset is ~2.1 MiB, still well within uint32 range, but
the test exercises the largest stride math we'll see in production
(per-MB chroma offset = mb_y*8 + cb_plane_size = up to 1.06 MiB).
- flush_frame partitions 8160 MBs by transform mode → exercises the
bi4 == 4080*16 and bi8 == 4080*4 accumulators at frame scale.
- Verifies the 1088 coded height handling (the displayed 1080 +
8 cropped rows trap that catches Pi 5 H.264 integrations).
Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0):
$ ctest --test-dir build --output-on-failure
Start 1: smoke
1/3 Test #1: smoke ............................ Passed 0.09 sec
Start 2: idct_bitexact
2/3 Test #2: idct_bitexact .................... Passed 0.03 sec
Start 3: idct_bitexact_1080p
3/3 Test #3: idct_bitexact_1080p .............. Passed 0.06 sec
100% tests passed, 0 tests failed out of 3
$ ./build/test_idct_bitexact 1920 1088
test_idct_bitexact: 1920x1088 (8160 MBs), seed=0xfeedface5a5a5a5a
MB mix: 4080 4x4 MBs, 4080 8x8 MBs
Y bytes total: 2088960
Y bytes diff: 0 (0.0000%)
Cb bytes total: 522240 diff: 0 (0.0000%)
Cr bytes total: 522240 diff: 0 (0.0000%)
BIT-EXACT PASS (Y + Cb + Cr)
(0.06 s when shader pool warm; ~0.2 s cold via the standalone
invocation above — the 1080p run happens after smoke, so pool is
already primed by the time it runs in ctest.)
daedalus-decoder
Frame-level GPU H.264 decoder for Raspberry Pi 5 / V3D7. Design phase — not implemented yet.
The objective: build the NVDEC-equivalent shape on Pi 5. One Vulkan submit per frame, one fence wait per frame, encoded H.264 bitstream in, NV12 frame out. Reuses daedalus-fourier's V3D compute primitives at the right granularity — not the per-block-call granularity that the kernel-substitution prototype exposed as architecturally wrong.
Sibling projects:
- daedalus-fourier — V3D + NEON kernel pack (IDCT, MC, deblock primitives). Stays as research/microbench artifact.
- daedalus-v4l2 — V4L2 stateless decoder shim + userspace daemon for Pi 5. The eventual consumer of this decoder.
- libva-v4l2-request-fourier — VAAPI ↔ V4L2 stateless bridge. End consumer.
See DESIGN.md for the architecture sketch.
Description
Frame-level GPU H.264 decoder for Raspberry Pi 5 V3D7. NVDEC-shaped pipeline (encoded bitstream in, NV12 out, one Vulkan submit per frame) built on daedalus-fourier's V3D compute primitives. Phase 1 design exploration.
Languages
Markdown
100%