phase1: IDCT 8x8 dispatch (High profile transform_8x8_size_flag) #6
Reference in New Issue
Block a user
Delete Branch "noether/phase1-idct8"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Adds the High-profile 8x8 luma transform path alongside the existing 4x4 dispatch. flush_frame now partitions MBs by each MB's new
transform_8x8flag and issues one luma dispatch per non-empty partition.Bit-exact PASS on hertz for a mixed-mode frame (150 4x4 MBs + 150 8x8 MBs, 600 8x8 blocks tested). Validates both the daedalus-fourier IDCT 8x8 shader end-to-end through flush_frame and the 8x8 layout assumptions (column-major coeffs, raster sb_y*2+sb_x order). 8x8 C reference transcribed from the daedalus-fourier reference per H.264 §8.5.13.2.
New
uint8_t transform_8x8field at the end ofstruct daedalus_decoder_mb_input(defaults to 0 = existing 4x4 behaviour). ABI is pre-0.1.Full commit message has the test transcript and the list of what is NOT yet covered (z-scan permutation, chroma DC Hadamard).
Adds the High-profile 8x8 luma transform path alongside the existing 4x4 dispatch. flush_frame now partitions macroblocks by each MB's transform_8x8 flag and issues a separate luma dispatch per partition: - mb.transform_8x8 == 0 (Baseline/Main) → coeffs[0..256) interpreted as 16 4x4 blocks, fed to daedalus_recipe_dispatch_h264_idct4 (existing behaviour, unchanged). - mb.transform_8x8 == 1 (High) → coeffs[0..256) interpreted as 4 8x8 blocks (64 int16 each, column-major), fed to the new daedalus_recipe_dispatch_h264_idct8 call. Both luma partitions can be non-empty in the same frame (FFmpeg sets the flag per-MB). Each non-empty partition costs one vkQueueSubmit + vkQueueWaitIdle; empty partitions are skipped (common case: Baseline streams skip the 8x8 dispatch entirely). Chroma is unchanged — 4:2:0 chroma always uses the 4x4 transform. API surface: - New uint8_t `transform_8x8` field in `struct daedalus_decoder_mb_input` (after deblock_*). Backwards-compatible at the source level because the field defaults to 0 with C99 designated initialisers or {0} struct zeroing, both of which select the existing 4x4 path. ABI is pre-0.1 (per the header doc) so structural change is fine. - Mirrored in `struct daedalus_decoder_mb_desc` (internal layout). Test changes: - test_idct_bitexact now exercises a mixed-mode frame: every odd raster MB uses 8x8, every even uses 4x4 (so flush_frame's partitioning is also under test, not just the underlying shaders). - 8x8 C reference (h264_idct8_butterfly + ref_idct8_add) transcribed from daedalus-fourier tests/h264_idct8_ref.c per H.264 §8.5.13.2. Block layout column-major; +32 >> 6 rounding; add-to-predicted; clip255. - Reference luma compute branches per MB on the same mb_8x8[] array used to set the input flag. Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0): $ ./build/test_idct_bitexact test_idct_bitexact: 320x240 (300 MBs), seed=0xfeedface5a5a5a5a MB mix: 150 4x4 MBs, 150 8x8 MBs Y bytes total: 76800 Y bytes diff: 0 (0.0000%) Cb bytes total: 19200 diff: 0 (0.0000%) Cr bytes total: 19200 diff: 0 (0.0000%) BIT-EXACT PASS (Y + Cb + Cr) $ ctest --test-dir build 100% tests passed, 0 tests failed out of 2 Bit-exact PASS first try for the 8x8 path — 150 8x8 MBs × 4 blocks = 600 8x8 IDCTs against the spec C reference, identical output. Validates both the daedalus-fourier IDCT 8x8 shader (already gated by its own cycle-7 bit-exact test, now also gated end-to-end through our flush_frame), and our 8x8 layout assumptions (column-major coeffs, raster sb_y*2+sb_x block order, top-left = mb*16 + sb*8). What's NOT covered yet (deferred): - Z-scan permutation for FFmpeg compatibility (libavcodec intercept patch's concern; both 4x4 and 8x8 z-scans differ). - Chroma DC / luma Intra16x16 DC Hadamard pre-pass. - Mixed intra/inter MB handling — currently all MBs treated as residual-only (predicted=0). Closes the "IDCT 8x8 (High profile)" item from PR #3's deferred list.