phase1: IDCT 8x8 dispatch (High profile transform_8x8_size_flag) #6

Merged
marfrit merged 1 commits from noether/phase1-idct8 into main 2026-05-24 20:45:27 +00:00
Owner

Adds the High-profile 8x8 luma transform path alongside the existing 4x4 dispatch. flush_frame now partitions MBs by each MB's new transform_8x8 flag and issues one luma dispatch per non-empty partition.

Bit-exact PASS on hertz for a mixed-mode frame (150 4x4 MBs + 150 8x8 MBs, 600 8x8 blocks tested). Validates both the daedalus-fourier IDCT 8x8 shader end-to-end through flush_frame and the 8x8 layout assumptions (column-major coeffs, raster sb_y*2+sb_x order). 8x8 C reference transcribed from the daedalus-fourier reference per H.264 §8.5.13.2.

New uint8_t transform_8x8 field at the end of struct daedalus_decoder_mb_input (defaults to 0 = existing 4x4 behaviour). ABI is pre-0.1.

Full commit message has the test transcript and the list of what is NOT yet covered (z-scan permutation, chroma DC Hadamard).

Adds the High-profile 8x8 luma transform path alongside the existing 4x4 dispatch. flush_frame now partitions MBs by each MB's new `transform_8x8` flag and issues one luma dispatch per non-empty partition. Bit-exact PASS on hertz for a mixed-mode frame (150 4x4 MBs + 150 8x8 MBs, 600 8x8 blocks tested). Validates both the daedalus-fourier IDCT 8x8 shader end-to-end through flush_frame and the 8x8 layout assumptions (column-major coeffs, raster sb_y*2+sb_x order). 8x8 C reference transcribed from the daedalus-fourier reference per H.264 §8.5.13.2. New `uint8_t transform_8x8` field at the end of `struct daedalus_decoder_mb_input` (defaults to 0 = existing 4x4 behaviour). ABI is pre-0.1. Full commit message has the test transcript and the list of what is NOT yet covered (z-scan permutation, chroma DC Hadamard).
marfrit added 1 commit 2026-05-24 20:41:19 +00:00
Adds the High-profile 8x8 luma transform path alongside the existing
4x4 dispatch.  flush_frame now partitions macroblocks by each MB's
transform_8x8 flag and issues a separate luma dispatch per partition:

  - mb.transform_8x8 == 0 (Baseline/Main) → coeffs[0..256) interpreted
    as 16 4x4 blocks, fed to daedalus_recipe_dispatch_h264_idct4
    (existing behaviour, unchanged).
  - mb.transform_8x8 == 1 (High)          → coeffs[0..256) interpreted
    as 4 8x8 blocks (64 int16 each, column-major), fed to the new
    daedalus_recipe_dispatch_h264_idct8 call.

Both luma partitions can be non-empty in the same frame (FFmpeg sets
the flag per-MB).  Each non-empty partition costs one
vkQueueSubmit + vkQueueWaitIdle; empty partitions are skipped
(common case: Baseline streams skip the 8x8 dispatch entirely).

Chroma is unchanged — 4:2:0 chroma always uses the 4x4 transform.

API surface:
  - New uint8_t `transform_8x8` field in `struct daedalus_decoder_mb_input`
    (after deblock_*).  Backwards-compatible at the source level
    because the field defaults to 0 with C99 designated initialisers
    or {0} struct zeroing, both of which select the existing 4x4
    path.  ABI is pre-0.1 (per the header doc) so structural change
    is fine.
  - Mirrored in `struct daedalus_decoder_mb_desc` (internal layout).

Test changes:

  - test_idct_bitexact now exercises a mixed-mode frame: every odd
    raster MB uses 8x8, every even uses 4x4 (so flush_frame's
    partitioning is also under test, not just the underlying shaders).
  - 8x8 C reference (h264_idct8_butterfly + ref_idct8_add)
    transcribed from daedalus-fourier tests/h264_idct8_ref.c per
    H.264 §8.5.13.2.  Block layout column-major; +32 >> 6 rounding;
    add-to-predicted; clip255.
  - Reference luma compute branches per MB on the same mb_8x8[]
    array used to set the input flag.

Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0):

  $ ./build/test_idct_bitexact
  test_idct_bitexact: 320x240 (300 MBs), seed=0xfeedface5a5a5a5a
  MB mix: 150 4x4 MBs, 150 8x8 MBs
  Y bytes total:  76800
  Y bytes diff:   0 (0.0000%)
  Cb bytes total: 19200  diff: 0 (0.0000%)
  Cr bytes total: 19200  diff: 0 (0.0000%)
  BIT-EXACT PASS (Y + Cb + Cr)

  $ ctest --test-dir build
  100% tests passed, 0 tests failed out of 2

Bit-exact PASS first try for the 8x8 path — 150 8x8 MBs × 4 blocks =
600 8x8 IDCTs against the spec C reference, identical output.
Validates both the daedalus-fourier IDCT 8x8 shader (already gated
by its own cycle-7 bit-exact test, now also gated end-to-end through
our flush_frame), and our 8x8 layout assumptions (column-major coeffs,
raster sb_y*2+sb_x block order, top-left = mb*16 + sb*8).

What's NOT covered yet (deferred):

  - Z-scan permutation for FFmpeg compatibility (libavcodec intercept
    patch's concern; both 4x4 and 8x8 z-scans differ).
  - Chroma DC / luma Intra16x16 DC Hadamard pre-pass.
  - Mixed intra/inter MB handling — currently all MBs treated as
    residual-only (predicted=0).

Closes the "IDCT 8x8 (High profile)" item from PR #3's deferred list.
marfrit merged commit 72ee154b36 into main 2026-05-24 20:45:27 +00:00
marfrit deleted branch noether/phase1-idct8 2026-05-24 20:45:29 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marfrit/daedalus-decoder#6