phase1/stage1: chroma 4x4 IDCT dispatch (Cb+Cr, NV12 interleave) #5

2026-05-24T20:34:53Z

marfrit commented

2026-05-24 20:34:53 +00:00

Replaces the chroma UV memset(128) placeholder with a real frame-scaled IDCT dispatch over a combined Cb||Cr planar scratch buffer. Single recipe call, CPU-side NV12 interleave post-dispatch.

Bit-exact PASS on hertz for Y + Cb + Cr (4800 luma + 1200 Cb + 1200 Cr blocks at 320×240). Smoke at 1080p (8160 MBs, 130,560 luma + 65,280 chroma blocks across two dispatches) completes in 1.27s including pool warm-up. Full commit message has the test transcript and the list of what is NOT yet covered (chroma DC Hadamard, z-scan permutation, IDCT 8×8).

Replaces the chroma UV memset(128) placeholder with a real frame-scaled IDCT dispatch over a combined Cb||Cr planar scratch buffer. Single recipe call, CPU-side NV12 interleave post-dispatch. Bit-exact PASS on hertz for Y + Cb + Cr (4800 luma + 1200 Cb + 1200 Cr blocks at 320×240). Smoke at 1080p (8160 MBs, 130,560 luma + 65,280 chroma blocks across two dispatches) completes in 1.27s including pool warm-up. Full commit message has the test transcript and the list of what is NOT yet covered (chroma DC Hadamard, z-scan permutation, IDCT 8×8).

marfrit added 1 commit 2026-05-24 20:34:56 +00:00

phase1/stage1: chroma 4x4 IDCT dispatch (Cb+Cr planar scratch, NV12 interleave) 58848bd162

Replaces the chroma placeholder (memset 128) with a real frame-scaled
4x4 IDCT dispatch for the Cb and Cr components.  Two Vulkan submits +
waits per frame now (one luma, one chroma) instead of one + memset.

Implementation:

  - One combined planar scratch buffer (W*H/2 bytes) holds Cb then Cr;
    a single `daedalus_recipe_dispatch_h264_idct4` call processes both
    components by setting meta[].dst_off accordingly (Cr blocks add
    cb_plane_size).
  - Stride = W/2 (chroma row pitch); shared between Cb and Cr since
    they have identical geometry.
  - Per-MB coeff layout already had [256..320) for Cb and [320..384)
    for Cr (4 raster-order 4x4 blocks per component) from the original
    daedalus_decoder_append_mb design — no header-side changes.
  - Post-dispatch CPU memcpy loop interleaves Cb[r][c] and Cr[r][c]
    into NV12 UV at out_uv[r][2c..2c+1].  ~1 MB/frame at 1080p, well
    off the critical path; a GPU-side interleave shader is a Stage-5
    optimisation.
  - Chroma dispatch is gated on out_uv != NULL so callers that only
    want luma (e.g. the bit-exact test before this PR) still pay
    nothing.

Test changes:

  - tests/test_idct_bitexact.c extended with parallel reference IDCT
    for Cb and Cr planes (W/2 x H/2 each), then deinterleaves NV12 UV
    back into Cb/Cr for the compare.  Random coeffs in [-512, 511] for
    all 384 per-MB int16 slots (previously only luma was randomised).
  - tests/test_smoke.c UV expectation flipped from "all 128 placeholder"
    to "all 0" (real dispatch with zero coeffs).  Sentinel 0xcd
    pre-fill stays — same purpose: catches read-then-write bugs.

Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0):

  $ ctest --test-dir build --output-on-failure
    Start 1: smoke
  1/2 Test #1: smoke ............................   Passed    1.27 sec
    Start 2: idct_bitexact
  2/2 Test #2: idct_bitexact ....................   Passed    0.05 sec

  100% tests passed, 0 tests failed out of 2

  $ ./build/test_idct_bitexact
  test_idct_bitexact: 320x240 (300 MBs), seed=0xfeedface5a5a5a5a
  Y bytes total:  76800
  Y bytes diff:   0 (0.0000%)
  Cb bytes total: 19200  diff: 0 (0.0000%)
  Cr bytes total: 19200  diff: 0 (0.0000%)
  BIT-EXACT PASS (Y + Cb + Cr)

  $ ./build/test_smoke
  daedalus-decoder version: 0.0.1
  ctx created: 1920x1088, has_qpu=1
  appended 8160 MBs (120x68)
  flush_frame rc=0
  Y non-zero bytes: 0 / 2088960
  UV non-zero bytes: 0 / 1044480
  smoke OK

(Smoke's 1.27s includes the 1080p frame: 8160 MBs * 16 = 130,560 luma
blocks + 8160 * 8 = 65,280 chroma blocks across two dispatches —
shader pool warm-up dominates the wall time, not the IDCT work.)

What's NOT covered yet (deferred):

  - Chroma DC / Intra16x16 luma DC 2x2 Hadamard pre-pass.  Real H.264
    chroma puts the per-block DC coefficient through a Hadamard before
    it's added to the AC block; we currently treat all chroma blocks as
    plain 4x4 AC.  Will land alongside the libavcodec intercept patch,
    since CABAC/CAVLC is where the DC vs AC distinction is exposed.
  - Z-scan permutation for FFmpeg compatibility — only matters at the
    intercept boundary, not here.
  - IDCT 8x8 (High profile).

Closes the "chroma is a stub" item from PR #3's "what's NOT done" list.

marfrit merged commit 5fa495964d into main

2026-05-24 20:36:51 +00:00

marfrit deleted branch noether/phase1-stage1-chroma

2026-05-24 20:36:51 +00:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: marfrit/daedalus-decoder#5