phase1/stage1: chroma 4x4 IDCT dispatch (Cb+Cr planar scratch, NV12 interleave)

Replaces the chroma placeholder (memset 128) with a real frame-scaled 4x4 IDCT dispatch for the Cb and Cr components. Two Vulkan submits + waits per frame now (one luma, one chroma) instead of one + memset. Implementation: - One combined planar scratch buffer (W*H/2 bytes) holds Cb then Cr; a single `daedalus_recipe_dispatch_h264_idct4` call processes both components by setting meta[].dst_off accordingly (Cr blocks add cb_plane_size). - Stride = W/2 (chroma row pitch); shared between Cb and Cr since they have identical geometry. - Per-MB coeff layout already had [256..320) for Cb and [320..384) for Cr (4 raster-order 4x4 blocks per component) from the original daedalus_decoder_append_mb design — no header-side changes. - Post-dispatch CPU memcpy loop interleaves Cb[r][c] and Cr[r][c] into NV12 UV at out_uv[r][2c..2c+1]. ~1 MB/frame at 1080p, well off the critical path; a GPU-side interleave shader is a Stage-5 optimisation. - Chroma dispatch is gated on out_uv != NULL so callers that only want luma (e.g. the bit-exact test before this PR) still pay nothing. Test changes: - tests/test_idct_bitexact.c extended with parallel reference IDCT for Cb and Cr planes (W/2 x H/2 each), then deinterleaves NV12 UV back into Cb/Cr for the compare. Random coeffs in [-512, 511] for all 384 per-MB int16 slots (previously only luma was randomised). - tests/test_smoke.c UV expectation flipped from "all 128 placeholder" to "all 0" (real dispatch with zero coeffs). Sentinel 0xcd pre-fill stays — same purpose: catches read-then-write bugs. Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0): $ ctest --test-dir build --output-on-failure Start 1: smoke 1/2 Test #1: smoke ............................ Passed 1.27 sec Start 2: idct_bitexact 2/2 Test #2: idct_bitexact .................... Passed 0.05 sec 100% tests passed, 0 tests failed out of 2 $ ./build/test_idct_bitexact test_idct_bitexact: 320x240 (300 MBs), seed=0xfeedface5a5a5a5a Y bytes total: 76800 Y bytes diff: 0 (0.0000%) Cb bytes total: 19200 diff: 0 (0.0000%) Cr bytes total: 19200 diff: 0 (0.0000%) BIT-EXACT PASS (Y + Cb + Cr) $ ./build/test_smoke daedalus-decoder version: 0.0.1 ctx created: 1920x1088, has_qpu=1 appended 8160 MBs (120x68) flush_frame rc=0 Y non-zero bytes: 0 / 2088960 UV non-zero bytes: 0 / 1044480 smoke OK (Smoke's 1.27s includes the 1080p frame: 8160 MBs * 16 = 130,560 luma blocks + 8160 * 8 = 65,280 chroma blocks across two dispatches — shader pool warm-up dominates the wall time, not the IDCT work.) What's NOT covered yet (deferred): - Chroma DC / Intra16x16 luma DC 2x2 Hadamard pre-pass. Real H.264 chroma puts the per-block DC coefficient through a Hadamard before it's added to the AC block; we currently treat all chroma blocks as plain 4x4 AC. Will land alongside the libavcodec intercept patch, since CABAC/CAVLC is where the DC vs AC distinction is exposed. - Z-scan permutation for FFmpeg compatibility — only matters at the intercept boundary, not here. - IDCT 8x8 (High profile). Closes the "chroma is a stub" item from PR #3's "what's NOT done" list.
2026-05-24 22:34:42 +02:00
parent 41306e48ee
commit 58848bd162
3 changed files with 238 additions and 52 deletions
@@ -128,25 +128,25 @@ int daedalus_decoder_append_mb(daedalus_decoder *dec,
    return 0;
 }

-/* Phase 1 stage 1 — frame-scaled IDCT 4x4 dispatch.
+/* Phase 1 stage 1 — frame-scaled IDCT 4x4 dispatch (luma + chroma).
 *
 * Brings up the GPU substrate by calling daedalus-fourier's existing
- * `daedalus_recipe_dispatch_h264_idct4` at frame batch granularity
- * (n_blocks = N_MBs × 16 luma 4×4 blocks per frame), in contrast to
- * the substitution-arc shim that called it with n_blocks = 1 per call.
- * ONE Vulkan submit + wait round-trip per frame instead of millions.
+ * `daedalus_recipe_dispatch_h264_idct4` at frame batch granularity in
+ * contrast to the substitution-arc shim that called it with
+ * n_blocks = 1 per call.  Two Vulkan submits + waits per frame (one
+ * luma, one chroma) instead of millions of per-block dispatches.
 *
 * What's done in this stage:
- *   - Build a per-frame luma-4x4 meta[] in raster order across all MBs
- *   - Repack the per-MB coeffs[] (384 int16, first 256 are luma) into
- *     a flat block-major coeffs buffer (n_blocks × 16 int16)
- *   - Allocate a frame-sized scratch Y plane (zero-initialised — no
- *     intra prediction yet, so "predicted" = 0)
- *   - Dispatch once via the recipe layer; the shader does
- *     clip255(predicted + idct(coeffs)), i.e. with predicted=0 it's
- *     clip255(idct(coeffs))
- *   - Copy the scratch Y plane to the caller's out_y at the requested
- *     stride
+ *   - Luma: build a per-frame meta[] in raster order (n_blocks =
+ *     N_MBs × 16); flat-pack coeffs from each MB's first 256 int16;
+ *     dispatch into a frame-sized zero-initialised Y scratch plane.
+ *   - Chroma: build an interleaved Cb+Cr meta[] (n_blocks = N_MBs × 8,
+ *     4 Cb + 4 Cr per MB); flat-pack coeffs from each MB's next 128
+ *     int16 (64 Cb + 64 Cr); dispatch into a planar Cb||Cr scratch
+ *     buffer (W*H/4 each, concatenated W*H/2 total); CPU-interleave
+ *     into the caller's NV12 UV plane post-dispatch.
+ *   - Both dispatches use predicted=0 (the scratch buffers are
+ *     calloc'd); the shader does clip255(predicted + idct(coeffs)).
 *
 * What's NOT done yet (follow-on Phase 1 sub-PRs):
 *   - Intra prediction (Stage 2a wavefront): predicted is forced to 0,
@@ -154,14 +154,17 @@ int daedalus_decoder_append_mb(daedalus_decoder *dec,
 *     Sufficient for Vulkan round-trip validation, not for bit-exact
 *     against FFmpeg.
 *   - Motion compensation (Stage 2b): inter MBs not handled.
- *   - High-profile IDCT 8x8 (Stage 1 extension)
- *   - Deblock (Stage 4)
- *   - Chroma planes — the daedalus-fourier idct4 shader is luma-only
- *     in this revision; chroma blocks (4×4, 4 cb + 4 cr per MB) need a
- *     separate dispatch with different meta/dst layout.  out_uv is
- *     filled with neutral grey (128) as placeholder.
+ *   - High-profile IDCT 8x8 (Stage 1 extension).
+ *   - Chroma DC / luma Intra16x16 DC Hadamard pre-pass (currently we
+ *     treat all chroma blocks as plain 4×4 AC IDCT; real decode needs
+ *     the chroma DC 2×2 Hadamard contribution folded in).
+ *   - Deblock (Stage 4).
 *   - dmabuf export — still memcpy-out to caller-provided planes.
 *   - Stage 5 RGBA opt-in.
+ *   - GPU-side NV12 interleave — currently a CPU memcpy loop after
+ *     the chroma dispatch.  Trivial cost (~1 MB / frame at 1080p)
+ *     vs the IDCT itself, but worth folding into a Stage-5 pass
+ *     later for full-GPU residency.
 */
 int daedalus_decoder_flush_frame(daedalus_decoder *dec,
                                  uint8_t *out_y,  size_t y_stride,
@@ -246,17 +249,116 @@ int daedalus_decoder_flush_frame(daedalus_decoder *dec,
        goto cleanup;
    }

-    /* ---- Copy out to caller's planes at the requested stride. ---- */
+    /* ---- Copy Y out to caller's plane at the requested stride. ---- */
    for (int r = 0; r < dec->height; r++)
        memcpy(out_y + (size_t) r * y_stride,
               &scratch_y[(size_t) r * y_stride_int],
               (size_t) dec->width);

-    /* Chroma placeholder: 128 = mid-grey (NV12 neutral).  Real chroma
-     * IDCT dispatch is the next sub-PR. */
+    /* ---- Build frame-scaled chroma 4×4 dispatch ---- */
+    /*
+     * 4:2:0 layout — chroma planes are (W/2) by (H/2), one Cb + one
+     * Cr per pixel pair.  H.264 per-MB chroma is two 8×8 components,
+     * each split into 4 4×4 blocks, so 8 chroma 4×4 blocks per MB.
+     *
+     * We dispatch BOTH components in a single shader call against a
+     * planar scratch buffer:
+     *     scratch_uv[0 .. cb_plane_size)        — Cb plane (W/2 × H/2)
+     *     scratch_uv[cb_plane_size .. 2*size)   — Cr plane (W/2 × H/2)
+     *
+     * meta[i].dst_off is a flat offset into the scratch buffer (the
+     * shader treats dst+dst_off as a contiguous 4×4 with row pitch =
+     * stride), so Cr blocks just add cb_plane_size to their offset.
+     * Stride is W/2 (the chroma row width); this works because Cb and
+     * Cr planes share the same row pitch.
+     *
+     * Post-dispatch we interleave the two planes into NV12 UV layout
+     * on the CPU.  Doing this on the GPU is a Stage-5 follow-up
+     * (would need a small "copy + interleave" shader); CPU memcpy
+     * loop is ~1 MB/frame at 1080p so it's not on the critical path.
+     */
+    int16_t *chroma_coeffs = NULL;
+    daedalus_h264_block_meta *chroma_meta = NULL;
+    uint8_t *scratch_uv = NULL;
    if (out_uv) {
-        for (int r = 0; r < dec->height / 2; r++)
-            memset(out_uv + (size_t) r * uv_stride, 128, (size_t) dec->width);
+        const size_t n_chroma_blocks_per_mb = 8;  /* 4 Cb + 4 Cr */
+        const size_t n_chroma_blocks =
+            (size_t) dec->n_mbs * n_chroma_blocks_per_mb;
+        const size_t chroma_w = (size_t) dec->width  / 2;
+        const size_t chroma_h = (size_t) dec->height / 2;
+        const size_t cb_plane_size = chroma_w * chroma_h;
+        const size_t uv_scratch_size = 2 * cb_plane_size;
+
+        scratch_uv    = calloc(1, uv_scratch_size);
+        chroma_coeffs = malloc(n_chroma_blocks * 16 * sizeof(int16_t));
+        chroma_meta   = malloc(n_chroma_blocks *
+                               sizeof(daedalus_h264_block_meta));
+        if (!scratch_uv || !chroma_coeffs || !chroma_meta) {
+            rc = -1;
+            goto chroma_cleanup;
+        }
+
+        size_t cbi = 0;
+        for (int mb_y = 0; mb_y < dec->mb_height; mb_y++) {
+            for (int mb_x = 0; mb_x < dec->mb_width; mb_x++) {
+                int mb_idx = mb_y * dec->mb_width + mb_x;
+                const int16_t *mb_coeffs = &dec->coeffs[(size_t) mb_idx * 384];
+                /* Per-MB coeff layout (set by append_mb):
+                 *   [  0 .. 256) — 16 luma 4×4 blocks
+                 *   [256 .. 320) — 4 Cb 4×4 blocks (raster sb_y*2+sb_x)
+                 *   [320 .. 384) — 4 Cr 4×4 blocks (raster sb_y*2+sb_x)
+                 */
+                for (int comp = 0; comp < 2; comp++) {           /* 0=Cb 1=Cr */
+                    size_t plane_base = (size_t) comp * cb_plane_size;
+                    size_t coeff_base = 256u + (size_t) comp * 64u;
+                    for (int sb_y = 0; sb_y < 2; sb_y++) {
+                        for (int sb_x = 0; sb_x < 2; sb_x++) {
+                            size_t px_y = (size_t) mb_y * 8 + (size_t) sb_y * 4;
+                            size_t px_x = (size_t) mb_x * 8 + (size_t) sb_x * 4;
+                            chroma_meta[cbi].dst_off = (uint32_t)
+                                (plane_base + px_y * chroma_w + px_x);
+
+                            int block_in_comp = sb_y * 2 + sb_x;
+                            memcpy(&chroma_coeffs[cbi * 16],
+                                   &mb_coeffs[coeff_base + (size_t) block_in_comp * 16],
+                                   16 * sizeof(int16_t));
+                            cbi++;
+                        }
+                    }
+                }
+            }
+        }
+        /* assert cbi == n_chroma_blocks; loop math guarantees it */
+
+        int cr_rc = daedalus_recipe_dispatch_h264_idct4(dec->dctx,
+                                                         scratch_uv, chroma_w,
+                                                         chroma_coeffs,
+                                                         n_chroma_blocks,
+                                                         chroma_meta);
+        if (cr_rc != 0) {
+            rc = -3;
+            goto chroma_cleanup;
+        }
+
+        /* CPU NV12 interleave: out_uv[r][2c+0] = Cb[r][c], [2c+1] = Cr. */
+        const uint8_t *cb_plane = scratch_uv;
+        const uint8_t *cr_plane = scratch_uv + cb_plane_size;
+        for (size_t r = 0; r < chroma_h; r++) {
+            uint8_t *dst_row = out_uv + r * uv_stride;
+            const uint8_t *cb_row = cb_plane + r * chroma_w;
+            const uint8_t *cr_row = cr_plane + r * chroma_w;
+            for (size_t c = 0; c < chroma_w; c++) {
+                dst_row[c * 2 + 0] = cb_row[c];
+                dst_row[c * 2 + 1] = cr_row[c];
+            }
+        }
+
+    chroma_cleanup:
+        free(chroma_meta);
+        free(chroma_coeffs);
+        free(scratch_uv);
+        if (rc != 0)
+            goto cleanup;
    }

 cleanup: