phase1/stage1: chroma 4x4 IDCT dispatch (Cb+Cr planar scratch, NV12 interleave)
Replaces the chroma placeholder (memset 128) with a real frame-scaled
4x4 IDCT dispatch for the Cb and Cr components. Two Vulkan submits +
waits per frame now (one luma, one chroma) instead of one + memset.
Implementation:
- One combined planar scratch buffer (W*H/2 bytes) holds Cb then Cr;
a single `daedalus_recipe_dispatch_h264_idct4` call processes both
components by setting meta[].dst_off accordingly (Cr blocks add
cb_plane_size).
- Stride = W/2 (chroma row pitch); shared between Cb and Cr since
they have identical geometry.
- Per-MB coeff layout already had [256..320) for Cb and [320..384)
for Cr (4 raster-order 4x4 blocks per component) from the original
daedalus_decoder_append_mb design — no header-side changes.
- Post-dispatch CPU memcpy loop interleaves Cb[r][c] and Cr[r][c]
into NV12 UV at out_uv[r][2c..2c+1]. ~1 MB/frame at 1080p, well
off the critical path; a GPU-side interleave shader is a Stage-5
optimisation.
- Chroma dispatch is gated on out_uv != NULL so callers that only
want luma (e.g. the bit-exact test before this PR) still pay
nothing.
Test changes:
- tests/test_idct_bitexact.c extended with parallel reference IDCT
for Cb and Cr planes (W/2 x H/2 each), then deinterleaves NV12 UV
back into Cb/Cr for the compare. Random coeffs in [-512, 511] for
all 384 per-MB int16 slots (previously only luma was randomised).
- tests/test_smoke.c UV expectation flipped from "all 128 placeholder"
to "all 0" (real dispatch with zero coeffs). Sentinel 0xcd
pre-fill stays — same purpose: catches read-then-write bugs.
Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0):
$ ctest --test-dir build --output-on-failure
Start 1: smoke
1/2 Test #1: smoke ............................ Passed 1.27 sec
Start 2: idct_bitexact
2/2 Test #2: idct_bitexact .................... Passed 0.05 sec
100% tests passed, 0 tests failed out of 2
$ ./build/test_idct_bitexact
test_idct_bitexact: 320x240 (300 MBs), seed=0xfeedface5a5a5a5a
Y bytes total: 76800
Y bytes diff: 0 (0.0000%)
Cb bytes total: 19200 diff: 0 (0.0000%)
Cr bytes total: 19200 diff: 0 (0.0000%)
BIT-EXACT PASS (Y + Cb + Cr)
$ ./build/test_smoke
daedalus-decoder version: 0.0.1
ctx created: 1920x1088, has_qpu=1
appended 8160 MBs (120x68)
flush_frame rc=0
Y non-zero bytes: 0 / 2088960
UV non-zero bytes: 0 / 1044480
smoke OK
(Smoke's 1.27s includes the 1080p frame: 8160 MBs * 16 = 130,560 luma
blocks + 8160 * 8 = 65,280 chroma blocks across two dispatches —
shader pool warm-up dominates the wall time, not the IDCT work.)
What's NOT covered yet (deferred):
- Chroma DC / Intra16x16 luma DC 2x2 Hadamard pre-pass. Real H.264
chroma puts the per-block DC coefficient through a Hadamard before
it's added to the AC block; we currently treat all chroma blocks as
plain 4x4 AC. Will land alongside the libavcodec intercept patch,
since CABAC/CAVLC is where the DC vs AC distinction is exposed.
- Z-scan permutation for FFmpeg compatibility — only matters at the
intercept boundary, not here.
- IDCT 8x8 (High profile).
Closes the "chroma is a stub" item from PR #3's "what's NOT done" list.
This commit is contained in:
+128
-26
@@ -128,25 +128,25 @@ int daedalus_decoder_append_mb(daedalus_decoder *dec,
|
||||
return 0;
|
||||
}
|
||||
|
||||
/* Phase 1 stage 1 — frame-scaled IDCT 4x4 dispatch.
|
||||
/* Phase 1 stage 1 — frame-scaled IDCT 4x4 dispatch (luma + chroma).
|
||||
*
|
||||
* Brings up the GPU substrate by calling daedalus-fourier's existing
|
||||
* `daedalus_recipe_dispatch_h264_idct4` at frame batch granularity
|
||||
* (n_blocks = N_MBs × 16 luma 4×4 blocks per frame), in contrast to
|
||||
* the substitution-arc shim that called it with n_blocks = 1 per call.
|
||||
* ONE Vulkan submit + wait round-trip per frame instead of millions.
|
||||
* `daedalus_recipe_dispatch_h264_idct4` at frame batch granularity in
|
||||
* contrast to the substitution-arc shim that called it with
|
||||
* n_blocks = 1 per call. Two Vulkan submits + waits per frame (one
|
||||
* luma, one chroma) instead of millions of per-block dispatches.
|
||||
*
|
||||
* What's done in this stage:
|
||||
* - Build a per-frame luma-4x4 meta[] in raster order across all MBs
|
||||
* - Repack the per-MB coeffs[] (384 int16, first 256 are luma) into
|
||||
* a flat block-major coeffs buffer (n_blocks × 16 int16)
|
||||
* - Allocate a frame-sized scratch Y plane (zero-initialised — no
|
||||
* intra prediction yet, so "predicted" = 0)
|
||||
* - Dispatch once via the recipe layer; the shader does
|
||||
* clip255(predicted + idct(coeffs)), i.e. with predicted=0 it's
|
||||
* clip255(idct(coeffs))
|
||||
* - Copy the scratch Y plane to the caller's out_y at the requested
|
||||
* stride
|
||||
* - Luma: build a per-frame meta[] in raster order (n_blocks =
|
||||
* N_MBs × 16); flat-pack coeffs from each MB's first 256 int16;
|
||||
* dispatch into a frame-sized zero-initialised Y scratch plane.
|
||||
* - Chroma: build an interleaved Cb+Cr meta[] (n_blocks = N_MBs × 8,
|
||||
* 4 Cb + 4 Cr per MB); flat-pack coeffs from each MB's next 128
|
||||
* int16 (64 Cb + 64 Cr); dispatch into a planar Cb||Cr scratch
|
||||
* buffer (W*H/4 each, concatenated W*H/2 total); CPU-interleave
|
||||
* into the caller's NV12 UV plane post-dispatch.
|
||||
* - Both dispatches use predicted=0 (the scratch buffers are
|
||||
* calloc'd); the shader does clip255(predicted + idct(coeffs)).
|
||||
*
|
||||
* What's NOT done yet (follow-on Phase 1 sub-PRs):
|
||||
* - Intra prediction (Stage 2a wavefront): predicted is forced to 0,
|
||||
@@ -154,14 +154,17 @@ int daedalus_decoder_append_mb(daedalus_decoder *dec,
|
||||
* Sufficient for Vulkan round-trip validation, not for bit-exact
|
||||
* against FFmpeg.
|
||||
* - Motion compensation (Stage 2b): inter MBs not handled.
|
||||
* - High-profile IDCT 8x8 (Stage 1 extension)
|
||||
* - Deblock (Stage 4)
|
||||
* - Chroma planes — the daedalus-fourier idct4 shader is luma-only
|
||||
* in this revision; chroma blocks (4×4, 4 cb + 4 cr per MB) need a
|
||||
* separate dispatch with different meta/dst layout. out_uv is
|
||||
* filled with neutral grey (128) as placeholder.
|
||||
* - High-profile IDCT 8x8 (Stage 1 extension).
|
||||
* - Chroma DC / luma Intra16x16 DC Hadamard pre-pass (currently we
|
||||
* treat all chroma blocks as plain 4×4 AC IDCT; real decode needs
|
||||
* the chroma DC 2×2 Hadamard contribution folded in).
|
||||
* - Deblock (Stage 4).
|
||||
* - dmabuf export — still memcpy-out to caller-provided planes.
|
||||
* - Stage 5 RGBA opt-in.
|
||||
* - GPU-side NV12 interleave — currently a CPU memcpy loop after
|
||||
* the chroma dispatch. Trivial cost (~1 MB / frame at 1080p)
|
||||
* vs the IDCT itself, but worth folding into a Stage-5 pass
|
||||
* later for full-GPU residency.
|
||||
*/
|
||||
int daedalus_decoder_flush_frame(daedalus_decoder *dec,
|
||||
uint8_t *out_y, size_t y_stride,
|
||||
@@ -246,17 +249,116 @@ int daedalus_decoder_flush_frame(daedalus_decoder *dec,
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
/* ---- Copy out to caller's planes at the requested stride. ---- */
|
||||
/* ---- Copy Y out to caller's plane at the requested stride. ---- */
|
||||
for (int r = 0; r < dec->height; r++)
|
||||
memcpy(out_y + (size_t) r * y_stride,
|
||||
&scratch_y[(size_t) r * y_stride_int],
|
||||
(size_t) dec->width);
|
||||
|
||||
/* Chroma placeholder: 128 = mid-grey (NV12 neutral). Real chroma
|
||||
* IDCT dispatch is the next sub-PR. */
|
||||
/* ---- Build frame-scaled chroma 4×4 dispatch ---- */
|
||||
/*
|
||||
* 4:2:0 layout — chroma planes are (W/2) by (H/2), one Cb + one
|
||||
* Cr per pixel pair. H.264 per-MB chroma is two 8×8 components,
|
||||
* each split into 4 4×4 blocks, so 8 chroma 4×4 blocks per MB.
|
||||
*
|
||||
* We dispatch BOTH components in a single shader call against a
|
||||
* planar scratch buffer:
|
||||
* scratch_uv[0 .. cb_plane_size) — Cb plane (W/2 × H/2)
|
||||
* scratch_uv[cb_plane_size .. 2*size) — Cr plane (W/2 × H/2)
|
||||
*
|
||||
* meta[i].dst_off is a flat offset into the scratch buffer (the
|
||||
* shader treats dst+dst_off as a contiguous 4×4 with row pitch =
|
||||
* stride), so Cr blocks just add cb_plane_size to their offset.
|
||||
* Stride is W/2 (the chroma row width); this works because Cb and
|
||||
* Cr planes share the same row pitch.
|
||||
*
|
||||
* Post-dispatch we interleave the two planes into NV12 UV layout
|
||||
* on the CPU. Doing this on the GPU is a Stage-5 follow-up
|
||||
* (would need a small "copy + interleave" shader); CPU memcpy
|
||||
* loop is ~1 MB/frame at 1080p so it's not on the critical path.
|
||||
*/
|
||||
int16_t *chroma_coeffs = NULL;
|
||||
daedalus_h264_block_meta *chroma_meta = NULL;
|
||||
uint8_t *scratch_uv = NULL;
|
||||
if (out_uv) {
|
||||
for (int r = 0; r < dec->height / 2; r++)
|
||||
memset(out_uv + (size_t) r * uv_stride, 128, (size_t) dec->width);
|
||||
const size_t n_chroma_blocks_per_mb = 8; /* 4 Cb + 4 Cr */
|
||||
const size_t n_chroma_blocks =
|
||||
(size_t) dec->n_mbs * n_chroma_blocks_per_mb;
|
||||
const size_t chroma_w = (size_t) dec->width / 2;
|
||||
const size_t chroma_h = (size_t) dec->height / 2;
|
||||
const size_t cb_plane_size = chroma_w * chroma_h;
|
||||
const size_t uv_scratch_size = 2 * cb_plane_size;
|
||||
|
||||
scratch_uv = calloc(1, uv_scratch_size);
|
||||
chroma_coeffs = malloc(n_chroma_blocks * 16 * sizeof(int16_t));
|
||||
chroma_meta = malloc(n_chroma_blocks *
|
||||
sizeof(daedalus_h264_block_meta));
|
||||
if (!scratch_uv || !chroma_coeffs || !chroma_meta) {
|
||||
rc = -1;
|
||||
goto chroma_cleanup;
|
||||
}
|
||||
|
||||
size_t cbi = 0;
|
||||
for (int mb_y = 0; mb_y < dec->mb_height; mb_y++) {
|
||||
for (int mb_x = 0; mb_x < dec->mb_width; mb_x++) {
|
||||
int mb_idx = mb_y * dec->mb_width + mb_x;
|
||||
const int16_t *mb_coeffs = &dec->coeffs[(size_t) mb_idx * 384];
|
||||
/* Per-MB coeff layout (set by append_mb):
|
||||
* [ 0 .. 256) — 16 luma 4×4 blocks
|
||||
* [256 .. 320) — 4 Cb 4×4 blocks (raster sb_y*2+sb_x)
|
||||
* [320 .. 384) — 4 Cr 4×4 blocks (raster sb_y*2+sb_x)
|
||||
*/
|
||||
for (int comp = 0; comp < 2; comp++) { /* 0=Cb 1=Cr */
|
||||
size_t plane_base = (size_t) comp * cb_plane_size;
|
||||
size_t coeff_base = 256u + (size_t) comp * 64u;
|
||||
for (int sb_y = 0; sb_y < 2; sb_y++) {
|
||||
for (int sb_x = 0; sb_x < 2; sb_x++) {
|
||||
size_t px_y = (size_t) mb_y * 8 + (size_t) sb_y * 4;
|
||||
size_t px_x = (size_t) mb_x * 8 + (size_t) sb_x * 4;
|
||||
chroma_meta[cbi].dst_off = (uint32_t)
|
||||
(plane_base + px_y * chroma_w + px_x);
|
||||
|
||||
int block_in_comp = sb_y * 2 + sb_x;
|
||||
memcpy(&chroma_coeffs[cbi * 16],
|
||||
&mb_coeffs[coeff_base + (size_t) block_in_comp * 16],
|
||||
16 * sizeof(int16_t));
|
||||
cbi++;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
/* assert cbi == n_chroma_blocks; loop math guarantees it */
|
||||
|
||||
int cr_rc = daedalus_recipe_dispatch_h264_idct4(dec->dctx,
|
||||
scratch_uv, chroma_w,
|
||||
chroma_coeffs,
|
||||
n_chroma_blocks,
|
||||
chroma_meta);
|
||||
if (cr_rc != 0) {
|
||||
rc = -3;
|
||||
goto chroma_cleanup;
|
||||
}
|
||||
|
||||
/* CPU NV12 interleave: out_uv[r][2c+0] = Cb[r][c], [2c+1] = Cr. */
|
||||
const uint8_t *cb_plane = scratch_uv;
|
||||
const uint8_t *cr_plane = scratch_uv + cb_plane_size;
|
||||
for (size_t r = 0; r < chroma_h; r++) {
|
||||
uint8_t *dst_row = out_uv + r * uv_stride;
|
||||
const uint8_t *cb_row = cb_plane + r * chroma_w;
|
||||
const uint8_t *cr_row = cr_plane + r * chroma_w;
|
||||
for (size_t c = 0; c < chroma_w; c++) {
|
||||
dst_row[c * 2 + 0] = cb_row[c];
|
||||
dst_row[c * 2 + 1] = cr_row[c];
|
||||
}
|
||||
}
|
||||
|
||||
chroma_cleanup:
|
||||
free(chroma_meta);
|
||||
free(chroma_coeffs);
|
||||
free(scratch_uv);
|
||||
if (rc != 0)
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
cleanup:
|
||||
|
||||
+101
-21
@@ -19,9 +19,14 @@
|
||||
* layout is a separate concern (handled in the eventual libavcodec-
|
||||
* intercept patch).
|
||||
*
|
||||
* Covers BOTH luma (Y plane, 16 blocks/MB) and chroma (UV plane,
|
||||
* 4 Cb + 4 Cr blocks/MB, NV12-interleaved). Random coeffs for all
|
||||
* three components; reference IDCT applied per block. The chroma
|
||||
* compare deinterleaves NV12 UV back into separate Cb/Cr expectations.
|
||||
*
|
||||
* Not in scope (covered by other tests / future PRs):
|
||||
* - chroma planes (Phase 1 stage 1 fills UV with grey 128)
|
||||
* - IDCT 8×8 (Phase 1 follow-on)
|
||||
* - Chroma DC / Intra16x16 DC Hadamard pre-pass
|
||||
* - bit-exactness against real H.264 streams (test-vector PR)
|
||||
* - non-zero predicted pixels (intra prediction lands in Stage 2a)
|
||||
*/
|
||||
@@ -120,10 +125,9 @@ int main(int argc, char **argv)
|
||||
|
||||
for (int mb = 0; mb < n_mbs; mb++) {
|
||||
for (int i = 0; i < 384; i++) {
|
||||
if (i < 256)
|
||||
per_mb_coeffs[mb][i] = (int16_t)((int)(xs64() % 1024) - 512);
|
||||
else
|
||||
per_mb_coeffs[mb][i] = 0; /* chroma — unused this stage */
|
||||
/* Random coeffs in [-512, 511] for all of luma + Cb + Cr.
|
||||
* Same range as the daedalus-fourier cycle-6 M1 gate. */
|
||||
per_mb_coeffs[mb][i] = (int16_t)((int)(xs64() % 1024) - 512);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -142,12 +146,16 @@ int main(int argc, char **argv)
|
||||
}
|
||||
}
|
||||
|
||||
/* Flush. */
|
||||
size_t y_size = (size_t) width * height;
|
||||
uint8_t *gpu_y = calloc(1, y_size);
|
||||
if (!gpu_y) return 1;
|
||||
/* Flush — exercise BOTH the luma path (out_y) and the chroma path
|
||||
* (out_uv set to non-NULL so flush_frame runs the chroma dispatch
|
||||
* + NV12 interleave). */
|
||||
size_t y_size = (size_t) width * height;
|
||||
size_t uv_size = (size_t) width * height / 2;
|
||||
uint8_t *gpu_y = calloc(1, y_size);
|
||||
uint8_t *gpu_uv = calloc(1, uv_size);
|
||||
if (!gpu_y || !gpu_uv) return 1;
|
||||
int frc = daedalus_decoder_flush_frame(dec, gpu_y, (size_t) width,
|
||||
NULL, 0);
|
||||
gpu_uv, (size_t) width);
|
||||
if (frc != 0) {
|
||||
fprintf(stderr, "flush_frame rc=%d\n", frc);
|
||||
return 1;
|
||||
@@ -180,29 +188,101 @@ int main(int argc, char **argv)
|
||||
}
|
||||
}
|
||||
|
||||
/* Byte-by-byte compare. */
|
||||
size_t diffs = 0;
|
||||
size_t first_diff = 0;
|
||||
/* Build the chroma reference: separate planar Cb and Cr (W/2 by
|
||||
* H/2), each block IDCT'd into its plane. Chroma per-MB layout
|
||||
* matches flush_frame: 4 Cb blocks then 4 Cr blocks, raster order
|
||||
* within each component (sb_y * 2 + sb_x). */
|
||||
size_t chroma_w = (size_t) width / 2;
|
||||
size_t chroma_h = (size_t) height / 2;
|
||||
size_t chroma_plane_size = chroma_w * chroma_h;
|
||||
uint8_t *ref_cb = calloc(1, chroma_plane_size);
|
||||
uint8_t *ref_cr = calloc(1, chroma_plane_size);
|
||||
if (!ref_cb || !ref_cr) return 1;
|
||||
for (int my = 0; my < mb_h; my++) {
|
||||
for (int mx = 0; mx < mb_w; mx++) {
|
||||
int mb_idx = my * mb_w + mx;
|
||||
for (int comp = 0; comp < 2; comp++) {
|
||||
uint8_t *plane = (comp == 0) ? ref_cb : ref_cr;
|
||||
size_t coeff_base = 256u + (size_t) comp * 64u;
|
||||
for (int sb_y = 0; sb_y < 2; sb_y++) {
|
||||
for (int sb_x = 0; sb_x < 2; sb_x++) {
|
||||
int block_in_comp = sb_y * 2 + sb_x;
|
||||
memcpy(block_scratch,
|
||||
&per_mb_coeffs[mb_idx][coeff_base +
|
||||
(size_t) block_in_comp * 16],
|
||||
16 * sizeof(int16_t));
|
||||
size_t px_y = (size_t) my * 8 + (size_t) sb_y * 4;
|
||||
size_t px_x = (size_t) mx * 8 + (size_t) sb_x * 4;
|
||||
ref_idct4_add(&plane[px_y * chroma_w + px_x],
|
||||
(ptrdiff_t) chroma_w, block_scratch);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/* Y compare. */
|
||||
size_t y_diffs = 0, y_first_diff = 0;
|
||||
for (size_t i = 0; i < y_size; i++) {
|
||||
if (gpu_y[i] != ref_y[i]) {
|
||||
if (diffs == 0) first_diff = i;
|
||||
diffs++;
|
||||
if (y_diffs == 0) y_first_diff = i;
|
||||
y_diffs++;
|
||||
}
|
||||
}
|
||||
printf("Y bytes total: %zu\n", y_size);
|
||||
printf("Y bytes diff: %zu (%.4f%%)\n", diffs, 100.0 * diffs / y_size);
|
||||
if (diffs) {
|
||||
printf("first diff at offset %zu: gpu=%u ref=%u\n",
|
||||
first_diff, gpu_y[first_diff], ref_y[first_diff]);
|
||||
printf("Y bytes diff: %zu (%.4f%%)\n", y_diffs, 100.0 * y_diffs / y_size);
|
||||
if (y_diffs) {
|
||||
printf("Y first diff at offset %zu: gpu=%u ref=%u\n",
|
||||
y_first_diff, gpu_y[y_first_diff], ref_y[y_first_diff]);
|
||||
}
|
||||
|
||||
/* UV compare — deinterleave NV12 back into Cb/Cr and compare. */
|
||||
size_t cb_diffs = 0, cr_diffs = 0;
|
||||
size_t cb_first = 0, cr_first = 0;
|
||||
for (size_t r = 0; r < chroma_h; r++) {
|
||||
const uint8_t *gpu_row = gpu_uv + r * (size_t) width;
|
||||
const uint8_t *cb_row = ref_cb + r * chroma_w;
|
||||
const uint8_t *cr_row = ref_cr + r * chroma_w;
|
||||
for (size_t c = 0; c < chroma_w; c++) {
|
||||
uint8_t gpu_cb = gpu_row[c * 2 + 0];
|
||||
uint8_t gpu_cr = gpu_row[c * 2 + 1];
|
||||
if (gpu_cb != cb_row[c]) {
|
||||
if (cb_diffs == 0) cb_first = r * chroma_w + c;
|
||||
cb_diffs++;
|
||||
}
|
||||
if (gpu_cr != cr_row[c]) {
|
||||
if (cr_diffs == 0) cr_first = r * chroma_w + c;
|
||||
cr_diffs++;
|
||||
}
|
||||
}
|
||||
}
|
||||
printf("Cb bytes total: %zu diff: %zu (%.4f%%)\n",
|
||||
chroma_plane_size, cb_diffs,
|
||||
100.0 * cb_diffs / chroma_plane_size);
|
||||
printf("Cr bytes total: %zu diff: %zu (%.4f%%)\n",
|
||||
chroma_plane_size, cr_diffs,
|
||||
100.0 * cr_diffs / chroma_plane_size);
|
||||
if (cb_diffs) {
|
||||
size_t r = cb_first / chroma_w, c = cb_first % chroma_w;
|
||||
printf("Cb first diff at (%zu,%zu): gpu=%u ref=%u\n",
|
||||
r, c, gpu_uv[r * (size_t) width + c * 2 + 0], ref_cb[cb_first]);
|
||||
}
|
||||
if (cr_diffs) {
|
||||
size_t r = cr_first / chroma_w, c = cr_first % chroma_w;
|
||||
printf("Cr first diff at (%zu,%zu): gpu=%u ref=%u\n",
|
||||
r, c, gpu_uv[r * (size_t) width + c * 2 + 1], ref_cr[cr_first]);
|
||||
}
|
||||
|
||||
free(ref_cr);
|
||||
free(ref_cb);
|
||||
free(ref_y);
|
||||
free(gpu_uv);
|
||||
free(gpu_y);
|
||||
free(per_mb_coeffs);
|
||||
daedalus_decoder_destroy(dec);
|
||||
|
||||
if (diffs == 0) {
|
||||
printf("BIT-EXACT PASS\n");
|
||||
if (y_diffs == 0 && cb_diffs == 0 && cr_diffs == 0) {
|
||||
printf("BIT-EXACT PASS (Y + Cb + Cr)\n");
|
||||
return 0;
|
||||
}
|
||||
fprintf(stderr, "BIT-EXACT FAIL\n");
|
||||
|
||||
+9
-5
@@ -145,12 +145,16 @@ int main(void)
|
||||
printf("Y non-zero bytes: %d / %zu\n", y_nz, y_size);
|
||||
EXPECT(y_nz == 0, "Y plane all zero for zero-coeff frame");
|
||||
|
||||
/* UV plane should be neutral grey (128) per Phase 1 placeholder. */
|
||||
int uv_wrong = 0;
|
||||
/* UV plane should be all zero now (real chroma IDCT runs with
|
||||
* zero coeffs → zero residual → clip255(0+0) = 0). Previously a
|
||||
* 128 placeholder when chroma was a memset stub; this PR replaced
|
||||
* that with the real dispatch. Sentinel 0xcd above guarantees we
|
||||
* are observing post-dispatch writes, not the leftover memset. */
|
||||
int uv_nz = 0;
|
||||
for (size_t i = 0; i < uv_size; i++)
|
||||
if (out_uv[i] != 128) uv_wrong++;
|
||||
printf("UV non-128 bytes: %d / %zu\n", uv_wrong, uv_size);
|
||||
EXPECT(uv_wrong == 0, "UV plane is grey (128) Phase 1 placeholder");
|
||||
if (out_uv[i] != 0) uv_nz++;
|
||||
printf("UV non-zero bytes: %d / %zu\n", uv_nz, uv_size);
|
||||
EXPECT(uv_nz == 0, "UV plane all zero for zero-coeff frame");
|
||||
|
||||
free(out_y);
|
||||
free(out_uv);
|
||||
|
||||
Reference in New Issue
Block a user