phase1/stage1: frame-scaled luma IDCT 4x4 dispatch — first GPU round-trip
flush_frame now performs a real GPU dispatch via the daedalus-fourier
public API at frame batch granularity, in contrast to the substitution-
arc shim that paid Vulkan sync overhead per-block.
What's wired:
- Build per-frame luma-4x4 meta[] in raster order across all MBs
(N_MBs × 16 entries; 130,560 for 1080p)
- Repack per-MB coeffs[] (384 int16; first 256 are luma) into a flat
block-major coeffs buffer (n_blocks × 16 int16)
- Allocate a frame-sized scratch Y plane, zero-initialised — no intra
prediction yet so "predicted" = 0
- daedalus_recipe_dispatch_h264_idct4(ctx, scratch_y, stride, coeffs,
n_blocks, meta) — ONE call, ONE vkQueueSubmit, ONE vkQueueWaitIdle
- Copy result to caller's out_y at requested stride
Measured on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0 post-pool):
$ time ./build/test_smoke
daedalus-decoder version: 0.0.1
ctx created: 1920x1088, has_qpu=1
appended 8160 MBs (120x68)
flush_frame rc=0
Y non-zero bytes: 0 / 2088960
UV non-128 bytes: 0 / 1044480
smoke OK
real 0m0.163s
163ms wall for full 1080p frame including ctx-create (Vulkan init).
Per-block dispatch via the substitution arc would have paid
130,560 × ~50us = ~6.5s on the same workload — ~40x speedup from
the right dispatch granularity.
Smoke validates:
- flush_frame succeeds (rc=0) on a complete frame
- Zero-coefficient input → zero-pixel Y output (clip255(IDCT(zeros))=0)
- UV plane filled with neutral grey 128 (placeholder until chroma
dispatch lands)
What's deliberately deferred to follow-on sub-PRs:
- Intra prediction wavefront (Stage 2a) — predicted=0 means output
pixels are residual-only, not a valid frame decode. Sufficient for
Vulkan round-trip validation; not bit-exact vs FFmpeg yet.
- Motion compensation (Stage 2b) for inter MBs
- High-profile IDCT 8x8 (Stage 1 extension)
- Deblocking filter (Stage 4)
- Chroma 4x4 IDCT — needs separate dispatch with chroma stride
- Z-scan permutation of per-MB 4x4 block order (currently flat
raster; FFmpeg's per-MB coeffs[] uses spec §6.4.3 z-scan).
Bit-exact against FFmpeg requires this permutation; deferred to
the test-vector PR.
- dmabuf export (still memcpy-out)
- Stage 5 RGBA opt-in
API surface unchanged from the scaffold PR; only the body of
flush_frame becomes non-stub. Internal helpers stay file-local.
Stacks on noether/repo-scaffold (PR #2). Rebase on main after #2
lands; the diff is purely additive against the scaffold.
This commit is contained in:
@@ -95,6 +95,67 @@ int main(void)
|
||||
|
||||
daedalus_decoder_destroy(dec);
|
||||
|
||||
/* ---- Full-frame round-trip with all-zero coefficients.
|
||||
* Phase 1 stage 1 validation: flush_frame builds the per-frame IDCT
|
||||
* dispatch and a successful GPU round-trip returns 0. IDCT of
|
||||
* all-zero coefficients with zero-initialised predicted = all-zero
|
||||
* output pixels. */
|
||||
dec = daedalus_decoder_create(1920, 1088);
|
||||
if (!dec) {
|
||||
fprintf(stderr, "SKIP roundtrip: ctx create failed\n");
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int16_t zero_coeffs[384] = {0};
|
||||
struct daedalus_decoder_mb_input zmb = {0};
|
||||
zmb.coeffs = zero_coeffs;
|
||||
|
||||
int mb_width = 1920 / 16; /* 120 */
|
||||
int mb_height = 1088 / 16; /* 68 */
|
||||
int n_mbs = mb_width * mb_height;
|
||||
|
||||
for (int mby = 0; mby < mb_height; mby++) {
|
||||
for (int mbx = 0; mbx < mb_width; mbx++) {
|
||||
zmb.mb_x = (uint16_t) mbx;
|
||||
zmb.mb_y = (uint16_t) mby;
|
||||
if (daedalus_decoder_append_mb(dec, &zmb) != 0) {
|
||||
fprintf(stderr, "append (%d, %d) failed\n", mbx, mby);
|
||||
return 1;
|
||||
}
|
||||
}
|
||||
}
|
||||
printf("appended %d MBs (%dx%d)\n", n_mbs, mb_width, mb_height);
|
||||
|
||||
size_t y_size = (size_t) 1920 * 1088;
|
||||
size_t uv_size = (size_t) 1920 * 1088 / 2;
|
||||
uint8_t *out_y = malloc(y_size);
|
||||
uint8_t *out_uv = malloc(uv_size);
|
||||
/* Pre-fill with sentinel so any read-then-write bug becomes visible. */
|
||||
memset(out_y, 0xab, y_size);
|
||||
memset(out_uv, 0xcd, uv_size);
|
||||
|
||||
int frc = daedalus_decoder_flush_frame(dec, out_y, 1920, out_uv, 1920);
|
||||
printf("flush_frame rc=%d\n", frc);
|
||||
EXPECT(frc == 0, "flush succeeds on full frame");
|
||||
|
||||
/* Y plane should be all zero (clip255(IDCT(zeros)) = 0). */
|
||||
int y_nz = 0;
|
||||
for (size_t i = 0; i < y_size; i++)
|
||||
if (out_y[i] != 0) y_nz++;
|
||||
printf("Y non-zero bytes: %d / %zu\n", y_nz, y_size);
|
||||
EXPECT(y_nz == 0, "Y plane all zero for zero-coeff frame");
|
||||
|
||||
/* UV plane should be neutral grey (128) per Phase 1 placeholder. */
|
||||
int uv_wrong = 0;
|
||||
for (size_t i = 0; i < uv_size; i++)
|
||||
if (out_uv[i] != 128) uv_wrong++;
|
||||
printf("UV non-128 bytes: %d / %zu\n", uv_wrong, uv_size);
|
||||
EXPECT(uv_wrong == 0, "UV plane is grey (128) Phase 1 placeholder");
|
||||
|
||||
free(out_y);
|
||||
free(out_uv);
|
||||
daedalus_decoder_destroy(dec);
|
||||
|
||||
printf("smoke OK\n");
|
||||
return 0;
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user