phase1/stage1: frame-scaled luma IDCT 4×4 — first GPU round-trip #3
Reference in New Issue
Block a user
Delete Branch "noether/phase1-stage1-idct"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Stacked on PR #2 (now merged). First real GPU dispatch in daedalus-decoder.
The win
163 ms wall for a full 1080p frame round-trip including Vulkan init. Per-block dispatch via the substitution arc would have paid 130,560 × ~50 µs sync = ~6.5 s on the same workload. ~40× speedup from the right dispatch granularity — the structural fix predicted in the design doc.
What is wired
daedalus_decoder_flush_frame():meta[]in raster order across all MBs (N_MBs × 16entries; 130,560 for 1080p)coeffs[](384 int16; first 256 luma) into flat block-major buffer (n_blocks × 16int16)daedalus_recipe_dispatch_h264_idct4()once — onevkQueueSubmit + vkQueueWaitIdleWhat is deliberately deferred (follow-on sub-PRs)
API surface unchanged from scaffold PR; only flush_frame body becomes non-stub.
flush_frame now performs a real GPU dispatch via the daedalus-fourier public API at frame batch granularity, in contrast to the substitution- arc shim that paid Vulkan sync overhead per-block. What's wired: - Build per-frame luma-4x4 meta[] in raster order across all MBs (N_MBs × 16 entries; 130,560 for 1080p) - Repack per-MB coeffs[] (384 int16; first 256 are luma) into a flat block-major coeffs buffer (n_blocks × 16 int16) - Allocate a frame-sized scratch Y plane, zero-initialised — no intra prediction yet so "predicted" = 0 - daedalus_recipe_dispatch_h264_idct4(ctx, scratch_y, stride, coeffs, n_blocks, meta) — ONE call, ONE vkQueueSubmit, ONE vkQueueWaitIdle - Copy result to caller's out_y at requested stride Measured on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0 post-pool): $ time ./build/test_smoke daedalus-decoder version: 0.0.1 ctx created: 1920x1088, has_qpu=1 appended 8160 MBs (120x68) flush_frame rc=0 Y non-zero bytes: 0 / 2088960 UV non-128 bytes: 0 / 1044480 smoke OK real 0m0.163s 163ms wall for full 1080p frame including ctx-create (Vulkan init). Per-block dispatch via the substitution arc would have paid 130,560 × ~50us = ~6.5s on the same workload — ~40x speedup from the right dispatch granularity. Smoke validates: - flush_frame succeeds (rc=0) on a complete frame - Zero-coefficient input → zero-pixel Y output (clip255(IDCT(zeros))=0) - UV plane filled with neutral grey 128 (placeholder until chroma dispatch lands) What's deliberately deferred to follow-on sub-PRs: - Intra prediction wavefront (Stage 2a) — predicted=0 means output pixels are residual-only, not a valid frame decode. Sufficient for Vulkan round-trip validation; not bit-exact vs FFmpeg yet. - Motion compensation (Stage 2b) for inter MBs - High-profile IDCT 8x8 (Stage 1 extension) - Deblocking filter (Stage 4) - Chroma 4x4 IDCT — needs separate dispatch with chroma stride - Z-scan permutation of per-MB 4x4 block order (currently flat raster; FFmpeg's per-MB coeffs[] uses spec §6.4.3 z-scan). Bit-exact against FFmpeg requires this permutation; deferred to the test-vector PR. - dmabuf export (still memcpy-out) - Stage 5 RGBA opt-in API surface unchanged from the scaffold PR; only the body of flush_frame becomes non-stub. Internal helpers stay file-local. Stacks on noether/repo-scaffold (PR #2). Rebase on main after #2 lands; the diff is purely additive against the scaffold.