phase1/stage1: frame-scaled luma IDCT 4×4 — first GPU round-trip #3

2026-05-24T20:17:12Z

marfrit commented

2026-05-24 20:17:12 +00:00

Stacked on PR #2 (now merged). First real GPU dispatch in daedalus-decoder.

The win

$ time ./build/test_smoke
ctx created: 1920x1088, has_qpu=1
appended 8160 MBs (120x68)
flush_frame rc=0
Y non-zero bytes: 0 / 2088960
UV non-128 bytes: 0 / 1044480
smoke OK
real	0m0.163s

163 ms wall for a full 1080p frame round-trip including Vulkan init. Per-block dispatch via the substitution arc would have paid 130,560 × ~50 µs sync = ~6.5 s on the same workload. ~40× speedup from the right dispatch granularity — the structural fix predicted in the design doc.

What is wired

daedalus_decoder_flush_frame():

Builds frame-scaled meta[] in raster order across all MBs (N_MBs × 16 entries; 130,560 for 1080p)
Repacks per-MB coeffs[] (384 int16; first 256 luma) into flat block-major buffer (n_blocks × 16 int16)
Allocates scratch Y plane, zero-initialised (predicted = 0 because no intra/MC yet)
Calls daedalus_recipe_dispatch_h264_idct4() once — one vkQueueSubmit + vkQueueWaitIdle
Copies result to caller out_y at requested stride

What is deliberately deferred (follow-on sub-PRs)

Intra prediction wavefront (Stage 2a) — predicted=0 so output is residual-only, not a valid frame decode. Sufficient for round-trip validation.
Motion compensation (Stage 2b) for inter MBs
High-profile IDCT 8×8 (Stage 1 extension)
Deblocking filter (Stage 4)
Chroma 4×4 IDCT (separate dispatch with chroma stride)
Z-scan permutation of per-MB 4×4 block order — FFmpegs per-MB coeffs uses spec §6.4.3 z-scan; we currently assume flat raster. Bit-exact vs FFmpeg requires this permutation; deferred to test-vector PR.
dmabuf export (still memcpy-out)
Stage 5 RGBA opt-in

API surface unchanged from scaffold PR; only flush_frame body becomes non-stub.

Stacked on PR #2 (now merged). First real GPU dispatch in daedalus-decoder. ## The win ``` $ time ./build/test_smoke ctx created: 1920x1088, has_qpu=1 appended 8160 MBs (120x68) flush_frame rc=0 Y non-zero bytes: 0 / 2088960 UV non-128 bytes: 0 / 1044480 smoke OK real 0m0.163s ``` **163 ms** wall for a full 1080p frame round-trip including Vulkan init. Per-block dispatch via the substitution arc would have paid 130,560 × ~50 µs sync = ~6.5 s on the same workload. **~40× speedup from the right dispatch granularity** — the structural fix predicted in the design doc. ## What is wired `daedalus_decoder_flush_frame()`: 1. Builds frame-scaled `meta[]` in raster order across all MBs (`N_MBs × 16` entries; 130,560 for 1080p) 2. Repacks per-MB `coeffs[]` (384 int16; first 256 luma) into flat block-major buffer (`n_blocks × 16` int16) 3. Allocates scratch Y plane, zero-initialised (predicted = 0 because no intra/MC yet) 4. Calls `daedalus_recipe_dispatch_h264_idct4()` **once** — one `vkQueueSubmit + vkQueueWaitIdle` 5. Copies result to caller out_y at requested stride ## What is deliberately deferred (follow-on sub-PRs) - Intra prediction wavefront (Stage 2a) — predicted=0 so output is residual-only, not a valid frame decode. Sufficient for round-trip validation. - Motion compensation (Stage 2b) for inter MBs - High-profile IDCT 8×8 (Stage 1 extension) - Deblocking filter (Stage 4) - Chroma 4×4 IDCT (separate dispatch with chroma stride) - **Z-scan permutation** of per-MB 4×4 block order — FFmpegs per-MB coeffs uses spec §6.4.3 z-scan; we currently assume flat raster. Bit-exact vs FFmpeg requires this permutation; deferred to test-vector PR. - dmabuf export (still memcpy-out) - Stage 5 RGBA opt-in API surface unchanged from scaffold PR; only flush_frame body becomes non-stub.

marfrit added 1 commit 2026-05-24 20:17:13 +00:00

phase1/stage1: frame-scaled luma IDCT 4x4 dispatch — first GPU round-trip 69b124adf1

flush_frame now performs a real GPU dispatch via the daedalus-fourier
public API at frame batch granularity, in contrast to the substitution-
arc shim that paid Vulkan sync overhead per-block.

What's wired:

  - Build per-frame luma-4x4 meta[] in raster order across all MBs
    (N_MBs × 16 entries; 130,560 for 1080p)
  - Repack per-MB coeffs[] (384 int16; first 256 are luma) into a flat
    block-major coeffs buffer (n_blocks × 16 int16)
  - Allocate a frame-sized scratch Y plane, zero-initialised — no intra
    prediction yet so "predicted" = 0
  - daedalus_recipe_dispatch_h264_idct4(ctx, scratch_y, stride, coeffs,
    n_blocks, meta) — ONE call, ONE vkQueueSubmit, ONE vkQueueWaitIdle
  - Copy result to caller's out_y at requested stride

Measured on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0 post-pool):

  $ time ./build/test_smoke
  daedalus-decoder version: 0.0.1
  ctx created: 1920x1088, has_qpu=1
  appended 8160 MBs (120x68)
  flush_frame rc=0
  Y non-zero bytes: 0 / 2088960
  UV non-128 bytes: 0 / 1044480
  smoke OK
  real  0m0.163s

163ms wall for full 1080p frame including ctx-create (Vulkan init).
Per-block dispatch via the substitution arc would have paid
130,560 × ~50us = ~6.5s on the same workload — ~40x speedup from
the right dispatch granularity.

Smoke validates:
  - flush_frame succeeds (rc=0) on a complete frame
  - Zero-coefficient input → zero-pixel Y output (clip255(IDCT(zeros))=0)
  - UV plane filled with neutral grey 128 (placeholder until chroma
    dispatch lands)

What's deliberately deferred to follow-on sub-PRs:

  - Intra prediction wavefront (Stage 2a) — predicted=0 means output
    pixels are residual-only, not a valid frame decode.  Sufficient for
    Vulkan round-trip validation; not bit-exact vs FFmpeg yet.
  - Motion compensation (Stage 2b) for inter MBs
  - High-profile IDCT 8x8 (Stage 1 extension)
  - Deblocking filter (Stage 4)
  - Chroma 4x4 IDCT — needs separate dispatch with chroma stride
  - Z-scan permutation of per-MB 4x4 block order (currently flat
    raster; FFmpeg's per-MB coeffs[] uses spec §6.4.3 z-scan).
    Bit-exact against FFmpeg requires this permutation; deferred to
    the test-vector PR.
  - dmabuf export (still memcpy-out)
  - Stage 5 RGBA opt-in

API surface unchanged from the scaffold PR; only the body of
flush_frame becomes non-stub.  Internal helpers stay file-local.

Stacks on noether/repo-scaffold (PR #2).  Rebase on main after #2
lands; the diff is purely additive against the scaffold.

marfrit merged commit abd94e9db5 into main

2026-05-24 20:18:44 +00:00

marfrit deleted branch noether/phase1-stage1-idct

2026-05-24 20:18:44 +00:00

claude-noether referenced this issue from a commit

2026-05-24 20:20:24 +00:00

phase1/stage1: bit-exact gate for the frame-scaled luma IDCT 4x4

marfrit referenced this pull request

2026-05-24 20:20:36 +00:00

phase1/stage1: bit-exact gate for the frame-scaled IDCT 4×4 #4

claude-noether referenced this issue from a commit

2026-05-24 20:34:44 +00:00