phase1/stage1: frame-scaled luma IDCT 4×4 — first GPU round-trip #3

Merged

marfrit merged 1 commits from noether/phase1-stage1-idct into main

2026-05-24 20:18:44 +00:00

Author	SHA1	Message	Date
claude-noether	69b124adf1	phase1/stage1: frame-scaled luma IDCT 4x4 dispatch — first GPU round-trip flush_frame now performs a real GPU dispatch via the daedalus-fourier public API at frame batch granularity, in contrast to the substitution- arc shim that paid Vulkan sync overhead per-block. What's wired: - Build per-frame luma-4x4 meta[] in raster order across all MBs (N_MBs × 16 entries; 130,560 for 1080p) - Repack per-MB coeffs[] (384 int16; first 256 are luma) into a flat block-major coeffs buffer (n_blocks × 16 int16) - Allocate a frame-sized scratch Y plane, zero-initialised — no intra prediction yet so "predicted" = 0 - daedalus_recipe_dispatch_h264_idct4(ctx, scratch_y, stride, coeffs, n_blocks, meta) — ONE call, ONE vkQueueSubmit, ONE vkQueueWaitIdle - Copy result to caller's out_y at requested stride Measured on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0 post-pool): $ time ./build/test_smoke daedalus-decoder version: 0.0.1 ctx created: 1920x1088, has_qpu=1 appended 8160 MBs (120x68) flush_frame rc=0 Y non-zero bytes: 0 / 2088960 UV non-128 bytes: 0 / 1044480 smoke OK real 0m0.163s 163ms wall for full 1080p frame including ctx-create (Vulkan init). Per-block dispatch via the substitution arc would have paid 130,560 × ~50us = ~6.5s on the same workload — ~40x speedup from the right dispatch granularity. Smoke validates: - flush_frame succeeds (rc=0) on a complete frame - Zero-coefficient input → zero-pixel Y output (clip255(IDCT(zeros))=0) - UV plane filled with neutral grey 128 (placeholder until chroma dispatch lands) What's deliberately deferred to follow-on sub-PRs: - Intra prediction wavefront (Stage 2a) — predicted=0 means output pixels are residual-only, not a valid frame decode. Sufficient for Vulkan round-trip validation; not bit-exact vs FFmpeg yet. - Motion compensation (Stage 2b) for inter MBs - High-profile IDCT 8x8 (Stage 1 extension) - Deblocking filter (Stage 4) - Chroma 4x4 IDCT — needs separate dispatch with chroma stride - Z-scan permutation of per-MB 4x4 block order (currently flat raster; FFmpeg's per-MB coeffs[] uses spec §6.4.3 z-scan). Bit-exact against FFmpeg requires this permutation; deferred to the test-vector PR. - dmabuf export (still memcpy-out) - Stage 5 RGBA opt-in API surface unchanged from the scaffold PR; only the body of flush_frame becomes non-stub. Internal helpers stay file-local. Stacks on noether/repo-scaffold (PR #2). Rebase on main after #2 lands; the diff is purely additive against the scaffold.	2026-05-24 22:15:35 +02:00

Author

SHA1

Message

Date

claude-noether

69b124adf1

phase1/stage1: frame-scaled luma IDCT 4x4 dispatch — first GPU round-trip

flush_frame now performs a real GPU dispatch via the daedalus-fourier
public API at frame batch granularity, in contrast to the substitution-
arc shim that paid Vulkan sync overhead per-block.

What's wired:

  - Build per-frame luma-4x4 meta[] in raster order across all MBs
    (N_MBs × 16 entries; 130,560 for 1080p)
  - Repack per-MB coeffs[] (384 int16; first 256 are luma) into a flat
    block-major coeffs buffer (n_blocks × 16 int16)
  - Allocate a frame-sized scratch Y plane, zero-initialised — no intra
    prediction yet so "predicted" = 0
  - daedalus_recipe_dispatch_h264_idct4(ctx, scratch_y, stride, coeffs,
    n_blocks, meta) — ONE call, ONE vkQueueSubmit, ONE vkQueueWaitIdle
  - Copy result to caller's out_y at requested stride

Measured on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0 post-pool):

  $ time ./build/test_smoke
  daedalus-decoder version: 0.0.1
  ctx created: 1920x1088, has_qpu=1
  appended 8160 MBs (120x68)
  flush_frame rc=0
  Y non-zero bytes: 0 / 2088960
  UV non-128 bytes: 0 / 1044480
  smoke OK
  real  0m0.163s

163ms wall for full 1080p frame including ctx-create (Vulkan init).
Per-block dispatch via the substitution arc would have paid
130,560 × ~50us = ~6.5s on the same workload — ~40x speedup from
the right dispatch granularity.

Smoke validates:
  - flush_frame succeeds (rc=0) on a complete frame
  - Zero-coefficient input → zero-pixel Y output (clip255(IDCT(zeros))=0)
  - UV plane filled with neutral grey 128 (placeholder until chroma
    dispatch lands)

What's deliberately deferred to follow-on sub-PRs:

  - Intra prediction wavefront (Stage 2a) — predicted=0 means output
    pixels are residual-only, not a valid frame decode.  Sufficient for
    Vulkan round-trip validation; not bit-exact vs FFmpeg yet.
  - Motion compensation (Stage 2b) for inter MBs
  - High-profile IDCT 8x8 (Stage 1 extension)
  - Deblocking filter (Stage 4)
  - Chroma 4x4 IDCT — needs separate dispatch with chroma stride
  - Z-scan permutation of per-MB 4x4 block order (currently flat
    raster; FFmpeg's per-MB coeffs[] uses spec §6.4.3 z-scan).
    Bit-exact against FFmpeg requires this permutation; deferred to
    the test-vector PR.
  - dmabuf export (still memcpy-out)
  - Stage 5 RGBA opt-in

API surface unchanged from the scaffold PR; only the body of
flush_frame becomes non-stub.  Internal helpers stay file-local.

Stacks on noether/repo-scaffold (PR #2).  Rebase on main after #2
lands; the diff is purely additive against the scaffold.

2026-05-24 22:15:35 +02:00

phase1/stage1: frame-scaled luma IDCT 4×4 — first GPU round-trip #3

1 Commits