Adds test_idct_bitexact that exercises daedalus_decoder_flush_frame
end-to-end with random coefficients and compares every output byte
against an inline C reference of the H.264 §8.5.12.1 1D butterfly.
Closes the validation gap from the previous PR ("dispatch succeeds"
becomes "dispatch is bit-exact").
What's tested:
- 320×240 coded frame (300 MBs), enough to cover multiple workgroups
of the V3D shader (16 blocks/WG → ≥30 WGs)
- Per-MB → flat-raster block layout consistent with flush_frame
- Random coeffs in [-512, 511] (same range as daedalus-fourier
cycle-6 M1 gate)
- Inline C reference: H.264 §8.5.12.1 butterfly with column-major
block layout, +32 rounding, >>6, add-to-predicted (=0), clip255 —
mirrors daedalus-fourier tests/h264_idct4_ref.c
Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0):
$ ctest --test-dir build --output-on-failure
Start 1: smoke
1/2 Test #1: smoke ............................ Passed 0.16 sec
Start 2: idct_bitexact
2/2 Test #2: idct_bitexact .................... Passed 0.03 sec
100% tests passed, 0 tests failed out of 2
Bit-exact PASS first try — daedalus-fourier's V3D IDCT 4x4 shader
produces identical pixels to the C reference for all 4800 blocks in
the test frame. Validates BOTH the shader correctness AND the
frame-batched-dispatch correctness (this is the first time
n_blocks > ~30 has been exercised at the recipe-dispatch layer; the
substitution arc only ever called with n_blocks=1).
What is NOT tested by this PR (deferred to follow-ons):
- Non-zero predicted pixels — flush_frame zero-initialises scratch_y,
so the IDCT-ADD reduces to clip255(IDCT). Real predicted comes
from Stage 2a intra prediction.
- Z-scan permutation between FFmpeg's per-MB coeffs layout and our
per-MB → flat raster — the test uses its own coefficient generator
that already matches our layout, so it doesn't exercise the
permutation. The libavcodec-intercept patch is where the
permutation lands and gets validated against real H.264 streams.
- Chroma 4×4 IDCT.
- IDCT 8×8 (High profile).
Stacked on noether/phase1-stage1-idct (PR #3, the frame-scaled
dispatch). Rebase on main after #3 lands; the diff is purely additive
(one new test file + 5 lines of CMake).
flush_frame now performs a real GPU dispatch via the daedalus-fourier
public API at frame batch granularity, in contrast to the substitution-
arc shim that paid Vulkan sync overhead per-block.
What's wired:
- Build per-frame luma-4x4 meta[] in raster order across all MBs
(N_MBs × 16 entries; 130,560 for 1080p)
- Repack per-MB coeffs[] (384 int16; first 256 are luma) into a flat
block-major coeffs buffer (n_blocks × 16 int16)
- Allocate a frame-sized scratch Y plane, zero-initialised — no intra
prediction yet so "predicted" = 0
- daedalus_recipe_dispatch_h264_idct4(ctx, scratch_y, stride, coeffs,
n_blocks, meta) — ONE call, ONE vkQueueSubmit, ONE vkQueueWaitIdle
- Copy result to caller's out_y at requested stride
Measured on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0 post-pool):
$ time ./build/test_smoke
daedalus-decoder version: 0.0.1
ctx created: 1920x1088, has_qpu=1
appended 8160 MBs (120x68)
flush_frame rc=0
Y non-zero bytes: 0 / 2088960
UV non-128 bytes: 0 / 1044480
smoke OK
real 0m0.163s
163ms wall for full 1080p frame including ctx-create (Vulkan init).
Per-block dispatch via the substitution arc would have paid
130,560 × ~50us = ~6.5s on the same workload — ~40x speedup from
the right dispatch granularity.
Smoke validates:
- flush_frame succeeds (rc=0) on a complete frame
- Zero-coefficient input → zero-pixel Y output (clip255(IDCT(zeros))=0)
- UV plane filled with neutral grey 128 (placeholder until chroma
dispatch lands)
What's deliberately deferred to follow-on sub-PRs:
- Intra prediction wavefront (Stage 2a) — predicted=0 means output
pixels are residual-only, not a valid frame decode. Sufficient for
Vulkan round-trip validation; not bit-exact vs FFmpeg yet.
- Motion compensation (Stage 2b) for inter MBs
- High-profile IDCT 8x8 (Stage 1 extension)
- Deblocking filter (Stage 4)
- Chroma 4x4 IDCT — needs separate dispatch with chroma stride
- Z-scan permutation of per-MB 4x4 block order (currently flat
raster; FFmpeg's per-MB coeffs[] uses spec §6.4.3 z-scan).
Bit-exact against FFmpeg requires this permutation; deferred to
the test-vector PR.
- dmabuf export (still memcpy-out)
- Stage 5 RGBA opt-in
API surface unchanged from the scaffold PR; only the body of
flush_frame becomes non-stub. Internal helpers stay file-local.
Stacks on noether/repo-scaffold (PR #2). Rebase on main after #2
lands; the diff is purely additive against the scaffold.
All seven questions from the initial design draft decided in the
user's 2026-05-24 review:
1. Intra prediction: GPU wavefront in Phase 1, revisit if bottleneck
2. libavcodec intercept: macroblock-level for Phase 1
3. Shader parameterisation: measure both during Phase 2 MC, pick winner
4. DPB allocation: Vulkan-native VkImage with dma_buf export
5. Daemon integration: library link
6. daedalus-fourier dep: CMake find_package, pinned to tagged release
7. Codec scope: H.264 first; HEVC/10-bit/interlaced/FMO/ASO firmly out;
VP9 + AV1 deferred to Phase 5+ but NOT firmly out (scope expansion
vs the initial draft which had grouped them with HEVC)
Section heading renamed "Open questions" → "Phase 1 decisions" with
explicit user-confirmed annotations. Each item preserves the original
wording for traceability.
§8 Phasing extended with a Phase 5+ paragraph clarifying the VP9/AV1
deferral and reaffirming HEVC's firmly-out status.
No architecture changes; only decisions captured. Phase 1
implementation can now begin against this baseline.
Read-only research done autonomously while push to marfrit/daedalus-decoder
is blocked on user perms. All findings appended to DESIGN.md; no new
files, no architecture changes.
Appendix A — daedalus-fourier shader reuse audit
- 2 shaders directly reusable (v3d_h264_idct4, v3d_h264_idct8)
just at frame scale instead of n_blocks=1 per call
- 2 shaders partial-reuse (v3d_h264deblock + v3d_h264_qpel_mc20)
serve as templates for ~20 sibling variants (horizontal/chroma
deblock variants, 15 missing qpel positions + 16x16 size + avg)
- 5 daedalus-fourier shaders not reusable (VP9/AV1 codec-specific)
- 7 brand-new shaders required (iquant, intra prediction modes,
chroma MC, reconstruct, optional yuv→rgba)
- ~22 H.264 shaders total; estimate 6-10 weeks for the inventory
if done in sequence with M1 bit-exact gate methodology
Appendix B — libavcodec intercept point
- decode_slice() at libavcodec/h264_slice.c:2598 is the loop site
- Per-MB sequence: ff_h264_decode_mb_cabac → ff_h264_hl_decode_mb
- Intercept replaces ff_h264_hl_decode_mb with a stub that snapshots
sl->mb[] (coefficients), MV/ref caches, intra modes, mb_type, QP,
non_zero_count_cache into a frame-shaped descriptor SSBO
- End-of-slice flush builds + submits the GPU pipeline
- CABAC/CAVLC stay in libavcodec (we don't re-implement entropy)
- New FFmpeg patch in marfrit-packages, sibling to 0003-0007:
0008-h264-daedalus-decoder-frame-pipeline.patch
- daedalus_decoder_active(h) gates the intercept; default OFF =
no-op = full coexistence with the kernel-pack substitution arc
Appendix C — risk register
- 6 risks catalogued: intra wavefront perf, qpel shader explosion,
Stage 5 colourspace bugs, Mesa V3DV concurrency, daedalus-fourier
pin drift, Phase 4 30fps@1080p target miss
- Highest impact: project fails to beat NEON. Acknowledged from
project start (§10), explicit pivot language.
User question 2026-05-23: 'Wayland does need a conversion of NV12 to
its output format. Could we cram that in?'
Yes — trivially. Added Stage 5 to the pipeline doc with:
- 5-line per-pixel compute shader (BT.709 limited-range example
given; matrix selected from H.264 VUI at runtime)
- explicit OPT-IN flag, off by default
- rationale for default-off: most consumers (V4L2 stateless,
Wayland zwp_linux_dmabuf NV12 passthrough, Firefox/mpv VAAPI
paths) want NV12 because compositors convert during composition
essentially for free. RGBA8 is 4x the bandwidth of NV12 — not
worth burning DMA + electrons when no downstream needs it
- colourspace metadata plumbing requirement: SPS vui_parameters
(colour_primaries, transfer_characteristics, matrix_coefficients,
video_full_range_flag) MUST flow through to the shader; default
BT.709 limited-range with warning if VUI absent
Updated the new-shader inventory to include v3d_h264_yuv_to_rgba.
Total dispatches/frame remains ~190-200; Stage 5 adds one.
Path C of the 2026-05-23 architecture decision after the daedalus-
fourier substitution arc's per-block QPU dispatch was measured to be
>600x slower than NEON in production. Root cause: per-block synchronous
Vulkan dispatch from inside libavcodec's per-MB loops, paying ~50us of
queue-submit/wait round-trip per ~30ns of NEON-equivalent arithmetic.
NVDEC and Vulkan Video escape this by dispatching at picture-level.
Pi 5 has no dedicated H.264 hardware decode block and Mesa V3DV does
not implement VK_KHR_video_decode_h264; this project builds the same
*shape* (one submit per frame, one fence wait per frame, encoded
bitstream in, NV12 out) using V3D7 Vulkan compute as the substrate.
DESIGN.md covers:
- architecture sketch (CPU side keeps entropy decode + descriptors;
GPU runs 4-stage compute pipeline per frame)
- per-MB descriptor layout (frame-shaped SSBO, ~8160 entries for 1080p)
- inter-stage dependencies (vkCmdPipelineBarrier within one command
buffer)
- intra prediction wavefront (~187 dispatches per frame on diagonals)
- libavcodec intercept point (macroblock-level, evolves the
substitution shim from "dispatch now" to "append to frame buffer")
- shader inventory (existing daedalus-fourier reuse + ~14 new ones)
- 4-phase plan, 4-6 months total budget
- 7 open questions including DPB allocation, qpel parameterization,
daemon integration shape
- explicit out-of-scope: VP9 / AV1 / HEVC / 10-bit / interlaced
This is design only. No code beyond README.md and DESIGN.md. User
review + redirect expected before Phase 1 implementation begins.