daedalus-decoder

Author	SHA1	Message	Date
claude-noether	a7a0d56ecd	Stage 2 PR-a: predicted samples plumbing — caller-supplied per-MB pixels First concrete deliverable on the daedalus-decoder Stage 2 path post the 2026-05-25 architecture re-pin (memory: dejavu / frame-major UMA). Q2 decision: CPU intra prediction. libavcodec's existing NEON intra prediction kernels generate predicted samples per MB; daedalus-decoder accepts those samples through the API and uses them as the IDCT-add starting state. FFmpeg's `idct_add` semantics — dst += idct(coeffs); clip255 — fold DESIGN.md's Stage 3 reconstruction into the existing Stage 1 IDCT dispatch for free. No new GPU work. API change ---------- `daedalus_decoder_mb_input` gains a `const uint8_t predicted` field: predicted [ 0 .. 256) — 16×16 luma, row-major raster predicted [256 .. 320) — 8×8 Cb, row-major raster predicted [320 .. 384) — 8×8 Cr, row-major raster NULL is legal and equivalent to all-zero predicted samples — preserves the existing IDCT-isolation test contract. Internal changes ---------------- - `daedalus_decoder` gains predicted_y (W×H) and predicted_uv (planar Cb\|\|Cr, W×H/2) buffers allocated at create, zeroed at end of every flush_frame so NULL `mb->predicted` is indistinguishable from explicit zeros from one frame to the next. - `append_mb` splats mb->predicted into predicted_y/_uv at raster (mb_y16, mb_x16) for luma and (mb_y8, mb_x*8) for each chroma component. - `flush_frame` replaces `calloc(scratch_y)` and `calloc(scratch_uv)` with `malloc + memcpy from predicted_y/_uv` — the IDCT dispatch then writes residual on top, clip-adding to the predicted samples in place. Test ---- `test_idct_bitexact` extended: - Generates random predicted samples (uint8_t) per MB alongside the existing random coeffs. - Pre-fills the reference ref_y / ref_cb / ref_cr planes with those same predicted samples at the corresponding raster positions BEFORE applying ref_idct4_add / ref_idct8_add per block. - Compares GPU output to reference byte-for-byte. Result on hertz (Pi 5 V3D 7.1), all three substrates: test_idct_bitexact 320 240 0xfeedface5a5a5a5a {cpu, qpu, auto} Y bytes diff: 0/76800 (0.0000%) Cb bytes diff: 0/19200 (0.0000%) Cr bytes diff: 0/19200 (0.0000%) BIT-EXACT PASS on all three substrates Catches any silent drift between substrates and any predicted-samples plumbing mistake on either the API or the dispatch side. Followups --------- - Stage 2 PR-b: deblock dispatch in flush_frame. - Stage 2 daemon refactor (parallel, daedalus-v4l2 daemon): replace avcodec_send_packet/receive_frame with a libavcodec-parser-only path that drives daedalus_decoder_append_mb in raster order + flush_frame at slice boundary.	2026-05-25 23:01:20 +02:00
claude-noether	44ca4e550f	phase1: substrate selector API + cross-substrate bit-exact ctest Surfaces daedalus-fourier's substrate-override capability at the decoder boundary. Lets tests run on CPU-only hosts (CI runners, x86 dev boxes) AND cross-checks V3D shader output against NEON reference on hosts that have both. API additions (pre-0.1 ABI, additive): - daedalus_decoder_substrate enum { AUTO, CPU, QPU } (mirrors daedalus_substrate; isolated for ABI reasons). - daedalus_decoder_set_substrate(dec, sub) setter, same mid-frame-change restrictions as set_output_format. - Default remains AUTO — the only sensible choice for production. Internal: - flush_frame now calls daedalus_dispatch_h264_idct{4,8} with an explicit substrate instead of daedalus_recipe_dispatch_*. Mapped via a small map_substrate() helper. No perf delta on AUTO (recipe layer was just doing the same dispatch under the hood). Test changes: - test_smoke: new EXPECTs for set_substrate (valid + bogus). - test_idct_bitexact: new argv[4] takes "auto" (default), "cpu", or "qpu" to force the substrate. - CMakeLists.txt: new ctest entry `idct_bitexact_cpu` re-runs the QVGA case forcing the CPU path. Catches silent drift between the V3D shader and the NEON reference; both must produce identical output for the same coefficient input (and they do — see ctest log below). Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0): $ ctest --test-dir build --output-on-failure Start 1: smoke 1/4 Test #1: smoke ............................ Passed 0.10 sec Start 2: idct_bitexact 2/4 Test #2: idct_bitexact .................... Passed 0.03 sec Start 3: idct_bitexact_cpu 3/4 Test #3: idct_bitexact_cpu ................ Passed 0.03 sec Start 4: idct_bitexact_1080p 4/4 Test #4: idct_bitexact_1080p .............. Passed 0.06 sec 100% tests passed, 0 tests failed out of 4 CPU substrate produces byte-identical Y + Cb + Cr planes against the same C reference that the AUTO/QPU path matches — confirming the V3D shaders and the daedalus-fourier NEON path agree at the spec level. Why we plumbed the lower-level dispatch instead of leaving recipe in place: recipe is just a thin wrapper that calls dispatch with AUTO. Once we needed substrate control, the wrapper became a liability (would have required adding a parallel recipe API for each substrate); going direct is simpler and the AUTO path is unchanged. Coverage note: idct_bitexact_cpu runs at QVGA (300 MBs); not also at 1080p because the CPU path's wall time scales linearly with block count and a 1080p CPU run is ~0.5s on hertz — fine standalone but slows ctest enough that it would tempt opt-in gating. The bit-exact content is the same regardless of frame size; the 1080p variant only exists to gate index-arithmetic bugs that surface above small int boundaries.	2026-05-24 23:07:45 +02:00
claude-noether	adaabb1f63	phase1: IDCT 8x8 dispatch (High profile transform_8x8_size_flag) Adds the High-profile 8x8 luma transform path alongside the existing 4x4 dispatch. flush_frame now partitions macroblocks by each MB's transform_8x8 flag and issues a separate luma dispatch per partition: - mb.transform_8x8 == 0 (Baseline/Main) → coeffs[0..256) interpreted as 16 4x4 blocks, fed to daedalus_recipe_dispatch_h264_idct4 (existing behaviour, unchanged). - mb.transform_8x8 == 1 (High) → coeffs[0..256) interpreted as 4 8x8 blocks (64 int16 each, column-major), fed to the new daedalus_recipe_dispatch_h264_idct8 call. Both luma partitions can be non-empty in the same frame (FFmpeg sets the flag per-MB). Each non-empty partition costs one vkQueueSubmit + vkQueueWaitIdle; empty partitions are skipped (common case: Baseline streams skip the 8x8 dispatch entirely). Chroma is unchanged — 4:2:0 chroma always uses the 4x4 transform. API surface: - New uint8_t `transform_8x8` field in `struct daedalus_decoder_mb_input` (after deblock_). Backwards-compatible at the source level because the field defaults to 0 with C99 designated initialisers or {0} struct zeroing, both of which select the existing 4x4 path. ABI is pre-0.1 (per the header doc) so structural change is fine. - Mirrored in `struct daedalus_decoder_mb_desc` (internal layout). Test changes: - test_idct_bitexact now exercises a mixed-mode frame: every odd raster MB uses 8x8, every even uses 4x4 (so flush_frame's partitioning is also under test, not just the underlying shaders). - 8x8 C reference (h264_idct8_butterfly + ref_idct8_add) transcribed from daedalus-fourier tests/h264_idct8_ref.c per H.264 §8.5.13.2. Block layout column-major; +32 >> 6 rounding; add-to-predicted; clip255. - Reference luma compute branches per MB on the same mb_8x8[] array used to set the input flag. Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0): $ ./build/test_idct_bitexact test_idct_bitexact: 320x240 (300 MBs), seed=0xfeedface5a5a5a5a MB mix: 150 4x4 MBs, 150 8x8 MBs Y bytes total: 76800 Y bytes diff: 0 (0.0000%) Cb bytes total: 19200 diff: 0 (0.0000%) Cr bytes total: 19200 diff: 0 (0.0000%) BIT-EXACT PASS (Y + Cb + Cr) $ ctest --test-dir build 100% tests passed, 0 tests failed out of 2 Bit-exact PASS first try for the 8x8 path — 150 8x8 MBs × 4 blocks = 600 8x8 IDCTs against the spec C reference, identical output. Validates both the daedalus-fourier IDCT 8x8 shader (already gated by its own cycle-7 bit-exact test, now also gated end-to-end through our flush_frame), and our 8x8 layout assumptions (column-major coeffs, raster sb_y2+sb_x block order, top-left = mb16 + sb8). What's NOT covered yet (deferred): - Z-scan permutation for FFmpeg compatibility (libavcodec intercept patch's concern; both 4x4 and 8x8 z-scans differ). - Chroma DC / luma Intra16x16 DC Hadamard pre-pass. - Mixed intra/inter MB handling — currently all MBs treated as residual-only (predicted=0). Closes the "IDCT 8x8 (High profile)" item from PR #3's deferred list.	2026-05-24 22:41:05 +02:00
claude-noether	58848bd162	phase1/stage1: chroma 4x4 IDCT dispatch (Cb+Cr planar scratch, NV12 interleave) Replaces the chroma placeholder (memset 128) with a real frame-scaled 4x4 IDCT dispatch for the Cb and Cr components. Two Vulkan submits + waits per frame now (one luma, one chroma) instead of one + memset. Implementation: - One combined planar scratch buffer (WH/2 bytes) holds Cb then Cr; a single `daedalus_recipe_dispatch_h264_idct4` call processes both components by setting meta[].dst_off accordingly (Cr blocks add cb_plane_size). - Stride = W/2 (chroma row pitch); shared between Cb and Cr since they have identical geometry. - Per-MB coeff layout already had [256..320) for Cb and [320..384) for Cr (4 raster-order 4x4 blocks per component) from the original daedalus_decoder_append_mb design — no header-side changes. - Post-dispatch CPU memcpy loop interleaves Cb[r][c] and Cr[r][c] into NV12 UV at out_uv[r][2c..2c+1]. ~1 MB/frame at 1080p, well off the critical path; a GPU-side interleave shader is a Stage-5 optimisation. - Chroma dispatch is gated on out_uv != NULL so callers that only want luma (e.g. the bit-exact test before this PR) still pay nothing. Test changes: - tests/test_idct_bitexact.c extended with parallel reference IDCT for Cb and Cr planes (W/2 x H/2 each), then deinterleaves NV12 UV back into Cb/Cr for the compare. Random coeffs in [-512, 511] for all 384 per-MB int16 slots (previously only luma was randomised). - tests/test_smoke.c UV expectation flipped from "all 128 placeholder" to "all 0" (real dispatch with zero coeffs). Sentinel 0xcd pre-fill stays — same purpose: catches read-then-write bugs. Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0): $ ctest --test-dir build --output-on-failure Start 1: smoke 1/2 Test #1: smoke ............................ Passed 1.27 sec Start 2: idct_bitexact 2/2 Test #2: idct_bitexact .................... Passed 0.05 sec 100% tests passed, 0 tests failed out of 2 $ ./build/test_idct_bitexact test_idct_bitexact: 320x240 (300 MBs), seed=0xfeedface5a5a5a5a Y bytes total: 76800 Y bytes diff: 0 (0.0000%) Cb bytes total: 19200 diff: 0 (0.0000%) Cr bytes total: 19200 diff: 0 (0.0000%) BIT-EXACT PASS (Y + Cb + Cr) $ ./build/test_smoke daedalus-decoder version: 0.0.1 ctx created: 1920x1088, has_qpu=1 appended 8160 MBs (120x68) flush_frame rc=0 Y non-zero bytes: 0 / 2088960 UV non-zero bytes: 0 / 1044480 smoke OK (Smoke's 1.27s includes the 1080p frame: 8160 MBs 16 = 130,560 luma blocks + 8160 * 8 = 65,280 chroma blocks across two dispatches — shader pool warm-up dominates the wall time, not the IDCT work.) What's NOT covered yet (deferred): - Chroma DC / Intra16x16 luma DC 2x2 Hadamard pre-pass. Real H.264 chroma puts the per-block DC coefficient through a Hadamard before it's added to the AC block; we currently treat all chroma blocks as plain 4x4 AC. Will land alongside the libavcodec intercept patch, since CABAC/CAVLC is where the DC vs AC distinction is exposed. - Z-scan permutation for FFmpeg compatibility — only matters at the intercept boundary, not here. - IDCT 8x8 (High profile). Closes the "chroma is a stub" item from PR #3's "what's NOT done" list.	2026-05-24 22:34:42 +02:00
claude-noether	948697ef0d	phase1/stage1: bit-exact gate for the frame-scaled luma IDCT 4x4 Adds test_idct_bitexact that exercises daedalus_decoder_flush_frame end-to-end with random coefficients and compares every output byte against an inline C reference of the H.264 §8.5.12.1 1D butterfly. Closes the validation gap from the previous PR ("dispatch succeeds" becomes "dispatch is bit-exact"). What's tested: - 320×240 coded frame (300 MBs), enough to cover multiple workgroups of the V3D shader (16 blocks/WG → ≥30 WGs) - Per-MB → flat-raster block layout consistent with flush_frame - Random coeffs in [-512, 511] (same range as daedalus-fourier cycle-6 M1 gate) - Inline C reference: H.264 §8.5.12.1 butterfly with column-major block layout, +32 rounding, >>6, add-to-predicted (=0), clip255 — mirrors daedalus-fourier tests/h264_idct4_ref.c Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0): $ ctest --test-dir build --output-on-failure Start 1: smoke 1/2 Test #1: smoke ............................ Passed 0.16 sec Start 2: idct_bitexact 2/2 Test #2: idct_bitexact .................... Passed 0.03 sec 100% tests passed, 0 tests failed out of 2 Bit-exact PASS first try — daedalus-fourier's V3D IDCT 4x4 shader produces identical pixels to the C reference for all 4800 blocks in the test frame. Validates BOTH the shader correctness AND the frame-batched-dispatch correctness (this is the first time n_blocks > ~30 has been exercised at the recipe-dispatch layer; the substitution arc only ever called with n_blocks=1). What is NOT tested by this PR (deferred to follow-ons): - Non-zero predicted pixels — flush_frame zero-initialises scratch_y, so the IDCT-ADD reduces to clip255(IDCT). Real predicted comes from Stage 2a intra prediction. - Z-scan permutation between FFmpeg's per-MB coeffs layout and our per-MB → flat raster — the test uses its own coefficient generator that already matches our layout, so it doesn't exercise the permutation. The libavcodec-intercept patch is where the permutation lands and gets validated against real H.264 streams. - Chroma 4×4 IDCT. - IDCT 8×8 (High profile). Stacked on noether/phase1-stage1-idct (PR #3, the frame-scaled dispatch). Rebase on main after #3 lands; the diff is purely additive (one new test file + 5 lines of CMake).	2026-05-24 22:20:21 +02:00

5 Commits