daedalus-decoder

Author	SHA1	Message	Date
claude-noether	321f94bba9	wip: deblock dispatch	2026-05-25 23:14:24 +02:00
claude-noether	a7a0d56ecd	Stage 2 PR-a: predicted samples plumbing — caller-supplied per-MB pixels First concrete deliverable on the daedalus-decoder Stage 2 path post the 2026-05-25 architecture re-pin (memory: dejavu / frame-major UMA). Q2 decision: CPU intra prediction. libavcodec's existing NEON intra prediction kernels generate predicted samples per MB; daedalus-decoder accepts those samples through the API and uses them as the IDCT-add starting state. FFmpeg's `idct_add` semantics — dst += idct(coeffs); clip255 — fold DESIGN.md's Stage 3 reconstruction into the existing Stage 1 IDCT dispatch for free. No new GPU work. API change ---------- `daedalus_decoder_mb_input` gains a `const uint8_t predicted` field: predicted [ 0 .. 256) — 16×16 luma, row-major raster predicted [256 .. 320) — 8×8 Cb, row-major raster predicted [320 .. 384) — 8×8 Cr, row-major raster NULL is legal and equivalent to all-zero predicted samples — preserves the existing IDCT-isolation test contract. Internal changes ---------------- - `daedalus_decoder` gains predicted_y (W×H) and predicted_uv (planar Cb\|\|Cr, W×H/2) buffers allocated at create, zeroed at end of every flush_frame so NULL `mb->predicted` is indistinguishable from explicit zeros from one frame to the next. - `append_mb` splats mb->predicted into predicted_y/_uv at raster (mb_y16, mb_x16) for luma and (mb_y8, mb_x*8) for each chroma component. - `flush_frame` replaces `calloc(scratch_y)` and `calloc(scratch_uv)` with `malloc + memcpy from predicted_y/_uv` — the IDCT dispatch then writes residual on top, clip-adding to the predicted samples in place. Test ---- `test_idct_bitexact` extended: - Generates random predicted samples (uint8_t) per MB alongside the existing random coeffs. - Pre-fills the reference ref_y / ref_cb / ref_cr planes with those same predicted samples at the corresponding raster positions BEFORE applying ref_idct4_add / ref_idct8_add per block. - Compares GPU output to reference byte-for-byte. Result on hertz (Pi 5 V3D 7.1), all three substrates: test_idct_bitexact 320 240 0xfeedface5a5a5a5a {cpu, qpu, auto} Y bytes diff: 0/76800 (0.0000%) Cb bytes diff: 0/19200 (0.0000%) Cr bytes diff: 0/19200 (0.0000%) BIT-EXACT PASS on all three substrates Catches any silent drift between substrates and any predicted-samples plumbing mistake on either the API or the dispatch side. Followups --------- - Stage 2 PR-b: deblock dispatch in flush_frame. - Stage 2 daemon refactor (parallel, daedalus-v4l2 daemon): replace avcodec_send_packet/receive_frame with a libavcodec-parser-only path that drives daedalus_decoder_append_mb in raster order + flush_frame at slice boundary.	2026-05-25 23:01:20 +02:00
claude-noether	44ca4e550f	phase1: substrate selector API + cross-substrate bit-exact ctest Surfaces daedalus-fourier's substrate-override capability at the decoder boundary. Lets tests run on CPU-only hosts (CI runners, x86 dev boxes) AND cross-checks V3D shader output against NEON reference on hosts that have both. API additions (pre-0.1 ABI, additive): - daedalus_decoder_substrate enum { AUTO, CPU, QPU } (mirrors daedalus_substrate; isolated for ABI reasons). - daedalus_decoder_set_substrate(dec, sub) setter, same mid-frame-change restrictions as set_output_format. - Default remains AUTO — the only sensible choice for production. Internal: - flush_frame now calls daedalus_dispatch_h264_idct{4,8} with an explicit substrate instead of daedalus_recipe_dispatch_*. Mapped via a small map_substrate() helper. No perf delta on AUTO (recipe layer was just doing the same dispatch under the hood). Test changes: - test_smoke: new EXPECTs for set_substrate (valid + bogus). - test_idct_bitexact: new argv[4] takes "auto" (default), "cpu", or "qpu" to force the substrate. - CMakeLists.txt: new ctest entry `idct_bitexact_cpu` re-runs the QVGA case forcing the CPU path. Catches silent drift between the V3D shader and the NEON reference; both must produce identical output for the same coefficient input (and they do — see ctest log below). Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0): $ ctest --test-dir build --output-on-failure Start 1: smoke 1/4 Test #1: smoke ............................ Passed 0.10 sec Start 2: idct_bitexact 2/4 Test #2: idct_bitexact .................... Passed 0.03 sec Start 3: idct_bitexact_cpu 3/4 Test #3: idct_bitexact_cpu ................ Passed 0.03 sec Start 4: idct_bitexact_1080p 4/4 Test #4: idct_bitexact_1080p .............. Passed 0.06 sec 100% tests passed, 0 tests failed out of 4 CPU substrate produces byte-identical Y + Cb + Cr planes against the same C reference that the AUTO/QPU path matches — confirming the V3D shaders and the daedalus-fourier NEON path agree at the spec level. Why we plumbed the lower-level dispatch instead of leaving recipe in place: recipe is just a thin wrapper that calls dispatch with AUTO. Once we needed substrate control, the wrapper became a liability (would have required adding a parallel recipe API for each substrate); going direct is simpler and the AUTO path is unchanged. Coverage note: idct_bitexact_cpu runs at QVGA (300 MBs); not also at 1080p because the CPU path's wall time scales linearly with block count and a 1080p CPU run is ~0.5s on hertz — fine standalone but slows ctest enough that it would tempt opt-in gating. The bit-exact content is the same regardless of frame size; the 1080p variant only exists to gate index-arithmetic bugs that surface above small int boundaries.	2026-05-24 23:07:45 +02:00
claude-noether	adaabb1f63	phase1: IDCT 8x8 dispatch (High profile transform_8x8_size_flag) Adds the High-profile 8x8 luma transform path alongside the existing 4x4 dispatch. flush_frame now partitions macroblocks by each MB's transform_8x8 flag and issues a separate luma dispatch per partition: - mb.transform_8x8 == 0 (Baseline/Main) → coeffs[0..256) interpreted as 16 4x4 blocks, fed to daedalus_recipe_dispatch_h264_idct4 (existing behaviour, unchanged). - mb.transform_8x8 == 1 (High) → coeffs[0..256) interpreted as 4 8x8 blocks (64 int16 each, column-major), fed to the new daedalus_recipe_dispatch_h264_idct8 call. Both luma partitions can be non-empty in the same frame (FFmpeg sets the flag per-MB). Each non-empty partition costs one vkQueueSubmit + vkQueueWaitIdle; empty partitions are skipped (common case: Baseline streams skip the 8x8 dispatch entirely). Chroma is unchanged — 4:2:0 chroma always uses the 4x4 transform. API surface: - New uint8_t `transform_8x8` field in `struct daedalus_decoder_mb_input` (after deblock_). Backwards-compatible at the source level because the field defaults to 0 with C99 designated initialisers or {0} struct zeroing, both of which select the existing 4x4 path. ABI is pre-0.1 (per the header doc) so structural change is fine. - Mirrored in `struct daedalus_decoder_mb_desc` (internal layout). Test changes: - test_idct_bitexact now exercises a mixed-mode frame: every odd raster MB uses 8x8, every even uses 4x4 (so flush_frame's partitioning is also under test, not just the underlying shaders). - 8x8 C reference (h264_idct8_butterfly + ref_idct8_add) transcribed from daedalus-fourier tests/h264_idct8_ref.c per H.264 §8.5.13.2. Block layout column-major; +32 >> 6 rounding; add-to-predicted; clip255. - Reference luma compute branches per MB on the same mb_8x8[] array used to set the input flag. Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0): $ ./build/test_idct_bitexact test_idct_bitexact: 320x240 (300 MBs), seed=0xfeedface5a5a5a5a MB mix: 150 4x4 MBs, 150 8x8 MBs Y bytes total: 76800 Y bytes diff: 0 (0.0000%) Cb bytes total: 19200 diff: 0 (0.0000%) Cr bytes total: 19200 diff: 0 (0.0000%) BIT-EXACT PASS (Y + Cb + Cr) $ ctest --test-dir build 100% tests passed, 0 tests failed out of 2 Bit-exact PASS first try for the 8x8 path — 150 8x8 MBs × 4 blocks = 600 8x8 IDCTs against the spec C reference, identical output. Validates both the daedalus-fourier IDCT 8x8 shader (already gated by its own cycle-7 bit-exact test, now also gated end-to-end through our flush_frame), and our 8x8 layout assumptions (column-major coeffs, raster sb_y2+sb_x block order, top-left = mb16 + sb8). What's NOT covered yet (deferred): - Z-scan permutation for FFmpeg compatibility (libavcodec intercept patch's concern; both 4x4 and 8x8 z-scans differ). - Chroma DC / luma Intra16x16 DC Hadamard pre-pass. - Mixed intra/inter MB handling — currently all MBs treated as residual-only (predicted=0). Closes the "IDCT 8x8 (High profile)" item from PR #3's deferred list.	2026-05-24 22:41:05 +02:00
claude-noether	08080f062c	scaffold: CMake + API skeleton + smoke test First code on daedalus-decoder per the Phase 1 decisions merged 2026-05-24. Repo skeleton only — no Vulkan pipeline yet, no shaders, no libavcodec intercept. Establishes the build shape so subsequent work has a place to land. Layout: LICENSE BSD-2-Clause (matches daedalus-fourier) .gitignore build/, CMake artefacts, .spv CMakeLists.txt top-level — finds daedalus-fourier ≥0.1.0 via pkg-config (per §9.6 decision: find_package, pinned to tagged release; .pc consumed via pkg_check_modules until we ship a CMake config), Vulkan via find_package, builds static lib + smoke test, GNUInstallDirs install include/daedalus_decoder.h public API surface: - daedalus_decoder_{create,destroy, version,has_qpu} - daedalus_decoder_set_output_format (NV12 default, RGBA opt-in per §5) - daedalus_decoder_append_mb + struct daedalus_decoder_mb_input (matches §3 per-MB descriptor) - daedalus_decoder_flush_frame (per-frame submit + wait) - daedalus_decoder_export_dmabuf (Vulkan-native VkImage export per §9.4 decision) Dimensions are CODED frame size (mod-16), not displayed — caller translates from SPS + crop offsets. src/internal.h internal mb_desc struct (matches shader std430 layout, to be nailed down once shaders exist) + per-ctx state src/daedalus_decoder.c stub bodies: - create/destroy with proper resource lifecycle - append_mb validates + writes CPU staging buffers (no GPU yet) - flush_frame returns -2 (not implemented) — Phase 1 work - export_dmabuf returns -1 - has_qpu / version diagnostics tests/test_smoke.c link + lifecycle test: bad dims reject, OOB MB reject, null inputs reject, raster-order enforcement, mid-frame format-change reject, incomplete-frame flush reject. On hosts without V3D7 Vulkan, SKIPs gracefully (returns 0). Verified on hertz (Pi 5 / V3D 7.1 / Mesa V3DV via daedalus-fourier 0.1.0): $ cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release $ cmake --build build $ ctest --test-dir build --output-on-failure Test #1: smoke ... Passed $ ./build/test_smoke daedalus-decoder version: 0.0.1 ctx created: 1920x1088, has_qpu=1 smoke OK Note the coded-vs-displayed dims trap: 1080p H.264 has coded height 1088 with 8 rows cropped via SPS frame_cropping_. Header docstring on daedalus_decoder_create() spells this out so future callers don't hit the multiple-of-16 reject (smoke test caught it during scaffold write). Next: Phase 1 implementation begins — IDCT 4×4 / 8×8 frame-scaled dispatch (reusing daedalus-fourier shaders per Appendix A), intra prediction wavefront, reconstruct stage, NV12 output via dmabuf export. Smoke test grows from "ctx lifecycle works" to "I-frame-only Baseline decode bit-exact vs FFmpeg reference".	2026-05-24 22:08:46 +02:00

5 Commits