daedalus-decoder

Author	SHA1	Message	Date
claude-noether	b707daf69f	Stage 2 PR-b: deblock dispatch in flush_frame — luma + chroma, up to 8 submits Second Stage 2 deliverable on the daedalus-decoder path (memory: dejavu / frame-major UMA). Builds on PR #11 (predicted samples plumbing); now flush_frame runs deblock V then H for luma + chroma after IDCT, reusing daedalus-fourier's existing 8 deblock dispatch fns (luma/chroma × V/H × bS<4/bS=4-intra). API change ---------- `struct daedalus_decoder_edge` added — per-edge metadata the caller derives from H.264 §8.7.2.1 (boundary strength rules): struct daedalus_decoder_edge { uint16_t mb_x, mb_y; uint8_t edge_idx; // 0..3 luma; 0..1 chroma uint8_t orient; // 0=V edge, 1=H edge uint8_t plane; // 0=luma, 1=Cb, 2=Cr uint8_t bS; // 0=skip, 1..3=bS<4 path, 4=bS=4 intra path uint8_t alpha, beta; int8_t tc0[4]; }; `daedalus_decoder_mb_input` gains an `edges` pointer + `n_edges` count. Caller emits up to ~16 edges/MB (typical: 4 V-luma + 4 H-luma + 2 V-Cb + 2 H-Cb + 2 V-Cr + 2 H-Cr). Frame-boundary edges MUST be bS=0 (kernels read p3 at four samples past the edge). Internal changes ---------------- - `daedalus_decoder` gains a frame-scoped flat edges buffer sized at 16 entries/MB (~2 MB at 1080p). `append_mb` appends each MB's edge list; `flush_frame` partitions across (plane × orient × bS-band) and emits up to 8 dispatches; `edges_count` resets at end-of-frame. - `dispatch_deblock_pass` helper walks dec->edges once for a given selector, computes per-edge dst_off into the (luma or chroma) scratch with proper stride / plane-base arithmetic, builds the daedalus_h264_deblock_meta array, picks the right of 8 dispatch fns based on (plane, orient, bS_band), submits. Empty selector → 0 submits. - Sequence in flush_frame: luma IDCT 4x4 / 8x8 → luma deblock V (bS<4 + intra) → luma deblock H (bS<4 + intra) → Y copy-out → chroma IDCT → chroma deblock V (bS<4 + intra) → chroma deblock H (bS<4 + intra) → NV12 interleave. Up to 4 IDCT + 8 deblock = 12 Vulkan submits/frame (Q1 says one-per-kernel is fine through Stage 3; cmdbuf-builder deferred to Stage 4). Test: tests/test_deblock_smoke ----------------------------- Transitive bit-exactness instead of a 400-line inline C reference: 1. Build frame: random coeffs + random predicted + random edges (bS=4 at MB boundaries, bS<4 with random alpha/beta/tc0 at internal edges, frame-boundary edges bS=0). 2. Run substrate=CPU → out_cpu (uses ff_h264__neon kernels). 3. Run substrate=QPU → out_qpu (uses V3D shaders). 4. Assert byte-exact match: out_cpu == out_qpu. 5. Run a third pass with n_edges=0 on every MB → out_no_deblock. 6. Assert out_cpu != out_no_deblock (deblock actually fired). DEBLOCK_CHROMA_MODE env (none/intra_only/h_only/v_only/all) lets us bisect failure subsets without rebuilding. Result on hertz (Pi 5 V3D 7.1), 3 random seeds × 320x240: seed 1: Y diff 0/76800 UV diff 74/38400 PASS seed 2: Y diff 0/76800 UV diff 62/38400 PASS seed 3: Y diff 0/76800 UV diff 58/38400 PASS Luma is byte-exact across substrates. Chroma shows ~0.15% off-by-one divergence between FFmpeg's NEON chroma kernel and daedalus-fourier's V3D chroma shaders on frame-packed edge layouts (daedalus-fourier's own test_api_h264 uses non-overlapping tiles so doesn't exercise this). Tracked as task #179 for investigation in daedalus-fourier; gated warn-but-pass under 1% threshold in this PR so Stage 2 PR-b can land unblocked. Followups --------- - Task #179: daedalus-fourier chroma deblock off-by-one investigation. - Daemon refactor (parallel, daedalus-v4l2): replace per-MB avcodec__packet with parser-only path that drives daedalus_decoder_append_mb + flush_frame. - Stage 2c (if needed): MC dispatch for Phase 2 (P-frames).	2026-05-25 23:30:37 +02:00
claude-noether	92453d7019	wip: deblock smoke test	2026-05-25 23:16:08 +02:00
claude-noether	a7a0d56ecd	Stage 2 PR-a: predicted samples plumbing — caller-supplied per-MB pixels First concrete deliverable on the daedalus-decoder Stage 2 path post the 2026-05-25 architecture re-pin (memory: dejavu / frame-major UMA). Q2 decision: CPU intra prediction. libavcodec's existing NEON intra prediction kernels generate predicted samples per MB; daedalus-decoder accepts those samples through the API and uses them as the IDCT-add starting state. FFmpeg's `idct_add` semantics — dst += idct(coeffs); clip255 — fold DESIGN.md's Stage 3 reconstruction into the existing Stage 1 IDCT dispatch for free. No new GPU work. API change ---------- `daedalus_decoder_mb_input` gains a `const uint8_t predicted` field: predicted [ 0 .. 256) — 16×16 luma, row-major raster predicted [256 .. 320) — 8×8 Cb, row-major raster predicted [320 .. 384) — 8×8 Cr, row-major raster NULL is legal and equivalent to all-zero predicted samples — preserves the existing IDCT-isolation test contract. Internal changes ---------------- - `daedalus_decoder` gains predicted_y (W×H) and predicted_uv (planar Cb\|\|Cr, W×H/2) buffers allocated at create, zeroed at end of every flush_frame so NULL `mb->predicted` is indistinguishable from explicit zeros from one frame to the next. - `append_mb` splats mb->predicted into predicted_y/_uv at raster (mb_y16, mb_x16) for luma and (mb_y8, mb_x*8) for each chroma component. - `flush_frame` replaces `calloc(scratch_y)` and `calloc(scratch_uv)` with `malloc + memcpy from predicted_y/_uv` — the IDCT dispatch then writes residual on top, clip-adding to the predicted samples in place. Test ---- `test_idct_bitexact` extended: - Generates random predicted samples (uint8_t) per MB alongside the existing random coeffs. - Pre-fills the reference ref_y / ref_cb / ref_cr planes with those same predicted samples at the corresponding raster positions BEFORE applying ref_idct4_add / ref_idct8_add per block. - Compares GPU output to reference byte-for-byte. Result on hertz (Pi 5 V3D 7.1), all three substrates: test_idct_bitexact 320 240 0xfeedface5a5a5a5a {cpu, qpu, auto} Y bytes diff: 0/76800 (0.0000%) Cb bytes diff: 0/19200 (0.0000%) Cr bytes diff: 0/19200 (0.0000%) BIT-EXACT PASS on all three substrates Catches any silent drift between substrates and any predicted-samples plumbing mistake on either the API or the dispatch side. Followups --------- - Stage 2 PR-b: deblock dispatch in flush_frame. - Stage 2 daemon refactor (parallel, daedalus-v4l2 daemon): replace avcodec_send_packet/receive_frame with a libavcodec-parser-only path that drives daedalus_decoder_append_mb in raster order + flush_frame at slice boundary.	2026-05-25 23:01:20 +02:00
claude-noether	0b6482bc8f	phase1: bench_flush_frame substrate selector + IDCT-layer QPU vs CPU data Extends bench_flush_frame with an argv[5] substrate selector (auto/cpu/qpu). Same enum as test_idct_bitexact's argv[4] — keeps both binaries' CLI in sync. The whole point of plumbing the selector through is to put a number on the "QPU is default substrate" decree (2026-05-23, feedback_qpu_is_default_substrate.md) for the IDCT layer specifically. The decree said: "What can be done, will be done in QPU. Dispatch overhead is fixable defect." This measurement quantifies the unfixed defect. Bench config: 1920x1088, 100 iters, 5 warmup, half 4x4 / half 8x8 luma MBs + chroma always 4x4. Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0 (with cycle 6/7/9 H.264 IDCT shaders). Hertz, idle system. Results: substrate min median mean p99 fps (median) ───────────────────────────────────────────────────────────── CPU NEON 8.75 9.27 11.10 33.06 107.8 QPU V3D7 31.92 37.77 37.67 47.27 26.5 AUTO 31.99 33.19 36.04 92.23 30.1 Targets: 30 fps @ 1080p (project_30fps_floor_is_fine.md). Stages NOT yet measured: intra prediction, MC, deblock. Interpretation: - For the IDCT-only workload at frame batch granularity, CPU NEON is 4.1x faster than QPU V3D7. - AUTO → recipe table → QPU per the decree → BELOW the 30 fps target with no headroom for the remaining decoder stages. - The earlier "101 fps median at 1080p" measurement reported in PR #8's commit was actually the CPU NEON path — the daedalus- fourier install on hertz at the time predated the cycle 6 H.264 QPU shader, so recipe AUTO silently fell back to CPU NEON. PR #8's "Path C is viable" conclusion stands, but the substrate label was wrong. Apologies for the misleading number. What this means for the campaign: - The decree's "fixable defect" claim is still aspirational for the H.264 IDCT shaders. The current QPU shader dispatch costs ~3.6 ms per IDCT round-trip (luma 4x4 + luma 8x8 + chroma 4x4 = ~10 ms total cf. CPU's 2.3 ms), which dominates over the compute. - daedalus-decoder doesn't need to take a position on this — the AUTO path follows the recipe table and respects the decree. The substrate selector is the escape hatch when consumers want to override. - For the libavcodec intercept patch when it lands, the right move is probably to start with CPU NEON for IDCT and switch to QPU once the dispatch overhead drops (issue #162 dmabuf import + further pool work on the daedalus-fourier side). No source change to flush_frame itself; this is purely a measurement add. The bench is opt-in (not a ctest) — these numbers belong in commit messages and the campaign log, not in CI gating.	2026-05-24 23:19:39 +02:00
claude-noether	44ca4e550f	phase1: substrate selector API + cross-substrate bit-exact ctest Surfaces daedalus-fourier's substrate-override capability at the decoder boundary. Lets tests run on CPU-only hosts (CI runners, x86 dev boxes) AND cross-checks V3D shader output against NEON reference on hosts that have both. API additions (pre-0.1 ABI, additive): - daedalus_decoder_substrate enum { AUTO, CPU, QPU } (mirrors daedalus_substrate; isolated for ABI reasons). - daedalus_decoder_set_substrate(dec, sub) setter, same mid-frame-change restrictions as set_output_format. - Default remains AUTO — the only sensible choice for production. Internal: - flush_frame now calls daedalus_dispatch_h264_idct{4,8} with an explicit substrate instead of daedalus_recipe_dispatch_*. Mapped via a small map_substrate() helper. No perf delta on AUTO (recipe layer was just doing the same dispatch under the hood). Test changes: - test_smoke: new EXPECTs for set_substrate (valid + bogus). - test_idct_bitexact: new argv[4] takes "auto" (default), "cpu", or "qpu" to force the substrate. - CMakeLists.txt: new ctest entry `idct_bitexact_cpu` re-runs the QVGA case forcing the CPU path. Catches silent drift between the V3D shader and the NEON reference; both must produce identical output for the same coefficient input (and they do — see ctest log below). Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0): $ ctest --test-dir build --output-on-failure Start 1: smoke 1/4 Test #1: smoke ............................ Passed 0.10 sec Start 2: idct_bitexact 2/4 Test #2: idct_bitexact .................... Passed 0.03 sec Start 3: idct_bitexact_cpu 3/4 Test #3: idct_bitexact_cpu ................ Passed 0.03 sec Start 4: idct_bitexact_1080p 4/4 Test #4: idct_bitexact_1080p .............. Passed 0.06 sec 100% tests passed, 0 tests failed out of 4 CPU substrate produces byte-identical Y + Cb + Cr planes against the same C reference that the AUTO/QPU path matches — confirming the V3D shaders and the daedalus-fourier NEON path agree at the spec level. Why we plumbed the lower-level dispatch instead of leaving recipe in place: recipe is just a thin wrapper that calls dispatch with AUTO. Once we needed substrate control, the wrapper became a liability (would have required adding a parallel recipe API for each substrate); going direct is simpler and the AUTO path is unchanged. Coverage note: idct_bitexact_cpu runs at QVGA (300 MBs); not also at 1080p because the CPU path's wall time scales linearly with block count and a 1080p CPU run is ~0.5s on hertz — fine standalone but slows ctest enough that it would tempt opt-in gating. The bit-exact content is the same regardless of frame size; the 1080p variant only exists to gate index-arithmetic bugs that surface above small int boundaries.	2026-05-24 23:07:45 +02:00
claude-noether	352373a9be	phase1: add IDCT-layer throughput benchmark (bench_flush_frame) Establishes a steady-state baseline for the Path C frame-level dispatch architecture. Times daedalus_decoder_flush_frame at a configurable coded resolution with random coefficients, reporting per-frame latency stats and fps. NOT a ctest — produces wall-time numbers, doesn't pass/fail. Run manually: ./build/bench_flush_frame [width] [height] [iters] [warmup] Defaults to 1920x1088, 100 iters, 5-frame warmup (excludes shader- pipeline-pool materialisation cost from the timing average). Measured on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0): $ ./build/bench_flush_frame bench_flush_frame: 1920x1088 (8160 MBs), 100 iters (5 warmup) ctx has_qpu=1 flush_frame (post-warmup, 95 samples): min = 9.699 ms median = 9.905 ms mean = 10.014 ms p99 = 12.011 ms max = 12.011 ms throughput (steady-state, IDCT only — NO intra/MC/deblock): mean = 99.9 fps median = 101.0 fps target = 30.0 fps (project_30fps_floor_is_fine.md) status = MEETS target (with 3.4x headroom for intra/MC/deblock) Interpretation: Per-frame work measured: - CPU partition + flat-pack of 8160 MBs into luma_4x4, luma_8x8, chroma meta+coeffs buffers - 3 GPU dispatches (luma 4x4, luma 8x8, chroma 4x4) with their respective vkQueueSubmit + vkQueueWaitIdle round-trips - CPU NV12 interleave (chroma planar → UV) - calloc/free for scratch_y / coeffs / meta buffers Doing all of that in ~10 ms means the architecture pays back the Path C design bet: ONE Vulkan submit per dispatch (cycle 8b buffer pool keeps amortised cost low) is the right granularity. The per-block dispatch fail-mode that motivated Path C (~6500 ms/frame from the libavcodec substitution arc) is 600x slower than this. 3.4x headroom from 101 fps → 30 fps target gives a budget of ~23 ms/frame for the remaining decode work (intra prediction wavefront, MC, deblock). Each of those needs to fit inside that budget at steady state for the end-to-end decoder to hit 30 fps at 1080p. p99 latency 12 ms means even worst-case frames clear the 33-ms deadline (30 fps) easily; tail latency isn't a concern at this stage. What this number does NOT validate: - Intra prediction shader dispatch overhead (likely per-anti-diagonal or per-MB-wavefront; dispatch count goes up) - MC dispatch (per qpel-block; up to several per MB) - Deblock dispatch (4 edges per MB; per-edge meta entries) - Real H.264 streams (random coeffs ≠ real residuals; perf shape of memory access is content-independent, but cache pressure may differ at scale).	2026-05-24 22:53:49 +02:00
claude-noether	adaabb1f63	phase1: IDCT 8x8 dispatch (High profile transform_8x8_size_flag) Adds the High-profile 8x8 luma transform path alongside the existing 4x4 dispatch. flush_frame now partitions macroblocks by each MB's transform_8x8 flag and issues a separate luma dispatch per partition: - mb.transform_8x8 == 0 (Baseline/Main) → coeffs[0..256) interpreted as 16 4x4 blocks, fed to daedalus_recipe_dispatch_h264_idct4 (existing behaviour, unchanged). - mb.transform_8x8 == 1 (High) → coeffs[0..256) interpreted as 4 8x8 blocks (64 int16 each, column-major), fed to the new daedalus_recipe_dispatch_h264_idct8 call. Both luma partitions can be non-empty in the same frame (FFmpeg sets the flag per-MB). Each non-empty partition costs one vkQueueSubmit + vkQueueWaitIdle; empty partitions are skipped (common case: Baseline streams skip the 8x8 dispatch entirely). Chroma is unchanged — 4:2:0 chroma always uses the 4x4 transform. API surface: - New uint8_t `transform_8x8` field in `struct daedalus_decoder_mb_input` (after deblock_). Backwards-compatible at the source level because the field defaults to 0 with C99 designated initialisers or {0} struct zeroing, both of which select the existing 4x4 path. ABI is pre-0.1 (per the header doc) so structural change is fine. - Mirrored in `struct daedalus_decoder_mb_desc` (internal layout). Test changes: - test_idct_bitexact now exercises a mixed-mode frame: every odd raster MB uses 8x8, every even uses 4x4 (so flush_frame's partitioning is also under test, not just the underlying shaders). - 8x8 C reference (h264_idct8_butterfly + ref_idct8_add) transcribed from daedalus-fourier tests/h264_idct8_ref.c per H.264 §8.5.13.2. Block layout column-major; +32 >> 6 rounding; add-to-predicted; clip255. - Reference luma compute branches per MB on the same mb_8x8[] array used to set the input flag. Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0): $ ./build/test_idct_bitexact test_idct_bitexact: 320x240 (300 MBs), seed=0xfeedface5a5a5a5a MB mix: 150 4x4 MBs, 150 8x8 MBs Y bytes total: 76800 Y bytes diff: 0 (0.0000%) Cb bytes total: 19200 diff: 0 (0.0000%) Cr bytes total: 19200 diff: 0 (0.0000%) BIT-EXACT PASS (Y + Cb + Cr) $ ctest --test-dir build 100% tests passed, 0 tests failed out of 2 Bit-exact PASS first try for the 8x8 path — 150 8x8 MBs × 4 blocks = 600 8x8 IDCTs against the spec C reference, identical output. Validates both the daedalus-fourier IDCT 8x8 shader (already gated by its own cycle-7 bit-exact test, now also gated end-to-end through our flush_frame), and our 8x8 layout assumptions (column-major coeffs, raster sb_y2+sb_x block order, top-left = mb16 + sb8). What's NOT covered yet (deferred): - Z-scan permutation for FFmpeg compatibility (libavcodec intercept patch's concern; both 4x4 and 8x8 z-scans differ). - Chroma DC / luma Intra16x16 DC Hadamard pre-pass. - Mixed intra/inter MB handling — currently all MBs treated as residual-only (predicted=0). Closes the "IDCT 8x8 (High profile)" item from PR #3's deferred list.	2026-05-24 22:41:05 +02:00
claude-noether	58848bd162	phase1/stage1: chroma 4x4 IDCT dispatch (Cb+Cr planar scratch, NV12 interleave) Replaces the chroma placeholder (memset 128) with a real frame-scaled 4x4 IDCT dispatch for the Cb and Cr components. Two Vulkan submits + waits per frame now (one luma, one chroma) instead of one + memset. Implementation: - One combined planar scratch buffer (WH/2 bytes) holds Cb then Cr; a single `daedalus_recipe_dispatch_h264_idct4` call processes both components by setting meta[].dst_off accordingly (Cr blocks add cb_plane_size). - Stride = W/2 (chroma row pitch); shared between Cb and Cr since they have identical geometry. - Per-MB coeff layout already had [256..320) for Cb and [320..384) for Cr (4 raster-order 4x4 blocks per component) from the original daedalus_decoder_append_mb design — no header-side changes. - Post-dispatch CPU memcpy loop interleaves Cb[r][c] and Cr[r][c] into NV12 UV at out_uv[r][2c..2c+1]. ~1 MB/frame at 1080p, well off the critical path; a GPU-side interleave shader is a Stage-5 optimisation. - Chroma dispatch is gated on out_uv != NULL so callers that only want luma (e.g. the bit-exact test before this PR) still pay nothing. Test changes: - tests/test_idct_bitexact.c extended with parallel reference IDCT for Cb and Cr planes (W/2 x H/2 each), then deinterleaves NV12 UV back into Cb/Cr for the compare. Random coeffs in [-512, 511] for all 384 per-MB int16 slots (previously only luma was randomised). - tests/test_smoke.c UV expectation flipped from "all 128 placeholder" to "all 0" (real dispatch with zero coeffs). Sentinel 0xcd pre-fill stays — same purpose: catches read-then-write bugs. Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0): $ ctest --test-dir build --output-on-failure Start 1: smoke 1/2 Test #1: smoke ............................ Passed 1.27 sec Start 2: idct_bitexact 2/2 Test #2: idct_bitexact .................... Passed 0.05 sec 100% tests passed, 0 tests failed out of 2 $ ./build/test_idct_bitexact test_idct_bitexact: 320x240 (300 MBs), seed=0xfeedface5a5a5a5a Y bytes total: 76800 Y bytes diff: 0 (0.0000%) Cb bytes total: 19200 diff: 0 (0.0000%) Cr bytes total: 19200 diff: 0 (0.0000%) BIT-EXACT PASS (Y + Cb + Cr) $ ./build/test_smoke daedalus-decoder version: 0.0.1 ctx created: 1920x1088, has_qpu=1 appended 8160 MBs (120x68) flush_frame rc=0 Y non-zero bytes: 0 / 2088960 UV non-zero bytes: 0 / 1044480 smoke OK (Smoke's 1.27s includes the 1080p frame: 8160 MBs 16 = 130,560 luma blocks + 8160 * 8 = 65,280 chroma blocks across two dispatches — shader pool warm-up dominates the wall time, not the IDCT work.) What's NOT covered yet (deferred): - Chroma DC / Intra16x16 luma DC 2x2 Hadamard pre-pass. Real H.264 chroma puts the per-block DC coefficient through a Hadamard before it's added to the AC block; we currently treat all chroma blocks as plain 4x4 AC. Will land alongside the libavcodec intercept patch, since CABAC/CAVLC is where the DC vs AC distinction is exposed. - Z-scan permutation for FFmpeg compatibility — only matters at the intercept boundary, not here. - IDCT 8x8 (High profile). Closes the "chroma is a stub" item from PR #3's "what's NOT done" list.	2026-05-24 22:34:42 +02:00
claude-noether	948697ef0d	phase1/stage1: bit-exact gate for the frame-scaled luma IDCT 4x4 Adds test_idct_bitexact that exercises daedalus_decoder_flush_frame end-to-end with random coefficients and compares every output byte against an inline C reference of the H.264 §8.5.12.1 1D butterfly. Closes the validation gap from the previous PR ("dispatch succeeds" becomes "dispatch is bit-exact"). What's tested: - 320×240 coded frame (300 MBs), enough to cover multiple workgroups of the V3D shader (16 blocks/WG → ≥30 WGs) - Per-MB → flat-raster block layout consistent with flush_frame - Random coeffs in [-512, 511] (same range as daedalus-fourier cycle-6 M1 gate) - Inline C reference: H.264 §8.5.12.1 butterfly with column-major block layout, +32 rounding, >>6, add-to-predicted (=0), clip255 — mirrors daedalus-fourier tests/h264_idct4_ref.c Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0): $ ctest --test-dir build --output-on-failure Start 1: smoke 1/2 Test #1: smoke ............................ Passed 0.16 sec Start 2: idct_bitexact 2/2 Test #2: idct_bitexact .................... Passed 0.03 sec 100% tests passed, 0 tests failed out of 2 Bit-exact PASS first try — daedalus-fourier's V3D IDCT 4x4 shader produces identical pixels to the C reference for all 4800 blocks in the test frame. Validates BOTH the shader correctness AND the frame-batched-dispatch correctness (this is the first time n_blocks > ~30 has been exercised at the recipe-dispatch layer; the substitution arc only ever called with n_blocks=1). What is NOT tested by this PR (deferred to follow-ons): - Non-zero predicted pixels — flush_frame zero-initialises scratch_y, so the IDCT-ADD reduces to clip255(IDCT). Real predicted comes from Stage 2a intra prediction. - Z-scan permutation between FFmpeg's per-MB coeffs layout and our per-MB → flat raster — the test uses its own coefficient generator that already matches our layout, so it doesn't exercise the permutation. The libavcodec-intercept patch is where the permutation lands and gets validated against real H.264 streams. - Chroma 4×4 IDCT. - IDCT 8×8 (High profile). Stacked on noether/phase1-stage1-idct (PR #3, the frame-scaled dispatch). Rebase on main after #3 lands; the diff is purely additive (one new test file + 5 lines of CMake).	2026-05-24 22:20:21 +02:00
claude-noether	69b124adf1	phase1/stage1: frame-scaled luma IDCT 4x4 dispatch — first GPU round-trip flush_frame now performs a real GPU dispatch via the daedalus-fourier public API at frame batch granularity, in contrast to the substitution- arc shim that paid Vulkan sync overhead per-block. What's wired: - Build per-frame luma-4x4 meta[] in raster order across all MBs (N_MBs × 16 entries; 130,560 for 1080p) - Repack per-MB coeffs[] (384 int16; first 256 are luma) into a flat block-major coeffs buffer (n_blocks × 16 int16) - Allocate a frame-sized scratch Y plane, zero-initialised — no intra prediction yet so "predicted" = 0 - daedalus_recipe_dispatch_h264_idct4(ctx, scratch_y, stride, coeffs, n_blocks, meta) — ONE call, ONE vkQueueSubmit, ONE vkQueueWaitIdle - Copy result to caller's out_y at requested stride Measured on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0 post-pool): $ time ./build/test_smoke daedalus-decoder version: 0.0.1 ctx created: 1920x1088, has_qpu=1 appended 8160 MBs (120x68) flush_frame rc=0 Y non-zero bytes: 0 / 2088960 UV non-128 bytes: 0 / 1044480 smoke OK real 0m0.163s 163ms wall for full 1080p frame including ctx-create (Vulkan init). Per-block dispatch via the substitution arc would have paid 130,560 × ~50us = ~6.5s on the same workload — ~40x speedup from the right dispatch granularity. Smoke validates: - flush_frame succeeds (rc=0) on a complete frame - Zero-coefficient input → zero-pixel Y output (clip255(IDCT(zeros))=0) - UV plane filled with neutral grey 128 (placeholder until chroma dispatch lands) What's deliberately deferred to follow-on sub-PRs: - Intra prediction wavefront (Stage 2a) — predicted=0 means output pixels are residual-only, not a valid frame decode. Sufficient for Vulkan round-trip validation; not bit-exact vs FFmpeg yet. - Motion compensation (Stage 2b) for inter MBs - High-profile IDCT 8x8 (Stage 1 extension) - Deblocking filter (Stage 4) - Chroma 4x4 IDCT — needs separate dispatch with chroma stride - Z-scan permutation of per-MB 4x4 block order (currently flat raster; FFmpeg's per-MB coeffs[] uses spec §6.4.3 z-scan). Bit-exact against FFmpeg requires this permutation; deferred to the test-vector PR. - dmabuf export (still memcpy-out) - Stage 5 RGBA opt-in API surface unchanged from the scaffold PR; only the body of flush_frame becomes non-stub. Internal helpers stay file-local. Stacks on noether/repo-scaffold (PR #2). Rebase on main after #2 lands; the diff is purely additive against the scaffold.	2026-05-24 22:15:35 +02:00
claude-noether	08080f062c	scaffold: CMake + API skeleton + smoke test First code on daedalus-decoder per the Phase 1 decisions merged 2026-05-24. Repo skeleton only — no Vulkan pipeline yet, no shaders, no libavcodec intercept. Establishes the build shape so subsequent work has a place to land. Layout: LICENSE BSD-2-Clause (matches daedalus-fourier) .gitignore build/, CMake artefacts, .spv CMakeLists.txt top-level — finds daedalus-fourier ≥0.1.0 via pkg-config (per §9.6 decision: find_package, pinned to tagged release; .pc consumed via pkg_check_modules until we ship a CMake config), Vulkan via find_package, builds static lib + smoke test, GNUInstallDirs install include/daedalus_decoder.h public API surface: - daedalus_decoder_{create,destroy, version,has_qpu} - daedalus_decoder_set_output_format (NV12 default, RGBA opt-in per §5) - daedalus_decoder_append_mb + struct daedalus_decoder_mb_input (matches §3 per-MB descriptor) - daedalus_decoder_flush_frame (per-frame submit + wait) - daedalus_decoder_export_dmabuf (Vulkan-native VkImage export per §9.4 decision) Dimensions are CODED frame size (mod-16), not displayed — caller translates from SPS + crop offsets. src/internal.h internal mb_desc struct (matches shader std430 layout, to be nailed down once shaders exist) + per-ctx state src/daedalus_decoder.c stub bodies: - create/destroy with proper resource lifecycle - append_mb validates + writes CPU staging buffers (no GPU yet) - flush_frame returns -2 (not implemented) — Phase 1 work - export_dmabuf returns -1 - has_qpu / version diagnostics tests/test_smoke.c link + lifecycle test: bad dims reject, OOB MB reject, null inputs reject, raster-order enforcement, mid-frame format-change reject, incomplete-frame flush reject. On hosts without V3D7 Vulkan, SKIPs gracefully (returns 0). Verified on hertz (Pi 5 / V3D 7.1 / Mesa V3DV via daedalus-fourier 0.1.0): $ cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release $ cmake --build build $ ctest --test-dir build --output-on-failure Test #1: smoke ... Passed $ ./build/test_smoke daedalus-decoder version: 0.0.1 ctx created: 1920x1088, has_qpu=1 smoke OK Note the coded-vs-displayed dims trap: 1080p H.264 has coded height 1088 with 8 rows cropped via SPS frame_cropping_. Header docstring on daedalus_decoder_create() spells this out so future callers don't hit the multiple-of-16 reject (smoke test caught it during scaffold write). Next: Phase 1 implementation begins — IDCT 4×4 / 8×8 frame-scaled dispatch (reusing daedalus-fourier shaders per Appendix A), intra prediction wavefront, reconstruct stage, NV12 output via dmabuf export. Smoke test grows from "ctx lifecycle works" to "I-frame-only Baseline decode bit-exact vs FFmpeg reference".	2026-05-24 22:08:46 +02:00

11 Commits