daedalus-decoder

Author	SHA1	Message	Date
claude-noether	a9380e33ad	wip: tools/daedalus_decode_h264 cli	2026-05-26 06:10:56 +02:00
claude-noether	92453d7019	wip: deblock smoke test	2026-05-25 23:16:08 +02:00
claude-noether	44ca4e550f	phase1: substrate selector API + cross-substrate bit-exact ctest Surfaces daedalus-fourier's substrate-override capability at the decoder boundary. Lets tests run on CPU-only hosts (CI runners, x86 dev boxes) AND cross-checks V3D shader output against NEON reference on hosts that have both. API additions (pre-0.1 ABI, additive): - daedalus_decoder_substrate enum { AUTO, CPU, QPU } (mirrors daedalus_substrate; isolated for ABI reasons). - daedalus_decoder_set_substrate(dec, sub) setter, same mid-frame-change restrictions as set_output_format. - Default remains AUTO — the only sensible choice for production. Internal: - flush_frame now calls daedalus_dispatch_h264_idct{4,8} with an explicit substrate instead of daedalus_recipe_dispatch_*. Mapped via a small map_substrate() helper. No perf delta on AUTO (recipe layer was just doing the same dispatch under the hood). Test changes: - test_smoke: new EXPECTs for set_substrate (valid + bogus). - test_idct_bitexact: new argv[4] takes "auto" (default), "cpu", or "qpu" to force the substrate. - CMakeLists.txt: new ctest entry `idct_bitexact_cpu` re-runs the QVGA case forcing the CPU path. Catches silent drift between the V3D shader and the NEON reference; both must produce identical output for the same coefficient input (and they do — see ctest log below). Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0): $ ctest --test-dir build --output-on-failure Start 1: smoke 1/4 Test #1: smoke ............................ Passed 0.10 sec Start 2: idct_bitexact 2/4 Test #2: idct_bitexact .................... Passed 0.03 sec Start 3: idct_bitexact_cpu 3/4 Test #3: idct_bitexact_cpu ................ Passed 0.03 sec Start 4: idct_bitexact_1080p 4/4 Test #4: idct_bitexact_1080p .............. Passed 0.06 sec 100% tests passed, 0 tests failed out of 4 CPU substrate produces byte-identical Y + Cb + Cr planes against the same C reference that the AUTO/QPU path matches — confirming the V3D shaders and the daedalus-fourier NEON path agree at the spec level. Why we plumbed the lower-level dispatch instead of leaving recipe in place: recipe is just a thin wrapper that calls dispatch with AUTO. Once we needed substrate control, the wrapper became a liability (would have required adding a parallel recipe API for each substrate); going direct is simpler and the AUTO path is unchanged. Coverage note: idct_bitexact_cpu runs at QVGA (300 MBs); not also at 1080p because the CPU path's wall time scales linearly with block count and a 1080p CPU run is ~0.5s on hertz — fine standalone but slows ctest enough that it would tempt opt-in gating. The bit-exact content is the same regardless of frame size; the 1080p variant only exists to gate index-arithmetic bugs that surface above small int boundaries.	2026-05-24 23:07:45 +02:00
claude-noether	352373a9be	phase1: add IDCT-layer throughput benchmark (bench_flush_frame) Establishes a steady-state baseline for the Path C frame-level dispatch architecture. Times daedalus_decoder_flush_frame at a configurable coded resolution with random coefficients, reporting per-frame latency stats and fps. NOT a ctest — produces wall-time numbers, doesn't pass/fail. Run manually: ./build/bench_flush_frame [width] [height] [iters] [warmup] Defaults to 1920x1088, 100 iters, 5-frame warmup (excludes shader- pipeline-pool materialisation cost from the timing average). Measured on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0): $ ./build/bench_flush_frame bench_flush_frame: 1920x1088 (8160 MBs), 100 iters (5 warmup) ctx has_qpu=1 flush_frame (post-warmup, 95 samples): min = 9.699 ms median = 9.905 ms mean = 10.014 ms p99 = 12.011 ms max = 12.011 ms throughput (steady-state, IDCT only — NO intra/MC/deblock): mean = 99.9 fps median = 101.0 fps target = 30.0 fps (project_30fps_floor_is_fine.md) status = MEETS target (with 3.4x headroom for intra/MC/deblock) Interpretation: Per-frame work measured: - CPU partition + flat-pack of 8160 MBs into luma_4x4, luma_8x8, chroma meta+coeffs buffers - 3 GPU dispatches (luma 4x4, luma 8x8, chroma 4x4) with their respective vkQueueSubmit + vkQueueWaitIdle round-trips - CPU NV12 interleave (chroma planar → UV) - calloc/free for scratch_y / coeffs / meta buffers Doing all of that in ~10 ms means the architecture pays back the Path C design bet: ONE Vulkan submit per dispatch (cycle 8b buffer pool keeps amortised cost low) is the right granularity. The per-block dispatch fail-mode that motivated Path C (~6500 ms/frame from the libavcodec substitution arc) is 600x slower than this. 3.4x headroom from 101 fps → 30 fps target gives a budget of ~23 ms/frame for the remaining decode work (intra prediction wavefront, MC, deblock). Each of those needs to fit inside that budget at steady state for the end-to-end decoder to hit 30 fps at 1080p. p99 latency 12 ms means even worst-case frames clear the 33-ms deadline (30 fps) easily; tail latency isn't a concern at this stage. What this number does NOT validate: - Intra prediction shader dispatch overhead (likely per-anti-diagonal or per-MB-wavefront; dispatch count goes up) - MC dispatch (per qpel-block; up to several per MB) - Deblock dispatch (4 edges per MB; per-edge meta entries) - Real H.264 streams (random coeffs ≠ real residuals; perf shape of memory access is content-independent, but cache pressure may differ at scale).	2026-05-24 22:53:49 +02:00
claude-noether	045553ccaf	phase1: add deployment-scale bit-exact ctest (1080p, 8160 MBs) The existing 320x240 bit-exact test (300 MBs) is the fast inner-loop gate, but it's small enough that index arithmetic bugs that only surface above 16-bit boundaries would slip through. This adds a second ctest entry that runs the same binary against a full coded 1080p frame (1920x1088, 8160 MBs): - 4080 MBs at transform_8x8=0 → 65,280 luma 4x4 blocks - 4080 MBs at transform_8x8=1 → 16,320 luma 8x8 blocks - 65,280 chroma 4x4 blocks (32,640 Cb + 32,640 Cr) - 146,880 IDCTs total across 3 separate luma_4x4 + luma_8x8 + chroma dispatches; bit-exact compared against the in-test C reference for each. No code change to the test binary itself — it already accepted width/height as argv[1..2]. Just a second `add_test` in CMakeLists.txt that invokes it with `1920 1088`. Coverage rationale: - dst_off is uint32_t in daedalus_h264_block_meta; at 1920x1088 the max offset is ~2.1 MiB, still well within uint32 range, but the test exercises the largest stride math we'll see in production (per-MB chroma offset = mb_y8 + cb_plane_size = up to 1.06 MiB). - flush_frame partitions 8160 MBs by transform mode → exercises the bi4 == 408016 and bi8 == 4080*4 accumulators at frame scale. - Verifies the 1088 coded height handling (the displayed 1080 + 8 cropped rows trap that catches Pi 5 H.264 integrations). Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0): $ ctest --test-dir build --output-on-failure Start 1: smoke 1/3 Test #1: smoke ............................ Passed 0.09 sec Start 2: idct_bitexact 2/3 Test #2: idct_bitexact .................... Passed 0.03 sec Start 3: idct_bitexact_1080p 3/3 Test #3: idct_bitexact_1080p .............. Passed 0.06 sec 100% tests passed, 0 tests failed out of 3 $ ./build/test_idct_bitexact 1920 1088 test_idct_bitexact: 1920x1088 (8160 MBs), seed=0xfeedface5a5a5a5a MB mix: 4080 4x4 MBs, 4080 8x8 MBs Y bytes total: 2088960 Y bytes diff: 0 (0.0000%) Cb bytes total: 522240 diff: 0 (0.0000%) Cr bytes total: 522240 diff: 0 (0.0000%) BIT-EXACT PASS (Y + Cb + Cr) (0.06 s when shader pool warm; ~0.2 s cold via the standalone invocation above — the 1080p run happens after smoke, so pool is already primed by the time it runs in ctest.)	2026-05-24 22:49:01 +02:00
claude-noether	948697ef0d	phase1/stage1: bit-exact gate for the frame-scaled luma IDCT 4x4 Adds test_idct_bitexact that exercises daedalus_decoder_flush_frame end-to-end with random coefficients and compares every output byte against an inline C reference of the H.264 §8.5.12.1 1D butterfly. Closes the validation gap from the previous PR ("dispatch succeeds" becomes "dispatch is bit-exact"). What's tested: - 320×240 coded frame (300 MBs), enough to cover multiple workgroups of the V3D shader (16 blocks/WG → ≥30 WGs) - Per-MB → flat-raster block layout consistent with flush_frame - Random coeffs in [-512, 511] (same range as daedalus-fourier cycle-6 M1 gate) - Inline C reference: H.264 §8.5.12.1 butterfly with column-major block layout, +32 rounding, >>6, add-to-predicted (=0), clip255 — mirrors daedalus-fourier tests/h264_idct4_ref.c Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0): $ ctest --test-dir build --output-on-failure Start 1: smoke 1/2 Test #1: smoke ............................ Passed 0.16 sec Start 2: idct_bitexact 2/2 Test #2: idct_bitexact .................... Passed 0.03 sec 100% tests passed, 0 tests failed out of 2 Bit-exact PASS first try — daedalus-fourier's V3D IDCT 4x4 shader produces identical pixels to the C reference for all 4800 blocks in the test frame. Validates BOTH the shader correctness AND the frame-batched-dispatch correctness (this is the first time n_blocks > ~30 has been exercised at the recipe-dispatch layer; the substitution arc only ever called with n_blocks=1). What is NOT tested by this PR (deferred to follow-ons): - Non-zero predicted pixels — flush_frame zero-initialises scratch_y, so the IDCT-ADD reduces to clip255(IDCT). Real predicted comes from Stage 2a intra prediction. - Z-scan permutation between FFmpeg's per-MB coeffs layout and our per-MB → flat raster — the test uses its own coefficient generator that already matches our layout, so it doesn't exercise the permutation. The libavcodec-intercept patch is where the permutation lands and gets validated against real H.264 streams. - Chroma 4×4 IDCT. - IDCT 8×8 (High profile). Stacked on noether/phase1-stage1-idct (PR #3, the frame-scaled dispatch). Rebase on main after #3 lands; the diff is purely additive (one new test file + 5 lines of CMake).	2026-05-24 22:20:21 +02:00
claude-noether	08080f062c	scaffold: CMake + API skeleton + smoke test First code on daedalus-decoder per the Phase 1 decisions merged 2026-05-24. Repo skeleton only — no Vulkan pipeline yet, no shaders, no libavcodec intercept. Establishes the build shape so subsequent work has a place to land. Layout: LICENSE BSD-2-Clause (matches daedalus-fourier) .gitignore build/, CMake artefacts, .spv CMakeLists.txt top-level — finds daedalus-fourier ≥0.1.0 via pkg-config (per §9.6 decision: find_package, pinned to tagged release; .pc consumed via pkg_check_modules until we ship a CMake config), Vulkan via find_package, builds static lib + smoke test, GNUInstallDirs install include/daedalus_decoder.h public API surface: - daedalus_decoder_{create,destroy, version,has_qpu} - daedalus_decoder_set_output_format (NV12 default, RGBA opt-in per §5) - daedalus_decoder_append_mb + struct daedalus_decoder_mb_input (matches §3 per-MB descriptor) - daedalus_decoder_flush_frame (per-frame submit + wait) - daedalus_decoder_export_dmabuf (Vulkan-native VkImage export per §9.4 decision) Dimensions are CODED frame size (mod-16), not displayed — caller translates from SPS + crop offsets. src/internal.h internal mb_desc struct (matches shader std430 layout, to be nailed down once shaders exist) + per-ctx state src/daedalus_decoder.c stub bodies: - create/destroy with proper resource lifecycle - append_mb validates + writes CPU staging buffers (no GPU yet) - flush_frame returns -2 (not implemented) — Phase 1 work - export_dmabuf returns -1 - has_qpu / version diagnostics tests/test_smoke.c link + lifecycle test: bad dims reject, OOB MB reject, null inputs reject, raster-order enforcement, mid-frame format-change reject, incomplete-frame flush reject. On hosts without V3D7 Vulkan, SKIPs gracefully (returns 0). Verified on hertz (Pi 5 / V3D 7.1 / Mesa V3DV via daedalus-fourier 0.1.0): $ cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release $ cmake --build build $ ctest --test-dir build --output-on-failure Test #1: smoke ... Passed $ ./build/test_smoke daedalus-decoder version: 0.0.1 ctx created: 1920x1088, has_qpu=1 smoke OK Note the coded-vs-displayed dims trap: 1080p H.264 has coded height 1088 with 8 rows cropped via SPS frame_cropping_. Header docstring on daedalus_decoder_create() spells this out so future callers don't hit the multiple-of-16 reject (smoke test caught it during scaffold write). Next: Phase 1 implementation begins — IDCT 4×4 / 8×8 frame-scaled dispatch (reusing daedalus-fourier shaders per Appendix A), intra prediction wavefront, reconstruct stage, NV12 output via dmabuf export. Smoke test grows from "ctx lifecycle works" to "I-frame-only Baseline decode bit-exact vs FFmpeg reference".	2026-05-24 22:08:46 +02:00

7 Commits