daedalus-decoder

Author	SHA1	Message	Date
marfrit	5fa495964d	Merge pull request 'phase1/stage1: chroma 4x4 IDCT dispatch (Cb+Cr, NV12 interleave)' (#5 ) from noether/phase1-stage1-chroma into main Reviewed-on: #5	2026-05-24 20:36:50 +00:00
claude-noether	58848bd162	phase1/stage1: chroma 4x4 IDCT dispatch (Cb+Cr planar scratch, NV12 interleave) Replaces the chroma placeholder (memset 128) with a real frame-scaled 4x4 IDCT dispatch for the Cb and Cr components. Two Vulkan submits + waits per frame now (one luma, one chroma) instead of one + memset. Implementation: - One combined planar scratch buffer (WH/2 bytes) holds Cb then Cr; a single `daedalus_recipe_dispatch_h264_idct4` call processes both components by setting meta[].dst_off accordingly (Cr blocks add cb_plane_size). - Stride = W/2 (chroma row pitch); shared between Cb and Cr since they have identical geometry. - Per-MB coeff layout already had [256..320) for Cb and [320..384) for Cr (4 raster-order 4x4 blocks per component) from the original daedalus_decoder_append_mb design — no header-side changes. - Post-dispatch CPU memcpy loop interleaves Cb[r][c] and Cr[r][c] into NV12 UV at out_uv[r][2c..2c+1]. ~1 MB/frame at 1080p, well off the critical path; a GPU-side interleave shader is a Stage-5 optimisation. - Chroma dispatch is gated on out_uv != NULL so callers that only want luma (e.g. the bit-exact test before this PR) still pay nothing. Test changes: - tests/test_idct_bitexact.c extended with parallel reference IDCT for Cb and Cr planes (W/2 x H/2 each), then deinterleaves NV12 UV back into Cb/Cr for the compare. Random coeffs in [-512, 511] for all 384 per-MB int16 slots (previously only luma was randomised). - tests/test_smoke.c UV expectation flipped from "all 128 placeholder" to "all 0" (real dispatch with zero coeffs). Sentinel 0xcd pre-fill stays — same purpose: catches read-then-write bugs. Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0): $ ctest --test-dir build --output-on-failure Start 1: smoke 1/2 Test #1: smoke ............................ Passed 1.27 sec Start 2: idct_bitexact 2/2 Test #2: idct_bitexact .................... Passed 0.05 sec 100% tests passed, 0 tests failed out of 2 $ ./build/test_idct_bitexact test_idct_bitexact: 320x240 (300 MBs), seed=0xfeedface5a5a5a5a Y bytes total: 76800 Y bytes diff: 0 (0.0000%) Cb bytes total: 19200 diff: 0 (0.0000%) Cr bytes total: 19200 diff: 0 (0.0000%) BIT-EXACT PASS (Y + Cb + Cr) $ ./build/test_smoke daedalus-decoder version: 0.0.1 ctx created: 1920x1088, has_qpu=1 appended 8160 MBs (120x68) flush_frame rc=0 Y non-zero bytes: 0 / 2088960 UV non-zero bytes: 0 / 1044480 smoke OK (Smoke's 1.27s includes the 1080p frame: 8160 MBs 16 = 130,560 luma blocks + 8160 * 8 = 65,280 chroma blocks across two dispatches — shader pool warm-up dominates the wall time, not the IDCT work.) What's NOT covered yet (deferred): - Chroma DC / Intra16x16 luma DC 2x2 Hadamard pre-pass. Real H.264 chroma puts the per-block DC coefficient through a Hadamard before it's added to the AC block; we currently treat all chroma blocks as plain 4x4 AC. Will land alongside the libavcodec intercept patch, since CABAC/CAVLC is where the DC vs AC distinction is exposed. - Z-scan permutation for FFmpeg compatibility — only matters at the intercept boundary, not here. - IDCT 8x8 (High profile). Closes the "chroma is a stub" item from PR #3's "what's NOT done" list.	2026-05-24 22:34:42 +02:00
marfrit	41306e48ee	Merge pull request 'phase1/stage1: bit-exact gate for the frame-scaled IDCT 4×4' (#4 ) from noether/phase1-stage1-bitexact into main Reviewed-on: #4	2026-05-24 20:27:04 +00:00
claude-noether	948697ef0d	phase1/stage1: bit-exact gate for the frame-scaled luma IDCT 4x4 Adds test_idct_bitexact that exercises daedalus_decoder_flush_frame end-to-end with random coefficients and compares every output byte against an inline C reference of the H.264 §8.5.12.1 1D butterfly. Closes the validation gap from the previous PR ("dispatch succeeds" becomes "dispatch is bit-exact"). What's tested: - 320×240 coded frame (300 MBs), enough to cover multiple workgroups of the V3D shader (16 blocks/WG → ≥30 WGs) - Per-MB → flat-raster block layout consistent with flush_frame - Random coeffs in [-512, 511] (same range as daedalus-fourier cycle-6 M1 gate) - Inline C reference: H.264 §8.5.12.1 butterfly with column-major block layout, +32 rounding, >>6, add-to-predicted (=0), clip255 — mirrors daedalus-fourier tests/h264_idct4_ref.c Verified on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0): $ ctest --test-dir build --output-on-failure Start 1: smoke 1/2 Test #1: smoke ............................ Passed 0.16 sec Start 2: idct_bitexact 2/2 Test #2: idct_bitexact .................... Passed 0.03 sec 100% tests passed, 0 tests failed out of 2 Bit-exact PASS first try — daedalus-fourier's V3D IDCT 4x4 shader produces identical pixels to the C reference for all 4800 blocks in the test frame. Validates BOTH the shader correctness AND the frame-batched-dispatch correctness (this is the first time n_blocks > ~30 has been exercised at the recipe-dispatch layer; the substitution arc only ever called with n_blocks=1). What is NOT tested by this PR (deferred to follow-ons): - Non-zero predicted pixels — flush_frame zero-initialises scratch_y, so the IDCT-ADD reduces to clip255(IDCT). Real predicted comes from Stage 2a intra prediction. - Z-scan permutation between FFmpeg's per-MB coeffs layout and our per-MB → flat raster — the test uses its own coefficient generator that already matches our layout, so it doesn't exercise the permutation. The libavcodec-intercept patch is where the permutation lands and gets validated against real H.264 streams. - Chroma 4×4 IDCT. - IDCT 8×8 (High profile). Stacked on noether/phase1-stage1-idct (PR #3, the frame-scaled dispatch). Rebase on main after #3 lands; the diff is purely additive (one new test file + 5 lines of CMake).	2026-05-24 22:20:21 +02:00
marfrit	abd94e9db5	Merge pull request 'phase1/stage1: frame-scaled luma IDCT 4×4 — first GPU round-trip' (#3 ) from noether/phase1-stage1-idct into main Reviewed-on: #3	2026-05-24 20:18:43 +00:00
claude-noether	69b124adf1	phase1/stage1: frame-scaled luma IDCT 4x4 dispatch — first GPU round-trip flush_frame now performs a real GPU dispatch via the daedalus-fourier public API at frame batch granularity, in contrast to the substitution- arc shim that paid Vulkan sync overhead per-block. What's wired: - Build per-frame luma-4x4 meta[] in raster order across all MBs (N_MBs × 16 entries; 130,560 for 1080p) - Repack per-MB coeffs[] (384 int16; first 256 are luma) into a flat block-major coeffs buffer (n_blocks × 16 int16) - Allocate a frame-sized scratch Y plane, zero-initialised — no intra prediction yet so "predicted" = 0 - daedalus_recipe_dispatch_h264_idct4(ctx, scratch_y, stride, coeffs, n_blocks, meta) — ONE call, ONE vkQueueSubmit, ONE vkQueueWaitIdle - Copy result to caller's out_y at requested stride Measured on hertz (Pi 5 / V3D 7.1 / daedalus-fourier 0.1.0 post-pool): $ time ./build/test_smoke daedalus-decoder version: 0.0.1 ctx created: 1920x1088, has_qpu=1 appended 8160 MBs (120x68) flush_frame rc=0 Y non-zero bytes: 0 / 2088960 UV non-128 bytes: 0 / 1044480 smoke OK real 0m0.163s 163ms wall for full 1080p frame including ctx-create (Vulkan init). Per-block dispatch via the substitution arc would have paid 130,560 × ~50us = ~6.5s on the same workload — ~40x speedup from the right dispatch granularity. Smoke validates: - flush_frame succeeds (rc=0) on a complete frame - Zero-coefficient input → zero-pixel Y output (clip255(IDCT(zeros))=0) - UV plane filled with neutral grey 128 (placeholder until chroma dispatch lands) What's deliberately deferred to follow-on sub-PRs: - Intra prediction wavefront (Stage 2a) — predicted=0 means output pixels are residual-only, not a valid frame decode. Sufficient for Vulkan round-trip validation; not bit-exact vs FFmpeg yet. - Motion compensation (Stage 2b) for inter MBs - High-profile IDCT 8x8 (Stage 1 extension) - Deblocking filter (Stage 4) - Chroma 4x4 IDCT — needs separate dispatch with chroma stride - Z-scan permutation of per-MB 4x4 block order (currently flat raster; FFmpeg's per-MB coeffs[] uses spec §6.4.3 z-scan). Bit-exact against FFmpeg requires this permutation; deferred to the test-vector PR. - dmabuf export (still memcpy-out) - Stage 5 RGBA opt-in API surface unchanged from the scaffold PR; only the body of flush_frame becomes non-stub. Internal helpers stay file-local. Stacks on noether/repo-scaffold (PR #2). Rebase on main after #2 lands; the diff is purely additive against the scaffold.	2026-05-24 22:15:35 +02:00
marfrit	90d7c546bd	Merge pull request 'scaffold: CMake + API skeleton + smoke test' (#2 ) from noether/repo-scaffold into main Reviewed-on: #2	2026-05-24 20:10:25 +00:00
claude-noether	08080f062c	scaffold: CMake + API skeleton + smoke test First code on daedalus-decoder per the Phase 1 decisions merged 2026-05-24. Repo skeleton only — no Vulkan pipeline yet, no shaders, no libavcodec intercept. Establishes the build shape so subsequent work has a place to land. Layout: LICENSE BSD-2-Clause (matches daedalus-fourier) .gitignore build/, CMake artefacts, .spv CMakeLists.txt top-level — finds daedalus-fourier ≥0.1.0 via pkg-config (per §9.6 decision: find_package, pinned to tagged release; .pc consumed via pkg_check_modules until we ship a CMake config), Vulkan via find_package, builds static lib + smoke test, GNUInstallDirs install include/daedalus_decoder.h public API surface: - daedalus_decoder_{create,destroy, version,has_qpu} - daedalus_decoder_set_output_format (NV12 default, RGBA opt-in per §5) - daedalus_decoder_append_mb + struct daedalus_decoder_mb_input (matches §3 per-MB descriptor) - daedalus_decoder_flush_frame (per-frame submit + wait) - daedalus_decoder_export_dmabuf (Vulkan-native VkImage export per §9.4 decision) Dimensions are CODED frame size (mod-16), not displayed — caller translates from SPS + crop offsets. src/internal.h internal mb_desc struct (matches shader std430 layout, to be nailed down once shaders exist) + per-ctx state src/daedalus_decoder.c stub bodies: - create/destroy with proper resource lifecycle - append_mb validates + writes CPU staging buffers (no GPU yet) - flush_frame returns -2 (not implemented) — Phase 1 work - export_dmabuf returns -1 - has_qpu / version diagnostics tests/test_smoke.c link + lifecycle test: bad dims reject, OOB MB reject, null inputs reject, raster-order enforcement, mid-frame format-change reject, incomplete-frame flush reject. On hosts without V3D7 Vulkan, SKIPs gracefully (returns 0). Verified on hertz (Pi 5 / V3D 7.1 / Mesa V3DV via daedalus-fourier 0.1.0): $ cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release $ cmake --build build $ ctest --test-dir build --output-on-failure Test #1: smoke ... Passed $ ./build/test_smoke daedalus-decoder version: 0.0.1 ctx created: 1920x1088, has_qpu=1 smoke OK Note the coded-vs-displayed dims trap: 1080p H.264 has coded height 1088 with 8 rows cropped via SPS frame_cropping_. Header docstring on daedalus_decoder_create() spells this out so future callers don't hit the multiple-of-16 reject (smoke test caught it during scaffold write). Next: Phase 1 implementation begins — IDCT 4×4 / 8×8 frame-scaled dispatch (reusing daedalus-fourier shaders per Appendix A), intra prediction wavefront, reconstruct stage, NV12 output via dmabuf export. Smoke test grows from "ctx lifecycle works" to "I-frame-only Baseline decode bit-exact vs FFmpeg reference".	2026-05-24 22:08:46 +02:00
marfrit	4c9f07f082	Merge pull request 'design: §9 open questions → Phase 1 decisions (user confirmed 2026-05-24)' (#1 ) from noether/design-decisions into main Reviewed-on: #1	2026-05-24 19:58:41 +00:00
claude-noether	7cbf4ce15b	design: §9 open questions → Phase 1 decisions (user confirmed 2026-05-24) All seven questions from the initial design draft decided in the user's 2026-05-24 review: 1. Intra prediction: GPU wavefront in Phase 1, revisit if bottleneck 2. libavcodec intercept: macroblock-level for Phase 1 3. Shader parameterisation: measure both during Phase 2 MC, pick winner 4. DPB allocation: Vulkan-native VkImage with dma_buf export 5. Daemon integration: library link 6. daedalus-fourier dep: CMake find_package, pinned to tagged release 7. Codec scope: H.264 first; HEVC/10-bit/interlaced/FMO/ASO firmly out; VP9 + AV1 deferred to Phase 5+ but NOT firmly out (scope expansion vs the initial draft which had grouped them with HEVC) Section heading renamed "Open questions" → "Phase 1 decisions" with explicit user-confirmed annotations. Each item preserves the original wording for traceability. §8 Phasing extended with a Phase 5+ paragraph clarifying the VP9/AV1 deferral and reaffirming HEVC's firmly-out status. No architecture changes; only decisions captured. Phase 1 implementation can now begin against this baseline.	2026-05-24 21:57:20 +02:00
claude-noether	8a4fb10a7f	design: appendices A (shader reuse audit) + B (libavcodec intercept) + C (risk register) Read-only research done autonomously while push to marfrit/daedalus-decoder is blocked on user perms. All findings appended to DESIGN.md; no new files, no architecture changes. Appendix A — daedalus-fourier shader reuse audit - 2 shaders directly reusable (v3d_h264_idct4, v3d_h264_idct8) just at frame scale instead of n_blocks=1 per call - 2 shaders partial-reuse (v3d_h264deblock + v3d_h264_qpel_mc20) serve as templates for ~20 sibling variants (horizontal/chroma deblock variants, 15 missing qpel positions + 16x16 size + avg) - 5 daedalus-fourier shaders not reusable (VP9/AV1 codec-specific) - 7 brand-new shaders required (iquant, intra prediction modes, chroma MC, reconstruct, optional yuv→rgba) - ~22 H.264 shaders total; estimate 6-10 weeks for the inventory if done in sequence with M1 bit-exact gate methodology Appendix B — libavcodec intercept point - decode_slice() at libavcodec/h264_slice.c:2598 is the loop site - Per-MB sequence: ff_h264_decode_mb_cabac → ff_h264_hl_decode_mb - Intercept replaces ff_h264_hl_decode_mb with a stub that snapshots sl->mb[] (coefficients), MV/ref caches, intra modes, mb_type, QP, non_zero_count_cache into a frame-shaped descriptor SSBO - End-of-slice flush builds + submits the GPU pipeline - CABAC/CAVLC stay in libavcodec (we don't re-implement entropy) - New FFmpeg patch in marfrit-packages, sibling to 0003-0007: 0008-h264-daedalus-decoder-frame-pipeline.patch - daedalus_decoder_active(h) gates the intercept; default OFF = no-op = full coexistence with the kernel-pack substitution arc Appendix C — risk register - 6 risks catalogued: intra wavefront perf, qpel shader explosion, Stage 5 colourspace bugs, Mesa V3DV concurrency, daedalus-fourier pin drift, Phase 4 30fps@1080p target miss - Highest impact: project fails to beat NEON. Acknowledged from project start (§10), explicit pivot language.	2026-05-23 23:10:39 +02:00
claude-noether	4182b32adf	design: optional Stage 5 NV12 → RGBA conversion User question 2026-05-23: 'Wayland does need a conversion of NV12 to its output format. Could we cram that in?' Yes — trivially. Added Stage 5 to the pipeline doc with: - 5-line per-pixel compute shader (BT.709 limited-range example given; matrix selected from H.264 VUI at runtime) - explicit OPT-IN flag, off by default - rationale for default-off: most consumers (V4L2 stateless, Wayland zwp_linux_dmabuf NV12 passthrough, Firefox/mpv VAAPI paths) want NV12 because compositors convert during composition essentially for free. RGBA8 is 4x the bandwidth of NV12 — not worth burning DMA + electrons when no downstream needs it - colourspace metadata plumbing requirement: SPS vui_parameters (colour_primaries, transfer_characteristics, matrix_coefficients, video_full_range_flag) MUST flow through to the shader; default BT.709 limited-range with warning if VUI absent Updated the new-shader inventory to include v3d_h264_yuv_to_rgba. Total dispatches/frame remains ~190-200; Stage 5 adds one.	2026-05-23 22:46:45 +02:00
claude-noether	59885dd868	initial design doc — frame-level GPU H.264 decoder for V3D7 Path C of the 2026-05-23 architecture decision after the daedalus- fourier substitution arc's per-block QPU dispatch was measured to be >600x slower than NEON in production. Root cause: per-block synchronous Vulkan dispatch from inside libavcodec's per-MB loops, paying ~50us of queue-submit/wait round-trip per ~30ns of NEON-equivalent arithmetic. NVDEC and Vulkan Video escape this by dispatching at picture-level. Pi 5 has no dedicated H.264 hardware decode block and Mesa V3DV does not implement VK_KHR_video_decode_h264; this project builds the same shape (one submit per frame, one fence wait per frame, encoded bitstream in, NV12 out) using V3D7 Vulkan compute as the substrate. DESIGN.md covers: - architecture sketch (CPU side keeps entropy decode + descriptors; GPU runs 4-stage compute pipeline per frame) - per-MB descriptor layout (frame-shaped SSBO, ~8160 entries for 1080p) - inter-stage dependencies (vkCmdPipelineBarrier within one command buffer) - intra prediction wavefront (~187 dispatches per frame on diagonals) - libavcodec intercept point (macroblock-level, evolves the substitution shim from "dispatch now" to "append to frame buffer") - shader inventory (existing daedalus-fourier reuse + ~14 new ones) - 4-phase plan, 4-6 months total budget - 7 open questions including DPB allocation, qpel parameterization, daemon integration shape - explicit out-of-scope: VP9 / AV1 / HEVC / 10-bit / interlaced This is design only. No code beyond README.md and DESIGN.md. User review + redirect expected before Phase 1 implementation begins.	2026-05-23 22:44:03 +02:00

13 Commits