daedalus-fourier

Author	SHA1	Message	Date
claude-noether	e01f7bc7c6	h264: qpel single-axis quarter-pel — mc10/mc30/mc01/mc03 (CPU/NEON) Closes the 4 single-axis quarter-pel positions in one PR. Each is a half-pel lowpass clipped to u8 followed by L2 rounded-average with an integer-aligned source pixel per H.264 §8.4.2.2.1: mc10 ¼-H ("a" pos): clip255(mc20(s)) avg src[r,c] mc30 ¾-H ("c" pos): clip255(mc20(s)) avg src[r,c+1] mc01 ¼-V ("d" pos): clip255(mc02(s)) avg src[r,c] mc03 ¾-V ("n" pos): clip255(mc02(s)) avg src[r+1,c] The mc10/mc30 pair and mc01/mc03 pair only differ in WHICH integer source pixel they average with — the half-pel computation is the same. Putting them in one PR is justified by that uniformity. Scope: - 4 new kernel enums: MC10=19, MC30=20, MC01=21, MC03=22 → CPU. - 4 NEON externs for the vendored ff_put_h264_qpel8_mc{10,30,01,03}_neon. - 4 CPU dispatch wrappers via DEFINE_QPEL_CPU_DISPATCH macro (collapses ~50 LOC of repetition). - 4 public dispatch fns via DEFINE_QPEL_DISPATCH macro. - 4 recipe wrappers via DEFINE_QPEL_RECIPE macro. - tests/h264_qpel8_quarter_axis_ref.c covers all four via shared hpel_h() / hpel_v() inlines + per-mode L2 average. - Test refactor: generic run_quarter_axis_qpel() harness exercises all 4 positions through a single helper (~50 LOC for 4 tests vs ~200 if each was hand-rolled). Verified on hertz: $ ./build/test_api_h264 \| tail -8 H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%) H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc10: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc30: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc01: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc03: 2048/2048 bytes bit-exact (100.0000%) All 4 new positions bit-exact PASS first try. Coverage matrix update: put_ mc00 mc10 mc20 mc30 mc01 — ✓ — ✓ mc11 — — ✓ — ← this row mc21 — — — — mc31 — — — — mc02 — — ✓ — ← mc02 + mc22 anchor mc03 — — ✓ — After this PR: 7 of 16 single-axis + diagonal positions done. Remaining 9 are the off-axis quarter-pel combinations (mc11/mc12/mc13/mc21/mc23/mc31/mc32/mc33) — each combines a 2D lowpass intermediate with L2 averaging against a 1D-lowpass output. Next PR scope. Why no QPU shaders: same R-band logic as the prior CPU additions. At ~10 ns per 8x8 NEON block, all 16 qpel positions together would land in ~1.3 ms/frame at 1080p worst case — comfortably inside the 33 ms budget. QPU shader for mc20 already exists (cycle 9 / v3d_h264_qpel_mc20.spv); the other 15 follow once a clear perf reason emerges.	2026-05-25 01:29:52 +02:00
claude-noether	20a4299c5c	h264: qpel mc22 (2D half-pel, CPU/NEON) Adds the "j position" 2D half-pel via cascaded H + V 6-tap lowpass with intermediate 16-bit precision per H.264 §8.4.2.2.1. One of the most common qpel positions in real H.264 streams — many encoders emit 1/2-1/2 motion vectors as their best-RD choice. Algorithmically distinct from the 1D mc20/mc02 siblings: - Horizontal 6-tap produces 13 rows of int16 intermediate (no per-stage clip/round — full precision retained). - Vertical 6-tap on the intermediate, then +512 >> 10 (the double-shift compensates for both 6-tap scalings) + clip255. The intermediate-precision requirement means the C reference can't just be "call mc20 then mc02" — that would double-clip and produce the wrong result. The 13-row int16 tmp[] buffer is the central invariant. Scope (same pattern as mc02 PR #15): - Public API: daedalus_dispatch_h264_qpel_mc22 + recipe wrapper. - Internal: dispatch_h264_qpel_mc22_cpu calling ff_put_h264_qpel8_mc22_neon. - Recipe table: DAEDALUS_KERNEL_H264_QPEL_MC22 = 18 → CPU. - C reference: tests/h264_qpel8_mc22_ref.c — explicit tmp[13][8] int16 staging buffer; spec-derived shifts and rounding. - Test: test_qpel_mc22 in test_api_h264, 8 tiles at 16×16 with output positioned at (SRC_ROW=3, SRC_COL=3) so the kernel's [-2 .. +10] read window stays in-tile. Verified on hertz: $ ./build/test_api_h264 \| tail -5 H.264 deblock chroma v intra: 256/256 bytes bit-exact (100.0000%) H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%) H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%) All 13 H.264 kernels in api_smoke now bit-exact PASS. mc22 being right first try is meaningful — the +512 >> 10 scaling + int16 intermediate sequence has multiple sign/shift/clip pitfalls and any of them would surface on random inputs immediately. Coverage matrix update: put_ mc20 ✓ (QPU+CPU) put_ mc02 ✓ (CPU) put_ mc22 ✓ (CPU) → 12 single put_ positions still missing (¼/¾ + HV combos with L2 averaging).	2026-05-25 01:03:14 +02:00
claude-noether	c3301b0c2e	h264: qpel mc02 (vertical half-pel, CPU/NEON) Mirror of cycle 9's mc20 transposed to vertical orientation. Wires up the second qpel half-pel position via the vendored ff_put_h264_qpel8_mc02_neon symbol, closes the "missing vertical sibling" gap that mc20 left open since cycle 9. Scope: - Public API: daedalus_dispatch_h264_qpel_mc02 + recipe wrapper. - Internal: dispatch_h264_qpel_mc02_cpu calling the NEON entry. - Recipe table: DAEDALUS_KERNEL_H264_QPEL_MC02 = 17 → CPU. Explicit SUBSTRATE_QPU returns -1 (no shader yet). - C reference: tests/h264_qpel8_mc02_ref.c — vertical 6-tap transpose of mc20 (reads src[(r±N)stride + c] instead of src[rstride + c±N]). - Test: test_qpel_mc02 in test_api_h264, 8 tiles × 16×16 cols × 16 rows, random input, bit-exact compare against the C ref. Verified on hertz: $ ./build/test_api_h264 ... H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%) All 12 H.264 kernels in the api_smoke now bit-exact PASS. Why CPU-only: same R-band logic as the deblock _h sibling pattern. mc02 at ~7.6 ns per 8x8 block on NEON (per the cycle 9 baseline measurements) gives ~700 us for 8160 MBs × 4 8x8 luma blocks at 1080p — comfortably inside the 33 ms budget. QPU shader is a fast-follow once the V vs H shader work is consolidated (the transpose for the V shader is not mechanical — different SIMD access pattern than the H shader). Coverage matrix update: qpel position put_ status avg_ status ------------- ----------- ----------- mc00 (copy) not wired not wired mc10 (¼-H) not wired not wired mc20 (½-H) ✓ QPU+CPU not wired mc30 (¾-H) not wired not wired mc01 (¼-V) not wired not wired mc02 (½-V) ✓ CPU not wired (this PR) mc03 (¾-V) not wired not wired mc11..mc33 not wired not wired 13 more qpel positions to go for the full put_ matrix. Adding them follows the same template; each is a small contained PR.	2026-05-25 00:47:37 +02:00
claude-noether	9b1c106dc5	h264: deblock bS=4 intra variants (luma + chroma, V + H) Closes the deblock matrix: adds the four bS=4 intra-strength loop filters used at I-MB edges (and other boundaries where H.264 §8.7.2.1 forces boundary strength to 4). After this PR fourier covers all 8 standard 8-bit 4:2:0 deblock combinations: bS<4 bS=4 ----- ----- luma_v ✓ (cycle 8 QPU) ✓ (CPU) luma_h ✓ (CPU, PR #9) ✓ (CPU) chrm_v ✓ (CPU, PR #10) ✓ (CPU) chrm_h ✓ (CPU, PR #10) ✓ (CPU) Scope: - 4 new kernel enums (LV_INTRA=13, LH_INTRA=14, CV_INTRA=15, CH_INTRA=16), all → CPU substrate in the recipe table. - 4 new public dispatch fns + 4 recipe wrappers (defined via two DEFINE_INTRA_DISPATCH / DEFINE_INTRA_RECIPE macros to keep the boilerplate tight). - 4 new extern decls for the vendored ff_h264_{v,h}_loop_filter_{luma,chroma}_intra_neon symbols. - C reference: tests/h264_intra_loop_filter_ref.c covers all four orientations. Algorithm per H.264 §8.7.2.3: Luma: per-side strong/weak filter selector strong_p = (\|p2-p0\| < β) AND (\|p0-q0\| < (α>>2)+2) strong_q = (\|q2-q0\| < β) AND (\|p0-q0\| < (α>>2)+2) Strong updates p0/p1/p2 (and mirror); weak updates p0 only. Chroma: always weak, only p0/q0 updated. - daedalus_h264_deblock_meta is REUSED for intra dispatches; the tc0[] field is ignored (bS=4 hardcodes the strength). Callers can build a single edge list and route by kernel without an extra struct. - Test refactor: an intra_test_spec table + run_intra_test helper drives all four orientations through one harness, keeping the new test surface compact (~50 LOC for 4 kernels vs ~200 if each had its own test_deblock_*_intra fn). Verified on hertz (Pi 5 / V3D 7.1): $ ./build/test_api_h264 === Phase 8a API smoke: H.264 kernels via recipe dispatch === ... H.264 deblock luma v intra: 1024/1024 bytes bit-exact (100.0000%) H.264 deblock luma h intra: 1024/1024 bytes bit-exact (100.0000%) H.264 deblock chroma v intra: 256/256 bytes bit-exact (100.0000%) H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%) ... All 11 H.264 kernels bit-exact PASS — the deblock matrix is closed. The bit-exact match on first try is meaningful for these kernels: the strong/weak filter selector + per-side asymmetry would have surfaced any sign / shift / rounding mistake immediately. The C reference is now a usable spec checkpoint for the eventual QPU shader work. QPU shader follow-up: not in this PR. The intra path's 3-cell per-side update + strong/weak branch is structurally more complex than the bS<4 path that already has a V shader (v3d_h264deblock.spv). Per the prior R-band logic for deblock, intra edges are < 20% of total deblock work at typical bit-rates, so NEON-only at ~ 10 ns/edge fits comfortably in the budget.	2026-05-25 00:00:46 +02:00
claude-noether	a5c47aa51c	h264: deblock chroma_v + chroma_h (CPU/NEON, bS<4) Continues the deblock buildout after PR #9 (luma_h). Adds the two chroma orientations via the same recipe-table-routed-to-CPU pattern; QPU shaders for chroma deblock are still a follow-up. Scope: - Public API: 4 new fns (dispatch + recipe wrapper × {v, h}). - Internal: dispatch_h264_deblock_chroma_{v,h}_cpu calling the vendored ff_h264_{v,h}_loop_filter_chroma_neon symbols. - Recipe table: DAEDALUS_KERNEL_H264_DEBLOCK_CV = 11, DAEDALUS_KERNEL_H264_DEBLOCK_CH = 12, both → CPU. Explicit SUBSTRATE_QPU returns -1 (no shader yet). - C reference: tests/h264_chroma_loop_filter_ref.c — covers both orientations. Algorithm per H.264 §8.7.2.4 (bS<4 chroma inter): tC = tc0_seg + 1 (no luma-style ap/aq side bonus); only p0/q0 are updated (chroma never modifies p1/p2/q1/q2). - Tests: test_deblock_chroma_v (8x4 tile, edge at row 2) + test_deblock_chroma_h (4x8 tile, edge at col 2), 4 segments x 2 cells per segment per spec. Verified on hertz (Pi 5 / V3D 7.1): $ ./build/test_api_h264 === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 2 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 2 H264_DEBLOCK_LH recipe substrate: 1 (CPU, no QPU H shader yet) H264_DEBLOCK_CV recipe substrate: 1 (CPU) H264_DEBLOCK_CH recipe substrate: 1 (CPU) H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%) H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%) H.264 deblock chroma v: 256/256 bytes bit-exact (100.0000%) H.264 deblock chroma h: 256/256 bytes bit-exact (100.0000%) H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) All 7 kernels bit-exact PASS. Chroma test sizes are smaller (256 bytes per orientation) because the per-MB chroma deblock surface is smaller than luma — accurate to the production geometry. Why no QPU shader yet (per the established pattern): - Chroma deblock is ~25% of total deblock work at 4:2:0 (one quarter the pixel count of luma per MB) — modest QPU win even after the shader exists. - Same R-band considerations as the luma _h follow-up: the V shader transpose isn't mechanical, and the 8-cell tile is small enough that NEON's per-edge cost (~3 ns) is already inside the budget. - Total bench at 1080p: 8160 MBs × 4 chroma edges × 3 ns = ~100 us. Negligible compared to the IDCT layer's 10 ms (CPU NEON). Now coverage in fourier for the bS<4 8-bit 4:2:0 deblock matrix is complete: luma_v ✓, luma_h ✓, chroma_v ✓, chroma_h ✓. Remaining deblock work: bS=4 intra variants (luma + chroma, V + H). What this unblocks downstream: - daedalus-decoder Stage 4 deblock can now dispatch all four bS<4 edge categories that a typical inter MB needs.	2026-05-24 23:53:09 +02:00
claude-noether	9d5451e0fe	h264: deblock_luma_h — CPU/NEON via vendored ff_h264_h_loop_filter Adds the horizontal-edge sibling of cycle 8's deblock_luma_v. The vendored FFmpeg snapshot already includes ff_h264_h_loop_filter_luma_neon in libavcodec/aarch64/h264dsp_neon.S — this PR wires up the symbol, the bit-exact reference, and the recipe-table entry so daedalus-decoder and other consumers can call the H variant through the same dispatch shape they use for _v. Scope: - Public API: daedalus_dispatch_h264_deblock_luma_h(ctx, sub, ...) + daedalus_recipe_dispatch_h264_deblock_luma_h(ctx, ...) wrapper. - Internal: dispatch_h264_deblock_h_cpu() calls the NEON entry. - Recipe table: new DAEDALUS_KERNEL_H264_DEBLOCK_LH = 10, mapped to DAEDALUS_SUBSTRATE_CPU until a QPU shader is written. An explicit SUBSTRATE_QPU request on the H dispatch returns -1 (fails fast, no silent CPU degradation). - C reference: tests/h264_h_loop_filter_luma_ref.c — the column-axis transpose of h264_deblock_ref.c. Same per-segment kernel; pix[-4..+3] accesses cols instead of rowsstride. - Test: test_api_h264 grows a test_deblock_h() with 8 tiles (8 cols x 16 rows each, edge at col 4), random alpha/beta/tc0; compares NEON dispatch against reference byte-for-byte. Verified on hertz (Pi 5 / V3D 7.1): $ ./build/test_api_h264 === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 2 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 2 H264_DEBLOCK_LH recipe substrate: 1 (CPU, no QPU H shader yet) H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%) H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%) H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) All 5 kernels bit-exact PASS. The new H variant joins the suite with 1024 random-input bytes per tile x 8 tiles. Why CPU-only for now: the daedalus-decoder downstream needs the H edge dispatched somewhere — even at CPU NEON cost (~6 ns/edge per the cycle 8 M3 baseline) a frame's worth at 1080p is ~ 8160 MBs 4 edges = 32 640 edges = ~200 us — well inside the 30 fps budget. Writing the V3D H-edge shader is a follow-up (would be cycle 8' or similar; the V-edge shader's transpose isn't mechanical because of how the workgroup organisation maps to columns vs rows). Backlog addition (out of scope for this PR): - V3D shader for the H variant (mirror of v3d_h264deblock.spv). - bS=4 intra-strength filter (different algebra; both _v and _h). - Chroma deblock luma_v/_h (8-cell variants).	2026-05-24 23:28:56 +02:00
claude-noether	79553c6e22	cycle 9: V3D shader for H.264 luma qpel mc20 — 9/9 QPU coverage Closes the QPU-default substrate campaign per the 2026-05-23 decree: every daedalus-fourier kernel that can be done in QPU is now done in QPU. Cycle 9 is the last piece — 6-tap horizontal half-pel luma motion compensation, H.264 §8.4.2.2.1. Shader (src/v3d_h264_qpel_mc20.comp): - local_size = 64, 1 lane per output pixel of one 8x8 block, 1 block per workgroup. Simplest layout that avoids any inter-lane communication — V3D's L2 cache handles the redundant reads from adjacent lanes computing adjacent output columns. - Per-pixel: read 6 src samples (cols c-2..c+3 in row r), apply the (1, -5, 20, 20, -5, 1) / 32 filter with +16 rounding, clip to u8, write one dst byte. - Single-stride convention matches FFmpeg's H264QpelContext (dst and src share `stride`; src+src_off points at output col 0 with the caller-guaranteed -2/+3 padding). Dispatch wiring (src/daedalus_core.c): - h264_qpel_mc20_pipe field on daedalus_ctx, lazy init. - dispatch_h264_qpel_mc20_qpu(): 3 SSBOs (src / dst / meta), src_max = src_off + 7stride + 11 (covers the +3-col read footprint on the last row), dst_max = dst_off + 7stride + 8. 1 block per WG. - daedalus_dispatch_h264_qpel_mc20() replaces ROUTE_CPU_ONLY with the substrate-switch pattern matching the other H.264 kernels. - Recipe table: H264_QPEL_MC20 returns SUBSTRATE_QPU. Verification (hertz, Pi 5, V3D 7.1): $ ./test_api_h264 === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 2 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 2 ← flipped H.264 IDCT 4x4: 2048/2048 bytes bit-exact H.264 IDCT 8x8: 2048/2048 bytes bit-exact H.264 deblock luma v: 2048/2048 bytes bit-exact H.264 qpel mc20: 1024/1024 bytes bit-exact ← QPU First-iteration result was 1017/1024 (99.32%) — off-by-7 traced to undersizing src_max in the host wrapper. The filter reads src_off + 7*stride + (7 + 3) = +10 at the last row last column; add 1 for memcpy's [0..N-1] semantic → 11. Fixed in the same patch. All 9 daedalus-fourier cycles now QPU-by-recipe: cycle 1 VP9 IDCT 8x8 QPU cycle 2 VP9 LPF wd=4 QPU cycle 3 VP9 MC 8h QPU cycle 4 VP9 LPF wd=8 QPU cycle 5 AV1 CDEF 8x8 QPU cycle 6 H.264 IDCT 4x4 QPU cycle 7 H.264 IDCT 8x8 QPU cycle 8 H.264 deblock luma-v QPU cycle 9 H.264 qpel mc20 QPU ← this commit Closes daedalus-fourier task #165. Per the decree memory [QPU is default substrate], the prototype now demonstrates GPU acceleration on every measured kernel.	2026-05-23 21:05:36 +02:00
marfrit	a092ee34aa	Merge pull request 'QPU is default substrate: recipe table + ctx env-var override' (#7 ) from noether/qpu-default-recipe-cycles-5-8 into main Reviewed-on: #7	2026-05-23 18:59:34 +00:00
claude-noether	74687d9def	cycle 7: V3D shader for H.264 IDCT 8x8 Mirrors cycle 6 (PR #7 prior commit) but at 8x8 scale: 8 lanes per block, 8 blocks per WG. H.264 §8.5.13.2 1D butterfly twice (row pass, column pass), (val + 32) >> 6 rounded + clipped + added to dst. Bit-exact first try on hertz (Pi 5, V3D 7.1): H264_IDCT4 recipe substrate: 2 (QPU) H264_IDCT8 recipe substrate: 2 (QPU) ← flipped H264_DEBLOCK_LV recipe substrate: 2 (QPU) H264_QPEL_MC20 recipe substrate: 1 (CPU) ← task #165 remaining H.264 IDCT 4x4: 2048/2048 bytes bit-exact H.264 IDCT 8x8: 2048/2048 bytes bit-exact ← QPU H.264 deblock luma v: 2048/2048 bytes bit-exact H.264 qpel mc20: 1024/1024 bytes bit-exact 8 of 9 daedalus-fourier cycles now QPU-by-recipe. Only cycle 9 (H.264 luma qpel mc20) still CPU — different shape (6-tap MC filter, not a transform) so needs its own shader template; task #165 covers it as a follow-on. Same pattern as cycle 6 commit (`65bd5c3`): adds h264_idct8_pipe field + lazy init, dispatch_h264_idct8_qpu() with 3 SSBOs, v3d_h264_idct8.spv install rule. Uses v3d_runner_create_buffer / destroy_buffer (will swap to pool API once PR #6 lands).	2026-05-23 20:09:25 +02:00
claude-noether	65bd5c3fe3	cycle 6: V3D shader for H.264 IDCT 4x4 (first cycle-6 QPU dispatch) Per the QPU-default substrate decree 2026-05-23, cycle 6 (H.264 IDCT 4x4 + add) was the highest-priority H.264 kernel to flip from NEON-only to QPU-capable. The same shape as VP9 IDCT 8x8 (cycle 1) — two-pass butterfly with shared-memory transpose — but at 4x4 scale: 4 lanes per block, 16 blocks per WG. What's added: - src/v3d_h264_idct4.comp: GLSL compute shader implementing the H.264 §8.5.12.1 1D butterfly twice (row pass then column pass), with (val + 32) >> 6 rounding and clip-to-u8 add to dst. Block memory layout is column-major (matches FFmpeg `ff_h264_idct_add_neon` convention). - CMakeLists: glslang rule + install entry for v3d_h264_idct4.spv. - dispatch_h264_idct4_qpu() in daedalus_core.c: lazy pipeline init, 3 SSBOs (coeffs / dst / meta as uvec4), push-constant (n_blocks, dst_stride), 16 blocks per WG dispatch. Matches the existing dispatch_*_qpu patterns; uses v3d_runner_create_buffer / destroy_buffer (will swap to pool API once PR #6 lands). - daedalus_dispatch_h264_idct4() replaces ROUTE_CPU_ONLY with the same CPU/QPU substrate switch the deblock dispatch uses. - daedalus_recipe_substrate_for(H264_IDCT4) returns QPU now that the shader exists. Verification on hertz (Pi 5 + V3D 7.1): $ ./test_api_h264 === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 1 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 1 H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%) ← QPU H.264 IDCT 8x8: 2048/2048 bytes bit-exact H.264 deblock luma v: 2048/2048 bytes bit-exact H.264 qpel mc20: 1024/1024 bytes bit-exact The AUTO-substrate path now picks QPU for H.264 IDCT 4x4, and the output is bit-exact against the C reference (which is identical to the NEON .S code by construction — same FFmpeg upstream). Remaining cycle-6/7/9 work in task #165: - cycle 7: H.264 IDCT 8x8 (template same shape; 8 lanes per block, fewer blocks per WG) - cycle 9: H.264 luma qpel mc20 (different shape — 6-tap MC not a transform) This commit lands the cycle-6 piece of task #165.	2026-05-23 20:06:20 +02:00
claude-noether	737e87980d	QPU is default substrate: recipe table + ctx env-var override Per the user decree 2026-05-23 — "what can be done in QPU will be done in QPU" — this lands two coupled changes that flip production-decode kernels with existing V3D shaders from CPU-by-recipe to QPU-by-recipe: 1) daedalus_recipe_substrate_for() returns SUBSTRATE_QPU for every kernel that has a shipped V3D compute shader: cycle 1 VP9 IDCT 8x8 QPU (was QPU; unchanged) cycle 2 VP9 LPF wd=4 QPU (was QPU; unchanged) cycle 3 VP9 MC 8h QPU (FLIPPED from CPU — v3d_mc_8h.spv) cycle 4 VP9 LPF wd=8 QPU (was QPU; unchanged) cycle 5 AV1 CDEF 8x8 QPU (FLIPPED from CPU — v3d_cdef.spv) cycle 6 H.264 IDCT 4x4 CPU (no shader yet; task #165) cycle 7 H.264 IDCT 8x8 CPU (no shader yet; task #165) cycle 8 H.264 deblock luma-v QPU (FLIPPED from CPU — v3d_h264deblock.spv) cycle 9 H.264 qpel mc20 CPU (no shader yet; task #165) The R-band cost/benefit framework still applies but is now superseded for substrate selection by the decree. Where R stays RED, the cost is in dispatch overhead, which is a fixable engineering issue (tasks 160 buffer-pool, 161 persistent cmdbuf, 162 dmabuf import). 2) daedalus_ctx_create_no_qpu() now honours an env-var override: set DAEDALUS_FORCE_QPU=1 in the process and create_no_qpu silently escalates to a full daedalus_ctx_create(). Lets the libavcodec substitution shims in marfrit-packages (which pthread_once a create_no_qpu ctx — see libavcodec/aarch64/h264_idct_daedalus.c) fire QPU paths without rebuilding those patches. Firefox / mpv consumers stay on the Vulkan-free path by default (env var unset). The daedalus-v4l2 daemon will set DAEDALUS_FORCE_QPU=1 explicitly before dlopen'ing libavcodec (separate daedalus-v4l2 follow-up). Smoke (hertz, Pi 5, kernel 6.18.29): === test_api_h264 === H264_IDCT4 recipe substrate: 1 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 1 H264_DEBLOCK_LV recipe substrate: 2 ← flipped H264_QPEL_MC20 recipe substrate: 1 H.264 IDCT 4x4: 2048/2048 bytes bit-exact H.264 IDCT 8x8: 2048/2048 bytes bit-exact H.264 deblock luma v: 2048/2048 bytes bit-exact ← QPU path H.264 qpel mc20: 1024/1024 bytes bit-exact === test_api_idct === all substrates (CPU/QPU/AUTO) bit-exact === test_api_lpf === all substrates bit-exact wd=4 and wd=8 The dispatch wrapper's fall-through logic (eff == SUBSTRATE_QPU && !ctx_has_qpu(ctx) → eff = SUBSTRATE_CPU) handles the case where the recipe says QPU but the consumer didn't opt in — it falls back to CPU silently, no regression. Closes daedalus-fourier tasks #163, #164. Refs the 2026-05-23 "QPU default substrate" decree.	2026-05-23 19:59:53 +02:00
claude-noether	98553278dd	v3d_runner: persistent per-pipeline command buffer Phase 2 of the QPU-default substrate campaign — eliminate vkAllocateCommandBuffers from the dispatch hot path. Attaches a VkCommandBuffer to each v3d_pipeline, allocated once in v3d_runner_create_pipeline() and freed in destroy_pipeline(). The five dispatch_*_qpu sites switch from v3d_runner_alloc_cmdbuf() to v3d_runner_pipeline_cmdbuf_reset() — vkResetCommandBuffer is O(1) versus the driver-side allocation walk. Pool was already created with VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT so reset is permitted. Microbench (hertz, Pi 5, kernel 6.18.29, V3D 7.1): before (task 160 pool only): steady-state p50: 76.44 us steady-state mean: 77.95 us after (task 160 pool + task 161 persistent cb): steady-state p50: 54.56 us steady-state mean: 56.00 us -> 28% per-dispatch reduction The remaining ~54 us steady-state is dominated by vkQueueWaitIdle + shader execution + the two memcpy(in/out) on the dst buffer — task 162 (dmabuf import for dst) targets the memcpy half. test_api_idct stays bit-exact across CPU/QPU/AUTO substrates. Refs daedalus-fourier task #161.	2026-05-23 19:56:35 +02:00
claude-noether	0a042a8e95	v3d_runner: buffer pool for QPU dispatch hot path Per the QPU-default substrate decree 2026-05-23: the per-dispatch vkAllocateMemory in dispatch__qpu was the biggest single fixable contributor to QPU dispatch overhead. This pools v3d_buffer allocations by power-of-2 size class so the second-and-subsequent dispatch hits a freelist instead of paying ~10-50us of Mesa-V3D7 memory-allocation cost per call. API additions (v3d_runner.h): - v3d_runner_acquire_buffer(): pulls from per-bucket freelist; falls through to v3d_runner_create_buffer() on miss. - v3d_runner_release_buffer(): pushes back onto the freelist; the backing VkBuffer/VkDeviceMemory only get vkFreeMemory'd in v3d_runner_destroy(). - v3d_runner_pool_total_bytes(): diagnostic watermark. Size classes 2^8..2^23 (256 B to 8 MiB). Oversize requests fall through to non-pooled (vkAllocateMemory) for both ends — pool stays correct, just degenerates to old behaviour for those calls. Migration: daedalus_core.c dispatch__qpu paths globally swap create_buffer → acquire_buffer and destroy_buffer → release_buffer. All five QPU dispatch functions (idct8 / lpf / mc_8h / cdef / h264_deblock) now reuse buffers across calls. test_api_idct stays bit-exact (4096/4096 bytes on CPU/QPU/AUTO substrates on hertz). Microbench (tests/bench_pool_overhead.c) on hertz (Pi 5, 6.18.29+rpt-rpi-2712, V3D 7.1): call 0: 434.89 us (cold — 3x vkAllocateMemory) call 1: 100.06 us (pool hit on all 3 buffers) steady-state: p50: 76.44 us p90: 90.52 us mean: 77.95 us first-call / steady-state ratio: 5.7x The remaining ~76us steady-state is dominated by vkQueueWaitIdle + shader execution + per-call descriptor-set update + command-buffer allocation — addressed in follow-on tasks 161 (persistent cmdbuf) and 162 (dmabuf import for dst, eliminates memcpy in/out). Refs daedalus-fourier task #160.	2026-05-23 19:52:50 +02:00
claude-noether	8fdef27a7d	Phase 8c: H.264 luma qpel mc20 through public API Extends daedalus-fourier with daedalus_recipe_dispatch_h264_qpel_mc20 so libavcodec.so can route H264QpelContext.put_h264_qpel_pixels_tab[1][2] through the recipe layer instead of ff_put_h264_qpel8_mc20_neon directly. API additions (header + library): - daedalus_h264_qpel_meta { dst_off, src_off } - daedalus_dispatch_h264_qpel_mc20(ctx, sub, dst, src, stride, n_blocks, meta) - daedalus_recipe_dispatch_h264_qpel_mc20(...) (AUTO wrapper) - DAEDALUS_KERNEL_H264_QPEL_MC20 = 9 in the recipe-query enum - daedalus_recipe_substrate_for() returns CPU NEON for cycle 9 The 6-tap horizontal half-pel filter signature matches FFmpeg's H264QpelContext convention exactly: dst and src share a single stride and src already points at output column 0 (filter reads cols -2..+3). Single-stride API to make the marfrit-packages FFmpeg shim a straight pointer-pass; no buffer rearrangement. Verdict per docs/k9_h264qpel_mc20.md: CPU NEON. Per-block 7.6 ns gives 135x margin over 30 fps 1080p; QPU dispatch floor at ~250 ns makes any V3D shader strictly worse. Recipe table reflects that — the recipe_dispatch entry is a one-line forward to the CPU path. CMakeLists changes: - h264qpel_neon.S added to the daedalus_core static lib (only the bench targets owned it before; now the public API needs it too) - tests/h264_qpel8_mc20_ref.c added to the test_api_h264 target Phase 8a/8b smoke gains a 4th case (test_qpel_mc20): 1024/1024 bytes bit-exact via daedalus_recipe_dispatch_h264_qpel_mc20. Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.	2026-05-23 03:25:24 +02:00
marfrit	0a99b16489	Phase 8b: opportunistic QPU paths through public API Wires QPU dispatch for cycles 3 (VP9 MC), 5 (AV1 CDEF), 8 (H.264 deblock) through the public API. These three kernels have recipe substrate = CPU, but per Issue 003 the mixed-kernel helper value is real — the dispatch path must exist so override-mode callers can request QPU on the side. Pattern mirrors dispatch_idct8_qpu (lazy pipeline + per-call SSBO alloc + memcpy + dispatch + readback). Each kernel has its own push-constant struct (mc_pc 3-field, cdef_pc 3-field, deblock_pc 2-field shared with lpf). Notable bug caught + fixed in test_api_opportunistic_qpu: the initial dispatch_mc_8h_qpu sized src_max using CPU-side reach (src_off + 3 + 8 + 7stride), but the QPU shader reads src[ src_off + rowstride + 0..14] for row=0..7. Last block had 3 uninitialized bytes → 99.8% match → 100% after fix. After this commit, the public API surface fully covers cycles 1-8: Cycle 1 (IDCT 8x8): CPU + QPU + AUTO bit-exact Cycle 2 (LPF wd=4): CPU + QPU + AUTO bit-exact Cycle 3 (MC 8h): CPU recipe; QPU override bit-exact Cycle 4 (LPF wd=8): CPU + QPU + AUTO bit-exact Cycle 5 (CDEF): CPU recipe; QPU override (untested in this test — bench_v3d_cdef is the authoritative 3-way M1) Cycle 6 (H.264 IDCT 4x4): CPU only (no QPU shader by recipe) Cycle 7 (H.264 IDCT 8x8): CPU only Cycle 8 (H.264 deblock luma-v): CPU recipe; QPU override bit-exact Tests: test_api_opportunistic_qpu adds CPU-vs-QPU bit-exact comparison for VP9 MC and H.264 deblock through the API. test_api_idct, test_api_lpf, test_api_h264 still pass. Per the locked Phase 8 architecture (project_phase8_architecture memory): next session opens daedalus-v4l2 sibling repo with Option B (kernel V4L2 shim + userspace daemon), Option γ (dlopen FFmpeg parser). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:50:41 +00:00
marfrit	af8146a2cd	Phase 8a: H.264 kernels through public API Extends include/daedalus.h with cycles 6, 7, 8 (H.264 IDCT 4x4, IDCT 8x8, luma deblock luma-v). All recipe-substrate = CPU (matches per-cycle Phase 7 verdicts). src/daedalus_core.c: NEON-path implementations + recipe routing. daedalus_core library now links the full FFmpeg H.264 NEON snapshot (h264idct + h264dsp) plus existing VP9 + dav1d. tests/test_api_h264.c: smoke test covering all 3 H.264 kernels via daedalus_recipe_dispatch_*. All pass 2048/2048 bit-exact. Public API coverage after this commit: - Cycles 1 IDCT 8x8 + 2 LPF4 + 4 LPF8: CPU+QPU+AUTO dispatch (test_api_idct, test_api_lpf, both pass) - Cycle 3 MC 8h: CPU only (QPU dispatch stub returns -1) - Cycle 5 CDEF: CPU only (QPU stub) - Cycle 6 H.264 IDCT 4x4: CPU only (recipe + only NEON wired) - Cycle 7 H.264 IDCT 8x8: CPU only - Cycle 8 H.264 deblock: CPU only (QPU opportunistic — not wired through API yet; bench_v3d_h264deblock exists for direct test) Next Phase 8 sub-step: wire opportunistic QPU dispatch for cycles 3+5+8 through the API (so override-mode users can request QPU). Then surface V4L2-wrapper architecture decisions to user. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:46:03 +00:00
marfrit	eb5cfb34c4	Phase 8: wire LPF wd=4 + wd=8 QPU through public API Mirror the IDCT pattern (lazy pipeline + per-call SSBO alloc + dispatch + readback) for cycles 2 (LPF wd=4) and 4 (LPF wd=8). Important caught-empirically bug: the two LPF shaders disagree on push-constant slot order — wd=4 puts dst_stride_u8 at slot 1, wd=8 puts it at slot 2 (with unused blocks_per_row at slot 1). Initial single-struct attempt silently corrupted wd=8 output (1958/2048 = 95.6 % bit-exact on test_api_lpf). Fixed by keeping separate lpf4_pc and lpf8_pc struct definitions. dst-window calc handles both kernels (same -4..+3 byte footprint per row). test_api_lpf exercises both kernels in CPU / QPU / AUTO modes against the C reference. All 6 mode/kernel combinations pass 2048/2048 bit-exact (32 edges × 8 rows × 8 bytes/edge). Phase 8 status after this commit: 3 of 5 kernels wired through API for QPU dispatch (IDCT, LPF wd=4, LPF wd=8 — i.e., all 3 QPU-default kernels per recipe). Cycle 3 MC and cycle 5 CDEF still need wiring for opportunistic-override mode but aren't needed for recipe-AUTO path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 13:57:25 +00:00
marfrit	1085c5699c	Phase 8: wire IDCT QPU dispatch through public API daedalus_ctx now owns a v3d_runner when V3D is available. The public API's dispatch_vp9_idct8 routes QPU calls through a new dispatch_idct8_qpu helper that: (1) lazy-creates the cycle 1 v4 pipeline on first use, (2) allocates 3 host-visible SSBOs per call (coeffs/dst/meta), (3) memcpy host->GPU, (4) dispatch with the v4 32-blocks-per-WG geometry, (5) memcpy GPU->host. Per-call alloc is intentional for Phase 8 correctness-first scope; buffer-pool perf optimization is deferred. Added daedalus_ctx_create_no_qpu() for fast-path callers that know they want CPU only. test_api_idct extended to a 3-mode matrix: CPU forced, QPU forced, AUTO recipe. All three deliver 4096/4096 bit-exact on hertz with V3D 7.1.7.0: recipe substrate for VP9_IDCT8: 2 (QPU) [CPU] 4096/4096 bit-exact [QPU] 4096/4096 bit-exact (real QPU dispatch through the API) [AUTO] 4096/4096 bit-exact (recipe routes to QPU) Next Phase 8 sub-step: same wiring pattern for cycle 2 LPF wd=4 and cycle 4 LPF wd=8 (the other two recipe-QPU kernels). Cycle 3 MC and cycle 5 CDEF only need the dispatch hook (recipe routes to CPU; QPU stays opportunistic via explicit override). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 13:55:55 +00:00
marfrit	760f6a4060	Phase 8 skeleton: public C API + first end-to-end smoke test include/daedalus.h: stable C API surface exposing the 5 cycles (VP9 IDCT 8x8, LPF wd=4, MC 8h, LPF wd=8; AV1 CDEF). Per-kernel recipe-dispatch helpers default to the cycle 1-5 verdict substrate (QPU for cycles 1+2+4, CPU for cycles 3+5); explicit override available for benchmarking and runtime-aware scheduling. src/daedalus_core.c: NEON-path implementation of all 5 kernels wrapped behind the public API. QPU path stubbed out (returns -1) since wiring v3d_runner into daedalus_ctx is the next Phase 8 sub-step; with has_qpu=0 the recipe falls back to CPU cleanly. tests/test_api_idct.c: 64-block IDCT through the public recipe dispatch, bit-exact vs C ref. PASS 4096/4096 bytes — proves the API surface compiles, library links, dispatch routing works, and NEON fallback delivers correct results. docs/phase8_scoping.md: architecture options (A=userspace V4L2, B=kernel V4L2 shim, C=direct libva); pick A for v1; explicitly out-of-scope work tracked. Next Phase 8 sub-step: wire v3d_runner into daedalus_ctx so has_qpu=1 and QPU dispatch goes through the API too. After that: V4L2 ioctl glue, bitstream parser, superblock loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 13:54:43 +00:00

19 Commits