daedalus-fourier

Author	SHA1	Message	Date
claude-noether	01f782cfaf	h264: qpel avg — 12 remaining variants (closes the matrix) Closes the H.264 8x8 qpel buildout. Adds the remaining 12 avg_ biprediction positions: 4 quarter-axis: avg_mc{10,30,01,03} 8 diagonals : avg_mc{11,12,13,21,23,31,32,33} Each follows the established pattern: same half-pel formula as the put_ sibling, then L2 average with the existing dst contents per H.264 §8.4.2.3.1. Scope: - 12 new kernel enums (MC10..MC33 avg_ = 34..45) → CPU. - 12 NEON externs for the vendored ff_avg_h264_qpel8_mc*_neon. - 12 CPU dispatches via existing DEFINE_QPEL_CPU_DISPATCH macro. - 12 public dispatches via DEFINE_QPEL_DISPATCH macro. - 12 recipe wrappers via DEFINE_QPEL_RECIPE macro. - 12 header decls via DECLARE_QPEL_AVG macro. - tests/h264_qpel8_avg_rest_ref.c — references via two parametric macros: DEFINE_AVG_QUARTER for the 4 ¼-pel L2 forms, DEFINE_AVG_DIAG for the 8 two-half-pel-avg forms. - Test harness extended with a RUN(MC) sub-macro that derives both the ref name and dispatch name from the bare mcXX. (The ref is daedalus_avg_h264_qpel8_<mc>_ref; the dispatch is daedalus_recipe_dispatch_h264_qpel_avg_<mc>. Macro had a typo on first try that duplicated "avg_" in the ref name — caught at compile, fixed.) Verified on hertz: $ ./build/test_api_h264 \| tail -12 H.264 qpel avg_mc10: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel avg_mc30: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel avg_mc01: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel avg_mc03: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel avg_mc11: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel avg_mc12: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel avg_mc13: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel avg_mc21: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel avg_mc23: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel avg_mc31: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel avg_mc32: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel avg_mc33: 2048/2048 bytes bit-exact (100.0000%) All 12 new positions bit-exact PASS first try. Final qpel matrix state: put_: mc00 (none — integer copy) mc01 ✓ mc02 ✓ mc03 ✓ mc10 ✓ mc11 ✓ mc12 ✓ mc13 ✓ mc20 ✓ (QPU+CPU) mc21 ✓ mc22 ✓ mc23 ✓ mc30 ✓ mc31 ✓ mc32 ✓ mc33 ✓ avg_: same 15-of-16 coverage, all CPU. Every B-slice biprediction case the libavcodec intercept can throw at us is now serviceable. QPU shaders remain mc20-only (cycle 9); the other 29 positions are CPU NEON. Whether to write more QPU shaders depends on real perf measurement — at NEON ~10 ns per 8x8 block, full qpel coverage at 1080p is ~2-3 ms of total work, well inside budget.	2026-05-25 08:49:42 +02:00
claude-noether	1113953f97	h264: qpel avg anchors (avg_mc20/02/22, biprediction support) Begins the avg_ qpel buildout for B-slice biprediction. Each avg_ form computes the same half-pel formula as its put_ sibling, then L2-averages the result with the existing dst contents — the caller pre-loads dst with the list0 prediction; the avg_ call adds list1 per H.264 §8.4.2.3.1. Scope (3 anchors, sets the pattern for the remaining 13 avg_ variants): - 3 new kernel enums (AVG_MC20=31, AVG_MC02=32, AVG_MC22=33) → CPU. - 3 NEON externs for the vendored ff_avg_h264_qpel8_{mc20,mc02,mc22}_neon. - 3 CPU dispatches via existing DEFINE_QPEL_CPU_DISPATCH macro (the macro is type-agnostic so it didn't need changes for avg_). - 3 public dispatches via DEFINE_QPEL_DISPATCH macro. - 3 recipe wrappers via DEFINE_QPEL_RECIPE macro. - tests/h264_qpel8_avg_anchors_ref.c — per-cell helpers + L2 avg. - Test harness: run_avg_qpel() seeds dst with random content so the L2 averaging is actually exercised (not just put_-style overwrite that would silently pass). Verified on hertz: $ ./build/test_api_h264 \| tail -3 H.264 qpel avg_mc20: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel avg_mc02: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel avg_mc22: 2048/2048 bytes bit-exact (100.0000%) All 3 anchors bit-exact PASS first try. Why anchors only in this PR: the avg_ pattern is uniform across all 16 positions (each is just "put_ result + L2 with dst"). Landing the anchors first confirms the macro pattern works for both put_ and avg_; the remaining 13 (avg_mc10/30/01/03 + avg_mc11..33) follow the same template in a follow-up PR. State of the qpel matrix after this PR: put_ : 15 of 16 positions ✓ (mc00 is integer copy, no wrapper) avg_ : 3 of 16 positions ✓ (mc20, mc02, mc22 anchors) 13 follow-up positions	2026-05-25 08:35:25 +02:00
claude-noether	0894a46114	h264: qpel diagonals — 8 positions (mc11/12/13/21/23/31/32/33) Closes the qpel buildout. All 8 remaining diagonal positions land in one PR. Each is the rounded average of two half-pel intermediates per H.264 §8.4.2.2.1 / Table 8-4, with the decomposition matching the FFmpeg .S reference structure (verified by reading external/ffmpeg-snapshot/.../h264qpel_neon.S lines 622-758). Decomposition table (the formula for each output cell at (r,c)): mc11 ¼¼ : avg(mc20[r, c], mc02[r, c]) mc12 ¼½ : avg(mc22[r, c], mc02[r, c]) mc13 ¼¾ : avg(mc20[r+1, c], mc02[r, c]) mc21 ½¼ : avg(mc22[r, c], mc20[r, c]) mc23 ½¾ : avg(mc22[r, c], mc20[r+1, c]) mc31 ¾¼ : avg(mc20[r, c], mc02[r, c+1]) mc32 ¾½ : avg(mc22[r, c], mc02[r, c+1]) mc33 ¾¾ : avg(mc20[r+1, c], mc02[r, c+1]) The (r±1, c±1) offsets capture the position-dependent shift that the FFmpeg .S encodes by pre-incrementing x1 (src pointer) before branching into the common mc11/mc21 code paths. Scope (tightly macro-ised): - 8 new kernel enums (MC11..MC33 = 23..30) → CPU. - 8 NEON externs for the vendored ff_put_h264_qpel8_mc*_neon. - 8 CPU dispatches via existing DEFINE_QPEL_CPU_DISPATCH macro. - 8 public dispatches via DEFINE_QPEL_DISPATCH macro. - 8 recipe wrappers via DEFINE_QPEL_RECIPE macro. - Header decls condensed via a DECLARE_QPEL_DIAG macro that expands to both recipe + dispatch decls per name. - C references via DEFINE_DIAG_REF macro: each ref is a 6-line wrapper around the per-cell hpel_h / hpel_v / hpel_hv helpers (the latter being the per-cell version of mc22's 13-row int16 tmp[] computation). - Test wrapper: test_qpel_diag_all() drives all 8 through the existing run_quarter_axis_qpel() harness. Verified on hertz (Pi 5 / V3D 7.1): $ ./build/test_api_h264 \| tail -8 H.264 qpel mc11: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc12: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc13: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc21: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc23: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc31: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc32: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc33: 2048/2048 bytes bit-exact (100.0000%) ALL 8 diagonal positions bit-exact PASS first try. Meaningful because the position-dependent (r±1, c±1) source offsets are easy to get wrong by transcription, and any of them would surface on random inputs immediately. After this PR the H.264 qpel 8x8 put_ matrix is complete: mc00 mc01 mc02 mc03 mc10 mc11 mc12 mc13 mc20 mc21 mc22 mc23 mc30 mc31 mc32 mc33 15 of 16 positions exposed through the daedalus API; mc00 is just integer copy and rarely needs a dispatch wrapper (libavcodec sets the function pointer table directly). mc20 retains its QPU shader (cycle 9 / v3d_h264_qpel_mc20.spv); all other 14 are CPU NEON. What this does NOT cover (still in backlog): - avg_ variants (the "add" form for biprediction, 16 more positions). Currently the API only exposes put_. - 16x16 qpel (separate function family in FFmpeg; the 8x8 path can be used twice to substitute when 16x16 isn't critical). - QPU shaders for any qpel position other than mc20.	2026-05-25 07:49:12 +02:00
claude-noether	e01f7bc7c6	h264: qpel single-axis quarter-pel — mc10/mc30/mc01/mc03 (CPU/NEON) Closes the 4 single-axis quarter-pel positions in one PR. Each is a half-pel lowpass clipped to u8 followed by L2 rounded-average with an integer-aligned source pixel per H.264 §8.4.2.2.1: mc10 ¼-H ("a" pos): clip255(mc20(s)) avg src[r,c] mc30 ¾-H ("c" pos): clip255(mc20(s)) avg src[r,c+1] mc01 ¼-V ("d" pos): clip255(mc02(s)) avg src[r,c] mc03 ¾-V ("n" pos): clip255(mc02(s)) avg src[r+1,c] The mc10/mc30 pair and mc01/mc03 pair only differ in WHICH integer source pixel they average with — the half-pel computation is the same. Putting them in one PR is justified by that uniformity. Scope: - 4 new kernel enums: MC10=19, MC30=20, MC01=21, MC03=22 → CPU. - 4 NEON externs for the vendored ff_put_h264_qpel8_mc{10,30,01,03}_neon. - 4 CPU dispatch wrappers via DEFINE_QPEL_CPU_DISPATCH macro (collapses ~50 LOC of repetition). - 4 public dispatch fns via DEFINE_QPEL_DISPATCH macro. - 4 recipe wrappers via DEFINE_QPEL_RECIPE macro. - tests/h264_qpel8_quarter_axis_ref.c covers all four via shared hpel_h() / hpel_v() inlines + per-mode L2 average. - Test refactor: generic run_quarter_axis_qpel() harness exercises all 4 positions through a single helper (~50 LOC for 4 tests vs ~200 if each was hand-rolled). Verified on hertz: $ ./build/test_api_h264 \| tail -8 H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%) H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc10: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc30: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc01: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc03: 2048/2048 bytes bit-exact (100.0000%) All 4 new positions bit-exact PASS first try. Coverage matrix update: put_ mc00 mc10 mc20 mc30 mc01 — ✓ — ✓ mc11 — — ✓ — ← this row mc21 — — — — mc31 — — — — mc02 — — ✓ — ← mc02 + mc22 anchor mc03 — — ✓ — After this PR: 7 of 16 single-axis + diagonal positions done. Remaining 9 are the off-axis quarter-pel combinations (mc11/mc12/mc13/mc21/mc23/mc31/mc32/mc33) — each combines a 2D lowpass intermediate with L2 averaging against a 1D-lowpass output. Next PR scope. Why no QPU shaders: same R-band logic as the prior CPU additions. At ~10 ns per 8x8 NEON block, all 16 qpel positions together would land in ~1.3 ms/frame at 1080p worst case — comfortably inside the 33 ms budget. QPU shader for mc20 already exists (cycle 9 / v3d_h264_qpel_mc20.spv); the other 15 follow once a clear perf reason emerges.	2026-05-25 01:29:52 +02:00
claude-noether	20a4299c5c	h264: qpel mc22 (2D half-pel, CPU/NEON) Adds the "j position" 2D half-pel via cascaded H + V 6-tap lowpass with intermediate 16-bit precision per H.264 §8.4.2.2.1. One of the most common qpel positions in real H.264 streams — many encoders emit 1/2-1/2 motion vectors as their best-RD choice. Algorithmically distinct from the 1D mc20/mc02 siblings: - Horizontal 6-tap produces 13 rows of int16 intermediate (no per-stage clip/round — full precision retained). - Vertical 6-tap on the intermediate, then +512 >> 10 (the double-shift compensates for both 6-tap scalings) + clip255. The intermediate-precision requirement means the C reference can't just be "call mc20 then mc02" — that would double-clip and produce the wrong result. The 13-row int16 tmp[] buffer is the central invariant. Scope (same pattern as mc02 PR #15): - Public API: daedalus_dispatch_h264_qpel_mc22 + recipe wrapper. - Internal: dispatch_h264_qpel_mc22_cpu calling ff_put_h264_qpel8_mc22_neon. - Recipe table: DAEDALUS_KERNEL_H264_QPEL_MC22 = 18 → CPU. - C reference: tests/h264_qpel8_mc22_ref.c — explicit tmp[13][8] int16 staging buffer; spec-derived shifts and rounding. - Test: test_qpel_mc22 in test_api_h264, 8 tiles at 16×16 with output positioned at (SRC_ROW=3, SRC_COL=3) so the kernel's [-2 .. +10] read window stays in-tile. Verified on hertz: $ ./build/test_api_h264 \| tail -5 H.264 deblock chroma v intra: 256/256 bytes bit-exact (100.0000%) H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%) H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%) All 13 H.264 kernels in api_smoke now bit-exact PASS. mc22 being right first try is meaningful — the +512 >> 10 scaling + int16 intermediate sequence has multiple sign/shift/clip pitfalls and any of them would surface on random inputs immediately. Coverage matrix update: put_ mc20 ✓ (QPU+CPU) put_ mc02 ✓ (CPU) put_ mc22 ✓ (CPU) → 12 single put_ positions still missing (¼/¾ + HV combos with L2 averaging).	2026-05-25 01:03:14 +02:00
claude-noether	c3301b0c2e	h264: qpel mc02 (vertical half-pel, CPU/NEON) Mirror of cycle 9's mc20 transposed to vertical orientation. Wires up the second qpel half-pel position via the vendored ff_put_h264_qpel8_mc02_neon symbol, closes the "missing vertical sibling" gap that mc20 left open since cycle 9. Scope: - Public API: daedalus_dispatch_h264_qpel_mc02 + recipe wrapper. - Internal: dispatch_h264_qpel_mc02_cpu calling the NEON entry. - Recipe table: DAEDALUS_KERNEL_H264_QPEL_MC02 = 17 → CPU. Explicit SUBSTRATE_QPU returns -1 (no shader yet). - C reference: tests/h264_qpel8_mc02_ref.c — vertical 6-tap transpose of mc20 (reads src[(r±N)stride + c] instead of src[rstride + c±N]). - Test: test_qpel_mc02 in test_api_h264, 8 tiles × 16×16 cols × 16 rows, random input, bit-exact compare against the C ref. Verified on hertz: $ ./build/test_api_h264 ... H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%) All 12 H.264 kernels in the api_smoke now bit-exact PASS. Why CPU-only: same R-band logic as the deblock _h sibling pattern. mc02 at ~7.6 ns per 8x8 block on NEON (per the cycle 9 baseline measurements) gives ~700 us for 8160 MBs × 4 8x8 luma blocks at 1080p — comfortably inside the 33 ms budget. QPU shader is a fast-follow once the V vs H shader work is consolidated (the transpose for the V shader is not mechanical — different SIMD access pattern than the H shader). Coverage matrix update: qpel position put_ status avg_ status ------------- ----------- ----------- mc00 (copy) not wired not wired mc10 (¼-H) not wired not wired mc20 (½-H) ✓ QPU+CPU not wired mc30 (¾-H) not wired not wired mc01 (¼-V) not wired not wired mc02 (½-V) ✓ CPU not wired (this PR) mc03 (¾-V) not wired not wired mc11..mc33 not wired not wired 13 more qpel positions to go for the full put_ matrix. Adding them follows the same template; each is a small contained PR.	2026-05-25 00:47:37 +02:00
claude-noether	9b1c106dc5	h264: deblock bS=4 intra variants (luma + chroma, V + H) Closes the deblock matrix: adds the four bS=4 intra-strength loop filters used at I-MB edges (and other boundaries where H.264 §8.7.2.1 forces boundary strength to 4). After this PR fourier covers all 8 standard 8-bit 4:2:0 deblock combinations: bS<4 bS=4 ----- ----- luma_v ✓ (cycle 8 QPU) ✓ (CPU) luma_h ✓ (CPU, PR #9) ✓ (CPU) chrm_v ✓ (CPU, PR #10) ✓ (CPU) chrm_h ✓ (CPU, PR #10) ✓ (CPU) Scope: - 4 new kernel enums (LV_INTRA=13, LH_INTRA=14, CV_INTRA=15, CH_INTRA=16), all → CPU substrate in the recipe table. - 4 new public dispatch fns + 4 recipe wrappers (defined via two DEFINE_INTRA_DISPATCH / DEFINE_INTRA_RECIPE macros to keep the boilerplate tight). - 4 new extern decls for the vendored ff_h264_{v,h}_loop_filter_{luma,chroma}_intra_neon symbols. - C reference: tests/h264_intra_loop_filter_ref.c covers all four orientations. Algorithm per H.264 §8.7.2.3: Luma: per-side strong/weak filter selector strong_p = (\|p2-p0\| < β) AND (\|p0-q0\| < (α>>2)+2) strong_q = (\|q2-q0\| < β) AND (\|p0-q0\| < (α>>2)+2) Strong updates p0/p1/p2 (and mirror); weak updates p0 only. Chroma: always weak, only p0/q0 updated. - daedalus_h264_deblock_meta is REUSED for intra dispatches; the tc0[] field is ignored (bS=4 hardcodes the strength). Callers can build a single edge list and route by kernel without an extra struct. - Test refactor: an intra_test_spec table + run_intra_test helper drives all four orientations through one harness, keeping the new test surface compact (~50 LOC for 4 kernels vs ~200 if each had its own test_deblock_*_intra fn). Verified on hertz (Pi 5 / V3D 7.1): $ ./build/test_api_h264 === Phase 8a API smoke: H.264 kernels via recipe dispatch === ... H.264 deblock luma v intra: 1024/1024 bytes bit-exact (100.0000%) H.264 deblock luma h intra: 1024/1024 bytes bit-exact (100.0000%) H.264 deblock chroma v intra: 256/256 bytes bit-exact (100.0000%) H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%) ... All 11 H.264 kernels bit-exact PASS — the deblock matrix is closed. The bit-exact match on first try is meaningful for these kernels: the strong/weak filter selector + per-side asymmetry would have surfaced any sign / shift / rounding mistake immediately. The C reference is now a usable spec checkpoint for the eventual QPU shader work. QPU shader follow-up: not in this PR. The intra path's 3-cell per-side update + strong/weak branch is structurally more complex than the bS<4 path that already has a V shader (v3d_h264deblock.spv). Per the prior R-band logic for deblock, intra edges are < 20% of total deblock work at typical bit-rates, so NEON-only at ~ 10 ns/edge fits comfortably in the budget.	2026-05-25 00:00:46 +02:00
claude-noether	a5c47aa51c	h264: deblock chroma_v + chroma_h (CPU/NEON, bS<4) Continues the deblock buildout after PR #9 (luma_h). Adds the two chroma orientations via the same recipe-table-routed-to-CPU pattern; QPU shaders for chroma deblock are still a follow-up. Scope: - Public API: 4 new fns (dispatch + recipe wrapper × {v, h}). - Internal: dispatch_h264_deblock_chroma_{v,h}_cpu calling the vendored ff_h264_{v,h}_loop_filter_chroma_neon symbols. - Recipe table: DAEDALUS_KERNEL_H264_DEBLOCK_CV = 11, DAEDALUS_KERNEL_H264_DEBLOCK_CH = 12, both → CPU. Explicit SUBSTRATE_QPU returns -1 (no shader yet). - C reference: tests/h264_chroma_loop_filter_ref.c — covers both orientations. Algorithm per H.264 §8.7.2.4 (bS<4 chroma inter): tC = tc0_seg + 1 (no luma-style ap/aq side bonus); only p0/q0 are updated (chroma never modifies p1/p2/q1/q2). - Tests: test_deblock_chroma_v (8x4 tile, edge at row 2) + test_deblock_chroma_h (4x8 tile, edge at col 2), 4 segments x 2 cells per segment per spec. Verified on hertz (Pi 5 / V3D 7.1): $ ./build/test_api_h264 === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 2 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 2 H264_DEBLOCK_LH recipe substrate: 1 (CPU, no QPU H shader yet) H264_DEBLOCK_CV recipe substrate: 1 (CPU) H264_DEBLOCK_CH recipe substrate: 1 (CPU) H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%) H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%) H.264 deblock chroma v: 256/256 bytes bit-exact (100.0000%) H.264 deblock chroma h: 256/256 bytes bit-exact (100.0000%) H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) All 7 kernels bit-exact PASS. Chroma test sizes are smaller (256 bytes per orientation) because the per-MB chroma deblock surface is smaller than luma — accurate to the production geometry. Why no QPU shader yet (per the established pattern): - Chroma deblock is ~25% of total deblock work at 4:2:0 (one quarter the pixel count of luma per MB) — modest QPU win even after the shader exists. - Same R-band considerations as the luma _h follow-up: the V shader transpose isn't mechanical, and the 8-cell tile is small enough that NEON's per-edge cost (~3 ns) is already inside the budget. - Total bench at 1080p: 8160 MBs × 4 chroma edges × 3 ns = ~100 us. Negligible compared to the IDCT layer's 10 ms (CPU NEON). Now coverage in fourier for the bS<4 8-bit 4:2:0 deblock matrix is complete: luma_v ✓, luma_h ✓, chroma_v ✓, chroma_h ✓. Remaining deblock work: bS=4 intra variants (luma + chroma, V + H). What this unblocks downstream: - daedalus-decoder Stage 4 deblock can now dispatch all four bS<4 edge categories that a typical inter MB needs.	2026-05-24 23:53:09 +02:00
claude-noether	9d5451e0fe	h264: deblock_luma_h — CPU/NEON via vendored ff_h264_h_loop_filter Adds the horizontal-edge sibling of cycle 8's deblock_luma_v. The vendored FFmpeg snapshot already includes ff_h264_h_loop_filter_luma_neon in libavcodec/aarch64/h264dsp_neon.S — this PR wires up the symbol, the bit-exact reference, and the recipe-table entry so daedalus-decoder and other consumers can call the H variant through the same dispatch shape they use for _v. Scope: - Public API: daedalus_dispatch_h264_deblock_luma_h(ctx, sub, ...) + daedalus_recipe_dispatch_h264_deblock_luma_h(ctx, ...) wrapper. - Internal: dispatch_h264_deblock_h_cpu() calls the NEON entry. - Recipe table: new DAEDALUS_KERNEL_H264_DEBLOCK_LH = 10, mapped to DAEDALUS_SUBSTRATE_CPU until a QPU shader is written. An explicit SUBSTRATE_QPU request on the H dispatch returns -1 (fails fast, no silent CPU degradation). - C reference: tests/h264_h_loop_filter_luma_ref.c — the column-axis transpose of h264_deblock_ref.c. Same per-segment kernel; pix[-4..+3] accesses cols instead of rowsstride. - Test: test_api_h264 grows a test_deblock_h() with 8 tiles (8 cols x 16 rows each, edge at col 4), random alpha/beta/tc0; compares NEON dispatch against reference byte-for-byte. Verified on hertz (Pi 5 / V3D 7.1): $ ./build/test_api_h264 === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 2 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 2 H264_DEBLOCK_LH recipe substrate: 1 (CPU, no QPU H shader yet) H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%) H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%) H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) All 5 kernels bit-exact PASS. The new H variant joins the suite with 1024 random-input bytes per tile x 8 tiles. Why CPU-only for now: the daedalus-decoder downstream needs the H edge dispatched somewhere — even at CPU NEON cost (~6 ns/edge per the cycle 8 M3 baseline) a frame's worth at 1080p is ~ 8160 MBs 4 edges = 32 640 edges = ~200 us — well inside the 30 fps budget. Writing the V3D H-edge shader is a follow-up (would be cycle 8' or similar; the V-edge shader's transpose isn't mechanical because of how the workgroup organisation maps to columns vs rows). Backlog addition (out of scope for this PR): - V3D shader for the H variant (mirror of v3d_h264deblock.spv). - bS=4 intra-strength filter (different algebra; both _v and _h). - Chroma deblock luma_v/_h (8-cell variants).	2026-05-24 23:28:56 +02:00
claude-noether	79553c6e22	cycle 9: V3D shader for H.264 luma qpel mc20 — 9/9 QPU coverage Closes the QPU-default substrate campaign per the 2026-05-23 decree: every daedalus-fourier kernel that can be done in QPU is now done in QPU. Cycle 9 is the last piece — 6-tap horizontal half-pel luma motion compensation, H.264 §8.4.2.2.1. Shader (src/v3d_h264_qpel_mc20.comp): - local_size = 64, 1 lane per output pixel of one 8x8 block, 1 block per workgroup. Simplest layout that avoids any inter-lane communication — V3D's L2 cache handles the redundant reads from adjacent lanes computing adjacent output columns. - Per-pixel: read 6 src samples (cols c-2..c+3 in row r), apply the (1, -5, 20, 20, -5, 1) / 32 filter with +16 rounding, clip to u8, write one dst byte. - Single-stride convention matches FFmpeg's H264QpelContext (dst and src share `stride`; src+src_off points at output col 0 with the caller-guaranteed -2/+3 padding). Dispatch wiring (src/daedalus_core.c): - h264_qpel_mc20_pipe field on daedalus_ctx, lazy init. - dispatch_h264_qpel_mc20_qpu(): 3 SSBOs (src / dst / meta), src_max = src_off + 7stride + 11 (covers the +3-col read footprint on the last row), dst_max = dst_off + 7stride + 8. 1 block per WG. - daedalus_dispatch_h264_qpel_mc20() replaces ROUTE_CPU_ONLY with the substrate-switch pattern matching the other H.264 kernels. - Recipe table: H264_QPEL_MC20 returns SUBSTRATE_QPU. Verification (hertz, Pi 5, V3D 7.1): $ ./test_api_h264 === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 2 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 2 ← flipped H.264 IDCT 4x4: 2048/2048 bytes bit-exact H.264 IDCT 8x8: 2048/2048 bytes bit-exact H.264 deblock luma v: 2048/2048 bytes bit-exact H.264 qpel mc20: 1024/1024 bytes bit-exact ← QPU First-iteration result was 1017/1024 (99.32%) — off-by-7 traced to undersizing src_max in the host wrapper. The filter reads src_off + 7*stride + (7 + 3) = +10 at the last row last column; add 1 for memcpy's [0..N-1] semantic → 11. Fixed in the same patch. All 9 daedalus-fourier cycles now QPU-by-recipe: cycle 1 VP9 IDCT 8x8 QPU cycle 2 VP9 LPF wd=4 QPU cycle 3 VP9 MC 8h QPU cycle 4 VP9 LPF wd=8 QPU cycle 5 AV1 CDEF 8x8 QPU cycle 6 H.264 IDCT 4x4 QPU cycle 7 H.264 IDCT 8x8 QPU cycle 8 H.264 deblock luma-v QPU cycle 9 H.264 qpel mc20 QPU ← this commit Closes daedalus-fourier task #165. Per the decree memory [QPU is default substrate], the prototype now demonstrates GPU acceleration on every measured kernel.	2026-05-23 21:05:36 +02:00
marfrit	a092ee34aa	Merge pull request 'QPU is default substrate: recipe table + ctx env-var override' (#7 ) from noether/qpu-default-recipe-cycles-5-8 into main Reviewed-on: #7	2026-05-23 18:59:34 +00:00
claude-noether	74687d9def	cycle 7: V3D shader for H.264 IDCT 8x8 Mirrors cycle 6 (PR #7 prior commit) but at 8x8 scale: 8 lanes per block, 8 blocks per WG. H.264 §8.5.13.2 1D butterfly twice (row pass, column pass), (val + 32) >> 6 rounded + clipped + added to dst. Bit-exact first try on hertz (Pi 5, V3D 7.1): H264_IDCT4 recipe substrate: 2 (QPU) H264_IDCT8 recipe substrate: 2 (QPU) ← flipped H264_DEBLOCK_LV recipe substrate: 2 (QPU) H264_QPEL_MC20 recipe substrate: 1 (CPU) ← task #165 remaining H.264 IDCT 4x4: 2048/2048 bytes bit-exact H.264 IDCT 8x8: 2048/2048 bytes bit-exact ← QPU H.264 deblock luma v: 2048/2048 bytes bit-exact H.264 qpel mc20: 1024/1024 bytes bit-exact 8 of 9 daedalus-fourier cycles now QPU-by-recipe. Only cycle 9 (H.264 luma qpel mc20) still CPU — different shape (6-tap MC filter, not a transform) so needs its own shader template; task #165 covers it as a follow-on. Same pattern as cycle 6 commit (`65bd5c3`): adds h264_idct8_pipe field + lazy init, dispatch_h264_idct8_qpu() with 3 SSBOs, v3d_h264_idct8.spv install rule. Uses v3d_runner_create_buffer / destroy_buffer (will swap to pool API once PR #6 lands).	2026-05-23 20:09:25 +02:00
claude-noether	65bd5c3fe3	cycle 6: V3D shader for H.264 IDCT 4x4 (first cycle-6 QPU dispatch) Per the QPU-default substrate decree 2026-05-23, cycle 6 (H.264 IDCT 4x4 + add) was the highest-priority H.264 kernel to flip from NEON-only to QPU-capable. The same shape as VP9 IDCT 8x8 (cycle 1) — two-pass butterfly with shared-memory transpose — but at 4x4 scale: 4 lanes per block, 16 blocks per WG. What's added: - src/v3d_h264_idct4.comp: GLSL compute shader implementing the H.264 §8.5.12.1 1D butterfly twice (row pass then column pass), with (val + 32) >> 6 rounding and clip-to-u8 add to dst. Block memory layout is column-major (matches FFmpeg `ff_h264_idct_add_neon` convention). - CMakeLists: glslang rule + install entry for v3d_h264_idct4.spv. - dispatch_h264_idct4_qpu() in daedalus_core.c: lazy pipeline init, 3 SSBOs (coeffs / dst / meta as uvec4), push-constant (n_blocks, dst_stride), 16 blocks per WG dispatch. Matches the existing dispatch_*_qpu patterns; uses v3d_runner_create_buffer / destroy_buffer (will swap to pool API once PR #6 lands). - daedalus_dispatch_h264_idct4() replaces ROUTE_CPU_ONLY with the same CPU/QPU substrate switch the deblock dispatch uses. - daedalus_recipe_substrate_for(H264_IDCT4) returns QPU now that the shader exists. Verification on hertz (Pi 5 + V3D 7.1): $ ./test_api_h264 === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 1 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 1 H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%) ← QPU H.264 IDCT 8x8: 2048/2048 bytes bit-exact H.264 deblock luma v: 2048/2048 bytes bit-exact H.264 qpel mc20: 1024/1024 bytes bit-exact The AUTO-substrate path now picks QPU for H.264 IDCT 4x4, and the output is bit-exact against the C reference (which is identical to the NEON .S code by construction — same FFmpeg upstream). Remaining cycle-6/7/9 work in task #165: - cycle 7: H.264 IDCT 8x8 (template same shape; 8 lanes per block, fewer blocks per WG) - cycle 9: H.264 luma qpel mc20 (different shape — 6-tap MC not a transform) This commit lands the cycle-6 piece of task #165.	2026-05-23 20:06:20 +02:00
claude-noether	737e87980d	QPU is default substrate: recipe table + ctx env-var override Per the user decree 2026-05-23 — "what can be done in QPU will be done in QPU" — this lands two coupled changes that flip production-decode kernels with existing V3D shaders from CPU-by-recipe to QPU-by-recipe: 1) daedalus_recipe_substrate_for() returns SUBSTRATE_QPU for every kernel that has a shipped V3D compute shader: cycle 1 VP9 IDCT 8x8 QPU (was QPU; unchanged) cycle 2 VP9 LPF wd=4 QPU (was QPU; unchanged) cycle 3 VP9 MC 8h QPU (FLIPPED from CPU — v3d_mc_8h.spv) cycle 4 VP9 LPF wd=8 QPU (was QPU; unchanged) cycle 5 AV1 CDEF 8x8 QPU (FLIPPED from CPU — v3d_cdef.spv) cycle 6 H.264 IDCT 4x4 CPU (no shader yet; task #165) cycle 7 H.264 IDCT 8x8 CPU (no shader yet; task #165) cycle 8 H.264 deblock luma-v QPU (FLIPPED from CPU — v3d_h264deblock.spv) cycle 9 H.264 qpel mc20 CPU (no shader yet; task #165) The R-band cost/benefit framework still applies but is now superseded for substrate selection by the decree. Where R stays RED, the cost is in dispatch overhead, which is a fixable engineering issue (tasks 160 buffer-pool, 161 persistent cmdbuf, 162 dmabuf import). 2) daedalus_ctx_create_no_qpu() now honours an env-var override: set DAEDALUS_FORCE_QPU=1 in the process and create_no_qpu silently escalates to a full daedalus_ctx_create(). Lets the libavcodec substitution shims in marfrit-packages (which pthread_once a create_no_qpu ctx — see libavcodec/aarch64/h264_idct_daedalus.c) fire QPU paths without rebuilding those patches. Firefox / mpv consumers stay on the Vulkan-free path by default (env var unset). The daedalus-v4l2 daemon will set DAEDALUS_FORCE_QPU=1 explicitly before dlopen'ing libavcodec (separate daedalus-v4l2 follow-up). Smoke (hertz, Pi 5, kernel 6.18.29): === test_api_h264 === H264_IDCT4 recipe substrate: 1 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 1 H264_DEBLOCK_LV recipe substrate: 2 ← flipped H264_QPEL_MC20 recipe substrate: 1 H.264 IDCT 4x4: 2048/2048 bytes bit-exact H.264 IDCT 8x8: 2048/2048 bytes bit-exact H.264 deblock luma v: 2048/2048 bytes bit-exact ← QPU path H.264 qpel mc20: 1024/1024 bytes bit-exact === test_api_idct === all substrates (CPU/QPU/AUTO) bit-exact === test_api_lpf === all substrates bit-exact wd=4 and wd=8 The dispatch wrapper's fall-through logic (eff == SUBSTRATE_QPU && !ctx_has_qpu(ctx) → eff = SUBSTRATE_CPU) handles the case where the recipe says QPU but the consumer didn't opt in — it falls back to CPU silently, no regression. Closes daedalus-fourier tasks #163, #164. Refs the 2026-05-23 "QPU default substrate" decree.	2026-05-23 19:59:53 +02:00
claude-noether	98553278dd	v3d_runner: persistent per-pipeline command buffer Phase 2 of the QPU-default substrate campaign — eliminate vkAllocateCommandBuffers from the dispatch hot path. Attaches a VkCommandBuffer to each v3d_pipeline, allocated once in v3d_runner_create_pipeline() and freed in destroy_pipeline(). The five dispatch_*_qpu sites switch from v3d_runner_alloc_cmdbuf() to v3d_runner_pipeline_cmdbuf_reset() — vkResetCommandBuffer is O(1) versus the driver-side allocation walk. Pool was already created with VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT so reset is permitted. Microbench (hertz, Pi 5, kernel 6.18.29, V3D 7.1): before (task 160 pool only): steady-state p50: 76.44 us steady-state mean: 77.95 us after (task 160 pool + task 161 persistent cb): steady-state p50: 54.56 us steady-state mean: 56.00 us -> 28% per-dispatch reduction The remaining ~54 us steady-state is dominated by vkQueueWaitIdle + shader execution + the two memcpy(in/out) on the dst buffer — task 162 (dmabuf import for dst) targets the memcpy half. test_api_idct stays bit-exact across CPU/QPU/AUTO substrates. Refs daedalus-fourier task #161.	2026-05-23 19:56:35 +02:00
claude-noether	0a042a8e95	v3d_runner: buffer pool for QPU dispatch hot path Per the QPU-default substrate decree 2026-05-23: the per-dispatch vkAllocateMemory in dispatch__qpu was the biggest single fixable contributor to QPU dispatch overhead. This pools v3d_buffer allocations by power-of-2 size class so the second-and-subsequent dispatch hits a freelist instead of paying ~10-50us of Mesa-V3D7 memory-allocation cost per call. API additions (v3d_runner.h): - v3d_runner_acquire_buffer(): pulls from per-bucket freelist; falls through to v3d_runner_create_buffer() on miss. - v3d_runner_release_buffer(): pushes back onto the freelist; the backing VkBuffer/VkDeviceMemory only get vkFreeMemory'd in v3d_runner_destroy(). - v3d_runner_pool_total_bytes(): diagnostic watermark. Size classes 2^8..2^23 (256 B to 8 MiB). Oversize requests fall through to non-pooled (vkAllocateMemory) for both ends — pool stays correct, just degenerates to old behaviour for those calls. Migration: daedalus_core.c dispatch__qpu paths globally swap create_buffer → acquire_buffer and destroy_buffer → release_buffer. All five QPU dispatch functions (idct8 / lpf / mc_8h / cdef / h264_deblock) now reuse buffers across calls. test_api_idct stays bit-exact (4096/4096 bytes on CPU/QPU/AUTO substrates on hertz). Microbench (tests/bench_pool_overhead.c) on hertz (Pi 5, 6.18.29+rpt-rpi-2712, V3D 7.1): call 0: 434.89 us (cold — 3x vkAllocateMemory) call 1: 100.06 us (pool hit on all 3 buffers) steady-state: p50: 76.44 us p90: 90.52 us mean: 77.95 us first-call / steady-state ratio: 5.7x The remaining ~76us steady-state is dominated by vkQueueWaitIdle + shader execution + per-call descriptor-set update + command-buffer allocation — addressed in follow-on tasks 161 (persistent cmdbuf) and 162 (dmabuf import for dst, eliminates memcpy in/out). Refs daedalus-fourier task #160.	2026-05-23 19:52:50 +02:00
claude-noether	8fdef27a7d	Phase 8c: H.264 luma qpel mc20 through public API Extends daedalus-fourier with daedalus_recipe_dispatch_h264_qpel_mc20 so libavcodec.so can route H264QpelContext.put_h264_qpel_pixels_tab[1][2] through the recipe layer instead of ff_put_h264_qpel8_mc20_neon directly. API additions (header + library): - daedalus_h264_qpel_meta { dst_off, src_off } - daedalus_dispatch_h264_qpel_mc20(ctx, sub, dst, src, stride, n_blocks, meta) - daedalus_recipe_dispatch_h264_qpel_mc20(...) (AUTO wrapper) - DAEDALUS_KERNEL_H264_QPEL_MC20 = 9 in the recipe-query enum - daedalus_recipe_substrate_for() returns CPU NEON for cycle 9 The 6-tap horizontal half-pel filter signature matches FFmpeg's H264QpelContext convention exactly: dst and src share a single stride and src already points at output column 0 (filter reads cols -2..+3). Single-stride API to make the marfrit-packages FFmpeg shim a straight pointer-pass; no buffer rearrangement. Verdict per docs/k9_h264qpel_mc20.md: CPU NEON. Per-block 7.6 ns gives 135x margin over 30 fps 1080p; QPU dispatch floor at ~250 ns makes any V3D shader strictly worse. Recipe table reflects that — the recipe_dispatch entry is a one-line forward to the CPU path. CMakeLists changes: - h264qpel_neon.S added to the daedalus_core static lib (only the bench targets owned it before; now the public API needs it too) - tests/h264_qpel8_mc20_ref.c added to the test_api_h264 target Phase 8a/8b smoke gains a 4th case (test_qpel_mc20): 1024/1024 bytes bit-exact via daedalus_recipe_dispatch_h264_qpel_mc20. Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.	2026-05-23 03:25:24 +02:00
marfrit	0a99b16489	Phase 8b: opportunistic QPU paths through public API Wires QPU dispatch for cycles 3 (VP9 MC), 5 (AV1 CDEF), 8 (H.264 deblock) through the public API. These three kernels have recipe substrate = CPU, but per Issue 003 the mixed-kernel helper value is real — the dispatch path must exist so override-mode callers can request QPU on the side. Pattern mirrors dispatch_idct8_qpu (lazy pipeline + per-call SSBO alloc + memcpy + dispatch + readback). Each kernel has its own push-constant struct (mc_pc 3-field, cdef_pc 3-field, deblock_pc 2-field shared with lpf). Notable bug caught + fixed in test_api_opportunistic_qpu: the initial dispatch_mc_8h_qpu sized src_max using CPU-side reach (src_off + 3 + 8 + 7stride), but the QPU shader reads src[ src_off + rowstride + 0..14] for row=0..7. Last block had 3 uninitialized bytes → 99.8% match → 100% after fix. After this commit, the public API surface fully covers cycles 1-8: Cycle 1 (IDCT 8x8): CPU + QPU + AUTO bit-exact Cycle 2 (LPF wd=4): CPU + QPU + AUTO bit-exact Cycle 3 (MC 8h): CPU recipe; QPU override bit-exact Cycle 4 (LPF wd=8): CPU + QPU + AUTO bit-exact Cycle 5 (CDEF): CPU recipe; QPU override (untested in this test — bench_v3d_cdef is the authoritative 3-way M1) Cycle 6 (H.264 IDCT 4x4): CPU only (no QPU shader by recipe) Cycle 7 (H.264 IDCT 8x8): CPU only Cycle 8 (H.264 deblock luma-v): CPU recipe; QPU override bit-exact Tests: test_api_opportunistic_qpu adds CPU-vs-QPU bit-exact comparison for VP9 MC and H.264 deblock through the API. test_api_idct, test_api_lpf, test_api_h264 still pass. Per the locked Phase 8 architecture (project_phase8_architecture memory): next session opens daedalus-v4l2 sibling repo with Option B (kernel V4L2 shim + userspace daemon), Option γ (dlopen FFmpeg parser). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:50:41 +00:00
marfrit	af8146a2cd	Phase 8a: H.264 kernels through public API Extends include/daedalus.h with cycles 6, 7, 8 (H.264 IDCT 4x4, IDCT 8x8, luma deblock luma-v). All recipe-substrate = CPU (matches per-cycle Phase 7 verdicts). src/daedalus_core.c: NEON-path implementations + recipe routing. daedalus_core library now links the full FFmpeg H.264 NEON snapshot (h264idct + h264dsp) plus existing VP9 + dav1d. tests/test_api_h264.c: smoke test covering all 3 H.264 kernels via daedalus_recipe_dispatch_*. All pass 2048/2048 bit-exact. Public API coverage after this commit: - Cycles 1 IDCT 8x8 + 2 LPF4 + 4 LPF8: CPU+QPU+AUTO dispatch (test_api_idct, test_api_lpf, both pass) - Cycle 3 MC 8h: CPU only (QPU dispatch stub returns -1) - Cycle 5 CDEF: CPU only (QPU stub) - Cycle 6 H.264 IDCT 4x4: CPU only (recipe + only NEON wired) - Cycle 7 H.264 IDCT 8x8: CPU only - Cycle 8 H.264 deblock: CPU only (QPU opportunistic — not wired through API yet; bench_v3d_h264deblock exists for direct test) Next Phase 8 sub-step: wire opportunistic QPU dispatch for cycles 3+5+8 through the API (so override-mode users can request QPU). Then surface V4L2-wrapper architecture decisions to user. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:46:03 +00:00
marfrit	373f63a910	Cycle 8 closed: H.264 deblock R8=0.061 RED, opportunistic helper Phase 6 deliverable: v3d_h264deblock.comp (132 inst, 4 threads, no spills). Phase 5 REDs applied: RED-1: explicit clamp p1'/q1' to [0,255] before uint8 write RED-2: bench-enforced m.x >= 4*stride contract M1: 3-way 4096/4096 bit-exact (QPU vs C ref AND vs NEON). M2: 5.629 Medge/s isolation → R8 = 0.061 RED (predicted 0.09-0.14). Lower than prediction; H.264 deblock has 4 early-return paths + 2 conditional writes that hurt V3D branchy execution more than expected. M4 same-kernel: NEON-3+QPU 12.81 Medge/s ≈ pure-NEON-4 ~12-15 (neutral). M4 MIXED (real H.264 deployment shape): CPU=MC + QPU=h264deblock gives CPU MC 25.11 Mblock/s + QPU h264deblock 6.23 Medge/s. QPU contribution is essentially unchanged from isolation — the cross-substrate contention is gentle (consistent with Issue 003's V4 finding). Verdict: H.264 deblock = opportunistic QPU helper. Same recipe slot as cycle 5 CDEF. 6 Medge/s helper = 85% of single-NEON-core deblock capacity, available when CPU is busy with other work. Cycles 1-8 deployment recipe complete: Primary QPU: cycles 1+2+4 (VP9 IDCT/LPF, all bandwidth-bound) Primary CPU: cycles 3+6+7 (compute-heavy or trivially fast on NEON) Opportunistic helper: cycles 5+8 (CDEF, H.264 deblock) Phase 9 lessons added: - Branchy kernels underperform V3D vs straight-line ones - Mixed-kernel helper value scales with isolation M2, not same-kernel M4 - R prediction needs branchiness weight, not just compute density - src/v3d_h264deblock.comp (132 inst QPU shader) - tests/bench_v3d_h264deblock.c (3-way M1 + M2 + R classification) - tests/bench_concurrent_mixed.c extended with K_H264DEBLOCK - CMakeLists.txt: v3d_h264deblock.spv + bench_v3d_h264deblock + h264dsp linked into bench_concurrent_mixed - docs/k8_h264deblock_phase7.md (full closure with cycles 1-8 recipe) Next: Phase 8 — V4L2 wrapper / deployment infra. Public API already exposes recipe-default substrate per kernel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:44:21 +00:00
marfrit	eb5cfb34c4	Phase 8: wire LPF wd=4 + wd=8 QPU through public API Mirror the IDCT pattern (lazy pipeline + per-call SSBO alloc + dispatch + readback) for cycles 2 (LPF wd=4) and 4 (LPF wd=8). Important caught-empirically bug: the two LPF shaders disagree on push-constant slot order — wd=4 puts dst_stride_u8 at slot 1, wd=8 puts it at slot 2 (with unused blocks_per_row at slot 1). Initial single-struct attempt silently corrupted wd=8 output (1958/2048 = 95.6 % bit-exact on test_api_lpf). Fixed by keeping separate lpf4_pc and lpf8_pc struct definitions. dst-window calc handles both kernels (same -4..+3 byte footprint per row). test_api_lpf exercises both kernels in CPU / QPU / AUTO modes against the C reference. All 6 mode/kernel combinations pass 2048/2048 bit-exact (32 edges × 8 rows × 8 bytes/edge). Phase 8 status after this commit: 3 of 5 kernels wired through API for QPU dispatch (IDCT, LPF wd=4, LPF wd=8 — i.e., all 3 QPU-default kernels per recipe). Cycle 3 MC and cycle 5 CDEF still need wiring for opportunistic-override mode but aren't needed for recipe-AUTO path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 13:57:25 +00:00
marfrit	1085c5699c	Phase 8: wire IDCT QPU dispatch through public API daedalus_ctx now owns a v3d_runner when V3D is available. The public API's dispatch_vp9_idct8 routes QPU calls through a new dispatch_idct8_qpu helper that: (1) lazy-creates the cycle 1 v4 pipeline on first use, (2) allocates 3 host-visible SSBOs per call (coeffs/dst/meta), (3) memcpy host->GPU, (4) dispatch with the v4 32-blocks-per-WG geometry, (5) memcpy GPU->host. Per-call alloc is intentional for Phase 8 correctness-first scope; buffer-pool perf optimization is deferred. Added daedalus_ctx_create_no_qpu() for fast-path callers that know they want CPU only. test_api_idct extended to a 3-mode matrix: CPU forced, QPU forced, AUTO recipe. All three deliver 4096/4096 bit-exact on hertz with V3D 7.1.7.0: recipe substrate for VP9_IDCT8: 2 (QPU) [CPU] 4096/4096 bit-exact [QPU] 4096/4096 bit-exact (real QPU dispatch through the API) [AUTO] 4096/4096 bit-exact (recipe routes to QPU) Next Phase 8 sub-step: same wiring pattern for cycle 2 LPF wd=4 and cycle 4 LPF wd=8 (the other two recipe-QPU kernels). Cycle 3 MC and cycle 5 CDEF only need the dispatch hook (recipe routes to CPU; QPU stays opportunistic via explicit override). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 13:55:55 +00:00
marfrit	760f6a4060	Phase 8 skeleton: public C API + first end-to-end smoke test include/daedalus.h: stable C API surface exposing the 5 cycles (VP9 IDCT 8x8, LPF wd=4, MC 8h, LPF wd=8; AV1 CDEF). Per-kernel recipe-dispatch helpers default to the cycle 1-5 verdict substrate (QPU for cycles 1+2+4, CPU for cycles 3+5); explicit override available for benchmarking and runtime-aware scheduling. src/daedalus_core.c: NEON-path implementation of all 5 kernels wrapped behind the public API. QPU path stubbed out (returns -1) since wiring v3d_runner into daedalus_ctx is the next Phase 8 sub-step; with has_qpu=0 the recipe falls back to CPU cleanly. tests/test_api_idct.c: 64-block IDCT through the public recipe dispatch, bit-exact vs C ref. PASS 4096/4096 bytes — proves the API surface compiles, library links, dispatch routing works, and NEON fallback delivers correct results. docs/phase8_scoping.md: architecture options (A=userspace V4L2, B=kernel V4L2 shim, C=direct libva); pick A for v1; explicitly out-of-scope work tracked. Next Phase 8 sub-step: wire v3d_runner into daedalus_ctx so has_qpu=1 and QPU dispatch goes through the API too. After that: V4L2 ioctl glue, bitstream parser, superblock loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 13:54:43 +00:00
marfrit	5223d3cb3f	Cycle 5 closed: CDEF QPU R5=0.116 ORANGE, opportunistic helper Phase 4 plan with 3 Phase-5 REDs applied inline: - meta layout: m.z=tmp_off, m.w=dir - sec_shift clamped to >=0 (NEON uqsub semantics) - directions table as const ivec2[14], not OR-packed Phase 6 deliverable: v3d_cdef.comp (387 inst, 2 threads, no spills). 3-way M1 (QPU vs C ref vs NEON) PASS 4096/4096. M2: 0.443 Mblock/s -> R5 = 0.116 ORANGE (predicted 0.02-0.05 RED). M4 same-kernel: NEON-3+QPU 8.46 < NEON-4 alone ~10 (negative). M4 mixed (NEON-3 MC + QPU CDEF): CPU 34.17 Mblock/s MC, QPU 0.42 Mblock/s CDEF helper. CPU side higher than the Issue 003 NEON-fallback proxy suggested - cross-substrate contention is gentler than same-side NEON contention. Verdict: CDEF stays on CPU; QPU dispatch path exists for opportunistic use. Deployment recipe table updated for all 5 cycles. Phase 9 lessons: linear extrapolation across cycles is too pessimistic; CDEF is bandwidth-bound on NEON despite high per-block ns; real-substrate-cross contention < NEON-proxy contention. - src/v3d_cdef.comp: cycle 5 QPU shader - tests/bench_v3d_cdef.c: 3-way M1, M2 bench - tests/bench_concurrent_mixed.c: K_CDEF on both sides - tests/cdef_ref.c + bench_neon_cdef.c: sec_shift clamp + expanded damping range to exercise the edge case - CMakeLists.txt: v3d_cdef.spv + bench_v3d_cdef wiring - docs/k5_cdef_phase4.md updated with Phase 5 review applied - docs/k5_cdef_phase7.md: closure doc with full verdict matrix Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 13:52:46 +00:00
marfrit	85feba4087	Cycle 4 (LPF wd=8) closure: M1=100%, R=0.34, M4=+4.1%, PASS Fourth daedalus-fourier kernel — VP9 8-tap inner loop filter wd=8 h_8_8 variant. Width extension of cycle 2's wd=4; completes VP9 inner-edge LPF coverage. Full cycle Phase 1-7 + M4'''' in one combined go (cycle compressed since incremental from cycle 2). Phase 5 review explicitly skipped (incremental ~30-line shader delta from cycle 2 + same geometry + cycle-2 RED-pattern checks still apply). Flagged in docs/k4_lpf8_phase4_7.md per dev_process.md "Skipping phases is a deliberate choice that should be flagged." Phase 6 v1 first-light: M1'''' 100.0000% bit-exact (65536/65536) first try. Shaderdb shows 231 inst, 4 hardware threads, 0 spills, 27 max-temps, 48 uniforms — compiler at the latency-hiding ceiling. Performance: M3'''' NEON (single-core) 52.382 Medge/s M2'''' QPU isolation 17.847 Medge/s R'''' 0.341 → ORANGE band 30fps floor margin 9.2x (isolation), 20.3x (mixed) M4'''' concurrent matrix: NEON 4-core 37.823 Medge/s <- baseline QPU only 14.867 Medge/s MIXED NEON-3 + QPU 39.389 Medge/s <- +4.1% PASS Verdict: YELLOW-via-M4'''' PASS. Deploy wd=8 LPF on QPU alongside cycle 2 wd=4. Combined VP9 inner-edge LPF coverage now complete. Cross-cycle LPF comparison: \| \| wd=4 (k2) \| wd=8 (k4) \| \| M3 NEON \| 48.3 \| 52.4 \| \| M2 QPU iso \| 19.6 \| 17.8 \| \| R iso \| 0.41 \| 0.34 \| \| M4 delta \| +6.9% \| +4.1% \| \| 30fps mixed \| 7.2x \| 20.3x \| \| Verdict \| GO QPU \| GO QPU \| NEW finding (Phase 9 lesson): NEON gets faster per edge as filter width grows (20.7 → 19.1 ns wd=4 → wd=8). The relative QPU loss grows with width. wd=16 would probably flip negative based on the trend line. Deployment recipe with cycle 4: IDCT 8x8 (k1) -> QPU (R=0.92, +7% mixed) LPF wd=4 (k2) -> QPU (R=0.41, +7% mixed) LPF wd=8 (k4) -> QPU (R=0.34, +4% mixed) MC 8h (k3) -> CPU (R=0.067, -19% mixed) Entropy -> CPU (structural) VP9 inner-edge LPF coverage complete. Project continues to higgs deployment plumbing or further kernels per user direction. New artifacts: - src/v3d_lpf_h_8_8.comp — GLSL shader - tests/vp9_lpf8_ref.c — standalone C ref - tests/bench_neon_lpf8.c — M1+M3 bench - tests/bench_v3d_lpf8.c — M1+M2 bench - tests/bench_concurrent_lpf8.c — M4 pthread bench - docs/k4_lpf8_phase1_3.md + phase4_7.md — combined cycle docs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:56:25 +00:00
marfrit	356e446a49	Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5% Third daedalus-fourier kernel — VP9 8-tap regular subpel filter, horizontal direction, 8-wide output. Multiply-heavy by design to stress V3D's no-DP4A deficit. Full cycle Phase 1-7 + M4'''. Phase 5''' second-model review delivered cleanly — caught 1 RED bug pre-implementation (src_off off-by-3 indexing convention) and 2 YELLOW gaps (assert MUST language, shaderdb filter-LUT gate). Without the review, M1''' would have failed silently on first run with cryptic "high-index source pixels wrong" symptoms. Phase 6 v1 first-light: M1''' 100.0000% bit-exact (65536/65536 blocks across all 16 mx phases). Phase 5''' filter-LUT prediction materialised exactly: 197 uniforms (gate was 144), 2 threads (down from cycle-2's 4 due to register pressure). Performance: M2''' = 1.413 Mblock/s (707.9 ns/block) M3''' = 20.997 Mblock/s (NEON baseline phase3) R''' = 0.067 (RED band — structural mismatch) shaderdb: 488 inst, 2 threads, 197 uniforms, 25 max-temps, 0 spills M4''' concurrent matrix (8s windows): NEON 1-core 14.479 Mblock/s NEON 4-core 15.248 Mblock/s <- baseline (compute-bound, not bandwidth-saturated like cycles 1+2!) QPU only 1.380 Mblock/s MIXED NEON-3 + QPU 12.277 Mblock/s <- -19.5% (FAIL gate) MIXED NEON-4 + QPU 12.158 Mblock/s <- -20.3% NEW cross-cycle finding (Phase 9 lesson 2): compute-bound CPU workloads make the QPU-offload story collapse. Cycles 1+2 were bandwidth-saturated (4-core scaling 0.56-0.82x of 1-core), so freeing a CPU core via QPU offload added throughput. Cycle 3 MC is compute-bound (4-core scaling 1.05x of 1-core — near-linear), no free cycles to free. QPU contribution (0.45 Mblock/s in contention) doesn't compensate for losing 1 NEON core delivering ~3.8 Mblock/s. But 30fps@1080p floor: PASS in every config (1.4x to 15.7x isolation margin). Per project_30fps_floor_is_fine.md, user-facing test never fails — daily YouTube playback works fine on any CPU/QPU split. DEPLOYMENT RECIPE for higgs (cycle 3 confirmed split): IDCT (k1) -> QPU (R=0.92, +7% mixed, frees CPU core) LPF (k2) -> QPU (R=0.41, +7% mixed, frees CPU core) MC (k3) -> CPU (R=0.067, -19.5% mixed — stays on CPU) Entropy -> CPU (structurally serial) Mixed-substrate deployment, not "QPU does everything". Realistic for higgs: entropy + MC on 2-3 ARM cores; IDCT + LPF dispatched to QPU concurrently; 1-2 ARM cores left for vscode etc. New artifacts: - src/v3d_mc_8h.comp — GLSL kernel - tests/vp9_mc_ref.c — standalone C ref (REGULAR filter embedded; clean transcription) - tests/bench_neon_mc.c — M1'''_c + M3''' bench - tests/bench_v3d_mc.c — M1''' + M2''' bench with contract asserts + 30fps margin display - tests/bench_concurrent_mc.c — M4''' pthread bench - external/ffmpeg-snapshot/libavcodec/aarch64/vp9mc_neon.S (vendored) - external/ffmpeg-snapshot/libavcodec/vp9_subpel_filters_table.c (hand-extracted; provides ff_vp9_subpel_filters symbol without dragging in full vp9dsp.c) - docs/k3_mc_phase{1,2,3,4,5,7}.md — full cycle documentation Memory updates: project_30fps_floor_is_fine.md (user's 30fps target recalibration), MEMORY.md index updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:51:43 +00:00
marfrit	36eca40ff2	Cycle 2 (LPF) closure: M1''=100%, R''=0.41, M4''=+6.9%, PASS Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS: 2 YELLOW contract gaps applied) + Phase 6 v1 implementation + Phase 7 verification including M4'' concurrent gate. Phase 5'' review delivered cleanly — no RED bugs (cycle 1 lessons applied successfully). 2 YELLOW findings baked into phase4 §4: - stride >= 4 contract added alongside m.x >= 4 (finding 2) - assert(...) in bench made a MUST not a suggestion (finding 4) - V3D divergence-cost note: don't restructure to always-execute, masked lanes consume clock anyway (finding 3, informational) Phase 6 v1 first-light hit M1'' 100.0000% bit-exact on first run (65536/65536 edges) — the cycle-1 v4 patterns (WG=256, 2-per-sg, uint8_t SSBO, oob early-return discipline) baked in from start worked as expected. Performance: M2'' = 19.645 Medge/s (50.9 ns/edge) M3'' = 48.285 Medge/s (NEON baseline from phase3) R'' = 0.41 (ORANGE band - doesn't auto-close per cycle-1 calibration adjustment) shaderdb: 160 inst, 4 threads, 0 spills, 21 max-temps — shader is already at the compiler ceiling. No v2/v3/v4 iteration loop like cycle 1 because there's nothing more to extract from the compiled shape. The 30x gap between theoretical instruction throughput and measured wall-clock is divergence-tax + memory latency, not compile quality. M4'' concurrent matrix on hertz (8s windows): NEON-1 LPF 41.131 Medge/s NEON-4 LPF 33.726 Medge/s <- realistic CPU ceiling (per-core 7-9; same bandwidth-saturation as cycle-1 F1) QPU only 14.299 Medge/s MIXED NEON-3 + QPU 36.049 Medge/s <- +6.9% over NEON-4 MIXED NEON-4 + QPU 31.892 Medge/s <- -5.4% oversubscribed The "freed-core" pattern generalizes from IDCT to LPF: NEON-3+QPU beats pure NEON-4 by ~7% in both cycles. Cycle-2 NEW finding: oversubscribed mode hurts for lighter kernels (LPF -5.4% vs cycle-1 IDCT +9.4%). Recommendation for higgs deployment hardens to "always N-1 NEON cores + QPU, never N + QPU". Phase 9 lessons (in phase7 §"Phase 9 lessons"): 1. Cycle-1 v4-pattern is the v1 starting point (saves 3 iterations) 2. Phase 5 review pays off every cycle 3. R isolation misleading on bandwidth-saturated hardware 4. Oversubscription tax depends on kernel weight 5. shaderdb 4-threads/0-spills = compute not the bottleneck New artifacts: - src/v3d_lpf_h_4_8.comp — GLSL kernel - tests/bench_v3d_lpf.c — M1'' + M2'' harness with contract asserts + fm/hev pass-rate instrumentation - tests/bench_concurrent_lpf.c — M4'' pthread bench (mirrors bench_concurrent.c) - docs/k2_deblock_phase{4,5,7}.md — plan + review + verification Project verdict: continue. Cycle 3 candidates: MC interpolation (multiply-heavy, stress V3D SMUL24), CDEF (AV1-only, different neighborhood shape), or wd=8/wd=16 LPF variants. User to direct. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:39:26 +00:00
marfrit	d66f22f333	Phase 6 (v1+v4 production) + Phase 7 closure: R = 0.92 ± 0.03 on hertz First QPU IDCT8 kernel running and bit-exact on V3D 7.1 via Mesa v3dv compute. Five iterations through a Phase 7→Phase 4' loopback; production kernel is v4. New files: - src/v3d_runner.{c,h} — reusable Vulkan compute plumbing (instance, V3D device picker, HOST_VISIBLE\|COHERENT SSBOs with mmap, compute pipeline from .spv, enables storageBuffer{8,16}BitAccess) - src/v3d_idct8.comp — VP9 8x8 DCT_DCT IDCT add, v4 production: 256 invocations/WG, 2 blocks/subgroup (no idle lanes), uint8 dst SSBO (race-free per phase5 finding 5), unrolled writes (no chained ternary), oob-flag pattern (barrier-safe per phase5 finding 7) - tests/bench_v3d_idct.c — M1' bit-exact gate + M2 throughput vs C ref - docs/phase7.md — full iteration journey + decision verdict CMakeLists.txt updated to build the new shader, library, and bench when DAEDALUS_BUILD_VULKAN=ON. Iteration record (1920x1088 luma, 32640 blocks/dispatch, N=3): ver change R ns/block v1 first-light 0.230 533 v2 kill ternary + 2-blocks-per-sg 0.474 258 v3 per-pass scope oN 0.481 254 (noise) v4 WG 64 -> 256 invocations 0.947 129 v5 packed uint32 coeff reads 0.938 130 (noise, reverted) v4 final N=3 0.918 +/- 0.033 Bit-exactness 100.0000% across all iterations (10000-block sample on 128x128, 32640-block sample on 1080p) against both the C reference (tests/vp9_idct8_ref.c) and the vendored FFmpeg NEON ff_vp9_idct_idct_8x8_add_neon. Key learning over the Phase 5 review's prediction model: the chained ternary was NOT a spill killer on V3D 7.1 (shaderdb showed 0:0 spills:fills even in v1). The actual lever was workgroup-size-driven latency hiding — going from 64 to 256 invocations doubled throughput with the same compiled code (270 inst, 2 threads, 21 max-temps, 0 spills) because the v3dv scheduler had 4x more in-flight work to overlap TMU latency. Verdict per phase1.md decision rules: YELLOW band (0.5 <= R < 1.0) by a wide margin, near GREEN boundary. Phase 1 YELLOW rule: add M4 (concurrent CPU+QPU throughput) before honest-close or continue. M4 is the next measurement, not more shader tuning — at R = 0.92 with all 4 A76 cores still 100% free for other work, the question is whether the system aggregate beats pure 4-core NEON. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:09:00 +00:00
marfrit	dcbbc77038	Path B pivot + Phase 0-3 closed with first baseline numbers This is a from-scratch initial commit on a fresh .git. The original scaffold commit (`7510b56`) and the earlier session's working-tree docs were lost in a 2026-05-18 10:25 working-tree wipe; the corrupted .git is preserved at .git-broken-2026-05-18/ (gitignored) for forensic inspection. Scope re-anchored from Path A (custom VPU firmware on VC7 scalar cores; blocked by BCM2712 silicon-RoT mask-ROM signature check) to Path B (QPU compute kernels via Mesa v3d / Vulkan compute or direct DRM, on stock signed Pi 5 / CM5). See README.md and docs/phase0.md for the substrate audit that closed Path A. Phases closed: Phase 0 — substrate audit; Path A blocked, Path B open; codec-back-end-fits-QPU finding (docs/phase0.md) Phase 1 — first kernel locked (VP9 / AV1 8x8 inverse DCT) with publish-before-measure R = M2/M3 decision rules (docs/phase1.md) Phase 2 — reference impls mapped; FFmpeg n7.1.3 source vendored under external/ffmpeg-snapshot/ (PROVENANCE.md pins commit f46e514 + per-file SHA-256s) (docs/phase2.md) Phase 3 — real baseline measurements on hertz (docs/phase3.md): M1 bit-exact 100.0000 % (10000/10000) M3 NEON IDCT8 single 8.171 Mblock/s (122.4 ns/block) M5a empty Vulkan submit 22.66 us M5b 1-WG noop dispatch 55.60 us M5 delta 32.95 us/dispatch => per-dispatch overhead is ~455x per-NEON-block cost; Phase 4 must batch at frame level or close to it. Build harness in place: CMakeLists.txt + tests/{bench_neon_idct.c, vp9_idct8_ref.c, bench_vulkan_dispatch.c, shaders/noop.comp} + external/ffmpeg-snapshot/config.h shim (7 defines + EXTERN_ASM). Builds clean on Debian Trixie aarch64 with cmake 3.31, ninja 1.12, libvulkan-dev 1.4.309, glslang-tools 15.1.0. Vendored FFmpeg .S assembles via the config.h shim. Next: Phase 4 (plan first QPU IDCT kernel under the M5 batching constraint) -> Phase 5 second-model review -> Phase 6 implement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 11:30:12 +00:00

29 Commits