daedalus-fourier

Author	SHA1	Message	Date
marfrit	37b75b5813	Merge pull request 'h264: V3D shaders for chroma deblock V + H (4:2:0)' (#29 ) from noether/v3d-shader-h264-deblock-chroma into main Reviewed-on: #29	2026-05-25 16:35:08 +00:00
claude-noether	d8de7754fa	h264: V3D shaders for chroma deblock V + H (4:2:0) Adds the QPU shader pair for chroma_v / chroma_h deblock (non-intra bS<4), siblings of the cycle 8 luma_v shader and PR #28's luma_h. Closes 4 of 8 deblock QPU coverage at non-intra: luma_v ✓ cycle 8 luma_h ✓ PR #28 chroma_v ✓ this PR chroma_h ✓ this PR *_intra — CPU NEON (less common; smaller volume) Per H.264 §8.7.2.4 chroma kernel is simpler than luma: only p0/q0 updated (never p1/p2/q1/q2), tC = tc0_seg + 1 (no luma-style ap/aq side bonus), 8 cells per edge (vs luma's 16). Shader: 64 lines vs luma_v's 108 — same WG geometry (16 edges × 16 lanes, lanes 8..15 of each edge early-return). 4:2:0-only: 4:2:2 chroma_h has a 16-row edge geometry that this shader doesn't address; daedalus_dispatch_h264_deblock_chroma_h is 4:2:0-only by design, caller-side gating already covers this in the libavcodec substitution arc (marfrit-packages PR #98). Recipe table flips DAEDALUS_KERNEL_H264_DEBLOCK_CV / CH from CPU to QPU. dispatch_h264_deblock_chroma_qpu factored to share QPU plumbing between V and H (orientation passed as a flag for the dst_max calculation). Verified on hertz: $ ./build/test_api_h264 \| grep "deblock chroma [vh]:" H.264 deblock chroma v: 256/256 bytes bit-exact (100.0000%) H.264 deblock chroma h: 256/256 bytes bit-exact (100.0000%) Recipe substrate now reports 2 (QPU) for both CV and CH. Coverage now: bS<4 QPU bS=4 (intra) luma_v ✓ cycle 8 CPU NEON luma_h ✓ PR #28 CPU NEON chroma_v ✓ this PR CPU NEON chroma_h ✓ this PR CPU NEON Intra (bS=4) variants stay CPU NEON. Less common case, smaller per-frame contribution, and the algorithm is structurally different (no tc0; strong-vs-weak filter quad-tree). Can land as a follow-up PR if perf demands.	2026-05-25 17:10:34 +02:00
marfrit	de9266a6eb	Merge pull request 'h264: V3D shader for deblock_luma_h — first QPU port since cycle 9' (#28 ) from noether/v3d-shader-h264-deblock-luma-h into main Reviewed-on: #28	2026-05-25 15:06:18 +00:00
claude-noether	3db059ffab	h264: V3D shader for deblock_luma_h — first QPU port since cycle 9 Ports cycle 8's v3d_h264deblock.comp (V edge, horizontal across a row) to the H orientation (V edge, horizontal across a column). Same algorithm, transposed access pattern: V variant: lane → column, reads/writes pix[±Nstride] (vertical I/O) H variant: lane → row, reads/writes pix[±N] (horizontal I/O) WG geometry unchanged: 256 invocations, 16 edges/WG, 16 lanes/edge. Lane-in-edge interpretation flips: column-index for V → row-index for H. tc0 segment math unchanged (one tc0 byte per 4 lanes). dst_max calculation flips: V used dst_off + 3stride + 16 (cols), H uses dst_off + 15stride + 4 (rows). Recipe table: DAEDALUS_KERNEL_H264_DEBLOCK_LH = QPU (was CPU). AUTO dispatch now picks QPU for the H edge as well as the V edge. CPU NEON path stays as the explicit-SUBSTRATE_CPU + has_qpu=0 fallback. Verified on hertz (Pi 5 / V3D 7.1): $ ./build/test_api_h264 \| grep luma_h H264_DEBLOCK_LH recipe substrate: 2 (was 1 — flipped to QPU) H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%) Bit-exact against the C reference (h264_h_loop_filter_luma_ref) on 8 tiles × 8 cols × 16 rows of random input. Same correctness gate as the cycle 8 V shader. CMake plumbing: glslang rule for v3d_h264deblock_h.comp; new SPV added to daedalus_shaders ALL list + install rule. daedalus_ctx gains a parallel h264deblock_h_pipe_ready / h264deblock_h_pipe pair (can't share with V because pipelines bind a specific SPIR-V module at create time). What this changes for the substitution arc: PR #97's 0008-h264- deblock-luma-h substitution patch already plumbed daedalus_recipe_dispatch_h264_deblock_luma_h through libavcodec. That path was NEON-by-recipe; with this PR it becomes QPU-by-recipe (unless the libavcodec ctx is no-QPU per daedalus_ctx_create_no_qpu, in which case it stays NEON — same shape as cycle 8's V shader). Coverage state for H.264 8-bit 4:2:0 deblock kernels (QPU shaders): luma_v ✓ cycle 8 ✓ now luma_h — ✓ THIS PR chroma_v/h — (CPU NEON; smaller tiles, lower-priority) _intra (4) — (CPU NEON; less common)	2026-05-25 16:50:41 +02:00
marfrit	2faa849ce2	Merge pull request 'h264: promote remaining intra prediction modes (17) to public API' (#27 ) from noether/h264-intra-pred-rest-api into main Reviewed-on: #27	2026-05-25 13:43:56 +00:00
claude-noether	cb3aef3dac	h264: promote remaining intra prediction modes (17) to public API Follows PR #26 (Intra_4x4 luma) with the same promotion pattern for the rest of the intra prediction primitive set: Intra_16x16 luma (4 modes, PR #13) — V/H/DC/Plane Intra_8x8 chroma (4 modes, PR #14) — DC/H/V/Plane (4:2:0) Intra_8x8 luma (9 modes, PRs #21 + #22) — High profile, with 1-2-1 pre-filter 3 file moves via `git mv`, ~17 function renames stripping the `_ref` suffix. Test binaries rewired to link daedalus_core instead of compiling the (now moved) ref files directly. No code change — pure plumbing for substitution-arc consumers. 26 intra prediction modes total now in the public API after this PR. Verified on hertz: test_intra_pred_16x16: 5/5 PASS test_intra_pred_chroma8x8: 5/5 PASS test_intra_pred_8x8_luma: 11/11 PASS All via public symbols (test binaries linked against daedalus_core). Unblocks marfrit-packages substitution arc patch 0014 — wires H264PredContext.pred4x4[], pred16x16[], pred8x8[], pred8x8l[] through daedalus alongside the existing IDCT / deblock / qpel / DC Hadamard substitutions. After 0014 lands, the libavcodec.so built by marfrit-packages will have EVERY hot-path pixel-math kernel of an H.264 8-bit 4:2:0 decode routing through daedalus — the substitution arc is feature- complete for the campaign target (Pi 5 Firefox YouTube playback).	2026-05-25 15:37:44 +02:00
marfrit	31c68d0d0e	Merge pull request 'h264: promote Intra_4x4 luma prediction (9 modes) to public API' (#26 ) from noether/h264-intra-pred-4x4-api into main Reviewed-on: #26	2026-05-25 13:35:56 +00:00
claude-noether	df9e1c9d78	h264: promote Intra_4x4 luma prediction (9 modes) to public API PR #12 added the 9 Intra_4x4 luma intra prediction modes as test-only spec references in tests/. This PR promotes them to public src/ symbols so consumers (the eventual marfrit-packages substitution-arc patch 0014) can link against them. Moved: tests/h264_intra_pred_4x4_ref.c → src/h264_intra_pred_4x4.c Renamed: daedalus_h264_pred_4x4_<mode>_ref → daedalus_h264_pred_4x4_<mode> (9 functions: vertical/horizontal/dc/ddl/ddr/vr/hd/vl/hu) The src/ implementation is byte-for-byte the same code as the test-only ref; this PR is plain plumbing. The test binary now links against daedalus_core to pull in the public symbols (instead of compiling the ref file directly), exercising the path that real consumers will use. Same promotion shape as PR #25 (chroma DC Hadamard). Verified on hertz: $ ./build/test_intra_pred_4x4 Vertical (mode 0) PASS Horizontal (mode 1) PASS DC (mode 2) PASS DiagDownLeft (mode 3) PASS DiagDownRight (mode 4) PASS VerticalRight (mode 5) PASS HorizontalDown (mode 6) PASS VerticalLeft (mode 7) PASS HorizontalUp (mode 8) PASS VR asym (sanity) PASS ALL 10 intra-4x4 mode references PASS $ nm -g build/libdaedalus_core.a \| grep "T daedalus_h264_pred_4x4" (9 symbols exported) Follow-ups (same promotion pattern, can land in parallel): - Intra_16x16 luma (4 modes, PR #13) - Intra_8x8 chroma (4 modes, PR #14) - Intra_8x8 luma (9 modes, PRs #21 + #22) Once all 26 intra modes are in the public API, the marfrit-packages substitution arc can route H264PredContext's pred function pointer tables through daedalus alongside the IDCT / deblock / qpel / DC Hadamard substitutions already in place.	2026-05-25 14:53:37 +02:00
marfrit	b9f9ff2a89	Merge pull request 'h264: expose chroma DC 2x2 Hadamard as public API' (#25 ) from noether/h264-chroma-dc-hadamard-api into main Reviewed-on: #25	2026-05-25 11:35:05 +00:00
claude-noether	1f07f3cd70	h264: expose chroma DC 2x2 Hadamard as public API PR #23 added the Hadamard as a test-only spec reference; this PR promotes it to a public symbol in src/ so consumers (the eventual marfrit-packages substitution-arc patch 0011) can link against it. New: void daedalus_h264_chroma_dc_hadamard_2x2(int16_t c[4]); — operates in-place on 4 int16, no QP-dependent scaling (caller composes that themselves per §8.5.11.2). The src/ implementation is byte-for-byte identical to the test-only ref in tests/h264_chroma_dc_hadamard_ref.c (kept as a separate spec-validation copy). A new "public API parity" test case verifies the two produce identical output for a non-trivial input. Pure CPU primitive — no substrate-dispatch wrapper because the work is 4 adds + 4 subs; the substrate machinery would cost more than the kernel itself. Verified on hertz: $ ./build/test_chroma_dc_hadamard all-uniform 5 PASS col gradient [0,10,0,10] PASS row gradient [0,0,10,10] PASS anti-diagonal [10,0,0,10] PASS asymmetric [1,2,3,4] PASS sign-alternating [-5,5,-5,5] PASS double-Hadamard = 4*orig PASS public API parity vs _ref PASS ALL chroma DC Hadamard tests PASS $ nm -g build/libdaedalus_core.a \| grep chroma_dc_hadamard 0000000000000000 T daedalus_h264_chroma_dc_hadamard_2x2 Unblocks marfrit-packages 0011 (substituting H264DSPContext.chroma_dc_dequant_idct, which composes the Hadamard + qmul scaling).	2026-05-25 13:32:01 +02:00
marfrit	b21b35c74b	Merge pull request 'bench: H.264 primitives NEON CPU baseline (1080p budget projection)' (#24 ) from noether/h264-primitives-bench into main Reviewed-on: #24	2026-05-25 09:51:20 +00:00
claude-noether	ba5bbae8e2	bench: H.264 primitives NEON CPU baseline (1080p budget projection) Adds bench_h264_primitives — a non-ctest binary that times the H.264 pixel-math primitives at their representative per-frame N and projects 1080p frame budgets. Lets us answer "how much of the 33-ms 30fps deadline does the pixel-math layer eat on NEON alone, before the intercept patch adds entropy decode + metadata work." Results on hertz (Pi 5 / 4×Cortex-A76, NEON path): Per-kernel ns/op (CPU NEON): IDCT 4x4 luma 10.78 ns/block IDCT 8x8 luma 29.73 ns/block Deblock luma_v 18.04 ns/edge Deblock luma_h 41.65 ns/edge (H access pattern less SIMD-friendly) qpel mc20 (H half-pel) 25.66 ns/block qpel mc02 (V half-pel) 15.06 ns/block (faster than mc20!) qpel mc22 (HV half-pel) 71.50 ns/block (cascaded H+V, expected) Projected 1080p frame budgets (worst-case, CPU NEON only): IDCT 4x4 (all-4x4 MBs): 1.41 ms (130,560 blocks) IDCT 8x8 (all-8x8 MBs): 0.97 ms ( 32,640 blocks) Deblock luma_v (all MBs): 0.59 ms ( 32,640 edges) Deblock luma_h (all MBs): 1.36 ms ( 32,640 edges) qpel mc22 (all 8x8 blocks): 2.33 ms ( 32,640 blocks) Sum (IDCT 4x4 + deblock luma + MC all-mc22): 5.69 ms 30 fps deadline: 33.33 ms Margin: +27.64 ms What this validates: - The "30fps@1080p is the fine floor" memory note holds with huge headroom on the pixel-math layer alone. 17% of the deadline goes to pixel math (worst case); 83% is available for entropy decode + reference frame management + intra prediction + chroma deblock + chroma IDCT + the libavcodec intercept overhead. - The CPU-vs-QPU substrate finding from earlier (PR #10 on daedalus-decoder showed CPU NEON is 4x faster than QPU for IDCT) is consistent here. All these kernels have CPU-only recipes by default; the data suggests that's the right call for now. The recipe substrate decision can be revisited per-kernel once QPU shaders catch up. - mc22 (2D HV half-pel) is the most expensive single qpel position at ~71 ns/block — 2-7x more than the 1D variants. Real B-slice biprediction with two mc22 calls per MB would add ~4.7 ms/frame; still comfortable but worth knowing. What this DOESN'T measure (intentionally — they aren't on the critical path at NEON speeds): - Chroma IDCT (4 cb + 4 cr 4x4 per MB). At similar ns/block to luma, that's ~0.7 ms/frame. - Chroma deblock (smaller tile, simpler kernel — sub-ms). - Intra prediction (per-block, ~50 ops at NEON, but serialized in z-scan order so cache-friendly; ~0.5 ms/frame estimate). - bS=4 intra deblock variants — different algorithm, similar cost to bS<4. - chroma DC Hadamard — trivial. Adding all of those in the worst case would maybe double the 5.69 ms number to ~12 ms. Still leaves 20+ ms for entropy decode + metadata work in the intercept patch.	2026-05-25 11:26:11 +02:00
marfrit	eef7f034b0	Merge pull request 'h264: chroma DC 2x2 Hadamard pre-pass primitive' (#23 ) from noether/h264-chroma-dc-hadamard into main Reviewed-on: #23	2026-05-25 09:23:05 +00:00
claude-noether	854bdeda20	h264: chroma DC 2x2 Hadamard pre-pass primitive Adds the H.264 §8.5.11.1 chroma DC Hadamard transform. In 4:2:0 chroma, the four DC coefficients (one from each chroma 4x4 AC block within an MB) go through a 2x2 Hadamard before quant-scaling and before being added back to each block's [0,0] coefficient prior to the 4x4 AC IDCT. This PR ships the pure Hadamard transform: f[0,0] = c[0,0] + c[0,1] + c[1,0] + c[1,1] f[0,1] = c[0,0] - c[0,1] + c[1,0] - c[1,1] f[1,0] = c[0,0] + c[0,1] - c[1,0] - c[1,1] f[1,1] = c[0,0] - c[0,1] - c[1,0] + c[1,1] implemented as the 2-stage row+col butterfly (1:1 with the NEON SIMD shape upstream). Operates in-place on int16[4]. What this does NOT do (deferred to caller-side composition): - QP-dependent scaling per §8.5.11.2. The scale depends on QP_C (with chroma_qp_offset adjustment), so the formula has branches (>=6 vs <6) and looks up LevelScale4x4 table values. The libavcodec intercept patch composes Hadamard + scale + shift itself since the scale shape varies by codec-level context (slice header chroma_qp_offset, PPS chroma_qp_offset, second_chroma_qp_offset for the chroma_qp_index_offset). - Inverse transform (decode-time used for the FORWARD direction is the same Hadamard up to scaling, but conceptually the spec distinguishes them in §8.5.11; we expose only the matrix). Test design (tests/test_chroma_dc_hadamard.c): 7 cases, all spec-derived hand-computations: - all-uniform 5 → [20, 0, 0, 0] - col gradient [0,10,0,10] → [20, -20, 0, 0] - row gradient [0,0,10,10] → [20, 0, -20, 0] - anti-diagonal [10,0,0,10] → [20, 0, 0, 20] - asymmetric [1,2,3,4] → [10, -2, -4, 0] - sign-alternating [-5,5,-5,5] → [0, -20, 0, 0] - double-Hadamard invariant: H·H = 4·I, so applying twice gives [4c[0], 4c[1], 4c[2], 4c[3]] for any input. The double-Hadamard test is the strongest correctness gate: any single sign error in the butterfly would break the H·H = 4·I algebraic property, surfacing immediately. All 7 PASS first try. Verified on hertz: $ ./build/test_chroma_dc_hadamard all-uniform 5 PASS col gradient [0,10,0,10] PASS row gradient [0,0,10,10] PASS anti-diagonal [10,0,0,10] PASS asymmetric [1,2,3,4] PASS sign-alternating [-5,5,-5,5] PASS double-Hadamard = 4*orig PASS ALL chroma DC Hadamard tests PASS With this primitive the H.264 8-bit 4:2:0 pixel-math primitive matrix is complete in fourier: - IDCT 4x4 (luma + chroma) ✓ - IDCT 8x8 (luma, High profile) ✓ - Chroma DC Hadamard 2x2 ✓ (this PR) - Deblock (8 variants) ✓ - Intra prediction (26 modes) ✓ - MC qpel (30 dispatches) ✓ What remains for the libavcodec intercept patch: CABAC/CAVLC entropy decode, SPS/PPS parsing, slice header parsing, MB type / QP / CBP / intra mode prediction. All of that lives at the intercept layer (it's spec-derived from the bitstream syntax, not pixel-math); the intercept patch will call into these fourier primitives once the metadata is decoded.	2026-05-25 11:18:59 +02:00
marfrit	17d672ebef	Merge pull request 'h264: Intra_8x8 luma — 6 directional modes (DDL/DDR/VR/HD/VL/HU)' (#22 ) from noether/h264-intra-pred-8x8-directional into main Reviewed-on: #22	2026-05-25 09:16:19 +00:00
claude-noether	5565cc2bef	h264: Intra_8x8 luma — 6 directional modes (DDL/DDR/VR/HD/VL/HU) Closes the H.264 8-bit 4:2:0 intra-prediction primitive matrix. Adds the 6 directional Intra_8x8 luma modes per H.264 §8.3.2.1.5.. §8.3.2.1.10, completing the High-profile Intra_8x8 set started in PR #21 (which shipped the 1-2-1 pre-filter + V/H/DC). Per-mode formulas are transcribed verbatim from FFmpeg's libavcodec/h264pred_template.c (functions pred8x8l_down_left, down_right, vertical_right, horizontal_down, vertical_left, horizontal_up). Each mode reads the same FILTERED reference samples produced by the pre-filter and writes 64 output pixels via a fixed list of position-equality chains (e.g. for DDL, SRC(0,7)=SRC(1,6)=SRC(2,5)=...=SRC(7,0)= some shared 3-tap formula). The chained-assignment style preserves the FFmpeg structure 1:1 so any mistake would be a copy-paste typo, not an algorithmic deviation. Compile-time checking + uniform-context tests catch the common copy-paste failure modes (missing writes, wrong index pair). Scope: - 6 new ref functions: ddl/ddr/vr/hd/vl/hu_ref. - Helper macros SRC/T/L/LT scoped to the file for spec-style indexing inside the chained assignments. - 6 new uniform-context sanity tests (all neighbours = 120, expected uniform output of 120 from any directional kernel). Verified on hertz: $ ./build/test_intra_pred_8x8_luma Vertical (mode 0, uniform top) PASS Horizontal (mode 1, uniform left) PASS DC (mode 2, uniform) PASS Vertical (mode 0, gradient) PASS (filtered gradient) Horizontal (mode 1, gradient) PASS (filtered gradient) DDL (mode 3, uniform) PASS DDR (mode 4, uniform) PASS VR (mode 5, uniform) PASS HD (mode 6, uniform) PASS VL (mode 7, uniform) PASS HU (mode 8, uniform) PASS ALL Intra_8x8 luma PASS (9 modes) Uniform-context tests verify structural correctness (every output position is written by some formula); arithmetic correctness on non-uniform inputs comes from FFmpeg's spec-derived C reference (which is validated against H.264 conformance bitstreams upstream). The libavcodec intercept patch will exercise these on real streams. Combined intra-prediction primitive coverage: Intra_4x4 luma ✓ (9 modes, PR #12) Intra_16x16 luma ✓ (4 modes, PR #13) Intra_8x8 chroma ✓ (4 modes, PR #14) Intra_8x8 luma ✓ (9 modes, PRs #21 + this one) 26 intra-prediction modes total, all bit-exact gated. Every H.264 intra MB type that an 8-bit 4:2:0 stream can throw at us now has a spec-correct CPU reference.	2026-05-25 09:56:45 +02:00
marfrit	18ca708f87	Merge pull request 'h264: Intra_8x8 luma (High profile) — pre-filter + 3 modes (V/H/DC)' (#21 ) from noether/h264-intra-pred-8x8-luma into main Reviewed-on: #21	2026-05-25 07:51:51 +00:00
claude-noether	8bc6d27ea7	h264: Intra_8x8 luma prediction (High profile) — pre-filter + 3 modes Adds the High-profile Intra_8x8 luma primitive set. Per H.264 §8.3.2.1, this is distinct from Intra_4x4 in two ways: 1. REFERENCE SAMPLE PRE-FILTER (§8.3.2.1.1). The 25 raw neighbour samples are smoothed with a 1-2-1 filter BEFORE prediction. Spec-defined boundary handling at corners and the right edge: - top-left filt'd: (top[0] + 2tl + left[0] + 2) >> 2 - top[0] filt'd: (tl + 2t[0] + t[1] + 2) >> 2 - top[i] for 1..14: (t[i-1] + 2t[i] + t[i+1] + 2) >> 2 - top[15] filt'd: (t[14] + 3t[15] + 2) >> 2 ← 3× boundary - left analogous, with l[7] using 3× boundary. 2. SCALE. All 9 prediction modes operate at 8x8 on the filtered samples (Intra_4x4 is 4x4 on raw samples). This PR ships the pre-filter + the 3 simple modes (V, H, DC): - Mode 0 Vertical (§8.3.2.1.2): pred[r,c] = filt_top[c] - Mode 1 Horizontal (§8.3.2.1.3): pred[r,c] = filt_left[r] - Mode 2 DC (§8.3.2.1.4): ((sum_filt_top[0..7] + sum_filt_left[0..7] + 8) >> 4) broadcast The 6 directional modes (DDL, DDR, VR, HD, VL, HU at 8x8 per §8.3.2.1.5..§8.3.2.1.10) follow in a separate PR. They use the same filtered samples; only the per-cell formula differs. Test design (tests/test_intra_pred_8x8_luma.c): - 3 uniform-context tests, one per mode (sanity). - 2 gradient tests that exercise the pre-filter's interior + boundary cases: * Vertical with top = 0..15: spec arithmetic gives filtered top[c] = c for c in 0..7 (gradient input → identity through the 1-2-1 filter on the interior; boundaries arithmetically verify too). Test expects pred[r,c] = c. * Horizontal with left = 0..7: same arithmetic chain on the left col. Test expects pred[r,c] = r. Verified on hertz: $ ./build/test_intra_pred_8x8_luma Vertical (mode 0, uniform top) PASS Horizontal (mode 1, uniform left) PASS DC (mode 2, uniform) PASS Vertical (mode 0, gradient) PASS (filtered gradient) Horizontal (mode 1, gradient) PASS (filtered gradient) ALL Intra_8x8 luma PASS (3 modes — V, H, DC) The pre-filter being right first try is meaningful — the boundary samples use a 3× weight rather than 2× (filt[top 15] = (t[14] + 3*t[15] + 2) >> 2), which is easy to forget when transcribing. The gradient test would have surfaced any boundary mistake immediately. Combined intra-prediction primitive coverage after this PR: Intra_4x4 luma ✓ (9 modes, PR #12) Intra_16x16 luma ✓ (4 modes, PR #13) Intra_8x8 chroma ✓ (4 modes, PR #14) Intra_8x8 luma △ (3 of 9 modes — V, H, DC ✓; DDL/DDR/VR/HD/VL/HU pending) The 6 remaining Intra_8x8 luma directional modes are spec-mechanical follow-ups; each is a ~30-line formula per §8.3.2.1.5+.	2026-05-25 09:35:49 +02:00
marfrit	1ee8b1c0ab	Merge pull request 'h264: qpel avg — 12 remaining variants (closes the matrix)' (#20 ) from noether/h264-qpel-avg-rest into main Reviewed-on: #20	2026-05-25 07:33:02 +00:00
claude-noether	01f782cfaf	h264: qpel avg — 12 remaining variants (closes the matrix) Closes the H.264 8x8 qpel buildout. Adds the remaining 12 avg_ biprediction positions: 4 quarter-axis: avg_mc{10,30,01,03} 8 diagonals : avg_mc{11,12,13,21,23,31,32,33} Each follows the established pattern: same half-pel formula as the put_ sibling, then L2 average with the existing dst contents per H.264 §8.4.2.3.1. Scope: - 12 new kernel enums (MC10..MC33 avg_ = 34..45) → CPU. - 12 NEON externs for the vendored ff_avg_h264_qpel8_mc*_neon. - 12 CPU dispatches via existing DEFINE_QPEL_CPU_DISPATCH macro. - 12 public dispatches via DEFINE_QPEL_DISPATCH macro. - 12 recipe wrappers via DEFINE_QPEL_RECIPE macro. - 12 header decls via DECLARE_QPEL_AVG macro. - tests/h264_qpel8_avg_rest_ref.c — references via two parametric macros: DEFINE_AVG_QUARTER for the 4 ¼-pel L2 forms, DEFINE_AVG_DIAG for the 8 two-half-pel-avg forms. - Test harness extended with a RUN(MC) sub-macro that derives both the ref name and dispatch name from the bare mcXX. (The ref is daedalus_avg_h264_qpel8_<mc>_ref; the dispatch is daedalus_recipe_dispatch_h264_qpel_avg_<mc>. Macro had a typo on first try that duplicated "avg_" in the ref name — caught at compile, fixed.) Verified on hertz: $ ./build/test_api_h264 \| tail -12 H.264 qpel avg_mc10: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel avg_mc30: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel avg_mc01: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel avg_mc03: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel avg_mc11: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel avg_mc12: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel avg_mc13: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel avg_mc21: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel avg_mc23: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel avg_mc31: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel avg_mc32: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel avg_mc33: 2048/2048 bytes bit-exact (100.0000%) All 12 new positions bit-exact PASS first try. Final qpel matrix state: put_: mc00 (none — integer copy) mc01 ✓ mc02 ✓ mc03 ✓ mc10 ✓ mc11 ✓ mc12 ✓ mc13 ✓ mc20 ✓ (QPU+CPU) mc21 ✓ mc22 ✓ mc23 ✓ mc30 ✓ mc31 ✓ mc32 ✓ mc33 ✓ avg_: same 15-of-16 coverage, all CPU. Every B-slice biprediction case the libavcodec intercept can throw at us is now serviceable. QPU shaders remain mc20-only (cycle 9); the other 29 positions are CPU NEON. Whether to write more QPU shaders depends on real perf measurement — at NEON ~10 ns per 8x8 block, full qpel coverage at 1080p is ~2-3 ms of total work, well inside budget.	2026-05-25 08:49:42 +02:00
marfrit	1cc0990c9f	Merge pull request 'h264: qpel avg anchors (avg_mc20/02/22, biprediction support)' (#19 ) from noether/h264-qpel-avg-anchors into main Reviewed-on: #19	2026-05-25 06:45:34 +00:00
claude-noether	1113953f97	h264: qpel avg anchors (avg_mc20/02/22, biprediction support) Begins the avg_ qpel buildout for B-slice biprediction. Each avg_ form computes the same half-pel formula as its put_ sibling, then L2-averages the result with the existing dst contents — the caller pre-loads dst with the list0 prediction; the avg_ call adds list1 per H.264 §8.4.2.3.1. Scope (3 anchors, sets the pattern for the remaining 13 avg_ variants): - 3 new kernel enums (AVG_MC20=31, AVG_MC02=32, AVG_MC22=33) → CPU. - 3 NEON externs for the vendored ff_avg_h264_qpel8_{mc20,mc02,mc22}_neon. - 3 CPU dispatches via existing DEFINE_QPEL_CPU_DISPATCH macro (the macro is type-agnostic so it didn't need changes for avg_). - 3 public dispatches via DEFINE_QPEL_DISPATCH macro. - 3 recipe wrappers via DEFINE_QPEL_RECIPE macro. - tests/h264_qpel8_avg_anchors_ref.c — per-cell helpers + L2 avg. - Test harness: run_avg_qpel() seeds dst with random content so the L2 averaging is actually exercised (not just put_-style overwrite that would silently pass). Verified on hertz: $ ./build/test_api_h264 \| tail -3 H.264 qpel avg_mc20: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel avg_mc02: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel avg_mc22: 2048/2048 bytes bit-exact (100.0000%) All 3 anchors bit-exact PASS first try. Why anchors only in this PR: the avg_ pattern is uniform across all 16 positions (each is just "put_ result + L2 with dst"). Landing the anchors first confirms the macro pattern works for both put_ and avg_; the remaining 13 (avg_mc10/30/01/03 + avg_mc11..33) follow the same template in a follow-up PR. State of the qpel matrix after this PR: put_ : 15 of 16 positions ✓ (mc00 is integer copy, no wrapper) avg_ : 3 of 16 positions ✓ (mc20, mc02, mc22 anchors) 13 follow-up positions	2026-05-25 08:35:25 +02:00
marfrit	76e3076670	Merge pull request 'h264: qpel diagonals — 8 positions (mc11/12/13/21/23/31/32/33)' (#18 ) from noether/h264-qpel-diagonals into main Reviewed-on: #18	2026-05-25 06:32:02 +00:00
claude-noether	0894a46114	h264: qpel diagonals — 8 positions (mc11/12/13/21/23/31/32/33) Closes the qpel buildout. All 8 remaining diagonal positions land in one PR. Each is the rounded average of two half-pel intermediates per H.264 §8.4.2.2.1 / Table 8-4, with the decomposition matching the FFmpeg .S reference structure (verified by reading external/ffmpeg-snapshot/.../h264qpel_neon.S lines 622-758). Decomposition table (the formula for each output cell at (r,c)): mc11 ¼¼ : avg(mc20[r, c], mc02[r, c]) mc12 ¼½ : avg(mc22[r, c], mc02[r, c]) mc13 ¼¾ : avg(mc20[r+1, c], mc02[r, c]) mc21 ½¼ : avg(mc22[r, c], mc20[r, c]) mc23 ½¾ : avg(mc22[r, c], mc20[r+1, c]) mc31 ¾¼ : avg(mc20[r, c], mc02[r, c+1]) mc32 ¾½ : avg(mc22[r, c], mc02[r, c+1]) mc33 ¾¾ : avg(mc20[r+1, c], mc02[r, c+1]) The (r±1, c±1) offsets capture the position-dependent shift that the FFmpeg .S encodes by pre-incrementing x1 (src pointer) before branching into the common mc11/mc21 code paths. Scope (tightly macro-ised): - 8 new kernel enums (MC11..MC33 = 23..30) → CPU. - 8 NEON externs for the vendored ff_put_h264_qpel8_mc*_neon. - 8 CPU dispatches via existing DEFINE_QPEL_CPU_DISPATCH macro. - 8 public dispatches via DEFINE_QPEL_DISPATCH macro. - 8 recipe wrappers via DEFINE_QPEL_RECIPE macro. - Header decls condensed via a DECLARE_QPEL_DIAG macro that expands to both recipe + dispatch decls per name. - C references via DEFINE_DIAG_REF macro: each ref is a 6-line wrapper around the per-cell hpel_h / hpel_v / hpel_hv helpers (the latter being the per-cell version of mc22's 13-row int16 tmp[] computation). - Test wrapper: test_qpel_diag_all() drives all 8 through the existing run_quarter_axis_qpel() harness. Verified on hertz (Pi 5 / V3D 7.1): $ ./build/test_api_h264 \| tail -8 H.264 qpel mc11: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc12: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc13: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc21: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc23: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc31: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc32: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc33: 2048/2048 bytes bit-exact (100.0000%) ALL 8 diagonal positions bit-exact PASS first try. Meaningful because the position-dependent (r±1, c±1) source offsets are easy to get wrong by transcription, and any of them would surface on random inputs immediately. After this PR the H.264 qpel 8x8 put_ matrix is complete: mc00 mc01 mc02 mc03 mc10 mc11 mc12 mc13 mc20 mc21 mc22 mc23 mc30 mc31 mc32 mc33 15 of 16 positions exposed through the daedalus API; mc00 is just integer copy and rarely needs a dispatch wrapper (libavcodec sets the function pointer table directly). mc20 retains its QPU shader (cycle 9 / v3d_h264_qpel_mc20.spv); all other 14 are CPU NEON. What this does NOT cover (still in backlog): - avg_ variants (the "add" form for biprediction, 16 more positions). Currently the API only exposes put_. - 16x16 qpel (separate function family in FFmpeg; the 8x8 path can be used twice to substitute when 16x16 isn't critical). - QPU shaders for any qpel position other than mc20.	2026-05-25 07:49:12 +02:00
marfrit	d0a1db3c8f	Merge pull request 'h264: qpel single-axis quarter-pel — mc10/mc30/mc01/mc03 (CPU/NEON)' (#17 ) from noether/h264-qpel-quarter-axis into main Reviewed-on: #17	2026-05-25 05:42:16 +00:00
claude-noether	e01f7bc7c6	h264: qpel single-axis quarter-pel — mc10/mc30/mc01/mc03 (CPU/NEON) Closes the 4 single-axis quarter-pel positions in one PR. Each is a half-pel lowpass clipped to u8 followed by L2 rounded-average with an integer-aligned source pixel per H.264 §8.4.2.2.1: mc10 ¼-H ("a" pos): clip255(mc20(s)) avg src[r,c] mc30 ¾-H ("c" pos): clip255(mc20(s)) avg src[r,c+1] mc01 ¼-V ("d" pos): clip255(mc02(s)) avg src[r,c] mc03 ¾-V ("n" pos): clip255(mc02(s)) avg src[r+1,c] The mc10/mc30 pair and mc01/mc03 pair only differ in WHICH integer source pixel they average with — the half-pel computation is the same. Putting them in one PR is justified by that uniformity. Scope: - 4 new kernel enums: MC10=19, MC30=20, MC01=21, MC03=22 → CPU. - 4 NEON externs for the vendored ff_put_h264_qpel8_mc{10,30,01,03}_neon. - 4 CPU dispatch wrappers via DEFINE_QPEL_CPU_DISPATCH macro (collapses ~50 LOC of repetition). - 4 public dispatch fns via DEFINE_QPEL_DISPATCH macro. - 4 recipe wrappers via DEFINE_QPEL_RECIPE macro. - tests/h264_qpel8_quarter_axis_ref.c covers all four via shared hpel_h() / hpel_v() inlines + per-mode L2 average. - Test refactor: generic run_quarter_axis_qpel() harness exercises all 4 positions through a single helper (~50 LOC for 4 tests vs ~200 if each was hand-rolled). Verified on hertz: $ ./build/test_api_h264 \| tail -8 H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%) H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc10: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc30: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc01: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc03: 2048/2048 bytes bit-exact (100.0000%) All 4 new positions bit-exact PASS first try. Coverage matrix update: put_ mc00 mc10 mc20 mc30 mc01 — ✓ — ✓ mc11 — — ✓ — ← this row mc21 — — — — mc31 — — — — mc02 — — ✓ — ← mc02 + mc22 anchor mc03 — — ✓ — After this PR: 7 of 16 single-axis + diagonal positions done. Remaining 9 are the off-axis quarter-pel combinations (mc11/mc12/mc13/mc21/mc23/mc31/mc32/mc33) — each combines a 2D lowpass intermediate with L2 averaging against a 1D-lowpass output. Next PR scope. Why no QPU shaders: same R-band logic as the prior CPU additions. At ~10 ns per 8x8 NEON block, all 16 qpel positions together would land in ~1.3 ms/frame at 1080p worst case — comfortably inside the 33 ms budget. QPU shader for mc20 already exists (cycle 9 / v3d_h264_qpel_mc20.spv); the other 15 follow once a clear perf reason emerges.	2026-05-25 01:29:52 +02:00
marfrit	f3d4b15b9a	Merge pull request 'h264: qpel mc22 (2D half-pel, CPU/NEON)' (#16 ) from noether/h264-qpel-mc22 into main Reviewed-on: #16	2026-05-24 23:26:14 +00:00
claude-noether	20a4299c5c	h264: qpel mc22 (2D half-pel, CPU/NEON) Adds the "j position" 2D half-pel via cascaded H + V 6-tap lowpass with intermediate 16-bit precision per H.264 §8.4.2.2.1. One of the most common qpel positions in real H.264 streams — many encoders emit 1/2-1/2 motion vectors as their best-RD choice. Algorithmically distinct from the 1D mc20/mc02 siblings: - Horizontal 6-tap produces 13 rows of int16 intermediate (no per-stage clip/round — full precision retained). - Vertical 6-tap on the intermediate, then +512 >> 10 (the double-shift compensates for both 6-tap scalings) + clip255. The intermediate-precision requirement means the C reference can't just be "call mc20 then mc02" — that would double-clip and produce the wrong result. The 13-row int16 tmp[] buffer is the central invariant. Scope (same pattern as mc02 PR #15): - Public API: daedalus_dispatch_h264_qpel_mc22 + recipe wrapper. - Internal: dispatch_h264_qpel_mc22_cpu calling ff_put_h264_qpel8_mc22_neon. - Recipe table: DAEDALUS_KERNEL_H264_QPEL_MC22 = 18 → CPU. - C reference: tests/h264_qpel8_mc22_ref.c — explicit tmp[13][8] int16 staging buffer; spec-derived shifts and rounding. - Test: test_qpel_mc22 in test_api_h264, 8 tiles at 16×16 with output positioned at (SRC_ROW=3, SRC_COL=3) so the kernel's [-2 .. +10] read window stays in-tile. Verified on hertz: $ ./build/test_api_h264 \| tail -5 H.264 deblock chroma v intra: 256/256 bytes bit-exact (100.0000%) H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%) H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%) H.264 qpel mc22: 2048/2048 bytes bit-exact (100.0000%) All 13 H.264 kernels in api_smoke now bit-exact PASS. mc22 being right first try is meaningful — the +512 >> 10 scaling + int16 intermediate sequence has multiple sign/shift/clip pitfalls and any of them would surface on random inputs immediately. Coverage matrix update: put_ mc20 ✓ (QPU+CPU) put_ mc02 ✓ (CPU) put_ mc22 ✓ (CPU) → 12 single put_ positions still missing (¼/¾ + HV combos with L2 averaging).	2026-05-25 01:03:14 +02:00
marfrit	a2575d5e42	Merge pull request 'h264: qpel mc02 (vertical half-pel, CPU/NEON)' (#15 ) from noether/h264-qpel-mc02 into main Reviewed-on: #15	2026-05-24 22:59:38 +00:00
claude-noether	c3301b0c2e	h264: qpel mc02 (vertical half-pel, CPU/NEON) Mirror of cycle 9's mc20 transposed to vertical orientation. Wires up the second qpel half-pel position via the vendored ff_put_h264_qpel8_mc02_neon symbol, closes the "missing vertical sibling" gap that mc20 left open since cycle 9. Scope: - Public API: daedalus_dispatch_h264_qpel_mc02 + recipe wrapper. - Internal: dispatch_h264_qpel_mc02_cpu calling the NEON entry. - Recipe table: DAEDALUS_KERNEL_H264_QPEL_MC02 = 17 → CPU. Explicit SUBSTRATE_QPU returns -1 (no shader yet). - C reference: tests/h264_qpel8_mc02_ref.c — vertical 6-tap transpose of mc20 (reads src[(r±N)stride + c] instead of src[rstride + c±N]). - Test: test_qpel_mc02 in test_api_h264, 8 tiles × 16×16 cols × 16 rows, random input, bit-exact compare against the C ref. Verified on hertz: $ ./build/test_api_h264 ... H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) H.264 qpel mc02: 2048/2048 bytes bit-exact (100.0000%) All 12 H.264 kernels in the api_smoke now bit-exact PASS. Why CPU-only: same R-band logic as the deblock _h sibling pattern. mc02 at ~7.6 ns per 8x8 block on NEON (per the cycle 9 baseline measurements) gives ~700 us for 8160 MBs × 4 8x8 luma blocks at 1080p — comfortably inside the 33 ms budget. QPU shader is a fast-follow once the V vs H shader work is consolidated (the transpose for the V shader is not mechanical — different SIMD access pattern than the H shader). Coverage matrix update: qpel position put_ status avg_ status ------------- ----------- ----------- mc00 (copy) not wired not wired mc10 (¼-H) not wired not wired mc20 (½-H) ✓ QPU+CPU not wired mc30 (¾-H) not wired not wired mc01 (¼-V) not wired not wired mc02 (½-V) ✓ CPU not wired (this PR) mc03 (¾-V) not wired not wired mc11..mc33 not wired not wired 13 more qpel positions to go for the full put_ matrix. Adding them follows the same template; each is a small contained PR.	2026-05-25 00:47:37 +02:00
marfrit	9abc73d308	Merge pull request 'h264: Intra_8x8 chroma prediction — 4-mode C reference + spec gates' (#14 ) from noether/h264-intra-pred-chroma8x8 into main Reviewed-on: #14	2026-05-24 22:43:26 +00:00
claude-noether	d7100459f2	h264: Intra_8x8 chroma prediction — 4-mode C reference + spec gates Third intra-prediction primitive after PR #12 (Intra_4x4 luma) and PR #13 (Intra_16x16 luma). Covers Intra_8x8 chroma per H.264 §8.3.3: 4 modes used for BOTH Cb and Cr planes at 4:2:0. Mode quirks worth flagging in code review: - Mode 0 DC is asymmetric per quadrant. The 8x8 chroma block splits into four 4x4 quadrants with different DC formulas: (0,0) top-left : (sum_top[0..3] + sum_left[0..3] + 4) >> 3 (0,1) top-right : (sum_top[4..7] + 2) >> 2 (1,0) bot-left : (sum_left[4..7] + 2) >> 2 (1,1) bot-right : (sum_top[4..7] + sum_left[4..7] + 4) >> 3 The top-right quadrant deliberately IGNORES the top-left half even though it's available — that's per spec §8.3.3.2. - Mode 3 Plane uses slope coefficient 34 (not 5 like Intra_16x16 luma). Centre is (x-3, y-3) instead of (x-7, y-7). Sums span 4 differences instead of 8. Easy to copy-paste-bug from the luma Plane if you don't notice the constants change. Test highlights: - DC quadrants: distinct expected values per quadrant (16, 16, 40, 28 from asymmetric top/left halves) — any quadrant mix-up would surface immediately. Hand-derived from the formulas in the test comment. - Plane uniform: all-100 context → all-100 output (a = 3200, H = V = 0, (3200+16) >> 5 = 100 exactly). - Plane gradient: top + left = 0..7, hand-derives pred[0][0] = 1 and pred[7][7] = 15 via the full arithmetic chain (H = V = 56, b = c = 30, a = 224). Same hand-traced spec-walkthrough as the Intra_16x16 Plane gradient test. Verified on hertz: $ ./build/test_intra_pred_chroma8x8 Horizontal (mode 1) PASS Vertical (mode 2) PASS DC quadrants (mode 0) PASS Plane uniform (mode 3) PASS Plane gradient (mode 3) PASS (corners 1, 15) ALL Intra_8x8 chroma mode references PASS All 5 tests PASS first try. The DC quadrant correctness is meaningful (4 different formulas in one kernel) and the Plane gradient corners validate the slope=34 + centre=(x-3,y-3) constants vs the luma equivalents. Combined coverage after this PR: - Intra_4x4 luma: 9 modes ✓ (PR #12, all 9 PASS) - Intra_16x16 luma: 4 modes ✓ (PR #13, all 5 tests PASS) - Intra_8x8 chroma: 4 modes ✓ (this PR, all 5 tests PASS) - Intra_8x8 luma (High profile): 9 modes + smoothing — pending. Remaining backlog: Intra_8x8 luma (High profile, 9 modes + 1-2-1 smoothing pre-filter — distinct algorithm from Intra_4x4 because of the pre-filter), neighbour-availability fallback, dispatch wrappers.	2026-05-25 00:42:49 +02:00
marfrit	dff610e13d	Merge pull request 'h264: Intra_16x16 luma prediction — 4-mode C reference + spec gates' (#13 ) from noether/h264-intra-pred-16x16 into main Reviewed-on: #13	2026-05-24 22:40:29 +00:00
claude-noether	c43ee84d8e	h264: Intra_16x16 luma prediction — 4-mode C reference + spec gates Second piece of the intra-prediction primitive set after PR #12 (Intra_4x4 luma 9 modes). Covers the Intra_16x16 luma MB type per H.264 §8.3.2: 4 modes (Vertical, Horizontal, DC, Plane). Scope: - tests/h264_intra_pred_16x16_ref.c — 4 spec-derived modes. Same FFmpeg-style interface as the 4x4 sibling: void daedalus_h264_pred_16x16_<name>_ref(uint8_t dst, ptrdiff_t stride); Assumes all neighbours valid (interior-MB case). The Plane mode is the algorithmically heaviest of the four — spec §8.3.2.4 has two slope sums (H, V) over the asymmetric top/left contexts, a clipped quadratic evaluation per cell, and a top-left-corner participant at i=7 / j=7. Implementation follows the spec straightforwardly with `clip_u8` on the final saturating cast. - tests/test_intra_pred_16x16.c — 5 test cases: V, H, DC: standard contexts (gradient top / gradient left / small uniform pair). * Plane (uniform): all neighbours = 100 → H = V = 0 → output = (16200 + 16) >> 5 = 100. Verifies the orientation-free portion of the formula. Plane (gradient): top + left both 0..15, spec-derived corner expectations pred[0][0] = 1 and pred[15][15] = 31. The arithmetic chain (H = V = 400 → b = c = 31, a = 480) is fully hand-traced in the test comment so the expected values are auditable. - CMakeLists.txt — new test_intra_pred_16x16 binary; pure-CPU library, no daedalus_core dependency (same separation as the 4x4 ref). Verified on hertz: $ ./build/test_intra_pred_16x16 Vertical (mode 0) PASS Horizontal (mode 1) PASS DC (mode 2) PASS Plane (mode 3, uniform) PASS Plane (mode 3, gradient) PASS (corners 1, 31) ALL Intra_16x16 mode references PASS Plane mode being right first try is meaningful — H/V sums, b/c slope shifts, and the a-baseline arithmetic have many sign / index error opportunities. The asymmetric gradient test would have caught any of them; it didn't. What this does NOT cover (still in the intra-pred backlog): - Intra_8x8 chroma (4 modes per H.264 §8.3.3). - Intra_8x8 luma (High profile, 9 modes per §8.3.2.1 + the 1-2-1 smoothing pre-filter — distinct algorithm from Intra_4x4). - Neighbour-availability fallback for boundary MBs. - Dispatch wrappers (same architectural question as before — wait for decoder Stage 2a strategy decision).	2026-05-25 00:35:24 +02:00
marfrit	fad600000b	Merge pull request 'h264: Intra_4x4 luma prediction — 9-mode C reference + spec gates' (#12 ) from noether/h264-intra-pred-4x4 into main Reviewed-on: #12	2026-05-24 22:28:39 +00:00
claude-noether	ce6703a862	h264: Intra_4x4 luma prediction — 9-mode C reference + spec gates Lays the bit-exact gate for H.264 §8.3.1.4 Intra_4x4 luma prediction. Spec-derived C reference covering all 9 modes; standalone test exercises each against hand-computed expected 4x4 patterns. Why fourier (not the decoder) gets this: it's a reusable spec-level primitive — both daedalus-decoder (Phase 1 Stage 2a intra prediction) and any future shader work will need the same bit-exact reference. Putting it in fourier alongside the IDCT / deblock refs keeps the "spec implementations" library cohesive. Why CPU C reference, not NEON or QPU: the vendored FFmpeg snapshot (external/ffmpeg-snapshot/libavcodec/aarch64/) has h264dsp/idct/qpel but NOT h264pred. Vendoring h264pred_neon.S would expand the snapshot surface; deferring that pending real perf data. Per the cycle 9 NEON benches that take ~5 ns per 8x8 qpel block, intra prediction at ~5 ns per 4x4 block × 16 blocks/MB × 8160 MBs = ~650 us/frame at 1080p — well inside budget even at NEON, and much further inside at plain C. Not the critical-path concern. Scope: - tests/h264_intra_pred_4x4_ref.c — 9 prediction modes per H.264 spec §8.3.1.4 sub-clauses, FFmpeg-style interface: void daedalus_h264_pred_4x4_<name>_ref(uint8_t dst, ptrdiff_t stride); Reads top/top-right/left/top-left neighbours from dst[-stride/-1] offsets, writes 4×4 output at dst[0..3][0..3]. Assumes all 13 neighbour bytes are valid (interior-MB case; availability fallbacks are caller-side per spec). - tests/test_intra_pred_4x4.c — 10 cases: 9 uniform-context degenerate tests (one per mode), establishing that nothing is structurally broken (all output cells must equal the uniform input value). * 1 asymmetric Vertical_Right sanity test with 16 distinct expected cells hand-computed from spec §8.3.1.4.6 — the "really exercise orientation + row/col arithmetic" gate. - CMakeLists.txt — new test_intra_pred_4x4 binary (no daedalus_core dependency; pure-CPU library doesn't need a context to construct). Verified on hertz: $ ./build/test_intra_pred_4x4 Vertical (mode 0) PASS Horizontal (mode 1) PASS DC (mode 2) PASS DiagDownLeft (mode 3) PASS DiagDownRight (mode 4) PASS VerticalRight (mode 5) PASS HorizontalDown (mode 6) PASS VerticalLeft (mode 7) PASS HorizontalUp (mode 8) PASS VR asym (sanity) PASS ALL 10 intra-4x4 mode references PASS The VR asym test passed first try; the DC test fell on the first attempt because my test expectation miscomputed the rounding shift (I wrote 4, actual is 2 = (16+4)>>3). Fixed in the test. Reference itself never had the bug. What this does NOT cover (next-step backlog): - Intra_16x16 luma prediction (4 modes per H.264 §8.3.2): vertical, horizontal, DC, plane. - Intra_8x8 chroma prediction (4 modes per H.264 §8.3.3): DC, horizontal, vertical, plane. - Intra_8x8 luma prediction (High profile, 9 modes per §8.3.2.1) — these are the High-profile siblings of the modes in this PR with the 1-2-1 smoothing pre-filter. Different but well-defined. - Neighbour availability fallback (top-edge MB, left-edge MB, slice-boundary, top-right unavailable in some positions). - Dispatch wrappers — these refs aren't surfaced through daedalus_dispatch_*(). Whether to do that depends on the daedalus-decoder Stage 2a architecture (per-block CPU vs per-diagonal GPU wavefront — TBD).	2026-05-25 00:14:51 +02:00
marfrit	5306bf0f61	Merge pull request 'h264: deblock bS=4 intra variants (luma + chroma, V + H)' (#11 ) from noether/h264-deblock-intra into main Reviewed-on: #11	2026-05-24 22:09:15 +00:00
claude-noether	9b1c106dc5	h264: deblock bS=4 intra variants (luma + chroma, V + H) Closes the deblock matrix: adds the four bS=4 intra-strength loop filters used at I-MB edges (and other boundaries where H.264 §8.7.2.1 forces boundary strength to 4). After this PR fourier covers all 8 standard 8-bit 4:2:0 deblock combinations: bS<4 bS=4 ----- ----- luma_v ✓ (cycle 8 QPU) ✓ (CPU) luma_h ✓ (CPU, PR #9) ✓ (CPU) chrm_v ✓ (CPU, PR #10) ✓ (CPU) chrm_h ✓ (CPU, PR #10) ✓ (CPU) Scope: - 4 new kernel enums (LV_INTRA=13, LH_INTRA=14, CV_INTRA=15, CH_INTRA=16), all → CPU substrate in the recipe table. - 4 new public dispatch fns + 4 recipe wrappers (defined via two DEFINE_INTRA_DISPATCH / DEFINE_INTRA_RECIPE macros to keep the boilerplate tight). - 4 new extern decls for the vendored ff_h264_{v,h}_loop_filter_{luma,chroma}_intra_neon symbols. - C reference: tests/h264_intra_loop_filter_ref.c covers all four orientations. Algorithm per H.264 §8.7.2.3: Luma: per-side strong/weak filter selector strong_p = (\|p2-p0\| < β) AND (\|p0-q0\| < (α>>2)+2) strong_q = (\|q2-q0\| < β) AND (\|p0-q0\| < (α>>2)+2) Strong updates p0/p1/p2 (and mirror); weak updates p0 only. Chroma: always weak, only p0/q0 updated. - daedalus_h264_deblock_meta is REUSED for intra dispatches; the tc0[] field is ignored (bS=4 hardcodes the strength). Callers can build a single edge list and route by kernel without an extra struct. - Test refactor: an intra_test_spec table + run_intra_test helper drives all four orientations through one harness, keeping the new test surface compact (~50 LOC for 4 kernels vs ~200 if each had its own test_deblock_*_intra fn). Verified on hertz (Pi 5 / V3D 7.1): $ ./build/test_api_h264 === Phase 8a API smoke: H.264 kernels via recipe dispatch === ... H.264 deblock luma v intra: 1024/1024 bytes bit-exact (100.0000%) H.264 deblock luma h intra: 1024/1024 bytes bit-exact (100.0000%) H.264 deblock chroma v intra: 256/256 bytes bit-exact (100.0000%) H.264 deblock chroma h intra: 256/256 bytes bit-exact (100.0000%) ... All 11 H.264 kernels bit-exact PASS — the deblock matrix is closed. The bit-exact match on first try is meaningful for these kernels: the strong/weak filter selector + per-side asymmetry would have surfaced any sign / shift / rounding mistake immediately. The C reference is now a usable spec checkpoint for the eventual QPU shader work. QPU shader follow-up: not in this PR. The intra path's 3-cell per-side update + strong/weak branch is structurally more complex than the bS<4 path that already has a V shader (v3d_h264deblock.spv). Per the prior R-band logic for deblock, intra edges are < 20% of total deblock work at typical bit-rates, so NEON-only at ~ 10 ns/edge fits comfortably in the budget.	2026-05-25 00:00:46 +02:00
marfrit	ce436bfd96	Merge pull request 'h264: deblock chroma_v + chroma_h (CPU/NEON, bS<4)' (#10 ) from noether/h264-deblock-chroma into main Reviewed-on: #10	2026-05-24 21:55:57 +00:00
claude-noether	a5c47aa51c	h264: deblock chroma_v + chroma_h (CPU/NEON, bS<4) Continues the deblock buildout after PR #9 (luma_h). Adds the two chroma orientations via the same recipe-table-routed-to-CPU pattern; QPU shaders for chroma deblock are still a follow-up. Scope: - Public API: 4 new fns (dispatch + recipe wrapper × {v, h}). - Internal: dispatch_h264_deblock_chroma_{v,h}_cpu calling the vendored ff_h264_{v,h}_loop_filter_chroma_neon symbols. - Recipe table: DAEDALUS_KERNEL_H264_DEBLOCK_CV = 11, DAEDALUS_KERNEL_H264_DEBLOCK_CH = 12, both → CPU. Explicit SUBSTRATE_QPU returns -1 (no shader yet). - C reference: tests/h264_chroma_loop_filter_ref.c — covers both orientations. Algorithm per H.264 §8.7.2.4 (bS<4 chroma inter): tC = tc0_seg + 1 (no luma-style ap/aq side bonus); only p0/q0 are updated (chroma never modifies p1/p2/q1/q2). - Tests: test_deblock_chroma_v (8x4 tile, edge at row 2) + test_deblock_chroma_h (4x8 tile, edge at col 2), 4 segments x 2 cells per segment per spec. Verified on hertz (Pi 5 / V3D 7.1): $ ./build/test_api_h264 === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 2 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 2 H264_DEBLOCK_LH recipe substrate: 1 (CPU, no QPU H shader yet) H264_DEBLOCK_CV recipe substrate: 1 (CPU) H264_DEBLOCK_CH recipe substrate: 1 (CPU) H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%) H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%) H.264 deblock chroma v: 256/256 bytes bit-exact (100.0000%) H.264 deblock chroma h: 256/256 bytes bit-exact (100.0000%) H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) All 7 kernels bit-exact PASS. Chroma test sizes are smaller (256 bytes per orientation) because the per-MB chroma deblock surface is smaller than luma — accurate to the production geometry. Why no QPU shader yet (per the established pattern): - Chroma deblock is ~25% of total deblock work at 4:2:0 (one quarter the pixel count of luma per MB) — modest QPU win even after the shader exists. - Same R-band considerations as the luma _h follow-up: the V shader transpose isn't mechanical, and the 8-cell tile is small enough that NEON's per-edge cost (~3 ns) is already inside the budget. - Total bench at 1080p: 8160 MBs × 4 chroma edges × 3 ns = ~100 us. Negligible compared to the IDCT layer's 10 ms (CPU NEON). Now coverage in fourier for the bS<4 8-bit 4:2:0 deblock matrix is complete: luma_v ✓, luma_h ✓, chroma_v ✓, chroma_h ✓. Remaining deblock work: bS=4 intra variants (luma + chroma, V + H). What this unblocks downstream: - daedalus-decoder Stage 4 deblock can now dispatch all four bS<4 edge categories that a typical inter MB needs.	2026-05-24 23:53:09 +02:00
marfrit	f4af24020f	Merge pull request 'h264: deblock_luma_h (CPU/NEON via vendored ff_h264_h_loop_filter)' (#9 ) from noether/h264-deblock-luma-h into main Reviewed-on: #9	2026-05-24 21:47:57 +00:00
claude-noether	818e71560e	gitignore: exclude .claude/ runtime files The previous commit unintentionally added .claude/scheduled_tasks.lock which is an agent-runtime artefact, not source. Untrack it and add .claude/ to .gitignore so it stays out of future commits.	2026-05-24 23:29:06 +02:00
claude-noether	9d5451e0fe	h264: deblock_luma_h — CPU/NEON via vendored ff_h264_h_loop_filter Adds the horizontal-edge sibling of cycle 8's deblock_luma_v. The vendored FFmpeg snapshot already includes ff_h264_h_loop_filter_luma_neon in libavcodec/aarch64/h264dsp_neon.S — this PR wires up the symbol, the bit-exact reference, and the recipe-table entry so daedalus-decoder and other consumers can call the H variant through the same dispatch shape they use for _v. Scope: - Public API: daedalus_dispatch_h264_deblock_luma_h(ctx, sub, ...) + daedalus_recipe_dispatch_h264_deblock_luma_h(ctx, ...) wrapper. - Internal: dispatch_h264_deblock_h_cpu() calls the NEON entry. - Recipe table: new DAEDALUS_KERNEL_H264_DEBLOCK_LH = 10, mapped to DAEDALUS_SUBSTRATE_CPU until a QPU shader is written. An explicit SUBSTRATE_QPU request on the H dispatch returns -1 (fails fast, no silent CPU degradation). - C reference: tests/h264_h_loop_filter_luma_ref.c — the column-axis transpose of h264_deblock_ref.c. Same per-segment kernel; pix[-4..+3] accesses cols instead of rowsstride. - Test: test_api_h264 grows a test_deblock_h() with 8 tiles (8 cols x 16 rows each, edge at col 4), random alpha/beta/tc0; compares NEON dispatch against reference byte-for-byte. Verified on hertz (Pi 5 / V3D 7.1): $ ./build/test_api_h264 === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 2 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 2 H264_DEBLOCK_LH recipe substrate: 1 (CPU, no QPU H shader yet) H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%) H.264 IDCT 8x8: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma v: 2048/2048 bytes bit-exact (100.0000%) H.264 deblock luma h: 1024/1024 bytes bit-exact (100.0000%) H.264 qpel mc20: 1024/1024 bytes bit-exact (100.0000%) All 5 kernels bit-exact PASS. The new H variant joins the suite with 1024 random-input bytes per tile x 8 tiles. Why CPU-only for now: the daedalus-decoder downstream needs the H edge dispatched somewhere — even at CPU NEON cost (~6 ns/edge per the cycle 8 M3 baseline) a frame's worth at 1080p is ~ 8160 MBs 4 edges = 32 640 edges = ~200 us — well inside the 30 fps budget. Writing the V3D H-edge shader is a follow-up (would be cycle 8' or similar; the V-edge shader's transpose isn't mechanical because of how the workgroup organisation maps to columns vs rows). Backlog addition (out of scope for this PR): - V3D shader for the H variant (mirror of v3d_h264deblock.spv). - bS=4 intra-strength filter (different algebra; both _v and _h). - Chroma deblock luma_v/_h (8-cell variants).	2026-05-24 23:28:56 +02:00
marfrit	0d54d68f38	Merge pull request 'cycle 9: V3D shader for H.264 luma qpel mc20 — closes 9/9 QPU coverage' (#8 ) from noether/v3d-shader-h264-qpel-mc20 into main Reviewed-on: #8	2026-05-23 19:14:44 +00:00
claude-noether	79553c6e22	cycle 9: V3D shader for H.264 luma qpel mc20 — 9/9 QPU coverage Closes the QPU-default substrate campaign per the 2026-05-23 decree: every daedalus-fourier kernel that can be done in QPU is now done in QPU. Cycle 9 is the last piece — 6-tap horizontal half-pel luma motion compensation, H.264 §8.4.2.2.1. Shader (src/v3d_h264_qpel_mc20.comp): - local_size = 64, 1 lane per output pixel of one 8x8 block, 1 block per workgroup. Simplest layout that avoids any inter-lane communication — V3D's L2 cache handles the redundant reads from adjacent lanes computing adjacent output columns. - Per-pixel: read 6 src samples (cols c-2..c+3 in row r), apply the (1, -5, 20, 20, -5, 1) / 32 filter with +16 rounding, clip to u8, write one dst byte. - Single-stride convention matches FFmpeg's H264QpelContext (dst and src share `stride`; src+src_off points at output col 0 with the caller-guaranteed -2/+3 padding). Dispatch wiring (src/daedalus_core.c): - h264_qpel_mc20_pipe field on daedalus_ctx, lazy init. - dispatch_h264_qpel_mc20_qpu(): 3 SSBOs (src / dst / meta), src_max = src_off + 7stride + 11 (covers the +3-col read footprint on the last row), dst_max = dst_off + 7stride + 8. 1 block per WG. - daedalus_dispatch_h264_qpel_mc20() replaces ROUTE_CPU_ONLY with the substrate-switch pattern matching the other H.264 kernels. - Recipe table: H264_QPEL_MC20 returns SUBSTRATE_QPU. Verification (hertz, Pi 5, V3D 7.1): $ ./test_api_h264 === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 2 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 2 ← flipped H.264 IDCT 4x4: 2048/2048 bytes bit-exact H.264 IDCT 8x8: 2048/2048 bytes bit-exact H.264 deblock luma v: 2048/2048 bytes bit-exact H.264 qpel mc20: 1024/1024 bytes bit-exact ← QPU First-iteration result was 1017/1024 (99.32%) — off-by-7 traced to undersizing src_max in the host wrapper. The filter reads src_off + 7*stride + (7 + 3) = +10 at the last row last column; add 1 for memcpy's [0..N-1] semantic → 11. Fixed in the same patch. All 9 daedalus-fourier cycles now QPU-by-recipe: cycle 1 VP9 IDCT 8x8 QPU cycle 2 VP9 LPF wd=4 QPU cycle 3 VP9 MC 8h QPU cycle 4 VP9 LPF wd=8 QPU cycle 5 AV1 CDEF 8x8 QPU cycle 6 H.264 IDCT 4x4 QPU cycle 7 H.264 IDCT 8x8 QPU cycle 8 H.264 deblock luma-v QPU cycle 9 H.264 qpel mc20 QPU ← this commit Closes daedalus-fourier task #165. Per the decree memory [QPU is default substrate], the prototype now demonstrates GPU acceleration on every measured kernel.	2026-05-23 21:05:36 +02:00
marfrit	a092ee34aa	Merge pull request 'QPU is default substrate: recipe table + ctx env-var override' (#7 ) from noether/qpu-default-recipe-cycles-5-8 into main Reviewed-on: #7	2026-05-23 18:59:34 +00:00
marfrit	c01754e849	Merge pull request 'v3d_runner: buffer pool for QPU dispatch hot path' (#6 ) from noether/v3d-buffer-pool into main Reviewed-on: #6	2026-05-23 18:59:18 +00:00
claude-noether	74687d9def	cycle 7: V3D shader for H.264 IDCT 8x8 Mirrors cycle 6 (PR #7 prior commit) but at 8x8 scale: 8 lanes per block, 8 blocks per WG. H.264 §8.5.13.2 1D butterfly twice (row pass, column pass), (val + 32) >> 6 rounded + clipped + added to dst. Bit-exact first try on hertz (Pi 5, V3D 7.1): H264_IDCT4 recipe substrate: 2 (QPU) H264_IDCT8 recipe substrate: 2 (QPU) ← flipped H264_DEBLOCK_LV recipe substrate: 2 (QPU) H264_QPEL_MC20 recipe substrate: 1 (CPU) ← task #165 remaining H.264 IDCT 4x4: 2048/2048 bytes bit-exact H.264 IDCT 8x8: 2048/2048 bytes bit-exact ← QPU H.264 deblock luma v: 2048/2048 bytes bit-exact H.264 qpel mc20: 1024/1024 bytes bit-exact 8 of 9 daedalus-fourier cycles now QPU-by-recipe. Only cycle 9 (H.264 luma qpel mc20) still CPU — different shape (6-tap MC filter, not a transform) so needs its own shader template; task #165 covers it as a follow-on. Same pattern as cycle 6 commit (`65bd5c3`): adds h264_idct8_pipe field + lazy init, dispatch_h264_idct8_qpu() with 3 SSBOs, v3d_h264_idct8.spv install rule. Uses v3d_runner_create_buffer / destroy_buffer (will swap to pool API once PR #6 lands).	2026-05-23 20:09:25 +02:00
claude-noether	65bd5c3fe3	cycle 6: V3D shader for H.264 IDCT 4x4 (first cycle-6 QPU dispatch) Per the QPU-default substrate decree 2026-05-23, cycle 6 (H.264 IDCT 4x4 + add) was the highest-priority H.264 kernel to flip from NEON-only to QPU-capable. The same shape as VP9 IDCT 8x8 (cycle 1) — two-pass butterfly with shared-memory transpose — but at 4x4 scale: 4 lanes per block, 16 blocks per WG. What's added: - src/v3d_h264_idct4.comp: GLSL compute shader implementing the H.264 §8.5.12.1 1D butterfly twice (row pass then column pass), with (val + 32) >> 6 rounding and clip-to-u8 add to dst. Block memory layout is column-major (matches FFmpeg `ff_h264_idct_add_neon` convention). - CMakeLists: glslang rule + install entry for v3d_h264_idct4.spv. - dispatch_h264_idct4_qpu() in daedalus_core.c: lazy pipeline init, 3 SSBOs (coeffs / dst / meta as uvec4), push-constant (n_blocks, dst_stride), 16 blocks per WG dispatch. Matches the existing dispatch_*_qpu patterns; uses v3d_runner_create_buffer / destroy_buffer (will swap to pool API once PR #6 lands). - daedalus_dispatch_h264_idct4() replaces ROUTE_CPU_ONLY with the same CPU/QPU substrate switch the deblock dispatch uses. - daedalus_recipe_substrate_for(H264_IDCT4) returns QPU now that the shader exists. Verification on hertz (Pi 5 + V3D 7.1): $ ./test_api_h264 === Phase 8a API smoke: H.264 kernels via recipe dispatch === H264_IDCT4 recipe substrate: 2 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 1 H264_DEBLOCK_LV recipe substrate: 2 H264_QPEL_MC20 recipe substrate: 1 H.264 IDCT 4x4: 2048/2048 bytes bit-exact (100.0000%) ← QPU H.264 IDCT 8x8: 2048/2048 bytes bit-exact H.264 deblock luma v: 2048/2048 bytes bit-exact H.264 qpel mc20: 1024/1024 bytes bit-exact The AUTO-substrate path now picks QPU for H.264 IDCT 4x4, and the output is bit-exact against the C reference (which is identical to the NEON .S code by construction — same FFmpeg upstream). Remaining cycle-6/7/9 work in task #165: - cycle 7: H.264 IDCT 8x8 (template same shape; 8 lanes per block, fewer blocks per WG) - cycle 9: H.264 luma qpel mc20 (different shape — 6-tap MC not a transform) This commit lands the cycle-6 piece of task #165.	2026-05-23 20:06:20 +02:00
claude-noether	737e87980d	QPU is default substrate: recipe table + ctx env-var override Per the user decree 2026-05-23 — "what can be done in QPU will be done in QPU" — this lands two coupled changes that flip production-decode kernels with existing V3D shaders from CPU-by-recipe to QPU-by-recipe: 1) daedalus_recipe_substrate_for() returns SUBSTRATE_QPU for every kernel that has a shipped V3D compute shader: cycle 1 VP9 IDCT 8x8 QPU (was QPU; unchanged) cycle 2 VP9 LPF wd=4 QPU (was QPU; unchanged) cycle 3 VP9 MC 8h QPU (FLIPPED from CPU — v3d_mc_8h.spv) cycle 4 VP9 LPF wd=8 QPU (was QPU; unchanged) cycle 5 AV1 CDEF 8x8 QPU (FLIPPED from CPU — v3d_cdef.spv) cycle 6 H.264 IDCT 4x4 CPU (no shader yet; task #165) cycle 7 H.264 IDCT 8x8 CPU (no shader yet; task #165) cycle 8 H.264 deblock luma-v QPU (FLIPPED from CPU — v3d_h264deblock.spv) cycle 9 H.264 qpel mc20 CPU (no shader yet; task #165) The R-band cost/benefit framework still applies but is now superseded for substrate selection by the decree. Where R stays RED, the cost is in dispatch overhead, which is a fixable engineering issue (tasks 160 buffer-pool, 161 persistent cmdbuf, 162 dmabuf import). 2) daedalus_ctx_create_no_qpu() now honours an env-var override: set DAEDALUS_FORCE_QPU=1 in the process and create_no_qpu silently escalates to a full daedalus_ctx_create(). Lets the libavcodec substitution shims in marfrit-packages (which pthread_once a create_no_qpu ctx — see libavcodec/aarch64/h264_idct_daedalus.c) fire QPU paths without rebuilding those patches. Firefox / mpv consumers stay on the Vulkan-free path by default (env var unset). The daedalus-v4l2 daemon will set DAEDALUS_FORCE_QPU=1 explicitly before dlopen'ing libavcodec (separate daedalus-v4l2 follow-up). Smoke (hertz, Pi 5, kernel 6.18.29): === test_api_h264 === H264_IDCT4 recipe substrate: 1 (1=CPU, 2=QPU) H264_IDCT8 recipe substrate: 1 H264_DEBLOCK_LV recipe substrate: 2 ← flipped H264_QPEL_MC20 recipe substrate: 1 H.264 IDCT 4x4: 2048/2048 bytes bit-exact H.264 IDCT 8x8: 2048/2048 bytes bit-exact H.264 deblock luma v: 2048/2048 bytes bit-exact ← QPU path H.264 qpel mc20: 1024/1024 bytes bit-exact === test_api_idct === all substrates (CPU/QPU/AUTO) bit-exact === test_api_lpf === all substrates bit-exact wd=4 and wd=8 The dispatch wrapper's fall-through logic (eff == SUBSTRATE_QPU && !ctx_has_qpu(ctx) → eff = SUBSTRATE_CPU) handles the case where the recipe says QPU but the consumer didn't opt in — it falls back to CPU silently, no regression. Closes daedalus-fourier tasks #163, #164. Refs the 2026-05-23 "QPU default substrate" decree.	2026-05-23 19:59:53 +02:00

1 2

95 Commits