daedalus-fourier

106 Commits 3 Branches 0 Tags

Author	SHA1	Message	Date
marfrit	36eca40ff2	Cycle 2 (LPF) closure: M1''=100%, R''=0.41, M4''=+6.9%, PASS Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS: 2 YELLOW contract gaps applied) + Phase 6 v1 implementation + Phase 7 verification including M4'' concurrent gate. Phase 5'' review delivered cleanly — no RED bugs (cycle 1 lessons applied successfully). 2 YELLOW findings baked into phase4 §4: - stride >= 4 contract added alongside m.x >= 4 (finding 2) - assert(...) in bench made a MUST not a suggestion (finding 4) - V3D divergence-cost note: don't restructure to always-execute, masked lanes consume clock anyway (finding 3, informational) Phase 6 v1 first-light hit M1'' 100.0000% bit-exact on first run (65536/65536 edges) — the cycle-1 v4 patterns (WG=256, 2-per-sg, uint8_t SSBO, oob early-return discipline) baked in from start worked as expected. Performance: M2'' = 19.645 Medge/s (50.9 ns/edge) M3'' = 48.285 Medge/s (NEON baseline from phase3) R'' = 0.41 (ORANGE band - doesn't auto-close per cycle-1 calibration adjustment) shaderdb: 160 inst, 4 threads, 0 spills, 21 max-temps — shader is already at the compiler ceiling. No v2/v3/v4 iteration loop like cycle 1 because there's nothing more to extract from the compiled shape. The 30x gap between theoretical instruction throughput and measured wall-clock is divergence-tax + memory latency, not compile quality. M4'' concurrent matrix on hertz (8s windows): NEON-1 LPF 41.131 Medge/s NEON-4 LPF 33.726 Medge/s <- realistic CPU ceiling (per-core 7-9; same bandwidth-saturation as cycle-1 F1) QPU only 14.299 Medge/s MIXED NEON-3 + QPU 36.049 Medge/s <- +6.9% over NEON-4 MIXED NEON-4 + QPU 31.892 Medge/s <- -5.4% oversubscribed The "freed-core" pattern generalizes from IDCT to LPF: NEON-3+QPU beats pure NEON-4 by ~7% in both cycles. Cycle-2 NEW finding: oversubscribed mode hurts for lighter kernels (LPF -5.4% vs cycle-1 IDCT +9.4%). Recommendation for higgs deployment hardens to "always N-1 NEON cores + QPU, never N + QPU". Phase 9 lessons (in phase7 §"Phase 9 lessons"): 1. Cycle-1 v4-pattern is the v1 starting point (saves 3 iterations) 2. Phase 5 review pays off every cycle 3. R isolation misleading on bandwidth-saturated hardware 4. Oversubscription tax depends on kernel weight 5. shaderdb 4-threads/0-spills = compute not the bottleneck New artifacts: - src/v3d_lpf_h_4_8.comp — GLSL kernel - tests/bench_v3d_lpf.c — M1'' + M2'' harness with contract asserts + fm/hev pass-rate instrumentation - tests/bench_concurrent_lpf.c — M4'' pthread bench (mirrors bench_concurrent.c) - docs/k2_deblock_phase{4,5,7}.md — plan + review + verification Project verdict: continue. Cycle 3 candidates: MC interpolation (multiply-heavy, stress V3D SMUL24), CDEF (AV1-only, different neighborhood shape), or wd=8/wd=16 LPF variants. User to direct. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:39:26 +00:00
marfrit	be7ff5587c	Cycle 2 (deblocking) Phase 1-3: M3'' = 48.285 Medge/s baseline Second kernel candidate per phase7_M4.md verdict "next-kernel cycle authorised". VP9 4-tap inner loop filter, horizontal direction, 8-pixel edge (libavcodec ff_vp9_loop_filter_h_4_8_neon as baseline). Different workload shape from IDCT - boundary streaming, lighter compute per unit, per-row conditionals - tests whether QPU win generalises. docs/k2_deblock_phase1.md - goal-setting. Same R-band decision rules as cycle 1 (phase1.md), with the cycle-1 calibration adjustment: ORANGE band is no longer auto-close because M4 showed mixed > pure CPU even at modest R when CPU bandwidth-saturates. docs/k2_deblock_phase2.md - situation analysis. C reference already in vendored snapshot (vp9dsp_template.c:1780-1898). Fetched vp9lpf_neon.S fresh (1334 lines, LGPL-2.1+, FFmpeg n7.1.3 pin, SHA-256 384e49e7...). PROVENANCE.md updated. docs/k2_deblock_phase3.md - NEON baseline: M1''_c bit-exact 100.0000 % (10000 random edges) M3'' throughput 48.285 Medge/s (20.7 ns/edge, single A76) per-frame 1080p-eq 748 FPS (worst case 64 530 edges/frame) cycles/edge ~58 (=20.7ns x 2.8GHz), ~7 cycles/row LPF is 5.9x faster per-unit than IDCT M3 (20.7 vs 122 ns), so the QPU break-even point moves down. Predicted R''_v1 band ~0.5-0.9 - frame-level batching amortises the same 33us dispatch overhead; workload becomes bandwidth-bound rather than compute-bound (~5.7 MB/frame traffic at 64 530 edges x ~88 B per edge). New artifacts: - tests/vp9_lpf_ref.c - standalone bit-exact C ref (8-bit, wd=4 inner only; clean transcription) - tests/bench_neon_lpf.c - M1''_c gate + M3'' time-based bench (5s window, edge-content-biased RNG for realistic fm/hev hit rates) - external/ffmpeg-snapshot/libavcodec/aarch64/vp9lpf_neon.S - CMakeLists.txt updated with bench_neon_lpf target Phase 4 next: plan the QPU LPF compute shader. Cycle 1's phase4.md + phase5.md + phase7.md learnings apply directly - bake in the v4 winning patterns from the start (WG=256, edges-per-subgroup pattern adapted from blocks, uint8_t dst SSBO, oob flag, unrolled writes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:28:57 +00:00
marfrit	8182e43c15	Phase 7 M4: mixed CPU+QPU beats pure 4-core NEON; project continues The YELLOW-band gate test from phase1.md (concurrent CPU+QPU vs pure-CPU baseline). tests/bench_concurrent.c is a pthread harness that runs N NEON workers (pinned to cores 0..N-1) and optionally a QPU dispatch loop on its own pinned thread; 8s time-based windows; sums per-worker block counts. Raw results on hertz (1920x1088, 32640 blocks/dispatch): Config Mblock/s 1080p FPS-eq NEON 1-core 12.623 389.6 NEON 4-core 7.074 218.3 <- realistic CPU ceiling QPU only 6.890 212.7 MIXED NEON-3 + QPU 7.583 234.0 <- +7.2 % over NEON-4 MIXED NEON-4 + QPU 7.739 238.9 <- +9.4 % oversubscribed Headline findings beyond the gate test itself: F1 — Pi 5 LPDDR4x saturates well before 4-core CPU scaling. NEON-1 (12.6) > NEON-4 (7.1): 4 cores deliver 0.56x the per-core throughput, not 4x. The realistic CPU ceiling for memory-bound IDCT work is ~7 Mblock/s aggregate, not the ~32 Mblock/s a naive 4x scaling would predict. This recasts phase7.md's R=0.92 framing: the right baseline is "4-core NEON saturated", which the QPU effectively matches (6.89 vs 7.07) on its own. F2 — QPU contributes meaningfully BECAUSE it doesn't fully share the CPU's bandwidth bottleneck (own access channel + v3d L2 cache partially insulate it). Mixed adds the QPU's 0.51 Mblock/s on top of an already-saturated CPU. F3 — Oversubscribed mode (NEON-4 + QPU) is not harmful — per-NEON drops slightly but QPU adds more than the loss. Net +9 %. F4 — Freed-core story is bigger than the throughput delta. In mixed NEON-3+QPU, the 4th core is 100 % free for entropy decode (Bool coder, ANS) which MUST run on CPU. Pure NEON-4 has nothing left. Realistic decode pipeline gets more like a 30-50 % effective throughput uplift, not just 7 %. Verdict per phase1.md YELLOW-band rule (mixed > pure-CPU): PASS. Project continues to next-kernel cycle (recommend deblocking or CDEF — same "small parallel block-level" workload class that amortises the same M4 wins). docs/phase7_M4.md captures the full M4 harness design, all 5 configs raw output, and the leaves-open items: M7 wall-power via Himbeere plug, sustained-thermal test, realistic-bitstream coefficient distribution, multi-frame async pipelining. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:18:36 +00:00
marfrit	d66f22f333	Phase 6 (v1+v4 production) + Phase 7 closure: R = 0.92 ± 0.03 on hertz First QPU IDCT8 kernel running and bit-exact on V3D 7.1 via Mesa v3dv compute. Five iterations through a Phase 7→Phase 4' loopback; production kernel is v4. New files: - src/v3d_runner.{c,h} — reusable Vulkan compute plumbing (instance, V3D device picker, HOST_VISIBLE\|COHERENT SSBOs with mmap, compute pipeline from .spv, enables storageBuffer{8,16}BitAccess) - src/v3d_idct8.comp — VP9 8x8 DCT_DCT IDCT add, v4 production: 256 invocations/WG, 2 blocks/subgroup (no idle lanes), uint8 dst SSBO (race-free per phase5 finding 5), unrolled writes (no chained ternary), oob-flag pattern (barrier-safe per phase5 finding 7) - tests/bench_v3d_idct.c — M1' bit-exact gate + M2 throughput vs C ref - docs/phase7.md — full iteration journey + decision verdict CMakeLists.txt updated to build the new shader, library, and bench when DAEDALUS_BUILD_VULKAN=ON. Iteration record (1920x1088 luma, 32640 blocks/dispatch, N=3): ver change R ns/block v1 first-light 0.230 533 v2 kill ternary + 2-blocks-per-sg 0.474 258 v3 per-pass scope oN 0.481 254 (noise) v4 WG 64 -> 256 invocations 0.947 129 v5 packed uint32 coeff reads 0.938 130 (noise, reverted) v4 final N=3 0.918 +/- 0.033 Bit-exactness 100.0000% across all iterations (10000-block sample on 128x128, 32640-block sample on 1080p) against both the C reference (tests/vp9_idct8_ref.c) and the vendored FFmpeg NEON ff_vp9_idct_idct_8x8_add_neon. Key learning over the Phase 5 review's prediction model: the chained ternary was NOT a spill killer on V3D 7.1 (shaderdb showed 0:0 spills:fills even in v1). The actual lever was workgroup-size-driven latency hiding — going from 64 to 256 invocations doubled throughput with the same compiled code (270 inst, 2 threads, 21 max-temps, 0 spills) because the v3dv scheduler had 4x more in-flight work to overlap TMU latency. Verdict per phase1.md decision rules: YELLOW band (0.5 <= R < 1.0) by a wide margin, near GREEN boundary. Phase 1 YELLOW rule: add M4 (concurrent CPU+QPU throughput) before honest-close or continue. M4 is the next measurement, not more shader tuning — at R = 0.92 with all 4 A76 cores still 100% free for other work, the question is whether the system aggregate beats pure 4-core NEON. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:09:00 +00:00
marfrit	71db72928f	Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS) Phase 4 — Plan first QPU IDCT8 kernel given the M5 32.95us/dispatch constraint. Frame-level dispatch (8160 WGs for 1080p), 64 invocations/WG = 4 blocks/WG, 3 SSBOs + push constants, reproduces FFmpeg's transposed column-pass orientation. Predicted M2 ~16 Mblock/s (BW-limited), R ~ 2.0 -> strong PASS under phase1.md decision rules. Phase 5 — Second-model review by Claude Sonnet (fresh-context Agent, no prior session memory). Verdict PASS-WITH-REVISIONS with 2 RED-class findings + 1 YELLOW that this commit applies: RED finding 5 (dst race condition): int32_t[] dst with 4 lanes writing to overlapping 32-bit words = non-atomic concurrent writes = Vulkan UB. Fix: uint8_t[] via storageBuffer8BitAccess (verified exposed). Applied to phase4.md sec 5 + GLSL declaration. RED finding 7 (early-return before barrier): if (block_idx >= n_blocks) return; ahead of barrier() is UB by Vulkan spec. For 1080p (32640 blocks, /4) no partial WGs; for any frame width not /32 there are. Fix: oob flag, gate work bodies, barrier() unconditional. Applied to phase4.md sec 4 pseudocode. YELLOW finding 6 (subgroup ops): docs claimed BASIC+VOTE only; actual exposed set is BASIC+VOTE+BALLOT+SHUFFLE+ SHUFFLE_RELATIVE+QUAD per vulkaninfo. Plan doesn't use any subgroup ops in v1 so unaffected, but the wrong constraint would mislead Phase 6/7. Corrected in phase0.md sec 2, phase2.md sec 6, phase4.md sec 1 (C4). GREEN/YELLOW findings 1-4, 8 (orientation, WG geom, idle lanes, BW prediction, compute envelope accounting) accepted as-is or deferred to Phase 7 M6 sweep per plan's existing flagging. Reviewer verdict post-revisions: "Phase 4 is APPROVED for Phase 6 implementation. No re-review needed; revisions are mechanical and address verified bugs/errors." Phase 5 itself just paid for itself: two real UB bugs caught before any GLSL was written. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 11:47:03 +00:00
marfrit	dcbbc77038	Path B pivot + Phase 0-3 closed with first baseline numbers This is a from-scratch initial commit on a fresh .git. The original scaffold commit (`7510b56`) and the earlier session's working-tree docs were lost in a 2026-05-18 10:25 working-tree wipe; the corrupted .git is preserved at .git-broken-2026-05-18/ (gitignored) for forensic inspection. Scope re-anchored from Path A (custom VPU firmware on VC7 scalar cores; blocked by BCM2712 silicon-RoT mask-ROM signature check) to Path B (QPU compute kernels via Mesa v3d / Vulkan compute or direct DRM, on stock signed Pi 5 / CM5). See README.md and docs/phase0.md for the substrate audit that closed Path A. Phases closed: Phase 0 — substrate audit; Path A blocked, Path B open; codec-back-end-fits-QPU finding (docs/phase0.md) Phase 1 — first kernel locked (VP9 / AV1 8x8 inverse DCT) with publish-before-measure R = M2/M3 decision rules (docs/phase1.md) Phase 2 — reference impls mapped; FFmpeg n7.1.3 source vendored under external/ffmpeg-snapshot/ (PROVENANCE.md pins commit f46e514 + per-file SHA-256s) (docs/phase2.md) Phase 3 — real baseline measurements on hertz (docs/phase3.md): M1 bit-exact 100.0000 % (10000/10000) M3 NEON IDCT8 single 8.171 Mblock/s (122.4 ns/block) M5a empty Vulkan submit 22.66 us M5b 1-WG noop dispatch 55.60 us M5 delta 32.95 us/dispatch => per-dispatch overhead is ~455x per-NEON-block cost; Phase 4 must batch at frame level or close to it. Build harness in place: CMakeLists.txt + tests/{bench_neon_idct.c, vp9_idct8_ref.c, bench_vulkan_dispatch.c, shaders/noop.comp} + external/ffmpeg-snapshot/config.h shim (7 defines + EXTERN_ASM). Builds clean on Debian Trixie aarch64 with cmake 3.31, ninja 1.12, libvulkan-dev 1.4.309, glslang-tools 15.1.0. Vendored FFmpeg .S assembles via the config.h shim. Next: Phase 4 (plan first QPU IDCT kernel under the M5 batching constraint) -> Phase 5 second-model review -> Phase 6 implement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 11:30:12 +00:00

106 Commits