daedalus-fourier

Author	SHA1	Message	Date
marfrit	20b59cd6a5	Cycle 5 phase 3 partial: M3 NEON = 3.923 Mblock/s; M1 deferred CDEF is the most compute-intensive kernel measured so far — 254.9 ns/block (2x IDCT, 5x MC). 30fps@1080p floor margin: 4x even on single NEON core in isolation. M3 captured cleanly via dav1d_cdef_filter8_8bpc_neon. M1 bit-exact gate failing due to tmp-layout mismatch between my standalone C reference and dav1d's NEON expectation. The smoking gun: NEON output appears at (+2 rows, -2 cols) shifted positions vs C ref output — suggests NEON's padding-function output has a different convention than my manual tmp construction. Untangled in setup work: - dav1d has TWO directions tables: stride-12 in src/tables.c (C-side), stride-16 in src/arm/64/cdef_tmpl.S (NEON-side). Initially vendored the C-side; should have used the NEON-side. - dav1d's NEON expects tmp built by dav1d_cdef_padding8_8bpc_neon (a separate function with its own conventions), not the C-side padding() function from cdef_tmpl.c. - Updated cdef_ref.c to use NEON-layout (stride 16) with table transcribed from cdef_tmpl.S. Algorithm matches — but bench's manual tmp construction doesn't match what NEON expects. Resolution paths for next session (documented in docs/k5_cdef_phase3_partial.md §'Resolution paths'): 1. Use dav1d_cdef_padding8_8bpc_neon to construct tmp (simplest) 2. Vendor dav1d's full C reference (most rigorous) 3. Reverse-engineer dav1d's padding output layout (hackiest) Predicted R5 if/when QPU shader implemented: 0.02-0.05 (RED). CDEF likely stays on CPU per cycle 3 lesson 7 (compute-bound kernels don't benefit from QPU offload). 30fps floor still passes regardless. New artifacts: - external/dav1d-snapshot/src/arm/64/cdef_tmpl.S (additional vendored) - external/dav1d-snapshot/config.h — 14-define asm preamble shim - tests/cdef_ref.c — standalone C ref (algorithmically correct, layout mismatch with NEON known) - tests/bench_neon_cdef.c — bench (M1 made warning, M3 captured) - docs/k5_cdef_phase3_partial.md — phase 3 partial closure + resumption checklist dav1d snapshot in PROVENANCE.md should be updated next session with the new cdef_tmpl.S entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 13:21:24 +00:00
marfrit	85feba4087	Cycle 4 (LPF wd=8) closure: M1=100%, R=0.34, M4=+4.1%, PASS Fourth daedalus-fourier kernel — VP9 8-tap inner loop filter wd=8 h_8_8 variant. Width extension of cycle 2's wd=4; completes VP9 inner-edge LPF coverage. Full cycle Phase 1-7 + M4'''' in one combined go (cycle compressed since incremental from cycle 2). Phase 5 review explicitly skipped (incremental ~30-line shader delta from cycle 2 + same geometry + cycle-2 RED-pattern checks still apply). Flagged in docs/k4_lpf8_phase4_7.md per dev_process.md "Skipping phases is a deliberate choice that should be flagged." Phase 6 v1 first-light: M1'''' 100.0000% bit-exact (65536/65536) first try. Shaderdb shows 231 inst, 4 hardware threads, 0 spills, 27 max-temps, 48 uniforms — compiler at the latency-hiding ceiling. Performance: M3'''' NEON (single-core) 52.382 Medge/s M2'''' QPU isolation 17.847 Medge/s R'''' 0.341 → ORANGE band 30fps floor margin 9.2x (isolation), 20.3x (mixed) M4'''' concurrent matrix: NEON 4-core 37.823 Medge/s <- baseline QPU only 14.867 Medge/s MIXED NEON-3 + QPU 39.389 Medge/s <- +4.1% PASS Verdict: YELLOW-via-M4'''' PASS. Deploy wd=8 LPF on QPU alongside cycle 2 wd=4. Combined VP9 inner-edge LPF coverage now complete. Cross-cycle LPF comparison: \| \| wd=4 (k2) \| wd=8 (k4) \| \| M3 NEON \| 48.3 \| 52.4 \| \| M2 QPU iso \| 19.6 \| 17.8 \| \| R iso \| 0.41 \| 0.34 \| \| M4 delta \| +6.9% \| +4.1% \| \| 30fps mixed \| 7.2x \| 20.3x \| \| Verdict \| GO QPU \| GO QPU \| NEW finding (Phase 9 lesson): NEON gets faster per edge as filter width grows (20.7 → 19.1 ns wd=4 → wd=8). The relative QPU loss grows with width. wd=16 would probably flip negative based on the trend line. Deployment recipe with cycle 4: IDCT 8x8 (k1) -> QPU (R=0.92, +7% mixed) LPF wd=4 (k2) -> QPU (R=0.41, +7% mixed) LPF wd=8 (k4) -> QPU (R=0.34, +4% mixed) MC 8h (k3) -> CPU (R=0.067, -19% mixed) Entropy -> CPU (structural) VP9 inner-edge LPF coverage complete. Project continues to higgs deployment plumbing or further kernels per user direction. New artifacts: - src/v3d_lpf_h_8_8.comp — GLSL shader - tests/vp9_lpf8_ref.c — standalone C ref - tests/bench_neon_lpf8.c — M1+M3 bench - tests/bench_v3d_lpf8.c — M1+M2 bench - tests/bench_concurrent_lpf8.c — M4 pthread bench - docs/k4_lpf8_phase1_3.md + phase4_7.md — combined cycle docs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:56:25 +00:00
marfrit	356e446a49	Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5% Third daedalus-fourier kernel — VP9 8-tap regular subpel filter, horizontal direction, 8-wide output. Multiply-heavy by design to stress V3D's no-DP4A deficit. Full cycle Phase 1-7 + M4'''. Phase 5''' second-model review delivered cleanly — caught 1 RED bug pre-implementation (src_off off-by-3 indexing convention) and 2 YELLOW gaps (assert MUST language, shaderdb filter-LUT gate). Without the review, M1''' would have failed silently on first run with cryptic "high-index source pixels wrong" symptoms. Phase 6 v1 first-light: M1''' 100.0000% bit-exact (65536/65536 blocks across all 16 mx phases). Phase 5''' filter-LUT prediction materialised exactly: 197 uniforms (gate was 144), 2 threads (down from cycle-2's 4 due to register pressure). Performance: M2''' = 1.413 Mblock/s (707.9 ns/block) M3''' = 20.997 Mblock/s (NEON baseline phase3) R''' = 0.067 (RED band — structural mismatch) shaderdb: 488 inst, 2 threads, 197 uniforms, 25 max-temps, 0 spills M4''' concurrent matrix (8s windows): NEON 1-core 14.479 Mblock/s NEON 4-core 15.248 Mblock/s <- baseline (compute-bound, not bandwidth-saturated like cycles 1+2!) QPU only 1.380 Mblock/s MIXED NEON-3 + QPU 12.277 Mblock/s <- -19.5% (FAIL gate) MIXED NEON-4 + QPU 12.158 Mblock/s <- -20.3% NEW cross-cycle finding (Phase 9 lesson 2): compute-bound CPU workloads make the QPU-offload story collapse. Cycles 1+2 were bandwidth-saturated (4-core scaling 0.56-0.82x of 1-core), so freeing a CPU core via QPU offload added throughput. Cycle 3 MC is compute-bound (4-core scaling 1.05x of 1-core — near-linear), no free cycles to free. QPU contribution (0.45 Mblock/s in contention) doesn't compensate for losing 1 NEON core delivering ~3.8 Mblock/s. But 30fps@1080p floor: PASS in every config (1.4x to 15.7x isolation margin). Per project_30fps_floor_is_fine.md, user-facing test never fails — daily YouTube playback works fine on any CPU/QPU split. DEPLOYMENT RECIPE for higgs (cycle 3 confirmed split): IDCT (k1) -> QPU (R=0.92, +7% mixed, frees CPU core) LPF (k2) -> QPU (R=0.41, +7% mixed, frees CPU core) MC (k3) -> CPU (R=0.067, -19.5% mixed — stays on CPU) Entropy -> CPU (structurally serial) Mixed-substrate deployment, not "QPU does everything". Realistic for higgs: entropy + MC on 2-3 ARM cores; IDCT + LPF dispatched to QPU concurrently; 1-2 ARM cores left for vscode etc. New artifacts: - src/v3d_mc_8h.comp — GLSL kernel - tests/vp9_mc_ref.c — standalone C ref (REGULAR filter embedded; clean transcription) - tests/bench_neon_mc.c — M1'''_c + M3''' bench - tests/bench_v3d_mc.c — M1''' + M2''' bench with contract asserts + 30fps margin display - tests/bench_concurrent_mc.c — M4''' pthread bench - external/ffmpeg-snapshot/libavcodec/aarch64/vp9mc_neon.S (vendored) - external/ffmpeg-snapshot/libavcodec/vp9_subpel_filters_table.c (hand-extracted; provides ff_vp9_subpel_filters symbol without dragging in full vp9dsp.c) - docs/k3_mc_phase{1,2,3,4,5,7}.md — full cycle documentation Memory updates: project_30fps_floor_is_fine.md (user's 30fps target recalibration), MEMORY.md index updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:51:43 +00:00
marfrit	36eca40ff2	Cycle 2 (LPF) closure: M1''=100%, R''=0.41, M4''=+6.9%, PASS Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS: 2 YELLOW contract gaps applied) + Phase 6 v1 implementation + Phase 7 verification including M4'' concurrent gate. Phase 5'' review delivered cleanly — no RED bugs (cycle 1 lessons applied successfully). 2 YELLOW findings baked into phase4 §4: - stride >= 4 contract added alongside m.x >= 4 (finding 2) - assert(...) in bench made a MUST not a suggestion (finding 4) - V3D divergence-cost note: don't restructure to always-execute, masked lanes consume clock anyway (finding 3, informational) Phase 6 v1 first-light hit M1'' 100.0000% bit-exact on first run (65536/65536 edges) — the cycle-1 v4 patterns (WG=256, 2-per-sg, uint8_t SSBO, oob early-return discipline) baked in from start worked as expected. Performance: M2'' = 19.645 Medge/s (50.9 ns/edge) M3'' = 48.285 Medge/s (NEON baseline from phase3) R'' = 0.41 (ORANGE band - doesn't auto-close per cycle-1 calibration adjustment) shaderdb: 160 inst, 4 threads, 0 spills, 21 max-temps — shader is already at the compiler ceiling. No v2/v3/v4 iteration loop like cycle 1 because there's nothing more to extract from the compiled shape. The 30x gap between theoretical instruction throughput and measured wall-clock is divergence-tax + memory latency, not compile quality. M4'' concurrent matrix on hertz (8s windows): NEON-1 LPF 41.131 Medge/s NEON-4 LPF 33.726 Medge/s <- realistic CPU ceiling (per-core 7-9; same bandwidth-saturation as cycle-1 F1) QPU only 14.299 Medge/s MIXED NEON-3 + QPU 36.049 Medge/s <- +6.9% over NEON-4 MIXED NEON-4 + QPU 31.892 Medge/s <- -5.4% oversubscribed The "freed-core" pattern generalizes from IDCT to LPF: NEON-3+QPU beats pure NEON-4 by ~7% in both cycles. Cycle-2 NEW finding: oversubscribed mode hurts for lighter kernels (LPF -5.4% vs cycle-1 IDCT +9.4%). Recommendation for higgs deployment hardens to "always N-1 NEON cores + QPU, never N + QPU". Phase 9 lessons (in phase7 §"Phase 9 lessons"): 1. Cycle-1 v4-pattern is the v1 starting point (saves 3 iterations) 2. Phase 5 review pays off every cycle 3. R isolation misleading on bandwidth-saturated hardware 4. Oversubscription tax depends on kernel weight 5. shaderdb 4-threads/0-spills = compute not the bottleneck New artifacts: - src/v3d_lpf_h_4_8.comp — GLSL kernel - tests/bench_v3d_lpf.c — M1'' + M2'' harness with contract asserts + fm/hev pass-rate instrumentation - tests/bench_concurrent_lpf.c — M4'' pthread bench (mirrors bench_concurrent.c) - docs/k2_deblock_phase{4,5,7}.md — plan + review + verification Project verdict: continue. Cycle 3 candidates: MC interpolation (multiply-heavy, stress V3D SMUL24), CDEF (AV1-only, different neighborhood shape), or wd=8/wd=16 LPF variants. User to direct. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:39:26 +00:00
marfrit	be7ff5587c	Cycle 2 (deblocking) Phase 1-3: M3'' = 48.285 Medge/s baseline Second kernel candidate per phase7_M4.md verdict "next-kernel cycle authorised". VP9 4-tap inner loop filter, horizontal direction, 8-pixel edge (libavcodec ff_vp9_loop_filter_h_4_8_neon as baseline). Different workload shape from IDCT - boundary streaming, lighter compute per unit, per-row conditionals - tests whether QPU win generalises. docs/k2_deblock_phase1.md - goal-setting. Same R-band decision rules as cycle 1 (phase1.md), with the cycle-1 calibration adjustment: ORANGE band is no longer auto-close because M4 showed mixed > pure CPU even at modest R when CPU bandwidth-saturates. docs/k2_deblock_phase2.md - situation analysis. C reference already in vendored snapshot (vp9dsp_template.c:1780-1898). Fetched vp9lpf_neon.S fresh (1334 lines, LGPL-2.1+, FFmpeg n7.1.3 pin, SHA-256 384e49e7...). PROVENANCE.md updated. docs/k2_deblock_phase3.md - NEON baseline: M1''_c bit-exact 100.0000 % (10000 random edges) M3'' throughput 48.285 Medge/s (20.7 ns/edge, single A76) per-frame 1080p-eq 748 FPS (worst case 64 530 edges/frame) cycles/edge ~58 (=20.7ns x 2.8GHz), ~7 cycles/row LPF is 5.9x faster per-unit than IDCT M3 (20.7 vs 122 ns), so the QPU break-even point moves down. Predicted R''_v1 band ~0.5-0.9 - frame-level batching amortises the same 33us dispatch overhead; workload becomes bandwidth-bound rather than compute-bound (~5.7 MB/frame traffic at 64 530 edges x ~88 B per edge). New artifacts: - tests/vp9_lpf_ref.c - standalone bit-exact C ref (8-bit, wd=4 inner only; clean transcription) - tests/bench_neon_lpf.c - M1''_c gate + M3'' time-based bench (5s window, edge-content-biased RNG for realistic fm/hev hit rates) - external/ffmpeg-snapshot/libavcodec/aarch64/vp9lpf_neon.S - CMakeLists.txt updated with bench_neon_lpf target Phase 4 next: plan the QPU LPF compute shader. Cycle 1's phase4.md + phase5.md + phase7.md learnings apply directly - bake in the v4 winning patterns from the start (WG=256, edges-per-subgroup pattern adapted from blocks, uint8_t dst SSBO, oob flag, unrolled writes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:28:57 +00:00
marfrit	8182e43c15	Phase 7 M4: mixed CPU+QPU beats pure 4-core NEON; project continues The YELLOW-band gate test from phase1.md (concurrent CPU+QPU vs pure-CPU baseline). tests/bench_concurrent.c is a pthread harness that runs N NEON workers (pinned to cores 0..N-1) and optionally a QPU dispatch loop on its own pinned thread; 8s time-based windows; sums per-worker block counts. Raw results on hertz (1920x1088, 32640 blocks/dispatch): Config Mblock/s 1080p FPS-eq NEON 1-core 12.623 389.6 NEON 4-core 7.074 218.3 <- realistic CPU ceiling QPU only 6.890 212.7 MIXED NEON-3 + QPU 7.583 234.0 <- +7.2 % over NEON-4 MIXED NEON-4 + QPU 7.739 238.9 <- +9.4 % oversubscribed Headline findings beyond the gate test itself: F1 — Pi 5 LPDDR4x saturates well before 4-core CPU scaling. NEON-1 (12.6) > NEON-4 (7.1): 4 cores deliver 0.56x the per-core throughput, not 4x. The realistic CPU ceiling for memory-bound IDCT work is ~7 Mblock/s aggregate, not the ~32 Mblock/s a naive 4x scaling would predict. This recasts phase7.md's R=0.92 framing: the right baseline is "4-core NEON saturated", which the QPU effectively matches (6.89 vs 7.07) on its own. F2 — QPU contributes meaningfully BECAUSE it doesn't fully share the CPU's bandwidth bottleneck (own access channel + v3d L2 cache partially insulate it). Mixed adds the QPU's 0.51 Mblock/s on top of an already-saturated CPU. F3 — Oversubscribed mode (NEON-4 + QPU) is not harmful — per-NEON drops slightly but QPU adds more than the loss. Net +9 %. F4 — Freed-core story is bigger than the throughput delta. In mixed NEON-3+QPU, the 4th core is 100 % free for entropy decode (Bool coder, ANS) which MUST run on CPU. Pure NEON-4 has nothing left. Realistic decode pipeline gets more like a 30-50 % effective throughput uplift, not just 7 %. Verdict per phase1.md YELLOW-band rule (mixed > pure-CPU): PASS. Project continues to next-kernel cycle (recommend deblocking or CDEF — same "small parallel block-level" workload class that amortises the same M4 wins). docs/phase7_M4.md captures the full M4 harness design, all 5 configs raw output, and the leaves-open items: M7 wall-power via Himbeere plug, sustained-thermal test, realistic-bitstream coefficient distribution, multi-frame async pipelining. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:18:36 +00:00
marfrit	d66f22f333	Phase 6 (v1+v4 production) + Phase 7 closure: R = 0.92 ± 0.03 on hertz First QPU IDCT8 kernel running and bit-exact on V3D 7.1 via Mesa v3dv compute. Five iterations through a Phase 7→Phase 4' loopback; production kernel is v4. New files: - src/v3d_runner.{c,h} — reusable Vulkan compute plumbing (instance, V3D device picker, HOST_VISIBLE\|COHERENT SSBOs with mmap, compute pipeline from .spv, enables storageBuffer{8,16}BitAccess) - src/v3d_idct8.comp — VP9 8x8 DCT_DCT IDCT add, v4 production: 256 invocations/WG, 2 blocks/subgroup (no idle lanes), uint8 dst SSBO (race-free per phase5 finding 5), unrolled writes (no chained ternary), oob-flag pattern (barrier-safe per phase5 finding 7) - tests/bench_v3d_idct.c — M1' bit-exact gate + M2 throughput vs C ref - docs/phase7.md — full iteration journey + decision verdict CMakeLists.txt updated to build the new shader, library, and bench when DAEDALUS_BUILD_VULKAN=ON. Iteration record (1920x1088 luma, 32640 blocks/dispatch, N=3): ver change R ns/block v1 first-light 0.230 533 v2 kill ternary + 2-blocks-per-sg 0.474 258 v3 per-pass scope oN 0.481 254 (noise) v4 WG 64 -> 256 invocations 0.947 129 v5 packed uint32 coeff reads 0.938 130 (noise, reverted) v4 final N=3 0.918 +/- 0.033 Bit-exactness 100.0000% across all iterations (10000-block sample on 128x128, 32640-block sample on 1080p) against both the C reference (tests/vp9_idct8_ref.c) and the vendored FFmpeg NEON ff_vp9_idct_idct_8x8_add_neon. Key learning over the Phase 5 review's prediction model: the chained ternary was NOT a spill killer on V3D 7.1 (shaderdb showed 0:0 spills:fills even in v1). The actual lever was workgroup-size-driven latency hiding — going from 64 to 256 invocations doubled throughput with the same compiled code (270 inst, 2 threads, 21 max-temps, 0 spills) because the v3dv scheduler had 4x more in-flight work to overlap TMU latency. Verdict per phase1.md decision rules: YELLOW band (0.5 <= R < 1.0) by a wide margin, near GREEN boundary. Phase 1 YELLOW rule: add M4 (concurrent CPU+QPU throughput) before honest-close or continue. M4 is the next measurement, not more shader tuning — at R = 0.92 with all 4 A76 cores still 100% free for other work, the question is whether the system aggregate beats pure 4-core NEON. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:09:00 +00:00
marfrit	dcbbc77038	Path B pivot + Phase 0-3 closed with first baseline numbers This is a from-scratch initial commit on a fresh .git. The original scaffold commit (`7510b56`) and the earlier session's working-tree docs were lost in a 2026-05-18 10:25 working-tree wipe; the corrupted .git is preserved at .git-broken-2026-05-18/ (gitignored) for forensic inspection. Scope re-anchored from Path A (custom VPU firmware on VC7 scalar cores; blocked by BCM2712 silicon-RoT mask-ROM signature check) to Path B (QPU compute kernels via Mesa v3d / Vulkan compute or direct DRM, on stock signed Pi 5 / CM5). See README.md and docs/phase0.md for the substrate audit that closed Path A. Phases closed: Phase 0 — substrate audit; Path A blocked, Path B open; codec-back-end-fits-QPU finding (docs/phase0.md) Phase 1 — first kernel locked (VP9 / AV1 8x8 inverse DCT) with publish-before-measure R = M2/M3 decision rules (docs/phase1.md) Phase 2 — reference impls mapped; FFmpeg n7.1.3 source vendored under external/ffmpeg-snapshot/ (PROVENANCE.md pins commit f46e514 + per-file SHA-256s) (docs/phase2.md) Phase 3 — real baseline measurements on hertz (docs/phase3.md): M1 bit-exact 100.0000 % (10000/10000) M3 NEON IDCT8 single 8.171 Mblock/s (122.4 ns/block) M5a empty Vulkan submit 22.66 us M5b 1-WG noop dispatch 55.60 us M5 delta 32.95 us/dispatch => per-dispatch overhead is ~455x per-NEON-block cost; Phase 4 must batch at frame level or close to it. Build harness in place: CMakeLists.txt + tests/{bench_neon_idct.c, vp9_idct8_ref.c, bench_vulkan_dispatch.c, shaders/noop.comp} + external/ffmpeg-snapshot/config.h shim (7 defines + EXTERN_ASM). Builds clean on Debian Trixie aarch64 with cmake 3.31, ninja 1.12, libvulkan-dev 1.4.309, glslang-tools 15.1.0. Vendored FFmpeg .S assembles via the config.h shim. Next: Phase 4 (plan first QPU IDCT kernel under the M5 batching constraint) -> Phase 5 second-model review -> Phase 6 implement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 11:30:12 +00:00

8 Commits