daedalus-fourier

Author	SHA1	Message	Date
marfrit	5c8b09349c	Cycle 9 closed: H.264 luma qpel mc20 = 131 Mblock/s NEON, CPU-only Last unmeasured H.264 kernel. mc20 picked as representative (horizontal half-pel, 6-tap filter; canonical for the H.264 luma qpel family). M1 PASS 10000/10000 first try, M3 = 131.477 Mblock/s on a single core (7.6 ns/block), 135x the 1080p30 floor. Per the cycles 6+7 lightweight-kernel rationale, Phase 4 deferred: QPU dispatch floor (~250 ns/block) is 33x above the NEON per-block cost; R9 ≈ 0.03 deep RED. No realistic QPU offload value. Generalization: all H.264 luma MC variants (mc02, mc11, mc22, etc.) will share this verdict. No need to measure each variant individually. H.264 NEON is dramatically faster than VP9 NEON across the board: - IDCT 4x4: 175 vs N/A (no VP9 analog) - IDCT 8x8: 151 vs 8.2 Mblock/s (18x faster) - MC 6/8-tap: 131 vs 7.0 (19x faster) - Deblock: 92 vs 48 Medge/s (2x faster) H.264 deployment recipe: all CPU NEON except deblock (opportunistic QPU). On a Pi 5 running H.264-only, the QPU is mostly idle. Cycles 1-9 complete. Public API exposes all 9. Next: daedalus-v4l2 sibling repo per locked Phase 8 architecture (B + γ + sibling), then README polish. - external/ffmpeg-snapshot/libavcodec/aarch64/h264qpel_neon.S vendored (1467 lines, all qpel variants) - tests/h264_qpel8_mc20_ref.c: 40-line C ref (clip255 of 6-tap convolution) - tests/bench_neon_h264qpel_mc20.c: M1 + M3 bench - docs/k9_h264qpel_mc20.md: cycle 9 closure with comparison matrix Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:53:21 +00:00
marfrit	5a085e7180	Cycle 8 (H.264 deblock) opened — Phase 1 + NEON vendored Targets the one H.264 kernel most likely to be QPU-worthy: in-loop deblock. Cycles 6 and 7 (IDCT 4x4 and 8x8) both came in CPU-only because H.264 transforms are NEON-trivial. H.264 deblock has analogous structure to VP9 LPF (cycles 2+4, both GREEN) so predicted R8 = ORANGE/YELLOW. This commit: - Vendors ff_h264__loop_filter__neon from h264dsp_neon.S (1076 lines, includes both v/h luma + chroma + intra variants + weight/biweight) - PROVENANCE.md updated with the new vendored file - Phase 1 doc captures the full plan: start with luma vertical non-intra (most common case), defer Phase 3+ to next session H.264 deblock C ref scope is ~2 hours (per-row branching, per-4-row-segment tc0, ap/aq side conditions, alpha/beta thresholds — much more complex than VP9 LPF wd=4's single-branch filter). Deferring to fresh attention next session rather than rushing now. After cycle 8 closes, the H.264 QPU surface is well-characterised and the cycles-1-8 inventory drives the Phase 8 V4L2 wrapper's substrate-routing recipe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:18:19 +00:00
marfrit	f92dc40f43	Cycle 6 (H.264) opened — IDCT 4x4 Phase 1+3, M3 = 175 Mblock/s H.264 scope added 2026-05-18 per user direction. Pi 5's VideoCore VII has no hardware H.264 decoder block (only HEVC), so a QPU-accelerated H.264 path fills the most impactful codec gap. Cycle 6 = first H.264 kernel (4x4 IDCT + add, smallest H.264 transform, simplest first cycle). Phase 1: goal doc + 1080p30 floor analysis (5.85 Mblock/s worst-case, 2.0 Mblock/s realistic since most MBs use 8x8 or P-skip). Phase 3: NEON M3 baseline captured. ff_h264_idct_add_neon on hertz delivers 175 Mblock/s (5.7 ns per block) = 30x worst-case floor margin. H.264 IDCT 4x4 is dramatically lighter than VP9 IDCT 8x8 (21x faster per block). Phase 3 closure also caught the key Phase 9 lesson: H.264/FFmpeg blocks are COLUMN-MAJOR (block[c*4 + r] = (row=r, col=c)). NEON ld1 with 4 registers interleaves loading, and the FFmpeg C ref indexing makes this convention explicit. Initial C ref assumed row-major, M1 was 5% bit-exact; after fix, M1 = 100%. Convention encoded for all subsequent H.264 cycles (cycle 7+). - external/ffmpeg-snapshot/libavcodec/aarch64/h264idct_neon.S (vendored verbatim from FFmpeg n7.1.3, 415 lines) - external/ffmpeg-snapshot/PROVENANCE.md: updated - tests/h264_idct4_ref.c: column-major C ref - tests/bench_neon_h264idct4.c: M1 + M3 bench - CMakeLists.txt: cycle 6 NEON bench wiring - docs/k6_h264idct4_phase1.md, phase3.md Phase 4 next: QPU shader for cycle 6. Predicted R6 = 0.01 (deep RED — kernel too small relative to QPU dispatch overhead) but worth building for cycle-completeness + the opportunistic-helper hypothesis (cycle 6 may stay CPU per recipe). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:14:43 +00:00
marfrit	20b59cd6a5	Cycle 5 phase 3 partial: M3 NEON = 3.923 Mblock/s; M1 deferred CDEF is the most compute-intensive kernel measured so far — 254.9 ns/block (2x IDCT, 5x MC). 30fps@1080p floor margin: 4x even on single NEON core in isolation. M3 captured cleanly via dav1d_cdef_filter8_8bpc_neon. M1 bit-exact gate failing due to tmp-layout mismatch between my standalone C reference and dav1d's NEON expectation. The smoking gun: NEON output appears at (+2 rows, -2 cols) shifted positions vs C ref output — suggests NEON's padding-function output has a different convention than my manual tmp construction. Untangled in setup work: - dav1d has TWO directions tables: stride-12 in src/tables.c (C-side), stride-16 in src/arm/64/cdef_tmpl.S (NEON-side). Initially vendored the C-side; should have used the NEON-side. - dav1d's NEON expects tmp built by dav1d_cdef_padding8_8bpc_neon (a separate function with its own conventions), not the C-side padding() function from cdef_tmpl.c. - Updated cdef_ref.c to use NEON-layout (stride 16) with table transcribed from cdef_tmpl.S. Algorithm matches — but bench's manual tmp construction doesn't match what NEON expects. Resolution paths for next session (documented in docs/k5_cdef_phase3_partial.md §'Resolution paths'): 1. Use dav1d_cdef_padding8_8bpc_neon to construct tmp (simplest) 2. Vendor dav1d's full C reference (most rigorous) 3. Reverse-engineer dav1d's padding output layout (hackiest) Predicted R5 if/when QPU shader implemented: 0.02-0.05 (RED). CDEF likely stays on CPU per cycle 3 lesson 7 (compute-bound kernels don't benefit from QPU offload). 30fps floor still passes regardless. New artifacts: - external/dav1d-snapshot/src/arm/64/cdef_tmpl.S (additional vendored) - external/dav1d-snapshot/config.h — 14-define asm preamble shim - tests/cdef_ref.c — standalone C ref (algorithmically correct, layout mismatch with NEON known) - tests/bench_neon_cdef.c — bench (M1 made warning, M3 captured) - docs/k5_cdef_phase3_partial.md — phase 3 partial closure + resumption checklist dav1d snapshot in PROVENANCE.md should be updated next session with the new cdef_tmpl.S entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 13:21:24 +00:00
marfrit	2cd2258a7b	Cycle 5 setup (Phase 1+2): vendor dav1d 1.4.3 CDEF sources First AV1 kernel cycle and first dav1d-vendored sources. Phase 1+2 docs lay out the structural complexity (CDEF needs pre-padded 12x12 working buffer + external edge context + direction lookup + constraint function — meaningfully more complex than cycles 1-4). Phase 3+ deferred to next session — CDEF is the first cycle that doesn't fit cleanly into a single autonomous run. Vendored from dav1d 1.4.3 (BSD-2-Clause, cleaner license than FFmpeg's LGPL-2.1+): src/arm/64/cdef.S 520 lines — NEON impl src/arm/64/util.S 278 lines — NEON helpers src/arm/asm.S 335 lines — GAS preamble src/cdef_tmpl.c 331 lines — C reference (templated) include/common/intops.h 84 lines — utility helpers src/tables_cdef_subset.c hand-extracted — dav1d_cdef_directions only (avoids dragging full 1013-line tables.c + transitive includes) Discovery from Phase 2 analysis: - Filter type and shape: dav1d_cdef_filter8_pri_sec_8bpc_neon takes (dst, dst_stride, tmp, pri_strength, sec_strength, dir, damping, h). The 'tmp' arg is the pre-padded 12x12 buffer constructed externally by the dav1d C-side padding() function. - Tap weights are inline-computed (not table): pri_tap = 4 or 3 (based on pri_strength bit), sec_tap = 2 or 1. Only dav1d_cdef_directions[12][2] is an external table. - Constraint function: constrain(diff, threshold, shift) = apply_sign(min(abs(diff), max(0, threshold - (abs(diff) >> shift))), diff) Predicted R5 band: 0.15-0.30 (ORANGE). CDEF is compute-heavier than LPF (per-pixel min/max conditional logic), so likely worse R than cycle 2/4 but better than cycle 3 MC. M4 gate likely required. What Phase 3+ needs (next session): 1. config.h shim for dav1d's asm preamble (defines TBD on first build) 2. Standalone C reference for cdef_filter_block_8x8_c (cdef_tmpl.c references several dav1d private headers; cleaner to transcribe to a self-contained tests/cdef_ref.c) 3. tests/bench_neon_cdef.c — M1+M3 bench 4. Phase 4 plan, Phase 5 review (mandatory), Phase 6 shader, Phase 7 measure PROVENANCE.md documents pin + per-file role + re-vendoring procedure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 13:12:25 +00:00
marfrit	356e446a49	Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5% Third daedalus-fourier kernel — VP9 8-tap regular subpel filter, horizontal direction, 8-wide output. Multiply-heavy by design to stress V3D's no-DP4A deficit. Full cycle Phase 1-7 + M4'''. Phase 5''' second-model review delivered cleanly — caught 1 RED bug pre-implementation (src_off off-by-3 indexing convention) and 2 YELLOW gaps (assert MUST language, shaderdb filter-LUT gate). Without the review, M1''' would have failed silently on first run with cryptic "high-index source pixels wrong" symptoms. Phase 6 v1 first-light: M1''' 100.0000% bit-exact (65536/65536 blocks across all 16 mx phases). Phase 5''' filter-LUT prediction materialised exactly: 197 uniforms (gate was 144), 2 threads (down from cycle-2's 4 due to register pressure). Performance: M2''' = 1.413 Mblock/s (707.9 ns/block) M3''' = 20.997 Mblock/s (NEON baseline phase3) R''' = 0.067 (RED band — structural mismatch) shaderdb: 488 inst, 2 threads, 197 uniforms, 25 max-temps, 0 spills M4''' concurrent matrix (8s windows): NEON 1-core 14.479 Mblock/s NEON 4-core 15.248 Mblock/s <- baseline (compute-bound, not bandwidth-saturated like cycles 1+2!) QPU only 1.380 Mblock/s MIXED NEON-3 + QPU 12.277 Mblock/s <- -19.5% (FAIL gate) MIXED NEON-4 + QPU 12.158 Mblock/s <- -20.3% NEW cross-cycle finding (Phase 9 lesson 2): compute-bound CPU workloads make the QPU-offload story collapse. Cycles 1+2 were bandwidth-saturated (4-core scaling 0.56-0.82x of 1-core), so freeing a CPU core via QPU offload added throughput. Cycle 3 MC is compute-bound (4-core scaling 1.05x of 1-core — near-linear), no free cycles to free. QPU contribution (0.45 Mblock/s in contention) doesn't compensate for losing 1 NEON core delivering ~3.8 Mblock/s. But 30fps@1080p floor: PASS in every config (1.4x to 15.7x isolation margin). Per project_30fps_floor_is_fine.md, user-facing test never fails — daily YouTube playback works fine on any CPU/QPU split. DEPLOYMENT RECIPE for higgs (cycle 3 confirmed split): IDCT (k1) -> QPU (R=0.92, +7% mixed, frees CPU core) LPF (k2) -> QPU (R=0.41, +7% mixed, frees CPU core) MC (k3) -> CPU (R=0.067, -19.5% mixed — stays on CPU) Entropy -> CPU (structurally serial) Mixed-substrate deployment, not "QPU does everything". Realistic for higgs: entropy + MC on 2-3 ARM cores; IDCT + LPF dispatched to QPU concurrently; 1-2 ARM cores left for vscode etc. New artifacts: - src/v3d_mc_8h.comp — GLSL kernel - tests/vp9_mc_ref.c — standalone C ref (REGULAR filter embedded; clean transcription) - tests/bench_neon_mc.c — M1'''_c + M3''' bench - tests/bench_v3d_mc.c — M1''' + M2''' bench with contract asserts + 30fps margin display - tests/bench_concurrent_mc.c — M4''' pthread bench - external/ffmpeg-snapshot/libavcodec/aarch64/vp9mc_neon.S (vendored) - external/ffmpeg-snapshot/libavcodec/vp9_subpel_filters_table.c (hand-extracted; provides ff_vp9_subpel_filters symbol without dragging in full vp9dsp.c) - docs/k3_mc_phase{1,2,3,4,5,7}.md — full cycle documentation Memory updates: project_30fps_floor_is_fine.md (user's 30fps target recalibration), MEMORY.md index updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:51:43 +00:00
marfrit	be7ff5587c	Cycle 2 (deblocking) Phase 1-3: M3'' = 48.285 Medge/s baseline Second kernel candidate per phase7_M4.md verdict "next-kernel cycle authorised". VP9 4-tap inner loop filter, horizontal direction, 8-pixel edge (libavcodec ff_vp9_loop_filter_h_4_8_neon as baseline). Different workload shape from IDCT - boundary streaming, lighter compute per unit, per-row conditionals - tests whether QPU win generalises. docs/k2_deblock_phase1.md - goal-setting. Same R-band decision rules as cycle 1 (phase1.md), with the cycle-1 calibration adjustment: ORANGE band is no longer auto-close because M4 showed mixed > pure CPU even at modest R when CPU bandwidth-saturates. docs/k2_deblock_phase2.md - situation analysis. C reference already in vendored snapshot (vp9dsp_template.c:1780-1898). Fetched vp9lpf_neon.S fresh (1334 lines, LGPL-2.1+, FFmpeg n7.1.3 pin, SHA-256 384e49e7...). PROVENANCE.md updated. docs/k2_deblock_phase3.md - NEON baseline: M1''_c bit-exact 100.0000 % (10000 random edges) M3'' throughput 48.285 Medge/s (20.7 ns/edge, single A76) per-frame 1080p-eq 748 FPS (worst case 64 530 edges/frame) cycles/edge ~58 (=20.7ns x 2.8GHz), ~7 cycles/row LPF is 5.9x faster per-unit than IDCT M3 (20.7 vs 122 ns), so the QPU break-even point moves down. Predicted R''_v1 band ~0.5-0.9 - frame-level batching amortises the same 33us dispatch overhead; workload becomes bandwidth-bound rather than compute-bound (~5.7 MB/frame traffic at 64 530 edges x ~88 B per edge). New artifacts: - tests/vp9_lpf_ref.c - standalone bit-exact C ref (8-bit, wd=4 inner only; clean transcription) - tests/bench_neon_lpf.c - M1''_c gate + M3'' time-based bench (5s window, edge-content-biased RNG for realistic fm/hev hit rates) - external/ffmpeg-snapshot/libavcodec/aarch64/vp9lpf_neon.S - CMakeLists.txt updated with bench_neon_lpf target Phase 4 next: plan the QPU LPF compute shader. Cycle 1's phase4.md + phase5.md + phase7.md learnings apply directly - bake in the v4 winning patterns from the start (WG=256, edges-per-subgroup pattern adapted from blocks, uint8_t dst SSBO, oob flag, unrolled writes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:28:57 +00:00
marfrit	dcbbc77038	Path B pivot + Phase 0-3 closed with first baseline numbers This is a from-scratch initial commit on a fresh .git. The original scaffold commit (`7510b56`) and the earlier session's working-tree docs were lost in a 2026-05-18 10:25 working-tree wipe; the corrupted .git is preserved at .git-broken-2026-05-18/ (gitignored) for forensic inspection. Scope re-anchored from Path A (custom VPU firmware on VC7 scalar cores; blocked by BCM2712 silicon-RoT mask-ROM signature check) to Path B (QPU compute kernels via Mesa v3d / Vulkan compute or direct DRM, on stock signed Pi 5 / CM5). See README.md and docs/phase0.md for the substrate audit that closed Path A. Phases closed: Phase 0 — substrate audit; Path A blocked, Path B open; codec-back-end-fits-QPU finding (docs/phase0.md) Phase 1 — first kernel locked (VP9 / AV1 8x8 inverse DCT) with publish-before-measure R = M2/M3 decision rules (docs/phase1.md) Phase 2 — reference impls mapped; FFmpeg n7.1.3 source vendored under external/ffmpeg-snapshot/ (PROVENANCE.md pins commit f46e514 + per-file SHA-256s) (docs/phase2.md) Phase 3 — real baseline measurements on hertz (docs/phase3.md): M1 bit-exact 100.0000 % (10000/10000) M3 NEON IDCT8 single 8.171 Mblock/s (122.4 ns/block) M5a empty Vulkan submit 22.66 us M5b 1-WG noop dispatch 55.60 us M5 delta 32.95 us/dispatch => per-dispatch overhead is ~455x per-NEON-block cost; Phase 4 must batch at frame level or close to it. Build harness in place: CMakeLists.txt + tests/{bench_neon_idct.c, vp9_idct8_ref.c, bench_vulkan_dispatch.c, shaders/noop.comp} + external/ffmpeg-snapshot/config.h shim (7 defines + EXTERN_ASM). Builds clean on Debian Trixie aarch64 with cmake 3.31, ninja 1.12, libvulkan-dev 1.4.309, glslang-tools 15.1.0. Vendored FFmpeg .S assembles via the config.h shim. Next: Phase 4 (plan first QPU IDCT kernel under the M5 batching constraint) -> Phase 5 second-model review -> Phase 6 implement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 11:30:12 +00:00

8 Commits