daedalus-fourier

Author	SHA1	Message	Date
claude-noether	8fdef27a7d	Phase 8c: H.264 luma qpel mc20 through public API Extends daedalus-fourier with daedalus_recipe_dispatch_h264_qpel_mc20 so libavcodec.so can route H264QpelContext.put_h264_qpel_pixels_tab[1][2] through the recipe layer instead of ff_put_h264_qpel8_mc20_neon directly. API additions (header + library): - daedalus_h264_qpel_meta { dst_off, src_off } - daedalus_dispatch_h264_qpel_mc20(ctx, sub, dst, src, stride, n_blocks, meta) - daedalus_recipe_dispatch_h264_qpel_mc20(...) (AUTO wrapper) - DAEDALUS_KERNEL_H264_QPEL_MC20 = 9 in the recipe-query enum - daedalus_recipe_substrate_for() returns CPU NEON for cycle 9 The 6-tap horizontal half-pel filter signature matches FFmpeg's H264QpelContext convention exactly: dst and src share a single stride and src already points at output column 0 (filter reads cols -2..+3). Single-stride API to make the marfrit-packages FFmpeg shim a straight pointer-pass; no buffer rearrangement. Verdict per docs/k9_h264qpel_mc20.md: CPU NEON. Per-block 7.6 ns gives 135x margin over 30 fps 1080p; QPU dispatch floor at ~250 ns makes any V3D shader strictly worse. Recipe table reflects that — the recipe_dispatch entry is a one-line forward to the CPU path. CMakeLists changes: - h264qpel_neon.S added to the daedalus_core static lib (only the bench targets owned it before; now the public API needs it too) - tests/h264_qpel8_mc20_ref.c added to the test_api_h264 target Phase 8a/8b smoke gains a 4th case (test_qpel_mc20): 1024/1024 bytes bit-exact via daedalus_recipe_dispatch_h264_qpel_mc20. Refs reauktion/daedalus-v4l2#11 — substitution arc step 2 cycle 9.	2026-05-23 03:25:24 +02:00
marfrit	0a99b16489	Phase 8b: opportunistic QPU paths through public API Wires QPU dispatch for cycles 3 (VP9 MC), 5 (AV1 CDEF), 8 (H.264 deblock) through the public API. These three kernels have recipe substrate = CPU, but per Issue 003 the mixed-kernel helper value is real — the dispatch path must exist so override-mode callers can request QPU on the side. Pattern mirrors dispatch_idct8_qpu (lazy pipeline + per-call SSBO alloc + memcpy + dispatch + readback). Each kernel has its own push-constant struct (mc_pc 3-field, cdef_pc 3-field, deblock_pc 2-field shared with lpf). Notable bug caught + fixed in test_api_opportunistic_qpu: the initial dispatch_mc_8h_qpu sized src_max using CPU-side reach (src_off + 3 + 8 + 7stride), but the QPU shader reads src[ src_off + rowstride + 0..14] for row=0..7. Last block had 3 uninitialized bytes → 99.8% match → 100% after fix. After this commit, the public API surface fully covers cycles 1-8: Cycle 1 (IDCT 8x8): CPU + QPU + AUTO bit-exact Cycle 2 (LPF wd=4): CPU + QPU + AUTO bit-exact Cycle 3 (MC 8h): CPU recipe; QPU override bit-exact Cycle 4 (LPF wd=8): CPU + QPU + AUTO bit-exact Cycle 5 (CDEF): CPU recipe; QPU override (untested in this test — bench_v3d_cdef is the authoritative 3-way M1) Cycle 6 (H.264 IDCT 4x4): CPU only (no QPU shader by recipe) Cycle 7 (H.264 IDCT 8x8): CPU only Cycle 8 (H.264 deblock luma-v): CPU recipe; QPU override bit-exact Tests: test_api_opportunistic_qpu adds CPU-vs-QPU bit-exact comparison for VP9 MC and H.264 deblock through the API. test_api_idct, test_api_lpf, test_api_h264 still pass. Per the locked Phase 8 architecture (project_phase8_architecture memory): next session opens daedalus-v4l2 sibling repo with Option B (kernel V4L2 shim + userspace daemon), Option γ (dlopen FFmpeg parser). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:50:41 +00:00
marfrit	af8146a2cd	Phase 8a: H.264 kernels through public API Extends include/daedalus.h with cycles 6, 7, 8 (H.264 IDCT 4x4, IDCT 8x8, luma deblock luma-v). All recipe-substrate = CPU (matches per-cycle Phase 7 verdicts). src/daedalus_core.c: NEON-path implementations + recipe routing. daedalus_core library now links the full FFmpeg H.264 NEON snapshot (h264idct + h264dsp) plus existing VP9 + dav1d. tests/test_api_h264.c: smoke test covering all 3 H.264 kernels via daedalus_recipe_dispatch_*. All pass 2048/2048 bit-exact. Public API coverage after this commit: - Cycles 1 IDCT 8x8 + 2 LPF4 + 4 LPF8: CPU+QPU+AUTO dispatch (test_api_idct, test_api_lpf, both pass) - Cycle 3 MC 8h: CPU only (QPU dispatch stub returns -1) - Cycle 5 CDEF: CPU only (QPU stub) - Cycle 6 H.264 IDCT 4x4: CPU only (recipe + only NEON wired) - Cycle 7 H.264 IDCT 8x8: CPU only - Cycle 8 H.264 deblock: CPU only (QPU opportunistic — not wired through API yet; bench_v3d_h264deblock exists for direct test) Next Phase 8 sub-step: wire opportunistic QPU dispatch for cycles 3+5+8 through the API (so override-mode users can request QPU). Then surface V4L2-wrapper architecture decisions to user. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:46:03 +00:00
marfrit	373f63a910	Cycle 8 closed: H.264 deblock R8=0.061 RED, opportunistic helper Phase 6 deliverable: v3d_h264deblock.comp (132 inst, 4 threads, no spills). Phase 5 REDs applied: RED-1: explicit clamp p1'/q1' to [0,255] before uint8 write RED-2: bench-enforced m.x >= 4*stride contract M1: 3-way 4096/4096 bit-exact (QPU vs C ref AND vs NEON). M2: 5.629 Medge/s isolation → R8 = 0.061 RED (predicted 0.09-0.14). Lower than prediction; H.264 deblock has 4 early-return paths + 2 conditional writes that hurt V3D branchy execution more than expected. M4 same-kernel: NEON-3+QPU 12.81 Medge/s ≈ pure-NEON-4 ~12-15 (neutral). M4 MIXED (real H.264 deployment shape): CPU=MC + QPU=h264deblock gives CPU MC 25.11 Mblock/s + QPU h264deblock 6.23 Medge/s. QPU contribution is essentially unchanged from isolation — the cross-substrate contention is gentle (consistent with Issue 003's V4 finding). Verdict: H.264 deblock = opportunistic QPU helper. Same recipe slot as cycle 5 CDEF. 6 Medge/s helper = 85% of single-NEON-core deblock capacity, available when CPU is busy with other work. Cycles 1-8 deployment recipe complete: Primary QPU: cycles 1+2+4 (VP9 IDCT/LPF, all bandwidth-bound) Primary CPU: cycles 3+6+7 (compute-heavy or trivially fast on NEON) Opportunistic helper: cycles 5+8 (CDEF, H.264 deblock) Phase 9 lessons added: - Branchy kernels underperform V3D vs straight-line ones - Mixed-kernel helper value scales with isolation M2, not same-kernel M4 - R prediction needs branchiness weight, not just compute density - src/v3d_h264deblock.comp (132 inst QPU shader) - tests/bench_v3d_h264deblock.c (3-way M1 + M2 + R classification) - tests/bench_concurrent_mixed.c extended with K_H264DEBLOCK - CMakeLists.txt: v3d_h264deblock.spv + bench_v3d_h264deblock + h264dsp linked into bench_concurrent_mixed - docs/k8_h264deblock_phase7.md (full closure with cycles 1-8 recipe) Next: Phase 8 — V4L2 wrapper / deployment infra. Public API already exposes recipe-default substrate per kernel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 14:44:21 +00:00
marfrit	eb5cfb34c4	Phase 8: wire LPF wd=4 + wd=8 QPU through public API Mirror the IDCT pattern (lazy pipeline + per-call SSBO alloc + dispatch + readback) for cycles 2 (LPF wd=4) and 4 (LPF wd=8). Important caught-empirically bug: the two LPF shaders disagree on push-constant slot order — wd=4 puts dst_stride_u8 at slot 1, wd=8 puts it at slot 2 (with unused blocks_per_row at slot 1). Initial single-struct attempt silently corrupted wd=8 output (1958/2048 = 95.6 % bit-exact on test_api_lpf). Fixed by keeping separate lpf4_pc and lpf8_pc struct definitions. dst-window calc handles both kernels (same -4..+3 byte footprint per row). test_api_lpf exercises both kernels in CPU / QPU / AUTO modes against the C reference. All 6 mode/kernel combinations pass 2048/2048 bit-exact (32 edges × 8 rows × 8 bytes/edge). Phase 8 status after this commit: 3 of 5 kernels wired through API for QPU dispatch (IDCT, LPF wd=4, LPF wd=8 — i.e., all 3 QPU-default kernels per recipe). Cycle 3 MC and cycle 5 CDEF still need wiring for opportunistic-override mode but aren't needed for recipe-AUTO path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 13:57:25 +00:00
marfrit	1085c5699c	Phase 8: wire IDCT QPU dispatch through public API daedalus_ctx now owns a v3d_runner when V3D is available. The public API's dispatch_vp9_idct8 routes QPU calls through a new dispatch_idct8_qpu helper that: (1) lazy-creates the cycle 1 v4 pipeline on first use, (2) allocates 3 host-visible SSBOs per call (coeffs/dst/meta), (3) memcpy host->GPU, (4) dispatch with the v4 32-blocks-per-WG geometry, (5) memcpy GPU->host. Per-call alloc is intentional for Phase 8 correctness-first scope; buffer-pool perf optimization is deferred. Added daedalus_ctx_create_no_qpu() for fast-path callers that know they want CPU only. test_api_idct extended to a 3-mode matrix: CPU forced, QPU forced, AUTO recipe. All three deliver 4096/4096 bit-exact on hertz with V3D 7.1.7.0: recipe substrate for VP9_IDCT8: 2 (QPU) [CPU] 4096/4096 bit-exact [QPU] 4096/4096 bit-exact (real QPU dispatch through the API) [AUTO] 4096/4096 bit-exact (recipe routes to QPU) Next Phase 8 sub-step: same wiring pattern for cycle 2 LPF wd=4 and cycle 4 LPF wd=8 (the other two recipe-QPU kernels). Cycle 3 MC and cycle 5 CDEF only need the dispatch hook (recipe routes to CPU; QPU stays opportunistic via explicit override). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 13:55:55 +00:00
marfrit	760f6a4060	Phase 8 skeleton: public C API + first end-to-end smoke test include/daedalus.h: stable C API surface exposing the 5 cycles (VP9 IDCT 8x8, LPF wd=4, MC 8h, LPF wd=8; AV1 CDEF). Per-kernel recipe-dispatch helpers default to the cycle 1-5 verdict substrate (QPU for cycles 1+2+4, CPU for cycles 3+5); explicit override available for benchmarking and runtime-aware scheduling. src/daedalus_core.c: NEON-path implementation of all 5 kernels wrapped behind the public API. QPU path stubbed out (returns -1) since wiring v3d_runner into daedalus_ctx is the next Phase 8 sub-step; with has_qpu=0 the recipe falls back to CPU cleanly. tests/test_api_idct.c: 64-block IDCT through the public recipe dispatch, bit-exact vs C ref. PASS 4096/4096 bytes — proves the API surface compiles, library links, dispatch routing works, and NEON fallback delivers correct results. docs/phase8_scoping.md: architecture options (A=userspace V4L2, B=kernel V4L2 shim, C=direct libva); pick A for v1; explicitly out-of-scope work tracked. Next Phase 8 sub-step: wire v3d_runner into daedalus_ctx so has_qpu=1 and QPU dispatch goes through the API too. After that: V4L2 ioctl glue, bitstream parser, superblock loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 13:54:43 +00:00
marfrit	5223d3cb3f	Cycle 5 closed: CDEF QPU R5=0.116 ORANGE, opportunistic helper Phase 4 plan with 3 Phase-5 REDs applied inline: - meta layout: m.z=tmp_off, m.w=dir - sec_shift clamped to >=0 (NEON uqsub semantics) - directions table as const ivec2[14], not OR-packed Phase 6 deliverable: v3d_cdef.comp (387 inst, 2 threads, no spills). 3-way M1 (QPU vs C ref vs NEON) PASS 4096/4096. M2: 0.443 Mblock/s -> R5 = 0.116 ORANGE (predicted 0.02-0.05 RED). M4 same-kernel: NEON-3+QPU 8.46 < NEON-4 alone ~10 (negative). M4 mixed (NEON-3 MC + QPU CDEF): CPU 34.17 Mblock/s MC, QPU 0.42 Mblock/s CDEF helper. CPU side higher than the Issue 003 NEON-fallback proxy suggested - cross-substrate contention is gentler than same-side NEON contention. Verdict: CDEF stays on CPU; QPU dispatch path exists for opportunistic use. Deployment recipe table updated for all 5 cycles. Phase 9 lessons: linear extrapolation across cycles is too pessimistic; CDEF is bandwidth-bound on NEON despite high per-block ns; real-substrate-cross contention < NEON-proxy contention. - src/v3d_cdef.comp: cycle 5 QPU shader - tests/bench_v3d_cdef.c: 3-way M1, M2 bench - tests/bench_concurrent_mixed.c: K_CDEF on both sides - tests/cdef_ref.c + bench_neon_cdef.c: sec_shift clamp + expanded damping range to exercise the edge case - CMakeLists.txt: v3d_cdef.spv + bench_v3d_cdef wiring - docs/k5_cdef_phase4.md updated with Phase 5 review applied - docs/k5_cdef_phase7.md: closure doc with full verdict matrix Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 13:52:46 +00:00
marfrit	85feba4087	Cycle 4 (LPF wd=8) closure: M1=100%, R=0.34, M4=+4.1%, PASS Fourth daedalus-fourier kernel — VP9 8-tap inner loop filter wd=8 h_8_8 variant. Width extension of cycle 2's wd=4; completes VP9 inner-edge LPF coverage. Full cycle Phase 1-7 + M4'''' in one combined go (cycle compressed since incremental from cycle 2). Phase 5 review explicitly skipped (incremental ~30-line shader delta from cycle 2 + same geometry + cycle-2 RED-pattern checks still apply). Flagged in docs/k4_lpf8_phase4_7.md per dev_process.md "Skipping phases is a deliberate choice that should be flagged." Phase 6 v1 first-light: M1'''' 100.0000% bit-exact (65536/65536) first try. Shaderdb shows 231 inst, 4 hardware threads, 0 spills, 27 max-temps, 48 uniforms — compiler at the latency-hiding ceiling. Performance: M3'''' NEON (single-core) 52.382 Medge/s M2'''' QPU isolation 17.847 Medge/s R'''' 0.341 → ORANGE band 30fps floor margin 9.2x (isolation), 20.3x (mixed) M4'''' concurrent matrix: NEON 4-core 37.823 Medge/s <- baseline QPU only 14.867 Medge/s MIXED NEON-3 + QPU 39.389 Medge/s <- +4.1% PASS Verdict: YELLOW-via-M4'''' PASS. Deploy wd=8 LPF on QPU alongside cycle 2 wd=4. Combined VP9 inner-edge LPF coverage now complete. Cross-cycle LPF comparison: \| \| wd=4 (k2) \| wd=8 (k4) \| \| M3 NEON \| 48.3 \| 52.4 \| \| M2 QPU iso \| 19.6 \| 17.8 \| \| R iso \| 0.41 \| 0.34 \| \| M4 delta \| +6.9% \| +4.1% \| \| 30fps mixed \| 7.2x \| 20.3x \| \| Verdict \| GO QPU \| GO QPU \| NEW finding (Phase 9 lesson): NEON gets faster per edge as filter width grows (20.7 → 19.1 ns wd=4 → wd=8). The relative QPU loss grows with width. wd=16 would probably flip negative based on the trend line. Deployment recipe with cycle 4: IDCT 8x8 (k1) -> QPU (R=0.92, +7% mixed) LPF wd=4 (k2) -> QPU (R=0.41, +7% mixed) LPF wd=8 (k4) -> QPU (R=0.34, +4% mixed) MC 8h (k3) -> CPU (R=0.067, -19% mixed) Entropy -> CPU (structural) VP9 inner-edge LPF coverage complete. Project continues to higgs deployment plumbing or further kernels per user direction. New artifacts: - src/v3d_lpf_h_8_8.comp — GLSL shader - tests/vp9_lpf8_ref.c — standalone C ref - tests/bench_neon_lpf8.c — M1+M3 bench - tests/bench_v3d_lpf8.c — M1+M2 bench - tests/bench_concurrent_lpf8.c — M4 pthread bench - docs/k4_lpf8_phase1_3.md + phase4_7.md — combined cycle docs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:56:25 +00:00
marfrit	356e446a49	Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5% Third daedalus-fourier kernel — VP9 8-tap regular subpel filter, horizontal direction, 8-wide output. Multiply-heavy by design to stress V3D's no-DP4A deficit. Full cycle Phase 1-7 + M4'''. Phase 5''' second-model review delivered cleanly — caught 1 RED bug pre-implementation (src_off off-by-3 indexing convention) and 2 YELLOW gaps (assert MUST language, shaderdb filter-LUT gate). Without the review, M1''' would have failed silently on first run with cryptic "high-index source pixels wrong" symptoms. Phase 6 v1 first-light: M1''' 100.0000% bit-exact (65536/65536 blocks across all 16 mx phases). Phase 5''' filter-LUT prediction materialised exactly: 197 uniforms (gate was 144), 2 threads (down from cycle-2's 4 due to register pressure). Performance: M2''' = 1.413 Mblock/s (707.9 ns/block) M3''' = 20.997 Mblock/s (NEON baseline phase3) R''' = 0.067 (RED band — structural mismatch) shaderdb: 488 inst, 2 threads, 197 uniforms, 25 max-temps, 0 spills M4''' concurrent matrix (8s windows): NEON 1-core 14.479 Mblock/s NEON 4-core 15.248 Mblock/s <- baseline (compute-bound, not bandwidth-saturated like cycles 1+2!) QPU only 1.380 Mblock/s MIXED NEON-3 + QPU 12.277 Mblock/s <- -19.5% (FAIL gate) MIXED NEON-4 + QPU 12.158 Mblock/s <- -20.3% NEW cross-cycle finding (Phase 9 lesson 2): compute-bound CPU workloads make the QPU-offload story collapse. Cycles 1+2 were bandwidth-saturated (4-core scaling 0.56-0.82x of 1-core), so freeing a CPU core via QPU offload added throughput. Cycle 3 MC is compute-bound (4-core scaling 1.05x of 1-core — near-linear), no free cycles to free. QPU contribution (0.45 Mblock/s in contention) doesn't compensate for losing 1 NEON core delivering ~3.8 Mblock/s. But 30fps@1080p floor: PASS in every config (1.4x to 15.7x isolation margin). Per project_30fps_floor_is_fine.md, user-facing test never fails — daily YouTube playback works fine on any CPU/QPU split. DEPLOYMENT RECIPE for higgs (cycle 3 confirmed split): IDCT (k1) -> QPU (R=0.92, +7% mixed, frees CPU core) LPF (k2) -> QPU (R=0.41, +7% mixed, frees CPU core) MC (k3) -> CPU (R=0.067, -19.5% mixed — stays on CPU) Entropy -> CPU (structurally serial) Mixed-substrate deployment, not "QPU does everything". Realistic for higgs: entropy + MC on 2-3 ARM cores; IDCT + LPF dispatched to QPU concurrently; 1-2 ARM cores left for vscode etc. New artifacts: - src/v3d_mc_8h.comp — GLSL kernel - tests/vp9_mc_ref.c — standalone C ref (REGULAR filter embedded; clean transcription) - tests/bench_neon_mc.c — M1'''_c + M3''' bench - tests/bench_v3d_mc.c — M1''' + M2''' bench with contract asserts + 30fps margin display - tests/bench_concurrent_mc.c — M4''' pthread bench - external/ffmpeg-snapshot/libavcodec/aarch64/vp9mc_neon.S (vendored) - external/ffmpeg-snapshot/libavcodec/vp9_subpel_filters_table.c (hand-extracted; provides ff_vp9_subpel_filters symbol without dragging in full vp9dsp.c) - docs/k3_mc_phase{1,2,3,4,5,7}.md — full cycle documentation Memory updates: project_30fps_floor_is_fine.md (user's 30fps target recalibration), MEMORY.md index updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:51:43 +00:00
marfrit	36eca40ff2	Cycle 2 (LPF) closure: M1''=100%, R''=0.41, M4''=+6.9%, PASS Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS: 2 YELLOW contract gaps applied) + Phase 6 v1 implementation + Phase 7 verification including M4'' concurrent gate. Phase 5'' review delivered cleanly — no RED bugs (cycle 1 lessons applied successfully). 2 YELLOW findings baked into phase4 §4: - stride >= 4 contract added alongside m.x >= 4 (finding 2) - assert(...) in bench made a MUST not a suggestion (finding 4) - V3D divergence-cost note: don't restructure to always-execute, masked lanes consume clock anyway (finding 3, informational) Phase 6 v1 first-light hit M1'' 100.0000% bit-exact on first run (65536/65536 edges) — the cycle-1 v4 patterns (WG=256, 2-per-sg, uint8_t SSBO, oob early-return discipline) baked in from start worked as expected. Performance: M2'' = 19.645 Medge/s (50.9 ns/edge) M3'' = 48.285 Medge/s (NEON baseline from phase3) R'' = 0.41 (ORANGE band - doesn't auto-close per cycle-1 calibration adjustment) shaderdb: 160 inst, 4 threads, 0 spills, 21 max-temps — shader is already at the compiler ceiling. No v2/v3/v4 iteration loop like cycle 1 because there's nothing more to extract from the compiled shape. The 30x gap between theoretical instruction throughput and measured wall-clock is divergence-tax + memory latency, not compile quality. M4'' concurrent matrix on hertz (8s windows): NEON-1 LPF 41.131 Medge/s NEON-4 LPF 33.726 Medge/s <- realistic CPU ceiling (per-core 7-9; same bandwidth-saturation as cycle-1 F1) QPU only 14.299 Medge/s MIXED NEON-3 + QPU 36.049 Medge/s <- +6.9% over NEON-4 MIXED NEON-4 + QPU 31.892 Medge/s <- -5.4% oversubscribed The "freed-core" pattern generalizes from IDCT to LPF: NEON-3+QPU beats pure NEON-4 by ~7% in both cycles. Cycle-2 NEW finding: oversubscribed mode hurts for lighter kernels (LPF -5.4% vs cycle-1 IDCT +9.4%). Recommendation for higgs deployment hardens to "always N-1 NEON cores + QPU, never N + QPU". Phase 9 lessons (in phase7 §"Phase 9 lessons"): 1. Cycle-1 v4-pattern is the v1 starting point (saves 3 iterations) 2. Phase 5 review pays off every cycle 3. R isolation misleading on bandwidth-saturated hardware 4. Oversubscription tax depends on kernel weight 5. shaderdb 4-threads/0-spills = compute not the bottleneck New artifacts: - src/v3d_lpf_h_4_8.comp — GLSL kernel - tests/bench_v3d_lpf.c — M1'' + M2'' harness with contract asserts + fm/hev pass-rate instrumentation - tests/bench_concurrent_lpf.c — M4'' pthread bench (mirrors bench_concurrent.c) - docs/k2_deblock_phase{4,5,7}.md — plan + review + verification Project verdict: continue. Cycle 3 candidates: MC interpolation (multiply-heavy, stress V3D SMUL24), CDEF (AV1-only, different neighborhood shape), or wd=8/wd=16 LPF variants. User to direct. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:39:26 +00:00
marfrit	d66f22f333	Phase 6 (v1+v4 production) + Phase 7 closure: R = 0.92 ± 0.03 on hertz First QPU IDCT8 kernel running and bit-exact on V3D 7.1 via Mesa v3dv compute. Five iterations through a Phase 7→Phase 4' loopback; production kernel is v4. New files: - src/v3d_runner.{c,h} — reusable Vulkan compute plumbing (instance, V3D device picker, HOST_VISIBLE\|COHERENT SSBOs with mmap, compute pipeline from .spv, enables storageBuffer{8,16}BitAccess) - src/v3d_idct8.comp — VP9 8x8 DCT_DCT IDCT add, v4 production: 256 invocations/WG, 2 blocks/subgroup (no idle lanes), uint8 dst SSBO (race-free per phase5 finding 5), unrolled writes (no chained ternary), oob-flag pattern (barrier-safe per phase5 finding 7) - tests/bench_v3d_idct.c — M1' bit-exact gate + M2 throughput vs C ref - docs/phase7.md — full iteration journey + decision verdict CMakeLists.txt updated to build the new shader, library, and bench when DAEDALUS_BUILD_VULKAN=ON. Iteration record (1920x1088 luma, 32640 blocks/dispatch, N=3): ver change R ns/block v1 first-light 0.230 533 v2 kill ternary + 2-blocks-per-sg 0.474 258 v3 per-pass scope oN 0.481 254 (noise) v4 WG 64 -> 256 invocations 0.947 129 v5 packed uint32 coeff reads 0.938 130 (noise, reverted) v4 final N=3 0.918 +/- 0.033 Bit-exactness 100.0000% across all iterations (10000-block sample on 128x128, 32640-block sample on 1080p) against both the C reference (tests/vp9_idct8_ref.c) and the vendored FFmpeg NEON ff_vp9_idct_idct_8x8_add_neon. Key learning over the Phase 5 review's prediction model: the chained ternary was NOT a spill killer on V3D 7.1 (shaderdb showed 0:0 spills:fills even in v1). The actual lever was workgroup-size-driven latency hiding — going from 64 to 256 invocations doubled throughput with the same compiled code (270 inst, 2 threads, 21 max-temps, 0 spills) because the v3dv scheduler had 4x more in-flight work to overlap TMU latency. Verdict per phase1.md decision rules: YELLOW band (0.5 <= R < 1.0) by a wide margin, near GREEN boundary. Phase 1 YELLOW rule: add M4 (concurrent CPU+QPU throughput) before honest-close or continue. M4 is the next measurement, not more shader tuning — at R = 0.92 with all 4 A76 cores still 100% free for other work, the question is whether the system aggregate beats pure 4-core NEON. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:09:00 +00:00
marfrit	dcbbc77038	Path B pivot + Phase 0-3 closed with first baseline numbers This is a from-scratch initial commit on a fresh .git. The original scaffold commit (`7510b56`) and the earlier session's working-tree docs were lost in a 2026-05-18 10:25 working-tree wipe; the corrupted .git is preserved at .git-broken-2026-05-18/ (gitignored) for forensic inspection. Scope re-anchored from Path A (custom VPU firmware on VC7 scalar cores; blocked by BCM2712 silicon-RoT mask-ROM signature check) to Path B (QPU compute kernels via Mesa v3d / Vulkan compute or direct DRM, on stock signed Pi 5 / CM5). See README.md and docs/phase0.md for the substrate audit that closed Path A. Phases closed: Phase 0 — substrate audit; Path A blocked, Path B open; codec-back-end-fits-QPU finding (docs/phase0.md) Phase 1 — first kernel locked (VP9 / AV1 8x8 inverse DCT) with publish-before-measure R = M2/M3 decision rules (docs/phase1.md) Phase 2 — reference impls mapped; FFmpeg n7.1.3 source vendored under external/ffmpeg-snapshot/ (PROVENANCE.md pins commit f46e514 + per-file SHA-256s) (docs/phase2.md) Phase 3 — real baseline measurements on hertz (docs/phase3.md): M1 bit-exact 100.0000 % (10000/10000) M3 NEON IDCT8 single 8.171 Mblock/s (122.4 ns/block) M5a empty Vulkan submit 22.66 us M5b 1-WG noop dispatch 55.60 us M5 delta 32.95 us/dispatch => per-dispatch overhead is ~455x per-NEON-block cost; Phase 4 must batch at frame level or close to it. Build harness in place: CMakeLists.txt + tests/{bench_neon_idct.c, vp9_idct8_ref.c, bench_vulkan_dispatch.c, shaders/noop.comp} + external/ffmpeg-snapshot/config.h shim (7 defines + EXTERN_ASM). Builds clean on Debian Trixie aarch64 with cmake 3.31, ninja 1.12, libvulkan-dev 1.4.309, glslang-tools 15.1.0. Vendored FFmpeg .S assembles via the config.h shim. Next: Phase 4 (plan first QPU IDCT kernel under the M5 batching constraint) -> Phase 5 second-model review -> Phase 6 implement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 11:30:12 +00:00

13 Commits