marfrit
|
5223d3cb3f
|
Cycle 5 closed: CDEF QPU R5=0.116 ORANGE, opportunistic helper
Phase 4 plan with 3 Phase-5 REDs applied inline:
- meta layout: m.z=tmp_off, m.w=dir
- sec_shift clamped to >=0 (NEON uqsub semantics)
- directions table as const ivec2[14], not OR-packed
Phase 6 deliverable: v3d_cdef.comp (387 inst, 2 threads, no spills).
3-way M1 (QPU vs C ref vs NEON) PASS 4096/4096.
M2: 0.443 Mblock/s -> R5 = 0.116 ORANGE (predicted 0.02-0.05 RED).
M4 same-kernel: NEON-3+QPU 8.46 < NEON-4 alone ~10 (negative).
M4 mixed (NEON-3 MC + QPU CDEF): CPU 34.17 Mblock/s MC,
QPU 0.42 Mblock/s CDEF helper. CPU side higher than the
Issue 003 NEON-fallback proxy suggested - cross-substrate
contention is gentler than same-side NEON contention.
Verdict: CDEF stays on CPU; QPU dispatch path exists for
opportunistic use. Deployment recipe table updated for all 5
cycles. Phase 9 lessons: linear extrapolation across cycles is
too pessimistic; CDEF is bandwidth-bound on NEON despite high
per-block ns; real-substrate-cross contention < NEON-proxy
contention.
- src/v3d_cdef.comp: cycle 5 QPU shader
- tests/bench_v3d_cdef.c: 3-way M1, M2 bench
- tests/bench_concurrent_mixed.c: K_CDEF on both sides
- tests/cdef_ref.c + bench_neon_cdef.c: sec_shift clamp +
expanded damping range to exercise the edge case
- CMakeLists.txt: v3d_cdef.spv + bench_v3d_cdef wiring
- docs/k5_cdef_phase4.md updated with Phase 5 review applied
- docs/k5_cdef_phase7.md: closure doc with full verdict matrix
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
2026-05-18 13:52:46 +00:00 |
|