Calibration: M4 same-kernel measures worst-case contention

User-flagged 2026-05-18: the cycles 3 (MC) + 5 (CDEF) 'CPU only' verdicts were based on M4 measuring same-kernel concurrent NEON+QPU, which is the WORST case for memory-bandwidth contention. A real decoder pipeline has CPU doing kernel A + QPU doing kernel B concurrently — different access patterns contend less. Concretely: in a real pipeline, CPU runs entropy + MC + other work while QPU is idle except for IDCT + LPF. The 'opportunistic QPU helper' for CDEF (or MC) hasn't been measured. M4 set the bar too high. Updates: - docs/k3_mc_phase7.md §'M4 methodology caveat' added with the user's contribution framing - docs/k5_cdef_phase3_partial.md §'Deployment recommendation' softened from 'CPU only' to 'CPU baseline; QPU helper viable in mixed-kernel deployment, unmeasured' - docs/issues/003-mixed-kernel-m4-bench.md filed — the rigorous test to close the question (4 variants: bandwidth+bandwidth, compute+CDEF, same-kernel control, real-pipeline mix) - ~/.claude/projects/-home-mfritsche-src-daedalus-fourier/memory/ feedback_m4_same_kernel_worst_case.md added — carries the calibration into future cycles + Phase 8 deployment decisions - MEMORY.md index updated The bandwidth-bound vs compute-bound classification still holds at the kernel level — Phase 9 cross-cycle lesson stays valid. But its mapping to deployment is nuanced: - Bandwidth-bound on QPU → DEFINITIVE offload (M4 +ve, cycles 1+2+4) - Compute-bound on QPU → OPPORTUNISTIC helper if pipeline has bandwidth-light CPU work running concurrently (cycles 3+5, needs Issue 003 measurement) Phase 8 V4L2 wrapper should keep CDEF + MC slot-able to either CPU or QPU at runtime (not hard-baked), so Issue 003's result can update the dispatch table without re-architecture. No code changes. Doc + memory + issue only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:31:27 +00:00
parent 20b59cd6a5
commit 460a6a6d08
3 changed files with 138 additions and 8 deletions
@@ -95,11 +95,27 @@ chasing two layout issues simultaneously).
 - 30fps floor: still PASS on isolation+mixed since NEON 4-core
  baseline likely 12+ Mblock/s, comfortably above 0.972

-**Deployment recommendation** (provisional, pending Phase 4-7):
-CDEF stays on CPU. Same verdict as MC. **All compute-bound kernels
-stay on CPU; all bandwidth-bound (IDCT/LPF) kernels offload to QPU.**
-This is starting to look like a clean classification rule across all
-cycles.
+**Deployment recommendation** (provisional, pending Phase 4-7 +
+Issue 003 mixed-kernel M4): **CDEF baseline = CPU, QPU offload
+viable as opportunistic helper, not measured**.
+
+Same caveat as cycle 3 MC (see `k3_mc_phase7.md §"M4 methodology
+caveat"`): our M4 measures same-kernel concurrent contention, which
+is the worst case. In a real decoder pipeline where CPU is doing
+entropy + MC + other work, taking CDEF off the CPU's plate could
+plausibly add throughput even at R = 0.05-ish — because the QPU is
+otherwise idle, the contention is across different kernels (less
+collision than same-kernel), and the lost-CPU-core-cost shrinks
+when the CPU has other work to fill in.
+
+The **bandwidth-bound vs compute-bound classification rule** still
+holds at the kernel level, but its mapping to deployment is more
+nuanced than "compute-bound → never QPU." Better framing:
+
+- **Bandwidth-bound on QPU** → **definitive** QPU offload (cycle 1+2+4)
+- **Compute-bound on QPU** → **opportunistic** QPU helper if pipeline
+  has bandwidth-light CPU work running concurrently (cycle 3+5,
+  needs Issue 003 measurement to confirm)

 ## Phase 9 lessons (provisional)