Calibration: M4 same-kernel measures worst-case contention

User-flagged 2026-05-18: the cycles 3 (MC) + 5 (CDEF) 'CPU only' verdicts were based on M4 measuring same-kernel concurrent NEON+QPU, which is the WORST case for memory-bandwidth contention. A real decoder pipeline has CPU doing kernel A + QPU doing kernel B concurrently — different access patterns contend less. Concretely: in a real pipeline, CPU runs entropy + MC + other work while QPU is idle except for IDCT + LPF. The 'opportunistic QPU helper' for CDEF (or MC) hasn't been measured. M4 set the bar too high. Updates: - docs/k3_mc_phase7.md §'M4 methodology caveat' added with the user's contribution framing - docs/k5_cdef_phase3_partial.md §'Deployment recommendation' softened from 'CPU only' to 'CPU baseline; QPU helper viable in mixed-kernel deployment, unmeasured' - docs/issues/003-mixed-kernel-m4-bench.md filed — the rigorous test to close the question (4 variants: bandwidth+bandwidth, compute+CDEF, same-kernel control, real-pipeline mix) - ~/.claude/projects/-home-mfritsche-src-daedalus-fourier/memory/ feedback_m4_same_kernel_worst_case.md added — carries the calibration into future cycles + Phase 8 deployment decisions - MEMORY.md index updated The bandwidth-bound vs compute-bound classification still holds at the kernel level — Phase 9 cross-cycle lesson stays valid. But its mapping to deployment is nuanced: - Bandwidth-bound on QPU → DEFINITIVE offload (M4 +ve, cycles 1+2+4) - Compute-bound on QPU → OPPORTUNISTIC helper if pipeline has bandwidth-light CPU work running concurrently (cycles 3+5, needs Issue 003 measurement) Phase 8 V4L2 wrapper should keep CDEF + MC slot-able to either CPU or QPU at runtime (not hard-baked), so Issue 003's result can update the dispatch table without re-architecture. No code changes. Doc + memory + issue only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:31:27 +00:00
parent 20b59cd6a5
commit 460a6a6d08
3 changed files with 138 additions and 8 deletions
@@ -86,15 +86,42 @@ back-end-on-QPU/CPU split for the consumed decoder pipeline:

 - **IDCT (cycle 1)** → QPU. R = 0.92, +7 % mixed, frees a CPU core.
 - **LPF (cycle 2)** → QPU. R = 0.41, +7 % mixed, frees a CPU core.
- **MC (cycle 3)** → **CPU NEON**. R = 0.067, −19.5 % mixed.
-  Compute-bound on CPU but CPU already comfortably exceeds 30fps;
-  offload makes things worse.
+- **MC (cycle 3)** → **CPU NEON baseline; QPU offload viable as
+  opportunistic helper, not yet measured.** R = 0.067 in isolation
+  was discouraging; M4 same-kernel mixed was −19.5 % which looks
+  conclusive but isn't — see *M4 methodology caveat* below.
 - **Entropy** (VP9 Bool / AV1 ANS) → CPU. Structurally serial.

 This is a **mixed-substrate deployment**, not a "QPU does everything"
 plan. Realistic for higgs: entropy + MC on 2-3 ARM cores; IDCT + LPF
 dispatched to QPU concurrently; 1-2 ARM cores left for vscode / etc.

+## M4 methodology caveat (added 2026-05-18 after cycle 5)
+
+The M4 mixed bench (`bench_concurrent_mc.c`) tests NEON-3 + QPU
+running the SAME kernel concurrently. This is the **worst case** for
+memory-bandwidth contention — both substrates competing for the same
+bus with the same access pattern.
+
+A real decoder pipeline has different shape: CPU runs entropy + MC
+ other CPU-bound work; QPU runs IDCT + LPF + (potentially) MC as
+opportunistic helper. **Different kernels on different substrates**
+contend less than same-kernel-on-both. Our M4-same-kernel result is
+a pessimistic lower bound, not the actual deployment number.
+
+Empirically supporting this: cycle 3 M4 showed per-core NEON
+throughput in 3-core mode (3.78-4.16 Mblock/s) was higher than in
+4-core mode (3.24-4.48), confirming bandwidth saturation at ≥4
+cores. So freeing 1 core via QPU offload costs ~25 % of total NEON
+MC throughput, but the QPU contributes 0.45 (-MC) or 1.4 (in CDEF
+isolation) on top.
+
+**To rigorously test the helper hypothesis**: see
+`docs/issues/003-mixed-kernel-m4-bench.md`. A bench that runs
+NEON-3 on kernel-A + QPU on kernel-B concurrently would close the
+question. ~½ day of additional bench work; would update the
+deployment recipe for cycles 3 + 5 if the result is positive.
+
 ## Decision per Phase 1 rules + 30fps-floor calibration

 | Rule | Result | Status |