# Issue 003 — Mixed-kernel M4 bench (closes cycle 3/5 deployment verdict) **Status**: open, blocks Phase 8 deployment plumbing for cycles 3+5 **Type**: measurement gap; methodology fix **Predicted verdict**: cycle 3 MC + cycle 5 CDEF may flip from "CPU only" to "opportunistic QPU helper" **Priority**: medium (changes deployment recipe; doesn't block other cycles) **Filed**: 2026-05-18 ## Background Cycles 3 (MC) and 5 (CDEF, partial) were verdict'd "stay on CPU" based on M4 measurements showing mixed NEON-3 + QPU running the **same kernel** ran SLOWER than pure NEON-4. Specifically: | | NEON-4 | NEON-3 + QPU | delta | |---|---|---|---| | Cycle 3 MC | 15.25 Mblock/s | 12.28 | **−19.5 %** | | Cycle 5 CDEF (predicted) | ~ 12-15 | ~ 10-12 | negative | But this is the **worst-case contention scenario**: both substrates competing for the same memory bus with the same access pattern. **Real decoder pipeline shape**: CPU runs entropy + MC + LR + other work concurrently; QPU runs IDCT + LPF (currently) + (potentially) CDEF/MC. Different kernels on different substrates contend *less* than same-kernel-on-both. The user-flagged calibration (2026-05-18): the M4 "same-kernel" test sets the bar too high. A "different-kernel" test would more accurately reflect deployment. ## What to measure A new bench harness `tests/bench_concurrent_mixed.c` that runs: | Variant | CPU side (NEON-3 pinned) | QPU side (1 core) | Captures | |---|---|---|---| | A | LPF wd=4 (bandwidth-bound, like real LPF stage) | CDEF | CDEF helper throughput; CPU LPF throughput drop | | B | MC (compute-bound, like real MC stage) | CDEF | CDEF helper throughput; CPU MC throughput drop | | C | MC | MC | (cycle 3 M4 control) | | D | LPF wd=4 + MC alternating (proxy for "CPU doing mixed real work") | CDEF | Real-pipeline approximation | Compute "QPU helper value" = (mixed total throughput in the relevant kernel) − (CPU-only baseline) for each variant. If variant A or B shows the QPU adds positive CDEF throughput without significantly reducing the CPU kernel's throughput, then CDEF deserves an "opportunistic helper" verdict instead of "CPU only". ## Expected outcome Per the user's "5 % CPU drop / 50 % bored QPU" framing: - Variant A (bandwidth+bandwidth): QPU contention with bandwidth- heavy LPF is real; QPU contribution likely ~70 % of isolation - Variant B (compute+CDEF): MC is the worst-saturated case from cycle 3; QPU likely under-contributes, CPU MC may drop. Net result ~ cycle 3 M4 (−19.5 % rerun) - Variant D (mixed): probably the closest-to-deployment number. Best estimate of "additional QPU helper" value. ## Acceptance criteria - `tests/bench_concurrent_mixed.c` lands, 4 variants measurable - Verdict per variant: "+X.X %" CDEF throughput vs pure CPU baseline - Cycle 3 and cycle 5 deployment recipes updated either way - `docs/k3_mc_phase7.md §"M4 methodology caveat"` updated with results ## Why deferred User-directed cycle 5 was CDEF; M4 methodology calibration only surfaced AFTER cycle 5 close. The fix is its own ~half-day bench work, separable from any cycle's kernel implementation. ## Related - `docs/k3_mc_phase7.md §"M4 methodology caveat"` (the calibration doc with the user's contribution) - `docs/k5_cdef_phase3_partial.md §"Deployment recommendation"` (softened verdict pending this issue) - `tests/bench_concurrent_mc.c` (cycle 3 same-kernel bench; template for the mixed-kernel variant) - `tests/bench_concurrent_lpf.c` + `bench_concurrent_lpf8.c` (cycle 2/4 bench templates) - Memory: `feedback_m4_same_kernel_worst_case.md`