Calibration: M4 same-kernel measures worst-case contention

User-flagged 2026-05-18: the cycles 3 (MC) + 5 (CDEF) 'CPU only' verdicts were based on M4 measuring same-kernel concurrent NEON+QPU, which is the WORST case for memory-bandwidth contention. A real decoder pipeline has CPU doing kernel A + QPU doing kernel B concurrently — different access patterns contend less. Concretely: in a real pipeline, CPU runs entropy + MC + other work while QPU is idle except for IDCT + LPF. The 'opportunistic QPU helper' for CDEF (or MC) hasn't been measured. M4 set the bar too high. Updates: - docs/k3_mc_phase7.md §'M4 methodology caveat' added with the user's contribution framing - docs/k5_cdef_phase3_partial.md §'Deployment recommendation' softened from 'CPU only' to 'CPU baseline; QPU helper viable in mixed-kernel deployment, unmeasured' - docs/issues/003-mixed-kernel-m4-bench.md filed — the rigorous test to close the question (4 variants: bandwidth+bandwidth, compute+CDEF, same-kernel control, real-pipeline mix) - ~/.claude/projects/-home-mfritsche-src-daedalus-fourier/memory/ feedback_m4_same_kernel_worst_case.md added — carries the calibration into future cycles + Phase 8 deployment decisions - MEMORY.md index updated The bandwidth-bound vs compute-bound classification still holds at the kernel level — Phase 9 cross-cycle lesson stays valid. But its mapping to deployment is nuanced: - Bandwidth-bound on QPU → DEFINITIVE offload (M4 +ve, cycles 1+2+4) - Compute-bound on QPU → OPPORTUNISTIC helper if pipeline has bandwidth-light CPU work running concurrently (cycles 3+5, needs Issue 003 measurement) Phase 8 V4L2 wrapper should keep CDEF + MC slot-able to either CPU or QPU at runtime (not hard-baked), so Issue 003's result can update the dispatch table without re-architecture. No code changes. Doc + memory + issue only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:31:27 +00:00
parent 20b59cd6a5
commit 460a6a6d08
3 changed files with 138 additions and 8 deletions
@@ -0,0 +1,87 @@
+# Issue 003 — Mixed-kernel M4 bench (closes cycle 3/5 deployment verdict)
+
+**Status**: open, blocks Phase 8 deployment plumbing for cycles 3+5
+**Type**: measurement gap; methodology fix
+**Predicted verdict**: cycle 3 MC + cycle 5 CDEF may flip from
+                       "CPU only" to "opportunistic QPU helper"
+**Priority**: medium (changes deployment recipe; doesn't block other cycles)
+**Filed**: 2026-05-18
+
+## Background
+
+Cycles 3 (MC) and 5 (CDEF, partial) were verdict'd "stay on CPU"
+based on M4 measurements showing mixed NEON-3 + QPU running the
+**same kernel** ran SLOWER than pure NEON-4. Specifically:
+
+| | NEON-4 | NEON-3 + QPU | delta |
+|---|---|---|---|
+| Cycle 3 MC | 15.25 Mblock/s | 12.28 | **−19.5 %** |
+| Cycle 5 CDEF (predicted) | ~ 12-15 | ~ 10-12 | negative |
+
+But this is the **worst-case contention scenario**: both substrates
+competing for the same memory bus with the same access pattern.
+
+**Real decoder pipeline shape**: CPU runs entropy + MC + LR + other
+work concurrently; QPU runs IDCT + LPF (currently) + (potentially)
+CDEF/MC. Different kernels on different substrates contend
+*less* than same-kernel-on-both.
+
+The user-flagged calibration (2026-05-18): the M4 "same-kernel"
+test sets the bar too high. A "different-kernel" test would more
+accurately reflect deployment.
+
+## What to measure
+
+A new bench harness `tests/bench_concurrent_mixed.c` that runs:
+
+| Variant | CPU side (NEON-3 pinned) | QPU side (1 core) | Captures |
+|---|---|---|---|
+| A | LPF wd=4 (bandwidth-bound, like real LPF stage) | CDEF | CDEF helper throughput; CPU LPF throughput drop |
+| B | MC (compute-bound, like real MC stage) | CDEF | CDEF helper throughput; CPU MC throughput drop |
+| C | MC | MC | (cycle 3 M4 control) |
+| D | LPF wd=4 + MC alternating (proxy for "CPU doing mixed real work") | CDEF | Real-pipeline approximation |
+
+Compute "QPU helper value" = (mixed total throughput in the relevant
+kernel) − (CPU-only baseline) for each variant.
+
+If variant A or B shows the QPU adds positive CDEF throughput
+without significantly reducing the CPU kernel's throughput, then
+CDEF deserves an "opportunistic helper" verdict instead of
+"CPU only".
+
+## Expected outcome
+
+Per the user's "5 % CPU drop / 50 % bored QPU" framing:
+- Variant A (bandwidth+bandwidth): QPU contention with bandwidth-
+  heavy LPF is real; QPU contribution likely ~70 % of isolation
+- Variant B (compute+CDEF): MC is the worst-saturated case from
+  cycle 3; QPU likely under-contributes, CPU MC may drop. Net
+  result ~ cycle 3 M4 (−19.5 % rerun)
+- Variant D (mixed): probably the closest-to-deployment number.
+  Best estimate of "additional QPU helper" value.
+
+## Acceptance criteria
+
+- `tests/bench_concurrent_mixed.c` lands, 4 variants measurable
+- Verdict per variant: "+X.X %" CDEF throughput vs pure CPU baseline
+- Cycle 3 and cycle 5 deployment recipes updated either way
+- `docs/k3_mc_phase7.md §"M4 methodology caveat"` updated with
+  results
+
+## Why deferred
+
+User-directed cycle 5 was CDEF; M4 methodology calibration only
+surfaced AFTER cycle 5 close. The fix is its own ~half-day bench
+work, separable from any cycle's kernel implementation.
+
+## Related
+
+- `docs/k3_mc_phase7.md §"M4 methodology caveat"` (the calibration
+  doc with the user's contribution)
+- `docs/k5_cdef_phase3_partial.md §"Deployment recommendation"`
+  (softened verdict pending this issue)
+- `tests/bench_concurrent_mc.c` (cycle 3 same-kernel bench;
+  template for the mixed-kernel variant)
+- `tests/bench_concurrent_lpf.c` + `bench_concurrent_lpf8.c`
+  (cycle 2/4 bench templates)
+- Memory: `feedback_m4_same_kernel_worst_case.md`