From 460a6a6d08dc41252c7faba6b237237b88755a70 Mon Sep 17 00:00:00 2001 From: Markus Fritsche Date: Mon, 18 May 2026 13:31:27 +0000 Subject: [PATCH] Calibration: M4 same-kernel measures worst-case contention MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit User-flagged 2026-05-18: the cycles 3 (MC) + 5 (CDEF) 'CPU only' verdicts were based on M4 measuring same-kernel concurrent NEON+QPU, which is the WORST case for memory-bandwidth contention. A real decoder pipeline has CPU doing kernel A + QPU doing kernel B concurrently — different access patterns contend less. Concretely: in a real pipeline, CPU runs entropy + MC + other work while QPU is idle except for IDCT + LPF. The 'opportunistic QPU helper' for CDEF (or MC) hasn't been measured. M4 set the bar too high. Updates: - docs/k3_mc_phase7.md §'M4 methodology caveat' added with the user's contribution framing - docs/k5_cdef_phase3_partial.md §'Deployment recommendation' softened from 'CPU only' to 'CPU baseline; QPU helper viable in mixed-kernel deployment, unmeasured' - docs/issues/003-mixed-kernel-m4-bench.md filed — the rigorous test to close the question (4 variants: bandwidth+bandwidth, compute+CDEF, same-kernel control, real-pipeline mix) - ~/.claude/projects/-home-mfritsche-src-daedalus-fourier/memory/ feedback_m4_same_kernel_worst_case.md added — carries the calibration into future cycles + Phase 8 deployment decisions - MEMORY.md index updated The bandwidth-bound vs compute-bound classification still holds at the kernel level — Phase 9 cross-cycle lesson stays valid. But its mapping to deployment is nuanced: - Bandwidth-bound on QPU → DEFINITIVE offload (M4 +ve, cycles 1+2+4) - Compute-bound on QPU → OPPORTUNISTIC helper if pipeline has bandwidth-light CPU work running concurrently (cycles 3+5, needs Issue 003 measurement) Phase 8 V4L2 wrapper should keep CDEF + MC slot-able to either CPU or QPU at runtime (not hard-baked), so Issue 003's result can update the dispatch table without re-architecture. No code changes. Doc + memory + issue only. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/issues/003-mixed-kernel-m4-bench.md | 87 ++++++++++++++++++++++++ docs/k3_mc_phase7.md | 33 ++++++++- docs/k5_cdef_phase3_partial.md | 26 +++++-- 3 files changed, 138 insertions(+), 8 deletions(-) create mode 100644 docs/issues/003-mixed-kernel-m4-bench.md diff --git a/docs/issues/003-mixed-kernel-m4-bench.md b/docs/issues/003-mixed-kernel-m4-bench.md new file mode 100644 index 0000000..fa6c2f1 --- /dev/null +++ b/docs/issues/003-mixed-kernel-m4-bench.md @@ -0,0 +1,87 @@ +# Issue 003 — Mixed-kernel M4 bench (closes cycle 3/5 deployment verdict) + +**Status**: open, blocks Phase 8 deployment plumbing for cycles 3+5 +**Type**: measurement gap; methodology fix +**Predicted verdict**: cycle 3 MC + cycle 5 CDEF may flip from + "CPU only" to "opportunistic QPU helper" +**Priority**: medium (changes deployment recipe; doesn't block other cycles) +**Filed**: 2026-05-18 + +## Background + +Cycles 3 (MC) and 5 (CDEF, partial) were verdict'd "stay on CPU" +based on M4 measurements showing mixed NEON-3 + QPU running the +**same kernel** ran SLOWER than pure NEON-4. Specifically: + +| | NEON-4 | NEON-3 + QPU | delta | +|---|---|---|---| +| Cycle 3 MC | 15.25 Mblock/s | 12.28 | **−19.5 %** | +| Cycle 5 CDEF (predicted) | ~ 12-15 | ~ 10-12 | negative | + +But this is the **worst-case contention scenario**: both substrates +competing for the same memory bus with the same access pattern. + +**Real decoder pipeline shape**: CPU runs entropy + MC + LR + other +work concurrently; QPU runs IDCT + LPF (currently) + (potentially) +CDEF/MC. Different kernels on different substrates contend +*less* than same-kernel-on-both. + +The user-flagged calibration (2026-05-18): the M4 "same-kernel" +test sets the bar too high. A "different-kernel" test would more +accurately reflect deployment. + +## What to measure + +A new bench harness `tests/bench_concurrent_mixed.c` that runs: + +| Variant | CPU side (NEON-3 pinned) | QPU side (1 core) | Captures | +|---|---|---|---| +| A | LPF wd=4 (bandwidth-bound, like real LPF stage) | CDEF | CDEF helper throughput; CPU LPF throughput drop | +| B | MC (compute-bound, like real MC stage) | CDEF | CDEF helper throughput; CPU MC throughput drop | +| C | MC | MC | (cycle 3 M4 control) | +| D | LPF wd=4 + MC alternating (proxy for "CPU doing mixed real work") | CDEF | Real-pipeline approximation | + +Compute "QPU helper value" = (mixed total throughput in the relevant +kernel) − (CPU-only baseline) for each variant. + +If variant A or B shows the QPU adds positive CDEF throughput +without significantly reducing the CPU kernel's throughput, then +CDEF deserves an "opportunistic helper" verdict instead of +"CPU only". + +## Expected outcome + +Per the user's "5 % CPU drop / 50 % bored QPU" framing: +- Variant A (bandwidth+bandwidth): QPU contention with bandwidth- + heavy LPF is real; QPU contribution likely ~70 % of isolation +- Variant B (compute+CDEF): MC is the worst-saturated case from + cycle 3; QPU likely under-contributes, CPU MC may drop. Net + result ~ cycle 3 M4 (−19.5 % rerun) +- Variant D (mixed): probably the closest-to-deployment number. + Best estimate of "additional QPU helper" value. + +## Acceptance criteria + +- `tests/bench_concurrent_mixed.c` lands, 4 variants measurable +- Verdict per variant: "+X.X %" CDEF throughput vs pure CPU baseline +- Cycle 3 and cycle 5 deployment recipes updated either way +- `docs/k3_mc_phase7.md §"M4 methodology caveat"` updated with + results + +## Why deferred + +User-directed cycle 5 was CDEF; M4 methodology calibration only +surfaced AFTER cycle 5 close. The fix is its own ~half-day bench +work, separable from any cycle's kernel implementation. + +## Related + +- `docs/k3_mc_phase7.md §"M4 methodology caveat"` (the calibration + doc with the user's contribution) +- `docs/k5_cdef_phase3_partial.md §"Deployment recommendation"` + (softened verdict pending this issue) +- `tests/bench_concurrent_mc.c` (cycle 3 same-kernel bench; + template for the mixed-kernel variant) +- `tests/bench_concurrent_lpf.c` + `bench_concurrent_lpf8.c` + (cycle 2/4 bench templates) +- Memory: `feedback_m4_same_kernel_worst_case.md` diff --git a/docs/k3_mc_phase7.md b/docs/k3_mc_phase7.md index 64c60b0..df22a83 100644 --- a/docs/k3_mc_phase7.md +++ b/docs/k3_mc_phase7.md @@ -86,15 +86,42 @@ back-end-on-QPU/CPU split for the consumed decoder pipeline: - **IDCT (cycle 1)** → QPU. R = 0.92, +7 % mixed, frees a CPU core. - **LPF (cycle 2)** → QPU. R = 0.41, +7 % mixed, frees a CPU core. -- **MC (cycle 3)** → **CPU NEON**. R = 0.067, −19.5 % mixed. - Compute-bound on CPU but CPU already comfortably exceeds 30fps; - offload makes things worse. +- **MC (cycle 3)** → **CPU NEON baseline; QPU offload viable as + opportunistic helper, not yet measured.** R = 0.067 in isolation + was discouraging; M4 same-kernel mixed was −19.5 % which looks + conclusive but isn't — see *M4 methodology caveat* below. - **Entropy** (VP9 Bool / AV1 ANS) → CPU. Structurally serial. This is a **mixed-substrate deployment**, not a "QPU does everything" plan. Realistic for higgs: entropy + MC on 2-3 ARM cores; IDCT + LPF dispatched to QPU concurrently; 1-2 ARM cores left for vscode / etc. +## M4 methodology caveat (added 2026-05-18 after cycle 5) + +The M4 mixed bench (`bench_concurrent_mc.c`) tests NEON-3 + QPU +running the SAME kernel concurrently. This is the **worst case** for +memory-bandwidth contention — both substrates competing for the same +bus with the same access pattern. + +A real decoder pipeline has different shape: CPU runs entropy + MC ++ other CPU-bound work; QPU runs IDCT + LPF + (potentially) MC as +opportunistic helper. **Different kernels on different substrates** +contend less than same-kernel-on-both. Our M4-same-kernel result is +a pessimistic lower bound, not the actual deployment number. + +Empirically supporting this: cycle 3 M4 showed per-core NEON +throughput in 3-core mode (3.78-4.16 Mblock/s) was higher than in +4-core mode (3.24-4.48), confirming bandwidth saturation at ≥4 +cores. So freeing 1 core via QPU offload costs ~25 % of total NEON +MC throughput, but the QPU contributes 0.45 (-MC) or 1.4 (in CDEF +isolation) on top. + +**To rigorously test the helper hypothesis**: see +`docs/issues/003-mixed-kernel-m4-bench.md`. A bench that runs +NEON-3 on kernel-A + QPU on kernel-B concurrently would close the +question. ~½ day of additional bench work; would update the +deployment recipe for cycles 3 + 5 if the result is positive. + ## Decision per Phase 1 rules + 30fps-floor calibration | Rule | Result | Status | diff --git a/docs/k5_cdef_phase3_partial.md b/docs/k5_cdef_phase3_partial.md index ddf93d2..e62cf55 100644 --- a/docs/k5_cdef_phase3_partial.md +++ b/docs/k5_cdef_phase3_partial.md @@ -95,11 +95,27 @@ chasing two layout issues simultaneously). - 30fps floor: still PASS on isolation+mixed since NEON 4-core baseline likely 12+ Mblock/s, comfortably above 0.972 -**Deployment recommendation** (provisional, pending Phase 4-7): -CDEF stays on CPU. Same verdict as MC. **All compute-bound kernels -stay on CPU; all bandwidth-bound (IDCT/LPF) kernels offload to QPU.** -This is starting to look like a clean classification rule across all -cycles. +**Deployment recommendation** (provisional, pending Phase 4-7 + +Issue 003 mixed-kernel M4): **CDEF baseline = CPU, QPU offload +viable as opportunistic helper, not measured**. + +Same caveat as cycle 3 MC (see `k3_mc_phase7.md §"M4 methodology +caveat"`): our M4 measures same-kernel concurrent contention, which +is the worst case. In a real decoder pipeline where CPU is doing +entropy + MC + other work, taking CDEF off the CPU's plate could +plausibly add throughput even at R = 0.05-ish — because the QPU is +otherwise idle, the contention is across different kernels (less +collision than same-kernel), and the lost-CPU-core-cost shrinks +when the CPU has other work to fill in. + +The **bandwidth-bound vs compute-bound classification rule** still +holds at the kernel level, but its mapping to deployment is more +nuanced than "compute-bound → never QPU." Better framing: + +- **Bandwidth-bound on QPU** → **definitive** QPU offload (cycle 1+2+4) +- **Compute-bound on QPU** → **opportunistic** QPU helper if pipeline + has bandwidth-light CPU work running concurrently (cycle 3+5, + needs Issue 003 measurement to confirm) ## Phase 9 lessons (provisional)