Calibration: M4 same-kernel measures worst-case contention
User-flagged 2026-05-18: the cycles 3 (MC) + 5 (CDEF) 'CPU only'
verdicts were based on M4 measuring same-kernel concurrent NEON+QPU,
which is the WORST case for memory-bandwidth contention. A real
decoder pipeline has CPU doing kernel A + QPU doing kernel B
concurrently — different access patterns contend less.
Concretely: in a real pipeline, CPU runs entropy + MC + other work
while QPU is idle except for IDCT + LPF. The 'opportunistic QPU
helper' for CDEF (or MC) hasn't been measured. M4 set the bar too
high.
Updates:
- docs/k3_mc_phase7.md §'M4 methodology caveat' added with the
user's contribution framing
- docs/k5_cdef_phase3_partial.md §'Deployment recommendation'
softened from 'CPU only' to 'CPU baseline; QPU helper viable in
mixed-kernel deployment, unmeasured'
- docs/issues/003-mixed-kernel-m4-bench.md filed — the rigorous
test to close the question (4 variants: bandwidth+bandwidth,
compute+CDEF, same-kernel control, real-pipeline mix)
- ~/.claude/projects/-home-mfritsche-src-daedalus-fourier/memory/
feedback_m4_same_kernel_worst_case.md added — carries the
calibration into future cycles + Phase 8 deployment decisions
- MEMORY.md index updated
The bandwidth-bound vs compute-bound classification still holds at
the kernel level — Phase 9 cross-cycle lesson stays valid. But its
mapping to deployment is nuanced:
- Bandwidth-bound on QPU → DEFINITIVE offload (M4 +ve, cycles 1+2+4)
- Compute-bound on QPU → OPPORTUNISTIC helper if pipeline has
bandwidth-light CPU work running concurrently (cycles 3+5,
needs Issue 003 measurement)
Phase 8 V4L2 wrapper should keep CDEF + MC slot-able to either CPU
or QPU at runtime (not hard-baked), so Issue 003's result can update
the dispatch table without re-architecture.
No code changes. Doc + memory + issue only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,87 @@
|
|||||||
|
# Issue 003 — Mixed-kernel M4 bench (closes cycle 3/5 deployment verdict)
|
||||||
|
|
||||||
|
**Status**: open, blocks Phase 8 deployment plumbing for cycles 3+5
|
||||||
|
**Type**: measurement gap; methodology fix
|
||||||
|
**Predicted verdict**: cycle 3 MC + cycle 5 CDEF may flip from
|
||||||
|
"CPU only" to "opportunistic QPU helper"
|
||||||
|
**Priority**: medium (changes deployment recipe; doesn't block other cycles)
|
||||||
|
**Filed**: 2026-05-18
|
||||||
|
|
||||||
|
## Background
|
||||||
|
|
||||||
|
Cycles 3 (MC) and 5 (CDEF, partial) were verdict'd "stay on CPU"
|
||||||
|
based on M4 measurements showing mixed NEON-3 + QPU running the
|
||||||
|
**same kernel** ran SLOWER than pure NEON-4. Specifically:
|
||||||
|
|
||||||
|
| | NEON-4 | NEON-3 + QPU | delta |
|
||||||
|
|---|---|---|---|
|
||||||
|
| Cycle 3 MC | 15.25 Mblock/s | 12.28 | **−19.5 %** |
|
||||||
|
| Cycle 5 CDEF (predicted) | ~ 12-15 | ~ 10-12 | negative |
|
||||||
|
|
||||||
|
But this is the **worst-case contention scenario**: both substrates
|
||||||
|
competing for the same memory bus with the same access pattern.
|
||||||
|
|
||||||
|
**Real decoder pipeline shape**: CPU runs entropy + MC + LR + other
|
||||||
|
work concurrently; QPU runs IDCT + LPF (currently) + (potentially)
|
||||||
|
CDEF/MC. Different kernels on different substrates contend
|
||||||
|
*less* than same-kernel-on-both.
|
||||||
|
|
||||||
|
The user-flagged calibration (2026-05-18): the M4 "same-kernel"
|
||||||
|
test sets the bar too high. A "different-kernel" test would more
|
||||||
|
accurately reflect deployment.
|
||||||
|
|
||||||
|
## What to measure
|
||||||
|
|
||||||
|
A new bench harness `tests/bench_concurrent_mixed.c` that runs:
|
||||||
|
|
||||||
|
| Variant | CPU side (NEON-3 pinned) | QPU side (1 core) | Captures |
|
||||||
|
|---|---|---|---|
|
||||||
|
| A | LPF wd=4 (bandwidth-bound, like real LPF stage) | CDEF | CDEF helper throughput; CPU LPF throughput drop |
|
||||||
|
| B | MC (compute-bound, like real MC stage) | CDEF | CDEF helper throughput; CPU MC throughput drop |
|
||||||
|
| C | MC | MC | (cycle 3 M4 control) |
|
||||||
|
| D | LPF wd=4 + MC alternating (proxy for "CPU doing mixed real work") | CDEF | Real-pipeline approximation |
|
||||||
|
|
||||||
|
Compute "QPU helper value" = (mixed total throughput in the relevant
|
||||||
|
kernel) − (CPU-only baseline) for each variant.
|
||||||
|
|
||||||
|
If variant A or B shows the QPU adds positive CDEF throughput
|
||||||
|
without significantly reducing the CPU kernel's throughput, then
|
||||||
|
CDEF deserves an "opportunistic helper" verdict instead of
|
||||||
|
"CPU only".
|
||||||
|
|
||||||
|
## Expected outcome
|
||||||
|
|
||||||
|
Per the user's "5 % CPU drop / 50 % bored QPU" framing:
|
||||||
|
- Variant A (bandwidth+bandwidth): QPU contention with bandwidth-
|
||||||
|
heavy LPF is real; QPU contribution likely ~70 % of isolation
|
||||||
|
- Variant B (compute+CDEF): MC is the worst-saturated case from
|
||||||
|
cycle 3; QPU likely under-contributes, CPU MC may drop. Net
|
||||||
|
result ~ cycle 3 M4 (−19.5 % rerun)
|
||||||
|
- Variant D (mixed): probably the closest-to-deployment number.
|
||||||
|
Best estimate of "additional QPU helper" value.
|
||||||
|
|
||||||
|
## Acceptance criteria
|
||||||
|
|
||||||
|
- `tests/bench_concurrent_mixed.c` lands, 4 variants measurable
|
||||||
|
- Verdict per variant: "+X.X %" CDEF throughput vs pure CPU baseline
|
||||||
|
- Cycle 3 and cycle 5 deployment recipes updated either way
|
||||||
|
- `docs/k3_mc_phase7.md §"M4 methodology caveat"` updated with
|
||||||
|
results
|
||||||
|
|
||||||
|
## Why deferred
|
||||||
|
|
||||||
|
User-directed cycle 5 was CDEF; M4 methodology calibration only
|
||||||
|
surfaced AFTER cycle 5 close. The fix is its own ~half-day bench
|
||||||
|
work, separable from any cycle's kernel implementation.
|
||||||
|
|
||||||
|
## Related
|
||||||
|
|
||||||
|
- `docs/k3_mc_phase7.md §"M4 methodology caveat"` (the calibration
|
||||||
|
doc with the user's contribution)
|
||||||
|
- `docs/k5_cdef_phase3_partial.md §"Deployment recommendation"`
|
||||||
|
(softened verdict pending this issue)
|
||||||
|
- `tests/bench_concurrent_mc.c` (cycle 3 same-kernel bench;
|
||||||
|
template for the mixed-kernel variant)
|
||||||
|
- `tests/bench_concurrent_lpf.c` + `bench_concurrent_lpf8.c`
|
||||||
|
(cycle 2/4 bench templates)
|
||||||
|
- Memory: `feedback_m4_same_kernel_worst_case.md`
|
||||||
+30
-3
@@ -86,15 +86,42 @@ back-end-on-QPU/CPU split for the consumed decoder pipeline:
|
|||||||
|
|
||||||
- **IDCT (cycle 1)** → QPU. R = 0.92, +7 % mixed, frees a CPU core.
|
- **IDCT (cycle 1)** → QPU. R = 0.92, +7 % mixed, frees a CPU core.
|
||||||
- **LPF (cycle 2)** → QPU. R = 0.41, +7 % mixed, frees a CPU core.
|
- **LPF (cycle 2)** → QPU. R = 0.41, +7 % mixed, frees a CPU core.
|
||||||
- **MC (cycle 3)** → **CPU NEON**. R = 0.067, −19.5 % mixed.
|
- **MC (cycle 3)** → **CPU NEON baseline; QPU offload viable as
|
||||||
Compute-bound on CPU but CPU already comfortably exceeds 30fps;
|
opportunistic helper, not yet measured.** R = 0.067 in isolation
|
||||||
offload makes things worse.
|
was discouraging; M4 same-kernel mixed was −19.5 % which looks
|
||||||
|
conclusive but isn't — see *M4 methodology caveat* below.
|
||||||
- **Entropy** (VP9 Bool / AV1 ANS) → CPU. Structurally serial.
|
- **Entropy** (VP9 Bool / AV1 ANS) → CPU. Structurally serial.
|
||||||
|
|
||||||
This is a **mixed-substrate deployment**, not a "QPU does everything"
|
This is a **mixed-substrate deployment**, not a "QPU does everything"
|
||||||
plan. Realistic for higgs: entropy + MC on 2-3 ARM cores; IDCT + LPF
|
plan. Realistic for higgs: entropy + MC on 2-3 ARM cores; IDCT + LPF
|
||||||
dispatched to QPU concurrently; 1-2 ARM cores left for vscode / etc.
|
dispatched to QPU concurrently; 1-2 ARM cores left for vscode / etc.
|
||||||
|
|
||||||
|
## M4 methodology caveat (added 2026-05-18 after cycle 5)
|
||||||
|
|
||||||
|
The M4 mixed bench (`bench_concurrent_mc.c`) tests NEON-3 + QPU
|
||||||
|
running the SAME kernel concurrently. This is the **worst case** for
|
||||||
|
memory-bandwidth contention — both substrates competing for the same
|
||||||
|
bus with the same access pattern.
|
||||||
|
|
||||||
|
A real decoder pipeline has different shape: CPU runs entropy + MC
|
||||||
|
+ other CPU-bound work; QPU runs IDCT + LPF + (potentially) MC as
|
||||||
|
opportunistic helper. **Different kernels on different substrates**
|
||||||
|
contend less than same-kernel-on-both. Our M4-same-kernel result is
|
||||||
|
a pessimistic lower bound, not the actual deployment number.
|
||||||
|
|
||||||
|
Empirically supporting this: cycle 3 M4 showed per-core NEON
|
||||||
|
throughput in 3-core mode (3.78-4.16 Mblock/s) was higher than in
|
||||||
|
4-core mode (3.24-4.48), confirming bandwidth saturation at ≥4
|
||||||
|
cores. So freeing 1 core via QPU offload costs ~25 % of total NEON
|
||||||
|
MC throughput, but the QPU contributes 0.45 (-MC) or 1.4 (in CDEF
|
||||||
|
isolation) on top.
|
||||||
|
|
||||||
|
**To rigorously test the helper hypothesis**: see
|
||||||
|
`docs/issues/003-mixed-kernel-m4-bench.md`. A bench that runs
|
||||||
|
NEON-3 on kernel-A + QPU on kernel-B concurrently would close the
|
||||||
|
question. ~½ day of additional bench work; would update the
|
||||||
|
deployment recipe for cycles 3 + 5 if the result is positive.
|
||||||
|
|
||||||
## Decision per Phase 1 rules + 30fps-floor calibration
|
## Decision per Phase 1 rules + 30fps-floor calibration
|
||||||
|
|
||||||
| Rule | Result | Status |
|
| Rule | Result | Status |
|
||||||
|
|||||||
@@ -95,11 +95,27 @@ chasing two layout issues simultaneously).
|
|||||||
- 30fps floor: still PASS on isolation+mixed since NEON 4-core
|
- 30fps floor: still PASS on isolation+mixed since NEON 4-core
|
||||||
baseline likely 12+ Mblock/s, comfortably above 0.972
|
baseline likely 12+ Mblock/s, comfortably above 0.972
|
||||||
|
|
||||||
**Deployment recommendation** (provisional, pending Phase 4-7):
|
**Deployment recommendation** (provisional, pending Phase 4-7 +
|
||||||
CDEF stays on CPU. Same verdict as MC. **All compute-bound kernels
|
Issue 003 mixed-kernel M4): **CDEF baseline = CPU, QPU offload
|
||||||
stay on CPU; all bandwidth-bound (IDCT/LPF) kernels offload to QPU.**
|
viable as opportunistic helper, not measured**.
|
||||||
This is starting to look like a clean classification rule across all
|
|
||||||
cycles.
|
Same caveat as cycle 3 MC (see `k3_mc_phase7.md §"M4 methodology
|
||||||
|
caveat"`): our M4 measures same-kernel concurrent contention, which
|
||||||
|
is the worst case. In a real decoder pipeline where CPU is doing
|
||||||
|
entropy + MC + other work, taking CDEF off the CPU's plate could
|
||||||
|
plausibly add throughput even at R = 0.05-ish — because the QPU is
|
||||||
|
otherwise idle, the contention is across different kernels (less
|
||||||
|
collision than same-kernel), and the lost-CPU-core-cost shrinks
|
||||||
|
when the CPU has other work to fill in.
|
||||||
|
|
||||||
|
The **bandwidth-bound vs compute-bound classification rule** still
|
||||||
|
holds at the kernel level, but its mapping to deployment is more
|
||||||
|
nuanced than "compute-bound → never QPU." Better framing:
|
||||||
|
|
||||||
|
- **Bandwidth-bound on QPU** → **definitive** QPU offload (cycle 1+2+4)
|
||||||
|
- **Compute-bound on QPU** → **opportunistic** QPU helper if pipeline
|
||||||
|
has bandwidth-light CPU work running concurrently (cycle 3+5,
|
||||||
|
needs Issue 003 measurement to confirm)
|
||||||
|
|
||||||
## Phase 9 lessons (provisional)
|
## Phase 9 lessons (provisional)
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user