User-flagged 2026-05-18: the cycles 3 (MC) + 5 (CDEF) 'CPU only'
verdicts were based on M4 measuring same-kernel concurrent NEON+QPU,
which is the WORST case for memory-bandwidth contention. A real
decoder pipeline has CPU doing kernel A + QPU doing kernel B
concurrently — different access patterns contend less.
Concretely: in a real pipeline, CPU runs entropy + MC + other work
while QPU is idle except for IDCT + LPF. The 'opportunistic QPU
helper' for CDEF (or MC) hasn't been measured. M4 set the bar too
high.
Updates:
- docs/k3_mc_phase7.md §'M4 methodology caveat' added with the
user's contribution framing
- docs/k5_cdef_phase3_partial.md §'Deployment recommendation'
softened from 'CPU only' to 'CPU baseline; QPU helper viable in
mixed-kernel deployment, unmeasured'
- docs/issues/003-mixed-kernel-m4-bench.md filed — the rigorous
test to close the question (4 variants: bandwidth+bandwidth,
compute+CDEF, same-kernel control, real-pipeline mix)
- ~/.claude/projects/-home-mfritsche-src-daedalus-fourier/memory/
feedback_m4_same_kernel_worst_case.md added — carries the
calibration into future cycles + Phase 8 deployment decisions
- MEMORY.md index updated
The bandwidth-bound vs compute-bound classification still holds at
the kernel level — Phase 9 cross-cycle lesson stays valid. But its
mapping to deployment is nuanced:
- Bandwidth-bound on QPU → DEFINITIVE offload (M4 +ve, cycles 1+2+4)
- Compute-bound on QPU → OPPORTUNISTIC helper if pipeline has
bandwidth-light CPU work running concurrently (cycles 3+5,
needs Issue 003 measurement)
Phase 8 V4L2 wrapper should keep CDEF + MC slot-able to either CPU
or QPU at runtime (not hard-baked), so Issue 003's result can update
the dispatch table without re-architecture.
No code changes. Doc + memory + issue only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3.5 KiB
Issue 003 — Mixed-kernel M4 bench (closes cycle 3/5 deployment verdict)
Status: open, blocks Phase 8 deployment plumbing for cycles 3+5 Type: measurement gap; methodology fix Predicted verdict: cycle 3 MC + cycle 5 CDEF may flip from "CPU only" to "opportunistic QPU helper" Priority: medium (changes deployment recipe; doesn't block other cycles) Filed: 2026-05-18
Background
Cycles 3 (MC) and 5 (CDEF, partial) were verdict'd "stay on CPU" based on M4 measurements showing mixed NEON-3 + QPU running the same kernel ran SLOWER than pure NEON-4. Specifically:
| NEON-4 | NEON-3 + QPU | delta | |
|---|---|---|---|
| Cycle 3 MC | 15.25 Mblock/s | 12.28 | −19.5 % |
| Cycle 5 CDEF (predicted) | ~ 12-15 | ~ 10-12 | negative |
But this is the worst-case contention scenario: both substrates competing for the same memory bus with the same access pattern.
Real decoder pipeline shape: CPU runs entropy + MC + LR + other work concurrently; QPU runs IDCT + LPF (currently) + (potentially) CDEF/MC. Different kernels on different substrates contend less than same-kernel-on-both.
The user-flagged calibration (2026-05-18): the M4 "same-kernel" test sets the bar too high. A "different-kernel" test would more accurately reflect deployment.
What to measure
A new bench harness tests/bench_concurrent_mixed.c that runs:
| Variant | CPU side (NEON-3 pinned) | QPU side (1 core) | Captures |
|---|---|---|---|
| A | LPF wd=4 (bandwidth-bound, like real LPF stage) | CDEF | CDEF helper throughput; CPU LPF throughput drop |
| B | MC (compute-bound, like real MC stage) | CDEF | CDEF helper throughput; CPU MC throughput drop |
| C | MC | MC | (cycle 3 M4 control) |
| D | LPF wd=4 + MC alternating (proxy for "CPU doing mixed real work") | CDEF | Real-pipeline approximation |
Compute "QPU helper value" = (mixed total throughput in the relevant kernel) − (CPU-only baseline) for each variant.
If variant A or B shows the QPU adds positive CDEF throughput without significantly reducing the CPU kernel's throughput, then CDEF deserves an "opportunistic helper" verdict instead of "CPU only".
Expected outcome
Per the user's "5 % CPU drop / 50 % bored QPU" framing:
- Variant A (bandwidth+bandwidth): QPU contention with bandwidth- heavy LPF is real; QPU contribution likely ~70 % of isolation
- Variant B (compute+CDEF): MC is the worst-saturated case from cycle 3; QPU likely under-contributes, CPU MC may drop. Net result ~ cycle 3 M4 (−19.5 % rerun)
- Variant D (mixed): probably the closest-to-deployment number. Best estimate of "additional QPU helper" value.
Acceptance criteria
tests/bench_concurrent_mixed.clands, 4 variants measurable- Verdict per variant: "+X.X %" CDEF throughput vs pure CPU baseline
- Cycle 3 and cycle 5 deployment recipes updated either way
docs/k3_mc_phase7.md §"M4 methodology caveat"updated with results
Why deferred
User-directed cycle 5 was CDEF; M4 methodology calibration only surfaced AFTER cycle 5 close. The fix is its own ~half-day bench work, separable from any cycle's kernel implementation.
Related
docs/k3_mc_phase7.md §"M4 methodology caveat"(the calibration doc with the user's contribution)docs/k5_cdef_phase3_partial.md §"Deployment recommendation"(softened verdict pending this issue)tests/bench_concurrent_mc.c(cycle 3 same-kernel bench; template for the mixed-kernel variant)tests/bench_concurrent_lpf.c+bench_concurrent_lpf8.c(cycle 2/4 bench templates)- Memory:
feedback_m4_same_kernel_worst_case.md