bench_concurrent_mixed runs NEON-N on kernel A + QPU on kernel B concurrently. Matrix on hertz: V3 (CPU MC + QPU MC same-kernel): CPU 22.64 + QPU 0.39 Mblock/s V4 (CPU MC + QPU LPF4): CPU 27.87 + QPU 12.74 Medge/s V1 (CPU MC + NEON-fb CDEF): CPU 24.49 + 1.75 Mblock/s CDEF V2 (CPU LPF4 + NEON-fb CDEF): CPU 27.28 Medge + 1.70 Mblock/s V4 is the daedalus-fourier deployment shape (CPU runs MC; QPU runs LPF4 via cycle 2 GREEN offload). Both substrates productive; CPU MC +23% per-core vs same-kernel V3 control. Same-kernel M4 in cycles 1-5 was a worst-case contention bound, not a deployment number — user's "5%/50%" framing was correct. Cycle 3 MC verdict unchanged (QPU MC contributes ~0.4 under any contention); cycle 5 CDEF deferred verdict softened to opportunistic helper (NEON-fallback proxy used since cycle 5 Phase 6 not yet built). - tests/bench_concurrent_mixed.c (configurable cpu-kernel / qpu-kernel matrix; supports MC, LPF4, LPF8, IDCT real QPU dispatch; CDEF uses NEON-on-core-3 fallback) - CMakeLists.txt: build target wired with all FFmpeg + dav1d sources - docs/issues/003-mixed-kernel-m4-bench.md: closure + matrix - docs/k3_mc_phase7.md: M4 methodology caveat extended with V3/V4 - docs/k5_cdef_phase3_partial.md: deployment recommendation updated Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.8 KiB
Issue 003 — Mixed-kernel M4 bench (closes cycle 3/5 deployment verdict)
Status: CLOSED 2026-05-18 (partial — real QPU CDEF still deferred to cycle 5 Phase 6, but enough data to update deployment recipe)
Type: measurement gap; methodology fix
Verdict shift: cycle 3 MC verdict stands (CPU only); cycle 5 CDEF deserves "opportunistic helper" caveat; cycle 1+2+4 deployment recipe validated by V4 result.
Filed: 2026-05-18
Bench: tests/bench_concurrent_mixed.c (built bench_concurrent_mixed)
Background
Cycles 3 (MC) and 5 (CDEF, partial) were verdict'd "stay on CPU" based on M4 measurements showing mixed NEON-3 + QPU running the same kernel ran SLOWER than pure NEON-4. The user-flagged calibration (2026-05-18): the M4 "same-kernel" test sets the bar too high. A "different-kernel" test would more accurately reflect deployment.
Measurement results (hertz, 2026-05-18)
bench_concurrent_mixed matrix, 6-second windows, NEON-3 pinned
to cores 0-2, QPU/fallback worker on core 3:
| # | CPU side | QPU side | CPU agg | QPU contrib |
|---|---|---|---|---|
| V1 | MC NEON-3 | CDEF (NEON fallback, core 3) | 24.49 Mblock/s | 1.75 Mblock/s CDEF |
| V2 | LPF4 NEON-3 | CDEF (NEON fallback, core 3) | 27.28 Medge/s | 1.70 Mblock/s CDEF |
| V3 | MC NEON-3 (control) | MC (real QPU dispatch) | 22.64 Mblock/s | 0.39 Mblock/s MC |
| V4 | MC NEON-3 | LPF4 (real QPU dispatch) | 27.87 Mblock/s | 12.74 Medge/s LPF4 |
| V5 | LPF4 NEON-3 | MC (real QPU dispatch) | 30.82 Medge/s | 0.37 Mblock/s MC |
The "QPU side" cell records the substrate actually used. V1 and V2 use NEON-on-core-3 as a proxy for QPU CDEF because cycle 5 Phase 6 (real QPU CDEF shader) is not yet implemented; the proxy gives a lower bound on the "QPU helper" question.
Cross-variant deltas
Effect on CPU MC throughput when the QPU runs a different kernel:
| QPU kernel | CPU MC agg | delta vs V3 | per-core delta |
|---|---|---|---|
| MC (V3, same-kernel) | 22.64 Mblock/s | — | baseline |
| CDEF NEON fallback (V1) | 24.49 Mblock/s | +8.2 % | +0.6 Mblock/s/core |
| LPF4 real QPU (V4) | 27.87 Mblock/s | +23.1 % | +1.7 Mblock/s/core |
Switching the QPU off MC (the same kernel) onto LPF4 (a different bandwidth-bound kernel) gave the CPU MC side a 23 % per-core throughput uplift — because the QPU stopped contending for the shared memory channel with the same access pattern.
Headline finding — V4 is the validated deployment shape
V4 = NEON-3 doing MC + QPU doing LPF4 is precisely the daedalus-fourier deployment recipe (CPU runs cycle 3 MC; QPU runs cycle 2 LPF4 via the GREEN-band offload). The measurement:
- CPU MC: 27.87 Mblock/s (per-core 8.3-10.0)
- QPU LPF4: 12.74 Medge/s (65 % of QPU LPF4 isolation throughput, 19.6 Medge/s from cycle 2; bandwidth contention is real but doesn't kill the offload)
- Both substrates productive concurrently.
This is the experiment that should have run first; the same-kernel M4 was the wrong comparison. The user was right.
V3 vs V4 — why same-kernel M4 was pessimistic
V3 (cycle 3 same-kernel rerun in this bench): 22.64 CPU MC + 0.39 QPU MC = 23.03 total Mblock/s. The QPU substrate is a poor substitute for a 4th NEON core when both are doing the same kernel (QPU contributes 0.39 vs ~9.0 a 4th NEON core would add).
V4 (different-kernel deployment): 27.87 CPU MC + 12.74 QPU LPF4. The QPU is "free" — it's not stealing throughput from the CPU side (CPU MC is higher than in V3), and it's adding real LPF4 work that the CPU would otherwise have to do.
Conclusion: the same-kernel M4 in cycles 1-5 was a worst-case contention bound. The real deployment shape (V4) performs better than same-kernel M4 suggested.
V1, V2 — CDEF as opportunistic helper
V1/V2 use NEON-on-core-3 (not real QPU) as a proxy because cycle 5 Phase 6 isn't built. The proxy results:
- V1: NEON-core-3 CDEF adds 1.75 Mblock/s while NEON-3 MC delivers 24.49 Mblock/s (slightly higher than V3 control's 22.64, because CDEF is compute-bound so it contends little on the memory bus).
- V2: NEON-core-3 CDEF adds 1.70 Mblock/s while NEON-3 LPF4 delivers 27.28 Medge/s (close to NEON-4 LPF4 isolation 29.47).
So the 4th core CAN run CDEF concurrently without crushing the other 3 cores' MC or LPF work. Whether the actual QPU (after cycle 5 Phase 6 lands) does likewise is unknown:
- QPU CDEF predicted R₅ = 0.02-0.05 → at best 0.05 × 3.9 ≈ 0.2 Mblock/s of CDEF helper. That's an order of magnitude below the NEON-fallback proxy.
- But the QPU substrate would contend on the QPU side of the memory hierarchy; the CPU MC side may be less affected than V1's 24.49 (which had NEON contention).
The conservative read: CDEF stays on CPU as primary path; QPU CDEF dispatch path should exist in the V4L2 wrapper but only used when no IDCT/LPF queue is pending. Re-measure after cycle 5 Phase 6 closes.
V5 — LPF on CPU side with QPU MC
V5 inverts V4: NEON-3 does LPF4, QPU does MC. CPU LPF agg = 30.82 Medge/s (essentially NEON-4 isolation), QPU MC adds 0.37 Mblock/s. This is the wrong deployment — QPU has no comparative advantage for MC, and the LPF kernel that should go to QPU stays on CPU. Confirms that cycle 2 LPF belongs on QPU, not the other way around.
Updated deployment recipe
| Cycle | Kernel | Primary substrate | QPU dispatch path | Notes |
|---|---|---|---|---|
| 1 IDCT 8×8 | QPU | yes | M4 +7.2 % validated | |
| 2 LPF wd=4 | QPU | yes | M4 +6.9 % validated; V4 confirms under MC contention | |
| 3 MC 8h | CPU | optional / unused | QPU MC contributes 0.39 Mblock/s under any contention scenario — keep dispatch path but don't enqueue | |
| 4 LPF wd=8 | QPU | yes | M4 +4.1 % validated | |
| 5 CDEF | CPU | opportunistic only | Cycle 5 Phase 6 deferred; real QPU CDEF measurement still owed |
What changes in repo state
tests/bench_concurrent_mixed.clands (~470 LOC).CMakeLists.txtbuildsbench_concurrent_mixedtarget with all the FFmpeg + dav1d NEON sources.docs/k3_mc_phase7.md§ "M4 methodology caveat" updated with V3 vs V4 deltas.docs/k5_cdef_phase3_partial.md§ "Deployment recommendation" updated with V1/V2 fallback-proxy results.- Memory
feedback_m4_same_kernel_worst_case.mdannotated with closing numbers.
What's still open after this issue
- Real QPU CDEF measurement (depends on cycle 5 Phase 6 landing).
- Variant D (mixed LPF+MC alternating CPU work) skipped — the V1 vs V4 contrast already answers the deployment question.
- Phase 8 V4L2 wrapper should follow the recipe table above: dispatch paths for ALL kernels exist; the scheduler chooses per-kernel based on the validated recipe.