Files
daedalus-fourier/docs/issues/003-mixed-kernel-m4-bench.md
marfrit 2dd774a9ab Issue 003 closed: mixed-kernel M4 validates V4 deployment shape
bench_concurrent_mixed runs NEON-N on kernel A + QPU on kernel B
concurrently. Matrix on hertz:

  V3 (CPU MC + QPU MC same-kernel): CPU 22.64 + QPU 0.39 Mblock/s
  V4 (CPU MC + QPU LPF4):            CPU 27.87 + QPU 12.74 Medge/s
  V1 (CPU MC + NEON-fb CDEF):        CPU 24.49 + 1.75 Mblock/s CDEF
  V2 (CPU LPF4 + NEON-fb CDEF):      CPU 27.28 Medge + 1.70 Mblock/s

V4 is the daedalus-fourier deployment shape (CPU runs MC; QPU runs
LPF4 via cycle 2 GREEN offload). Both substrates productive; CPU
MC +23% per-core vs same-kernel V3 control. Same-kernel M4 in
cycles 1-5 was a worst-case contention bound, not a deployment
number — user's "5%/50%" framing was correct.

Cycle 3 MC verdict unchanged (QPU MC contributes ~0.4 under any
contention); cycle 5 CDEF deferred verdict softened to
opportunistic helper (NEON-fallback proxy used since cycle 5
Phase 6 not yet built).

- tests/bench_concurrent_mixed.c (configurable cpu-kernel /
  qpu-kernel matrix; supports MC, LPF4, LPF8, IDCT real QPU
  dispatch; CDEF uses NEON-on-core-3 fallback)
- CMakeLists.txt: build target wired with all FFmpeg + dav1d sources
- docs/issues/003-mixed-kernel-m4-bench.md: closure + matrix
- docs/k3_mc_phase7.md: M4 methodology caveat extended with V3/V4
- docs/k5_cdef_phase3_partial.md: deployment recommendation updated

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:44:08 +00:00

6.8 KiB
Raw Permalink Blame History

Issue 003 — Mixed-kernel M4 bench (closes cycle 3/5 deployment verdict)

Status: CLOSED 2026-05-18 (partial — real QPU CDEF still deferred to cycle 5 Phase 6, but enough data to update deployment recipe) Type: measurement gap; methodology fix Verdict shift: cycle 3 MC verdict stands (CPU only); cycle 5 CDEF deserves "opportunistic helper" caveat; cycle 1+2+4 deployment recipe validated by V4 result. Filed: 2026-05-18 Bench: tests/bench_concurrent_mixed.c (built bench_concurrent_mixed)

Background

Cycles 3 (MC) and 5 (CDEF, partial) were verdict'd "stay on CPU" based on M4 measurements showing mixed NEON-3 + QPU running the same kernel ran SLOWER than pure NEON-4. The user-flagged calibration (2026-05-18): the M4 "same-kernel" test sets the bar too high. A "different-kernel" test would more accurately reflect deployment.

Measurement results (hertz, 2026-05-18)

bench_concurrent_mixed matrix, 6-second windows, NEON-3 pinned to cores 0-2, QPU/fallback worker on core 3:

# CPU side QPU side CPU agg QPU contrib
V1 MC NEON-3 CDEF (NEON fallback, core 3) 24.49 Mblock/s 1.75 Mblock/s CDEF
V2 LPF4 NEON-3 CDEF (NEON fallback, core 3) 27.28 Medge/s 1.70 Mblock/s CDEF
V3 MC NEON-3 (control) MC (real QPU dispatch) 22.64 Mblock/s 0.39 Mblock/s MC
V4 MC NEON-3 LPF4 (real QPU dispatch) 27.87 Mblock/s 12.74 Medge/s LPF4
V5 LPF4 NEON-3 MC (real QPU dispatch) 30.82 Medge/s 0.37 Mblock/s MC

The "QPU side" cell records the substrate actually used. V1 and V2 use NEON-on-core-3 as a proxy for QPU CDEF because cycle 5 Phase 6 (real QPU CDEF shader) is not yet implemented; the proxy gives a lower bound on the "QPU helper" question.

Cross-variant deltas

Effect on CPU MC throughput when the QPU runs a different kernel:

QPU kernel CPU MC agg delta vs V3 per-core delta
MC (V3, same-kernel) 22.64 Mblock/s baseline
CDEF NEON fallback (V1) 24.49 Mblock/s +8.2 % +0.6 Mblock/s/core
LPF4 real QPU (V4) 27.87 Mblock/s +23.1 % +1.7 Mblock/s/core

Switching the QPU off MC (the same kernel) onto LPF4 (a different bandwidth-bound kernel) gave the CPU MC side a 23 % per-core throughput uplift — because the QPU stopped contending for the shared memory channel with the same access pattern.

Headline finding — V4 is the validated deployment shape

V4 = NEON-3 doing MC + QPU doing LPF4 is precisely the daedalus-fourier deployment recipe (CPU runs cycle 3 MC; QPU runs cycle 2 LPF4 via the GREEN-band offload). The measurement:

  • CPU MC: 27.87 Mblock/s (per-core 8.3-10.0)
  • QPU LPF4: 12.74 Medge/s (65 % of QPU LPF4 isolation throughput, 19.6 Medge/s from cycle 2; bandwidth contention is real but doesn't kill the offload)
  • Both substrates productive concurrently.

This is the experiment that should have run first; the same-kernel M4 was the wrong comparison. The user was right.

V3 vs V4 — why same-kernel M4 was pessimistic

V3 (cycle 3 same-kernel rerun in this bench): 22.64 CPU MC + 0.39 QPU MC = 23.03 total Mblock/s. The QPU substrate is a poor substitute for a 4th NEON core when both are doing the same kernel (QPU contributes 0.39 vs ~9.0 a 4th NEON core would add).

V4 (different-kernel deployment): 27.87 CPU MC + 12.74 QPU LPF4. The QPU is "free" — it's not stealing throughput from the CPU side (CPU MC is higher than in V3), and it's adding real LPF4 work that the CPU would otherwise have to do.

Conclusion: the same-kernel M4 in cycles 1-5 was a worst-case contention bound. The real deployment shape (V4) performs better than same-kernel M4 suggested.

V1, V2 — CDEF as opportunistic helper

V1/V2 use NEON-on-core-3 (not real QPU) as a proxy because cycle 5 Phase 6 isn't built. The proxy results:

  • V1: NEON-core-3 CDEF adds 1.75 Mblock/s while NEON-3 MC delivers 24.49 Mblock/s (slightly higher than V3 control's 22.64, because CDEF is compute-bound so it contends little on the memory bus).
  • V2: NEON-core-3 CDEF adds 1.70 Mblock/s while NEON-3 LPF4 delivers 27.28 Medge/s (close to NEON-4 LPF4 isolation 29.47).

So the 4th core CAN run CDEF concurrently without crushing the other 3 cores' MC or LPF work. Whether the actual QPU (after cycle 5 Phase 6 lands) does likewise is unknown:

  • QPU CDEF predicted R₅ = 0.02-0.05 → at best 0.05 × 3.9 ≈ 0.2 Mblock/s of CDEF helper. That's an order of magnitude below the NEON-fallback proxy.
  • But the QPU substrate would contend on the QPU side of the memory hierarchy; the CPU MC side may be less affected than V1's 24.49 (which had NEON contention).

The conservative read: CDEF stays on CPU as primary path; QPU CDEF dispatch path should exist in the V4L2 wrapper but only used when no IDCT/LPF queue is pending. Re-measure after cycle 5 Phase 6 closes.

V5 — LPF on CPU side with QPU MC

V5 inverts V4: NEON-3 does LPF4, QPU does MC. CPU LPF agg = 30.82 Medge/s (essentially NEON-4 isolation), QPU MC adds 0.37 Mblock/s. This is the wrong deployment — QPU has no comparative advantage for MC, and the LPF kernel that should go to QPU stays on CPU. Confirms that cycle 2 LPF belongs on QPU, not the other way around.

Updated deployment recipe

Cycle Kernel Primary substrate QPU dispatch path Notes
1 IDCT 8×8 QPU yes M4 +7.2 % validated
2 LPF wd=4 QPU yes M4 +6.9 % validated; V4 confirms under MC contention
3 MC 8h CPU optional / unused QPU MC contributes 0.39 Mblock/s under any contention scenario — keep dispatch path but don't enqueue
4 LPF wd=8 QPU yes M4 +4.1 % validated
5 CDEF CPU opportunistic only Cycle 5 Phase 6 deferred; real QPU CDEF measurement still owed

What changes in repo state

  • tests/bench_concurrent_mixed.c lands (~470 LOC).
  • CMakeLists.txt builds bench_concurrent_mixed target with all the FFmpeg + dav1d NEON sources.
  • docs/k3_mc_phase7.md § "M4 methodology caveat" updated with V3 vs V4 deltas.
  • docs/k5_cdef_phase3_partial.md § "Deployment recommendation" updated with V1/V2 fallback-proxy results.
  • Memory feedback_m4_same_kernel_worst_case.md annotated with closing numbers.

What's still open after this issue

  • Real QPU CDEF measurement (depends on cycle 5 Phase 6 landing).
  • Variant D (mixed LPF+MC alternating CPU work) skipped — the V1 vs V4 contrast already answers the deployment question.
  • Phase 8 V4L2 wrapper should follow the recipe table above: dispatch paths for ALL kernels exist; the scheduler chooses per-kernel based on the validated recipe.