Files
daedalus-fourier/docs/issues/003-mixed-kernel-m4-bench.md
T
marfrit 2dd774a9ab Issue 003 closed: mixed-kernel M4 validates V4 deployment shape
bench_concurrent_mixed runs NEON-N on kernel A + QPU on kernel B
concurrently. Matrix on hertz:

  V3 (CPU MC + QPU MC same-kernel): CPU 22.64 + QPU 0.39 Mblock/s
  V4 (CPU MC + QPU LPF4):            CPU 27.87 + QPU 12.74 Medge/s
  V1 (CPU MC + NEON-fb CDEF):        CPU 24.49 + 1.75 Mblock/s CDEF
  V2 (CPU LPF4 + NEON-fb CDEF):      CPU 27.28 Medge + 1.70 Mblock/s

V4 is the daedalus-fourier deployment shape (CPU runs MC; QPU runs
LPF4 via cycle 2 GREEN offload). Both substrates productive; CPU
MC +23% per-core vs same-kernel V3 control. Same-kernel M4 in
cycles 1-5 was a worst-case contention bound, not a deployment
number — user's "5%/50%" framing was correct.

Cycle 3 MC verdict unchanged (QPU MC contributes ~0.4 under any
contention); cycle 5 CDEF deferred verdict softened to
opportunistic helper (NEON-fallback proxy used since cycle 5
Phase 6 not yet built).

- tests/bench_concurrent_mixed.c (configurable cpu-kernel /
  qpu-kernel matrix; supports MC, LPF4, LPF8, IDCT real QPU
  dispatch; CDEF uses NEON-on-core-3 fallback)
- CMakeLists.txt: build target wired with all FFmpeg + dav1d sources
- docs/issues/003-mixed-kernel-m4-bench.md: closure + matrix
- docs/k3_mc_phase7.md: M4 methodology caveat extended with V3/V4
- docs/k5_cdef_phase3_partial.md: deployment recommendation updated

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:44:08 +00:00

149 lines
6.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Issue 003 — Mixed-kernel M4 bench (closes cycle 3/5 deployment verdict)
**Status**: **CLOSED 2026-05-18** (partial — real QPU CDEF still deferred to cycle 5 Phase 6, but enough data to update deployment recipe)
**Type**: measurement gap; methodology fix
**Verdict shift**: cycle 3 MC verdict stands (CPU only); cycle 5 CDEF deserves "opportunistic helper" caveat; cycle 1+2+4 deployment recipe **validated by V4 result**.
**Filed**: 2026-05-18
**Bench**: `tests/bench_concurrent_mixed.c` (built `bench_concurrent_mixed`)
## Background
Cycles 3 (MC) and 5 (CDEF, partial) were verdict'd "stay on CPU"
based on M4 measurements showing mixed NEON-3 + QPU running the
**same kernel** ran SLOWER than pure NEON-4. The user-flagged
calibration (2026-05-18): the M4 "same-kernel" test sets the bar
too high. A "different-kernel" test would more accurately reflect
deployment.
## Measurement results (hertz, 2026-05-18)
`bench_concurrent_mixed` matrix, 6-second windows, NEON-3 pinned
to cores 0-2, QPU/fallback worker on core 3:
| # | CPU side | QPU side | CPU agg | QPU contrib |
|---|---------------------------|--------------------------------|-------------|--------------|
|V1 | MC NEON-3 | CDEF (NEON fallback, core 3) | 24.49 Mblock/s | 1.75 Mblock/s CDEF |
|V2 | LPF4 NEON-3 | CDEF (NEON fallback, core 3) | 27.28 Medge/s | 1.70 Mblock/s CDEF |
|V3 | MC NEON-3 (**control**) | MC (real QPU dispatch) | 22.64 Mblock/s | 0.39 Mblock/s MC |
|V4 | MC NEON-3 | LPF4 (real QPU dispatch) | 27.87 Mblock/s | 12.74 Medge/s LPF4 |
|V5 | LPF4 NEON-3 | MC (real QPU dispatch) | 30.82 Medge/s | 0.37 Mblock/s MC |
The "QPU side" cell records the substrate actually used.
**V1 and V2 use NEON-on-core-3** as a proxy for QPU CDEF because
cycle 5 Phase 6 (real QPU CDEF shader) is not yet implemented;
the proxy gives a lower bound on the "QPU helper" question.
## Cross-variant deltas
**Effect on CPU MC throughput when the QPU runs a different kernel:**
| QPU kernel | CPU MC agg | delta vs V3 | per-core delta |
|---|---|---|---|
| MC (V3, same-kernel) | 22.64 Mblock/s | — | baseline |
| CDEF NEON fallback (V1) | 24.49 Mblock/s | +8.2 % | +0.6 Mblock/s/core |
| LPF4 real QPU (V4) | 27.87 Mblock/s | **+23.1 %** | +1.7 Mblock/s/core |
Switching the QPU off MC (the same kernel) onto LPF4 (a different
bandwidth-bound kernel) gave the CPU MC side a **23 % per-core
throughput uplift** — because the QPU stopped contending for the
shared memory channel with the same access pattern.
## Headline finding — V4 is the validated deployment shape
**V4 = NEON-3 doing MC + QPU doing LPF4** is precisely the
daedalus-fourier deployment recipe (CPU runs cycle 3 MC; QPU runs
cycle 2 LPF4 via the GREEN-band offload). The measurement:
- CPU MC: 27.87 Mblock/s (per-core 8.3-10.0)
- QPU LPF4: 12.74 Medge/s (65 % of QPU LPF4 isolation throughput,
19.6 Medge/s from cycle 2; bandwidth contention is real but
doesn't kill the offload)
- **Both substrates productive concurrently.**
This is the experiment that should have run *first*; the
same-kernel M4 was the wrong comparison. The user was right.
## V3 vs V4 — why same-kernel M4 was pessimistic
V3 (cycle 3 same-kernel rerun in this bench): 22.64 CPU MC + 0.39
QPU MC = 23.03 total Mblock/s. The QPU substrate is a poor
substitute for a 4th NEON core when both are doing the same
kernel (QPU contributes 0.39 vs ~9.0 a 4th NEON core would add).
V4 (different-kernel deployment): 27.87 CPU MC + 12.74 QPU LPF4.
The QPU is "free" — it's not stealing throughput from the CPU
side (CPU MC is *higher* than in V3), and it's adding real LPF4
work that the CPU would otherwise have to do.
**Conclusion**: the same-kernel M4 in cycles 1-5 was a
worst-case contention bound. The real deployment shape (V4)
performs *better* than same-kernel M4 suggested.
## V1, V2 — CDEF as opportunistic helper
V1/V2 use NEON-on-core-3 (not real QPU) as a proxy because cycle
5 Phase 6 isn't built. The proxy results:
- V1: NEON-core-3 CDEF adds **1.75 Mblock/s** while NEON-3 MC
delivers 24.49 Mblock/s (slightly *higher* than V3 control's
22.64, because CDEF is compute-bound so it contends little on
the memory bus).
- V2: NEON-core-3 CDEF adds **1.70 Mblock/s** while NEON-3 LPF4
delivers 27.28 Medge/s (close to NEON-4 LPF4 isolation 29.47).
So **the 4th core CAN run CDEF concurrently** without crushing
the other 3 cores' MC or LPF work. Whether the actual *QPU*
(after cycle 5 Phase 6 lands) does likewise is unknown:
- QPU CDEF predicted R₅ = 0.02-0.05 → at best 0.05 × 3.9
≈ 0.2 Mblock/s of CDEF helper. That's an order of magnitude
*below* the NEON-fallback proxy.
- But the QPU substrate would contend on the QPU side of the
memory hierarchy; the CPU MC side may be *less* affected than
V1's 24.49 (which had NEON contention).
The conservative read: **CDEF stays on CPU as primary path; QPU
CDEF dispatch path should exist in the V4L2 wrapper but only used
when no IDCT/LPF queue is pending**. Re-measure after cycle 5
Phase 6 closes.
## V5 — LPF on CPU side with QPU MC
V5 inverts V4: NEON-3 does LPF4, QPU does MC. CPU LPF agg =
30.82 Medge/s (essentially NEON-4 isolation), QPU MC adds 0.37
Mblock/s. This is the **wrong deployment** — QPU has no comparative
advantage for MC, and the LPF kernel that *should* go to QPU
stays on CPU. Confirms that cycle 2 LPF belongs on QPU, not the
other way around.
## Updated deployment recipe
| Cycle | Kernel | Primary substrate | QPU dispatch path | Notes |
|---|---|---|---|---|
| 1 IDCT 8×8 | QPU | yes | M4 +7.2 % validated |
| 2 LPF wd=4 | QPU | yes | M4 +6.9 % validated; **V4 confirms under MC contention** |
| 3 MC 8h | **CPU** | optional / unused | QPU MC contributes 0.39 Mblock/s under any contention scenario — keep dispatch path but don't enqueue |
| 4 LPF wd=8 | QPU | yes | M4 +4.1 % validated |
| 5 CDEF | **CPU** | opportunistic only | Cycle 5 Phase 6 deferred; real QPU CDEF measurement still owed |
## What changes in repo state
- `tests/bench_concurrent_mixed.c` lands (~470 LOC).
- `CMakeLists.txt` builds `bench_concurrent_mixed` target with all
the FFmpeg + dav1d NEON sources.
- `docs/k3_mc_phase7.md` § "M4 methodology caveat" updated with V3
vs V4 deltas.
- `docs/k5_cdef_phase3_partial.md` § "Deployment recommendation"
updated with V1/V2 fallback-proxy results.
- Memory `feedback_m4_same_kernel_worst_case.md` annotated with
closing numbers.
## What's still open after this issue
- Real QPU CDEF measurement (depends on cycle 5 Phase 6 landing).
- Variant D (mixed LPF+MC alternating CPU work) skipped — the V1
vs V4 contrast already answers the deployment question.
- Phase 8 V4L2 wrapper should follow the recipe table above:
dispatch paths for ALL kernels exist; the scheduler chooses
per-kernel based on the validated recipe.