Issue 003 closed: mixed-kernel M4 validates V4 deployment shape
bench_concurrent_mixed runs NEON-N on kernel A + QPU on kernel B concurrently. Matrix on hertz: V3 (CPU MC + QPU MC same-kernel): CPU 22.64 + QPU 0.39 Mblock/s V4 (CPU MC + QPU LPF4): CPU 27.87 + QPU 12.74 Medge/s V1 (CPU MC + NEON-fb CDEF): CPU 24.49 + 1.75 Mblock/s CDEF V2 (CPU LPF4 + NEON-fb CDEF): CPU 27.28 Medge + 1.70 Mblock/s V4 is the daedalus-fourier deployment shape (CPU runs MC; QPU runs LPF4 via cycle 2 GREEN offload). Both substrates productive; CPU MC +23% per-core vs same-kernel V3 control. Same-kernel M4 in cycles 1-5 was a worst-case contention bound, not a deployment number — user's "5%/50%" framing was correct. Cycle 3 MC verdict unchanged (QPU MC contributes ~0.4 under any contention); cycle 5 CDEF deferred verdict softened to opportunistic helper (NEON-fallback proxy used since cycle 5 Phase 6 not yet built). - tests/bench_concurrent_mixed.c (configurable cpu-kernel / qpu-kernel matrix; supports MC, LPF4, LPF8, IDCT real QPU dispatch; CDEF uses NEON-on-core-3 fallback) - CMakeLists.txt: build target wired with all FFmpeg + dav1d sources - docs/issues/003-mixed-kernel-m4-bench.md: closure + matrix - docs/k3_mc_phase7.md: M4 methodology caveat extended with V3/V4 - docs/k5_cdef_phase3_partial.md: deployment recommendation updated Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,87 +1,148 @@
|
||||
# Issue 003 — Mixed-kernel M4 bench (closes cycle 3/5 deployment verdict)
|
||||
|
||||
**Status**: open, blocks Phase 8 deployment plumbing for cycles 3+5
|
||||
**Status**: **CLOSED 2026-05-18** (partial — real QPU CDEF still deferred to cycle 5 Phase 6, but enough data to update deployment recipe)
|
||||
**Type**: measurement gap; methodology fix
|
||||
**Predicted verdict**: cycle 3 MC + cycle 5 CDEF may flip from
|
||||
"CPU only" to "opportunistic QPU helper"
|
||||
**Priority**: medium (changes deployment recipe; doesn't block other cycles)
|
||||
**Verdict shift**: cycle 3 MC verdict stands (CPU only); cycle 5 CDEF deserves "opportunistic helper" caveat; cycle 1+2+4 deployment recipe **validated by V4 result**.
|
||||
**Filed**: 2026-05-18
|
||||
**Bench**: `tests/bench_concurrent_mixed.c` (built `bench_concurrent_mixed`)
|
||||
|
||||
## Background
|
||||
|
||||
Cycles 3 (MC) and 5 (CDEF, partial) were verdict'd "stay on CPU"
|
||||
based on M4 measurements showing mixed NEON-3 + QPU running the
|
||||
**same kernel** ran SLOWER than pure NEON-4. Specifically:
|
||||
**same kernel** ran SLOWER than pure NEON-4. The user-flagged
|
||||
calibration (2026-05-18): the M4 "same-kernel" test sets the bar
|
||||
too high. A "different-kernel" test would more accurately reflect
|
||||
deployment.
|
||||
|
||||
| | NEON-4 | NEON-3 + QPU | delta |
|
||||
## Measurement results (hertz, 2026-05-18)
|
||||
|
||||
`bench_concurrent_mixed` matrix, 6-second windows, NEON-3 pinned
|
||||
to cores 0-2, QPU/fallback worker on core 3:
|
||||
|
||||
| # | CPU side | QPU side | CPU agg | QPU contrib |
|
||||
|---|---------------------------|--------------------------------|-------------|--------------|
|
||||
|V1 | MC NEON-3 | CDEF (NEON fallback, core 3) | 24.49 Mblock/s | 1.75 Mblock/s CDEF |
|
||||
|V2 | LPF4 NEON-3 | CDEF (NEON fallback, core 3) | 27.28 Medge/s | 1.70 Mblock/s CDEF |
|
||||
|V3 | MC NEON-3 (**control**) | MC (real QPU dispatch) | 22.64 Mblock/s | 0.39 Mblock/s MC |
|
||||
|V4 | MC NEON-3 | LPF4 (real QPU dispatch) | 27.87 Mblock/s | 12.74 Medge/s LPF4 |
|
||||
|V5 | LPF4 NEON-3 | MC (real QPU dispatch) | 30.82 Medge/s | 0.37 Mblock/s MC |
|
||||
|
||||
The "QPU side" cell records the substrate actually used.
|
||||
**V1 and V2 use NEON-on-core-3** as a proxy for QPU CDEF because
|
||||
cycle 5 Phase 6 (real QPU CDEF shader) is not yet implemented;
|
||||
the proxy gives a lower bound on the "QPU helper" question.
|
||||
|
||||
## Cross-variant deltas
|
||||
|
||||
**Effect on CPU MC throughput when the QPU runs a different kernel:**
|
||||
|
||||
| QPU kernel | CPU MC agg | delta vs V3 | per-core delta |
|
||||
|---|---|---|---|
|
||||
| Cycle 3 MC | 15.25 Mblock/s | 12.28 | **−19.5 %** |
|
||||
| Cycle 5 CDEF (predicted) | ~ 12-15 | ~ 10-12 | negative |
|
||||
| MC (V3, same-kernel) | 22.64 Mblock/s | — | baseline |
|
||||
| CDEF NEON fallback (V1) | 24.49 Mblock/s | +8.2 % | +0.6 Mblock/s/core |
|
||||
| LPF4 real QPU (V4) | 27.87 Mblock/s | **+23.1 %** | +1.7 Mblock/s/core |
|
||||
|
||||
But this is the **worst-case contention scenario**: both substrates
|
||||
competing for the same memory bus with the same access pattern.
|
||||
Switching the QPU off MC (the same kernel) onto LPF4 (a different
|
||||
bandwidth-bound kernel) gave the CPU MC side a **23 % per-core
|
||||
throughput uplift** — because the QPU stopped contending for the
|
||||
shared memory channel with the same access pattern.
|
||||
|
||||
**Real decoder pipeline shape**: CPU runs entropy + MC + LR + other
|
||||
work concurrently; QPU runs IDCT + LPF (currently) + (potentially)
|
||||
CDEF/MC. Different kernels on different substrates contend
|
||||
*less* than same-kernel-on-both.
|
||||
## Headline finding — V4 is the validated deployment shape
|
||||
|
||||
The user-flagged calibration (2026-05-18): the M4 "same-kernel"
|
||||
test sets the bar too high. A "different-kernel" test would more
|
||||
accurately reflect deployment.
|
||||
**V4 = NEON-3 doing MC + QPU doing LPF4** is precisely the
|
||||
daedalus-fourier deployment recipe (CPU runs cycle 3 MC; QPU runs
|
||||
cycle 2 LPF4 via the GREEN-band offload). The measurement:
|
||||
|
||||
## What to measure
|
||||
- CPU MC: 27.87 Mblock/s (per-core 8.3-10.0)
|
||||
- QPU LPF4: 12.74 Medge/s (65 % of QPU LPF4 isolation throughput,
|
||||
19.6 Medge/s from cycle 2; bandwidth contention is real but
|
||||
doesn't kill the offload)
|
||||
- **Both substrates productive concurrently.**
|
||||
|
||||
A new bench harness `tests/bench_concurrent_mixed.c` that runs:
|
||||
This is the experiment that should have run *first*; the
|
||||
same-kernel M4 was the wrong comparison. The user was right.
|
||||
|
||||
| Variant | CPU side (NEON-3 pinned) | QPU side (1 core) | Captures |
|
||||
|---|---|---|---|
|
||||
| A | LPF wd=4 (bandwidth-bound, like real LPF stage) | CDEF | CDEF helper throughput; CPU LPF throughput drop |
|
||||
| B | MC (compute-bound, like real MC stage) | CDEF | CDEF helper throughput; CPU MC throughput drop |
|
||||
| C | MC | MC | (cycle 3 M4 control) |
|
||||
| D | LPF wd=4 + MC alternating (proxy for "CPU doing mixed real work") | CDEF | Real-pipeline approximation |
|
||||
## V3 vs V4 — why same-kernel M4 was pessimistic
|
||||
|
||||
Compute "QPU helper value" = (mixed total throughput in the relevant
|
||||
kernel) − (CPU-only baseline) for each variant.
|
||||
V3 (cycle 3 same-kernel rerun in this bench): 22.64 CPU MC + 0.39
|
||||
QPU MC = 23.03 total Mblock/s. The QPU substrate is a poor
|
||||
substitute for a 4th NEON core when both are doing the same
|
||||
kernel (QPU contributes 0.39 vs ~9.0 a 4th NEON core would add).
|
||||
|
||||
If variant A or B shows the QPU adds positive CDEF throughput
|
||||
without significantly reducing the CPU kernel's throughput, then
|
||||
CDEF deserves an "opportunistic helper" verdict instead of
|
||||
"CPU only".
|
||||
V4 (different-kernel deployment): 27.87 CPU MC + 12.74 QPU LPF4.
|
||||
The QPU is "free" — it's not stealing throughput from the CPU
|
||||
side (CPU MC is *higher* than in V3), and it's adding real LPF4
|
||||
work that the CPU would otherwise have to do.
|
||||
|
||||
## Expected outcome
|
||||
**Conclusion**: the same-kernel M4 in cycles 1-5 was a
|
||||
worst-case contention bound. The real deployment shape (V4)
|
||||
performs *better* than same-kernel M4 suggested.
|
||||
|
||||
Per the user's "5 % CPU drop / 50 % bored QPU" framing:
|
||||
- Variant A (bandwidth+bandwidth): QPU contention with bandwidth-
|
||||
heavy LPF is real; QPU contribution likely ~70 % of isolation
|
||||
- Variant B (compute+CDEF): MC is the worst-saturated case from
|
||||
cycle 3; QPU likely under-contributes, CPU MC may drop. Net
|
||||
result ~ cycle 3 M4 (−19.5 % rerun)
|
||||
- Variant D (mixed): probably the closest-to-deployment number.
|
||||
Best estimate of "additional QPU helper" value.
|
||||
## V1, V2 — CDEF as opportunistic helper
|
||||
|
||||
## Acceptance criteria
|
||||
V1/V2 use NEON-on-core-3 (not real QPU) as a proxy because cycle
|
||||
5 Phase 6 isn't built. The proxy results:
|
||||
|
||||
- `tests/bench_concurrent_mixed.c` lands, 4 variants measurable
|
||||
- Verdict per variant: "+X.X %" CDEF throughput vs pure CPU baseline
|
||||
- Cycle 3 and cycle 5 deployment recipes updated either way
|
||||
- `docs/k3_mc_phase7.md §"M4 methodology caveat"` updated with
|
||||
results
|
||||
- V1: NEON-core-3 CDEF adds **1.75 Mblock/s** while NEON-3 MC
|
||||
delivers 24.49 Mblock/s (slightly *higher* than V3 control's
|
||||
22.64, because CDEF is compute-bound so it contends little on
|
||||
the memory bus).
|
||||
- V2: NEON-core-3 CDEF adds **1.70 Mblock/s** while NEON-3 LPF4
|
||||
delivers 27.28 Medge/s (close to NEON-4 LPF4 isolation 29.47).
|
||||
|
||||
## Why deferred
|
||||
So **the 4th core CAN run CDEF concurrently** without crushing
|
||||
the other 3 cores' MC or LPF work. Whether the actual *QPU*
|
||||
(after cycle 5 Phase 6 lands) does likewise is unknown:
|
||||
|
||||
User-directed cycle 5 was CDEF; M4 methodology calibration only
|
||||
surfaced AFTER cycle 5 close. The fix is its own ~half-day bench
|
||||
work, separable from any cycle's kernel implementation.
|
||||
- QPU CDEF predicted R₅ = 0.02-0.05 → at best 0.05 × 3.9
|
||||
≈ 0.2 Mblock/s of CDEF helper. That's an order of magnitude
|
||||
*below* the NEON-fallback proxy.
|
||||
- But the QPU substrate would contend on the QPU side of the
|
||||
memory hierarchy; the CPU MC side may be *less* affected than
|
||||
V1's 24.49 (which had NEON contention).
|
||||
|
||||
## Related
|
||||
The conservative read: **CDEF stays on CPU as primary path; QPU
|
||||
CDEF dispatch path should exist in the V4L2 wrapper but only used
|
||||
when no IDCT/LPF queue is pending**. Re-measure after cycle 5
|
||||
Phase 6 closes.
|
||||
|
||||
- `docs/k3_mc_phase7.md §"M4 methodology caveat"` (the calibration
|
||||
doc with the user's contribution)
|
||||
- `docs/k5_cdef_phase3_partial.md §"Deployment recommendation"`
|
||||
(softened verdict pending this issue)
|
||||
- `tests/bench_concurrent_mc.c` (cycle 3 same-kernel bench;
|
||||
template for the mixed-kernel variant)
|
||||
- `tests/bench_concurrent_lpf.c` + `bench_concurrent_lpf8.c`
|
||||
(cycle 2/4 bench templates)
|
||||
- Memory: `feedback_m4_same_kernel_worst_case.md`
|
||||
## V5 — LPF on CPU side with QPU MC
|
||||
|
||||
V5 inverts V4: NEON-3 does LPF4, QPU does MC. CPU LPF agg =
|
||||
30.82 Medge/s (essentially NEON-4 isolation), QPU MC adds 0.37
|
||||
Mblock/s. This is the **wrong deployment** — QPU has no comparative
|
||||
advantage for MC, and the LPF kernel that *should* go to QPU
|
||||
stays on CPU. Confirms that cycle 2 LPF belongs on QPU, not the
|
||||
other way around.
|
||||
|
||||
## Updated deployment recipe
|
||||
|
||||
| Cycle | Kernel | Primary substrate | QPU dispatch path | Notes |
|
||||
|---|---|---|---|---|
|
||||
| 1 IDCT 8×8 | QPU | yes | M4 +7.2 % validated |
|
||||
| 2 LPF wd=4 | QPU | yes | M4 +6.9 % validated; **V4 confirms under MC contention** |
|
||||
| 3 MC 8h | **CPU** | optional / unused | QPU MC contributes 0.39 Mblock/s under any contention scenario — keep dispatch path but don't enqueue |
|
||||
| 4 LPF wd=8 | QPU | yes | M4 +4.1 % validated |
|
||||
| 5 CDEF | **CPU** | opportunistic only | Cycle 5 Phase 6 deferred; real QPU CDEF measurement still owed |
|
||||
|
||||
## What changes in repo state
|
||||
|
||||
- `tests/bench_concurrent_mixed.c` lands (~470 LOC).
|
||||
- `CMakeLists.txt` builds `bench_concurrent_mixed` target with all
|
||||
the FFmpeg + dav1d NEON sources.
|
||||
- `docs/k3_mc_phase7.md` § "M4 methodology caveat" updated with V3
|
||||
vs V4 deltas.
|
||||
- `docs/k5_cdef_phase3_partial.md` § "Deployment recommendation"
|
||||
updated with V1/V2 fallback-proxy results.
|
||||
- Memory `feedback_m4_same_kernel_worst_case.md` annotated with
|
||||
closing numbers.
|
||||
|
||||
## What's still open after this issue
|
||||
|
||||
- Real QPU CDEF measurement (depends on cycle 5 Phase 6 landing).
|
||||
- Variant D (mixed LPF+MC alternating CPU work) skipped — the V1
|
||||
vs V4 contrast already answers the deployment question.
|
||||
- Phase 8 V4L2 wrapper should follow the recipe table above:
|
||||
dispatch paths for ALL kernels exist; the scheduler chooses
|
||||
per-kernel based on the validated recipe.
|
||||
|
||||
Reference in New Issue
Block a user