Issue 003 closed: mixed-kernel M4 validates V4 deployment shape
bench_concurrent_mixed runs NEON-N on kernel A + QPU on kernel B concurrently. Matrix on hertz: V3 (CPU MC + QPU MC same-kernel): CPU 22.64 + QPU 0.39 Mblock/s V4 (CPU MC + QPU LPF4): CPU 27.87 + QPU 12.74 Medge/s V1 (CPU MC + NEON-fb CDEF): CPU 24.49 + 1.75 Mblock/s CDEF V2 (CPU LPF4 + NEON-fb CDEF): CPU 27.28 Medge + 1.70 Mblock/s V4 is the daedalus-fourier deployment shape (CPU runs MC; QPU runs LPF4 via cycle 2 GREEN offload). Both substrates productive; CPU MC +23% per-core vs same-kernel V3 control. Same-kernel M4 in cycles 1-5 was a worst-case contention bound, not a deployment number — user's "5%/50%" framing was correct. Cycle 3 MC verdict unchanged (QPU MC contributes ~0.4 under any contention); cycle 5 CDEF deferred verdict softened to opportunistic helper (NEON-fallback proxy used since cycle 5 Phase 6 not yet built). - tests/bench_concurrent_mixed.c (configurable cpu-kernel / qpu-kernel matrix; supports MC, LPF4, LPF8, IDCT real QPU dispatch; CDEF uses NEON-on-core-3 fallback) - CMakeLists.txt: build target wired with all FFmpeg + dav1d sources - docs/issues/003-mixed-kernel-m4-bench.md: closure + matrix - docs/k3_mc_phase7.md: M4 methodology caveat extended with V3/V4 - docs/k5_cdef_phase3_partial.md: deployment recommendation updated Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -95,18 +95,29 @@ chasing two layout issues simultaneously).
|
||||
- 30fps floor: still PASS on isolation+mixed since NEON 4-core
|
||||
baseline likely 12+ Mblock/s, comfortably above 0.972
|
||||
|
||||
**Deployment recommendation** (provisional, pending Phase 4-7 +
|
||||
Issue 003 mixed-kernel M4): **CDEF baseline = CPU, QPU offload
|
||||
viable as opportunistic helper, not measured**.
|
||||
**Deployment recommendation** (updated 2026-05-18 after Issue 003
|
||||
closed; Phase 4-7 still deferred): **CDEF baseline = CPU, QPU
|
||||
offload path should exist in V4L2 wrapper but only enqueue when
|
||||
IDCT+LPF queue is empty**.
|
||||
|
||||
Same caveat as cycle 3 MC (see `k3_mc_phase7.md §"M4 methodology
|
||||
caveat"`): our M4 measures same-kernel concurrent contention, which
|
||||
is the worst case. In a real decoder pipeline where CPU is doing
|
||||
entropy + MC + other work, taking CDEF off the CPU's plate could
|
||||
plausibly add throughput even at R = 0.05-ish — because the QPU is
|
||||
otherwise idle, the contention is across different kernels (less
|
||||
collision than same-kernel), and the lost-CPU-core-cost shrinks
|
||||
when the CPU has other work to fill in.
|
||||
`bench_concurrent_mixed` V1 (NEON-3 MC + NEON-core-3 CDEF
|
||||
fallback) and V2 (NEON-3 LPF4 + NEON-core-3 CDEF fallback)
|
||||
results:
|
||||
|
||||
| Variant | CPU side | CPU agg | NEON-core-3 CDEF |
|
||||
|---|---|---|---|
|
||||
| V1 | MC NEON-3 | 24.49 Mblock/s | 1.75 Mblock/s |
|
||||
| V2 | LPF4 NEON-3 | 27.28 Medge/s | 1.70 Mblock/s |
|
||||
|
||||
The proxy (NEON-on-core-3 doing CDEF) adds 1.7-1.75 Mblock/s of
|
||||
CDEF work without crushing the other 3 cores' main work. CPU
|
||||
aggregate stays close to single-kernel 4-core levels. Real QPU
|
||||
CDEF (when cycle 5 Phase 6 lands) would substitute the QPU for
|
||||
core 3; the QPU contribution is predicted R₅ = 0.02-0.05 →
|
||||
~0.2 Mblock/s (much less than the NEON-fallback proxy).
|
||||
|
||||
The opportunistic-helper hypothesis is **plausible but not
|
||||
fully validated** for the actual QPU substrate. Conservative read:
|
||||
|
||||
The **bandwidth-bound vs compute-bound classification rule** still
|
||||
holds at the kernel level, but its mapping to deployment is more
|
||||
|
||||
Reference in New Issue
Block a user