Files
daedalus-fourier/docs/issues/003-mixed-kernel-m4-bench.md
marfrit 460a6a6d08 Calibration: M4 same-kernel measures worst-case contention
User-flagged 2026-05-18: the cycles 3 (MC) + 5 (CDEF) 'CPU only'
verdicts were based on M4 measuring same-kernel concurrent NEON+QPU,
which is the WORST case for memory-bandwidth contention. A real
decoder pipeline has CPU doing kernel A + QPU doing kernel B
concurrently — different access patterns contend less.

Concretely: in a real pipeline, CPU runs entropy + MC + other work
while QPU is idle except for IDCT + LPF. The 'opportunistic QPU
helper' for CDEF (or MC) hasn't been measured. M4 set the bar too
high.

Updates:
- docs/k3_mc_phase7.md §'M4 methodology caveat' added with the
  user's contribution framing
- docs/k5_cdef_phase3_partial.md §'Deployment recommendation'
  softened from 'CPU only' to 'CPU baseline; QPU helper viable in
  mixed-kernel deployment, unmeasured'
- docs/issues/003-mixed-kernel-m4-bench.md filed — the rigorous
  test to close the question (4 variants: bandwidth+bandwidth,
  compute+CDEF, same-kernel control, real-pipeline mix)
- ~/.claude/projects/-home-mfritsche-src-daedalus-fourier/memory/
  feedback_m4_same_kernel_worst_case.md added — carries the
  calibration into future cycles + Phase 8 deployment decisions
- MEMORY.md index updated

The bandwidth-bound vs compute-bound classification still holds at
the kernel level — Phase 9 cross-cycle lesson stays valid. But its
mapping to deployment is nuanced:
  - Bandwidth-bound on QPU → DEFINITIVE offload (M4 +ve, cycles 1+2+4)
  - Compute-bound on QPU → OPPORTUNISTIC helper if pipeline has
    bandwidth-light CPU work running concurrently (cycles 3+5,
    needs Issue 003 measurement)

Phase 8 V4L2 wrapper should keep CDEF + MC slot-able to either CPU
or QPU at runtime (not hard-baked), so Issue 003's result can update
the dispatch table without re-architecture.

No code changes. Doc + memory + issue only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:31:27 +00:00

3.5 KiB
Raw Permalink Blame History

Issue 003 — Mixed-kernel M4 bench (closes cycle 3/5 deployment verdict)

Status: open, blocks Phase 8 deployment plumbing for cycles 3+5 Type: measurement gap; methodology fix Predicted verdict: cycle 3 MC + cycle 5 CDEF may flip from "CPU only" to "opportunistic QPU helper" Priority: medium (changes deployment recipe; doesn't block other cycles) Filed: 2026-05-18

Background

Cycles 3 (MC) and 5 (CDEF, partial) were verdict'd "stay on CPU" based on M4 measurements showing mixed NEON-3 + QPU running the same kernel ran SLOWER than pure NEON-4. Specifically:

NEON-4 NEON-3 + QPU delta
Cycle 3 MC 15.25 Mblock/s 12.28 19.5 %
Cycle 5 CDEF (predicted) ~ 12-15 ~ 10-12 negative

But this is the worst-case contention scenario: both substrates competing for the same memory bus with the same access pattern.

Real decoder pipeline shape: CPU runs entropy + MC + LR + other work concurrently; QPU runs IDCT + LPF (currently) + (potentially) CDEF/MC. Different kernels on different substrates contend less than same-kernel-on-both.

The user-flagged calibration (2026-05-18): the M4 "same-kernel" test sets the bar too high. A "different-kernel" test would more accurately reflect deployment.

What to measure

A new bench harness tests/bench_concurrent_mixed.c that runs:

Variant CPU side (NEON-3 pinned) QPU side (1 core) Captures
A LPF wd=4 (bandwidth-bound, like real LPF stage) CDEF CDEF helper throughput; CPU LPF throughput drop
B MC (compute-bound, like real MC stage) CDEF CDEF helper throughput; CPU MC throughput drop
C MC MC (cycle 3 M4 control)
D LPF wd=4 + MC alternating (proxy for "CPU doing mixed real work") CDEF Real-pipeline approximation

Compute "QPU helper value" = (mixed total throughput in the relevant kernel) (CPU-only baseline) for each variant.

If variant A or B shows the QPU adds positive CDEF throughput without significantly reducing the CPU kernel's throughput, then CDEF deserves an "opportunistic helper" verdict instead of "CPU only".

Expected outcome

Per the user's "5 % CPU drop / 50 % bored QPU" framing:

  • Variant A (bandwidth+bandwidth): QPU contention with bandwidth- heavy LPF is real; QPU contribution likely ~70 % of isolation
  • Variant B (compute+CDEF): MC is the worst-saturated case from cycle 3; QPU likely under-contributes, CPU MC may drop. Net result ~ cycle 3 M4 (19.5 % rerun)
  • Variant D (mixed): probably the closest-to-deployment number. Best estimate of "additional QPU helper" value.

Acceptance criteria

  • tests/bench_concurrent_mixed.c lands, 4 variants measurable
  • Verdict per variant: "+X.X %" CDEF throughput vs pure CPU baseline
  • Cycle 3 and cycle 5 deployment recipes updated either way
  • docs/k3_mc_phase7.md §"M4 methodology caveat" updated with results

Why deferred

User-directed cycle 5 was CDEF; M4 methodology calibration only surfaced AFTER cycle 5 close. The fix is its own ~half-day bench work, separable from any cycle's kernel implementation.

  • docs/k3_mc_phase7.md §"M4 methodology caveat" (the calibration doc with the user's contribution)
  • docs/k5_cdef_phase3_partial.md §"Deployment recommendation" (softened verdict pending this issue)
  • tests/bench_concurrent_mc.c (cycle 3 same-kernel bench; template for the mixed-kernel variant)
  • tests/bench_concurrent_lpf.c + bench_concurrent_lpf8.c (cycle 2/4 bench templates)
  • Memory: feedback_m4_same_kernel_worst_case.md