Files
daedalus-fourier/docs/issues/003-mixed-kernel-m4-bench.md
marfrit 460a6a6d08 Calibration: M4 same-kernel measures worst-case contention
User-flagged 2026-05-18: the cycles 3 (MC) + 5 (CDEF) 'CPU only'
verdicts were based on M4 measuring same-kernel concurrent NEON+QPU,
which is the WORST case for memory-bandwidth contention. A real
decoder pipeline has CPU doing kernel A + QPU doing kernel B
concurrently — different access patterns contend less.

Concretely: in a real pipeline, CPU runs entropy + MC + other work
while QPU is idle except for IDCT + LPF. The 'opportunistic QPU
helper' for CDEF (or MC) hasn't been measured. M4 set the bar too
high.

Updates:
- docs/k3_mc_phase7.md §'M4 methodology caveat' added with the
  user's contribution framing
- docs/k5_cdef_phase3_partial.md §'Deployment recommendation'
  softened from 'CPU only' to 'CPU baseline; QPU helper viable in
  mixed-kernel deployment, unmeasured'
- docs/issues/003-mixed-kernel-m4-bench.md filed — the rigorous
  test to close the question (4 variants: bandwidth+bandwidth,
  compute+CDEF, same-kernel control, real-pipeline mix)
- ~/.claude/projects/-home-mfritsche-src-daedalus-fourier/memory/
  feedback_m4_same_kernel_worst_case.md added — carries the
  calibration into future cycles + Phase 8 deployment decisions
- MEMORY.md index updated

The bandwidth-bound vs compute-bound classification still holds at
the kernel level — Phase 9 cross-cycle lesson stays valid. But its
mapping to deployment is nuanced:
  - Bandwidth-bound on QPU → DEFINITIVE offload (M4 +ve, cycles 1+2+4)
  - Compute-bound on QPU → OPPORTUNISTIC helper if pipeline has
    bandwidth-light CPU work running concurrently (cycles 3+5,
    needs Issue 003 measurement)

Phase 8 V4L2 wrapper should keep CDEF + MC slot-able to either CPU
or QPU at runtime (not hard-baked), so Issue 003's result can update
the dispatch table without re-architecture.

No code changes. Doc + memory + issue only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:31:27 +00:00

88 lines
3.5 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Issue 003 — Mixed-kernel M4 bench (closes cycle 3/5 deployment verdict)
**Status**: open, blocks Phase 8 deployment plumbing for cycles 3+5
**Type**: measurement gap; methodology fix
**Predicted verdict**: cycle 3 MC + cycle 5 CDEF may flip from
"CPU only" to "opportunistic QPU helper"
**Priority**: medium (changes deployment recipe; doesn't block other cycles)
**Filed**: 2026-05-18
## Background
Cycles 3 (MC) and 5 (CDEF, partial) were verdict'd "stay on CPU"
based on M4 measurements showing mixed NEON-3 + QPU running the
**same kernel** ran SLOWER than pure NEON-4. Specifically:
| | NEON-4 | NEON-3 + QPU | delta |
|---|---|---|---|
| Cycle 3 MC | 15.25 Mblock/s | 12.28 | **19.5 %** |
| Cycle 5 CDEF (predicted) | ~ 12-15 | ~ 10-12 | negative |
But this is the **worst-case contention scenario**: both substrates
competing for the same memory bus with the same access pattern.
**Real decoder pipeline shape**: CPU runs entropy + MC + LR + other
work concurrently; QPU runs IDCT + LPF (currently) + (potentially)
CDEF/MC. Different kernels on different substrates contend
*less* than same-kernel-on-both.
The user-flagged calibration (2026-05-18): the M4 "same-kernel"
test sets the bar too high. A "different-kernel" test would more
accurately reflect deployment.
## What to measure
A new bench harness `tests/bench_concurrent_mixed.c` that runs:
| Variant | CPU side (NEON-3 pinned) | QPU side (1 core) | Captures |
|---|---|---|---|
| A | LPF wd=4 (bandwidth-bound, like real LPF stage) | CDEF | CDEF helper throughput; CPU LPF throughput drop |
| B | MC (compute-bound, like real MC stage) | CDEF | CDEF helper throughput; CPU MC throughput drop |
| C | MC | MC | (cycle 3 M4 control) |
| D | LPF wd=4 + MC alternating (proxy for "CPU doing mixed real work") | CDEF | Real-pipeline approximation |
Compute "QPU helper value" = (mixed total throughput in the relevant
kernel) (CPU-only baseline) for each variant.
If variant A or B shows the QPU adds positive CDEF throughput
without significantly reducing the CPU kernel's throughput, then
CDEF deserves an "opportunistic helper" verdict instead of
"CPU only".
## Expected outcome
Per the user's "5 % CPU drop / 50 % bored QPU" framing:
- Variant A (bandwidth+bandwidth): QPU contention with bandwidth-
heavy LPF is real; QPU contribution likely ~70 % of isolation
- Variant B (compute+CDEF): MC is the worst-saturated case from
cycle 3; QPU likely under-contributes, CPU MC may drop. Net
result ~ cycle 3 M4 (19.5 % rerun)
- Variant D (mixed): probably the closest-to-deployment number.
Best estimate of "additional QPU helper" value.
## Acceptance criteria
- `tests/bench_concurrent_mixed.c` lands, 4 variants measurable
- Verdict per variant: "+X.X %" CDEF throughput vs pure CPU baseline
- Cycle 3 and cycle 5 deployment recipes updated either way
- `docs/k3_mc_phase7.md §"M4 methodology caveat"` updated with
results
## Why deferred
User-directed cycle 5 was CDEF; M4 methodology calibration only
surfaced AFTER cycle 5 close. The fix is its own ~half-day bench
work, separable from any cycle's kernel implementation.
## Related
- `docs/k3_mc_phase7.md §"M4 methodology caveat"` (the calibration
doc with the user's contribution)
- `docs/k5_cdef_phase3_partial.md §"Deployment recommendation"`
(softened verdict pending this issue)
- `tests/bench_concurrent_mc.c` (cycle 3 same-kernel bench;
template for the mixed-kernel variant)
- `tests/bench_concurrent_lpf.c` + `bench_concurrent_lpf8.c`
(cycle 2/4 bench templates)
- Memory: `feedback_m4_same_kernel_worst_case.md`