3 Commits

Author SHA1 Message Date
marfrit 2dd774a9ab Issue 003 closed: mixed-kernel M4 validates V4 deployment shape
bench_concurrent_mixed runs NEON-N on kernel A + QPU on kernel B
concurrently. Matrix on hertz:

  V3 (CPU MC + QPU MC same-kernel): CPU 22.64 + QPU 0.39 Mblock/s
  V4 (CPU MC + QPU LPF4):            CPU 27.87 + QPU 12.74 Medge/s
  V1 (CPU MC + NEON-fb CDEF):        CPU 24.49 + 1.75 Mblock/s CDEF
  V2 (CPU LPF4 + NEON-fb CDEF):      CPU 27.28 Medge + 1.70 Mblock/s

V4 is the daedalus-fourier deployment shape (CPU runs MC; QPU runs
LPF4 via cycle 2 GREEN offload). Both substrates productive; CPU
MC +23% per-core vs same-kernel V3 control. Same-kernel M4 in
cycles 1-5 was a worst-case contention bound, not a deployment
number — user's "5%/50%" framing was correct.

Cycle 3 MC verdict unchanged (QPU MC contributes ~0.4 under any
contention); cycle 5 CDEF deferred verdict softened to
opportunistic helper (NEON-fallback proxy used since cycle 5
Phase 6 not yet built).

- tests/bench_concurrent_mixed.c (configurable cpu-kernel /
  qpu-kernel matrix; supports MC, LPF4, LPF8, IDCT real QPU
  dispatch; CDEF uses NEON-on-core-3 fallback)
- CMakeLists.txt: build target wired with all FFmpeg + dav1d sources
- docs/issues/003-mixed-kernel-m4-bench.md: closure + matrix
- docs/k3_mc_phase7.md: M4 methodology caveat extended with V3/V4
- docs/k5_cdef_phase3_partial.md: deployment recommendation updated

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:44:08 +00:00
marfrit 460a6a6d08 Calibration: M4 same-kernel measures worst-case contention
User-flagged 2026-05-18: the cycles 3 (MC) + 5 (CDEF) 'CPU only'
verdicts were based on M4 measuring same-kernel concurrent NEON+QPU,
which is the WORST case for memory-bandwidth contention. A real
decoder pipeline has CPU doing kernel A + QPU doing kernel B
concurrently — different access patterns contend less.

Concretely: in a real pipeline, CPU runs entropy + MC + other work
while QPU is idle except for IDCT + LPF. The 'opportunistic QPU
helper' for CDEF (or MC) hasn't been measured. M4 set the bar too
high.

Updates:
- docs/k3_mc_phase7.md §'M4 methodology caveat' added with the
  user's contribution framing
- docs/k5_cdef_phase3_partial.md §'Deployment recommendation'
  softened from 'CPU only' to 'CPU baseline; QPU helper viable in
  mixed-kernel deployment, unmeasured'
- docs/issues/003-mixed-kernel-m4-bench.md filed — the rigorous
  test to close the question (4 variants: bandwidth+bandwidth,
  compute+CDEF, same-kernel control, real-pipeline mix)
- ~/.claude/projects/-home-mfritsche-src-daedalus-fourier/memory/
  feedback_m4_same_kernel_worst_case.md added — carries the
  calibration into future cycles + Phase 8 deployment decisions
- MEMORY.md index updated

The bandwidth-bound vs compute-bound classification still holds at
the kernel level — Phase 9 cross-cycle lesson stays valid. But its
mapping to deployment is nuanced:
  - Bandwidth-bound on QPU → DEFINITIVE offload (M4 +ve, cycles 1+2+4)
  - Compute-bound on QPU → OPPORTUNISTIC helper if pipeline has
    bandwidth-light CPU work running concurrently (cycles 3+5,
    needs Issue 003 measurement)

Phase 8 V4L2 wrapper should keep CDEF + MC slot-able to either CPU
or QPU at runtime (not hard-baked), so Issue 003's result can update
the dispatch table without re-architecture.

No code changes. Doc + memory + issue only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:31:27 +00:00
marfrit 20e3d004ae Issues 001+002: defer LPF wd=16 + LPF vertical variants
Per user direction at cycle-4 close: file wd=16 (trend prediction
validation) and vertical variants (column-stride TMU behaviour
unknown) as local issues for future cycles. Progress instead to
CDEF (AV1) for codec breadth.

docs/issues/001 — wd=16 prediction validation. Per cycle 4 lesson 4,
trend says wd=16 likely flips M4 negative. Quick incremental cycle
when revisited.

docs/issues/002 — vertical variants. Different memory access pattern
(column-strided vs row-strided). The load-bearing unknown is
whether the cycle 2 +6.9% mixed gain survives the TMU coalescing
shift. If positive, deployment recipe gains symmetry; if negative,
must split by orientation.

Both issues have acceptance criteria + expected outcomes documented.

Cycle 5 next: CDEF (AV1) — codec-breadth expansion.

No Gitea repo exists for daedalus-fourier yet (project is local-
only). If a tracker is wanted, create the repo and migrate these
.md files. For now they live in-tree as part of the project history.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:09:51 +00:00