User-flagged 2026-05-18: the cycles 3 (MC) + 5 (CDEF) 'CPU only'
verdicts were based on M4 measuring same-kernel concurrent NEON+QPU,
which is the WORST case for memory-bandwidth contention. A real
decoder pipeline has CPU doing kernel A + QPU doing kernel B
concurrently — different access patterns contend less.
Concretely: in a real pipeline, CPU runs entropy + MC + other work
while QPU is idle except for IDCT + LPF. The 'opportunistic QPU
helper' for CDEF (or MC) hasn't been measured. M4 set the bar too
high.
Updates:
- docs/k3_mc_phase7.md §'M4 methodology caveat' added with the
user's contribution framing
- docs/k5_cdef_phase3_partial.md §'Deployment recommendation'
softened from 'CPU only' to 'CPU baseline; QPU helper viable in
mixed-kernel deployment, unmeasured'
- docs/issues/003-mixed-kernel-m4-bench.md filed — the rigorous
test to close the question (4 variants: bandwidth+bandwidth,
compute+CDEF, same-kernel control, real-pipeline mix)
- ~/.claude/projects/-home-mfritsche-src-daedalus-fourier/memory/
feedback_m4_same_kernel_worst_case.md added — carries the
calibration into future cycles + Phase 8 deployment decisions
- MEMORY.md index updated
The bandwidth-bound vs compute-bound classification still holds at
the kernel level — Phase 9 cross-cycle lesson stays valid. But its
mapping to deployment is nuanced:
- Bandwidth-bound on QPU → DEFINITIVE offload (M4 +ve, cycles 1+2+4)
- Compute-bound on QPU → OPPORTUNISTIC helper if pipeline has
bandwidth-light CPU work running concurrently (cycles 3+5,
needs Issue 003 measurement)
Phase 8 V4L2 wrapper should keep CDEF + MC slot-able to either CPU
or QPU at runtime (not hard-baked), so Issue 003's result can update
the dispatch table without re-architecture.
No code changes. Doc + memory + issue only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.7 KiB
cycle, phase, status, date_opened, date_partial_close, parent
| cycle | phase | status | date_opened | date_partial_close | parent |
|---|---|---|---|---|---|
| 5 | 3 (partial — M3 captured, M1 deferred) | in_progress (M1 known-issue, Phase 4+ deferred) | 2026-05-18 | 2026-05-18 | k5_cdef_phase1_2.md |
Cycle 5, Phase 3 (partial) — CDEF NEON baseline
Cycle 5 Phase 3 captured M3₅ throughput but M1 bit-exact gate deferred to next session due to a tmp-layout mismatch between the standalone C reference and dav1d's NEON expectation.
M3₅ NEON throughput (captured)
=== M3₅ NEON throughput ===
blocks/batch: 65536
batches done: 279
total blocks: 18 284 544
elapsed (kernel)=4.661 s
throughput = 3.923 Mblock/s
per-block = 254.9 ns
equiv 1080p = 121.1 FPS (32 400 blocks/frame)
Per-block 254 ns — CDEF is the most compute-intensive kernel measured so far:
| per-block ns | relative | |
|---|---|---|
| IDCT 8×8 (k1) | 122 | 1.0× |
| LPF wd=4 (k2) | 20.7 | 0.17× |
| MC 8h (k3) | 47.6 | 0.39× |
| LPF wd=8 (k4) | 19.1 | 0.16× |
| CDEF (k5) | 254.9 | 2.09× |
30fps@1080p floor margin: 4× isolation (32 400 × 30 fps ÷ 1e6 = 0.972 Mblock/s required; 3.923 / 0.972 = 4.04). NEON CDEF on a single CPU core comfortably exceeds the user-facing test alone.
M1 known-issue (deferred to next session)
The bit-exact gate against my standalone C reference fails. The output structure (NEON vs C ref) shows the NEON producing algorithmically-correct-looking pixel values, but at a SHIFTED (row, col) offset within dst. Trace evidence:
neon row 5, cols 2-7 =
90 213 247 143 95 76
C ref row 3, cols 0-5 =90 213 247 143 95 76
— same 6-byte sequence at an offset of (+2 rows, -2 cols) =
(+2×8 + (-2)) = +14 byte stride mismatch. The smoking gun is that
dav1d's NEON expects tmp built by a specific
dav1d_cdef_padding8_8bpc_neon routine (different from the C-side
padding() function), and my manual tmp construction doesn't match
that convention.
Resolution paths (next session):
- Call dav1d's NEON padding function to construct tmp from dst+left+top+bottom random inputs. Then the filter reads it with the right layout. Adds another extern symbol to bind.
- Vendor
dav1d_cdef_filter_block_8x8_cfrom dav1d's C-side (with templated headers shimmed). Compare NEON output against dav1d's own C, not my standalone transcription. Eliminates the layout-shim ambiguity entirely. - Inspect
dav1d_cdef_padding8_8bpc_neonoutput for one block, reverse-engineer the layout, update standalone C ref to match.
Path 1 is probably simplest. The padding function signature
(inferred from cdef.S padding_func macro):
void cdef_padding8_8bpc_neon(uint16_t *tmp, const uint8_t *src,
ptrdiff_t src_stride,
const uint8_t (*left)[2],
const uint8_t *top, const uint8_t *bottom,
int h, size_t edges);
Phase 3 closure requires M1 bit-exact verified.
Phase 4-7 deferred
Without M1 verified, can't safely build the QPU shader (would have no correctness gate against the NEON path either, and we'd be chasing two layout issues simultaneously).
Predicted R₅ (extrapolating from cycle 3 MC):
- CDEF is ~5× heavier per-block than MC on NEON (254 vs 47 ns)
- NEON ~5× advantage → QPU likely ~25× behind
- R₅ isolation estimate: 0.02-0.05 (deep RED)
- M4₅ mixed: very likely negative (deeper than cycle 3 MC's -19.5%)
- 30fps floor: still PASS on isolation+mixed since NEON 4-core baseline likely 12+ Mblock/s, comfortably above 0.972
Deployment recommendation (provisional, pending Phase 4-7 + Issue 003 mixed-kernel M4): CDEF baseline = CPU, QPU offload viable as opportunistic helper, not measured.
Same caveat as cycle 3 MC (see k3_mc_phase7.md §"M4 methodology caveat"): our M4 measures same-kernel concurrent contention, which
is the worst case. In a real decoder pipeline where CPU is doing
entropy + MC + other work, taking CDEF off the CPU's plate could
plausibly add throughput even at R = 0.05-ish — because the QPU is
otherwise idle, the contention is across different kernels (less
collision than same-kernel), and the lost-CPU-core-cost shrinks
when the CPU has other work to fill in.
The bandwidth-bound vs compute-bound classification rule still holds at the kernel level, but its mapping to deployment is more nuanced than "compute-bound → never QPU." Better framing:
- Bandwidth-bound on QPU → definitive QPU offload (cycle 1+2+4)
- Compute-bound on QPU → opportunistic QPU helper if pipeline has bandwidth-light CPU work running concurrently (cycle 3+5, needs Issue 003 measurement to confirm)
Phase 9 lessons (provisional)
-
Vendoring from a SECOND upstream (dav1d after FFmpeg) added non-trivial layout-convention friction. Different projects make different optimisation tradeoffs (dav1d NEON uses stride-16 tmp for vector-load alignment; dav1d C uses stride-12 because it doesn't matter for scalar code). Standalone C ref had to be re-fit to match NEON layout, not just transcribe C.
-
Two different
dav1d_cdef_directionstables in dav1d: stride-12 insrc/tables.c(used by C path), stride-16 insrc/arm/64/cdef_tmpl.S(used by NEON path). I initially vendored the C-side table; should have used the NEON-side embedded version for matching against NEON. -
Bit-exact gate fundamentally requires the standalone C ref to match the actual NEON call convention exactly. When the layout convention differs (as here), no amount of correct algorithm transcription saves you. The cleanest fix is to either run dav1d's own C ref (vendor more headers) or use dav1d's NEON padding to construct tmp.
What lands in this commit
external/dav1d-snapshot/src/arm/64/cdef_tmpl.S(additional vendored file, needed for cdef.S to include)tests/cdef_ref.c— standalone C ref (algorithmically correct, layout known-mismatched)tests/bench_neon_cdef.c— bench harness with M1 made warning (proceeds to M3 even on layout mismatch)external/dav1d-snapshot/config.h— asm preamble shim (works — dav1d's cdef.S assembles + links + executes)CMakeLists.txt— dav1d asm + table source build wiring- M3₅ baseline: 3.923 Mblock/s captured on hertz
Resumption checklist (next session)
- Pick M1 resolution path (1, 2, or 3 from §"Resolution paths")
- If path 1: vendor + bind
dav1d_cdef_padding8_8bpc_neon, update bench to call padding-then-filter, recapture M1 gate - Phase 4 plan QPU CDEF kernel (likely brief; predicted RED)
- Phase 5 review (mandatory; first AV1 QPU work)
- Phase 6 implement
- Phase 7 measure M2 + M4 if reaches threshold
- Confirm deployment recipe: CDEF stays on CPU (likely)