--- cycle: 5 phase: 3 (partial — M3 captured, M1 deferred) status: in_progress (M1 known-issue, Phase 4+ deferred) date_opened: 2026-05-18 date_partial_close: 2026-05-18 parent: k5_cdef_phase1_2.md --- # Cycle 5, Phase 3 (partial) — CDEF NEON baseline Cycle 5 Phase 3 captured **M3₅ throughput** but **M1 bit-exact gate deferred** to next session due to a tmp-layout mismatch between the standalone C reference and dav1d's NEON expectation. ## M3₅ NEON throughput (captured) ``` === M3₅ NEON throughput === blocks/batch: 65536 batches done: 279 total blocks: 18 284 544 elapsed (kernel)=4.661 s throughput = 3.923 Mblock/s per-block = 254.9 ns equiv 1080p = 121.1 FPS (32 400 blocks/frame) ``` **Per-block 254 ns** — CDEF is the most compute-intensive kernel measured so far: | | per-block ns | relative | |---|---|---| | IDCT 8×8 (k1) | 122 | 1.0× | | LPF wd=4 (k2) | 20.7 | 0.17× | | MC 8h (k3) | 47.6 | 0.39× | | LPF wd=8 (k4) | 19.1 | 0.16× | | **CDEF (k5)** | **254.9** | **2.09×** | 30fps@1080p floor margin: **4×** isolation (32 400 × 30 fps ÷ 1e6 = 0.972 Mblock/s required; 3.923 / 0.972 = 4.04). NEON CDEF on a single CPU core comfortably exceeds the user-facing test alone. ## M1 known-issue (deferred to next session) The bit-exact gate against my standalone C reference fails. The output structure (NEON vs C ref) shows the NEON producing algorithmically-correct-looking pixel values, but at a SHIFTED (row, col) offset within dst. Trace evidence: > neon row 5, cols 2-7 = `90 213 247 143 95 76` > C ref row 3, cols 0-5 = `90 213 247 143 95 76` — same 6-byte sequence at an offset of (+2 rows, -2 cols) = (+2×8 + (-2)) = +14 byte stride mismatch. The smoking gun is that dav1d's NEON expects tmp built by a specific `dav1d_cdef_padding8_8bpc_neon` routine (different from the C-side `padding()` function), and my manual tmp construction doesn't match that convention. **Resolution paths** (next session): 1. **Call dav1d's NEON padding function** to construct tmp from dst+left+top+bottom random inputs. Then the filter reads it with the right layout. Adds another extern symbol to bind. 2. **Vendor `dav1d_cdef_filter_block_8x8_c` from dav1d's C-side** (with templated headers shimmed). Compare NEON output against dav1d's *own* C, not my standalone transcription. Eliminates the layout-shim ambiguity entirely. 3. Inspect `dav1d_cdef_padding8_8bpc_neon` output for one block, reverse-engineer the layout, update standalone C ref to match. Path 1 is probably simplest. The padding function signature (inferred from cdef.S `padding_func` macro): ``` void cdef_padding8_8bpc_neon(uint16_t *tmp, const uint8_t *src, ptrdiff_t src_stride, const uint8_t (*left)[2], const uint8_t *top, const uint8_t *bottom, int h, size_t edges); ``` Phase 3 closure requires M1 bit-exact verified. ## Phase 4-7 deferred Without M1 verified, can't safely build the QPU shader (would have no correctness gate against the NEON path either, and we'd be chasing two layout issues simultaneously). **Predicted R₅** (extrapolating from cycle 3 MC): - CDEF is ~5× heavier per-block than MC on NEON (254 vs 47 ns) - NEON ~5× advantage → QPU likely ~25× behind - R₅ isolation estimate: **0.02-0.05 (deep RED)** - M4₅ mixed: very likely negative (deeper than cycle 3 MC's -19.5%) - 30fps floor: still PASS on isolation+mixed since NEON 4-core baseline likely 12+ Mblock/s, comfortably above 0.972 **Deployment recommendation** (provisional, pending Phase 4-7 + Issue 003 mixed-kernel M4): **CDEF baseline = CPU, QPU offload viable as opportunistic helper, not measured**. Same caveat as cycle 3 MC (see `k3_mc_phase7.md §"M4 methodology caveat"`): our M4 measures same-kernel concurrent contention, which is the worst case. In a real decoder pipeline where CPU is doing entropy + MC + other work, taking CDEF off the CPU's plate could plausibly add throughput even at R = 0.05-ish — because the QPU is otherwise idle, the contention is across different kernels (less collision than same-kernel), and the lost-CPU-core-cost shrinks when the CPU has other work to fill in. The **bandwidth-bound vs compute-bound classification rule** still holds at the kernel level, but its mapping to deployment is more nuanced than "compute-bound → never QPU." Better framing: - **Bandwidth-bound on QPU** → **definitive** QPU offload (cycle 1+2+4) - **Compute-bound on QPU** → **opportunistic** QPU helper if pipeline has bandwidth-light CPU work running concurrently (cycle 3+5, needs Issue 003 measurement to confirm) ## Phase 9 lessons (provisional) 1. **Vendoring from a SECOND upstream (dav1d after FFmpeg) added non-trivial layout-convention friction.** Different projects make different optimisation tradeoffs (dav1d NEON uses stride-16 tmp for vector-load alignment; dav1d C uses stride-12 because it doesn't matter for scalar code). Standalone C ref had to be re-fit to match NEON layout, not just transcribe C. 2. **Two different `dav1d_cdef_directions` tables in dav1d**: stride-12 in `src/tables.c` (used by C path), stride-16 in `src/arm/64/cdef_tmpl.S` (used by NEON path). I initially vendored the C-side table; should have used the NEON-side embedded version for matching against NEON. 3. **Bit-exact gate fundamentally requires the standalone C ref to match the actual NEON call convention exactly.** When the layout convention differs (as here), no amount of correct algorithm transcription saves you. The cleanest fix is to either run dav1d's own C ref (vendor more headers) or use dav1d's NEON padding to construct tmp. ## What lands in this commit - `external/dav1d-snapshot/src/arm/64/cdef_tmpl.S` (additional vendored file, needed for cdef.S to include) - `tests/cdef_ref.c` — standalone C ref (algorithmically correct, layout known-mismatched) - `tests/bench_neon_cdef.c` — bench harness with M1 made warning (proceeds to M3 even on layout mismatch) - `external/dav1d-snapshot/config.h` — asm preamble shim (works — dav1d's cdef.S assembles + links + executes) - `CMakeLists.txt` — dav1d asm + table source build wiring - M3₅ baseline: 3.923 Mblock/s captured on hertz ## Resumption checklist (next session) - [ ] Pick M1 resolution path (1, 2, or 3 from §"Resolution paths") - [ ] If path 1: vendor + bind `dav1d_cdef_padding8_8bpc_neon`, update bench to call padding-then-filter, recapture M1 gate - [ ] Phase 4 plan QPU CDEF kernel (likely brief; predicted RED) - [ ] Phase 5 review (mandatory; first AV1 QPU work) - [ ] Phase 6 implement - [ ] Phase 7 measure M2 + M4 if reaches threshold - [ ] Confirm deployment recipe: CDEF stays on CPU (likely)