--- cycle: 5 phase: 3 status: closed 2026-05-18 — M1 PASS, M3 captured date_opened: 2026-05-18 date_closed: 2026-05-18 parent: k5_cdef_phase1_2.md host: hertz --- # Cycle 5, Phase 3 — CDEF NEON baseline (closed) Supersedes `k5_cdef_phase3_partial.md`. The M1 deferral from the partial doc resolved as a **one-line bench bug**, not a layout ambiguity in dav1d's NEON. ## Root cause of the previous "layout mismatch" `tests/cdef_ref.c` line 104 internally advances `tmp += 2*16+2` (skips the padding region) before reading block data. `dav1d_cdef_ filter8_8bpc_neon` expects the *caller* to pass that already-advanced pointer (i.e., pointer to the 8×8 block origin, not the padded buffer origin). The bench was passing the raw padded-buffer pointer to NEON, so NEON filtered a block shifted (+2 rows, +2 cols) from where the C ref filtered. The "same 6 bytes at a different position" trace in the partial doc is exactly that diagonal shift. Fix: `tmps + i*TMP_INTS + (2 * TMP_W + 2)` for the NEON call. Three-line patch in `tests/bench_neon_cdef.c`. ## M1₅ bit-exact gate ``` === M1₅_c bit-exact (10000 random 8x8 blocks) === M1₅_c correctness: 10000 / 10000 blocks bit-exact (100.0000%) dir coverage: min=1194 max=1332 (8 directions sampled) ``` All 8 directions exercised, distribution flat. **M1 gate PASS.** ## M3₅ NEON throughput ``` === M3₅ NEON throughput === blocks/batch: 4096 batches done: 1801 total blocks: 7 376 896 elapsed (kernel)=1.937 s throughput = 3.809 Mblock/s per-block = 262.5 ns equiv 1080p = 117.6 FPS (32 400 blocks/frame) ``` Consistent with the previously captured 3.923 Mblock/s (longer window). Per-block ~260 ns. **CDEF remains the most compute- intensive kernel cycle so far** (2.1× IDCT, 13× LPF wd=4, 5.5× MC). | | per-block ns | relative | |---|---|---| | IDCT 8×8 (k1) | 122 | 1.0× | | LPF wd=4 (k2) | 20.7 | 0.17× | | MC 8h (k3) | 47.6 | 0.39× | | LPF wd=8 (k4) | 19.1 | 0.16× | | **CDEF (k5)** | **262.5** | **2.15×** | 30fps@1080p floor margin: **3.9×** isolation NEON single-core. NEON-4 baseline would be ~12-15 Mblock/s → 12-15× margin. ## Methodology lessons 1. **Inverted-bench bugs look like layout mismatches.** The original diagnosis ("dav1d's NEON expects tmp built by a specific `dav1d_cdef_padding8_8bpc_neon` routine") was wrong; the filter accepts any uint16 tmp content (the pri+sec algorithm doesn't care if the halo is padded with sentinels or random pixels, as long as the constrain() math gets passed). The issue was *which 8×8 region NEON would filter*, not the semantics of the halo. 2. **Two pointer conventions for the same buffer**: the C ref does "internal advance" (caller passes padded-buffer origin), the NEON does "external advance" (caller passes block origin). Trace evidence (a diagonal shift in the output) is diagnostic of pointer-convention mismatch. 3. **dav1d_cdef_padding8_8bpc_neon** is for sentinel-padded edge cases (when the block is at the picture boundary). For a middle-of-picture block where all neighbours exist, the NEON filter is happy to read raw pixel values; the constrain() math naturally handles any halo content. ## What lands in this commit - `tests/bench_neon_cdef.c`: 3-line fix (tmp+34 for NEON calls) - `docs/k5_cdef_phase3.md` (this doc) supersedes `k5_cdef_phase3_partial.md` ## Phase 4 unblocked Predicted R₅ (from `k5_cdef_phase3_partial.md`): - CDEF is ~5× heavier per-block than MC on NEON (262 vs 48 ns) - NEON ~5× per-core advantage on MC → QPU likely ~25× behind on CDEF - R₅ isolation estimate: **0.02-0.05 (deep RED)** Issue 003 V1/V2 NEON-fallback proxy showed that a 4th NEON core running CDEF adds 1.7 Mblock/s of CDEF helper without crushing the other 3 cores. Real QPU CDEF is predicted at ~0.2 Mblock/s (an order of magnitude below the NEON-fallback proxy). **Phase 4 plan rationale**: even predicted RED, build the QPU CDEF kernel because: - Confirms or refutes the R₅ 0.02-0.05 prediction with real data - Completes the cycle 5 record (Phases 1-7 all closed) - Provides the QPU CDEF dispatch path needed for the V4L2 wrapper to *exist* (Phase 8), even if scheduler doesn't enqueue it by default Expected Phase 4 effort: 2-3 hours given the kernel shape is similar to cycle 2/4 LPF (per-block stencil with table lookups for directions; primary + secondary tap accumulation).