---
cycle: 5
phase: 3
status: closed 2026-05-18 — M1 PASS, M3 captured
date_opened: 2026-05-18
date_closed: 2026-05-18
parent: k5_cdef_phase1_2.md
host: hertz
---

# Cycle 5, Phase 3 — CDEF NEON baseline (closed)

Supersedes `k5_cdef_phase3_partial.md`. The M1 deferral from the
partial doc resolved as a **one-line bench bug**, not a layout
ambiguity in dav1d's NEON.

## Root cause of the previous "layout mismatch"

`tests/cdef_ref.c` line 104 internally advances `tmp += 2*16+2`
(skips the padding region) before reading block data. `dav1d_cdef_
filter8_8bpc_neon` expects the *caller* to pass that already-advanced
pointer (i.e., pointer to the 8×8 block origin, not the padded
buffer origin). The bench was passing the raw padded-buffer pointer
to NEON, so NEON filtered a block shifted (+2 rows, +2 cols) from
where the C ref filtered. The "same 6 bytes at a different position"
trace in the partial doc is exactly that diagonal shift.

Fix: `tmps + i*TMP_INTS + (2 * TMP_W + 2)` for the NEON call.
Three-line patch in `tests/bench_neon_cdef.c`.

## M1₅ bit-exact gate

```
=== M1₅_c bit-exact (10000 random 8x8 blocks) ===
M1₅_c correctness: 10000 / 10000 blocks bit-exact (100.0000%)
  dir coverage: min=1194 max=1332 (8 directions sampled)
```

All 8 directions exercised, distribution flat. **M1 gate PASS.**

## M3₅ NEON throughput

```
=== M3₅ NEON throughput ===
  blocks/batch:    4096
  batches done:    1801
  total blocks:    7 376 896
  elapsed (kernel)=1.937 s
  throughput      = 3.809 Mblock/s
  per-block       = 262.5 ns
  equiv 1080p     = 117.6 FPS  (32 400 blocks/frame)
```

Consistent with the previously captured 3.923 Mblock/s (longer
window). Per-block ~260 ns. **CDEF remains the most compute-
intensive kernel cycle so far** (2.1× IDCT, 13× LPF wd=4,
5.5× MC).

| | per-block ns | relative |
|---|---|---|
| IDCT 8×8 (k1) | 122 | 1.0× |
| LPF wd=4 (k2) | 20.7 | 0.17× |
| MC 8h (k3) | 47.6 | 0.39× |
| LPF wd=8 (k4) | 19.1 | 0.16× |
| **CDEF (k5)** | **262.5** | **2.15×** |

30fps@1080p floor margin: **3.9×** isolation NEON single-core.
NEON-4 baseline would be ~12-15 Mblock/s → 12-15× margin.

## Methodology lessons

1. **Inverted-bench bugs look like layout mismatches.** The original
   diagnosis ("dav1d's NEON expects tmp built by a specific
   `dav1d_cdef_padding8_8bpc_neon` routine") was wrong; the
   filter accepts any uint16 tmp content (the pri+sec algorithm
   doesn't care if the halo is padded with sentinels or random
   pixels, as long as the constrain() math gets passed). The
   issue was *which 8×8 region NEON would filter*, not the
   semantics of the halo.

2. **Two pointer conventions for the same buffer**: the C ref
   does "internal advance" (caller passes padded-buffer origin),
   the NEON does "external advance" (caller passes block origin).
   Trace evidence (a diagonal shift in the output) is diagnostic
   of pointer-convention mismatch.

3. **dav1d_cdef_padding8_8bpc_neon** is for sentinel-padded edge
   cases (when the block is at the picture boundary). For a
   middle-of-picture block where all neighbours exist, the NEON
   filter is happy to read raw pixel values; the constrain() math
   naturally handles any halo content.

## What lands in this commit

- `tests/bench_neon_cdef.c`: 3-line fix (tmp+34 for NEON calls)
- `docs/k5_cdef_phase3.md` (this doc) supersedes
  `k5_cdef_phase3_partial.md`

## Phase 4 unblocked

Predicted R₅ (from `k5_cdef_phase3_partial.md`):
- CDEF is ~5× heavier per-block than MC on NEON (262 vs 48 ns)
- NEON ~5× per-core advantage on MC → QPU likely ~25× behind on CDEF
- R₅ isolation estimate: **0.02-0.05 (deep RED)**

Issue 003 V1/V2 NEON-fallback proxy showed that a 4th NEON core
running CDEF adds 1.7 Mblock/s of CDEF helper without crushing
the other 3 cores. Real QPU CDEF is predicted at ~0.2 Mblock/s
(an order of magnitude below the NEON-fallback proxy).

**Phase 4 plan rationale**: even predicted RED, build the QPU
CDEF kernel because:
- Confirms or refutes the R₅ 0.02-0.05 prediction with real data
- Completes the cycle 5 record (Phases 1-7 all closed)
- Provides the QPU CDEF dispatch path needed for the V4L2 wrapper
  to *exist* (Phase 8), even if scheduler doesn't enqueue it by
  default

Expected Phase 4 effort: 2-3 hours given the kernel shape is
similar to cycle 2/4 LPF (per-block stencil with table lookups
for directions; primary + secondary tap accumulation).