diff --git a/docs/k5_cdef_phase3.md b/docs/k5_cdef_phase3.md new file mode 100644 index 0000000..09f4150 --- /dev/null +++ b/docs/k5_cdef_phase3.md @@ -0,0 +1,121 @@ +--- +cycle: 5 +phase: 3 +status: closed 2026-05-18 — M1 PASS, M3 captured +date_opened: 2026-05-18 +date_closed: 2026-05-18 +parent: k5_cdef_phase1_2.md +host: hertz +--- + +# Cycle 5, Phase 3 — CDEF NEON baseline (closed) + +Supersedes `k5_cdef_phase3_partial.md`. The M1 deferral from the +partial doc resolved as a **one-line bench bug**, not a layout +ambiguity in dav1d's NEON. + +## Root cause of the previous "layout mismatch" + +`tests/cdef_ref.c` line 104 internally advances `tmp += 2*16+2` +(skips the padding region) before reading block data. `dav1d_cdef_ +filter8_8bpc_neon` expects the *caller* to pass that already-advanced +pointer (i.e., pointer to the 8×8 block origin, not the padded +buffer origin). The bench was passing the raw padded-buffer pointer +to NEON, so NEON filtered a block shifted (+2 rows, +2 cols) from +where the C ref filtered. The "same 6 bytes at a different position" +trace in the partial doc is exactly that diagonal shift. + +Fix: `tmps + i*TMP_INTS + (2 * TMP_W + 2)` for the NEON call. +Three-line patch in `tests/bench_neon_cdef.c`. + +## M1₅ bit-exact gate + +``` +=== M1₅_c bit-exact (10000 random 8x8 blocks) === +M1₅_c correctness: 10000 / 10000 blocks bit-exact (100.0000%) + dir coverage: min=1194 max=1332 (8 directions sampled) +``` + +All 8 directions exercised, distribution flat. **M1 gate PASS.** + +## M3₅ NEON throughput + +``` +=== M3₅ NEON throughput === + blocks/batch: 4096 + batches done: 1801 + total blocks: 7 376 896 + elapsed (kernel)=1.937 s + throughput = 3.809 Mblock/s + per-block = 262.5 ns + equiv 1080p = 117.6 FPS (32 400 blocks/frame) +``` + +Consistent with the previously captured 3.923 Mblock/s (longer +window). Per-block ~260 ns. **CDEF remains the most compute- +intensive kernel cycle so far** (2.1× IDCT, 13× LPF wd=4, +5.5× MC). + +| | per-block ns | relative | +|---|---|---| +| IDCT 8×8 (k1) | 122 | 1.0× | +| LPF wd=4 (k2) | 20.7 | 0.17× | +| MC 8h (k3) | 47.6 | 0.39× | +| LPF wd=8 (k4) | 19.1 | 0.16× | +| **CDEF (k5)** | **262.5** | **2.15×** | + +30fps@1080p floor margin: **3.9×** isolation NEON single-core. +NEON-4 baseline would be ~12-15 Mblock/s → 12-15× margin. + +## Methodology lessons + +1. **Inverted-bench bugs look like layout mismatches.** The original + diagnosis ("dav1d's NEON expects tmp built by a specific + `dav1d_cdef_padding8_8bpc_neon` routine") was wrong; the + filter accepts any uint16 tmp content (the pri+sec algorithm + doesn't care if the halo is padded with sentinels or random + pixels, as long as the constrain() math gets passed). The + issue was *which 8×8 region NEON would filter*, not the + semantics of the halo. + +2. **Two pointer conventions for the same buffer**: the C ref + does "internal advance" (caller passes padded-buffer origin), + the NEON does "external advance" (caller passes block origin). + Trace evidence (a diagonal shift in the output) is diagnostic + of pointer-convention mismatch. + +3. **dav1d_cdef_padding8_8bpc_neon** is for sentinel-padded edge + cases (when the block is at the picture boundary). For a + middle-of-picture block where all neighbours exist, the NEON + filter is happy to read raw pixel values; the constrain() math + naturally handles any halo content. + +## What lands in this commit + +- `tests/bench_neon_cdef.c`: 3-line fix (tmp+34 for NEON calls) +- `docs/k5_cdef_phase3.md` (this doc) supersedes + `k5_cdef_phase3_partial.md` + +## Phase 4 unblocked + +Predicted R₅ (from `k5_cdef_phase3_partial.md`): +- CDEF is ~5× heavier per-block than MC on NEON (262 vs 48 ns) +- NEON ~5× per-core advantage on MC → QPU likely ~25× behind on CDEF +- R₅ isolation estimate: **0.02-0.05 (deep RED)** + +Issue 003 V1/V2 NEON-fallback proxy showed that a 4th NEON core +running CDEF adds 1.7 Mblock/s of CDEF helper without crushing +the other 3 cores. Real QPU CDEF is predicted at ~0.2 Mblock/s +(an order of magnitude below the NEON-fallback proxy). + +**Phase 4 plan rationale**: even predicted RED, build the QPU +CDEF kernel because: +- Confirms or refutes the R₅ 0.02-0.05 prediction with real data +- Completes the cycle 5 record (Phases 1-7 all closed) +- Provides the QPU CDEF dispatch path needed for the V4L2 wrapper + to *exist* (Phase 8), even if scheduler doesn't enqueue it by + default + +Expected Phase 4 effort: 2-3 hours given the kernel shape is +similar to cycle 2/4 LPF (per-block stencil with table lookups +for directions; primary + secondary tap accumulation). diff --git a/tests/bench_neon_cdef.c b/tests/bench_neon_cdef.c index 9168b8f..5a04638 100644 --- a/tests/bench_neon_cdef.c +++ b/tests/bench_neon_cdef.c @@ -113,11 +113,16 @@ static int correctness_check(uint64_t seed, int n) tmp_center_to_dst(dst_a, tmp); memcpy(dst_b, dst_a, DST_BYTES); + /* C ref advances tmp internally by +2*stride+2. + * NEON expects the caller to pass the already-advanced pointer + * (i.e. pointer to the block-data origin, not the padded-buffer + * origin). Hence the tmp+34 for the NEON call. */ daedalus_cdef_filter_8x8_pri_sec_ref( dst_a, DST_W, tmp, pri, sec, dir, damping, 8); dav1d_cdef_filter8_8bpc_neon( - dst_b, DST_W, tmp, pri, sec, dir, damping, 8, - /* edges = */ 0); /* != 0xf → non-edged path, uint16 tmp w/stride 12 */ + dst_b, DST_W, tmp + (2 * TMP_W + 2), + pri, sec, dir, damping, 8, + /* edges = */ 0); /* uint16 tmp non-edged path */ if (memcmp(dst_a, dst_b, DST_BYTES) != 0) { if (mismatches < 3) { @@ -180,7 +185,7 @@ static void throughput_neon(uint64_t seed, int n_blocks, double duration_s) for (int i = 0; i < n_blocks; i++) dav1d_cdef_filter8_8bpc_neon( work_dst + (size_t)i * DST_BYTES, DST_W, - tmps + (size_t)i * TMP_INTS, + tmps + (size_t)i * TMP_INTS + (2 * TMP_W + 2), pris[i], secs[i], dirs[i], damps[i], 8, 0); double t0 = now_seconds(); @@ -191,7 +196,7 @@ static void throughput_neon(uint64_t seed, int n_blocks, double duration_s) for (int i = 0; i < n_blocks; i++) dav1d_cdef_filter8_8bpc_neon( work_dst + (size_t)i * DST_BYTES, DST_W, - tmps + (size_t)i * TMP_INTS, + tmps + (size_t)i * TMP_INTS + (2 * TMP_W + 2), pris[i], secs[i], dirs[i], damps[i], 8, 0); done += n_blocks; }