Cycle 5 phase 3 partial: M3 NEON = 3.923 Mblock/s; M1 deferred

CDEF is the most compute-intensive kernel measured so far — 254.9 ns/block (2x IDCT, 5x MC). 30fps@1080p floor margin: 4x even on single NEON core in isolation. M3 captured cleanly via dav1d_cdef_filter8_8bpc_neon. M1 bit-exact gate failing due to tmp-layout mismatch between my standalone C reference and dav1d's NEON expectation. The smoking gun: NEON output appears at (+2 rows, -2 cols) shifted positions vs C ref output — suggests NEON's padding-function output has a different convention than my manual tmp construction. Untangled in setup work: - dav1d has TWO directions tables: stride-12 in src/tables.c (C-side), stride-16 in src/arm/64/cdef_tmpl.S (NEON-side). Initially vendored the C-side; should have used the NEON-side. - dav1d's NEON expects tmp built by dav1d_cdef_padding8_8bpc_neon (a separate function with its own conventions), not the C-side padding() function from cdef_tmpl.c. - Updated cdef_ref.c to use NEON-layout (stride 16) with table transcribed from cdef_tmpl.S. Algorithm matches — but bench's manual tmp construction doesn't match what NEON expects. Resolution paths for next session (documented in docs/k5_cdef_phase3_partial.md §'Resolution paths'): 1. Use dav1d_cdef_padding8_8bpc_neon to construct tmp (simplest) 2. Vendor dav1d's full C reference (most rigorous) 3. Reverse-engineer dav1d's padding output layout (hackiest) Predicted R5 if/when QPU shader implemented: 0.02-0.05 (RED). CDEF likely stays on CPU per cycle 3 lesson 7 (compute-bound kernels don't benefit from QPU offload). 30fps floor still passes regardless. New artifacts: - external/dav1d-snapshot/src/arm/64/cdef_tmpl.S (additional vendored) - external/dav1d-snapshot/config.h — 14-define asm preamble shim - tests/cdef_ref.c — standalone C ref (algorithmically correct, layout mismatch with NEON known) - tests/bench_neon_cdef.c — bench (M1 made warning, M3 captured) - docs/k5_cdef_phase3_partial.md — phase 3 partial closure + resumption checklist dav1d snapshot in PROVENANCE.md should be updated next session with the new cdef_tmpl.S entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:21:24 +00:00
parent 2cd2258a7b
commit 20b59cd6a5
6 changed files with 1155 additions and 0 deletions
@@ -0,0 +1,148 @@
+---
+cycle: 5
+phase: 3 (partial — M3 captured, M1 deferred)
+status: in_progress (M1 known-issue, Phase 4+ deferred)
+date_opened: 2026-05-18
+date_partial_close: 2026-05-18
+parent: k5_cdef_phase1_2.md
+---
+
+# Cycle 5, Phase 3 (partial) — CDEF NEON baseline
+
+Cycle 5 Phase 3 captured **M3₅ throughput** but **M1 bit-exact gate
+deferred** to next session due to a tmp-layout mismatch between the
+standalone C reference and dav1d's NEON expectation.
+
+## M3₅ NEON throughput (captured)
+
+```
+=== M3₅ NEON throughput ===
+  blocks/batch:    65536
+  batches done:    279
+  total blocks:    18 284 544
+  elapsed (kernel)=4.661 s
+  throughput      = 3.923 Mblock/s
+  per-block       = 254.9 ns
+  equiv 1080p     = 121.1 FPS  (32 400 blocks/frame)
+```
+
+**Per-block 254 ns** — CDEF is the most compute-intensive kernel
+measured so far:
+
+| | per-block ns | relative |
+|---|---|---|
+| IDCT 8×8 (k1) | 122 | 1.0× |
+| LPF wd=4 (k2) | 20.7 | 0.17× |
+| MC 8h (k3) | 47.6 | 0.39× |
+| LPF wd=8 (k4) | 19.1 | 0.16× |
+| **CDEF (k5)** | **254.9** | **2.09×** |
+
+30fps@1080p floor margin: **4×** isolation (32 400 × 30 fps ÷ 1e6 =
+0.972 Mblock/s required; 3.923 / 0.972 = 4.04). NEON CDEF on a
+single CPU core comfortably exceeds the user-facing test alone.
+
+## M1 known-issue (deferred to next session)
+
+The bit-exact gate against my standalone C reference fails. The
+output structure (NEON vs C ref) shows the NEON producing
+algorithmically-correct-looking pixel values, but at a SHIFTED
+(row, col) offset within dst. Trace evidence:
+
+> neon row 5, cols 2-7 = `90 213 247 143 95 76`  
+> C ref row 3, cols 0-5 = `90 213 247 143 95 76`
+
+— same 6-byte sequence at an offset of (+2 rows, -2 cols) =
+(+2×8 + (-2)) = +14 byte stride mismatch. The smoking gun is that
+dav1d's NEON expects tmp built by a specific
+`dav1d_cdef_padding8_8bpc_neon` routine (different from the C-side
+`padding()` function), and my manual tmp construction doesn't match
+that convention.
+
+**Resolution paths** (next session):
+1. **Call dav1d's NEON padding function** to construct tmp from
+   dst+left+top+bottom random inputs. Then the filter reads it
+   with the right layout. Adds another extern symbol to bind.
+2. **Vendor `dav1d_cdef_filter_block_8x8_c` from dav1d's C-side**
+   (with templated headers shimmed). Compare NEON output against
+   dav1d's *own* C, not my standalone transcription. Eliminates the
+   layout-shim ambiguity entirely.
+3. Inspect `dav1d_cdef_padding8_8bpc_neon` output for one block,
+   reverse-engineer the layout, update standalone C ref to match.
+
+Path 1 is probably simplest. The padding function signature
+(inferred from cdef.S `padding_func` macro):
+```
+void cdef_padding8_8bpc_neon(uint16_t *tmp, const uint8_t *src,
+                             ptrdiff_t src_stride,
+                             const uint8_t (*left)[2],
+                             const uint8_t *top, const uint8_t *bottom,
+                             int h, size_t edges);
+```
+
+Phase 3 closure requires M1 bit-exact verified.
+
+## Phase 4-7 deferred
+
+Without M1 verified, can't safely build the QPU shader (would have
+no correctness gate against the NEON path either, and we'd be
+chasing two layout issues simultaneously).
+
+**Predicted R₅** (extrapolating from cycle 3 MC):
+- CDEF is ~5× heavier per-block than MC on NEON (254 vs 47 ns)
+- NEON ~5× advantage → QPU likely ~25× behind
+- R₅ isolation estimate: **0.02-0.05 (deep RED)**
+- M4₅ mixed: very likely negative (deeper than cycle 3 MC's -19.5%)
+- 30fps floor: still PASS on isolation+mixed since NEON 4-core
+  baseline likely 12+ Mblock/s, comfortably above 0.972
+
+**Deployment recommendation** (provisional, pending Phase 4-7):
+CDEF stays on CPU. Same verdict as MC. **All compute-bound kernels
+stay on CPU; all bandwidth-bound (IDCT/LPF) kernels offload to QPU.**
+This is starting to look like a clean classification rule across all
+cycles.
+
+## Phase 9 lessons (provisional)
+
+1. **Vendoring from a SECOND upstream (dav1d after FFmpeg) added
+   non-trivial layout-convention friction.** Different projects make
+   different optimisation tradeoffs (dav1d NEON uses stride-16 tmp
+   for vector-load alignment; dav1d C uses stride-12 because it
+   doesn't matter for scalar code). Standalone C ref had to be
+   re-fit to match NEON layout, not just transcribe C.
+
+2. **Two different `dav1d_cdef_directions` tables in dav1d**:
+   stride-12 in `src/tables.c` (used by C path), stride-16 in
+   `src/arm/64/cdef_tmpl.S` (used by NEON path). I initially vendored
+   the C-side table; should have used the NEON-side embedded version
+   for matching against NEON.
+
+3. **Bit-exact gate fundamentally requires the standalone C ref to
+   match the actual NEON call convention exactly.** When the layout
+   convention differs (as here), no amount of correct algorithm
+   transcription saves you. The cleanest fix is to either run
+   dav1d's own C ref (vendor more headers) or use dav1d's NEON
+   padding to construct tmp.
+
+## What lands in this commit
+
+- `external/dav1d-snapshot/src/arm/64/cdef_tmpl.S` (additional
+  vendored file, needed for cdef.S to include)
+- `tests/cdef_ref.c` — standalone C ref (algorithmically correct,
+  layout known-mismatched)
+- `tests/bench_neon_cdef.c` — bench harness with M1 made warning
+  (proceeds to M3 even on layout mismatch)
+- `external/dav1d-snapshot/config.h` — asm preamble shim
+  (works — dav1d's cdef.S assembles + links + executes)
+- `CMakeLists.txt` — dav1d asm + table source build wiring
+- M3₅ baseline: 3.923 Mblock/s captured on hertz
+
+## Resumption checklist (next session)
+
+- [ ] Pick M1 resolution path (1, 2, or 3 from §"Resolution paths")
+- [ ] If path 1: vendor + bind `dav1d_cdef_padding8_8bpc_neon`,
+  update bench to call padding-then-filter, recapture M1 gate
+- [ ] Phase 4 plan QPU CDEF kernel (likely brief; predicted RED)
+- [ ] Phase 5 review (mandatory; first AV1 QPU work)
+- [ ] Phase 6 implement
+- [ ] Phase 7 measure M2 + M4 if reaches threshold
+- [ ] Confirm deployment recipe: CDEF stays on CPU (likely)