--- cycle: 5 phases: 1-2 (combined; phase 3+ pending) status: setup in progress date_opened: 2026-05-18 parent_cycle: k4_lpf8_phase4_7.md target_kernel: AV1 CDEF filter, 8×8 luma, 8bpc, FILTER stage only (assume direction + strengths pre-computed) new_vendor: dav1d 1.4.3 (BSD-2-Clause), separate from FFmpeg pin --- # Cycle 5, Phases 1-2 — AV1 CDEF First AV1 kernel; first cycle that vendors from outside the FFmpeg snapshot. dav1d is the canonical AV1 reference (clean BSD-2-Clause, mature aarch64 NEON, used by VLC + Firefox via libdav1d). ## Phase 1 — goal **Kernel**: AV1 Constrained Directional Enhancement Filter, 8×8 luma output, 8 bits/component, FILTER stage (direction + strength parameters assumed pre-computed). Match the "pre-computed params" convention of LPF (E/I/H) and MC (mx). **NEON symbol target**: `dav1d_cdef_filter8_pri_sec_8bpc_neon` (combined primary + secondary filter). There are also `_pri_` and `_sec_` only variants for the cases where one strength is 0; for the bench we cover the worst case (both active). **C reference**: `cdef_filter_block_8x8_c` from `dav1d/src/cdef_tmpl.c` (macro-expanded), delegating to `cdef_filter_block_c`. Spec source: AV1 specification §7.15 (CDEF). ### Measurable success (cycle-5 numbering, `5` superscript) | ID | Measurement | Gate | |---|---|---| | M1₅ | bit-exact vs C ref, N random 8×8 blocks across all 8 directions × various strengths | 100.0000 % | | M2₅ | QPU throughput Mblock/s | recorded | | M3₅ | NEON `dav1d_cdef_filter8_pri_sec_8bpc_neon` Mblock/s | recorded | | M4₅ | mixed NEON-3 + QPU vs pure NEON-4 (if YELLOW/ORANGE band) | conditional | ### Decision bands (carried) Same R bands and 30fps-floor calibration as cycles 1-4. ### Predicted R₅ The CDEF filter is **compute-heavier than LPF**: - Per pixel: 8 constraint applications (abs + min + max + sign-restore) plus the per-pixel accumulation with min/max tracking - Per 8×8 block: ~32 mults (small constants 1-4) + many adds + many conditionals - Memory: 12×12 padded source = 144 reads + 64 writes = 208 B/block (vs LPF's ~88 B and MC's ~184 B) - No DP4A applicability (the multipliers are small constants, but the constraint function dominates) **Predicted R₅ band**: 0.15-0.30 (ORANGE). The constraint function's per-pixel min/max conditional logic is heavier than LPF's per-row fm/flat tests. Compute-bound on QPU. M4 may still rescue per cycle-1+2 pattern. ### NEW for cycle 5 - **First AV1 kernel** → expands codec coverage beyond VP9 - **First dav1d-vendored source** → new external/ subdirectory: `external/dav1d-snapshot/` (BSD-2-Clause; clean license vs LGPL FFmpeg) - **First kernel needing external padding context** — CDEF reads beyond the 8×8 block (2-pixel halo on each side); dav1d's C reference uses pre-padded `tmp_buf[12×12]` constructed by a separate `padding()` function from left/top/bottom edge arrays. Our bench will construct this padding inline for each random block. ## Phase 2 — situation analysis ### C reference structure (dav1d) `cdef_filter_block_8x8_c` signature: ```c void cdef_filter_block_8x8_c(pixel *dst, ptrdiff_t stride, const pixel (*left)[2], const pixel *top, const pixel *bottom, int pri_strength, int sec_strength, int dir, int damping, enum CdefEdgeFlags edges); ``` The function: 1. Allocates `int16_t tmp_buf[144]` (12×12 working buffer) 2. Calls `padding()` to fill from left/top/bottom + dst with edge-replicate 3. Iterates 8 rows × 8 cols; per pixel: - Looks up direction offsets: `dav1d_cdef_directions[dir+offset][k]` - For each of 4 primary tap positions (k=0..1, both signs): compute pri-constrained diff, multiply by tap weight, accumulate - For each of 4 secondary tap positions (k=0..1, both signs, two adjacent directions): same with sec weights - Track min/max across all sampled neighbours - Output: `iclip(px + ((sum - (sum < 0) + 8) >> 4), min, max)` The "constraint" function: ```c static inline int constrain(int diff, int threshold, int shift) { int adiff = abs(diff); return apply_sign(imin(adiff, imax(0, threshold - (adiff >> shift))), diff); } ``` This is the per-pixel-pair clamp that makes CDEF *constrained* (directional enhancement that can't exceed a threshold tied to local strength). ### Tables needed - `dav1d_cdef_directions[12][2]` — 12 directions (8 + 4 wrap-arounds), each a (y_offset, x_offset) pair. In `dav1d/src/tables.c`. - `dav1d_cdef_pri_taps[2][2]` — primary tap weights, indexed by `(pri_strength & 1)` and tap position k. Small ints. - `dav1d_cdef_sec_taps[2]` — secondary tap weights, just 2 entries. ### NEON reference structure (dav1d) `dav1d_cdef_filter8_pri_sec_8bpc_neon` signature: ``` x0: dst pixel buffer x1: dst_stride ptrdiff_t x2: tmp uint8_t source (the pre-padded 12×12 buffer reinterpreted) w3: pri_strength w4: sec_strength w5: dir w6: damping w7: h height (8 for 8×8) ``` Notable: dav1d's NEON takes the already-padded `tmp` buffer pointer (after the C side did `padding()`). So our bench needs to construct the padded buffer per block. Padded buffer layout (12×12, int16 elements): - Real pixel region at rows [2..9], cols [2..9] (the 8×8 dst) - Halo at rows {0,1,10,11} and cols {0,1,10,11}: either edge-replicate from adjacent block (if edges flag set) or INT16_MIN (which the constraint function treats as "skip this neighbour") ### Vendoring plan New directory: `external/dav1d-snapshot/` (BSD-2-Clause, separate PROVENANCE.md from FFmpeg pin). Files to vendor from dav1d 1.4.3: 1. `src/arm/64/cdef.S` — main NEON file (~870 lines) 2. `src/arm/64/util.S` — helper macros referenced by cdef.S 3. `src/arm/asm.S` — top-level macros (function, endfunc, etc.) 4. `src/cdef_tmpl.c` — C reference (~250 lines) 5. `src/tables.c` — the static tables (cdef_directions, pri/sec taps) *or* hand-extract just the CDEF tables (~50 lines) 6. `include/common/intops.h` — apply_sign, imin, imax, iclip helpers 7. A standalone PROVENANCE.md with pin + SHA-256s dav1d's asm preamble may need its own config.h shim (different defines than FFmpeg's). Phase 6 setup will identify exact needs. ### Build path dav1d's asm uses similar GAS preamble to FFmpeg's. The config defines are different: `ARCH_AARCH64`, `HAVE_AS_FUNC`, etc., but also dav1d-specific like `PRIVATE_PREFIX dav1d_` and `EXTERN_ASM ` (same empty for ELF as in cycle 1). ### What Phase 2 does *not* close - The exact list of dav1d asm.S macros needed (will surface during first build attempt) - C reference completeness — `padding()` setup logic is non-trivial (handles edges/CdefEdgeFlags = combinations of HAVE_LEFT, HAVE_TOP, HAVE_RIGHT, HAVE_BOTTOM). For the bench, we can simplify by always passing "all edges valid" with synthetic neighbouring pixels. - Direction validation — directions 0..7 should all be tested for bit-exactness; an off-by-one in the direction-offset table would be caught by M1. Phase 3 next: vendor the dav1d files, write standalone C ref + bench, capture M3₅ NEON baseline. This is **the first multi-session cycle** — Phase 3+ likely lands in next session. Cycle setup commit at end of this session.