First AV1 kernel cycle and first dav1d-vendored sources. Phase 1+2
docs lay out the structural complexity (CDEF needs pre-padded 12x12
working buffer + external edge context + direction lookup +
constraint function — meaningfully more complex than cycles 1-4).
Phase 3+ deferred to next session — CDEF is the first cycle that
doesn't fit cleanly into a single autonomous run.
Vendored from dav1d 1.4.3 (BSD-2-Clause, cleaner license than
FFmpeg's LGPL-2.1+):
src/arm/64/cdef.S 520 lines — NEON impl
src/arm/64/util.S 278 lines — NEON helpers
src/arm/asm.S 335 lines — GAS preamble
src/cdef_tmpl.c 331 lines — C reference (templated)
include/common/intops.h 84 lines — utility helpers
src/tables_cdef_subset.c hand-extracted — dav1d_cdef_directions
only (avoids dragging full 1013-line
tables.c + transitive includes)
Discovery from Phase 2 analysis:
- Filter type and shape: dav1d_cdef_filter8_pri_sec_8bpc_neon takes
(dst, dst_stride, tmp, pri_strength, sec_strength, dir, damping, h).
The 'tmp' arg is the pre-padded 12x12 buffer constructed externally
by the dav1d C-side padding() function.
- Tap weights are inline-computed (not table): pri_tap = 4 or 3
(based on pri_strength bit), sec_tap = 2 or 1. Only
dav1d_cdef_directions[12][2] is an external table.
- Constraint function: constrain(diff, threshold, shift) =
apply_sign(min(abs(diff), max(0, threshold - (abs(diff) >> shift))),
diff)
Predicted R5 band: 0.15-0.30 (ORANGE). CDEF is compute-heavier than
LPF (per-pixel min/max conditional logic), so likely worse R than
cycle 2/4 but better than cycle 3 MC. M4 gate likely required.
What Phase 3+ needs (next session):
1. config.h shim for dav1d's asm preamble (defines TBD on first build)
2. Standalone C reference for cdef_filter_block_8x8_c
(cdef_tmpl.c references several dav1d private headers; cleaner to
transcribe to a self-contained tests/cdef_ref.c)
3. tests/bench_neon_cdef.c — M1+M3 bench
4. Phase 4 plan, Phase 5 review (mandatory), Phase 6 shader, Phase 7 measure
PROVENANCE.md documents pin + per-file role + re-vendoring procedure.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.2 KiB
cycle, phases, status, date_opened, parent_cycle, target_kernel, new_vendor
| cycle | phases | status | date_opened | parent_cycle | target_kernel | new_vendor |
|---|---|---|---|---|---|---|
| 5 | 1-2 (combined; phase 3+ pending) | setup in progress | 2026-05-18 | k4_lpf8_phase4_7.md | AV1 CDEF filter, 8×8 luma, 8bpc, FILTER stage only (assume direction + strengths pre-computed) | dav1d 1.4.3 (BSD-2-Clause), separate from FFmpeg pin |
Cycle 5, Phases 1-2 — AV1 CDEF
First AV1 kernel; first cycle that vendors from outside the FFmpeg snapshot. dav1d is the canonical AV1 reference (clean BSD-2-Clause, mature aarch64 NEON, used by VLC + Firefox via libdav1d).
Phase 1 — goal
Kernel: AV1 Constrained Directional Enhancement Filter, 8×8 luma output, 8 bits/component, FILTER stage (direction + strength parameters assumed pre-computed). Match the "pre-computed params" convention of LPF (E/I/H) and MC (mx).
NEON symbol target: dav1d_cdef_filter8_pri_sec_8bpc_neon (combined
primary + secondary filter). There are also _pri_ and _sec_ only
variants for the cases where one strength is 0; for the bench we
cover the worst case (both active).
C reference: cdef_filter_block_8x8_c from dav1d/src/cdef_tmpl.c
(macro-expanded), delegating to cdef_filter_block_c. Spec source:
AV1 specification §7.15 (CDEF).
Measurable success (cycle-5 numbering, 5 superscript)
| ID | Measurement | Gate |
|---|---|---|
| M1₅ | bit-exact vs C ref, N random 8×8 blocks across all 8 directions × various strengths | 100.0000 % |
| M2₅ | QPU throughput Mblock/s | recorded |
| M3₅ | NEON dav1d_cdef_filter8_pri_sec_8bpc_neon Mblock/s |
recorded |
| M4₅ | mixed NEON-3 + QPU vs pure NEON-4 (if YELLOW/ORANGE band) | conditional |
Decision bands (carried)
Same R bands and 30fps-floor calibration as cycles 1-4.
Predicted R₅
The CDEF filter is compute-heavier than LPF:
- Per pixel: 8 constraint applications (abs + min + max + sign-restore) plus the per-pixel accumulation with min/max tracking
- Per 8×8 block: ~32 mults (small constants 1-4) + many adds + many conditionals
- Memory: 12×12 padded source = 144 reads + 64 writes = 208 B/block (vs LPF's ~88 B and MC's ~184 B)
- No DP4A applicability (the multipliers are small constants, but the constraint function dominates)
Predicted R₅ band: 0.15-0.30 (ORANGE). The constraint function's per-pixel min/max conditional logic is heavier than LPF's per-row fm/flat tests. Compute-bound on QPU. M4 may still rescue per cycle-1+2 pattern.
NEW for cycle 5
- First AV1 kernel → expands codec coverage beyond VP9
- First dav1d-vendored source → new external/ subdirectory:
external/dav1d-snapshot/(BSD-2-Clause; clean license vs LGPL FFmpeg) - First kernel needing external padding context — CDEF reads
beyond the 8×8 block (2-pixel halo on each side); dav1d's C
reference uses pre-padded
tmp_buf[12×12]constructed by a separatepadding()function from left/top/bottom edge arrays. Our bench will construct this padding inline for each random block.
Phase 2 — situation analysis
C reference structure (dav1d)
cdef_filter_block_8x8_c signature:
void cdef_filter_block_8x8_c(pixel *dst, ptrdiff_t stride,
const pixel (*left)[2],
const pixel *top, const pixel *bottom,
int pri_strength, int sec_strength,
int dir, int damping,
enum CdefEdgeFlags edges);
The function:
- Allocates
int16_t tmp_buf[144](12×12 working buffer) - Calls
padding()to fill from left/top/bottom + dst with edge-replicate - Iterates 8 rows × 8 cols; per pixel:
- Looks up direction offsets:
dav1d_cdef_directions[dir+offset][k] - For each of 4 primary tap positions (k=0..1, both signs): compute pri-constrained diff, multiply by tap weight, accumulate
- For each of 4 secondary tap positions (k=0..1, both signs, two adjacent directions): same with sec weights
- Track min/max across all sampled neighbours
- Output:
iclip(px + ((sum - (sum < 0) + 8) >> 4), min, max)
- Looks up direction offsets:
The "constraint" function:
static inline int constrain(int diff, int threshold, int shift) {
int adiff = abs(diff);
return apply_sign(imin(adiff, imax(0, threshold - (adiff >> shift))),
diff);
}
This is the per-pixel-pair clamp that makes CDEF constrained (directional enhancement that can't exceed a threshold tied to local strength).
Tables needed
dav1d_cdef_directions[12][2]— 12 directions (8 + 4 wrap-arounds), each a (y_offset, x_offset) pair. Indav1d/src/tables.c.dav1d_cdef_pri_taps[2][2]— primary tap weights, indexed by(pri_strength & 1)and tap position k. Small ints.dav1d_cdef_sec_taps[2]— secondary tap weights, just 2 entries.
NEON reference structure (dav1d)
dav1d_cdef_filter8_pri_sec_8bpc_neon signature:
x0: dst pixel buffer
x1: dst_stride ptrdiff_t
x2: tmp uint8_t source (the pre-padded 12×12 buffer reinterpreted)
w3: pri_strength
w4: sec_strength
w5: dir
w6: damping
w7: h height (8 for 8×8)
Notable: dav1d's NEON takes the already-padded tmp buffer pointer
(after the C side did padding()). So our bench needs to construct
the padded buffer per block.
Padded buffer layout (12×12, int16 elements):
- Real pixel region at rows [2..9], cols [2..9] (the 8×8 dst)
- Halo at rows {0,1,10,11} and cols {0,1,10,11}: either edge-replicate from adjacent block (if edges flag set) or INT16_MIN (which the constraint function treats as "skip this neighbour")
Vendoring plan
New directory: external/dav1d-snapshot/ (BSD-2-Clause, separate
PROVENANCE.md from FFmpeg pin).
Files to vendor from dav1d 1.4.3:
src/arm/64/cdef.S— main NEON file (~870 lines)src/arm/64/util.S— helper macros referenced by cdef.Ssrc/arm/asm.S— top-level macros (function, endfunc, etc.)src/cdef_tmpl.c— C reference (~250 lines)src/tables.c— the static tables (cdef_directions, pri/sec taps) or hand-extract just the CDEF tables (~50 lines)include/common/intops.h— apply_sign, imin, imax, iclip helpers- A standalone PROVENANCE.md with pin + SHA-256s
dav1d's asm preamble may need its own config.h shim (different defines than FFmpeg's). Phase 6 setup will identify exact needs.
Build path
dav1d's asm uses similar GAS preamble to FFmpeg's. The config
defines are different: ARCH_AARCH64, HAVE_AS_FUNC, etc., but
also dav1d-specific like PRIVATE_PREFIX dav1d_ and EXTERN_ASM (same
empty for ELF as in cycle 1).
What Phase 2 does not close
- The exact list of dav1d asm.S macros needed (will surface during first build attempt)
- C reference completeness —
padding()setup logic is non-trivial (handles edges/CdefEdgeFlags = combinations of HAVE_LEFT, HAVE_TOP, HAVE_RIGHT, HAVE_BOTTOM). For the bench, we can simplify by always passing "all edges valid" with synthetic neighbouring pixels. - Direction validation — directions 0..7 should all be tested for bit-exactness; an off-by-one in the direction-offset table would be caught by M1.
Phase 3 next: vendor the dav1d files, write standalone C ref + bench, capture M3₅ NEON baseline.
This is the first multi-session cycle — Phase 3+ likely lands in next session. Cycle setup commit at end of this session.