Files
daedalus-fourier/docs/k5_cdef_phase1_2.md
T
marfrit 2cd2258a7b Cycle 5 setup (Phase 1+2): vendor dav1d 1.4.3 CDEF sources
First AV1 kernel cycle and first dav1d-vendored sources. Phase 1+2
docs lay out the structural complexity (CDEF needs pre-padded 12x12
working buffer + external edge context + direction lookup +
constraint function — meaningfully more complex than cycles 1-4).

Phase 3+ deferred to next session — CDEF is the first cycle that
doesn't fit cleanly into a single autonomous run.

Vendored from dav1d 1.4.3 (BSD-2-Clause, cleaner license than
FFmpeg's LGPL-2.1+):

  src/arm/64/cdef.S            520 lines — NEON impl
  src/arm/64/util.S            278 lines — NEON helpers
  src/arm/asm.S                335 lines — GAS preamble
  src/cdef_tmpl.c              331 lines — C reference (templated)
  include/common/intops.h       84 lines — utility helpers
  src/tables_cdef_subset.c      hand-extracted — dav1d_cdef_directions
                                only (avoids dragging full 1013-line
                                tables.c + transitive includes)

Discovery from Phase 2 analysis:
- Filter type and shape: dav1d_cdef_filter8_pri_sec_8bpc_neon takes
  (dst, dst_stride, tmp, pri_strength, sec_strength, dir, damping, h).
  The 'tmp' arg is the pre-padded 12x12 buffer constructed externally
  by the dav1d C-side padding() function.
- Tap weights are inline-computed (not table): pri_tap = 4 or 3
  (based on pri_strength bit), sec_tap = 2 or 1. Only
  dav1d_cdef_directions[12][2] is an external table.
- Constraint function: constrain(diff, threshold, shift) =
  apply_sign(min(abs(diff), max(0, threshold - (abs(diff) >> shift))),
             diff)

Predicted R5 band: 0.15-0.30 (ORANGE). CDEF is compute-heavier than
LPF (per-pixel min/max conditional logic), so likely worse R than
cycle 2/4 but better than cycle 3 MC. M4 gate likely required.

What Phase 3+ needs (next session):
1. config.h shim for dav1d's asm preamble (defines TBD on first build)
2. Standalone C reference for cdef_filter_block_8x8_c
   (cdef_tmpl.c references several dav1d private headers; cleaner to
   transcribe to a self-contained tests/cdef_ref.c)
3. tests/bench_neon_cdef.c — M1+M3 bench
4. Phase 4 plan, Phase 5 review (mandatory), Phase 6 shader, Phase 7 measure

PROVENANCE.md documents pin + per-file role + re-vendoring procedure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:12:25 +00:00

7.2 KiB
Raw Blame History

cycle, phases, status, date_opened, parent_cycle, target_kernel, new_vendor
cycle phases status date_opened parent_cycle target_kernel new_vendor
5 1-2 (combined; phase 3+ pending) setup in progress 2026-05-18 k4_lpf8_phase4_7.md AV1 CDEF filter, 8×8 luma, 8bpc, FILTER stage only (assume direction + strengths pre-computed) dav1d 1.4.3 (BSD-2-Clause), separate from FFmpeg pin

Cycle 5, Phases 1-2 — AV1 CDEF

First AV1 kernel; first cycle that vendors from outside the FFmpeg snapshot. dav1d is the canonical AV1 reference (clean BSD-2-Clause, mature aarch64 NEON, used by VLC + Firefox via libdav1d).

Phase 1 — goal

Kernel: AV1 Constrained Directional Enhancement Filter, 8×8 luma output, 8 bits/component, FILTER stage (direction + strength parameters assumed pre-computed). Match the "pre-computed params" convention of LPF (E/I/H) and MC (mx).

NEON symbol target: dav1d_cdef_filter8_pri_sec_8bpc_neon (combined primary + secondary filter). There are also _pri_ and _sec_ only variants for the cases where one strength is 0; for the bench we cover the worst case (both active).

C reference: cdef_filter_block_8x8_c from dav1d/src/cdef_tmpl.c (macro-expanded), delegating to cdef_filter_block_c. Spec source: AV1 specification §7.15 (CDEF).

Measurable success (cycle-5 numbering, 5 superscript)

ID Measurement Gate
M1₅ bit-exact vs C ref, N random 8×8 blocks across all 8 directions × various strengths 100.0000 %
M2₅ QPU throughput Mblock/s recorded
M3₅ NEON dav1d_cdef_filter8_pri_sec_8bpc_neon Mblock/s recorded
M4₅ mixed NEON-3 + QPU vs pure NEON-4 (if YELLOW/ORANGE band) conditional

Decision bands (carried)

Same R bands and 30fps-floor calibration as cycles 1-4.

Predicted R₅

The CDEF filter is compute-heavier than LPF:

  • Per pixel: 8 constraint applications (abs + min + max + sign-restore) plus the per-pixel accumulation with min/max tracking
  • Per 8×8 block: ~32 mults (small constants 1-4) + many adds + many conditionals
  • Memory: 12×12 padded source = 144 reads + 64 writes = 208 B/block (vs LPF's ~88 B and MC's ~184 B)
  • No DP4A applicability (the multipliers are small constants, but the constraint function dominates)

Predicted R₅ band: 0.15-0.30 (ORANGE). The constraint function's per-pixel min/max conditional logic is heavier than LPF's per-row fm/flat tests. Compute-bound on QPU. M4 may still rescue per cycle-1+2 pattern.

NEW for cycle 5

  • First AV1 kernel → expands codec coverage beyond VP9
  • First dav1d-vendored source → new external/ subdirectory: external/dav1d-snapshot/ (BSD-2-Clause; clean license vs LGPL FFmpeg)
  • First kernel needing external padding context — CDEF reads beyond the 8×8 block (2-pixel halo on each side); dav1d's C reference uses pre-padded tmp_buf[12×12] constructed by a separate padding() function from left/top/bottom edge arrays. Our bench will construct this padding inline for each random block.

Phase 2 — situation analysis

C reference structure (dav1d)

cdef_filter_block_8x8_c signature:

void cdef_filter_block_8x8_c(pixel *dst, ptrdiff_t stride,
                             const pixel (*left)[2],
                             const pixel *top, const pixel *bottom,
                             int pri_strength, int sec_strength,
                             int dir, int damping,
                             enum CdefEdgeFlags edges);

The function:

  1. Allocates int16_t tmp_buf[144] (12×12 working buffer)
  2. Calls padding() to fill from left/top/bottom + dst with edge-replicate
  3. Iterates 8 rows × 8 cols; per pixel:
    • Looks up direction offsets: dav1d_cdef_directions[dir+offset][k]
    • For each of 4 primary tap positions (k=0..1, both signs): compute pri-constrained diff, multiply by tap weight, accumulate
    • For each of 4 secondary tap positions (k=0..1, both signs, two adjacent directions): same with sec weights
    • Track min/max across all sampled neighbours
    • Output: iclip(px + ((sum - (sum < 0) + 8) >> 4), min, max)

The "constraint" function:

static inline int constrain(int diff, int threshold, int shift) {
    int adiff = abs(diff);
    return apply_sign(imin(adiff, imax(0, threshold - (adiff >> shift))),
                      diff);
}

This is the per-pixel-pair clamp that makes CDEF constrained (directional enhancement that can't exceed a threshold tied to local strength).

Tables needed

  • dav1d_cdef_directions[12][2] — 12 directions (8 + 4 wrap-arounds), each a (y_offset, x_offset) pair. In dav1d/src/tables.c.
  • dav1d_cdef_pri_taps[2][2] — primary tap weights, indexed by (pri_strength & 1) and tap position k. Small ints.
  • dav1d_cdef_sec_taps[2] — secondary tap weights, just 2 entries.

NEON reference structure (dav1d)

dav1d_cdef_filter8_pri_sec_8bpc_neon signature:

x0: dst         pixel buffer
x1: dst_stride  ptrdiff_t
x2: tmp         uint8_t source (the pre-padded 12×12 buffer reinterpreted)
w3: pri_strength
w4: sec_strength
w5: dir
w6: damping
w7: h           height (8 for 8×8)

Notable: dav1d's NEON takes the already-padded tmp buffer pointer (after the C side did padding()). So our bench needs to construct the padded buffer per block.

Padded buffer layout (12×12, int16 elements):

  • Real pixel region at rows [2..9], cols [2..9] (the 8×8 dst)
  • Halo at rows {0,1,10,11} and cols {0,1,10,11}: either edge-replicate from adjacent block (if edges flag set) or INT16_MIN (which the constraint function treats as "skip this neighbour")

Vendoring plan

New directory: external/dav1d-snapshot/ (BSD-2-Clause, separate PROVENANCE.md from FFmpeg pin).

Files to vendor from dav1d 1.4.3:

  1. src/arm/64/cdef.S — main NEON file (~870 lines)
  2. src/arm/64/util.S — helper macros referenced by cdef.S
  3. src/arm/asm.S — top-level macros (function, endfunc, etc.)
  4. src/cdef_tmpl.c — C reference (~250 lines)
  5. src/tables.c — the static tables (cdef_directions, pri/sec taps) or hand-extract just the CDEF tables (~50 lines)
  6. include/common/intops.h — apply_sign, imin, imax, iclip helpers
  7. A standalone PROVENANCE.md with pin + SHA-256s

dav1d's asm preamble may need its own config.h shim (different defines than FFmpeg's). Phase 6 setup will identify exact needs.

Build path

dav1d's asm uses similar GAS preamble to FFmpeg's. The config defines are different: ARCH_AARCH64, HAVE_AS_FUNC, etc., but also dav1d-specific like PRIVATE_PREFIX dav1d_ and EXTERN_ASM (same empty for ELF as in cycle 1).

What Phase 2 does not close

  • The exact list of dav1d asm.S macros needed (will surface during first build attempt)
  • C reference completeness — padding() setup logic is non-trivial (handles edges/CdefEdgeFlags = combinations of HAVE_LEFT, HAVE_TOP, HAVE_RIGHT, HAVE_BOTTOM). For the bench, we can simplify by always passing "all edges valid" with synthetic neighbouring pixels.
  • Direction validation — directions 0..7 should all be tested for bit-exactness; an off-by-one in the direction-offset table would be caught by M1.

Phase 3 next: vendor the dav1d files, write standalone C ref + bench, capture M3₅ NEON baseline.

This is the first multi-session cycle — Phase 3+ likely lands in next session. Cycle setup commit at end of this session.