Files
marfrit 2cd2258a7b Cycle 5 setup (Phase 1+2): vendor dav1d 1.4.3 CDEF sources
First AV1 kernel cycle and first dav1d-vendored sources. Phase 1+2
docs lay out the structural complexity (CDEF needs pre-padded 12x12
working buffer + external edge context + direction lookup +
constraint function — meaningfully more complex than cycles 1-4).

Phase 3+ deferred to next session — CDEF is the first cycle that
doesn't fit cleanly into a single autonomous run.

Vendored from dav1d 1.4.3 (BSD-2-Clause, cleaner license than
FFmpeg's LGPL-2.1+):

  src/arm/64/cdef.S            520 lines — NEON impl
  src/arm/64/util.S            278 lines — NEON helpers
  src/arm/asm.S                335 lines — GAS preamble
  src/cdef_tmpl.c              331 lines — C reference (templated)
  include/common/intops.h       84 lines — utility helpers
  src/tables_cdef_subset.c      hand-extracted — dav1d_cdef_directions
                                only (avoids dragging full 1013-line
                                tables.c + transitive includes)

Discovery from Phase 2 analysis:
- Filter type and shape: dav1d_cdef_filter8_pri_sec_8bpc_neon takes
  (dst, dst_stride, tmp, pri_strength, sec_strength, dir, damping, h).
  The 'tmp' arg is the pre-padded 12x12 buffer constructed externally
  by the dav1d C-side padding() function.
- Tap weights are inline-computed (not table): pri_tap = 4 or 3
  (based on pri_strength bit), sec_tap = 2 or 1. Only
  dav1d_cdef_directions[12][2] is an external table.
- Constraint function: constrain(diff, threshold, shift) =
  apply_sign(min(abs(diff), max(0, threshold - (abs(diff) >> shift))),
             diff)

Predicted R5 band: 0.15-0.30 (ORANGE). CDEF is compute-heavier than
LPF (per-pixel min/max conditional logic), so likely worse R than
cycle 2/4 but better than cycle 3 MC. M4 gate likely required.

What Phase 3+ needs (next session):
1. config.h shim for dav1d's asm preamble (defines TBD on first build)
2. Standalone C reference for cdef_filter_block_8x8_c
   (cdef_tmpl.c references several dav1d private headers; cleaner to
   transcribe to a self-contained tests/cdef_ref.c)
3. tests/bench_neon_cdef.c — M1+M3 bench
4. Phase 4 plan, Phase 5 review (mandatory), Phase 6 shader, Phase 7 measure

PROVENANCE.md documents pin + per-file role + re-vendoring procedure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:12:25 +00:00

191 lines
7.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
cycle: 5
phases: 1-2 (combined; phase 3+ pending)
status: setup in progress
date_opened: 2026-05-18
parent_cycle: k4_lpf8_phase4_7.md
target_kernel: AV1 CDEF filter, 8×8 luma, 8bpc, FILTER stage only
(assume direction + strengths pre-computed)
new_vendor: dav1d 1.4.3 (BSD-2-Clause), separate from FFmpeg pin
---
# Cycle 5, Phases 1-2 — AV1 CDEF
First AV1 kernel; first cycle that vendors from outside the FFmpeg
snapshot. dav1d is the canonical AV1 reference (clean BSD-2-Clause,
mature aarch64 NEON, used by VLC + Firefox via libdav1d).
## Phase 1 — goal
**Kernel**: AV1 Constrained Directional Enhancement Filter, 8×8 luma
output, 8 bits/component, FILTER stage (direction + strength
parameters assumed pre-computed). Match the "pre-computed params"
convention of LPF (E/I/H) and MC (mx).
**NEON symbol target**: `dav1d_cdef_filter8_pri_sec_8bpc_neon` (combined
primary + secondary filter). There are also `_pri_` and `_sec_` only
variants for the cases where one strength is 0; for the bench we
cover the worst case (both active).
**C reference**: `cdef_filter_block_8x8_c` from `dav1d/src/cdef_tmpl.c`
(macro-expanded), delegating to `cdef_filter_block_c`. Spec source:
AV1 specification §7.15 (CDEF).
### Measurable success (cycle-5 numbering, `5` superscript)
| ID | Measurement | Gate |
|---|---|---|
| M1₅ | bit-exact vs C ref, N random 8×8 blocks across all 8 directions × various strengths | 100.0000 % |
| M2₅ | QPU throughput Mblock/s | recorded |
| M3₅ | NEON `dav1d_cdef_filter8_pri_sec_8bpc_neon` Mblock/s | recorded |
| M4₅ | mixed NEON-3 + QPU vs pure NEON-4 (if YELLOW/ORANGE band) | conditional |
### Decision bands (carried)
Same R bands and 30fps-floor calibration as cycles 1-4.
### Predicted R₅
The CDEF filter is **compute-heavier than LPF**:
- Per pixel: 8 constraint applications (abs + min + max + sign-restore)
plus the per-pixel accumulation with min/max tracking
- Per 8×8 block: ~32 mults (small constants 1-4) + many adds + many
conditionals
- Memory: 12×12 padded source = 144 reads + 64 writes = 208 B/block
(vs LPF's ~88 B and MC's ~184 B)
- No DP4A applicability (the multipliers are small constants, but
the constraint function dominates)
**Predicted R₅ band**: 0.15-0.30 (ORANGE). The constraint function's
per-pixel min/max conditional logic is heavier than LPF's per-row
fm/flat tests. Compute-bound on QPU. M4 may still rescue per
cycle-1+2 pattern.
### NEW for cycle 5
- **First AV1 kernel** → expands codec coverage beyond VP9
- **First dav1d-vendored source** → new external/ subdirectory:
`external/dav1d-snapshot/` (BSD-2-Clause; clean license vs LGPL
FFmpeg)
- **First kernel needing external padding context** — CDEF reads
beyond the 8×8 block (2-pixel halo on each side); dav1d's C
reference uses pre-padded `tmp_buf[12×12]` constructed by a
separate `padding()` function from left/top/bottom edge arrays.
Our bench will construct this padding inline for each random
block.
## Phase 2 — situation analysis
### C reference structure (dav1d)
`cdef_filter_block_8x8_c` signature:
```c
void cdef_filter_block_8x8_c(pixel *dst, ptrdiff_t stride,
const pixel (*left)[2],
const pixel *top, const pixel *bottom,
int pri_strength, int sec_strength,
int dir, int damping,
enum CdefEdgeFlags edges);
```
The function:
1. Allocates `int16_t tmp_buf[144]` (12×12 working buffer)
2. Calls `padding()` to fill from left/top/bottom + dst with edge-replicate
3. Iterates 8 rows × 8 cols; per pixel:
- Looks up direction offsets: `dav1d_cdef_directions[dir+offset][k]`
- For each of 4 primary tap positions (k=0..1, both signs):
compute pri-constrained diff, multiply by tap weight, accumulate
- For each of 4 secondary tap positions (k=0..1, both signs,
two adjacent directions):
same with sec weights
- Track min/max across all sampled neighbours
- Output: `iclip(px + ((sum - (sum < 0) + 8) >> 4), min, max)`
The "constraint" function:
```c
static inline int constrain(int diff, int threshold, int shift) {
int adiff = abs(diff);
return apply_sign(imin(adiff, imax(0, threshold - (adiff >> shift))),
diff);
}
```
This is the per-pixel-pair clamp that makes CDEF *constrained*
(directional enhancement that can't exceed a threshold tied to
local strength).
### Tables needed
- `dav1d_cdef_directions[12][2]` — 12 directions (8 + 4 wrap-arounds),
each a (y_offset, x_offset) pair. In `dav1d/src/tables.c`.
- `dav1d_cdef_pri_taps[2][2]` — primary tap weights, indexed by
`(pri_strength & 1)` and tap position k. Small ints.
- `dav1d_cdef_sec_taps[2]` — secondary tap weights, just 2 entries.
### NEON reference structure (dav1d)
`dav1d_cdef_filter8_pri_sec_8bpc_neon` signature:
```
x0: dst pixel buffer
x1: dst_stride ptrdiff_t
x2: tmp uint8_t source (the pre-padded 12×12 buffer reinterpreted)
w3: pri_strength
w4: sec_strength
w5: dir
w6: damping
w7: h height (8 for 8×8)
```
Notable: dav1d's NEON takes the already-padded `tmp` buffer pointer
(after the C side did `padding()`). So our bench needs to construct
the padded buffer per block.
Padded buffer layout (12×12, int16 elements):
- Real pixel region at rows [2..9], cols [2..9] (the 8×8 dst)
- Halo at rows {0,1,10,11} and cols {0,1,10,11}: either edge-replicate
from adjacent block (if edges flag set) or INT16_MIN (which the
constraint function treats as "skip this neighbour")
### Vendoring plan
New directory: `external/dav1d-snapshot/` (BSD-2-Clause, separate
PROVENANCE.md from FFmpeg pin).
Files to vendor from dav1d 1.4.3:
1. `src/arm/64/cdef.S` — main NEON file (~870 lines)
2. `src/arm/64/util.S` — helper macros referenced by cdef.S
3. `src/arm/asm.S` — top-level macros (function, endfunc, etc.)
4. `src/cdef_tmpl.c` — C reference (~250 lines)
5. `src/tables.c` — the static tables (cdef_directions, pri/sec taps)
*or* hand-extract just the CDEF tables (~50 lines)
6. `include/common/intops.h` — apply_sign, imin, imax, iclip helpers
7. A standalone PROVENANCE.md with pin + SHA-256s
dav1d's asm preamble may need its own config.h shim (different
defines than FFmpeg's). Phase 6 setup will identify exact needs.
### Build path
dav1d's asm uses similar GAS preamble to FFmpeg's. The config
defines are different: `ARCH_AARCH64`, `HAVE_AS_FUNC`, etc., but
also dav1d-specific like `PRIVATE_PREFIX dav1d_` and `EXTERN_ASM ` (same
empty for ELF as in cycle 1).
### What Phase 2 does *not* close
- The exact list of dav1d asm.S macros needed (will surface during
first build attempt)
- C reference completeness — `padding()` setup logic is non-trivial
(handles edges/CdefEdgeFlags = combinations of HAVE_LEFT, HAVE_TOP,
HAVE_RIGHT, HAVE_BOTTOM). For the bench, we can simplify by
always passing "all edges valid" with synthetic neighbouring pixels.
- Direction validation — directions 0..7 should all be tested for
bit-exactness; an off-by-one in the direction-offset table would
be caught by M1.
Phase 3 next: vendor the dav1d files, write standalone C ref +
bench, capture M3₅ NEON baseline.
This is **the first multi-session cycle** — Phase 3+ likely lands
in next session. Cycle setup commit at end of this session.