Cycle 5 setup (Phase 1+2): vendor dav1d 1.4.3 CDEF sources
First AV1 kernel cycle and first dav1d-vendored sources. Phase 1+2
docs lay out the structural complexity (CDEF needs pre-padded 12x12
working buffer + external edge context + direction lookup +
constraint function — meaningfully more complex than cycles 1-4).
Phase 3+ deferred to next session — CDEF is the first cycle that
doesn't fit cleanly into a single autonomous run.
Vendored from dav1d 1.4.3 (BSD-2-Clause, cleaner license than
FFmpeg's LGPL-2.1+):
src/arm/64/cdef.S 520 lines — NEON impl
src/arm/64/util.S 278 lines — NEON helpers
src/arm/asm.S 335 lines — GAS preamble
src/cdef_tmpl.c 331 lines — C reference (templated)
include/common/intops.h 84 lines — utility helpers
src/tables_cdef_subset.c hand-extracted — dav1d_cdef_directions
only (avoids dragging full 1013-line
tables.c + transitive includes)
Discovery from Phase 2 analysis:
- Filter type and shape: dav1d_cdef_filter8_pri_sec_8bpc_neon takes
(dst, dst_stride, tmp, pri_strength, sec_strength, dir, damping, h).
The 'tmp' arg is the pre-padded 12x12 buffer constructed externally
by the dav1d C-side padding() function.
- Tap weights are inline-computed (not table): pri_tap = 4 or 3
(based on pri_strength bit), sec_tap = 2 or 1. Only
dav1d_cdef_directions[12][2] is an external table.
- Constraint function: constrain(diff, threshold, shift) =
apply_sign(min(abs(diff), max(0, threshold - (abs(diff) >> shift))),
diff)
Predicted R5 band: 0.15-0.30 (ORANGE). CDEF is compute-heavier than
LPF (per-pixel min/max conditional logic), so likely worse R than
cycle 2/4 but better than cycle 3 MC. M4 gate likely required.
What Phase 3+ needs (next session):
1. config.h shim for dav1d's asm preamble (defines TBD on first build)
2. Standalone C reference for cdef_filter_block_8x8_c
(cdef_tmpl.c references several dav1d private headers; cleaner to
transcribe to a self-contained tests/cdef_ref.c)
3. tests/bench_neon_cdef.c — M1+M3 bench
4. Phase 4 plan, Phase 5 review (mandatory), Phase 6 shader, Phase 7 measure
PROVENANCE.md documents pin + per-file role + re-vendoring procedure.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,190 @@
|
||||
---
|
||||
cycle: 5
|
||||
phases: 1-2 (combined; phase 3+ pending)
|
||||
status: setup in progress
|
||||
date_opened: 2026-05-18
|
||||
parent_cycle: k4_lpf8_phase4_7.md
|
||||
target_kernel: AV1 CDEF filter, 8×8 luma, 8bpc, FILTER stage only
|
||||
(assume direction + strengths pre-computed)
|
||||
new_vendor: dav1d 1.4.3 (BSD-2-Clause), separate from FFmpeg pin
|
||||
---
|
||||
|
||||
# Cycle 5, Phases 1-2 — AV1 CDEF
|
||||
|
||||
First AV1 kernel; first cycle that vendors from outside the FFmpeg
|
||||
snapshot. dav1d is the canonical AV1 reference (clean BSD-2-Clause,
|
||||
mature aarch64 NEON, used by VLC + Firefox via libdav1d).
|
||||
|
||||
## Phase 1 — goal
|
||||
|
||||
**Kernel**: AV1 Constrained Directional Enhancement Filter, 8×8 luma
|
||||
output, 8 bits/component, FILTER stage (direction + strength
|
||||
parameters assumed pre-computed). Match the "pre-computed params"
|
||||
convention of LPF (E/I/H) and MC (mx).
|
||||
|
||||
**NEON symbol target**: `dav1d_cdef_filter8_pri_sec_8bpc_neon` (combined
|
||||
primary + secondary filter). There are also `_pri_` and `_sec_` only
|
||||
variants for the cases where one strength is 0; for the bench we
|
||||
cover the worst case (both active).
|
||||
|
||||
**C reference**: `cdef_filter_block_8x8_c` from `dav1d/src/cdef_tmpl.c`
|
||||
(macro-expanded), delegating to `cdef_filter_block_c`. Spec source:
|
||||
AV1 specification §7.15 (CDEF).
|
||||
|
||||
### Measurable success (cycle-5 numbering, `5` superscript)
|
||||
|
||||
| ID | Measurement | Gate |
|
||||
|---|---|---|
|
||||
| M1₅ | bit-exact vs C ref, N random 8×8 blocks across all 8 directions × various strengths | 100.0000 % |
|
||||
| M2₅ | QPU throughput Mblock/s | recorded |
|
||||
| M3₅ | NEON `dav1d_cdef_filter8_pri_sec_8bpc_neon` Mblock/s | recorded |
|
||||
| M4₅ | mixed NEON-3 + QPU vs pure NEON-4 (if YELLOW/ORANGE band) | conditional |
|
||||
|
||||
### Decision bands (carried)
|
||||
|
||||
Same R bands and 30fps-floor calibration as cycles 1-4.
|
||||
|
||||
### Predicted R₅
|
||||
|
||||
The CDEF filter is **compute-heavier than LPF**:
|
||||
- Per pixel: 8 constraint applications (abs + min + max + sign-restore)
|
||||
plus the per-pixel accumulation with min/max tracking
|
||||
- Per 8×8 block: ~32 mults (small constants 1-4) + many adds + many
|
||||
conditionals
|
||||
- Memory: 12×12 padded source = 144 reads + 64 writes = 208 B/block
|
||||
(vs LPF's ~88 B and MC's ~184 B)
|
||||
- No DP4A applicability (the multipliers are small constants, but
|
||||
the constraint function dominates)
|
||||
|
||||
**Predicted R₅ band**: 0.15-0.30 (ORANGE). The constraint function's
|
||||
per-pixel min/max conditional logic is heavier than LPF's per-row
|
||||
fm/flat tests. Compute-bound on QPU. M4 may still rescue per
|
||||
cycle-1+2 pattern.
|
||||
|
||||
### NEW for cycle 5
|
||||
|
||||
- **First AV1 kernel** → expands codec coverage beyond VP9
|
||||
- **First dav1d-vendored source** → new external/ subdirectory:
|
||||
`external/dav1d-snapshot/` (BSD-2-Clause; clean license vs LGPL
|
||||
FFmpeg)
|
||||
- **First kernel needing external padding context** — CDEF reads
|
||||
beyond the 8×8 block (2-pixel halo on each side); dav1d's C
|
||||
reference uses pre-padded `tmp_buf[12×12]` constructed by a
|
||||
separate `padding()` function from left/top/bottom edge arrays.
|
||||
Our bench will construct this padding inline for each random
|
||||
block.
|
||||
|
||||
## Phase 2 — situation analysis
|
||||
|
||||
### C reference structure (dav1d)
|
||||
|
||||
`cdef_filter_block_8x8_c` signature:
|
||||
```c
|
||||
void cdef_filter_block_8x8_c(pixel *dst, ptrdiff_t stride,
|
||||
const pixel (*left)[2],
|
||||
const pixel *top, const pixel *bottom,
|
||||
int pri_strength, int sec_strength,
|
||||
int dir, int damping,
|
||||
enum CdefEdgeFlags edges);
|
||||
```
|
||||
|
||||
The function:
|
||||
1. Allocates `int16_t tmp_buf[144]` (12×12 working buffer)
|
||||
2. Calls `padding()` to fill from left/top/bottom + dst with edge-replicate
|
||||
3. Iterates 8 rows × 8 cols; per pixel:
|
||||
- Looks up direction offsets: `dav1d_cdef_directions[dir+offset][k]`
|
||||
- For each of 4 primary tap positions (k=0..1, both signs):
|
||||
compute pri-constrained diff, multiply by tap weight, accumulate
|
||||
- For each of 4 secondary tap positions (k=0..1, both signs,
|
||||
two adjacent directions):
|
||||
same with sec weights
|
||||
- Track min/max across all sampled neighbours
|
||||
- Output: `iclip(px + ((sum - (sum < 0) + 8) >> 4), min, max)`
|
||||
|
||||
The "constraint" function:
|
||||
```c
|
||||
static inline int constrain(int diff, int threshold, int shift) {
|
||||
int adiff = abs(diff);
|
||||
return apply_sign(imin(adiff, imax(0, threshold - (adiff >> shift))),
|
||||
diff);
|
||||
}
|
||||
```
|
||||
|
||||
This is the per-pixel-pair clamp that makes CDEF *constrained*
|
||||
(directional enhancement that can't exceed a threshold tied to
|
||||
local strength).
|
||||
|
||||
### Tables needed
|
||||
|
||||
- `dav1d_cdef_directions[12][2]` — 12 directions (8 + 4 wrap-arounds),
|
||||
each a (y_offset, x_offset) pair. In `dav1d/src/tables.c`.
|
||||
- `dav1d_cdef_pri_taps[2][2]` — primary tap weights, indexed by
|
||||
`(pri_strength & 1)` and tap position k. Small ints.
|
||||
- `dav1d_cdef_sec_taps[2]` — secondary tap weights, just 2 entries.
|
||||
|
||||
### NEON reference structure (dav1d)
|
||||
|
||||
`dav1d_cdef_filter8_pri_sec_8bpc_neon` signature:
|
||||
```
|
||||
x0: dst pixel buffer
|
||||
x1: dst_stride ptrdiff_t
|
||||
x2: tmp uint8_t source (the pre-padded 12×12 buffer reinterpreted)
|
||||
w3: pri_strength
|
||||
w4: sec_strength
|
||||
w5: dir
|
||||
w6: damping
|
||||
w7: h height (8 for 8×8)
|
||||
```
|
||||
|
||||
Notable: dav1d's NEON takes the already-padded `tmp` buffer pointer
|
||||
(after the C side did `padding()`). So our bench needs to construct
|
||||
the padded buffer per block.
|
||||
|
||||
Padded buffer layout (12×12, int16 elements):
|
||||
- Real pixel region at rows [2..9], cols [2..9] (the 8×8 dst)
|
||||
- Halo at rows {0,1,10,11} and cols {0,1,10,11}: either edge-replicate
|
||||
from adjacent block (if edges flag set) or INT16_MIN (which the
|
||||
constraint function treats as "skip this neighbour")
|
||||
|
||||
### Vendoring plan
|
||||
|
||||
New directory: `external/dav1d-snapshot/` (BSD-2-Clause, separate
|
||||
PROVENANCE.md from FFmpeg pin).
|
||||
|
||||
Files to vendor from dav1d 1.4.3:
|
||||
1. `src/arm/64/cdef.S` — main NEON file (~870 lines)
|
||||
2. `src/arm/64/util.S` — helper macros referenced by cdef.S
|
||||
3. `src/arm/asm.S` — top-level macros (function, endfunc, etc.)
|
||||
4. `src/cdef_tmpl.c` — C reference (~250 lines)
|
||||
5. `src/tables.c` — the static tables (cdef_directions, pri/sec taps)
|
||||
*or* hand-extract just the CDEF tables (~50 lines)
|
||||
6. `include/common/intops.h` — apply_sign, imin, imax, iclip helpers
|
||||
7. A standalone PROVENANCE.md with pin + SHA-256s
|
||||
|
||||
dav1d's asm preamble may need its own config.h shim (different
|
||||
defines than FFmpeg's). Phase 6 setup will identify exact needs.
|
||||
|
||||
### Build path
|
||||
|
||||
dav1d's asm uses similar GAS preamble to FFmpeg's. The config
|
||||
defines are different: `ARCH_AARCH64`, `HAVE_AS_FUNC`, etc., but
|
||||
also dav1d-specific like `PRIVATE_PREFIX dav1d_` and `EXTERN_ASM ` (same
|
||||
empty for ELF as in cycle 1).
|
||||
|
||||
### What Phase 2 does *not* close
|
||||
|
||||
- The exact list of dav1d asm.S macros needed (will surface during
|
||||
first build attempt)
|
||||
- C reference completeness — `padding()` setup logic is non-trivial
|
||||
(handles edges/CdefEdgeFlags = combinations of HAVE_LEFT, HAVE_TOP,
|
||||
HAVE_RIGHT, HAVE_BOTTOM). For the bench, we can simplify by
|
||||
always passing "all edges valid" with synthetic neighbouring pixels.
|
||||
- Direction validation — directions 0..7 should all be tested for
|
||||
bit-exactness; an off-by-one in the direction-offset table would
|
||||
be caught by M1.
|
||||
|
||||
Phase 3 next: vendor the dav1d files, write standalone C ref +
|
||||
bench, capture M3₅ NEON baseline.
|
||||
|
||||
This is **the first multi-session cycle** — Phase 3+ likely lands
|
||||
in next session. Cycle setup commit at end of this session.
|
||||
Reference in New Issue
Block a user