20b59cd6a5
CDEF is the most compute-intensive kernel measured so far —
254.9 ns/block (2x IDCT, 5x MC). 30fps@1080p floor margin: 4x
even on single NEON core in isolation.
M3 captured cleanly via dav1d_cdef_filter8_8bpc_neon. M1 bit-exact
gate failing due to tmp-layout mismatch between my standalone C
reference and dav1d's NEON expectation. The smoking gun: NEON output
appears at (+2 rows, -2 cols) shifted positions vs C ref output —
suggests NEON's padding-function output has a different convention
than my manual tmp construction.
Untangled in setup work:
- dav1d has TWO directions tables: stride-12 in src/tables.c
(C-side), stride-16 in src/arm/64/cdef_tmpl.S (NEON-side).
Initially vendored the C-side; should have used the NEON-side.
- dav1d's NEON expects tmp built by dav1d_cdef_padding8_8bpc_neon
(a separate function with its own conventions), not the C-side
padding() function from cdef_tmpl.c.
- Updated cdef_ref.c to use NEON-layout (stride 16) with table
transcribed from cdef_tmpl.S. Algorithm matches — but bench's
manual tmp construction doesn't match what NEON expects.
Resolution paths for next session (documented in
docs/k5_cdef_phase3_partial.md §'Resolution paths'):
1. Use dav1d_cdef_padding8_8bpc_neon to construct tmp (simplest)
2. Vendor dav1d's full C reference (most rigorous)
3. Reverse-engineer dav1d's padding output layout (hackiest)
Predicted R5 if/when QPU shader implemented: 0.02-0.05 (RED).
CDEF likely stays on CPU per cycle 3 lesson 7 (compute-bound
kernels don't benefit from QPU offload). 30fps floor still
passes regardless.
New artifacts:
- external/dav1d-snapshot/src/arm/64/cdef_tmpl.S (additional vendored)
- external/dav1d-snapshot/config.h — 14-define asm preamble shim
- tests/cdef_ref.c — standalone C ref (algorithmically correct,
layout mismatch with NEON known)
- tests/bench_neon_cdef.c — bench (M1 made warning, M3 captured)
- docs/k5_cdef_phase3_partial.md — phase 3 partial closure +
resumption checklist
dav1d snapshot in PROVENANCE.md should be updated next session
with the new cdef_tmpl.S entry.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
36 lines
1.3 KiB
C
36 lines
1.3 KiB
C
/*
|
|
* Minimal config.h shim for assembling dav1d's vendored .S files
|
|
* outside the dav1d build tree. Targets aarch64-Linux, A76 (no SVE).
|
|
*
|
|
* Defines collected by grep over src/arm/asm.S + src/arm/64/*.S.
|
|
* See ../../docs/k5_cdef_phase1_2.md.
|
|
*/
|
|
#pragma once
|
|
|
|
#define ARCH_AARCH64 1
|
|
#define ARCH_ARM 0
|
|
#define CONFIG_THUMB 0
|
|
|
|
#define HAVE_AS_FUNC 1
|
|
#define HAVE_AS_ARCH_DIRECTIVE 1
|
|
#define AS_ARCH_LEVEL armv8-a
|
|
#define HAVE_AS_ARCHEXT_DOTPROD_DIRECTIVE 1
|
|
#define HAVE_AS_ARCHEXT_I8MM_DIRECTIVE 1
|
|
#define HAVE_AS_ARCHEXT_SVE_DIRECTIVE 0
|
|
#define HAVE_AS_ARCHEXT_SVE2_DIRECTIVE 0
|
|
|
|
/* PRIVATE_PREFIX is the symbol-name prefix dav1d uses. By convention
|
|
* dav1d_ in the exported symbols (e.g. dav1d_cdef_filter8_8bpc_neon). */
|
|
#define PRIVATE_PREFIX dav1d_
|
|
|
|
/* CdefEdgeFlags bit values — from dav1d include/dav1d/cdef.h (enum):
|
|
* CDEF_HAVE_LEFT = 1
|
|
* CDEF_HAVE_RIGHT = 2
|
|
* CDEF_HAVE_TOP = 4
|
|
* CDEF_HAVE_BOTTOM = 8
|
|
* The asm references these as bit-test immediate values. */
|
|
#define CDEF_HAVE_LEFT 1
|
|
#define CDEF_HAVE_RIGHT 2
|
|
#define CDEF_HAVE_TOP 4
|
|
#define CDEF_HAVE_BOTTOM 8
|