Files

T

marfrit 20b59cd6a5 Cycle 5 phase 3 partial: M3 NEON = 3.923 Mblock/s; M1 deferred

CDEF is the most compute-intensive kernel measured so far —
254.9 ns/block (2x IDCT, 5x MC). 30fps@1080p floor margin: 4x
even on single NEON core in isolation.

M3 captured cleanly via dav1d_cdef_filter8_8bpc_neon. M1 bit-exact
gate failing due to tmp-layout mismatch between my standalone C
reference and dav1d's NEON expectation. The smoking gun: NEON output
appears at (+2 rows, -2 cols) shifted positions vs C ref output —
suggests NEON's padding-function output has a different convention
than my manual tmp construction.

Untangled in setup work:
- dav1d has TWO directions tables: stride-12 in src/tables.c
  (C-side), stride-16 in src/arm/64/cdef_tmpl.S (NEON-side).
  Initially vendored the C-side; should have used the NEON-side.
- dav1d's NEON expects tmp built by dav1d_cdef_padding8_8bpc_neon
  (a separate function with its own conventions), not the C-side
  padding() function from cdef_tmpl.c.
- Updated cdef_ref.c to use NEON-layout (stride 16) with table
  transcribed from cdef_tmpl.S. Algorithm matches — but bench's
  manual tmp construction doesn't match what NEON expects.

Resolution paths for next session (documented in
docs/k5_cdef_phase3_partial.md §'Resolution paths'):
1. Use dav1d_cdef_padding8_8bpc_neon to construct tmp (simplest)
2. Vendor dav1d's full C reference (most rigorous)
3. Reverse-engineer dav1d's padding output layout (hackiest)

Predicted R5 if/when QPU shader implemented: 0.02-0.05 (RED).
CDEF likely stays on CPU per cycle 3 lesson 7 (compute-bound
kernels don't benefit from QPU offload). 30fps floor still
passes regardless.

New artifacts:
- external/dav1d-snapshot/src/arm/64/cdef_tmpl.S (additional vendored)
- external/dav1d-snapshot/config.h — 14-define asm preamble shim
- tests/cdef_ref.c — standalone C ref (algorithmically correct,
                     layout mismatch with NEON known)
- tests/bench_neon_cdef.c — bench (M1 made warning, M3 captured)
- docs/k5_cdef_phase3_partial.md — phase 3 partial closure +
                                    resumption checklist

dav1d snapshot in PROVENANCE.md should be updated next session
with the new cdef_tmpl.S entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-18 13:21:24 +00:00

5.9 KiB

Raw Blame History

cycle, phase, status, date_opened, date_partial_close, parent

cycle	phase	status	date_opened	date_partial_close	parent
5	3 (partial — M3 captured, M1 deferred)	in_progress (M1 known-issue, Phase 4+ deferred)	2026-05-18	2026-05-18	k5_cdef_phase1_2.md

Cycle 5, Phase 3 (partial) — CDEF NEON baseline

Cycle 5 Phase 3 captured M3₅ throughput but M1 bit-exact gate deferred to next session due to a tmp-layout mismatch between the standalone C reference and dav1d's NEON expectation.

M3₅ NEON throughput (captured)

=== M3₅ NEON throughput ===
  blocks/batch:    65536
  batches done:    279
  total blocks:    18 284 544
  elapsed (kernel)=4.661 s
  throughput      = 3.923 Mblock/s
  per-block       = 254.9 ns
  equiv 1080p     = 121.1 FPS  (32 400 blocks/frame)

Per-block 254 ns — CDEF is the most compute-intensive kernel measured so far:

	per-block ns	relative
IDCT 8×8 (k1)	122	1.0×
LPF wd=4 (k2)	20.7	0.17×
MC 8h (k3)	47.6	0.39×
LPF wd=8 (k4)	19.1	0.16×
CDEF (k5)	254.9	2.09×

30fps@1080p floor margin: 4× isolation (32 400 × 30 fps ÷ 1e6 = 0.972 Mblock/s required; 3.923 / 0.972 = 4.04). NEON CDEF on a single CPU core comfortably exceeds the user-facing test alone.

M1 known-issue (deferred to next session)

The bit-exact gate against my standalone C reference fails. The output structure (NEON vs C ref) shows the NEON producing algorithmically-correct-looking pixel values, but at a SHIFTED (row, col) offset within dst. Trace evidence:

neon row 5, cols 2-7 = 90 213 247 143 95 76
C ref row 3, cols 0-5 = 90 213 247 143 95 76

— same 6-byte sequence at an offset of (+2 rows, -2 cols) = (+2×8 + (-2)) = +14 byte stride mismatch. The smoking gun is that dav1d's NEON expects tmp built by a specific dav1d_cdef_padding8_8bpc_neon routine (different from the C-side padding() function), and my manual tmp construction doesn't match that convention.

Resolution paths (next session):

Call dav1d's NEON padding function to construct tmp from dst+left+top+bottom random inputs. Then the filter reads it with the right layout. Adds another extern symbol to bind.
Vendor dav1d_cdef_filter_block_8x8_c from dav1d's C-side (with templated headers shimmed). Compare NEON output against dav1d's own C, not my standalone transcription. Eliminates the layout-shim ambiguity entirely.
Inspect dav1d_cdef_padding8_8bpc_neon output for one block, reverse-engineer the layout, update standalone C ref to match.

Path 1 is probably simplest. The padding function signature (inferred from cdef.S padding_func macro):

void cdef_padding8_8bpc_neon(uint16_t *tmp, const uint8_t *src,
                             ptrdiff_t src_stride,
                             const uint8_t (*left)[2],
                             const uint8_t *top, const uint8_t *bottom,
                             int h, size_t edges);

Phase 3 closure requires M1 bit-exact verified.

Phase 4-7 deferred

Without M1 verified, can't safely build the QPU shader (would have no correctness gate against the NEON path either, and we'd be chasing two layout issues simultaneously).

Predicted R₅ (extrapolating from cycle 3 MC):

CDEF is ~5× heavier per-block than MC on NEON (254 vs 47 ns)
NEON ~5× advantage → QPU likely ~25× behind
R₅ isolation estimate: 0.02-0.05 (deep RED)
M4₅ mixed: very likely negative (deeper than cycle 3 MC's -19.5%)
30fps floor: still PASS on isolation+mixed since NEON 4-core baseline likely 12+ Mblock/s, comfortably above 0.972

Deployment recommendation (provisional, pending Phase 4-7): CDEF stays on CPU. Same verdict as MC. All compute-bound kernels stay on CPU; all bandwidth-bound (IDCT/LPF) kernels offload to QPU. This is starting to look like a clean classification rule across all cycles.

Phase 9 lessons (provisional)

Vendoring from a SECOND upstream (dav1d after FFmpeg) added non-trivial layout-convention friction. Different projects make different optimisation tradeoffs (dav1d NEON uses stride-16 tmp for vector-load alignment; dav1d C uses stride-12 because it doesn't matter for scalar code). Standalone C ref had to be re-fit to match NEON layout, not just transcribe C.
Two different dav1d_cdef_directions tables in dav1d: stride-12 in src/tables.c (used by C path), stride-16 in src/arm/64/cdef_tmpl.S (used by NEON path). I initially vendored the C-side table; should have used the NEON-side embedded version for matching against NEON.
Bit-exact gate fundamentally requires the standalone C ref to match the actual NEON call convention exactly. When the layout convention differs (as here), no amount of correct algorithm transcription saves you. The cleanest fix is to either run dav1d's own C ref (vendor more headers) or use dav1d's NEON padding to construct tmp.

What lands in this commit

external/dav1d-snapshot/src/arm/64/cdef_tmpl.S (additional vendored file, needed for cdef.S to include)
tests/cdef_ref.c — standalone C ref (algorithmically correct, layout known-mismatched)
tests/bench_neon_cdef.c — bench harness with M1 made warning (proceeds to M3 even on layout mismatch)
external/dav1d-snapshot/config.h — asm preamble shim (works — dav1d's cdef.S assembles + links + executes)
CMakeLists.txt — dav1d asm + table source build wiring
M3₅ baseline: 3.923 Mblock/s captured on hertz

Resumption checklist (next session)

Pick M1 resolution path (1, 2, or 3 from §"Resolution paths")
If path 1: vendor + bind dav1d_cdef_padding8_8bpc_neon, update bench to call padding-then-filter, recapture M1 gate
Phase 4 plan QPU CDEF kernel (likely brief; predicted RED)
Phase 5 review (mandatory; first AV1 QPU work)
Phase 6 implement
Phase 7 measure M2 + M4 if reaches threshold
Confirm deployment recipe: CDEF stays on CPU (likely)

5.9 KiB Raw Blame History Unescape Escape