Files
marfrit 9c0bd72e70 Cycle 5 Phase 3 closed: M1 PASS via bench pointer-convention fix
The previous "layout mismatch" deferral was a one-line bench bug:
NEON expects the caller to pass tmp pointing at the 8x8 block
origin (after the 2*16+2 padding skip), but the bench passed the
raw padded-buffer origin. C ref does the advance internally, so it
filtered the correct block; NEON filtered a (+2 rows, +2 cols)
shifted region. Diagonal-shift trace in the partial doc was
exactly that.

Fix: tmps + i*TMP_INTS + (2*TMP_W + 2) for NEON calls.

Results:
  M1: 10000/10000 bit-exact (100.0000%), all 8 dirs balanced
  M3: 3.809 Mblock/s (consistent with 3.923 from longer window)

Phase 4 unblocked; predicted R5 = 0.02-0.05 (deep RED) per earlier
analysis. Will build QPU CDEF anyway for cycle-completeness +
V4L2 dispatch-path existence.

- tests/bench_neon_cdef.c: 3-line tmp pointer fix
- docs/k5_cdef_phase3.md: supersedes k5_cdef_phase3_partial.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:46:50 +00:00

4.3 KiB
Raw Permalink Blame History

cycle, phase, status, date_opened, date_closed, parent, host
cycle phase status date_opened date_closed parent host
5 3 closed 2026-05-18 — M1 PASS, M3 captured 2026-05-18 2026-05-18 k5_cdef_phase1_2.md hertz

Cycle 5, Phase 3 — CDEF NEON baseline (closed)

Supersedes k5_cdef_phase3_partial.md. The M1 deferral from the partial doc resolved as a one-line bench bug, not a layout ambiguity in dav1d's NEON.

Root cause of the previous "layout mismatch"

tests/cdef_ref.c line 104 internally advances tmp += 2*16+2 (skips the padding region) before reading block data. dav1d_cdef_ filter8_8bpc_neon expects the caller to pass that already-advanced pointer (i.e., pointer to the 8×8 block origin, not the padded buffer origin). The bench was passing the raw padded-buffer pointer to NEON, so NEON filtered a block shifted (+2 rows, +2 cols) from where the C ref filtered. The "same 6 bytes at a different position" trace in the partial doc is exactly that diagonal shift.

Fix: tmps + i*TMP_INTS + (2 * TMP_W + 2) for the NEON call. Three-line patch in tests/bench_neon_cdef.c.

M1₅ bit-exact gate

=== M1₅_c bit-exact (10000 random 8x8 blocks) ===
M1₅_c correctness: 10000 / 10000 blocks bit-exact (100.0000%)
  dir coverage: min=1194 max=1332 (8 directions sampled)

All 8 directions exercised, distribution flat. M1 gate PASS.

M3₅ NEON throughput

=== M3₅ NEON throughput ===
  blocks/batch:    4096
  batches done:    1801
  total blocks:    7 376 896
  elapsed (kernel)=1.937 s
  throughput      = 3.809 Mblock/s
  per-block       = 262.5 ns
  equiv 1080p     = 117.6 FPS  (32 400 blocks/frame)

Consistent with the previously captured 3.923 Mblock/s (longer window). Per-block ~260 ns. CDEF remains the most compute- intensive kernel cycle so far (2.1× IDCT, 13× LPF wd=4, 5.5× MC).

per-block ns relative
IDCT 8×8 (k1) 122 1.0×
LPF wd=4 (k2) 20.7 0.17×
MC 8h (k3) 47.6 0.39×
LPF wd=8 (k4) 19.1 0.16×
CDEF (k5) 262.5 2.15×

30fps@1080p floor margin: 3.9× isolation NEON single-core. NEON-4 baseline would be ~12-15 Mblock/s → 12-15× margin.

Methodology lessons

  1. Inverted-bench bugs look like layout mismatches. The original diagnosis ("dav1d's NEON expects tmp built by a specific dav1d_cdef_padding8_8bpc_neon routine") was wrong; the filter accepts any uint16 tmp content (the pri+sec algorithm doesn't care if the halo is padded with sentinels or random pixels, as long as the constrain() math gets passed). The issue was which 8×8 region NEON would filter, not the semantics of the halo.

  2. Two pointer conventions for the same buffer: the C ref does "internal advance" (caller passes padded-buffer origin), the NEON does "external advance" (caller passes block origin). Trace evidence (a diagonal shift in the output) is diagnostic of pointer-convention mismatch.

  3. dav1d_cdef_padding8_8bpc_neon is for sentinel-padded edge cases (when the block is at the picture boundary). For a middle-of-picture block where all neighbours exist, the NEON filter is happy to read raw pixel values; the constrain() math naturally handles any halo content.

What lands in this commit

  • tests/bench_neon_cdef.c: 3-line fix (tmp+34 for NEON calls)
  • docs/k5_cdef_phase3.md (this doc) supersedes k5_cdef_phase3_partial.md

Phase 4 unblocked

Predicted R₅ (from k5_cdef_phase3_partial.md):

  • CDEF is ~5× heavier per-block than MC on NEON (262 vs 48 ns)
  • NEON ~5× per-core advantage on MC → QPU likely ~25× behind on CDEF
  • R₅ isolation estimate: 0.02-0.05 (deep RED)

Issue 003 V1/V2 NEON-fallback proxy showed that a 4th NEON core running CDEF adds 1.7 Mblock/s of CDEF helper without crushing the other 3 cores. Real QPU CDEF is predicted at ~0.2 Mblock/s (an order of magnitude below the NEON-fallback proxy).

Phase 4 plan rationale: even predicted RED, build the QPU CDEF kernel because:

  • Confirms or refutes the R₅ 0.02-0.05 prediction with real data
  • Completes the cycle 5 record (Phases 1-7 all closed)
  • Provides the QPU CDEF dispatch path needed for the V4L2 wrapper to exist (Phase 8), even if scheduler doesn't enqueue it by default

Expected Phase 4 effort: 2-3 hours given the kernel shape is similar to cycle 2/4 LPF (per-block stencil with table lookups for directions; primary + secondary tap accumulation).