Files
daedalus-fourier/docs/k5_cdef_phase3.md
T
marfrit 9c0bd72e70 Cycle 5 Phase 3 closed: M1 PASS via bench pointer-convention fix
The previous "layout mismatch" deferral was a one-line bench bug:
NEON expects the caller to pass tmp pointing at the 8x8 block
origin (after the 2*16+2 padding skip), but the bench passed the
raw padded-buffer origin. C ref does the advance internally, so it
filtered the correct block; NEON filtered a (+2 rows, +2 cols)
shifted region. Diagonal-shift trace in the partial doc was
exactly that.

Fix: tmps + i*TMP_INTS + (2*TMP_W + 2) for NEON calls.

Results:
  M1: 10000/10000 bit-exact (100.0000%), all 8 dirs balanced
  M3: 3.809 Mblock/s (consistent with 3.923 from longer window)

Phase 4 unblocked; predicted R5 = 0.02-0.05 (deep RED) per earlier
analysis. Will build QPU CDEF anyway for cycle-completeness +
V4L2 dispatch-path existence.

- tests/bench_neon_cdef.c: 3-line tmp pointer fix
- docs/k5_cdef_phase3.md: supersedes k5_cdef_phase3_partial.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:46:50 +00:00

122 lines
4.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
cycle: 5
phase: 3
status: closed 2026-05-18 — M1 PASS, M3 captured
date_opened: 2026-05-18
date_closed: 2026-05-18
parent: k5_cdef_phase1_2.md
host: hertz
---
# Cycle 5, Phase 3 — CDEF NEON baseline (closed)
Supersedes `k5_cdef_phase3_partial.md`. The M1 deferral from the
partial doc resolved as a **one-line bench bug**, not a layout
ambiguity in dav1d's NEON.
## Root cause of the previous "layout mismatch"
`tests/cdef_ref.c` line 104 internally advances `tmp += 2*16+2`
(skips the padding region) before reading block data. `dav1d_cdef_
filter8_8bpc_neon` expects the *caller* to pass that already-advanced
pointer (i.e., pointer to the 8×8 block origin, not the padded
buffer origin). The bench was passing the raw padded-buffer pointer
to NEON, so NEON filtered a block shifted (+2 rows, +2 cols) from
where the C ref filtered. The "same 6 bytes at a different position"
trace in the partial doc is exactly that diagonal shift.
Fix: `tmps + i*TMP_INTS + (2 * TMP_W + 2)` for the NEON call.
Three-line patch in `tests/bench_neon_cdef.c`.
## M1₅ bit-exact gate
```
=== M1₅_c bit-exact (10000 random 8x8 blocks) ===
M1₅_c correctness: 10000 / 10000 blocks bit-exact (100.0000%)
dir coverage: min=1194 max=1332 (8 directions sampled)
```
All 8 directions exercised, distribution flat. **M1 gate PASS.**
## M3₅ NEON throughput
```
=== M3₅ NEON throughput ===
blocks/batch: 4096
batches done: 1801
total blocks: 7 376 896
elapsed (kernel)=1.937 s
throughput = 3.809 Mblock/s
per-block = 262.5 ns
equiv 1080p = 117.6 FPS (32 400 blocks/frame)
```
Consistent with the previously captured 3.923 Mblock/s (longer
window). Per-block ~260 ns. **CDEF remains the most compute-
intensive kernel cycle so far** (2.1× IDCT, 13× LPF wd=4,
5.5× MC).
| | per-block ns | relative |
|---|---|---|
| IDCT 8×8 (k1) | 122 | 1.0× |
| LPF wd=4 (k2) | 20.7 | 0.17× |
| MC 8h (k3) | 47.6 | 0.39× |
| LPF wd=8 (k4) | 19.1 | 0.16× |
| **CDEF (k5)** | **262.5** | **2.15×** |
30fps@1080p floor margin: **3.9×** isolation NEON single-core.
NEON-4 baseline would be ~12-15 Mblock/s → 12-15× margin.
## Methodology lessons
1. **Inverted-bench bugs look like layout mismatches.** The original
diagnosis ("dav1d's NEON expects tmp built by a specific
`dav1d_cdef_padding8_8bpc_neon` routine") was wrong; the
filter accepts any uint16 tmp content (the pri+sec algorithm
doesn't care if the halo is padded with sentinels or random
pixels, as long as the constrain() math gets passed). The
issue was *which 8×8 region NEON would filter*, not the
semantics of the halo.
2. **Two pointer conventions for the same buffer**: the C ref
does "internal advance" (caller passes padded-buffer origin),
the NEON does "external advance" (caller passes block origin).
Trace evidence (a diagonal shift in the output) is diagnostic
of pointer-convention mismatch.
3. **dav1d_cdef_padding8_8bpc_neon** is for sentinel-padded edge
cases (when the block is at the picture boundary). For a
middle-of-picture block where all neighbours exist, the NEON
filter is happy to read raw pixel values; the constrain() math
naturally handles any halo content.
## What lands in this commit
- `tests/bench_neon_cdef.c`: 3-line fix (tmp+34 for NEON calls)
- `docs/k5_cdef_phase3.md` (this doc) supersedes
`k5_cdef_phase3_partial.md`
## Phase 4 unblocked
Predicted R₅ (from `k5_cdef_phase3_partial.md`):
- CDEF is ~5× heavier per-block than MC on NEON (262 vs 48 ns)
- NEON ~5× per-core advantage on MC → QPU likely ~25× behind on CDEF
- R₅ isolation estimate: **0.02-0.05 (deep RED)**
Issue 003 V1/V2 NEON-fallback proxy showed that a 4th NEON core
running CDEF adds 1.7 Mblock/s of CDEF helper without crushing
the other 3 cores. Real QPU CDEF is predicted at ~0.2 Mblock/s
(an order of magnitude below the NEON-fallback proxy).
**Phase 4 plan rationale**: even predicted RED, build the QPU
CDEF kernel because:
- Confirms or refutes the R₅ 0.02-0.05 prediction with real data
- Completes the cycle 5 record (Phases 1-7 all closed)
- Provides the QPU CDEF dispatch path needed for the V4L2 wrapper
to *exist* (Phase 8), even if scheduler doesn't enqueue it by
default
Expected Phase 4 effort: 2-3 hours given the kernel shape is
similar to cycle 2/4 LPF (per-block stencil with table lookups
for directions; primary + secondary tap accumulation).