Cycle 5 Phase 3 closed: M1 PASS via bench pointer-convention fix
The previous "layout mismatch" deferral was a one-line bench bug: NEON expects the caller to pass tmp pointing at the 8x8 block origin (after the 2*16+2 padding skip), but the bench passed the raw padded-buffer origin. C ref does the advance internally, so it filtered the correct block; NEON filtered a (+2 rows, +2 cols) shifted region. Diagonal-shift trace in the partial doc was exactly that. Fix: tmps + i*TMP_INTS + (2*TMP_W + 2) for NEON calls. Results: M1: 10000/10000 bit-exact (100.0000%), all 8 dirs balanced M3: 3.809 Mblock/s (consistent with 3.923 from longer window) Phase 4 unblocked; predicted R5 = 0.02-0.05 (deep RED) per earlier analysis. Will build QPU CDEF anyway for cycle-completeness + V4L2 dispatch-path existence. - tests/bench_neon_cdef.c: 3-line tmp pointer fix - docs/k5_cdef_phase3.md: supersedes k5_cdef_phase3_partial.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,121 @@
|
|||||||
|
---
|
||||||
|
cycle: 5
|
||||||
|
phase: 3
|
||||||
|
status: closed 2026-05-18 — M1 PASS, M3 captured
|
||||||
|
date_opened: 2026-05-18
|
||||||
|
date_closed: 2026-05-18
|
||||||
|
parent: k5_cdef_phase1_2.md
|
||||||
|
host: hertz
|
||||||
|
---
|
||||||
|
|
||||||
|
# Cycle 5, Phase 3 — CDEF NEON baseline (closed)
|
||||||
|
|
||||||
|
Supersedes `k5_cdef_phase3_partial.md`. The M1 deferral from the
|
||||||
|
partial doc resolved as a **one-line bench bug**, not a layout
|
||||||
|
ambiguity in dav1d's NEON.
|
||||||
|
|
||||||
|
## Root cause of the previous "layout mismatch"
|
||||||
|
|
||||||
|
`tests/cdef_ref.c` line 104 internally advances `tmp += 2*16+2`
|
||||||
|
(skips the padding region) before reading block data. `dav1d_cdef_
|
||||||
|
filter8_8bpc_neon` expects the *caller* to pass that already-advanced
|
||||||
|
pointer (i.e., pointer to the 8×8 block origin, not the padded
|
||||||
|
buffer origin). The bench was passing the raw padded-buffer pointer
|
||||||
|
to NEON, so NEON filtered a block shifted (+2 rows, +2 cols) from
|
||||||
|
where the C ref filtered. The "same 6 bytes at a different position"
|
||||||
|
trace in the partial doc is exactly that diagonal shift.
|
||||||
|
|
||||||
|
Fix: `tmps + i*TMP_INTS + (2 * TMP_W + 2)` for the NEON call.
|
||||||
|
Three-line patch in `tests/bench_neon_cdef.c`.
|
||||||
|
|
||||||
|
## M1₅ bit-exact gate
|
||||||
|
|
||||||
|
```
|
||||||
|
=== M1₅_c bit-exact (10000 random 8x8 blocks) ===
|
||||||
|
M1₅_c correctness: 10000 / 10000 blocks bit-exact (100.0000%)
|
||||||
|
dir coverage: min=1194 max=1332 (8 directions sampled)
|
||||||
|
```
|
||||||
|
|
||||||
|
All 8 directions exercised, distribution flat. **M1 gate PASS.**
|
||||||
|
|
||||||
|
## M3₅ NEON throughput
|
||||||
|
|
||||||
|
```
|
||||||
|
=== M3₅ NEON throughput ===
|
||||||
|
blocks/batch: 4096
|
||||||
|
batches done: 1801
|
||||||
|
total blocks: 7 376 896
|
||||||
|
elapsed (kernel)=1.937 s
|
||||||
|
throughput = 3.809 Mblock/s
|
||||||
|
per-block = 262.5 ns
|
||||||
|
equiv 1080p = 117.6 FPS (32 400 blocks/frame)
|
||||||
|
```
|
||||||
|
|
||||||
|
Consistent with the previously captured 3.923 Mblock/s (longer
|
||||||
|
window). Per-block ~260 ns. **CDEF remains the most compute-
|
||||||
|
intensive kernel cycle so far** (2.1× IDCT, 13× LPF wd=4,
|
||||||
|
5.5× MC).
|
||||||
|
|
||||||
|
| | per-block ns | relative |
|
||||||
|
|---|---|---|
|
||||||
|
| IDCT 8×8 (k1) | 122 | 1.0× |
|
||||||
|
| LPF wd=4 (k2) | 20.7 | 0.17× |
|
||||||
|
| MC 8h (k3) | 47.6 | 0.39× |
|
||||||
|
| LPF wd=8 (k4) | 19.1 | 0.16× |
|
||||||
|
| **CDEF (k5)** | **262.5** | **2.15×** |
|
||||||
|
|
||||||
|
30fps@1080p floor margin: **3.9×** isolation NEON single-core.
|
||||||
|
NEON-4 baseline would be ~12-15 Mblock/s → 12-15× margin.
|
||||||
|
|
||||||
|
## Methodology lessons
|
||||||
|
|
||||||
|
1. **Inverted-bench bugs look like layout mismatches.** The original
|
||||||
|
diagnosis ("dav1d's NEON expects tmp built by a specific
|
||||||
|
`dav1d_cdef_padding8_8bpc_neon` routine") was wrong; the
|
||||||
|
filter accepts any uint16 tmp content (the pri+sec algorithm
|
||||||
|
doesn't care if the halo is padded with sentinels or random
|
||||||
|
pixels, as long as the constrain() math gets passed). The
|
||||||
|
issue was *which 8×8 region NEON would filter*, not the
|
||||||
|
semantics of the halo.
|
||||||
|
|
||||||
|
2. **Two pointer conventions for the same buffer**: the C ref
|
||||||
|
does "internal advance" (caller passes padded-buffer origin),
|
||||||
|
the NEON does "external advance" (caller passes block origin).
|
||||||
|
Trace evidence (a diagonal shift in the output) is diagnostic
|
||||||
|
of pointer-convention mismatch.
|
||||||
|
|
||||||
|
3. **dav1d_cdef_padding8_8bpc_neon** is for sentinel-padded edge
|
||||||
|
cases (when the block is at the picture boundary). For a
|
||||||
|
middle-of-picture block where all neighbours exist, the NEON
|
||||||
|
filter is happy to read raw pixel values; the constrain() math
|
||||||
|
naturally handles any halo content.
|
||||||
|
|
||||||
|
## What lands in this commit
|
||||||
|
|
||||||
|
- `tests/bench_neon_cdef.c`: 3-line fix (tmp+34 for NEON calls)
|
||||||
|
- `docs/k5_cdef_phase3.md` (this doc) supersedes
|
||||||
|
`k5_cdef_phase3_partial.md`
|
||||||
|
|
||||||
|
## Phase 4 unblocked
|
||||||
|
|
||||||
|
Predicted R₅ (from `k5_cdef_phase3_partial.md`):
|
||||||
|
- CDEF is ~5× heavier per-block than MC on NEON (262 vs 48 ns)
|
||||||
|
- NEON ~5× per-core advantage on MC → QPU likely ~25× behind on CDEF
|
||||||
|
- R₅ isolation estimate: **0.02-0.05 (deep RED)**
|
||||||
|
|
||||||
|
Issue 003 V1/V2 NEON-fallback proxy showed that a 4th NEON core
|
||||||
|
running CDEF adds 1.7 Mblock/s of CDEF helper without crushing
|
||||||
|
the other 3 cores. Real QPU CDEF is predicted at ~0.2 Mblock/s
|
||||||
|
(an order of magnitude below the NEON-fallback proxy).
|
||||||
|
|
||||||
|
**Phase 4 plan rationale**: even predicted RED, build the QPU
|
||||||
|
CDEF kernel because:
|
||||||
|
- Confirms or refutes the R₅ 0.02-0.05 prediction with real data
|
||||||
|
- Completes the cycle 5 record (Phases 1-7 all closed)
|
||||||
|
- Provides the QPU CDEF dispatch path needed for the V4L2 wrapper
|
||||||
|
to *exist* (Phase 8), even if scheduler doesn't enqueue it by
|
||||||
|
default
|
||||||
|
|
||||||
|
Expected Phase 4 effort: 2-3 hours given the kernel shape is
|
||||||
|
similar to cycle 2/4 LPF (per-block stencil with table lookups
|
||||||
|
for directions; primary + secondary tap accumulation).
|
||||||
@@ -113,11 +113,16 @@ static int correctness_check(uint64_t seed, int n)
|
|||||||
tmp_center_to_dst(dst_a, tmp);
|
tmp_center_to_dst(dst_a, tmp);
|
||||||
memcpy(dst_b, dst_a, DST_BYTES);
|
memcpy(dst_b, dst_a, DST_BYTES);
|
||||||
|
|
||||||
|
/* C ref advances tmp internally by +2*stride+2.
|
||||||
|
* NEON expects the caller to pass the already-advanced pointer
|
||||||
|
* (i.e. pointer to the block-data origin, not the padded-buffer
|
||||||
|
* origin). Hence the tmp+34 for the NEON call. */
|
||||||
daedalus_cdef_filter_8x8_pri_sec_ref(
|
daedalus_cdef_filter_8x8_pri_sec_ref(
|
||||||
dst_a, DST_W, tmp, pri, sec, dir, damping, 8);
|
dst_a, DST_W, tmp, pri, sec, dir, damping, 8);
|
||||||
dav1d_cdef_filter8_8bpc_neon(
|
dav1d_cdef_filter8_8bpc_neon(
|
||||||
dst_b, DST_W, tmp, pri, sec, dir, damping, 8,
|
dst_b, DST_W, tmp + (2 * TMP_W + 2),
|
||||||
/* edges = */ 0); /* != 0xf → non-edged path, uint16 tmp w/stride 12 */
|
pri, sec, dir, damping, 8,
|
||||||
|
/* edges = */ 0); /* uint16 tmp non-edged path */
|
||||||
|
|
||||||
if (memcmp(dst_a, dst_b, DST_BYTES) != 0) {
|
if (memcmp(dst_a, dst_b, DST_BYTES) != 0) {
|
||||||
if (mismatches < 3) {
|
if (mismatches < 3) {
|
||||||
@@ -180,7 +185,7 @@ static void throughput_neon(uint64_t seed, int n_blocks, double duration_s)
|
|||||||
for (int i = 0; i < n_blocks; i++)
|
for (int i = 0; i < n_blocks; i++)
|
||||||
dav1d_cdef_filter8_8bpc_neon(
|
dav1d_cdef_filter8_8bpc_neon(
|
||||||
work_dst + (size_t)i * DST_BYTES, DST_W,
|
work_dst + (size_t)i * DST_BYTES, DST_W,
|
||||||
tmps + (size_t)i * TMP_INTS,
|
tmps + (size_t)i * TMP_INTS + (2 * TMP_W + 2),
|
||||||
pris[i], secs[i], dirs[i], damps[i], 8, 0);
|
pris[i], secs[i], dirs[i], damps[i], 8, 0);
|
||||||
|
|
||||||
double t0 = now_seconds();
|
double t0 = now_seconds();
|
||||||
@@ -191,7 +196,7 @@ static void throughput_neon(uint64_t seed, int n_blocks, double duration_s)
|
|||||||
for (int i = 0; i < n_blocks; i++)
|
for (int i = 0; i < n_blocks; i++)
|
||||||
dav1d_cdef_filter8_8bpc_neon(
|
dav1d_cdef_filter8_8bpc_neon(
|
||||||
work_dst + (size_t)i * DST_BYTES, DST_W,
|
work_dst + (size_t)i * DST_BYTES, DST_W,
|
||||||
tmps + (size_t)i * TMP_INTS,
|
tmps + (size_t)i * TMP_INTS + (2 * TMP_W + 2),
|
||||||
pris[i], secs[i], dirs[i], damps[i], 8, 0);
|
pris[i], secs[i], dirs[i], damps[i], 8, 0);
|
||||||
done += n_blocks;
|
done += n_blocks;
|
||||||
}
|
}
|
||||||
|
|||||||
Reference in New Issue
Block a user