Files
daedalus-fourier/docs/k5_cdef_phase3_partial.md
marfrit 2dd774a9ab Issue 003 closed: mixed-kernel M4 validates V4 deployment shape
bench_concurrent_mixed runs NEON-N on kernel A + QPU on kernel B
concurrently. Matrix on hertz:

  V3 (CPU MC + QPU MC same-kernel): CPU 22.64 + QPU 0.39 Mblock/s
  V4 (CPU MC + QPU LPF4):            CPU 27.87 + QPU 12.74 Medge/s
  V1 (CPU MC + NEON-fb CDEF):        CPU 24.49 + 1.75 Mblock/s CDEF
  V2 (CPU LPF4 + NEON-fb CDEF):      CPU 27.28 Medge + 1.70 Mblock/s

V4 is the daedalus-fourier deployment shape (CPU runs MC; QPU runs
LPF4 via cycle 2 GREEN offload). Both substrates productive; CPU
MC +23% per-core vs same-kernel V3 control. Same-kernel M4 in
cycles 1-5 was a worst-case contention bound, not a deployment
number — user's "5%/50%" framing was correct.

Cycle 3 MC verdict unchanged (QPU MC contributes ~0.4 under any
contention); cycle 5 CDEF deferred verdict softened to
opportunistic helper (NEON-fallback proxy used since cycle 5
Phase 6 not yet built).

- tests/bench_concurrent_mixed.c (configurable cpu-kernel /
  qpu-kernel matrix; supports MC, LPF4, LPF8, IDCT real QPU
  dispatch; CDEF uses NEON-on-core-3 fallback)
- CMakeLists.txt: build target wired with all FFmpeg + dav1d sources
- docs/issues/003-mixed-kernel-m4-bench.md: closure + matrix
- docs/k3_mc_phase7.md: M4 methodology caveat extended with V3/V4
- docs/k5_cdef_phase3_partial.md: deployment recommendation updated

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:44:08 +00:00

176 lines
7.0 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
cycle: 5
phase: 3 (partial — M3 captured, M1 deferred)
status: in_progress (M1 known-issue, Phase 4+ deferred)
date_opened: 2026-05-18
date_partial_close: 2026-05-18
parent: k5_cdef_phase1_2.md
---
# Cycle 5, Phase 3 (partial) — CDEF NEON baseline
Cycle 5 Phase 3 captured **M3₅ throughput** but **M1 bit-exact gate
deferred** to next session due to a tmp-layout mismatch between the
standalone C reference and dav1d's NEON expectation.
## M3₅ NEON throughput (captured)
```
=== M3₅ NEON throughput ===
blocks/batch: 65536
batches done: 279
total blocks: 18 284 544
elapsed (kernel)=4.661 s
throughput = 3.923 Mblock/s
per-block = 254.9 ns
equiv 1080p = 121.1 FPS (32 400 blocks/frame)
```
**Per-block 254 ns** — CDEF is the most compute-intensive kernel
measured so far:
| | per-block ns | relative |
|---|---|---|
| IDCT 8×8 (k1) | 122 | 1.0× |
| LPF wd=4 (k2) | 20.7 | 0.17× |
| MC 8h (k3) | 47.6 | 0.39× |
| LPF wd=8 (k4) | 19.1 | 0.16× |
| **CDEF (k5)** | **254.9** | **2.09×** |
30fps@1080p floor margin: **4×** isolation (32 400 × 30 fps ÷ 1e6 =
0.972 Mblock/s required; 3.923 / 0.972 = 4.04). NEON CDEF on a
single CPU core comfortably exceeds the user-facing test alone.
## M1 known-issue (deferred to next session)
The bit-exact gate against my standalone C reference fails. The
output structure (NEON vs C ref) shows the NEON producing
algorithmically-correct-looking pixel values, but at a SHIFTED
(row, col) offset within dst. Trace evidence:
> neon row 5, cols 2-7 = `90 213 247 143 95 76`
> C ref row 3, cols 0-5 = `90 213 247 143 95 76`
— same 6-byte sequence at an offset of (+2 rows, -2 cols) =
(+2×8 + (-2)) = +14 byte stride mismatch. The smoking gun is that
dav1d's NEON expects tmp built by a specific
`dav1d_cdef_padding8_8bpc_neon` routine (different from the C-side
`padding()` function), and my manual tmp construction doesn't match
that convention.
**Resolution paths** (next session):
1. **Call dav1d's NEON padding function** to construct tmp from
dst+left+top+bottom random inputs. Then the filter reads it
with the right layout. Adds another extern symbol to bind.
2. **Vendor `dav1d_cdef_filter_block_8x8_c` from dav1d's C-side**
(with templated headers shimmed). Compare NEON output against
dav1d's *own* C, not my standalone transcription. Eliminates the
layout-shim ambiguity entirely.
3. Inspect `dav1d_cdef_padding8_8bpc_neon` output for one block,
reverse-engineer the layout, update standalone C ref to match.
Path 1 is probably simplest. The padding function signature
(inferred from cdef.S `padding_func` macro):
```
void cdef_padding8_8bpc_neon(uint16_t *tmp, const uint8_t *src,
ptrdiff_t src_stride,
const uint8_t (*left)[2],
const uint8_t *top, const uint8_t *bottom,
int h, size_t edges);
```
Phase 3 closure requires M1 bit-exact verified.
## Phase 4-7 deferred
Without M1 verified, can't safely build the QPU shader (would have
no correctness gate against the NEON path either, and we'd be
chasing two layout issues simultaneously).
**Predicted R₅** (extrapolating from cycle 3 MC):
- CDEF is ~5× heavier per-block than MC on NEON (254 vs 47 ns)
- NEON ~5× advantage → QPU likely ~25× behind
- R₅ isolation estimate: **0.02-0.05 (deep RED)**
- M4₅ mixed: very likely negative (deeper than cycle 3 MC's -19.5%)
- 30fps floor: still PASS on isolation+mixed since NEON 4-core
baseline likely 12+ Mblock/s, comfortably above 0.972
**Deployment recommendation** (updated 2026-05-18 after Issue 003
closed; Phase 4-7 still deferred): **CDEF baseline = CPU, QPU
offload path should exist in V4L2 wrapper but only enqueue when
IDCT+LPF queue is empty**.
`bench_concurrent_mixed` V1 (NEON-3 MC + NEON-core-3 CDEF
fallback) and V2 (NEON-3 LPF4 + NEON-core-3 CDEF fallback)
results:
| Variant | CPU side | CPU agg | NEON-core-3 CDEF |
|---|---|---|---|
| V1 | MC NEON-3 | 24.49 Mblock/s | 1.75 Mblock/s |
| V2 | LPF4 NEON-3 | 27.28 Medge/s | 1.70 Mblock/s |
The proxy (NEON-on-core-3 doing CDEF) adds 1.7-1.75 Mblock/s of
CDEF work without crushing the other 3 cores' main work. CPU
aggregate stays close to single-kernel 4-core levels. Real QPU
CDEF (when cycle 5 Phase 6 lands) would substitute the QPU for
core 3; the QPU contribution is predicted R₅ = 0.02-0.05 →
~0.2 Mblock/s (much less than the NEON-fallback proxy).
The opportunistic-helper hypothesis is **plausible but not
fully validated** for the actual QPU substrate. Conservative read:
The **bandwidth-bound vs compute-bound classification rule** still
holds at the kernel level, but its mapping to deployment is more
nuanced than "compute-bound → never QPU." Better framing:
- **Bandwidth-bound on QPU** → **definitive** QPU offload (cycle 1+2+4)
- **Compute-bound on QPU** → **opportunistic** QPU helper if pipeline
has bandwidth-light CPU work running concurrently (cycle 3+5,
needs Issue 003 measurement to confirm)
## Phase 9 lessons (provisional)
1. **Vendoring from a SECOND upstream (dav1d after FFmpeg) added
non-trivial layout-convention friction.** Different projects make
different optimisation tradeoffs (dav1d NEON uses stride-16 tmp
for vector-load alignment; dav1d C uses stride-12 because it
doesn't matter for scalar code). Standalone C ref had to be
re-fit to match NEON layout, not just transcribe C.
2. **Two different `dav1d_cdef_directions` tables in dav1d**:
stride-12 in `src/tables.c` (used by C path), stride-16 in
`src/arm/64/cdef_tmpl.S` (used by NEON path). I initially vendored
the C-side table; should have used the NEON-side embedded version
for matching against NEON.
3. **Bit-exact gate fundamentally requires the standalone C ref to
match the actual NEON call convention exactly.** When the layout
convention differs (as here), no amount of correct algorithm
transcription saves you. The cleanest fix is to either run
dav1d's own C ref (vendor more headers) or use dav1d's NEON
padding to construct tmp.
## What lands in this commit
- `external/dav1d-snapshot/src/arm/64/cdef_tmpl.S` (additional
vendored file, needed for cdef.S to include)
- `tests/cdef_ref.c` — standalone C ref (algorithmically correct,
layout known-mismatched)
- `tests/bench_neon_cdef.c` — bench harness with M1 made warning
(proceeds to M3 even on layout mismatch)
- `external/dav1d-snapshot/config.h` — asm preamble shim
(works — dav1d's cdef.S assembles + links + executes)
- `CMakeLists.txt` — dav1d asm + table source build wiring
- M3₅ baseline: 3.923 Mblock/s captured on hertz
## Resumption checklist (next session)
- [ ] Pick M1 resolution path (1, 2, or 3 from §"Resolution paths")
- [ ] If path 1: vendor + bind `dav1d_cdef_padding8_8bpc_neon`,
update bench to call padding-then-filter, recapture M1 gate
- [ ] Phase 4 plan QPU CDEF kernel (likely brief; predicted RED)
- [ ] Phase 5 review (mandatory; first AV1 QPU work)
- [ ] Phase 6 implement
- [ ] Phase 7 measure M2 + M4 if reaches threshold
- [ ] Confirm deployment recipe: CDEF stays on CPU (likely)