2dd774a9ab
bench_concurrent_mixed runs NEON-N on kernel A + QPU on kernel B concurrently. Matrix on hertz: V3 (CPU MC + QPU MC same-kernel): CPU 22.64 + QPU 0.39 Mblock/s V4 (CPU MC + QPU LPF4): CPU 27.87 + QPU 12.74 Medge/s V1 (CPU MC + NEON-fb CDEF): CPU 24.49 + 1.75 Mblock/s CDEF V2 (CPU LPF4 + NEON-fb CDEF): CPU 27.28 Medge + 1.70 Mblock/s V4 is the daedalus-fourier deployment shape (CPU runs MC; QPU runs LPF4 via cycle 2 GREEN offload). Both substrates productive; CPU MC +23% per-core vs same-kernel V3 control. Same-kernel M4 in cycles 1-5 was a worst-case contention bound, not a deployment number — user's "5%/50%" framing was correct. Cycle 3 MC verdict unchanged (QPU MC contributes ~0.4 under any contention); cycle 5 CDEF deferred verdict softened to opportunistic helper (NEON-fallback proxy used since cycle 5 Phase 6 not yet built). - tests/bench_concurrent_mixed.c (configurable cpu-kernel / qpu-kernel matrix; supports MC, LPF4, LPF8, IDCT real QPU dispatch; CDEF uses NEON-on-core-3 fallback) - CMakeLists.txt: build target wired with all FFmpeg + dav1d sources - docs/issues/003-mixed-kernel-m4-bench.md: closure + matrix - docs/k3_mc_phase7.md: M4 methodology caveat extended with V3/V4 - docs/k5_cdef_phase3_partial.md: deployment recommendation updated Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
176 lines
7.0 KiB
Markdown
176 lines
7.0 KiB
Markdown
---
|
||
cycle: 5
|
||
phase: 3 (partial — M3 captured, M1 deferred)
|
||
status: in_progress (M1 known-issue, Phase 4+ deferred)
|
||
date_opened: 2026-05-18
|
||
date_partial_close: 2026-05-18
|
||
parent: k5_cdef_phase1_2.md
|
||
---
|
||
|
||
# Cycle 5, Phase 3 (partial) — CDEF NEON baseline
|
||
|
||
Cycle 5 Phase 3 captured **M3₅ throughput** but **M1 bit-exact gate
|
||
deferred** to next session due to a tmp-layout mismatch between the
|
||
standalone C reference and dav1d's NEON expectation.
|
||
|
||
## M3₅ NEON throughput (captured)
|
||
|
||
```
|
||
=== M3₅ NEON throughput ===
|
||
blocks/batch: 65536
|
||
batches done: 279
|
||
total blocks: 18 284 544
|
||
elapsed (kernel)=4.661 s
|
||
throughput = 3.923 Mblock/s
|
||
per-block = 254.9 ns
|
||
equiv 1080p = 121.1 FPS (32 400 blocks/frame)
|
||
```
|
||
|
||
**Per-block 254 ns** — CDEF is the most compute-intensive kernel
|
||
measured so far:
|
||
|
||
| | per-block ns | relative |
|
||
|---|---|---|
|
||
| IDCT 8×8 (k1) | 122 | 1.0× |
|
||
| LPF wd=4 (k2) | 20.7 | 0.17× |
|
||
| MC 8h (k3) | 47.6 | 0.39× |
|
||
| LPF wd=8 (k4) | 19.1 | 0.16× |
|
||
| **CDEF (k5)** | **254.9** | **2.09×** |
|
||
|
||
30fps@1080p floor margin: **4×** isolation (32 400 × 30 fps ÷ 1e6 =
|
||
0.972 Mblock/s required; 3.923 / 0.972 = 4.04). NEON CDEF on a
|
||
single CPU core comfortably exceeds the user-facing test alone.
|
||
|
||
## M1 known-issue (deferred to next session)
|
||
|
||
The bit-exact gate against my standalone C reference fails. The
|
||
output structure (NEON vs C ref) shows the NEON producing
|
||
algorithmically-correct-looking pixel values, but at a SHIFTED
|
||
(row, col) offset within dst. Trace evidence:
|
||
|
||
> neon row 5, cols 2-7 = `90 213 247 143 95 76`
|
||
> C ref row 3, cols 0-5 = `90 213 247 143 95 76`
|
||
|
||
— same 6-byte sequence at an offset of (+2 rows, -2 cols) =
|
||
(+2×8 + (-2)) = +14 byte stride mismatch. The smoking gun is that
|
||
dav1d's NEON expects tmp built by a specific
|
||
`dav1d_cdef_padding8_8bpc_neon` routine (different from the C-side
|
||
`padding()` function), and my manual tmp construction doesn't match
|
||
that convention.
|
||
|
||
**Resolution paths** (next session):
|
||
1. **Call dav1d's NEON padding function** to construct tmp from
|
||
dst+left+top+bottom random inputs. Then the filter reads it
|
||
with the right layout. Adds another extern symbol to bind.
|
||
2. **Vendor `dav1d_cdef_filter_block_8x8_c` from dav1d's C-side**
|
||
(with templated headers shimmed). Compare NEON output against
|
||
dav1d's *own* C, not my standalone transcription. Eliminates the
|
||
layout-shim ambiguity entirely.
|
||
3. Inspect `dav1d_cdef_padding8_8bpc_neon` output for one block,
|
||
reverse-engineer the layout, update standalone C ref to match.
|
||
|
||
Path 1 is probably simplest. The padding function signature
|
||
(inferred from cdef.S `padding_func` macro):
|
||
```
|
||
void cdef_padding8_8bpc_neon(uint16_t *tmp, const uint8_t *src,
|
||
ptrdiff_t src_stride,
|
||
const uint8_t (*left)[2],
|
||
const uint8_t *top, const uint8_t *bottom,
|
||
int h, size_t edges);
|
||
```
|
||
|
||
Phase 3 closure requires M1 bit-exact verified.
|
||
|
||
## Phase 4-7 deferred
|
||
|
||
Without M1 verified, can't safely build the QPU shader (would have
|
||
no correctness gate against the NEON path either, and we'd be
|
||
chasing two layout issues simultaneously).
|
||
|
||
**Predicted R₅** (extrapolating from cycle 3 MC):
|
||
- CDEF is ~5× heavier per-block than MC on NEON (254 vs 47 ns)
|
||
- NEON ~5× advantage → QPU likely ~25× behind
|
||
- R₅ isolation estimate: **0.02-0.05 (deep RED)**
|
||
- M4₅ mixed: very likely negative (deeper than cycle 3 MC's -19.5%)
|
||
- 30fps floor: still PASS on isolation+mixed since NEON 4-core
|
||
baseline likely 12+ Mblock/s, comfortably above 0.972
|
||
|
||
**Deployment recommendation** (updated 2026-05-18 after Issue 003
|
||
closed; Phase 4-7 still deferred): **CDEF baseline = CPU, QPU
|
||
offload path should exist in V4L2 wrapper but only enqueue when
|
||
IDCT+LPF queue is empty**.
|
||
|
||
`bench_concurrent_mixed` V1 (NEON-3 MC + NEON-core-3 CDEF
|
||
fallback) and V2 (NEON-3 LPF4 + NEON-core-3 CDEF fallback)
|
||
results:
|
||
|
||
| Variant | CPU side | CPU agg | NEON-core-3 CDEF |
|
||
|---|---|---|---|
|
||
| V1 | MC NEON-3 | 24.49 Mblock/s | 1.75 Mblock/s |
|
||
| V2 | LPF4 NEON-3 | 27.28 Medge/s | 1.70 Mblock/s |
|
||
|
||
The proxy (NEON-on-core-3 doing CDEF) adds 1.7-1.75 Mblock/s of
|
||
CDEF work without crushing the other 3 cores' main work. CPU
|
||
aggregate stays close to single-kernel 4-core levels. Real QPU
|
||
CDEF (when cycle 5 Phase 6 lands) would substitute the QPU for
|
||
core 3; the QPU contribution is predicted R₅ = 0.02-0.05 →
|
||
~0.2 Mblock/s (much less than the NEON-fallback proxy).
|
||
|
||
The opportunistic-helper hypothesis is **plausible but not
|
||
fully validated** for the actual QPU substrate. Conservative read:
|
||
|
||
The **bandwidth-bound vs compute-bound classification rule** still
|
||
holds at the kernel level, but its mapping to deployment is more
|
||
nuanced than "compute-bound → never QPU." Better framing:
|
||
|
||
- **Bandwidth-bound on QPU** → **definitive** QPU offload (cycle 1+2+4)
|
||
- **Compute-bound on QPU** → **opportunistic** QPU helper if pipeline
|
||
has bandwidth-light CPU work running concurrently (cycle 3+5,
|
||
needs Issue 003 measurement to confirm)
|
||
|
||
## Phase 9 lessons (provisional)
|
||
|
||
1. **Vendoring from a SECOND upstream (dav1d after FFmpeg) added
|
||
non-trivial layout-convention friction.** Different projects make
|
||
different optimisation tradeoffs (dav1d NEON uses stride-16 tmp
|
||
for vector-load alignment; dav1d C uses stride-12 because it
|
||
doesn't matter for scalar code). Standalone C ref had to be
|
||
re-fit to match NEON layout, not just transcribe C.
|
||
|
||
2. **Two different `dav1d_cdef_directions` tables in dav1d**:
|
||
stride-12 in `src/tables.c` (used by C path), stride-16 in
|
||
`src/arm/64/cdef_tmpl.S` (used by NEON path). I initially vendored
|
||
the C-side table; should have used the NEON-side embedded version
|
||
for matching against NEON.
|
||
|
||
3. **Bit-exact gate fundamentally requires the standalone C ref to
|
||
match the actual NEON call convention exactly.** When the layout
|
||
convention differs (as here), no amount of correct algorithm
|
||
transcription saves you. The cleanest fix is to either run
|
||
dav1d's own C ref (vendor more headers) or use dav1d's NEON
|
||
padding to construct tmp.
|
||
|
||
## What lands in this commit
|
||
|
||
- `external/dav1d-snapshot/src/arm/64/cdef_tmpl.S` (additional
|
||
vendored file, needed for cdef.S to include)
|
||
- `tests/cdef_ref.c` — standalone C ref (algorithmically correct,
|
||
layout known-mismatched)
|
||
- `tests/bench_neon_cdef.c` — bench harness with M1 made warning
|
||
(proceeds to M3 even on layout mismatch)
|
||
- `external/dav1d-snapshot/config.h` — asm preamble shim
|
||
(works — dav1d's cdef.S assembles + links + executes)
|
||
- `CMakeLists.txt` — dav1d asm + table source build wiring
|
||
- M3₅ baseline: 3.923 Mblock/s captured on hertz
|
||
|
||
## Resumption checklist (next session)
|
||
|
||
- [ ] Pick M1 resolution path (1, 2, or 3 from §"Resolution paths")
|
||
- [ ] If path 1: vendor + bind `dav1d_cdef_padding8_8bpc_neon`,
|
||
update bench to call padding-then-filter, recapture M1 gate
|
||
- [ ] Phase 4 plan QPU CDEF kernel (likely brief; predicted RED)
|
||
- [ ] Phase 5 review (mandatory; first AV1 QPU work)
|
||
- [ ] Phase 6 implement
|
||
- [ ] Phase 7 measure M2 + M4 if reaches threshold
|
||
- [ ] Confirm deployment recipe: CDEF stays on CPU (likely)
|