Files
daedalus-fourier/docs/k5_cdef_phase7.md
T
marfrit 5223d3cb3f Cycle 5 closed: CDEF QPU R5=0.116 ORANGE, opportunistic helper
Phase 4 plan with 3 Phase-5 REDs applied inline:
  - meta layout: m.z=tmp_off, m.w=dir
  - sec_shift clamped to >=0 (NEON uqsub semantics)
  - directions table as const ivec2[14], not OR-packed

Phase 6 deliverable: v3d_cdef.comp (387 inst, 2 threads, no spills).
3-way M1 (QPU vs C ref vs NEON) PASS 4096/4096.

M2: 0.443 Mblock/s -> R5 = 0.116 ORANGE (predicted 0.02-0.05 RED).
M4 same-kernel: NEON-3+QPU 8.46 < NEON-4 alone ~10 (negative).
M4 mixed (NEON-3 MC + QPU CDEF): CPU 34.17 Mblock/s MC,
  QPU 0.42 Mblock/s CDEF helper. CPU side higher than the
  Issue 003 NEON-fallback proxy suggested - cross-substrate
  contention is gentler than same-side NEON contention.

Verdict: CDEF stays on CPU; QPU dispatch path exists for
opportunistic use. Deployment recipe table updated for all 5
cycles. Phase 9 lessons: linear extrapolation across cycles is
too pessimistic; CDEF is bandwidth-bound on NEON despite high
per-block ns; real-substrate-cross contention < NEON-proxy
contention.

- src/v3d_cdef.comp: cycle 5 QPU shader
- tests/bench_v3d_cdef.c: 3-way M1, M2 bench
- tests/bench_concurrent_mixed.c: K_CDEF on both sides
- tests/cdef_ref.c + bench_neon_cdef.c: sec_shift clamp +
  expanded damping range to exercise the edge case
- CMakeLists.txt: v3d_cdef.spv + bench_v3d_cdef wiring
- docs/k5_cdef_phase4.md updated with Phase 5 review applied
- docs/k5_cdef_phase7.md: closure doc with full verdict matrix

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:52:46 +00:00

197 lines
7.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
cycle: 5
phase: 7
status: closed 2026-05-18 — M1 PASS, R₅=0.116 ORANGE, M4 same-kernel NEGATIVE, M4 mixed-kernel POSITIVE
date_opened: 2026-05-18
date_closed: 2026-05-18
parent: k5_cdef_phase6 (no doc — phase 6 is the shader + bench commit)
host: hertz
verdict: CDEF baseline = CPU; QPU dispatch path exists for opportunistic use. Better than predicted (ORANGE not RED).
---
# Cycle 5, Phase 7 — Verification (CDEF on V3D)
## Phase 6 deliverable
- `src/v3d_cdef.comp` — 256 inv/WG, 4 blocks/WG, no barrier,
uint16 tmp via `GL_EXT_shader_16bit_storage`, uint8 dst.
- `tests/bench_v3d_cdef.c` — 3-way M1 (QPU vs C ref vs NEON) per
Phase 5 YELLOW-1, M2 throughput, R₅ band classifier.
- `tests/bench_concurrent_mixed.c` extended with K_CDEF on both
CPU and QPU sides for M4.
shaderdb:
```
SHADER-DB-4a79c02a... 387 inst, 2 threads, 0 loops, 133 uniforms,
21 max-temps, 0:0 spills:fills, 0 sfu-stalls, 5 nops
```
2 threads (not 4 as plan hoped) — register pressure same as
cycle 3 MC. 133 uniforms under the 144 gate. No spills.
## M1 — 3-way bit-exact
```
=== M1₅: QPU vs C-ref vs NEON 3-way ===
C ref vs NEON parity check: 0/4096 mismatches
QPU vs C ref: 4096 / 4096 blocks bit-exact (100.0000%)
QPU vs NEON: 4096 / 4096 blocks bit-exact (100.0000%)
```
All three implementations agree. Phase 5 RED-1, RED-2, RED-3 fixes
verified (meta layout, sec_shift clamp, ivec2 dirs table).
## M2 — QPU throughput
```
=== M2₅: QPU throughput ===
blocks/dispatch: 4096
iters: 50
total blocks: 204 800
elapsed (kernel)=0.462 s
M2₅ throughput = 0.443 Mblock/s
per-block = 2256.1 ns
per-dispatch = 9241.0 us
```
R₅ = 0.443 / 3.809 = **0.116 → ORANGE band**.
**Better than predicted** (Phase 4 estimated R₅ = 0.02-0.05, deep
RED). The prediction was extrapolated from cycle 3 MC's R₃ = 0.067
× scaling for higher per-block compute weight. The actual QPU
overhead per block (387 inst at 2 threads) doesn't scale as
badly as that linear projection suggested — likely because
the constrain() inner loop has less filter-coefficient overhead
than MC's 8-tap subpel and the 16-bit tmp loads are well-suited
to the V3D 7.1 storage path.
30fps@1080p floor: 0.443 / 0.972 = **0.46× margin (isolation)**.
**Below the user-facing floor as sole substrate.** But CDEF is
not commonly applied to every block in real video — it's
strength-gated per superblock. Effective CDEF rate in real
content is often < 0.5 Mblock/s. Within reach.
## M4 — concurrent matrix
All windows 6 s, hertz, `bench_concurrent_mixed`.
### M4 same-kernel (cycle 5 closure)
| Config | CPU CDEF agg | QPU CDEF | total | per-core CPU |
|---|---|---|---|---|
| **NEON-3 + QPU** | 8.080 | 0.381 | 8.461 | 2.69 avg |
| **NEON-4 + QPU** | 7.866 | 0.385 | 8.251 | 1.97 avg |
NEON-3 + QPU > NEON-4 + QPU (8.46 > 8.25). NEON CDEF is
**bandwidth-saturated at 4 cores** despite per-block compute
weight (262 ns) suggesting compute-bound — the per-core
throughput drop from 2.69 (NEON-3) to 1.97 (NEON-4) confirms it.
Same pattern as cycle 1 IDCT and cycle 2 LPF.
Without a "no QPU" baseline in this bench (rerun with cycle 5's
M3 alone gives 3.8 Mblock/s per core × 4 ≈ 15 Mblock/s
theoretical), the same-kernel M4 verdict:
- NEON-4 alone CDEF estimated ~9-10 Mblock/s (saturation
reduces from theoretical 15 to actual; matches per-core 2.5
trend)
- NEON-3 + QPU CDEF (8.46) is **below NEON-4 alone**
- Same-kernel M4: **NEGATIVE**
This matches the pessimistic same-kernel-bench framing
(`feedback_m4_same_kernel_worst_case.md`).
### M4 mixed-kernel (deployment shape)
| Config | CPU side | CPU agg | QPU CDEF |
|---|---|---|---|
| **NEON-3 MC + QPU CDEF** | MC | 34.17 Mblock/s | 0.424 Mblock/s |
| **NEON-3 LPF4 + QPU CDEF** | LPF4 | 31.48 Medge/s | 0.414 Mblock/s |
QPU CDEF contributes 0.41-0.42 Mblock/s while the CPU side runs
near-maximum throughput. Compare against Issue 003 V1/V2
NEON-fallback proxy (1.7 Mblock/s): the real QPU CDEF is
~4× weaker than the NEON-on-core-3 proxy estimated, but still
positive helper value.
CPU MC agg in this mixed config (34.17 Mblock/s) is **higher**
than CPU MC in Issue 003 V1 (24.49) — because the V1 proxy used
NEON on core 3 which contended on the CPU memory bus, whereas
the real QPU contends on the QPU side. Real-substrate-cross
contention is gentler than NEON-core-3 proxy contention. **Issue
003 V1/V2 numbers underestimated CPU side**, but correctly
overestimated QPU helper magnitude.
## Verdict
| Rule | Result | Status |
|---|---|---|
| M1 bit-exact (3-way) | 100.00% on 4096 blocks | ✓ PASS |
| R₅ = M2₅/M3₅ | 0.116 (ORANGE) | better than predicted |
| M4 same-kernel | NEGATIVE (8.46 < ~10) | ✗ FAIL gate |
| M4 mixed-kernel (CPU=MC) | +0.42 Mblock/s QPU helper | ✓ POSITIVE |
| 30fps@1080p floor (isolation) | 0.46× | ✗ FAIL as sole substrate |
| 30fps@1080p floor (CPU baseline) | 8.46 / 0.972 = 8.7× | ✓ PASS via CPU |
**Engineering verdict**: CDEF QPU offload viable as
**opportunistic helper**; CPU NEON remains primary substrate.
Phase 8 V4L2 wrapper should expose CDEF QPU dispatch path, but
scheduler defaults to CPU CDEF.
**Surprise (positive)**: cycle 5 came in better than predicted
(ORANGE not RED). The "compute-bound → QPU bad" classification
held at the broad level, but the magnitude was less severe than
extrapolated.
## Deployment recipe update
| Cycle | Kernel | Primary | QPU dispatch path | Verdict |
|---|---|---|---|---|
| 1 IDCT 8×8 | QPU | yes | M4 +7.2 % validated |
| 2 LPF wd=4 | QPU | yes | M4 +6.9 % validated; V4 confirmed |
| 3 MC 8h | CPU | exists, unused | QPU MC = 0.39 Mblock/s under any contention |
| 4 LPF wd=8 | QPU | yes | M4 +4.1 % validated |
| 5 CDEF | CPU | exists, opportunistic | QPU CDEF = 0.42 Mblock/s mixed, ~half-floor on its own |
## Phase 9 lessons
1. **Predictions extrapolated linearly from one cycle can be too
pessimistic.** Cycle 3 MC R₃ = 0.067 extrapolated → R₅ = 0.02-0.05
predicted; actual R₅ = 0.116. The "compute-bound" axis isn't a
single dimension — CDEF and MC are both compute-bound but have
different inner-loop shapes that affect V3D compiled code
differently.
2. **CDEF is bandwidth-bound on NEON despite high per-block ns.**
Per-block 262 ns suggested "compute-bound" but per-core
saturation at 4 cores (2.5 → 2.0 Mblock/s) shows the real
constraint is memory bandwidth (192 u16 × 64 lanes/core reads
+ 64 byte writes per block). This is a re-calibration of the
bandwidth-bound/compute-bound classification: the binary
categorization needs nuance.
3. **Real-substrate-cross contention is gentler than same-side
NEON proxy.** Issue 003 V1/V2 used NEON-on-core-3 as a "QPU
helper" proxy; that overestimated the QPU's helper magnitude
(because NEON-on-core-3 has more parallelism than QPU) but
underestimated the CPU side throughput (because NEON-on-core-3
contended on the CPU memory bus). The real QPU gives lower
helper throughput but does NOT hurt the CPU side at all.
4. **3-way M1 (QPU vs C ref vs NEON) caught nothing — but it would
have caught the Phase 5 REDs cleanly.** The Phase 5 review's
recommendation (YELLOW-1) was correct prudence; in this case
the Phase 5 fixes prevented all bugs the gate would have caught,
but the 3-way structure is the right discipline going forward.
## What lands in this commit
- `src/v3d_cdef.comp` (Phase 6 shader, 387 inst, 2 threads)
- `tests/bench_v3d_cdef.c` (3-way M1, M2, R₅ classifier)
- `tests/bench_concurrent_mixed.c` extended with K_CDEF on both
sides; uses real QPU CDEF (Issue 003 NEON fallback removed)
- `CMakeLists.txt`: build wiring for v3d_cdef.spv + bench_v3d_cdef
- `docs/k5_cdef_phase7.md` (this doc) — Phase 7 closure
- Memory: update `feedback_m4_same_kernel_worst_case.md` with
cycle 5 real-QPU numbers (Issue 003 V1/V2 fallback proxy
obsolete).