--- cycle: 2 phase: 7 status: closed 2026-05-18 — PASS date_opened: 2026-05-18 date_closed: 2026-05-18 parent: k2_deblock_phase4.md (+ phase5 revisions) host: hertz (Pi 5, 8 GB, Debian Trixie, kernel 6.12.75+rpt-rpi-2712, Mesa 25.0.7-2+rpt4, V3D 7.1.7 @ 1 GHz, A76 @ 2.8 GHz) verdict: M4'' PASS — mixed +6.9 % over pure NEON-4; project continues --- # Cycle 2, Phase 7 — Verification (v1 + M4'') Per `dev_process.md`: repeat measurements from Phase 3, compare explicitly to baseline. Phase 4 §6 predicted R'' ≈ 0.5–0.9 isolation, bandwidth ceiling at 0.87. Measured R'' = 0.41 isolation — below the predicted lower bound. Per cycle-1 calibration (M4 showed mixed > pure-CPU even at modest R), this triggers M4'' rather than honest-close. M4'' gate result: **PASS.** Project continues. ## v1 first-light (single dispatch, isolation R'') ``` === v3d LPF h_4_8 bench === device: V3D 7.1.7.0 n_edges: 65536 iters: 100 fm pass rate: 8.09% (10k-edge sample) hev pass rate: 4.93% (of fm-passing) dispatch: 2048 WGs × 256 invocations = 65536 edges === M1'': QPU vs C-reference bit-exact === edges bit-exact: 65536 / 65536 (100.0000 %) total byte diffs: 0 / 4194304 (0.0000 %) === M2'': QPU throughput === M2'' throughput = 19.645 Medge/s per-edge = 50.9 ns per-dispatch = 3336.1 us R'' = M2''/M3'' = 0.407 → ORANGE band ``` shaderdb (v1 LPF kernel): ``` SHADER-DB-6c8e828054...: MESA_SHADER_COMPUTE shader: 160 inst, 4 threads, 0 loops, 36 uniforms, 21 max-temps, 0:0 spills:fills, 0 sfu-stalls, 160 inst-and-stalls, 15 nops ``` The shader is *already well-optimised by v3d_compiler*: - **4 hardware threads** (vs cycle-1 IDCT's 2 — better latency hiding from the start) - 0 spills:fills (compiler delivered) - 160 instructions — about 60 % of cycle-1 IDCT's 270 Yet R'' = 0.41. The 30× gap between theoretical instruction throughput and measured wall-clock is **not** compile-quality limited. Plausible attribution: 1. fm-pass rate 8 % → 92 % of edges read+compute then return. But masked lanes still pay clock (phase5'' finding 3) — no throughput benefit from early-return. 2. Memory latency: per-edge 64 reads + 0-4 writes via TMU; less compute density per memory op than IDCT. 3. v3dv per-dispatch overhead is 0.05 % of total at 3.3 ms per-dispatch — not the bottleneck. The fundamental issue: LPF on QPU is **memory-bound**, not compute-bound. Per-edge ~88 B of traffic × 19.6 Medge/s ≈ 1.7 GB/s — well below the 4 GB/s GPU bandwidth ceiling. The divergence tax may be eating the bandwidth headroom (lanes that early-return don't write but still consume cycle). ## M4'' concurrent matrix (cycle-2 gate test) 8-second time-based windows, hertz, all 65 536-edge dispatches: | Config | Medge/s | per-core (NEON) | vs NEON-4 | |---|---|---|---| | **NEON 1-core** | 41.131 | 41.131 | — | | **NEON 4-core** | 33.726 | 7.21 – 9.28 | **baseline ceiling** | | QPU alone (host on core 3) | 14.299 | n/a | — | | **MIXED NEON-3 + QPU** | **36.049** | 9.44 – 12.98 | **+6.9 %** | | MIXED NEON-4 + QPU (oversubscribed) | 31.892 | 6.45 – 8.02 | **−5.4 %** | **The gate verdict:** NEON-3 + QPU (36.05) **>** NEON-4 alone (33.73) by 2.32 Medge/s = +6.9 %. M4'' PASSES. QPU's contribution in mixed mode (4.0 Medge/s) is 28 % of its isolation throughput (14.3) — the same QPU-bandwidth-collapse under CPU contention seen in cycle-1 M4 (where QPU dropped from 6.9 → 1.6 Medge/s = 23 % survival). ## Cycle-2 vs cycle-1 M4 deltas | | Cycle 1 (IDCT) | Cycle 2 (LPF) | |---|---|---| | NEON 1-core (Mblock/s vs Medge/s) | 12.6 | 41.1 | | NEON 4-core | 7.07 | 33.7 | | QPU isolation | 6.89 | 14.3 | | R isolation (vs 1-core NEON) | 0.55 | 0.35 | | R isolation (vs 4-core NEON saturated) | 0.97 | 0.42 | | MIXED N3+Q vs N4 | **+7.2 %** | **+6.9 %** | | MIXED N4+Q vs N4 | +9.4 % (neutral-to-pos) | **−5.4 % (negative)** | The "freed-core" pattern generalizes: NEON-3+QPU > NEON-4 by roughly the same percentage in both cycles. The oversubscription flip (cycle 1 positive → cycle 2 negative) is the new finding: **lighter per-unit kernels are more sensitive to CPU/QPU-host contention**. For deployment on higgs the recommendation hardens to "always NEON-3 + QPU, never NEON-4 + QPU". ## Phase 4''/5'' prediction calibration What Phase 4'' got right: - Bandwidth-bound — bench fm-pass rate confirms most edges don't even do the conditional write work, yet bandwidth is the ceiling - 4-thread shaderdb result — phase 4 §6 predicted "compute doesn't bottleneck"; confirmed What Phase 4'' got wrong: - Isolation R'' band 0.5–0.9 was too optimistic by ~25 %. Actual 0.41. Divergence tax was bigger than estimated. - Phase 5'' finding 3 specifically warned not to restructure for divergence — that holds; the 0.41 IS the floor. What this means: **the cycle-1-style "single big v4 jump from WG sweep" probably doesn't exist for LPF** — we're already at WG 256 from v1, already at 4 hardware threads, already at 0 spills. The compiler delivered. The hardware limit on LPF-shape kernels appears to be ~14 Medge/s isolation. The project can pursue further optimization only by attacking the algorithm structure (e.g., fused multi-edge-per-WG with shared prefetch — but that adds shared mem and barriers, complicating divergence further). For now: cycle 2 closes as a YELLOW-PASS via M4''. Cycle 3 next. ## Phase 7'' decision Per `k2_deblock_phase1.md §"Decision rules"` and cycle-1 calibration adjustment: | Rule | Result | Status | |---|---|---| | M1'' bit-exact | 100.0000 % | ✓ PASS | | R'' = M2''/M3'' | 0.41 (ORANGE) | does not auto-close | | M4'' > pure-CPU 4-core | +6.9 % | ✓ PASS | | **Cycle verdict** | **YELLOW-via-M4''** | **continue to next kernel** | Phase 9 (lessons): see end of this doc. ## Leaves open - **Real-bitstream fm-pass rate.** Bench's random distribution gives 8 % fm-pass. Real VP9 streams may be 30-60 %. If fm-pass rate matters for the divergence tax, real content might measurably shift M2''. Worth a sample-stream re-measurement if/when an end-to-end pipeline exists. - **Vertical variant v_4_8.** Different memory access pattern (column-strided reads). Cycle 2 v2 if there's a reason; not blocking. - **wd=8 and wd=16 filters.** Bigger conditional paths. Cycle 3+ candidates. ## Phase 9 lessons (added to project memory) 1. **Cycle-1 v4-pattern is the v1 starting point.** Bake in WG 256, 2-block-per-subgroup adaptation, uint8_t SSBO, oob early-return discipline, NO chained ternary from the start. Saves 3 iterations. 2. **Phase 5 review pays off every cycle.** Cycle 1 caught 2 RED bugs; cycle 2 caught 2 YELLOW contract gaps (stride ≥ 4, assert discipline) and 1 V3D-specific divergence-cost warning. No wasted code from review-flagged bugs in either cycle. 3. **R isolation is a misleading metric on bandwidth-saturated hardware.** Comparing QPU vs 1-core NEON is the wrong baseline when 4-core NEON only delivers 0.56-0.82× of 1-core scaled. The right comparison is QPU vs 4-core-NEON-saturated, then the mixed-vs-pure-CPU delta. Both cycles' M4 confirm this. 4. **Oversubscription tax depends on kernel weight.** Heavy per-unit work (IDCT) tolerates NEON-4 + QPU (+9 %). Light per-unit work (LPF) is hurt by it (-5 %). Recommendation for deployment: always N-1 NEON cores + QPU, never N + QPU. 5. **shaderdb at 4 threads / 0 spills means compute is not the bottleneck.** Subsequent optimization should target memory pattern (TMU prefetch, working-set tiling) or accept the silicon limit. Cycle 2 v1 hit this ceiling — no v2-v5 iterations needed because there's nothing to improve in the compiled shader shape.