Cycle 2 (LPF) closure: M1''=100%, R''=0.41, M4''=+6.9%, PASS

Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS: 2 YELLOW contract gaps applied) + Phase 6 v1 implementation + Phase 7 verification including M4'' concurrent gate. Phase 5'' review delivered cleanly — no RED bugs (cycle 1 lessons applied successfully). 2 YELLOW findings baked into phase4 §4: - stride >= 4 contract added alongside m.x >= 4 (finding 2) - assert(...) in bench made a MUST not a suggestion (finding 4) - V3D divergence-cost note: don't restructure to always-execute, masked lanes consume clock anyway (finding 3, informational) Phase 6 v1 first-light hit M1'' 100.0000% bit-exact on first run (65536/65536 edges) — the cycle-1 v4 patterns (WG=256, 2-per-sg, uint8_t SSBO, oob early-return discipline) baked in from start worked as expected. Performance: M2'' = 19.645 Medge/s (50.9 ns/edge) M3'' = 48.285 Medge/s (NEON baseline from phase3) R'' = 0.41 (ORANGE band - doesn't auto-close per cycle-1 calibration adjustment) shaderdb: 160 inst, **4 threads**, 0 spills, 21 max-temps — shader is already at the compiler ceiling. No v2/v3/v4 iteration loop like cycle 1 because there's nothing more to extract from the compiled shape. The 30x gap between theoretical instruction throughput and measured wall-clock is divergence-tax + memory latency, not compile quality. M4'' concurrent matrix on hertz (8s windows): NEON-1 LPF 41.131 Medge/s NEON-4 LPF 33.726 Medge/s <- realistic CPU ceiling (per-core 7-9; same bandwidth-saturation as cycle-1 F1) QPU only 14.299 Medge/s MIXED NEON-3 + QPU 36.049 Medge/s <- +6.9% over NEON-4 MIXED NEON-4 + QPU 31.892 Medge/s <- -5.4% oversubscribed The "freed-core" pattern generalizes from IDCT to LPF: NEON-3+QPU beats pure NEON-4 by ~7% in both cycles. Cycle-2 NEW finding: **oversubscribed mode hurts for lighter kernels** (LPF -5.4% vs cycle-1 IDCT +9.4%). Recommendation for higgs deployment hardens to "always N-1 NEON cores + QPU, never N + QPU". Phase 9 lessons (in phase7 §"Phase 9 lessons"): 1. Cycle-1 v4-pattern is the v1 starting point (saves 3 iterations) 2. Phase 5 review pays off every cycle 3. R isolation misleading on bandwidth-saturated hardware 4. Oversubscription tax depends on kernel weight 5. shaderdb 4-threads/0-spills = compute not the bottleneck New artifacts: - src/v3d_lpf_h_4_8.comp — GLSL kernel - tests/bench_v3d_lpf.c — M1'' + M2'' harness with contract asserts + fm/hev pass-rate instrumentation - tests/bench_concurrent_lpf.c — M4'' pthread bench (mirrors bench_concurrent.c) - docs/k2_deblock_phase{4,5,7}.md — plan + review + verification Project verdict: continue. Cycle 3 candidates: MC interpolation (multiply-heavy, stress V3D SMUL24), CDEF (AV1-only, different neighborhood shape), or wd=8/wd=16 LPF variants. User to direct. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:39:26 +00:00
parent be7ff5587c
commit 36eca40ff2
7 changed files with 1436 additions and 1 deletions
@@ -0,0 +1,194 @@
+---
+cycle: 2
+phase: 7
+status: closed 2026-05-18 — PASS
+date_opened: 2026-05-18
+date_closed: 2026-05-18
+parent: k2_deblock_phase4.md (+ phase5 revisions)
+host: hertz (Pi 5, 8 GB, Debian Trixie, kernel 6.12.75+rpt-rpi-2712,
+      Mesa 25.0.7-2+rpt4, V3D 7.1.7 @ 1 GHz, A76 @ 2.8 GHz)
+verdict: M4'' PASS — mixed +6.9 % over pure NEON-4; project continues
+---
+
+# Cycle 2, Phase 7 — Verification (v1 + M4'')
+
+Per `dev_process.md`: repeat measurements from Phase 3, compare
+explicitly to baseline. Phase 4 §6 predicted R'' ≈ 0.5–0.9 isolation,
+bandwidth ceiling at 0.87. Measured R'' = 0.41 isolation — below the
+predicted lower bound. Per cycle-1 calibration (M4 showed mixed >
+pure-CPU even at modest R), this triggers M4'' rather than honest-close.
+
+M4'' gate result: **PASS.** Project continues.
+
+## v1 first-light (single dispatch, isolation R'')
+
+```
+=== v3d LPF h_4_8 bench ===
+  device:  V3D 7.1.7.0
+  n_edges: 65536  iters: 100
+  fm pass rate:  8.09% (10k-edge sample)
+  hev pass rate: 4.93% (of fm-passing)
+  dispatch: 2048 WGs × 256 invocations = 65536 edges
+
+=== M1'': QPU vs C-reference bit-exact ===
+  edges bit-exact: 65536 / 65536 (100.0000 %)
+  total byte diffs: 0 / 4194304 (0.0000 %)
+
+=== M2'': QPU throughput ===
+  M2'' throughput = 19.645 Medge/s
+  per-edge        = 50.9 ns
+  per-dispatch    = 3336.1 us
+  R'' = M2''/M3''   = 0.407 → ORANGE band
+```
+
+shaderdb (v1 LPF kernel):
+```
+SHADER-DB-6c8e828054...: MESA_SHADER_COMPUTE shader:
+  160 inst, 4 threads, 0 loops, 36 uniforms, 21 max-temps,
+  0:0 spills:fills, 0 sfu-stalls, 160 inst-and-stalls, 15 nops
+```
+
+The shader is *already well-optimised by v3d_compiler*:
+- **4 hardware threads** (vs cycle-1 IDCT's 2 — better latency
+  hiding from the start)
+- 0 spills:fills (compiler delivered)
+- 160 instructions — about 60 % of cycle-1 IDCT's 270
+
+Yet R'' = 0.41. The 30× gap between theoretical instruction
+throughput and measured wall-clock is **not** compile-quality
+limited. Plausible attribution:
+1. fm-pass rate 8 % → 92 % of edges read+compute then return.
+   But masked lanes still pay clock (phase5'' finding 3) — no
+   throughput benefit from early-return.
+2. Memory latency: per-edge 64 reads + 0-4 writes via TMU; less
+   compute density per memory op than IDCT.
+3. v3dv per-dispatch overhead is 0.05 % of total at 3.3 ms
+   per-dispatch — not the bottleneck.
+
+The fundamental issue: LPF on QPU is **memory-bound**, not
+compute-bound. Per-edge ~88 B of traffic × 19.6 Medge/s ≈
+1.7 GB/s — well below the 4 GB/s GPU bandwidth ceiling. The
+divergence tax may be eating the bandwidth headroom (lanes
+that early-return don't write but still consume cycle).
+
+## M4'' concurrent matrix (cycle-2 gate test)
+
+8-second time-based windows, hertz, all 65 536-edge dispatches:
+
+| Config | Medge/s | per-core (NEON) | vs NEON-4 |
+|---|---|---|---|
+| **NEON 1-core** | 41.131 | 41.131 | — |
+| **NEON 4-core** | 33.726 | 7.21 – 9.28 | **baseline ceiling** |
+| QPU alone (host on core 3) | 14.299 | n/a | — |
+| **MIXED NEON-3 + QPU** | **36.049** | 9.44 – 12.98 | **+6.9 %** |
+| MIXED NEON-4 + QPU (oversubscribed) | 31.892 | 6.45 – 8.02 | **−5.4 %** |
+
+**The gate verdict:** NEON-3 + QPU (36.05) **>** NEON-4 alone
+(33.73) by 2.32 Medge/s = +6.9 %. M4'' PASSES.
+
+QPU's contribution in mixed mode (4.0 Medge/s) is 28 % of its
+isolation throughput (14.3) — the same QPU-bandwidth-collapse
+under CPU contention seen in cycle-1 M4 (where QPU dropped from
+6.9 → 1.6 Medge/s = 23 % survival).
+
+## Cycle-2 vs cycle-1 M4 deltas
+
+| | Cycle 1 (IDCT) | Cycle 2 (LPF) |
+|---|---|---|
+| NEON 1-core (Mblock/s vs Medge/s) | 12.6 | 41.1 |
+| NEON 4-core | 7.07 | 33.7 |
+| QPU isolation | 6.89 | 14.3 |
+| R isolation (vs 1-core NEON) | 0.55 | 0.35 |
+| R isolation (vs 4-core NEON saturated) | 0.97 | 0.42 |
+| MIXED N3+Q vs N4 | **+7.2 %** | **+6.9 %** |
+| MIXED N4+Q vs N4 | +9.4 % (neutral-to-pos) | **−5.4 % (negative)** |
+
+The "freed-core" pattern generalizes: NEON-3+QPU > NEON-4 by
+roughly the same percentage in both cycles. The oversubscription
+flip (cycle 1 positive → cycle 2 negative) is the new finding:
+**lighter per-unit kernels are more sensitive to CPU/QPU-host
+contention**. For deployment on higgs the recommendation
+hardens to "always NEON-3 + QPU, never NEON-4 + QPU".
+
+## Phase 4''/5'' prediction calibration
+
+What Phase 4'' got right:
+- Bandwidth-bound — bench fm-pass rate confirms most edges don't
+  even do the conditional write work, yet bandwidth is the
+  ceiling
+- 4-thread shaderdb result — phase 4 §6 predicted "compute
+  doesn't bottleneck"; confirmed
+
+What Phase 4'' got wrong:
+- Isolation R'' band 0.5–0.9 was too optimistic by ~25 %.
+  Actual 0.41. Divergence tax was bigger than estimated.
+- Phase 5'' finding 3 specifically warned not to restructure
+  for divergence — that holds; the 0.41 IS the floor.
+
+What this means: **the cycle-1-style "single big v4 jump from
+WG sweep" probably doesn't exist for LPF** — we're already at
+WG 256 from v1, already at 4 hardware threads, already at 0
+spills. The compiler delivered. The hardware limit on
+LPF-shape kernels appears to be ~14 Medge/s isolation. The
+project can pursue further optimization only by attacking the
+algorithm structure (e.g., fused multi-edge-per-WG with shared
+prefetch — but that adds shared mem and barriers, complicating
+divergence further).
+
+For now: cycle 2 closes as a YELLOW-PASS via M4''. Cycle 3 next.
+
+## Phase 7'' decision
+
+Per `k2_deblock_phase1.md §"Decision rules"` and cycle-1
+calibration adjustment:
+
+| Rule | Result | Status |
+|---|---|---|
+| M1'' bit-exact | 100.0000 % | ✓ PASS |
+| R'' = M2''/M3'' | 0.41 (ORANGE) | does not auto-close |
+| M4'' > pure-CPU 4-core | +6.9 % | ✓ PASS |
+| **Cycle verdict** | **YELLOW-via-M4''** | **continue to next kernel** |
+
+Phase 9 (lessons): see end of this doc.
+
+## Leaves open
+
+- **Real-bitstream fm-pass rate.** Bench's random distribution
+  gives 8 % fm-pass. Real VP9 streams may be 30-60 %. If fm-pass
+  rate matters for the divergence tax, real content might
+  measurably shift M2''. Worth a sample-stream re-measurement
+  if/when an end-to-end pipeline exists.
+- **Vertical variant v_4_8.** Different memory access pattern
+  (column-strided reads). Cycle 2 v2 if there's a reason; not
+  blocking.
+- **wd=8 and wd=16 filters.** Bigger conditional paths. Cycle 3+
+  candidates.
+
+## Phase 9 lessons (added to project memory)
+
+1. **Cycle-1 v4-pattern is the v1 starting point.** Bake in WG 256,
+   2-block-per-subgroup adaptation, uint8_t SSBO, oob early-return
+   discipline, NO chained ternary from the start. Saves 3 iterations.
+
+2. **Phase 5 review pays off every cycle.** Cycle 1 caught 2 RED
+   bugs; cycle 2 caught 2 YELLOW contract gaps (stride ≥ 4, assert
+   discipline) and 1 V3D-specific divergence-cost warning. No
+   wasted code from review-flagged bugs in either cycle.
+
+3. **R isolation is a misleading metric on bandwidth-saturated
+   hardware.** Comparing QPU vs 1-core NEON is the wrong baseline
+   when 4-core NEON only delivers 0.56-0.82× of 1-core scaled.
+   The right comparison is QPU vs 4-core-NEON-saturated, then
+   the mixed-vs-pure-CPU delta. Both cycles' M4 confirm this.
+
+4. **Oversubscription tax depends on kernel weight.** Heavy
+   per-unit work (IDCT) tolerates NEON-4 + QPU (+9 %). Light
+   per-unit work (LPF) is hurt by it (-5 %). Recommendation
+   for deployment: always N-1 NEON cores + QPU, never N + QPU.
+
+5. **shaderdb at 4 threads / 0 spills means compute is not the
+   bottleneck.** Subsequent optimization should target memory
+   pattern (TMU prefetch, working-set tiling) or accept the
+   silicon limit. Cycle 2 v1 hit this ceiling — no v2-v5
+   iterations needed because there's nothing to improve in the
+   compiled shader shape.