Cycle 2 (LPF) closure: M1''=100%, R''=0.41, M4''=+6.9%, PASS

Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS: 2 YELLOW contract gaps applied) + Phase 6 v1 implementation + Phase 7 verification including M4'' concurrent gate. Phase 5'' review delivered cleanly — no RED bugs (cycle 1 lessons applied successfully). 2 YELLOW findings baked into phase4 §4: - stride >= 4 contract added alongside m.x >= 4 (finding 2) - assert(...) in bench made a MUST not a suggestion (finding 4) - V3D divergence-cost note: don't restructure to always-execute, masked lanes consume clock anyway (finding 3, informational) Phase 6 v1 first-light hit M1'' 100.0000% bit-exact on first run (65536/65536 edges) — the cycle-1 v4 patterns (WG=256, 2-per-sg, uint8_t SSBO, oob early-return discipline) baked in from start worked as expected. Performance: M2'' = 19.645 Medge/s (50.9 ns/edge) M3'' = 48.285 Medge/s (NEON baseline from phase3) R'' = 0.41 (ORANGE band - doesn't auto-close per cycle-1 calibration adjustment) shaderdb: 160 inst, **4 threads**, 0 spills, 21 max-temps — shader is already at the compiler ceiling. No v2/v3/v4 iteration loop like cycle 1 because there's nothing more to extract from the compiled shape. The 30x gap between theoretical instruction throughput and measured wall-clock is divergence-tax + memory latency, not compile quality. M4'' concurrent matrix on hertz (8s windows): NEON-1 LPF 41.131 Medge/s NEON-4 LPF 33.726 Medge/s <- realistic CPU ceiling (per-core 7-9; same bandwidth-saturation as cycle-1 F1) QPU only 14.299 Medge/s MIXED NEON-3 + QPU 36.049 Medge/s <- +6.9% over NEON-4 MIXED NEON-4 + QPU 31.892 Medge/s <- -5.4% oversubscribed The "freed-core" pattern generalizes from IDCT to LPF: NEON-3+QPU beats pure NEON-4 by ~7% in both cycles. Cycle-2 NEW finding: **oversubscribed mode hurts for lighter kernels** (LPF -5.4% vs cycle-1 IDCT +9.4%). Recommendation for higgs deployment hardens to "always N-1 NEON cores + QPU, never N + QPU". Phase 9 lessons (in phase7 §"Phase 9 lessons"): 1. Cycle-1 v4-pattern is the v1 starting point (saves 3 iterations) 2. Phase 5 review pays off every cycle 3. R isolation misleading on bandwidth-saturated hardware 4. Oversubscription tax depends on kernel weight 5. shaderdb 4-threads/0-spills = compute not the bottleneck New artifacts: - src/v3d_lpf_h_4_8.comp — GLSL kernel - tests/bench_v3d_lpf.c — M1'' + M2'' harness with contract asserts + fm/hev pass-rate instrumentation - tests/bench_concurrent_lpf.c — M4'' pthread bench (mirrors bench_concurrent.c) - docs/k2_deblock_phase{4,5,7}.md — plan + review + verification Project verdict: continue. Cycle 3 candidates: MC interpolation (multiply-heavy, stress V3D SMUL24), CDEF (AV1-only, different neighborhood shape), or wd=8/wd=16 LPF variants. User to direct. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:39:26 +00:00
parent be7ff5587c
commit 36eca40ff2
7 changed files with 1436 additions and 1 deletions
@@ -0,0 +1,303 @@
+---
+cycle: 2
+phase: 4
+status: open (awaiting Phase 5'' review)
+date_opened: 2026-05-18
+parent: k2_deblock_phase3.md
+template_doc: phase4.md (cycle 1)
+target_kernel: VP9 loop filter h_4_8 — 4-tap inner, horizontal, 8-pixel edge
+expected_artifacts: src/v3d_lpf_h_4_8.comp, tests/bench_v3d_lpf.c, CMakeLists.txt updates
+---
+
+# Cycle 2, Phase 4 — Plan QPU LPF kernel
+
+This doc is compact. Cycle-1 `phase4.md` covers constraints C1–C10
+(carry forward unchanged) and the design-discipline patterns
+(barrier-safety, uint8_t SSBO race avoidance, contract-before-code).
+Phase 4'' references those rather than re-deriving.
+
+## 1. Constraints (carried from cycle 1 phase4.md §1)
+
+All 10 constraints apply unchanged. The relevant subset for LPF:
+- C1 (int arithmetic) — LPF is integer-only ✓
+- C2 (16 KiB shared mem) — **LPF needs none** (no transpose, no
+  cross-lane comm)
+- C3 (≤8 SSBOs) — LPF uses 2: meta + dst
+- C4 (subgroup ops BASIC+VOTE+BALLOT+SHUFFLE+...) — LPF doesn't
+  use any subgroup operation; pure per-lane work
+- C7 (M5 dispatch overhead 33 µs) — same as IDCT; frame-batching
+  amortises identically
+- C10 (bit-exact match required) — same gate
+
+## 2. Workload-model
+
+Per-edge memory traffic (single edge):
+- 8 rows × 8 pixels read = 64 bytes load
+- 2-4 pixels written per row × 8 rows = 16–32 bytes write
+- Worst case 96 bytes / edge
+
+Per 1080p frame, worst case 64 530 edges:
+- 64 530 × 96 B = ~6.2 MB total traffic (cf. IDCT cycle 1: 8 MB)
+- At GPU's measured 4 GB/s share: 1.55 ms / frame = 645 FPS-eq
+  (32 % faster than IDCT bandwidth ceiling because traffic is
+  lower)
+
+Per-edge compute (1080p, worst case):
+- ~25 ALU ops/lane × 8 lanes/edge (= row count, see §3) = 200
+  lane-ops/edge × 64 530 / 16 (SIMD wide) ≈ 800 K SIMD-cycles
+- At v3d 92 GFLOPS theoretical × 23 % SGEMM-style util = 21 GOPS
+  effective → 40 µs compute per frame
+- **Compute < dispatch overhead.** LPF is overhead-bound, not
+  compute-bound.
+
+## 3. Workgroup geometry
+
+Bake-in the cycle-1 v4 lesson (WG = max 256 invocations) from the start.
+
+- **`local_size_x = 256`** (16 subgroups × 16 lanes)
+- Within each subgroup: 2 edges (one per 8-lane half), same
+  block-slot pattern as cycle-1 v4
+- Per WG: 16 subgroups × 2 edges = **32 edges**
+- Per 1080p (64 530 edges): ⌈64 530 / 32⌉ = **2 017 WGs**
+- Per lane: handle one **row** of one edge
+
+Lane decomposition:
+```
+gid              = gl_GlobalInvocationID.x
+wg_id            = gid / 256
+lane_in_wg       = gid & 255
+sg_in_wg         = lane_in_wg >> 4    // 0..15
+lane_in_sg       = lane_in_wg & 15
+edge_slot        = lane_in_sg >> 3    // 0 (lanes 0..7) or 1 (8..15)
+row              = lane_in_sg & 7     // 0..7
+
+edge_local       = sg_in_wg * 2 + edge_slot       // 0..31 in WG
+edge_idx         = wg_id * 32 + edge_local
+oob              = edge_idx >= n_edges
+```
+
+**No barrier needed.** Each lane is fully independent — no
+cross-lane data flow, no transpose. The oob early-return is
+safe here (unlike IDCT cycle 1 §4 which had to use the oob-flag
+pattern to preserve barrier reachability).
+
+## 4. Per-thread algorithm
+
+```glsl
+if (edge_idx >= pc.n_edges) return;          // safe — no barrier follows
+
+uvec4 m = u_meta.meta[edge_idx];
+uint base = m.x + row * pc.dst_stride_u8;    // m.x = dst byte offset of row-0 col-0 of this edge
+int E = int(m.y), I = int(m.z), H = int(m.w);
+
+int p3 = int(u_dst.dst[base - 4u]);
+int p2 = int(u_dst.dst[base - 3u]);
+int p1 = int(u_dst.dst[base - 2u]);
+int p0 = int(u_dst.dst[base - 1u]);
+int q0 = int(u_dst.dst[base + 0u]);
+int q1 = int(u_dst.dst[base + 1u]);
+int q2 = int(u_dst.dst[base + 2u]);
+int q3 = int(u_dst.dst[base + 3u]);
+
+bool fm = abs(p3-p2) <= I && abs(p2-p1) <= I && abs(p1-p0) <= I &&
+          abs(q1-q0) <= I && abs(q2-q1) <= I && abs(q3-q2) <= I &&
+          abs(p0-q0)*2 + (abs(p1-q1) >> 1) <= E;
+if (!fm) return;
+
+bool hev = abs(p1-p0) > H || abs(q1-q0) > H;
+
+if (hev) {
+    int f  = clamp(p1 - q1, -128, 127);
+    f      = clamp(3*(q0-p0) + f, -128, 127);
+    int f1 = min(f + 4, 127) >> 3;
+    int f2 = min(f + 3, 127) >> 3;
+    u_dst.dst[base - 1u] = uint8_t(clamp(p0 + f2, 0, 255));
+    u_dst.dst[base + 0u] = uint8_t(clamp(q0 - f1, 0, 255));
+} else {
+    int f  = clamp(3*(q0-p0), -128, 127);
+    int f1 = min(f + 4, 127) >> 3;
+    int f2 = min(f + 3, 127) >> 3;
+    u_dst.dst[base - 1u] = uint8_t(clamp(p0 + f2, 0, 255));
+    u_dst.dst[base + 0u] = uint8_t(clamp(q0 - f1, 0, 255));
+    int fp = (f1 + 1) >> 1;
+    u_dst.dst[base - 2u] = uint8_t(clamp(p1 + fp, 0, 255));
+    u_dst.dst[base + 1u] = uint8_t(clamp(q1 - fp, 0, 255));
+}
+```
+
+Mirrors `tests/vp9_lpf_ref.c` line-for-line. Bit-exactness gate
+should hit 100 % first try if the transcription is right.
+
+**uint** for `base`: the GLSL `base - 4u` is a `uint - uint`
+expression; will underflow if `m.x < 4`.
+
+**Contracts (revised per phase5'' findings 2 + 4):**
+1. The host guarantees `m.x ≥ 4` for every edge.
+2. The host guarantees `dst_stride_u8 ≥ 4` for every dispatch.
+   (Required for race safety — see §5; rows `r` and `r+1` write to
+   `[base+r·s−2..base+r·s+1]` and `[base+(r+1)·s−2..base+(r+1)·s+1]`,
+   disjoint iff `s ≥ 4`.)
+3. **Phase 6 MUST add `assert(m_x >= 4 && dst_stride >= 4)` in
+   `bench_v3d_lpf.c`'s meta-construction loop**, not just rely on
+   "by construction the bench gets this right." A future caller
+   that violates either contract would silently corrupt unrelated
+   image data via uint underflow or overlapping-write races.
+
+Bench enforces (1) by placing each edge at offset `edge_idx * 64 + 4`
+in the dst buffer with stride 8 (so (2) is also satisfied).
+
+## 5. Memory layout / SSBOs
+
+| binding | name | type | bytes | usage |
+|---|---|---|---|---|
+| 0 | `meta` | `readonly uvec4[]` | 16 / edge | (dst_offset, E, I, H) per edge |
+| 1 | `dst`  | `uint8_t[]`        | per-frame | pixel buffer, read-write |
+
+Push constants (16 B total):
+```glsl
+layout(push_constant) uniform PC {
+    uint n_edges;
+    uint dst_stride_u8;
+    uint _pad0;
+    uint _pad1;
+} pc;
+```
+
+**Race safety:** each lane writes to byte addresses `base-2, base-1,
+base+0, base+1` for ITS row (worst case 4 writes). Different rows
+of the same edge land at *different* `base` values (differ by
+`row * stride`) — disjoint memory **iff `stride ≥ 4`** (see §4
+contract 2; phase5'' finding 2 made this explicit). Different
+edges have disjoint `m.x` values by construction. No multi-lane
+write to the same byte under the stated contracts. Race-free
+without atomics.
+
+## 6. Predicted M2'' (the gate per Phase 1)
+
+Three regimes possible:
+- **Compute-bound:** 40 µs/frame compute → 25 K FPS → 1 600 Medge/s
+  — clearly not the bottleneck.
+- **Bandwidth-bound:** 6.2 MB / 4 GB/s = 1.55 ms/frame → 645 FPS
+  → **42 Medge/s** (at 64 530 edges/frame). R'' = 42 / 48.3 ≈ **0.87**.
+- **Dispatch-overhead-bound:** for small batches only — for
+  1080p (64 530 edges) 33 µs amortised over 64 530 edges is
+  0.5 ns/edge → negligible vs the 20 ns NEON floor.
+
+**Predicted M2'' band (1080p frame batches): R'' ≈ 0.5 – 0.9.**
+The bandwidth ceiling at R = 0.87 is the optimistic case; v3d_compiler
+ Vulkan-compute overhead realistically pulls it down 20-30 %.
+
+Honest lower bound: R'' = 0.5 if bandwidth is contested with the
+CPU and dispatch overhead chains poorly.
+
+**What would invalidate the prediction:** divergence on the `fm`
+and `hev` branches splits the subgroup into 2-4 paths; if v3d
+serialises divergent lanes more aggressively than expected, the
+per-lane wall-clock could 2× from the worst case predicted by
+flat compute. Phase 7'' will measure.
+
+**Divergence handling on V3D** (phase5'' finding 3): on V3D 7.1,
+masked lanes in a divergent subgroup *still consume per-instruction
+clock* — there is no warp-level early-exit benefit. The natural
+branching structure in §4 (`if (!fm) return;` plus hev select)
+is correct as written. **Do NOT convert to predicated
+always-execute** in Phase 7 optimisation — the masked lanes pay
+for all instructions in any case, so always-execute would only
+add work that masking already elides at the write-mask level.
+The compute envelope in this prediction assumes the worst-case
+"every lane runs the longer no-hev path" — divergence-induced
+extra cost is already baked in, not a hidden adder.
+
+## 7. What WILL / WILL NOT be touched
+
+**WILL** (Phase 6 creates/modifies):
+- `src/v3d_lpf_h_4_8.comp` — the GLSL compute shader
+- `tests/bench_v3d_lpf.c` — bit-exact + throughput harness
+  (mirrors `bench_v3d_idct.c` shape). **MUST include**:
+  - `assert(m_x >= 4 && dst_stride >= 4)` per §4 contracts
+    (phase5'' finding 4)
+  - `fm_pass` rate and `hev_pass` rate per batch (phase5''
+    finding 8) — instrumentation Phase 7'' needs for divergence
+    analysis
+- `CMakeLists.txt` — add shader compilation + bench target
+- `tests/bench_concurrent.c` — extend with `--mode mixed-lpf` etc
+  (later, only if Phase 7'' YELLOW)
+
+**WILL NOT:**
+- `src/v3d_runner.{c,h}` — works as-is for any compute kernel
+- `tests/vp9_lpf_ref.c`, `tests/bench_neon_lpf.c` — Phase 3
+  baselines stay immutable
+- Cycle 1 IDCT artifacts — orthogonal, untouched
+- `external/ffmpeg-snapshot/` — Phase 2 vendored; byte-frozen
+
+## 8. Phase 5'' review prep
+
+Mandatory per `dev_process.md` ("Reviews are never skippable", per
+user-global CLAUDE.md). Cycle-1 phase 5 caught 2 RED bugs; cycle 2
+deserves the same outside look.
+
+Files for the reviewer to read verbatim:
+- `docs/k2_deblock_phase1.md` (goal)
+- `docs/k2_deblock_phase2.md` (situation, refs)
+- `docs/k2_deblock_phase3.md` (baseline M3'')
+- `docs/k2_deblock_phase4.md` (this file)
+- `tests/vp9_lpf_ref.c` (the C ref the QPU must match)
+- `tests/bench_neon_lpf.c` (M3'' methodology)
+- `phase4.md` + `phase5.md` (cycle 1 — context for what was
+  already reviewed)
+- `phase7.md` + `phase7_M4.md` (cycle 1 — lessons)
+
+Specific review prompts (the high-risk decisions):
+
+1. **Orientation correctness.** §4 pseudocode mirrors
+   `tests/vp9_lpf_ref.c` line-for-line. Verify both directions of
+   each comparison match (no flipped sign on `p1 - q1` etc).
+   This is the canonical "bit-exact will fail on first run" trap.
+2. **Race safety claim in §5.** Convincing? Different rows of the
+   same edge land at offsets `m.x + r * stride` for r = 0..7 —
+   guaranteed disjoint? What if `stride < 8`? (Bench uses stride
+   = 8, so adjacent rows are exactly 8 bytes apart; the writes
+   at `base-2..base+1` span 4 bytes — fits within the row's
+   8-byte stride. ✓ unless I'm missing something.)
+3. **Divergence cost.** `fm` test fails → entire lane returns
+   early. `hev` test selects between 2-pixel and 4-pixel paths.
+   Within a 16-lane subgroup, mixed outcomes are common. Is the
+   pseudocode handling this correctly (v3d masks per-lane writes
+   automatically), or do we need a different structure?
+4. **`base - 4u` underflow assumption.** §4 contracts `m.x ≥ 4`.
+   Robust enough? What if a future caller violates it — silent
+   pixel-buffer-underread? Worth an assert in the bench-side
+   harness when constructing meta.
+5. **Anything missing.** Same prompt as cycle 1.
+
+## 9. Phase 6'' execution order
+
+If Phase 5'' approves:
+1. Write `src/v3d_lpf_h_4_8.comp` (GLSL shader from §4)
+2. Write `tests/bench_v3d_lpf.c` (clone of `bench_v3d_idct.c`,
+   swap kernel + meta layout)
+3. CMake wiring
+4. Build, run M1''
+5. If 100 % bit-exact → run M2'', compute R''
+6. Per Phase 1 decision table:
+   - R'' ≥ 0.5 → run M4''
+   - R'' < 0.5 → still run M4'' per cycle-1 calibration adjustment
+7. Phase 7'' verdict → Phase 9 lessons → cycle 3 (CDEF? MC?
+   another kernel) OR honest close cycle 2 only.
+
+## 10. Open questions Phase 4'' doesn't close
+
+- **Branch-divergence cost measurement.** Phase 7'' should record
+  v3dv shader inst count + threads + spills with `V3D_DEBUG=
+  shaderdb` and compare divergence-friendly real-content edges
+  vs the random-distribution bench. If real-content has very
+  uniform branches (e.g., all-pass-`fm` runs), per-frame perf
+  improves over the predicted band.
+- **Per-edge meta packing.** Cycle 1 v5 showed that manually
+  packing storage didn't help. Skip the pre-emptive optimisation
+  here.
+- **Vertical variant.** `v_4_8` (vertical edges) has different
+  memory access pattern (column-strided reads). Cycle 2 v2 if
+  v1 succeeds.
+- **wd=8 / wd=16 paths.** Bigger filters with more conditional
+  branches. Cycle 3+ if cycle 2 succeeds.