--- cycle: 2 phase: 4 status: open (awaiting Phase 5'' review) date_opened: 2026-05-18 parent: k2_deblock_phase3.md template_doc: phase4.md (cycle 1) target_kernel: VP9 loop filter h_4_8 — 4-tap inner, horizontal, 8-pixel edge expected_artifacts: src/v3d_lpf_h_4_8.comp, tests/bench_v3d_lpf.c, CMakeLists.txt updates --- # Cycle 2, Phase 4 — Plan QPU LPF kernel This doc is compact. Cycle-1 `phase4.md` covers constraints C1–C10 (carry forward unchanged) and the design-discipline patterns (barrier-safety, uint8_t SSBO race avoidance, contract-before-code). Phase 4'' references those rather than re-deriving. ## 1. Constraints (carried from cycle 1 phase4.md §1) All 10 constraints apply unchanged. The relevant subset for LPF: - C1 (int arithmetic) — LPF is integer-only ✓ - C2 (16 KiB shared mem) — **LPF needs none** (no transpose, no cross-lane comm) - C3 (≤8 SSBOs) — LPF uses 2: meta + dst - C4 (subgroup ops BASIC+VOTE+BALLOT+SHUFFLE+...) — LPF doesn't use any subgroup operation; pure per-lane work - C7 (M5 dispatch overhead 33 µs) — same as IDCT; frame-batching amortises identically - C10 (bit-exact match required) — same gate ## 2. Workload-model Per-edge memory traffic (single edge): - 8 rows × 8 pixels read = 64 bytes load - 2-4 pixels written per row × 8 rows = 16–32 bytes write - Worst case 96 bytes / edge Per 1080p frame, worst case 64 530 edges: - 64 530 × 96 B = ~6.2 MB total traffic (cf. IDCT cycle 1: 8 MB) - At GPU's measured 4 GB/s share: 1.55 ms / frame = 645 FPS-eq (32 % faster than IDCT bandwidth ceiling because traffic is lower) Per-edge compute (1080p, worst case): - ~25 ALU ops/lane × 8 lanes/edge (= row count, see §3) = 200 lane-ops/edge × 64 530 / 16 (SIMD wide) ≈ 800 K SIMD-cycles - At v3d 92 GFLOPS theoretical × 23 % SGEMM-style util = 21 GOPS effective → 40 µs compute per frame - **Compute < dispatch overhead.** LPF is overhead-bound, not compute-bound. ## 3. Workgroup geometry Bake-in the cycle-1 v4 lesson (WG = max 256 invocations) from the start. - **`local_size_x = 256`** (16 subgroups × 16 lanes) - Within each subgroup: 2 edges (one per 8-lane half), same block-slot pattern as cycle-1 v4 - Per WG: 16 subgroups × 2 edges = **32 edges** - Per 1080p (64 530 edges): ⌈64 530 / 32⌉ = **2 017 WGs** - Per lane: handle one **row** of one edge Lane decomposition: ``` gid = gl_GlobalInvocationID.x wg_id = gid / 256 lane_in_wg = gid & 255 sg_in_wg = lane_in_wg >> 4 // 0..15 lane_in_sg = lane_in_wg & 15 edge_slot = lane_in_sg >> 3 // 0 (lanes 0..7) or 1 (8..15) row = lane_in_sg & 7 // 0..7 edge_local = sg_in_wg * 2 + edge_slot // 0..31 in WG edge_idx = wg_id * 32 + edge_local oob = edge_idx >= n_edges ``` **No barrier needed.** Each lane is fully independent — no cross-lane data flow, no transpose. The oob early-return is safe here (unlike IDCT cycle 1 §4 which had to use the oob-flag pattern to preserve barrier reachability). ## 4. Per-thread algorithm ```glsl if (edge_idx >= pc.n_edges) return; // safe — no barrier follows uvec4 m = u_meta.meta[edge_idx]; uint base = m.x + row * pc.dst_stride_u8; // m.x = dst byte offset of row-0 col-0 of this edge int E = int(m.y), I = int(m.z), H = int(m.w); int p3 = int(u_dst.dst[base - 4u]); int p2 = int(u_dst.dst[base - 3u]); int p1 = int(u_dst.dst[base - 2u]); int p0 = int(u_dst.dst[base - 1u]); int q0 = int(u_dst.dst[base + 0u]); int q1 = int(u_dst.dst[base + 1u]); int q2 = int(u_dst.dst[base + 2u]); int q3 = int(u_dst.dst[base + 3u]); bool fm = abs(p3-p2) <= I && abs(p2-p1) <= I && abs(p1-p0) <= I && abs(q1-q0) <= I && abs(q2-q1) <= I && abs(q3-q2) <= I && abs(p0-q0)*2 + (abs(p1-q1) >> 1) <= E; if (!fm) return; bool hev = abs(p1-p0) > H || abs(q1-q0) > H; if (hev) { int f = clamp(p1 - q1, -128, 127); f = clamp(3*(q0-p0) + f, -128, 127); int f1 = min(f + 4, 127) >> 3; int f2 = min(f + 3, 127) >> 3; u_dst.dst[base - 1u] = uint8_t(clamp(p0 + f2, 0, 255)); u_dst.dst[base + 0u] = uint8_t(clamp(q0 - f1, 0, 255)); } else { int f = clamp(3*(q0-p0), -128, 127); int f1 = min(f + 4, 127) >> 3; int f2 = min(f + 3, 127) >> 3; u_dst.dst[base - 1u] = uint8_t(clamp(p0 + f2, 0, 255)); u_dst.dst[base + 0u] = uint8_t(clamp(q0 - f1, 0, 255)); int fp = (f1 + 1) >> 1; u_dst.dst[base - 2u] = uint8_t(clamp(p1 + fp, 0, 255)); u_dst.dst[base + 1u] = uint8_t(clamp(q1 - fp, 0, 255)); } ``` Mirrors `tests/vp9_lpf_ref.c` line-for-line. Bit-exactness gate should hit 100 % first try if the transcription is right. **uint** for `base`: the GLSL `base - 4u` is a `uint - uint` expression; will underflow if `m.x < 4`. **Contracts (revised per phase5'' findings 2 + 4):** 1. The host guarantees `m.x ≥ 4` for every edge. 2. The host guarantees `dst_stride_u8 ≥ 4` for every dispatch. (Required for race safety — see §5; rows `r` and `r+1` write to `[base+r·s−2..base+r·s+1]` and `[base+(r+1)·s−2..base+(r+1)·s+1]`, disjoint iff `s ≥ 4`.) 3. **Phase 6 MUST add `assert(m_x >= 4 && dst_stride >= 4)` in `bench_v3d_lpf.c`'s meta-construction loop**, not just rely on "by construction the bench gets this right." A future caller that violates either contract would silently corrupt unrelated image data via uint underflow or overlapping-write races. Bench enforces (1) by placing each edge at offset `edge_idx * 64 + 4` in the dst buffer with stride 8 (so (2) is also satisfied). ## 5. Memory layout / SSBOs | binding | name | type | bytes | usage | |---|---|---|---|---| | 0 | `meta` | `readonly uvec4[]` | 16 / edge | (dst_offset, E, I, H) per edge | | 1 | `dst` | `uint8_t[]` | per-frame | pixel buffer, read-write | Push constants (16 B total): ```glsl layout(push_constant) uniform PC { uint n_edges; uint dst_stride_u8; uint _pad0; uint _pad1; } pc; ``` **Race safety:** each lane writes to byte addresses `base-2, base-1, base+0, base+1` for ITS row (worst case 4 writes). Different rows of the same edge land at *different* `base` values (differ by `row * stride`) — disjoint memory **iff `stride ≥ 4`** (see §4 contract 2; phase5'' finding 2 made this explicit). Different edges have disjoint `m.x` values by construction. No multi-lane write to the same byte under the stated contracts. Race-free without atomics. ## 6. Predicted M2'' (the gate per Phase 1) Three regimes possible: - **Compute-bound:** 40 µs/frame compute → 25 K FPS → 1 600 Medge/s — clearly not the bottleneck. - **Bandwidth-bound:** 6.2 MB / 4 GB/s = 1.55 ms/frame → 645 FPS → **42 Medge/s** (at 64 530 edges/frame). R'' = 42 / 48.3 ≈ **0.87**. - **Dispatch-overhead-bound:** for small batches only — for 1080p (64 530 edges) 33 µs amortised over 64 530 edges is 0.5 ns/edge → negligible vs the 20 ns NEON floor. **Predicted M2'' band (1080p frame batches): R'' ≈ 0.5 – 0.9.** The bandwidth ceiling at R = 0.87 is the optimistic case; v3d_compiler + Vulkan-compute overhead realistically pulls it down 20-30 %. Honest lower bound: R'' = 0.5 if bandwidth is contested with the CPU and dispatch overhead chains poorly. **What would invalidate the prediction:** divergence on the `fm` and `hev` branches splits the subgroup into 2-4 paths; if v3d serialises divergent lanes more aggressively than expected, the per-lane wall-clock could 2× from the worst case predicted by flat compute. Phase 7'' will measure. **Divergence handling on V3D** (phase5'' finding 3): on V3D 7.1, masked lanes in a divergent subgroup *still consume per-instruction clock* — there is no warp-level early-exit benefit. The natural branching structure in §4 (`if (!fm) return;` plus hev select) is correct as written. **Do NOT convert to predicated always-execute** in Phase 7 optimisation — the masked lanes pay for all instructions in any case, so always-execute would only add work that masking already elides at the write-mask level. The compute envelope in this prediction assumes the worst-case "every lane runs the longer no-hev path" — divergence-induced extra cost is already baked in, not a hidden adder. ## 7. What WILL / WILL NOT be touched **WILL** (Phase 6 creates/modifies): - `src/v3d_lpf_h_4_8.comp` — the GLSL compute shader - `tests/bench_v3d_lpf.c` — bit-exact + throughput harness (mirrors `bench_v3d_idct.c` shape). **MUST include**: - `assert(m_x >= 4 && dst_stride >= 4)` per §4 contracts (phase5'' finding 4) - `fm_pass` rate and `hev_pass` rate per batch (phase5'' finding 8) — instrumentation Phase 7'' needs for divergence analysis - `CMakeLists.txt` — add shader compilation + bench target - `tests/bench_concurrent.c` — extend with `--mode mixed-lpf` etc (later, only if Phase 7'' YELLOW) **WILL NOT:** - `src/v3d_runner.{c,h}` — works as-is for any compute kernel - `tests/vp9_lpf_ref.c`, `tests/bench_neon_lpf.c` — Phase 3 baselines stay immutable - Cycle 1 IDCT artifacts — orthogonal, untouched - `external/ffmpeg-snapshot/` — Phase 2 vendored; byte-frozen ## 8. Phase 5'' review prep Mandatory per `dev_process.md` ("Reviews are never skippable", per user-global CLAUDE.md). Cycle-1 phase 5 caught 2 RED bugs; cycle 2 deserves the same outside look. Files for the reviewer to read verbatim: - `docs/k2_deblock_phase1.md` (goal) - `docs/k2_deblock_phase2.md` (situation, refs) - `docs/k2_deblock_phase3.md` (baseline M3'') - `docs/k2_deblock_phase4.md` (this file) - `tests/vp9_lpf_ref.c` (the C ref the QPU must match) - `tests/bench_neon_lpf.c` (M3'' methodology) - `phase4.md` + `phase5.md` (cycle 1 — context for what was already reviewed) - `phase7.md` + `phase7_M4.md` (cycle 1 — lessons) Specific review prompts (the high-risk decisions): 1. **Orientation correctness.** §4 pseudocode mirrors `tests/vp9_lpf_ref.c` line-for-line. Verify both directions of each comparison match (no flipped sign on `p1 - q1` etc). This is the canonical "bit-exact will fail on first run" trap. 2. **Race safety claim in §5.** Convincing? Different rows of the same edge land at offsets `m.x + r * stride` for r = 0..7 — guaranteed disjoint? What if `stride < 8`? (Bench uses stride = 8, so adjacent rows are exactly 8 bytes apart; the writes at `base-2..base+1` span 4 bytes — fits within the row's 8-byte stride. ✓ unless I'm missing something.) 3. **Divergence cost.** `fm` test fails → entire lane returns early. `hev` test selects between 2-pixel and 4-pixel paths. Within a 16-lane subgroup, mixed outcomes are common. Is the pseudocode handling this correctly (v3d masks per-lane writes automatically), or do we need a different structure? 4. **`base - 4u` underflow assumption.** §4 contracts `m.x ≥ 4`. Robust enough? What if a future caller violates it — silent pixel-buffer-underread? Worth an assert in the bench-side harness when constructing meta. 5. **Anything missing.** Same prompt as cycle 1. ## 9. Phase 6'' execution order If Phase 5'' approves: 1. Write `src/v3d_lpf_h_4_8.comp` (GLSL shader from §4) 2. Write `tests/bench_v3d_lpf.c` (clone of `bench_v3d_idct.c`, swap kernel + meta layout) 3. CMake wiring 4. Build, run M1'' 5. If 100 % bit-exact → run M2'', compute R'' 6. Per Phase 1 decision table: - R'' ≥ 0.5 → run M4'' - R'' < 0.5 → still run M4'' per cycle-1 calibration adjustment 7. Phase 7'' verdict → Phase 9 lessons → cycle 3 (CDEF? MC? another kernel) OR honest close cycle 2 only. ## 10. Open questions Phase 4'' doesn't close - **Branch-divergence cost measurement.** Phase 7'' should record v3dv shader inst count + threads + spills with `V3D_DEBUG= shaderdb` and compare divergence-friendly real-content edges vs the random-distribution bench. If real-content has very uniform branches (e.g., all-pass-`fm` runs), per-frame perf improves over the predicted band. - **Per-edge meta packing.** Cycle 1 v5 showed that manually packing storage didn't help. Skip the pre-emptive optimisation here. - **Vertical variant.** `v_4_8` (vertical edges) has different memory access pattern (column-strided reads). Cycle 2 v2 if v1 succeeds. - **wd=8 / wd=16 paths.** Bigger filters with more conditional branches. Cycle 3+ if cycle 2 succeeds.