--- cycle: 4 phases: 4-7 (combined) status: in_progress date_opened: 2026-05-18 parent: k4_lpf8_phase1_3.md template: k2_deblock_phase4.md (direct adaptation) --- # Cycle 4, Phases 4-7 — LPF wd=8 Compact — straight extension of cycle 2 LPF. Phase 4 plan inherits all of cycle-2's geometry/contracts unchanged; only the per-thread algorithm changes (adds flat8in branch). ## Phase 4 — plan **Geometry**: identical to cycle 2 LPF (256 invocations/WG, 2 edges per subgroup, 8 lanes per edge, 32 edges per WG, oob early-return safe). **SSBO bindings**: identical to cycle 2 (meta uvec4, dst uint8_t). **Per-thread algorithm** — extends cycle 2 with flat8in: ```glsl // ... same lane/edge decomposition, base/E/I/H load, p3..q3 reads, // fm test, !fm early return as cycle 2 ... bool flat8in = abs(p3-p0) <= 1 && abs(p2-p0) <= 1 && abs(p1-p0) <= 1 && abs(q1-q0) <= 1 && abs(q2-q0) <= 1 && abs(q3-q0) <= 1; if (flat8in) { /* 6-write flat-region filter */ u_dst.dst[base-3u] = uint8_t((p3+p3+p3 + 2*p2 + p1+p0+q0 + 4) >> 3); u_dst.dst[base-2u] = uint8_t((p3+p3+p2 + 2*p1 + p0+q0+q1 + 4) >> 3); u_dst.dst[base-1u] = uint8_t((p3+p2+p1 + 2*p0 + q0+q1+q2 + 4) >> 3); u_dst.dst[base+0u] = uint8_t((p2+p1+p0 + 2*q0 + q1+q2+q3 + 4) >> 3); u_dst.dst[base+1u] = uint8_t((p1+p0+q0 + 2*q1 + q2+q3+q3 + 4) >> 3); u_dst.dst[base+2u] = uint8_t((p0+q0+q1 + 2*q2 + q3+q3+q3 + 4) >> 3); } else { /* same hev/no-hev paths as cycle 2 */ bool hev = abs(p1-p0) > H || abs(q1-q0) > H; if (hev) { /* 2-write */ } else { /* 4-write */ } } ``` **Race safety**: flat8in path writes at `base-3..base+2` = 6 contiguous bytes per row. **Updated contract** vs cycle 2: `dst_stride_u8 ≥ 6` (vs cycle 2's `≥ 4`). Bench uses stride=8, satisfies. Phase 6 MUST add `assert(dst_stride_u8 >= 6)`. **Predicted R''''**: 0.3–0.5 (similar to wd=4's 0.41). The flat8in write-on-pass path has 50 % more writes than wd=4's no-hev path, but if flat8in passes rarely under random distributions, it's a small perturbation. ## Phase 5 — review (skipped — incremental extension) Cycle-2's phase5 review remains the relevant outside-look. The specific delta from cycle 2 to cycle 4: - Added flat8in branch + 6 writes - Stride contract relaxed-tightened from ≥4 to ≥6 - Same geometry, same SSBOs, same race-safety pattern The cycle-2 review's two RED-pattern checks (write race, barrier UB) remain satisfied because the geometry is unchanged. The new arithmetic is mechanically transcribed from `vp9_lpf8_ref.c` — risk of orientation/arithmetic bug is concrete but contained; M1'''' is the immediate gate. **Justification for skipping fresh-context review**: cycle 4 changes ~30 lines of one shader and inherits everything else from cycle 2. Per dev_process.md "Skipping phases is a deliberate choice that should be flagged, not a default" — flagging here. If M1'''' fails on first run, restart with full Phase 5'''' review. ## Phase 6 — implementation (executed below — `src/v3d_lpf_h_8_8.comp` + `tests/bench_v3d_lpf8.c`) ## Phase 7 — verification ### v1 first-light ``` === v3d LPF h_8_8 bench === === M1'''': QPU vs C bit-exact === edges bit-exact: 65536 / 65536 (100.0000 %) === M2'''': QPU throughput === per-edge = 56.0 ns per-dispatch = 3672.1 us M2'''' = 17.847 Medge/s R'''' = 0.341 → ORANGE band 30fps@1080p floor: 9.2x margin (isolation) ``` shaderdb: **231 inst, 4 threads, 0 spills, 27 max-temps, 48 uniforms.** The 4-thread result is the meaningful one — compiler delivered. The wd=8 kernel runs at the latency-hiding ceiling from v1. ### M4'''' concurrent (8s windows) | Config | Medge/s | vs NEON-4 | 30fps margin | |---|---|---|---| | **NEON 4-core** | **37.823** | baseline | 19.5× | | QPU only | 14.867 | — | 7.7× | | **MIXED NEON-3 + QPU** | **39.389** | **+4.1 %** | 20.3× | **M4'''' PASSES**. The freed-core pattern from cycles 1+2 holds for wd=8 — smaller delta than wd=4 (+4.1 % vs +6.9 %) but still positive. The larger conditional logic (flat8in path) dilutes per-edge QPU contribution under contention (3.98 vs cycle-2's 4.00 — basically same), and NEON-4 baseline is higher (37.8 vs cycle-2's 33.7) because the per-edge NEON cost is slightly lower for wd=8 (19.1 vs cycle-2's 20.7 ns), so the relative gain shrinks. ### Cross-cycle LPF comparison | | k2 wd=4 | k4 wd=8 | |---|---|---| | M3 NEON (Medge/s) | 48.285 | 52.382 | | M2 QPU isolation | 19.645 | 17.847 | | R isolation | 0.41 | 0.34 | | NEON-4 (Medge/s) | 33.726 | 37.823 | | Mixed N-3+QPU | 36.049 | 39.389 | | M4 delta | **+6.9 %** | **+4.1 %** | | 30fps margin (mixed) | 7.2× | 20.3× | | Verdict | GO QPU | GO QPU | ### Decision per Phase 1 rules + 30fps floor | Rule | Result | Status | |---|---|---| | M1'''' bit-exact | 100.0000 % | ✓ PASS | | R'''' = M2''''/M3'''' | 0.341 (ORANGE) | does not auto-close | | M4'''' > pure NEON-4 | +4.1 % | ✓ PASS gate | | 30fps@1080p floor | 20.3× mixed | ✓ PASS user-facing | **Verdict: YELLOW-via-M4'''' PASS. Deploy wd=8 LPF on QPU, alongside cycle-2 wd=4.** Combined VP9 LPF coverage = wd=4 + wd=8 on QPU. ### Phase 9 lessons 1. Width extensions of a known-working kernel (wd=4 → wd=8) inherit the pattern reliably. v1 first-light hit M1'''' = 100 % first try on a 30-line shader delta. No iteration needed. 2. **Phase 5 review can be skipped for incremental extensions** — when the delta is < ~30 lines and the cycle-2 review's pattern coverage still applies. Flagged explicitly in §"Phase 5 — review (skipped)". If M1 had failed, restart with full review. Cycle 5+ should restore mandatory review for non-incremental work. 3. NEON gets faster per edge as filter width grows (20.7 → 19.1 ns wd=4 → wd=8). The NEON implementation is heavily optimised; the relative QPU loss grows with kernel width. Cycle 5 wd=16 would probably show further R degradation. 4. M4 delta is the gating metric for ORANGE-band kernels. The gap from cycle-2 +6.9 % to cycle-4 +4.1 % indicates "wd=8 is borderline useful on QPU; wd=16 may flip negative." ### Leaves open - LPF wd=16 (cycle 5 if VP9 coverage requires it; likely RED based on the trend line) - Vertical variants of both wd=4 and wd=8 (different memory pattern) - CDEF / loop restoration (AV1 kernels) - Phase 8 deployment plumbing (libva-v4l2-request-fourier integration)