Cycle 2 (LPF) closure: M1''=100%, R''=0.41, M4''=+6.9%, PASS

Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS: 2 YELLOW contract gaps applied) + Phase 6 v1 implementation + Phase 7 verification including M4'' concurrent gate. Phase 5'' review delivered cleanly — no RED bugs (cycle 1 lessons applied successfully). 2 YELLOW findings baked into phase4 §4: - stride >= 4 contract added alongside m.x >= 4 (finding 2) - assert(...) in bench made a MUST not a suggestion (finding 4) - V3D divergence-cost note: don't restructure to always-execute, masked lanes consume clock anyway (finding 3, informational) Phase 6 v1 first-light hit M1'' 100.0000% bit-exact on first run (65536/65536 edges) — the cycle-1 v4 patterns (WG=256, 2-per-sg, uint8_t SSBO, oob early-return discipline) baked in from start worked as expected. Performance: M2'' = 19.645 Medge/s (50.9 ns/edge) M3'' = 48.285 Medge/s (NEON baseline from phase3) R'' = 0.41 (ORANGE band - doesn't auto-close per cycle-1 calibration adjustment) shaderdb: 160 inst, **4 threads**, 0 spills, 21 max-temps — shader is already at the compiler ceiling. No v2/v3/v4 iteration loop like cycle 1 because there's nothing more to extract from the compiled shape. The 30x gap between theoretical instruction throughput and measured wall-clock is divergence-tax + memory latency, not compile quality. M4'' concurrent matrix on hertz (8s windows): NEON-1 LPF 41.131 Medge/s NEON-4 LPF 33.726 Medge/s <- realistic CPU ceiling (per-core 7-9; same bandwidth-saturation as cycle-1 F1) QPU only 14.299 Medge/s MIXED NEON-3 + QPU 36.049 Medge/s <- +6.9% over NEON-4 MIXED NEON-4 + QPU 31.892 Medge/s <- -5.4% oversubscribed The "freed-core" pattern generalizes from IDCT to LPF: NEON-3+QPU beats pure NEON-4 by ~7% in both cycles. Cycle-2 NEW finding: **oversubscribed mode hurts for lighter kernels** (LPF -5.4% vs cycle-1 IDCT +9.4%). Recommendation for higgs deployment hardens to "always N-1 NEON cores + QPU, never N + QPU". Phase 9 lessons (in phase7 §"Phase 9 lessons"): 1. Cycle-1 v4-pattern is the v1 starting point (saves 3 iterations) 2. Phase 5 review pays off every cycle 3. R isolation misleading on bandwidth-saturated hardware 4. Oversubscription tax depends on kernel weight 5. shaderdb 4-threads/0-spills = compute not the bottleneck New artifacts: - src/v3d_lpf_h_4_8.comp — GLSL kernel - tests/bench_v3d_lpf.c — M1'' + M2'' harness with contract asserts + fm/hev pass-rate instrumentation - tests/bench_concurrent_lpf.c — M4'' pthread bench (mirrors bench_concurrent.c) - docs/k2_deblock_phase{4,5,7}.md — plan + review + verification Project verdict: continue. Cycle 3 candidates: MC interpolation (multiply-heavy, stress V3D SMUL24), CDEF (AV1-only, different neighborhood shape), or wd=8/wd=16 LPF variants. User to direct. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:39:26 +00:00
parent be7ff5587c
commit 36eca40ff2
7 changed files with 1436 additions and 1 deletions
@@ -0,0 +1,101 @@
+// daedalus-fourier cycle 2 — VP9 4-tap inner loop filter, horizontal
+// direction, 8-pixel edge. V3D 7.1 via Mesa v3dv compute.
+//
+// Bakes in cycle-1 v4 winning patterns from the start:
+//   - 256 invocations / WG (max), for v3dv latency hiding
+//   - uint8_t dst SSBO via storageBuffer8BitAccess (race-free byte writes)
+//   - 2 lanes per "block_slot" pattern — here 2 edges per 16-lane subgroup
+//   - NO chained-ternary writes, only direct named-variable writes
+//
+// Differs from cycle-1 IDCT structurally:
+//   - NO barrier — each lane fully independent (one row of one edge)
+//   - NO shared memory — no transpose needed
+//   - oob early-return is SAFE here (no barrier reachability issue)
+//
+// Contracts (per k2_deblock_phase4.md §4, revised per phase5'' findings 2+4):
+//   1. meta[i].x ≥ 4 for every edge — bench enforced via assert
+//   2. pc.dst_stride_u8 ≥ 4 — bench enforced via assert
+//
+// License: BSD-2-Clause. Algorithm transcribed from
+// tests/vp9_lpf_ref.c which mirrors libavcodec/vp9dsp_template.c
+// (vendored LGPL-2.1+).
+
+#version 450
+#extension GL_EXT_shader_8bit_storage              : require
+#extension GL_EXT_shader_explicit_arithmetic_types : require
+
+layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
+
+layout(binding = 0) readonly buffer Meta {
+    uvec4 meta[];   // per edge: (dst_offset_bytes, E, I, H)
+} u_meta;
+
+layout(binding = 1) buffer Dst {
+    uint8_t dst[];
+} u_dst;
+
+layout(push_constant) uniform PC {
+    uint n_edges;
+    uint dst_stride_u8;
+    uint _pad0;
+    uint _pad1;
+} pc;
+
+void main()
+{
+    // Lane / edge decomposition (cycle-1 v4 pattern adapted: 8 lanes
+    // per edge instead of 8 lanes per block; 2 edges per subgroup,
+    // 16 subgroups per WG, 32 edges per WG).
+    uint gid         = gl_GlobalInvocationID.x;
+    uint wg_id       = gid / 256u;
+    uint lane_in_wg  = gid & 255u;
+    uint sg_in_wg    = lane_in_wg >> 4;          // 0..15
+    uint lane_in_sg  = lane_in_wg & 15u;
+    uint edge_slot   = lane_in_sg >> 3;          // 0 (lanes 0..7) or 1 (8..15)
+    uint row         = lane_in_sg & 7u;          // 0..7 — which row of this edge
+
+    uint edge_local  = sg_in_wg * 2u + edge_slot;
+    uint edge_idx    = wg_id * 32u + edge_local;
+
+    // Safe early-return: no barrier follows. Per phase4 §4.
+    if (edge_idx >= pc.n_edges) return;
+
+    uvec4 m = u_meta.meta[edge_idx];
+    uint base = m.x + row * pc.dst_stride_u8;
+    int E = int(m.y), I = int(m.z), H = int(m.w);
+
+    int p3 = int(u_dst.dst[base - 4u]);
+    int p2 = int(u_dst.dst[base - 3u]);
+    int p1 = int(u_dst.dst[base - 2u]);
+    int p0 = int(u_dst.dst[base - 1u]);
+    int q0 = int(u_dst.dst[base + 0u]);
+    int q1 = int(u_dst.dst[base + 1u]);
+    int q2 = int(u_dst.dst[base + 2u]);
+    int q3 = int(u_dst.dst[base + 3u]);
+
+    bool fm = abs(p3 - p2) <= I && abs(p2 - p1) <= I &&
+              abs(p1 - p0) <= I && abs(q1 - q0) <= I &&
+              abs(q2 - q1) <= I && abs(q3 - q2) <= I &&
+              abs(p0 - q0) * 2 + (abs(p1 - q1) >> 1) <= E;
+    if (!fm) return;
+
+    bool hev = abs(p1 - p0) > H || abs(q1 - q0) > H;
+
+    if (hev) {
+        int f  = clamp(p1 - q1, -128, 127);
+        f      = clamp(3 * (q0 - p0) + f, -128, 127);
+        int f1 = min(f + 4, 127) >> 3;
+        int f2 = min(f + 3, 127) >> 3;
+        u_dst.dst[base - 1u] = uint8_t(clamp(p0 + f2, 0, 255));
+        u_dst.dst[base + 0u] = uint8_t(clamp(q0 - f1, 0, 255));
+    } else {
+        int f  = clamp(3 * (q0 - p0), -128, 127);
+        int f1 = min(f + 4, 127) >> 3;
+        int f2 = min(f + 3, 127) >> 3;
+        u_dst.dst[base - 1u] = uint8_t(clamp(p0 + f2, 0, 255));
+        u_dst.dst[base + 0u] = uint8_t(clamp(q0 - f1, 0, 255));
+        int fp = (f1 + 1) >> 1;
+        u_dst.dst[base - 2u] = uint8_t(clamp(p1 + fp, 0, 255));
+        u_dst.dst[base + 1u] = uint8_t(clamp(q1 - fp, 0, 255));
+    }
+}