Cycle 8 Phase 4: H.264 deblock QPU shader plan

Per-column dispatch (16 cols/edge, 16 edges/WG, 256 inv/WG). Follows cycle 2 LPF wd=4 template with H.264-specific adjustments: alpha/beta gating + tc0 per-4-col-segment + ap/aq side conditions for conditional p1/q1 writes. Predicted shaderdb: ~150-200 inst, 2-3 threads. Predicted R8 = 0.09-0.14 ORANGE (per Phase 3 closure). 7 Phase 5 review items flagged for Sonnet audit: - tc0 sign-extension semantics - Multiple early-return safety (no barrier follows — safe) - abs() on int operands - clamp vs clip3 equivalence - per-segment tc0 LUT extraction tradeoff - alpha=0/beta=0 outer precondition - dst_off arithmetic Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:40:07 +00:00
parent 436a5c4f74
commit f2ba08e1cf
1 changed files with 246 additions and 0 deletions
@@ -0,0 +1,246 @@
+---
+cycle: 8
+phase: 4
+status: draft, awaiting Phase 5 review
+date_opened: 2026-05-18
+parent: k8_h264deblock_phase3.md
+predicted_R: 0.09-0.14 (ORANGE)
+---
+
+# Cycle 8, Phase 4 — H.264 deblock QPU shader plan
+
+Plan a Vulkan compute shader for H.264 luma vertical deblock
+filter (the "v_loop_filter" — vertical filtering across a
+horizontal edge). Follows cycle 2 LPF wd=4 shader template
+(`src/v3d_lpf_h_4_8.comp`) with H.264-specific adjustments.
+
+## Kernel contract (recap)
+
+Per H.264 spec §8.7.2.4 (luma filtering for samples adjacent to
+a horizontal edge, bS<4):
+
+Inputs:
+- pix: pointer to (row 0, col 0) of the bottom block
+- stride: bytes between rows
+- alpha, beta: thresholds (uint8 range)
+- tc0[4]: int8 per-segment strengths; segment s covers cols
+  4s..4s+3; tc0[s] = -1 means skip filter for that segment
+
+Per column c (c = 0..15):
+1. Read p3, p2, p1, p0 from pix[-4*stride..-1*stride] at col c
+   Read q0, q1, q2, q3 from pix[0..+3*stride] at col c
+2. tc0_s = tc0[c >> 2]; if tc0_s < 0, skip
+3. Edge precondition: |p0-q0|<alpha && |p1-p0|<beta && |q1-q0|<beta
+4. ap = |p2-p0|, aq = |q2-q0|; ap<beta and aq<beta gate p1/q1 updates
+5. tc = tc0_s + (ap<beta) + (aq<beta)
+6. delta = clip3(-tc, tc, ((q0-p0)*4 + (p1-q1) + 4) >> 3)
+7. p0' = clip255(p0 + delta), q0' = clip255(q0 - delta)
+8. If ap<beta: p1' = p1 + clip3(-tc0_s, tc0_s, (p2 + ((p0+q0+1)>>1) - 2*p1) >> 1)
+9. If aq<beta: q1' = q1 + clip3(-tc0_s, tc0_s, (q2 + ((p0+q0+1)>>1) - 2*q1) >> 1)
+10. Write back p1', p0', q0', q1' to pix[-2*stride..+1*stride] at col c
+
+## Lane decomposition
+
+Following cycle 2 LPF wd=4 pattern (256 inv/WG, 32 edges/WG):
+- 256 invocations per workgroup
+- 16 lanes per edge (one lane per column 0..15)
+- 16 edges per WG (256/16)
+
+Lane mapping:
+- `gid = gl_GlobalInvocationID.x`
+- `lane_in_wg = gid & 255u`
+- `edge_in_wg = lane_in_wg >> 4`         // 0..15 (16 edges/WG)
+- `col_in_edge = lane_in_wg & 15u`       // 0..15
+- `edge_idx = wg_id * 16u + edge_in_wg`
+
+(Cycle 2 used 32 edges/WG with 8 lanes/edge. Here 16 edges/WG with
+16 lanes/edge gives the same total of 256 invocations per WG and
+matches H.264 deblock's 16-column edge width.)
+
+## SSBO layout
+
+- `Meta[i]`: `uvec4(dst_off_bytes, params, _pad0, _pad1)` where
+  `params = (alpha & 0xff) | ((beta & 0xff) << 8) |
+           ((uint(tc0[0]) & 0xff) << 16) |
+           ((uint(tc0[1]) & 0xff) << 24)`.
+  Wait — that's only 2 tc0 values. Need 4. Use meta[i].y = (alpha|beta<<8), meta[i].z = tc0 packed (4 int8 in lower 32 bits), meta[i].w = unused.
+- `Dst[]`: uint8_t SSBO via `GL_EXT_shader_8bit_storage`
+
+Meta refined:
+- `meta[i].x` = dst_off_bytes (pointer to row 0 col 0 of edge)
+- `meta[i].y` = alpha | (beta << 8)
+- `meta[i].z` = packed tc0 (4 int8); shader extracts via shifts +
+  sign-extend
+- `meta[i].w` = 0 (reserved)
+
+## Push constants
+
+```glsl
+layout(push_constant) uniform PC {
+    uint n_edges;
+    uint dst_stride_u8;
+    uint _pad0;
+    uint _pad1;
+} pc;
+```
+
+## Shader pseudo-code (post Phase 5 review pending)
+
+```glsl
+#version 450
+#extension GL_EXT_shader_8bit_storage              : require
+#extension GL_EXT_shader_explicit_arithmetic_types : require
+
+layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;
+
+layout(binding = 0) readonly buffer Meta { uvec4 meta[]; } u_meta;
+layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst;
+
+layout(push_constant) uniform PC {
+    uint n_edges;
+    uint dst_stride_u8;
+    uint _pad0;
+    uint _pad1;
+} pc;
+
+void main()
+{
+    uint gid          = gl_GlobalInvocationID.x;
+    uint wg_id        = gl_WorkGroupID.x;
+    uint lane_in_wg   = gid & 255u;
+    uint edge_in_wg   = lane_in_wg >> 4;
+    uint col_in_edge  = lane_in_wg & 15u;
+
+    uint edge_idx = wg_id * 16u + edge_in_wg;
+    if (edge_idx >= pc.n_edges) return;   // safe — no barrier follows
+
+    uvec4 m = u_meta.meta[edge_idx];
+    uint dst_off = m.x + col_in_edge;
+    uint stride  = pc.dst_stride_u8;
+    int alpha = int(m.y & 0xffu);
+    int beta  = int((m.y >> 8) & 0xffu);
+
+    // Unpack tc0: 4 int8 in m.z low 32 bits, segment = col_in_edge >> 2
+    uint seg = col_in_edge >> 2;
+    uint tc0_byte = (m.z >> (seg * 8u)) & 0xffu;
+    int tc0_s = int(tc0_byte);
+    if (tc0_s >= 128) tc0_s -= 256;       // sign-extend
+
+    if (alpha == 0 || beta == 0) return;
+    if (tc0_s < 0) return;                // segment skip
+
+    // Read 8 rows of context (p3..p0, q0..q3) at this column.
+    int p3 = int(u_dst.dst[dst_off - 4u * stride]);
+    int p2 = int(u_dst.dst[dst_off - 3u * stride]);
+    int p1 = int(u_dst.dst[dst_off - 2u * stride]);
+    int p0 = int(u_dst.dst[dst_off - 1u * stride]);
+    int q0 = int(u_dst.dst[dst_off]);
+    int q1 = int(u_dst.dst[dst_off + 1u * stride]);
+    int q2 = int(u_dst.dst[dst_off + 2u * stride]);
+    int q3 = int(u_dst.dst[dst_off + 3u * stride]);
+
+    // Edge preconditions.
+    if (abs(p0 - q0) >= alpha) return;
+    if (abs(p1 - p0) >= beta)  return;
+    if (abs(q1 - q0) >= beta)  return;
+
+    int ap = abs(p2 - p0);
+    int aq = abs(q2 - q0);
+    bool ap_lt = ap < beta;
+    bool aq_lt = aq < beta;
+    int tc = tc0_s + int(ap_lt) + int(aq_lt);
+
+    int delta = clamp(((q0 - p0) * 4 + (p1 - q1) + 4) >> 3, -tc, tc);
+    int p0p = clamp(p0 + delta, 0, 255);
+    int q0p = clamp(q0 - delta, 0, 255);
+
+    int p1p = p1;
+    if (ap_lt) {
+        int d_p1 = clamp((p2 + ((p0 + q0 + 1) >> 1) - 2*p1) >> 1, -tc0_s, tc0_s);
+        p1p = p1 + d_p1;
+    }
+    int q1p = q1;
+    if (aq_lt) {
+        int d_q1 = clamp((q2 + ((p0 + q0 + 1) >> 1) - 2*q1) >> 1, -tc0_s, tc0_s);
+        q1p = q1 + d_q1;
+    }
+
+    u_dst.dst[dst_off - 2u * stride] = uint8_t(p1p);
+    u_dst.dst[dst_off - 1u * stride] = uint8_t(p0p);
+    u_dst.dst[dst_off            ]  = uint8_t(q0p);
+    u_dst.dst[dst_off + 1u * stride] = uint8_t(q1p);
+}
+```
+
+## V3D substrate fit
+
+Per `docs/phase0.md`:
+- 16 KB shared: not needed (no inter-lane data sharing)
+- ≤ 8 SSBOs: 2 used (meta, dst). Comfortable.
+- subgroupSize = 16: 16 cols/edge = 1 subgroup per edge. Good fit.
+- No DP4A: doesn't matter here; H.264 deblock is per-pixel scalar
+- No shaderFloat16/Int8 ALU: all int math; uint8 dst via 8bit_storage
+
+## Predicted shaderdb stats
+
+- ~150-200 instructions (alpha/beta gating + tc0 conditional +
+  multiple writes per lane)
+- 2-3 threads (alpha/beta condition tracking + 8 pixel context
+  variables + intermediate p0', q0', p1', q1' = high register
+  pressure)
+- 0 loops, 0 spills (hopefully)
+- ~20 uniforms (push consts + constants)
+
+## Phase 5 review focus
+
+Items for the Sonnet second-model audit:
+
+1. **tc0 sign-extension** — `if (tc0_s >= 128) tc0_s -= 256` —
+   correct? GLSL's int sign-extension semantics for uint→int cast
+   matter. Alternative: pack tc0 as int32 array in meta with
+   sign already encoded.
+
+2. **Multiple early-return statements** — `if (... ) return;` paths
+   for edge preconditions. SAFE here (no barrier follows), but
+   should document explicitly to avoid cargo-culting the cycle-1
+   barrier-before-return UB lesson.
+
+3. **abs() on signed int** — GLSL's `abs(int)` works as expected for
+   negative numbers. Make sure operands are signed int (cast from
+   uint8 first).
+
+4. **clamp() vs clip3** — GLSL clamp(x, lo, hi) = max(lo, min(hi, x)).
+   Equivalent to my C ref's clip3 (which I wrote as
+   `clip3(v, lo, hi) = v < lo ? lo : v > hi ? hi : v`).
+   Match.
+
+5. **Per-segment tc0 LUT** — extracting 4 int8 from a uint32 via
+   shifts is fine but adds 3-4 instructions per lane. Alternative:
+   `meta[i].z = sext_to_int32(tc0[0])` and `.w = sext_to_int32(tc0[1])`
+   etc — uses more meta storage but avoids unpacking per lane.
+   Tradeoff to weigh.
+
+6. **Edge-case alpha=0 / beta=0 early return** — covered by the
+   spec's outer precondition. Both shaders (NEON + ours) must
+   bail out before reading pixels (which might be stale if the
+   filter was supposed to skip entirely). Currently the shader
+   bails at lane level — should it bail at the WG level instead
+   to save dispatching the WG? Probably not — easier to let each
+   lane check independently.
+
+7. **dst_off arithmetic** — `m.x + col_in_edge` then offsets by
+   `stride * N` for the 8 rows. Confirm dst_off is byte offset
+   (not pixel index — same in 8-bit luma).
+
+## Acceptance criteria
+
+- shaderdb predicted ≤ 200 inst, ≥ 2 threads, 0 spills
+- M1 bit-exact (3-way: QPU vs NEON vs C ref); 10000+ edges, both
+  filter-triggering and skip cases sampled
+- M2 captured, R₈ classified per band
+- M4 same-kernel mixed bench measured
+
+## Estimated effort
+
+2-3 hours through Phase 7 closure (similar to cycle 2 LPF wd=4
+build).