diff --git a/docs/k8_h264deblock_phase4.md b/docs/k8_h264deblock_phase4.md new file mode 100644 index 0000000..488f2a4 --- /dev/null +++ b/docs/k8_h264deblock_phase4.md @@ -0,0 +1,246 @@ +--- +cycle: 8 +phase: 4 +status: draft, awaiting Phase 5 review +date_opened: 2026-05-18 +parent: k8_h264deblock_phase3.md +predicted_R: 0.09-0.14 (ORANGE) +--- + +# Cycle 8, Phase 4 — H.264 deblock QPU shader plan + +Plan a Vulkan compute shader for H.264 luma vertical deblock +filter (the "v_loop_filter" — vertical filtering across a +horizontal edge). Follows cycle 2 LPF wd=4 shader template +(`src/v3d_lpf_h_4_8.comp`) with H.264-specific adjustments. + +## Kernel contract (recap) + +Per H.264 spec §8.7.2.4 (luma filtering for samples adjacent to +a horizontal edge, bS<4): + +Inputs: +- pix: pointer to (row 0, col 0) of the bottom block +- stride: bytes between rows +- alpha, beta: thresholds (uint8 range) +- tc0[4]: int8 per-segment strengths; segment s covers cols + 4s..4s+3; tc0[s] = -1 means skip filter for that segment + +Per column c (c = 0..15): +1. Read p3, p2, p1, p0 from pix[-4*stride..-1*stride] at col c + Read q0, q1, q2, q3 from pix[0..+3*stride] at col c +2. tc0_s = tc0[c >> 2]; if tc0_s < 0, skip +3. Edge precondition: |p0-q0|> 3) +7. p0' = clip255(p0 + delta), q0' = clip255(q0 - delta) +8. If ap>1) - 2*p1) >> 1) +9. If aq>1) - 2*q1) >> 1) +10. Write back p1', p0', q0', q1' to pix[-2*stride..+1*stride] at col c + +## Lane decomposition + +Following cycle 2 LPF wd=4 pattern (256 inv/WG, 32 edges/WG): +- 256 invocations per workgroup +- 16 lanes per edge (one lane per column 0..15) +- 16 edges per WG (256/16) + +Lane mapping: +- `gid = gl_GlobalInvocationID.x` +- `lane_in_wg = gid & 255u` +- `edge_in_wg = lane_in_wg >> 4` // 0..15 (16 edges/WG) +- `col_in_edge = lane_in_wg & 15u` // 0..15 +- `edge_idx = wg_id * 16u + edge_in_wg` + +(Cycle 2 used 32 edges/WG with 8 lanes/edge. Here 16 edges/WG with +16 lanes/edge gives the same total of 256 invocations per WG and +matches H.264 deblock's 16-column edge width.) + +## SSBO layout + +- `Meta[i]`: `uvec4(dst_off_bytes, params, _pad0, _pad1)` where + `params = (alpha & 0xff) | ((beta & 0xff) << 8) | + ((uint(tc0[0]) & 0xff) << 16) | + ((uint(tc0[1]) & 0xff) << 24)`. + Wait — that's only 2 tc0 values. Need 4. Use meta[i].y = (alpha|beta<<8), meta[i].z = tc0 packed (4 int8 in lower 32 bits), meta[i].w = unused. +- `Dst[]`: uint8_t SSBO via `GL_EXT_shader_8bit_storage` + +Meta refined: +- `meta[i].x` = dst_off_bytes (pointer to row 0 col 0 of edge) +- `meta[i].y` = alpha | (beta << 8) +- `meta[i].z` = packed tc0 (4 int8); shader extracts via shifts + + sign-extend +- `meta[i].w` = 0 (reserved) + +## Push constants + +```glsl +layout(push_constant) uniform PC { + uint n_edges; + uint dst_stride_u8; + uint _pad0; + uint _pad1; +} pc; +``` + +## Shader pseudo-code (post Phase 5 review pending) + +```glsl +#version 450 +#extension GL_EXT_shader_8bit_storage : require +#extension GL_EXT_shader_explicit_arithmetic_types : require + +layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in; + +layout(binding = 0) readonly buffer Meta { uvec4 meta[]; } u_meta; +layout(binding = 1) buffer Dst { uint8_t dst[]; } u_dst; + +layout(push_constant) uniform PC { + uint n_edges; + uint dst_stride_u8; + uint _pad0; + uint _pad1; +} pc; + +void main() +{ + uint gid = gl_GlobalInvocationID.x; + uint wg_id = gl_WorkGroupID.x; + uint lane_in_wg = gid & 255u; + uint edge_in_wg = lane_in_wg >> 4; + uint col_in_edge = lane_in_wg & 15u; + + uint edge_idx = wg_id * 16u + edge_in_wg; + if (edge_idx >= pc.n_edges) return; // safe — no barrier follows + + uvec4 m = u_meta.meta[edge_idx]; + uint dst_off = m.x + col_in_edge; + uint stride = pc.dst_stride_u8; + int alpha = int(m.y & 0xffu); + int beta = int((m.y >> 8) & 0xffu); + + // Unpack tc0: 4 int8 in m.z low 32 bits, segment = col_in_edge >> 2 + uint seg = col_in_edge >> 2; + uint tc0_byte = (m.z >> (seg * 8u)) & 0xffu; + int tc0_s = int(tc0_byte); + if (tc0_s >= 128) tc0_s -= 256; // sign-extend + + if (alpha == 0 || beta == 0) return; + if (tc0_s < 0) return; // segment skip + + // Read 8 rows of context (p3..p0, q0..q3) at this column. + int p3 = int(u_dst.dst[dst_off - 4u * stride]); + int p2 = int(u_dst.dst[dst_off - 3u * stride]); + int p1 = int(u_dst.dst[dst_off - 2u * stride]); + int p0 = int(u_dst.dst[dst_off - 1u * stride]); + int q0 = int(u_dst.dst[dst_off]); + int q1 = int(u_dst.dst[dst_off + 1u * stride]); + int q2 = int(u_dst.dst[dst_off + 2u * stride]); + int q3 = int(u_dst.dst[dst_off + 3u * stride]); + + // Edge preconditions. + if (abs(p0 - q0) >= alpha) return; + if (abs(p1 - p0) >= beta) return; + if (abs(q1 - q0) >= beta) return; + + int ap = abs(p2 - p0); + int aq = abs(q2 - q0); + bool ap_lt = ap < beta; + bool aq_lt = aq < beta; + int tc = tc0_s + int(ap_lt) + int(aq_lt); + + int delta = clamp(((q0 - p0) * 4 + (p1 - q1) + 4) >> 3, -tc, tc); + int p0p = clamp(p0 + delta, 0, 255); + int q0p = clamp(q0 - delta, 0, 255); + + int p1p = p1; + if (ap_lt) { + int d_p1 = clamp((p2 + ((p0 + q0 + 1) >> 1) - 2*p1) >> 1, -tc0_s, tc0_s); + p1p = p1 + d_p1; + } + int q1p = q1; + if (aq_lt) { + int d_q1 = clamp((q2 + ((p0 + q0 + 1) >> 1) - 2*q1) >> 1, -tc0_s, tc0_s); + q1p = q1 + d_q1; + } + + u_dst.dst[dst_off - 2u * stride] = uint8_t(p1p); + u_dst.dst[dst_off - 1u * stride] = uint8_t(p0p); + u_dst.dst[dst_off ] = uint8_t(q0p); + u_dst.dst[dst_off + 1u * stride] = uint8_t(q1p); +} +``` + +## V3D substrate fit + +Per `docs/phase0.md`: +- 16 KB shared: not needed (no inter-lane data sharing) +- ≤ 8 SSBOs: 2 used (meta, dst). Comfortable. +- subgroupSize = 16: 16 cols/edge = 1 subgroup per edge. Good fit. +- No DP4A: doesn't matter here; H.264 deblock is per-pixel scalar +- No shaderFloat16/Int8 ALU: all int math; uint8 dst via 8bit_storage + +## Predicted shaderdb stats + +- ~150-200 instructions (alpha/beta gating + tc0 conditional + + multiple writes per lane) +- 2-3 threads (alpha/beta condition tracking + 8 pixel context + variables + intermediate p0', q0', p1', q1' = high register + pressure) +- 0 loops, 0 spills (hopefully) +- ~20 uniforms (push consts + constants) + +## Phase 5 review focus + +Items for the Sonnet second-model audit: + +1. **tc0 sign-extension** — `if (tc0_s >= 128) tc0_s -= 256` — + correct? GLSL's int sign-extension semantics for uint→int cast + matter. Alternative: pack tc0 as int32 array in meta with + sign already encoded. + +2. **Multiple early-return statements** — `if (... ) return;` paths + for edge preconditions. SAFE here (no barrier follows), but + should document explicitly to avoid cargo-culting the cycle-1 + barrier-before-return UB lesson. + +3. **abs() on signed int** — GLSL's `abs(int)` works as expected for + negative numbers. Make sure operands are signed int (cast from + uint8 first). + +4. **clamp() vs clip3** — GLSL clamp(x, lo, hi) = max(lo, min(hi, x)). + Equivalent to my C ref's clip3 (which I wrote as + `clip3(v, lo, hi) = v < lo ? lo : v > hi ? hi : v`). + Match. + +5. **Per-segment tc0 LUT** — extracting 4 int8 from a uint32 via + shifts is fine but adds 3-4 instructions per lane. Alternative: + `meta[i].z = sext_to_int32(tc0[0])` and `.w = sext_to_int32(tc0[1])` + etc — uses more meta storage but avoids unpacking per lane. + Tradeoff to weigh. + +6. **Edge-case alpha=0 / beta=0 early return** — covered by the + spec's outer precondition. Both shaders (NEON + ours) must + bail out before reading pixels (which might be stale if the + filter was supposed to skip entirely). Currently the shader + bails at lane level — should it bail at the WG level instead + to save dispatching the WG? Probably not — easier to let each + lane check independently. + +7. **dst_off arithmetic** — `m.x + col_in_edge` then offsets by + `stride * N` for the 8 rows. Confirm dst_off is byte offset + (not pixel index — same in 8-bit luma). + +## Acceptance criteria + +- shaderdb predicted ≤ 200 inst, ≥ 2 threads, 0 spills +- M1 bit-exact (3-way: QPU vs NEON vs C ref); 10000+ edges, both + filter-triggering and skip cases sampled +- M2 captured, R₈ classified per band +- M4 same-kernel mixed bench measured + +## Estimated effort + +2-3 hours through Phase 7 closure (similar to cycle 2 LPF wd=4 +build).