Cycle 5 Phase 4: QPU CDEF shader plan (predicted deep RED)

Per-block stencil: 12 constrain ops per pixel, 64 pixels per block, 4 blocks/WG, 256 invocations/WG. Predicted R5 = 0.03 (deep RED) from cycle-3 MC scaling. Plan calls out 5 Phase 5 review items, notably sentinel handling and signed/unsigned min/max distinction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:47:18 +00:00
parent 9c0bd72e70
commit 1740e7c165
1 changed files with 227 additions and 0 deletions
@@ -0,0 +1,227 @@
 ---
 cycle: 5
 phase: 4
 status: draft, awaiting Phase 5 review
 date_opened: 2026-05-18
 parent: k5_cdef_phase3.md
 predicted_R: 0.02-0.05 (deep RED)
 ---
 # Cycle 5, Phase 4 — QPU CDEF shader plan
 Plan a Vulkan compute shader for the AV1 CDEF primary+secondary
 8×8 luma filter on V3D 7.1. Predicted **deep RED** (R₅ = 0.02-0.05);
 plan + build it anyway because:
 - Confirms the prediction with measured data (or refutes it).
 - Provides the dispatch path needed for Phase 8 V4L2 wrapper.
 - Closes cycle 5 (Phases 1-7 all on the record).
 ## Kernel shape (NEON reference: 263 ns/block)
 Per 8×8 output block: 8 directions table, 2 offsets each. For
 each output pixel:
 - 2 primary taps (off1, -off1) using `dir`
 - 4 secondary taps (off2, -off2, off3, -off3) using `(dir+2)%8` and `(dir-2+8)%8`
 - For each of 2 k-rounds (different tap weights)
 - 12 `constrain()` ops per pixel × 64 pixels = **768 constrain ops per block**
 - Plus min/max bookkeeping for iclip
 The constrain math:
 ```
 diff = p - px;
 adiff = abs(diff);
 clip = max(0, threshold - (adiff >> shift));
 constrained = sign(diff) * min(adiff, clip);
 sum += tap * constrained;
 ```
 Output: `dst[r,c] = clamp(px + ((sum - (sum<0) + 8) >> 4), min, max);`
 ## V3D substrate fit (phase0 constraints)
 - **No DP4A**: each constrain is scalar int math; no vector packing
  helps (per cycle 3 MC finding). Predicted instruction count
  proportional to ops.
 - **16KB shared**: not needed — each pixel computes independently;
  no row sharing in compute side (tmp is read-only input).
 - **subgroupSize=16**: 1 pixel per lane × 16 lanes/sg = 16 pixels
  per sg. Block of 64 pixels = 4 sg slots. Better: 2 blocks per
  WG of 256 invocations (16 sg) → 256 pixels = 4 blocks per WG.
  Following cycle-2 pattern: aim for **64 blocks/WG**? Too high
  — 64 × 64 = 4096 pixels/WG → 256 lanes × 16 pixels/lane.
  Wait — 256 lanes total, 1 pixel/lane → 256 pixels = 4 blocks/WG.
  Settle on **4 blocks/WG**, 256 invocations.
 - **≤8 SSBO**: need 3 (meta, tmp, dst). Comfortable.
 - **No shaderFloat16/Int8 ALU**: int math everywhere. uint8 dst
  via storageBuffer8BitAccess (cycle-1 v4 pattern).
 ## SSBO layout
 - `Meta[i]`: `uvec4(dst_off_bytes, params0, params1, dir)` where
  `params0 = (pri | sec << 8 | damping << 16)` and
  `params1 = tmp_off_bytes` (offset to block-origin = padded_origin + 2*16+2)
 - `Tmp[]`: `uint16` array (`uint8_t` SSBO with manual 16-bit
  read? Or `storageBuffer16BitAccess`? V3D 7.1 supports the
  16-bit extension.)
 - `Dst[]`: `uint8_t` array
 Use 16-bit storage extension for tmp.
 ## Lane decomposition
 256 invocations / WG, 4 blocks/WG:
 - `lane_in_wg = 0..255`
 - `block_in_wg = lane_in_wg / 64` (0..3)
 - `pixel_in_block = lane_in_wg & 63` (0..63 → row=>>3, col=&7)
 - `block_idx = wg_id * 4 + block_in_wg`
 No barrier needed; each pixel computes independently.
 ## Push constants
 ```glsl
 layout(push_constant) uniform PC {
    uint n_blocks;
    uint tmp_stride_u16;   // = 16
    uint dst_stride_u8;
    uint _pad;
 } pc;
 ```
 ## Directions table
 Store the 14-entry stride-16 directions table as a `const uint
 dirs[14]` in the shader, packed as `(off1 << 16) | off2` per
 direction (both signed offsets fit in int16). Read via index.
 Alternative: store as constants array (compiler may unroll into
 uniform LUT). Same as cycle-2 LPF stored its tap weights.
 ## Shader pseudo-code
 ```glsl
 void main() {
    uint gid = gl_GlobalInvocationID.x;
    uint wg_id = gid / 256u;
    uint block_in_wg = (gid & 255u) >> 6;   // 0..3
    uint px_idx = gid & 63u;                 // 0..63
    uint row = px_idx >> 3;                  // 0..7
    uint col = px_idx & 7u;                  // 0..7
    uint block_idx = wg_id * 4u + block_in_wg;
    if (block_idx >= pc.n_blocks) return;
    uvec4 m = u_meta.meta[block_idx];
    uint dst_off = m.x + row * pc.dst_stride_u8 + col;
    uint tmp_off = m.w + row * pc.tmp_stride_u16 + col;   // m.w = tmp block-origin u16 offset
    int pri = int(m.y & 0xffu);
    int sec = int((m.y >> 8) & 0xffu);
    int damping = int((m.y >> 16) & 0xffu);
    int dir = int(m.z & 7u);
    int px = int(u_tmp.tmp[tmp_off]);
    int sum = 0;
    int mn = px, mx = px;
    int pri_shift = max(0, damping - ulog2(pri));
    int sec_shift = damping - ulog2(sec);
    // pri_tap[k] for k=0,1 = 4-(pri&1), then (tap & 3) | 2
    int pri_tap0 = 4 - (pri & 1);
    int pri_tap1 = (pri_tap0 & 3) | 2;
    int pri_idx = dir;
    int sec1_idx = (dir + 2) & 7;
    int sec2_idx = (dir + 6) & 7;
    // k=0
    {
        int off = dirs_off1[pri_idx];
        int p0 = int(u_tmp.tmp[tmp_off + off]);
        int p1 = int(u_tmp.tmp[tmp_off - off]);
        sum += pri_tap0 * constrain(p0 - px, pri, pri_shift);
        sum += pri_tap0 * constrain(p1 - px, pri, pri_shift);
        mn = min(min(mn, p0), p1); mx = max(max(mx, p0), p1);
        // ... 4 secondary taps the same way for off2, off3
    }
    // k=1: same with off2 versions
    int adj = (sum - int(sum < 0) + 8) >> 4;
    int out = clamp(px + adj, mn, mx);
    u_dst.dst[dst_off] = uint8_t(out);
 }
 ```
 Note: dirs_off1/dirs_off2 are per-k-round offsets. For k=0 use
 `*[idx][0]` (the "+1 row" component); for k=1 use `*[idx][1]`
 (the "+2 rows" component).
 ## Throughput prediction
 NEON 1-core: 3.81 Mblock/s = 262 ns/block.
 V3D 7.1 compute estimate (per cycle 3 MC pattern):
 - 12 constrain ops × 8 SMUL24+ADD per constrain = ~96 instructions per pixel
 - 64 pixels per block, 4 blocks/WG → 256 lanes work in parallel
 - Per-block QPU latency ≈ instruction count / lanes × cycle time
 - Predicted: ~5000-8000 ns per block → 0.125-0.2 Mblock/s
 - R₅ = 0.125 / 3.81 = **0.033** (deep RED, matches prediction)
 shaderdb prediction:
 - ~800-1200 instructions (similar shape to cycle 1 IDCT, more
  ops though)
 - 2-4 threads (if uniform count stays < 144 per phase5''' finding 2)
 - uniform count: 14 entries × 2 offsets = 28; + tap weights 4
  = small. Should stay well below threshold. Predict 4 threads.
 ## Phase 5 review focus
 Particular review items for the Phase 5 second-model audit:
 1. **Sentinel handling**: when reading from tmp halo, raw uint16
   values could be 0x8000 (INT16_MIN sentinel from padding) for
   real picture-boundary blocks. Our cycle 5 bench uses random
   pixel values (no sentinels), but a production deployment would
   pass through padded blocks. The constrain() math naturally
   handles INT16_MIN-as-uint16=32768 (clip becomes 0), BUT the
   `min(mn, p)` should use UNSIGNED compare and `max(mx, p)`
   should use SIGNED compare to match NEON. GLSL's `min`/`max`
   on `int` is signed; need separate `umin` (or cast to uint).
   Concretely: `mn = int(min(uint(mn), uint(p)))`,
   `mx = max(mx, int(int16_t(p)))`.
 2. **OOB read on direction taps**: for blocks near the picture
   edge, the direction offsets reach into the halo. Our bench
   uses random pixels there (valid uint8). For deployment with
   sentinels, we need to either (a) zero-out halo values that are
   sentinels before reading or (b) accept the constrain-math-
   handles-it argument.
 3. **Tmp stride**: must equal 16 (stride_u16=16) to match the
   directions table that's baked at stride 16. push constant
   `tmp_stride_u16` should be const or asserted = 16 in bench.
 4. **dst_stride_u8**: cycle-2 LPF used dst_stride_u8 = 8 (for
   isolated blocks). Same here. Production deployment with real
   picture strides (e.g. 1920) would need re-validation.
 5. **Push-constant meta size**: m.z carries dir (only 3 bits used);
   could be packed into params0. But current layout simple, leave
   as-is.
 ## Acceptance criteria
 - shaderdb predicted ≤ 1200 inst, ≥ 2 threads, ≤ 30 uniforms, no
  spills.
 - M1 bit-exact (use the same bench setup as Phase 3 but compare
  QPU output vs NEON output).
 - M2 captured (any number, even deep RED).
 - M4 measured against pure-NEON-4 baseline (expected: negative,
  per same-kernel pattern); cross-reference Issue 003 V1/V2 for
  the mixed-kernel context.
 ## Estimated effort
 2-3 hours for the shader; 30 min for the M2 bench; 30 min for
 M4. Total: ~4 hours, then Phase 7 closure.