Files
daedalus-fourier/docs/k5_cdef_phase4.md
T
marfrit 5223d3cb3f Cycle 5 closed: CDEF QPU R5=0.116 ORANGE, opportunistic helper
Phase 4 plan with 3 Phase-5 REDs applied inline:
  - meta layout: m.z=tmp_off, m.w=dir
  - sec_shift clamped to >=0 (NEON uqsub semantics)
  - directions table as const ivec2[14], not OR-packed

Phase 6 deliverable: v3d_cdef.comp (387 inst, 2 threads, no spills).
3-way M1 (QPU vs C ref vs NEON) PASS 4096/4096.

M2: 0.443 Mblock/s -> R5 = 0.116 ORANGE (predicted 0.02-0.05 RED).
M4 same-kernel: NEON-3+QPU 8.46 < NEON-4 alone ~10 (negative).
M4 mixed (NEON-3 MC + QPU CDEF): CPU 34.17 Mblock/s MC,
  QPU 0.42 Mblock/s CDEF helper. CPU side higher than the
  Issue 003 NEON-fallback proxy suggested - cross-substrate
  contention is gentler than same-side NEON contention.

Verdict: CDEF stays on CPU; QPU dispatch path exists for
opportunistic use. Deployment recipe table updated for all 5
cycles. Phase 9 lessons: linear extrapolation across cycles is
too pessimistic; CDEF is bandwidth-bound on NEON despite high
per-block ns; real-substrate-cross contention < NEON-proxy
contention.

- src/v3d_cdef.comp: cycle 5 QPU shader
- tests/bench_v3d_cdef.c: 3-way M1, M2 bench
- tests/bench_concurrent_mixed.c: K_CDEF on both sides
- tests/cdef_ref.c + bench_neon_cdef.c: sec_shift clamp +
  expanded damping range to exercise the edge case
- CMakeLists.txt: v3d_cdef.spv + bench_v3d_cdef wiring
- docs/k5_cdef_phase4.md updated with Phase 5 review applied
- docs/k5_cdef_phase7.md: closure doc with full verdict matrix

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:52:46 +00:00

9.1 KiB
Raw Blame History

cycle, phase, status, date_opened, parent, predicted_R
cycle phase status date_opened parent predicted_R
5 4 draft, awaiting Phase 5 review 2026-05-18 k5_cdef_phase3.md 0.02-0.05 (deep RED)

Cycle 5, Phase 4 — QPU CDEF shader plan

Plan a Vulkan compute shader for the AV1 CDEF primary+secondary 8×8 luma filter on V3D 7.1. Predicted deep RED (R₅ = 0.02-0.05); plan + build it anyway because:

  • Confirms the prediction with measured data (or refutes it).
  • Provides the dispatch path needed for Phase 8 V4L2 wrapper.
  • Closes cycle 5 (Phases 1-7 all on the record).

Kernel shape (NEON reference: 263 ns/block)

Per 8×8 output block: 8 directions table, 2 offsets each. For each output pixel:

  • 2 primary taps (off1, -off1) using dir
  • 4 secondary taps (off2, -off2, off3, -off3) using (dir+2)%8 and (dir-2+8)%8
  • For each of 2 k-rounds (different tap weights)
  • 12 constrain() ops per pixel × 64 pixels = 768 constrain ops per block
  • Plus min/max bookkeeping for iclip

The constrain math:

diff = p - px;
adiff = abs(diff);
clip = max(0, threshold - (adiff >> shift));
constrained = sign(diff) * min(adiff, clip);
sum += tap * constrained;

Output: dst[r,c] = clamp(px + ((sum - (sum<0) + 8) >> 4), min, max);

V3D substrate fit (phase0 constraints)

  • No DP4A: each constrain is scalar int math; no vector packing helps (per cycle 3 MC finding). Predicted instruction count proportional to ops.
  • 16KB shared: not needed — each pixel computes independently; no row sharing in compute side (tmp is read-only input).
  • subgroupSize=16: 1 pixel per lane × 16 lanes/sg = 16 pixels per sg. Block of 64 pixels = 4 sg slots. Better: 2 blocks per WG of 256 invocations (16 sg) → 256 pixels = 4 blocks per WG. Following cycle-2 pattern: aim for 64 blocks/WG? Too high — 64 × 64 = 4096 pixels/WG → 256 lanes × 16 pixels/lane. Wait — 256 lanes total, 1 pixel/lane → 256 pixels = 4 blocks/WG. Settle on 4 blocks/WG, 256 invocations.
  • ≤8 SSBO: need 3 (meta, tmp, dst). Comfortable.
  • No shaderFloat16/Int8 ALU: int math everywhere. uint8 dst via storageBuffer8BitAccess (cycle-1 v4 pattern).

SSBO layout (post Phase 5 RED-1 fix)

  • Meta[i]: uvec4(dst_off_bytes, params0, tmp_off_u16, dir) — i.e. m.x = dst_off, m.y = params (pri | sec << 8 | damping << 16), m.z = tmp block-origin u16-element offset, m.w = dir (3 bits used). Pseudo-code below uses this layout consistently.
  • Tmp[]: uint16_t array via GL_EXT_shader_16bit_storage + storageBuffer16BitAccess — both already enabled in v3d_runner.c and used by cycle 1 IDCT shader. No uncertainty.
  • Dst[]: uint8_t array via GL_EXT_shader_8bit_storage (per cycle-1 v4 pattern).

Lane decomposition

256 invocations / WG, 4 blocks/WG:

  • lane_in_wg = 0..255
  • block_in_wg = lane_in_wg / 64 (0..3)
  • pixel_in_block = lane_in_wg & 63 (0..63 → row=>>3, col=&7)
  • block_idx = wg_id * 4 + block_in_wg

No barrier needed; each pixel computes independently.

Push constants

layout(push_constant) uniform PC {
    uint n_blocks;
    uint tmp_stride_u16;   // = 16
    uint dst_stride_u8;
    uint _pad;
} pc;

Directions table (post Phase 5 RED-3 fix)

Use const ivec2 dirs[14] (8 directions + 6 wrap copies), each entry = (off_k0, off_k1). Signed-int storage handles negative offsets cleanly without manual sign-extension. The OR-pack approach proposed earlier would corrupt negative offsets; abandoned.

Values from tests/cdef_ref.c neon_directions8[14][2]:

dirs[ 0] = ivec2(-1*16+1, -2*16+2)  // (-15, -30)
dirs[ 1] = ivec2( 0*16+1, -1*16+2)  // (1, -14)
... (etc.)

Shader pseudo-code

void main() {
    uint gid = gl_GlobalInvocationID.x;
    uint wg_id = gid / 256u;
    uint block_in_wg = (gid & 255u) >> 6;   // 0..3
    uint px_idx = gid & 63u;                 // 0..63
    uint row = px_idx >> 3;                  // 0..7
    uint col = px_idx & 7u;                  // 0..7

    uint block_idx = wg_id * 4u + block_in_wg;
    if (block_idx >= pc.n_blocks) return;

    uvec4 m = u_meta.meta[block_idx];
    uint dst_off = m.x + row * pc.dst_stride_u8 + col;
    uint tmp_off = m.z + row * pc.tmp_stride_u16 + col;   // m.z = tmp block-origin u16 offset
    int pri = int(m.y & 0xffu);
    int sec = int((m.y >> 8) & 0xffu);
    int damping = int((m.y >> 16) & 0xffu);
    int dir = int(m.w & 7u);

    int px = int(u_tmp.tmp[tmp_off]);
    int sum = 0;
    int mn = px, mx = px;

    int pri_shift = max(0, damping - ulog2(pri));
    int sec_shift = max(0, damping - ulog2(sec));  // RED-2: NEON uqsub saturates to 0; GLSL >> by negative is UB.

    // pri_tap[k] for k=0,1 = 4-(pri&1), then (tap & 3) | 2
    int pri_tap0 = 4 - (pri & 1);
    int pri_tap1 = (pri_tap0 & 3) | 2;

    int pri_idx = dir;
    int sec1_idx = (dir + 2) & 7;
    int sec2_idx = (dir + 6) & 7;

    // k=0
    {
        int off = dirs_off1[pri_idx];
        int p0 = int(u_tmp.tmp[tmp_off + off]);
        int p1 = int(u_tmp.tmp[tmp_off - off]);
        sum += pri_tap0 * constrain(p0 - px, pri, pri_shift);
        sum += pri_tap0 * constrain(p1 - px, pri, pri_shift);
        mn = min(min(mn, p0), p1); mx = max(max(mx, p0), p1);
        // ... 4 secondary taps the same way for off2, off3
    }
    // k=1: same with off2 versions

    int adj = (sum - int(sum < 0) + 8) >> 4;
    int out = clamp(px + adj, mn, mx);
    u_dst.dst[dst_off] = uint8_t(out);
}

Note: dirs_off1/dirs_off2 are per-k-round offsets. For k=0 use *[idx][0] (the "+1 row" component); for k=1 use *[idx][1] (the "+2 rows" component).

Throughput prediction

NEON 1-core: 3.81 Mblock/s = 262 ns/block. V3D 7.1 compute estimate (per cycle 3 MC pattern):

  • 12 constrain ops × 8 SMUL24+ADD per constrain = ~96 instructions per pixel
  • 64 pixels per block, 4 blocks/WG → 256 lanes work in parallel
  • Per-block QPU latency ≈ instruction count / lanes × cycle time
  • Predicted: ~5000-8000 ns per block → 0.125-0.2 Mblock/s
  • R₅ = 0.125 / 3.81 = 0.033 (deep RED, matches prediction)

shaderdb prediction:

  • ~800-1200 instructions (similar shape to cycle 1 IDCT, more ops though)
  • 2-4 threads (if uniform count stays < 144 per phase5''' finding 2)
  • uniform count: 14 entries × 2 offsets = 28; + tap weights 4 = small. Should stay well below threshold. Predict 4 threads.

Phase 5 review applied (2026-05-18, Sonnet)

REDs fixed inline above:

  • RED-1: meta field layout — m.z = tmp_off, m.w = dir (was swapped).
  • RED-2: sec_shift = max(0, ...) to match NEON's uqsub saturation.
  • RED-3: directions table is const ivec2 dirs[14], not packed.

YELLOWs accepted:

  • YELLOW-1: Phase 6 bench is 3-way M1 (QPU vs NEON vs C ref), not 2-way.
  • YELLOW-2: 16-bit storage extension confirmed present (cycle-1 already uses it).
  • YELLOW-3: sec_tap0 = 2, sec_tap1 = 1 made explicit in shader.
  • YELLOW-4: use gl_WorkGroupID.x directly, not gid / 256u.

Also: also clamp sec_shift in tests/cdef_ref.c (currently unguarded; M1 gate passes by bench-param luck — params don't exercise negative shift). Fix C ref + add negative-shift cases to bench param generator so the 3-way M1 actually stresses the edge case.

Phase 5 review focus

Particular review items for the Phase 5 second-model audit:

  1. Sentinel handling: when reading from tmp halo, raw uint16 values could be 0x8000 (INT16_MIN sentinel from padding) for real picture-boundary blocks. Our cycle 5 bench uses random pixel values (no sentinels), but a production deployment would pass through padded blocks. The constrain() math naturally handles INT16_MIN-as-uint16=32768 (clip becomes 0), BUT the min(mn, p) should use UNSIGNED compare and max(mx, p) should use SIGNED compare to match NEON. GLSL's min/max on int is signed; need separate umin (or cast to uint).

    Concretely: mn = int(min(uint(mn), uint(p))), mx = max(mx, int(int16_t(p))).

  2. OOB read on direction taps: for blocks near the picture edge, the direction offsets reach into the halo. Our bench uses random pixels there (valid uint8). For deployment with sentinels, we need to either (a) zero-out halo values that are sentinels before reading or (b) accept the constrain-math- handles-it argument.

  3. Tmp stride: must equal 16 (stride_u16=16) to match the directions table that's baked at stride 16. push constant tmp_stride_u16 should be const or asserted = 16 in bench.

  4. dst_stride_u8: cycle-2 LPF used dst_stride_u8 = 8 (for isolated blocks). Same here. Production deployment with real picture strides (e.g. 1920) would need re-validation.

  5. Push-constant meta size: m.z carries dir (only 3 bits used); could be packed into params0. But current layout simple, leave as-is.

Acceptance criteria

  • shaderdb predicted ≤ 1200 inst, ≥ 2 threads, ≤ 30 uniforms, no spills.
  • M1 bit-exact (use the same bench setup as Phase 3 but compare QPU output vs NEON output).
  • M2 captured (any number, even deep RED).
  • M4 measured against pure-NEON-4 baseline (expected: negative, per same-kernel pattern); cross-reference Issue 003 V1/V2 for the mixed-kernel context.

Estimated effort

2-3 hours for the shader; 30 min for the M2 bench; 30 min for M4. Total: ~4 hours, then Phase 7 closure.