Files
daedalus-fourier/docs/k2_deblock_phase4.md
T
marfrit 36eca40ff2 Cycle 2 (LPF) closure: M1''=100%, R''=0.41, M4''=+6.9%, PASS
Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS:
2 YELLOW contract gaps applied) + Phase 6 v1 implementation +
Phase 7 verification including M4'' concurrent gate.

Phase 5'' review delivered cleanly — no RED bugs (cycle 1 lessons
applied successfully). 2 YELLOW findings baked into phase4 §4:
  - stride >= 4 contract added alongside m.x >= 4 (finding 2)
  - assert(...) in bench made a MUST not a suggestion (finding 4)
  - V3D divergence-cost note: don't restructure to always-execute,
    masked lanes consume clock anyway (finding 3, informational)

Phase 6 v1 first-light hit M1'' 100.0000% bit-exact on first run
(65536/65536 edges) — the cycle-1 v4 patterns (WG=256, 2-per-sg,
uint8_t SSBO, oob early-return discipline) baked in from start
worked as expected.

Performance:

  M2'' = 19.645 Medge/s     (50.9 ns/edge)
  M3'' = 48.285 Medge/s     (NEON baseline from phase3)
  R''  = 0.41               (ORANGE band - doesn't auto-close per
                             cycle-1 calibration adjustment)

shaderdb: 160 inst, **4 threads**, 0 spills, 21 max-temps —
shader is already at the compiler ceiling. No v2/v3/v4 iteration
loop like cycle 1 because there's nothing more to extract from
the compiled shape. The 30x gap between theoretical instruction
throughput and measured wall-clock is divergence-tax + memory
latency, not compile quality.

M4'' concurrent matrix on hertz (8s windows):

  NEON-1 LPF          41.131 Medge/s
  NEON-4 LPF          33.726 Medge/s  <- realistic CPU ceiling
                                          (per-core 7-9; same
                                          bandwidth-saturation as
                                          cycle-1 F1)
  QPU only            14.299 Medge/s
  MIXED NEON-3 + QPU  36.049 Medge/s  <- +6.9% over NEON-4
  MIXED NEON-4 + QPU  31.892 Medge/s  <- -5.4% oversubscribed

The "freed-core" pattern generalizes from IDCT to LPF: NEON-3+QPU
beats pure NEON-4 by ~7% in both cycles. Cycle-2 NEW finding:
**oversubscribed mode hurts for lighter kernels** (LPF -5.4% vs
cycle-1 IDCT +9.4%). Recommendation for higgs deployment hardens
to "always N-1 NEON cores + QPU, never N + QPU".

Phase 9 lessons (in phase7 §"Phase 9 lessons"):
1. Cycle-1 v4-pattern is the v1 starting point (saves 3 iterations)
2. Phase 5 review pays off every cycle
3. R isolation misleading on bandwidth-saturated hardware
4. Oversubscription tax depends on kernel weight
5. shaderdb 4-threads/0-spills = compute not the bottleneck

New artifacts:
- src/v3d_lpf_h_4_8.comp                — GLSL kernel
- tests/bench_v3d_lpf.c                 — M1'' + M2'' harness with
                                          contract asserts + fm/hev
                                          pass-rate instrumentation
- tests/bench_concurrent_lpf.c          — M4'' pthread bench
                                          (mirrors bench_concurrent.c)
- docs/k2_deblock_phase{4,5,7}.md       — plan + review + verification

Project verdict: continue. Cycle 3 candidates: MC interpolation
(multiply-heavy, stress V3D SMUL24), CDEF (AV1-only, different
neighborhood shape), or wd=8/wd=16 LPF variants. User to direct.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:39:26 +00:00

12 KiB
Raw Blame History

cycle, phase, status, date_opened, parent, template_doc, target_kernel, expected_artifacts
cycle phase status date_opened parent template_doc target_kernel expected_artifacts
2 4 open (awaiting Phase 5'' review) 2026-05-18 k2_deblock_phase3.md phase4.md (cycle 1) VP9 loop filter h_4_8 — 4-tap inner, horizontal, 8-pixel edge src/v3d_lpf_h_4_8.comp, tests/bench_v3d_lpf.c, CMakeLists.txt updates

Cycle 2, Phase 4 — Plan QPU LPF kernel

This doc is compact. Cycle-1 phase4.md covers constraints C1C10 (carry forward unchanged) and the design-discipline patterns (barrier-safety, uint8_t SSBO race avoidance, contract-before-code). Phase 4'' references those rather than re-deriving.

1. Constraints (carried from cycle 1 phase4.md §1)

All 10 constraints apply unchanged. The relevant subset for LPF:

  • C1 (int arithmetic) — LPF is integer-only ✓
  • C2 (16 KiB shared mem) — LPF needs none (no transpose, no cross-lane comm)
  • C3 (≤8 SSBOs) — LPF uses 2: meta + dst
  • C4 (subgroup ops BASIC+VOTE+BALLOT+SHUFFLE+...) — LPF doesn't use any subgroup operation; pure per-lane work
  • C7 (M5 dispatch overhead 33 µs) — same as IDCT; frame-batching amortises identically
  • C10 (bit-exact match required) — same gate

2. Workload-model

Per-edge memory traffic (single edge):

  • 8 rows × 8 pixels read = 64 bytes load
  • 2-4 pixels written per row × 8 rows = 1632 bytes write
  • Worst case 96 bytes / edge

Per 1080p frame, worst case 64 530 edges:

  • 64 530 × 96 B = ~6.2 MB total traffic (cf. IDCT cycle 1: 8 MB)
  • At GPU's measured 4 GB/s share: 1.55 ms / frame = 645 FPS-eq (32 % faster than IDCT bandwidth ceiling because traffic is lower)

Per-edge compute (1080p, worst case):

  • ~25 ALU ops/lane × 8 lanes/edge (= row count, see §3) = 200 lane-ops/edge × 64 530 / 16 (SIMD wide) ≈ 800 K SIMD-cycles
  • At v3d 92 GFLOPS theoretical × 23 % SGEMM-style util = 21 GOPS effective → 40 µs compute per frame
  • Compute < dispatch overhead. LPF is overhead-bound, not compute-bound.

3. Workgroup geometry

Bake-in the cycle-1 v4 lesson (WG = max 256 invocations) from the start.

  • local_size_x = 256 (16 subgroups × 16 lanes)
  • Within each subgroup: 2 edges (one per 8-lane half), same block-slot pattern as cycle-1 v4
  • Per WG: 16 subgroups × 2 edges = 32 edges
  • Per 1080p (64 530 edges): ⌈64 530 / 32⌉ = 2 017 WGs
  • Per lane: handle one row of one edge

Lane decomposition:

gid              = gl_GlobalInvocationID.x
wg_id            = gid / 256
lane_in_wg       = gid & 255
sg_in_wg         = lane_in_wg >> 4    // 0..15
lane_in_sg       = lane_in_wg & 15
edge_slot        = lane_in_sg >> 3    // 0 (lanes 0..7) or 1 (8..15)
row              = lane_in_sg & 7     // 0..7

edge_local       = sg_in_wg * 2 + edge_slot       // 0..31 in WG
edge_idx         = wg_id * 32 + edge_local
oob              = edge_idx >= n_edges

No barrier needed. Each lane is fully independent — no cross-lane data flow, no transpose. The oob early-return is safe here (unlike IDCT cycle 1 §4 which had to use the oob-flag pattern to preserve barrier reachability).

4. Per-thread algorithm

if (edge_idx >= pc.n_edges) return;          // safe — no barrier follows

uvec4 m = u_meta.meta[edge_idx];
uint base = m.x + row * pc.dst_stride_u8;    // m.x = dst byte offset of row-0 col-0 of this edge
int E = int(m.y), I = int(m.z), H = int(m.w);

int p3 = int(u_dst.dst[base - 4u]);
int p2 = int(u_dst.dst[base - 3u]);
int p1 = int(u_dst.dst[base - 2u]);
int p0 = int(u_dst.dst[base - 1u]);
int q0 = int(u_dst.dst[base + 0u]);
int q1 = int(u_dst.dst[base + 1u]);
int q2 = int(u_dst.dst[base + 2u]);
int q3 = int(u_dst.dst[base + 3u]);

bool fm = abs(p3-p2) <= I && abs(p2-p1) <= I && abs(p1-p0) <= I &&
          abs(q1-q0) <= I && abs(q2-q1) <= I && abs(q3-q2) <= I &&
          abs(p0-q0)*2 + (abs(p1-q1) >> 1) <= E;
if (!fm) return;

bool hev = abs(p1-p0) > H || abs(q1-q0) > H;

if (hev) {
    int f  = clamp(p1 - q1, -128, 127);
    f      = clamp(3*(q0-p0) + f, -128, 127);
    int f1 = min(f + 4, 127) >> 3;
    int f2 = min(f + 3, 127) >> 3;
    u_dst.dst[base - 1u] = uint8_t(clamp(p0 + f2, 0, 255));
    u_dst.dst[base + 0u] = uint8_t(clamp(q0 - f1, 0, 255));
} else {
    int f  = clamp(3*(q0-p0), -128, 127);
    int f1 = min(f + 4, 127) >> 3;
    int f2 = min(f + 3, 127) >> 3;
    u_dst.dst[base - 1u] = uint8_t(clamp(p0 + f2, 0, 255));
    u_dst.dst[base + 0u] = uint8_t(clamp(q0 - f1, 0, 255));
    int fp = (f1 + 1) >> 1;
    u_dst.dst[base - 2u] = uint8_t(clamp(p1 + fp, 0, 255));
    u_dst.dst[base + 1u] = uint8_t(clamp(q1 - fp, 0, 255));
}

Mirrors tests/vp9_lpf_ref.c line-for-line. Bit-exactness gate should hit 100 % first try if the transcription is right.

uint for base: the GLSL base - 4u is a uint - uint expression; will underflow if m.x < 4.

Contracts (revised per phase5'' findings 2 + 4):

  1. The host guarantees m.x ≥ 4 for every edge.
  2. The host guarantees dst_stride_u8 ≥ 4 for every dispatch. (Required for race safety — see §5; rows r and r+1 write to [base+r·s2..base+r·s+1] and [base+(r+1)·s2..base+(r+1)·s+1], disjoint iff s ≥ 4.)
  3. Phase 6 MUST add assert(m_x >= 4 && dst_stride >= 4) in bench_v3d_lpf.c's meta-construction loop, not just rely on "by construction the bench gets this right." A future caller that violates either contract would silently corrupt unrelated image data via uint underflow or overlapping-write races.

Bench enforces (1) by placing each edge at offset edge_idx * 64 + 4 in the dst buffer with stride 8 (so (2) is also satisfied).

5. Memory layout / SSBOs

binding name type bytes usage
0 meta readonly uvec4[] 16 / edge (dst_offset, E, I, H) per edge
1 dst uint8_t[] per-frame pixel buffer, read-write

Push constants (16 B total):

layout(push_constant) uniform PC {
    uint n_edges;
    uint dst_stride_u8;
    uint _pad0;
    uint _pad1;
} pc;

Race safety: each lane writes to byte addresses base-2, base-1, base+0, base+1 for ITS row (worst case 4 writes). Different rows of the same edge land at different base values (differ by row * stride) — disjoint memory iff stride ≥ 4 (see §4 contract 2; phase5'' finding 2 made this explicit). Different edges have disjoint m.x values by construction. No multi-lane write to the same byte under the stated contracts. Race-free without atomics.

6. Predicted M2'' (the gate per Phase 1)

Three regimes possible:

  • Compute-bound: 40 µs/frame compute → 25 K FPS → 1 600 Medge/s — clearly not the bottleneck.
  • Bandwidth-bound: 6.2 MB / 4 GB/s = 1.55 ms/frame → 645 FPS → 42 Medge/s (at 64 530 edges/frame). R'' = 42 / 48.3 ≈ 0.87.
  • Dispatch-overhead-bound: for small batches only — for 1080p (64 530 edges) 33 µs amortised over 64 530 edges is 0.5 ns/edge → negligible vs the 20 ns NEON floor.

Predicted M2'' band (1080p frame batches): R'' ≈ 0.5 0.9. The bandwidth ceiling at R = 0.87 is the optimistic case; v3d_compiler

  • Vulkan-compute overhead realistically pulls it down 20-30 %.

Honest lower bound: R'' = 0.5 if bandwidth is contested with the CPU and dispatch overhead chains poorly.

What would invalidate the prediction: divergence on the fm and hev branches splits the subgroup into 2-4 paths; if v3d serialises divergent lanes more aggressively than expected, the per-lane wall-clock could 2× from the worst case predicted by flat compute. Phase 7'' will measure.

Divergence handling on V3D (phase5'' finding 3): on V3D 7.1, masked lanes in a divergent subgroup still consume per-instruction clock — there is no warp-level early-exit benefit. The natural branching structure in §4 (if (!fm) return; plus hev select) is correct as written. Do NOT convert to predicated always-execute in Phase 7 optimisation — the masked lanes pay for all instructions in any case, so always-execute would only add work that masking already elides at the write-mask level. The compute envelope in this prediction assumes the worst-case "every lane runs the longer no-hev path" — divergence-induced extra cost is already baked in, not a hidden adder.

7. What WILL / WILL NOT be touched

WILL (Phase 6 creates/modifies):

  • src/v3d_lpf_h_4_8.comp — the GLSL compute shader
  • tests/bench_v3d_lpf.c — bit-exact + throughput harness (mirrors bench_v3d_idct.c shape). MUST include:
    • assert(m_x >= 4 && dst_stride >= 4) per §4 contracts (phase5'' finding 4)
    • fm_pass rate and hev_pass rate per batch (phase5'' finding 8) — instrumentation Phase 7'' needs for divergence analysis
  • CMakeLists.txt — add shader compilation + bench target
  • tests/bench_concurrent.c — extend with --mode mixed-lpf etc (later, only if Phase 7'' YELLOW)

WILL NOT:

  • src/v3d_runner.{c,h} — works as-is for any compute kernel
  • tests/vp9_lpf_ref.c, tests/bench_neon_lpf.c — Phase 3 baselines stay immutable
  • Cycle 1 IDCT artifacts — orthogonal, untouched
  • external/ffmpeg-snapshot/ — Phase 2 vendored; byte-frozen

8. Phase 5'' review prep

Mandatory per dev_process.md ("Reviews are never skippable", per user-global CLAUDE.md). Cycle-1 phase 5 caught 2 RED bugs; cycle 2 deserves the same outside look.

Files for the reviewer to read verbatim:

  • docs/k2_deblock_phase1.md (goal)
  • docs/k2_deblock_phase2.md (situation, refs)
  • docs/k2_deblock_phase3.md (baseline M3'')
  • docs/k2_deblock_phase4.md (this file)
  • tests/vp9_lpf_ref.c (the C ref the QPU must match)
  • tests/bench_neon_lpf.c (M3'' methodology)
  • phase4.md + phase5.md (cycle 1 — context for what was already reviewed)
  • phase7.md + phase7_M4.md (cycle 1 — lessons)

Specific review prompts (the high-risk decisions):

  1. Orientation correctness. §4 pseudocode mirrors tests/vp9_lpf_ref.c line-for-line. Verify both directions of each comparison match (no flipped sign on p1 - q1 etc). This is the canonical "bit-exact will fail on first run" trap.
  2. Race safety claim in §5. Convincing? Different rows of the same edge land at offsets m.x + r * stride for r = 0..7 — guaranteed disjoint? What if stride < 8? (Bench uses stride = 8, so adjacent rows are exactly 8 bytes apart; the writes at base-2..base+1 span 4 bytes — fits within the row's 8-byte stride. ✓ unless I'm missing something.)
  3. Divergence cost. fm test fails → entire lane returns early. hev test selects between 2-pixel and 4-pixel paths. Within a 16-lane subgroup, mixed outcomes are common. Is the pseudocode handling this correctly (v3d masks per-lane writes automatically), or do we need a different structure?
  4. base - 4u underflow assumption. §4 contracts m.x ≥ 4. Robust enough? What if a future caller violates it — silent pixel-buffer-underread? Worth an assert in the bench-side harness when constructing meta.
  5. Anything missing. Same prompt as cycle 1.

9. Phase 6'' execution order

If Phase 5'' approves:

  1. Write src/v3d_lpf_h_4_8.comp (GLSL shader from §4)
  2. Write tests/bench_v3d_lpf.c (clone of bench_v3d_idct.c, swap kernel + meta layout)
  3. CMake wiring
  4. Build, run M1''
  5. If 100 % bit-exact → run M2'', compute R''
  6. Per Phase 1 decision table:
    • R'' ≥ 0.5 → run M4''
    • R'' < 0.5 → still run M4'' per cycle-1 calibration adjustment
  7. Phase 7'' verdict → Phase 9 lessons → cycle 3 (CDEF? MC? another kernel) OR honest close cycle 2 only.

10. Open questions Phase 4'' doesn't close

  • Branch-divergence cost measurement. Phase 7'' should record v3dv shader inst count + threads + spills with V3D_DEBUG= shaderdb and compare divergence-friendly real-content edges vs the random-distribution bench. If real-content has very uniform branches (e.g., all-pass-fm runs), per-frame perf improves over the predicted band.
  • Per-edge meta packing. Cycle 1 v5 showed that manually packing storage didn't help. Skip the pre-emptive optimisation here.
  • Vertical variant. v_4_8 (vertical edges) has different memory access pattern (column-strided reads). Cycle 2 v2 if v1 succeeds.
  • wd=8 / wd=16 paths. Bigger filters with more conditional branches. Cycle 3+ if cycle 2 succeeds.