Files
daedalus-fourier/docs/k2_deblock_phase4.md
T
marfrit 36eca40ff2 Cycle 2 (LPF) closure: M1''=100%, R''=0.41, M4''=+6.9%, PASS
Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS:
2 YELLOW contract gaps applied) + Phase 6 v1 implementation +
Phase 7 verification including M4'' concurrent gate.

Phase 5'' review delivered cleanly — no RED bugs (cycle 1 lessons
applied successfully). 2 YELLOW findings baked into phase4 §4:
  - stride >= 4 contract added alongside m.x >= 4 (finding 2)
  - assert(...) in bench made a MUST not a suggestion (finding 4)
  - V3D divergence-cost note: don't restructure to always-execute,
    masked lanes consume clock anyway (finding 3, informational)

Phase 6 v1 first-light hit M1'' 100.0000% bit-exact on first run
(65536/65536 edges) — the cycle-1 v4 patterns (WG=256, 2-per-sg,
uint8_t SSBO, oob early-return discipline) baked in from start
worked as expected.

Performance:

  M2'' = 19.645 Medge/s     (50.9 ns/edge)
  M3'' = 48.285 Medge/s     (NEON baseline from phase3)
  R''  = 0.41               (ORANGE band - doesn't auto-close per
                             cycle-1 calibration adjustment)

shaderdb: 160 inst, **4 threads**, 0 spills, 21 max-temps —
shader is already at the compiler ceiling. No v2/v3/v4 iteration
loop like cycle 1 because there's nothing more to extract from
the compiled shape. The 30x gap between theoretical instruction
throughput and measured wall-clock is divergence-tax + memory
latency, not compile quality.

M4'' concurrent matrix on hertz (8s windows):

  NEON-1 LPF          41.131 Medge/s
  NEON-4 LPF          33.726 Medge/s  <- realistic CPU ceiling
                                          (per-core 7-9; same
                                          bandwidth-saturation as
                                          cycle-1 F1)
  QPU only            14.299 Medge/s
  MIXED NEON-3 + QPU  36.049 Medge/s  <- +6.9% over NEON-4
  MIXED NEON-4 + QPU  31.892 Medge/s  <- -5.4% oversubscribed

The "freed-core" pattern generalizes from IDCT to LPF: NEON-3+QPU
beats pure NEON-4 by ~7% in both cycles. Cycle-2 NEW finding:
**oversubscribed mode hurts for lighter kernels** (LPF -5.4% vs
cycle-1 IDCT +9.4%). Recommendation for higgs deployment hardens
to "always N-1 NEON cores + QPU, never N + QPU".

Phase 9 lessons (in phase7 §"Phase 9 lessons"):
1. Cycle-1 v4-pattern is the v1 starting point (saves 3 iterations)
2. Phase 5 review pays off every cycle
3. R isolation misleading on bandwidth-saturated hardware
4. Oversubscription tax depends on kernel weight
5. shaderdb 4-threads/0-spills = compute not the bottleneck

New artifacts:
- src/v3d_lpf_h_4_8.comp                — GLSL kernel
- tests/bench_v3d_lpf.c                 — M1'' + M2'' harness with
                                          contract asserts + fm/hev
                                          pass-rate instrumentation
- tests/bench_concurrent_lpf.c          — M4'' pthread bench
                                          (mirrors bench_concurrent.c)
- docs/k2_deblock_phase{4,5,7}.md       — plan + review + verification

Project verdict: continue. Cycle 3 candidates: MC interpolation
(multiply-heavy, stress V3D SMUL24), CDEF (AV1-only, different
neighborhood shape), or wd=8/wd=16 LPF variants. User to direct.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:39:26 +00:00

304 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
cycle: 2
phase: 4
status: open (awaiting Phase 5'' review)
date_opened: 2026-05-18
parent: k2_deblock_phase3.md
template_doc: phase4.md (cycle 1)
target_kernel: VP9 loop filter h_4_8 — 4-tap inner, horizontal, 8-pixel edge
expected_artifacts: src/v3d_lpf_h_4_8.comp, tests/bench_v3d_lpf.c, CMakeLists.txt updates
---
# Cycle 2, Phase 4 — Plan QPU LPF kernel
This doc is compact. Cycle-1 `phase4.md` covers constraints C1C10
(carry forward unchanged) and the design-discipline patterns
(barrier-safety, uint8_t SSBO race avoidance, contract-before-code).
Phase 4'' references those rather than re-deriving.
## 1. Constraints (carried from cycle 1 phase4.md §1)
All 10 constraints apply unchanged. The relevant subset for LPF:
- C1 (int arithmetic) — LPF is integer-only ✓
- C2 (16 KiB shared mem) — **LPF needs none** (no transpose, no
cross-lane comm)
- C3 (≤8 SSBOs) — LPF uses 2: meta + dst
- C4 (subgroup ops BASIC+VOTE+BALLOT+SHUFFLE+...) — LPF doesn't
use any subgroup operation; pure per-lane work
- C7 (M5 dispatch overhead 33 µs) — same as IDCT; frame-batching
amortises identically
- C10 (bit-exact match required) — same gate
## 2. Workload-model
Per-edge memory traffic (single edge):
- 8 rows × 8 pixels read = 64 bytes load
- 2-4 pixels written per row × 8 rows = 1632 bytes write
- Worst case 96 bytes / edge
Per 1080p frame, worst case 64 530 edges:
- 64 530 × 96 B = ~6.2 MB total traffic (cf. IDCT cycle 1: 8 MB)
- At GPU's measured 4 GB/s share: 1.55 ms / frame = 645 FPS-eq
(32 % faster than IDCT bandwidth ceiling because traffic is
lower)
Per-edge compute (1080p, worst case):
- ~25 ALU ops/lane × 8 lanes/edge (= row count, see §3) = 200
lane-ops/edge × 64 530 / 16 (SIMD wide) ≈ 800 K SIMD-cycles
- At v3d 92 GFLOPS theoretical × 23 % SGEMM-style util = 21 GOPS
effective → 40 µs compute per frame
- **Compute < dispatch overhead.** LPF is overhead-bound, not
compute-bound.
## 3. Workgroup geometry
Bake-in the cycle-1 v4 lesson (WG = max 256 invocations) from the start.
- **`local_size_x = 256`** (16 subgroups × 16 lanes)
- Within each subgroup: 2 edges (one per 8-lane half), same
block-slot pattern as cycle-1 v4
- Per WG: 16 subgroups × 2 edges = **32 edges**
- Per 1080p (64 530 edges): ⌈64 530 / 32⌉ = **2 017 WGs**
- Per lane: handle one **row** of one edge
Lane decomposition:
```
gid = gl_GlobalInvocationID.x
wg_id = gid / 256
lane_in_wg = gid & 255
sg_in_wg = lane_in_wg >> 4 // 0..15
lane_in_sg = lane_in_wg & 15
edge_slot = lane_in_sg >> 3 // 0 (lanes 0..7) or 1 (8..15)
row = lane_in_sg & 7 // 0..7
edge_local = sg_in_wg * 2 + edge_slot // 0..31 in WG
edge_idx = wg_id * 32 + edge_local
oob = edge_idx >= n_edges
```
**No barrier needed.** Each lane is fully independent — no
cross-lane data flow, no transpose. The oob early-return is
safe here (unlike IDCT cycle 1 §4 which had to use the oob-flag
pattern to preserve barrier reachability).
## 4. Per-thread algorithm
```glsl
if (edge_idx >= pc.n_edges) return; // safe — no barrier follows
uvec4 m = u_meta.meta[edge_idx];
uint base = m.x + row * pc.dst_stride_u8; // m.x = dst byte offset of row-0 col-0 of this edge
int E = int(m.y), I = int(m.z), H = int(m.w);
int p3 = int(u_dst.dst[base - 4u]);
int p2 = int(u_dst.dst[base - 3u]);
int p1 = int(u_dst.dst[base - 2u]);
int p0 = int(u_dst.dst[base - 1u]);
int q0 = int(u_dst.dst[base + 0u]);
int q1 = int(u_dst.dst[base + 1u]);
int q2 = int(u_dst.dst[base + 2u]);
int q3 = int(u_dst.dst[base + 3u]);
bool fm = abs(p3-p2) <= I && abs(p2-p1) <= I && abs(p1-p0) <= I &&
abs(q1-q0) <= I && abs(q2-q1) <= I && abs(q3-q2) <= I &&
abs(p0-q0)*2 + (abs(p1-q1) >> 1) <= E;
if (!fm) return;
bool hev = abs(p1-p0) > H || abs(q1-q0) > H;
if (hev) {
int f = clamp(p1 - q1, -128, 127);
f = clamp(3*(q0-p0) + f, -128, 127);
int f1 = min(f + 4, 127) >> 3;
int f2 = min(f + 3, 127) >> 3;
u_dst.dst[base - 1u] = uint8_t(clamp(p0 + f2, 0, 255));
u_dst.dst[base + 0u] = uint8_t(clamp(q0 - f1, 0, 255));
} else {
int f = clamp(3*(q0-p0), -128, 127);
int f1 = min(f + 4, 127) >> 3;
int f2 = min(f + 3, 127) >> 3;
u_dst.dst[base - 1u] = uint8_t(clamp(p0 + f2, 0, 255));
u_dst.dst[base + 0u] = uint8_t(clamp(q0 - f1, 0, 255));
int fp = (f1 + 1) >> 1;
u_dst.dst[base - 2u] = uint8_t(clamp(p1 + fp, 0, 255));
u_dst.dst[base + 1u] = uint8_t(clamp(q1 - fp, 0, 255));
}
```
Mirrors `tests/vp9_lpf_ref.c` line-for-line. Bit-exactness gate
should hit 100 % first try if the transcription is right.
**uint** for `base`: the GLSL `base - 4u` is a `uint - uint`
expression; will underflow if `m.x < 4`.
**Contracts (revised per phase5'' findings 2 + 4):**
1. The host guarantees `m.x ≥ 4` for every edge.
2. The host guarantees `dst_stride_u8 ≥ 4` for every dispatch.
(Required for race safety — see §5; rows `r` and `r+1` write to
`[base+r·s2..base+r·s+1]` and `[base+(r+1)·s2..base+(r+1)·s+1]`,
disjoint iff `s ≥ 4`.)
3. **Phase 6 MUST add `assert(m_x >= 4 && dst_stride >= 4)` in
`bench_v3d_lpf.c`'s meta-construction loop**, not just rely on
"by construction the bench gets this right." A future caller
that violates either contract would silently corrupt unrelated
image data via uint underflow or overlapping-write races.
Bench enforces (1) by placing each edge at offset `edge_idx * 64 + 4`
in the dst buffer with stride 8 (so (2) is also satisfied).
## 5. Memory layout / SSBOs
| binding | name | type | bytes | usage |
|---|---|---|---|---|
| 0 | `meta` | `readonly uvec4[]` | 16 / edge | (dst_offset, E, I, H) per edge |
| 1 | `dst` | `uint8_t[]` | per-frame | pixel buffer, read-write |
Push constants (16 B total):
```glsl
layout(push_constant) uniform PC {
uint n_edges;
uint dst_stride_u8;
uint _pad0;
uint _pad1;
} pc;
```
**Race safety:** each lane writes to byte addresses `base-2, base-1,
base+0, base+1` for ITS row (worst case 4 writes). Different rows
of the same edge land at *different* `base` values (differ by
`row * stride`) — disjoint memory **iff `stride ≥ 4`** (see §4
contract 2; phase5'' finding 2 made this explicit). Different
edges have disjoint `m.x` values by construction. No multi-lane
write to the same byte under the stated contracts. Race-free
without atomics.
## 6. Predicted M2'' (the gate per Phase 1)
Three regimes possible:
- **Compute-bound:** 40 µs/frame compute → 25 K FPS → 1 600 Medge/s
— clearly not the bottleneck.
- **Bandwidth-bound:** 6.2 MB / 4 GB/s = 1.55 ms/frame → 645 FPS
**42 Medge/s** (at 64 530 edges/frame). R'' = 42 / 48.3 ≈ **0.87**.
- **Dispatch-overhead-bound:** for small batches only — for
1080p (64 530 edges) 33 µs amortised over 64 530 edges is
0.5 ns/edge → negligible vs the 20 ns NEON floor.
**Predicted M2'' band (1080p frame batches): R'' ≈ 0.5 0.9.**
The bandwidth ceiling at R = 0.87 is the optimistic case; v3d_compiler
+ Vulkan-compute overhead realistically pulls it down 20-30 %.
Honest lower bound: R'' = 0.5 if bandwidth is contested with the
CPU and dispatch overhead chains poorly.
**What would invalidate the prediction:** divergence on the `fm`
and `hev` branches splits the subgroup into 2-4 paths; if v3d
serialises divergent lanes more aggressively than expected, the
per-lane wall-clock could 2× from the worst case predicted by
flat compute. Phase 7'' will measure.
**Divergence handling on V3D** (phase5'' finding 3): on V3D 7.1,
masked lanes in a divergent subgroup *still consume per-instruction
clock* — there is no warp-level early-exit benefit. The natural
branching structure in §4 (`if (!fm) return;` plus hev select)
is correct as written. **Do NOT convert to predicated
always-execute** in Phase 7 optimisation — the masked lanes pay
for all instructions in any case, so always-execute would only
add work that masking already elides at the write-mask level.
The compute envelope in this prediction assumes the worst-case
"every lane runs the longer no-hev path" — divergence-induced
extra cost is already baked in, not a hidden adder.
## 7. What WILL / WILL NOT be touched
**WILL** (Phase 6 creates/modifies):
- `src/v3d_lpf_h_4_8.comp` — the GLSL compute shader
- `tests/bench_v3d_lpf.c` — bit-exact + throughput harness
(mirrors `bench_v3d_idct.c` shape). **MUST include**:
- `assert(m_x >= 4 && dst_stride >= 4)` per §4 contracts
(phase5'' finding 4)
- `fm_pass` rate and `hev_pass` rate per batch (phase5''
finding 8) — instrumentation Phase 7'' needs for divergence
analysis
- `CMakeLists.txt` — add shader compilation + bench target
- `tests/bench_concurrent.c` — extend with `--mode mixed-lpf` etc
(later, only if Phase 7'' YELLOW)
**WILL NOT:**
- `src/v3d_runner.{c,h}` — works as-is for any compute kernel
- `tests/vp9_lpf_ref.c`, `tests/bench_neon_lpf.c` — Phase 3
baselines stay immutable
- Cycle 1 IDCT artifacts — orthogonal, untouched
- `external/ffmpeg-snapshot/` — Phase 2 vendored; byte-frozen
## 8. Phase 5'' review prep
Mandatory per `dev_process.md` ("Reviews are never skippable", per
user-global CLAUDE.md). Cycle-1 phase 5 caught 2 RED bugs; cycle 2
deserves the same outside look.
Files for the reviewer to read verbatim:
- `docs/k2_deblock_phase1.md` (goal)
- `docs/k2_deblock_phase2.md` (situation, refs)
- `docs/k2_deblock_phase3.md` (baseline M3'')
- `docs/k2_deblock_phase4.md` (this file)
- `tests/vp9_lpf_ref.c` (the C ref the QPU must match)
- `tests/bench_neon_lpf.c` (M3'' methodology)
- `phase4.md` + `phase5.md` (cycle 1 — context for what was
already reviewed)
- `phase7.md` + `phase7_M4.md` (cycle 1 — lessons)
Specific review prompts (the high-risk decisions):
1. **Orientation correctness.** §4 pseudocode mirrors
`tests/vp9_lpf_ref.c` line-for-line. Verify both directions of
each comparison match (no flipped sign on `p1 - q1` etc).
This is the canonical "bit-exact will fail on first run" trap.
2. **Race safety claim in §5.** Convincing? Different rows of the
same edge land at offsets `m.x + r * stride` for r = 0..7 —
guaranteed disjoint? What if `stride < 8`? (Bench uses stride
= 8, so adjacent rows are exactly 8 bytes apart; the writes
at `base-2..base+1` span 4 bytes — fits within the row's
8-byte stride. ✓ unless I'm missing something.)
3. **Divergence cost.** `fm` test fails → entire lane returns
early. `hev` test selects between 2-pixel and 4-pixel paths.
Within a 16-lane subgroup, mixed outcomes are common. Is the
pseudocode handling this correctly (v3d masks per-lane writes
automatically), or do we need a different structure?
4. **`base - 4u` underflow assumption.** §4 contracts `m.x ≥ 4`.
Robust enough? What if a future caller violates it — silent
pixel-buffer-underread? Worth an assert in the bench-side
harness when constructing meta.
5. **Anything missing.** Same prompt as cycle 1.
## 9. Phase 6'' execution order
If Phase 5'' approves:
1. Write `src/v3d_lpf_h_4_8.comp` (GLSL shader from §4)
2. Write `tests/bench_v3d_lpf.c` (clone of `bench_v3d_idct.c`,
swap kernel + meta layout)
3. CMake wiring
4. Build, run M1''
5. If 100 % bit-exact → run M2'', compute R''
6. Per Phase 1 decision table:
- R'' ≥ 0.5 → run M4''
- R'' < 0.5 → still run M4'' per cycle-1 calibration adjustment
7. Phase 7'' verdict → Phase 9 lessons → cycle 3 (CDEF? MC?
another kernel) OR honest close cycle 2 only.
## 10. Open questions Phase 4'' doesn't close
- **Branch-divergence cost measurement.** Phase 7'' should record
v3dv shader inst count + threads + spills with `V3D_DEBUG=
shaderdb` and compare divergence-friendly real-content edges
vs the random-distribution bench. If real-content has very
uniform branches (e.g., all-pass-`fm` runs), per-frame perf
improves over the predicted band.
- **Per-edge meta packing.** Cycle 1 v5 showed that manually
packing storage didn't help. Skip the pre-emptive optimisation
here.
- **Vertical variant.** `v_4_8` (vertical edges) has different
memory access pattern (column-strided reads). Cycle 2 v2 if
v1 succeeds.
- **wd=8 / wd=16 paths.** Bigger filters with more conditional
branches. Cycle 3+ if cycle 2 succeeds.