daedalus-fourier/docs/k2_deblock_phase5.md

---
cycle: 2
phase: 5
status: closed 2026-05-18 — PASS-WITH-REVISIONS, revisions applied
date_opened: 2026-05-18
date_closed: 2026-05-18
parent: k2_deblock_phase4.md
reviewer: Claude Sonnet (general-purpose Agent, fresh context)
plan_author: Claude Opus 4.7 (this session)
verdict: PASS-WITH-REVISIONS
---

# Cycle 2, Phase 5 — Second-Model Review of LPF Plan

Same handoff mechanism as cycle 1's Phase 5: in-session `Agent` tool with
`subagent_type=general-purpose, model=sonnet`, fresh context, files read
directly off disk, no Claude-Opus curation.

The reviewer was given the 5 prompts that `k2_deblock_phase4.md §8`
itself flagged plus "anything else."

Outcome: **no RED-class bugs** (cycle 1's two RED findings did not
recur — orientation is independently verified against the C reference,
no barrier exists so no barrier-UB risk, no subgroup operations used
so the cycle-1 finding-6 false-constraint pattern doesn't apply).
**Two YELLOW findings** to address before Phase 6:
- F2: stride contract missing from §5 race-safety argument
- F4: assert that should be MUST not "worth considering"

One additional YELLOW (F3) on V3D divergence behavior — informational
note to add to §6 so Phase 7'' doesn't go down the wrong optimisation
rabbit hole. One YELLOW (F8) on bench instrumentation — bake into
Phase 6 implementation requirements.

## Review (verbatim)

````markdown
## Verdict

**PASS-WITH-REVISIONS** — the pseudocode is arithmetically correct against the C reference, and no RED-class correctness bugs were found. One YELLOW issue (missing stride contract) and one YELLOW issue (divergence-cost accounting) need to be acknowledged and documented; neither requires rework of the algorithm. No safety net against the `base - 4u` underflow is present in the bench harness and should be added.

---

## Per-prompt findings

**1. Orientation correctness — GREEN**

The pseudocode in §4 matches `tests/vp9_lpf_ref.c` line-for-line. Traced independently:

- `fm` mask: all six abs-diff comparisons against `I`, and `abs(p0-q0)*2 + (abs(p1-q1) >> 1) <= E`, appear in the same order as the C reference. No term is swapped or missing.
- hev path: `f = clamp(p1 - q1, -128, 127)` — sign is correct (`p1 - q1`, not `q1 - p1`). The second clip `clamp(3*(q0-p0) + f, ...)` matches. `f1 = min(f+4,127)>>3`, `f2 = min(f+3,127)>>3` match. Writes: `p0 + f2` at `base-1`, `q0 - f1` at `base+0` — correct direction.
- no-hev path: `f = clamp(3*(q0-p0), ...)` — no `p1-q1` term, correct (the no-hev path drops the initial `p1-q1` from `f`). `fp = (f1+1)>>1` shared between `p1+fp` and `q1-fp`. Writes: `base-2` gets `p1+fp`, `base+1` gets `q1-fp`. The C reference writes `dst[-2] = clip_u8(p1+fp)` and `dst[+1] = clip_u8(q1-fp)` — matches exactly.
- Pixel reads: `p3..p0` at `base-4..base-1`, `q0..q3` at `base+0..base+3`. Row advance via `row * pc.dst_stride_u8`. The C reference iterates `dst += stride` and reads `dst[-4..+3]` — same semantics.
- GLSL `min(int, int)` works correctly (GLSL 4.50 provides `genIType min(genIType x, genIType y)`).

The bit-exact gate should pass on first run if the transcription holds through the actual `.comp` file.

**2. Race safety claim in §5 — YELLOW**

The §5 disjointness claim is **mostly correct but missing a lower bound on stride**. Examining the geometry: row `r` writes to byte addresses `[base + r*stride - 2 .. base + r*stride + 1]` (4 bytes). For rows `r` and `r+1` to be disjoint, we need `r*stride + 1 < (r+1)*stride - 2`, i.e., `stride > 3`, i.e., `stride ≥ 4`. The bench uses stride = 8, so there is no actual race in the bench. But §5's race-safety argument says "disjoint memory" without stating the stride lower bound. If a future caller passes stride=2 (a valid, if unusual, packed-pixel scenario), rows 0 and 1 would overlap and the writes would race silently. The contract in §4 currently only asserts `m.x ≥ 4`; it does not assert `dst_stride_u8 ≥ 4`. This is not the same shape as cycle-1 phase5 finding 5 (that was sub-word packing within a single 32-bit word across lanes in the same row), but it is the same class: a hidden assumption that the correctness of the write-back relies on.

The plan is safe for the bench as written; the gap is that it doesn't document the stride contract, which means a future caller cannot know what the shader requires. This should be stated explicitly.

**3. Divergence cost — YELLOW**

The §6 prediction of R'' ≈ 0.5–0.9 is plausible but the divergence penalty analysis is incomplete. Within a 16-lane subgroup processing 2 edges (8 rows each), `fm` failure causes a `return` that retires 1 or more lanes early. V3D's compiler will mask the retired lanes for the rest of the subgroup's execution — those lanes are not actually idle; they still consume subgroup clock cycles on every instruction. The cost of a divergent `return` is that the surviving lanes continue executing to subgroup completion. However, V3D 7.1 differs from desktop GPUs in one important way: the v3d_compiler controls divergence handling entirely in software, via per-lane predicate masks. There are no warp-level early-exit savings. An `if (!fm) return` that triggers for half the lanes in the subgroup means the other half continues through the full hev/no-hev path while the failed lanes are masked. The per-lane wall-clock for all lanes is the same. The §6 prediction does acknowledge "divergence on `fm` and `hev` branches splits the subgroup into 2-4 paths" but then only mentions it as a risk without quantifying it. If `fm` fails for 50 % of rows (a realistic real-bitstream rate), the effective throughput is the same as if `fm` passed for 100 % — because the masked lanes still consume clock. This makes the prediction slightly optimistic: R'' is bounded by the bandwidth ceiling regardless, but the divergence cost means the compute overhead does not improve when `fm` skips rows. Whether restructuring to always-execute would help: it would not, because the masked lanes in V3D do pay for all instructions anyway. There is no benefit to removing the `if (!fm) return` in terms of V3D wall-clock. The `hev` branch within the fm-passing path is a different shape: it selects between 2-write and 4-write paths; the non-hev path executes more instructions. The shorter hev path wastes no clock on V3D (masked lanes pay for max-path instructions in any case). Given that, restructuring to always-execute the longer no-hev path with predicated writes would make things **worse** (more instructions always executed vs. fewer instructions with masking). The plan's choice of the natural branching structure is correct. The YELLOW here is that the prediction doesn't spell this out — it should note that "divergence has no throughput benefit because masked lanes still consume clock on V3D."

**4. `base - 4u` underflow assumption — YELLOW**

The plan correctly identifies the risk in §4 ("Contract: host guarantees `m.x ≥ 4`") and notes the bench enforces it by placing edges at `edge_idx * 64 + 4`. However, the bench harness (`bench_neon_lpf.c`) constructs `meta[]` entries with no runtime assert. If `n_edges` is large enough that the edge layout is correct by construction and cannot be violated, this is fine — but it's not a runtime check. The plan says "Worth an assert in the bench-side harness when constructing meta" as a review question, then never commits to adding it. Given that cycle-1 lessons explicitly mention contract enforcement, this should be a concrete "Phase 6 MUST add assert" requirement in the plan, not a question for the reviewer. The risk is concrete: a caller who miscomputes `m.x` and passes a value less than 4 will get a silent pixel-buffer-underread wrapped around to a high address. In the QPU context this would silently corrupt unrelated image data. The fix is one line (`assert(m.x >= 4)` in the bench's `meta[]` construction loop); the plan should commit to it.

**5. Anything missing — GREEN with one note**

The three cycle-1 phase5 bugs do not recur here in the same form:

- **Finding 5 (write race on packed dst):** The LPF kernel uses a per-lane byte write pattern that is intrinsically race-free for well-separated rows — the pattern is the same fix that cycle-1 applied. Not a risk here provided the stride contract holds (see finding 2 above).
- **Finding 7 (barrier UB before early-return):** The LPF kernel has no barrier, and §3 correctly notes "No barrier needed." The early `if (edge_idx >= pc.n_edges) return;` is safe here, unlike cycle-1. The plan explicitly calls this out with the correct reasoning.
- **Finding 6 (subgroup ops table error):** The LPF kernel uses no subgroup operations at all; C4 compliance is trivially satisfied regardless of which ops are available.

One unstated assumption worth noting: the push constant `dst_stride_u8` is passed as `uint`, but the computation `m.x + row * pc.dst_stride_u8` is unsigned arithmetic throughout. If `row` is `uint` (derived from `lane_in_sg & 7`, which is `uint`), this multiplication is `uint * uint = uint`, and for normal stride values (stride ≤ ~500K) there's no overflow risk. This is fine and does not require a fix, but it is unstated.

A second unstated assumption: the `uvec4 m = u_meta.meta[edge_idx]` load packs `(dst_offset, E, I, H)` as four `uint` components. The shader casts `m.y, m.z, m.w` to `int` via `int(m.y)` etc. The C reference takes `E, I, H` as `int` parameters in range 0–80 / 0–40 / 0–10 per `gen_thresholds()`. All positive, so the uint→int cast is safe. The plan states this correctly in §5 without explicitly noting the sign assumption, but it holds for any physically meaningful threshold value.

---

## Other findings

**6. YELLOW — `fm` mask operand order in the C reference versus GLSL: subtle `abs(q1-q0)` vs `abs(q0-q1)`**

Both are equivalent (absolute value), but the plan deserves a note that these are identical. The C reference has `abs_i(q1 - q0)` (line 55) while the GLSL has `abs(q1-q0)` — same order. No issue, but someone reading quickly might second-guess. Confirm: both are identical.

**7. GREEN — No shared memory needed is correctly stated**

Unlike IDCT which required an 8 KiB shared scratch for the transpose, the LPF kernel needs zero shared memory. This is a significant simplification that eliminates the entire barrier-safety problem class. The plan correctly identifies this and uses it to justify the safe early return.

**8. YELLOW — Bench `fm_pass` tracking is acknowledged-broken**

`bench_neon_lpf.c` lines 119–124 contain a comment `/* fm_pass above is broken — left as TODO */`. This is a known deficiency. The bench cannot report what fraction of edges actually triggered `fm`-pass vs `fm`-skip. For M1''_c correctness this doesn't matter (mismatch count is the headline). For Phase 7'' analysis of divergence behavior under the QPU, knowing the actual `fm` pass rate of the random edge distribution matters — it determines how often the divergent `return` fires and whether the bench's random distribution is representative of real content. The plan should note that `bench_v3d_lpf.c` in Phase 6 should track `fm` pass rate (and `hev` rate), since that informs Phase 7'' divergence analysis. The broken tracking in `bench_neon_lpf.c` should either be fixed in Phase 6 or marked as Phase 7'' instrumentation debt.

---

## What you would change about the plan

**Required before Phase 6:**

1. **§4, contract block** — Add a second contract line: "Contract: the host guarantees `dst_stride_u8 ≥ 4` for every dispatch." Without this, the race-safety claim in §5 is incomplete.

2. **§4, contract enforcement** — Change "Worth an assert in the bench-side harness when constructing meta" (currently a question) to a concrete requirement: "Phase 6 MUST add `assert(m_x >= 4)` in `bench_v3d_lpf.c`'s meta construction loop before the dispatch." Lock it, don't leave it as a review suggestion.

**Recommended before Phase 6:**

3. **§6, divergence discussion** — Add one sentence: "On V3D 7.1, masked lanes in a divergent subgroup still consume per-instruction clock; there is no throughput benefit from restructuring the filter to always-execute. The branching structure in §4 is correct as written — do not convert to predicated always-execute."

4. **§8, review prep item 2** — Update the stride-race note: "The race-safety argument in §5 requires `dst_stride_u8 ≥ 4`; add this as a stated contract alongside `m.x ≥ 4`."

**Defer to Phase 6/7:**

5. `bench_v3d_lpf.c` should track `fm` pass rate and `hev` rate per batch, so Phase 7'' divergence analysis has the data it needs. This is Phase 6 instrumentation scope, not a plan change.

---

## Confidence

High on findings 1 (orientation, independently verified line-by-line against both the C reference and the GLSL pseudocode), 2 (stride contract gap, derived from first principles), and 4 (underflow assert gap, directly stated in the plan as a question then not resolved). Medium on finding 3 (V3D divergence masking behavior — this is informed by the cycle-1 phase7 shaderdb evidence and V3D architecture documentation, but actual masking behavior is not publicly specified and Phase 7'' will be the real answer).
````

## Resolution

| Finding | Severity | Action | Applied? |
|---|---|---|---|
| 1 (orientation) | GREEN | None — verified against C ref | ✓ accepted |
| 2 (stride contract missing) | YELLOW | Add `dst_stride_u8 ≥ 4` to §4 contracts and §5 disjointness argument | applied to phase4.md |
| 3 (divergence on V3D) | YELLOW | Add note to §6: masked lanes consume clock; do not restructure to always-execute | applied to phase4.md |
| 4 (assert as MUST) | YELLOW | Change §4 question to Phase 6 implementation requirement | applied to phase4.md |
| 5 (anything missing) | GREEN | None — three cycle-1 RED patterns absent here | ✓ accepted |
| 6 (`q1-q0` vs `q0-q1`) | GREEN | None — both verified identical | ✓ accepted |
| 7 (no shared mem) | GREEN | None — already correctly stated | ✓ accepted |
| 8 (fm_pass tracking) | YELLOW | Phase 6 `bench_v3d_lpf.c` MUST track fm/hev rates | applied as Phase 6 requirement note |

After revisions: **Phase 4'' APPROVED for Phase 6'' implementation.**
Phase 6'' may proceed.