Files
daedalus-fourier/docs/k2_deblock_phase5.md
T
marfrit 36eca40ff2 Cycle 2 (LPF) closure: M1''=100%, R''=0.41, M4''=+6.9%, PASS
Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS:
2 YELLOW contract gaps applied) + Phase 6 v1 implementation +
Phase 7 verification including M4'' concurrent gate.

Phase 5'' review delivered cleanly — no RED bugs (cycle 1 lessons
applied successfully). 2 YELLOW findings baked into phase4 §4:
  - stride >= 4 contract added alongside m.x >= 4 (finding 2)
  - assert(...) in bench made a MUST not a suggestion (finding 4)
  - V3D divergence-cost note: don't restructure to always-execute,
    masked lanes consume clock anyway (finding 3, informational)

Phase 6 v1 first-light hit M1'' 100.0000% bit-exact on first run
(65536/65536 edges) — the cycle-1 v4 patterns (WG=256, 2-per-sg,
uint8_t SSBO, oob early-return discipline) baked in from start
worked as expected.

Performance:

  M2'' = 19.645 Medge/s     (50.9 ns/edge)
  M3'' = 48.285 Medge/s     (NEON baseline from phase3)
  R''  = 0.41               (ORANGE band - doesn't auto-close per
                             cycle-1 calibration adjustment)

shaderdb: 160 inst, **4 threads**, 0 spills, 21 max-temps —
shader is already at the compiler ceiling. No v2/v3/v4 iteration
loop like cycle 1 because there's nothing more to extract from
the compiled shape. The 30x gap between theoretical instruction
throughput and measured wall-clock is divergence-tax + memory
latency, not compile quality.

M4'' concurrent matrix on hertz (8s windows):

  NEON-1 LPF          41.131 Medge/s
  NEON-4 LPF          33.726 Medge/s  <- realistic CPU ceiling
                                          (per-core 7-9; same
                                          bandwidth-saturation as
                                          cycle-1 F1)
  QPU only            14.299 Medge/s
  MIXED NEON-3 + QPU  36.049 Medge/s  <- +6.9% over NEON-4
  MIXED NEON-4 + QPU  31.892 Medge/s  <- -5.4% oversubscribed

The "freed-core" pattern generalizes from IDCT to LPF: NEON-3+QPU
beats pure NEON-4 by ~7% in both cycles. Cycle-2 NEW finding:
**oversubscribed mode hurts for lighter kernels** (LPF -5.4% vs
cycle-1 IDCT +9.4%). Recommendation for higgs deployment hardens
to "always N-1 NEON cores + QPU, never N + QPU".

Phase 9 lessons (in phase7 §"Phase 9 lessons"):
1. Cycle-1 v4-pattern is the v1 starting point (saves 3 iterations)
2. Phase 5 review pays off every cycle
3. R isolation misleading on bandwidth-saturated hardware
4. Oversubscription tax depends on kernel weight
5. shaderdb 4-threads/0-spills = compute not the bottleneck

New artifacts:
- src/v3d_lpf_h_4_8.comp                — GLSL kernel
- tests/bench_v3d_lpf.c                 — M1'' + M2'' harness with
                                          contract asserts + fm/hev
                                          pass-rate instrumentation
- tests/bench_concurrent_lpf.c          — M4'' pthread bench
                                          (mirrors bench_concurrent.c)
- docs/k2_deblock_phase{4,5,7}.md       — plan + review + verification

Project verdict: continue. Cycle 3 candidates: MC interpolation
(multiply-heavy, stress V3D SMUL24), CDEF (AV1-only, different
neighborhood shape), or wd=8/wd=16 LPF variants. User to direct.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:39:26 +00:00

14 KiB
Raw Blame History

cycle, phase, status, date_opened, date_closed, parent, reviewer, plan_author, verdict
cycle phase status date_opened date_closed parent reviewer plan_author verdict
2 5 closed 2026-05-18 — PASS-WITH-REVISIONS, revisions applied 2026-05-18 2026-05-18 k2_deblock_phase4.md Claude Sonnet (general-purpose Agent, fresh context) Claude Opus 4.7 (this session) PASS-WITH-REVISIONS

Cycle 2, Phase 5 — Second-Model Review of LPF Plan

Same handoff mechanism as cycle 1's Phase 5: in-session Agent tool with subagent_type=general-purpose, model=sonnet, fresh context, files read directly off disk, no Claude-Opus curation.

The reviewer was given the 5 prompts that k2_deblock_phase4.md §8 itself flagged plus "anything else."

Outcome: no RED-class bugs (cycle 1's two RED findings did not recur — orientation is independently verified against the C reference, no barrier exists so no barrier-UB risk, no subgroup operations used so the cycle-1 finding-6 false-constraint pattern doesn't apply). Two YELLOW findings to address before Phase 6:

  • F2: stride contract missing from §5 race-safety argument
  • F4: assert that should be MUST not "worth considering"

One additional YELLOW (F3) on V3D divergence behavior — informational note to add to §6 so Phase 7'' doesn't go down the wrong optimisation rabbit hole. One YELLOW (F8) on bench instrumentation — bake into Phase 6 implementation requirements.

Review (verbatim)

## Verdict

**PASS-WITH-REVISIONS** — the pseudocode is arithmetically correct against the C reference, and no RED-class correctness bugs were found. One YELLOW issue (missing stride contract) and one YELLOW issue (divergence-cost accounting) need to be acknowledged and documented; neither requires rework of the algorithm. No safety net against the `base - 4u` underflow is present in the bench harness and should be added.

---

## Per-prompt findings

**1. Orientation correctness — GREEN**

The pseudocode in §4 matches `tests/vp9_lpf_ref.c` line-for-line. Traced independently:

- `fm` mask: all six abs-diff comparisons against `I`, and `abs(p0-q0)*2 + (abs(p1-q1) >> 1) <= E`, appear in the same order as the C reference. No term is swapped or missing.
- hev path: `f = clamp(p1 - q1, -128, 127)` — sign is correct (`p1 - q1`, not `q1 - p1`). The second clip `clamp(3*(q0-p0) + f, ...)` matches. `f1 = min(f+4,127)>>3`, `f2 = min(f+3,127)>>3` match. Writes: `p0 + f2` at `base-1`, `q0 - f1` at `base+0` — correct direction.
- no-hev path: `f = clamp(3*(q0-p0), ...)` — no `p1-q1` term, correct (the no-hev path drops the initial `p1-q1` from `f`). `fp = (f1+1)>>1` shared between `p1+fp` and `q1-fp`. Writes: `base-2` gets `p1+fp`, `base+1` gets `q1-fp`. The C reference writes `dst[-2] = clip_u8(p1+fp)` and `dst[+1] = clip_u8(q1-fp)` — matches exactly.
- Pixel reads: `p3..p0` at `base-4..base-1`, `q0..q3` at `base+0..base+3`. Row advance via `row * pc.dst_stride_u8`. The C reference iterates `dst += stride` and reads `dst[-4..+3]` — same semantics.
- GLSL `min(int, int)` works correctly (GLSL 4.50 provides `genIType min(genIType x, genIType y)`).

The bit-exact gate should pass on first run if the transcription holds through the actual `.comp` file.

**2. Race safety claim in §5 — YELLOW**

The §5 disjointness claim is **mostly correct but missing a lower bound on stride**. Examining the geometry: row `r` writes to byte addresses `[base + r*stride - 2 .. base + r*stride + 1]` (4 bytes). For rows `r` and `r+1` to be disjoint, we need `r*stride + 1 < (r+1)*stride - 2`, i.e., `stride > 3`, i.e., `stride ≥ 4`. The bench uses stride = 8, so there is no actual race in the bench. But §5's race-safety argument says "disjoint memory" without stating the stride lower bound. If a future caller passes stride=2 (a valid, if unusual, packed-pixel scenario), rows 0 and 1 would overlap and the writes would race silently. The contract in §4 currently only asserts `m.x ≥ 4`; it does not assert `dst_stride_u8 ≥ 4`. This is not the same shape as cycle-1 phase5 finding 5 (that was sub-word packing within a single 32-bit word across lanes in the same row), but it is the same class: a hidden assumption that the correctness of the write-back relies on.

The plan is safe for the bench as written; the gap is that it doesn't document the stride contract, which means a future caller cannot know what the shader requires. This should be stated explicitly.

**3. Divergence cost — YELLOW**

The §6 prediction of R'' ≈ 0.50.9 is plausible but the divergence penalty analysis is incomplete. Within a 16-lane subgroup processing 2 edges (8 rows each), `fm` failure causes a `return` that retires 1 or more lanes early. V3D's compiler will mask the retired lanes for the rest of the subgroup's execution — those lanes are not actually idle; they still consume subgroup clock cycles on every instruction. The cost of a divergent `return` is that the surviving lanes continue executing to subgroup completion. However, V3D 7.1 differs from desktop GPUs in one important way: the v3d_compiler controls divergence handling entirely in software, via per-lane predicate masks. There are no warp-level early-exit savings. An `if (!fm) return` that triggers for half the lanes in the subgroup means the other half continues through the full hev/no-hev path while the failed lanes are masked. The per-lane wall-clock for all lanes is the same. The §6 prediction does acknowledge "divergence on `fm` and `hev` branches splits the subgroup into 2-4 paths" but then only mentions it as a risk without quantifying it. If `fm` fails for 50 % of rows (a realistic real-bitstream rate), the effective throughput is the same as if `fm` passed for 100 % — because the masked lanes still consume clock. This makes the prediction slightly optimistic: R'' is bounded by the bandwidth ceiling regardless, but the divergence cost means the compute overhead does not improve when `fm` skips rows. Whether restructuring to always-execute would help: it would not, because the masked lanes in V3D do pay for all instructions anyway. There is no benefit to removing the `if (!fm) return` in terms of V3D wall-clock. The `hev` branch within the fm-passing path is a different shape: it selects between 2-write and 4-write paths; the non-hev path executes more instructions. The shorter hev path wastes no clock on V3D (masked lanes pay for max-path instructions in any case). Given that, restructuring to always-execute the longer no-hev path with predicated writes would make things **worse** (more instructions always executed vs. fewer instructions with masking). The plan's choice of the natural branching structure is correct. The YELLOW here is that the prediction doesn't spell this out — it should note that "divergence has no throughput benefit because masked lanes still consume clock on V3D."

**4. `base - 4u` underflow assumption — YELLOW**

The plan correctly identifies the risk in §4 ("Contract: host guarantees `m.x ≥ 4`") and notes the bench enforces it by placing edges at `edge_idx * 64 + 4`. However, the bench harness (`bench_neon_lpf.c`) constructs `meta[]` entries with no runtime assert. If `n_edges` is large enough that the edge layout is correct by construction and cannot be violated, this is fine — but it's not a runtime check. The plan says "Worth an assert in the bench-side harness when constructing meta" as a review question, then never commits to adding it. Given that cycle-1 lessons explicitly mention contract enforcement, this should be a concrete "Phase 6 MUST add assert" requirement in the plan, not a question for the reviewer. The risk is concrete: a caller who miscomputes `m.x` and passes a value less than 4 will get a silent pixel-buffer-underread wrapped around to a high address. In the QPU context this would silently corrupt unrelated image data. The fix is one line (`assert(m.x >= 4)` in the bench's `meta[]` construction loop); the plan should commit to it.

**5. Anything missing — GREEN with one note**

The three cycle-1 phase5 bugs do not recur here in the same form:

- **Finding 5 (write race on packed dst):** The LPF kernel uses a per-lane byte write pattern that is intrinsically race-free for well-separated rows — the pattern is the same fix that cycle-1 applied. Not a risk here provided the stride contract holds (see finding 2 above).
- **Finding 7 (barrier UB before early-return):** The LPF kernel has no barrier, and §3 correctly notes "No barrier needed." The early `if (edge_idx >= pc.n_edges) return;` is safe here, unlike cycle-1. The plan explicitly calls this out with the correct reasoning.
- **Finding 6 (subgroup ops table error):** The LPF kernel uses no subgroup operations at all; C4 compliance is trivially satisfied regardless of which ops are available.

One unstated assumption worth noting: the push constant `dst_stride_u8` is passed as `uint`, but the computation `m.x + row * pc.dst_stride_u8` is unsigned arithmetic throughout. If `row` is `uint` (derived from `lane_in_sg & 7`, which is `uint`), this multiplication is `uint * uint = uint`, and for normal stride values (stride ≤ ~500K) there's no overflow risk. This is fine and does not require a fix, but it is unstated.

A second unstated assumption: the `uvec4 m = u_meta.meta[edge_idx]` load packs `(dst_offset, E, I, H)` as four `uint` components. The shader casts `m.y, m.z, m.w` to `int` via `int(m.y)` etc. The C reference takes `E, I, H` as `int` parameters in range 080 / 040 / 010 per `gen_thresholds()`. All positive, so the uint→int cast is safe. The plan states this correctly in §5 without explicitly noting the sign assumption, but it holds for any physically meaningful threshold value.

---

## Other findings

**6. YELLOW — `fm` mask operand order in the C reference versus GLSL: subtle `abs(q1-q0)` vs `abs(q0-q1)`**

Both are equivalent (absolute value), but the plan deserves a note that these are identical. The C reference has `abs_i(q1 - q0)` (line 55) while the GLSL has `abs(q1-q0)` — same order. No issue, but someone reading quickly might second-guess. Confirm: both are identical.

**7. GREEN — No shared memory needed is correctly stated**

Unlike IDCT which required an 8 KiB shared scratch for the transpose, the LPF kernel needs zero shared memory. This is a significant simplification that eliminates the entire barrier-safety problem class. The plan correctly identifies this and uses it to justify the safe early return.

**8. YELLOW — Bench `fm_pass` tracking is acknowledged-broken**

`bench_neon_lpf.c` lines 119124 contain a comment `/* fm_pass above is broken — left as TODO */`. This is a known deficiency. The bench cannot report what fraction of edges actually triggered `fm`-pass vs `fm`-skip. For M1''_c correctness this doesn't matter (mismatch count is the headline). For Phase 7'' analysis of divergence behavior under the QPU, knowing the actual `fm` pass rate of the random edge distribution matters — it determines how often the divergent `return` fires and whether the bench's random distribution is representative of real content. The plan should note that `bench_v3d_lpf.c` in Phase 6 should track `fm` pass rate (and `hev` rate), since that informs Phase 7'' divergence analysis. The broken tracking in `bench_neon_lpf.c` should either be fixed in Phase 6 or marked as Phase 7'' instrumentation debt.

---

## What you would change about the plan

**Required before Phase 6:**

1. **§4, contract block** — Add a second contract line: "Contract: the host guarantees `dst_stride_u8 ≥ 4` for every dispatch." Without this, the race-safety claim in §5 is incomplete.

2. **§4, contract enforcement** — Change "Worth an assert in the bench-side harness when constructing meta" (currently a question) to a concrete requirement: "Phase 6 MUST add `assert(m_x >= 4)` in `bench_v3d_lpf.c`'s meta construction loop before the dispatch." Lock it, don't leave it as a review suggestion.

**Recommended before Phase 6:**

3. **§6, divergence discussion** — Add one sentence: "On V3D 7.1, masked lanes in a divergent subgroup still consume per-instruction clock; there is no throughput benefit from restructuring the filter to always-execute. The branching structure in §4 is correct as written — do not convert to predicated always-execute."

4. **§8, review prep item 2** — Update the stride-race note: "The race-safety argument in §5 requires `dst_stride_u8 ≥ 4`; add this as a stated contract alongside `m.x ≥ 4`."

**Defer to Phase 6/7:**

5. `bench_v3d_lpf.c` should track `fm` pass rate and `hev` rate per batch, so Phase 7'' divergence analysis has the data it needs. This is Phase 6 instrumentation scope, not a plan change.

---

## Confidence

High on findings 1 (orientation, independently verified line-by-line against both the C reference and the GLSL pseudocode), 2 (stride contract gap, derived from first principles), and 4 (underflow assert gap, directly stated in the plan as a question then not resolved). Medium on finding 3 (V3D divergence masking behavior — this is informed by the cycle-1 phase7 shaderdb evidence and V3D architecture documentation, but actual masking behavior is not publicly specified and Phase 7'' will be the real answer).

Resolution

Finding Severity Action Applied?
1 (orientation) GREEN None — verified against C ref ✓ accepted
2 (stride contract missing) YELLOW Add dst_stride_u8 ≥ 4 to §4 contracts and §5 disjointness argument applied to phase4.md
3 (divergence on V3D) YELLOW Add note to §6: masked lanes consume clock; do not restructure to always-execute applied to phase4.md
4 (assert as MUST) YELLOW Change §4 question to Phase 6 implementation requirement applied to phase4.md
5 (anything missing) GREEN None — three cycle-1 RED patterns absent here ✓ accepted
6 (q1-q0 vs q0-q1) GREEN None — both verified identical ✓ accepted
7 (no shared mem) GREEN None — already correctly stated ✓ accepted
8 (fm_pass tracking) YELLOW Phase 6 bench_v3d_lpf.c MUST track fm/hev rates applied as Phase 6 requirement note

After revisions: Phase 4'' APPROVED for Phase 6'' implementation. Phase 6'' may proceed.