Files
marfrit 71db72928f Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS)
Phase 4 — Plan first QPU IDCT8 kernel given the M5 32.95us/dispatch
constraint. Frame-level dispatch (8160 WGs for 1080p), 64
invocations/WG = 4 blocks/WG, 3 SSBOs + push constants, reproduces
FFmpeg's transposed column-pass orientation. Predicted M2 ~16 Mblock/s
(BW-limited), R ~ 2.0 -> strong PASS under phase1.md decision rules.

Phase 5 — Second-model review by Claude Sonnet (fresh-context Agent,
no prior session memory). Verdict PASS-WITH-REVISIONS with 2 RED-class
findings + 1 YELLOW that this commit applies:

  RED  finding 5 (dst race condition): int32_t[] dst with 4 lanes
       writing to overlapping 32-bit words = non-atomic concurrent
       writes = Vulkan UB. Fix: uint8_t[] via storageBuffer8BitAccess
       (verified exposed). Applied to phase4.md sec 5 + GLSL declaration.

  RED  finding 7 (early-return before barrier): if (block_idx >=
       n_blocks) return; ahead of barrier() is UB by Vulkan spec.
       For 1080p (32640 blocks, /4) no partial WGs; for any frame
       width not /32 there are. Fix: oob flag, gate work bodies,
       barrier() unconditional. Applied to phase4.md sec 4 pseudocode.

  YELLOW finding 6 (subgroup ops): docs claimed BASIC+VOTE only;
         actual exposed set is BASIC+VOTE+BALLOT+SHUFFLE+
         SHUFFLE_RELATIVE+QUAD per vulkaninfo. Plan doesn't use any
         subgroup ops in v1 so unaffected, but the wrong constraint
         would mislead Phase 6/7. Corrected in phase0.md sec 2,
         phase2.md sec 6, phase4.md sec 1 (C4).

GREEN/YELLOW findings 1-4, 8 (orientation, WG geom, idle lanes, BW
prediction, compute envelope accounting) accepted as-is or deferred to
Phase 7 M6 sweep per plan's existing flagging.

Reviewer verdict post-revisions: "Phase 4 is APPROVED for Phase 6
implementation. No re-review needed; revisions are mechanical and
address verified bugs/errors."

Phase 5 itself just paid for itself: two real UB bugs caught before
any GLSL was written.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:47:03 +00:00

15 KiB
Raw Permalink Blame History

phase, status, date_opened, date_closed, parent, reviewer, plan_author, verdict
phase status date_opened date_closed parent reviewer plan_author verdict
5 closed 2026-05-18 — PASS-WITH-REVISIONS, revisions applied 2026-05-18 2026-05-18 phase4.md Claude Sonnet (general-purpose Agent, fresh context, no prior session memory) Claude Opus 4.7 (this session's primary model) PASS-WITH-REVISIONS

Phase 5 — Second-Model Review of Phase 4 Plan

Per dev_process.md:

Goal, situation, measurements, plan get pasted into DokuWiki. Markus reviews and redacts, then initiates the handover to a fresh model instance. Claude does not curate the artifact going to the reviewer — that would re-introduce the blind-spot accumulation the review is meant to escape.

For this session the handover mechanism was the in-session Agent tool with subagent_type=general-purpose, model=sonnet rather than DokuWiki round-trip. The reviewer had no prior session context and read the artifacts (README, dev_process.md, phase04, ref.c, bench_neon_idct.c, PROVENANCE.md, relevant vulkaninfo sections, the vendored FFmpeg vp9dsp_template.c) directly off disk.

The reviewer was given the 5 specific high-risk prompts from phase4.md §8 plus "anything else."

The review found:

  • 0 issues on orientation (independently traced against 3 sources)
  • 1 deferrable optimisation opportunity on WG geometry
  • 1 design improvement opportunity on idle-lane geometry
  • 1 documentation honesty improvement on bandwidth (range vs point)
  • 1 RED-class undefined-behaviour bug (data race on packed dst)
  • 1 RED-class undefined-behaviour bug (early-return before barrier)
  • 1 factual error in the constraint table carried through 3 documents
  • 1 accounting error in the compute envelope (no impact on conclusion)

Two RED-class bugs in the plan, both caught before any code was written. The review paid for itself.


Review (verbatim)

The following is the Sonnet reviewer's response, verbatim, with no editing beyond markdown-fence wrapping.

## Verdict

**PASS-WITH-REVISIONS** — two issues must be addressed before Phase 6 writes any shader code. Everything else is acceptable-for-v1 or a documentation fix.

---

## Per-prompt findings

**1. Orientation correctness — GREEN**

The Phase 4 §4 pseudocode is correct against `tests/vp9_idct8_ref.c` and against the upstream FFmpeg `itxfm_wrapper` macro.

Tracing the FFmpeg macro (lines 11801191): the column pass calls `idct8_1d(block + i, stride=sz, tmp + i*sz, pass=0)`. The `IN(x)` macro expands to `block[i + x*sz]`, which reads row `x`, column `i` — i.e., column `i` of the row-major block. The output lands at `tmp[i*sz + 0..7]`, which is row `i` of `tmp`. That is the transposed write. The row pass calls `idct8_1d(tmp + i, stride=sz, out, pass=1)`, which reads `tmp[i + x*sz]` = column `i` of `tmp`, and writes the result as `dst[j*stride]` with `dst++` — column `i` of `dst`. Both passes verified.

`vp9_idct8_ref.c` does the same: column pass reads `block[r*8+i]` and writes `tmp[i*8+r]` (transposed), row pass reads `tmp[r*8+i]` (column `i` of `tmp`) and writes `dst[r*stride+i]` (column `i` of dst). Phase 4 pseudocode: column pass reads `block[block_idx*64 + r*8 + k]` and writes `tmp_shared[block_local_idx*64 + k*8 + r]` — correct. Row pass reads `tmp_shared[block_local_idx*64 + r*8 + k]` (column `k` of `tmp`) and writes `dst[dy*stride + dx]` where `dx = block_x*8 + k` (column `k` of dst) — correct. All three match identically. The orientation is right.

**2. Workgroup geometry — YELLOW**

4 blocks/WG × 64 invocations is a defensible v1 starting point: it maps cleanly to one block per 16-lane subgroup, shared memory footprint is only 1 KiB (4 × 64 × 4B), and the dispatch arithmetic is straightforward. The max of 256 invocations would give 16 blocks/WG, which reduces dispatch overhead by 4× and uses the same shared memory fraction. That is not complex to implement. The plan correctly defers the WG-size sweep to Phase 7 M6, which is the right approach — get bit-exact first, then sweep. No revision needed; flag for M6 that the sweet spot is likely in the 128256 range.

**3. Idle lanes — YELLOW (stronger than the plan implies)**

The plan acknowledges the issue and flags it for Phase 7 M6. The proposed alternative — "lanes 0..7 do both passes sequentially, separated by a barrier" — would not eliminate idle lanes; it would just move the idling. The actual fix is 2 blocks per subgroup: lanes 0..7 do the column pass of block A while lanes 8..15 do the column pass of block B simultaneously; after one barrier, lanes 0..7 do the row pass of block A while lanes 8..15 do the row pass of block B. This uses all 16 lanes productively in both phases, doubles throughput per subgroup, shared memory cost is 8 blocks × 64 × 4B = 2048B (still well under 16 KiB), and dispatch count halves to 4080 WGs. The added code complexity is modest: change the lane mapping from `col_pass = lane < 8` to `block_slot = lane / 8, col_row_idx = lane % 8` and adjust `block_local_idx` accordingly. This is straightforward enough that implementing it in v1 is recommended, not a Phase 7 afterthought, especially given that BW is the governing constraint and even small compute gains matter. Acceptable to defer, but the description in §4 of the "alternative geometry" as "lanes 0..7 do column, lanes 0..7 also do row" is incorrect — that is not how you eliminate idle lanes.

**4. Bandwidth prediction — YELLOW**

The 8 MB/frame figure (4 MB coeffs + 2 MB pred-read + 2 MB dst-write) is correctly derived from 32,640 blocks. The 4 GB/s divisor is the optimistic end of the GPU memory bandwidth range. The `dst` concern is real: in the row pass, lane `k` writes to column `k` of the output block. For a 1920-pixel-wide frame (stride 1920 bytes), consecutive rows of the same output column are 1920 bytes = 30 cache lines apart. The 8 writes per lane per block are all cache-line-miss candidates. However, the WGs will be dispatched in block-index order, and consecutive blocks in the same row share output cache lines — for block_x=0,1 (both in row 0), their dst rows overlap in the same L2 lines. V3D's L2C is 256 KB and the 8 rows of a frame row fit in 15 KB, so temporal reuse within a row of blocks is plausible. An honest bandwidth range is 14 GB/s, not a point estimate at 4. This gives an honest M2 range of ~416 Mblock/s (R = 0.52.0), which still covers the "hybrid concurrent work viable" decision band at the low end. The plan's acknowledgment in §6 of "may push below 1 GB/s" is appropriate. No revision needed, but Phase 7 should record the actual L2 hit rate if the v3dv/DRM perf counters expose it.

**5. Unstated load-bearing assumption — RED**

The `dst` SSBO is specified as `int32_t[]` with 4 pixels packed per word. In the row pass, lanes 0..7 each write to a different column of the output block. For any block position, lanes 0, 1, 2, 3 write to columns `block_x*8+0`, `block_x*8+1`, `block_x*8+2`, `block_x*8+3` — which are bytes 0, 1, 2, 3 of the same `int32_t` word at `dst[(row) * stride_u32 + (block_x*8)/4]`. Lanes 4, 5, 6, 7 write to the next word. For every block at every position, exactly four lanes are writing to the same 32-bit word simultaneously. In Vulkan, concurrent non-atomic writes from different invocations to overlapping memory addresses are undefined behavior. The v3d hardware does not guarantee atomic sub-word stores. The result in practice will be silent data corruption where one lane's write stomps another's.

The fix is straightforward: use a `uint8_t[]` SSBO binding. `storageBuffer8BitAccess = true` on this device (verified in vulkaninfo line 735), and GLSL supports `uint8_t` buffer members with `VK_KHR_8bit_storage`. Each lane then writes to address `block_y*8*stride + r*stride + block_x*8 + k` — a unique byte address, no overlap. The pred/dst buffer is still a single SSBO binding (≤8 limit not affected), just typed `uint8_t` instead of `int32_t`. This must be resolved before Phase 6 writes the shader. The current §5 layout specification is wrong.

---

## Other findings

**6. Subgroup operations constraint table is incorrect — YELLOW**

Phase 0 §2 table, Phase 2 §6 C4, and Phase 4 §1 C4 all state "subgroupSupportedOperations = BASIC + VOTE only." The actual vulkaninfo output (lines 285291) shows six operations: BASIC, VOTE, **BALLOT, SHUFFLE, SHUFFLE_RELATIVE, QUAD**. The plan still works correctly — it never calls any subgroup operation at all and uses a workgroup barrier + shared memory, which is fully valid. But the constraint table will mislead Phase 6 and Phase 7 implementers into thinking `subgroupShuffle` is unavailable. It is available, and it opens an alternative algorithm: the 16-lane subgroup can distribute column-pass results to row-pass lanes via `subgroupShuffle` without any shared memory or barrier, replacing the `tmp_shared[4][64]` array entirely. That would eliminate the barrier latency and reduce shared memory pressure, though it requires a different lane-mapping design. This is a Phase 7 optimization candidate, but the constraint table needs correcting so future phases don't build on a false foundation.

**Required fix:** Correct C4 in phase4.md §1 to: "subgroupSupportedOperations = BASIC, VOTE, BALLOT, SHUFFLE, SHUFFLE_RELATIVE, QUAD — no arithmetic reductions (no `subgroupAdd`). Note: `subgroupShuffle` IS available; Phase 1 does not use it." Also correct phase0.md §2 table.

**7. Early-return before barrier is UB for non-multiple-of-4 block counts — RED (addressable in Phase 6 with 2 lines)**

The pseudocode guard `if (block_idx >= n_blocks) return;` fires before `barrier()`. In Vulkan GLSL, a `barrier()` call must be reached by all invocations in the workgroup — a divergent early exit before it is undefined behavior (the spec says "all invocations of a workgroup must execute barrier with the same operands, in the same order, relative to other uses of OpMemoryBarrier"). For 1080p (32,640 blocks, evenly divisible by 4), there are no partial WGs and no problem. But VP9 frames need not have dimensions divisible by 32 pixels (4 blocks × 8). A 1366×768 frame has 170×96 = 16,320 blocks, divisible by 4. A hypothetical 80×80 frame has 10×10 = 100 blocks, divisible by 4. However an 88×72 frame has 11×9 = 99 blocks, not divisible by 4, and will have a partial WG. The plan does not address this. The fix is two lines: replace the early return with a `bool oob = block_idx >= n_blocks;` flag, then guard the column-pass work body and row-pass work body individually with `if (!oob)`, while leaving the `barrier()` unconditional. All lanes in the WG then always reach the barrier.

**8. Compute envelope derivation has an accounting error — GREEN (no impact on final number)**

Phase 4 §6 derives "~2 MOPs / frame" as `32,640 blocks × 60 ops`. This is then divided by 92 GOPS to get 22 µs. The arithmetic conflates per-block lane-ops with SIMD-cycle-equivalent ops. Each 16-lane subgroup processes 1 block in the column pass by running 8 active lanes × 30 ops = 240 useful lane-ops in 30 SIMD cycles, then similarly for the row pass. The "60 ops" correctly counts SIMD cycles per block (2 × 30), but dividing by 92 GOPS (which are whole-SIMD-width ops) underestimates compute time by a factor of ~16/8 = 2 (idle lanes). The corrected compute frame time is ~44 µs, not 22 µs. But since bandwidth at ~2 ms is the governing constraint and is 45× the compute time, the error has zero impact on the `min(compute, bandwidth) ≈ 16 Mblock/s` conclusion. No revision required, but §6 should note the 2× discrepancy to avoid misleading future optimization work.

---

## What you would change about the plan

**Required before Phase 6 (two changes):**

1. **Phase 4 §5, binding 1** — Replace `int32_t[]` with `uint8_t[]` backed by `storageBuffer8BitAccess`. Remove the sentence "Packing 4 bytes per i32 simplifies write-back to coalesced 32-bit stores." Replace with: "Declare as `layout(binding=1) buffer Dst { uint8_t dst[]; };` using `VK_KHR_8bit_storage` (storageBuffer8BitAccess = true on this device). Lane k in the row pass writes `dst[(block_y*8+r)*stride + block_x*8+k]` directly. No packing. No sub-word race condition." Update `dst_stride_u32` push constant to `dst_stride_u8` (now in bytes, not words).

2. **Phase 4 §4, pseudocode preamble** — Add a `bool oob = (block_idx >= n_blocks);` before the `if (col_pass)` branch. Replace `if (block_idx >= n_blocks) return;` (early return) with the `oob` flag, and wrap the col-pass body and row-pass body with `if (!oob)`, leaving `barrier()` unconditional. Add a comment: `// barrier() must be reached by all WG invocations — do not early-return before it.`

**Recommended before Phase 6 (one change):**

3. **Phase 4 §1, C4 row** — Correct from "subgroupSupportedOperations = BASIC + VOTE only" to "BASIC + VOTE + BALLOT + SHUFFLE + SHUFFLE_RELATIVE + QUAD — arithmetic reductions unavailable but `subgroupShuffle` is." Also fix the same sentence in phase0.md §2 and phase2.md §6.

**Defer to Phase 7 (no change needed now):**

4. The 2-blocks-per-subgroup redesign is worth implementing as the v1 shape rather than the half-idle design, but it is not a correctness issue. If the team prefers the simpler v1, the M6 sweep in Phase 7 is the right time to do it. The Phase 4 description of the alternative ("lanes 0..7 do column, lanes 0..7 also do row") is an incorrect description of how to eliminate idle lanes — correct it to "lanes 0..7 handle block A, lanes 8..15 handle block B simultaneously in both passes."

---

## Confidence

High on findings 5 (race condition), 6 (subgroup ops), and 7 (early return / barrier); those are mechanical reads against the vulkaninfo output and the Vulkan spec. Medium on finding 4 (bandwidth): the 14 GB/s range is informed reasoning but the actual L2 behavior on V3D 7.1 is not documented publicly and Phase 7 will be the real answer. The orientation correctness (finding 1) has the highest confidence — three independent traces (FFmpeg macro, ref.c, phase4 pseudocode) all agree.

Resolution

Finding Severity Action Status
1 (orientation) GREEN None accepted
2 (WG geom) YELLOW Defer to Phase 7 M6 sweep (matches plan) accepted
3 (idle lanes) YELLOW Correct §4 description of "alternative"; keep half-idle v1 design (defer 2-blocks-per-subgroup to Phase 7) apply description fix
4 (bandwidth) YELLOW Acceptable; note that 4 GB/s is optimistic end accepted as-is, §6 already flags downside
5 (dst race) RED Switch dst SSBO to uint8_t[] via VK_KHR_8bit_storage + storageBuffer8BitAccess applied to phase4.md §5
6 (subgroup ops table) YELLOW Correct C4 in phase0.md §2, phase2.md §6, phase4.md §1 applied to all three
7 (barrier UB) RED Add oob flag pattern in §4 pseudocode applied to phase4.md §4
8 (compute envelope 2× off) GREEN Note in §6 — doesn't change conclusion apply note

After revisions:

  • Phase 4 is APPROVED for Phase 6 implementation.
  • No re-review needed; revisions are mechanical and address verified bugs/errors.
  • Phase 6 may proceed with the revised plan.