Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS)

Phase 4 — Plan first QPU IDCT8 kernel given the M5 32.95us/dispatch constraint. Frame-level dispatch (8160 WGs for 1080p), 64 invocations/WG = 4 blocks/WG, 3 SSBOs + push constants, reproduces FFmpeg's transposed column-pass orientation. Predicted M2 ~16 Mblock/s (BW-limited), R ~ 2.0 -> strong PASS under phase1.md decision rules. Phase 5 — Second-model review by Claude Sonnet (fresh-context Agent, no prior session memory). Verdict PASS-WITH-REVISIONS with 2 RED-class findings + 1 YELLOW that this commit applies: RED finding 5 (dst race condition): int32_t[] dst with 4 lanes writing to overlapping 32-bit words = non-atomic concurrent writes = Vulkan UB. Fix: uint8_t[] via storageBuffer8BitAccess (verified exposed). Applied to phase4.md sec 5 + GLSL declaration. RED finding 7 (early-return before barrier): if (block_idx >= n_blocks) return; ahead of barrier() is UB by Vulkan spec. For 1080p (32640 blocks, /4) no partial WGs; for any frame width not /32 there are. Fix: oob flag, gate work bodies, barrier() unconditional. Applied to phase4.md sec 4 pseudocode. YELLOW finding 6 (subgroup ops): docs claimed BASIC+VOTE only; actual exposed set is BASIC+VOTE+BALLOT+SHUFFLE+ SHUFFLE_RELATIVE+QUAD per vulkaninfo. Plan doesn't use any subgroup ops in v1 so unaffected, but the wrong constraint would mislead Phase 6/7. Corrected in phase0.md sec 2, phase2.md sec 6, phase4.md sec 1 (C4). GREEN/YELLOW findings 1-4, 8 (orientation, WG geom, idle lanes, BW prediction, compute envelope accounting) accepted as-is or deferred to Phase 7 M6 sweep per plan's existing flagging. Reviewer verdict post-revisions: "Phase 4 is APPROVED for Phase 6 implementation. No re-review needed; revisions are mechanical and address verified bugs/errors." Phase 5 itself just paid for itself: two real UB bugs caught before any GLSL was written. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:47:03 +00:00
parent dcbbc77038
commit 71db72928f
4 changed files with 550 additions and 2 deletions
@@ -178,7 +178,7 @@ re-derivation:
 - **C1**: shaderFloat16 = false → all shader arithmetic must be int32 (we are int anyway — no risk).
 - **C2**: maxComputeSharedMemorySize = 16 KiB → kernel must not require more (8×8 IDCT trivially fits even with many blocks per WG).
 - **C3**: maxPerStageDescriptorStorageBuffers = 8 → we need only 2 (coeffs + dst), no risk.
- **C4**: subgroupSupportedOperations = BASIC + VOTE only → no `subgroupAdd`/etc. for accumulator reductions. Workaround: the IDCT structure is fully data-parallel without reductions; this constraint doesn't bite.
+- **C4**: subgroupSupportedOperations = BASIC + VOTE + BALLOT + SHUFFLE + SHUFFLE_RELATIVE + QUAD (no arithmetic reductions like `subgroupAdd`; but `subgroupShuffle` *is* available — corrected per phase5.md finding 6). Workaround needed for accumulators: the IDCT structure is fully data-parallel without reductions; this constraint doesn't bite. `subgroupShuffle` is an alternative to shared-mem transpose for Phase 7.
 - **C5**: VC7 has SMUL24 but no INT8 MAC. Our Q14 multiplies are i16×i16→i32 — the multiplicands fit in 17 bits, so SMUL24 covers it. No INT8/INT4 issues.
 - **C6**: shared LPDDR4x bus; GPU sees ~4–7 GB/s vs CPU ~12–15 GB/s. For 8×8 IDCT, working set is tiny (≤320 B/block), so per-block bandwidth is not the bottleneck; per-dispatch submit overhead is.
 - **C7**: VPM read-stall serialization. If we hand-write QPU asm (we won't, in Phase 1) this would matter; the Vulkan compute path lets the v3d_compiler schedule for us.