From 71db72928feed11b07286df5d61d78692c7817cd Mon Sep 17 00:00:00 2001
From: Markus Fritsche <mfritsche@reauktion.de>
Date: Mon, 18 May 2026 11:47:03 +0000
Subject: [PATCH] Phase 4 plan + Phase 5 second-model review
 (PASS-WITH-REVISIONS)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Phase 4 — Plan first QPU IDCT8 kernel given the M5 32.95us/dispatch
constraint. Frame-level dispatch (8160 WGs for 1080p), 64
invocations/WG = 4 blocks/WG, 3 SSBOs + push constants, reproduces
FFmpeg's transposed column-pass orientation. Predicted M2 ~16 Mblock/s
(BW-limited), R ~ 2.0 -> strong PASS under phase1.md decision rules.

Phase 5 — Second-model review by Claude Sonnet (fresh-context Agent,
no prior session memory). Verdict PASS-WITH-REVISIONS with 2 RED-class
findings + 1 YELLOW that this commit applies:

  RED  finding 5 (dst race condition): int32_t[] dst with 4 lanes
       writing to overlapping 32-bit words = non-atomic concurrent
       writes = Vulkan UB. Fix: uint8_t[] via storageBuffer8BitAccess
       (verified exposed). Applied to phase4.md sec 5 + GLSL declaration.

  RED  finding 7 (early-return before barrier): if (block_idx >=
       n_blocks) return; ahead of barrier() is UB by Vulkan spec.
       For 1080p (32640 blocks, /4) no partial WGs; for any frame
       width not /32 there are. Fix: oob flag, gate work bodies,
       barrier() unconditional. Applied to phase4.md sec 4 pseudocode.

  YELLOW finding 6 (subgroup ops): docs claimed BASIC+VOTE only;
         actual exposed set is BASIC+VOTE+BALLOT+SHUFFLE+
         SHUFFLE_RELATIVE+QUAD per vulkaninfo. Plan doesn't use any
         subgroup ops in v1 so unaffected, but the wrong constraint
         would mislead Phase 6/7. Corrected in phase0.md sec 2,
         phase2.md sec 6, phase4.md sec 1 (C4).

GREEN/YELLOW findings 1-4, 8 (orientation, WG geom, idle lanes, BW
prediction, compute envelope accounting) accepted as-is or deferred to
Phase 7 M6 sweep per plan's existing flagging.

Reviewer verdict post-revisions: "Phase 4 is APPROVED for Phase 6
implementation. No re-review needed; revisions are mechanical and
address verified bugs/errors."

Phase 5 itself just paid for itself: two real UB bugs caught before
any GLSL was written.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/phase0.md |   2 +-
 docs/phase2.md |   2 +-
 docs/phase4.md | 400 +++++++++++++++++++++++++++++++++++++++++++++++++
 docs/phase5.md | 148 ++++++++++++++++++
 4 files changed, 550 insertions(+), 2 deletions(-)
 create mode 100644 docs/phase4.md
 create mode 100644 docs/phase5.md

diff --git a/docs/phase0.md b/docs/phase0.md
index 8dbc370..bb77004 100644
--- a/docs/phase0.md
+++ b/docs/phase0.md
@@ -68,7 +68,7 @@ Captured live 2026-05-17 via SSH. Full `vulkaninfo` in
 | `maxComputeWorkGroupInvocations` | 256 | Standard |
 | `maxComputeWorkGroupSize` | 256 / 256 / ? | Standard |
 | `subgroupSize` | 16 (fixed) | Matches QPU SIMD width |
-| `subgroupSupportedOperations` | BASIC + VOTE only | No arithmetic reductions — accumulate via shared memory |
+| `subgroupSupportedOperations` | BASIC + VOTE + BALLOT + SHUFFLE + SHUFFLE_RELATIVE + QUAD | No arithmetic reductions — accumulate via shared memory. `subgroupShuffle` *is* available (corrected per phase5.md finding 6 — earlier text said BASIC+VOTE only). |
 | `shaderFloat16` | **false** | Storage only; arithmetic runs FP32 |
 | `shaderInt8` | **false** | Storage only; arithmetic on widened ints |
 | `shaderInt16` | **false** | Same |
diff --git a/docs/phase2.md b/docs/phase2.md
index 8cad881..c346501 100644
--- a/docs/phase2.md
+++ b/docs/phase2.md
@@ -178,7 +178,7 @@ re-derivation:
 - **C1**: shaderFloat16 = false → all shader arithmetic must be int32 (we are int anyway — no risk).
 - **C2**: maxComputeSharedMemorySize = 16 KiB → kernel must not require more (8×8 IDCT trivially fits even with many blocks per WG).
 - **C3**: maxPerStageDescriptorStorageBuffers = 8 → we need only 2 (coeffs + dst), no risk.
-- **C4**: subgroupSupportedOperations = BASIC + VOTE only → no `subgroupAdd`/etc. for accumulator reductions. Workaround: the IDCT structure is fully data-parallel without reductions; this constraint doesn't bite.
+- **C4**: subgroupSupportedOperations = BASIC + VOTE + BALLOT + SHUFFLE + SHUFFLE_RELATIVE + QUAD (no arithmetic reductions like `subgroupAdd`; but `subgroupShuffle` *is* available — corrected per phase5.md finding 6). Workaround needed for accumulators: the IDCT structure is fully data-parallel without reductions; this constraint doesn't bite. `subgroupShuffle` is an alternative to shared-mem transpose for Phase 7.
 - **C5**: VC7 has SMUL24 but no INT8 MAC. Our Q14 multiplies are i16×i16→i32 — the multiplicands fit in 17 bits, so SMUL24 covers it. No INT8/INT4 issues.
 - **C6**: shared LPDDR4x bus; GPU sees ~4–7 GB/s vs CPU ~12–15 GB/s. For 8×8 IDCT, working set is tiny (≤320 B/block), so per-block bandwidth is not the bottleneck; per-dispatch submit overhead is.
 - **C7**: VPM read-stall serialization. If we hand-write QPU asm (we won't, in Phase 1) this would matter; the Vulkan compute path lets the v3d_compiler schedule for us.
diff --git a/docs/phase4.md b/docs/phase4.md
new file mode 100644
index 0000000..4d8a3fa
--- /dev/null
+++ b/docs/phase4.md
@@ -0,0 +1,400 @@
+---
+phase: 4
+status: open (awaiting Phase 5 review)
+date_opened: 2026-05-18
+parent: phase3.md
+target_kernel: VP9 8×8 inverse DCT (DCT_DCT, 8-bit pixels)
+expected_artifacts: src/v3d_idct8_kernel.comp, src/v3d_kernel_runner.{c,h},
+                    tests/bench_v3d_idct.c, CMakeLists.txt updates
+---
+
+# Phase 4 — Plan
+
+Per `dev_process.md`:
+
+> Formulate the approach. Identify what will and will not be touched.
+> State expected outcome of implementation in the *same* measurable
+> terms used in Phase 1/3.
+
+And the Phase 6 contract-before-code rule, applied early here so
+Phase 5 review can verify the contracts before any code is written:
+
+> Read the API contract — kernel docs, header comments, and upstream
+> source for every call touched. State the contract explicitly
+> before implementing against it.
+
+## 1. Constraint recap
+
+All numbered constraints are carried verbatim from `phase0.md §6`
+and `phase3.md`. The plan must satisfy every one.
+
+| ID | Constraint | Source |
+|---|---|---|
+| C1 | shader arithmetic = int32 only (no FP16, no native FP) | `vulkaninfo` shaderFloat16 = false |
+| C2 | shared mem per WG ≤ 16 KiB | `vulkaninfo` maxComputeSharedMemorySize |
+| C3 | SSBOs per shader stage ≤ 8 | `vulkaninfo` maxPerStageDescriptorStorageBuffers |
+| C4 | subgroup ops = BASIC + VOTE + BALLOT + SHUFFLE + SHUFFLE_RELATIVE + QUAD (no arithmetic reductions, but `subgroupShuffle` *is* available) | `vulkaninfo` subgroupSupportedOperations — corrected per phase5.md finding 6 |
+| C5 | per-block fixed-point multiplies fit SMUL24 (≤17-bit operands) | Q14 constants are 14 bits, coeffs ≤16 bits |
+| C6 | GPU LPDDR4x share ≈ 4–7 GB/s (CPU sees 12–15) | py-videocore7 scopy benchmark |
+| C7 | per-dispatch overhead = **32.95 µs** | `phase3.md` M5 |
+| C8 | 1 WG ≤ 256 invocations, ≤ 16 subgroups of 16 lanes | `vulkaninfo` maxComputeWorkGroupInvocations |
+| C9 | `maxStorageBufferRange` = 1 GiB → single SSBO ≤ 1 GiB | `vulkaninfo` |
+| C10 | bit-exact output must match `ff_vp9_idct_idct_8x8_add_neon` | Phase 1 M1 |
+
+Spec contracts the kernel must implement against:
+
+- **VP9 specification §8.7 — inverse transform process** (Google,
+  2016). The canonical text. Bit-exact verification is the
+  ultimate gate, not the spec's words.
+- **FFmpeg n7.1.3** `idct_idct_8x8_add_c` from
+  `libavcodec/vp9dsp_template.c` and
+  `ff_vp9_idct_idct_8x8_add_neon` from
+  `libavcodec/aarch64/vp9itxfm_neon.S` (vendored at
+  `external/ffmpeg-snapshot/`, pinned to commit `f46e514`). The
+  C reference and NEON implementation differ in output
+  *orientation*: FFmpeg's column pass writes transposed (column
+  `i` of block IDCT'd into row `i` of `tmp`), and the row pass
+  reads columns of `tmp` and writes columns of `dst` via the
+  `dst++` pattern. **Our QPU kernel must reproduce this
+  orientation** or M1 fails. See
+  `tests/vp9_idct8_ref.c` for the corrected reference (which
+  itself was caught failing during Phase 3 and fixed to match
+  the FFmpeg orientation).
+
+## 2. Workload model
+
+For a 1920×1088-luma 1080p frame:
+
+| Quantity | Value |
+|---|---|
+| 8×8 blocks per luma plane | (1920/8) × (1088/8) = 240 × 136 = 32 640 |
+| Coeff bytes per frame (i16 × 64 / block) | 4 178 880 ≈ 3.99 MiB |
+| Pixel bytes per frame (u8 luma) | 2 088 960 ≈ 1.99 MiB |
+| Coeff read once + pred read + dst write | ≈ 4.0 + 2.0 + 2.0 = **8.0 MB / frame** |
+
+Chroma planes are out of scope for Phase 1 (the VP9 8×8 IDCT
+kernel is the same arithmetic; chroma plumbing is a separate
+follow-on once luma works).
+
+## 3. Workgroup geometry decision
+
+Choices and why:
+
+- **Dispatch granularity**: one `vkCmdDispatch` per frame plane.
+  Justification: C7 forces ≥ 555 blocks/dispatch to beat NEON
+  on overhead alone; frame-level (32 640 blocks) is the only
+  granularity that amortises C7 to negligible (1.0 ns/block of
+  overhead vs NEON's 122 ns/block).
+
+- **Workgroup size**: **64 invocations** (= 4 subgroups × 16
+  lanes), `local_size_x = 64`. One workgroup processes **4
+  blocks**, one block per subgroup.
+
+  Why 4 blocks/WG and not 16 (max 16 subgroups per WG of 256):
+  Phase 1 deliberately keeps the first kernel simple. 4 blocks
+  gives us a clean subgroup-per-block mapping and small shared
+  memory footprint (4 × 64 × 4 B = 1024 B for the transpose
+  scratch, far below C2's 16 KiB). Phase 7 may sweep wider WGs
+  (16 / 32 / 64 / 128 / 256) per `phase1.md §"Secondary
+  measurements" M6` to find the optimum.
+
+  Why 1 block/subgroup: a 16-lane subgroup naturally maps to
+  "one cooperating team for one 8×8 block." Lanes 0..7 do the
+  column pass (each owning a column of the block); lanes 8..15
+  do the row pass (each owning a row of dst output). Lanes are
+  not reduced (C4 satisfied — we never call `subgroupAdd`).
+
+- **Dispatch count**: ⌈32 640 / 4⌉ = **8 160 workgroups** per 1080p
+  luma plane. Per Vulkan spec
+  `maxComputeWorkGroupCount.x = 65 535` — plenty of headroom.
+
+## 4. Per-thread algorithm
+
+The contract reproduces FFmpeg's `idct_idct_8x8_add_c` orientation
+exactly (column pass with transposed write, row pass with
+columnar write). Pseudocode:
+
+```glsl
+// Per-thread state derived from gl_GlobalInvocationID.x:
+//   block_idx  = global_id / 16
+//   lane       = global_id % 16        // [0..15], subgroup-local
+//   col_pass   = lane < 8               // first half does columns
+//   row_pass   = lane >= 8              // second half does rows
+//   k          = lane & 7               // column or row index 0..7
+
+// Out-of-bounds flag — do NOT early-return here. Vulkan requires
+// barrier() to be reached by all WG invocations; a divergent early
+// return ahead of the barrier is undefined behaviour (and would bite
+// us on any frame whose block count isn't a multiple of 4).
+// Per phase5.md finding 7.
+bool oob = (block_idx >= n_blocks);
+
+if (col_pass && !oob) {
+    // Read column k of block_idx, 8 i16 values.
+    int in[8];
+    for (r = 0; r < 8; r++)
+        in[r] = block[block_idx * 64 + r * 8 + k];
+    int out[8];
+    idct8_1d(in, out);                        // 14 mults + 12 adds + 4 shifts
+    // Transposed write: row k of tmp_shared[block_local_idx]
+    for (r = 0; r < 8; r++)
+        tmp_shared[block_local_idx * 64 + k * 8 + r] = out[r];
+}
+
+barrier();   // ALL lanes reach this, oob or not
+
+if (row_pass && !oob) {
+    // Read column k of tmp_shared[block_local_idx], 8 i32 values.
+    int in[8];
+    for (r = 0; r < 8; r++)
+        in[r] = tmp_shared[block_local_idx * 64 + r * 8 + k];
+    int out[8];
+    idct8_1d(in, out);                        // same kernel
+    // Columnar write into dst: column k of the destination 8x8.
+    // Block position in dst: (block_y, block_x) = (block_idx / Wb, block_idx % Wb) × 8.
+    // dst is uint8_t[] (NOT packed i32): each lane writes a unique
+    // byte address → no sub-word race. Per phase5.md finding 5.
+    for (r = 0; r < 8; r++) {
+        uint dx = block_x * 8 + k;
+        uint dy = block_y * 8 + r;
+        uint8_t px = dst[dy * stride + dx];
+        int sum = int(px) + ((out[r] + 16) >> 5);
+        dst[dy * stride + dx] = uint8_t(clamp(sum, 0, 255));
+    }
+}
+```
+
+Notes:
+- `idct8_1d` is the same 1D 8-point IDCT used in
+  `tests/vp9_idct8_ref.c::idct8_1d`, transcribed unchanged
+  (constants 11585 / 6270 / 15137 / 3196 / 16069 / 13623 / 9102,
+  Q14 round-shift `(x + (1<<13)) >> 14`).
+- All arithmetic is `int` (32-bit) per C1. Multiplies are
+  i16 × i16 → i32, well within SMUL24 (C5).
+- The barrier is a *workgroup* barrier (`barrier()` + memory
+  barrier), not a subgroup operation, so C4 is satisfied.
+- DC-only fast path (`eob == 1`) is *not* implemented in v1.
+  Per Phase 3 M1 stats, DC-only frequency is 0.11 %; the cost
+  of unconditional general-path execution is ~1× per-block. Phase
+  7 may add the fast path if M2 measurement is close to the
+  decision threshold.
+- The 8 idle lanes per subgroup per phase (lanes 8..15 idle
+  during column pass; lanes 0..7 idle during row pass) waste
+  half of subgroup-cycle bandwidth. The real alternative geometry
+  (corrected per phase5.md finding 3) is **2 blocks per subgroup**:
+  lanes 0..7 handle block A's column pass while lanes 8..15
+  handle block B's column pass simultaneously; one barrier; lanes
+  0..7 handle A's row pass, lanes 8..15 handle B's row pass.
+  Doubles useful work per subgroup, halves dispatch count to
+  ~4080 WGs, shared mem grows to 2 KiB (still 8 % of C2). The v1
+  half-idle design is kept deliberately for simplicity-first; the
+  2-blocks-per-subgroup rework is the obvious Phase 7 M6 win.
+
+## 5. Memory layout / SSBO design
+
+Three SSBO bindings, well under C3's limit of 8:
+
+| binding | name | type | size | usage |
+|---|---|---|---|---|
+| 0 | `coeffs` | `readonly int16_t[]` | N × 64 × 2 B | input quantised coefficients |
+| 1 | `dst` | `uint8_t[]` | H × stride B | input pred + output pixels, in/out |
+| 2 | `meta` | `readonly uvec2[]` | N × 8 B | per-block (block_y, block_x) |
+
+Why `uint8_t[]` for `dst` and not `int32_t[]` (revised per phase5.md
+finding 5): packing 4 pixels per i32 would put 4 lanes per subgroup
+writing to the *same* 32-bit word every block (lanes 0..3 → bytes
+0..3 of one word; lanes 4..7 → next word). Vulkan does not provide
+atomic sub-word writes — concurrent non-atomic writes to overlapping
+addresses are undefined behaviour. v3dv exposes
+`storageBuffer8BitAccess = true` (verified in
+`vulkaninfo_v3d_7_1_7_hertz.txt`), so we declare:
+
+```glsl
+#extension GL_EXT_shader_8bit_storage : require
+layout(binding = 1) buffer Dst { uint8_t dst[]; };
+```
+
+Each lane then writes `dst[(block_y*8+r)*stride + block_x*8 + k]` —
+a unique byte address per lane, no race.
+
+Push constants (C8: 128 B max, we use 16):
+
+```
+layout(push_constant) uniform PC {
+    uint n_blocks;        // total 8×8 blocks in this dispatch
+    uint blocks_per_row;  // (dst_width / 8), for block_idx → (by, bx)
+    uint dst_stride_u8;   // dst row stride in bytes (8-bit storage)
+    uint _pad;
+} pc;
+```
+
+The 7 Q14 trig constants are baked into the shader source as
+literal `const int` — they don't need to be in a buffer.
+
+Shared memory layout (4 blocks per WG, i32 per element):
+
+```
+shared int tmp_shared[4][64];  // 4 × 64 × 4 B = 1024 B
+```
+
+## 6. Predicted M2 (the expected outcome per Phase 1)
+
+Phase 1 §"Decision rules" requires a predicted measurement. We
+state two predictions: a compute-envelope upper bound and a
+bandwidth-ceiling upper bound. M2 is bounded by the *minimum*.
+
+### Compute-envelope estimate
+
+- 1D 8-point IDCT: 14 multiplies + 12 adds + 4 shifts ≈ 30 ops
+- 2 passes per block (column + row) = 60 SIMD cycles per block
+  *(idealised — every SIMD cycle issued on the full 16-lane width)*
+- Frame: 32 640 blocks × 60 SIMD cycles = ~2 M SIMD-cycles / frame
+- At V3D 7.1 ~92 GFLOPS theoretical FP32 across all 12 QPUs at
+  960 MHz — we use int arithmetic, SMUL24 throughput is comparable.
+  Use 92 GOPS (whole-SIMD-width ops) as the optimistic envelope.
+- Frame time ≈ 2 M / 92 G ≈ **22 µs / frame** (compute only,
+  idealised — see correction below)
+- **Correction per phase5.md finding 8:** the v1 half-idle
+  geometry runs only 8 of the 16 subgroup lanes per phase, so
+  the effective SIMD utilisation is 0.5× — compute frame time
+  is closer to **~44 µs / frame**. This doesn't change the
+  conclusion because the bandwidth ceiling (below) is the
+  governing constraint, but the 2× note is essential for any
+  future compute-bound optimisation work.
+- Plus C7 = 33 µs/dispatch overhead, idealised total ≈ **55 µs / frame** ≈ **18 000 FPS** ≈ **587 Mblock/s**
+- Realistic 23% utilization (per `py-videocore7` SGEMM ceiling):
+  → ≈ **135 Mblock/s**, R = 135 / 8.171 ≈ **16.5**
+
+### Bandwidth-ceiling estimate
+
+- Per frame: 8.0 MB total traffic (§2)
+- GPU sustained share: ~4 GB/s (`phase0.md` Throughput envelopes)
+- Frame time = 8 MB / 4 GB/s = **2.0 ms / frame** = **500 FPS** = **16.3 Mblock/s**
+- R = 16.3 / 8.171 ≈ **2.0**
+
+### Combined prediction
+
+M2 = min(compute envelope, bandwidth ceiling) = **~16 Mblock/s**.
+Predicted **R ≈ 2.0**, decision rule: ≥ 1.0 → strong PASS.
+
+Honest uncertainty: this is a *prediction*, not a measurement.
+Phase 7 may show the GPU sees less than 4 GB/s under this access
+pattern (the dst read-modify-write is not coalesced and may push
+us below 1 GB/s, dragging R below 0.5). Or per-WG launch + the 8
+idle lanes per phase may halve the compute envelope. The honest
+*lower bound* is R ≈ 1.0 (memory-bound, 8 Mblock/s) — still passes
+Phase 1 §"Decision rules" 0.5 ≤ R < 1.0 band ("hybrid concurrent
+work viable").
+
+### What would invalidate the prediction
+
+- Vulkan compute launches that pre-amble more cost than M5b
+  estimates (e.g., per-dispatch descriptor binding overhead the
+  noop bench doesn't measure). Phase 7 records actual.
+- v3dv's compiler producing significantly worse code than `idct8_1d`'s
+  expected ~30 ops (e.g., unable to fuse multiply-adds, or
+  spilling registers). Phase 7 records `glsl-spv` and
+  `spv-disas` output for inspection.
+- Bandwidth contention with concurrent CPU activity (which is
+  guaranteed on hertz — LXD spine). Phase 7 must measure with
+  representative CPU load.
+
+## 7. What will be touched / not touched
+
+**Touched** (created or modified in Phase 6):
+
+- `src/v3d_idct8.comp` — the GLSL compute shader.
+- `src/v3d_runner.c` + `src/v3d_runner.h` — Vulkan boilerplate
+  for instance/device/queue/buffer/pipeline setup, common to
+  all kernels.
+- `tests/bench_v3d_idct.c` — the M2 throughput bench and the
+  M1' bit-exact gate (QPU-vs-NEON-vs-C-ref, all three checked).
+- `CMakeLists.txt` — add the new GLSL shader to the
+  `daedalus_shaders` custom command + the new bench target.
+
+**NOT touched**:
+
+- `tests/bench_neon_idct.c`, `tests/vp9_idct8_ref.c`,
+  `tests/bench_vulkan_dispatch.c` — Phase 3 baselines stay
+  immutable so re-running them on the same hertz gives the
+  same numbers.
+- `external/ffmpeg-snapshot/` — vendored reference, byte-frozen.
+- Any kernel-side / DRM / firmware code. Path B specifically
+  scopes us to Vulkan compute on stock v3dv.
+
+## 8. Phase 5 review handoff
+
+Per `dev_process.md`:
+
+> Goal, situation, measurements, plan get pasted into DokuWiki.
+> Markus reviews and redacts, then initiates the handover to a
+> fresh model instance. **Claude does not curate the artifact going
+> to the reviewer** — that would re-introduce the blind-spot
+> accumulation the review is meant to escape. Do not summarize
+> when handing over; paste the actual artifacts.
+
+Files the second-model reviewer needs verbatim:
+
+- `README.md`
+- `docs/phase0.md`
+- `docs/phase1.md`
+- `docs/phase2.md`
+- `docs/phase3.md`
+- `docs/phase4.md` (this file)
+- `docs/vulkaninfo_v3d_7_1_7_hertz.txt`
+- `tests/vp9_idct8_ref.c`
+- `tests/bench_neon_idct.c` (specifically the M3 measurement code)
+- `external/ffmpeg-snapshot/PROVENANCE.md`
+
+Specific review prompts (to focus the second model on the highest-
+risk decisions):
+
+1. **Orientation correctness.** The plan asserts the QPU kernel
+   must reproduce FFmpeg's transposed column-pass orientation.
+   Is the §4 pseudocode correct against `tests/vp9_idct8_ref.c`?
+2. **Workgroup geometry.** Is 4 blocks/WG × 64 invocations the
+   right starting point, or should v1 already explore wider WGs?
+3. **The 8-idle-lanes-per-phase tradeoff.** Half the subgroup
+   lanes idle for each pass. Acceptable for v1, or rework the
+   algorithm so all 16 lanes work per phase (e.g., 2 blocks per
+   subgroup)?
+4. **Bandwidth prediction.** Is the §6 bandwidth estimate
+   defensible, or is the dst read-modify-write going to collapse
+   to ~1 GB/s through GPU L2 misses?
+5. **Anything missing.** What load-bearing assumption is unstated?
+
+## 9. Phase 6 execution order (if Phase 5 approves)
+
+1. Skeleton `v3d_runner.{c,h}`: generalise `bench_vulkan_dispatch.c`'s
+   instance/device/queue/CB plumbing into a reusable API.
+2. `v3d_idct8.comp` v0: write the shader, get it to compile (glslang
+   accepts it, spirv-val passes).
+3. `bench_v3d_idct.c`: harness that runs the shader on the same
+   randomly-generated blocks that `bench_neon_idct` uses, reads
+   back, compares bit-exact against `daedalus_vp9_idct_idct_8x8_add_ref`.
+4. First-light run on hertz. Expect non-bit-exact output —
+   transpose orientation or barrier placement bugs are likely.
+   Iterate to M1' = 100 % bit-exact.
+5. Once bit-exact: measure M2 (throughput). Compare to predicted
+   R ≈ 2.0; record actual.
+6. If R ≥ 0.5 → Phase 9 (lessons) → Phase 1 of next kernel.
+   If R < 0.5 → Phase 4 revision (loopback) or honest close.
+
+## 10. Open questions Phase 4 does not close
+
+- **Subgroup size assumption.** Phase 0 says subgroupSize is fixed
+  at 16 on V3D 7.1. The shader hard-codes `gl_SubgroupSize == 16`
+  via the lane mask `lane = global_id % 16`. If a future v3dv
+  exposes a different subgroup size, the algorithm needs revisiting.
+  Acceptable risk for Phase 1; document and move on.
+- **Memory coherency between CPU and GPU.** The Phase 0
+  `is_host=true` zero-copy story is the *intent*; whether v3dv's
+  default memory allocation actually gives us coherent shared
+  buffers without explicit cache flushes is *unverified*. Phase 6
+  must check — if writes don't appear on the other side, add
+  `VkMappedMemoryRange` flush/invalidate calls.
+- **Block-position metadata source.** §5 specifies `meta` SSBO
+  with (by, bx). Alternatively, derive position from block_idx
+  arithmetic using `blocks_per_row` push constant — saves a
+  buffer at the cost of a divide-modulo per thread. Phase 4 picks
+  the buffered version; Phase 7 may revisit.
diff --git a/docs/phase5.md b/docs/phase5.md
new file mode 100644
index 0000000..ac4a757
--- /dev/null
+++ b/docs/phase5.md
@@ -0,0 +1,148 @@
+---
+phase: 5
+status: closed 2026-05-18 — PASS-WITH-REVISIONS, revisions applied
+date_opened: 2026-05-18
+date_closed: 2026-05-18
+parent: phase4.md
+reviewer: Claude Sonnet (general-purpose Agent, fresh context, no prior session memory)
+plan_author: Claude Opus 4.7 (this session's primary model)
+verdict: PASS-WITH-REVISIONS
+---
+
+# Phase 5 — Second-Model Review of Phase 4 Plan
+
+Per `dev_process.md`:
+
+> Goal, situation, measurements, plan get pasted into DokuWiki.
+> Markus reviews and redacts, then initiates the handover to a
+> fresh model instance. Claude does not curate the artifact going
+> to the reviewer — that would re-introduce the blind-spot
+> accumulation the review is meant to escape.
+
+For this session the handover mechanism was the in-session `Agent`
+tool with `subagent_type=general-purpose, model=sonnet` rather than
+DokuWiki round-trip. The reviewer had no prior session context and
+read the artifacts (README, dev_process.md, phase0–4, ref.c,
+bench_neon_idct.c, PROVENANCE.md, relevant vulkaninfo sections, the
+vendored FFmpeg `vp9dsp_template.c`) directly off disk.
+
+The reviewer was given the 5 specific high-risk prompts from
+`phase4.md §8` plus "anything else."
+
+The review found:
+- 0 issues on orientation (independently traced against 3 sources)
+- 1 deferrable optimisation opportunity on WG geometry
+- 1 design improvement opportunity on idle-lane geometry
+- 1 documentation honesty improvement on bandwidth (range vs point)
+- **1 RED-class undefined-behaviour bug** (data race on packed dst)
+- **1 RED-class undefined-behaviour bug** (early-return before barrier)
+- 1 factual error in the constraint table carried through 3 documents
+- 1 accounting error in the compute envelope (no impact on conclusion)
+
+Two RED-class bugs in the plan, both caught before any code was
+written. The review paid for itself.
+
+---
+
+## Review (verbatim)
+
+The following is the Sonnet reviewer's response, verbatim, with no
+editing beyond markdown-fence wrapping.
+
+````markdown
+## Verdict
+
+**PASS-WITH-REVISIONS** — two issues must be addressed before Phase 6 writes any shader code. Everything else is acceptable-for-v1 or a documentation fix.
+
+---
+
+## Per-prompt findings
+
+**1. Orientation correctness — GREEN**
+
+The Phase 4 §4 pseudocode is correct against `tests/vp9_idct8_ref.c` and against the upstream FFmpeg `itxfm_wrapper` macro.
+
+Tracing the FFmpeg macro (lines 1180–1191): the column pass calls `idct8_1d(block + i, stride=sz, tmp + i*sz, pass=0)`. The `IN(x)` macro expands to `block[i + x*sz]`, which reads row `x`, column `i` — i.e., column `i` of the row-major block. The output lands at `tmp[i*sz + 0..7]`, which is row `i` of `tmp`. That is the transposed write. The row pass calls `idct8_1d(tmp + i, stride=sz, out, pass=1)`, which reads `tmp[i + x*sz]` = column `i` of `tmp`, and writes the result as `dst[j*stride]` with `dst++` — column `i` of `dst`. Both passes verified.
+
+`vp9_idct8_ref.c` does the same: column pass reads `block[r*8+i]` and writes `tmp[i*8+r]` (transposed), row pass reads `tmp[r*8+i]` (column `i` of `tmp`) and writes `dst[r*stride+i]` (column `i` of dst). Phase 4 pseudocode: column pass reads `block[block_idx*64 + r*8 + k]` and writes `tmp_shared[block_local_idx*64 + k*8 + r]` — correct. Row pass reads `tmp_shared[block_local_idx*64 + r*8 + k]` (column `k` of `tmp`) and writes `dst[dy*stride + dx]` where `dx = block_x*8 + k` (column `k` of dst) — correct. All three match identically. The orientation is right.
+
+**2. Workgroup geometry — YELLOW**
+
+4 blocks/WG × 64 invocations is a defensible v1 starting point: it maps cleanly to one block per 16-lane subgroup, shared memory footprint is only 1 KiB (4 × 64 × 4B), and the dispatch arithmetic is straightforward. The max of 256 invocations would give 16 blocks/WG, which reduces dispatch overhead by 4× and uses the same shared memory fraction. That is not complex to implement. The plan correctly defers the WG-size sweep to Phase 7 M6, which is the right approach — get bit-exact first, then sweep. No revision needed; flag for M6 that the sweet spot is likely in the 128–256 range.
+
+**3. Idle lanes — YELLOW (stronger than the plan implies)**
+
+The plan acknowledges the issue and flags it for Phase 7 M6. The proposed alternative — "lanes 0..7 do both passes sequentially, separated by a barrier" — would not eliminate idle lanes; it would just move the idling. The actual fix is 2 blocks per subgroup: lanes 0..7 do the column pass of block A while lanes 8..15 do the column pass of block B simultaneously; after one barrier, lanes 0..7 do the row pass of block A while lanes 8..15 do the row pass of block B. This uses all 16 lanes productively in both phases, doubles throughput per subgroup, shared memory cost is 8 blocks × 64 × 4B = 2048B (still well under 16 KiB), and dispatch count halves to 4080 WGs. The added code complexity is modest: change the lane mapping from `col_pass = lane < 8` to `block_slot = lane / 8, col_row_idx = lane % 8` and adjust `block_local_idx` accordingly. This is straightforward enough that implementing it in v1 is recommended, not a Phase 7 afterthought, especially given that BW is the governing constraint and even small compute gains matter. Acceptable to defer, but the description in §4 of the "alternative geometry" as "lanes 0..7 do column, lanes 0..7 also do row" is incorrect — that is not how you eliminate idle lanes.
+
+**4. Bandwidth prediction — YELLOW**
+
+The 8 MB/frame figure (4 MB coeffs + 2 MB pred-read + 2 MB dst-write) is correctly derived from 32,640 blocks. The 4 GB/s divisor is the optimistic end of the GPU memory bandwidth range. The `dst` concern is real: in the row pass, lane `k` writes to column `k` of the output block. For a 1920-pixel-wide frame (stride 1920 bytes), consecutive rows of the same output column are 1920 bytes = 30 cache lines apart. The 8 writes per lane per block are all cache-line-miss candidates. However, the WGs will be dispatched in block-index order, and consecutive blocks in the same row share output cache lines — for block_x=0,1 (both in row 0), their dst rows overlap in the same L2 lines. V3D's L2C is 256 KB and the 8 rows of a frame row fit in 15 KB, so temporal reuse within a row of blocks is plausible. An honest bandwidth range is 1–4 GB/s, not a point estimate at 4. This gives an honest M2 range of ~4–16 Mblock/s (R = 0.5–2.0), which still covers the "hybrid concurrent work viable" decision band at the low end. The plan's acknowledgment in §6 of "may push below 1 GB/s" is appropriate. No revision needed, but Phase 7 should record the actual L2 hit rate if the v3dv/DRM perf counters expose it.
+
+**5. Unstated load-bearing assumption — RED**
+
+The `dst` SSBO is specified as `int32_t[]` with 4 pixels packed per word. In the row pass, lanes 0..7 each write to a different column of the output block. For any block position, lanes 0, 1, 2, 3 write to columns `block_x*8+0`, `block_x*8+1`, `block_x*8+2`, `block_x*8+3` — which are bytes 0, 1, 2, 3 of the same `int32_t` word at `dst[(row) * stride_u32 + (block_x*8)/4]`. Lanes 4, 5, 6, 7 write to the next word. For every block at every position, exactly four lanes are writing to the same 32-bit word simultaneously. In Vulkan, concurrent non-atomic writes from different invocations to overlapping memory addresses are undefined behavior. The v3d hardware does not guarantee atomic sub-word stores. The result in practice will be silent data corruption where one lane's write stomps another's.
+
+The fix is straightforward: use a `uint8_t[]` SSBO binding. `storageBuffer8BitAccess = true` on this device (verified in vulkaninfo line 735), and GLSL supports `uint8_t` buffer members with `VK_KHR_8bit_storage`. Each lane then writes to address `block_y*8*stride + r*stride + block_x*8 + k` — a unique byte address, no overlap. The pred/dst buffer is still a single SSBO binding (≤8 limit not affected), just typed `uint8_t` instead of `int32_t`. This must be resolved before Phase 6 writes the shader. The current §5 layout specification is wrong.
+
+---
+
+## Other findings
+
+**6. Subgroup operations constraint table is incorrect — YELLOW**
+
+Phase 0 §2 table, Phase 2 §6 C4, and Phase 4 §1 C4 all state "subgroupSupportedOperations = BASIC + VOTE only." The actual vulkaninfo output (lines 285–291) shows six operations: BASIC, VOTE, **BALLOT, SHUFFLE, SHUFFLE_RELATIVE, QUAD**. The plan still works correctly — it never calls any subgroup operation at all and uses a workgroup barrier + shared memory, which is fully valid. But the constraint table will mislead Phase 6 and Phase 7 implementers into thinking `subgroupShuffle` is unavailable. It is available, and it opens an alternative algorithm: the 16-lane subgroup can distribute column-pass results to row-pass lanes via `subgroupShuffle` without any shared memory or barrier, replacing the `tmp_shared[4][64]` array entirely. That would eliminate the barrier latency and reduce shared memory pressure, though it requires a different lane-mapping design. This is a Phase 7 optimization candidate, but the constraint table needs correcting so future phases don't build on a false foundation.
+
+**Required fix:** Correct C4 in phase4.md §1 to: "subgroupSupportedOperations = BASIC, VOTE, BALLOT, SHUFFLE, SHUFFLE_RELATIVE, QUAD — no arithmetic reductions (no `subgroupAdd`). Note: `subgroupShuffle` IS available; Phase 1 does not use it." Also correct phase0.md §2 table.
+
+**7. Early-return before barrier is UB for non-multiple-of-4 block counts — RED (addressable in Phase 6 with 2 lines)**
+
+The pseudocode guard `if (block_idx >= n_blocks) return;` fires before `barrier()`. In Vulkan GLSL, a `barrier()` call must be reached by all invocations in the workgroup — a divergent early exit before it is undefined behavior (the spec says "all invocations of a workgroup must execute barrier with the same operands, in the same order, relative to other uses of OpMemoryBarrier"). For 1080p (32,640 blocks, evenly divisible by 4), there are no partial WGs and no problem. But VP9 frames need not have dimensions divisible by 32 pixels (4 blocks × 8). A 1366×768 frame has 170×96 = 16,320 blocks, divisible by 4. A hypothetical 80×80 frame has 10×10 = 100 blocks, divisible by 4. However an 88×72 frame has 11×9 = 99 blocks, not divisible by 4, and will have a partial WG. The plan does not address this. The fix is two lines: replace the early return with a `bool oob = block_idx >= n_blocks;` flag, then guard the column-pass work body and row-pass work body individually with `if (!oob)`, while leaving the `barrier()` unconditional. All lanes in the WG then always reach the barrier.
+
+**8. Compute envelope derivation has an accounting error — GREEN (no impact on final number)**
+
+Phase 4 §6 derives "~2 MOPs / frame" as `32,640 blocks × 60 ops`. This is then divided by 92 GOPS to get 22 µs. The arithmetic conflates per-block lane-ops with SIMD-cycle-equivalent ops. Each 16-lane subgroup processes 1 block in the column pass by running 8 active lanes × 30 ops = 240 useful lane-ops in 30 SIMD cycles, then similarly for the row pass. The "60 ops" correctly counts SIMD cycles per block (2 × 30), but dividing by 92 GOPS (which are whole-SIMD-width ops) underestimates compute time by a factor of ~16/8 = 2 (idle lanes). The corrected compute frame time is ~44 µs, not 22 µs. But since bandwidth at ~2 ms is the governing constraint and is 45× the compute time, the error has zero impact on the `min(compute, bandwidth) ≈ 16 Mblock/s` conclusion. No revision required, but §6 should note the 2× discrepancy to avoid misleading future optimization work.
+
+---
+
+## What you would change about the plan
+
+**Required before Phase 6 (two changes):**
+
+1. **Phase 4 §5, binding 1** — Replace `int32_t[]` with `uint8_t[]` backed by `storageBuffer8BitAccess`. Remove the sentence "Packing 4 bytes per i32 simplifies write-back to coalesced 32-bit stores." Replace with: "Declare as `layout(binding=1) buffer Dst { uint8_t dst[]; };` using `VK_KHR_8bit_storage` (storageBuffer8BitAccess = true on this device). Lane k in the row pass writes `dst[(block_y*8+r)*stride + block_x*8+k]` directly. No packing. No sub-word race condition." Update `dst_stride_u32` push constant to `dst_stride_u8` (now in bytes, not words).
+
+2. **Phase 4 §4, pseudocode preamble** — Add a `bool oob = (block_idx >= n_blocks);` before the `if (col_pass)` branch. Replace `if (block_idx >= n_blocks) return;` (early return) with the `oob` flag, and wrap the col-pass body and row-pass body with `if (!oob)`, leaving `barrier()` unconditional. Add a comment: `// barrier() must be reached by all WG invocations — do not early-return before it.`
+
+**Recommended before Phase 6 (one change):**
+
+3. **Phase 4 §1, C4 row** — Correct from "subgroupSupportedOperations = BASIC + VOTE only" to "BASIC + VOTE + BALLOT + SHUFFLE + SHUFFLE_RELATIVE + QUAD — arithmetic reductions unavailable but `subgroupShuffle` is." Also fix the same sentence in phase0.md §2 and phase2.md §6.
+
+**Defer to Phase 7 (no change needed now):**
+
+4. The 2-blocks-per-subgroup redesign is worth implementing as the v1 shape rather than the half-idle design, but it is not a correctness issue. If the team prefers the simpler v1, the M6 sweep in Phase 7 is the right time to do it. The Phase 4 description of the alternative ("lanes 0..7 do column, lanes 0..7 also do row") is an incorrect description of how to eliminate idle lanes — correct it to "lanes 0..7 handle block A, lanes 8..15 handle block B simultaneously in both passes."
+
+---
+
+## Confidence
+
+High on findings 5 (race condition), 6 (subgroup ops), and 7 (early return / barrier); those are mechanical reads against the vulkaninfo output and the Vulkan spec. Medium on finding 4 (bandwidth): the 1–4 GB/s range is informed reasoning but the actual L2 behavior on V3D 7.1 is not documented publicly and Phase 7 will be the real answer. The orientation correctness (finding 1) has the highest confidence — three independent traces (FFmpeg macro, ref.c, phase4 pseudocode) all agree.
+````
+
+---
+
+## Resolution
+
+| Finding | Severity | Action | Status |
+|---|---|---|---|
+| 1 (orientation) | GREEN | None | accepted |
+| 2 (WG geom) | YELLOW | Defer to Phase 7 M6 sweep (matches plan) | accepted |
+| 3 (idle lanes) | YELLOW | Correct §4 description of "alternative"; keep half-idle v1 design (defer 2-blocks-per-subgroup to Phase 7) | apply description fix |
+| 4 (bandwidth) | YELLOW | Acceptable; note that 4 GB/s is optimistic end | accepted as-is, §6 already flags downside |
+| 5 (dst race) | **RED** | Switch dst SSBO to `uint8_t[]` via `VK_KHR_8bit_storage` + storageBuffer8BitAccess | applied to phase4.md §5 |
+| 6 (subgroup ops table) | YELLOW | Correct C4 in phase0.md §2, phase2.md §6, phase4.md §1 | applied to all three |
+| 7 (barrier UB) | **RED** | Add `oob` flag pattern in §4 pseudocode | applied to phase4.md §4 |
+| 8 (compute envelope 2× off) | GREEN | Note in §6 — doesn't change conclusion | apply note |
+
+After revisions:
+- Phase 4 is **APPROVED** for Phase 6 implementation.
+- No re-review needed; revisions are mechanical and address verified bugs/errors.
+- Phase 6 may proceed with the revised plan.