Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS)

Phase 4 — Plan first QPU IDCT8 kernel given the M5 32.95us/dispatch constraint. Frame-level dispatch (8160 WGs for 1080p), 64 invocations/WG = 4 blocks/WG, 3 SSBOs + push constants, reproduces FFmpeg's transposed column-pass orientation. Predicted M2 ~16 Mblock/s (BW-limited), R ~ 2.0 -> strong PASS under phase1.md decision rules. Phase 5 — Second-model review by Claude Sonnet (fresh-context Agent, no prior session memory). Verdict PASS-WITH-REVISIONS with 2 RED-class findings + 1 YELLOW that this commit applies: RED finding 5 (dst race condition): int32_t[] dst with 4 lanes writing to overlapping 32-bit words = non-atomic concurrent writes = Vulkan UB. Fix: uint8_t[] via storageBuffer8BitAccess (verified exposed). Applied to phase4.md sec 5 + GLSL declaration. RED finding 7 (early-return before barrier): if (block_idx >= n_blocks) return; ahead of barrier() is UB by Vulkan spec. For 1080p (32640 blocks, /4) no partial WGs; for any frame width not /32 there are. Fix: oob flag, gate work bodies, barrier() unconditional. Applied to phase4.md sec 4 pseudocode. YELLOW finding 6 (subgroup ops): docs claimed BASIC+VOTE only; actual exposed set is BASIC+VOTE+BALLOT+SHUFFLE+ SHUFFLE_RELATIVE+QUAD per vulkaninfo. Plan doesn't use any subgroup ops in v1 so unaffected, but the wrong constraint would mislead Phase 6/7. Corrected in phase0.md sec 2, phase2.md sec 6, phase4.md sec 1 (C4). GREEN/YELLOW findings 1-4, 8 (orientation, WG geom, idle lanes, BW prediction, compute envelope accounting) accepted as-is or deferred to Phase 7 M6 sweep per plan's existing flagging. Reviewer verdict post-revisions: "Phase 4 is APPROVED for Phase 6 implementation. No re-review needed; revisions are mechanical and address verified bugs/errors." Phase 5 itself just paid for itself: two real UB bugs caught before any GLSL was written. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:47:03 +00:00
parent dcbbc77038
commit 71db72928f
4 changed files with 550 additions and 2 deletions
@@ -0,0 +1,400 @@
+---
+phase: 4
+status: open (awaiting Phase 5 review)
+date_opened: 2026-05-18
+parent: phase3.md
+target_kernel: VP9 8×8 inverse DCT (DCT_DCT, 8-bit pixels)
+expected_artifacts: src/v3d_idct8_kernel.comp, src/v3d_kernel_runner.{c,h},
+                    tests/bench_v3d_idct.c, CMakeLists.txt updates
+---
+
+# Phase 4 — Plan
+
+Per `dev_process.md`:
+
+> Formulate the approach. Identify what will and will not be touched.
+> State expected outcome of implementation in the *same* measurable
+> terms used in Phase 1/3.
+
+And the Phase 6 contract-before-code rule, applied early here so
+Phase 5 review can verify the contracts before any code is written:
+
+> Read the API contract — kernel docs, header comments, and upstream
+> source for every call touched. State the contract explicitly
+> before implementing against it.
+
+## 1. Constraint recap
+
+All numbered constraints are carried verbatim from `phase0.md §6`
+and `phase3.md`. The plan must satisfy every one.
+
+| ID | Constraint | Source |
+|---|---|---|
+| C1 | shader arithmetic = int32 only (no FP16, no native FP) | `vulkaninfo` shaderFloat16 = false |
+| C2 | shared mem per WG ≤ 16 KiB | `vulkaninfo` maxComputeSharedMemorySize |
+| C3 | SSBOs per shader stage ≤ 8 | `vulkaninfo` maxPerStageDescriptorStorageBuffers |
+| C4 | subgroup ops = BASIC + VOTE + BALLOT + SHUFFLE + SHUFFLE_RELATIVE + QUAD (no arithmetic reductions, but `subgroupShuffle` *is* available) | `vulkaninfo` subgroupSupportedOperations — corrected per phase5.md finding 6 |
+| C5 | per-block fixed-point multiplies fit SMUL24 (≤17-bit operands) | Q14 constants are 14 bits, coeffs ≤16 bits |
+| C6 | GPU LPDDR4x share ≈ 4–7 GB/s (CPU sees 12–15) | py-videocore7 scopy benchmark |
+| C7 | per-dispatch overhead = **32.95 µs** | `phase3.md` M5 |
+| C8 | 1 WG ≤ 256 invocations, ≤ 16 subgroups of 16 lanes | `vulkaninfo` maxComputeWorkGroupInvocations |
+| C9 | `maxStorageBufferRange` = 1 GiB → single SSBO ≤ 1 GiB | `vulkaninfo` |
+| C10 | bit-exact output must match `ff_vp9_idct_idct_8x8_add_neon` | Phase 1 M1 |
+
+Spec contracts the kernel must implement against:
+
+- **VP9 specification §8.7 — inverse transform process** (Google,
+  2016). The canonical text. Bit-exact verification is the
+  ultimate gate, not the spec's words.
+- **FFmpeg n7.1.3** `idct_idct_8x8_add_c` from
+  `libavcodec/vp9dsp_template.c` and
+  `ff_vp9_idct_idct_8x8_add_neon` from
+  `libavcodec/aarch64/vp9itxfm_neon.S` (vendored at
+  `external/ffmpeg-snapshot/`, pinned to commit `f46e514`). The
+  C reference and NEON implementation differ in output
+  *orientation*: FFmpeg's column pass writes transposed (column
+  `i` of block IDCT'd into row `i` of `tmp`), and the row pass
+  reads columns of `tmp` and writes columns of `dst` via the
+  `dst++` pattern. **Our QPU kernel must reproduce this
+  orientation** or M1 fails. See
+  `tests/vp9_idct8_ref.c` for the corrected reference (which
+  itself was caught failing during Phase 3 and fixed to match
+  the FFmpeg orientation).
+
+## 2. Workload model
+
+For a 1920×1088-luma 1080p frame:
+
+| Quantity | Value |
+|---|---|
+| 8×8 blocks per luma plane | (1920/8) × (1088/8) = 240 × 136 = 32 640 |
+| Coeff bytes per frame (i16 × 64 / block) | 4 178 880 ≈ 3.99 MiB |
+| Pixel bytes per frame (u8 luma) | 2 088 960 ≈ 1.99 MiB |
+| Coeff read once + pred read + dst write | ≈ 4.0 + 2.0 + 2.0 = **8.0 MB / frame** |
+
+Chroma planes are out of scope for Phase 1 (the VP9 8×8 IDCT
+kernel is the same arithmetic; chroma plumbing is a separate
+follow-on once luma works).
+
+## 3. Workgroup geometry decision
+
+Choices and why:
+
+- **Dispatch granularity**: one `vkCmdDispatch` per frame plane.
+  Justification: C7 forces ≥ 555 blocks/dispatch to beat NEON
+  on overhead alone; frame-level (32 640 blocks) is the only
+  granularity that amortises C7 to negligible (1.0 ns/block of
+  overhead vs NEON's 122 ns/block).
+
+- **Workgroup size**: **64 invocations** (= 4 subgroups × 16
+  lanes), `local_size_x = 64`. One workgroup processes **4
+  blocks**, one block per subgroup.
+
+  Why 4 blocks/WG and not 16 (max 16 subgroups per WG of 256):
+  Phase 1 deliberately keeps the first kernel simple. 4 blocks
+  gives us a clean subgroup-per-block mapping and small shared
+  memory footprint (4 × 64 × 4 B = 1024 B for the transpose
+  scratch, far below C2's 16 KiB). Phase 7 may sweep wider WGs
+  (16 / 32 / 64 / 128 / 256) per `phase1.md §"Secondary
+  measurements" M6` to find the optimum.
+
+  Why 1 block/subgroup: a 16-lane subgroup naturally maps to
+  "one cooperating team for one 8×8 block." Lanes 0..7 do the
+  column pass (each owning a column of the block); lanes 8..15
+  do the row pass (each owning a row of dst output). Lanes are
+  not reduced (C4 satisfied — we never call `subgroupAdd`).
+
+- **Dispatch count**: ⌈32 640 / 4⌉ = **8 160 workgroups** per 1080p
+  luma plane. Per Vulkan spec
+  `maxComputeWorkGroupCount.x = 65 535` — plenty of headroom.
+
+## 4. Per-thread algorithm
+
+The contract reproduces FFmpeg's `idct_idct_8x8_add_c` orientation
+exactly (column pass with transposed write, row pass with
+columnar write). Pseudocode:
+
+```glsl
+// Per-thread state derived from gl_GlobalInvocationID.x:
+//   block_idx  = global_id / 16
+//   lane       = global_id % 16        // [0..15], subgroup-local
+//   col_pass   = lane < 8               // first half does columns
+//   row_pass   = lane >= 8              // second half does rows
+//   k          = lane & 7               // column or row index 0..7
+
+// Out-of-bounds flag — do NOT early-return here. Vulkan requires
+// barrier() to be reached by all WG invocations; a divergent early
+// return ahead of the barrier is undefined behaviour (and would bite
+// us on any frame whose block count isn't a multiple of 4).
+// Per phase5.md finding 7.
+bool oob = (block_idx >= n_blocks);
+
+if (col_pass && !oob) {
+    // Read column k of block_idx, 8 i16 values.
+    int in[8];
+    for (r = 0; r < 8; r++)
+        in[r] = block[block_idx * 64 + r * 8 + k];
+    int out[8];
+    idct8_1d(in, out);                        // 14 mults + 12 adds + 4 shifts
+    // Transposed write: row k of tmp_shared[block_local_idx]
+    for (r = 0; r < 8; r++)
+        tmp_shared[block_local_idx * 64 + k * 8 + r] = out[r];
+}
+
+barrier();   // ALL lanes reach this, oob or not
+
+if (row_pass && !oob) {
+    // Read column k of tmp_shared[block_local_idx], 8 i32 values.
+    int in[8];
+    for (r = 0; r < 8; r++)
+        in[r] = tmp_shared[block_local_idx * 64 + r * 8 + k];
+    int out[8];
+    idct8_1d(in, out);                        // same kernel
+    // Columnar write into dst: column k of the destination 8x8.
+    // Block position in dst: (block_y, block_x) = (block_idx / Wb, block_idx % Wb) × 8.
+    // dst is uint8_t[] (NOT packed i32): each lane writes a unique
+    // byte address → no sub-word race. Per phase5.md finding 5.
+    for (r = 0; r < 8; r++) {
+        uint dx = block_x * 8 + k;
+        uint dy = block_y * 8 + r;
+        uint8_t px = dst[dy * stride + dx];
+        int sum = int(px) + ((out[r] + 16) >> 5);
+        dst[dy * stride + dx] = uint8_t(clamp(sum, 0, 255));
+    }
+}
+```
+
+Notes:
+- `idct8_1d` is the same 1D 8-point IDCT used in
+  `tests/vp9_idct8_ref.c::idct8_1d`, transcribed unchanged
+  (constants 11585 / 6270 / 15137 / 3196 / 16069 / 13623 / 9102,
+  Q14 round-shift `(x + (1<<13)) >> 14`).
+- All arithmetic is `int` (32-bit) per C1. Multiplies are
+  i16 × i16 → i32, well within SMUL24 (C5).
+- The barrier is a *workgroup* barrier (`barrier()` + memory
+  barrier), not a subgroup operation, so C4 is satisfied.
+- DC-only fast path (`eob == 1`) is *not* implemented in v1.
+  Per Phase 3 M1 stats, DC-only frequency is 0.11 %; the cost
+  of unconditional general-path execution is ~1× per-block. Phase
+  7 may add the fast path if M2 measurement is close to the
+  decision threshold.
+- The 8 idle lanes per subgroup per phase (lanes 8..15 idle
+  during column pass; lanes 0..7 idle during row pass) waste
+  half of subgroup-cycle bandwidth. The real alternative geometry
+  (corrected per phase5.md finding 3) is **2 blocks per subgroup**:
+  lanes 0..7 handle block A's column pass while lanes 8..15
+  handle block B's column pass simultaneously; one barrier; lanes
+  0..7 handle A's row pass, lanes 8..15 handle B's row pass.
+  Doubles useful work per subgroup, halves dispatch count to
+  ~4080 WGs, shared mem grows to 2 KiB (still 8 % of C2). The v1
+  half-idle design is kept deliberately for simplicity-first; the
+  2-blocks-per-subgroup rework is the obvious Phase 7 M6 win.
+
+## 5. Memory layout / SSBO design
+
+Three SSBO bindings, well under C3's limit of 8:
+
+| binding | name | type | size | usage |
+|---|---|---|---|---|
+| 0 | `coeffs` | `readonly int16_t[]` | N × 64 × 2 B | input quantised coefficients |
+| 1 | `dst` | `uint8_t[]` | H × stride B | input pred + output pixels, in/out |
+| 2 | `meta` | `readonly uvec2[]` | N × 8 B | per-block (block_y, block_x) |
+
+Why `uint8_t[]` for `dst` and not `int32_t[]` (revised per phase5.md
+finding 5): packing 4 pixels per i32 would put 4 lanes per subgroup
+writing to the *same* 32-bit word every block (lanes 0..3 → bytes
+0..3 of one word; lanes 4..7 → next word). Vulkan does not provide
+atomic sub-word writes — concurrent non-atomic writes to overlapping
+addresses are undefined behaviour. v3dv exposes
+`storageBuffer8BitAccess = true` (verified in
+`vulkaninfo_v3d_7_1_7_hertz.txt`), so we declare:
+
+```glsl
+#extension GL_EXT_shader_8bit_storage : require
+layout(binding = 1) buffer Dst { uint8_t dst[]; };
+```
+
+Each lane then writes `dst[(block_y*8+r)*stride + block_x*8 + k]` —
+a unique byte address per lane, no race.
+
+Push constants (C8: 128 B max, we use 16):
+
+```
+layout(push_constant) uniform PC {
+    uint n_blocks;        // total 8×8 blocks in this dispatch
+    uint blocks_per_row;  // (dst_width / 8), for block_idx → (by, bx)
+    uint dst_stride_u8;   // dst row stride in bytes (8-bit storage)
+    uint _pad;
+} pc;
+```
+
+The 7 Q14 trig constants are baked into the shader source as
+literal `const int` — they don't need to be in a buffer.
+
+Shared memory layout (4 blocks per WG, i32 per element):
+
+```
+shared int tmp_shared[4][64];  // 4 × 64 × 4 B = 1024 B
+```
+
+## 6. Predicted M2 (the expected outcome per Phase 1)
+
+Phase 1 §"Decision rules" requires a predicted measurement. We
+state two predictions: a compute-envelope upper bound and a
+bandwidth-ceiling upper bound. M2 is bounded by the *minimum*.
+
+### Compute-envelope estimate
+
+- 1D 8-point IDCT: 14 multiplies + 12 adds + 4 shifts ≈ 30 ops
+- 2 passes per block (column + row) = 60 SIMD cycles per block
+  *(idealised — every SIMD cycle issued on the full 16-lane width)*
+- Frame: 32 640 blocks × 60 SIMD cycles = ~2 M SIMD-cycles / frame
+- At V3D 7.1 ~92 GFLOPS theoretical FP32 across all 12 QPUs at
+  960 MHz — we use int arithmetic, SMUL24 throughput is comparable.
+  Use 92 GOPS (whole-SIMD-width ops) as the optimistic envelope.
+- Frame time ≈ 2 M / 92 G ≈ **22 µs / frame** (compute only,
+  idealised — see correction below)
+- **Correction per phase5.md finding 8:** the v1 half-idle
+  geometry runs only 8 of the 16 subgroup lanes per phase, so
+  the effective SIMD utilisation is 0.5× — compute frame time
+  is closer to **~44 µs / frame**. This doesn't change the
+  conclusion because the bandwidth ceiling (below) is the
+  governing constraint, but the 2× note is essential for any
+  future compute-bound optimisation work.
+- Plus C7 = 33 µs/dispatch overhead, idealised total ≈ **55 µs / frame** ≈ **18 000 FPS** ≈ **587 Mblock/s**
+- Realistic 23% utilization (per `py-videocore7` SGEMM ceiling):
+  → ≈ **135 Mblock/s**, R = 135 / 8.171 ≈ **16.5**
+
+### Bandwidth-ceiling estimate
+
+- Per frame: 8.0 MB total traffic (§2)
+- GPU sustained share: ~4 GB/s (`phase0.md` Throughput envelopes)
+- Frame time = 8 MB / 4 GB/s = **2.0 ms / frame** = **500 FPS** = **16.3 Mblock/s**
+- R = 16.3 / 8.171 ≈ **2.0**
+
+### Combined prediction
+
+M2 = min(compute envelope, bandwidth ceiling) = **~16 Mblock/s**.
+Predicted **R ≈ 2.0**, decision rule: ≥ 1.0 → strong PASS.
+
+Honest uncertainty: this is a *prediction*, not a measurement.
+Phase 7 may show the GPU sees less than 4 GB/s under this access
+pattern (the dst read-modify-write is not coalesced and may push
+us below 1 GB/s, dragging R below 0.5). Or per-WG launch + the 8
+idle lanes per phase may halve the compute envelope. The honest
+*lower bound* is R ≈ 1.0 (memory-bound, 8 Mblock/s) — still passes
+Phase 1 §"Decision rules" 0.5 ≤ R < 1.0 band ("hybrid concurrent
+work viable").
+
+### What would invalidate the prediction
+
+- Vulkan compute launches that pre-amble more cost than M5b
+  estimates (e.g., per-dispatch descriptor binding overhead the
+  noop bench doesn't measure). Phase 7 records actual.
+- v3dv's compiler producing significantly worse code than `idct8_1d`'s
+  expected ~30 ops (e.g., unable to fuse multiply-adds, or
+  spilling registers). Phase 7 records `glsl-spv` and
+  `spv-disas` output for inspection.
+- Bandwidth contention with concurrent CPU activity (which is
+  guaranteed on hertz — LXD spine). Phase 7 must measure with
+  representative CPU load.
+
+## 7. What will be touched / not touched
+
+**Touched** (created or modified in Phase 6):
+
+- `src/v3d_idct8.comp` — the GLSL compute shader.
+- `src/v3d_runner.c` + `src/v3d_runner.h` — Vulkan boilerplate
+  for instance/device/queue/buffer/pipeline setup, common to
+  all kernels.
+- `tests/bench_v3d_idct.c` — the M2 throughput bench and the
+  M1' bit-exact gate (QPU-vs-NEON-vs-C-ref, all three checked).
+- `CMakeLists.txt` — add the new GLSL shader to the
+  `daedalus_shaders` custom command + the new bench target.
+
+**NOT touched**:
+
+- `tests/bench_neon_idct.c`, `tests/vp9_idct8_ref.c`,
+  `tests/bench_vulkan_dispatch.c` — Phase 3 baselines stay
+  immutable so re-running them on the same hertz gives the
+  same numbers.
+- `external/ffmpeg-snapshot/` — vendored reference, byte-frozen.
+- Any kernel-side / DRM / firmware code. Path B specifically
+  scopes us to Vulkan compute on stock v3dv.
+
+## 8. Phase 5 review handoff
+
+Per `dev_process.md`:
+
+> Goal, situation, measurements, plan get pasted into DokuWiki.
+> Markus reviews and redacts, then initiates the handover to a
+> fresh model instance. **Claude does not curate the artifact going
+> to the reviewer** — that would re-introduce the blind-spot
+> accumulation the review is meant to escape. Do not summarize
+> when handing over; paste the actual artifacts.
+
+Files the second-model reviewer needs verbatim:
+
+- `README.md`
+- `docs/phase0.md`
+- `docs/phase1.md`
+- `docs/phase2.md`
+- `docs/phase3.md`
+- `docs/phase4.md` (this file)
+- `docs/vulkaninfo_v3d_7_1_7_hertz.txt`
+- `tests/vp9_idct8_ref.c`
+- `tests/bench_neon_idct.c` (specifically the M3 measurement code)
+- `external/ffmpeg-snapshot/PROVENANCE.md`
+
+Specific review prompts (to focus the second model on the highest-
+risk decisions):
+
+1. **Orientation correctness.** The plan asserts the QPU kernel
+   must reproduce FFmpeg's transposed column-pass orientation.
+   Is the §4 pseudocode correct against `tests/vp9_idct8_ref.c`?
+2. **Workgroup geometry.** Is 4 blocks/WG × 64 invocations the
+   right starting point, or should v1 already explore wider WGs?
+3. **The 8-idle-lanes-per-phase tradeoff.** Half the subgroup
+   lanes idle for each pass. Acceptable for v1, or rework the
+   algorithm so all 16 lanes work per phase (e.g., 2 blocks per
+   subgroup)?
+4. **Bandwidth prediction.** Is the §6 bandwidth estimate
+   defensible, or is the dst read-modify-write going to collapse
+   to ~1 GB/s through GPU L2 misses?
+5. **Anything missing.** What load-bearing assumption is unstated?
+
+## 9. Phase 6 execution order (if Phase 5 approves)
+
+1. Skeleton `v3d_runner.{c,h}`: generalise `bench_vulkan_dispatch.c`'s
+   instance/device/queue/CB plumbing into a reusable API.
+2. `v3d_idct8.comp` v0: write the shader, get it to compile (glslang
+   accepts it, spirv-val passes).
+3. `bench_v3d_idct.c`: harness that runs the shader on the same
+   randomly-generated blocks that `bench_neon_idct` uses, reads
+   back, compares bit-exact against `daedalus_vp9_idct_idct_8x8_add_ref`.
+4. First-light run on hertz. Expect non-bit-exact output —
+   transpose orientation or barrier placement bugs are likely.
+   Iterate to M1' = 100 % bit-exact.
+5. Once bit-exact: measure M2 (throughput). Compare to predicted
+   R ≈ 2.0; record actual.
+6. If R ≥ 0.5 → Phase 9 (lessons) → Phase 1 of next kernel.
+   If R < 0.5 → Phase 4 revision (loopback) or honest close.
+
+## 10. Open questions Phase 4 does not close
+
+- **Subgroup size assumption.** Phase 0 says subgroupSize is fixed
+  at 16 on V3D 7.1. The shader hard-codes `gl_SubgroupSize == 16`
+  via the lane mask `lane = global_id % 16`. If a future v3dv
+  exposes a different subgroup size, the algorithm needs revisiting.
+  Acceptable risk for Phase 1; document and move on.
+- **Memory coherency between CPU and GPU.** The Phase 0
+  `is_host=true` zero-copy story is the *intent*; whether v3dv's
+  default memory allocation actually gives us coherent shared
+  buffers without explicit cache flushes is *unverified*. Phase 6
+  must check — if writes don't appear on the other side, add
+  `VkMappedMemoryRange` flush/invalidate calls.
+- **Block-position metadata source.** §5 specifies `meta` SSBO
+  with (by, bx). Alternatively, derive position from block_idx
+  arithmetic using `blocks_per_row` push constant — saves a
+  buffer at the cost of a divide-modulo per thread. Phase 4 picks
+  the buffered version; Phase 7 may revisit.