Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS)

Phase 4 — Plan first QPU IDCT8 kernel given the M5 32.95us/dispatch
constraint. Frame-level dispatch (8160 WGs for 1080p), 64
invocations/WG = 4 blocks/WG, 3 SSBOs + push constants, reproduces
FFmpeg's transposed column-pass orientation. Predicted M2 ~16 Mblock/s
(BW-limited), R ~ 2.0 -> strong PASS under phase1.md decision rules.

Phase 5 — Second-model review by Claude Sonnet (fresh-context Agent,
no prior session memory). Verdict PASS-WITH-REVISIONS with 2 RED-class
findings + 1 YELLOW that this commit applies:

  RED  finding 5 (dst race condition): int32_t[] dst with 4 lanes
       writing to overlapping 32-bit words = non-atomic concurrent
       writes = Vulkan UB. Fix: uint8_t[] via storageBuffer8BitAccess
       (verified exposed). Applied to phase4.md sec 5 + GLSL declaration.

  RED  finding 7 (early-return before barrier): if (block_idx >=
       n_blocks) return; ahead of barrier() is UB by Vulkan spec.
       For 1080p (32640 blocks, /4) no partial WGs; for any frame
       width not /32 there are. Fix: oob flag, gate work bodies,
       barrier() unconditional. Applied to phase4.md sec 4 pseudocode.

  YELLOW finding 6 (subgroup ops): docs claimed BASIC+VOTE only;
         actual exposed set is BASIC+VOTE+BALLOT+SHUFFLE+
         SHUFFLE_RELATIVE+QUAD per vulkaninfo. Plan doesn't use any
         subgroup ops in v1 so unaffected, but the wrong constraint
         would mislead Phase 6/7. Corrected in phase0.md sec 2,
         phase2.md sec 6, phase4.md sec 1 (C4).

GREEN/YELLOW findings 1-4, 8 (orientation, WG geom, idle lanes, BW
prediction, compute envelope accounting) accepted as-is or deferred to
Phase 7 M6 sweep per plan's existing flagging.

Reviewer verdict post-revisions: "Phase 4 is APPROVED for Phase 6
implementation. No re-review needed; revisions are mechanical and
address verified bugs/errors."

Phase 5 itself just paid for itself: two real UB bugs caught before
any GLSL was written.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-18 11:47:03 +00:00
parent dcbbc77038
commit 71db72928f
4 changed files with 550 additions and 2 deletions
+400
View File
@@ -0,0 +1,400 @@
---
phase: 4
status: open (awaiting Phase 5 review)
date_opened: 2026-05-18
parent: phase3.md
target_kernel: VP9 8×8 inverse DCT (DCT_DCT, 8-bit pixels)
expected_artifacts: src/v3d_idct8_kernel.comp, src/v3d_kernel_runner.{c,h},
tests/bench_v3d_idct.c, CMakeLists.txt updates
---
# Phase 4 — Plan
Per `dev_process.md`:
> Formulate the approach. Identify what will and will not be touched.
> State expected outcome of implementation in the *same* measurable
> terms used in Phase 1/3.
And the Phase 6 contract-before-code rule, applied early here so
Phase 5 review can verify the contracts before any code is written:
> Read the API contract — kernel docs, header comments, and upstream
> source for every call touched. State the contract explicitly
> before implementing against it.
## 1. Constraint recap
All numbered constraints are carried verbatim from `phase0.md §6`
and `phase3.md`. The plan must satisfy every one.
| ID | Constraint | Source |
|---|---|---|
| C1 | shader arithmetic = int32 only (no FP16, no native FP) | `vulkaninfo` shaderFloat16 = false |
| C2 | shared mem per WG ≤ 16 KiB | `vulkaninfo` maxComputeSharedMemorySize |
| C3 | SSBOs per shader stage ≤ 8 | `vulkaninfo` maxPerStageDescriptorStorageBuffers |
| C4 | subgroup ops = BASIC + VOTE + BALLOT + SHUFFLE + SHUFFLE_RELATIVE + QUAD (no arithmetic reductions, but `subgroupShuffle` *is* available) | `vulkaninfo` subgroupSupportedOperations — corrected per phase5.md finding 6 |
| C5 | per-block fixed-point multiplies fit SMUL24 (≤17-bit operands) | Q14 constants are 14 bits, coeffs ≤16 bits |
| C6 | GPU LPDDR4x share ≈ 47 GB/s (CPU sees 1215) | py-videocore7 scopy benchmark |
| C7 | per-dispatch overhead = **32.95 µs** | `phase3.md` M5 |
| C8 | 1 WG ≤ 256 invocations, ≤ 16 subgroups of 16 lanes | `vulkaninfo` maxComputeWorkGroupInvocations |
| C9 | `maxStorageBufferRange` = 1 GiB → single SSBO ≤ 1 GiB | `vulkaninfo` |
| C10 | bit-exact output must match `ff_vp9_idct_idct_8x8_add_neon` | Phase 1 M1 |
Spec contracts the kernel must implement against:
- **VP9 specification §8.7 — inverse transform process** (Google,
2016). The canonical text. Bit-exact verification is the
ultimate gate, not the spec's words.
- **FFmpeg n7.1.3** `idct_idct_8x8_add_c` from
`libavcodec/vp9dsp_template.c` and
`ff_vp9_idct_idct_8x8_add_neon` from
`libavcodec/aarch64/vp9itxfm_neon.S` (vendored at
`external/ffmpeg-snapshot/`, pinned to commit `f46e514`). The
C reference and NEON implementation differ in output
*orientation*: FFmpeg's column pass writes transposed (column
`i` of block IDCT'd into row `i` of `tmp`), and the row pass
reads columns of `tmp` and writes columns of `dst` via the
`dst++` pattern. **Our QPU kernel must reproduce this
orientation** or M1 fails. See
`tests/vp9_idct8_ref.c` for the corrected reference (which
itself was caught failing during Phase 3 and fixed to match
the FFmpeg orientation).
## 2. Workload model
For a 1920×1088-luma 1080p frame:
| Quantity | Value |
|---|---|
| 8×8 blocks per luma plane | (1920/8) × (1088/8) = 240 × 136 = 32 640 |
| Coeff bytes per frame (i16 × 64 / block) | 4 178 880 ≈ 3.99 MiB |
| Pixel bytes per frame (u8 luma) | 2 088 960 ≈ 1.99 MiB |
| Coeff read once + pred read + dst write | ≈ 4.0 + 2.0 + 2.0 = **8.0 MB / frame** |
Chroma planes are out of scope for Phase 1 (the VP9 8×8 IDCT
kernel is the same arithmetic; chroma plumbing is a separate
follow-on once luma works).
## 3. Workgroup geometry decision
Choices and why:
- **Dispatch granularity**: one `vkCmdDispatch` per frame plane.
Justification: C7 forces ≥ 555 blocks/dispatch to beat NEON
on overhead alone; frame-level (32 640 blocks) is the only
granularity that amortises C7 to negligible (1.0 ns/block of
overhead vs NEON's 122 ns/block).
- **Workgroup size**: **64 invocations** (= 4 subgroups × 16
lanes), `local_size_x = 64`. One workgroup processes **4
blocks**, one block per subgroup.
Why 4 blocks/WG and not 16 (max 16 subgroups per WG of 256):
Phase 1 deliberately keeps the first kernel simple. 4 blocks
gives us a clean subgroup-per-block mapping and small shared
memory footprint (4 × 64 × 4 B = 1024 B for the transpose
scratch, far below C2's 16 KiB). Phase 7 may sweep wider WGs
(16 / 32 / 64 / 128 / 256) per `phase1.md §"Secondary
measurements" M6` to find the optimum.
Why 1 block/subgroup: a 16-lane subgroup naturally maps to
"one cooperating team for one 8×8 block." Lanes 0..7 do the
column pass (each owning a column of the block); lanes 8..15
do the row pass (each owning a row of dst output). Lanes are
not reduced (C4 satisfied — we never call `subgroupAdd`).
- **Dispatch count**: ⌈32 640 / 4⌉ = **8 160 workgroups** per 1080p
luma plane. Per Vulkan spec
`maxComputeWorkGroupCount.x = 65 535` — plenty of headroom.
## 4. Per-thread algorithm
The contract reproduces FFmpeg's `idct_idct_8x8_add_c` orientation
exactly (column pass with transposed write, row pass with
columnar write). Pseudocode:
```glsl
// Per-thread state derived from gl_GlobalInvocationID.x:
// block_idx = global_id / 16
// lane = global_id % 16 // [0..15], subgroup-local
// col_pass = lane < 8 // first half does columns
// row_pass = lane >= 8 // second half does rows
// k = lane & 7 // column or row index 0..7
// Out-of-bounds flag — do NOT early-return here. Vulkan requires
// barrier() to be reached by all WG invocations; a divergent early
// return ahead of the barrier is undefined behaviour (and would bite
// us on any frame whose block count isn't a multiple of 4).
// Per phase5.md finding 7.
bool oob = (block_idx >= n_blocks);
if (col_pass && !oob) {
// Read column k of block_idx, 8 i16 values.
int in[8];
for (r = 0; r < 8; r++)
in[r] = block[block_idx * 64 + r * 8 + k];
int out[8];
idct8_1d(in, out); // 14 mults + 12 adds + 4 shifts
// Transposed write: row k of tmp_shared[block_local_idx]
for (r = 0; r < 8; r++)
tmp_shared[block_local_idx * 64 + k * 8 + r] = out[r];
}
barrier(); // ALL lanes reach this, oob or not
if (row_pass && !oob) {
// Read column k of tmp_shared[block_local_idx], 8 i32 values.
int in[8];
for (r = 0; r < 8; r++)
in[r] = tmp_shared[block_local_idx * 64 + r * 8 + k];
int out[8];
idct8_1d(in, out); // same kernel
// Columnar write into dst: column k of the destination 8x8.
// Block position in dst: (block_y, block_x) = (block_idx / Wb, block_idx % Wb) × 8.
// dst is uint8_t[] (NOT packed i32): each lane writes a unique
// byte address → no sub-word race. Per phase5.md finding 5.
for (r = 0; r < 8; r++) {
uint dx = block_x * 8 + k;
uint dy = block_y * 8 + r;
uint8_t px = dst[dy * stride + dx];
int sum = int(px) + ((out[r] + 16) >> 5);
dst[dy * stride + dx] = uint8_t(clamp(sum, 0, 255));
}
}
```
Notes:
- `idct8_1d` is the same 1D 8-point IDCT used in
`tests/vp9_idct8_ref.c::idct8_1d`, transcribed unchanged
(constants 11585 / 6270 / 15137 / 3196 / 16069 / 13623 / 9102,
Q14 round-shift `(x + (1<<13)) >> 14`).
- All arithmetic is `int` (32-bit) per C1. Multiplies are
i16 × i16 → i32, well within SMUL24 (C5).
- The barrier is a *workgroup* barrier (`barrier()` + memory
barrier), not a subgroup operation, so C4 is satisfied.
- DC-only fast path (`eob == 1`) is *not* implemented in v1.
Per Phase 3 M1 stats, DC-only frequency is 0.11 %; the cost
of unconditional general-path execution is ~1× per-block. Phase
7 may add the fast path if M2 measurement is close to the
decision threshold.
- The 8 idle lanes per subgroup per phase (lanes 8..15 idle
during column pass; lanes 0..7 idle during row pass) waste
half of subgroup-cycle bandwidth. The real alternative geometry
(corrected per phase5.md finding 3) is **2 blocks per subgroup**:
lanes 0..7 handle block A's column pass while lanes 8..15
handle block B's column pass simultaneously; one barrier; lanes
0..7 handle A's row pass, lanes 8..15 handle B's row pass.
Doubles useful work per subgroup, halves dispatch count to
~4080 WGs, shared mem grows to 2 KiB (still 8 % of C2). The v1
half-idle design is kept deliberately for simplicity-first; the
2-blocks-per-subgroup rework is the obvious Phase 7 M6 win.
## 5. Memory layout / SSBO design
Three SSBO bindings, well under C3's limit of 8:
| binding | name | type | size | usage |
|---|---|---|---|---|
| 0 | `coeffs` | `readonly int16_t[]` | N × 64 × 2 B | input quantised coefficients |
| 1 | `dst` | `uint8_t[]` | H × stride B | input pred + output pixels, in/out |
| 2 | `meta` | `readonly uvec2[]` | N × 8 B | per-block (block_y, block_x) |
Why `uint8_t[]` for `dst` and not `int32_t[]` (revised per phase5.md
finding 5): packing 4 pixels per i32 would put 4 lanes per subgroup
writing to the *same* 32-bit word every block (lanes 0..3 → bytes
0..3 of one word; lanes 4..7 → next word). Vulkan does not provide
atomic sub-word writes — concurrent non-atomic writes to overlapping
addresses are undefined behaviour. v3dv exposes
`storageBuffer8BitAccess = true` (verified in
`vulkaninfo_v3d_7_1_7_hertz.txt`), so we declare:
```glsl
#extension GL_EXT_shader_8bit_storage : require
layout(binding = 1) buffer Dst { uint8_t dst[]; };
```
Each lane then writes `dst[(block_y*8+r)*stride + block_x*8 + k]`
a unique byte address per lane, no race.
Push constants (C8: 128 B max, we use 16):
```
layout(push_constant) uniform PC {
uint n_blocks; // total 8×8 blocks in this dispatch
uint blocks_per_row; // (dst_width / 8), for block_idx → (by, bx)
uint dst_stride_u8; // dst row stride in bytes (8-bit storage)
uint _pad;
} pc;
```
The 7 Q14 trig constants are baked into the shader source as
literal `const int` — they don't need to be in a buffer.
Shared memory layout (4 blocks per WG, i32 per element):
```
shared int tmp_shared[4][64]; // 4 × 64 × 4 B = 1024 B
```
## 6. Predicted M2 (the expected outcome per Phase 1)
Phase 1 §"Decision rules" requires a predicted measurement. We
state two predictions: a compute-envelope upper bound and a
bandwidth-ceiling upper bound. M2 is bounded by the *minimum*.
### Compute-envelope estimate
- 1D 8-point IDCT: 14 multiplies + 12 adds + 4 shifts ≈ 30 ops
- 2 passes per block (column + row) = 60 SIMD cycles per block
*(idealised — every SIMD cycle issued on the full 16-lane width)*
- Frame: 32 640 blocks × 60 SIMD cycles = ~2 M SIMD-cycles / frame
- At V3D 7.1 ~92 GFLOPS theoretical FP32 across all 12 QPUs at
960 MHz — we use int arithmetic, SMUL24 throughput is comparable.
Use 92 GOPS (whole-SIMD-width ops) as the optimistic envelope.
- Frame time ≈ 2 M / 92 G ≈ **22 µs / frame** (compute only,
idealised — see correction below)
- **Correction per phase5.md finding 8:** the v1 half-idle
geometry runs only 8 of the 16 subgroup lanes per phase, so
the effective SIMD utilisation is 0.5× — compute frame time
is closer to **~44 µs / frame**. This doesn't change the
conclusion because the bandwidth ceiling (below) is the
governing constraint, but the 2× note is essential for any
future compute-bound optimisation work.
- Plus C7 = 33 µs/dispatch overhead, idealised total ≈ **55 µs / frame****18 000 FPS****587 Mblock/s**
- Realistic 23% utilization (per `py-videocore7` SGEMM ceiling):
→ ≈ **135 Mblock/s**, R = 135 / 8.171 ≈ **16.5**
### Bandwidth-ceiling estimate
- Per frame: 8.0 MB total traffic (§2)
- GPU sustained share: ~4 GB/s (`phase0.md` Throughput envelopes)
- Frame time = 8 MB / 4 GB/s = **2.0 ms / frame** = **500 FPS** = **16.3 Mblock/s**
- R = 16.3 / 8.171 ≈ **2.0**
### Combined prediction
M2 = min(compute envelope, bandwidth ceiling) = **~16 Mblock/s**.
Predicted **R ≈ 2.0**, decision rule: ≥ 1.0 → strong PASS.
Honest uncertainty: this is a *prediction*, not a measurement.
Phase 7 may show the GPU sees less than 4 GB/s under this access
pattern (the dst read-modify-write is not coalesced and may push
us below 1 GB/s, dragging R below 0.5). Or per-WG launch + the 8
idle lanes per phase may halve the compute envelope. The honest
*lower bound* is R ≈ 1.0 (memory-bound, 8 Mblock/s) — still passes
Phase 1 §"Decision rules" 0.5 ≤ R < 1.0 band ("hybrid concurrent
work viable").
### What would invalidate the prediction
- Vulkan compute launches that pre-amble more cost than M5b
estimates (e.g., per-dispatch descriptor binding overhead the
noop bench doesn't measure). Phase 7 records actual.
- v3dv's compiler producing significantly worse code than `idct8_1d`'s
expected ~30 ops (e.g., unable to fuse multiply-adds, or
spilling registers). Phase 7 records `glsl-spv` and
`spv-disas` output for inspection.
- Bandwidth contention with concurrent CPU activity (which is
guaranteed on hertz — LXD spine). Phase 7 must measure with
representative CPU load.
## 7. What will be touched / not touched
**Touched** (created or modified in Phase 6):
- `src/v3d_idct8.comp` — the GLSL compute shader.
- `src/v3d_runner.c` + `src/v3d_runner.h` — Vulkan boilerplate
for instance/device/queue/buffer/pipeline setup, common to
all kernels.
- `tests/bench_v3d_idct.c` — the M2 throughput bench and the
M1' bit-exact gate (QPU-vs-NEON-vs-C-ref, all three checked).
- `CMakeLists.txt` — add the new GLSL shader to the
`daedalus_shaders` custom command + the new bench target.
**NOT touched**:
- `tests/bench_neon_idct.c`, `tests/vp9_idct8_ref.c`,
`tests/bench_vulkan_dispatch.c` — Phase 3 baselines stay
immutable so re-running them on the same hertz gives the
same numbers.
- `external/ffmpeg-snapshot/` — vendored reference, byte-frozen.
- Any kernel-side / DRM / firmware code. Path B specifically
scopes us to Vulkan compute on stock v3dv.
## 8. Phase 5 review handoff
Per `dev_process.md`:
> Goal, situation, measurements, plan get pasted into DokuWiki.
> Markus reviews and redacts, then initiates the handover to a
> fresh model instance. **Claude does not curate the artifact going
> to the reviewer** — that would re-introduce the blind-spot
> accumulation the review is meant to escape. Do not summarize
> when handing over; paste the actual artifacts.
Files the second-model reviewer needs verbatim:
- `README.md`
- `docs/phase0.md`
- `docs/phase1.md`
- `docs/phase2.md`
- `docs/phase3.md`
- `docs/phase4.md` (this file)
- `docs/vulkaninfo_v3d_7_1_7_hertz.txt`
- `tests/vp9_idct8_ref.c`
- `tests/bench_neon_idct.c` (specifically the M3 measurement code)
- `external/ffmpeg-snapshot/PROVENANCE.md`
Specific review prompts (to focus the second model on the highest-
risk decisions):
1. **Orientation correctness.** The plan asserts the QPU kernel
must reproduce FFmpeg's transposed column-pass orientation.
Is the §4 pseudocode correct against `tests/vp9_idct8_ref.c`?
2. **Workgroup geometry.** Is 4 blocks/WG × 64 invocations the
right starting point, or should v1 already explore wider WGs?
3. **The 8-idle-lanes-per-phase tradeoff.** Half the subgroup
lanes idle for each pass. Acceptable for v1, or rework the
algorithm so all 16 lanes work per phase (e.g., 2 blocks per
subgroup)?
4. **Bandwidth prediction.** Is the §6 bandwidth estimate
defensible, or is the dst read-modify-write going to collapse
to ~1 GB/s through GPU L2 misses?
5. **Anything missing.** What load-bearing assumption is unstated?
## 9. Phase 6 execution order (if Phase 5 approves)
1. Skeleton `v3d_runner.{c,h}`: generalise `bench_vulkan_dispatch.c`'s
instance/device/queue/CB plumbing into a reusable API.
2. `v3d_idct8.comp` v0: write the shader, get it to compile (glslang
accepts it, spirv-val passes).
3. `bench_v3d_idct.c`: harness that runs the shader on the same
randomly-generated blocks that `bench_neon_idct` uses, reads
back, compares bit-exact against `daedalus_vp9_idct_idct_8x8_add_ref`.
4. First-light run on hertz. Expect non-bit-exact output —
transpose orientation or barrier placement bugs are likely.
Iterate to M1' = 100 % bit-exact.
5. Once bit-exact: measure M2 (throughput). Compare to predicted
R ≈ 2.0; record actual.
6. If R ≥ 0.5 → Phase 9 (lessons) → Phase 1 of next kernel.
If R < 0.5 → Phase 4 revision (loopback) or honest close.
## 10. Open questions Phase 4 does not close
- **Subgroup size assumption.** Phase 0 says subgroupSize is fixed
at 16 on V3D 7.1. The shader hard-codes `gl_SubgroupSize == 16`
via the lane mask `lane = global_id % 16`. If a future v3dv
exposes a different subgroup size, the algorithm needs revisiting.
Acceptable risk for Phase 1; document and move on.
- **Memory coherency between CPU and GPU.** The Phase 0
`is_host=true` zero-copy story is the *intent*; whether v3dv's
default memory allocation actually gives us coherent shared
buffers without explicit cache flushes is *unverified*. Phase 6
must check — if writes don't appear on the other side, add
`VkMappedMemoryRange` flush/invalidate calls.
- **Block-position metadata source.** §5 specifies `meta` SSBO
with (by, bx). Alternatively, derive position from block_idx
arithmetic using `blocks_per_row` push constant — saves a
buffer at the cost of a divide-modulo per thread. Phase 4 picks
the buffered version; Phase 7 may revisit.