--- phase: 4 status: open (awaiting Phase 5 review) date_opened: 2026-05-18 parent: phase3.md target_kernel: VP9 8×8 inverse DCT (DCT_DCT, 8-bit pixels) expected_artifacts: src/v3d_idct8_kernel.comp, src/v3d_kernel_runner.{c,h}, tests/bench_v3d_idct.c, CMakeLists.txt updates --- # Phase 4 — Plan Per `dev_process.md`: > Formulate the approach. Identify what will and will not be touched. > State expected outcome of implementation in the *same* measurable > terms used in Phase 1/3. And the Phase 6 contract-before-code rule, applied early here so Phase 5 review can verify the contracts before any code is written: > Read the API contract — kernel docs, header comments, and upstream > source for every call touched. State the contract explicitly > before implementing against it. ## 1. Constraint recap All numbered constraints are carried verbatim from `phase0.md §6` and `phase3.md`. The plan must satisfy every one. | ID | Constraint | Source | |---|---|---| | C1 | shader arithmetic = int32 only (no FP16, no native FP) | `vulkaninfo` shaderFloat16 = false | | C2 | shared mem per WG ≤ 16 KiB | `vulkaninfo` maxComputeSharedMemorySize | | C3 | SSBOs per shader stage ≤ 8 | `vulkaninfo` maxPerStageDescriptorStorageBuffers | | C4 | subgroup ops = BASIC + VOTE + BALLOT + SHUFFLE + SHUFFLE_RELATIVE + QUAD (no arithmetic reductions, but `subgroupShuffle` *is* available) | `vulkaninfo` subgroupSupportedOperations — corrected per phase5.md finding 6 | | C5 | per-block fixed-point multiplies fit SMUL24 (≤17-bit operands) | Q14 constants are 14 bits, coeffs ≤16 bits | | C6 | GPU LPDDR4x share ≈ 4–7 GB/s (CPU sees 12–15) | py-videocore7 scopy benchmark | | C7 | per-dispatch overhead = **32.95 µs** | `phase3.md` M5 | | C8 | 1 WG ≤ 256 invocations, ≤ 16 subgroups of 16 lanes | `vulkaninfo` maxComputeWorkGroupInvocations | | C9 | `maxStorageBufferRange` = 1 GiB → single SSBO ≤ 1 GiB | `vulkaninfo` | | C10 | bit-exact output must match `ff_vp9_idct_idct_8x8_add_neon` | Phase 1 M1 | Spec contracts the kernel must implement against: - **VP9 specification §8.7 — inverse transform process** (Google, 2016). The canonical text. Bit-exact verification is the ultimate gate, not the spec's words. - **FFmpeg n7.1.3** `idct_idct_8x8_add_c` from `libavcodec/vp9dsp_template.c` and `ff_vp9_idct_idct_8x8_add_neon` from `libavcodec/aarch64/vp9itxfm_neon.S` (vendored at `external/ffmpeg-snapshot/`, pinned to commit `f46e514`). The C reference and NEON implementation differ in output *orientation*: FFmpeg's column pass writes transposed (column `i` of block IDCT'd into row `i` of `tmp`), and the row pass reads columns of `tmp` and writes columns of `dst` via the `dst++` pattern. **Our QPU kernel must reproduce this orientation** or M1 fails. See `tests/vp9_idct8_ref.c` for the corrected reference (which itself was caught failing during Phase 3 and fixed to match the FFmpeg orientation). ## 2. Workload model For a 1920×1088-luma 1080p frame: | Quantity | Value | |---|---| | 8×8 blocks per luma plane | (1920/8) × (1088/8) = 240 × 136 = 32 640 | | Coeff bytes per frame (i16 × 64 / block) | 4 178 880 ≈ 3.99 MiB | | Pixel bytes per frame (u8 luma) | 2 088 960 ≈ 1.99 MiB | | Coeff read once + pred read + dst write | ≈ 4.0 + 2.0 + 2.0 = **8.0 MB / frame** | Chroma planes are out of scope for Phase 1 (the VP9 8×8 IDCT kernel is the same arithmetic; chroma plumbing is a separate follow-on once luma works). ## 3. Workgroup geometry decision Choices and why: - **Dispatch granularity**: one `vkCmdDispatch` per frame plane. Justification: C7 forces ≥ 555 blocks/dispatch to beat NEON on overhead alone; frame-level (32 640 blocks) is the only granularity that amortises C7 to negligible (1.0 ns/block of overhead vs NEON's 122 ns/block). - **Workgroup size**: **64 invocations** (= 4 subgroups × 16 lanes), `local_size_x = 64`. One workgroup processes **4 blocks**, one block per subgroup. Why 4 blocks/WG and not 16 (max 16 subgroups per WG of 256): Phase 1 deliberately keeps the first kernel simple. 4 blocks gives us a clean subgroup-per-block mapping and small shared memory footprint (4 × 64 × 4 B = 1024 B for the transpose scratch, far below C2's 16 KiB). Phase 7 may sweep wider WGs (16 / 32 / 64 / 128 / 256) per `phase1.md §"Secondary measurements" M6` to find the optimum. Why 1 block/subgroup: a 16-lane subgroup naturally maps to "one cooperating team for one 8×8 block." Lanes 0..7 do the column pass (each owning a column of the block); lanes 8..15 do the row pass (each owning a row of dst output). Lanes are not reduced (C4 satisfied — we never call `subgroupAdd`). - **Dispatch count**: ⌈32 640 / 4⌉ = **8 160 workgroups** per 1080p luma plane. Per Vulkan spec `maxComputeWorkGroupCount.x = 65 535` — plenty of headroom. ## 4. Per-thread algorithm The contract reproduces FFmpeg's `idct_idct_8x8_add_c` orientation exactly (column pass with transposed write, row pass with columnar write). Pseudocode: ```glsl // Per-thread state derived from gl_GlobalInvocationID.x: // block_idx = global_id / 16 // lane = global_id % 16 // [0..15], subgroup-local // col_pass = lane < 8 // first half does columns // row_pass = lane >= 8 // second half does rows // k = lane & 7 // column or row index 0..7 // Out-of-bounds flag — do NOT early-return here. Vulkan requires // barrier() to be reached by all WG invocations; a divergent early // return ahead of the barrier is undefined behaviour (and would bite // us on any frame whose block count isn't a multiple of 4). // Per phase5.md finding 7. bool oob = (block_idx >= n_blocks); if (col_pass && !oob) { // Read column k of block_idx, 8 i16 values. int in[8]; for (r = 0; r < 8; r++) in[r] = block[block_idx * 64 + r * 8 + k]; int out[8]; idct8_1d(in, out); // 14 mults + 12 adds + 4 shifts // Transposed write: row k of tmp_shared[block_local_idx] for (r = 0; r < 8; r++) tmp_shared[block_local_idx * 64 + k * 8 + r] = out[r]; } barrier(); // ALL lanes reach this, oob or not if (row_pass && !oob) { // Read column k of tmp_shared[block_local_idx], 8 i32 values. int in[8]; for (r = 0; r < 8; r++) in[r] = tmp_shared[block_local_idx * 64 + r * 8 + k]; int out[8]; idct8_1d(in, out); // same kernel // Columnar write into dst: column k of the destination 8x8. // Block position in dst: (block_y, block_x) = (block_idx / Wb, block_idx % Wb) × 8. // dst is uint8_t[] (NOT packed i32): each lane writes a unique // byte address → no sub-word race. Per phase5.md finding 5. for (r = 0; r < 8; r++) { uint dx = block_x * 8 + k; uint dy = block_y * 8 + r; uint8_t px = dst[dy * stride + dx]; int sum = int(px) + ((out[r] + 16) >> 5); dst[dy * stride + dx] = uint8_t(clamp(sum, 0, 255)); } } ``` Notes: - `idct8_1d` is the same 1D 8-point IDCT used in `tests/vp9_idct8_ref.c::idct8_1d`, transcribed unchanged (constants 11585 / 6270 / 15137 / 3196 / 16069 / 13623 / 9102, Q14 round-shift `(x + (1<<13)) >> 14`). - All arithmetic is `int` (32-bit) per C1. Multiplies are i16 × i16 → i32, well within SMUL24 (C5). - The barrier is a *workgroup* barrier (`barrier()` + memory barrier), not a subgroup operation, so C4 is satisfied. - DC-only fast path (`eob == 1`) is *not* implemented in v1. Per Phase 3 M1 stats, DC-only frequency is 0.11 %; the cost of unconditional general-path execution is ~1× per-block. Phase 7 may add the fast path if M2 measurement is close to the decision threshold. - The 8 idle lanes per subgroup per phase (lanes 8..15 idle during column pass; lanes 0..7 idle during row pass) waste half of subgroup-cycle bandwidth. The real alternative geometry (corrected per phase5.md finding 3) is **2 blocks per subgroup**: lanes 0..7 handle block A's column pass while lanes 8..15 handle block B's column pass simultaneously; one barrier; lanes 0..7 handle A's row pass, lanes 8..15 handle B's row pass. Doubles useful work per subgroup, halves dispatch count to ~4080 WGs, shared mem grows to 2 KiB (still 8 % of C2). The v1 half-idle design is kept deliberately for simplicity-first; the 2-blocks-per-subgroup rework is the obvious Phase 7 M6 win. ## 5. Memory layout / SSBO design Three SSBO bindings, well under C3's limit of 8: | binding | name | type | size | usage | |---|---|---|---|---| | 0 | `coeffs` | `readonly int16_t[]` | N × 64 × 2 B | input quantised coefficients | | 1 | `dst` | `uint8_t[]` | H × stride B | input pred + output pixels, in/out | | 2 | `meta` | `readonly uvec2[]` | N × 8 B | per-block (block_y, block_x) | Why `uint8_t[]` for `dst` and not `int32_t[]` (revised per phase5.md finding 5): packing 4 pixels per i32 would put 4 lanes per subgroup writing to the *same* 32-bit word every block (lanes 0..3 → bytes 0..3 of one word; lanes 4..7 → next word). Vulkan does not provide atomic sub-word writes — concurrent non-atomic writes to overlapping addresses are undefined behaviour. v3dv exposes `storageBuffer8BitAccess = true` (verified in `vulkaninfo_v3d_7_1_7_hertz.txt`), so we declare: ```glsl #extension GL_EXT_shader_8bit_storage : require layout(binding = 1) buffer Dst { uint8_t dst[]; }; ``` Each lane then writes `dst[(block_y*8+r)*stride + block_x*8 + k]` — a unique byte address per lane, no race. Push constants (C8: 128 B max, we use 16): ``` layout(push_constant) uniform PC { uint n_blocks; // total 8×8 blocks in this dispatch uint blocks_per_row; // (dst_width / 8), for block_idx → (by, bx) uint dst_stride_u8; // dst row stride in bytes (8-bit storage) uint _pad; } pc; ``` The 7 Q14 trig constants are baked into the shader source as literal `const int` — they don't need to be in a buffer. Shared memory layout (4 blocks per WG, i32 per element): ``` shared int tmp_shared[4][64]; // 4 × 64 × 4 B = 1024 B ``` ## 6. Predicted M2 (the expected outcome per Phase 1) Phase 1 §"Decision rules" requires a predicted measurement. We state two predictions: a compute-envelope upper bound and a bandwidth-ceiling upper bound. M2 is bounded by the *minimum*. ### Compute-envelope estimate - 1D 8-point IDCT: 14 multiplies + 12 adds + 4 shifts ≈ 30 ops - 2 passes per block (column + row) = 60 SIMD cycles per block *(idealised — every SIMD cycle issued on the full 16-lane width)* - Frame: 32 640 blocks × 60 SIMD cycles = ~2 M SIMD-cycles / frame - At V3D 7.1 ~92 GFLOPS theoretical FP32 across all 12 QPUs at 960 MHz — we use int arithmetic, SMUL24 throughput is comparable. Use 92 GOPS (whole-SIMD-width ops) as the optimistic envelope. - Frame time ≈ 2 M / 92 G ≈ **22 µs / frame** (compute only, idealised — see correction below) - **Correction per phase5.md finding 8:** the v1 half-idle geometry runs only 8 of the 16 subgroup lanes per phase, so the effective SIMD utilisation is 0.5× — compute frame time is closer to **~44 µs / frame**. This doesn't change the conclusion because the bandwidth ceiling (below) is the governing constraint, but the 2× note is essential for any future compute-bound optimisation work. - Plus C7 = 33 µs/dispatch overhead, idealised total ≈ **55 µs / frame** ≈ **18 000 FPS** ≈ **587 Mblock/s** - Realistic 23% utilization (per `py-videocore7` SGEMM ceiling): → ≈ **135 Mblock/s**, R = 135 / 8.171 ≈ **16.5** ### Bandwidth-ceiling estimate - Per frame: 8.0 MB total traffic (§2) - GPU sustained share: ~4 GB/s (`phase0.md` Throughput envelopes) - Frame time = 8 MB / 4 GB/s = **2.0 ms / frame** = **500 FPS** = **16.3 Mblock/s** - R = 16.3 / 8.171 ≈ **2.0** ### Combined prediction M2 = min(compute envelope, bandwidth ceiling) = **~16 Mblock/s**. Predicted **R ≈ 2.0**, decision rule: ≥ 1.0 → strong PASS. Honest uncertainty: this is a *prediction*, not a measurement. Phase 7 may show the GPU sees less than 4 GB/s under this access pattern (the dst read-modify-write is not coalesced and may push us below 1 GB/s, dragging R below 0.5). Or per-WG launch + the 8 idle lanes per phase may halve the compute envelope. The honest *lower bound* is R ≈ 1.0 (memory-bound, 8 Mblock/s) — still passes Phase 1 §"Decision rules" 0.5 ≤ R < 1.0 band ("hybrid concurrent work viable"). ### What would invalidate the prediction - Vulkan compute launches that pre-amble more cost than M5b estimates (e.g., per-dispatch descriptor binding overhead the noop bench doesn't measure). Phase 7 records actual. - v3dv's compiler producing significantly worse code than `idct8_1d`'s expected ~30 ops (e.g., unable to fuse multiply-adds, or spilling registers). Phase 7 records `glsl-spv` and `spv-disas` output for inspection. - Bandwidth contention with concurrent CPU activity (which is guaranteed on hertz — LXD spine). Phase 7 must measure with representative CPU load. ## 7. What will be touched / not touched **Touched** (created or modified in Phase 6): - `src/v3d_idct8.comp` — the GLSL compute shader. - `src/v3d_runner.c` + `src/v3d_runner.h` — Vulkan boilerplate for instance/device/queue/buffer/pipeline setup, common to all kernels. - `tests/bench_v3d_idct.c` — the M2 throughput bench and the M1' bit-exact gate (QPU-vs-NEON-vs-C-ref, all three checked). - `CMakeLists.txt` — add the new GLSL shader to the `daedalus_shaders` custom command + the new bench target. **NOT touched**: - `tests/bench_neon_idct.c`, `tests/vp9_idct8_ref.c`, `tests/bench_vulkan_dispatch.c` — Phase 3 baselines stay immutable so re-running them on the same hertz gives the same numbers. - `external/ffmpeg-snapshot/` — vendored reference, byte-frozen. - Any kernel-side / DRM / firmware code. Path B specifically scopes us to Vulkan compute on stock v3dv. ## 8. Phase 5 review handoff Per `dev_process.md`: > Goal, situation, measurements, plan get pasted into DokuWiki. > Markus reviews and redacts, then initiates the handover to a > fresh model instance. **Claude does not curate the artifact going > to the reviewer** — that would re-introduce the blind-spot > accumulation the review is meant to escape. Do not summarize > when handing over; paste the actual artifacts. Files the second-model reviewer needs verbatim: - `README.md` - `docs/phase0.md` - `docs/phase1.md` - `docs/phase2.md` - `docs/phase3.md` - `docs/phase4.md` (this file) - `docs/vulkaninfo_v3d_7_1_7_hertz.txt` - `tests/vp9_idct8_ref.c` - `tests/bench_neon_idct.c` (specifically the M3 measurement code) - `external/ffmpeg-snapshot/PROVENANCE.md` Specific review prompts (to focus the second model on the highest- risk decisions): 1. **Orientation correctness.** The plan asserts the QPU kernel must reproduce FFmpeg's transposed column-pass orientation. Is the §4 pseudocode correct against `tests/vp9_idct8_ref.c`? 2. **Workgroup geometry.** Is 4 blocks/WG × 64 invocations the right starting point, or should v1 already explore wider WGs? 3. **The 8-idle-lanes-per-phase tradeoff.** Half the subgroup lanes idle for each pass. Acceptable for v1, or rework the algorithm so all 16 lanes work per phase (e.g., 2 blocks per subgroup)? 4. **Bandwidth prediction.** Is the §6 bandwidth estimate defensible, or is the dst read-modify-write going to collapse to ~1 GB/s through GPU L2 misses? 5. **Anything missing.** What load-bearing assumption is unstated? ## 9. Phase 6 execution order (if Phase 5 approves) 1. Skeleton `v3d_runner.{c,h}`: generalise `bench_vulkan_dispatch.c`'s instance/device/queue/CB plumbing into a reusable API. 2. `v3d_idct8.comp` v0: write the shader, get it to compile (glslang accepts it, spirv-val passes). 3. `bench_v3d_idct.c`: harness that runs the shader on the same randomly-generated blocks that `bench_neon_idct` uses, reads back, compares bit-exact against `daedalus_vp9_idct_idct_8x8_add_ref`. 4. First-light run on hertz. Expect non-bit-exact output — transpose orientation or barrier placement bugs are likely. Iterate to M1' = 100 % bit-exact. 5. Once bit-exact: measure M2 (throughput). Compare to predicted R ≈ 2.0; record actual. 6. If R ≥ 0.5 → Phase 9 (lessons) → Phase 1 of next kernel. If R < 0.5 → Phase 4 revision (loopback) or honest close. ## 10. Open questions Phase 4 does not close - **Subgroup size assumption.** Phase 0 says subgroupSize is fixed at 16 on V3D 7.1. The shader hard-codes `gl_SubgroupSize == 16` via the lane mask `lane = global_id % 16`. If a future v3dv exposes a different subgroup size, the algorithm needs revisiting. Acceptable risk for Phase 1; document and move on. - **Memory coherency between CPU and GPU.** The Phase 0 `is_host=true` zero-copy story is the *intent*; whether v3dv's default memory allocation actually gives us coherent shared buffers without explicit cache flushes is *unverified*. Phase 6 must check — if writes don't appear on the other side, add `VkMappedMemoryRange` flush/invalidate calls. - **Block-position metadata source.** §5 specifies `meta` SSBO with (by, bx). Alternatively, derive position from block_idx arithmetic using `blocks_per_row` push constant — saves a buffer at the cost of a divide-modulo per thread. Phase 4 picks the buffered version; Phase 7 may revisit.