Files
marfrit 71db72928f Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS)
Phase 4 — Plan first QPU IDCT8 kernel given the M5 32.95us/dispatch
constraint. Frame-level dispatch (8160 WGs for 1080p), 64
invocations/WG = 4 blocks/WG, 3 SSBOs + push constants, reproduces
FFmpeg's transposed column-pass orientation. Predicted M2 ~16 Mblock/s
(BW-limited), R ~ 2.0 -> strong PASS under phase1.md decision rules.

Phase 5 — Second-model review by Claude Sonnet (fresh-context Agent,
no prior session memory). Verdict PASS-WITH-REVISIONS with 2 RED-class
findings + 1 YELLOW that this commit applies:

  RED  finding 5 (dst race condition): int32_t[] dst with 4 lanes
       writing to overlapping 32-bit words = non-atomic concurrent
       writes = Vulkan UB. Fix: uint8_t[] via storageBuffer8BitAccess
       (verified exposed). Applied to phase4.md sec 5 + GLSL declaration.

  RED  finding 7 (early-return before barrier): if (block_idx >=
       n_blocks) return; ahead of barrier() is UB by Vulkan spec.
       For 1080p (32640 blocks, /4) no partial WGs; for any frame
       width not /32 there are. Fix: oob flag, gate work bodies,
       barrier() unconditional. Applied to phase4.md sec 4 pseudocode.

  YELLOW finding 6 (subgroup ops): docs claimed BASIC+VOTE only;
         actual exposed set is BASIC+VOTE+BALLOT+SHUFFLE+
         SHUFFLE_RELATIVE+QUAD per vulkaninfo. Plan doesn't use any
         subgroup ops in v1 so unaffected, but the wrong constraint
         would mislead Phase 6/7. Corrected in phase0.md sec 2,
         phase2.md sec 6, phase4.md sec 1 (C4).

GREEN/YELLOW findings 1-4, 8 (orientation, WG geom, idle lanes, BW
prediction, compute envelope accounting) accepted as-is or deferred to
Phase 7 M6 sweep per plan's existing flagging.

Reviewer verdict post-revisions: "Phase 4 is APPROVED for Phase 6
implementation. No re-review needed; revisions are mechanical and
address verified bugs/errors."

Phase 5 itself just paid for itself: two real UB bugs caught before
any GLSL was written.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:47:03 +00:00

401 lines
17 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
phase: 4
status: open (awaiting Phase 5 review)
date_opened: 2026-05-18
parent: phase3.md
target_kernel: VP9 8×8 inverse DCT (DCT_DCT, 8-bit pixels)
expected_artifacts: src/v3d_idct8_kernel.comp, src/v3d_kernel_runner.{c,h},
tests/bench_v3d_idct.c, CMakeLists.txt updates
---
# Phase 4 — Plan
Per `dev_process.md`:
> Formulate the approach. Identify what will and will not be touched.
> State expected outcome of implementation in the *same* measurable
> terms used in Phase 1/3.
And the Phase 6 contract-before-code rule, applied early here so
Phase 5 review can verify the contracts before any code is written:
> Read the API contract — kernel docs, header comments, and upstream
> source for every call touched. State the contract explicitly
> before implementing against it.
## 1. Constraint recap
All numbered constraints are carried verbatim from `phase0.md §6`
and `phase3.md`. The plan must satisfy every one.
| ID | Constraint | Source |
|---|---|---|
| C1 | shader arithmetic = int32 only (no FP16, no native FP) | `vulkaninfo` shaderFloat16 = false |
| C2 | shared mem per WG ≤ 16 KiB | `vulkaninfo` maxComputeSharedMemorySize |
| C3 | SSBOs per shader stage ≤ 8 | `vulkaninfo` maxPerStageDescriptorStorageBuffers |
| C4 | subgroup ops = BASIC + VOTE + BALLOT + SHUFFLE + SHUFFLE_RELATIVE + QUAD (no arithmetic reductions, but `subgroupShuffle` *is* available) | `vulkaninfo` subgroupSupportedOperations — corrected per phase5.md finding 6 |
| C5 | per-block fixed-point multiplies fit SMUL24 (≤17-bit operands) | Q14 constants are 14 bits, coeffs ≤16 bits |
| C6 | GPU LPDDR4x share ≈ 47 GB/s (CPU sees 1215) | py-videocore7 scopy benchmark |
| C7 | per-dispatch overhead = **32.95 µs** | `phase3.md` M5 |
| C8 | 1 WG ≤ 256 invocations, ≤ 16 subgroups of 16 lanes | `vulkaninfo` maxComputeWorkGroupInvocations |
| C9 | `maxStorageBufferRange` = 1 GiB → single SSBO ≤ 1 GiB | `vulkaninfo` |
| C10 | bit-exact output must match `ff_vp9_idct_idct_8x8_add_neon` | Phase 1 M1 |
Spec contracts the kernel must implement against:
- **VP9 specification §8.7 — inverse transform process** (Google,
2016). The canonical text. Bit-exact verification is the
ultimate gate, not the spec's words.
- **FFmpeg n7.1.3** `idct_idct_8x8_add_c` from
`libavcodec/vp9dsp_template.c` and
`ff_vp9_idct_idct_8x8_add_neon` from
`libavcodec/aarch64/vp9itxfm_neon.S` (vendored at
`external/ffmpeg-snapshot/`, pinned to commit `f46e514`). The
C reference and NEON implementation differ in output
*orientation*: FFmpeg's column pass writes transposed (column
`i` of block IDCT'd into row `i` of `tmp`), and the row pass
reads columns of `tmp` and writes columns of `dst` via the
`dst++` pattern. **Our QPU kernel must reproduce this
orientation** or M1 fails. See
`tests/vp9_idct8_ref.c` for the corrected reference (which
itself was caught failing during Phase 3 and fixed to match
the FFmpeg orientation).
## 2. Workload model
For a 1920×1088-luma 1080p frame:
| Quantity | Value |
|---|---|
| 8×8 blocks per luma plane | (1920/8) × (1088/8) = 240 × 136 = 32 640 |
| Coeff bytes per frame (i16 × 64 / block) | 4 178 880 ≈ 3.99 MiB |
| Pixel bytes per frame (u8 luma) | 2 088 960 ≈ 1.99 MiB |
| Coeff read once + pred read + dst write | ≈ 4.0 + 2.0 + 2.0 = **8.0 MB / frame** |
Chroma planes are out of scope for Phase 1 (the VP9 8×8 IDCT
kernel is the same arithmetic; chroma plumbing is a separate
follow-on once luma works).
## 3. Workgroup geometry decision
Choices and why:
- **Dispatch granularity**: one `vkCmdDispatch` per frame plane.
Justification: C7 forces ≥ 555 blocks/dispatch to beat NEON
on overhead alone; frame-level (32 640 blocks) is the only
granularity that amortises C7 to negligible (1.0 ns/block of
overhead vs NEON's 122 ns/block).
- **Workgroup size**: **64 invocations** (= 4 subgroups × 16
lanes), `local_size_x = 64`. One workgroup processes **4
blocks**, one block per subgroup.
Why 4 blocks/WG and not 16 (max 16 subgroups per WG of 256):
Phase 1 deliberately keeps the first kernel simple. 4 blocks
gives us a clean subgroup-per-block mapping and small shared
memory footprint (4 × 64 × 4 B = 1024 B for the transpose
scratch, far below C2's 16 KiB). Phase 7 may sweep wider WGs
(16 / 32 / 64 / 128 / 256) per `phase1.md §"Secondary
measurements" M6` to find the optimum.
Why 1 block/subgroup: a 16-lane subgroup naturally maps to
"one cooperating team for one 8×8 block." Lanes 0..7 do the
column pass (each owning a column of the block); lanes 8..15
do the row pass (each owning a row of dst output). Lanes are
not reduced (C4 satisfied — we never call `subgroupAdd`).
- **Dispatch count**: ⌈32 640 / 4⌉ = **8 160 workgroups** per 1080p
luma plane. Per Vulkan spec
`maxComputeWorkGroupCount.x = 65 535` — plenty of headroom.
## 4. Per-thread algorithm
The contract reproduces FFmpeg's `idct_idct_8x8_add_c` orientation
exactly (column pass with transposed write, row pass with
columnar write). Pseudocode:
```glsl
// Per-thread state derived from gl_GlobalInvocationID.x:
// block_idx = global_id / 16
// lane = global_id % 16 // [0..15], subgroup-local
// col_pass = lane < 8 // first half does columns
// row_pass = lane >= 8 // second half does rows
// k = lane & 7 // column or row index 0..7
// Out-of-bounds flag — do NOT early-return here. Vulkan requires
// barrier() to be reached by all WG invocations; a divergent early
// return ahead of the barrier is undefined behaviour (and would bite
// us on any frame whose block count isn't a multiple of 4).
// Per phase5.md finding 7.
bool oob = (block_idx >= n_blocks);
if (col_pass && !oob) {
// Read column k of block_idx, 8 i16 values.
int in[8];
for (r = 0; r < 8; r++)
in[r] = block[block_idx * 64 + r * 8 + k];
int out[8];
idct8_1d(in, out); // 14 mults + 12 adds + 4 shifts
// Transposed write: row k of tmp_shared[block_local_idx]
for (r = 0; r < 8; r++)
tmp_shared[block_local_idx * 64 + k * 8 + r] = out[r];
}
barrier(); // ALL lanes reach this, oob or not
if (row_pass && !oob) {
// Read column k of tmp_shared[block_local_idx], 8 i32 values.
int in[8];
for (r = 0; r < 8; r++)
in[r] = tmp_shared[block_local_idx * 64 + r * 8 + k];
int out[8];
idct8_1d(in, out); // same kernel
// Columnar write into dst: column k of the destination 8x8.
// Block position in dst: (block_y, block_x) = (block_idx / Wb, block_idx % Wb) × 8.
// dst is uint8_t[] (NOT packed i32): each lane writes a unique
// byte address → no sub-word race. Per phase5.md finding 5.
for (r = 0; r < 8; r++) {
uint dx = block_x * 8 + k;
uint dy = block_y * 8 + r;
uint8_t px = dst[dy * stride + dx];
int sum = int(px) + ((out[r] + 16) >> 5);
dst[dy * stride + dx] = uint8_t(clamp(sum, 0, 255));
}
}
```
Notes:
- `idct8_1d` is the same 1D 8-point IDCT used in
`tests/vp9_idct8_ref.c::idct8_1d`, transcribed unchanged
(constants 11585 / 6270 / 15137 / 3196 / 16069 / 13623 / 9102,
Q14 round-shift `(x + (1<<13)) >> 14`).
- All arithmetic is `int` (32-bit) per C1. Multiplies are
i16 × i16 → i32, well within SMUL24 (C5).
- The barrier is a *workgroup* barrier (`barrier()` + memory
barrier), not a subgroup operation, so C4 is satisfied.
- DC-only fast path (`eob == 1`) is *not* implemented in v1.
Per Phase 3 M1 stats, DC-only frequency is 0.11 %; the cost
of unconditional general-path execution is ~1× per-block. Phase
7 may add the fast path if M2 measurement is close to the
decision threshold.
- The 8 idle lanes per subgroup per phase (lanes 8..15 idle
during column pass; lanes 0..7 idle during row pass) waste
half of subgroup-cycle bandwidth. The real alternative geometry
(corrected per phase5.md finding 3) is **2 blocks per subgroup**:
lanes 0..7 handle block A's column pass while lanes 8..15
handle block B's column pass simultaneously; one barrier; lanes
0..7 handle A's row pass, lanes 8..15 handle B's row pass.
Doubles useful work per subgroup, halves dispatch count to
~4080 WGs, shared mem grows to 2 KiB (still 8 % of C2). The v1
half-idle design is kept deliberately for simplicity-first; the
2-blocks-per-subgroup rework is the obvious Phase 7 M6 win.
## 5. Memory layout / SSBO design
Three SSBO bindings, well under C3's limit of 8:
| binding | name | type | size | usage |
|---|---|---|---|---|
| 0 | `coeffs` | `readonly int16_t[]` | N × 64 × 2 B | input quantised coefficients |
| 1 | `dst` | `uint8_t[]` | H × stride B | input pred + output pixels, in/out |
| 2 | `meta` | `readonly uvec2[]` | N × 8 B | per-block (block_y, block_x) |
Why `uint8_t[]` for `dst` and not `int32_t[]` (revised per phase5.md
finding 5): packing 4 pixels per i32 would put 4 lanes per subgroup
writing to the *same* 32-bit word every block (lanes 0..3 → bytes
0..3 of one word; lanes 4..7 → next word). Vulkan does not provide
atomic sub-word writes — concurrent non-atomic writes to overlapping
addresses are undefined behaviour. v3dv exposes
`storageBuffer8BitAccess = true` (verified in
`vulkaninfo_v3d_7_1_7_hertz.txt`), so we declare:
```glsl
#extension GL_EXT_shader_8bit_storage : require
layout(binding = 1) buffer Dst { uint8_t dst[]; };
```
Each lane then writes `dst[(block_y*8+r)*stride + block_x*8 + k]`
a unique byte address per lane, no race.
Push constants (C8: 128 B max, we use 16):
```
layout(push_constant) uniform PC {
uint n_blocks; // total 8×8 blocks in this dispatch
uint blocks_per_row; // (dst_width / 8), for block_idx → (by, bx)
uint dst_stride_u8; // dst row stride in bytes (8-bit storage)
uint _pad;
} pc;
```
The 7 Q14 trig constants are baked into the shader source as
literal `const int` — they don't need to be in a buffer.
Shared memory layout (4 blocks per WG, i32 per element):
```
shared int tmp_shared[4][64]; // 4 × 64 × 4 B = 1024 B
```
## 6. Predicted M2 (the expected outcome per Phase 1)
Phase 1 §"Decision rules" requires a predicted measurement. We
state two predictions: a compute-envelope upper bound and a
bandwidth-ceiling upper bound. M2 is bounded by the *minimum*.
### Compute-envelope estimate
- 1D 8-point IDCT: 14 multiplies + 12 adds + 4 shifts ≈ 30 ops
- 2 passes per block (column + row) = 60 SIMD cycles per block
*(idealised — every SIMD cycle issued on the full 16-lane width)*
- Frame: 32 640 blocks × 60 SIMD cycles = ~2 M SIMD-cycles / frame
- At V3D 7.1 ~92 GFLOPS theoretical FP32 across all 12 QPUs at
960 MHz — we use int arithmetic, SMUL24 throughput is comparable.
Use 92 GOPS (whole-SIMD-width ops) as the optimistic envelope.
- Frame time ≈ 2 M / 92 G ≈ **22 µs / frame** (compute only,
idealised — see correction below)
- **Correction per phase5.md finding 8:** the v1 half-idle
geometry runs only 8 of the 16 subgroup lanes per phase, so
the effective SIMD utilisation is 0.5× — compute frame time
is closer to **~44 µs / frame**. This doesn't change the
conclusion because the bandwidth ceiling (below) is the
governing constraint, but the 2× note is essential for any
future compute-bound optimisation work.
- Plus C7 = 33 µs/dispatch overhead, idealised total ≈ **55 µs / frame****18 000 FPS****587 Mblock/s**
- Realistic 23% utilization (per `py-videocore7` SGEMM ceiling):
→ ≈ **135 Mblock/s**, R = 135 / 8.171 ≈ **16.5**
### Bandwidth-ceiling estimate
- Per frame: 8.0 MB total traffic (§2)
- GPU sustained share: ~4 GB/s (`phase0.md` Throughput envelopes)
- Frame time = 8 MB / 4 GB/s = **2.0 ms / frame** = **500 FPS** = **16.3 Mblock/s**
- R = 16.3 / 8.171 ≈ **2.0**
### Combined prediction
M2 = min(compute envelope, bandwidth ceiling) = **~16 Mblock/s**.
Predicted **R ≈ 2.0**, decision rule: ≥ 1.0 → strong PASS.
Honest uncertainty: this is a *prediction*, not a measurement.
Phase 7 may show the GPU sees less than 4 GB/s under this access
pattern (the dst read-modify-write is not coalesced and may push
us below 1 GB/s, dragging R below 0.5). Or per-WG launch + the 8
idle lanes per phase may halve the compute envelope. The honest
*lower bound* is R ≈ 1.0 (memory-bound, 8 Mblock/s) — still passes
Phase 1 §"Decision rules" 0.5 ≤ R < 1.0 band ("hybrid concurrent
work viable").
### What would invalidate the prediction
- Vulkan compute launches that pre-amble more cost than M5b
estimates (e.g., per-dispatch descriptor binding overhead the
noop bench doesn't measure). Phase 7 records actual.
- v3dv's compiler producing significantly worse code than `idct8_1d`'s
expected ~30 ops (e.g., unable to fuse multiply-adds, or
spilling registers). Phase 7 records `glsl-spv` and
`spv-disas` output for inspection.
- Bandwidth contention with concurrent CPU activity (which is
guaranteed on hertz — LXD spine). Phase 7 must measure with
representative CPU load.
## 7. What will be touched / not touched
**Touched** (created or modified in Phase 6):
- `src/v3d_idct8.comp` — the GLSL compute shader.
- `src/v3d_runner.c` + `src/v3d_runner.h` — Vulkan boilerplate
for instance/device/queue/buffer/pipeline setup, common to
all kernels.
- `tests/bench_v3d_idct.c` — the M2 throughput bench and the
M1' bit-exact gate (QPU-vs-NEON-vs-C-ref, all three checked).
- `CMakeLists.txt` — add the new GLSL shader to the
`daedalus_shaders` custom command + the new bench target.
**NOT touched**:
- `tests/bench_neon_idct.c`, `tests/vp9_idct8_ref.c`,
`tests/bench_vulkan_dispatch.c` — Phase 3 baselines stay
immutable so re-running them on the same hertz gives the
same numbers.
- `external/ffmpeg-snapshot/` — vendored reference, byte-frozen.
- Any kernel-side / DRM / firmware code. Path B specifically
scopes us to Vulkan compute on stock v3dv.
## 8. Phase 5 review handoff
Per `dev_process.md`:
> Goal, situation, measurements, plan get pasted into DokuWiki.
> Markus reviews and redacts, then initiates the handover to a
> fresh model instance. **Claude does not curate the artifact going
> to the reviewer** — that would re-introduce the blind-spot
> accumulation the review is meant to escape. Do not summarize
> when handing over; paste the actual artifacts.
Files the second-model reviewer needs verbatim:
- `README.md`
- `docs/phase0.md`
- `docs/phase1.md`
- `docs/phase2.md`
- `docs/phase3.md`
- `docs/phase4.md` (this file)
- `docs/vulkaninfo_v3d_7_1_7_hertz.txt`
- `tests/vp9_idct8_ref.c`
- `tests/bench_neon_idct.c` (specifically the M3 measurement code)
- `external/ffmpeg-snapshot/PROVENANCE.md`
Specific review prompts (to focus the second model on the highest-
risk decisions):
1. **Orientation correctness.** The plan asserts the QPU kernel
must reproduce FFmpeg's transposed column-pass orientation.
Is the §4 pseudocode correct against `tests/vp9_idct8_ref.c`?
2. **Workgroup geometry.** Is 4 blocks/WG × 64 invocations the
right starting point, or should v1 already explore wider WGs?
3. **The 8-idle-lanes-per-phase tradeoff.** Half the subgroup
lanes idle for each pass. Acceptable for v1, or rework the
algorithm so all 16 lanes work per phase (e.g., 2 blocks per
subgroup)?
4. **Bandwidth prediction.** Is the §6 bandwidth estimate
defensible, or is the dst read-modify-write going to collapse
to ~1 GB/s through GPU L2 misses?
5. **Anything missing.** What load-bearing assumption is unstated?
## 9. Phase 6 execution order (if Phase 5 approves)
1. Skeleton `v3d_runner.{c,h}`: generalise `bench_vulkan_dispatch.c`'s
instance/device/queue/CB plumbing into a reusable API.
2. `v3d_idct8.comp` v0: write the shader, get it to compile (glslang
accepts it, spirv-val passes).
3. `bench_v3d_idct.c`: harness that runs the shader on the same
randomly-generated blocks that `bench_neon_idct` uses, reads
back, compares bit-exact against `daedalus_vp9_idct_idct_8x8_add_ref`.
4. First-light run on hertz. Expect non-bit-exact output —
transpose orientation or barrier placement bugs are likely.
Iterate to M1' = 100 % bit-exact.
5. Once bit-exact: measure M2 (throughput). Compare to predicted
R ≈ 2.0; record actual.
6. If R ≥ 0.5 → Phase 9 (lessons) → Phase 1 of next kernel.
If R < 0.5 → Phase 4 revision (loopback) or honest close.
## 10. Open questions Phase 4 does not close
- **Subgroup size assumption.** Phase 0 says subgroupSize is fixed
at 16 on V3D 7.1. The shader hard-codes `gl_SubgroupSize == 16`
via the lane mask `lane = global_id % 16`. If a future v3dv
exposes a different subgroup size, the algorithm needs revisiting.
Acceptable risk for Phase 1; document and move on.
- **Memory coherency between CPU and GPU.** The Phase 0
`is_host=true` zero-copy story is the *intent*; whether v3dv's
default memory allocation actually gives us coherent shared
buffers without explicit cache flushes is *unverified*. Phase 6
must check — if writes don't appear on the other side, add
`VkMappedMemoryRange` flush/invalidate calls.
- **Block-position metadata source.** §5 specifies `meta` SSBO
with (by, bx). Alternatively, derive position from block_idx
arithmetic using `blocks_per_row` push constant — saves a
buffer at the cost of a divide-modulo per thread. Phase 4 picks
the buffered version; Phase 7 may revisit.