---
phase: 4
status: open (awaiting Phase 5 review)
date_opened: 2026-05-18
parent: phase3.md
target_kernel: VP9 8×8 inverse DCT (DCT_DCT, 8-bit pixels)
expected_artifacts: src/v3d_idct8_kernel.comp, src/v3d_kernel_runner.{c,h},
                    tests/bench_v3d_idct.c, CMakeLists.txt updates
---

# Phase 4 — Plan

Per `dev_process.md`:

> Formulate the approach. Identify what will and will not be touched.
> State expected outcome of implementation in the *same* measurable
> terms used in Phase 1/3.

And the Phase 6 contract-before-code rule, applied early here so
Phase 5 review can verify the contracts before any code is written:

> Read the API contract — kernel docs, header comments, and upstream
> source for every call touched. State the contract explicitly
> before implementing against it.

## 1. Constraint recap

All numbered constraints are carried verbatim from `phase0.md §6`
and `phase3.md`. The plan must satisfy every one.

| ID | Constraint | Source |
|---|---|---|
| C1 | shader arithmetic = int32 only (no FP16, no native FP) | `vulkaninfo` shaderFloat16 = false |
| C2 | shared mem per WG ≤ 16 KiB | `vulkaninfo` maxComputeSharedMemorySize |
| C3 | SSBOs per shader stage ≤ 8 | `vulkaninfo` maxPerStageDescriptorStorageBuffers |
| C4 | subgroup ops = BASIC + VOTE + BALLOT + SHUFFLE + SHUFFLE_RELATIVE + QUAD (no arithmetic reductions, but `subgroupShuffle` *is* available) | `vulkaninfo` subgroupSupportedOperations — corrected per phase5.md finding 6 |
| C5 | per-block fixed-point multiplies fit SMUL24 (≤17-bit operands) | Q14 constants are 14 bits, coeffs ≤16 bits |
| C6 | GPU LPDDR4x share ≈ 4–7 GB/s (CPU sees 12–15) | py-videocore7 scopy benchmark |
| C7 | per-dispatch overhead = **32.95 µs** | `phase3.md` M5 |
| C8 | 1 WG ≤ 256 invocations, ≤ 16 subgroups of 16 lanes | `vulkaninfo` maxComputeWorkGroupInvocations |
| C9 | `maxStorageBufferRange` = 1 GiB → single SSBO ≤ 1 GiB | `vulkaninfo` |
| C10 | bit-exact output must match `ff_vp9_idct_idct_8x8_add_neon` | Phase 1 M1 |

Spec contracts the kernel must implement against:

- **VP9 specification §8.7 — inverse transform process** (Google,
  2016). The canonical text. Bit-exact verification is the
  ultimate gate, not the spec's words.
- **FFmpeg n7.1.3** `idct_idct_8x8_add_c` from
  `libavcodec/vp9dsp_template.c` and
  `ff_vp9_idct_idct_8x8_add_neon` from
  `libavcodec/aarch64/vp9itxfm_neon.S` (vendored at
  `external/ffmpeg-snapshot/`, pinned to commit `f46e514`). The
  C reference and NEON implementation differ in output
  *orientation*: FFmpeg's column pass writes transposed (column
  `i` of block IDCT'd into row `i` of `tmp`), and the row pass
  reads columns of `tmp` and writes columns of `dst` via the
  `dst++` pattern. **Our QPU kernel must reproduce this
  orientation** or M1 fails. See
  `tests/vp9_idct8_ref.c` for the corrected reference (which
  itself was caught failing during Phase 3 and fixed to match
  the FFmpeg orientation).

## 2. Workload model

For a 1920×1088-luma 1080p frame:

| Quantity | Value |
|---|---|
| 8×8 blocks per luma plane | (1920/8) × (1088/8) = 240 × 136 = 32 640 |
| Coeff bytes per frame (i16 × 64 / block) | 4 178 880 ≈ 3.99 MiB |
| Pixel bytes per frame (u8 luma) | 2 088 960 ≈ 1.99 MiB |
| Coeff read once + pred read + dst write | ≈ 4.0 + 2.0 + 2.0 = **8.0 MB / frame** |

Chroma planes are out of scope for Phase 1 (the VP9 8×8 IDCT
kernel is the same arithmetic; chroma plumbing is a separate
follow-on once luma works).

## 3. Workgroup geometry decision

Choices and why:

- **Dispatch granularity**: one `vkCmdDispatch` per frame plane.
  Justification: C7 forces ≥ 555 blocks/dispatch to beat NEON
  on overhead alone; frame-level (32 640 blocks) is the only
  granularity that amortises C7 to negligible (1.0 ns/block of
  overhead vs NEON's 122 ns/block).

- **Workgroup size**: **64 invocations** (= 4 subgroups × 16
  lanes), `local_size_x = 64`. One workgroup processes **4
  blocks**, one block per subgroup.

  Why 4 blocks/WG and not 16 (max 16 subgroups per WG of 256):
  Phase 1 deliberately keeps the first kernel simple. 4 blocks
  gives us a clean subgroup-per-block mapping and small shared
  memory footprint (4 × 64 × 4 B = 1024 B for the transpose
  scratch, far below C2's 16 KiB). Phase 7 may sweep wider WGs
  (16 / 32 / 64 / 128 / 256) per `phase1.md §"Secondary
  measurements" M6` to find the optimum.

  Why 1 block/subgroup: a 16-lane subgroup naturally maps to
  "one cooperating team for one 8×8 block." Lanes 0..7 do the
  column pass (each owning a column of the block); lanes 8..15
  do the row pass (each owning a row of dst output). Lanes are
  not reduced (C4 satisfied — we never call `subgroupAdd`).

- **Dispatch count**: ⌈32 640 / 4⌉ = **8 160 workgroups** per 1080p
  luma plane. Per Vulkan spec
  `maxComputeWorkGroupCount.x = 65 535` — plenty of headroom.

## 4. Per-thread algorithm

The contract reproduces FFmpeg's `idct_idct_8x8_add_c` orientation
exactly (column pass with transposed write, row pass with
columnar write). Pseudocode:

```glsl
// Per-thread state derived from gl_GlobalInvocationID.x:
//   block_idx  = global_id / 16
//   lane       = global_id % 16        // [0..15], subgroup-local
//   col_pass   = lane < 8               // first half does columns
//   row_pass   = lane >= 8              // second half does rows
//   k          = lane & 7               // column or row index 0..7

// Out-of-bounds flag — do NOT early-return here. Vulkan requires
// barrier() to be reached by all WG invocations; a divergent early
// return ahead of the barrier is undefined behaviour (and would bite
// us on any frame whose block count isn't a multiple of 4).
// Per phase5.md finding 7.
bool oob = (block_idx >= n_blocks);

if (col_pass && !oob) {
    // Read column k of block_idx, 8 i16 values.
    int in[8];
    for (r = 0; r < 8; r++)
        in[r] = block[block_idx * 64 + r * 8 + k];
    int out[8];
    idct8_1d(in, out);                        // 14 mults + 12 adds + 4 shifts
    // Transposed write: row k of tmp_shared[block_local_idx]
    for (r = 0; r < 8; r++)
        tmp_shared[block_local_idx * 64 + k * 8 + r] = out[r];
}

barrier();   // ALL lanes reach this, oob or not

if (row_pass && !oob) {
    // Read column k of tmp_shared[block_local_idx], 8 i32 values.
    int in[8];
    for (r = 0; r < 8; r++)
        in[r] = tmp_shared[block_local_idx * 64 + r * 8 + k];
    int out[8];
    idct8_1d(in, out);                        // same kernel
    // Columnar write into dst: column k of the destination 8x8.
    // Block position in dst: (block_y, block_x) = (block_idx / Wb, block_idx % Wb) × 8.
    // dst is uint8_t[] (NOT packed i32): each lane writes a unique
    // byte address → no sub-word race. Per phase5.md finding 5.
    for (r = 0; r < 8; r++) {
        uint dx = block_x * 8 + k;
        uint dy = block_y * 8 + r;
        uint8_t px = dst[dy * stride + dx];
        int sum = int(px) + ((out[r] + 16) >> 5);
        dst[dy * stride + dx] = uint8_t(clamp(sum, 0, 255));
    }
}
```

Notes:
- `idct8_1d` is the same 1D 8-point IDCT used in
  `tests/vp9_idct8_ref.c::idct8_1d`, transcribed unchanged
  (constants 11585 / 6270 / 15137 / 3196 / 16069 / 13623 / 9102,
  Q14 round-shift `(x + (1<<13)) >> 14`).
- All arithmetic is `int` (32-bit) per C1. Multiplies are
  i16 × i16 → i32, well within SMUL24 (C5).
- The barrier is a *workgroup* barrier (`barrier()` + memory
  barrier), not a subgroup operation, so C4 is satisfied.
- DC-only fast path (`eob == 1`) is *not* implemented in v1.
  Per Phase 3 M1 stats, DC-only frequency is 0.11 %; the cost
  of unconditional general-path execution is ~1× per-block. Phase
  7 may add the fast path if M2 measurement is close to the
  decision threshold.
- The 8 idle lanes per subgroup per phase (lanes 8..15 idle
  during column pass; lanes 0..7 idle during row pass) waste
  half of subgroup-cycle bandwidth. The real alternative geometry
  (corrected per phase5.md finding 3) is **2 blocks per subgroup**:
  lanes 0..7 handle block A's column pass while lanes 8..15
  handle block B's column pass simultaneously; one barrier; lanes
  0..7 handle A's row pass, lanes 8..15 handle B's row pass.
  Doubles useful work per subgroup, halves dispatch count to
  ~4080 WGs, shared mem grows to 2 KiB (still 8 % of C2). The v1
  half-idle design is kept deliberately for simplicity-first; the
  2-blocks-per-subgroup rework is the obvious Phase 7 M6 win.

## 5. Memory layout / SSBO design

Three SSBO bindings, well under C3's limit of 8:

| binding | name | type | size | usage |
|---|---|---|---|---|
| 0 | `coeffs` | `readonly int16_t[]` | N × 64 × 2 B | input quantised coefficients |
| 1 | `dst` | `uint8_t[]` | H × stride B | input pred + output pixels, in/out |
| 2 | `meta` | `readonly uvec2[]` | N × 8 B | per-block (block_y, block_x) |

Why `uint8_t[]` for `dst` and not `int32_t[]` (revised per phase5.md
finding 5): packing 4 pixels per i32 would put 4 lanes per subgroup
writing to the *same* 32-bit word every block (lanes 0..3 → bytes
0..3 of one word; lanes 4..7 → next word). Vulkan does not provide
atomic sub-word writes — concurrent non-atomic writes to overlapping
addresses are undefined behaviour. v3dv exposes
`storageBuffer8BitAccess = true` (verified in
`vulkaninfo_v3d_7_1_7_hertz.txt`), so we declare:

```glsl
#extension GL_EXT_shader_8bit_storage : require
layout(binding = 1) buffer Dst { uint8_t dst[]; };
```

Each lane then writes `dst[(block_y*8+r)*stride + block_x*8 + k]` —
a unique byte address per lane, no race.

Push constants (C8: 128 B max, we use 16):

```
layout(push_constant) uniform PC {
    uint n_blocks;        // total 8×8 blocks in this dispatch
    uint blocks_per_row;  // (dst_width / 8), for block_idx → (by, bx)
    uint dst_stride_u8;   // dst row stride in bytes (8-bit storage)
    uint _pad;
} pc;
```

The 7 Q14 trig constants are baked into the shader source as
literal `const int` — they don't need to be in a buffer.

Shared memory layout (4 blocks per WG, i32 per element):

```
shared int tmp_shared[4][64];  // 4 × 64 × 4 B = 1024 B
```

## 6. Predicted M2 (the expected outcome per Phase 1)

Phase 1 §"Decision rules" requires a predicted measurement. We
state two predictions: a compute-envelope upper bound and a
bandwidth-ceiling upper bound. M2 is bounded by the *minimum*.

### Compute-envelope estimate

- 1D 8-point IDCT: 14 multiplies + 12 adds + 4 shifts ≈ 30 ops
- 2 passes per block (column + row) = 60 SIMD cycles per block
  *(idealised — every SIMD cycle issued on the full 16-lane width)*
- Frame: 32 640 blocks × 60 SIMD cycles = ~2 M SIMD-cycles / frame
- At V3D 7.1 ~92 GFLOPS theoretical FP32 across all 12 QPUs at
  960 MHz — we use int arithmetic, SMUL24 throughput is comparable.
  Use 92 GOPS (whole-SIMD-width ops) as the optimistic envelope.
- Frame time ≈ 2 M / 92 G ≈ **22 µs / frame** (compute only,
  idealised — see correction below)
- **Correction per phase5.md finding 8:** the v1 half-idle
  geometry runs only 8 of the 16 subgroup lanes per phase, so
  the effective SIMD utilisation is 0.5× — compute frame time
  is closer to **~44 µs / frame**. This doesn't change the
  conclusion because the bandwidth ceiling (below) is the
  governing constraint, but the 2× note is essential for any
  future compute-bound optimisation work.
- Plus C7 = 33 µs/dispatch overhead, idealised total ≈ **55 µs / frame** ≈ **18 000 FPS** ≈ **587 Mblock/s**
- Realistic 23% utilization (per `py-videocore7` SGEMM ceiling):
  → ≈ **135 Mblock/s**, R = 135 / 8.171 ≈ **16.5**

### Bandwidth-ceiling estimate

- Per frame: 8.0 MB total traffic (§2)
- GPU sustained share: ~4 GB/s (`phase0.md` Throughput envelopes)
- Frame time = 8 MB / 4 GB/s = **2.0 ms / frame** = **500 FPS** = **16.3 Mblock/s**
- R = 16.3 / 8.171 ≈ **2.0**

### Combined prediction

M2 = min(compute envelope, bandwidth ceiling) = **~16 Mblock/s**.
Predicted **R ≈ 2.0**, decision rule: ≥ 1.0 → strong PASS.

Honest uncertainty: this is a *prediction*, not a measurement.
Phase 7 may show the GPU sees less than 4 GB/s under this access
pattern (the dst read-modify-write is not coalesced and may push
us below 1 GB/s, dragging R below 0.5). Or per-WG launch + the 8
idle lanes per phase may halve the compute envelope. The honest
*lower bound* is R ≈ 1.0 (memory-bound, 8 Mblock/s) — still passes
Phase 1 §"Decision rules" 0.5 ≤ R < 1.0 band ("hybrid concurrent
work viable").

### What would invalidate the prediction

- Vulkan compute launches that pre-amble more cost than M5b
  estimates (e.g., per-dispatch descriptor binding overhead the
  noop bench doesn't measure). Phase 7 records actual.
- v3dv's compiler producing significantly worse code than `idct8_1d`'s
  expected ~30 ops (e.g., unable to fuse multiply-adds, or
  spilling registers). Phase 7 records `glsl-spv` and
  `spv-disas` output for inspection.
- Bandwidth contention with concurrent CPU activity (which is
  guaranteed on hertz — LXD spine). Phase 7 must measure with
  representative CPU load.

## 7. What will be touched / not touched

**Touched** (created or modified in Phase 6):

- `src/v3d_idct8.comp` — the GLSL compute shader.
- `src/v3d_runner.c` + `src/v3d_runner.h` — Vulkan boilerplate
  for instance/device/queue/buffer/pipeline setup, common to
  all kernels.
- `tests/bench_v3d_idct.c` — the M2 throughput bench and the
  M1' bit-exact gate (QPU-vs-NEON-vs-C-ref, all three checked).
- `CMakeLists.txt` — add the new GLSL shader to the
  `daedalus_shaders` custom command + the new bench target.

**NOT touched**:

- `tests/bench_neon_idct.c`, `tests/vp9_idct8_ref.c`,
  `tests/bench_vulkan_dispatch.c` — Phase 3 baselines stay
  immutable so re-running them on the same hertz gives the
  same numbers.
- `external/ffmpeg-snapshot/` — vendored reference, byte-frozen.
- Any kernel-side / DRM / firmware code. Path B specifically
  scopes us to Vulkan compute on stock v3dv.

## 8. Phase 5 review handoff

Per `dev_process.md`:

> Goal, situation, measurements, plan get pasted into DokuWiki.
> Markus reviews and redacts, then initiates the handover to a
> fresh model instance. **Claude does not curate the artifact going
> to the reviewer** — that would re-introduce the blind-spot
> accumulation the review is meant to escape. Do not summarize
> when handing over; paste the actual artifacts.

Files the second-model reviewer needs verbatim:

- `README.md`
- `docs/phase0.md`
- `docs/phase1.md`
- `docs/phase2.md`
- `docs/phase3.md`
- `docs/phase4.md` (this file)
- `docs/vulkaninfo_v3d_7_1_7_hertz.txt`
- `tests/vp9_idct8_ref.c`
- `tests/bench_neon_idct.c` (specifically the M3 measurement code)
- `external/ffmpeg-snapshot/PROVENANCE.md`

Specific review prompts (to focus the second model on the highest-
risk decisions):

1. **Orientation correctness.** The plan asserts the QPU kernel
   must reproduce FFmpeg's transposed column-pass orientation.
   Is the §4 pseudocode correct against `tests/vp9_idct8_ref.c`?
2. **Workgroup geometry.** Is 4 blocks/WG × 64 invocations the
   right starting point, or should v1 already explore wider WGs?
3. **The 8-idle-lanes-per-phase tradeoff.** Half the subgroup
   lanes idle for each pass. Acceptable for v1, or rework the
   algorithm so all 16 lanes work per phase (e.g., 2 blocks per
   subgroup)?
4. **Bandwidth prediction.** Is the §6 bandwidth estimate
   defensible, or is the dst read-modify-write going to collapse
   to ~1 GB/s through GPU L2 misses?
5. **Anything missing.** What load-bearing assumption is unstated?

## 9. Phase 6 execution order (if Phase 5 approves)

1. Skeleton `v3d_runner.{c,h}`: generalise `bench_vulkan_dispatch.c`'s
   instance/device/queue/CB plumbing into a reusable API.
2. `v3d_idct8.comp` v0: write the shader, get it to compile (glslang
   accepts it, spirv-val passes).
3. `bench_v3d_idct.c`: harness that runs the shader on the same
   randomly-generated blocks that `bench_neon_idct` uses, reads
   back, compares bit-exact against `daedalus_vp9_idct_idct_8x8_add_ref`.
4. First-light run on hertz. Expect non-bit-exact output —
   transpose orientation or barrier placement bugs are likely.
   Iterate to M1' = 100 % bit-exact.
5. Once bit-exact: measure M2 (throughput). Compare to predicted
   R ≈ 2.0; record actual.
6. If R ≥ 0.5 → Phase 9 (lessons) → Phase 1 of next kernel.
   If R < 0.5 → Phase 4 revision (loopback) or honest close.

## 10. Open questions Phase 4 does not close

- **Subgroup size assumption.** Phase 0 says subgroupSize is fixed
  at 16 on V3D 7.1. The shader hard-codes `gl_SubgroupSize == 16`
  via the lane mask `lane = global_id % 16`. If a future v3dv
  exposes a different subgroup size, the algorithm needs revisiting.
  Acceptable risk for Phase 1; document and move on.
- **Memory coherency between CPU and GPU.** The Phase 0
  `is_host=true` zero-copy story is the *intent*; whether v3dv's
  default memory allocation actually gives us coherent shared
  buffers without explicit cache flushes is *unverified*. Phase 6
  must check — if writes don't appear on the other side, add
  `VkMappedMemoryRange` flush/invalidate calls.
- **Block-position metadata source.** §5 specifies `meta` SSBO
  with (by, bx). Alternatively, derive position from block_idx
  arithmetic using `blocks_per_row` push constant — saves a
  buffer at the cost of a divide-modulo per thread. Phase 4 picks
  the buffered version; Phase 7 may revisit.