Files

T

marfrit 71db72928f Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS)

Phase 4 — Plan first QPU IDCT8 kernel given the M5 32.95us/dispatch
constraint. Frame-level dispatch (8160 WGs for 1080p), 64
invocations/WG = 4 blocks/WG, 3 SSBOs + push constants, reproduces
FFmpeg's transposed column-pass orientation. Predicted M2 ~16 Mblock/s
(BW-limited), R ~ 2.0 -> strong PASS under phase1.md decision rules.

Phase 5 — Second-model review by Claude Sonnet (fresh-context Agent,
no prior session memory). Verdict PASS-WITH-REVISIONS with 2 RED-class
findings + 1 YELLOW that this commit applies:

  RED  finding 5 (dst race condition): int32_t[] dst with 4 lanes
       writing to overlapping 32-bit words = non-atomic concurrent
       writes = Vulkan UB. Fix: uint8_t[] via storageBuffer8BitAccess
       (verified exposed). Applied to phase4.md sec 5 + GLSL declaration.

  RED  finding 7 (early-return before barrier): if (block_idx >=
       n_blocks) return; ahead of barrier() is UB by Vulkan spec.
       For 1080p (32640 blocks, /4) no partial WGs; for any frame
       width not /32 there are. Fix: oob flag, gate work bodies,
       barrier() unconditional. Applied to phase4.md sec 4 pseudocode.

  YELLOW finding 6 (subgroup ops): docs claimed BASIC+VOTE only;
         actual exposed set is BASIC+VOTE+BALLOT+SHUFFLE+
         SHUFFLE_RELATIVE+QUAD per vulkaninfo. Plan doesn't use any
         subgroup ops in v1 so unaffected, but the wrong constraint
         would mislead Phase 6/7. Corrected in phase0.md sec 2,
         phase2.md sec 6, phase4.md sec 1 (C4).

GREEN/YELLOW findings 1-4, 8 (orientation, WG geom, idle lanes, BW
prediction, compute envelope accounting) accepted as-is or deferred to
Phase 7 M6 sweep per plan's existing flagging.

Reviewer verdict post-revisions: "Phase 4 is APPROVED for Phase 6
implementation. No re-review needed; revisions are mechanical and
address verified bugs/errors."

Phase 5 itself just paid for itself: two real UB bugs caught before
any GLSL was written.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-18 11:47:03 +00:00

17 KiB

Raw Blame History

phase, status, date_opened, parent, target_kernel, expected_artifacts

phase	status	date_opened	parent	target_kernel	expected_artifacts
4	open (awaiting Phase 5 review)	2026-05-18	phase3.md	VP9 8×8 inverse DCT (DCT_DCT, 8-bit pixels)	src/v3d_idct8_kernel.comp, src/v3d_kernel_runner.{c,h}, tests/bench_v3d_idct.c, CMakeLists.txt updates

Phase 4 — Plan

Per dev_process.md:

Formulate the approach. Identify what will and will not be touched. State expected outcome of implementation in the same measurable terms used in Phase 1/3.

And the Phase 6 contract-before-code rule, applied early here so Phase 5 review can verify the contracts before any code is written:

Read the API contract — kernel docs, header comments, and upstream source for every call touched. State the contract explicitly before implementing against it.

1. Constraint recap

All numbered constraints are carried verbatim from phase0.md §6 and phase3.md. The plan must satisfy every one.

ID	Constraint	Source
C1	shader arithmetic = int32 only (no FP16, no native FP)	`vulkaninfo` shaderFloat16 = false
C2	shared mem per WG ≤ 16 KiB	`vulkaninfo` maxComputeSharedMemorySize
C3	SSBOs per shader stage ≤ 8	`vulkaninfo` maxPerStageDescriptorStorageBuffers
C4	subgroup ops = BASIC + VOTE + BALLOT + SHUFFLE + SHUFFLE_RELATIVE + QUAD (no arithmetic reductions, but `subgroupShuffle` is available)	`vulkaninfo` subgroupSupportedOperations — corrected per phase5.md finding 6
C5	per-block fixed-point multiplies fit SMUL24 (≤17-bit operands)	Q14 constants are 14 bits, coeffs ≤16 bits
C6	GPU LPDDR4x share ≈ 4–7 GB/s (CPU sees 12–15)	py-videocore7 scopy benchmark
C7	per-dispatch overhead = 32.95 µs	`phase3.md` M5
C8	1 WG ≤ 256 invocations, ≤ 16 subgroups of 16 lanes	`vulkaninfo` maxComputeWorkGroupInvocations
C9	`maxStorageBufferRange` = 1 GiB → single SSBO ≤ 1 GiB	`vulkaninfo`
C10	bit-exact output must match `ff_vp9_idct_idct_8x8_add_neon`	Phase 1 M1

Spec contracts the kernel must implement against:

VP9 specification §8.7 — inverse transform process (Google, 2016). The canonical text. Bit-exact verification is the ultimate gate, not the spec's words.
FFmpeg n7.1.3 idct_idct_8x8_add_c from libavcodec/vp9dsp_template.c and ff_vp9_idct_idct_8x8_add_neon from libavcodec/aarch64/vp9itxfm_neon.S (vendored at external/ffmpeg-snapshot/, pinned to commit f46e514). The C reference and NEON implementation differ in output orientation: FFmpeg's column pass writes transposed (column i of block IDCT'd into row i of tmp), and the row pass reads columns of tmp and writes columns of dst via the dst++ pattern. Our QPU kernel must reproduce this orientation or M1 fails. See tests/vp9_idct8_ref.c for the corrected reference (which itself was caught failing during Phase 3 and fixed to match the FFmpeg orientation).

2. Workload model

For a 1920×1088-luma 1080p frame:

Quantity	Value
8×8 blocks per luma plane	(1920/8) × (1088/8) = 240 × 136 = 32 640
Coeff bytes per frame (i16 × 64 / block)	4 178 880 ≈ 3.99 MiB
Pixel bytes per frame (u8 luma)	2 088 960 ≈ 1.99 MiB
Coeff read once + pred read + dst write	≈ 4.0 + 2.0 + 2.0 = 8.0 MB / frame

Chroma planes are out of scope for Phase 1 (the VP9 8×8 IDCT kernel is the same arithmetic; chroma plumbing is a separate follow-on once luma works).

3. Workgroup geometry decision

Choices and why:

Dispatch granularity: one vkCmdDispatch per frame plane. Justification: C7 forces ≥ 555 blocks/dispatch to beat NEON on overhead alone; frame-level (32 640 blocks) is the only granularity that amortises C7 to negligible (1.0 ns/block of overhead vs NEON's 122 ns/block).
Workgroup size: 64 invocations (= 4 subgroups × 16 lanes), local_size_x = 64. One workgroup processes 4 blocks, one block per subgroup.

Why 4 blocks/WG and not 16 (max 16 subgroups per WG of 256): Phase 1 deliberately keeps the first kernel simple. 4 blocks gives us a clean subgroup-per-block mapping and small shared memory footprint (4 × 64 × 4 B = 1024 B for the transpose scratch, far below C2's 16 KiB). Phase 7 may sweep wider WGs (16 / 32 / 64 / 128 / 256) per phase1.md §"Secondary measurements" M6 to find the optimum.

Why 1 block/subgroup: a 16-lane subgroup naturally maps to "one cooperating team for one 8×8 block." Lanes 0..7 do the column pass (each owning a column of the block); lanes 8..15 do the row pass (each owning a row of dst output). Lanes are not reduced (C4 satisfied — we never call subgroupAdd).
Dispatch count: ⌈32 640 / 4⌉ = 8 160 workgroups per 1080p luma plane. Per Vulkan spec maxComputeWorkGroupCount.x = 65 535 — plenty of headroom.

4. Per-thread algorithm

The contract reproduces FFmpeg's idct_idct_8x8_add_c orientation exactly (column pass with transposed write, row pass with columnar write). Pseudocode:

// Per-thread state derived from gl_GlobalInvocationID.x:
//   block_idx  = global_id / 16
//   lane       = global_id % 16        // [0..15], subgroup-local
//   col_pass   = lane < 8               // first half does columns
//   row_pass   = lane >= 8              // second half does rows
//   k          = lane & 7               // column or row index 0..7

// Out-of-bounds flag — do NOT early-return here. Vulkan requires
// barrier() to be reached by all WG invocations; a divergent early
// return ahead of the barrier is undefined behaviour (and would bite
// us on any frame whose block count isn't a multiple of 4).
// Per phase5.md finding 7.
bool oob = (block_idx >= n_blocks);

if (col_pass && !oob) {
    // Read column k of block_idx, 8 i16 values.
    int in[8];
    for (r = 0; r < 8; r++)
        in[r] = block[block_idx * 64 + r * 8 + k];
    int out[8];
    idct8_1d(in, out);                        // 14 mults + 12 adds + 4 shifts
    // Transposed write: row k of tmp_shared[block_local_idx]
    for (r = 0; r < 8; r++)
        tmp_shared[block_local_idx * 64 + k * 8 + r] = out[r];
}

barrier();   // ALL lanes reach this, oob or not

if (row_pass && !oob) {
    // Read column k of tmp_shared[block_local_idx], 8 i32 values.
    int in[8];
    for (r = 0; r < 8; r++)
        in[r] = tmp_shared[block_local_idx * 64 + r * 8 + k];
    int out[8];
    idct8_1d(in, out);                        // same kernel
    // Columnar write into dst: column k of the destination 8x8.
    // Block position in dst: (block_y, block_x) = (block_idx / Wb, block_idx % Wb) × 8.
    // dst is uint8_t[] (NOT packed i32): each lane writes a unique
    // byte address → no sub-word race. Per phase5.md finding 5.
    for (r = 0; r < 8; r++) {
        uint dx = block_x * 8 + k;
        uint dy = block_y * 8 + r;
        uint8_t px = dst[dy * stride + dx];
        int sum = int(px) + ((out[r] + 16) >> 5);
        dst[dy * stride + dx] = uint8_t(clamp(sum, 0, 255));
    }
}

Notes:

idct8_1d is the same 1D 8-point IDCT used in tests/vp9_idct8_ref.c::idct8_1d, transcribed unchanged (constants 11585 / 6270 / 15137 / 3196 / 16069 / 13623 / 9102, Q14 round-shift (x + (1<<13)) >> 14).
All arithmetic is int (32-bit) per C1. Multiplies are i16 × i16 → i32, well within SMUL24 (C5).
The barrier is a workgroup barrier (barrier() + memory barrier), not a subgroup operation, so C4 is satisfied.
DC-only fast path (eob == 1) is not implemented in v1. Per Phase 3 M1 stats, DC-only frequency is 0.11 %; the cost of unconditional general-path execution is ~1× per-block. Phase 7 may add the fast path if M2 measurement is close to the decision threshold.
The 8 idle lanes per subgroup per phase (lanes 8..15 idle during column pass; lanes 0..7 idle during row pass) waste half of subgroup-cycle bandwidth. The real alternative geometry (corrected per phase5.md finding 3) is 2 blocks per subgroup: lanes 0..7 handle block A's column pass while lanes 8..15 handle block B's column pass simultaneously; one barrier; lanes 0..7 handle A's row pass, lanes 8..15 handle B's row pass. Doubles useful work per subgroup, halves dispatch count to ~4080 WGs, shared mem grows to 2 KiB (still 8 % of C2). The v1 half-idle design is kept deliberately for simplicity-first; the 2-blocks-per-subgroup rework is the obvious Phase 7 M6 win.

5. Memory layout / SSBO design

Three SSBO bindings, well under C3's limit of 8:

binding	name	type	size	usage
0	`coeffs`	`readonly int16_t[]`	N × 64 × 2 B	input quantised coefficients
1	`dst`	`uint8_t[]`	H × stride B	input pred + output pixels, in/out
2	`meta`	`readonly uvec2[]`	N × 8 B	per-block (block_y, block_x)

Why uint8_t[] for dst and not int32_t[] (revised per phase5.md finding 5): packing 4 pixels per i32 would put 4 lanes per subgroup writing to the same 32-bit word every block (lanes 0..3 → bytes 0..3 of one word; lanes 4..7 → next word). Vulkan does not provide atomic sub-word writes — concurrent non-atomic writes to overlapping addresses are undefined behaviour. v3dv exposes storageBuffer8BitAccess = true (verified in vulkaninfo_v3d_7_1_7_hertz.txt), so we declare:

#extension GL_EXT_shader_8bit_storage : require
layout(binding = 1) buffer Dst { uint8_t dst[]; };

Each lane then writes dst[(block_y*8+r)*stride + block_x*8 + k] — a unique byte address per lane, no race.

Push constants (C8: 128 B max, we use 16):

layout(push_constant) uniform PC {
    uint n_blocks;        // total 8×8 blocks in this dispatch
    uint blocks_per_row;  // (dst_width / 8), for block_idx → (by, bx)
    uint dst_stride_u8;   // dst row stride in bytes (8-bit storage)
    uint _pad;
} pc;

The 7 Q14 trig constants are baked into the shader source as literal const int — they don't need to be in a buffer.

Shared memory layout (4 blocks per WG, i32 per element):

shared int tmp_shared[4][64];  // 4 × 64 × 4 B = 1024 B

6. Predicted M2 (the expected outcome per Phase 1)

Phase 1 §"Decision rules" requires a predicted measurement. We state two predictions: a compute-envelope upper bound and a bandwidth-ceiling upper bound. M2 is bounded by the minimum.

Compute-envelope estimate

1D 8-point IDCT: 14 multiplies + 12 adds + 4 shifts ≈ 30 ops
2 passes per block (column + row) = 60 SIMD cycles per block (idealised — every SIMD cycle issued on the full 16-lane width)
Frame: 32 640 blocks × 60 SIMD cycles = ~2 M SIMD-cycles / frame
At V3D 7.1 ~92 GFLOPS theoretical FP32 across all 12 QPUs at 960 MHz — we use int arithmetic, SMUL24 throughput is comparable. Use 92 GOPS (whole-SIMD-width ops) as the optimistic envelope.
Frame time ≈ 2 M / 92 G ≈ 22 µs / frame (compute only, idealised — see correction below)
Correction per phase5.md finding 8: the v1 half-idle geometry runs only 8 of the 16 subgroup lanes per phase, so the effective SIMD utilisation is 0.5× — compute frame time is closer to ~44 µs / frame. This doesn't change the conclusion because the bandwidth ceiling (below) is the governing constraint, but the 2× note is essential for any future compute-bound optimisation work.
Plus C7 = 33 µs/dispatch overhead, idealised total ≈ 55 µs / frame ≈ 18 000 FPS ≈ 587 Mblock/s
Realistic 23% utilization (per py-videocore7 SGEMM ceiling): → ≈ 135 Mblock/s, R = 135 / 8.171 ≈ 16.5

Bandwidth-ceiling estimate

Per frame: 8.0 MB total traffic (§2)
GPU sustained share: ~4 GB/s (phase0.md Throughput envelopes)
Frame time = 8 MB / 4 GB/s = 2.0 ms / frame = 500 FPS = 16.3 Mblock/s
R = 16.3 / 8.171 ≈ 2.0

Combined prediction

M2 = min(compute envelope, bandwidth ceiling) = ~16 Mblock/s. Predicted R ≈ 2.0, decision rule: ≥ 1.0 → strong PASS.

Honest uncertainty: this is a prediction, not a measurement. Phase 7 may show the GPU sees less than 4 GB/s under this access pattern (the dst read-modify-write is not coalesced and may push us below 1 GB/s, dragging R below 0.5). Or per-WG launch + the 8 idle lanes per phase may halve the compute envelope. The honest lower bound is R ≈ 1.0 (memory-bound, 8 Mblock/s) — still passes Phase 1 §"Decision rules" 0.5 ≤ R < 1.0 band ("hybrid concurrent work viable").

What would invalidate the prediction

Vulkan compute launches that pre-amble more cost than M5b estimates (e.g., per-dispatch descriptor binding overhead the noop bench doesn't measure). Phase 7 records actual.
v3dv's compiler producing significantly worse code than idct8_1d's expected ~30 ops (e.g., unable to fuse multiply-adds, or spilling registers). Phase 7 records glsl-spv and spv-disas output for inspection.
Bandwidth contention with concurrent CPU activity (which is guaranteed on hertz — LXD spine). Phase 7 must measure with representative CPU load.

7. What will be touched / not touched

Touched (created or modified in Phase 6):

src/v3d_idct8.comp — the GLSL compute shader.
src/v3d_runner.c + src/v3d_runner.h — Vulkan boilerplate for instance/device/queue/buffer/pipeline setup, common to all kernels.
tests/bench_v3d_idct.c — the M2 throughput bench and the M1' bit-exact gate (QPU-vs-NEON-vs-C-ref, all three checked).
CMakeLists.txt — add the new GLSL shader to the daedalus_shaders custom command + the new bench target.

NOT touched:

tests/bench_neon_idct.c, tests/vp9_idct8_ref.c, tests/bench_vulkan_dispatch.c — Phase 3 baselines stay immutable so re-running them on the same hertz gives the same numbers.
external/ffmpeg-snapshot/ — vendored reference, byte-frozen.
Any kernel-side / DRM / firmware code. Path B specifically scopes us to Vulkan compute on stock v3dv.

8. Phase 5 review handoff