71db72928f
Phase 4 — Plan first QPU IDCT8 kernel given the M5 32.95us/dispatch
constraint. Frame-level dispatch (8160 WGs for 1080p), 64
invocations/WG = 4 blocks/WG, 3 SSBOs + push constants, reproduces
FFmpeg's transposed column-pass orientation. Predicted M2 ~16 Mblock/s
(BW-limited), R ~ 2.0 -> strong PASS under phase1.md decision rules.
Phase 5 — Second-model review by Claude Sonnet (fresh-context Agent,
no prior session memory). Verdict PASS-WITH-REVISIONS with 2 RED-class
findings + 1 YELLOW that this commit applies:
RED finding 5 (dst race condition): int32_t[] dst with 4 lanes
writing to overlapping 32-bit words = non-atomic concurrent
writes = Vulkan UB. Fix: uint8_t[] via storageBuffer8BitAccess
(verified exposed). Applied to phase4.md sec 5 + GLSL declaration.
RED finding 7 (early-return before barrier): if (block_idx >=
n_blocks) return; ahead of barrier() is UB by Vulkan spec.
For 1080p (32640 blocks, /4) no partial WGs; for any frame
width not /32 there are. Fix: oob flag, gate work bodies,
barrier() unconditional. Applied to phase4.md sec 4 pseudocode.
YELLOW finding 6 (subgroup ops): docs claimed BASIC+VOTE only;
actual exposed set is BASIC+VOTE+BALLOT+SHUFFLE+
SHUFFLE_RELATIVE+QUAD per vulkaninfo. Plan doesn't use any
subgroup ops in v1 so unaffected, but the wrong constraint
would mislead Phase 6/7. Corrected in phase0.md sec 2,
phase2.md sec 6, phase4.md sec 1 (C4).
GREEN/YELLOW findings 1-4, 8 (orientation, WG geom, idle lanes, BW
prediction, compute envelope accounting) accepted as-is or deferred to
Phase 7 M6 sweep per plan's existing flagging.
Reviewer verdict post-revisions: "Phase 4 is APPROVED for Phase 6
implementation. No re-review needed; revisions are mechanical and
address verified bugs/errors."
Phase 5 itself just paid for itself: two real UB bugs caught before
any GLSL was written.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
401 lines
17 KiB
Markdown
401 lines
17 KiB
Markdown
---
|
||
phase: 4
|
||
status: open (awaiting Phase 5 review)
|
||
date_opened: 2026-05-18
|
||
parent: phase3.md
|
||
target_kernel: VP9 8×8 inverse DCT (DCT_DCT, 8-bit pixels)
|
||
expected_artifacts: src/v3d_idct8_kernel.comp, src/v3d_kernel_runner.{c,h},
|
||
tests/bench_v3d_idct.c, CMakeLists.txt updates
|
||
---
|
||
|
||
# Phase 4 — Plan
|
||
|
||
Per `dev_process.md`:
|
||
|
||
> Formulate the approach. Identify what will and will not be touched.
|
||
> State expected outcome of implementation in the *same* measurable
|
||
> terms used in Phase 1/3.
|
||
|
||
And the Phase 6 contract-before-code rule, applied early here so
|
||
Phase 5 review can verify the contracts before any code is written:
|
||
|
||
> Read the API contract — kernel docs, header comments, and upstream
|
||
> source for every call touched. State the contract explicitly
|
||
> before implementing against it.
|
||
|
||
## 1. Constraint recap
|
||
|
||
All numbered constraints are carried verbatim from `phase0.md §6`
|
||
and `phase3.md`. The plan must satisfy every one.
|
||
|
||
| ID | Constraint | Source |
|
||
|---|---|---|
|
||
| C1 | shader arithmetic = int32 only (no FP16, no native FP) | `vulkaninfo` shaderFloat16 = false |
|
||
| C2 | shared mem per WG ≤ 16 KiB | `vulkaninfo` maxComputeSharedMemorySize |
|
||
| C3 | SSBOs per shader stage ≤ 8 | `vulkaninfo` maxPerStageDescriptorStorageBuffers |
|
||
| C4 | subgroup ops = BASIC + VOTE + BALLOT + SHUFFLE + SHUFFLE_RELATIVE + QUAD (no arithmetic reductions, but `subgroupShuffle` *is* available) | `vulkaninfo` subgroupSupportedOperations — corrected per phase5.md finding 6 |
|
||
| C5 | per-block fixed-point multiplies fit SMUL24 (≤17-bit operands) | Q14 constants are 14 bits, coeffs ≤16 bits |
|
||
| C6 | GPU LPDDR4x share ≈ 4–7 GB/s (CPU sees 12–15) | py-videocore7 scopy benchmark |
|
||
| C7 | per-dispatch overhead = **32.95 µs** | `phase3.md` M5 |
|
||
| C8 | 1 WG ≤ 256 invocations, ≤ 16 subgroups of 16 lanes | `vulkaninfo` maxComputeWorkGroupInvocations |
|
||
| C9 | `maxStorageBufferRange` = 1 GiB → single SSBO ≤ 1 GiB | `vulkaninfo` |
|
||
| C10 | bit-exact output must match `ff_vp9_idct_idct_8x8_add_neon` | Phase 1 M1 |
|
||
|
||
Spec contracts the kernel must implement against:
|
||
|
||
- **VP9 specification §8.7 — inverse transform process** (Google,
|
||
2016). The canonical text. Bit-exact verification is the
|
||
ultimate gate, not the spec's words.
|
||
- **FFmpeg n7.1.3** `idct_idct_8x8_add_c` from
|
||
`libavcodec/vp9dsp_template.c` and
|
||
`ff_vp9_idct_idct_8x8_add_neon` from
|
||
`libavcodec/aarch64/vp9itxfm_neon.S` (vendored at
|
||
`external/ffmpeg-snapshot/`, pinned to commit `f46e514`). The
|
||
C reference and NEON implementation differ in output
|
||
*orientation*: FFmpeg's column pass writes transposed (column
|
||
`i` of block IDCT'd into row `i` of `tmp`), and the row pass
|
||
reads columns of `tmp` and writes columns of `dst` via the
|
||
`dst++` pattern. **Our QPU kernel must reproduce this
|
||
orientation** or M1 fails. See
|
||
`tests/vp9_idct8_ref.c` for the corrected reference (which
|
||
itself was caught failing during Phase 3 and fixed to match
|
||
the FFmpeg orientation).
|
||
|
||
## 2. Workload model
|
||
|
||
For a 1920×1088-luma 1080p frame:
|
||
|
||
| Quantity | Value |
|
||
|---|---|
|
||
| 8×8 blocks per luma plane | (1920/8) × (1088/8) = 240 × 136 = 32 640 |
|
||
| Coeff bytes per frame (i16 × 64 / block) | 4 178 880 ≈ 3.99 MiB |
|
||
| Pixel bytes per frame (u8 luma) | 2 088 960 ≈ 1.99 MiB |
|
||
| Coeff read once + pred read + dst write | ≈ 4.0 + 2.0 + 2.0 = **8.0 MB / frame** |
|
||
|
||
Chroma planes are out of scope for Phase 1 (the VP9 8×8 IDCT
|
||
kernel is the same arithmetic; chroma plumbing is a separate
|
||
follow-on once luma works).
|
||
|
||
## 3. Workgroup geometry decision
|
||
|
||
Choices and why:
|
||
|
||
- **Dispatch granularity**: one `vkCmdDispatch` per frame plane.
|
||
Justification: C7 forces ≥ 555 blocks/dispatch to beat NEON
|
||
on overhead alone; frame-level (32 640 blocks) is the only
|
||
granularity that amortises C7 to negligible (1.0 ns/block of
|
||
overhead vs NEON's 122 ns/block).
|
||
|
||
- **Workgroup size**: **64 invocations** (= 4 subgroups × 16
|
||
lanes), `local_size_x = 64`. One workgroup processes **4
|
||
blocks**, one block per subgroup.
|
||
|
||
Why 4 blocks/WG and not 16 (max 16 subgroups per WG of 256):
|
||
Phase 1 deliberately keeps the first kernel simple. 4 blocks
|
||
gives us a clean subgroup-per-block mapping and small shared
|
||
memory footprint (4 × 64 × 4 B = 1024 B for the transpose
|
||
scratch, far below C2's 16 KiB). Phase 7 may sweep wider WGs
|
||
(16 / 32 / 64 / 128 / 256) per `phase1.md §"Secondary
|
||
measurements" M6` to find the optimum.
|
||
|
||
Why 1 block/subgroup: a 16-lane subgroup naturally maps to
|
||
"one cooperating team for one 8×8 block." Lanes 0..7 do the
|
||
column pass (each owning a column of the block); lanes 8..15
|
||
do the row pass (each owning a row of dst output). Lanes are
|
||
not reduced (C4 satisfied — we never call `subgroupAdd`).
|
||
|
||
- **Dispatch count**: ⌈32 640 / 4⌉ = **8 160 workgroups** per 1080p
|
||
luma plane. Per Vulkan spec
|
||
`maxComputeWorkGroupCount.x = 65 535` — plenty of headroom.
|
||
|
||
## 4. Per-thread algorithm
|
||
|
||
The contract reproduces FFmpeg's `idct_idct_8x8_add_c` orientation
|
||
exactly (column pass with transposed write, row pass with
|
||
columnar write). Pseudocode:
|
||
|
||
```glsl
|
||
// Per-thread state derived from gl_GlobalInvocationID.x:
|
||
// block_idx = global_id / 16
|
||
// lane = global_id % 16 // [0..15], subgroup-local
|
||
// col_pass = lane < 8 // first half does columns
|
||
// row_pass = lane >= 8 // second half does rows
|
||
// k = lane & 7 // column or row index 0..7
|
||
|
||
// Out-of-bounds flag — do NOT early-return here. Vulkan requires
|
||
// barrier() to be reached by all WG invocations; a divergent early
|
||
// return ahead of the barrier is undefined behaviour (and would bite
|
||
// us on any frame whose block count isn't a multiple of 4).
|
||
// Per phase5.md finding 7.
|
||
bool oob = (block_idx >= n_blocks);
|
||
|
||
if (col_pass && !oob) {
|
||
// Read column k of block_idx, 8 i16 values.
|
||
int in[8];
|
||
for (r = 0; r < 8; r++)
|
||
in[r] = block[block_idx * 64 + r * 8 + k];
|
||
int out[8];
|
||
idct8_1d(in, out); // 14 mults + 12 adds + 4 shifts
|
||
// Transposed write: row k of tmp_shared[block_local_idx]
|
||
for (r = 0; r < 8; r++)
|
||
tmp_shared[block_local_idx * 64 + k * 8 + r] = out[r];
|
||
}
|
||
|
||
barrier(); // ALL lanes reach this, oob or not
|
||
|
||
if (row_pass && !oob) {
|
||
// Read column k of tmp_shared[block_local_idx], 8 i32 values.
|
||
int in[8];
|
||
for (r = 0; r < 8; r++)
|
||
in[r] = tmp_shared[block_local_idx * 64 + r * 8 + k];
|
||
int out[8];
|
||
idct8_1d(in, out); // same kernel
|
||
// Columnar write into dst: column k of the destination 8x8.
|
||
// Block position in dst: (block_y, block_x) = (block_idx / Wb, block_idx % Wb) × 8.
|
||
// dst is uint8_t[] (NOT packed i32): each lane writes a unique
|
||
// byte address → no sub-word race. Per phase5.md finding 5.
|
||
for (r = 0; r < 8; r++) {
|
||
uint dx = block_x * 8 + k;
|
||
uint dy = block_y * 8 + r;
|
||
uint8_t px = dst[dy * stride + dx];
|
||
int sum = int(px) + ((out[r] + 16) >> 5);
|
||
dst[dy * stride + dx] = uint8_t(clamp(sum, 0, 255));
|
||
}
|
||
}
|
||
```
|
||
|
||
Notes:
|
||
- `idct8_1d` is the same 1D 8-point IDCT used in
|
||
`tests/vp9_idct8_ref.c::idct8_1d`, transcribed unchanged
|
||
(constants 11585 / 6270 / 15137 / 3196 / 16069 / 13623 / 9102,
|
||
Q14 round-shift `(x + (1<<13)) >> 14`).
|
||
- All arithmetic is `int` (32-bit) per C1. Multiplies are
|
||
i16 × i16 → i32, well within SMUL24 (C5).
|
||
- The barrier is a *workgroup* barrier (`barrier()` + memory
|
||
barrier), not a subgroup operation, so C4 is satisfied.
|
||
- DC-only fast path (`eob == 1`) is *not* implemented in v1.
|
||
Per Phase 3 M1 stats, DC-only frequency is 0.11 %; the cost
|
||
of unconditional general-path execution is ~1× per-block. Phase
|
||
7 may add the fast path if M2 measurement is close to the
|
||
decision threshold.
|
||
- The 8 idle lanes per subgroup per phase (lanes 8..15 idle
|
||
during column pass; lanes 0..7 idle during row pass) waste
|
||
half of subgroup-cycle bandwidth. The real alternative geometry
|
||
(corrected per phase5.md finding 3) is **2 blocks per subgroup**:
|
||
lanes 0..7 handle block A's column pass while lanes 8..15
|
||
handle block B's column pass simultaneously; one barrier; lanes
|
||
0..7 handle A's row pass, lanes 8..15 handle B's row pass.
|
||
Doubles useful work per subgroup, halves dispatch count to
|
||
~4080 WGs, shared mem grows to 2 KiB (still 8 % of C2). The v1
|
||
half-idle design is kept deliberately for simplicity-first; the
|
||
2-blocks-per-subgroup rework is the obvious Phase 7 M6 win.
|
||
|
||
## 5. Memory layout / SSBO design
|
||
|
||
Three SSBO bindings, well under C3's limit of 8:
|
||
|
||
| binding | name | type | size | usage |
|
||
|---|---|---|---|---|
|
||
| 0 | `coeffs` | `readonly int16_t[]` | N × 64 × 2 B | input quantised coefficients |
|
||
| 1 | `dst` | `uint8_t[]` | H × stride B | input pred + output pixels, in/out |
|
||
| 2 | `meta` | `readonly uvec2[]` | N × 8 B | per-block (block_y, block_x) |
|
||
|
||
Why `uint8_t[]` for `dst` and not `int32_t[]` (revised per phase5.md
|
||
finding 5): packing 4 pixels per i32 would put 4 lanes per subgroup
|
||
writing to the *same* 32-bit word every block (lanes 0..3 → bytes
|
||
0..3 of one word; lanes 4..7 → next word). Vulkan does not provide
|
||
atomic sub-word writes — concurrent non-atomic writes to overlapping
|
||
addresses are undefined behaviour. v3dv exposes
|
||
`storageBuffer8BitAccess = true` (verified in
|
||
`vulkaninfo_v3d_7_1_7_hertz.txt`), so we declare:
|
||
|
||
```glsl
|
||
#extension GL_EXT_shader_8bit_storage : require
|
||
layout(binding = 1) buffer Dst { uint8_t dst[]; };
|
||
```
|
||
|
||
Each lane then writes `dst[(block_y*8+r)*stride + block_x*8 + k]` —
|
||
a unique byte address per lane, no race.
|
||
|
||
Push constants (C8: 128 B max, we use 16):
|
||
|
||
```
|
||
layout(push_constant) uniform PC {
|
||
uint n_blocks; // total 8×8 blocks in this dispatch
|
||
uint blocks_per_row; // (dst_width / 8), for block_idx → (by, bx)
|
||
uint dst_stride_u8; // dst row stride in bytes (8-bit storage)
|
||
uint _pad;
|
||
} pc;
|
||
```
|
||
|
||
The 7 Q14 trig constants are baked into the shader source as
|
||
literal `const int` — they don't need to be in a buffer.
|
||
|
||
Shared memory layout (4 blocks per WG, i32 per element):
|
||
|
||
```
|
||
shared int tmp_shared[4][64]; // 4 × 64 × 4 B = 1024 B
|
||
```
|
||
|
||
## 6. Predicted M2 (the expected outcome per Phase 1)
|
||
|
||
Phase 1 §"Decision rules" requires a predicted measurement. We
|
||
state two predictions: a compute-envelope upper bound and a
|
||
bandwidth-ceiling upper bound. M2 is bounded by the *minimum*.
|
||
|
||
### Compute-envelope estimate
|
||
|
||
- 1D 8-point IDCT: 14 multiplies + 12 adds + 4 shifts ≈ 30 ops
|
||
- 2 passes per block (column + row) = 60 SIMD cycles per block
|
||
*(idealised — every SIMD cycle issued on the full 16-lane width)*
|
||
- Frame: 32 640 blocks × 60 SIMD cycles = ~2 M SIMD-cycles / frame
|
||
- At V3D 7.1 ~92 GFLOPS theoretical FP32 across all 12 QPUs at
|
||
960 MHz — we use int arithmetic, SMUL24 throughput is comparable.
|
||
Use 92 GOPS (whole-SIMD-width ops) as the optimistic envelope.
|
||
- Frame time ≈ 2 M / 92 G ≈ **22 µs / frame** (compute only,
|
||
idealised — see correction below)
|
||
- **Correction per phase5.md finding 8:** the v1 half-idle
|
||
geometry runs only 8 of the 16 subgroup lanes per phase, so
|
||
the effective SIMD utilisation is 0.5× — compute frame time
|
||
is closer to **~44 µs / frame**. This doesn't change the
|
||
conclusion because the bandwidth ceiling (below) is the
|
||
governing constraint, but the 2× note is essential for any
|
||
future compute-bound optimisation work.
|
||
- Plus C7 = 33 µs/dispatch overhead, idealised total ≈ **55 µs / frame** ≈ **18 000 FPS** ≈ **587 Mblock/s**
|
||
- Realistic 23% utilization (per `py-videocore7` SGEMM ceiling):
|
||
→ ≈ **135 Mblock/s**, R = 135 / 8.171 ≈ **16.5**
|
||
|
||
### Bandwidth-ceiling estimate
|
||
|
||
- Per frame: 8.0 MB total traffic (§2)
|
||
- GPU sustained share: ~4 GB/s (`phase0.md` Throughput envelopes)
|
||
- Frame time = 8 MB / 4 GB/s = **2.0 ms / frame** = **500 FPS** = **16.3 Mblock/s**
|
||
- R = 16.3 / 8.171 ≈ **2.0**
|
||
|
||
### Combined prediction
|
||
|
||
M2 = min(compute envelope, bandwidth ceiling) = **~16 Mblock/s**.
|
||
Predicted **R ≈ 2.0**, decision rule: ≥ 1.0 → strong PASS.
|
||
|
||
Honest uncertainty: this is a *prediction*, not a measurement.
|
||
Phase 7 may show the GPU sees less than 4 GB/s under this access
|
||
pattern (the dst read-modify-write is not coalesced and may push
|
||
us below 1 GB/s, dragging R below 0.5). Or per-WG launch + the 8
|
||
idle lanes per phase may halve the compute envelope. The honest
|
||
*lower bound* is R ≈ 1.0 (memory-bound, 8 Mblock/s) — still passes
|
||
Phase 1 §"Decision rules" 0.5 ≤ R < 1.0 band ("hybrid concurrent
|
||
work viable").
|
||
|
||
### What would invalidate the prediction
|
||
|
||
- Vulkan compute launches that pre-amble more cost than M5b
|
||
estimates (e.g., per-dispatch descriptor binding overhead the
|
||
noop bench doesn't measure). Phase 7 records actual.
|
||
- v3dv's compiler producing significantly worse code than `idct8_1d`'s
|
||
expected ~30 ops (e.g., unable to fuse multiply-adds, or
|
||
spilling registers). Phase 7 records `glsl-spv` and
|
||
`spv-disas` output for inspection.
|
||
- Bandwidth contention with concurrent CPU activity (which is
|
||
guaranteed on hertz — LXD spine). Phase 7 must measure with
|
||
representative CPU load.
|
||
|
||
## 7. What will be touched / not touched
|
||
|
||
**Touched** (created or modified in Phase 6):
|
||
|
||
- `src/v3d_idct8.comp` — the GLSL compute shader.
|
||
- `src/v3d_runner.c` + `src/v3d_runner.h` — Vulkan boilerplate
|
||
for instance/device/queue/buffer/pipeline setup, common to
|
||
all kernels.
|
||
- `tests/bench_v3d_idct.c` — the M2 throughput bench and the
|
||
M1' bit-exact gate (QPU-vs-NEON-vs-C-ref, all three checked).
|
||
- `CMakeLists.txt` — add the new GLSL shader to the
|
||
`daedalus_shaders` custom command + the new bench target.
|
||
|
||
**NOT touched**:
|
||
|
||
- `tests/bench_neon_idct.c`, `tests/vp9_idct8_ref.c`,
|
||
`tests/bench_vulkan_dispatch.c` — Phase 3 baselines stay
|
||
immutable so re-running them on the same hertz gives the
|
||
same numbers.
|
||
- `external/ffmpeg-snapshot/` — vendored reference, byte-frozen.
|
||
- Any kernel-side / DRM / firmware code. Path B specifically
|
||
scopes us to Vulkan compute on stock v3dv.
|
||
|
||
## 8. Phase 5 review handoff
|
||
|
||
Per `dev_process.md`:
|
||
|
||
> Goal, situation, measurements, plan get pasted into DokuWiki.
|
||
> Markus reviews and redacts, then initiates the handover to a
|
||
> fresh model instance. **Claude does not curate the artifact going
|
||
> to the reviewer** — that would re-introduce the blind-spot
|
||
> accumulation the review is meant to escape. Do not summarize
|
||
> when handing over; paste the actual artifacts.
|
||
|
||
Files the second-model reviewer needs verbatim:
|
||
|
||
- `README.md`
|
||
- `docs/phase0.md`
|
||
- `docs/phase1.md`
|
||
- `docs/phase2.md`
|
||
- `docs/phase3.md`
|
||
- `docs/phase4.md` (this file)
|
||
- `docs/vulkaninfo_v3d_7_1_7_hertz.txt`
|
||
- `tests/vp9_idct8_ref.c`
|
||
- `tests/bench_neon_idct.c` (specifically the M3 measurement code)
|
||
- `external/ffmpeg-snapshot/PROVENANCE.md`
|
||
|
||
Specific review prompts (to focus the second model on the highest-
|
||
risk decisions):
|
||
|
||
1. **Orientation correctness.** The plan asserts the QPU kernel
|
||
must reproduce FFmpeg's transposed column-pass orientation.
|
||
Is the §4 pseudocode correct against `tests/vp9_idct8_ref.c`?
|
||
2. **Workgroup geometry.** Is 4 blocks/WG × 64 invocations the
|
||
right starting point, or should v1 already explore wider WGs?
|
||
3. **The 8-idle-lanes-per-phase tradeoff.** Half the subgroup
|
||
lanes idle for each pass. Acceptable for v1, or rework the
|
||
algorithm so all 16 lanes work per phase (e.g., 2 blocks per
|
||
subgroup)?
|
||
4. **Bandwidth prediction.** Is the §6 bandwidth estimate
|
||
defensible, or is the dst read-modify-write going to collapse
|
||
to ~1 GB/s through GPU L2 misses?
|
||
5. **Anything missing.** What load-bearing assumption is unstated?
|
||
|
||
## 9. Phase 6 execution order (if Phase 5 approves)
|
||
|
||
1. Skeleton `v3d_runner.{c,h}`: generalise `bench_vulkan_dispatch.c`'s
|
||
instance/device/queue/CB plumbing into a reusable API.
|
||
2. `v3d_idct8.comp` v0: write the shader, get it to compile (glslang
|
||
accepts it, spirv-val passes).
|
||
3. `bench_v3d_idct.c`: harness that runs the shader on the same
|
||
randomly-generated blocks that `bench_neon_idct` uses, reads
|
||
back, compares bit-exact against `daedalus_vp9_idct_idct_8x8_add_ref`.
|
||
4. First-light run on hertz. Expect non-bit-exact output —
|
||
transpose orientation or barrier placement bugs are likely.
|
||
Iterate to M1' = 100 % bit-exact.
|
||
5. Once bit-exact: measure M2 (throughput). Compare to predicted
|
||
R ≈ 2.0; record actual.
|
||
6. If R ≥ 0.5 → Phase 9 (lessons) → Phase 1 of next kernel.
|
||
If R < 0.5 → Phase 4 revision (loopback) or honest close.
|
||
|
||
## 10. Open questions Phase 4 does not close
|
||
|
||
- **Subgroup size assumption.** Phase 0 says subgroupSize is fixed
|
||
at 16 on V3D 7.1. The shader hard-codes `gl_SubgroupSize == 16`
|
||
via the lane mask `lane = global_id % 16`. If a future v3dv
|
||
exposes a different subgroup size, the algorithm needs revisiting.
|
||
Acceptable risk for Phase 1; document and move on.
|
||
- **Memory coherency between CPU and GPU.** The Phase 0
|
||
`is_host=true` zero-copy story is the *intent*; whether v3dv's
|
||
default memory allocation actually gives us coherent shared
|
||
buffers without explicit cache flushes is *unverified*. Phase 6
|
||
must check — if writes don't appear on the other side, add
|
||
`VkMappedMemoryRange` flush/invalidate calls.
|
||
- **Block-position metadata source.** §5 specifies `meta` SSBO
|
||
with (by, bx). Alternatively, derive position from block_idx
|
||
arithmetic using `blocks_per_row` push constant — saves a
|
||
buffer at the cost of a divide-modulo per thread. Phase 4 picks
|
||
the buffered version; Phase 7 may revisit.
|