Files
marfrit 71db72928f Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS)
Phase 4 — Plan first QPU IDCT8 kernel given the M5 32.95us/dispatch
constraint. Frame-level dispatch (8160 WGs for 1080p), 64
invocations/WG = 4 blocks/WG, 3 SSBOs + push constants, reproduces
FFmpeg's transposed column-pass orientation. Predicted M2 ~16 Mblock/s
(BW-limited), R ~ 2.0 -> strong PASS under phase1.md decision rules.

Phase 5 — Second-model review by Claude Sonnet (fresh-context Agent,
no prior session memory). Verdict PASS-WITH-REVISIONS with 2 RED-class
findings + 1 YELLOW that this commit applies:

  RED  finding 5 (dst race condition): int32_t[] dst with 4 lanes
       writing to overlapping 32-bit words = non-atomic concurrent
       writes = Vulkan UB. Fix: uint8_t[] via storageBuffer8BitAccess
       (verified exposed). Applied to phase4.md sec 5 + GLSL declaration.

  RED  finding 7 (early-return before barrier): if (block_idx >=
       n_blocks) return; ahead of barrier() is UB by Vulkan spec.
       For 1080p (32640 blocks, /4) no partial WGs; for any frame
       width not /32 there are. Fix: oob flag, gate work bodies,
       barrier() unconditional. Applied to phase4.md sec 4 pseudocode.

  YELLOW finding 6 (subgroup ops): docs claimed BASIC+VOTE only;
         actual exposed set is BASIC+VOTE+BALLOT+SHUFFLE+
         SHUFFLE_RELATIVE+QUAD per vulkaninfo. Plan doesn't use any
         subgroup ops in v1 so unaffected, but the wrong constraint
         would mislead Phase 6/7. Corrected in phase0.md sec 2,
         phase2.md sec 6, phase4.md sec 1 (C4).

GREEN/YELLOW findings 1-4, 8 (orientation, WG geom, idle lanes, BW
prediction, compute envelope accounting) accepted as-is or deferred to
Phase 7 M6 sweep per plan's existing flagging.

Reviewer verdict post-revisions: "Phase 4 is APPROVED for Phase 6
implementation. No re-review needed; revisions are mechanical and
address verified bugs/errors."

Phase 5 itself just paid for itself: two real UB bugs caught before
any GLSL was written.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:47:03 +00:00

213 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
phase: 2
status: closed 2026-05-18
date_opened: 2026-05-18
parent: phase1.md
target_kernel: VP9 8×8 inverse DCT (DCT_DCT variant, 8-bit pixels)
---
# Phase 2 — Situation analysis
Per `dev_process.md`:
> Document current state. Identify constraints, dependencies, known
> failure modes. Reset context here — do not carry assumptions from
> prior sessions; re-read CLAUDE.md, relevant memory files, run
> `git status`, re-verify reachability.
## 1. Context reset
- Working tree state: dirty (Phase 0/1/2 docs not yet committed).
`.git-broken-2026-05-18/` preserved as a forensic artifact of
the 2026-05-18 10:25 working-tree wipe (cause undetermined).
- CLAUDE.md re-read: no contradictions with the Path B scope set
in README §"Architecture (Path B)".
- hertz reachability: confirmed via SSH; `vcgencmd`, `vulkaninfo`,
`apt`, sudo NOPASSWD all working as of 2026-05-17 inventory.
Mesa 25.0.7 / Vulkan 1.3.305 / V3D 7.1.7 stable.
## 2. Reference implementations — VP9 8×8 IDCT (DCT_DCT)
The Phase 1 kernel has *two* canonical reference implementations
in FFmpeg n7.1.3 (the version installed on hertz). The harness
will link both: the C path as the bit-exact gate (M1), the NEON
path as the throughput baseline (M3).
### 2.1 C reference
- **Source**: `libavcodec/vp9dsp_template.c`, function `idct_idct_8x8_add_c`
- **Spec basis**: VP9 specification §8.7 — Inverse transform process
- **Signature**:
```c
static void idct_idct_8x8_add_c(uint8_t *_dst, ptrdiff_t stride,
int16_t *_block, int eob);
```
- **Algorithm** (8-bit path):
1. If `eob == 1` (DC-only): single `(coef * 11585 * 11585)` round, broadcast to 8×8 with `+pred, clamp[0,255]`.
2. Otherwise: 8 column passes through `idct8_1d` → tmp[64]. Zero the input block. 8 row passes through `idct8_1d` → out[8]. Per-element `(out + 16) >> 5`, add to `dst`, `av_clip_pixel`.
- **`idct8_1d`**: 1-D 8-point inverse DCT, 8 trigonometric multiply-add stages with Q14 fixed-point constants then 8-butterfly add/sub stages. All arithmetic is signed int32 (`dctint`).
- **Q14 constants** (matched against VP9 spec §8.7.1.4):
| symbol | value | trig identity |
|---|---|---|
| cospi_16_64 | 11585 | cos(π/4) × 2^14 ≈ 0.70711 |
| cospi_24_64 | 6270 | cos(3π/8) × 2^14 ≈ 0.38268 |
| cospi_8_64 | 15137 | sin(3π/8) × 2^14 ≈ 0.92388 |
| cospi_28_64 | 3196 | cos(7π/16) × 2^14 ≈ 0.19509 |
| cospi_4_64 | 16069 | sin(7π/16) × 2^14 ≈ 0.98079 |
| cospi_20_64 | 9102 | cos(5π/16) × 2^14 ≈ 0.55557 |
| cospi_12_64 | 13623 | sin(5π/16) × 2^14 ≈ 0.83147 |
Rounding convention: `(product + (1 << 13)) >> 14`, i.e. round-half-up at bit 14.
- **License**: LGPL-2.1-or-later (FFmpeg).
- **Side effect**: zeroes the input `block[]` (idempotency requirement; matches spec).
### 2.2 NEON reference
- **Source**: `libavcodec/aarch64/vp9itxfm_neon.S`, symbol `ff_vp9_idct_idct_8x8_add_neon`
- **Signature** (same as C):
```
void ff_vp9_idct_idct_8x8_add_neon(uint8_t *dst, ptrdiff_t stride,
int16_t *block, int eob);
```
Registers: `x0=dst, x1=stride, x2=block, w3=eob`.
- **Internal dependencies** (must be copied alongside the .S):
| macro / symbol | location | role |
|---|---|---|
| `idct8` | `vp9itxfm_neon.S` | 1-D 8-pt IDCT, fully unrolled with `dmbutterfly*` |
| `dmbutterfly0` | `vp9itxfm_neon.S` | rotation by π/4 (the `cospi_16_64` case) |
| `dmbutterfly` | `vp9itxfm_neon.S` | general 2-input rotation `[a,b] → [a·c1b·c2, a·c2+b·c1]` (`Q14`) |
| `dmbutterfly_l` | `vp9itxfm_neon.S` | wide-form (4×i32 acc) for `dmbutterfly` |
| `butterfly_8h` | `vp9itxfm_neon.S` | trivial `[a+b, ab]` on `int16x8_t` |
| `transpose_8x8H` | `libavcodec/aarch64/neon.S` | in-place 8×8 i16 transpose |
| `idct_coeffs` | `vp9itxfm_neon.S` (`const`) | Q14 trig constants table, aligned 4 |
| `movrel` | `libavutil/aarch64/asm.S` | PIC-aware constant-pool relocation helper |
- **License**: LGPL-2.1-or-later (Google, 2016).
- **Performance shape**: full unrolled 8-pt butterfly with NEON `smull/smlsl/smlal` + `rshrn` for the Q14 round-shift; output uses `sqxtun` for saturated narrow to u8. Estimated ~80 NEON instructions for the steady state (non-DC) path.
### 2.3 AV1 equivalence note
AV1's 8×8 DCT_DCT transform (`av1_iidentity8_iidentity8_c` vs `av1_idct8_idct8_c` family in `libavcodec/av1dsp/...`) shares the same 1-D 8-point structure but with **different** scaling: AV1 uses 12-bit fixed-point (`>> 12`) and a slightly different rounding shift due to its different transform-stage bit growth model. Calling our VP9 IDCT shader on AV1 coefficients will produce wrong output. **AV1 support is out of scope for Phase 1.** A Phase-N variant can fork the shader with the AV1 constants once Phase 1 has proven the VP9 path.
## 3. Vulkan compute dispatch path
Hertz exposes V3D 7.1 via Mesa's v3dv driver as Vulkan
`PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU`, API 1.3.305, conformance
1.3.8.3. The compute-only dispatch path is:
```
host program
├─ vkCreateInstance / vkEnumeratePhysicalDevices (picks V3D 7.1.7.0)
├─ vkCreateDevice (queue family with COMPUTE_BIT, no graphics needed)
├─ vkCreateBuffer x N (SSBOs for block coeffs in / dst pixels in+out)
│ - buffer flags: STORAGE_BUFFER_BIT | TRANSFER_SRC/DST
│ - memory type: HOST_VISIBLE | HOST_COHERENT (zero-copy on shared LPDDR4x)
├─ vkCreateDescriptorSetLayout (≤8 SSBOs per layout — Pi 5 limit)
├─ vkCreateShaderModule (SPIR-V from glslang)
├─ vkCreateComputePipeline
├─ vkBeginCommandBuffer
│ vkCmdBindPipeline / vkCmdBindDescriptorSets / vkCmdPushConstants
│ vkCmdDispatch(group_count_x, 1, 1) # one WG per ~K blocks
├─ vkQueueSubmit + vkQueueWaitIdle (or fence) — this is the measured op
└─ (read back via the HOST_VISIBLE buffer, or alias it to the same memory the CPU populated)
```
Per Phase 0 §2 inside-view limits, the relevant constraints
for this kernel:
- ≤8 SSBOs per stage → group inputs/outputs into ≤8 bindings (we
only need 2: `block[]` in, `dst[]` in/out).
- Shared mem ≤16 KiB → each 8×8 block fits trivially (256 B in
i16 plus 64 B in u8). One WG can carry dozens of blocks of
shared state if useful.
- Subgroup size = 16 (fixed). One workgroup of 64 invocations =
4 subgroups; one block per subgroup is a natural shape (each
16-lane subgroup processes 8×8 = 64 pixels in 4 cycles of
subgroup work).
## 4. Build path on hertz
Already installed (2026-05-17): cmake 3.31, ninja 1.12, gcc (Debian
trixie default), `libvulkan-dev 1.4.309`, `glslang-tools 15.1.0`,
`spirv-tools 2025.1`, `libdrm-dev 2.4.131`, `vulkan-tools 1.4.304`.
Missing but cheap:
- `libavcodec-dev` — only needed if the harness wants to link
against system libavcodec for cross-checks against the dynamic
dispatcher. *Not* needed for the source-copy approach (preferred,
see §5).
## 5. Reference-copy strategy (vs system-libavcodec link)
**Decision: source-copy the 3 FFmpeg files into `external/ffmpeg-snapshot/`.**
Rationale:
- System `libavcodec.so` on hertz is symbol-stripped (`nm` returns
empty for `ff_vp9_idct_*`). Internal NEON entry points are not
reachable via `dlsym`.
- The two reference implementations (C, NEON) plus their macro/
data dependencies total ~3 files / ~600 lines. Source-copy is
smaller than the dlopen plumbing would be.
- LGPL-2.1-or-later (FFmpeg license) is propagation-compatible
with the harness binary if the harness binary itself is GPL
or LGPL. The kernel shaders and dispatch library stay
separately-licensed (BSD-2-Clause, default for this project).
- Pinning to `n7.1.3` matches hertz's runtime libavcodec version,
so any in-session sanity cross-check against the running Mesa
/ video tooling stays consistent.
Files to vendor:
| Source | License | Target path under `daedalus-fourier/` |
|---|---|---|
| `libavcodec/vp9dsp_template.c` | LGPL-2.1+ | `external/ffmpeg-snapshot/vp9dsp_template.c` |
| `libavcodec/aarch64/vp9itxfm_neon.S` | LGPL-2.1+ | `external/ffmpeg-snapshot/aarch64/vp9itxfm_neon.S` |
| `libavcodec/aarch64/neon.S` (for `transpose_8x8H`) | LGPL-2.1+ | `external/ffmpeg-snapshot/aarch64/neon.S` |
| `libavutil/aarch64/asm.S` (for `movrel`, `function`, `endfunc`) | LGPL-2.1+ | `external/ffmpeg-snapshot/aarch64/asm.S` |
| (whatever else `vp9dsp_template.c` transitively needs) | LGPL-2.1+ | as required |
A `external/ffmpeg-snapshot/COPYING.LGPL` and `external/ffmpeg-snapshot/PROVENANCE.md` document the upstream commit (n7.1.3 tag, commit hash) and the verbatim-copy guarantee.
## 6. Known constraints / failure modes carried from Phase 0
Repeated here so Phase 4 (plan) can bind against them without
re-derivation:
- **C1**: shaderFloat16 = false → all shader arithmetic must be int32 (we are int anyway — no risk).
- **C2**: maxComputeSharedMemorySize = 16 KiB → kernel must not require more (8×8 IDCT trivially fits even with many blocks per WG).
- **C3**: maxPerStageDescriptorStorageBuffers = 8 → we need only 2 (coeffs + dst), no risk.
- **C4**: subgroupSupportedOperations = BASIC + VOTE + BALLOT + SHUFFLE + SHUFFLE_RELATIVE + QUAD (no arithmetic reductions like `subgroupAdd`; but `subgroupShuffle` *is* available — corrected per phase5.md finding 6). Workaround needed for accumulators: the IDCT structure is fully data-parallel without reductions; this constraint doesn't bite. `subgroupShuffle` is an alternative to shared-mem transpose for Phase 7.
- **C5**: VC7 has SMUL24 but no INT8 MAC. Our Q14 multiplies are i16×i16→i32 — the multiplicands fit in 17 bits, so SMUL24 covers it. No INT8/INT4 issues.
- **C6**: shared LPDDR4x bus; GPU sees ~47 GB/s vs CPU ~1215 GB/s. For 8×8 IDCT, working set is tiny (≤320 B/block), so per-block bandwidth is not the bottleneck; per-dispatch submit overhead is.
- **C7**: VPM read-stall serialization. If we hand-write QPU asm (we won't, in Phase 1) this would matter; the Vulkan compute path lets the v3d_compiler schedule for us.
- **C8**: VC7 thermal throttle at 85°C GPU / 80°C CPU. Phase 7 measurements should record temp before/during/after to flag throttling.
## 7. What Phase 2 does *not* close
- The harness architecture (single binary? Two binaries — one for
bit-exact, one for throughput?). Phase 4 picks.
- Block-per-WG dispatch geometry. Phase 4 + Phase 6 sweep.
- Random-coefficient generation strategy (uniform i16 vs
realistic-distribution; the latter affects DC-only path
frequency). Phase 4 picks; Phase 7 may re-evaluate.
- Whether NEON measurement uses `clock_gettime(CLOCK_MONOTONIC_RAW)`
per-call (high overhead) or batched (more realistic for codec
use). Phase 3 picks during baseline collection.
## 8. Hand-off to Phase 3
Phase 3 measures:
- **M3-prelim**: NEON `ff_vp9_idct_idct_8x8_add_neon` throughput
on hertz, batched over 10⁶ random blocks, single-threaded,
4-thread, sched-isolated. This is the *floor*.
- **M5-prelim**: Vulkan dispatch overhead — pipeline create cost
(one-time), per-`vkCmdDispatch` cost (per-frame-equivalent),
per-`vkQueueSubmit + vkQueueWaitIdle` cost (per-completion).
Bound below which kernel batching is mandatory.
Both are measurements on the *existing* substrate. Neither
requires writing any shader code. Phase 3 closes before Phase 4
(plan) begins.