Path B pivot + Phase 0-3 closed with first baseline numbers

This is a from-scratch initial commit on a fresh .git. The original scaffold commit (7510b56) and the earlier session's working-tree docs were lost in a 2026-05-18 10:25 working-tree wipe; the corrupted .git is preserved at .git-broken-2026-05-18/ (gitignored) for forensic inspection. Scope re-anchored from Path A (custom VPU firmware on VC7 scalar cores; blocked by BCM2712 silicon-RoT mask-ROM signature check) to Path B (QPU compute kernels via Mesa v3d / Vulkan compute or direct DRM, on stock signed Pi 5 / CM5). See README.md and docs/phase0.md for the substrate audit that closed Path A. Phases closed: Phase 0 — substrate audit; Path A blocked, Path B open; codec-back-end-fits-QPU finding (docs/phase0.md) Phase 1 — first kernel locked (VP9 / AV1 8x8 inverse DCT) with publish-before-measure R = M2/M3 decision rules (docs/phase1.md) Phase 2 — reference impls mapped; FFmpeg n7.1.3 source vendored under external/ffmpeg-snapshot/ (PROVENANCE.md pins commit f46e514 + per-file SHA-256s) (docs/phase2.md) Phase 3 — real baseline measurements on hertz (docs/phase3.md): M1 bit-exact 100.0000 % (10000/10000) M3 NEON IDCT8 single 8.171 Mblock/s (122.4 ns/block) M5a empty Vulkan submit 22.66 us M5b 1-WG noop dispatch 55.60 us M5 delta 32.95 us/dispatch => per-dispatch overhead is ~455x per-NEON-block cost; Phase 4 must batch at frame level or close to it. Build harness in place: CMakeLists.txt + tests/{bench_neon_idct.c, vp9_idct8_ref.c, bench_vulkan_dispatch.c, shaders/noop.comp} + external/ffmpeg-snapshot/config.h shim (7 defines + EXTERN_ASM). Builds clean on Debian Trixie aarch64 with cmake 3.31, ninja 1.12, libvulkan-dev 1.4.309, glslang-tools 15.1.0. Vendored FFmpeg .S assembles via the config.h shim. Next: Phase 4 (plan first QPU IDCT kernel under the M5 batching constraint) -> Phase 5 second-model review -> Phase 6 implement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:30:12 +00:00
commit dcbbc77038
22 changed files with 9030 additions and 0 deletions
@@ -0,0 +1,212 @@
+---
+phase: 2
+status: closed 2026-05-18
+date_opened: 2026-05-18
+parent: phase1.md
+target_kernel: VP9 8×8 inverse DCT (DCT_DCT variant, 8-bit pixels)
+---
+
+# Phase 2 — Situation analysis
+
+Per `dev_process.md`:
+
+> Document current state. Identify constraints, dependencies, known
+> failure modes. Reset context here — do not carry assumptions from
+> prior sessions; re-read CLAUDE.md, relevant memory files, run
+> `git status`, re-verify reachability.
+
+## 1. Context reset
+
+- Working tree state: dirty (Phase 0/1/2 docs not yet committed).
+  `.git-broken-2026-05-18/` preserved as a forensic artifact of
+  the 2026-05-18 10:25 working-tree wipe (cause undetermined).
+- CLAUDE.md re-read: no contradictions with the Path B scope set
+  in README §"Architecture (Path B)".
+- hertz reachability: confirmed via SSH; `vcgencmd`, `vulkaninfo`,
+  `apt`, sudo NOPASSWD all working as of 2026-05-17 inventory.
+  Mesa 25.0.7 / Vulkan 1.3.305 / V3D 7.1.7 stable.
+
+## 2. Reference implementations — VP9 8×8 IDCT (DCT_DCT)
+
+The Phase 1 kernel has *two* canonical reference implementations
+in FFmpeg n7.1.3 (the version installed on hertz). The harness
+will link both: the C path as the bit-exact gate (M1), the NEON
+path as the throughput baseline (M3).
+
+### 2.1 C reference
+
+- **Source**: `libavcodec/vp9dsp_template.c`, function `idct_idct_8x8_add_c`
+- **Spec basis**: VP9 specification §8.7 — Inverse transform process
+- **Signature**:
+
+  ```c
+  static void idct_idct_8x8_add_c(uint8_t *_dst, ptrdiff_t stride,
+                                  int16_t *_block, int eob);
+  ```
+
+- **Algorithm** (8-bit path):
+  1. If `eob == 1` (DC-only): single `(coef * 11585 * 11585)` round, broadcast to 8×8 with `+pred, clamp[0,255]`.
+  2. Otherwise: 8 column passes through `idct8_1d` → tmp[64]. Zero the input block. 8 row passes through `idct8_1d` → out[8]. Per-element `(out + 16) >> 5`, add to `dst`, `av_clip_pixel`.
+- **`idct8_1d`**: 1-D 8-point inverse DCT, 8 trigonometric multiply-add stages with Q14 fixed-point constants then 8-butterfly add/sub stages. All arithmetic is signed int32 (`dctint`).
+- **Q14 constants** (matched against VP9 spec §8.7.1.4):
+  | symbol | value | trig identity |
+  |---|---|---|
+  | cospi_16_64 | 11585 | cos(π/4) × 2^14 ≈ 0.70711 |
+  | cospi_24_64 |  6270 | cos(3π/8) × 2^14 ≈ 0.38268 |
+  | cospi_8_64  | 15137 | sin(3π/8) × 2^14 ≈ 0.92388 |
+  | cospi_28_64 |  3196 | cos(7π/16) × 2^14 ≈ 0.19509 |
+  | cospi_4_64  | 16069 | sin(7π/16) × 2^14 ≈ 0.98079 |
+  | cospi_20_64 |  9102 | cos(5π/16) × 2^14 ≈ 0.55557 |
+  | cospi_12_64 | 13623 | sin(5π/16) × 2^14 ≈ 0.83147 |
+
+  Rounding convention: `(product + (1 << 13)) >> 14`, i.e. round-half-up at bit 14.
+
+- **License**: LGPL-2.1-or-later (FFmpeg).
+- **Side effect**: zeroes the input `block[]` (idempotency requirement; matches spec).
+
+### 2.2 NEON reference
+
+- **Source**: `libavcodec/aarch64/vp9itxfm_neon.S`, symbol `ff_vp9_idct_idct_8x8_add_neon`
+- **Signature** (same as C):
+  ```
+  void ff_vp9_idct_idct_8x8_add_neon(uint8_t *dst, ptrdiff_t stride,
+                                     int16_t *block, int eob);
+  ```
+  Registers: `x0=dst, x1=stride, x2=block, w3=eob`.
+- **Internal dependencies** (must be copied alongside the .S):
+  | macro / symbol | location | role |
+  |---|---|---|
+  | `idct8` | `vp9itxfm_neon.S` | 1-D 8-pt IDCT, fully unrolled with `dmbutterfly*` |
+  | `dmbutterfly0` | `vp9itxfm_neon.S` | rotation by π/4 (the `cospi_16_64` case) |
+  | `dmbutterfly` | `vp9itxfm_neon.S` | general 2-input rotation `[a,b] → [a·c1−b·c2, a·c2+b·c1]` (`Q14`) |
+  | `dmbutterfly_l` | `vp9itxfm_neon.S` | wide-form (4×i32 acc) for `dmbutterfly` |
+  | `butterfly_8h` | `vp9itxfm_neon.S` | trivial `[a+b, a−b]` on `int16x8_t` |
+  | `transpose_8x8H` | `libavcodec/aarch64/neon.S` | in-place 8×8 i16 transpose |
+  | `idct_coeffs` | `vp9itxfm_neon.S` (`const`) | Q14 trig constants table, aligned 4 |
+  | `movrel` | `libavutil/aarch64/asm.S` | PIC-aware constant-pool relocation helper |
+- **License**: LGPL-2.1-or-later (Google, 2016).
+- **Performance shape**: full unrolled 8-pt butterfly with NEON `smull/smlsl/smlal` + `rshrn` for the Q14 round-shift; output uses `sqxtun` for saturated narrow to u8. Estimated ~80 NEON instructions for the steady state (non-DC) path.
+
+### 2.3 AV1 equivalence note
+
+AV1's 8×8 DCT_DCT transform (`av1_iidentity8_iidentity8_c` vs `av1_idct8_idct8_c` family in `libavcodec/av1dsp/...`) shares the same 1-D 8-point structure but with **different** scaling: AV1 uses 12-bit fixed-point (`>> 12`) and a slightly different rounding shift due to its different transform-stage bit growth model. Calling our VP9 IDCT shader on AV1 coefficients will produce wrong output. **AV1 support is out of scope for Phase 1.** A Phase-N variant can fork the shader with the AV1 constants once Phase 1 has proven the VP9 path.
+
+## 3. Vulkan compute dispatch path
+
+Hertz exposes V3D 7.1 via Mesa's v3dv driver as Vulkan
+`PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU`, API 1.3.305, conformance
+1.3.8.3. The compute-only dispatch path is:
+
+```
+host program
+  ├─ vkCreateInstance / vkEnumeratePhysicalDevices (picks V3D 7.1.7.0)
+  ├─ vkCreateDevice (queue family with COMPUTE_BIT, no graphics needed)
+  ├─ vkCreateBuffer x N (SSBOs for block coeffs in / dst pixels in+out)
+  │     - buffer flags: STORAGE_BUFFER_BIT | TRANSFER_SRC/DST
+  │     - memory type: HOST_VISIBLE | HOST_COHERENT (zero-copy on shared LPDDR4x)
+  ├─ vkCreateDescriptorSetLayout (≤8 SSBOs per layout — Pi 5 limit)
+  ├─ vkCreateShaderModule (SPIR-V from glslang)
+  ├─ vkCreateComputePipeline
+  ├─ vkBeginCommandBuffer
+  │     vkCmdBindPipeline / vkCmdBindDescriptorSets / vkCmdPushConstants
+  │     vkCmdDispatch(group_count_x, 1, 1)   # one WG per ~K blocks
+  ├─ vkQueueSubmit + vkQueueWaitIdle (or fence) — this is the measured op
+  └─ (read back via the HOST_VISIBLE buffer, or alias it to the same memory the CPU populated)
+```
+
+Per Phase 0 §2 inside-view limits, the relevant constraints
+for this kernel:
+
+- ≤8 SSBOs per stage → group inputs/outputs into ≤8 bindings (we
+  only need 2: `block[]` in, `dst[]` in/out).
+- Shared mem ≤16 KiB → each 8×8 block fits trivially (256 B in
+  i16 plus 64 B in u8). One WG can carry dozens of blocks of
+  shared state if useful.
+- Subgroup size = 16 (fixed). One workgroup of 64 invocations =
+  4 subgroups; one block per subgroup is a natural shape (each
+  16-lane subgroup processes 8×8 = 64 pixels in 4 cycles of
+  subgroup work).
+
+## 4. Build path on hertz
+
+Already installed (2026-05-17): cmake 3.31, ninja 1.12, gcc (Debian
+trixie default), `libvulkan-dev 1.4.309`, `glslang-tools 15.1.0`,
+`spirv-tools 2025.1`, `libdrm-dev 2.4.131`, `vulkan-tools 1.4.304`.
+
+Missing but cheap:
+- `libavcodec-dev` — only needed if the harness wants to link
+  against system libavcodec for cross-checks against the dynamic
+  dispatcher. *Not* needed for the source-copy approach (preferred,
+  see §5).
+
+## 5. Reference-copy strategy (vs system-libavcodec link)
+
+**Decision: source-copy the 3 FFmpeg files into `external/ffmpeg-snapshot/`.**
+
+Rationale:
+- System `libavcodec.so` on hertz is symbol-stripped (`nm` returns
+  empty for `ff_vp9_idct_*`). Internal NEON entry points are not
+  reachable via `dlsym`.
+- The two reference implementations (C, NEON) plus their macro/
+  data dependencies total ~3 files / ~600 lines. Source-copy is
+  smaller than the dlopen plumbing would be.
+- LGPL-2.1-or-later (FFmpeg license) is propagation-compatible
+  with the harness binary if the harness binary itself is GPL
+  or LGPL. The kernel shaders and dispatch library stay
+  separately-licensed (BSD-2-Clause, default for this project).
+- Pinning to `n7.1.3` matches hertz's runtime libavcodec version,
+  so any in-session sanity cross-check against the running Mesa
+  / video tooling stays consistent.
+
+Files to vendor:
+
+| Source | License | Target path under `daedalus-fourier/` |
+|---|---|---|
+| `libavcodec/vp9dsp_template.c` | LGPL-2.1+ | `external/ffmpeg-snapshot/vp9dsp_template.c` |
+| `libavcodec/aarch64/vp9itxfm_neon.S` | LGPL-2.1+ | `external/ffmpeg-snapshot/aarch64/vp9itxfm_neon.S` |
+| `libavcodec/aarch64/neon.S` (for `transpose_8x8H`) | LGPL-2.1+ | `external/ffmpeg-snapshot/aarch64/neon.S` |
+| `libavutil/aarch64/asm.S` (for `movrel`, `function`, `endfunc`) | LGPL-2.1+ | `external/ffmpeg-snapshot/aarch64/asm.S` |
+| (whatever else `vp9dsp_template.c` transitively needs) | LGPL-2.1+ | as required |
+
+A `external/ffmpeg-snapshot/COPYING.LGPL` and `external/ffmpeg-snapshot/PROVENANCE.md` document the upstream commit (n7.1.3 tag, commit hash) and the verbatim-copy guarantee.
+
+## 6. Known constraints / failure modes carried from Phase 0
+
+Repeated here so Phase 4 (plan) can bind against them without
+re-derivation:
+
+- **C1**: shaderFloat16 = false → all shader arithmetic must be int32 (we are int anyway — no risk).
+- **C2**: maxComputeSharedMemorySize = 16 KiB → kernel must not require more (8×8 IDCT trivially fits even with many blocks per WG).
+- **C3**: maxPerStageDescriptorStorageBuffers = 8 → we need only 2 (coeffs + dst), no risk.
+- **C4**: subgroupSupportedOperations = BASIC + VOTE only → no `subgroupAdd`/etc. for accumulator reductions. Workaround: the IDCT structure is fully data-parallel without reductions; this constraint doesn't bite.
+- **C5**: VC7 has SMUL24 but no INT8 MAC. Our Q14 multiplies are i16×i16→i32 — the multiplicands fit in 17 bits, so SMUL24 covers it. No INT8/INT4 issues.
+- **C6**: shared LPDDR4x bus; GPU sees ~4–7 GB/s vs CPU ~12–15 GB/s. For 8×8 IDCT, working set is tiny (≤320 B/block), so per-block bandwidth is not the bottleneck; per-dispatch submit overhead is.
+- **C7**: VPM read-stall serialization. If we hand-write QPU asm (we won't, in Phase 1) this would matter; the Vulkan compute path lets the v3d_compiler schedule for us.
+- **C8**: VC7 thermal throttle at 85°C GPU / 80°C CPU. Phase 7 measurements should record temp before/during/after to flag throttling.
+
+## 7. What Phase 2 does *not* close
+
+- The harness architecture (single binary? Two binaries — one for
+  bit-exact, one for throughput?). Phase 4 picks.
+- Block-per-WG dispatch geometry. Phase 4 + Phase 6 sweep.
+- Random-coefficient generation strategy (uniform i16 vs
+  realistic-distribution; the latter affects DC-only path
+  frequency). Phase 4 picks; Phase 7 may re-evaluate.
+- Whether NEON measurement uses `clock_gettime(CLOCK_MONOTONIC_RAW)`
+  per-call (high overhead) or batched (more realistic for codec
+  use). Phase 3 picks during baseline collection.
+
+## 8. Hand-off to Phase 3
+
+Phase 3 measures:
+- **M3-prelim**: NEON `ff_vp9_idct_idct_8x8_add_neon` throughput
+  on hertz, batched over 10⁶ random blocks, single-threaded,
+  4-thread, sched-isolated. This is the *floor*.
+- **M5-prelim**: Vulkan dispatch overhead — pipeline create cost
+  (one-time), per-`vkCmdDispatch` cost (per-frame-equivalent),
+  per-`vkQueueSubmit + vkQueueWaitIdle` cost (per-completion).
+  Bound below which kernel batching is mandatory.
+
+Both are measurements on the *existing* substrate. Neither
+requires writing any shader code. Phase 3 closes before Phase 4
+(plan) begins.