Files
daedalus-fourier/docs/phase2.md
T
marfrit dcbbc77038 Path B pivot + Phase 0-3 closed with first baseline numbers
This is a from-scratch initial commit on a fresh .git. The original
scaffold commit (7510b56) and the earlier session's working-tree
docs were lost in a 2026-05-18 10:25 working-tree wipe; the corrupted
.git is preserved at .git-broken-2026-05-18/ (gitignored) for
forensic inspection.

Scope re-anchored from Path A (custom VPU firmware on VC7 scalar
cores; blocked by BCM2712 silicon-RoT mask-ROM signature check)
to Path B (QPU compute kernels via Mesa v3d / Vulkan compute or
direct DRM, on stock signed Pi 5 / CM5). See README.md and
docs/phase0.md for the substrate audit that closed Path A.

Phases closed:
  Phase 0 — substrate audit; Path A blocked, Path B open;
            codec-back-end-fits-QPU finding (docs/phase0.md)
  Phase 1 — first kernel locked (VP9 / AV1 8x8 inverse DCT) with
            publish-before-measure R = M2/M3 decision rules
            (docs/phase1.md)
  Phase 2 — reference impls mapped; FFmpeg n7.1.3 source vendored
            under external/ffmpeg-snapshot/ (PROVENANCE.md pins
            commit f46e514 + per-file SHA-256s) (docs/phase2.md)
  Phase 3 — real baseline measurements on hertz (docs/phase3.md):
              M1 bit-exact            100.0000 % (10000/10000)
              M3 NEON IDCT8 single    8.171 Mblock/s (122.4 ns/block)
              M5a empty Vulkan submit 22.66 us
              M5b 1-WG noop dispatch  55.60 us
              M5 delta                32.95 us/dispatch
            => per-dispatch overhead is ~455x per-NEON-block cost;
               Phase 4 must batch at frame level or close to it.

Build harness in place: CMakeLists.txt + tests/{bench_neon_idct.c,
vp9_idct8_ref.c, bench_vulkan_dispatch.c, shaders/noop.comp} +
external/ffmpeg-snapshot/config.h shim (7 defines + EXTERN_ASM).
Builds clean on Debian Trixie aarch64 with cmake 3.31, ninja 1.12,
libvulkan-dev 1.4.309, glslang-tools 15.1.0. Vendored FFmpeg .S
assembles via the config.h shim.

Next: Phase 4 (plan first QPU IDCT kernel under the M5 batching
constraint) -> Phase 5 second-model review -> Phase 6 implement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:30:12 +00:00

213 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
phase: 2
status: closed 2026-05-18
date_opened: 2026-05-18
parent: phase1.md
target_kernel: VP9 8×8 inverse DCT (DCT_DCT variant, 8-bit pixels)
---
# Phase 2 — Situation analysis
Per `dev_process.md`:
> Document current state. Identify constraints, dependencies, known
> failure modes. Reset context here — do not carry assumptions from
> prior sessions; re-read CLAUDE.md, relevant memory files, run
> `git status`, re-verify reachability.
## 1. Context reset
- Working tree state: dirty (Phase 0/1/2 docs not yet committed).
`.git-broken-2026-05-18/` preserved as a forensic artifact of
the 2026-05-18 10:25 working-tree wipe (cause undetermined).
- CLAUDE.md re-read: no contradictions with the Path B scope set
in README §"Architecture (Path B)".
- hertz reachability: confirmed via SSH; `vcgencmd`, `vulkaninfo`,
`apt`, sudo NOPASSWD all working as of 2026-05-17 inventory.
Mesa 25.0.7 / Vulkan 1.3.305 / V3D 7.1.7 stable.
## 2. Reference implementations — VP9 8×8 IDCT (DCT_DCT)
The Phase 1 kernel has *two* canonical reference implementations
in FFmpeg n7.1.3 (the version installed on hertz). The harness
will link both: the C path as the bit-exact gate (M1), the NEON
path as the throughput baseline (M3).
### 2.1 C reference
- **Source**: `libavcodec/vp9dsp_template.c`, function `idct_idct_8x8_add_c`
- **Spec basis**: VP9 specification §8.7 — Inverse transform process
- **Signature**:
```c
static void idct_idct_8x8_add_c(uint8_t *_dst, ptrdiff_t stride,
int16_t *_block, int eob);
```
- **Algorithm** (8-bit path):
1. If `eob == 1` (DC-only): single `(coef * 11585 * 11585)` round, broadcast to 8×8 with `+pred, clamp[0,255]`.
2. Otherwise: 8 column passes through `idct8_1d` → tmp[64]. Zero the input block. 8 row passes through `idct8_1d` → out[8]. Per-element `(out + 16) >> 5`, add to `dst`, `av_clip_pixel`.
- **`idct8_1d`**: 1-D 8-point inverse DCT, 8 trigonometric multiply-add stages with Q14 fixed-point constants then 8-butterfly add/sub stages. All arithmetic is signed int32 (`dctint`).
- **Q14 constants** (matched against VP9 spec §8.7.1.4):
| symbol | value | trig identity |
|---|---|---|
| cospi_16_64 | 11585 | cos(π/4) × 2^14 ≈ 0.70711 |
| cospi_24_64 | 6270 | cos(3π/8) × 2^14 ≈ 0.38268 |
| cospi_8_64 | 15137 | sin(3π/8) × 2^14 ≈ 0.92388 |
| cospi_28_64 | 3196 | cos(7π/16) × 2^14 ≈ 0.19509 |
| cospi_4_64 | 16069 | sin(7π/16) × 2^14 ≈ 0.98079 |
| cospi_20_64 | 9102 | cos(5π/16) × 2^14 ≈ 0.55557 |
| cospi_12_64 | 13623 | sin(5π/16) × 2^14 ≈ 0.83147 |
Rounding convention: `(product + (1 << 13)) >> 14`, i.e. round-half-up at bit 14.
- **License**: LGPL-2.1-or-later (FFmpeg).
- **Side effect**: zeroes the input `block[]` (idempotency requirement; matches spec).
### 2.2 NEON reference
- **Source**: `libavcodec/aarch64/vp9itxfm_neon.S`, symbol `ff_vp9_idct_idct_8x8_add_neon`
- **Signature** (same as C):
```
void ff_vp9_idct_idct_8x8_add_neon(uint8_t *dst, ptrdiff_t stride,
int16_t *block, int eob);
```
Registers: `x0=dst, x1=stride, x2=block, w3=eob`.
- **Internal dependencies** (must be copied alongside the .S):
| macro / symbol | location | role |
|---|---|---|
| `idct8` | `vp9itxfm_neon.S` | 1-D 8-pt IDCT, fully unrolled with `dmbutterfly*` |
| `dmbutterfly0` | `vp9itxfm_neon.S` | rotation by π/4 (the `cospi_16_64` case) |
| `dmbutterfly` | `vp9itxfm_neon.S` | general 2-input rotation `[a,b] → [a·c1b·c2, a·c2+b·c1]` (`Q14`) |
| `dmbutterfly_l` | `vp9itxfm_neon.S` | wide-form (4×i32 acc) for `dmbutterfly` |
| `butterfly_8h` | `vp9itxfm_neon.S` | trivial `[a+b, ab]` on `int16x8_t` |
| `transpose_8x8H` | `libavcodec/aarch64/neon.S` | in-place 8×8 i16 transpose |
| `idct_coeffs` | `vp9itxfm_neon.S` (`const`) | Q14 trig constants table, aligned 4 |
| `movrel` | `libavutil/aarch64/asm.S` | PIC-aware constant-pool relocation helper |
- **License**: LGPL-2.1-or-later (Google, 2016).
- **Performance shape**: full unrolled 8-pt butterfly with NEON `smull/smlsl/smlal` + `rshrn` for the Q14 round-shift; output uses `sqxtun` for saturated narrow to u8. Estimated ~80 NEON instructions for the steady state (non-DC) path.
### 2.3 AV1 equivalence note
AV1's 8×8 DCT_DCT transform (`av1_iidentity8_iidentity8_c` vs `av1_idct8_idct8_c` family in `libavcodec/av1dsp/...`) shares the same 1-D 8-point structure but with **different** scaling: AV1 uses 12-bit fixed-point (`>> 12`) and a slightly different rounding shift due to its different transform-stage bit growth model. Calling our VP9 IDCT shader on AV1 coefficients will produce wrong output. **AV1 support is out of scope for Phase 1.** A Phase-N variant can fork the shader with the AV1 constants once Phase 1 has proven the VP9 path.
## 3. Vulkan compute dispatch path
Hertz exposes V3D 7.1 via Mesa's v3dv driver as Vulkan
`PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU`, API 1.3.305, conformance
1.3.8.3. The compute-only dispatch path is:
```
host program
├─ vkCreateInstance / vkEnumeratePhysicalDevices (picks V3D 7.1.7.0)
├─ vkCreateDevice (queue family with COMPUTE_BIT, no graphics needed)
├─ vkCreateBuffer x N (SSBOs for block coeffs in / dst pixels in+out)
│ - buffer flags: STORAGE_BUFFER_BIT | TRANSFER_SRC/DST
│ - memory type: HOST_VISIBLE | HOST_COHERENT (zero-copy on shared LPDDR4x)
├─ vkCreateDescriptorSetLayout (≤8 SSBOs per layout — Pi 5 limit)
├─ vkCreateShaderModule (SPIR-V from glslang)
├─ vkCreateComputePipeline
├─ vkBeginCommandBuffer
│ vkCmdBindPipeline / vkCmdBindDescriptorSets / vkCmdPushConstants
│ vkCmdDispatch(group_count_x, 1, 1) # one WG per ~K blocks
├─ vkQueueSubmit + vkQueueWaitIdle (or fence) — this is the measured op
└─ (read back via the HOST_VISIBLE buffer, or alias it to the same memory the CPU populated)
```
Per Phase 0 §2 inside-view limits, the relevant constraints
for this kernel:
- ≤8 SSBOs per stage → group inputs/outputs into ≤8 bindings (we
only need 2: `block[]` in, `dst[]` in/out).
- Shared mem ≤16 KiB → each 8×8 block fits trivially (256 B in
i16 plus 64 B in u8). One WG can carry dozens of blocks of
shared state if useful.
- Subgroup size = 16 (fixed). One workgroup of 64 invocations =
4 subgroups; one block per subgroup is a natural shape (each
16-lane subgroup processes 8×8 = 64 pixels in 4 cycles of
subgroup work).
## 4. Build path on hertz
Already installed (2026-05-17): cmake 3.31, ninja 1.12, gcc (Debian
trixie default), `libvulkan-dev 1.4.309`, `glslang-tools 15.1.0`,
`spirv-tools 2025.1`, `libdrm-dev 2.4.131`, `vulkan-tools 1.4.304`.
Missing but cheap:
- `libavcodec-dev` — only needed if the harness wants to link
against system libavcodec for cross-checks against the dynamic
dispatcher. *Not* needed for the source-copy approach (preferred,
see §5).
## 5. Reference-copy strategy (vs system-libavcodec link)
**Decision: source-copy the 3 FFmpeg files into `external/ffmpeg-snapshot/`.**
Rationale:
- System `libavcodec.so` on hertz is symbol-stripped (`nm` returns
empty for `ff_vp9_idct_*`). Internal NEON entry points are not
reachable via `dlsym`.
- The two reference implementations (C, NEON) plus their macro/
data dependencies total ~3 files / ~600 lines. Source-copy is
smaller than the dlopen plumbing would be.
- LGPL-2.1-or-later (FFmpeg license) is propagation-compatible
with the harness binary if the harness binary itself is GPL
or LGPL. The kernel shaders and dispatch library stay
separately-licensed (BSD-2-Clause, default for this project).
- Pinning to `n7.1.3` matches hertz's runtime libavcodec version,
so any in-session sanity cross-check against the running Mesa
/ video tooling stays consistent.
Files to vendor:
| Source | License | Target path under `daedalus-fourier/` |
|---|---|---|
| `libavcodec/vp9dsp_template.c` | LGPL-2.1+ | `external/ffmpeg-snapshot/vp9dsp_template.c` |
| `libavcodec/aarch64/vp9itxfm_neon.S` | LGPL-2.1+ | `external/ffmpeg-snapshot/aarch64/vp9itxfm_neon.S` |
| `libavcodec/aarch64/neon.S` (for `transpose_8x8H`) | LGPL-2.1+ | `external/ffmpeg-snapshot/aarch64/neon.S` |
| `libavutil/aarch64/asm.S` (for `movrel`, `function`, `endfunc`) | LGPL-2.1+ | `external/ffmpeg-snapshot/aarch64/asm.S` |
| (whatever else `vp9dsp_template.c` transitively needs) | LGPL-2.1+ | as required |
A `external/ffmpeg-snapshot/COPYING.LGPL` and `external/ffmpeg-snapshot/PROVENANCE.md` document the upstream commit (n7.1.3 tag, commit hash) and the verbatim-copy guarantee.
## 6. Known constraints / failure modes carried from Phase 0
Repeated here so Phase 4 (plan) can bind against them without
re-derivation:
- **C1**: shaderFloat16 = false → all shader arithmetic must be int32 (we are int anyway — no risk).
- **C2**: maxComputeSharedMemorySize = 16 KiB → kernel must not require more (8×8 IDCT trivially fits even with many blocks per WG).
- **C3**: maxPerStageDescriptorStorageBuffers = 8 → we need only 2 (coeffs + dst), no risk.
- **C4**: subgroupSupportedOperations = BASIC + VOTE only → no `subgroupAdd`/etc. for accumulator reductions. Workaround: the IDCT structure is fully data-parallel without reductions; this constraint doesn't bite.
- **C5**: VC7 has SMUL24 but no INT8 MAC. Our Q14 multiplies are i16×i16→i32 — the multiplicands fit in 17 bits, so SMUL24 covers it. No INT8/INT4 issues.
- **C6**: shared LPDDR4x bus; GPU sees ~47 GB/s vs CPU ~1215 GB/s. For 8×8 IDCT, working set is tiny (≤320 B/block), so per-block bandwidth is not the bottleneck; per-dispatch submit overhead is.
- **C7**: VPM read-stall serialization. If we hand-write QPU asm (we won't, in Phase 1) this would matter; the Vulkan compute path lets the v3d_compiler schedule for us.
- **C8**: VC7 thermal throttle at 85°C GPU / 80°C CPU. Phase 7 measurements should record temp before/during/after to flag throttling.
## 7. What Phase 2 does *not* close
- The harness architecture (single binary? Two binaries — one for
bit-exact, one for throughput?). Phase 4 picks.
- Block-per-WG dispatch geometry. Phase 4 + Phase 6 sweep.
- Random-coefficient generation strategy (uniform i16 vs
realistic-distribution; the latter affects DC-only path
frequency). Phase 4 picks; Phase 7 may re-evaluate.
- Whether NEON measurement uses `clock_gettime(CLOCK_MONOTONIC_RAW)`
per-call (high overhead) or batched (more realistic for codec
use). Phase 3 picks during baseline collection.
## 8. Hand-off to Phase 3
Phase 3 measures:
- **M3-prelim**: NEON `ff_vp9_idct_idct_8x8_add_neon` throughput
on hertz, batched over 10⁶ random blocks, single-threaded,
4-thread, sched-isolated. This is the *floor*.
- **M5-prelim**: Vulkan dispatch overhead — pipeline create cost
(one-time), per-`vkCmdDispatch` cost (per-frame-equivalent),
per-`vkQueueSubmit + vkQueueWaitIdle` cost (per-completion).
Bound below which kernel batching is mandatory.
Both are measurements on the *existing* substrate. Neither
requires writing any shader code. Phase 3 closes before Phase 4
(plan) begins.