Path B pivot + Phase 0-3 closed with first baseline numbers

This is a from-scratch initial commit on a fresh .git. The original scaffold commit (7510b56) and the earlier session's working-tree docs were lost in a 2026-05-18 10:25 working-tree wipe; the corrupted .git is preserved at .git-broken-2026-05-18/ (gitignored) for forensic inspection. Scope re-anchored from Path A (custom VPU firmware on VC7 scalar cores; blocked by BCM2712 silicon-RoT mask-ROM signature check) to Path B (QPU compute kernels via Mesa v3d / Vulkan compute or direct DRM, on stock signed Pi 5 / CM5). See README.md and docs/phase0.md for the substrate audit that closed Path A. Phases closed: Phase 0 — substrate audit; Path A blocked, Path B open; codec-back-end-fits-QPU finding (docs/phase0.md) Phase 1 — first kernel locked (VP9 / AV1 8x8 inverse DCT) with publish-before-measure R = M2/M3 decision rules (docs/phase1.md) Phase 2 — reference impls mapped; FFmpeg n7.1.3 source vendored under external/ffmpeg-snapshot/ (PROVENANCE.md pins commit f46e514 + per-file SHA-256s) (docs/phase2.md) Phase 3 — real baseline measurements on hertz (docs/phase3.md): M1 bit-exact 100.0000 % (10000/10000) M3 NEON IDCT8 single 8.171 Mblock/s (122.4 ns/block) M5a empty Vulkan submit 22.66 us M5b 1-WG noop dispatch 55.60 us M5 delta 32.95 us/dispatch => per-dispatch overhead is ~455x per-NEON-block cost; Phase 4 must batch at frame level or close to it. Build harness in place: CMakeLists.txt + tests/{bench_neon_idct.c, vp9_idct8_ref.c, bench_vulkan_dispatch.c, shaders/noop.comp} + external/ffmpeg-snapshot/config.h shim (7 defines + EXTERN_ASM). Builds clean on Debian Trixie aarch64 with cmake 3.31, ninja 1.12, libvulkan-dev 1.4.309, glslang-tools 15.1.0. Vendored FFmpeg .S assembles via the config.h shim. Next: Phase 4 (plan first QPU IDCT kernel under the M5 batching constraint) -> Phase 5 second-model review -> Phase 6 implement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:30:12 +00:00
commit dcbbc77038
22 changed files with 9030 additions and 0 deletions
@@ -0,0 +1,96 @@
+---
+name: Claude-Assisted Development Process (9(+1)-phase loop)
+description: Default workflow for any non-trivial implementation — substrate/motivation/inventory, formulate, analyze, baseline, plan, second-model review, implement, verify, closing (package+ship), memory-update; with explicit loopback edges
+type: feedback
+originSessionId: 83898ac9-e61f-4c44-8429-0154cb12d124
+---
+Markus's standardized loop for our implementation work. Apply by default whenever a task is bigger than a one-liner. Skipping phases is a deliberate choice that should be flagged, not a default.
+
+## Phase 0 — Substrate / Motivation / Inventory
+
+Pre-formulation. Lock the research question and assemble the substrate *before* Phase 1 commits to a measurable goal. Output: a `phase0_findings.md` artifact that future phases can refer back to without re-deriving.
+
+- **Research question + mechanism captured.** State the question in one sentence. Capture any operator-supplied mechanism (the "why this question, how does it work" insight) verbatim — it's the load-bearing claim Phase 1 binds against.
+- **Predecessor carry-over: state vs data.** When a campaign succeeds another, categorize what transfers. *State* (installed packages, governor settings, system tweaks, source-read file:line pointers, protocol designs, parser scripts) carries forward. *Data* (drop counts, perf percentages, threshold values, baseline floors) does not — it is reference history only. Binding cells in this campaign anchor to in-session-acquired numbers, even if the predecessor measured an identical condition.
+- **Tooling and measurement-instrument inventory.** What's installed, what would need installing, what extensions/protocols the live system actually supports. Live verification, not paper compatibility.
+- **In-session baseline anchor.** Re-run the reference rep — N=3 minimum if the baseline is load-bearing for the campaign's premise — *before* any instrument changes. **If the predecessor's reference floor doesn't replicate at N=3 in the same session, that is the campaign result.** Don't build multi-phase infrastructure on an N=1 historical floor. See `feedback_replicate_baseline_first.md`.
+- **Open questions tabled.** What's not known going into Phase 1. Phase 1 locks against the knowns; Phase 0 surfaces the unknowns explicitly so they don't slip into binding cells unverified.
+
+## Phase 1 — Goal Formulation
+Define the objective in measurable terms. State what success looks like *before* touching anything. The chosen metric is a **hypothesis** about what to measure, not an axiom — Phase 3 may invalidate it.
+
+## Phase 2 — Situation Analysis
+Document current state. Identify constraints, dependencies, known failure modes. **Reset context here** — do not carry assumptions from prior sessions; re-read CLAUDE.md, relevant memory files, run `git status`, re-verify reachability.
+
+## Phase 3 — Baseline Measurements
+Take concrete measurements *before* any changes. Paste raw output into DokuWiki at capture time — verbatim, not paraphrased. The Phase 5 artifact is the raw data, not Claude's summary.
+
+**Real data, not theatre.** Phase 3 exists to use AI capacity for absorbing wide, low-level instrumentation a human reader would skim past. Attaching strace / perf / ftrace / eBPF / custom tripwires to the process under test is real Phase 3; scraping mpv's stdout dropped-frame counter is not. Discriminator: if a human with bash and grep could produce the same baseline, it isn't Phase 3 yet — go down to the syscall / call-path / MMIO / register layer. See `feedback_phase3_no_theatre.md`.
+
+**Anti-fabrication:**
+- Every cited value traces to a visible tool invocation or verbatim paste-in. If a measurement wasn't taken, write "not measured" — never an estimate, inference, or recall from training / prior sessions / sibling-host memory.
+- Raw before derived. A derived number (FPS, p99, error rate) appears alongside the raw stream it came from, never alone.
+- Rig failure is the finding. Empty strace, dead UART, perf counter that didn't increment → that *is* the Phase 3 result. Loop back to Phase 2 to fix the rig; do not synthesize plausible-looking baseline data to keep momentum.
+
+- **If baseline reveals the Phase 1 metric was tracking the wrong thing → loop back to Phase 1** with the corrected target. (Example: "max H.264 FPS" Phase 1 metric, but baseline shows DMA-setup + sync overhead dwarfs decode → real metric is bytes-copied-per-second / EGL surface-import time, not FPS.)
+
+**Measurements describe what the system *does*, not what it *should do*.** Baseline data is evidence, not a specification. Do NOT derive API call sequences, struct layouts, or parameter values from observed behaviour (strace, perf, example output). Observable behaviour may reflect bugs, workarounds, or implementation accidents — anything you copy from it inherits those.
+
+## Phase 4 — Plan
+Formulate the approach. Identify what will and will not be touched. State expected outcome of implementation in the *same* measurable terms used in Phase 1/3.
+
+## Phase 5 — Second Model Review
+Goal, situation, measurements, plan get pasted into **DokuWiki**. Markus reviews and redacts, then initiates the handover to a fresh model instance. **Claude does not curate the artifact going to the reviewer** — that would re-introduce the blind-spot accumulation the review is meant to escape. Do not summarize when handing over; paste the actual artifacts.
+
+## Phase 6 — Implementation
+Execute the plan. Scope strictly to what was planned — resist feature creep, refactor-creep, "while I'm here" cleanups, and over-eager scope expansion. If a plan revision is needed mid-implementation, surface it explicitly and re-enter Phase 4.
+
+**Contract before code.** Before writing or modifying any call site:
+- Read the API contract — kernel docs, header comments, and upstream source for every call touched.
+- State the contract explicitly before implementing against it (in the plan, the commit message, or a comment — somewhere reviewable).
+- If the contract cannot be found: stop and surface the gap. Don't infer it from baseline behaviour or sibling code.
+
+**Copying from baseline measurements is not implementation. It is transcription of potentially broken behaviour.** A deliverable that matches baseline bytes but violates the API contract is not a deliverable — it is a deferred bug.
+
+### What "state the contract explicitly" looks like
+
+Worked example: `0012-h264-omit-scaling-matrix-frame-based.patch` in `~/src/ohm_gl_fix/phase6/step1/`. The commit message opens with the contract before any code:
+
+> VAAPI signals "explicit scaling lists are present in the bitstream" implicitly: the consumer (ffmpeg-vaapi, mpv, etc.) sends a `VAIQMatrixBufferH264` alongside `RenderPicture` iff `sps_scaling_matrix_present_flag || pps_scaling_matrix_present_flag`. When the bitstream uses default (flat) scaling, no IQMatrixBuffer arrives […]
+>
+> Earlier draft of this patch unconditionally omitted SCALING_MATRIX in FRAME_BASED. That's **corpus-correct** (bbb has no explicit scaling lists) but the **wrong predicate**: the kernel-side gating is by "matrix-supplied vs. not," not by decode mode. […]
+>
+> Contract verification (audit_0008_decode_params_2026-05-01.md + hantro_h264.c::assemble_scaling_list): the kernel uses the supplied matrix when SCALING_MATRIX is in the control batch and falls back to spec-defined defaults when absent. Mode-independent.
+
+What this gets right:
+- **Contract first**: per-control rules cited from kernel doc (`ext-ctrls-codec-stateless.rst:752`), kernel driver (`hantro_h264.c::assemble_scaling_list`), and sibling implementation (gst-plugins-bad commit 9e3e775) — *before* any patch hunks.
+- **Corpus-correct ≠ spec-correct, called out by name**: the rejected predicate ("omit SCALING_MATRIX in FRAME_BASED") *did* match the BBB baseline. It still got rejected, because the contract said the gate is "matrix-supplied vs. not," not "decode mode." This is exactly the Phase 3-derived-implementation trap.
+- **Then** the diff implements one branch per contract clause: SPS/PPS/DECODE_PARAMS always, SCALING_MATRIX iff `matrix_set`, SLICE_PARAMS iff SLICE_BASED, PRED_WEIGHTS iff SLICE_BASED + `V4L2_H264_CTRL_PRED_WEIGHTS_REQUIRED`.
+
+Mirror format anywhere reviewable: PR description, commit message body, plan section, or a header comment block. The shape is "contract clauses with citations → code that maps 1:1 to those clauses."
+
+## Phase 7 — Verification Measurements
+Repeat measurements from Phase 3. Compare explicitly against baseline.
+- **If the delta does not match Phase 4's prediction → loop back to Phase 4** (re-plan). Do not declare success when the numbers say otherwise; an unexplained delta is a finding, not a footnote.
+
+## Phase 8 — Closing (Package & Ship)
+Ship the deliverable to its consumption point. Working code that lives only in a checkout is half a deliverable — the next session has to re-discover it, the fleet doesn't get the fix, and the loop's value evaporates.
+
+- **Kernel patch → kernel-agent package.** Route through the kernel-agent flow (`fleet/<host>.yaml` + scope-tagged patches) so the kernel package gets properly built, signed, and published. Don't leave loose `.patch` files in a working tree. See `project_kernel_agent.md` for the manifest shape; `linux-ampere-fourier` and `linux-fresnel-fourier` are the canonical examples.
+- **Program / library change → marfrit-packages.** Add or update a PKGBUILD (Arch/ALARM) or debian/ tree (deb), push to `git.reauktion.de/marfrit/marfrit-packages`, and let `.gitea/workflows/build.yml` produce + sign + publish to `packages.reauktion.de`. See `project_marfrit_packages.md`. Local-only fixes go upstream as PR-quality diffs into the same overlay.
+- **Skipping is a deliberate choice.** If the change is one-shot scratch work (debugging tripwire, throw-away script), say so explicitly in the closing note. The default is: it gets packaged.
+- **Re-verify on the deploy host with the packaged artifact.** A clean Phase 7 result from a hand-rolled dev build (e.g. `meson -Dbuildtype=release && ninja`) is **not** the same as the `.pkg.tar.zst` / `.deb` that the deploy host installs. Distro packaging flags (Arch makepkg's `-O2 + FORTIFY + stack-protector-strong + stack-clash-protection` vs meson's `-O3 -DNDEBUG`, debhelper's hardening defaults, lto toggles) vectorise / unroll loops differently and routinely unmask latent UB the dev build folded away. Pull the published package down via the package manager and re-run the Phase 7 success criterion against it before closing — until that PASSes, the loop is not done. See `feedback_package_build_flags_unmask_bugs.md` for the iter39 incident that codified this.
+
+## Phase 9 — Memory Update
+Loop terminates here. Distill the lesson into a memory entry — what was the mistake the loop caught, what's the rule that would shorten the next cycle. Do not let the lesson rot in chat history.
+
+---
+
+## Loopback edges (summary)
+- Phase 3 → Phase 1 (metric was wrong)
+- Phase 7 → Phase 4 (plan didn't deliver predicted delta)
+- Any phase → Phase 0 (substrate was wrong: predecessor baseline didn't replicate, mechanism doesn't engage on this stack, or the data inverts the premise → re-anchor or honest close)
+- Phase 9 closes the loop
+
+## Why this exists
+Several recurring failures in prior work codify into individual rules — observer-first, simulate-before-flash, three-strikes-then-verify, "trust eyes not vibes," scope-strictly-to-plan, no-fake-dry-run. Those are all symptoms; this loop is the structural fix. Use it as the spine and let those rules show up as rejection patterns inside the appropriate phases.
@@ -0,0 +1,239 @@
+---
+phase: 0
+status: closed 2026-05-18
+date_opened: 2026-05-17
+date_closed: 2026-05-18
+research_method: three rounds of parallel web research (Sonnet via Agent), plus hands-on hertz substrate inventory and live `vulkaninfo` capture
+target_hardware: hertz (Pi 5 8 GB) for dev; higgs (CM5) eventual user target
+---
+
+# Phase 0 — Substrate / motivation / inventory
+
+This is the consolidated Phase 0 record. Path A (custom VPU firmware)
+is **closed at the silicon-RoT step**; Path B (QPU compute via the
+existing Mesa `v3d` driver) is **open**. The remainder of the
+project lives in Path B.
+
+The earlier session produced two separate Phase 0 artifacts that
+were lost when the working tree was wiped at 2026-05-18 10:25
+(`.git-broken-2026-05-18/` retains the corrupted state if needed).
+This document supersedes both.
+
+---
+
+## 1. Research question
+
+Verbatim from `README.md`:
+
+> Community-built VP9 / AV1 software-decode back-end running on the
+> VideoCore VII (V3D 7.1) QPUs on Broadcom BCM2712 (Raspberry Pi 5 /
+> Compute Module 5), via the existing Mesa `v3d` userspace driver.
+
+The load-bearing claim: *the QPU is programmable by us, on stock
+production hardware, and the codec back-end is a workload class
+where that programmability buys CPU time on the A76 cluster.*
+Phase 0's job is to test that claim before Phase 1 binds a metric.
+
+## 2. Substrate inventory — hertz
+
+Captured live 2026-05-17 via SSH. Full `vulkaninfo` in
+`vulkaninfo_v3d_7_1_7_hertz.txt`.
+
+| | |
+|---|---|
+| Host | hertz, Pi 5, 8 GB, eMMC + 1 TB SATA |
+| Role | LXD host for 11 containers (home-LAN spine — DNS / VPN / HA proxy / NCP / SMTP) |
+| OS | Debian 13 Trixie |
+| Kernel | `6.12.75+rpt-rpi-2712` (RPi Foundation kernel, 2026-03-11) |
+| CPU | 4× Cortex-A76 @ 2.8 GHz |
+| GPU clock | V3D 7.1 @ 1000 MHz (slight OC; spec 960 MHz) |
+| Mesa | `25.0.7-2+rpt4` (`libvulkan_broadcom.so` v3dv ICD) |
+| Vulkan loader | `1.4.309` |
+| Vulkan device API | 1.3.305 (conformance 1.3.8.3) |
+| DRM nodes | `card0 → v3d` (compute target), `card1 → vc4-drm` (display), `renderD128` |
+| kernel uAPI hdr | `/usr/include/drm/v3d_drm.h` present |
+| Build tools | cmake 3.31, ninja 1.12, libvulkan-dev 1.4.309, glslang-tools 15.1.0, spirv-tools 2025.1, libdrm-dev 2.4.131 (installed 2026-05-17) |
+| User groups | mfritsche ∈ `render`, `video`, `lxd`, `sudo` |
+| Memory pressure | 7.9 GiB RAM, ~3 GiB available; 6 GiB zram, ~2.8 GiB in use (cohabitation with LXD spine) |
+| Watchdog | yes — power-cut reboot via Himbeere plug if hertz crashes (acknowledged dev cost: household DNS/VPN drops during each reboot cycle) |
+
+**Inside-view V3D 7.1 compute envelope** (from
+`vulkaninfo_v3d_7_1_7_hertz.txt`):
+
+| Property | Value | Implication |
+|---|---|---|
+| `maxStorageBufferRange` | 1 GiB | Bounds single-tensor size; codec working sets (frames, planes) fit trivially |
+| `maxPerStageDescriptorStorageBuffers` | 8 | Forces ≤8 SSBO bindings per dispatch — ggml-vulkan binds more, doesn't fit |
+| `maxComputeSharedMemorySize` | 16 KiB | Small tiled kernels only; codec block work (8×8, 16×16) fits easily |
+| `maxComputeWorkGroupInvocations` | 256 | Standard |
+| `maxComputeWorkGroupSize` | 256 / 256 / ? | Standard |
+| `subgroupSize` | 16 (fixed) | Matches QPU SIMD width |
+| `subgroupSupportedOperations` | BASIC + VOTE only | No arithmetic reductions — accumulate via shared memory |
+| `shaderFloat16` | **false** | Storage only; arithmetic runs FP32 |
+| `shaderInt8` | **false** | Storage only; arithmetic on widened ints |
+| `shaderInt16` | **false** | Same |
+| `storageBuffer8/16BitAccess` | true | Can load tightly-packed quantized / packed pixel data |
+| `subgroupSizeControl`, `computeFullSubgroups`, `synchronization2` | true | Modern compute features available |
+
+**Throughput envelopes** (from prior community measurements,
+not yet re-confirmed in-session):
+
+| Metric | Value | Source |
+|---|---|---|
+| V3D 7.1 theoretical FP32 peak | ~92 GFLOPS at 960 MHz | 12 QPU × 4 ALU × 2 op/cycle |
+| Direct-DRM SGEMM sustained | 21.4 GFLOPS (~23%) | `Idein/py-videocore7` |
+| Vulkan-compute `vkpeak` fp32-vec4 | 6.9 GFLOPS (~7.5%) | RPi forum benchmark thread |
+| A76 NEON sustained for matmul | ~50 GFLOPS | Multiple benchmark sources |
+| Shared LPDDR4x bus | ~17 GB/s nominal | LPDDR4x-4267 × 32 bit / 8 |
+| GPU-measured BW share | 4–7 GB/s | py-videocore7 scopy benchmark |
+| CPU NEON BW achievable | 12–15 GB/s | Pi 5 STREAM benchmarks |
+
+## 3. Path A — closed
+
+**Custom VPU firmware loaded onto VC7 scalar cores.** This was the
+README's original framing.
+
+Blocked at the silicon-RoT step:
+
+- **BCM2712 mask ROM hardcodes RPi's public key** and unconditionally
+  verifies the second-stage bootloader (`bootsys`) on every boot
+  path (SPI flash, USB rpiboot, SD recovery). RPi holds the
+  corresponding private key.
+- `EXECUTE_CODE` mailbox tag (the only documented Pi 1–4 runtime
+  "run code on a VPU core" mechanism) **confirmed removed on Pi 5**
+  by Pi Foundation engineer (forum.raspberrypi.com).
+- Pre-CRA EEPROM downgrade is possible (no anti-rollback fuse) but
+  only yields *older RPi-signed* EEPROMs — doesn't help.
+- OTP fuse state on stock CM5 is already the most permissive
+  possible (customer key hash = zero); the RPi-key check is
+  silicon-unconditional, not gated by OTP.
+- CM5 vs retail Pi 5: same silicon, same chain, no meaningful
+  security delta.
+- One non-software escape exists: VPU JTAG via documented test
+  points (`schlae/cm5-reveng`, Dec 2025). Hardware mod only,
+  sealed-chassis higgs not the dev unit, novel research with no
+  published firmware-injection workflow. Out of scope for this
+  project.
+
+Verdict: **structurally blocked for community use without RPi
+cooperation or hardware-RE-grade work on a sacrificial CM5.**
+
+## 4. Path B — open
+
+**QPU compute kernels via the existing Mesa `v3d` driver.** Reachable
+from userspace today on a stock signed Pi 5 / CM5 via
+`/dev/dri/card0` (Vulkan compute through `v3dv`) or `renderD128`
+(direct DRM submit, py-videocore7 style). No firmware loading.
+No signing fight. mfritsche on hertz is in the `render` group and
+can hit the device without sudo.
+
+The substrate is real:
+- `Idein/py-videocore7` runs SGEMM at 21 GFLOPS sustained on stock
+  Pi 5 with no special setup — existence proof of arbitrary QPU
+  programs.
+- Mesa v3dv is Vulkan 1.3-conformant on V3D 7.1 (Mesa 24.3+;
+  hertz runs 25.0.7).
+- The kernel `v3d` DRM driver is fully upstream and open.
+
+Phase 0 does **not** assume Path B leads to a winning result. It
+asserts only that Path B is *reachable*, where Path A isn't.
+
+## 5. Why this isn't the same project as "v3d backend for llama.cpp"
+
+A llama.cpp v3d backend was investigated mid-session and rejected
+as structurally infeasible. The verdict was decisive: GPU loses
+to CPU on raw FP32 (21 vs ~50 GFLOPS), on memory bandwidth share
+(4–7 vs 12–15 GB/s), and on quantized instruction support (no
+INT8 MAC vs A76 SDOT/UDOT). For LLM matmul, the QPU is the wrong
+substrate.
+
+**Codec back-end work is a different workload class** with
+properties that fit the QPU substantively better:
+
+| Property | LLM matmul | Codec back-end (post-entropy) |
+|---|---|---|
+| Working set per dispatch | Whole weight matrices (GB) | Per-block (8×8 / 16×16, hundreds of bytes) — fits in 16 KiB shared mem |
+| Dominant op | INT8 MAC | Integer add / shift / small-constant multiply |
+| Why GPU misses | No INT8 MAC | Less impact — fewer multiplies, mostly add/shift |
+| Memory pattern | Full-tensor stream | Sequential plane reads, TMU-friendly |
+| Parallelism | One big GEMM | Thousands of independent small blocks per frame |
+| A76 advantage | NEON SDOT/UDOT crushing it | Less specialized; QPU advantage real |
+| Bandwidth-bound? | Yes (kills the GPU) | Compute-bound at block scale |
+
+This is the load-bearing reframe between the failed llama.cpp
+investigation and the daedalus-fourier scope. Codec back-end
+*might* live on the QPU. Phase 1 measures whether it actually does.
+
+## 6. Honest probability assessment
+
+A competent outside reviewer should rate the project as **hard but
+viable**, with one concrete prior precedent (MulticoreWare /
+Imagination PowerVR OpenCL VP9 decoder, 2014, achieved 1080p30 in
+a hybrid model with CPU entropy + GPU back-end on a comparable
+embedded GPU) and one concrete recent failure (FFmpeg 8.0 VP9-on-
+Vulkan-compute, 2025, produced corrupted output on a much more
+capable NVIDIA target — but the failure was in the *attempt to
+move entropy onto GPU*, not the back-end).
+
+The win condition is **not** "GPU beats CPU at the same work." The
+win condition is **"GPU work overlaps with CPU work that has to
+happen anyway"** — concurrent decode where ARM does entropy and
+the QPU finishes the block-level back-end on the previous frame,
+recovering CPU time for the rest of the system (browser, audio,
+UI, the 11 LXD containers on hertz).
+
+Phase 1 measures the building block: one kernel, bit-exact, with
+numbers. Phase 2+ only if Phase 1 numbers justify it.
+
+## 7. Open questions for Phase 1
+
+1. **What's the actual single-kernel QPU throughput on a
+   codec-shaped workload?** SGEMM at 21 GFLOPS is the only public
+   number, and SGEMM is not block-IDCT-shaped. We need an in-session
+   N=3 measurement on a real codec kernel.
+
+2. **What's the ARM NEON baseline for the same kernel on the same
+   hertz?** libavcodec ships highly-tuned NEON paths for IDCT,
+   deblocking, etc. Without measuring NEON in-session, "the QPU
+   wins" or "the QPU loses" is unverifiable.
+
+3. **Vulkan compute vs direct DRM submit — which path?** Vulkan
+   has tooling, documentation, debuggability. Direct DRM has
+   ~10–15% lower per-dispatch overhead and bypasses the
+   v3dv-imposed 16 KiB shared-mem / 8-SSBO limits, at the cost
+   of writing QPU asm against the NDA ISA. Phase 1 picks one.
+
+4. **Memory bandwidth contention with concurrent ARM decode.**
+   The shared 17 GB/s bus is the floor. If QPU+ARM-NEON both
+   running collide for bandwidth, the "concurrent work" win
+   disappears. Needs in-session measurement once any kernel exists.
+
+5. **VC7 thermal headroom under sustained mixed CPU+GPU load.**
+   Pi 5 throttles GPU at 85°C, CPU at 80°C. hertz idles at ~64°C
+   with the LXD spine; mixed compute will push higher. With or
+   without active cooling on hertz is an open question.
+
+These are Phase 1's burden, not Phase 0's. Phase 0 closes here.
+
+## 8. Sources
+
+Earlier session's web research produced ~7000 words of substrate
+references across 6 parallel threads. The full source list lived
+in the deleted `phase0_findings.md` and `phase0_wall1_bypass.md`.
+The high-value pointers that should follow this project forward:
+
+- [Mesa `src/broadcom/qpu/qpu_instr.h`](https://github.com/Mesa3D/mesa/blob/main/src/broadcom/qpu/qpu_instr.h) — de-facto VC7 QPU ISA reference (no Broadcom-published doc; ISA under NDA)
+- [Mesa `src/broadcom/compiler/`](https://github.com/Mesa3D/mesa/tree/main/src/broadcom/compiler) — NIR→QPU compiler, the open ground truth for what V3D 7.1 can do
+- [`Idein/py-videocore7`](https://github.com/Idein/py-videocore7) — working QPU GPGPU runtime via DRM; SGEMM benchmark; existence proof
+- [`Towdo/py-videocore7`](https://github.com/Towdo/py-videocore7) — fork with more fixes
+- [Mesa `v3dv` driver source](https://gitlab.freedesktop.org/mesa/mesa/-/tree/main/src/broadcom/vulkan) — Vulkan compute path
+- [Pi 5 HEVC kernel driver patch series](https://patchwork.kernel.org) — closest architectural template for ARM-side V4L2 stateless wrapping a Pi-5 hardware accelerator (search "rpi-hevc-dec")
+- [raspberrypi/usbboot secure-boot.md](https://github.com/raspberrypi/usbboot/blob/master/docs/secure-boot.md) — Wall 1 silicon-RoT confirmation
+- [schlae/cm5-reveng](https://github.com/schlae/cm5-reveng) — CM5 PCB RE; VPU JTAG test points (Dec 2025; out of Path B scope, kept as escape hatch reference)
+- [MulticoreWare / Imagination PowerVR VP9 OpenCL decoder press](https://www.design-reuse.com/news/34030/vp9-decoder-imagination-powervr-series6-gpus.html) — 2014 precedent for hybrid codec back-end on embedded GPU compute
+- [FFmpeg 8.0 part-3 VP9 Vulkan failure post](https://www.rendi.dev/blog/ffmpeg-8-0-part-3-failed-attempts-to-use-vulkan-for-av1-encoding-vp9-decoding) — recent cautionary tale; failure was in entropy stage, not back-end
+- [`Halide/Halide` Vulkan Pi 5 issue #8494](https://github.com/halide/Halide/issues/8494) — known runtime edge cases on Pi 5 Vulkan
+- [Pi Forum p=2330030](https://forums.raspberrypi.com/viewtopic.php?p=2330030) — RPi engineer confirms VC7 ISA NDA + EU CRA signing rationale
+
+Future phases should add citations here as they're consumed, not
+re-derive Phase 0's substrate findings.
@@ -0,0 +1,128 @@
+---
+phase: 1
+status: open
+date_opened: 2026-05-18
+parent: phase0.md
+target_kernel: VP9 / AV1 8×8 inverse DCT (integer fixed-point)
+dev_host: hertz
+---
+
+# Phase 1 — Goal formulation
+
+Per `dev_process.md`:
+
+> Define the objective in measurable terms. State what success looks
+> like *before* touching anything. The chosen metric is a **hypothesis**
+> about what to measure, not an axiom — Phase 3 may invalidate it.
+
+## Kernel under test
+
+**VP9 / AV1 8×8 inverse DCT (DCT_DCT variant), integer 16-bit
+fixed-point input, 8-bit output, with reconstructed-block add.**
+
+Mirrors the `ff_vp9_idct_idct_8x8_add_neon` shape in libavcodec
+(see `libavcodec/aarch64/vp9itxfm_neon.S`) and the equivalent
+dav1d / rav1d / libgav1 implementations for AV1's `IDTX_DCT` /
+`DCT_DCT` 8×8 path.
+
+I/O contract (per VP9 spec § 8.7 inverse transform process):
+
+```
+input:   int16_t coeffs[64]   // dequantized transform coefficients
+input:   uint8_t pred[64]     // predicted block (intra/inter)
+input:   ptrdiff_t stride     // typically 8 for an isolated test
+output:  uint8_t dst[64]      // clamp(pred + idct(coeffs)) per pixel
+```
+
+Bit-exact: integer arithmetic per spec, no rounding ambiguity.
+
+## Measurable success criteria
+
+Three numbers must come out of Phase 7, all measured in-session on
+hertz, all N≥3:
+
+| ID | Measurement | What it tells us |
+|---|---|---|
+| **M1** | **Bit-exactness rate** vs libavcodec C reference, across ≥10 000 random coefficient blocks | Correctness gate. Must be 100.000 %. Anything less and the kernel is wrong, no other number matters. |
+| **M2** | **QPU throughput** in million-blocks-per-second (MblockS), single-threaded host driver, sustained over ≥1 s | The substrate's actual delivered capacity for this kernel shape. |
+| **M3** | **NEON throughput** in MblockS on the same hertz, single-threaded, running `ff_vp9_idct_idct_8x8_add_neon` via a microbench harness | The floor any GPU offload has to beat or get close to. |
+
+Derived figure for go/no-go: **R = M2 / M3**.
+
+## Decision rules (set before measuring, per `feedback_no_motivated_reasoning`)
+
+| R | Interpretation | Next step |
+|---|---|---|
+| ≥ 1.0 | QPU beats NEON on this kernel in isolation. Strong substrate signal. | Phase 9 lessons → Phase 1 of next kernel (deblocking or CDEF). |
+| 0.5 ≤ R < 1.0 | QPU loses in isolation but is in the same order of magnitude. *Concurrent-work* hypothesis becomes viable: at R≈0.5 the QPU can roughly handle half of decode while the CPU does the other half + everything else. | Add a Phase 1' measurement: M4 = combined CPU+QPU throughput when both run concurrently (does total system delivery exceed pure-CPU?). Then decide. |
+| 0.1 ≤ R < 0.5 | QPU is materially slower. Concurrent-work win unlikely to be worth the integration cost. | Honest close. Phase 9 documents the negative result. |
+| < 0.1 | QPU is structurally wrong for this kernel shape. | Honest close. Phase 9 documents the failure, project shelves. |
+
+These thresholds are deliberately published *before* measurement so
+the result can't be retroactively reframed.
+
+## Secondary measurements (not gating, but recorded)
+
+- **M5** — per-kernel-launch overhead in µs, isolated (run with 0
+  blocks, measure submit+wait round-trip). Tells us the floor below
+  which kernel batching is required.
+- **M6** — workgroup-size sweep across {8, 16, 32, 64, 128, 256}
+  invocations to identify the v3dv-optimal launch shape for this
+  kernel. Records the Pareto curve, doesn't change R unless the
+  best-WG result invalidates M2.
+- **M7** — power draw delta at the wall (via the Himbeere Fritz!DECT
+  plug telemetry, if reachable) under idle vs CPU-only vs QPU-only
+  vs CPU+QPU concurrent. Order-of-magnitude only; informs the higgs
+  battery argument that motivates the project.
+
+## What Phase 1 does *not* lock
+
+- The dispatch path (Vulkan compute via `v3dv` vs direct DRM
+  submit via `v3d_drm.h` ioctl). Phase 4 picks. Default for
+  Phase 1 = **Vulkan compute** unless Phase 4 has reason to flip:
+  documented, debuggable, doesn't require QPU asm against the
+  NDA ISA.
+- The shader source (GLSL → glslang → SPIR-V) vs hand-written
+  SPIR-V. Default = GLSL.
+- Workgroup partitioning (one-block-per-WG vs many-blocks-per-WG).
+  Phase 4 chooses based on subgroup width and tile cost; Phase 1
+  records the sweep (M6).
+
+## Non-goals for Phase 1
+
+- No V4L2 driver work.
+- No end-to-end VP9 / AV1 decode (entropy + back-end). Just one
+  kernel, isolated, measured.
+- No optimization beyond what's needed to hit the bit-exact gate
+  and produce a single throughput number. Tuning is Phase 7's
+  feedback if R is borderline.
+- No build-system perfection. A CMakeLists that compiles the test
+  harness on hertz is enough.
+
+## Phase 2 → Phase 3 hand-off conditions
+
+Phase 1 closes when:
+- The above metrics + decision rules are reviewed (second-model
+  review per dev_process.md Phase 5? No — this is *Phase 1* not
+  Phase 5. The Phase 5 second-model review comes after Phase 4
+  plan).
+- The metrics are recorded in this file or a sibling
+  `phase1_metrics.md` artifact (TBD).
+
+The next phase (Phase 2 — situation analysis) inventories:
+- libavcodec's NEON IDCT reference (file, function, calling
+  convention, expected I/O contract).
+- VP9 spec § 8.7 transform process (which the C reference
+  implements verbatim).
+- AV1 spec § 7.7 (same transform structure, larger transform set;
+  8×8 DCT_DCT path is identical to VP9's at this size).
+- Mesa v3dv's compute-shader compilation path and any known
+  v3dv-specific shader idioms that perform better on V3D 7.1.
+- The hertz Vulkan dispatch overhead floor (M5 candidate, but
+  measured as part of Phase 3 baseline).
+
+## Open questions Phase 1 hands forward
+
+None new. Phase 0 § 7's open questions are the standing list;
+Phase 1 picks off Q1 (single-kernel throughput) and Q2 (NEON
+baseline) directly via M2 and M3.
@@ -0,0 +1,212 @@
+---
+phase: 2
+status: closed 2026-05-18
+date_opened: 2026-05-18
+parent: phase1.md
+target_kernel: VP9 8×8 inverse DCT (DCT_DCT variant, 8-bit pixels)
+---
+
+# Phase 2 — Situation analysis
+
+Per `dev_process.md`:
+
+> Document current state. Identify constraints, dependencies, known
+> failure modes. Reset context here — do not carry assumptions from
+> prior sessions; re-read CLAUDE.md, relevant memory files, run
+> `git status`, re-verify reachability.
+
+## 1. Context reset
+
+- Working tree state: dirty (Phase 0/1/2 docs not yet committed).
+  `.git-broken-2026-05-18/` preserved as a forensic artifact of
+  the 2026-05-18 10:25 working-tree wipe (cause undetermined).
+- CLAUDE.md re-read: no contradictions with the Path B scope set
+  in README §"Architecture (Path B)".
+- hertz reachability: confirmed via SSH; `vcgencmd`, `vulkaninfo`,
+  `apt`, sudo NOPASSWD all working as of 2026-05-17 inventory.
+  Mesa 25.0.7 / Vulkan 1.3.305 / V3D 7.1.7 stable.
+
+## 2. Reference implementations — VP9 8×8 IDCT (DCT_DCT)
+
+The Phase 1 kernel has *two* canonical reference implementations
+in FFmpeg n7.1.3 (the version installed on hertz). The harness
+will link both: the C path as the bit-exact gate (M1), the NEON
+path as the throughput baseline (M3).
+
+### 2.1 C reference
+
+- **Source**: `libavcodec/vp9dsp_template.c`, function `idct_idct_8x8_add_c`
+- **Spec basis**: VP9 specification §8.7 — Inverse transform process
+- **Signature**:
+
+  ```c
+  static void idct_idct_8x8_add_c(uint8_t *_dst, ptrdiff_t stride,
+                                  int16_t *_block, int eob);
+  ```
+
+- **Algorithm** (8-bit path):
+  1. If `eob == 1` (DC-only): single `(coef * 11585 * 11585)` round, broadcast to 8×8 with `+pred, clamp[0,255]`.
+  2. Otherwise: 8 column passes through `idct8_1d` → tmp[64]. Zero the input block. 8 row passes through `idct8_1d` → out[8]. Per-element `(out + 16) >> 5`, add to `dst`, `av_clip_pixel`.
+- **`idct8_1d`**: 1-D 8-point inverse DCT, 8 trigonometric multiply-add stages with Q14 fixed-point constants then 8-butterfly add/sub stages. All arithmetic is signed int32 (`dctint`).
+- **Q14 constants** (matched against VP9 spec §8.7.1.4):
+  | symbol | value | trig identity |
+  |---|---|---|
+  | cospi_16_64 | 11585 | cos(π/4) × 2^14 ≈ 0.70711 |
+  | cospi_24_64 |  6270 | cos(3π/8) × 2^14 ≈ 0.38268 |
+  | cospi_8_64  | 15137 | sin(3π/8) × 2^14 ≈ 0.92388 |
+  | cospi_28_64 |  3196 | cos(7π/16) × 2^14 ≈ 0.19509 |
+  | cospi_4_64  | 16069 | sin(7π/16) × 2^14 ≈ 0.98079 |
+  | cospi_20_64 |  9102 | cos(5π/16) × 2^14 ≈ 0.55557 |
+  | cospi_12_64 | 13623 | sin(5π/16) × 2^14 ≈ 0.83147 |
+
+  Rounding convention: `(product + (1 << 13)) >> 14`, i.e. round-half-up at bit 14.
+
+- **License**: LGPL-2.1-or-later (FFmpeg).
+- **Side effect**: zeroes the input `block[]` (idempotency requirement; matches spec).
+
+### 2.2 NEON reference
+
+- **Source**: `libavcodec/aarch64/vp9itxfm_neon.S`, symbol `ff_vp9_idct_idct_8x8_add_neon`
+- **Signature** (same as C):
+  ```
+  void ff_vp9_idct_idct_8x8_add_neon(uint8_t *dst, ptrdiff_t stride,
+                                     int16_t *block, int eob);
+  ```
+  Registers: `x0=dst, x1=stride, x2=block, w3=eob`.
+- **Internal dependencies** (must be copied alongside the .S):
+  | macro / symbol | location | role |
+  |---|---|---|
+  | `idct8` | `vp9itxfm_neon.S` | 1-D 8-pt IDCT, fully unrolled with `dmbutterfly*` |
+  | `dmbutterfly0` | `vp9itxfm_neon.S` | rotation by π/4 (the `cospi_16_64` case) |
+  | `dmbutterfly` | `vp9itxfm_neon.S` | general 2-input rotation `[a,b] → [a·c1−b·c2, a·c2+b·c1]` (`Q14`) |
+  | `dmbutterfly_l` | `vp9itxfm_neon.S` | wide-form (4×i32 acc) for `dmbutterfly` |
+  | `butterfly_8h` | `vp9itxfm_neon.S` | trivial `[a+b, a−b]` on `int16x8_t` |
+  | `transpose_8x8H` | `libavcodec/aarch64/neon.S` | in-place 8×8 i16 transpose |
+  | `idct_coeffs` | `vp9itxfm_neon.S` (`const`) | Q14 trig constants table, aligned 4 |
+  | `movrel` | `libavutil/aarch64/asm.S` | PIC-aware constant-pool relocation helper |
+- **License**: LGPL-2.1-or-later (Google, 2016).
+- **Performance shape**: full unrolled 8-pt butterfly with NEON `smull/smlsl/smlal` + `rshrn` for the Q14 round-shift; output uses `sqxtun` for saturated narrow to u8. Estimated ~80 NEON instructions for the steady state (non-DC) path.
+
+### 2.3 AV1 equivalence note
+
+AV1's 8×8 DCT_DCT transform (`av1_iidentity8_iidentity8_c` vs `av1_idct8_idct8_c` family in `libavcodec/av1dsp/...`) shares the same 1-D 8-point structure but with **different** scaling: AV1 uses 12-bit fixed-point (`>> 12`) and a slightly different rounding shift due to its different transform-stage bit growth model. Calling our VP9 IDCT shader on AV1 coefficients will produce wrong output. **AV1 support is out of scope for Phase 1.** A Phase-N variant can fork the shader with the AV1 constants once Phase 1 has proven the VP9 path.
+
+## 3. Vulkan compute dispatch path
+
+Hertz exposes V3D 7.1 via Mesa's v3dv driver as Vulkan
+`PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU`, API 1.3.305, conformance
+1.3.8.3. The compute-only dispatch path is:
+
+```
+host program
+  ├─ vkCreateInstance / vkEnumeratePhysicalDevices (picks V3D 7.1.7.0)
+  ├─ vkCreateDevice (queue family with COMPUTE_BIT, no graphics needed)
+  ├─ vkCreateBuffer x N (SSBOs for block coeffs in / dst pixels in+out)
+  │     - buffer flags: STORAGE_BUFFER_BIT | TRANSFER_SRC/DST
+  │     - memory type: HOST_VISIBLE | HOST_COHERENT (zero-copy on shared LPDDR4x)
+  ├─ vkCreateDescriptorSetLayout (≤8 SSBOs per layout — Pi 5 limit)
+  ├─ vkCreateShaderModule (SPIR-V from glslang)
+  ├─ vkCreateComputePipeline
+  ├─ vkBeginCommandBuffer
+  │     vkCmdBindPipeline / vkCmdBindDescriptorSets / vkCmdPushConstants
+  │     vkCmdDispatch(group_count_x, 1, 1)   # one WG per ~K blocks
+  ├─ vkQueueSubmit + vkQueueWaitIdle (or fence) — this is the measured op
+  └─ (read back via the HOST_VISIBLE buffer, or alias it to the same memory the CPU populated)
+```
+
+Per Phase 0 §2 inside-view limits, the relevant constraints
+for this kernel:
+
+- ≤8 SSBOs per stage → group inputs/outputs into ≤8 bindings (we
+  only need 2: `block[]` in, `dst[]` in/out).
+- Shared mem ≤16 KiB → each 8×8 block fits trivially (256 B in
+  i16 plus 64 B in u8). One WG can carry dozens of blocks of
+  shared state if useful.
+- Subgroup size = 16 (fixed). One workgroup of 64 invocations =
+  4 subgroups; one block per subgroup is a natural shape (each
+  16-lane subgroup processes 8×8 = 64 pixels in 4 cycles of
+  subgroup work).
+
+## 4. Build path on hertz
+
+Already installed (2026-05-17): cmake 3.31, ninja 1.12, gcc (Debian
+trixie default), `libvulkan-dev 1.4.309`, `glslang-tools 15.1.0`,
+`spirv-tools 2025.1`, `libdrm-dev 2.4.131`, `vulkan-tools 1.4.304`.
+
+Missing but cheap:
+- `libavcodec-dev` — only needed if the harness wants to link
+  against system libavcodec for cross-checks against the dynamic
+  dispatcher. *Not* needed for the source-copy approach (preferred,
+  see §5).
+
+## 5. Reference-copy strategy (vs system-libavcodec link)
+
+**Decision: source-copy the 3 FFmpeg files into `external/ffmpeg-snapshot/`.**
+
+Rationale:
+- System `libavcodec.so` on hertz is symbol-stripped (`nm` returns
+  empty for `ff_vp9_idct_*`). Internal NEON entry points are not
+  reachable via `dlsym`.
+- The two reference implementations (C, NEON) plus their macro/
+  data dependencies total ~3 files / ~600 lines. Source-copy is
+  smaller than the dlopen plumbing would be.
+- LGPL-2.1-or-later (FFmpeg license) is propagation-compatible
+  with the harness binary if the harness binary itself is GPL
+  or LGPL. The kernel shaders and dispatch library stay
+  separately-licensed (BSD-2-Clause, default for this project).
+- Pinning to `n7.1.3` matches hertz's runtime libavcodec version,
+  so any in-session sanity cross-check against the running Mesa
+  / video tooling stays consistent.
+
+Files to vendor:
+
+| Source | License | Target path under `daedalus-fourier/` |
+|---|---|---|
+| `libavcodec/vp9dsp_template.c` | LGPL-2.1+ | `external/ffmpeg-snapshot/vp9dsp_template.c` |
+| `libavcodec/aarch64/vp9itxfm_neon.S` | LGPL-2.1+ | `external/ffmpeg-snapshot/aarch64/vp9itxfm_neon.S` |
+| `libavcodec/aarch64/neon.S` (for `transpose_8x8H`) | LGPL-2.1+ | `external/ffmpeg-snapshot/aarch64/neon.S` |
+| `libavutil/aarch64/asm.S` (for `movrel`, `function`, `endfunc`) | LGPL-2.1+ | `external/ffmpeg-snapshot/aarch64/asm.S` |
+| (whatever else `vp9dsp_template.c` transitively needs) | LGPL-2.1+ | as required |
+
+A `external/ffmpeg-snapshot/COPYING.LGPL` and `external/ffmpeg-snapshot/PROVENANCE.md` document the upstream commit (n7.1.3 tag, commit hash) and the verbatim-copy guarantee.
+
+## 6. Known constraints / failure modes carried from Phase 0
+
+Repeated here so Phase 4 (plan) can bind against them without
+re-derivation:
+
+- **C1**: shaderFloat16 = false → all shader arithmetic must be int32 (we are int anyway — no risk).
+- **C2**: maxComputeSharedMemorySize = 16 KiB → kernel must not require more (8×8 IDCT trivially fits even with many blocks per WG).
+- **C3**: maxPerStageDescriptorStorageBuffers = 8 → we need only 2 (coeffs + dst), no risk.
+- **C4**: subgroupSupportedOperations = BASIC + VOTE only → no `subgroupAdd`/etc. for accumulator reductions. Workaround: the IDCT structure is fully data-parallel without reductions; this constraint doesn't bite.
+- **C5**: VC7 has SMUL24 but no INT8 MAC. Our Q14 multiplies are i16×i16→i32 — the multiplicands fit in 17 bits, so SMUL24 covers it. No INT8/INT4 issues.
+- **C6**: shared LPDDR4x bus; GPU sees ~4–7 GB/s vs CPU ~12–15 GB/s. For 8×8 IDCT, working set is tiny (≤320 B/block), so per-block bandwidth is not the bottleneck; per-dispatch submit overhead is.
+- **C7**: VPM read-stall serialization. If we hand-write QPU asm (we won't, in Phase 1) this would matter; the Vulkan compute path lets the v3d_compiler schedule for us.
+- **C8**: VC7 thermal throttle at 85°C GPU / 80°C CPU. Phase 7 measurements should record temp before/during/after to flag throttling.
+
+## 7. What Phase 2 does *not* close
+
+- The harness architecture (single binary? Two binaries — one for
+  bit-exact, one for throughput?). Phase 4 picks.
+- Block-per-WG dispatch geometry. Phase 4 + Phase 6 sweep.
+- Random-coefficient generation strategy (uniform i16 vs
+  realistic-distribution; the latter affects DC-only path
+  frequency). Phase 4 picks; Phase 7 may re-evaluate.
+- Whether NEON measurement uses `clock_gettime(CLOCK_MONOTONIC_RAW)`
+  per-call (high overhead) or batched (more realistic for codec
+  use). Phase 3 picks during baseline collection.
+
+## 8. Hand-off to Phase 3
+
+Phase 3 measures:
+- **M3-prelim**: NEON `ff_vp9_idct_idct_8x8_add_neon` throughput
+  on hertz, batched over 10⁶ random blocks, single-threaded,
+  4-thread, sched-isolated. This is the *floor*.
+- **M5-prelim**: Vulkan dispatch overhead — pipeline create cost
+  (one-time), per-`vkCmdDispatch` cost (per-frame-equivalent),
+  per-`vkQueueSubmit + vkQueueWaitIdle` cost (per-completion).
+  Bound below which kernel batching is mandatory.
+
+Both are measurements on the *existing* substrate. Neither
+requires writing any shader code. Phase 3 closes before Phase 4
+(plan) begins.
@@ -0,0 +1,105 @@
+---
+phase: 3
+status: closed 2026-05-18
+date_opened: 2026-05-18
+date_closed: 2026-05-18
+parent: phase2.md
+host: hertz (Pi 5, 8 GB, Debian Trixie, kernel 6.12.75+rpt-rpi-2712, Mesa 25.0.7-2+rpt4, V3D 7.1.7 @ 1 GHz, A76 @ 2.8 GHz)
+artifacts: build/bench_neon_idct, build/bench_vulkan_dispatch, build/noop.spv
+---
+
+# Phase 3 — Baseline measurements
+
+Per `dev_process.md`:
+
+> Take concrete measurements *before* any changes. Raw before
+> derived. Real data, not theatre.
+
+These numbers anchor every Phase 4+ decision. Re-run with the
+same harness on the same hertz before drawing any new conclusions
+in later phases.
+
+## M1 — bit-exact correctness gate (Phase 1)
+
+| | |
+|---|---|
+| Method | 10 000 random VP9-plausible coefficient blocks + random `pred[64]`, compare `daedalus_vp9_idct_idct_8x8_add_ref` C output vs vendored FFmpeg `ff_vp9_idct_idct_8x8_add_neon` |
+| Run    | `./bench_neon_idct --blocks 1000000 --iters 5` (built 2026-05-18) |
+| **Result** | **10 000 / 10 000 = 100.0000 %** |
+| DC-only path frequency | 11 / 10 000 = 0.11 % |
+| Notes | Random generator: xorshift64, biased toward 1–16 non-zero coeffs per block; eob mostly ∈ [4, 63]. DC-only frequency is incidental; Phase 7 may revisit if it materially affects the throughput number. |
+
+**Gate passes. Throughput measurement was authorized to run.**
+
+## M3 — NEON throughput (single-core)
+
+| | |
+|---|---|
+| Kernel | `ff_vp9_idct_idct_8x8_add_neon` from FFmpeg n7.1.3 (vendored, see `external/ffmpeg-snapshot/PROVENANCE.md`) |
+| Method | Pre-generate 1 M random blocks + preds. Per iteration: memcpy refresh of all blocks/preds (NEON path zeroes blocks), then call NEON kernel 1 M times. Subtract setup memcpy time from the measured wall-clock. 5 iterations, single thread, no CPU pinning. |
+| Compiler flags | `-O3 -march=armv8-a+simd` |
+| Run | `./bench_neon_idct --blocks 1000000 --iters 5` |
+| **Throughput** | **8.171 Mblock/s** |
+| Per-block | 122.4 ns |
+| Equivalent 1080p frame rate | 252.2 FPS (32 400 blocks per 1080p frame, assuming pure 8×8 work) |
+| Elapsed (kernel) | 0.612 s / 5 M blocks |
+| Elapsed (setup-only) | 0.250 s / 5 M iters |
+| Cross-check | Cycle estimate at 2.8 GHz: 122.4 ns × 2.8 GHz ≈ 342 cycles/block. Plausible for a fully-unrolled NEON 8-point IDCT with butterflies + saturated narrow stores; the FFmpeg implementation interleaves loads/computes/stores aggressively. |
+
+### M3 implications
+
+- A single A76 core handles ~8 M blocks/s = **252 FPS at 1080p**. Real decode needs ~60 FPS = 4.2× headroom on one core, ~16× headroom on all four cores. **NEON is not the bottleneck for current YouTube workloads on Pi 5.**
+- The QPU offload story is not "make decode faster" — decode is already fast enough single-threaded. The story has to be "free CPU cycles for the rest of the system" (browser, audio, the 11 LXD containers on hertz).
+- For a per-kernel R = QPU / NEON measurement (per `phase1.md §"Decision rules"`), the QPU has to hit ≥4 M blocks/s to score R ≥ 0.5. That's the gate.
+
+## M5 — Vulkan compute dispatch overhead
+
+| | |
+|---|---|
+| Method | Allocate empty pipeline (no descriptors, no push constants), bind+dispatch a `void main(){}` shader on `local_size_x=64`. Time `vkQueueSubmit` + `vkQueueWaitIdle` round-trip. 50 000 iterations, warm. |
+| Device | V3D 7.1.7.0 via Mesa v3dv 25.0.7 (selected past llvmpipe by `strstr("V3D")`) |
+| Run | `./bench_vulkan_dispatch --iters 50000` |
+| **M5a — empty CB submit+wait** | **22.66 µs / op** |
+| **M5b — 1-WG noop dispatch submit+wait** | **55.60 µs / op** |
+| **M5 delta — per-vkCmdDispatch + pipeline-bind** | **32.95 µs** |
+
+### M5 implications — the load-bearing finding for Phase 4
+
+This is the single most important number from Phase 3.
+
+- Per-dispatch cost (55.6 µs) is **~455× the NEON per-block cost** (122 ns).
+- A per-block QPU dispatch is structurally impossible — overhead dominates by two-and-a-half orders of magnitude.
+- Break-even batch size for a *hypothetical* zero-cost QPU kernel: **≥ 556 blocks per dispatch**. Real kernel cost on top of that.
+- Frame-level batching is mandatory: a 1080p frame has 32 400 8×8 blocks; one dispatch per frame amortizes M5b to 1.7 ns/block — well below NEON's 122 ns.
+- Tile-level batching is borderline: a typical VP9 64×64 superblock has 64 sub-blocks; 55.6 µs / 64 ≈ 870 ns/block, ~7× NEON. Probably too coarse — frame-level or full-plane is the right granularity.
+
+### M5 measurement caveats
+
+- `vkQueueWaitIdle` after each submit forces a full GPU sync, modelling the "submit and need the result now" case. Real decode pipelines can submit multiple frames ahead and wait less often — the per-dispatch cost in a pipelined deployment will be lower (probably bounded below by M5a ≈ 22.66 µs as the pure submit cost).
+- Empty CB (M5a) at 22.66 µs is the *floor*. This is Mesa command-list construction + kernel `DRM_IOCTL_V3D_SUBMIT_CL` + scheduler RTT. Cannot be optimised at the userspace level without changing Mesa or kernel.
+- Both numbers include `vkQueueWaitIdle` overhead; pure submit-without-wait would be lower. For Phase 1's threshold analysis the with-wait number is the right one to use because end-to-end frame decode must wait for its output to be readable.
+
+## Phase 3 closure
+
+Two anchor measurements captured, both with verbatim raw output
+(see `bench_neon_idct` and `bench_vulkan_dispatch` source for the
+print format). No estimates, no inferences, no recall from prior
+sessions or sibling-host memory.
+
+Phase 4 (plan) opens against these numbers. Its first decision:
+**given the 32.95 µs per-dispatch floor, what is the
+batch granularity for the first kernel?** The answer is either
+frame-level (32 400 blocks/dispatch) or row-level (~120
+blocks/dispatch for one 1920-wide row of 8×8 → still ~460 ns/block
+overhead, ~4× NEON). Frame-level is the only granularity that
+amortises overhead enough to leave kernel compute room to win.
+
+Open thread for a later phase (not blocking Phase 4):
+- Multi-core NEON sweep (M3'): single-core NEON is the right
+  *competitor floor*, but the actual ARM headroom on hertz is
+  4× this number under load.
+- Memory-bandwidth contention measurement (M6): does NEON's
+  rate change when concurrent QPU is reading the same LPDDR4x
+  bus? Needs the QPU kernel to exist first.
+- Power-draw delta via Himbeere plug (M7): same — needs a real
+  GPU workload to differentiate from idle.