Files
daedalus-fourier/docs/phase0.md
T
marfrit dcbbc77038 Path B pivot + Phase 0-3 closed with first baseline numbers
This is a from-scratch initial commit on a fresh .git. The original
scaffold commit (7510b56) and the earlier session's working-tree
docs were lost in a 2026-05-18 10:25 working-tree wipe; the corrupted
.git is preserved at .git-broken-2026-05-18/ (gitignored) for
forensic inspection.

Scope re-anchored from Path A (custom VPU firmware on VC7 scalar
cores; blocked by BCM2712 silicon-RoT mask-ROM signature check)
to Path B (QPU compute kernels via Mesa v3d / Vulkan compute or
direct DRM, on stock signed Pi 5 / CM5). See README.md and
docs/phase0.md for the substrate audit that closed Path A.

Phases closed:
  Phase 0 — substrate audit; Path A blocked, Path B open;
            codec-back-end-fits-QPU finding (docs/phase0.md)
  Phase 1 — first kernel locked (VP9 / AV1 8x8 inverse DCT) with
            publish-before-measure R = M2/M3 decision rules
            (docs/phase1.md)
  Phase 2 — reference impls mapped; FFmpeg n7.1.3 source vendored
            under external/ffmpeg-snapshot/ (PROVENANCE.md pins
            commit f46e514 + per-file SHA-256s) (docs/phase2.md)
  Phase 3 — real baseline measurements on hertz (docs/phase3.md):
              M1 bit-exact            100.0000 % (10000/10000)
              M3 NEON IDCT8 single    8.171 Mblock/s (122.4 ns/block)
              M5a empty Vulkan submit 22.66 us
              M5b 1-WG noop dispatch  55.60 us
              M5 delta                32.95 us/dispatch
            => per-dispatch overhead is ~455x per-NEON-block cost;
               Phase 4 must batch at frame level or close to it.

Build harness in place: CMakeLists.txt + tests/{bench_neon_idct.c,
vp9_idct8_ref.c, bench_vulkan_dispatch.c, shaders/noop.comp} +
external/ffmpeg-snapshot/config.h shim (7 defines + EXTERN_ASM).
Builds clean on Debian Trixie aarch64 with cmake 3.31, ninja 1.12,
libvulkan-dev 1.4.309, glslang-tools 15.1.0. Vendored FFmpeg .S
assembles via the config.h shim.

Next: Phase 4 (plan first QPU IDCT kernel under the M5 batching
constraint) -> Phase 5 second-model review -> Phase 6 implement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:30:12 +00:00

240 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
phase: 0
status: closed 2026-05-18
date_opened: 2026-05-17
date_closed: 2026-05-18
research_method: three rounds of parallel web research (Sonnet via Agent), plus hands-on hertz substrate inventory and live `vulkaninfo` capture
target_hardware: hertz (Pi 5 8 GB) for dev; higgs (CM5) eventual user target
---
# Phase 0 — Substrate / motivation / inventory
This is the consolidated Phase 0 record. Path A (custom VPU firmware)
is **closed at the silicon-RoT step**; Path B (QPU compute via the
existing Mesa `v3d` driver) is **open**. The remainder of the
project lives in Path B.
The earlier session produced two separate Phase 0 artifacts that
were lost when the working tree was wiped at 2026-05-18 10:25
(`.git-broken-2026-05-18/` retains the corrupted state if needed).
This document supersedes both.
---
## 1. Research question
Verbatim from `README.md`:
> Community-built VP9 / AV1 software-decode back-end running on the
> VideoCore VII (V3D 7.1) QPUs on Broadcom BCM2712 (Raspberry Pi 5 /
> Compute Module 5), via the existing Mesa `v3d` userspace driver.
The load-bearing claim: *the QPU is programmable by us, on stock
production hardware, and the codec back-end is a workload class
where that programmability buys CPU time on the A76 cluster.*
Phase 0's job is to test that claim before Phase 1 binds a metric.
## 2. Substrate inventory — hertz
Captured live 2026-05-17 via SSH. Full `vulkaninfo` in
`vulkaninfo_v3d_7_1_7_hertz.txt`.
| | |
|---|---|
| Host | hertz, Pi 5, 8 GB, eMMC + 1 TB SATA |
| Role | LXD host for 11 containers (home-LAN spine — DNS / VPN / HA proxy / NCP / SMTP) |
| OS | Debian 13 Trixie |
| Kernel | `6.12.75+rpt-rpi-2712` (RPi Foundation kernel, 2026-03-11) |
| CPU | 4× Cortex-A76 @ 2.8 GHz |
| GPU clock | V3D 7.1 @ 1000 MHz (slight OC; spec 960 MHz) |
| Mesa | `25.0.7-2+rpt4` (`libvulkan_broadcom.so` v3dv ICD) |
| Vulkan loader | `1.4.309` |
| Vulkan device API | 1.3.305 (conformance 1.3.8.3) |
| DRM nodes | `card0 → v3d` (compute target), `card1 → vc4-drm` (display), `renderD128` |
| kernel uAPI hdr | `/usr/include/drm/v3d_drm.h` present |
| Build tools | cmake 3.31, ninja 1.12, libvulkan-dev 1.4.309, glslang-tools 15.1.0, spirv-tools 2025.1, libdrm-dev 2.4.131 (installed 2026-05-17) |
| User groups | mfritsche ∈ `render`, `video`, `lxd`, `sudo` |
| Memory pressure | 7.9 GiB RAM, ~3 GiB available; 6 GiB zram, ~2.8 GiB in use (cohabitation with LXD spine) |
| Watchdog | yes — power-cut reboot via Himbeere plug if hertz crashes (acknowledged dev cost: household DNS/VPN drops during each reboot cycle) |
**Inside-view V3D 7.1 compute envelope** (from
`vulkaninfo_v3d_7_1_7_hertz.txt`):
| Property | Value | Implication |
|---|---|---|
| `maxStorageBufferRange` | 1 GiB | Bounds single-tensor size; codec working sets (frames, planes) fit trivially |
| `maxPerStageDescriptorStorageBuffers` | 8 | Forces ≤8 SSBO bindings per dispatch — ggml-vulkan binds more, doesn't fit |
| `maxComputeSharedMemorySize` | 16 KiB | Small tiled kernels only; codec block work (8×8, 16×16) fits easily |
| `maxComputeWorkGroupInvocations` | 256 | Standard |
| `maxComputeWorkGroupSize` | 256 / 256 / ? | Standard |
| `subgroupSize` | 16 (fixed) | Matches QPU SIMD width |
| `subgroupSupportedOperations` | BASIC + VOTE only | No arithmetic reductions — accumulate via shared memory |
| `shaderFloat16` | **false** | Storage only; arithmetic runs FP32 |
| `shaderInt8` | **false** | Storage only; arithmetic on widened ints |
| `shaderInt16` | **false** | Same |
| `storageBuffer8/16BitAccess` | true | Can load tightly-packed quantized / packed pixel data |
| `subgroupSizeControl`, `computeFullSubgroups`, `synchronization2` | true | Modern compute features available |
**Throughput envelopes** (from prior community measurements,
not yet re-confirmed in-session):
| Metric | Value | Source |
|---|---|---|
| V3D 7.1 theoretical FP32 peak | ~92 GFLOPS at 960 MHz | 12 QPU × 4 ALU × 2 op/cycle |
| Direct-DRM SGEMM sustained | 21.4 GFLOPS (~23%) | `Idein/py-videocore7` |
| Vulkan-compute `vkpeak` fp32-vec4 | 6.9 GFLOPS (~7.5%) | RPi forum benchmark thread |
| A76 NEON sustained for matmul | ~50 GFLOPS | Multiple benchmark sources |
| Shared LPDDR4x bus | ~17 GB/s nominal | LPDDR4x-4267 × 32 bit / 8 |
| GPU-measured BW share | 47 GB/s | py-videocore7 scopy benchmark |
| CPU NEON BW achievable | 1215 GB/s | Pi 5 STREAM benchmarks |
## 3. Path A — closed
**Custom VPU firmware loaded onto VC7 scalar cores.** This was the
README's original framing.
Blocked at the silicon-RoT step:
- **BCM2712 mask ROM hardcodes RPi's public key** and unconditionally
verifies the second-stage bootloader (`bootsys`) on every boot
path (SPI flash, USB rpiboot, SD recovery). RPi holds the
corresponding private key.
- `EXECUTE_CODE` mailbox tag (the only documented Pi 14 runtime
"run code on a VPU core" mechanism) **confirmed removed on Pi 5**
by Pi Foundation engineer (forum.raspberrypi.com).
- Pre-CRA EEPROM downgrade is possible (no anti-rollback fuse) but
only yields *older RPi-signed* EEPROMs — doesn't help.
- OTP fuse state on stock CM5 is already the most permissive
possible (customer key hash = zero); the RPi-key check is
silicon-unconditional, not gated by OTP.
- CM5 vs retail Pi 5: same silicon, same chain, no meaningful
security delta.
- One non-software escape exists: VPU JTAG via documented test
points (`schlae/cm5-reveng`, Dec 2025). Hardware mod only,
sealed-chassis higgs not the dev unit, novel research with no
published firmware-injection workflow. Out of scope for this
project.
Verdict: **structurally blocked for community use without RPi
cooperation or hardware-RE-grade work on a sacrificial CM5.**
## 4. Path B — open
**QPU compute kernels via the existing Mesa `v3d` driver.** Reachable
from userspace today on a stock signed Pi 5 / CM5 via
`/dev/dri/card0` (Vulkan compute through `v3dv`) or `renderD128`
(direct DRM submit, py-videocore7 style). No firmware loading.
No signing fight. mfritsche on hertz is in the `render` group and
can hit the device without sudo.
The substrate is real:
- `Idein/py-videocore7` runs SGEMM at 21 GFLOPS sustained on stock
Pi 5 with no special setup — existence proof of arbitrary QPU
programs.
- Mesa v3dv is Vulkan 1.3-conformant on V3D 7.1 (Mesa 24.3+;
hertz runs 25.0.7).
- The kernel `v3d` DRM driver is fully upstream and open.
Phase 0 does **not** assume Path B leads to a winning result. It
asserts only that Path B is *reachable*, where Path A isn't.
## 5. Why this isn't the same project as "v3d backend for llama.cpp"
A llama.cpp v3d backend was investigated mid-session and rejected
as structurally infeasible. The verdict was decisive: GPU loses
to CPU on raw FP32 (21 vs ~50 GFLOPS), on memory bandwidth share
(47 vs 1215 GB/s), and on quantized instruction support (no
INT8 MAC vs A76 SDOT/UDOT). For LLM matmul, the QPU is the wrong
substrate.
**Codec back-end work is a different workload class** with
properties that fit the QPU substantively better:
| Property | LLM matmul | Codec back-end (post-entropy) |
|---|---|---|
| Working set per dispatch | Whole weight matrices (GB) | Per-block (8×8 / 16×16, hundreds of bytes) — fits in 16 KiB shared mem |
| Dominant op | INT8 MAC | Integer add / shift / small-constant multiply |
| Why GPU misses | No INT8 MAC | Less impact — fewer multiplies, mostly add/shift |
| Memory pattern | Full-tensor stream | Sequential plane reads, TMU-friendly |
| Parallelism | One big GEMM | Thousands of independent small blocks per frame |
| A76 advantage | NEON SDOT/UDOT crushing it | Less specialized; QPU advantage real |
| Bandwidth-bound? | Yes (kills the GPU) | Compute-bound at block scale |
This is the load-bearing reframe between the failed llama.cpp
investigation and the daedalus-fourier scope. Codec back-end
*might* live on the QPU. Phase 1 measures whether it actually does.
## 6. Honest probability assessment
A competent outside reviewer should rate the project as **hard but
viable**, with one concrete prior precedent (MulticoreWare /
Imagination PowerVR OpenCL VP9 decoder, 2014, achieved 1080p30 in
a hybrid model with CPU entropy + GPU back-end on a comparable
embedded GPU) and one concrete recent failure (FFmpeg 8.0 VP9-on-
Vulkan-compute, 2025, produced corrupted output on a much more
capable NVIDIA target — but the failure was in the *attempt to
move entropy onto GPU*, not the back-end).
The win condition is **not** "GPU beats CPU at the same work." The
win condition is **"GPU work overlaps with CPU work that has to
happen anyway"** — concurrent decode where ARM does entropy and
the QPU finishes the block-level back-end on the previous frame,
recovering CPU time for the rest of the system (browser, audio,
UI, the 11 LXD containers on hertz).
Phase 1 measures the building block: one kernel, bit-exact, with
numbers. Phase 2+ only if Phase 1 numbers justify it.
## 7. Open questions for Phase 1
1. **What's the actual single-kernel QPU throughput on a
codec-shaped workload?** SGEMM at 21 GFLOPS is the only public
number, and SGEMM is not block-IDCT-shaped. We need an in-session
N=3 measurement on a real codec kernel.
2. **What's the ARM NEON baseline for the same kernel on the same
hertz?** libavcodec ships highly-tuned NEON paths for IDCT,
deblocking, etc. Without measuring NEON in-session, "the QPU
wins" or "the QPU loses" is unverifiable.
3. **Vulkan compute vs direct DRM submit — which path?** Vulkan
has tooling, documentation, debuggability. Direct DRM has
~1015% lower per-dispatch overhead and bypasses the
v3dv-imposed 16 KiB shared-mem / 8-SSBO limits, at the cost
of writing QPU asm against the NDA ISA. Phase 1 picks one.
4. **Memory bandwidth contention with concurrent ARM decode.**
The shared 17 GB/s bus is the floor. If QPU+ARM-NEON both
running collide for bandwidth, the "concurrent work" win
disappears. Needs in-session measurement once any kernel exists.
5. **VC7 thermal headroom under sustained mixed CPU+GPU load.**
Pi 5 throttles GPU at 85°C, CPU at 80°C. hertz idles at ~64°C
with the LXD spine; mixed compute will push higher. With or
without active cooling on hertz is an open question.
These are Phase 1's burden, not Phase 0's. Phase 0 closes here.
## 8. Sources
Earlier session's web research produced ~7000 words of substrate
references across 6 parallel threads. The full source list lived
in the deleted `phase0_findings.md` and `phase0_wall1_bypass.md`.
The high-value pointers that should follow this project forward:
- [Mesa `src/broadcom/qpu/qpu_instr.h`](https://github.com/Mesa3D/mesa/blob/main/src/broadcom/qpu/qpu_instr.h) — de-facto VC7 QPU ISA reference (no Broadcom-published doc; ISA under NDA)
- [Mesa `src/broadcom/compiler/`](https://github.com/Mesa3D/mesa/tree/main/src/broadcom/compiler) — NIR→QPU compiler, the open ground truth for what V3D 7.1 can do
- [`Idein/py-videocore7`](https://github.com/Idein/py-videocore7) — working QPU GPGPU runtime via DRM; SGEMM benchmark; existence proof
- [`Towdo/py-videocore7`](https://github.com/Towdo/py-videocore7) — fork with more fixes
- [Mesa `v3dv` driver source](https://gitlab.freedesktop.org/mesa/mesa/-/tree/main/src/broadcom/vulkan) — Vulkan compute path
- [Pi 5 HEVC kernel driver patch series](https://patchwork.kernel.org) — closest architectural template for ARM-side V4L2 stateless wrapping a Pi-5 hardware accelerator (search "rpi-hevc-dec")
- [raspberrypi/usbboot secure-boot.md](https://github.com/raspberrypi/usbboot/blob/master/docs/secure-boot.md) — Wall 1 silicon-RoT confirmation
- [schlae/cm5-reveng](https://github.com/schlae/cm5-reveng) — CM5 PCB RE; VPU JTAG test points (Dec 2025; out of Path B scope, kept as escape hatch reference)
- [MulticoreWare / Imagination PowerVR VP9 OpenCL decoder press](https://www.design-reuse.com/news/34030/vp9-decoder-imagination-powervr-series6-gpus.html) — 2014 precedent for hybrid codec back-end on embedded GPU compute
- [FFmpeg 8.0 part-3 VP9 Vulkan failure post](https://www.rendi.dev/blog/ffmpeg-8-0-part-3-failed-attempts-to-use-vulkan-for-av1-encoding-vp9-decoding) — recent cautionary tale; failure was in entropy stage, not back-end
- [`Halide/Halide` Vulkan Pi 5 issue #8494](https://github.com/halide/Halide/issues/8494) — known runtime edge cases on Pi 5 Vulkan
- [Pi Forum p=2330030](https://forums.raspberrypi.com/viewtopic.php?p=2330030) — RPi engineer confirms VC7 ISA NDA + EU CRA signing rationale
Future phases should add citations here as they're consumed, not
re-derive Phase 0's substrate findings.