Files
marfrit 71db72928f Phase 4 plan + Phase 5 second-model review (PASS-WITH-REVISIONS)
Phase 4 — Plan first QPU IDCT8 kernel given the M5 32.95us/dispatch
constraint. Frame-level dispatch (8160 WGs for 1080p), 64
invocations/WG = 4 blocks/WG, 3 SSBOs + push constants, reproduces
FFmpeg's transposed column-pass orientation. Predicted M2 ~16 Mblock/s
(BW-limited), R ~ 2.0 -> strong PASS under phase1.md decision rules.

Phase 5 — Second-model review by Claude Sonnet (fresh-context Agent,
no prior session memory). Verdict PASS-WITH-REVISIONS with 2 RED-class
findings + 1 YELLOW that this commit applies:

  RED  finding 5 (dst race condition): int32_t[] dst with 4 lanes
       writing to overlapping 32-bit words = non-atomic concurrent
       writes = Vulkan UB. Fix: uint8_t[] via storageBuffer8BitAccess
       (verified exposed). Applied to phase4.md sec 5 + GLSL declaration.

  RED  finding 7 (early-return before barrier): if (block_idx >=
       n_blocks) return; ahead of barrier() is UB by Vulkan spec.
       For 1080p (32640 blocks, /4) no partial WGs; for any frame
       width not /32 there are. Fix: oob flag, gate work bodies,
       barrier() unconditional. Applied to phase4.md sec 4 pseudocode.

  YELLOW finding 6 (subgroup ops): docs claimed BASIC+VOTE only;
         actual exposed set is BASIC+VOTE+BALLOT+SHUFFLE+
         SHUFFLE_RELATIVE+QUAD per vulkaninfo. Plan doesn't use any
         subgroup ops in v1 so unaffected, but the wrong constraint
         would mislead Phase 6/7. Corrected in phase0.md sec 2,
         phase2.md sec 6, phase4.md sec 1 (C4).

GREEN/YELLOW findings 1-4, 8 (orientation, WG geom, idle lanes, BW
prediction, compute envelope accounting) accepted as-is or deferred to
Phase 7 M6 sweep per plan's existing flagging.

Reviewer verdict post-revisions: "Phase 4 is APPROVED for Phase 6
implementation. No re-review needed; revisions are mechanical and
address verified bugs/errors."

Phase 5 itself just paid for itself: two real UB bugs caught before
any GLSL was written.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:47:03 +00:00

240 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
phase: 0
status: closed 2026-05-18
date_opened: 2026-05-17
date_closed: 2026-05-18
research_method: three rounds of parallel web research (Sonnet via Agent), plus hands-on hertz substrate inventory and live `vulkaninfo` capture
target_hardware: hertz (Pi 5 8 GB) for dev; higgs (CM5) eventual user target
---
# Phase 0 — Substrate / motivation / inventory
This is the consolidated Phase 0 record. Path A (custom VPU firmware)
is **closed at the silicon-RoT step**; Path B (QPU compute via the
existing Mesa `v3d` driver) is **open**. The remainder of the
project lives in Path B.
The earlier session produced two separate Phase 0 artifacts that
were lost when the working tree was wiped at 2026-05-18 10:25
(`.git-broken-2026-05-18/` retains the corrupted state if needed).
This document supersedes both.
---
## 1. Research question
Verbatim from `README.md`:
> Community-built VP9 / AV1 software-decode back-end running on the
> VideoCore VII (V3D 7.1) QPUs on Broadcom BCM2712 (Raspberry Pi 5 /
> Compute Module 5), via the existing Mesa `v3d` userspace driver.
The load-bearing claim: *the QPU is programmable by us, on stock
production hardware, and the codec back-end is a workload class
where that programmability buys CPU time on the A76 cluster.*
Phase 0's job is to test that claim before Phase 1 binds a metric.
## 2. Substrate inventory — hertz
Captured live 2026-05-17 via SSH. Full `vulkaninfo` in
`vulkaninfo_v3d_7_1_7_hertz.txt`.
| | |
|---|---|
| Host | hertz, Pi 5, 8 GB, eMMC + 1 TB SATA |
| Role | LXD host for 11 containers (home-LAN spine — DNS / VPN / HA proxy / NCP / SMTP) |
| OS | Debian 13 Trixie |
| Kernel | `6.12.75+rpt-rpi-2712` (RPi Foundation kernel, 2026-03-11) |
| CPU | 4× Cortex-A76 @ 2.8 GHz |
| GPU clock | V3D 7.1 @ 1000 MHz (slight OC; spec 960 MHz) |
| Mesa | `25.0.7-2+rpt4` (`libvulkan_broadcom.so` v3dv ICD) |
| Vulkan loader | `1.4.309` |
| Vulkan device API | 1.3.305 (conformance 1.3.8.3) |
| DRM nodes | `card0 → v3d` (compute target), `card1 → vc4-drm` (display), `renderD128` |
| kernel uAPI hdr | `/usr/include/drm/v3d_drm.h` present |
| Build tools | cmake 3.31, ninja 1.12, libvulkan-dev 1.4.309, glslang-tools 15.1.0, spirv-tools 2025.1, libdrm-dev 2.4.131 (installed 2026-05-17) |
| User groups | mfritsche ∈ `render`, `video`, `lxd`, `sudo` |
| Memory pressure | 7.9 GiB RAM, ~3 GiB available; 6 GiB zram, ~2.8 GiB in use (cohabitation with LXD spine) |
| Watchdog | yes — power-cut reboot via Himbeere plug if hertz crashes (acknowledged dev cost: household DNS/VPN drops during each reboot cycle) |
**Inside-view V3D 7.1 compute envelope** (from
`vulkaninfo_v3d_7_1_7_hertz.txt`):
| Property | Value | Implication |
|---|---|---|
| `maxStorageBufferRange` | 1 GiB | Bounds single-tensor size; codec working sets (frames, planes) fit trivially |
| `maxPerStageDescriptorStorageBuffers` | 8 | Forces ≤8 SSBO bindings per dispatch — ggml-vulkan binds more, doesn't fit |
| `maxComputeSharedMemorySize` | 16 KiB | Small tiled kernels only; codec block work (8×8, 16×16) fits easily |
| `maxComputeWorkGroupInvocations` | 256 | Standard |
| `maxComputeWorkGroupSize` | 256 / 256 / ? | Standard |
| `subgroupSize` | 16 (fixed) | Matches QPU SIMD width |
| `subgroupSupportedOperations` | BASIC + VOTE + BALLOT + SHUFFLE + SHUFFLE_RELATIVE + QUAD | No arithmetic reductions — accumulate via shared memory. `subgroupShuffle` *is* available (corrected per phase5.md finding 6 — earlier text said BASIC+VOTE only). |
| `shaderFloat16` | **false** | Storage only; arithmetic runs FP32 |
| `shaderInt8` | **false** | Storage only; arithmetic on widened ints |
| `shaderInt16` | **false** | Same |
| `storageBuffer8/16BitAccess` | true | Can load tightly-packed quantized / packed pixel data |
| `subgroupSizeControl`, `computeFullSubgroups`, `synchronization2` | true | Modern compute features available |
**Throughput envelopes** (from prior community measurements,
not yet re-confirmed in-session):
| Metric | Value | Source |
|---|---|---|
| V3D 7.1 theoretical FP32 peak | ~92 GFLOPS at 960 MHz | 12 QPU × 4 ALU × 2 op/cycle |
| Direct-DRM SGEMM sustained | 21.4 GFLOPS (~23%) | `Idein/py-videocore7` |
| Vulkan-compute `vkpeak` fp32-vec4 | 6.9 GFLOPS (~7.5%) | RPi forum benchmark thread |
| A76 NEON sustained for matmul | ~50 GFLOPS | Multiple benchmark sources |
| Shared LPDDR4x bus | ~17 GB/s nominal | LPDDR4x-4267 × 32 bit / 8 |
| GPU-measured BW share | 47 GB/s | py-videocore7 scopy benchmark |
| CPU NEON BW achievable | 1215 GB/s | Pi 5 STREAM benchmarks |
## 3. Path A — closed
**Custom VPU firmware loaded onto VC7 scalar cores.** This was the
README's original framing.
Blocked at the silicon-RoT step:
- **BCM2712 mask ROM hardcodes RPi's public key** and unconditionally
verifies the second-stage bootloader (`bootsys`) on every boot
path (SPI flash, USB rpiboot, SD recovery). RPi holds the
corresponding private key.
- `EXECUTE_CODE` mailbox tag (the only documented Pi 14 runtime
"run code on a VPU core" mechanism) **confirmed removed on Pi 5**
by Pi Foundation engineer (forum.raspberrypi.com).
- Pre-CRA EEPROM downgrade is possible (no anti-rollback fuse) but
only yields *older RPi-signed* EEPROMs — doesn't help.
- OTP fuse state on stock CM5 is already the most permissive
possible (customer key hash = zero); the RPi-key check is
silicon-unconditional, not gated by OTP.
- CM5 vs retail Pi 5: same silicon, same chain, no meaningful
security delta.
- One non-software escape exists: VPU JTAG via documented test
points (`schlae/cm5-reveng`, Dec 2025). Hardware mod only,
sealed-chassis higgs not the dev unit, novel research with no
published firmware-injection workflow. Out of scope for this
project.
Verdict: **structurally blocked for community use without RPi
cooperation or hardware-RE-grade work on a sacrificial CM5.**
## 4. Path B — open
**QPU compute kernels via the existing Mesa `v3d` driver.** Reachable
from userspace today on a stock signed Pi 5 / CM5 via
`/dev/dri/card0` (Vulkan compute through `v3dv`) or `renderD128`
(direct DRM submit, py-videocore7 style). No firmware loading.
No signing fight. mfritsche on hertz is in the `render` group and
can hit the device without sudo.
The substrate is real:
- `Idein/py-videocore7` runs SGEMM at 21 GFLOPS sustained on stock
Pi 5 with no special setup — existence proof of arbitrary QPU
programs.
- Mesa v3dv is Vulkan 1.3-conformant on V3D 7.1 (Mesa 24.3+;
hertz runs 25.0.7).
- The kernel `v3d` DRM driver is fully upstream and open.
Phase 0 does **not** assume Path B leads to a winning result. It
asserts only that Path B is *reachable*, where Path A isn't.
## 5. Why this isn't the same project as "v3d backend for llama.cpp"
A llama.cpp v3d backend was investigated mid-session and rejected
as structurally infeasible. The verdict was decisive: GPU loses
to CPU on raw FP32 (21 vs ~50 GFLOPS), on memory bandwidth share
(47 vs 1215 GB/s), and on quantized instruction support (no
INT8 MAC vs A76 SDOT/UDOT). For LLM matmul, the QPU is the wrong
substrate.
**Codec back-end work is a different workload class** with
properties that fit the QPU substantively better:
| Property | LLM matmul | Codec back-end (post-entropy) |
|---|---|---|
| Working set per dispatch | Whole weight matrices (GB) | Per-block (8×8 / 16×16, hundreds of bytes) — fits in 16 KiB shared mem |
| Dominant op | INT8 MAC | Integer add / shift / small-constant multiply |
| Why GPU misses | No INT8 MAC | Less impact — fewer multiplies, mostly add/shift |
| Memory pattern | Full-tensor stream | Sequential plane reads, TMU-friendly |
| Parallelism | One big GEMM | Thousands of independent small blocks per frame |
| A76 advantage | NEON SDOT/UDOT crushing it | Less specialized; QPU advantage real |
| Bandwidth-bound? | Yes (kills the GPU) | Compute-bound at block scale |
This is the load-bearing reframe between the failed llama.cpp
investigation and the daedalus-fourier scope. Codec back-end
*might* live on the QPU. Phase 1 measures whether it actually does.
## 6. Honest probability assessment
A competent outside reviewer should rate the project as **hard but
viable**, with one concrete prior precedent (MulticoreWare /
Imagination PowerVR OpenCL VP9 decoder, 2014, achieved 1080p30 in
a hybrid model with CPU entropy + GPU back-end on a comparable
embedded GPU) and one concrete recent failure (FFmpeg 8.0 VP9-on-
Vulkan-compute, 2025, produced corrupted output on a much more
capable NVIDIA target — but the failure was in the *attempt to
move entropy onto GPU*, not the back-end).
The win condition is **not** "GPU beats CPU at the same work." The
win condition is **"GPU work overlaps with CPU work that has to
happen anyway"** — concurrent decode where ARM does entropy and
the QPU finishes the block-level back-end on the previous frame,
recovering CPU time for the rest of the system (browser, audio,
UI, the 11 LXD containers on hertz).
Phase 1 measures the building block: one kernel, bit-exact, with
numbers. Phase 2+ only if Phase 1 numbers justify it.
## 7. Open questions for Phase 1
1. **What's the actual single-kernel QPU throughput on a
codec-shaped workload?** SGEMM at 21 GFLOPS is the only public
number, and SGEMM is not block-IDCT-shaped. We need an in-session
N=3 measurement on a real codec kernel.
2. **What's the ARM NEON baseline for the same kernel on the same
hertz?** libavcodec ships highly-tuned NEON paths for IDCT,
deblocking, etc. Without measuring NEON in-session, "the QPU
wins" or "the QPU loses" is unverifiable.
3. **Vulkan compute vs direct DRM submit — which path?** Vulkan
has tooling, documentation, debuggability. Direct DRM has
~1015% lower per-dispatch overhead and bypasses the
v3dv-imposed 16 KiB shared-mem / 8-SSBO limits, at the cost
of writing QPU asm against the NDA ISA. Phase 1 picks one.
4. **Memory bandwidth contention with concurrent ARM decode.**
The shared 17 GB/s bus is the floor. If QPU+ARM-NEON both
running collide for bandwidth, the "concurrent work" win
disappears. Needs in-session measurement once any kernel exists.
5. **VC7 thermal headroom under sustained mixed CPU+GPU load.**
Pi 5 throttles GPU at 85°C, CPU at 80°C. hertz idles at ~64°C
with the LXD spine; mixed compute will push higher. With or
without active cooling on hertz is an open question.
These are Phase 1's burden, not Phase 0's. Phase 0 closes here.
## 8. Sources
Earlier session's web research produced ~7000 words of substrate
references across 6 parallel threads. The full source list lived
in the deleted `phase0_findings.md` and `phase0_wall1_bypass.md`.
The high-value pointers that should follow this project forward:
- [Mesa `src/broadcom/qpu/qpu_instr.h`](https://github.com/Mesa3D/mesa/blob/main/src/broadcom/qpu/qpu_instr.h) — de-facto VC7 QPU ISA reference (no Broadcom-published doc; ISA under NDA)
- [Mesa `src/broadcom/compiler/`](https://github.com/Mesa3D/mesa/tree/main/src/broadcom/compiler) — NIR→QPU compiler, the open ground truth for what V3D 7.1 can do
- [`Idein/py-videocore7`](https://github.com/Idein/py-videocore7) — working QPU GPGPU runtime via DRM; SGEMM benchmark; existence proof
- [`Towdo/py-videocore7`](https://github.com/Towdo/py-videocore7) — fork with more fixes
- [Mesa `v3dv` driver source](https://gitlab.freedesktop.org/mesa/mesa/-/tree/main/src/broadcom/vulkan) — Vulkan compute path
- [Pi 5 HEVC kernel driver patch series](https://patchwork.kernel.org) — closest architectural template for ARM-side V4L2 stateless wrapping a Pi-5 hardware accelerator (search "rpi-hevc-dec")
- [raspberrypi/usbboot secure-boot.md](https://github.com/raspberrypi/usbboot/blob/master/docs/secure-boot.md) — Wall 1 silicon-RoT confirmation
- [schlae/cm5-reveng](https://github.com/schlae/cm5-reveng) — CM5 PCB RE; VPU JTAG test points (Dec 2025; out of Path B scope, kept as escape hatch reference)
- [MulticoreWare / Imagination PowerVR VP9 OpenCL decoder press](https://www.design-reuse.com/news/34030/vp9-decoder-imagination-powervr-series6-gpus.html) — 2014 precedent for hybrid codec back-end on embedded GPU compute
- [FFmpeg 8.0 part-3 VP9 Vulkan failure post](https://www.rendi.dev/blog/ffmpeg-8-0-part-3-failed-attempts-to-use-vulkan-for-av1-encoding-vp9-decoding) — recent cautionary tale; failure was in entropy stage, not back-end
- [`Halide/Halide` Vulkan Pi 5 issue #8494](https://github.com/halide/Halide/issues/8494) — known runtime edge cases on Pi 5 Vulkan
- [Pi Forum p=2330030](https://forums.raspberrypi.com/viewtopic.php?p=2330030) — RPi engineer confirms VC7 ISA NDA + EU CRA signing rationale
Future phases should add citations here as they're consumed, not
re-derive Phase 0's substrate findings.