dcbbc77038
This is a from-scratch initial commit on a fresh .git. The original
scaffold commit (7510b56) and the earlier session's working-tree
docs were lost in a 2026-05-18 10:25 working-tree wipe; the corrupted
.git is preserved at .git-broken-2026-05-18/ (gitignored) for
forensic inspection.
Scope re-anchored from Path A (custom VPU firmware on VC7 scalar
cores; blocked by BCM2712 silicon-RoT mask-ROM signature check)
to Path B (QPU compute kernels via Mesa v3d / Vulkan compute or
direct DRM, on stock signed Pi 5 / CM5). See README.md and
docs/phase0.md for the substrate audit that closed Path A.
Phases closed:
Phase 0 — substrate audit; Path A blocked, Path B open;
codec-back-end-fits-QPU finding (docs/phase0.md)
Phase 1 — first kernel locked (VP9 / AV1 8x8 inverse DCT) with
publish-before-measure R = M2/M3 decision rules
(docs/phase1.md)
Phase 2 — reference impls mapped; FFmpeg n7.1.3 source vendored
under external/ffmpeg-snapshot/ (PROVENANCE.md pins
commit f46e514 + per-file SHA-256s) (docs/phase2.md)
Phase 3 — real baseline measurements on hertz (docs/phase3.md):
M1 bit-exact 100.0000 % (10000/10000)
M3 NEON IDCT8 single 8.171 Mblock/s (122.4 ns/block)
M5a empty Vulkan submit 22.66 us
M5b 1-WG noop dispatch 55.60 us
M5 delta 32.95 us/dispatch
=> per-dispatch overhead is ~455x per-NEON-block cost;
Phase 4 must batch at frame level or close to it.
Build harness in place: CMakeLists.txt + tests/{bench_neon_idct.c,
vp9_idct8_ref.c, bench_vulkan_dispatch.c, shaders/noop.comp} +
external/ffmpeg-snapshot/config.h shim (7 defines + EXTERN_ASM).
Builds clean on Debian Trixie aarch64 with cmake 3.31, ninja 1.12,
libvulkan-dev 1.4.309, glslang-tools 15.1.0. Vendored FFmpeg .S
assembles via the config.h shim.
Next: Phase 4 (plan first QPU IDCT kernel under the M5 batching
constraint) -> Phase 5 second-model review -> Phase 6 implement.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
178 lines
7.6 KiB
Markdown
178 lines
7.6 KiB
Markdown
# daedalus-fourier
|
||
|
||
Community-built VP9 / AV1 software-decode back-end running on the
|
||
VideoCore VII (V3D 7.1) QPUs on Broadcom BCM2712 (Raspberry Pi 5 /
|
||
Compute Module 5), via the existing Mesa `v3d` userspace driver.
|
||
ARM keeps the serial entropy front-end; the QPU takes the parallel
|
||
back-end (inverse transforms, deblocking, CDEF, loop restoration,
|
||
MC residual add).
|
||
|
||
> Daedalus built the Labyrinth for King Minos, then escaped from it
|
||
> by hand-forging flight firmware out of feathers and wax when no
|
||
> sanctioned exit existed.
|
||
|
||
That's the project shape. The Broadcom-locked VideoCore VII is the
|
||
Labyrinth; the Pi Foundation's "use the HEVC block and live with
|
||
software decode for everything else" is the official non-exit;
|
||
the QPU sits unused inside the labyrinth's walls.
|
||
|
||
**Status: Phase 0 closed (substrate audit). Phase 1 in progress
|
||
(first-kernel proof on hertz).** This is research-track work that
|
||
may take months or may yield a single proof-of-concept kernel that
|
||
loses to ARM NEON, in which case the negative result ships and the
|
||
project closes.
|
||
|
||
## Why this exists
|
||
|
||
higgs is a Raspberry Pi Compute Module 5 in a small portable
|
||
chassis with a battery. Watching nerds review *Star Wars* on YouTube
|
||
while putting Mac Studios into virtual shopping baskets is a
|
||
core workload for the higgs class of device.
|
||
|
||
YouTube serves H.264 (legacy), VP9 (typical 4K), and AV1 (newer
|
||
high-bitrate / high-resolution content). It does not serve HEVC.
|
||
Pi 5's BCM2712 has one HW decoder block: HEVC. The intersection
|
||
of {what YouTube serves} ∩ {what BCM2712 decodes in HW} = ∅.
|
||
|
||
Every YouTube frame on higgs today is software-decoded on Cortex-A76
|
||
cores at ~50–90% CPU per video stream. Offloading the parallel
|
||
back-end of that decode to the otherwise-idle QPU complex *might*
|
||
recover meaningful CPU time and battery on higgs. The honest
|
||
prior — measured in Phase 0 — is that the QPU has roughly equal
|
||
raw compute to the A76 cluster but a smaller slice of the shared
|
||
LPDDR4x bandwidth, so the win, if any, comes from offloading
|
||
*concurrent* work the CPU would have done anyway.
|
||
|
||
The Pi Foundation isn't going to do this work (per their own
|
||
statement: chromium-patch sustainment was too much; codec
|
||
sustainment would be moreso). The kernel `rpi-hevc-dec` series has
|
||
been 17 months in review for one decoder block they DID write
|
||
themselves. Whatever ships here ships through the community.
|
||
|
||
## Architecture (Path B)
|
||
|
||
Phase 0 closed two paths:
|
||
|
||
- **Path A — custom VPU firmware on the VC7 scalar cores.**
|
||
Blocked. BCM2712 has a silicon root of trust: the mask ROM
|
||
hardcodes RPi's public key and unconditionally verifies the
|
||
second-stage bootloader. `EXECUTE_CODE` mailbox removed on Pi 5.
|
||
No software-only bypass exists. See `docs/phase0.md §3`.
|
||
|
||
- **Path B — QPU compute kernels via the existing Mesa `v3d` /
|
||
DRM / Vulkan-compute path.** This is the path. The QPU is
|
||
reachable from userspace today on a stock signed Pi 5 / CM5
|
||
via `/dev/dri/card0`. No firmware loading. No signing fight.
|
||
`Idein/py-videocore7` (SGEMM 21 GFLOPS sustained) is the
|
||
existence proof.
|
||
|
||
The build:
|
||
|
||
```
|
||
┌───────────────────────────────┐
|
||
│ userspace VP9 / AV1 decoder │
|
||
│ (fork of dav1d / libvpx) │
|
||
├───────────────────────────────┤
|
||
│ ARM: entropy decode │ ← Cortex-A76 + NEON
|
||
│ (Bool coder / ANS) │ structurally serial
|
||
├───────────────────────────────┤
|
||
│ QPU: parallel back-end │ ← V3D 7.1 via Mesa v3dv
|
||
│ (IDCT, CDEF, │ Vulkan compute shaders
|
||
│ deblock, LR, MC) │ or direct DRM submit
|
||
├───────────────────────────────┤
|
||
│ V4L2 stateless wrapper │ ← out-of-tree kernel module
|
||
│ (eventual, kernel-agent) │ exposing /dev/videoN
|
||
└───────────────────────────────┘
|
||
```
|
||
|
||
The first deliverable is *not* the V4L2 wrapper. The first
|
||
deliverable is one back-end kernel running on the QPU, bit-exact
|
||
against a libavcodec reference, with measured throughput. If that
|
||
single kernel can't beat NEON or get within 50% of it, the project
|
||
closes here with a documented negative result.
|
||
|
||
## In scope
|
||
|
||
- A small set of codec back-end kernels (IDCT 8×8, CDEF, deblocking,
|
||
loop restoration filter, MC interpolation) compiled as SPIR-V
|
||
compute shaders for Mesa `v3dv`, dispatched via Vulkan compute
|
||
from userspace.
|
||
- A test harness on hertz that runs each kernel against libavcodec
|
||
reference outputs and measures throughput (megapixels/sec or
|
||
blocks/sec) against the equivalent NEON path.
|
||
- Phase 1 = one kernel, bit-exact, with numbers. Phase 2+ = more
|
||
kernels only if Phase 1 numbers justify it.
|
||
|
||
## Out of scope (for now)
|
||
|
||
- HEVC (Pi 5 has dedicated silicon; `rpi-hevc-dec` covers it).
|
||
- Pi 4 / BCM2711 / VideoCore VI. Different ISA, smaller compute
|
||
budget. Path B *could* extend but isn't the priority.
|
||
- Encode. Pi Foundation removed all HW encode in Pi 5; encode on
|
||
VC7 is a separate, larger project.
|
||
- Custom VPU firmware (Path A — blocked by silicon RoT, see
|
||
`docs/phase0.md`).
|
||
- V4L2 stateless driver wrapping the userspace decoder. Eventual
|
||
consumption point, but Phase 1 lives entirely in userspace.
|
||
- Beating ARM NEON unconditionally. The honest target is
|
||
*concurrent* work: QPU runs while CPU does something else.
|
||
|
||
## Dev substrate
|
||
|
||
- **hertz** (Pi 5, 8 GB, Debian Trixie, kernel 6.12.75-rpt-rpi-2712,
|
||
Mesa 25.0.7 with v3dv, V3D 7.1.7) — the dev / test / measurement
|
||
host. Watchdog-protected for crash recovery. See
|
||
`docs/vulkaninfo_v3d_7_1_7_hertz.txt` for the inside-view device
|
||
profile.
|
||
- **higgs** (CM5 in portable battery chassis) — the eventual user
|
||
target. Not a dev unit; sealed chassis.
|
||
|
||
## Conventions
|
||
|
||
This project follows the 9(+1)-phase dev process. See
|
||
`docs/dev_process.md`. Phase 0 is closed (`docs/phase0.md`);
|
||
Phase 1 is `docs/phase1.md`.
|
||
|
||
Gitea identity: `claude-noether` (per
|
||
`feedback_gitea_as_claude_noether.md`). No `marfrit` pushes from
|
||
Claude sessions.
|
||
|
||
## Layout
|
||
|
||
```
|
||
daedalus-fourier/
|
||
├── README.md ← this file
|
||
├── docs/
|
||
│ ├── dev_process.md ← reference copy of the 9(+1)-phase loop
|
||
│ ├── phase0.md ← substrate audit (closes Paths A and B)
|
||
│ ├── phase1.md ← first-kernel goal + measurement plan
|
||
│ └── vulkaninfo_v3d_7_1_7_hertz.txt
|
||
│ ← inside-view device profile from hertz
|
||
├── src/ ← kernels + Vulkan dispatch harness
|
||
└── tests/ ← bit-exact vs libavcodec, throughput
|
||
```
|
||
|
||
No build system yet. Adding CMake when the first kernel lands.
|
||
|
||
## Sibling projects in the same orbit
|
||
|
||
- `libva-v4l2-request-fourier` — VA-API consumer-side backend.
|
||
Eventual consumer if daedalus produces a V4L2 stateless node.
|
||
- `firefox-fourier` — Firefox fork that routes stateless V4L2
|
||
through libavcodec's `v4l2_request` hwaccel. Same pickup point.
|
||
- `chromium-fourier` — sibling for Chromium.
|
||
- `kernel-agent` — would house the V4L2 driver wrapping the
|
||
userspace decoder, once one exists.
|
||
- `ampere-av1-enablement` — software-side AV1 bring-up on RK3588
|
||
(rkvdec / vpu981). Provides the userspace conformance harness
|
||
daedalus reuses for VC7-AV1 verification.
|
||
|
||
## Source attribution
|
||
|
||
Daedalus-the-myth is public domain. The wax-and-feathers
|
||
metaphor is older than software engineering.
|
||
|
||
Anyone wanting to fail at this project: please file your failures
|
||
under `branches/icarus/`. Built-in self-deprecation slot, with
|
||
honor.
|