Files
daedalus-fourier/README.md
T
marfrit dcbbc77038 Path B pivot + Phase 0-3 closed with first baseline numbers
This is a from-scratch initial commit on a fresh .git. The original
scaffold commit (7510b56) and the earlier session's working-tree
docs were lost in a 2026-05-18 10:25 working-tree wipe; the corrupted
.git is preserved at .git-broken-2026-05-18/ (gitignored) for
forensic inspection.

Scope re-anchored from Path A (custom VPU firmware on VC7 scalar
cores; blocked by BCM2712 silicon-RoT mask-ROM signature check)
to Path B (QPU compute kernels via Mesa v3d / Vulkan compute or
direct DRM, on stock signed Pi 5 / CM5). See README.md and
docs/phase0.md for the substrate audit that closed Path A.

Phases closed:
  Phase 0 — substrate audit; Path A blocked, Path B open;
            codec-back-end-fits-QPU finding (docs/phase0.md)
  Phase 1 — first kernel locked (VP9 / AV1 8x8 inverse DCT) with
            publish-before-measure R = M2/M3 decision rules
            (docs/phase1.md)
  Phase 2 — reference impls mapped; FFmpeg n7.1.3 source vendored
            under external/ffmpeg-snapshot/ (PROVENANCE.md pins
            commit f46e514 + per-file SHA-256s) (docs/phase2.md)
  Phase 3 — real baseline measurements on hertz (docs/phase3.md):
              M1 bit-exact            100.0000 % (10000/10000)
              M3 NEON IDCT8 single    8.171 Mblock/s (122.4 ns/block)
              M5a empty Vulkan submit 22.66 us
              M5b 1-WG noop dispatch  55.60 us
              M5 delta                32.95 us/dispatch
            => per-dispatch overhead is ~455x per-NEON-block cost;
               Phase 4 must batch at frame level or close to it.

Build harness in place: CMakeLists.txt + tests/{bench_neon_idct.c,
vp9_idct8_ref.c, bench_vulkan_dispatch.c, shaders/noop.comp} +
external/ffmpeg-snapshot/config.h shim (7 defines + EXTERN_ASM).
Builds clean on Debian Trixie aarch64 with cmake 3.31, ninja 1.12,
libvulkan-dev 1.4.309, glslang-tools 15.1.0. Vendored FFmpeg .S
assembles via the config.h shim.

Next: Phase 4 (plan first QPU IDCT kernel under the M5 batching
constraint) -> Phase 5 second-model review -> Phase 6 implement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:30:12 +00:00

178 lines
7.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# daedalus-fourier
Community-built VP9 / AV1 software-decode back-end running on the
VideoCore VII (V3D 7.1) QPUs on Broadcom BCM2712 (Raspberry Pi 5 /
Compute Module 5), via the existing Mesa `v3d` userspace driver.
ARM keeps the serial entropy front-end; the QPU takes the parallel
back-end (inverse transforms, deblocking, CDEF, loop restoration,
MC residual add).
> Daedalus built the Labyrinth for King Minos, then escaped from it
> by hand-forging flight firmware out of feathers and wax when no
> sanctioned exit existed.
That's the project shape. The Broadcom-locked VideoCore VII is the
Labyrinth; the Pi Foundation's "use the HEVC block and live with
software decode for everything else" is the official non-exit;
the QPU sits unused inside the labyrinth's walls.
**Status: Phase 0 closed (substrate audit). Phase 1 in progress
(first-kernel proof on hertz).** This is research-track work that
may take months or may yield a single proof-of-concept kernel that
loses to ARM NEON, in which case the negative result ships and the
project closes.
## Why this exists
higgs is a Raspberry Pi Compute Module 5 in a small portable
chassis with a battery. Watching nerds review *Star Wars* on YouTube
while putting Mac Studios into virtual shopping baskets is a
core workload for the higgs class of device.
YouTube serves H.264 (legacy), VP9 (typical 4K), and AV1 (newer
high-bitrate / high-resolution content). It does not serve HEVC.
Pi 5's BCM2712 has one HW decoder block: HEVC. The intersection
of {what YouTube serves} ∩ {what BCM2712 decodes in HW} = ∅.
Every YouTube frame on higgs today is software-decoded on Cortex-A76
cores at ~5090% CPU per video stream. Offloading the parallel
back-end of that decode to the otherwise-idle QPU complex *might*
recover meaningful CPU time and battery on higgs. The honest
prior — measured in Phase 0 — is that the QPU has roughly equal
raw compute to the A76 cluster but a smaller slice of the shared
LPDDR4x bandwidth, so the win, if any, comes from offloading
*concurrent* work the CPU would have done anyway.
The Pi Foundation isn't going to do this work (per their own
statement: chromium-patch sustainment was too much; codec
sustainment would be moreso). The kernel `rpi-hevc-dec` series has
been 17 months in review for one decoder block they DID write
themselves. Whatever ships here ships through the community.
## Architecture (Path B)
Phase 0 closed two paths:
- **Path A — custom VPU firmware on the VC7 scalar cores.**
Blocked. BCM2712 has a silicon root of trust: the mask ROM
hardcodes RPi's public key and unconditionally verifies the
second-stage bootloader. `EXECUTE_CODE` mailbox removed on Pi 5.
No software-only bypass exists. See `docs/phase0.md §3`.
- **Path B — QPU compute kernels via the existing Mesa `v3d` /
DRM / Vulkan-compute path.** This is the path. The QPU is
reachable from userspace today on a stock signed Pi 5 / CM5
via `/dev/dri/card0`. No firmware loading. No signing fight.
`Idein/py-videocore7` (SGEMM 21 GFLOPS sustained) is the
existence proof.
The build:
```
┌───────────────────────────────┐
│ userspace VP9 / AV1 decoder │
│ (fork of dav1d / libvpx) │
├───────────────────────────────┤
│ ARM: entropy decode │ ← Cortex-A76 + NEON
│ (Bool coder / ANS) │ structurally serial
├───────────────────────────────┤
│ QPU: parallel back-end │ ← V3D 7.1 via Mesa v3dv
│ (IDCT, CDEF, │ Vulkan compute shaders
│ deblock, LR, MC) │ or direct DRM submit
├───────────────────────────────┤
│ V4L2 stateless wrapper │ ← out-of-tree kernel module
│ (eventual, kernel-agent) │ exposing /dev/videoN
└───────────────────────────────┘
```
The first deliverable is *not* the V4L2 wrapper. The first
deliverable is one back-end kernel running on the QPU, bit-exact
against a libavcodec reference, with measured throughput. If that
single kernel can't beat NEON or get within 50% of it, the project
closes here with a documented negative result.
## In scope
- A small set of codec back-end kernels (IDCT 8×8, CDEF, deblocking,
loop restoration filter, MC interpolation) compiled as SPIR-V
compute shaders for Mesa `v3dv`, dispatched via Vulkan compute
from userspace.
- A test harness on hertz that runs each kernel against libavcodec
reference outputs and measures throughput (megapixels/sec or
blocks/sec) against the equivalent NEON path.
- Phase 1 = one kernel, bit-exact, with numbers. Phase 2+ = more
kernels only if Phase 1 numbers justify it.
## Out of scope (for now)
- HEVC (Pi 5 has dedicated silicon; `rpi-hevc-dec` covers it).
- Pi 4 / BCM2711 / VideoCore VI. Different ISA, smaller compute
budget. Path B *could* extend but isn't the priority.
- Encode. Pi Foundation removed all HW encode in Pi 5; encode on
VC7 is a separate, larger project.
- Custom VPU firmware (Path A — blocked by silicon RoT, see
`docs/phase0.md`).
- V4L2 stateless driver wrapping the userspace decoder. Eventual
consumption point, but Phase 1 lives entirely in userspace.
- Beating ARM NEON unconditionally. The honest target is
*concurrent* work: QPU runs while CPU does something else.
## Dev substrate
- **hertz** (Pi 5, 8 GB, Debian Trixie, kernel 6.12.75-rpt-rpi-2712,
Mesa 25.0.7 with v3dv, V3D 7.1.7) — the dev / test / measurement
host. Watchdog-protected for crash recovery. See
`docs/vulkaninfo_v3d_7_1_7_hertz.txt` for the inside-view device
profile.
- **higgs** (CM5 in portable battery chassis) — the eventual user
target. Not a dev unit; sealed chassis.
## Conventions
This project follows the 9(+1)-phase dev process. See
`docs/dev_process.md`. Phase 0 is closed (`docs/phase0.md`);
Phase 1 is `docs/phase1.md`.
Gitea identity: `claude-noether` (per
`feedback_gitea_as_claude_noether.md`). No `marfrit` pushes from
Claude sessions.
## Layout
```
daedalus-fourier/
├── README.md ← this file
├── docs/
│ ├── dev_process.md ← reference copy of the 9(+1)-phase loop
│ ├── phase0.md ← substrate audit (closes Paths A and B)
│ ├── phase1.md ← first-kernel goal + measurement plan
│ └── vulkaninfo_v3d_7_1_7_hertz.txt
│ ← inside-view device profile from hertz
├── src/ ← kernels + Vulkan dispatch harness
└── tests/ ← bit-exact vs libavcodec, throughput
```
No build system yet. Adding CMake when the first kernel lands.
## Sibling projects in the same orbit
- `libva-v4l2-request-fourier` — VA-API consumer-side backend.
Eventual consumer if daedalus produces a V4L2 stateless node.
- `firefox-fourier` — Firefox fork that routes stateless V4L2
through libavcodec's `v4l2_request` hwaccel. Same pickup point.
- `chromium-fourier` — sibling for Chromium.
- `kernel-agent` — would house the V4L2 driver wrapping the
userspace decoder, once one exists.
- `ampere-av1-enablement` — software-side AV1 bring-up on RK3588
(rkvdec / vpu981). Provides the userspace conformance harness
daedalus reuses for VC7-AV1 verification.
## Source attribution
Daedalus-the-myth is public domain. The wax-and-feathers
metaphor is older than software engineering.
Anyone wanting to fail at this project: please file your failures
under `branches/icarus/`. Built-in self-deprecation slot, with
honor.