daedalus-fourier/README.md

# daedalus-fourier

Community-built VP9 / AV1 software-decode back-end running on the
VideoCore VII (V3D 7.1) QPUs on Broadcom BCM2712 (Raspberry Pi 5 /
Compute Module 5), via the existing Mesa `v3d` userspace driver.
ARM keeps the serial entropy front-end; the QPU takes the parallel
back-end (inverse transforms, deblocking, CDEF, loop restoration,
MC residual add).

> Daedalus built the Labyrinth for King Minos, then escaped from it
> by hand-forging flight firmware out of feathers and wax when no
> sanctioned exit existed.

That's the project shape. The Broadcom-locked VideoCore VII is the
Labyrinth; the Pi Foundation's "use the HEVC block and live with
software decode for everything else" is the official non-exit;
the QPU sits unused inside the labyrinth's walls.

**Status: Phase 0 closed (substrate audit). Phase 1 in progress
(first-kernel proof on hertz).** This is research-track work that
may take months or may yield a single proof-of-concept kernel that
loses to ARM NEON, in which case the negative result ships and the
project closes.

## Why this exists

higgs is a Raspberry Pi Compute Module 5 in a small portable
chassis with a battery. Watching nerds review *Star Wars* on YouTube
while putting Mac Studios into virtual shopping baskets is a
core workload for the higgs class of device.

YouTube serves H.264 (legacy), VP9 (typical 4K), and AV1 (newer
high-bitrate / high-resolution content). It does not serve HEVC.
Pi 5's BCM2712 has one HW decoder block: HEVC. The intersection
of {what YouTube serves} ∩ {what BCM2712 decodes in HW} = ∅.

Every YouTube frame on higgs today is software-decoded on Cortex-A76
cores at ~50–90% CPU per video stream. Offloading the parallel
back-end of that decode to the otherwise-idle QPU complex *might*
recover meaningful CPU time and battery on higgs. The honest
prior — measured in Phase 0 — is that the QPU has roughly equal
raw compute to the A76 cluster but a smaller slice of the shared
LPDDR4x bandwidth, so the win, if any, comes from offloading
*concurrent* work the CPU would have done anyway.

The Pi Foundation isn't going to do this work (per their own
statement: chromium-patch sustainment was too much; codec
sustainment would be moreso). The kernel `rpi-hevc-dec` series has
been 17 months in review for one decoder block they DID write
themselves. Whatever ships here ships through the community.

## Architecture (Path B)

Phase 0 closed two paths:

- **Path A — custom VPU firmware on the VC7 scalar cores.**
  Blocked. BCM2712 has a silicon root of trust: the mask ROM
  hardcodes RPi's public key and unconditionally verifies the
  second-stage bootloader. `EXECUTE_CODE` mailbox removed on Pi 5.
  No software-only bypass exists. See `docs/phase0.md §3`.

- **Path B — QPU compute kernels via the existing Mesa `v3d` /
  DRM / Vulkan-compute path.** This is the path. The QPU is
  reachable from userspace today on a stock signed Pi 5 / CM5
  via `/dev/dri/card0`. No firmware loading. No signing fight.
  `Idein/py-videocore7` (SGEMM 21 GFLOPS sustained) is the
  existence proof.

The build:

```
┌───────────────────────────────┐
│ userspace VP9 / AV1 decoder   │
│  (fork of dav1d / libvpx)     │
├───────────────────────────────┤
│  ARM:    entropy decode       │ ← Cortex-A76 + NEON
│          (Bool coder / ANS)   │   structurally serial
├───────────────────────────────┤
│  QPU:    parallel back-end    │ ← V3D 7.1 via Mesa v3dv
│          (IDCT, CDEF,         │   Vulkan compute shaders
│           deblock, LR, MC)    │   or direct DRM submit
├───────────────────────────────┤
│ V4L2 stateless wrapper        │ ← out-of-tree kernel module
│  (eventual, kernel-agent)     │   exposing /dev/videoN
└───────────────────────────────┘
```

The first deliverable is *not* the V4L2 wrapper. The first
deliverable is one back-end kernel running on the QPU, bit-exact
against a libavcodec reference, with measured throughput. If that
single kernel can't beat NEON or get within 50% of it, the project
closes here with a documented negative result.

## In scope

- A small set of codec back-end kernels (IDCT 8×8, CDEF, deblocking,
  loop restoration filter, MC interpolation) compiled as SPIR-V
  compute shaders for Mesa `v3dv`, dispatched via Vulkan compute
  from userspace.
- A test harness on hertz that runs each kernel against libavcodec
  reference outputs and measures throughput (megapixels/sec or
  blocks/sec) against the equivalent NEON path.
- Phase 1 = one kernel, bit-exact, with numbers. Phase 2+ = more
  kernels only if Phase 1 numbers justify it.

## Out of scope (for now)

- HEVC (Pi 5 has dedicated silicon; `rpi-hevc-dec` covers it).
- Pi 4 / BCM2711 / VideoCore VI. Different ISA, smaller compute
  budget. Path B *could* extend but isn't the priority.
- Encode. Pi Foundation removed all HW encode in Pi 5; encode on
  VC7 is a separate, larger project.
- Custom VPU firmware (Path A — blocked by silicon RoT, see
  `docs/phase0.md`).
- V4L2 stateless driver wrapping the userspace decoder. Eventual
  consumption point, but Phase 1 lives entirely in userspace.
- Beating ARM NEON unconditionally. The honest target is
  *concurrent* work: QPU runs while CPU does something else.

## Dev substrate

- **hertz** (Pi 5, 8 GB, Debian Trixie, kernel 6.12.75-rpt-rpi-2712,
  Mesa 25.0.7 with v3dv, V3D 7.1.7) — the dev / test / measurement
  host. Watchdog-protected for crash recovery. See
  `docs/vulkaninfo_v3d_7_1_7_hertz.txt` for the inside-view device
  profile.
- **higgs** (CM5 in portable battery chassis) — the eventual user
  target. Not a dev unit; sealed chassis.

## Conventions

This project follows the 9(+1)-phase dev process. See
`docs/dev_process.md`. Phase 0 is closed (`docs/phase0.md`);
Phase 1 is `docs/phase1.md`.

Gitea identity: `claude-noether` (per
`feedback_gitea_as_claude_noether.md`). No `marfrit` pushes from
Claude sessions.

## Layout

```
daedalus-fourier/
├── README.md             ← this file
├── docs/
│   ├── dev_process.md    ← reference copy of the 9(+1)-phase loop
│   ├── phase0.md         ← substrate audit (closes Paths A and B)
│   ├── phase1.md         ← first-kernel goal + measurement plan
│   └── vulkaninfo_v3d_7_1_7_hertz.txt
│                          ← inside-view device profile from hertz
├── src/                  ← kernels + Vulkan dispatch harness
└── tests/                ← bit-exact vs libavcodec, throughput
```

No build system yet. Adding CMake when the first kernel lands.

## Sibling projects in the same orbit

- `libva-v4l2-request-fourier` — VA-API consumer-side backend.
  Eventual consumer if daedalus produces a V4L2 stateless node.
- `firefox-fourier` — Firefox fork that routes stateless V4L2
  through libavcodec's `v4l2_request` hwaccel. Same pickup point.
- `chromium-fourier` — sibling for Chromium.
- `kernel-agent` — would house the V4L2 driver wrapping the
  userspace decoder, once one exists.
- `ampere-av1-enablement` — software-side AV1 bring-up on RK3588
  (rkvdec / vpu981). Provides the userspace conformance harness
  daedalus reuses for VC7-AV1 verification.

## Source attribution

Daedalus-the-myth is public domain. The wax-and-feathers
metaphor is older than software engineering.

Anyone wanting to fail at this project: please file your failures
under `branches/icarus/`. Built-in self-deprecation slot, with
honor.