daedalus-fourier/README.md

# daedalus-fourier

Community-built VP9 / AV1 software-decode back-end running on the
VideoCore VII (V3D 7.1) QPUs on Broadcom BCM2712 (Raspberry Pi 5 /
Compute Module 5), via the existing Mesa `v3d` userspace driver.
ARM keeps the serial entropy front-end; the QPU takes the parallel
back-end (inverse transforms, deblocking, CDEF, loop restoration,
MC residual add).

> Daedalus built the Labyrinth for King Minos, then escaped from it
> by hand-forging flight firmware out of feathers and wax when no
> sanctioned exit existed.

That's the project shape. The Broadcom-locked VideoCore VII is the
Labyrinth; the Pi Foundation's "use the HEVC block and live with
software decode for everything else" is the official non-exit;
the QPU sits unused inside the labyrinth's walls.

**Status (2026-05-18): cycles 1-9 closed across 3 codecs
(VP9 + AV1 CDEF + H.264). Public API exposes all 9 kernels.
3 kernels deploy on QPU, 6 on CPU, 2 with opportunistic-QPU
helper paths. Phase 8 (V4L2 deployment) ongoing in sibling
[daedalus-v4l2](https://git.reauktion.de/reauktion/daedalus-v4l2).
On hertz, all kernels exceed the 30fps@1080p user-facing floor by
8-30×.**

### Cycles 1-9 deployment recipe

| Cycle | Kernel | NEON M3 | Primary substrate | QPU offload verdict |
|---|---|---|---|---|
| 1 | VP9 IDCT 8×8 | 8.2 Mblock/s | **QPU** | M4 +7.2 %, R=0.92 GREEN |
| 2 | VP9 LPF wd=4 | 48 Medge/s | **QPU** | M4 +6.9 %, R=0.41 |
| 3 | VP9 MC 8h | 7.0 Mblock/s | CPU | R=0.067 RED; QPU dispatch path exists |
| 4 | VP9 LPF wd=8 | 31 Medge/s | **QPU** | M4 +4.1 %, R=0.34 |
| 5 | AV1 CDEF 8×8 | 3.9 Mblock/s | CPU | R=0.116 ORANGE; QPU = opportunistic helper (0.42 Mblock/s in mixed) |
| 6 | H.264 IDCT 4×4 | 175 Mblock/s | CPU | trivially fast on NEON; QPU pointless |
| 7 | H.264 IDCT 8×8 | 151 Mblock/s | CPU | likewise |
| 8 | H.264 deblock luma-v | 92 Medge/s | CPU | R=0.061 RED; QPU = opportunistic helper (6.2 Medge/s in mixed) |
| 9 | H.264 luma qpel MC (mc20) | 131 Mblock/s | CPU | NEON 19× faster than VP9 analog; QPU pointless |

Per-cycle Phase 7 docs in `docs/k*_phase7.md` (or `*_phase3_and_4.md`
for deferred-Phase-4 closures).

## Why this exists

higgs is a Raspberry Pi Compute Module 5 in a small portable
chassis with a battery. Watching nerds review *Star Wars* on YouTube
while putting Mac Studios into virtual shopping baskets is a
core workload for the higgs class of device.

YouTube serves H.264 (legacy), VP9 (typical 4K), and AV1 (newer
high-bitrate / high-resolution content). It does not serve HEVC.
Pi 5's BCM2712 has one HW decoder block: HEVC. The intersection
of {what YouTube serves} ∩ {what BCM2712 decodes in HW} = ∅.

Every YouTube frame on higgs today is software-decoded on Cortex-A76
cores at ~50–90% CPU per video stream. Offloading the parallel
back-end of that decode to the otherwise-idle QPU complex *might*
recover meaningful CPU time and battery on higgs. The honest
prior — measured in Phase 0 — is that the QPU has roughly equal
raw compute to the A76 cluster but a smaller slice of the shared
LPDDR4x bandwidth, so the win, if any, comes from offloading
*concurrent* work the CPU would have done anyway.

The Pi Foundation isn't going to do this work (per their own
statement: chromium-patch sustainment was too much; codec
sustainment would be moreso). The kernel `rpi-hevc-dec` series has
been 17 months in review for one decoder block they DID write
themselves. Whatever ships here ships through the community.

## Architecture (Path B)

Phase 0 closed two paths:

- **Path A — custom VPU firmware on the VC7 scalar cores.**
  Blocked. BCM2712 has a silicon root of trust: the mask ROM
  hardcodes RPi's public key and unconditionally verifies the
  second-stage bootloader. `EXECUTE_CODE` mailbox removed on Pi 5.
  No software-only bypass exists. See `docs/phase0.md §3`.

- **Path B — QPU compute kernels via the existing Mesa `v3d` /
  DRM / Vulkan-compute path.** This is the path. The QPU is
  reachable from userspace today on a stock signed Pi 5 / CM5
  via `/dev/dri/card0`. No firmware loading. No signing fight.
  `Idein/py-videocore7` (SGEMM 21 GFLOPS sustained) is the
  existence proof.

The build:

```
┌───────────────────────────────┐
│ userspace VP9 / AV1 decoder   │
│  (fork of dav1d / libvpx)     │
├───────────────────────────────┤
│  ARM:    entropy decode       │ ← Cortex-A76 + NEON
│          (Bool coder / ANS)   │   structurally serial
├───────────────────────────────┤
│  QPU:    parallel back-end    │ ← V3D 7.1 via Mesa v3dv
│          (IDCT, CDEF,         │   Vulkan compute shaders
│           deblock, LR, MC)    │   or direct DRM submit
├───────────────────────────────┤
│ V4L2 stateless wrapper        │ ← out-of-tree kernel module
│  (eventual, kernel-agent)     │   exposing /dev/videoN
└───────────────────────────────┘
```

The first deliverable was one back-end kernel; nine cycles later
the public API in `include/daedalus.h` exposes nine kernels each
with bit-exact NEON and (where worthwhile) QPU paths. The V4L2
wrapper is the next-up sibling project
([daedalus-v4l2](https://git.reauktion.de/reauktion/daedalus-v4l2)),
which turns the kernel-library into a `/dev/videoNN` device for
libva-v4l2-request-fourier / browser consumption.

## In scope

- The set of codec back-end kernels documented in the deployment
  recipe table above (9 kernels closed; more added per cycle as
  the codec coverage expands).
- A test harness on hertz that runs each kernel against a
  bit-exact reference (FFmpeg or dav1d NEON) and measures
  throughput vs the equivalent NEON path.
- The public C API in `include/daedalus.h` so the sibling
  daedalus-v4l2 (and any other consumer) can dispatch per-block
  work with recipe-default substrate routing or explicit override.

## Out of scope (lives in sibling repos)

- The V4L2 stateless driver — that's
  [daedalus-v4l2](https://git.reauktion.de/reauktion/daedalus-v4l2).
- Bitstream parsing — that lives in daedalus-v4l2 too, via
  `dlopen`'d FFmpeg at runtime (Option γ).
- Browser-side consumption — libva-v4l2-request-fourier +
  firefox-fourier / chromium-fourier, already mature.

## Out of scope (permanent)

- HEVC (Pi 5 has dedicated silicon; `rpi-hevc-dec` covers it).
- Pi 4 / BCM2711 / VideoCore VI. Different ISA, smaller compute
  budget.
- Encode. Pi Foundation removed all HW encode in Pi 5.
- Custom VPU firmware (Path A — blocked by silicon RoT, see
  `docs/phase0.md`).
- Beating ARM NEON unconditionally. The honest target is
  *concurrent* work: QPU runs while CPU does something else.
  Per Issue 003 (`docs/issues/003-mixed-kernel-m4-bench.md`),
  the mixed-kernel deployment shape is where QPU offload pays —
  same-kernel M4 is the worst-case bound.

## Dev substrate

- **hertz** (Pi 5, 8 GB, Debian Trixie, kernel 6.12.75-rpt-rpi-2712,
  Mesa 25.0.7 with v3dv, V3D 7.1.7) — the dev / test / measurement
  host. Watchdog-protected for crash recovery. See
  `docs/vulkaninfo_v3d_7_1_7_hertz.txt` for the inside-view device
  profile.
- **higgs** (CM5 in portable battery chassis) — the eventual user
  target. Not a dev unit; sealed chassis.

## Conventions

This project follows a 9(+1)-phase dev process per cycle. See
`docs/dev_process.md`. Phase 0 is closed once at project start
(`docs/phase0.md`); each kernel cycle re-runs Phases 1-9.

Phase 5 (second-model independent review) is non-skippable per
project rule. See `~/.claude/CLAUDE.md` "Reviews are never
skippable" — empty/no-finding reviews are themselves a strong
positive signal, not wasted effort.

Gitea identity: `claude-noether` for Claude-driven pushes, via
SSH alias `git.reauktion.de.claude-noether` (see
`memory/reference_gitea_ssh_alias_noether.md`).

## Layout

```
daedalus-fourier/
├── README.md             ← this file
├── include/daedalus.h    ← public C API
├── src/
│   ├── daedalus_core.c   ← API impl: per-kernel CPU+QPU dispatch
│   ├── v3d_runner.{c,h}  ← Vulkan compute plumbing
│   └── v3d_*.comp        ← compute shaders (cycles 1, 2, 4, 5, 8)
├── tests/
│   ├── *_ref.c           ← per-kernel C references (bit-exact)
│   ├── bench_neon_*.c    ← NEON M3 baselines
│   ├── bench_v3d_*.c     ← QPU M2 + 3-way M1 (vs NEON + C ref)
│   ├── bench_concurrent_*.c ← M4 mixed-kernel concurrent bench
│   └── test_api_*.c      ← public API smoke tests
├── docs/
│   ├── dev_process.md    ← reference 9(+1)-phase loop
│   ├── phase0.md         ← substrate audit (closes Path A)
│   ├── phase1.md         ← R-band decision rules
│   ├── phase8_scoping.md ← V4L2 architecture options
│   ├── phase8_status.md  ← decisions locked + status
│   ├── k1_*.md..k9_*.md  ← per-cycle Phase 1/3/4/5/7 docs
│   └── issues/           ← deferred work
├── external/
│   ├── ffmpeg-snapshot/  ← vendored FFmpeg n7.1.3 NEON refs (LGPL-2.1+)
│   └── dav1d-snapshot/   ← vendored dav1d 1.4.3 CDEF (BSD-2-Clause)
└── CMakeLists.txt
```

## Build and run

On a Pi 5 (Debian Trixie or similar) with Vulkan SDK + Mesa v3dv:

```sh
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build .

# Per-kernel M1+M3 NEON baseline:
./bench_neon_idct
./bench_neon_lpf
./bench_neon_h264deblock
# ... (one per cycle)

# Per-kernel M1+M2 QPU bench (3-way bit-exact vs NEON + C ref):
./bench_v3d_idct
./bench_v3d_lpf
./bench_v3d_h264deblock
# ...

# Public API smoke tests:
./test_api_idct       # VP9 IDCT 8x8, CPU+QPU+AUTO
./test_api_lpf        # VP9 LPF wd=4 + wd=8
./test_api_h264       # H.264 IDCT 4x4 + 8x8 + deblock
./test_api_opportunistic_qpu  # cycles 3+5+8 QPU-override paths

# Mixed-kernel M4 bench (Issue 003 framework):
./bench_concurrent_mixed --cpu-kernel mc --qpu-kernel lpf4 --neon-threads 3 --qpu-core 3 --duration 6
```

## Consuming the kernel library

For integration code (e.g., `daedalus-v4l2` userspace daemon):

```c
#include <daedalus.h>

daedalus_ctx *ctx = daedalus_ctx_create();
// has_qpu == 1 if V3D init succeeded; else NEON-only fallback

// Recipe dispatch: routes to the per-cycle verdict substrate.
daedalus_recipe_dispatch_vp9_idct8(ctx, dst, stride, coeffs, n_blocks, meta);

// Or explicit substrate selection for runtime-aware scheduling:
daedalus_dispatch_vp9_mc_8h(ctx, DAEDALUS_SUBSTRATE_QPU, dst, dst_stride,
                            src, src_stride, n_blocks, meta);

daedalus_ctx_destroy(ctx);
```

See `include/daedalus.h` for the full API.

## Sibling projects in the same orbit

- **[daedalus-v4l2](https://git.reauktion.de/reauktion/daedalus-v4l2)**
  — V4L2 stateless wrapper. Linux kernel module +
  userspace daemon that consume `libdaedalus_core.a` from this
  repo. Scaffold + roadmap; Phase 8 implementation work.
- `libva-v4l2-request-fourier` — VA-API consumer; talks to
  daedalus-v4l2's `/dev/videoNN`.
- `firefox-fourier` — Firefox fork routing stateless V4L2 through
  libavcodec's `v4l2_request` hwaccel.
- `chromium-fourier` — sibling for Chromium.
- `ampere-av1-enablement` — software-side AV1 bring-up on RK3588
  (rkvdec / vpu981). Provides the userspace conformance harness
  daedalus reuses for VC7-AV1 verification.

## Source attribution

Daedalus-the-myth is public domain. The wax-and-feathers
metaphor is older than software engineering.

Anyone wanting to fail at this project: please file your failures
under `branches/icarus/`. Built-in self-deprecation slot, with
honor.