Files
daedalus-fourier/README.md
T
marfrit 0e4caae006 README: fix daedalus-v4l2 link (reauktion/, not marfrit/)
User created the sibling repo under reauktion/ org. Updates
all 5 cross-links in the README.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 14:57:38 +00:00

281 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# daedalus-fourier
Community-built VP9 / AV1 software-decode back-end running on the
VideoCore VII (V3D 7.1) QPUs on Broadcom BCM2712 (Raspberry Pi 5 /
Compute Module 5), via the existing Mesa `v3d` userspace driver.
ARM keeps the serial entropy front-end; the QPU takes the parallel
back-end (inverse transforms, deblocking, CDEF, loop restoration,
MC residual add).
> Daedalus built the Labyrinth for King Minos, then escaped from it
> by hand-forging flight firmware out of feathers and wax when no
> sanctioned exit existed.
That's the project shape. The Broadcom-locked VideoCore VII is the
Labyrinth; the Pi Foundation's "use the HEVC block and live with
software decode for everything else" is the official non-exit;
the QPU sits unused inside the labyrinth's walls.
**Status (2026-05-18): cycles 1-9 closed across 3 codecs
(VP9 + AV1 CDEF + H.264). Public API exposes all 9 kernels.
3 kernels deploy on QPU, 6 on CPU, 2 with opportunistic-QPU
helper paths. Phase 8 (V4L2 deployment) ongoing in sibling
[daedalus-v4l2](https://git.reauktion.de/reauktion/daedalus-v4l2).
On hertz, all kernels exceed the 30fps@1080p user-facing floor by
8-30×.**
### Cycles 1-9 deployment recipe
| Cycle | Kernel | NEON M3 | Primary substrate | QPU offload verdict |
|---|---|---|---|---|
| 1 | VP9 IDCT 8×8 | 8.2 Mblock/s | **QPU** | M4 +7.2 %, R=0.92 GREEN |
| 2 | VP9 LPF wd=4 | 48 Medge/s | **QPU** | M4 +6.9 %, R=0.41 |
| 3 | VP9 MC 8h | 7.0 Mblock/s | CPU | R=0.067 RED; QPU dispatch path exists |
| 4 | VP9 LPF wd=8 | 31 Medge/s | **QPU** | M4 +4.1 %, R=0.34 |
| 5 | AV1 CDEF 8×8 | 3.9 Mblock/s | CPU | R=0.116 ORANGE; QPU = opportunistic helper (0.42 Mblock/s in mixed) |
| 6 | H.264 IDCT 4×4 | 175 Mblock/s | CPU | trivially fast on NEON; QPU pointless |
| 7 | H.264 IDCT 8×8 | 151 Mblock/s | CPU | likewise |
| 8 | H.264 deblock luma-v | 92 Medge/s | CPU | R=0.061 RED; QPU = opportunistic helper (6.2 Medge/s in mixed) |
| 9 | H.264 luma qpel MC (mc20) | 131 Mblock/s | CPU | NEON 19× faster than VP9 analog; QPU pointless |
Per-cycle Phase 7 docs in `docs/k*_phase7.md` (or `*_phase3_and_4.md`
for deferred-Phase-4 closures).
## Why this exists
higgs is a Raspberry Pi Compute Module 5 in a small portable
chassis with a battery. Watching nerds review *Star Wars* on YouTube
while putting Mac Studios into virtual shopping baskets is a
core workload for the higgs class of device.
YouTube serves H.264 (legacy), VP9 (typical 4K), and AV1 (newer
high-bitrate / high-resolution content). It does not serve HEVC.
Pi 5's BCM2712 has one HW decoder block: HEVC. The intersection
of {what YouTube serves} ∩ {what BCM2712 decodes in HW} = ∅.
Every YouTube frame on higgs today is software-decoded on Cortex-A76
cores at ~5090% CPU per video stream. Offloading the parallel
back-end of that decode to the otherwise-idle QPU complex *might*
recover meaningful CPU time and battery on higgs. The honest
prior — measured in Phase 0 — is that the QPU has roughly equal
raw compute to the A76 cluster but a smaller slice of the shared
LPDDR4x bandwidth, so the win, if any, comes from offloading
*concurrent* work the CPU would have done anyway.
The Pi Foundation isn't going to do this work (per their own
statement: chromium-patch sustainment was too much; codec
sustainment would be moreso). The kernel `rpi-hevc-dec` series has
been 17 months in review for one decoder block they DID write
themselves. Whatever ships here ships through the community.
## Architecture (Path B)
Phase 0 closed two paths:
- **Path A — custom VPU firmware on the VC7 scalar cores.**
Blocked. BCM2712 has a silicon root of trust: the mask ROM
hardcodes RPi's public key and unconditionally verifies the
second-stage bootloader. `EXECUTE_CODE` mailbox removed on Pi 5.
No software-only bypass exists. See `docs/phase0.md §3`.
- **Path B — QPU compute kernels via the existing Mesa `v3d` /
DRM / Vulkan-compute path.** This is the path. The QPU is
reachable from userspace today on a stock signed Pi 5 / CM5
via `/dev/dri/card0`. No firmware loading. No signing fight.
`Idein/py-videocore7` (SGEMM 21 GFLOPS sustained) is the
existence proof.
The build:
```
┌───────────────────────────────┐
│ userspace VP9 / AV1 decoder │
│ (fork of dav1d / libvpx) │
├───────────────────────────────┤
│ ARM: entropy decode │ ← Cortex-A76 + NEON
│ (Bool coder / ANS) │ structurally serial
├───────────────────────────────┤
│ QPU: parallel back-end │ ← V3D 7.1 via Mesa v3dv
│ (IDCT, CDEF, │ Vulkan compute shaders
│ deblock, LR, MC) │ or direct DRM submit
├───────────────────────────────┤
│ V4L2 stateless wrapper │ ← out-of-tree kernel module
│ (eventual, kernel-agent) │ exposing /dev/videoN
└───────────────────────────────┘
```
The first deliverable was one back-end kernel; nine cycles later
the public API in `include/daedalus.h` exposes nine kernels each
with bit-exact NEON and (where worthwhile) QPU paths. The V4L2
wrapper is the next-up sibling project
([daedalus-v4l2](https://git.reauktion.de/reauktion/daedalus-v4l2)),
which turns the kernel-library into a `/dev/videoNN` device for
libva-v4l2-request-fourier / browser consumption.
## In scope
- The set of codec back-end kernels documented in the deployment
recipe table above (9 kernels closed; more added per cycle as
the codec coverage expands).
- A test harness on hertz that runs each kernel against a
bit-exact reference (FFmpeg or dav1d NEON) and measures
throughput vs the equivalent NEON path.
- The public C API in `include/daedalus.h` so the sibling
daedalus-v4l2 (and any other consumer) can dispatch per-block
work with recipe-default substrate routing or explicit override.
## Out of scope (lives in sibling repos)
- The V4L2 stateless driver — that's
[daedalus-v4l2](https://git.reauktion.de/reauktion/daedalus-v4l2).
- Bitstream parsing — that lives in daedalus-v4l2 too, via
`dlopen`'d FFmpeg at runtime (Option γ).
- Browser-side consumption — libva-v4l2-request-fourier +
firefox-fourier / chromium-fourier, already mature.
## Out of scope (permanent)
- HEVC (Pi 5 has dedicated silicon; `rpi-hevc-dec` covers it).
- Pi 4 / BCM2711 / VideoCore VI. Different ISA, smaller compute
budget.
- Encode. Pi Foundation removed all HW encode in Pi 5.
- Custom VPU firmware (Path A — blocked by silicon RoT, see
`docs/phase0.md`).
- Beating ARM NEON unconditionally. The honest target is
*concurrent* work: QPU runs while CPU does something else.
Per Issue 003 (`docs/issues/003-mixed-kernel-m4-bench.md`),
the mixed-kernel deployment shape is where QPU offload pays —
same-kernel M4 is the worst-case bound.
## Dev substrate
- **hertz** (Pi 5, 8 GB, Debian Trixie, kernel 6.12.75-rpt-rpi-2712,
Mesa 25.0.7 with v3dv, V3D 7.1.7) — the dev / test / measurement
host. Watchdog-protected for crash recovery. See
`docs/vulkaninfo_v3d_7_1_7_hertz.txt` for the inside-view device
profile.
- **higgs** (CM5 in portable battery chassis) — the eventual user
target. Not a dev unit; sealed chassis.
## Conventions
This project follows a 9(+1)-phase dev process per cycle. See
`docs/dev_process.md`. Phase 0 is closed once at project start
(`docs/phase0.md`); each kernel cycle re-runs Phases 1-9.
Phase 5 (second-model independent review) is non-skippable per
project rule. See `~/.claude/CLAUDE.md` "Reviews are never
skippable" — empty/no-finding reviews are themselves a strong
positive signal, not wasted effort.
Gitea identity: `claude-noether` for Claude-driven pushes, via
SSH alias `git.reauktion.de.claude-noether` (see
`memory/reference_gitea_ssh_alias_noether.md`).
## Layout
```
daedalus-fourier/
├── README.md ← this file
├── include/daedalus.h ← public C API
├── src/
│ ├── daedalus_core.c ← API impl: per-kernel CPU+QPU dispatch
│ ├── v3d_runner.{c,h} ← Vulkan compute plumbing
│ └── v3d_*.comp ← compute shaders (cycles 1, 2, 4, 5, 8)
├── tests/
│ ├── *_ref.c ← per-kernel C references (bit-exact)
│ ├── bench_neon_*.c ← NEON M3 baselines
│ ├── bench_v3d_*.c ← QPU M2 + 3-way M1 (vs NEON + C ref)
│ ├── bench_concurrent_*.c ← M4 mixed-kernel concurrent bench
│ └── test_api_*.c ← public API smoke tests
├── docs/
│ ├── dev_process.md ← reference 9(+1)-phase loop
│ ├── phase0.md ← substrate audit (closes Path A)
│ ├── phase1.md ← R-band decision rules
│ ├── phase8_scoping.md ← V4L2 architecture options
│ ├── phase8_status.md ← decisions locked + status
│ ├── k1_*.md..k9_*.md ← per-cycle Phase 1/3/4/5/7 docs
│ └── issues/ ← deferred work
├── external/
│ ├── ffmpeg-snapshot/ ← vendored FFmpeg n7.1.3 NEON refs (LGPL-2.1+)
│ └── dav1d-snapshot/ ← vendored dav1d 1.4.3 CDEF (BSD-2-Clause)
└── CMakeLists.txt
```
## Build and run
On a Pi 5 (Debian Trixie or similar) with Vulkan SDK + Mesa v3dv:
```sh
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build .
# Per-kernel M1+M3 NEON baseline:
./bench_neon_idct
./bench_neon_lpf
./bench_neon_h264deblock
# ... (one per cycle)
# Per-kernel M1+M2 QPU bench (3-way bit-exact vs NEON + C ref):
./bench_v3d_idct
./bench_v3d_lpf
./bench_v3d_h264deblock
# ...
# Public API smoke tests:
./test_api_idct # VP9 IDCT 8x8, CPU+QPU+AUTO
./test_api_lpf # VP9 LPF wd=4 + wd=8
./test_api_h264 # H.264 IDCT 4x4 + 8x8 + deblock
./test_api_opportunistic_qpu # cycles 3+5+8 QPU-override paths
# Mixed-kernel M4 bench (Issue 003 framework):
./bench_concurrent_mixed --cpu-kernel mc --qpu-kernel lpf4 --neon-threads 3 --qpu-core 3 --duration 6
```
## Consuming the kernel library
For integration code (e.g., `daedalus-v4l2` userspace daemon):
```c
#include <daedalus.h>
daedalus_ctx *ctx = daedalus_ctx_create();
// has_qpu == 1 if V3D init succeeded; else NEON-only fallback
// Recipe dispatch: routes to the per-cycle verdict substrate.
daedalus_recipe_dispatch_vp9_idct8(ctx, dst, stride, coeffs, n_blocks, meta);
// Or explicit substrate selection for runtime-aware scheduling:
daedalus_dispatch_vp9_mc_8h(ctx, DAEDALUS_SUBSTRATE_QPU, dst, dst_stride,
src, src_stride, n_blocks, meta);
daedalus_ctx_destroy(ctx);
```
See `include/daedalus.h` for the full API.
## Sibling projects in the same orbit
- **[daedalus-v4l2](https://git.reauktion.de/reauktion/daedalus-v4l2)**
— V4L2 stateless wrapper. Linux kernel module +
userspace daemon that consume `libdaedalus_core.a` from this
repo. Scaffold + roadmap; Phase 8 implementation work.
- `libva-v4l2-request-fourier` — VA-API consumer; talks to
daedalus-v4l2's `/dev/videoNN`.
- `firefox-fourier` — Firefox fork routing stateless V4L2 through
libavcodec's `v4l2_request` hwaccel.
- `chromium-fourier` — sibling for Chromium.
- `ampere-av1-enablement` — software-side AV1 bring-up on RK3588
(rkvdec / vpu981). Provides the userspace conformance harness
daedalus reuses for VC7-AV1 verification.
## Source attribution
Daedalus-the-myth is public domain. The wax-and-feathers
metaphor is older than software engineering.
Anyone wanting to fail at this project: please file your failures
under `branches/icarus/`. Built-in self-deprecation slot, with
honor.