# daedalus-fourier Community-built VP9 / AV1 software-decode back-end running on the VideoCore VII (V3D 7.1) QPUs on Broadcom BCM2712 (Raspberry Pi 5 / Compute Module 5), via the existing Mesa `v3d` userspace driver. ARM keeps the serial entropy front-end; the QPU takes the parallel back-end (inverse transforms, deblocking, CDEF, loop restoration, MC residual add). > Daedalus built the Labyrinth for King Minos, then escaped from it > by hand-forging flight firmware out of feathers and wax when no > sanctioned exit existed. That's the project shape. The Broadcom-locked VideoCore VII is the Labyrinth; the Pi Foundation's "use the HEVC block and live with software decode for everything else" is the official non-exit; the QPU sits unused inside the labyrinth's walls. **Status (2026-05-18): cycles 1-9 closed across 3 codecs (VP9 + AV1 CDEF + H.264). Public API exposes all 9 kernels. 3 kernels deploy on QPU, 6 on CPU, 2 with opportunistic-QPU helper paths. Phase 8 (V4L2 deployment) ongoing in sibling [daedalus-v4l2](https://git.reauktion.de/reauktion/daedalus-v4l2). On hertz, all kernels exceed the 30fps@1080p user-facing floor by 8-30×.** ### Cycles 1-9 deployment recipe | Cycle | Kernel | NEON M3 | Primary substrate | QPU offload verdict | |---|---|---|---|---| | 1 | VP9 IDCT 8×8 | 8.2 Mblock/s | **QPU** | M4 +7.2 %, R=0.92 GREEN | | 2 | VP9 LPF wd=4 | 48 Medge/s | **QPU** | M4 +6.9 %, R=0.41 | | 3 | VP9 MC 8h | 7.0 Mblock/s | CPU | R=0.067 RED; QPU dispatch path exists | | 4 | VP9 LPF wd=8 | 31 Medge/s | **QPU** | M4 +4.1 %, R=0.34 | | 5 | AV1 CDEF 8×8 | 3.9 Mblock/s | CPU | R=0.116 ORANGE; QPU = opportunistic helper (0.42 Mblock/s in mixed) | | 6 | H.264 IDCT 4×4 | 175 Mblock/s | CPU | trivially fast on NEON; QPU pointless | | 7 | H.264 IDCT 8×8 | 151 Mblock/s | CPU | likewise | | 8 | H.264 deblock luma-v | 92 Medge/s | CPU | R=0.061 RED; QPU = opportunistic helper (6.2 Medge/s in mixed) | | 9 | H.264 luma qpel MC (mc20) | 131 Mblock/s | CPU | NEON 19× faster than VP9 analog; QPU pointless | Per-cycle Phase 7 docs in `docs/k*_phase7.md` (or `*_phase3_and_4.md` for deferred-Phase-4 closures). ## Why this exists higgs is a Raspberry Pi Compute Module 5 in a small portable chassis with a battery. Watching nerds review *Star Wars* on YouTube while putting Mac Studios into virtual shopping baskets is a core workload for the higgs class of device. YouTube serves H.264 (legacy), VP9 (typical 4K), and AV1 (newer high-bitrate / high-resolution content). It does not serve HEVC. Pi 5's BCM2712 has one HW decoder block: HEVC. The intersection of {what YouTube serves} ∩ {what BCM2712 decodes in HW} = ∅. Every YouTube frame on higgs today is software-decoded on Cortex-A76 cores at ~50–90% CPU per video stream. Offloading the parallel back-end of that decode to the otherwise-idle QPU complex *might* recover meaningful CPU time and battery on higgs. The honest prior — measured in Phase 0 — is that the QPU has roughly equal raw compute to the A76 cluster but a smaller slice of the shared LPDDR4x bandwidth, so the win, if any, comes from offloading *concurrent* work the CPU would have done anyway. The Pi Foundation isn't going to do this work (per their own statement: chromium-patch sustainment was too much; codec sustainment would be moreso). The kernel `rpi-hevc-dec` series has been 17 months in review for one decoder block they DID write themselves. Whatever ships here ships through the community. ## Architecture (Path B) Phase 0 closed two paths: - **Path A — custom VPU firmware on the VC7 scalar cores.** Blocked. BCM2712 has a silicon root of trust: the mask ROM hardcodes RPi's public key and unconditionally verifies the second-stage bootloader. `EXECUTE_CODE` mailbox removed on Pi 5. No software-only bypass exists. See `docs/phase0.md §3`. - **Path B — QPU compute kernels via the existing Mesa `v3d` / DRM / Vulkan-compute path.** This is the path. The QPU is reachable from userspace today on a stock signed Pi 5 / CM5 via `/dev/dri/card0`. No firmware loading. No signing fight. `Idein/py-videocore7` (SGEMM 21 GFLOPS sustained) is the existence proof. The build: ``` ┌───────────────────────────────┐ │ userspace VP9 / AV1 decoder │ │ (fork of dav1d / libvpx) │ ├───────────────────────────────┤ │ ARM: entropy decode │ ← Cortex-A76 + NEON │ (Bool coder / ANS) │ structurally serial ├───────────────────────────────┤ │ QPU: parallel back-end │ ← V3D 7.1 via Mesa v3dv │ (IDCT, CDEF, │ Vulkan compute shaders │ deblock, LR, MC) │ or direct DRM submit ├───────────────────────────────┤ │ V4L2 stateless wrapper │ ← out-of-tree kernel module │ (eventual, kernel-agent) │ exposing /dev/videoN └───────────────────────────────┘ ``` The first deliverable was one back-end kernel; nine cycles later the public API in `include/daedalus.h` exposes nine kernels each with bit-exact NEON and (where worthwhile) QPU paths. The V4L2 wrapper is the next-up sibling project ([daedalus-v4l2](https://git.reauktion.de/reauktion/daedalus-v4l2)), which turns the kernel-library into a `/dev/videoNN` device for libva-v4l2-request-fourier / browser consumption. ## In scope - The set of codec back-end kernels documented in the deployment recipe table above (9 kernels closed; more added per cycle as the codec coverage expands). - A test harness on hertz that runs each kernel against a bit-exact reference (FFmpeg or dav1d NEON) and measures throughput vs the equivalent NEON path. - The public C API in `include/daedalus.h` so the sibling daedalus-v4l2 (and any other consumer) can dispatch per-block work with recipe-default substrate routing or explicit override. ## Out of scope (lives in sibling repos) - The V4L2 stateless driver — that's [daedalus-v4l2](https://git.reauktion.de/reauktion/daedalus-v4l2). - Bitstream parsing — that lives in daedalus-v4l2 too, via `dlopen`'d FFmpeg at runtime (Option γ). - Browser-side consumption — libva-v4l2-request-fourier + firefox-fourier / chromium-fourier, already mature. ## Out of scope (permanent) - HEVC (Pi 5 has dedicated silicon; `rpi-hevc-dec` covers it). - Pi 4 / BCM2711 / VideoCore VI. Different ISA, smaller compute budget. - Encode. Pi Foundation removed all HW encode in Pi 5. - Custom VPU firmware (Path A — blocked by silicon RoT, see `docs/phase0.md`). - Beating ARM NEON unconditionally. The honest target is *concurrent* work: QPU runs while CPU does something else. Per Issue 003 (`docs/issues/003-mixed-kernel-m4-bench.md`), the mixed-kernel deployment shape is where QPU offload pays — same-kernel M4 is the worst-case bound. ## Dev substrate - **hertz** (Pi 5, 8 GB, Debian Trixie, kernel 6.12.75-rpt-rpi-2712, Mesa 25.0.7 with v3dv, V3D 7.1.7) — the dev / test / measurement host. Watchdog-protected for crash recovery. See `docs/vulkaninfo_v3d_7_1_7_hertz.txt` for the inside-view device profile. - **higgs** (CM5 in portable battery chassis) — the eventual user target. Not a dev unit; sealed chassis. ## Conventions This project follows a 9(+1)-phase dev process per cycle. See `docs/dev_process.md`. Phase 0 is closed once at project start (`docs/phase0.md`); each kernel cycle re-runs Phases 1-9. Phase 5 (second-model independent review) is non-skippable per project rule. See `~/.claude/CLAUDE.md` "Reviews are never skippable" — empty/no-finding reviews are themselves a strong positive signal, not wasted effort. Gitea identity: `claude-noether` for Claude-driven pushes, via SSH alias `git.reauktion.de.claude-noether` (see `memory/reference_gitea_ssh_alias_noether.md`). ## Layout ``` daedalus-fourier/ ├── README.md ← this file ├── include/daedalus.h ← public C API ├── src/ │ ├── daedalus_core.c ← API impl: per-kernel CPU+QPU dispatch │ ├── v3d_runner.{c,h} ← Vulkan compute plumbing │ └── v3d_*.comp ← compute shaders (cycles 1, 2, 4, 5, 8) ├── tests/ │ ├── *_ref.c ← per-kernel C references (bit-exact) │ ├── bench_neon_*.c ← NEON M3 baselines │ ├── bench_v3d_*.c ← QPU M2 + 3-way M1 (vs NEON + C ref) │ ├── bench_concurrent_*.c ← M4 mixed-kernel concurrent bench │ └── test_api_*.c ← public API smoke tests ├── docs/ │ ├── dev_process.md ← reference 9(+1)-phase loop │ ├── phase0.md ← substrate audit (closes Path A) │ ├── phase1.md ← R-band decision rules │ ├── phase8_scoping.md ← V4L2 architecture options │ ├── phase8_status.md ← decisions locked + status │ ├── k1_*.md..k9_*.md ← per-cycle Phase 1/3/4/5/7 docs │ └── issues/ ← deferred work ├── external/ │ ├── ffmpeg-snapshot/ ← vendored FFmpeg n7.1.3 NEON refs (LGPL-2.1+) │ └── dav1d-snapshot/ ← vendored dav1d 1.4.3 CDEF (BSD-2-Clause) └── CMakeLists.txt ``` ## Build and run On a Pi 5 (Debian Trixie or similar) with Vulkan SDK + Mesa v3dv: ```sh mkdir build && cd build cmake .. -DCMAKE_BUILD_TYPE=Release cmake --build . # Per-kernel M1+M3 NEON baseline: ./bench_neon_idct ./bench_neon_lpf ./bench_neon_h264deblock # ... (one per cycle) # Per-kernel M1+M2 QPU bench (3-way bit-exact vs NEON + C ref): ./bench_v3d_idct ./bench_v3d_lpf ./bench_v3d_h264deblock # ... # Public API smoke tests: ./test_api_idct # VP9 IDCT 8x8, CPU+QPU+AUTO ./test_api_lpf # VP9 LPF wd=4 + wd=8 ./test_api_h264 # H.264 IDCT 4x4 + 8x8 + deblock ./test_api_opportunistic_qpu # cycles 3+5+8 QPU-override paths # Mixed-kernel M4 bench (Issue 003 framework): ./bench_concurrent_mixed --cpu-kernel mc --qpu-kernel lpf4 --neon-threads 3 --qpu-core 3 --duration 6 ``` ## Consuming the kernel library For integration code (e.g., `daedalus-v4l2` userspace daemon): ```c #include daedalus_ctx *ctx = daedalus_ctx_create(); // has_qpu == 1 if V3D init succeeded; else NEON-only fallback // Recipe dispatch: routes to the per-cycle verdict substrate. daedalus_recipe_dispatch_vp9_idct8(ctx, dst, stride, coeffs, n_blocks, meta); // Or explicit substrate selection for runtime-aware scheduling: daedalus_dispatch_vp9_mc_8h(ctx, DAEDALUS_SUBSTRATE_QPU, dst, dst_stride, src, src_stride, n_blocks, meta); daedalus_ctx_destroy(ctx); ``` See `include/daedalus.h` for the full API. ## Sibling projects in the same orbit - **[daedalus-v4l2](https://git.reauktion.de/reauktion/daedalus-v4l2)** — V4L2 stateless wrapper. Linux kernel module + userspace daemon that consume `libdaedalus_core.a` from this repo. Scaffold + roadmap; Phase 8 implementation work. - `libva-v4l2-request-fourier` — VA-API consumer; talks to daedalus-v4l2's `/dev/videoNN`. - `firefox-fourier` — Firefox fork routing stateless V4L2 through libavcodec's `v4l2_request` hwaccel. - `chromium-fourier` — sibling for Chromium. - `ampere-av1-enablement` — software-side AV1 bring-up on RK3588 (rkvdec / vpu981). Provides the userspace conformance harness daedalus reuses for VC7-AV1 verification. ## Source attribution Daedalus-the-myth is public domain. The wax-and-feathers metaphor is older than software engineering. Anyone wanting to fail at this project: please file your failures under `branches/icarus/`. Built-in self-deprecation slot, with honor.