marfrit 356e446a49 Cycle 3 (MC interpolation) closure: M1'''=100%, R'''=0.067 RED, M4=-19.5%
Third daedalus-fourier kernel — VP9 8-tap regular subpel filter,
horizontal direction, 8-wide output. Multiply-heavy by design to
stress V3D's no-DP4A deficit. Full cycle Phase 1-7 + M4'''.

Phase 5''' second-model review delivered cleanly — caught 1 RED
bug pre-implementation (src_off off-by-3 indexing convention) and
2 YELLOW gaps (assert MUST language, shaderdb filter-LUT gate).
Without the review, M1''' would have failed silently on first run
with cryptic "high-index source pixels wrong" symptoms.

Phase 6 v1 first-light: M1''' 100.0000% bit-exact (65536/65536
blocks across all 16 mx phases). Phase 5''' filter-LUT prediction
materialised exactly: 197 uniforms (gate was 144), 2 threads (down
from cycle-2's 4 due to register pressure).

Performance:

  M2''' = 1.413 Mblock/s     (707.9 ns/block)
  M3''' = 20.997 Mblock/s    (NEON baseline phase3)
  R'''  = 0.067              (RED band — structural mismatch)
  shaderdb: 488 inst, 2 threads, 197 uniforms, 25 max-temps, 0 spills

M4''' concurrent matrix (8s windows):

  NEON 1-core           14.479 Mblock/s
  NEON 4-core           15.248 Mblock/s   <- baseline (compute-bound,
                                              not bandwidth-saturated
                                              like cycles 1+2!)
  QPU only               1.380 Mblock/s
  MIXED NEON-3 + QPU    12.277 Mblock/s   <- -19.5% (FAIL gate)
  MIXED NEON-4 + QPU    12.158 Mblock/s   <- -20.3%

NEW cross-cycle finding (Phase 9 lesson 2): compute-bound CPU
workloads make the QPU-offload story collapse. Cycles 1+2 were
bandwidth-saturated (4-core scaling 0.56-0.82x of 1-core), so
freeing a CPU core via QPU offload added throughput. Cycle 3 MC
is compute-bound (4-core scaling 1.05x of 1-core — near-linear),
no free cycles to free. QPU contribution (0.45 Mblock/s in
contention) doesn't compensate for losing 1 NEON core delivering
~3.8 Mblock/s.

But 30fps@1080p floor: PASS in every config (1.4x to 15.7x
isolation margin). Per project_30fps_floor_is_fine.md, user-facing
test never fails — daily YouTube playback works fine on any CPU/QPU
split.

DEPLOYMENT RECIPE for higgs (cycle 3 confirmed split):

  IDCT (k1)  -> QPU   (R=0.92, +7% mixed, frees CPU core)
  LPF  (k2)  -> QPU   (R=0.41, +7% mixed, frees CPU core)
  MC   (k3)  -> CPU   (R=0.067, -19.5% mixed — stays on CPU)
  Entropy    -> CPU   (structurally serial)

Mixed-substrate deployment, not "QPU does everything". Realistic for
higgs: entropy + MC on 2-3 ARM cores; IDCT + LPF dispatched to QPU
concurrently; 1-2 ARM cores left for vscode etc.

New artifacts:
- src/v3d_mc_8h.comp               — GLSL kernel
- tests/vp9_mc_ref.c               — standalone C ref (REGULAR filter
                                     embedded; clean transcription)
- tests/bench_neon_mc.c            — M1'''_c + M3''' bench
- tests/bench_v3d_mc.c             — M1''' + M2''' bench with contract
                                     asserts + 30fps margin display
- tests/bench_concurrent_mc.c      — M4''' pthread bench
- external/ffmpeg-snapshot/libavcodec/aarch64/vp9mc_neon.S    (vendored)
- external/ffmpeg-snapshot/libavcodec/vp9_subpel_filters_table.c
                                     (hand-extracted; provides
                                      ff_vp9_subpel_filters symbol
                                      without dragging in full vp9dsp.c)
- docs/k3_mc_phase{1,2,3,4,5,7}.md — full cycle documentation

Memory updates: project_30fps_floor_is_fine.md (user's 30fps target
recalibration), MEMORY.md index updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:51:43 +00:00

daedalus-fourier

Community-built VP9 / AV1 software-decode back-end running on the VideoCore VII (V3D 7.1) QPUs on Broadcom BCM2712 (Raspberry Pi 5 / Compute Module 5), via the existing Mesa v3d userspace driver. ARM keeps the serial entropy front-end; the QPU takes the parallel back-end (inverse transforms, deblocking, CDEF, loop restoration, MC residual add).

Daedalus built the Labyrinth for King Minos, then escaped from it by hand-forging flight firmware out of feathers and wax when no sanctioned exit existed.

That's the project shape. The Broadcom-locked VideoCore VII is the Labyrinth; the Pi Foundation's "use the HEVC block and live with software decode for everything else" is the official non-exit; the QPU sits unused inside the labyrinth's walls.

Status: Phase 0 closed (substrate audit). Phase 1 in progress (first-kernel proof on hertz). This is research-track work that may take months or may yield a single proof-of-concept kernel that loses to ARM NEON, in which case the negative result ships and the project closes.

Why this exists

higgs is a Raspberry Pi Compute Module 5 in a small portable chassis with a battery. Watching nerds review Star Wars on YouTube while putting Mac Studios into virtual shopping baskets is a core workload for the higgs class of device.

YouTube serves H.264 (legacy), VP9 (typical 4K), and AV1 (newer high-bitrate / high-resolution content). It does not serve HEVC. Pi 5's BCM2712 has one HW decoder block: HEVC. The intersection of {what YouTube serves} ∩ {what BCM2712 decodes in HW} = ∅.

Every YouTube frame on higgs today is software-decoded on Cortex-A76 cores at ~5090% CPU per video stream. Offloading the parallel back-end of that decode to the otherwise-idle QPU complex might recover meaningful CPU time and battery on higgs. The honest prior — measured in Phase 0 — is that the QPU has roughly equal raw compute to the A76 cluster but a smaller slice of the shared LPDDR4x bandwidth, so the win, if any, comes from offloading concurrent work the CPU would have done anyway.

The Pi Foundation isn't going to do this work (per their own statement: chromium-patch sustainment was too much; codec sustainment would be moreso). The kernel rpi-hevc-dec series has been 17 months in review for one decoder block they DID write themselves. Whatever ships here ships through the community.

Architecture (Path B)

Phase 0 closed two paths:

  • Path A — custom VPU firmware on the VC7 scalar cores. Blocked. BCM2712 has a silicon root of trust: the mask ROM hardcodes RPi's public key and unconditionally verifies the second-stage bootloader. EXECUTE_CODE mailbox removed on Pi 5. No software-only bypass exists. See docs/phase0.md §3.

  • Path B — QPU compute kernels via the existing Mesa v3d / DRM / Vulkan-compute path. This is the path. The QPU is reachable from userspace today on a stock signed Pi 5 / CM5 via /dev/dri/card0. No firmware loading. No signing fight. Idein/py-videocore7 (SGEMM 21 GFLOPS sustained) is the existence proof.

The build:

┌───────────────────────────────┐
│ userspace VP9 / AV1 decoder   │
│  (fork of dav1d / libvpx)     │
├───────────────────────────────┤
│  ARM:    entropy decode       │ ← Cortex-A76 + NEON
│          (Bool coder / ANS)   │   structurally serial
├───────────────────────────────┤
│  QPU:    parallel back-end    │ ← V3D 7.1 via Mesa v3dv
│          (IDCT, CDEF,         │   Vulkan compute shaders
│           deblock, LR, MC)    │   or direct DRM submit
├───────────────────────────────┤
│ V4L2 stateless wrapper        │ ← out-of-tree kernel module
│  (eventual, kernel-agent)     │   exposing /dev/videoN
└───────────────────────────────┘

The first deliverable is not the V4L2 wrapper. The first deliverable is one back-end kernel running on the QPU, bit-exact against a libavcodec reference, with measured throughput. If that single kernel can't beat NEON or get within 50% of it, the project closes here with a documented negative result.

In scope

  • A small set of codec back-end kernels (IDCT 8×8, CDEF, deblocking, loop restoration filter, MC interpolation) compiled as SPIR-V compute shaders for Mesa v3dv, dispatched via Vulkan compute from userspace.
  • A test harness on hertz that runs each kernel against libavcodec reference outputs and measures throughput (megapixels/sec or blocks/sec) against the equivalent NEON path.
  • Phase 1 = one kernel, bit-exact, with numbers. Phase 2+ = more kernels only if Phase 1 numbers justify it.

Out of scope (for now)

  • HEVC (Pi 5 has dedicated silicon; rpi-hevc-dec covers it).
  • Pi 4 / BCM2711 / VideoCore VI. Different ISA, smaller compute budget. Path B could extend but isn't the priority.
  • Encode. Pi Foundation removed all HW encode in Pi 5; encode on VC7 is a separate, larger project.
  • Custom VPU firmware (Path A — blocked by silicon RoT, see docs/phase0.md).
  • V4L2 stateless driver wrapping the userspace decoder. Eventual consumption point, but Phase 1 lives entirely in userspace.
  • Beating ARM NEON unconditionally. The honest target is concurrent work: QPU runs while CPU does something else.

Dev substrate

  • hertz (Pi 5, 8 GB, Debian Trixie, kernel 6.12.75-rpt-rpi-2712, Mesa 25.0.7 with v3dv, V3D 7.1.7) — the dev / test / measurement host. Watchdog-protected for crash recovery. See docs/vulkaninfo_v3d_7_1_7_hertz.txt for the inside-view device profile.
  • higgs (CM5 in portable battery chassis) — the eventual user target. Not a dev unit; sealed chassis.

Conventions

This project follows the 9(+1)-phase dev process. See docs/dev_process.md. Phase 0 is closed (docs/phase0.md); Phase 1 is docs/phase1.md.

Gitea identity: claude-noether (per feedback_gitea_as_claude_noether.md). No marfrit pushes from Claude sessions.

Layout

daedalus-fourier/
├── README.md             ← this file
├── docs/
│   ├── dev_process.md    ← reference copy of the 9(+1)-phase loop
│   ├── phase0.md         ← substrate audit (closes Paths A and B)
│   ├── phase1.md         ← first-kernel goal + measurement plan
│   └── vulkaninfo_v3d_7_1_7_hertz.txt
│                          ← inside-view device profile from hertz
├── src/                  ← kernels + Vulkan dispatch harness
└── tests/                ← bit-exact vs libavcodec, throughput

No build system yet. Adding CMake when the first kernel lands.

Sibling projects in the same orbit

  • libva-v4l2-request-fourier — VA-API consumer-side backend. Eventual consumer if daedalus produces a V4L2 stateless node.
  • firefox-fourier — Firefox fork that routes stateless V4L2 through libavcodec's v4l2_request hwaccel. Same pickup point.
  • chromium-fourier — sibling for Chromium.
  • kernel-agent — would house the V4L2 driver wrapping the userspace decoder, once one exists.
  • ampere-av1-enablement — software-side AV1 bring-up on RK3588 (rkvdec / vpu981). Provides the userspace conformance harness daedalus reuses for VC7-AV1 verification.

Source attribution

Daedalus-the-myth is public domain. The wax-and-feathers metaphor is older than software engineering.

Anyone wanting to fail at this project: please file your failures under branches/icarus/. Built-in self-deprecation slot, with honor.

S
Description
Community-built VP9/AV1 software-decode back-end running on VideoCore VII (V3D 7.1) QPUs on Pi 5 / CM5 via Mesa v3dv. ARM keeps entropy front-end; QPU takes parallel back-end. Phase 0 closed; cycles 1-5 implemented (IDCT, LPF wd=4, MC, LPF wd=8, CDEF partial).
Readme 312 KiB
Languages
C 94.3%
CMake 5.7%