diff --git a/daedalus_architecture_backlog.md b/daedalus_architecture_backlog.md new file mode 100644 index 0000000..919f158 --- /dev/null +++ b/daedalus_architecture_backlog.md @@ -0,0 +1,254 @@ +# Daedalus architecture backlog + +**Status:** design draft, **not** scheduled. Captured 2026-05-23 after the cycle 9 close, while Pi 5 H.264 deployment is still settling on higgs. The pivot described here is **deferred until a second SoC creates a forcing function** — see "Why deferred" at the bottom. + +This document is forward-looking. It describes the generalized multi-SoC daedalus daemon architecture, but the immediate work block stays "finish Pi 5". Re-read this when: + +- A second aarch64 host without a working kernel-side V4L2 stateless decoder shows up in the fleet (most likely candidate: Pi 4, which has V3D 4.x and no rpivid stable upstream). +- A specific working-copy slowdown that the current Pi-5-only daedalus can't address motivates the generalization. +- libva-v4l2-request-fourier evolves to need multi-node negotiation (currently it picks the first matching V4L2 node). + +Until then: this is decision context, not a TODO. + +--- + +## What we have today (2026-05-23) + +The current stack is **Pi 5 specific** by deliberate construction: + +``` +Firefox / mpv + └─ libva-fourier (VAAPI) + └─ libva-v4l2-request-fourier (V4L2 stateless consumer) + └─ /dev/video0 (daedalus_v4l2 kernel char-dev shim) + └─ /dev/daedalus-v4l2 → userspace daemon (Option γ) + └─ dlopen libavcodec.so.62 (Kwiboo FFmpeg fork) + └─ daedalus-fourier kernels (NEON + V3D opportunistic) + ├─ cycle 1: VP9 IDCT 8x8 (V3D QPU) + ├─ cycle 2: VP9 LPF wd=4 (V3D QPU) + ├─ cycle 3: VP9 MC 8h (CPU NEON) + ├─ cycle 4: VP9 LPF wd=8 (V3D QPU) + ├─ cycle 5: AV1 CDEF 8x8 (CPU NEON; QPU opportunistic helper) + ├─ cycle 6: H.264 IDCT 4x4 (CPU NEON) + ├─ cycle 7: H.264 IDCT 8x8 (CPU NEON) + ├─ cycle 8: H.264 luma-v deblk (CPU NEON; QPU opportunistic helper) + └─ cycle 9: H.264 luma qpel mc20 (CPU NEON) +``` + +Two things in this stack **already** look like the generalized architecture: + +1. **`daedalus_recipe_dispatch_*` is already the runtime substrate selector.** Public-API functions in `include/daedalus.h` (cycles 6–9 added the H.264 family on 2026-05-21 through 2026-05-23). Per-kernel substrate decisions live in `daedalus_recipe_substrate_for(daedalus_kernel k)` — currently a hard-coded switch, but a data-driven version is a near-mechanical rewrite. + +2. **libva-v4l2-request-fourier already abstracts over "any V4L2 stateless decoder node".** On RK3588 the same VAAPI driver consumes rkvdec directly with no daedalus daemon in the path; on Pi 5 it consumes the daedalus_v4l2 shim. The cross-SoC seam is **at the V4L2 device level**, which is the right place — it's how the upstream V4L2 stateless API was designed to work. + +So the generalization needed is smaller than it looks. Most of the abstraction surface is already in place; what's missing is **substrate-table data per SoC** and a **second daemon backend** for codec-level pass-through to vendor decoders. + +--- + +## Problem statement + +The mfritsche fleet has heterogeneous aarch64 hardware decoders: + +| SoC | Host(s) | H.264 | HEVC | VP9 | AV1 | GPU compute | +|---|---|---|---|---|---|---| +| BCM2712 (Pi 5) | higgs, broglie | none | V3D7 (rpi-hevc-dec — SPS quirks) | none | none | V3D7 (Vulkan compute, queryable) | +| BCM2711 (Pi 4) | dcw3 | rpivid (out of tree, unstable) | rpivid (out of tree, unstable) | none | none | V3D4 (Vulkan compute, weaker) | +| RK3588 | hertz, tesla | rkvdec V4L2 stateless (upstream) | rkvdec V4L2 stateless | rkvdec V4L2 stateless | none (rkvdec lacks AV1) | Mali Valhall (panvk) + RK NPU | +| Allwinner H6 | (not in current fleet, but Cedrus exists) | Cedrus V4L2 | Cedrus V4L2 | none | none | Mali Bifrost | + +No single SoC has a complete codec set. RK3588 lacks AV1; Pi 5 lacks H.264 + VP9 + AV1; Pi 4 has rpivid (out-of-tree, kernel-version-fragile); Allwinner Cedrus is H.264/HEVC only. + +The current daedalus model — "kernel substitution + libavcodec front end" — is the right answer for **Pi 5 specifically**, where no usable kernel V4L2 stateless decoder exists for the codecs we care about, and a Vulkan-capable GPU (V3D7) is available to help on a few kernels. + +The model is **not** the right answer for SoCs that already have working V4L2 stateless decoders for the requested codec — those should be passed through, not re-implemented through libavcodec + kernel substitution. + +--- + +## The conceptual gap + +A naïve "shaders per SoC" generalization runs into the fact that **hardware decoders are not made of shaders**. rkvdec on RK3588, Hantro G1/G2 on Allwinner, VPU8 on Amlogic, even the rpi-hevc-dec block on Pi 5 — these are **bitstream-in, NV12-out** monoliths that do not expose intermediate kernel slots. You cannot route "their IDCT" through one substrate and "their MC" through another; they are opaque pipelines. + +This forces a **two-backend daemon**: + +- **Substrate-composed backend.** What we have today. Used when no hardware decoder for the requested codec exists on this SoC. Front end is libavcodec (entropy decode, slice headers); kernel hot paths run through `daedalus_recipe_dispatch_*` with substrate chosen per (SoC × kernel). + +- **Pass-through backend.** Used when a hardware decoder for the requested codec exists. Daemon (or, more realistically, the kernel V4L2 shim itself) forwards the bitstream to the vendor V4L2 stateless node and returns the decoded frame. No kernel substitution. Effectively a no-op from the daemon's perspective — and in fact, **libva-v4l2-request-fourier can already talk to the vendor node directly** without going through the daedalus daemon at all. + +The routing decision is **per (SoC × codec)**: + +| | Pi 5 | Pi 4 | RK3588 | Allwinner H6 | +|---|---|---|---|---| +| H.264 | substrate-composed (NEON+QPU) | substrate-composed (NEON only — V3D4 too weak) **or** rpivid pass-through if stable | rkvdec pass-through | Cedrus pass-through | +| HEVC | rpi-hevc-dec pass-through (when SPS quirks fixed) **or** substrate-composed | rpivid pass-through | rkvdec pass-through | Cedrus pass-through | +| VP9 | substrate-composed | substrate-composed | rkvdec pass-through | substrate-composed | +| AV1 | substrate-composed | substrate-composed (slow) | substrate-composed | substrate-composed | + +Note: on RK3588 + every codec rkvdec supports, the **daedalus daemon is bypassed entirely** — libva talks to rkvdec directly. The daemon is only ever in the path on SoCs where at least one codec needs substrate-composition. + +--- + +## Refined architecture sketch + +If/when we do this: + +``` +/usr/lib/daedalus/ +├── shaders/ # SPIR-V binaries, one set for all Vulkan- +│ # capable SoCs (V3D7, V3D4, Mali Valhall, +│ # Mali Bifrost, Adreno). SPIR-V is portable +│ # by design — the per-SoC fragmentation is +│ # *which kernels are worth running on GPU*, +│ # not the binaries themselves. +│ +├── caps/ # per-SoC substrate selection tables +│ ├── bcm2712.toml # Pi 5 (V3D7, no H.264 HW) +│ ├── bcm2711.toml # Pi 4 (V3D4, rpivid optional) +│ ├── rk3588.toml # RK3588 (rkvdec covers most codecs; +│ │ # substrate-composed only for AV1) +│ ├── allwinner-h6.toml # Cedrus +│ └── default.toml # unknown SoC: CPU NEON only, +│ # libavcodec front-end + kernel pack +│ +└── plugins/ # ONLY for pass-through to vendor decoders + ├── rkvdec_passthrough.so # forward bitstream to /dev/video-rkvdec + ├── cedrus_passthrough.so + └── rpivid_passthrough.so # if we ever stabilize rpivid + +``` + +Daemon startup probe: + +1. Read `/proc/device-tree/compatible` (or `/sys/firmware/devicetree/.../compatible`); fall back to DMI on x86 (won't apply in practice — fleet is aarch64-only). +2. Match against caps files; load the matching `.toml`. +3. Enumerate `/dev/video*` and `/dev/media*`; classify each as {daedalus-shim, vendor-stateless, vendor-stateful, unknown}. +4. For each codec the caps file declares as "pass-through-preferred": load the matching `plugins/_passthrough.so`. On dlopen failure, fall back to substrate-composed. +5. Build per-codec routing table; advertise the union through V4L2 to libva. + +**Caps file shape** (illustrative — final TOML keys TBD): + +```toml +# bcm2712.toml — Pi 5, V3D7 GPU compute available; no codec HW decoders +compatible = ["raspberrypi,5-model-b", "brcm,bcm2712"] + +[gpu] +substrate = "v3d-vulkan" +device_match = "V3D 7" # Vulkan VkPhysicalDeviceProperties.deviceName regex + +[codecs.h264] +backend = "substrate-composed" +[codecs.h264.kernels] +idct4 = "cpu" +idct8 = "cpu" +deblock_lv = "cpu" # opportunistic = "gpu" — see cycle 8 docs +qpel_mc20 = "cpu" + +[codecs.vp9] +backend = "substrate-composed" +[codecs.vp9.kernels] +idct8 = "gpu" +lpf4 = "gpu" +mc_8h = "cpu" +lpf8 = "gpu" + +[codecs.av1] +backend = "substrate-composed" +[codecs.av1.kernels] +cdef = "cpu" # opportunistic = "gpu" +``` + +```toml +# rk3588.toml — rkvdec covers H.264/HEVC/VP9; AV1 falls to substrate-composed +compatible = ["rockchip,rk3588", "rockchip,rk3588s"] + +[gpu] +substrate = "mali-valhall" +device_match = "Mali-G610" + +[codecs.h264] +backend = "passthrough" +plugin = "rkvdec_passthrough.so" +v4l2_node_match = "rkvdec" + +[codecs.hevc] +backend = "passthrough" +plugin = "rkvdec_passthrough.so" + +[codecs.vp9] +backend = "passthrough" +plugin = "rkvdec_passthrough.so" + +[codecs.av1] +backend = "substrate-composed" +[codecs.av1.kernels] +cdef = "cpu" # Mali Valhall opportunistic = TBD +``` + +Pass-through plugins are *thin* — they translate the daedalus daemon's wire protocol to the vendor's V4L2 stateless ioctls (which they often already are; the plugin is mostly a fd-forward and buffer-copy). The substrate-composed backend stays as it is today. + +--- + +## Where it gets hard + +1. **Caps-file authorship.** Each new SoC needs measurement-driven entries (M3 thresholds, R-band verdicts) — that's the entire daedalus-fourier cycle 1–9 dance, done per SoC. Pi 5 took ~3 weeks. Pi 4 V3D4 is probably 1–2 weeks (same kernels, weaker GPU; mostly verifying CPU verdicts hold). RK3588 is mostly pass-through, so caps work is light there. + +2. **Probing without hard-coded fragility.** `/proc/device-tree/compatible` strings are not stable identifiers (Raspberry Pi has changed compatible across kernel versions). Caps files should match on multiple compatible strings + Vulkan device-name regex + V4L2 driver-name (`v4l2-ctl -d /dev/video0 -D`), majority-voting style. + +3. **Error-fallback paths.** Pass-through plugin dlopen failure → fall back to substrate-composed. Substrate kernel returns error → fall back to libavcodec stock NEON. Each fallback layer adds error-handling code and increases test surface. + +4. **Stateful vs stateless decoders.** Some vendors expose stateful V4L2 (Hantro H.264 on some chips); others expose stateless. The daedalus daemon's wire protocol is shaped around stateless. Pass-through plugins for stateful decoders need a state-machine adapter, not just an fd forward. + +5. **CI matrix explosion.** Per-SoC build × per-codec smoke × per-plugin presence. Need to decide which combinations are gated CI vs nightly. + +6. **The "libva picks the right node" problem.** Today libva-v4l2-request-fourier picks the first matching V4L2 node. On a host with both rkvdec **and** daedalus-v4l2 present (unlikely but possible — e.g. an RK3588 host with daedalus-v4l2 installed for testing), how does it pick? Probably: prefer vendor stateless over daedalus shim, configurable via env. This logic belongs in libva-v4l2-request-fourier, not the daemon. + +--- + +## Why deferred (and the forcing function) + +**Today's calculus:** + +- Pi 5 daedalus path is the only thing in the fleet that uses daedalus daemon. Generalizing for a single user is overdesign. +- RK3588 uses rkvdec directly through libva-v4l2-request-fourier; daedalus daemon is **not in the path** for any RK3588 codec. The "RK3588 support" the architecture above proposes is mostly a no-op routing decision plus an AV1 fallback that doesn't yet measure on Mali. +- Pi 4 with rpivid is the only realistic second motivator. rpivid upstream stability is the gate — if it lands cleanly, Pi 4 takes the pass-through path with no kernel substitution needed. If it stays out-of-tree-fragile, **then** the substrate-composed path with V3D4 + NEON becomes the right backend for Pi 4, and we need the per-SoC caps mechanism to handle V3D4's weaker compute. +- The recipe layer in daedalus-fourier already scales cleanly. Adding more substrates is incremental, not architectural. + +**The forcing function that flips this from "deferred" to "do it":** + +- Pi 4 enters daily use and rpivid is still not stable upstream — implies we need a Pi 4 substrate-composed path, which means at minimum a second caps file and the loader for it. At that point, building the full pluggable scaffold becomes proportionate. +- **Or:** an x86 host enters the fleet running mesa-panvk on a Pi-CM5-like board, and we need the daedalus daemon to discover it dynamically rather than being baked at build time. +- **Or:** a third-party Pi 5 user needs to swap shaders for V3D firmware experiments without rebuilding the daemon — at that point dynamic shader loading + caps overrides become a feature ask. + +Until one of those happens: keep daedalus daemon Pi 5 specific. Push cross-SoC abstraction *up* to libva-v4l2-request-fourier (which already does most of it) rather than *down* into the daemon. + +--- + +## Open questions + +1. **Where do caps files live?** `/usr/lib/daedalus/caps/` (package-provided) vs `/etc/daedalus/caps/` (admin override) vs both with merge precedence. Final call deferred. + +2. **Does the daemon even need plugins?** A simpler design: daemon does substrate-composed only; pass-through is handled by libva-v4l2-request-fourier preferring the vendor node when present. Removes the entire plugin layer and pushes the codec-routing decision to the consumer. Probably the right call — re-evaluate when designing. + +3. **Per-process vs per-system substrate choice.** Today libavcodec uses `daedalus_ctx_create_no_qpu()` (no Vulkan init in arbitrary host processes). If the daemon centralizes substrate decisions, the per-process compromise can be relaxed — but at the cost of more daemon ↔ libavcodec round-trips per kernel. Cost/benefit unclear without measurement. + +4. **AV1 on Mali compute.** RK3588 has no AV1 HW decoder. Mali Valhall has compute. Is `daedalus_recipe_dispatch_cdef_8x8` worth running on Mali instead of NEON? Unknown — needs a cycle 5–equivalent measurement campaign on RK3588 before any RK3588-specific caps entry can be authored. + +5. **What's the deliverable for the architecture revisit?** Probably a fresh repo (`daedalus-platform/` ?) that wraps daedalus-fourier + daedalus-v4l2 + caps files + plugins. Or fold everything into daedalus-v4l2 since the daemon already lives there. Final call deferred until the forcing function is concrete. + +--- + +## Decision log + +| Date | Decision | Reason | +|---|---|---| +| 2026-05-23 | **Defer generalization.** Finish Pi 5 substitution arc (cycle 9 PR #90 pending), then pivot to bug-fix backlog (daemon SEGV #145, D-state #146) before architecture work. | Architecture pivot is a multi-week scope; Pi 5 path is the only user-visible motivator today; deferring loses nothing because the recipe layer already abstracts kernels and libva-v4l2-request-fourier already abstracts V4L2 nodes. | +| 2026-05-23 | **Document the design now, even though it's deferred.** | Captures the conceptual gap (shaders ≠ hardware decoders) and the two-backend conclusion while the analysis is fresh; saves re-litigating in 3–6 months. | + +--- + +## References + +- `include/daedalus.h` — current public API; the `daedalus_recipe_dispatch_*` family is the kernel-level substrate selector that scales to multi-SoC. +- `docs/k1_phase7.md` through `docs/k9_h264qpel_mc20.md` — per-cycle Phase 7 / closure docs that record substrate verdicts. Same dance would be repeated per SoC. +- `docs/phase8_status.md` — Phase 8 status (V4L2 daemon side, sibling daedalus-v4l2). +- libva-v4l2-request-fourier — the consumer side; already abstracts over any V4L2 stateless decoder node. Most of the multi-SoC abstraction surface is already here. +- daedalus-v4l2 repository — the kernel char-dev shim + userspace daemon. The natural home for an eventual generalized daemon, if/when the forcing function fires.