daedalus-fourier/daedalus_architecture_backlog.md

# Daedalus architecture backlog

**Status:** design draft, **not** scheduled. Captured 2026-05-23 after the cycle 9 close, while Pi 5 H.264 deployment is still settling on higgs. The pivot described here is **deferred until a second SoC creates a forcing function** — see "Why deferred" at the bottom.

This document is forward-looking. It describes the generalized multi-SoC daedalus daemon architecture, but the immediate work block stays "finish Pi 5". Re-read this when:

- HW decode on noether (Pi 4, the user's interactive workstation) becomes a real ask and rpivid upstream is still unstable. This is the most likely trigger — same SoC class as Pi 5 but weaker V3D 4.x, so the caps-file mechanism plus an extra row's worth of substrate measurements.
- AV1 playback on boltzmann (RK3588) starts mattering. rkvdec doesn't cover AV1, so the daedalus path becomes the only HW-accelerated option, and Mali Valhall compute substrate decisions need their own caps row.
- libva-v4l2-request-fourier evolves to need multi-node negotiation (today it picks the first matching V4L2 node; a host with both rkvdec and daedalus-v4l2 nodes wants a preference policy).

Until then: this is decision context, not a TODO.

---

## What we have today (2026-05-23)

The current stack is **Pi 5 specific** by deliberate construction:

```
Firefox / mpv
  └─ libva-fourier (VAAPI)
       └─ libva-v4l2-request-fourier (V4L2 stateless consumer)
            └─ /dev/video0 (daedalus_v4l2 kernel char-dev shim)
                 └─ /dev/daedalus-v4l2 → userspace daemon (Option γ)
                      └─ dlopen libavcodec.so.62 (Kwiboo FFmpeg fork)
                           └─ daedalus-fourier kernels (NEON + V3D opportunistic)
                                ├─ cycle 1: VP9 IDCT 8x8       (V3D QPU)
                                ├─ cycle 2: VP9 LPF wd=4       (V3D QPU)
                                ├─ cycle 3: VP9 MC 8h          (CPU NEON)
                                ├─ cycle 4: VP9 LPF wd=8       (V3D QPU)
                                ├─ cycle 5: AV1 CDEF 8x8       (CPU NEON; QPU opportunistic helper)
                                ├─ cycle 6: H.264 IDCT 4x4     (CPU NEON)
                                ├─ cycle 7: H.264 IDCT 8x8     (CPU NEON)
                                ├─ cycle 8: H.264 luma-v deblk (CPU NEON; QPU opportunistic helper)
                                └─ cycle 9: H.264 luma qpel mc20 (CPU NEON)
```

Two things in this stack **already** look like the generalized architecture:

1. **`daedalus_recipe_dispatch_*` is already the runtime substrate selector.** Public-API functions in `include/daedalus.h` (cycles 6–9 added the H.264 family on 2026-05-21 through 2026-05-23). Per-kernel substrate decisions live in `daedalus_recipe_substrate_for(daedalus_kernel k)` — currently a hard-coded switch, but a data-driven version is a near-mechanical rewrite.

2. **libva-v4l2-request-fourier already abstracts over "any V4L2 stateless decoder node".** On RK3588 the same VAAPI driver consumes rkvdec directly with no daedalus daemon in the path; on Pi 5 it consumes the daedalus_v4l2 shim. The cross-SoC seam is **at the V4L2 device level**, which is the right place — it's how the upstream V4L2 stateless API was designed to work.

So the generalization needed is smaller than it looks. Most of the abstraction surface is already in place; what's missing is **substrate-table data per SoC** and a **second daemon backend** for codec-level pass-through to vendor decoders.

---

## Problem statement

The mfritsche fleet has heterogeneous aarch64 hardware decoders:

| SoC | Host(s) | H.264 | HEVC | VP9 | AV1 | GPU compute |
|---|---|---|---|---|---|---|
| BCM2712 (Pi 5) | higgs, hertz, broglie, tesla (LXD on hertz) | none | V3D7 (rpi-hevc-dec — SPS quirks) | none | none | V3D7 (Vulkan compute, queryable) |
| BCM2711 (Pi 4) | noether (interactive workstation), dcw3, dcw2 | rpivid (out of tree, unstable) | rpivid (out of tree, unstable) | none | none | V3D4 (Vulkan compute, weaker) |
| RK3588 | boltzmann (32 GB, kernel-dev / MCP hub, 8 W always-on) | rkvdec V4L2 stateless (upstream) | rkvdec V4L2 stateless | rkvdec V4L2 stateless | none (rkvdec lacks AV1) | Mali Valhall (panvk-bifrost-video in dev) + RK NPU |
| Allwinner H6 | (not in current fleet, but Cedrus exists upstream) | Cedrus V4L2 | Cedrus V4L2 | none | none | Mali Bifrost |

No single SoC has a complete codec set. RK3588 lacks AV1; Pi 5 lacks H.264 + VP9 + AV1; Pi 4 has rpivid (out-of-tree, kernel-version-fragile); Allwinner Cedrus is H.264/HEVC only.

A note on the Pi 5 row: hertz and tesla share hardware (tesla is an LXD container hosted on hertz) but are operationally distinct — tesla is the distcc/MCP worker, hertz is the LXD host with all the cron automations and the 17-tool lmcp hub. From a daedalus deployment perspective they count as **one** Pi 5 substrate; from a workflow perspective they're separate boxes.

A note on noether: it's the user's interactive workstation (Pi 4, BCM2711). Firefox + mpv run here. Any "I want HW decode on my main box" pressure lands first on this host, which puts Pi 4 (V3D4 + maybe-rpivid) closer to the front of the queue than the original draft of this document suggested.

The current daedalus model — "kernel substitution + libavcodec front end" — is the right answer for **Pi 5 specifically**, where no usable kernel V4L2 stateless decoder exists for the codecs we care about, and a Vulkan-capable GPU (V3D7) is available to help on a few kernels.

The model is **not** the right answer for SoCs that already have working V4L2 stateless decoders for the requested codec — those should be passed through, not re-implemented through libavcodec + kernel substitution.

---

## The conceptual gap

A naïve "shaders per SoC" generalization runs into the fact that **hardware decoders are not made of shaders**. rkvdec on RK3588, Hantro G1/G2 on Allwinner, VPU8 on Amlogic, even the rpi-hevc-dec block on Pi 5 — these are **bitstream-in, NV12-out** monoliths that do not expose intermediate kernel slots. You cannot route "their IDCT" through one substrate and "their MC" through another; they are opaque pipelines.

This forces a **two-backend daemon**:

- **Substrate-composed backend.** What we have today. Used when no hardware decoder for the requested codec exists on this SoC. Front end is libavcodec (entropy decode, slice headers); kernel hot paths run through `daedalus_recipe_dispatch_*` with substrate chosen per (SoC × kernel).

- **Pass-through backend.** Used when a hardware decoder for the requested codec exists. Daemon (or, more realistically, the kernel V4L2 shim itself) forwards the bitstream to the vendor V4L2 stateless node and returns the decoded frame. No kernel substitution. Effectively a no-op from the daemon's perspective — and in fact, **libva-v4l2-request-fourier can already talk to the vendor node directly** without going through the daedalus daemon at all.

The routing decision is **per (SoC × codec)**:

| | Pi 5 | Pi 4 | RK3588 | Allwinner H6 |
|---|---|---|---|---|
| H.264 | substrate-composed (NEON+QPU) | substrate-composed (NEON only — V3D4 too weak) **or** rpivid pass-through if stable | rkvdec pass-through | Cedrus pass-through |
| HEVC | rpi-hevc-dec pass-through (when SPS quirks fixed) **or** substrate-composed | rpivid pass-through | rkvdec pass-through | Cedrus pass-through |
| VP9 | substrate-composed | substrate-composed | rkvdec pass-through | substrate-composed |
| AV1 | substrate-composed | substrate-composed (slow) | substrate-composed | substrate-composed |

Note: on RK3588 + every codec rkvdec supports, the **daedalus daemon is bypassed entirely** — libva talks to rkvdec directly. The daemon is only ever in the path on SoCs where at least one codec needs substrate-composition.

---

## Refined architecture sketch

If/when we do this:

```
/usr/lib/daedalus/
├── shaders/                      # SPIR-V binaries, one set for all Vulkan-
│                                 # capable SoCs (V3D7, V3D4, Mali Valhall,
│                                 # Mali Bifrost, Adreno). SPIR-V is portable
│                                 # by design — the per-SoC fragmentation is
│                                 # *which kernels are worth running on GPU*,
│                                 # not the binaries themselves.
│
├── caps/                         # per-SoC substrate selection tables
│   ├── bcm2712.toml              # Pi 5 (V3D7, no H.264 HW)
│   ├── bcm2711.toml              # Pi 4 (V3D4, rpivid optional)
│   ├── rk3588.toml               # RK3588 (rkvdec covers most codecs;
│   │                             # substrate-composed only for AV1)
│   ├── allwinner-h6.toml         # Cedrus
│   └── default.toml              # unknown SoC: CPU NEON only,
│                                 # libavcodec front-end + kernel pack
│
└── plugins/                      # ONLY for pass-through to vendor decoders
    ├── rkvdec_passthrough.so     # forward bitstream to /dev/video-rkvdec
    ├── cedrus_passthrough.so
    └── rpivid_passthrough.so     # if we ever stabilize rpivid

```

Daemon startup probe:

1. Read `/proc/device-tree/compatible` (or `/sys/firmware/devicetree/.../compatible`); fall back to DMI on x86 (won't apply in practice — fleet is aarch64-only).
2. Match against caps files; load the matching `<soc>.toml`.
3. Enumerate `/dev/video*` and `/dev/media*`; classify each as {daedalus-shim, vendor-stateless, vendor-stateful, unknown}.
4. For each codec the caps file declares as "pass-through-preferred": load the matching `plugins/<vendor>_passthrough.so`. On dlopen failure, fall back to substrate-composed.
5. Build per-codec routing table; advertise the union through V4L2 to libva.

**Caps file shape** (illustrative — final TOML keys TBD):

```toml
# bcm2712.toml — Pi 5, V3D7 GPU compute available; no codec HW decoders
compatible = ["raspberrypi,5-model-b", "brcm,bcm2712"]

[gpu]
substrate = "v3d-vulkan"
device_match = "V3D 7"   # Vulkan VkPhysicalDeviceProperties.deviceName regex

[codecs.h264]
backend = "substrate-composed"
[codecs.h264.kernels]
idct4     = "cpu"
idct8     = "cpu"
deblock_lv = "cpu"  # opportunistic = "gpu" — see cycle 8 docs
qpel_mc20 = "cpu"

[codecs.vp9]
backend = "substrate-composed"
[codecs.vp9.kernels]
idct8 = "gpu"
lpf4  = "gpu"
mc_8h = "cpu"
lpf8  = "gpu"

[codecs.av1]
backend = "substrate-composed"
[codecs.av1.kernels]
cdef = "cpu"  # opportunistic = "gpu"
```

```toml
# rk3588.toml — rkvdec covers H.264/HEVC/VP9; AV1 falls to substrate-composed
compatible = ["rockchip,rk3588", "rockchip,rk3588s"]

[gpu]
substrate = "mali-valhall"
device_match = "Mali-G610"

[codecs.h264]
backend = "passthrough"
plugin  = "rkvdec_passthrough.so"
v4l2_node_match = "rkvdec"

[codecs.hevc]
backend = "passthrough"
plugin  = "rkvdec_passthrough.so"

[codecs.vp9]
backend = "passthrough"
plugin  = "rkvdec_passthrough.so"

[codecs.av1]
backend = "substrate-composed"
[codecs.av1.kernels]
cdef = "cpu"   # Mali Valhall opportunistic = TBD
```

Pass-through plugins are *thin* — they translate the daedalus daemon's wire protocol to the vendor's V4L2 stateless ioctls (which they often already are; the plugin is mostly a fd-forward and buffer-copy). The substrate-composed backend stays as it is today.

---

## Where it gets hard

1. **Caps-file authorship.** Each new SoC needs measurement-driven entries (M3 thresholds, R-band verdicts) — that's the entire daedalus-fourier cycle 1–9 dance, done per SoC. Pi 5 took ~3 weeks. Pi 4 V3D4 is probably 1–2 weeks (same kernels, weaker GPU; mostly verifying CPU verdicts hold). RK3588 is mostly pass-through, so caps work is light there.

2. **Probing without hard-coded fragility.** `/proc/device-tree/compatible` strings are not stable identifiers (Raspberry Pi has changed compatible across kernel versions). Caps files should match on multiple compatible strings + Vulkan device-name regex + V4L2 driver-name (`v4l2-ctl -d /dev/video0 -D`), majority-voting style.

3. **Error-fallback paths.** Pass-through plugin dlopen failure → fall back to substrate-composed. Substrate kernel returns error → fall back to libavcodec stock NEON. Each fallback layer adds error-handling code and increases test surface.

4. **Stateful vs stateless decoders.** Some vendors expose stateful V4L2 (Hantro H.264 on some chips); others expose stateless. The daedalus daemon's wire protocol is shaped around stateless. Pass-through plugins for stateful decoders need a state-machine adapter, not just an fd forward.

5. **CI matrix explosion.** Per-SoC build × per-codec smoke × per-plugin presence. Need to decide which combinations are gated CI vs nightly.

6. **The "libva picks the right node" problem.** Today libva-v4l2-request-fourier picks the first matching V4L2 node. On a host with both rkvdec **and** daedalus-v4l2 present (unlikely but possible — e.g. an RK3588 host with daedalus-v4l2 installed for testing), how does it pick? Probably: prefer vendor stateless over daedalus shim, configurable via env. This logic belongs in libva-v4l2-request-fourier, not the daemon.

---

## Why deferred (and the forcing function)

**Today's calculus:**

- Pi 5 (higgs + hertz + broglie + tesla) is **four hosts**, but **one SoC**. Adding the fifth Pi 5 host wouldn't pressure-test the architecture; they all share BCM2712 caps so the substrate decisions are identical across the row.
- boltzmann (RK3588) is the only non-Pi-5 always-on host in the fleet, and it uses rkvdec directly through libva-v4l2-request-fourier — daedalus daemon is **not in the path** for any RK3588 codec on it. The "RK3588 support" the architecture above proposes is mostly a no-op routing decision plus an AV1 fallback that doesn't yet measure on Mali. No forcing pressure from boltzmann today.
- noether (Pi 4, this user's interactive workstation) and dcw3/dcw2 (also Pi 4) are the real second-SoC candidates. The gate is rpivid upstream stability: if it lands cleanly, Pi 4 takes the pass-through path with zero kernel substitution work. If it stays out-of-tree-fragile, **then** the substrate-composed path with V3D4 + NEON becomes the right backend for Pi 4, and we need the per-SoC caps mechanism to handle V3D4's weaker compute.
- The recipe layer in daedalus-fourier already scales cleanly. Adding more substrates is incremental, not architectural.

**The forcing function that flips this from "deferred" to "do it":**

- **noether-as-Firefox-host** — the user starts wanting HW decode on their main workstation and rpivid is still not stable upstream. Implies a Pi 4 substrate-composed path, which means at minimum a second caps file and the loader for it. At that point, building the full pluggable scaffold becomes proportionate. This is the most likely trigger; noether is already a daily-driver Pi 4.
- **boltzmann-as-AV1-decoder** — RK3588 has no AV1 HW decoder, and the user wants AV1 playback there (currently CPU-only). Triggers a cycle-5–equivalent measurement campaign on Mali Valhall to see whether `daedalus_recipe_dispatch_cdef_8x8` (or follow-on AV1 kernels) is worth running on Mali compute. If yes, we need an RK3588 caps file that overrides only the AV1 row while leaving H.264/HEVC/VP9 on rkvdec pass-through.
- **Or:** a third-party Pi 5 user needs to swap shaders for V3D firmware experiments without rebuilding the daemon — at that point dynamic shader loading + caps overrides become a feature ask.

Until one of those happens: keep daedalus daemon Pi 5 specific. Push cross-SoC abstraction *up* to libva-v4l2-request-fourier (which already does most of it) rather than *down* into the daemon.

---

## Open questions

1. **Where do caps files live?** `/usr/lib/daedalus/caps/` (package-provided) vs `/etc/daedalus/caps/` (admin override) vs both with merge precedence. Final call deferred.

2. **Does the daemon even need plugins?** A simpler design: daemon does substrate-composed only; pass-through is handled by libva-v4l2-request-fourier preferring the vendor node when present. Removes the entire plugin layer and pushes the codec-routing decision to the consumer. Probably the right call — re-evaluate when designing.

3. **Per-process vs per-system substrate choice.** Today libavcodec uses `daedalus_ctx_create_no_qpu()` (no Vulkan init in arbitrary host processes). If the daemon centralizes substrate decisions, the per-process compromise can be relaxed — but at the cost of more daemon ↔ libavcodec round-trips per kernel. Cost/benefit unclear without measurement.

4. **AV1 on Mali compute.** RK3588 has no AV1 HW decoder. Mali Valhall has compute. Is `daedalus_recipe_dispatch_cdef_8x8` worth running on Mali instead of NEON? Unknown — needs a cycle 5–equivalent measurement campaign on RK3588 before any RK3588-specific caps entry can be authored.

5. **What's the deliverable for the architecture revisit?** Probably a fresh repo (`daedalus-platform/` ?) that wraps daedalus-fourier + daedalus-v4l2 + caps files + plugins. Or fold everything into daedalus-v4l2 since the daemon already lives there. Final call deferred until the forcing function is concrete.

---

## Decision log

| Date | Decision | Reason |
|---|---|---|
| 2026-05-23 | **Defer generalization.** Finish Pi 5 substitution arc (cycle 9 PR #90 pending), then pivot to bug-fix backlog (daemon SEGV #145, D-state #146) before architecture work. | Architecture pivot is a multi-week scope; Pi 5 path is the only user-visible motivator today; deferring loses nothing because the recipe layer already abstracts kernels and libva-v4l2-request-fourier already abstracts V4L2 nodes. |
| 2026-05-23 | **Document the design now, even though it's deferred.** | Captures the conceptual gap (shaders ≠ hardware decoders) and the two-backend conclusion while the analysis is fresh; saves re-litigating in 3–6 months. |
| 2026-05-23 | **Correct fleet hardware mapping.** Original draft had hertz/tesla under RK3588 and omitted boltzmann + noether entirely. Verified via `/proc/device-tree/compatible`: hertz + tesla are Pi 5 (BCM2712), noether is Pi 4 (BCM2711), boltzmann is the only RK3588 in the fleet. Adjusted "Why deferred" / forcing-function reasoning accordingly — Pi 5 row is now 4 hosts (one SoC), noether is the realistic Pi 4 trigger, boltzmann is the realistic RK3588 trigger via AV1. | Original draft was speculative on host-to-SoC mapping; verified state changes which forcing functions are credible. |

---

## References

- `include/daedalus.h` — current public API; the `daedalus_recipe_dispatch_*` family is the kernel-level substrate selector that scales to multi-SoC.
- `docs/k1_phase7.md` through `docs/k9_h264qpel_mc20.md` — per-cycle Phase 7 / closure docs that record substrate verdicts. Same dance would be repeated per SoC.
- `docs/phase8_status.md` — Phase 8 status (V4L2 daemon side, sibling daedalus-v4l2).
- libva-v4l2-request-fourier — the consumer side; already abstracts over any V4L2 stateless decoder node. Most of the multi-SoC abstraction surface is already here.
- daedalus-v4l2 repository — the kernel char-dev shim + userspace daemon. The natural home for an eventual generalized daemon, if/when the forcing function fires.