daedalus-v4l2/docs/architecture.md

# daedalus-v4l2 — architecture

## Components

### 1. Kernel module (`kernel/`)

A Linux V4L2 stateless decoder driver. Registers as
`/dev/videoNN` with `VFL_TYPE_VIDEO` and supports VP9, AV1,
and H.264 stateless decoder controls (matching the existing
V4L2 stateless uAPI used by libva-v4l2-request-fourier).

Internally it does NOT decode bitstream. It:
1. Accepts V4L2 ioctls (VIDIOC_S_FMT, VIDIOC_S_CTRL with
   STATELESS controls, VIDIOC_QBUF, etc.)
2. Marshals the bitstream and per-frame/per-slice control
   structs onto a chardev (or netlink) channel.
3. Pulls decoded frames back from userspace daemon.
4. Returns them via VIDIOC_DQBUF.

Out-of-tree kernel module (built with `make` against the
running kernel's headers; loaded via `insmod`).

### 2. Userspace daemon (`daemon/`)

A long-running daemon that:
1. Connects to the kernel module's chardev.
2. Pulls bitstream + control blobs.
3. Drives FFmpeg parsers via `dlopen` to get per-block
   metadata (block positions, MVs, coefficients, tc0
   arrays, etc.).
4. Calls `daedalus_dispatch_*` from `libdaedalus_core.a`
   (sibling repo) to do the actual per-block work on
   NEON / V3D.
5. Posts decoded frames back to the kernel module via
   the chardev.

Architecture: single-threaded event loop initially; per-stream
worker threads later if needed.

### 3. Bitstream parser layer (Option γ — runtime dlopen)

Instead of vendoring FFmpeg's parsers, the daemon loads the
system FFmpeg at runtime. Two integration patterns:

- **a. AVCodec/AVPacket through libavcodec**: feed packets to
  `avcodec_send_packet`, intercept the parse-only stage and
  pull out block metadata before the actual decode runs.
- **b. Custom parser via libavcodec internal APIs**: messier
  but avoids running FFmpeg's full decode path.

Plan: try (a) first. If FFmpeg's internal API doesn't expose
the per-block info we need, fall back to (b) or vendor a
minimal parser per codec.

## Communication: kernel ↔ daemon

Initial plan: single chardev `/dev/daedalus-v4l2` with a
simple request/response protocol:

```
REQ_DECODE { stream_id, frame_idx, codec, controls[], bitstream_blob }
RESP_FRAME { stream_id, frame_idx, dma_buf_fd, w, h, format }
```

Alternative: netlink socket (more standard for kernel-userspace
IPC, but more boilerplate). Chardev is simpler for v1.

## Memory: DRM PRIME / dmabuf

For browser zero-copy, the kernel module needs to register the
decoded frame buffers as dmabuf handles that V4L2 hands out via
PRIME export. The daedalus-fourier kernel library writes pixels
to CPU-mapped memory; the kernel module manages the
DMA-coherent allocation and PRIME export.

Two strategies:
- **Strategy A**: kernel module allocates dmabuf, mmaps it into
  daemon via the chardev, daemon writes pixels there.
- **Strategy B**: daemon allocates via libdrm, transfers
  dmabuf-fd to kernel via chardev, kernel exposes via PRIME.

Strategy A is simpler; B is more flexible. Start with A.

## Build & deploy

- Kernel module: `cd kernel && make` (out-of-tree against running
  kernel headers; `/lib/modules/$(uname -r)/build` path).
- Daemon: CMake, depends on installed `libdaedalus_core.a` from
  sibling repo. Run as systemd service or under direct user
  invocation.

## What's NOT in this repo

- The cycles 1-9 video kernels — those live in sibling
  daedalus-fourier and are consumed via `include/daedalus.h`.
- The browser side (firefox-fourier / chromium-fourier) — those
  are their own sibling projects.
- libva-v4l2-request-fourier — sibling, talks to our
  `/dev/videoNN` via V4L2 ioctls.

## Sub-phases (roadmap excerpt; see docs/roadmap.md for the full plan)

1. **8.1**: kernel module skeleton — register /dev/videoNN with
   stub ioctls, no decoding.
2. **8.2**: chardev bridge — kernel ↔ daemon round-trip with
   dummy bitstream/dummy frame data.
3. **8.3**: daemon FFmpeg dlopen + parse path — pull per-frame
   info from FFmpeg without decoding.
4. **8.4**: dispatch one codec end-to-end via daedalus-fourier
   (VP9 first since it has the most QPU-deployed kernels).
5. **8.5**: dmabuf integration — first browser zero-copy frame.
6. **8.6**: AV1 + H.264 added.
7. **8.7**: performance tuning; 30fps@1080p target.

Per-phase effort: each is roughly a week of focused work.