Files
marfrit 760f6a4060 Phase 8 skeleton: public C API + first end-to-end smoke test
include/daedalus.h: stable C API surface exposing the 5 cycles
(VP9 IDCT 8x8, LPF wd=4, MC 8h, LPF wd=8; AV1 CDEF). Per-kernel
recipe-dispatch helpers default to the cycle 1-5 verdict
substrate (QPU for cycles 1+2+4, CPU for cycles 3+5); explicit
override available for benchmarking and runtime-aware scheduling.

src/daedalus_core.c: NEON-path implementation of all 5 kernels
wrapped behind the public API. QPU path stubbed out (returns -1)
since wiring v3d_runner into daedalus_ctx is the next Phase 8
sub-step; with has_qpu=0 the recipe falls back to CPU cleanly.

tests/test_api_idct.c: 64-block IDCT through the public recipe
dispatch, bit-exact vs C ref. PASS 4096/4096 bytes — proves the
API surface compiles, library links, dispatch routing works, and
NEON fallback delivers correct results.

docs/phase8_scoping.md: architecture options (A=userspace V4L2,
B=kernel V4L2 shim, C=direct libva); pick A for v1; explicitly
out-of-scope work tracked.

Next Phase 8 sub-step: wire v3d_runner into daedalus_ctx so
has_qpu=1 and QPU dispatch goes through the API too. After that:
V4L2 ioctl glue, bitstream parser, superblock loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:54:43 +00:00

143 lines
5.6 KiB
Markdown

---
phase: 8
status: scoping (architecture options + tractable-first-step picked)
date_opened: 2026-05-18
prereqs: cycles 1-5 closed (IDCT, LPF wd=4, MC, LPF wd=8, CDEF)
consumer_target: libva-v4l2-request-fourier → firefox/chromium-fourier
---
# Phase 8 — V4L2 deployment scoping
## What Phase 8 is
The "deliver the work" phase. Cycles 1-5 produced 5 individually-
measured per-block kernels (3 deployed on QPU, 2 on CPU per the
deployment recipe). Phase 8 makes those kernels add up to a
decoded video at the user's display.
Per `project_consumer_target.md`, the integration target is
**libva-v4l2-request-fourier**: a V4L2 stateless decoder node
exposing a VP9 (later AV1) contract, bridged via VA-API to
browser-fourier builds. Same plumbing mfritsche already runs for
HEVC/RK3588, different decoder backend.
## Architecture stack
```
+-------------------------------------------------------+
| firefox-fourier / chromium-fourier (already builds) |
+-------------------------------------------------------+
| VA-API |
+-------------------------------------------------------+
| libva-v4l2-request-fourier (already runs for HEVC) |
+-------------------------------------------------------+
| V4L2 stateless ioctl interface (kernel uAPI) |
+-------------------------------------------------------+
| daedalus-fourier V4L2 shim (NEW — Phase 8 work) |
| ↳ Parses bitstream control structs (V4L2_CID_STATELESS_VP9_*)
| ↳ Drives per-superblock decode loop
| ↳ Dispatches per-kernel to CPU NEON or V3D QPU (recipe)
+-------------------------------------------------------+
| daedalus-fourier core library (NEW Phase 8 — wraps |
| ↳ kernels from cycles 1-5) |
+-------------------------------------------------------+
| V3D 7.1 Mesa userspace + ARM NEON |
+-------------------------------------------------------+
```
## Three architecture options
### Option A — Userspace V4L2 emulation (recommended for v1)
Implement a userspace `videodev2`-compatible loopback device
(via `v4l2loopback` or a custom UIO-style approach) that exposes
`/dev/videoNN` with the VP9 stateless contract. libva-v4l2-
request-fourier talks to this normally.
**Pros**: stays entirely in userspace; no kernel module work; can
iterate quickly; isolation from kernel crash domain. The
daedalus-fourier daemon runs as a regular Linux process, taking
V4L2 ioctls (via the loopback shim) and emitting decoded frames.
**Cons**: v4l2loopback is loosely maintained; userspace V4L2 has
some semantic quirks (DRM/PRIME buffer sharing is harder than in
a real kernel driver).
### Option B — Tiny kernel V4L2 shim
A small kernel module that registers as a V4L2 device, takes the
ioctls, and forwards bitstream blobs + control structs to a
userspace daemon (the actual decoder) over a UNIX socket or
character-device chardev. Daemon decodes and posts frames back.
**Pros**: a real `/dev/videoNN` with proper VFL_TYPE_VIDEO
semantics. DRM PRIME buffer sharing works correctly.
**Cons**: kernel module work. Cross-process buffer marshaling
adds latency. Out-of-tree maintenance burden.
### Option C — Direct libva integration (not recommended)
Skip V4L2 entirely; implement a libva backend module directly.
**Pros**: avoids the V4L2 wrapper layer entirely.
**Cons**: contradicts `project_consumer_target.md` (decision to
use V4L2 path locked in). libva backend maintenance burden is
roughly equivalent to V4L2 shim with no portability gain.
**Pick A** for v1; revisit if userspace V4L2 semantics block
DRM PRIME / dmabuf for browser zero-copy.
## What's tractable this session
Phase 8 in full is **days of work** (V4L2 ioctl glue, bitstream
parser, superblock loop, frame buffer management, dmabuf handling,
end-to-end test against a real VP9 clip). Out of scope for a
single session continuation.
What IS tractable now:
1. **Public C API header** (`include/daedalus.h`): declare the
library's stable function surface for the 5 kernels +
substrate selection + init/teardown. Future Phase 8 V4L2 shim
consumes this header. This:
- Locks the API shape so V4L2 work doesn't need to plumb
through internal types.
- Documents which kernels deploy where (recipe encoded in API).
- Forces a clean separation between "kernel work" (cycles 1-5)
and "decoder pipeline" (Phase 8).
2. **A minimal core library** (`src/daedalus_core.{h,c}`):
skeleton that compiles, has the right typedefs and dispatch
tables, but body of each function is `assert(0 && "TODO")`.
Builds against existing kernel implementations.
3. **One integration test** (`tests/test_idct_through_api.c`):
exercise the public API for ONE kernel end-to-end. Proves the
API can connect to existing benches.
This commit gives the integration target something concrete to
hook into without prejudging V4L2 architecture (A/B/C).
## Out of scope for this session
- v4l2loopback setup (Option A specifics).
- VP9 bitstream parser (huge — borrow from FFmpeg / VP9 reference).
- Superblock-level decode loop.
- Frame buffer / dmabuf integration.
- libva-v4l2-request-fourier modifications (separate sibling repo).
These are tracked as future phases / issues.
## Acceptance for this Phase 8 scoping deliverable
- `include/daedalus.h` exists and is documented.
- `src/daedalus_core.{h,c}` skeleton compiles + links into the
existing CMake build.
- One pass-through test (`test_idct_through_api`) runs and
exercises the public API path for at least one kernel,
producing the same M1 bit-exact result the cycle 1 bench did.
- Recipe table (which kernel runs where) is documented in the
header and the docs/k* phase7 docs cross-reference it.