include/daedalus.h: stable C API surface exposing the 5 cycles (VP9 IDCT 8x8, LPF wd=4, MC 8h, LPF wd=8; AV1 CDEF). Per-kernel recipe-dispatch helpers default to the cycle 1-5 verdict substrate (QPU for cycles 1+2+4, CPU for cycles 3+5); explicit override available for benchmarking and runtime-aware scheduling. src/daedalus_core.c: NEON-path implementation of all 5 kernels wrapped behind the public API. QPU path stubbed out (returns -1) since wiring v3d_runner into daedalus_ctx is the next Phase 8 sub-step; with has_qpu=0 the recipe falls back to CPU cleanly. tests/test_api_idct.c: 64-block IDCT through the public recipe dispatch, bit-exact vs C ref. PASS 4096/4096 bytes — proves the API surface compiles, library links, dispatch routing works, and NEON fallback delivers correct results. docs/phase8_scoping.md: architecture options (A=userspace V4L2, B=kernel V4L2 shim, C=direct libva); pick A for v1; explicitly out-of-scope work tracked. Next Phase 8 sub-step: wire v3d_runner into daedalus_ctx so has_qpu=1 and QPU dispatch goes through the API too. After that: V4L2 ioctl glue, bitstream parser, superblock loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5.6 KiB
phase, status, date_opened, prereqs, consumer_target
| phase | status | date_opened | prereqs | consumer_target |
|---|---|---|---|---|
| 8 | scoping (architecture options + tractable-first-step picked) | 2026-05-18 | cycles 1-5 closed (IDCT, LPF wd=4, MC, LPF wd=8, CDEF) | libva-v4l2-request-fourier → firefox/chromium-fourier |
Phase 8 — V4L2 deployment scoping
What Phase 8 is
The "deliver the work" phase. Cycles 1-5 produced 5 individually- measured per-block kernels (3 deployed on QPU, 2 on CPU per the deployment recipe). Phase 8 makes those kernels add up to a decoded video at the user's display.
Per project_consumer_target.md, the integration target is
libva-v4l2-request-fourier: a V4L2 stateless decoder node
exposing a VP9 (later AV1) contract, bridged via VA-API to
browser-fourier builds. Same plumbing mfritsche already runs for
HEVC/RK3588, different decoder backend.
Architecture stack
+-------------------------------------------------------+
| firefox-fourier / chromium-fourier (already builds) |
+-------------------------------------------------------+
| VA-API |
+-------------------------------------------------------+
| libva-v4l2-request-fourier (already runs for HEVC) |
+-------------------------------------------------------+
| V4L2 stateless ioctl interface (kernel uAPI) |
+-------------------------------------------------------+
| daedalus-fourier V4L2 shim (NEW — Phase 8 work) |
| ↳ Parses bitstream control structs (V4L2_CID_STATELESS_VP9_*)
| ↳ Drives per-superblock decode loop
| ↳ Dispatches per-kernel to CPU NEON or V3D QPU (recipe)
+-------------------------------------------------------+
| daedalus-fourier core library (NEW Phase 8 — wraps |
| ↳ kernels from cycles 1-5) |
+-------------------------------------------------------+
| V3D 7.1 Mesa userspace + ARM NEON |
+-------------------------------------------------------+
Three architecture options
Option A — Userspace V4L2 emulation (recommended for v1)
Implement a userspace videodev2-compatible loopback device
(via v4l2loopback or a custom UIO-style approach) that exposes
/dev/videoNN with the VP9 stateless contract. libva-v4l2-
request-fourier talks to this normally.
Pros: stays entirely in userspace; no kernel module work; can iterate quickly; isolation from kernel crash domain. The daedalus-fourier daemon runs as a regular Linux process, taking V4L2 ioctls (via the loopback shim) and emitting decoded frames.
Cons: v4l2loopback is loosely maintained; userspace V4L2 has some semantic quirks (DRM/PRIME buffer sharing is harder than in a real kernel driver).
Option B — Tiny kernel V4L2 shim
A small kernel module that registers as a V4L2 device, takes the ioctls, and forwards bitstream blobs + control structs to a userspace daemon (the actual decoder) over a UNIX socket or character-device chardev. Daemon decodes and posts frames back.
Pros: a real /dev/videoNN with proper VFL_TYPE_VIDEO
semantics. DRM PRIME buffer sharing works correctly.
Cons: kernel module work. Cross-process buffer marshaling adds latency. Out-of-tree maintenance burden.
Option C — Direct libva integration (not recommended)
Skip V4L2 entirely; implement a libva backend module directly.
Pros: avoids the V4L2 wrapper layer entirely.
Cons: contradicts project_consumer_target.md (decision to
use V4L2 path locked in). libva backend maintenance burden is
roughly equivalent to V4L2 shim with no portability gain.
Pick A for v1; revisit if userspace V4L2 semantics block DRM PRIME / dmabuf for browser zero-copy.
What's tractable this session
Phase 8 in full is days of work (V4L2 ioctl glue, bitstream parser, superblock loop, frame buffer management, dmabuf handling, end-to-end test against a real VP9 clip). Out of scope for a single session continuation.
What IS tractable now:
-
Public C API header (
include/daedalus.h): declare the library's stable function surface for the 5 kernels + substrate selection + init/teardown. Future Phase 8 V4L2 shim consumes this header. This:- Locks the API shape so V4L2 work doesn't need to plumb through internal types.
- Documents which kernels deploy where (recipe encoded in API).
- Forces a clean separation between "kernel work" (cycles 1-5) and "decoder pipeline" (Phase 8).
-
A minimal core library (
src/daedalus_core.{h,c}): skeleton that compiles, has the right typedefs and dispatch tables, but body of each function isassert(0 && "TODO"). Builds against existing kernel implementations. -
One integration test (
tests/test_idct_through_api.c): exercise the public API for ONE kernel end-to-end. Proves the API can connect to existing benches.
This commit gives the integration target something concrete to hook into without prejudging V4L2 architecture (A/B/C).
Out of scope for this session
- v4l2loopback setup (Option A specifics).
- VP9 bitstream parser (huge — borrow from FFmpeg / VP9 reference).
- Superblock-level decode loop.
- Frame buffer / dmabuf integration.
- libva-v4l2-request-fourier modifications (separate sibling repo).
These are tracked as future phases / issues.
Acceptance for this Phase 8 scoping deliverable
include/daedalus.hexists and is documented.src/daedalus_core.{h,c}skeleton compiles + links into the existing CMake build.- One pass-through test (
test_idct_through_api) runs and exercises the public API path for at least one kernel, producing the same M1 bit-exact result the cycle 1 bench did. - Recipe table (which kernel runs where) is documented in the header and the docs/k* phase7 docs cross-reference it.