Phase 8 skeleton: public C API + first end-to-end smoke test

include/daedalus.h: stable C API surface exposing the 5 cycles (VP9 IDCT 8x8, LPF wd=4, MC 8h, LPF wd=8; AV1 CDEF). Per-kernel recipe-dispatch helpers default to the cycle 1-5 verdict substrate (QPU for cycles 1+2+4, CPU for cycles 3+5); explicit override available for benchmarking and runtime-aware scheduling. src/daedalus_core.c: NEON-path implementation of all 5 kernels wrapped behind the public API. QPU path stubbed out (returns -1) since wiring v3d_runner into daedalus_ctx is the next Phase 8 sub-step; with has_qpu=0 the recipe falls back to CPU cleanly. tests/test_api_idct.c: 64-block IDCT through the public recipe dispatch, bit-exact vs C ref. PASS 4096/4096 bytes — proves the API surface compiles, library links, dispatch routing works, and NEON fallback delivers correct results. docs/phase8_scoping.md: architecture options (A=userspace V4L2, B=kernel V4L2 shim, C=direct libva); pick A for v1; explicitly out-of-scope work tracked. Next Phase 8 sub-step: wire v3d_runner into daedalus_ctx so has_qpu=1 and QPU dispatch goes through the API too. After that: V4L2 ioctl glue, bitstream parser, superblock loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 13:54:43 +00:00
parent 5223d3cb3f
commit 760f6a4060
5 changed files with 733 additions and 0 deletions
@@ -0,0 +1,142 @@
+---
+phase: 8
+status: scoping (architecture options + tractable-first-step picked)
+date_opened: 2026-05-18
+prereqs: cycles 1-5 closed (IDCT, LPF wd=4, MC, LPF wd=8, CDEF)
+consumer_target: libva-v4l2-request-fourier → firefox/chromium-fourier
+---
+
+# Phase 8 — V4L2 deployment scoping
+
+## What Phase 8 is
+
+The "deliver the work" phase. Cycles 1-5 produced 5 individually-
+measured per-block kernels (3 deployed on QPU, 2 on CPU per the
+deployment recipe). Phase 8 makes those kernels add up to a
+decoded video at the user's display.
+
+Per `project_consumer_target.md`, the integration target is
+**libva-v4l2-request-fourier**: a V4L2 stateless decoder node
+exposing a VP9 (later AV1) contract, bridged via VA-API to
+browser-fourier builds. Same plumbing mfritsche already runs for
+HEVC/RK3588, different decoder backend.
+
+## Architecture stack
+
+```
+-------------------------------------------------------+
+| firefox-fourier / chromium-fourier  (already builds)  |
+-------------------------------------------------------+
+| VA-API                                                |
+-------------------------------------------------------+
+| libva-v4l2-request-fourier  (already runs for HEVC)   |
+-------------------------------------------------------+
+| V4L2 stateless ioctl interface  (kernel uAPI)         |
+-------------------------------------------------------+
+| daedalus-fourier V4L2 shim  (NEW — Phase 8 work)      |
+| ↳ Parses bitstream control structs (V4L2_CID_STATELESS_VP9_*)
+| ↳ Drives per-superblock decode loop
+| ↳ Dispatches per-kernel to CPU NEON or V3D QPU (recipe)
+-------------------------------------------------------+
+| daedalus-fourier core library  (NEW Phase 8 — wraps   |
+| ↳ kernels from cycles 1-5)                            |
+-------------------------------------------------------+
+| V3D 7.1 Mesa userspace + ARM NEON                     |
+-------------------------------------------------------+
+```
+
+## Three architecture options
+
+### Option A — Userspace V4L2 emulation (recommended for v1)
+
+Implement a userspace `videodev2`-compatible loopback device
+(via `v4l2loopback` or a custom UIO-style approach) that exposes
+`/dev/videoNN` with the VP9 stateless contract. libva-v4l2-
+request-fourier talks to this normally.
+
+**Pros**: stays entirely in userspace; no kernel module work; can
+iterate quickly; isolation from kernel crash domain. The
+daedalus-fourier daemon runs as a regular Linux process, taking
+V4L2 ioctls (via the loopback shim) and emitting decoded frames.
+
+**Cons**: v4l2loopback is loosely maintained; userspace V4L2 has
+some semantic quirks (DRM/PRIME buffer sharing is harder than in
+a real kernel driver).
+
+### Option B — Tiny kernel V4L2 shim
+
+A small kernel module that registers as a V4L2 device, takes the
+ioctls, and forwards bitstream blobs + control structs to a
+userspace daemon (the actual decoder) over a UNIX socket or
+character-device chardev. Daemon decodes and posts frames back.
+
+**Pros**: a real `/dev/videoNN` with proper VFL_TYPE_VIDEO
+semantics. DRM PRIME buffer sharing works correctly.
+
+**Cons**: kernel module work. Cross-process buffer marshaling
+adds latency. Out-of-tree maintenance burden.
+
+### Option C — Direct libva integration (not recommended)
+
+Skip V4L2 entirely; implement a libva backend module directly.
+
+**Pros**: avoids the V4L2 wrapper layer entirely.
+
+**Cons**: contradicts `project_consumer_target.md` (decision to
+use V4L2 path locked in). libva backend maintenance burden is
+roughly equivalent to V4L2 shim with no portability gain.
+
+**Pick A** for v1; revisit if userspace V4L2 semantics block
+DRM PRIME / dmabuf for browser zero-copy.
+
+## What's tractable this session
+
+Phase 8 in full is **days of work** (V4L2 ioctl glue, bitstream
+parser, superblock loop, frame buffer management, dmabuf handling,
+end-to-end test against a real VP9 clip). Out of scope for a
+single session continuation.
+
+What IS tractable now:
+
+1. **Public C API header** (`include/daedalus.h`): declare the
+   library's stable function surface for the 5 kernels +
+   substrate selection + init/teardown. Future Phase 8 V4L2 shim
+   consumes this header. This:
+   - Locks the API shape so V4L2 work doesn't need to plumb
+     through internal types.
+   - Documents which kernels deploy where (recipe encoded in API).
+   - Forces a clean separation between "kernel work" (cycles 1-5)
+     and "decoder pipeline" (Phase 8).
+
+2. **A minimal core library** (`src/daedalus_core.{h,c}`):
+   skeleton that compiles, has the right typedefs and dispatch
+   tables, but body of each function is `assert(0 && "TODO")`.
+   Builds against existing kernel implementations.
+
+3. **One integration test** (`tests/test_idct_through_api.c`):
+   exercise the public API for ONE kernel end-to-end. Proves the
+   API can connect to existing benches.
+
+This commit gives the integration target something concrete to
+hook into without prejudging V4L2 architecture (A/B/C).
+
+## Out of scope for this session
+
+- v4l2loopback setup (Option A specifics).
+- VP9 bitstream parser (huge — borrow from FFmpeg / VP9 reference).
+- Superblock-level decode loop.
+- Frame buffer / dmabuf integration.
+- libva-v4l2-request-fourier modifications (separate sibling repo).
+
+These are tracked as future phases / issues.
+
+## Acceptance for this Phase 8 scoping deliverable
+
+- `include/daedalus.h` exists and is documented.
+- `src/daedalus_core.{h,c}` skeleton compiles + links into the
+  existing CMake build.
+- One pass-through test (`test_idct_through_api`) runs and
+  exercises the public API path for at least one kernel,
+  producing the same M1 bit-exact result the cycle 1 bench did.
+- Recipe table (which kernel runs where) is documented in the
+  header and the docs/k* phase7 docs cross-reference it.