Phase 8 skeleton: public C API + first end-to-end smoke test
include/daedalus.h: stable C API surface exposing the 5 cycles (VP9 IDCT 8x8, LPF wd=4, MC 8h, LPF wd=8; AV1 CDEF). Per-kernel recipe-dispatch helpers default to the cycle 1-5 verdict substrate (QPU for cycles 1+2+4, CPU for cycles 3+5); explicit override available for benchmarking and runtime-aware scheduling. src/daedalus_core.c: NEON-path implementation of all 5 kernels wrapped behind the public API. QPU path stubbed out (returns -1) since wiring v3d_runner into daedalus_ctx is the next Phase 8 sub-step; with has_qpu=0 the recipe falls back to CPU cleanly. tests/test_api_idct.c: 64-block IDCT through the public recipe dispatch, bit-exact vs C ref. PASS 4096/4096 bytes — proves the API surface compiles, library links, dispatch routing works, and NEON fallback delivers correct results. docs/phase8_scoping.md: architecture options (A=userspace V4L2, B=kernel V4L2 shim, C=direct libva); pick A for v1; explicitly out-of-scope work tracked. Next Phase 8 sub-step: wire v3d_runner into daedalus_ctx so has_qpu=1 and QPU dispatch goes through the API too. After that: V4L2 ioctl glue, bitstream parser, superblock loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,142 @@
|
||||
---
|
||||
phase: 8
|
||||
status: scoping (architecture options + tractable-first-step picked)
|
||||
date_opened: 2026-05-18
|
||||
prereqs: cycles 1-5 closed (IDCT, LPF wd=4, MC, LPF wd=8, CDEF)
|
||||
consumer_target: libva-v4l2-request-fourier → firefox/chromium-fourier
|
||||
---
|
||||
|
||||
# Phase 8 — V4L2 deployment scoping
|
||||
|
||||
## What Phase 8 is
|
||||
|
||||
The "deliver the work" phase. Cycles 1-5 produced 5 individually-
|
||||
measured per-block kernels (3 deployed on QPU, 2 on CPU per the
|
||||
deployment recipe). Phase 8 makes those kernels add up to a
|
||||
decoded video at the user's display.
|
||||
|
||||
Per `project_consumer_target.md`, the integration target is
|
||||
**libva-v4l2-request-fourier**: a V4L2 stateless decoder node
|
||||
exposing a VP9 (later AV1) contract, bridged via VA-API to
|
||||
browser-fourier builds. Same plumbing mfritsche already runs for
|
||||
HEVC/RK3588, different decoder backend.
|
||||
|
||||
## Architecture stack
|
||||
|
||||
```
|
||||
+-------------------------------------------------------+
|
||||
| firefox-fourier / chromium-fourier (already builds) |
|
||||
+-------------------------------------------------------+
|
||||
| VA-API |
|
||||
+-------------------------------------------------------+
|
||||
| libva-v4l2-request-fourier (already runs for HEVC) |
|
||||
+-------------------------------------------------------+
|
||||
| V4L2 stateless ioctl interface (kernel uAPI) |
|
||||
+-------------------------------------------------------+
|
||||
| daedalus-fourier V4L2 shim (NEW — Phase 8 work) |
|
||||
| ↳ Parses bitstream control structs (V4L2_CID_STATELESS_VP9_*)
|
||||
| ↳ Drives per-superblock decode loop
|
||||
| ↳ Dispatches per-kernel to CPU NEON or V3D QPU (recipe)
|
||||
+-------------------------------------------------------+
|
||||
| daedalus-fourier core library (NEW Phase 8 — wraps |
|
||||
| ↳ kernels from cycles 1-5) |
|
||||
+-------------------------------------------------------+
|
||||
| V3D 7.1 Mesa userspace + ARM NEON |
|
||||
+-------------------------------------------------------+
|
||||
```
|
||||
|
||||
## Three architecture options
|
||||
|
||||
### Option A — Userspace V4L2 emulation (recommended for v1)
|
||||
|
||||
Implement a userspace `videodev2`-compatible loopback device
|
||||
(via `v4l2loopback` or a custom UIO-style approach) that exposes
|
||||
`/dev/videoNN` with the VP9 stateless contract. libva-v4l2-
|
||||
request-fourier talks to this normally.
|
||||
|
||||
**Pros**: stays entirely in userspace; no kernel module work; can
|
||||
iterate quickly; isolation from kernel crash domain. The
|
||||
daedalus-fourier daemon runs as a regular Linux process, taking
|
||||
V4L2 ioctls (via the loopback shim) and emitting decoded frames.
|
||||
|
||||
**Cons**: v4l2loopback is loosely maintained; userspace V4L2 has
|
||||
some semantic quirks (DRM/PRIME buffer sharing is harder than in
|
||||
a real kernel driver).
|
||||
|
||||
### Option B — Tiny kernel V4L2 shim
|
||||
|
||||
A small kernel module that registers as a V4L2 device, takes the
|
||||
ioctls, and forwards bitstream blobs + control structs to a
|
||||
userspace daemon (the actual decoder) over a UNIX socket or
|
||||
character-device chardev. Daemon decodes and posts frames back.
|
||||
|
||||
**Pros**: a real `/dev/videoNN` with proper VFL_TYPE_VIDEO
|
||||
semantics. DRM PRIME buffer sharing works correctly.
|
||||
|
||||
**Cons**: kernel module work. Cross-process buffer marshaling
|
||||
adds latency. Out-of-tree maintenance burden.
|
||||
|
||||
### Option C — Direct libva integration (not recommended)
|
||||
|
||||
Skip V4L2 entirely; implement a libva backend module directly.
|
||||
|
||||
**Pros**: avoids the V4L2 wrapper layer entirely.
|
||||
|
||||
**Cons**: contradicts `project_consumer_target.md` (decision to
|
||||
use V4L2 path locked in). libva backend maintenance burden is
|
||||
roughly equivalent to V4L2 shim with no portability gain.
|
||||
|
||||
**Pick A** for v1; revisit if userspace V4L2 semantics block
|
||||
DRM PRIME / dmabuf for browser zero-copy.
|
||||
|
||||
## What's tractable this session
|
||||
|
||||
Phase 8 in full is **days of work** (V4L2 ioctl glue, bitstream
|
||||
parser, superblock loop, frame buffer management, dmabuf handling,
|
||||
end-to-end test against a real VP9 clip). Out of scope for a
|
||||
single session continuation.
|
||||
|
||||
What IS tractable now:
|
||||
|
||||
1. **Public C API header** (`include/daedalus.h`): declare the
|
||||
library's stable function surface for the 5 kernels +
|
||||
substrate selection + init/teardown. Future Phase 8 V4L2 shim
|
||||
consumes this header. This:
|
||||
- Locks the API shape so V4L2 work doesn't need to plumb
|
||||
through internal types.
|
||||
- Documents which kernels deploy where (recipe encoded in API).
|
||||
- Forces a clean separation between "kernel work" (cycles 1-5)
|
||||
and "decoder pipeline" (Phase 8).
|
||||
|
||||
2. **A minimal core library** (`src/daedalus_core.{h,c}`):
|
||||
skeleton that compiles, has the right typedefs and dispatch
|
||||
tables, but body of each function is `assert(0 && "TODO")`.
|
||||
Builds against existing kernel implementations.
|
||||
|
||||
3. **One integration test** (`tests/test_idct_through_api.c`):
|
||||
exercise the public API for ONE kernel end-to-end. Proves the
|
||||
API can connect to existing benches.
|
||||
|
||||
This commit gives the integration target something concrete to
|
||||
hook into without prejudging V4L2 architecture (A/B/C).
|
||||
|
||||
## Out of scope for this session
|
||||
|
||||
- v4l2loopback setup (Option A specifics).
|
||||
- VP9 bitstream parser (huge — borrow from FFmpeg / VP9 reference).
|
||||
- Superblock-level decode loop.
|
||||
- Frame buffer / dmabuf integration.
|
||||
- libva-v4l2-request-fourier modifications (separate sibling repo).
|
||||
|
||||
These are tracked as future phases / issues.
|
||||
|
||||
## Acceptance for this Phase 8 scoping deliverable
|
||||
|
||||
- `include/daedalus.h` exists and is documented.
|
||||
- `src/daedalus_core.{h,c}` skeleton compiles + links into the
|
||||
existing CMake build.
|
||||
- One pass-through test (`test_idct_through_api`) runs and
|
||||
exercises the public API path for at least one kernel,
|
||||
producing the same M1 bit-exact result the cycle 1 bench did.
|
||||
- Recipe table (which kernel runs where) is documented in the
|
||||
header and the docs/k* phase7 docs cross-reference it.
|
||||
Reference in New Issue
Block a user