Files

T

marfrit 760f6a4060 Phase 8 skeleton: public C API + first end-to-end smoke test

include/daedalus.h: stable C API surface exposing the 5 cycles
(VP9 IDCT 8x8, LPF wd=4, MC 8h, LPF wd=8; AV1 CDEF). Per-kernel
recipe-dispatch helpers default to the cycle 1-5 verdict
substrate (QPU for cycles 1+2+4, CPU for cycles 3+5); explicit
override available for benchmarking and runtime-aware scheduling.

src/daedalus_core.c: NEON-path implementation of all 5 kernels
wrapped behind the public API. QPU path stubbed out (returns -1)
since wiring v3d_runner into daedalus_ctx is the next Phase 8
sub-step; with has_qpu=0 the recipe falls back to CPU cleanly.

tests/test_api_idct.c: 64-block IDCT through the public recipe
dispatch, bit-exact vs C ref. PASS 4096/4096 bytes — proves the
API surface compiles, library links, dispatch routing works, and
NEON fallback delivers correct results.

docs/phase8_scoping.md: architecture options (A=userspace V4L2,
B=kernel V4L2 shim, C=direct libva); pick A for v1; explicitly
out-of-scope work tracked.

Next Phase 8 sub-step: wire v3d_runner into daedalus_ctx so
has_qpu=1 and QPU dispatch goes through the API too. After that:
V4L2 ioctl glue, bitstream parser, superblock loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-18 13:54:43 +00:00

5.6 KiB

Raw Permalink Blame History

phase, status, date_opened, prereqs, consumer_target

phase	status	date_opened	prereqs	consumer_target
8	scoping (architecture options + tractable-first-step picked)	2026-05-18	cycles 1-5 closed (IDCT, LPF wd=4, MC, LPF wd=8, CDEF)	libva-v4l2-request-fourier → firefox/chromium-fourier

Phase 8 — V4L2 deployment scoping

What Phase 8 is

The "deliver the work" phase. Cycles 1-5 produced 5 individually- measured per-block kernels (3 deployed on QPU, 2 on CPU per the deployment recipe). Phase 8 makes those kernels add up to a decoded video at the user's display.

Per project_consumer_target.md, the integration target is libva-v4l2-request-fourier: a V4L2 stateless decoder node exposing a VP9 (later AV1) contract, bridged via VA-API to browser-fourier builds. Same plumbing mfritsche already runs for HEVC/RK3588, different decoder backend.

Architecture stack

+-------------------------------------------------------+
| firefox-fourier / chromium-fourier  (already builds)  |
+-------------------------------------------------------+
| VA-API                                                |
+-------------------------------------------------------+
| libva-v4l2-request-fourier  (already runs for HEVC)   |
+-------------------------------------------------------+
| V4L2 stateless ioctl interface  (kernel uAPI)         |
+-------------------------------------------------------+
| daedalus-fourier V4L2 shim  (NEW — Phase 8 work)      |
| ↳ Parses bitstream control structs (V4L2_CID_STATELESS_VP9_*)
| ↳ Drives per-superblock decode loop
| ↳ Dispatches per-kernel to CPU NEON or V3D QPU (recipe)
+-------------------------------------------------------+
| daedalus-fourier core library  (NEW Phase 8 — wraps   |
| ↳ kernels from cycles 1-5)                            |
+-------------------------------------------------------+
| V3D 7.1 Mesa userspace + ARM NEON                     |
+-------------------------------------------------------+

Three architecture options

Option A — Userspace V4L2 emulation (recommended for v1)

Implement a userspace videodev2-compatible loopback device (via v4l2loopback or a custom UIO-style approach) that exposes /dev/videoNN with the VP9 stateless contract. libva-v4l2- request-fourier talks to this normally.

Pros: stays entirely in userspace; no kernel module work; can iterate quickly; isolation from kernel crash domain. The daedalus-fourier daemon runs as a regular Linux process, taking V4L2 ioctls (via the loopback shim) and emitting decoded frames.

Cons: v4l2loopback is loosely maintained; userspace V4L2 has some semantic quirks (DRM/PRIME buffer sharing is harder than in a real kernel driver).

Option B — Tiny kernel V4L2 shim

A small kernel module that registers as a V4L2 device, takes the ioctls, and forwards bitstream blobs + control structs to a userspace daemon (the actual decoder) over a UNIX socket or character-device chardev. Daemon decodes and posts frames back.

Pros: a real /dev/videoNN with proper VFL_TYPE_VIDEO semantics. DRM PRIME buffer sharing works correctly.

Cons: kernel module work. Cross-process buffer marshaling adds latency. Out-of-tree maintenance burden.

Option C — Direct libva integration (not recommended)

Skip V4L2 entirely; implement a libva backend module directly.

Pros: avoids the V4L2 wrapper layer entirely.

Cons: contradicts project_consumer_target.md (decision to use V4L2 path locked in). libva backend maintenance burden is roughly equivalent to V4L2 shim with no portability gain.

Pick A for v1; revisit if userspace V4L2 semantics block DRM PRIME / dmabuf for browser zero-copy.

What's tractable this session

Phase 8 in full is days of work (V4L2 ioctl glue, bitstream parser, superblock loop, frame buffer management, dmabuf handling, end-to-end test against a real VP9 clip). Out of scope for a single session continuation.

What IS tractable now:

Public C API header (include/daedalus.h): declare the library's stable function surface for the 5 kernels + substrate selection + init/teardown. Future Phase 8 V4L2 shim consumes this header. This:
- Locks the API shape so V4L2 work doesn't need to plumb through internal types.
- Documents which kernels deploy where (recipe encoded in API).
- Forces a clean separation between "kernel work" (cycles 1-5) and "decoder pipeline" (Phase 8).
A minimal core library (src/daedalus_core.{h,c}): skeleton that compiles, has the right typedefs and dispatch tables, but body of each function is assert(0 && "TODO"). Builds against existing kernel implementations.
One integration test (tests/test_idct_through_api.c): exercise the public API for ONE kernel end-to-end. Proves the API can connect to existing benches.

This commit gives the integration target something concrete to hook into without prejudging V4L2 architecture (A/B/C).

Out of scope for this session

v4l2loopback setup (Option A specifics).
VP9 bitstream parser (huge — borrow from FFmpeg / VP9 reference).
Superblock-level decode loop.
Frame buffer / dmabuf integration.
libva-v4l2-request-fourier modifications (separate sibling repo).

These are tracked as future phases / issues.

Acceptance for this Phase 8 scoping deliverable

include/daedalus.h exists and is documented.
src/daedalus_core.{h,c} skeleton compiles + links into the existing CMake build.
One pass-through test (test_idct_through_api) runs and exercises the public API path for at least one kernel, producing the same M1 bit-exact result the cycle 1 bench did.
Recipe table (which kernel runs where) is documented in the header and the docs/k* phase7 docs cross-reference it.

5.6 KiB Raw Permalink Blame History