From fd55f5ebc115851520975e28ef96cbca469e2bde Mon Sep 17 00:00:00 2001 From: Markus Fritsche Date: Mon, 18 May 2026 14:46:24 +0000 Subject: [PATCH] =?UTF-8?q?Phase=208=20status=20doc=20=E2=80=94=20surfacin?= =?UTF-8?q?g=20V4L2=20architecture=20to=20user?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Per goal "c8p3..c8p7, then p8 — surface if user intervention is required": this is the surface point. Kernel-library work is complete (cycles 1-8 all dispatchable via public API, all CPU paths bit-exact, 3 QPU paths bit-exact, 3 opportunistic-QPU paths shader-exists-API-TODO). V4L2 wrapper architecture needs 4 user decisions: - Q1: Option A (v4l2loopback) / B (kernel V4L2 shim) / C (libva direct) - Q2: Parser source: FFmpeg-vendored / dav1d+libvpx mix / FFmpeg-dlopen - Q3: In-repo or sibling repo (daedalus-v4l2)? - Q4: End-to-end test target (tiny clips / 1080p30 / both) Recommended defaults (A / γ / sibling / both) documented; explicit confirmation requested before committing to days of work that locks in months of follow-on choices. Mechanical TODOs that can proceed in parallel without blocking V4L2 decision: cycle 3+5+8 opportunistic-QPU dispatch wiring through API, or cycle 9 (H.264 luma qpel MC, predicted CPU-only per cycle 6/7 pattern). 24 commits pushed to marfrit/daedalus-fourier this autonomous arc. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/phase8_status.md | 136 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 136 insertions(+) create mode 100644 docs/phase8_status.md diff --git a/docs/phase8_status.md b/docs/phase8_status.md new file mode 100644 index 0000000..11c0f5f --- /dev/null +++ b/docs/phase8_status.md @@ -0,0 +1,136 @@ +--- +phase: 8 +status: kernel-library complete; V4L2 wrapper needs user decisions +date_opened: 2026-05-18 +prereqs: cycles 1-8 closed (all 3 codecs covered) +--- + +# Phase 8 status — user-intervention point + +Per the goal "c8p3..c8p7, then p8 — surface if user intervention +is required": Phase 8's kernel-library work is **complete enough +to surface**. The V4L2 deployment layer needs decisions that +weren't covered in `docs/phase8_scoping.md` and that I should +NOT make unilaterally because they affect days of follow-on work +in a separate (sibling) project. + +## What's done in Phase 8 so far + +### Public API (`include/daedalus.h` + `src/daedalus_core.c`) + +Stable C API surface covering all 8 cycles: + +| Kernel | Public API entry | Recipe | Status | +|---|---|---|---| +| VP9 IDCT 8×8 | `daedalus_dispatch_vp9_idct8` | QPU | CPU+QPU+AUTO wired, bit-exact | +| VP9 LPF wd=4 | `daedalus_dispatch_vp9_lpf4` | QPU | CPU+QPU+AUTO wired, bit-exact | +| VP9 MC 8h | `daedalus_dispatch_vp9_mc_8h` | CPU | CPU wired; QPU returns -1 | +| VP9 LPF wd=8 | `daedalus_dispatch_vp9_lpf8` | QPU | CPU+QPU+AUTO wired, bit-exact | +| AV1 CDEF 8×8 | `daedalus_dispatch_cdef_8x8` | CPU | CPU wired; QPU returns -1 | +| H.264 IDCT 4×4 | `daedalus_dispatch_h264_idct4` | CPU | CPU wired (no QPU shader exists) | +| H.264 IDCT 8×8 | `daedalus_dispatch_h264_idct8` | CPU | CPU wired (no QPU shader exists) | +| H.264 deblock luma-v | `daedalus_dispatch_h264_deblock_luma_v` | CPU | CPU wired; QPU dispatch via API TODO (shader exists, just not API-wired) | + +`daedalus_recipe_substrate_for(kernel)` returns the verdict +substrate; `_recipe_dispatch_*` wrappers default to AUTO routing. + +### Smoke tests (all passing) + +- `test_api_idct` — VP9 IDCT, CPU+QPU+AUTO, 4096/4096 +- `test_api_lpf` — VP9 LPF wd=4 + wd=8, CPU+QPU+AUTO, 2048/2048 +- `test_api_h264` — H.264 IDCT 4×4, IDCT 8×8, deblock luma-v + (CPU only), 2048/2048 each + +### What's mechanically TODO (not blocking V4L2 surface decision) + +- Opportunistic-QPU dispatch through API for cycles 3 (MC), + 5 (CDEF), 8 (H.264 deblock). The shaders exist; just need + the wiring pattern from `dispatch_idct8_qpu` repeated. +- ~1 hour each per kernel. Can be done in parallel with V4L2 work + by anyone (myself in a later session, or any consumer). + +## V4L2 wrapper — user decision points + +`docs/phase8_scoping.md` outlined 3 architecture options +(A/B/C). I tentatively picked Option A (userspace +v4l2loopback) in the scoping doc. Before committing 1+ week +of work, I need user input on: + +### Q1. V4L2 architecture choice (A / B / C)? + +- **Option A** (userspace v4l2loopback): documented as my + recommendation. Pros: no kernel module. Cons: v4l2loopback is + loosely maintained; DRM PRIME / dmabuf integration awkward. +- **Option B** (tiny kernel V4L2 shim + userspace daemon over + chardev): real `/dev/videoNN`. Pros: proper DRM PRIME. Cons: + kernel module work, cross-process buffer marshaling. +- **Option C** (direct libva backend, skip V4L2): contradicts + `project_consumer_target.md` decision to use V4L2 path; would + require updating that memory entry first. + +### Q2. Bitstream parser source? + +To actually decode a frame we need: bitstream parse → block +metadata → per-block dispatch. The parser is huge. + +- **Option α**: Vendor FFmpeg's VP9/AV1/H.264 parsers as additional + LGPL-2.1+ source (substantial: thousands of LOC). Daedalus + becomes ~50 % parser code by volume. +- **Option β**: Vendor dav1d (BSD-2-Clause) for AV1, libvpx for + VP9, and ??? for H.264. Multi-source mix; license-clean. +- **Option γ**: Use FFmpeg as a SHARED LIBRARY at runtime + (`dlopen`), drive its parser via API and dispatch the per-block + ops to daedalus. Lightest. Probably easiest for v1. + +### Q3. Phase 8 scope: in-repo or sibling repo? + +Per `project_consumer_target`, `libva-v4l2-request-fourier` +itself is a separate sibling. The daedalus-fourier core library +(this repo) probably exposes the kernel API and a thin demo +program; the V4L2 driver lives in a new sibling. + +- **Option in**: do Phase 8 inside daedalus-fourier as + `src/v4l2_wrapper/` or similar. +- **Option sibling**: open `daedalus-v4l2` sibling repo, + daedalus-fourier exports only the kernel API. + +### Q4. End-to-end test target? + +What clip and what success criterion? Options: +- Tiny test clips (e.g., a 320×240 VP9 clip from FFmpeg test suite, + decoded to PNG, compared to reference). +- Real 1080p30 H.264 clip (e.g., YouTube-style sample), with + timing-based success ("decode at ≥30 fps wall-clock"). +- Both. + +## Recommended next moves (my picks, but confirm please) + +If I had to pick without your input: +- Q1: Option A (v4l2loopback) — sticking with scoping doc. +- Q2: Option γ (dlopen FFmpeg) — lowest scope, fastest to v1. +- Q3: sibling repo `daedalus-v4l2` — per consumer-target memory. +- Q4: both — start with tiny test clips for M1, then 1080p30 for + timing. + +But these are real architecture choices that lock in months of +follow-on work. Confirm before I proceed. + +## Optional: continue the mechanical TODOs now + +While you decide on the V4L2 surface, I could continue with the +non-blocking work: +- Wire opportunistic-QPU paths for cycles 3, 5, 8 through the + API (3 × ~1 hour each) +- Or: start cycle 9 (H.264 luma qpel MC) — predicted CPU only + per the cycle 6/7 pattern, but worth measuring + +Let me know which to pick up while V4L2 architecture is decided +(or in parallel if you want both threads). + +## Cycles 1-8 summary state + +8 cycles closed. 3 QPU-deployed (VP9 IDCT/LPF), 3 CPU-deployed +(VP9 MC, H.264 IDCT 4×4, H.264 IDCT 8×8), 2 opportunistic-helper +(AV1 CDEF, H.264 deblock). Public API exposes all 8 with +recipe-default routing and explicit-override support. ~24 +commits pushed to `marfrit/daedalus-fourier` on gitea.