daedalus-fourier/docs/phase8_status.md

---
phase: 8
status: kernel-library complete; V4L2 wrapper needs user decisions
date_opened: 2026-05-18
prereqs: cycles 1-8 closed (all 3 codecs covered)
---

# Phase 8 status — user-intervention point

Per the goal "c8p3..c8p7, then p8 — surface if user intervention
is required": Phase 8's kernel-library work is **complete enough
to surface**. The V4L2 deployment layer needs decisions that
weren't covered in `docs/phase8_scoping.md` and that I should
NOT make unilaterally because they affect days of follow-on work
in a separate (sibling) project.

## What's done in Phase 8 so far

### Public API (`include/daedalus.h` + `src/daedalus_core.c`)

Stable C API surface covering all 8 cycles:

| Kernel | Public API entry | Recipe | Status |
|---|---|---|---|
| VP9 IDCT 8×8 | `daedalus_dispatch_vp9_idct8` | QPU | CPU+QPU+AUTO wired, bit-exact |
| VP9 LPF wd=4 | `daedalus_dispatch_vp9_lpf4` | QPU | CPU+QPU+AUTO wired, bit-exact |
| VP9 MC 8h | `daedalus_dispatch_vp9_mc_8h` | CPU | CPU wired; QPU returns -1 |
| VP9 LPF wd=8 | `daedalus_dispatch_vp9_lpf8` | QPU | CPU+QPU+AUTO wired, bit-exact |
| AV1 CDEF 8×8 | `daedalus_dispatch_cdef_8x8` | CPU | CPU wired; QPU returns -1 |
| H.264 IDCT 4×4 | `daedalus_dispatch_h264_idct4` | CPU | CPU wired (no QPU shader exists) |
| H.264 IDCT 8×8 | `daedalus_dispatch_h264_idct8` | CPU | CPU wired (no QPU shader exists) |
| H.264 deblock luma-v | `daedalus_dispatch_h264_deblock_luma_v` | CPU | CPU wired; QPU dispatch via API TODO (shader exists, just not API-wired) |

`daedalus_recipe_substrate_for(kernel)` returns the verdict
substrate; `_recipe_dispatch_*` wrappers default to AUTO routing.

### Smoke tests (all passing)

- `test_api_idct` — VP9 IDCT, CPU+QPU+AUTO, 4096/4096
- `test_api_lpf` — VP9 LPF wd=4 + wd=8, CPU+QPU+AUTO, 2048/2048
- `test_api_h264` — H.264 IDCT 4×4, IDCT 8×8, deblock luma-v
  (CPU only), 2048/2048 each

### What's mechanically TODO (not blocking V4L2 surface decision)

- Opportunistic-QPU dispatch through API for cycles 3 (MC),
  5 (CDEF), 8 (H.264 deblock). The shaders exist; just need
  the wiring pattern from `dispatch_idct8_qpu` repeated.
- ~1 hour each per kernel. Can be done in parallel with V4L2 work
  by anyone (myself in a later session, or any consumer).

## V4L2 wrapper — user decision points

`docs/phase8_scoping.md` outlined 3 architecture options
(A/B/C). I tentatively picked Option A (userspace
v4l2loopback) in the scoping doc. Before committing 1+ week
of work, I need user input on:

### Q1. V4L2 architecture choice (A / B / C)?

- **Option A** (userspace v4l2loopback): documented as my
  recommendation. Pros: no kernel module. Cons: v4l2loopback is
  loosely maintained; DRM PRIME / dmabuf integration awkward.
- **Option B** (tiny kernel V4L2 shim + userspace daemon over
  chardev): real `/dev/videoNN`. Pros: proper DRM PRIME. Cons:
  kernel module work, cross-process buffer marshaling.
- **Option C** (direct libva backend, skip V4L2): contradicts
  `project_consumer_target.md` decision to use V4L2 path; would
  require updating that memory entry first.

### Q2. Bitstream parser source?

To actually decode a frame we need: bitstream parse → block
metadata → per-block dispatch. The parser is huge.

- **Option α**: Vendor FFmpeg's VP9/AV1/H.264 parsers as additional
  LGPL-2.1+ source (substantial: thousands of LOC). Daedalus
  becomes ~50 % parser code by volume.
- **Option β**: Vendor dav1d (BSD-2-Clause) for AV1, libvpx for
  VP9, and ??? for H.264. Multi-source mix; license-clean.
- **Option γ**: Use FFmpeg as a SHARED LIBRARY at runtime
  (`dlopen`), drive its parser via API and dispatch the per-block
  ops to daedalus. Lightest. Probably easiest for v1.

### Q3. Phase 8 scope: in-repo or sibling repo?

Per `project_consumer_target`, `libva-v4l2-request-fourier`
itself is a separate sibling. The daedalus-fourier core library
(this repo) probably exposes the kernel API and a thin demo
program; the V4L2 driver lives in a new sibling.

- **Option in**: do Phase 8 inside daedalus-fourier as
  `src/v4l2_wrapper/` or similar.
- **Option sibling**: open `daedalus-v4l2` sibling repo,
  daedalus-fourier exports only the kernel API.

### Q4. End-to-end test target?

What clip and what success criterion? Options:
- Tiny test clips (e.g., a 320×240 VP9 clip from FFmpeg test suite,
  decoded to PNG, compared to reference).
- Real 1080p30 H.264 clip (e.g., YouTube-style sample), with
  timing-based success ("decode at ≥30 fps wall-clock").
- Both.

## Recommended next moves (my picks, but confirm please)

If I had to pick without your input:
- Q1: Option A (v4l2loopback) — sticking with scoping doc.
- Q2: Option γ (dlopen FFmpeg) — lowest scope, fastest to v1.
- Q3: sibling repo `daedalus-v4l2` — per consumer-target memory.
- Q4: both — start with tiny test clips for M1, then 1080p30 for
  timing.

But these are real architecture choices that lock in months of
follow-on work. Confirm before I proceed.

## Optional: continue the mechanical TODOs now

While you decide on the V4L2 surface, I could continue with the
non-blocking work:
- Wire opportunistic-QPU paths for cycles 3, 5, 8 through the
  API (3 × ~1 hour each)
- Or: start cycle 9 (H.264 luma qpel MC) — predicted CPU only
  per the cycle 6/7 pattern, but worth measuring

Let me know which to pick up while V4L2 architecture is decided
(or in parallel if you want both threads).

## Cycles 1-8 summary state

8 cycles closed. 3 QPU-deployed (VP9 IDCT/LPF), 3 CPU-deployed
(VP9 MC, H.264 IDCT 4×4, H.264 IDCT 8×8), 2 opportunistic-helper
(AV1 CDEF, H.264 deblock). Public API exposes all 8 with
recipe-default routing and explicit-override support. ~24
commits pushed to `marfrit/daedalus-fourier` on gitea.