Files

T

marfrit fd55f5ebc1 Phase 8 status doc — surfacing V4L2 architecture to user

Per goal "c8p3..c8p7, then p8 — surface if user intervention is
required": this is the surface point.

Kernel-library work is complete (cycles 1-8 all dispatchable via
public API, all CPU paths bit-exact, 3 QPU paths bit-exact, 3
opportunistic-QPU paths shader-exists-API-TODO).

V4L2 wrapper architecture needs 4 user decisions:
- Q1: Option A (v4l2loopback) / B (kernel V4L2 shim) / C (libva direct)
- Q2: Parser source: FFmpeg-vendored / dav1d+libvpx mix / FFmpeg-dlopen
- Q3: In-repo or sibling repo (daedalus-v4l2)?
- Q4: End-to-end test target (tiny clips / 1080p30 / both)

Recommended defaults (A / γ / sibling / both) documented;
explicit confirmation requested before committing to days of work
that locks in months of follow-on choices.

Mechanical TODOs that can proceed in parallel without blocking V4L2
decision: cycle 3+5+8 opportunistic-QPU dispatch wiring through API,
or cycle 9 (H.264 luma qpel MC, predicted CPU-only per cycle 6/7
pattern).

24 commits pushed to marfrit/daedalus-fourier this autonomous arc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-18 14:46:24 +00:00

5.6 KiB

Raw Blame History

phase, status, date_opened, prereqs

phase	status	date_opened	prereqs
8	kernel-library complete; V4L2 wrapper needs user decisions	2026-05-18	cycles 1-8 closed (all 3 codecs covered)

Phase 8 status — user-intervention point

Per the goal "c8p3..c8p7, then p8 — surface if user intervention is required": Phase 8's kernel-library work is complete enough to surface. The V4L2 deployment layer needs decisions that weren't covered in docs/phase8_scoping.md and that I should NOT make unilaterally because they affect days of follow-on work in a separate (sibling) project.

What's done in Phase 8 so far

Public API (`include/daedalus.h` + `src/daedalus_core.c`)

Stable C API surface covering all 8 cycles:

Kernel	Public API entry	Recipe	Status
VP9 IDCT 8×8	`daedalus_dispatch_vp9_idct8`	QPU	CPU+QPU+AUTO wired, bit-exact
VP9 LPF wd=4	`daedalus_dispatch_vp9_lpf4`	QPU	CPU+QPU+AUTO wired, bit-exact
VP9 MC 8h	`daedalus_dispatch_vp9_mc_8h`	CPU	CPU wired; QPU returns -1
VP9 LPF wd=8	`daedalus_dispatch_vp9_lpf8`	QPU	CPU+QPU+AUTO wired, bit-exact
AV1 CDEF 8×8	`daedalus_dispatch_cdef_8x8`	CPU	CPU wired; QPU returns -1
H.264 IDCT 4×4	`daedalus_dispatch_h264_idct4`	CPU	CPU wired (no QPU shader exists)
H.264 IDCT 8×8	`daedalus_dispatch_h264_idct8`	CPU	CPU wired (no QPU shader exists)
H.264 deblock luma-v	`daedalus_dispatch_h264_deblock_luma_v`	CPU	CPU wired; QPU dispatch via API TODO (shader exists, just not API-wired)

daedalus_recipe_substrate_for(kernel) returns the verdict substrate; _recipe_dispatch_* wrappers default to AUTO routing.

Smoke tests (all passing)

test_api_idct — VP9 IDCT, CPU+QPU+AUTO, 4096/4096
test_api_lpf — VP9 LPF wd=4 + wd=8, CPU+QPU+AUTO, 2048/2048
test_api_h264 — H.264 IDCT 4×4, IDCT 8×8, deblock luma-v (CPU only), 2048/2048 each

What's mechanically TODO (not blocking V4L2 surface decision)

Opportunistic-QPU dispatch through API for cycles 3 (MC), 5 (CDEF), 8 (H.264 deblock). The shaders exist; just need the wiring pattern from dispatch_idct8_qpu repeated.
~1 hour each per kernel. Can be done in parallel with V4L2 work by anyone (myself in a later session, or any consumer).

V4L2 wrapper — user decision points

docs/phase8_scoping.md outlined 3 architecture options (A/B/C). I tentatively picked Option A (userspace v4l2loopback) in the scoping doc. Before committing 1+ week of work, I need user input on:

Q1. V4L2 architecture choice (A / B / C)?

Option A (userspace v4l2loopback): documented as my recommendation. Pros: no kernel module. Cons: v4l2loopback is loosely maintained; DRM PRIME / dmabuf integration awkward.
Option B (tiny kernel V4L2 shim + userspace daemon over chardev): real /dev/videoNN. Pros: proper DRM PRIME. Cons: kernel module work, cross-process buffer marshaling.
Option C (direct libva backend, skip V4L2): contradicts project_consumer_target.md decision to use V4L2 path; would require updating that memory entry first.

Q2. Bitstream parser source?

To actually decode a frame we need: bitstream parse → block metadata → per-block dispatch. The parser is huge.

Option α: Vendor FFmpeg's VP9/AV1/H.264 parsers as additional LGPL-2.1+ source (substantial: thousands of LOC). Daedalus becomes ~50 % parser code by volume.
Option β: Vendor dav1d (BSD-2-Clause) for AV1, libvpx for VP9, and ??? for H.264. Multi-source mix; license-clean.
Option γ: Use FFmpeg as a SHARED LIBRARY at runtime (dlopen), drive its parser via API and dispatch the per-block ops to daedalus. Lightest. Probably easiest for v1.

Q3. Phase 8 scope: in-repo or sibling repo?

Per project_consumer_target, libva-v4l2-request-fourier itself is a separate sibling. The daedalus-fourier core library (this repo) probably exposes the kernel API and a thin demo program; the V4L2 driver lives in a new sibling.

Option in: do Phase 8 inside daedalus-fourier as src/v4l2_wrapper/ or similar.
Option sibling: open daedalus-v4l2 sibling repo, daedalus-fourier exports only the kernel API.

Q4. End-to-end test target?

What clip and what success criterion? Options:

Tiny test clips (e.g., a 320×240 VP9 clip from FFmpeg test suite, decoded to PNG, compared to reference).
Real 1080p30 H.264 clip (e.g., YouTube-style sample), with timing-based success ("decode at ≥30 fps wall-clock").
Both.

Recommended next moves (my picks, but confirm please)

If I had to pick without your input:

Q1: Option A (v4l2loopback) — sticking with scoping doc.
Q2: Option γ (dlopen FFmpeg) — lowest scope, fastest to v1.
Q3: sibling repo daedalus-v4l2 — per consumer-target memory.
Q4: both — start with tiny test clips for M1, then 1080p30 for timing.

But these are real architecture choices that lock in months of follow-on work. Confirm before I proceed.

Optional: continue the mechanical TODOs now

While you decide on the V4L2 surface, I could continue with the non-blocking work:

Wire opportunistic-QPU paths for cycles 3, 5, 8 through the API (3 × ~1 hour each)
Or: start cycle 9 (H.264 luma qpel MC) — predicted CPU only per the cycle 6/7 pattern, but worth measuring

Let me know which to pick up while V4L2 architecture is decided (or in parallel if you want both threads).

Cycles 1-8 summary state

8 cycles closed. 3 QPU-deployed (VP9 IDCT/LPF), 3 CPU-deployed (VP9 MC, H.264 IDCT 4×4, H.264 IDCT 8×8), 2 opportunistic-helper (AV1 CDEF, H.264 deblock). Public API exposes all 8 with recipe-default routing and explicit-override support. ~24 commits pushed to marfrit/daedalus-fourier on gitea.

5.6 KiB Raw Blame History Unescape Escape