daedalus-v4l2/docs/phase_8_5_closure.md

# Phase 8.5 closure — full V4L2 m2m driver, VP9 decode via QBUF/DQBUF

**Status:** closed 2026-05-18.

Replaces the Phase 8.4 debugfs-triggered chardev path with a real
V4L2 m2m driver.  Userspace clients now drive decoding the
standard way — `S_FMT` / `REQBUFS` / `QBUF` on the OUTPUT
(bitstream) queue, `DQBUF` on the CAPTURE (NV12M) queue.  Kernel
device_run packs the bitstream into REQ_DECODE; the daemon
decodes via FFmpeg; RESP_FRAME's inline NV12 pixel payload lands
in the CAPTURE buffer.  Phase 8.6 swaps the inline payload for
dmabuf so big frames stop being capped at 64 KiB.

## What lands

### Kernel (`kernel/daedalus_v4l2_main.c`)
- Per-open `struct daedalus_ctx`: v4l2_fh, m2m_ctx,
  ctrl_handler, per-queue formats.
- Two vb2_queues via `daedalus_queue_init`:
  - OUTPUT: V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE,
    V4L2_PIX_FMT_VP9_FRAME, `vb2_vmalloc_memops`.
  - CAPTURE: V4L2_BUF_TYPE_VIDEO_CAPTURE_MPLANE,
    V4L2_PIX_FMT_NV12M, `vb2_vmalloc_memops`.
- Full `v4l2_ioctl_ops` table: querycap / enum_fmt /
  g_fmt / s_fmt / try_fmt for both queues; reqbufs / querybuf /
  qbuf / dqbuf / create_bufs / prepare_buf / expbuf /
  streamon / streamoff via the `v4l2_m2m_ioctl_*` helpers.
- `v4l2_m2m_ops.device_run`: pulls the next OUTPUT buf,
  kmaps the bitstream, builds REQ_DECODE inline (capped at
  `DAEDALUS_PROTO_MAX_PAYLOAD - sizeof(struct
  daedalus_req_decode)`), enqueues to the chardev with a
  cookie, stores `{ctx, src_buf, dst_buf}` in a per-device
  inflight list.  Job stays open until RESP_FRAME comes back.
- `daedalus_complete_resp_frame` (called from the chardev
  write path): pops the inflight entry, memcpys inline NV12
  pixels into the CAPTURE vb2 buffer (Y plane + interleaved
  CbCr), finishes the m2m job via
  `v4l2_m2m_buf_done_and_job_finish` so both buffers complete
  cleanly and the scheduler doesn't immediately re-run
  device_run on the same src.

### Kernel header (`kernel/daedalus_v4l2_main.h`)
- New private header so the chardev source can reach
  `daedalus_complete_resp_frame()` without `extern`-only
  declarations.
- Contains the canonical `struct daedalus_dev` definition
  (was previously inline in main.c).

### Kernel (`kernel/daedalus_v4l2_chardev.c`)
- RESP_FRAME handler now passes the inline pixel payload to
  `daedalus_complete_resp_frame` so it can land in the
  CAPTURE buffer.  The previous Phase 8.4 path (debugfs
  `test_decode` injection) still works, just hits a
  ratelimited `unknown cookie` log line because it bypasses
  the V4L2 m2m queue.

### Daemon (`daemon/src/decoder.c`, `decoder.h`)
- `daedalus_decoder_run_request` signature extended with
  `(nv12_out, nv12_cap, nv12_used)`.  After the FNV-1a digest
  it packs the decoded YUV420P planes into NV12 in the
  caller's buffer (Y plane line-by-line stripped of stride
  padding; CbCr interleaved into the chroma plane).
- Truncation silent — kernel only memcpys what fits in the
  CAPTURE plane.

### Daemon (`daemon/src/chardev_client.c`)
- `handle_req_decode` allocates a response buffer sized
  `sizeof(resp) + (MAX_PAYLOAD - sizeof(resp))`, lets the
  decoder fill the pixel area, then sends the full payload
  (struct + pixels) via `send_response`.

### Test harness (`tools/test_m2m_decode.c`)
- New: minimal V4L2 m2m client that drives one full
  QBUF/DQBUF round-trip.  Used for end-to-end verification
  in this phase.  v4l2-ctl could substitute eventually,
  but its `--stream-from-hdr` format isn't compatible with
  a raw VP9 frame; a small custom client is the cleanest
  test until we add framing.

## Verification

Built clean (kernel `make`, daemon `cmake --build`, tools
`make`).  All `-Wall -Wextra` warning-free.

### End-to-end round-trip

```
$ ffmpeg -f lavfi -i 'testsrc=duration=0.04:size=128x96:rate=25' \
       -pix_fmt yuv420p -c:v libvpx-vp9 -frames:v 1 -y /tmp/vp9_small.ivf
$ python3 strip-ivf-header  → /tmp/vp9_small_kf.bin (1566 B)

$ sudo insmod kernel/daedalus_v4l2.ko
$ daedalus_v4l2_daemon -v daemon &
$ sudo ./tools/test_m2m_decode /tmp/vp9_small_kf.bin /tmp/out_m2m.nv12 128 96

  loaded bitstream: 1566 bytes
  OUTPUT sizeimage = 65524
  CAPTURE planes = 2, [0].sizeimage=12288 [1].sizeimage=6144
  OUTPUT REQBUFS -> 2
  CAPTURE REQBUFS -> 2
  QBUF OUTPUT[0] bytesused=1566
  QBUF CAPTURE[0]
  STREAMON both
  poll revents=0x5
  DQBUF OUTPUT[0] flags=0x4001          # V4L2_BUF_FLAG_DONE
  DQBUF CAPTURE[0] flags=0x4000 payloads=[12288, 6144]
  wrote 12288 Y + 6144 UV bytes to /tmp/out_m2m.nv12
  OK

daemon log:
  REQ_DECODE cookie=1 codec=1 bitstream=1566 bytes
  decoder: opened vp9 context
  decoder: OK 128x96 fmt=0 (yuv420p) fnv1a=0x1eb34bfe luma=12288 chroma=6144 nv12=18432
```

### Pixel correctness

```
$ ffmpeg -i vp9_small.ivf -pix_fmt nv12 -f rawvideo -y ref.nv12
$ cmp out_m2m.nv12 ref.nv12
$ echo $?
0
```

**Byte-for-byte match against `ffmpeg -pix_fmt nv12`.**  Whole
18432-byte NV12 frame matches exactly — the full kernel ↔
daemon ↔ FFmpeg pipeline produces the same pixels as a plain
FFmpeg CLI decode.

### v4l2-compliance

```
$ sudo v4l2-compliance -d /dev/video0
…
Detected Stateless Decoder

Required ioctls:                test VIDIOC_QUERYCAP: OK
                                test invalid ioctls: OK
Allow for multiple opens:       all OK
Format ioctls:                  VIDIOC_S_FMT FAIL  (colorspace mismatch)
Codec ioctls:                   VIDIOC_(TRY_)DECODER_CMD FAIL
                                (stateless decoder requires media controller
                                 OR decoder commands; we have neither)
Buffer ioctls:                  REQBUFS/CREATE_BUFS/QUERYBUF/REMOVE_BUFS/EXPBUF: OK
```

Two expected fails:

- **S_FMT colorspace**: `try_fmt` always rewrites colorspace to
  REC709 from the canonical fill helper.  Should preserve the
  userspace-supplied value when valid.  Trivial fix; lands in 8.6.
- **DECODER_CMD / media controller**: stateless decoders are
  required by spec to provide either a media controller (for the
  request API) OR decoder commands (`V4L2_DEC_CMD_*`).  We have
  neither — the daemon handles per-frame state internally via
  FFmpeg, so we never needed the request API.  Phase 8.6 adds
  the media controller binding when AV1/H.264 controls land.

### Clean teardown

```
$ pkill -TERM daedalus_v4l2_daemon       # SIGTERM, daemon exits cleanly
$ sudo rmmod daedalus_v4l2               # ok
$ sudo dmesg | grep -E 'BUG|oops'
(empty)
```

No kernel oops / WARN traces from the production flow.  The
initial run had a bug — see [Bugs found and fixed] below.

## Design decisions

### Why vb2_vmalloc instead of vb2_dma_contig?

Two reasons:

1. **No DMA needed in Phase 8.5.**  device_run reads bitstream
   bytes via `vb2_plane_vaddr` and the chardev path writes
   decoded pixels via `memcpy`.  No hardware DMA touches these
   buffers, so vb2_dma_contig's CMA pressure buys us nothing
   yet.
2. **Phase 8.6 will switch CAPTURE to vb2_dma_contig** for
   dmabuf-export — at that point the daemon mmaps the dmabuf
   directly and writes pixels in place, bypassing the chardev
   payload entirely.  Doing the switch in Phase 8.6 (rather
   than now) keeps each phase's scope clean: 8.5 = "real V4L2
   m2m flow", 8.6 = "stop truncating big frames + add more
   codecs + add request API".

### Why inline pixels in RESP_FRAME?

The same 64 KiB cap as REQ_DECODE.  For Phase 8.5 we want to
prove the QBUF/DQBUF round-trip works without introducing
dmabuf-fd passing (which needs `dma_buf_fd` in the daemon's
task context + a new chardev ioctl + daemon-side mmap of the
returned fd — all real work, but orthogonal to the m2m
verification).  Phase 8.6 adds the dmabuf path.

For now: small frames (≤ ~256×192 NV12 = 73 KiB; safely
128×96 = 18 KiB) work end-to-end.  The pixel-match test
above proves the path is bit-exact, not just "approximately
right."

### Why a custom test client instead of v4l2-ctl?

v4l2-ctl's `--stream-from-hdr` expects a v4l2-ctl-specific
length-prefix format (which I tried with a `>I` length prefix
and got "Unknown header ID" — there's apparently a magic
ID byte we didn't supply).  Writing a 20-line test harness
that does the V4L2 ioctls directly is faster than reverse-
engineering v4l2-ctl's framing, and the harness stays useful
for regression tests after the per-frame format work in 8.6.

### Cookie collision between debugfs and V4L2 paths

Both `daedalus_decode_cookie` (in chardev.c, for debugfs
`test_decode`) and `daedalus_cookie_seq` (in main.c, for m2m
device_run) are independent atomics starting from 0.  After a
fresh insmod, both begin issuing cookie=1 → collisions are
likely in mixed-use scenarios.  This is harmless today —
debugfs is a test fixture and doesn't have a V4L2 inflight to
complete, so RESP_FRAME for "debugfs cookies" just logs
"unknown cookie" and moves on.  Phase 8.6 either unifies the
counter or makes debugfs use cookies with bit 31 set so the
two namespaces don't overlap.

## Bugs found and fixed during the phase

### Bug 1: device_run re-runs on same src buf, eventual stop_streaming oops

**Symptom:** test client logged TWO REQ_DECODE messages for one
QBUF.  After teardown, dmesg showed:

```
lr : daedalus_stop_streaming+0x3c/0x80 [daedalus_v4l2]
```

**Cause:** `daedalus_complete_resp_frame` called
`v4l2_m2m_buf_done(src_buf, ...)` + `v4l2_m2m_buf_done(dst_buf,
...)` + `v4l2_m2m_job_finish(...)`.  This marks the vb2 buffers
as DONE but does NOT pop them off the m2m's internal src/dst
queues.  The scheduler immediately re-runs device_run with the
same still-queued src buf — which then double-frees on
stop_streaming.

**Fix:** use `v4l2_m2m_buf_done_and_job_finish` — the canonical
helper that pops both buffers AND marks them done AND finishes
the job in one atomic sequence.  Caught on the first end-to-end
run; second run was clean.

### Bug 2: missing `<media/v4l2-event.h>`

`v4l2_event_unsubscribe` undeclared.  Trivial; added the
include.

### Bug 3 (build): missing `<stdint.h>` in tools/

The test harness used `(uint32_t)` casts without including
`<stdint.h>`.  Added.

## What's NOT here (deferred to 8.6)

- **Per-frame dmabuf export.**  CAPTURE buffers come back
  through inline pixel data in RESP_FRAME today; ≤ 64 KiB cap
  rules out 1080p.
- **V4L2 stateless controls.**  No
  `V4L2_CID_STATELESS_VP9_FRAME` etc. — the daemon parses VP9
  headers itself.  Compliance complains accordingly.
- **Media controller.**  v4l2-compliance flags this as the
  "stateless decoder requires media controller OR decoder cmds"
  failure.
- **Colorspace round-trip in TRY_FMT.**  Documented compliance
  failure; trivial fix.
- **AV1 + H.264 codec contexts.**  Phase 8.6.

## Phase 8.6 plan

1. dmabuf-export on CAPTURE: switch mem_ops to
   `vb2_dma_contig_memops`; add `DAEDALUS_IOC_GET_DMABUF` on
   the chardev that calls `vb2_core_expbuf` in daemon context.
2. Daemon mmaps the dmabuf, decodes into it directly,
   RESP_FRAME carries metadata only.
3. Add AV1 + H.264 to the decoder's codec switch (FFmpeg
   already supports both).
4. Add V4L2 stateless controls (VP9_FRAME, then AV1_FRAME,
   then H264_SLICE_PARAMS) — these can be NULL-handled by the
   daemon initially (it just ignores them since FFmpeg parses
   on its own), but adding them satisfies the spec / compliance.
5. Media controller binding via `v4l2_m2m_register_media_controller`.
6. Fix the S_FMT colorspace preservation.
7. Fix the cookie namespace collision.