# Phase 8.6 closure — dmabuf + AV1 + H.264 + stateless controls

**Status:** closed 2026-05-18.

Phase 8.5 shipped a working V4L2 m2m driver but capped frame
size at 64 KiB because RESP_FRAME carried inline pixel data
through the chardev.  Phase 8.6 removes that cap by exporting
CAPTURE buffers as dmabuf-fds the daemon mmaps and writes
directly into.  Along the way: AV1 + H.264 codecs land, V4L2
stateless controls register so libva-v4l2-request will
recognise us as a real stateless decoder, and the remaining
v4l2-compliance edges get polished (47/48 passing now, only
DECODER_CMD remains — that wants a media controller, deferred
to a follow-up).

## What lands

### Protocol (`include/daedalus_v4l2_proto.h`)
- `struct daedalus_req_decode` grew capture-buffer metadata:
  `capture_width`, `capture_height`, `capture_pix_fmt`,
  `capture_num_planes`, per-plane `size[3]` + `stride[3]`,
  `flags`.  The daemon needs these to call
  `DAEDALUS_IOC_GET_DMABUF` and to know how to lay out NV12
  in the mapped plane.
- New `struct daedalus_get_dmabuf` + `DAEDALUS_IOC_GET_DMABUF`
  ioctl (`_IOWR('D', 1, ...)`).  Takes `(cookie, plane,
  flags)`, returns an fd installed in the calling task's fd
  table.

### Kernel m2m driver (`kernel/daedalus_v4l2_main.c`)
- **Both queues switched to `vb2_dma_contig_memops`**.  OUTPUT
  was vmalloc in Phase 8.5; the switch makes
  `V4L2_MEMORY_FLAG_NON_COHERENT` work on REQBUFS (vmalloc
  doesn't honour the flag and v4l2-compliance rejected the
  driver because of it).  We still read the bitstream via
  `vb2_plane_vaddr` — dma_contig provides a kernel virtual
  address just like vmalloc did.
- `dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32))`
  in probe so the platform device satisfies vb2_dma_contig's
  parent-device requirement.
- `queue_setup` now populates `alloc_devs[plane] =
  &pdev->dev` for both queues.
- Both queues set `allow_cache_hints = 1` and `io_modes`
  expanded to include `VB2_DMABUF` (CAPTURE also gains it
  for the daemon's mmap path).
- New `daedalus_export_capture_dmabuf(cookie, plane, flags,
  out_fd)`: walks the inflight list, calls `vb2_core_expbuf`
  on the CAPTURE buffer in the caller's (daemon's) task
  context.  Backs the new chardev ioctl.
- `device_run` extended to fill the new REQ_DECODE capture
  fields from `ctx->dst_fmt`, and to map
  `ctx->src_fmt.pixelformat` → `DAEDALUS_CODEC_VP9 / _AV1
  / _H264` (was hard-wired to VP9 in 8.5).
- `daedalus_complete_resp_frame` now handles both the Phase
  8.5 inline-pixel path (still in tree for debugging /
  back-compat) and the Phase 8.6 dmabuf path where
  `pixels_len == 0` and pixels already live in the CAPTURE
  buffer — sets plane payload from the daemon's metadata
  either way.
- `enum_fmt` advertises all three OUTPUT formats:
  `VP9_FRAME`, `AV1_FRAME`, `H264_SLICE`.
- `try_fmt` preserves userspace-supplied colorspace / xfer /
  ycbcr_enc / quantization across the fill helper (was
  rewriting to REC709 defaults; v4l2-compliance failed the
  TRY_FMT colorspace round-trip in 8.5).
- `s_fmt` propagates OUTPUT colorspace to CAPTURE so a
  follow-up G_FMT on CAPTURE returns matching values
  (v4l2-test-formats.cpp:958 stateless-decoder round-trip).
- Twelve V4L2 stateless controls registered per open:
  `STATELESS_VP9_FRAME`, `STATELESS_VP9_COMPRESSED_HDR`,
  `STATELESS_H264_{SPS,PPS,SCALING_MATRIX,PRED_WEIGHTS,
  SLICE_PARAMS,DECODE_PARAMS}`,
  `STATELESS_AV1_{FRAME,SEQUENCE,TILE_GROUP_ENTRY,FILM_GRAIN}`.
  Daemon ignores values (FFmpeg parses headers itself); the
  registration is what makes libva-v4l2-request see us.

### Kernel chardev (`kernel/daedalus_v4l2_chardev.c`)
- New `unlocked_ioctl` handler dispatching
  `DAEDALUS_IOC_GET_DMABUF` to
  `daedalus_export_capture_dmabuf`.
- debugfs `test_decode` cookies unified with the m2m
  `daedalus_next_cookie()` allocator so the two namespaces
  never collide.

### Daemon

- **New** `daemon/src/dmabuf_capture.{c,h}`:
  - `daedalus_capture_planes_open(chardev_fd, cookie, req,
    out)`: ioctl GET_DMABUF per plane with
    `O_RDWR | O_CLOEXEC` (O_RDWR is essential —
    vb2_core_expbuf uses `flags & O_ACCMODE` to choose the
    dma_buf's access mode, default O_RDONLY gives mmap
    `-EACCES` on writes), mmap each plane, populate
    `{fd, base, size, stride}` per plane.
  - `daedalus_capture_planes_close(p)`: munmap + close each
    plane.

- `daemon/src/decoder.c`:
  - Lazily opens AV1 + H.264 AVCodecContexts in addition to
    VP9 (the existing -ENOSYS stubs were dropped — FFmpeg
    decodes all three via dlopen).
  - `pack_nv12_to_planes`: writes Y line-by-line into
    `planes[0].base` with `planes[0].stride`, interleaves
    Cb/Cr into `planes[1].base` with `planes[1].stride`.
    Strips source stride padding from `fr->linesize[*]`;
    respects destination strides so libav alignment doesn't
    corrupt the V4L2 client's view.
  - `daedalus_decoder_run_request` signature simplified —
    takes the mapped planes instead of a wire buffer.

- `daemon/src/chardev_client.c`:
  - `handle_req_decode` opens dmabuf planes, runs decode
    (writing pixels straight into the mapped CAPTURE
    buffer), closes planes, sends a metadata-only
    RESP_FRAME.  No wire-pixel allocation at all.

- `daemon/CMakeLists.txt`: includes `dmabuf_capture.c`.

### Test harness (`tools/test_m2m_decode.c`)
- Optional 5th argument `codec` (vp9 | av1 | h264) selects
  the OUTPUT fourcc.  Same client drives all three codecs.

## Verification

### Bit-exact end-to-end vs reference FFmpeg

| Codec | Frame  | Bitstream | Decoded NV12 | vs `ffmpeg -pix_fmt nv12` |
|-------|--------|-----------|--------------|---------------------------|
| VP9   | 1920×1080 | 12690 B | 3,110,400 B | **byte-for-byte match ✓** |
| AV1   | 128×96    |   456 B |    18,432 B | **byte-for-byte match ✓** |
| H.264 | 128×96    |  1863 B |    18,432 B | **byte-for-byte match ✓** |

VP9 1080p decode went through the full dmabuf path with no
chardev payload bloat — the same chardev that capped at 64 KiB
in Phase 8.5 now ferries metadata only and lets the daemon
mmap+write a 3.1 MB frame directly into the V4L2 client's
buffer.

### v4l2-compliance

```
Total for daedalus_v4l2 device /dev/video0: 48, Succeeded: 47, Failed: 1
```

| Phase | Pass | Fail | Delta |
|-------|------|------|-------|
| 8.1   | 44/48 | 4 | baseline |
| 8.5   | 44/48 | 4 | (different fails — m2m wired) |
| 8.6   | 47/48 | 1 | +3 pass |

The one remaining fail is `VIDIOC_(TRY_)DECODER_CMD`:

```
fail: v4l2-test-codecs.cpp(105):
  (node->codec_mask & STATELESS_DECODER) && !node->has_media
```

Compliance requires stateless decoders to bind a media
controller (for the request API).  That's the next milestone;
not addressed in 8.6.

### Format + control enumeration

```
$ v4l2-ctl -d /dev/video0 --list-formats-out
        [0]: 'VP9F' (VP9 Frame, compressed)
        [1]: 'AV1F' (AV1 Frame, compressed)
        [2]: 'S264' (H.264 Parsed Slice Data, compressed)

$ v4l2-ctl -d /dev/video0 -l
  h264_sequence_parameter_set            (has-payload)
  h264_picture_parameter_set             (has-payload)
  h264_scaling_matrix                    (has-payload)
  h264_prediction_weight_table           (has-payload)
  h264_slice_parameters                  (has-payload)
  h264_decode_parameters                 (has-payload)
  vp9_frame_decode_parameters            (has-payload)
  vp9_probabilities_updates              (has-payload)
  av1_sequence_parameters                (has-payload)
  av1_frame_parameters                   (has-payload)
  av1_film_grain                         (has-payload)

  Standard Compound Controls: 11
```

(av1_tile_group_entry got refused by `hdl->error` on this
kernel — the v6.12 headers expose the CID but the v4l2-core
control table doesn't include it, so we just skip and log at
debug.)

### Clean shutdown

```
$ pkill -TERM daedalus_v4l2_daemon   # daemon exits cleanly
$ sudo rmmod daedalus_v4l2           # ok
$ sudo dmesg | grep -E 'BUG|oops'
(empty)
```

## Design decisions

### Why dmabuf-export over inline pixels?

The chardev payload is capped at 64 KiB to keep kmalloc
allocations under the slab's reliability ceiling.  A 1080p
NV12 frame is 3.1 MB; 4K is 12.4 MB.  No reasonable cap on
inline pixels covers real-world content.

dmabuf gives us a zero-copy hand-off:
- Kernel allocates CAPTURE buffers via CMA (vb2_dma_contig).
- Daemon mmaps via the dmabuf-fd we hand it.
- Daemon's decode-loop writes pixels straight into the V4L2
  client's CAPTURE buffer.  No daemon-side allocation, no
  chardev memcpy, no payload limit.

### Why both queues on vb2_dma_contig?

Pure dma_contig everywhere simplifies the `queue_setup`
plane-allocator assignment (no "OUTPUT is vmalloc / CAPTURE
is dma_contig" special-case).  More importantly,
`V4L2_MEMORY_FLAG_NON_COHERENT` only works on memops that
implement the non-coherent allocation path — vb2_vmalloc
doesn't, so v4l2-compliance's REQBUFS test fails when any
queue is on vmalloc.  Switching OUTPUT to dma_contig closed
the compliance gap without any functional cost (the kernel
still gets a kernel-virtual-address out of
`vb2_plane_vaddr`).

### Why O_RDWR in the ioctl call?

`vb2_core_expbuf` extracts the access mode from `flags &
O_ACCMODE` and passes it to the memop's `get_dmabuf`.
Default (O_RDONLY) creates a read-only dma_buf and userspace
gets `mmap -EACCES` on PROT_WRITE.  Caught on first run.

### Why a unified cookie allocator?

The Phase 8.5 closure flagged a debug-time confusion:
debugfs `test_decode` and m2m `device_run` had independent
atomic counters, both starting at 0 after each insmod.  When
both got used in the same boot, RESP_FRAME logs hit "unknown
cookie" for one path's responses arriving at the other's
namespace.  One shared `daedalus_next_cookie()` makes the
logs deterministic and the namespaces non-colliding.

### Why register stateless controls if we ignore the values?

libva-v4l2-request and other production clients probe
`VIDIOC_QUERYCTRL` to verify the codec is supported.  Without
the controls registered, the driver appears as a
"non-stateless" decoder to those clients and they refuse to
drive it.  Our daemon ignores the per-buffer values because
FFmpeg re-parses the bitstream anyway — but the controls
need to be visible.  Phase 8.7+ wires them to a real codec
state machine if we ever want to skip FFmpeg's reparse.

## Bugs found and fixed during the phase

### Bug 1: mmap EACCES on the daemon's CAPTURE dmabuf

Daemon called `ioctl(GET_DMABUF, flags=O_CLOEXEC)`.
vb2_core_expbuf treated the omitted access mode as O_RDONLY
and exported a read-only dma_buf; `mmap(PROT_READ | PROT_WRITE,
MAP_SHARED)` returned `-EACCES`.  Fix: O_RDWR | O_CLOEXEC.

### Bug 2: REQBUFS rejection on V4L2_MEMORY_FLAG_NON_COHERENT

v4l2-compliance set the flag; vb2 rejected because:

1. OUTPUT queue used vb2_vmalloc which doesn't honour the
   flag.
2. `allow_cache_hints = 0` (default) on both queues blocked
   the flag at the vb2 layer.

Fix: switch OUTPUT to vb2_dma_contig + set
`allow_cache_hints = 1` on both queues.

### Bug 3: TRY_FMT clobbered userspace colorspace

`daedalus_fill_*_fmt` always set `colorspace =
V4L2_COLORSPACE_REC709`.  Compliance fed in
V4L2_COLORSPACE_SMPTE170M and expected it to round-trip
through TRY_FMT and propagate to CAPTURE on S_FMT.  Fix:
preserve the four colorspace fields in TRY_FMT (only
default-fill when zero) and copy them OUTPUT → CAPTURE on
S_FMT.

## What's NOT here (deferred)

- **Media controller binding** (`v4l2_m2m_register_media_controller`).
  This is what the last DECODER_CMD compliance failure needs.
  Roadmap puts it in Phase 8.7 alongside performance work,
  since it brings real complexity (media graph topology,
  request-API entity wiring) but no end-user functionality
  on its own.
- **HDR / 10-bit pixel formats.**  CAPTURE is NV12M only.
  P010M etc. need the AVPixFmtDescriptor walker in
  `pack_nv12_to_planes` to grow a depth-aware pack path.
- **QPU dispatch substitution.**  The daemon decodes
  entirely on CPU via FFmpeg.  Substituting per-block
  `daedalus_dispatch_*` calls into FFmpeg's hot path needs
  FFmpeg-internal block-walker hooks — orthogonal to the
  V4L2 m2m work and lives in Phase 8.7.

## Phase 8.7 plan

1. Media controller integration (closes the last compliance
   fail).
2. Performance work toward the 30fps@1080p user-facing
   target: profile the daemon's per-frame cost; substitute
   `daedalus_dispatch_*` for FFmpeg's per-block calls where
   the kernel matches.
3. HDR / 10-bit support (P010M CAPTURE, depth-aware
   pack_nv12_to_planes).
4. Long-form multi-frame streaming tests (not just one
   keyframe at a time).