Files
daedalus-v4l2/docs/phase_8_6_closure.md
T
marfrit c7f6fb90cb Phase 8.6: dmabuf + AV1 + H.264 + stateless controls
Removes the Phase 8.5 64 KiB frame-size cap by exporting CAPTURE
buffers as dmabuf-fds the daemon mmaps and writes pixels into
directly. Adds AV1 + H.264 codec support, V4L2 stateless control
registration, and the compliance polish that brings the driver
to 47/48 v4l2-compliance pass.

Protocol (include/daedalus_v4l2_proto.h):
- struct daedalus_req_decode grew capture-buffer metadata
  (width/height/pix_fmt/num_planes + per-plane size+stride).
- New DAEDALUS_IOC_GET_DMABUF ioctl on the chardev: daemon
  asks for a per-plane dmabuf fd, kernel calls vb2_core_expbuf
  in daemon task context so the fd lands in the daemon's table.

Kernel m2m driver (kernel/daedalus_v4l2_main.c):
- Both queues switched to vb2_dma_contig_memops. OUTPUT was
  vmalloc in 8.5; the switch is needed because vmalloc doesn't
  honour V4L2_MEMORY_FLAG_NON_COHERENT and v4l2-compliance's
  REQBUFS test rejected the driver because of it. We still
  read bitstream via vb2_plane_vaddr (dma_contig gives a
  kernel virtual address just like vmalloc did).
- dma_coerce_mask_and_coherent(DMA_BIT_MASK(32)) in probe.
- queue_setup populates alloc_devs[plane] = &pdev->dev for
  both queues; allow_cache_hints=1 on both.
- daedalus_export_capture_dmabuf(cookie, plane, flags, *fd):
  walks inflight list, calls vb2_core_expbuf on the CAPTURE
  buffer in the caller's (daemon's) task context.
- device_run fills the new REQ_DECODE capture fields from
  ctx->dst_fmt and maps ctx->src_fmt.pixelformat to
  DAEDALUS_CODEC_VP9 / _AV1 / _H264 (was hard-wired to VP9).
- daedalus_complete_resp_frame handles both the 8.5 inline
  path (kept for debugging) and the 8.6 dmabuf path (pixels
  already in CAPTURE buffer, just set payload from metadata).
- enum_fmt advertises all 3 OUTPUT formats (VP9F, AV1F, S264).
- try_fmt preserves userspace colorspace fields instead of
  overwriting with REC709 defaults (fixes 8.5 compliance fail).
- s_fmt propagates OUTPUT colorspace → CAPTURE (stateless
  decoder round-trip test at v4l2-test-formats.cpp:958).
- 12 V4L2 stateless controls registered per open (VP9_FRAME,
  VP9_COMPRESSED_HDR, H264_SPS/PPS/SCALING/PRED_WEIGHTS/
  SLICE_PARAMS/DECODE_PARAMS, AV1_FRAME/SEQUENCE/
  TILE_GROUP_ENTRY/FILM_GRAIN). Daemon ignores values (FFmpeg
  re-parses); registration is what makes libva-v4l2-request
  see us.

Kernel chardev (kernel/daedalus_v4l2_chardev.c):
- New unlocked_ioctl dispatching DAEDALUS_IOC_GET_DMABUF to
  daedalus_export_capture_dmabuf.
- debugfs test_decode cookies unified with the m2m cookie
  allocator via shared daedalus_next_cookie() — kills the
  Phase 8.5 namespace collision.

Daemon (daemon/src/...):
- New dmabuf_capture.{c,h}: GET_DMABUF + mmap each plane on
  REQ_DECODE; munmap + close on completion. O_RDWR | O_CLOEXEC
  is essential — vb2_core_expbuf extracts O_ACCMODE from flags
  and exports read-only by default (caught on first run; mmap
  -EACCES on PROT_WRITE).
- decoder.{c,h}: lazily opens AV1 + H.264 AVCodecContexts in
  addition to VP9 (dropped the -ENOSYS stubs). pack_nv12_to_planes
  writes Y line-by-line into planes[0] with planes[0].stride;
  interleaves Cb/Cr into planes[1] with planes[1].stride.
- chardev_client.c handle_req_decode: opens dmabuf planes,
  runs decode (pixels land in CAPTURE buffer directly), closes
  planes, sends metadata-only RESP_FRAME. No wire-pixel
  allocation.

Test harness (tools/test_m2m_decode.c):
- Optional 5th arg `codec` (vp9 | av1 | h264). Same client
  drives all three codecs.

Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):

Bit-exact end-to-end vs `ffmpeg -pix_fmt nv12`:
  VP9   1920x1080  3,110,400 bytes  MATCH
  AV1     128x96      18,432 bytes  MATCH
  H.264   128x96      18,432 bytes  MATCH

VP9 1080p went through the full dmabuf path with no chardev
payload bloat — the same chardev that capped at 64 KiB in 8.5
now ferries metadata only and lets the daemon mmap+write a
3.1 MB frame directly into the V4L2 client's buffer.

v4l2-compliance:
  Phase 8.1: 44/48
  Phase 8.5: 44/48 (different fails after m2m landed)
  Phase 8.6: 47/48
  Only remaining: VIDIOC_(TRY_)DECODER_CMD (needs media
  controller — explicitly Phase 8.7 work).

11 standard compound controls visible:
  vp9_frame_decode_parameters, vp9_probabilities_updates,
  h264_sequence_parameter_set, h264_picture_parameter_set,
  h264_scaling_matrix, h264_prediction_weight_table,
  h264_slice_parameters, h264_decode_parameters,
  av1_sequence_parameters, av1_frame_parameters,
  av1_film_grain (av1_tile_group_entry refused by hdl->error
  on this kernel — skipped silently).

Clean SIGTERM + rmmod, no oops/WARN.

Roadmap update (docs/roadmap.md):
- Phase 8.6 marked closed with the closure-doc reference.
- Phase 8.7 reshaped to (1) media controller, (2) perf +
  daedalus_dispatch_* substitution, (3) HDR/10-bit, (4)
  long-form multi-frame streaming.

Per correctness-before-speed:
- Real V4L2 dmabuf via vb2_core_expbuf (not a sideband
  fd-passing hack).
- O_RDWR access mode threaded through correctly.
- Strict pixel-byte comparison against ffmpeg, not "looks
  right" eyeballing.
- Each compliance edge documented with the underlying test
  source-line + the fix.
- All resource paths cleaned (munmap + close per plane on
  every exit, including error paths).

Phase 8.7 next: media controller binding (closes last
compliance fail), per-frame profiling, QPU dispatch
substitution targeting 30fps@1080p from
30fps-floor-is-fine memory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:16:06 +00:00

317 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 8.6 closure — dmabuf + AV1 + H.264 + stateless controls
**Status:** closed 2026-05-18.
Phase 8.5 shipped a working V4L2 m2m driver but capped frame
size at 64 KiB because RESP_FRAME carried inline pixel data
through the chardev. Phase 8.6 removes that cap by exporting
CAPTURE buffers as dmabuf-fds the daemon mmaps and writes
directly into. Along the way: AV1 + H.264 codecs land, V4L2
stateless controls register so libva-v4l2-request will
recognise us as a real stateless decoder, and the remaining
v4l2-compliance edges get polished (47/48 passing now, only
DECODER_CMD remains — that wants a media controller, deferred
to a follow-up).
## What lands
### Protocol (`include/daedalus_v4l2_proto.h`)
- `struct daedalus_req_decode` grew capture-buffer metadata:
`capture_width`, `capture_height`, `capture_pix_fmt`,
`capture_num_planes`, per-plane `size[3]` + `stride[3]`,
`flags`. The daemon needs these to call
`DAEDALUS_IOC_GET_DMABUF` and to know how to lay out NV12
in the mapped plane.
- New `struct daedalus_get_dmabuf` + `DAEDALUS_IOC_GET_DMABUF`
ioctl (`_IOWR('D', 1, ...)`). Takes `(cookie, plane,
flags)`, returns an fd installed in the calling task's fd
table.
### Kernel m2m driver (`kernel/daedalus_v4l2_main.c`)
- **Both queues switched to `vb2_dma_contig_memops`**. OUTPUT
was vmalloc in Phase 8.5; the switch makes
`V4L2_MEMORY_FLAG_NON_COHERENT` work on REQBUFS (vmalloc
doesn't honour the flag and v4l2-compliance rejected the
driver because of it). We still read the bitstream via
`vb2_plane_vaddr` — dma_contig provides a kernel virtual
address just like vmalloc did.
- `dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32))`
in probe so the platform device satisfies vb2_dma_contig's
parent-device requirement.
- `queue_setup` now populates `alloc_devs[plane] =
&pdev->dev` for both queues.
- Both queues set `allow_cache_hints = 1` and `io_modes`
expanded to include `VB2_DMABUF` (CAPTURE also gains it
for the daemon's mmap path).
- New `daedalus_export_capture_dmabuf(cookie, plane, flags,
out_fd)`: walks the inflight list, calls `vb2_core_expbuf`
on the CAPTURE buffer in the caller's (daemon's) task
context. Backs the new chardev ioctl.
- `device_run` extended to fill the new REQ_DECODE capture
fields from `ctx->dst_fmt`, and to map
`ctx->src_fmt.pixelformat` → `DAEDALUS_CODEC_VP9 / _AV1
/ _H264` (was hard-wired to VP9 in 8.5).
- `daedalus_complete_resp_frame` now handles both the Phase
8.5 inline-pixel path (still in tree for debugging /
back-compat) and the Phase 8.6 dmabuf path where
`pixels_len == 0` and pixels already live in the CAPTURE
buffer — sets plane payload from the daemon's metadata
either way.
- `enum_fmt` advertises all three OUTPUT formats:
`VP9_FRAME`, `AV1_FRAME`, `H264_SLICE`.
- `try_fmt` preserves userspace-supplied colorspace / xfer /
ycbcr_enc / quantization across the fill helper (was
rewriting to REC709 defaults; v4l2-compliance failed the
TRY_FMT colorspace round-trip in 8.5).
- `s_fmt` propagates OUTPUT colorspace to CAPTURE so a
follow-up G_FMT on CAPTURE returns matching values
(v4l2-test-formats.cpp:958 stateless-decoder round-trip).
- Twelve V4L2 stateless controls registered per open:
`STATELESS_VP9_FRAME`, `STATELESS_VP9_COMPRESSED_HDR`,
`STATELESS_H264_{SPS,PPS,SCALING_MATRIX,PRED_WEIGHTS,
SLICE_PARAMS,DECODE_PARAMS}`,
`STATELESS_AV1_{FRAME,SEQUENCE,TILE_GROUP_ENTRY,FILM_GRAIN}`.
Daemon ignores values (FFmpeg parses headers itself); the
registration is what makes libva-v4l2-request see us.
### Kernel chardev (`kernel/daedalus_v4l2_chardev.c`)
- New `unlocked_ioctl` handler dispatching
`DAEDALUS_IOC_GET_DMABUF` to
`daedalus_export_capture_dmabuf`.
- debugfs `test_decode` cookies unified with the m2m
`daedalus_next_cookie()` allocator so the two namespaces
never collide.
### Daemon
- **New** `daemon/src/dmabuf_capture.{c,h}`:
- `daedalus_capture_planes_open(chardev_fd, cookie, req,
out)`: ioctl GET_DMABUF per plane with
`O_RDWR | O_CLOEXEC` (O_RDWR is essential —
vb2_core_expbuf uses `flags & O_ACCMODE` to choose the
dma_buf's access mode, default O_RDONLY gives mmap
`-EACCES` on writes), mmap each plane, populate
`{fd, base, size, stride}` per plane.
- `daedalus_capture_planes_close(p)`: munmap + close each
plane.
- `daemon/src/decoder.c`:
- Lazily opens AV1 + H.264 AVCodecContexts in addition to
VP9 (the existing -ENOSYS stubs were dropped — FFmpeg
decodes all three via dlopen).
- `pack_nv12_to_planes`: writes Y line-by-line into
`planes[0].base` with `planes[0].stride`, interleaves
Cb/Cr into `planes[1].base` with `planes[1].stride`.
Strips source stride padding from `fr->linesize[*]`;
respects destination strides so libav alignment doesn't
corrupt the V4L2 client's view.
- `daedalus_decoder_run_request` signature simplified —
takes the mapped planes instead of a wire buffer.
- `daemon/src/chardev_client.c`:
- `handle_req_decode` opens dmabuf planes, runs decode
(writing pixels straight into the mapped CAPTURE
buffer), closes planes, sends a metadata-only
RESP_FRAME. No wire-pixel allocation at all.
- `daemon/CMakeLists.txt`: includes `dmabuf_capture.c`.
### Test harness (`tools/test_m2m_decode.c`)
- Optional 5th argument `codec` (vp9 | av1 | h264) selects
the OUTPUT fourcc. Same client drives all three codecs.
## Verification
### Bit-exact end-to-end vs reference FFmpeg
| Codec | Frame | Bitstream | Decoded NV12 | vs `ffmpeg -pix_fmt nv12` |
|-------|--------|-----------|--------------|---------------------------|
| VP9 | 1920×1080 | 12690 B | 3,110,400 B | **byte-for-byte match ✓** |
| AV1 | 128×96 | 456 B | 18,432 B | **byte-for-byte match ✓** |
| H.264 | 128×96 | 1863 B | 18,432 B | **byte-for-byte match ✓** |
VP9 1080p decode went through the full dmabuf path with no
chardev payload bloat — the same chardev that capped at 64 KiB
in Phase 8.5 now ferries metadata only and lets the daemon
mmap+write a 3.1 MB frame directly into the V4L2 client's
buffer.
### v4l2-compliance
```
Total for daedalus_v4l2 device /dev/video0: 48, Succeeded: 47, Failed: 1
```
| Phase | Pass | Fail | Delta |
|-------|------|------|-------|
| 8.1 | 44/48 | 4 | baseline |
| 8.5 | 44/48 | 4 | (different fails — m2m wired) |
| 8.6 | 47/48 | 1 | +3 pass |
The one remaining fail is `VIDIOC_(TRY_)DECODER_CMD`:
```
fail: v4l2-test-codecs.cpp(105):
(node->codec_mask & STATELESS_DECODER) && !node->has_media
```
Compliance requires stateless decoders to bind a media
controller (for the request API). That's the next milestone;
not addressed in 8.6.
### Format + control enumeration
```
$ v4l2-ctl -d /dev/video0 --list-formats-out
[0]: 'VP9F' (VP9 Frame, compressed)
[1]: 'AV1F' (AV1 Frame, compressed)
[2]: 'S264' (H.264 Parsed Slice Data, compressed)
$ v4l2-ctl -d /dev/video0 -l
h264_sequence_parameter_set (has-payload)
h264_picture_parameter_set (has-payload)
h264_scaling_matrix (has-payload)
h264_prediction_weight_table (has-payload)
h264_slice_parameters (has-payload)
h264_decode_parameters (has-payload)
vp9_frame_decode_parameters (has-payload)
vp9_probabilities_updates (has-payload)
av1_sequence_parameters (has-payload)
av1_frame_parameters (has-payload)
av1_film_grain (has-payload)
Standard Compound Controls: 11
```
(av1_tile_group_entry got refused by `hdl->error` on this
kernel — the v6.12 headers expose the CID but the v4l2-core
control table doesn't include it, so we just skip and log at
debug.)
### Clean shutdown
```
$ pkill -TERM daedalus_v4l2_daemon # daemon exits cleanly
$ sudo rmmod daedalus_v4l2 # ok
$ sudo dmesg | grep -E 'BUG|oops'
(empty)
```
## Design decisions
### Why dmabuf-export over inline pixels?
The chardev payload is capped at 64 KiB to keep kmalloc
allocations under the slab's reliability ceiling. A 1080p
NV12 frame is 3.1 MB; 4K is 12.4 MB. No reasonable cap on
inline pixels covers real-world content.
dmabuf gives us a zero-copy hand-off:
- Kernel allocates CAPTURE buffers via CMA (vb2_dma_contig).
- Daemon mmaps via the dmabuf-fd we hand it.
- Daemon's decode-loop writes pixels straight into the V4L2
client's CAPTURE buffer. No daemon-side allocation, no
chardev memcpy, no payload limit.
### Why both queues on vb2_dma_contig?
Pure dma_contig everywhere simplifies the `queue_setup`
plane-allocator assignment (no "OUTPUT is vmalloc / CAPTURE
is dma_contig" special-case). More importantly,
`V4L2_MEMORY_FLAG_NON_COHERENT` only works on memops that
implement the non-coherent allocation path — vb2_vmalloc
doesn't, so v4l2-compliance's REQBUFS test fails when any
queue is on vmalloc. Switching OUTPUT to dma_contig closed
the compliance gap without any functional cost (the kernel
still gets a kernel-virtual-address out of
`vb2_plane_vaddr`).
### Why O_RDWR in the ioctl call?
`vb2_core_expbuf` extracts the access mode from `flags &
O_ACCMODE` and passes it to the memop's `get_dmabuf`.
Default (O_RDONLY) creates a read-only dma_buf and userspace
gets `mmap -EACCES` on PROT_WRITE. Caught on first run.
### Why a unified cookie allocator?
The Phase 8.5 closure flagged a debug-time confusion:
debugfs `test_decode` and m2m `device_run` had independent
atomic counters, both starting at 0 after each insmod. When
both got used in the same boot, RESP_FRAME logs hit "unknown
cookie" for one path's responses arriving at the other's
namespace. One shared `daedalus_next_cookie()` makes the
logs deterministic and the namespaces non-colliding.
### Why register stateless controls if we ignore the values?
libva-v4l2-request and other production clients probe
`VIDIOC_QUERYCTRL` to verify the codec is supported. Without
the controls registered, the driver appears as a
"non-stateless" decoder to those clients and they refuse to
drive it. Our daemon ignores the per-buffer values because
FFmpeg re-parses the bitstream anyway — but the controls
need to be visible. Phase 8.7+ wires them to a real codec
state machine if we ever want to skip FFmpeg's reparse.
## Bugs found and fixed during the phase
### Bug 1: mmap EACCES on the daemon's CAPTURE dmabuf
Daemon called `ioctl(GET_DMABUF, flags=O_CLOEXEC)`.
vb2_core_expbuf treated the omitted access mode as O_RDONLY
and exported a read-only dma_buf; `mmap(PROT_READ | PROT_WRITE,
MAP_SHARED)` returned `-EACCES`. Fix: O_RDWR | O_CLOEXEC.
### Bug 2: REQBUFS rejection on V4L2_MEMORY_FLAG_NON_COHERENT
v4l2-compliance set the flag; vb2 rejected because:
1. OUTPUT queue used vb2_vmalloc which doesn't honour the
flag.
2. `allow_cache_hints = 0` (default) on both queues blocked
the flag at the vb2 layer.
Fix: switch OUTPUT to vb2_dma_contig + set
`allow_cache_hints = 1` on both queues.
### Bug 3: TRY_FMT clobbered userspace colorspace
`daedalus_fill_*_fmt` always set `colorspace =
V4L2_COLORSPACE_REC709`. Compliance fed in
V4L2_COLORSPACE_SMPTE170M and expected it to round-trip
through TRY_FMT and propagate to CAPTURE on S_FMT. Fix:
preserve the four colorspace fields in TRY_FMT (only
default-fill when zero) and copy them OUTPUT → CAPTURE on
S_FMT.
## What's NOT here (deferred)
- **Media controller binding** (`v4l2_m2m_register_media_controller`).
This is what the last DECODER_CMD compliance failure needs.
Roadmap puts it in Phase 8.7 alongside performance work,
since it brings real complexity (media graph topology,
request-API entity wiring) but no end-user functionality
on its own.
- **HDR / 10-bit pixel formats.** CAPTURE is NV12M only.
P010M etc. need the AVPixFmtDescriptor walker in
`pack_nv12_to_planes` to grow a depth-aware pack path.
- **QPU dispatch substitution.** The daemon decodes
entirely on CPU via FFmpeg. Substituting per-block
`daedalus_dispatch_*` calls into FFmpeg's hot path needs
FFmpeg-internal block-walker hooks — orthogonal to the
V4L2 m2m work and lives in Phase 8.7.
## Phase 8.7 plan
1. Media controller integration (closes the last compliance
fail).
2. Performance work toward the 30fps@1080p user-facing
target: profile the daemon's per-frame cost; substitute
`daedalus_dispatch_*` for FFmpeg's per-block calls where
the kernel matches.
3. HDR / 10-bit support (P010M CAPTURE, depth-aware
pack_nv12_to_planes).
4. Long-form multi-frame streaming tests (not just one
keyframe at a time).