Phase 8.6: dmabuf + AV1 + H.264 + stateless controls
Removes the Phase 8.5 64 KiB frame-size cap by exporting CAPTURE
buffers as dmabuf-fds the daemon mmaps and writes pixels into
directly. Adds AV1 + H.264 codec support, V4L2 stateless control
registration, and the compliance polish that brings the driver
to 47/48 v4l2-compliance pass.
Protocol (include/daedalus_v4l2_proto.h):
- struct daedalus_req_decode grew capture-buffer metadata
(width/height/pix_fmt/num_planes + per-plane size+stride).
- New DAEDALUS_IOC_GET_DMABUF ioctl on the chardev: daemon
asks for a per-plane dmabuf fd, kernel calls vb2_core_expbuf
in daemon task context so the fd lands in the daemon's table.
Kernel m2m driver (kernel/daedalus_v4l2_main.c):
- Both queues switched to vb2_dma_contig_memops. OUTPUT was
vmalloc in 8.5; the switch is needed because vmalloc doesn't
honour V4L2_MEMORY_FLAG_NON_COHERENT and v4l2-compliance's
REQBUFS test rejected the driver because of it. We still
read bitstream via vb2_plane_vaddr (dma_contig gives a
kernel virtual address just like vmalloc did).
- dma_coerce_mask_and_coherent(DMA_BIT_MASK(32)) in probe.
- queue_setup populates alloc_devs[plane] = &pdev->dev for
both queues; allow_cache_hints=1 on both.
- daedalus_export_capture_dmabuf(cookie, plane, flags, *fd):
walks inflight list, calls vb2_core_expbuf on the CAPTURE
buffer in the caller's (daemon's) task context.
- device_run fills the new REQ_DECODE capture fields from
ctx->dst_fmt and maps ctx->src_fmt.pixelformat to
DAEDALUS_CODEC_VP9 / _AV1 / _H264 (was hard-wired to VP9).
- daedalus_complete_resp_frame handles both the 8.5 inline
path (kept for debugging) and the 8.6 dmabuf path (pixels
already in CAPTURE buffer, just set payload from metadata).
- enum_fmt advertises all 3 OUTPUT formats (VP9F, AV1F, S264).
- try_fmt preserves userspace colorspace fields instead of
overwriting with REC709 defaults (fixes 8.5 compliance fail).
- s_fmt propagates OUTPUT colorspace → CAPTURE (stateless
decoder round-trip test at v4l2-test-formats.cpp:958).
- 12 V4L2 stateless controls registered per open (VP9_FRAME,
VP9_COMPRESSED_HDR, H264_SPS/PPS/SCALING/PRED_WEIGHTS/
SLICE_PARAMS/DECODE_PARAMS, AV1_FRAME/SEQUENCE/
TILE_GROUP_ENTRY/FILM_GRAIN). Daemon ignores values (FFmpeg
re-parses); registration is what makes libva-v4l2-request
see us.
Kernel chardev (kernel/daedalus_v4l2_chardev.c):
- New unlocked_ioctl dispatching DAEDALUS_IOC_GET_DMABUF to
daedalus_export_capture_dmabuf.
- debugfs test_decode cookies unified with the m2m cookie
allocator via shared daedalus_next_cookie() — kills the
Phase 8.5 namespace collision.
Daemon (daemon/src/...):
- New dmabuf_capture.{c,h}: GET_DMABUF + mmap each plane on
REQ_DECODE; munmap + close on completion. O_RDWR | O_CLOEXEC
is essential — vb2_core_expbuf extracts O_ACCMODE from flags
and exports read-only by default (caught on first run; mmap
-EACCES on PROT_WRITE).
- decoder.{c,h}: lazily opens AV1 + H.264 AVCodecContexts in
addition to VP9 (dropped the -ENOSYS stubs). pack_nv12_to_planes
writes Y line-by-line into planes[0] with planes[0].stride;
interleaves Cb/Cr into planes[1] with planes[1].stride.
- chardev_client.c handle_req_decode: opens dmabuf planes,
runs decode (pixels land in CAPTURE buffer directly), closes
planes, sends metadata-only RESP_FRAME. No wire-pixel
allocation.
Test harness (tools/test_m2m_decode.c):
- Optional 5th arg `codec` (vp9 | av1 | h264). Same client
drives all three codecs.
Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):
Bit-exact end-to-end vs `ffmpeg -pix_fmt nv12`:
VP9 1920x1080 3,110,400 bytes MATCH
AV1 128x96 18,432 bytes MATCH
H.264 128x96 18,432 bytes MATCH
VP9 1080p went through the full dmabuf path with no chardev
payload bloat — the same chardev that capped at 64 KiB in 8.5
now ferries metadata only and lets the daemon mmap+write a
3.1 MB frame directly into the V4L2 client's buffer.
v4l2-compliance:
Phase 8.1: 44/48
Phase 8.5: 44/48 (different fails after m2m landed)
Phase 8.6: 47/48
Only remaining: VIDIOC_(TRY_)DECODER_CMD (needs media
controller — explicitly Phase 8.7 work).
11 standard compound controls visible:
vp9_frame_decode_parameters, vp9_probabilities_updates,
h264_sequence_parameter_set, h264_picture_parameter_set,
h264_scaling_matrix, h264_prediction_weight_table,
h264_slice_parameters, h264_decode_parameters,
av1_sequence_parameters, av1_frame_parameters,
av1_film_grain (av1_tile_group_entry refused by hdl->error
on this kernel — skipped silently).
Clean SIGTERM + rmmod, no oops/WARN.
Roadmap update (docs/roadmap.md):
- Phase 8.6 marked closed with the closure-doc reference.
- Phase 8.7 reshaped to (1) media controller, (2) perf +
daedalus_dispatch_* substitution, (3) HDR/10-bit, (4)
long-form multi-frame streaming.
Per correctness-before-speed:
- Real V4L2 dmabuf via vb2_core_expbuf (not a sideband
fd-passing hack).
- O_RDWR access mode threaded through correctly.
- Strict pixel-byte comparison against ffmpeg, not "looks
right" eyeballing.
- Each compliance edge documented with the underlying test
source-line + the fix.
- All resource paths cleaned (munmap + close per plane on
every exit, including error paths).
Phase 8.7 next: media controller binding (closes last
compliance fail), per-frame profiling, QPU dispatch
substitution targeting 30fps@1080p from
30fps-floor-is-fine memory.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,316 @@
|
||||
# Phase 8.6 closure — dmabuf + AV1 + H.264 + stateless controls
|
||||
|
||||
**Status:** closed 2026-05-18.
|
||||
|
||||
Phase 8.5 shipped a working V4L2 m2m driver but capped frame
|
||||
size at 64 KiB because RESP_FRAME carried inline pixel data
|
||||
through the chardev. Phase 8.6 removes that cap by exporting
|
||||
CAPTURE buffers as dmabuf-fds the daemon mmaps and writes
|
||||
directly into. Along the way: AV1 + H.264 codecs land, V4L2
|
||||
stateless controls register so libva-v4l2-request will
|
||||
recognise us as a real stateless decoder, and the remaining
|
||||
v4l2-compliance edges get polished (47/48 passing now, only
|
||||
DECODER_CMD remains — that wants a media controller, deferred
|
||||
to a follow-up).
|
||||
|
||||
## What lands
|
||||
|
||||
### Protocol (`include/daedalus_v4l2_proto.h`)
|
||||
- `struct daedalus_req_decode` grew capture-buffer metadata:
|
||||
`capture_width`, `capture_height`, `capture_pix_fmt`,
|
||||
`capture_num_planes`, per-plane `size[3]` + `stride[3]`,
|
||||
`flags`. The daemon needs these to call
|
||||
`DAEDALUS_IOC_GET_DMABUF` and to know how to lay out NV12
|
||||
in the mapped plane.
|
||||
- New `struct daedalus_get_dmabuf` + `DAEDALUS_IOC_GET_DMABUF`
|
||||
ioctl (`_IOWR('D', 1, ...)`). Takes `(cookie, plane,
|
||||
flags)`, returns an fd installed in the calling task's fd
|
||||
table.
|
||||
|
||||
### Kernel m2m driver (`kernel/daedalus_v4l2_main.c`)
|
||||
- **Both queues switched to `vb2_dma_contig_memops`**. OUTPUT
|
||||
was vmalloc in Phase 8.5; the switch makes
|
||||
`V4L2_MEMORY_FLAG_NON_COHERENT` work on REQBUFS (vmalloc
|
||||
doesn't honour the flag and v4l2-compliance rejected the
|
||||
driver because of it). We still read the bitstream via
|
||||
`vb2_plane_vaddr` — dma_contig provides a kernel virtual
|
||||
address just like vmalloc did.
|
||||
- `dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32))`
|
||||
in probe so the platform device satisfies vb2_dma_contig's
|
||||
parent-device requirement.
|
||||
- `queue_setup` now populates `alloc_devs[plane] =
|
||||
&pdev->dev` for both queues.
|
||||
- Both queues set `allow_cache_hints = 1` and `io_modes`
|
||||
expanded to include `VB2_DMABUF` (CAPTURE also gains it
|
||||
for the daemon's mmap path).
|
||||
- New `daedalus_export_capture_dmabuf(cookie, plane, flags,
|
||||
out_fd)`: walks the inflight list, calls `vb2_core_expbuf`
|
||||
on the CAPTURE buffer in the caller's (daemon's) task
|
||||
context. Backs the new chardev ioctl.
|
||||
- `device_run` extended to fill the new REQ_DECODE capture
|
||||
fields from `ctx->dst_fmt`, and to map
|
||||
`ctx->src_fmt.pixelformat` → `DAEDALUS_CODEC_VP9 / _AV1
|
||||
/ _H264` (was hard-wired to VP9 in 8.5).
|
||||
- `daedalus_complete_resp_frame` now handles both the Phase
|
||||
8.5 inline-pixel path (still in tree for debugging /
|
||||
back-compat) and the Phase 8.6 dmabuf path where
|
||||
`pixels_len == 0` and pixels already live in the CAPTURE
|
||||
buffer — sets plane payload from the daemon's metadata
|
||||
either way.
|
||||
- `enum_fmt` advertises all three OUTPUT formats:
|
||||
`VP9_FRAME`, `AV1_FRAME`, `H264_SLICE`.
|
||||
- `try_fmt` preserves userspace-supplied colorspace / xfer /
|
||||
ycbcr_enc / quantization across the fill helper (was
|
||||
rewriting to REC709 defaults; v4l2-compliance failed the
|
||||
TRY_FMT colorspace round-trip in 8.5).
|
||||
- `s_fmt` propagates OUTPUT colorspace to CAPTURE so a
|
||||
follow-up G_FMT on CAPTURE returns matching values
|
||||
(v4l2-test-formats.cpp:958 stateless-decoder round-trip).
|
||||
- Twelve V4L2 stateless controls registered per open:
|
||||
`STATELESS_VP9_FRAME`, `STATELESS_VP9_COMPRESSED_HDR`,
|
||||
`STATELESS_H264_{SPS,PPS,SCALING_MATRIX,PRED_WEIGHTS,
|
||||
SLICE_PARAMS,DECODE_PARAMS}`,
|
||||
`STATELESS_AV1_{FRAME,SEQUENCE,TILE_GROUP_ENTRY,FILM_GRAIN}`.
|
||||
Daemon ignores values (FFmpeg parses headers itself); the
|
||||
registration is what makes libva-v4l2-request see us.
|
||||
|
||||
### Kernel chardev (`kernel/daedalus_v4l2_chardev.c`)
|
||||
- New `unlocked_ioctl` handler dispatching
|
||||
`DAEDALUS_IOC_GET_DMABUF` to
|
||||
`daedalus_export_capture_dmabuf`.
|
||||
- debugfs `test_decode` cookies unified with the m2m
|
||||
`daedalus_next_cookie()` allocator so the two namespaces
|
||||
never collide.
|
||||
|
||||
### Daemon
|
||||
|
||||
- **New** `daemon/src/dmabuf_capture.{c,h}`:
|
||||
- `daedalus_capture_planes_open(chardev_fd, cookie, req,
|
||||
out)`: ioctl GET_DMABUF per plane with
|
||||
`O_RDWR | O_CLOEXEC` (O_RDWR is essential —
|
||||
vb2_core_expbuf uses `flags & O_ACCMODE` to choose the
|
||||
dma_buf's access mode, default O_RDONLY gives mmap
|
||||
`-EACCES` on writes), mmap each plane, populate
|
||||
`{fd, base, size, stride}` per plane.
|
||||
- `daedalus_capture_planes_close(p)`: munmap + close each
|
||||
plane.
|
||||
|
||||
- `daemon/src/decoder.c`:
|
||||
- Lazily opens AV1 + H.264 AVCodecContexts in addition to
|
||||
VP9 (the existing -ENOSYS stubs were dropped — FFmpeg
|
||||
decodes all three via dlopen).
|
||||
- `pack_nv12_to_planes`: writes Y line-by-line into
|
||||
`planes[0].base` with `planes[0].stride`, interleaves
|
||||
Cb/Cr into `planes[1].base` with `planes[1].stride`.
|
||||
Strips source stride padding from `fr->linesize[*]`;
|
||||
respects destination strides so libav alignment doesn't
|
||||
corrupt the V4L2 client's view.
|
||||
- `daedalus_decoder_run_request` signature simplified —
|
||||
takes the mapped planes instead of a wire buffer.
|
||||
|
||||
- `daemon/src/chardev_client.c`:
|
||||
- `handle_req_decode` opens dmabuf planes, runs decode
|
||||
(writing pixels straight into the mapped CAPTURE
|
||||
buffer), closes planes, sends a metadata-only
|
||||
RESP_FRAME. No wire-pixel allocation at all.
|
||||
|
||||
- `daemon/CMakeLists.txt`: includes `dmabuf_capture.c`.
|
||||
|
||||
### Test harness (`tools/test_m2m_decode.c`)
|
||||
- Optional 5th argument `codec` (vp9 | av1 | h264) selects
|
||||
the OUTPUT fourcc. Same client drives all three codecs.
|
||||
|
||||
## Verification
|
||||
|
||||
### Bit-exact end-to-end vs reference FFmpeg
|
||||
|
||||
| Codec | Frame | Bitstream | Decoded NV12 | vs `ffmpeg -pix_fmt nv12` |
|
||||
|-------|--------|-----------|--------------|---------------------------|
|
||||
| VP9 | 1920×1080 | 12690 B | 3,110,400 B | **byte-for-byte match ✓** |
|
||||
| AV1 | 128×96 | 456 B | 18,432 B | **byte-for-byte match ✓** |
|
||||
| H.264 | 128×96 | 1863 B | 18,432 B | **byte-for-byte match ✓** |
|
||||
|
||||
VP9 1080p decode went through the full dmabuf path with no
|
||||
chardev payload bloat — the same chardev that capped at 64 KiB
|
||||
in Phase 8.5 now ferries metadata only and lets the daemon
|
||||
mmap+write a 3.1 MB frame directly into the V4L2 client's
|
||||
buffer.
|
||||
|
||||
### v4l2-compliance
|
||||
|
||||
```
|
||||
Total for daedalus_v4l2 device /dev/video0: 48, Succeeded: 47, Failed: 1
|
||||
```
|
||||
|
||||
| Phase | Pass | Fail | Delta |
|
||||
|-------|------|------|-------|
|
||||
| 8.1 | 44/48 | 4 | baseline |
|
||||
| 8.5 | 44/48 | 4 | (different fails — m2m wired) |
|
||||
| 8.6 | 47/48 | 1 | +3 pass |
|
||||
|
||||
The one remaining fail is `VIDIOC_(TRY_)DECODER_CMD`:
|
||||
|
||||
```
|
||||
fail: v4l2-test-codecs.cpp(105):
|
||||
(node->codec_mask & STATELESS_DECODER) && !node->has_media
|
||||
```
|
||||
|
||||
Compliance requires stateless decoders to bind a media
|
||||
controller (for the request API). That's the next milestone;
|
||||
not addressed in 8.6.
|
||||
|
||||
### Format + control enumeration
|
||||
|
||||
```
|
||||
$ v4l2-ctl -d /dev/video0 --list-formats-out
|
||||
[0]: 'VP9F' (VP9 Frame, compressed)
|
||||
[1]: 'AV1F' (AV1 Frame, compressed)
|
||||
[2]: 'S264' (H.264 Parsed Slice Data, compressed)
|
||||
|
||||
$ v4l2-ctl -d /dev/video0 -l
|
||||
h264_sequence_parameter_set (has-payload)
|
||||
h264_picture_parameter_set (has-payload)
|
||||
h264_scaling_matrix (has-payload)
|
||||
h264_prediction_weight_table (has-payload)
|
||||
h264_slice_parameters (has-payload)
|
||||
h264_decode_parameters (has-payload)
|
||||
vp9_frame_decode_parameters (has-payload)
|
||||
vp9_probabilities_updates (has-payload)
|
||||
av1_sequence_parameters (has-payload)
|
||||
av1_frame_parameters (has-payload)
|
||||
av1_film_grain (has-payload)
|
||||
|
||||
Standard Compound Controls: 11
|
||||
```
|
||||
|
||||
(av1_tile_group_entry got refused by `hdl->error` on this
|
||||
kernel — the v6.12 headers expose the CID but the v4l2-core
|
||||
control table doesn't include it, so we just skip and log at
|
||||
debug.)
|
||||
|
||||
### Clean shutdown
|
||||
|
||||
```
|
||||
$ pkill -TERM daedalus_v4l2_daemon # daemon exits cleanly
|
||||
$ sudo rmmod daedalus_v4l2 # ok
|
||||
$ sudo dmesg | grep -E 'BUG|oops'
|
||||
(empty)
|
||||
```
|
||||
|
||||
## Design decisions
|
||||
|
||||
### Why dmabuf-export over inline pixels?
|
||||
|
||||
The chardev payload is capped at 64 KiB to keep kmalloc
|
||||
allocations under the slab's reliability ceiling. A 1080p
|
||||
NV12 frame is 3.1 MB; 4K is 12.4 MB. No reasonable cap on
|
||||
inline pixels covers real-world content.
|
||||
|
||||
dmabuf gives us a zero-copy hand-off:
|
||||
- Kernel allocates CAPTURE buffers via CMA (vb2_dma_contig).
|
||||
- Daemon mmaps via the dmabuf-fd we hand it.
|
||||
- Daemon's decode-loop writes pixels straight into the V4L2
|
||||
client's CAPTURE buffer. No daemon-side allocation, no
|
||||
chardev memcpy, no payload limit.
|
||||
|
||||
### Why both queues on vb2_dma_contig?
|
||||
|
||||
Pure dma_contig everywhere simplifies the `queue_setup`
|
||||
plane-allocator assignment (no "OUTPUT is vmalloc / CAPTURE
|
||||
is dma_contig" special-case). More importantly,
|
||||
`V4L2_MEMORY_FLAG_NON_COHERENT` only works on memops that
|
||||
implement the non-coherent allocation path — vb2_vmalloc
|
||||
doesn't, so v4l2-compliance's REQBUFS test fails when any
|
||||
queue is on vmalloc. Switching OUTPUT to dma_contig closed
|
||||
the compliance gap without any functional cost (the kernel
|
||||
still gets a kernel-virtual-address out of
|
||||
`vb2_plane_vaddr`).
|
||||
|
||||
### Why O_RDWR in the ioctl call?
|
||||
|
||||
`vb2_core_expbuf` extracts the access mode from `flags &
|
||||
O_ACCMODE` and passes it to the memop's `get_dmabuf`.
|
||||
Default (O_RDONLY) creates a read-only dma_buf and userspace
|
||||
gets `mmap -EACCES` on PROT_WRITE. Caught on first run.
|
||||
|
||||
### Why a unified cookie allocator?
|
||||
|
||||
The Phase 8.5 closure flagged a debug-time confusion:
|
||||
debugfs `test_decode` and m2m `device_run` had independent
|
||||
atomic counters, both starting at 0 after each insmod. When
|
||||
both got used in the same boot, RESP_FRAME logs hit "unknown
|
||||
cookie" for one path's responses arriving at the other's
|
||||
namespace. One shared `daedalus_next_cookie()` makes the
|
||||
logs deterministic and the namespaces non-colliding.
|
||||
|
||||
### Why register stateless controls if we ignore the values?
|
||||
|
||||
libva-v4l2-request and other production clients probe
|
||||
`VIDIOC_QUERYCTRL` to verify the codec is supported. Without
|
||||
the controls registered, the driver appears as a
|
||||
"non-stateless" decoder to those clients and they refuse to
|
||||
drive it. Our daemon ignores the per-buffer values because
|
||||
FFmpeg re-parses the bitstream anyway — but the controls
|
||||
need to be visible. Phase 8.7+ wires them to a real codec
|
||||
state machine if we ever want to skip FFmpeg's reparse.
|
||||
|
||||
## Bugs found and fixed during the phase
|
||||
|
||||
### Bug 1: mmap EACCES on the daemon's CAPTURE dmabuf
|
||||
|
||||
Daemon called `ioctl(GET_DMABUF, flags=O_CLOEXEC)`.
|
||||
vb2_core_expbuf treated the omitted access mode as O_RDONLY
|
||||
and exported a read-only dma_buf; `mmap(PROT_READ | PROT_WRITE,
|
||||
MAP_SHARED)` returned `-EACCES`. Fix: O_RDWR | O_CLOEXEC.
|
||||
|
||||
### Bug 2: REQBUFS rejection on V4L2_MEMORY_FLAG_NON_COHERENT
|
||||
|
||||
v4l2-compliance set the flag; vb2 rejected because:
|
||||
|
||||
1. OUTPUT queue used vb2_vmalloc which doesn't honour the
|
||||
flag.
|
||||
2. `allow_cache_hints = 0` (default) on both queues blocked
|
||||
the flag at the vb2 layer.
|
||||
|
||||
Fix: switch OUTPUT to vb2_dma_contig + set
|
||||
`allow_cache_hints = 1` on both queues.
|
||||
|
||||
### Bug 3: TRY_FMT clobbered userspace colorspace
|
||||
|
||||
`daedalus_fill_*_fmt` always set `colorspace =
|
||||
V4L2_COLORSPACE_REC709`. Compliance fed in
|
||||
V4L2_COLORSPACE_SMPTE170M and expected it to round-trip
|
||||
through TRY_FMT and propagate to CAPTURE on S_FMT. Fix:
|
||||
preserve the four colorspace fields in TRY_FMT (only
|
||||
default-fill when zero) and copy them OUTPUT → CAPTURE on
|
||||
S_FMT.
|
||||
|
||||
## What's NOT here (deferred)
|
||||
|
||||
- **Media controller binding** (`v4l2_m2m_register_media_controller`).
|
||||
This is what the last DECODER_CMD compliance failure needs.
|
||||
Roadmap puts it in Phase 8.7 alongside performance work,
|
||||
since it brings real complexity (media graph topology,
|
||||
request-API entity wiring) but no end-user functionality
|
||||
on its own.
|
||||
- **HDR / 10-bit pixel formats.** CAPTURE is NV12M only.
|
||||
P010M etc. need the AVPixFmtDescriptor walker in
|
||||
`pack_nv12_to_planes` to grow a depth-aware pack path.
|
||||
- **QPU dispatch substitution.** The daemon decodes
|
||||
entirely on CPU via FFmpeg. Substituting per-block
|
||||
`daedalus_dispatch_*` calls into FFmpeg's hot path needs
|
||||
FFmpeg-internal block-walker hooks — orthogonal to the
|
||||
V4L2 m2m work and lives in Phase 8.7.
|
||||
|
||||
## Phase 8.7 plan
|
||||
|
||||
1. Media controller integration (closes the last compliance
|
||||
fail).
|
||||
2. Performance work toward the 30fps@1080p user-facing
|
||||
target: profile the daemon's per-frame cost; substitute
|
||||
`daedalus_dispatch_*` for FFmpeg's per-block calls where
|
||||
the kernel matches.
|
||||
3. HDR / 10-bit support (P010M CAPTURE, depth-aware
|
||||
pack_nv12_to_planes).
|
||||
4. Long-form multi-frame streaming tests (not just one
|
||||
keyframe at a time).
|
||||
Reference in New Issue
Block a user