# Phase 8.6 closure — dmabuf + AV1 + H.264 + stateless controls **Status:** closed 2026-05-18. Phase 8.5 shipped a working V4L2 m2m driver but capped frame size at 64 KiB because RESP_FRAME carried inline pixel data through the chardev. Phase 8.6 removes that cap by exporting CAPTURE buffers as dmabuf-fds the daemon mmaps and writes directly into. Along the way: AV1 + H.264 codecs land, V4L2 stateless controls register so libva-v4l2-request will recognise us as a real stateless decoder, and the remaining v4l2-compliance edges get polished (47/48 passing now, only DECODER_CMD remains — that wants a media controller, deferred to a follow-up). ## What lands ### Protocol (`include/daedalus_v4l2_proto.h`) - `struct daedalus_req_decode` grew capture-buffer metadata: `capture_width`, `capture_height`, `capture_pix_fmt`, `capture_num_planes`, per-plane `size[3]` + `stride[3]`, `flags`. The daemon needs these to call `DAEDALUS_IOC_GET_DMABUF` and to know how to lay out NV12 in the mapped plane. - New `struct daedalus_get_dmabuf` + `DAEDALUS_IOC_GET_DMABUF` ioctl (`_IOWR('D', 1, ...)`). Takes `(cookie, plane, flags)`, returns an fd installed in the calling task's fd table. ### Kernel m2m driver (`kernel/daedalus_v4l2_main.c`) - **Both queues switched to `vb2_dma_contig_memops`**. OUTPUT was vmalloc in Phase 8.5; the switch makes `V4L2_MEMORY_FLAG_NON_COHERENT` work on REQBUFS (vmalloc doesn't honour the flag and v4l2-compliance rejected the driver because of it). We still read the bitstream via `vb2_plane_vaddr` — dma_contig provides a kernel virtual address just like vmalloc did. - `dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32))` in probe so the platform device satisfies vb2_dma_contig's parent-device requirement. - `queue_setup` now populates `alloc_devs[plane] = &pdev->dev` for both queues. - Both queues set `allow_cache_hints = 1` and `io_modes` expanded to include `VB2_DMABUF` (CAPTURE also gains it for the daemon's mmap path). - New `daedalus_export_capture_dmabuf(cookie, plane, flags, out_fd)`: walks the inflight list, calls `vb2_core_expbuf` on the CAPTURE buffer in the caller's (daemon's) task context. Backs the new chardev ioctl. - `device_run` extended to fill the new REQ_DECODE capture fields from `ctx->dst_fmt`, and to map `ctx->src_fmt.pixelformat` → `DAEDALUS_CODEC_VP9 / _AV1 / _H264` (was hard-wired to VP9 in 8.5). - `daedalus_complete_resp_frame` now handles both the Phase 8.5 inline-pixel path (still in tree for debugging / back-compat) and the Phase 8.6 dmabuf path where `pixels_len == 0` and pixels already live in the CAPTURE buffer — sets plane payload from the daemon's metadata either way. - `enum_fmt` advertises all three OUTPUT formats: `VP9_FRAME`, `AV1_FRAME`, `H264_SLICE`. - `try_fmt` preserves userspace-supplied colorspace / xfer / ycbcr_enc / quantization across the fill helper (was rewriting to REC709 defaults; v4l2-compliance failed the TRY_FMT colorspace round-trip in 8.5). - `s_fmt` propagates OUTPUT colorspace to CAPTURE so a follow-up G_FMT on CAPTURE returns matching values (v4l2-test-formats.cpp:958 stateless-decoder round-trip). - Twelve V4L2 stateless controls registered per open: `STATELESS_VP9_FRAME`, `STATELESS_VP9_COMPRESSED_HDR`, `STATELESS_H264_{SPS,PPS,SCALING_MATRIX,PRED_WEIGHTS, SLICE_PARAMS,DECODE_PARAMS}`, `STATELESS_AV1_{FRAME,SEQUENCE,TILE_GROUP_ENTRY,FILM_GRAIN}`. Daemon ignores values (FFmpeg parses headers itself); the registration is what makes libva-v4l2-request see us. ### Kernel chardev (`kernel/daedalus_v4l2_chardev.c`) - New `unlocked_ioctl` handler dispatching `DAEDALUS_IOC_GET_DMABUF` to `daedalus_export_capture_dmabuf`. - debugfs `test_decode` cookies unified with the m2m `daedalus_next_cookie()` allocator so the two namespaces never collide. ### Daemon - **New** `daemon/src/dmabuf_capture.{c,h}`: - `daedalus_capture_planes_open(chardev_fd, cookie, req, out)`: ioctl GET_DMABUF per plane with `O_RDWR | O_CLOEXEC` (O_RDWR is essential — vb2_core_expbuf uses `flags & O_ACCMODE` to choose the dma_buf's access mode, default O_RDONLY gives mmap `-EACCES` on writes), mmap each plane, populate `{fd, base, size, stride}` per plane. - `daedalus_capture_planes_close(p)`: munmap + close each plane. - `daemon/src/decoder.c`: - Lazily opens AV1 + H.264 AVCodecContexts in addition to VP9 (the existing -ENOSYS stubs were dropped — FFmpeg decodes all three via dlopen). - `pack_nv12_to_planes`: writes Y line-by-line into `planes[0].base` with `planes[0].stride`, interleaves Cb/Cr into `planes[1].base` with `planes[1].stride`. Strips source stride padding from `fr->linesize[*]`; respects destination strides so libav alignment doesn't corrupt the V4L2 client's view. - `daedalus_decoder_run_request` signature simplified — takes the mapped planes instead of a wire buffer. - `daemon/src/chardev_client.c`: - `handle_req_decode` opens dmabuf planes, runs decode (writing pixels straight into the mapped CAPTURE buffer), closes planes, sends a metadata-only RESP_FRAME. No wire-pixel allocation at all. - `daemon/CMakeLists.txt`: includes `dmabuf_capture.c`. ### Test harness (`tools/test_m2m_decode.c`) - Optional 5th argument `codec` (vp9 | av1 | h264) selects the OUTPUT fourcc. Same client drives all three codecs. ## Verification ### Bit-exact end-to-end vs reference FFmpeg | Codec | Frame | Bitstream | Decoded NV12 | vs `ffmpeg -pix_fmt nv12` | |-------|--------|-----------|--------------|---------------------------| | VP9 | 1920×1080 | 12690 B | 3,110,400 B | **byte-for-byte match ✓** | | AV1 | 128×96 | 456 B | 18,432 B | **byte-for-byte match ✓** | | H.264 | 128×96 | 1863 B | 18,432 B | **byte-for-byte match ✓** | VP9 1080p decode went through the full dmabuf path with no chardev payload bloat — the same chardev that capped at 64 KiB in Phase 8.5 now ferries metadata only and lets the daemon mmap+write a 3.1 MB frame directly into the V4L2 client's buffer. ### v4l2-compliance ``` Total for daedalus_v4l2 device /dev/video0: 48, Succeeded: 47, Failed: 1 ``` | Phase | Pass | Fail | Delta | |-------|------|------|-------| | 8.1 | 44/48 | 4 | baseline | | 8.5 | 44/48 | 4 | (different fails — m2m wired) | | 8.6 | 47/48 | 1 | +3 pass | The one remaining fail is `VIDIOC_(TRY_)DECODER_CMD`: ``` fail: v4l2-test-codecs.cpp(105): (node->codec_mask & STATELESS_DECODER) && !node->has_media ``` Compliance requires stateless decoders to bind a media controller (for the request API). That's the next milestone; not addressed in 8.6. ### Format + control enumeration ``` $ v4l2-ctl -d /dev/video0 --list-formats-out [0]: 'VP9F' (VP9 Frame, compressed) [1]: 'AV1F' (AV1 Frame, compressed) [2]: 'S264' (H.264 Parsed Slice Data, compressed) $ v4l2-ctl -d /dev/video0 -l h264_sequence_parameter_set (has-payload) h264_picture_parameter_set (has-payload) h264_scaling_matrix (has-payload) h264_prediction_weight_table (has-payload) h264_slice_parameters (has-payload) h264_decode_parameters (has-payload) vp9_frame_decode_parameters (has-payload) vp9_probabilities_updates (has-payload) av1_sequence_parameters (has-payload) av1_frame_parameters (has-payload) av1_film_grain (has-payload) Standard Compound Controls: 11 ``` (av1_tile_group_entry got refused by `hdl->error` on this kernel — the v6.12 headers expose the CID but the v4l2-core control table doesn't include it, so we just skip and log at debug.) ### Clean shutdown ``` $ pkill -TERM daedalus_v4l2_daemon # daemon exits cleanly $ sudo rmmod daedalus_v4l2 # ok $ sudo dmesg | grep -E 'BUG|oops' (empty) ``` ## Design decisions ### Why dmabuf-export over inline pixels? The chardev payload is capped at 64 KiB to keep kmalloc allocations under the slab's reliability ceiling. A 1080p NV12 frame is 3.1 MB; 4K is 12.4 MB. No reasonable cap on inline pixels covers real-world content. dmabuf gives us a zero-copy hand-off: - Kernel allocates CAPTURE buffers via CMA (vb2_dma_contig). - Daemon mmaps via the dmabuf-fd we hand it. - Daemon's decode-loop writes pixels straight into the V4L2 client's CAPTURE buffer. No daemon-side allocation, no chardev memcpy, no payload limit. ### Why both queues on vb2_dma_contig? Pure dma_contig everywhere simplifies the `queue_setup` plane-allocator assignment (no "OUTPUT is vmalloc / CAPTURE is dma_contig" special-case). More importantly, `V4L2_MEMORY_FLAG_NON_COHERENT` only works on memops that implement the non-coherent allocation path — vb2_vmalloc doesn't, so v4l2-compliance's REQBUFS test fails when any queue is on vmalloc. Switching OUTPUT to dma_contig closed the compliance gap without any functional cost (the kernel still gets a kernel-virtual-address out of `vb2_plane_vaddr`). ### Why O_RDWR in the ioctl call? `vb2_core_expbuf` extracts the access mode from `flags & O_ACCMODE` and passes it to the memop's `get_dmabuf`. Default (O_RDONLY) creates a read-only dma_buf and userspace gets `mmap -EACCES` on PROT_WRITE. Caught on first run. ### Why a unified cookie allocator? The Phase 8.5 closure flagged a debug-time confusion: debugfs `test_decode` and m2m `device_run` had independent atomic counters, both starting at 0 after each insmod. When both got used in the same boot, RESP_FRAME logs hit "unknown cookie" for one path's responses arriving at the other's namespace. One shared `daedalus_next_cookie()` makes the logs deterministic and the namespaces non-colliding. ### Why register stateless controls if we ignore the values? libva-v4l2-request and other production clients probe `VIDIOC_QUERYCTRL` to verify the codec is supported. Without the controls registered, the driver appears as a "non-stateless" decoder to those clients and they refuse to drive it. Our daemon ignores the per-buffer values because FFmpeg re-parses the bitstream anyway — but the controls need to be visible. Phase 8.7+ wires them to a real codec state machine if we ever want to skip FFmpeg's reparse. ## Bugs found and fixed during the phase ### Bug 1: mmap EACCES on the daemon's CAPTURE dmabuf Daemon called `ioctl(GET_DMABUF, flags=O_CLOEXEC)`. vb2_core_expbuf treated the omitted access mode as O_RDONLY and exported a read-only dma_buf; `mmap(PROT_READ | PROT_WRITE, MAP_SHARED)` returned `-EACCES`. Fix: O_RDWR | O_CLOEXEC. ### Bug 2: REQBUFS rejection on V4L2_MEMORY_FLAG_NON_COHERENT v4l2-compliance set the flag; vb2 rejected because: 1. OUTPUT queue used vb2_vmalloc which doesn't honour the flag. 2. `allow_cache_hints = 0` (default) on both queues blocked the flag at the vb2 layer. Fix: switch OUTPUT to vb2_dma_contig + set `allow_cache_hints = 1` on both queues. ### Bug 3: TRY_FMT clobbered userspace colorspace `daedalus_fill_*_fmt` always set `colorspace = V4L2_COLORSPACE_REC709`. Compliance fed in V4L2_COLORSPACE_SMPTE170M and expected it to round-trip through TRY_FMT and propagate to CAPTURE on S_FMT. Fix: preserve the four colorspace fields in TRY_FMT (only default-fill when zero) and copy them OUTPUT → CAPTURE on S_FMT. ## What's NOT here (deferred) - **Media controller binding** (`v4l2_m2m_register_media_controller`). This is what the last DECODER_CMD compliance failure needs. Roadmap puts it in Phase 8.7 alongside performance work, since it brings real complexity (media graph topology, request-API entity wiring) but no end-user functionality on its own. - **HDR / 10-bit pixel formats.** CAPTURE is NV12M only. P010M etc. need the AVPixFmtDescriptor walker in `pack_nv12_to_planes` to grow a depth-aware pack path. - **QPU dispatch substitution.** The daemon decodes entirely on CPU via FFmpeg. Substituting per-block `daedalus_dispatch_*` calls into FFmpeg's hot path needs FFmpeg-internal block-walker hooks — orthogonal to the V4L2 m2m work and lives in Phase 8.7. ## Phase 8.7 plan 1. Media controller integration (closes the last compliance fail). 2. Performance work toward the 30fps@1080p user-facing target: profile the daemon's per-frame cost; substitute `daedalus_dispatch_*` for FFmpeg's per-block calls where the kernel matches. 3. HDR / 10-bit support (P010M CAPTURE, depth-aware pack_nv12_to_planes). 4. Long-form multi-frame streaming tests (not just one keyframe at a time).