Files

T

marfrit c7f6fb90cb Phase 8.6: dmabuf + AV1 + H.264 + stateless controls

Removes the Phase 8.5 64 KiB frame-size cap by exporting CAPTURE
buffers as dmabuf-fds the daemon mmaps and writes pixels into
directly. Adds AV1 + H.264 codec support, V4L2 stateless control
registration, and the compliance polish that brings the driver
to 47/48 v4l2-compliance pass.

Protocol (include/daedalus_v4l2_proto.h):
- struct daedalus_req_decode grew capture-buffer metadata
  (width/height/pix_fmt/num_planes + per-plane size+stride).
- New DAEDALUS_IOC_GET_DMABUF ioctl on the chardev: daemon
  asks for a per-plane dmabuf fd, kernel calls vb2_core_expbuf
  in daemon task context so the fd lands in the daemon's table.

Kernel m2m driver (kernel/daedalus_v4l2_main.c):
- Both queues switched to vb2_dma_contig_memops. OUTPUT was
  vmalloc in 8.5; the switch is needed because vmalloc doesn't
  honour V4L2_MEMORY_FLAG_NON_COHERENT and v4l2-compliance's
  REQBUFS test rejected the driver because of it. We still
  read bitstream via vb2_plane_vaddr (dma_contig gives a
  kernel virtual address just like vmalloc did).
- dma_coerce_mask_and_coherent(DMA_BIT_MASK(32)) in probe.
- queue_setup populates alloc_devs[plane] = &pdev->dev for
  both queues; allow_cache_hints=1 on both.
- daedalus_export_capture_dmabuf(cookie, plane, flags, *fd):
  walks inflight list, calls vb2_core_expbuf on the CAPTURE
  buffer in the caller's (daemon's) task context.
- device_run fills the new REQ_DECODE capture fields from
  ctx->dst_fmt and maps ctx->src_fmt.pixelformat to
  DAEDALUS_CODEC_VP9 / _AV1 / _H264 (was hard-wired to VP9).
- daedalus_complete_resp_frame handles both the 8.5 inline
  path (kept for debugging) and the 8.6 dmabuf path (pixels
  already in CAPTURE buffer, just set payload from metadata).
- enum_fmt advertises all 3 OUTPUT formats (VP9F, AV1F, S264).
- try_fmt preserves userspace colorspace fields instead of
  overwriting with REC709 defaults (fixes 8.5 compliance fail).
- s_fmt propagates OUTPUT colorspace → CAPTURE (stateless
  decoder round-trip test at v4l2-test-formats.cpp:958).
- 12 V4L2 stateless controls registered per open (VP9_FRAME,
  VP9_COMPRESSED_HDR, H264_SPS/PPS/SCALING/PRED_WEIGHTS/
  SLICE_PARAMS/DECODE_PARAMS, AV1_FRAME/SEQUENCE/
  TILE_GROUP_ENTRY/FILM_GRAIN). Daemon ignores values (FFmpeg
  re-parses); registration is what makes libva-v4l2-request
  see us.

Kernel chardev (kernel/daedalus_v4l2_chardev.c):
- New unlocked_ioctl dispatching DAEDALUS_IOC_GET_DMABUF to
  daedalus_export_capture_dmabuf.
- debugfs test_decode cookies unified with the m2m cookie
  allocator via shared daedalus_next_cookie() — kills the
  Phase 8.5 namespace collision.

Daemon (daemon/src/...):
- New dmabuf_capture.{c,h}: GET_DMABUF + mmap each plane on
  REQ_DECODE; munmap + close on completion. O_RDWR | O_CLOEXEC
  is essential — vb2_core_expbuf extracts O_ACCMODE from flags
  and exports read-only by default (caught on first run; mmap
  -EACCES on PROT_WRITE).
- decoder.{c,h}: lazily opens AV1 + H.264 AVCodecContexts in
  addition to VP9 (dropped the -ENOSYS stubs). pack_nv12_to_planes
  writes Y line-by-line into planes[0] with planes[0].stride;
  interleaves Cb/Cr into planes[1] with planes[1].stride.
- chardev_client.c handle_req_decode: opens dmabuf planes,
  runs decode (pixels land in CAPTURE buffer directly), closes
  planes, sends metadata-only RESP_FRAME. No wire-pixel
  allocation.

Test harness (tools/test_m2m_decode.c):
- Optional 5th arg `codec` (vp9 | av1 | h264). Same client
  drives all three codecs.

Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):

Bit-exact end-to-end vs `ffmpeg -pix_fmt nv12`:
  VP9   1920x1080  3,110,400 bytes  MATCH
  AV1     128x96      18,432 bytes  MATCH
  H.264   128x96      18,432 bytes  MATCH

VP9 1080p went through the full dmabuf path with no chardev
payload bloat — the same chardev that capped at 64 KiB in 8.5
now ferries metadata only and lets the daemon mmap+write a
3.1 MB frame directly into the V4L2 client's buffer.

v4l2-compliance:
  Phase 8.1: 44/48
  Phase 8.5: 44/48 (different fails after m2m landed)
  Phase 8.6: 47/48
  Only remaining: VIDIOC_(TRY_)DECODER_CMD (needs media
  controller — explicitly Phase 8.7 work).

11 standard compound controls visible:
  vp9_frame_decode_parameters, vp9_probabilities_updates,
  h264_sequence_parameter_set, h264_picture_parameter_set,
  h264_scaling_matrix, h264_prediction_weight_table,
  h264_slice_parameters, h264_decode_parameters,
  av1_sequence_parameters, av1_frame_parameters,
  av1_film_grain (av1_tile_group_entry refused by hdl->error
  on this kernel — skipped silently).

Clean SIGTERM + rmmod, no oops/WARN.

Roadmap update (docs/roadmap.md):
- Phase 8.6 marked closed with the closure-doc reference.
- Phase 8.7 reshaped to (1) media controller, (2) perf +
  daedalus_dispatch_* substitution, (3) HDR/10-bit, (4)
  long-form multi-frame streaming.

Per correctness-before-speed:
- Real V4L2 dmabuf via vb2_core_expbuf (not a sideband
  fd-passing hack).
- O_RDWR access mode threaded through correctly.
- Strict pixel-byte comparison against ffmpeg, not "looks
  right" eyeballing.
- Each compliance edge documented with the underlying test
  source-line + the fix.
- All resource paths cleaned (munmap + close per plane on
  every exit, including error paths).

Phase 8.7 next: media controller binding (closes last
compliance fail), per-frame profiling, QPU dispatch
substitution targeting 30fps@1080p from
30fps-floor-is-fine memory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-18 16:16:06 +00:00

12 KiB

Raw Permalink Blame History

Phase 8.6 closure — dmabuf + AV1 + H.264 + stateless controls

Status: closed 2026-05-18.

Phase 8.5 shipped a working V4L2 m2m driver but capped frame size at 64 KiB because RESP_FRAME carried inline pixel data through the chardev. Phase 8.6 removes that cap by exporting CAPTURE buffers as dmabuf-fds the daemon mmaps and writes directly into. Along the way: AV1 + H.264 codecs land, V4L2 stateless controls register so libva-v4l2-request will recognise us as a real stateless decoder, and the remaining v4l2-compliance edges get polished (47/48 passing now, only DECODER_CMD remains — that wants a media controller, deferred to a follow-up).

What lands

Protocol (`include/daedalus_v4l2_proto.h`)

struct daedalus_req_decode grew capture-buffer metadata: capture_width, capture_height, capture_pix_fmt, capture_num_planes, per-plane size[3] + stride[3], flags. The daemon needs these to call DAEDALUS_IOC_GET_DMABUF and to know how to lay out NV12 in the mapped plane.
New struct daedalus_get_dmabuf + DAEDALUS_IOC_GET_DMABUF ioctl (_IOWR('D', 1, ...)). Takes (cookie, plane, flags), returns an fd installed in the calling task's fd table.

Kernel m2m driver (`kernel/daedalus_v4l2_main.c`)

Both queues switched to vb2_dma_contig_memops. OUTPUT was vmalloc in Phase 8.5; the switch makes V4L2_MEMORY_FLAG_NON_COHERENT work on REQBUFS (vmalloc doesn't honour the flag and v4l2-compliance rejected the driver because of it). We still read the bitstream via vb2_plane_vaddr — dma_contig provides a kernel virtual address just like vmalloc did.
dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32)) in probe so the platform device satisfies vb2_dma_contig's parent-device requirement.
queue_setup now populates alloc_devs[plane] = &pdev->dev for both queues.
Both queues set allow_cache_hints = 1 and io_modes expanded to include VB2_DMABUF (CAPTURE also gains it for the daemon's mmap path).
New daedalus_export_capture_dmabuf(cookie, plane, flags, out_fd): walks the inflight list, calls vb2_core_expbuf on the CAPTURE buffer in the caller's (daemon's) task context. Backs the new chardev ioctl.
device_run extended to fill the new REQ_DECODE capture fields from ctx->dst_fmt, and to map ctx->src_fmt.pixelformat → DAEDALUS_CODEC_VP9 / _AV1 / _H264 (was hard-wired to VP9 in 8.5).
daedalus_complete_resp_frame now handles both the Phase 8.5 inline-pixel path (still in tree for debugging / back-compat) and the Phase 8.6 dmabuf path where pixels_len == 0 and pixels already live in the CAPTURE buffer — sets plane payload from the daemon's metadata either way.
enum_fmt advertises all three OUTPUT formats: VP9_FRAME, AV1_FRAME, H264_SLICE.
try_fmt preserves userspace-supplied colorspace / xfer / ycbcr_enc / quantization across the fill helper (was rewriting to REC709 defaults; v4l2-compliance failed the TRY_FMT colorspace round-trip in 8.5).
s_fmt propagates OUTPUT colorspace to CAPTURE so a follow-up G_FMT on CAPTURE returns matching values (v4l2-test-formats.cpp:958 stateless-decoder round-trip).
Twelve V4L2 stateless controls registered per open: STATELESS_VP9_FRAME, STATELESS_VP9_COMPRESSED_HDR, STATELESS_H264_{SPS,PPS,SCALING_MATRIX,PRED_WEIGHTS, SLICE_PARAMS,DECODE_PARAMS}, STATELESS_AV1_{FRAME,SEQUENCE,TILE_GROUP_ENTRY,FILM_GRAIN}. Daemon ignores values (FFmpeg parses headers itself); the registration is what makes libva-v4l2-request see us.

Kernel chardev (`kernel/daedalus_v4l2_chardev.c`)

New unlocked_ioctl handler dispatching DAEDALUS_IOC_GET_DMABUF to daedalus_export_capture_dmabuf.
debugfs test_decode cookies unified with the m2m daedalus_next_cookie() allocator so the two namespaces never collide.

Daemon

New daemon/src/dmabuf_capture.{c,h}:
- daedalus_capture_planes_open(chardev_fd, cookie, req, out): ioctl GET_DMABUF per plane with O_RDWR | O_CLOEXEC (O_RDWR is essential — vb2_core_expbuf uses flags & O_ACCMODE to choose the dma_buf's access mode, default O_RDONLY gives mmap -EACCES on writes), mmap each plane, populate {fd, base, size, stride} per plane.
- daedalus_capture_planes_close(p): munmap + close each plane.
daemon/src/decoder.c:
- Lazily opens AV1 + H.264 AVCodecContexts in addition to VP9 (the existing -ENOSYS stubs were dropped — FFmpeg decodes all three via dlopen).
- pack_nv12_to_planes: writes Y line-by-line into planes[0].base with planes[0].stride, interleaves Cb/Cr into planes[1].base with planes[1].stride. Strips source stride padding from fr->linesize[*]; respects destination strides so libav alignment doesn't corrupt the V4L2 client's view.
- daedalus_decoder_run_request signature simplified — takes the mapped planes instead of a wire buffer.
daemon/src/chardev_client.c:
- handle_req_decode opens dmabuf planes, runs decode (writing pixels straight into the mapped CAPTURE buffer), closes planes, sends a metadata-only RESP_FRAME. No wire-pixel allocation at all.
daemon/CMakeLists.txt: includes dmabuf_capture.c.

Test harness (`tools/test_m2m_decode.c`)

Optional 5th argument codec (vp9 | av1 | h264) selects the OUTPUT fourcc. Same client drives all three codecs.

Verification

Bit-exact end-to-end vs reference FFmpeg

Codec	Frame	Bitstream	Decoded NV12	vs `ffmpeg -pix_fmt nv12`
VP9	1920×1080	12690 B	3,110,400 B	byte-for-byte match ✓
AV1	128×96	456 B	18,432 B	byte-for-byte match ✓
H.264	128×96	1863 B	18,432 B	byte-for-byte match ✓

VP9 1080p decode went through the full dmabuf path with no chardev payload bloat — the same chardev that capped at 64 KiB in Phase 8.5 now ferries metadata only and lets the daemon mmap+write a 3.1 MB frame directly into the V4L2 client's buffer.

v4l2-compliance

Total for daedalus_v4l2 device /dev/video0: 48, Succeeded: 47, Failed: 1

Phase	Pass	Fail	Delta
8.1	44/48	4	baseline
8.5	44/48	4	(different fails — m2m wired)
8.6	47/48	1	+3 pass

The one remaining fail is VIDIOC_(TRY_)DECODER_CMD:

fail: v4l2-test-codecs.cpp(105):
  (node->codec_mask & STATELESS_DECODER) && !node->has_media

Compliance requires stateless decoders to bind a media controller (for the request API). That's the next milestone; not addressed in 8.6.

Format + control enumeration

$ v4l2-ctl -d /dev/video0 --list-formats-out
        [0]: 'VP9F' (VP9 Frame, compressed)
        [1]: 'AV1F' (AV1 Frame, compressed)
        [2]: 'S264' (H.264 Parsed Slice Data, compressed)

$ v4l2-ctl -d /dev/video0 -l
  h264_sequence_parameter_set            (has-payload)
  h264_picture_parameter_set             (has-payload)
  h264_scaling_matrix                    (has-payload)
  h264_prediction_weight_table           (has-payload)
  h264_slice_parameters                  (has-payload)
  h264_decode_parameters                 (has-payload)
  vp9_frame_decode_parameters            (has-payload)
  vp9_probabilities_updates              (has-payload)
  av1_sequence_parameters                (has-payload)
  av1_frame_parameters                   (has-payload)
  av1_film_grain                         (has-payload)

  Standard Compound Controls: 11

(av1_tile_group_entry got refused by hdl->error on this kernel — the v6.12 headers expose the CID but the v4l2-core control table doesn't include it, so we just skip and log at debug.)

Clean shutdown

$ pkill -TERM daedalus_v4l2_daemon   # daemon exits cleanly
$ sudo rmmod daedalus_v4l2           # ok
$ sudo dmesg | grep -E 'BUG|oops'
(empty)

Design decisions

Why dmabuf-export over inline pixels?

The chardev payload is capped at 64 KiB to keep kmalloc allocations under the slab's reliability ceiling. A 1080p NV12 frame is 3.1 MB; 4K is 12.4 MB. No reasonable cap on inline pixels covers real-world content.

dmabuf gives us a zero-copy hand-off:

Kernel allocates CAPTURE buffers via CMA (vb2_dma_contig).
Daemon mmaps via the dmabuf-fd we hand it.
Daemon's decode-loop writes pixels straight into the V4L2 client's CAPTURE buffer. No daemon-side allocation, no chardev memcpy, no payload limit.

Why both queues on vb2_dma_contig?

Pure dma_contig everywhere simplifies the queue_setup plane-allocator assignment (no "OUTPUT is vmalloc / CAPTURE is dma_contig" special-case). More importantly, V4L2_MEMORY_FLAG_NON_COHERENT only works on memops that implement the non-coherent allocation path — vb2_vmalloc doesn't, so v4l2-compliance's REQBUFS test fails when any queue is on vmalloc. Switching OUTPUT to dma_contig closed the compliance gap without any functional cost (the kernel still gets a kernel-virtual-address out of vb2_plane_vaddr).

Why O_RDWR in the ioctl call?

vb2_core_expbuf extracts the access mode from flags & O_ACCMODE and passes it to the memop's get_dmabuf. Default (O_RDONLY) creates a read-only dma_buf and userspace gets mmap -EACCES on PROT_WRITE. Caught on first run.

The Phase 8.5 closure flagged a debug-time confusion: debugfs test_decode and m2m device_run had independent atomic counters, both starting at 0 after each insmod. When both got used in the same boot, RESP_FRAME logs hit "unknown cookie" for one path's responses arriving at the other's namespace. One shared daedalus_next_cookie() makes the logs deterministic and the namespaces non-colliding.

Why register stateless controls if we ignore the values?

libva-v4l2-request and other production clients probe VIDIOC_QUERYCTRL to verify the codec is supported. Without the controls registered, the driver appears as a "non-stateless" decoder to those clients and they refuse to drive it. Our daemon ignores the per-buffer values because FFmpeg re-parses the bitstream anyway — but the controls need to be visible. Phase 8.7+ wires them to a real codec state machine if we ever want to skip FFmpeg's reparse.

Bugs found and fixed during the phase

Bug 1: mmap EACCES on the daemon's CAPTURE dmabuf

Daemon called ioctl(GET_DMABUF, flags=O_CLOEXEC). vb2_core_expbuf treated the omitted access mode as O_RDONLY and exported a read-only dma_buf; mmap(PROT_READ | PROT_WRITE, MAP_SHARED) returned -EACCES. Fix: O_RDWR | O_CLOEXEC.

Bug 2: REQBUFS rejection on V4L2_MEMORY_FLAG_NON_COHERENT

v4l2-compliance set the flag; vb2 rejected because:

OUTPUT queue used vb2_vmalloc which doesn't honour the flag.
allow_cache_hints = 0 (default) on both queues blocked the flag at the vb2 layer.

Fix: switch OUTPUT to vb2_dma_contig + set allow_cache_hints = 1 on both queues.

Bug 3: TRY_FMT clobbered userspace colorspace

daedalus_fill_*_fmt always set colorspace = V4L2_COLORSPACE_REC709. Compliance fed in V4L2_COLORSPACE_SMPTE170M and expected it to round-trip through TRY_FMT and propagate to CAPTURE on S_FMT. Fix: preserve the four colorspace fields in TRY_FMT (only default-fill when zero) and copy them OUTPUT → CAPTURE on S_FMT.

What's NOT here (deferred)

Media controller binding (v4l2_m2m_register_media_controller). This is what the last DECODER_CMD compliance failure needs. Roadmap puts it in Phase 8.7 alongside performance work, since it brings real complexity (media graph topology, request-API entity wiring) but no end-user functionality on its own.
HDR / 10-bit pixel formats. CAPTURE is NV12M only. P010M etc. need the AVPixFmtDescriptor walker in pack_nv12_to_planes to grow a depth-aware pack path.
QPU dispatch substitution. The daemon decodes entirely on CPU via FFmpeg. Substituting per-block daedalus_dispatch_* calls into FFmpeg's hot path needs FFmpeg-internal block-walker hooks — orthogonal to the V4L2 m2m work and lives in Phase 8.7.

Phase 8.7 plan

Media controller integration (closes the last compliance fail).
Performance work toward the 30fps@1080p user-facing target: profile the daemon's per-frame cost; substitute daedalus_dispatch_* for FFmpeg's per-block calls where the kernel matches.
HDR / 10-bit support (P010M CAPTURE, depth-aware pack_nv12_to_planes).
Long-form multi-frame streaming tests (not just one keyframe at a time).

12 KiB Raw Permalink Blame History Unescape Escape