Files
daedalus-v4l2/docs/phase_8_5_closure.md
T
marfrit 6f4b580f7c Phase 8.5: full V4L2 m2m driver, VP9 decode via QBUF/DQBUF
Replaces the Phase 8.4 debugfs-triggered chardev path with a
real V4L2 m2m driver. Userspace clients now drive decoding the
standard way — S_FMT / REQBUFS / QBUF on the OUTPUT (bitstream)
queue, DQBUF on the CAPTURE (NV12M) queue. Kernel device_run
packs the bitstream into REQ_DECODE; daemon decodes via FFmpeg;
RESP_FRAME's inline NV12 pixel payload lands in the CAPTURE
buffer. Phase 8.6 swaps the inline payload for dmabuf so big
frames stop being capped at 64 KiB.

Kernel (daedalus_v4l2_main.c, rewritten + main.h added):
- Per-open struct daedalus_ctx: v4l2_fh, m2m_ctx, ctrl_handler,
  per-queue v4l2_pix_format_mplane.
- Two vb2_queues (vb2_vmalloc_memops for both — no DMA needed
  yet; 8.6 switches CAPTURE to dma_contig for dmabuf-export):
    OUTPUT  = V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE,  VP9_FRAME
    CAPTURE = V4L2_BUF_TYPE_VIDEO_CAPTURE_MPLANE, NV12M
- Full v4l2_ioctl_ops table: querycap, enum_fmt, g/s/try_fmt
  for both queues, reqbufs/querybuf/qbuf/dqbuf/create_bufs/
  prepare_buf/expbuf/streamon/streamoff via v4l2_m2m_ioctl_*
  helpers.
- v4l2_m2m_ops.device_run: peeks next OUTPUT buf, builds
  REQ_DECODE inline with the bitstream bytes, enqueues with an
  auto-incrementing cookie, stores {ctx, src_buf, dst_buf} in
  a per-device inflight list. Job stays open until RESP_FRAME.
- daedalus_complete_resp_frame(): pops the inflight entry,
  memcpys inline NV12 pixels into the CAPTURE buffer (Y plane
  + interleaved CbCr), finishes via
  v4l2_m2m_buf_done_and_job_finish — NOT plain buf_done +
  job_finish, which leaves the src buf on the m2m queue and
  causes device_run to immediately re-run on the same input
  (caught on first run; second REQ_DECODE for same bitstream +
  eventual oops in stop_streaming on teardown).

Kernel (daedalus_v4l2_chardev.c):
- RESP_FRAME handler now hands inline pixel payload to
  daedalus_complete_resp_frame so it lands in the CAPTURE
  vb2 buffer. Existing PONG and debugfs test_decode paths still
  work; the latter produces a harmless ratelimited "unknown
  cookie" since it bypasses V4L2 m2m.

Daemon (decoder.c, decoder.h):
- daedalus_decoder_run_request signature extended with
  (nv12_out, nv12_cap, nv12_used). After the FNV-1a digest the
  decoder packs YUV420P into NV12 in the caller's buffer: Y
  plane line-by-line stripped of stride padding; Cb/Cr
  interleaved into a single chroma plane. Truncation silent —
  kernel only memcpys what fits in the CAPTURE plane.

Daemon (chardev_client.c):
- handle_req_decode allocates a response buffer sized for the
  full chardev payload, lets decoder fill the pixel area
  after the resp_frame struct, sends the full payload via the
  existing send_response.

Test client (tools/test_m2m_decode.c, new):
- Minimal V4L2 m2m client: S_FMT both queues, REQBUFS 1 each,
  mmap+fill OUTPUT, QBUF both, STREAMON, poll, DQBUF, dump
  CAPTURE planes to a raw NV12 file. ~250 LOC; verifies the
  whole flow without needing v4l2-ctl framing.

Roadmap update (docs/roadmap.md):
- Phase 8.4 retitled "daemon ↔ kernel decode round-trip"
  to reflect what actually shipped (vs. the original V4L2-
  ioctl-driven plan which moved here).
- Phase 8.5 retitled "full V4L2 m2m driver" with closure
  status.
- Phase 8.6 reshaped to two tracks: dmabuf + AV1/H.264/
  stateless controls + media controller. Adds the punch list
  of v4l2-compliance failures (DECODER_CMD, S_FMT colorspace)
  that 8.6 will fix.

Verification on hertz (Pi 5, 6.12.75+rpt-rpi-2712):

  Kernel + daemon build clean (-Wall -Wextra clean both sides).
  Test harness drives one VP9 keyframe end-to-end:
    OUTPUT REQBUFS -> 2
    CAPTURE REQBUFS -> 2
    QBUF OUTPUT[0] bytesused=1566
    QBUF CAPTURE[0]; STREAMON both
    poll revents=0x5
    DQBUF OUTPUT[0] flags=0x4001 (DONE)
    DQBUF CAPTURE[0] flags=0x4000 payloads=[12288, 6144]
    wrote 12288 Y + 6144 UV bytes to /tmp/out_m2m.nv12

  Pixel correctness vs reference:
    ffmpeg -i vp9_small.ivf -pix_fmt nv12 -f rawvideo -y ref.nv12
    cmp /tmp/out_m2m.nv12 /tmp/ref.nv12 → match ✓
  Byte-for-byte identical to FFmpeg's stock CPU decode.

  v4l2-compliance: detected as Stateless Decoder; most ioctls
  pass; two expected fails documented in closure doc
  (DECODER_CMD/media controller, S_FMT colorspace).

  Clean teardown: SIGTERM the daemon, rmmod the module, no
  oops/WARN in dmesg.

Per correctness-before-speed:
- Real V4L2 ioctl table (not stubs); uses v4l2-core helpers
  where they exist instead of reinventing.
- v4l2_m2m_buf_done_and_job_finish (not the manual sequence)
  to keep scheduler state consistent.
- Bit-exact reference comparison, not just "looks right."
- Documented every compliance failure with the planned fix.
- All resource paths (kmalloc/kfree, inflight list cleanup,
  src/dst buf removal in stop_streaming) handled on every
  error branch.

Phase 8.6 next: dmabuf-export for CAPTURE (removes 64 KiB
frame-size cap), add AV1+H.264 codecs, add V4L2 stateless
controls + media controller binding, fix the colorspace +
cookie-namespace compliance issues.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 15:55:10 +00:00

11 KiB
Raw Blame History

Phase 8.5 closure — full V4L2 m2m driver, VP9 decode via QBUF/DQBUF

Status: closed 2026-05-18.

Replaces the Phase 8.4 debugfs-triggered chardev path with a real V4L2 m2m driver. Userspace clients now drive decoding the standard way — S_FMT / REQBUFS / QBUF on the OUTPUT (bitstream) queue, DQBUF on the CAPTURE (NV12M) queue. Kernel device_run packs the bitstream into REQ_DECODE; the daemon decodes via FFmpeg; RESP_FRAME's inline NV12 pixel payload lands in the CAPTURE buffer. Phase 8.6 swaps the inline payload for dmabuf so big frames stop being capped at 64 KiB.

What lands

Kernel (kernel/daedalus_v4l2_main.c)

  • Per-open struct daedalus_ctx: v4l2_fh, m2m_ctx, ctrl_handler, per-queue formats.
  • Two vb2_queues via daedalus_queue_init:
    • OUTPUT: V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE, V4L2_PIX_FMT_VP9_FRAME, vb2_vmalloc_memops.
    • CAPTURE: V4L2_BUF_TYPE_VIDEO_CAPTURE_MPLANE, V4L2_PIX_FMT_NV12M, vb2_vmalloc_memops.
  • Full v4l2_ioctl_ops table: querycap / enum_fmt / g_fmt / s_fmt / try_fmt for both queues; reqbufs / querybuf / qbuf / dqbuf / create_bufs / prepare_buf / expbuf / streamon / streamoff via the v4l2_m2m_ioctl_* helpers.
  • v4l2_m2m_ops.device_run: pulls the next OUTPUT buf, kmaps the bitstream, builds REQ_DECODE inline (capped at DAEDALUS_PROTO_MAX_PAYLOAD - sizeof(struct daedalus_req_decode)), enqueues to the chardev with a cookie, stores {ctx, src_buf, dst_buf} in a per-device inflight list. Job stays open until RESP_FRAME comes back.
  • daedalus_complete_resp_frame (called from the chardev write path): pops the inflight entry, memcpys inline NV12 pixels into the CAPTURE vb2 buffer (Y plane + interleaved CbCr), finishes the m2m job via v4l2_m2m_buf_done_and_job_finish so both buffers complete cleanly and the scheduler doesn't immediately re-run device_run on the same src.

Kernel header (kernel/daedalus_v4l2_main.h)

  • New private header so the chardev source can reach daedalus_complete_resp_frame() without extern-only declarations.
  • Contains the canonical struct daedalus_dev definition (was previously inline in main.c).

Kernel (kernel/daedalus_v4l2_chardev.c)

  • RESP_FRAME handler now passes the inline pixel payload to daedalus_complete_resp_frame so it can land in the CAPTURE buffer. The previous Phase 8.4 path (debugfs test_decode injection) still works, just hits a ratelimited unknown cookie log line because it bypasses the V4L2 m2m queue.

Daemon (daemon/src/decoder.c, decoder.h)

  • daedalus_decoder_run_request signature extended with (nv12_out, nv12_cap, nv12_used). After the FNV-1a digest it packs the decoded YUV420P planes into NV12 in the caller's buffer (Y plane line-by-line stripped of stride padding; CbCr interleaved into the chroma plane).
  • Truncation silent — kernel only memcpys what fits in the CAPTURE plane.

Daemon (daemon/src/chardev_client.c)

  • handle_req_decode allocates a response buffer sized sizeof(resp) + (MAX_PAYLOAD - sizeof(resp)), lets the decoder fill the pixel area, then sends the full payload (struct + pixels) via send_response.

Test harness (tools/test_m2m_decode.c)

  • New: minimal V4L2 m2m client that drives one full QBUF/DQBUF round-trip. Used for end-to-end verification in this phase. v4l2-ctl could substitute eventually, but its --stream-from-hdr format isn't compatible with a raw VP9 frame; a small custom client is the cleanest test until we add framing.

Verification

Built clean (kernel make, daemon cmake --build, tools make). All -Wall -Wextra warning-free.

End-to-end round-trip

$ ffmpeg -f lavfi -i 'testsrc=duration=0.04:size=128x96:rate=25' \
       -pix_fmt yuv420p -c:v libvpx-vp9 -frames:v 1 -y /tmp/vp9_small.ivf
$ python3 strip-ivf-header  → /tmp/vp9_small_kf.bin (1566 B)

$ sudo insmod kernel/daedalus_v4l2.ko
$ daedalus_v4l2_daemon -v daemon &
$ sudo ./tools/test_m2m_decode /tmp/vp9_small_kf.bin /tmp/out_m2m.nv12 128 96

  loaded bitstream: 1566 bytes
  OUTPUT sizeimage = 65524
  CAPTURE planes = 2, [0].sizeimage=12288 [1].sizeimage=6144
  OUTPUT REQBUFS -> 2
  CAPTURE REQBUFS -> 2
  QBUF OUTPUT[0] bytesused=1566
  QBUF CAPTURE[0]
  STREAMON both
  poll revents=0x5
  DQBUF OUTPUT[0] flags=0x4001          # V4L2_BUF_FLAG_DONE
  DQBUF CAPTURE[0] flags=0x4000 payloads=[12288, 6144]
  wrote 12288 Y + 6144 UV bytes to /tmp/out_m2m.nv12
  OK

daemon log:
  REQ_DECODE cookie=1 codec=1 bitstream=1566 bytes
  decoder: opened vp9 context
  decoder: OK 128x96 fmt=0 (yuv420p) fnv1a=0x1eb34bfe luma=12288 chroma=6144 nv12=18432

Pixel correctness

$ ffmpeg -i vp9_small.ivf -pix_fmt nv12 -f rawvideo -y ref.nv12
$ cmp out_m2m.nv12 ref.nv12
$ echo $?
0

Byte-for-byte match against ffmpeg -pix_fmt nv12. Whole 18432-byte NV12 frame matches exactly — the full kernel ↔ daemon ↔ FFmpeg pipeline produces the same pixels as a plain FFmpeg CLI decode.

v4l2-compliance

$ sudo v4l2-compliance -d /dev/video0
…
Detected Stateless Decoder

Required ioctls:                test VIDIOC_QUERYCAP: OK
                                test invalid ioctls: OK
Allow for multiple opens:       all OK
Format ioctls:                  VIDIOC_S_FMT FAIL  (colorspace mismatch)
Codec ioctls:                   VIDIOC_(TRY_)DECODER_CMD FAIL
                                (stateless decoder requires media controller
                                 OR decoder commands; we have neither)
Buffer ioctls:                  REQBUFS/CREATE_BUFS/QUERYBUF/REMOVE_BUFS/EXPBUF: OK

Two expected fails:

  • S_FMT colorspace: try_fmt always rewrites colorspace to REC709 from the canonical fill helper. Should preserve the userspace-supplied value when valid. Trivial fix; lands in 8.6.
  • DECODER_CMD / media controller: stateless decoders are required by spec to provide either a media controller (for the request API) OR decoder commands (V4L2_DEC_CMD_*). We have neither — the daemon handles per-frame state internally via FFmpeg, so we never needed the request API. Phase 8.6 adds the media controller binding when AV1/H.264 controls land.

Clean teardown

$ pkill -TERM daedalus_v4l2_daemon       # SIGTERM, daemon exits cleanly
$ sudo rmmod daedalus_v4l2               # ok
$ sudo dmesg | grep -E 'BUG|oops'
(empty)

No kernel oops / WARN traces from the production flow. The initial run had a bug — see [Bugs found and fixed] below.

Design decisions

Why vb2_vmalloc instead of vb2_dma_contig?

Two reasons:

  1. No DMA needed in Phase 8.5. device_run reads bitstream bytes via vb2_plane_vaddr and the chardev path writes decoded pixels via memcpy. No hardware DMA touches these buffers, so vb2_dma_contig's CMA pressure buys us nothing yet.
  2. Phase 8.6 will switch CAPTURE to vb2_dma_contig for dmabuf-export — at that point the daemon mmaps the dmabuf directly and writes pixels in place, bypassing the chardev payload entirely. Doing the switch in Phase 8.6 (rather than now) keeps each phase's scope clean: 8.5 = "real V4L2 m2m flow", 8.6 = "stop truncating big frames + add more codecs + add request API".

Why inline pixels in RESP_FRAME?

The same 64 KiB cap as REQ_DECODE. For Phase 8.5 we want to prove the QBUF/DQBUF round-trip works without introducing dmabuf-fd passing (which needs dma_buf_fd in the daemon's task context + a new chardev ioctl + daemon-side mmap of the returned fd — all real work, but orthogonal to the m2m verification). Phase 8.6 adds the dmabuf path.

For now: small frames (≤ ~256×192 NV12 = 73 KiB; safely 128×96 = 18 KiB) work end-to-end. The pixel-match test above proves the path is bit-exact, not just "approximately right."

Why a custom test client instead of v4l2-ctl?

v4l2-ctl's --stream-from-hdr expects a v4l2-ctl-specific length-prefix format (which I tried with a >I length prefix and got "Unknown header ID" — there's apparently a magic ID byte we didn't supply). Writing a 20-line test harness that does the V4L2 ioctls directly is faster than reverse- engineering v4l2-ctl's framing, and the harness stays useful for regression tests after the per-frame format work in 8.6.

Both daedalus_decode_cookie (in chardev.c, for debugfs test_decode) and daedalus_cookie_seq (in main.c, for m2m device_run) are independent atomics starting from 0. After a fresh insmod, both begin issuing cookie=1 → collisions are likely in mixed-use scenarios. This is harmless today — debugfs is a test fixture and doesn't have a V4L2 inflight to complete, so RESP_FRAME for "debugfs cookies" just logs "unknown cookie" and moves on. Phase 8.6 either unifies the counter or makes debugfs use cookies with bit 31 set so the two namespaces don't overlap.

Bugs found and fixed during the phase

Bug 1: device_run re-runs on same src buf, eventual stop_streaming oops

Symptom: test client logged TWO REQ_DECODE messages for one QBUF. After teardown, dmesg showed:

lr : daedalus_stop_streaming+0x3c/0x80 [daedalus_v4l2]

Cause: daedalus_complete_resp_frame called v4l2_m2m_buf_done(src_buf, ...) + v4l2_m2m_buf_done(dst_buf, ...) + v4l2_m2m_job_finish(...). This marks the vb2 buffers as DONE but does NOT pop them off the m2m's internal src/dst queues. The scheduler immediately re-runs device_run with the same still-queued src buf — which then double-frees on stop_streaming.

Fix: use v4l2_m2m_buf_done_and_job_finish — the canonical helper that pops both buffers AND marks them done AND finishes the job in one atomic sequence. Caught on the first end-to-end run; second run was clean.

Bug 2: missing <media/v4l2-event.h>

v4l2_event_unsubscribe undeclared. Trivial; added the include.

Bug 3 (build): missing <stdint.h> in tools/

The test harness used (uint32_t) casts without including <stdint.h>. Added.

What's NOT here (deferred to 8.6)

  • Per-frame dmabuf export. CAPTURE buffers come back through inline pixel data in RESP_FRAME today; ≤ 64 KiB cap rules out 1080p.
  • V4L2 stateless controls. No V4L2_CID_STATELESS_VP9_FRAME etc. — the daemon parses VP9 headers itself. Compliance complains accordingly.
  • Media controller. v4l2-compliance flags this as the "stateless decoder requires media controller OR decoder cmds" failure.
  • Colorspace round-trip in TRY_FMT. Documented compliance failure; trivial fix.
  • AV1 + H.264 codec contexts. Phase 8.6.

Phase 8.6 plan

  1. dmabuf-export on CAPTURE: switch mem_ops to vb2_dma_contig_memops; add DAEDALUS_IOC_GET_DMABUF on the chardev that calls vb2_core_expbuf in daemon context.
  2. Daemon mmaps the dmabuf, decodes into it directly, RESP_FRAME carries metadata only.
  3. Add AV1 + H.264 to the decoder's codec switch (FFmpeg already supports both).
  4. Add V4L2 stateless controls (VP9_FRAME, then AV1_FRAME, then H264_SLICE_PARAMS) — these can be NULL-handled by the daemon initially (it just ignores them since FFmpeg parses on its own), but adding them satisfies the spec / compliance.
  5. Media controller binding via v4l2_m2m_register_media_controller.
  6. Fix the S_FMT colorspace preservation.
  7. Fix the cookie namespace collision.