kernel: per-ctx vb2 lock — Firefox multi-process VAAPI unblock #3

Merged
marfrit merged 1 commits from noether/kernel-per-ctx-vb-mutex into main 2026-05-20 19:25:02 +00:00
Member

Firefox YouTube on Pi 5 — works on a Pi 5 with daedalus_v4l2

Bug

daedalus_queue_init was wiring src_vq->lock and dst_vq->lock to ctx->dev->m2m_lock — a device-wide mutex. Every vb2 ioctl (S_FMT, REQBUFS, QBUF, DQBUF, STREAMON, ...) serialised across all concurrent clients of /dev/video0.

For single-client consumers (the test_m2m_* harness, simple ffmpeg jobs) this was invisible. For Firefox — which spawns separate content + RDD + GPU processes that each open /dev/video0 and run libva probe simultaneously — the contention surfaced as EBUSY from one libva session's S_FMT(OUTPUT_MPLANE) while another was mid-streamon.

$ MOZ_VA_API_ENABLED=1 LIBVA_DRIVER_NAME=v4l2_request firefox
...
v4l2-request: phase 8.10: opened daedalus_v4l2 at video_fd=32 ...
v4l2-request: cap_pool_init: 24 slots ready
v4l2-request: Unable to set format for type 10: Device or resource busy

Result: VAAPI initialised in ONE Firefox process, but other processes that also wanted to decode (e.g. the content process for actual video tags) bailed at S_FMT before any frame queued.

Fix

Each open() gets its own ctx->vb_mutex, initialised in daedalus_open and destroyed in daedalus_release (and the err_ctrl unwind path). Per-context vb2_queue locks — concurrent clients no longer fight.

cedrus / rkvdec / hantro all use the per-ctx vb mutex pattern for the same reason; this mirrors them.

Verification

On higgs (Pi CM5, kernel 6.18.29):

  1. rmmod daedalus_v4l2 + modprobe the rebuilt .ko
  2. systemctl restart daedalus-v4l2.service
  3. MOZ_VA_API_ENABLED=1 LIBVA_DRIVER_NAME=v4l2_request firefox
  4. Open YouTube, play a video

Daemon journal shows clean per-frame REQ_DECODE / decoder:OK pairs:

REQ_DECODE cookie=1 codec=3 bitstream=4217 bytes meta=h264 capture=640x368 1 planes
decoder: opened h264 context
decoder: OK 640x368 fmt=0 (yuv420p) fnv1a=0x31f802b4 luma=235520 chroma=117760
REQ_DECODE cookie=2 codec=3 bitstream=1432 bytes ...
... (sustained ~230 fps over 600ms test window, all unique fnv1a hashes)

Zero EBUSY in firefox stderr or daemon journal during the session.

Throughput headroom: ~230 fps decoded at 640x368 in the test == ~7× the 30fps@1080p Pi 5 Fourier target, with the daemon's libavcodec dlopen path doing all the work CPU-side.

Closes

The Pi 5 Fourier campaign's H.264-via-libva-via-daedalus end-to-end path for multi-process VAAPI consumers (Firefox; Chromium / mpv should benefit the same).

Generated with Claude Code

## Firefox YouTube on Pi 5 — works on a Pi 5 with daedalus_v4l2 ### Bug `daedalus_queue_init` was wiring `src_vq->lock` and `dst_vq->lock` to `ctx->dev->m2m_lock` — a **device-wide** mutex. Every vb2 ioctl (S_FMT, REQBUFS, QBUF, DQBUF, STREAMON, ...) serialised across all concurrent clients of /dev/video0. For single-client consumers (the test_m2m_* harness, simple ffmpeg jobs) this was invisible. For Firefox — which spawns separate content + RDD + GPU processes that each open /dev/video0 and run libva probe simultaneously — the contention surfaced as EBUSY from one libva session's S_FMT(OUTPUT_MPLANE) while another was mid-streamon. ``` $ MOZ_VA_API_ENABLED=1 LIBVA_DRIVER_NAME=v4l2_request firefox ... v4l2-request: phase 8.10: opened daedalus_v4l2 at video_fd=32 ... v4l2-request: cap_pool_init: 24 slots ready v4l2-request: Unable to set format for type 10: Device or resource busy ``` Result: VAAPI initialised in ONE Firefox process, but other processes that also wanted to decode (e.g. the content process for actual video tags) bailed at S_FMT before any frame queued. ### Fix Each open() gets its own `ctx->vb_mutex`, initialised in `daedalus_open` and destroyed in `daedalus_release` (and the err_ctrl unwind path). Per-context vb2_queue locks — concurrent clients no longer fight. cedrus / rkvdec / hantro all use the per-ctx vb mutex pattern for the same reason; this mirrors them. ### Verification On higgs (Pi CM5, kernel 6.18.29): 1. rmmod daedalus_v4l2 + modprobe the rebuilt .ko 2. systemctl restart daedalus-v4l2.service 3. `MOZ_VA_API_ENABLED=1 LIBVA_DRIVER_NAME=v4l2_request firefox` 4. Open YouTube, play a video Daemon journal shows clean per-frame REQ_DECODE / decoder:OK pairs: ``` REQ_DECODE cookie=1 codec=3 bitstream=4217 bytes meta=h264 capture=640x368 1 planes decoder: opened h264 context decoder: OK 640x368 fmt=0 (yuv420p) fnv1a=0x31f802b4 luma=235520 chroma=117760 REQ_DECODE cookie=2 codec=3 bitstream=1432 bytes ... ... (sustained ~230 fps over 600ms test window, all unique fnv1a hashes) ``` Zero EBUSY in firefox stderr or daemon journal during the session. Throughput headroom: ~230 fps decoded at 640x368 in the test == ~7× the 30fps@1080p Pi 5 Fourier target, with the daemon's libavcodec dlopen path doing all the work CPU-side. ### Closes The Pi 5 Fourier campaign's H.264-via-libva-via-daedalus end-to-end path for multi-process VAAPI consumers (Firefox; Chromium / mpv should benefit the same). Generated with Claude Code
claude-noether added 1 commit 2026-05-20 19:24:21 +00:00
daedalus_queue_init was wiring both src_vq->lock and dst_vq->lock to
ctx->dev->m2m_lock — a device-wide mutex.  That serialises every
vb2 ioctl (S_FMT, REQBUFS, QBUF, DQBUF, STREAMON, ...) across ALL
concurrent clients of /dev/video0.  For a single-client consumer
like the test_m2m_* tools it doesn't matter; for Firefox, which
spawns separate content + RDD + GPU processes that each open
/dev/video0 and run libva probe simultaneously, the contention
showed up as EBUSY from one libva session's S_FMT(OUTPUT_MPLANE)
when another session was mid-streamon on the same device.

Observable on higgs (Pi CM5):

    $ MOZ_VA_API_ENABLED=1 LIBVA_DRIVER_NAME=v4l2_request firefox
    ...
    v4l2-request: phase 8.10: opened daedalus_v4l2 at video_fd=32 ...
    v4l2-request: cap_pool_init: 24 slots ready
    v4l2-request: Unable to set format for type 10: Device or
                  resource busy

After this fix, each open() gets its own ctx->vb_mutex and the
per-context vb2_queue locks are independent — Firefox's multi-
process VAAPI clients no longer fight each other.  YouTube
playback on higgs runs through daedalus at ~230 fps sustained
(640x368, libavcodec dlopen path), 7× headroom over the 30fps
target.

cedrus / rkvdec / hantro all use the per-ctx vb mutex pattern
for the same reason.  This mirrors them.

Lifecycle:
  - mutex_init in daedalus_open (right after the kzalloc that
    creates ctx, before v4l2_fh_init).
  - mutex_destroy in daedalus_release (after v4l2_fh_exit, before
    kfree), and in the err_ctrl unwind path in daedalus_open.

Verified end-to-end on higgs:
  - rmmod + modprobe the rebuilt .ko.
  - Restart daedalus-v4l2.service.
  - Firefox YouTube playback engages VAAPI, daemon journal shows
    cookie=1..N codec=3 (H.264) REQ_DECODE / decoder:OK pairs
    with unique per-frame fnv1a hashes.
  - No EBUSY in either firefox stderr or daemon journal during
    the entire session.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
marfrit merged commit f0d41867f6 into main 2026-05-20 19:25:02 +00:00
marfrit deleted branch noether/kernel-per-ctx-vb-mutex 2026-05-20 19:25:02 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: reauktion/daedalus-v4l2#3