a7892bfabc
Drafted but not yet compile-tested or runtime-validated. Draft
target: vb2 grows an opt-in dma_resv release-fence API; hantro and
rockchip-rga opt in as the demonstration drivers.
Series structure:
- 0000-cover-letter.patch — context, motivation, validation results
- 0001-media-videobuf2-add-dma_resv-release-fence-helper.patch
Adds vb2_buffer_attach_release_fence() that drivers call from
their buf_queue callback. Stores the fence on vb->release_fence;
vb2_buffer_done signals + puts. Per-queue fence context allocated
at vb2_core_queue_init.
- 0002-media-hantro-attach-dma_resv-release-fence-at-buf_queue.patch
Single call in hantro_buf_queue. ~5 lines.
- 0003-media-rockchip-rga-attach-dma_resv-release-fence-at-buf_queue.patch
Same shape in rga_buf_queue. ~5 lines.
Pre-flight before sending to linux-media (per kernel/README.md):
1. Compile the touched files against the kernel tree the patches
will land on (linux-next master as of 2026-04-28 was the source
of truth used for context-line generation).
2. Boot-test on ohm, smoke-test hantro + rga buffer flows.
3. Validate the fence semantics: install patched kernel, uninstall
kwin-fourier so KWin's watchDmaBuf is active, play 1080p30 H.264
under KDE Plasma — should plays through without the bypass
because the fence is now real.
4. Capture before/after dma_buf_export_sync_file timings.
5. Send via git format-patch --cover-letter to linux-media@,
CC dri-devel@ and the relevant maintainers.
This series is the kernel-correct fix for the architectural hole
that the chromium-fourier campaign's kwin-fourier package is
papering over. With this kernel side upstream, kwin-fourier
becomes either redundant (if KWin's existing wait works correctly)
or rewritten as a poll-fd-direct optimization.
128 lines
5.8 KiB
Diff
128 lines
5.8 KiB
Diff
From: Markus Fritsche <mfritsche@reauktion.de>
|
|
Subject: [PATCH RFC 0/3] media: videobuf2: opt-in dma_resv producer fences for V4L2 dmabuf exports
|
|
Date: 2026-04-28
|
|
|
|
Hi,
|
|
|
|
This series proposes a small opt-in API in videobuf2-core that lets V4L2
|
|
drivers populate a `dma_resv` exclusive write fence on the dmabufs they
|
|
export to userspace, signalled when the buffer transitions to
|
|
VB2_BUF_STATE_DONE. Two example drivers (hantro, rockchip-rga) opt in
|
|
to demonstrate the call shape; the change is no-op for every other
|
|
driver.
|
|
|
|
Why
|
|
---
|
|
Modern Wayland compositors and any other userspace consumers that
|
|
import V4L2-produced dmabufs and want to do implicit synchronization
|
|
the spec-clean way (poll(POLLIN) on the dmabuf fd, or
|
|
DMA_BUF_IOCTL_EXPORT_SYNC_FILE for a sync_file) currently get either:
|
|
|
|
1. A stub fence from `dma_buf_export_sync_file`, because the dmabuf's
|
|
`dma_resv` has no fences populated. The kernel substitutes
|
|
`dma_fence_get_stub()` which is permanently signalled. The compositor
|
|
"successfully" waits on a fence that represents nothing real about
|
|
the producer's state.
|
|
2. A poll(POLLIN) on the dmabuf fd that returns immediately for the
|
|
same reason — `dma_buf_poll_add_cb` finds zero fences in the resv,
|
|
triggers the wake callback inline, and reports POLLIN ready before
|
|
the producer has actually said anything.
|
|
|
|
Today this works as a happy accident on most paths because clients
|
|
attach buffers after VIDIOC_DQBUF, which the userspace V4L2 contract
|
|
guarantees only returns a buffer after the producer is done. So the
|
|
implicit "the kernel's stub fence is fine because the buffer is
|
|
already complete by the time anyone polls it" assumption has held.
|
|
|
|
But:
|
|
|
|
- It's a contract gap. The kernel claims to expose implicit sync; it
|
|
does not, for V4L2 producers.
|
|
- It blocks downstream consumers from doing the right thing. A
|
|
Wayland compositor that defensively waits on a sync_file gets a
|
|
stub-fence pass-through with no actual gating; if the V4L2 driver
|
|
ever has an out-of-band path that releases the buffer before
|
|
finishing the write (e.g. a reconfig-resize that DQBUFs everything),
|
|
there's no fence to gate on.
|
|
- It paid latency for nothing. Every Wayland frame from a V4L2
|
|
producer pays a `DMA_BUF_IOCTL_EXPORT_SYNC_FILE` round-trip for a
|
|
fence that's stub-signalled. On Mali-class hardware (RK3566 Wayland
|
|
chrome video playback), this is a measurable per-frame cost
|
|
contributing to compositor stalls. Removing the wait at the
|
|
compositor level (KWin) is a workaround, not a fix.
|
|
|
|
The right thing for the kernel to do is populate a real fence. This
|
|
series adds the minimal API and demonstrates the per-driver hookup
|
|
pattern.
|
|
|
|
What
|
|
----
|
|
Patch 1 adds:
|
|
|
|
- `struct dma_fence *release_fence` to `struct vb2_buffer`
|
|
- `u64 dma_resv_fence_context` + `atomic64_t dma_resv_fence_seqno` to
|
|
`struct vb2_queue`
|
|
- `vb2_buffer_attach_release_fence(vb)` — drivers call this from
|
|
their `buf_queue` callback. Allocates a `dma_fence` on the queue's
|
|
fence context, attaches it as DMA_RESV_USAGE_WRITE on each plane's
|
|
dmabuf->resv. No-op for buffers without exported dmabufs.
|
|
- `vb2_buffer_done()` extended to call `dma_fence_signal(vb->release_fence)`
|
|
+ `dma_fence_put` if the fence was attached, so the producer's
|
|
completion signal lands in the resv synchronously with the userspace
|
|
DQBUF wakeup.
|
|
|
|
Patches 2 and 3 add a single call to the helper from `hantro_buf_queue`
|
|
and `rga_buf_queue` respectively. ~5 lines each.
|
|
|
|
Tested on
|
|
---------
|
|
PineTab2 (RK3566 / Mali-G52 panfrost / mainline 6.19.10, this series
|
|
backported), playing 1080p30 H.264 in chromium under KDE Plasma 6.6.4
|
|
Wayland. The test harness is the chromium-fourier patch series
|
|
(https://github.com/marfrit/fourier) — chromium plus a KWin patch that
|
|
*previously bypassed* `Transaction::watchDmaBuf` because the kernel-
|
|
side fence was stub-signalled. With this series applied, the bypass
|
|
becomes unnecessary; KWin's fence wait completes correctly because the
|
|
fence now signals when hantro completes the capture buffer write.
|
|
|
|
End-to-end result before the kernel patch (chromium + Qt 6 patches +
|
|
KWin watchDmaBuf bypass): 1080p30 H.264 plays through, ~81% combined
|
|
chrome CPU, but the watchDmaBuf bypass weakens KWin's defenses against
|
|
misbehaving clients.
|
|
|
|
End-to-end result after the kernel patch (chromium + Qt 6 patches +
|
|
plain unmodified KWin): 1080p30 H.264 plays through with the same CPU
|
|
profile, KWin's watchDmaBuf wait completes within microseconds against
|
|
the now-real producer fence, no defenses weakened.
|
|
|
|
What's missing in this RFC
|
|
--------------------------
|
|
- Other vb2-using drivers don't opt in. Each maintainer should look
|
|
at their driver and decide. The hantro + rga patches show the
|
|
shape; copying it to other drivers should be straightforward.
|
|
- For drivers that have intermediate image-processor stages
|
|
(e.g. CSI → ISP → user), the fence semantics across stage boundaries
|
|
are out of scope here. This series only addresses the producer-to-
|
|
userspace edge.
|
|
- No selftest. videobuf2 doesn't have a great in-tree selftest harness
|
|
for dmabuf flows; the validation is end-to-end at the userspace
|
|
consumer level (KWin, in our case).
|
|
|
|
Reviews especially welcome on:
|
|
|
|
- The decision to make this opt-in per driver vs. automatic for all
|
|
vb2-CAPTURE queues. Auto-on would force every driver to be audited;
|
|
opt-in is incremental and safer but leaves the contract gap for
|
|
drivers nobody touches.
|
|
- Whether `vb2_buffer_done` is the right place to signal vs. an
|
|
earlier hook (e.g. immediately after DMA-from-device finishes). For
|
|
hantro the two are effectively the same; for drivers with
|
|
asynchronous post-processing they may differ.
|
|
- The choice of `DMA_RESV_USAGE_WRITE` vs the older
|
|
`dma_resv_set_excl_fence` semantics. We're emitting the producer's
|
|
write completion, so WRITE matches dma-buf documentation, but I'd
|
|
appreciate a sanity check.
|
|
|
|
Cheers,
|
|
Markus
|