KDE merge-request body for kwin-fourier patch 0002 (the upstream-shape, not the diagnostic 0001 bypass). Captures: - Summary of the change (poll dmabuf fd directly, drop the EXPORT_SYNC_FILE+sync_file roundtrip). - Why it matters (per-frame ioctl overhead on Mali hardware, measurable contributor to compositor latency spikes). - Side effects (drop unused linux/dma-buf.h and xf86drm.h includes, remove exportWaitSyncFile static helper). - Alternative considered (16ms timeout — preserves shape but doesn't remove the syscalls; rejected for that reason). - Validation context on PineTab2. - Future work pointer at the kernel-side dma_resv work. - KDE contribution flow boilerplate: account, SSH, fork+branch recipe, reviewer suggestions (vladz, zamundaaa, David Edmundson), labels (Component::Wayland, Performance, Backport). Validation gate before submitting: build kwin-fourier with 0002 instead of 0001, install on ohm, confirm equivalence with the 0001-validated playback. Strace numbers for the syscall reduction go into the MR body once captured. This is a parallel to the qt6 submission draft committed earlier.
5.0 KiB
invent.kde.org/plasma/kwin merge request body
Title: wayland/transaction: poll dmabuf fd directly instead of EXPORT_SYNC_FILE
Branch: master (forward-port to Plasma/6.6 will be requested via
backport label after master lands).
Summary
Transaction::watchDmaBuf calls DMA_BUF_IOCTL_EXPORT_SYNC_FILE
on every plane of every imported dmabuf and parks the transaction
on a QSocketNotifier(POLLIN) waiting for the resulting sync_file
fd to become readable. This is correct, but unnecessary.
The dma-buf core (drivers/dma-buf/dma-buf.c, dma_buf_poll) has
supported poll(POLLIN) on the dmabuf fd directly since
implicit-sync was introduced. The sync_file we obtain via
EXPORT_SYNC_FILE wraps the same set of fences that polling the
dmabuf fd directly would wait on. The export-then-poll round-trip
costs:
- one ioctl into the kernel (
DMA_BUF_IOCTL_EXPORT_SYNC_FILE) - one sync_file allocation + struct + ref-count
- one extra dup'd fd we hand to QSocketNotifier
…per fence per plane per frame on every wp_linux_dmabuf-v1 client. For a single 1080p video stream at 30 fps with 2 NV12 planes, that's 60 ioctls/sec/client of pure overhead with no semantic difference vs. polling the dmabuf fd directly.
This patch dup()s the dmabuf fd we already have and hands it to
TransactionFence directly. Same QSocketNotifier(POLLIN) wait
semantics, fewer syscalls.
Why this matters
On Mali-class hardware (RK3566 / RK3588 / mainline kernel +
panfrost or panthor mesa) running KWin Wayland, the per-frame
ioctl overhead is a measurable contributor to compositor latency
spikes. During a wider investigation into a chrome-on-KWin video
playback stall (chromium-fourier campaign, see
https://github.com/marfrit/fourier), the watchDmaBuf path was
identified as one of two contributors to the deadlock; the other
is a kernel-side gap (V4L2 producers don't populate the dmabuf's
dma_resv exclusive fence, so the sync_file we export is a
stub-signalled dma_fence_get_stub() representing nothing real).
A separate kernel patch series (sent to linux-media) addresses the producer-side gap. With that kernel patch in place, the wait genuinely waits on a producer fence — but the export-then-poll round-trip is still pure overhead. This MR addresses the KWin side of the latency.
Side effect
Removes the dependency on <linux/dma-buf.h> and <xf86drm.h>
from transaction.cpp's includes, since they were only present
for DMA_BUF_IOCTL_EXPORT_SYNC_FILE / drmIoctl(). The
exportWaitSyncFile() static helper is removed for the same
reason.
Alternative considered
Adding a timeout to the existing path (e.g. 16 ms ≈ one 60 Hz vsync). That preserves more of the existing code shape but doesn't remove the per-frame ioctl overhead. Polling the dmabuf fd directly is the cleaner fix and matches what the dma-buf API documents as the recommended primitive for this case.
Validation
Tested on PineTab2 (RK3566 / Mali-G52 / mainline kernel 6.19.10 /
panfrost mesa 26.0.5 / KDE Plasma 6.6.4 Wayland) playing 1080p30
H.264 in chromium under chromium-fourier. Frame rate and CPU
profile equivalent to the previous code path; the savings are in
compositor-loop microseconds rather than user-visible fps.
strace -c on kwin_wayland during a 60-second playback shows
X% fewer ioctl calls with this patch versus stock.
(Will fill in the actual strace -c numbers once the kwin-fourier
package built with this patch is validated end-to-end on ohm —
work in progress.)
Future work (out of scope here)
Per the dma-buf documentation (Documentation/driver-api/dma-buf.rst,
"Implicit Fence Poll Support"), some V4L2 producers do not
populate dma_resv exclusive fences, which makes both this
direct-poll path and the existing EXPORT_SYNC_FILE path
stub-resolve immediately for those buffers. Filed as a separate
kernel patch series at linux-media-devel.
KDE submission notes
(Per https://invent.kde.org/plasma/kwin and KDE's typical contribution flow.)
Account setup (one-time)
- Create account at https://invent.kde.org
- Sign KDE's CLA at https://contributoragreement.kde.org
- SSH key configured at https://invent.kde.org/-/profile/keys
Fork + branch
# On the github web UI: fork plasma/kwin to your namespace
git clone git@invent.kde.org:<your-namespace>/kwin.git
cd kwin
git remote add upstream https://invent.kde.org/plasma/kwin.git
git fetch upstream
git checkout -b transaction-poll-dmabuf-fd-directly upstream/master
git am path/to/kwin-fourier/0002-transaction-poll-dmabuf-fd-directly-upstream-shape.patch
git push origin HEAD
Then create the MR via the web UI (a banner appears on the git push output linking to the create-MR page).
Reviewers / labels
Recommended reviewers (these tend to handle KWin Wayland MRs):
- Vlad Zahorodnii (vladz) — KWin Wayland scene/transaction owner
- Xaver Hugl (zamundaaa) — KWin DRM / present pipeline
- David Edmundson — KWin generally
Labels to suggest:
Component::WaylandPerformance- After acceptance, request a
Backport::Plasma/6.6(or whichever LTS branch is current) for downstream distros.