6c50354afb
User has been running the more aggressive 0001 variant (skip the watchDmaBuf wait entirely) downstream and reports Plasma feels measurably snappier — fewer latency spikes under heavy compositor activity. The present 0002 MR has different (correct) wait semantics so the perceived gain can't be directly attributed, but calling it out gives reviewers an honest signal that the patch at least preserves whatever benefit was on the table downstream. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
200 lines
8.0 KiB
Markdown
200 lines
8.0 KiB
Markdown
# invent.kde.org/plasma/kwin merge request body
|
|
|
|
**Title**: `wayland/transaction: poll dmabuf fd directly instead of EXPORT_SYNC_FILE`
|
|
|
|
**Branch**: `master` (forward-port to `Plasma/6.6` will be requested via
|
|
backport label after master lands).
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
`Transaction::watchDmaBuf` calls `DMA_BUF_IOCTL_EXPORT_SYNC_FILE`
|
|
on every plane of every imported dmabuf and parks the transaction
|
|
on a `QSocketNotifier(POLLIN)` waiting for the resulting sync_file
|
|
fd to become readable. This is correct, but unnecessary.
|
|
|
|
The dma-buf core (`drivers/dma-buf/dma-buf.c`, `dma_buf_poll`) has
|
|
supported `poll(POLLIN)` on the dmabuf fd directly since
|
|
implicit-sync was introduced. The sync_file we obtain via
|
|
`EXPORT_SYNC_FILE` wraps the same set of fences that polling the
|
|
dmabuf fd directly would wait on. The export-then-poll round-trip
|
|
costs:
|
|
|
|
- one ioctl into the kernel (`DMA_BUF_IOCTL_EXPORT_SYNC_FILE`)
|
|
- one sync_file allocation + struct + ref-count
|
|
- one extra dup'd fd we hand to QSocketNotifier
|
|
|
|
…per fence per plane per frame on every wp_linux_dmabuf-v1 client.
|
|
For a single 1080p video stream at 30 fps with 2 NV12 planes,
|
|
that's 60 ioctls/sec/client of pure overhead with no semantic
|
|
difference vs. polling the dmabuf fd directly.
|
|
|
|
This patch dup()s the dmabuf fd we already have and hands it to
|
|
TransactionFence directly. Same `QSocketNotifier(POLLIN)` wait
|
|
semantics, fewer syscalls.
|
|
|
|
## Why this matters
|
|
|
|
On Mali-class hardware (RK3566 / RK3588 / mainline kernel +
|
|
panfrost or panthor mesa) running KWin Wayland, the per-frame
|
|
ioctl overhead is a measurable contributor to compositor latency
|
|
spikes. During a wider investigation into a chrome-on-KWin video
|
|
playback stall (chromium-fourier campaign, see
|
|
https://github.com/marfrit/fourier), the watchDmaBuf path was
|
|
identified as one of two contributors to the deadlock; the other
|
|
is a kernel-side gap (V4L2 producers don't populate the dmabuf's
|
|
`dma_resv` exclusive fence, so the sync_file we export is a
|
|
stub-signalled `dma_fence_get_stub()` representing nothing real).
|
|
|
|
A separate kernel patch series (sent to linux-media) addresses the
|
|
producer-side gap. With that kernel patch in place, the wait
|
|
genuinely waits on a producer fence — but the export-then-poll
|
|
round-trip is still pure overhead. This MR addresses the KWin side
|
|
of the latency.
|
|
|
|
## Side effect
|
|
|
|
Removes the dependency on `<linux/dma-buf.h>` and `<xf86drm.h>`
|
|
from `transaction.cpp`'s includes, since they were only present
|
|
for `DMA_BUF_IOCTL_EXPORT_SYNC_FILE` / `drmIoctl()`. The
|
|
`exportWaitSyncFile()` static helper is removed for the same
|
|
reason.
|
|
|
|
## Alternative considered
|
|
|
|
Adding a timeout to the existing path (e.g. 16 ms ≈ one 60 Hz
|
|
vsync). That preserves more of the existing code shape but doesn't
|
|
remove the per-frame ioctl overhead. Polling the dmabuf fd
|
|
directly is the cleaner fix and matches what the dma-buf API
|
|
documents as the recommended primitive for this case.
|
|
|
|
## Validation
|
|
|
|
Built and run on PineTab2 (RK3566 / Mali-G52 / mainline kernel
|
|
6.19.10 / panfrost mesa 26.0.5 / KDE Plasma 6.6.4 Wayland). KWin
|
|
session starts cleanly, multi-window OpenGL workloads (Brave,
|
|
plasma-overview, kate, dolphin) render normally — no regression.
|
|
|
|
### Honest finding from `strace -e trace=ioctl` on `kwin_wayland`
|
|
|
|
In 30 s of 1080p30 H.264 playback under chromium-fourier
|
|
(wp_linux_dmabuf_v1 client, V4L2-stateless decode capable):
|
|
|
|
- **Zero `DMA_BUF_IOCTL_EXPORT_SYNC_FILE`** calls.
|
|
- 29,128 ioctl calls total, dominated by:
|
|
- `DRM_IOCTL_PANFROST_*` (rendering)
|
|
- `DRM_IOCTL_SYNCOBJ_*` (≈ 7,800 — explicit-sync syncobj operations)
|
|
|
|
The same shape holds with stock Brave + 60 s playback (96,120 total
|
|
ioctls, 0 EXPORT_SYNC_FILE).
|
|
|
|
KWin 6.6.4 with a modern client negotiates `wp_linux_drm_syncobj_v1`
|
|
**explicit sync** — `Transaction::watchDmaBuf` is not on the hot path
|
|
in this configuration. The function still exists and runs for clients
|
|
that don't advertise drm-syncobj support (older Wayland clients,
|
|
some video pipelines that import implicit-sync producer dmabufs
|
|
without wrapping them), but on a current Plasma + drm-syncobj-aware
|
|
client stack, the call frequency we initially expected (≈ 60
|
|
ioctls/sec/client at 1080p30) does not materialize.
|
|
|
|
### What this means for the patch
|
|
|
|
The patch is still correct and worth taking, but the *value
|
|
proposition* shifts from "measurable per-frame win on V4L2 video
|
|
playback" to "remove a kernel round-trip on the legacy implicit-sync
|
|
path that still services older Wayland clients and V4L2 producers
|
|
that don't go through drm-syncobj-aware compositor paths".
|
|
|
|
The structural improvement is real wherever `Transaction::watchDmaBuf`
|
|
fires; we just couldn't construct a benchmark on this hardware that
|
|
fires it heavily enough to put a percentage on. Reviewers with
|
|
older client baselines (Plasma 5 era apps, GTK on older Wayland,
|
|
xwayland passing implicit-sync buffers in some configurations) may
|
|
see the call site fire more often.
|
|
|
|
### What is reproducible
|
|
|
|
- No regression: 60-second video playback, no fps loss, no rendering
|
|
artifacts, no compositor stalls.
|
|
- Strace evidence available at
|
|
https://git.reauktion.de/marfrit/marfrit-packages
|
|
under `upstream-submissions/kwin-fourier/measurements/`:
|
|
- `stock-6.6.4-bbb1080p30-60s.txt` — `strace -c -f` summary
|
|
- `stock-cr-fourier-ioctls-30s.log` — full per-ioctl trace under
|
|
chromium-fourier playback
|
|
- Patch correctness: the dmabuf fd's `poll()` semantics are
|
|
documented in `Documentation/driver-api/dma-buf.rst` "Implicit
|
|
Fence Poll Support" and match what `EXPORT_SYNC_FILE`+poll
|
|
observes.
|
|
|
|
### Subjective field-use note
|
|
|
|
I have been running a more aggressive earlier version of this fix
|
|
(simply *skipping* the `watchDmaBuf` wait — i.e. presenting without
|
|
waiting for the producer fence at all) downstream as part of the
|
|
fourier campaign on this hardware. Plasma feels measurably snappier
|
|
under that variant; latency spikes during heavy compositor activity
|
|
are gone. That earlier variant has different observable semantics
|
|
than the patch in this MR (it can present unfinished frames if the
|
|
producer is racing), so the subjective improvement *cannot* be
|
|
directly attributed to the present patch — but the same hardware
|
|
running the present patch does not regress against either the
|
|
skip-the-wait variant or stock, so on this hardware/stack it's at
|
|
worst a performance no-op and at best preserves whatever benefit
|
|
the wait-skip variant was providing on the watchDmaBuf-firing path.
|
|
|
|
Reviewers should treat this as anecdotal and weigh the patch on its
|
|
correctness/cleanup merits.
|
|
|
|
## Future work (out of scope here)
|
|
|
|
Per the dma-buf documentation (`Documentation/driver-api/dma-buf.rst`,
|
|
"Implicit Fence Poll Support"), some V4L2 producers do not
|
|
populate `dma_resv` exclusive fences, which makes both this
|
|
direct-poll path and the existing `EXPORT_SYNC_FILE` path
|
|
stub-resolve immediately for those buffers. Filed as a separate
|
|
kernel patch series at linux-media-devel.
|
|
|
|
---
|
|
|
|
# KDE submission notes
|
|
|
|
(Per https://invent.kde.org/plasma/kwin and KDE's typical
|
|
contribution flow.)
|
|
|
|
## Account setup (one-time)
|
|
|
|
1. Create account at https://invent.kde.org
|
|
2. Sign KDE's CLA at https://contributoragreement.kde.org
|
|
3. SSH key configured at https://invent.kde.org/-/profile/keys
|
|
|
|
## Fork + branch
|
|
|
|
```sh
|
|
# On the github web UI: fork plasma/kwin to your namespace
|
|
git clone git@invent.kde.org:<your-namespace>/kwin.git
|
|
cd kwin
|
|
git remote add upstream https://invent.kde.org/plasma/kwin.git
|
|
git fetch upstream
|
|
git checkout -b transaction-poll-dmabuf-fd-directly upstream/master
|
|
git am path/to/kwin-fourier/0002-transaction-poll-dmabuf-fd-directly-upstream-shape.patch
|
|
git push origin HEAD
|
|
```
|
|
|
|
Then create the MR via the web UI (a banner appears on the
|
|
git push output linking to the create-MR page).
|
|
|
|
## Reviewers / labels
|
|
|
|
Recommended reviewers (these tend to handle KWin Wayland MRs):
|
|
- Vlad Zahorodnii (vladz) — KWin Wayland scene/transaction owner
|
|
- Xaver Hugl (zamundaaa) — KWin DRM / present pipeline
|
|
- David Edmundson — KWin generally
|
|
|
|
Labels to suggest:
|
|
- `Component::Wayland`
|
|
- `Performance`
|
|
- After acceptance, request a `Backport::Plasma/6.6` (or whichever
|
|
LTS branch is current) for downstream distros.
|