Files
marfrit-packages/upstream-submissions/kwin-fourier/kde-mr-body.md
T
test0r 70a01bd41a upstream-submissions/kwin-fourier: validation findings + revised MR body
Captured stock kwin 6.6.4 ioctl baselines on ohm (PineTab2 / RK3566
/ mainline 6.19.10 / Plasma 6.6.4 Wayland):
- Brave + bbb 1080p30 (sw decode), 60 s: 96,120 ioctls, 0 EXPORT_SYNC_FILE
- chromium-fourier + bbb 1080p30, 30 s: 29,128 ioctls, 0 EXPORT_SYNC_FILE,
  ~7,800 SYNCOBJ_* (explicit sync)

Finding: KWin 6.6.4 + drm-syncobj-aware clients negotiate
wp_linux_drm_syncobj_v1 explicit sync, leaving Transaction::watchDmaBuf
on the legacy implicit-sync path that fires only for clients that
don't advertise drm-syncobj support.

Updated kde-mr-body.md to reflect honest scope: structural cleanup
of the legacy path (still relevant for older clients / V4L2 pipelines
that don't wrap dmabufs in syncobjs) rather than the originally-
hypothesized per-frame ioctl reduction. Patch is correct by
construction; no regression on the no-targeted-path workload.

Evidence files saved under measurements/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:10:55 +00:00

7.0 KiB

invent.kde.org/plasma/kwin merge request body

Title: wayland/transaction: poll dmabuf fd directly instead of EXPORT_SYNC_FILE

Branch: master (forward-port to Plasma/6.6 will be requested via backport label after master lands).


Summary

Transaction::watchDmaBuf calls DMA_BUF_IOCTL_EXPORT_SYNC_FILE on every plane of every imported dmabuf and parks the transaction on a QSocketNotifier(POLLIN) waiting for the resulting sync_file fd to become readable. This is correct, but unnecessary.

The dma-buf core (drivers/dma-buf/dma-buf.c, dma_buf_poll) has supported poll(POLLIN) on the dmabuf fd directly since implicit-sync was introduced. The sync_file we obtain via EXPORT_SYNC_FILE wraps the same set of fences that polling the dmabuf fd directly would wait on. The export-then-poll round-trip costs:

  • one ioctl into the kernel (DMA_BUF_IOCTL_EXPORT_SYNC_FILE)
  • one sync_file allocation + struct + ref-count
  • one extra dup'd fd we hand to QSocketNotifier

…per fence per plane per frame on every wp_linux_dmabuf-v1 client. For a single 1080p video stream at 30 fps with 2 NV12 planes, that's 60 ioctls/sec/client of pure overhead with no semantic difference vs. polling the dmabuf fd directly.

This patch dup()s the dmabuf fd we already have and hands it to TransactionFence directly. Same QSocketNotifier(POLLIN) wait semantics, fewer syscalls.

Why this matters

On Mali-class hardware (RK3566 / RK3588 / mainline kernel + panfrost or panthor mesa) running KWin Wayland, the per-frame ioctl overhead is a measurable contributor to compositor latency spikes. During a wider investigation into a chrome-on-KWin video playback stall (chromium-fourier campaign, see https://github.com/marfrit/fourier), the watchDmaBuf path was identified as one of two contributors to the deadlock; the other is a kernel-side gap (V4L2 producers don't populate the dmabuf's dma_resv exclusive fence, so the sync_file we export is a stub-signalled dma_fence_get_stub() representing nothing real).

A separate kernel patch series (sent to linux-media) addresses the producer-side gap. With that kernel patch in place, the wait genuinely waits on a producer fence — but the export-then-poll round-trip is still pure overhead. This MR addresses the KWin side of the latency.

Side effect

Removes the dependency on <linux/dma-buf.h> and <xf86drm.h> from transaction.cpp's includes, since they were only present for DMA_BUF_IOCTL_EXPORT_SYNC_FILE / drmIoctl(). The exportWaitSyncFile() static helper is removed for the same reason.

Alternative considered

Adding a timeout to the existing path (e.g. 16 ms ≈ one 60 Hz vsync). That preserves more of the existing code shape but doesn't remove the per-frame ioctl overhead. Polling the dmabuf fd directly is the cleaner fix and matches what the dma-buf API documents as the recommended primitive for this case.

Validation

Built and run on PineTab2 (RK3566 / Mali-G52 / mainline kernel 6.19.10 / panfrost mesa 26.0.5 / KDE Plasma 6.6.4 Wayland). KWin session starts cleanly, multi-window OpenGL workloads (Brave, plasma-overview, kate, dolphin) render normally — no regression.

Honest finding from strace -e trace=ioctl on kwin_wayland

In 30 s of 1080p30 H.264 playback under chromium-fourier (wp_linux_dmabuf_v1 client, V4L2-stateless decode capable):

  • Zero DMA_BUF_IOCTL_EXPORT_SYNC_FILE calls.
  • 29,128 ioctl calls total, dominated by:
    • DRM_IOCTL_PANFROST_* (rendering)
    • DRM_IOCTL_SYNCOBJ_* (≈ 7,800 — explicit-sync syncobj operations)

The same shape holds with stock Brave + 60 s playback (96,120 total ioctls, 0 EXPORT_SYNC_FILE).

KWin 6.6.4 with a modern client negotiates wp_linux_drm_syncobj_v1 explicit syncTransaction::watchDmaBuf is not on the hot path in this configuration. The function still exists and runs for clients that don't advertise drm-syncobj support (older Wayland clients, some video pipelines that import implicit-sync producer dmabufs without wrapping them), but on a current Plasma + drm-syncobj-aware client stack, the call frequency we initially expected (≈ 60 ioctls/sec/client at 1080p30) does not materialize.

What this means for the patch

The patch is still correct and worth taking, but the value proposition shifts from "measurable per-frame win on V4L2 video playback" to "remove a kernel round-trip on the legacy implicit-sync path that still services older Wayland clients and V4L2 producers that don't go through drm-syncobj-aware compositor paths".

The structural improvement is real wherever Transaction::watchDmaBuf fires; we just couldn't construct a benchmark on this hardware that fires it heavily enough to put a percentage on. Reviewers with older client baselines (Plasma 5 era apps, GTK on older Wayland, xwayland passing implicit-sync buffers in some configurations) may see the call site fire more often.

What is reproducible

  • No regression: 60-second video playback, no fps loss, no rendering artifacts, no compositor stalls.
  • Strace evidence available at https://git.reauktion.de/marfrit/marfrit-packages under upstream-submissions/kwin-fourier/measurements/:
    • stock-6.6.4-bbb1080p30-60s.txtstrace -c -f summary
    • stock-cr-fourier-ioctls-30s.log — full per-ioctl trace under chromium-fourier playback
  • Patch correctness: the dmabuf fd's poll() semantics are documented in Documentation/driver-api/dma-buf.rst "Implicit Fence Poll Support" and match what EXPORT_SYNC_FILE+poll observes.

Future work (out of scope here)

Per the dma-buf documentation (Documentation/driver-api/dma-buf.rst, "Implicit Fence Poll Support"), some V4L2 producers do not populate dma_resv exclusive fences, which makes both this direct-poll path and the existing EXPORT_SYNC_FILE path stub-resolve immediately for those buffers. Filed as a separate kernel patch series at linux-media-devel.


KDE submission notes

(Per https://invent.kde.org/plasma/kwin and KDE's typical contribution flow.)

Account setup (one-time)

  1. Create account at https://invent.kde.org
  2. Sign KDE's CLA at https://contributoragreement.kde.org
  3. SSH key configured at https://invent.kde.org/-/profile/keys

Fork + branch

# On the github web UI: fork plasma/kwin to your namespace
git clone git@invent.kde.org:<your-namespace>/kwin.git
cd kwin
git remote add upstream https://invent.kde.org/plasma/kwin.git
git fetch upstream
git checkout -b transaction-poll-dmabuf-fd-directly upstream/master
git am path/to/kwin-fourier/0002-transaction-poll-dmabuf-fd-directly-upstream-shape.patch
git push origin HEAD

Then create the MR via the web UI (a banner appears on the git push output linking to the create-MR page).

Reviewers / labels

Recommended reviewers (these tend to handle KWin Wayland MRs):

  • Vlad Zahorodnii (vladz) — KWin Wayland scene/transaction owner
  • Xaver Hugl (zamundaaa) — KWin DRM / present pipeline
  • David Edmundson — KWin generally

Labels to suggest:

  • Component::Wayland
  • Performance
  • After acceptance, request a Backport::Plasma/6.6 (or whichever LTS branch is current) for downstream distros.