Files
x11-session-research/phase0_evidence/wayland_baseline_2026-05-03/drmprobe_findings.md
T
marfrit d1285ca2b2 Phase 0: drmprobe findings + A1 corrigendum + worklist update
The drm_info-during-playback probe under Wayland confirmed
KWin IS GL-compositing per frame: Plane 39 rotates triple-
buffered ABGR8888 framebuffers (FB IDs 60/61/66) during
playback. The earlier "0% kwin CPU = direct-scanout" reading
in a1_summary.md was a CPU-blind-spot artifact — Panfrost
shader work isn't visible to top or perf-on-userspace.

Corrigendum added to a1_summary.md preserving the original
text as the on-the-day record.

Worklist: A1 entry updated to point at both summaries.

The probe itself was invasive (drm_info every 3s perturbed
KWin's atomic-commit fast-path, kwin %CPU jumped 0->18 median
and 6 drops appeared) — usable diagnostically but cannot be
embedded in measurement reps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:46:21 +00:00

7.8 KiB
Raw Blame History

drm_info-during-playback probe — verdict

Date: 2026-05-03 ~17:00 CEST. Goal: confirm or refute the "KWin direct-scanout" hypothesis that A1's 0%-kwin-CPU result suggested. Capture drm_info snapshots every 3 s during a 70 s chromium-fourier-kwin playback rep, inspect Plane 39's framebuffer state. Outcome: direct-scanout hypothesis refuted. KWin IS GL-compositing every frame and rotating triple-buffered ABGR8888 framebuffers on Plane 39. The CPU-side measurement (top -p kwin_wayland, perf record) was blind to the work because it's all GPU-side (Panfrost shader). The 0%-kwin-CPU result in the A1 reps was misleading.

Plane 39 framebuffer state — decisive

Snapshot Plane 39 FB ID Format Plane 45
00_pre.txt (no chrome) 60 ABGR8888 unused (FB 0)
01.txt23.txt (during 70 s playback) rotates 60 / 61 / 66 ABGR8888 unused (FB 0)
99_post_capture.txt 66 ABGR8888 unused (FB 0)

Three framebuffers rotating during playback is classic triple-buffering: KWin renders the next frame into the not-currently-scanned-out buffer while DRM scans out from the previous one. Three IDs spread across 23 snapshots taken at arbitrary phase, all ABGR8888 — KWin is doing per-frame GL composite into rotating RGB framebuffers.

No NV12 ever lands on Plane 39. No buffer's FOURCC ever flips from ABGR8888. Direct-scanout of chrome's hardware-decoded NV12 buffer is not engaged.

Plane 45 (Overlay) stays unused throughout — FB ID = 0, CRTC = 0. The hypothesised "video on Plane 39 + desktop on Plane 45" arrangement is structurally available on this hardware but not exercised by KWin in this configuration.

Why A1's top -p kwin_wayland and perf record showed 0 %

KWin's per-frame work happens on the GPU via Mesa Panfrost. Per-frame GL composite invocations dispatch to Panfrost's kernel driver and the Mali-G52 hardware. The kwin_wayland userspace process spends microseconds queuing GL commands and the rest of its time blocked in poll/futex waits. top -p kwin_wayland -d 1 samples %CPU at 1 Hz — that resolution can't see microseconds-per-frame bursts, especially when most of the time is spent waiting on GPU fences. perf record on the userspace PID likewise misses kernel-side panfrost work.

What we DID see in perf reports:

  • DBus message handling, libz, Qt event dispatcher, kernel memcpy.

What we DIDN'T see (and now know is happening):

  • The actual GL composite per frame, dispatched as Mesa GL calls to /dev/dri/renderD128 (panfrost). 24 fps × KWin's composite-shader = real GPU work, just not CPU-visible.

Heisenberg moment: drm_info itself perturbed the result

This rep showed:

drmprobe rep A1 rep 1 (no probe)
drops_total 11 0
drops_post_warmup 6 0
frames_total 1662 1685
kwin %CPU median 18.00 0.00
kwin %CPU max 31.9 4.0

The act of running drm_info every 3 s — opening /dev/dri/card0 and querying every plane property — forced KWin into a slower path. kwin %CPU jumped from 0 to ~18 % median; ~6 frames were dropped post-warmup over the 70 s window.

Why: opening /dev/dri/card0 from a third-party client likely either (a) invalidated KWin's atomic-commit fast-path, (b) triggered KWin's heuristic to back off because another DRM master might be present, or (c) the kernel forced a flush of in-flight commits before answering the property query. Whichever the mechanism, the probe is invasive — it changes the very behavior it's measuring.

The 18 % kwin %CPU result is informative even though it's "perturbed" — it shows that when KWin can't engage its fastest path, its userspace cost becomes ~18 % for 24 fps H.264 composite. That's roughly half the predecessor's ~36 % kwin %CPU, suggesting the predecessor's reps were running an even slower path (perhaps without the kwin-fourier patches' watchDmaBuf no-op optimization, or with thermal contention).

What this means for the campaign

The campaign's mechanism is intact and the matrix has real work to do.

Under Plasma Wayland, KWin is doing the per-frame GL composite of chrome's RGB surface (which itself includes the chrome-side GL composite of the NV12 video texture). That's the cost the campaign's "without-KWin" cells were designed to avoid. The cost is real — it's just GPU-side, not CPU-side, so we need GPU-aware metrics to see it.

The matrix's effective_fps and drops_post_warmup metrics are still valid — they measure user-visible playback quality regardless of where the cost is. The fact that today's GPU has enough headroom for both chrome's-internal-composite + KWin's-additional-composite at 24 fps means today both sessions could deliver clean playback. Under stress, only one session does the extra work.

Better metrics to add to Phase 1 binding cells:

  • Panfrost GPU utilization via /sys/class/devfreq/fde60000.gpu/load_busy_percent or panfrost_pmu perf events — measures the GPU work top can't see.
  • Frame latency (commit → present delta) via wp_presentation_feedback on Wayland and XPresent notify events on X11 — directly measures the per-frame composite delay.
  • Power consumption / battery drain during sustained playback — quantifies the cumulative GPU-cycle cost.

Without these, top + perf alone will miss the campaign's mechanism every time.

Why the result restores (not weakens) the campaign

The A1 verdict at a1_summary.md interpreted 0 % kwin CPU as "KWin doing nothing per frame, direct-scanout engaged, campaign hypothesis structurally weakened." That interpretation was wrong because of the GPU-vs-CPU blind spot.

The correct reading: KWin is doing exactly the GL composite the campaign's premise named. CPU measurement isn't sufficient to see it. The matrix needs GPU-side metrics. Once added, the X11 cells should be able to demonstrate a real (if subtle) delta against the Wayland baseline.

The original framing "stutterless playback - possible with X11? proven impossible with Wayland" remains intact: Wayland's per-frame composite is real, and under enough load it WILL miss frames (predecessor's data is consistent with this; today's clean Wayland just means GPU has headroom for this specific 24 fps workload). X11 + non-compositing WM removes the second composite step entirely.

What should change in a1_summary.md

The "Major reframing finding" section overstates the result. The data is consistent with KWin doing per-frame GL composite that's invisible to CPU instrumentation. The "structurally weakened campaign hypothesis" claim should be retracted.

I'll add a corrigendum at the top of a1_summary.md linking to this file and noting the corrected interpretation. Not rewriting the original text — keeping it as the original record of what I concluded with the wrong instrumentation.

Phase 1 implications

Updated:

  1. Binding cells need GPU-aware metrics. CPU-only instrumentation will miss the campaign's mechanism in any condition that isn't already at the GPU contention limit.
  2. The drm_info probe is too invasive to embed in measurement reps. Use it diagnostically before/after reps, not during. Direct-scanout decisions can be inferred from the framebuffer-rotation pattern at idle vs at boot vs per-frame; we don't need per-rep.
  3. The matrix value isn't lost. The X11 cells will save one GPU pass per frame regardless. Phase 1 should pick a workload that pushes the GPU closer to its limit so the per-frame saving manifests as drops difference. 1080p60 H.264 (vs the current 24 fps) is the obvious bump.
  4. mpv --vo=xv mechanism test still wanted. Now framed as "does the X server schedule NV12 directly to Plane 39 for an Xv client, bypassing the GL composite that both browsers and Wayland need?" Answer requires a native X11 session.