Files
x11-session-research/phase0_evidence/wayland_baseline_2026-05-03/drmprobe_findings.md
T
marfrit d1285ca2b2 Phase 0: drmprobe findings + A1 corrigendum + worklist update
The drm_info-during-playback probe under Wayland confirmed
KWin IS GL-compositing per frame: Plane 39 rotates triple-
buffered ABGR8888 framebuffers (FB IDs 60/61/66) during
playback. The earlier "0% kwin CPU = direct-scanout" reading
in a1_summary.md was a CPU-blind-spot artifact — Panfrost
shader work isn't visible to top or perf-on-userspace.

Corrigendum added to a1_summary.md preserving the original
text as the on-the-day record.

Worklist: A1 entry updated to point at both summaries.

The probe itself was invasive (drm_info every 3s perturbed
KWin's atomic-commit fast-path, kwin %CPU jumped 0->18 median
and 6 drops appeared) — usable diagnostically but cannot be
embedded in measurement reps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:46:21 +00:00

185 lines
7.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# drm_info-during-playback probe — verdict
**Date:** 2026-05-03 ~17:00 CEST.
**Goal:** confirm or refute the "KWin direct-scanout" hypothesis
that A1's 0%-kwin-CPU result suggested. Capture
`drm_info` snapshots every 3 s during a 70 s
chromium-fourier-kwin playback rep, inspect Plane 39's framebuffer
state.
**Outcome:** **direct-scanout hypothesis refuted.** KWin IS
GL-compositing every frame and rotating triple-buffered ABGR8888
framebuffers on Plane 39. The CPU-side measurement (`top -p kwin_wayland`,
`perf record`) was blind to the work because it's all GPU-side
(Panfrost shader). The 0%-kwin-CPU result in the A1 reps was
misleading.
## Plane 39 framebuffer state — decisive
| Snapshot | Plane 39 FB ID | Format | Plane 45 |
|---|---|---|---|
| `00_pre.txt` (no chrome) | 60 | ABGR8888 | unused (FB 0) |
| `01.txt``23.txt` (during 70 s playback) | **rotates 60 / 61 / 66** | ABGR8888 | unused (FB 0) |
| `99_post_capture.txt` | 66 | ABGR8888 | unused (FB 0) |
Three framebuffers rotating during playback is **classic
triple-buffering**: KWin renders the next frame into the
not-currently-scanned-out buffer while DRM scans out from the
previous one. Three IDs spread across 23 snapshots taken at
arbitrary phase, all ABGR8888 — KWin is doing per-frame GL
composite into rotating RGB framebuffers.
**No NV12 ever lands on Plane 39.** No buffer's FOURCC ever flips
from ABGR8888. Direct-scanout of chrome's hardware-decoded NV12
buffer is not engaged.
Plane 45 (Overlay) stays unused throughout — FB ID = 0, CRTC = 0.
The hypothesised "video on Plane 39 + desktop on Plane 45"
arrangement is structurally available on this hardware but not
exercised by KWin in this configuration.
## Why A1's `top -p kwin_wayland` and perf record showed 0 %
KWin's per-frame work happens on the GPU via Mesa Panfrost. Per-frame
GL composite invocations dispatch to Panfrost's kernel driver and
the Mali-G52 hardware. The kwin_wayland userspace process spends
microseconds queuing GL commands and the rest of its time blocked
in poll/futex waits. `top -p kwin_wayland -d 1` samples %CPU at
1 Hz — that resolution can't see microseconds-per-frame bursts,
especially when most of the time is spent waiting on GPU fences.
`perf record` on the userspace PID likewise misses kernel-side
panfrost work.
What we DID see in perf reports:
- DBus message handling, libz, Qt event dispatcher, kernel memcpy.
What we DIDN'T see (and now know is happening):
- The actual GL composite per frame, dispatched as Mesa GL calls
to `/dev/dri/renderD128` (panfrost). 24 fps × KWin's
composite-shader = real GPU work, just not CPU-visible.
## Heisenberg moment: drm_info itself perturbed the result
This rep showed:
| | drmprobe rep | A1 rep 1 (no probe) |
|---|---|---|
| drops_total | 11 | 0 |
| drops_post_warmup | 6 | 0 |
| frames_total | 1662 | 1685 |
| kwin %CPU median | **18.00** | **0.00** |
| kwin %CPU max | 31.9 | 4.0 |
The act of running `drm_info` every 3 s — opening
`/dev/dri/card0` and querying every plane property —
forced KWin into a slower path. `kwin %CPU` jumped from 0
to ~18 % median; ~6 frames were dropped post-warmup over the
70 s window.
Why: opening /dev/dri/card0 from a third-party client likely
either (a) invalidated KWin's atomic-commit fast-path,
(b) triggered KWin's heuristic to back off because another DRM
master might be present, or (c) the kernel forced a flush of
in-flight commits before answering the property query. Whichever
the mechanism, the probe is invasive — it changes the very
behavior it's measuring.
The 18 % kwin %CPU result is informative even though it's
"perturbed" — it shows that **when KWin can't engage its
fastest path, its userspace cost becomes ~18 %** for 24 fps
H.264 composite. That's roughly half the predecessor's
~36 % kwin %CPU, suggesting the predecessor's reps were
running an even slower path (perhaps without the
kwin-fourier patches' `watchDmaBuf no-op` optimization, or
with thermal contention).
## What this means for the campaign
**The campaign's mechanism is intact and the matrix has real
work to do.**
Under Plasma Wayland, KWin **is** doing the per-frame GL
composite of chrome's RGB surface (which itself includes the
chrome-side GL composite of the NV12 video texture). That's
the cost the campaign's "without-KWin" cells were designed to
avoid. The cost is real — it's just GPU-side, not
CPU-side, so we need GPU-aware metrics to see it.
**The matrix's `effective_fps` and `drops_post_warmup`
metrics are still valid** — they measure user-visible
playback quality regardless of where the cost is. The fact
that today's GPU has enough headroom for both
chrome's-internal-composite + KWin's-additional-composite
at 24 fps means today both sessions could deliver clean
playback. Under stress, only one session does the extra
work.
**Better metrics to add to Phase 1 binding cells:**
- **Panfrost GPU utilization** via
`/sys/class/devfreq/fde60000.gpu/load_busy_percent` or
`panfrost_pmu` perf events — measures the GPU work top
can't see.
- **Frame latency** (commit → present delta) via
`wp_presentation_feedback` on Wayland and
`XPresent` notify events on X11 — directly measures the
per-frame composite delay.
- **Power consumption / battery drain** during sustained
playback — quantifies the cumulative GPU-cycle cost.
Without these, top + perf alone *will* miss the campaign's
mechanism every time.
## Why the result restores (not weakens) the campaign
The A1 verdict at `a1_summary.md` interpreted 0 % kwin CPU as
"KWin doing nothing per frame, direct-scanout engaged,
campaign hypothesis structurally weakened." That interpretation
was wrong because of the GPU-vs-CPU blind spot.
The correct reading: KWin is doing exactly the GL composite
the campaign's premise named. CPU measurement isn't sufficient
to see it. The matrix needs GPU-side metrics. Once added,
the X11 cells should be able to demonstrate a real (if subtle)
delta against the Wayland baseline.
The original framing "stutterless playback - possible with X11?
proven impossible with Wayland" remains intact: Wayland's
per-frame composite is real, and under enough load it WILL
miss frames (predecessor's data is consistent with this; today's
clean Wayland just means GPU has headroom for this specific
24 fps workload). X11 + non-compositing WM removes the second
composite step entirely.
## What should change in `a1_summary.md`
The "Major reframing finding" section overstates the result.
The data is consistent with KWin doing per-frame GL composite
that's invisible to CPU instrumentation. The "structurally
weakened campaign hypothesis" claim should be retracted.
I'll add a corrigendum at the top of `a1_summary.md` linking
to this file and noting the corrected interpretation. Not
rewriting the original text — keeping it as the original
record of what I concluded with the wrong instrumentation.
## Phase 1 implications
Updated:
1. **Binding cells need GPU-aware metrics.** CPU-only
instrumentation will miss the campaign's mechanism in any
condition that isn't already at the GPU contention limit.
2. **The drm_info probe is too invasive to embed in
measurement reps.** Use it diagnostically before/after
reps, not during. Direct-scanout decisions can be inferred
from the framebuffer-rotation pattern at idle vs at boot vs
per-frame; we don't need per-rep.
3. **The matrix value isn't lost.** The X11 cells will save
one GPU pass per frame regardless. Phase 1 should pick a
workload that pushes the GPU closer to its limit so the
per-frame saving manifests as drops difference. 1080p60
H.264 (vs the current 24 fps) is the obvious bump.
4. **mpv `--vo=xv` mechanism test still wanted.** Now framed
as "does the X server schedule NV12 directly to Plane 39
for an Xv client, bypassing the GL composite that both
browsers and Wayland need?" Answer requires a native X11
session.