d1285ca2b2
The drm_info-during-playback probe under Wayland confirmed KWin IS GL-compositing per frame: Plane 39 rotates triple- buffered ABGR8888 framebuffers (FB IDs 60/61/66) during playback. The earlier "0% kwin CPU = direct-scanout" reading in a1_summary.md was a CPU-blind-spot artifact — Panfrost shader work isn't visible to top or perf-on-userspace. Corrigendum added to a1_summary.md preserving the original text as the on-the-day record. Worklist: A1 entry updated to point at both summaries. The probe itself was invasive (drm_info every 3s perturbed KWin's atomic-commit fast-path, kwin %CPU jumped 0->18 median and 6 drops appeared) — usable diagnostically but cannot be embedded in measurement reps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
185 lines
7.8 KiB
Markdown
185 lines
7.8 KiB
Markdown
# drm_info-during-playback probe — verdict
|
||
|
||
**Date:** 2026-05-03 ~17:00 CEST.
|
||
**Goal:** confirm or refute the "KWin direct-scanout" hypothesis
|
||
that A1's 0%-kwin-CPU result suggested. Capture
|
||
`drm_info` snapshots every 3 s during a 70 s
|
||
chromium-fourier-kwin playback rep, inspect Plane 39's framebuffer
|
||
state.
|
||
**Outcome:** **direct-scanout hypothesis refuted.** KWin IS
|
||
GL-compositing every frame and rotating triple-buffered ABGR8888
|
||
framebuffers on Plane 39. The CPU-side measurement (`top -p kwin_wayland`,
|
||
`perf record`) was blind to the work because it's all GPU-side
|
||
(Panfrost shader). The 0%-kwin-CPU result in the A1 reps was
|
||
misleading.
|
||
|
||
## Plane 39 framebuffer state — decisive
|
||
|
||
| Snapshot | Plane 39 FB ID | Format | Plane 45 |
|
||
|---|---|---|---|
|
||
| `00_pre.txt` (no chrome) | 60 | ABGR8888 | unused (FB 0) |
|
||
| `01.txt`–`23.txt` (during 70 s playback) | **rotates 60 / 61 / 66** | ABGR8888 | unused (FB 0) |
|
||
| `99_post_capture.txt` | 66 | ABGR8888 | unused (FB 0) |
|
||
|
||
Three framebuffers rotating during playback is **classic
|
||
triple-buffering**: KWin renders the next frame into the
|
||
not-currently-scanned-out buffer while DRM scans out from the
|
||
previous one. Three IDs spread across 23 snapshots taken at
|
||
arbitrary phase, all ABGR8888 — KWin is doing per-frame GL
|
||
composite into rotating RGB framebuffers.
|
||
|
||
**No NV12 ever lands on Plane 39.** No buffer's FOURCC ever flips
|
||
from ABGR8888. Direct-scanout of chrome's hardware-decoded NV12
|
||
buffer is not engaged.
|
||
|
||
Plane 45 (Overlay) stays unused throughout — FB ID = 0, CRTC = 0.
|
||
The hypothesised "video on Plane 39 + desktop on Plane 45"
|
||
arrangement is structurally available on this hardware but not
|
||
exercised by KWin in this configuration.
|
||
|
||
## Why A1's `top -p kwin_wayland` and perf record showed 0 %
|
||
|
||
KWin's per-frame work happens on the GPU via Mesa Panfrost. Per-frame
|
||
GL composite invocations dispatch to Panfrost's kernel driver and
|
||
the Mali-G52 hardware. The kwin_wayland userspace process spends
|
||
microseconds queuing GL commands and the rest of its time blocked
|
||
in poll/futex waits. `top -p kwin_wayland -d 1` samples %CPU at
|
||
1 Hz — that resolution can't see microseconds-per-frame bursts,
|
||
especially when most of the time is spent waiting on GPU fences.
|
||
`perf record` on the userspace PID likewise misses kernel-side
|
||
panfrost work.
|
||
|
||
What we DID see in perf reports:
|
||
- DBus message handling, libz, Qt event dispatcher, kernel memcpy.
|
||
|
||
What we DIDN'T see (and now know is happening):
|
||
- The actual GL composite per frame, dispatched as Mesa GL calls
|
||
to `/dev/dri/renderD128` (panfrost). 24 fps × KWin's
|
||
composite-shader = real GPU work, just not CPU-visible.
|
||
|
||
## Heisenberg moment: drm_info itself perturbed the result
|
||
|
||
This rep showed:
|
||
|
||
| | drmprobe rep | A1 rep 1 (no probe) |
|
||
|---|---|---|
|
||
| drops_total | 11 | 0 |
|
||
| drops_post_warmup | 6 | 0 |
|
||
| frames_total | 1662 | 1685 |
|
||
| kwin %CPU median | **18.00** | **0.00** |
|
||
| kwin %CPU max | 31.9 | 4.0 |
|
||
|
||
The act of running `drm_info` every 3 s — opening
|
||
`/dev/dri/card0` and querying every plane property —
|
||
forced KWin into a slower path. `kwin %CPU` jumped from 0
|
||
to ~18 % median; ~6 frames were dropped post-warmup over the
|
||
70 s window.
|
||
|
||
Why: opening /dev/dri/card0 from a third-party client likely
|
||
either (a) invalidated KWin's atomic-commit fast-path,
|
||
(b) triggered KWin's heuristic to back off because another DRM
|
||
master might be present, or (c) the kernel forced a flush of
|
||
in-flight commits before answering the property query. Whichever
|
||
the mechanism, the probe is invasive — it changes the very
|
||
behavior it's measuring.
|
||
|
||
The 18 % kwin %CPU result is informative even though it's
|
||
"perturbed" — it shows that **when KWin can't engage its
|
||
fastest path, its userspace cost becomes ~18 %** for 24 fps
|
||
H.264 composite. That's roughly half the predecessor's
|
||
~36 % kwin %CPU, suggesting the predecessor's reps were
|
||
running an even slower path (perhaps without the
|
||
kwin-fourier patches' `watchDmaBuf no-op` optimization, or
|
||
with thermal contention).
|
||
|
||
## What this means for the campaign
|
||
|
||
**The campaign's mechanism is intact and the matrix has real
|
||
work to do.**
|
||
|
||
Under Plasma Wayland, KWin **is** doing the per-frame GL
|
||
composite of chrome's RGB surface (which itself includes the
|
||
chrome-side GL composite of the NV12 video texture). That's
|
||
the cost the campaign's "without-KWin" cells were designed to
|
||
avoid. The cost is real — it's just GPU-side, not
|
||
CPU-side, so we need GPU-aware metrics to see it.
|
||
|
||
**The matrix's `effective_fps` and `drops_post_warmup`
|
||
metrics are still valid** — they measure user-visible
|
||
playback quality regardless of where the cost is. The fact
|
||
that today's GPU has enough headroom for both
|
||
chrome's-internal-composite + KWin's-additional-composite
|
||
at 24 fps means today both sessions could deliver clean
|
||
playback. Under stress, only one session does the extra
|
||
work.
|
||
|
||
**Better metrics to add to Phase 1 binding cells:**
|
||
- **Panfrost GPU utilization** via
|
||
`/sys/class/devfreq/fde60000.gpu/load_busy_percent` or
|
||
`panfrost_pmu` perf events — measures the GPU work top
|
||
can't see.
|
||
- **Frame latency** (commit → present delta) via
|
||
`wp_presentation_feedback` on Wayland and
|
||
`XPresent` notify events on X11 — directly measures the
|
||
per-frame composite delay.
|
||
- **Power consumption / battery drain** during sustained
|
||
playback — quantifies the cumulative GPU-cycle cost.
|
||
|
||
Without these, top + perf alone *will* miss the campaign's
|
||
mechanism every time.
|
||
|
||
## Why the result restores (not weakens) the campaign
|
||
|
||
The A1 verdict at `a1_summary.md` interpreted 0 % kwin CPU as
|
||
"KWin doing nothing per frame, direct-scanout engaged,
|
||
campaign hypothesis structurally weakened." That interpretation
|
||
was wrong because of the GPU-vs-CPU blind spot.
|
||
|
||
The correct reading: KWin is doing exactly the GL composite
|
||
the campaign's premise named. CPU measurement isn't sufficient
|
||
to see it. The matrix needs GPU-side metrics. Once added,
|
||
the X11 cells should be able to demonstrate a real (if subtle)
|
||
delta against the Wayland baseline.
|
||
|
||
The original framing "stutterless playback - possible with X11?
|
||
proven impossible with Wayland" remains intact: Wayland's
|
||
per-frame composite is real, and under enough load it WILL
|
||
miss frames (predecessor's data is consistent with this; today's
|
||
clean Wayland just means GPU has headroom for this specific
|
||
24 fps workload). X11 + non-compositing WM removes the second
|
||
composite step entirely.
|
||
|
||
## What should change in `a1_summary.md`
|
||
|
||
The "Major reframing finding" section overstates the result.
|
||
The data is consistent with KWin doing per-frame GL composite
|
||
that's invisible to CPU instrumentation. The "structurally
|
||
weakened campaign hypothesis" claim should be retracted.
|
||
|
||
I'll add a corrigendum at the top of `a1_summary.md` linking
|
||
to this file and noting the corrected interpretation. Not
|
||
rewriting the original text — keeping it as the original
|
||
record of what I concluded with the wrong instrumentation.
|
||
|
||
## Phase 1 implications
|
||
|
||
Updated:
|
||
1. **Binding cells need GPU-aware metrics.** CPU-only
|
||
instrumentation will miss the campaign's mechanism in any
|
||
condition that isn't already at the GPU contention limit.
|
||
2. **The drm_info probe is too invasive to embed in
|
||
measurement reps.** Use it diagnostically before/after
|
||
reps, not during. Direct-scanout decisions can be inferred
|
||
from the framebuffer-rotation pattern at idle vs at boot vs
|
||
per-frame; we don't need per-rep.
|
||
3. **The matrix value isn't lost.** The X11 cells will save
|
||
one GPU pass per frame regardless. Phase 1 should pick a
|
||
workload that pushes the GPU closer to its limit so the
|
||
per-frame saving manifests as drops difference. 1080p60
|
||
H.264 (vs the current 24 fps) is the obvious bump.
|
||
4. **mpv `--vo=xv` mechanism test still wanted.** Now framed
|
||
as "does the X server schedule NV12 directly to Plane 39
|
||
for an Xv client, bypassing the GL composite that both
|
||
browsers and Wayland need?" Answer requires a native X11
|
||
session.
|