x11-session-research/phase0_evidence/wayland_baseline_2026-05-03/drmprobe_findings.md

# drm_info-during-playback probe — verdict

**Date:** 2026-05-03 ~17:00 CEST.
**Goal:** confirm or refute the "KWin direct-scanout" hypothesis
that A1's 0%-kwin-CPU result suggested. Capture
`drm_info` snapshots every 3 s during a 70 s
chromium-fourier-kwin playback rep, inspect Plane 39's framebuffer
state.
**Outcome:** **direct-scanout hypothesis refuted.** KWin IS
GL-compositing every frame and rotating triple-buffered ABGR8888
framebuffers on Plane 39. The CPU-side measurement (`top -p kwin_wayland`,
`perf record`) was blind to the work because it's all GPU-side
(Panfrost shader). The 0%-kwin-CPU result in the A1 reps was
misleading.

## Plane 39 framebuffer state — decisive

| Snapshot | Plane 39 FB ID | Format | Plane 45 |
|---|---|---|---|
| `00_pre.txt` (no chrome) | 60 | ABGR8888 | unused (FB 0) |
| `01.txt`–`23.txt` (during 70 s playback) | **rotates 60 / 61 / 66** | ABGR8888 | unused (FB 0) |
| `99_post_capture.txt` | 66 | ABGR8888 | unused (FB 0) |

Three framebuffers rotating during playback is **classic
triple-buffering**: KWin renders the next frame into the
not-currently-scanned-out buffer while DRM scans out from the
previous one. Three IDs spread across 23 snapshots taken at
arbitrary phase, all ABGR8888 — KWin is doing per-frame GL
composite into rotating RGB framebuffers.

**No NV12 ever lands on Plane 39.** No buffer's FOURCC ever flips
from ABGR8888. Direct-scanout of chrome's hardware-decoded NV12
buffer is not engaged.

Plane 45 (Overlay) stays unused throughout — FB ID = 0, CRTC = 0.
The hypothesised "video on Plane 39 + desktop on Plane 45"
arrangement is structurally available on this hardware but not
exercised by KWin in this configuration.

## Why A1's `top -p kwin_wayland` and perf record showed 0 %

KWin's per-frame work happens on the GPU via Mesa Panfrost. Per-frame
GL composite invocations dispatch to Panfrost's kernel driver and
the Mali-G52 hardware. The kwin_wayland userspace process spends
microseconds queuing GL commands and the rest of its time blocked
in poll/futex waits. `top -p kwin_wayland -d 1` samples %CPU at
1 Hz — that resolution can't see microseconds-per-frame bursts,
especially when most of the time is spent waiting on GPU fences.
`perf record` on the userspace PID likewise misses kernel-side
panfrost work.

What we DID see in perf reports:
- DBus message handling, libz, Qt event dispatcher, kernel memcpy.

What we DIDN'T see (and now know is happening):
- The actual GL composite per frame, dispatched as Mesa GL calls
  to `/dev/dri/renderD128` (panfrost). 24 fps × KWin's
  composite-shader = real GPU work, just not CPU-visible.

## Heisenberg moment: drm_info itself perturbed the result

This rep showed:

| | drmprobe rep | A1 rep 1 (no probe) |
|---|---|---|
| drops_total | 11 | 0 |
| drops_post_warmup | 6 | 0 |
| frames_total | 1662 | 1685 |
| kwin %CPU median | **18.00** | **0.00** |
| kwin %CPU max | 31.9 | 4.0 |

The act of running `drm_info` every 3 s — opening
`/dev/dri/card0` and querying every plane property —
forced KWin into a slower path. `kwin %CPU` jumped from 0
to ~18 % median; ~6 frames were dropped post-warmup over the
70 s window.

Why: opening /dev/dri/card0 from a third-party client likely
either (a) invalidated KWin's atomic-commit fast-path,
(b) triggered KWin's heuristic to back off because another DRM
master might be present, or (c) the kernel forced a flush of
in-flight commits before answering the property query. Whichever
the mechanism, the probe is invasive — it changes the very
behavior it's measuring.

The 18 % kwin %CPU result is informative even though it's
"perturbed" — it shows that **when KWin can't engage its
fastest path, its userspace cost becomes ~18 %** for 24 fps
H.264 composite. That's roughly half the predecessor's
~36 % kwin %CPU, suggesting the predecessor's reps were
running an even slower path (perhaps without the
kwin-fourier patches' `watchDmaBuf no-op` optimization, or
with thermal contention).

## What this means for the campaign

**The campaign's mechanism is intact and the matrix has real
work to do.**

Under Plasma Wayland, KWin **is** doing the per-frame GL
composite of chrome's RGB surface (which itself includes the
chrome-side GL composite of the NV12 video texture). That's
the cost the campaign's "without-KWin" cells were designed to
avoid. The cost is real — it's just GPU-side, not
CPU-side, so we need GPU-aware metrics to see it.

**The matrix's `effective_fps` and `drops_post_warmup`
metrics are still valid** — they measure user-visible
playback quality regardless of where the cost is. The fact
that today's GPU has enough headroom for both
chrome's-internal-composite + KWin's-additional-composite
at 24 fps means today both sessions could deliver clean
playback. Under stress, only one session does the extra
work.

**Better metrics to add to Phase 1 binding cells:**
- **Panfrost GPU utilization** via
  `/sys/class/devfreq/fde60000.gpu/load_busy_percent` or
  `panfrost_pmu` perf events — measures the GPU work top
  can't see.
- **Frame latency** (commit → present delta) via
  `wp_presentation_feedback` on Wayland and
  `XPresent` notify events on X11 — directly measures the
  per-frame composite delay.
- **Power consumption / battery drain** during sustained
  playback — quantifies the cumulative GPU-cycle cost.

Without these, top + perf alone *will* miss the campaign's
mechanism every time.

## Why the result restores (not weakens) the campaign

The A1 verdict at `a1_summary.md` interpreted 0 % kwin CPU as
"KWin doing nothing per frame, direct-scanout engaged,
campaign hypothesis structurally weakened." That interpretation
was wrong because of the GPU-vs-CPU blind spot.

The correct reading: KWin is doing exactly the GL composite
the campaign's premise named. CPU measurement isn't sufficient
to see it. The matrix needs GPU-side metrics. Once added,
the X11 cells should be able to demonstrate a real (if subtle)
delta against the Wayland baseline.

The original framing "stutterless playback - possible with X11?
proven impossible with Wayland" remains intact: Wayland's
per-frame composite is real, and under enough load it WILL
miss frames (predecessor's data is consistent with this; today's
clean Wayland just means GPU has headroom for this specific
24 fps workload). X11 + non-compositing WM removes the second
composite step entirely.

## What should change in `a1_summary.md`

The "Major reframing finding" section overstates the result.
The data is consistent with KWin doing per-frame GL composite
that's invisible to CPU instrumentation. The "structurally
weakened campaign hypothesis" claim should be retracted.

I'll add a corrigendum at the top of `a1_summary.md` linking
to this file and noting the corrected interpretation. Not
rewriting the original text — keeping it as the original
record of what I concluded with the wrong instrumentation.

## Phase 1 implications

Updated:
1. **Binding cells need GPU-aware metrics.** CPU-only
   instrumentation will miss the campaign's mechanism in any
   condition that isn't already at the GPU contention limit.
2. **The drm_info probe is too invasive to embed in
   measurement reps.** Use it diagnostically before/after
   reps, not during. Direct-scanout decisions can be inferred
   from the framebuffer-rotation pattern at idle vs at boot vs
   per-frame; we don't need per-rep.
3. **The matrix value isn't lost.** The X11 cells will save
   one GPU pass per frame regardless. Phase 1 should pick a
   workload that pushes the GPU closer to its limit so the
   per-frame saving manifests as drops difference. 1080p60
   H.264 (vs the current 24 fps) is the obvious bump.
4. **mpv `--vo=xv` mechanism test still wanted.** Now framed
   as "does the X server schedule NV12 directly to Plane 39
   for an Xv client, bypassing the GL composite that both
   browsers and Wayland need?" Answer requires a native X11
   session.