Files
x11-session-research/phase0_evidence/wayland_baseline_2026-05-03/a1_summary.md
T
marfrit d1285ca2b2 Phase 0: drmprobe findings + A1 corrigendum + worklist update
The drm_info-during-playback probe under Wayland confirmed
KWin IS GL-compositing per frame: Plane 39 rotates triple-
buffered ABGR8888 framebuffers (FB IDs 60/61/66) during
playback. The earlier "0% kwin CPU = direct-scanout" reading
in a1_summary.md was a CPU-blind-spot artifact — Panfrost
shader work isn't visible to top or perf-on-userspace.

Corrigendum added to a1_summary.md preserving the original
text as the on-the-day record.

Worklist: A1 entry updated to point at both summaries.

The probe itself was invasive (drm_info every 3s perturbed
KWin's atomic-commit fast-path, kwin %CPU jumped 0->18 median
and 6 drops appeared) — usable diagnostically but cannot be
embedded in measurement reps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:46:21 +00:00

12 KiB
Raw Blame History

A1 baseline — verdict

CORRIGENDUM 2026-05-03 ~17:00 CEST — the "Major reframing finding" section below interpreted 0 % kwin CPU as "KWin doing nothing per frame, direct-scanout engaged, campaign hypothesis structurally weakened." That interpretation is wrong. A subsequent drm_info-during-playback probe (see drmprobe_findings.md in this directory) shows Plane 39 rotates triple-buffered ABGR8888 framebuffers throughout playback — KWin IS doing per-frame GL composite, but the work is GPU-side (Panfrost) and invisible to top / perf-on-userspace. The campaign's mechanism is intact; the matrix needs GPU-aware metrics. The original text below is preserved as the on-the-day record of what I concluded with insufficient instrumentation.


Three in-session reps of chromium-fourier 149 / brave_drops_test.html / Plasma Wayland 6.6.4 + kwin-fourier 6.6.4-3 + qt6-base-fourier 6.11.0-3 acquired 2026-05-03 14:2314:31 CEST.

Per the campaign-contained-data discipline, this is the only Wayland baseline this campaign uses. Predecessor numbers are referenced below as instrument-sanity context, not as comparison targets.

Results

Metric Rep 1 Rep 2 Rep 3 Median (cluster)
frames_total (over 70 s) 1685 1685 1686 1685
effective fps 24.04 24.04 24.05 24.04
drops_total 0 0 0 0
drops_post_warmup (after t=10s) 0 0 0 0
kwin_wayland %CPU median (1 Hz × 70 samples) 0.00 0.00 0.00 0.00
kwin_wayland %CPU mean 0.07 0.04 0.04 0.04
kwin_wayland %CPU max 4.00 1.00 3.00 3.00
perf samples on kwin_wayland (99 Hz × 70 s) 39 28 (similar) tens, not thousands

IQR / median for the spread metrics is 0 (cluster is degenerate at the lower bound). Per the protocol's exit-condition tree, this is the "tight cluster" branch: Phase 1 binding cells can use these as the anchor with a sub-1% tolerance band on kwin %CPU and a ≤ 1-frame tolerance on drops_post_warmup.

Source video plays at 24 fps for 70.09 s ≈ 1682 frames; observed 1685 frames matches within rounding (chromium counts the DROPS_TRAJECTORY playing-event frame separately).

What the perf reports say kwin was actually doing

For all three reps, the perf samples on kwin_wayland during playback are dominated by event-loop bookkeeping, not by any GL-composite or dmabuf-import path:

Rep Top symbol(s) at non-trivial %
1 __pi_memcpy_generic (97.18 %) — single memcpy event in a 39-sample run
2 libz.so (37.57 %), call_filldir (31.91 %) — readdir + zlib
3 dbus_message_unref (38.80 %), QUnixEventDispatcherQPA::processEvents (36.72 %), libz.so (23.30 %) — DBus + Qt event loop

Zero samples anywhere in:

  • glEGLImageTargetTexture2DOES (the GL EGL image bind path the predecessor's kwin_overlay_subsurface campaign Phase 2 hypothesised would dominate per-frame KWin cost on this hardware)
  • panfrost_* (Mesa Panfrost driver routines)
  • wp_subsurface_* / WaylandSurface_* (overlay/subsurface protocol handling)
  • any Compositor::* / OpenGLBackend::* / OutputLayer::* KWin internal symbols

The total cycle count across all three reps is in the millions, not billions — kwin_wayland was scheduled out for >99 % of the 70 s capture window in every rep.

What this means

The "KWin is the bottleneck" framing the campaign was built around is structurally weakened by these data.

The campaign's load-bearing hypothesis (README.md § 1) was that "the campaign's load-bearing hypothesis is that this plane-allocation freedom translates into measurable browser-video speedup." That hypothesis was built on top of the predecessor's observation that kwin_wayland consumed ~36 % CPU during similar playback, attributable per the predecessor's Phase 2 source-read to per-frame GL composite of NV12 → RGB. Today's A1 reps show kwin_wayland at 0 % median, with no GL-composite work in the perf samples. There is no Wayland-side KWin-induced overhead for the X11 cells to be faster than.

The most likely mechanism (hypothesis, not yet verified)

KWin 6.6.4 (with kwin-fourier 6.6.4-3 patches applied) appears to have engaged its direct-scanout code path for the chrome-window-displaying-video workload. KWin's direct-scanout support has been there for years on the Wayland backend and has been progressively widened: when there's a single visible "top" surface (the chrome window) whose buffer matches a hardware plane's format/modifier capabilities, KWin can hand that buffer to DRM directly without first GL-compositing it into KWin's own framebuffer. The browser's RGB or NV12 dmabuf goes onto Plane 39 (the Primary plane on rockchip-drm RK3568) without any per-frame KWin GPU work.

This is not the wp_subsurface route the predecessor was investigating — it's a different, simpler scanout path that doesn't require the client to opt into overlay protocol. It just requires the client's surface to be the only visible non-trivial top-level plus a buffer format/modifier that DRM can scan out.

If this hypothesis is correct, two things follow:

  1. The campaign's X11-vs-Wayland delta is much smaller than originally expected. Both sessions can avoid per-frame compositor work for the single-window video case. The X11 cells will not be measurably faster than Wayland for this workload.
  2. The campaign's mechanism is realised under Wayland too, when conditions are right. "Plane-allocation freedom" is not X11-exclusive — it's just easier to engage on X11 because there's no compositor in the path at all. On Wayland, KWin engages an equivalent fast-path when its heuristics allow.

What the data does NOT establish

  • That direct-scanout is the cause. The data is consistent with direct-scanout but a perf-only diagnosis can't pin it down. Phase 1 should add drm_info snapshots during playback (which plane is programmed with the chrome window's buffer FOURCC + modifier?) and KWin debug logging (KWIN_DRM=1 dumps direct-scanout decisions) to confirm.
  • That this behavior holds for multi-window scenarios. A single visible non-trivial top-level window is the simplest case for direct-scanout. If the operator works multi-windowed (panel + chrome + terminal + Konsole), the fast-path may decline and kwin %CPU may rebound. The matrix's relevance to daily-driver scenarios depends on this.
  • That this behavior holds across browsers. chromium-fourier 149 specifically has the patches that enable smooth NV12 dmabuf production. Brave 147 stock and Firefox 150 may produce buffers in different shapes (RGB-pre-composited, different modifier) that don't satisfy the direct-scanout predicate. Phase 1 reps for those browsers will tell.

Cross-check: predecessor's same-condition reps

For instrument sanity (NOT as comparison target — these are the predecessor's kwin_timing_nodebug_rep[1-3] numbers from 2026-05-02/03):

Predecessor median This campaign median
frames_total 1688 1685
drops_total 44 0
drops_post_warmup 28 0
kwin %CPU median 35.9 0.00

The frames_total match indicates the test page + chromium-fourier emit at the same rate in both campaigns. The drops + kwin %CPU divergence is too large to be measurement noise — something about the runtime conditions changed between 2026-05-02 (predecessor's reps) and 2026-05-03 14:23 (today's reps), even though the package versions are identical and the test page is the same file.

Possible non-package causes worth listing here so a follow-up can investigate (out of A1 scope):

  • Boot generation: predecessor's reps were on a session that had been running ~9 hours; today's reps were on a session that had been running ~50 minutes after autologin (revert.log entry 6).
  • Cumulative session state: predecessor's session likely had multiple browser instances and other windows open during the campaign's preceding work; today's session was freshly autologin'd from greeter, only the test chrome window visible.
  • Thermal: predecessor's temp_pre.txt for rep 1 isn't in our scope to check (would be predecessor data import); but today's reps had cpu-thermal at 36 °C pre-rep, well below thermal-throttle thresholds.
  • kwin / qt patches state: packages identical per pacman -Q, but the runtime state of KWin's heuristics (window-rule cache, scanout-decision history) might differ between sessions. This is a normal property of compositors and explains some run-to-run variance even on the same binary.

The discipline rule already required the in-session re-measurement this campaign just did. The predecessor's number is no longer the reference; this campaign's measured median (0 drops, 0 % kwin) is the reference for any X11 cells the campaign will later compare against.

Phase 1 implications

The matrix design needs revisiting before Phase 1 cells lock:

  1. The mpv --vo=xv cell remains the most informative single point for the campaign's original mechanism (does the X server route NV12 to Plane 39 directly?), per the browser overlay inventory's verdict.
  2. The browser X11 cells become a measurement of "do browsers under X11 get the equivalent direct-scanout benefit they get under Wayland?" rather than the original "does X11 win over Wayland?" framing. Three plausible outcomes:
    • X11 cells match Wayland baseline (both engage direct scanout) → "compositor-or-not is irrelevant for the single-window case on this hardware"
    • X11 cells slightly faster than Wayland (X11 path has less per-frame X protocol overhead) → small but real win for X11 daily-driver
    • X11 cells slower than Wayland (X11 path has issues KWin's direct-scanout doesn't) → unexpected; would need re-investigation
  3. A multi-window variant of the with-KWin baseline should be added before Phase 1 binding-cell lock — otherwise the matrix only measures the easiest scenario. Suggested add: A1' rep with chrome + Konsole + Plasma panel all visible, see if kwin %CPU rebounds. If it does, the matrix's daily-driver-relevance picture is more nuanced.

The campaign continues with the matrix as defined, but with the understanding that the original framing is partially invalidated. Phase 1 will lock around the reframed sub-questions in phase0_evidence/browser_overlay_inventory_2026-05-03.md § "Implications for the matrix" + the multi-window add above.

Files in this evidence dir

a1_rep1/  a1_rep2/  a1_rep3/   — three rep evidence dirs
01_live_session.txt              — Wayland session state at A1 capture time
02_predecessor_assets.txt        — verification that predecessor scripts/assets reusable
a1_protocol.md                   — protocol spec, run beforehand
a1_summary.md                    — this file

Each rep directory contains:

  • start.txt / end.txt / capture_start.txt — wall-clock
  • temp_pre.txt / temp_post.txt — cpu-thermal temp
  • top_kwin.txt — kwin_wayland top samples (70 × 1 Hz)
  • top_full.txt — system top samples (70 × 1 Hz)
  • stderr.log — chromium stderr (full)
  • drops_trajectory.txt — DROPS_TRAJECTORY lines (73 each)
  • drops_summary.txt — frames_total / drops_total / drops_post_warmup
  • kwin_cpu_summary.txt — kwin %CPU stats
  • perf_record_stderr.txt — perf recorder's own stderr
  • perf_report_self.txt / perf_report_top50.txt — perf flamegraph (text)

perf.data files (~400 KB each) are root-owned on ohm (created by sudo perf record) and were not synced to noether. They remain at /home/mfritsche/phase3_prime_runs/x11research_a1_rep[1-3]/perf.data on ohm if re-analysis is needed.