Phase 0: A1 Wayland baseline + state snapshot — major reframing

3 in-session reps of chromium-fourier 149 / brave_drops_test.html /
Plasma Wayland 6.6.4 (kwin-fourier 6.6.4-3 + qt6-base-fourier
6.11.0-3 carry-overs intact). Tight cluster IQR=0:
drops_total=0, drops_post_warmup=0, frames_total=1685, kwin %CPU
median=0.00, mean=0.04. Perf samples on kwin (~30 over 70s) show
zero composite/dmabuf/GL symbols — only event-loop bookkeeping.

Most likely mechanism: KWin direct-scanout fast-path engaged for
the single-visible-client video case. The campaign's load-bearing
hypothesis ("X11 + non-compositing WM avoids per-frame GL composite
of NV12") is structurally weakened — KWin already avoids that work
under Wayland for this workload. Phase 1 needs to add a
multi-window A1' variant and drm_info-during-playback to confirm
direct-scanout, then revisit matrix cell design.

revert.log entry 6: SDDM autologin + state.conf swap that landed
the Plasma Wayland session for the A1 reps. Backup of original
state.conf preserved at /var/lib/sddm/state.conf.x11-research-bak;
single-command revert documented.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-03 13:11:15 +00:00
parent d2e11be430
commit 9a023e9264
45 changed files with 3199 additions and 17 deletions
@@ -0,0 +1,248 @@
# A1 baseline — verdict
**Three in-session reps of chromium-fourier 149 / brave_drops_test.html
/ Plasma Wayland 6.6.4 + kwin-fourier 6.6.4-3 + qt6-base-fourier
6.11.0-3 acquired 2026-05-03 14:2314:31 CEST.**
Per the campaign-contained-data discipline, this is the only
Wayland baseline this campaign uses. Predecessor numbers are
referenced below as instrument-sanity context, not as comparison
targets.
## Results
| Metric | Rep 1 | Rep 2 | Rep 3 | Median (cluster) |
|---|---|---|---|---|
| frames_total (over 70 s) | 1685 | 1685 | 1686 | **1685** |
| effective fps | 24.04 | 24.04 | 24.05 | **24.04** |
| drops_total | 0 | 0 | 0 | **0** |
| drops_post_warmup (after t=10s) | 0 | 0 | 0 | **0** |
| kwin_wayland %CPU median (1 Hz × 70 samples) | 0.00 | 0.00 | 0.00 | **0.00** |
| kwin_wayland %CPU mean | 0.07 | 0.04 | 0.04 | **0.04** |
| kwin_wayland %CPU max | 4.00 | 1.00 | 3.00 | **3.00** |
| perf samples on kwin_wayland (99 Hz × 70 s) | 39 | 28 | (similar) | tens, not thousands |
**IQR / median for the spread metrics is 0 (cluster is degenerate
at the lower bound).** Per the protocol's exit-condition tree,
this is the "tight cluster" branch: Phase 1 binding cells can
use these as the anchor with a sub-1% tolerance band on kwin
%CPU and a ≤ 1-frame tolerance on drops_post_warmup.
Source video plays at 24 fps for 70.09 s ≈ 1682 frames; observed
1685 frames matches within rounding (chromium counts the
DROPS_TRAJECTORY playing-event frame separately).
## What the perf reports say kwin was actually doing
For all three reps, the perf samples on kwin_wayland during
playback are dominated by event-loop bookkeeping, **not** by
any GL-composite or dmabuf-import path:
| Rep | Top symbol(s) at non-trivial % |
|---|---|
| 1 | `__pi_memcpy_generic` (97.18 %) — single memcpy event in a 39-sample run |
| 2 | `libz.so` (37.57 %), `call_filldir` (31.91 %) — readdir + zlib |
| 3 | `dbus_message_unref` (38.80 %), `QUnixEventDispatcherQPA::processEvents` (36.72 %), `libz.so` (23.30 %) — DBus + Qt event loop |
**Zero samples anywhere in:**
- `glEGLImageTargetTexture2DOES` (the GL EGL image bind path
the predecessor's `kwin_overlay_subsurface` campaign Phase 2
hypothesised would dominate per-frame KWin cost on this
hardware)
- `panfrost_*` (Mesa Panfrost driver routines)
- `wp_subsurface_*` / `WaylandSurface_*` (overlay/subsurface
protocol handling)
- any `Compositor::*` / `OpenGLBackend::*` / `OutputLayer::*`
KWin internal symbols
The total cycle count across all three reps is in the millions,
not billions — kwin_wayland was scheduled out for >99 % of the
70 s capture window in every rep.
## What this means
**The "KWin is the bottleneck" framing the campaign was built
around is structurally weakened by these data.**
The campaign's load-bearing hypothesis (`README.md` § 1) was
that "the campaign's load-bearing hypothesis is that this
plane-allocation freedom translates into measurable browser-video
speedup." That hypothesis was built on top of the predecessor's
observation that `kwin_wayland` consumed ~36 % CPU during
similar playback, attributable per the predecessor's Phase 2
source-read to per-frame GL composite of NV12 → RGB. **Today's
A1 reps show kwin_wayland at 0 % median, with no GL-composite
work in the perf samples.** There is no Wayland-side
KWin-induced overhead for the X11 cells to *be faster than*.
### The most likely mechanism (hypothesis, not yet verified)
KWin 6.6.4 (with kwin-fourier 6.6.4-3 patches applied) appears
to have engaged its **direct-scanout** code path for the
chrome-window-displaying-video workload. KWin's direct-scanout
support has been there for years on the Wayland backend and
has been progressively widened: when there's a single visible
"top" surface (the chrome window) whose buffer matches a
hardware plane's format/modifier capabilities, KWin can hand
that buffer to DRM directly without first GL-compositing it
into KWin's own framebuffer. The browser's RGB or NV12 dmabuf
goes onto Plane 39 (the Primary plane on rockchip-drm RK3568)
without any per-frame KWin GPU work.
This is **not** the wp_subsurface route the predecessor was
investigating — it's a different, simpler scanout path that
doesn't require the client to opt into overlay protocol. It
just requires the client's surface to be the only visible
non-trivial top-level plus a buffer format/modifier that DRM
can scan out.
If this hypothesis is correct, two things follow:
1. **The campaign's X11-vs-Wayland delta is much smaller than
originally expected.** Both sessions can avoid per-frame
compositor work for the single-window video case. The X11
cells will not be measurably faster than Wayland for this
workload.
2. **The campaign's mechanism is realised under Wayland too,
when conditions are right.** "Plane-allocation freedom" is
not X11-exclusive — it's just easier to engage on X11
because there's no compositor in the path at all. On
Wayland, KWin engages an equivalent fast-path when its
heuristics allow.
### What the data does NOT establish
- That direct-scanout is the cause. The data is consistent
with direct-scanout but a perf-only diagnosis can't pin it
down. Phase 1 should add `drm_info` snapshots during
playback (which plane is programmed with the chrome
window's buffer FOURCC + modifier?) and KWin debug
logging (`KWIN_DRM=1` dumps direct-scanout decisions) to
confirm.
- That this behavior holds for multi-window scenarios. A
single visible non-trivial top-level window is the simplest
case for direct-scanout. If the operator works
multi-windowed (panel + chrome + terminal + Konsole), the
fast-path may decline and kwin %CPU may rebound. The
matrix's relevance to daily-driver scenarios depends on
this.
- That this behavior holds across browsers. chromium-fourier
149 specifically has the patches that enable smooth NV12
dmabuf production. Brave 147 stock and Firefox 150 may
produce buffers in different shapes (RGB-pre-composited,
different modifier) that don't satisfy the direct-scanout
predicate. Phase 1 reps for those browsers will tell.
## Cross-check: predecessor's same-condition reps
For instrument sanity (NOT as comparison target — these are
the predecessor's `kwin_timing_nodebug_rep[1-3]` numbers from
2026-05-02/03):
| | Predecessor median | This campaign median |
|---|---|---|
| frames_total | 1688 | **1685** |
| drops_total | 44 | **0** |
| drops_post_warmup | 28 | **0** |
| kwin %CPU median | 35.9 | **0.00** |
The frames_total match indicates the test page + chromium-fourier
emit at the same rate in both campaigns. The drops + kwin %CPU
divergence is too large to be measurement noise — something
about the runtime conditions changed between 2026-05-02
(predecessor's reps) and 2026-05-03 14:23 (today's reps), even
though the package versions are identical and the test page is
the same file.
Possible non-package causes worth listing here so a follow-up
can investigate (out of A1 scope):
- **Boot generation:** predecessor's reps were on a session
that had been running ~9 hours; today's reps were on a
session that had been running ~50 minutes after autologin
(revert.log entry 6).
- **Cumulative session state:** predecessor's session likely
had multiple browser instances and other windows open
during the campaign's preceding work; today's session was
freshly autologin'd from greeter, only the test chrome
window visible.
- **Thermal:** predecessor's `temp_pre.txt` for rep 1 isn't
in our scope to check (would be predecessor data import);
but today's reps had cpu-thermal at 36 °C pre-rep, well
below thermal-throttle thresholds.
- **kwin / qt patches state:** packages identical per
`pacman -Q`, but the runtime state of KWin's heuristics
(window-rule cache, scanout-decision history) might differ
between sessions. This is a normal property of compositors
and explains some run-to-run variance even on the same
binary.
The discipline rule already required the in-session re-measurement
this campaign just did. The predecessor's number is no longer
the reference; **this campaign's measured median (0 drops, 0 %
kwin) is the reference for any X11 cells the campaign will
later compare against**.
## Phase 1 implications
The matrix design needs revisiting before Phase 1 cells lock:
1. **The mpv `--vo=xv` cell remains the most informative
single point** for the campaign's original mechanism (does
the X server route NV12 to Plane 39 directly?), per the
browser overlay inventory's verdict.
2. **The browser X11 cells become a measurement of "do
browsers under X11 get the equivalent direct-scanout
benefit they get under Wayland?"** rather than the original
"does X11 win over Wayland?" framing. Three plausible
outcomes:
- X11 cells match Wayland baseline (both engage direct
scanout) → "compositor-or-not is irrelevant for the
single-window case on this hardware"
- X11 cells slightly faster than Wayland (X11 path has
less per-frame X protocol overhead) → small but real
win for X11 daily-driver
- X11 cells slower than Wayland (X11 path has issues
KWin's direct-scanout doesn't) → unexpected; would need
re-investigation
3. **A multi-window variant** of the with-KWin baseline
should be added before Phase 1 binding-cell lock —
otherwise the matrix only measures the easiest scenario.
Suggested add: A1' rep with chrome + Konsole + Plasma
panel all visible, see if kwin %CPU rebounds. If it does,
the matrix's daily-driver-relevance picture is more
nuanced.
The campaign continues with the matrix as defined, but with
the understanding that the original framing is partially
invalidated. Phase 1 will lock around the reframed sub-questions
in `phase0_evidence/browser_overlay_inventory_2026-05-03.md` §
"Implications for the matrix" + the multi-window add above.
## Files in this evidence dir
```
a1_rep1/ a1_rep2/ a1_rep3/ — three rep evidence dirs
01_live_session.txt — Wayland session state at A1 capture time
02_predecessor_assets.txt — verification that predecessor scripts/assets reusable
a1_protocol.md — protocol spec, run beforehand
a1_summary.md — this file
```
Each rep directory contains:
- `start.txt` / `end.txt` / `capture_start.txt` — wall-clock
- `temp_pre.txt` / `temp_post.txt` — cpu-thermal temp
- `top_kwin.txt` — kwin_wayland top samples (70 × 1 Hz)
- `top_full.txt` — system top samples (70 × 1 Hz)
- `stderr.log` — chromium stderr (full)
- `drops_trajectory.txt` — DROPS_TRAJECTORY lines (73 each)
- `drops_summary.txt` — frames_total / drops_total / drops_post_warmup
- `kwin_cpu_summary.txt` — kwin %CPU stats
- `perf_record_stderr.txt` — perf recorder's own stderr
- `perf_report_self.txt` / `perf_report_top50.txt` — perf
flamegraph (text)
`perf.data` files (~400 KB each) are root-owned on ohm
(created by `sudo perf record`) and were not synced to noether.
They remain at `/home/mfritsche/phase3_prime_runs/x11research_a1_rep[1-3]/perf.data`
on ohm if re-analysis is needed.