d1285ca2b2
The drm_info-during-playback probe under Wayland confirmed KWin IS GL-compositing per frame: Plane 39 rotates triple- buffered ABGR8888 framebuffers (FB IDs 60/61/66) during playback. The earlier "0% kwin CPU = direct-scanout" reading in a1_summary.md was a CPU-blind-spot artifact — Panfrost shader work isn't visible to top or perf-on-userspace. Corrigendum added to a1_summary.md preserving the original text as the on-the-day record. Worklist: A1 entry updated to point at both summaries. The probe itself was invasive (drm_info every 3s perturbed KWin's atomic-commit fast-path, kwin %CPU jumped 0->18 median and 6 drops appeared) — usable diagnostically but cannot be embedded in measurement reps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
266 lines
12 KiB
Markdown
266 lines
12 KiB
Markdown
# A1 baseline — verdict
|
||
|
||
> **CORRIGENDUM 2026-05-03 ~17:00 CEST** — the
|
||
> "Major reframing finding" section below interpreted 0 %
|
||
> kwin CPU as "KWin doing nothing per frame, direct-scanout
|
||
> engaged, campaign hypothesis structurally weakened." That
|
||
> interpretation is **wrong**. A subsequent
|
||
> drm_info-during-playback probe (see
|
||
> `drmprobe_findings.md` in this directory) shows Plane 39
|
||
> rotates triple-buffered ABGR8888 framebuffers throughout
|
||
> playback — KWin IS doing per-frame GL composite, but
|
||
> the work is GPU-side (Panfrost) and invisible to top /
|
||
> perf-on-userspace. The campaign's mechanism is intact;
|
||
> the matrix needs GPU-aware metrics. The original text
|
||
> below is preserved as the on-the-day record of what I
|
||
> concluded with insufficient instrumentation.
|
||
|
||
---
|
||
|
||
**Three in-session reps of chromium-fourier 149 / brave_drops_test.html
|
||
/ Plasma Wayland 6.6.4 + kwin-fourier 6.6.4-3 + qt6-base-fourier
|
||
6.11.0-3 acquired 2026-05-03 14:23–14:31 CEST.**
|
||
|
||
Per the campaign-contained-data discipline, this is the only
|
||
Wayland baseline this campaign uses. Predecessor numbers are
|
||
referenced below as instrument-sanity context, not as comparison
|
||
targets.
|
||
|
||
## Results
|
||
|
||
| Metric | Rep 1 | Rep 2 | Rep 3 | Median (cluster) |
|
||
|---|---|---|---|---|
|
||
| frames_total (over 70 s) | 1685 | 1685 | 1686 | **1685** |
|
||
| effective fps | 24.04 | 24.04 | 24.05 | **24.04** |
|
||
| drops_total | 0 | 0 | 0 | **0** |
|
||
| drops_post_warmup (after t=10s) | 0 | 0 | 0 | **0** |
|
||
| kwin_wayland %CPU median (1 Hz × 70 samples) | 0.00 | 0.00 | 0.00 | **0.00** |
|
||
| kwin_wayland %CPU mean | 0.07 | 0.04 | 0.04 | **0.04** |
|
||
| kwin_wayland %CPU max | 4.00 | 1.00 | 3.00 | **3.00** |
|
||
| perf samples on kwin_wayland (99 Hz × 70 s) | 39 | 28 | (similar) | tens, not thousands |
|
||
|
||
**IQR / median for the spread metrics is 0 (cluster is degenerate
|
||
at the lower bound).** Per the protocol's exit-condition tree,
|
||
this is the "tight cluster" branch: Phase 1 binding cells can
|
||
use these as the anchor with a sub-1% tolerance band on kwin
|
||
%CPU and a ≤ 1-frame tolerance on drops_post_warmup.
|
||
|
||
Source video plays at 24 fps for 70.09 s ≈ 1682 frames; observed
|
||
1685 frames matches within rounding (chromium counts the
|
||
DROPS_TRAJECTORY playing-event frame separately).
|
||
|
||
## What the perf reports say kwin was actually doing
|
||
|
||
For all three reps, the perf samples on kwin_wayland during
|
||
playback are dominated by event-loop bookkeeping, **not** by
|
||
any GL-composite or dmabuf-import path:
|
||
|
||
| Rep | Top symbol(s) at non-trivial % |
|
||
|---|---|
|
||
| 1 | `__pi_memcpy_generic` (97.18 %) — single memcpy event in a 39-sample run |
|
||
| 2 | `libz.so` (37.57 %), `call_filldir` (31.91 %) — readdir + zlib |
|
||
| 3 | `dbus_message_unref` (38.80 %), `QUnixEventDispatcherQPA::processEvents` (36.72 %), `libz.so` (23.30 %) — DBus + Qt event loop |
|
||
|
||
**Zero samples anywhere in:**
|
||
- `glEGLImageTargetTexture2DOES` (the GL EGL image bind path
|
||
the predecessor's `kwin_overlay_subsurface` campaign Phase 2
|
||
hypothesised would dominate per-frame KWin cost on this
|
||
hardware)
|
||
- `panfrost_*` (Mesa Panfrost driver routines)
|
||
- `wp_subsurface_*` / `WaylandSurface_*` (overlay/subsurface
|
||
protocol handling)
|
||
- any `Compositor::*` / `OpenGLBackend::*` / `OutputLayer::*`
|
||
KWin internal symbols
|
||
|
||
The total cycle count across all three reps is in the millions,
|
||
not billions — kwin_wayland was scheduled out for >99 % of the
|
||
70 s capture window in every rep.
|
||
|
||
## What this means
|
||
|
||
**The "KWin is the bottleneck" framing the campaign was built
|
||
around is structurally weakened by these data.**
|
||
|
||
The campaign's load-bearing hypothesis (`README.md` § 1) was
|
||
that "the campaign's load-bearing hypothesis is that this
|
||
plane-allocation freedom translates into measurable browser-video
|
||
speedup." That hypothesis was built on top of the predecessor's
|
||
observation that `kwin_wayland` consumed ~36 % CPU during
|
||
similar playback, attributable per the predecessor's Phase 2
|
||
source-read to per-frame GL composite of NV12 → RGB. **Today's
|
||
A1 reps show kwin_wayland at 0 % median, with no GL-composite
|
||
work in the perf samples.** There is no Wayland-side
|
||
KWin-induced overhead for the X11 cells to *be faster than*.
|
||
|
||
### The most likely mechanism (hypothesis, not yet verified)
|
||
|
||
KWin 6.6.4 (with kwin-fourier 6.6.4-3 patches applied) appears
|
||
to have engaged its **direct-scanout** code path for the
|
||
chrome-window-displaying-video workload. KWin's direct-scanout
|
||
support has been there for years on the Wayland backend and
|
||
has been progressively widened: when there's a single visible
|
||
"top" surface (the chrome window) whose buffer matches a
|
||
hardware plane's format/modifier capabilities, KWin can hand
|
||
that buffer to DRM directly without first GL-compositing it
|
||
into KWin's own framebuffer. The browser's RGB or NV12 dmabuf
|
||
goes onto Plane 39 (the Primary plane on rockchip-drm RK3568)
|
||
without any per-frame KWin GPU work.
|
||
|
||
This is **not** the wp_subsurface route the predecessor was
|
||
investigating — it's a different, simpler scanout path that
|
||
doesn't require the client to opt into overlay protocol. It
|
||
just requires the client's surface to be the only visible
|
||
non-trivial top-level plus a buffer format/modifier that DRM
|
||
can scan out.
|
||
|
||
If this hypothesis is correct, two things follow:
|
||
|
||
1. **The campaign's X11-vs-Wayland delta is much smaller than
|
||
originally expected.** Both sessions can avoid per-frame
|
||
compositor work for the single-window video case. The X11
|
||
cells will not be measurably faster than Wayland for this
|
||
workload.
|
||
2. **The campaign's mechanism is realised under Wayland too,
|
||
when conditions are right.** "Plane-allocation freedom" is
|
||
not X11-exclusive — it's just easier to engage on X11
|
||
because there's no compositor in the path at all. On
|
||
Wayland, KWin engages an equivalent fast-path when its
|
||
heuristics allow.
|
||
|
||
### What the data does NOT establish
|
||
|
||
- That direct-scanout is the cause. The data is consistent
|
||
with direct-scanout but a perf-only diagnosis can't pin it
|
||
down. Phase 1 should add `drm_info` snapshots during
|
||
playback (which plane is programmed with the chrome
|
||
window's buffer FOURCC + modifier?) and KWin debug
|
||
logging (`KWIN_DRM=1` dumps direct-scanout decisions) to
|
||
confirm.
|
||
- That this behavior holds for multi-window scenarios. A
|
||
single visible non-trivial top-level window is the simplest
|
||
case for direct-scanout. If the operator works
|
||
multi-windowed (panel + chrome + terminal + Konsole), the
|
||
fast-path may decline and kwin %CPU may rebound. The
|
||
matrix's relevance to daily-driver scenarios depends on
|
||
this.
|
||
- That this behavior holds across browsers. chromium-fourier
|
||
149 specifically has the patches that enable smooth NV12
|
||
dmabuf production. Brave 147 stock and Firefox 150 may
|
||
produce buffers in different shapes (RGB-pre-composited,
|
||
different modifier) that don't satisfy the direct-scanout
|
||
predicate. Phase 1 reps for those browsers will tell.
|
||
|
||
## Cross-check: predecessor's same-condition reps
|
||
|
||
For instrument sanity (NOT as comparison target — these are
|
||
the predecessor's `kwin_timing_nodebug_rep[1-3]` numbers from
|
||
2026-05-02/03):
|
||
|
||
| | Predecessor median | This campaign median |
|
||
|---|---|---|
|
||
| frames_total | 1688 | **1685** |
|
||
| drops_total | 44 | **0** |
|
||
| drops_post_warmup | 28 | **0** |
|
||
| kwin %CPU median | 35.9 | **0.00** |
|
||
|
||
The frames_total match indicates the test page + chromium-fourier
|
||
emit at the same rate in both campaigns. The drops + kwin %CPU
|
||
divergence is too large to be measurement noise — something
|
||
about the runtime conditions changed between 2026-05-02
|
||
(predecessor's reps) and 2026-05-03 14:23 (today's reps), even
|
||
though the package versions are identical and the test page is
|
||
the same file.
|
||
|
||
Possible non-package causes worth listing here so a follow-up
|
||
can investigate (out of A1 scope):
|
||
|
||
- **Boot generation:** predecessor's reps were on a session
|
||
that had been running ~9 hours; today's reps were on a
|
||
session that had been running ~50 minutes after autologin
|
||
(revert.log entry 6).
|
||
- **Cumulative session state:** predecessor's session likely
|
||
had multiple browser instances and other windows open
|
||
during the campaign's preceding work; today's session was
|
||
freshly autologin'd from greeter, only the test chrome
|
||
window visible.
|
||
- **Thermal:** predecessor's `temp_pre.txt` for rep 1 isn't
|
||
in our scope to check (would be predecessor data import);
|
||
but today's reps had cpu-thermal at 36 °C pre-rep, well
|
||
below thermal-throttle thresholds.
|
||
- **kwin / qt patches state:** packages identical per
|
||
`pacman -Q`, but the runtime state of KWin's heuristics
|
||
(window-rule cache, scanout-decision history) might differ
|
||
between sessions. This is a normal property of compositors
|
||
and explains some run-to-run variance even on the same
|
||
binary.
|
||
|
||
The discipline rule already required the in-session re-measurement
|
||
this campaign just did. The predecessor's number is no longer
|
||
the reference; **this campaign's measured median (0 drops, 0 %
|
||
kwin) is the reference for any X11 cells the campaign will
|
||
later compare against**.
|
||
|
||
## Phase 1 implications
|
||
|
||
The matrix design needs revisiting before Phase 1 cells lock:
|
||
|
||
1. **The mpv `--vo=xv` cell remains the most informative
|
||
single point** for the campaign's original mechanism (does
|
||
the X server route NV12 to Plane 39 directly?), per the
|
||
browser overlay inventory's verdict.
|
||
2. **The browser X11 cells become a measurement of "do
|
||
browsers under X11 get the equivalent direct-scanout
|
||
benefit they get under Wayland?"** rather than the original
|
||
"does X11 win over Wayland?" framing. Three plausible
|
||
outcomes:
|
||
- X11 cells match Wayland baseline (both engage direct
|
||
scanout) → "compositor-or-not is irrelevant for the
|
||
single-window case on this hardware"
|
||
- X11 cells slightly faster than Wayland (X11 path has
|
||
less per-frame X protocol overhead) → small but real
|
||
win for X11 daily-driver
|
||
- X11 cells slower than Wayland (X11 path has issues
|
||
KWin's direct-scanout doesn't) → unexpected; would need
|
||
re-investigation
|
||
3. **A multi-window variant** of the with-KWin baseline
|
||
should be added before Phase 1 binding-cell lock —
|
||
otherwise the matrix only measures the easiest scenario.
|
||
Suggested add: A1' rep with chrome + Konsole + Plasma
|
||
panel all visible, see if kwin %CPU rebounds. If it does,
|
||
the matrix's daily-driver-relevance picture is more
|
||
nuanced.
|
||
|
||
The campaign continues with the matrix as defined, but with
|
||
the understanding that the original framing is partially
|
||
invalidated. Phase 1 will lock around the reframed sub-questions
|
||
in `phase0_evidence/browser_overlay_inventory_2026-05-03.md` §
|
||
"Implications for the matrix" + the multi-window add above.
|
||
|
||
## Files in this evidence dir
|
||
|
||
```
|
||
a1_rep1/ a1_rep2/ a1_rep3/ — three rep evidence dirs
|
||
01_live_session.txt — Wayland session state at A1 capture time
|
||
02_predecessor_assets.txt — verification that predecessor scripts/assets reusable
|
||
a1_protocol.md — protocol spec, run beforehand
|
||
a1_summary.md — this file
|
||
```
|
||
|
||
Each rep directory contains:
|
||
- `start.txt` / `end.txt` / `capture_start.txt` — wall-clock
|
||
- `temp_pre.txt` / `temp_post.txt` — cpu-thermal temp
|
||
- `top_kwin.txt` — kwin_wayland top samples (70 × 1 Hz)
|
||
- `top_full.txt` — system top samples (70 × 1 Hz)
|
||
- `stderr.log` — chromium stderr (full)
|
||
- `drops_trajectory.txt` — DROPS_TRAJECTORY lines (73 each)
|
||
- `drops_summary.txt` — frames_total / drops_total / drops_post_warmup
|
||
- `kwin_cpu_summary.txt` — kwin %CPU stats
|
||
- `perf_record_stderr.txt` — perf recorder's own stderr
|
||
- `perf_report_self.txt` / `perf_report_top50.txt` — perf
|
||
flamegraph (text)
|
||
|
||
`perf.data` files (~400 KB each) are root-owned on ohm
|
||
(created by `sudo perf record`) and were not synced to noether.
|
||
They remain at `/home/mfritsche/phase3_prime_runs/x11research_a1_rep[1-3]/perf.data`
|
||
on ohm if re-analysis is needed.
|