Files
marfrit d1285ca2b2 Phase 0: drmprobe findings + A1 corrigendum + worklist update
The drm_info-during-playback probe under Wayland confirmed
KWin IS GL-compositing per frame: Plane 39 rotates triple-
buffered ABGR8888 framebuffers (FB IDs 60/61/66) during
playback. The earlier "0% kwin CPU = direct-scanout" reading
in a1_summary.md was a CPU-blind-spot artifact — Panfrost
shader work isn't visible to top or perf-on-userspace.

Corrigendum added to a1_summary.md preserving the original
text as the on-the-day record.

Worklist: A1 entry updated to point at both summaries.

The probe itself was invasive (drm_info every 3s perturbed
KWin's atomic-commit fast-path, kwin %CPU jumped 0->18 median
and 6 drops appeared) — usable diagnostically but cannot be
embedded in measurement reps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:46:21 +00:00

266 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# A1 baseline — verdict
> **CORRIGENDUM 2026-05-03 ~17:00 CEST** — the
> "Major reframing finding" section below interpreted 0 %
> kwin CPU as "KWin doing nothing per frame, direct-scanout
> engaged, campaign hypothesis structurally weakened." That
> interpretation is **wrong**. A subsequent
> drm_info-during-playback probe (see
> `drmprobe_findings.md` in this directory) shows Plane 39
> rotates triple-buffered ABGR8888 framebuffers throughout
> playback — KWin IS doing per-frame GL composite, but
> the work is GPU-side (Panfrost) and invisible to top /
> perf-on-userspace. The campaign's mechanism is intact;
> the matrix needs GPU-aware metrics. The original text
> below is preserved as the on-the-day record of what I
> concluded with insufficient instrumentation.
---
**Three in-session reps of chromium-fourier 149 / brave_drops_test.html
/ Plasma Wayland 6.6.4 + kwin-fourier 6.6.4-3 + qt6-base-fourier
6.11.0-3 acquired 2026-05-03 14:2314:31 CEST.**
Per the campaign-contained-data discipline, this is the only
Wayland baseline this campaign uses. Predecessor numbers are
referenced below as instrument-sanity context, not as comparison
targets.
## Results
| Metric | Rep 1 | Rep 2 | Rep 3 | Median (cluster) |
|---|---|---|---|---|
| frames_total (over 70 s) | 1685 | 1685 | 1686 | **1685** |
| effective fps | 24.04 | 24.04 | 24.05 | **24.04** |
| drops_total | 0 | 0 | 0 | **0** |
| drops_post_warmup (after t=10s) | 0 | 0 | 0 | **0** |
| kwin_wayland %CPU median (1 Hz × 70 samples) | 0.00 | 0.00 | 0.00 | **0.00** |
| kwin_wayland %CPU mean | 0.07 | 0.04 | 0.04 | **0.04** |
| kwin_wayland %CPU max | 4.00 | 1.00 | 3.00 | **3.00** |
| perf samples on kwin_wayland (99 Hz × 70 s) | 39 | 28 | (similar) | tens, not thousands |
**IQR / median for the spread metrics is 0 (cluster is degenerate
at the lower bound).** Per the protocol's exit-condition tree,
this is the "tight cluster" branch: Phase 1 binding cells can
use these as the anchor with a sub-1% tolerance band on kwin
%CPU and a ≤ 1-frame tolerance on drops_post_warmup.
Source video plays at 24 fps for 70.09 s ≈ 1682 frames; observed
1685 frames matches within rounding (chromium counts the
DROPS_TRAJECTORY playing-event frame separately).
## What the perf reports say kwin was actually doing
For all three reps, the perf samples on kwin_wayland during
playback are dominated by event-loop bookkeeping, **not** by
any GL-composite or dmabuf-import path:
| Rep | Top symbol(s) at non-trivial % |
|---|---|
| 1 | `__pi_memcpy_generic` (97.18 %) — single memcpy event in a 39-sample run |
| 2 | `libz.so` (37.57 %), `call_filldir` (31.91 %) — readdir + zlib |
| 3 | `dbus_message_unref` (38.80 %), `QUnixEventDispatcherQPA::processEvents` (36.72 %), `libz.so` (23.30 %) — DBus + Qt event loop |
**Zero samples anywhere in:**
- `glEGLImageTargetTexture2DOES` (the GL EGL image bind path
the predecessor's `kwin_overlay_subsurface` campaign Phase 2
hypothesised would dominate per-frame KWin cost on this
hardware)
- `panfrost_*` (Mesa Panfrost driver routines)
- `wp_subsurface_*` / `WaylandSurface_*` (overlay/subsurface
protocol handling)
- any `Compositor::*` / `OpenGLBackend::*` / `OutputLayer::*`
KWin internal symbols
The total cycle count across all three reps is in the millions,
not billions — kwin_wayland was scheduled out for >99 % of the
70 s capture window in every rep.
## What this means
**The "KWin is the bottleneck" framing the campaign was built
around is structurally weakened by these data.**
The campaign's load-bearing hypothesis (`README.md` § 1) was
that "the campaign's load-bearing hypothesis is that this
plane-allocation freedom translates into measurable browser-video
speedup." That hypothesis was built on top of the predecessor's
observation that `kwin_wayland` consumed ~36 % CPU during
similar playback, attributable per the predecessor's Phase 2
source-read to per-frame GL composite of NV12 → RGB. **Today's
A1 reps show kwin_wayland at 0 % median, with no GL-composite
work in the perf samples.** There is no Wayland-side
KWin-induced overhead for the X11 cells to *be faster than*.
### The most likely mechanism (hypothesis, not yet verified)
KWin 6.6.4 (with kwin-fourier 6.6.4-3 patches applied) appears
to have engaged its **direct-scanout** code path for the
chrome-window-displaying-video workload. KWin's direct-scanout
support has been there for years on the Wayland backend and
has been progressively widened: when there's a single visible
"top" surface (the chrome window) whose buffer matches a
hardware plane's format/modifier capabilities, KWin can hand
that buffer to DRM directly without first GL-compositing it
into KWin's own framebuffer. The browser's RGB or NV12 dmabuf
goes onto Plane 39 (the Primary plane on rockchip-drm RK3568)
without any per-frame KWin GPU work.
This is **not** the wp_subsurface route the predecessor was
investigating — it's a different, simpler scanout path that
doesn't require the client to opt into overlay protocol. It
just requires the client's surface to be the only visible
non-trivial top-level plus a buffer format/modifier that DRM
can scan out.
If this hypothesis is correct, two things follow:
1. **The campaign's X11-vs-Wayland delta is much smaller than
originally expected.** Both sessions can avoid per-frame
compositor work for the single-window video case. The X11
cells will not be measurably faster than Wayland for this
workload.
2. **The campaign's mechanism is realised under Wayland too,
when conditions are right.** "Plane-allocation freedom" is
not X11-exclusive — it's just easier to engage on X11
because there's no compositor in the path at all. On
Wayland, KWin engages an equivalent fast-path when its
heuristics allow.
### What the data does NOT establish
- That direct-scanout is the cause. The data is consistent
with direct-scanout but a perf-only diagnosis can't pin it
down. Phase 1 should add `drm_info` snapshots during
playback (which plane is programmed with the chrome
window's buffer FOURCC + modifier?) and KWin debug
logging (`KWIN_DRM=1` dumps direct-scanout decisions) to
confirm.
- That this behavior holds for multi-window scenarios. A
single visible non-trivial top-level window is the simplest
case for direct-scanout. If the operator works
multi-windowed (panel + chrome + terminal + Konsole), the
fast-path may decline and kwin %CPU may rebound. The
matrix's relevance to daily-driver scenarios depends on
this.
- That this behavior holds across browsers. chromium-fourier
149 specifically has the patches that enable smooth NV12
dmabuf production. Brave 147 stock and Firefox 150 may
produce buffers in different shapes (RGB-pre-composited,
different modifier) that don't satisfy the direct-scanout
predicate. Phase 1 reps for those browsers will tell.
## Cross-check: predecessor's same-condition reps
For instrument sanity (NOT as comparison target — these are
the predecessor's `kwin_timing_nodebug_rep[1-3]` numbers from
2026-05-02/03):
| | Predecessor median | This campaign median |
|---|---|---|
| frames_total | 1688 | **1685** |
| drops_total | 44 | **0** |
| drops_post_warmup | 28 | **0** |
| kwin %CPU median | 35.9 | **0.00** |
The frames_total match indicates the test page + chromium-fourier
emit at the same rate in both campaigns. The drops + kwin %CPU
divergence is too large to be measurement noise — something
about the runtime conditions changed between 2026-05-02
(predecessor's reps) and 2026-05-03 14:23 (today's reps), even
though the package versions are identical and the test page is
the same file.
Possible non-package causes worth listing here so a follow-up
can investigate (out of A1 scope):
- **Boot generation:** predecessor's reps were on a session
that had been running ~9 hours; today's reps were on a
session that had been running ~50 minutes after autologin
(revert.log entry 6).
- **Cumulative session state:** predecessor's session likely
had multiple browser instances and other windows open
during the campaign's preceding work; today's session was
freshly autologin'd from greeter, only the test chrome
window visible.
- **Thermal:** predecessor's `temp_pre.txt` for rep 1 isn't
in our scope to check (would be predecessor data import);
but today's reps had cpu-thermal at 36 °C pre-rep, well
below thermal-throttle thresholds.
- **kwin / qt patches state:** packages identical per
`pacman -Q`, but the runtime state of KWin's heuristics
(window-rule cache, scanout-decision history) might differ
between sessions. This is a normal property of compositors
and explains some run-to-run variance even on the same
binary.
The discipline rule already required the in-session re-measurement
this campaign just did. The predecessor's number is no longer
the reference; **this campaign's measured median (0 drops, 0 %
kwin) is the reference for any X11 cells the campaign will
later compare against**.
## Phase 1 implications
The matrix design needs revisiting before Phase 1 cells lock:
1. **The mpv `--vo=xv` cell remains the most informative
single point** for the campaign's original mechanism (does
the X server route NV12 to Plane 39 directly?), per the
browser overlay inventory's verdict.
2. **The browser X11 cells become a measurement of "do
browsers under X11 get the equivalent direct-scanout
benefit they get under Wayland?"** rather than the original
"does X11 win over Wayland?" framing. Three plausible
outcomes:
- X11 cells match Wayland baseline (both engage direct
scanout) → "compositor-or-not is irrelevant for the
single-window case on this hardware"
- X11 cells slightly faster than Wayland (X11 path has
less per-frame X protocol overhead) → small but real
win for X11 daily-driver
- X11 cells slower than Wayland (X11 path has issues
KWin's direct-scanout doesn't) → unexpected; would need
re-investigation
3. **A multi-window variant** of the with-KWin baseline
should be added before Phase 1 binding-cell lock —
otherwise the matrix only measures the easiest scenario.
Suggested add: A1' rep with chrome + Konsole + Plasma
panel all visible, see if kwin %CPU rebounds. If it does,
the matrix's daily-driver-relevance picture is more
nuanced.
The campaign continues with the matrix as defined, but with
the understanding that the original framing is partially
invalidated. Phase 1 will lock around the reframed sub-questions
in `phase0_evidence/browser_overlay_inventory_2026-05-03.md` §
"Implications for the matrix" + the multi-window add above.
## Files in this evidence dir
```
a1_rep1/ a1_rep2/ a1_rep3/ — three rep evidence dirs
01_live_session.txt — Wayland session state at A1 capture time
02_predecessor_assets.txt — verification that predecessor scripts/assets reusable
a1_protocol.md — protocol spec, run beforehand
a1_summary.md — this file
```
Each rep directory contains:
- `start.txt` / `end.txt` / `capture_start.txt` — wall-clock
- `temp_pre.txt` / `temp_post.txt` — cpu-thermal temp
- `top_kwin.txt` — kwin_wayland top samples (70 × 1 Hz)
- `top_full.txt` — system top samples (70 × 1 Hz)
- `stderr.log` — chromium stderr (full)
- `drops_trajectory.txt` — DROPS_TRAJECTORY lines (73 each)
- `drops_summary.txt` — frames_total / drops_total / drops_post_warmup
- `kwin_cpu_summary.txt` — kwin %CPU stats
- `perf_record_stderr.txt` — perf recorder's own stderr
- `perf_report_self.txt` / `perf_report_top50.txt` — perf
flamegraph (text)
`perf.data` files (~400 KB each) are root-owned on ohm
(created by `sudo perf record`) and were not synced to noether.
They remain at `/home/mfritsche/phase3_prime_runs/x11research_a1_rep[1-3]/perf.data`
on ohm if re-analysis is needed.