Phase 0: A1 Wayland baseline + state snapshot — major reframing

3 in-session reps of chromium-fourier 149 / brave_drops_test.html / Plasma Wayland 6.6.4 (kwin-fourier 6.6.4-3 + qt6-base-fourier 6.11.0-3 carry-overs intact). Tight cluster IQR=0: drops_total=0, drops_post_warmup=0, frames_total=1685, kwin %CPU median=0.00, mean=0.04. Perf samples on kwin (~30 over 70s) show zero composite/dmabuf/GL symbols — only event-loop bookkeeping. Most likely mechanism: KWin direct-scanout fast-path engaged for the single-visible-client video case. The campaign's load-bearing hypothesis ("X11 + non-compositing WM avoids per-frame GL composite of NV12") is structurally weakened — KWin already avoids that work under Wayland for this workload. Phase 1 needs to add a multi-window A1' variant and drm_info-during-playback to confirm direct-scanout, then revisit matrix cell design. revert.log entry 6: SDDM autologin + state.conf swap that landed the Plasma Wayland session for the A1 reps. Backup of original state.conf preserved at /var/lib/sddm/state.conf.x11-research-bak; single-command revert documented. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 13:11:15 +00:00
parent d2e11be430
commit 9a023e9264
45 changed files with 3199 additions and 17 deletions
@@ -0,0 +1,248 @@
+# A1 baseline — verdict
+
+**Three in-session reps of chromium-fourier 149 / brave_drops_test.html
+/ Plasma Wayland 6.6.4 + kwin-fourier 6.6.4-3 + qt6-base-fourier
+6.11.0-3 acquired 2026-05-03 14:23–14:31 CEST.**
+
+Per the campaign-contained-data discipline, this is the only
+Wayland baseline this campaign uses. Predecessor numbers are
+referenced below as instrument-sanity context, not as comparison
+targets.
+
+## Results
+
+| Metric | Rep 1 | Rep 2 | Rep 3 | Median (cluster) |
+|---|---|---|---|---|
+| frames_total (over 70 s) | 1685 | 1685 | 1686 | **1685** |
+| effective fps | 24.04 | 24.04 | 24.05 | **24.04** |
+| drops_total | 0 | 0 | 0 | **0** |
+| drops_post_warmup (after t=10s) | 0 | 0 | 0 | **0** |
+| kwin_wayland %CPU median (1 Hz × 70 samples) | 0.00 | 0.00 | 0.00 | **0.00** |
+| kwin_wayland %CPU mean | 0.07 | 0.04 | 0.04 | **0.04** |
+| kwin_wayland %CPU max | 4.00 | 1.00 | 3.00 | **3.00** |
+| perf samples on kwin_wayland (99 Hz × 70 s) | 39 | 28 | (similar) | tens, not thousands |
+
+**IQR / median for the spread metrics is 0 (cluster is degenerate
+at the lower bound).** Per the protocol's exit-condition tree,
+this is the "tight cluster" branch: Phase 1 binding cells can
+use these as the anchor with a sub-1% tolerance band on kwin
+%CPU and a ≤ 1-frame tolerance on drops_post_warmup.
+
+Source video plays at 24 fps for 70.09 s ≈ 1682 frames; observed
+1685 frames matches within rounding (chromium counts the
+DROPS_TRAJECTORY playing-event frame separately).
+
+## What the perf reports say kwin was actually doing
+
+For all three reps, the perf samples on kwin_wayland during
+playback are dominated by event-loop bookkeeping, **not** by
+any GL-composite or dmabuf-import path:
+
+| Rep | Top symbol(s) at non-trivial % |
+|---|---|
+| 1 | `__pi_memcpy_generic` (97.18 %) — single memcpy event in a 39-sample run |
+| 2 | `libz.so` (37.57 %), `call_filldir` (31.91 %) — readdir + zlib |
+| 3 | `dbus_message_unref` (38.80 %), `QUnixEventDispatcherQPA::processEvents` (36.72 %), `libz.so` (23.30 %) — DBus + Qt event loop |
+
+**Zero samples anywhere in:**
+- `glEGLImageTargetTexture2DOES` (the GL EGL image bind path
+  the predecessor's `kwin_overlay_subsurface` campaign Phase 2
+  hypothesised would dominate per-frame KWin cost on this
+  hardware)
+- `panfrost_*` (Mesa Panfrost driver routines)
+- `wp_subsurface_*` / `WaylandSurface_*` (overlay/subsurface
+  protocol handling)
+- any `Compositor::*` / `OpenGLBackend::*` / `OutputLayer::*`
+  KWin internal symbols
+
+The total cycle count across all three reps is in the millions,
+not billions — kwin_wayland was scheduled out for >99 % of the
+70 s capture window in every rep.
+
+## What this means
+
+**The "KWin is the bottleneck" framing the campaign was built
+around is structurally weakened by these data.**
+
+The campaign's load-bearing hypothesis (`README.md` § 1) was
+that "the campaign's load-bearing hypothesis is that this
+plane-allocation freedom translates into measurable browser-video
+speedup." That hypothesis was built on top of the predecessor's
+observation that `kwin_wayland` consumed ~36 % CPU during
+similar playback, attributable per the predecessor's Phase 2
+source-read to per-frame GL composite of NV12 → RGB. **Today's
+A1 reps show kwin_wayland at 0 % median, with no GL-composite
+work in the perf samples.** There is no Wayland-side
+KWin-induced overhead for the X11 cells to *be faster than*.
+
+### The most likely mechanism (hypothesis, not yet verified)
+
+KWin 6.6.4 (with kwin-fourier 6.6.4-3 patches applied) appears
+to have engaged its **direct-scanout** code path for the
+chrome-window-displaying-video workload. KWin's direct-scanout
+support has been there for years on the Wayland backend and
+has been progressively widened: when there's a single visible
+"top" surface (the chrome window) whose buffer matches a
+hardware plane's format/modifier capabilities, KWin can hand
+that buffer to DRM directly without first GL-compositing it
+into KWin's own framebuffer. The browser's RGB or NV12 dmabuf
+goes onto Plane 39 (the Primary plane on rockchip-drm RK3568)
+without any per-frame KWin GPU work.
+
+This is **not** the wp_subsurface route the predecessor was
+investigating — it's a different, simpler scanout path that
+doesn't require the client to opt into overlay protocol. It
+just requires the client's surface to be the only visible
+non-trivial top-level plus a buffer format/modifier that DRM
+can scan out.
+
+If this hypothesis is correct, two things follow:
+
+1. **The campaign's X11-vs-Wayland delta is much smaller than
+   originally expected.** Both sessions can avoid per-frame
+   compositor work for the single-window video case. The X11
+   cells will not be measurably faster than Wayland for this
+   workload.
+2. **The campaign's mechanism is realised under Wayland too,
+   when conditions are right.** "Plane-allocation freedom" is
+   not X11-exclusive — it's just easier to engage on X11
+   because there's no compositor in the path at all. On
+   Wayland, KWin engages an equivalent fast-path when its
+   heuristics allow.
+
+### What the data does NOT establish
+
+- That direct-scanout is the cause. The data is consistent
+  with direct-scanout but a perf-only diagnosis can't pin it
+  down. Phase 1 should add `drm_info` snapshots during
+  playback (which plane is programmed with the chrome
+  window's buffer FOURCC + modifier?) and KWin debug
+  logging (`KWIN_DRM=1` dumps direct-scanout decisions) to
+  confirm.
+- That this behavior holds for multi-window scenarios. A
+  single visible non-trivial top-level window is the simplest
+  case for direct-scanout. If the operator works
+  multi-windowed (panel + chrome + terminal + Konsole), the
+  fast-path may decline and kwin %CPU may rebound. The
+  matrix's relevance to daily-driver scenarios depends on
+  this.
+- That this behavior holds across browsers. chromium-fourier
+  149 specifically has the patches that enable smooth NV12
+  dmabuf production. Brave 147 stock and Firefox 150 may
+  produce buffers in different shapes (RGB-pre-composited,
+  different modifier) that don't satisfy the direct-scanout
+  predicate. Phase 1 reps for those browsers will tell.
+
+## Cross-check: predecessor's same-condition reps
+
+For instrument sanity (NOT as comparison target — these are
+the predecessor's `kwin_timing_nodebug_rep[1-3]` numbers from
+2026-05-02/03):
+
+| | Predecessor median | This campaign median |
+|---|---|---|
+| frames_total | 1688 | **1685** |
+| drops_total | 44 | **0** |
+| drops_post_warmup | 28 | **0** |
+| kwin %CPU median | 35.9 | **0.00** |
+
+The frames_total match indicates the test page + chromium-fourier
+emit at the same rate in both campaigns. The drops + kwin %CPU
+divergence is too large to be measurement noise — something
+about the runtime conditions changed between 2026-05-02
+(predecessor's reps) and 2026-05-03 14:23 (today's reps), even
+though the package versions are identical and the test page is
+the same file.
+
+Possible non-package causes worth listing here so a follow-up
+can investigate (out of A1 scope):
+
+- **Boot generation:** predecessor's reps were on a session
+  that had been running ~9 hours; today's reps were on a
+  session that had been running ~50 minutes after autologin
+  (revert.log entry 6).
+- **Cumulative session state:** predecessor's session likely
+  had multiple browser instances and other windows open
+  during the campaign's preceding work; today's session was
+  freshly autologin'd from greeter, only the test chrome
+  window visible.
+- **Thermal:** predecessor's `temp_pre.txt` for rep 1 isn't
+  in our scope to check (would be predecessor data import);
+  but today's reps had cpu-thermal at 36 °C pre-rep, well
+  below thermal-throttle thresholds.
+- **kwin / qt patches state:** packages identical per
+  `pacman -Q`, but the runtime state of KWin's heuristics
+  (window-rule cache, scanout-decision history) might differ
+  between sessions. This is a normal property of compositors
+  and explains some run-to-run variance even on the same
+  binary.
+
+The discipline rule already required the in-session re-measurement
+this campaign just did. The predecessor's number is no longer
+the reference; **this campaign's measured median (0 drops, 0 %
+kwin) is the reference for any X11 cells the campaign will
+later compare against**.
+
+## Phase 1 implications
+
+The matrix design needs revisiting before Phase 1 cells lock:
+
+1. **The mpv `--vo=xv` cell remains the most informative
+   single point** for the campaign's original mechanism (does
+   the X server route NV12 to Plane 39 directly?), per the
+   browser overlay inventory's verdict.
+2. **The browser X11 cells become a measurement of "do
+   browsers under X11 get the equivalent direct-scanout
+   benefit they get under Wayland?"** rather than the original
+   "does X11 win over Wayland?" framing. Three plausible
+   outcomes:
+   - X11 cells match Wayland baseline (both engage direct
+     scanout) → "compositor-or-not is irrelevant for the
+     single-window case on this hardware"
+   - X11 cells slightly faster than Wayland (X11 path has
+     less per-frame X protocol overhead) → small but real
+     win for X11 daily-driver
+   - X11 cells slower than Wayland (X11 path has issues
+     KWin's direct-scanout doesn't) → unexpected; would need
+     re-investigation
+3. **A multi-window variant** of the with-KWin baseline
+   should be added before Phase 1 binding-cell lock —
+   otherwise the matrix only measures the easiest scenario.
+   Suggested add: A1' rep with chrome + Konsole + Plasma
+   panel all visible, see if kwin %CPU rebounds. If it does,
+   the matrix's daily-driver-relevance picture is more
+   nuanced.
+
+The campaign continues with the matrix as defined, but with
+the understanding that the original framing is partially
+invalidated. Phase 1 will lock around the reframed sub-questions
+in `phase0_evidence/browser_overlay_inventory_2026-05-03.md` §
+"Implications for the matrix" + the multi-window add above.
+
+## Files in this evidence dir
+
+```
+a1_rep1/  a1_rep2/  a1_rep3/   — three rep evidence dirs
+01_live_session.txt              — Wayland session state at A1 capture time
+02_predecessor_assets.txt        — verification that predecessor scripts/assets reusable
+a1_protocol.md                   — protocol spec, run beforehand
+a1_summary.md                    — this file
+```
+
+Each rep directory contains:
+- `start.txt` / `end.txt` / `capture_start.txt` — wall-clock
+- `temp_pre.txt` / `temp_post.txt` — cpu-thermal temp
+- `top_kwin.txt` — kwin_wayland top samples (70 × 1 Hz)
+- `top_full.txt` — system top samples (70 × 1 Hz)
+- `stderr.log` — chromium stderr (full)
+- `drops_trajectory.txt` — DROPS_TRAJECTORY lines (73 each)
+- `drops_summary.txt` — frames_total / drops_total / drops_post_warmup
+- `kwin_cpu_summary.txt` — kwin %CPU stats
+- `perf_record_stderr.txt` — perf recorder's own stderr
+- `perf_report_self.txt` / `perf_report_top50.txt` — perf
+  flamegraph (text)
+
+`perf.data` files (~400 KB each) are root-owned on ohm
+(created by `sudo perf record`) and were not synced to noether.
+They remain at `/home/mfritsche/phase3_prime_runs/x11research_a1_rep[1-3]/perf.data`
+on ohm if re-analysis is needed.