x11-session-research/phase0_findings.md

# Phase 0 — substrate and locked research question

This is the campaign's Phase 0 doc: the locked research
question (see § "Research question (LOCKED 2026-05-03)"
below), the substrate inherited as *context* from the
predecessor `kwin_overlay_subsurface`, and the open questions
this campaign answers in-session before Phase 1 binding cells
lock.

## Campaign-contained data discipline (governing rule)

**This campaign acquires all its own measurement data from
scratch in this session.** Predecessor measurement numbers
(drop counts, perf percentages, Δ_present medians, kwin %CPU
values, threshold values) are documented for *context* but
**never imported as binding cells, comparison targets, or
success criteria**.

Concretely, in this Phase 0 doc:

- Numbers from `kwin_overlay_subsurface` Phase 3 / Phase 3-prime
  (e.g. "C5' had 32 drops total / 22 post-warmup"; "median
  Δ_present 45.85 ms"; "kwin %CPU median 36.90") are quoted
  **only as evidence of past measurements that may or may not
  reproduce**. They establish the SHAPE of what's measurable
  and the fact that measurement infrastructure works on this
  hardware. They do **not** establish "this is the baseline
  the X11 cells will be compared against."
- The Phase 0 A1 baseline rep (worklist.md) is the
  in-session anchor for any with-KWin Wayland measurement
  this campaign references.
- No `metrics.csv` row in this campaign is populated from
  predecessor data, even if the predecessor measured an
  identical condition.
- If an X11 cell appears to "match the cage Phase 0 number,"
  that's incidental — the cage Phase 0 number is not the test
  this campaign is running. The test is "X11 vs in-session
  Wayland baseline."

The discipline lesson is concrete: the predecessor's Phase 1
binding cell `drops_post_warmup == 0` was anchored to a single
ohm_gl_fix Phase 0 cage measurement (7 drops total / 0
post-warmup). Three weeks of phase planning ran on the
assumption that floor existed. At N=3 in-session replication
on the closing day of the predecessor, that floor was missing
(cage today: 22 / 26 / 56 post-warmup). The campaign closed
without patch. **A 30-minute N=3 same-session baseline check
on day 1 would have made the campaign different — or made it
honestly close earlier.** This campaign acts on that lesson.

## Predecessor close-out summary (context, not data)

[`../kwin_overlay_subsurface/phase8_handover.md`](../kwin_overlay_subsurface/phase8_handover.md)
(closed 2026-05-03 without patch). Three independent reasons
no patch landed (numbers below are quoted as historical
context per the governing rule above):

1. The predecessor's locked Phase 1 reference floor
   (`drops_post_warmup == 0` from cage) was unreachable in the
   predecessor's closing N=3 measurement session — same
   chromium-fourier binary, same hardware, same kernel, same
   Mesa, same kwin-fourier as the original Phase 0 measurement.
   KWin direct's number reproduced; cage's 0-floor did not.
   Numbers quoted in the predecessor's `phase3_prime_findings.md`;
   not used here as a baseline for any cell.
2. The campaign's surface-of-investigation
   (`wp_subsurface` overlay route) is not engaged by
   `brave_drops_test.html`. Chromium-fourier renders the video
   element via internal compositing into its main browser
   window surface — a single-surface case.
3. The Phase 2 hot-path hypothesis
   (`glEGLImageTargetTexture2DOES` dominates `kwin_wayland`'s
   per-frame cost) was rejected by Phase 3 perf measurement
   with 100×-margin on the wrong side of the threshold.

The diagnostic loop terminated at "the campaign's premise was
N=1 to begin with, and the N=3 in-session re-measurement
doesn't replicate it." This is filed as a feedback memory:
*replicate the N=1 baseline at N=3 in the same session BEFORE
building multi-phase infrastructure around it*.

## What stays valid from the predecessor

Durable substrate listed in
`kwin_overlay_subsurface/phase8_handover.md` § "What's left
for a future session to pick up":

- **Phase 1 scanout-promotion archaeology** (rockchip-drm
  RK3568 plane format/modifier table, KWin v6.6.4 promotion
  predicate). Plane 39 (Primary, NV12 LINEAR) is the GL
  framebuffer; Plane 45 (Overlay) does not advertise NV12 in
  any modifier. Both KWin scanout-promotion paths are
  structurally rejected for windowed Brave on this DRM driver.
  This holds regardless of display server.
- **Phase 2 H1 file:line** in
  `kwin_overlay_subsurface/phase2_source_findings.md`. Cold
  per Phase 3 measurement; informational only.
- **Phase 2-prime Shape C source-read** of
  `Display::dispatchEvents` and `TransactionFence` in KWin's
  `src/wayland/`. Specific to the Wayland path; **not relevant
  to an X11-session campaign**. The X11 path uses different
  KWin surface plumbing (`kwin_x11`) and a different per-frame
  protocol (X11 Composite extension + Damage + XPresent), not
  Wayland protocol dispatch.
- **Δ_present-46 ms reproducible side-finding** under Plasma
  Wayland. Across all measured conditions (chromium-fourier on
  KWin, chromium-fourier in cage, stock Brave on KWin), median
  Δ_present was 41-46 ms on a 60 Hz panel — a stable
  ~2.7-vsync queue depth. This finding is independent of the
  cage breakdown and **directly testable under X11** as a
  comparison point.
- **Measurement infrastructure**:
  `kwin_overlay_subsurface/scripts/wayland_debug_to_csv.py`
  (libwayland 1.21+ format, 17 unit tests passing) +
  `phase3_prime_runs/run_browser.sh` orchestrator on ohm
  (handles `WAYLAND_DEBUG=1` capture, perf record, top
  sampling, drops trajectory extraction, kill-cleanly). **The
  WAYLAND_DEBUG portion does not apply under X11**; an X11
  equivalent would be different tooling (`xtrace`, `xev`, or
  XCB-debug instrumentation if the client emits any). The
  perf+top+drops capture portion remains usable under X11
  unchanged.

## Current ohm state (carry-over from predecessor)

Per `kwin_overlay_subsurface/phase1_evidence/ohm_tooling_revert_log.md`,
not reverted at predecessor close-out:

- `qt6-base-fourier 1:6.11.0-3`
- `kwin-fourier 1:6.6.4-3` (Wayland-side compositor; not in
  the hot path under an X11 session)
- `mesa 1:26.0.5-1`
- CPU governor pinned to `performance`
- Baloo permanently disabled
- `drm-info 2.9.0-1`
- Active session: `startplasma-wayland` on tty2,
  `kwin_wayland` PID 3927 (as of 2026-05-03 03:05 UTC).
- Browser binaries available: `/tmp/chromium-ohm-gl-fix-step2/chrome`
  (chromium-fourier, Step 1 + Step 2 patches, 149.0.7812.0),
  `/usr/bin/brave` (`brave-bin 1:1.89.145-1`).

If this campaign needs to switch ohm to an X11 session, that
is a session-level operator action (logout, switch via SDDM,
log back in). It cannot be done unattended.

## Research question (LOCKED 2026-05-03)

> *"Does cutting out the KWin compositor enable faster video
> display of Brave, chromium-fourier, and Firefox — for full
> SW decoding, and for libva decoding (where possible) — on
> PineTab2 RK3568?"*

### Mechanism the question targets

Operator-supplied context 2026-05-03:

> *"hantro emits NV12 which the GPU can't put on a
> compositeable plane. So that is the real bottleneck of
> Wayland."*

This connects directly to the predecessor's Phase 1 finding
(`kwin_overlay_subsurface/phase2_source_findings.md`:170-229):

- Hantro VPU decodes H.264 video into NV12 dmabufs (`DRM_FORMAT_NV12`,
  `DRM_FORMAT_MOD_LINEAR`).
- rockchip-drm's only NV12-LINEAR-capable plane is the
  Primary plane (Plane 39 on CRTC 52), which the running KWin
  uses for its GL framebuffer.
- The overlay plane (Plane 45) advertises no NV12 in any
  modifier in `IN_FORMATS`.
- Therefore **no rockchip-drm scanout plane can accept the
  NV12 buffer hantro produces while KWin owns the primary
  plane.** Some compositing step must convert NV12 → RGB
  before display.

The predecessor named the *constraint* (Path B rejected at
the format/modifier intersection) but the *consequence* —
"some component must GL-composite NV12 → RGB on the GPU
because nothing else on this hardware can put NV12 on a
scanout plane" — was not made explicit. That consequence is
this campaign's motivating insight:

- **Under Plasma Wayland:** when the browser engages the
  Wayland subsurface route (chromium's
  `WaylandBufferManagerHost::CommitOverlays`), KWin receives
  an NV12 dmabuf and must GL-composite it. **KWin's compositor
  is the GL-composite step.** When the browser does NOT
  engage the subsurface route (the predecessor's measured
  case on `brave_drops_test.html` — zero `wp_subsurface` in
  the trace), the browser itself converts NV12 → RGB in its
  own GL context and hands KWin only RGB; KWin then composites
  the RGB to its primary plane.
- **Under X11 without a compositor:** there is no separate
  compositor process. Two paths are open to the client:
  - *RGB-composite path* (browser converts NV12 → RGB in its
    own GL context and presents the RGB result via XPresent/DRI3
    to the X server, which schedules a page-flip on the same
    Primary plane KWin would have used). One fewer hand-off
    than the Wayland-with-subsurface case but the same GL-
    composite cost as the no-subsurface Wayland case.
  - **Hardware-overlay path** (operator-supplied context
    2026-05-03: *"a X11 pipeline would route around that by
    giving a portion of screen real estate directly to the
    video pipeline"*). The X server allocates the Primary
    plane (Plane 39, supports NV12 LINEAR) to the video
    region and the Overlay plane (Plane 45, supports
    RGB/AFBC) to the rest of the desktop. Hardware-blended
    at scanout time. **No GL-composite of NV12 anywhere —
    the cost the operator named as "the real bottleneck"
    is structurally avoided.**

This second X11 path is what Wayland compositors as
designed today cannot do on rockchip-drm-class hardware: KWin
Wayland *must* own the Primary plane for its compositor
framebuffer (because the Wayland model is "compositor presents
one merged surface per output"), so it cannot give Plane 39
to a video-region NV12 buffer while putting the rest of the
desktop on Plane 45. X11 + non-compositing WM has no such
constraint — different windows can be assigned to different
planes by the X server's plane allocator.

This is the X11 hardware-overlay mechanism that historically
made X11 desktops good at video playback (Xv from the late
1990s, and the modern equivalents via DRI3 + XPresent +
Composite-redirection-disabled). It is structurally absent
in Wayland-with-monolithic-compositor designs.

### Hypothesis the matrix tests

There are three potentially separable costs:

1. **The mandatory NV12 → RGB GL conversion**, which is
   *forced* on Wayland-with-KWin because KWin must own the
   only NV12-LINEAR-capable plane on this hardware for its
   compositor framebuffer. **This cost is structurally
   avoidable** under X11 + non-compositing WM via
   hardware-plane-overlay (per the operator-supplied insight
   above). Whether browsers can be coaxed to *use* the X11
   hardware-overlay path — rather than internally compositing
   to RGB before presenting — is browser-specific (see Open
   questions below).
2. **The fallback GL-composite cost** when the
   hardware-overlay path doesn't engage. Both Wayland and X11
   pay this when the buffer shape doesn't match a plane —
   it just runs in different processes (KWin under Wayland,
   browser under X11).
3. **The per-frame compositor overhead** independent of NV12:
   dmabuf import, transaction apply, presentation-feedback
   wiring, frame-callback delivery — which the predecessor
   measured at ~30-37 % of `kwin_wayland`'s CPU during
   steady-state video playback even when KWin only saw RGB
   surfaces.

The X11 hypothesis is strongest if cost (1) is dominant on
the matrix's with-KWin cells AND the X11 cells trigger the
hardware-overlay path. The X11 hypothesis is weakest if
cost (1) is small and cost (3) is small — in which case the
"cutting out KWin" experiment would show only marginal
differences.

The matrix below is designed to surface which of (1) (2) (3)
dominates per browser × decode path.

"Faster video display" is operationally **a combination of**:

- **Effective fps actually rendered** (= `getVideoPlaybackQuality().totalVideoFrames / elapsed_s`
  for a 30 fps source — the upper bound is 30; the question is
  how close).
- **Drop count** over the same 70 s window (`droppedVideoFrames`).
- **End-to-end latency** if testable (commit → present;
  testable on Wayland via `wp_presentation_feedback`,
  testable on X11 via `XPresent` extension or `RandR` vblank
  events; protocol-side measurement under each
  display-server).
- **Compositor + browser CPU at steady state** (the cost
  saved by cutting the compositor is the upper bound on the
  patch-payoff if a future campaign tries to optimise the
  compositor instead of removing it).

### Experimental matrix

Six 2-axis cells (3 browsers × 2 decode paths) × 2
session conditions (with-KWin / without-KWin):

| Browser | Decode | with-KWin (Plasma Wayland) | without-KWin (X11 session, no compositor) |
|---|---|---|---|
| Brave 147 | full SW | C-W-brave-sw | C-X-brave-sw |
| Brave 147 | libva (if it works) | C-W-brave-libva | C-X-brave-libva |
| chromium-fourier 149 (Step 1 + Step 2) | full SW | C-W-chrf-sw | C-X-chrf-sw |
| chromium-fourier 149 | libva (Step 1 enables it) | C-W-chrf-libva | C-X-chrf-libva |
| Firefox | full SW | C-W-ff-sw | C-X-ff-sw |
| Firefox | libva | C-W-ff-libva | C-X-ff-libva |

The "(if it works)" / "where possible" qualifier per the
operator's directive: libva on rockchip-drm RK3568 only works
on chromium-fourier (Step 1 ports `libva-v4l2-request`); for
stock Brave 147 and stock Firefox, libva probably doesn't
engage and those cells are documented N/A. For Firefox, the
Mesa-side `libva-v4l2-request` may make libva work via Mozilla's
VAAPI backend even on stock Firefox — to be verified in
Phase 0 inventory.

### What "cutting out the KWin compositor" means

This campaign uses **X11 session with no compositor in the
display path** as the "without-KWin" cell. Specifically:

- Native Xorg server, NOT XWayland (XWayland would still go
  through KWin for display, defeating the purpose).
- Window manager that does NOT composite by default — e.g.
  openbox, fluxbox, xfwm4-with-compositing-off, i3, twm.
  Plasma X11 uses `kwin_x11` as compositing WM, which is
  still a "KWin compositor" — it does not satisfy "cut KWin
  out" and is **excluded** from the without-KWin cell.
- Browser windowed (not fullscreen). Even on a non-compositing
  WM, fullscreen browsers may engage XPresent direct
  presentation paths — testing windowed isolates the
  baseline non-compositor windowed display path.

The exact WM choice is a Phase 0 inventory decision (which
WMs are available on ohm, which install cleanly, which
SDDM-advertised sessions exist). Default candidate: openbox.

### Three plausible outcome shapes

- **(α)** Without-KWin is materially faster across all 6
  cells: confirms the KWin compositor cost is a real
  bottleneck on this hardware, and X11-session-without-
  compositor becomes the recommended daily-driver
  configuration for video work on PineTab2.
- **(β)** Without-KWin is comparable or only marginally
  faster: the compositor isn't the bottleneck; the drop
  phenomenon is hardware/kernel/Mesa-bound, and the
  predecessor's Phase 8 closure stands.
- **(γ)** Mixed picture per browser × decode path: e.g.
  libva paths benefit but SW paths don't; or Firefox benefits
  but chromium-class clients don't. Each cell becomes its own
  characterisation.

### Open questions before Phase 1 lock

The hardware-overlay-path mechanism is structurally available
on X11 + non-compositing WM. Whether it actually engages for
each of the three browsers is browser-specific and currently
unknown:

- **Brave / Chromium ozone-x11**: Chromium has overlay-support
  code (`OverlayProcessor`, `GpuMemoryBufferManager`,
  `DCOMPSurface` on Windows; on Linux X11 the path is via
  XPresent + DMA-BUF + `OverlayCandidate`). Whether Brave
  147 / chromium-fourier 149 actually request hardware-overlay
  presentation for a windowed video element under X11 is open.
- **Firefox**: VAAPIVideoDecoder backend produces hardware
  decoded NV12 dmabufs that the GL compositor consumes
  internally. Whether Firefox's X11 backend has a path to
  hand the dmabuf to the X server for hardware-overlay
  presentation (rather than internally composing to RGB) is
  open. Mozilla has a `MOZ_X11_EGL` hint and a "hardware video
  overlay" pref but these are not universally engaged.
- **Reference clients**: mpv with `--vo=xv` or
  `--vo=gpu --hwdec=auto-copy --gpu-context=x11`, or `gst-play-1.0`
  with `xvimagesink` or `glimagesink`, are known-good X11
  hardware-overlay paths. **Adding mpv to the matrix as a
  reference client** would isolate "does the X11 hardware-
  overlay path work AT ALL on this hardware" from "do
  browsers actually use it." If mpv hardware-overlays cleanly
  but browsers don't, the conclusion is "the X11 path is fast,
  but browsers leave the speedup on the table."

If the operator agrees, Phase 0 inventory should:

1. Verify Plane 39's NV12-LINEAR availability is reachable to
   X11 clients (it is for KWin Wayland; should be for X11 too
   since Plane 39 is just a DRM resource), and identify which
   X11 path actually programs it (modesetting Xorg driver +
   `Option "PageFlip" "true"`, or DRI3-presented buffer ending
   up on Plane 39 via the X server's plane allocator).
2. Inventory Brave's, chromium-fourier's, and Firefox's X11
   overlay-presentation paths to see which (if any) request
   hardware-overlay presentation.
3. Add mpv as a reference X11-overlay client to the matrix,
   so the campaign has a known-good comparison point.

### What this question does NOT cover

For clarity, since the predecessor was specifically about
the Wayland-overlay-subsurface composite path:

- This campaign is **not** investigating the wp_subsurface
  route. The Wayland-cell of the matrix (with-KWin) measures
  whatever browser configuration produces under the existing
  Plasma Wayland session — windowed, default profile, default
  flags. It's a measurement of the as-shipped Plasma Wayland
  stack from the user's perspective, not a probe of a
  specific KWin code path.
- The Δ_present-46 ms finding from the predecessor is
  testable as a free side-finding under both axes (Wayland
  and X11) but is not the campaign's primary question.
- Daily-driver fitness (apps that break under X11, touchscreen
  behavior, multi-monitor edge cases, etc.) is **not in
  scope**. The campaign's deliverable is the matrix above; if
  any cell is decisively faster, daily-driver-fitness becomes
  a follow-up campaign.

## What's NOT in scope (working assumption)

Until the research question is confirmed, the following are
treated as out of scope so they don't slip into Phase 1
prematurely:

- Patches to KWin, Xorg, kwin-fourier, qt6-base-fourier, or any
  other component on ohm. This is **research**, not
  patch-development. Per non-upstreaming default, MR/bug-report
  filing is explicitly tasked and not scheduled here.
- The Δ_present-46 ms finding's investigation. It's a known
  hook from the predecessor; whether this campaign chases it
  depends on the locked research question.
- Reverting predecessor tooling state. Governor, baloo,
  `qt6-base-fourier`, `kwin-fourier` stay as-is unless the
  operator decides otherwise.
- File a bug for any of the predecessor's three documented
  candidate findings. Same non-upstreaming default applies.

## What Phase 0 will deliver, regardless of framing

Even before the research question is locked, the following are
useful Phase 0 deliverables that don't depend on the specific
question:

1. **State snapshot of ohm under current Plasma Wayland**
   captured at campaign start. This is the *before* photo for
   any future X11 vs Wayland comparison. Unattended-tractable
   (just scripted SSH).
2. **Inventory of available X11 paths on ohm**: what packages
   are installed, what session candidates SDDM advertises,
   what would need to be installed to enable a Plasma X11
   session, what alternate WMs are available. Read-only,
   unattended-tractable.
3. **Inventory of measurement instruments that work under
   X11**: `xtrace`, `xprop`, `xrandr --verbose --query`, perf
   on `Xorg` PID, frame-timing extraction options. Read-only.
4. **A1 baseline** under current Plasma Wayland: re-run a
   single rep of the predecessor's `kwin_timing_nodebug`
   condition immediately at the start of this campaign, so
   the comparison Wayland-vs-X11 has a same-session anchor.
   This is the "set the baseline before instrument changes"
   discipline from `feedback_replicate_baseline_first.md`.

These steps are unblocked. They don't commit to a specific
research question and they produce evidence that's useful
under any of the candidate framings.