Files
x11-session-research/phase0_findings.md
T
marfrit ac301b4e48 Phase 0: motivation locked + mechanism captured
Operator-supplied research question 2026-05-03: "Does cutting
out the KWin compositor enable faster video display of Brave,
chromium-fourier, and Firefox — for full SW decoding, and for
libva decoding (where possible) — on PineTab2 RK3568?"

Operator-supplied mechanism 2026-05-03 (two messages):

1. "hantro emits NV12 which the GPU can't put on a
   compositeable plane. So that is the real bottleneck of
   Wayland."

   Connects directly to predecessor's Phase 1 finding
   (kwin_overlay_subsurface/phase2_source_findings.md:170-229):
   rockchip-drm overlay Plane 45 advertises no NV12 modifier;
   Primary Plane 39 supports NV12 LINEAR but is owned by KWin
   for its compositor framebuffer. Predecessor named the
   constraint but not the consequence — the consequence is
   that NV12 → RGB GL-composite is forced on Wayland-with-KWin
   regardless of which protocol path the browser uses.

2. "A X11 pipeline would route around that by giving a portion
   of screen real estate directly to the video pipeline."

   The X11 hardware-overlay path: with X11 + non-compositing
   WM, the X server can allocate Plane 39 (NV12 LINEAR) to
   the video region and Plane 45 (RGB AFBC) to the rest of
   the desktop. Hardware-blended at scanout. NO GL-composite
   anywhere — the cost the operator named as "the real
   bottleneck" is structurally avoided.

   This is the X11 hardware-overlay mechanism that historically
   made X11 desktops good at video playback (Xv → modern
   DRI3 + XPresent + Composite-redirection-disabled).
   Wayland-with-monolithic-compositor designs cannot use this
   freedom: the compositor must own the Primary plane, so the
   plane-allocation freedom required to put NV12 video on
   Plane 39 alongside RGB chrome on Plane 45 isn't available.

phase0_findings.md updated with:
- Locked research question + 12-cell experimental matrix
  (3 browsers × 2 decode paths × 2 sessions; some N/A).
- Three separable cost components the matrix tests for
  (mandatory NV12→RGB GL conversion if hardware-overlay
  doesn't engage, fallback GL-composite, per-frame compositor
  overhead independent of NV12).
- Open questions about whether browsers actually request
  hardware-overlay presentation under X11, or whether they
  always internally composite to RGB.
- Recommendation to add mpv as a reference X11-overlay
  client: distinguishes "X11 path works on this hardware"
  from "browsers actually use the X11 path."

worklist.md updated:
- Phase 0 motivation + matrix items ticked.
- Pre-Phase-1 inventory broken out: state snapshot, X11
  path inventory, browser-overlay-path inventory, mpv
  reference, X11 measurement-tool inventory, A1 Wayland
  baseline anchor.
- Phase 1 sketch: binding cells per matrix cell, clear-pass /
  clear-fail thresholds, measurement protocol mirroring the
  predecessor's phase3_protocol.md structure.

README banner updated to reflect locked motivation +
mechanism summary + matrix shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 06:49:18 +00:00

408 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 0 — substrate and provisional research question
This is the campaign's Phase 0 substrate doc: what we already
know from the predecessor `kwin_overlay_subsurface` close-out,
what's open, and what the candidate research question looks
like. **The research question is provisional and awaits
operator confirmation before Phase 1 lock.**
## Predecessor close-out summary
[`../kwin_overlay_subsurface/phase8_handover.md`](../kwin_overlay_subsurface/phase8_handover.md)
(closed 2026-05-03 without patch). Three independent reasons
no patch landed:
1. The campaign's locked Phase 1 reference floor
(`drops_post_warmup == 0` from cage) is unreachable at N=3
today. Today's median is 26 post-warmup with the same
chromium-fourier binary, same hardware, same kernel, same
Mesa, same kwin-fourier — KWin direct reproduces Phase 0's
29 post-warmup, but cage now also drops ~22-56 post-warmup
instead of Phase 0's 0.
2. The campaign's surface-of-investigation
(`wp_subsurface` overlay route) is not engaged by
`brave_drops_test.html`. Chromium-fourier renders the video
element via internal compositing into its main browser
window surface — a single-surface case.
3. The Phase 2 hot-path hypothesis
(`glEGLImageTargetTexture2DOES` dominates `kwin_wayland`'s
per-frame cost) was rejected by Phase 3 perf measurement
with 100×-margin on the wrong side of the threshold.
The diagnostic loop terminated at "the campaign's premise was
N=1 to begin with, and the N=3 in-session re-measurement
doesn't replicate it." This is filed as a feedback memory:
*replicate the N=1 baseline at N=3 in the same session BEFORE
building multi-phase infrastructure around it*.
## What stays valid from the predecessor
Durable substrate listed in
`kwin_overlay_subsurface/phase8_handover.md` § "What's left
for a future session to pick up":
- **Phase 1 scanout-promotion archaeology** (rockchip-drm
RK3568 plane format/modifier table, KWin v6.6.4 promotion
predicate). Plane 39 (Primary, NV12 LINEAR) is the GL
framebuffer; Plane 45 (Overlay) does not advertise NV12 in
any modifier. Both KWin scanout-promotion paths are
structurally rejected for windowed Brave on this DRM driver.
This holds regardless of display server.
- **Phase 2 H1 file:line** in
`kwin_overlay_subsurface/phase2_source_findings.md`. Cold
per Phase 3 measurement; informational only.
- **Phase 2-prime Shape C source-read** of
`Display::dispatchEvents` and `TransactionFence` in KWin's
`src/wayland/`. Specific to the Wayland path; **not relevant
to an X11-session campaign**. The X11 path uses different
KWin surface plumbing (`kwin_x11`) and a different per-frame
protocol (X11 Composite extension + Damage + XPresent), not
Wayland protocol dispatch.
- **Δ_present-46 ms reproducible side-finding** under Plasma
Wayland. Across all measured conditions (chromium-fourier on
KWin, chromium-fourier in cage, stock Brave on KWin), median
Δ_present was 41-46 ms on a 60 Hz panel — a stable
~2.7-vsync queue depth. This finding is independent of the
cage breakdown and **directly testable under X11** as a
comparison point.
- **Measurement infrastructure**:
`kwin_overlay_subsurface/scripts/wayland_debug_to_csv.py`
(libwayland 1.21+ format, 17 unit tests passing) +
`phase3_prime_runs/run_browser.sh` orchestrator on ohm
(handles `WAYLAND_DEBUG=1` capture, perf record, top
sampling, drops trajectory extraction, kill-cleanly). **The
WAYLAND_DEBUG portion does not apply under X11**; an X11
equivalent would be different tooling (`xtrace`, `xev`, or
XCB-debug instrumentation if the client emits any). The
perf+top+drops capture portion remains usable under X11
unchanged.
## Current ohm state (carry-over from predecessor)
Per `kwin_overlay_subsurface/phase1_evidence/ohm_tooling_revert_log.md`,
not reverted at predecessor close-out:
- `qt6-base-fourier 1:6.11.0-3`
- `kwin-fourier 1:6.6.4-3` (Wayland-side compositor; not in
the hot path under an X11 session)
- `mesa 1:26.0.5-1`
- CPU governor pinned to `performance`
- Baloo permanently disabled
- `drm-info 2.9.0-1`
- Active session: `startplasma-wayland` on tty2,
`kwin_wayland` PID 3927 (as of 2026-05-03 03:05 UTC).
- Browser binaries available: `/tmp/chromium-ohm-gl-fix-step2/chrome`
(chromium-fourier, Step 1 + Step 2 patches, 149.0.7812.0),
`/usr/bin/brave` (`brave-bin 1:1.89.145-1`).
If this campaign needs to switch ohm to an X11 session, that
is a session-level operator action (logout, switch via SDDM,
log back in). It cannot be done unattended.
## Research question (LOCKED 2026-05-03)
> *"Does cutting out the KWin compositor enable faster video
> display of Brave, chromium-fourier, and Firefox — for full
> SW decoding, and for libva decoding (where possible) — on
> PineTab2 RK3568?"*
### Mechanism the question targets
Operator-supplied context 2026-05-03:
> *"hantro emits NV12 which the GPU can't put on a
> compositeable plane. So that is the real bottleneck of
> Wayland."*
This connects directly to the predecessor's Phase 1 finding
(`kwin_overlay_subsurface/phase2_source_findings.md`:170-229):
- Hantro VPU decodes H.264 video into NV12 dmabufs (`DRM_FORMAT_NV12`,
`DRM_FORMAT_MOD_LINEAR`).
- rockchip-drm's only NV12-LINEAR-capable plane is the
Primary plane (Plane 39 on CRTC 52), which the running KWin
uses for its GL framebuffer.
- The overlay plane (Plane 45) advertises no NV12 in any
modifier in `IN_FORMATS`.
- Therefore **no rockchip-drm scanout plane can accept the
NV12 buffer hantro produces while KWin owns the primary
plane.** Some compositing step must convert NV12 → RGB
before display.
The predecessor named the *constraint* (Path B rejected at
the format/modifier intersection) but the *consequence*
"some component must GL-composite NV12 → RGB on the GPU
because nothing else on this hardware can put NV12 on a
scanout plane" — was not made explicit. That consequence is
this campaign's motivating insight:
- **Under Plasma Wayland:** when the browser engages the
Wayland subsurface route (chromium's
`WaylandBufferManagerHost::CommitOverlays`), KWin receives
an NV12 dmabuf and must GL-composite it. **KWin's compositor
is the GL-composite step.** When the browser does NOT
engage the subsurface route (the predecessor's measured
case on `brave_drops_test.html` — zero `wp_subsurface` in
the trace), the browser itself converts NV12 → RGB in its
own GL context and hands KWin only RGB; KWin then composites
the RGB to its primary plane.
- **Under X11 without a compositor:** there is no separate
compositor process. Two paths are open to the client:
- *RGB-composite path* (browser converts NV12 → RGB in its
own GL context and presents the RGB result via XPresent/DRI3
to the X server, which schedules a page-flip on the same
Primary plane KWin would have used). One fewer hand-off
than the Wayland-with-subsurface case but the same GL-
composite cost as the no-subsurface Wayland case.
- **Hardware-overlay path** (operator-supplied context
2026-05-03: *"a X11 pipeline would route around that by
giving a portion of screen real estate directly to the
video pipeline"*). The X server allocates the Primary
plane (Plane 39, supports NV12 LINEAR) to the video
region and the Overlay plane (Plane 45, supports
RGB/AFBC) to the rest of the desktop. Hardware-blended
at scanout time. **No GL-composite of NV12 anywhere —
the cost the operator named as "the real bottleneck"
is structurally avoided.**
This second X11 path is what Wayland compositors as
designed today cannot do on rockchip-drm-class hardware: KWin
Wayland *must* own the Primary plane for its compositor
framebuffer (because the Wayland model is "compositor presents
one merged surface per output"), so it cannot give Plane 39
to a video-region NV12 buffer while putting the rest of the
desktop on Plane 45. X11 + non-compositing WM has no such
constraint — different windows can be assigned to different
planes by the X server's plane allocator.
This is the X11 hardware-overlay mechanism that historically
made X11 desktops good at video playback (Xv from the late
1990s, and the modern equivalents via DRI3 + XPresent +
Composite-redirection-disabled). It is structurally absent
in Wayland-with-monolithic-compositor designs.
### Hypothesis the matrix tests
There are three potentially separable costs:
1. **The mandatory NV12 → RGB GL conversion**, which is
*forced* on Wayland-with-KWin because KWin must own the
only NV12-LINEAR-capable plane on this hardware for its
compositor framebuffer. **This cost is structurally
avoidable** under X11 + non-compositing WM via
hardware-plane-overlay (per the operator-supplied insight
above). Whether browsers can be coaxed to *use* the X11
hardware-overlay path — rather than internally compositing
to RGB before presenting — is browser-specific (see Open
questions below).
2. **The fallback GL-composite cost** when the
hardware-overlay path doesn't engage. Both Wayland and X11
pay this when the buffer shape doesn't match a plane —
it just runs in different processes (KWin under Wayland,
browser under X11).
3. **The per-frame compositor overhead** independent of NV12:
dmabuf import, transaction apply, presentation-feedback
wiring, frame-callback delivery — which the predecessor
measured at ~30-37 % of `kwin_wayland`'s CPU during
steady-state video playback even when KWin only saw RGB
surfaces.
The X11 hypothesis is strongest if cost (1) is dominant on
the matrix's with-KWin cells AND the X11 cells trigger the
hardware-overlay path. The X11 hypothesis is weakest if
cost (1) is small and cost (3) is small — in which case the
"cutting out KWin" experiment would show only marginal
differences.
The matrix below is designed to surface which of (1) (2) (3)
dominates per browser × decode path.
"Faster video display" is operationally **a combination of**:
- **Effective fps actually rendered** (= `getVideoPlaybackQuality().totalVideoFrames / elapsed_s`
for a 30 fps source — the upper bound is 30; the question is
how close).
- **Drop count** over the same 70 s window (`droppedVideoFrames`).
- **End-to-end latency** if testable (commit → present;
testable on Wayland via `wp_presentation_feedback`,
testable on X11 via `XPresent` extension or `RandR` vblank
events; protocol-side measurement under each
display-server).
- **Compositor + browser CPU at steady state** (the cost
saved by cutting the compositor is the upper bound on the
patch-payoff if a future campaign tries to optimise the
compositor instead of removing it).
### Experimental matrix
Six 2-axis cells (3 browsers × 2 decode paths) × 2
session conditions (with-KWin / without-KWin):
| Browser | Decode | with-KWin (Plasma Wayland) | without-KWin (X11 session, no compositor) |
|---|---|---|---|
| Brave 147 | full SW | C-W-brave-sw | C-X-brave-sw |
| Brave 147 | libva (if it works) | C-W-brave-libva | C-X-brave-libva |
| chromium-fourier 149 (Step 1 + Step 2) | full SW | C-W-chrf-sw | C-X-chrf-sw |
| chromium-fourier 149 | libva (Step 1 enables it) | C-W-chrf-libva | C-X-chrf-libva |
| Firefox | full SW | C-W-ff-sw | C-X-ff-sw |
| Firefox | libva | C-W-ff-libva | C-X-ff-libva |
The "(if it works)" / "where possible" qualifier per the
operator's directive: libva on rockchip-drm RK3568 only works
on chromium-fourier (Step 1 ports `libva-v4l2-request`); for
stock Brave 147 and stock Firefox, libva probably doesn't
engage and those cells are documented N/A. For Firefox, the
Mesa-side `libva-v4l2-request` may make libva work via Mozilla's
VAAPI backend even on stock Firefox — to be verified in
Phase 0 inventory.
### What "cutting out the KWin compositor" means
This campaign uses **X11 session with no compositor in the
display path** as the "without-KWin" cell. Specifically:
- Native Xorg server, NOT XWayland (XWayland would still go
through KWin for display, defeating the purpose).
- Window manager that does NOT composite by default — e.g.
openbox, fluxbox, xfwm4-with-compositing-off, i3, twm.
Plasma X11 uses `kwin_x11` as compositing WM, which is
still a "KWin compositor" — it does not satisfy "cut KWin
out" and is **excluded** from the without-KWin cell.
- Browser windowed (not fullscreen). Even on a non-compositing
WM, fullscreen browsers may engage XPresent direct
presentation paths — testing windowed isolates the
baseline non-compositor windowed display path.
The exact WM choice is a Phase 0 inventory decision (which
WMs are available on ohm, which install cleanly, which
SDDM-advertised sessions exist). Default candidate: openbox.
### Three plausible outcome shapes
- **(α)** Without-KWin is materially faster across all 6
cells: confirms the KWin compositor cost is a real
bottleneck on this hardware, and X11-session-without-
compositor becomes the recommended daily-driver
configuration for video work on PineTab2.
- **(β)** Without-KWin is comparable or only marginally
faster: the compositor isn't the bottleneck; the drop
phenomenon is hardware/kernel/Mesa-bound, and the
predecessor's Phase 8 closure stands.
- **(γ)** Mixed picture per browser × decode path: e.g.
libva paths benefit but SW paths don't; or Firefox benefits
but chromium-class clients don't. Each cell becomes its own
characterisation.
### Open questions before Phase 1 lock
The hardware-overlay-path mechanism is structurally available
on X11 + non-compositing WM. Whether it actually engages for
each of the three browsers is browser-specific and currently
unknown:
- **Brave / Chromium ozone-x11**: Chromium has overlay-support
code (`OverlayProcessor`, `GpuMemoryBufferManager`,
`DCOMPSurface` on Windows; on Linux X11 the path is via
XPresent + DMA-BUF + `OverlayCandidate`). Whether Brave
147 / chromium-fourier 149 actually request hardware-overlay
presentation for a windowed video element under X11 is open.
- **Firefox**: VAAPIVideoDecoder backend produces hardware
decoded NV12 dmabufs that the GL compositor consumes
internally. Whether Firefox's X11 backend has a path to
hand the dmabuf to the X server for hardware-overlay
presentation (rather than internally composing to RGB) is
open. Mozilla has a `MOZ_X11_EGL` hint and a "hardware video
overlay" pref but these are not universally engaged.
- **Reference clients**: mpv with `--vo=xv` or
`--vo=gpu --hwdec=auto-copy --gpu-context=x11`, or `gst-play-1.0`
with `xvimagesink` or `glimagesink`, are known-good X11
hardware-overlay paths. **Adding mpv to the matrix as a
reference client** would isolate "does the X11 hardware-
overlay path work AT ALL on this hardware" from "do
browsers actually use it." If mpv hardware-overlays cleanly
but browsers don't, the conclusion is "the X11 path is fast,
but browsers leave the speedup on the table."
If the operator agrees, Phase 0 inventory should:
1. Verify Plane 39's NV12-LINEAR availability is reachable to
X11 clients (it is for KWin Wayland; should be for X11 too
since Plane 39 is just a DRM resource), and identify which
X11 path actually programs it (modesetting Xorg driver +
`Option "PageFlip" "true"`, or DRI3-presented buffer ending
up on Plane 39 via the X server's plane allocator).
2. Inventory Brave's, chromium-fourier's, and Firefox's X11
overlay-presentation paths to see which (if any) request
hardware-overlay presentation.
3. Add mpv as a reference X11-overlay client to the matrix,
so the campaign has a known-good comparison point.
### What this question does NOT cover
For clarity, since the predecessor was specifically about
the Wayland-overlay-subsurface composite path:
- This campaign is **not** investigating the wp_subsurface
route. The Wayland-cell of the matrix (with-KWin) measures
whatever browser configuration produces under the existing
Plasma Wayland session — windowed, default profile, default
flags. It's a measurement of the as-shipped Plasma Wayland
stack from the user's perspective, not a probe of a
specific KWin code path.
- The Δ_present-46 ms finding from the predecessor is
testable as a free side-finding under both axes (Wayland
and X11) but is not the campaign's primary question.
- Daily-driver fitness (apps that break under X11, touchscreen
behavior, multi-monitor edge cases, etc.) is **not in
scope**. The campaign's deliverable is the matrix above; if
any cell is decisively faster, daily-driver-fitness becomes
a follow-up campaign.
## What's NOT in scope (working assumption)
Until the research question is confirmed, the following are
treated as out of scope so they don't slip into Phase 1
prematurely:
- Patches to KWin, Xorg, kwin-fourier, qt6-base-fourier, or any
other component on ohm. This is **research**, not
patch-development. Per non-upstreaming default, MR/bug-report
filing is explicitly tasked and not scheduled here.
- The Δ_present-46 ms finding's investigation. It's a known
hook from the predecessor; whether this campaign chases it
depends on the locked research question.
- Reverting predecessor tooling state. Governor, baloo,
`qt6-base-fourier`, `kwin-fourier` stay as-is unless the
operator decides otherwise.
- File a bug for any of the predecessor's three documented
candidate findings. Same non-upstreaming default applies.
## What Phase 0 will deliver, regardless of framing
Even before the research question is locked, the following are
useful Phase 0 deliverables that don't depend on the specific
question:
1. **State snapshot of ohm under current Plasma Wayland**
captured at campaign start. This is the *before* photo for
any future X11 vs Wayland comparison. Unattended-tractable
(just scripted SSH).
2. **Inventory of available X11 paths on ohm**: what packages
are installed, what session candidates SDDM advertises,
what would need to be installed to enable a Plasma X11
session, what alternate WMs are available. Read-only,
unattended-tractable.
3. **Inventory of measurement instruments that work under
X11**: `xtrace`, `xprop`, `xrandr --verbose --query`, perf
on `Xorg` PID, frame-timing extraction options. Read-only.
4. **A1 baseline** under current Plasma Wayland: re-run a
single rep of the predecessor's `kwin_timing_nodebug`
condition immediately at the start of this campaign, so
the comparison Wayland-vs-X11 has a same-session anchor.
This is the "set the baseline before instrument changes"
discipline from `feedback_replicate_baseline_first.md`.
These steps are unblocked. They don't commit to a specific
research question and they produce evidence that's useful
under any of the candidate framings.