Phase 0: motivation locked + mechanism captured

Operator-supplied research question 2026-05-03: "Does cutting
out the KWin compositor enable faster video display of Brave,
chromium-fourier, and Firefox — for full SW decoding, and for
libva decoding (where possible) — on PineTab2 RK3568?"

Operator-supplied mechanism 2026-05-03 (two messages):

1. "hantro emits NV12 which the GPU can't put on a
   compositeable plane. So that is the real bottleneck of
   Wayland."

   Connects directly to predecessor's Phase 1 finding
   (kwin_overlay_subsurface/phase2_source_findings.md:170-229):
   rockchip-drm overlay Plane 45 advertises no NV12 modifier;
   Primary Plane 39 supports NV12 LINEAR but is owned by KWin
   for its compositor framebuffer. Predecessor named the
   constraint but not the consequence — the consequence is
   that NV12 → RGB GL-composite is forced on Wayland-with-KWin
   regardless of which protocol path the browser uses.

2. "A X11 pipeline would route around that by giving a portion
   of screen real estate directly to the video pipeline."

   The X11 hardware-overlay path: with X11 + non-compositing
   WM, the X server can allocate Plane 39 (NV12 LINEAR) to
   the video region and Plane 45 (RGB AFBC) to the rest of
   the desktop. Hardware-blended at scanout. NO GL-composite
   anywhere — the cost the operator named as "the real
   bottleneck" is structurally avoided.

   This is the X11 hardware-overlay mechanism that historically
   made X11 desktops good at video playback (Xv → modern
   DRI3 + XPresent + Composite-redirection-disabled).
   Wayland-with-monolithic-compositor designs cannot use this
   freedom: the compositor must own the Primary plane, so the
   plane-allocation freedom required to put NV12 video on
   Plane 39 alongside RGB chrome on Plane 45 isn't available.

phase0_findings.md updated with:
- Locked research question + 12-cell experimental matrix
  (3 browsers × 2 decode paths × 2 sessions; some N/A).
- Three separable cost components the matrix tests for
  (mandatory NV12→RGB GL conversion if hardware-overlay
  doesn't engage, fallback GL-composite, per-frame compositor
  overhead independent of NV12).
- Open questions about whether browsers actually request
  hardware-overlay presentation under X11, or whether they
  always internally composite to RGB.
- Recommendation to add mpv as a reference X11-overlay
  client: distinguishes "X11 path works on this hardware"
  from "browsers actually use the X11 path."

worklist.md updated:
- Phase 0 motivation + matrix items ticked.
- Pre-Phase-1 inventory broken out: state snapshot, X11
  path inventory, browser-overlay-path inventory, mpv
  reference, X11 measurement-tool inventory, A1 Wayland
  baseline anchor.
- Phase 1 sketch: binding cells per matrix cell, clear-pass /
  clear-fail thresholds, measurement protocol mirroring the
  predecessor's phase3_protocol.md structure.

README banner updated to reflect locked motivation +
mechanism summary + matrix shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-03 06:49:18 +00:00
parent eff6fb5b29
commit ac301b4e48
3 changed files with 358 additions and 99 deletions
+249 -55
View File
@@ -99,70 +99,264 @@ If this campaign needs to switch ohm to an X11 session, that
is a session-level operator action (logout, switch via SDDM,
log back in). It cannot be done unattended.
## Research question (provisional — awaits operator confirmation)
## Research question (LOCKED 2026-05-03)
**Candidate framing**, not locked:
> *"Does cutting out the KWin compositor enable faster video
> display of Brave, chromium-fourier, and Firefox — for full
> SW decoding, and for libva decoding (where possible) — on
> PineTab2 RK3568?"*
> *"On PineTab2 RK3568 with the same chromium-fourier binary,
> the same `bbb_1080p30_h264.mp4` 30 fps source clip, and the
> same `brave_drops_test.html` instrumented page, does running
> an X11-session display server (Plasma X11, or an alternative
> X11 desktop) reproduce the drop-inversion phenomenon that
> motivated `kwin_overlay_subsurface`, eliminate it, or
> introduce a different drop characteristic?"*
### Mechanism the question targets
This is the most narrowly relevant question given the
predecessor's close-out. Three plausible outcomes:
Operator-supplied context 2026-05-03:
- **(α)** X11 reproduces low post-warmup drops (matches Phase 0
cage = 0 floor): isolates the dropped-frames mechanism to
the Wayland compositor stack on this hardware. The original
campaign's framing was correct in spirit but the cage
comparison was confounded; X11 becomes the better
comparator.
- **(β)** X11 has comparable or higher post-warmup drops: the
drop phenomenon is hardware/kernel/Mesa-bound and does not
localise to the display-server stack at all. Predecessor's
Phase 8 closure stands; the X11 measurement is the
decisive cross-check.
- **(γ)** X11 has a different failure mode entirely (different
drop pattern, different perf hot symbols, different effective
fps): each finding is its own characterisation; the
campaign becomes "what does running X11 on this hardware
look like end-to-end."
> *"hantro emits NV12 which the GPU can't put on a
> compositeable plane. So that is the real bottleneck of
> Wayland."*
**Alternate framings** the operator may have in mind that
this provisional question doesn't cover:
This connects directly to the predecessor's Phase 1 finding
(`kwin_overlay_subsurface/phase2_source_findings.md`:170-229):
- *Daily-driver fitness*: "Can I use X11 instead of Wayland on
this device for everyday browser/video/desktop work, and
what works/breaks?" — different scope; less measurement-heavy,
more workflow-oriented.
- *Specific X11-only feature investigation*: composite
redirection, XRender, GLAMOR, Xinerama on a single-display
device, etc.
- *XWayland behaviour*: many Linux desktops run X11 clients
under Wayland via XWayland. If an "X11 session" really means
"test under XWayland to compare with native Wayland", the
measurement is fundamentally different.
- *Power consumption / thermal*: X11 vs Wayland on a passively
cooled tablet may differ in idle CPU and thermal envelope.
Different metric set.
- Hantro VPU decodes H.264 video into NV12 dmabufs (`DRM_FORMAT_NV12`,
`DRM_FORMAT_MOD_LINEAR`).
- rockchip-drm's only NV12-LINEAR-capable plane is the
Primary plane (Plane 39 on CRTC 52), which the running KWin
uses for its GL framebuffer.
- The overlay plane (Plane 45) advertises no NV12 in any
modifier in `IN_FORMATS`.
- Therefore **no rockchip-drm scanout plane can accept the
NV12 buffer hantro produces while KWin owns the primary
plane.** Some compositing step must convert NV12 → RGB
before display.
**Operator decision needed before Phase 1**:
The predecessor named the *constraint* (Path B rejected at
the format/modifier intersection) but the *consequence*
"some component must GL-composite NV12 → RGB on the GPU
because nothing else on this hardware can put NV12 on a
scanout plane" — was not made explicit. That consequence is
this campaign's motivating insight:
1. Which question is in scope? (drop phenomenon, daily-driver,
feature-specific, XWayland-vs-native, power, or something
else).
2. What "X11 session" means specifically: native Xorg + Plasma
X11; native Xorg + lightweight WM (e.g. openbox / i3 / xfwm);
XWayland under the existing Plasma Wayland session; or
another configuration.
3. What the success/failure criteria look like (binding cells,
`metrics.csv` shape).
- **Under Plasma Wayland:** when the browser engages the
Wayland subsurface route (chromium's
`WaylandBufferManagerHost::CommitOverlays`), KWin receives
an NV12 dmabuf and must GL-composite it. **KWin's compositor
is the GL-composite step.** When the browser does NOT
engage the subsurface route (the predecessor's measured
case on `brave_drops_test.html` — zero `wp_subsurface` in
the trace), the browser itself converts NV12 → RGB in its
own GL context and hands KWin only RGB; KWin then composites
the RGB to its primary plane.
- **Under X11 without a compositor:** there is no separate
compositor process. Two paths are open to the client:
- *RGB-composite path* (browser converts NV12 → RGB in its
own GL context and presents the RGB result via XPresent/DRI3
to the X server, which schedules a page-flip on the same
Primary plane KWin would have used). One fewer hand-off
than the Wayland-with-subsurface case but the same GL-
composite cost as the no-subsurface Wayland case.
- **Hardware-overlay path** (operator-supplied context
2026-05-03: *"a X11 pipeline would route around that by
giving a portion of screen real estate directly to the
video pipeline"*). The X server allocates the Primary
plane (Plane 39, supports NV12 LINEAR) to the video
region and the Overlay plane (Plane 45, supports
RGB/AFBC) to the rest of the desktop. Hardware-blended
at scanout time. **No GL-composite of NV12 anywhere —
the cost the operator named as "the real bottleneck"
is structurally avoided.**
Until those are answered, Phase 0 documents the question space
and Phase 1 does not lock.
This second X11 path is what Wayland compositors as
designed today cannot do on rockchip-drm-class hardware: KWin
Wayland *must* own the Primary plane for its compositor
framebuffer (because the Wayland model is "compositor presents
one merged surface per output"), so it cannot give Plane 39
to a video-region NV12 buffer while putting the rest of the
desktop on Plane 45. X11 + non-compositing WM has no such
constraint — different windows can be assigned to different
planes by the X server's plane allocator.
This is the X11 hardware-overlay mechanism that historically
made X11 desktops good at video playback (Xv from the late
1990s, and the modern equivalents via DRI3 + XPresent +
Composite-redirection-disabled). It is structurally absent
in Wayland-with-monolithic-compositor designs.
### Hypothesis the matrix tests
There are three potentially separable costs:
1. **The mandatory NV12 → RGB GL conversion**, which is
*forced* on Wayland-with-KWin because KWin must own the
only NV12-LINEAR-capable plane on this hardware for its
compositor framebuffer. **This cost is structurally
avoidable** under X11 + non-compositing WM via
hardware-plane-overlay (per the operator-supplied insight
above). Whether browsers can be coaxed to *use* the X11
hardware-overlay path — rather than internally compositing
to RGB before presenting — is browser-specific (see Open
questions below).
2. **The fallback GL-composite cost** when the
hardware-overlay path doesn't engage. Both Wayland and X11
pay this when the buffer shape doesn't match a plane —
it just runs in different processes (KWin under Wayland,
browser under X11).
3. **The per-frame compositor overhead** independent of NV12:
dmabuf import, transaction apply, presentation-feedback
wiring, frame-callback delivery — which the predecessor
measured at ~30-37 % of `kwin_wayland`'s CPU during
steady-state video playback even when KWin only saw RGB
surfaces.
The X11 hypothesis is strongest if cost (1) is dominant on
the matrix's with-KWin cells AND the X11 cells trigger the
hardware-overlay path. The X11 hypothesis is weakest if
cost (1) is small and cost (3) is small — in which case the
"cutting out KWin" experiment would show only marginal
differences.
The matrix below is designed to surface which of (1) (2) (3)
dominates per browser × decode path.
"Faster video display" is operationally **a combination of**:
- **Effective fps actually rendered** (= `getVideoPlaybackQuality().totalVideoFrames / elapsed_s`
for a 30 fps source — the upper bound is 30; the question is
how close).
- **Drop count** over the same 70 s window (`droppedVideoFrames`).
- **End-to-end latency** if testable (commit → present;
testable on Wayland via `wp_presentation_feedback`,
testable on X11 via `XPresent` extension or `RandR` vblank
events; protocol-side measurement under each
display-server).
- **Compositor + browser CPU at steady state** (the cost
saved by cutting the compositor is the upper bound on the
patch-payoff if a future campaign tries to optimise the
compositor instead of removing it).
### Experimental matrix
Six 2-axis cells (3 browsers × 2 decode paths) × 2
session conditions (with-KWin / without-KWin):
| Browser | Decode | with-KWin (Plasma Wayland) | without-KWin (X11 session, no compositor) |
|---|---|---|---|
| Brave 147 | full SW | C-W-brave-sw | C-X-brave-sw |
| Brave 147 | libva (if it works) | C-W-brave-libva | C-X-brave-libva |
| chromium-fourier 149 (Step 1 + Step 2) | full SW | C-W-chrf-sw | C-X-chrf-sw |
| chromium-fourier 149 | libva (Step 1 enables it) | C-W-chrf-libva | C-X-chrf-libva |
| Firefox | full SW | C-W-ff-sw | C-X-ff-sw |
| Firefox | libva | C-W-ff-libva | C-X-ff-libva |
The "(if it works)" / "where possible" qualifier per the
operator's directive: libva on rockchip-drm RK3568 only works
on chromium-fourier (Step 1 ports `libva-v4l2-request`); for
stock Brave 147 and stock Firefox, libva probably doesn't
engage and those cells are documented N/A. For Firefox, the
Mesa-side `libva-v4l2-request` may make libva work via Mozilla's
VAAPI backend even on stock Firefox — to be verified in
Phase 0 inventory.
### What "cutting out the KWin compositor" means
This campaign uses **X11 session with no compositor in the
display path** as the "without-KWin" cell. Specifically:
- Native Xorg server, NOT XWayland (XWayland would still go
through KWin for display, defeating the purpose).
- Window manager that does NOT composite by default — e.g.
openbox, fluxbox, xfwm4-with-compositing-off, i3, twm.
Plasma X11 uses `kwin_x11` as compositing WM, which is
still a "KWin compositor" — it does not satisfy "cut KWin
out" and is **excluded** from the without-KWin cell.
- Browser windowed (not fullscreen). Even on a non-compositing
WM, fullscreen browsers may engage XPresent direct
presentation paths — testing windowed isolates the
baseline non-compositor windowed display path.
The exact WM choice is a Phase 0 inventory decision (which
WMs are available on ohm, which install cleanly, which
SDDM-advertised sessions exist). Default candidate: openbox.
### Three plausible outcome shapes
- **(α)** Without-KWin is materially faster across all 6
cells: confirms the KWin compositor cost is a real
bottleneck on this hardware, and X11-session-without-
compositor becomes the recommended daily-driver
configuration for video work on PineTab2.
- **(β)** Without-KWin is comparable or only marginally
faster: the compositor isn't the bottleneck; the drop
phenomenon is hardware/kernel/Mesa-bound, and the
predecessor's Phase 8 closure stands.
- **(γ)** Mixed picture per browser × decode path: e.g.
libva paths benefit but SW paths don't; or Firefox benefits
but chromium-class clients don't. Each cell becomes its own
characterisation.
### Open questions before Phase 1 lock
The hardware-overlay-path mechanism is structurally available
on X11 + non-compositing WM. Whether it actually engages for
each of the three browsers is browser-specific and currently
unknown:
- **Brave / Chromium ozone-x11**: Chromium has overlay-support
code (`OverlayProcessor`, `GpuMemoryBufferManager`,
`DCOMPSurface` on Windows; on Linux X11 the path is via
XPresent + DMA-BUF + `OverlayCandidate`). Whether Brave
147 / chromium-fourier 149 actually request hardware-overlay
presentation for a windowed video element under X11 is open.
- **Firefox**: VAAPIVideoDecoder backend produces hardware
decoded NV12 dmabufs that the GL compositor consumes
internally. Whether Firefox's X11 backend has a path to
hand the dmabuf to the X server for hardware-overlay
presentation (rather than internally composing to RGB) is
open. Mozilla has a `MOZ_X11_EGL` hint and a "hardware video
overlay" pref but these are not universally engaged.
- **Reference clients**: mpv with `--vo=xv` or
`--vo=gpu --hwdec=auto-copy --gpu-context=x11`, or `gst-play-1.0`
with `xvimagesink` or `glimagesink`, are known-good X11
hardware-overlay paths. **Adding mpv to the matrix as a
reference client** would isolate "does the X11 hardware-
overlay path work AT ALL on this hardware" from "do
browsers actually use it." If mpv hardware-overlays cleanly
but browsers don't, the conclusion is "the X11 path is fast,
but browsers leave the speedup on the table."
If the operator agrees, Phase 0 inventory should:
1. Verify Plane 39's NV12-LINEAR availability is reachable to
X11 clients (it is for KWin Wayland; should be for X11 too
since Plane 39 is just a DRM resource), and identify which
X11 path actually programs it (modesetting Xorg driver +
`Option "PageFlip" "true"`, or DRI3-presented buffer ending
up on Plane 39 via the X server's plane allocator).
2. Inventory Brave's, chromium-fourier's, and Firefox's X11
overlay-presentation paths to see which (if any) request
hardware-overlay presentation.
3. Add mpv as a reference X11-overlay client to the matrix,
so the campaign has a known-good comparison point.
### What this question does NOT cover
For clarity, since the predecessor was specifically about
the Wayland-overlay-subsurface composite path:
- This campaign is **not** investigating the wp_subsurface
route. The Wayland-cell of the matrix (with-KWin) measures
whatever browser configuration produces under the existing
Plasma Wayland session — windowed, default profile, default
flags. It's a measurement of the as-shipped Plasma Wayland
stack from the user's perspective, not a probe of a
specific KWin code path.
- The Δ_present-46 ms finding from the predecessor is
testable as a free side-finding under both axes (Wayland
and X11) but is not the campaign's primary question.
- Daily-driver fitness (apps that break under X11, touchscreen
behavior, multi-monitor edge cases, etc.) is **not in
scope**. The campaign's deliverable is the matrix above; if
any cell is decisively faster, daily-driver-fitness becomes
a follow-up campaign.
## What's NOT in scope (working assumption)