Files
x11-session-research/phase0_findings.md
T
marfrit ac301b4e48 Phase 0: motivation locked + mechanism captured
Operator-supplied research question 2026-05-03: "Does cutting
out the KWin compositor enable faster video display of Brave,
chromium-fourier, and Firefox — for full SW decoding, and for
libva decoding (where possible) — on PineTab2 RK3568?"

Operator-supplied mechanism 2026-05-03 (two messages):

1. "hantro emits NV12 which the GPU can't put on a
   compositeable plane. So that is the real bottleneck of
   Wayland."

   Connects directly to predecessor's Phase 1 finding
   (kwin_overlay_subsurface/phase2_source_findings.md:170-229):
   rockchip-drm overlay Plane 45 advertises no NV12 modifier;
   Primary Plane 39 supports NV12 LINEAR but is owned by KWin
   for its compositor framebuffer. Predecessor named the
   constraint but not the consequence — the consequence is
   that NV12 → RGB GL-composite is forced on Wayland-with-KWin
   regardless of which protocol path the browser uses.

2. "A X11 pipeline would route around that by giving a portion
   of screen real estate directly to the video pipeline."

   The X11 hardware-overlay path: with X11 + non-compositing
   WM, the X server can allocate Plane 39 (NV12 LINEAR) to
   the video region and Plane 45 (RGB AFBC) to the rest of
   the desktop. Hardware-blended at scanout. NO GL-composite
   anywhere — the cost the operator named as "the real
   bottleneck" is structurally avoided.

   This is the X11 hardware-overlay mechanism that historically
   made X11 desktops good at video playback (Xv → modern
   DRI3 + XPresent + Composite-redirection-disabled).
   Wayland-with-monolithic-compositor designs cannot use this
   freedom: the compositor must own the Primary plane, so the
   plane-allocation freedom required to put NV12 video on
   Plane 39 alongside RGB chrome on Plane 45 isn't available.

phase0_findings.md updated with:
- Locked research question + 12-cell experimental matrix
  (3 browsers × 2 decode paths × 2 sessions; some N/A).
- Three separable cost components the matrix tests for
  (mandatory NV12→RGB GL conversion if hardware-overlay
  doesn't engage, fallback GL-composite, per-frame compositor
  overhead independent of NV12).
- Open questions about whether browsers actually request
  hardware-overlay presentation under X11, or whether they
  always internally composite to RGB.
- Recommendation to add mpv as a reference X11-overlay
  client: distinguishes "X11 path works on this hardware"
  from "browsers actually use the X11 path."

worklist.md updated:
- Phase 0 motivation + matrix items ticked.
- Pre-Phase-1 inventory broken out: state snapshot, X11
  path inventory, browser-overlay-path inventory, mpv
  reference, X11 measurement-tool inventory, A1 Wayland
  baseline anchor.
- Phase 1 sketch: binding cells per matrix cell, clear-pass /
  clear-fail thresholds, measurement protocol mirroring the
  predecessor's phase3_protocol.md structure.

README banner updated to reflect locked motivation +
mechanism summary + matrix shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 06:49:18 +00:00

18 KiB
Raw Blame History

Phase 0 — substrate and provisional research question

This is the campaign's Phase 0 substrate doc: what we already know from the predecessor kwin_overlay_subsurface close-out, what's open, and what the candidate research question looks like. The research question is provisional and awaits operator confirmation before Phase 1 lock.

Predecessor close-out summary

../kwin_overlay_subsurface/phase8_handover.md (closed 2026-05-03 without patch). Three independent reasons no patch landed:

  1. The campaign's locked Phase 1 reference floor (drops_post_warmup == 0 from cage) is unreachable at N=3 today. Today's median is 26 post-warmup with the same chromium-fourier binary, same hardware, same kernel, same Mesa, same kwin-fourier — KWin direct reproduces Phase 0's 29 post-warmup, but cage now also drops ~22-56 post-warmup instead of Phase 0's 0.
  2. The campaign's surface-of-investigation (wp_subsurface overlay route) is not engaged by brave_drops_test.html. Chromium-fourier renders the video element via internal compositing into its main browser window surface — a single-surface case.
  3. The Phase 2 hot-path hypothesis (glEGLImageTargetTexture2DOES dominates kwin_wayland's per-frame cost) was rejected by Phase 3 perf measurement with 100×-margin on the wrong side of the threshold.

The diagnostic loop terminated at "the campaign's premise was N=1 to begin with, and the N=3 in-session re-measurement doesn't replicate it." This is filed as a feedback memory: replicate the N=1 baseline at N=3 in the same session BEFORE building multi-phase infrastructure around it.

What stays valid from the predecessor

Durable substrate listed in kwin_overlay_subsurface/phase8_handover.md § "What's left for a future session to pick up":

  • Phase 1 scanout-promotion archaeology (rockchip-drm RK3568 plane format/modifier table, KWin v6.6.4 promotion predicate). Plane 39 (Primary, NV12 LINEAR) is the GL framebuffer; Plane 45 (Overlay) does not advertise NV12 in any modifier. Both KWin scanout-promotion paths are structurally rejected for windowed Brave on this DRM driver. This holds regardless of display server.
  • Phase 2 H1 file:line in kwin_overlay_subsurface/phase2_source_findings.md. Cold per Phase 3 measurement; informational only.
  • Phase 2-prime Shape C source-read of Display::dispatchEvents and TransactionFence in KWin's src/wayland/. Specific to the Wayland path; not relevant to an X11-session campaign. The X11 path uses different KWin surface plumbing (kwin_x11) and a different per-frame protocol (X11 Composite extension + Damage + XPresent), not Wayland protocol dispatch.
  • Δ_present-46 ms reproducible side-finding under Plasma Wayland. Across all measured conditions (chromium-fourier on KWin, chromium-fourier in cage, stock Brave on KWin), median Δ_present was 41-46 ms on a 60 Hz panel — a stable ~2.7-vsync queue depth. This finding is independent of the cage breakdown and directly testable under X11 as a comparison point.
  • Measurement infrastructure: kwin_overlay_subsurface/scripts/wayland_debug_to_csv.py (libwayland 1.21+ format, 17 unit tests passing) + phase3_prime_runs/run_browser.sh orchestrator on ohm (handles WAYLAND_DEBUG=1 capture, perf record, top sampling, drops trajectory extraction, kill-cleanly). The WAYLAND_DEBUG portion does not apply under X11; an X11 equivalent would be different tooling (xtrace, xev, or XCB-debug instrumentation if the client emits any). The perf+top+drops capture portion remains usable under X11 unchanged.

Current ohm state (carry-over from predecessor)

Per kwin_overlay_subsurface/phase1_evidence/ohm_tooling_revert_log.md, not reverted at predecessor close-out:

  • qt6-base-fourier 1:6.11.0-3
  • kwin-fourier 1:6.6.4-3 (Wayland-side compositor; not in the hot path under an X11 session)
  • mesa 1:26.0.5-1
  • CPU governor pinned to performance
  • Baloo permanently disabled
  • drm-info 2.9.0-1
  • Active session: startplasma-wayland on tty2, kwin_wayland PID 3927 (as of 2026-05-03 03:05 UTC).
  • Browser binaries available: /tmp/chromium-ohm-gl-fix-step2/chrome (chromium-fourier, Step 1 + Step 2 patches, 149.0.7812.0), /usr/bin/brave (brave-bin 1:1.89.145-1).

If this campaign needs to switch ohm to an X11 session, that is a session-level operator action (logout, switch via SDDM, log back in). It cannot be done unattended.

Research question (LOCKED 2026-05-03)

"Does cutting out the KWin compositor enable faster video display of Brave, chromium-fourier, and Firefox — for full SW decoding, and for libva decoding (where possible) — on PineTab2 RK3568?"

Mechanism the question targets

Operator-supplied context 2026-05-03:

"hantro emits NV12 which the GPU can't put on a compositeable plane. So that is the real bottleneck of Wayland."

This connects directly to the predecessor's Phase 1 finding (kwin_overlay_subsurface/phase2_source_findings.md:170-229):

  • Hantro VPU decodes H.264 video into NV12 dmabufs (DRM_FORMAT_NV12, DRM_FORMAT_MOD_LINEAR).
  • rockchip-drm's only NV12-LINEAR-capable plane is the Primary plane (Plane 39 on CRTC 52), which the running KWin uses for its GL framebuffer.
  • The overlay plane (Plane 45) advertises no NV12 in any modifier in IN_FORMATS.
  • Therefore no rockchip-drm scanout plane can accept the NV12 buffer hantro produces while KWin owns the primary plane. Some compositing step must convert NV12 → RGB before display.

The predecessor named the constraint (Path B rejected at the format/modifier intersection) but the consequence — "some component must GL-composite NV12 → RGB on the GPU because nothing else on this hardware can put NV12 on a scanout plane" — was not made explicit. That consequence is this campaign's motivating insight:

  • Under Plasma Wayland: when the browser engages the Wayland subsurface route (chromium's WaylandBufferManagerHost::CommitOverlays), KWin receives an NV12 dmabuf and must GL-composite it. KWin's compositor is the GL-composite step. When the browser does NOT engage the subsurface route (the predecessor's measured case on brave_drops_test.html — zero wp_subsurface in the trace), the browser itself converts NV12 → RGB in its own GL context and hands KWin only RGB; KWin then composites the RGB to its primary plane.
  • Under X11 without a compositor: there is no separate compositor process. Two paths are open to the client:
    • RGB-composite path (browser converts NV12 → RGB in its own GL context and presents the RGB result via XPresent/DRI3 to the X server, which schedules a page-flip on the same Primary plane KWin would have used). One fewer hand-off than the Wayland-with-subsurface case but the same GL- composite cost as the no-subsurface Wayland case.
    • Hardware-overlay path (operator-supplied context 2026-05-03: "a X11 pipeline would route around that by giving a portion of screen real estate directly to the video pipeline"). The X server allocates the Primary plane (Plane 39, supports NV12 LINEAR) to the video region and the Overlay plane (Plane 45, supports RGB/AFBC) to the rest of the desktop. Hardware-blended at scanout time. No GL-composite of NV12 anywhere — the cost the operator named as "the real bottleneck" is structurally avoided.

This second X11 path is what Wayland compositors as designed today cannot do on rockchip-drm-class hardware: KWin Wayland must own the Primary plane for its compositor framebuffer (because the Wayland model is "compositor presents one merged surface per output"), so it cannot give Plane 39 to a video-region NV12 buffer while putting the rest of the desktop on Plane 45. X11 + non-compositing WM has no such constraint — different windows can be assigned to different planes by the X server's plane allocator.

This is the X11 hardware-overlay mechanism that historically made X11 desktops good at video playback (Xv from the late 1990s, and the modern equivalents via DRI3 + XPresent + Composite-redirection-disabled). It is structurally absent in Wayland-with-monolithic-compositor designs.

Hypothesis the matrix tests

There are three potentially separable costs:

  1. The mandatory NV12 → RGB GL conversion, which is forced on Wayland-with-KWin because KWin must own the only NV12-LINEAR-capable plane on this hardware for its compositor framebuffer. This cost is structurally avoidable under X11 + non-compositing WM via hardware-plane-overlay (per the operator-supplied insight above). Whether browsers can be coaxed to use the X11 hardware-overlay path — rather than internally compositing to RGB before presenting — is browser-specific (see Open questions below).
  2. The fallback GL-composite cost when the hardware-overlay path doesn't engage. Both Wayland and X11 pay this when the buffer shape doesn't match a plane — it just runs in different processes (KWin under Wayland, browser under X11).
  3. The per-frame compositor overhead independent of NV12: dmabuf import, transaction apply, presentation-feedback wiring, frame-callback delivery — which the predecessor measured at ~30-37 % of kwin_wayland's CPU during steady-state video playback even when KWin only saw RGB surfaces.

The X11 hypothesis is strongest if cost (1) is dominant on the matrix's with-KWin cells AND the X11 cells trigger the hardware-overlay path. The X11 hypothesis is weakest if cost (1) is small and cost (3) is small — in which case the "cutting out KWin" experiment would show only marginal differences.

The matrix below is designed to surface which of (1) (2) (3) dominates per browser × decode path.

"Faster video display" is operationally a combination of:

  • Effective fps actually rendered (= getVideoPlaybackQuality().totalVideoFrames / elapsed_s for a 30 fps source — the upper bound is 30; the question is how close).
  • Drop count over the same 70 s window (droppedVideoFrames).
  • End-to-end latency if testable (commit → present; testable on Wayland via wp_presentation_feedback, testable on X11 via XPresent extension or RandR vblank events; protocol-side measurement under each display-server).
  • Compositor + browser CPU at steady state (the cost saved by cutting the compositor is the upper bound on the patch-payoff if a future campaign tries to optimise the compositor instead of removing it).

Experimental matrix

Six 2-axis cells (3 browsers × 2 decode paths) × 2 session conditions (with-KWin / without-KWin):

Browser Decode with-KWin (Plasma Wayland) without-KWin (X11 session, no compositor)
Brave 147 full SW C-W-brave-sw C-X-brave-sw
Brave 147 libva (if it works) C-W-brave-libva C-X-brave-libva
chromium-fourier 149 (Step 1 + Step 2) full SW C-W-chrf-sw C-X-chrf-sw
chromium-fourier 149 libva (Step 1 enables it) C-W-chrf-libva C-X-chrf-libva
Firefox full SW C-W-ff-sw C-X-ff-sw
Firefox libva C-W-ff-libva C-X-ff-libva

The "(if it works)" / "where possible" qualifier per the operator's directive: libva on rockchip-drm RK3568 only works on chromium-fourier (Step 1 ports libva-v4l2-request); for stock Brave 147 and stock Firefox, libva probably doesn't engage and those cells are documented N/A. For Firefox, the Mesa-side libva-v4l2-request may make libva work via Mozilla's VAAPI backend even on stock Firefox — to be verified in Phase 0 inventory.

What "cutting out the KWin compositor" means

This campaign uses X11 session with no compositor in the display path as the "without-KWin" cell. Specifically:

  • Native Xorg server, NOT XWayland (XWayland would still go through KWin for display, defeating the purpose).
  • Window manager that does NOT composite by default — e.g. openbox, fluxbox, xfwm4-with-compositing-off, i3, twm. Plasma X11 uses kwin_x11 as compositing WM, which is still a "KWin compositor" — it does not satisfy "cut KWin out" and is excluded from the without-KWin cell.
  • Browser windowed (not fullscreen). Even on a non-compositing WM, fullscreen browsers may engage XPresent direct presentation paths — testing windowed isolates the baseline non-compositor windowed display path.

The exact WM choice is a Phase 0 inventory decision (which WMs are available on ohm, which install cleanly, which SDDM-advertised sessions exist). Default candidate: openbox.

Three plausible outcome shapes

  • (α) Without-KWin is materially faster across all 6 cells: confirms the KWin compositor cost is a real bottleneck on this hardware, and X11-session-without- compositor becomes the recommended daily-driver configuration for video work on PineTab2.
  • (β) Without-KWin is comparable or only marginally faster: the compositor isn't the bottleneck; the drop phenomenon is hardware/kernel/Mesa-bound, and the predecessor's Phase 8 closure stands.
  • (γ) Mixed picture per browser × decode path: e.g. libva paths benefit but SW paths don't; or Firefox benefits but chromium-class clients don't. Each cell becomes its own characterisation.

Open questions before Phase 1 lock

The hardware-overlay-path mechanism is structurally available on X11 + non-compositing WM. Whether it actually engages for each of the three browsers is browser-specific and currently unknown:

  • Brave / Chromium ozone-x11: Chromium has overlay-support code (OverlayProcessor, GpuMemoryBufferManager, DCOMPSurface on Windows; on Linux X11 the path is via XPresent + DMA-BUF + OverlayCandidate). Whether Brave 147 / chromium-fourier 149 actually request hardware-overlay presentation for a windowed video element under X11 is open.
  • Firefox: VAAPIVideoDecoder backend produces hardware decoded NV12 dmabufs that the GL compositor consumes internally. Whether Firefox's X11 backend has a path to hand the dmabuf to the X server for hardware-overlay presentation (rather than internally composing to RGB) is open. Mozilla has a MOZ_X11_EGL hint and a "hardware video overlay" pref but these are not universally engaged.
  • Reference clients: mpv with --vo=xv or --vo=gpu --hwdec=auto-copy --gpu-context=x11, or gst-play-1.0 with xvimagesink or glimagesink, are known-good X11 hardware-overlay paths. Adding mpv to the matrix as a reference client would isolate "does the X11 hardware- overlay path work AT ALL on this hardware" from "do browsers actually use it." If mpv hardware-overlays cleanly but browsers don't, the conclusion is "the X11 path is fast, but browsers leave the speedup on the table."

If the operator agrees, Phase 0 inventory should:

  1. Verify Plane 39's NV12-LINEAR availability is reachable to X11 clients (it is for KWin Wayland; should be for X11 too since Plane 39 is just a DRM resource), and identify which X11 path actually programs it (modesetting Xorg driver + Option "PageFlip" "true", or DRI3-presented buffer ending up on Plane 39 via the X server's plane allocator).
  2. Inventory Brave's, chromium-fourier's, and Firefox's X11 overlay-presentation paths to see which (if any) request hardware-overlay presentation.
  3. Add mpv as a reference X11-overlay client to the matrix, so the campaign has a known-good comparison point.

What this question does NOT cover

For clarity, since the predecessor was specifically about the Wayland-overlay-subsurface composite path:

  • This campaign is not investigating the wp_subsurface route. The Wayland-cell of the matrix (with-KWin) measures whatever browser configuration produces under the existing Plasma Wayland session — windowed, default profile, default flags. It's a measurement of the as-shipped Plasma Wayland stack from the user's perspective, not a probe of a specific KWin code path.
  • The Δ_present-46 ms finding from the predecessor is testable as a free side-finding under both axes (Wayland and X11) but is not the campaign's primary question.
  • Daily-driver fitness (apps that break under X11, touchscreen behavior, multi-monitor edge cases, etc.) is not in scope. The campaign's deliverable is the matrix above; if any cell is decisively faster, daily-driver-fitness becomes a follow-up campaign.

What's NOT in scope (working assumption)

Until the research question is confirmed, the following are treated as out of scope so they don't slip into Phase 1 prematurely:

  • Patches to KWin, Xorg, kwin-fourier, qt6-base-fourier, or any other component on ohm. This is research, not patch-development. Per non-upstreaming default, MR/bug-report filing is explicitly tasked and not scheduled here.
  • The Δ_present-46 ms finding's investigation. It's a known hook from the predecessor; whether this campaign chases it depends on the locked research question.
  • Reverting predecessor tooling state. Governor, baloo, qt6-base-fourier, kwin-fourier stay as-is unless the operator decides otherwise.
  • File a bug for any of the predecessor's three documented candidate findings. Same non-upstreaming default applies.

What Phase 0 will deliver, regardless of framing

Even before the research question is locked, the following are useful Phase 0 deliverables that don't depend on the specific question:

  1. State snapshot of ohm under current Plasma Wayland captured at campaign start. This is the before photo for any future X11 vs Wayland comparison. Unattended-tractable (just scripted SSH).
  2. Inventory of available X11 paths on ohm: what packages are installed, what session candidates SDDM advertises, what would need to be installed to enable a Plasma X11 session, what alternate WMs are available. Read-only, unattended-tractable.
  3. Inventory of measurement instruments that work under X11: xtrace, xprop, xrandr --verbose --query, perf on Xorg PID, frame-timing extraction options. Read-only.
  4. A1 baseline under current Plasma Wayland: re-run a single rep of the predecessor's kwin_timing_nodebug condition immediately at the start of this campaign, so the comparison Wayland-vs-X11 has a same-session anchor. This is the "set the baseline before instrument changes" discipline from feedback_replicate_baseline_first.md.

These steps are unblocked. They don't commit to a specific research question and they produce evidence that's useful under any of the candidate framings.