Files
marfrit-packages/arch/chromium-fourier/KWIN_PIVOT.md
T
marfrit cd25d02e01
build and publish packages / distcc-avahi-aarch64 (push) Successful in 31s
build and publish packages / lmcp-any (push) Successful in 6s
build and publish packages / lmcp-debian (push) Successful in 4s
build and publish packages / claude-his-any (push) Successful in 7s
build and publish packages / ffmpeg-v4l2-request-aarch64 (push) Successful in 12m29s
build and publish packages / claude-his-debian (push) Successful in 5s
KWIN_PIVOT: Phase-2 findings — bug is in Qt 6, not KWin
Source-grep collapsed Phase 1+2 onto a single pass. KWin's own GL paths
use GL_R8 correctly (gltexture.cpp:61, shadowitem.cpp:494). The
glTexImage2D(GL_ALPHA) calls observed in the journal originate from
Qt 6:

- qtbase/src/opengl/qopengltextureglyphcache.cpp:111-117 — text glyph
  cache upload path. The #else branch (active when qtbase is built
  with QT_CONFIG(opengles2)) unconditionally uses GL_ALPHA, with no
  runtime check for ES context major version. Correct on ES 2.x;
  broken on ES 3.x where GL_ALPHA is no longer a valid glTexImage2D
  internalFormat.
- qtbase/src/gui/rhi/qrhigles2.cpp:1373-1378 — Qt-Quick-RHI sibling.
  Same logic, gated only on caps.coreProfile, missing the ES≥3 case.
- qtbase/src/opengl/qopengltextureuploader.cpp:253-257 — QImage→GL
  upload path; same shape.

KWin runs an ES 3.2 context on Mali-G52 panfrost (RK3566), Qt picks
GL_ALPHA, mesa returns GL_INVALID_VALUE, every dependent draw errors
at level 0, the compositor's frame-callback path stalls. KWin is the
visible victim because it's the compositor, but the bug is in Qt.

KWIN_PIVOT.md rewritten: the patch series and packaging now target
qt6-base-fourier instead of kwin-fourier. Three small hunks (~3 lines
each), runtime-safe via existing caps.gles + caps.ctxMajor / surface
format majorVersion checks. Upstream landing path: bugreports.qt.io
+ Gerrit change against qtbase dev branch.
2026-04-28 12:18:25 +00:00

281 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# KWin pivot — fix the `glTexImage2D(GL_ALPHA)` stall
> **2026-04-28 update — Phase 2 collapsed onto Phase 1: it's not KWin.**
> Source-grep nailed the offender on the first pass. Real culprit:
> Qt 6's `QOpenGLTextureGlyphCache` (`src/opengl/qopengltextureglyphcache.cpp:111-117`)
> and `QRhiGles2::toGlTextureFormat` (`src/gui/rhi/qrhigles2.cpp:1373-1378`).
> KWin's own GL paths use `GL_R8` correctly (`src/opengl/gltexture.cpp:61`,
> `src/scene/shadowitem.cpp:494`). The pivot becomes a **Qt-fourier**
> patch, not a kwin-fourier one. Plan rewritten below; the pre-rewrite
> reproduction/triangulation phases are kept verbatim because they
> still apply to whatever lives downstream of the Qt fix.
>
> Qt's broken logic, in plain English: *"If qtbase was built with
> opengles2, just always use `GL_ALPHA`."* That's correct for an
> OpenGL ES 2.x context. It's wrong for OpenGL ES 3.x, where
> `GL_ALPHA` is no longer a valid `glTexImage2D` internalFormat
> (only sized formats — `GL_R8`, etc.). Mali / panfrost on RK3566
> exposes ES 3.2; KWin requests an ES 3.2 context; Qt picks
> `GL_ALPHA`; mesa returns `GL_INVALID_VALUE`; the texture is
> permanently broken; every dependent draw errors at level 0; the
> compositor's frame-callback path stalls. Affects every Qt 6
> application on Mali-class hardware that ends up rendering text
> through `QOpenGLTextureGlyphCache` (KDE's window decorations,
> Plasma overlays, Qt Quick scenegraph via RHI, ad nauseam) — KWin
> just happens to be the most visible victim because it's the
> compositor and its stall takes everyone else down with it.
## What we know
KWin 6.6.4-1 on Arch Linux ARM (Plasma 6.6.4-1, mesa 26.0.5-1, libdrm
2.4.131-1) on ohm (PineTab2 / RK3566 / panfrost) silently corrupts its
GL command queue mid-frame whenever a wayland client posts a video
buffer. The journal carries a rolling stream of:
```
kwin_wayland: 0x4: GL_INVALID_VALUE in glTexImage2D(internalFormat=GL_ALPHA)
kwin_wayland: 0x4: GL_INVALID_OPERATION in glTexSubImage2D(invalid texture level 0) × N
```
`GL_ALPHA` is not a valid `internalFormat` for `glTexImage2D` under
**OpenGL ES 3.x** (it was the GLES1.x single-channel alpha format;
GLES3 deprecates it for sized formats — `GL_R8`, `GL_LUMINANCE8_ALPHA8`,
etc.). Once the texture allocation fails, the `glTexSubImage2D` calls
that should populate it all error at level 0. KWin keeps retrying the
same broken upload every frame, never recovers, and the present-callback
path that depends on that texture stops acking client frames. Every
wayland video client deadlocks on the missing ack.
First occurrence in this box's journal: **2026-03-06** — the bug
predates any chromium-fourier work by roughly seven weeks.
## Triangulation already in hand
| Client | Outcome |
|---|---|
| chromium-fourier 149-r2 (with patch 3/3) | plays ~3 s @ 34.7 % CPU then renderer/GPU park in `futex_do_wait` |
| chromium-fourier 149-r2 (without patch 3/3) | plays ~10 s (slower path delays surfacing) then identical deadlock |
| VLC | `cannot convert decoder/filter output to any format supported by the output``could not initialize video chain` |
| mpv `--vo=null --hwdec=v4l2request` | `Could not create device.` (mpv-side bug, separate, unrelated) |
| ffmpeg `-hwaccel v4l2request -i bbb -f null -` | plays through clean at 36 fps; hardware path is healthy |
Decode path is healthy on this hardware. The wall is exclusively the
compositor's GL backend.
## Constraint: ohm is the only test box on hand
ampere (RK3588 / panthor) is in the boxes-from-Shenzhen pile, currently
DOWN. fresnel (RK3399 / Pinebook Pro) is offline. boltzmann (Rock 5
ITX+ build host) doesn't run KWin. We do every step on ohm; we accept
the wifi flakiness and the occasional reboot.
## Phase 1 — Reproduce outside chrome and bound the trigger (1 evening)
Goal: a deterministic, headless-or-near-headless reproduction that
doesn't require launching a 800-MB browser.
1. **Smallest-possible client.** Build a 50-line C wayland client that
creates a `wp_linux_dmabuf_v1` buffer, pumps frames at 30 fps, and
exits when KWin first errors. Use `weston-simple-dmabuf-egl` from
the `weston` package as a starting template — already does exactly
this but without our specific format/modifier matrix.
2. **Vary the format/modifier matrix.** Run the smallest-possible
client with each of: NV12 + LINEAR, NV12 + AFBC, NV12 + AFRC,
AR24 + LINEAR, XR24 + LINEAR. We already know NV12 paths trigger;
confirming AR24/XR24 do *not* trigger localizes the bug to KWin's
YUV import path (vs a generic dmabuf import bug).
3. **Vary the buffer dimensions.** Some KWin texture-cache paths
allocate fixed-size internal scratch textures; non-power-of-two,
non-multiple-of-16, or specifically odd-aspect cases sometimes
trigger paths that healthy aspect ratios skip. Test 1920×1080,
1280×720, 854×480, 640×360 and a deliberately weird 1366×768.
4. **Vary KWin scene type.** Switch
`kwin_wayland --scene-type=opengl` vs `--scene-type=opengl-es`
(current default on this hardware). If the bug only fires under
GLES, that's a strong signal — the offending site is in a
GLES-only fallback.
By the end of Phase 1 we should have a one-line `weston-simple-dmabuf-egl
-format=NV12 -modifier=…` that triggers the GL_ALPHA error within
seconds, plus a yes/no answer to "does AR24 also trigger".
## Phase 2 — Identify the call site (12 evenings)
The crime scene is somewhere in `kwin/src/scene/*` or
`kwin/src/effects/*`. Suspects, ranked:
- **`SurfaceItemWayland::createPixmapTexture``GLTexture::create`
with `GL_ALPHA`.** This is the most likely path: KWin allocates a
fallback per-plane texture when the dmabuf import path can't take
the buffer whole. NV12 has a Y plane (single-channel) and a CbCr
plane (two-channel); historically the Y plane has been allocated as
`GL_ALPHA` in software fallbacks. If the EGL dmabuf import returned
`EGL_BAD_ATTRIBUTE` for `external_only` modifiers and KWin fell
through to per-plane, this is exactly where it would land.
- **`BlurEffect::initBlurTexture` / `BackgroundContrastEffect::*`.**
Single-channel noise textures for blur dither. Less likely (these
fire on every frame regardless of video clients) but listed for
completeness.
- **Window-decoration text glyph cache.** Qt's QGLTexture historically
requested `GL_ALPHA` for monochrome glyph atlases. Plasma 6 should
have moved to `GL_RED` long ago, but a stale code path in a
third-party theme or systray icon could still hit it.
- **Cursor texture upload via `wl_shm_pool` + ARGB8888.** KWin's
cursor scene sometimes uploads via glTexImage2D — but the format
there is `GL_RGBA`, not `GL_ALPHA`. Probably not the suspect.
Tooling to identify *which*:
1. **`apitrace trace --api egl kwin_wayland …`** then
`apitrace dump trace.trace | grep -B5 GL_ALPHA`. Apitrace gives
us the C++ call stack at the offending site if KWin was built with
debug symbols.
2. **`MESA_GL_DEBUG=context KWIN_GL_DEBUG=1 kwin_wayland --replace`**
plus `glDebugMessageCallback` already installed in KWin's
`OpenGLBackend` will print the source/type/severity for each
`GL_INVALID_VALUE`. Whether the file/line in the message includes
the user-space caller depends on Mesa's debug-extension support;
on panfrost it usually does include the GL function name and an
ID, but not the C++ source — that is what apitrace adds.
3. **Build kwin from source** (`extra/kwin` PKGBUILD on Arch ARM,
patch in `-DDEBUG=ON`, `-DCMAKE_BUILD_TYPE=Debug`) so the call
stacks resolve to file:line.
## Phase 3 — Write the patch (½ evening once Phase 2 is done)
The Qt 6 fix is two ~3-line changes, runtime-safe, no new dependency.
**Fix #1 — `src/opengl/qopengltextureglyphcache.cpp` lines 111-117:**
```diff
#if !QT_CONFIG(opengles2)
const GLint internalFormat = isCoreProfile() ? GL_R8 : GL_ALPHA;
const GLenum format = isCoreProfile() ? GL_RED : GL_ALPHA;
#else
- const GLint internalFormat = GL_ALPHA;
- const GLenum format = GL_ALPHA;
+ // OpenGL ES 3.x deprecated GL_ALPHA as a glTexImage2D
+ // internalFormat; only true ES 2 contexts retain it. Use GL_R8
+ // + the matching swizzle (handled in the fragment shader's .r
+ // sample below) on ES 3+ hardware so Mali / panfrost / panthor
+ // GLES3 contexts stop emitting GL_INVALID_VALUE every frame.
+ const bool useR8 = ctx->format().majorVersion() >= 3;
+ const GLint internalFormat = useR8 ? GL_R8 : GL_ALPHA;
+ const GLenum format = useR8 ? GL_RED : GL_ALPHA;
#endif
```
The downstream fragment shader path that samples this texture must
read `.r` instead of `.a` when `GL_R8` is used. Qt's text-rendering
fragment program already has both code paths conditioned on context
core-profile; the ES 3+ branch needs the same treatment. Lines
214-216 of the same file (the resize / re-upload path) need the
identical change.
**Fix #2 — `src/gui/rhi/qrhigles2.cpp` lines 1373-1378:**
```diff
case QRhiTexture::RED_OR_ALPHA8:
- *glintformat = caps.coreProfile ? GL_R8 : GL_ALPHA;
+ *glintformat = (caps.coreProfile || (caps.gles && caps.ctxMajor >= 3))
+ ? GL_R8 : GL_ALPHA;
*glsizedintformat = *glintformat;
- *glformat = caps.coreProfile ? GL_RED : GL_ALPHA;
+ *glformat = (caps.coreProfile || (caps.gles && caps.ctxMajor >= 3))
+ ? GL_RED : GL_ALPHA;
*gltype = GL_UNSIGNED_BYTE;
break;
```
`caps.gles` and `caps.ctxMajor` are populated at context creation
(qrhigles2.cpp:804 + :855); the disjunct is free.
**Fix #3 — `src/opengl/qopengltextureuploader.cpp` lines 253-257:**
This is the QImage→GL upload path (used by `QOpenGLPaintEngineEx`
and its descendants). Same pattern, same fix shape: extend the
"core profile or GLES2 fallback" branching to also consider GLES3+
as needing `GL_R8`.
If we want to be aggressive, we can collapse all three sites onto a
single `qt_gl_use_r8_for_alpha8(ctx)` helper in `qopenglhelper_p.h`
so future Qt versions don't drift apart again — but a minimal patch
should keep the three sites independent so each is reviewable in
isolation by the relevant Qt module owner.
## Phase 4 — Ship and upstream (1 evening)
1. **Local Arch package** as `qt6-base-fourier` under
`marfrit-packages/arch/qt6-base-fourier/`, sibling to chromium-fourier
and firefox-fourier. PKGBUILD inherits from `extra/qt6-base`, drops
in the three patches above, bumps `pkgrel`. Same
`provides=qt6-base conflicts=qt6-base` pattern. Rebuild is heavy
(qtbase compile is ~30 minutes on boltzmann; ohm rebuild is
sustained-fan-territory and probably better avoided — boltzmann
builds the aarch64 .pkg.tar.zst, then we rsync it to ohm and
`pacman -U` there).
2. **Validate on ohm** by:
- `pacman -U` the patched qt6-base.
- Restart Plasma session (logout / login) so the new qt6-base.so
is mapped into the fresh kwin_wayland.
- Re-run `journalctl -u plasma-kwin_wayland.service -f` while
opening any Qt 6 application that triggers text caching (a
terminal, kate, the system tray) — the GL_INVALID_VALUE spam
should be **gone**.
- Then run chromium-fourier 149-r2 + the bbb sample for a full
minute uninterrupted. Success = smooth playback through to EOF
at the 34.7 % CPU number, no stall, no audio static, no
KWin-side errors in the journal.
3. **Upstream** via:
- File on `bugreports.qt.io` against `QtBase: OpenGL`, with: the
three diff hunks above, the exact behavior on Mali-G52 panfrost
RK3566 mainline 6.19, an excerpt of the journal noise, and
mesa 26.0.5 / qt 6.11.0 / kwin 6.6.4 versions.
- Push a Gerrit change against `qtbase` `dev` branch
(`codereview.qt-project.org`). Qt won't accept a GitHub MR —
they live on Gerrit. Create a Qt account, configure
`git-review`, push.
- Reference the chromium-fourier project as the discovery site
so the next Mali-on-Linux Qt 6 user finds the breadcrumb.
4. **Document** the fix in
`chromium-fourier/docs/dmabuf-zero-copy.md` "Caveat — KWin 6.6.4
GLES backend on this hardware" subsection: replace the "to be
investigated" wording with "fixed by qt6-base-fourier; see
`marfrit-packages/arch/qt6-base-fourier/`. Upstream Qt change
pending review at `<gerrit-link>`."
## Reflection — corporate IT spec leakage, as predicted
The user's Phase-1 hypothesis was that this was the result of code
written by people who never read the spec they were claiming to
implement. They were correct, with one nuance: the Qt code did read
the spec — *the OpenGL ES 2.x spec*, where `GL_ALPHA` is genuinely
the canonical single-channel format for `glTexImage2D`. What it
never went back and re-read is the OpenGL ES 3.0 spec
(section 3.8.3, "Texture Image Specification"), where `GL_ALPHA`
is moved to the deprecated list and only sized formats are
retained. The bug is: *Qt 6 was written assuming "OpenGL ES" is
one thing, and never updated the assumption when ES 3 dropped the
unsized formats.* That's a corporate-IT-style architectural
shortcut: codify the world in two boxes (desktop vs ES), call it
done, ship. The fact that a category had a sub-category which moved
in 2012 is not the framework's job to track. Until the bug report
arrives and someone has to extend the boolean to a triple.
## What success looks like
`chromium-fourier-149-r2` on ohm under KWin Wayland plays
`bbb_1080p30_h264.mp4` end-to-end at the 34.7 % CPU figure already
recorded by the architectural validation, with zero `GL_INVALID_VALUE`
in the journal during playback. That number is the goal of the entire
chromium-fourier campaign for RK3566 — it is currently blocked on a
bug that has nothing to do with chromium.
## Scope discipline
We do not turn this into "audit the entire KWin GLES backend." If
Phase 2 surfaces additional latent GL_INVALID_* errors that don't
matter for video playback, we note them in the bug report and move
on. The pivot is explicitly "remove this single wall so the
chromium-fourier patch series can ship a working stack on RK3566."