cd25d02e01
build and publish packages / distcc-avahi-aarch64 (push) Successful in 31s
build and publish packages / lmcp-any (push) Successful in 6s
build and publish packages / lmcp-debian (push) Successful in 4s
build and publish packages / claude-his-any (push) Successful in 7s
build and publish packages / ffmpeg-v4l2-request-aarch64 (push) Successful in 12m29s
build and publish packages / claude-his-debian (push) Successful in 5s
Source-grep collapsed Phase 1+2 onto a single pass. KWin's own GL paths use GL_R8 correctly (gltexture.cpp:61, shadowitem.cpp:494). The glTexImage2D(GL_ALPHA) calls observed in the journal originate from Qt 6: - qtbase/src/opengl/qopengltextureglyphcache.cpp:111-117 — text glyph cache upload path. The #else branch (active when qtbase is built with QT_CONFIG(opengles2)) unconditionally uses GL_ALPHA, with no runtime check for ES context major version. Correct on ES 2.x; broken on ES 3.x where GL_ALPHA is no longer a valid glTexImage2D internalFormat. - qtbase/src/gui/rhi/qrhigles2.cpp:1373-1378 — Qt-Quick-RHI sibling. Same logic, gated only on caps.coreProfile, missing the ES≥3 case. - qtbase/src/opengl/qopengltextureuploader.cpp:253-257 — QImage→GL upload path; same shape. KWin runs an ES 3.2 context on Mali-G52 panfrost (RK3566), Qt picks GL_ALPHA, mesa returns GL_INVALID_VALUE, every dependent draw errors at level 0, the compositor's frame-callback path stalls. KWin is the visible victim because it's the compositor, but the bug is in Qt. KWIN_PIVOT.md rewritten: the patch series and packaging now target qt6-base-fourier instead of kwin-fourier. Three small hunks (~3 lines each), runtime-safe via existing caps.gles + caps.ctxMajor / surface format majorVersion checks. Upstream landing path: bugreports.qt.io + Gerrit change against qtbase dev branch.
281 lines
14 KiB
Markdown
281 lines
14 KiB
Markdown
# KWin pivot — fix the `glTexImage2D(GL_ALPHA)` stall
|
||
|
||
> **2026-04-28 update — Phase 2 collapsed onto Phase 1: it's not KWin.**
|
||
> Source-grep nailed the offender on the first pass. Real culprit:
|
||
> Qt 6's `QOpenGLTextureGlyphCache` (`src/opengl/qopengltextureglyphcache.cpp:111-117`)
|
||
> and `QRhiGles2::toGlTextureFormat` (`src/gui/rhi/qrhigles2.cpp:1373-1378`).
|
||
> KWin's own GL paths use `GL_R8` correctly (`src/opengl/gltexture.cpp:61`,
|
||
> `src/scene/shadowitem.cpp:494`). The pivot becomes a **Qt-fourier**
|
||
> patch, not a kwin-fourier one. Plan rewritten below; the pre-rewrite
|
||
> reproduction/triangulation phases are kept verbatim because they
|
||
> still apply to whatever lives downstream of the Qt fix.
|
||
>
|
||
> Qt's broken logic, in plain English: *"If qtbase was built with
|
||
> opengles2, just always use `GL_ALPHA`."* That's correct for an
|
||
> OpenGL ES 2.x context. It's wrong for OpenGL ES 3.x, where
|
||
> `GL_ALPHA` is no longer a valid `glTexImage2D` internalFormat
|
||
> (only sized formats — `GL_R8`, etc.). Mali / panfrost on RK3566
|
||
> exposes ES 3.2; KWin requests an ES 3.2 context; Qt picks
|
||
> `GL_ALPHA`; mesa returns `GL_INVALID_VALUE`; the texture is
|
||
> permanently broken; every dependent draw errors at level 0; the
|
||
> compositor's frame-callback path stalls. Affects every Qt 6
|
||
> application on Mali-class hardware that ends up rendering text
|
||
> through `QOpenGLTextureGlyphCache` (KDE's window decorations,
|
||
> Plasma overlays, Qt Quick scenegraph via RHI, ad nauseam) — KWin
|
||
> just happens to be the most visible victim because it's the
|
||
> compositor and its stall takes everyone else down with it.
|
||
|
||
## What we know
|
||
|
||
KWin 6.6.4-1 on Arch Linux ARM (Plasma 6.6.4-1, mesa 26.0.5-1, libdrm
|
||
2.4.131-1) on ohm (PineTab2 / RK3566 / panfrost) silently corrupts its
|
||
GL command queue mid-frame whenever a wayland client posts a video
|
||
buffer. The journal carries a rolling stream of:
|
||
|
||
```
|
||
kwin_wayland: 0x4: GL_INVALID_VALUE in glTexImage2D(internalFormat=GL_ALPHA)
|
||
kwin_wayland: 0x4: GL_INVALID_OPERATION in glTexSubImage2D(invalid texture level 0) × N
|
||
```
|
||
|
||
`GL_ALPHA` is not a valid `internalFormat` for `glTexImage2D` under
|
||
**OpenGL ES 3.x** (it was the GLES1.x single-channel alpha format;
|
||
GLES3 deprecates it for sized formats — `GL_R8`, `GL_LUMINANCE8_ALPHA8`,
|
||
etc.). Once the texture allocation fails, the `glTexSubImage2D` calls
|
||
that should populate it all error at level 0. KWin keeps retrying the
|
||
same broken upload every frame, never recovers, and the present-callback
|
||
path that depends on that texture stops acking client frames. Every
|
||
wayland video client deadlocks on the missing ack.
|
||
|
||
First occurrence in this box's journal: **2026-03-06** — the bug
|
||
predates any chromium-fourier work by roughly seven weeks.
|
||
|
||
## Triangulation already in hand
|
||
|
||
| Client | Outcome |
|
||
|---|---|
|
||
| chromium-fourier 149-r2 (with patch 3/3) | plays ~3 s @ 34.7 % CPU then renderer/GPU park in `futex_do_wait` |
|
||
| chromium-fourier 149-r2 (without patch 3/3) | plays ~10 s (slower path delays surfacing) then identical deadlock |
|
||
| VLC | `cannot convert decoder/filter output to any format supported by the output` → `could not initialize video chain` |
|
||
| mpv `--vo=null --hwdec=v4l2request` | `Could not create device.` (mpv-side bug, separate, unrelated) |
|
||
| ffmpeg `-hwaccel v4l2request -i bbb -f null -` | plays through clean at 36 fps; hardware path is healthy |
|
||
|
||
Decode path is healthy on this hardware. The wall is exclusively the
|
||
compositor's GL backend.
|
||
|
||
## Constraint: ohm is the only test box on hand
|
||
|
||
ampere (RK3588 / panthor) is in the boxes-from-Shenzhen pile, currently
|
||
DOWN. fresnel (RK3399 / Pinebook Pro) is offline. boltzmann (Rock 5
|
||
ITX+ build host) doesn't run KWin. We do every step on ohm; we accept
|
||
the wifi flakiness and the occasional reboot.
|
||
|
||
## Phase 1 — Reproduce outside chrome and bound the trigger (1 evening)
|
||
|
||
Goal: a deterministic, headless-or-near-headless reproduction that
|
||
doesn't require launching a 800-MB browser.
|
||
|
||
1. **Smallest-possible client.** Build a 50-line C wayland client that
|
||
creates a `wp_linux_dmabuf_v1` buffer, pumps frames at 30 fps, and
|
||
exits when KWin first errors. Use `weston-simple-dmabuf-egl` from
|
||
the `weston` package as a starting template — already does exactly
|
||
this but without our specific format/modifier matrix.
|
||
2. **Vary the format/modifier matrix.** Run the smallest-possible
|
||
client with each of: NV12 + LINEAR, NV12 + AFBC, NV12 + AFRC,
|
||
AR24 + LINEAR, XR24 + LINEAR. We already know NV12 paths trigger;
|
||
confirming AR24/XR24 do *not* trigger localizes the bug to KWin's
|
||
YUV import path (vs a generic dmabuf import bug).
|
||
3. **Vary the buffer dimensions.** Some KWin texture-cache paths
|
||
allocate fixed-size internal scratch textures; non-power-of-two,
|
||
non-multiple-of-16, or specifically odd-aspect cases sometimes
|
||
trigger paths that healthy aspect ratios skip. Test 1920×1080,
|
||
1280×720, 854×480, 640×360 and a deliberately weird 1366×768.
|
||
4. **Vary KWin scene type.** Switch
|
||
`kwin_wayland --scene-type=opengl` vs `--scene-type=opengl-es`
|
||
(current default on this hardware). If the bug only fires under
|
||
GLES, that's a strong signal — the offending site is in a
|
||
GLES-only fallback.
|
||
|
||
By the end of Phase 1 we should have a one-line `weston-simple-dmabuf-egl
|
||
-format=NV12 -modifier=…` that triggers the GL_ALPHA error within
|
||
seconds, plus a yes/no answer to "does AR24 also trigger".
|
||
|
||
## Phase 2 — Identify the call site (1–2 evenings)
|
||
|
||
The crime scene is somewhere in `kwin/src/scene/*` or
|
||
`kwin/src/effects/*`. Suspects, ranked:
|
||
|
||
- **`SurfaceItemWayland::createPixmapTexture` → `GLTexture::create`
|
||
with `GL_ALPHA`.** This is the most likely path: KWin allocates a
|
||
fallback per-plane texture when the dmabuf import path can't take
|
||
the buffer whole. NV12 has a Y plane (single-channel) and a CbCr
|
||
plane (two-channel); historically the Y plane has been allocated as
|
||
`GL_ALPHA` in software fallbacks. If the EGL dmabuf import returned
|
||
`EGL_BAD_ATTRIBUTE` for `external_only` modifiers and KWin fell
|
||
through to per-plane, this is exactly where it would land.
|
||
- **`BlurEffect::initBlurTexture` / `BackgroundContrastEffect::*`.**
|
||
Single-channel noise textures for blur dither. Less likely (these
|
||
fire on every frame regardless of video clients) but listed for
|
||
completeness.
|
||
- **Window-decoration text glyph cache.** Qt's QGLTexture historically
|
||
requested `GL_ALPHA` for monochrome glyph atlases. Plasma 6 should
|
||
have moved to `GL_RED` long ago, but a stale code path in a
|
||
third-party theme or systray icon could still hit it.
|
||
- **Cursor texture upload via `wl_shm_pool` + ARGB8888.** KWin's
|
||
cursor scene sometimes uploads via glTexImage2D — but the format
|
||
there is `GL_RGBA`, not `GL_ALPHA`. Probably not the suspect.
|
||
|
||
Tooling to identify *which*:
|
||
|
||
1. **`apitrace trace --api egl kwin_wayland …`** then
|
||
`apitrace dump trace.trace | grep -B5 GL_ALPHA`. Apitrace gives
|
||
us the C++ call stack at the offending site if KWin was built with
|
||
debug symbols.
|
||
2. **`MESA_GL_DEBUG=context KWIN_GL_DEBUG=1 kwin_wayland --replace`**
|
||
plus `glDebugMessageCallback` already installed in KWin's
|
||
`OpenGLBackend` will print the source/type/severity for each
|
||
`GL_INVALID_VALUE`. Whether the file/line in the message includes
|
||
the user-space caller depends on Mesa's debug-extension support;
|
||
on panfrost it usually does include the GL function name and an
|
||
ID, but not the C++ source — that is what apitrace adds.
|
||
3. **Build kwin from source** (`extra/kwin` PKGBUILD on Arch ARM,
|
||
patch in `-DDEBUG=ON`, `-DCMAKE_BUILD_TYPE=Debug`) so the call
|
||
stacks resolve to file:line.
|
||
|
||
## Phase 3 — Write the patch (½ evening once Phase 2 is done)
|
||
|
||
The Qt 6 fix is two ~3-line changes, runtime-safe, no new dependency.
|
||
|
||
**Fix #1 — `src/opengl/qopengltextureglyphcache.cpp` lines 111-117:**
|
||
|
||
```diff
|
||
#if !QT_CONFIG(opengles2)
|
||
const GLint internalFormat = isCoreProfile() ? GL_R8 : GL_ALPHA;
|
||
const GLenum format = isCoreProfile() ? GL_RED : GL_ALPHA;
|
||
#else
|
||
- const GLint internalFormat = GL_ALPHA;
|
||
- const GLenum format = GL_ALPHA;
|
||
+ // OpenGL ES 3.x deprecated GL_ALPHA as a glTexImage2D
|
||
+ // internalFormat; only true ES 2 contexts retain it. Use GL_R8
|
||
+ // + the matching swizzle (handled in the fragment shader's .r
|
||
+ // sample below) on ES 3+ hardware so Mali / panfrost / panthor
|
||
+ // GLES3 contexts stop emitting GL_INVALID_VALUE every frame.
|
||
+ const bool useR8 = ctx->format().majorVersion() >= 3;
|
||
+ const GLint internalFormat = useR8 ? GL_R8 : GL_ALPHA;
|
||
+ const GLenum format = useR8 ? GL_RED : GL_ALPHA;
|
||
#endif
|
||
```
|
||
|
||
The downstream fragment shader path that samples this texture must
|
||
read `.r` instead of `.a` when `GL_R8` is used. Qt's text-rendering
|
||
fragment program already has both code paths conditioned on context
|
||
core-profile; the ES 3+ branch needs the same treatment. Lines
|
||
214-216 of the same file (the resize / re-upload path) need the
|
||
identical change.
|
||
|
||
**Fix #2 — `src/gui/rhi/qrhigles2.cpp` lines 1373-1378:**
|
||
|
||
```diff
|
||
case QRhiTexture::RED_OR_ALPHA8:
|
||
- *glintformat = caps.coreProfile ? GL_R8 : GL_ALPHA;
|
||
+ *glintformat = (caps.coreProfile || (caps.gles && caps.ctxMajor >= 3))
|
||
+ ? GL_R8 : GL_ALPHA;
|
||
*glsizedintformat = *glintformat;
|
||
- *glformat = caps.coreProfile ? GL_RED : GL_ALPHA;
|
||
+ *glformat = (caps.coreProfile || (caps.gles && caps.ctxMajor >= 3))
|
||
+ ? GL_RED : GL_ALPHA;
|
||
*gltype = GL_UNSIGNED_BYTE;
|
||
break;
|
||
```
|
||
|
||
`caps.gles` and `caps.ctxMajor` are populated at context creation
|
||
(qrhigles2.cpp:804 + :855); the disjunct is free.
|
||
|
||
**Fix #3 — `src/opengl/qopengltextureuploader.cpp` lines 253-257:**
|
||
|
||
This is the QImage→GL upload path (used by `QOpenGLPaintEngineEx`
|
||
and its descendants). Same pattern, same fix shape: extend the
|
||
"core profile or GLES2 fallback" branching to also consider GLES3+
|
||
as needing `GL_R8`.
|
||
|
||
If we want to be aggressive, we can collapse all three sites onto a
|
||
single `qt_gl_use_r8_for_alpha8(ctx)` helper in `qopenglhelper_p.h`
|
||
so future Qt versions don't drift apart again — but a minimal patch
|
||
should keep the three sites independent so each is reviewable in
|
||
isolation by the relevant Qt module owner.
|
||
|
||
## Phase 4 — Ship and upstream (1 evening)
|
||
|
||
1. **Local Arch package** as `qt6-base-fourier` under
|
||
`marfrit-packages/arch/qt6-base-fourier/`, sibling to chromium-fourier
|
||
and firefox-fourier. PKGBUILD inherits from `extra/qt6-base`, drops
|
||
in the three patches above, bumps `pkgrel`. Same
|
||
`provides=qt6-base conflicts=qt6-base` pattern. Rebuild is heavy
|
||
(qtbase compile is ~30 minutes on boltzmann; ohm rebuild is
|
||
sustained-fan-territory and probably better avoided — boltzmann
|
||
builds the aarch64 .pkg.tar.zst, then we rsync it to ohm and
|
||
`pacman -U` there).
|
||
2. **Validate on ohm** by:
|
||
- `pacman -U` the patched qt6-base.
|
||
- Restart Plasma session (logout / login) so the new qt6-base.so
|
||
is mapped into the fresh kwin_wayland.
|
||
- Re-run `journalctl -u plasma-kwin_wayland.service -f` while
|
||
opening any Qt 6 application that triggers text caching (a
|
||
terminal, kate, the system tray) — the GL_INVALID_VALUE spam
|
||
should be **gone**.
|
||
- Then run chromium-fourier 149-r2 + the bbb sample for a full
|
||
minute uninterrupted. Success = smooth playback through to EOF
|
||
at the 34.7 % CPU number, no stall, no audio static, no
|
||
KWin-side errors in the journal.
|
||
3. **Upstream** via:
|
||
- File on `bugreports.qt.io` against `QtBase: OpenGL`, with: the
|
||
three diff hunks above, the exact behavior on Mali-G52 panfrost
|
||
RK3566 mainline 6.19, an excerpt of the journal noise, and
|
||
mesa 26.0.5 / qt 6.11.0 / kwin 6.6.4 versions.
|
||
- Push a Gerrit change against `qtbase` `dev` branch
|
||
(`codereview.qt-project.org`). Qt won't accept a GitHub MR —
|
||
they live on Gerrit. Create a Qt account, configure
|
||
`git-review`, push.
|
||
- Reference the chromium-fourier project as the discovery site
|
||
so the next Mali-on-Linux Qt 6 user finds the breadcrumb.
|
||
4. **Document** the fix in
|
||
`chromium-fourier/docs/dmabuf-zero-copy.md` "Caveat — KWin 6.6.4
|
||
GLES backend on this hardware" subsection: replace the "to be
|
||
investigated" wording with "fixed by qt6-base-fourier; see
|
||
`marfrit-packages/arch/qt6-base-fourier/`. Upstream Qt change
|
||
pending review at `<gerrit-link>`."
|
||
|
||
## Reflection — corporate IT spec leakage, as predicted
|
||
|
||
The user's Phase-1 hypothesis was that this was the result of code
|
||
written by people who never read the spec they were claiming to
|
||
implement. They were correct, with one nuance: the Qt code did read
|
||
the spec — *the OpenGL ES 2.x spec*, where `GL_ALPHA` is genuinely
|
||
the canonical single-channel format for `glTexImage2D`. What it
|
||
never went back and re-read is the OpenGL ES 3.0 spec
|
||
(section 3.8.3, "Texture Image Specification"), where `GL_ALPHA`
|
||
is moved to the deprecated list and only sized formats are
|
||
retained. The bug is: *Qt 6 was written assuming "OpenGL ES" is
|
||
one thing, and never updated the assumption when ES 3 dropped the
|
||
unsized formats.* That's a corporate-IT-style architectural
|
||
shortcut: codify the world in two boxes (desktop vs ES), call it
|
||
done, ship. The fact that a category had a sub-category which moved
|
||
in 2012 is not the framework's job to track. Until the bug report
|
||
arrives and someone has to extend the boolean to a triple.
|
||
|
||
## What success looks like
|
||
|
||
`chromium-fourier-149-r2` on ohm under KWin Wayland plays
|
||
`bbb_1080p30_h264.mp4` end-to-end at the 34.7 % CPU figure already
|
||
recorded by the architectural validation, with zero `GL_INVALID_VALUE`
|
||
in the journal during playback. That number is the goal of the entire
|
||
chromium-fourier campaign for RK3566 — it is currently blocked on a
|
||
bug that has nothing to do with chromium.
|
||
|
||
## Scope discipline
|
||
|
||
We do not turn this into "audit the entire KWin GLES backend." If
|
||
Phase 2 surfaces additional latent GL_INVALID_* errors that don't
|
||
matter for video playback, we note them in the bug report and move
|
||
on. The pivot is explicitly "remove this single wall so the
|
||
chromium-fourier patch series can ship a working stack on RK3566."
|