Files
marfrit a4e7d8ab90 initial seed: retrofit campaign lineage from local working trees
panvk-bifrost campaigns (r1..r4 Vulkan compositor + r5.video1 Vulkan
video decode) shipped before this repo existed; the deliverable
patches live in marfrit-packages, but the reasoning chain, phase docs,
and source-state evidence lived only in local working trees on the
development host.

This retrofit imports:
- mesa-panvk-bifrost/   — r1..r4 era phase docs (iter1..iter18)
                          (libmali stub blobs at iter18/blob/ excluded
                          — 109MB of RE artifacts replaced with a README
                          pointer)
- mesa-panvk-bifrost-video/ — sibling campaign phase docs + probe
- evidence/             — frozen .tgz source snapshots at each milestone
                          (basis for the 0005 patch diff generation)

Future iterations should branch off here from day one, so each iter is
a commit rather than a snapshot. See [[feedback-session-local-process-pins]]
for the process drift this retrofit closes.

Total: 1.9 MB across 124 files.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 05:25:37 +02:00

6.4 KiB
Raw Permalink Blame History

Phase 2 — situation analysis for iter8

Opened 2026-05-19 following the RED result in iter8 (phase0_findings_iter8.md).

What we tested

Per iter8 lock: run eglinfo and other GL clients via Zink-on-PanVk on ohm, force GL → Vulkan translation, verify Zink picks up PanVk-Bifrost (not llvmpipe).

What happened

Zink refused to load on top of PanVk-Bifrost. The error log:

MESA: error: Zink requires the nullDescriptor feature of KHR/EXT robustness2.

(Emitted twice — Zink probes twice during EGL setup.)

Mesa silently fell back to llvmpipe (the LLVM-based software rasterizer). EGL/GL still works, but every pixel is rendered on the CPU. For a workload like TuxRacer this would be unusably slow (single-digit FPS at best on the Cortex-A55s in RK3566).

Root cause (Mesa source)

src/panfrost/vulkan/panvk_vX_physical_device.c (Mesa main):

line  94:  .KHR_robustness2 = PAN_ARCH >= 10,    // extension advertisement (KHR)
line 194:  .EXT_robustness2 = PAN_ARCH >= 10,    // extension advertisement (EXT)
line 590:  .nullDescriptor  = PAN_ARCH >= 10,    // feature bit

Three lines gate the entire robustness2 path on Mali architectures strictly newer than Valhall-JM. PAN_ARCH values:

  • 4/5 — Midgard
  • 6/7 — Bifrost ← Mali-G52 r1 on ohm is 7
  • 9 — Valhall (JM)
  • 10+ — Valhall (CSF) and fifth-gen

The gate >= 10 means only CSF-class Valhall and fifth-gen get robustness2. Bifrost is denied even though the underlying NIR/shader plumbing is already arch-agnostic:

panvk_vX_nir_lower_descriptors.c:1309:
  .null_descriptor_support = dev->vk.enabled_features.nullDescriptor,

panvk_vX_shader.c:1355:
  .robust_descriptors = dev->vk.enabled_features.nullDescriptor,

If the feature were exposed on Bifrost, these per-arch code paths would handle it. The gate appears to be conservative ("haven't tested on v6/v7/v9") rather than reflecting hardware incapability.

Why the gate exists

Speculation, but informed by iter1's findings: the entire Bifrost+Valhall-JM path was set to "not well-tested" — see the same file's arch gate at panvk_physical_device.c:413 that requires PAN_I_WANT_A_BROKEN_VULKAN_DRIVER=1. The robustness2 gate is part of the same defensive crouch: don't advertise features that haven't been bench-tested on these archs.

iter17 proved that the fundamentals of the Bifrost driver work. Specifically iter4 (phase8_iteration4_close.md) showed COMBINED_IMAGE_SAMPLER descriptors work end-to-end. The risk that "null descriptor" specifically fails on Bifrost is real but bounded — null descriptor means "shader can attempt to read from an unbound descriptor binding without faulting", which is mostly a question of whether the descriptor table has a defined zero entry. PanVk-Bifrost's bifrost/panvk_vX_meta_desc_copy.c exists specifically for descriptor table manipulation — the building blocks are there.

Why this matters

Without nullDescriptor:

  • Zink refuses to use PanVk-Bifrost ⇒ fallback to llvmpipe ⇒ no GPU acceleration for any GL app on Bifrost.
  • TuxRacer-via-Zink (the README operator-level motivation) is blocked.
  • Likely many other modern Vulkan apps that opt into robustness2 (it's a popular extension; conformance tests use it) will also break.

This is the campaign's first real driver gap. Everything before iter8 was "the gate is defensive but the driver works." This is "the gate genuinely blocks an end-user workload."

Proposed Phase 4 fix

Minimal patch: flip the three PAN_ARCH >= 10 to a wider range that includes Bifrost:

-      .KHR_robustness2 = PAN_ARCH >= 10,
+      .KHR_robustness2 = true,   /* or PAN_ARCH >= 6 if we want to keep Midgard out */

-      .EXT_robustness2 = PAN_ARCH >= 10,
+      .EXT_robustness2 = true,

-      .nullDescriptor = PAN_ARCH >= 10,
+      .nullDescriptor = true,

Risk register:

  1. Bifrost's descriptor table may handle null-binding-reads differently from Valhall-CSF. If the NIR null_descriptor_support path emits Bifrost ISA that returns zero on null reads (which is the spec'd behavior for nullDescriptor), this works. If Bifrost requires a different sequence and the lowering code doesn't have a v6/v7 branch, we'd get either wrong values or a GPU fault on shaders that read null descriptors.
  2. The KHR/EXT robustness2 also has nullPointers, robustImageAccess2, robustBufferAccess2 features. The gate only mentions nullDescriptor, but the extension's other features may have other code paths. Need to check the per-feature gate code.
  3. Untested code paths in panvk_vX_meta_desc_copy.c — the Bifrost-specific descriptor copy meta path was last touched 2024 (per iter0 file header). May have bit-rotted.

Mitigations:

  • Build the patch as a custom libvulkan_panfrost.so, install side-by-side via LD_LIBRARY_PATH, don't overwrite system Mesa. Easy rollback.
  • Validate stepwise: first vulkaninfo (confirms ext list), then eglinfo (confirms Zink picks PanVk), then es2_info (GL context creates), then a simple GL workload.
  • Validation layer continuously enabled.

What this needs from the operator

Building Mesa from source on the workstation (or a beefier compile host — boltzmann, data, distcc cluster) and shipping the patched libvulkan_panfrost.so to ohm. That's a substantial action the operator should approve:

  • Compile time: Mesa is a big project; expect 3090 min on a normal aarch64 builder, less with distcc or x86_64 cross-compile.
  • Install path: LD_LIBRARY_PATH=/home/mfritsche/panvk-patched-libs PAN_I_WANT_A_BROKEN_VULKAN_DRIVER=1 ... keeps it isolated. No system files modified.
  • If it works: publish via marfrit-packages eventually (per the libva-multiplanar fork model), feed Collabora the patch upstream (or carry out-of-tree per feedback_no_upstream).
  • If it doesn't: fall back to system Mesa, document what failed.

Status

iter8 is RED, characterized. Awaiting operator approval to proceed to Phase 4 (the build + patch step).

Reference