initial seed: retrofit campaign lineage from local working trees

panvk-bifrost campaigns (r1..r4 Vulkan compositor + r5.video1 Vulkan
video decode) shipped before this repo existed; the deliverable
patches live in marfrit-packages, but the reasoning chain, phase docs,
and source-state evidence lived only in local working trees on the
development host.

This retrofit imports:
- mesa-panvk-bifrost/   — r1..r4 era phase docs (iter1..iter18)
                          (libmali stub blobs at iter18/blob/ excluded
                          — 109MB of RE artifacts replaced with a README
                          pointer)
- mesa-panvk-bifrost-video/ — sibling campaign phase docs + probe
- evidence/             — frozen .tgz source snapshots at each milestone
                          (basis for the 0005 patch diff generation)

Future iterations should branch off here from day one, so each iter is
a commit rather than a snapshot. See [[feedback-session-local-process-pins]]
for the process drift this retrofit closes.

Total: 1.9 MB across 124 files.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-23 05:25:37 +02:00
parent 430d0da278
commit a4e7d8ab90
124 changed files with 22551 additions and 1 deletions
@@ -0,0 +1,108 @@
# Phase 2 — situation analysis for iter8
Opened **2026-05-19** following the RED result in iter8 ([phase0_findings_iter8.md](phase0_findings_iter8.md)).
## What we tested
Per iter8 lock: run `eglinfo` and other GL clients via Zink-on-PanVk on ohm, force GL → Vulkan translation, verify Zink picks up PanVk-Bifrost (not llvmpipe).
## What happened
Zink refused to load on top of PanVk-Bifrost. The error log:
```
MESA: error: Zink requires the nullDescriptor feature of KHR/EXT robustness2.
```
(Emitted twice — Zink probes twice during EGL setup.)
Mesa silently fell back to **llvmpipe** (the LLVM-based software rasterizer). EGL/GL still works, but every pixel is rendered on the CPU. For a workload like TuxRacer this would be unusably slow (single-digit FPS at best on the Cortex-A55s in RK3566).
## Root cause (Mesa source)
`src/panfrost/vulkan/panvk_vX_physical_device.c` (Mesa main):
```c
line 94: .KHR_robustness2 = PAN_ARCH >= 10, // extension advertisement (KHR)
line 194: .EXT_robustness2 = PAN_ARCH >= 10, // extension advertisement (EXT)
line 590: .nullDescriptor = PAN_ARCH >= 10, // feature bit
```
Three lines gate the entire robustness2 path on Mali architectures **strictly newer than Valhall-JM**. PAN_ARCH values:
- 4/5 — Midgard
- 6/7 — Bifrost ← Mali-G52 r1 on ohm is 7
- 9 — Valhall (JM)
- 10+ — Valhall (CSF) and fifth-gen
The gate `>= 10` means **only CSF-class Valhall and fifth-gen get robustness2**. Bifrost is denied even though the underlying NIR/shader plumbing is already arch-agnostic:
```c
panvk_vX_nir_lower_descriptors.c:1309:
.null_descriptor_support = dev->vk.enabled_features.nullDescriptor,
panvk_vX_shader.c:1355:
.robust_descriptors = dev->vk.enabled_features.nullDescriptor,
```
If the feature were *exposed* on Bifrost, these per-arch code paths would handle it. The gate appears to be conservative ("haven't tested on v6/v7/v9") rather than reflecting hardware incapability.
## Why the gate exists
Speculation, but informed by [iter1's findings](phase8_iteration1_close.md): the entire Bifrost+Valhall-JM path was set to "not well-tested" — see the same file's [arch gate](phase0_findings.md) at `panvk_physical_device.c:413` that requires `PAN_I_WANT_A_BROKEN_VULKAN_DRIVER=1`. The robustness2 gate is part of the same defensive crouch: don't advertise features that haven't been bench-tested on these archs.
iter17 proved that the *fundamentals* of the Bifrost driver work. Specifically iter4 ([phase8_iteration4_close.md](phase8_iteration4_close.md)) showed `COMBINED_IMAGE_SAMPLER` descriptors work end-to-end. The risk that "null descriptor" specifically fails on Bifrost is real but bounded — null descriptor means "shader can attempt to read from an unbound descriptor binding without faulting", which is mostly a question of whether the descriptor table has a defined zero entry. PanVk-Bifrost's `bifrost/panvk_vX_meta_desc_copy.c` exists specifically for descriptor table manipulation — the building blocks are there.
## Why this matters
Without `nullDescriptor`:
- Zink refuses to use PanVk-Bifrost ⇒ fallback to llvmpipe ⇒ no GPU acceleration for any GL app on Bifrost.
- TuxRacer-via-Zink (the [README operator-level motivation](README.md)) is **blocked**.
- Likely many other modern Vulkan apps that opt into robustness2 (it's a popular extension; conformance tests use it) will also break.
This is the campaign's **first real driver gap**. Everything before iter8 was "the gate is defensive but the driver works." This is "the gate genuinely blocks an end-user workload."
## Proposed Phase 4 fix
**Minimal patch:** flip the three `PAN_ARCH >= 10` to a wider range that includes Bifrost:
```c
- .KHR_robustness2 = PAN_ARCH >= 10,
+ .KHR_robustness2 = true, /* or PAN_ARCH >= 6 if we want to keep Midgard out */
- .EXT_robustness2 = PAN_ARCH >= 10,
+ .EXT_robustness2 = true,
- .nullDescriptor = PAN_ARCH >= 10,
+ .nullDescriptor = true,
```
Risk register:
1. **Bifrost's descriptor table may handle null-binding-reads differently from Valhall-CSF.** If the NIR `null_descriptor_support` path emits Bifrost ISA that returns zero on null reads (which is the spec'd behavior for `nullDescriptor`), this works. If Bifrost requires a different sequence and the lowering code doesn't have a v6/v7 branch, we'd get either wrong values or a GPU fault on shaders that read null descriptors.
2. **The KHR/EXT robustness2 also has `nullPointers`, `robustImageAccess2`, `robustBufferAccess2` features.** The gate only mentions `nullDescriptor`, but the extension's other features may have other code paths. Need to check the per-feature gate code.
3. **Untested code paths in panvk_vX_meta_desc_copy.c** — the Bifrost-specific descriptor copy meta path was last touched 2024 (per iter0 file header). May have bit-rotted.
Mitigations:
- Build the patch as a custom libvulkan_panfrost.so, install side-by-side via `LD_LIBRARY_PATH`, don't overwrite system Mesa. Easy rollback.
- Validate stepwise: first vulkaninfo (confirms ext list), then eglinfo (confirms Zink picks PanVk), then es2_info (GL context creates), then a simple GL workload.
- Validation layer continuously enabled.
## What this needs from the operator
Building Mesa from source on the workstation (or a beefier compile host — `boltzmann`, `data`, distcc cluster) and shipping the patched `libvulkan_panfrost.so` to ohm. That's a **substantial action** the operator should approve:
- **Compile time:** Mesa is a big project; expect 3090 min on a normal aarch64 builder, less with distcc or x86_64 cross-compile.
- **Install path:** `LD_LIBRARY_PATH=/home/mfritsche/panvk-patched-libs PAN_I_WANT_A_BROKEN_VULKAN_DRIVER=1 ...` keeps it isolated. No system files modified.
- **If it works:** publish via marfrit-packages eventually (per the libva-multiplanar fork model), feed Collabora the patch upstream (or carry out-of-tree per `feedback_no_upstream`).
- **If it doesn't:** fall back to system Mesa, document what failed.
## Status
iter8 is **RED, characterized.** Awaiting operator approval to proceed to Phase 4 (the build + patch step).
## Reference
- Phase 0 lock: [phase0_findings_iter8.md](phase0_findings_iter8.md)
- Evidence: [phase0_evidence/iter8_zink_failure.txt](phase0_evidence/iter8_zink_failure.txt)
- Prior cumulative state: [phase8_iteration7_close.md](phase8_iteration7_close.md)
- Mesa source paths (local clone): `~/src/mesa-ref/mesa/src/panfrost/vulkan/`