Files
marfrit a4e7d8ab90 initial seed: retrofit campaign lineage from local working trees
panvk-bifrost campaigns (r1..r4 Vulkan compositor + r5.video1 Vulkan
video decode) shipped before this repo existed; the deliverable
patches live in marfrit-packages, but the reasoning chain, phase docs,
and source-state evidence lived only in local working trees on the
development host.

This retrofit imports:
- mesa-panvk-bifrost/   — r1..r4 era phase docs (iter1..iter18)
                          (libmali stub blobs at iter18/blob/ excluded
                          — 109MB of RE artifacts replaced with a README
                          pointer)
- mesa-panvk-bifrost-video/ — sibling campaign phase docs + probe
- evidence/             — frozen .tgz source snapshots at each milestone
                          (basis for the 0005 patch diff generation)

Future iterations should branch off here from day one, so each iter is
a commit rather than a snapshot. See [[feedback-session-local-process-pins]]
for the process drift this retrofit closes.

Total: 1.9 MB across 124 files.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 05:25:37 +02:00

109 lines
6.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 2 — situation analysis for iter8
Opened **2026-05-19** following the RED result in iter8 ([phase0_findings_iter8.md](phase0_findings_iter8.md)).
## What we tested
Per iter8 lock: run `eglinfo` and other GL clients via Zink-on-PanVk on ohm, force GL → Vulkan translation, verify Zink picks up PanVk-Bifrost (not llvmpipe).
## What happened
Zink refused to load on top of PanVk-Bifrost. The error log:
```
MESA: error: Zink requires the nullDescriptor feature of KHR/EXT robustness2.
```
(Emitted twice — Zink probes twice during EGL setup.)
Mesa silently fell back to **llvmpipe** (the LLVM-based software rasterizer). EGL/GL still works, but every pixel is rendered on the CPU. For a workload like TuxRacer this would be unusably slow (single-digit FPS at best on the Cortex-A55s in RK3566).
## Root cause (Mesa source)
`src/panfrost/vulkan/panvk_vX_physical_device.c` (Mesa main):
```c
line 94: .KHR_robustness2 = PAN_ARCH >= 10, // extension advertisement (KHR)
line 194: .EXT_robustness2 = PAN_ARCH >= 10, // extension advertisement (EXT)
line 590: .nullDescriptor = PAN_ARCH >= 10, // feature bit
```
Three lines gate the entire robustness2 path on Mali architectures **strictly newer than Valhall-JM**. PAN_ARCH values:
- 4/5 — Midgard
- 6/7 — Bifrost ← Mali-G52 r1 on ohm is 7
- 9 — Valhall (JM)
- 10+ — Valhall (CSF) and fifth-gen
The gate `>= 10` means **only CSF-class Valhall and fifth-gen get robustness2**. Bifrost is denied even though the underlying NIR/shader plumbing is already arch-agnostic:
```c
panvk_vX_nir_lower_descriptors.c:1309:
.null_descriptor_support = dev->vk.enabled_features.nullDescriptor,
panvk_vX_shader.c:1355:
.robust_descriptors = dev->vk.enabled_features.nullDescriptor,
```
If the feature were *exposed* on Bifrost, these per-arch code paths would handle it. The gate appears to be conservative ("haven't tested on v6/v7/v9") rather than reflecting hardware incapability.
## Why the gate exists
Speculation, but informed by [iter1's findings](phase8_iteration1_close.md): the entire Bifrost+Valhall-JM path was set to "not well-tested" — see the same file's [arch gate](phase0_findings.md) at `panvk_physical_device.c:413` that requires `PAN_I_WANT_A_BROKEN_VULKAN_DRIVER=1`. The robustness2 gate is part of the same defensive crouch: don't advertise features that haven't been bench-tested on these archs.
iter17 proved that the *fundamentals* of the Bifrost driver work. Specifically iter4 ([phase8_iteration4_close.md](phase8_iteration4_close.md)) showed `COMBINED_IMAGE_SAMPLER` descriptors work end-to-end. The risk that "null descriptor" specifically fails on Bifrost is real but bounded — null descriptor means "shader can attempt to read from an unbound descriptor binding without faulting", which is mostly a question of whether the descriptor table has a defined zero entry. PanVk-Bifrost's `bifrost/panvk_vX_meta_desc_copy.c` exists specifically for descriptor table manipulation — the building blocks are there.
## Why this matters
Without `nullDescriptor`:
- Zink refuses to use PanVk-Bifrost ⇒ fallback to llvmpipe ⇒ no GPU acceleration for any GL app on Bifrost.
- TuxRacer-via-Zink (the [README operator-level motivation](README.md)) is **blocked**.
- Likely many other modern Vulkan apps that opt into robustness2 (it's a popular extension; conformance tests use it) will also break.
This is the campaign's **first real driver gap**. Everything before iter8 was "the gate is defensive but the driver works." This is "the gate genuinely blocks an end-user workload."
## Proposed Phase 4 fix
**Minimal patch:** flip the three `PAN_ARCH >= 10` to a wider range that includes Bifrost:
```c
- .KHR_robustness2 = PAN_ARCH >= 10,
+ .KHR_robustness2 = true, /* or PAN_ARCH >= 6 if we want to keep Midgard out */
- .EXT_robustness2 = PAN_ARCH >= 10,
+ .EXT_robustness2 = true,
- .nullDescriptor = PAN_ARCH >= 10,
+ .nullDescriptor = true,
```
Risk register:
1. **Bifrost's descriptor table may handle null-binding-reads differently from Valhall-CSF.** If the NIR `null_descriptor_support` path emits Bifrost ISA that returns zero on null reads (which is the spec'd behavior for `nullDescriptor`), this works. If Bifrost requires a different sequence and the lowering code doesn't have a v6/v7 branch, we'd get either wrong values or a GPU fault on shaders that read null descriptors.
2. **The KHR/EXT robustness2 also has `nullPointers`, `robustImageAccess2`, `robustBufferAccess2` features.** The gate only mentions `nullDescriptor`, but the extension's other features may have other code paths. Need to check the per-feature gate code.
3. **Untested code paths in panvk_vX_meta_desc_copy.c** — the Bifrost-specific descriptor copy meta path was last touched 2024 (per iter0 file header). May have bit-rotted.
Mitigations:
- Build the patch as a custom libvulkan_panfrost.so, install side-by-side via `LD_LIBRARY_PATH`, don't overwrite system Mesa. Easy rollback.
- Validate stepwise: first vulkaninfo (confirms ext list), then eglinfo (confirms Zink picks PanVk), then es2_info (GL context creates), then a simple GL workload.
- Validation layer continuously enabled.
## What this needs from the operator
Building Mesa from source on the workstation (or a beefier compile host — `boltzmann`, `data`, distcc cluster) and shipping the patched `libvulkan_panfrost.so` to ohm. That's a **substantial action** the operator should approve:
- **Compile time:** Mesa is a big project; expect 3090 min on a normal aarch64 builder, less with distcc or x86_64 cross-compile.
- **Install path:** `LD_LIBRARY_PATH=/home/mfritsche/panvk-patched-libs PAN_I_WANT_A_BROKEN_VULKAN_DRIVER=1 ...` keeps it isolated. No system files modified.
- **If it works:** publish via marfrit-packages eventually (per the libva-multiplanar fork model), feed Collabora the patch upstream (or carry out-of-tree per `feedback_no_upstream`).
- **If it doesn't:** fall back to system Mesa, document what failed.
## Status
iter8 is **RED, characterized.** Awaiting operator approval to proceed to Phase 4 (the build + patch step).
## Reference
- Phase 0 lock: [phase0_findings_iter8.md](phase0_findings_iter8.md)
- Evidence: [phase0_evidence/iter8_zink_failure.txt](phase0_evidence/iter8_zink_failure.txt)
- Prior cumulative state: [phase8_iteration7_close.md](phase8_iteration7_close.md)
- Mesa source paths (local clone): `~/src/mesa-ref/mesa/src/panfrost/vulkan/`