panvk-bifrost campaigns (r1..r4 Vulkan compositor + r5.video1 Vulkan
video decode) shipped before this repo existed; the deliverable
patches live in marfrit-packages, but the reasoning chain, phase docs,
and source-state evidence lived only in local working trees on the
development host.
This retrofit imports:
- mesa-panvk-bifrost/ — r1..r4 era phase docs (iter1..iter18)
(libmali stub blobs at iter18/blob/ excluded
— 109MB of RE artifacts replaced with a README
pointer)
- mesa-panvk-bifrost-video/ — sibling campaign phase docs + probe
- evidence/ — frozen .tgz source snapshots at each milestone
(basis for the 0005 patch diff generation)
Future iterations should branch off here from day one, so each iter is
a commit rather than a snapshot. See [[feedback-session-local-process-pins]]
for the process drift this retrofit closes.
Total: 1.9 MB across 124 files.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
6.2 KiB
Phase 1 — source map for iter17
pan_nir_lower_xfb.c (80 LoC)
Anatomy:
| Lines | Function | What it does |
|---|---|---|
| 9-40 | lower_xfb_output |
Per (output, channel) → emit ONE store_global |
| 42-77 | lower_xfb |
Per intrinsic: handle load_vertex_id rewrite + dispatch to lower_xfb_output for each non-zero channel in the nir_io_xfb annotation |
| 79-84 | pan_nir_lower_xfb |
Top-level wrapper calling nir_shader_intrinsics_pass |
Core formula (lines 23-34)
nir_def *index = nir_iadd(b,
nir_imul(b, nir_load_instance_id(b), nir_load_num_vertices(b)),
nir_load_raw_vertex_id_pan(b));
nir_def *addr = xfb_address[buffer] + index * stride + offset_bytes;
nir_store_global(b, value, addr);
Critical observation: nir_load_num_vertices(b) is a sysval — already in iter13's panvk_graphics_sysvals.vs.num_vertices. iter16's design added a second sysval (xfb.decomposed_count) for the override case. iter17 doesn't need that one; we keep input_count in num_vertices and do the decomposition arithmetic in the shader using a third sysval: vs.xfb_topology.
NIR builder pattern we'll use
For our panvk-specific replacement pass, the existing single store becomes:
nir_def *topology = load_sysval(b, vs.xfb_topology); /* uint32 */
/* Branch per topology family. Each branch emits 1-3 (or more for TRI_FAN)
* conditional stores per VS invocation. */
nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_TRI_STRIP));
{
emit_tri_strip_stores(b, /* contribution arithmetic */);
}
nir_push_else(b);
{
nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LINE_STRIP));
{
emit_line_strip_stores(b, ...);
}
/* ... etc per topology ... */
}
Per-vertex contribution map
For each affected topology, input vertex v contributes to a small set of (primitive_idx, slot) pairs.
TRIANGLE_STRIP (canonical case)
Decomposition: prim p emits {p, p+1+p%2, p+2-p%2} (even/odd winding flip).
Inverse — for input vertex v on a strip of N vertices, contributes to:
| Primitive | Eligibility | Slot |
|---|---|---|
| p = v | 0 ≤ v ≤ N−3 | 0 |
| p = v − 1 | 1 ≤ v ≤ N−2 | 1 if (v−1) even, else 2 |
| p = v − 2 | 2 ≤ v ≤ N−1 | 2 if (v−2) even, else 1 |
Up to 3 stores per VS invocation. Each store guarded by the eligibility predicate.
LINE_STRIP
Decomposition: prim p emits {p, p+1}. Vertex v contributes to:
| Primitive | Eligibility | Slot |
|---|---|---|
| p = v | 0 ≤ v ≤ N−2 | 0 |
| p = v − 1 | 1 ≤ v ≤ N−1 | 1 |
Up to 2 stores.
TRIANGLE_FAN — the awkward case
Decomposition: prim p emits {p+1, p+2, 0}. Vertex v contributes to:
| Primitive | Eligibility | Slot |
|---|---|---|
| p = v − 1 | 1 ≤ v ≤ N−2 | 0 |
| p = v − 2 | 2 ≤ v ≤ N−1 | 1 |
| p = any in [0, N−2) | v == 0 | 2 |
The central vertex (v=0) contributes to ALL primitives as slot 2. That's O(N) stores from a single VS invocation, requiring a NIR loop bounded by num_vertices.
Adjacency variants
- LINE_LIST_WITH_ADJACENCY: prim p emits
{4p+1, 4p+2}. Vertex v contributes only if (v%4 ∈ {1, 2}) — O(1) stores. - LINE_STRIP_WITH_ADJACENCY: prim p emits
{p+1, p+2}. Similar to LINE_STRIP shifted by 1. O(1) stores. - TRIANGLE_LIST_WITH_ADJACENCY: prim p emits
{6p, 6p+2, 6p+4}. Vertex v contributes only if (v%6 ∈ {0, 2, 4}) — O(1) stores. - TRIANGLE_STRIP_WITH_ADJACENCY: prim p emits
{2p, 2p+2, 2p+4}for even p;{2p, 2p+4, 2p+2}for odd. O(1) stores per vertex.
Implications for Phase 2
- 6 of 7 affected topologies have O(1) contributions per VS invocation — straightforward
nir_push_if+ emit. - TRIANGLE_FAN's central vertex needs a NIR loop — requires
nir_push_loopand a conditionalnir_breakbased onnum_vertices. - The runtime topology switch is a 7-way branch on
vs.xfb_topologysysval (plus a pass-through for LIST topologies). NIR generates clean conditional code; Bifrost backend should optimize it OK.
What the sysval vs.xfb_topology looks like
8-bit integer in graphics_sysvals struct. Enum values:
enum panvk_xfb_topology {
PANVK_XFB_TOPO_LIST = 0, /* pass-through; current iter13 formula */
PANVK_XFB_TOPO_LINE_STRIP = 1,
PANVK_XFB_TOPO_TRI_STRIP = 2,
PANVK_XFB_TOPO_TRI_FAN = 3,
PANVK_XFB_TOPO_LINE_LIST_ADJ = 4,
PANVK_XFB_TOPO_LINE_STRIP_ADJ = 5,
PANVK_XFB_TOPO_TRI_LIST_ADJ = 6,
PANVK_XFB_TOPO_TRI_STRIP_ADJ = 7,
};
Driver maps VkPrimitiveTopology → panvk_xfb_topology at draw time, sets the sysval via set_gfx_sysval(cmdbuf, dirty, vs.xfb_topology, val).
Risk: shader complexity
The lowered shader after iter17 will have:
- 1 sysval load
- 7 conditional branches
- 2-3 conditional stores per branch (except TRI_FAN which has a loop)
- per-store address arithmetic
That's a lot for what was a single store_global. On Bifrost (in-order architecture), branches are cheap but the increased instruction count + register pressure could hurt throughput.
Mitigation: most XFB workloads are tiny (per-frame, dozens to thousands of vertices). The throughput cost is irrelevant for the CTS-driven correctness target. Real-world XFB-heavy workloads (rare on Bifrost) might prefer iter13's single-store path, but those aren't impacted by iter17's correctness fix because the LIST topology still uses the fast path (topology == PANVK_XFB_TOPO_LIST → emit single store).
What to write in Phase 4
NEW file: src/panfrost/vulkan/panvk_vX_xfb_lower.c — a panvk-specific replacement for pan_nir_lower_xfb. Calls into pieces of pan_nir_lower_xfb for the LIST case (or re-implements its minimal logic) and adds the per-topology contribution branches for the others. Exposed as panvk_per_arch(nir_lower_xfb)(nir_shader *).
MODIFIED: panvk_vX_shader.c — replace the NIR_PASS(_, nir, pan_nir_lower_xfb) call with NIR_PASS(_, nir, panvk_per_arch(nir_lower_xfb)).
MODIFIED: panvk_shader.h — add vs.xfb_topology to sysval struct.
MODIFIED: panvk_vX_cmd_draw.c::cmd_prepare_draw_sysvals — at draw time, map current topology to enum + set_gfx_sysval(..., vs.xfb_topology, mapped).
Phase 4 LoC estimate: ~250 (replacement pass) + 30 (sysval threading + draw-time topology map) ≈ 280 LoC.
— claude-noether, 2026-05-21