Files
panvk-bifrost/mesa-panvk-bifrost/iter17/phase1_source_map.md
T
marfrit a4e7d8ab90 initial seed: retrofit campaign lineage from local working trees
panvk-bifrost campaigns (r1..r4 Vulkan compositor + r5.video1 Vulkan
video decode) shipped before this repo existed; the deliverable
patches live in marfrit-packages, but the reasoning chain, phase docs,
and source-state evidence lived only in local working trees on the
development host.

This retrofit imports:
- mesa-panvk-bifrost/   — r1..r4 era phase docs (iter1..iter18)
                          (libmali stub blobs at iter18/blob/ excluded
                          — 109MB of RE artifacts replaced with a README
                          pointer)
- mesa-panvk-bifrost-video/ — sibling campaign phase docs + probe
- evidence/             — frozen .tgz source snapshots at each milestone
                          (basis for the 0005 patch diff generation)

Future iterations should branch off here from day one, so each iter is
a commit rather than a snapshot. See [[feedback-session-local-process-pins]]
for the process drift this retrofit closes.

Total: 1.9 MB across 124 files.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 05:25:37 +02:00

6.2 KiB
Raw Blame History

Phase 1 — source map for iter17

pan_nir_lower_xfb.c (80 LoC)

Anatomy:

Lines Function What it does
9-40 lower_xfb_output Per (output, channel) → emit ONE store_global
42-77 lower_xfb Per intrinsic: handle load_vertex_id rewrite + dispatch to lower_xfb_output for each non-zero channel in the nir_io_xfb annotation
79-84 pan_nir_lower_xfb Top-level wrapper calling nir_shader_intrinsics_pass

Core formula (lines 23-34)

nir_def *index = nir_iadd(b,
   nir_imul(b, nir_load_instance_id(b), nir_load_num_vertices(b)),
   nir_load_raw_vertex_id_pan(b));
nir_def *addr = xfb_address[buffer] + index * stride + offset_bytes;
nir_store_global(b, value, addr);

Critical observation: nir_load_num_vertices(b) is a sysval — already in iter13's panvk_graphics_sysvals.vs.num_vertices. iter16's design added a second sysval (xfb.decomposed_count) for the override case. iter17 doesn't need that one; we keep input_count in num_vertices and do the decomposition arithmetic in the shader using a third sysval: vs.xfb_topology.

NIR builder pattern we'll use

For our panvk-specific replacement pass, the existing single store becomes:

nir_def *topology = load_sysval(b, vs.xfb_topology);  /* uint32 */

/* Branch per topology family. Each branch emits 1-3 (or more for TRI_FAN)
 * conditional stores per VS invocation. */
nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_TRI_STRIP));
{
   emit_tri_strip_stores(b, /* contribution arithmetic */);
}
nir_push_else(b);
{
   nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LINE_STRIP));
   {
      emit_line_strip_stores(b, ...);
   }
   /* ... etc per topology ... */
}

Per-vertex contribution map

For each affected topology, input vertex v contributes to a small set of (primitive_idx, slot) pairs.

TRIANGLE_STRIP (canonical case)

Decomposition: prim p emits {p, p+1+p%2, p+2-p%2} (even/odd winding flip).

Inverse — for input vertex v on a strip of N vertices, contributes to:

Primitive Eligibility Slot
p = v 0 ≤ v ≤ N3 0
p = v 1 1 ≤ v ≤ N2 1 if (v1) even, else 2
p = v 2 2 ≤ v ≤ N1 2 if (v2) even, else 1

Up to 3 stores per VS invocation. Each store guarded by the eligibility predicate.

LINE_STRIP

Decomposition: prim p emits {p, p+1}. Vertex v contributes to:

Primitive Eligibility Slot
p = v 0 ≤ v ≤ N2 0
p = v 1 1 ≤ v ≤ N1 1

Up to 2 stores.

TRIANGLE_FAN — the awkward case

Decomposition: prim p emits {p+1, p+2, 0}. Vertex v contributes to:

Primitive Eligibility Slot
p = v 1 1 ≤ v ≤ N2 0
p = v 2 2 ≤ v ≤ N1 1
p = any in [0, N2) v == 0 2

The central vertex (v=0) contributes to ALL primitives as slot 2. That's O(N) stores from a single VS invocation, requiring a NIR loop bounded by num_vertices.

Adjacency variants

  • LINE_LIST_WITH_ADJACENCY: prim p emits {4p+1, 4p+2}. Vertex v contributes only if (v%4 ∈ {1, 2}) — O(1) stores.
  • LINE_STRIP_WITH_ADJACENCY: prim p emits {p+1, p+2}. Similar to LINE_STRIP shifted by 1. O(1) stores.
  • TRIANGLE_LIST_WITH_ADJACENCY: prim p emits {6p, 6p+2, 6p+4}. Vertex v contributes only if (v%6 ∈ {0, 2, 4}) — O(1) stores.
  • TRIANGLE_STRIP_WITH_ADJACENCY: prim p emits {2p, 2p+2, 2p+4} for even p; {2p, 2p+4, 2p+2} for odd. O(1) stores per vertex.

Implications for Phase 2

  • 6 of 7 affected topologies have O(1) contributions per VS invocation — straightforward nir_push_if + emit.
  • TRIANGLE_FAN's central vertex needs a NIR loop — requires nir_push_loop and a conditional nir_break based on num_vertices.
  • The runtime topology switch is a 7-way branch on vs.xfb_topology sysval (plus a pass-through for LIST topologies). NIR generates clean conditional code; Bifrost backend should optimize it OK.

What the sysval vs.xfb_topology looks like

8-bit integer in graphics_sysvals struct. Enum values:

enum panvk_xfb_topology {
   PANVK_XFB_TOPO_LIST            = 0,  /* pass-through; current iter13 formula */
   PANVK_XFB_TOPO_LINE_STRIP      = 1,
   PANVK_XFB_TOPO_TRI_STRIP       = 2,
   PANVK_XFB_TOPO_TRI_FAN         = 3,
   PANVK_XFB_TOPO_LINE_LIST_ADJ   = 4,
   PANVK_XFB_TOPO_LINE_STRIP_ADJ  = 5,
   PANVK_XFB_TOPO_TRI_LIST_ADJ    = 6,
   PANVK_XFB_TOPO_TRI_STRIP_ADJ   = 7,
};

Driver maps VkPrimitiveTopologypanvk_xfb_topology at draw time, sets the sysval via set_gfx_sysval(cmdbuf, dirty, vs.xfb_topology, val).

Risk: shader complexity

The lowered shader after iter17 will have:

  • 1 sysval load
  • 7 conditional branches
  • 2-3 conditional stores per branch (except TRI_FAN which has a loop)
  • per-store address arithmetic

That's a lot for what was a single store_global. On Bifrost (in-order architecture), branches are cheap but the increased instruction count + register pressure could hurt throughput.

Mitigation: most XFB workloads are tiny (per-frame, dozens to thousands of vertices). The throughput cost is irrelevant for the CTS-driven correctness target. Real-world XFB-heavy workloads (rare on Bifrost) might prefer iter13's single-store path, but those aren't impacted by iter17's correctness fix because the LIST topology still uses the fast path (topology == PANVK_XFB_TOPO_LIST → emit single store).

What to write in Phase 4

NEW file: src/panfrost/vulkan/panvk_vX_xfb_lower.c — a panvk-specific replacement for pan_nir_lower_xfb. Calls into pieces of pan_nir_lower_xfb for the LIST case (or re-implements its minimal logic) and adds the per-topology contribution branches for the others. Exposed as panvk_per_arch(nir_lower_xfb)(nir_shader *).

MODIFIED: panvk_vX_shader.c — replace the NIR_PASS(_, nir, pan_nir_lower_xfb) call with NIR_PASS(_, nir, panvk_per_arch(nir_lower_xfb)).

MODIFIED: panvk_shader.h — add vs.xfb_topology to sysval struct.

MODIFIED: panvk_vX_cmd_draw.c::cmd_prepare_draw_sysvals — at draw time, map current topology to enum + set_gfx_sysval(..., vs.xfb_topology, mapped).

Phase 4 LoC estimate: ~250 (replacement pass) + 30 (sysval threading + draw-time topology map) ≈ 280 LoC.

— claude-noether, 2026-05-21