# Phase 1 — source map for iter17 ## `pan_nir_lower_xfb.c` (80 LoC) Anatomy: | Lines | Function | What it does | |---|---|---| | 9-40 | `lower_xfb_output` | Per (output, channel) → emit ONE `store_global` | | 42-77 | `lower_xfb` | Per intrinsic: handle `load_vertex_id` rewrite + dispatch to `lower_xfb_output` for each non-zero channel in the `nir_io_xfb` annotation | | 79-84 | `pan_nir_lower_xfb` | Top-level wrapper calling `nir_shader_intrinsics_pass` | ### Core formula (lines 23-34) ```c nir_def *index = nir_iadd(b, nir_imul(b, nir_load_instance_id(b), nir_load_num_vertices(b)), nir_load_raw_vertex_id_pan(b)); nir_def *addr = xfb_address[buffer] + index * stride + offset_bytes; nir_store_global(b, value, addr); ``` **Critical observation:** `nir_load_num_vertices(b)` is a sysval — already in iter13's `panvk_graphics_sysvals.vs.num_vertices`. iter16's design added a second sysval (`xfb.decomposed_count`) for the override case. iter17 doesn't need that one; we keep input_count in `num_vertices` and do the decomposition arithmetic in the shader using a *third* sysval: `vs.xfb_topology`. ## NIR builder pattern we'll use For our panvk-specific replacement pass, the existing single store becomes: ```c nir_def *topology = load_sysval(b, vs.xfb_topology); /* uint32 */ /* Branch per topology family. Each branch emits 1-3 (or more for TRI_FAN) * conditional stores per VS invocation. */ nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_TRI_STRIP)); { emit_tri_strip_stores(b, /* contribution arithmetic */); } nir_push_else(b); { nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LINE_STRIP)); { emit_line_strip_stores(b, ...); } /* ... etc per topology ... */ } ``` ## Per-vertex contribution map For each affected topology, **input vertex v** contributes to a small set of `(primitive_idx, slot)` pairs. ### TRIANGLE_STRIP (canonical case) Decomposition: prim p emits `{p, p+1+p%2, p+2-p%2}` (even/odd winding flip). Inverse — for input vertex v on a strip of N vertices, contributes to: | Primitive | Eligibility | Slot | |---|---|---| | p = v | 0 ≤ v ≤ N−3 | 0 | | p = v − 1 | 1 ≤ v ≤ N−2 | 1 if (v−1) even, else 2 | | p = v − 2 | 2 ≤ v ≤ N−1 | 2 if (v−2) even, else 1 | Up to 3 stores per VS invocation. Each store guarded by the eligibility predicate. ### LINE_STRIP Decomposition: prim p emits `{p, p+1}`. Vertex v contributes to: | Primitive | Eligibility | Slot | |---|---|---| | p = v | 0 ≤ v ≤ N−2 | 0 | | p = v − 1 | 1 ≤ v ≤ N−1 | 1 | Up to 2 stores. ### TRIANGLE_FAN — the awkward case Decomposition: prim p emits `{p+1, p+2, 0}`. Vertex v contributes to: | Primitive | Eligibility | Slot | |---|---|---| | p = v − 1 | 1 ≤ v ≤ N−2 | 0 | | p = v − 2 | 2 ≤ v ≤ N−1 | 1 | | **p = any in [0, N−2)** | **v == 0** | **2** | The **central vertex (v=0)** contributes to ALL primitives as slot 2. That's O(N) stores from a single VS invocation, requiring a **NIR loop** bounded by `num_vertices`. ### Adjacency variants - LINE_LIST_WITH_ADJACENCY: prim p emits `{4p+1, 4p+2}`. Vertex v contributes only if (v%4 ∈ {1, 2}) — O(1) stores. - LINE_STRIP_WITH_ADJACENCY: prim p emits `{p+1, p+2}`. Similar to LINE_STRIP shifted by 1. O(1) stores. - TRIANGLE_LIST_WITH_ADJACENCY: prim p emits `{6p, 6p+2, 6p+4}`. Vertex v contributes only if (v%6 ∈ {0, 2, 4}) — O(1) stores. - TRIANGLE_STRIP_WITH_ADJACENCY: prim p emits `{2p, 2p+2, 2p+4}` for even p; `{2p, 2p+4, 2p+2}` for odd. O(1) stores per vertex. ## Implications for Phase 2 - **6 of 7 affected topologies have O(1) contributions per VS invocation** — straightforward `nir_push_if` + emit. - **TRIANGLE_FAN's central vertex needs a NIR loop** — requires `nir_push_loop` and a conditional `nir_break` based on `num_vertices`. - **The runtime topology switch** is a 7-way branch on `vs.xfb_topology` sysval (plus a pass-through for LIST topologies). NIR generates clean conditional code; Bifrost backend should optimize it OK. ## What the sysval `vs.xfb_topology` looks like 8-bit integer in graphics_sysvals struct. Enum values: ```c enum panvk_xfb_topology { PANVK_XFB_TOPO_LIST = 0, /* pass-through; current iter13 formula */ PANVK_XFB_TOPO_LINE_STRIP = 1, PANVK_XFB_TOPO_TRI_STRIP = 2, PANVK_XFB_TOPO_TRI_FAN = 3, PANVK_XFB_TOPO_LINE_LIST_ADJ = 4, PANVK_XFB_TOPO_LINE_STRIP_ADJ = 5, PANVK_XFB_TOPO_TRI_LIST_ADJ = 6, PANVK_XFB_TOPO_TRI_STRIP_ADJ = 7, }; ``` Driver maps `VkPrimitiveTopology` → `panvk_xfb_topology` at draw time, sets the sysval via `set_gfx_sysval(cmdbuf, dirty, vs.xfb_topology, val)`. ## Risk: shader complexity The lowered shader after iter17 will have: - 1 sysval load - 7 conditional branches - 2-3 conditional stores per branch (except TRI_FAN which has a loop) - per-store address arithmetic That's a lot for what was a single `store_global`. On Bifrost (in-order architecture), branches are cheap but the increased instruction count + register pressure could hurt throughput. Mitigation: most XFB workloads are tiny (per-frame, dozens to thousands of vertices). The throughput cost is irrelevant for the CTS-driven correctness target. Real-world XFB-heavy workloads (rare on Bifrost) might prefer iter13's single-store path, but those aren't impacted by iter17's correctness fix because the LIST topology still uses the fast path (topology == PANVK_XFB_TOPO_LIST → emit single store). ## What to write in Phase 4 NEW file: `src/panfrost/vulkan/panvk_vX_xfb_lower.c` — a panvk-specific replacement for `pan_nir_lower_xfb`. Calls into pieces of pan_nir_lower_xfb for the LIST case (or re-implements its minimal logic) and adds the per-topology contribution branches for the others. Exposed as `panvk_per_arch(nir_lower_xfb)(nir_shader *)`. MODIFIED: `panvk_vX_shader.c` — replace the `NIR_PASS(_, nir, pan_nir_lower_xfb)` call with `NIR_PASS(_, nir, panvk_per_arch(nir_lower_xfb))`. MODIFIED: `panvk_shader.h` — add `vs.xfb_topology` to sysval struct. MODIFIED: `panvk_vX_cmd_draw.c::cmd_prepare_draw_sysvals` — at draw time, map current topology to enum + `set_gfx_sysval(..., vs.xfb_topology, mapped)`. Phase 4 LoC estimate: ~250 (replacement pass) + 30 (sysval threading + draw-time topology map) ≈ 280 LoC. — claude-noether, 2026-05-21