# Phase 1 — source map for iter17

## `pan_nir_lower_xfb.c` (80 LoC)

Anatomy:

| Lines | Function | What it does |
|---|---|---|
| 9-40 | `lower_xfb_output` | Per (output, channel) → emit ONE `store_global` |
| 42-77 | `lower_xfb` | Per intrinsic: handle `load_vertex_id` rewrite + dispatch to `lower_xfb_output` for each non-zero channel in the `nir_io_xfb` annotation |
| 79-84 | `pan_nir_lower_xfb` | Top-level wrapper calling `nir_shader_intrinsics_pass` |

### Core formula (lines 23-34)

```c
nir_def *index = nir_iadd(b,
   nir_imul(b, nir_load_instance_id(b), nir_load_num_vertices(b)),
   nir_load_raw_vertex_id_pan(b));
nir_def *addr = xfb_address[buffer] + index * stride + offset_bytes;
nir_store_global(b, value, addr);
```

**Critical observation:** `nir_load_num_vertices(b)` is a sysval — already in iter13's `panvk_graphics_sysvals.vs.num_vertices`. iter16's design added a second sysval (`xfb.decomposed_count`) for the override case. iter17 doesn't need that one; we keep input_count in `num_vertices` and do the decomposition arithmetic in the shader using a *third* sysval: `vs.xfb_topology`.

## NIR builder pattern we'll use

For our panvk-specific replacement pass, the existing single store becomes:

```c
nir_def *topology = load_sysval(b, vs.xfb_topology);  /* uint32 */

/* Branch per topology family. Each branch emits 1-3 (or more for TRI_FAN)
 * conditional stores per VS invocation. */
nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_TRI_STRIP));
{
   emit_tri_strip_stores(b, /* contribution arithmetic */);
}
nir_push_else(b);
{
   nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LINE_STRIP));
   {
      emit_line_strip_stores(b, ...);
   }
   /* ... etc per topology ... */
}
```

## Per-vertex contribution map

For each affected topology, **input vertex v** contributes to a small set of `(primitive_idx, slot)` pairs.

### TRIANGLE_STRIP (canonical case)

Decomposition: prim p emits `{p, p+1+p%2, p+2-p%2}` (even/odd winding flip).

Inverse — for input vertex v on a strip of N vertices, contributes to:

| Primitive | Eligibility | Slot |
|---|---|---|
| p = v | 0 ≤ v ≤ N−3 | 0 |
| p = v − 1 | 1 ≤ v ≤ N−2 | 1 if (v−1) even, else 2 |
| p = v − 2 | 2 ≤ v ≤ N−1 | 2 if (v−2) even, else 1 |

Up to 3 stores per VS invocation. Each store guarded by the eligibility predicate.

### LINE_STRIP

Decomposition: prim p emits `{p, p+1}`. Vertex v contributes to:

| Primitive | Eligibility | Slot |
|---|---|---|
| p = v | 0 ≤ v ≤ N−2 | 0 |
| p = v − 1 | 1 ≤ v ≤ N−1 | 1 |

Up to 2 stores.

### TRIANGLE_FAN — the awkward case

Decomposition: prim p emits `{p+1, p+2, 0}`. Vertex v contributes to:

| Primitive | Eligibility | Slot |
|---|---|---|
| p = v − 1 | 1 ≤ v ≤ N−2 | 0 |
| p = v − 2 | 2 ≤ v ≤ N−1 | 1 |
| **p = any in [0, N−2)** | **v == 0** | **2** |

The **central vertex (v=0)** contributes to ALL primitives as slot 2. That's O(N) stores from a single VS invocation, requiring a **NIR loop** bounded by `num_vertices`.

### Adjacency variants

- LINE_LIST_WITH_ADJACENCY: prim p emits `{4p+1, 4p+2}`. Vertex v contributes only if (v%4 ∈ {1, 2}) — O(1) stores.
- LINE_STRIP_WITH_ADJACENCY: prim p emits `{p+1, p+2}`. Similar to LINE_STRIP shifted by 1. O(1) stores.
- TRIANGLE_LIST_WITH_ADJACENCY: prim p emits `{6p, 6p+2, 6p+4}`. Vertex v contributes only if (v%6 ∈ {0, 2, 4}) — O(1) stores.
- TRIANGLE_STRIP_WITH_ADJACENCY: prim p emits `{2p, 2p+2, 2p+4}` for even p; `{2p, 2p+4, 2p+2}` for odd. O(1) stores per vertex.

## Implications for Phase 2

- **6 of 7 affected topologies have O(1) contributions per VS invocation** — straightforward `nir_push_if` + emit.
- **TRIANGLE_FAN's central vertex needs a NIR loop** — requires `nir_push_loop` and a conditional `nir_break` based on `num_vertices`.
- **The runtime topology switch** is a 7-way branch on `vs.xfb_topology` sysval (plus a pass-through for LIST topologies). NIR generates clean conditional code; Bifrost backend should optimize it OK.

## What the sysval `vs.xfb_topology` looks like

8-bit integer in graphics_sysvals struct. Enum values:
```c
enum panvk_xfb_topology {
   PANVK_XFB_TOPO_LIST            = 0,  /* pass-through; current iter13 formula */
   PANVK_XFB_TOPO_LINE_STRIP      = 1,
   PANVK_XFB_TOPO_TRI_STRIP       = 2,
   PANVK_XFB_TOPO_TRI_FAN         = 3,
   PANVK_XFB_TOPO_LINE_LIST_ADJ   = 4,
   PANVK_XFB_TOPO_LINE_STRIP_ADJ  = 5,
   PANVK_XFB_TOPO_TRI_LIST_ADJ    = 6,
   PANVK_XFB_TOPO_TRI_STRIP_ADJ   = 7,
};
```

Driver maps `VkPrimitiveTopology` → `panvk_xfb_topology` at draw time, sets the sysval via `set_gfx_sysval(cmdbuf, dirty, vs.xfb_topology, val)`.

## Risk: shader complexity

The lowered shader after iter17 will have:
- 1 sysval load
- 7 conditional branches
- 2-3 conditional stores per branch (except TRI_FAN which has a loop)
- per-store address arithmetic

That's a lot for what was a single `store_global`. On Bifrost (in-order architecture), branches are cheap but the increased instruction count + register pressure could hurt throughput.

Mitigation: most XFB workloads are tiny (per-frame, dozens to thousands of vertices). The throughput cost is irrelevant for the CTS-driven correctness target. Real-world XFB-heavy workloads (rare on Bifrost) might prefer iter13's single-store path, but those aren't impacted by iter17's correctness fix because the LIST topology still uses the fast path (topology == PANVK_XFB_TOPO_LIST → emit single store).

## What to write in Phase 4

NEW file: `src/panfrost/vulkan/panvk_vX_xfb_lower.c` — a panvk-specific replacement for `pan_nir_lower_xfb`. Calls into pieces of pan_nir_lower_xfb for the LIST case (or re-implements its minimal logic) and adds the per-topology contribution branches for the others. Exposed as `panvk_per_arch(nir_lower_xfb)(nir_shader *)`.

MODIFIED: `panvk_vX_shader.c` — replace the `NIR_PASS(_, nir, pan_nir_lower_xfb)` call with `NIR_PASS(_, nir, panvk_per_arch(nir_lower_xfb))`.

MODIFIED: `panvk_shader.h` — add `vs.xfb_topology` to sysval struct.

MODIFIED: `panvk_vX_cmd_draw.c::cmd_prepare_draw_sysvals` — at draw time, map current topology to enum + `set_gfx_sysval(..., vs.xfb_topology, mapped)`.

Phase 4 LoC estimate: ~250 (replacement pass) + 30 (sysval threading + draw-time topology map) ≈ 280 LoC.

— claude-noether, 2026-05-21