Files
panvk-bifrost/mesa-panvk-bifrost/iter17/phase1_source_map.md
T
marfrit a4e7d8ab90 initial seed: retrofit campaign lineage from local working trees
panvk-bifrost campaigns (r1..r4 Vulkan compositor + r5.video1 Vulkan
video decode) shipped before this repo existed; the deliverable
patches live in marfrit-packages, but the reasoning chain, phase docs,
and source-state evidence lived only in local working trees on the
development host.

This retrofit imports:
- mesa-panvk-bifrost/   — r1..r4 era phase docs (iter1..iter18)
                          (libmali stub blobs at iter18/blob/ excluded
                          — 109MB of RE artifacts replaced with a README
                          pointer)
- mesa-panvk-bifrost-video/ — sibling campaign phase docs + probe
- evidence/             — frozen .tgz source snapshots at each milestone
                          (basis for the 0005 patch diff generation)

Future iterations should branch off here from day one, so each iter is
a commit rather than a snapshot. See [[feedback-session-local-process-pins]]
for the process drift this retrofit closes.

Total: 1.9 MB across 124 files.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 05:25:37 +02:00

145 lines
6.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 1 — source map for iter17
## `pan_nir_lower_xfb.c` (80 LoC)
Anatomy:
| Lines | Function | What it does |
|---|---|---|
| 9-40 | `lower_xfb_output` | Per (output, channel) → emit ONE `store_global` |
| 42-77 | `lower_xfb` | Per intrinsic: handle `load_vertex_id` rewrite + dispatch to `lower_xfb_output` for each non-zero channel in the `nir_io_xfb` annotation |
| 79-84 | `pan_nir_lower_xfb` | Top-level wrapper calling `nir_shader_intrinsics_pass` |
### Core formula (lines 23-34)
```c
nir_def *index = nir_iadd(b,
nir_imul(b, nir_load_instance_id(b), nir_load_num_vertices(b)),
nir_load_raw_vertex_id_pan(b));
nir_def *addr = xfb_address[buffer] + index * stride + offset_bytes;
nir_store_global(b, value, addr);
```
**Critical observation:** `nir_load_num_vertices(b)` is a sysval — already in iter13's `panvk_graphics_sysvals.vs.num_vertices`. iter16's design added a second sysval (`xfb.decomposed_count`) for the override case. iter17 doesn't need that one; we keep input_count in `num_vertices` and do the decomposition arithmetic in the shader using a *third* sysval: `vs.xfb_topology`.
## NIR builder pattern we'll use
For our panvk-specific replacement pass, the existing single store becomes:
```c
nir_def *topology = load_sysval(b, vs.xfb_topology); /* uint32 */
/* Branch per topology family. Each branch emits 1-3 (or more for TRI_FAN)
* conditional stores per VS invocation. */
nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_TRI_STRIP));
{
emit_tri_strip_stores(b, /* contribution arithmetic */);
}
nir_push_else(b);
{
nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LINE_STRIP));
{
emit_line_strip_stores(b, ...);
}
/* ... etc per topology ... */
}
```
## Per-vertex contribution map
For each affected topology, **input vertex v** contributes to a small set of `(primitive_idx, slot)` pairs.
### TRIANGLE_STRIP (canonical case)
Decomposition: prim p emits `{p, p+1+p%2, p+2-p%2}` (even/odd winding flip).
Inverse — for input vertex v on a strip of N vertices, contributes to:
| Primitive | Eligibility | Slot |
|---|---|---|
| p = v | 0 ≤ v ≤ N3 | 0 |
| p = v 1 | 1 ≤ v ≤ N2 | 1 if (v1) even, else 2 |
| p = v 2 | 2 ≤ v ≤ N1 | 2 if (v2) even, else 1 |
Up to 3 stores per VS invocation. Each store guarded by the eligibility predicate.
### LINE_STRIP
Decomposition: prim p emits `{p, p+1}`. Vertex v contributes to:
| Primitive | Eligibility | Slot |
|---|---|---|
| p = v | 0 ≤ v ≤ N2 | 0 |
| p = v 1 | 1 ≤ v ≤ N1 | 1 |
Up to 2 stores.
### TRIANGLE_FAN — the awkward case
Decomposition: prim p emits `{p+1, p+2, 0}`. Vertex v contributes to:
| Primitive | Eligibility | Slot |
|---|---|---|
| p = v 1 | 1 ≤ v ≤ N2 | 0 |
| p = v 2 | 2 ≤ v ≤ N1 | 1 |
| **p = any in [0, N2)** | **v == 0** | **2** |
The **central vertex (v=0)** contributes to ALL primitives as slot 2. That's O(N) stores from a single VS invocation, requiring a **NIR loop** bounded by `num_vertices`.
### Adjacency variants
- LINE_LIST_WITH_ADJACENCY: prim p emits `{4p+1, 4p+2}`. Vertex v contributes only if (v%4 ∈ {1, 2}) — O(1) stores.
- LINE_STRIP_WITH_ADJACENCY: prim p emits `{p+1, p+2}`. Similar to LINE_STRIP shifted by 1. O(1) stores.
- TRIANGLE_LIST_WITH_ADJACENCY: prim p emits `{6p, 6p+2, 6p+4}`. Vertex v contributes only if (v%6 ∈ {0, 2, 4}) — O(1) stores.
- TRIANGLE_STRIP_WITH_ADJACENCY: prim p emits `{2p, 2p+2, 2p+4}` for even p; `{2p, 2p+4, 2p+2}` for odd. O(1) stores per vertex.
## Implications for Phase 2
- **6 of 7 affected topologies have O(1) contributions per VS invocation** — straightforward `nir_push_if` + emit.
- **TRIANGLE_FAN's central vertex needs a NIR loop** — requires `nir_push_loop` and a conditional `nir_break` based on `num_vertices`.
- **The runtime topology switch** is a 7-way branch on `vs.xfb_topology` sysval (plus a pass-through for LIST topologies). NIR generates clean conditional code; Bifrost backend should optimize it OK.
## What the sysval `vs.xfb_topology` looks like
8-bit integer in graphics_sysvals struct. Enum values:
```c
enum panvk_xfb_topology {
PANVK_XFB_TOPO_LIST = 0, /* pass-through; current iter13 formula */
PANVK_XFB_TOPO_LINE_STRIP = 1,
PANVK_XFB_TOPO_TRI_STRIP = 2,
PANVK_XFB_TOPO_TRI_FAN = 3,
PANVK_XFB_TOPO_LINE_LIST_ADJ = 4,
PANVK_XFB_TOPO_LINE_STRIP_ADJ = 5,
PANVK_XFB_TOPO_TRI_LIST_ADJ = 6,
PANVK_XFB_TOPO_TRI_STRIP_ADJ = 7,
};
```
Driver maps `VkPrimitiveTopology``panvk_xfb_topology` at draw time, sets the sysval via `set_gfx_sysval(cmdbuf, dirty, vs.xfb_topology, val)`.
## Risk: shader complexity
The lowered shader after iter17 will have:
- 1 sysval load
- 7 conditional branches
- 2-3 conditional stores per branch (except TRI_FAN which has a loop)
- per-store address arithmetic
That's a lot for what was a single `store_global`. On Bifrost (in-order architecture), branches are cheap but the increased instruction count + register pressure could hurt throughput.
Mitigation: most XFB workloads are tiny (per-frame, dozens to thousands of vertices). The throughput cost is irrelevant for the CTS-driven correctness target. Real-world XFB-heavy workloads (rare on Bifrost) might prefer iter13's single-store path, but those aren't impacted by iter17's correctness fix because the LIST topology still uses the fast path (topology == PANVK_XFB_TOPO_LIST → emit single store).
## What to write in Phase 4
NEW file: `src/panfrost/vulkan/panvk_vX_xfb_lower.c` — a panvk-specific replacement for `pan_nir_lower_xfb`. Calls into pieces of pan_nir_lower_xfb for the LIST case (or re-implements its minimal logic) and adds the per-topology contribution branches for the others. Exposed as `panvk_per_arch(nir_lower_xfb)(nir_shader *)`.
MODIFIED: `panvk_vX_shader.c` — replace the `NIR_PASS(_, nir, pan_nir_lower_xfb)` call with `NIR_PASS(_, nir, panvk_per_arch(nir_lower_xfb))`.
MODIFIED: `panvk_shader.h` — add `vs.xfb_topology` to sysval struct.
MODIFIED: `panvk_vX_cmd_draw.c::cmd_prepare_draw_sysvals` — at draw time, map current topology to enum + `set_gfx_sysval(..., vs.xfb_topology, mapped)`.
Phase 4 LoC estimate: ~250 (replacement pass) + 30 (sysval threading + draw-time topology map) ≈ 280 LoC.
— claude-noether, 2026-05-21