initial seed: retrofit campaign lineage from local working trees
panvk-bifrost campaigns (r1..r4 Vulkan compositor + r5.video1 Vulkan
video decode) shipped before this repo existed; the deliverable
patches live in marfrit-packages, but the reasoning chain, phase docs,
and source-state evidence lived only in local working trees on the
development host.
This retrofit imports:
- mesa-panvk-bifrost/ — r1..r4 era phase docs (iter1..iter18)
(libmali stub blobs at iter18/blob/ excluded
— 109MB of RE artifacts replaced with a README
pointer)
- mesa-panvk-bifrost-video/ — sibling campaign phase docs + probe
- evidence/ — frozen .tgz source snapshots at each milestone
(basis for the 0005 patch diff generation)
Future iterations should branch off here from day one, so each iter is
a commit rather than a snapshot. See [[feedback-session-local-process-pins]]
for the process drift this retrofit closes.
Total: 1.9 MB across 124 files.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,223 @@
|
||||
# Phase 2 — design lock for iter17
|
||||
|
||||
## Locked decisions
|
||||
|
||||
### D1: Replacement pass, not modification of upstream
|
||||
|
||||
Write `src/panfrost/vulkan/panvk_vX_xfb_lower.c` as a panvk-specific NIR pass. Call it from `panvk_vX_shader.c` instead of `pan_nir_lower_xfb`. Leaves Panfrost-Gallium and any other panfrost compiler consumers untouched. Per [[feedback-no-upstream-proposals]] and Phase 0 safety.
|
||||
|
||||
### D2: Runtime topology dispatch via sysval
|
||||
|
||||
Add a `vs.xfb_topology` sysval (uint8_t in `panvk_graphics_sysvals`). Driver maps `VkPrimitiveTopology` → `panvk_xfb_topology` enum at draw time. Shader's lowered XFB code switches on this sysval at runtime.
|
||||
|
||||
Rejected alternative: per-topology shader variants. 7 extra variants per XFB shader, with iter13's existing variant doubling that's a lot of shader cache bloat for marginal runtime benefit. The runtime switch is cheap on Bifrost.
|
||||
|
||||
### D3: TRIANGLE_FAN central-vertex handling
|
||||
|
||||
**Decision: implement.** The NIR loop is straightforward — `nir_push_loop` + bounded by `num_vertices`. Estimated ~30 LoC in the new pass. Closes ~22 of the 162 winding fails (TRIANGLE_FAN's share, roughly 1/7 of 162 ≈ 23).
|
||||
|
||||
Alternative considered: skip TRIANGLE_FAN, document as not-yet-implemented. Would leave 22 fails on the table. Not worth the docs-vs-code tradeoff — the loop isn't that hard.
|
||||
|
||||
### D4: Per-topology contribution emission
|
||||
|
||||
For VS invocation v on topology T, emit conditional stores using `nir_push_if` (eligibility predicate) + `nir_store_global` (address + value).
|
||||
|
||||
Each contribution = `(prim_idx, slot)` pair. Per-topology contribution count:
|
||||
|
||||
| Topology | Stores per VS invocation |
|
||||
|---|---|
|
||||
| TRIANGLE_STRIP | 1-3 (depends on v's position) |
|
||||
| LINE_STRIP | 1-2 |
|
||||
| TRIANGLE_FAN | 1-2 + central vertex (v=0) writes O(N) via loop |
|
||||
| LINE_LIST_WITH_ADJACENCY | 0-1 (only when v%4 ∈ {1, 2}) |
|
||||
| LINE_STRIP_WITH_ADJACENCY | 1-2 |
|
||||
| TRIANGLE_LIST_WITH_ADJACENCY | 0-1 (only when v%6 ∈ {0, 2, 4}) |
|
||||
| TRIANGLE_STRIP_WITH_ADJACENCY | 1-3 |
|
||||
|
||||
All eligibility predicates are O(1) integer comparisons. All address arithmetic is O(1) integer mul/add. No loops except for TRIANGLE_FAN.
|
||||
|
||||
### D5: LIST topologies bypass the new logic
|
||||
|
||||
For POINT_LIST, LINE_LIST, TRIANGLE_LIST: keep iter13's single-store fast path. The topology dispatch ladder starts with `if (topology == PANVK_XFB_TOPO_LIST) { iter13_path() }` — generic optimizer will hoist this nicely.
|
||||
|
||||
### D6: Multiple XFB output channels
|
||||
|
||||
`nir_io_xfb` annotation has up to 4 channels per `store_output`. Current `pan_nir_lower_xfb` loops over them and emits one global store each. Our replacement keeps that outer loop, applies decomposition logic at the inner store level. Each channel writes to a different offset within the same vertex's output slot.
|
||||
|
||||
### D7: Sysval threading
|
||||
|
||||
Add to `panvk_graphics_sysvals` struct (in `panvk_shader.h`):
|
||||
|
||||
```c
|
||||
uint32_t xfb_topology; /* panvk_xfb_topology enum */
|
||||
```
|
||||
|
||||
Enum in same header:
|
||||
```c
|
||||
enum panvk_xfb_topology {
|
||||
PANVK_XFB_TOPO_LIST = 0,
|
||||
PANVK_XFB_TOPO_LINE_STRIP = 1,
|
||||
PANVK_XFB_TOPO_TRI_STRIP = 2,
|
||||
PANVK_XFB_TOPO_TRI_FAN = 3,
|
||||
PANVK_XFB_TOPO_LINE_LIST_ADJ = 4,
|
||||
PANVK_XFB_TOPO_LINE_STRIP_ADJ = 5,
|
||||
PANVK_XFB_TOPO_TRI_LIST_ADJ = 6,
|
||||
PANVK_XFB_TOPO_TRI_STRIP_ADJ = 7,
|
||||
};
|
||||
```
|
||||
|
||||
In `cmd_prepare_draw_sysvals` (around the existing iter13 `vs.num_vertices` line):
|
||||
|
||||
```c
|
||||
uint32_t topo_enum = panvk_topology_to_xfb_enum(
|
||||
cmdbuf->vk.dynamic_graphics_state.ia.primitive_topology);
|
||||
set_gfx_sysval(cmdbuf, dirty_sysvals, vs.xfb_topology, topo_enum);
|
||||
```
|
||||
|
||||
Helper `panvk_topology_to_xfb_enum` lives in `panvk_vX_xfb_lower.c` (or a small helper header).
|
||||
|
||||
## Code structure
|
||||
|
||||
```
|
||||
src/panfrost/vulkan/
|
||||
├── panvk_vX_xfb_lower.c NEW — replacement pass + topology mapping helper
|
||||
├── panvk_shader.h MOD — add vs.xfb_topology + enum + load_xfb_topology macro
|
||||
├── panvk_vX_cmd_draw.c MOD — set xfb_topology sysval in cmd_prepare_draw_sysvals
|
||||
└── panvk_vX_shader.c MOD — replace pan_nir_lower_xfb call with panvk_per_arch(nir_lower_xfb)
|
||||
```
|
||||
|
||||
## NIR pseudocode for the replacement pass
|
||||
|
||||
```c
|
||||
static void
|
||||
lower_xfb_output_iter17(nir_builder *b, nir_intrinsic_instr *intr,
|
||||
unsigned channel_idx, unsigned num_components,
|
||||
unsigned buffer, unsigned offset_words)
|
||||
{
|
||||
uint16_t stride = b->shader->info.xfb_stride[buffer] * 4;
|
||||
uint16_t offset_bytes = offset_words * 4;
|
||||
|
||||
nir_def *topology = load_sysval(b, graphics, 32, vs.xfb_topology);
|
||||
nir_def *v = nir_load_raw_vertex_id_pan(b);
|
||||
nir_def *N = nir_load_num_vertices(b);
|
||||
nir_def *instance = nir_load_instance_id(b);
|
||||
nir_def *buf = nir_load_xfb_address(b, 64, .base = buffer);
|
||||
nir_def *value = nir_channels(b, intr->src[0].ssa,
|
||||
nir_component_mask(num_components) << channel_idx);
|
||||
|
||||
/* LIST fast path: single store, iter13-compatible formula */
|
||||
nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_LIST));
|
||||
{
|
||||
nir_def *idx = nir_iadd(b, nir_imul(b, instance, N), v);
|
||||
nir_def *addr = compute_addr(b, buf, idx, stride, offset_bytes);
|
||||
nir_store_global(b, value, addr);
|
||||
}
|
||||
nir_push_else(b);
|
||||
{
|
||||
/* TRIANGLE_STRIP */
|
||||
nir_push_if(b, nir_ieq_imm(b, topology, PANVK_XFB_TOPO_TRI_STRIP));
|
||||
{
|
||||
emit_tri_strip_stores(b, v, N, instance, buf, stride, offset_bytes, value);
|
||||
}
|
||||
nir_push_else(b);
|
||||
/* ... other topologies ... */
|
||||
nir_pop_if(b);
|
||||
}
|
||||
nir_pop_if(b);
|
||||
}
|
||||
|
||||
static void
|
||||
emit_tri_strip_stores(nir_builder *b, nir_def *v, nir_def *N,
|
||||
nir_def *instance, nir_def *buf,
|
||||
uint16_t stride, uint16_t offset_bytes,
|
||||
nir_def *value)
|
||||
{
|
||||
/* prim p = v, slot 0: when v ≤ N-3 (i.e., v < N-2) */
|
||||
{
|
||||
nir_def *eligible = nir_ilt(b, v, nir_iadd_imm(b, N, -2));
|
||||
nir_push_if(b, eligible);
|
||||
{
|
||||
nir_def *prim = v;
|
||||
nir_def *out_idx_in_prim = nir_iadd(b,
|
||||
nir_imul(b, instance, ceil_3_times_N(b, N)), /* TODO: precompute */
|
||||
nir_iadd(b, nir_imul_imm(b, prim, 3),
|
||||
nir_imm_int(b, 0))); /* slot 0 */
|
||||
nir_def *addr = compute_addr(b, buf, out_idx_in_prim, stride, offset_bytes);
|
||||
nir_store_global(b, value, addr);
|
||||
}
|
||||
nir_pop_if(b);
|
||||
}
|
||||
|
||||
/* prim p = v-1, slot = 1 if (v-1) even else 2: when v >= 1 and v ≤ N-2 */
|
||||
{
|
||||
nir_def *eligible = nir_iand(b, nir_uge_imm(b, v, 1),
|
||||
nir_ilt(b, v, nir_iadd_imm(b, N, -1)));
|
||||
nir_push_if(b, eligible);
|
||||
{
|
||||
nir_def *prim = nir_iadd_imm(b, v, -1);
|
||||
nir_def *parity_even = nir_ieq_imm(b,
|
||||
nir_iand_imm(b, prim, 1), 0);
|
||||
nir_def *slot = nir_bcsel(b, parity_even,
|
||||
nir_imm_int(b, 1), nir_imm_int(b, 2));
|
||||
/* ... store ... */
|
||||
}
|
||||
nir_pop_if(b);
|
||||
}
|
||||
|
||||
/* prim p = v-2: when v >= 2 */
|
||||
{
|
||||
/* analogous */
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
For TRIANGLE_FAN central vertex:
|
||||
|
||||
```c
|
||||
/* Special: v == 0 → write to slot 2 of every primitive */
|
||||
nir_push_if(b, nir_ieq_imm(b, v, 0));
|
||||
{
|
||||
/* Loop p from 0 to N-3 (inclusive), write value to slot 2 of prim p */
|
||||
nir_variable *p_var = nir_local_variable_create(b->impl, glsl_uint_type(), "p");
|
||||
nir_store_var(b, p_var, nir_imm_int(b, 0), 0x1);
|
||||
nir_push_loop(b);
|
||||
{
|
||||
nir_def *p = nir_load_var(b, p_var);
|
||||
nir_push_if(b, nir_uge(b, p, nir_iadd_imm(b, N, -2)));
|
||||
{
|
||||
nir_jump(b, nir_jump_break);
|
||||
}
|
||||
nir_pop_if(b);
|
||||
|
||||
nir_def *out_idx = nir_iadd_imm(b, nir_imul_imm(b, p, 3), 2); /* slot 2 */
|
||||
nir_def *addr = compute_addr(b, buf, out_idx, stride, offset_bytes);
|
||||
nir_store_global(b, value, addr);
|
||||
|
||||
nir_store_var(b, p_var, nir_iadd_imm(b, p, 1), 0x1);
|
||||
}
|
||||
nir_pop_loop(b);
|
||||
}
|
||||
nir_pop_if(b);
|
||||
```
|
||||
|
||||
## Edge case: per-vertex output count needs to compute total
|
||||
|
||||
For `vs.num_vertices` purposes in the XFB index calculation, we need the OUTPUT-SIDE count (`3*(N-2)` for tri_strip etc), not the input count.
|
||||
|
||||
Solution: don't use `nir_load_num_vertices(b)` for the output index calc in non-LIST branches. Instead, the per-primitive store directly computes `prim * verts_per_prim + slot` which is the output buffer position. The `instance * num_vertices` instance-stride multiplier should ALSO use the output count.
|
||||
|
||||
For multi-instance correctness, we need an `output_vertex_count` value that's the DECOMPOSED count per instance. Two ways:
|
||||
1. Pre-compute as another sysval `vs.xfb_output_count = decompose_count(topology, input_count)` — set CPU-side at draw time.
|
||||
2. Compute it in shader: use a switch over topology + math (e.g., for tri_strip: `3*(N-2)`).
|
||||
|
||||
**Lock: option 1.** Pre-compute on CPU, set as `vs.xfb_output_count` sysval. The CPU has trivially cheap arithmetic for this; shader avoids the per-VS-invocation math.
|
||||
|
||||
So total sysval additions:
|
||||
- `vs.xfb_topology` (uint32 / enum)
|
||||
- `vs.xfb_output_count` (uint32) — per-instance output vertex count after decomposition
|
||||
|
||||
## Phase 3 next
|
||||
|
||||
The probe already exists at `iter16/probe_winding.c`. Reuse it. Will Phase 4 actually-implement.
|
||||
|
||||
— claude-noether, 2026-05-21
|
||||
Reference in New Issue
Block a user