initial seed: retrofit campaign lineage from local working trees

panvk-bifrost campaigns (r1..r4 Vulkan compositor + r5.video1 Vulkan
video decode) shipped before this repo existed; the deliverable
patches live in marfrit-packages, but the reasoning chain, phase docs,
and source-state evidence lived only in local working trees on the
development host.

This retrofit imports:
- mesa-panvk-bifrost/   — r1..r4 era phase docs (iter1..iter18)
                          (libmali stub blobs at iter18/blob/ excluded
                          — 109MB of RE artifacts replaced with a README
                          pointer)
- mesa-panvk-bifrost-video/ — sibling campaign phase docs + probe
- evidence/             — frozen .tgz source snapshots at each milestone
                          (basis for the 0005 patch diff generation)

Future iterations should branch off here from day one, so each iter is
a commit rather than a snapshot. See [[feedback-session-local-process-pins]]
for the process drift this retrofit closes.

Total: 1.9 MB across 124 files.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-23 05:25:37 +02:00
parent 430d0da278
commit a4e7d8ab90
124 changed files with 22551 additions and 1 deletions
+139
View File
@@ -0,0 +1,139 @@
# Phase 2 — design lock for iter16
## Decisions
### Q1: Where does decomposition happen — CPU or GPU?
**Decision: CPU-side index buffer construction.**
Per-draw CPU cost: building a decomposed index buffer for a 4K-vertex strip is ~12K integer writes — microseconds. Negligible against the per-frame budget. The alternative (compute shader) adds shader compile + dispatch overhead per draw which is worse for small draws. For huge meshes (>100K vertices) the calculation flips, but XFB on strip topologies in real-world apps is uncommon, and apps that do hit it can be handled with a future GPU-path optimization without ABI change.
### Q2: Path 2 (CmdDrawIndexed + non-LIST + XFB) — what's the strategy?
**Decision: deferred to follow-up iter.** iter16 handles only CmdDraw (non-indexed) + non-LIST + XFB.
Rationale: CTS's `winding_*` tests use **non-indexed draws**. The 162 fails categorized in iter15 are all from non-indexed paths. Fixing those gets us the parity number we promised the operator. CmdDrawIndexed + non-LIST + XFB exists as a real case but isn't in the CTS subset we measured — adding it would expand scope without moving the measured pass-rate number that's the campaign artifact.
For iter16, we **detect** CmdDrawIndexed + non-LIST + XFB and produce a `mesa_loge` warning + still capture (with wrong winding). That's a known soft-gap. Future iter17 can add the compute-shader path if needed.
### Q3: How to save/restore user's bind state?
**Decision: snapshot before override, restore after `panvk_cmd_draw_indirect` returns.**
```c
/* Before override */
struct panvk_cmd_index_buffer_state ib_save = cmdbuf->state.gfx.ib;
VkPrimitiveTopology topo_save = cmdbuf->vk.dynamic_graphics_state.ia.primitive_topology;
/* Override + dispatch */
cmdbuf->state.gfx.ib.dev_addr = synthetic_buf.gpu;
cmdbuf->state.gfx.ib.size = decomposed_count * 4;
cmdbuf->state.gfx.ib.index_size = 4;
cmdbuf->vk.dynamic_graphics_state.ia.primitive_topology = list_equiv(topo_save);
/* Dispatch as indexed-LIST */
panvk_cmd_draw_indirect(cmdbuf, &draw_with_decomposed_count);
/* Restore */
cmdbuf->state.gfx.ib = ib_save;
cmdbuf->vk.dynamic_graphics_state.ia.primitive_topology = topo_save;
```
The dirty-tracking mechanism will re-mark IB and topology dirty on the next user-issued draw, so the synthetic state is correctly invalidated.
### Q4: Where does the decomposition table live?
**Decision: a small static-data table in a new file `panvk_vX_winding.c` (under PAN_ARCH < 9 gate).**
Per-topology entries:
- `vertices_per_primitive_after_decomp` (2 or 3)
- `primitive_count(input_vert_count)` lambda
- `decompose_vertex(prim_idx, vert_in_prim) → input_vert_index` lambda
- `equivalent_list_topology` enum
API:
```c
struct panvk_winding_table {
uint32_t verts_per_prim;
uint32_t (*prim_count)(uint32_t in_count);
uint32_t (*decompose)(uint32_t prim_idx, uint32_t vert_idx);
VkPrimitiveTopology list_equiv;
};
const struct panvk_winding_table *panvk_get_winding_table(VkPrimitiveTopology);
/* Returns NULL for topologies that don't need decomposition (LIST variants). */
```
Caller:
```c
const struct panvk_winding_table *wt = panvk_get_winding_table(topo);
if (wt && cmdbuf->state.gfx.xfb.active) {
uint32_t n_prim = wt->prim_count(input_vert_count);
uint32_t out_count = n_prim * wt->verts_per_prim;
struct pan_ptr buf = panvk_cmd_alloc_dev_mem(cmdbuf, desc, out_count * 4, 8);
uint32_t *idx = buf.cpu;
for (uint32_t p = 0; p < n_prim; p++)
for (uint32_t v = 0; v < wt->verts_per_prim; v++)
*idx++ = wt->decompose(p, v);
/* Override IB + topology + draw as indexed-LIST */
}
```
### Q5: How does `vs.num_vertices` sysval track decomposed count?
**Decision: at sysval upload time, check `cmdbuf->state.gfx.xfb.decomposed_count != 0` and use it instead of `info->vertex.count`.**
Add a field `uint32_t decomposed_count` to `cmdbuf->state.gfx.xfb`. Set in the new decomposition path. Reset to 0 after restore.
In `cmd_prepare_draw_sysvals` (around the existing iter13 `set_gfx_sysval(... vs.num_vertices, info->vertex.count)` line):
```c
uint32_t nv = cmdbuf->state.gfx.xfb.decomposed_count
? cmdbuf->state.gfx.xfb.decomposed_count
: info->vertex.count;
set_gfx_sysval(cmdbuf, dirty_sysvals, vs.num_vertices, nv);
```
### Q6: Topology classification — which need decomposition?
**Decision:**
| Topology | Decomposed? | Output verts | List equiv |
|---|---|---|---|
| POINT_LIST | No | input | (same) |
| LINE_LIST | No | input | (same) |
| LINE_STRIP | **Yes** | 2(N-1) | LINE_LIST |
| TRIANGLE_LIST | No | input | (same) |
| TRIANGLE_STRIP | **Yes** | 3(N-2) | TRIANGLE_LIST |
| TRIANGLE_FAN | **Yes** | 3(N-2) | TRIANGLE_LIST |
| LINE_LIST_WITH_ADJACENCY | **Yes** | N/2 | LINE_LIST (drop adjacency verts) |
| LINE_STRIP_WITH_ADJACENCY | **Yes** | 2(N-3) | LINE_LIST |
| TRIANGLE_LIST_WITH_ADJACENCY | **Yes** | N/2 | TRIANGLE_LIST |
| TRIANGLE_STRIP_WITH_ADJACENCY | **Yes** | 3(N/2-2) | TRIANGLE_LIST |
| PATCH_LIST | N/A (tess not advertised) | — | — |
Seven topologies need decomposition tables. Each is a small lambda + count formula.
### Q7: When does the iter16 path NOT activate?
- XFB not active: no-op (fast path unchanged)
- LIST or POINT topology: no-op
- CmdDrawIndexed (any topology): falls through with warning log (Q2)
- Tessellation (PATCH_LIST): we don't expose, never hit
- Geometry shaders: not exposed, never hit
## Scope confirmation
- **In:** `vkCmdDraw` + LINE_STRIP / TRIANGLE_STRIP / TRIANGLE_FAN / *_WITH_ADJACENCY topologies + XFB active → driver-side decomposition
- **Out:** indexed draws (`vkCmdDrawIndexed`) — warning only
- **Out:** indirect draws (`vkCmdDrawIndirect`) — unchanged behavior
- **Expected CTS delta:** all 162 winding fails → Pass (since they all use non-indexed strip/fan draws)
- **Expected CTS new fails:** none
## Phase 3 next
Write `probe_winding.c` that exercises XFB+triangle_strip with 8 vertices, captures, and verifies the expected 18-vertex decomposed output. Same probe scaffolding as iter13's probe_xfb.c.
— claude-noether, 2026-05-21