initial seed: retrofit campaign lineage from local working trees

panvk-bifrost campaigns (r1..r4 Vulkan compositor + r5.video1 Vulkan video decode) shipped before this repo existed; the deliverable patches live in marfrit-packages, but the reasoning chain, phase docs, and source-state evidence lived only in local working trees on the development host. This retrofit imports: - mesa-panvk-bifrost/ — r1..r4 era phase docs (iter1..iter18) (libmali stub blobs at iter18/blob/ excluded — 109MB of RE artifacts replaced with a README pointer) - mesa-panvk-bifrost-video/ — sibling campaign phase docs + probe - evidence/ — frozen .tgz source snapshots at each milestone (basis for the 0005 patch diff generation) Future iterations should branch off here from day one, so each iter is a commit rather than a snapshot. See [[feedback-session-local-process-pins]] for the process drift this retrofit closes. Total: 1.9 MB across 124 files. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 05:25:37 +02:00
parent 430d0da278
commit a4e7d8ab90
124 changed files with 22551 additions and 1 deletions
@@ -0,0 +1,139 @@
+# Phase 2 — design lock for iter16
+
+## Decisions
+
+### Q1: Where does decomposition happen — CPU or GPU?
+
+**Decision: CPU-side index buffer construction.**
+
+Per-draw CPU cost: building a decomposed index buffer for a 4K-vertex strip is ~12K integer writes — microseconds. Negligible against the per-frame budget. The alternative (compute shader) adds shader compile + dispatch overhead per draw which is worse for small draws. For huge meshes (>100K vertices) the calculation flips, but XFB on strip topologies in real-world apps is uncommon, and apps that do hit it can be handled with a future GPU-path optimization without ABI change.
+
+### Q2: Path 2 (CmdDrawIndexed + non-LIST + XFB) — what's the strategy?
+
+**Decision: deferred to follow-up iter.** iter16 handles only CmdDraw (non-indexed) + non-LIST + XFB.
+
+Rationale: CTS's `winding_*` tests use **non-indexed draws**. The 162 fails categorized in iter15 are all from non-indexed paths. Fixing those gets us the parity number we promised the operator. CmdDrawIndexed + non-LIST + XFB exists as a real case but isn't in the CTS subset we measured — adding it would expand scope without moving the measured pass-rate number that's the campaign artifact.
+
+For iter16, we **detect** CmdDrawIndexed + non-LIST + XFB and produce a `mesa_loge` warning + still capture (with wrong winding). That's a known soft-gap. Future iter17 can add the compute-shader path if needed.
+
+### Q3: How to save/restore user's bind state?
+
+**Decision: snapshot before override, restore after `panvk_cmd_draw_indirect` returns.**
+
+```c
+/* Before override */
+struct panvk_cmd_index_buffer_state ib_save = cmdbuf->state.gfx.ib;
+VkPrimitiveTopology topo_save = cmdbuf->vk.dynamic_graphics_state.ia.primitive_topology;
+
+/* Override + dispatch */
+cmdbuf->state.gfx.ib.dev_addr = synthetic_buf.gpu;
+cmdbuf->state.gfx.ib.size = decomposed_count * 4;
+cmdbuf->state.gfx.ib.index_size = 4;
+cmdbuf->vk.dynamic_graphics_state.ia.primitive_topology = list_equiv(topo_save);
+/* Dispatch as indexed-LIST */
+panvk_cmd_draw_indirect(cmdbuf, &draw_with_decomposed_count);
+
+/* Restore */
+cmdbuf->state.gfx.ib = ib_save;
+cmdbuf->vk.dynamic_graphics_state.ia.primitive_topology = topo_save;
+```
+
+The dirty-tracking mechanism will re-mark IB and topology dirty on the next user-issued draw, so the synthetic state is correctly invalidated.
+
+### Q4: Where does the decomposition table live?
+
+**Decision: a small static-data table in a new file `panvk_vX_winding.c` (under PAN_ARCH < 9 gate).**
+
+Per-topology entries:
+- `vertices_per_primitive_after_decomp` (2 or 3)
+- `primitive_count(input_vert_count)` lambda
+- `decompose_vertex(prim_idx, vert_in_prim) → input_vert_index` lambda
+- `equivalent_list_topology` enum
+
+API:
+
+```c
+struct panvk_winding_table {
+    uint32_t verts_per_prim;
+    uint32_t (*prim_count)(uint32_t in_count);
+    uint32_t (*decompose)(uint32_t prim_idx, uint32_t vert_idx);
+    VkPrimitiveTopology list_equiv;
+};
+
+const struct panvk_winding_table *panvk_get_winding_table(VkPrimitiveTopology);
+
+/* Returns NULL for topologies that don't need decomposition (LIST variants). */
+```
+
+Caller:
+
+```c
+const struct panvk_winding_table *wt = panvk_get_winding_table(topo);
+if (wt && cmdbuf->state.gfx.xfb.active) {
+    uint32_t n_prim = wt->prim_count(input_vert_count);
+    uint32_t out_count = n_prim * wt->verts_per_prim;
+    struct pan_ptr buf = panvk_cmd_alloc_dev_mem(cmdbuf, desc, out_count * 4, 8);
+    uint32_t *idx = buf.cpu;
+    for (uint32_t p = 0; p < n_prim; p++)
+        for (uint32_t v = 0; v < wt->verts_per_prim; v++)
+            *idx++ = wt->decompose(p, v);
+    /* Override IB + topology + draw as indexed-LIST */
+}
+```
+
+### Q5: How does `vs.num_vertices` sysval track decomposed count?
+
+**Decision: at sysval upload time, check `cmdbuf->state.gfx.xfb.decomposed_count != 0` and use it instead of `info->vertex.count`.**
+
+Add a field `uint32_t decomposed_count` to `cmdbuf->state.gfx.xfb`. Set in the new decomposition path. Reset to 0 after restore.
+
+In `cmd_prepare_draw_sysvals` (around the existing iter13 `set_gfx_sysval(... vs.num_vertices, info->vertex.count)` line):
+
+```c
+uint32_t nv = cmdbuf->state.gfx.xfb.decomposed_count
+              ? cmdbuf->state.gfx.xfb.decomposed_count
+              : info->vertex.count;
+set_gfx_sysval(cmdbuf, dirty_sysvals, vs.num_vertices, nv);
+```
+
+### Q6: Topology classification — which need decomposition?
+
+**Decision:**
+
+| Topology | Decomposed? | Output verts | List equiv |
+|---|---|---|---|
+| POINT_LIST | No | input | (same) |
+| LINE_LIST | No | input | (same) |
+| LINE_STRIP | **Yes** | 2(N-1) | LINE_LIST |
+| TRIANGLE_LIST | No | input | (same) |
+| TRIANGLE_STRIP | **Yes** | 3(N-2) | TRIANGLE_LIST |
+| TRIANGLE_FAN | **Yes** | 3(N-2) | TRIANGLE_LIST |
+| LINE_LIST_WITH_ADJACENCY | **Yes** | N/2 | LINE_LIST (drop adjacency verts) |
+| LINE_STRIP_WITH_ADJACENCY | **Yes** | 2(N-3) | LINE_LIST |
+| TRIANGLE_LIST_WITH_ADJACENCY | **Yes** | N/2 | TRIANGLE_LIST |
+| TRIANGLE_STRIP_WITH_ADJACENCY | **Yes** | 3(N/2-2) | TRIANGLE_LIST |
+| PATCH_LIST | N/A (tess not advertised) | — | — |
+
+Seven topologies need decomposition tables. Each is a small lambda + count formula.
+
+### Q7: When does the iter16 path NOT activate?
+
+- XFB not active: no-op (fast path unchanged)
+- LIST or POINT topology: no-op
+- CmdDrawIndexed (any topology): falls through with warning log (Q2)
+- Tessellation (PATCH_LIST): we don't expose, never hit
+- Geometry shaders: not exposed, never hit
+
+## Scope confirmation
+
+- **In:** `vkCmdDraw` + LINE_STRIP / TRIANGLE_STRIP / TRIANGLE_FAN / *_WITH_ADJACENCY topologies + XFB active → driver-side decomposition
+- **Out:** indexed draws (`vkCmdDrawIndexed`) — warning only
+- **Out:** indirect draws (`vkCmdDrawIndirect`) — unchanged behavior
+- **Expected CTS delta:** all 162 winding fails → Pass (since they all use non-indexed strip/fan draws)
+- **Expected CTS new fails:** none
+
+## Phase 3 next
+
+Write `probe_winding.c` that exercises XFB+triangle_strip with 8 vertices, captures, and verifies the expected 18-vertex decomposed output. Same probe scaffolding as iter13's probe_xfb.c.
+
+— claude-noether, 2026-05-21