# Phase 2 — design lock for iter16 ## Decisions ### Q1: Where does decomposition happen — CPU or GPU? **Decision: CPU-side index buffer construction.** Per-draw CPU cost: building a decomposed index buffer for a 4K-vertex strip is ~12K integer writes — microseconds. Negligible against the per-frame budget. The alternative (compute shader) adds shader compile + dispatch overhead per draw which is worse for small draws. For huge meshes (>100K vertices) the calculation flips, but XFB on strip topologies in real-world apps is uncommon, and apps that do hit it can be handled with a future GPU-path optimization without ABI change. ### Q2: Path 2 (CmdDrawIndexed + non-LIST + XFB) — what's the strategy? **Decision: deferred to follow-up iter.** iter16 handles only CmdDraw (non-indexed) + non-LIST + XFB. Rationale: CTS's `winding_*` tests use **non-indexed draws**. The 162 fails categorized in iter15 are all from non-indexed paths. Fixing those gets us the parity number we promised the operator. CmdDrawIndexed + non-LIST + XFB exists as a real case but isn't in the CTS subset we measured — adding it would expand scope without moving the measured pass-rate number that's the campaign artifact. For iter16, we **detect** CmdDrawIndexed + non-LIST + XFB and produce a `mesa_loge` warning + still capture (with wrong winding). That's a known soft-gap. Future iter17 can add the compute-shader path if needed. ### Q3: How to save/restore user's bind state? **Decision: snapshot before override, restore after `panvk_cmd_draw_indirect` returns.** ```c /* Before override */ struct panvk_cmd_index_buffer_state ib_save = cmdbuf->state.gfx.ib; VkPrimitiveTopology topo_save = cmdbuf->vk.dynamic_graphics_state.ia.primitive_topology; /* Override + dispatch */ cmdbuf->state.gfx.ib.dev_addr = synthetic_buf.gpu; cmdbuf->state.gfx.ib.size = decomposed_count * 4; cmdbuf->state.gfx.ib.index_size = 4; cmdbuf->vk.dynamic_graphics_state.ia.primitive_topology = list_equiv(topo_save); /* Dispatch as indexed-LIST */ panvk_cmd_draw_indirect(cmdbuf, &draw_with_decomposed_count); /* Restore */ cmdbuf->state.gfx.ib = ib_save; cmdbuf->vk.dynamic_graphics_state.ia.primitive_topology = topo_save; ``` The dirty-tracking mechanism will re-mark IB and topology dirty on the next user-issued draw, so the synthetic state is correctly invalidated. ### Q4: Where does the decomposition table live? **Decision: a small static-data table in a new file `panvk_vX_winding.c` (under PAN_ARCH < 9 gate).** Per-topology entries: - `vertices_per_primitive_after_decomp` (2 or 3) - `primitive_count(input_vert_count)` lambda - `decompose_vertex(prim_idx, vert_in_prim) → input_vert_index` lambda - `equivalent_list_topology` enum API: ```c struct panvk_winding_table { uint32_t verts_per_prim; uint32_t (*prim_count)(uint32_t in_count); uint32_t (*decompose)(uint32_t prim_idx, uint32_t vert_idx); VkPrimitiveTopology list_equiv; }; const struct panvk_winding_table *panvk_get_winding_table(VkPrimitiveTopology); /* Returns NULL for topologies that don't need decomposition (LIST variants). */ ``` Caller: ```c const struct panvk_winding_table *wt = panvk_get_winding_table(topo); if (wt && cmdbuf->state.gfx.xfb.active) { uint32_t n_prim = wt->prim_count(input_vert_count); uint32_t out_count = n_prim * wt->verts_per_prim; struct pan_ptr buf = panvk_cmd_alloc_dev_mem(cmdbuf, desc, out_count * 4, 8); uint32_t *idx = buf.cpu; for (uint32_t p = 0; p < n_prim; p++) for (uint32_t v = 0; v < wt->verts_per_prim; v++) *idx++ = wt->decompose(p, v); /* Override IB + topology + draw as indexed-LIST */ } ``` ### Q5: How does `vs.num_vertices` sysval track decomposed count? **Decision: at sysval upload time, check `cmdbuf->state.gfx.xfb.decomposed_count != 0` and use it instead of `info->vertex.count`.** Add a field `uint32_t decomposed_count` to `cmdbuf->state.gfx.xfb`. Set in the new decomposition path. Reset to 0 after restore. In `cmd_prepare_draw_sysvals` (around the existing iter13 `set_gfx_sysval(... vs.num_vertices, info->vertex.count)` line): ```c uint32_t nv = cmdbuf->state.gfx.xfb.decomposed_count ? cmdbuf->state.gfx.xfb.decomposed_count : info->vertex.count; set_gfx_sysval(cmdbuf, dirty_sysvals, vs.num_vertices, nv); ``` ### Q6: Topology classification — which need decomposition? **Decision:** | Topology | Decomposed? | Output verts | List equiv | |---|---|---|---| | POINT_LIST | No | input | (same) | | LINE_LIST | No | input | (same) | | LINE_STRIP | **Yes** | 2(N-1) | LINE_LIST | | TRIANGLE_LIST | No | input | (same) | | TRIANGLE_STRIP | **Yes** | 3(N-2) | TRIANGLE_LIST | | TRIANGLE_FAN | **Yes** | 3(N-2) | TRIANGLE_LIST | | LINE_LIST_WITH_ADJACENCY | **Yes** | N/2 | LINE_LIST (drop adjacency verts) | | LINE_STRIP_WITH_ADJACENCY | **Yes** | 2(N-3) | LINE_LIST | | TRIANGLE_LIST_WITH_ADJACENCY | **Yes** | N/2 | TRIANGLE_LIST | | TRIANGLE_STRIP_WITH_ADJACENCY | **Yes** | 3(N/2-2) | TRIANGLE_LIST | | PATCH_LIST | N/A (tess not advertised) | — | — | Seven topologies need decomposition tables. Each is a small lambda + count formula. ### Q7: When does the iter16 path NOT activate? - XFB not active: no-op (fast path unchanged) - LIST or POINT topology: no-op - CmdDrawIndexed (any topology): falls through with warning log (Q2) - Tessellation (PATCH_LIST): we don't expose, never hit - Geometry shaders: not exposed, never hit ## Scope confirmation - **In:** `vkCmdDraw` + LINE_STRIP / TRIANGLE_STRIP / TRIANGLE_FAN / *_WITH_ADJACENCY topologies + XFB active → driver-side decomposition - **Out:** indexed draws (`vkCmdDrawIndexed`) — warning only - **Out:** indirect draws (`vkCmdDrawIndirect`) — unchanged behavior - **Expected CTS delta:** all 162 winding fails → Pass (since they all use non-indexed strip/fan draws) - **Expected CTS new fails:** none ## Phase 3 next Write `probe_winding.c` that exercises XFB+triangle_strip with 8 vertices, captures, and verifies the expected 18-vertex decomposed output. Same probe scaffolding as iter13's probe_xfb.c. — claude-noether, 2026-05-21