panvk-bifrost campaigns (r1..r4 Vulkan compositor + r5.video1 Vulkan
video decode) shipped before this repo existed; the deliverable
patches live in marfrit-packages, but the reasoning chain, phase docs,
and source-state evidence lived only in local working trees on the
development host.
This retrofit imports:
- mesa-panvk-bifrost/ — r1..r4 era phase docs (iter1..iter18)
(libmali stub blobs at iter18/blob/ excluded
— 109MB of RE artifacts replaced with a README
pointer)
- mesa-panvk-bifrost-video/ — sibling campaign phase docs + probe
- evidence/ — frozen .tgz source snapshots at each milestone
(basis for the 0005 patch diff generation)
Future iterations should branch off here from day one, so each iter is
a commit rather than a snapshot. See [[feedback-session-local-process-pins]]
for the process drift this retrofit closes.
Total: 1.9 MB across 124 files.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
6.0 KiB
Phase 2 — design lock for iter16
Decisions
Q1: Where does decomposition happen — CPU or GPU?
Decision: CPU-side index buffer construction.
Per-draw CPU cost: building a decomposed index buffer for a 4K-vertex strip is ~12K integer writes — microseconds. Negligible against the per-frame budget. The alternative (compute shader) adds shader compile + dispatch overhead per draw which is worse for small draws. For huge meshes (>100K vertices) the calculation flips, but XFB on strip topologies in real-world apps is uncommon, and apps that do hit it can be handled with a future GPU-path optimization without ABI change.
Q2: Path 2 (CmdDrawIndexed + non-LIST + XFB) — what's the strategy?
Decision: deferred to follow-up iter. iter16 handles only CmdDraw (non-indexed) + non-LIST + XFB.
Rationale: CTS's winding_* tests use non-indexed draws. The 162 fails categorized in iter15 are all from non-indexed paths. Fixing those gets us the parity number we promised the operator. CmdDrawIndexed + non-LIST + XFB exists as a real case but isn't in the CTS subset we measured — adding it would expand scope without moving the measured pass-rate number that's the campaign artifact.
For iter16, we detect CmdDrawIndexed + non-LIST + XFB and produce a mesa_loge warning + still capture (with wrong winding). That's a known soft-gap. Future iter17 can add the compute-shader path if needed.
Q3: How to save/restore user's bind state?
Decision: snapshot before override, restore after panvk_cmd_draw_indirect returns.
/* Before override */
struct panvk_cmd_index_buffer_state ib_save = cmdbuf->state.gfx.ib;
VkPrimitiveTopology topo_save = cmdbuf->vk.dynamic_graphics_state.ia.primitive_topology;
/* Override + dispatch */
cmdbuf->state.gfx.ib.dev_addr = synthetic_buf.gpu;
cmdbuf->state.gfx.ib.size = decomposed_count * 4;
cmdbuf->state.gfx.ib.index_size = 4;
cmdbuf->vk.dynamic_graphics_state.ia.primitive_topology = list_equiv(topo_save);
/* Dispatch as indexed-LIST */
panvk_cmd_draw_indirect(cmdbuf, &draw_with_decomposed_count);
/* Restore */
cmdbuf->state.gfx.ib = ib_save;
cmdbuf->vk.dynamic_graphics_state.ia.primitive_topology = topo_save;
The dirty-tracking mechanism will re-mark IB and topology dirty on the next user-issued draw, so the synthetic state is correctly invalidated.
Q4: Where does the decomposition table live?
Decision: a small static-data table in a new file panvk_vX_winding.c (under PAN_ARCH < 9 gate).
Per-topology entries:
vertices_per_primitive_after_decomp(2 or 3)primitive_count(input_vert_count)lambdadecompose_vertex(prim_idx, vert_in_prim) → input_vert_indexlambdaequivalent_list_topologyenum
API:
struct panvk_winding_table {
uint32_t verts_per_prim;
uint32_t (*prim_count)(uint32_t in_count);
uint32_t (*decompose)(uint32_t prim_idx, uint32_t vert_idx);
VkPrimitiveTopology list_equiv;
};
const struct panvk_winding_table *panvk_get_winding_table(VkPrimitiveTopology);
/* Returns NULL for topologies that don't need decomposition (LIST variants). */
Caller:
const struct panvk_winding_table *wt = panvk_get_winding_table(topo);
if (wt && cmdbuf->state.gfx.xfb.active) {
uint32_t n_prim = wt->prim_count(input_vert_count);
uint32_t out_count = n_prim * wt->verts_per_prim;
struct pan_ptr buf = panvk_cmd_alloc_dev_mem(cmdbuf, desc, out_count * 4, 8);
uint32_t *idx = buf.cpu;
for (uint32_t p = 0; p < n_prim; p++)
for (uint32_t v = 0; v < wt->verts_per_prim; v++)
*idx++ = wt->decompose(p, v);
/* Override IB + topology + draw as indexed-LIST */
}
Q5: How does vs.num_vertices sysval track decomposed count?
Decision: at sysval upload time, check cmdbuf->state.gfx.xfb.decomposed_count != 0 and use it instead of info->vertex.count.
Add a field uint32_t decomposed_count to cmdbuf->state.gfx.xfb. Set in the new decomposition path. Reset to 0 after restore.
In cmd_prepare_draw_sysvals (around the existing iter13 set_gfx_sysval(... vs.num_vertices, info->vertex.count) line):
uint32_t nv = cmdbuf->state.gfx.xfb.decomposed_count
? cmdbuf->state.gfx.xfb.decomposed_count
: info->vertex.count;
set_gfx_sysval(cmdbuf, dirty_sysvals, vs.num_vertices, nv);
Q6: Topology classification — which need decomposition?
Decision:
| Topology | Decomposed? | Output verts | List equiv |
|---|---|---|---|
| POINT_LIST | No | input | (same) |
| LINE_LIST | No | input | (same) |
| LINE_STRIP | Yes | 2(N-1) | LINE_LIST |
| TRIANGLE_LIST | No | input | (same) |
| TRIANGLE_STRIP | Yes | 3(N-2) | TRIANGLE_LIST |
| TRIANGLE_FAN | Yes | 3(N-2) | TRIANGLE_LIST |
| LINE_LIST_WITH_ADJACENCY | Yes | N/2 | LINE_LIST (drop adjacency verts) |
| LINE_STRIP_WITH_ADJACENCY | Yes | 2(N-3) | LINE_LIST |
| TRIANGLE_LIST_WITH_ADJACENCY | Yes | N/2 | TRIANGLE_LIST |
| TRIANGLE_STRIP_WITH_ADJACENCY | Yes | 3(N/2-2) | TRIANGLE_LIST |
| PATCH_LIST | N/A (tess not advertised) | — | — |
Seven topologies need decomposition tables. Each is a small lambda + count formula.
Q7: When does the iter16 path NOT activate?
- XFB not active: no-op (fast path unchanged)
- LIST or POINT topology: no-op
- CmdDrawIndexed (any topology): falls through with warning log (Q2)
- Tessellation (PATCH_LIST): we don't expose, never hit
- Geometry shaders: not exposed, never hit
Scope confirmation
- In:
vkCmdDraw+ LINE_STRIP / TRIANGLE_STRIP / TRIANGLE_FAN / *_WITH_ADJACENCY topologies + XFB active → driver-side decomposition - Out: indexed draws (
vkCmdDrawIndexed) — warning only - Out: indirect draws (
vkCmdDrawIndirect) — unchanged behavior - Expected CTS delta: all 162 winding fails → Pass (since they all use non-indexed strip/fan draws)
- Expected CTS new fails: none
Phase 3 next
Write probe_winding.c that exercises XFB+triangle_strip with 8 vertices, captures, and verifies the expected 18-vertex decomposed output. Same probe scaffolding as iter13's probe_xfb.c.
— claude-noether, 2026-05-21