Files
panvk-bifrost/mesa-panvk-bifrost/iter16/phase2_design.md
T
marfrit a4e7d8ab90 initial seed: retrofit campaign lineage from local working trees
panvk-bifrost campaigns (r1..r4 Vulkan compositor + r5.video1 Vulkan
video decode) shipped before this repo existed; the deliverable
patches live in marfrit-packages, but the reasoning chain, phase docs,
and source-state evidence lived only in local working trees on the
development host.

This retrofit imports:
- mesa-panvk-bifrost/   — r1..r4 era phase docs (iter1..iter18)
                          (libmali stub blobs at iter18/blob/ excluded
                          — 109MB of RE artifacts replaced with a README
                          pointer)
- mesa-panvk-bifrost-video/ — sibling campaign phase docs + probe
- evidence/             — frozen .tgz source snapshots at each milestone
                          (basis for the 0005 patch diff generation)

Future iterations should branch off here from day one, so each iter is
a commit rather than a snapshot. See [[feedback-session-local-process-pins]]
for the process drift this retrofit closes.

Total: 1.9 MB across 124 files.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 05:25:37 +02:00

6.0 KiB

Phase 2 — design lock for iter16

Decisions

Q1: Where does decomposition happen — CPU or GPU?

Decision: CPU-side index buffer construction.

Per-draw CPU cost: building a decomposed index buffer for a 4K-vertex strip is ~12K integer writes — microseconds. Negligible against the per-frame budget. The alternative (compute shader) adds shader compile + dispatch overhead per draw which is worse for small draws. For huge meshes (>100K vertices) the calculation flips, but XFB on strip topologies in real-world apps is uncommon, and apps that do hit it can be handled with a future GPU-path optimization without ABI change.

Q2: Path 2 (CmdDrawIndexed + non-LIST + XFB) — what's the strategy?

Decision: deferred to follow-up iter. iter16 handles only CmdDraw (non-indexed) + non-LIST + XFB.

Rationale: CTS's winding_* tests use non-indexed draws. The 162 fails categorized in iter15 are all from non-indexed paths. Fixing those gets us the parity number we promised the operator. CmdDrawIndexed + non-LIST + XFB exists as a real case but isn't in the CTS subset we measured — adding it would expand scope without moving the measured pass-rate number that's the campaign artifact.

For iter16, we detect CmdDrawIndexed + non-LIST + XFB and produce a mesa_loge warning + still capture (with wrong winding). That's a known soft-gap. Future iter17 can add the compute-shader path if needed.

Q3: How to save/restore user's bind state?

Decision: snapshot before override, restore after panvk_cmd_draw_indirect returns.

/* Before override */
struct panvk_cmd_index_buffer_state ib_save = cmdbuf->state.gfx.ib;
VkPrimitiveTopology topo_save = cmdbuf->vk.dynamic_graphics_state.ia.primitive_topology;

/* Override + dispatch */
cmdbuf->state.gfx.ib.dev_addr = synthetic_buf.gpu;
cmdbuf->state.gfx.ib.size = decomposed_count * 4;
cmdbuf->state.gfx.ib.index_size = 4;
cmdbuf->vk.dynamic_graphics_state.ia.primitive_topology = list_equiv(topo_save);
/* Dispatch as indexed-LIST */
panvk_cmd_draw_indirect(cmdbuf, &draw_with_decomposed_count);

/* Restore */
cmdbuf->state.gfx.ib = ib_save;
cmdbuf->vk.dynamic_graphics_state.ia.primitive_topology = topo_save;

The dirty-tracking mechanism will re-mark IB and topology dirty on the next user-issued draw, so the synthetic state is correctly invalidated.

Q4: Where does the decomposition table live?

Decision: a small static-data table in a new file panvk_vX_winding.c (under PAN_ARCH < 9 gate).

Per-topology entries:

  • vertices_per_primitive_after_decomp (2 or 3)
  • primitive_count(input_vert_count) lambda
  • decompose_vertex(prim_idx, vert_in_prim) → input_vert_index lambda
  • equivalent_list_topology enum

API:

struct panvk_winding_table {
    uint32_t verts_per_prim;
    uint32_t (*prim_count)(uint32_t in_count);
    uint32_t (*decompose)(uint32_t prim_idx, uint32_t vert_idx);
    VkPrimitiveTopology list_equiv;
};

const struct panvk_winding_table *panvk_get_winding_table(VkPrimitiveTopology);

/* Returns NULL for topologies that don't need decomposition (LIST variants). */

Caller:

const struct panvk_winding_table *wt = panvk_get_winding_table(topo);
if (wt && cmdbuf->state.gfx.xfb.active) {
    uint32_t n_prim = wt->prim_count(input_vert_count);
    uint32_t out_count = n_prim * wt->verts_per_prim;
    struct pan_ptr buf = panvk_cmd_alloc_dev_mem(cmdbuf, desc, out_count * 4, 8);
    uint32_t *idx = buf.cpu;
    for (uint32_t p = 0; p < n_prim; p++)
        for (uint32_t v = 0; v < wt->verts_per_prim; v++)
            *idx++ = wt->decompose(p, v);
    /* Override IB + topology + draw as indexed-LIST */
}

Q5: How does vs.num_vertices sysval track decomposed count?

Decision: at sysval upload time, check cmdbuf->state.gfx.xfb.decomposed_count != 0 and use it instead of info->vertex.count.

Add a field uint32_t decomposed_count to cmdbuf->state.gfx.xfb. Set in the new decomposition path. Reset to 0 after restore.

In cmd_prepare_draw_sysvals (around the existing iter13 set_gfx_sysval(... vs.num_vertices, info->vertex.count) line):

uint32_t nv = cmdbuf->state.gfx.xfb.decomposed_count
              ? cmdbuf->state.gfx.xfb.decomposed_count
              : info->vertex.count;
set_gfx_sysval(cmdbuf, dirty_sysvals, vs.num_vertices, nv);

Q6: Topology classification — which need decomposition?

Decision:

Topology Decomposed? Output verts List equiv
POINT_LIST No input (same)
LINE_LIST No input (same)
LINE_STRIP Yes 2(N-1) LINE_LIST
TRIANGLE_LIST No input (same)
TRIANGLE_STRIP Yes 3(N-2) TRIANGLE_LIST
TRIANGLE_FAN Yes 3(N-2) TRIANGLE_LIST
LINE_LIST_WITH_ADJACENCY Yes N/2 LINE_LIST (drop adjacency verts)
LINE_STRIP_WITH_ADJACENCY Yes 2(N-3) LINE_LIST
TRIANGLE_LIST_WITH_ADJACENCY Yes N/2 TRIANGLE_LIST
TRIANGLE_STRIP_WITH_ADJACENCY Yes 3(N/2-2) TRIANGLE_LIST
PATCH_LIST N/A (tess not advertised)

Seven topologies need decomposition tables. Each is a small lambda + count formula.

Q7: When does the iter16 path NOT activate?

  • XFB not active: no-op (fast path unchanged)
  • LIST or POINT topology: no-op
  • CmdDrawIndexed (any topology): falls through with warning log (Q2)
  • Tessellation (PATCH_LIST): we don't expose, never hit
  • Geometry shaders: not exposed, never hit

Scope confirmation

  • In: vkCmdDraw + LINE_STRIP / TRIANGLE_STRIP / TRIANGLE_FAN / *_WITH_ADJACENCY topologies + XFB active → driver-side decomposition
  • Out: indexed draws (vkCmdDrawIndexed) — warning only
  • Out: indirect draws (vkCmdDrawIndirect) — unchanged behavior
  • Expected CTS delta: all 162 winding fails → Pass (since they all use non-indexed strip/fan draws)
  • Expected CTS new fails: none

Phase 3 next

Write probe_winding.c that exercises XFB+triangle_strip with 8 vertices, captures, and verifies the expected 18-vertex decomposed output. Same probe scaffolding as iter13's probe_xfb.c.

— claude-noether, 2026-05-21