panvk-bifrost campaigns (r1..r4 Vulkan compositor + r5.video1 Vulkan
video decode) shipped before this repo existed; the deliverable
patches live in marfrit-packages, but the reasoning chain, phase docs,
and source-state evidence lived only in local working trees on the
development host.
This retrofit imports:
- mesa-panvk-bifrost/ — r1..r4 era phase docs (iter1..iter18)
(libmali stub blobs at iter18/blob/ excluded
— 109MB of RE artifacts replaced with a README
pointer)
- mesa-panvk-bifrost-video/ — sibling campaign phase docs + probe
- evidence/ — frozen .tgz source snapshots at each milestone
(basis for the 0005 patch diff generation)
Future iterations should branch off here from day one, so each iter is
a commit rather than a snapshot. See [[feedback-session-local-process-pins]]
for the process drift this retrofit closes.
Total: 1.9 MB across 124 files.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4.6 KiB
Phase 0 — substrate lock for iter16
Goal: close the 162 winding_* CTS failures from iter15 by implementing driver-side primitive decomposition when XFB is active and topology is strip/fan/adjacency. Spec compliance for the spec corner that iter13 didn't cover.
Operator framing (2026-05-21, post-iter15-close): "Continue with the winding-order cluster" — going with the proper fix even though it doesn't directly help the iter9/iter13 ANGLE-Vulkan motivator. Upstream value.
What's broken
iter13's pan_nir_lower_xfb (in Mesa's panfrost compiler) computes the XFB output index as:
index = instance_id * num_vertices + raw_vertex_id_pan
store_global(xfb_address[i] + index * stride, captured_value)
This produces ONE XFB output per VS invocation, which equals one output per input vertex. Vulkan spec for transform feedback requires:
| Topology | Output count for N input vertices |
|---|---|
| POINT_LIST | N |
| LINE_LIST | N |
| LINE_STRIP | 2 × (N - 1) |
| TRIANGLE_LIST | N |
| TRIANGLE_STRIP | 3 × (N - 2) |
| TRIANGLE_FAN | 3 × (N - 2) |
| LINE_LIST_WITH_ADJACENCY | N/2 (2 per primitive after dropping adjacency) |
| LINE_STRIP_WITH_ADJACENCY | 2 × (N - 3) |
| TRIANGLE_LIST_WITH_ADJACENCY | N/2 (3 per primitive) |
| TRIANGLE_STRIP_WITH_ADJACENCY | 3 × (N/2 - 2) |
iter13 currently handles only the LIST topologies correctly (where output_count = input_count). All strip/fan/adjacency variants fail because we capture N vertices when the spec wants the decomposed count.
Plus odd-numbered triangle-strip primitives must have their winding reversed: {i, i+2, i+1} not {i, i+1, i+2} — the test name "winding" comes from this.
The fix architecture (locked early because the operator picked option 1)
When XFB is active and topology requires decomposition:
- At draw record time (in
jm/panvk_vX_cmd_draw.c/panvk_vX_cmd_draw.c):- Compute
decomposed_vertex_count = decompose_count(topology, input_count) - Allocate a scratch BO (via
panvk_priv_bo_*) sized fordecomposed_vertex_count * sizeof(uint32_t) - Fill the BO with a synthetic index buffer encoding the decomposition (e.g. for triangle-strip vert 8:
0 1 2 1 3 2 2 3 4 3 5 4 4 5 6 5 7 6) - Emit the draw as indexed LIST topology with this synthetic index buffer + the decomposed vertex count
- Compute
- At sysval upload (in
panvk_vX_cmd_draw.c::cmd_prepare_draw_sysvals):- Set
vs.num_vertices = decomposed_vertex_countinstead of the input count
- Set
- No shader changes needed — the VS already runs once per dispatched (indexed) vertex; the existing
pan_nir_lower_xfbformula does the right thing oncenum_verticesand the vertex dispatch count match.
What about the existing CmdDrawIndexed path?
For indexed draws that are already strip/fan, we need to REMAP the user's index buffer through the decomposition table — read user_index[decomp[k]] for k in 0..decomposed_count. That's an extra indirection in the synthetic index buffer construction.
Cleanest abstraction: build the decomposed buffer as values, not as indices, by reading the user's index buffer on the CPU and emitting the resolved input vertex IDs. But for large input meshes that's a CPU cost.
Alternative: have the GPU do the indirection. The synthetic index buffer holds decomp_indices (positions into the user buffer), and we tell the Bifrost vertex job to use a 2-level index lookup. Bifrost JM doesn't natively support that. So CPU-side resolve is necessary for indexed draws.
Out-of-scope failure modes
- Tessellation topologies (PATCH_LIST): Not in iter13's exposed feature set; we don't advertise tessellation. CTS test
winding_patch_listis in the NotSupported bucket already. No-op. - Geometry shaders:
geometryStreams=falsein iter13's properties. No-op. - Indirect draws (
vkCmdDrawIndirect): Vertex count comes from a GPU buffer, not from the CPU. Decomposition would need to happen on the GPU. Out of iter16 scope; we'll keep behavior unchanged for indirect+strip+XFB (will fail iter16 too, but separate followup). vkCmdDrawIndirectByteCountEXT— already not implemented (transformFeedbackDraw=false).
Time / complexity estimate
- Phase 1 source map: 1-2h
- Phase 2 design lock: 1h
- Phase 3 probe (regression test for triangle_strip winding): 2-3h
- Phase 4 implementation: 1-2 days
- Phase 5 review: spawn a janet-style reviewer
- Phase 6 CTS rerun: ~2h
- Phase 8 package: standard PKGBUILD update + CI + 3-point close
Total estimate: 3-5 working days for the full cycle.
Next: Phase 1
Source map. Where in panvk does pipeline topology live, where does the draw dispatch read it, where to inject the decomposition.
— claude-noether, 2026-05-21