Files
panvk-bifrost/mesa-panvk-bifrost/iter16/phase0_findings.md
T
marfrit a4e7d8ab90 initial seed: retrofit campaign lineage from local working trees
panvk-bifrost campaigns (r1..r4 Vulkan compositor + r5.video1 Vulkan
video decode) shipped before this repo existed; the deliverable
patches live in marfrit-packages, but the reasoning chain, phase docs,
and source-state evidence lived only in local working trees on the
development host.

This retrofit imports:
- mesa-panvk-bifrost/   — r1..r4 era phase docs (iter1..iter18)
                          (libmali stub blobs at iter18/blob/ excluded
                          — 109MB of RE artifacts replaced with a README
                          pointer)
- mesa-panvk-bifrost-video/ — sibling campaign phase docs + probe
- evidence/             — frozen .tgz source snapshots at each milestone
                          (basis for the 0005 patch diff generation)

Future iterations should branch off here from day one, so each iter is
a commit rather than a snapshot. See [[feedback-session-local-process-pins]]
for the process drift this retrofit closes.

Total: 1.9 MB across 124 files.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 05:25:37 +02:00

4.6 KiB
Raw Blame History

Phase 0 — substrate lock for iter16

Goal: close the 162 winding_* CTS failures from iter15 by implementing driver-side primitive decomposition when XFB is active and topology is strip/fan/adjacency. Spec compliance for the spec corner that iter13 didn't cover.

Operator framing (2026-05-21, post-iter15-close): "Continue with the winding-order cluster" — going with the proper fix even though it doesn't directly help the iter9/iter13 ANGLE-Vulkan motivator. Upstream value.

What's broken

iter13's pan_nir_lower_xfb (in Mesa's panfrost compiler) computes the XFB output index as:

index = instance_id * num_vertices + raw_vertex_id_pan
store_global(xfb_address[i] + index * stride, captured_value)

This produces ONE XFB output per VS invocation, which equals one output per input vertex. Vulkan spec for transform feedback requires:

Topology Output count for N input vertices
POINT_LIST N
LINE_LIST N
LINE_STRIP 2 × (N - 1)
TRIANGLE_LIST N
TRIANGLE_STRIP 3 × (N - 2)
TRIANGLE_FAN 3 × (N - 2)
LINE_LIST_WITH_ADJACENCY N/2 (2 per primitive after dropping adjacency)
LINE_STRIP_WITH_ADJACENCY 2 × (N - 3)
TRIANGLE_LIST_WITH_ADJACENCY N/2 (3 per primitive)
TRIANGLE_STRIP_WITH_ADJACENCY 3 × (N/2 - 2)

iter13 currently handles only the LIST topologies correctly (where output_count = input_count). All strip/fan/adjacency variants fail because we capture N vertices when the spec wants the decomposed count.

Plus odd-numbered triangle-strip primitives must have their winding reversed: {i, i+2, i+1} not {i, i+1, i+2} — the test name "winding" comes from this.

The fix architecture (locked early because the operator picked option 1)

When XFB is active and topology requires decomposition:

  1. At draw record time (in jm/panvk_vX_cmd_draw.c / panvk_vX_cmd_draw.c):
    • Compute decomposed_vertex_count = decompose_count(topology, input_count)
    • Allocate a scratch BO (via panvk_priv_bo_*) sized for decomposed_vertex_count * sizeof(uint32_t)
    • Fill the BO with a synthetic index buffer encoding the decomposition (e.g. for triangle-strip vert 8: 0 1 2 1 3 2 2 3 4 3 5 4 4 5 6 5 7 6)
    • Emit the draw as indexed LIST topology with this synthetic index buffer + the decomposed vertex count
  2. At sysval upload (in panvk_vX_cmd_draw.c::cmd_prepare_draw_sysvals):
    • Set vs.num_vertices = decomposed_vertex_count instead of the input count
  3. No shader changes needed — the VS already runs once per dispatched (indexed) vertex; the existing pan_nir_lower_xfb formula does the right thing once num_vertices and the vertex dispatch count match.

What about the existing CmdDrawIndexed path?

For indexed draws that are already strip/fan, we need to REMAP the user's index buffer through the decomposition table — read user_index[decomp[k]] for k in 0..decomposed_count. That's an extra indirection in the synthetic index buffer construction.

Cleanest abstraction: build the decomposed buffer as values, not as indices, by reading the user's index buffer on the CPU and emitting the resolved input vertex IDs. But for large input meshes that's a CPU cost.

Alternative: have the GPU do the indirection. The synthetic index buffer holds decomp_indices (positions into the user buffer), and we tell the Bifrost vertex job to use a 2-level index lookup. Bifrost JM doesn't natively support that. So CPU-side resolve is necessary for indexed draws.

Out-of-scope failure modes

  • Tessellation topologies (PATCH_LIST): Not in iter13's exposed feature set; we don't advertise tessellation. CTS test winding_patch_list is in the NotSupported bucket already. No-op.
  • Geometry shaders: geometryStreams=false in iter13's properties. No-op.
  • Indirect draws (vkCmdDrawIndirect): Vertex count comes from a GPU buffer, not from the CPU. Decomposition would need to happen on the GPU. Out of iter16 scope; we'll keep behavior unchanged for indirect+strip+XFB (will fail iter16 too, but separate followup).
  • vkCmdDrawIndirectByteCountEXT — already not implemented (transformFeedbackDraw=false).

Time / complexity estimate

  • Phase 1 source map: 1-2h
  • Phase 2 design lock: 1h
  • Phase 3 probe (regression test for triangle_strip winding): 2-3h
  • Phase 4 implementation: 1-2 days
  • Phase 5 review: spawn a janet-style reviewer
  • Phase 6 CTS rerun: ~2h
  • Phase 8 package: standard PKGBUILD update + CI + 3-point close

Total estimate: 3-5 working days for the full cycle.

Next: Phase 1

Source map. Where in panvk does pipeline topology live, where does the draw dispatch read it, where to inject the decomposition.

— claude-noether, 2026-05-21