Files
panvk-bifrost/mesa-panvk-bifrost/iter16/phase4_progress.md
T
marfrit a4e7d8ab90 initial seed: retrofit campaign lineage from local working trees
panvk-bifrost campaigns (r1..r4 Vulkan compositor + r5.video1 Vulkan
video decode) shipped before this repo existed; the deliverable
patches live in marfrit-packages, but the reasoning chain, phase docs,
and source-state evidence lived only in local working trees on the
development host.

This retrofit imports:
- mesa-panvk-bifrost/   — r1..r4 era phase docs (iter1..iter18)
                          (libmali stub blobs at iter18/blob/ excluded
                          — 109MB of RE artifacts replaced with a README
                          pointer)
- mesa-panvk-bifrost-video/ — sibling campaign phase docs + probe
- evidence/             — frozen .tgz source snapshots at each milestone
                          (basis for the 0005 patch diff generation)

Future iterations should branch off here from day one, so each iter is
a commit rather than a snapshot. See [[feedback-session-local-process-pins]]
for the process drift this retrofit closes.

Total: 1.9 MB across 124 files.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 05:25:37 +02:00

6.2 KiB

Phase 4 progress (incomplete) — iter16

Status: WIP. Probe-correct, infrastructure-in-place, integration-blocked.

What works

  • panvk_vX_winding.c (new file) compiles clean, builds into the v6/v7 archives as panvk_v6_get_winding_table / panvk_v7_get_winding_table symbols. Tables for 7 topologies verified by Phase 3 probe expectations.
  • The injection point in jm/panvk_vX_cmd_draw.c::CmdDraw correctly detects xfb.active + non-LIST topology, looks up the winding table, builds the synthetic index buffer with the correct decomposition pattern (0 1 2 1 3 2 2 3 4 3 5 4 4 5 6 5 7 6 for an 8-vert tri-strip), and builds the VkDrawIndexedIndirectCommand with indexCount = 18.
  • The vs.num_vertices sysval override correctly uses decomposed_count (18) instead of info->vertex.count (0 for indexed draws).
  • IB and topology state overrides + dirty bits set correctly.

What's broken

  • After panvk_cmd_draw_indirect(cmdbuf, &draw) returns, the captured XFB output shows 8 entries of 0,1,2,3,4,5,6,7, identical to the iter13 baseline non-indexed dispatch. Expected: 18 entries of 0,1,2,1,3,2,....
  • Entries 8..63 of the capture buffer are 0xDEADBEEF (sentinels). So the dispatch was 8 invocations, with gl_VertexIndex consistent with non-indexed firstVertex=0.
  • The fall-through trace [iter16] FALL-THROUGH to non-indexed CmdDraw does not print, confirming the return from the injection block fires correctly.

What's been verified to NOT be the cause

  • Probe correctness: a parallel sanity probe (probe_idx.c) calls vkCmdBindIndexBuffer + vkCmdDrawIndexed(6 indices, [10..15]) and correctly captures 10,11,12,13,14,15 via XFB. So:
    • iter13's XFB implementation handles indexed draws perfectly via the public CmdDrawIndexed entry.
    • The patched library doesn't regress indexed XFB.
  • IB-state dirty marking: added gfx_state_set_dirty(cmdbuf, IB) after override (matches CmdBindIndexBuffer2). No effect.
  • Topology dynamic-state dirty bit: added BITSET_SET(...dirty, MESA_VK_DYNAMIC_IA_PRIMITIVE_TOPOLOGY). No effect.

Hypothesis (untested)

The difference between "my injection inside CmdDraw" and "the public CmdDrawIndexed entry" must be in implicit state setup that happens BETWEEN the bind and the draw, but specifically requires the bind to have been a real vkCmd call (not just a direct state mutation). Possibilities:

  1. BO tracking: when CmdBindIndexBuffer2 registers the VkBuffer with the batch, that may add the underlying BO to the batch's BO-list for kernel mapping. My synthetic IB allocated via panvk_cmd_alloc_dev_mem should be tied to the cmdbuf but maybe needs explicit BO-list registration.
  2. Vertex-job descriptor cached pre-draw: an earlier point in command recording may have emitted a vertex-job descriptor based on the topology+IB-bound state at that time. My runtime override doesn't trigger a re-emission because the dirty-bit flow doesn't reach the descriptor cache.
  3. Render-pass-scope state snapshot: pBeginRendering may have captured topology/IB into batch-local copies that my mutation doesn't update.

Resolving any of these requires either: deep panvk internals expertise; GPU-side debugging tools (RGP / Mali Graph Profiler); or restructuring the iter16 fix to operate at a different layer (e.g. NIR-pass-level decomposition, or a state-restore pattern that goes through pBindIB).

Consulted Sonnet architect 2026-05-21 — verdict + outcome

Architect picked Path B (call panvk_per_arch(CmdDrawIndexed) from inside the injection instead of constructing the indir command + calling panvk_cmd_draw_indirect manually). Diagnosis: draw->info.index.size = 0 somewhere; using the public entry should fix it.

Tested. Same failure. Captured 8 entries 0,1,2,3,4,5,6,7 (non-indexed pattern). The architect's diagnosis didn't apply — my code already sets .index.size = cmdbuf->state.gfx.ib.index_size = 4. The bug isn't in that struct field.

Additional test: a sanity probe that calls vkCmdBindIndexBuffer AFTER pBeginRendering, before BindPipeline works perfectly (captures the bound indices via XFB). So render-pass scope itself isn't the gap. The gap is specifically about state-mutation-from-within-CmdDraw vs separate-vkCmdBindIndexBuffer-call-as-its-own-vkCmd. Possibly:

  • pipeline-bind-time descriptor emission captures IB-bound state at that moment
  • some BO-list registration happens in CmdBindIndexBuffer2 (via VK_FROM_HANDLE(panvk_buffer) path) that direct state writes skip
  • Mali JM-specific dirty-tracking that needs explicit invalidation we're missing

Architect's Path C (NIR-pass-level decomposition) is the remaining structural option — 200-400 LoC in pan_nir_lower_xfb to emit multiple store_globals per VS invocation. Bypasses dispatch entirely. Multi-day investment in Mesa internals.

  1. Path D — defer iter16 (chosen 2026-05-21): documentary close. Campaign's iter13/iter15 deliverables unchanged. 162 winding fails remain known/categorized.
  2. Path C — NIR-pass decomposition: when bandwidth allows. Bypasses the dispatch-level mystery entirely by doing decomposition at shader-compile time. Pure Mesa work; could land upstream alongside iter13's transform_feedback patches.
  3. Path B — deep debug: revisit with Mali Graph Profiler / RGP to see what GPU descriptors are actually being committed at dispatch. Likely 1-2 more days of driver-internals work to isolate the BO-or-cache divergence.

Files modified on ohm (for resume)

  • src/panfrost/vulkan/panvk_cmd_draw.h — extended xfb substruct + winding_table struct + per-arch decl
  • src/panfrost/vulkan/panvk_vX_cmd_draw.c — vs.num_vertices override + debug fprintf (remove before commit)
  • src/panfrost/vulkan/jm/panvk_vX_cmd_draw.c — CmdDraw injection + debug fprintfs (remove before commit)
  • src/panfrost/vulkan/panvk_vX_winding.c — NEW
  • src/panfrost/vulkan/meson.build — register winding.c

Probe state

/home/mfritsche/src/panvk-bifrost/iter16/probe_winding.c works as a regression test. Verified to FAIL on iter13 r3 baseline (captures 8 not 18 for triangle_strip). Will PASS when the fix lands. Pre-iter16 baseline + iter16-WIP both fail identically — useful for confirming "did the fix change anything observable yet."

— claude-noether, 2026-05-21