a4e7d8ab90
panvk-bifrost campaigns (r1..r4 Vulkan compositor + r5.video1 Vulkan
video decode) shipped before this repo existed; the deliverable
patches live in marfrit-packages, but the reasoning chain, phase docs,
and source-state evidence lived only in local working trees on the
development host.
This retrofit imports:
- mesa-panvk-bifrost/ — r1..r4 era phase docs (iter1..iter18)
(libmali stub blobs at iter18/blob/ excluded
— 109MB of RE artifacts replaced with a README
pointer)
- mesa-panvk-bifrost-video/ — sibling campaign phase docs + probe
- evidence/ — frozen .tgz source snapshots at each milestone
(basis for the 0005 patch diff generation)
Future iterations should branch off here from day one, so each iter is
a commit rather than a snapshot. See [[feedback-session-local-process-pins]]
for the process drift this retrofit closes.
Total: 1.9 MB across 124 files.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
68 lines
6.2 KiB
Markdown
68 lines
6.2 KiB
Markdown
# Phase 4 progress (incomplete) — iter16
|
|
|
|
**Status: WIP. Probe-correct, infrastructure-in-place, integration-blocked.**
|
|
|
|
## What works
|
|
|
|
- `panvk_vX_winding.c` (new file) compiles clean, builds into the v6/v7 archives as `panvk_v6_get_winding_table` / `panvk_v7_get_winding_table` symbols. Tables for 7 topologies verified by Phase 3 probe expectations.
|
|
- The injection point in `jm/panvk_vX_cmd_draw.c::CmdDraw` correctly detects `xfb.active + non-LIST topology`, looks up the winding table, builds the synthetic index buffer with the correct decomposition pattern (`0 1 2 1 3 2 2 3 4 3 5 4 4 5 6 5 7 6` for an 8-vert tri-strip), and builds the `VkDrawIndexedIndirectCommand` with `indexCount = 18`.
|
|
- The `vs.num_vertices` sysval override correctly uses `decomposed_count` (18) instead of `info->vertex.count` (0 for indexed draws).
|
|
- IB and topology state overrides + dirty bits set correctly.
|
|
|
|
## What's broken
|
|
|
|
- After `panvk_cmd_draw_indirect(cmdbuf, &draw)` returns, the captured XFB output shows **8 entries of `0,1,2,3,4,5,6,7`**, identical to the iter13 baseline non-indexed dispatch. Expected: 18 entries of `0,1,2,1,3,2,...`.
|
|
- Entries 8..63 of the capture buffer are 0xDEADBEEF (sentinels). So the dispatch was 8 invocations, with gl_VertexIndex consistent with non-indexed firstVertex=0.
|
|
- The fall-through trace `[iter16] FALL-THROUGH to non-indexed CmdDraw` does **not** print, confirming the `return` from the injection block fires correctly.
|
|
|
|
## What's been verified to NOT be the cause
|
|
|
|
- Probe correctness: a parallel sanity probe (`probe_idx.c`) calls `vkCmdBindIndexBuffer + vkCmdDrawIndexed(6 indices, [10..15])` and **correctly captures 10,11,12,13,14,15** via XFB. So:
|
|
- iter13's XFB implementation handles indexed draws perfectly via the public CmdDrawIndexed entry.
|
|
- The patched library doesn't regress indexed XFB.
|
|
- IB-state dirty marking: added `gfx_state_set_dirty(cmdbuf, IB)` after override (matches `CmdBindIndexBuffer2`). No effect.
|
|
- Topology dynamic-state dirty bit: added `BITSET_SET(...dirty, MESA_VK_DYNAMIC_IA_PRIMITIVE_TOPOLOGY)`. No effect.
|
|
|
|
## Hypothesis (untested)
|
|
|
|
The difference between "my injection inside CmdDraw" and "the public CmdDrawIndexed entry" must be in implicit state setup that happens BETWEEN the bind and the draw, but specifically requires the bind to have been a real vkCmd call (not just a direct state mutation). Possibilities:
|
|
|
|
1. **BO tracking**: when `CmdBindIndexBuffer2` registers the VkBuffer with the batch, that may add the underlying BO to the batch's BO-list for kernel mapping. My synthetic IB allocated via `panvk_cmd_alloc_dev_mem` should be tied to the cmdbuf but maybe needs explicit BO-list registration.
|
|
2. **Vertex-job descriptor cached pre-draw**: an earlier point in command recording may have emitted a vertex-job descriptor based on the topology+IB-bound state at that time. My runtime override doesn't trigger a re-emission because the dirty-bit flow doesn't reach the descriptor cache.
|
|
3. **Render-pass-scope state snapshot**: `pBeginRendering` may have captured topology/IB into batch-local copies that my mutation doesn't update.
|
|
|
|
Resolving any of these requires either: deep panvk internals expertise; GPU-side debugging tools (RGP / Mali Graph Profiler); or restructuring the iter16 fix to operate at a different layer (e.g. NIR-pass-level decomposition, or a state-restore pattern that goes through pBindIB).
|
|
|
|
## Consulted Sonnet architect 2026-05-21 — verdict + outcome
|
|
|
|
Architect picked Path B (call `panvk_per_arch(CmdDrawIndexed)` from inside the injection instead of constructing the indir command + calling `panvk_cmd_draw_indirect` manually). Diagnosis: `draw->info.index.size = 0` somewhere; using the public entry should fix it.
|
|
|
|
**Tested. Same failure.** Captured 8 entries `0,1,2,3,4,5,6,7` (non-indexed pattern). The architect's diagnosis didn't apply — my code already sets `.index.size = cmdbuf->state.gfx.ib.index_size = 4`. The bug isn't in that struct field.
|
|
|
|
Additional test: a sanity probe that calls `vkCmdBindIndexBuffer AFTER pBeginRendering, before BindPipeline` works perfectly (captures the bound indices via XFB). So **render-pass scope itself isn't the gap**. The gap is specifically about *state-mutation-from-within-CmdDraw* vs *separate-vkCmdBindIndexBuffer-call-as-its-own-vkCmd*. Possibly:
|
|
- pipeline-bind-time descriptor emission captures IB-bound state at that moment
|
|
- some BO-list registration happens in CmdBindIndexBuffer2 (via VK_FROM_HANDLE(panvk_buffer) path) that direct state writes skip
|
|
- Mali JM-specific dirty-tracking that needs explicit invalidation we're missing
|
|
|
|
Architect's Path C (NIR-pass-level decomposition) is the remaining structural option — 200-400 LoC in `pan_nir_lower_xfb` to emit multiple store_globals per VS invocation. Bypasses dispatch entirely. Multi-day investment in Mesa internals.
|
|
|
|
## Recommended next attempts (in order)
|
|
|
|
1. **Path D — defer iter16** (chosen 2026-05-21): documentary close. Campaign's iter13/iter15 deliverables unchanged. 162 winding fails remain known/categorized.
|
|
2. **Path C — NIR-pass decomposition**: when bandwidth allows. Bypasses the dispatch-level mystery entirely by doing decomposition at shader-compile time. Pure Mesa work; could land upstream alongside iter13's transform_feedback patches.
|
|
3. **Path B — deep debug**: revisit with Mali Graph Profiler / RGP to see what GPU descriptors are actually being committed at dispatch. Likely 1-2 more days of driver-internals work to isolate the BO-or-cache divergence.
|
|
|
|
## Files modified on ohm (for resume)
|
|
|
|
- `src/panfrost/vulkan/panvk_cmd_draw.h` — extended xfb substruct + winding_table struct + per-arch decl
|
|
- `src/panfrost/vulkan/panvk_vX_cmd_draw.c` — vs.num_vertices override + debug fprintf (remove before commit)
|
|
- `src/panfrost/vulkan/jm/panvk_vX_cmd_draw.c` — CmdDraw injection + debug fprintfs (remove before commit)
|
|
- `src/panfrost/vulkan/panvk_vX_winding.c` — NEW
|
|
- `src/panfrost/vulkan/meson.build` — register winding.c
|
|
|
|
## Probe state
|
|
|
|
`/home/mfritsche/src/panvk-bifrost/iter16/probe_winding.c` works as a regression test. Verified to FAIL on iter13 r3 baseline (captures 8 not 18 for triangle_strip). Will PASS when the fix lands. Pre-iter16 baseline + iter16-WIP both fail identically — useful for confirming "did the fix change anything observable yet."
|
|
|
|
— claude-noether, 2026-05-21
|