initial seed: retrofit campaign lineage from local working trees

panvk-bifrost campaigns (r1..r4 Vulkan compositor + r5.video1 Vulkan
video decode) shipped before this repo existed; the deliverable
patches live in marfrit-packages, but the reasoning chain, phase docs,
and source-state evidence lived only in local working trees on the
development host.

This retrofit imports:
- mesa-panvk-bifrost/   — r1..r4 era phase docs (iter1..iter18)
                          (libmali stub blobs at iter18/blob/ excluded
                          — 109MB of RE artifacts replaced with a README
                          pointer)
- mesa-panvk-bifrost-video/ — sibling campaign phase docs + probe
- evidence/             — frozen .tgz source snapshots at each milestone
                          (basis for the 0005 patch diff generation)

Future iterations should branch off here from day one, so each iter is
a commit rather than a snapshot. See [[feedback-session-local-process-pins]]
for the process drift this retrofit closes.

Total: 1.9 MB across 124 files.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-23 05:25:37 +02:00
parent 430d0da278
commit a4e7d8ab90
124 changed files with 22551 additions and 1 deletions
+222
View File
@@ -0,0 +1,222 @@
# Phase 2 — design lock for panvk-bifrost-video
Phase 1 source-map (`phase1_source_map.md`) acquired the architecture. This document locks the implementation-level decisions that bind Phase 4. Where Phase 1 listed options, this picks one.
## Re-anchored constraints (re-verified 2026-05-21)
- ohm reachable, kernel `linux-fresnel-fourier` with `dma_resv` patches
- `/dev/video1` (hantro decoder) + `/dev/media0` (media controller) present
- libva-v4l2-request-fourier installed and exercising the same V4L2 path — proves the protocol works (1.56× / 1.73× realtime). **Coexistence policy: env-mutex (Phase 0 Q1 lock A).** Only one client holds `/dev/video1` at a time; user picks via `LIBVA_DRIVER_NAME=null` or service-level coordination.
- `mesa-panvk-bifrost` r4 source on ohm at `/home/mfritsche/mesa-build/mesa-26.0.6/`. Reuses the same r1r4 patch lineage in PKGBUILD; new package `mesa-panvk-bifrost-video` is a sibling — see Phase 0 [[campaign-close-via-pkgbuild]].
- Vulkan headers: 26.0.6's bundled `vk.xml` has H.264 decode v9 stable. No `--beta` flag needed.
- Test bitstream: `/home/mfritsche/fourier-test/bbb_1080p30_h264.mp4` (725 MB H.264 Main, 1080p30) — proven decoding via libva path 2026-05-21.
- vk-video-samples builds on aarch64 (Phase 0). simple-test binary at `~/panvk-bifrost-video-evidence/Vulkan-Video-Samples/build/vk_video_decoder/test/vulkan-video-dec-simple-test`.
## Locked decisions
### D1 — V4L2 device ownership: per-`VkVideoSessionKHR`, not per-`VkDevice`
Each call to `vkCreateVideoSessionKHR` opens its own `video_fd` to `/dev/video1` and `media_fd` to `/dev/media0`. The PhysicalDevice only holds discovery state (paths + caps flags). Per Phase 1 §J reasoning: kernel V4L2 state is per-fd, multiple sessions need separate fds anyway, idle-when-no-session is good citizenship.
Trade-off rejected: per-device shared fd. Would force a session-arbitration daemon inside panvk. Not worth it for Phase 1; not needed for the simple-test workload (single session).
### D2 — File layout (committed)
New files in `src/panfrost/vulkan/`:
| File | Purpose | Est. LoC |
|---|---|---|
| `panvk_video_decode.c` | VkVideoSession* + VkCmd*VideoCoding entrypoints; record video_decode_ops dynarray | ~400 |
| `panvk_video_decode.h` | structs: `panvk_video_session`, `panvk_video_decode_op`, `panvk_video_decode_queue` | ~80 |
| `panvk_v4l2.c` | V4L2 probe + per-session init + Std*→v4l2_ctrl_h264_* bridge + submit_op() | ~500 |
| `panvk_vX_video_decode_queue.c` | per-arch queue create/destroy/submit (walks ops, calls panvk_v4l2_submit_op) | ~150 |
Modified files (locations from Phase 1 §I.1):
- `panvk_vX_physical_device.c` (extension list + capability/format entrypoints)
- `panvk_physical_device.c` (queue family list + video properties pNext walk)
- `panvk_device.h` (queue family enum)
- `panvk_vX_device.c` (queue create/destroy/submit dispatch — 4 cases)
- `meson.build` (register new sources)
### D3 — Per-session state struct (locked layout)
```c
struct panvk_video_session {
struct vk_video_session vk; /* spec-mandated fields */
/* V4L2 fds — opened in CreateVideoSession, closed in Destroy */
int video_fd; /* /dev/video1 */
int media_fd; /* /dev/media0 */
/* Negotiated formats per OUTPUT / CAPTURE queue */
struct v4l2_format fmt_output;
struct v4l2_format fmt_capture;
/* Request fd pool. Max-in-flight = max_dpb_slots + 2 */
int *request_fds;
unsigned num_request_fds;
uint32_t request_fd_next; /* round-robin index */
/* DPB slotIndex → V4L2 reference_ts mapping */
struct {
bool valid;
uint64_t reference_ts; /* V4L2 timestamp at QBUF time */
/* No image-view pointer here — image references via slotIndex
* only; resolution at record time via vk.params lookup. */
} dpb[16];
/* DECODE_PARAMS/SLICE_PARAMS submit mode (locked FRAME_BASED for Phase 1) */
bool slice_based; /* Phase 1: false */
};
```
DPB mirroring is identical to `libva-v4l2-request-fourier/src/h264.c:140-218` `dpb_insert` / `dpb_update`. Reuse the algorithm; don't link the lib — copy ~80 LoC verbatim into `panvk_v4l2.c`.
### D4 — Per-cmdbuf decode-op entry (locked layout)
```c
struct panvk_video_decode_op {
/* Captured at vkCmdDecodeVideoKHR record time */
uint32_t dst_dpb_slot; /* output slot */
struct panvk_image_view *dst_iv; /* output VkImageView */
uint32_t num_ref_slots;
struct {
uint32_t slot_index;
struct panvk_image_view *iv; /* reference VkImageView */
} ref_slots[16];
/* Bitstream buffer */
struct panvk_buffer *src_buffer;
uint64_t src_offset;
uint64_t src_size;
/* Cached params at record time (so submit can run after Parameters object updates) */
const StdVideoH264SequenceParameterSet *sps; /* from vk.params */
const StdVideoH264PictureParameterSet *pps;
VkVideoDecodeH264PictureInfoKHR pic_info; /* the per-frame info */
/* Filled at submit time */
int request_fd; /* allocated from session pool */
uint64_t qbuf_ts; /* timestamp used for dpb tracking */
};
```
Recorded as a `util_dynarray` on the command buffer. `vkResetCommandBuffer` clears it.
### D5 — Bitstream input: VkBuffer dmabuf import (one-shot)
At record time, the `VkBuffer` (with `VIDEO_DECODE_SRC_BIT_KHR` usage) carries a `panvk_priv_bo` with an exportable dmabuf. At submit time, op-submit does:
```
fd = pan_kmod_bo_export_dma_buf(src_buffer->bo)
VIDIOC_QBUF(video_fd, V4L2_BUF_TYPE_VIDEO_OUTPUT,
memory=V4L2_MEMORY_DMABUF, m.fd=fd, bytesused=op->src_size,
request_fd=op->request_fd)
```
Source-side buffers are not pinned to V4L2 OUTPUT slots — each decode gets a fresh QBUF using the dmabuf fd. After DQBUF the slot is implicitly released.
### D6 — Output frames: VkImage permanent CAPTURE slot binding (Strategy B from §G.2)
At `vkBindImageMemory` time, if the VkImage's `usage & VIDEO_DECODE_DST_BIT_KHR`, the image's underlying BO is `EXPBUF`'d and registered as a permanent CAPTURE buffer slot via `VIDIOC_QBUF(memory=DMABUF)` at session init, then the slot index is stashed in:
```c
struct panvk_image {
...
int v4l2_capture_index; /* -1 if not a video output image */
};
```
Rejected alternative: per-decode-call dmabuf import. Higher per-frame ioctl overhead. Strategy B amortizes the registration cost across the session lifetime.
### D7 — Submit-time ioctl dance (the 14 steps, locked)
```
panvk_per_arch(video_decode_queue_submit)(queue, submit):
for each cmdbuf in submit:
for each op in cmdbuf->video_decode_ops:
panvk_v4l2_submit_op(session, op):
1. resolve request_fd: pool[round_robin++ % num] or MEDIA_IOC_REQUEST_ALLOC
2. ioctl(request_fd, MEDIA_REQUEST_IOC_REINIT)
3. fill v4l2_ctrl_h264_sps from op->sps via panvk_v4l2_h264_std_to_ctrl_sps()
4. fill v4l2_ctrl_h264_pps from op->pps via panvk_v4l2_h264_std_to_ctrl_pps()
5. fill v4l2_ctrl_h264_decode_params from op->pic_info + session->dpb[]
6. ext_controls = { SPS, PPS, DECODE_PARAMS, SCALING_MATRIX }
(Phase 1: SLICE_PARAMS optional, FRAME_BASED → omit)
7. VIDIOC_S_EXT_CTRLS(video_fd, which=REQUEST_VAL, request_fd, ext_controls)
8. VIDIOC_QBUF(video_fd, OUTPUT, memory=DMABUF, request_fd, m.fd=src_dmabuf,
bytesused=op->src_size, timestamp=op->qbuf_ts)
9. VIDIOC_QBUF(video_fd, CAPTURE, memory=DMABUF, index=dst_iv->image->v4l2_capture_index)
10. MEDIA_REQUEST_IOC_QUEUE(request_fd)
11. poll(request_fd, POLLPRI, timeout_ms=200)
12. VIDIOC_DQBUF(video_fd, OUTPUT) /* release input slot */
13. VIDIOC_DQBUF(video_fd, CAPTURE) /* output ready */
14. session->dpb[op->dst_dpb_slot] = { valid:true, reference_ts:op->qbuf_ts }
vk_queue_signal_semaphores(submit->signal_semaphores)
```
Per Phase 1 §J. Step 11's 200ms timeout is empirically derived from libva-v4l2-request-fourier behavior (it polls indefinitely; we cap to avoid driver-side hangs surfacing as Vulkan device-lost on bad bitstreams).
### D8 — Synchronization: standard vk_queue infrastructure
`panvk_per_arch(create_video_decode_queue)` initializes a `struct vk_queue` with `driver_submit = panvk_per_arch(video_decode_queue_submit)`. Wait/signal semaphores are handled by the standard `vk_queue_submit` infrastructure. Inside `submit`, the `poll(request_fd)` in step 11 is the synchronous gate — when it returns, the decode is done in V4L2 land, and the signal semaphores are signaled before returning.
For Phase 1, **all video decodes are synchronous to submit**. Async / pipelined decode is Phase >>1.
### D9 — Hantro probe: by DT compatible name + topology
`panvk_v4l2_probe_hantro()` enumerates `/dev/video*` via `udev`, queries each with `VIDIOC_QUERYCAP`, accepts cards whose `card` field starts with `"hantro-vpu"` OR matches the RK3568/RK3566/RK3588 hantro DT compatibles. Falls back to a hard-coded `/dev/video1` if udev unavailable. Mirrors `libva-v4l2-request-fourier/src/request.c:143-308` `find_decoder_video_node_via_topology`.
Negative probe outcome (no hantro device) → physical_device's video extension advertisement returns false, queue family entry is suppressed, vkEnumerateDeviceExtensionProperties does not list the three KHR_video_*. Driver gracefully degrades to graphics-only.
### D10 — Errors: broad first, refine Phase 6
- V4L2 EINVAL / EAGAIN / EBUSY at submit → `VK_ERROR_DEVICE_LOST` (broad)
- Probe failure during CreateVideoSession → `VK_ERROR_INITIALIZATION_FAILED`
- DPB slot conflict → `VK_ERROR_OUT_OF_DEVICE_MEMORY` (closest spec match)
- Refine per-error-class mapping in Phase 6 (conformance hardening).
## Out of scope for this iteration (explicit non-goals)
1. **H.265 / HEVC**: Phase 0 lock — H.264 only.
2. **Encode**: out of scope, ever (until a separate campaign).
3. **Async decode** / pipelined submit: synchronous-to-submit only in Phase 1.
4. **Multi-session concurrent decode**: single session only in Phase 1 (per Phase 0 Q5).
5. **`VkVideoMaintenance1`** (inline parameters, inline queries): not in the simple-test requirements.
6. **Multiplane 444 formats** (`VK_EXT_ycbcr_2plane_444_formats`): optional, not in Phase 1.
7. **`VK_EXT_descriptor_buffer`** integration: optional, not in Phase 1.
8. **Decode correctness verification** (frame-PSNR vs reference): Phase 7 territory.
9. **Brave consumer**: structurally unfixable, see brave-vaapi-fourier close + DokuWiki.
## Failure modes to watch for during Phase 4 (instrumentation plan)
| Failure | Detection |
|---|---|
| hantro device not present on a build target | `panvk_v4l2_probe_hantro` returns false → extension list silently shrinks. Test: `vulkaninfo \| grep VK_KHR_video` empty on a non-hantro box |
| `/dev/video1` held by libva → CreateVideoSession EBUSY | `mesa_loge()` at probe + return VK_ERROR_INITIALIZATION_FAILED. Test: run mpv-fourier in parallel, verify clean error message |
| S_EXT_CTRLS EINVAL on a per-control basis | per-control `failing_ctrl_id` is in libva-v4l2-request-fourier `src/v4l2.c:497-502` (the format we don't have on the iter14 path). Reproduce that diagnostic in our `panvk_v4l2_submit_op` |
| H.264 spec field mismatch between Std* and v4l2_ctrl_* | Add a per-field assertion in the std→v4l2 bridge for the fields where the bitwidth differs (e.g., `bit_depth_luma_minus8` is u8 in std, u8 in v4l2 — but some flags pack differently). Test: assert at translation time |
| DPB slot reuse with stale reference_ts | `session->dpb[].valid` cleared at DestroyVideoSession + at ResetVideoCodingControl. Test: send a `RESET` flag mid-stream and check dpb[] is cleared |
| Driver-side decode hang (bad bitstream) | poll(timeout=200ms) is the gate. Test: feed a truncated bitstream, verify clean VK_ERROR_DEVICE_LOST rather than session hang |
## Phase 4 implementation slice — first three commits
Bite-sized, validated incrementally:
1. **Commit 1** — extension advertisement + queue family registration (no functionality, just enumeration). Validation: `vulkan-video-dec-simple-test` gets past `HasAllDeviceExtensions` check and into device creation. Failure mode: extension list still missing.
2. **Commit 2**`CreateVideoSessionKHR` + `DestroyVideoSessionKHR` + capability/format entrypoints (returns sane caps, no V4L2 yet — fds opened as `/dev/null` placeholders if necessary). Validation: simple-test creates a session, gets memory requirements (0 entries), destroys it cleanly. Failure mode: session create returns ERROR.
3. **Commit 3**`panvk_v4l2_probe_hantro` + real video_fd open + per-session V4L2 init (S_FMT, REQBUFS, request fd pool). Validation: simple-test creates a session against real `/dev/video1`. Failure mode: probe fails or EBUSY.
After commit 3, all the plumbing is wired. Commits 4-6 add the per-frame decode plumbing (vkCmdDecodeVideoKHR record + submit dispatch + the ioctl dance). Commit 7 is the Std→v4l2 control bridge.
## Phase 2 close criteria
- [x] All D1D10 decisions locked
- [x] Non-goals explicit
- [x] Failure-modes table with detection methods
- [x] Phase 4 first-three-commits slice defined
- [x] Constraints re-verified on ohm (substrate side)
Phase 3 next: build a probe test client (smaller than vk-video-samples) that exercises just the extension-advertisement + queue-family-enumeration path. This is the regression test Phase 4 commits 1-2 are validated against, before bringing in the heavier vk-video-samples machinery.
— claude-noether, 2026-05-21