Files
panvk-bifrost/mesa-panvk-bifrost-video/phase2_design.md
T
marfrit a4e7d8ab90 initial seed: retrofit campaign lineage from local working trees
panvk-bifrost campaigns (r1..r4 Vulkan compositor + r5.video1 Vulkan
video decode) shipped before this repo existed; the deliverable
patches live in marfrit-packages, but the reasoning chain, phase docs,
and source-state evidence lived only in local working trees on the
development host.

This retrofit imports:
- mesa-panvk-bifrost/   — r1..r4 era phase docs (iter1..iter18)
                          (libmali stub blobs at iter18/blob/ excluded
                          — 109MB of RE artifacts replaced with a README
                          pointer)
- mesa-panvk-bifrost-video/ — sibling campaign phase docs + probe
- evidence/             — frozen .tgz source snapshots at each milestone
                          (basis for the 0005 patch diff generation)

Future iterations should branch off here from day one, so each iter is
a commit rather than a snapshot. See [[feedback-session-local-process-pins]]
for the process drift this retrofit closes.

Total: 1.9 MB across 124 files.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 05:25:37 +02:00

223 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 2 — design lock for panvk-bifrost-video
Phase 1 source-map (`phase1_source_map.md`) acquired the architecture. This document locks the implementation-level decisions that bind Phase 4. Where Phase 1 listed options, this picks one.
## Re-anchored constraints (re-verified 2026-05-21)
- ohm reachable, kernel `linux-fresnel-fourier` with `dma_resv` patches
- `/dev/video1` (hantro decoder) + `/dev/media0` (media controller) present
- libva-v4l2-request-fourier installed and exercising the same V4L2 path — proves the protocol works (1.56× / 1.73× realtime). **Coexistence policy: env-mutex (Phase 0 Q1 lock A).** Only one client holds `/dev/video1` at a time; user picks via `LIBVA_DRIVER_NAME=null` or service-level coordination.
- `mesa-panvk-bifrost` r4 source on ohm at `/home/mfritsche/mesa-build/mesa-26.0.6/`. Reuses the same r1r4 patch lineage in PKGBUILD; new package `mesa-panvk-bifrost-video` is a sibling — see Phase 0 [[campaign-close-via-pkgbuild]].
- Vulkan headers: 26.0.6's bundled `vk.xml` has H.264 decode v9 stable. No `--beta` flag needed.
- Test bitstream: `/home/mfritsche/fourier-test/bbb_1080p30_h264.mp4` (725 MB H.264 Main, 1080p30) — proven decoding via libva path 2026-05-21.
- vk-video-samples builds on aarch64 (Phase 0). simple-test binary at `~/panvk-bifrost-video-evidence/Vulkan-Video-Samples/build/vk_video_decoder/test/vulkan-video-dec-simple-test`.
## Locked decisions
### D1 — V4L2 device ownership: per-`VkVideoSessionKHR`, not per-`VkDevice`
Each call to `vkCreateVideoSessionKHR` opens its own `video_fd` to `/dev/video1` and `media_fd` to `/dev/media0`. The PhysicalDevice only holds discovery state (paths + caps flags). Per Phase 1 §J reasoning: kernel V4L2 state is per-fd, multiple sessions need separate fds anyway, idle-when-no-session is good citizenship.
Trade-off rejected: per-device shared fd. Would force a session-arbitration daemon inside panvk. Not worth it for Phase 1; not needed for the simple-test workload (single session).
### D2 — File layout (committed)
New files in `src/panfrost/vulkan/`:
| File | Purpose | Est. LoC |
|---|---|---|
| `panvk_video_decode.c` | VkVideoSession* + VkCmd*VideoCoding entrypoints; record video_decode_ops dynarray | ~400 |
| `panvk_video_decode.h` | structs: `panvk_video_session`, `panvk_video_decode_op`, `panvk_video_decode_queue` | ~80 |
| `panvk_v4l2.c` | V4L2 probe + per-session init + Std*→v4l2_ctrl_h264_* bridge + submit_op() | ~500 |
| `panvk_vX_video_decode_queue.c` | per-arch queue create/destroy/submit (walks ops, calls panvk_v4l2_submit_op) | ~150 |
Modified files (locations from Phase 1 §I.1):
- `panvk_vX_physical_device.c` (extension list + capability/format entrypoints)
- `panvk_physical_device.c` (queue family list + video properties pNext walk)
- `panvk_device.h` (queue family enum)
- `panvk_vX_device.c` (queue create/destroy/submit dispatch — 4 cases)
- `meson.build` (register new sources)
### D3 — Per-session state struct (locked layout)
```c
struct panvk_video_session {
struct vk_video_session vk; /* spec-mandated fields */
/* V4L2 fds — opened in CreateVideoSession, closed in Destroy */
int video_fd; /* /dev/video1 */
int media_fd; /* /dev/media0 */
/* Negotiated formats per OUTPUT / CAPTURE queue */
struct v4l2_format fmt_output;
struct v4l2_format fmt_capture;
/* Request fd pool. Max-in-flight = max_dpb_slots + 2 */
int *request_fds;
unsigned num_request_fds;
uint32_t request_fd_next; /* round-robin index */
/* DPB slotIndex → V4L2 reference_ts mapping */
struct {
bool valid;
uint64_t reference_ts; /* V4L2 timestamp at QBUF time */
/* No image-view pointer here — image references via slotIndex
* only; resolution at record time via vk.params lookup. */
} dpb[16];
/* DECODE_PARAMS/SLICE_PARAMS submit mode (locked FRAME_BASED for Phase 1) */
bool slice_based; /* Phase 1: false */
};
```
DPB mirroring is identical to `libva-v4l2-request-fourier/src/h264.c:140-218` `dpb_insert` / `dpb_update`. Reuse the algorithm; don't link the lib — copy ~80 LoC verbatim into `panvk_v4l2.c`.
### D4 — Per-cmdbuf decode-op entry (locked layout)
```c
struct panvk_video_decode_op {
/* Captured at vkCmdDecodeVideoKHR record time */
uint32_t dst_dpb_slot; /* output slot */
struct panvk_image_view *dst_iv; /* output VkImageView */
uint32_t num_ref_slots;
struct {
uint32_t slot_index;
struct panvk_image_view *iv; /* reference VkImageView */
} ref_slots[16];
/* Bitstream buffer */
struct panvk_buffer *src_buffer;
uint64_t src_offset;
uint64_t src_size;
/* Cached params at record time (so submit can run after Parameters object updates) */
const StdVideoH264SequenceParameterSet *sps; /* from vk.params */
const StdVideoH264PictureParameterSet *pps;
VkVideoDecodeH264PictureInfoKHR pic_info; /* the per-frame info */
/* Filled at submit time */
int request_fd; /* allocated from session pool */
uint64_t qbuf_ts; /* timestamp used for dpb tracking */
};
```
Recorded as a `util_dynarray` on the command buffer. `vkResetCommandBuffer` clears it.
### D5 — Bitstream input: VkBuffer dmabuf import (one-shot)
At record time, the `VkBuffer` (with `VIDEO_DECODE_SRC_BIT_KHR` usage) carries a `panvk_priv_bo` with an exportable dmabuf. At submit time, op-submit does:
```
fd = pan_kmod_bo_export_dma_buf(src_buffer->bo)
VIDIOC_QBUF(video_fd, V4L2_BUF_TYPE_VIDEO_OUTPUT,
memory=V4L2_MEMORY_DMABUF, m.fd=fd, bytesused=op->src_size,
request_fd=op->request_fd)
```
Source-side buffers are not pinned to V4L2 OUTPUT slots — each decode gets a fresh QBUF using the dmabuf fd. After DQBUF the slot is implicitly released.
### D6 — Output frames: VkImage permanent CAPTURE slot binding (Strategy B from §G.2)
At `vkBindImageMemory` time, if the VkImage's `usage & VIDEO_DECODE_DST_BIT_KHR`, the image's underlying BO is `EXPBUF`'d and registered as a permanent CAPTURE buffer slot via `VIDIOC_QBUF(memory=DMABUF)` at session init, then the slot index is stashed in:
```c
struct panvk_image {
...
int v4l2_capture_index; /* -1 if not a video output image */
};
```
Rejected alternative: per-decode-call dmabuf import. Higher per-frame ioctl overhead. Strategy B amortizes the registration cost across the session lifetime.
### D7 — Submit-time ioctl dance (the 14 steps, locked)
```
panvk_per_arch(video_decode_queue_submit)(queue, submit):
for each cmdbuf in submit:
for each op in cmdbuf->video_decode_ops:
panvk_v4l2_submit_op(session, op):
1. resolve request_fd: pool[round_robin++ % num] or MEDIA_IOC_REQUEST_ALLOC
2. ioctl(request_fd, MEDIA_REQUEST_IOC_REINIT)
3. fill v4l2_ctrl_h264_sps from op->sps via panvk_v4l2_h264_std_to_ctrl_sps()
4. fill v4l2_ctrl_h264_pps from op->pps via panvk_v4l2_h264_std_to_ctrl_pps()
5. fill v4l2_ctrl_h264_decode_params from op->pic_info + session->dpb[]
6. ext_controls = { SPS, PPS, DECODE_PARAMS, SCALING_MATRIX }
(Phase 1: SLICE_PARAMS optional, FRAME_BASED → omit)
7. VIDIOC_S_EXT_CTRLS(video_fd, which=REQUEST_VAL, request_fd, ext_controls)
8. VIDIOC_QBUF(video_fd, OUTPUT, memory=DMABUF, request_fd, m.fd=src_dmabuf,
bytesused=op->src_size, timestamp=op->qbuf_ts)
9. VIDIOC_QBUF(video_fd, CAPTURE, memory=DMABUF, index=dst_iv->image->v4l2_capture_index)
10. MEDIA_REQUEST_IOC_QUEUE(request_fd)
11. poll(request_fd, POLLPRI, timeout_ms=200)
12. VIDIOC_DQBUF(video_fd, OUTPUT) /* release input slot */
13. VIDIOC_DQBUF(video_fd, CAPTURE) /* output ready */
14. session->dpb[op->dst_dpb_slot] = { valid:true, reference_ts:op->qbuf_ts }
vk_queue_signal_semaphores(submit->signal_semaphores)
```
Per Phase 1 §J. Step 11's 200ms timeout is empirically derived from libva-v4l2-request-fourier behavior (it polls indefinitely; we cap to avoid driver-side hangs surfacing as Vulkan device-lost on bad bitstreams).
### D8 — Synchronization: standard vk_queue infrastructure
`panvk_per_arch(create_video_decode_queue)` initializes a `struct vk_queue` with `driver_submit = panvk_per_arch(video_decode_queue_submit)`. Wait/signal semaphores are handled by the standard `vk_queue_submit` infrastructure. Inside `submit`, the `poll(request_fd)` in step 11 is the synchronous gate — when it returns, the decode is done in V4L2 land, and the signal semaphores are signaled before returning.
For Phase 1, **all video decodes are synchronous to submit**. Async / pipelined decode is Phase >>1.
### D9 — Hantro probe: by DT compatible name + topology
`panvk_v4l2_probe_hantro()` enumerates `/dev/video*` via `udev`, queries each with `VIDIOC_QUERYCAP`, accepts cards whose `card` field starts with `"hantro-vpu"` OR matches the RK3568/RK3566/RK3588 hantro DT compatibles. Falls back to a hard-coded `/dev/video1` if udev unavailable. Mirrors `libva-v4l2-request-fourier/src/request.c:143-308` `find_decoder_video_node_via_topology`.
Negative probe outcome (no hantro device) → physical_device's video extension advertisement returns false, queue family entry is suppressed, vkEnumerateDeviceExtensionProperties does not list the three KHR_video_*. Driver gracefully degrades to graphics-only.
### D10 — Errors: broad first, refine Phase 6
- V4L2 EINVAL / EAGAIN / EBUSY at submit → `VK_ERROR_DEVICE_LOST` (broad)
- Probe failure during CreateVideoSession → `VK_ERROR_INITIALIZATION_FAILED`
- DPB slot conflict → `VK_ERROR_OUT_OF_DEVICE_MEMORY` (closest spec match)
- Refine per-error-class mapping in Phase 6 (conformance hardening).
## Out of scope for this iteration (explicit non-goals)
1. **H.265 / HEVC**: Phase 0 lock — H.264 only.
2. **Encode**: out of scope, ever (until a separate campaign).
3. **Async decode** / pipelined submit: synchronous-to-submit only in Phase 1.
4. **Multi-session concurrent decode**: single session only in Phase 1 (per Phase 0 Q5).
5. **`VkVideoMaintenance1`** (inline parameters, inline queries): not in the simple-test requirements.
6. **Multiplane 444 formats** (`VK_EXT_ycbcr_2plane_444_formats`): optional, not in Phase 1.
7. **`VK_EXT_descriptor_buffer`** integration: optional, not in Phase 1.
8. **Decode correctness verification** (frame-PSNR vs reference): Phase 7 territory.
9. **Brave consumer**: structurally unfixable, see brave-vaapi-fourier close + DokuWiki.
## Failure modes to watch for during Phase 4 (instrumentation plan)
| Failure | Detection |
|---|---|
| hantro device not present on a build target | `panvk_v4l2_probe_hantro` returns false → extension list silently shrinks. Test: `vulkaninfo \| grep VK_KHR_video` empty on a non-hantro box |
| `/dev/video1` held by libva → CreateVideoSession EBUSY | `mesa_loge()` at probe + return VK_ERROR_INITIALIZATION_FAILED. Test: run mpv-fourier in parallel, verify clean error message |
| S_EXT_CTRLS EINVAL on a per-control basis | per-control `failing_ctrl_id` is in libva-v4l2-request-fourier `src/v4l2.c:497-502` (the format we don't have on the iter14 path). Reproduce that diagnostic in our `panvk_v4l2_submit_op` |
| H.264 spec field mismatch between Std* and v4l2_ctrl_* | Add a per-field assertion in the std→v4l2 bridge for the fields where the bitwidth differs (e.g., `bit_depth_luma_minus8` is u8 in std, u8 in v4l2 — but some flags pack differently). Test: assert at translation time |
| DPB slot reuse with stale reference_ts | `session->dpb[].valid` cleared at DestroyVideoSession + at ResetVideoCodingControl. Test: send a `RESET` flag mid-stream and check dpb[] is cleared |
| Driver-side decode hang (bad bitstream) | poll(timeout=200ms) is the gate. Test: feed a truncated bitstream, verify clean VK_ERROR_DEVICE_LOST rather than session hang |
## Phase 4 implementation slice — first three commits
Bite-sized, validated incrementally:
1. **Commit 1** — extension advertisement + queue family registration (no functionality, just enumeration). Validation: `vulkan-video-dec-simple-test` gets past `HasAllDeviceExtensions` check and into device creation. Failure mode: extension list still missing.
2. **Commit 2**`CreateVideoSessionKHR` + `DestroyVideoSessionKHR` + capability/format entrypoints (returns sane caps, no V4L2 yet — fds opened as `/dev/null` placeholders if necessary). Validation: simple-test creates a session, gets memory requirements (0 entries), destroys it cleanly. Failure mode: session create returns ERROR.
3. **Commit 3**`panvk_v4l2_probe_hantro` + real video_fd open + per-session V4L2 init (S_FMT, REQBUFS, request fd pool). Validation: simple-test creates a session against real `/dev/video1`. Failure mode: probe fails or EBUSY.
After commit 3, all the plumbing is wired. Commits 4-6 add the per-frame decode plumbing (vkCmdDecodeVideoKHR record + submit dispatch + the ioctl dance). Commit 7 is the Std→v4l2 control bridge.
## Phase 2 close criteria
- [x] All D1D10 decisions locked
- [x] Non-goals explicit
- [x] Failure-modes table with detection methods
- [x] Phase 4 first-three-commits slice defined
- [x] Constraints re-verified on ohm (substrate side)
Phase 3 next: build a probe test client (smaller than vk-video-samples) that exercises just the extension-advertisement + queue-family-enumeration path. This is the regression test Phase 4 commits 1-2 are validated against, before bringing in the heavier vk-video-samples machinery.
— claude-noether, 2026-05-21