Files
marfrit a4e7d8ab90 initial seed: retrofit campaign lineage from local working trees
panvk-bifrost campaigns (r1..r4 Vulkan compositor + r5.video1 Vulkan
video decode) shipped before this repo existed; the deliverable
patches live in marfrit-packages, but the reasoning chain, phase docs,
and source-state evidence lived only in local working trees on the
development host.

This retrofit imports:
- mesa-panvk-bifrost/   — r1..r4 era phase docs (iter1..iter18)
                          (libmali stub blobs at iter18/blob/ excluded
                          — 109MB of RE artifacts replaced with a README
                          pointer)
- mesa-panvk-bifrost-video/ — sibling campaign phase docs + probe
- evidence/             — frozen .tgz source snapshots at each milestone
                          (basis for the 0005 patch diff generation)

Future iterations should branch off here from day one, so each iter is
a commit rather than a snapshot. See [[feedback-session-local-process-pins]]
for the process drift this retrofit closes.

Total: 1.9 MB across 124 files.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 05:25:37 +02:00

670 lines
46 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 1 Source Map — VK_KHR_video_decode_h264 on panvk-bifrost (V4L2/Hantro backend)
**Campaign**: panvk-bifrost-video (successor to panvk-bifrost r4)
**Mesa version**: 26.0.6 (source tree on ohm at `/home/mfritsche/mesa-build/mesa-26.0.6/`)
**Phase 1 goal**: vk-video-samples simple-test passes `HasAllDeviceExtensions`, creates a `VkVideoSessionKHR`, submits one `VkCmdDecodeVideoKHR`. Decode correctness is Phase 7.
**Backend**: V4L2-stateless `hantro` VPU on RK3566/PineTab2 via `/dev/video1` + `/dev/media0`. Mali GPU is not the decode engine.
> Convention used throughout: every file path is **on ohm** unless otherwise stated. Cite as `FILE:LINE`. When citing libva-v4l2-request-fourier (the reference for V4L2-side bridging), the path is on the **workstation** at `/home/mfritsche/src/libva-v4l2-request-fourier/`.
---
## Executive summary
The Mesa 26.0.6 video stack is structured in three layers:
1. **Shared runtime helpers**`src/vulkan/runtime/vk_video.{c,h}` (3413 + 436 lines). Owns: `vk_video_session_init`/`finish`, `vk_video_session_parameters_{create,update,destroy}`, H.264 SPS/PPS storage as `struct vk_video_h264_{sps,pps}`, and the `vk_common_{Create,Update,Destroy}VideoSessionParametersKHR` entrypoints (full dispatch coverage of the parameters object). Codec parameter parsing helpers (`vk_video_get_h264_parameters`, `vk_video_find_h264_dec_std_{sps,pps}`).
2. **Driver-side video** — anv (`src/intel/vulkan/anv_video.c` + `genX_cmd_video.c`) and radv (`src/amd/vulkan/radv_video.c`). Each driver owns: extension advertisement, queue-family advertisement, `GetPhysicalDeviceVideoCapabilitiesKHR`, `GetPhysicalDeviceVideoFormatPropertiesKHR`, `Create/DestroyVideoSessionKHR`, `GetVideoSessionMemoryRequirementsKHR`, `BindVideoSessionMemoryKHR`, and the per-frame `CmdBeginVideoCodingKHR`/`CmdControlVideoCodingKHR`/`CmdDecodeVideoKHR`/`CmdEndVideoCodingKHR` recording.
3. **HW codegen** — driver emits register packets into a command stream during the `CmdDecodeVideoKHR` record; the existing GPU queue submit path then ships that stream to the video engine.
**Critical mismatch for our backend**: layer 3 does not exist for us. The Hantro VPU has no Mali-side command stream. It has its own kernel device node (`/dev/video1` + `/dev/media0`) with a request-API ioctl interface. So we keep layer 1 verbatim (huge win — all H.264 SPS/PPS parsing comes free), reuse layer 2's *interface contracts*, and replace layer 2's command-stream codegen with deferred V4L2 control marshalling + submit-time `VIDIOC_QBUF`/`POLL`/`VIDIOC_DQBUF`.
**vk-video-samples simple-test trinity** of required extensions:
- `VK_KHR_video_queue` (spec v8) — shared base
- `VK_KHR_video_decode_queue` (spec v8) — decode-specific commands
- `VK_KHR_video_decode_h264` (spec v9) — H.264 profile
None are advertised in panvk-bifrost r4 today (Mesa 26.0.6 `src/panfrost/vulkan/panvk_vX_physical_device.c:539-540` explicitly sets `unifiedImageLayoutsVideo = false` and leaves all `KHR_video_*` extension flags unset / default-false).
---
## A. Extension surface
### A.1 Where extensions are advertised
panvk extension table is built by `panvk_per_arch(get_physical_device_extensions)` in `src/panfrost/vulkan/panvk_vX_physical_device.c:35-160`. This is a single struct-literal that fills a `struct vk_device_extension_table` field-by-field. To add the three required extensions we extend the literal between (alphabetical sort by KHR_):
```
.KHR_video_decode_h264 = true, /* gated on hantro probe success */
.KHR_video_decode_queue = true,
.KHR_video_queue = true,
```
The natural insertion point is between `.KHR_vertex_attribute_divisor = true,` (line ~123) and `.KHR_vulkan_memory_model = true,` (line ~124).
Anv reference for comparison: `src/intel/vulkan/anv_physical_device.c:262-274`:
```c
.KHR_video_queue = video_decode_enabled || video_encode_enabled,
.KHR_video_decode_queue = video_decode_enabled,
.KHR_video_decode_h264 = VIDEO_CODEC_H264DEC && video_decode_enabled,
```
where `video_decode_enabled` is `device->instance->debug & ANV_DEBUG_VIDEO_DECODE` (`anv_physical_device.c:153`). Anv gates this behind a debug flag because anv-side decode is still considered experimental. We probably want the same gating pattern, except keyed on hantro probe success rather than a debug flag — so the extension is advertised only if `/dev/video1` opens and reports H.264 OUTPUT format support.
### A.2 Feature struct fields
vk-video-samples simple-test requires `VK_KHR_video_queue` and friends advertised. The strictly-required feature struct fields are:
- `VkPhysicalDeviceVideoMaintenance1FeaturesKHR::videoMaintenance1`**only if** we advertise `KHR_video_maintenance1`. For Phase 1, the simple-test does NOT require maintenance1 — confirmed by reading test harness expectations. Skip in Phase 1.
- `VkPhysicalDeviceUnifiedImageLayoutsFeaturesKHR::unifiedImageLayoutsVideo` — currently `false` at `panvk_vX_physical_device.c:540`. Stays `false` for Phase 1 (transition rules still apply).
The shared `vk_video_session` struct (`vk_video.h:80-115`) carries the per-session profile bookkeeping that gets driven by the codec ops `pNext`. No driver-side feature toggles needed beyond the three extension booleans for Phase 1.
### A.3 vkGetPhysicalDeviceVideoCapabilitiesKHR routing
This is a **direct driver entrypoint** — there is no `vk_common_GetPhysicalDeviceVideoCapabilitiesKHR` in `src/vulkan/runtime/`. Verified: `grep -rn "vk_common_GetPhysicalDeviceVideo" /home/mfritsche/mesa-build/mesa-26.0.6/src/` returns no hits.
Driver-side, the entrypoint is generated via `vk_entrypoints_gen` from `vk_api.xml` (per `panvk/vulkan/meson.build:7-19`). The panvk symbol resolution uses the `panvk` prefix and per-arch shims `panvk_v6` / `panvk_v7` / `panvk_v9` / `panvk_v10` / `panvk_v12` / `panvk_v13`. So the symbol we need to provide is one of:
- `panvk_GetPhysicalDeviceVideoCapabilitiesKHR` (in `panvk_physical_device.c`) — common (arch-agnostic), since physical-device caps don't vary across Mali archs for V4L2-side decode (the VPU is on a separate engine entirely). **Recommended.**
- `panvk_per_arch(GetPhysicalDeviceVideoCapabilitiesKHR)` in a new `panvk_vX_video_decode.c` — only needed if the answer varies per arch, which it doesn't here.
Reference shape from anv (`anv_video.c:183-291`): the function takes `pVideoProfile` and fills `pCapabilities` (`maxCodedExtent`, `maxDpbSlots`, `maxActiveReferencePictures`, `minBitstreamBufferOffsetAlignment`, `stdHeaderVersion`), then walks the codec-specific `pNext` chain. For H.264-decode, that means `VkVideoDecodeH264CapabilitiesKHR` (anv lines 213-225) with `maxLevelIdc` and `fieldOffsetGranularity`. Also fills `VkVideoDecodeCapabilitiesKHR::flags = VK_VIDEO_DECODE_CAPABILITY_DPB_AND_OUTPUT_COINCIDE_BIT_KHR` (anv line 205) — which is what we'll want too, because the Hantro CAPTURE buffers ARE the DPB (no separate scratch).
The hantro driver's real limits (4K H.264 decode confirmed on RK3566) drive these values; we want to be conservative for Phase 1 and use `maxCodedExtent = 1920x1088`, `maxDpbSlots = 17` (one more than `STD_VIDEO_H264_MAX_NUM_LIST_REF=16`, matches `ANV_VIDEO_H264_MAX_DPB_SLOTS` at `anv_private.h:6581`), `maxActiveReferencePictures = 16`.
### A.4 vkGetPhysicalDeviceVideoFormatPropertiesKHR routing
Same routing pattern as A.3 — direct driver entrypoint, no shared common path. Implement as `panvk_GetPhysicalDeviceVideoFormatPropertiesKHR` in `panvk_physical_device.c`.
Reference shape from anv (`anv_video.c:393-481`): walks `VkVideoProfileListInfoKHR` from `pVideoFormatInfo->pNext`, validates each profile, then outputs format entries. For H.264 8-bit, anv reports `VK_FORMAT_G8_B8R8_2PLANE_420_UNORM` (NV12-equivalent, anv:460).
This is exactly what we need. The hantro driver returns NV12 as `V4L2_PIX_FMT_NV12` on the CAPTURE queue (confirmed in libva-v4l2-request-fourier `src/h264.c` and via `v4l2_find_format` calls in `src/request.c:864-865` showing format-probe pattern). The dst usage flag merge in anv at lines 410-419 (where `VIDEO_DECODE_DST` triggers added flags including `SAMPLED_BIT | TRANSFER_DST_BIT`) is universal vulkan-video pattern and applies verbatim. Set:
- `format = VK_FORMAT_G8_B8R8_2PLANE_420_UNORM` (NV12)
- `imageType = VK_IMAGE_TYPE_2D`
- `imageTiling = VK_IMAGE_TILING_OPTIMAL` — but see G.2 below about how the underlying memory comes from V4L2, so this is a "logical" tiling decision
- `imageUsageFlags = VK_IMAGE_USAGE_VIDEO_DECODE_DST_BIT_KHR | VK_IMAGE_USAGE_VIDEO_DECODE_DPB_BIT_KHR | VK_IMAGE_USAGE_SAMPLED_BIT | VK_IMAGE_USAGE_TRANSFER_SRC_BIT`
- `imageCreateFlags = VK_IMAGE_CREATE_VIDEO_PROFILE_INDEPENDENT_BIT_KHR | VK_IMAGE_CREATE_MUTABLE_FORMAT_BIT | VK_IMAGE_CREATE_EXTENDED_USAGE_BIT`
---
## B. Queue family registration
### B.1 Current state (r4)
`src/panfrost/vulkan/panvk_device.h:46-48`:
```
enum panvk_queue_family {
PANVK_QUEUE_FAMILY_GPU,
PANVK_QUEUE_FAMILY_BIND,
PANVK_QUEUE_FAMILY_COUNT,
};
```
Queue-family-properties query at `panvk_physical_device.c:557-595`:
```
[PANVK_QUEUE_FAMILY_GPU] = {
.queueFlags = VK_QUEUE_GRAPHICS_BIT | VK_QUEUE_COMPUTE_BIT | VK_QUEUE_TRANSFER_BIT,
...
},
[PANVK_QUEUE_FAMILY_BIND] = {
.queueFlags = VK_QUEUE_SPARSE_BINDING_BIT,
.queueCount = 1,
},
```
Queue dispatch in `panvk_vX_device.c`:
- line 253-258 — `panvk_queue_check_status` switches on `queue->queue_family_index` to call `gpu_queue_check_status` or `bind_queue_check_status`
- line 269 — `panvk_device_check_status` iterates `for (uint32_t qfi = 0; qfi < PANVK_QUEUE_FAMILY_COUNT; qfi++)`
- line 305-313 — `panvk_queue_create` switches on `create_info->queueFamilyIndex` to dispatch to `panvk_per_arch(create_gpu_queue)` or `panvk_per_arch(create_bind_queue)`
- line 320-329 — `panvk_queue_destroy` symmetric
- line 546-561 — `panvk_per_arch(create_device)` iterates `pCreateInfo->queueCreateInfoCount`, calls `panvk_queue_create` for each
### B.2 What to add
Add a third enum value `PANVK_QUEUE_FAMILY_VIDEO_DECODE`. Slot ordering matters: Vulkan apps query queue families by index and the test client *typically* iterates looking for `VK_QUEUE_VIDEO_DECODE_BIT_KHR`. Index value is opaque so adding at end is safe.
```
enum panvk_queue_family {
PANVK_QUEUE_FAMILY_GPU,
PANVK_QUEUE_FAMILY_BIND,
PANVK_QUEUE_FAMILY_VIDEO_DECODE, /* NEW */
PANVK_QUEUE_FAMILY_COUNT,
};
```
Then in `panvk_physical_device.c:557-595` extend the props table:
```
[PANVK_QUEUE_FAMILY_VIDEO_DECODE] = {
.queueFlags = VK_QUEUE_VIDEO_DECODE_BIT_KHR | VK_QUEUE_TRANSFER_BIT,
.queueCount = 1,
.minImageTransferGranularity = {1, 1, 1}, /* match VPU mb alignment if needed */
},
```
Anv reference for this pattern: `src/intel/vulkan/anv_physical_device.c:2556-2576` (queue-family-init writing flags onto `pdevice->queue.families[family_count++]`). Anv also handles the `VkQueueFamilyVideoPropertiesKHR` pNext extension at `anv_physical_device.c:3012-3030`:
```c
case VK_STRUCTURE_TYPE_QUEUE_FAMILY_VIDEO_PROPERTIES_KHR: {
VkQueueFamilyVideoPropertiesKHR *prop = ...;
if (queue_family->queueFlags & VK_QUEUE_VIDEO_DECODE_BIT_KHR) {
prop->videoCodecOperations = VK_VIDEO_CODEC_OPERATION_DECODE_H264_BIT_KHR | ...;
}
}
```
We need to mirror that pattern in `panvk_GetPhysicalDeviceQueueFamilyProperties2`. Right now it only walks `VkQueueFamilyGlobalPriorityPropertiesKHR` (at panvk_physical_device.c:589). Add a pNext walk for `VK_STRUCTURE_TYPE_QUEUE_FAMILY_VIDEO_PROPERTIES_KHR` and fill `videoCodecOperations = VK_VIDEO_CODEC_OPERATION_DECODE_H264_BIT_KHR`. Optional but recommended for Phase 1: also fill `VK_STRUCTURE_TYPE_QUEUE_FAMILY_QUERY_RESULT_STATUS_PROPERTIES_KHR` if test client asks (`anv_physical_device.c:3007-3011`).
### B.3 Queue identification at queue_create time
Driver dispatches at `panvk_vX_device.c:305-313` via `panvk_queue_create`. Extend the switch:
```
case PANVK_QUEUE_FAMILY_VIDEO_DECODE:
return panvk_per_arch(create_video_decode_queue)(
dev, create_info, queue_idx, out_queue);
```
And similarly extend `panvk_queue_destroy` (line 320-329) and `panvk_queue_check_status` (line 253-258).
For check_global_priority at panvk_vX_device.c:218-247 — the video decode family gets a new case that returns `VK_SUCCESS` for any priority (since the V4L2 device doesn't expose priority semantics) or just `VK_QUEUE_GLOBAL_PRIORITY_MEDIUM_KHR` like BIND.
### B.4 V4L2 submit path — clean hook into queue infrastructure
The existing `vk_queue` has a `driver_submit` callback (set in `jm/panvk_vX_gpu_queue.c:359`: `queue->vk.driver_submit = panvk_per_arch(gpu_queue_submit);`). The submit function takes a `struct vk_queue_submit` containing `command_buffers[]`, waits, signals.
For our V4L2 queue, the analog is: `queue->vk.driver_submit = panvk_per_arch(video_decode_queue_submit);` and the implementation does NOT touch Mali — it walks the cmdbuf's recorded V4L2 ops and dispatches each:
```
for each panvk_video_decode_op in cmdbuf->video_decode_ops:
media_request_reinit(op->request_fd) /* libva-v4l2-request-fourier media.c:51 */
VIDIOC_S_EXT_CTRLS(video_fd, request_fd,
{SPS, PPS, DECODE_PARAMS, SLICE_PARAMS, SCALING_MATRIX})
VIDIOC_QBUF(video_fd, OUTPUT, request_fd=op->request_fd) /* bitstream src */
VIDIOC_QBUF(video_fd, CAPTURE, dpb_buffer_index=op->dst_slot)
media_request_queue(op->request_fd) /* media.c:65 */
poll(request_fd, POLLPRI, timeout) /* media.c:79 */
VIDIOC_DQBUF(video_fd, OUTPUT)
VIDIOC_DQBUF(video_fd, CAPTURE)
```
The waits/signals from `vk_queue_submit` need to map to syncobj waits before we VIDIOC_QBUF, and a syncobj signal after the POLL completes. For Phase 1 (a single submit with no other GPU work in the queue), we can ignore semaphores and just use a syncobj that signals on DQBUF completion.
`vk_queue_init` (`panvk_vX_gpu_queue.c:348`) is the entry point; we'd reuse the same pattern for `create_video_decode_queue`. Allocate a `struct panvk_video_decode_queue { struct vk_queue vk; int video_fd; int media_fd; ... }` and stash the fds.
---
## C. Session object lifecycle (`VkVideoSessionKHR`)
### C.1 What CreateVideoSession allocates
Anv reference at `src/intel/vulkan/anv_video.c:31-55`:
```c
struct anv_video_session *vid = vk_alloc2(...);
memset(vid, 0, sizeof(*vid));
VkResult result = vk_video_session_init(&device->vk, &vid->vk, pCreateInfo);
*pVideoSession = anv_video_session_to_handle(vid);
```
That's it. The heavy lifting is in `vk_video_session_init` (`src/vulkan/runtime/vk_video.c:33-128`), which fills:
- `vid->op` (`VK_VIDEO_CODEC_OPERATION_DECODE_H264_BIT_KHR` etc.)
- `vid->max_coded`, `picture_format`, `ref_format`, `max_dpb_slots`, `max_active_ref_pics`
- `vid->h264.profile_idc` from the `VkVideoDecodeH264ProfileInfoKHR` pNext (lines 51-57)
The driver-specific anv_video_session struct (`anv_private.h:6688-6727`) adds backend-specific per-stream state: `cdf_initialized` (for AV1), `vid_mem[ANV_VID_MEM_AV1_MAX]` (private memory bindings for codec scratch).
### C.2 Memory binding via vkBindVideoSessionMemoryKHR
Anv reference at `anv_video.c:914-998` for `GetVideoSessionMemoryRequirements` and `anv_video.c:972-1000` for `BindVideoSessionMemory`. The mem_idx enums for H.264 (`anv_private.h:6588-6593`):
```c
enum anv_vid_mem_h264_types {
ANV_VID_MEM_H264_INTRA_ROW_STORE,
ANV_VID_MEM_H264_DEBLOCK_FILTER_ROW_STORE,
ANV_VID_MEM_H264_BSD_MPC_ROW_SCRATCH,
ANV_VID_MEM_H264_MPR_ROW_SCRATCH,
ANV_VID_MEM_H264_MAX,
};
```
These are scratch buffers the Intel HCP/MFX engines need. The sizes are computed in `get_h264_video_mem_size` (`anv_video.c:483-501`) as multiples of width-in-MBs.
`BindVideoSessionMemory` (anv lines 972-998) is just bookkeeping: it copies each `VkBindVideoSessionMemoryInfoKHR` into `vid->vid_mem[bind_index]` (struct `anv_vid_mem { anv_device_memory *mem; offset; size; }` at `anv_private.h:6572-6576`).
### C.3 For our V4L2 backend
**Massive simplification opportunity**: the Hantro VPU does NOT require driver-allocated scratch buffers — all scratch is internal to the VPU and managed by the kernel driver. So `GetVideoSessionMemoryRequirements` can return **zero entries** (`*pVideoSessionMemoryRequirementsCount = 0`), and `BindVideoSessionMemory` becomes a no-op (just `return VK_SUCCESS;`).
What CreateVideoSession DOES need to allocate, V4L2-side:
1. **Open `/dev/video1` and `/dev/media0`** if not already held by the device (see J.1 for ownership decision).
2. **VIDIOC_S_FMT** on the OUTPUT queue: `V4L2_PIX_FMT_H264_SLICE` (note: hantro is slice-stateless), based on `vid->h264.profile_idc` and `vid->max_coded`. See libva-v4l2-request-fourier `src/h264.c:699-738` for the control-set pattern.
3. **VIDIOC_S_FMT** on the CAPTURE queue: `V4L2_PIX_FMT_NV12`, dimensions from `vid->max_coded`.
4. **Allocate request_fd pool**: pre-allocate N request fds (one per DPB slot + outstanding submits) via `MEDIA_IOC_REQUEST_ALLOC` ioctls (media.c:41).
5. **VIDIOC_REQBUFS** on OUTPUT + CAPTURE queues to set up buffer count.
So `panvk_video_session` struct shape:
```c
struct panvk_video_session {
struct vk_video_session vk; /* shared base */
int video_fd; /* may share with physical_device */
int media_fd; /* may share with physical_device */
/* per-session V4L2 state */
uint32_t bitstream_buffer_count;
uint32_t capture_buffer_count;
struct {
int request_fd;
bool in_use;
uint32_t dpb_slot;
} request_pool[MAX_OUTSTANDING_DECODES];
};
```
### C.4 Anv session creation shape — full reference
```c
VkResult anv_CreateVideoSessionKHR(VkDevice _device,
const VkVideoSessionCreateInfoKHR *pCreateInfo,
const VkAllocationCallbacks *pAllocator,
VkVideoSessionKHR *pVideoSession)
/* anv_video.c:31-55 */
{
ANV_FROM_HANDLE(anv_device, device, _device);
struct anv_video_session *vid = vk_alloc2(..., sizeof(*vid), 8, OBJECT);
if (!vid) return vk_error(device, VK_ERROR_OUT_OF_HOST_MEMORY);
memset(vid, 0, sizeof(*vid));
VkResult result = vk_video_session_init(&device->vk, &vid->vk, pCreateInfo);
if (result != VK_SUCCESS) { vk_free2(..., vid); return result; }
*pVideoSession = anv_video_session_to_handle(vid);
return VK_SUCCESS;
}
```
For us, the body grows by ~15-30 lines for V4L2 setup (open fds, S_FMT, REQBUFS, request_fd pool init) and adds error-rollback paths.
---
## D. Parameters object lifecycle (`VkVideoSessionParametersKHR`)
### D.1 The shared layer does almost everything
`src/vulkan/runtime/vk_video.c:845-885` defines:
- `vk_common_CreateVideoSessionParametersKHR` (line 846-862)
- `vk_common_UpdateVideoSessionParametersKHR` (line 865-872)
- `vk_common_DestroyVideoSessionParametersKHR` (line 875-885)
These delegate to:
- `vk_video_session_parameters_create` (helper at `vk_video.c:480` — alloc + dispatch by codec op)
- `vk_video_session_parameters_update` (line 793-844 — switches on `params->op` and calls `update_h264_dec_session_parameters` at line 692 which does the actual SPS/PPS array merge with seq_parameter_set_id collision detection per the spec)
- `vk_video_session_parameters_destroy`
**Key question**: do panvk-bifrost entrypoints get auto-wired to the `vk_common_*` versions, or does the driver need to opt in?
Mesa's entrypoint generator (`vk_entrypoints_gen.py`) wires shared-helper entrypoints **by default** unless the driver provides a stronger symbol. So if panvk does NOT define `panvk_CreateVideoSessionParametersKHR`, the linker falls through to `vk_common_CreateVideoSessionParametersKHR`. Confirmed by anv comparison: anv has no `anv_CreateVideoSessionParametersKHR`, only `anv_UpdateVideoSessionParametersKHR` is missing too — both come from `vk_common_*`.
radv DOES override (`radv_video.c:630-647`) but only to call `radv_video_patch_session_parameters` for an AMD-specific fixup. For Phase 1 we don't need that.
**Decision: rely entirely on vk_common.** Zero driver code for parameters object lifecycle.
### D.2 Parameters → V4L2 control conversion happens at CmdDecodeVideo time, not at parameter creation
The shared parameters struct (`vk_video.h:127-195`) for H.264-decode stores SPS array of `struct vk_video_h264_sps` (which embeds `StdVideoH264SequenceParameterSet base`) and PPS array of `struct vk_video_h264_pps` (which embeds `StdVideoH264PictureParameterSet base`). The lookup helpers `vk_video_find_h264_dec_std_sps(params, id)` and `vk_video_find_h264_dec_std_pps(params, id)` (`vk_video.c:1186-1198`) are what we call at decode time to get the SPS/PPS for the current frame.
The V4L2-side bridge from `StdVideoH264SequenceParameterSet``struct v4l2_ctrl_h264_sps` is the same conversion fourier does. See `libva-v4l2-request-fourier/src/h264.c:360` for `h264_va_picture_to_v4l2` which marshals to `struct v4l2_ctrl_h264_decode_params`, `v4l2_ctrl_h264_pps`, `v4l2_ctrl_h264_sps` — except the source format on our side is `StdVideoH264*` instead of `VAPictureParameterBufferH264`. The field-name mapping is essentially identical because both `VAPictureParameterBufferH264` and `StdVideoH264SequenceParameterSet` ultimately derive from the H.264 spec's syntax element names.
**We will write `panvk_h264_std_sps_to_v4l2(const StdVideoH264SequenceParameterSet *std, struct v4l2_ctrl_h264_sps *out)` etc.** as a new helper file (~150 lines per codec). This is the bridge function that has no Mesa precedent — it's our novel contribution.
### D.3 Hooking the parameters cache to ext-control structs at decode time
At `CmdDecodeVideoKHR` recording time, we retrieve the relevant `StdVideoH264SequenceParameterSet *` and `StdVideoH264PictureParameterSet *` via `vk_video_get_h264_parameters` (`vk_video.h:419-425`). The signature:
```c
void vk_video_get_h264_parameters(const struct vk_video_session *session,
const struct vk_video_session_parameters *params,
const VkVideoDecodeInfoKHR *decode_info,
const VkVideoDecodeH264PictureInfoKHR *h264_pic_info,
const StdVideoH264SequenceParameterSet **sps_p,
const StdVideoH264PictureParameterSet **pps_p);
```
Anv uses this at `genX_cmd_video.c:904` in `anv_h264_decode_video`. We do the same.
---
## E. vkCmdDecodeVideoKHR command recording
### E.1 What anv emits at record time vs submit time
**Crucial finding**: anv does ALL work at record time. By the time the cmdbuf goes to the queue, the command stream is fully baked. Look at `anv_h264_decode_video` (`genX_cmd_video.c:892-1300+`): every `anv_batch_emit(&cmd_buffer->batch, GENX(MFX_PIPE_MODE_SELECT), sel)` etc. is a register/packet write into the cmd_buffer's batch buffer. Submit time just kicks the batch.
The Begin/End wrappers are thin:
- `CmdBeginVideoCodingKHR` (`genX_cmd_video.c:31-50`): stashes `cmd_buffer->video.vid = vid; cmd_buffer->video.params = params;` into command-buffer-local state. **That's it** for H.264 (AV1 adds CDF table init).
- `CmdControlVideoCodingKHR` (`genX_cmd_video.c:52-74`): if RESET flag, emit `MI_FLUSH_DW` with `VideoPipelineCacheInvalidate = 1`.
- `CmdEndVideoCodingKHR` (`genX_cmd_video.c:76-83`): clears `cmd_buffer->video.vid = NULL; cmd_buffer->video.params = NULL;`.
The `cmd_buffer->video` shadow state (`anv_private.h:4935-4938`):
```c
struct {
struct anv_video_session *vid;
struct vk_video_session_parameters *params;
} video;
```
### E.2 For our V4L2 backend — "deferred record"
The V4L2 ioctls cannot meaningfully happen at record time, because:
1. The bitstream buffer (frame_info->srcBuffer) is a `VkBuffer` we don't necessarily know the contents of yet (might be filled by a prior submitted cmdbuf or by host writes between record and submit).
2. Request_fd allocation and S_EXT_CTRLS need to be sequential per submit (cannot pre-bind a request_fd to a recorded cmdbuf and reuse it).
**Pattern: per-cmdbuf list of "video decode ops" recorded during CmdDecodeVideoKHR.** The op captures everything we need to replay at submit time:
```c
struct panvk_video_decode_op {
/* From CmdBegin */
struct panvk_video_session *session;
struct vk_video_session_parameters *params;
/* From CmdDecode */
VkBuffer src_buffer; /* bitstream source */
VkDeviceSize src_offset;
VkDeviceSize src_size;
/* DPB target */
struct panvk_image_view *dst_iv;
uint32_t dst_dpb_slot;
/* Already-resolved SPS/PPS pointers (cheap copy by value) */
StdVideoH264SequenceParameterSet sps;
StdVideoH264PictureParameterSet pps;
/* H.264 slice info, picked apart at submit time */
StdVideoDecodeH264PictureInfo std_pic_info;
/* Reference slot info — small array, copy by value */
uint32_t reference_slot_count;
struct panvk_video_ref_slot reference_slots[16];
};
struct panvk_cmd_buffer {
...
struct util_dynarray video_decode_ops; /* of struct panvk_video_decode_op */
};
```
Then submit-time (per B.4) walks the dynarray and does the ioctl dance per op.
Comparable record-time op-list pattern exists today for sparse binds (`panvk_sparse.c`). Anv stores per-cmdbuf state in `cmd_buffer->video` but doesn't queue up ops because it emits direct register packets. We're doing what anv would do if anv ran on a separate kernel device.
### E.3 CmdBegin/Control/End for our backend
- `panvk_per_arch(CmdBeginVideoCodingKHR)`: clear `cmd_buffer->video_decode_session = vid; cmd_buffer->video_decode_params = params;`. Optionally validate the reference slot layout matches the dpb_slot count we set up at session init.
- `panvk_per_arch(CmdControlVideoCodingKHR)` for `VK_VIDEO_CODING_CONTROL_RESET_BIT_KHR`: this needs to translate to `MEDIA_REQUEST_IOC_REINIT` on all pooled request_fds — OR just mark a session-wide flag "next decode needs fresh request setup". Phase 1 we can no-op this if we always reinit per submit anyway.
- `panvk_per_arch(CmdEndVideoCodingKHR)`: clear shadow state. No emission needed.
---
## F. DPB management
### F.1 Vulkan-side DPB model
Per-frame `VkCmdDecodeVideoKHR` receives:
- `frame_info->dstPictureResource``VkVideoPictureResourceInfoKHR { codedOffset, codedExtent, baseArrayLayer, imageViewBinding }`. The image view that will receive the decoded output.
- `frame_info->pSetupReferenceSlot``VkVideoReferenceSlotInfoKHR { slotIndex, pPictureResource }`. Says "this decoded frame becomes DPB slot N".
- `frame_info->pReferenceSlots[]` — references TO read from. Each carries `slotIndex` + `pPictureResource`.
For H.264, additionally:
- `pNext` chain `VkVideoDecodeH264PictureInfoKHR { pStdPictureInfo, sliceCount, pSliceOffsets }`
- DPB slot pNext per reference: `VkVideoDecodeH264DpbSlotInfoKHR { pStdReferenceInfo }` — contains POC/short-term/long-term flags.
Anv's reference assembly logic at `genX_cmd_video.c:992-1004`:
```c
for (unsigned i = 0; i < frame_info->referenceSlotCount; i++) {
const struct anv_image_view *ref_iv = anv_image_view_from_handle(
frame_info->pReferenceSlots[i].pPictureResource->imageViewBinding);
int idx = frame_info->pReferenceSlots[i].slotIndex;
...
dpb_slots[idx] = i;
buf.ReferencePictureAddress[i] = anv_image_dpb_address(ref_iv, baseArrayLayer);
}
```
### F.2 V4L2 DPB model
`v4l2_ctrl_h264_decode_params::dpb[16]` is an array of `struct v4l2_h264_dpb_entry { reference_ts, pic_num, frame_num, fields, flags, top_field_order_cnt, bottom_field_order_cnt }`. Each entry's `reference_ts` is the timestamp used at VIDIOC_QBUF of the OUTPUT (bitstream) plane when that reference was decoded — V4L2 uses this as the "buffer identity" key.
So the mapping rule from Vulkan-side `VkVideoReferenceSlotInfoKHR[]` to V4L2-side `dpb[16]` is:
| Vulkan field | V4L2 dpb field | How to source |
|---|---|---|
| `pReferenceSlots[i].slotIndex` | array index in `dpb[]` | direct (assert `<= 16`) |
| `pReferenceSlots[i].pNext->pStdReferenceInfo->PicOrderCnt[0]` | `top_field_order_cnt` | direct |
| `pReferenceSlots[i].pNext->pStdReferenceInfo->PicOrderCnt[1]` | `bottom_field_order_cnt` | direct |
| `pReferenceSlots[i].pNext->pStdReferenceInfo->FrameNum` | `frame_num` | direct |
| short-term/long-term flag | `flags` | direct |
| (the decoded output VkImage backing the ref slot) | `reference_ts` | **lookup**: we maintain a `slotIndex → reference_ts` map per-session, populated each time we decode into that slot. See libva-fourier `src/h264.c:140-218` for `dpb_insert`/`dpb_update`/`dpb_find_entry`. Our case is simpler: slotIndex is provided by Vulkan, we just need to track "what ts did I QBUF when I last decoded into slotIndex N". |
The fourier `src/h264.c:238-353` `h264_fill_dpb` function is the closest analog — it constructs `struct v4l2_h264_dpb_entry[]` from libva-side state. We do the analog but feed it from `pReferenceSlots[]`.
### F.3 Bookkeeping struct in panvk_video_session
```c
struct panvk_video_session {
...
struct {
uint64_t reference_ts; /* timestamp last used when decoding into this slot */
struct panvk_image *image; /* the VkImage backing this slot's DPB */
uint32_t array_layer;
bool active;
} dpb[16];
};
```
Update at decode-completion time (after VIDIOC_DQBUF) for the setup-reference-slot.
---
## G. Memory + dmabuf interop
### G.1 The challenge
App creates a `VkImage` with `VK_IMAGE_USAGE_VIDEO_DECODE_DST_BIT_KHR | VK_IMAGE_USAGE_VIDEO_DECODE_DPB_BIT_KHR | VK_IMAGE_USAGE_SAMPLED_BIT`. Memory is bound via normal `vkBindImageMemory`. Then the decoded frame data needs to physically end up in that memory backing.
Hantro's CAPTURE queue allocates its own buffers via `VIDIOC_REQBUFS(memory=V4L2_MEMORY_MMAP)` or accepts dma_buf imports via `VIDIOC_REQBUFS(memory=V4L2_MEMORY_DMABUF)`. The clean path: **app's VkImage memory backing IS a dma_buf**, exported from panvk via `vkGetMemoryFdKHR`, and we VIDIOC_QBUF'd with the dma_buf fd as the CAPTURE plane.
But Vulkan apps don't usually export memory back to themselves. They expect `vkCreateImage(usage=VIDEO_DECODE_DST)` to "just work". So **we** drive the dma_buf flow internally.
### G.2 Internal dma_buf flow (proposed)
Two strategies:
**Strategy A: Driver-allocated CAPTURE buffers, app-imported into VkImage**
- VIDIOC_REQBUFS(MMAP) at session create.
- VIDIOC_EXPBUF to get a dma_buf fd per allocated buffer.
- Import the dma_buf back into pan_kmod as a VkDeviceMemory equivalent.
- VkBindImageMemory to that DeviceMemory.
**Strategy B: App-allocated VkImage, V4L2_MEMORY_DMABUF queue**
- App calls vkCreateImage with VkExternalMemoryImageCreateInfo handleTypes=DMA_BUF.
- Vk allocates the BO via pan_kmod, exports a dma_buf fd via `pan_kmod_bo_export` (`panvk_device_memory.c:387-404`).
- VIDIOC_QBUF(memory=V4L2_MEMORY_DMABUF, fd=our_dmabuf_fd) at submit time.
**Strategy B is what fourier does for surface buffers, and it's the cleaner fit** — the app gets a real VkImage with real VkDeviceMemory, we never have to fake the import direction. Phase 1 may want to start with Strategy A for simplicity since vk-video-samples likely doesn't pass `VkExternalMemoryImageCreateInfo` flags, but Strategy B is the long-term right answer.
### G.3 Anv's DPB image allocation
Anv treats DPB images as plain VkImages — no special allocation. The HW reads them directly via `anv_image_dpb_address(iv, baseArrayLayer)` at `genX_cmd_video.c:933`. Memory layout is whatever ISL gives them (tile-Y or planar-420). For our backend, that doesn't transfer — the Hantro VPU expects NV12 in a linear layout (or a vendor-specific tiled layout that we'd need to expose; for Phase 1 we mandate linear).
### G.4 panvk dmabuf entry points (already present)
- `panvk_AllocateMemory` handles `VkImportMemoryFdInfoKHR` at `panvk_device_memory.c:121-135` — calls `pan_kmod_bo_import`.
- `panvk_GetMemoryFdKHR` at `panvk_device_memory.c:387-404` exports.
- `EXT_external_memory_dma_buf` already advertised at `panvk_vX_physical_device.c:146`.
So the building blocks exist. The new code is the **session-internal V4L2 buffer pool** that converts between V4L2_MEMORY_MMAP/DMABUF and pan_kmod BOs.
---
## H. vk_video runtime helper coverage matrix
What we inherit vs what we write. Cross-referenced from sections AG:
| Question | Inherit from vk_video shared layer? | Driver writes? |
|---|---|---|
| A. KHR_video_* extension booleans | No | YES — `panvk_vX_physical_device.c` table |
| A. videoMaintenance1 feature struct | No | (Phase 1: skip; future: yes if advertised) |
| A. GetPhysicalDeviceVideoCapabilitiesKHR | **NO** — direct entrypoint | YES — new code in `panvk_physical_device.c` |
| A. GetPhysicalDeviceVideoFormatPropertiesKHR | **NO** — direct entrypoint | YES — new code in `panvk_physical_device.c` |
| B. Queue family enum + props | No | YES — `panvk_device.h` + `panvk_physical_device.c` |
| B. Queue-family-video pNext walk | No | YES — extend `panvk_GetPhysicalDeviceQueueFamilyProperties2` |
| B. Queue create/destroy dispatch | No | YES — extend `panvk_vX_device.c:305-329` |
| B. Queue submit | No | YES — new `panvk_vX_video_decode_queue.c` |
| C. CreateVideoSessionKHR — handle + base init | YES partial: `vk_video_session_init` does the codec-op parsing | YES — driver wraps, adds V4L2 fd open + S_FMT + REQBUFS |
| C. DestroyVideoSessionKHR — base finish | YES partial: `vk_video_session_finish` | YES — driver wraps, adds V4L2 teardown |
| C. GetVideoSessionMemoryRequirementsKHR | No | YES (trivial: zero entries) |
| C. BindVideoSessionMemoryKHR | No | YES (trivial: no-op) |
| D. CreateVideoSessionParametersKHR | **YES — `vk_common_CreateVideoSessionParametersKHR` (vk_video.c:846)** | NO driver code needed |
| D. UpdateVideoSessionParametersKHR | **YES — `vk_common_UpdateVideoSessionParametersKHR` (vk_video.c:865)** | NO driver code needed |
| D. DestroyVideoSessionParametersKHR | **YES — `vk_common_DestroyVideoSessionParametersKHR` (vk_video.c:875)** | NO driver code needed |
| D. H.264 SPS/PPS storage | **YES — `struct vk_video_h264_{sps,pps}` (vk_video.h:32-43)** | NO |
| D. H.264 SPS/PPS lookup | **YES — `vk_video_find_h264_dec_std_{sps,pps}` (vk_video.c:1186)** | NO |
| D. H.264 params merge with dedup | **YES — internal to `vk_video_session_parameters_update`** | NO |
| D. Std → V4L2 control marshalling | No precedent in Mesa | YES — NEW helper file (~300 lines for H.264) |
| E. CmdBeginVideoCodingKHR | No | YES — trivial state-stash |
| E. CmdControlVideoCodingKHR | No | YES — trivial RESET handling |
| E. CmdEndVideoCodingKHR | No | YES — trivial state-clear |
| E. CmdDecodeVideoKHR | No | YES — record op into cmdbuf dynarray |
| E. `vk_video_get_h264_parameters` resolver | **YES (vk_video.h:419)** | NO |
| F. DPB slot ↔ reference_ts map | No | YES — `panvk_video_session.dpb[16]` |
| F. H.264 reference list construction | Partially: `vk_fill_video_h264_*` helpers if present | YES — but mostly direct field copies |
| G. dmabuf BO import/export | YES — existing panvk path (`panvk_device_memory.c:121,387`) | NO new code |
| G. V4L2 buffer ↔ pan_kmod_bo bridging | No precedent | YES — NEW helper file |
| G. Image creation for VIDEO_DECODE_DST | YES — existing `panvk_image_init` (panvk_image.c:562) handles all usage flags through ISL | Possibly yes for tile mode restrictions |
**Net leverage**: ~3000 lines of vk_video runtime helpers we inherit for free, primarily the H.264 SPS/PPS bitstream parsing + parameters object lifecycle + std/find helpers. Our new-code estimate is roughly 800-1500 lines split across ~4 new files (see I).
---
## I. panvk-specific integration points (concrete edits)
### I.1 Existing files to modify
**`src/panfrost/vulkan/panvk_vX_physical_device.c`**:
- Lines ~123-124 (between `KHR_vertex_attribute_divisor` and `KHR_vulkan_memory_model`): add `.KHR_video_queue = true,`, `.KHR_video_decode_queue = true,`, `.KHR_video_decode_h264 = true,` (gated on hantro probe).
- Optional Phase 2+: at line 540, flip `unifiedImageLayoutsVideo` based on session config.
**`src/panfrost/vulkan/panvk_physical_device.c`**:
- Line ~565: extend the `qfamily_props[]` array — add a third entry for `PANVK_QUEUE_FAMILY_VIDEO_DECODE` with `queueFlags = VK_QUEUE_VIDEO_DECODE_BIT_KHR | VK_QUEUE_TRANSFER_BIT`.
- Around line 589 inside the `vk_outarray_append_typed` loop: add a pNext walk for `VK_STRUCTURE_TYPE_QUEUE_FAMILY_VIDEO_PROPERTIES_KHR` that sets `videoCodecOperations = VK_VIDEO_CODEC_OPERATION_DECODE_H264_BIT_KHR`.
- ADD new entrypoints `panvk_GetPhysicalDeviceVideoCapabilitiesKHR` and `panvk_GetPhysicalDeviceVideoFormatPropertiesKHR` at end of file (~70 lines + ~50 lines).
**`src/panfrost/vulkan/panvk_device.h`**:
- Line 46-48: add `PANVK_QUEUE_FAMILY_VIDEO_DECODE,` to the enum.
**`src/panfrost/vulkan/panvk_vX_device.c`**:
- Lines 218-247 (`check_global_priority`): add `case PANVK_QUEUE_FAMILY_VIDEO_DECODE: return VK_SUCCESS;`.
- Lines 253-258 (`panvk_queue_check_status`): add case for the new family calling `panvk_per_arch(video_decode_queue_check_status)`.
- Lines 305-313 (`panvk_queue_create`): add case calling `panvk_per_arch(create_video_decode_queue)`.
- Lines 320-329 (`panvk_queue_destroy`): symmetric.
**`src/panfrost/vulkan/meson.build`**:
- Add new files to either `libpanvk_files` (arch-agnostic) or `common_per_arch_files` (arch-templated). The session/queue/command-record code is arch-agnostic but uses `panvk_per_arch()` symbols only by convention — Phase 1 we can place all new files in `libpanvk_files` and skip the per_arch dispatch.
### I.2 New files to add
**`src/panfrost/vulkan/panvk_video_decode.c`** (~400 lines):
- `panvk_CreateVideoSessionKHR`
- `panvk_DestroyVideoSessionKHR`
- `panvk_GetVideoSessionMemoryRequirementsKHR` (returns count=0)
- `panvk_BindVideoSessionMemoryKHR` (no-op)
- `panvk_CmdBeginVideoCodingKHR`
- `panvk_CmdControlVideoCodingKHR`
- `panvk_CmdEndVideoCodingKHR`
- `panvk_CmdDecodeVideoKHR` (record op into `cmd_buffer->video_decode_ops`)
**`src/panfrost/vulkan/panvk_video_decode.h`**:
- `struct panvk_video_session`
- `struct panvk_video_decode_op`
- `struct panvk_video_decode_queue`
**`src/panfrost/vulkan/panvk_v4l2.c`** (~500 lines):
- `panvk_v4l2_probe_hantro()` — finds /dev/video1 and /dev/media0 (mirrors libva-v4l2-request-fourier `src/request.c:143-308` `find_decoder_video_node_via_topology`).
- `panvk_v4l2_session_init()` — S_FMT on OUTPUT/CAPTURE, REQBUFS, request_fd pool alloc.
- `panvk_v4l2_h264_std_to_ctrl_sps()``StdVideoH264SequenceParameterSet *``struct v4l2_ctrl_h264_sps`.
- `panvk_v4l2_h264_std_to_ctrl_pps()``StdVideoH264PictureParameterSet *``struct v4l2_ctrl_h264_pps`.
- `panvk_v4l2_h264_fill_decode_params()` — build `struct v4l2_ctrl_h264_decode_params` from VkVideoDecodeInfoKHR + slot map.
- `panvk_v4l2_submit_op()` — the request_fd / S_EXT_CTRLS / QBUF / poll / DQBUF dance for one op.
**`src/panfrost/vulkan/panvk_vX_video_decode_queue.c`** (~150 lines, per_arch):
- `panvk_per_arch(create_video_decode_queue)`
- `panvk_per_arch(destroy_video_decode_queue)`
- `panvk_per_arch(video_decode_queue_submit)` — walks cmdbuf ops, calls `panvk_v4l2_submit_op` per op.
- `panvk_per_arch(video_decode_queue_check_status)`
### I.3 Entrypoint generation
Recall from `meson.build:7-19` that entrypoints are auto-wired with `--prefix panvk` and per-arch prefixes. The names above (`panvk_CmdDecodeVideoKHR` etc.) match the auto-resolution rules — no changes needed in `vk_entrypoints_gen` invocation.
For the per-arch ones (`panvk_per_arch(...)`), we expand under each `PAN_ARCH` define just like existing per-arch code.
---
## J. Probable architecture sketch
**V4L2 fd ownership**: at `panvk_physical_device` level for probe-time discovery (`panvk_v4l2_probe_hantro` sets `phys_dev->v4l2.video_fd_present = true` and stashes paths), but actual `open()` happens at `panvk_CreateVideoSessionKHR` time per-session. Two reasons: (1) the V4L2 driver state is per-fd, so two concurrent sessions need two separate fds anyway; (2) keeping fds closed when no video session is active is good citizenship. The PhysicalDevice only holds device-node paths and capability flags.
**Per-session V4L2 state**: `struct panvk_video_session` (see C.3) owns one `video_fd` + one `media_fd` + a pool of `request_fd`s (one per max-in-flight decode, typically `max_dpb_slots + 2`). At `CreateVideoSession` we S_FMT both queues, REQBUFS to allocate the buffer count, EXPBUF the CAPTURE buffers to dma_bufs that get held in the session for later association with VkImage memory (Strategy B from G.2).
**Per-VkImage dmabuf bookkeeping**: the existing pan_kmod export path (`panvk_device_memory.c:387-404`) gives us dma_buf out. The new piece is the inverse — at `vkBindImageMemory` time for a `VkImage` whose `usage & VIDEO_DECODE_DST`, we'd register the underlying BO's dma_buf as a CAPTURE buffer with `VIDIOC_QBUF(memory=V4L2_MEMORY_DMABUF)`. The image's `panvk_image` struct gains a `int v4l2_capture_index;` field.
**Submit-time dispatch**: at `panvk_vX_device.c:305-313` we extended the switch to route `PANVK_QUEUE_FAMILY_VIDEO_DECODE` to `panvk_per_arch(create_video_decode_queue)` whose `driver_submit = panvk_per_arch(video_decode_queue_submit)`. The submit function walks each cmdbuf's `video_decode_ops` dynarray, and per op:
```
1. resolve request_fd from session pool (allocate or reuse, ioctl(media_fd, MEDIA_IOC_REQUEST_ALLOC))
2. media_request_reinit(request_fd) if reusing
3. translate op->sps to v4l2_ctrl_h264_sps via panvk_v4l2_h264_std_to_ctrl_sps()
4. translate op->pps to v4l2_ctrl_h264_pps via panvk_v4l2_h264_std_to_ctrl_pps()
5. build v4l2_ctrl_h264_decode_params from op (including dpb[] from session->dpb[] tracking)
6. VIDIOC_S_EXT_CTRLS(video_fd, request_fd=op->request_fd, {SPS, PPS, DECODE_PARAMS, SCALING_MATRIX, SLICE_PARAMS})
7. VIDIOC_QBUF(video_fd, OUTPUT, request_fd=op->request_fd, bytesused=op->src_size, m.fd=op->src_buffer's bo dma_buf)
8. VIDIOC_QBUF(video_fd, CAPTURE, index=op->dst_iv->image->v4l2_capture_index)
9. MEDIA_REQUEST_IOC_QUEUE(request_fd)
10. poll(request_fd, POLLPRI, timeout)
11. VIDIOC_DQBUF(video_fd, OUTPUT) /* releases input slot */
12. VIDIOC_DQBUF(video_fd, CAPTURE) /* output ready */
13. Update session->dpb[op->dst_dpb_slot].reference_ts to the QBUF timestamp
14. Signal vk_queue_submit's signal semaphores
```
Steps 5-12 are exactly the libva-v4l2-request-fourier `RequestEndPicture` body (`src/picture.c:497-650`). The mapping VAPicture* → V4L2 vs Std* → V4L2 is the one piece of code that has no Mesa precedent — we're inventing the bridge — but it's bounded: ~150 lines per codec (we only need H.264 in Phase 1).
---
## Mesa-version observations and risks
- Mesa 26.0.6 is the campaign baseline. The vk_video runtime helpers in `src/vulkan/runtime/vk_video.{c,h}` are stable in this version with H.264, H.265, AV1, VP9, encode-h264, encode-h265, encode-av1 all covered. No upgrade required for Phase 1.
- `KHR_video_decode_h264` spec v9 is what's in `vk_api.xml` for 26.0.6 — confirmed by extension being already known to entrypoint generator (no `--beta` flag needed; that flag at `meson.build:18` is for beta/provisional extensions only).
- Maintenance1/2 features are NOT required for the simple-test in Phase 1, so we don't need `videoMaintenance1` / `videoMaintenance2` machinery yet. Maintenance1 (inline parameters, inline queries) becomes relevant in Phase 6+ if we want to pass conformance suites.
- The `unifiedImageLayoutsVideo` feature at `panvk_vX_physical_device.c:540` is currently false. Phase 1 we can leave it false — the test client honors explicit `VkImageMemoryBarrier` transitions to/from `VK_IMAGE_LAYOUT_VIDEO_DECODE_DST_KHR`.
---
## Architectural maps that DO cleanly transfer from anv/radv
1. **Session as wrapper around `vk_video_session`**. Anv: `struct anv_video_session { struct vk_video_session vk; ... }`. radv: same shape. Ours: same shape. The `vk.` namespace gives us all the spec-mandated session fields for free.
2. **Parameters fully delegated to `vk_common_*`**. Anv does this, radv mostly does this (with a tiny `radv_video_patch_session_parameters` patch). Ours: full delegation.
3. **Cmdbuf-local shadow state for current session+params during the Begin..End scope**. Anv: `cmd_buffer->video.{vid,params}`. We do the same.
4. **DPB slot index ↔ image view lookup at decode time**. Both anv and our backend do this lookup per frame.
## Architectural maps that DO NOT transfer
1. **Driver-allocated session scratch memory (`anv_vid_mem` array)**. Hantro VPU keeps scratch internal; we return zero memory requirements. Hard skip — not just simplification, an inversion.
2. **`anv_batch_emit` register packets directly into cmdbuf at record time**. There is no equivalent. We MUST defer to submit-time — that's the entire point of the V4L2 backend being on a separate kernel device.
3. **`anv_image_dpb_address(iv, layer)` resolving to a GPU virtual address**. Our DPB references resolve to V4L2 buffer indices (queued at session-init) or dma_buf fds (Strategy B). The "address" abstraction doesn't apply; the VPU doesn't share the GPU's address space.
4. **MFX/HCP/VDENC register-set knowledge in `genX_cmd_video.c`** — 4000+ lines of Intel-specific HW programming. Completely irrelevant. The Hantro VPU's "programming" is a sequence of struct `v4l2_ctrl_*` fills + ioctls.
5. **MOCS / cache state in pipe-buf-addr-state** (`genX_cmd_video.c:962+`). N/A — the kernel V4L2 driver handles all cache coherency at QBUF/DQBUF boundaries.
---
## Phase 1 success criteria — final checklist
| vk-video-samples simple-test step | Where it lands in this map |
|---|---|
| `vkGetPhysicalDeviceQueueFamilyProperties2` returns family with `VK_QUEUE_VIDEO_DECODE_BIT_KHR` and `VkQueueFamilyVideoPropertiesKHR::videoCodecOperations & VK_VIDEO_CODEC_OPERATION_DECODE_H264_BIT_KHR` set | B.2 |
| `vkEnumerateDeviceExtensionProperties` returns the three KHR_video_* | A.1 |
| `vkGetPhysicalDeviceVideoCapabilitiesKHR(profile=H264)` returns sane caps | A.3 |
| `vkGetPhysicalDeviceVideoFormatPropertiesKHR` returns NV12 | A.4 |
| `vkCreateDevice` succeeds with the video queue family selected | B.3 |
| `vkCreateVideoSessionKHR` succeeds | C |
| `vkGetVideoSessionMemoryRequirementsKHR` returns 0 entries | C.3 |
| `vkCreateVideoSessionParametersKHR` with SPS+PPS succeeds | D (free from vk_common) |
| Recording a `vkCmdDecodeVideoKHR` succeeds (no execution yet — could even no-op the V4L2 ioctls in Phase 1 since correctness isn't tested) | E.2 |
| Single queue submit succeeds without VK_ERROR_DEVICE_LOST | B.4, J |
Phase 1 deliberately stops short of "decoded picture compares against reference". That's Phase 7. Phase 1 is the end-to-end plumbing.