Files
panvk-bifrost/mesa-panvk-bifrost-video/phase2_design.md
T
marfrit a4e7d8ab90 initial seed: retrofit campaign lineage from local working trees
panvk-bifrost campaigns (r1..r4 Vulkan compositor + r5.video1 Vulkan
video decode) shipped before this repo existed; the deliverable
patches live in marfrit-packages, but the reasoning chain, phase docs,
and source-state evidence lived only in local working trees on the
development host.

This retrofit imports:
- mesa-panvk-bifrost/   — r1..r4 era phase docs (iter1..iter18)
                          (libmali stub blobs at iter18/blob/ excluded
                          — 109MB of RE artifacts replaced with a README
                          pointer)
- mesa-panvk-bifrost-video/ — sibling campaign phase docs + probe
- evidence/             — frozen .tgz source snapshots at each milestone
                          (basis for the 0005 patch diff generation)

Future iterations should branch off here from day one, so each iter is
a commit rather than a snapshot. See [[feedback-session-local-process-pins]]
for the process drift this retrofit closes.

Total: 1.9 MB across 124 files.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 05:25:37 +02:00

14 KiB
Raw Blame History

Phase 2 — design lock for panvk-bifrost-video

Phase 1 source-map (phase1_source_map.md) acquired the architecture. This document locks the implementation-level decisions that bind Phase 4. Where Phase 1 listed options, this picks one.

Re-anchored constraints (re-verified 2026-05-21)

  • ohm reachable, kernel linux-fresnel-fourier with dma_resv patches
  • /dev/video1 (hantro decoder) + /dev/media0 (media controller) present
  • libva-v4l2-request-fourier installed and exercising the same V4L2 path — proves the protocol works (1.56× / 1.73× realtime). Coexistence policy: env-mutex (Phase 0 Q1 lock A). Only one client holds /dev/video1 at a time; user picks via LIBVA_DRIVER_NAME=null or service-level coordination.
  • mesa-panvk-bifrost r4 source on ohm at /home/mfritsche/mesa-build/mesa-26.0.6/. Reuses the same r1r4 patch lineage in PKGBUILD; new package mesa-panvk-bifrost-video is a sibling — see Phase 0 campaign-close-via-pkgbuild.
  • Vulkan headers: 26.0.6's bundled vk.xml has H.264 decode v9 stable. No --beta flag needed.
  • Test bitstream: /home/mfritsche/fourier-test/bbb_1080p30_h264.mp4 (725 MB H.264 Main, 1080p30) — proven decoding via libva path 2026-05-21.
  • vk-video-samples builds on aarch64 (Phase 0). simple-test binary at ~/panvk-bifrost-video-evidence/Vulkan-Video-Samples/build/vk_video_decoder/test/vulkan-video-dec-simple-test.

Locked decisions

D1 — V4L2 device ownership: per-VkVideoSessionKHR, not per-VkDevice

Each call to vkCreateVideoSessionKHR opens its own video_fd to /dev/video1 and media_fd to /dev/media0. The PhysicalDevice only holds discovery state (paths + caps flags). Per Phase 1 §J reasoning: kernel V4L2 state is per-fd, multiple sessions need separate fds anyway, idle-when-no-session is good citizenship.

Trade-off rejected: per-device shared fd. Would force a session-arbitration daemon inside panvk. Not worth it for Phase 1; not needed for the simple-test workload (single session).

D2 — File layout (committed)

New files in src/panfrost/vulkan/:

File Purpose Est. LoC
panvk_video_decode.c VkVideoSession* + VkCmd*VideoCoding entrypoints; record video_decode_ops dynarray ~400
panvk_video_decode.h structs: panvk_video_session, panvk_video_decode_op, panvk_video_decode_queue ~80
panvk_v4l2.c V4L2 probe + per-session init + Std*→v4l2_ctrl_h264_* bridge + submit_op() ~500
panvk_vX_video_decode_queue.c per-arch queue create/destroy/submit (walks ops, calls panvk_v4l2_submit_op) ~150

Modified files (locations from Phase 1 §I.1):

  • panvk_vX_physical_device.c (extension list + capability/format entrypoints)
  • panvk_physical_device.c (queue family list + video properties pNext walk)
  • panvk_device.h (queue family enum)
  • panvk_vX_device.c (queue create/destroy/submit dispatch — 4 cases)
  • meson.build (register new sources)

D3 — Per-session state struct (locked layout)

struct panvk_video_session {
   struct vk_video_session vk;            /* spec-mandated fields */

   /* V4L2 fds — opened in CreateVideoSession, closed in Destroy */
   int video_fd;                          /* /dev/video1 */
   int media_fd;                          /* /dev/media0 */

   /* Negotiated formats per OUTPUT / CAPTURE queue */
   struct v4l2_format fmt_output;
   struct v4l2_format fmt_capture;

   /* Request fd pool. Max-in-flight = max_dpb_slots + 2 */
   int *request_fds;
   unsigned num_request_fds;
   uint32_t request_fd_next;              /* round-robin index */

   /* DPB slotIndex → V4L2 reference_ts mapping */
   struct {
      bool valid;
      uint64_t reference_ts;              /* V4L2 timestamp at QBUF time */
      /* No image-view pointer here — image references via slotIndex
       * only; resolution at record time via vk.params lookup. */
   } dpb[16];

   /* DECODE_PARAMS/SLICE_PARAMS submit mode (locked FRAME_BASED for Phase 1) */
   bool slice_based;                      /* Phase 1: false */
};

DPB mirroring is identical to libva-v4l2-request-fourier/src/h264.c:140-218 dpb_insert / dpb_update. Reuse the algorithm; don't link the lib — copy ~80 LoC verbatim into panvk_v4l2.c.

D4 — Per-cmdbuf decode-op entry (locked layout)

struct panvk_video_decode_op {
   /* Captured at vkCmdDecodeVideoKHR record time */
   uint32_t dst_dpb_slot;                 /* output slot */
   struct panvk_image_view *dst_iv;       /* output VkImageView */
   uint32_t num_ref_slots;
   struct {
      uint32_t slot_index;
      struct panvk_image_view *iv;        /* reference VkImageView */
   } ref_slots[16];

   /* Bitstream buffer */
   struct panvk_buffer *src_buffer;
   uint64_t src_offset;
   uint64_t src_size;

   /* Cached params at record time (so submit can run after Parameters object updates) */
   const StdVideoH264SequenceParameterSet *sps;     /* from vk.params */
   const StdVideoH264PictureParameterSet *pps;
   VkVideoDecodeH264PictureInfoKHR pic_info;        /* the per-frame info */

   /* Filled at submit time */
   int request_fd;                        /* allocated from session pool */
   uint64_t qbuf_ts;                      /* timestamp used for dpb tracking */
};

Recorded as a util_dynarray on the command buffer. vkResetCommandBuffer clears it.

D5 — Bitstream input: VkBuffer dmabuf import (one-shot)

At record time, the VkBuffer (with VIDEO_DECODE_SRC_BIT_KHR usage) carries a panvk_priv_bo with an exportable dmabuf. At submit time, op-submit does:

fd = pan_kmod_bo_export_dma_buf(src_buffer->bo)
VIDIOC_QBUF(video_fd, V4L2_BUF_TYPE_VIDEO_OUTPUT,
            memory=V4L2_MEMORY_DMABUF, m.fd=fd, bytesused=op->src_size,
            request_fd=op->request_fd)

Source-side buffers are not pinned to V4L2 OUTPUT slots — each decode gets a fresh QBUF using the dmabuf fd. After DQBUF the slot is implicitly released.

D6 — Output frames: VkImage permanent CAPTURE slot binding (Strategy B from §G.2)

At vkBindImageMemory time, if the VkImage's usage & VIDEO_DECODE_DST_BIT_KHR, the image's underlying BO is EXPBUF'd and registered as a permanent CAPTURE buffer slot via VIDIOC_QBUF(memory=DMABUF) at session init, then the slot index is stashed in:

struct panvk_image {
   ...
   int v4l2_capture_index;                /* -1 if not a video output image */
};

Rejected alternative: per-decode-call dmabuf import. Higher per-frame ioctl overhead. Strategy B amortizes the registration cost across the session lifetime.

D7 — Submit-time ioctl dance (the 14 steps, locked)

panvk_per_arch(video_decode_queue_submit)(queue, submit):
   for each cmdbuf in submit:
      for each op in cmdbuf->video_decode_ops:
         panvk_v4l2_submit_op(session, op):
            1. resolve request_fd: pool[round_robin++ % num] or MEDIA_IOC_REQUEST_ALLOC
            2. ioctl(request_fd, MEDIA_REQUEST_IOC_REINIT)
            3. fill v4l2_ctrl_h264_sps    from op->sps  via panvk_v4l2_h264_std_to_ctrl_sps()
            4. fill v4l2_ctrl_h264_pps    from op->pps  via panvk_v4l2_h264_std_to_ctrl_pps()
            5. fill v4l2_ctrl_h264_decode_params from op->pic_info + session->dpb[]
            6. ext_controls = { SPS, PPS, DECODE_PARAMS, SCALING_MATRIX }
               (Phase 1: SLICE_PARAMS optional, FRAME_BASED → omit)
            7. VIDIOC_S_EXT_CTRLS(video_fd, which=REQUEST_VAL, request_fd, ext_controls)
            8. VIDIOC_QBUF(video_fd, OUTPUT, memory=DMABUF, request_fd, m.fd=src_dmabuf,
                           bytesused=op->src_size, timestamp=op->qbuf_ts)
            9. VIDIOC_QBUF(video_fd, CAPTURE, memory=DMABUF, index=dst_iv->image->v4l2_capture_index)
           10. MEDIA_REQUEST_IOC_QUEUE(request_fd)
           11. poll(request_fd, POLLPRI, timeout_ms=200)
           12. VIDIOC_DQBUF(video_fd, OUTPUT)       /* release input slot */
           13. VIDIOC_DQBUF(video_fd, CAPTURE)      /* output ready */
           14. session->dpb[op->dst_dpb_slot] = { valid:true, reference_ts:op->qbuf_ts }
   vk_queue_signal_semaphores(submit->signal_semaphores)

Per Phase 1 §J. Step 11's 200ms timeout is empirically derived from libva-v4l2-request-fourier behavior (it polls indefinitely; we cap to avoid driver-side hangs surfacing as Vulkan device-lost on bad bitstreams).

D8 — Synchronization: standard vk_queue infrastructure

panvk_per_arch(create_video_decode_queue) initializes a struct vk_queue with driver_submit = panvk_per_arch(video_decode_queue_submit). Wait/signal semaphores are handled by the standard vk_queue_submit infrastructure. Inside submit, the poll(request_fd) in step 11 is the synchronous gate — when it returns, the decode is done in V4L2 land, and the signal semaphores are signaled before returning.

For Phase 1, all video decodes are synchronous to submit. Async / pipelined decode is Phase >>1.

D9 — Hantro probe: by DT compatible name + topology

panvk_v4l2_probe_hantro() enumerates /dev/video* via udev, queries each with VIDIOC_QUERYCAP, accepts cards whose card field starts with "hantro-vpu" OR matches the RK3568/RK3566/RK3588 hantro DT compatibles. Falls back to a hard-coded /dev/video1 if udev unavailable. Mirrors libva-v4l2-request-fourier/src/request.c:143-308 find_decoder_video_node_via_topology.

Negative probe outcome (no hantro device) → physical_device's video extension advertisement returns false, queue family entry is suppressed, vkEnumerateDeviceExtensionProperties does not list the three KHR_video_*. Driver gracefully degrades to graphics-only.

D10 — Errors: broad first, refine Phase 6

  • V4L2 EINVAL / EAGAIN / EBUSY at submit → VK_ERROR_DEVICE_LOST (broad)
  • Probe failure during CreateVideoSession → VK_ERROR_INITIALIZATION_FAILED
  • DPB slot conflict → VK_ERROR_OUT_OF_DEVICE_MEMORY (closest spec match)
  • Refine per-error-class mapping in Phase 6 (conformance hardening).

Out of scope for this iteration (explicit non-goals)

  1. H.265 / HEVC: Phase 0 lock — H.264 only.
  2. Encode: out of scope, ever (until a separate campaign).
  3. Async decode / pipelined submit: synchronous-to-submit only in Phase 1.
  4. Multi-session concurrent decode: single session only in Phase 1 (per Phase 0 Q5).
  5. VkVideoMaintenance1 (inline parameters, inline queries): not in the simple-test requirements.
  6. Multiplane 444 formats (VK_EXT_ycbcr_2plane_444_formats): optional, not in Phase 1.
  7. VK_EXT_descriptor_buffer integration: optional, not in Phase 1.
  8. Decode correctness verification (frame-PSNR vs reference): Phase 7 territory.
  9. Brave consumer: structurally unfixable, see brave-vaapi-fourier close + DokuWiki.

Failure modes to watch for during Phase 4 (instrumentation plan)

Failure Detection
hantro device not present on a build target panvk_v4l2_probe_hantro returns false → extension list silently shrinks. Test: vulkaninfo | grep VK_KHR_video empty on a non-hantro box
/dev/video1 held by libva → CreateVideoSession EBUSY mesa_loge() at probe + return VK_ERROR_INITIALIZATION_FAILED. Test: run mpv-fourier in parallel, verify clean error message
S_EXT_CTRLS EINVAL on a per-control basis per-control failing_ctrl_id is in libva-v4l2-request-fourier src/v4l2.c:497-502 (the format we don't have on the iter14 path). Reproduce that diagnostic in our panvk_v4l2_submit_op
H.264 spec field mismatch between Std* and v4l2_ctrl_* Add a per-field assertion in the std→v4l2 bridge for the fields where the bitwidth differs (e.g., bit_depth_luma_minus8 is u8 in std, u8 in v4l2 — but some flags pack differently). Test: assert at translation time
DPB slot reuse with stale reference_ts session->dpb[].valid cleared at DestroyVideoSession + at ResetVideoCodingControl. Test: send a RESET flag mid-stream and check dpb[] is cleared
Driver-side decode hang (bad bitstream) poll(timeout=200ms) is the gate. Test: feed a truncated bitstream, verify clean VK_ERROR_DEVICE_LOST rather than session hang

Phase 4 implementation slice — first three commits

Bite-sized, validated incrementally:

  1. Commit 1 — extension advertisement + queue family registration (no functionality, just enumeration). Validation: vulkan-video-dec-simple-test gets past HasAllDeviceExtensions check and into device creation. Failure mode: extension list still missing.
  2. Commit 2CreateVideoSessionKHR + DestroyVideoSessionKHR + capability/format entrypoints (returns sane caps, no V4L2 yet — fds opened as /dev/null placeholders if necessary). Validation: simple-test creates a session, gets memory requirements (0 entries), destroys it cleanly. Failure mode: session create returns ERROR.
  3. Commit 3panvk_v4l2_probe_hantro + real video_fd open + per-session V4L2 init (S_FMT, REQBUFS, request fd pool). Validation: simple-test creates a session against real /dev/video1. Failure mode: probe fails or EBUSY.

After commit 3, all the plumbing is wired. Commits 4-6 add the per-frame decode plumbing (vkCmdDecodeVideoKHR record + submit dispatch + the ioctl dance). Commit 7 is the Std→v4l2 control bridge.

Phase 2 close criteria

  • All D1D10 decisions locked
  • Non-goals explicit
  • Failure-modes table with detection methods
  • Phase 4 first-three-commits slice defined
  • Constraints re-verified on ohm (substrate side)

Phase 3 next: build a probe test client (smaller than vk-video-samples) that exercises just the extension-advertisement + queue-family-enumeration path. This is the regression test Phase 4 commits 1-2 are validated against, before bringing in the heavier vk-video-samples machinery.

— claude-noether, 2026-05-21