Files
libva-multiplanar/phase2_iter2_analysis.md
marfrit 9b1d7737cd Iteration 2 Phase 0-4: substrate, situation analysis, baseline anchor, plan
Iteration 2 opens. Hardening, not new feature work.

Phase 0 (phase0_findings_iter2.md): locked research question — make
DMA-BUF (vaapi no -copy) path render without artifacts on direct-
render consumers. Plus WSI alignment for 864-wide videos in Firefox
and multi-resolution kernel-state recovery.

Phase 2 (phase2_iter2_analysis.md): the bug origin is picture.c:375
re-QBUFing surface_object->destination_index every decode cycle
while the kernel CAPTURE buffer is still being read by the consumer
via an EXPBUF'd dma_buf fd. V4L2 doesn't enforce the constraint;
userspace must coordinate.

Three architecture options analyzed:
  A. More buffers + LRU recycling (cheapest, statistical mitigation)
  B. Per-buffer dma_buf-refcount-aware recycling (correct, requires
     kernel changes or userspace V4L2_MEMORY_DMABUF rewrite)
  C. Kernel patch to enforce QBUF rejection (out of campaign scope)

Picking Option A for iteration 2.

Phase 3 (in-session baseline anchor): mpv vaapi --vo=gpu shows
91 drops in 14s at 1080p, 9 CAPTURE buffer indices used (2-10),
~10-19 re-queues per buffer in 14s window. Per-buffer re-queue
interval ~875ms vs typical compositor hold ~50ms — race window
opens episodically.

Phase 4 (phase4_iter2_plan.md): three independent fixes ordered
cheapest-first.
  Fix 1: invalidate LAST_OUTPUT_WIDTH cache on session teardown.
  Fix 2: try DRM_FORMAT_MOD_INVALID for WSI compatibility.
  Fix 3: decoupled CAPTURE buffer pool with LRU recycling
         (~150-200 lines, the load-bearing fix).

Phase 5 sonnet review BEFORE Fix 3 implementation.

Out-of-scope for iteration 2 (carry to iter 3): EACCES probe,
multi-slice num_ref_idx, HACK block MPEG-2 cleanup, seek-to-non-IDR,
DEBUG instrumentation cleanup, V4L2_MEMORY_DMABUF rewrite, perf
metrics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:55:31 +00:00

9.9 KiB
Raw Permalink Blame History

Iteration 2 Phase 2 — situation analysis (DMA-BUF lifecycle, WSI alignment, multi-resolution)

CAPTURE buffer lifecycle in libva-v4l2-request (current state)

Three locations that touch the CAPTURE buffer index ↔ surface binding:

  1. src/surface.c::RequestCreateSurfaces2 line 201 — allocates surfaces_count CAPTURE buffers via v4l2_create_buffers(...). Each gets a V4L2 buffer index index_base..index_base+surfaces_count-1.
  2. src/surface.c::RequestCreateSurfaces2 line 280 — binds surface_object->destination_index = index for each surface. Permanent 1:1 binding for the surface's lifetime.
  3. src/picture.c::RequestEndPicture line 375 — re-QBUFs surface_object->destination_index for every decode cycle on that surface.

The bug:

  • mpv allocates ~16 surfaces. Each gets a permanently bound CAPTURE buffer index 0..15.
  • mpv decodes frame N into surface 0 → V4L2 QBUFs buffer 0 → kernel writes frame N into buffer 0 → DQBUF → mpv calls vaExportSurfaceHandle → gets fd to buffer 0 → renders frame N (compositor holds fd briefly).
  • A few frames later (say frame N+5 in display order, but earlier in decode order due to B-frames), mpv decodes new content into surface 0 again → our backend re-QBUFs buffer 0 → kernel writes frame N+5 into the same physical memory the compositor is still rendering frame N from.
  • Visible result on mpv --vo=gpu: stutter, "back and forth," frame content tearing/swapping. Operator-confirmed 2026-05-04.
  • Not visible on mpv --vo=null because no rendering happens.
  • Not visible on mpv --hwdec=vaapi-copy because libva-side vaMapBuffer does an explicit CPU copy out of the V4L2 buffer before returning to mpv, decoupling mpv's display pipeline from the V4L2 buffer lifecycle.

Why the kernel doesn't enforce the constraint

videobuf2-core.c::vb2_qbuf permits requeue of a V4L2 buffer regardless of the dma_buf refcount on its EXPBUF'd fd. The two refcounts are decoupled:

  • V4L2 queue state: dequeued ↔ queued (kernel writable)
  • dma_buf refcount: alive while ANY fd references it

When userspace QBUFs a buffer that has external dma_buf refs, the kernel writes to the same physical memory; the consumer holding the dma_buf just sees its content change underfoot. It's userspace's responsibility to coordinate.

Three viable architectures (in increasing implementation cost)

Option A: more buffers + LRU recycling (cheapest, statistical mitigation)

Allocate max(surfaces_count, MIN_POOL) CAPTURE buffers (e.g., MIN_POOL=24). Track per-buffer "last exported at" timestamp. When recycling for a new decode, prefer the buffer with oldest last-exported timestamp. Allocates +8 buffers ≈ +24 MB at 1080p NV12.

Race window narrows but doesn't close. mpv at 24 fps decode + compositor at 60 fps display + 2-3 frame compositor pool = consumer holds ~3 frames × 16ms = ~50ms. With 24-buffer pool and oldest-first recycling, the recycled buffer was last exported ~24 × 41ms = ~1 second ago. Race window: 50ms vs 1s margin. Statistically safe for typical playback.

Failure modes still possible: pause-then-resume (consumer holds frame longer); jank-induced compositor lag; multi-stream playback exhausting the pool. Acceptable for iteration 2's scope per phase0_findings_iter2.md.

Option B: per-buffer dma_buf-refcount-aware recycling (correct, complex)

Track per CAPTURE buffer: V4L2 queue state + our own EXPBUF fd. On recycle attempt, close OUR fd, then check kernel-side dma_buf refcount via... no public API exists for this in V4L2. The kernel exposes nothing. Would need either:

  • Workaround: dup the fd into a non-blocking pipe and poll; consumer's dup count signals via... no.
  • Use V4L2_MEMORY_DMABUF + userspace allocation (gbm/etc.) — substantial rewrite, inverts the buffer ownership model.

Real fix but iteration 3+ scope.

Option C: kernel-side fix + userspace coordination (deepest)

Patch v4l2-core to expose dma_buf refcount via VIDIOC_QUERYBUF or similar, OR enforce QBUF rejection (EBUSY) when dma_buf refcount > 0. Out of campaign scope (feedback_no_upstream).

Iteration 2 picks Option A.

Rationale: smallest code change, addresses the operator-visible stutter for the canonical case (mpv vaapi --vo=gpu single-stream), defers the architectural redesign to iteration 3+ when the FFmpeg / GStreamer reference implementations can be studied for prior art.

WSI pitch alignment (independent issue)

src/surface.c::RequestExportSurfaceHandle sets:

surface_descriptor->layers[0].pitch[0] = surface_object->destination_bytesperlines[0];

Where destination_bytesperlines[0] comes from v4l2_get_format on the kernel side — for a 864×480 video, kernel reports bytesperline=864. Mesa's WSI rejects with "WSI pitch not properly aligned" because the GPU compositor wants pitch aligned to 64 bytes (or higher: 128, 256 — Mali and panfrost both have constraints).

Two sub-options:

Sub-option WSI-1: round up reported pitch to the nearest 64-byte boundary in our exported descriptor. The kernel-side V4L2 buffer is sized per the actual bytesperline=864; if we report pitch=896 (round up to 64-aligned 896 = 14×64 = 896 vs 864 = 13.5×64), Mesa accepts but reads bytes 864..895 of each row as garbage (alignment padding the kernel doesn't actually allocate).

This produces visual artifacts at the right edge of frames if Mesa actually samples those bytes. Mesa typically clips to width so it's mostly OK. But still wrong technically.

Sub-option WSI-2: convince hantro to allocate buffers with a wider stride. Hantro's CAPTURE buffer stride is what the kernel computes based on the OUTPUT format we set. If we ask hantro to use a wider stride at S_FMT time, the buffer is wider, kernel reports the wider stride, our descriptor reports it, Mesa accepts, no garbage at right edge.

Looking at hantro driver code: drivers/media/platform/verisilicon/hantro_v4l2.c::hantro_set_fmt_cap sets the CAPTURE format from the OUTPUT format's resolution. The stride is computed by hantro_pp_dec_check_format_compatibility etc. There's no userspace knob to force a wider stride.

So sub-option WSI-1 (report wider pitch in descriptor than the kernel actually allocates) is the realistic fix. Pixel-edge artifacts are acceptable for a test corpus; production-grade fix needs kernel changes.

Implementation: in ExportSurfaceHandle, compute aligned_pitch = (bytesperline + 63) & ~63 and use that for pitch. Adjust UV plane offset accordingly: offsets[1] = aligned_pitch * format_height instead of bytesperline * format_height. Wait, that's wrong too — the kernel's UV starts at bytesperline * format_height, not aligned_pitch * format_height. Reporting wrong offset would be worse.

Hmm. The right fix is: report the kernel's actual offsets and pitches AS-IS, but also add DRM_FORMAT_MOD_INVALID or DRM_FORMAT_MOD_LINEAR_NONALIGN_* modifier to tell Mesa "this isn't WSI-aligned, treat it as a non-display intermediate." Then mpv/Firefox would composite via texture upload rather than direct WSI, paying a perf cost but correct.

Actually DRM_FORMAT_MOD_INVALID means "modifier unknown, do whatever you'd do for implicit." DRM_FORMAT_MOD_LINEAR means "explicitly known to be linear." Currently we set DRM_FORMAT_MOD_NONE (= 0 = same value as MOD_LINEAR). If we set DRM_FORMAT_MOD_INVALID, Mesa might be more lenient.

Worth a one-line experiment.

For iteration 2: try DRM_FORMAT_MOD_INVALID first. If Mesa accepts, done. If not, pad-pitch-with-aliasing fallback.

Multi-resolution kernel state corruption

After Firefox plays multiple videos at different resolutions (864→1920→...) in sequence, the next CAPTURE format query returns 48×48 (kernel default). Our LAST_OUTPUT_WIDTH/HEIGHT cache (commit 37c0e72) doesn't catch this because the surface request matches the last set value, but the kernel state has been clobbered between sessions.

Hypothesis: when one playback ends, mpv/Firefox calls vaDestroyContext → our backend cleans up (REQBUFS(0), STREAMOFF) → the kernel's CAPTURE queue is reset to default 48×48. The next vaCreateContext reuses our cache that says "format already 1920×1088, no need to S_FMT" — but the kernel is back at 48×48.

Fix: invalidate LAST_OUTPUT_WIDTH/HEIGHT cache when the OUTPUT queue is REQBUFS(0)'d. That guarantees the next CreateSurfaces2 will re-set the format.

Alternative: always check kernel-side actual format via v4l2_get_format after S_FMT and trust the kernel's value; if kernel doesn't have what we set, force re-set.

Implementation: invalidate cache in RequestDestroyContext / RequestDestroySurfaces / wherever cleanup happens. Find the right hook.

Touch points for iteration 2 implementation

File Line(s) Change
src/surface.h new fields Add enum cap_buffer_state + per-buffer state tracking struct
src/surface.c ~201 Allocate max(surfaces_count, MIN_CAP_POOL) buffers (e.g., MIN_CAP_POOL=24)
src/surface.c ~280 Don't 1:1 bind — instead, push all buffers to the FREE pool
src/surface.c ~640 (Export) Mark slot as exported, save export timestamp
src/surface.c ~700 (Export) Try DRM_FORMAT_MOD_INVALID instead of DRM_FORMAT_MOD_NONE for WSI compat
src/picture.c ~340 (BeginPicture) Allocate slot from pool (LRU recycling), bind surface→slot, update destination_index
src/picture.c ~375 (EndPicture, the existing QBUF) Already uses surface_object->destination_index — works once BeginPicture rebinds
src/surface.c LAST_OUTPUT_WIDTH cache Invalidate on RequestDestroyContext or DestroySurfaces

Phase 4 plan input — to be locked

The pool refactor is the load-bearing fix. WSI alignment + multi-resolution recovery are independent and small (1-line experiments first, escalate if needed).

Suggested fix order:

  1. Multi-resolution cache invalidation (smallest, lowest risk)
  2. WSI modifier change to DRM_FORMAT_MOD_INVALID (1 line, test immediately)
  3. Decoupled buffer pool (the substantive iteration 2 work)
  4. Each fix tested independently before stacking the next.