libva-multiplanar/phase2_iter2_analysis.md

# Iteration 2 Phase 2 — situation analysis (DMA-BUF lifecycle, WSI alignment, multi-resolution)

## CAPTURE buffer lifecycle in libva-v4l2-request (current state)

Three locations that touch the CAPTURE buffer index ↔ surface binding:

1. **`src/surface.c::RequestCreateSurfaces2` line 201** — allocates `surfaces_count` CAPTURE buffers via `v4l2_create_buffers(...)`. Each gets a V4L2 buffer index `index_base..index_base+surfaces_count-1`.
2. **`src/surface.c::RequestCreateSurfaces2` line 280** — binds `surface_object->destination_index = index` for each surface. **Permanent 1:1 binding** for the surface's lifetime.
3. **`src/picture.c::RequestEndPicture` line 375** — re-QBUFs `surface_object->destination_index` for every decode cycle on that surface.

The bug:
- mpv allocates ~16 surfaces. Each gets a permanently bound CAPTURE buffer index 0..15.
- mpv decodes frame N into surface 0 → V4L2 QBUFs buffer 0 → kernel writes frame N into buffer 0 → DQBUF → mpv calls vaExportSurfaceHandle → gets fd to buffer 0 → renders frame N (compositor holds fd briefly).
- A few frames later (say frame N+5 in display order, but earlier in decode order due to B-frames), mpv decodes new content into surface 0 again → our backend re-QBUFs buffer 0 → **kernel writes frame N+5 into the same physical memory the compositor is still rendering frame N from**.
- Visible result on mpv `--vo=gpu`: stutter, "back and forth," frame content tearing/swapping. Operator-confirmed 2026-05-04.
- Not visible on mpv `--vo=null` because no rendering happens.
- Not visible on mpv `--hwdec=vaapi-copy` because libva-side `vaMapBuffer` does an explicit CPU copy out of the V4L2 buffer before returning to mpv, decoupling mpv's display pipeline from the V4L2 buffer lifecycle.

## Why the kernel doesn't enforce the constraint

`videobuf2-core.c::vb2_qbuf` permits requeue of a V4L2 buffer regardless of the dma_buf refcount on its EXPBUF'd fd. The two refcounts are decoupled:
- V4L2 queue state: dequeued ↔ queued (kernel writable)
- dma_buf refcount: alive while ANY fd references it

When userspace QBUFs a buffer that has external dma_buf refs, the kernel writes to the same physical memory; the consumer holding the dma_buf just sees its content change underfoot. **It's userspace's responsibility to coordinate.**

## Three viable architectures (in increasing implementation cost)

### Option A: more buffers + LRU recycling (cheapest, statistical mitigation)

Allocate `max(surfaces_count, MIN_POOL)` CAPTURE buffers (e.g., MIN_POOL=24). Track per-buffer "last exported at" timestamp. When recycling for a new decode, prefer the buffer with oldest last-exported timestamp. Allocates +8 buffers ≈ +24 MB at 1080p NV12.

Race window narrows but doesn't close. mpv at 24 fps decode + compositor at 60 fps display + 2-3 frame compositor pool = consumer holds ~3 frames × 16ms = ~50ms. With 24-buffer pool and oldest-first recycling, the recycled buffer was last exported ~24 × 41ms = ~1 second ago. Race window: 50ms vs 1s margin. Statistically safe for typical playback.

Failure modes still possible: pause-then-resume (consumer holds frame longer); jank-induced compositor lag; multi-stream playback exhausting the pool. Acceptable for iteration 2's scope per `phase0_findings_iter2.md`.

### Option B: per-buffer dma_buf-refcount-aware recycling (correct, complex)

Track per CAPTURE buffer: V4L2 queue state + our own EXPBUF fd. On recycle attempt, close OUR fd, then check kernel-side dma_buf refcount via... no public API exists for this in V4L2. The kernel exposes nothing. Would need either:
- Workaround: dup the fd into a non-blocking pipe and poll; consumer's dup count signals via... no.
- Use V4L2_MEMORY_DMABUF + userspace allocation (gbm/etc.) — substantial rewrite, inverts the buffer ownership model.

Real fix but iteration 3+ scope.

### Option C: kernel-side fix + userspace coordination (deepest)

Patch v4l2-core to expose dma_buf refcount via VIDIOC_QUERYBUF or similar, OR enforce QBUF rejection (EBUSY) when dma_buf refcount > 0. Out of campaign scope (`feedback_no_upstream`).

## Iteration 2 picks Option A.

Rationale: smallest code change, addresses the operator-visible stutter for the canonical case (mpv vaapi --vo=gpu single-stream), defers the architectural redesign to iteration 3+ when the FFmpeg / GStreamer reference implementations can be studied for prior art.

## WSI pitch alignment (independent issue)

`src/surface.c::RequestExportSurfaceHandle` sets:
```c
surface_descriptor->layers[0].pitch[0] = surface_object->destination_bytesperlines[0];
```
Where `destination_bytesperlines[0]` comes from `v4l2_get_format` on the kernel side — for a 864×480 video, kernel reports `bytesperline=864`. Mesa's WSI rejects with "WSI pitch not properly aligned" because the GPU compositor wants pitch aligned to 64 bytes (or higher: 128, 256 — Mali and panfrost both have constraints).

Two sub-options:

**Sub-option WSI-1**: round up reported pitch to the nearest 64-byte boundary in our exported descriptor. The kernel-side V4L2 buffer is sized per the actual `bytesperline=864`; if we report `pitch=896` (round up to 64-aligned 896 = 14×64 = 896 vs 864 = 13.5×64), Mesa accepts but reads bytes 864..895 of each row as garbage (alignment padding the kernel doesn't actually allocate).

This produces visual artifacts at the right edge of frames if Mesa actually samples those bytes. Mesa typically clips to `width` so it's mostly OK. But still wrong technically.

**Sub-option WSI-2**: convince hantro to allocate buffers with a wider stride. Hantro's CAPTURE buffer stride is what the kernel computes based on the OUTPUT format we set. If we ask hantro to use a wider stride at S_FMT time, the buffer is wider, kernel reports the wider stride, our descriptor reports it, Mesa accepts, no garbage at right edge.

Looking at hantro driver code: `drivers/media/platform/verisilicon/hantro_v4l2.c::hantro_set_fmt_cap` sets the CAPTURE format from the OUTPUT format's resolution. The stride is computed by `hantro_pp_dec_check_format_compatibility` etc. There's no userspace knob to force a wider stride.

So sub-option WSI-1 (report wider pitch in descriptor than the kernel actually allocates) is the realistic fix. Pixel-edge artifacts are acceptable for a test corpus; production-grade fix needs kernel changes.

Implementation: in `ExportSurfaceHandle`, compute `aligned_pitch = (bytesperline + 63) & ~63` and use that for `pitch`. Adjust UV plane offset accordingly: `offsets[1] = aligned_pitch * format_height` instead of `bytesperline * format_height`. Wait, that's wrong too — the kernel's UV starts at `bytesperline * format_height`, not `aligned_pitch * format_height`. Reporting wrong offset would be worse.

Hmm. The right fix is: report the kernel's actual offsets and pitches AS-IS, but also add `DRM_FORMAT_MOD_INVALID` or `DRM_FORMAT_MOD_LINEAR_NONALIGN_*` modifier to tell Mesa "this isn't WSI-aligned, treat it as a non-display intermediate." Then mpv/Firefox would composite via texture upload rather than direct WSI, paying a perf cost but correct.

Actually `DRM_FORMAT_MOD_INVALID` means "modifier unknown, do whatever you'd do for implicit." `DRM_FORMAT_MOD_LINEAR` means "explicitly known to be linear." Currently we set `DRM_FORMAT_MOD_NONE` (= 0 = same value as MOD_LINEAR). If we set `DRM_FORMAT_MOD_INVALID`, Mesa might be more lenient.

Worth a one-line experiment.

For iteration 2: try `DRM_FORMAT_MOD_INVALID` first. If Mesa accepts, done. If not, pad-pitch-with-aliasing fallback.

## Multi-resolution kernel state corruption

After Firefox plays multiple videos at different resolutions (864→1920→...) in sequence, the next CAPTURE format query returns 48×48 (kernel default). Our `LAST_OUTPUT_WIDTH/HEIGHT` cache (commit `37c0e72`) doesn't catch this because the surface request matches the last set value, but the kernel state has been clobbered between sessions.

Hypothesis: when one playback ends, mpv/Firefox calls `vaDestroyContext` → our backend cleans up (REQBUFS(0), STREAMOFF) → the kernel's CAPTURE queue is reset to default 48×48. The next vaCreateContext reuses our cache that says "format already 1920×1088, no need to S_FMT" — but the kernel is back at 48×48.

Fix: invalidate `LAST_OUTPUT_WIDTH/HEIGHT` cache when the OUTPUT queue is REQBUFS(0)'d. That guarantees the next CreateSurfaces2 will re-set the format.

Alternative: always check kernel-side actual format via `v4l2_get_format` after S_FMT and trust the kernel's value; if kernel doesn't have what we set, force re-set.

Implementation: invalidate cache in `RequestDestroyContext` / `RequestDestroySurfaces` / wherever cleanup happens. Find the right hook.

## Touch points for iteration 2 implementation

| File | Line(s) | Change |
|---|---|---|
| `src/surface.h` | new fields | Add `enum cap_buffer_state` + per-buffer state tracking struct |
| `src/surface.c` | ~201 | Allocate `max(surfaces_count, MIN_CAP_POOL)` buffers (e.g., MIN_CAP_POOL=24) |
| `src/surface.c` | ~280 | Don't 1:1 bind — instead, push all buffers to the FREE pool |
| `src/surface.c` | ~640 (Export) | Mark slot as exported, save export timestamp |
| `src/surface.c` | ~700 (Export) | Try `DRM_FORMAT_MOD_INVALID` instead of `DRM_FORMAT_MOD_NONE` for WSI compat |
| `src/picture.c` | ~340 (BeginPicture) | Allocate slot from pool (LRU recycling), bind surface→slot, update destination_index |
| `src/picture.c` | ~375 (EndPicture, the existing QBUF) | Already uses `surface_object->destination_index` — works once BeginPicture rebinds |
| `src/surface.c` | LAST_OUTPUT_WIDTH cache | Invalidate on RequestDestroyContext or DestroySurfaces |

## Phase 4 plan input — to be locked

The pool refactor is the load-bearing fix. WSI alignment + multi-resolution recovery are independent and small (1-line experiments first, escalate if needed).

Suggested fix order:
1. Multi-resolution cache invalidation (smallest, lowest risk)
2. WSI modifier change to `DRM_FORMAT_MOD_INVALID` (1 line, test immediately)
3. Decoupled buffer pool (the substantive iteration 2 work)
4. Each fix tested independently before stacking the next.