Files
libva-multiplanar/phase2_iter2_analysis.md
T
marfrit 9b1d7737cd Iteration 2 Phase 0-4: substrate, situation analysis, baseline anchor, plan
Iteration 2 opens. Hardening, not new feature work.

Phase 0 (phase0_findings_iter2.md): locked research question — make
DMA-BUF (vaapi no -copy) path render without artifacts on direct-
render consumers. Plus WSI alignment for 864-wide videos in Firefox
and multi-resolution kernel-state recovery.

Phase 2 (phase2_iter2_analysis.md): the bug origin is picture.c:375
re-QBUFing surface_object->destination_index every decode cycle
while the kernel CAPTURE buffer is still being read by the consumer
via an EXPBUF'd dma_buf fd. V4L2 doesn't enforce the constraint;
userspace must coordinate.

Three architecture options analyzed:
  A. More buffers + LRU recycling (cheapest, statistical mitigation)
  B. Per-buffer dma_buf-refcount-aware recycling (correct, requires
     kernel changes or userspace V4L2_MEMORY_DMABUF rewrite)
  C. Kernel patch to enforce QBUF rejection (out of campaign scope)

Picking Option A for iteration 2.

Phase 3 (in-session baseline anchor): mpv vaapi --vo=gpu shows
91 drops in 14s at 1080p, 9 CAPTURE buffer indices used (2-10),
~10-19 re-queues per buffer in 14s window. Per-buffer re-queue
interval ~875ms vs typical compositor hold ~50ms — race window
opens episodically.

Phase 4 (phase4_iter2_plan.md): three independent fixes ordered
cheapest-first.
  Fix 1: invalidate LAST_OUTPUT_WIDTH cache on session teardown.
  Fix 2: try DRM_FORMAT_MOD_INVALID for WSI compatibility.
  Fix 3: decoupled CAPTURE buffer pool with LRU recycling
         (~150-200 lines, the load-bearing fix).

Phase 5 sonnet review BEFORE Fix 3 implementation.

Out-of-scope for iteration 2 (carry to iter 3): EACCES probe,
multi-slice num_ref_idx, HACK block MPEG-2 cleanup, seek-to-non-IDR,
DEBUG instrumentation cleanup, V4L2_MEMORY_DMABUF rewrite, perf
metrics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:55:31 +00:00

117 lines
9.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Iteration 2 Phase 2 — situation analysis (DMA-BUF lifecycle, WSI alignment, multi-resolution)
## CAPTURE buffer lifecycle in libva-v4l2-request (current state)
Three locations that touch the CAPTURE buffer index ↔ surface binding:
1. **`src/surface.c::RequestCreateSurfaces2` line 201** — allocates `surfaces_count` CAPTURE buffers via `v4l2_create_buffers(...)`. Each gets a V4L2 buffer index `index_base..index_base+surfaces_count-1`.
2. **`src/surface.c::RequestCreateSurfaces2` line 280** — binds `surface_object->destination_index = index` for each surface. **Permanent 1:1 binding** for the surface's lifetime.
3. **`src/picture.c::RequestEndPicture` line 375** — re-QBUFs `surface_object->destination_index` for every decode cycle on that surface.
The bug:
- mpv allocates ~16 surfaces. Each gets a permanently bound CAPTURE buffer index 0..15.
- mpv decodes frame N into surface 0 → V4L2 QBUFs buffer 0 → kernel writes frame N into buffer 0 → DQBUF → mpv calls vaExportSurfaceHandle → gets fd to buffer 0 → renders frame N (compositor holds fd briefly).
- A few frames later (say frame N+5 in display order, but earlier in decode order due to B-frames), mpv decodes new content into surface 0 again → our backend re-QBUFs buffer 0 → **kernel writes frame N+5 into the same physical memory the compositor is still rendering frame N from**.
- Visible result on mpv `--vo=gpu`: stutter, "back and forth," frame content tearing/swapping. Operator-confirmed 2026-05-04.
- Not visible on mpv `--vo=null` because no rendering happens.
- Not visible on mpv `--hwdec=vaapi-copy` because libva-side `vaMapBuffer` does an explicit CPU copy out of the V4L2 buffer before returning to mpv, decoupling mpv's display pipeline from the V4L2 buffer lifecycle.
## Why the kernel doesn't enforce the constraint
`videobuf2-core.c::vb2_qbuf` permits requeue of a V4L2 buffer regardless of the dma_buf refcount on its EXPBUF'd fd. The two refcounts are decoupled:
- V4L2 queue state: dequeued ↔ queued (kernel writable)
- dma_buf refcount: alive while ANY fd references it
When userspace QBUFs a buffer that has external dma_buf refs, the kernel writes to the same physical memory; the consumer holding the dma_buf just sees its content change underfoot. **It's userspace's responsibility to coordinate.**
## Three viable architectures (in increasing implementation cost)
### Option A: more buffers + LRU recycling (cheapest, statistical mitigation)
Allocate `max(surfaces_count, MIN_POOL)` CAPTURE buffers (e.g., MIN_POOL=24). Track per-buffer "last exported at" timestamp. When recycling for a new decode, prefer the buffer with oldest last-exported timestamp. Allocates +8 buffers ≈ +24 MB at 1080p NV12.
Race window narrows but doesn't close. mpv at 24 fps decode + compositor at 60 fps display + 2-3 frame compositor pool = consumer holds ~3 frames × 16ms = ~50ms. With 24-buffer pool and oldest-first recycling, the recycled buffer was last exported ~24 × 41ms = ~1 second ago. Race window: 50ms vs 1s margin. Statistically safe for typical playback.
Failure modes still possible: pause-then-resume (consumer holds frame longer); jank-induced compositor lag; multi-stream playback exhausting the pool. Acceptable for iteration 2's scope per `phase0_findings_iter2.md`.
### Option B: per-buffer dma_buf-refcount-aware recycling (correct, complex)
Track per CAPTURE buffer: V4L2 queue state + our own EXPBUF fd. On recycle attempt, close OUR fd, then check kernel-side dma_buf refcount via... no public API exists for this in V4L2. The kernel exposes nothing. Would need either:
- Workaround: dup the fd into a non-blocking pipe and poll; consumer's dup count signals via... no.
- Use V4L2_MEMORY_DMABUF + userspace allocation (gbm/etc.) — substantial rewrite, inverts the buffer ownership model.
Real fix but iteration 3+ scope.
### Option C: kernel-side fix + userspace coordination (deepest)
Patch v4l2-core to expose dma_buf refcount via VIDIOC_QUERYBUF or similar, OR enforce QBUF rejection (EBUSY) when dma_buf refcount > 0. Out of campaign scope (`feedback_no_upstream`).
## Iteration 2 picks Option A.
Rationale: smallest code change, addresses the operator-visible stutter for the canonical case (mpv vaapi --vo=gpu single-stream), defers the architectural redesign to iteration 3+ when the FFmpeg / GStreamer reference implementations can be studied for prior art.
## WSI pitch alignment (independent issue)
`src/surface.c::RequestExportSurfaceHandle` sets:
```c
surface_descriptor->layers[0].pitch[0] = surface_object->destination_bytesperlines[0];
```
Where `destination_bytesperlines[0]` comes from `v4l2_get_format` on the kernel side — for a 864×480 video, kernel reports `bytesperline=864`. Mesa's WSI rejects with "WSI pitch not properly aligned" because the GPU compositor wants pitch aligned to 64 bytes (or higher: 128, 256 — Mali and panfrost both have constraints).
Two sub-options:
**Sub-option WSI-1**: round up reported pitch to the nearest 64-byte boundary in our exported descriptor. The kernel-side V4L2 buffer is sized per the actual `bytesperline=864`; if we report `pitch=896` (round up to 64-aligned 896 = 14×64 = 896 vs 864 = 13.5×64), Mesa accepts but reads bytes 864..895 of each row as garbage (alignment padding the kernel doesn't actually allocate).
This produces visual artifacts at the right edge of frames if Mesa actually samples those bytes. Mesa typically clips to `width` so it's mostly OK. But still wrong technically.
**Sub-option WSI-2**: convince hantro to allocate buffers with a wider stride. Hantro's CAPTURE buffer stride is what the kernel computes based on the OUTPUT format we set. If we ask hantro to use a wider stride at S_FMT time, the buffer is wider, kernel reports the wider stride, our descriptor reports it, Mesa accepts, no garbage at right edge.
Looking at hantro driver code: `drivers/media/platform/verisilicon/hantro_v4l2.c::hantro_set_fmt_cap` sets the CAPTURE format from the OUTPUT format's resolution. The stride is computed by `hantro_pp_dec_check_format_compatibility` etc. There's no userspace knob to force a wider stride.
So sub-option WSI-1 (report wider pitch in descriptor than the kernel actually allocates) is the realistic fix. Pixel-edge artifacts are acceptable for a test corpus; production-grade fix needs kernel changes.
Implementation: in `ExportSurfaceHandle`, compute `aligned_pitch = (bytesperline + 63) & ~63` and use that for `pitch`. Adjust UV plane offset accordingly: `offsets[1] = aligned_pitch * format_height` instead of `bytesperline * format_height`. Wait, that's wrong too — the kernel's UV starts at `bytesperline * format_height`, not `aligned_pitch * format_height`. Reporting wrong offset would be worse.
Hmm. The right fix is: report the kernel's actual offsets and pitches AS-IS, but also add `DRM_FORMAT_MOD_INVALID` or `DRM_FORMAT_MOD_LINEAR_NONALIGN_*` modifier to tell Mesa "this isn't WSI-aligned, treat it as a non-display intermediate." Then mpv/Firefox would composite via texture upload rather than direct WSI, paying a perf cost but correct.
Actually `DRM_FORMAT_MOD_INVALID` means "modifier unknown, do whatever you'd do for implicit." `DRM_FORMAT_MOD_LINEAR` means "explicitly known to be linear." Currently we set `DRM_FORMAT_MOD_NONE` (= 0 = same value as MOD_LINEAR). If we set `DRM_FORMAT_MOD_INVALID`, Mesa might be more lenient.
Worth a one-line experiment.
For iteration 2: try `DRM_FORMAT_MOD_INVALID` first. If Mesa accepts, done. If not, pad-pitch-with-aliasing fallback.
## Multi-resolution kernel state corruption
After Firefox plays multiple videos at different resolutions (864→1920→...) in sequence, the next CAPTURE format query returns 48×48 (kernel default). Our `LAST_OUTPUT_WIDTH/HEIGHT` cache (commit `37c0e72`) doesn't catch this because the surface request matches the last set value, but the kernel state has been clobbered between sessions.
Hypothesis: when one playback ends, mpv/Firefox calls `vaDestroyContext` → our backend cleans up (REQBUFS(0), STREAMOFF) → the kernel's CAPTURE queue is reset to default 48×48. The next vaCreateContext reuses our cache that says "format already 1920×1088, no need to S_FMT" — but the kernel is back at 48×48.
Fix: invalidate `LAST_OUTPUT_WIDTH/HEIGHT` cache when the OUTPUT queue is REQBUFS(0)'d. That guarantees the next CreateSurfaces2 will re-set the format.
Alternative: always check kernel-side actual format via `v4l2_get_format` after S_FMT and trust the kernel's value; if kernel doesn't have what we set, force re-set.
Implementation: invalidate cache in `RequestDestroyContext` / `RequestDestroySurfaces` / wherever cleanup happens. Find the right hook.
## Touch points for iteration 2 implementation
| File | Line(s) | Change |
|---|---|---|
| `src/surface.h` | new fields | Add `enum cap_buffer_state` + per-buffer state tracking struct |
| `src/surface.c` | ~201 | Allocate `max(surfaces_count, MIN_CAP_POOL)` buffers (e.g., MIN_CAP_POOL=24) |
| `src/surface.c` | ~280 | Don't 1:1 bind — instead, push all buffers to the FREE pool |
| `src/surface.c` | ~640 (Export) | Mark slot as exported, save export timestamp |
| `src/surface.c` | ~700 (Export) | Try `DRM_FORMAT_MOD_INVALID` instead of `DRM_FORMAT_MOD_NONE` for WSI compat |
| `src/picture.c` | ~340 (BeginPicture) | Allocate slot from pool (LRU recycling), bind surface→slot, update destination_index |
| `src/picture.c` | ~375 (EndPicture, the existing QBUF) | Already uses `surface_object->destination_index` — works once BeginPicture rebinds |
| `src/surface.c` | LAST_OUTPUT_WIDTH cache | Invalidate on RequestDestroyContext or DestroySurfaces |
## Phase 4 plan input — to be locked
The pool refactor is the load-bearing fix. WSI alignment + multi-resolution recovery are independent and small (1-line experiments first, escalate if needed).
Suggested fix order:
1. Multi-resolution cache invalidation (smallest, lowest risk)
2. WSI modifier change to `DRM_FORMAT_MOD_INVALID` (1 line, test immediately)
3. Decoupled buffer pool (the substantive iteration 2 work)
4. Each fix tested independently before stacking the next.