Iteration 2 Phase 0-4: substrate, situation analysis, baseline anchor, plan
Iteration 2 opens. Hardening, not new feature work.
Phase 0 (phase0_findings_iter2.md): locked research question — make
DMA-BUF (vaapi no -copy) path render without artifacts on direct-
render consumers. Plus WSI alignment for 864-wide videos in Firefox
and multi-resolution kernel-state recovery.
Phase 2 (phase2_iter2_analysis.md): the bug origin is picture.c:375
re-QBUFing surface_object->destination_index every decode cycle
while the kernel CAPTURE buffer is still being read by the consumer
via an EXPBUF'd dma_buf fd. V4L2 doesn't enforce the constraint;
userspace must coordinate.
Three architecture options analyzed:
A. More buffers + LRU recycling (cheapest, statistical mitigation)
B. Per-buffer dma_buf-refcount-aware recycling (correct, requires
kernel changes or userspace V4L2_MEMORY_DMABUF rewrite)
C. Kernel patch to enforce QBUF rejection (out of campaign scope)
Picking Option A for iteration 2.
Phase 3 (in-session baseline anchor): mpv vaapi --vo=gpu shows
91 drops in 14s at 1080p, 9 CAPTURE buffer indices used (2-10),
~10-19 re-queues per buffer in 14s window. Per-buffer re-queue
interval ~875ms vs typical compositor hold ~50ms — race window
opens episodically.
Phase 4 (phase4_iter2_plan.md): three independent fixes ordered
cheapest-first.
Fix 1: invalidate LAST_OUTPUT_WIDTH cache on session teardown.
Fix 2: try DRM_FORMAT_MOD_INVALID for WSI compatibility.
Fix 3: decoupled CAPTURE buffer pool with LRU recycling
(~150-200 lines, the load-bearing fix).
Phase 5 sonnet review BEFORE Fix 3 implementation.
Out-of-scope for iteration 2 (carry to iter 3): EACCES probe,
multi-slice num_ref_idx, HACK block MPEG-2 cleanup, seek-to-non-IDR,
DEBUG instrumentation cleanup, V4L2_MEMORY_DMABUF rewrite, perf
metrics.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,116 @@
|
||||
# Iteration 2 Phase 2 — situation analysis (DMA-BUF lifecycle, WSI alignment, multi-resolution)
|
||||
|
||||
## CAPTURE buffer lifecycle in libva-v4l2-request (current state)
|
||||
|
||||
Three locations that touch the CAPTURE buffer index ↔ surface binding:
|
||||
|
||||
1. **`src/surface.c::RequestCreateSurfaces2` line 201** — allocates `surfaces_count` CAPTURE buffers via `v4l2_create_buffers(...)`. Each gets a V4L2 buffer index `index_base..index_base+surfaces_count-1`.
|
||||
2. **`src/surface.c::RequestCreateSurfaces2` line 280** — binds `surface_object->destination_index = index` for each surface. **Permanent 1:1 binding** for the surface's lifetime.
|
||||
3. **`src/picture.c::RequestEndPicture` line 375** — re-QBUFs `surface_object->destination_index` for every decode cycle on that surface.
|
||||
|
||||
The bug:
|
||||
- mpv allocates ~16 surfaces. Each gets a permanently bound CAPTURE buffer index 0..15.
|
||||
- mpv decodes frame N into surface 0 → V4L2 QBUFs buffer 0 → kernel writes frame N into buffer 0 → DQBUF → mpv calls vaExportSurfaceHandle → gets fd to buffer 0 → renders frame N (compositor holds fd briefly).
|
||||
- A few frames later (say frame N+5 in display order, but earlier in decode order due to B-frames), mpv decodes new content into surface 0 again → our backend re-QBUFs buffer 0 → **kernel writes frame N+5 into the same physical memory the compositor is still rendering frame N from**.
|
||||
- Visible result on mpv `--vo=gpu`: stutter, "back and forth," frame content tearing/swapping. Operator-confirmed 2026-05-04.
|
||||
- Not visible on mpv `--vo=null` because no rendering happens.
|
||||
- Not visible on mpv `--hwdec=vaapi-copy` because libva-side `vaMapBuffer` does an explicit CPU copy out of the V4L2 buffer before returning to mpv, decoupling mpv's display pipeline from the V4L2 buffer lifecycle.
|
||||
|
||||
## Why the kernel doesn't enforce the constraint
|
||||
|
||||
`videobuf2-core.c::vb2_qbuf` permits requeue of a V4L2 buffer regardless of the dma_buf refcount on its EXPBUF'd fd. The two refcounts are decoupled:
|
||||
- V4L2 queue state: dequeued ↔ queued (kernel writable)
|
||||
- dma_buf refcount: alive while ANY fd references it
|
||||
|
||||
When userspace QBUFs a buffer that has external dma_buf refs, the kernel writes to the same physical memory; the consumer holding the dma_buf just sees its content change underfoot. **It's userspace's responsibility to coordinate.**
|
||||
|
||||
## Three viable architectures (in increasing implementation cost)
|
||||
|
||||
### Option A: more buffers + LRU recycling (cheapest, statistical mitigation)
|
||||
|
||||
Allocate `max(surfaces_count, MIN_POOL)` CAPTURE buffers (e.g., MIN_POOL=24). Track per-buffer "last exported at" timestamp. When recycling for a new decode, prefer the buffer with oldest last-exported timestamp. Allocates +8 buffers ≈ +24 MB at 1080p NV12.
|
||||
|
||||
Race window narrows but doesn't close. mpv at 24 fps decode + compositor at 60 fps display + 2-3 frame compositor pool = consumer holds ~3 frames × 16ms = ~50ms. With 24-buffer pool and oldest-first recycling, the recycled buffer was last exported ~24 × 41ms = ~1 second ago. Race window: 50ms vs 1s margin. Statistically safe for typical playback.
|
||||
|
||||
Failure modes still possible: pause-then-resume (consumer holds frame longer); jank-induced compositor lag; multi-stream playback exhausting the pool. Acceptable for iteration 2's scope per `phase0_findings_iter2.md`.
|
||||
|
||||
### Option B: per-buffer dma_buf-refcount-aware recycling (correct, complex)
|
||||
|
||||
Track per CAPTURE buffer: V4L2 queue state + our own EXPBUF fd. On recycle attempt, close OUR fd, then check kernel-side dma_buf refcount via... no public API exists for this in V4L2. The kernel exposes nothing. Would need either:
|
||||
- Workaround: dup the fd into a non-blocking pipe and poll; consumer's dup count signals via... no.
|
||||
- Use V4L2_MEMORY_DMABUF + userspace allocation (gbm/etc.) — substantial rewrite, inverts the buffer ownership model.
|
||||
|
||||
Real fix but iteration 3+ scope.
|
||||
|
||||
### Option C: kernel-side fix + userspace coordination (deepest)
|
||||
|
||||
Patch v4l2-core to expose dma_buf refcount via VIDIOC_QUERYBUF or similar, OR enforce QBUF rejection (EBUSY) when dma_buf refcount > 0. Out of campaign scope (`feedback_no_upstream`).
|
||||
|
||||
## Iteration 2 picks Option A.
|
||||
|
||||
Rationale: smallest code change, addresses the operator-visible stutter for the canonical case (mpv vaapi --vo=gpu single-stream), defers the architectural redesign to iteration 3+ when the FFmpeg / GStreamer reference implementations can be studied for prior art.
|
||||
|
||||
## WSI pitch alignment (independent issue)
|
||||
|
||||
`src/surface.c::RequestExportSurfaceHandle` sets:
|
||||
```c
|
||||
surface_descriptor->layers[0].pitch[0] = surface_object->destination_bytesperlines[0];
|
||||
```
|
||||
Where `destination_bytesperlines[0]` comes from `v4l2_get_format` on the kernel side — for a 864×480 video, kernel reports `bytesperline=864`. Mesa's WSI rejects with "WSI pitch not properly aligned" because the GPU compositor wants pitch aligned to 64 bytes (or higher: 128, 256 — Mali and panfrost both have constraints).
|
||||
|
||||
Two sub-options:
|
||||
|
||||
**Sub-option WSI-1**: round up reported pitch to the nearest 64-byte boundary in our exported descriptor. The kernel-side V4L2 buffer is sized per the actual `bytesperline=864`; if we report `pitch=896` (round up to 64-aligned 896 = 14×64 = 896 vs 864 = 13.5×64), Mesa accepts but reads bytes 864..895 of each row as garbage (alignment padding the kernel doesn't actually allocate).
|
||||
|
||||
This produces visual artifacts at the right edge of frames if Mesa actually samples those bytes. Mesa typically clips to `width` so it's mostly OK. But still wrong technically.
|
||||
|
||||
**Sub-option WSI-2**: convince hantro to allocate buffers with a wider stride. Hantro's CAPTURE buffer stride is what the kernel computes based on the OUTPUT format we set. If we ask hantro to use a wider stride at S_FMT time, the buffer is wider, kernel reports the wider stride, our descriptor reports it, Mesa accepts, no garbage at right edge.
|
||||
|
||||
Looking at hantro driver code: `drivers/media/platform/verisilicon/hantro_v4l2.c::hantro_set_fmt_cap` sets the CAPTURE format from the OUTPUT format's resolution. The stride is computed by `hantro_pp_dec_check_format_compatibility` etc. There's no userspace knob to force a wider stride.
|
||||
|
||||
So sub-option WSI-1 (report wider pitch in descriptor than the kernel actually allocates) is the realistic fix. Pixel-edge artifacts are acceptable for a test corpus; production-grade fix needs kernel changes.
|
||||
|
||||
Implementation: in `ExportSurfaceHandle`, compute `aligned_pitch = (bytesperline + 63) & ~63` and use that for `pitch`. Adjust UV plane offset accordingly: `offsets[1] = aligned_pitch * format_height` instead of `bytesperline * format_height`. Wait, that's wrong too — the kernel's UV starts at `bytesperline * format_height`, not `aligned_pitch * format_height`. Reporting wrong offset would be worse.
|
||||
|
||||
Hmm. The right fix is: report the kernel's actual offsets and pitches AS-IS, but also add `DRM_FORMAT_MOD_INVALID` or `DRM_FORMAT_MOD_LINEAR_NONALIGN_*` modifier to tell Mesa "this isn't WSI-aligned, treat it as a non-display intermediate." Then mpv/Firefox would composite via texture upload rather than direct WSI, paying a perf cost but correct.
|
||||
|
||||
Actually `DRM_FORMAT_MOD_INVALID` means "modifier unknown, do whatever you'd do for implicit." `DRM_FORMAT_MOD_LINEAR` means "explicitly known to be linear." Currently we set `DRM_FORMAT_MOD_NONE` (= 0 = same value as MOD_LINEAR). If we set `DRM_FORMAT_MOD_INVALID`, Mesa might be more lenient.
|
||||
|
||||
Worth a one-line experiment.
|
||||
|
||||
For iteration 2: try `DRM_FORMAT_MOD_INVALID` first. If Mesa accepts, done. If not, pad-pitch-with-aliasing fallback.
|
||||
|
||||
## Multi-resolution kernel state corruption
|
||||
|
||||
After Firefox plays multiple videos at different resolutions (864→1920→...) in sequence, the next CAPTURE format query returns 48×48 (kernel default). Our `LAST_OUTPUT_WIDTH/HEIGHT` cache (commit `37c0e72`) doesn't catch this because the surface request matches the last set value, but the kernel state has been clobbered between sessions.
|
||||
|
||||
Hypothesis: when one playback ends, mpv/Firefox calls `vaDestroyContext` → our backend cleans up (REQBUFS(0), STREAMOFF) → the kernel's CAPTURE queue is reset to default 48×48. The next vaCreateContext reuses our cache that says "format already 1920×1088, no need to S_FMT" — but the kernel is back at 48×48.
|
||||
|
||||
Fix: invalidate `LAST_OUTPUT_WIDTH/HEIGHT` cache when the OUTPUT queue is REQBUFS(0)'d. That guarantees the next CreateSurfaces2 will re-set the format.
|
||||
|
||||
Alternative: always check kernel-side actual format via `v4l2_get_format` after S_FMT and trust the kernel's value; if kernel doesn't have what we set, force re-set.
|
||||
|
||||
Implementation: invalidate cache in `RequestDestroyContext` / `RequestDestroySurfaces` / wherever cleanup happens. Find the right hook.
|
||||
|
||||
## Touch points for iteration 2 implementation
|
||||
|
||||
| File | Line(s) | Change |
|
||||
|---|---|---|
|
||||
| `src/surface.h` | new fields | Add `enum cap_buffer_state` + per-buffer state tracking struct |
|
||||
| `src/surface.c` | ~201 | Allocate `max(surfaces_count, MIN_CAP_POOL)` buffers (e.g., MIN_CAP_POOL=24) |
|
||||
| `src/surface.c` | ~280 | Don't 1:1 bind — instead, push all buffers to the FREE pool |
|
||||
| `src/surface.c` | ~640 (Export) | Mark slot as exported, save export timestamp |
|
||||
| `src/surface.c` | ~700 (Export) | Try `DRM_FORMAT_MOD_INVALID` instead of `DRM_FORMAT_MOD_NONE` for WSI compat |
|
||||
| `src/picture.c` | ~340 (BeginPicture) | Allocate slot from pool (LRU recycling), bind surface→slot, update destination_index |
|
||||
| `src/picture.c` | ~375 (EndPicture, the existing QBUF) | Already uses `surface_object->destination_index` — works once BeginPicture rebinds |
|
||||
| `src/surface.c` | LAST_OUTPUT_WIDTH cache | Invalidate on RequestDestroyContext or DestroySurfaces |
|
||||
|
||||
## Phase 4 plan input — to be locked
|
||||
|
||||
The pool refactor is the load-bearing fix. WSI alignment + multi-resolution recovery are independent and small (1-line experiments first, escalate if needed).
|
||||
|
||||
Suggested fix order:
|
||||
1. Multi-resolution cache invalidation (smallest, lowest risk)
|
||||
2. WSI modifier change to `DRM_FORMAT_MOD_INVALID` (1 line, test immediately)
|
||||
3. Decoupled buffer pool (the substantive iteration 2 work)
|
||||
4. Each fix tested independently before stacking the next.
|
||||
Reference in New Issue
Block a user