Files
libva-multiplanar/phase4_iter2_plan.md
T
marfrit 9b1d7737cd Iteration 2 Phase 0-4: substrate, situation analysis, baseline anchor, plan
Iteration 2 opens. Hardening, not new feature work.

Phase 0 (phase0_findings_iter2.md): locked research question — make
DMA-BUF (vaapi no -copy) path render without artifacts on direct-
render consumers. Plus WSI alignment for 864-wide videos in Firefox
and multi-resolution kernel-state recovery.

Phase 2 (phase2_iter2_analysis.md): the bug origin is picture.c:375
re-QBUFing surface_object->destination_index every decode cycle
while the kernel CAPTURE buffer is still being read by the consumer
via an EXPBUF'd dma_buf fd. V4L2 doesn't enforce the constraint;
userspace must coordinate.

Three architecture options analyzed:
  A. More buffers + LRU recycling (cheapest, statistical mitigation)
  B. Per-buffer dma_buf-refcount-aware recycling (correct, requires
     kernel changes or userspace V4L2_MEMORY_DMABUF rewrite)
  C. Kernel patch to enforce QBUF rejection (out of campaign scope)

Picking Option A for iteration 2.

Phase 3 (in-session baseline anchor): mpv vaapi --vo=gpu shows
91 drops in 14s at 1080p, 9 CAPTURE buffer indices used (2-10),
~10-19 re-queues per buffer in 14s window. Per-buffer re-queue
interval ~875ms vs typical compositor hold ~50ms — race window
opens episodically.

Phase 4 (phase4_iter2_plan.md): three independent fixes ordered
cheapest-first.
  Fix 1: invalidate LAST_OUTPUT_WIDTH cache on session teardown.
  Fix 2: try DRM_FORMAT_MOD_INVALID for WSI compatibility.
  Fix 3: decoupled CAPTURE buffer pool with LRU recycling
         (~150-200 lines, the load-bearing fix).

Phase 5 sonnet review BEFORE Fix 3 implementation.

Out-of-scope for iteration 2 (carry to iter 3): EACCES probe,
multi-slice num_ref_idx, HACK block MPEG-2 cleanup, seek-to-non-IDR,
DEBUG instrumentation cleanup, V4L2_MEMORY_DMABUF rewrite, perf
metrics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:55:31 +00:00

105 lines
6.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Iteration 2 Phase 4 — plan
Three independent fixes, ordered cheapest-first. Each tested before stacking the next.
## Fix 1 — Multi-resolution: invalidate `LAST_OUTPUT_WIDTH/HEIGHT` cache on session teardown
**File**: `src/surface.c` + `src/context.c` (or wherever DestroyContext lives)
**Diff scale**: ~5 lines
**Problem**: Iteration 1 commit `37c0e72` added `LAST_OUTPUT_WIDTH/HEIGHT` static globals to skip re-S_FMT when dimensions unchanged. Across multiple playback sessions, the kernel CAPTURE format gets reset (likely on REQBUFS(0) at session end) but our cache still says "format already 1920×1088, no re-set needed." Next session inherits a 48×48 default kernel state and our cache lies.
**Fix**: zero out `LAST_OUTPUT_WIDTH/HEIGHT` whenever the OUTPUT queue is REQBUFS(0)'d (in our existing reset path on resolution change) AND in `RequestDestroyContext`. Forces next CreateSurfaces2 to S_FMT.
**Risk**: low. Worst case extra S_FMT call during normal operation — harmless.
**Validation**: Firefox loading mozilla.org main page (multi-video sequence at varying resolutions) should not trigger `fmt_width=48 fmt_height=48` regression. Operator-confirmable.
## Fix 2 — WSI alignment: try `DRM_FORMAT_MOD_INVALID` in surface descriptor
**File**: `src/surface.c::RequestExportSurfaceHandle` line 657
**Diff scale**: 1 line + 1 fallback case
**Problem**: Mesa's WSI rejects pitch=864 (only 16-aligned) with "WSI pitch not properly aligned." Our `objects[i].drm_format_modifier = video_format->drm_modifier = DRM_FORMAT_MOD_NONE` (= 0 = LINEAR explicit). Mesa interprets this as "I am promised this is a strict LINEAR DMA-BUF that meets WSI alignment" — but hantro's actual buffer doesn't meet WSI alignment for non-64-aligned widths.
**Fix attempt 1** (cheapest): change `drm_modifier` to `DRM_FORMAT_MOD_INVALID` (~0ULL). This tells Mesa "modifier unknown, treat as implicit / texture-import-only" — Mesa will texture-upload rather than direct WSI-import, paying a small perf cost but correct.
**Fix attempt 2** (fallback): if `DRM_FORMAT_MOD_INVALID` causes regressions in mpv vaapi-copy or Firefox, fall back to per-modifier branching:
- mpv vaapi-copy: use `DRM_FORMAT_MOD_LINEAR` (works as before)
- Firefox + WSI-aligned widths (64-aligned): `DRM_FORMAT_MOD_LINEAR`
- Non-aligned widths: `DRM_FORMAT_MOD_INVALID`
We can't easily detect the consumer at export time, so attempt 1 first universally.
**Validation**: Firefox plays mozilla.org's 864-wide intro videos without "MESA: error: WSI pitch not properly aligned"; mpv vaapi-copy and vaapi (DMA-BUF) still render bunny on bbb 1080p.
**Risk**: medium. `DRM_FORMAT_MOD_INVALID` is a documented sentinel but its semantics in EGL_EXT_image_dma_buf_import vary by driver. Worst case: regression on consumers we currently work for. Mitigation: revert to `DRM_FORMAT_MOD_NONE` if regression observed.
## Fix 3 — Decoupled CAPTURE buffer pool with LRU recycling (the load-bearing fix)
**File**: `src/surface.h`, `src/surface.c`, `src/picture.c`
**Diff scale**: ~150-200 lines (new code)
**Problem**: 1:1 surface↔buffer binding causes V4L2 to re-QBUF the same physical buffer while consumer still holds an EXPBUF'd fd to it. Per Phase 2 analysis + Phase 3 baseline (re-queue interval ~875ms vs compositor hold ~50ms): race window opens during certain playback patterns and produces operator-visible stutter on mpv vaapi --vo=gpu.
**Architecture (Option A from Phase 2)**:
- Add `struct cap_pool_slot` per CAPTURE buffer with state: FREE, IN_DECODE, EXPORTED.
- Allocate `max(surfaces_count, MIN_CAP_POOL)` CAPTURE buffers (MIN_CAP_POOL = 24 — gives ~3x headroom for typical mpv 16-surface pool).
- Decouple: surfaces no longer have permanent `destination_index`. Instead, each surface holds a transient pointer to its currently-bound slot.
- On `RequestBeginPicture(surface_X)`:
- If surface_X has a previously-bound slot in EXPORTED state: close OUR copy of the EXPBUF'd fd, mark slot as FREE
- Find LRU FREE slot (oldest "last_used_at" timestamp)
- Bind surface_X → slot, set destination_index, mmap if needed
- QBUF that slot
- On `RequestSyncSurface(surface_X)`: DQBUF as before, slot transitions FREE→IN_DECODE→EXPORTED→...
- On `RequestExportSurfaceHandle(surface_X)`: VIDIOC_EXPBUF on slot's V4L2 index, save our fd, mark slot EXPORTED, record timestamp
- On `RequestDestroySurfaces`: free all slots, REQBUFS(0)
**Implementation details**:
- `cap_pool_slot` struct in `surface.h`:
```c
enum cap_slot_state { CAP_SLOT_FREE, CAP_SLOT_IN_DECODE, CAP_SLOT_EXPORTED };
struct cap_pool_slot {
unsigned int v4l2_index;
void *map[VIDEO_MAX_PLANES];
unsigned int map_lengths[VIDEO_MAX_PLANES];
unsigned int map_offsets[VIDEO_MAX_PLANES];
enum cap_slot_state state;
int our_export_fd; /* -1 if not exported */
uint64_t last_used_at_ns; /* for LRU recycling */
};
```
- Pool stored in `request_data` (driver-level), not in `surface_object`.
- LRU helper: scan slots, find FREE with oldest `last_used_at_ns`. If no FREE, force-recycle the oldest EXPORTED slot (close fd, demote to FREE) — race window may still open in pathological cases but rare.
**Validation**:
- mpv `--hwdec=vaapi --vo=gpu` plays bbb 1080p for ≥60s without operator-visible stutter
- Drop count < 100 in 14s (Phase 3 anchor)
- Firefox 150 sustained engagement with no MESA WSI errors after Fix 2
- vainfo + mpv vaapi-copy + chromium-fourier 149 — no regression
- strace shows recycled buffer indices have re-queue intervals consistent with LRU spread (e.g., wider than current ~875ms)
**Risk**: medium-high. Substantive code change. Mitigations:
- Keep pool size + recycling logic configurable
- Preserve Fix 1 + Fix 2 as separate commits before stacking Fix 3 (revert one without losing the others)
- Phase 5 sonnet review before committing Fix 3
## Order of attack
1. **Fix 1 (multi-resolution cache)** — small, low-risk, fixes a known regression. Test: Firefox multi-video session.
2. **Fix 2 (WSI modifier)** — 1 line, test on 864-wide video. If regression, revert.
3. **Phase 5 review (sonnet)** before Fix 3 — get an outside read on the buffer-pool architecture before committing the substantive code.
4. **Fix 3 (buffer pool)** — implement, test, iterate.
5. **Phase 7 verification** — full 4-consumer matrix + multi-video session.
6. **Phase 8 close** — memory entry, iteration 3 input doc.
## Out-of-scope for iteration 2 (carried to iteration 3)
- VIDIOC_G_EXT_CTRLS EACCES probe (Sonnet 7.1)
- num_ref_idx for multi-slice streams (Sonnet 7.2)
- HACK block in surface.c MPEG-2 case (Sonnet 7.4)
- Firefox seek-to-non-IDR (Sonnet 7.5)
- DEBUG instrumentation cleanup (until iteration 2 verified, per Sonnet)
- V4L2_MEMORY_DMABUF mode rewrite (Option B from Phase 2 — proper but expensive)
- Performance metrics — iteration 3