libva-multiplanar/phase4_iter2_plan.md

# Iteration 2 Phase 4 — plan

Three independent fixes, ordered cheapest-first. Each tested before stacking the next.

## Fix 1 — Multi-resolution: invalidate `LAST_OUTPUT_WIDTH/HEIGHT` cache on session teardown

**File**: `src/surface.c` + `src/context.c` (or wherever DestroyContext lives)
**Diff scale**: ~5 lines

**Problem**: Iteration 1 commit `37c0e72` added `LAST_OUTPUT_WIDTH/HEIGHT` static globals to skip re-S_FMT when dimensions unchanged. Across multiple playback sessions, the kernel CAPTURE format gets reset (likely on REQBUFS(0) at session end) but our cache still says "format already 1920×1088, no re-set needed." Next session inherits a 48×48 default kernel state and our cache lies.

**Fix**: zero out `LAST_OUTPUT_WIDTH/HEIGHT` whenever the OUTPUT queue is REQBUFS(0)'d (in our existing reset path on resolution change) AND in `RequestDestroyContext`. Forces next CreateSurfaces2 to S_FMT.

**Risk**: low. Worst case extra S_FMT call during normal operation — harmless.

**Validation**: Firefox loading mozilla.org main page (multi-video sequence at varying resolutions) should not trigger `fmt_width=48 fmt_height=48` regression. Operator-confirmable.

## Fix 2 — WSI alignment: try `DRM_FORMAT_MOD_INVALID` in surface descriptor

**File**: `src/surface.c::RequestExportSurfaceHandle` line 657
**Diff scale**: 1 line + 1 fallback case

**Problem**: Mesa's WSI rejects pitch=864 (only 16-aligned) with "WSI pitch not properly aligned." Our `objects[i].drm_format_modifier = video_format->drm_modifier = DRM_FORMAT_MOD_NONE` (= 0 = LINEAR explicit). Mesa interprets this as "I am promised this is a strict LINEAR DMA-BUF that meets WSI alignment" — but hantro's actual buffer doesn't meet WSI alignment for non-64-aligned widths.

**Fix attempt 1** (cheapest): change `drm_modifier` to `DRM_FORMAT_MOD_INVALID` (~0ULL). This tells Mesa "modifier unknown, treat as implicit / texture-import-only" — Mesa will texture-upload rather than direct WSI-import, paying a small perf cost but correct.

**Fix attempt 2** (fallback): if `DRM_FORMAT_MOD_INVALID` causes regressions in mpv vaapi-copy or Firefox, fall back to per-modifier branching:
- mpv vaapi-copy: use `DRM_FORMAT_MOD_LINEAR` (works as before)
- Firefox + WSI-aligned widths (64-aligned): `DRM_FORMAT_MOD_LINEAR`
- Non-aligned widths: `DRM_FORMAT_MOD_INVALID`

We can't easily detect the consumer at export time, so attempt 1 first universally.

**Validation**: Firefox plays mozilla.org's 864-wide intro videos without "MESA: error: WSI pitch not properly aligned"; mpv vaapi-copy and vaapi (DMA-BUF) still render bunny on bbb 1080p.

**Risk**: medium. `DRM_FORMAT_MOD_INVALID` is a documented sentinel but its semantics in EGL_EXT_image_dma_buf_import vary by driver. Worst case: regression on consumers we currently work for. Mitigation: revert to `DRM_FORMAT_MOD_NONE` if regression observed.

## Fix 3 — Decoupled CAPTURE buffer pool with LRU recycling (the load-bearing fix)

**File**: `src/surface.h`, `src/surface.c`, `src/picture.c`
**Diff scale**: ~150-200 lines (new code)

**Problem**: 1:1 surface↔buffer binding causes V4L2 to re-QBUF the same physical buffer while consumer still holds an EXPBUF'd fd to it. Per Phase 2 analysis + Phase 3 baseline (re-queue interval ~875ms vs compositor hold ~50ms): race window opens during certain playback patterns and produces operator-visible stutter on mpv vaapi --vo=gpu.

**Architecture (Option A from Phase 2)**:
- Add `struct cap_pool_slot` per CAPTURE buffer with state: FREE, IN_DECODE, EXPORTED.
- Allocate `max(surfaces_count, MIN_CAP_POOL)` CAPTURE buffers (MIN_CAP_POOL = 24 — gives ~3x headroom for typical mpv 16-surface pool).
- Decouple: surfaces no longer have permanent `destination_index`. Instead, each surface holds a transient pointer to its currently-bound slot.
- On `RequestBeginPicture(surface_X)`:
  - If surface_X has a previously-bound slot in EXPORTED state: close OUR copy of the EXPBUF'd fd, mark slot as FREE
  - Find LRU FREE slot (oldest "last_used_at" timestamp)
  - Bind surface_X → slot, set destination_index, mmap if needed
  - QBUF that slot
- On `RequestSyncSurface(surface_X)`: DQBUF as before, slot transitions FREE→IN_DECODE→EXPORTED→...
- On `RequestExportSurfaceHandle(surface_X)`: VIDIOC_EXPBUF on slot's V4L2 index, save our fd, mark slot EXPORTED, record timestamp
- On `RequestDestroySurfaces`: free all slots, REQBUFS(0)

**Implementation details**:
- `cap_pool_slot` struct in `surface.h`:
  ```c
  enum cap_slot_state { CAP_SLOT_FREE, CAP_SLOT_IN_DECODE, CAP_SLOT_EXPORTED };
  struct cap_pool_slot {
      unsigned int v4l2_index;
      void *map[VIDEO_MAX_PLANES];
      unsigned int map_lengths[VIDEO_MAX_PLANES];
      unsigned int map_offsets[VIDEO_MAX_PLANES];
      enum cap_slot_state state;
      int our_export_fd;        /* -1 if not exported */
      uint64_t last_used_at_ns; /* for LRU recycling */
  };
  ```
- Pool stored in `request_data` (driver-level), not in `surface_object`.
- LRU helper: scan slots, find FREE with oldest `last_used_at_ns`. If no FREE, force-recycle the oldest EXPORTED slot (close fd, demote to FREE) — race window may still open in pathological cases but rare.

**Validation**:
- mpv `--hwdec=vaapi --vo=gpu` plays bbb 1080p for ≥60s without operator-visible stutter
- Drop count < 100 in 14s (Phase 3 anchor)
- Firefox 150 sustained engagement with no MESA WSI errors after Fix 2
- vainfo + mpv vaapi-copy + chromium-fourier 149 — no regression
- strace shows recycled buffer indices have re-queue intervals consistent with LRU spread (e.g., wider than current ~875ms)

**Risk**: medium-high. Substantive code change. Mitigations:
- Keep pool size + recycling logic configurable
- Preserve Fix 1 + Fix 2 as separate commits before stacking Fix 3 (revert one without losing the others)
- Phase 5 sonnet review before committing Fix 3

## Order of attack

1. **Fix 1 (multi-resolution cache)** — small, low-risk, fixes a known regression. Test: Firefox multi-video session.
2. **Fix 2 (WSI modifier)** — 1 line, test on 864-wide video. If regression, revert.
3. **Phase 5 review (sonnet)** before Fix 3 — get an outside read on the buffer-pool architecture before committing the substantive code.
4. **Fix 3 (buffer pool)** — implement, test, iterate.
5. **Phase 7 verification** — full 4-consumer matrix + multi-video session.
6. **Phase 8 close** — memory entry, iteration 3 input doc.

## Out-of-scope for iteration 2 (carried to iteration 3)

- VIDIOC_G_EXT_CTRLS EACCES probe (Sonnet 7.1)
- num_ref_idx for multi-slice streams (Sonnet 7.2)
- HACK block in surface.c MPEG-2 case (Sonnet 7.4)
- Firefox seek-to-non-IDR (Sonnet 7.5)
- DEBUG instrumentation cleanup (until iteration 2 verified, per Sonnet)
- V4L2_MEMORY_DMABUF mode rewrite (Option B from Phase 2 — proper but expensive)
- Performance metrics — iteration 3