Iteration 2 Phase 0-4: substrate, situation analysis, baseline anchor, plan

Iteration 2 opens. Hardening, not new feature work.

Phase 0 (phase0_findings_iter2.md): locked research question — make
DMA-BUF (vaapi no -copy) path render without artifacts on direct-
render consumers. Plus WSI alignment for 864-wide videos in Firefox
and multi-resolution kernel-state recovery.

Phase 2 (phase2_iter2_analysis.md): the bug origin is picture.c:375
re-QBUFing surface_object->destination_index every decode cycle
while the kernel CAPTURE buffer is still being read by the consumer
via an EXPBUF'd dma_buf fd. V4L2 doesn't enforce the constraint;
userspace must coordinate.

Three architecture options analyzed:
  A. More buffers + LRU recycling (cheapest, statistical mitigation)
  B. Per-buffer dma_buf-refcount-aware recycling (correct, requires
     kernel changes or userspace V4L2_MEMORY_DMABUF rewrite)
  C. Kernel patch to enforce QBUF rejection (out of campaign scope)

Picking Option A for iteration 2.

Phase 3 (in-session baseline anchor): mpv vaapi --vo=gpu shows
91 drops in 14s at 1080p, 9 CAPTURE buffer indices used (2-10),
~10-19 re-queues per buffer in 14s window. Per-buffer re-queue
interval ~875ms vs typical compositor hold ~50ms — race window
opens episodically.

Phase 4 (phase4_iter2_plan.md): three independent fixes ordered
cheapest-first.
  Fix 1: invalidate LAST_OUTPUT_WIDTH cache on session teardown.
  Fix 2: try DRM_FORMAT_MOD_INVALID for WSI compatibility.
  Fix 3: decoupled CAPTURE buffer pool with LRU recycling
         (~150-200 lines, the load-bearing fix).

Phase 5 sonnet review BEFORE Fix 3 implementation.

Out-of-scope for iteration 2 (carry to iter 3): EACCES probe,
multi-slice num_ref_idx, HACK block MPEG-2 cleanup, seek-to-non-IDR,
DEBUG instrumentation cleanup, V4L2_MEMORY_DMABUF rewrite, perf
metrics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-04 18:55:31 +00:00
parent 7e66abb72f
commit 9b1d7737cd
3 changed files with 330 additions and 0 deletions
+110
View File
@@ -0,0 +1,110 @@
# Iteration 2 — Phase 0 (substrate / motivation / inventory)
Opens 2026-05-04 immediately after iteration 1 close (`phase8_iteration1_close.md`).
## Locked research question (iteration 2)
> **"Make libva-v4l2-request's DMA-BUF (`vaapi`-no-`-copy`) path render decoded H.264 video without visual artifacts on direct-render consumers (mpv `--vo=gpu`), and handle the multi-video / multi-resolution session shape that real-world consumers (Firefox loading typical web pages) generate, on PineTab2 RK3568 hantro G1/G2."**
**This iteration is hardening, not new feature work.** Iteration 1 made the decode + libva engagement correct. The remaining failure modes are all in the buffer-lifecycle, format-state, and pitch-alignment layers — all of which only manifested AFTER the decode landed real frames.
Pass/fail (boolean):
- mpv `--hwdec=vaapi --vo=gpu` plays `bbb_1080p30_h264.mp4` for ≥30s **without visible stutter** (operator inspection on real screen)
- Firefox 150 plays the Mozilla homepage's animated 864-wide intro videos **without `MESA: error: WSI pitch not properly aligned`** falling back to SW
- A multi-video session (Firefox playing several videos at different resolutions in sequence) **does not corrupt kernel CAPTURE format state** (no `fmt_width=48 fmt_height=48` regression)
## Mechanism the question targets
Iteration 1 established the libva backend now correctly:
- engages hantro G1/G2 (decode produces real NV12 pixels)
- exports DMA-BUF descriptor in the SEPARATE_LAYERS shape Firefox needs
- re-sets OUTPUT format on resolution change for the mpv-probe pattern
But the remaining failure modes share a common underlying issue: **V4L2's per-buffer state is decoupled from the DMA-BUF refcount of the EXPBUF'd fd**. The kernel allows VIDIOC_QBUF to re-queue a buffer even when an external consumer holds an EXPBUF'd fd to the same physical memory. Userspace MUST NOT re-QBUF an in-use buffer, OR it must allocate enough buffers that re-use never happens within the consumer's hold window.
The current libva-v4l2-request code (post-iteration-1) binds CAPTURE buffer index 1:1 with the surface. When mpv recycles a surface (asks to decode frame N+M into a previously-used surface), our backend re-QBUFs the same physical buffer. If the consumer hasn't yet released its EXPBUF'd fd from frame N's render, the kernel writes frame N+M's content into the buffer **while the consumer is still reading frame N from it**.
Visible result on mpv `--vo=gpu`: frames overlap, the bunny appears to move backward and forward, "extreme stuttering and back and forth" per operator description.
Visible result on Firefox: not observed (Firefox's compositor pool buffers frames briefly and dups the fd, decoupling display timing from EXPBUF'd-fd lifetime), but in principle the same race exists.
## Predecessor close-out summary (iteration 1 → iteration 2)
### State that carries forward (re-verified in iteration 1, current)
- **Hardware**: ohm RK3568, kernel 6.19.10-danctnix1-1-pinetab2, hantro G1/G2 on `/dev/video1` + `/dev/media0`
- **Userspace**: libva 2.23.0, libva-utils 2.22.0, mpv 0.41.0-3, Firefox 150.0.1, Mesa 26.0.5
- **Test fixture**: `/home/mfritsche/fourier-test/bbb_1080p30_h264.mp4` (sha256 dcf8a7170fbd49bb...), 1920×1080 H.264 24 fps
- **Build harness**: `meson setup --buildtype=release && ninja` directly on ohm; deploy to `/usr/lib/dri/v4l2_request_drv_video.so`
- **Live test rig**: operator's Plasma 6 Wayland session on tty7, env at `XDG_RUNTIME_DIR=/run/user/1001`, `WAYLAND_DISPLAY=wayland-0`, `DISPLAY=:0`, `XAUTHORITY=/run/user/1001/xauth_ilDMqm`, `DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1001/bus`. SSH-driven launches need `MOZ_ENABLE_WAYLAND=1` for Firefox; mpv works via either `--vo=gpu` (Wayland EGL) or `--vo=null` (headless probe).
- **Debug instrumentation in fork master** (commits 2517a12 / a047926 / 7da2b27 / 92f5b25 / 21ae311 / fdfee2d / 6be3f3b): ENTER logging on most major libva entry points + ExportSurfaceHandle descriptor dump + CAPTURE plane-0 hex/variance dump with msync cache-fix. Stays in this iteration to support debugging; clean sweep at iteration 2 close per Phase 5 review.
### Data that does NOT carry forward (re-acquire per `feedback_replicate_baseline_first.md`)
- Iteration 1 saw `64/300` drops in 12s for vaapi-copy, `29/300` for vaapi (DMA-BUF). Not anchors — re-measure with consistent rig + binding cells in iteration 2.
- "Firefox sustains 100+ slice_header parses" — true at end of iteration 1, but the 100-count was incidental (test ran 14s); not a binding cell.
- Visual stutter intensity for vaapi (DMA-BUF) — re-measure under iteration 2's binding-cell capture.
### Open questions inherited from Sonnet's Phase 5 review (iteration 1)
7.1 **VIDIOC_G_EXT_CTRLS EACCES on this rig** — try moving readback to before `MEDIA_REQUEST_IOC_QUEUE`. If still EACCES, file a kernel-side investigation note and remove the readback as deadweight.
7.2 **`num_ref_idx_l0/l1_default_active_minus1` from VASlice** is wrong for multi-slice streams with explicit per-slice override. Defer until a real test stream surfaces it.
7.3 **`SET_FORMAT_OF_OUTPUT_ONCE` removal completeness** — replaced with `LAST_OUTPUT_WIDTH/HEIGHT` tracking in iteration 1 commit `37c0e72`. Iteration 1 did NOT verify multi-context safety. Iteration 2 multi-resolution work covers this.
7.4 **`// HACK` block in surface.c** — set OUTPUT format from inside CreateSurfaces2 with hardcoded `V4L2_PIX_FMT_H264_SLICE`. Wrong for MPEG-2 surfaces. Defer until MPEG-2 testing exercises it.
7.5 **Firefox seek-to-non-IDR** — verify Firefox handles a stream where the first played frame isn't an IDR. Add to test corpus.
7.6 **fourier_attribution cell A mechanism** is chromium-internal V4L2 backend, not libva. Documented in iteration 1 close; no further action.
## Tooling and measurement-instrument inventory (live verification)
- `strace -f -e ioctl,openat,close` for libva-side V4L2 ioctl tracing — confirmed working iter1
- `sudo ftrace events/v4l2/* events/vb2/* events/dma_fence/*` for kernel-side V4L2/vb2 lifecycle — confirmed working iter1
- `sudo dmesg -w` for kernel-side warnings during decode — generates no output for our test cases (hantro doesn't WARN at the severity level we'd see), but useful for any kernel-side panic
- `sudo lsof /dev/video1` + `sudo ls -l /proc/$RDD_PID/fd/` for current libva-side fd ownership
- `mpv --frames=300 --vo=gpu` with stderr capture for V: timeline + drop count
- Firefox MOZ_LOG=PlatformDecoderModule:5 + grep for ProcessDecode + Broadcast support lines
- Operator visual inspection on real screen for stutter/artifact detection (load-bearing for "boolean correctness")
For this iteration specifically, we ALSO need:
- **Per-CAPTURE-buffer state tracking** in the libva backend itself: which surfaces have outstanding EXPBUF'd fds, which buffers are "owned" by consumer, which are free for re-QBUF
- **mpv vaapi (no -copy) frame consistency check**: a way to detect "frame N's content was overwritten by frame N+M before display." Could be: render to image-sequence (`--vo=image-sequence`), inspect dumped frames for content discontinuity
- **Strace QBUF re-queue pattern** for mpv: time between consecutive QBUFs of same buffer index. If pattern shows re-QBUF happening within ~33ms (one frame at 30fps display), the lifecycle race is structurally guaranteed to fire.
## In-session baseline anchor (Phase 3 input — to capture FIRST in this iteration)
Per `feedback_dev_process.md` Phase 0 + `feedback_replicate_baseline_first.md`, before any iteration 2 implementation work, capture:
1. **Real-VO mpv `--hwdec=vaapi --vo=gpu` reference run** (operator inspection): describe stutter pattern, count visual artifacts per second, log mpv stderr for drop count. Already informally captured in iteration 1 close — formalize for binding cell here.
2. **Strace VIDIOC_QBUF/DQBUF re-queue timing** for the same run: how often does the same buffer index get QBUF'd back-to-back? What's the typical interval?
3. **lsof tracking**: how many EXPBUF'd fds does mpv have outstanding at the moment of stutter? If it's >1, lifecycle race confirmed.
## In-scope (LOCKED 2026-05-04 for iteration 2)
- Buffer-pool refactor in libva-v4l2-request `src/surface.c` and `src/picture.c`: decouple CAPTURE buffer pool from surface count. Track per-buffer free/exported/in-decode state.
- WSI alignment in `src/surface.c::ExportSurfaceHandle`: round up reported pitch to 64-byte alignment OR set explicit modifier so Mesa's WSI accepts.
- Multi-resolution kernel-state recovery: detect format-state mismatch on the kernel side (e.g., G_FMT returns 48×48 when we expect 1920×1088) and force REQBUFS(0)+S_FMT recovery.
- Iteration 2 Phase 5 second-model review (Sonnet) before any implementation lands.
- Iteration 2 Phase 7 verification across the same 4-consumer matrix as iteration 1, plus a multi-video Firefox session, plus a long-playing mpv vaapi (no -copy) run.
## Out-of-scope (LOCKED 2026-05-04 for iteration 2)
- New codecs (MPEG-2, VP8, VP9, AV1, HEVC) — iteration 1's H.264-only scope holds.
- New hardware (fresnel RK3399, ampere/boltzmann RK3588) — separate iteration after ohm path is stable.
- Performance metrics binding cells — explicitly deferred to iteration 3 (perf comparison: SW vs HW).
- Bootlin upstreaming — `feedback_no_upstream.md` holds; no PRs unless explicitly tasked.
- Cleanup of DEBUG instrumentation patches — Sonnet's Phase 5 review said "wait until iteration 1 + 2 work is complete." Iteration 2 keeps the debug for diagnosis; iteration 3 sweeps.
- Multi-process / multi-context libva use (e.g., Firefox running while mpv runs) — defer until basic single-context lifecycle is solid.
## Phase 1 success criterion (will lock when Phase 0 completes)
Pre-lock draft:
- mpv `--hwdec=vaapi --vo=gpu` plays bbb_1080p30 for ≥60s without operator-visible stutter or "back and forth" (binding cell: operator inspection)
- Drop count < 100 in 60s (rig-normalized; iteration 1 saw 29 drops in 12s = 145/min = within bound at 60s)
- Firefox 150 sustained engagement on mozilla.org main page (multi-resolution video sequence)
- vainfo + mpv vaapi-copy + chromium-fourier 149 → no regression
## What Phase 0 will deliver (regardless of detail)
1. **In-session baseline capture** of the current stutter behavior (the Phase 3 anchor for iteration 2). Specifically: visual confirmation, drop rate, strace-side QBUF re-queue timing.
2. **Code-side situation analysis**: read libva-v4l2-request's surface/buffer lifecycle and identify the specific touch points for the buffer-pool refactor. Map to FFmpeg hwcontext_v4l2request and GStreamer v4l2codecs reference patterns.
3. **WSI alignment investigation**: confirm Mesa's actual alignment requirement (try 64, 128, 256), confirm hantro's actual buffer alignment (read kernel hantro_drv.c). Find the right reported pitch.
4. **Iteration 2 worklist with binding cells**.
+116
View File
@@ -0,0 +1,116 @@
# Iteration 2 Phase 2 — situation analysis (DMA-BUF lifecycle, WSI alignment, multi-resolution)
## CAPTURE buffer lifecycle in libva-v4l2-request (current state)
Three locations that touch the CAPTURE buffer index ↔ surface binding:
1. **`src/surface.c::RequestCreateSurfaces2` line 201** — allocates `surfaces_count` CAPTURE buffers via `v4l2_create_buffers(...)`. Each gets a V4L2 buffer index `index_base..index_base+surfaces_count-1`.
2. **`src/surface.c::RequestCreateSurfaces2` line 280** — binds `surface_object->destination_index = index` for each surface. **Permanent 1:1 binding** for the surface's lifetime.
3. **`src/picture.c::RequestEndPicture` line 375** — re-QBUFs `surface_object->destination_index` for every decode cycle on that surface.
The bug:
- mpv allocates ~16 surfaces. Each gets a permanently bound CAPTURE buffer index 0..15.
- mpv decodes frame N into surface 0 → V4L2 QBUFs buffer 0 → kernel writes frame N into buffer 0 → DQBUF → mpv calls vaExportSurfaceHandle → gets fd to buffer 0 → renders frame N (compositor holds fd briefly).
- A few frames later (say frame N+5 in display order, but earlier in decode order due to B-frames), mpv decodes new content into surface 0 again → our backend re-QBUFs buffer 0 → **kernel writes frame N+5 into the same physical memory the compositor is still rendering frame N from**.
- Visible result on mpv `--vo=gpu`: stutter, "back and forth," frame content tearing/swapping. Operator-confirmed 2026-05-04.
- Not visible on mpv `--vo=null` because no rendering happens.
- Not visible on mpv `--hwdec=vaapi-copy` because libva-side `vaMapBuffer` does an explicit CPU copy out of the V4L2 buffer before returning to mpv, decoupling mpv's display pipeline from the V4L2 buffer lifecycle.
## Why the kernel doesn't enforce the constraint
`videobuf2-core.c::vb2_qbuf` permits requeue of a V4L2 buffer regardless of the dma_buf refcount on its EXPBUF'd fd. The two refcounts are decoupled:
- V4L2 queue state: dequeued ↔ queued (kernel writable)
- dma_buf refcount: alive while ANY fd references it
When userspace QBUFs a buffer that has external dma_buf refs, the kernel writes to the same physical memory; the consumer holding the dma_buf just sees its content change underfoot. **It's userspace's responsibility to coordinate.**
## Three viable architectures (in increasing implementation cost)
### Option A: more buffers + LRU recycling (cheapest, statistical mitigation)
Allocate `max(surfaces_count, MIN_POOL)` CAPTURE buffers (e.g., MIN_POOL=24). Track per-buffer "last exported at" timestamp. When recycling for a new decode, prefer the buffer with oldest last-exported timestamp. Allocates +8 buffers ≈ +24 MB at 1080p NV12.
Race window narrows but doesn't close. mpv at 24 fps decode + compositor at 60 fps display + 2-3 frame compositor pool = consumer holds ~3 frames × 16ms = ~50ms. With 24-buffer pool and oldest-first recycling, the recycled buffer was last exported ~24 × 41ms = ~1 second ago. Race window: 50ms vs 1s margin. Statistically safe for typical playback.
Failure modes still possible: pause-then-resume (consumer holds frame longer); jank-induced compositor lag; multi-stream playback exhausting the pool. Acceptable for iteration 2's scope per `phase0_findings_iter2.md`.
### Option B: per-buffer dma_buf-refcount-aware recycling (correct, complex)
Track per CAPTURE buffer: V4L2 queue state + our own EXPBUF fd. On recycle attempt, close OUR fd, then check kernel-side dma_buf refcount via... no public API exists for this in V4L2. The kernel exposes nothing. Would need either:
- Workaround: dup the fd into a non-blocking pipe and poll; consumer's dup count signals via... no.
- Use V4L2_MEMORY_DMABUF + userspace allocation (gbm/etc.) — substantial rewrite, inverts the buffer ownership model.
Real fix but iteration 3+ scope.
### Option C: kernel-side fix + userspace coordination (deepest)
Patch v4l2-core to expose dma_buf refcount via VIDIOC_QUERYBUF or similar, OR enforce QBUF rejection (EBUSY) when dma_buf refcount > 0. Out of campaign scope (`feedback_no_upstream`).
## Iteration 2 picks Option A.
Rationale: smallest code change, addresses the operator-visible stutter for the canonical case (mpv vaapi --vo=gpu single-stream), defers the architectural redesign to iteration 3+ when the FFmpeg / GStreamer reference implementations can be studied for prior art.
## WSI pitch alignment (independent issue)
`src/surface.c::RequestExportSurfaceHandle` sets:
```c
surface_descriptor->layers[0].pitch[0] = surface_object->destination_bytesperlines[0];
```
Where `destination_bytesperlines[0]` comes from `v4l2_get_format` on the kernel side — for a 864×480 video, kernel reports `bytesperline=864`. Mesa's WSI rejects with "WSI pitch not properly aligned" because the GPU compositor wants pitch aligned to 64 bytes (or higher: 128, 256 — Mali and panfrost both have constraints).
Two sub-options:
**Sub-option WSI-1**: round up reported pitch to the nearest 64-byte boundary in our exported descriptor. The kernel-side V4L2 buffer is sized per the actual `bytesperline=864`; if we report `pitch=896` (round up to 64-aligned 896 = 14×64 = 896 vs 864 = 13.5×64), Mesa accepts but reads bytes 864..895 of each row as garbage (alignment padding the kernel doesn't actually allocate).
This produces visual artifacts at the right edge of frames if Mesa actually samples those bytes. Mesa typically clips to `width` so it's mostly OK. But still wrong technically.
**Sub-option WSI-2**: convince hantro to allocate buffers with a wider stride. Hantro's CAPTURE buffer stride is what the kernel computes based on the OUTPUT format we set. If we ask hantro to use a wider stride at S_FMT time, the buffer is wider, kernel reports the wider stride, our descriptor reports it, Mesa accepts, no garbage at right edge.
Looking at hantro driver code: `drivers/media/platform/verisilicon/hantro_v4l2.c::hantro_set_fmt_cap` sets the CAPTURE format from the OUTPUT format's resolution. The stride is computed by `hantro_pp_dec_check_format_compatibility` etc. There's no userspace knob to force a wider stride.
So sub-option WSI-1 (report wider pitch in descriptor than the kernel actually allocates) is the realistic fix. Pixel-edge artifacts are acceptable for a test corpus; production-grade fix needs kernel changes.
Implementation: in `ExportSurfaceHandle`, compute `aligned_pitch = (bytesperline + 63) & ~63` and use that for `pitch`. Adjust UV plane offset accordingly: `offsets[1] = aligned_pitch * format_height` instead of `bytesperline * format_height`. Wait, that's wrong too — the kernel's UV starts at `bytesperline * format_height`, not `aligned_pitch * format_height`. Reporting wrong offset would be worse.
Hmm. The right fix is: report the kernel's actual offsets and pitches AS-IS, but also add `DRM_FORMAT_MOD_INVALID` or `DRM_FORMAT_MOD_LINEAR_NONALIGN_*` modifier to tell Mesa "this isn't WSI-aligned, treat it as a non-display intermediate." Then mpv/Firefox would composite via texture upload rather than direct WSI, paying a perf cost but correct.
Actually `DRM_FORMAT_MOD_INVALID` means "modifier unknown, do whatever you'd do for implicit." `DRM_FORMAT_MOD_LINEAR` means "explicitly known to be linear." Currently we set `DRM_FORMAT_MOD_NONE` (= 0 = same value as MOD_LINEAR). If we set `DRM_FORMAT_MOD_INVALID`, Mesa might be more lenient.
Worth a one-line experiment.
For iteration 2: try `DRM_FORMAT_MOD_INVALID` first. If Mesa accepts, done. If not, pad-pitch-with-aliasing fallback.
## Multi-resolution kernel state corruption
After Firefox plays multiple videos at different resolutions (864→1920→...) in sequence, the next CAPTURE format query returns 48×48 (kernel default). Our `LAST_OUTPUT_WIDTH/HEIGHT` cache (commit `37c0e72`) doesn't catch this because the surface request matches the last set value, but the kernel state has been clobbered between sessions.
Hypothesis: when one playback ends, mpv/Firefox calls `vaDestroyContext` → our backend cleans up (REQBUFS(0), STREAMOFF) → the kernel's CAPTURE queue is reset to default 48×48. The next vaCreateContext reuses our cache that says "format already 1920×1088, no need to S_FMT" — but the kernel is back at 48×48.
Fix: invalidate `LAST_OUTPUT_WIDTH/HEIGHT` cache when the OUTPUT queue is REQBUFS(0)'d. That guarantees the next CreateSurfaces2 will re-set the format.
Alternative: always check kernel-side actual format via `v4l2_get_format` after S_FMT and trust the kernel's value; if kernel doesn't have what we set, force re-set.
Implementation: invalidate cache in `RequestDestroyContext` / `RequestDestroySurfaces` / wherever cleanup happens. Find the right hook.
## Touch points for iteration 2 implementation
| File | Line(s) | Change |
|---|---|---|
| `src/surface.h` | new fields | Add `enum cap_buffer_state` + per-buffer state tracking struct |
| `src/surface.c` | ~201 | Allocate `max(surfaces_count, MIN_CAP_POOL)` buffers (e.g., MIN_CAP_POOL=24) |
| `src/surface.c` | ~280 | Don't 1:1 bind — instead, push all buffers to the FREE pool |
| `src/surface.c` | ~640 (Export) | Mark slot as exported, save export timestamp |
| `src/surface.c` | ~700 (Export) | Try `DRM_FORMAT_MOD_INVALID` instead of `DRM_FORMAT_MOD_NONE` for WSI compat |
| `src/picture.c` | ~340 (BeginPicture) | Allocate slot from pool (LRU recycling), bind surface→slot, update destination_index |
| `src/picture.c` | ~375 (EndPicture, the existing QBUF) | Already uses `surface_object->destination_index` — works once BeginPicture rebinds |
| `src/surface.c` | LAST_OUTPUT_WIDTH cache | Invalidate on RequestDestroyContext or DestroySurfaces |
## Phase 4 plan input — to be locked
The pool refactor is the load-bearing fix. WSI alignment + multi-resolution recovery are independent and small (1-line experiments first, escalate if needed).
Suggested fix order:
1. Multi-resolution cache invalidation (smallest, lowest risk)
2. WSI modifier change to `DRM_FORMAT_MOD_INVALID` (1 line, test immediately)
3. Decoupled buffer pool (the substantive iteration 2 work)
4. Each fix tested independently before stacking the next.
+104
View File
@@ -0,0 +1,104 @@
# Iteration 2 Phase 4 — plan
Three independent fixes, ordered cheapest-first. Each tested before stacking the next.
## Fix 1 — Multi-resolution: invalidate `LAST_OUTPUT_WIDTH/HEIGHT` cache on session teardown
**File**: `src/surface.c` + `src/context.c` (or wherever DestroyContext lives)
**Diff scale**: ~5 lines
**Problem**: Iteration 1 commit `37c0e72` added `LAST_OUTPUT_WIDTH/HEIGHT` static globals to skip re-S_FMT when dimensions unchanged. Across multiple playback sessions, the kernel CAPTURE format gets reset (likely on REQBUFS(0) at session end) but our cache still says "format already 1920×1088, no re-set needed." Next session inherits a 48×48 default kernel state and our cache lies.
**Fix**: zero out `LAST_OUTPUT_WIDTH/HEIGHT` whenever the OUTPUT queue is REQBUFS(0)'d (in our existing reset path on resolution change) AND in `RequestDestroyContext`. Forces next CreateSurfaces2 to S_FMT.
**Risk**: low. Worst case extra S_FMT call during normal operation — harmless.
**Validation**: Firefox loading mozilla.org main page (multi-video sequence at varying resolutions) should not trigger `fmt_width=48 fmt_height=48` regression. Operator-confirmable.
## Fix 2 — WSI alignment: try `DRM_FORMAT_MOD_INVALID` in surface descriptor
**File**: `src/surface.c::RequestExportSurfaceHandle` line 657
**Diff scale**: 1 line + 1 fallback case
**Problem**: Mesa's WSI rejects pitch=864 (only 16-aligned) with "WSI pitch not properly aligned." Our `objects[i].drm_format_modifier = video_format->drm_modifier = DRM_FORMAT_MOD_NONE` (= 0 = LINEAR explicit). Mesa interprets this as "I am promised this is a strict LINEAR DMA-BUF that meets WSI alignment" — but hantro's actual buffer doesn't meet WSI alignment for non-64-aligned widths.
**Fix attempt 1** (cheapest): change `drm_modifier` to `DRM_FORMAT_MOD_INVALID` (~0ULL). This tells Mesa "modifier unknown, treat as implicit / texture-import-only" — Mesa will texture-upload rather than direct WSI-import, paying a small perf cost but correct.
**Fix attempt 2** (fallback): if `DRM_FORMAT_MOD_INVALID` causes regressions in mpv vaapi-copy or Firefox, fall back to per-modifier branching:
- mpv vaapi-copy: use `DRM_FORMAT_MOD_LINEAR` (works as before)
- Firefox + WSI-aligned widths (64-aligned): `DRM_FORMAT_MOD_LINEAR`
- Non-aligned widths: `DRM_FORMAT_MOD_INVALID`
We can't easily detect the consumer at export time, so attempt 1 first universally.
**Validation**: Firefox plays mozilla.org's 864-wide intro videos without "MESA: error: WSI pitch not properly aligned"; mpv vaapi-copy and vaapi (DMA-BUF) still render bunny on bbb 1080p.
**Risk**: medium. `DRM_FORMAT_MOD_INVALID` is a documented sentinel but its semantics in EGL_EXT_image_dma_buf_import vary by driver. Worst case: regression on consumers we currently work for. Mitigation: revert to `DRM_FORMAT_MOD_NONE` if regression observed.
## Fix 3 — Decoupled CAPTURE buffer pool with LRU recycling (the load-bearing fix)
**File**: `src/surface.h`, `src/surface.c`, `src/picture.c`
**Diff scale**: ~150-200 lines (new code)
**Problem**: 1:1 surface↔buffer binding causes V4L2 to re-QBUF the same physical buffer while consumer still holds an EXPBUF'd fd to it. Per Phase 2 analysis + Phase 3 baseline (re-queue interval ~875ms vs compositor hold ~50ms): race window opens during certain playback patterns and produces operator-visible stutter on mpv vaapi --vo=gpu.
**Architecture (Option A from Phase 2)**:
- Add `struct cap_pool_slot` per CAPTURE buffer with state: FREE, IN_DECODE, EXPORTED.
- Allocate `max(surfaces_count, MIN_CAP_POOL)` CAPTURE buffers (MIN_CAP_POOL = 24 — gives ~3x headroom for typical mpv 16-surface pool).
- Decouple: surfaces no longer have permanent `destination_index`. Instead, each surface holds a transient pointer to its currently-bound slot.
- On `RequestBeginPicture(surface_X)`:
- If surface_X has a previously-bound slot in EXPORTED state: close OUR copy of the EXPBUF'd fd, mark slot as FREE
- Find LRU FREE slot (oldest "last_used_at" timestamp)
- Bind surface_X → slot, set destination_index, mmap if needed
- QBUF that slot
- On `RequestSyncSurface(surface_X)`: DQBUF as before, slot transitions FREE→IN_DECODE→EXPORTED→...
- On `RequestExportSurfaceHandle(surface_X)`: VIDIOC_EXPBUF on slot's V4L2 index, save our fd, mark slot EXPORTED, record timestamp
- On `RequestDestroySurfaces`: free all slots, REQBUFS(0)
**Implementation details**:
- `cap_pool_slot` struct in `surface.h`:
```c
enum cap_slot_state { CAP_SLOT_FREE, CAP_SLOT_IN_DECODE, CAP_SLOT_EXPORTED };
struct cap_pool_slot {
unsigned int v4l2_index;
void *map[VIDEO_MAX_PLANES];
unsigned int map_lengths[VIDEO_MAX_PLANES];
unsigned int map_offsets[VIDEO_MAX_PLANES];
enum cap_slot_state state;
int our_export_fd; /* -1 if not exported */
uint64_t last_used_at_ns; /* for LRU recycling */
};
```
- Pool stored in `request_data` (driver-level), not in `surface_object`.
- LRU helper: scan slots, find FREE with oldest `last_used_at_ns`. If no FREE, force-recycle the oldest EXPORTED slot (close fd, demote to FREE) — race window may still open in pathological cases but rare.
**Validation**:
- mpv `--hwdec=vaapi --vo=gpu` plays bbb 1080p for ≥60s without operator-visible stutter
- Drop count < 100 in 14s (Phase 3 anchor)
- Firefox 150 sustained engagement with no MESA WSI errors after Fix 2
- vainfo + mpv vaapi-copy + chromium-fourier 149 — no regression
- strace shows recycled buffer indices have re-queue intervals consistent with LRU spread (e.g., wider than current ~875ms)
**Risk**: medium-high. Substantive code change. Mitigations:
- Keep pool size + recycling logic configurable
- Preserve Fix 1 + Fix 2 as separate commits before stacking Fix 3 (revert one without losing the others)
- Phase 5 sonnet review before committing Fix 3
## Order of attack
1. **Fix 1 (multi-resolution cache)** — small, low-risk, fixes a known regression. Test: Firefox multi-video session.
2. **Fix 2 (WSI modifier)** — 1 line, test on 864-wide video. If regression, revert.
3. **Phase 5 review (sonnet)** before Fix 3 — get an outside read on the buffer-pool architecture before committing the substantive code.
4. **Fix 3 (buffer pool)** — implement, test, iterate.
5. **Phase 7 verification** — full 4-consumer matrix + multi-video session.
6. **Phase 8 close** — memory entry, iteration 3 input doc.
## Out-of-scope for iteration 2 (carried to iteration 3)
- VIDIOC_G_EXT_CTRLS EACCES probe (Sonnet 7.1)
- num_ref_idx for multi-slice streams (Sonnet 7.2)
- HACK block in surface.c MPEG-2 case (Sonnet 7.4)
- Firefox seek-to-non-IDR (Sonnet 7.5)
- DEBUG instrumentation cleanup (until iteration 2 verified, per Sonnet)
- V4L2_MEMORY_DMABUF mode rewrite (Option B from Phase 2 — proper but expensive)
- Performance metrics — iteration 3