9b1d7737cd
Iteration 2 opens. Hardening, not new feature work.
Phase 0 (phase0_findings_iter2.md): locked research question — make
DMA-BUF (vaapi no -copy) path render without artifacts on direct-
render consumers. Plus WSI alignment for 864-wide videos in Firefox
and multi-resolution kernel-state recovery.
Phase 2 (phase2_iter2_analysis.md): the bug origin is picture.c:375
re-QBUFing surface_object->destination_index every decode cycle
while the kernel CAPTURE buffer is still being read by the consumer
via an EXPBUF'd dma_buf fd. V4L2 doesn't enforce the constraint;
userspace must coordinate.
Three architecture options analyzed:
A. More buffers + LRU recycling (cheapest, statistical mitigation)
B. Per-buffer dma_buf-refcount-aware recycling (correct, requires
kernel changes or userspace V4L2_MEMORY_DMABUF rewrite)
C. Kernel patch to enforce QBUF rejection (out of campaign scope)
Picking Option A for iteration 2.
Phase 3 (in-session baseline anchor): mpv vaapi --vo=gpu shows
91 drops in 14s at 1080p, 9 CAPTURE buffer indices used (2-10),
~10-19 re-queues per buffer in 14s window. Per-buffer re-queue
interval ~875ms vs typical compositor hold ~50ms — race window
opens episodically.
Phase 4 (phase4_iter2_plan.md): three independent fixes ordered
cheapest-first.
Fix 1: invalidate LAST_OUTPUT_WIDTH cache on session teardown.
Fix 2: try DRM_FORMAT_MOD_INVALID for WSI compatibility.
Fix 3: decoupled CAPTURE buffer pool with LRU recycling
(~150-200 lines, the load-bearing fix).
Phase 5 sonnet review BEFORE Fix 3 implementation.
Out-of-scope for iteration 2 (carry to iter 3): EACCES probe,
multi-slice num_ref_idx, HACK block MPEG-2 cleanup, seek-to-non-IDR,
DEBUG instrumentation cleanup, V4L2_MEMORY_DMABUF rewrite, perf
metrics.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
105 lines
6.7 KiB
Markdown
105 lines
6.7 KiB
Markdown
# Iteration 2 Phase 4 — plan
|
||
|
||
Three independent fixes, ordered cheapest-first. Each tested before stacking the next.
|
||
|
||
## Fix 1 — Multi-resolution: invalidate `LAST_OUTPUT_WIDTH/HEIGHT` cache on session teardown
|
||
|
||
**File**: `src/surface.c` + `src/context.c` (or wherever DestroyContext lives)
|
||
**Diff scale**: ~5 lines
|
||
|
||
**Problem**: Iteration 1 commit `37c0e72` added `LAST_OUTPUT_WIDTH/HEIGHT` static globals to skip re-S_FMT when dimensions unchanged. Across multiple playback sessions, the kernel CAPTURE format gets reset (likely on REQBUFS(0) at session end) but our cache still says "format already 1920×1088, no re-set needed." Next session inherits a 48×48 default kernel state and our cache lies.
|
||
|
||
**Fix**: zero out `LAST_OUTPUT_WIDTH/HEIGHT` whenever the OUTPUT queue is REQBUFS(0)'d (in our existing reset path on resolution change) AND in `RequestDestroyContext`. Forces next CreateSurfaces2 to S_FMT.
|
||
|
||
**Risk**: low. Worst case extra S_FMT call during normal operation — harmless.
|
||
|
||
**Validation**: Firefox loading mozilla.org main page (multi-video sequence at varying resolutions) should not trigger `fmt_width=48 fmt_height=48` regression. Operator-confirmable.
|
||
|
||
## Fix 2 — WSI alignment: try `DRM_FORMAT_MOD_INVALID` in surface descriptor
|
||
|
||
**File**: `src/surface.c::RequestExportSurfaceHandle` line 657
|
||
**Diff scale**: 1 line + 1 fallback case
|
||
|
||
**Problem**: Mesa's WSI rejects pitch=864 (only 16-aligned) with "WSI pitch not properly aligned." Our `objects[i].drm_format_modifier = video_format->drm_modifier = DRM_FORMAT_MOD_NONE` (= 0 = LINEAR explicit). Mesa interprets this as "I am promised this is a strict LINEAR DMA-BUF that meets WSI alignment" — but hantro's actual buffer doesn't meet WSI alignment for non-64-aligned widths.
|
||
|
||
**Fix attempt 1** (cheapest): change `drm_modifier` to `DRM_FORMAT_MOD_INVALID` (~0ULL). This tells Mesa "modifier unknown, treat as implicit / texture-import-only" — Mesa will texture-upload rather than direct WSI-import, paying a small perf cost but correct.
|
||
|
||
**Fix attempt 2** (fallback): if `DRM_FORMAT_MOD_INVALID` causes regressions in mpv vaapi-copy or Firefox, fall back to per-modifier branching:
|
||
- mpv vaapi-copy: use `DRM_FORMAT_MOD_LINEAR` (works as before)
|
||
- Firefox + WSI-aligned widths (64-aligned): `DRM_FORMAT_MOD_LINEAR`
|
||
- Non-aligned widths: `DRM_FORMAT_MOD_INVALID`
|
||
|
||
We can't easily detect the consumer at export time, so attempt 1 first universally.
|
||
|
||
**Validation**: Firefox plays mozilla.org's 864-wide intro videos without "MESA: error: WSI pitch not properly aligned"; mpv vaapi-copy and vaapi (DMA-BUF) still render bunny on bbb 1080p.
|
||
|
||
**Risk**: medium. `DRM_FORMAT_MOD_INVALID` is a documented sentinel but its semantics in EGL_EXT_image_dma_buf_import vary by driver. Worst case: regression on consumers we currently work for. Mitigation: revert to `DRM_FORMAT_MOD_NONE` if regression observed.
|
||
|
||
## Fix 3 — Decoupled CAPTURE buffer pool with LRU recycling (the load-bearing fix)
|
||
|
||
**File**: `src/surface.h`, `src/surface.c`, `src/picture.c`
|
||
**Diff scale**: ~150-200 lines (new code)
|
||
|
||
**Problem**: 1:1 surface↔buffer binding causes V4L2 to re-QBUF the same physical buffer while consumer still holds an EXPBUF'd fd to it. Per Phase 2 analysis + Phase 3 baseline (re-queue interval ~875ms vs compositor hold ~50ms): race window opens during certain playback patterns and produces operator-visible stutter on mpv vaapi --vo=gpu.
|
||
|
||
**Architecture (Option A from Phase 2)**:
|
||
- Add `struct cap_pool_slot` per CAPTURE buffer with state: FREE, IN_DECODE, EXPORTED.
|
||
- Allocate `max(surfaces_count, MIN_CAP_POOL)` CAPTURE buffers (MIN_CAP_POOL = 24 — gives ~3x headroom for typical mpv 16-surface pool).
|
||
- Decouple: surfaces no longer have permanent `destination_index`. Instead, each surface holds a transient pointer to its currently-bound slot.
|
||
- On `RequestBeginPicture(surface_X)`:
|
||
- If surface_X has a previously-bound slot in EXPORTED state: close OUR copy of the EXPBUF'd fd, mark slot as FREE
|
||
- Find LRU FREE slot (oldest "last_used_at" timestamp)
|
||
- Bind surface_X → slot, set destination_index, mmap if needed
|
||
- QBUF that slot
|
||
- On `RequestSyncSurface(surface_X)`: DQBUF as before, slot transitions FREE→IN_DECODE→EXPORTED→...
|
||
- On `RequestExportSurfaceHandle(surface_X)`: VIDIOC_EXPBUF on slot's V4L2 index, save our fd, mark slot EXPORTED, record timestamp
|
||
- On `RequestDestroySurfaces`: free all slots, REQBUFS(0)
|
||
|
||
**Implementation details**:
|
||
- `cap_pool_slot` struct in `surface.h`:
|
||
```c
|
||
enum cap_slot_state { CAP_SLOT_FREE, CAP_SLOT_IN_DECODE, CAP_SLOT_EXPORTED };
|
||
struct cap_pool_slot {
|
||
unsigned int v4l2_index;
|
||
void *map[VIDEO_MAX_PLANES];
|
||
unsigned int map_lengths[VIDEO_MAX_PLANES];
|
||
unsigned int map_offsets[VIDEO_MAX_PLANES];
|
||
enum cap_slot_state state;
|
||
int our_export_fd; /* -1 if not exported */
|
||
uint64_t last_used_at_ns; /* for LRU recycling */
|
||
};
|
||
```
|
||
- Pool stored in `request_data` (driver-level), not in `surface_object`.
|
||
- LRU helper: scan slots, find FREE with oldest `last_used_at_ns`. If no FREE, force-recycle the oldest EXPORTED slot (close fd, demote to FREE) — race window may still open in pathological cases but rare.
|
||
|
||
**Validation**:
|
||
- mpv `--hwdec=vaapi --vo=gpu` plays bbb 1080p for ≥60s without operator-visible stutter
|
||
- Drop count < 100 in 14s (Phase 3 anchor)
|
||
- Firefox 150 sustained engagement with no MESA WSI errors after Fix 2
|
||
- vainfo + mpv vaapi-copy + chromium-fourier 149 — no regression
|
||
- strace shows recycled buffer indices have re-queue intervals consistent with LRU spread (e.g., wider than current ~875ms)
|
||
|
||
**Risk**: medium-high. Substantive code change. Mitigations:
|
||
- Keep pool size + recycling logic configurable
|
||
- Preserve Fix 1 + Fix 2 as separate commits before stacking Fix 3 (revert one without losing the others)
|
||
- Phase 5 sonnet review before committing Fix 3
|
||
|
||
## Order of attack
|
||
|
||
1. **Fix 1 (multi-resolution cache)** — small, low-risk, fixes a known regression. Test: Firefox multi-video session.
|
||
2. **Fix 2 (WSI modifier)** — 1 line, test on 864-wide video. If regression, revert.
|
||
3. **Phase 5 review (sonnet)** before Fix 3 — get an outside read on the buffer-pool architecture before committing the substantive code.
|
||
4. **Fix 3 (buffer pool)** — implement, test, iterate.
|
||
5. **Phase 7 verification** — full 4-consumer matrix + multi-video session.
|
||
6. **Phase 8 close** — memory entry, iteration 3 input doc.
|
||
|
||
## Out-of-scope for iteration 2 (carried to iteration 3)
|
||
|
||
- VIDIOC_G_EXT_CTRLS EACCES probe (Sonnet 7.1)
|
||
- num_ref_idx for multi-slice streams (Sonnet 7.2)
|
||
- HACK block in surface.c MPEG-2 case (Sonnet 7.4)
|
||
- Firefox seek-to-non-IDR (Sonnet 7.5)
|
||
- DEBUG instrumentation cleanup (until iteration 2 verified, per Sonnet)
|
||
- V4L2_MEMORY_DMABUF mode rewrite (Option B from Phase 2 — proper but expensive)
|
||
- Performance metrics — iteration 3
|