Iteration 2 opens. Hardening, not new feature work.
Phase 0 (phase0_findings_iter2.md): locked research question — make
DMA-BUF (vaapi no -copy) path render without artifacts on direct-
render consumers. Plus WSI alignment for 864-wide videos in Firefox
and multi-resolution kernel-state recovery.
Phase 2 (phase2_iter2_analysis.md): the bug origin is picture.c:375
re-QBUFing surface_object->destination_index every decode cycle
while the kernel CAPTURE buffer is still being read by the consumer
via an EXPBUF'd dma_buf fd. V4L2 doesn't enforce the constraint;
userspace must coordinate.
Three architecture options analyzed:
A. More buffers + LRU recycling (cheapest, statistical mitigation)
B. Per-buffer dma_buf-refcount-aware recycling (correct, requires
kernel changes or userspace V4L2_MEMORY_DMABUF rewrite)
C. Kernel patch to enforce QBUF rejection (out of campaign scope)
Picking Option A for iteration 2.
Phase 3 (in-session baseline anchor): mpv vaapi --vo=gpu shows
91 drops in 14s at 1080p, 9 CAPTURE buffer indices used (2-10),
~10-19 re-queues per buffer in 14s window. Per-buffer re-queue
interval ~875ms vs typical compositor hold ~50ms — race window
opens episodically.
Phase 4 (phase4_iter2_plan.md): three independent fixes ordered
cheapest-first.
Fix 1: invalidate LAST_OUTPUT_WIDTH cache on session teardown.
Fix 2: try DRM_FORMAT_MOD_INVALID for WSI compatibility.
Fix 3: decoupled CAPTURE buffer pool with LRU recycling
(~150-200 lines, the load-bearing fix).
Phase 5 sonnet review BEFORE Fix 3 implementation.
Out-of-scope for iteration 2 (carry to iter 3): EACCES probe,
multi-slice num_ref_idx, HACK block MPEG-2 cleanup, seek-to-non-IDR,
DEBUG instrumentation cleanup, V4L2_MEMORY_DMABUF rewrite, perf
metrics.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.7 KiB
Iteration 2 Phase 4 — plan
Three independent fixes, ordered cheapest-first. Each tested before stacking the next.
Fix 1 — Multi-resolution: invalidate LAST_OUTPUT_WIDTH/HEIGHT cache on session teardown
File: src/surface.c + src/context.c (or wherever DestroyContext lives)
Diff scale: ~5 lines
Problem: Iteration 1 commit 37c0e72 added LAST_OUTPUT_WIDTH/HEIGHT static globals to skip re-S_FMT when dimensions unchanged. Across multiple playback sessions, the kernel CAPTURE format gets reset (likely on REQBUFS(0) at session end) but our cache still says "format already 1920×1088, no re-set needed." Next session inherits a 48×48 default kernel state and our cache lies.
Fix: zero out LAST_OUTPUT_WIDTH/HEIGHT whenever the OUTPUT queue is REQBUFS(0)'d (in our existing reset path on resolution change) AND in RequestDestroyContext. Forces next CreateSurfaces2 to S_FMT.
Risk: low. Worst case extra S_FMT call during normal operation — harmless.
Validation: Firefox loading mozilla.org main page (multi-video sequence at varying resolutions) should not trigger fmt_width=48 fmt_height=48 regression. Operator-confirmable.
Fix 2 — WSI alignment: try DRM_FORMAT_MOD_INVALID in surface descriptor
File: src/surface.c::RequestExportSurfaceHandle line 657
Diff scale: 1 line + 1 fallback case
Problem: Mesa's WSI rejects pitch=864 (only 16-aligned) with "WSI pitch not properly aligned." Our objects[i].drm_format_modifier = video_format->drm_modifier = DRM_FORMAT_MOD_NONE (= 0 = LINEAR explicit). Mesa interprets this as "I am promised this is a strict LINEAR DMA-BUF that meets WSI alignment" — but hantro's actual buffer doesn't meet WSI alignment for non-64-aligned widths.
Fix attempt 1 (cheapest): change drm_modifier to DRM_FORMAT_MOD_INVALID (~0ULL). This tells Mesa "modifier unknown, treat as implicit / texture-import-only" — Mesa will texture-upload rather than direct WSI-import, paying a small perf cost but correct.
Fix attempt 2 (fallback): if DRM_FORMAT_MOD_INVALID causes regressions in mpv vaapi-copy or Firefox, fall back to per-modifier branching:
- mpv vaapi-copy: use
DRM_FORMAT_MOD_LINEAR(works as before) - Firefox + WSI-aligned widths (64-aligned):
DRM_FORMAT_MOD_LINEAR - Non-aligned widths:
DRM_FORMAT_MOD_INVALID
We can't easily detect the consumer at export time, so attempt 1 first universally.
Validation: Firefox plays mozilla.org's 864-wide intro videos without "MESA: error: WSI pitch not properly aligned"; mpv vaapi-copy and vaapi (DMA-BUF) still render bunny on bbb 1080p.
Risk: medium. DRM_FORMAT_MOD_INVALID is a documented sentinel but its semantics in EGL_EXT_image_dma_buf_import vary by driver. Worst case: regression on consumers we currently work for. Mitigation: revert to DRM_FORMAT_MOD_NONE if regression observed.
Fix 3 — Decoupled CAPTURE buffer pool with LRU recycling (the load-bearing fix)
File: src/surface.h, src/surface.c, src/picture.c
Diff scale: ~150-200 lines (new code)
Problem: 1:1 surface↔buffer binding causes V4L2 to re-QBUF the same physical buffer while consumer still holds an EXPBUF'd fd to it. Per Phase 2 analysis + Phase 3 baseline (re-queue interval ~875ms vs compositor hold ~50ms): race window opens during certain playback patterns and produces operator-visible stutter on mpv vaapi --vo=gpu.
Architecture (Option A from Phase 2):
- Add
struct cap_pool_slotper CAPTURE buffer with state: FREE, IN_DECODE, EXPORTED. - Allocate
max(surfaces_count, MIN_CAP_POOL)CAPTURE buffers (MIN_CAP_POOL = 24 — gives ~3x headroom for typical mpv 16-surface pool). - Decouple: surfaces no longer have permanent
destination_index. Instead, each surface holds a transient pointer to its currently-bound slot. - On
RequestBeginPicture(surface_X):- If surface_X has a previously-bound slot in EXPORTED state: close OUR copy of the EXPBUF'd fd, mark slot as FREE
- Find LRU FREE slot (oldest "last_used_at" timestamp)
- Bind surface_X → slot, set destination_index, mmap if needed
- QBUF that slot
- On
RequestSyncSurface(surface_X): DQBUF as before, slot transitions FREE→IN_DECODE→EXPORTED→... - On
RequestExportSurfaceHandle(surface_X): VIDIOC_EXPBUF on slot's V4L2 index, save our fd, mark slot EXPORTED, record timestamp - On
RequestDestroySurfaces: free all slots, REQBUFS(0)
Implementation details:
cap_pool_slotstruct insurface.h:enum cap_slot_state { CAP_SLOT_FREE, CAP_SLOT_IN_DECODE, CAP_SLOT_EXPORTED }; struct cap_pool_slot { unsigned int v4l2_index; void *map[VIDEO_MAX_PLANES]; unsigned int map_lengths[VIDEO_MAX_PLANES]; unsigned int map_offsets[VIDEO_MAX_PLANES]; enum cap_slot_state state; int our_export_fd; /* -1 if not exported */ uint64_t last_used_at_ns; /* for LRU recycling */ };- Pool stored in
request_data(driver-level), not insurface_object. - LRU helper: scan slots, find FREE with oldest
last_used_at_ns. If no FREE, force-recycle the oldest EXPORTED slot (close fd, demote to FREE) — race window may still open in pathological cases but rare.
Validation:
- mpv
--hwdec=vaapi --vo=gpuplays bbb 1080p for ≥60s without operator-visible stutter - Drop count < 100 in 14s (Phase 3 anchor)
- Firefox 150 sustained engagement with no MESA WSI errors after Fix 2
- vainfo + mpv vaapi-copy + chromium-fourier 149 — no regression
- strace shows recycled buffer indices have re-queue intervals consistent with LRU spread (e.g., wider than current ~875ms)
Risk: medium-high. Substantive code change. Mitigations:
- Keep pool size + recycling logic configurable
- Preserve Fix 1 + Fix 2 as separate commits before stacking Fix 3 (revert one without losing the others)
- Phase 5 sonnet review before committing Fix 3
Order of attack
- Fix 1 (multi-resolution cache) — small, low-risk, fixes a known regression. Test: Firefox multi-video session.
- Fix 2 (WSI modifier) — 1 line, test on 864-wide video. If regression, revert.
- Phase 5 review (sonnet) before Fix 3 — get an outside read on the buffer-pool architecture before committing the substantive code.
- Fix 3 (buffer pool) — implement, test, iterate.
- Phase 7 verification — full 4-consumer matrix + multi-video session.
- Phase 8 close — memory entry, iteration 3 input doc.
Out-of-scope for iteration 2 (carried to iteration 3)
- VIDIOC_G_EXT_CTRLS EACCES probe (Sonnet 7.1)
- num_ref_idx for multi-slice streams (Sonnet 7.2)
- HACK block in surface.c MPEG-2 case (Sonnet 7.4)
- Firefox seek-to-non-IDR (Sonnet 7.5)
- DEBUG instrumentation cleanup (until iteration 2 verified, per Sonnet)
- V4L2_MEMORY_DMABUF mode rewrite (Option B from Phase 2 — proper but expensive)
- Performance metrics — iteration 3