Files
libva-multiplanar/phase4_iter2_plan.md
marfrit 9b1d7737cd Iteration 2 Phase 0-4: substrate, situation analysis, baseline anchor, plan
Iteration 2 opens. Hardening, not new feature work.

Phase 0 (phase0_findings_iter2.md): locked research question — make
DMA-BUF (vaapi no -copy) path render without artifacts on direct-
render consumers. Plus WSI alignment for 864-wide videos in Firefox
and multi-resolution kernel-state recovery.

Phase 2 (phase2_iter2_analysis.md): the bug origin is picture.c:375
re-QBUFing surface_object->destination_index every decode cycle
while the kernel CAPTURE buffer is still being read by the consumer
via an EXPBUF'd dma_buf fd. V4L2 doesn't enforce the constraint;
userspace must coordinate.

Three architecture options analyzed:
  A. More buffers + LRU recycling (cheapest, statistical mitigation)
  B. Per-buffer dma_buf-refcount-aware recycling (correct, requires
     kernel changes or userspace V4L2_MEMORY_DMABUF rewrite)
  C. Kernel patch to enforce QBUF rejection (out of campaign scope)

Picking Option A for iteration 2.

Phase 3 (in-session baseline anchor): mpv vaapi --vo=gpu shows
91 drops in 14s at 1080p, 9 CAPTURE buffer indices used (2-10),
~10-19 re-queues per buffer in 14s window. Per-buffer re-queue
interval ~875ms vs typical compositor hold ~50ms — race window
opens episodically.

Phase 4 (phase4_iter2_plan.md): three independent fixes ordered
cheapest-first.
  Fix 1: invalidate LAST_OUTPUT_WIDTH cache on session teardown.
  Fix 2: try DRM_FORMAT_MOD_INVALID for WSI compatibility.
  Fix 3: decoupled CAPTURE buffer pool with LRU recycling
         (~150-200 lines, the load-bearing fix).

Phase 5 sonnet review BEFORE Fix 3 implementation.

Out-of-scope for iteration 2 (carry to iter 3): EACCES probe,
multi-slice num_ref_idx, HACK block MPEG-2 cleanup, seek-to-non-IDR,
DEBUG instrumentation cleanup, V4L2_MEMORY_DMABUF rewrite, perf
metrics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:55:31 +00:00

6.7 KiB
Raw Permalink Blame History

Iteration 2 Phase 4 — plan

Three independent fixes, ordered cheapest-first. Each tested before stacking the next.

Fix 1 — Multi-resolution: invalidate LAST_OUTPUT_WIDTH/HEIGHT cache on session teardown

File: src/surface.c + src/context.c (or wherever DestroyContext lives) Diff scale: ~5 lines

Problem: Iteration 1 commit 37c0e72 added LAST_OUTPUT_WIDTH/HEIGHT static globals to skip re-S_FMT when dimensions unchanged. Across multiple playback sessions, the kernel CAPTURE format gets reset (likely on REQBUFS(0) at session end) but our cache still says "format already 1920×1088, no re-set needed." Next session inherits a 48×48 default kernel state and our cache lies.

Fix: zero out LAST_OUTPUT_WIDTH/HEIGHT whenever the OUTPUT queue is REQBUFS(0)'d (in our existing reset path on resolution change) AND in RequestDestroyContext. Forces next CreateSurfaces2 to S_FMT.

Risk: low. Worst case extra S_FMT call during normal operation — harmless.

Validation: Firefox loading mozilla.org main page (multi-video sequence at varying resolutions) should not trigger fmt_width=48 fmt_height=48 regression. Operator-confirmable.

Fix 2 — WSI alignment: try DRM_FORMAT_MOD_INVALID in surface descriptor

File: src/surface.c::RequestExportSurfaceHandle line 657 Diff scale: 1 line + 1 fallback case

Problem: Mesa's WSI rejects pitch=864 (only 16-aligned) with "WSI pitch not properly aligned." Our objects[i].drm_format_modifier = video_format->drm_modifier = DRM_FORMAT_MOD_NONE (= 0 = LINEAR explicit). Mesa interprets this as "I am promised this is a strict LINEAR DMA-BUF that meets WSI alignment" — but hantro's actual buffer doesn't meet WSI alignment for non-64-aligned widths.

Fix attempt 1 (cheapest): change drm_modifier to DRM_FORMAT_MOD_INVALID (~0ULL). This tells Mesa "modifier unknown, treat as implicit / texture-import-only" — Mesa will texture-upload rather than direct WSI-import, paying a small perf cost but correct.

Fix attempt 2 (fallback): if DRM_FORMAT_MOD_INVALID causes regressions in mpv vaapi-copy or Firefox, fall back to per-modifier branching:

  • mpv vaapi-copy: use DRM_FORMAT_MOD_LINEAR (works as before)
  • Firefox + WSI-aligned widths (64-aligned): DRM_FORMAT_MOD_LINEAR
  • Non-aligned widths: DRM_FORMAT_MOD_INVALID

We can't easily detect the consumer at export time, so attempt 1 first universally.

Validation: Firefox plays mozilla.org's 864-wide intro videos without "MESA: error: WSI pitch not properly aligned"; mpv vaapi-copy and vaapi (DMA-BUF) still render bunny on bbb 1080p.

Risk: medium. DRM_FORMAT_MOD_INVALID is a documented sentinel but its semantics in EGL_EXT_image_dma_buf_import vary by driver. Worst case: regression on consumers we currently work for. Mitigation: revert to DRM_FORMAT_MOD_NONE if regression observed.

Fix 3 — Decoupled CAPTURE buffer pool with LRU recycling (the load-bearing fix)

File: src/surface.h, src/surface.c, src/picture.c Diff scale: ~150-200 lines (new code)

Problem: 1:1 surface↔buffer binding causes V4L2 to re-QBUF the same physical buffer while consumer still holds an EXPBUF'd fd to it. Per Phase 2 analysis + Phase 3 baseline (re-queue interval ~875ms vs compositor hold ~50ms): race window opens during certain playback patterns and produces operator-visible stutter on mpv vaapi --vo=gpu.

Architecture (Option A from Phase 2):

  • Add struct cap_pool_slot per CAPTURE buffer with state: FREE, IN_DECODE, EXPORTED.
  • Allocate max(surfaces_count, MIN_CAP_POOL) CAPTURE buffers (MIN_CAP_POOL = 24 — gives ~3x headroom for typical mpv 16-surface pool).
  • Decouple: surfaces no longer have permanent destination_index. Instead, each surface holds a transient pointer to its currently-bound slot.
  • On RequestBeginPicture(surface_X):
    • If surface_X has a previously-bound slot in EXPORTED state: close OUR copy of the EXPBUF'd fd, mark slot as FREE
    • Find LRU FREE slot (oldest "last_used_at" timestamp)
    • Bind surface_X → slot, set destination_index, mmap if needed
    • QBUF that slot
  • On RequestSyncSurface(surface_X): DQBUF as before, slot transitions FREE→IN_DECODE→EXPORTED→...
  • On RequestExportSurfaceHandle(surface_X): VIDIOC_EXPBUF on slot's V4L2 index, save our fd, mark slot EXPORTED, record timestamp
  • On RequestDestroySurfaces: free all slots, REQBUFS(0)

Implementation details:

  • cap_pool_slot struct in surface.h:
    enum cap_slot_state { CAP_SLOT_FREE, CAP_SLOT_IN_DECODE, CAP_SLOT_EXPORTED };
    struct cap_pool_slot {
        unsigned int v4l2_index;
        void *map[VIDEO_MAX_PLANES];
        unsigned int map_lengths[VIDEO_MAX_PLANES];
        unsigned int map_offsets[VIDEO_MAX_PLANES];
        enum cap_slot_state state;
        int our_export_fd;        /* -1 if not exported */
        uint64_t last_used_at_ns; /* for LRU recycling */
    };
    
  • Pool stored in request_data (driver-level), not in surface_object.
  • LRU helper: scan slots, find FREE with oldest last_used_at_ns. If no FREE, force-recycle the oldest EXPORTED slot (close fd, demote to FREE) — race window may still open in pathological cases but rare.

Validation:

  • mpv --hwdec=vaapi --vo=gpu plays bbb 1080p for ≥60s without operator-visible stutter
  • Drop count < 100 in 14s (Phase 3 anchor)
  • Firefox 150 sustained engagement with no MESA WSI errors after Fix 2
  • vainfo + mpv vaapi-copy + chromium-fourier 149 — no regression
  • strace shows recycled buffer indices have re-queue intervals consistent with LRU spread (e.g., wider than current ~875ms)

Risk: medium-high. Substantive code change. Mitigations:

  • Keep pool size + recycling logic configurable
  • Preserve Fix 1 + Fix 2 as separate commits before stacking Fix 3 (revert one without losing the others)
  • Phase 5 sonnet review before committing Fix 3

Order of attack

  1. Fix 1 (multi-resolution cache) — small, low-risk, fixes a known regression. Test: Firefox multi-video session.
  2. Fix 2 (WSI modifier) — 1 line, test on 864-wide video. If regression, revert.
  3. Phase 5 review (sonnet) before Fix 3 — get an outside read on the buffer-pool architecture before committing the substantive code.
  4. Fix 3 (buffer pool) — implement, test, iterate.
  5. Phase 7 verification — full 4-consumer matrix + multi-video session.
  6. Phase 8 close — memory entry, iteration 3 input doc.

Out-of-scope for iteration 2 (carried to iteration 3)

  • VIDIOC_G_EXT_CTRLS EACCES probe (Sonnet 7.1)
  • num_ref_idx for multi-slice streams (Sonnet 7.2)
  • HACK block in surface.c MPEG-2 case (Sonnet 7.4)
  • Firefox seek-to-non-IDR (Sonnet 7.5)
  • DEBUG instrumentation cleanup (until iteration 2 verified, per Sonnet)
  • V4L2_MEMORY_DMABUF mode rewrite (Option B from Phase 2 — proper but expensive)
  • Performance metrics — iteration 3