mpv --hwdec=vaapi --vo=gpu plays bbb_1080p30 smoothly per operator inspection. Three independent fixes landed in the fork: format-cache invalidation on multi-resolution sessions (Fix 1), conditional DRM_FORMAT_MOD_INVALID for non-64-aligned pitches (Fix 2), decoupled CAPTURE buffer pool with LRU recycling (Fix 3, the load-bearing one). Documented limitations carried into iter3 substrate: Option-A statistical mitigation not provably race-free under pathological workloads; multi-context safety not addressed; LAST_OUTPUT_* process-global cache; iter1 carry-over Sonnet 7.x questions still open. Open recommendation for iter3 substrate: re-run iter1's 4-consumer regression matrix under Fix 3, especially Firefox 150 multi-video. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8.4 KiB
Iteration 2 — Phase 8 close (2026-05-04)
Deliverable
marfrit/libva-v4l2-request-fourier master at 19acc76 — three independent hardening fixes layered on top of iteration 1's working decode:
| Fix | Commit | What it addresses |
|---|---|---|
| Fix 1 | 06beef6 |
Multi-resolution session-state corruption: REQBUFS(0) on CAPTURE in resolution-change path + invalidate LAST_OUTPUT_WIDTH/HEIGHT cache on RequestDestroyContext |
| Fix 2 | e64bb08 |
Mesa WSI rejection of non-64-aligned pitch: conditional DRM_FORMAT_MOD_INVALID for pitches that aren't 64-byte aligned |
| Fix 3 | 19acc76 |
DMA-BUF lifecycle race: decoupled CAPTURE buffer pool with 4-state machine (FREE / IN_DECODE / DECODED / EXPORTED), pthread-locked, LRU recycling. Pool size = max(surfaces_count, MIN_CAP_POOL=24). |
Iteration 2 Phase 1 lock — met
mpv
--hwdec=vaapi --vo=gpuplays bbb_1080p30 for ≥30s without operator-visible stutter or back-and-forth (binding cell: operator inspection on real screen).
Met (operator-confirmed: "smooth" on the goal-test, 2026-05-04 with driver sha256 f27e0064...).
Phase 6 / 7 verification status
| Consumer | Result |
|---|---|
| vainfo | ✓ enumerates 7 H.264 + 2 MPEG-2 profiles |
mpv --hwdec=vaapi-copy --vo=null --frames=200 bbb_1080p30 |
✓ 0 drops, 0 decoder drops, real luma gradient (0x10→0x81), pool LRU visibly recycling slot indices 0..23 → 0,2,1,4,3... |
mpv --hwdec=vaapi --vo=gpu (real-VO, operator inspection) |
✓ smooth — iteration 2 goal met |
| Firefox 150 multi-video session | ⚠ not re-verified post-Fix-3 (deferred — fix 3 changed the surface→buffer model that iter1 Firefox path depends on; smoke-clean on vaapi-copy + vaapi suggests no regression but operator inspection is the binding cell) |
| Brave / chromium-fourier 149 | ✓ unchanged scope (uses chromium-internal V4L2 backend, bypasses libva) |
The Firefox + 4-consumer regression matrix from iteration 1 was NOT re-run. If a regression exists in the Firefox path under Fix 3, the cap_pool's LRU + DECODED-state-retention model should still be correct — but operator validation is the boolean criterion. Recommended for iteration 2.5 or as the first move of iteration 3 substrate.
Architecture-level changes worth knowing for iter3+
cap_poolabstraction (src/cap_pool.{h,c}) is now the single owner of CAPTURE buffers' V4L2 indices, mmap regions, and EXPBUF fd lifetimes. Surfaces hold acurrent_slotreference, populated atBeginPictureand cleared atDestroySurfacesor nextBeginPicturefor the same surface. Format-uniform fields (destination_planes_count,destination_sizes,destination_offsets,destination_bytesperlines) stay onobject_surfacesince they're constant across slot rebinds.- Pool destroy lifecycle is split between two paths: (a)
DestroyContextcallscap_pool_destroyafter the surface unbind cascade; (b) the resolution-change branch inCreateSurfaces2also tears the pool down before issuing REQBUFS(0) to avoid leaving slots pointing at dead V4L2 indices. Both code paths are required. DeriveImagesemantic change: skips the surface→image copy whencurrent_slot == NULL. ffmpeg'sav_hwframe_ctx_initcallsvaDeriveImageon never-decoded surfaces purely as a format probe; pre-Fix-3 this worked because every surface had a permanently-bound zero-filled mmap, post-Fix-3 it would crash. The new behavior gives the consumer a validVAImagestructure with a zero-filled buffer; the actual surface data is read on a later post-decodeDeriveImage.
Documented limitations carried into iteration 3 substrate
- Option-A statistical mitigation, not provably race-free. When the pool exhausts FREE slots and force-recycles the oldest EXPORTED slot, the consumer may still hold a dup'd dma_buf fd to the recycled slot's underlying physical memory. The kernel will happily DMA new content into it on the next QBUF. Closing OUR EXPBUF fd does not invalidate the consumer's dup. There is no public V4L2 API to query dma_buf refcount. For typical mpv 16-surface playback with
MIN_CAP_POOL=24, the force-recycle path never fires. For pathological workloads (paused-with-video-still-visible, very long compositor hold windows, multi-stream), races still possible. - Multi-context concurrent libva use. Each context allocates its own
cap_pool, but there's only one V4L2 device. Two concurrent libva contexts (Firefox playing two tabs through separate RDD instances) would multiply pool size on a single device. Not addressed in iteration 2. LAST_OUTPUT_WIDTH/HEIGHTcache is process-global. Sequential session reuse is fixed (Fix 1's invalidation), but two concurrent contexts in the same process would still race on the static.- Iteration 1 carry-overs not addressed in iteration 2 (Sonnet review 7.x): EACCES on
VIDIOC_G_EXT_CTRLSreadback (7.1),num_ref_idxfor multi-slice streams (7.2),// HACKblock insurface.c::CreateSurfaces2for multi-codec (7.4), Firefox seek-to-non-IDR (7.5).
What landed (commits ahead of iteration 1 close)
19acc76 iter2 Fix 3: decoupled CAPTURE buffer pool with LRU recycling
e64bb08 iter2 Fix 2: conditional DRM_FORMAT_MOD_INVALID for non-64-aligned pitch
06beef6 iter2 Fix 1: invalidate format cache on DestroyContext + REQBUFS(0) on CAPTURE in resolution-change path
DEBUG instrumentation from iteration 1 stays per the original Sonnet review guidance ("clean sweep at iteration 2 close" was the original plan, but Fix 3 introduced new sites that benefit from the existing CAPTURE-dump and ENTER-trace harness). Iteration 3 sweep still pending.
Lessons distilled to memory
feedback_consumer_probe_calls.md(NEW) — VAAPI consumers (notably ffmpeg'sav_hwframe_ctx_init) may callvaDeriveImageand similar surface-introspection APIs on never-decoded surfaces purely as format probes. Backend code must tolerate uninitialized per-decode state (NULL data pointers, no slot bound) on these paths and return valid metadata + zero-filled buffers, not crash. Saved for iter3 work plus future codec backends.
(Phase 5 architecture lessons + the C-comment-fragility / VIDEO_MAX_PLANES redefinition syntax-bugs are not memory-worthy — derivable from current code state and gcc behavior respectively.)
Phase 0 carry-over to iteration 3 (substrate input)
State that carries (re-verified in iteration 2):
- ohm RK3568 hantro G1/G2 on
/dev/video1+/dev/media0, kernel 6.19.10 - libva 2.23.0, libva-utils 2.22.0, mpv 0.41.0-3, Firefox 150.0.1, Mesa 26.0.5
- Test fixture:
/home/mfritsche/fourier-test/bbb_1080p30_h264.mp4 - Build harness:
meson setup --buildtype=release && ninjaon ohm; deploy to/usr/lib/dri/v4l2_request_drv_video.so - All three iter2 fixes deployed (driver sha256
f27e0064...)
Open questions for iteration 3:
- Re-run iter1 4-consumer regression matrix under Fix 3. Especially Firefox 150 multi-video on mozilla.org (Fix 1's CAPTURE REQBUFS path) and chromium-fourier-149 (uses chromium-internal V4L2 backend but worth confirming no environmental break).
- DEBUG instrumentation sweep. Remove ENTER logging, CAPTURE/OUTPUT hex dumps, sentinel write in
EndPicture, msync workaround comment-only-or-removable. Required before any upstream snapshot. - Performance comparison binding cell (deferred from iter1, iter2 substrate notes). Iteration 3 should be the first iteration that does perf measurement: drop counts, effective FPS, browser CPU% on a defined corpus, scanout-plane residency for vaapi vs vaapi-copy vs SW.
- Multi-context libva safety (Sonnet review 9.6). Two concurrent contexts (different processes or same-process Firefox + mpv overlap) need pool isolation that doesn't currently exist.
V4L2_MEMORY_DMABUFuserspace allocation (the iter2 plan's Option B). True architectural fix for the EXPBUF lifecycle race — userspace allocates the dma_buf and lifetime is unambiguous. Significant kernel-side test surface; not free.- Sonnet iter1 carry-over questions still open: 7.1 EACCES, 7.2 num_ref_idx, 7.4 HACK refactor, 7.5 Firefox seek-to-non-IDR.
- Cleanup of
LAST_OUTPUT_WIDTH/HEIGHTstatic. Move to per-context (or per-pool) state to fix concurrent-context safety.
Bootlin upstream outlook
iteration 2 widens the gap between this fork and bootlin upstream by ~700 LOC (Fix 3's cap_pool abstraction). Per feedback_no_upstream.md, no PRs unless asked. Iteration 3's DEBUG sweep + multi-codec HACK refactor + perf data are the natural prerequisites if upstreamability becomes a goal.