Files

T

marfrit 2c57816fb1 Phase 8 iteration 2 close — DMA-BUF lifecycle hardened, goal met

mpv --hwdec=vaapi --vo=gpu plays bbb_1080p30 smoothly per operator
inspection. Three independent fixes landed in the fork: format-cache
invalidation on multi-resolution sessions (Fix 1), conditional
DRM_FORMAT_MOD_INVALID for non-64-aligned pitches (Fix 2), decoupled
CAPTURE buffer pool with LRU recycling (Fix 3, the load-bearing one).

Documented limitations carried into iter3 substrate: Option-A
statistical mitigation not provably race-free under pathological
workloads; multi-context safety not addressed; LAST_OUTPUT_*
process-global cache; iter1 carry-over Sonnet 7.x questions still open.

Open recommendation for iter3 substrate: re-run iter1's 4-consumer
regression matrix under Fix 3, especially Firefox 150 multi-video.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-04 22:09:58 +00:00

8.4 KiB

Raw Blame History

Iteration 2 — Phase 8 close (2026-05-04)

Deliverable

marfrit/libva-v4l2-request-fourier master at 19acc76 — three independent hardening fixes layered on top of iteration 1's working decode:

Fix	Commit	What it addresses
Fix 1	`06beef6`	Multi-resolution session-state corruption: REQBUFS(0) on CAPTURE in resolution-change path + invalidate `LAST_OUTPUT_WIDTH/HEIGHT` cache on `RequestDestroyContext`
Fix 2	`e64bb08`	Mesa WSI rejection of non-64-aligned pitch: conditional `DRM_FORMAT_MOD_INVALID` for pitches that aren't 64-byte aligned
Fix 3	`19acc76`	DMA-BUF lifecycle race: decoupled CAPTURE buffer pool with 4-state machine (FREE / IN_DECODE / DECODED / EXPORTED), pthread-locked, LRU recycling. Pool size = max(surfaces_count, MIN_CAP_POOL=24).

Iteration 2 Phase 1 lock — met

mpv --hwdec=vaapi --vo=gpu plays bbb_1080p30 for ≥30s without operator-visible stutter or back-and-forth (binding cell: operator inspection on real screen).

Met (operator-confirmed: "smooth" on the goal-test, 2026-05-04 with driver sha256 f27e0064...).

Phase 6 / 7 verification status

Consumer	Result
vainfo	✓ enumerates 7 H.264 + 2 MPEG-2 profiles
mpv `--hwdec=vaapi-copy --vo=null --frames=200` bbb_1080p30	✓ 0 drops, 0 decoder drops, real luma gradient (0x10→0x81), pool LRU visibly recycling slot indices `0..23 → 0,2,1,4,3...`
mpv `--hwdec=vaapi --vo=gpu` (real-VO, operator inspection)	✓ smooth — iteration 2 goal met
Firefox 150 multi-video session	⚠ not re-verified post-Fix-3 (deferred — fix 3 changed the surface→buffer model that iter1 Firefox path depends on; smoke-clean on vaapi-copy + vaapi suggests no regression but operator inspection is the binding cell)
Brave / chromium-fourier 149	✓ unchanged scope (uses chromium-internal V4L2 backend, bypasses libva)

The Firefox + 4-consumer regression matrix from iteration 1 was NOT re-run. If a regression exists in the Firefox path under Fix 3, the cap_pool's LRU + DECODED-state-retention model should still be correct — but operator validation is the boolean criterion. Recommended for iteration 2.5 or as the first move of iteration 3 substrate.

Architecture-level changes worth knowing for iter3+

cap_pool abstraction (src/cap_pool.{h,c}) is now the single owner of CAPTURE buffers' V4L2 indices, mmap regions, and EXPBUF fd lifetimes. Surfaces hold a current_slot reference, populated at BeginPicture and cleared at DestroySurfaces or next BeginPicture for the same surface. Format-uniform fields (destination_planes_count, destination_sizes, destination_offsets, destination_bytesperlines) stay on object_surface since they're constant across slot rebinds.
Pool destroy lifecycle is split between two paths: (a) DestroyContext calls cap_pool_destroy after the surface unbind cascade; (b) the resolution-change branch in CreateSurfaces2 also tears the pool down before issuing REQBUFS(0) to avoid leaving slots pointing at dead V4L2 indices. Both code paths are required.
DeriveImage semantic change: skips the surface→image copy when current_slot == NULL. ffmpeg's av_hwframe_ctx_init calls vaDeriveImage on never-decoded surfaces purely as a format probe; pre-Fix-3 this worked because every surface had a permanently-bound zero-filled mmap, post-Fix-3 it would crash. The new behavior gives the consumer a valid VAImage structure with a zero-filled buffer; the actual surface data is read on a later post-decode DeriveImage.

Documented limitations carried into iteration 3 substrate

Option-A statistical mitigation, not provably race-free. When the pool exhausts FREE slots and force-recycles the oldest EXPORTED slot, the consumer may still hold a dup'd dma_buf fd to the recycled slot's underlying physical memory. The kernel will happily DMA new content into it on the next QBUF. Closing OUR EXPBUF fd does not invalidate the consumer's dup. There is no public V4L2 API to query dma_buf refcount. For typical mpv 16-surface playback with MIN_CAP_POOL=24, the force-recycle path never fires. For pathological workloads (paused-with-video-still-visible, very long compositor hold windows, multi-stream), races still possible.
Multi-context concurrent libva use. Each context allocates its own cap_pool, but there's only one V4L2 device. Two concurrent libva contexts (Firefox playing two tabs through separate RDD instances) would multiply pool size on a single device. Not addressed in iteration 2.
LAST_OUTPUT_WIDTH/HEIGHT cache is process-global. Sequential session reuse is fixed (Fix 1's invalidation), but two concurrent contexts in the same process would still race on the static.
Iteration 1 carry-overs not addressed in iteration 2 (Sonnet review 7.x): EACCES on VIDIOC_G_EXT_CTRLS readback (7.1), num_ref_idx for multi-slice streams (7.2), // HACK block in surface.c::CreateSurfaces2 for multi-codec (7.4), Firefox seek-to-non-IDR (7.5).

What landed (commits ahead of iteration 1 close)

19acc76 iter2 Fix 3: decoupled CAPTURE buffer pool with LRU recycling
e64bb08 iter2 Fix 2: conditional DRM_FORMAT_MOD_INVALID for non-64-aligned pitch
06beef6 iter2 Fix 1: invalidate format cache on DestroyContext + REQBUFS(0) on CAPTURE in resolution-change path

DEBUG instrumentation from iteration 1 stays per the original Sonnet review guidance ("clean sweep at iteration 2 close" was the original plan, but Fix 3 introduced new sites that benefit from the existing CAPTURE-dump and ENTER-trace harness). Iteration 3 sweep still pending.

Lessons distilled to memory

feedback_consumer_probe_calls.md (NEW) — VAAPI consumers (notably ffmpeg's av_hwframe_ctx_init) may call vaDeriveImage and similar surface-introspection APIs on never-decoded surfaces purely as format probes. Backend code must tolerate uninitialized per-decode state (NULL data pointers, no slot bound) on these paths and return valid metadata + zero-filled buffers, not crash. Saved for iter3 work plus future codec backends.

(Phase 5 architecture lessons + the C-comment-fragility / VIDEO_MAX_PLANES redefinition syntax-bugs are not memory-worthy — derivable from current code state and gcc behavior respectively.)

Phase 0 carry-over to iteration 3 (substrate input)

State that carries (re-verified in iteration 2):

ohm RK3568 hantro G1/G2 on /dev/video1 + /dev/media0, kernel 6.19.10
libva 2.23.0, libva-utils 2.22.0, mpv 0.41.0-3, Firefox 150.0.1, Mesa 26.0.5
Test fixture: /home/mfritsche/fourier-test/bbb_1080p30_h264.mp4
Build harness: meson setup --buildtype=release && ninja on ohm; deploy to /usr/lib/dri/v4l2_request_drv_video.so
All three iter2 fixes deployed (driver sha256 f27e0064...)

Open questions for iteration 3:

Re-run iter1 4-consumer regression matrix under Fix 3. Especially Firefox 150 multi-video on mozilla.org (Fix 1's CAPTURE REQBUFS path) and chromium-fourier-149 (uses chromium-internal V4L2 backend but worth confirming no environmental break).
DEBUG instrumentation sweep. Remove ENTER logging, CAPTURE/OUTPUT hex dumps, sentinel write in EndPicture, msync workaround comment-only-or-removable. Required before any upstream snapshot.
Performance comparison binding cell (deferred from iter1, iter2 substrate notes). Iteration 3 should be the first iteration that does perf measurement: drop counts, effective FPS, browser CPU% on a defined corpus, scanout-plane residency for vaapi vs vaapi-copy vs SW.
Multi-context libva safety (Sonnet review 9.6). Two concurrent contexts (different processes or same-process Firefox + mpv overlap) need pool isolation that doesn't currently exist.
V4L2_MEMORY_DMABUF userspace allocation (the iter2 plan's Option B). True architectural fix for the EXPBUF lifecycle race — userspace allocates the dma_buf and lifetime is unambiguous. Significant kernel-side test surface; not free.
Sonnet iter1 carry-over questions still open: 7.1 EACCES, 7.2 num_ref_idx, 7.4 HACK refactor, 7.5 Firefox seek-to-non-IDR.
Cleanup of LAST_OUTPUT_WIDTH/HEIGHT static. Move to per-context (or per-pool) state to fix concurrent-context safety.

Bootlin upstream outlook

iteration 2 widens the gap between this fork and bootlin upstream by ~700 LOC (Fix 3's cap_pool abstraction). Per feedback_no_upstream.md, no PRs unless asked. Iteration 3's DEBUG sweep + multi-codec HACK refactor + perf data are the natural prerequisites if upstreamability becomes a goal.

8.4 KiB Raw Blame History