Files

T

marfrit c36c61e197 Phase 8 close: append Firefox 150 regression re-verify result

Operator-driven Firefox 150 video session under iter2 driver:
- With MOZ_DISABLE_RDD_SANDBOX=1: our libva backend engages, decodes 10
  frames cleanly through hantro (luma gradient 0x10..0x1c matching BBB
  intro fade), surface ID recycle + cap_pool slot cycling work as
  designed. EINVAL on frame 11 is iter1-carryover (Sonnet 7.x family),
  not an iter2 code regression.
- Without sandbox bypass: Firefox 150 RDD sandbox blocks open of
  /dev/media0 with ENETDOWN, libva init fails, Firefox SW-falls-back.
  iter1 evidence shows libva ran in the utility process
  (sandboxingKind=0) at that time; Firefox routing has since changed
  to RDD. NOT iter2-side; flag for iter3 substrate.

Cap_pool architecture confirmed working with Firefox: 8 surfaces, slots
recycled IN_DECODE -> DECODED -> reacquired across multiple BeginPicture
cycles for the same surface ID. Decoded NV12 content is real (matches
mpv vaapi-copy luma signature on the same clip).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-04 22:30:45 +00:00

9.5 KiB

Raw Blame History

Iteration 2 — Phase 8 close (2026-05-04)

Deliverable

marfrit/libva-v4l2-request-fourier master at 19acc76 — three independent hardening fixes layered on top of iteration 1's working decode:

Fix	Commit	What it addresses
Fix 1	`06beef6`	Multi-resolution session-state corruption: REQBUFS(0) on CAPTURE in resolution-change path + invalidate `LAST_OUTPUT_WIDTH/HEIGHT` cache on `RequestDestroyContext`
Fix 2	`e64bb08`	Mesa WSI rejection of non-64-aligned pitch: conditional `DRM_FORMAT_MOD_INVALID` for pitches that aren't 64-byte aligned
Fix 3	`19acc76`	DMA-BUF lifecycle race: decoupled CAPTURE buffer pool with 4-state machine (FREE / IN_DECODE / DECODED / EXPORTED), pthread-locked, LRU recycling. Pool size = max(surfaces_count, MIN_CAP_POOL=24).

Iteration 2 Phase 1 lock — met

mpv --hwdec=vaapi --vo=gpu plays bbb_1080p30 for ≥30s without operator-visible stutter or back-and-forth (binding cell: operator inspection on real screen).

Met (operator-confirmed: "smooth" on the goal-test, 2026-05-04 with driver sha256 f27e0064...).

Phase 6 / 7 verification status

Consumer	Result
vainfo	✓ enumerates 7 H.264 + 2 MPEG-2 profiles
mpv `--hwdec=vaapi-copy --vo=null --frames=200` bbb_1080p30	✓ 0 drops, 0 decoder drops, real luma gradient (0x10→0x81), pool LRU visibly recycling slot indices `0..23 → 0,2,1,4,3...`
mpv `--hwdec=vaapi --vo=gpu` (real-VO, operator inspection)	✓ smooth — iteration 2 goal met
Firefox 150 (sandbox-disabled)	✓ engages our libva, decodes 10 frames cleanly through hantro (luma gradient `0x10→0x1c` matching BBB intro fade, real NV12 pixels), then EINVAL on `set_controls` at frame 11. The EINVAL is a non-iter2 issue — same Sonnet 7.x family carryover from iter1 (likely 7.5 mid-stream / 7.2 num_ref_idx). cap_pool model is NOT the regression.
Firefox 150 (default sandbox)	✗ libva init fails inside RDD sandbox on `open(/dev/media0)` returning ENETDOWN — Firefox SW-falls-back. NOT an iter2 code regression (iter1 init code is byte-identical), but a Firefox routing change since iter1: iter1's findings.md shows decode happened on the utility process (`sandboxingKind=0`), iter2 today shows the libva path goes through RDD which is sandbox-blocked. Workaround: launch Firefox with `MOZ_DISABLE_RDD_SANDBOX=1`.
Brave / chromium-fourier 149	✓ unchanged scope (uses chromium-internal V4L2 backend, bypasses libva)

The 10-frame decoded sequence under sandbox-bypass confirms Fix 3's cap_pool architecture works correctly with Firefox: surface IDs 67108864..67108871 each acquired their own slot, and surface IDs were recycled across frames 5,6,9 with the slot state machine cycling through IN_DECODE → DECODED → recycle on next BeginPicture for the same surface. Pool was operating exactly as designed.

Architecture-level changes worth knowing for iter3+

cap_pool abstraction (src/cap_pool.{h,c}) is now the single owner of CAPTURE buffers' V4L2 indices, mmap regions, and EXPBUF fd lifetimes. Surfaces hold a current_slot reference, populated at BeginPicture and cleared at DestroySurfaces or next BeginPicture for the same surface. Format-uniform fields (destination_planes_count, destination_sizes, destination_offsets, destination_bytesperlines) stay on object_surface since they're constant across slot rebinds.
Pool destroy lifecycle is split between two paths: (a) DestroyContext calls cap_pool_destroy after the surface unbind cascade; (b) the resolution-change branch in CreateSurfaces2 also tears the pool down before issuing REQBUFS(0) to avoid leaving slots pointing at dead V4L2 indices. Both code paths are required.
DeriveImage semantic change: skips the surface→image copy when current_slot == NULL. ffmpeg's av_hwframe_ctx_init calls vaDeriveImage on never-decoded surfaces purely as a format probe; pre-Fix-3 this worked because every surface had a permanently-bound zero-filled mmap, post-Fix-3 it would crash. The new behavior gives the consumer a valid VAImage structure with a zero-filled buffer; the actual surface data is read on a later post-decode DeriveImage.

Documented limitations carried into iteration 3 substrate

Option-A statistical mitigation, not provably race-free. When the pool exhausts FREE slots and force-recycles the oldest EXPORTED slot, the consumer may still hold a dup'd dma_buf fd to the recycled slot's underlying physical memory. The kernel will happily DMA new content into it on the next QBUF. Closing OUR EXPBUF fd does not invalidate the consumer's dup. There is no public V4L2 API to query dma_buf refcount. For typical mpv 16-surface playback with MIN_CAP_POOL=24, the force-recycle path never fires. For pathological workloads (paused-with-video-still-visible, very long compositor hold windows, multi-stream), races still possible.
Multi-context concurrent libva use. Each context allocates its own cap_pool, but there's only one V4L2 device. Two concurrent libva contexts (Firefox playing two tabs through separate RDD instances) would multiply pool size on a single device. Not addressed in iteration 2.
LAST_OUTPUT_WIDTH/HEIGHT cache is process-global. Sequential session reuse is fixed (Fix 1's invalidation), but two concurrent contexts in the same process would still race on the static.
Iteration 1 carry-overs not addressed in iteration 2 (Sonnet review 7.x): EACCES on VIDIOC_G_EXT_CTRLS readback (7.1), num_ref_idx for multi-slice streams (7.2), // HACK block in surface.c::CreateSurfaces2 for multi-codec (7.4), Firefox seek-to-non-IDR (7.5).

What landed (commits ahead of iteration 1 close)

19acc76 iter2 Fix 3: decoupled CAPTURE buffer pool with LRU recycling
e64bb08 iter2 Fix 2: conditional DRM_FORMAT_MOD_INVALID for non-64-aligned pitch
06beef6 iter2 Fix 1: invalidate format cache on DestroyContext + REQBUFS(0) on CAPTURE in resolution-change path

DEBUG instrumentation from iteration 1 stays per the original Sonnet review guidance ("clean sweep at iteration 2 close" was the original plan, but Fix 3 introduced new sites that benefit from the existing CAPTURE-dump and ENTER-trace harness). Iteration 3 sweep still pending.

Lessons distilled to memory

feedback_consumer_probe_calls.md (NEW) — VAAPI consumers (notably ffmpeg's av_hwframe_ctx_init) may call vaDeriveImage and similar surface-introspection APIs on never-decoded surfaces purely as format probes. Backend code must tolerate uninitialized per-decode state (NULL data pointers, no slot bound) on these paths and return valid metadata + zero-filled buffers, not crash. Saved for iter3 work plus future codec backends.

(Phase 5 architecture lessons + the C-comment-fragility / VIDEO_MAX_PLANES redefinition syntax-bugs are not memory-worthy — derivable from current code state and gcc behavior respectively.)

Phase 0 carry-over to iteration 3 (substrate input)

State that carries (re-verified in iteration 2):

ohm RK3568 hantro G1/G2 on /dev/video1 + /dev/media0, kernel 6.19.10
libva 2.23.0, libva-utils 2.22.0, mpv 0.41.0-3, Firefox 150.0.1, Mesa 26.0.5
Test fixture: /home/mfritsche/fourier-test/bbb_1080p30_h264.mp4
Build harness: meson setup --buildtype=release && ninja on ohm; deploy to /usr/lib/dri/v4l2_request_drv_video.so
All three iter2 fixes deployed (driver sha256 f27e0064...)

Open questions for iteration 3:

Firefox 150 RDD sandbox blocks /dev/media0 — workaround MOZ_DISABLE_RDD_SANDBOX=1 confirmed. Either a Firefox routing change since iter1 (decode used to go through utility/sandboxingKind=0; now goes through RDD/sandboxed) or a Mesa-side change; needs investigation. Either upstream a Firefox prefs option, document the env var requirement permanently, or switch to a libva backend pattern that doesn't need /dev/media0 (likely impossible — the V4L2 stateless API requires it).
EINVAL after ~10 frames in Firefox. Same as iter1 carryover Sonnet 7.x family (7.5 mid-stream non-IDR, 7.2 num_ref_idx for multi-slice). Reproducible now; iter3 should narrow which control is rejected.
DEBUG instrumentation sweep. Remove ENTER logging, CAPTURE/OUTPUT hex dumps, sentinel write in EndPicture, msync workaround comment-only-or-removable. Required before any upstream snapshot.
Performance comparison binding cell (deferred from iter1, iter2 substrate notes). Iteration 3 should be the first iteration that does perf measurement: drop counts, effective FPS, browser CPU% on a defined corpus, scanout-plane residency for vaapi vs vaapi-copy vs SW.
Multi-context libva safety (Sonnet review 9.6). Two concurrent contexts (different processes or same-process Firefox + mpv overlap) need pool isolation that doesn't currently exist.
V4L2_MEMORY_DMABUF userspace allocation (the iter2 plan's Option B). True architectural fix for the EXPBUF lifecycle race — userspace allocates the dma_buf and lifetime is unambiguous. Significant kernel-side test surface; not free.
Sonnet iter1 carry-over questions still open: 7.1 EACCES, 7.2 num_ref_idx, 7.4 HACK refactor, 7.5 Firefox seek-to-non-IDR.
Cleanup of LAST_OUTPUT_WIDTH/HEIGHT static. Move to per-context (or per-pool) state to fix concurrent-context safety.

Bootlin upstream outlook

iteration 2 widens the gap between this fork and bootlin upstream by ~700 LOC (Fix 3's cap_pool abstraction). Per feedback_no_upstream.md, no PRs unless asked. Iteration 3's DEBUG sweep + multi-codec HACK refactor + perf data are the natural prerequisites if upstreamability becomes a goal.

9.5 KiB Raw Blame History