diff --git a/phase8_iteration2_close.md b/phase8_iteration2_close.md new file mode 100644 index 0000000..7c7d5fb --- /dev/null +++ b/phase8_iteration2_close.md @@ -0,0 +1,80 @@ +# Iteration 2 — Phase 8 close (2026-05-04) + +## Deliverable + +`marfrit/libva-v4l2-request-fourier` master at `19acc76` — three independent hardening fixes layered on top of iteration 1's working decode: + +| Fix | Commit | What it addresses | +|---|---|---| +| Fix 1 | `06beef6` | Multi-resolution session-state corruption: REQBUFS(0) on CAPTURE in resolution-change path + invalidate `LAST_OUTPUT_WIDTH/HEIGHT` cache on `RequestDestroyContext` | +| Fix 2 | `e64bb08` | Mesa WSI rejection of non-64-aligned pitch: conditional `DRM_FORMAT_MOD_INVALID` for pitches that aren't 64-byte aligned | +| Fix 3 | `19acc76` | DMA-BUF lifecycle race: decoupled CAPTURE buffer pool with 4-state machine (FREE / IN_DECODE / DECODED / EXPORTED), pthread-locked, LRU recycling. Pool size = max(surfaces_count, MIN_CAP_POOL=24). | + +## Iteration 2 Phase 1 lock — met + +> mpv `--hwdec=vaapi --vo=gpu` plays bbb_1080p30 for ≥30s without operator-visible stutter or back-and-forth (binding cell: operator inspection on real screen). + +**Met (operator-confirmed: "smooth"** on the goal-test, 2026-05-04 with driver sha256 `f27e0064...`). + +## Phase 6 / 7 verification status + +| Consumer | Result | +|---|---| +| vainfo | ✓ enumerates 7 H.264 + 2 MPEG-2 profiles | +| mpv `--hwdec=vaapi-copy --vo=null --frames=200` bbb_1080p30 | ✓ 0 drops, 0 decoder drops, real luma gradient (0x10→0x81), pool LRU visibly recycling slot indices `0..23 → 0,2,1,4,3...` | +| mpv `--hwdec=vaapi --vo=gpu` (real-VO, operator inspection) | ✓ smooth — iteration 2 goal met | +| Firefox 150 multi-video session | ⚠ not re-verified post-Fix-3 (deferred — fix 3 changed the surface→buffer model that iter1 Firefox path depends on; smoke-clean on vaapi-copy + vaapi suggests no regression but operator inspection is the binding cell) | +| Brave / chromium-fourier 149 | ✓ unchanged scope (uses chromium-internal V4L2 backend, bypasses libva) | + +The Firefox + 4-consumer regression matrix from iteration 1 was NOT re-run. If a regression exists in the Firefox path under Fix 3, the cap_pool's LRU + DECODED-state-retention model should still be correct — but operator validation is the boolean criterion. Recommended for iteration 2.5 or as the first move of iteration 3 substrate. + +## Architecture-level changes worth knowing for iter3+ + +- **`cap_pool` abstraction (`src/cap_pool.{h,c}`)** is now the single owner of CAPTURE buffers' V4L2 indices, mmap regions, and EXPBUF fd lifetimes. Surfaces hold a `current_slot` reference, populated at `BeginPicture` and cleared at `DestroySurfaces` or next `BeginPicture` for the same surface. Format-uniform fields (`destination_planes_count`, `destination_sizes`, `destination_offsets`, `destination_bytesperlines`) stay on `object_surface` since they're constant across slot rebinds. +- **Pool destroy lifecycle** is split between two paths: (a) `DestroyContext` calls `cap_pool_destroy` after the surface unbind cascade; (b) the resolution-change branch in `CreateSurfaces2` also tears the pool down before issuing REQBUFS(0) to avoid leaving slots pointing at dead V4L2 indices. Both code paths are required. +- **`DeriveImage` semantic change**: skips the surface→image copy when `current_slot == NULL`. ffmpeg's `av_hwframe_ctx_init` calls `vaDeriveImage` on never-decoded surfaces purely as a format probe; pre-Fix-3 this worked because every surface had a permanently-bound zero-filled mmap, post-Fix-3 it would crash. The new behavior gives the consumer a valid `VAImage` structure with a zero-filled buffer; the actual surface data is read on a later post-decode `DeriveImage`. + +## Documented limitations carried into iteration 3 substrate + +- **Option-A statistical mitigation, not provably race-free.** When the pool exhausts FREE slots and force-recycles the oldest EXPORTED slot, the consumer may still hold a dup'd dma_buf fd to the recycled slot's underlying physical memory. The kernel will happily DMA new content into it on the next QBUF. Closing OUR EXPBUF fd does not invalidate the consumer's dup. There is no public V4L2 API to query dma_buf refcount. For typical mpv 16-surface playback with `MIN_CAP_POOL=24`, the force-recycle path never fires. For pathological workloads (paused-with-video-still-visible, very long compositor hold windows, multi-stream), races still possible. +- **Multi-context concurrent libva use.** Each context allocates its own `cap_pool`, but there's only one V4L2 device. Two concurrent libva contexts (Firefox playing two tabs through separate RDD instances) would multiply pool size on a single device. Not addressed in iteration 2. +- **`LAST_OUTPUT_WIDTH/HEIGHT` cache is process-global.** Sequential session reuse is fixed (Fix 1's invalidation), but two concurrent contexts in the same process would still race on the static. +- **Iteration 1 carry-overs not addressed in iteration 2** (Sonnet review 7.x): EACCES on `VIDIOC_G_EXT_CTRLS` readback (7.1), `num_ref_idx` for multi-slice streams (7.2), `// HACK` block in `surface.c::CreateSurfaces2` for multi-codec (7.4), Firefox seek-to-non-IDR (7.5). + +## What landed (commits ahead of iteration 1 close) + +``` +19acc76 iter2 Fix 3: decoupled CAPTURE buffer pool with LRU recycling +e64bb08 iter2 Fix 2: conditional DRM_FORMAT_MOD_INVALID for non-64-aligned pitch +06beef6 iter2 Fix 1: invalidate format cache on DestroyContext + REQBUFS(0) on CAPTURE in resolution-change path +``` + +DEBUG instrumentation from iteration 1 stays per the original Sonnet review guidance ("clean sweep at iteration 2 close" was the original plan, but Fix 3 introduced new sites that benefit from the existing CAPTURE-dump and ENTER-trace harness). Iteration 3 sweep still pending. + +## Lessons distilled to memory + +- **`feedback_consumer_probe_calls.md`** (NEW) — VAAPI consumers (notably ffmpeg's `av_hwframe_ctx_init`) may call `vaDeriveImage` and similar surface-introspection APIs on never-decoded surfaces purely as format probes. Backend code must tolerate uninitialized per-decode state (NULL data pointers, no slot bound) on these paths and return valid metadata + zero-filled buffers, not crash. Saved for iter3 work plus future codec backends. + +(Phase 5 architecture lessons + the C-comment-fragility / `VIDEO_MAX_PLANES` redefinition syntax-bugs are not memory-worthy — derivable from current code state and gcc behavior respectively.) + +## Phase 0 carry-over to iteration 3 (substrate input) + +State that carries (re-verified in iteration 2): +- ohm RK3568 hantro G1/G2 on `/dev/video1` + `/dev/media0`, kernel 6.19.10 +- libva 2.23.0, libva-utils 2.22.0, mpv 0.41.0-3, Firefox 150.0.1, Mesa 26.0.5 +- Test fixture: `/home/mfritsche/fourier-test/bbb_1080p30_h264.mp4` +- Build harness: `meson setup --buildtype=release && ninja` on ohm; deploy to `/usr/lib/dri/v4l2_request_drv_video.so` +- All three iter2 fixes deployed (driver sha256 `f27e0064...`) + +Open questions for iteration 3: +1. **Re-run iter1 4-consumer regression matrix under Fix 3.** Especially Firefox 150 multi-video on mozilla.org (Fix 1's CAPTURE REQBUFS path) and chromium-fourier-149 (uses chromium-internal V4L2 backend but worth confirming no environmental break). +2. **DEBUG instrumentation sweep.** Remove ENTER logging, CAPTURE/OUTPUT hex dumps, sentinel write in `EndPicture`, msync workaround comment-only-or-removable. Required before any upstream snapshot. +3. **Performance comparison binding cell** (deferred from iter1, iter2 substrate notes). Iteration 3 should be the first iteration that does perf measurement: drop counts, effective FPS, browser CPU% on a defined corpus, scanout-plane residency for vaapi vs vaapi-copy vs SW. +4. **Multi-context libva safety** (Sonnet review 9.6). Two concurrent contexts (different processes or same-process Firefox + mpv overlap) need pool isolation that doesn't currently exist. +5. **`V4L2_MEMORY_DMABUF` userspace allocation** (the iter2 plan's Option B). True architectural fix for the EXPBUF lifecycle race — userspace allocates the dma_buf and lifetime is unambiguous. Significant kernel-side test surface; not free. +6. **Sonnet iter1 carry-over questions** still open: 7.1 EACCES, 7.2 num_ref_idx, 7.4 HACK refactor, 7.5 Firefox seek-to-non-IDR. +7. **Cleanup of `LAST_OUTPUT_WIDTH/HEIGHT` static.** Move to per-context (or per-pool) state to fix concurrent-context safety. + +## Bootlin upstream outlook + +iteration 2 widens the gap between this fork and bootlin upstream by ~700 LOC (Fix 3's cap_pool abstraction). Per `feedback_no_upstream.md`, no PRs unless asked. Iteration 3's DEBUG sweep + multi-codec HACK refactor + perf data are the natural prerequisites if upstreamability becomes a goal.