Files
libva-multiplanar/phase0_findings_iter2.md
T
marfrit 9b1d7737cd Iteration 2 Phase 0-4: substrate, situation analysis, baseline anchor, plan
Iteration 2 opens. Hardening, not new feature work.

Phase 0 (phase0_findings_iter2.md): locked research question — make
DMA-BUF (vaapi no -copy) path render without artifacts on direct-
render consumers. Plus WSI alignment for 864-wide videos in Firefox
and multi-resolution kernel-state recovery.

Phase 2 (phase2_iter2_analysis.md): the bug origin is picture.c:375
re-QBUFing surface_object->destination_index every decode cycle
while the kernel CAPTURE buffer is still being read by the consumer
via an EXPBUF'd dma_buf fd. V4L2 doesn't enforce the constraint;
userspace must coordinate.

Three architecture options analyzed:
  A. More buffers + LRU recycling (cheapest, statistical mitigation)
  B. Per-buffer dma_buf-refcount-aware recycling (correct, requires
     kernel changes or userspace V4L2_MEMORY_DMABUF rewrite)
  C. Kernel patch to enforce QBUF rejection (out of campaign scope)

Picking Option A for iteration 2.

Phase 3 (in-session baseline anchor): mpv vaapi --vo=gpu shows
91 drops in 14s at 1080p, 9 CAPTURE buffer indices used (2-10),
~10-19 re-queues per buffer in 14s window. Per-buffer re-queue
interval ~875ms vs typical compositor hold ~50ms — race window
opens episodically.

Phase 4 (phase4_iter2_plan.md): three independent fixes ordered
cheapest-first.
  Fix 1: invalidate LAST_OUTPUT_WIDTH cache on session teardown.
  Fix 2: try DRM_FORMAT_MOD_INVALID for WSI compatibility.
  Fix 3: decoupled CAPTURE buffer pool with LRU recycling
         (~150-200 lines, the load-bearing fix).

Phase 5 sonnet review BEFORE Fix 3 implementation.

Out-of-scope for iteration 2 (carry to iter 3): EACCES probe,
multi-slice num_ref_idx, HACK block MPEG-2 cleanup, seek-to-non-IDR,
DEBUG instrumentation cleanup, V4L2_MEMORY_DMABUF rewrite, perf
metrics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:55:31 +00:00

111 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Iteration 2 — Phase 0 (substrate / motivation / inventory)
Opens 2026-05-04 immediately after iteration 1 close (`phase8_iteration1_close.md`).
## Locked research question (iteration 2)
> **"Make libva-v4l2-request's DMA-BUF (`vaapi`-no-`-copy`) path render decoded H.264 video without visual artifacts on direct-render consumers (mpv `--vo=gpu`), and handle the multi-video / multi-resolution session shape that real-world consumers (Firefox loading typical web pages) generate, on PineTab2 RK3568 hantro G1/G2."**
**This iteration is hardening, not new feature work.** Iteration 1 made the decode + libva engagement correct. The remaining failure modes are all in the buffer-lifecycle, format-state, and pitch-alignment layers — all of which only manifested AFTER the decode landed real frames.
Pass/fail (boolean):
- mpv `--hwdec=vaapi --vo=gpu` plays `bbb_1080p30_h264.mp4` for ≥30s **without visible stutter** (operator inspection on real screen)
- Firefox 150 plays the Mozilla homepage's animated 864-wide intro videos **without `MESA: error: WSI pitch not properly aligned`** falling back to SW
- A multi-video session (Firefox playing several videos at different resolutions in sequence) **does not corrupt kernel CAPTURE format state** (no `fmt_width=48 fmt_height=48` regression)
## Mechanism the question targets
Iteration 1 established the libva backend now correctly:
- engages hantro G1/G2 (decode produces real NV12 pixels)
- exports DMA-BUF descriptor in the SEPARATE_LAYERS shape Firefox needs
- re-sets OUTPUT format on resolution change for the mpv-probe pattern
But the remaining failure modes share a common underlying issue: **V4L2's per-buffer state is decoupled from the DMA-BUF refcount of the EXPBUF'd fd**. The kernel allows VIDIOC_QBUF to re-queue a buffer even when an external consumer holds an EXPBUF'd fd to the same physical memory. Userspace MUST NOT re-QBUF an in-use buffer, OR it must allocate enough buffers that re-use never happens within the consumer's hold window.
The current libva-v4l2-request code (post-iteration-1) binds CAPTURE buffer index 1:1 with the surface. When mpv recycles a surface (asks to decode frame N+M into a previously-used surface), our backend re-QBUFs the same physical buffer. If the consumer hasn't yet released its EXPBUF'd fd from frame N's render, the kernel writes frame N+M's content into the buffer **while the consumer is still reading frame N from it**.
Visible result on mpv `--vo=gpu`: frames overlap, the bunny appears to move backward and forward, "extreme stuttering and back and forth" per operator description.
Visible result on Firefox: not observed (Firefox's compositor pool buffers frames briefly and dups the fd, decoupling display timing from EXPBUF'd-fd lifetime), but in principle the same race exists.
## Predecessor close-out summary (iteration 1 → iteration 2)
### State that carries forward (re-verified in iteration 1, current)
- **Hardware**: ohm RK3568, kernel 6.19.10-danctnix1-1-pinetab2, hantro G1/G2 on `/dev/video1` + `/dev/media0`
- **Userspace**: libva 2.23.0, libva-utils 2.22.0, mpv 0.41.0-3, Firefox 150.0.1, Mesa 26.0.5
- **Test fixture**: `/home/mfritsche/fourier-test/bbb_1080p30_h264.mp4` (sha256 dcf8a7170fbd49bb...), 1920×1080 H.264 24 fps
- **Build harness**: `meson setup --buildtype=release && ninja` directly on ohm; deploy to `/usr/lib/dri/v4l2_request_drv_video.so`
- **Live test rig**: operator's Plasma 6 Wayland session on tty7, env at `XDG_RUNTIME_DIR=/run/user/1001`, `WAYLAND_DISPLAY=wayland-0`, `DISPLAY=:0`, `XAUTHORITY=/run/user/1001/xauth_ilDMqm`, `DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1001/bus`. SSH-driven launches need `MOZ_ENABLE_WAYLAND=1` for Firefox; mpv works via either `--vo=gpu` (Wayland EGL) or `--vo=null` (headless probe).
- **Debug instrumentation in fork master** (commits 2517a12 / a047926 / 7da2b27 / 92f5b25 / 21ae311 / fdfee2d / 6be3f3b): ENTER logging on most major libva entry points + ExportSurfaceHandle descriptor dump + CAPTURE plane-0 hex/variance dump with msync cache-fix. Stays in this iteration to support debugging; clean sweep at iteration 2 close per Phase 5 review.
### Data that does NOT carry forward (re-acquire per `feedback_replicate_baseline_first.md`)
- Iteration 1 saw `64/300` drops in 12s for vaapi-copy, `29/300` for vaapi (DMA-BUF). Not anchors — re-measure with consistent rig + binding cells in iteration 2.
- "Firefox sustains 100+ slice_header parses" — true at end of iteration 1, but the 100-count was incidental (test ran 14s); not a binding cell.
- Visual stutter intensity for vaapi (DMA-BUF) — re-measure under iteration 2's binding-cell capture.
### Open questions inherited from Sonnet's Phase 5 review (iteration 1)
7.1 **VIDIOC_G_EXT_CTRLS EACCES on this rig** — try moving readback to before `MEDIA_REQUEST_IOC_QUEUE`. If still EACCES, file a kernel-side investigation note and remove the readback as deadweight.
7.2 **`num_ref_idx_l0/l1_default_active_minus1` from VASlice** is wrong for multi-slice streams with explicit per-slice override. Defer until a real test stream surfaces it.
7.3 **`SET_FORMAT_OF_OUTPUT_ONCE` removal completeness** — replaced with `LAST_OUTPUT_WIDTH/HEIGHT` tracking in iteration 1 commit `37c0e72`. Iteration 1 did NOT verify multi-context safety. Iteration 2 multi-resolution work covers this.
7.4 **`// HACK` block in surface.c** — set OUTPUT format from inside CreateSurfaces2 with hardcoded `V4L2_PIX_FMT_H264_SLICE`. Wrong for MPEG-2 surfaces. Defer until MPEG-2 testing exercises it.
7.5 **Firefox seek-to-non-IDR** — verify Firefox handles a stream where the first played frame isn't an IDR. Add to test corpus.
7.6 **fourier_attribution cell A mechanism** is chromium-internal V4L2 backend, not libva. Documented in iteration 1 close; no further action.
## Tooling and measurement-instrument inventory (live verification)
- `strace -f -e ioctl,openat,close` for libva-side V4L2 ioctl tracing — confirmed working iter1
- `sudo ftrace events/v4l2/* events/vb2/* events/dma_fence/*` for kernel-side V4L2/vb2 lifecycle — confirmed working iter1
- `sudo dmesg -w` for kernel-side warnings during decode — generates no output for our test cases (hantro doesn't WARN at the severity level we'd see), but useful for any kernel-side panic
- `sudo lsof /dev/video1` + `sudo ls -l /proc/$RDD_PID/fd/` for current libva-side fd ownership
- `mpv --frames=300 --vo=gpu` with stderr capture for V: timeline + drop count
- Firefox MOZ_LOG=PlatformDecoderModule:5 + grep for ProcessDecode + Broadcast support lines
- Operator visual inspection on real screen for stutter/artifact detection (load-bearing for "boolean correctness")
For this iteration specifically, we ALSO need:
- **Per-CAPTURE-buffer state tracking** in the libva backend itself: which surfaces have outstanding EXPBUF'd fds, which buffers are "owned" by consumer, which are free for re-QBUF
- **mpv vaapi (no -copy) frame consistency check**: a way to detect "frame N's content was overwritten by frame N+M before display." Could be: render to image-sequence (`--vo=image-sequence`), inspect dumped frames for content discontinuity
- **Strace QBUF re-queue pattern** for mpv: time between consecutive QBUFs of same buffer index. If pattern shows re-QBUF happening within ~33ms (one frame at 30fps display), the lifecycle race is structurally guaranteed to fire.
## In-session baseline anchor (Phase 3 input — to capture FIRST in this iteration)
Per `feedback_dev_process.md` Phase 0 + `feedback_replicate_baseline_first.md`, before any iteration 2 implementation work, capture:
1. **Real-VO mpv `--hwdec=vaapi --vo=gpu` reference run** (operator inspection): describe stutter pattern, count visual artifacts per second, log mpv stderr for drop count. Already informally captured in iteration 1 close — formalize for binding cell here.
2. **Strace VIDIOC_QBUF/DQBUF re-queue timing** for the same run: how often does the same buffer index get QBUF'd back-to-back? What's the typical interval?
3. **lsof tracking**: how many EXPBUF'd fds does mpv have outstanding at the moment of stutter? If it's >1, lifecycle race confirmed.
## In-scope (LOCKED 2026-05-04 for iteration 2)
- Buffer-pool refactor in libva-v4l2-request `src/surface.c` and `src/picture.c`: decouple CAPTURE buffer pool from surface count. Track per-buffer free/exported/in-decode state.
- WSI alignment in `src/surface.c::ExportSurfaceHandle`: round up reported pitch to 64-byte alignment OR set explicit modifier so Mesa's WSI accepts.
- Multi-resolution kernel-state recovery: detect format-state mismatch on the kernel side (e.g., G_FMT returns 48×48 when we expect 1920×1088) and force REQBUFS(0)+S_FMT recovery.
- Iteration 2 Phase 5 second-model review (Sonnet) before any implementation lands.
- Iteration 2 Phase 7 verification across the same 4-consumer matrix as iteration 1, plus a multi-video Firefox session, plus a long-playing mpv vaapi (no -copy) run.
## Out-of-scope (LOCKED 2026-05-04 for iteration 2)
- New codecs (MPEG-2, VP8, VP9, AV1, HEVC) — iteration 1's H.264-only scope holds.
- New hardware (fresnel RK3399, ampere/boltzmann RK3588) — separate iteration after ohm path is stable.
- Performance metrics binding cells — explicitly deferred to iteration 3 (perf comparison: SW vs HW).
- Bootlin upstreaming — `feedback_no_upstream.md` holds; no PRs unless explicitly tasked.
- Cleanup of DEBUG instrumentation patches — Sonnet's Phase 5 review said "wait until iteration 1 + 2 work is complete." Iteration 2 keeps the debug for diagnosis; iteration 3 sweeps.
- Multi-process / multi-context libva use (e.g., Firefox running while mpv runs) — defer until basic single-context lifecycle is solid.
## Phase 1 success criterion (will lock when Phase 0 completes)
Pre-lock draft:
- mpv `--hwdec=vaapi --vo=gpu` plays bbb_1080p30 for ≥60s without operator-visible stutter or "back and forth" (binding cell: operator inspection)
- Drop count < 100 in 60s (rig-normalized; iteration 1 saw 29 drops in 12s = 145/min = within bound at 60s)
- Firefox 150 sustained engagement on mozilla.org main page (multi-resolution video sequence)
- vainfo + mpv vaapi-copy + chromium-fourier 149 → no regression
## What Phase 0 will deliver (regardless of detail)
1. **In-session baseline capture** of the current stutter behavior (the Phase 3 anchor for iteration 2). Specifically: visual confirmation, drop rate, strace-side QBUF re-queue timing.
2. **Code-side situation analysis**: read libva-v4l2-request's surface/buffer lifecycle and identify the specific touch points for the buffer-pool refactor. Map to FFmpeg hwcontext_v4l2request and GStreamer v4l2codecs reference patterns.
3. **WSI alignment investigation**: confirm Mesa's actual alignment requirement (try 64, 128, 256), confirm hantro's actual buffer alignment (read kernel hantro_drv.c). Find the right reported pitch.
4. **Iteration 2 worklist with binding cells**.