Files
libva-multiplanar/phase0_findings_iter2.md
marfrit 9b1d7737cd Iteration 2 Phase 0-4: substrate, situation analysis, baseline anchor, plan
Iteration 2 opens. Hardening, not new feature work.

Phase 0 (phase0_findings_iter2.md): locked research question — make
DMA-BUF (vaapi no -copy) path render without artifacts on direct-
render consumers. Plus WSI alignment for 864-wide videos in Firefox
and multi-resolution kernel-state recovery.

Phase 2 (phase2_iter2_analysis.md): the bug origin is picture.c:375
re-QBUFing surface_object->destination_index every decode cycle
while the kernel CAPTURE buffer is still being read by the consumer
via an EXPBUF'd dma_buf fd. V4L2 doesn't enforce the constraint;
userspace must coordinate.

Three architecture options analyzed:
  A. More buffers + LRU recycling (cheapest, statistical mitigation)
  B. Per-buffer dma_buf-refcount-aware recycling (correct, requires
     kernel changes or userspace V4L2_MEMORY_DMABUF rewrite)
  C. Kernel patch to enforce QBUF rejection (out of campaign scope)

Picking Option A for iteration 2.

Phase 3 (in-session baseline anchor): mpv vaapi --vo=gpu shows
91 drops in 14s at 1080p, 9 CAPTURE buffer indices used (2-10),
~10-19 re-queues per buffer in 14s window. Per-buffer re-queue
interval ~875ms vs typical compositor hold ~50ms — race window
opens episodically.

Phase 4 (phase4_iter2_plan.md): three independent fixes ordered
cheapest-first.
  Fix 1: invalidate LAST_OUTPUT_WIDTH cache on session teardown.
  Fix 2: try DRM_FORMAT_MOD_INVALID for WSI compatibility.
  Fix 3: decoupled CAPTURE buffer pool with LRU recycling
         (~150-200 lines, the load-bearing fix).

Phase 5 sonnet review BEFORE Fix 3 implementation.

Out-of-scope for iteration 2 (carry to iter 3): EACCES probe,
multi-slice num_ref_idx, HACK block MPEG-2 cleanup, seek-to-non-IDR,
DEBUG instrumentation cleanup, V4L2_MEMORY_DMABUF rewrite, perf
metrics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:55:31 +00:00

11 KiB
Raw Permalink Blame History

Iteration 2 — Phase 0 (substrate / motivation / inventory)

Opens 2026-05-04 immediately after iteration 1 close (phase8_iteration1_close.md).

Locked research question (iteration 2)

"Make libva-v4l2-request's DMA-BUF (vaapi-no--copy) path render decoded H.264 video without visual artifacts on direct-render consumers (mpv --vo=gpu), and handle the multi-video / multi-resolution session shape that real-world consumers (Firefox loading typical web pages) generate, on PineTab2 RK3568 hantro G1/G2."

This iteration is hardening, not new feature work. Iteration 1 made the decode + libva engagement correct. The remaining failure modes are all in the buffer-lifecycle, format-state, and pitch-alignment layers — all of which only manifested AFTER the decode landed real frames.

Pass/fail (boolean):

  • mpv --hwdec=vaapi --vo=gpu plays bbb_1080p30_h264.mp4 for ≥30s without visible stutter (operator inspection on real screen)
  • Firefox 150 plays the Mozilla homepage's animated 864-wide intro videos without MESA: error: WSI pitch not properly aligned falling back to SW
  • A multi-video session (Firefox playing several videos at different resolutions in sequence) does not corrupt kernel CAPTURE format state (no fmt_width=48 fmt_height=48 regression)

Mechanism the question targets

Iteration 1 established the libva backend now correctly:

  • engages hantro G1/G2 (decode produces real NV12 pixels)
  • exports DMA-BUF descriptor in the SEPARATE_LAYERS shape Firefox needs
  • re-sets OUTPUT format on resolution change for the mpv-probe pattern

But the remaining failure modes share a common underlying issue: V4L2's per-buffer state is decoupled from the DMA-BUF refcount of the EXPBUF'd fd. The kernel allows VIDIOC_QBUF to re-queue a buffer even when an external consumer holds an EXPBUF'd fd to the same physical memory. Userspace MUST NOT re-QBUF an in-use buffer, OR it must allocate enough buffers that re-use never happens within the consumer's hold window.

The current libva-v4l2-request code (post-iteration-1) binds CAPTURE buffer index 1:1 with the surface. When mpv recycles a surface (asks to decode frame N+M into a previously-used surface), our backend re-QBUFs the same physical buffer. If the consumer hasn't yet released its EXPBUF'd fd from frame N's render, the kernel writes frame N+M's content into the buffer while the consumer is still reading frame N from it.

Visible result on mpv --vo=gpu: frames overlap, the bunny appears to move backward and forward, "extreme stuttering and back and forth" per operator description.

Visible result on Firefox: not observed (Firefox's compositor pool buffers frames briefly and dups the fd, decoupling display timing from EXPBUF'd-fd lifetime), but in principle the same race exists.

Predecessor close-out summary (iteration 1 → iteration 2)

State that carries forward (re-verified in iteration 1, current)

  • Hardware: ohm RK3568, kernel 6.19.10-danctnix1-1-pinetab2, hantro G1/G2 on /dev/video1 + /dev/media0
  • Userspace: libva 2.23.0, libva-utils 2.22.0, mpv 0.41.0-3, Firefox 150.0.1, Mesa 26.0.5
  • Test fixture: /home/mfritsche/fourier-test/bbb_1080p30_h264.mp4 (sha256 dcf8a7170fbd49bb...), 1920×1080 H.264 24 fps
  • Build harness: meson setup --buildtype=release && ninja directly on ohm; deploy to /usr/lib/dri/v4l2_request_drv_video.so
  • Live test rig: operator's Plasma 6 Wayland session on tty7, env at XDG_RUNTIME_DIR=/run/user/1001, WAYLAND_DISPLAY=wayland-0, DISPLAY=:0, XAUTHORITY=/run/user/1001/xauth_ilDMqm, DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1001/bus. SSH-driven launches need MOZ_ENABLE_WAYLAND=1 for Firefox; mpv works via either --vo=gpu (Wayland EGL) or --vo=null (headless probe).
  • Debug instrumentation in fork master (commits 2517a12 / a047926 / 7da2b27 / 92f5b25 / 21ae311 / fdfee2d / 6be3f3b): ENTER logging on most major libva entry points + ExportSurfaceHandle descriptor dump + CAPTURE plane-0 hex/variance dump with msync cache-fix. Stays in this iteration to support debugging; clean sweep at iteration 2 close per Phase 5 review.

Data that does NOT carry forward (re-acquire per feedback_replicate_baseline_first.md)

  • Iteration 1 saw 64/300 drops in 12s for vaapi-copy, 29/300 for vaapi (DMA-BUF). Not anchors — re-measure with consistent rig + binding cells in iteration 2.
  • "Firefox sustains 100+ slice_header parses" — true at end of iteration 1, but the 100-count was incidental (test ran 14s); not a binding cell.
  • Visual stutter intensity for vaapi (DMA-BUF) — re-measure under iteration 2's binding-cell capture.

Open questions inherited from Sonnet's Phase 5 review (iteration 1)

7.1 VIDIOC_G_EXT_CTRLS EACCES on this rig — try moving readback to before MEDIA_REQUEST_IOC_QUEUE. If still EACCES, file a kernel-side investigation note and remove the readback as deadweight. 7.2 num_ref_idx_l0/l1_default_active_minus1 from VASlice is wrong for multi-slice streams with explicit per-slice override. Defer until a real test stream surfaces it. 7.3 SET_FORMAT_OF_OUTPUT_ONCE removal completeness — replaced with LAST_OUTPUT_WIDTH/HEIGHT tracking in iteration 1 commit 37c0e72. Iteration 1 did NOT verify multi-context safety. Iteration 2 multi-resolution work covers this. 7.4 // HACK block in surface.c — set OUTPUT format from inside CreateSurfaces2 with hardcoded V4L2_PIX_FMT_H264_SLICE. Wrong for MPEG-2 surfaces. Defer until MPEG-2 testing exercises it. 7.5 Firefox seek-to-non-IDR — verify Firefox handles a stream where the first played frame isn't an IDR. Add to test corpus. 7.6 fourier_attribution cell A mechanism is chromium-internal V4L2 backend, not libva. Documented in iteration 1 close; no further action.

Tooling and measurement-instrument inventory (live verification)

  • strace -f -e ioctl,openat,close for libva-side V4L2 ioctl tracing — confirmed working iter1
  • sudo ftrace events/v4l2/* events/vb2/* events/dma_fence/* for kernel-side V4L2/vb2 lifecycle — confirmed working iter1
  • sudo dmesg -w for kernel-side warnings during decode — generates no output for our test cases (hantro doesn't WARN at the severity level we'd see), but useful for any kernel-side panic
  • sudo lsof /dev/video1 + sudo ls -l /proc/$RDD_PID/fd/ for current libva-side fd ownership
  • mpv --frames=300 --vo=gpu with stderr capture for V: timeline + drop count
  • Firefox MOZ_LOG=PlatformDecoderModule:5 + grep for ProcessDecode + Broadcast support lines
  • Operator visual inspection on real screen for stutter/artifact detection (load-bearing for "boolean correctness")

For this iteration specifically, we ALSO need:

  • Per-CAPTURE-buffer state tracking in the libva backend itself: which surfaces have outstanding EXPBUF'd fds, which buffers are "owned" by consumer, which are free for re-QBUF
  • mpv vaapi (no -copy) frame consistency check: a way to detect "frame N's content was overwritten by frame N+M before display." Could be: render to image-sequence (--vo=image-sequence), inspect dumped frames for content discontinuity
  • Strace QBUF re-queue pattern for mpv: time between consecutive QBUFs of same buffer index. If pattern shows re-QBUF happening within ~33ms (one frame at 30fps display), the lifecycle race is structurally guaranteed to fire.

In-session baseline anchor (Phase 3 input — to capture FIRST in this iteration)

Per feedback_dev_process.md Phase 0 + feedback_replicate_baseline_first.md, before any iteration 2 implementation work, capture:

  1. Real-VO mpv --hwdec=vaapi --vo=gpu reference run (operator inspection): describe stutter pattern, count visual artifacts per second, log mpv stderr for drop count. Already informally captured in iteration 1 close — formalize for binding cell here.
  2. Strace VIDIOC_QBUF/DQBUF re-queue timing for the same run: how often does the same buffer index get QBUF'd back-to-back? What's the typical interval?
  3. lsof tracking: how many EXPBUF'd fds does mpv have outstanding at the moment of stutter? If it's >1, lifecycle race confirmed.

In-scope (LOCKED 2026-05-04 for iteration 2)

  • Buffer-pool refactor in libva-v4l2-request src/surface.c and src/picture.c: decouple CAPTURE buffer pool from surface count. Track per-buffer free/exported/in-decode state.
  • WSI alignment in src/surface.c::ExportSurfaceHandle: round up reported pitch to 64-byte alignment OR set explicit modifier so Mesa's WSI accepts.
  • Multi-resolution kernel-state recovery: detect format-state mismatch on the kernel side (e.g., G_FMT returns 48×48 when we expect 1920×1088) and force REQBUFS(0)+S_FMT recovery.
  • Iteration 2 Phase 5 second-model review (Sonnet) before any implementation lands.
  • Iteration 2 Phase 7 verification across the same 4-consumer matrix as iteration 1, plus a multi-video Firefox session, plus a long-playing mpv vaapi (no -copy) run.

Out-of-scope (LOCKED 2026-05-04 for iteration 2)

  • New codecs (MPEG-2, VP8, VP9, AV1, HEVC) — iteration 1's H.264-only scope holds.
  • New hardware (fresnel RK3399, ampere/boltzmann RK3588) — separate iteration after ohm path is stable.
  • Performance metrics binding cells — explicitly deferred to iteration 3 (perf comparison: SW vs HW).
  • Bootlin upstreaming — feedback_no_upstream.md holds; no PRs unless explicitly tasked.
  • Cleanup of DEBUG instrumentation patches — Sonnet's Phase 5 review said "wait until iteration 1 + 2 work is complete." Iteration 2 keeps the debug for diagnosis; iteration 3 sweeps.
  • Multi-process / multi-context libva use (e.g., Firefox running while mpv runs) — defer until basic single-context lifecycle is solid.

Phase 1 success criterion (will lock when Phase 0 completes)

Pre-lock draft:

  • mpv --hwdec=vaapi --vo=gpu plays bbb_1080p30 for ≥60s without operator-visible stutter or "back and forth" (binding cell: operator inspection)
  • Drop count < 100 in 60s (rig-normalized; iteration 1 saw 29 drops in 12s = 145/min = within bound at 60s)
  • Firefox 150 sustained engagement on mozilla.org main page (multi-resolution video sequence)
  • vainfo + mpv vaapi-copy + chromium-fourier 149 → no regression

What Phase 0 will deliver (regardless of detail)

  1. In-session baseline capture of the current stutter behavior (the Phase 3 anchor for iteration 2). Specifically: visual confirmation, drop rate, strace-side QBUF re-queue timing.
  2. Code-side situation analysis: read libva-v4l2-request's surface/buffer lifecycle and identify the specific touch points for the buffer-pool refactor. Map to FFmpeg hwcontext_v4l2request and GStreamer v4l2codecs reference patterns.
  3. WSI alignment investigation: confirm Mesa's actual alignment requirement (try 64, 128, 256), confirm hantro's actual buffer alignment (read kernel hantro_drv.c). Find the right reported pitch.
  4. Iteration 2 worklist with binding cells.