Iteration 2 Phase 0-4: substrate, situation analysis, baseline anchor, plan
Iteration 2 opens. Hardening, not new feature work.
Phase 0 (phase0_findings_iter2.md): locked research question — make
DMA-BUF (vaapi no -copy) path render without artifacts on direct-
render consumers. Plus WSI alignment for 864-wide videos in Firefox
and multi-resolution kernel-state recovery.
Phase 2 (phase2_iter2_analysis.md): the bug origin is picture.c:375
re-QBUFing surface_object->destination_index every decode cycle
while the kernel CAPTURE buffer is still being read by the consumer
via an EXPBUF'd dma_buf fd. V4L2 doesn't enforce the constraint;
userspace must coordinate.
Three architecture options analyzed:
A. More buffers + LRU recycling (cheapest, statistical mitigation)
B. Per-buffer dma_buf-refcount-aware recycling (correct, requires
kernel changes or userspace V4L2_MEMORY_DMABUF rewrite)
C. Kernel patch to enforce QBUF rejection (out of campaign scope)
Picking Option A for iteration 2.
Phase 3 (in-session baseline anchor): mpv vaapi --vo=gpu shows
91 drops in 14s at 1080p, 9 CAPTURE buffer indices used (2-10),
~10-19 re-queues per buffer in 14s window. Per-buffer re-queue
interval ~875ms vs typical compositor hold ~50ms — race window
opens episodically.
Phase 4 (phase4_iter2_plan.md): three independent fixes ordered
cheapest-first.
Fix 1: invalidate LAST_OUTPUT_WIDTH cache on session teardown.
Fix 2: try DRM_FORMAT_MOD_INVALID for WSI compatibility.
Fix 3: decoupled CAPTURE buffer pool with LRU recycling
(~150-200 lines, the load-bearing fix).
Phase 5 sonnet review BEFORE Fix 3 implementation.
Out-of-scope for iteration 2 (carry to iter 3): EACCES probe,
multi-slice num_ref_idx, HACK block MPEG-2 cleanup, seek-to-non-IDR,
DEBUG instrumentation cleanup, V4L2_MEMORY_DMABUF rewrite, perf
metrics.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,110 @@
|
|||||||
|
# Iteration 2 — Phase 0 (substrate / motivation / inventory)
|
||||||
|
|
||||||
|
Opens 2026-05-04 immediately after iteration 1 close (`phase8_iteration1_close.md`).
|
||||||
|
|
||||||
|
## Locked research question (iteration 2)
|
||||||
|
|
||||||
|
> **"Make libva-v4l2-request's DMA-BUF (`vaapi`-no-`-copy`) path render decoded H.264 video without visual artifacts on direct-render consumers (mpv `--vo=gpu`), and handle the multi-video / multi-resolution session shape that real-world consumers (Firefox loading typical web pages) generate, on PineTab2 RK3568 hantro G1/G2."**
|
||||||
|
|
||||||
|
**This iteration is hardening, not new feature work.** Iteration 1 made the decode + libva engagement correct. The remaining failure modes are all in the buffer-lifecycle, format-state, and pitch-alignment layers — all of which only manifested AFTER the decode landed real frames.
|
||||||
|
|
||||||
|
Pass/fail (boolean):
|
||||||
|
- mpv `--hwdec=vaapi --vo=gpu` plays `bbb_1080p30_h264.mp4` for ≥30s **without visible stutter** (operator inspection on real screen)
|
||||||
|
- Firefox 150 plays the Mozilla homepage's animated 864-wide intro videos **without `MESA: error: WSI pitch not properly aligned`** falling back to SW
|
||||||
|
- A multi-video session (Firefox playing several videos at different resolutions in sequence) **does not corrupt kernel CAPTURE format state** (no `fmt_width=48 fmt_height=48` regression)
|
||||||
|
|
||||||
|
## Mechanism the question targets
|
||||||
|
|
||||||
|
Iteration 1 established the libva backend now correctly:
|
||||||
|
- engages hantro G1/G2 (decode produces real NV12 pixels)
|
||||||
|
- exports DMA-BUF descriptor in the SEPARATE_LAYERS shape Firefox needs
|
||||||
|
- re-sets OUTPUT format on resolution change for the mpv-probe pattern
|
||||||
|
|
||||||
|
But the remaining failure modes share a common underlying issue: **V4L2's per-buffer state is decoupled from the DMA-BUF refcount of the EXPBUF'd fd**. The kernel allows VIDIOC_QBUF to re-queue a buffer even when an external consumer holds an EXPBUF'd fd to the same physical memory. Userspace MUST NOT re-QBUF an in-use buffer, OR it must allocate enough buffers that re-use never happens within the consumer's hold window.
|
||||||
|
|
||||||
|
The current libva-v4l2-request code (post-iteration-1) binds CAPTURE buffer index 1:1 with the surface. When mpv recycles a surface (asks to decode frame N+M into a previously-used surface), our backend re-QBUFs the same physical buffer. If the consumer hasn't yet released its EXPBUF'd fd from frame N's render, the kernel writes frame N+M's content into the buffer **while the consumer is still reading frame N from it**.
|
||||||
|
|
||||||
|
Visible result on mpv `--vo=gpu`: frames overlap, the bunny appears to move backward and forward, "extreme stuttering and back and forth" per operator description.
|
||||||
|
|
||||||
|
Visible result on Firefox: not observed (Firefox's compositor pool buffers frames briefly and dups the fd, decoupling display timing from EXPBUF'd-fd lifetime), but in principle the same race exists.
|
||||||
|
|
||||||
|
## Predecessor close-out summary (iteration 1 → iteration 2)
|
||||||
|
|
||||||
|
### State that carries forward (re-verified in iteration 1, current)
|
||||||
|
|
||||||
|
- **Hardware**: ohm RK3568, kernel 6.19.10-danctnix1-1-pinetab2, hantro G1/G2 on `/dev/video1` + `/dev/media0`
|
||||||
|
- **Userspace**: libva 2.23.0, libva-utils 2.22.0, mpv 0.41.0-3, Firefox 150.0.1, Mesa 26.0.5
|
||||||
|
- **Test fixture**: `/home/mfritsche/fourier-test/bbb_1080p30_h264.mp4` (sha256 dcf8a7170fbd49bb...), 1920×1080 H.264 24 fps
|
||||||
|
- **Build harness**: `meson setup --buildtype=release && ninja` directly on ohm; deploy to `/usr/lib/dri/v4l2_request_drv_video.so`
|
||||||
|
- **Live test rig**: operator's Plasma 6 Wayland session on tty7, env at `XDG_RUNTIME_DIR=/run/user/1001`, `WAYLAND_DISPLAY=wayland-0`, `DISPLAY=:0`, `XAUTHORITY=/run/user/1001/xauth_ilDMqm`, `DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1001/bus`. SSH-driven launches need `MOZ_ENABLE_WAYLAND=1` for Firefox; mpv works via either `--vo=gpu` (Wayland EGL) or `--vo=null` (headless probe).
|
||||||
|
- **Debug instrumentation in fork master** (commits 2517a12 / a047926 / 7da2b27 / 92f5b25 / 21ae311 / fdfee2d / 6be3f3b): ENTER logging on most major libva entry points + ExportSurfaceHandle descriptor dump + CAPTURE plane-0 hex/variance dump with msync cache-fix. Stays in this iteration to support debugging; clean sweep at iteration 2 close per Phase 5 review.
|
||||||
|
|
||||||
|
### Data that does NOT carry forward (re-acquire per `feedback_replicate_baseline_first.md`)
|
||||||
|
|
||||||
|
- Iteration 1 saw `64/300` drops in 12s for vaapi-copy, `29/300` for vaapi (DMA-BUF). Not anchors — re-measure with consistent rig + binding cells in iteration 2.
|
||||||
|
- "Firefox sustains 100+ slice_header parses" — true at end of iteration 1, but the 100-count was incidental (test ran 14s); not a binding cell.
|
||||||
|
- Visual stutter intensity for vaapi (DMA-BUF) — re-measure under iteration 2's binding-cell capture.
|
||||||
|
|
||||||
|
### Open questions inherited from Sonnet's Phase 5 review (iteration 1)
|
||||||
|
|
||||||
|
7.1 **VIDIOC_G_EXT_CTRLS EACCES on this rig** — try moving readback to before `MEDIA_REQUEST_IOC_QUEUE`. If still EACCES, file a kernel-side investigation note and remove the readback as deadweight.
|
||||||
|
7.2 **`num_ref_idx_l0/l1_default_active_minus1` from VASlice** is wrong for multi-slice streams with explicit per-slice override. Defer until a real test stream surfaces it.
|
||||||
|
7.3 **`SET_FORMAT_OF_OUTPUT_ONCE` removal completeness** — replaced with `LAST_OUTPUT_WIDTH/HEIGHT` tracking in iteration 1 commit `37c0e72`. Iteration 1 did NOT verify multi-context safety. Iteration 2 multi-resolution work covers this.
|
||||||
|
7.4 **`// HACK` block in surface.c** — set OUTPUT format from inside CreateSurfaces2 with hardcoded `V4L2_PIX_FMT_H264_SLICE`. Wrong for MPEG-2 surfaces. Defer until MPEG-2 testing exercises it.
|
||||||
|
7.5 **Firefox seek-to-non-IDR** — verify Firefox handles a stream where the first played frame isn't an IDR. Add to test corpus.
|
||||||
|
7.6 **fourier_attribution cell A mechanism** is chromium-internal V4L2 backend, not libva. Documented in iteration 1 close; no further action.
|
||||||
|
|
||||||
|
## Tooling and measurement-instrument inventory (live verification)
|
||||||
|
|
||||||
|
- `strace -f -e ioctl,openat,close` for libva-side V4L2 ioctl tracing — confirmed working iter1
|
||||||
|
- `sudo ftrace events/v4l2/* events/vb2/* events/dma_fence/*` for kernel-side V4L2/vb2 lifecycle — confirmed working iter1
|
||||||
|
- `sudo dmesg -w` for kernel-side warnings during decode — generates no output for our test cases (hantro doesn't WARN at the severity level we'd see), but useful for any kernel-side panic
|
||||||
|
- `sudo lsof /dev/video1` + `sudo ls -l /proc/$RDD_PID/fd/` for current libva-side fd ownership
|
||||||
|
- `mpv --frames=300 --vo=gpu` with stderr capture for V: timeline + drop count
|
||||||
|
- Firefox MOZ_LOG=PlatformDecoderModule:5 + grep for ProcessDecode + Broadcast support lines
|
||||||
|
- Operator visual inspection on real screen for stutter/artifact detection (load-bearing for "boolean correctness")
|
||||||
|
|
||||||
|
For this iteration specifically, we ALSO need:
|
||||||
|
- **Per-CAPTURE-buffer state tracking** in the libva backend itself: which surfaces have outstanding EXPBUF'd fds, which buffers are "owned" by consumer, which are free for re-QBUF
|
||||||
|
- **mpv vaapi (no -copy) frame consistency check**: a way to detect "frame N's content was overwritten by frame N+M before display." Could be: render to image-sequence (`--vo=image-sequence`), inspect dumped frames for content discontinuity
|
||||||
|
- **Strace QBUF re-queue pattern** for mpv: time between consecutive QBUFs of same buffer index. If pattern shows re-QBUF happening within ~33ms (one frame at 30fps display), the lifecycle race is structurally guaranteed to fire.
|
||||||
|
|
||||||
|
## In-session baseline anchor (Phase 3 input — to capture FIRST in this iteration)
|
||||||
|
|
||||||
|
Per `feedback_dev_process.md` Phase 0 + `feedback_replicate_baseline_first.md`, before any iteration 2 implementation work, capture:
|
||||||
|
|
||||||
|
1. **Real-VO mpv `--hwdec=vaapi --vo=gpu` reference run** (operator inspection): describe stutter pattern, count visual artifacts per second, log mpv stderr for drop count. Already informally captured in iteration 1 close — formalize for binding cell here.
|
||||||
|
2. **Strace VIDIOC_QBUF/DQBUF re-queue timing** for the same run: how often does the same buffer index get QBUF'd back-to-back? What's the typical interval?
|
||||||
|
3. **lsof tracking**: how many EXPBUF'd fds does mpv have outstanding at the moment of stutter? If it's >1, lifecycle race confirmed.
|
||||||
|
|
||||||
|
## In-scope (LOCKED 2026-05-04 for iteration 2)
|
||||||
|
|
||||||
|
- Buffer-pool refactor in libva-v4l2-request `src/surface.c` and `src/picture.c`: decouple CAPTURE buffer pool from surface count. Track per-buffer free/exported/in-decode state.
|
||||||
|
- WSI alignment in `src/surface.c::ExportSurfaceHandle`: round up reported pitch to 64-byte alignment OR set explicit modifier so Mesa's WSI accepts.
|
||||||
|
- Multi-resolution kernel-state recovery: detect format-state mismatch on the kernel side (e.g., G_FMT returns 48×48 when we expect 1920×1088) and force REQBUFS(0)+S_FMT recovery.
|
||||||
|
- Iteration 2 Phase 5 second-model review (Sonnet) before any implementation lands.
|
||||||
|
- Iteration 2 Phase 7 verification across the same 4-consumer matrix as iteration 1, plus a multi-video Firefox session, plus a long-playing mpv vaapi (no -copy) run.
|
||||||
|
|
||||||
|
## Out-of-scope (LOCKED 2026-05-04 for iteration 2)
|
||||||
|
|
||||||
|
- New codecs (MPEG-2, VP8, VP9, AV1, HEVC) — iteration 1's H.264-only scope holds.
|
||||||
|
- New hardware (fresnel RK3399, ampere/boltzmann RK3588) — separate iteration after ohm path is stable.
|
||||||
|
- Performance metrics binding cells — explicitly deferred to iteration 3 (perf comparison: SW vs HW).
|
||||||
|
- Bootlin upstreaming — `feedback_no_upstream.md` holds; no PRs unless explicitly tasked.
|
||||||
|
- Cleanup of DEBUG instrumentation patches — Sonnet's Phase 5 review said "wait until iteration 1 + 2 work is complete." Iteration 2 keeps the debug for diagnosis; iteration 3 sweeps.
|
||||||
|
- Multi-process / multi-context libva use (e.g., Firefox running while mpv runs) — defer until basic single-context lifecycle is solid.
|
||||||
|
|
||||||
|
## Phase 1 success criterion (will lock when Phase 0 completes)
|
||||||
|
|
||||||
|
Pre-lock draft:
|
||||||
|
- mpv `--hwdec=vaapi --vo=gpu` plays bbb_1080p30 for ≥60s without operator-visible stutter or "back and forth" (binding cell: operator inspection)
|
||||||
|
- Drop count < 100 in 60s (rig-normalized; iteration 1 saw 29 drops in 12s = 145/min = within bound at 60s)
|
||||||
|
- Firefox 150 sustained engagement on mozilla.org main page (multi-resolution video sequence)
|
||||||
|
- vainfo + mpv vaapi-copy + chromium-fourier 149 → no regression
|
||||||
|
|
||||||
|
## What Phase 0 will deliver (regardless of detail)
|
||||||
|
|
||||||
|
1. **In-session baseline capture** of the current stutter behavior (the Phase 3 anchor for iteration 2). Specifically: visual confirmation, drop rate, strace-side QBUF re-queue timing.
|
||||||
|
2. **Code-side situation analysis**: read libva-v4l2-request's surface/buffer lifecycle and identify the specific touch points for the buffer-pool refactor. Map to FFmpeg hwcontext_v4l2request and GStreamer v4l2codecs reference patterns.
|
||||||
|
3. **WSI alignment investigation**: confirm Mesa's actual alignment requirement (try 64, 128, 256), confirm hantro's actual buffer alignment (read kernel hantro_drv.c). Find the right reported pitch.
|
||||||
|
4. **Iteration 2 worklist with binding cells**.
|
||||||
@@ -0,0 +1,116 @@
|
|||||||
|
# Iteration 2 Phase 2 — situation analysis (DMA-BUF lifecycle, WSI alignment, multi-resolution)
|
||||||
|
|
||||||
|
## CAPTURE buffer lifecycle in libva-v4l2-request (current state)
|
||||||
|
|
||||||
|
Three locations that touch the CAPTURE buffer index ↔ surface binding:
|
||||||
|
|
||||||
|
1. **`src/surface.c::RequestCreateSurfaces2` line 201** — allocates `surfaces_count` CAPTURE buffers via `v4l2_create_buffers(...)`. Each gets a V4L2 buffer index `index_base..index_base+surfaces_count-1`.
|
||||||
|
2. **`src/surface.c::RequestCreateSurfaces2` line 280** — binds `surface_object->destination_index = index` for each surface. **Permanent 1:1 binding** for the surface's lifetime.
|
||||||
|
3. **`src/picture.c::RequestEndPicture` line 375** — re-QBUFs `surface_object->destination_index` for every decode cycle on that surface.
|
||||||
|
|
||||||
|
The bug:
|
||||||
|
- mpv allocates ~16 surfaces. Each gets a permanently bound CAPTURE buffer index 0..15.
|
||||||
|
- mpv decodes frame N into surface 0 → V4L2 QBUFs buffer 0 → kernel writes frame N into buffer 0 → DQBUF → mpv calls vaExportSurfaceHandle → gets fd to buffer 0 → renders frame N (compositor holds fd briefly).
|
||||||
|
- A few frames later (say frame N+5 in display order, but earlier in decode order due to B-frames), mpv decodes new content into surface 0 again → our backend re-QBUFs buffer 0 → **kernel writes frame N+5 into the same physical memory the compositor is still rendering frame N from**.
|
||||||
|
- Visible result on mpv `--vo=gpu`: stutter, "back and forth," frame content tearing/swapping. Operator-confirmed 2026-05-04.
|
||||||
|
- Not visible on mpv `--vo=null` because no rendering happens.
|
||||||
|
- Not visible on mpv `--hwdec=vaapi-copy` because libva-side `vaMapBuffer` does an explicit CPU copy out of the V4L2 buffer before returning to mpv, decoupling mpv's display pipeline from the V4L2 buffer lifecycle.
|
||||||
|
|
||||||
|
## Why the kernel doesn't enforce the constraint
|
||||||
|
|
||||||
|
`videobuf2-core.c::vb2_qbuf` permits requeue of a V4L2 buffer regardless of the dma_buf refcount on its EXPBUF'd fd. The two refcounts are decoupled:
|
||||||
|
- V4L2 queue state: dequeued ↔ queued (kernel writable)
|
||||||
|
- dma_buf refcount: alive while ANY fd references it
|
||||||
|
|
||||||
|
When userspace QBUFs a buffer that has external dma_buf refs, the kernel writes to the same physical memory; the consumer holding the dma_buf just sees its content change underfoot. **It's userspace's responsibility to coordinate.**
|
||||||
|
|
||||||
|
## Three viable architectures (in increasing implementation cost)
|
||||||
|
|
||||||
|
### Option A: more buffers + LRU recycling (cheapest, statistical mitigation)
|
||||||
|
|
||||||
|
Allocate `max(surfaces_count, MIN_POOL)` CAPTURE buffers (e.g., MIN_POOL=24). Track per-buffer "last exported at" timestamp. When recycling for a new decode, prefer the buffer with oldest last-exported timestamp. Allocates +8 buffers ≈ +24 MB at 1080p NV12.
|
||||||
|
|
||||||
|
Race window narrows but doesn't close. mpv at 24 fps decode + compositor at 60 fps display + 2-3 frame compositor pool = consumer holds ~3 frames × 16ms = ~50ms. With 24-buffer pool and oldest-first recycling, the recycled buffer was last exported ~24 × 41ms = ~1 second ago. Race window: 50ms vs 1s margin. Statistically safe for typical playback.
|
||||||
|
|
||||||
|
Failure modes still possible: pause-then-resume (consumer holds frame longer); jank-induced compositor lag; multi-stream playback exhausting the pool. Acceptable for iteration 2's scope per `phase0_findings_iter2.md`.
|
||||||
|
|
||||||
|
### Option B: per-buffer dma_buf-refcount-aware recycling (correct, complex)
|
||||||
|
|
||||||
|
Track per CAPTURE buffer: V4L2 queue state + our own EXPBUF fd. On recycle attempt, close OUR fd, then check kernel-side dma_buf refcount via... no public API exists for this in V4L2. The kernel exposes nothing. Would need either:
|
||||||
|
- Workaround: dup the fd into a non-blocking pipe and poll; consumer's dup count signals via... no.
|
||||||
|
- Use V4L2_MEMORY_DMABUF + userspace allocation (gbm/etc.) — substantial rewrite, inverts the buffer ownership model.
|
||||||
|
|
||||||
|
Real fix but iteration 3+ scope.
|
||||||
|
|
||||||
|
### Option C: kernel-side fix + userspace coordination (deepest)
|
||||||
|
|
||||||
|
Patch v4l2-core to expose dma_buf refcount via VIDIOC_QUERYBUF or similar, OR enforce QBUF rejection (EBUSY) when dma_buf refcount > 0. Out of campaign scope (`feedback_no_upstream`).
|
||||||
|
|
||||||
|
## Iteration 2 picks Option A.
|
||||||
|
|
||||||
|
Rationale: smallest code change, addresses the operator-visible stutter for the canonical case (mpv vaapi --vo=gpu single-stream), defers the architectural redesign to iteration 3+ when the FFmpeg / GStreamer reference implementations can be studied for prior art.
|
||||||
|
|
||||||
|
## WSI pitch alignment (independent issue)
|
||||||
|
|
||||||
|
`src/surface.c::RequestExportSurfaceHandle` sets:
|
||||||
|
```c
|
||||||
|
surface_descriptor->layers[0].pitch[0] = surface_object->destination_bytesperlines[0];
|
||||||
|
```
|
||||||
|
Where `destination_bytesperlines[0]` comes from `v4l2_get_format` on the kernel side — for a 864×480 video, kernel reports `bytesperline=864`. Mesa's WSI rejects with "WSI pitch not properly aligned" because the GPU compositor wants pitch aligned to 64 bytes (or higher: 128, 256 — Mali and panfrost both have constraints).
|
||||||
|
|
||||||
|
Two sub-options:
|
||||||
|
|
||||||
|
**Sub-option WSI-1**: round up reported pitch to the nearest 64-byte boundary in our exported descriptor. The kernel-side V4L2 buffer is sized per the actual `bytesperline=864`; if we report `pitch=896` (round up to 64-aligned 896 = 14×64 = 896 vs 864 = 13.5×64), Mesa accepts but reads bytes 864..895 of each row as garbage (alignment padding the kernel doesn't actually allocate).
|
||||||
|
|
||||||
|
This produces visual artifacts at the right edge of frames if Mesa actually samples those bytes. Mesa typically clips to `width` so it's mostly OK. But still wrong technically.
|
||||||
|
|
||||||
|
**Sub-option WSI-2**: convince hantro to allocate buffers with a wider stride. Hantro's CAPTURE buffer stride is what the kernel computes based on the OUTPUT format we set. If we ask hantro to use a wider stride at S_FMT time, the buffer is wider, kernel reports the wider stride, our descriptor reports it, Mesa accepts, no garbage at right edge.
|
||||||
|
|
||||||
|
Looking at hantro driver code: `drivers/media/platform/verisilicon/hantro_v4l2.c::hantro_set_fmt_cap` sets the CAPTURE format from the OUTPUT format's resolution. The stride is computed by `hantro_pp_dec_check_format_compatibility` etc. There's no userspace knob to force a wider stride.
|
||||||
|
|
||||||
|
So sub-option WSI-1 (report wider pitch in descriptor than the kernel actually allocates) is the realistic fix. Pixel-edge artifacts are acceptable for a test corpus; production-grade fix needs kernel changes.
|
||||||
|
|
||||||
|
Implementation: in `ExportSurfaceHandle`, compute `aligned_pitch = (bytesperline + 63) & ~63` and use that for `pitch`. Adjust UV plane offset accordingly: `offsets[1] = aligned_pitch * format_height` instead of `bytesperline * format_height`. Wait, that's wrong too — the kernel's UV starts at `bytesperline * format_height`, not `aligned_pitch * format_height`. Reporting wrong offset would be worse.
|
||||||
|
|
||||||
|
Hmm. The right fix is: report the kernel's actual offsets and pitches AS-IS, but also add `DRM_FORMAT_MOD_INVALID` or `DRM_FORMAT_MOD_LINEAR_NONALIGN_*` modifier to tell Mesa "this isn't WSI-aligned, treat it as a non-display intermediate." Then mpv/Firefox would composite via texture upload rather than direct WSI, paying a perf cost but correct.
|
||||||
|
|
||||||
|
Actually `DRM_FORMAT_MOD_INVALID` means "modifier unknown, do whatever you'd do for implicit." `DRM_FORMAT_MOD_LINEAR` means "explicitly known to be linear." Currently we set `DRM_FORMAT_MOD_NONE` (= 0 = same value as MOD_LINEAR). If we set `DRM_FORMAT_MOD_INVALID`, Mesa might be more lenient.
|
||||||
|
|
||||||
|
Worth a one-line experiment.
|
||||||
|
|
||||||
|
For iteration 2: try `DRM_FORMAT_MOD_INVALID` first. If Mesa accepts, done. If not, pad-pitch-with-aliasing fallback.
|
||||||
|
|
||||||
|
## Multi-resolution kernel state corruption
|
||||||
|
|
||||||
|
After Firefox plays multiple videos at different resolutions (864→1920→...) in sequence, the next CAPTURE format query returns 48×48 (kernel default). Our `LAST_OUTPUT_WIDTH/HEIGHT` cache (commit `37c0e72`) doesn't catch this because the surface request matches the last set value, but the kernel state has been clobbered between sessions.
|
||||||
|
|
||||||
|
Hypothesis: when one playback ends, mpv/Firefox calls `vaDestroyContext` → our backend cleans up (REQBUFS(0), STREAMOFF) → the kernel's CAPTURE queue is reset to default 48×48. The next vaCreateContext reuses our cache that says "format already 1920×1088, no need to S_FMT" — but the kernel is back at 48×48.
|
||||||
|
|
||||||
|
Fix: invalidate `LAST_OUTPUT_WIDTH/HEIGHT` cache when the OUTPUT queue is REQBUFS(0)'d. That guarantees the next CreateSurfaces2 will re-set the format.
|
||||||
|
|
||||||
|
Alternative: always check kernel-side actual format via `v4l2_get_format` after S_FMT and trust the kernel's value; if kernel doesn't have what we set, force re-set.
|
||||||
|
|
||||||
|
Implementation: invalidate cache in `RequestDestroyContext` / `RequestDestroySurfaces` / wherever cleanup happens. Find the right hook.
|
||||||
|
|
||||||
|
## Touch points for iteration 2 implementation
|
||||||
|
|
||||||
|
| File | Line(s) | Change |
|
||||||
|
|---|---|---|
|
||||||
|
| `src/surface.h` | new fields | Add `enum cap_buffer_state` + per-buffer state tracking struct |
|
||||||
|
| `src/surface.c` | ~201 | Allocate `max(surfaces_count, MIN_CAP_POOL)` buffers (e.g., MIN_CAP_POOL=24) |
|
||||||
|
| `src/surface.c` | ~280 | Don't 1:1 bind — instead, push all buffers to the FREE pool |
|
||||||
|
| `src/surface.c` | ~640 (Export) | Mark slot as exported, save export timestamp |
|
||||||
|
| `src/surface.c` | ~700 (Export) | Try `DRM_FORMAT_MOD_INVALID` instead of `DRM_FORMAT_MOD_NONE` for WSI compat |
|
||||||
|
| `src/picture.c` | ~340 (BeginPicture) | Allocate slot from pool (LRU recycling), bind surface→slot, update destination_index |
|
||||||
|
| `src/picture.c` | ~375 (EndPicture, the existing QBUF) | Already uses `surface_object->destination_index` — works once BeginPicture rebinds |
|
||||||
|
| `src/surface.c` | LAST_OUTPUT_WIDTH cache | Invalidate on RequestDestroyContext or DestroySurfaces |
|
||||||
|
|
||||||
|
## Phase 4 plan input — to be locked
|
||||||
|
|
||||||
|
The pool refactor is the load-bearing fix. WSI alignment + multi-resolution recovery are independent and small (1-line experiments first, escalate if needed).
|
||||||
|
|
||||||
|
Suggested fix order:
|
||||||
|
1. Multi-resolution cache invalidation (smallest, lowest risk)
|
||||||
|
2. WSI modifier change to `DRM_FORMAT_MOD_INVALID` (1 line, test immediately)
|
||||||
|
3. Decoupled buffer pool (the substantive iteration 2 work)
|
||||||
|
4. Each fix tested independently before stacking the next.
|
||||||
@@ -0,0 +1,104 @@
|
|||||||
|
# Iteration 2 Phase 4 — plan
|
||||||
|
|
||||||
|
Three independent fixes, ordered cheapest-first. Each tested before stacking the next.
|
||||||
|
|
||||||
|
## Fix 1 — Multi-resolution: invalidate `LAST_OUTPUT_WIDTH/HEIGHT` cache on session teardown
|
||||||
|
|
||||||
|
**File**: `src/surface.c` + `src/context.c` (or wherever DestroyContext lives)
|
||||||
|
**Diff scale**: ~5 lines
|
||||||
|
|
||||||
|
**Problem**: Iteration 1 commit `37c0e72` added `LAST_OUTPUT_WIDTH/HEIGHT` static globals to skip re-S_FMT when dimensions unchanged. Across multiple playback sessions, the kernel CAPTURE format gets reset (likely on REQBUFS(0) at session end) but our cache still says "format already 1920×1088, no re-set needed." Next session inherits a 48×48 default kernel state and our cache lies.
|
||||||
|
|
||||||
|
**Fix**: zero out `LAST_OUTPUT_WIDTH/HEIGHT` whenever the OUTPUT queue is REQBUFS(0)'d (in our existing reset path on resolution change) AND in `RequestDestroyContext`. Forces next CreateSurfaces2 to S_FMT.
|
||||||
|
|
||||||
|
**Risk**: low. Worst case extra S_FMT call during normal operation — harmless.
|
||||||
|
|
||||||
|
**Validation**: Firefox loading mozilla.org main page (multi-video sequence at varying resolutions) should not trigger `fmt_width=48 fmt_height=48` regression. Operator-confirmable.
|
||||||
|
|
||||||
|
## Fix 2 — WSI alignment: try `DRM_FORMAT_MOD_INVALID` in surface descriptor
|
||||||
|
|
||||||
|
**File**: `src/surface.c::RequestExportSurfaceHandle` line 657
|
||||||
|
**Diff scale**: 1 line + 1 fallback case
|
||||||
|
|
||||||
|
**Problem**: Mesa's WSI rejects pitch=864 (only 16-aligned) with "WSI pitch not properly aligned." Our `objects[i].drm_format_modifier = video_format->drm_modifier = DRM_FORMAT_MOD_NONE` (= 0 = LINEAR explicit). Mesa interprets this as "I am promised this is a strict LINEAR DMA-BUF that meets WSI alignment" — but hantro's actual buffer doesn't meet WSI alignment for non-64-aligned widths.
|
||||||
|
|
||||||
|
**Fix attempt 1** (cheapest): change `drm_modifier` to `DRM_FORMAT_MOD_INVALID` (~0ULL). This tells Mesa "modifier unknown, treat as implicit / texture-import-only" — Mesa will texture-upload rather than direct WSI-import, paying a small perf cost but correct.
|
||||||
|
|
||||||
|
**Fix attempt 2** (fallback): if `DRM_FORMAT_MOD_INVALID` causes regressions in mpv vaapi-copy or Firefox, fall back to per-modifier branching:
|
||||||
|
- mpv vaapi-copy: use `DRM_FORMAT_MOD_LINEAR` (works as before)
|
||||||
|
- Firefox + WSI-aligned widths (64-aligned): `DRM_FORMAT_MOD_LINEAR`
|
||||||
|
- Non-aligned widths: `DRM_FORMAT_MOD_INVALID`
|
||||||
|
|
||||||
|
We can't easily detect the consumer at export time, so attempt 1 first universally.
|
||||||
|
|
||||||
|
**Validation**: Firefox plays mozilla.org's 864-wide intro videos without "MESA: error: WSI pitch not properly aligned"; mpv vaapi-copy and vaapi (DMA-BUF) still render bunny on bbb 1080p.
|
||||||
|
|
||||||
|
**Risk**: medium. `DRM_FORMAT_MOD_INVALID` is a documented sentinel but its semantics in EGL_EXT_image_dma_buf_import vary by driver. Worst case: regression on consumers we currently work for. Mitigation: revert to `DRM_FORMAT_MOD_NONE` if regression observed.
|
||||||
|
|
||||||
|
## Fix 3 — Decoupled CAPTURE buffer pool with LRU recycling (the load-bearing fix)
|
||||||
|
|
||||||
|
**File**: `src/surface.h`, `src/surface.c`, `src/picture.c`
|
||||||
|
**Diff scale**: ~150-200 lines (new code)
|
||||||
|
|
||||||
|
**Problem**: 1:1 surface↔buffer binding causes V4L2 to re-QBUF the same physical buffer while consumer still holds an EXPBUF'd fd to it. Per Phase 2 analysis + Phase 3 baseline (re-queue interval ~875ms vs compositor hold ~50ms): race window opens during certain playback patterns and produces operator-visible stutter on mpv vaapi --vo=gpu.
|
||||||
|
|
||||||
|
**Architecture (Option A from Phase 2)**:
|
||||||
|
- Add `struct cap_pool_slot` per CAPTURE buffer with state: FREE, IN_DECODE, EXPORTED.
|
||||||
|
- Allocate `max(surfaces_count, MIN_CAP_POOL)` CAPTURE buffers (MIN_CAP_POOL = 24 — gives ~3x headroom for typical mpv 16-surface pool).
|
||||||
|
- Decouple: surfaces no longer have permanent `destination_index`. Instead, each surface holds a transient pointer to its currently-bound slot.
|
||||||
|
- On `RequestBeginPicture(surface_X)`:
|
||||||
|
- If surface_X has a previously-bound slot in EXPORTED state: close OUR copy of the EXPBUF'd fd, mark slot as FREE
|
||||||
|
- Find LRU FREE slot (oldest "last_used_at" timestamp)
|
||||||
|
- Bind surface_X → slot, set destination_index, mmap if needed
|
||||||
|
- QBUF that slot
|
||||||
|
- On `RequestSyncSurface(surface_X)`: DQBUF as before, slot transitions FREE→IN_DECODE→EXPORTED→...
|
||||||
|
- On `RequestExportSurfaceHandle(surface_X)`: VIDIOC_EXPBUF on slot's V4L2 index, save our fd, mark slot EXPORTED, record timestamp
|
||||||
|
- On `RequestDestroySurfaces`: free all slots, REQBUFS(0)
|
||||||
|
|
||||||
|
**Implementation details**:
|
||||||
|
- `cap_pool_slot` struct in `surface.h`:
|
||||||
|
```c
|
||||||
|
enum cap_slot_state { CAP_SLOT_FREE, CAP_SLOT_IN_DECODE, CAP_SLOT_EXPORTED };
|
||||||
|
struct cap_pool_slot {
|
||||||
|
unsigned int v4l2_index;
|
||||||
|
void *map[VIDEO_MAX_PLANES];
|
||||||
|
unsigned int map_lengths[VIDEO_MAX_PLANES];
|
||||||
|
unsigned int map_offsets[VIDEO_MAX_PLANES];
|
||||||
|
enum cap_slot_state state;
|
||||||
|
int our_export_fd; /* -1 if not exported */
|
||||||
|
uint64_t last_used_at_ns; /* for LRU recycling */
|
||||||
|
};
|
||||||
|
```
|
||||||
|
- Pool stored in `request_data` (driver-level), not in `surface_object`.
|
||||||
|
- LRU helper: scan slots, find FREE with oldest `last_used_at_ns`. If no FREE, force-recycle the oldest EXPORTED slot (close fd, demote to FREE) — race window may still open in pathological cases but rare.
|
||||||
|
|
||||||
|
**Validation**:
|
||||||
|
- mpv `--hwdec=vaapi --vo=gpu` plays bbb 1080p for ≥60s without operator-visible stutter
|
||||||
|
- Drop count < 100 in 14s (Phase 3 anchor)
|
||||||
|
- Firefox 150 sustained engagement with no MESA WSI errors after Fix 2
|
||||||
|
- vainfo + mpv vaapi-copy + chromium-fourier 149 — no regression
|
||||||
|
- strace shows recycled buffer indices have re-queue intervals consistent with LRU spread (e.g., wider than current ~875ms)
|
||||||
|
|
||||||
|
**Risk**: medium-high. Substantive code change. Mitigations:
|
||||||
|
- Keep pool size + recycling logic configurable
|
||||||
|
- Preserve Fix 1 + Fix 2 as separate commits before stacking Fix 3 (revert one without losing the others)
|
||||||
|
- Phase 5 sonnet review before committing Fix 3
|
||||||
|
|
||||||
|
## Order of attack
|
||||||
|
|
||||||
|
1. **Fix 1 (multi-resolution cache)** — small, low-risk, fixes a known regression. Test: Firefox multi-video session.
|
||||||
|
2. **Fix 2 (WSI modifier)** — 1 line, test on 864-wide video. If regression, revert.
|
||||||
|
3. **Phase 5 review (sonnet)** before Fix 3 — get an outside read on the buffer-pool architecture before committing the substantive code.
|
||||||
|
4. **Fix 3 (buffer pool)** — implement, test, iterate.
|
||||||
|
5. **Phase 7 verification** — full 4-consumer matrix + multi-video session.
|
||||||
|
6. **Phase 8 close** — memory entry, iteration 3 input doc.
|
||||||
|
|
||||||
|
## Out-of-scope for iteration 2 (carried to iteration 3)
|
||||||
|
|
||||||
|
- VIDIOC_G_EXT_CTRLS EACCES probe (Sonnet 7.1)
|
||||||
|
- num_ref_idx for multi-slice streams (Sonnet 7.2)
|
||||||
|
- HACK block in surface.c MPEG-2 case (Sonnet 7.4)
|
||||||
|
- Firefox seek-to-non-IDR (Sonnet 7.5)
|
||||||
|
- DEBUG instrumentation cleanup (until iteration 2 verified, per Sonnet)
|
||||||
|
- V4L2_MEMORY_DMABUF mode rewrite (Option B from Phase 2 — proper but expensive)
|
||||||
|
- Performance metrics — iteration 3
|
||||||
Reference in New Issue
Block a user