Files
libva-multiplanar/phase0_findings_iter9.md
T
claude-noether c331519507 iter9 phase 0: lock cap_pool/REQBUFS/REINIT cascade as the question
Campaign reopen — iter8's "campaign-closing" status was contingent on
"mpv --hwdec=vaapi smooth", which doesn't hold against fresh-install
interactive testing. iter9 single-track scope:

- Bug #1 (libva-v4l2-request-fourier#1) only
- mpv H.264 fresh-login through ≥30s of decode without any of: cap_pool
  double-init, REQBUFS EBUSY, REINIT bad-fd, OUTPUT ENOMEM
- Phase 0 will source-read cap_pool + request_pool + iter6 REINIT,
  build a vo=null reproduction harness, prepare bisect against iter5
  baseline, and a libva-direct C probe for minimal repro

Bug #2 (presentation green) is dmabuf-modifier-triage's job — peer
campaign opened 2026-05-08 at ~/src/dmabuf-modifier-triage/. README
cross-link now points at it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 14:01:47 +00:00

8.8 KiB

Phase 0 — iteration 9 substrate (libva-multiplanar campaign — REOPEN)

Opened 2026-05-08 after the iter8 production-tip artifact (libva-v4l2-request-fourier-1.0.0.r280.65969da-1, shipped to [marfrit] 2026-05-08) was found to fail under fresh-install interactive mpv H.264 playback on ohm. iter8's "campaign-closing" status (per phase0_findings_iter8.md line 3) was contingent on the iter5/8 close claim of "mpv --hwdec=vaapi smooth" — that claim does not hold against a fresh-install + fresh-Plasma-session test path.

This is a campaign reopen, not a continuation. iter8 still represents the validated state under the test paths it was measured against (the tests/run_perf_binding_cell.sh harness). iter9 exists to address what the harness didn't catch.

Predecessor close-out summary (iter8 → iter9)

iter8 landed three fork commits on top of iter7:

  • dcaa1f1 (2026-05-06) — docs/silicon-ID fix (PineTab2 = RK3566 silicon).
  • 65969da (2026-05-06) — tests/run_perf_binding_cell.sh harness for measured per-consumer drop/CPU/freq/memory numbers.
  • (iter8 close commit not in fork log; close artifact phase8_iteration8_close.md records GREEN for E on 2026-05-06.)

iter8 then sat for two days. On 2026-05-08:

  • The fork was packaged as libva-v4l2-request-fourier (PKGBUILD at ~/src/marfrit-packages/arch/libva-v4l2-request-fourier/), pinned to _commit=65969da. CI built and published to [marfrit] via Gitea Actions run #65 success. ohm pulled the package via pacman -Syu. [marfrit] repo enabled in /etc/pacman.conf. /etc/profile.d/libva-v4l2-request.sh exports LIBVA_DRIVER_NAME=v4l2_request + LIBVA_V4L2_REQUEST_VIDEO_PATH=/dev/video1 + LIBVA_V4L2_REQUEST_MEDIA_PATH=/dev/media0.
  • vainfo confirmed the new driver loads cleanly, enumerates the H.264 + MPEG-2 profile list (same shape as predecessor).
  • Interactive mpv --hwdec=vaapi --vo=dmabuf-wayland fourier-test/bbb_1080p30_h264.mp4 immediately hit the cap_pool / REQBUFS / REINIT cascade described in marfrit/libva-v4l2-request-fourier#1.
  • Separately: even the --hwdec=v4l2request (libva-bypassed) path produced solid green frames via --vo=dmabuf-wayland, isolated to a different bug. That second bug moved to its own peer campaign at ~/src/dmabuf-modifier-triage/ and does not gate iter9.

Locked research question — iter9

Triage and fix the probe-then-decode lifecycle cascade exposed in interactive mpv H.264 playback at the iter8 production tip — fresh login through to ≥30s of decode without any of the following events: cap_pool_init firing twice for overlapping slot ranges in a single mpv invocation, VIDIOC_REQBUFS returning EBUSY, Unable to reinit media request: Bad file descriptor, Unable to create buffer for type 9: No buffer space available. The campaign re-closes only when this test path passes from a freshly-logged-in Plasma session.

This is the test path iter5/8 closes implicitly claimed worked. iter9 makes the claim explicit and verifiable.

Hypothesis space (Phase 0 must read source to confirm)

Three layers can produce the observed cascade:

  1. cap_pool lifecycle in src/cap_pool.c + callers. Two cap_pool_init events for slot range 0..23 in close succession before the first decoded frame strongly suggests probe-context + decode-context double-init without teardown between. mpv's VA-API call sequence is roughly: vaInitializevaQueryConfigProfilesvaCreateConfigvaCreateContextvaCreateSurfaces → decode loop. If cap_pool_init is wired into vaCreateContext rather than vaCreateConfig, both the probe context and the actual context would init separately and require teardown to be symmetric.

  2. request_pool lifecycle + iter6's REINIT. "Unable to reinit media request: Bad file descriptor" is a direct iter6 output (commit a09c03c, "iter6 fix: per-OUTPUT-slot request_fd binding via REINIT"). The fd is being closed before REINIT runs. Possible causes: request_pool teardown closes the fd unconditionally, or the iter7 slot-leak fix (commit 988b848 adds request_pool_force_release) mistakenly closes a still-bound request fd.

  3. VIDIOC_REQBUFS without prior STREAMOFF. EBUSY on REQBUFS means the queue is in STREAMING state. The fork's STREAMOFF call sites need to be audited — every REQBUFS(count=N) after a previous successful REQBUFS(count=M, M>0) must be preceded by STREAMOFF if the queue was started in between.

Phase 0 will deliver

  1. Source-read of cap_pool + request_pool + surface.c + context.c at commit 65969da. Output: phase0_iter9_source_read.md capturing the actual call graph: which vaXxx entry point triggers cap_pool_init, when STREAMON happens, when STREAMOFF happens, when REINIT happens, who owns each fd. Read against the iter5 sweep commits (951233a, 848fc0c, 843febc, d3a299b, c8b6ede, b993355) and the iter6/7 fix commits (a09c03c, 988b848, 7bd0818) so the diff that introduced the leak is identifiable.

  2. Reproduction harness — tests/run_iter9_lifecycle_repro.sh. Wraps a single mpv --hwdec=vaapi --vo=null --frames=300 ... call (vo=null isolates the bug from any presentation issues) with LIBVA_TRACE capture, parses output for the four cascade signatures, exits non-zero if any signature fires. Anchored to bbb_1080p30_h264.mp4. Critical: must launch from a fresh subshell with no leftover env / VA state so probe-then-decode lifecycle is exercised.

  3. Bisection plan against the iter5..iter8 commit range. If the source-read in (1) doesn't unambiguously identify the regression-introducing commit, prepare a git bisect script using the harness in (2) so phase 4 can mechanically narrow.

  4. iter5 close re-validation. Re-run the harness from (2) against the iter5-state commit (c8b6ede = "iter5 sweep follow-up"). Two outcomes — both useful:

    • If iter5 also fails → the bug pre-dates iter6/7's additions and the iter5 close was over-claimed.
    • If iter5 passes → iter6 (request_fd REINIT) or iter7 (slot-leak) introduced the regression. Bisect (3) narrows further.
  5. Sanity check against a single-track ffmpeg vainfo + decode probe. Build a small C harness that calls vaInitialize + vaQueryConfigProfiles + vaCreateConfig + vaCreateContext + vaCreateSurfaces + a single decode + teardown, all via libva direct (no mpv, no ffmpeg). If the harness reproduces the cascade with no mpv complexity, the test surface area for phase 4's fix shrinks dramatically.

After Phase 0 closes, Phase 1 will replicate the baseline from items 2 + 4 (per feedback_replicate_baseline_first.md). Phase 2 will source-deep-dive on the layer item 1 fingered. Phase 3 will write the deterministic regression test. Phase 4 will fix. Phase 5 review will be sonnet.

In-scope (LOCKED 2026-05-08 for iteration 9)

Single-track. Decoder-side cascade only.

  • Bug #1 (libva-v4l2-request-fourier#1) only.
  • Test fixture: ~/fourier-test/bbb_1080p30_h264.mp4 (already on ohm).
  • Target host: ohm. fresnel sits — fresnel-fourier is a peer campaign with separate iteration cadence; whatever iter9 fixes on ohm will be auto-inherited when fresnel-fourier rebuilds against the same fork master.

Out-of-scope (LOCKED 2026-05-08 for iteration 9)

  • Bug #2 (libva-multiplanar#1) — owned by ~/src/dmabuf-modifier-triage/. Not gating iter9.
  • Performance / measurement. iter8's perf binding cell is already in the fork (tests/run_perf_binding_cell.sh) and re-runs as part of any future iteration close. iter9 only needs to demonstrate that the cascade no longer fires; numbers are not a deliverable.
  • Other consumers (Firefox, chromium-fourier, vainfo). mpv is the consumer that surfaced the bug; mpv is the consumer iter9 closes against. Sweep to other consumers is iter10's call.
  • Other codecs (HEVC, VP9). H.264 only.
  • Other hardware. ohm only.
  • Upstreaming. Per feedback_no_upstream.md.

Reference history

  • phase0_findings.md — original campaign Phase 0 substrate.
  • phase0_findings_iter[2-8].md — per-iter substrate.
  • phase8_iteration[1-8]_close.md — per-iter close artifacts. Re-read iter5 + iter7 + iter8 closes specifically; the iter9 hypothesis space refers back to their explicit fix commits.
  • ~/src/libva-multiplanar/libva-v4l2-request-fourier/ — fork at master = 229d6d1 today (with fresnel-fourier MPEG-2 commits past 65969da); iter9 work happens on this master, with potentially a feature branch if the fix becomes large enough to warrant one.
  • ~/src/marfrit-packages/arch/libva-v4l2-request-fourier/PKGBUILD — bump _commit after iter9 close + close validation passes.