Files
libva-multiplanar/phase0_findings_iter5.md
marfrit e46b2ed2d6 Iteration 5 Phase 1 lock — A + G + B + E
Heavyweight four-track iteration:
- Track A: DEBUG instrumentation sweep (carried four iterations)
- Track G: PGO-disabled Firefox-fourier rebuild
- Track B: mpv libplacebo --vo=gpu segfault investigation
- Track E: multi-context libva safety (Sonnet review 9.6)

Natural sequence: A first (clean codebase), G in parallel on boltzmann
(~2h rebuild offloaded), E next (architectural on clean source), B last
(consumer-side investigation on post-A+E driver).

Phase 4 will subdivide into 4A/4G/4E/4B sub-phases. Phase 5 sonnet
review + Phase 7 verify span all four tracks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 14:42:49 +00:00

178 lines
15 KiB
Markdown

# Iteration 5 — Phase 0 (substrate / motivation / inventory)
Opens 2026-05-05 immediately after iteration 4 close (`phase8_iteration4_close.md`, fork commit `b81ce69`, campaign close `67494ae`).
## Predecessor close-out summary (iteration 4 → iteration 5)
iter4 was the first iteration that closed Track A — the iter1+iter2+iter3 frame-11 EINVAL carryover. Three correctness fixes landed in fork:
- `74d8dd1` — DPB `fields = V4L2_H264_FRAME_REF` + skip stale entries (FFmpeg-semantics match)
- `385dee1` — fresh `request_fd` per frame (THE load-bearing fix)
- `b81ce69` — B-slice L1 reflist `.fields` copy-paste
Plus diagnostic instrumentation (`a12d299`, `4892656`, `f21bdf0`) accumulated during the diagnostic journey.
iter4 verified Track A via mpv direct stress test on ohm: 2130 BeginPictures over 90s with 0 EINVAL of any kind — real-time HW decode through libva-v4l2-request-fourier without `MOZ_DISABLE_RDD_SANDBOX=1`. Track F (sandbox patch) from iter3 stays GREEN; the campaign now has a working H.264 decode pipeline through libva on hantro.
The campaign's original substrate question — "make multi-planar libva work on Rockchip hantro for production VAAPI consumers" — is empirically achieved at the libva-side decode layer.
## Iteration 5 candidate research questions
### A. DEBUG instrumentation sweep (carried from iter1+iter2+iter3+iter4)
> Remove all accumulated diagnostic instrumentation commit-by-commit, building cleanly between each removal. End state: zero `request_log()` calls in non-error paths, no patch-0011 sentinel write in `EndPicture`, no msync workaround (or document why it stays). Driver source builds clean and vaapi-copy + vaapi smoke tests still green.
**Inventory of instrumentation to remove (or keep, as decided per item):**
- iter1 ENTER traces in surface entry points (CreateBuffer, BeginPicture, etc.)
- iter1 patch-0011 sentinel write in `EndPicture`
- iter1 patch-0010 CAPTURE/OUTPUT hex-dumps in SyncSurface
- iter1 msync(MS_SYNC|MS_INVALIDATE) workaround in SyncSurface (probably keep — was load-bearing for cache coherency)
- iter1 POC sentinel strip (KEEP — load-bearing for ffmpeg-vaapi consumers)
- iter1 patch-0014 EACCES retry-skip in `v4l2_get_controls` (KEEP — load-bearing reflective behavior)
- iter1 slice_header bit-precise parser + dec_ref_pic_marking_bit_size etc. (KEEP — fixes hantro hw decode)
- iter3 Y2 v1 in `v4l2_ioctl_controls` (REMOVE — superseded by iter4 Y2 v3)
- iter4 Y2 v3 with TRY_EXT_CTRLS retry (REMOVE — fault no longer reproduces)
- iter4 DPB census + per-entry dump (REMOVE — fault no longer reproduces)
- iter4 per-control TRY isolation (REMOVE — fault no longer reproduces)
**Why first:** required prerequisite for any upstream snapshot (iter5 candidate F). Was deferred at iter1+iter2+iter3+iter4. Smaller scope than C or F, fits in any iteration's slack.
**Risk:** removing instrumentation that's actually load-bearing. Each removal verified by re-running mpv + Firefox + vainfo smoke tests.
### B. mpv libplacebo `--vo=gpu` segfault (carried from iter3 substrate, never iter3+iter4 scope)
> Resolve the segfault on `LIBVA_DRIVER_NAME=v4l2_request mpv --hwdec=vaapi --vo=gpu` after 4 frames on bbb_1080p30 when Vulkan init fails (`VK_ERROR_INITIALIZATION_FAILED`).
**Symptom** (captured iter3 substrate): Vulkan init fails, mpv falls through to GPU non-vulkan path, decode runs for 4 frames cleanly, then `Unable to request buffers: Device or resource busy` (REQBUFS EBUSY mid-stream), then bizarre `CreateSurfaces2: surf_width=16 surf_height=16 sizes[1]=1050626` (uninitialized memory shape), then SIGSEGV.
**Hypothesis (iter3-era):** cap_pool resolution-change path doesn't fully drain CAPTURE before REQBUFs → kernel returns EBUSY → driver pushes ahead with garbage → mmap or pool-init crashes. Could be a Mesa update side effect.
**iter4 evidence point:** mpv + `--vo=null` works for 2130 frames. So the issue is consumer-side compositor path, not libva-side decode. Diagnosis path: `--vo=null` (works) vs `--vo=gpu` (segfault) → bisect by mpv flags.
**Risk:** may surface a Mesa or libplacebo bug we can't fix from the libva side.
### C. Performance binding cell (deferred from iter1+iter2+iter3+iter4)
> Establish a measurement protocol for HW vs SW decode on this rig: drop counts, effective FPS, browser CPU%, scanout-plane residency for {mpv vaapi DMA-BUF, mpv vaapi-copy, Firefox-fourier HW (sandbox-on), SW baseline}. Anchor in `phaseN_evidence/`.
**Why:** anchors all iter1+iter2+iter3+iter4 claims to numbers. Carried four iterations. iter4's mpv stress test is a partial perf measurement (2130 frames clean, but no CPU%/drop count anchor).
**Pairing potential:** A (DEBUG sweep) before C — perf measurements should be on a clean instrumentation-free build. Or, run a baseline-vs-iter4 comparison BEFORE the sweep to capture the value of each instrumentation point.
### D. V4L2_MEMORY_DMABUF (carried from iter2+iter3+iter4)
> Replace V4L2_MEMORY_MMAP with userspace dma-buf allocation. iter2 Fix 3 was statistical (LRU mitigation); Option B is architectural (userspace owns the buffer).
**Why:** the cap_pool LRU is empirically working but doesn't formally close the DMA-BUF lifecycle race window. Option B closes it.
**Risk:** highest unknown. Possibly requires kernel work. Hantro on this kernel may not support `V4L2_MEMORY_DMABUF` at all; gstreamer's v4l2slh264dec uses MMAP only. Worth a probe before commit.
### E. Multi-context libva safety (Sonnet review 9.6 from iter1, carried iter2/3/4)
> Make the backend safe for two concurrent libva contexts in the same process (e.g. Firefox tab playing one video while another tab plays a different resolution). `LAST_OUTPUT_WIDTH/HEIGHT` is a process-global static; `cap_pool` is per-driver_data but the V4L2 device is shared.
**Why:** four iterations carried this. Real consumers (Firefox multi-tab, mpv-while-Firefox) would surface it. With Track A fixed, this becomes the next architectural correctness piece.
**Risk:** moderate. The fix shape is similar to iter2 Fix 1 (per-context state instead of process-global) but applied to more state.
### F. Bootlin / Mozilla upstreaming (combined from iter3 candidate G + iter4 carryover)
> File the Mozilla Bugzilla bug for `/dev/media*` + V4L2-stateless RDD sandbox with the iter3 firefox-fourier patch. File a bootlin issue on `bootlin/libva-v4l2-request` with iter1+iter2+iter3+iter4 patches as a cohesive working set.
**Why:** with Track A fixed, the libva-v4l2-request-fourier fork has empirical proof of working H.264 decode on hantro for any libva consumer. The patches are upstreamable in shape, just need the DEBUG sweep (A) cleanup first.
**Stance:** per `feedback_no_upstream.md`, no PR/MR/bug-file happens without explicit operator instruction. F is gated on operator decision.
### G. PGO-disabled Firefox rebuild
> Rebuild firefox-fourier without `--enable-profile-generate=cross` to get a release-quality binary suitable for performance measurement and Firefox-side stress testing.
**Why:** iter3's PGO-instrumented binary is 3.6 GB libxul.so and decodes at ~0.23x realtime under sandbox. iter4 verified Track A via mpv direct because the PGO Firefox couldn't reach 720+ frames in 90s. A clean Firefox-fourier build would let iter5 do Firefox-side stress testing.
**Risk:** ~2h rebuild on boltzmann. The infrastructure is in place (firefox-fourier LXD container persists). Edit the PKGBUILD to skip PGO, rebuild, redeploy.
**Pairing potential:** G + C (rebuild + perf measurement) is natural. G + B (rebuild + libplacebo investigation through Firefox-side path) is also possible.
### H. New codec / hardware (deferred from iter1+ scope)
> Extend to MPEG-2 (next codec per iter1 lock) or to fresnel RK3399 / ampere RK3588 hardware (next platforms).
**Why:** the campaign's original locked scope was H.264-first then MPEG-2; ohm RK3568 first then fresnel and ampere/boltzmann. With ohm+H.264 working, the natural extensions become possible.
**Risk:** new hardware iterations are their own can-of-worms. Probably one-codec-OR-one-hardware per iteration.
### Recommended pairings
- **A + F** (DEBUG sweep + upstream prep). Most natural sequence — sweep makes the patches mailing-list-ready. Smallest combined scope.
- **A + C** (sweep + perf). Sweep first to get clean measurements, then C anchors the campaign-wide claims.
- **B alone** (libplacebo) — separate consumer-side investigation, doesn't share authoring with anything else.
- **E alone** (multi-context safety) — architectural correctness piece, requires focused attention.
- **G + C** or **G + B** (PGO-disabled rebuild + perf or libplacebo) — Firefox-side validation matrix.
## State that carries (re-verified 2026-05-05 close)
- **Hardware**: ohm RK3568 hantro G1/G2, kernel 6.19.10. `ohm.vpn` access path. Plasma 6 Wayland session interactive.
- **Userspace**: firefox 150.0.1 stock + firefox-fourier 150.0.1-1.1 (PGO-instrumented) at `/opt/firefox-fourier/`, libva 2.23.0, mesa 26.0.5, libdrm 2.4.131, mpv 0.41.0-3.
- **Test fixture**: `/home/mfritsche/fourier-test/bbb_1080p30_h264.mp4` sha256 `dcf8a7170fbd...`.
- **Driver installed**: `/usr/lib/dri/v4l2_request_drv_video.so` post-iter4 (sha256 to recompute on iter5 start; rebuild on ohm via meson+ninja in /tmp/libva-src to redeploy with iter4 commits).
- **Build container**: `firefox-fourier` LXD on boltzmann, `ssh -J boltzmann builder@firefox-fourier`. Persistent. Source still extracted at `/build/aur/firefox-fourier/src/firefox-150.0.1/` with iter3 patches applied — incremental rebuilds via `./mach build`.
- **Phase 7 scripts**: `/home/mfritsche/iter3_phase7_evidence.sh` + `/tmp/run_phase7_v2.sh` on ohm.vpn.
- **mpv stress test command**: `LIBVA_DRIVER_NAME=v4l2_request mpv --hwdec=vaapi-copy --vo=null --no-audio bbb_1080p30_h264.mp4` — proven Track A verifier.
- **References cache**: `references/ffmpeg-kwiboo/` (FFmpeg V4L2-request reference), `references/linux-mainline/` (kernel hantro source), `references/firefox-master/` (Mozilla sandbox source).
## State that does NOT carry
- Performance numbers. Same caveat as iter1+iter2+iter3+iter4. Candidate C is the natural anchor.
- iter4 driver build state on ohm `/tmp/libva-src` is tmpfs-volatile; rsync+rebuild from rpi at iter5 start.
## Tooling and measurement-instrument inventory
Carried from iter4:
- `strace -f -e trace=openat,close,ioctl` for libva-side V4L2 ioctl tracing
- `sudo ftrace events/v4l2/* events/vb2/* events/dma_fence/*` for kernel-side V4L2/vb2 lifecycle
- `sudo dmesg -w` for kernel-side warnings
- `mpv --frames=N --vo=null` with stderr capture for libva stress
- `mpv --frames=N --vo=gpu` with stderr capture for full-pipeline (will surface candidate B's segfault)
- Firefox `MOZ_LOG=PlatformDecoderModule:5,VideoBridge:5` (under firefox-fourier, no MOZ_DISABLE_RDD_SANDBOX needed)
- Operator visual inspection on real screen (load-bearing for "frames reach screen" claims)
- iter3 Y2 v1 + iter4 Y2 v3 + iter4 DPB census + iter4 per-control TRY iso (ALL up for removal in candidate A)
Likely needed for specific iter5 candidates:
- For A (sweep): per-removal smoke test recipe (vainfo + mpv vaapi-copy + Firefox-fourier 30s).
- For B (libplacebo): mpv `--vo=gpu` minimal repro, possibly Mesa bisect or rollback.
- For C (perf): `pidstat -u -p $(pidof ...)` for CPU%, Mali-G52 freq via `/sys/class/devfreq/fde60000.gpu`, scanout-plane query (Wayland `ext-output-management` is hard — may need ftrace).
- For D (DMABUF): `gbm_bo_create` test program + `VIDIOC_QBUF type=V4L2_MEMORY_DMABUF` exploratory.
- For G (PGO-disabled rebuild): edit firefox-fourier PKGBUILD to skip `--enable-profile-generate=cross`, `./mach build` incremental, redeploy via 600 MB tarball.
## In-scope (LOCKED 2026-05-05 for iteration 5) — A + G + B + E
Heavyweight four-track iteration. Natural execution sequence:
1. **Track A first (DEBUG sweep)** — clean the codebase before other changes ripple. Per-removal commits with smoke verification. Estimated ~30-60 min for the removal commits + smoke runs.
2. **Track G in parallel (PGO-disabled Firefox rebuild)** — kick off on boltzmann firefox-fourier container while local A work proceeds. ~2h rebuild + redeploy to ohm; mostly background time. Edit PKGBUILD to remove `--enable-profile-generate=cross`, `./mach build` from existing source tree (already extracted with iter3 patches).
3. **Track E (multi-context libva safety)** — architectural change after the codebase is clean from A. Move `LAST_OUTPUT_WIDTH/HEIGHT` from process-global static to per-driver-data (or per-context). Audit other process-global state. Verify via two simultaneous mpv processes on different fixtures.
4. **Track B last (mpv libplacebo `--vo=gpu` segfault)** — consumer-side investigation on the post-A post-E driver. Bisect mpv flags between `--vo=null` (works) and `--vo=gpu` (segfaults). May surface a Mesa/libplacebo upstream issue.
iter5 is large. Phase 4 will subdivide into 4A / 4G / 4E / 4B sub-phases. Phase 5 sonnet review + Phase 7 verify span all four tracks.
## Out-of-scope (LOCKED 2026-05-05 for iteration 5)
- Candidates C (perf binding cell), D (V4L2_MEMORY_DMABUF), F (upstreaming), H (new codec/hardware) — deferred to iter6+.
- Same standing OOS: bootlin/Mozilla upstreaming PR/MR/bug-file unless explicitly tasked, new codecs outside H.264/MPEG-2 scope, new platforms beyond ohm RK3568.
## Phase 1 success criterion (LOCKED 2026-05-05)
**Track A:** Driver source builds clean with zero `request_log()` calls in non-error paths. iter1 patch-0010/0011/0014 DEBUG, iter3 Y2 v1, iter4 Y2 v3 / DPB census / per-control TRY iso all removed (or explicitly kept with justification). mpv stress test still 2000+ BeginPictures clean post-sweep.
**Track G:** Firefox-fourier rebuilt without `--enable-profile-generate=cross` (PGO-instrumented), redeployed to ohm. `firefox --version` reports `Mozilla Firefox 150.0.1`. Resulting libxul.so is materially smaller than the 3.6 GB instrumented build.
**Track E:** Two concurrent mpv processes on different bbb fixtures decode independently with no cross-context state corruption. `LAST_OUTPUT_WIDTH/HEIGHT` (and any other audited process-global) moved to per-driver-data or per-context.
**Track B:** `mpv --hwdec=vaapi --vo=gpu` plays bbb_1080p30 for ≥30s without segfault — OR root cause documented as Mesa/libplacebo upstream issue with operator-actionable workaround captured in iter5 close.
**Joint success:** all four tracks independently verifiable on the same iter5-end driver build. Phase 7 verifies each.
## Stop point
Phase 1 LOCKED. iter5 proceeds to Phase 2 (situation analysis covering all four tracks: instrumentation inventory, PKGBUILD PGO edit, libva-side process-global state audit, libplacebo segfault repro), Phase 3 (baseline anchor), Phase 4 (split into 4A/4G/4E/4B sub-phases per Track), Phase 5 (sonnet review of all four), Phase 6 (deploy + smoke), Phase 7 (verify each track), Phase 8 (close). Stop only if user input is needed.