Files
libva-multiplanar/phase0_findings_iter3.md
marfrit f91469abe3 Iteration 3 close — F GREEN, A reproduced + diagnosed for iter4
Phase 1 locked F (Firefox RDD sandbox verify-by-patch) and A (frame-11
EINVAL diagnose) running in parallel on a single firefox-fourier build.

Track F: GREEN. Patched Firefox 150.0.1 (firefox-fourier, pkgrel=1.1)
launches on ohm WITHOUT MOZ_DISABLE_RDD_SANDBOX=1 and engages our
libva-v4l2-request backend end-to-end. Three patches needed (Phase 2
identified one and deferred two):
  - Broker policy (SandboxBrokerPolicyFactory.cpp): allow /dev/media*,
    extend cap-filter to admit stateless decoders that lack M2M caps.
  - Seccomp policy (SandboxFilter.cpp): allow ioctl magic byte '|'
    for <linux/media.h> request-API ioctls.
  - Driver (media.c): replace select() with poll() — Mozilla's RDD
    seccomp common policy admits poll/ppoll/epoll_* but not
    select/pselect6. Driver-side fix preferred; smaller surface,
    portable across sandbox policies, and poll() is the modern API.

Track A: REPRODUCES + DIAGNOSED. Frame-11 EINVAL fires deterministically
on a single-slice P-frame (slice_type=0, frame_num=5, post-IDR) — the
exact iter1/iter2 carryover signature, confirming it isn't environmental.
Y2 instrumentation (in v4l2_ioctl_controls) now logs num_controls /
error_idx / per-control id+size on EINVAL. Sizes match kernel UAPI;
error_idx == num_controls is the kernel's "all bad / no specific control"
sentinel — it's a request-level rejection, not a single-field violation.
Fix is iter4's lock; rig + Y2 in place for fast iter4 turnaround.

Build infrastructure introduced: firefox-fourier LXD container on
boltzmann (RK3588 aarch64, persistent, ssh -J boltzmann
builder@firefox-fourier). Upstream Arch x86_64 wasi packages installed
to work around 4-year-stale ALARM versions. PGO generation crashes at
exit (LXC has no display); obj/dist/ tarball used as the deployable
artifact instead of the pacman package.

Phase 6 surprises captured in phase6_iter3_findings.md: malformed
first-cut patch (descriptive vs numeric hunk headers), --enable-v4l2
isn't a Mozilla 150 flag (auto-set on aarch64+GTK), Mozilla 2025 PGP
key rotation, ALARM-stale wasi, onnxruntime missing in ALARM, and the
"no tricks" lesson (revert workarounds first when redirected).

Carries to iter4 substrate: Track A fix is the natural lock; mpv
libplacebo --vo=gpu segfault stays as separate iter4 candidate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 12:56:34 +00:00

174 lines
18 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Iteration 3 — Phase 0 (substrate / motivation / inventory)
Opens 2026-05-04 immediately after iteration 2 close (`phase8_iteration2_close.md`, fork commit `19acc76`, campaign close commit `c36c61e`).
## Predecessor close-out summary (iteration 2 → iteration 3)
iter2 was hardening, not new feature work. Three independent fixes landed:
- Fix 1 (`06beef6`): multi-resolution session-state corruption
- Fix 2 (`e64bb08`): Mesa WSI rejection of non-64-aligned pitch
- Fix 3 (`19acc76`): DMA-BUF lifecycle race — decoupled CAPTURE buffer pool with LRU recycling, the load-bearing fix
iter2 Phase 1 lock met: `mpv --hwdec=vaapi --vo=gpu` plays bbb_1080p30 smoothly per operator inspection.
iter2 Phase 7 verification:
- vaapi-copy 200 frames: 0 drops, real luma gradient, pool LRU visibly recycling ✓
- vaapi `--vo=gpu` real-VO: smooth ✓
- Firefox 150: engages our libva backend (cap_pool architecture confirmed working with surface ID recycling), decodes 10 frames cleanly through hantro, then EINVAL on frame 11 (iter1 carryover Sonnet 7.x family). Requires `MOZ_DISABLE_RDD_SANDBOX=1` because Firefox now routes VAAPI through RDD instead of utility (changed since iter1's test).
## Iteration 3 candidate research questions (UNLOCKED — user picks)
Multiple substantive items survive iter2 close. Each could anchor an entire iteration; some pair naturally.
### A. Frame-11 EINVAL (Sonnet 7.x carryover, decode correctness)
> Identify which V4L2 control returns EINVAL on the 11th decoded frame in Firefox (and likely also under specific mpv stream-shape patterns), and fix.
**Why first**: it's the load-bearing remaining defect. Firefox decodes 10 frames cleanly, then HW decode terminates. Over a 30s+ session the EINVAL likely recurs across surface-recycle cycles. The `Unable to set control(s): Invalid argument` log from iter2 Phase 7 narrows the suspect set: per-request controls (DECODE_PARAMS, SCALING_MATRIX, SPS, PPS) for slice_type=1 / non-IDR mid-stream frames. Sonnet review 7.5 (mid-stream non-IDR) and 7.2 (`num_ref_idx_l0/l1` for multi-slice) are the named hypotheses. Reading `hantro_g1_h264_dec.c` for which control fields it validates is the first concrete investigation step.
**Risk**: may surface a need for VPS/SPS state tracking across requests that doesn't exist in our backend today.
### B. DEBUG instrumentation sweep
> Remove the 0010 / 0011 / 0014 + ENTER logging + sentinel write + msync workaround commit-by-commit, building cleanly between each removal. End state: zero `request_log()` calls in non-error paths, no patch-0011 sentinel write in `EndPicture`, and either delete the msync or document why it stays.
**Why**: required prerequisite for any upstream snapshot, plus Phase 5 review for iter1 had this on the to-do list and iter2 explicitly deferred it. Smaller scope than A; could pair with A or run standalone.
### C. Performance binding cell (deferred from iter1, named-deferred in iter2)
> Establish a measurement protocol for HW vs SW decode on this rig: drop counts, effective FPS, browser CPU%, scanout-plane residency for {mpv vaapi DMA-BUF, mpv vaapi-copy, Firefox HW (sandbox-bypassed), SW baseline}. Anchor results in an in-session evidence dir.
**Why**: anchors all iter1+iter2 claims to numbers. The iter1 close noted "Performance numbers: drop counts ... not anchors — re-measure in iteration 2 with consistent rig" but iter2 was hardening-only. This is the natural perf iteration.
**Risk**: lots of fixture work for one binding cell.
### D. Multi-context libva safety (Sonnet review 9.6)
> Make the backend safe for two concurrent libva contexts in the same process (e.g. Firefox tab playing one video while another tab plays a different resolution). Today's `LAST_OUTPUT_WIDTH/HEIGHT` is a process-global static and `cap_pool` is per-driver_data but the V4L2 device is shared.
**Why**: iter2 documented this as deferred; the architectural surgery is similar to Fix 3 (per-context pools, per-context format cache). One real consumer might surface this — Firefox tab + mpv together, for example.
### E. V4L2_MEMORY_DMABUF (the iter2 plan's Option B — true architectural fix for DMA-BUF lifecycle)
> Replace V4L2_MEMORY_MMAP with userspace dma-buf allocation: userspace allocates buffers via gbm/dumb-buffer, hands them to V4L2 via VIDIOC_QBUF with type DMABUF. EXPBUF is no longer needed; lifetime is unambiguous because userspace owns the buffer.
**Why**: iter2 Fix 3 is statistical (Option A) — adding more buffers + LRU narrows the race window but doesn't close it. Option B closes it. Significant kernel-side test surface (does hantro on this kernel actually accept DMABUF type? GStreamer's v4l2slh264dec uses MMAP, so DMABUF on hantro may not be tested upstream).
**Risk**: highest unknown of any candidate; possibly requires kernel work.
### F. Firefox RDD sandbox vs `/dev/media0` (backlog from iter2 close — UPDATED with web research 2026-05-04)
> File a Mozilla Bugzilla report ("RDD sandbox: allow /dev/media* and V4L2-stateless nodes for request-API hardware decoders") referencing Bug 1833354 as the V4L2-M2M precedent, propose a small patch.
**Sonnet web research findings (2026-05-04):**
- NO existing Mozilla bug covers `/dev/media*` or the V4L2-stateless request-API path. We'd be filing the first.
- Closest existing work is Bug 1833354 (FF116 — V4L2-M2M only) and Bug 1965646 (FF141 — extends M2M codecs). Neither addresses the request-API.
- Allowlist is in `security/sandbox/linux/broker/SandboxBrokerPolicyFactory.cpp::GetRDDPolicy()`. It calls `AddV4l2Dependencies()` which enumerates `/dev/video*` and **filters by `V4L2_CAP_VIDEO_M2M | V4L2_CAP_VIDEO_M2M_MPLANE`**. Hantro stateless uses `CAPTURE_MPLANE + OUTPUT_MPLANE + request-API` — explicitly excluded by this filter. `/dev/media*` is completely absent (no `AddPath` for it anywhere).
- Architecture is broker-on-IPC (same model as `renderD128`): when sandboxed RDD calls `open()`, seccomp intercepts, forwards over socket to broker thread in parent process, which checks path policy and opens on RDD's behalf. **No broker redesign needed** — just two functions in one file:
1. `GetRDDPolicy()`: add `policy->AddPath(rdwr, "/dev/media0")` (or glob)
2. `AddV4l2Dependencies()`: extend cap filter to also admit stateless V4L2 nodes (or add a new `AddV4l2RequestDependencies()`)
3. Possibly `SandboxFilter.cpp`: verify `MEDIA_REQUEST_IOC_QUEUE` ioctl (type 0xb7, number 0x02) is allowed
- Verdict: **real Mozilla patch needed, small in scope**. NOT a documentation-only fix.
**Why**: highest-leverage upstream contribution out of any iter3 candidate — a 30-line patch upstream removes the env-var requirement for every V4L2-stateless user, including future Rockchip/Hantro/Cedrus/Allwinner targets. Cross-verification on Intel/NVIDIA test boxes (meitner / clevo) is no longer needed for the diagnosis (cap-filter + missing /dev/media* are mechanically the explanation), but might be useful for "smoke-confirm the same env behaves identically on a working VAAPI box" before filing.
**Sources** (Sonnet's references):
- [Bug 1833354 — Implement V4L2-M2M HW decode](https://bugzilla.mozilla.org/show_bug.cgi?id=1833354)
- [Bug 1683808 — Make VAAPI work in RDD](https://bugzilla.mozilla.org/show_bug.cgi?id=1683808)
- [Bug 1751363 — VAAPI snapshot fails due to RDD sandbox](https://bugzilla.mozilla.org/show_bug.cgi?id=1751363)
- [SandboxBrokerPolicyFactory.cpp searchfox](https://searchfox.org/mozilla-central/source/security/sandbox/linux/broker/SandboxBrokerPolicyFactory.cpp)
- [SandboxFilter.cpp searchfox](https://searchfox.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp)
### G. Sonnet 7.x miscellany (carryovers from iter1 review)
> 7.1 EACCES on VIDIOC_G_EXT_CTRLS readback (move before MEDIA_REQUEST_IOC_QUEUE; if still EACCES, file kernel issue and remove dead code).
> 7.4 `// HACK` block in `surface.c::CreateSurfaces2` hardcoding `V4L2_PIX_FMT_H264_SLICE` — wrong for MPEG-2.
> 7.5 Firefox seek-to-non-IDR test corpus.
**Why**: cleanup. Fits in slack time of any other iteration's plan.
### Recommended pairings
- A + B (defect fix + clean-up of unused diagnostic before commit)
- C alone (perf measurement is its own thing)
- E alone (high risk, focus needed)
- F + G as a "small-debt cleanup" iteration
## State that carries (re-verified 2026-05-04 23:35)
- **Hardware**: ohm RK3568, kernel `6.19.10-danctnix1-1-pinetab2`, hantro G1/G2 on `/dev/video1` + `/dev/media0` (perms unchanged: `crw-rw----+ root:video` with mfritsche ACL)
- **Userspace**: libva 2.23.0-1, libva-utils 2.22.0-1, mpv 1:0.41.0-3, firefox 150.0.1-1, mesa 1:26.0.5-1, libdrm 2.4.131-1
- **Test fixture**: `/home/mfritsche/fourier-test/bbb_1080p30_h264.mp4` sha256 `dcf8a7170fbd49bb...`
- **Driver installed**: `/usr/lib/dri/v4l2_request_drv_video.so` sha256 `f27e006433cd4769...` (iter2 build with all three fixes)
- **Fork master**: `19acc76` (iter2 Fix 3); commits in iter1 + iter2 still ahead of bootlin upstream
- **Build harness**: `meson setup --buildtype=release && ninja` on ohm at `/tmp/libva-src/libva-v4l2-request-fourier` (rebuilds on every session — /tmp is tmpfs); deploy to `/usr/lib/dri/v4l2_request_drv_video.so`
- **Live test rig env** (Plasma 6 Wayland): `XDG_RUNTIME_DIR=/run/user/1001`, `WAYLAND_DISPLAY=wayland-0`, `DISPLAY=:0`, `XAUTHORITY=/run/user/1001/xauth_ZnmtRw` (changes per session), `DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1001/bus`. SSH-driven launches need `MOZ_ENABLE_WAYLAND=1` for Firefox + `MOZ_DISABLE_RDD_SANDBOX=1` for VAAPI.
## State that does NOT carry (re-acquire per `feedback_replicate_baseline_first.md`)
- Performance numbers: same caution as iter1+iter2 close. iter2 saw `0 drops in 200 frames` for vaapi-copy and `smooth on operator inspection` for vaapi DMA-BUF; these are smoke verdicts, not anchored measurements. iter3 perf candidate (option C) is the natural place to anchor.
- Firefox 10-frame decode + EINVAL at frame 11: re-acquire under iter3-specific binding cell. Today's evidence is in `/tmp/ff-stdout.log` on ohm but tmpfs-volatile.
## Tooling and measurement-instrument inventory (live verification)
Carried from iter2:
- `strace -f -e trace=openat,close,ioctl` for libva-side V4L2 ioctl tracing
- `sudo ftrace events/v4l2/* events/vb2/* events/dma_fence/*` for kernel-side V4L2/vb2 lifecycle
- `sudo dmesg -w` for kernel-side warnings (typically silent on this rig)
- `sudo lsof /dev/video1` for fd ownership snapshots
- `mpv --frames=N --vo=gpu` with stderr capture
- Firefox `MOZ_LOG=PlatformDecoderModule:5,VideoBridge:5` (under `MOZ_DISABLE_RDD_SANDBOX=1`)
- `coredumpctl info <pid>` for crash backtraces
- Operator visual inspection on real screen (load-bearing for boolean correctness)
Likely needed for specific iter3 candidates:
- For A (frame-11 EINVAL): `dmesg | grep hantro` during decode to catch driver-side EINVAL reasons; `ftrace events/hantro/*` if kernel exposes such; reading `hantro_g1_h264_dec.c` for control validation rules
- For C (perf): `pidstat -u -p $(pidof mpv) 1` for CPU%; `gpustat`-equivalent for Mali-G52 (likely needs Panfrost ftrace); compositor scanout query (Wayland `ext-output-management`?) is harder
- For E (DMABUF): `gbm_bo_create` userspace allocation test program; `VIDIOC_QBUF` with `type=V4L2_MEMORY_DMABUF` exploratory path
- For F (sandbox): meitner / clevo access; Firefox source `security/sandbox/linux/SandboxFilter.cpp`
## In-scope (LOCKED 2026-05-04 for iteration 3) — F + A in parallel
**Track F (sandbox hypothesis verify-by-patch).** Build `firefox-fourier`: a Firefox 150.0.1 fork with the RDD-sandbox patch from candidate F (allow `/dev/media0`, extend `AddV4l2Dependencies()` cap filter to admit stateless V4L2 nodes, verify `MEDIA_REQUEST_IOC_QUEUE` ioctl passes seccomp). Run on ohm without `MOZ_DISABLE_RDD_SANDBOX=1`. Stronger test of the hypothesis than Sonnet's static-analysis verdict — empirically separates "sandbox is the env-var requirement's cause" from any other gating factor.
**Track A (frame-11 EINVAL).** With sandbox now controlled (Track F's patched binary), the frame-11 EINVAL still recurs — clean-rig isolation. Identify which V4L2 control returns EINVAL on the 11th decoded frame in Firefox; suspect surface narrowed by Sonnet review to per-request DECODE_PARAMS / SCALING_MATRIX / SPS / PPS for non-IDR slices (7.5) or `num_ref_idx_l0/l1` mismatch in multi-slice frames (7.2). First concrete step: read `hantro_g1_h264_dec.c` for control validation rules; run patched Firefox under `MOZ_LOG=PlatformDecoderModule:5` + driver request_log to capture the failing control set.
**Why parallel rather than sequential:** Track F's verification rig (patched Firefox on ohm, running bbb_1080p30 without sandbox bypass) IS the rig that surfaces Track A's signature. Running them in one binding cell is the natural shape; splitting to two iterations would require setting up the same rig twice.
### Build host plan (Phase 4 input prereq)
Build venue: **boltzmann LXD container** (RK3588 aarch64, 8 cores, 30 GB RAM, NVMe, always-on). Native arm64 build avoids cross-compile. **AUR/PKGBUILD-based overlay** preferred over raw mozilla-central checkout — Arch's firefox PKGBUILD already has a working aarch64 mozconfig and dep set; we layer our sandbox patch as an additional `source=()` patch in `prepare()`. On rebuilds use `makepkg -e` to skip re-extraction and re-patching.
Fallback if rust-on-aarch64 toolchain proves unworkable in the container: power up `data` (x86_64 box), prevent its sleep timer, set up cross-compile toolchain to aarch64. AUR rebuild semantics (`makepkg -e`) carry over.
## Out-of-scope finding surfaced 2026-05-05 (carry to iter4)
**mpv libplacebo segfault on `--vo=gpu` post-reboot.** Operator-side reproduction with `LIBVA_DRIVER_NAME=v4l2_request mpv --hwdec=vaapi --vo=gpu --no-audio bbb_1080p30_h264.mp4` after host reboot hit a NEW failure pattern (not the iter2-close "smooth" verdict, not the Track A frame-11 EINVAL):
- Vulkan init fails: `[vo/gpu/libplacebo] EnumeratePhysicalDevices ... VK_ERROR_INITIALIZATION_FAILED` (line 4 of trace)
- 4 frames decode cleanly (surfaces 6710886467108867 sync to real luma data, var=4 on the I-frame)
- After surf 67108868's BeginPicture: two `Unable to request buffers: Device or resource busy` (EBUSY on REQBUFS)
- Then a bizarre `CreateSurfaces2: surf_width=16 surf_height=16 fmt_width=48 fmt_height=48 sizes[1]=1050626 (=0x100802, looks uninitialized)`
- Segfault
Hypothesis: vulkan-init-failed code path triggers a resolution-probe in libplacebo/mpv that calls `vaCreateSurfaces` with downscale-probe dimensions while CAPTURE is still queued. The cap_pool resolution-change path drains+REQBUFs but doesn't fully flush queued CAPTURE buffers, kernel returns EBUSY, driver pushes ahead with garbage `sizes[1]`, mmap or pool-init crashes.
iter3 disposition: **option 3 selected** (verify-via-Firefox first, defer libplacebo segfault to iter4). Firefox doesn't go through the libplacebo probe paths, so F+A's verification can proceed on patched-Firefox even with mpv broken on the vulkan-fallback path. If `firefox-fourier` works on ohm despite this regression, the lock for iter4 becomes:
- **Track libplacebo:** harden cap_pool resolution-change to drain CAPTURE before REQBUFs; reject `vaCreateSurfaces` with sentinel-shaped sizes[]; investigate the Vulkan init failure (could be Mesa update, kernel reboot reshuffling GPU state, or genuine Mesa/libplacebo regression).
Or, if the mpv segfault ALSO afflicts firefox-fourier (e.g. the same resolution-probe path is shared at a lower libva layer), iter3 expands or yields back at Phase 7. We learn that empirically.
## Out-of-scope (LOCKED 2026-05-04 for iteration 3)
- Candidates B, C, D, E, G — deferred to a later iteration. B (DEBUG sweep) is the most natural candidate for iter4 since it's an upstream prereq.
- New codecs (MPEG-2, VP8, VP9, AV1, HEVC) — H.264-only scope holds from iter1+iter2.
- New target hardware on the libva side (fresnel RK3399, ampere RK3588) — separate iteration after ohm path is hardened. Note: boltzmann (RK3588) is recruited only as a Firefox build host this iteration, NOT as a libva target.
- Bootlin upstreaming PR — `feedback_no_upstream.md` holds; no PRs unless explicitly tasked.
- Mozilla Bugzilla bug-file. Substituted by verify-by-patch; if the patched binary works, the bug filing becomes a follow-up upstream contribution, not part of iter3's Phase 1 success criterion.
- HEVC re-introduction (stripped in fourier port; no hantro G2 HEVC validation in operator's test corpus).
## Phase 1 success criterion (LOCKED 2026-05-04)
**Track F:** Patched `firefox-fourier` (firefox-150.0.1 + RDD-sandbox patch) launched on ohm WITHOUT `MOZ_DISABLE_RDD_SANDBOX=1` engages our libva-v4l2-request backend, opens `/dev/video1` + `/dev/media0` from RDD process, and decodes ≥10 frames of bbb_1080p30 through hantro. (10 frames is the iter2-observed floor before the EINVAL hits — past 10 is Track A's domain.)
**Track A:** Same patched-binary rig decodes ≥30s of bbb_1080p30 without `Unable to set control(s): Invalid argument` emerging in driver stderr. Where this requires changes, the change lives in libva-v4l2-request-fourier (per-request control set construction), not in firefox-fourier.
**Joint success:** Both above, on the same patched binary, in the same operator session, with anchored evidence (driver stderr capture, Firefox MOZ_LOG capture, dmesg capture, operator visual confirmation of decode output on screen).
## Stop point
Phase 1 LOCKED. iter3 proceeds to Phase 2 (situation analysis: read Mozilla sandbox source on a local mirror for the two target functions), Phase 3 (baseline anchor: re-verify frame-11 EINVAL still reproduces on ohm with stock Firefox 150 + sandbox bypass — same picture as iter2 close), Phase 4 (write the sandbox patch + plan PKGBUILD overlay + lock container provisioning with his), Phase 5 (sonnet review of patch), Phase 6 (build firefox-fourier in container, deploy to ohm), Phase 7 (verify F + A simultaneously), Phase 8 (iteration close). Stop only if user is needed (e.g. the patch produces multi-way design choice, or the rust-aarch64 fallback to `data` is required).