Files
libva-multiplanar/phase8_iteration3_close.md
marfrit f91469abe3 Iteration 3 close — F GREEN, A reproduced + diagnosed for iter4
Phase 1 locked F (Firefox RDD sandbox verify-by-patch) and A (frame-11
EINVAL diagnose) running in parallel on a single firefox-fourier build.

Track F: GREEN. Patched Firefox 150.0.1 (firefox-fourier, pkgrel=1.1)
launches on ohm WITHOUT MOZ_DISABLE_RDD_SANDBOX=1 and engages our
libva-v4l2-request backend end-to-end. Three patches needed (Phase 2
identified one and deferred two):
  - Broker policy (SandboxBrokerPolicyFactory.cpp): allow /dev/media*,
    extend cap-filter to admit stateless decoders that lack M2M caps.
  - Seccomp policy (SandboxFilter.cpp): allow ioctl magic byte '|'
    for <linux/media.h> request-API ioctls.
  - Driver (media.c): replace select() with poll() — Mozilla's RDD
    seccomp common policy admits poll/ppoll/epoll_* but not
    select/pselect6. Driver-side fix preferred; smaller surface,
    portable across sandbox policies, and poll() is the modern API.

Track A: REPRODUCES + DIAGNOSED. Frame-11 EINVAL fires deterministically
on a single-slice P-frame (slice_type=0, frame_num=5, post-IDR) — the
exact iter1/iter2 carryover signature, confirming it isn't environmental.
Y2 instrumentation (in v4l2_ioctl_controls) now logs num_controls /
error_idx / per-control id+size on EINVAL. Sizes match kernel UAPI;
error_idx == num_controls is the kernel's "all bad / no specific control"
sentinel — it's a request-level rejection, not a single-field violation.
Fix is iter4's lock; rig + Y2 in place for fast iter4 turnaround.

Build infrastructure introduced: firefox-fourier LXD container on
boltzmann (RK3588 aarch64, persistent, ssh -J boltzmann
builder@firefox-fourier). Upstream Arch x86_64 wasi packages installed
to work around 4-year-stale ALARM versions. PGO generation crashes at
exit (LXC has no display); obj/dist/ tarball used as the deployable
artifact instead of the pacman package.

Phase 6 surprises captured in phase6_iter3_findings.md: malformed
first-cut patch (descriptive vs numeric hunk headers), --enable-v4l2
isn't a Mozilla 150 flag (auto-set on aarch64+GTK), Mozilla 2025 PGP
key rotation, ALARM-stale wasi, onnxruntime missing in ALARM, and the
"no tricks" lesson (revert workarounds first when redirected).

Carries to iter4 substrate: Track A fix is the natural lock; mpv
libplacebo --vo=gpu segfault stays as separate iter4 candidate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 12:56:34 +00:00

120 lines
13 KiB
Markdown

# Iteration 3 close (Phase 8) — F+A locked, F GREEN, A reproduced + diagnosed
Opened 2026-05-04, closing 2026-05-05. Locked candidate: **F (Firefox RDD sandbox verify-by-patch) + A (frame-11 EINVAL diagnose)** running in parallel on a single firefox-fourier build.
## Verdict per track
### Track F: GREEN
Patched Firefox 150.0.1 (firefox-fourier, `pkgrel=1.1`) launched on ohm **without `MOZ_DISABLE_RDD_SANDBOX=1`** engages our libva-v4l2-request backend, opens `/dev/video1` + `/dev/media0` from the sandboxed RDD process, and submits decode requests through `MEDIA_REQUEST_IOC_*` ioctls. ENETDOWN signature from iter2 is gone; libva fully initialized; decode reaches the same frame-10 mark as iter2's sandbox-bypass run — proving the patched-sandbox is functionally equivalent to the bypass for V4L2 stateless decode.
Three distinct gates needed patching to reach this state — Phase 2 had identified one (broker policy) and explicitly deferred the seccomp question to empirical Phase 7. Phase 7 surfaced two MORE gates beyond what Phase 2 anticipated:
1. **Broker policy** (`security/sandbox/linux/broker/SandboxBrokerPolicyFactory.cpp`):
- `AddV4l2Dependencies()` cap-filter widened: admit `(CAPTURE_MPLANE & OUTPUT_MPLANE & STREAMING)` for stateless decoders that don't advertise `M2M`.
- New `AddV4l2RequestApiDependencies()` enumerates `/dev/media*` as rdwr.
2. **Seccomp policy** (`security/sandbox/linux/SandboxFilter.cpp`):
- Add ioctl magic byte `'|'` (`<linux/media.h>` ioctls) to RDD's allowlist alongside existing `'V'` (V4L2). Without this, MEDIA_REQUEST_IOC_NEW_REQUEST returned ENOSYS; libva couldn't allocate request fds.
3. **Driver-side** (`libva-v4l2-request-fourier/src/media.c`):
- `media_request_wait_completion()` migrated from `select()` to `poll()`. Mozilla's RDD seccomp common policy admits `poll/ppoll/epoll_*` but not `select/pselect6`. Without this, `select()` returned ENOSYS even after the broker + ioctl gates opened. Driver-side fix preferred over expanding Firefox seccomp — smaller surface, more portable across sandbox policies, and `poll()` is the modern API anyway.
The Phase 2 deferral ("if patched binary trips SIGSYS, extend SandboxFilter") was correctly defensive but missed that Mozilla's seccomp returns ENOSYS via `SECCOMP_RET_ERRNO` rather than SIGSYS — silent fall-through that we only caught by reading our driver's own log lines. Lesson distilled below.
### Track A: REPRODUCED + DIAGNOSED, NOT FIXED
Frame-11 EINVAL fires deterministically on the patched-sandbox rig — exactly matching iter1/iter2's carryover signature, ruling out "rig-specific" alibis. Decode succeeds for 10 BeginPictures (luma `var=0..4` confirms real NV12 output), then on the 11th `set_controls` call the kernel rejects with EINVAL.
Y2 instrumentation (`v4l2_ioctl_controls` extension, two iterations) now produces full diagnostic output on the failing call:
```
v4l2-request: S_EXT_CTRLS EINVAL: num_controls=4 error_idx=4
ctrl[0]: id=0x00a40902 size=1048 # V4L2_CID_STATELESS_H264_SPS
ctrl[1]: id=0x00a40903 size=12 # V4L2_CID_STATELESS_H264_PPS
ctrl[2]: id=0x00a40907 size=560 # V4L2_CID_STATELESS_H264_DECODE_PARAMS
ctrl[3]: id=0x00a40904 size=480 # V4L2_CID_STATELESS_H264_SCALING_MATRIX
```
`error_idx == num_controls` is the kernel's "all bad / no specific control identified" sentinel — request-level rejection, not a single-field violation. Sizes match kernel UAPI (`v4l2_ctrl_h264_sps`=1048, etc.) so this is NOT a struct-size mismatch.
The failing frame is a single-slice P-frame post-IDR: `slice_type=0 frame_num=5 poc_lsb=20 flags=SHORT_TERM_REFERENCE`. Sonnet review 7.5 ("mid-stream non-IDR") fits this signature better than 7.2 (multi-slice num_ref_idx) which doesn't apply to single-slice frames.
Phase 4 plan explicitly framed Track A's fix as Phase 7+ work informed by the rig: *"No code fix in Phase 4. The fix requires knowing WHICH V4L2 control field returns EINVAL on frame 11."* iter3 delivered the rig that makes that diagnosis reproducible. The next step — read `hantro_g1_h264_dec.c::set_params()` validation, diff against our DECODE_PARAMS / SLICE_PARAMS / SPS / PPS construction, narrow the failing field — is iter4's locked question.
## What landed
### libva-v4l2-request-fourier commits
- `media.c::media_request_wait_completion`: replace `select(except_fds)` with `poll(POLLPRI)` for sandbox compatibility
- `v4l2.c::v4l2_ioctl_controls`: Y2 instrumentation. On `VIDIOC_S_EXT_CTRLS` returning -EINVAL, log `num_controls`, `error_idx`, and per-control `id`+`size`. Pure diagnostic add-on; no behavior change. Should be removed at iter4's DEBUG sweep alongside iter1's instrumentation.
### libva-multiplanar campaign artifacts
- `firefox-fourier/0001-rdd-allow-stateless-v4l2-request-api.patch` — three-hunk Firefox patch (broker policy two hunks, seccomp policy one hunk). Applied via Arch PKGBUILD overlay in the boltzmann LXD container.
- `firefox-fourier/PKGBUILD-overlay.md` — verified working PKGBUILD overlay strategy: `pkgrel=1.1`, `arch=(x86_64 aarch64)`, our patch in `source=()` + `prepare()`, onnxruntime stripped, `--skippgpcheck` for Mozilla key rotation. No `--enable-v4l2` (Mozilla 150 auto-enables on aarch64+GTK).
- `firefox-fourier/bootstrap.sh` — reproducible bootstrap inside the LXD container.
- `phase2_iter3_situation.md` — Mozilla sandbox source verbatim (broker policy + cap filter quoted).
- `phase3_iter3_baseline.md` — pre-patch baseline anchored from iter2-close evidence (ohm offline at Phase 3 time).
- `phase4_iter3_plan.md` — Phase 4 plan + Phase 5 review checklist.
- `phase5_iter3_review.md` — sonnet review (Y1 patch idiom fix, Y2 driver `error_idx` instrumentation requirement, B-slice copy-paste finding kept for iter4).
- `phase6_iter3_findings.md` — six build-side surprises (proper unified-diff, no `--enable-v4l2`, GPG rotation, ALARM-stale wasi cluster, onnxruntime gap, "no tricks" lesson).
- `phase8_iteration3_close.md` — this file.
### Build infrastructure introduced
- `firefox-fourier` LXD container on **boltzmann** (RK3588 aarch64, 8 cores, 24 GB RAM, 787 GB free on `/build` NVMe). Provisioned by the `his` agent. Persistent (autostart=true). Useful for iter4 if Firefox rebuilds become necessary.
- Upstream Arch x86_64 wasi packages (`arch=any`) cached at `/build/aur/wasi/upstream-any/`. ALARM extra is years stale on these — same fix pattern likely needed for any future ALARM container needing current wasi tooling.
- Phase 7 evidence collector: `/home/mfritsche/iter3_phase7_evidence.sh` on ohm.vpn. Honors `LOG=` env override, prints per-track verdict.
- Autonomous Phase 7 runner: `/tmp/run_phase7_v2.sh` on ohm.vpn. Discovers Plasma session env from a long-running user process, launches firefox-fourier, captures stderr, kills cleanly. Tmpfs-volatile.
## State that carries to iter4
- **Hardware**: ohm RK3568 hantro G1/G2, kernel 6.19.10. Userspace versions all unchanged (firefox 150.0.1, libva 2.23.0, mesa 26.0.5, libdrm 2.4.131).
- **Driver installed**: `/usr/lib/dri/v4l2_request_drv_video.so` sha256 `70a2bb1e16012a5d...` (iter3 build with poll() fix + Y2 instrumentation).
- **Firefox installed**: `/opt/firefox-fourier/firefox` (Mozilla Firefox 150.0.1, libxul.so 3.59 GB — PGO-instrumented stage-1 binary; functionally equivalent to release for our purposes; iter4 may want a clean PGO-disabled rebuild for performance).
- **Test fixture**: `/home/mfritsche/fourier-test/bbb_1080p30_h264.mp4` (sha256 `dcf8a7170fbd...`).
- **Access path to ohm**: `ohm.vpn` (changed from `ohm.fritz.box` mid-iteration). Autonomous test rig works without operator intervention via Plasma session env discovery.
- **Build container**: `firefox-fourier` LXD on boltzmann, accessed `ssh -J boltzmann builder@firefox-fourier`. Source still extracted at `/build/aur/firefox-fourier/src/firefox-150.0.1/` with iter3 patches applied.
## State that does NOT carry
- The PGO instrumentation profile attempt always crashes at exit with `LLVM Profile Error: Permission denied` writes — irrelevant noise, will recur on every run of this binary.
- `/tmp/ff-fourier-stderr-v2.log` is tmpfs-volatile. Anchor before reboot if needed; iter3's Phase 7 anchored evidence is in this campaign repo's commit history (script outputs were captured in the close).
## Documented limitations carried into iteration 4 substrate
- **Track A unfixed**. The frame-11 EINVAL is the natural iter4 lock. With the rig and Y2 in place, iter4 starts with a richer baseline than iter3 did.
- **Mpv libplacebo `--vo=gpu` regression** (carried from iter3 substrate, never iter3-scope). `Unable to request buffers: Device or resource busy` followed by SEGV during a downscale-probe surface creation. Vulkan init fails on this Plasma session; Mesa/Mozilla update may have shifted the fallback path. iter4 candidate.
- **VAAPI consumer probe robustness** (existing memory `feedback_consumer_probe_calls.md`) — ffmpeg's `av_hwframe_ctx_init` calls vaDeriveImage on never-decoded surfaces. Our cap_pool tolerates this post-iter2; iter4 work shouldn't regress.
- **PGO profile generation under sandbox**. Phase 6 finding: `--enable-profile-generate=cross` PGO step needs an X11/Wayland display the LXC container can't provide. iter4 may want a clean PGO-disabled rebuild.
## Lessons distilled to memory
- **`feedback_no_tricks_revert_first.md`** (NEW) — when the user redirects on an in-flight workaround, the first action is to revert the workaround on disk, not continue diagnosing with the trick still active. iter3 lost ~1h to a stale background makepkg running against a python-edited PKGBUILD that had `--without-wasm-sandboxed-libraries` substituted in after the user said "no tricks." The `his` subagent caught and reverted it; the lesson is: do that proactively.
- **`feedback_seccomp_returns_enosys.md`** (NEW) — Mozilla's RDD seccomp policy returns `SECCOMP_RET_ERRNO` with `ENOSYS` for filtered syscalls, not `SIGSYS`. Phase 2's deferral defaulted to "we'll see SIGSYS if seccomp blocks something" — that assumption was wrong. ENOSYS surfaces as `Function not implemented` strerror in driver logs, easy to miss. Pattern: any "not implemented" errno from a sandboxed process under Mozilla's filter, suspect seccomp first.
- **`reference_alarm_stale_wasi.md`** (NEW) — ALARM (Arch Linux ARM) extra repo's wasi-* packages are 4 years stale (sdk-13 era). Mozilla 150 + clang 22 require sdk-33 wasm32-wasip1 toolchain. Fix: install upstream Arch x86_64 `arch=any` packages directly from `geo.mirror.pkgbuild.com`. Cached at `/build/aur/wasi/upstream-any/` on boltzmann firefox-fourier container.
- **`reference_firefox_fourier_container.md`** (NEW) — boltzmann LXD `firefox-fourier` container: builder@firefox-fourier via `ssh -J boltzmann`, /build is NVMe-backed bind-mount with 787 GB free, all Firefox build prereqs staged. Persistent across boltzmann reboots.
(Process memory `feedback_replicate_baseline_first.md` continues to apply; iter3's Phase 3 anchored from iter2-close evidence rather than re-acquiring with ohm offline, which was the right call when ohm was unreachable but the substrate state was unchanged within hours.)
## Bootlin upstream outlook
iter3 produces a Firefox patch that's a candidate for upstream Mozilla submission (currently no Mozilla bug exists for /dev/media* + V4L2-stateless RDD sandbox per Phase 0 Sonnet research). The patch is ~50 lines across two files; reviewer concerns would center on `/dev/media*` rdwr enumeration on x86 desktop where media controllers can be ISP/webcam (not just codec). For ARM-embedded targets the patch is well-scoped. Per `feedback_no_upstream.md`, no PR/MR happens without explicit operator instruction.
Driver-side `select() → poll()` change is a portable improvement that benefits any sandbox model, not just Mozilla's. Also a candidate for bootlin upstream — but again, deferred per policy.
## Phase 1 success criterion — final
Quoted from `phase0_findings_iter3.md`:
> **Track F:** Patched `firefox-fourier` (firefox-150.0.1 + RDD-sandbox patch) launched on ohm WITHOUT `MOZ_DISABLE_RDD_SANDBOX=1` engages our libva-v4l2-request backend, opens `/dev/video1` + `/dev/media0` from RDD process, and decodes ≥10 frames of bbb_1080p30 through hantro.
✓ HIT. ENETDOWN=0, cap_pool_init=1, BeginPicture=10, SyncSurface=42 (consumer probe overhead), EINVAL=0 in the first 10 frames.
> **Track A:** Same patched-binary rig decodes ≥30s of bbb_1080p30 without `Unable to set control(s): Invalid argument` emerging in driver stderr.
✗ NOT HIT. EINVAL fires on the 11th BeginPicture (single-slice P-frame, `frame_num=5 poc_lsb=20 slice_type=0`), exactly the iter1+iter2 carryover. Track A's fix is iter4 territory; the diagnostic rig and Y2 instrumentation are now in place to make iter4's debug loop short.
> **Joint success:** Both above, on the same patched binary, in the same operator session, with anchored evidence.
PARTIAL — F locked, A surfaced under controlled rig with rich diagnostics. iter3 closes at "F+A in parallel, F achieved, A diagnosed-but-deferred." Honest accounting.