Files
libva-multiplanar/phase8_iteration3_close.md
T
marfrit f91469abe3 Iteration 3 close — F GREEN, A reproduced + diagnosed for iter4
Phase 1 locked F (Firefox RDD sandbox verify-by-patch) and A (frame-11
EINVAL diagnose) running in parallel on a single firefox-fourier build.

Track F: GREEN. Patched Firefox 150.0.1 (firefox-fourier, pkgrel=1.1)
launches on ohm WITHOUT MOZ_DISABLE_RDD_SANDBOX=1 and engages our
libva-v4l2-request backend end-to-end. Three patches needed (Phase 2
identified one and deferred two):
  - Broker policy (SandboxBrokerPolicyFactory.cpp): allow /dev/media*,
    extend cap-filter to admit stateless decoders that lack M2M caps.
  - Seccomp policy (SandboxFilter.cpp): allow ioctl magic byte '|'
    for <linux/media.h> request-API ioctls.
  - Driver (media.c): replace select() with poll() — Mozilla's RDD
    seccomp common policy admits poll/ppoll/epoll_* but not
    select/pselect6. Driver-side fix preferred; smaller surface,
    portable across sandbox policies, and poll() is the modern API.

Track A: REPRODUCES + DIAGNOSED. Frame-11 EINVAL fires deterministically
on a single-slice P-frame (slice_type=0, frame_num=5, post-IDR) — the
exact iter1/iter2 carryover signature, confirming it isn't environmental.
Y2 instrumentation (in v4l2_ioctl_controls) now logs num_controls /
error_idx / per-control id+size on EINVAL. Sizes match kernel UAPI;
error_idx == num_controls is the kernel's "all bad / no specific control"
sentinel — it's a request-level rejection, not a single-field violation.
Fix is iter4's lock; rig + Y2 in place for fast iter4 turnaround.

Build infrastructure introduced: firefox-fourier LXD container on
boltzmann (RK3588 aarch64, persistent, ssh -J boltzmann
builder@firefox-fourier). Upstream Arch x86_64 wasi packages installed
to work around 4-year-stale ALARM versions. PGO generation crashes at
exit (LXC has no display); obj/dist/ tarball used as the deployable
artifact instead of the pacman package.

Phase 6 surprises captured in phase6_iter3_findings.md: malformed
first-cut patch (descriptive vs numeric hunk headers), --enable-v4l2
isn't a Mozilla 150 flag (auto-set on aarch64+GTK), Mozilla 2025 PGP
key rotation, ALARM-stale wasi, onnxruntime missing in ALARM, and the
"no tricks" lesson (revert workarounds first when redirected).

Carries to iter4 substrate: Track A fix is the natural lock; mpv
libplacebo --vo=gpu segfault stays as separate iter4 candidate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 12:56:34 +00:00

13 KiB

Iteration 3 close (Phase 8) — F+A locked, F GREEN, A reproduced + diagnosed

Opened 2026-05-04, closing 2026-05-05. Locked candidate: F (Firefox RDD sandbox verify-by-patch) + A (frame-11 EINVAL diagnose) running in parallel on a single firefox-fourier build.

Verdict per track

Track F: GREEN

Patched Firefox 150.0.1 (firefox-fourier, pkgrel=1.1) launched on ohm without MOZ_DISABLE_RDD_SANDBOX=1 engages our libva-v4l2-request backend, opens /dev/video1 + /dev/media0 from the sandboxed RDD process, and submits decode requests through MEDIA_REQUEST_IOC_* ioctls. ENETDOWN signature from iter2 is gone; libva fully initialized; decode reaches the same frame-10 mark as iter2's sandbox-bypass run — proving the patched-sandbox is functionally equivalent to the bypass for V4L2 stateless decode.

Three distinct gates needed patching to reach this state — Phase 2 had identified one (broker policy) and explicitly deferred the seccomp question to empirical Phase 7. Phase 7 surfaced two MORE gates beyond what Phase 2 anticipated:

  1. Broker policy (security/sandbox/linux/broker/SandboxBrokerPolicyFactory.cpp):
    • AddV4l2Dependencies() cap-filter widened: admit (CAPTURE_MPLANE & OUTPUT_MPLANE & STREAMING) for stateless decoders that don't advertise M2M.
    • New AddV4l2RequestApiDependencies() enumerates /dev/media* as rdwr.
  2. Seccomp policy (security/sandbox/linux/SandboxFilter.cpp):
    • Add ioctl magic byte '|' (<linux/media.h> ioctls) to RDD's allowlist alongside existing 'V' (V4L2). Without this, MEDIA_REQUEST_IOC_NEW_REQUEST returned ENOSYS; libva couldn't allocate request fds.
  3. Driver-side (libva-v4l2-request-fourier/src/media.c):
    • media_request_wait_completion() migrated from select() to poll(). Mozilla's RDD seccomp common policy admits poll/ppoll/epoll_* but not select/pselect6. Without this, select() returned ENOSYS even after the broker + ioctl gates opened. Driver-side fix preferred over expanding Firefox seccomp — smaller surface, more portable across sandbox policies, and poll() is the modern API anyway.

The Phase 2 deferral ("if patched binary trips SIGSYS, extend SandboxFilter") was correctly defensive but missed that Mozilla's seccomp returns ENOSYS via SECCOMP_RET_ERRNO rather than SIGSYS — silent fall-through that we only caught by reading our driver's own log lines. Lesson distilled below.

Track A: REPRODUCED + DIAGNOSED, NOT FIXED

Frame-11 EINVAL fires deterministically on the patched-sandbox rig — exactly matching iter1/iter2's carryover signature, ruling out "rig-specific" alibis. Decode succeeds for 10 BeginPictures (luma var=0..4 confirms real NV12 output), then on the 11th set_controls call the kernel rejects with EINVAL.

Y2 instrumentation (v4l2_ioctl_controls extension, two iterations) now produces full diagnostic output on the failing call:

v4l2-request: S_EXT_CTRLS EINVAL: num_controls=4 error_idx=4
  ctrl[0]: id=0x00a40902 size=1048   # V4L2_CID_STATELESS_H264_SPS
  ctrl[1]: id=0x00a40903 size=12     # V4L2_CID_STATELESS_H264_PPS
  ctrl[2]: id=0x00a40907 size=560    # V4L2_CID_STATELESS_H264_DECODE_PARAMS
  ctrl[3]: id=0x00a40904 size=480    # V4L2_CID_STATELESS_H264_SCALING_MATRIX

error_idx == num_controls is the kernel's "all bad / no specific control identified" sentinel — request-level rejection, not a single-field violation. Sizes match kernel UAPI (v4l2_ctrl_h264_sps=1048, etc.) so this is NOT a struct-size mismatch.

The failing frame is a single-slice P-frame post-IDR: slice_type=0 frame_num=5 poc_lsb=20 flags=SHORT_TERM_REFERENCE. Sonnet review 7.5 ("mid-stream non-IDR") fits this signature better than 7.2 (multi-slice num_ref_idx) which doesn't apply to single-slice frames.

Phase 4 plan explicitly framed Track A's fix as Phase 7+ work informed by the rig: "No code fix in Phase 4. The fix requires knowing WHICH V4L2 control field returns EINVAL on frame 11." iter3 delivered the rig that makes that diagnosis reproducible. The next step — read hantro_g1_h264_dec.c::set_params() validation, diff against our DECODE_PARAMS / SLICE_PARAMS / SPS / PPS construction, narrow the failing field — is iter4's locked question.

What landed

libva-v4l2-request-fourier commits

  • media.c::media_request_wait_completion: replace select(except_fds) with poll(POLLPRI) for sandbox compatibility
  • v4l2.c::v4l2_ioctl_controls: Y2 instrumentation. On VIDIOC_S_EXT_CTRLS returning -EINVAL, log num_controls, error_idx, and per-control id+size. Pure diagnostic add-on; no behavior change. Should be removed at iter4's DEBUG sweep alongside iter1's instrumentation.

libva-multiplanar campaign artifacts

  • firefox-fourier/0001-rdd-allow-stateless-v4l2-request-api.patch — three-hunk Firefox patch (broker policy two hunks, seccomp policy one hunk). Applied via Arch PKGBUILD overlay in the boltzmann LXD container.
  • firefox-fourier/PKGBUILD-overlay.md — verified working PKGBUILD overlay strategy: pkgrel=1.1, arch=(x86_64 aarch64), our patch in source=() + prepare(), onnxruntime stripped, --skippgpcheck for Mozilla key rotation. No --enable-v4l2 (Mozilla 150 auto-enables on aarch64+GTK).
  • firefox-fourier/bootstrap.sh — reproducible bootstrap inside the LXD container.
  • phase2_iter3_situation.md — Mozilla sandbox source verbatim (broker policy + cap filter quoted).
  • phase3_iter3_baseline.md — pre-patch baseline anchored from iter2-close evidence (ohm offline at Phase 3 time).
  • phase4_iter3_plan.md — Phase 4 plan + Phase 5 review checklist.
  • phase5_iter3_review.md — sonnet review (Y1 patch idiom fix, Y2 driver error_idx instrumentation requirement, B-slice copy-paste finding kept for iter4).
  • phase6_iter3_findings.md — six build-side surprises (proper unified-diff, no --enable-v4l2, GPG rotation, ALARM-stale wasi cluster, onnxruntime gap, "no tricks" lesson).
  • phase8_iteration3_close.md — this file.

Build infrastructure introduced

  • firefox-fourier LXD container on boltzmann (RK3588 aarch64, 8 cores, 24 GB RAM, 787 GB free on /build NVMe). Provisioned by the his agent. Persistent (autostart=true). Useful for iter4 if Firefox rebuilds become necessary.
  • Upstream Arch x86_64 wasi packages (arch=any) cached at /build/aur/wasi/upstream-any/. ALARM extra is years stale on these — same fix pattern likely needed for any future ALARM container needing current wasi tooling.
  • Phase 7 evidence collector: /home/mfritsche/iter3_phase7_evidence.sh on ohm.vpn. Honors LOG= env override, prints per-track verdict.
  • Autonomous Phase 7 runner: /tmp/run_phase7_v2.sh on ohm.vpn. Discovers Plasma session env from a long-running user process, launches firefox-fourier, captures stderr, kills cleanly. Tmpfs-volatile.

State that carries to iter4

  • Hardware: ohm RK3568 hantro G1/G2, kernel 6.19.10. Userspace versions all unchanged (firefox 150.0.1, libva 2.23.0, mesa 26.0.5, libdrm 2.4.131).
  • Driver installed: /usr/lib/dri/v4l2_request_drv_video.so sha256 70a2bb1e16012a5d... (iter3 build with poll() fix + Y2 instrumentation).
  • Firefox installed: /opt/firefox-fourier/firefox (Mozilla Firefox 150.0.1, libxul.so 3.59 GB — PGO-instrumented stage-1 binary; functionally equivalent to release for our purposes; iter4 may want a clean PGO-disabled rebuild for performance).
  • Test fixture: /home/mfritsche/fourier-test/bbb_1080p30_h264.mp4 (sha256 dcf8a7170fbd...).
  • Access path to ohm: ohm.vpn (changed from ohm.fritz.box mid-iteration). Autonomous test rig works without operator intervention via Plasma session env discovery.
  • Build container: firefox-fourier LXD on boltzmann, accessed ssh -J boltzmann builder@firefox-fourier. Source still extracted at /build/aur/firefox-fourier/src/firefox-150.0.1/ with iter3 patches applied.

State that does NOT carry

  • The PGO instrumentation profile attempt always crashes at exit with LLVM Profile Error: Permission denied writes — irrelevant noise, will recur on every run of this binary.
  • /tmp/ff-fourier-stderr-v2.log is tmpfs-volatile. Anchor before reboot if needed; iter3's Phase 7 anchored evidence is in this campaign repo's commit history (script outputs were captured in the close).

Documented limitations carried into iteration 4 substrate

  • Track A unfixed. The frame-11 EINVAL is the natural iter4 lock. With the rig and Y2 in place, iter4 starts with a richer baseline than iter3 did.
  • Mpv libplacebo --vo=gpu regression (carried from iter3 substrate, never iter3-scope). Unable to request buffers: Device or resource busy followed by SEGV during a downscale-probe surface creation. Vulkan init fails on this Plasma session; Mesa/Mozilla update may have shifted the fallback path. iter4 candidate.
  • VAAPI consumer probe robustness (existing memory feedback_consumer_probe_calls.md) — ffmpeg's av_hwframe_ctx_init calls vaDeriveImage on never-decoded surfaces. Our cap_pool tolerates this post-iter2; iter4 work shouldn't regress.
  • PGO profile generation under sandbox. Phase 6 finding: --enable-profile-generate=cross PGO step needs an X11/Wayland display the LXC container can't provide. iter4 may want a clean PGO-disabled rebuild.

Lessons distilled to memory

  • feedback_no_tricks_revert_first.md (NEW) — when the user redirects on an in-flight workaround, the first action is to revert the workaround on disk, not continue diagnosing with the trick still active. iter3 lost ~1h to a stale background makepkg running against a python-edited PKGBUILD that had --without-wasm-sandboxed-libraries substituted in after the user said "no tricks." The his subagent caught and reverted it; the lesson is: do that proactively.
  • feedback_seccomp_returns_enosys.md (NEW) — Mozilla's RDD seccomp policy returns SECCOMP_RET_ERRNO with ENOSYS for filtered syscalls, not SIGSYS. Phase 2's deferral defaulted to "we'll see SIGSYS if seccomp blocks something" — that assumption was wrong. ENOSYS surfaces as Function not implemented strerror in driver logs, easy to miss. Pattern: any "not implemented" errno from a sandboxed process under Mozilla's filter, suspect seccomp first.
  • reference_alarm_stale_wasi.md (NEW) — ALARM (Arch Linux ARM) extra repo's wasi-* packages are 4 years stale (sdk-13 era). Mozilla 150 + clang 22 require sdk-33 wasm32-wasip1 toolchain. Fix: install upstream Arch x86_64 arch=any packages directly from geo.mirror.pkgbuild.com. Cached at /build/aur/wasi/upstream-any/ on boltzmann firefox-fourier container.
  • reference_firefox_fourier_container.md (NEW) — boltzmann LXD firefox-fourier container: builder@firefox-fourier via ssh -J boltzmann, /build is NVMe-backed bind-mount with 787 GB free, all Firefox build prereqs staged. Persistent across boltzmann reboots.

(Process memory feedback_replicate_baseline_first.md continues to apply; iter3's Phase 3 anchored from iter2-close evidence rather than re-acquiring with ohm offline, which was the right call when ohm was unreachable but the substrate state was unchanged within hours.)

Bootlin upstream outlook

iter3 produces a Firefox patch that's a candidate for upstream Mozilla submission (currently no Mozilla bug exists for /dev/media* + V4L2-stateless RDD sandbox per Phase 0 Sonnet research). The patch is ~50 lines across two files; reviewer concerns would center on /dev/media* rdwr enumeration on x86 desktop where media controllers can be ISP/webcam (not just codec). For ARM-embedded targets the patch is well-scoped. Per feedback_no_upstream.md, no PR/MR happens without explicit operator instruction.

Driver-side select() → poll() change is a portable improvement that benefits any sandbox model, not just Mozilla's. Also a candidate for bootlin upstream — but again, deferred per policy.

Phase 1 success criterion — final

Quoted from phase0_findings_iter3.md:

Track F: Patched firefox-fourier (firefox-150.0.1 + RDD-sandbox patch) launched on ohm WITHOUT MOZ_DISABLE_RDD_SANDBOX=1 engages our libva-v4l2-request backend, opens /dev/video1 + /dev/media0 from RDD process, and decodes ≥10 frames of bbb_1080p30 through hantro.

✓ HIT. ENETDOWN=0, cap_pool_init=1, BeginPicture=10, SyncSurface=42 (consumer probe overhead), EINVAL=0 in the first 10 frames.

Track A: Same patched-binary rig decodes ≥30s of bbb_1080p30 without Unable to set control(s): Invalid argument emerging in driver stderr.

✗ NOT HIT. EINVAL fires on the 11th BeginPicture (single-slice P-frame, frame_num=5 poc_lsb=20 slice_type=0), exactly the iter1+iter2 carryover. Track A's fix is iter4 territory; the diagnostic rig and Y2 instrumentation are now in place to make iter4's debug loop short.

Joint success: Both above, on the same patched binary, in the same operator session, with anchored evidence.

PARTIAL — F locked, A surfaced under controlled rig with rich diagnostics. iter3 closes at "F+A in parallel, F achieved, A diagnosed-but-deferred." Honest accounting.