Iteration 3 close — F GREEN, A reproduced + diagnosed for iter4
Phase 1 locked F (Firefox RDD sandbox verify-by-patch) and A (frame-11
EINVAL diagnose) running in parallel on a single firefox-fourier build.
Track F: GREEN. Patched Firefox 150.0.1 (firefox-fourier, pkgrel=1.1)
launches on ohm WITHOUT MOZ_DISABLE_RDD_SANDBOX=1 and engages our
libva-v4l2-request backend end-to-end. Three patches needed (Phase 2
identified one and deferred two):
- Broker policy (SandboxBrokerPolicyFactory.cpp): allow /dev/media*,
extend cap-filter to admit stateless decoders that lack M2M caps.
- Seccomp policy (SandboxFilter.cpp): allow ioctl magic byte '|'
for <linux/media.h> request-API ioctls.
- Driver (media.c): replace select() with poll() — Mozilla's RDD
seccomp common policy admits poll/ppoll/epoll_* but not
select/pselect6. Driver-side fix preferred; smaller surface,
portable across sandbox policies, and poll() is the modern API.
Track A: REPRODUCES + DIAGNOSED. Frame-11 EINVAL fires deterministically
on a single-slice P-frame (slice_type=0, frame_num=5, post-IDR) — the
exact iter1/iter2 carryover signature, confirming it isn't environmental.
Y2 instrumentation (in v4l2_ioctl_controls) now logs num_controls /
error_idx / per-control id+size on EINVAL. Sizes match kernel UAPI;
error_idx == num_controls is the kernel's "all bad / no specific control"
sentinel — it's a request-level rejection, not a single-field violation.
Fix is iter4's lock; rig + Y2 in place for fast iter4 turnaround.
Build infrastructure introduced: firefox-fourier LXD container on
boltzmann (RK3588 aarch64, persistent, ssh -J boltzmann
builder@firefox-fourier). Upstream Arch x86_64 wasi packages installed
to work around 4-year-stale ALARM versions. PGO generation crashes at
exit (LXC has no display); obj/dist/ tarball used as the deployable
artifact instead of the pacman package.
Phase 6 surprises captured in phase6_iter3_findings.md: malformed
first-cut patch (descriptive vs numeric hunk headers), --enable-v4l2
isn't a Mozilla 150 flag (auto-set on aarch64+GTK), Mozilla 2025 PGP
key rotation, ALARM-stale wasi, onnxruntime missing in ALARM, and the
"no tricks" lesson (revert workarounds first when redirected).
Carries to iter4 substrate: Track A fix is the natural lock; mpv
libplacebo --vo=gpu segfault stays as separate iter4 candidate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+41
-14
@@ -119,28 +119,55 @@ Likely needed for specific iter3 candidates:
|
||||
- For E (DMABUF): `gbm_bo_create` userspace allocation test program; `VIDIOC_QBUF` with `type=V4L2_MEMORY_DMABUF` exploratory path
|
||||
- For F (sandbox): meitner / clevo access; Firefox source `security/sandbox/linux/SandboxFilter.cpp`
|
||||
|
||||
## In-scope (LOCKING DEFERRED — Phase 1 user input)
|
||||
## In-scope (LOCKED 2026-05-04 for iteration 3) — F + A in parallel
|
||||
|
||||
To be locked at Phase 1 from candidates A..G above. Recommended pairing or solo flagged per candidate.
|
||||
**Track F (sandbox hypothesis verify-by-patch).** Build `firefox-fourier`: a Firefox 150.0.1 fork with the RDD-sandbox patch from candidate F (allow `/dev/media0`, extend `AddV4l2Dependencies()` cap filter to admit stateless V4L2 nodes, verify `MEDIA_REQUEST_IOC_QUEUE` ioctl passes seccomp). Run on ohm without `MOZ_DISABLE_RDD_SANDBOX=1`. Stronger test of the hypothesis than Sonnet's static-analysis verdict — empirically separates "sandbox is the env-var requirement's cause" from any other gating factor.
|
||||
|
||||
**Track A (frame-11 EINVAL).** With sandbox now controlled (Track F's patched binary), the frame-11 EINVAL still recurs — clean-rig isolation. Identify which V4L2 control returns EINVAL on the 11th decoded frame in Firefox; suspect surface narrowed by Sonnet review to per-request DECODE_PARAMS / SCALING_MATRIX / SPS / PPS for non-IDR slices (7.5) or `num_ref_idx_l0/l1` mismatch in multi-slice frames (7.2). First concrete step: read `hantro_g1_h264_dec.c` for control validation rules; run patched Firefox under `MOZ_LOG=PlatformDecoderModule:5` + driver request_log to capture the failing control set.
|
||||
|
||||
**Why parallel rather than sequential:** Track F's verification rig (patched Firefox on ohm, running bbb_1080p30 without sandbox bypass) IS the rig that surfaces Track A's signature. Running them in one binding cell is the natural shape; splitting to two iterations would require setting up the same rig twice.
|
||||
|
||||
### Build host plan (Phase 4 input prereq)
|
||||
|
||||
Build venue: **boltzmann LXD container** (RK3588 aarch64, 8 cores, 30 GB RAM, NVMe, always-on). Native arm64 build avoids cross-compile. **AUR/PKGBUILD-based overlay** preferred over raw mozilla-central checkout — Arch's firefox PKGBUILD already has a working aarch64 mozconfig and dep set; we layer our sandbox patch as an additional `source=()` patch in `prepare()`. On rebuilds use `makepkg -e` to skip re-extraction and re-patching.
|
||||
|
||||
Fallback if rust-on-aarch64 toolchain proves unworkable in the container: power up `data` (x86_64 box), prevent its sleep timer, set up cross-compile toolchain to aarch64. AUR rebuild semantics (`makepkg -e`) carry over.
|
||||
|
||||
## Out-of-scope finding surfaced 2026-05-05 (carry to iter4)
|
||||
|
||||
**mpv libplacebo segfault on `--vo=gpu` post-reboot.** Operator-side reproduction with `LIBVA_DRIVER_NAME=v4l2_request mpv --hwdec=vaapi --vo=gpu --no-audio bbb_1080p30_h264.mp4` after host reboot hit a NEW failure pattern (not the iter2-close "smooth" verdict, not the Track A frame-11 EINVAL):
|
||||
|
||||
- Vulkan init fails: `[vo/gpu/libplacebo] EnumeratePhysicalDevices ... VK_ERROR_INITIALIZATION_FAILED` (line 4 of trace)
|
||||
- 4 frames decode cleanly (surfaces 67108864–67108867 sync to real luma data, var=4 on the I-frame)
|
||||
- After surf 67108868's BeginPicture: two `Unable to request buffers: Device or resource busy` (EBUSY on REQBUFS)
|
||||
- Then a bizarre `CreateSurfaces2: surf_width=16 surf_height=16 fmt_width=48 fmt_height=48 sizes[1]=1050626 (=0x100802, looks uninitialized)`
|
||||
- Segfault
|
||||
|
||||
Hypothesis: vulkan-init-failed code path triggers a resolution-probe in libplacebo/mpv that calls `vaCreateSurfaces` with downscale-probe dimensions while CAPTURE is still queued. The cap_pool resolution-change path drains+REQBUFs but doesn't fully flush queued CAPTURE buffers, kernel returns EBUSY, driver pushes ahead with garbage `sizes[1]`, mmap or pool-init crashes.
|
||||
|
||||
iter3 disposition: **option 3 selected** (verify-via-Firefox first, defer libplacebo segfault to iter4). Firefox doesn't go through the libplacebo probe paths, so F+A's verification can proceed on patched-Firefox even with mpv broken on the vulkan-fallback path. If `firefox-fourier` works on ohm despite this regression, the lock for iter4 becomes:
|
||||
|
||||
- **Track libplacebo:** harden cap_pool resolution-change to drain CAPTURE before REQBUFs; reject `vaCreateSurfaces` with sentinel-shaped sizes[]; investigate the Vulkan init failure (could be Mesa update, kernel reboot reshuffling GPU state, or genuine Mesa/libplacebo regression).
|
||||
|
||||
Or, if the mpv segfault ALSO afflicts firefox-fourier (e.g. the same resolution-probe path is shared at a lower libva layer), iter3 expands or yields back at Phase 7. We learn that empirically.
|
||||
|
||||
## Out-of-scope (LOCKED 2026-05-04 for iteration 3)
|
||||
|
||||
- Candidates B, C, D, E, G — deferred to a later iteration. B (DEBUG sweep) is the most natural candidate for iter4 since it's an upstream prereq.
|
||||
- New codecs (MPEG-2, VP8, VP9, AV1, HEVC) — H.264-only scope holds from iter1+iter2.
|
||||
- New hardware (fresnel RK3399, ampere/boltzmann RK3588) — separate iteration after ohm path is hardened.
|
||||
- Bootlin upstreaming PR — `feedback_no_upstream.md` holds; no PRs unless explicitly tasked. iter3 might produce the prerequisites (DEBUG sweep, HACK refactor, perf data) for an eventual upstream.
|
||||
- New target hardware on the libva side (fresnel RK3399, ampere RK3588) — separate iteration after ohm path is hardened. Note: boltzmann (RK3588) is recruited only as a Firefox build host this iteration, NOT as a libva target.
|
||||
- Bootlin upstreaming PR — `feedback_no_upstream.md` holds; no PRs unless explicitly tasked.
|
||||
- Mozilla Bugzilla bug-file. Substituted by verify-by-patch; if the patched binary works, the bug filing becomes a follow-up upstream contribution, not part of iter3's Phase 1 success criterion.
|
||||
- HEVC re-introduction (stripped in fourier port; no hantro G2 HEVC validation in operator's test corpus).
|
||||
|
||||
## Phase 1 success criterion (will lock after user picks candidate)
|
||||
## Phase 1 success criterion (LOCKED 2026-05-04)
|
||||
|
||||
Pre-lock template:
|
||||
- For candidate A: "Firefox 150 plays bbb_1080p30 for ≥30s through HW decode without `Unable to set control(s)` EINVAL emerging in driver stderr."
|
||||
- For candidate B: "Driver source builds clean with zero `request_log()` calls in non-error paths and zero patch-0011 sentinel writes; vaapi-copy + vaapi smoke tests still green."
|
||||
- For candidate C: "Anchored perf table for {mpv vaapi DMA-BUF, mpv vaapi-copy, Firefox HW, SW baseline} across drop count + CPU% + frame timing on bbb_1080p30; reproducible from operator instructions documented in iter3 substrate."
|
||||
- For candidate D: "Two concurrent libva contexts on the same V4L2 device decode independently without cross-context state corruption."
|
||||
- For candidate E: "vaapi-copy + vaapi --vo=gpu still produce real frames with `V4L2_MEMORY_DMABUF`-backed CAPTURE buffers; race window mathematically eliminated (no kernel can write to a buffer the consumer holds — userspace owns the dma-buf)."
|
||||
- For candidate F: "Decision documented (with Mozilla bug filed OR `MOZ_DISABLE_RDD_SANDBOX=1` permanently in README); cross-verified on Intel/NVIDIA test box."
|
||||
- For candidate G: per Sonnet 7.x sub-item.
|
||||
**Track F:** Patched `firefox-fourier` (firefox-150.0.1 + RDD-sandbox patch) launched on ohm WITHOUT `MOZ_DISABLE_RDD_SANDBOX=1` engages our libva-v4l2-request backend, opens `/dev/video1` + `/dev/media0` from RDD process, and decodes ≥10 frames of bbb_1080p30 through hantro. (10 frames is the iter2-observed floor before the EINVAL hits — past 10 is Track A's domain.)
|
||||
|
||||
**Track A:** Same patched-binary rig decodes ≥30s of bbb_1080p30 without `Unable to set control(s): Invalid argument` emerging in driver stderr. Where this requires changes, the change lives in libva-v4l2-request-fourier (per-request control set construction), not in firefox-fourier.
|
||||
|
||||
**Joint success:** Both above, on the same patched binary, in the same operator session, with anchored evidence (driver stderr capture, Firefox MOZ_LOG capture, dmesg capture, operator visual confirmation of decode output on screen).
|
||||
|
||||
## Stop point
|
||||
|
||||
**Phase 1 lock requires user input** — pick from A..G (and any pairing). After lock, iter3 phases 2..8 proceed autonomously per "Stop only if user is needed."
|
||||
Phase 1 LOCKED. iter3 proceeds to Phase 2 (situation analysis: read Mozilla sandbox source on a local mirror for the two target functions), Phase 3 (baseline anchor: re-verify frame-11 EINVAL still reproduces on ohm with stock Firefox 150 + sandbox bypass — same picture as iter2 close), Phase 4 (write the sandbox patch + plan PKGBUILD overlay + lock container provisioning with his), Phase 5 (sonnet review of patch), Phase 6 (build firefox-fourier in container, deploy to ohm), Phase 7 (verify F + A simultaneously), Phase 8 (iteration close). Stop only if user is needed (e.g. the patch produces multi-way design choice, or the rust-aarch64 fallback to `data` is required).
|
||||
|
||||
Reference in New Issue
Block a user