Files
libva-multiplanar/phase8_iteration5_close.md
T
claude-noether d2d9107e62 iter5 amendment: extend Firefox sandbox patch to UtilitySandboxPolicy
Real-world YouTube avc1 playback on the iter5-G binary surfaced a
seccomp violation (`syscall 29`, `0x80047C05` = `MEDIA_IOC_REQUEST_ALLOC`)
that the autonomous Phase 7G test missed because seccomp returns
ENOSYS silently and Firefox falls back to SW decode.

Two distinct gaps:
- patch-sync drift: campaign 113-line patch (broker+RDD-seccomp) had
  drifted from container 84-line patch (broker only); iter5-G shipped
  with the broker fix but no RDD seccomp fix.
- coverage gap: FF150 routes VAAPI to the Utility process; iter3's
  RDD-only seccomp allowlist never covered Utility.

Combined patch now hits three gates across two files (six hunks):
- broker: cap-filter widen + AddV4l2RequestApiDependencies + RDD wire-in
- RDD seccomp: kMediaType allow alongside existing kVideoType
- Utility seccomp: new __NR_ioctl override mirroring RDD's allowlist

Build: incremental `makepkg -e` on existing iter5-G object tree took
2:22 wall vs the 2h27m from-scratch alternative.

phase8_iteration5_close.md: appended amendment section with verdict-
gap analysis, patch breakdown, deploy-pending status.

firefox-fourier/README.md: rewrote "The problem" from 2 gates to 3
(broker + RDD seccomp + Utility seccomp); patch summary now explains
the six hunks.

Pending: pkg deploy to ohm + lsof /dev/video1 verification once
network route to ohm is restored.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 19:20:30 +00:00

15 KiB
Raw Blame History

Iteration 5 close (Phase 8) — A+G+B+E all GREEN

Opened 2026-05-05 just after iter4 close, closing same day. Locked candidates: A (DEBUG instrumentation sweep), G (PGO-disabled Firefox-fourier rebuild), B (mpv libplacebo --vo=gpu segfault), E (multi-context libva safety).

All four tracks closed GREEN with one named caveat carried to iter6 (cap_pool resolution-change race latent under untested consumer probe patterns — Phase 5 sonnet C4 finding).

Verdict per track

Track A: GREEN — DEBUG sweep landed in two passes

First pass (commits 848fc0c, 39498f0, 951233a, d3a299b, 843febc): removed iter1 patch-0010/0011/0014 + iter3 Y2 v1 + iter4 Y2 v3 + iter4 DPB census + iter4 per-control TRY iso. Per-frame v4l2-request log noise dropped from ~30+ lines/frame to ~9 init-time lines.

Second pass (commit c8b6ede, after Phase 5 sonnet C1+C2): removed three additional surface.c DEBUG sites (CreateSurfaces2 format-dump, ExportSurfaceHandle descriptor-dump, QuerySurfaceStatus status-dump) that the first pass missed because the vaapi-copy + --vo=null stress test didn't exercise the ExportSurfaceHandle path. Also removed h264.c's "3F observability" V4L2 readback block, which contained a static bool readback_warned (new mutable process-global state introduced post-Track-E — inconsistent with Track E's intent, also resolved by the block removal).

Net: ~340 lines of instrumentation removed across 6 commits. Verified clean: 2000-frame mpv vaapi-copy stress on the post-cleanup driver shows 0 EINVAL, 1 v4l2-request log line, 3 KB log (down from 9 lines / 4.4 KB after first pass).

KEPT (justified):

  • POC sentinel strip (h264_strip_ffmpeg_poc_sentinel) — load-bearing for ffmpeg-vaapi consumers
  • slice_header bit-precise parser — load-bearing for hantro hw decode (DECODE_PARAMS bit_size fields drive MMIO writes)
  • EACCES suppression in v4l2_get_controls — silences per-frame iter1-known-good error noise
  • "slice_header parse FAILED" log — fires only on decode-blocking errors, not per-frame noise

Track E: GREEN — multi-context libva safety

Commit b993355: LAST_OUTPUT_WIDTH/HEIGHT moved from process-global static in surface.c to struct request_data.last_output_width/height. The V4L2 device fd is per-driver_data, so this is the correct binding unit (one fd, one current OUTPUT format).

surface_reset_format_cache() signature changed to take struct request_data *driver_data; one callsite in context.c updated.

Audit confirmed only LAST_OUTPUT_* was mutable process-global state. Other statics (formats[], formats_count in video.c) are constant lookup tables — no race.

Verified: two concurrent mpv processes with 2-second stagger both decoded 300 frames cleanly, no cross-context corruption. Re-verified post-cleanup on driver 4bed52ec5d44b389... — both clean.

Limit: same-instant co-launch hits kernel-level fd contention on /dev/video1 (hantro is a single-instance device). Cross-process serialization is out of scope for a libva backend.

Track B: GREEN — mpv --vo=gpu doesn't segfault

35s mpv --hwdec=vaapi --vo=gpu on the iter5-end driver: stream pos 31s, 29 frames dropped, 0 segfaults. Vulkan init still fails (VK_ERROR_INITIALIZATION_FAILED — steady state on Mali-G52 / Bifrost per reference_pinetab_no_vulkan.md); mpv falls through to GLES via Panfrost gracefully.

Phase 5 sonnet C4 reframed the original "implicit fix" claim: the cap_pool REQBUFS-EBUSY race window remains latent under untested consumer probe patterns. The 35s mpv test sees 5 EBUSY events at init-time, mpv falls back to SW once, then continues. The race is documented as iter6+ candidate (the genuine fix is ordering-cap_pool-drain-before-REQBUFs in CreateSurfaces2, ~30 lines).

Track G: GREEN — PGO-disabled Firefox-fourier 150.0.1-1.1

PKGBUILD overlay edited to replace 3-tier PGO sequence with single-pass optimized build. Single-pass build on boltzmann LXD container: ~2h27m (vs iter3's 2h+ that died at PGO collect step — comparable wall time).

Result:

  • pkg: firefox-150.0.1-1.1-aarch64.pkg.tar.xz, 68.7 MB (sha256 aa94c7290ee7be76...)
  • libxul.so: 169 MB stripped (21× smaller than iter3's 3.6 GB PGO-instrumented)

Installed via pacman -U on ohm replacing stock firefox 150.0.1-1.

Phase 7G test (35s autonomous run, no MOZ_DISABLE_RDD_SANDBOX=1):

  • ENETDOWN: 0 (iter3 sandbox patch holds in release build)
  • EINVAL: 0 (iter4 frame-11 fix holds)
  • RDD ProcessDecode events: 538
  • Stream mTime reached: 22.3s in 35s wall = 0.64× realtime, ~2.7× speedup over PGO-instrumented binary

What landed

Fork commits (libva-v4l2-request-fourier)

iter5 sweep + multi-context fix:

  • 848fc0c — remove iter3+iter4 Y2 instrumentation from v4l2.c (-54)
  • 39498f0 — remove iter4 DPB census from h264.c (-31)
  • 951233a — remove iter1 ENTER traces (4 files, -17 across 13 sites)
  • d3a299b — remove iter1 patch-0010 hex-dumps + patch-0011 sentinel (-81)
  • 843febc — remove iter1 slice_header / VAPicture dumps + Sync RETURN trace, suppress EACCES per-frame log (-49)
  • b993355 — Track E: LAST_OUTPUT_* per-driver_data
  • c8b6ede — Phase 5 follow-up: 3 surface.c debug sites + h264.c readback block (-107)

Net: ~339 lines removed, ~52 lines added (Track E plumbing). Driver source builds clean and per-frame log noise is essentially zero (1 line per 2000-frame run).

Campaign artifacts (libva-multiplanar)

  • phase0_findings_iter5.md — substrate (8 candidates, locked A+G+B+E)
  • phase4_iter5_plan.md — Phase 4 plan + execution + Phase 5 caveat resolutions + Phase 7 anchored evidence
  • phase8_iteration5_close.md — this file
  • ~/src/panvk-bifrost/README.md — chartered as separate top-level future campaign (sequenced after fourier-fresnel)

Build infrastructure

  • firefox-fourier LXD container on boltzmann remains persistent. The PKGBUILD now has the iter5 PGO-disabled edit applied (the source extracted under src/firefox-150.0.1/ is the iter4 state with iter3 patches; iter5 reused that). Future Firefox rebuilds can cd src/firefox-150.0.1 && ./mach build for incremental.

State that carries to iter6 (or campaign close)

  • Hardware: ohm RK3568 hantro G1/G2, kernel 6.19.10. Access: ohm (LAN; ohm.vpn also works).
  • Userspace: firefox 150.0.1-1.1 (iter5 PGO-disabled fourier rebuild), libva 2.23.0, mesa 26.0.5, libdrm 2.4.131, mpv 0.41.0-3.
  • Driver installed: /usr/lib/dri/v4l2_request_drv_video.so sha256 4bed52ec5d44b389... (iter5-end, post-cleanup).
  • Test fixture: bbb_1080p30_h264.mp4 sha256 dcf8a7170fbd....
  • Build container: firefox-fourier LXD on boltzmann, persistent.

Documented limitations carried to iter6+ (or campaign close)

  1. Cap_pool resolution-change race — Phase 5 sonnet C4. mpv's libplacebo Vulkan-fallback path triggers it; mpv recovers via SW fallback (no segfault), but the race exists. Fix: drain CAPTURE properly before issuing REQBUFs(0) on resolution change in CreateSurfaces2. ~30 lines.
  2. No pixel-correctness verification post-msync-removal — Phase 5 sonnet C3. Probably safe (kernel does DMA sync at DQBUF level on this CMA-backed config). A frame-hash spot check would anchor formally.
  3. Vulkan unavailable on PineTab2reference_pinetab_no_vulkan.md. Out of campaign scope; consumers fall through to GLES via Panfrost.
  4. Sub-second concurrent libva init still races on /dev/video1 — Track E test passed only with 2s stagger. Cross-process serialization is out of scope for a libva backend.

Lessons distilled to memory

No new memory entries this iteration — the iter5 work was instrumentation cleanup + targeted multi-context fix, no new diagnostic patterns surfaced. Existing memory entries from iter3+iter4 cover the operative discoveries (kernel obfuscation, request_fd lifecycle, FFmpeg as authority, sandbox seccomp, ALARM-stale wasi, firefox-fourier container, follow-on campaigns).

The phase 5 review caveats — sweep-completion verification needs to exercise EVERY consumer code path, not just the most common one — could be a feedback memory ("re-test post-sweep with each consumer pattern, not just one") but it's covered implicitly by feedback_dev_process.md's Phase 7 verification discipline.

Bootlin upstream outlook

iter5 shifts the fork toward upstream-readiness. Per feedback_no_upstream.md, no PR/MR happens without explicit operator instruction. But the clean state is now:

  • Driver source builds with zero non-error request_log calls.
  • Process-global mutable state eliminated (LAST_OUTPUT_* moved to per-driver_data; readback_warned removed entirely).
  • Track A's frame-11 EINVAL fix from iter4 is in place (fresh request_fd per frame, DPB FFmpeg-semantics matching, B-slice L1 reflist .fields).
  • Track F's Firefox sandbox patch from iter3 is documented in campaign repo.
  • Track E's per-context state isolation is in.

Outstanding for upstream-readiness: cap_pool race fix (~30 lines for iter6), msync pixel-verification, possibly a multi-codec audit (MPEG-2 was iter1 lock's "next codec"; never opened).

Phase 1 success criterion — final per track

  • Track A: "Driver builds clean with zero request_log() calls in non-error paths, all iter1+iter3+iter4 DEBUG commits removed (or explicitly justified-and-kept), vaapi-copy + mpv smoke tests still green at 2000+ frames clean." ✓ HIT (2000 frames, 0 EINVAL, 1 log line).

  • Track G: "Firefox-fourier rebuilt without --enable-profile-generate=cross, redeployed to ohm. firefox --version reports Mozilla Firefox 150.0.1. Resulting libxul.so is materially smaller than the 3.6 GB instrumented build." ✓ HIT (169 MB libxul, 21× smaller).

  • Track E: "Two concurrent mpv processes on different bbb fixtures decode independently with no cross-context state corruption." ✓ HIT (with 2s stagger).

  • Track B: "≥30s of bbb_1080p30 without segfault — OR root cause documented as upstream issue with operator-actionable workaround." ✓ HIT (31s stream pos, 0 segfaults; mpv handles cap_pool race via SW fallback gracefully; cap_pool race documented as iter6+ candidate).

Joint success: all four tracks independently verifiable on the same iter5-end driver build (sha256 4bed52ec5d44b389...). Phase 7 verified each. Phase 5 sonnet review caveats addressed. iter5 closes GREEN.


Iter5 amendment (2026-05-05, same-day post-close) — Track F seccomp gap

Trigger. First real-world use of the iter5-G Firefox binary on ohm — playing a YouTube avc1 stream with LIBVA_DRIVER_NAME=v4l2_request env vars + sandbox enabled — emitted seccomp violations:

Sandbox: seccomp sandbox violation: pid <N>, tid <N>, syscall 29, ...

Decoded 0x80047C05 = _IOR('|', 0x05, int) = MEDIA_IOC_REQUEST_ALLOC. Two distinct gaps surfaced:

Gap 1: Track G "GREEN" verdict was structural-only

The Phase 7G test measured "0 ENETDOWN, 0 EINVAL, 538 RDD ProcessDecode events, 22.3s mTime in 35s wall" on a --headless --screenshot autonomous run with synthetic playback — but never confirmed VAAPI bytes actually flowed through /dev/video1. The seccomp filter was returning ENOSYS to Firefox's media stack silently (SECCOMP_RET_ERRNO, not SIGSYS), so Firefox fell back to SW decode and the autonomous test couldn't tell the difference.

Gap 2: Patch-sync drift between campaign repo and container

Campaign repo's firefox-fourier/0001-rdd-allow-stateless-v4l2-request-api.patch (113 lines, broker + RDD seccomp) had drifted from the container's /build/aur/firefox-fourier/0005-rdd-allow-stateless-v4l2-request-api.patch (84 lines, broker only). The iter5-G build used the stale 84-line patch — the iter3 RDD seccomp fix never made it into the iter5-G binary at all.

Gap 3: VAAPI work moved from RDD to Utility process

Even with the RDD seccomp restored, FF150 routes VAAPI decode through the Utility process, not RDD. UtilitySandboxPolicy::EvaluateSyscall falls through to SandboxPolicyCommon for __NR_ioctl, which doesn't allow '|' (or 'V', or 'd', or 'b'). The iter3 patch needed extending to mirror the RDD ioctl allowlist into Utility.

Amendment patch (combined, 160 lines)

firefox-fourier/0001-rdd-allow-stateless-v4l2-request-api.patch regenerated as combined broker + RDD-seccomp + Utility-seccomp:

  • Broker: 3 hunks (cap-filter widen + AddV4l2RequestApiDependencies + RDD wire-in) — unchanged from prior 84-line container patch.
  • RDD seccomp: 2 hunks (kMediaType constant + .ElseIf allow) — restored from campaign 113-line patch.
  • Utility seccomp: 1 hunk (entire case __NR_ioctl: override mirroring RDD's allowlist) — new in iter5 amendment.

Authored as claude-noether <claude@reauktion.de>.

Build approach — incremental, ~2:22 wall

Per operator's standard flow: edit src/ in place, makepkg -e --skippgpcheck. Container src tree from iter5-G already had broker hunks applied; only the seccomp hunks needed manual application. Then makepkg -e skipped extract+prepare and re-ran build/package on existing object cache.

Result:

  • pkg: firefox-150.0.1-1.1-aarch64.pkg.tar.xz (65.5 MB, sha256 675bbc7dd0a187ee8baefef5c3b36d15c64919cd80edf725e29c23f2675ed4a8)
  • Build wall time: 19:01 → 19:03:22 = ~2:22 (vs from-scratch ~2h27m for iter5-G)
  • Saved: ~2h25m by reusing existing ccache + object tree

Backed up prior pkg as firefox-150.0.1-1.1-aarch64.pkg.tar.xz.iter5g.

Verification (pending — deploy to ohm)

Definitive HW-decode test (couldn't run pre-amendment because seccomp blocked the path silently):

LIBVA_DRIVER_NAME=v4l2_request \
LIBVA_V4L2_REQUEST_VIDEO_PATH=/dev/video1 \
LIBVA_V4L2_REQUEST_MEDIA_PATH=/dev/media0 \
firefox <video-url> &
sleep 8
sudo lsof /dev/video1 /dev/media0
# expect: a Firefox-spawned Utility process holding both nodes

If lsof shows the Utility process on /dev/video1 during avc1 playback → Track F closes properly. If empty → another sandbox layer to chase.

Lessons captured

The Track G "GREEN" verdict was right by its own success criterion ("libxul materially smaller, firefox --version works") but the criterion didn't include "actually does HW decode in production use." Lesson for Phase 1 lock: criteria for sandbox/release-build tracks should include end-user playback verification with lsof proof of device handle ownership, not just structural metrics.

Adding to memory: feedback_seccomp_silent_enosys.md already covers the silent-ENOSYS pattern; this episode reinforces it. The patch-sync-drift between campaign repo and container is a process bug — fixed by syncing the combined patch back to container in this amendment, but the underlying mismatch (PKGBUILD numbers patches 0005-... while campaign uses 0001-...) remains. Tolerable for a personal-machine campaign; would need normalizing for upstream.

Status post-amendment

Element State
Combined 160-line patch Authored, in campaign repo + container 0005-...
Container src/ tree Broker (from iter5-G prepare) + seccomp (manually patched in iter5 amend)
Built pkg 675bbc7d… on boltzmann:/tmp/, awaiting deploy to ohm
ohm deploy Pending operator-side scp (vpn route closed at amendment author time)
HW-decode lsof verification Pending
Campaign git commit Pending