0d04ae4aad
Patched libmozsandbox.so (sha 4e6c7d58…, +1824 bytes vs iter5-G) deployed to ohm via temporary HTTP server on boltzmann:18080 (vpn route was closed; ohm has lan route to boltzmann.fritz.box). Sandbox gate is fully open. With LIBVA_DRIVER_NAME=v4l2_request + sandbox enabled + bbb_1080p30_h264.mp4, Firefox 150 emits: v4l2-request: cap_pool_init: 24 slots ready v4l2-request: Unable to queue buffer: Invalid argument vs pre-amendment which had the seccomp violation + "Unable to allocate media request: Function not implemented" between those two lines. Track F closes GREEN. YouTube test (watch?v=7DAPd5MGodY) didn't engage v4l2_request because YT negotiated VP9/AV1 with FF150 (no h264ify); v4l2_request only handles H.264. Codec-negotiation issue, not sandbox issue. New iter6 candidate I documented: Firefox VIDIOC_QBUF EINVAL on first frame is consumer-specific (mpv-vaapi-copy clean at 2000 frames on same driver). Diagnostic plan and repro included. phase8_iteration5_close.md: appended Track F GREEN evidence + post-amendment status table + iter6 candidate referral. phase0_findings_iter6.md: added candidate I (Firefox QBUF EINVAL), listed under recommended pairings as strong iter6-lock candidate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
228 lines
18 KiB
Markdown
228 lines
18 KiB
Markdown
# Iteration 5 close (Phase 8) — A+G+B+E all GREEN
|
||
|
||
Opened 2026-05-05 just after iter4 close, closing same day. Locked candidates: **A** (DEBUG instrumentation sweep), **G** (PGO-disabled Firefox-fourier rebuild), **B** (mpv libplacebo `--vo=gpu` segfault), **E** (multi-context libva safety).
|
||
|
||
All four tracks closed GREEN with one named caveat carried to iter6 (cap_pool resolution-change race latent under untested consumer probe patterns — Phase 5 sonnet C4 finding).
|
||
|
||
## Verdict per track
|
||
|
||
### Track A: GREEN — DEBUG sweep landed in two passes
|
||
|
||
**First pass** (commits `848fc0c`, `39498f0`, `951233a`, `d3a299b`, `843febc`): removed iter1 patch-0010/0011/0014 + iter3 Y2 v1 + iter4 Y2 v3 + iter4 DPB census + iter4 per-control TRY iso. Per-frame v4l2-request log noise dropped from ~30+ lines/frame to ~9 init-time lines.
|
||
|
||
**Second pass** (commit `c8b6ede`, after Phase 5 sonnet C1+C2): removed three additional surface.c DEBUG sites (CreateSurfaces2 format-dump, ExportSurfaceHandle descriptor-dump, QuerySurfaceStatus status-dump) that the first pass missed because the vaapi-copy + --vo=null stress test didn't exercise the ExportSurfaceHandle path. Also removed h264.c's "3F observability" V4L2 readback block, which contained a `static bool readback_warned` (new mutable process-global state introduced post-Track-E — inconsistent with Track E's intent, also resolved by the block removal).
|
||
|
||
**Net:** ~340 lines of instrumentation removed across 6 commits. Verified clean: 2000-frame mpv vaapi-copy stress on the post-cleanup driver shows **0 EINVAL, 1 v4l2-request log line, 3 KB log** (down from 9 lines / 4.4 KB after first pass).
|
||
|
||
**KEPT (justified):**
|
||
- POC sentinel strip (`h264_strip_ffmpeg_poc_sentinel`) — load-bearing for ffmpeg-vaapi consumers
|
||
- slice_header bit-precise parser — load-bearing for hantro hw decode (DECODE_PARAMS bit_size fields drive MMIO writes)
|
||
- EACCES suppression in `v4l2_get_controls` — silences per-frame iter1-known-good error noise
|
||
- "slice_header parse FAILED" log — fires only on decode-blocking errors, not per-frame noise
|
||
|
||
### Track E: GREEN — multi-context libva safety
|
||
|
||
Commit `b993355`: `LAST_OUTPUT_WIDTH/HEIGHT` moved from process-global static in `surface.c` to `struct request_data.last_output_width/height`. The V4L2 device fd is per-driver_data, so this is the correct binding unit (one fd, one current OUTPUT format).
|
||
|
||
`surface_reset_format_cache()` signature changed to take `struct request_data *driver_data`; one callsite in `context.c` updated.
|
||
|
||
Audit confirmed only LAST_OUTPUT_* was mutable process-global state. Other statics (`formats[]`, `formats_count` in video.c) are constant lookup tables — no race.
|
||
|
||
**Verified:** two concurrent mpv processes with 2-second stagger both decoded 300 frames cleanly, no cross-context corruption. Re-verified post-cleanup on driver `4bed52ec5d44b389...` — both clean.
|
||
|
||
Limit: same-instant co-launch hits kernel-level fd contention on `/dev/video1` (hantro is a single-instance device). Cross-process serialization is out of scope for a libva backend.
|
||
|
||
### Track B: GREEN — `mpv --vo=gpu` doesn't segfault
|
||
|
||
35s `mpv --hwdec=vaapi --vo=gpu` on the iter5-end driver: stream pos 31s, 29 frames dropped, **0 segfaults**. Vulkan init still fails (`VK_ERROR_INITIALIZATION_FAILED` — steady state on Mali-G52 / Bifrost per `reference_pinetab_no_vulkan.md`); mpv falls through to GLES via Panfrost gracefully.
|
||
|
||
Phase 5 sonnet C4 reframed the original "implicit fix" claim: the cap_pool REQBUFS-EBUSY race window remains latent under untested consumer probe patterns. The 35s mpv test sees 5 EBUSY events at init-time, mpv falls back to SW once, then continues. The race is documented as iter6+ candidate (the genuine fix is ordering-cap_pool-drain-before-REQBUFs in CreateSurfaces2, ~30 lines).
|
||
|
||
### Track G: GREEN — PGO-disabled Firefox-fourier 150.0.1-1.1
|
||
|
||
PKGBUILD overlay edited to replace 3-tier PGO sequence with single-pass optimized build. Single-pass build on boltzmann LXD container: **~2h27m** (vs iter3's 2h+ that died at PGO collect step — comparable wall time).
|
||
|
||
Result:
|
||
- pkg: `firefox-150.0.1-1.1-aarch64.pkg.tar.xz`, **68.7 MB** (sha256 `aa94c7290ee7be76...`)
|
||
- libxul.so: 169 MB stripped (21× smaller than iter3's 3.6 GB PGO-instrumented)
|
||
|
||
Installed via `pacman -U` on ohm replacing stock firefox 150.0.1-1.
|
||
|
||
Phase 7G test (35s autonomous run, no `MOZ_DISABLE_RDD_SANDBOX=1`):
|
||
- ENETDOWN: 0 (iter3 sandbox patch holds in release build)
|
||
- EINVAL: 0 (iter4 frame-11 fix holds)
|
||
- RDD ProcessDecode events: 538
|
||
- Stream mTime reached: 22.3s in 35s wall = **0.64× realtime**, **~2.7× speedup over PGO-instrumented binary**
|
||
|
||
## What landed
|
||
|
||
### Fork commits (libva-v4l2-request-fourier)
|
||
|
||
iter5 sweep + multi-context fix:
|
||
- `848fc0c` — remove iter3+iter4 Y2 instrumentation from v4l2.c (-54)
|
||
- `39498f0` — remove iter4 DPB census from h264.c (-31)
|
||
- `951233a` — remove iter1 ENTER traces (4 files, -17 across 13 sites)
|
||
- `d3a299b` — remove iter1 patch-0010 hex-dumps + patch-0011 sentinel (-81)
|
||
- `843febc` — remove iter1 slice_header / VAPicture dumps + Sync RETURN trace, suppress EACCES per-frame log (-49)
|
||
- `b993355` — Track E: LAST_OUTPUT_* per-driver_data
|
||
- `c8b6ede` — Phase 5 follow-up: 3 surface.c debug sites + h264.c readback block (-107)
|
||
|
||
Net: ~339 lines removed, ~52 lines added (Track E plumbing). Driver source builds clean and per-frame log noise is essentially zero (1 line per 2000-frame run).
|
||
|
||
### Campaign artifacts (libva-multiplanar)
|
||
|
||
- `phase0_findings_iter5.md` — substrate (8 candidates, locked A+G+B+E)
|
||
- `phase4_iter5_plan.md` — Phase 4 plan + execution + Phase 5 caveat resolutions + Phase 7 anchored evidence
|
||
- `phase8_iteration5_close.md` — this file
|
||
- `~/src/panvk-bifrost/README.md` — chartered as separate top-level future campaign (sequenced after fourier-fresnel)
|
||
|
||
### Build infrastructure
|
||
|
||
- firefox-fourier LXD container on boltzmann remains persistent. The PKGBUILD now has the iter5 PGO-disabled edit applied (the source extracted under `src/firefox-150.0.1/` is the iter4 state with iter3 patches; iter5 reused that). Future Firefox rebuilds can `cd src/firefox-150.0.1 && ./mach build` for incremental.
|
||
|
||
## State that carries to iter6 (or campaign close)
|
||
|
||
- **Hardware**: ohm RK3568 hantro G1/G2, kernel 6.19.10. Access: `ohm` (LAN; `ohm.vpn` also works).
|
||
- **Userspace**: firefox 150.0.1-1.1 (iter5 PGO-disabled fourier rebuild), libva 2.23.0, mesa 26.0.5, libdrm 2.4.131, mpv 0.41.0-3.
|
||
- **Driver installed**: `/usr/lib/dri/v4l2_request_drv_video.so` sha256 `4bed52ec5d44b389...` (iter5-end, post-cleanup).
|
||
- **Test fixture**: bbb_1080p30_h264.mp4 sha256 `dcf8a7170fbd...`.
|
||
- **Build container**: firefox-fourier LXD on boltzmann, persistent.
|
||
|
||
## Documented limitations carried to iter6+ (or campaign close)
|
||
|
||
1. **Cap_pool resolution-change race** — Phase 5 sonnet C4. mpv's libplacebo Vulkan-fallback path triggers it; mpv recovers via SW fallback (no segfault), but the race exists. Fix: drain CAPTURE properly before issuing REQBUFs(0) on resolution change in `CreateSurfaces2`. ~30 lines.
|
||
2. **No pixel-correctness verification post-msync-removal** — Phase 5 sonnet C3. Probably safe (kernel does DMA sync at DQBUF level on this CMA-backed config). A frame-hash spot check would anchor formally.
|
||
3. **Vulkan unavailable on PineTab2** — `reference_pinetab_no_vulkan.md`. Out of campaign scope; consumers fall through to GLES via Panfrost.
|
||
4. **Sub-second concurrent libva init still races on /dev/video1** — Track E test passed only with 2s stagger. Cross-process serialization is out of scope for a libva backend.
|
||
|
||
## Lessons distilled to memory
|
||
|
||
No new memory entries this iteration — the iter5 work was instrumentation cleanup + targeted multi-context fix, no new diagnostic patterns surfaced. Existing memory entries from iter3+iter4 cover the operative discoveries (kernel obfuscation, request_fd lifecycle, FFmpeg as authority, sandbox seccomp, ALARM-stale wasi, firefox-fourier container, follow-on campaigns).
|
||
|
||
The phase 5 review caveats — sweep-completion verification needs to exercise EVERY consumer code path, not just the most common one — could be a feedback memory ("re-test post-sweep with each consumer pattern, not just one") but it's covered implicitly by `feedback_dev_process.md`'s Phase 7 verification discipline.
|
||
|
||
## Bootlin upstream outlook
|
||
|
||
iter5 shifts the fork toward upstream-readiness. Per `feedback_no_upstream.md`, no PR/MR happens without explicit operator instruction. But the clean state is now:
|
||
|
||
- Driver source builds with zero non-error `request_log` calls.
|
||
- Process-global mutable state eliminated (`LAST_OUTPUT_*` moved to per-driver_data; `readback_warned` removed entirely).
|
||
- Track A's frame-11 EINVAL fix from iter4 is in place (fresh request_fd per frame, DPB FFmpeg-semantics matching, B-slice L1 reflist .fields).
|
||
- Track F's Firefox sandbox patch from iter3 is documented in campaign repo.
|
||
- Track E's per-context state isolation is in.
|
||
|
||
Outstanding for upstream-readiness: cap_pool race fix (~30 lines for iter6), msync pixel-verification, possibly a multi-codec audit (MPEG-2 was iter1 lock's "next codec"; never opened).
|
||
|
||
## Phase 1 success criterion — final per track
|
||
|
||
- **Track A:** "Driver builds clean with zero `request_log()` calls in non-error paths, all iter1+iter3+iter4 DEBUG commits removed (or explicitly justified-and-kept), vaapi-copy + mpv smoke tests still green at 2000+ frames clean." ✓ HIT (2000 frames, 0 EINVAL, 1 log line).
|
||
|
||
- **Track G:** "Firefox-fourier rebuilt without `--enable-profile-generate=cross`, redeployed to ohm. firefox --version reports Mozilla Firefox 150.0.1. Resulting libxul.so is materially smaller than the 3.6 GB instrumented build." ✓ HIT (169 MB libxul, 21× smaller).
|
||
|
||
- **Track E:** "Two concurrent mpv processes on different bbb fixtures decode independently with no cross-context state corruption." ✓ HIT (with 2s stagger).
|
||
|
||
- **Track B:** "≥30s of bbb_1080p30 without segfault — OR root cause documented as upstream issue with operator-actionable workaround." ✓ HIT (31s stream pos, 0 segfaults; mpv handles cap_pool race via SW fallback gracefully; cap_pool race documented as iter6+ candidate).
|
||
|
||
**Joint success:** all four tracks independently verifiable on the same iter5-end driver build (sha256 `4bed52ec5d44b389...`). Phase 7 verified each. Phase 5 sonnet review caveats addressed. iter5 closes GREEN.
|
||
|
||
---
|
||
|
||
## Iter5 amendment (2026-05-05, same-day post-close) — Track F seccomp gap
|
||
|
||
**Trigger.** First real-world use of the iter5-G Firefox binary on ohm — playing a YouTube avc1 stream with `LIBVA_DRIVER_NAME=v4l2_request` env vars + sandbox enabled — emitted seccomp violations:
|
||
|
||
```
|
||
Sandbox: seccomp sandbox violation: pid <N>, tid <N>, syscall 29, ...
|
||
```
|
||
|
||
Decoded `0x80047C05` = `_IOR('|', 0x05, int)` = `MEDIA_IOC_REQUEST_ALLOC`. Two distinct gaps surfaced:
|
||
|
||
### Gap 1: Track G "GREEN" verdict was structural-only
|
||
The Phase 7G test measured "0 ENETDOWN, 0 EINVAL, 538 RDD ProcessDecode events, 22.3s mTime in 35s wall" on a `--headless --screenshot` autonomous run with synthetic playback — but never confirmed VAAPI bytes actually flowed through `/dev/video1`. The seccomp filter was returning `ENOSYS` to Firefox's media stack silently (`SECCOMP_RET_ERRNO`, not `SIGSYS`), so Firefox fell back to SW decode and the autonomous test couldn't tell the difference.
|
||
|
||
### Gap 2: Patch-sync drift between campaign repo and container
|
||
Campaign repo's `firefox-fourier/0001-rdd-allow-stateless-v4l2-request-api.patch` (113 lines, broker + RDD seccomp) had drifted from the container's `/build/aur/firefox-fourier/0005-rdd-allow-stateless-v4l2-request-api.patch` (84 lines, broker only). The iter5-G build used the stale 84-line patch — the iter3 RDD seccomp fix never made it into the iter5-G binary at all.
|
||
|
||
### Gap 3: VAAPI work moved from RDD to Utility process
|
||
Even with the RDD seccomp restored, FF150 routes VAAPI decode through the **Utility process**, not RDD. `UtilitySandboxPolicy::EvaluateSyscall` falls through to `SandboxPolicyCommon` for `__NR_ioctl`, which doesn't allow `'|'` (or `'V'`, or `'d'`, or `'b'`). The iter3 patch needed extending to mirror the RDD ioctl allowlist into Utility.
|
||
|
||
### Amendment patch (combined, 160 lines)
|
||
|
||
`firefox-fourier/0001-rdd-allow-stateless-v4l2-request-api.patch` regenerated as combined broker + RDD-seccomp + Utility-seccomp:
|
||
|
||
- Broker: 3 hunks (cap-filter widen + `AddV4l2RequestApiDependencies` + RDD wire-in) — unchanged from prior 84-line container patch.
|
||
- RDD seccomp: 2 hunks (`kMediaType` constant + `.ElseIf` allow) — restored from campaign 113-line patch.
|
||
- Utility seccomp: 1 hunk (entire `case __NR_ioctl:` override mirroring RDD's allowlist) — **new in iter5 amendment**.
|
||
|
||
Authored as `claude-noether <claude@reauktion.de>`.
|
||
|
||
### Build approach — incremental, ~2:22 wall
|
||
|
||
Per operator's standard flow: edit src/ in place, `makepkg -e --skippgpcheck`. Container src tree from iter5-G already had broker hunks applied; only the seccomp hunks needed manual application. Then `makepkg -e` skipped extract+prepare and re-ran build/package on existing object cache.
|
||
|
||
Result:
|
||
- pkg: `firefox-150.0.1-1.1-aarch64.pkg.tar.xz` (65.5 MB, sha256 `675bbc7dd0a187ee8baefef5c3b36d15c64919cd80edf725e29c23f2675ed4a8`)
|
||
- Build wall time: 19:01 → 19:03:22 = **~2:22** (vs from-scratch ~2h27m for iter5-G)
|
||
- Saved: ~2h25m by reusing existing ccache + object tree
|
||
|
||
Backed up prior pkg as `firefox-150.0.1-1.1-aarch64.pkg.tar.xz.iter5g`.
|
||
|
||
### Verification (pending — deploy to ohm)
|
||
|
||
Definitive HW-decode test (couldn't run pre-amendment because seccomp blocked the path silently):
|
||
```
|
||
LIBVA_DRIVER_NAME=v4l2_request \
|
||
LIBVA_V4L2_REQUEST_VIDEO_PATH=/dev/video1 \
|
||
LIBVA_V4L2_REQUEST_MEDIA_PATH=/dev/media0 \
|
||
firefox <video-url> &
|
||
sleep 8
|
||
sudo lsof /dev/video1 /dev/media0
|
||
# expect: a Firefox-spawned Utility process holding both nodes
|
||
```
|
||
|
||
If `lsof` shows the Utility process on `/dev/video1` during avc1 playback → Track F closes properly. If empty → another sandbox layer to chase.
|
||
|
||
### Lessons captured
|
||
|
||
The Track G "GREEN" verdict was right by its own success criterion ("libxul materially smaller, firefox --version works") but the criterion didn't include "actually does HW decode in production use." Lesson for Phase 1 lock: criteria for sandbox/release-build tracks should include **end-user playback verification with `lsof` proof of device handle ownership**, not just structural metrics.
|
||
|
||
Adding to memory: `feedback_seccomp_silent_enosys.md` already covers the silent-ENOSYS pattern; this episode reinforces it. The patch-sync-drift between campaign repo and container is a process bug — fixed by syncing the combined patch back to container in this amendment, but the underlying mismatch (PKGBUILD numbers patches `0005-...` while campaign uses `0001-...`) remains. Tolerable for a personal-machine campaign; would need normalizing for upstream.
|
||
|
||
### Status post-amendment
|
||
|
||
| Element | State |
|
||
|---|---|
|
||
| Combined 160-line patch | Authored, in campaign repo + container 0005-... |
|
||
| Container src/ tree | Broker (from iter5-G prepare) + seccomp (manually patched in iter5 amend) |
|
||
| Built pkg | `675bbc7d…` on boltzmann:/tmp/ |
|
||
| ohm deploy | Done via temporary HTTP server on boltzmann:18080 (vpn route was closed; ohm has lan route to boltzmann.fritz.box). `pacman -U` confirmed Build Date 21:01 CEST = 19:01 UTC = amendment build. `libmozsandbox.so` sha `4e6c7d58bc2220dbdf6ad817ee70fa77fc85e618ffd49ebdabb833f416dc3076`, size 187536 (vs prior 185712, +1824 bytes for the new seccomp code). |
|
||
| HW-decode verification | **Track F GREEN.** Played `bbb_1080p30_h264.mp4` via Firefox 150 with `LIBVA_DRIVER_NAME=v4l2_request` and full sandbox. Log: `cap_pool_init: 24 slots ready` then `Unable to queue buffer: Invalid argument` — **no seccomp violation, no ENOSYS on `MEDIA_IOC_REQUEST_ALLOC`**. The seccomp gate is fully passed; the remaining EINVAL is post-sandbox, driver-level. |
|
||
| Campaign git commit | `d2d9107` (patch + docs) pushed to Gitea pre-verification. |
|
||
|
||
### Track F closes — definitive evidence
|
||
|
||
Pre-amendment failure pattern (iter5-G with broker-only patch):
|
||
```
|
||
v4l2-request: cap_pool_init: 24 slots ready
|
||
[PID] Sandbox: seccomp sandbox violation: pid PID, syscall 29, args FD 0x80047C05 ...
|
||
v4l2-request: Unable to allocate media request: Function not implemented
|
||
```
|
||
|
||
Post-amendment pattern (iter5-amend):
|
||
```
|
||
v4l2-request: cap_pool_init: 24 slots ready
|
||
v4l2-request: Unable to queue buffer: Invalid argument
|
||
```
|
||
|
||
The disappearance of "syscall 29 ... 0x80047C05" and "Function not implemented" in the post-amendment log is the closing proof: `MEDIA_IOC_REQUEST_ALLOC` (`_IOR('|', 0x05, int)`) now reaches the kernel from the Utility process. Track F is **GREEN**.
|
||
|
||
YouTube (`watch?v=7DAPd5MGodY`) played without engaging the v4l2_request driver at all — no `cap_pool_init`, no `/dev/video1` activity. Likely cause: YouTube negotiated VP9 or AV1 with FF150 (no `h264ify` extension installed); v4l2_request only handles H.264, so libva isn't dispatched. This is unrelated to Track F — a codec-negotiation issue, not a sandbox issue. The H.264-only fixture (`bbb_1080p30_h264.mp4` direct file URL) bypasses YouTube's codec negotiation and triggers v4l2_request, which is how the sandbox close was demonstrated.
|
||
|
||
### New iter6 candidate — Firefox VIDIOC_QBUF EINVAL on first frame
|
||
|
||
A driver-level issue surfaced post-sandbox-fix: with the v4l2_request driver loaded (cap_pool_init succeeds), Firefox's libva path issues a single `VIDIOC_QBUF` that returns EINVAL on what appears to be the first frame. `mpv --hwdec=vaapi-copy` decoded 2000 frames clean on the same iter5-end driver build (sha `4bed52ec5d44b389…`); only Firefox triggers this. Likely consumer-specific: Firefox's media stack feeds the driver differently than mpv (different VAEncodedSlice ordering, VAImage usage pattern, surface handle lifecycle, etc.).
|
||
|
||
Logged as iter6 candidate. Track A's frame-11 EINVAL fix was about per-frame request_fd lifecycle and DPB FFmpeg-semantics matching — this looks earlier (init-phase, not frame 11), so likely a different root cause. Diagnosis would start with strace on the Firefox Utility process during initial QBUF, capture the exact `v4l2_buffer` struct payload, compare against what mpv sends.
|
||
|
||
This does not retract the iter5 sweep's GREEN verdict — the per-consumer divergence reinforces the Phase 5 sonnet caveat ("sweep-completion verification needs to exercise EVERY consumer code path") that's already documented as a process lesson.
|