Files
libva-multiplanar/phase8_iteration5_close.md
T
claude-noether d2d9107e62 iter5 amendment: extend Firefox sandbox patch to UtilitySandboxPolicy
Real-world YouTube avc1 playback on the iter5-G binary surfaced a
seccomp violation (`syscall 29`, `0x80047C05` = `MEDIA_IOC_REQUEST_ALLOC`)
that the autonomous Phase 7G test missed because seccomp returns
ENOSYS silently and Firefox falls back to SW decode.

Two distinct gaps:
- patch-sync drift: campaign 113-line patch (broker+RDD-seccomp) had
  drifted from container 84-line patch (broker only); iter5-G shipped
  with the broker fix but no RDD seccomp fix.
- coverage gap: FF150 routes VAAPI to the Utility process; iter3's
  RDD-only seccomp allowlist never covered Utility.

Combined patch now hits three gates across two files (six hunks):
- broker: cap-filter widen + AddV4l2RequestApiDependencies + RDD wire-in
- RDD seccomp: kMediaType allow alongside existing kVideoType
- Utility seccomp: new __NR_ioctl override mirroring RDD's allowlist

Build: incremental `makepkg -e` on existing iter5-G object tree took
2:22 wall vs the 2h27m from-scratch alternative.

phase8_iteration5_close.md: appended amendment section with verdict-
gap analysis, patch breakdown, deploy-pending status.

firefox-fourier/README.md: rewrote "The problem" from 2 gates to 3
(broker + RDD seccomp + Utility seccomp); patch summary now explains
the six hunks.

Pending: pkg deploy to ohm + lsof /dev/video1 verification once
network route to ohm is restored.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 19:20:30 +00:00

201 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Iteration 5 close (Phase 8) — A+G+B+E all GREEN
Opened 2026-05-05 just after iter4 close, closing same day. Locked candidates: **A** (DEBUG instrumentation sweep), **G** (PGO-disabled Firefox-fourier rebuild), **B** (mpv libplacebo `--vo=gpu` segfault), **E** (multi-context libva safety).
All four tracks closed GREEN with one named caveat carried to iter6 (cap_pool resolution-change race latent under untested consumer probe patterns — Phase 5 sonnet C4 finding).
## Verdict per track
### Track A: GREEN — DEBUG sweep landed in two passes
**First pass** (commits `848fc0c`, `39498f0`, `951233a`, `d3a299b`, `843febc`): removed iter1 patch-0010/0011/0014 + iter3 Y2 v1 + iter4 Y2 v3 + iter4 DPB census + iter4 per-control TRY iso. Per-frame v4l2-request log noise dropped from ~30+ lines/frame to ~9 init-time lines.
**Second pass** (commit `c8b6ede`, after Phase 5 sonnet C1+C2): removed three additional surface.c DEBUG sites (CreateSurfaces2 format-dump, ExportSurfaceHandle descriptor-dump, QuerySurfaceStatus status-dump) that the first pass missed because the vaapi-copy + --vo=null stress test didn't exercise the ExportSurfaceHandle path. Also removed h264.c's "3F observability" V4L2 readback block, which contained a `static bool readback_warned` (new mutable process-global state introduced post-Track-E — inconsistent with Track E's intent, also resolved by the block removal).
**Net:** ~340 lines of instrumentation removed across 6 commits. Verified clean: 2000-frame mpv vaapi-copy stress on the post-cleanup driver shows **0 EINVAL, 1 v4l2-request log line, 3 KB log** (down from 9 lines / 4.4 KB after first pass).
**KEPT (justified):**
- POC sentinel strip (`h264_strip_ffmpeg_poc_sentinel`) — load-bearing for ffmpeg-vaapi consumers
- slice_header bit-precise parser — load-bearing for hantro hw decode (DECODE_PARAMS bit_size fields drive MMIO writes)
- EACCES suppression in `v4l2_get_controls` — silences per-frame iter1-known-good error noise
- "slice_header parse FAILED" log — fires only on decode-blocking errors, not per-frame noise
### Track E: GREEN — multi-context libva safety
Commit `b993355`: `LAST_OUTPUT_WIDTH/HEIGHT` moved from process-global static in `surface.c` to `struct request_data.last_output_width/height`. The V4L2 device fd is per-driver_data, so this is the correct binding unit (one fd, one current OUTPUT format).
`surface_reset_format_cache()` signature changed to take `struct request_data *driver_data`; one callsite in `context.c` updated.
Audit confirmed only LAST_OUTPUT_* was mutable process-global state. Other statics (`formats[]`, `formats_count` in video.c) are constant lookup tables — no race.
**Verified:** two concurrent mpv processes with 2-second stagger both decoded 300 frames cleanly, no cross-context corruption. Re-verified post-cleanup on driver `4bed52ec5d44b389...` — both clean.
Limit: same-instant co-launch hits kernel-level fd contention on `/dev/video1` (hantro is a single-instance device). Cross-process serialization is out of scope for a libva backend.
### Track B: GREEN — `mpv --vo=gpu` doesn't segfault
35s `mpv --hwdec=vaapi --vo=gpu` on the iter5-end driver: stream pos 31s, 29 frames dropped, **0 segfaults**. Vulkan init still fails (`VK_ERROR_INITIALIZATION_FAILED` — steady state on Mali-G52 / Bifrost per `reference_pinetab_no_vulkan.md`); mpv falls through to GLES via Panfrost gracefully.
Phase 5 sonnet C4 reframed the original "implicit fix" claim: the cap_pool REQBUFS-EBUSY race window remains latent under untested consumer probe patterns. The 35s mpv test sees 5 EBUSY events at init-time, mpv falls back to SW once, then continues. The race is documented as iter6+ candidate (the genuine fix is ordering-cap_pool-drain-before-REQBUFs in CreateSurfaces2, ~30 lines).
### Track G: GREEN — PGO-disabled Firefox-fourier 150.0.1-1.1
PKGBUILD overlay edited to replace 3-tier PGO sequence with single-pass optimized build. Single-pass build on boltzmann LXD container: **~2h27m** (vs iter3's 2h+ that died at PGO collect step — comparable wall time).
Result:
- pkg: `firefox-150.0.1-1.1-aarch64.pkg.tar.xz`, **68.7 MB** (sha256 `aa94c7290ee7be76...`)
- libxul.so: 169 MB stripped (21× smaller than iter3's 3.6 GB PGO-instrumented)
Installed via `pacman -U` on ohm replacing stock firefox 150.0.1-1.
Phase 7G test (35s autonomous run, no `MOZ_DISABLE_RDD_SANDBOX=1`):
- ENETDOWN: 0 (iter3 sandbox patch holds in release build)
- EINVAL: 0 (iter4 frame-11 fix holds)
- RDD ProcessDecode events: 538
- Stream mTime reached: 22.3s in 35s wall = **0.64× realtime**, **~2.7× speedup over PGO-instrumented binary**
## What landed
### Fork commits (libva-v4l2-request-fourier)
iter5 sweep + multi-context fix:
- `848fc0c` — remove iter3+iter4 Y2 instrumentation from v4l2.c (-54)
- `39498f0` — remove iter4 DPB census from h264.c (-31)
- `951233a` — remove iter1 ENTER traces (4 files, -17 across 13 sites)
- `d3a299b` — remove iter1 patch-0010 hex-dumps + patch-0011 sentinel (-81)
- `843febc` — remove iter1 slice_header / VAPicture dumps + Sync RETURN trace, suppress EACCES per-frame log (-49)
- `b993355` — Track E: LAST_OUTPUT_* per-driver_data
- `c8b6ede` — Phase 5 follow-up: 3 surface.c debug sites + h264.c readback block (-107)
Net: ~339 lines removed, ~52 lines added (Track E plumbing). Driver source builds clean and per-frame log noise is essentially zero (1 line per 2000-frame run).
### Campaign artifacts (libva-multiplanar)
- `phase0_findings_iter5.md` — substrate (8 candidates, locked A+G+B+E)
- `phase4_iter5_plan.md` — Phase 4 plan + execution + Phase 5 caveat resolutions + Phase 7 anchored evidence
- `phase8_iteration5_close.md` — this file
- `~/src/panvk-bifrost/README.md` — chartered as separate top-level future campaign (sequenced after fourier-fresnel)
### Build infrastructure
- firefox-fourier LXD container on boltzmann remains persistent. The PKGBUILD now has the iter5 PGO-disabled edit applied (the source extracted under `src/firefox-150.0.1/` is the iter4 state with iter3 patches; iter5 reused that). Future Firefox rebuilds can `cd src/firefox-150.0.1 && ./mach build` for incremental.
## State that carries to iter6 (or campaign close)
- **Hardware**: ohm RK3568 hantro G1/G2, kernel 6.19.10. Access: `ohm` (LAN; `ohm.vpn` also works).
- **Userspace**: firefox 150.0.1-1.1 (iter5 PGO-disabled fourier rebuild), libva 2.23.0, mesa 26.0.5, libdrm 2.4.131, mpv 0.41.0-3.
- **Driver installed**: `/usr/lib/dri/v4l2_request_drv_video.so` sha256 `4bed52ec5d44b389...` (iter5-end, post-cleanup).
- **Test fixture**: bbb_1080p30_h264.mp4 sha256 `dcf8a7170fbd...`.
- **Build container**: firefox-fourier LXD on boltzmann, persistent.
## Documented limitations carried to iter6+ (or campaign close)
1. **Cap_pool resolution-change race** — Phase 5 sonnet C4. mpv's libplacebo Vulkan-fallback path triggers it; mpv recovers via SW fallback (no segfault), but the race exists. Fix: drain CAPTURE properly before issuing REQBUFs(0) on resolution change in `CreateSurfaces2`. ~30 lines.
2. **No pixel-correctness verification post-msync-removal** — Phase 5 sonnet C3. Probably safe (kernel does DMA sync at DQBUF level on this CMA-backed config). A frame-hash spot check would anchor formally.
3. **Vulkan unavailable on PineTab2**`reference_pinetab_no_vulkan.md`. Out of campaign scope; consumers fall through to GLES via Panfrost.
4. **Sub-second concurrent libva init still races on /dev/video1** — Track E test passed only with 2s stagger. Cross-process serialization is out of scope for a libva backend.
## Lessons distilled to memory
No new memory entries this iteration — the iter5 work was instrumentation cleanup + targeted multi-context fix, no new diagnostic patterns surfaced. Existing memory entries from iter3+iter4 cover the operative discoveries (kernel obfuscation, request_fd lifecycle, FFmpeg as authority, sandbox seccomp, ALARM-stale wasi, firefox-fourier container, follow-on campaigns).
The phase 5 review caveats — sweep-completion verification needs to exercise EVERY consumer code path, not just the most common one — could be a feedback memory ("re-test post-sweep with each consumer pattern, not just one") but it's covered implicitly by `feedback_dev_process.md`'s Phase 7 verification discipline.
## Bootlin upstream outlook
iter5 shifts the fork toward upstream-readiness. Per `feedback_no_upstream.md`, no PR/MR happens without explicit operator instruction. But the clean state is now:
- Driver source builds with zero non-error `request_log` calls.
- Process-global mutable state eliminated (`LAST_OUTPUT_*` moved to per-driver_data; `readback_warned` removed entirely).
- Track A's frame-11 EINVAL fix from iter4 is in place (fresh request_fd per frame, DPB FFmpeg-semantics matching, B-slice L1 reflist .fields).
- Track F's Firefox sandbox patch from iter3 is documented in campaign repo.
- Track E's per-context state isolation is in.
Outstanding for upstream-readiness: cap_pool race fix (~30 lines for iter6), msync pixel-verification, possibly a multi-codec audit (MPEG-2 was iter1 lock's "next codec"; never opened).
## Phase 1 success criterion — final per track
- **Track A:** "Driver builds clean with zero `request_log()` calls in non-error paths, all iter1+iter3+iter4 DEBUG commits removed (or explicitly justified-and-kept), vaapi-copy + mpv smoke tests still green at 2000+ frames clean." ✓ HIT (2000 frames, 0 EINVAL, 1 log line).
- **Track G:** "Firefox-fourier rebuilt without `--enable-profile-generate=cross`, redeployed to ohm. firefox --version reports Mozilla Firefox 150.0.1. Resulting libxul.so is materially smaller than the 3.6 GB instrumented build." ✓ HIT (169 MB libxul, 21× smaller).
- **Track E:** "Two concurrent mpv processes on different bbb fixtures decode independently with no cross-context state corruption." ✓ HIT (with 2s stagger).
- **Track B:** "≥30s of bbb_1080p30 without segfault — OR root cause documented as upstream issue with operator-actionable workaround." ✓ HIT (31s stream pos, 0 segfaults; mpv handles cap_pool race via SW fallback gracefully; cap_pool race documented as iter6+ candidate).
**Joint success:** all four tracks independently verifiable on the same iter5-end driver build (sha256 `4bed52ec5d44b389...`). Phase 7 verified each. Phase 5 sonnet review caveats addressed. iter5 closes GREEN.
---
## Iter5 amendment (2026-05-05, same-day post-close) — Track F seccomp gap
**Trigger.** First real-world use of the iter5-G Firefox binary on ohm — playing a YouTube avc1 stream with `LIBVA_DRIVER_NAME=v4l2_request` env vars + sandbox enabled — emitted seccomp violations:
```
Sandbox: seccomp sandbox violation: pid <N>, tid <N>, syscall 29, ...
```
Decoded `0x80047C05` = `_IOR('|', 0x05, int)` = `MEDIA_IOC_REQUEST_ALLOC`. Two distinct gaps surfaced:
### Gap 1: Track G "GREEN" verdict was structural-only
The Phase 7G test measured "0 ENETDOWN, 0 EINVAL, 538 RDD ProcessDecode events, 22.3s mTime in 35s wall" on a `--headless --screenshot` autonomous run with synthetic playback — but never confirmed VAAPI bytes actually flowed through `/dev/video1`. The seccomp filter was returning `ENOSYS` to Firefox's media stack silently (`SECCOMP_RET_ERRNO`, not `SIGSYS`), so Firefox fell back to SW decode and the autonomous test couldn't tell the difference.
### Gap 2: Patch-sync drift between campaign repo and container
Campaign repo's `firefox-fourier/0001-rdd-allow-stateless-v4l2-request-api.patch` (113 lines, broker + RDD seccomp) had drifted from the container's `/build/aur/firefox-fourier/0005-rdd-allow-stateless-v4l2-request-api.patch` (84 lines, broker only). The iter5-G build used the stale 84-line patch — the iter3 RDD seccomp fix never made it into the iter5-G binary at all.
### Gap 3: VAAPI work moved from RDD to Utility process
Even with the RDD seccomp restored, FF150 routes VAAPI decode through the **Utility process**, not RDD. `UtilitySandboxPolicy::EvaluateSyscall` falls through to `SandboxPolicyCommon` for `__NR_ioctl`, which doesn't allow `'|'` (or `'V'`, or `'d'`, or `'b'`). The iter3 patch needed extending to mirror the RDD ioctl allowlist into Utility.
### Amendment patch (combined, 160 lines)
`firefox-fourier/0001-rdd-allow-stateless-v4l2-request-api.patch` regenerated as combined broker + RDD-seccomp + Utility-seccomp:
- Broker: 3 hunks (cap-filter widen + `AddV4l2RequestApiDependencies` + RDD wire-in) — unchanged from prior 84-line container patch.
- RDD seccomp: 2 hunks (`kMediaType` constant + `.ElseIf` allow) — restored from campaign 113-line patch.
- Utility seccomp: 1 hunk (entire `case __NR_ioctl:` override mirroring RDD's allowlist) — **new in iter5 amendment**.
Authored as `claude-noether <claude@reauktion.de>`.
### Build approach — incremental, ~2:22 wall
Per operator's standard flow: edit src/ in place, `makepkg -e --skippgpcheck`. Container src tree from iter5-G already had broker hunks applied; only the seccomp hunks needed manual application. Then `makepkg -e` skipped extract+prepare and re-ran build/package on existing object cache.
Result:
- pkg: `firefox-150.0.1-1.1-aarch64.pkg.tar.xz` (65.5 MB, sha256 `675bbc7dd0a187ee8baefef5c3b36d15c64919cd80edf725e29c23f2675ed4a8`)
- Build wall time: 19:01 → 19:03:22 = **~2:22** (vs from-scratch ~2h27m for iter5-G)
- Saved: ~2h25m by reusing existing ccache + object tree
Backed up prior pkg as `firefox-150.0.1-1.1-aarch64.pkg.tar.xz.iter5g`.
### Verification (pending — deploy to ohm)
Definitive HW-decode test (couldn't run pre-amendment because seccomp blocked the path silently):
```
LIBVA_DRIVER_NAME=v4l2_request \
LIBVA_V4L2_REQUEST_VIDEO_PATH=/dev/video1 \
LIBVA_V4L2_REQUEST_MEDIA_PATH=/dev/media0 \
firefox <video-url> &
sleep 8
sudo lsof /dev/video1 /dev/media0
# expect: a Firefox-spawned Utility process holding both nodes
```
If `lsof` shows the Utility process on `/dev/video1` during avc1 playback → Track F closes properly. If empty → another sandbox layer to chase.
### Lessons captured
The Track G "GREEN" verdict was right by its own success criterion ("libxul materially smaller, firefox --version works") but the criterion didn't include "actually does HW decode in production use." Lesson for Phase 1 lock: criteria for sandbox/release-build tracks should include **end-user playback verification with `lsof` proof of device handle ownership**, not just structural metrics.
Adding to memory: `feedback_seccomp_silent_enosys.md` already covers the silent-ENOSYS pattern; this episode reinforces it. The patch-sync-drift between campaign repo and container is a process bug — fixed by syncing the combined patch back to container in this amendment, but the underlying mismatch (PKGBUILD numbers patches `0005-...` while campaign uses `0001-...`) remains. Tolerable for a personal-machine campaign; would need normalizing for upstream.
### Status post-amendment
| Element | State |
|---|---|
| Combined 160-line patch | Authored, in campaign repo + container 0005-... |
| Container src/ tree | Broker (from iter5-G prepare) + seccomp (manually patched in iter5 amend) |
| Built pkg | `675bbc7d…` on boltzmann:/tmp/, awaiting deploy to ohm |
| ohm deploy | Pending operator-side scp (vpn route closed at amendment author time) |
| HW-decode lsof verification | Pending |
| Campaign git commit | Pending |