Files
libva-multiplanar/phase8_iteration5_close.md
T
claude-noether 0d04ae4aad iter5 amendment verified: Track F GREEN, new iter6 candidate I
Patched libmozsandbox.so (sha 4e6c7d58…, +1824 bytes vs iter5-G)
deployed to ohm via temporary HTTP server on boltzmann:18080
(vpn route was closed; ohm has lan route to boltzmann.fritz.box).

Sandbox gate is fully open. With LIBVA_DRIVER_NAME=v4l2_request +
sandbox enabled + bbb_1080p30_h264.mp4, Firefox 150 emits:
  v4l2-request: cap_pool_init: 24 slots ready
  v4l2-request: Unable to queue buffer: Invalid argument
vs pre-amendment which had the seccomp violation +
"Unable to allocate media request: Function not implemented"
between those two lines. Track F closes GREEN.

YouTube test (watch?v=7DAPd5MGodY) didn't engage v4l2_request
because YT negotiated VP9/AV1 with FF150 (no h264ify); v4l2_request
only handles H.264. Codec-negotiation issue, not sandbox issue.

New iter6 candidate I documented: Firefox VIDIOC_QBUF EINVAL on
first frame is consumer-specific (mpv-vaapi-copy clean at 2000
frames on same driver). Diagnostic plan and repro included.

phase8_iteration5_close.md: appended Track F GREEN evidence +
post-amendment status table + iter6 candidate referral.

phase0_findings_iter6.md: added candidate I (Firefox QBUF EINVAL),
listed under recommended pairings as strong iter6-lock candidate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 19:35:11 +00:00

228 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Iteration 5 close (Phase 8) — A+G+B+E all GREEN
Opened 2026-05-05 just after iter4 close, closing same day. Locked candidates: **A** (DEBUG instrumentation sweep), **G** (PGO-disabled Firefox-fourier rebuild), **B** (mpv libplacebo `--vo=gpu` segfault), **E** (multi-context libva safety).
All four tracks closed GREEN with one named caveat carried to iter6 (cap_pool resolution-change race latent under untested consumer probe patterns — Phase 5 sonnet C4 finding).
## Verdict per track
### Track A: GREEN — DEBUG sweep landed in two passes
**First pass** (commits `848fc0c`, `39498f0`, `951233a`, `d3a299b`, `843febc`): removed iter1 patch-0010/0011/0014 + iter3 Y2 v1 + iter4 Y2 v3 + iter4 DPB census + iter4 per-control TRY iso. Per-frame v4l2-request log noise dropped from ~30+ lines/frame to ~9 init-time lines.
**Second pass** (commit `c8b6ede`, after Phase 5 sonnet C1+C2): removed three additional surface.c DEBUG sites (CreateSurfaces2 format-dump, ExportSurfaceHandle descriptor-dump, QuerySurfaceStatus status-dump) that the first pass missed because the vaapi-copy + --vo=null stress test didn't exercise the ExportSurfaceHandle path. Also removed h264.c's "3F observability" V4L2 readback block, which contained a `static bool readback_warned` (new mutable process-global state introduced post-Track-E — inconsistent with Track E's intent, also resolved by the block removal).
**Net:** ~340 lines of instrumentation removed across 6 commits. Verified clean: 2000-frame mpv vaapi-copy stress on the post-cleanup driver shows **0 EINVAL, 1 v4l2-request log line, 3 KB log** (down from 9 lines / 4.4 KB after first pass).
**KEPT (justified):**
- POC sentinel strip (`h264_strip_ffmpeg_poc_sentinel`) — load-bearing for ffmpeg-vaapi consumers
- slice_header bit-precise parser — load-bearing for hantro hw decode (DECODE_PARAMS bit_size fields drive MMIO writes)
- EACCES suppression in `v4l2_get_controls` — silences per-frame iter1-known-good error noise
- "slice_header parse FAILED" log — fires only on decode-blocking errors, not per-frame noise
### Track E: GREEN — multi-context libva safety
Commit `b993355`: `LAST_OUTPUT_WIDTH/HEIGHT` moved from process-global static in `surface.c` to `struct request_data.last_output_width/height`. The V4L2 device fd is per-driver_data, so this is the correct binding unit (one fd, one current OUTPUT format).
`surface_reset_format_cache()` signature changed to take `struct request_data *driver_data`; one callsite in `context.c` updated.
Audit confirmed only LAST_OUTPUT_* was mutable process-global state. Other statics (`formats[]`, `formats_count` in video.c) are constant lookup tables — no race.
**Verified:** two concurrent mpv processes with 2-second stagger both decoded 300 frames cleanly, no cross-context corruption. Re-verified post-cleanup on driver `4bed52ec5d44b389...` — both clean.
Limit: same-instant co-launch hits kernel-level fd contention on `/dev/video1` (hantro is a single-instance device). Cross-process serialization is out of scope for a libva backend.
### Track B: GREEN — `mpv --vo=gpu` doesn't segfault
35s `mpv --hwdec=vaapi --vo=gpu` on the iter5-end driver: stream pos 31s, 29 frames dropped, **0 segfaults**. Vulkan init still fails (`VK_ERROR_INITIALIZATION_FAILED` — steady state on Mali-G52 / Bifrost per `reference_pinetab_no_vulkan.md`); mpv falls through to GLES via Panfrost gracefully.
Phase 5 sonnet C4 reframed the original "implicit fix" claim: the cap_pool REQBUFS-EBUSY race window remains latent under untested consumer probe patterns. The 35s mpv test sees 5 EBUSY events at init-time, mpv falls back to SW once, then continues. The race is documented as iter6+ candidate (the genuine fix is ordering-cap_pool-drain-before-REQBUFs in CreateSurfaces2, ~30 lines).
### Track G: GREEN — PGO-disabled Firefox-fourier 150.0.1-1.1
PKGBUILD overlay edited to replace 3-tier PGO sequence with single-pass optimized build. Single-pass build on boltzmann LXD container: **~2h27m** (vs iter3's 2h+ that died at PGO collect step — comparable wall time).
Result:
- pkg: `firefox-150.0.1-1.1-aarch64.pkg.tar.xz`, **68.7 MB** (sha256 `aa94c7290ee7be76...`)
- libxul.so: 169 MB stripped (21× smaller than iter3's 3.6 GB PGO-instrumented)
Installed via `pacman -U` on ohm replacing stock firefox 150.0.1-1.
Phase 7G test (35s autonomous run, no `MOZ_DISABLE_RDD_SANDBOX=1`):
- ENETDOWN: 0 (iter3 sandbox patch holds in release build)
- EINVAL: 0 (iter4 frame-11 fix holds)
- RDD ProcessDecode events: 538
- Stream mTime reached: 22.3s in 35s wall = **0.64× realtime**, **~2.7× speedup over PGO-instrumented binary**
## What landed
### Fork commits (libva-v4l2-request-fourier)
iter5 sweep + multi-context fix:
- `848fc0c` — remove iter3+iter4 Y2 instrumentation from v4l2.c (-54)
- `39498f0` — remove iter4 DPB census from h264.c (-31)
- `951233a` — remove iter1 ENTER traces (4 files, -17 across 13 sites)
- `d3a299b` — remove iter1 patch-0010 hex-dumps + patch-0011 sentinel (-81)
- `843febc` — remove iter1 slice_header / VAPicture dumps + Sync RETURN trace, suppress EACCES per-frame log (-49)
- `b993355` — Track E: LAST_OUTPUT_* per-driver_data
- `c8b6ede` — Phase 5 follow-up: 3 surface.c debug sites + h264.c readback block (-107)
Net: ~339 lines removed, ~52 lines added (Track E plumbing). Driver source builds clean and per-frame log noise is essentially zero (1 line per 2000-frame run).
### Campaign artifacts (libva-multiplanar)
- `phase0_findings_iter5.md` — substrate (8 candidates, locked A+G+B+E)
- `phase4_iter5_plan.md` — Phase 4 plan + execution + Phase 5 caveat resolutions + Phase 7 anchored evidence
- `phase8_iteration5_close.md` — this file
- `~/src/panvk-bifrost/README.md` — chartered as separate top-level future campaign (sequenced after fourier-fresnel)
### Build infrastructure
- firefox-fourier LXD container on boltzmann remains persistent. The PKGBUILD now has the iter5 PGO-disabled edit applied (the source extracted under `src/firefox-150.0.1/` is the iter4 state with iter3 patches; iter5 reused that). Future Firefox rebuilds can `cd src/firefox-150.0.1 && ./mach build` for incremental.
## State that carries to iter6 (or campaign close)
- **Hardware**: ohm RK3568 hantro G1/G2, kernel 6.19.10. Access: `ohm` (LAN; `ohm.vpn` also works).
- **Userspace**: firefox 150.0.1-1.1 (iter5 PGO-disabled fourier rebuild), libva 2.23.0, mesa 26.0.5, libdrm 2.4.131, mpv 0.41.0-3.
- **Driver installed**: `/usr/lib/dri/v4l2_request_drv_video.so` sha256 `4bed52ec5d44b389...` (iter5-end, post-cleanup).
- **Test fixture**: bbb_1080p30_h264.mp4 sha256 `dcf8a7170fbd...`.
- **Build container**: firefox-fourier LXD on boltzmann, persistent.
## Documented limitations carried to iter6+ (or campaign close)
1. **Cap_pool resolution-change race** — Phase 5 sonnet C4. mpv's libplacebo Vulkan-fallback path triggers it; mpv recovers via SW fallback (no segfault), but the race exists. Fix: drain CAPTURE properly before issuing REQBUFs(0) on resolution change in `CreateSurfaces2`. ~30 lines.
2. **No pixel-correctness verification post-msync-removal** — Phase 5 sonnet C3. Probably safe (kernel does DMA sync at DQBUF level on this CMA-backed config). A frame-hash spot check would anchor formally.
3. **Vulkan unavailable on PineTab2**`reference_pinetab_no_vulkan.md`. Out of campaign scope; consumers fall through to GLES via Panfrost.
4. **Sub-second concurrent libva init still races on /dev/video1** — Track E test passed only with 2s stagger. Cross-process serialization is out of scope for a libva backend.
## Lessons distilled to memory
No new memory entries this iteration — the iter5 work was instrumentation cleanup + targeted multi-context fix, no new diagnostic patterns surfaced. Existing memory entries from iter3+iter4 cover the operative discoveries (kernel obfuscation, request_fd lifecycle, FFmpeg as authority, sandbox seccomp, ALARM-stale wasi, firefox-fourier container, follow-on campaigns).
The phase 5 review caveats — sweep-completion verification needs to exercise EVERY consumer code path, not just the most common one — could be a feedback memory ("re-test post-sweep with each consumer pattern, not just one") but it's covered implicitly by `feedback_dev_process.md`'s Phase 7 verification discipline.
## Bootlin upstream outlook
iter5 shifts the fork toward upstream-readiness. Per `feedback_no_upstream.md`, no PR/MR happens without explicit operator instruction. But the clean state is now:
- Driver source builds with zero non-error `request_log` calls.
- Process-global mutable state eliminated (`LAST_OUTPUT_*` moved to per-driver_data; `readback_warned` removed entirely).
- Track A's frame-11 EINVAL fix from iter4 is in place (fresh request_fd per frame, DPB FFmpeg-semantics matching, B-slice L1 reflist .fields).
- Track F's Firefox sandbox patch from iter3 is documented in campaign repo.
- Track E's per-context state isolation is in.
Outstanding for upstream-readiness: cap_pool race fix (~30 lines for iter6), msync pixel-verification, possibly a multi-codec audit (MPEG-2 was iter1 lock's "next codec"; never opened).
## Phase 1 success criterion — final per track
- **Track A:** "Driver builds clean with zero `request_log()` calls in non-error paths, all iter1+iter3+iter4 DEBUG commits removed (or explicitly justified-and-kept), vaapi-copy + mpv smoke tests still green at 2000+ frames clean." ✓ HIT (2000 frames, 0 EINVAL, 1 log line).
- **Track G:** "Firefox-fourier rebuilt without `--enable-profile-generate=cross`, redeployed to ohm. firefox --version reports Mozilla Firefox 150.0.1. Resulting libxul.so is materially smaller than the 3.6 GB instrumented build." ✓ HIT (169 MB libxul, 21× smaller).
- **Track E:** "Two concurrent mpv processes on different bbb fixtures decode independently with no cross-context state corruption." ✓ HIT (with 2s stagger).
- **Track B:** "≥30s of bbb_1080p30 without segfault — OR root cause documented as upstream issue with operator-actionable workaround." ✓ HIT (31s stream pos, 0 segfaults; mpv handles cap_pool race via SW fallback gracefully; cap_pool race documented as iter6+ candidate).
**Joint success:** all four tracks independently verifiable on the same iter5-end driver build (sha256 `4bed52ec5d44b389...`). Phase 7 verified each. Phase 5 sonnet review caveats addressed. iter5 closes GREEN.
---
## Iter5 amendment (2026-05-05, same-day post-close) — Track F seccomp gap
**Trigger.** First real-world use of the iter5-G Firefox binary on ohm — playing a YouTube avc1 stream with `LIBVA_DRIVER_NAME=v4l2_request` env vars + sandbox enabled — emitted seccomp violations:
```
Sandbox: seccomp sandbox violation: pid <N>, tid <N>, syscall 29, ...
```
Decoded `0x80047C05` = `_IOR('|', 0x05, int)` = `MEDIA_IOC_REQUEST_ALLOC`. Two distinct gaps surfaced:
### Gap 1: Track G "GREEN" verdict was structural-only
The Phase 7G test measured "0 ENETDOWN, 0 EINVAL, 538 RDD ProcessDecode events, 22.3s mTime in 35s wall" on a `--headless --screenshot` autonomous run with synthetic playback — but never confirmed VAAPI bytes actually flowed through `/dev/video1`. The seccomp filter was returning `ENOSYS` to Firefox's media stack silently (`SECCOMP_RET_ERRNO`, not `SIGSYS`), so Firefox fell back to SW decode and the autonomous test couldn't tell the difference.
### Gap 2: Patch-sync drift between campaign repo and container
Campaign repo's `firefox-fourier/0001-rdd-allow-stateless-v4l2-request-api.patch` (113 lines, broker + RDD seccomp) had drifted from the container's `/build/aur/firefox-fourier/0005-rdd-allow-stateless-v4l2-request-api.patch` (84 lines, broker only). The iter5-G build used the stale 84-line patch — the iter3 RDD seccomp fix never made it into the iter5-G binary at all.
### Gap 3: VAAPI work moved from RDD to Utility process
Even with the RDD seccomp restored, FF150 routes VAAPI decode through the **Utility process**, not RDD. `UtilitySandboxPolicy::EvaluateSyscall` falls through to `SandboxPolicyCommon` for `__NR_ioctl`, which doesn't allow `'|'` (or `'V'`, or `'d'`, or `'b'`). The iter3 patch needed extending to mirror the RDD ioctl allowlist into Utility.
### Amendment patch (combined, 160 lines)
`firefox-fourier/0001-rdd-allow-stateless-v4l2-request-api.patch` regenerated as combined broker + RDD-seccomp + Utility-seccomp:
- Broker: 3 hunks (cap-filter widen + `AddV4l2RequestApiDependencies` + RDD wire-in) — unchanged from prior 84-line container patch.
- RDD seccomp: 2 hunks (`kMediaType` constant + `.ElseIf` allow) — restored from campaign 113-line patch.
- Utility seccomp: 1 hunk (entire `case __NR_ioctl:` override mirroring RDD's allowlist) — **new in iter5 amendment**.
Authored as `claude-noether <claude@reauktion.de>`.
### Build approach — incremental, ~2:22 wall
Per operator's standard flow: edit src/ in place, `makepkg -e --skippgpcheck`. Container src tree from iter5-G already had broker hunks applied; only the seccomp hunks needed manual application. Then `makepkg -e` skipped extract+prepare and re-ran build/package on existing object cache.
Result:
- pkg: `firefox-150.0.1-1.1-aarch64.pkg.tar.xz` (65.5 MB, sha256 `675bbc7dd0a187ee8baefef5c3b36d15c64919cd80edf725e29c23f2675ed4a8`)
- Build wall time: 19:01 → 19:03:22 = **~2:22** (vs from-scratch ~2h27m for iter5-G)
- Saved: ~2h25m by reusing existing ccache + object tree
Backed up prior pkg as `firefox-150.0.1-1.1-aarch64.pkg.tar.xz.iter5g`.
### Verification (pending — deploy to ohm)
Definitive HW-decode test (couldn't run pre-amendment because seccomp blocked the path silently):
```
LIBVA_DRIVER_NAME=v4l2_request \
LIBVA_V4L2_REQUEST_VIDEO_PATH=/dev/video1 \
LIBVA_V4L2_REQUEST_MEDIA_PATH=/dev/media0 \
firefox <video-url> &
sleep 8
sudo lsof /dev/video1 /dev/media0
# expect: a Firefox-spawned Utility process holding both nodes
```
If `lsof` shows the Utility process on `/dev/video1` during avc1 playback → Track F closes properly. If empty → another sandbox layer to chase.
### Lessons captured
The Track G "GREEN" verdict was right by its own success criterion ("libxul materially smaller, firefox --version works") but the criterion didn't include "actually does HW decode in production use." Lesson for Phase 1 lock: criteria for sandbox/release-build tracks should include **end-user playback verification with `lsof` proof of device handle ownership**, not just structural metrics.
Adding to memory: `feedback_seccomp_silent_enosys.md` already covers the silent-ENOSYS pattern; this episode reinforces it. The patch-sync-drift between campaign repo and container is a process bug — fixed by syncing the combined patch back to container in this amendment, but the underlying mismatch (PKGBUILD numbers patches `0005-...` while campaign uses `0001-...`) remains. Tolerable for a personal-machine campaign; would need normalizing for upstream.
### Status post-amendment
| Element | State |
|---|---|
| Combined 160-line patch | Authored, in campaign repo + container 0005-... |
| Container src/ tree | Broker (from iter5-G prepare) + seccomp (manually patched in iter5 amend) |
| Built pkg | `675bbc7d…` on boltzmann:/tmp/ |
| ohm deploy | Done via temporary HTTP server on boltzmann:18080 (vpn route was closed; ohm has lan route to boltzmann.fritz.box). `pacman -U` confirmed Build Date 21:01 CEST = 19:01 UTC = amendment build. `libmozsandbox.so` sha `4e6c7d58bc2220dbdf6ad817ee70fa77fc85e618ffd49ebdabb833f416dc3076`, size 187536 (vs prior 185712, +1824 bytes for the new seccomp code). |
| HW-decode verification | **Track F GREEN.** Played `bbb_1080p30_h264.mp4` via Firefox 150 with `LIBVA_DRIVER_NAME=v4l2_request` and full sandbox. Log: `cap_pool_init: 24 slots ready` then `Unable to queue buffer: Invalid argument`**no seccomp violation, no ENOSYS on `MEDIA_IOC_REQUEST_ALLOC`**. The seccomp gate is fully passed; the remaining EINVAL is post-sandbox, driver-level. |
| Campaign git commit | `d2d9107` (patch + docs) pushed to Gitea pre-verification. |
### Track F closes — definitive evidence
Pre-amendment failure pattern (iter5-G with broker-only patch):
```
v4l2-request: cap_pool_init: 24 slots ready
[PID] Sandbox: seccomp sandbox violation: pid PID, syscall 29, args FD 0x80047C05 ...
v4l2-request: Unable to allocate media request: Function not implemented
```
Post-amendment pattern (iter5-amend):
```
v4l2-request: cap_pool_init: 24 slots ready
v4l2-request: Unable to queue buffer: Invalid argument
```
The disappearance of "syscall 29 ... 0x80047C05" and "Function not implemented" in the post-amendment log is the closing proof: `MEDIA_IOC_REQUEST_ALLOC` (`_IOR('|', 0x05, int)`) now reaches the kernel from the Utility process. Track F is **GREEN**.
YouTube (`watch?v=7DAPd5MGodY`) played without engaging the v4l2_request driver at all — no `cap_pool_init`, no `/dev/video1` activity. Likely cause: YouTube negotiated VP9 or AV1 with FF150 (no `h264ify` extension installed); v4l2_request only handles H.264, so libva isn't dispatched. This is unrelated to Track F — a codec-negotiation issue, not a sandbox issue. The H.264-only fixture (`bbb_1080p30_h264.mp4` direct file URL) bypasses YouTube's codec negotiation and triggers v4l2_request, which is how the sandbox close was demonstrated.
### New iter6 candidate — Firefox VIDIOC_QBUF EINVAL on first frame
A driver-level issue surfaced post-sandbox-fix: with the v4l2_request driver loaded (cap_pool_init succeeds), Firefox's libva path issues a single `VIDIOC_QBUF` that returns EINVAL on what appears to be the first frame. `mpv --hwdec=vaapi-copy` decoded 2000 frames clean on the same iter5-end driver build (sha `4bed52ec5d44b389…`); only Firefox triggers this. Likely consumer-specific: Firefox's media stack feeds the driver differently than mpv (different VAEncodedSlice ordering, VAImage usage pattern, surface handle lifecycle, etc.).
Logged as iter6 candidate. Track A's frame-11 EINVAL fix was about per-frame request_fd lifecycle and DPB FFmpeg-semantics matching — this looks earlier (init-phase, not frame 11), so likely a different root cause. Diagnosis would start with strace on the Firefox Utility process during initial QBUF, capture the exact `v4l2_buffer` struct payload, compare against what mpv sends.
This does not retract the iter5 sweep's GREEN verdict — the per-consumer divergence reinforces the Phase 5 sonnet caveat ("sweep-completion verification needs to exercise EVERY consumer code path") that's already documented as a process lesson.