Iteration 3 Phase 0: substrate doc with 7 candidate questions
Carries iter2 close state forward. Lists 7 candidate research questions (A frame-11 EINVAL, B DEBUG sweep, C perf binding cell, D multi-context safety, E V4L2_MEMORY_DMABUF as Option B for true DMA-BUF lifecycle fix, F Firefox RDD sandbox, G Sonnet 7.x carryovers) plus recommended pairings. Candidate F enriched with Sonnet web research: NO existing Mozilla bug covers V4L2-stateless request-API; Bug 1833354 (FF116) is V4L2-M2M only. Sandbox allowlist in SandboxBrokerPolicyFactory.cpp::GetRDDPolicy() + AddV4l2Dependencies() filters by V4L2_CAP_VIDEO_M2M, explicitly excluding Hantro stateless. /dev/media* completely absent from allowlist. Patch needed is small in scope (~30 lines, 2 functions, 1 file) following the renderD128 broker pattern. Real Mozilla patch needed, not env-var-only. Stop point at Phase 1 lock — user picks among A..G. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,146 @@
|
||||
# Iteration 3 — Phase 0 (substrate / motivation / inventory)
|
||||
|
||||
Opens 2026-05-04 immediately after iteration 2 close (`phase8_iteration2_close.md`, fork commit `19acc76`, campaign close commit `c36c61e`).
|
||||
|
||||
## Predecessor close-out summary (iteration 2 → iteration 3)
|
||||
|
||||
iter2 was hardening, not new feature work. Three independent fixes landed:
|
||||
- Fix 1 (`06beef6`): multi-resolution session-state corruption
|
||||
- Fix 2 (`e64bb08`): Mesa WSI rejection of non-64-aligned pitch
|
||||
- Fix 3 (`19acc76`): DMA-BUF lifecycle race — decoupled CAPTURE buffer pool with LRU recycling, the load-bearing fix
|
||||
|
||||
iter2 Phase 1 lock met: `mpv --hwdec=vaapi --vo=gpu` plays bbb_1080p30 smoothly per operator inspection.
|
||||
|
||||
iter2 Phase 7 verification:
|
||||
- vaapi-copy 200 frames: 0 drops, real luma gradient, pool LRU visibly recycling ✓
|
||||
- vaapi `--vo=gpu` real-VO: smooth ✓
|
||||
- Firefox 150: engages our libva backend (cap_pool architecture confirmed working with surface ID recycling), decodes 10 frames cleanly through hantro, then EINVAL on frame 11 (iter1 carryover Sonnet 7.x family). Requires `MOZ_DISABLE_RDD_SANDBOX=1` because Firefox now routes VAAPI through RDD instead of utility (changed since iter1's test).
|
||||
|
||||
## Iteration 3 candidate research questions (UNLOCKED — user picks)
|
||||
|
||||
Multiple substantive items survive iter2 close. Each could anchor an entire iteration; some pair naturally.
|
||||
|
||||
### A. Frame-11 EINVAL (Sonnet 7.x carryover, decode correctness)
|
||||
> Identify which V4L2 control returns EINVAL on the 11th decoded frame in Firefox (and likely also under specific mpv stream-shape patterns), and fix.
|
||||
|
||||
**Why first**: it's the load-bearing remaining defect. Firefox decodes 10 frames cleanly, then HW decode terminates. Over a 30s+ session the EINVAL likely recurs across surface-recycle cycles. The `Unable to set control(s): Invalid argument` log from iter2 Phase 7 narrows the suspect set: per-request controls (DECODE_PARAMS, SCALING_MATRIX, SPS, PPS) for slice_type=1 / non-IDR mid-stream frames. Sonnet review 7.5 (mid-stream non-IDR) and 7.2 (`num_ref_idx_l0/l1` for multi-slice) are the named hypotheses. Reading `hantro_g1_h264_dec.c` for which control fields it validates is the first concrete investigation step.
|
||||
|
||||
**Risk**: may surface a need for VPS/SPS state tracking across requests that doesn't exist in our backend today.
|
||||
|
||||
### B. DEBUG instrumentation sweep
|
||||
> Remove the 0010 / 0011 / 0014 + ENTER logging + sentinel write + msync workaround commit-by-commit, building cleanly between each removal. End state: zero `request_log()` calls in non-error paths, no patch-0011 sentinel write in `EndPicture`, and either delete the msync or document why it stays.
|
||||
|
||||
**Why**: required prerequisite for any upstream snapshot, plus Phase 5 review for iter1 had this on the to-do list and iter2 explicitly deferred it. Smaller scope than A; could pair with A or run standalone.
|
||||
|
||||
### C. Performance binding cell (deferred from iter1, named-deferred in iter2)
|
||||
> Establish a measurement protocol for HW vs SW decode on this rig: drop counts, effective FPS, browser CPU%, scanout-plane residency for {mpv vaapi DMA-BUF, mpv vaapi-copy, Firefox HW (sandbox-bypassed), SW baseline}. Anchor results in an in-session evidence dir.
|
||||
|
||||
**Why**: anchors all iter1+iter2 claims to numbers. The iter1 close noted "Performance numbers: drop counts ... not anchors — re-measure in iteration 2 with consistent rig" but iter2 was hardening-only. This is the natural perf iteration.
|
||||
|
||||
**Risk**: lots of fixture work for one binding cell.
|
||||
|
||||
### D. Multi-context libva safety (Sonnet review 9.6)
|
||||
> Make the backend safe for two concurrent libva contexts in the same process (e.g. Firefox tab playing one video while another tab plays a different resolution). Today's `LAST_OUTPUT_WIDTH/HEIGHT` is a process-global static and `cap_pool` is per-driver_data but the V4L2 device is shared.
|
||||
|
||||
**Why**: iter2 documented this as deferred; the architectural surgery is similar to Fix 3 (per-context pools, per-context format cache). One real consumer might surface this — Firefox tab + mpv together, for example.
|
||||
|
||||
### E. V4L2_MEMORY_DMABUF (the iter2 plan's Option B — true architectural fix for DMA-BUF lifecycle)
|
||||
> Replace V4L2_MEMORY_MMAP with userspace dma-buf allocation: userspace allocates buffers via gbm/dumb-buffer, hands them to V4L2 via VIDIOC_QBUF with type DMABUF. EXPBUF is no longer needed; lifetime is unambiguous because userspace owns the buffer.
|
||||
|
||||
**Why**: iter2 Fix 3 is statistical (Option A) — adding more buffers + LRU narrows the race window but doesn't close it. Option B closes it. Significant kernel-side test surface (does hantro on this kernel actually accept DMABUF type? GStreamer's v4l2slh264dec uses MMAP, so DMABUF on hantro may not be tested upstream).
|
||||
|
||||
**Risk**: highest unknown of any candidate; possibly requires kernel work.
|
||||
|
||||
### F. Firefox RDD sandbox vs `/dev/media0` (backlog from iter2 close — UPDATED with web research 2026-05-04)
|
||||
> File a Mozilla Bugzilla report ("RDD sandbox: allow /dev/media* and V4L2-stateless nodes for request-API hardware decoders") referencing Bug 1833354 as the V4L2-M2M precedent, propose a small patch.
|
||||
|
||||
**Sonnet web research findings (2026-05-04):**
|
||||
- NO existing Mozilla bug covers `/dev/media*` or the V4L2-stateless request-API path. We'd be filing the first.
|
||||
- Closest existing work is Bug 1833354 (FF116 — V4L2-M2M only) and Bug 1965646 (FF141 — extends M2M codecs). Neither addresses the request-API.
|
||||
- Allowlist is in `security/sandbox/linux/broker/SandboxBrokerPolicyFactory.cpp::GetRDDPolicy()`. It calls `AddV4l2Dependencies()` which enumerates `/dev/video*` and **filters by `V4L2_CAP_VIDEO_M2M | V4L2_CAP_VIDEO_M2M_MPLANE`**. Hantro stateless uses `CAPTURE_MPLANE + OUTPUT_MPLANE + request-API` — explicitly excluded by this filter. `/dev/media*` is completely absent (no `AddPath` for it anywhere).
|
||||
- Architecture is broker-on-IPC (same model as `renderD128`): when sandboxed RDD calls `open()`, seccomp intercepts, forwards over socket to broker thread in parent process, which checks path policy and opens on RDD's behalf. **No broker redesign needed** — just two functions in one file:
|
||||
1. `GetRDDPolicy()`: add `policy->AddPath(rdwr, "/dev/media0")` (or glob)
|
||||
2. `AddV4l2Dependencies()`: extend cap filter to also admit stateless V4L2 nodes (or add a new `AddV4l2RequestDependencies()`)
|
||||
3. Possibly `SandboxFilter.cpp`: verify `MEDIA_REQUEST_IOC_QUEUE` ioctl (type 0xb7, number 0x02) is allowed
|
||||
- Verdict: **real Mozilla patch needed, small in scope**. NOT a documentation-only fix.
|
||||
|
||||
**Why**: highest-leverage upstream contribution out of any iter3 candidate — a 30-line patch upstream removes the env-var requirement for every V4L2-stateless user, including future Rockchip/Hantro/Cedrus/Allwinner targets. Cross-verification on Intel/NVIDIA test boxes (meitner / clevo) is no longer needed for the diagnosis (cap-filter + missing /dev/media* are mechanically the explanation), but might be useful for "smoke-confirm the same env behaves identically on a working VAAPI box" before filing.
|
||||
|
||||
**Sources** (Sonnet's references):
|
||||
- [Bug 1833354 — Implement V4L2-M2M HW decode](https://bugzilla.mozilla.org/show_bug.cgi?id=1833354)
|
||||
- [Bug 1683808 — Make VAAPI work in RDD](https://bugzilla.mozilla.org/show_bug.cgi?id=1683808)
|
||||
- [Bug 1751363 — VAAPI snapshot fails due to RDD sandbox](https://bugzilla.mozilla.org/show_bug.cgi?id=1751363)
|
||||
- [SandboxBrokerPolicyFactory.cpp searchfox](https://searchfox.org/mozilla-central/source/security/sandbox/linux/broker/SandboxBrokerPolicyFactory.cpp)
|
||||
- [SandboxFilter.cpp searchfox](https://searchfox.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp)
|
||||
|
||||
### G. Sonnet 7.x miscellany (carryovers from iter1 review)
|
||||
> 7.1 EACCES on VIDIOC_G_EXT_CTRLS readback (move before MEDIA_REQUEST_IOC_QUEUE; if still EACCES, file kernel issue and remove dead code).
|
||||
> 7.4 `// HACK` block in `surface.c::CreateSurfaces2` hardcoding `V4L2_PIX_FMT_H264_SLICE` — wrong for MPEG-2.
|
||||
> 7.5 Firefox seek-to-non-IDR test corpus.
|
||||
|
||||
**Why**: cleanup. Fits in slack time of any other iteration's plan.
|
||||
|
||||
### Recommended pairings
|
||||
- A + B (defect fix + clean-up of unused diagnostic before commit)
|
||||
- C alone (perf measurement is its own thing)
|
||||
- E alone (high risk, focus needed)
|
||||
- F + G as a "small-debt cleanup" iteration
|
||||
|
||||
## State that carries (re-verified 2026-05-04 23:35)
|
||||
|
||||
- **Hardware**: ohm RK3568, kernel `6.19.10-danctnix1-1-pinetab2`, hantro G1/G2 on `/dev/video1` + `/dev/media0` (perms unchanged: `crw-rw----+ root:video` with mfritsche ACL)
|
||||
- **Userspace**: libva 2.23.0-1, libva-utils 2.22.0-1, mpv 1:0.41.0-3, firefox 150.0.1-1, mesa 1:26.0.5-1, libdrm 2.4.131-1
|
||||
- **Test fixture**: `/home/mfritsche/fourier-test/bbb_1080p30_h264.mp4` sha256 `dcf8a7170fbd49bb...`
|
||||
- **Driver installed**: `/usr/lib/dri/v4l2_request_drv_video.so` sha256 `f27e006433cd4769...` (iter2 build with all three fixes)
|
||||
- **Fork master**: `19acc76` (iter2 Fix 3); commits in iter1 + iter2 still ahead of bootlin upstream
|
||||
- **Build harness**: `meson setup --buildtype=release && ninja` on ohm at `/tmp/libva-src/libva-v4l2-request-fourier` (rebuilds on every session — /tmp is tmpfs); deploy to `/usr/lib/dri/v4l2_request_drv_video.so`
|
||||
- **Live test rig env** (Plasma 6 Wayland): `XDG_RUNTIME_DIR=/run/user/1001`, `WAYLAND_DISPLAY=wayland-0`, `DISPLAY=:0`, `XAUTHORITY=/run/user/1001/xauth_ZnmtRw` (changes per session), `DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1001/bus`. SSH-driven launches need `MOZ_ENABLE_WAYLAND=1` for Firefox + `MOZ_DISABLE_RDD_SANDBOX=1` for VAAPI.
|
||||
|
||||
## State that does NOT carry (re-acquire per `feedback_replicate_baseline_first.md`)
|
||||
|
||||
- Performance numbers: same caution as iter1+iter2 close. iter2 saw `0 drops in 200 frames` for vaapi-copy and `smooth on operator inspection` for vaapi DMA-BUF; these are smoke verdicts, not anchored measurements. iter3 perf candidate (option C) is the natural place to anchor.
|
||||
- Firefox 10-frame decode + EINVAL at frame 11: re-acquire under iter3-specific binding cell. Today's evidence is in `/tmp/ff-stdout.log` on ohm but tmpfs-volatile.
|
||||
|
||||
## Tooling and measurement-instrument inventory (live verification)
|
||||
|
||||
Carried from iter2:
|
||||
- `strace -f -e trace=openat,close,ioctl` for libva-side V4L2 ioctl tracing
|
||||
- `sudo ftrace events/v4l2/* events/vb2/* events/dma_fence/*` for kernel-side V4L2/vb2 lifecycle
|
||||
- `sudo dmesg -w` for kernel-side warnings (typically silent on this rig)
|
||||
- `sudo lsof /dev/video1` for fd ownership snapshots
|
||||
- `mpv --frames=N --vo=gpu` with stderr capture
|
||||
- Firefox `MOZ_LOG=PlatformDecoderModule:5,VideoBridge:5` (under `MOZ_DISABLE_RDD_SANDBOX=1`)
|
||||
- `coredumpctl info <pid>` for crash backtraces
|
||||
- Operator visual inspection on real screen (load-bearing for boolean correctness)
|
||||
|
||||
Likely needed for specific iter3 candidates:
|
||||
- For A (frame-11 EINVAL): `dmesg | grep hantro` during decode to catch driver-side EINVAL reasons; `ftrace events/hantro/*` if kernel exposes such; reading `hantro_g1_h264_dec.c` for control validation rules
|
||||
- For C (perf): `pidstat -u -p $(pidof mpv) 1` for CPU%; `gpustat`-equivalent for Mali-G52 (likely needs Panfrost ftrace); compositor scanout query (Wayland `ext-output-management`?) is harder
|
||||
- For E (DMABUF): `gbm_bo_create` userspace allocation test program; `VIDIOC_QBUF` with `type=V4L2_MEMORY_DMABUF` exploratory path
|
||||
- For F (sandbox): meitner / clevo access; Firefox source `security/sandbox/linux/SandboxFilter.cpp`
|
||||
|
||||
## In-scope (LOCKING DEFERRED — Phase 1 user input)
|
||||
|
||||
To be locked at Phase 1 from candidates A..G above. Recommended pairing or solo flagged per candidate.
|
||||
|
||||
## Out-of-scope (LOCKED 2026-05-04 for iteration 3)
|
||||
|
||||
- New codecs (MPEG-2, VP8, VP9, AV1, HEVC) — H.264-only scope holds from iter1+iter2.
|
||||
- New hardware (fresnel RK3399, ampere/boltzmann RK3588) — separate iteration after ohm path is hardened.
|
||||
- Bootlin upstreaming PR — `feedback_no_upstream.md` holds; no PRs unless explicitly tasked. iter3 might produce the prerequisites (DEBUG sweep, HACK refactor, perf data) for an eventual upstream.
|
||||
- HEVC re-introduction (stripped in fourier port; no hantro G2 HEVC validation in operator's test corpus).
|
||||
|
||||
## Phase 1 success criterion (will lock after user picks candidate)
|
||||
|
||||
Pre-lock template:
|
||||
- For candidate A: "Firefox 150 plays bbb_1080p30 for ≥30s through HW decode without `Unable to set control(s)` EINVAL emerging in driver stderr."
|
||||
- For candidate B: "Driver source builds clean with zero `request_log()` calls in non-error paths and zero patch-0011 sentinel writes; vaapi-copy + vaapi smoke tests still green."
|
||||
- For candidate C: "Anchored perf table for {mpv vaapi DMA-BUF, mpv vaapi-copy, Firefox HW, SW baseline} across drop count + CPU% + frame timing on bbb_1080p30; reproducible from operator instructions documented in iter3 substrate."
|
||||
- For candidate D: "Two concurrent libva contexts on the same V4L2 device decode independently without cross-context state corruption."
|
||||
- For candidate E: "vaapi-copy + vaapi --vo=gpu still produce real frames with `V4L2_MEMORY_DMABUF`-backed CAPTURE buffers; race window mathematically eliminated (no kernel can write to a buffer the consumer holds — userspace owns the dma-buf)."
|
||||
- For candidate F: "Decision documented (with Mozilla bug filed OR `MOZ_DISABLE_RDD_SANDBOX=1` permanently in README); cross-verified on Intel/NVIDIA test box."
|
||||
- For candidate G: per Sonnet 7.x sub-item.
|
||||
|
||||
## Stop point
|
||||
|
||||
**Phase 1 lock requires user input** — pick from A..G (and any pairing). After lock, iter3 phases 2..8 proceed autonomously per "Stop only if user is needed."
|
||||
Reference in New Issue
Block a user