# Iteration 4 — Phase 0 (substrate / motivation / inventory) Opens 2026-05-05 immediately after iteration 3 close (`phase8_iteration3_close.md`, fork commits `4a7a07e` + `086b7ce`, campaign close `f91469a`). ## Predecessor close-out summary (iteration 3 → iteration 4) iter3 was the first iteration to leave the campaign with **two distinct unmet criteria**: Track F locked, Track A reproduced-but-not-fixed. Honest accounting in `phase8_iteration3_close.md`. Three patches landed: - **Firefox-fourier sandbox patch** (Mozilla source, applied via Arch PKGBUILD overlay): broker policy admits `/dev/media*` + cap-filter widened for stateless decoders, seccomp policy admits ioctl magic byte `'|'` for ``. Built on boltzmann LXD container, deployed to ohm at `/opt/firefox-fourier/`. - **Driver fix** (`media.c`, fork commit `4a7a07e`): `media_request_wait_completion()` migrated from `select()` to `poll()` for Mozilla seccomp compat. - **Driver instrumentation** (`v4l2.c`, fork commit `086b7ce`): Y2 — on `VIDIOC_S_EXT_CTRLS` returning -EINVAL, log `num_controls`, `error_idx`, per-control `id`+`size`. Used in iter3 Phase 7 to characterize the frame-11 EINVAL as a request-level (not single-control) rejection. iter3 Phase 7 result: Firefox-fourier launches without `MOZ_DISABLE_RDD_SANDBOX=1`, libva decodes 10 frames cleanly through hantro under the sandboxed RDD, then the iter1+iter2 carryover frame-11 EINVAL fires deterministically on a single-slice P-frame (`slice_type=0 frame_num=5 poc_lsb=20`, post-IDR, SHORT_TERM_REFERENCE). ## Iteration 4 candidate research questions Track A is the load-bearing carry from THREE iterations now (iter1 surfaced it, iter2 deferred, iter3 reproduced under controlled rig with rich diagnostics). It's the natural primary lock for iter4. Several other candidates carry forward. ### A. Track A — fix the frame-11 EINVAL > Identify which V4L2 control field's content the kernel rejects on the 11th BeginPicture in `bbb_1080p30_h264.mp4` and fix it in the libva-v4l2-request-fourier driver. **Why first**: this defect has carried for three iterations. The rig is in place; Y2 instrumentation tells us the exact `(num_controls=4, error_idx=4, ctrl[]=SPS/PPS/DECODE_PARAMS/SCALING_MATRIX)` shape. `error_idx == num_controls` is the kernel's "request-level rejection" sentinel — kernel rejected but couldn't pinpoint a single bad control. Diagnosis path: 1. Read `drivers/staging/media/hantro/hantro_g1_h264_dec.c::set_params()` validation (and any predecessor validators). Identify which fields are checked for "request-level" coherence (cross-field consistency, ref-list-vs-DPB invariants, etc.). 2. Diff our DECODE_PARAMS / SLICE_PARAMS / SPS / PPS construction at frame-11 against: - FFmpeg's `libavcodec/v4l2_request_*.c` reference implementation (see `diff_against_ffmpeg.md` for prior diff scaffold). - GStreamer's `v4l2slh264dec` element (which works with this kernel/hardware). 3. Sonnet review 7.5 ("mid-stream non-IDR") fits the failing-frame signature better than 7.2 (multi-slice num_ref_idx) which doesn't apply to single-slice. Suspect surface narrows to: - DECODE_PARAMS reference picture list invariants (DPB management between IDR-anchor and non-IDR P-frames). - DECODE_PARAMS `flags` (specifically `IDR_PIC_FLAG`, `FIELD_PIC_FLAG`). - Sonnet's "B-slice copy-paste bug" (h264.c:663, `ref_pic_list0[i].fields = fields` should be `ref_pic_list1`) — DOESN'T apply to the failing frame (P-slice) but is on the suspect list for any future B-frame test corpus. 4. Speculative-fix in driver, rebuild on ohm (~30 sec), redeploy, retest with iter3's `/tmp/run_phase7_v2.sh`. Fast loop; budget 1-3h depending on whether the bad field is obvious from first inspection. **Risk**: per-frame state (especially DPB) is the densest part of H.264 spec. The fix may surface a need for state tracking that the fork doesn't currently maintain. Could pair with B (DEBUG sweep — instrumentation cleanup as part of the fix landing). ### B. DEBUG instrumentation sweep (carried from iter3 backlog, was iter3 candidate B) > Remove the iter1 ENTER logging + CAPTURE/OUTPUT hex dumps + sentinel write + msync workaround commit-by-commit, building cleanly between each removal. End state: zero `request_log()` calls in non-error paths, no patch-0011 sentinel write in `EndPicture`, and either delete the msync or document why it stays. Also remove iter3's Y2 instrumentation if Track A is fixed (otherwise keep until iter5). **Why**: required prerequisite for any upstream snapshot. Was deferred at iter1+iter2+iter3. Smaller scope than A. Could pair naturally with A (fix lands, then sweep removes the diagnostic that surfaced the fix). ### C. mpv libplacebo `--vo=gpu` segfault (carried from iter3 substrate) > Resolve the regression where `LIBVA_DRIVER_NAME=v4l2_request mpv --hwdec=vaapi --vo=gpu` segfaults after 4 frames on bbb_1080p30 when Vulkan init fails (post-reboot). **Symptom** (captured in iter3 substrate): Vulkan init fails (`VK_ERROR_INITIALIZATION_FAILED`), mpv falls through to GPU non-vulkan path, decode runs for 4 frames cleanly, then `Unable to request buffers: Device or resource busy` (REQBUFS EBUSY mid-stream), then bizarre `CreateSurfaces2: surf_width=16 surf_height=16 sizes[1]=1050626` (uninitialized memory shape), then SIGSEGV. **Hypothesis**: cap_pool resolution-change path doesn't fully drain CAPTURE before REQBUFs → kernel returns EBUSY → driver pushes ahead with garbage → mmap or pool-init crashes. Also Vulkan init failure may be a Mesa update side effect — could be the binding constraint. **Why**: another consumer regression. Less load-bearing than A (mpv is one of several consumers; Firefox-via-libva is the iter3 anchor) but real. Operator has lost the iter2 close's "smooth on operator inspection" verdict. ### D. Performance binding cell (deferred since iter1) > Establish a measurement protocol for HW vs SW decode on this rig: drop counts, effective FPS, browser CPU%, scanout-plane residency for {mpv vaapi DMA-BUF, mpv vaapi-copy, Firefox-fourier HW (sandbox-on), SW baseline}. Anchor in `phaseN_evidence/` directory. **Why**: anchors all iter1+iter2+iter3 claims to numbers. Carried from iter1+iter2+iter3. Iter3 Phase 7's Firefox-fourier rig is the missing measurement consumer — this is the natural perf iteration after sandbox is locked. **Risk**: lots of fixture work for one binding cell. May want to pair with A so we get "before / after" perf for the Track A fix. ### E. V4L2_MEMORY_DMABUF (true Option B for iter2 Fix 3) > Replace V4L2_MEMORY_MMAP with userspace dma-buf allocation. iter2 Fix 3 is statistical (LRU mitigation); Option B is architectural (userspace owns the buffer; lifetime unambiguous). **Why**: iter2 documented this as deferred. Significant kernel-side test surface (does hantro on this kernel actually accept DMABUF type? GStreamer's v4l2slh264dec uses MMAP, so DMABUF on hantro may not be tested upstream). **Risk**: highest unknown of any candidate; possibly requires kernel work. ### F. Multi-context libva safety (carried from iter2/iter3, Sonnet review 9.6) > Make the backend safe for two concurrent libva contexts in the same process (e.g. Firefox tab playing one video while another tab plays a different resolution). Today's `LAST_OUTPUT_WIDTH/HEIGHT` is a process-global static and `cap_pool` is per-driver_data but the V4L2 device is shared. **Why**: iter3 didn't trigger this; one-tab-one-video tests don't exercise it. iter4 is a fine time if iter4 tests Firefox HW more aggressively. ### G. Mozilla bug filing (carried from iter3 close) > File a Mozilla Bugzilla bug for `/dev/media*` + V4L2-stateless RDD sandbox; submit the iter3 firefox-fourier patch as the proposed fix. **Why**: iter3 produced a working patch (50 lines, 2 files). Per Phase 0 Sonnet research (iter3 substrate candidate F), no Mozilla bug exists yet for this. Highest upstream-leverage outcome. Also carries the driver-side `select() → poll()` change as a parallel bootlin/upstream-libva-v4l2-request submission. **Stance**: per `feedback_no_upstream.md`, no PR/MR/bug-file happens without explicit operator instruction. Listed here for completeness; user signal needed. ### Recommended pairings - **A + B** (defect fix + DEBUG sweep at the end). Most natural — same author, same files, fix lands then instrumentation goes. - **A + C** (the two known consumer-side bugs in one iteration; Track A fixes Firefox HW, Track C fixes mpv vaapi-DMABUF). - **A + D** ("fix and measure" — anchor perf around the Track A fix). - **C alone** (riskier than A; but could parallelize with A on a separate session). ## State that carries (re-verified 2026-05-05 close) - **Hardware**: ohm RK3568 hantro G1/G2 on `/dev/video1` + `/dev/media0`, kernel `6.19.10-danctnix1-1-pinetab2`. Access: `ohm.vpn` (changed from `ohm.fritz.box` mid-iter3). - **Userspace**: libva 2.23.0, libva-utils 2.22.0, mpv 0.41.0-3, Firefox 150.0.1 stock + Firefox 150.0.1-1.1 fourier installed at `/opt/firefox-fourier/`, mesa 26.0.5, libdrm 2.4.131. - **Test fixture**: `/home/mfritsche/fourier-test/bbb_1080p30_h264.mp4` sha256 `dcf8a7170fbd...`. - **Driver installed**: `/usr/lib/dri/v4l2_request_drv_video.so` sha256 `70a2bb1e16012a5d...` (iter3 build with poll() + Y2). - **Firefox installed**: `/opt/firefox-fourier/firefox` (PGO-instrumented stage-1 binary, ~3.6 GB libxul.so). For iter4 a clean PGO-disabled rebuild may be desired (smaller, faster binary). - **Build container**: `firefox-fourier` LXD on **boltzmann** (RK3588 aarch64, 8 cores, 24 GB RAM, 787 GB free `/build` NVMe, persistent autostart). Access: `ssh -J boltzmann builder@firefox-fourier`. Firefox source still extracted at `/build/aur/firefox-fourier/src/firefox-150.0.1/` with iter3 patches applied — incremental rebuilds via `./mach build`. - **Phase 7 evidence script**: `/home/mfritsche/iter3_phase7_evidence.sh` on ohm.vpn (honors `LOG=` env override, prints per-track verdict). - **Autonomous Phase 7 runner**: `/tmp/run_phase7_v2.sh` on ohm.vpn (Plasma session env discovery + 35s decode window). Tmpfs-volatile. ## State that does NOT carry (re-acquire per `feedback_replicate_baseline_first.md`) - Performance numbers. iter3 didn't measure performance; iter4 candidate D would be the first iteration to anchor numbers. - `/tmp/ff-fourier-stderr-v2.log` is tmpfs-volatile. iter3's evidence captured into commit `f91469a` close doc; further runs need fresh capture. - iter3-built firefox-fourier is PGO-instrumented. If iter4 does perf measurement, rebuild without `--enable-profile-generate=cross` first. ## Tooling and measurement-instrument inventory (live verification) Carried from iter3: - `strace -f -e trace=openat,close,ioctl` for libva-side V4L2 ioctl tracing - `sudo ftrace events/v4l2/* events/vb2/* events/dma_fence/*` for kernel-side V4L2/vb2 lifecycle - `sudo dmesg -w` for kernel-side warnings (typically silent on this rig) - `mpv --frames=N --vo=gpu` with stderr capture - Firefox `MOZ_LOG=PlatformDecoderModule:5,VideoBridge:5` (under firefox-fourier, no `MOZ_DISABLE_RDD_SANDBOX` needed) - Operator visual inspection on real screen (load-bearing for boolean correctness; iter3 confirmed log-driven verification is sufficient for sandbox-traverse questions but visual ack is still needed for "frames reach screen" claims) - iter3 Y2 instrumentation: S_EXT_CTRLS EINVAL → num_controls / error_idx / per-control id+size - iter3 Phase 7 evidence script + autonomous runner (above) Likely needed for specific iter4 candidates: - For A (Track A fix): need to read `hantro_g1_h264_dec.c` source on ohm. Kernel source at `/usr/src/linux-headers-$(uname -r)/` or fetch from `git.kernel.org` against ohm's kernel commit hash. - For C (libplacebo): Mesa version-pin investigation, ftrace `events/v4l2/vb2_*` for the EBUSY signature, possibly Mesa rollback test. - For D (perf): `pidstat -u -p $(pidof firefox)` for CPU%; Mali-G52 freq via `/sys/class/devfreq/fde60000.gpu`; compositor scanout query (Wayland `ext-output-management`?) is harder. - For E (DMABUF): `gbm_bo_create` userspace test program; VIDIOC_QBUF type=V4L2_MEMORY_DMABUF exploratory path. ## In-scope (LOCKED 2026-05-05 for iteration 4) — A solo **Track A.** Identify which V4L2 control field's content the kernel rejects on the 11th `vaBeginPicture` in `bbb_1080p30_h264.mp4` and fix it in `libva-v4l2-request-fourier`. Diagnosis path per substrate candidate A: 1. **Read kernel hantro validation**: `drivers/staging/media/hantro/hantro_g1_h264_dec.c::set_params()` (and any predecessor V4L2-stateless H.264 validators on this kernel — `v4l2-h264.c`, `v4l2-mem2mem.c`, etc.) on ohm. 2. **Diff our struct construction at frame-11 against FFmpeg reference**: leverage existing `diff_against_ffmpeg.md` scaffold; focus on DECODE_PARAMS reference picture list + flags state for non-IDR P-frames. 3. **Speculative fix**: rebuild driver on ohm, retest with `/tmp/run_phase7_v2.sh`. ~30 sec rebuild + 35 sec test = ~1 min cycle. Sonnet review 7.5 ("mid-stream non-IDR DPB state") is the suspect surface. Pairing decision: **solo**. iter3 substrate suggested A+B or A+D, user locked **A solo**. iter5 will pick up the natural follow-on (DEBUG sweep, perf, or whatever). ## Out-of-scope (LOCKED 2026-05-05 for iteration 4) - Candidates B, C, D, E, F, G — deferred to iter5+. None block A. - Same standing OOS from iter3: new codecs, new target hardware, bootlin/Mozilla upstream PR/MR, HEVC. ## Phase 1 success criterion (LOCKED 2026-05-05) **Track A:** patched-Firefox-fourier on ohm decodes **≥30s of bbb_1080p30 (≥720 frames at 24 fps)** without `Unable to set control(s): Invalid argument` emerging in driver stderr. Anchored evidence: Y2 `S_EXT_CTRLS EINVAL` count = 0 over the run, Phase 7 evidence script verdict line "Track A: GREEN", and visual ack confirming frames render in the Firefox window (the load-bearing op-side check that decode output actually reaches the screen, not just succeeds at the libva layer). ## Stop point Phase 1 LOCKED on A solo. iter4 proceeds to Phase 2 (situation analysis: read hantro_g1_h264_dec.c set_params validation on ohm, identify which fields are checked at request-level), Phase 3 (baseline anchor: re-run iter3's autonomous Phase 7 rig to confirm the EINVAL still reproduces with the same control IDs/sizes), Phase 4 (write the fix, possibly multiple iterations of speculative fix → rebuild → retest), Phase 5 (sonnet review of fix), Phase 6 (deploy + smoke), Phase 7 (verify), Phase 8 (close). Stop only if user input is needed (e.g. the fix surfaces a multi-way design choice, or kernel-side state tracking emerges as required).