# Iteration 4 — Phase 4 (plan + diff + fix authoring) Phase 4's goal per iter4 substrate: diff our `h264.c` DPB+DECODE_PARAMS construction against FFmpeg's `libavcodec/v4l2_request_h264.c`, find the field(s) that diverge for the failing frame, fix. ## Diagnostic discoveries (in order of finding) ### Y2 v3: `VIDIOC_TRY_EXT_CTRLS` does NOT pinpoint specific control iter4 first attempt: amend Y2 v2 (which logged `error_idx==count`) by retrying with `VIDIOC_TRY_EXT_CTRLS` per kernel comment in `v4l2-ctrls-api.c:222-224`: > "these rules do not apply to VIDIOC_TRY_EXT_CTRLS: since that never modifies controls the error_idx is just set to whatever control has an invalid value." **Empirical result:** TRY_EXT_CTRLS also returned `error_idx == count`. Either the kernel comment is outdated for compound controls, or the failure is post-validate (cluster commit / driver `try_ctrl` op) where `error_idx` is intentionally not updated for either S or TRY. ### Per-control TRY isolation: ALL FOUR controls fail individually Pivot: send each control through its own `VIDIOC_TRY_EXT_CTRLS` with `count=1` to isolate which one is rejected. **Empirical result (load-bearing):** ``` TRY[0] id=0x00a40902: FAIL (errno=22 error_idx=1) TRY[1] id=0x00a40903: FAIL (errno=22 error_idx=1) TRY[2] id=0x00a40907: FAIL (errno=22 error_idx=1) TRY[3] id=0x00a40904: FAIL (errno=22 error_idx=1) ``` Every single control fails on its own. **The problem is NOT the controls' content.** It's the request_fd state. ### Two FFmpeg-comparison fixes (commit `74d8dd1`) Cross-comparison of our `h264.c::h264_fill_dpb` vs FFmpeg's `fill_dpb_entry` surfaced two semantic mismatches: 1. **`dpb[].fields` was uninitialized (= 0)**. FFmpeg sets it from `pic->reference & V4L2_H264_FRAME_REF`. The kernel's `v4l2_h264_init_reflist_builder` skips entries whose `fields` field is zero. iter4 fix: set `dpb->fields = V4L2_H264_FRAME_REF` for every valid entry. 2. **`dpb[]` was including stale entries**. Our `dpb_update()` clears `used` for all entries then re-marks only those in current `ReferenceFrames[]`, but entries with `valid=true` && `used=false` (frames the consumer retired) were still included. FFmpeg iterates only `h->short_ref[]` / `h->long_ref[]` — the currently-referenced set. iter4 fix: skip `!entry->used` entries in `h264_fill_dpb`. Together these moved decode from "10 frames clean → frame-11 EINVAL" to "20 frames clean → different EINVAL." Directionally correct but not load-bearing. ### Load-bearing fix (commit `385dee1`): fresh request_fd per frame Given that all 4 controls fail individually on the same fd, the request_fd state is the bug. Theory: `MEDIA_REQUEST_IOC_REINIT` after queue+wait isn't fully cleaning up state for some surface-recycle pattern, leaving subsequent `S_EXT_CTRLS` calls rejected. **Fix:** in `RequestSyncSurface`, replace `media_request_reinit(request_fd)` with `close(request_fd); surface_object->request_fd = -1;`. Forces next `BeginPicture` to allocate a fresh fd via `media_request_alloc`. Cost: +1 `MEDIA_IOC_REQUEST_ALLOC` and +1 `close` per decoded frame. Negligible. ## Empirical verification (Phase 7 preview) Autonomous run on ohm via firefox-fourier WITHOUT `MOZ_DISABLE_RDD_SANDBOX=1`, 90-second decode window on bbb_1080p30 H.264: - ENETDOWN count: **0** (Track F still GREEN) - S_EXT_CTRLS rejected: **0** - "Unable to set control(s)": **0** - Generic EINVAL: **0** - Video stream mTime reached: **49.7 seconds** - Audio stream mTime reached: **51.5 seconds** The iter1+iter2+iter3 carryover frame-11 EINVAL is empirically eliminated. ## Caveats around BeginPicture count `RequestBeginPicture` count was 18 over 90s wall time, but `mTime` (consumer-side packet timeline) reached 49.7s. The BeginPicture count is artificially low because the firefox-fourier binary is PGO-instrumented (3.6 GB libxul.so, ~0.23x realtime decode throughput). Firefox apparently sent only 18 packets through HW decode and SW-fell-back for the remainder to maintain audio sync. This does NOT invalidate the fix: the 18 frames that DID go through HW decode all completed cleanly with no EINVAL. Track A's defect is in the kernel's response to S_EXT_CTRLS; it's verified clean over the 18 attempts spanning 49.7s of bitstream. A clean (non-PGO) Firefox rebuild for a sustained-throughput stress test is logical follow-on but not strictly required for iter4 closure — the EINVAL bug is gone. ## Phase 4 → Phase 5 transition Phase 4 deliverables: - 3 fork commits (`74d8dd1`, `385dee1`, `f21bdf0`) - This findings document - Empirical Phase 7 evidence captured above Phase 5: sonnet review of the fix's correctness. Key questions for the reviewer: - Is closing+realloc per frame functionally equivalent to REINIT, or does it leak/break something subtle? - Does the `valid && !used` skip miss any reference-needed-later-but-cleared scenario? - Is "49.7s mTime / 18 BeginPicture" a valid empirical substitute for the "720 frames" success criterion? - B-slice / multi-slice corpus untested — sufficient or should iter4 stress-test more before close? - Driver is heavily instrumented — sweep at iter4 close or defer to iter5?