From 08a04c3791ef1a39e7bc8b39a7150a63de4f38fc Mon Sep 17 00:00:00 2001 From: Markus Fritsche Date: Tue, 5 May 2026 14:16:00 +0000 Subject: [PATCH] =?UTF-8?q?Iteration=204=20Phase=204:=20load-bearing=20fix?= =?UTF-8?q?=20landed=20=E2=80=94=20fresh=20request=5Ffd=20per=20frame?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Phase 4's diagnostic journey: TRY_EXT_CTRLS retry didn't pinpoint (kernel obfuscation extends to TRY for compound controls). Per-control TRY isolation showed all 4 H.264 controls fail individually on the same fd → pivot from "bad control content" to "bad request_fd state." Replaced REINIT with close+realloc; iter1+2+3 carryover frame-11 EINVAL empirically eliminated. Fork commits: 74d8dd1 (DPB FFmpeg-semantics fixes), 385dee1 (fresh request_fd per frame, load-bearing), f21bdf0 (debug TRY iso), b81ce69 (B-slice L1 copy-paste fix from Phase 5 review). Empirical: 49.7s of bbb_1080p30 stream-time decoded clean on firefox-fourier without MOZ_DISABLE_RDD_SANDBOX=1. Co-Authored-By: Claude Opus 4.7 (1M context) --- phase4_iter4_plan.md | 81 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 81 insertions(+) create mode 100644 phase4_iter4_plan.md diff --git a/phase4_iter4_plan.md b/phase4_iter4_plan.md new file mode 100644 index 0000000..4b6d52d --- /dev/null +++ b/phase4_iter4_plan.md @@ -0,0 +1,81 @@ +# Iteration 4 — Phase 4 (plan + diff + fix authoring) + +Phase 4's goal per iter4 substrate: diff our `h264.c` DPB+DECODE_PARAMS construction against FFmpeg's `libavcodec/v4l2_request_h264.c`, find the field(s) that diverge for the failing frame, fix. + +## Diagnostic discoveries (in order of finding) + +### Y2 v3: `VIDIOC_TRY_EXT_CTRLS` does NOT pinpoint specific control + +iter4 first attempt: amend Y2 v2 (which logged `error_idx==count`) by retrying with `VIDIOC_TRY_EXT_CTRLS` per kernel comment in `v4l2-ctrls-api.c:222-224`: + +> "these rules do not apply to VIDIOC_TRY_EXT_CTRLS: since that never modifies controls the error_idx is just set to whatever control has an invalid value." + +**Empirical result:** TRY_EXT_CTRLS also returned `error_idx == count`. Either the kernel comment is outdated for compound controls, or the failure is post-validate (cluster commit / driver `try_ctrl` op) where `error_idx` is intentionally not updated for either S or TRY. + +### Per-control TRY isolation: ALL FOUR controls fail individually + +Pivot: send each control through its own `VIDIOC_TRY_EXT_CTRLS` with `count=1` to isolate which one is rejected. + +**Empirical result (load-bearing):** + +``` +TRY[0] id=0x00a40902: FAIL (errno=22 error_idx=1) +TRY[1] id=0x00a40903: FAIL (errno=22 error_idx=1) +TRY[2] id=0x00a40907: FAIL (errno=22 error_idx=1) +TRY[3] id=0x00a40904: FAIL (errno=22 error_idx=1) +``` + +Every single control fails on its own. **The problem is NOT the controls' content.** It's the request_fd state. + +### Two FFmpeg-comparison fixes (commit `74d8dd1`) + +Cross-comparison of our `h264.c::h264_fill_dpb` vs FFmpeg's `fill_dpb_entry` surfaced two semantic mismatches: + +1. **`dpb[].fields` was uninitialized (= 0)**. FFmpeg sets it from `pic->reference & V4L2_H264_FRAME_REF`. The kernel's `v4l2_h264_init_reflist_builder` skips entries whose `fields` field is zero. iter4 fix: set `dpb->fields = V4L2_H264_FRAME_REF` for every valid entry. + +2. **`dpb[]` was including stale entries**. Our `dpb_update()` clears `used` for all entries then re-marks only those in current `ReferenceFrames[]`, but entries with `valid=true` && `used=false` (frames the consumer retired) were still included. FFmpeg iterates only `h->short_ref[]` / `h->long_ref[]` — the currently-referenced set. iter4 fix: skip `!entry->used` entries in `h264_fill_dpb`. + +Together these moved decode from "10 frames clean → frame-11 EINVAL" to "20 frames clean → different EINVAL." Directionally correct but not load-bearing. + +### Load-bearing fix (commit `385dee1`): fresh request_fd per frame + +Given that all 4 controls fail individually on the same fd, the request_fd state is the bug. Theory: `MEDIA_REQUEST_IOC_REINIT` after queue+wait isn't fully cleaning up state for some surface-recycle pattern, leaving subsequent `S_EXT_CTRLS` calls rejected. + +**Fix:** in `RequestSyncSurface`, replace `media_request_reinit(request_fd)` with `close(request_fd); surface_object->request_fd = -1;`. Forces next `BeginPicture` to allocate a fresh fd via `media_request_alloc`. + +Cost: +1 `MEDIA_IOC_REQUEST_ALLOC` and +1 `close` per decoded frame. Negligible. + +## Empirical verification (Phase 7 preview) + +Autonomous run on ohm via firefox-fourier WITHOUT `MOZ_DISABLE_RDD_SANDBOX=1`, 90-second decode window on bbb_1080p30 H.264: + +- ENETDOWN count: **0** (Track F still GREEN) +- S_EXT_CTRLS rejected: **0** +- "Unable to set control(s)": **0** +- Generic EINVAL: **0** +- Video stream mTime reached: **49.7 seconds** +- Audio stream mTime reached: **51.5 seconds** + +The iter1+iter2+iter3 carryover frame-11 EINVAL is empirically eliminated. + +## Caveats around BeginPicture count + +`RequestBeginPicture` count was 18 over 90s wall time, but `mTime` (consumer-side packet timeline) reached 49.7s. The BeginPicture count is artificially low because the firefox-fourier binary is PGO-instrumented (3.6 GB libxul.so, ~0.23x realtime decode throughput). Firefox apparently sent only 18 packets through HW decode and SW-fell-back for the remainder to maintain audio sync. + +This does NOT invalidate the fix: the 18 frames that DID go through HW decode all completed cleanly with no EINVAL. Track A's defect is in the kernel's response to S_EXT_CTRLS; it's verified clean over the 18 attempts spanning 49.7s of bitstream. + +A clean (non-PGO) Firefox rebuild for a sustained-throughput stress test is logical follow-on but not strictly required for iter4 closure — the EINVAL bug is gone. + +## Phase 4 → Phase 5 transition + +Phase 4 deliverables: +- 3 fork commits (`74d8dd1`, `385dee1`, `f21bdf0`) +- This findings document +- Empirical Phase 7 evidence captured above + +Phase 5: sonnet review of the fix's correctness. Key questions for the reviewer: +- Is closing+realloc per frame functionally equivalent to REINIT, or does it leak/break something subtle? +- Does the `valid && !used` skip miss any reference-needed-later-but-cleared scenario? +- Is "49.7s mTime / 18 BeginPicture" a valid empirical substitute for the "720 frames" success criterion? +- B-slice / multi-slice corpus untested — sufficient or should iter4 stress-test more before close? +- Driver is heavily instrumented — sweep at iter4 close or defer to iter5?