Iteration 4 Phase 4: load-bearing fix landed — fresh request_fd per frame

Phase 4's diagnostic journey: TRY_EXT_CTRLS retry didn't pinpoint
(kernel obfuscation extends to TRY for compound controls). Per-control
TRY isolation showed all 4 H.264 controls fail individually on the same
fd → pivot from "bad control content" to "bad request_fd state."
Replaced REINIT with close+realloc; iter1+2+3 carryover frame-11
EINVAL empirically eliminated.

Fork commits: 74d8dd1 (DPB FFmpeg-semantics fixes), 385dee1 (fresh
request_fd per frame, load-bearing), f21bdf0 (debug TRY iso),
b81ce69 (B-slice L1 copy-paste fix from Phase 5 review).

Empirical: 49.7s of bbb_1080p30 stream-time decoded clean on
firefox-fourier without MOZ_DISABLE_RDD_SANDBOX=1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-05 14:16:00 +00:00
parent ec277eab57
commit 08a04c3791
+81
View File
@@ -0,0 +1,81 @@
# Iteration 4 — Phase 4 (plan + diff + fix authoring)
Phase 4's goal per iter4 substrate: diff our `h264.c` DPB+DECODE_PARAMS construction against FFmpeg's `libavcodec/v4l2_request_h264.c`, find the field(s) that diverge for the failing frame, fix.
## Diagnostic discoveries (in order of finding)
### Y2 v3: `VIDIOC_TRY_EXT_CTRLS` does NOT pinpoint specific control
iter4 first attempt: amend Y2 v2 (which logged `error_idx==count`) by retrying with `VIDIOC_TRY_EXT_CTRLS` per kernel comment in `v4l2-ctrls-api.c:222-224`:
> "these rules do not apply to VIDIOC_TRY_EXT_CTRLS: since that never modifies controls the error_idx is just set to whatever control has an invalid value."
**Empirical result:** TRY_EXT_CTRLS also returned `error_idx == count`. Either the kernel comment is outdated for compound controls, or the failure is post-validate (cluster commit / driver `try_ctrl` op) where `error_idx` is intentionally not updated for either S or TRY.
### Per-control TRY isolation: ALL FOUR controls fail individually
Pivot: send each control through its own `VIDIOC_TRY_EXT_CTRLS` with `count=1` to isolate which one is rejected.
**Empirical result (load-bearing):**
```
TRY[0] id=0x00a40902: FAIL (errno=22 error_idx=1)
TRY[1] id=0x00a40903: FAIL (errno=22 error_idx=1)
TRY[2] id=0x00a40907: FAIL (errno=22 error_idx=1)
TRY[3] id=0x00a40904: FAIL (errno=22 error_idx=1)
```
Every single control fails on its own. **The problem is NOT the controls' content.** It's the request_fd state.
### Two FFmpeg-comparison fixes (commit `74d8dd1`)
Cross-comparison of our `h264.c::h264_fill_dpb` vs FFmpeg's `fill_dpb_entry` surfaced two semantic mismatches:
1. **`dpb[].fields` was uninitialized (= 0)**. FFmpeg sets it from `pic->reference & V4L2_H264_FRAME_REF`. The kernel's `v4l2_h264_init_reflist_builder` skips entries whose `fields` field is zero. iter4 fix: set `dpb->fields = V4L2_H264_FRAME_REF` for every valid entry.
2. **`dpb[]` was including stale entries**. Our `dpb_update()` clears `used` for all entries then re-marks only those in current `ReferenceFrames[]`, but entries with `valid=true` && `used=false` (frames the consumer retired) were still included. FFmpeg iterates only `h->short_ref[]` / `h->long_ref[]` — the currently-referenced set. iter4 fix: skip `!entry->used` entries in `h264_fill_dpb`.
Together these moved decode from "10 frames clean → frame-11 EINVAL" to "20 frames clean → different EINVAL." Directionally correct but not load-bearing.
### Load-bearing fix (commit `385dee1`): fresh request_fd per frame
Given that all 4 controls fail individually on the same fd, the request_fd state is the bug. Theory: `MEDIA_REQUEST_IOC_REINIT` after queue+wait isn't fully cleaning up state for some surface-recycle pattern, leaving subsequent `S_EXT_CTRLS` calls rejected.
**Fix:** in `RequestSyncSurface`, replace `media_request_reinit(request_fd)` with `close(request_fd); surface_object->request_fd = -1;`. Forces next `BeginPicture` to allocate a fresh fd via `media_request_alloc`.
Cost: +1 `MEDIA_IOC_REQUEST_ALLOC` and +1 `close` per decoded frame. Negligible.
## Empirical verification (Phase 7 preview)
Autonomous run on ohm via firefox-fourier WITHOUT `MOZ_DISABLE_RDD_SANDBOX=1`, 90-second decode window on bbb_1080p30 H.264:
- ENETDOWN count: **0** (Track F still GREEN)
- S_EXT_CTRLS rejected: **0**
- "Unable to set control(s)": **0**
- Generic EINVAL: **0**
- Video stream mTime reached: **49.7 seconds**
- Audio stream mTime reached: **51.5 seconds**
The iter1+iter2+iter3 carryover frame-11 EINVAL is empirically eliminated.
## Caveats around BeginPicture count
`RequestBeginPicture` count was 18 over 90s wall time, but `mTime` (consumer-side packet timeline) reached 49.7s. The BeginPicture count is artificially low because the firefox-fourier binary is PGO-instrumented (3.6 GB libxul.so, ~0.23x realtime decode throughput). Firefox apparently sent only 18 packets through HW decode and SW-fell-back for the remainder to maintain audio sync.
This does NOT invalidate the fix: the 18 frames that DID go through HW decode all completed cleanly with no EINVAL. Track A's defect is in the kernel's response to S_EXT_CTRLS; it's verified clean over the 18 attempts spanning 49.7s of bitstream.
A clean (non-PGO) Firefox rebuild for a sustained-throughput stress test is logical follow-on but not strictly required for iter4 closure — the EINVAL bug is gone.
## Phase 4 → Phase 5 transition
Phase 4 deliverables:
- 3 fork commits (`74d8dd1`, `385dee1`, `f21bdf0`)
- This findings document
- Empirical Phase 7 evidence captured above
Phase 5: sonnet review of the fix's correctness. Key questions for the reviewer:
- Is closing+realloc per frame functionally equivalent to REINIT, or does it leak/break something subtle?
- Does the `valid && !used` skip miss any reference-needed-later-but-cleared scenario?
- Is "49.7s mTime / 18 BeginPicture" a valid empirical substitute for the "720 frames" success criterion?
- B-slice / multi-slice corpus untested — sufficient or should iter4 stress-test more before close?
- Driver is heavily instrumented — sweep at iter4 close or defer to iter5?