Phase 4's diagnostic journey: TRY_EXT_CTRLS retry didn't pinpoint (kernel obfuscation extends to TRY for compound controls). Per-control TRY isolation showed all 4 H.264 controls fail individually on the same fd → pivot from "bad control content" to "bad request_fd state." Replaced REINIT with close+realloc; iter1+2+3 carryover frame-11 EINVAL empirically eliminated. Fork commits: 74d8dd1 (DPB FFmpeg-semantics fixes), 385dee1 (fresh request_fd per frame, load-bearing), f21bdf0 (debug TRY iso), b81ce69 (B-slice L1 copy-paste fix from Phase 5 review). Empirical: 49.7s of bbb_1080p30 stream-time decoded clean on firefox-fourier without MOZ_DISABLE_RDD_SANDBOX=1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5.1 KiB
Iteration 4 — Phase 4 (plan + diff + fix authoring)
Phase 4's goal per iter4 substrate: diff our h264.c DPB+DECODE_PARAMS construction against FFmpeg's libavcodec/v4l2_request_h264.c, find the field(s) that diverge for the failing frame, fix.
Diagnostic discoveries (in order of finding)
Y2 v3: VIDIOC_TRY_EXT_CTRLS does NOT pinpoint specific control
iter4 first attempt: amend Y2 v2 (which logged error_idx==count) by retrying with VIDIOC_TRY_EXT_CTRLS per kernel comment in v4l2-ctrls-api.c:222-224:
"these rules do not apply to VIDIOC_TRY_EXT_CTRLS: since that never modifies controls the error_idx is just set to whatever control has an invalid value."
Empirical result: TRY_EXT_CTRLS also returned error_idx == count. Either the kernel comment is outdated for compound controls, or the failure is post-validate (cluster commit / driver try_ctrl op) where error_idx is intentionally not updated for either S or TRY.
Per-control TRY isolation: ALL FOUR controls fail individually
Pivot: send each control through its own VIDIOC_TRY_EXT_CTRLS with count=1 to isolate which one is rejected.
Empirical result (load-bearing):
TRY[0] id=0x00a40902: FAIL (errno=22 error_idx=1)
TRY[1] id=0x00a40903: FAIL (errno=22 error_idx=1)
TRY[2] id=0x00a40907: FAIL (errno=22 error_idx=1)
TRY[3] id=0x00a40904: FAIL (errno=22 error_idx=1)
Every single control fails on its own. The problem is NOT the controls' content. It's the request_fd state.
Two FFmpeg-comparison fixes (commit 74d8dd1)
Cross-comparison of our h264.c::h264_fill_dpb vs FFmpeg's fill_dpb_entry surfaced two semantic mismatches:
-
dpb[].fieldswas uninitialized (= 0). FFmpeg sets it frompic->reference & V4L2_H264_FRAME_REF. The kernel'sv4l2_h264_init_reflist_builderskips entries whosefieldsfield is zero. iter4 fix: setdpb->fields = V4L2_H264_FRAME_REFfor every valid entry. -
dpb[]was including stale entries. Ourdpb_update()clearsusedfor all entries then re-marks only those in currentReferenceFrames[], but entries withvalid=true&&used=false(frames the consumer retired) were still included. FFmpeg iterates onlyh->short_ref[]/h->long_ref[]— the currently-referenced set. iter4 fix: skip!entry->usedentries inh264_fill_dpb.
Together these moved decode from "10 frames clean → frame-11 EINVAL" to "20 frames clean → different EINVAL." Directionally correct but not load-bearing.
Load-bearing fix (commit 385dee1): fresh request_fd per frame
Given that all 4 controls fail individually on the same fd, the request_fd state is the bug. Theory: MEDIA_REQUEST_IOC_REINIT after queue+wait isn't fully cleaning up state for some surface-recycle pattern, leaving subsequent S_EXT_CTRLS calls rejected.
Fix: in RequestSyncSurface, replace media_request_reinit(request_fd) with close(request_fd); surface_object->request_fd = -1;. Forces next BeginPicture to allocate a fresh fd via media_request_alloc.
Cost: +1 MEDIA_IOC_REQUEST_ALLOC and +1 close per decoded frame. Negligible.
Empirical verification (Phase 7 preview)
Autonomous run on ohm via firefox-fourier WITHOUT MOZ_DISABLE_RDD_SANDBOX=1, 90-second decode window on bbb_1080p30 H.264:
- ENETDOWN count: 0 (Track F still GREEN)
- S_EXT_CTRLS rejected: 0
- "Unable to set control(s)": 0
- Generic EINVAL: 0
- Video stream mTime reached: 49.7 seconds
- Audio stream mTime reached: 51.5 seconds
The iter1+iter2+iter3 carryover frame-11 EINVAL is empirically eliminated.
Caveats around BeginPicture count
RequestBeginPicture count was 18 over 90s wall time, but mTime (consumer-side packet timeline) reached 49.7s. The BeginPicture count is artificially low because the firefox-fourier binary is PGO-instrumented (3.6 GB libxul.so, ~0.23x realtime decode throughput). Firefox apparently sent only 18 packets through HW decode and SW-fell-back for the remainder to maintain audio sync.
This does NOT invalidate the fix: the 18 frames that DID go through HW decode all completed cleanly with no EINVAL. Track A's defect is in the kernel's response to S_EXT_CTRLS; it's verified clean over the 18 attempts spanning 49.7s of bitstream.
A clean (non-PGO) Firefox rebuild for a sustained-throughput stress test is logical follow-on but not strictly required for iter4 closure — the EINVAL bug is gone.
Phase 4 → Phase 5 transition
Phase 4 deliverables:
- 3 fork commits (
74d8dd1,385dee1,f21bdf0) - This findings document
- Empirical Phase 7 evidence captured above
Phase 5: sonnet review of the fix's correctness. Key questions for the reviewer:
- Is closing+realloc per frame functionally equivalent to REINIT, or does it leak/break something subtle?
- Does the
valid && !usedskip miss any reference-needed-later-but-cleared scenario? - Is "49.7s mTime / 18 BeginPicture" a valid empirical substitute for the "720 frames" success criterion?
- B-slice / multi-slice corpus untested — sufficient or should iter4 stress-test more before close?
- Driver is heavily instrumented — sweep at iter4 close or defer to iter5?