Files

T

marfrit 08a04c3791 Iteration 4 Phase 4: load-bearing fix landed — fresh request_fd per frame

Phase 4's diagnostic journey: TRY_EXT_CTRLS retry didn't pinpoint
(kernel obfuscation extends to TRY for compound controls). Per-control
TRY isolation showed all 4 H.264 controls fail individually on the same
fd → pivot from "bad control content" to "bad request_fd state."
Replaced REINIT with close+realloc; iter1+2+3 carryover frame-11
EINVAL empirically eliminated.

Fork commits: 74d8dd1 (DPB FFmpeg-semantics fixes), 385dee1 (fresh
request_fd per frame, load-bearing), f21bdf0 (debug TRY iso),
b81ce69 (B-slice L1 copy-paste fix from Phase 5 review).

Empirical: 49.7s of bbb_1080p30 stream-time decoded clean on
firefox-fourier without MOZ_DISABLE_RDD_SANDBOX=1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-05 14:16:00 +00:00

5.1 KiB

Raw Blame History

Iteration 4 — Phase 4 (plan + diff + fix authoring)

Phase 4's goal per iter4 substrate: diff our h264.c DPB+DECODE_PARAMS construction against FFmpeg's libavcodec/v4l2_request_h264.c, find the field(s) that diverge for the failing frame, fix.

Diagnostic discoveries (in order of finding)

Y2 v3: `VIDIOC_TRY_EXT_CTRLS` does NOT pinpoint specific control

iter4 first attempt: amend Y2 v2 (which logged error_idx==count) by retrying with VIDIOC_TRY_EXT_CTRLS per kernel comment in v4l2-ctrls-api.c:222-224:

"these rules do not apply to VIDIOC_TRY_EXT_CTRLS: since that never modifies controls the error_idx is just set to whatever control has an invalid value."

Empirical result: TRY_EXT_CTRLS also returned error_idx == count. Either the kernel comment is outdated for compound controls, or the failure is post-validate (cluster commit / driver try_ctrl op) where error_idx is intentionally not updated for either S or TRY.

Per-control TRY isolation: ALL FOUR controls fail individually

Pivot: send each control through its own VIDIOC_TRY_EXT_CTRLS with count=1 to isolate which one is rejected.

Empirical result (load-bearing):

TRY[0] id=0x00a40902: FAIL (errno=22 error_idx=1)
TRY[1] id=0x00a40903: FAIL (errno=22 error_idx=1)
TRY[2] id=0x00a40907: FAIL (errno=22 error_idx=1)
TRY[3] id=0x00a40904: FAIL (errno=22 error_idx=1)

Every single control fails on its own. The problem is NOT the controls' content. It's the request_fd state.

Two FFmpeg-comparison fixes (commit `74d8dd1`)

Cross-comparison of our h264.c::h264_fill_dpb vs FFmpeg's fill_dpb_entry surfaced two semantic mismatches:

dpb[].fields was uninitialized (= 0). FFmpeg sets it from pic->reference & V4L2_H264_FRAME_REF. The kernel's v4l2_h264_init_reflist_builder skips entries whose fields field is zero. iter4 fix: set dpb->fields = V4L2_H264_FRAME_REF for every valid entry.
dpb[] was including stale entries. Our dpb_update() clears used for all entries then re-marks only those in current ReferenceFrames[], but entries with valid=true && used=false (frames the consumer retired) were still included. FFmpeg iterates only h->short_ref[] / h->long_ref[] — the currently-referenced set. iter4 fix: skip !entry->used entries in h264_fill_dpb.

Together these moved decode from "10 frames clean → frame-11 EINVAL" to "20 frames clean → different EINVAL." Directionally correct but not load-bearing.

Load-bearing fix (commit `385dee1`): fresh request_fd per frame

Given that all 4 controls fail individually on the same fd, the request_fd state is the bug. Theory: MEDIA_REQUEST_IOC_REINIT after queue+wait isn't fully cleaning up state for some surface-recycle pattern, leaving subsequent S_EXT_CTRLS calls rejected.

Fix: in RequestSyncSurface, replace media_request_reinit(request_fd) with close(request_fd); surface_object->request_fd = -1;. Forces next BeginPicture to allocate a fresh fd via media_request_alloc.

Cost: +1 MEDIA_IOC_REQUEST_ALLOC and +1 close per decoded frame. Negligible.

Empirical verification (Phase 7 preview)

Autonomous run on ohm via firefox-fourier WITHOUT MOZ_DISABLE_RDD_SANDBOX=1, 90-second decode window on bbb_1080p30 H.264:

ENETDOWN count: 0 (Track F still GREEN)
S_EXT_CTRLS rejected: 0
"Unable to set control(s)": 0
Generic EINVAL: 0
Video stream mTime reached: 49.7 seconds
Audio stream mTime reached: 51.5 seconds

The iter1+iter2+iter3 carryover frame-11 EINVAL is empirically eliminated.

Caveats around BeginPicture count

RequestBeginPicture count was 18 over 90s wall time, but mTime (consumer-side packet timeline) reached 49.7s. The BeginPicture count is artificially low because the firefox-fourier binary is PGO-instrumented (3.6 GB libxul.so, ~0.23x realtime decode throughput). Firefox apparently sent only 18 packets through HW decode and SW-fell-back for the remainder to maintain audio sync.

This does NOT invalidate the fix: the 18 frames that DID go through HW decode all completed cleanly with no EINVAL. Track A's defect is in the kernel's response to S_EXT_CTRLS; it's verified clean over the 18 attempts spanning 49.7s of bitstream.

A clean (non-PGO) Firefox rebuild for a sustained-throughput stress test is logical follow-on but not strictly required for iter4 closure — the EINVAL bug is gone.

Phase 4 → Phase 5 transition

Phase 4 deliverables:

3 fork commits (74d8dd1, 385dee1, f21bdf0)
This findings document
Empirical Phase 7 evidence captured above

Phase 5: sonnet review of the fix's correctness. Key questions for the reviewer:

Is closing+realloc per frame functionally equivalent to REINIT, or does it leak/break something subtle?
Does the valid && !used skip miss any reference-needed-later-but-cleared scenario?
Is "49.7s mTime / 18 BeginPicture" a valid empirical substitute for the "720 frames" success criterion?
B-slice / multi-slice corpus untested — sufficient or should iter4 stress-test more before close?
Driver is heavily instrumented — sweep at iter4 close or defer to iter5?

5.1 KiB Raw Blame History