Files

T

claude-noether 793409b960 iter6 Phase 2: A∪I merge + bug class identified

Phase 1 amended: scope merged (A: cap_pool resolution-change race
+ I: Firefox VIDIOC_QBUF EINVAL) after Phase 2 telemetry showed
they're facets of the same buffer-pool / surface-recycle lifecycle
weakness.

Phase 2 findings:
- Original "S_EXT_CTRLS fails on frame 1" was transient state, does
  not reproduce on iter6-DX diagnostic build.
- Reproducible failure: OUTPUT VIDIOC_QBUF EINVAL after a varying
  number of successful frames (1, 19, 53 across three runs).
- mpv-vaapi-copy clean — single-surface recycle pattern doesn't
  trigger the race; Firefox's multi-surface MediaSource pattern does.
- DQBUF index-mismatch theory: ruled out.
- Control payload divergence: ruled out (first 64 bytes byte-identical
  between mpv and Firefox).
- Surviving hypothesis: request_fd lifecycle race — fd=30 reused on
  every frame after close, kernel-side request object may not release
  synchronously, next QBUF on REQUEST_FD=30 collides with stale state.

Phase 4 leading approach: C — extend iter4's "drain before reuse"
discipline from request_fd to OUTPUT pool slot. Mirror picture.c's
cap_pool unbind-before-rebind pattern in the OUTPUT lifecycle.

iter6-DX diagnostic build is local on ohm (/home/mfritsche/iter6-fork-dx).
Diagnostics are not committed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-05 20:37:31 +00:00

13 KiB

Raw Blame History

Phase 2 (iter6) — situation analysis: Firefox H.264 EINVAL

Opened 2026-05-05 immediately after iter6 Phase 1 lock on candidate I.

Working hypothesis

The iter5-amend Firefox 150 binary (Utility seccomp fix shipped) loads the v4l2_request driver, sets up surfaces (cap_pool_init: 24 slots ready), then fails on either:

VIDIOC_S_EXT_CTRLS EINVAL — observed on YouTube avc1 (Enhancer for YouTube forcing h264, multiple decode attempts)
VIDIOC_QBUF EINVAL — observed on bbb_1080p30_h264.mp4 direct file URL (single decode attempt)

mpv --hwdec=vaapi-copy decoded 2000 frames clean on the same iter5-end driver build. So this is consumer-specific (Firefox vs mpv) and stream-specific (S_EXT_CTRLS path vs QBUF path).

Diagnostic build (iter6-DX)

Source tarball'd from the campaign fork at HEAD c8b6ede and pushed to ohm at /home/mfritsche/iter6-fork-dx/. Two diagnostics added (local-only, not committed):

v4l2_set_controls: per-control VIDIOC_TRY_EXT_CTRLS isolation when S_EXT_CTRLS EINVAL fires. Iterates each compound control alone, escapes the kernel cluster-commit obfuscation per feedback_kernel_obfuscation_compound.md. Reinstates the iter4 diagnostic (f21bdf0) that iter5 sweep removed (848fc0c).
v4l2_queue_buffer: full v4l2_buffer struct dump on QBUF EINVAL — type, memory, index, length, bytesused, timestamp, flags, request_fd. Untangles which field the kernel objects to.

Built on ohm with meson/ninja. Output sha e01e2cc05b925cfdb2ad8e79a59ab796c90b9b20cdc4863a7c7f9c64624ed5a2. Installed at /usr/lib/dri/v4l2_request_drv_video.so. Iter5-end driver backed up at /home/mfritsche/v4l2_request_drv_video.so.iter5end.bak.

Telemetry runs

Run 1 — bbb_1080p30_h264.mp4 (with iter6-DX driver, S_EXT_CTRLS diagnostic only)

v4l2-request: cap_pool_init: 24 slots ready (v4l2_index=0..23, 1 plane(s) per slot)
v4l2-request: Unable to queue buffer: Invalid argument

Observation: No ITER6_DX TRY-isolation lines fired → S_EXT_CTRLS is NOT failing on bbb. The QBUF EINVAL is the sole error. Single attempt, then Firefox falls back to SW.

This means the diagnostic placement was incomplete for this stream. The QBUF-side diagnostic (just authored in this Phase 2) is needed. After rebuild + re-test, we'll have the full v4l2_buffer payload at the failing call.

Run 2 — YouTube avc1 (planned)

YouTube URL: https://www.youtube.com/watch?v=7DAPd5MGodY with Enhancer for YouTube extension forcing h264 codec.

Prior iter5-amend run (without diagnostic): emitted both Unable to set control(s) and Unable to queue buffer errors across 4 cap_pool_init events (multiple decode attempts on one tab as the player retried).

Run 2 with iter6-DX driver will print the failing compound H.264 control's id + size (and which others passed). Will identify whether the issue is in SPS, PPS, SLICE_PARAMS, DECODE_PARAMS, or SCALING_MATRIX.

Why-but-not-mpv reasoning

mpv --hwdec=vaapi-copy writes its own libva client code inheriting from FFmpeg's reference implementation. Firefox's Linux media stack uses the libva wrapper through the FFmpeg path too, but the surface-management is different: Firefox attaches DMA-buf VAExportSurfaceHandle for zero-copy, while mpv-vaapi-copy uses VAImage readback (per the -copy semantics).

Possible divergences that lead to S_EXT_CTRLS / QBUF EINVAL:

DMA-buf attach changes the buffer's exported state — kernel may reject QBUF on a buffer that has been EXPBUF'd
VABufferType ordering — Firefox may submit slice data buffers before SPS/PPS for some frames
Firefox's SPS payload builder might include a field mpv doesn't (e.g. profile_idc edge case, separate_colour_plane_flag)
request_fd lifecycle: iter4 fixed mpv's per-frame request_fd allocation; Firefox may share request_fd across frames in a way the kernel rejects

The diagnostic output should narrow this in a single retest.

Next steps (Phase 2 → Phase 3 → Phase 4)

Add QBUF-side diagnostic to ohm's /home/mfritsche/iter6-fork-dx/src/v4l2.c (waiting on ohm-tools MCP reconnection).
Rebuild + reinstall iter6-DX driver on ohm.
Re-run bbb fixture: capture full v4l2_buffer payload at QBUF failure.
Re-run YouTube avc1: capture per-control TRY isolation output identifying failing compound H.264 control.
Document findings as Phase 2 evidence.
Phase 3: anchor pre-fix baseline (iter5-end driver behavior + iter5-amend Firefox + iter6-DX driver behavior) for before/after comparison.
Phase 4: design + implement the actual semantic fix in libva-side struct setup (per feedback_no_fixture_hardcoding.md: must be a general-case fix, not a stream-specific kludge).

Stop point

Waiting on ohm-tools MCP reconnect. After that, autonomous through Phase 4 plan authoring (sonnet review at Phase 5 is mandatory per CLAUDE.md).

Phase 2 finding update (post-payload-comparison telemetry, 2026-05-05)

Working symptom shifted

Initial assumed bug: "Firefox VIDIOC_S_EXT_CTRLS EINVAL on frame 1, all controls fail individually with cluster-validation error_idx=count=1, looks like iter4 request_fd-staleness pattern reappearing in a different consumer."

Actual current bug (after payload comparison + multiple test cycles):

Phase	mpv-vaapi-copy	Firefox
Device-init `S_EXT_CTRLS` (DECODE_MODE/START_CODE, request_fd=-1)	rc=0	rc=0
Per-frame `S_EXT_CTRLS` (4 controls, request_fd≥0)	rc=0 sustained 2000+ frames	rc=0 for 19 frames
Per-frame `QBUF` (OUTPUT, request_fd)	rc=0 sustained 2000+ frames	EINVAL on frame 20

Failing QBUF detail (steady-state): video_fd=32 request_fd=30 type=10 memory=1 index=1 length=1 bytesused(plane0)=66661 flags=0x800000. type=10=V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE, flags V4L2_BUF_FLAG_REQUEST_FD. Original "first-frame fail" pattern observed before this Phase 2 work seems to have been transient kernel/Firefox state — does not reproduce on fresh runs with the iter6-DX diagnostic build.

Payload comparison (mpv vs Firefox, frame 1)

First 64 bytes of every per-frame ext_control are byte-identical between mpv (works) and Firefox (works through frame 19, fails at QBUF on frame 20):

SPS:             4d 00 29 00 01 00 00 01 00 03 02 00 00 00 ...   (Main profile, level 4.1)
PPS:             00 00 00 00 00 00 02 00 00 00 8a 00
DECODE_PARAMS:   00 00 00 00 ... (zeroed DPB at first frame)
SCALING_MATRIX:  10 10 10 10 ... (default flat 0x10 = 16)

So the control payloads aren't where mpv and Firefox diverge.

New hypothesis (cap_pool / source-index recycling)

mpv-vaapi-copy uses single-surface recycling — always one OUTPUT buffer index recycled in place. Firefox rotates through multiple surfaces → driver issues OUTPUT QBUF on source_index=0, 1, 2, ... over consecutive frames. By frame 20, some source_index is being requeued before the prior submission has been DQBUF'd by RequestSyncSurface, OR the buffer pool is exhausted, OR the kernel-side OUTPUT buffer state machine is rejecting recycle.

This is structurally similar to iter5 Phase 5 sonnet caveat C4 (cap_pool resolution-change race) — buffer-pool lifecycle issue. Possibly a different facet of the same root weakness: cap_pool / OUTPUT buffer drain isn't sequenced cleanly when the consumer issues fast-rotating surface IDs.

Diagnostic build status (iter6-DX)

Source tree: /home/mfritsche/iter6-fork-dx/ on ohm
Local-only patches in src/v4l2.c:
- per-control TRY isolation in v4l2_set_controls failure path
- per-call S_EXT_CTRLS rc + first-control id/value log
- first-perframe-call payload hex dump (one-shot)
- QBUF failure full v4l2_buffer struct dump
Built sha (current): 63ea0e630748cc81e98c3d3108fa5f593f60f880fa6a18a28fbafb6864740648
Installed at /usr/lib/dri/v4l2_request_drv_video.so
Iter5-end backup at /home/mfritsche/v4l2_request_drv_video.so.iter5end.bak

Diagnostics are diagnostic-only (no semantic change). Will not be committed to campaign repo.

Next steps

The OUTPUT-QBUF-on-frame-20 finding warrants:

Cross-check: does our driver allocate enough OUTPUT buffers? REQBUFS count vs surface rotation pattern.
Do we DQBUF OUTPUT buffers correctly between submissions? Look at picture.c QBUF path + surface.c RequestSyncSurface DQBUF path.
Strace at frame ~19-20 boundary to see EXACT kernel ioctl sequence preceding the EINVAL.
Whether iter5 Phase 5 caveat C4 (cap_pool race) and this iter6-I finding share root cause — they might be the same bug under different names.

User check-in needed: this is structurally similar to iter5 carryover candidate A (cap_pool resolution-change race). Probably worth merging iter6-I scope with candidate A or treating them as one investigation.

Operator decision (2026-05-05): merge accepted. Scope is now A∪I — the merged buffer-pool / surface-recycle lifecycle.

Phase 2 wrap (2026-05-05) — bug is intermittent, request_fd / OUTPUT-buffer race

After three test runs with the iter6-DX diagnostic build:

Run	Successful frames	Failure point
1	1	OUTPUT QBUF EINVAL index=3
2	19	OUTPUT QBUF EINVAL index=1
3	53	OUTPUT QBUF EINVAL index=0

Failure is intermittent. Different frame numbers. Different OUTPUT pool slot indices. Always: request_fd=30, type=OUTPUT_MPLANE, REQUEST_FD flag set. Always: bytesused varies with the frame's slice payload size. mpv-vaapi-copy never hits it (verified clean ≥3 frames in this Phase 2 testing; iter5 verified 2000 frames clean).

Theories ruled out

DQBUF returns wrong index — diagnostic logs expected vs kernel-returned and fired zero mismatches across 53 successful + 1 failed cycle. Kernel returns the same index our code passes.
Control payload divergence — first 64 bytes of every per-frame ext_control byte-identical between mpv (works) and Firefox (works frame N, fails frame N+1).
Frame-1 setup failure — original symptom does not reproduce on iter6-DX. S_EXT_CTRLS device-init call (DECODE_MODE/START_CODE) succeeds for both consumers.

Surviving hypothesis

request_fd lifecycle race. request_fd=30 is reused on every frame (Linux fd-allocation reuses lowest-free after close). The iter4 fix (385dee1) closes request_fd in RequestSyncSurface after DQBUF. But:

If close() does not synchronously release the kernel-side request object (it may defer cleanup until the request is fully drained from any in-flight queue), the next MEDIA_IOC_REQUEST_ALLOC may return fd=30 pointing at a new request object while the OLD object is still mid-cleanup with its OUTPUT buffer attached.
Then frame N+1's QBUF(OUTPUT, request_fd=30) attempts to attach a new buffer to a request that already has one — kernel rejects with EINVAL.

Why mpv doesn't hit this: single-surface recycle means each frame's RequestSyncSurface → close(request_fd) → DQBUF → next BeginPicture cycle fully completes before the next frame's BeginPicture. Firefox's MediaSource pipeline issues multiple BeginPicture in flight (separate libva surface IDs), so a stale request_fd from a not-yet-fully-closed previous frame can collide with a new frame's allocation.

Phase 4 fix sketch

Three angles, leading candidate is C:

A. Use unique request fds, not reused-30. After media_request_alloc, dup() the fd to push it past low-fd-reuse range, OR maintain a pool of pre-allocated request fds reset via REINIT. iter4 chose close+alloc over REINIT because REINIT failed for the iter1-3 frame-11 case — but that may have been the DPB / control-payload bug, not this lifecycle race.

B. Serialize close+alloc with explicit barrier. After close(request_fd), sleep / probe before next MEDIA_IOC_REQUEST_ALLOC. Brittle.

C. Drain OUTPUT pool slot before surface re-bind (mirror cap_pool discipline). picture.c:245 unbinds CAPTURE slot if surface_object->current_slot != NULL. Add equivalent for OUTPUT: at start of BeginPicture, if the surface still owns an OUTPUT slot from a previous decode that wasn't sync'd yet, drive that slot to drain (close request_fd, DQBUF its buffer, request_pool_release) before acquiring a new slot. Mirrors iter4's "drain before reuse" discipline at the right layer.

C is the cleanest. The OUTPUT-side gap mirrors the iter4 work for request_fd; this extends the same discipline to the OUTPUT pool slot.

Stop point (revised)

Phase 2 has identified the bug class: OUTPUT-pool / request_fd race, surfaced by Firefox's multi-surface concurrent-BeginPicture pattern that mpv-vaapi-copy doesn't hit.

Phase 4 leading approach: C — extend iter4's "drain before reuse" discipline from request_fd to OUTPUT pool slot, with the same close-then-fresh-alloc pattern applied to the OUTPUT slot lifecycle when surface state forces reuse.

Phase 5 sonnet review will be mandatory before any commit (per CLAUDE.md user-global rule).

13 KiB Raw Blame History Unescape Escape