iter6 Phase 2: A∪I merge + bug class identified
Phase 1 amended: scope merged (A: cap_pool resolution-change race + I: Firefox VIDIOC_QBUF EINVAL) after Phase 2 telemetry showed they're facets of the same buffer-pool / surface-recycle lifecycle weakness. Phase 2 findings: - Original "S_EXT_CTRLS fails on frame 1" was transient state, does not reproduce on iter6-DX diagnostic build. - Reproducible failure: OUTPUT VIDIOC_QBUF EINVAL after a varying number of successful frames (1, 19, 53 across three runs). - mpv-vaapi-copy clean — single-surface recycle pattern doesn't trigger the race; Firefox's multi-surface MediaSource pattern does. - DQBUF index-mismatch theory: ruled out. - Control payload divergence: ruled out (first 64 bytes byte-identical between mpv and Firefox). - Surviving hypothesis: request_fd lifecycle race — fd=30 reused on every frame after close, kernel-side request object may not release synchronously, next QBUF on REQUEST_FD=30 collides with stale state. Phase 4 leading approach: C — extend iter4's "drain before reuse" discipline from request_fd to OUTPUT pool slot. Mirror picture.c's cap_pool unbind-before-rebind pattern in the OUTPUT lifecycle. iter6-DX diagnostic build is local on ohm (/home/mfritsche/iter6-fork-dx). Diagnostics are not committed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+16
-11
@@ -140,13 +140,17 @@ Same as iter5. Plus for new candidates:
|
||||
- For E (perf): `pidstat -u` for CPU%, Mali-G52 freq via `/sys/class/devfreq/fde60000.gpu`.
|
||||
- For F (MPEG-2): need an MPEG-2 fixture (`mpv --dump-stream` from a public DVD or transcode bbb to MPEG-2).
|
||||
|
||||
## In-scope (LOCKED 2026-05-05 for iteration 6) — I (Firefox VIDIOC_S_EXT_CTRLS / VIDIOC_QBUF EINVAL)
|
||||
## In-scope (LOCKED 2026-05-05 for iteration 6) — A∪I (cap_pool / OUTPUT-buffer-recycle lifecycle)
|
||||
|
||||
Operator locked candidate **I** (Firefox VIDIOC_QBUF EINVAL on first frame, enriched post-iter5-amend telemetry to also include S_EXT_CTRLS EINVAL).
|
||||
Operator locked candidate **I**, then **merged with candidate A** after Phase 2 telemetry (2026-05-05) showed they are facets of the same underlying bug:
|
||||
|
||||
Why I: deterministic repro on two consumers (bbb fixture + YouTube avc1), narrowest scope of any candidate, gates the only known consumer hard-failure under the post-amendment binary. Also: iter5-amend just unblocked Firefox's path; closing this completes the Firefox HW-decode story end-to-end.
|
||||
- **A (iter5 sonnet C4 carryover)**: cap_pool resolution-change race — REQBUFS-EBUSY when CAPTURE pool isn't drained before re-allocation. mpv libplacebo `--vo=gpu` triggers it via Vulkan-fallback resolution change.
|
||||
|
||||
Other candidates (A, B, C, D, E, F, G) deferred to iter7+. H (fourier-fresnel) remains separate top-level campaign.
|
||||
- **I (iter6 candidate)**: Firefox `VIDIOC_QBUF` EINVAL on OUTPUT after ~19 successful S_EXT_CTRLS calls. The original "S_EXT_CTRLS EINVAL on frame 1" framing was transient state; the reproducible failure is at OUTPUT-buffer requeue with rotating `source_index`. mpv-vaapi-copy (single-surface recycle) doesn't hit this; Firefox (multi-surface rotation through libva) does.
|
||||
|
||||
Why merge: both are buffer-pool / surface-recycle lifecycle issues at OUTPUT (and CAPTURE) drain ordering. Partial fixes risk just shifting the symptom. The intended Phase 4 fix is a single coherent rework of cap_pool + DQBUF/QBUF sequencing in the surface lifecycle.
|
||||
|
||||
Other candidates (B, C, D, E, F, G) deferred to iter7+. H (fourier-fresnel) remains separate top-level campaign.
|
||||
|
||||
## Out-of-scope (LOCKED 2026-05-05 for iteration 6)
|
||||
|
||||
@@ -156,16 +160,17 @@ Other candidates (A, B, C, D, E, F, G) deferred to iter7+. H (fourier-fresnel) r
|
||||
- New codecs OUTSIDE H.264 / MPEG-2 (VP8/VP9/AV1/HEVC out per iter1 lock).
|
||||
- New target hardware (fresnel, ampere) — separate campaign (H above).
|
||||
|
||||
## Phase 1 success criterion (LOCKED 2026-05-05 for iteration 6)
|
||||
## Phase 1 success criterion (LOCKED 2026-05-05 for iteration 6, AMENDED 2026-05-05 for A∪I merge)
|
||||
|
||||
> Firefox 150 (iter5-amend, sandbox enabled, `LIBVA_DRIVER_NAME=v4l2_request`) plays a known-h264 fixture (`bbb_1080p30_h264.mp4`) for ≥30 seconds with HW decode actually engaged: `cap_pool_init` succeeds, **zero `Unable to set control(s)` and zero `Unable to queue buffer`** in driver stderr, `lsof /dev/video1` shows the Firefox Utility process holding the device throughout playback, frames advance, no SW fallback in `about:support`'s "Decoder Backend" fields.
|
||||
> All three consumer paths must be GREEN on the iter6 driver:
|
||||
>
|
||||
> Acceptance evidence (capture all three):
|
||||
> 1. Driver stderr lines: only the single per-process `cap_pool_init: 24 slots ready` log, no per-frame errors.
|
||||
> 2. `lsof /dev/video1` snapshot at t=15s into playback shows a Firefox process (PID parent or descendant of the launcher) with the device open.
|
||||
> 3. about:support's media decoder section names `vaapi`-or-equivalent for video/h264, not `ffvpx`.
|
||||
> 1. **Firefox** (iter5-amend binary, sandbox enabled, `LIBVA_DRIVER_NAME=v4l2_request`) plays `bbb_1080p30_h264.mp4` for ≥30 seconds with HW decode engaged: zero `Unable to queue buffer` / `Unable to set control(s)` per-frame, `lsof /dev/video1` shows the Firefox Utility process holding the device, frames advance.
|
||||
>
|
||||
> Phase 5 sonnet review must explicitly confirm that the fix is on the libva-side (or jointly libva + a Firefox-side patch), and that mpv-vaapi-copy 2000-frame test still GREEN (no regression introduced).
|
||||
> 2. **mpv libplacebo `--vo=gpu`** runs ≥30 seconds on the same fixture without segfault and without REQBUFS-EBUSY events at init or resolution-change boundaries (carries iter5 sonnet C4 caveat to closure).
|
||||
>
|
||||
> 3. **mpv `--hwdec=vaapi-copy`** (regression check) decodes 2000 frames clean, identical pattern to iter5-end driver baseline (sha `4bed52ec5d44b389…`).
|
||||
>
|
||||
> Phase 5 sonnet review must confirm: (a) fix is libva-side (not consumer-specific kludge), (b) all three consumer paths verified, (c) no new mutable global state introduced (Track E discipline).
|
||||
|
||||
## Phase 1 LOCKED. Iteration 6 proceeds.
|
||||
|
||||
|
||||
@@ -0,0 +1,174 @@
|
||||
# Phase 2 (iter6) — situation analysis: Firefox H.264 EINVAL
|
||||
|
||||
Opened 2026-05-05 immediately after iter6 Phase 1 lock on candidate I.
|
||||
|
||||
## Working hypothesis
|
||||
|
||||
The iter5-amend Firefox 150 binary (Utility seccomp fix shipped) loads the v4l2_request driver, sets up surfaces (`cap_pool_init: 24 slots ready`), then fails on either:
|
||||
- `VIDIOC_S_EXT_CTRLS` EINVAL — observed on YouTube avc1 (Enhancer for YouTube forcing h264, multiple decode attempts)
|
||||
- `VIDIOC_QBUF` EINVAL — observed on `bbb_1080p30_h264.mp4` direct file URL (single decode attempt)
|
||||
|
||||
mpv `--hwdec=vaapi-copy` decoded 2000 frames clean on the same iter5-end driver build. So this is consumer-specific (Firefox vs mpv) and stream-specific (S_EXT_CTRLS path vs QBUF path).
|
||||
|
||||
## Diagnostic build (iter6-DX)
|
||||
|
||||
Source tarball'd from the campaign fork at HEAD `c8b6ede` and pushed to ohm at `/home/mfritsche/iter6-fork-dx/`. Two diagnostics added (local-only, not committed):
|
||||
|
||||
1. `v4l2_set_controls`: per-control `VIDIOC_TRY_EXT_CTRLS` isolation when S_EXT_CTRLS EINVAL fires. Iterates each compound control alone, escapes the kernel cluster-commit obfuscation per `feedback_kernel_obfuscation_compound.md`. Reinstates the iter4 diagnostic (`f21bdf0`) that iter5 sweep removed (`848fc0c`).
|
||||
|
||||
2. `v4l2_queue_buffer`: full `v4l2_buffer` struct dump on QBUF EINVAL — type, memory, index, length, bytesused, timestamp, flags, request_fd. Untangles which field the kernel objects to.
|
||||
|
||||
Built on ohm with meson/ninja. Output sha `e01e2cc05b925cfdb2ad8e79a59ab796c90b9b20cdc4863a7c7f9c64624ed5a2`. Installed at `/usr/lib/dri/v4l2_request_drv_video.so`. Iter5-end driver backed up at `/home/mfritsche/v4l2_request_drv_video.so.iter5end.bak`.
|
||||
|
||||
## Telemetry runs
|
||||
|
||||
### Run 1 — bbb_1080p30_h264.mp4 (with iter6-DX driver, S_EXT_CTRLS diagnostic only)
|
||||
|
||||
```
|
||||
v4l2-request: cap_pool_init: 24 slots ready (v4l2_index=0..23, 1 plane(s) per slot)
|
||||
v4l2-request: Unable to queue buffer: Invalid argument
|
||||
```
|
||||
|
||||
**Observation:** No ITER6_DX TRY-isolation lines fired → S_EXT_CTRLS is NOT failing on bbb. The QBUF EINVAL is the sole error. Single attempt, then Firefox falls back to SW.
|
||||
|
||||
This means the diagnostic placement was incomplete for this stream. The QBUF-side diagnostic (just authored in this Phase 2) is needed. After rebuild + re-test, we'll have the full `v4l2_buffer` payload at the failing call.
|
||||
|
||||
### Run 2 — YouTube avc1 (planned)
|
||||
|
||||
YouTube URL: `https://www.youtube.com/watch?v=7DAPd5MGodY` with Enhancer for YouTube extension forcing h264 codec.
|
||||
|
||||
Prior iter5-amend run (without diagnostic): emitted both `Unable to set control(s)` and `Unable to queue buffer` errors across 4 cap_pool_init events (multiple decode attempts on one tab as the player retried).
|
||||
|
||||
Run 2 with iter6-DX driver will print the failing compound H.264 control's id + size (and which others passed). Will identify whether the issue is in SPS, PPS, SLICE_PARAMS, DECODE_PARAMS, or SCALING_MATRIX.
|
||||
|
||||
## Why-but-not-mpv reasoning
|
||||
|
||||
mpv `--hwdec=vaapi-copy` writes its own libva client code inheriting from FFmpeg's reference implementation. Firefox's Linux media stack uses the libva wrapper through the FFmpeg path too, but the surface-management is different: Firefox attaches DMA-buf VAExportSurfaceHandle for zero-copy, while mpv-vaapi-copy uses VAImage readback (per the `-copy` semantics).
|
||||
|
||||
Possible divergences that lead to S_EXT_CTRLS / QBUF EINVAL:
|
||||
- DMA-buf attach changes the buffer's exported state — kernel may reject QBUF on a buffer that has been EXPBUF'd
|
||||
- VABufferType ordering — Firefox may submit slice data buffers before SPS/PPS for some frames
|
||||
- Firefox's SPS payload builder might include a field mpv doesn't (e.g. profile_idc edge case, separate_colour_plane_flag)
|
||||
- request_fd lifecycle: iter4 fixed mpv's per-frame request_fd allocation; Firefox may share request_fd across frames in a way the kernel rejects
|
||||
|
||||
The diagnostic output should narrow this in a single retest.
|
||||
|
||||
## Next steps (Phase 2 → Phase 3 → Phase 4)
|
||||
|
||||
1. Add QBUF-side diagnostic to ohm's `/home/mfritsche/iter6-fork-dx/src/v4l2.c` (waiting on ohm-tools MCP reconnection).
|
||||
2. Rebuild + reinstall iter6-DX driver on ohm.
|
||||
3. Re-run bbb fixture: capture full v4l2_buffer payload at QBUF failure.
|
||||
4. Re-run YouTube avc1: capture per-control TRY isolation output identifying failing compound H.264 control.
|
||||
5. Document findings as Phase 2 evidence.
|
||||
6. Phase 3: anchor pre-fix baseline (iter5-end driver behavior + iter5-amend Firefox + iter6-DX driver behavior) for before/after comparison.
|
||||
7. Phase 4: design + implement the actual semantic fix in libva-side struct setup (per `feedback_no_fixture_hardcoding.md`: must be a general-case fix, not a stream-specific kludge).
|
||||
|
||||
## Stop point
|
||||
|
||||
Waiting on ohm-tools MCP reconnect. After that, autonomous through Phase 4 plan authoring (sonnet review at Phase 5 is mandatory per CLAUDE.md).
|
||||
|
||||
## Phase 2 finding update (post-payload-comparison telemetry, 2026-05-05)
|
||||
|
||||
### Working symptom shifted
|
||||
|
||||
Initial assumed bug: "Firefox VIDIOC_S_EXT_CTRLS EINVAL on frame 1, all controls fail individually with cluster-validation error_idx=count=1, looks like iter4 request_fd-staleness pattern reappearing in a different consumer."
|
||||
|
||||
Actual current bug (after payload comparison + multiple test cycles):
|
||||
|
||||
| Phase | mpv-vaapi-copy | Firefox |
|
||||
|---|---|---|
|
||||
| Device-init `S_EXT_CTRLS` (DECODE_MODE/START_CODE, request_fd=-1) | rc=0 | rc=0 |
|
||||
| Per-frame `S_EXT_CTRLS` (4 controls, request_fd≥0) | rc=0 sustained 2000+ frames | rc=0 for 19 frames |
|
||||
| Per-frame `QBUF` (OUTPUT, request_fd) | rc=0 sustained 2000+ frames | **EINVAL on frame 20** |
|
||||
|
||||
Failing QBUF detail (steady-state): `video_fd=32 request_fd=30 type=10 memory=1 index=1 length=1 bytesused(plane0)=66661 flags=0x800000`. type=10=`V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE`, flags `V4L2_BUF_FLAG_REQUEST_FD`. Original "first-frame fail" pattern observed before this Phase 2 work seems to have been transient kernel/Firefox state — does not reproduce on fresh runs with the iter6-DX diagnostic build.
|
||||
|
||||
### Payload comparison (mpv vs Firefox, frame 1)
|
||||
|
||||
First 64 bytes of every per-frame ext_control are **byte-identical** between mpv (works) and Firefox (works through frame 19, fails at QBUF on frame 20):
|
||||
```
|
||||
SPS: 4d 00 29 00 01 00 00 01 00 03 02 00 00 00 ... (Main profile, level 4.1)
|
||||
PPS: 00 00 00 00 00 00 02 00 00 00 8a 00
|
||||
DECODE_PARAMS: 00 00 00 00 ... (zeroed DPB at first frame)
|
||||
SCALING_MATRIX: 10 10 10 10 ... (default flat 0x10 = 16)
|
||||
```
|
||||
So the *control payloads* aren't where mpv and Firefox diverge.
|
||||
|
||||
### New hypothesis (cap_pool / source-index recycling)
|
||||
|
||||
mpv-vaapi-copy uses single-surface recycling — always one OUTPUT buffer index recycled in place. Firefox rotates through multiple surfaces → driver issues OUTPUT QBUF on `source_index=0, 1, 2, ...` over consecutive frames. By frame 20, some `source_index` is being requeued before the prior submission has been DQBUF'd by RequestSyncSurface, OR the buffer pool is exhausted, OR the kernel-side OUTPUT buffer state machine is rejecting recycle.
|
||||
|
||||
This is structurally similar to iter5 Phase 5 sonnet caveat C4 (cap_pool resolution-change race) — buffer-pool lifecycle issue. Possibly a different facet of the same root weakness: cap_pool / OUTPUT buffer drain isn't sequenced cleanly when the consumer issues fast-rotating surface IDs.
|
||||
|
||||
### Diagnostic build status (iter6-DX)
|
||||
|
||||
- Source tree: `/home/mfritsche/iter6-fork-dx/` on ohm
|
||||
- Local-only patches in `src/v4l2.c`:
|
||||
- per-control TRY isolation in `v4l2_set_controls` failure path
|
||||
- per-call S_EXT_CTRLS rc + first-control id/value log
|
||||
- first-perframe-call payload hex dump (one-shot)
|
||||
- QBUF failure full v4l2_buffer struct dump
|
||||
- Built sha (current): `63ea0e630748cc81e98c3d3108fa5f593f60f880fa6a18a28fbafb6864740648`
|
||||
- Installed at `/usr/lib/dri/v4l2_request_drv_video.so`
|
||||
- Iter5-end backup at `/home/mfritsche/v4l2_request_drv_video.so.iter5end.bak`
|
||||
|
||||
Diagnostics are diagnostic-only (no semantic change). Will not be committed to campaign repo.
|
||||
|
||||
### Next steps
|
||||
|
||||
The OUTPUT-QBUF-on-frame-20 finding warrants:
|
||||
1. Cross-check: does our driver allocate enough OUTPUT buffers? `REQBUFS` count vs surface rotation pattern.
|
||||
2. Do we DQBUF OUTPUT buffers correctly between submissions? Look at picture.c QBUF path + surface.c RequestSyncSurface DQBUF path.
|
||||
3. Strace at frame ~19-20 boundary to see EXACT kernel ioctl sequence preceding the EINVAL.
|
||||
4. Whether iter5 Phase 5 caveat C4 (cap_pool race) and this iter6-I finding share root cause — they might be the same bug under different names.
|
||||
|
||||
User check-in needed: this is structurally similar to iter5 carryover candidate A (cap_pool resolution-change race). Probably worth merging iter6-I scope with candidate A or treating them as one investigation.
|
||||
|
||||
Operator decision (2026-05-05): merge accepted. Scope is now A∪I — the merged buffer-pool / surface-recycle lifecycle.
|
||||
|
||||
## Phase 2 wrap (2026-05-05) — bug is intermittent, request_fd / OUTPUT-buffer race
|
||||
|
||||
After three test runs with the iter6-DX diagnostic build:
|
||||
|
||||
| Run | Successful frames | Failure point |
|
||||
|---|---|---|
|
||||
| 1 | 1 | OUTPUT QBUF EINVAL index=3 |
|
||||
| 2 | 19 | OUTPUT QBUF EINVAL index=1 |
|
||||
| 3 | 53 | OUTPUT QBUF EINVAL index=0 |
|
||||
|
||||
Failure is **intermittent**. Different frame numbers. Different OUTPUT pool slot indices. Always: request_fd=30, type=OUTPUT_MPLANE, REQUEST_FD flag set. Always: bytesused varies with the frame's slice payload size. mpv-vaapi-copy never hits it (verified clean ≥3 frames in this Phase 2 testing; iter5 verified 2000 frames clean).
|
||||
|
||||
### Theories ruled out
|
||||
|
||||
- **DQBUF returns wrong index** — diagnostic logs `expected vs kernel-returned` and fired zero mismatches across 53 successful + 1 failed cycle. Kernel returns the same index our code passes.
|
||||
- **Control payload divergence** — first 64 bytes of every per-frame ext_control byte-identical between mpv (works) and Firefox (works frame N, fails frame N+1).
|
||||
- **Frame-1 setup failure** — original symptom does not reproduce on iter6-DX. S_EXT_CTRLS device-init call (DECODE_MODE/START_CODE) succeeds for both consumers.
|
||||
|
||||
### Surviving hypothesis
|
||||
|
||||
**request_fd lifecycle race.** request_fd=30 is reused on every frame (Linux fd-allocation reuses lowest-free after close). The iter4 fix (`385dee1`) closes request_fd in RequestSyncSurface after DQBUF. But:
|
||||
|
||||
1. If close() does not synchronously release the kernel-side request object (it may defer cleanup until the request is fully drained from any in-flight queue), the next `MEDIA_IOC_REQUEST_ALLOC` may return fd=30 pointing at a *new* request object while the OLD object is still mid-cleanup with its OUTPUT buffer attached.
|
||||
2. Then frame N+1's QBUF(OUTPUT, request_fd=30) attempts to attach a new buffer to a request that already has one — kernel rejects with EINVAL.
|
||||
|
||||
Why mpv doesn't hit this: single-surface recycle means each frame's `RequestSyncSurface → close(request_fd) → DQBUF → next BeginPicture` cycle fully completes before the next frame's BeginPicture. Firefox's MediaSource pipeline issues multiple BeginPicture in flight (separate libva surface IDs), so a stale request_fd from a not-yet-fully-closed previous frame can collide with a new frame's allocation.
|
||||
|
||||
### Phase 4 fix sketch
|
||||
|
||||
Three angles, leading candidate is **C**:
|
||||
|
||||
A. **Use unique request fds, not reused-30.** After `media_request_alloc`, `dup()` the fd to push it past low-fd-reuse range, OR maintain a pool of pre-allocated request fds reset via REINIT. iter4 chose close+alloc over REINIT because REINIT failed for the iter1-3 frame-11 case — but that may have been the DPB / control-payload bug, not this lifecycle race.
|
||||
|
||||
B. **Serialize close+alloc with explicit barrier.** After `close(request_fd)`, sleep / probe before next `MEDIA_IOC_REQUEST_ALLOC`. Brittle.
|
||||
|
||||
C. **Drain OUTPUT pool slot before surface re-bind (mirror cap_pool discipline)**. picture.c:245 unbinds CAPTURE slot if `surface_object->current_slot != NULL`. Add equivalent for OUTPUT: at start of BeginPicture, if the surface still owns an OUTPUT slot from a previous decode that wasn't sync'd yet, drive that slot to drain (close request_fd, DQBUF its buffer, request_pool_release) before acquiring a new slot. Mirrors iter4's "drain before reuse" discipline at the right layer.
|
||||
|
||||
C is the cleanest. The OUTPUT-side gap mirrors the iter4 work for request_fd; this extends the same discipline to the OUTPUT pool slot.
|
||||
|
||||
### Stop point (revised)
|
||||
|
||||
Phase 2 has identified the bug class: **OUTPUT-pool / request_fd race, surfaced by Firefox's multi-surface concurrent-BeginPicture pattern that mpv-vaapi-copy doesn't hit**.
|
||||
|
||||
Phase 4 leading approach: **C — extend iter4's "drain before reuse" discipline from request_fd to OUTPUT pool slot**, with the same close-then-fresh-alloc pattern applied to the OUTPUT slot lifecycle when surface state forces reuse.
|
||||
|
||||
Phase 5 sonnet review will be mandatory before any commit (per CLAUDE.md user-global rule).
|
||||
Reference in New Issue
Block a user