4cd0167d08
Captures the post-Phase-5-design-review experiment: - pool=4 -> 16 only delayed failure (1/19/53 -> 62 frames), refuting pool exhaustion as the dominant cause - Surviving hypothesis: kernel-side buffer-state contention with fd-reuse on close+alloc cycle, surfaced by Firefox's multi- surface MediaSource pattern - Three Phase 4 directions weighed; operator chose 3 (REINIT) with 1 (kernel analysis) as fallback if 3 failed This file now serves as the iter6 Phase 2 reference for future iterations — the "Phase 5 (front-loaded design review)" section documents the back-and-forth with sonnet that shaped the final fix shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
224 lines
17 KiB
Markdown
224 lines
17 KiB
Markdown
# Phase 2 (iter6) — situation analysis: Firefox H.264 EINVAL
|
||
|
||
Opened 2026-05-05 immediately after iter6 Phase 1 lock on candidate I.
|
||
|
||
## Working hypothesis
|
||
|
||
The iter5-amend Firefox 150 binary (Utility seccomp fix shipped) loads the v4l2_request driver, sets up surfaces (`cap_pool_init: 24 slots ready`), then fails on either:
|
||
- `VIDIOC_S_EXT_CTRLS` EINVAL — observed on YouTube avc1 (Enhancer for YouTube forcing h264, multiple decode attempts)
|
||
- `VIDIOC_QBUF` EINVAL — observed on `bbb_1080p30_h264.mp4` direct file URL (single decode attempt)
|
||
|
||
mpv `--hwdec=vaapi-copy` decoded 2000 frames clean on the same iter5-end driver build. So this is consumer-specific (Firefox vs mpv) and stream-specific (S_EXT_CTRLS path vs QBUF path).
|
||
|
||
## Diagnostic build (iter6-DX)
|
||
|
||
Source tarball'd from the campaign fork at HEAD `c8b6ede` and pushed to ohm at `/home/mfritsche/iter6-fork-dx/`. Two diagnostics added (local-only, not committed):
|
||
|
||
1. `v4l2_set_controls`: per-control `VIDIOC_TRY_EXT_CTRLS` isolation when S_EXT_CTRLS EINVAL fires. Iterates each compound control alone, escapes the kernel cluster-commit obfuscation per `feedback_kernel_obfuscation_compound.md`. Reinstates the iter4 diagnostic (`f21bdf0`) that iter5 sweep removed (`848fc0c`).
|
||
|
||
2. `v4l2_queue_buffer`: full `v4l2_buffer` struct dump on QBUF EINVAL — type, memory, index, length, bytesused, timestamp, flags, request_fd. Untangles which field the kernel objects to.
|
||
|
||
Built on ohm with meson/ninja. Output sha `e01e2cc05b925cfdb2ad8e79a59ab796c90b9b20cdc4863a7c7f9c64624ed5a2`. Installed at `/usr/lib/dri/v4l2_request_drv_video.so`. Iter5-end driver backed up at `/home/mfritsche/v4l2_request_drv_video.so.iter5end.bak`.
|
||
|
||
## Telemetry runs
|
||
|
||
### Run 1 — bbb_1080p30_h264.mp4 (with iter6-DX driver, S_EXT_CTRLS diagnostic only)
|
||
|
||
```
|
||
v4l2-request: cap_pool_init: 24 slots ready (v4l2_index=0..23, 1 plane(s) per slot)
|
||
v4l2-request: Unable to queue buffer: Invalid argument
|
||
```
|
||
|
||
**Observation:** No ITER6_DX TRY-isolation lines fired → S_EXT_CTRLS is NOT failing on bbb. The QBUF EINVAL is the sole error. Single attempt, then Firefox falls back to SW.
|
||
|
||
This means the diagnostic placement was incomplete for this stream. The QBUF-side diagnostic (just authored in this Phase 2) is needed. After rebuild + re-test, we'll have the full `v4l2_buffer` payload at the failing call.
|
||
|
||
### Run 2 — YouTube avc1 (planned)
|
||
|
||
YouTube URL: `https://www.youtube.com/watch?v=7DAPd5MGodY` with Enhancer for YouTube extension forcing h264 codec.
|
||
|
||
Prior iter5-amend run (without diagnostic): emitted both `Unable to set control(s)` and `Unable to queue buffer` errors across 4 cap_pool_init events (multiple decode attempts on one tab as the player retried).
|
||
|
||
Run 2 with iter6-DX driver will print the failing compound H.264 control's id + size (and which others passed). Will identify whether the issue is in SPS, PPS, SLICE_PARAMS, DECODE_PARAMS, or SCALING_MATRIX.
|
||
|
||
## Why-but-not-mpv reasoning
|
||
|
||
mpv `--hwdec=vaapi-copy` writes its own libva client code inheriting from FFmpeg's reference implementation. Firefox's Linux media stack uses the libva wrapper through the FFmpeg path too, but the surface-management is different: Firefox attaches DMA-buf VAExportSurfaceHandle for zero-copy, while mpv-vaapi-copy uses VAImage readback (per the `-copy` semantics).
|
||
|
||
Possible divergences that lead to S_EXT_CTRLS / QBUF EINVAL:
|
||
- DMA-buf attach changes the buffer's exported state — kernel may reject QBUF on a buffer that has been EXPBUF'd
|
||
- VABufferType ordering — Firefox may submit slice data buffers before SPS/PPS for some frames
|
||
- Firefox's SPS payload builder might include a field mpv doesn't (e.g. profile_idc edge case, separate_colour_plane_flag)
|
||
- request_fd lifecycle: iter4 fixed mpv's per-frame request_fd allocation; Firefox may share request_fd across frames in a way the kernel rejects
|
||
|
||
The diagnostic output should narrow this in a single retest.
|
||
|
||
## Next steps (Phase 2 → Phase 3 → Phase 4)
|
||
|
||
1. Add QBUF-side diagnostic to ohm's `/home/mfritsche/iter6-fork-dx/src/v4l2.c` (waiting on ohm-tools MCP reconnection).
|
||
2. Rebuild + reinstall iter6-DX driver on ohm.
|
||
3. Re-run bbb fixture: capture full v4l2_buffer payload at QBUF failure.
|
||
4. Re-run YouTube avc1: capture per-control TRY isolation output identifying failing compound H.264 control.
|
||
5. Document findings as Phase 2 evidence.
|
||
6. Phase 3: anchor pre-fix baseline (iter5-end driver behavior + iter5-amend Firefox + iter6-DX driver behavior) for before/after comparison.
|
||
7. Phase 4: design + implement the actual semantic fix in libva-side struct setup (per `feedback_no_fixture_hardcoding.md`: must be a general-case fix, not a stream-specific kludge).
|
||
|
||
## Stop point
|
||
|
||
Waiting on ohm-tools MCP reconnect. After that, autonomous through Phase 4 plan authoring (sonnet review at Phase 5 is mandatory per CLAUDE.md).
|
||
|
||
## Phase 2 finding update (post-payload-comparison telemetry, 2026-05-05)
|
||
|
||
### Working symptom shifted
|
||
|
||
Initial assumed bug: "Firefox VIDIOC_S_EXT_CTRLS EINVAL on frame 1, all controls fail individually with cluster-validation error_idx=count=1, looks like iter4 request_fd-staleness pattern reappearing in a different consumer."
|
||
|
||
Actual current bug (after payload comparison + multiple test cycles):
|
||
|
||
| Phase | mpv-vaapi-copy | Firefox |
|
||
|---|---|---|
|
||
| Device-init `S_EXT_CTRLS` (DECODE_MODE/START_CODE, request_fd=-1) | rc=0 | rc=0 |
|
||
| Per-frame `S_EXT_CTRLS` (4 controls, request_fd≥0) | rc=0 sustained 2000+ frames | rc=0 for 19 frames |
|
||
| Per-frame `QBUF` (OUTPUT, request_fd) | rc=0 sustained 2000+ frames | **EINVAL on frame 20** |
|
||
|
||
Failing QBUF detail (steady-state): `video_fd=32 request_fd=30 type=10 memory=1 index=1 length=1 bytesused(plane0)=66661 flags=0x800000`. type=10=`V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE`, flags `V4L2_BUF_FLAG_REQUEST_FD`. Original "first-frame fail" pattern observed before this Phase 2 work seems to have been transient kernel/Firefox state — does not reproduce on fresh runs with the iter6-DX diagnostic build.
|
||
|
||
### Payload comparison (mpv vs Firefox, frame 1)
|
||
|
||
First 64 bytes of every per-frame ext_control are **byte-identical** between mpv (works) and Firefox (works through frame 19, fails at QBUF on frame 20):
|
||
```
|
||
SPS: 4d 00 29 00 01 00 00 01 00 03 02 00 00 00 ... (Main profile, level 4.1)
|
||
PPS: 00 00 00 00 00 00 02 00 00 00 8a 00
|
||
DECODE_PARAMS: 00 00 00 00 ... (zeroed DPB at first frame)
|
||
SCALING_MATRIX: 10 10 10 10 ... (default flat 0x10 = 16)
|
||
```
|
||
So the *control payloads* aren't where mpv and Firefox diverge.
|
||
|
||
### New hypothesis (cap_pool / source-index recycling)
|
||
|
||
mpv-vaapi-copy uses single-surface recycling — always one OUTPUT buffer index recycled in place. Firefox rotates through multiple surfaces → driver issues OUTPUT QBUF on `source_index=0, 1, 2, ...` over consecutive frames. By frame 20, some `source_index` is being requeued before the prior submission has been DQBUF'd by RequestSyncSurface, OR the buffer pool is exhausted, OR the kernel-side OUTPUT buffer state machine is rejecting recycle.
|
||
|
||
This is structurally similar to iter5 Phase 5 sonnet caveat C4 (cap_pool resolution-change race) — buffer-pool lifecycle issue. Possibly a different facet of the same root weakness: cap_pool / OUTPUT buffer drain isn't sequenced cleanly when the consumer issues fast-rotating surface IDs.
|
||
|
||
### Diagnostic build status (iter6-DX)
|
||
|
||
- Source tree: `/home/mfritsche/iter6-fork-dx/` on ohm
|
||
- Local-only patches in `src/v4l2.c`:
|
||
- per-control TRY isolation in `v4l2_set_controls` failure path
|
||
- per-call S_EXT_CTRLS rc + first-control id/value log
|
||
- first-perframe-call payload hex dump (one-shot)
|
||
- QBUF failure full v4l2_buffer struct dump
|
||
- Built sha (current): `63ea0e630748cc81e98c3d3108fa5f593f60f880fa6a18a28fbafb6864740648`
|
||
- Installed at `/usr/lib/dri/v4l2_request_drv_video.so`
|
||
- Iter5-end backup at `/home/mfritsche/v4l2_request_drv_video.so.iter5end.bak`
|
||
|
||
Diagnostics are diagnostic-only (no semantic change). Will not be committed to campaign repo.
|
||
|
||
### Next steps
|
||
|
||
The OUTPUT-QBUF-on-frame-20 finding warrants:
|
||
1. Cross-check: does our driver allocate enough OUTPUT buffers? `REQBUFS` count vs surface rotation pattern.
|
||
2. Do we DQBUF OUTPUT buffers correctly between submissions? Look at picture.c QBUF path + surface.c RequestSyncSurface DQBUF path.
|
||
3. Strace at frame ~19-20 boundary to see EXACT kernel ioctl sequence preceding the EINVAL.
|
||
4. Whether iter5 Phase 5 caveat C4 (cap_pool race) and this iter6-I finding share root cause — they might be the same bug under different names.
|
||
|
||
User check-in needed: this is structurally similar to iter5 carryover candidate A (cap_pool resolution-change race). Probably worth merging iter6-I scope with candidate A or treating them as one investigation.
|
||
|
||
Operator decision (2026-05-05): merge accepted. Scope is now A∪I — the merged buffer-pool / surface-recycle lifecycle.
|
||
|
||
## Phase 2 wrap (2026-05-05) — bug is intermittent, request_fd / OUTPUT-buffer race
|
||
|
||
After three test runs with the iter6-DX diagnostic build:
|
||
|
||
| Run | Successful frames | Failure point |
|
||
|---|---|---|
|
||
| 1 | 1 | OUTPUT QBUF EINVAL index=3 |
|
||
| 2 | 19 | OUTPUT QBUF EINVAL index=1 |
|
||
| 3 | 53 | OUTPUT QBUF EINVAL index=0 |
|
||
|
||
Failure is **intermittent**. Different frame numbers. Different OUTPUT pool slot indices. Always: request_fd=30, type=OUTPUT_MPLANE, REQUEST_FD flag set. Always: bytesused varies with the frame's slice payload size. mpv-vaapi-copy never hits it (verified clean ≥3 frames in this Phase 2 testing; iter5 verified 2000 frames clean).
|
||
|
||
### Theories ruled out
|
||
|
||
- **DQBUF returns wrong index** — diagnostic logs `expected vs kernel-returned` and fired zero mismatches across 53 successful + 1 failed cycle. Kernel returns the same index our code passes.
|
||
- **Control payload divergence** — first 64 bytes of every per-frame ext_control byte-identical between mpv (works) and Firefox (works frame N, fails frame N+1).
|
||
- **Frame-1 setup failure** — original symptom does not reproduce on iter6-DX. S_EXT_CTRLS device-init call (DECODE_MODE/START_CODE) succeeds for both consumers.
|
||
|
||
### Surviving hypothesis
|
||
|
||
**request_fd lifecycle race.** request_fd=30 is reused on every frame (Linux fd-allocation reuses lowest-free after close). The iter4 fix (`385dee1`) closes request_fd in RequestSyncSurface after DQBUF. But:
|
||
|
||
1. If close() does not synchronously release the kernel-side request object (it may defer cleanup until the request is fully drained from any in-flight queue), the next `MEDIA_IOC_REQUEST_ALLOC` may return fd=30 pointing at a *new* request object while the OLD object is still mid-cleanup with its OUTPUT buffer attached.
|
||
2. Then frame N+1's QBUF(OUTPUT, request_fd=30) attempts to attach a new buffer to a request that already has one — kernel rejects with EINVAL.
|
||
|
||
Why mpv doesn't hit this: single-surface recycle means each frame's `RequestSyncSurface → close(request_fd) → DQBUF → next BeginPicture` cycle fully completes before the next frame's BeginPicture. Firefox's MediaSource pipeline issues multiple BeginPicture in flight (separate libva surface IDs), so a stale request_fd from a not-yet-fully-closed previous frame can collide with a new frame's allocation.
|
||
|
||
### Phase 4 fix sketch
|
||
|
||
Three angles, leading candidate is **C**:
|
||
|
||
A. **Use unique request fds, not reused-30.** After `media_request_alloc`, `dup()` the fd to push it past low-fd-reuse range, OR maintain a pool of pre-allocated request fds reset via REINIT. iter4 chose close+alloc over REINIT because REINIT failed for the iter1-3 frame-11 case — but that may have been the DPB / control-payload bug, not this lifecycle race.
|
||
|
||
B. **Serialize close+alloc with explicit barrier.** After `close(request_fd)`, sleep / probe before next `MEDIA_IOC_REQUEST_ALLOC`. Brittle.
|
||
|
||
C. **Drain OUTPUT pool slot before surface re-bind (mirror cap_pool discipline)**. picture.c:245 unbinds CAPTURE slot if `surface_object->current_slot != NULL`. Add equivalent for OUTPUT: at start of BeginPicture, if the surface still owns an OUTPUT slot from a previous decode that wasn't sync'd yet, drive that slot to drain (close request_fd, DQBUF its buffer, request_pool_release) before acquiring a new slot. Mirrors iter4's "drain before reuse" discipline at the right layer.
|
||
|
||
C is the cleanest. The OUTPUT-side gap mirrors the iter4 work for request_fd; this extends the same discipline to the OUTPUT pool slot.
|
||
|
||
### Stop point (revised)
|
||
|
||
Phase 2 has identified the bug class: **OUTPUT-pool / request_fd race, surfaced by Firefox's multi-surface concurrent-BeginPicture pattern that mpv-vaapi-copy doesn't hit**.
|
||
|
||
Phase 4 leading approach: **C — extend iter4's "drain before reuse" discipline from request_fd to OUTPUT pool slot**, with the same close-then-fresh-alloc pattern applied to the OUTPUT slot lifecycle when surface state forces reuse.
|
||
|
||
Phase 5 sonnet review will be mandatory before any commit (per CLAUDE.md user-global rule).
|
||
|
||
## Phase 5 (front-loaded design review, 2026-05-05)
|
||
|
||
Sonnet review surfaced two important caveats:
|
||
1. **Pool exhaustion as competing hypothesis** — should test by bumping `request_pool_init` from 4 to 16 *before* implementing approach C.
|
||
2. **Approach C is mostly dead code in normal paths** — `EndPicture` (picture.c:389) calls `RequestSyncSurface` inline, which clears `source_data=NULL`. Drain-before-rebind only matters for error-path recovery.
|
||
|
||
Run that diagnostic now.
|
||
|
||
### Pool-size bump diagnostic (2026-05-05)
|
||
|
||
Changed `context.c:99-100` `request_pool_init(..., 4)` → `..., 16` in iter6-DX. Rebuilt (sha `e7ed864d693449a512c66d1c547b567ce6336d087c421d645eef7d7a7f248765`). Retested Firefox bbb fixture.
|
||
|
||
| Pool size | Successful frames | Failure slot | Notes |
|
||
|---|---|---|---|
|
||
| 4 (run 1) | 1 | index=3 | Original |
|
||
| 4 (run 2) | 19 | index=1 | |
|
||
| 4 (run 3) | 53 | index=0 | |
|
||
| 16 | 62 | index=13 | Slot 13 → ~4th reuse of slot |
|
||
|
||
**Conclusion:** pool size is not the dominant fix. Bumping 4→16 pushed failure from frame 53 to 62 (marginal). 30fps × 30s = 900 frames target; pool=24 won't close that gap. Failure mechanism is real per-slot kernel-side state contention, not pool exhaustion. (Pool exhaustion would manifest as `request_pool_acquire` returning -1 and BeginPicture cleanly failing with `ALLOCATION_FAILED`, not QBUF EINVAL on a freshly-acquired slot.)
|
||
|
||
The race fires on what appears to be the **~4th reuse of a slot** under Firefox's surface-rotation rate. mpv-vaapi-copy reuses slot 0 forever and never hits it because the per-buffer kernel state machine cycles cleanly when each cycle fully drains before the next begins. Firefox's pipeline issues per-surface BeginPicture/EndPicture, but `EndPicture` calls `RequestSyncSurface` inline (picture.c:389), so on its face the per-surface cycle should also be synchronous.
|
||
|
||
This means the race is happening **between SUCCESSIVE surface-decode cycles**, not within a single one. Specifically: surface A's `RequestSyncSurface(A)` does `close(request_fd_A)` + `DQBUF OUTPUT[slot_A]` — both kernel-clean operations from userspace's perspective. Surface B's next `BeginPicture(B)` then does `media_request_alloc()` (returns a fresh kernel request object, possibly at fd=30 again due to lowest-free-fd reuse) + `request_pool_acquire()` (returns slot_B which may have been used by an even-earlier surface decode). QBUF on slot_B with the new request_fd: **EINVAL on roughly the 4th reuse of any given slot.**
|
||
|
||
### Revised hypothesis
|
||
|
||
The kernel-side V4L2 OUTPUT buffer state machine retains a residual reference to the *last* request object that touched it, even after our DQBUF returns success. Reusing the buffer with a NEW request object (different request_fd) within ~3-4 frames of the prior use racy-fails because the kernel's buffer-to-request linkage hasn't been fully torn down.
|
||
|
||
This is hantro-driver kernel-side state. Userspace mitigations:
|
||
- **Increase the buffer cycle distance** (pool size) — papers over but doesn't eliminate
|
||
- **Hold each request_fd for the lifetime of its OUTPUT slot** (one-to-one binding) — eliminates the cross-pollination but requires either REINIT (iter4 ruled out for compound-control reasons) or some way to "reset" the request_object without close+alloc
|
||
- **Sleep / probe-ioctl barrier** between close-request_fd and next QBUF on the same slot — gross but effective
|
||
- **Detect the EINVAL on QBUF and retry once after barrier** — error-path recovery, ugly
|
||
|
||
### Stop point (Phase 2 closes here)
|
||
|
||
Phase 2 has narrowed the bug class to: **kernel-side per-buffer state contention with the V4L2 request object lifecycle on hantro G1, surfaced when consumers rotate OUTPUT slots faster than the kernel cleans up the buffer-to-request linkage.** Pool size only mitigates marginally.
|
||
|
||
**Operator decision needed before Phase 4:**
|
||
|
||
- **Direction 1**: deeper kernel-side analysis — read hantro_drv.c / v4l2-mem2mem / videobuf2 source for the OUTPUT-buffer-to-request state machine, identify the exact cleanup ordering, design a userspace barrier that aligns with it. This may also surface upstream kernel-bug avenues.
|
||
|
||
- **Direction 2**: pragmatic mitigation — combine pool size = `surfaces_count + DPB_MAX` with QBUF-EINVAL-retry-with-barrier. Will likely get us past 900-frame target but isn't architecturally clean.
|
||
|
||
- **Direction 3**: investigate REINIT-based per-slot request_fd binding. iter4 ruled this out for the iter1-3 frame-11 case (control-payload-related); this is a different failure mode and REINIT may now work. Worth a 30-min experiment.
|
||
|
||
Direction 3 is the cleanest if it works. Direction 1 is most thorough. Direction 2 is fastest.
|